Qwen3.5-4B-MTP-NVFP4-GGUF

NVFP4 (NVIDIA Blackwell FP4) GGUF quantization of Qwen/Qwen3.5-4B with Multi-Token Prediction (MTP) support.

Thinking is enabled by default. You may disable it by passing enable_thinking=false in the generation parameters. This matches the original model's behavior.

About NVFP4

NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3) designed for Blackwell architecture GPUs (RTX 5000 series). It offers several advantages over traditional integer quantization formats:

Feature NVFP4 (E4M3) INT4 (Q4_K_M, etc.)
Block size 16 elements 32 elements
Dynamic range ±448 ±7
Hardware acceleration Blackwell Tensor Cores CPU/CUDA cores
Dequantization overhead None (native) Required
BPW ~4.5-4.8 ~4.5

When to use NVFP4:

  • You have an RTX 5060/5070/5080/5090 (Blackwell GPU)
  • You want optimal inference speed with native FP4 tensor cores
  • You need the best quality-per-bit among 4-bit formats

When to use other formats:

  • Pre-Blackwell GPUs: Use Q4_K_M or Q4_0
  • CPU inference: Use Q4_K_M or Q5_K_M
  • Maximum quality: Use Q6_K or Q8_0

With MTP enabled, the model predicts 2 tokens per inference step (1 base + 1 predicted), increasing throughput by up to 30-40% for text generation while maintaining quality.

Files

Filename Type Size Description
qwen35-4b-mtp-nvfp4.gguf Model 2.48 GB NVFP4 quantized text model with MTP
mmproj-qwen35-4b-f16.gguf Vision encoder 672 MB Multimodal projector (SigLIP ViT, F16)

Quantization Details

Property Value
Format NVFP4 (E4M3)
Block size 16
BPW 4.81
HW target NVIDIA Blackwell (RTX 5000 series)
VRAM required ~3 GB (with mmproj)
MTP heads 1 (nextn_predict_layers=1)
Thinking Enabled by default (opt-out via enable_thinking=false)

Model Description

Qwen3.5-4B is a multilingual multimodal model from the Qwen team at Alibaba, featuring:

  • 2560-dimensional hidden states
  • 9216-dimensional feed-forward layers
  • 32 base transformer layers + 1 MTP head (33 total blocks)
  • Multi-Head Latent Attention (MLA) with per-head KV norms
  • Hybrid architecture: Attention + SSM (Mamba-2) layers
  • 262,144 token context window
  • 3D MRoPE (Multi-modal Rotary Position Embedding)
  • Built-in vision encoder (SigLIP ViT)

Usage

llama.cpp CLI (text-only)

./llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  -p "Hello, how are you?" \
  -n 256

llama.cpp CLI (multimodal — with vision)

./llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  --mmproj mmproj-qwen35-4b-f16.gguf \
  --image photo.jpg \
  -p "Describe this image" \
  -n 256

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="qwen35-4b-mtp-nvfp4.gguf",
    n_ctx=32768,
    n_gpu_layers=-1,  # Full GPU offload
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)
print(output["choices"][0]["message"]["content"])

Download from HuggingFace Hub

huggingface-cli download FreedomAISVR/Qwen3.5-4B-MTP-NVFP4-GGUF \
  qwen35-4b-mtp-nvfp4.gguf \
  --local-dir . --local-dir-use-symlinks False

# Also download mmproj for vision support:
huggingface-cli download FreedomAISVR/Qwen3.5-4B-MTP-NVFP4-GGUF \
  mmproj-qwen35-4b-f16.gguf \
  --local-dir . --local-dir-use-symlinks False

Conversion Pipeline

# 1. Download source model
huggingface-cli download Qwen/Qwen3.5-4B --local-dir ./qwen35-4b-src

# 2. Convert to F16 GGUF (MTP auto-included)
python convert_hf_to_gguf.py ./qwen35-4b-src \
  --outfile qwen35-4b-f16.gguf --outtype f16

# 3. Extract vision encoder mmproj
python convert_hf_to_gguf.py ./qwen35-4b-src \
  --mmproj --outfile mmproj-qwen35-4b-f16.gguf --outtype f16

# 4. Quantize to NVFP4
./llama-quantize qwen35-4b-f16.gguf \
  qwen35-4b-mtp-nvfp4.gguf NVFP4

Hardware

Component Specification
GPU RTX 5060 Ti 16GB (Blackwell)
VRAM 16 GB GDDR7
System RAM 64 GB
OS Windows 11

License

This quantization is distributed under Apache 2.0. The original Qwen3.5-4B model is released under Apache 2.0 by Qwen (Alibaba Group).

Downloads last month
376
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-4B-MTP-NVFP4-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(209)
this model