Qwen3.5-35B-A3B — IQ3_S with MTP Tensors

One of the first GGUF quantizations of Qwen3.5-35B-A3B that preserves the Multi-Token Prediction (MTP) head. Most public quantizations strip MTP tensors during conversion. This GGUF retains them, enabling speculative decoding via MTP.

Key specs

Metric Value
Base model Qwen/Qwen3.5-35B-A3B
File size 11 GB
Quantization IQ3_S custom-mix (base) + BF16 MTP tensors
Total layers 41 (40 main + 1 MTP)
Total tensors 753
MTP acceptance 84.8% (warm cache), 61-63% (cold)

MTP Results

Tested with ik_llama.cpp (patched for Qwen3.5 MoE MTP support):

Config Speed Acceptance Notes
Without -mtp 94 tok/s Normal inference
-mtp --draft-max 1 47.6 tok/s 84.8% warm / 61% cold Works but slower
-mtp (multi-step) 12.7 tok/s 12% Acceptance collapse (known MTP issue)

Why MTP is slower here

The MTP layer (layer 40) is itself a full MoE layer with 256 experts. On consumer hardware with CPU expert offloading (-ncmoe 4), each MTP forward costs nearly as much as a main model forward. The bottleneck is MoE expert transfer over PCIe, not the number of forward passes.

This is an informative negative result: MTP speculative decoding is not cost-effective on MoE models where the MTP head is also MoE. It would work on models with a dense MTP head.

MTP patches for ik_llama.cpp

This GGUF requires patched ik_llama.cpp with MTP support for Qwen3.5 MoE. The 5 patches (fixing 8 bugs in tensor mapping, compute buffers, and MoE handling) are available at: chimere/patches/ik-llama-mtp

How it was built

# Inject MTP tensors from original BF16 shards into IQ3_S base
python build_mtp_gguf_v3.py
# Source: IQ3_S custom-mix base + shards 13-14 from Qwen/Qwen3.5-35B-A3B

Build tool: ramp-quant/tools/build_mtp_gguf_v3.py

Related

Author

Kevin Remondiere — Independent ML researcher, Oloron-Sainte-Marie, France

License

Apache 2.0. Base model follows Qwen's license.

Downloads last month
45
GGUF
Model size
37B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kevletesteur/Qwen3.5-35B-A3B-IQ3_S-MTP

Quantized
(258)
this model