GGUF
conversational

Separate MTP GGUF

This repo contains Multi-Token Prediction (MTP) GGUF for LLaMA.cpp extracted from the base model (Qwen/Qwen3.6-35B-A3B).
It can be paired with a target model using the --spec-draft-model flag.
See PR: https://github.com/ggml-org/llama.cpp/pull/22673

If youโ€™re looking for an MTP GGUF for transplanting/"grafting" onto your model, check out:

am17an:

yes it can be loaded separately using --spec-draft-model. The convert_hf_to_gguf.py changes have an option of --mtp which just outputs the MTP gguf.

Using the "grafted" on MTP is more VRAM efficient though.

Another thing is that -hf option will try to look for the MTP gguf like it does for mmproj in case spec-draft-type draft-mtp is mentioned.

Discussion: https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078


Findings

Original MTP tensors:

Shared embeddings/output weights:

Ref: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model.safetensors.index.json

Filters (convert_hf_to_gguf.py)

    @classmethod
    def filter_tensors(cls, item):
        name, _ = item
        if name.startswith("mtp."):
            if cls.no_mtp:
                return None
            return item
        if cls.mtp_only:
            # In --mtp mode, drop trunk weights and keep only the shared embeddings/output
            # tensors that the standalone MTP graph references at inference time.
            canonical = name.replace("language_model.", "")
            keep = canonical in (
                "model.embed_tokens.weight", "model.norm.weight", "lm_head.weight",
                "embed_tokens.weight", "norm.weight",
            )
            if not keep:
                return None
        return super().filter_tensors(item)  # ty: ignore[unresolved-attribute]

MTP tensors to GGUF conversion: python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B --outtype bf16 --outfile ../Qwen3.6-35B-A3B-MTP/Qwen3.6-35B-A3B-MTP-bf16.gguf --mtp

Conversion log: conversion.log

INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model-00001-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00025-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00026-of-00026.safetensors'

INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:token_embd.weight,                    torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:blk.40.ffn_gate_exps.weight,          torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:blk.40.ffn_up_exps.weight,            torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:output.weight,                        torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:output_norm.weight,                   torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.eh_proj.weight,          torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_norm.weight,              torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.ffn_down_exps.weight,          torch.bfloat16 --> BF16, shape = {512, 2048, 256}
INFO:hf-to-gguf:blk.40.ffn_gate_inp.weight,           torch.bfloat16 --> BF16, shape = {2048, 256}
INFO:hf-to-gguf:blk.40.ffn_down_shexp.weight,         torch.bfloat16 --> BF16, shape = {512, 2048}
INFO:hf-to-gguf:blk.40.ffn_gate_shexp.weight,         torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_up_shexp.weight,           torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_gate_inp_shexp.weight,     torch.bfloat16 --> BF16, shape = {2048, 1}
INFO:hf-to-gguf:blk.40.post_attention_norm.weight,    torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.attn_k_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_k.weight,                 torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.attn_output.weight,            torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_q_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_q.weight,                 torch.bfloat16 --> BF16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.40.attn_v.weight,                 torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.nextn.shared_head_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.enorm.weight,            torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.hnorm.weight,            torch.bfloat16 --> F32, shape = {2048}

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf: n_tensors = 23, total_size = 3.7G
Writing: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3.72G/3.72G [00:22<00:00, 167Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf

Test Results

WebUI (llama.cpp) image

Harness (OpenCode) image

Harness (Claude Code) image

Config (models.ini)

version = 1

[*]
flash-attn = on
mlock = off
mmap = off
fit = on
warmup = on
batch-size = 2048
ubatch-size = 512
cache-type-k = q4_0
cache-type-v = q4_0
jinja = true
direct-io = off
cache-prompt = true
cache-ram = 28672
n-gpu-layers = 99
reasoning = off
reasoning-budget = 0
min-p = 0
presence-penalty = 1.5
top-k = 40
defrag-thold = 0.1
parallel = 1
#dflash = on
#spec-type = dflash

[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
#spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
#spec-type = draft-mtp
#spec-draft-n-max = 3
seed = 42

[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
spec-type = draft-mtp
spec-draft-n-max = 3
seed = 42

Hardware tested (low-budget mini PC)

  • Model: Machenike GTR Mini PC (~$600)
  • CPU: AMD R7-H255 (780M iGPU)
  • RAM: 32G DDR5 (Shared/Unified memory)
  • Backend: llama.cpp (Vulkan)
Downloads last month
1,231
GGUF
Model size
2B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support