Instructions to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test",
	filename="Qwen3.6-35B-A3B-MTP-bf16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
# Run inference directly in the terminal:
llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
# Run inference directly in the terminal:
llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
# Run inference directly in the terminal:
./llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Use Docker

docker model run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

LM Studio
Jan
Ollama
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Ollama:
```
ollama run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
```

Unsloth Studio new

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting

Pi new

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Run Hermes

hermes

Docker Model Runner
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Docker Model Runner:
```
docker model run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
```

Lemonade

How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MTP-GGUF-Test-BF16

List all available models

lemonade list

Separate MTP GGUF

This repo contains Multi-Token Prediction (MTP) GGUF for LLaMA.cpp extracted from the base model (Qwen/Qwen3.6-35B-A3B).
It can be paired with a target model using the --spec-draft-model flag.
See PR: https://github.com/ggml-org/llama.cpp/pull/22673

If you’re looking for an MTP GGUF for transplanting/"grafting" onto your model, check out:

https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-35A3B-MTP-TENSORS-ONLY

am17an:

yes it can be loaded separately using --spec-draft-model. The convert_hf_to_gguf.py changes have an option of --mtp which just outputs the MTP gguf.

Using the "grafted" on MTP is more VRAM efficient though.

Another thing is that -hf option will try to look for the MTP gguf like it does for mmproj in case spec-draft-type draft-mtp is mentioned.

Discussion: https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078

Findings

Original MTP tensors:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model-00025-of-00026.safetensors (mtp.layers.0.mlp.experts.gate_up_proj -> blk.40.ffn_up_exps.weight blk.40.ffn_gate_exps.weight)
https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model-00026-of-00026.safetensors

Shared embeddings/output weights:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model-00001-of-00026.safetensors

Ref: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model.safetensors.index.json

Filters (convert_hf_to_gguf.py)

    @classmethod
    def filter_tensors(cls, item):
        name, _ = item
        if name.startswith("mtp."):
            if cls.no_mtp:
                return None
            return item
        if cls.mtp_only:
            # In --mtp mode, drop trunk weights and keep only the shared embeddings/output
            # tensors that the standalone MTP graph references at inference time.
            canonical = name.replace("language_model.", "")
            keep = canonical in (
                "model.embed_tokens.weight", "model.norm.weight", "lm_head.weight",
                "embed_tokens.weight", "norm.weight",
            )
            if not keep:
                return None
        return super().filter_tensors(item)  # ty: ignore[unresolved-attribute]

MTP tensors to GGUF conversion: python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B --outtype bf16 --outfile ../Qwen3.6-35B-A3B-MTP/Qwen3.6-35B-A3B-MTP-bf16.gguf --mtp

Conversion log: conversion.log

INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model-00001-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00025-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00026-of-00026.safetensors'

INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:token_embd.weight,                    torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:blk.40.ffn_gate_exps.weight,          torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:blk.40.ffn_up_exps.weight,            torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:output.weight,                        torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:output_norm.weight,                   torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.eh_proj.weight,          torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_norm.weight,              torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.ffn_down_exps.weight,          torch.bfloat16 --> BF16, shape = {512, 2048, 256}
INFO:hf-to-gguf:blk.40.ffn_gate_inp.weight,           torch.bfloat16 --> BF16, shape = {2048, 256}
INFO:hf-to-gguf:blk.40.ffn_down_shexp.weight,         torch.bfloat16 --> BF16, shape = {512, 2048}
INFO:hf-to-gguf:blk.40.ffn_gate_shexp.weight,         torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_up_shexp.weight,           torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_gate_inp_shexp.weight,     torch.bfloat16 --> BF16, shape = {2048, 1}
INFO:hf-to-gguf:blk.40.post_attention_norm.weight,    torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.attn_k_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_k.weight,                 torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.attn_output.weight,            torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_q_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_q.weight,                 torch.bfloat16 --> BF16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.40.attn_v.weight,                 torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.nextn.shared_head_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.enorm.weight,            torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.hnorm.weight,            torch.bfloat16 --> F32, shape = {2048}

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf: n_tensors = 23, total_size = 3.7G
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.72G/3.72G [00:22<00:00, 167Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf

Test Results

WebUI (llama.cpp)

Harness (OpenCode)

Harness (Claude Code)

Config (models.ini)

version = 1

[*]
flash-attn = on
mlock = off
mmap = off
fit = on
warmup = on
batch-size = 2048
ubatch-size = 512
cache-type-k = q4_0
cache-type-v = q4_0
jinja = true
direct-io = off
cache-prompt = true
cache-ram = 28672
n-gpu-layers = 99
reasoning = off
reasoning-budget = 0
min-p = 0
presence-penalty = 1.5
top-k = 40
defrag-thold = 0.1
parallel = 1
#dflash = on
#spec-type = dflash

[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
#spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
#spec-type = draft-mtp
#spec-draft-n-max = 3
seed = 42

[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
spec-type = draft-mtp
spec-draft-n-max = 3
seed = 42

Hardware tested (low-budget mini PC)

Model: Machenike GTR Mini PC (~$600)
CPU: AMD R7-H255 (780M iGPU)
RAM: 32G DDR5 (Shared/Unified memory)
Backend: llama.cpp (Vulkan)

Downloads last month: 1,231

GGUF

Model size

2B params

Architecture

qwen35moe

Hardware compatibility

4-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support