Instructions to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test", filename="Qwen3.6-35B-A3B-MTP-bf16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16 # Run inference directly in the terminal: llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16 # Run inference directly in the terminal: ./llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Use Docker
docker model run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
- LM Studio
- Jan
- Ollama
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Ollama:
ollama run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
- Unsloth Studio new
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test to start chatting
- Pi new
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Run Hermes
hermes
- Docker Model Runner
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Docker Model Runner:
docker model run hf.co/lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
- Lemonade
How to use lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lym00/Qwen3.6-35B-A3B-MTP-GGUF-Test:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-MTP-GGUF-Test-BF16
List all available models
lemonade list
Separate MTP GGUF
This repo contains Multi-Token Prediction (MTP) GGUF for LLaMA.cpp extracted from the base model (Qwen/Qwen3.6-35B-A3B).
It can be paired with a target model using the--spec-draft-modelflag.
See PR: https://github.com/ggml-org/llama.cpp/pull/22673If youโre looking for an MTP GGUF for transplanting/"grafting" onto your model, check out:
yes it can be loaded separately using
--spec-draft-model. Theconvert_hf_to_gguf.pychanges have an option of--mtpwhich just outputs the MTP gguf.Using the "grafted" on MTP is more VRAM efficient though.
Another thing is that
-hfoption will try to look for the MTP gguf like it does formmprojin casespec-draft-type draft-mtpis mentioned.
Discussion: https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078
Findings
Original MTP tensors:
- https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model-00025-of-00026.safetensors (
mtp.layers.0.mlp.experts.gate_up_proj->blk.40.ffn_up_exps.weightblk.40.ffn_gate_exps.weight) - https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model-00026-of-00026.safetensors
Shared embeddings/output weights:
Ref: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/model.safetensors.index.json
Filters (convert_hf_to_gguf.py)
@classmethod
def filter_tensors(cls, item):
name, _ = item
if name.startswith("mtp."):
if cls.no_mtp:
return None
return item
if cls.mtp_only:
# In --mtp mode, drop trunk weights and keep only the shared embeddings/output
# tensors that the standalone MTP graph references at inference time.
canonical = name.replace("language_model.", "")
keep = canonical in (
"model.embed_tokens.weight", "model.norm.weight", "lm_head.weight",
"embed_tokens.weight", "norm.weight",
)
if not keep:
return None
return super().filter_tensors(item) # ty: ignore[unresolved-attribute]
MTP tensors to GGUF conversion:
python convert_hf_to_gguf.py ../Qwen3.6-35B-A3B --outtype bf16 --outfile ../Qwen3.6-35B-A3B-MTP/Qwen3.6-35B-A3B-MTP-bf16.gguf --mtp
Conversion log: conversion.log
INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model-00001-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00025-of-00026.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00026-of-00026.safetensors'
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:blk.40.ffn_gate_exps.weight, torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:blk.40.ffn_up_exps.weight, torch.bfloat16 --> BF16, shape = {2048, 512, 256}
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> BF16, shape = {2048, 248320}
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.eh_proj.weight, torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.ffn_down_exps.weight, torch.bfloat16 --> BF16, shape = {512, 2048, 256}
INFO:hf-to-gguf:blk.40.ffn_gate_inp.weight, torch.bfloat16 --> BF16, shape = {2048, 256}
INFO:hf-to-gguf:blk.40.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {512, 2048}
INFO:hf-to-gguf:blk.40.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_up_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.ffn_gate_inp_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 1}
INFO:hf-to-gguf:blk.40.post_attention_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_k.weight, torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2048}
INFO:hf-to-gguf:blk.40.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.40.attn_q.weight, torch.bfloat16 --> BF16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.40.attn_v.weight, torch.bfloat16 --> BF16, shape = {2048, 512}
INFO:hf-to-gguf:blk.40.nextn.shared_head_norm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.enorm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.40.nextn.hnorm.weight, torch.bfloat16 --> F32, shape = {2048}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf: n_tensors = 23, total_size = 3.7G
Writing: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3.72G/3.72G [00:22<00:00, 167Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ..\Qwen3.6-35B-A3B-MTP\Qwen3.6-35B-A3B-MTP-bf16.gguf
Test Results
Config (models.ini)
version = 1
[*]
flash-attn = on
mlock = off
mmap = off
fit = on
warmup = on
batch-size = 2048
ubatch-size = 512
cache-type-k = q4_0
cache-type-v = q4_0
jinja = true
direct-io = off
cache-prompt = true
cache-ram = 28672
n-gpu-layers = 99
reasoning = off
reasoning-budget = 0
min-p = 0
presence-penalty = 1.5
top-k = 40
defrag-thold = 0.1
parallel = 1
#dflash = on
#spec-type = dflash
[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
#spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
#spec-type = draft-mtp
#spec-draft-n-max = 3
seed = 42
[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Genesis-MXFP4_MOE.gguf
mmproj = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Genesis-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis.f16.gguf
spec-draft-model = /root/.cache/llama.cpp/Qwen3.6-35B-A3B-MTP-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
spec-type = draft-mtp
spec-draft-n-max = 3
seed = 42
Hardware tested (low-budget mini PC)
- Model: Machenike GTR Mini PC (~$600)
- CPU: AMD R7-H255 (780M iGPU)
- RAM: 32G DDR5 (Shared/Unified memory)
- Backend: llama.cpp (Vulkan)
- Downloads last month
- 1,231
4-bit
8-bit
16-bit


