Instructions to use froggeric/Qwen3.6-27B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use froggeric/Qwen3.6-27B-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="froggeric/Qwen3.6-27B-MTP-GGUF", filename="Qwen3.6-27B-F16-mtp.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use froggeric/Qwen3.6-27B-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Use Docker
docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use froggeric/Qwen3.6-27B-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "froggeric/Qwen3.6-27B-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "froggeric/Qwen3.6-27B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
- Ollama
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Ollama:
ollama run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
- Unsloth Studio new
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting
- Pi new
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Docker Model Runner:
docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
- Lemonade
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-MTP-GGUF-Q4_K_M
List all available models
lemonade list
- Qwen3.6-27B with MTP
- 2.5x faster with MTP · 262K context on 48 GB · Fixed chat template
Dense 27B model with vision, thinking, and tool use — self-speculative decoding,
configurable KV cache (f16 for quality, q8_0/q4_0 for longer context), fixed Jinja template (tool calls and thinking actually work in C++ runtimes),
and a server with both OpenAI and Anthropic APIs.
One command. Both APIs. No cloud.
Note: Vision works with MTP speculative decoding on llama.cpp b9240+. Older builds (PR #22673) crashed when combining vision + MTP — this is fixed in mainline.
Start the server
You need llama.cpp b9180 or newer (released 2026-05-16, includes MTP support). Install via Homebrew:
brew install llama.cpp
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
That's it. Three optimizations in one command:
| Flag | What it does | Impact |
|---|---|---|
--spec-type draft-mtp --spec-draft-n-max 3 |
Multi-Token Prediction (built into the model) | 2.5x faster generation |
--cache-type-k q8_0 --cache-type-v q8_0 |
8-bit KV cache (instead of 16-bit) | Half the KV memory, negligible quality loss |
-c 262144 |
262K context window | Full native context on 48 GB Mac with q8_0 KV |
Tip for Apple Silicon users at long context (>64K): If you experience slow prefill or timeouts, try disabling Flash Attention with
-fa off. On hybrid attention+SSM models like Qwen3.6, this can improve prefill speed 37-53% and unlock longer contexts that would otherwise time out.
Adjust -m, -c, and --cache-type-k/v for your hardware — see the Which quant should I download? table below.
Which quant should I download?
Find your hardware below — each row gives the best quant, KV cache type, and max context that fits.
Apple Silicon
Qwen3.6-27B is a hybrid model — only 16 of 65 layers use KV cache (verified). The other 48 are linear attention (fixed 898 MiB recurrent state). KV memory is ~4× less than a standard dense model. Runtimes that don't handle this (e.g. vllm) allocate KV for all 65 layers and show much higher memory usage.
Numbers below are total memory used (model + KV cache + 0.9 GB recurrent state). Must leave ≥ 8 GB for macOS (16 GB Macs excepted).
| RAM | Quant | KV cache | Max context | Total used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ2_M |
q8_0 |
42K | 12.0 GB | ✗ |
| 24 GB | IQ3_M |
46K | 16.0 GB | ✗ | |
| 24 GB | IQ3_M |
q8_0 |
91K | 16.0 GB | ✗ |
| 32 GB | Q5_K_M |
74K | 24.0 GB | ✗ | |
| 32 GB | Q5_K_M |
q8_0 |
147K | 24.0 GB | ✗ |
| 32 GB | Q4_K_M |
99K | 24.0 GB | ✓ | |
| 48 GB | Q6_K |
262K | 39.7 GB | ✓ | |
| 48 GB | Q8_0 |
173K | 40.0 GB | ✓ | |
| 48 GB | Q8_0 |
q8_0 |
262K | 37.3 GB | ✓ |
| 64 GB | Q8_0 |
262K | 45.8 GB | ✓ | |
| 96 GB | Q8_0 |
262K | 45.8 GB | ✓ |
NVIDIA GPU
Same model memory as Apple Silicon, plus ~1 GB CUDA overhead.
| VRAM | Quant | KV cache | Max context | Total VRAM used | Vision |
|---|---|---|---|---|---|
| 12 GB | IQ2_M |
q8_0 |
11K | 12.0 GB | ✗ |
| 16 GB | IQ3_M |
30K | 16.0 GB | ✗ | |
| 16 GB | IQ3_M |
q8_0 |
60K | 16.0 GB | ✗ |
| 24 GB | Q4_K_M |
83K | 24.0 GB | ✓ | |
| 24 GB | Q4_K_M |
q8_0 |
167K | 24.0 GB | ✓ |
| 24 GB | Q5_K_M |
58K | 24.0 GB | ✗ | |
| 48 GB | Q6_K |
262K | 40.7 GB | ✓ | |
| 48 GB | Q8_0 |
262K | 46.8 GB | ✓ | |
| 80 GB | Q8_0 |
262K | 46.8 GB | ✓ |
16 GB Mac:
IQ2_M/q8_0 — 42K text-only. No vision.24 GB Mac:
IQ3_M— 46K (f16 KV) or 91K (q8_0). Vision at 32–65K.32 GB Mac:
Q5_K_M— 74K text-only (f16 KV), 147K (q8_0).Q4_K_Mfor vision at 99K.48 GB Mac:
Q6_K/f16 KV — 262K with vision.Q8_0/q8_0 KV for 262K at higher model quality.64 GB+ Mac:
Q8_0/f16 KV — 262K with vision. Maximum quality at practical speed.12 GB GPU:
IQ2_M/q8_0 — 11K. Very limited, no vision.16 GB GPU:
IQ3_M— 30K (f16 KV) or 60K (q8_0). No vision.24 GB GPU:
Q4_K_M— 83K with vision (f16 KV).Q5_K_M— 58K text-only (f16 KV), 116K (q8_0).48 GB+ GPU:
Q6_K/f16 KV — 262K with vision.Q8_0for max quality.
Leave KV cache at f16 (blank column) for best quality. Use q8_0 KV only when f16 doesn't give enough context. q4_0 KV should not exceed 64K context.
Vision adds ~0.9 GB for mmproj. macOS needs ≥ 8 GB for itself (16 GB Macs excepted — use ~4 GB). You can increase available memory by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). NVIDIA reserves ~1 GB for CUDA.
API usage
OpenAI-compatible (/v1/chat/completions)
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"Hello"}]}'
Works with any OpenAI client — just point it at http://localhost:8081/v1.
Anthropic-compatible (/v1/messages)
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"qwen","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'
Works with any Anthropic client — the server natively speaks the Messages API with streaming, tool use, and vision.
Claude Code
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 claude
Claude Code uses the Anthropic Messages API. With this one env var, it talks to your local Qwen3.6-27B instead of the cloud.
Tool use (both APIs)
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"tools": [{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
Vision
Vision works with MTP enabled — just add --mmproj to the server command:
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"messages": [{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "'$(base64 < photo.jpg)'"}},
{"type": "text", "text": "Describe this image"}
]}]
}'
Direct CLI usage
# Text generation
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
-p "Your prompt here"
# Vision
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
--image photo.jpg \
-p "Describe this image"
KV cache options
The --cache-type-k and --cache-type-v flags control KV cache precision. Lower precision = less memory = longer context on the same hardware.
| Type | Bits/val | KV size (80K ctx) | Quality | Speed | When to use |
|---|---|---|---|---|---|
f16 |
16 | 5.3 GB | Full | Baseline | Best quality — use when RAM allows |
q8_0 |
8 | 2.8 GB | Negligible loss | Faster than f16 | When f16 KV doesn't give enough context |
q4_0 |
4 | 1.5 GB | Minor loss | Slightly slower | Max context on limited RAM (≤64K only) |
Recommendation: Leave KV at f16 for best quality. Use q8_0 when f16 doesn't give enough context. Reserve q4_0 for tight RAM — and only up to 64K context.
Effect on hardware requirements (Q5_K_M, 80K context):
| KV type | Model + recurrent + KV | Hardware |
|---|---|---|
| f16 | 24 GB | 48 GB Mac |
| q8_0 | 22 GB | 32 GB Mac |
Speculative decoding modes
MTP (recommended — 2.5x faster)
The model predicts 5 extra tokens per step using its own MTP heads, then verifies them in one pass. No extra model needed.
--spec-type draft-mtp --spec-draft-n-max 3 -np 1
MTP currently requires -np 1 (single-sequence mode).
Tune --spec-draft-n-max: 3 is optimal for general use (83% acceptance rate). Values of 1–2 are more conservative; 4–5 waste compute on rejected tokens.
Draft model (~2.3x faster)
Pair with a smaller Qwen 3.5/3.6 model that shares the same tokenizer.
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
-md Qwen3.5-0.8B-Q8_0.gguf \
--spec-draft-n-max 10 -ngl 99 -ngld 99 \
-c 4096 -n 2048 --temp 0.7 \
-p "Your prompt"
ngram-mod (no extra model, benefits repeat prompts)
Uses cached n-grams from previous prompts.
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64 \
--repeat-penalty 1.0
Downloads
| File | Size | Min. (4K ctx) | Recommended (80K ctx) | Max (262K ctx) |
|---|---|---|---|---|
Qwen3.6-27B-F16-mtp.gguf |
51 GB | 64 GB Mac · 80 GB GPU | 64 GB Mac · 80 GB GPU | 96 GB Mac · 80 GB GPU |
Qwen3.6-27B-Q8_0-mtp.gguf |
27 GB | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q6_K-mtp.gguf |
21 GB | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q5_K_M-mtp.gguf |
18 GB | 32 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q4_K_M-mtp.gguf |
16 GB | 32 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-IQ4_XS-mtp.gguf |
14 GB | 24 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 32 GB Mac · 48 GB GPU |
Qwen3.6-27B-IQ3_M-mtp.gguf |
12 GB | 24 GB Mac · 16 GB GPU | 24 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU |
Qwen3.6-27B-IQ2_M-mtp.gguf |
9.5 GB | 16 GB Mac · 16 GB GPU | 24 GB Mac · 16 GB GPU | 32 GB Mac · 24 GB GPU |
mmproj-Qwen3.6-27B-f16.gguf |
885 MB | Vision encoder (optional, any tier) | — | — |
All tiers include MTP heads and were quantized directly from the F16 conversion for maximum precision. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix. Q5_K_M is the sweet spot — use Q4_K_M if you're tight on RAM, Q8_0 for high quality, or F16 for long agentic coding sessions where quantization artifacts compound noticeably. GPU means NVIDIA (RTX 3060 = 12 GB, RTX 3090/4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB).
Hardware numbers assume f16 KV for "Min." (4K) and q8_0 KV for "Recommended" (80K) and "Max" (262K). Add --cache-type-k q8_0 --cache-type-v q8_0 to reach the recommended or max context on smaller hardware.
Memory requirements
Approximate VRAM on Apple Silicon (unified memory), using Q5_K_M as reference. Includes 0.9 GB recurrent state (constant, does not scale with context). Only 16 of 65 layers use KV cache — the other 48 use linear attention.
| Context | Model | KV (f16) | KV (q8_0) | Total (f16) | Total (q8_0) | Min. Mac |
|---|---|---|---|---|---|---|
| 4K | 18 GB | 0.3 GB | 0.1 GB | 19 GB | 19 GB | 32 GB |
| 8K | 18 GB | 0.5 GB | 0.3 GB | 19 GB | 19 GB | 32 GB |
| 32K | 18 GB | 2.1 GB | 1.0 GB | 20 GB | 20 GB | 32 GB |
| 64K | 18 GB | 4.1 GB | 2.1 GB | 21 GB | 21 GB | 32 GB |
| 80K (recommended) | 18 GB | 5.2 GB | 2.6 GB | 22 GB | 22 GB | 32 GB |
| 128K | 18 GB | 8.3 GB | 4.1 GB | 25 GB | 23 GB | 32 GB |
| 262K (max native) | 18 GB | 17.0 GB | 8.5 GB | 34 GB | 27 GB | 48 GB |
"Total" = model + recurrent state + KV cache. macOS needs ≥ 8 GB (16 GB Macs excepted). With vision: add 0.9 GB for the mmproj.
Memory for all quant tiers (4K context, q8_0 KV)
| Quant | Model | KV + recurrent | Total | Min. Mac |
|---|---|---|---|---|
| Q8_0 | 27 GB | 1.0 GB | 28 GB | 48 GB |
| Q6_K | 21 GB | 1.0 GB | 22 GB | 32 GB |
| Q5_K_M | 18 GB | 1.0 GB | 19 GB | 32 GB |
| Q4_K_M | 16 GB | 1.0 GB | 17 GB | 32 GB |
| IQ4_XS | 14 GB | 1.0 GB | 15 GB | 24 GB |
| IQ3_M | 12 GB | 1.0 GB | 13 GB | 24 GB |
| IQ2_M | 9.5 GB | 1.0 GB | 11 GB | 16 GB |
System prompt
The first line must be:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
The model underperforms without it. Append anything after that line.
Thinking toggle
Drop <|think_on|> or <|think_off|> in any message to toggle thinking. The template strips the tag so the model never sees it.
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
Fast answer, no internal reasoning.
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
The model thinks step by step, then answers.
Sampling
From the official Qwen authors. Reserve 128K+ context for thinking mode.
| Mode | temp | top_p | top_k | repeat_penalty |
|---|---|---|---|---|
| Thinking (coding) | 0.6 | 0.95 | 20 | 1.0 |
| Thinking (general) | 1.0 | 0.95 | 20 | 1.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.0 |
Compatibility
| Runtime | Status | Why |
|---|---|---|
| llama.cpp (b9180+ / Homebrew) | Works fully | MTP support merged in b9180 (2026-05-16). brew install llama.cpp |
| llama.cpp (pre-b9180) | Does not load | missing tensor — MTP heads not recognized |
| LM Studio | Does not load | Bundled llama.cpp may not yet include b9180+ |
| Ollama | Does not load | No speculative decoding support yet |
| koboldcpp | Unknown | Depends on bundled llama.cpp version |
LM Studio users: use the MLX 8-bit or MLX 4-bit instead — full vision + tools + thinking, no MTP.
Chat template fixes
The bundled Jinja template fixes several bugs in the official Qwen 3.6 template:
- Tool calls crash on C++ engines. The official template uses Python's
|itemsfilter and|safe, which don't exist in C++ Jinja runtimes (llama.cpp, LM Studio). This template uses direct dictionary key lookups. - The
developerrole crashes. Modern APIs sendmessage.role == "developer". The official template throws an exception. This template maps it tosystem. - Empty
preserve_thinkingspam. The official template wraps every past turn in empty<think/>blocks, wasting context tokens. This template only emits thinking blocks with actual content. </thinking>hallucination handling. The model sometimes generates</thinking>instead of the expected closing tag. Both are handled gracefully.
See Qwen-Fixed-Chat-Templates for the standalone template repo.
Note: The fixed template works in llama.cpp but may cause errors in some frameworks (oh-my-pi, Codex, etc.) — typically
Jinja Exception: System message must be at the beginning.If you hit this, use the default (unfixed) template instead. The fixes for tool calls and thedeveloperrole are only needed in llama.cpp's C++ Jinja runtime; other frameworks may already handle them natively or reject the modified template.
Architecture details
| Spec | Value |
|---|---|
| Total params | 27.8B (dense, all active) |
| Layers | 65 (3x linear attention + 1x full attention, 16 repetitions) + 1 MTP layer |
| Attention | 24 Q heads, 4 KV heads (GQA), head_dim 256 |
| Linear attention | 16 QK heads, 48 V heads, head_dim 128 |
| FFN | intermediate_size 17408 |
| Context | 262K native, 1M+ with YaRN |
| RoPE | theta 10M, partial_rotary_factor 0.25, mrope_interleaved |
| Vocab | 248K tokens |
| Multi-token prediction | 1 MTP draft layer (15 tensors) |
| model_type | qwen3_5 |
Conversion details
Converted from official Qwen3.6-27B safetensors using mainline convert_hf_to_gguf.py from llama.cpp (b9180+, Homebrew v9240). MTP tensors are included by default — no custom build needed. The fixed chat template (v19) from Qwen-Fixed-Chat-Templates was embedded in tokenizer_config.json before conversion.
Quantization source: F16 (not Q8_0) — all tiers are quantized directly from the F16 conversion for maximum precision, avoiding double-quantization artifacts. Standard K-quant tiers (Q8_0, Q6_K, Q5_K_M, Q4_K_M) use no importance matrix. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix (calibrated with chat template at 6K–12K context, 76 chunks, 496 entries). IQ2_M keeps MTP tensors at Q4_K since the importance matrix doesn't cover MTP layer tensors.
# Prerequisites
brew install llama.cpp
git clone --depth 1 --filter=blob:none --sparse https://github.com/ggml-org/llama.cpp.git llama.cpp-source
cd llama.cpp-source && git sparse-checkout set convert_hf_to_gguf.py conversion gguf-py
python3 -m venv .venv && .venv/bin/pip install torch numpy tqdm transformers sentencepiece pyyaml requests
.venv/bin/pip install -e gguf-py
# Embed fixed chat template (v19) into source tokenizer_config.json
python3 -c "
import json
with open('Qwen/Qwen3.6-27B/tokenizer_config.json') as f: d = json.load(f)
with open('Qwen-Fixed-Chat-Templates/chat_template_oneline.txt') as f: t = f.read().strip()
d['chat_template'] = t
with open('Qwen/Qwen3.6-27B/tokenizer_config.json', 'w') as f: json.dump(d, f, indent=2, ensure_ascii=False)
"
# Convert to F16 (text + MTP, ~51 GB, ~30-40 min)
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
--outtype f16 --outfile Qwen3.6-27B-F16-mtp.gguf --verbose
# Extract vision encoder
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
--outtype f16 --mmproj --outfile mmproj-Qwen3.6-27B-f16.gguf --verbose
# Quantize K-quant tiers from F16 (no imatrix)
F16=Qwen3.6-27B-F16-mtp.gguf
llama-quantize $F16 Qwen3.6-27B-Q8_0-mtp.gguf Q8_0
llama-quantize $F16 Qwen3.6-27B-Q6_K-mtp.gguf Q6_K
llama-quantize $F16 Qwen3.6-27B-Q5_K_M-mtp.gguf Q5_K_M
llama-quantize $F16 Qwen3.6-27B-Q4_K_M-mtp.gguf Q4_K_M
# Quantize I-quant tiers from F16 with unsloth imatrix
IMATRIX=imatrix_unsloth.gguf_file
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ4_XS-mtp.gguf IQ4_XS
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ3_M-mtp.gguf IQ3_M
# IQ2_M: MTP tensors at Q4_K (imatrix doesn't cover them)
llama-quantize --imatrix $IMATRIX --tensor-type blk.64.=q4_K $F16 Qwen3.6-27B-IQ2_M-mtp.gguf IQ2_M
Links
- Original model
- MLX 8-bit (LM Studio, Apple Silicon native, no MTP)
- MLX 4-bit
- Fixed chat templates
- Qwen3.6 blog post
Authorship
| Role | Author |
|---|---|
| Original model | Alibaba Cloud (Qwen team) |
| GGUF conversion + MTP + vision + fixed chat template + quantization | froggeric |
| Importance matrix | unsloth |
License
Apache-2.0, inherited from Qwen3.6.
- Downloads last month
- 75,861
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for froggeric/Qwen3.6-27B-MTP-GGUF
Base model
Qwen/Qwen3.6-27B