Instructions to use froggeric/Qwen3.6-27B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use froggeric/Qwen3.6-27B-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="froggeric/Qwen3.6-27B-MTP-GGUF",
	filename="Qwen3.6-27B-F16-mtp.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use froggeric/Qwen3.6-27B-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use froggeric/Qwen3.6-27B-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "froggeric/Qwen3.6-27B-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "froggeric/Qwen3.6-27B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Ollama
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Ollama:
```
ollama run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
```

Unsloth Studio new

How to use froggeric/Qwen3.6-27B-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for froggeric/Qwen3.6-27B-MTP-GGUF to start chatting

Pi new

How to use froggeric/Qwen3.6-27B-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use froggeric/Qwen3.6-27B-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use froggeric/Qwen3.6-27B-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M
```

Lemonade

How to use froggeric/Qwen3.6-27B-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull froggeric/Qwen3.6-27B-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-27B with MTP

2.5x faster with MTP · 262K context on 48 GB · Fixed chat template

Dense 27B model with vision, thinking, and tool use — self-speculative decoding,
configurable KV cache (f16 for quality, q8_0/q4_0 for longer context), fixed Jinja template (tool calls and thinking actually work in C++ runtimes),
and a server with both OpenAI and Anthropic APIs.

One command. Both APIs. No cloud.

Note: Vision works with MTP speculative decoding on llama.cpp b9240+. Older builds (PR #22673) crashed when combining vision + MTP — this is fixed in mainline.

Start the server

You need llama.cpp b9180 or newer (released 2026-05-16, includes MTP support). Install via Homebrew:

brew install llama.cpp

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

That's it. Three optimizations in one command:

Flag	What it does	Impact
`--spec-type draft-mtp --spec-draft-n-max 3`	Multi-Token Prediction (built into the model)	2.5x faster generation
`--cache-type-k q8_0 --cache-type-v q8_0`	8-bit KV cache (instead of 16-bit)	Half the KV memory, negligible quality loss
`-c 262144`	262K context window	Full native context on 48 GB Mac with q8_0 KV

Tip for Apple Silicon users at long context (>64K): If you experience slow prefill or timeouts, try disabling Flash Attention with -fa off. On hybrid attention+SSM models like Qwen3.6, this can improve prefill speed 37-53% and unlock longer contexts that would otherwise time out.

Adjust -m, -c, and --cache-type-k/v for your hardware — see the Which quant should I download? table below.

Which quant should I download?

Find your hardware below — each row gives the best quant, KV cache type, and max context that fits.

Apple Silicon

Qwen3.6-27B is a hybrid model — only 16 of 65 layers use KV cache (verified). The other 48 are linear attention (fixed 898 MiB recurrent state). KV memory is ~4× less than a standard dense model. Runtimes that don't handle this (e.g. vllm) allocate KV for all 65 layers and show much higher memory usage.

Numbers below are total memory used (model + KV cache + 0.9 GB recurrent state). Must leave ≥ 8 GB for macOS (16 GB Macs excepted).

RAM	Quant	KV cache	Max context	Total used	Vision
16 GB	`IQ2_M`	`q8_0`	42K	12.0 GB	✗
24 GB	`IQ3_M`		46K	16.0 GB	✗
24 GB	`IQ3_M`	`q8_0`	91K	16.0 GB	✗
32 GB	`Q5_K_M`		74K	24.0 GB	✗
32 GB	`Q5_K_M`	`q8_0`	147K	24.0 GB	✗
32 GB	`Q4_K_M`		99K	24.0 GB	✓
48 GB	`Q6_K`		262K	39.7 GB	✓
48 GB	`Q8_0`		173K	40.0 GB	✓
48 GB	`Q8_0`	`q8_0`	262K	37.3 GB	✓
64 GB	`Q8_0`		262K	45.8 GB	✓
96 GB	`Q8_0`		262K	45.8 GB	✓

NVIDIA GPU

Same model memory as Apple Silicon, plus ~1 GB CUDA overhead.

VRAM	Quant	KV cache	Max context	Total VRAM used	Vision
12 GB	`IQ2_M`	`q8_0`	11K	12.0 GB	✗
16 GB	`IQ3_M`		30K	16.0 GB	✗
16 GB	`IQ3_M`	`q8_0`	60K	16.0 GB	✗
24 GB	`Q4_K_M`		83K	24.0 GB	✓
24 GB	`Q4_K_M`	`q8_0`	167K	24.0 GB	✓
24 GB	`Q5_K_M`		58K	24.0 GB	✗
48 GB	`Q6_K`		262K	40.7 GB	✓
48 GB	`Q8_0`		262K	46.8 GB	✓
80 GB	`Q8_0`		262K	46.8 GB	✓

16 GB Mac: IQ2_M/q8_0 — 42K text-only. No vision.

24 GB Mac: IQ3_M — 46K (f16 KV) or 91K (q8_0). Vision at 32–65K.

32 GB Mac: Q5_K_M — 74K text-only (f16 KV), 147K (q8_0). Q4_K_M for vision at 99K.

48 GB Mac: Q6_K/f16 KV — 262K with vision. Q8_0/q8_0 KV for 262K at higher model quality.

64 GB+ Mac: Q8_0/f16 KV — 262K with vision. Maximum quality at practical speed.

12 GB GPU: IQ2_M/q8_0 — 11K. Very limited, no vision.

16 GB GPU: IQ3_M — 30K (f16 KV) or 60K (q8_0). No vision.

24 GB GPU: Q4_K_M — 83K with vision (f16 KV). Q5_K_M — 58K text-only (f16 KV), 116K (q8_0).

48 GB+ GPU: Q6_K/f16 KV — 262K with vision. Q8_0 for max quality.

Leave KV cache at f16 (blank column) for best quality. Use q8_0 KV only when f16 doesn't give enough context. q4_0 KV should not exceed 64K context.

Vision adds ~0.9 GB for mmproj. macOS needs ≥ 8 GB for itself (16 GB Macs excepted — use ~4 GB). You can increase available memory by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). NVIDIA reserves ~1 GB for CUDA.

API usage

OpenAI-compatible (`/v1/chat/completions`)

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","messages":[{"role":"user","content":"Hello"}]}'

Works with any OpenAI client — just point it at http://localhost:8081/v1.

Anthropic-compatible (`/v1/messages`)

curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'

Works with any Anthropic client — the server natively speaks the Messages API with streaming, tool use, and vision.

Claude Code

ANTHROPIC_BASE_URL=http://127.0.0.1:8081 claude

Claude Code uses the Anthropic Messages API. With this one env var, it talks to your local Qwen3.6-27B instead of the cloud.

Tool use (both APIs)

curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "max_tokens": 1024,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'

Vision

Vision works with MTP enabled — just add --mmproj to the server command:

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": [
      {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "'$(base64 < photo.jpg)'"}},
      {"type": "text", "text": "Describe this image"}
    ]}]
  }'

Direct CLI usage

# Text generation
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
  -p "Your prompt here"

# Vision
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
  --image photo.jpg \
  -p "Describe this image"

KV cache options

The --cache-type-k and --cache-type-v flags control KV cache precision. Lower precision = less memory = longer context on the same hardware.

Type	Bits/val	KV size (80K ctx)	Quality	Speed	When to use
`f16`	16	5.3 GB	Full	Baseline	Best quality — use when RAM allows
`q8_0`	8	2.8 GB	Negligible loss	Faster than f16	When f16 KV doesn't give enough context
`q4_0`	4	1.5 GB	Minor loss	Slightly slower	Max context on limited RAM (≤64K only)

Recommendation: Leave KV at f16 for best quality. Use q8_0 when f16 doesn't give enough context. Reserve q4_0 for tight RAM — and only up to 64K context.

Effect on hardware requirements (Q5_K_M, 80K context):

KV type	Model + recurrent + KV	Hardware
f16	24 GB	48 GB Mac
q8_0	22 GB	32 GB Mac

Speculative decoding modes

MTP (recommended — 2.5x faster)

The model predicts 5 extra tokens per step using its own MTP heads, then verifies them in one pass. No extra model needed.

--spec-type draft-mtp --spec-draft-n-max 3 -np 1

MTP currently requires -np 1 (single-sequence mode).

Tune --spec-draft-n-max: 3 is optimal for general use (83% acceptance rate). Values of 1–2 are more conservative; 4–5 waste compute on rejected tokens.

Draft model (~2.3x faster)

Pair with a smaller Qwen 3.5/3.6 model that shares the same tokenizer.

llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  -md Qwen3.5-0.8B-Q8_0.gguf \
  --spec-draft-n-max 10 -ngl 99 -ngld 99 \
  -c 4096 -n 2048 --temp 0.7 \
  -p "Your prompt"

ngram-mod (no extra model, benefits repeat prompts)

Uses cached n-grams from previous prompts.

--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64 \
--repeat-penalty 1.0

Downloads

File	Size	Min. (4K ctx)	Recommended (80K ctx)	Max (262K ctx)
`Qwen3.6-27B-F16-mtp.gguf`	51 GB	64 GB Mac · 80 GB GPU	64 GB Mac · 80 GB GPU	96 GB Mac · 80 GB GPU
`Qwen3.6-27B-Q8_0-mtp.gguf`	27 GB	48 GB Mac · 48 GB GPU	48 GB Mac · 48 GB GPU	48 GB Mac · 48 GB GPU
`Qwen3.6-27B-Q6_K-mtp.gguf`	21 GB	32 GB Mac · 24 GB GPU	48 GB Mac · 48 GB GPU	48 GB Mac · 48 GB GPU
`Qwen3.6-27B-Q5_K_M-mtp.gguf`	18 GB	32 GB Mac · 24 GB GPU	32 GB Mac · 24 GB GPU	48 GB Mac · 48 GB GPU
`Qwen3.6-27B-Q4_K_M-mtp.gguf`	16 GB	32 GB Mac · 24 GB GPU	32 GB Mac · 24 GB GPU	48 GB Mac · 48 GB GPU
`Qwen3.6-27B-IQ4_XS-mtp.gguf`	14 GB	24 GB Mac · 24 GB GPU	32 GB Mac · 24 GB GPU	32 GB Mac · 48 GB GPU
`Qwen3.6-27B-IQ3_M-mtp.gguf`	12 GB	24 GB Mac · 16 GB GPU	24 GB Mac · 24 GB GPU	32 GB Mac · 24 GB GPU
`Qwen3.6-27B-IQ2_M-mtp.gguf`	9.5 GB	16 GB Mac · 16 GB GPU	24 GB Mac · 16 GB GPU	32 GB Mac · 24 GB GPU
`mmproj-Qwen3.6-27B-f16.gguf`	885 MB	Vision encoder (optional, any tier)	—	—

All tiers include MTP heads and were quantized directly from the F16 conversion for maximum precision. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix. Q5_K_M is the sweet spot — use Q4_K_M if you're tight on RAM, Q8_0 for high quality, or F16 for long agentic coding sessions where quantization artifacts compound noticeably. GPU means NVIDIA (RTX 3060 = 12 GB, RTX 3090/4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB).

Hardware numbers assume f16 KV for "Min." (4K) and q8_0 KV for "Recommended" (80K) and "Max" (262K). Add --cache-type-k q8_0 --cache-type-v q8_0 to reach the recommended or max context on smaller hardware.

Memory requirements

Approximate VRAM on Apple Silicon (unified memory), using Q5_K_M as reference. Includes 0.9 GB recurrent state (constant, does not scale with context). Only 16 of 65 layers use KV cache — the other 48 use linear attention.

Context	Model	KV (f16)	KV (q8_0)	Total (f16)	Total (q8_0)	Min. Mac
4K	18 GB	0.3 GB	0.1 GB	19 GB	19 GB	32 GB
8K	18 GB	0.5 GB	0.3 GB	19 GB	19 GB	32 GB
32K	18 GB	2.1 GB	1.0 GB	20 GB	20 GB	32 GB
64K	18 GB	4.1 GB	2.1 GB	21 GB	21 GB	32 GB
80K (recommended)	18 GB	5.2 GB	2.6 GB	22 GB	22 GB	32 GB
128K	18 GB	8.3 GB	4.1 GB	25 GB	23 GB	32 GB
262K (max native)	18 GB	17.0 GB	8.5 GB	34 GB	27 GB	48 GB

"Total" = model + recurrent state + KV cache. macOS needs ≥ 8 GB (16 GB Macs excepted). With vision: add 0.9 GB for the mmproj.

Memory for all quant tiers (4K context, q8_0 KV)

Quant	Model	KV + recurrent	Total	Min. Mac
Q8_0	27 GB	1.0 GB	28 GB	48 GB
Q6_K	21 GB	1.0 GB	22 GB	32 GB
Q5_K_M	18 GB	1.0 GB	19 GB	32 GB
Q4_K_M	16 GB	1.0 GB	17 GB	32 GB
IQ4_XS	14 GB	1.0 GB	15 GB	24 GB
IQ3_M	12 GB	1.0 GB	13 GB	24 GB
IQ2_M	9.5 GB	1.0 GB	11 GB	16 GB

System prompt

The first line must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. Append anything after that line.

Thinking toggle

Drop <|think_on|> or <|think_off|> in any message to toggle thinking. The template strips the tag so the model never sees it.

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Fast answer, no internal reasoning.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model thinks step by step, then answers.

Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode	temp	top_p	top_k	repeat_penalty
Thinking (coding)	0.6	0.95	20	1.0
Thinking (general)	1.0	0.95	20	1.0
Non-thinking (general)	0.7	0.8	20	1.0

Compatibility

Runtime	Status	Why
llama.cpp (b9180+ / Homebrew)	Works fully	MTP support merged in b9180 (2026-05-16). `brew install llama.cpp`
llama.cpp (pre-b9180)	Does not load	`missing tensor` — MTP heads not recognized
LM Studio	Does not load	Bundled llama.cpp may not yet include b9180+
Ollama	Does not load	No speculative decoding support yet
koboldcpp	Unknown	Depends on bundled llama.cpp version

LM Studio users: use the MLX 8-bit or MLX 4-bit instead — full vision + tools + thinking, no MTP.

Chat template fixes

The bundled Jinja template fixes several bugs in the official Qwen 3.6 template:

Tool calls crash on C++ engines. The official template uses Python's |items filter and |safe, which don't exist in C++ Jinja runtimes (llama.cpp, LM Studio). This template uses direct dictionary key lookups.
The developer role crashes. Modern APIs send message.role == "developer". The official template throws an exception. This template maps it to system.
Empty preserve_thinking spam. The official template wraps every past turn in empty <think/> blocks, wasting context tokens. This template only emits thinking blocks with actual content.
</thinking> hallucination handling. The model sometimes generates </thinking> instead of the expected closing tag. Both are handled gracefully.

See Qwen-Fixed-Chat-Templates for the standalone template repo.

Note: The fixed template works in llama.cpp but may cause errors in some frameworks (oh-my-pi, Codex, etc.) — typically Jinja Exception: System message must be at the beginning. If you hit this, use the default (unfixed) template instead. The fixes for tool calls and the developer role are only needed in llama.cpp's C++ Jinja runtime; other frameworks may already handle them natively or reject the modified template.

Architecture details

Spec	Value
Total params	27.8B (dense, all active)
Layers	65 (3x linear attention + 1x full attention, 16 repetitions) + 1 MTP layer
Attention	24 Q heads, 4 KV heads (GQA), head_dim 256
Linear attention	16 QK heads, 48 V heads, head_dim 128
FFN	intermediate_size 17408
Context	262K native, 1M+ with YaRN
RoPE	theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab	248K tokens
Multi-token prediction	1 MTP draft layer (15 tensors)
model_type	`qwen3_5`

Conversion details

Converted from official Qwen3.6-27B safetensors using mainline convert_hf_to_gguf.py from llama.cpp (b9180+, Homebrew v9240). MTP tensors are included by default — no custom build needed. The fixed chat template (v19) from Qwen-Fixed-Chat-Templates was embedded in tokenizer_config.json before conversion.

Quantization source: F16 (not Q8_0) — all tiers are quantized directly from the F16 conversion for maximum precision, avoiding double-quantization artifacts. Standard K-quant tiers (Q8_0, Q6_K, Q5_K_M, Q4_K_M) use no importance matrix. I-quant tiers (IQ4_XS, IQ3_M, IQ2_M) use unsloth's importance matrix (calibrated with chat template at 6K–12K context, 76 chunks, 496 entries). IQ2_M keeps MTP tensors at Q4_K since the importance matrix doesn't cover MTP layer tensors.

# Prerequisites
brew install llama.cpp
git clone --depth 1 --filter=blob:none --sparse https://github.com/ggml-org/llama.cpp.git llama.cpp-source
cd llama.cpp-source && git sparse-checkout set convert_hf_to_gguf.py conversion gguf-py
python3 -m venv .venv && .venv/bin/pip install torch numpy tqdm transformers sentencepiece pyyaml requests
.venv/bin/pip install -e gguf-py

# Embed fixed chat template (v19) into source tokenizer_config.json
python3 -c "
import json
with open('Qwen/Qwen3.6-27B/tokenizer_config.json') as f: d = json.load(f)
with open('Qwen-Fixed-Chat-Templates/chat_template_oneline.txt') as f: t = f.read().strip()
d['chat_template'] = t
with open('Qwen/Qwen3.6-27B/tokenizer_config.json', 'w') as f: json.dump(d, f, indent=2, ensure_ascii=False)
"

# Convert to F16 (text + MTP, ~51 GB, ~30-40 min)
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
  --outtype f16 --outfile Qwen3.6-27B-F16-mtp.gguf --verbose

# Extract vision encoder
.venv/bin/python convert_hf_to_gguf.py Qwen/Qwen3.6-27B/ \
  --outtype f16 --mmproj --outfile mmproj-Qwen3.6-27B-f16.gguf --verbose

# Quantize K-quant tiers from F16 (no imatrix)
F16=Qwen3.6-27B-F16-mtp.gguf
llama-quantize $F16 Qwen3.6-27B-Q8_0-mtp.gguf  Q8_0
llama-quantize $F16 Qwen3.6-27B-Q6_K-mtp.gguf  Q6_K
llama-quantize $F16 Qwen3.6-27B-Q5_K_M-mtp.gguf Q5_K_M
llama-quantize $F16 Qwen3.6-27B-Q4_K_M-mtp.gguf Q4_K_M

# Quantize I-quant tiers from F16 with unsloth imatrix
IMATRIX=imatrix_unsloth.gguf_file
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ4_XS-mtp.gguf IQ4_XS
llama-quantize --imatrix $IMATRIX $F16 Qwen3.6-27B-IQ3_M-mtp.gguf  IQ3_M
# IQ2_M: MTP tensors at Q4_K (imatrix doesn't cover them)
llama-quantize --imatrix $IMATRIX --tensor-type blk.64.=q4_K $F16 Qwen3.6-27B-IQ2_M-mtp.gguf IQ2_M

Authorship

Role	Author
Original model	Alibaba Cloud (Qwen team)
GGUF conversion + MTP + vision + fixed chat template + quantization	froggeric
Importance matrix	unsloth

License

Apache-2.0, inherited from Qwen3.6.

Downloads last month: 75,861

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for froggeric/Qwen3.6-27B-MTP-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

(360)

this model

froggeric
/

Qwen3.6-27B-MTP-GGUF

Qwen3.6-27B with MTP

2.5x faster with MTP · 262K context on 48 GB · Fixed chat template

Start the server

Which quant should I download?

Apple Silicon

NVIDIA GPU

API usage

OpenAI-compatible (`/v1/chat/completions`)

Anthropic-compatible (`/v1/messages`)

Claude Code

Tool use (both APIs)

Vision

Direct CLI usage

KV cache options

Speculative decoding modes

MTP (recommended — 2.5x faster)

Draft model (~2.3x faster)

ngram-mod (no extra model, benefits repeat prompts)

Downloads

Memory requirements

System prompt

Thinking toggle

Sampling

Compatibility

Chat template fixes

Links

Authorship

License

Model tree for froggeric/Qwen3.6-27B-MTP-GGUF

Qwen3.6-27B with MTP

2.5x faster with MTP · 262K context on 48 GB · Fixed chat template

Start the server

Which quant should I download?

Apple Silicon

NVIDIA GPU

API usage

OpenAI-compatible (/v1/chat/completions)

Anthropic-compatible (/v1/messages)

Claude Code

Tool use (both APIs)

Vision

Direct CLI usage

KV cache options

Speculative decoding modes

MTP (recommended — 2.5x faster)

Draft model (~2.3x faster)

ngram-mod (no extra model, benefits repeat prompts)

Downloads

Memory requirements

System prompt

Thinking toggle

Sampling

Compatibility

Chat template fixes

Links

Authorship

License

Model tree for froggeric/Qwen3.6-27B-MTP-GGUF

OpenAI-compatible (`/v1/chat/completions`)

Anthropic-compatible (`/v1/messages`)