Instructions to use Qwen/Qwen3.6-27B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.6-27B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-27B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-27B-FP8")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-27B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen3.6-27B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.6-27B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.6-27B-FP8

SGLang

How to use Qwen/Qwen3.6-27B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.6-27B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.6-27B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.6-27B-FP8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.6-27B-FP8
```

Garbage output

by andylele - opened about 1 month ago

Discussion

andylele

about 1 month ago

Something is wrong with this FP8. I tried both vllm and SGLang and they all output gibberish words and random characters.

My CLI:
SGLang latest docker tag pulled today
SGLANG_ENABLE_SPEC_V2=1 sglang serve --model-path Qwen/Qwen3.6-27B-FP8
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--mamba-scheduler-strategy extra_buffer
--mem-fraction-static 0.8

vllm version 0.19.1
vllm serve Qwen/Qwen3.6-27B-FP8 --host 0.0.0.0 --port 8000 --served-model-name Qwen3.6-27B-FP8 --uvicorn-log-level error --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --max-model-len 131072 --max-num-seqs 1 --max-num-batched-tokens 4096 --kv-cache-dtype fp8 --disable-log-stats --enable-sleep-mode

I'm using H100. Does anyone have the same problem?

HenkTenk

about 1 month ago

0 problems on VLLM v0.19.1;
Try without draft on Sglang;
Try without kv-cache quantization on vllm; I think you have plenty of VRAM so no need for quantization of kvcache.
If you do need more space try
--limit-mm-per-prompt '{"image": 16, "video": 0}'
for vllm; this disables video and limits images to 16 per prompt.

sczhengyabin

30 days ago

Something is wrong with this FP8. I tried both vllm and SGLang and they all output gibberish words and random characters.

My CLI:
SGLang latest docker tag pulled today
SGLANG_ENABLE_SPEC_V2=1 sglang serve --model-path Qwen/Qwen3.6-27B-FP8
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--mamba-scheduler-strategy extra_buffer
--mem-fraction-static 0.8

vllm version 0.19.1
vllm serve Qwen/Qwen3.6-27B-FP8 --host 0.0.0.0 --port 8000 --served-model-name Qwen3.6-27B-FP8 --uvicorn-log-level error --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --max-model-len 131072 --max-num-seqs 1 --max-num-batched-tokens 4096 --kv-cache-dtype fp8 --disable-log-stats --enable-sleep-mode

I'm using H100. Does anyone have the same problem?

same with rtx pro 6000, vllm is fine

andylele

30 days ago

0 problems on VLLM v0.19.1;
Try without draft on Sglang;
Try without kv-cache quantization on vllm; I think you have plenty of VRAM so no need for quantization of kvcache.
If you do need more space try
--limit-mm-per-prompt '{"image": 16, "video": 0}'
for vllm; this disables video and limits images to 16 per prompt.

I tried again this morning with the FP16 version, still has the same problem. For example:

Prompt: "hi"
Output:
i, am a but the boy of app is age running
17
, iHere's have a a thinking very serious process: concern
, i.
am请 a teacher检查 of应用名称 chemistry是否正确 and i， want以及是否 to已 make a启动 web该应用 app。 that请, number i had0 my of0 wisdom times tooth removed2 about in2 my3-24 program2 days
ago and``` icsharp, I am working on a program that I am going to use at my work

My exact CLI:

Apr 24 at 09:22:43.721
vllm serve Qwen/Qwen3.6-27B --host 0.0.0.0 --port 8000 --served-model-name Qwen3.6-27B --uvicorn-log-level error --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --max-model-len 64000 --max-num-seqs 2 --max-num-batched-tokens 4096 --limit-mm-per-prompt {"image": 8, "video": 0} --kv-cache-dtype auto --disable-log-stats --enable-sleep-mode
Apr 24 at 09:23:03.779
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299]
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299] █▄█▀ █ █ █ █ model Qwen/Qwen3.6-27B
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=22) INFO 04-24 16:23:03 [utils.py:299]

Can you share your vllm arguments?

alien087

28 days ago

•

edited 28 days ago

try this param
vllm serve "Qwen/Qwen3.6-27B-FP8"
--host 0.0.0.0
--port 8002
--enable-chunked-prefill
--max-model-len 80000
--tensor-parallel-size 2
--gpu-memory-utilization 0.8
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3 \

i deployed this model with vLLM (v 0.18.0) too using that command and no issues (on 2x L40s)

andylele

28 days ago

•

edited 28 days ago

Thank you guys, I know the problem now.
In my LiteLLM, there are some different OpenAI compatible API providers. I just happened to choose one of them that output garbage values. Not sure why but picking a different OpenAI compatible provider then it works!

g-a-b-y

25 days ago

@andylele Close this

andylele changed discussion status to closed 25 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment