Benchmarks (Q8_0-Q4_0, RTX 5090, EPYC)

by sousekd - opened Nov 8, 2025

Discussion

sousekd

Nov 8, 2025

•

edited Nov 8, 2025

Here are benchmarks of Q8_0-Q4_0 running a single RTX 5090 and EPYC 9355:

128K context @ f16

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	256	0	18.455	443.90	13.719	18.66
8192	256	8192	20.105	407.47	14.089	18.17
8192	256	16384	21.961	373.03	14.435	17.73
8192	256	24576	23.750	344.92	15.003	17.06
...	...	...	...	...	...	...
8192	256	98304	42.663	192.01	16.907	15.14
8192	256	106496	44.635	183.53	17.095	14.97
8192	256	114688	46.665	175.55	17.559	14.58
8192	256	122880	48.700	168.21	17.604	14.54

192K context @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 -c 196608 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	15.120	270.90	6.918	18.50
4096	128	4096	15.601	262.54	7.105	18.02
4096	128	8192	16.057	255.09	7.150	17.90
4096	128	12288	16.468	248.72	7.353	17.41
...	...	...	...	...	...	...

Thank you for the quants @ubergarm .

After doing more perplexity measurements, I'm not sure q4_0 is the best choice despite fairly closely matching the original QAT target format... Needs more research...

Any findings? 😀

ubergarm

Owner Nov 11, 2025

Any findings? 😀

@sousekd

Thanks for the benchmarks! I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Some folks on ai beavers discord want to measure the original with VLLM and try to get apples-apples comparison with a GGUF but that is tricky to do especially with these large sizes.

The main thing I've seen is to run this with an updated chat template (given the original was patched a couple times since release) as well as adding --special to get it to output <think> tags but then you have to deal with the stop token in the client side..

sousekd

Nov 11, 2025

I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Interesting. I know nothing about the quantization process, but I can imagine that converting any floating point number can lead to some data loss, so direct conversion would be preferred...

The main thing I've seen is to run this with an updated chat template [...]

...and the tool calling does not work yet - even on mainline llama.cpp.

ubergarm

Owner Nov 11, 2025

...and the tool calling does not work yet - even on mainline llama.cpp.

Ohh really?? It seems like tool calling and MCP stuff can be so sensitive to exact implementation details, chat templates, and why Kimi released an entire tool K2-Vendor-Verifier github project to score a setup on how well it seems to actually be working...

sousekd

Nov 11, 2025

•

edited Nov 11, 2025

Ohh really??

Yes: https://github.com/ggml-org/llama.cpp/issues/17155

I wanted to post a feature request on ik_llama, but it probably makes sense to wait for the llama.cpp implementation first.
There are more models with broken tool calling on ik_llama still. I hope to find a bit of time soon to test them all vs mainline and report back...

It is interesting how tool calling does not seem to be that important for many.
Even vLLM still has serious issues wuth gpt-oss tool calling - and it is not a new model by any means.
I guess most people use local models just for RP 😀. And coding, where the clients have their own implementation.

sousekd

Dec 28, 2025

•

edited Dec 28, 2025

Just to compare with GLM 4.7 IQ5_K, here is a run of Kimi-K2 Thinking Q4_X with max supported context on RTX PRO 6000. It only fills 50% of VRAM (47 GB), while GLM 4.7 IQ5_K takes it all. This is without --grouped-expert-routing --merge-qkv, which I'm not quite sure what they are about:

Single RTX 6000, 256K context (max supported)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -amb 512 \
    -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 262144 \
    -ngl 999 -ncmoe 999 \
    --threads 24 \
    --threads-batch 32 \
    --warmup-batch \
    -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	256	0	17.652	464.10	11.589	22.09
8192	256	8192	18.808	435.55	12.222	20.95
8192	256	16384	19.770	414.36	12.432	20.59
8192	256	24576	20.992	390.25	12.674	20.20
8192	256	32768	22.070	371.19	13.044	19.63
8192	256	40960	23.333	351.09	13.287	19.27
8192	256	49152	24.608	332.90	13.427	19.07
8192	256	57344	25.767	317.93	13.685	18.71
8192	256	65536	27.495	297.94	13.927	18.38
8192	256	73728	28.796	284.49	14.245	17.97
8192	256	81920	29.940	273.62	14.515	17.64
8192	256	90112	31.135	263.12	14.767	17.34
8192	256	98304	32.714	250.41	14.923	17.15
8192	256	106496	34.027	240.75	15.303	16.73
8192	256	114688	35.290	232.13	15.435	16.59
8192	256	122880	36.775	222.76	15.891	16.11
8192	256	131072	40.133	204.12	16.004	16.00
8192	256	139264	41.578	197.03	16.846	15.20
8192	256	147456	43.047	190.30	17.153	14.92
8192	256	155648	44.519	184.01	17.322	14.78
8192	256	163840	46.067	177.83	16.904	15.14
8192	256	172032	47.584	172.16	17.842	14.35
8192	256	180224	49.081	166.91	18.053	14.18
8192	256	188416	50.508	162.19	18.324	13.97
8192	256	196608	52.107	157.22	18.577	13.78
8192	256	204800	53.662	152.66	18.910	13.54
8192	256	212992	55.044	148.83	19.082	13.42
8192	256	221184	56.519	144.94	19.332	13.24
8192	256	229376	58.427	140.21	19.614	13.05
8192	256	237568	58.924	139.03	19.879	12.88
8192	256	245760	59.585	137.48	20.551	12.46

GPU power capped at 450 W. CPU set to a “balanced”-style power profile in BIOS, also limited a bit.
Running in a Proxmox VM with only 32 physical CPU cores assigned, leaving the SMT threads for the host.
Full specs here for anyone interested.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment