Benchmarks (Q8_0-Q4_0, RTX 5090, EPYC)

#5
by sousekd - opened

Here are benchmarks of Q8_0-Q4_0 running a single RTX 5090 and EPYC 9355:

128K context @ f16

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 18.455 443.90 13.719 18.66
8192 256 8192 20.105 407.47 14.089 18.17
8192 256 16384 21.961 373.03 14.435 17.73
8192 256 24576 23.750 344.92 15.003 17.06
... ... ... ... ... ... ...
8192 256 98304 42.663 192.01 16.907 15.14
8192 256 106496 44.635 183.53 17.095 14.97
8192 256 114688 46.665 175.55 17.559 14.58
8192 256 122880 48.700 168.21 17.604 14.54

192K context @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 -c 196608 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 15.120 270.90 6.918 18.50
4096 128 4096 15.601 262.54 7.105 18.02
4096 128 8192 16.057 255.09 7.150 17.90
4096 128 12288 16.468 248.72 7.353 17.41
... ... ... ... ... ... ...

Thank you for the quants @ubergarm .

After doing more perplexity measurements, I'm not sure q4_0 is the best choice despite fairly closely matching the original QAT target format... Needs more research...

Any findings? 😀

Any findings? 😀

@sousekd

Thanks for the benchmarks! I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Some folks on ai beavers discord want to measure the original with VLLM and try to get apples-apples comparison with a GGUF but that is tricky to do especially with these large sizes.

The main thing I've seen is to run this with an updated chat template (given the original was patched a couple times since release) as well as adding --special to get it to output <think> tags but then you have to deal with the stop token in the client side..

I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Interesting. I know nothing about the quantization process, but I can imagine that converting any floating point number can lead to some data loss, so direct conversion would be preferred...

The main thing I've seen is to run this with an updated chat template [...]

...and the tool calling does not work yet - even on mainline llama.cpp.

...and the tool calling does not work yet - even on mainline llama.cpp.

Ohh really?? It seems like tool calling and MCP stuff can be so sensitive to exact implementation details, chat templates, and why Kimi released an entire tool K2-Vendor-Verifier github project to score a setup on how well it seems to actually be working...

Ohh really??

Yes: https://github.com/ggml-org/llama.cpp/issues/17155

I wanted to post a feature request on ik_llama, but it probably makes sense to wait for the llama.cpp implementation first.
There are more models with broken tool calling on ik_llama still. I hope to find a bit of time soon to test them all vs mainline and report back...

It is interesting how tool calling does not seem to be that important for many.
Even vLLM still has serious issues wuth gpt-oss tool calling - and it is not a new model by any means.
I guess most people use local models just for RP 😀. And coding, where the clients have their own implementation.

Just to compare with GLM 4.7 IQ5_K, here is a run of Kimi-K2 Thinking Q4_X with max supported context on RTX PRO 6000. It only fills 50% of VRAM (47 GB), while GLM 4.7 IQ5_K takes it all. This is without --grouped-expert-routing --merge-qkv, which I'm not quite sure what they are about:

Single RTX 6000, 256K context (max supported)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -amb 512 \
    -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 262144 \
    -ngl 999 -ncmoe 999 \
    --threads 24 \
    --threads-batch 32 \
    --warmup-batch \
    -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 17.652 464.10 11.589 22.09
8192 256 8192 18.808 435.55 12.222 20.95
8192 256 16384 19.770 414.36 12.432 20.59
8192 256 24576 20.992 390.25 12.674 20.20
8192 256 32768 22.070 371.19 13.044 19.63
8192 256 40960 23.333 351.09 13.287 19.27
8192 256 49152 24.608 332.90 13.427 19.07
8192 256 57344 25.767 317.93 13.685 18.71
8192 256 65536 27.495 297.94 13.927 18.38
8192 256 73728 28.796 284.49 14.245 17.97
8192 256 81920 29.940 273.62 14.515 17.64
8192 256 90112 31.135 263.12 14.767 17.34
8192 256 98304 32.714 250.41 14.923 17.15
8192 256 106496 34.027 240.75 15.303 16.73
8192 256 114688 35.290 232.13 15.435 16.59
8192 256 122880 36.775 222.76 15.891 16.11
8192 256 131072 40.133 204.12 16.004 16.00
8192 256 139264 41.578 197.03 16.846 15.20
8192 256 147456 43.047 190.30 17.153 14.92
8192 256 155648 44.519 184.01 17.322 14.78
8192 256 163840 46.067 177.83 16.904 15.14
8192 256 172032 47.584 172.16 17.842 14.35
8192 256 180224 49.081 166.91 18.053 14.18
8192 256 188416 50.508 162.19 18.324 13.97
8192 256 196608 52.107 157.22 18.577 13.78
8192 256 204800 53.662 152.66 18.910 13.54
8192 256 212992 55.044 148.83 19.082 13.42
8192 256 221184 56.519 144.94 19.332 13.24
8192 256 229376 58.427 140.21 19.614 13.05
8192 256 237568 58.924 139.03 19.879 12.88
8192 256 245760 59.585 137.48 20.551 12.46

GPU power capped at 450 W. CPU set to a “balanced”-style power profile in BIOS, also limited a bit.
Running in a Proxmox VM with only 32 physical CPU cores assigned, leaving the SMT threads for the host.
Full specs here for anyone interested.

Sign up or log in to comment