Good stuff!

#1
by cccdaf - opened

You absolutely cooked with these! I replaced my daily driver for important tasks.

Used: GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ4_XS-Q5_0-v2

It scored much higher on my mini private benchmark, general vibe test, and, perplexity to just throw it in the mix, all while retaining solid performance.

It crushed unsloths ‘ud,’ q3_k_xl. The q4_k_xl did better perplexity-wise, but it wasn't usable in terms of speed/performance on my hardware, topping out at 5 tk/s TG. I'll have to give your v1 a whirl, there's no quant in the v2 batch similar in size to give it a fair shake.

Bartowski’s IQ4_XS was a very close second, though it is 7.3 GB smaller, so it might be a tie. In all fairness, his models are always superb, at least to me.

  • I tried your two smaller variants, but GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ4_XS-IQ4_NL-v2 ran slow for some reason, and GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ3_S-IQ4_NL-v2 didn’t perform well; Bartowski’s was the clear winner between those two.

  • Here are some perplexity numbers; take them with a grain of salt. I only ran 75 chunks, which gives a good rough idea without waiting hours. Speeds vary as I was experimenting with various Linux kernel schedulers, and I lost track of the runs.

  • [May have made some mistakes below, it's late and I wanted to say well done and thank you before I forgot again :P]

--- GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ4_XS-Q5_0-v2

llama_model_loader: loaded meta data with 42 key-value pairs and 803 tensors from /home/iam/Downloads/Models/GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ4_XS-Q5_0-v2.gguf (version GGUF V3 (latest))
perplexity: tokenizing the input ..
perplexity: tokenization took 416.239 ms
perplexity: calculating perplexity over 75 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 30.57 seconds per pass - ETA 9.55 minutes
       0  2.9273  1.074086  0.110931
       0  3.6302  1.289277  0.091056
       0  2.8024  1.030472  0.071197
       0  2.4754  0.906421  0.061893
    2048  2.4320  0.888695  0.053184
    2048  2.3032  0.834286  0.045946
    2048  2.3109  0.837642  0.042740
    2048  2.3804  0.867259  0.039812
    4096  2.4878  0.911397  0.037920
    4096  2.5030  0.917473  0.035596
    4096  2.4952  0.914377  0.033811
    4096  2.6972  0.992215  0.034238
    6144  3.0065  1.100791  0.034947
    6144  3.0562  1.117159  0.033817
    6144  3.2102  1.166325  0.033267
    6144  3.2910  1.191181  0.032403
    8192  3.4199  1.229612  0.032256
    8192  3.6551  1.296125  0.032286
    8192  3.5916  1.278592  0.031342
    8192  3.6296  1.289119  0.030439
   10240  3.7176  1.313089  0.029906
   10240  3.6849  1.304233  0.029053
   10240  3.5997  1.280847  0.028252
   10240  3.5304  1.261423  0.027445
   12288  3.4873  1.249130  0.026728
   12288  3.4546  1.239704  0.026094
   12288  3.4317  1.233046  0.025630
   12288  3.4680  1.243567  0.025265
   14336  3.5115  1.256040  0.024873
   14336  3.5660  1.271458  0.024639
   14336  3.6333  1.290136  0.024492
   14336  3.6811  1.303213  0.024190
   16384  3.7518  1.322231  0.023961
   16384  3.7869  1.331557  0.023630
   16384  3.8771  1.355076  0.023508
   16384  3.9272  1.367921  0.023202
   18432  3.9458  1.372646  0.022873
   18432  4.0396  1.396151  0.022811
   18432  4.0721  1.404162  0.022516
   18432  4.1046  1.412116  0.022214
   20480  4.1845  1.431381  0.022065
   20480  4.2002  1.435133  0.021790
   20480  4.1996  1.434992  0.021529
   20480  4.2363  1.443691  0.021322
   22528  4.3437  1.468731  0.021293
   22528  4.4034  1.482380  0.021116
   22528  4.3713  1.475056  0.020881
   22528  4.3036  1.459455  0.020690
   24576  4.2605  1.449396  0.020426
   24576  4.2656  1.450588  0.020238
   24576  4.3086  1.460603  0.020014
   24576  4.3411  1.468132  0.019874
   26624  4.3904  1.479427  0.019750
   26624  4.4209  1.486334  0.019599
   26624  4.4429  1.491318  0.019466
   26624  4.4728  1.498014  0.019298
   28672  4.4688  1.497114  0.019107
   28672  4.4908  1.502041  0.018977
   28672  4.5009  1.504285  0.018828
   28672  4.5472  1.514502  0.018696
   30720  4.5920  1.524306  0.018616
   30720  4.6526  1.537417  0.018515
   30720  4.7047  1.548567  0.018451
   30720  4.7340  1.554775  0.018309
   32768  4.7450  1.557083  0.018166
   32768  4.7570  1.559608  0.018027
   32768  4.7581  1.559856  0.017854
   32768  4.7639  1.561063  0.017747
   34816  4.8057  1.569812  0.017649
   34816  4.8080  1.570286  0.017504
   34816  4.7971  1.568006  0.017360
   34816  4.7959  1.567756  0.017256
   36864  4.8084  1.570359  0.017168
   36864  4.8356  1.576015  0.017108
   36864  4.8370  1.576296  0.017024

Final estimate: PPL = 4.8370 +/- 0.08234

llama_perf_context_print:        load time =   44132.52 ms
llama_perf_context_print: prompt eval time =  364498.92 ms / 38400 tokens (    9.49 ms per token,   105.35 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  366322.90 ms / 38401 tokens
llama_perf_context_print:    graphs reused =          0

---Bartowski's IQ4_XS

llama_model_loader: loaded meta data with 48 key-value pairs and 803 tensors from /mnt/spare/AI-Models/bartowski_zai-org_GLM-4.5-Air-GGUF/zai-org_GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf (version GGUF V3 (latest))
perplexity: tokenizing the input ..
perplexity: tokenization took 429.382 ms
perplexity: calculating perplexity over 75 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 49.96 seconds per pass - ETA 15.60 minutes
       0  2.8947  1.062871  0.109924
       0  3.6463  1.293711  0.090985
       0  2.8217  1.037327  0.071466
       0  2.4826  0.909291  0.061360
    2048  2.4939  0.913846  0.053093
    2048  2.3597  0.858539  0.045797
    2048  2.3665  0.861413  0.042619
    2048  2.4086  0.879040  0.039719
    4096  2.5175  0.923267  0.038010
    4096  2.5372  0.931073  0.035643
    4096  2.5322  0.929097  0.033912
    4096  2.7331  1.005434  0.034361
    6144  3.0467  1.114045  0.035039
    6144  3.0967  1.130325  0.033895
    6144  3.2476  1.177920  0.033323
    6144  3.3271  1.202089  0.032447
    8192  3.4563  1.240184  0.032295
    8192  3.6927  1.306361  0.032311
    8192  3.6289  1.288931  0.031380
    8192  3.6709  1.300431  0.030491
   10240  3.7559  1.323319  0.029952
   10240  3.7253  1.315147  0.029107
   10240  3.6375  1.291295  0.028326
   10240  3.5655  1.271312  0.027513
   12288  3.5203  1.258544  0.026792
   12288  3.4936  1.250919  0.026180
   12288  3.4680  1.243571  0.025720
   12288  3.5007  1.252962  0.025355
   14336  3.5472  1.266146  0.024981
   14336  3.6006  1.281088  0.024748
   14336  3.6663  1.299173  0.024582
   14336  3.7152  1.312440  0.024296
   16384  3.7827  1.330450  0.024029
   16384  3.8183  1.339800  0.023691
   16384  3.9087  1.363198  0.023558
   16384  3.9586  1.375897  0.023249
   18432  3.9766  1.380432  0.022926
   18432  4.0682  1.403210  0.022859
   18432  4.1011  1.411264  0.022566
   18432  4.1345  1.419365  0.022268
   20480  4.2155  1.438768  0.022122
   20480  4.2316  1.442579  0.021849
   20480  4.2300  1.442213  0.021590
   20480  4.2656  1.450571  0.021383
   22528  4.3734  1.475535  0.021353
   22528  4.4370  1.489977  0.021186
   22528  4.3998  1.481569  0.020948
   22528  4.3368  1.467128  0.020756
   24576  4.2936  1.457128  0.020500
   24576  4.2989  1.458359  0.020312
   24576  4.3443  1.468867  0.020094
   24576  4.3756  1.476049  0.019948
   26624  4.4237  1.486981  0.019817
   26624  4.4540  1.493793  0.019664
   26624  4.4738  1.498228  0.019529
   26624  4.5022  1.504571  0.019357
   28672  4.4951  1.502985  0.019163
   28672  4.5159  1.507602  0.019029
   28672  4.5253  1.509675  0.018881
   28672  4.5711  1.519764  0.018750
   30720  4.6149  1.529295  0.018666
   30720  4.6770  1.542649  0.018569
   30720  4.7310  1.554137  0.018514
   30720  4.7611  1.560484  0.018373
   32768  4.7702  1.562383  0.018226
   32768  4.7805  1.564544  0.018082
   32768  4.7799  1.564423  0.017905
   32768  4.7859  1.565678  0.017795
   34816  4.8295  1.574738  0.017700
   34816  4.8319  1.575237  0.017553
   34816  4.8212  1.573017  0.017406
   34816  4.8214  1.573064  0.017304
   36864  4.8335  1.575573  0.017215
   36864  4.8599  1.581021  0.017152
   36864  4.8610  1.581246  0.017071

Final estimate: PPL = 4.8610 +/- 0.08298

llama_perf_context_print:        load time =  173919.89 ms
llama_perf_context_print: prompt eval time =  368140.00 ms / 38400 tokens (    9.59 ms per token,   104.31 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  369979.83 ms / 38401 tokens
llama_perf_context_print:    graphs reused =          0

--- GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ3_S-IQ4_NL-v2

llama_model_loader: loaded meta data with 42 key-value pairs and 803 tensors from /home/iam/Downloads/Models/GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ3_S-IQ4_NL-v2.gguf (version GGUF V3 (latest))
perplexity: tokenizing the input ..
perplexity: tokenization took 488.22 ms
perplexity: calculating perplexity over 75 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 28.32 seconds per pass - ETA 8.83 minutes
       0  2.8853  1.059639  0.106120
       0  3.6131  1.284562  0.088691
       0  2.8089  1.032805  0.069683
       0  2.4854  0.910434  0.060735
    2048  2.5242  0.925939  0.052864
    2048  2.3811  0.867562  0.045616
    2048  2.3988  0.874955  0.042466
    2048  2.4706  0.904455  0.040034
    4096  2.6382  0.970083  0.038649
    4096  2.6573  0.977324  0.036241
    4096  2.6420  0.971535  0.034435
    4096  2.8416  1.044364  0.034708
    6144  3.1722  1.154421  0.035447
    6144  3.2265  1.171411  0.034338
    6144  3.3781  1.217312  0.033709
    6144  3.4544  1.239663  0.032809
    8192  3.5776  1.274699  0.032526
    8192  3.8182  1.339771  0.032517
    8192  3.7472  1.321019  0.031550
    8192  3.7788  1.329414  0.030605
   10240  3.8608  1.350878  0.030067
   10240  3.8219  1.340751  0.029217
   10240  3.7290  1.316144  0.028412
   10240  3.6505  1.294861  0.027595
   12288  3.6010  1.281208  0.026861
   12288  3.5662  1.271512  0.026251
   12288  3.5391  1.263864  0.025773
   12288  3.5716  1.273012  0.025382
   14336  3.6184  1.286035  0.025017
   14336  3.6695  1.300064  0.024766
   14336  3.7352  1.317812  0.024596
   14336  3.7824  1.330358  0.024291
   16384  3.8537  1.349035  0.024040
   16384  3.8917  1.358849  0.023700
   16384  3.9838  1.382239  0.023569
   16384  4.0347  1.394941  0.023262
   18432  4.0494  1.398558  0.022925
   18432  4.1428  1.421368  0.022857
   18432  4.1757  1.429274  0.022572
   18432  4.2092  1.437267  0.022276
   20480  4.2893  1.456134  0.022133
   20480  4.3027  1.459246  0.021862
   20480  4.3007  1.458786  0.021597
   20480  4.3361  1.466976  0.021390
   22528  4.4439  1.491525  0.021356
   22528  4.5067  1.505558  0.021186
   22528  4.4710  1.497607  0.020944
   22528  4.4021  1.482075  0.020722
   24576  4.3570  1.471793  0.020471
   24576  4.3622  1.472981  0.020286
   24576  4.4065  1.483071  0.020064
   24576  4.4384  1.490289  0.019921
   26624  4.4892  1.501666  0.019800
   26624  4.5208  1.508695  0.019646
   26624  4.5427  1.513530  0.019513
   26624  4.5708  1.519696  0.019338
   28672  4.5626  1.517903  0.019146
   28672  4.5843  1.522632  0.019010
   28672  4.5921  1.524342  0.018859
   28672  4.6375  1.534165  0.018722
   30720  4.6830  1.543947  0.018638
   30720  4.7463  1.557358  0.018536
   30720  4.7994  1.568488  0.018481
   30720  4.8282  1.574482  0.018337
   32768  4.8394  1.576789  0.018193
   32768  4.8503  1.579047  0.018052
   32768  4.8494  1.578865  0.017876
   32768  4.8544  1.579885  0.017764
   34816  4.8972  1.588672  0.017669
   34816  4.8990  1.589029  0.017525
   34816  4.8846  1.586090  0.017378
   34816  4.8854  1.586245  0.017280
   36864  4.9002  1.589272  0.017199
   36864  4.9272  1.594769  0.017136
   36864  4.9276  1.594861  0.017049

Final estimate: PPL = 4.9276 +/- 0.08401

llama_perf_context_print:        load time =   47535.27 ms
llama_perf_context_print: prompt eval time =  401160.26 ms / 38400 tokens (   10.45 ms per token,    95.72 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  403384.70 ms / 38401 tokens
llama_perf_context_print:    graphs reused =          0

--- GLM-4.5-Air-Q8_0-FFN-IQ4_XS-IQ4_XS-IQ4_NL-v2

perplexity: tokenizing the input ..
perplexity: tokenization took 677.204 ms
perplexity: calculating perplexity over 75 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 43.64 seconds per pass - ETA 13.63 minutes
       0  2.9353  1.076825  0.114285
       0  3.6513  1.295081  0.092247
       0  2.8132  1.034314  0.072015
       0  2.4761  0.906694  0.062623
    2048  2.4622  0.901070  0.054131
    2048  2.3425  0.851220  0.046791
    2048  2.3405  0.850365  0.043396
    2048  2.3972  0.874281  0.040338
    4096  2.5288  0.927738  0.038668
    4096  2.5460  0.934539  0.036251
    4096  2.5359  0.930534  0.034401
    4096  2.7378  1.007150  0.034744
    6144  3.0513  1.115577  0.035395
    6144  3.1022  1.132128  0.034242
    6144  3.2534  1.179711  0.033637
    6144  3.3325  1.203726  0.032748
    8192  3.4603  1.241369  0.032543
    8192  3.6963  1.307335  0.032529
    8192  3.6283  1.288757  0.031558
    8192  3.6635  1.298418  0.030623
   10240  3.7504  1.321874  0.030096
   10240  3.7206  1.313882  0.029244
   10240  3.6328  1.289993  0.028430
   10240  3.5594  1.269591  0.027603
   12288  3.5156  1.257219  0.026886
   12288  3.4817  1.247511  0.026252
   12288  3.4578  1.240641  0.025783
   12288  3.4930  1.250768  0.025420
   14336  3.5380  1.263557  0.025024
   14336  3.5906  1.278309  0.024774
   14336  3.6564  1.296492  0.024601
   14336  3.7063  1.310035  0.024314
   16384  3.7773  1.329001  0.024067
   16384  3.8126  1.338309  0.023724
   16384  3.9027  1.361661  0.023591
   16384  3.9522  1.374263  0.023279
   18432  3.9720  1.379272  0.022957
   18432  4.0673  1.402984  0.022906
   18432  4.0992  1.410800  0.022610
   18432  4.1310  1.418520  0.022304
   20480  4.2102  1.437517  0.022156
   20480  4.2262  1.441296  0.021881
   20480  4.2256  1.441160  0.021624
   20480  4.2629  1.449959  0.021416
   22528  4.3714  1.475092  0.021384
   22528  4.4330  1.489076  0.021208
   22528  4.4003  1.481676  0.020964
   22528  4.3341  1.466519  0.020765
   24576  4.2871  1.455621  0.020503
   24576  4.2919  1.456719  0.020313
   24576  4.3343  1.466569  0.020087
   24576  4.3666  1.473986  0.019945
   26624  4.4168  1.485407  0.019823
   26624  4.4464  1.492084  0.019668
   26624  4.4685  1.497042  0.019534
   26624  4.4978  1.503591  0.019363
   28672  4.4915  1.502194  0.019169
   28672  4.5138  1.507148  0.019037
   28672  4.5238  1.509349  0.018892
   28672  4.5710  1.519743  0.018761
   30720  4.6145  1.529203  0.018674
   30720  4.6759  1.542429  0.018571
   30720  4.7306  1.554042  0.018516
   30720  4.7591  1.560058  0.018374
   32768  4.7689  1.562121  0.018230
   32768  4.7802  1.564489  0.018088
   32768  4.7804  1.564533  0.017914
   32768  4.7854  1.565567  0.017801
   34816  4.8267  1.574170  0.017700
   34816  4.8278  1.574392  0.017551
   34816  4.8192  1.572612  0.017411
   34816  4.8189  1.572543  0.017309
   36864  4.8326  1.575388  0.017222
   36864  4.8603  1.581091  0.017160
   36864  4.8620  1.581446  0.017080

Final estimate: PPL = 4.8620 +/- 0.08304

llama_perf_context_print:        load time =   50600.65 ms
llama_perf_context_print: prompt eval time =  492767.38 ms / 38400 tokens (   12.83 ms per token,    77.93 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  495380.22 ms / 38401 tokens
llama_perf_context_print:    graphs reused =          0

Thanks for the feedback! Glad to know it's working well out in the wild. Here's some data that may interest you:

These first three charts include all the v2 quants as well as GLM-4.5-Air-Q8_0-FFN-IQ3_S-IQ3_S-Q5_0, GLM-4.5-Air-Q8_0-FFN-1Q4_XS-1Q4_XS-Q5_0 (v1), and some other custom quants.
01_kld_vs_filesize
02_ppl_vs_filesize
03_ppl_ratio_vs_kld

This chart is older and shows all of the v1 quants as well as some quants from Unsloth and Bartowski.
kld_vs_filesize_linear

(All KLD values are computed against a standard pure Q8_0 quant)

Overall the IQ4_XS-IQ4_XS-Q5_0-v2 does seem to be the best bang-for-buck. :3

I found this in the list of quants after wanting to try air again, and I would recommend you put one of these charts on your model page, they're an incredible quick reference and were exactly what I was looking for in these discussions. They're so infrequently used that most people won't check them though. I would highly recommend at least putting the KL divergence one on there and maybe your last sentence as well.

Sign up or log in to comment