The benchmark of this version

#5
by whw2026 - opened

Since no one else seems to be benchmarking the abliterated models yet, I decided to run a quick test myself. I'm just an everyday model user, and the evaluation code was actually written with the help of Claude Opus4.6. The results of gemma-4-26B-A4B-it-abliterix-q8 model are as follows. The lack of sufficient tests (only 20 samples per dataset!) and inappropriate prompt no doubt results in bias. These results are for reference only, I will conduct a larger sample later.

The dataset organizing and testing code are available in https://github.com/salty-fish-114514/gemma4-abliterix-benchmark/tree/main

Dataset Accuracy Correct Total Errors Time(s)
gsm8k 100.00% 20 20 0 396.5
mmlu_pro 70.00% 14 20 0 588.9
ceval 65.00% 13 20 0 501.3
bbh 90.00% 18 20 1 324.0
gpqa_diamond 65.00% 13 20 0 1047.2
ifeval 95.00% 19 20 0 276.7
humaneval 100.00% 20 20 0 214.4

The larger benchmark is finally done! But I have to admit again, the ifeval and humaneval results aren't super reliable โ€” their evaluation needs a complicated routine, not just a simple value check.
Besides, the bias above still holds, so the detailed score are just for reference.

Dataset Accuracy Correct Total Errors Time(s)
gsm8k 96.18% 126 131 0 2560.7
mmlu_pro 74.00% 148 200 0 6101.2
ceval 75.24% 79 105 1 2740.2
bbh 79.50% 159 200 1 5297.2
gpqa_diamond 63.13% 125 198 1 11083.6
ifeval 90.74% 49 54 0 908.1
humaneval 94.51% 155 164 9 2448.3

Finally, I ran a benchmark with 90 samples per dataset under the same conditions on the original model (unsloth/gemma-4-26B-A4B-it-GGUF, same Q8_0 quantization). Here are the results:

Dataset Accuracy Correct Total Errors Time(s)
gsm8k 83.33% 75 90 0 2659.0
mmlu_pro 68.89% 62 90 0 2963.8
ceval 76.67% 69 90 0 2497.2
bbh 73.33% 66 90 0 2146.5
gpqa_diamond 71.11% 64 90 0 4992.0
ifeval 87.04% 47 54 0 897.9
humaneval 98.89% 89 90 1 1037.6

Still, the limited sample size leads to wide confidence intervals. However, the Abliterix model appears to be smarter overall, which contradicts the conventional view that abliteration always comes at the cost of intelligence. Or perhaps it's just more robust to inappropriate prompts. Below, I list the differences between the two models:

Dataset Accuracy Change (Abliterix vs Original, Q8_0) Trend
gsm8k +12.85% Increase
mmlu_pro +5.11% Increase
ceval -1.43% Decrease
bbh +6.17% Increase
gpqa_diamond -7.98% Decrease
ifeval +3.70% Increase
humaneval -4.38% Decrease

Very interesting results. I've long wondered whether abliteration affects anything other than the KL div. Apparently so, and even a low KL div doesn't guarantee the model hasn't lost some of its cognitive properties.
I wonder if, instead of abliteration, we perform finetuning on a harmful dataset (for example, 1000 harmful questions and their answers), wouldn't it be more effective than abliteration? Or a combination of finetuning and more light abliteration.
https://arxiv.org/pdf/2602.06258 โ€“ Microsoft had some interesting research on this topic.

Very interesting results. I've long wondered whether abliteration affects anything other than the KL div. Apparently so, and even a low KL div doesn't guarantee the model hasn't lost some of its cognitive properties.
I wonder if, instead of abliteration, we perform finetuning on a harmful dataset (for example, 1000 harmful questions and their answers), wouldn't it be more effective than abliteration? Or a combination of finetuning and more light abliteration.
https://arxiv.org/pdf/2602.06258 โ€“ Microsoft had some interesting research on this topic.

I have to say, this toy test isn't really proof that the model lost some of its cognitive properties. It's much more likely that the results are biased by inappropriate prompts and answer checking.

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

I will, but I'll have to wait until my computer is free for a long enough stretch again. Gotta find a big block of idle time first.

Thanks @whw2026 for running this โ€” you're the first to benchmark the abliterix series systematically, and the effort is appreciated.

On the baseline comparison (+1 to @Theory-of-mind 's suggestion):

This is the most important next step. The absolute numbers โ€” 96.2% GSM8K, 74% MMLU-Pro, 94.5% HumanEval โ€” are strong, but without the base gemma-4-26B-A4B-it running under identical conditions (same quant, sampler, prompt template, answer parser), it's hard to separate abliteration effects from eval-harness noise. When you re-run, specifically match Q8 on both sides โ€” quantization variance alone can mask or exaggerate the abliteration delta.

On abliteration vs. fine-tuning on harmful data:

A few observations from the Abliterix pipeline:

  1. Fine-tuning on ~1k harmful Q&A pairs does work โ€” it's the path most early "uncensored" models took โ€” but it produces broader distribution shift than abliteration. You're not just removing the refusal direction; you're moving the entire weight manifold, and it's harder to control where it moves. Capability regressions tend to be larger and less predictable.

  2. Naive abliteration (single direction, full projection) also degrades the model more than is usually admitted. KL divergence is a necessary-but-not-sufficient signal, as whw2026 rightly pointed out โ€” a low KL on a generic distribution doesn't guarantee preserved reasoning on OOD tasks like GPQA.

  3. What I do in Abliterix is treat this as a hyperparameter search: Optuna TPE over a compound objective of refusal rate + KL on a held-out capability set. This lets you explicitly trade refusal removal against capability preservation instead of hoping one direction happens to be clean. This checkpoint was produced that way, which I suspect is part of why the numbers are holding up on a 26B MoE.

  4. The combined approach (light abliteration + targeted SFT) is interesting and already on my roadmap โ€” would love to dig into that Microsoft paper when I get a moment.

On eval reliability:

Agreed that IFEval and HumanEval really need proper harnesses โ€” lm-evaluation-harness and BigCodeBench are worth the setup cost if you're going to publish numbers. GSM8K and MMLU-Pro with simple answer checking are defensible. Happy to share the internal eval config I used on this checkpoint if it's useful for comparison.

Thanks again for doing this in the open โ€” this is exactly the kind of scrutiny the abliteration field needs.

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

Original test done! And I have updated my benchmark comment.

Thanks @whw2026 for running this โ€” you're the first to benchmark the abliterix series systematically, and the effort is appreciated.

On the baseline comparison (+1 to @Theory-of-mind 's suggestion):

This is the most important next step. The absolute numbers โ€” 96.2% GSM8K, 74% MMLU-Pro, 94.5% HumanEval โ€” are strong, but without the base gemma-4-26B-A4B-it running under identical conditions (same quant, sampler, prompt template, answer parser), it's hard to separate abliteration effects from eval-harness noise. When you re-run, specifically match Q8 on both sides โ€” quantization variance alone can mask or exaggerate the abliteration delta.

On abliteration vs. fine-tuning on harmful data:

A few observations from the Abliterix pipeline:

  1. Fine-tuning on ~1k harmful Q&A pairs does work โ€” it's the path most early "uncensored" models took โ€” but it produces broader distribution shift than abliteration. You're not just removing the refusal direction; you're moving the entire weight manifold, and it's harder to control where it moves. Capability regressions tend to be larger and less predictable.

  2. Naive abliteration (single direction, full projection) also degrades the model more than is usually admitted. KL divergence is a necessary-but-not-sufficient signal, as whw2026 rightly pointed out โ€” a low KL on a generic distribution doesn't guarantee preserved reasoning on OOD tasks like GPQA.

  3. What I do in Abliterix is treat this as a hyperparameter search: Optuna TPE over a compound objective of refusal rate + KL on a held-out capability set. This lets you explicitly trade refusal removal against capability preservation instead of hoping one direction happens to be clean. This checkpoint was produced that way, which I suspect is part of why the numbers are holding up on a 26B MoE.

  4. The combined approach (light abliteration + targeted SFT) is interesting and already on my roadmap โ€” would love to dig into that Microsoft paper when I get a moment.

On eval reliability:

Agreed that IFEval and HumanEval really need proper harnesses โ€” lm-evaluation-harness and BigCodeBench are worth the setup cost if you're going to publish numbers. GSM8K and MMLU-Pro with simple answer checking are defensible. Happy to share the internal eval config I used on this checkpoint if it's useful for comparison.

Thanks again for doing this in the open โ€” this is exactly the kind of scrutiny the abliteration field needs.

The baseline comparison is done, and interestingly, the Abliterix Gemma4 model appears smarter โ€” or perhaps it's just more robust to the inappropriate prompts.

Thank you! These are very interesting results! Perhaps the safety tax is not a myth.
Advice for the future: in this case, if you used a regular unsloth Q8_0, everything is fine. However, for any other quants, you should use regular static quants without iMatrix (mradermacher WITHOUT i1, ggml-org, DevQuasar, etc.) to maintain the integrity of the experiment.

Sign up or log in to comment