The benchmark of this version

by whw2026 - opened Apr 22

whw2026

•

Since no one else seems to be benchmarking the abliterated models yet, I decided to run a quick test myself. I'm just an everyday model user, and the evaluation code was actually written with the help of Claude Opus4.6. The results of gemma-4-26B-A4B-it-abliterix-q8 model are as follows. The lack of sufficient tests (only 20 samples per dataset!) and inappropriate prompt no doubt results in bias. These results are for reference only, I will conduct a larger sample later.

The dataset organizing and testing code are available in https://github.com/salty-fish-114514/gemma4-abliterix-benchmark/tree/main

Dataset	Accuracy	Correct	Total	Errors	Time(s)
gsm8k	100.00%	20	20	0	396.5
mmlu_pro	70.00%	14	20	0	588.9
ceval	65.00%	13	20	0	501.3
bbh	90.00%	18	20	1	324.0
gpqa_diamond	65.00%	13	20	0	1047.2
ifeval	95.00%	19	20	0	276.7
humaneval	100.00%	20	20	0	214.4

The larger benchmark is finally done! But I have to admit again, the ifeval and humaneval results aren't super reliable — their evaluation needs a complicated routine, not just a simple value check.
Besides, the bias above still holds, so the detailed score are just for reference.

Dataset	Accuracy	Correct	Total	Errors	Time(s)
gsm8k	96.18%	126	131	0	2560.7
mmlu_pro	74.00%	148	200	0	6101.2
ceval	75.24%	79	105	1	2740.2
bbh	79.50%	159	200	1	5297.2
gpqa_diamond	63.13%	125	198	1	11083.6
ifeval	90.74%	49	54	0	908.1
humaneval	94.51%	155	164	9	2448.3

Finally, I ran a benchmark with 90 samples per dataset under the same conditions on the original model (unsloth/gemma-4-26B-A4B-it-GGUF, same Q8_0 quantization). Here are the results:

Dataset	Accuracy	Correct	Total	Errors	Time(s)
gsm8k	83.33%	75	90	0	2659.0
mmlu_pro	68.89%	62	90	0	2963.8
ceval	76.67%	69	90	0	2497.2
bbh	73.33%	66	90	0	2146.5
gpqa_diamond	71.11%	64	90	0	4992.0
ifeval	87.04%	47	54	0	897.9
humaneval	98.89%	89	90	1	1037.6

Still, the limited sample size leads to wide confidence intervals. However, the Abliterix model appears to be smarter overall, which contradicts the conventional view that abliteration always comes at the cost of intelligence. Or perhaps it's just more robust to inappropriate prompts. Below, I list the differences between the two models:

Dataset	Accuracy Change (Abliterix vs Original, Q8_0)	Trend
gsm8k	+12.85%	Increase
mmlu_pro	+5.11%	Increase
ceval	-1.43%	Decrease
bbh	+6.17%	Increase
gpqa_diamond	-7.98%	Decrease
ifeval	+3.70%	Increase
humaneval	-4.38%	Decrease

Theory-of-mind

Apr 22

Very interesting results. I've long wondered whether abliteration affects anything other than the KL div. Apparently so, and even a low KL div doesn't guarantee the model hasn't lost some of its cognitive properties.
I wonder if, instead of abliteration, we perform finetuning on a harmful dataset (for example, 1000 harmful questions and their answers), wouldn't it be more effective than abliteration? Or a combination of finetuning and more light abliteration.
https://arxiv.org/pdf/2602.06258 – Microsoft had some interesting research on this topic.

whw2026

Apr 22

Very interesting results. I've long wondered whether abliteration affects anything other than the KL div. Apparently so, and even a low KL div doesn't guarantee the model hasn't lost some of its cognitive properties.
I wonder if, instead of abliteration, we perform finetuning on a harmful dataset (for example, 1000 harmful questions and their answers), wouldn't it be more effective than abliteration? Or a combination of finetuning and more light abliteration.
https://arxiv.org/pdf/2602.06258 – Microsoft had some interesting research on this topic.

I have to say, this toy test isn't really proof that the model lost some of its cognitive properties. It's much more likely that the results are biased by inappropriate prompts and answer checking.

Theory-of-mind

Apr 22

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

whw2026

Apr 22

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

I will, but I'll have to wait until my computer is free for a long enough stretch again. Gotta find a big block of idle time first.

wangzhang

Owner Apr 22

Thanks @whw2026 for running this — you're the first to benchmark the abliterix series systematically, and the effort is appreciated.

On the baseline comparison (+1 to @Theory-of-mind 's suggestion):

This is the most important next step. The absolute numbers — 96.2% GSM8K, 74% MMLU-Pro, 94.5% HumanEval — are strong, but without the base gemma-4-26B-A4B-it running under identical conditions (same quant, sampler, prompt template, answer parser), it's hard to separate abliteration effects from eval-harness noise. When you re-run, specifically match Q8 on both sides — quantization variance alone can mask or exaggerate the abliteration delta.

On abliteration vs. fine-tuning on harmful data:

A few observations from the Abliterix pipeline:

Fine-tuning on ~1k harmful Q&A pairs does work — it's the path most early "uncensored" models took — but it produces broader distribution shift than abliteration. You're not just removing the refusal direction; you're moving the entire weight manifold, and it's harder to control where it moves. Capability regressions tend to be larger and less predictable.
Naive abliteration (single direction, full projection) also degrades the model more than is usually admitted. KL divergence is a necessary-but-not-sufficient signal, as whw2026 rightly pointed out — a low KL on a generic distribution doesn't guarantee preserved reasoning on OOD tasks like GPQA.
What I do in Abliterix is treat this as a hyperparameter search: Optuna TPE over a compound objective of refusal rate + KL on a held-out capability set. This lets you explicitly trade refusal removal against capability preservation instead of hoping one direction happens to be clean. This checkpoint was produced that way, which I suspect is part of why the numbers are holding up on a 26B MoE.
The combined approach (light abliteration + targeted SFT) is interesting and already on my roadmap — would love to dig into that Microsoft paper when I get a moment.

On eval reliability:

Agreed that IFEval and HumanEval really need proper harnesses — lm-evaluation-harness and BigCodeBench are worth the setup cost if you're going to publish numbers. GSM8K and MMLU-Pro with simple answer checking are defensible. Happy to share the internal eval config I used on this checkpoint if it's useful for comparison.

Thanks again for doing this in the open — this is exactly the kind of scrutiny the abliteration field needs.

whw2026

Apr 23

Have you tried running the same test under the exact same conditions (quant, system prompt, sampler, etc.) with the original Google Gemma 4? That would clarify the situation.

Original test done! And I have updated my benchmark comment.

whw2026

Apr 23

Thanks @whw2026 for running this — you're the first to benchmark the abliterix series systematically, and the effort is appreciated.

On the baseline comparison (+1 to @Theory-of-mind 's suggestion):

This is the most important next step. The absolute numbers — 96.2% GSM8K, 74% MMLU-Pro, 94.5% HumanEval — are strong, but without the base gemma-4-26B-A4B-it running under identical conditions (same quant, sampler, prompt template, answer parser), it's hard to separate abliteration effects from eval-harness noise. When you re-run, specifically match Q8 on both sides — quantization variance alone can mask or exaggerate the abliteration delta.

On abliteration vs. fine-tuning on harmful data:

A few observations from the Abliterix pipeline:

Fine-tuning on ~1k harmful Q&A pairs does work — it's the path most early "uncensored" models took — but it produces broader distribution shift than abliteration. You're not just removing the refusal direction; you're moving the entire weight manifold, and it's harder to control where it moves. Capability regressions tend to be larger and less predictable.

Naive abliteration (single direction, full projection) also degrades the model more than is usually admitted. KL divergence is a necessary-but-not-sufficient signal, as whw2026 rightly pointed out — a low KL on a generic distribution doesn't guarantee preserved reasoning on OOD tasks like GPQA.

What I do in Abliterix is treat this as a hyperparameter search: Optuna TPE over a compound objective of refusal rate + KL on a held-out capability set. This lets you explicitly trade refusal removal against capability preservation instead of hoping one direction happens to be clean. This checkpoint was produced that way, which I suspect is part of why the numbers are holding up on a 26B MoE.

The combined approach (light abliteration + targeted SFT) is interesting and already on my roadmap — would love to dig into that Microsoft paper when I get a moment.

On eval reliability:

Agreed that IFEval and HumanEval really need proper harnesses — lm-evaluation-harness and BigCodeBench are worth the setup cost if you're going to publish numbers. GSM8K and MMLU-Pro with simple answer checking are defensible. Happy to share the internal eval config I used on this checkpoint if it's useful for comparison.

Thanks again for doing this in the open — this is exactly the kind of scrutiny the abliteration field needs.

The baseline comparison is done, and interestingly, the Abliterix Gemma4 model appears smarter — or perhaps it's just more robust to the inappropriate prompts.

Theory-of-mind

Apr 23

Thank you! These are very interesting results! Perhaps the safety tax is not a myth.
Advice for the future: in this case, if you used a regular unsloth Q8_0, everything is fine. However, for any other quants, you should use regular static quants without iMatrix (mradermacher WITHOUT i1, ggml-org, DevQuasar, etc.) to maintain the integrity of the experiment.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment