All experts calibration & Linearization
LLM compressor requires specific modeling to ensure all experts are calibrated otherwise if the calibration dataset is not exhaustive enough, experts don't see enough tokens (min 64 for AWQ - https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf) or might not see any token at all, see:
Quantizing MoEs
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under
llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:
- Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include
torch.nn.Linearlayers in its MoE blocks, a requirement for LLM Compressor to run quantization.- Ensures experts are quantized correctly as not all experts are activated during calibration
For example this is what is done for Qwen3 MoEs: https://github.com/vllm-project/llm-compressor/blob/0.9.0/src/llmcompressor/modeling/qwen3_moe.py#L78-L84
for expert_idx, expert_layer in enumerate(self.experts):
idx, top_x = torch.where(expert_mask[expert_idx].squeeze(0))
if self.calibrate_all_experts:
expert_out = expert_layer(hidden_states)[top_x]
else:
expert_out = expert_layer(hidden_states[top_x])
and this is what is done for DeepSeek v3:
# Begin MoE
final_hidden_states = torch.zeros_like(hidden_states, dtype=topk_weights.dtype)
expert_mask = torch.nn.functional.one_hot(
topk_indices, num_classes=len(self.experts)
)
expert_mask = expert_mask.permute(2, 0, 1)
for expert_idx, expert in enumerate(self.experts):
token_indices, weight_indices = torch.where(expert_mask[expert_idx])
has_tokens = token_indices.numel() > 0
if self.calibrate_all_experts:
expert_input = hidden_states
expert_output = expert(expert_input)
if has_tokens:
expert_weights = topk_weights[token_indices, weight_indices]
routed_output = expert_output[
token_indices
] * expert_weights.unsqueeze(-1)
final_hidden_states.index_add_(0, token_indices, routed_output)
else:
# Normal MoE: only process tokens routed to this expert
if has_tokens:
expert_input = hidden_states[token_indices]
expert_output = expert(expert_input)
expert_weights = topk_weights[token_indices, weight_indices]
routed_output = expert_output * expert_weights.unsqueeze(-1)
final_hidden_states.index_add_(0, token_indices, routed_output)
# End MoE
Can you confirm how many samples were used and whether you had custom modeling to calibrate all experts?
Thank you for your concerns. With all recent models quantized using cyankiwi AWQ v1.0, a modified modeling is applied to better calibrate the model, including moe all expert calibration. So yes, this model has all experts calibrated with a somewhat extensive calibration sample.
Also, I would like to point your attention to this thread.
AWQ models without diverse multilingual calibration data are effectively lobotomized in some languages. Tested your quant just now with Greek.
It's not just a problem of quality degradation; the quant can't even generate a coherent Greek sentence. Probably Greek text gets mixed up with mathematical notation during calibration.
I had created a small dataset with 34 samples for that purpose here which contains 1 sample per lang for 34 langs, so that calibration datasets are more robust to multilingualism.
Thank you very much!
I have one more useful calibration dataset here which contains 102 rows:
- 94 with multilingual multiturn dialogues with reasoning covering 24 EU languages
- 4 with coding agent trajectories (OpenHands-like scaffold)
- 4 with terminal agent trajectories with reasoning (Terminus2-XML like scaffold)
Its aim is not only multilingualism and multi-turn, but also SOTA agentic uses.