All experts calibration & Linearization

#1
by mratsim - opened

LLM compressor requires specific modeling to ensure all experts are calibrated otherwise if the calibration dataset is not exhaustive enough, experts don't see enough tokens (min 64 for AWQ - https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf) or might not see any token at all, see:

Quantizing MoEs

To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:

  1. Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include torch.nn.Linear layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
  2. Ensures experts are quantized correctly as not all experts are activated during calibration

For example this is what is done for Qwen3 MoEs: https://github.com/vllm-project/llm-compressor/blob/0.9.0/src/llmcompressor/modeling/qwen3_moe.py#L78-L84

        for expert_idx, expert_layer in enumerate(self.experts):
            idx, top_x = torch.where(expert_mask[expert_idx].squeeze(0))

            if self.calibrate_all_experts:
                expert_out = expert_layer(hidden_states)[top_x]
            else:
                expert_out = expert_layer(hidden_states[top_x])

and this is what is done for DeepSeek v3:

        # Begin MoE
        final_hidden_states = torch.zeros_like(hidden_states, dtype=topk_weights.dtype)
        expert_mask = torch.nn.functional.one_hot(
            topk_indices, num_classes=len(self.experts)
        )
        expert_mask = expert_mask.permute(2, 0, 1)

        for expert_idx, expert in enumerate(self.experts):
            token_indices, weight_indices = torch.where(expert_mask[expert_idx])
            has_tokens = token_indices.numel() > 0

            if self.calibrate_all_experts:
                expert_input = hidden_states
                expert_output = expert(expert_input)

                if has_tokens:
                    expert_weights = topk_weights[token_indices, weight_indices]
                    routed_output = expert_output[
                        token_indices
                    ] * expert_weights.unsqueeze(-1)
                    final_hidden_states.index_add_(0, token_indices, routed_output)
            else:
                # Normal MoE: only process tokens routed to this expert
                if has_tokens:
                    expert_input = hidden_states[token_indices]
                    expert_output = expert(expert_input)
                    expert_weights = topk_weights[token_indices, weight_indices]
                    routed_output = expert_output * expert_weights.unsqueeze(-1)
                    final_hidden_states.index_add_(0, token_indices, routed_output)
        # End MoE

Can you confirm how many samples were used and whether you had custom modeling to calibrate all experts?

cyankiwi org

Thank you for your concerns. With all recent models quantized using cyankiwi AWQ v1.0, a modified modeling is applied to better calibrate the model, including moe all expert calibration. So yes, this model has all experts calibrated with a somewhat extensive calibration sample.

Also, I would like to point your attention to this thread.

AWQ models without diverse multilingual calibration data are effectively lobotomized in some languages. Tested your quant just now with Greek.

It's not just a problem of quality degradation; the quant can't even generate a coherent Greek sentence. Probably Greek text gets mixed up with mathematical notation during calibration.

I had created a small dataset with 34 samples for that purpose here which contains 1 sample per lang for 34 langs, so that calibration datasets are more robust to multilingualism.

Thank you very much!

cyankiwi org

@droussis Thank you a lot! I was just commented on other threads asking for multilingual calibration data suggestion, and your comment just pops right up!

@cpatonn

I have one more useful calibration dataset here which contains 102 rows:

  • 94 with multilingual multiturn dialogues with reasoning covering 24 EU languages
  • 4 with coding agent trajectories (OpenHands-like scaffold)
  • 4 with terminal agent trajectories with reasoning (Terminus2-XML like scaffold)

Its aim is not only multilingualism and multi-turn, but also SOTA agentic uses.

Sign up or log in to comment