Qwen3-REAP-15B-A3B

MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-30B-A3B.

30B → 15B total parameters | ~3B active per token | 50% expert pruning


Model Summary

Original Pruned
Model Qwen/Qwen3-30B-A3B Qwen3-REAP-15B-A3B
Total Parameters ~30B ~15B
Active Parameters ~3B ~3B
Experts per Layer 128 64
Experts Routed per Token 8 8
Hidden Layers 48 48
Hidden Size 2048 2048
MoE Intermediate Size 768 768
Context Length 40,960 40,960
Precision BF16 BF16
Disk Size ~57 GB ~30 GB

What Is REAP?

REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune. Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates — producing a more informed saliency score.

What changed:

  • 50% of MoE experts pruned globally (128 → 64 per layer, across all 48 layers)
  • Router weights pruned accordingly (only retained expert rows kept)
  • Routing mechanism (num_experts_per_tok = 8) unchanged
  • Attention layers, embeddings, and all non-MoE components untouched

Calibration Data

Calibration data was used for measuring expert activation patterns during the observation phase (NOT for fine-tuning). 1,000 samples, packed to 2,048 token sequences:

Source Proportion Dataset Description
Agentic trajectories 40% togethercomputer/CoderForge-Preview Passing SWE-agent trajectories
Raw code 30% bigcode/the-stack-smol (Python) Python source code
General web text 10% allenai/c4 (English) Pretraining distribution proxy
Broad coverage 20% NeelNanda/pile-10k Mixed general text

REAP Configuration

prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048
record_pruning_metrics_only: true

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "atbender/Qwen3-REAP-15B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Hardware Used

  • GPUs: 2× NVIDIA RTX A6000 48GB
  • Observation phase: ~74 minutes (231 packed samples × ~19s/sample)
  • Pruning phase: <1 second
  • Model save: ~50 seconds
  • Total wall time: ~76 minutes

Intended Use

  • Research on MoE pruning and compression techniques
  • Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
  • Exploring sparsity–performance trade-offs in MoE architectures
  • Local deployment of a smaller Qwen3 MoE variant

Limitations

  • No post-pruning fine-tuning — this is a raw prune, quality degradation expected on tail tasks
  • Aggressive compression — 50% expert removal is significant; some capabilities will be lost
  • Calibration bias — 70% code-focused calibration data may bias retention toward code-relevant experts
  • Not benchmarked — no formal evals run yet; contributions welcome

Acknowledgements

  • Qwen team — Qwen3-30B-A3B base model
  • Cerebras Research — REAP method
  • OpenMOSE — Reference implementation and model card inspiration

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

License

Apache License 2.0 — same as the base Qwen3-30B-A3B model.

Downloads last month
28
Safetensors
Model size
16B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atbender/Qwen3-REAP-15B-A3B

Finetuned
(39)
this model