Qwen3-REAP-15B-A3B
MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-30B-A3B.
30B → 15B total parameters | ~3B active per token | 50% expert pruning
Model Summary
| Original | Pruned | |
|---|---|---|
| Model | Qwen/Qwen3-30B-A3B | Qwen3-REAP-15B-A3B |
| Total Parameters | ~30B | ~15B |
| Active Parameters | ~3B | ~3B |
| Experts per Layer | 128 | 64 |
| Experts Routed per Token | 8 | 8 |
| Hidden Layers | 48 | 48 |
| Hidden Size | 2048 | 2048 |
| MoE Intermediate Size | 768 | 768 |
| Context Length | 40,960 | 40,960 |
| Precision | BF16 | BF16 |
| Disk Size | ~57 GB | ~30 GB |
What Is REAP?
REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune. Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates — producing a more informed saliency score.
What changed:
- 50% of MoE experts pruned globally (128 → 64 per layer, across all 48 layers)
- Router weights pruned accordingly (only retained expert rows kept)
- Routing mechanism (
num_experts_per_tok = 8) unchanged - Attention layers, embeddings, and all non-MoE components untouched
Calibration Data
Calibration data was used for measuring expert activation patterns during the observation phase (NOT for fine-tuning). 1,000 samples, packed to 2,048 token sequences:
| Source | Proportion | Dataset | Description |
|---|---|---|---|
| Agentic trajectories | 40% | togethercomputer/CoderForge-Preview |
Passing SWE-agent trajectories |
| Raw code | 30% | bigcode/the-stack-smol (Python) |
Python source code |
| General web text | 10% | allenai/c4 (English) |
Pretraining distribution proxy |
| Broad coverage | 20% | NeelNanda/pile-10k |
Mixed general text |
REAP Configuration
prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048
record_pruning_metrics_only: true
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "atbender/Qwen3-REAP-15B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
Hardware Used
- GPUs: 2× NVIDIA RTX A6000 48GB
- Observation phase: ~74 minutes (231 packed samples × ~19s/sample)
- Pruning phase: <1 second
- Model save: ~50 seconds
- Total wall time: ~76 minutes
Intended Use
- Research on MoE pruning and compression techniques
- Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
- Exploring sparsity–performance trade-offs in MoE architectures
- Local deployment of a smaller Qwen3 MoE variant
Limitations
- No post-pruning fine-tuning — this is a raw prune, quality degradation expected on tail tasks
- Aggressive compression — 50% expert removal is significant; some capabilities will be lost
- Calibration bias — 70% code-focused calibration data may bias retention toward code-relevant experts
- Not benchmarked — no formal evals run yet; contributions welcome
Acknowledgements
- Qwen team — Qwen3-30B-A3B base model
- Cerebras Research — REAP method
- OpenMOSE — Reference implementation and model card inspiration
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
License
Apache License 2.0 — same as the base Qwen3-30B-A3B model.
- Downloads last month
- 28