Qwen3-REAP-15B-A3B

MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-30B-A3B.

30B → 15B total parameters | ~3B active per token | 50% expert pruning

Model Summary

	Original	Pruned
Model	Qwen/Qwen3-30B-A3B	Qwen3-REAP-15B-A3B
Total Parameters	~30B	~15B
Active Parameters	~3B	~3B
Experts per Layer	128	64
Experts Routed per Token	8	8
Hidden Layers	48	48
Hidden Size	2048	2048
MoE Intermediate Size	768	768
Context Length	40,960	40,960
Precision	BF16	BF16
Disk Size	~57 GB	~30 GB

What Is REAP?

REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune. Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates — producing a more informed saliency score.

What changed:

50% of MoE experts pruned globally (128 → 64 per layer, across all 48 layers)
Router weights pruned accordingly (only retained expert rows kept)
Routing mechanism (num_experts_per_tok = 8) unchanged
Attention layers, embeddings, and all non-MoE components untouched

Calibration Data

Calibration data was used for measuring expert activation patterns during the observation phase (NOT for fine-tuning). 1,000 samples, packed to 2,048 token sequences:

Source	Proportion	Dataset	Description
Agentic trajectories	40%	`togethercomputer/CoderForge-Preview`	Passing SWE-agent trajectories
Raw code	30%	`bigcode/the-stack-smol` (Python)	Python source code
General web text	10%	`allenai/c4` (English)	Pretraining distribution proxy
Broad coverage	20%	`NeelNanda/pile-10k`	Mixed general text

REAP Configuration

prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048
record_pruning_metrics_only: true

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "atbender/Qwen3-REAP-15B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Hardware Used

GPUs: 2× NVIDIA RTX A6000 48GB
Observation phase: ~74 minutes (231 packed samples × ~19s/sample)
Pruning phase: <1 second
Model save: ~50 seconds
Total wall time: ~76 minutes

Intended Use

Research on MoE pruning and compression techniques
Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
Exploring sparsity–performance trade-offs in MoE architectures
Local deployment of a smaller Qwen3 MoE variant

Limitations

No post-pruning fine-tuning — this is a raw prune, quality degradation expected on tail tasks
Aggressive compression — 50% expert removal is significant; some capabilities will be lost
Calibration bias — 70% code-focused calibration data may bias retention toward code-relevant experts
Not benchmarked — no formal evals run yet; contributions welcome

Acknowledgements

Qwen team — Qwen3-30B-A3B base model
Cerebras Research — REAP method
OpenMOSE — Reference implementation and model card inspiration

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

License

Apache License 2.0 — same as the base Qwen3-30B-A3B model.

Downloads last month: 28

Safetensors

Model size

16B params

Tensor type

BF16

Model tree for atbender/Qwen3-REAP-15B-A3B

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Finetuned

(39)

this model