MiniMax-M2.7-REAP-172B-A10B-BF16
REAP-pruned BF16 variant of MiniMaxAI/MiniMax-M2.7. Uses Cerebras's REAP methodology with a 6-dataset agentic calibration mix to preserve tool-use, code-generation, and reasoning capabilities while reducing expert count by 25%.
This is the intermediate BF16 artifact — the REAP'd model before quantization. Use this if you want to apply your own quantization scheme. For direct single-Spark deployment on NVIDIA DGX Spark (GB10) and Blackwell-family hardware, use the NVFP4 variant: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10.
Model Details
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MiniMaxM2ForCausalLM (MoE, 192 experts, top-K=8) |
| Total Parameters | 172B (down from 230B, -25%) |
| Active Parameters | ~10B per token |
| Hidden Layers | 62 |
| Experts per layer | 192 (pruned from 256) |
| Format | BF16 safetensors (70 shards) |
| Size on Disk | ~345 GB |
| License | Other (inherited from MiniMaxAI/MiniMax-M2.7) |
REAP Pruning Details
- Method: REAP (Router-weighted Expert Activation Pruning), Lasby et al. — arXiv:2510.13999
- Keep ratio: 0.75 (kept 192 / 256 experts per layer)
- Scoring: Router weight × expert output magnitude, accumulated via forward-pass hooks
- Calibration samples: 250 per dataset × 6 = 1500 total (1234 after ≥32-token filter)
- Max sequence length: 2048 tokens
- Hardware used: Hugging Face Jobs, 4× NVIDIA H200 141 GB
- Recipe script:
reap-bf16-agentic.py
Agentic calibration mix
"Which experts survive pruning depends on which activations the calibration data triggers." If you calibrate on pure chat, experts specialized for code / tool-calling / math get scored low and pruned — the resulting model loses those capabilities. This mix is chosen to preserve capabilities for agent-framework deployment (OpenClaw, Aider, Claude Code-style assistants):
| Dataset | Domain |
|---|---|
theblackcat102/evol-codealpaca-v1 |
Code generation |
Salesforce/xlam-function-calling-60k |
Tool calling / function invocation |
open-r1/Mixture-of-Thoughts (code config) |
Code reasoning |
open-r1/Mixture-of-Thoughts (math config) |
Mathematical reasoning |
open-r1/Mixture-of-Thoughts (science config) |
Scientific reasoning |
SWE-bench/SWE-smith-trajectories (tool split) |
Software-engineering agent trajectories |
Performance
Benchmarks pending GB10 deployment validation runs — will be added when available.
Usage
Load with transformers (requires trust_remote_code=True):
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"saricles/MiniMax-M2.7-REAP-172B-A10B-BF16",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"saricles/MiniMax-M2.7-REAP-172B-A10B-BF16",
trust_remote_code=True,
)
At 340 GB this does not fit on a single consumer GPU. For serving, either run multi-GPU tensor-parallel or apply a quantization (NVFP4, AWQ, GPTQ, etc.). For GB10 / Blackwell deployment, consider the NVFP4 variant.
Recommended Sampling Parameters
{ "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01 }
When to choose this BF16 variant vs. full MiniMax-M2.7
- Use this REAP BF16 if: you want a 25%-smaller MoE model with preserved agentic/coder capabilities, or you plan to apply your own quantization (AWQ, GPTQ, GGUF, different NVFP4 recipe).
- Use MiniMaxAI/MiniMax-M2.7 if: you need the full 256-expert model and can afford the larger deployment footprint.
Target Hardware
This is BF16 — hardware-agnostic. Any inference stack supporting custom transformers modeling code + BF16 safetensors (transformers, vLLM, TGI, etc.) can load it, given enough aggregate GPU memory.
Calibration notes (2026-04-17 correction)
During the REAP scoring pass that produced this artifact, the dataset extractor silently dropped texts from SWE-bench/SWE-smith-trajectories because that dataset stores the messages field as a JSON-encoded string (not a list-of-dicts, as several other datasets in the mix). Our extractor iterated the string as characters, found no dict entries, and collected zero usable texts from SWE-smith.
Net effect on this artifact: 5 of the 6 documented calibration datasets contributed to expert-importance scoring — evol-codealpaca, xlam-function-calling, and the three Mixture-of-Thoughts configs (code, math, science). SWE-smith-trajectories did NOT contribute, contrary to the original claim in this card.
Fix: the recipe script reap-bf16-agentic.py has been updated to json.loads() string-encoded messages fields, plus a per-dataset assertion that fails the run if any dataset yields zero texts. Future REAP variants will include SWE-smith as originally intended.
Practical implication: agentic tool-use preservation still benefited from xlam-function-calling. What's missing is the specific activation pattern from multi-turn SWE-agent trajectories (long-horizon read→run→patch→verify chains). If you run this model in long-chain SWE-style agents and see activation-clipping artifacts, that's why — please file an issue.
Acknowledgments
- Base model by MiniMax
- REAP methodology + reference implementation: Cerebras Research (Lasby et al., arXiv:2510.13999)
- Calibration-mix design: in the spirit of the multi-domain calibration mix Cerebras recommends for REAP — the specific configs (MoT code/math/science split, SWE-smith
toolsplit) are optimized for agentic workloads
- Downloads last month
- 962
Model tree for saricles/MiniMax-M2.7-REAP-172B-A10B-BF16
Base model
MiniMaxAI/MiniMax-M2.7