m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
v1.1.1 — router-gate quantization fix (2026-04-16)
What happened: The initial upload (2026-04-15) used ignore=["lm_head"] in the llm-compressor recipe, which meant the 62 MoE routers (block_sparse_moe.gate) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized ReplicatedLinear router and fails at engine-init with:
KeyError: 'layers.0.block_sparse_moe.gate.weight_scale' # FP8
KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4
This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.)
Root cause: Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10):
ignore = [
"lm_head",
"model.embed_tokens",
r"re:.*block_sparse_moe\.gate$",
]
Fix: This variant was re-rolled 2026-04-16 with the corrected recipe. quantization_config.ignore now lists all 62 per-layer router gates alongside lm_head.
Verification: config.json on this repo now contains 62 model.layers.N.block_sparse_moe.gate entries in the ignore list. Loaders should open the model without the KeyError above.
Credit: Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern.
Unaffected variants (no re-roll needed): BF16 safetensors, all GGUF quantizations.
NVFP4 quantization of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting NVIDIA Blackwell (B100 / B200) for native FP4 tensor-core acceleration.
| Aspect | Value |
|---|---|
| Base model | dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B (BF16) |
| Quantization | NVFP4A16 (4-bit microscaled floating point weights, FP16 activations) |
| Format | compressed-tensors (vLLM-native) |
| Tool | llmcompressor |
| File size | ~75 GB across ~25 safetensors shards |
| Ignored layers | lm_head (kept in BF16) |
What is NVFP4?
NVFP4 is NVIDIA's 4-bit floating-point microscaling format introduced with the Blackwell architecture. It uses small block-wise scale factors to maintain quality at extreme compression, and benefits from dedicated FP4 tensor cores on B100/B200 hardware.
Compared to INT4 / AWQ quantization, NVFP4 typically preserves quality better at the same weight budget, particularly on reasoning-heavy workloads. Our REAP-pruned base model is an ideal candidate — the structural pruning has already reduced parameter count, and NVFP4 then packs each remaining weight into 4 bits.
Hardware & deployment
Native FP4 tensor-core acceleration requires Blackwell (B100 / B200). The quantized weights also load and run on Hopper (H100 / H200) and Ampere (A100) via FP4-to-higher-precision upcasting — functional but not at Blackwell speed.
Memory footprint: ~75 GB weights + KV cache. Recommended:
- 1× B100 / B200 (native NVFP4, best performance)
- 2× H100 80 GB or 1× H200 141 GB (functional, no native FP4 cores)
- Memory-constrained: combine with KV cache quantization (see vLLM docs)
Inference
vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4",
tensor_parallel_size=1, # fits on 1× Blackwell or 2× Hopper
trust_remote_code=True,
max_model_len=32768,
)
params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
out = llm.generate(["Explain REAP pruning briefly."], params)
print(out[0].outputs[0].text)
TensorRT-LLM
Supported via the compressed-tensors loader in TensorRT-LLM 0.14+ with NVFP4 scheme. Consult NVIDIA's deployment guide for Blackwell-specific kernels.
Quality
Inference quality was validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see parent safetensors card). NVFP4A16 is expected to track FP8 / BF16 quality very closely thanks to microscaling — activations remain in FP16 so only weights are compressed.
Systematic NVFP4-on-REAP evaluation is pending; we will update this card if there is community demand.
Base model summary
| Property | Value |
|---|---|
| Architecture | MoE, 62 layers, 154 experts (pruned from 256), top-8 routing |
| Active parameters / token | ~10 B |
| Total parameters | ~139 B |
| Max position embeddings | 196,608 |
| Vocabulary size | 200,064 |
| Pruning | REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool) |
See the parent safetensors card for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection.
Recommended generation parameters
temperature: 1.0top_p: 0.95top_k: 40repeat_penalty: 1.05
Companion repos
- Parent safetensors (BF16):
dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B - GGUF (Mac / llama.cpp / Ollama / LM Studio):
dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF - FP8 (Hopper-native):
dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 - AWQ-4bit (vLLM / HF Transformers INT4): coming soon
Citation
See the safetensors repo for full citations. Core references:
- Lasby et al., REAP the Experts (arXiv:2510.13999)
- MiniMax AI, MiniMax-M2.7
License
Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.
Published by m51Lab — open-source LLM contributions from the M51 AI OS group.
- Downloads last month
- 1,196
Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
Base model
MiniMaxAI/MiniMax-M2.7