Instructions to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H
- SGLang
How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with Docker Model Runner:
docker model run hf.co/ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H
MiniMax-M2.7 AWQ-G32-STRIX-2H
MiniMax-M2.7-AWQ-G32-STRIX-2H is a mixed-precision AWQ quantization of amd/MiniMax-M2.7-BF16, built for two-node AMD Strix Halo (gfx1151) inference with vLLM + Ray tensor parallelism.
The quantization recipe keeps attention, routing, embeddings, normalization, and the final four main-model MoE expert layers in BF16, while quantizing the bulk of the MoE expert weights to INT4 W4A16 AWQ with group size 32. The goal is to preserve long-context behavior and reasoning quality while fitting MiniMax-M2.7 into a 2× Strix Halo deployment target.
Model details
| Field | Value |
|---|---|
| Public name | MiniMax-M2.7-AWQ-G32-STRIX-2H |
| Suggested vLLM served name | minimax-m2-7-awq-g32-strix-2h |
| Base model | amd/MiniMax-M2.7-BF16 |
| Base revision | 92d4d55827de5231e493f0cf6e66e1b255749592 |
| Quantization format | compressed-tensors AWQ metadata + safetensors |
| Weight precision | Mixed BF16 + INT4 W4A16 |
| INT4 group size | 32 |
| Estimated model size in memory | ~145 GiB / 155.27 GB |
| Safetensors shards | 32 |
| Target runtime | vLLM + Ray tensor parallelism on 2× Strix Halo |
| Configured/tested max context | 196,608 tokens |
| Estimated memory-budget context ceiling | ~230K-280K tokens for one active sequence, depending on runtime overhead and how much Strix Halo UMA is exposed to ROCm/vLLM |
Intended deployment target
This quant is intended for a two-system AMD Strix Halo setup:
- 2× Strix Halo /
gfx1151GPUs - vLLM OpenAI-compatible serving
- Ray distributed executor
- tensor parallel size 2
- ROCm-based runtime
It is not designed for single-GPU Strix Halo serving. The model size and long-context KV cache budget assume tensor parallelism across two Strix Halo systems.
Strix Halo vLLM setup
For the Strix Halo ROCm/vLLM environment, use the Strix Halo vLLM toolbox:
That project provides a Strix Halo-oriented vLLM container/toolbox environment for AMD Ryzen AI Max / Strix Halo (gfx1151) systems. The model-specific settings used for this quant are listed below.
vLLM launch example
vllm serve /path/to/model \
--served-model-name minimax-m2-7-awq-g32-strix-2h \
--host 127.0.0.1 \
--port "${PORT:-8000}" \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--enforce-eager \
--gpu-memory-utilization 0.92 \
--max-model-len 196608 \
--max-num-seqs 2 \
--max-num-batched-tokens 20480 \
--dtype auto \
--load-format safetensors \
--trust-remote-code \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--override-generation-config '{"max_new_tokens": 10000, "temperature": 0.2, "top_p": 0.9, "repetition_penalty": 1.08}'
Recommended environment flags for ROCm/vLLM testing:
export VLLM_ROCM_USE_AITER=0
export OMP_NUM_THREADS=1
export TOKENIZERS_PARALLELISM=false
export TORCHDYNAMO_DISABLE=1
export RAY_CGRAPH_get_timeout=1800
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_SAMPLER=0
If this mixed BF16/INT4 MoE artifact hits a ROCm/vLLM MoE backend issue, also test with:
export VLLM_USE_FLASHINFER_MOE_FP16=0
Backend behavior can vary across vLLM, ROCm, and toolbox builds.
Quantization recipe
Precision map
| Component | Precision | Rationale |
|---|---|---|
Self-attention q/k/v/o projections |
BF16 | Preserves long-context coherence; attention was intentionally not quantized. |
| Router / MoE gate | BF16 | Protects expert routing quality at low memory cost. |
Embeddings and lm_head |
BF16 | Standard low-risk preservation choice. |
| LayerNorms | BF16 | Small memory cost; avoids unnecessary numerical risk. |
Main-model MoE experts, layers 0-57, w1/w2/w3 |
INT4 W4A16 AWQ, group size 32 | Main parameter savings. |
Main-model MoE experts, layers 58-61, w1/w2/w3 |
BF16 | Late-layer carve-out intended to protect reasoning/generation behavior. |
| MTP module experts | INT4 W4A16, group size 32 | Quantized by the broad expert target pattern. |
AWQ recipe
The recipe uses a single AWQModifier, rather than splitting AWQ smoothing and quantization into separate modifier passes.
default_stage:
default_modifiers:
AWQModifier:
mappings:
- smooth_layer: "re:.*post_attention_layernorm$"
balance_layers: ["re:.*w1$", "re:.*w3$"]
- smooth_layer: "re:.*w3$"
balance_layers: ["re:.*w2$"]
ignore:
- "lm_head"
- "embed_tokens"
- "re:.*self_attn.*"
- "re:.*block_sparse_moe\\.gate$"
- "re:.*\\.layers\\.(58|59|60|61)\\.block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
config_groups:
mlp_experts_projections:
targets:
- "re:.*block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
dynamic: false
observer: minmax
duo_scaling: true
Calibration
| Setting | Value |
|---|---|
| Calibration samples | 256 |
| Max calibration sequence length | 4,096 |
| Seed | 42 |
| Quantization hardware | 8× A100 80GB, 1+ TB NVMe scratch, ~512 GB system RAM |
Calibration data mix:
- 40% NVIDIA Llama-Nemotron SFT chat/science
- 25% code from NVIDIA Llama-Nemotron SFT code and
bigcode/the-stack-smol - 20% math from NVIDIA Llama-Nemotron SFT math, GSM8K, and MATH-500
- 15% long-context samples from
DKYoon/SlimPajama-6Bwith at least 4K-token texts
Why this quantization recipe is different
This quant intentionally spends more memory than smaller all-expert INT4 quants. The main design choices are:
- Group size 32 AWQ: smaller groups increase metadata/scale overhead but generally improve weight reconstruction quality compared with larger groups.
- BF16 attention: attention projections are left unquantized to protect long-context behavior.
- BF16 router gates: MoE routing is left unquantized to avoid compounding expert-selection errors.
- BF16 final expert layers: the last four main-model expert layers are preserved in BF16 as a late-generation/reasoning carve-out.
- Diverse calibration set: calibration uses code, math, chat/science, and long-context samples instead of a single-domain corpus.
- Unified AWQ modifier: smoothing and quantization remain in one AWQ modifier path, avoiding the earlier split-modifier failure mode observed during development.
These choices make the artifact larger than a data-free INT4 AWQ quant, but are intended to preserve more of the BF16 model's behavior under long-context and reasoning-heavy workloads.
Memory and context behavior on 2× Strix Halo
The model weights occupy approximately 145 GiB in memory. With vLLM tensor parallelism across two Strix Halo systems, the tested configuration uses:
--tensor-parallel-size 2--max-model-len 196608--max-num-seqs 2--max-num-batched-tokens 20480--gpu-memory-utilization 0.92
Estimated BF16 KV-cache usage for MiniMax-M2.7 at 196,608 tokens:
| Active sequences | KV cache estimate | Practical implication |
|---|---|---|
| 1 | ~46.51 GiB | The configured 196,608-token context is the intended long-context target. |
| 2 | ~93.03 GiB | Two simultaneous full-length 196K requests are not expected to fit comfortably. |
With all remaining memory assigned to KV cache, the one-sequence theoretical context ceiling is roughly 230K-280K tokens depending on runtime overhead and how much Strix Halo unified memory ROCm/vLLM exposes as usable GPU memory. The public serving configuration is still capped at 196,608 tokens because that is the tested long-context target with activation and startup headroom.
In this configuration, --max-num-seqs=2 should be interpreted as a peak-concurrency setting. For two simultaneous long-context requests, practical effective context is expected to be substantially lower than 196K per request, around the 110K-120K range per request for this size class.
Benchmarks and validation
Benchmark results
The following benchmark comparison was run against the unquantized reference, QuantTrio's AWQ quant, and this quant:
| Benchmark | Unquantized | QuantTrio | MiniMax-M2.7-AWQ-G32-STRIX-2H |
|---|---|---|---|
| HumanEval | 91.46% | 84.15% | 90.24% |
| MBPP Plus | 93.29% | 87.20% | 92.07% |
| MMLU College CS | 96.00% | 99.00% | 97.00% |
| MMLU Computer Security | 90.00% | 89.00% | 88.00% |
| MMLU Machine Learning | 91.96% | 91.07% | 89.29% |
| GSM8K slice | 100.00% | 100.00% | 100.00% |
Summary versus QuantTrio:
- HumanEval: +6.09 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
- MBPP Plus: +4.87 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
- GSM8K slice: tied at 100.00% across all three.
These results match the recipe goal: the larger mixed BF16/INT4 artifact recovers most of the unquantized code benchmark performance while substantially outperforming the smaller QuantTrio AWQ quant on HumanEval and MBPP Plus.
Artifact validation
This artifact has been validated for structural recovery and serving:
- 32 safetensors shards recovered.
- Artifact size verified at 155,271,743,469 bytes.
- vLLM route started successfully on a 2× Strix Halo Ray setup.
- OpenAI-compatible
/v1/chat/completionssmoke test returned HTTP 200. - Smoke prompt
Compute 47*53 step by step.returned the correct result,2491.
Recommended additional evaluation set for users who want to reproduce or extend the comparison:
- HumanEval
- MBPP / MBPP+
- MMLU subsets
- GSM8K
- MATH-500
- AIME 2024
- GPQA
- RULER at long-context lengths such as 32K, 64K, and 128K
Intended use
- Local and research inference on 2× Strix Halo systems.
- Long-context testing with vLLM on AMD ROCm.
- A/B testing mixed BF16/INT4 precision strategies for MiniMax-M2.7.
- Reasoning, code, math, and long-context experiments where local inference is preferred.
Limitations
- Not intended for single-GPU Strix Halo deployment.
- Benchmarks are limited to the listed suite and should not be interpreted as broad safety or capability evaluation.
- The artifact is larger than data-free all-expert INT4 quants because it intentionally preserves BF16 attention, routing, and late expert layers.
- Effective context depends on vLLM version, ROCm build, allocator behavior, runtime overhead, and concurrency.
License
This quant is a derivative of amd/MiniMax-M2.7-BF16 / MiniMaxAI/MiniMax-M2.7 and follows the upstream MiniMax-M2.7 license.
MiniMax-M2.7 is released under a custom non-commercial license: non-commercial use is permitted under MIT-style terms, while commercial use requires prior written authorization from MiniMax. See the included LICENSE file and the upstream license for details.
Citation / attribution
- Base model:
amd/MiniMax-M2.7-BF16 - Strix Halo vLLM toolbox:
kyuz0/amd-strix-halo-vllm-toolboxes - Quantization approach: AWQ W4A16 with compressed-tensors metadata
- Downloads last month
- 51