Instructions to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H

SGLang

How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H with Docker Model Runner:
```
docker model run hf.co/ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H
```

MiniMax-M2.7 AWQ-G32-STRIX-2H

MiniMax-M2.7-AWQ-G32-STRIX-2H is a mixed-precision AWQ quantization of amd/MiniMax-M2.7-BF16, built for two-node AMD Strix Halo (gfx1151) inference with vLLM + Ray tensor parallelism.

The quantization recipe keeps attention, routing, embeddings, normalization, and the final four main-model MoE expert layers in BF16, while quantizing the bulk of the MoE expert weights to INT4 W4A16 AWQ with group size 32. The goal is to preserve long-context behavior and reasoning quality while fitting MiniMax-M2.7 into a 2× Strix Halo deployment target.

Model details

Field	Value
Public name	`MiniMax-M2.7-AWQ-G32-STRIX-2H`
Suggested vLLM served name	`minimax-m2-7-awq-g32-strix-2h`
Base model	`amd/MiniMax-M2.7-BF16`
Base revision	`92d4d55827de5231e493f0cf6e66e1b255749592`
Quantization format	compressed-tensors AWQ metadata + safetensors
Weight precision	Mixed BF16 + INT4 W4A16
INT4 group size	32
Estimated model size in memory	~145 GiB / 155.27 GB
Safetensors shards	32
Target runtime	vLLM + Ray tensor parallelism on 2× Strix Halo
Configured/tested max context	196,608 tokens
Estimated memory-budget context ceiling	~230K-280K tokens for one active sequence, depending on runtime overhead and how much Strix Halo UMA is exposed to ROCm/vLLM

Intended deployment target

This quant is intended for a two-system AMD Strix Halo setup:

2× Strix Halo / gfx1151 GPUs
vLLM OpenAI-compatible serving
Ray distributed executor
tensor parallel size 2
ROCm-based runtime

It is not designed for single-GPU Strix Halo serving. The model size and long-context KV cache budget assume tensor parallelism across two Strix Halo systems.

Strix Halo vLLM setup

For the Strix Halo ROCm/vLLM environment, use the Strix Halo vLLM toolbox:

kyuz0/amd-strix-halo-vllm-toolboxes

That project provides a Strix Halo-oriented vLLM container/toolbox environment for AMD Ryzen AI Max / Strix Halo (gfx1151) systems. The model-specific settings used for this quant are listed below.

vLLM launch example

vllm serve /path/to/model \
  --served-model-name minimax-m2-7-awq-g32-strix-2h \
  --host 127.0.0.1 \
  --port "${PORT:-8000}" \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --enforce-eager \
  --gpu-memory-utilization 0.92 \
  --max-model-len 196608 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 20480 \
  --dtype auto \
  --load-format safetensors \
  --trust-remote-code \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --enable-auto-tool-choice \
  --override-generation-config '{"max_new_tokens": 10000, "temperature": 0.2, "top_p": 0.9, "repetition_penalty": 1.08}'

Recommended environment flags for ROCm/vLLM testing:

export VLLM_ROCM_USE_AITER=0
export OMP_NUM_THREADS=1
export TOKENIZERS_PARALLELISM=false
export TORCHDYNAMO_DISABLE=1
export RAY_CGRAPH_get_timeout=1800
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_SAMPLER=0

If this mixed BF16/INT4 MoE artifact hits a ROCm/vLLM MoE backend issue, also test with:

export VLLM_USE_FLASHINFER_MOE_FP16=0

Backend behavior can vary across vLLM, ROCm, and toolbox builds.

Quantization recipe

Precision map

Component	Precision	Rationale
Self-attention `q/k/v/o` projections	BF16	Preserves long-context coherence; attention was intentionally not quantized.
Router / MoE gate	BF16	Protects expert routing quality at low memory cost.
Embeddings and `lm_head`	BF16	Standard low-risk preservation choice.
LayerNorms	BF16	Small memory cost; avoids unnecessary numerical risk.
Main-model MoE experts, layers 0-57, `w1/w2/w3`	INT4 W4A16 AWQ, group size 32	Main parameter savings.
Main-model MoE experts, layers 58-61, `w1/w2/w3`	BF16	Late-layer carve-out intended to protect reasoning/generation behavior.
MTP module experts	INT4 W4A16, group size 32	Quantized by the broad expert target pattern.

AWQ recipe

The recipe uses a single AWQModifier, rather than splitting AWQ smoothing and quantization into separate modifier passes.

default_stage:
  default_modifiers:
    AWQModifier:
      mappings:
        - smooth_layer: "re:.*post_attention_layernorm$"
          balance_layers: ["re:.*w1$", "re:.*w3$"]
        - smooth_layer: "re:.*w3$"
          balance_layers: ["re:.*w2$"]
      ignore:
        - "lm_head"
        - "embed_tokens"
        - "re:.*self_attn.*"
        - "re:.*block_sparse_moe\\.gate$"
        - "re:.*\\.layers\\.(58|59|60|61)\\.block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
      config_groups:
        mlp_experts_projections:
          targets:
            - "re:.*block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            observer: minmax
      duo_scaling: true

Calibration

Setting	Value
Calibration samples	256
Max calibration sequence length	4,096
Seed	42
Quantization hardware	8× A100 80GB, 1+ TB NVMe scratch, ~512 GB system RAM

Calibration data mix:

40% NVIDIA Llama-Nemotron SFT chat/science
25% code from NVIDIA Llama-Nemotron SFT code and bigcode/the-stack-smol
20% math from NVIDIA Llama-Nemotron SFT math, GSM8K, and MATH-500
15% long-context samples from DKYoon/SlimPajama-6B with at least 4K-token texts

Why this quantization recipe is different

This quant intentionally spends more memory than smaller all-expert INT4 quants. The main design choices are:

Group size 32 AWQ: smaller groups increase metadata/scale overhead but generally improve weight reconstruction quality compared with larger groups.
BF16 attention: attention projections are left unquantized to protect long-context behavior.
BF16 router gates: MoE routing is left unquantized to avoid compounding expert-selection errors.
BF16 final expert layers: the last four main-model expert layers are preserved in BF16 as a late-generation/reasoning carve-out.
Diverse calibration set: calibration uses code, math, chat/science, and long-context samples instead of a single-domain corpus.
Unified AWQ modifier: smoothing and quantization remain in one AWQ modifier path, avoiding the earlier split-modifier failure mode observed during development.

These choices make the artifact larger than a data-free INT4 AWQ quant, but are intended to preserve more of the BF16 model's behavior under long-context and reasoning-heavy workloads.

Memory and context behavior on 2× Strix Halo

The model weights occupy approximately 145 GiB in memory. With vLLM tensor parallelism across two Strix Halo systems, the tested configuration uses:

--tensor-parallel-size 2
--max-model-len 196608
--max-num-seqs 2
--max-num-batched-tokens 20480
--gpu-memory-utilization 0.92

Estimated BF16 KV-cache usage for MiniMax-M2.7 at 196,608 tokens:

Active sequences	KV cache estimate	Practical implication
1	~46.51 GiB	The configured 196,608-token context is the intended long-context target.
2	~93.03 GiB	Two simultaneous full-length 196K requests are not expected to fit comfortably.

With all remaining memory assigned to KV cache, the one-sequence theoretical context ceiling is roughly 230K-280K tokens depending on runtime overhead and how much Strix Halo unified memory ROCm/vLLM exposes as usable GPU memory. The public serving configuration is still capped at 196,608 tokens because that is the tested long-context target with activation and startup headroom.

In this configuration, --max-num-seqs=2 should be interpreted as a peak-concurrency setting. For two simultaneous long-context requests, practical effective context is expected to be substantially lower than 196K per request, around the 110K-120K range per request for this size class.

Benchmarks and validation

Benchmark results

The following benchmark comparison was run against the unquantized reference, QuantTrio's AWQ quant, and this quant:

Benchmark	Unquantized	QuantTrio	MiniMax-M2.7-AWQ-G32-STRIX-2H
HumanEval	91.46%	84.15%	90.24%
MBPP Plus	93.29%	87.20%	92.07%
MMLU College CS	96.00%	99.00%	97.00%
MMLU Computer Security	90.00%	89.00%	88.00%
MMLU Machine Learning	91.96%	91.07%	89.29%
GSM8K slice	100.00%	100.00%	100.00%

Summary versus QuantTrio:

HumanEval: +6.09 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
MBPP Plus: +4.87 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
GSM8K slice: tied at 100.00% across all three.

These results match the recipe goal: the larger mixed BF16/INT4 artifact recovers most of the unquantized code benchmark performance while substantially outperforming the smaller QuantTrio AWQ quant on HumanEval and MBPP Plus.

Artifact validation

This artifact has been validated for structural recovery and serving:

32 safetensors shards recovered.
Artifact size verified at 155,271,743,469 bytes.
vLLM route started successfully on a 2× Strix Halo Ray setup.
OpenAI-compatible /v1/chat/completions smoke test returned HTTP 200.
Smoke prompt Compute 47*53 step by step. returned the correct result, 2491.

Recommended additional evaluation set for users who want to reproduce or extend the comparison:

HumanEval
MBPP / MBPP+
MMLU subsets
GSM8K
MATH-500
AIME 2024
GPQA
RULER at long-context lengths such as 32K, 64K, and 128K

Intended use

Local and research inference on 2× Strix Halo systems.
Long-context testing with vLLM on AMD ROCm.
A/B testing mixed BF16/INT4 precision strategies for MiniMax-M2.7.
Reasoning, code, math, and long-context experiments where local inference is preferred.

Limitations

Not intended for single-GPU Strix Halo deployment.
Benchmarks are limited to the listed suite and should not be interpreted as broad safety or capability evaluation.
The artifact is larger than data-free all-expert INT4 quants because it intentionally preserves BF16 attention, routing, and late expert layers.
Effective context depends on vLLM version, ROCm build, allocator behavior, runtime overhead, and concurrency.

License

This quant is a derivative of amd/MiniMax-M2.7-BF16 / MiniMaxAI/MiniMax-M2.7 and follows the upstream MiniMax-M2.7 license.

MiniMax-M2.7 is released under a custom non-commercial license: non-commercial use is permitted under MIT-style terms, while commercial use requires prior written authorization from MiniMax. See the included LICENSE file and the upstream license for details.

Citation / attribution

Base model: amd/MiniMax-M2.7-BF16
Strix Halo vLLM toolbox: kyuz0/amd-strix-halo-vllm-toolboxes
Quantization approach: AWQ W4A16 with compressed-tensors metadata

Downloads last month: 51

Safetensors

Model size

51B params

Tensor type

BF16

I64

I32

Model tree for ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

amd/MiniMax-M2.7-BF16

Quantized

(1)

this model