Gemma 4 31B Dense AWQ 4-bit

AWQ 4-bit quantization of Gemma 4 31B-it optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details

Base model google/gemma-4-31b-it
Architecture Dense with sliding window attention (50 SWA + 10 full attention layers)
Parameters 31B
Layers 60
Context 8K (tested)
Quantization AWQ 4-bit, group_size=128. Converted from Intel AutoRound GPTQ (sym=True) via full FP32 dequant→requant to handle 50.4% negative scales.

Performance (2x AMD Radeon AI PRO R9700, TP=2)

  • Decode speed: 15 tok/s single-user on 2x R9700
  • Launch: scripts/launch.sh gemma4-31b

Notes

Gemma 31B requires BF16 activations (FP16 overflows). Uses Triton AWQ GEMV with FP32 dequantization for decode and torch_native attention. Source GPTQ: Intel/gemma-4-31B-it-int4-AutoRound.

Known Limitations

  • Vision: WORKING — Vision encoder weights merged from BF16 base model (356 tensors, 1.15 GB in FP16). Tested: correctly identifies a red square image.
  • Triton attention: Triton decode attention degrades at 400+ tokens on RDNA4. Uses torch_native attention as workaround (15 tok/s vs potential 17 tok/s).

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh gemma4-31b

See the RDNA4 Inference Repository for full setup instructions, patches, and benchmarks.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.

Downloads last month
148
Safetensors
Model size
31B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support