Devstral-24B AWQ 4-bit
AWQ 4-bit quantization of Devstral Small 24B optimized for AMD RDNA4 (gfx1201) inference with SGLang.
Model Details
| Base model | mistralai/Devstral-Small-2507 |
| Architecture | Dense |
| Parameters | 24B |
| Layers | 40 |
| Context | 32K (tested), 393K (max) |
| Quantization | AWQ 4-bit, group_size=128 |
Performance (2x AMD Radeon AI PRO R9700, TP=2)
- Decode speed: 37 tok/s single-user on 2x R9700
- Launch:
scripts/launch.sh devstral
Notes
GPTQ-calibrated with 128 samples. BOS token removed from chat template (fixes <unk> output). Text-only warmup to avoid radix cache pollution from vision tokens.
Known Limitations
- Vision: WORKING. Vision tower weights preserved in original precision (
modules_to_not_convertincludesvision_tower,multi_modal_projector). Tested: correctly identifies a red square image.
Usage with SGLang
git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh devstral
See the RDNA4 Inference Repository for full setup instructions, patches, and benchmarks.
Hardware
Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.
- Downloads last month
- 66
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for mattbucci/Devstral-24B-AWQ
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503 Finetuned
mistralai/Devstral-Small-2507