Want a bigger model? Download Sarvam-105B!
Index
Introduction
Sarvam-30B is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.
This repository provides FP8 quantized weights for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.
A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.
Sarvam-30B is open-sourced under the Apache License. For more details, see our blog.
Architecture
The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN intermediate_size of 8192, moe_intermediate_size of 1024, top-6 routing, grouped KV heads (num_key_value_heads=4), and an extremely high rope_theta (8e6) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.
Inference
SGLang
Install latest SGLang from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
Launch Server
sglang serve --model-path sarvamai/sov_30b_fp8 \
--port 3002 --host 0.0.0.0 \
--mem-fraction-static 0.70 \
--trust-remote-code \
--tp 2 \
--enable-dp-attention --dp 2 \
--prefill-attention-backend fa3 \
--decode-attention-backend fa3 \
--ep 2 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--quantization modelopt_fp8 \
--kv-cache-dtype fp8_e4m3
vLLM
Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.
Option 1: install from source (hard)
Option 2: hot-patch (easy)
- Run hotpatch_vllm.py
- This will do the following:
- install vllm=0.15.0
- add 2 model entries to
registry.py - download the model executors for
sarvam-105bandsarvam-30b
Once this is done, you can launch the vLLM server.
Important: You must set
VLLM_USE_FLASHINFER_MOE_FP8=0as an environment variable, otherwise the server will get stuck during compilation and crash.
VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \
--trust-remote-code \
--tensor-parallel-size 2 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--port 3002
Citation
@misc{sarvam_sovereign_models,
title = {Introducing Sarvam's Sovereign Models},
author = {{Sarvam Foundation Models Team}},
year = {2026},
howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
note = {Accessed: 2026-03-03}
}
- Downloads last month
- 269
