image

Want a bigger model? Download Sarvam-105B!

Index

  1. Introduction
  2. Architecture
  3. Inference
  4. Citation

Introduction

Sarvam-30B is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.

This repository provides FP8 quantized weights for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-30B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN intermediate_size of 8192, moe_intermediate_size of 1024, top-6 routing, grouped KV heads (num_key_value_heads=4), and an extremely high rope_theta (8e6) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.

Inference

SGLang

Install latest SGLang from source

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch Server

sglang serve --model-path sarvamai/sov_30b_fp8 \
  --port 3002 --host 0.0.0.0 \
  --mem-fraction-static 0.70 \
  --trust-remote-code \
  --tp 2 \
  --enable-dp-attention --dp 2 \
  --prefill-attention-backend fa3 \
  --decode-attention-backend fa3 \
  --ep 2 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --quantization modelopt_fp8 \
  --kv-cache-dtype fp8_e4m3
vLLM

Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.

Option 1: install from source (hard)

  • Use the custom fork here: link
  • Follow the instructions here to install from source: link

Option 2: hot-patch (easy)

  • Run hotpatch_vllm.py
  • This will do the following:
    • install vllm=0.15.0
    • add 2 model entries to registry.py
    • download the model executors for sarvam-105b and sarvam-30b

Once this is done, you can launch the vLLM server.

Important: You must set VLLM_USE_FLASHINFER_MOE_FP8=0 as an environment variable, otherwise the server will get stuck during compilation and crash.

VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --port 3002

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}
Downloads last month
269
Safetensors
Model size
32B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support