GigaAM-v3

GigaAM-v3 is a Conformer-based foundation model with 220–240M parameters, pretrained on diverse Russian speech data using the HuBERT-CTC objective. It is the third generation of the GigaAM family and provides state-of-the-art performance on Russian ASR across a wide range of domains.

GigaAM-v3 includes the following model variants:

  • ssl — self-supervised HuBERT–CTC encoder pre-trained on 700,000 hours of Russian speech
  • ctc — ASR model fine-tuned with a CTC decoder
  • rnnt — ASR model fine-tuned with an RNN-T decoder
  • e2e_ctc — end-to-end CTC model with punctuation and text normalization
  • e2e_rnnt — end-to-end RNN-T model with punctuation and text normalization

GigaAM-v3 training incorporates new internal datasets: callcenter conversations, speech with background music, natural speech, and speech with atypical characteristics. the models perform on average 30% better on these new domains, while maintaining the same quality as previous GigaAM generations on public benchmarks.

The table below reports the Word Error Rate (%) for GigaAM-v3 and other existing models over diverse domains.

Set Name V3_CTC V3_RNNT T-One + LM Whisper
Open Datasets 3.0 2.6 5.7 12.0
Golos Farfield 4.5 3.9 12.2 16.7
Natural Speech 7.8 6.9 14.5 13.6
Disordered Speech 20.6 19.2 51.0 59.3
Callcenter 10.3 9.5 13.5 23.9
Average 9.2 8.4 19.4 25.1

The end-to-end ASR models (e2e_ctc and e2e_rnnt) produce punctuated, normalized text directly. In end-to-end ASR comparisons of e2e_ctc and e2e_rnnt against Whisper-large-v3, using Gemini 2.5 Pro as an LLM-as-a-judge, GigaAM-v3 models win by an average margin of 70:30.

For detailed results, see metrics.

FP8 quantization

The e2e_ctc, ctc, e2e_rnnt, and rnnt branches additionally carry FP8 (E4M3) quantized weights alongside the original fp16 weights:

  • model.safetensors — original fp16 weights
  • model_fp8.safetensors — FP8 E4M3 weights (per-output-channel scales) + per-tensor activation scales (model_fp8.safetensors.activation_scales.json)

Quantization targets the GEMM layers (encoder feed-forward and attention projections; RNNT joint enc/pred). CTC variants use post-training quantization (PTQ); RNNT variants add grid-level knowledge distillation against the fp16 teacher to recover accuracy.

Measured over 1000 held-out audio samples, FP8 transcription closely tracks the fp16 baseline — transcripts are byte-identical for 93–99% of samples, and FP8 WER vs ground truth is within ±0.02% of fp16:

Variant Word disagreement (FP8 vs fp16) Transcripts identical ΔWER vs fp16
e2e_ctc 1.71% 93.1% +0.02
ctc 1.67% 93.2% +0.00
e2e_rnnt 1.35% 96.1% +0.02
rnnt 0.29% 98.7% −0.02

License: MIT

Paper: GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025)

Downloads last month
221
Safetensors
Model size
0.2B params
Tensor type
F32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for vpermilp/GigaAM-v3