GigaAM-v3

GigaAM-v3 is a Conformer-based foundation model with 220–240M parameters, pretrained on diverse Russian speech data using the HuBERT-CTC objective. It is the third generation of the GigaAM family and provides state-of-the-art performance on Russian ASR across a wide range of domains.

GigaAM-v3 includes the following model variants:

ssl — self-supervised HuBERT–CTC encoder pre-trained on 700,000 hours of Russian speech
ctc — ASR model fine-tuned with a CTC decoder
rnnt — ASR model fine-tuned with an RNN-T decoder
e2e_ctc — end-to-end CTC model with punctuation and text normalization
e2e_rnnt — end-to-end RNN-T model with punctuation and text normalization

GigaAM-v3 training incorporates new internal datasets: callcenter conversations, speech with background music, natural speech, and speech with atypical characteristics. the models perform on average 30% better on these new domains, while maintaining the same quality as previous GigaAM generations on public benchmarks.

The table below reports the Word Error Rate (%) for GigaAM-v3 and other existing models over diverse domains.

Set Name	V3_CTC	V3_RNNT	T-One + LM	Whisper
Open Datasets	3.0	2.6	5.7	12.0
Golos Farfield	4.5	3.9	12.2	16.7
Natural Speech	7.8	6.9	14.5	13.6
Disordered Speech	20.6	19.2	51.0	59.3
Callcenter	10.3	9.5	13.5	23.9
Average	9.2	8.4	19.4	25.1

The end-to-end ASR models (e2e_ctc and e2e_rnnt) produce punctuated, normalized text directly. In end-to-end ASR comparisons of e2e_ctc and e2e_rnnt against Whisper-large-v3, using Gemini 2.5 Pro as an LLM-as-a-judge, GigaAM-v3 models win by an average margin of 70:30.

For detailed results, see metrics.

FP8 quantization

The e2e_ctc, ctc, e2e_rnnt, and rnnt branches additionally carry FP8 (E4M3) quantized weights alongside the original fp16 weights:

model.safetensors — original fp16 weights
model_fp8.safetensors — FP8 E4M3 weights (per-output-channel scales) + per-tensor activation scales (model_fp8.safetensors.activation_scales.json)

Quantization targets the GEMM layers (encoder feed-forward and attention projections; RNNT joint enc/pred). CTC variants use post-training quantization (PTQ); RNNT variants add grid-level knowledge distillation against the fp16 teacher to recover accuracy.

Measured over 1000 held-out audio samples, FP8 transcription closely tracks the fp16 baseline — transcripts are byte-identical for 93–99% of samples, and FP8 WER vs ground truth is within ±0.02% of fp16:

Variant	Word disagreement (FP8 vs fp16)	Transcripts identical	ΔWER vs fp16
`e2e_ctc`	1.71%	93.1%	+0.02
`ctc`	1.67%	93.2%	+0.00
`e2e_rnnt`	1.35%	96.1%	+0.02
`rnnt`	0.29%	98.7%	−0.02

License: MIT

Paper: GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025)

Downloads last month: 221

Safetensors

Model size

0.2B params

Tensor type

F32

F16

Paper for vpermilp/GigaAM-v3

GigaAM: Efficient Self-Supervised Learner for Speech Recognition

Paper • 2506.01192 • Published Jun 1, 2025