Enhanced Distilled Urdu ASR - wav2vec2-base

State-of-the-art lightweight Urdu ASR model using advanced knowledge distillation techniques.

🎯 Model Overview

This model uses cutting-edge distillation techniques:

  • βœ… Feature-level distillation: Matches intermediate layer representations
  • βœ… Temperature scheduling: Adaptive softening of targets (4.0 β†’ 1.5)
  • βœ… SpecAugment: Time/frequency masking for robustness
  • βœ… Multi-loss optimization: Logits + CTC + Features

Performance

  • βœ… Speed: 2.4x faster inference
  • βœ… Size: 3.3x smaller (94M vs 315M parameters)
  • βœ… Accuracy: Only 11.3% WER degradation

πŸ“Š Detailed Results

Student Model (This Model)

  • WER: 49.40%
  • CER: 19.38%
  • Parameters: 94,417,083 (94M)
  • Inference Speed: 0.010s/batch

Teacher Model (Original)

  • WER: 38.06%
  • CER: 14.75%
  • Parameters: 315,499,195 (315M)
  • Inference Speed: 0.023s/batch

Improvements vs Standard Distillation

  • Better WER retention through feature matching
  • More robust via SpecAugment regularization
  • Smoother training via temperature scheduling

πŸ”¬ Technical Details

Distillation Architecture

Teacher (Large)          Student (Base)
     β”‚                        β”‚
     β”œβ”€ Layer 0 ────────────> Layer 0  (Feature Match)
     β”œβ”€ Layer 6 ────────────> Layer 3  (Feature Match)
     β”œβ”€ Layer 12 ───────────> Layer 6  (Feature Match)
     β”œβ”€ Layer 18 ───────────> Layer 9  (Feature Match)
     └─ Layer 24 ───────────> Layer 12 (Feature Match)
             β”‚                    β”‚
             └─── Logits KL-Div β”€β”€β”˜

Loss Function

L = Ξ±Β·KL(S||T) + Ξ²Β·CTC(S,y) + Ξ³Β·MSE(H_s,H_t)

Where:
  Ξ± = 0.4 (logit distillation weight)
  Ξ² = 0.4 (hard CTC weight)
  Ξ³ = 0.2 (feature distillation weight)
  T = temperature (scheduled 4.0β†’1.5)

SpecAugment Configuration

  • Time masking: 2 masks Γ— 80 frames
  • Frequency masking: 2 masks Γ— 27 bins

πŸ’» Usage

Basic Inference

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio

# Load model
processor = Wav2Vec2Processor.from_pretrained("abidanoaman/urdu-asr-distilled-base-enhanced")
model = Wav2Vec2ForCTC.from_pretrained("abidanoaman/urdu-asr-distilled-base-enhanced")

# Load audio
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# Transcribe
inputs = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values).logits

pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(pred_ids[0])
print(transcription)

πŸš€ Deployment

Perfect for production environments requiring:

  • βœ… Real-time transcription (< 0.01s latency)
  • βœ… Low memory footprint (94MB)
  • βœ… Edge deployment (Raspberry Pi, mobile)
  • βœ… Cost-efficient scaling

πŸ“ˆ Training Details

Enhanced Techniques

  1. Feature-level distillation: Match hidden representations across 5 layer pairs
  2. Temperature scheduling: Linear decay from 4.0 to 1.5
  3. SpecAugment: Robust to time/frequency variations

Hyperparameters

  • Epochs: 50
  • Learning rate: 3e-05
  • Batch size: 4
  • Loss weights: Ξ±=0.4, Ξ²=0.4, Ξ³=0.2

🎯 Benchmark Comparison

Model Technique WER Size Speed
This Model All 49.4% 94M 2.4x
Standard KD Logits only ~51.4% 94M 2.4x
Teacher - 38.1% 315M 1.0x

πŸ“š Citation

@misc{urdu-asr-enhanced-2024,
  author = {Abid Anoaman},
  title = {Enhanced Distilled Urdu ASR with Feature Matching and SpecAugment},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/abidanoaman/urdu-asr-distilled-base-enhanced}
}

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

  • Teacher Model: abidanoaman/urdu-asr-complete-ablation
  • Base Architecture: facebook/wav2vec2-base
  • Techniques: Feature Distillation, Temperature Scheduling, SpecAugment
Downloads last month
1
Safetensors
Model size
94.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for abidanoaman/urdu-asr-distilled-base-enhanced

Finetuned
(929)
this model

Dataset used to train abidanoaman/urdu-asr-distilled-base-enhanced

Evaluation results

  • Word Error Rate on Common Voice + Emotion Voice
    self-reported
    49.40%
  • Character Error Rate on Common Voice + Emotion Voice
    self-reported
    19.38%