🎭 Urdu-English Speech Emotion Recognition Model (V3)

Model Description

This model is a fine‑tuned version of facebook/wav2vec2-xls-r-300m for 7‑class emotion recognition from speech. It supports Urdu and English (including code‑mixed audio).

It is the third and best performing iteration (V3) developed for a final‑year project on multimodal mental health (text + voice + vision).

  • Developed by: Muhammad Suleman
  • Project: Multimodal AI Mental Health Companion (FYP)
  • Model type: Audio classification (speech emotion recognition)
  • Language(s): English, Urdu (including Roman Urdu)
  • License: MIT
  • Finetuned from model: facebook/wav2vec2-xls-r-300m

Model Sources


Uses

Direct Use

This model can be used for research on multilingual speech emotion recognition and as a component in a multimodal mental health application (text, voice, vision). It is intended for academic demonstration and proof‑of‑concept.

Downstream Use

The model can be integrated into a larger pipeline that fuses text, voice, and facial expression predictions. It is not intended for standalone medical diagnosis.

Out-of-Scope Use

  • Real‑time safety‑critical systems.
  • Deployment without proper privacy and bias evaluation.
  • Emotion recognition from non‑speech audio.

Bias, Risks, and Limitations

  • Surprise emotion is present only in English (RAVDESS). UrduSER has no surprise samples → the model will likely fail on Urdu surprise.
  • Disgust and sadness have low recall (often confused with anger or neutral).
  • Performance may degrade in noisy, real‑world environments because training data is mostly studio‑quality.
  • The model is not fine‑tuned on Pakistani‑specific facial expressions – that is a separate component.
  • The test set (2,575 samples) is speaker‑disjoint, but the model may still exhibit gender or age bias inherited from the pre‑trained model.

Recommendations

Users should validate the model on their own data, especially for Urdu surprise, and consider additional fine‑tuning if needed. Always combine with other modalities (text, vision) for robust predictions.


How to Get Started with the Model

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor, pipeline

model = AutoModelForAudioClassification.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3") feature_extractor = AutoFeatureExtractor.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3")

pipe = pipeline("audio-classification", model=model, feature_extractor=feature_extractor) result = pipe("path/to/audio.wav") print(result) # e.g., [{'label': 'joy', 'score': 0.92}, ...]


Training Details

Training Data

The model was trained on a combined dataset of three sources (after discarding calm and boredom):

Dataset Language Style Samples (after cleaning)
RAVDESS English Acted (24 actors) ~1,250
CREMA‑D English Semi‑acted (91 actors) 7,442
UrduSER Urdu TV dramas (acted) ~3,000

Total after filtering: 11,684 samples (duration 0.5‑10 s).

Preprocessing:

  • Resampled to 16 kHz, truncated to 10 seconds.
  • Normalised by wav2vec2 feature extractor.
  • Mild augmentation (training only):
    • Gaussian noise (p=0.3, amplitude 0.001‑0.012)
    • Pitch shift ±1 semitone (p=0.4)
    • No time stretch (disabled)

Split (speaker‑disjoint):

  • Train: 70% (8,176 speakers)
  • Validation: 10% (1,139 speakers) – used for early stopping
  • Test: 20% (2,575 speakers) – used only for final evaluation

Training Procedure

Hardware: Google Colab T4 GPU (16 GB) – Training time ~3‑4 hours.

Software: PyTorch + Hugging Face Transformers + PEFT (LoRA).

Training Hyperparameters

Parameter Value
Optimizer AdamW
Learning rate 2e-5
Warmup ratio 0.1
Scheduler Cosine decay
Per‑device batch size 4
Gradient accumulation steps 4 (effective batch = 16)
Epochs 12 (early stopping patience = 5, not triggered)
Loss Focal Loss (γ=1.2) + class weights (alpha)
Freezing Feature encoder + first 6 transformer layers
Dropout 0.2
Precision fp16
Gradient checkpointing Enabled

Speeds, Sizes, Times

  • Model size: 300M parameters (~1.2 GB)
  • Training time: ~3‑4 hours on T4 GPU
  • Inference time: ~0.5‑1.0 second per audio clip (CPU)

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Test set size: 2,575 samples (20% of total data, speaker‑disjoint)
  • Distribution: Balanced across emotions and languages.

Metrics

  • Accuracy, weighted F1, macro F1, and per‑class precision/recall/F1.

Results

Overall Performance

Metric Value
Accuracy 59.2%
Weighted F1 0.578
Macro F1 0.594

Per‑Class Performance

Emotion Precision Recall F1‑score Support
Anger 0.78 0.85 0.81 433
Disgust 0.61 0.37 0.46 431
Fear 0.69 0.40 0.51 433
Joy 0.64 0.66 0.65 433
Neutral 0.41 0.96 0.57 380
Sadness 0.71 0.33 0.45 433
Surprise 0.54 1.00 0.70 32

Version History – Why V3 Is the Best

Version Loss Frozen Transformer Layers Augmentation Learning Rate Validation Split Test Accuracy Weighted F1
V1 (baseline) Weighted CE 0 (full fine‑tuning) None 1e‑5 constant No (test used for early stopping) 57.3% 0.554
V2 (failed) Focal Loss (γ=2.0) 18 Heavy (pitch ±2, time stretch) 3e‑5 + warmup No 54.7% 0.520
V3 (✅ final) Focal Loss (γ=1.2) + class weights 6 Mild (pitch ±1, no stretch) 2e‑5 + warmup + cosine Yes (10% speaker‑disjoint) 59.2% 0.578
V4 (alternative) Weighted CE 6 Mild (as V3) 2e‑5 + warmup + cosine Yes 58.8% 0.574

Why V3 is superior:
1. Optimised Focal Loss (γ=1.2 + class weights) – Fixes over‑penalisation of neutral/fear (V2) while still focusing on hard classes.
2. Selective freezing (first 6 layers) – More adaptation than V2, less overfitting than V1.
3. Mild augmentation – Generalisation without distortion (unlike V2’s heavy augmentation).
4. Proper validation split – Prevents test set leakage; V3’s metrics are truly held‑out.
5. Lower learning rate (2e-5) with warmup & cosine decay – Stable training with more trainable layers.

Downloads last month
180
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for muhammadsuleman1533/urdu-ser-model-v3

Finetuned
(859)
this model