🎭 Urdu-English Speech Emotion Recognition Model (V3)

Model Description

This model is a fine‑tuned version of facebook/wav2vec2-xls-r-300m for 7‑class emotion recognition from speech. It supports Urdu and English (including code‑mixed audio).

It is the third and best performing iteration (V3) developed for a final‑year project on multimodal mental health (text + voice + vision).

Developed by: Muhammad Suleman
Project: Multimodal AI Mental Health Companion (FYP)
Model type: Audio classification (speech emotion recognition)
Language(s): English, Urdu (including Roman Urdu)
License: MIT
Finetuned from model: facebook/wav2vec2-xls-r-300m

Model Sources

Repository: Hugging Face Model Page

Uses

Direct Use

This model can be used for research on multilingual speech emotion recognition and as a component in a multimodal mental health application (text, voice, vision). It is intended for academic demonstration and proof‑of‑concept.

Downstream Use

The model can be integrated into a larger pipeline that fuses text, voice, and facial expression predictions. It is not intended for standalone medical diagnosis.

Out-of-Scope Use

Real‑time safety‑critical systems.
Deployment without proper privacy and bias evaluation.
Emotion recognition from non‑speech audio.

Bias, Risks, and Limitations

Surprise emotion is present only in English (RAVDESS). UrduSER has no surprise samples → the model will likely fail on Urdu surprise.
Disgust and sadness have low recall (often confused with anger or neutral).
Performance may degrade in noisy, real‑world environments because training data is mostly studio‑quality.
The model is not fine‑tuned on Pakistani‑specific facial expressions – that is a separate component.
The test set (2,575 samples) is speaker‑disjoint, but the model may still exhibit gender or age bias inherited from the pre‑trained model.

Recommendations

Users should validate the model on their own data, especially for Urdu surprise, and consider additional fine‑tuning if needed. Always combine with other modalities (text, vision) for robust predictions.

How to Get Started with the Model

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor, pipeline

model = AutoModelForAudioClassification.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3") feature_extractor = AutoFeatureExtractor.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3")

pipe = pipeline("audio-classification", model=model, feature_extractor=feature_extractor) result = pipe("path/to/audio.wav") print(result) # e.g., [{'label': 'joy', 'score': 0.92}, ...]

Training Details

Training Data

The model was trained on a combined dataset of three sources (after discarding calm and boredom):

Dataset	Language	Style	Samples (after cleaning)
RAVDESS	English	Acted (24 actors)	~1,250
CREMA‑D	English	Semi‑acted (91 actors)	7,442
UrduSER	Urdu	TV dramas (acted)	~3,000

Total after filtering: 11,684 samples (duration 0.5‑10 s).

Preprocessing:

Resampled to 16 kHz, truncated to 10 seconds.
Normalised by wav2vec2 feature extractor.
Mild augmentation (training only):
- Gaussian noise (p=0.3, amplitude 0.001‑0.012)
- Pitch shift ±1 semitone (p=0.4)
- No time stretch (disabled)

Split (speaker‑disjoint):

Train: 70% (8,176 speakers)
Validation: 10% (1,139 speakers) – used for early stopping
Test: 20% (2,575 speakers) – used only for final evaluation

Training Procedure

Hardware: Google Colab T4 GPU (16 GB) – Training time ~3‑4 hours.

Software: PyTorch + Hugging Face Transformers + PEFT (LoRA).

Training Hyperparameters

Parameter	Value
Optimizer	AdamW
Learning rate	`2e-5`
Warmup ratio	`0.1`
Scheduler	Cosine decay
Per‑device batch size	4
Gradient accumulation steps	4 (effective batch = 16)
Epochs	12 (early stopping patience = 5, not triggered)
Loss	Focal Loss (`γ=1.2`) + class weights (alpha)
Freezing	Feature encoder + first 6 transformer layers
Dropout	0.2
Precision	`fp16`
Gradient checkpointing	Enabled

Speeds, Sizes, Times

Model size: 300M parameters (~1.2 GB)
Training time: ~3‑4 hours on T4 GPU
Inference time: ~0.5‑1.0 second per audio clip (CPU)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Test set size: 2,575 samples (20% of total data, speaker‑disjoint)
Distribution: Balanced across emotions and languages.

Metrics

Accuracy, weighted F1, macro F1, and per‑class precision/recall/F1.

Results

Overall Performance

Metric	Value
Accuracy	59.2%
Weighted F1	0.578
Macro F1	0.594

Per‑Class Performance

Emotion	Precision	Recall	F1‑score	Support
Anger	0.78	0.85	0.81	433
Disgust	0.61	0.37	0.46	431
Fear	0.69	0.40	0.51	433
Joy	0.64	0.66	0.65	433
Neutral	0.41	0.96	0.57	380
Sadness	0.71	0.33	0.45	433
Surprise	0.54	1.00	0.70	32

Version History – Why V3 Is the Best

Version	Loss	Frozen Transformer Layers	Augmentation	Learning Rate	Validation Split	Test Accuracy	Weighted F1
V1 (baseline)	Weighted CE	0 (full fine‑tuning)	None	1e‑5 constant	No (test used for early stopping)	57.3%	0.554
V2 (failed)	Focal Loss (γ=2.0)	18	Heavy (pitch ±2, time stretch)	3e‑5 + warmup	No	54.7%	0.520
V3 (✅ final)	Focal Loss (γ=1.2) + class weights	6	Mild (pitch ±1, no stretch)	2e‑5 + warmup + cosine	Yes (10% speaker‑disjoint)	59.2%	0.578
V4 (alternative)	Weighted CE	6	Mild (as V3)	2e‑5 + warmup + cosine	Yes	58.8%	0.574

Why V3 is superior:
1. Optimised Focal Loss (γ=1.2 + class weights) – Fixes over‑penalisation of neutral/fear (V2) while still focusing on hard classes.
2. Selective freezing (first 6 layers) – More adaptation than V2, less overfitting than V1.
3. Mild augmentation – Generalisation without distortion (unlike V2’s heavy augmentation).
4. Proper validation split – Prevents test set leakage; V3’s metrics are truly held‑out.
5. Lower learning rate (2e-5) with warmup & cosine decay – Stable training with more trainable layers.

Downloads last month: 180

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for muhammadsuleman1533/urdu-ser-model-v3

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(859)

this model