Instructions to use muhammadsuleman1533/urdu-ser-model-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use muhammadsuleman1533/urdu-ser-model-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="muhammadsuleman1533/urdu-ser-model-v3")# Load model directly from transformers import AutoProcessor, AutoModelForAudioClassification processor = AutoProcessor.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3") model = AutoModelForAudioClassification.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3") - Notebooks
- Google Colab
- Kaggle
- 🎭 Urdu-English Speech Emotion Recognition Model (V3)
- Model Description
- Uses
- Bias, Risks, and Limitations
- How to Get Started with the Model
- Training Details
- Evaluation
- Version History – Why V3 Is the Best
- Why V3 is superior:
1. Optimised Focal Loss (γ=1.2 + class weights) – Fixes over‑penalisation of neutral/fear (V2) while still focusing on hard classes.
2. Selective freezing (first 6 layers) – More adaptation than V2, less overfitting than V1.
3. Mild augmentation – Generalisation without distortion (unlike V2’s heavy augmentation).
4. Proper validation split – Prevents test set leakage; V3’s metrics are truly held‑out.
5. Lower learning rate (2e-5) with warmup & cosine decay – Stable training with more trainable layers.
🎭 Urdu-English Speech Emotion Recognition Model (V3)
Model Description
This model is a fine‑tuned version of facebook/wav2vec2-xls-r-300m for 7‑class emotion recognition from speech. It supports Urdu and English (including code‑mixed audio).
It is the third and best performing iteration (V3) developed for a final‑year project on multimodal mental health (text + voice + vision).
- Developed by: Muhammad Suleman
- Project: Multimodal AI Mental Health Companion (FYP)
- Model type: Audio classification (speech emotion recognition)
- Language(s): English, Urdu (including Roman Urdu)
- License: MIT
- Finetuned from model:
facebook/wav2vec2-xls-r-300m
Model Sources
- Repository: Hugging Face Model Page
Uses
Direct Use
This model can be used for research on multilingual speech emotion recognition and as a component in a multimodal mental health application (text, voice, vision). It is intended for academic demonstration and proof‑of‑concept.
Downstream Use
The model can be integrated into a larger pipeline that fuses text, voice, and facial expression predictions. It is not intended for standalone medical diagnosis.
Out-of-Scope Use
- Real‑time safety‑critical systems.
- Deployment without proper privacy and bias evaluation.
- Emotion recognition from non‑speech audio.
Bias, Risks, and Limitations
- Surprise emotion is present only in English (RAVDESS). UrduSER has no surprise samples → the model will likely fail on Urdu surprise.
- Disgust and sadness have low recall (often confused with anger or neutral).
- Performance may degrade in noisy, real‑world environments because training data is mostly studio‑quality.
- The model is not fine‑tuned on Pakistani‑specific facial expressions – that is a separate component.
- The test set (2,575 samples) is speaker‑disjoint, but the model may still exhibit gender or age bias inherited from the pre‑trained model.
Recommendations
Users should validate the model on their own data, especially for Urdu surprise, and consider additional fine‑tuning if needed. Always combine with other modalities (text, vision) for robust predictions.
How to Get Started with the Model
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor, pipeline
model = AutoModelForAudioClassification.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3") feature_extractor = AutoFeatureExtractor.from_pretrained("muhammadsuleman1533/urdu-ser-model-v3")
pipe = pipeline("audio-classification", model=model, feature_extractor=feature_extractor) result = pipe("path/to/audio.wav") print(result) # e.g., [{'label': 'joy', 'score': 0.92}, ...]
Training Details
Training Data
The model was trained on a combined dataset of three sources (after discarding calm and boredom):
| Dataset | Language | Style | Samples (after cleaning) |
|---|---|---|---|
| RAVDESS | English | Acted (24 actors) | ~1,250 |
| CREMA‑D | English | Semi‑acted (91 actors) | 7,442 |
| UrduSER | Urdu | TV dramas (acted) | ~3,000 |
Total after filtering: 11,684 samples (duration 0.5‑10 s).
Preprocessing:
- Resampled to 16 kHz, truncated to 10 seconds.
- Normalised by wav2vec2 feature extractor.
- Mild augmentation (training only):
- Gaussian noise (p=0.3, amplitude 0.001‑0.012)
- Pitch shift ±1 semitone (p=0.4)
- No time stretch (disabled)
Split (speaker‑disjoint):
- Train: 70% (8,176 speakers)
- Validation: 10% (1,139 speakers) – used for early stopping
- Test: 20% (2,575 speakers) – used only for final evaluation
Training Procedure
Hardware: Google Colab T4 GPU (16 GB) – Training time ~3‑4 hours.
Software: PyTorch + Hugging Face Transformers + PEFT (LoRA).
Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Warmup ratio | 0.1 |
| Scheduler | Cosine decay |
| Per‑device batch size | 4 |
| Gradient accumulation steps | 4 (effective batch = 16) |
| Epochs | 12 (early stopping patience = 5, not triggered) |
| Loss | Focal Loss (γ=1.2) + class weights (alpha) |
| Freezing | Feature encoder + first 6 transformer layers |
| Dropout | 0.2 |
| Precision | fp16 |
| Gradient checkpointing | Enabled |
Speeds, Sizes, Times
- Model size: 300M parameters (~1.2 GB)
- Training time: ~3‑4 hours on T4 GPU
- Inference time: ~0.5‑1.0 second per audio clip (CPU)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Test set size: 2,575 samples (20% of total data, speaker‑disjoint)
- Distribution: Balanced across emotions and languages.
Metrics
- Accuracy, weighted F1, macro F1, and per‑class precision/recall/F1.
Results
Overall Performance
| Metric | Value |
|---|---|
| Accuracy | 59.2% |
| Weighted F1 | 0.578 |
| Macro F1 | 0.594 |
Per‑Class Performance
| Emotion | Precision | Recall | F1‑score | Support |
|---|---|---|---|---|
| Anger | 0.78 | 0.85 | 0.81 | 433 |
| Disgust | 0.61 | 0.37 | 0.46 | 431 |
| Fear | 0.69 | 0.40 | 0.51 | 433 |
| Joy | 0.64 | 0.66 | 0.65 | 433 |
| Neutral | 0.41 | 0.96 | 0.57 | 380 |
| Sadness | 0.71 | 0.33 | 0.45 | 433 |
| Surprise | 0.54 | 1.00 | 0.70 | 32 |
Version History – Why V3 Is the Best
| Version | Loss | Frozen Transformer Layers | Augmentation | Learning Rate | Validation Split | Test Accuracy | Weighted F1 |
|---|---|---|---|---|---|---|---|
| V1 (baseline) | Weighted CE | 0 (full fine‑tuning) | None | 1e‑5 constant | No (test used for early stopping) | 57.3% | 0.554 |
| V2 (failed) | Focal Loss (γ=2.0) | 18 | Heavy (pitch ±2, time stretch) | 3e‑5 + warmup | No | 54.7% | 0.520 |
| V3 (✅ final) | Focal Loss (γ=1.2) + class weights | 6 | Mild (pitch ±1, no stretch) | 2e‑5 + warmup + cosine | Yes (10% speaker‑disjoint) | 59.2% | 0.578 |
| V4 (alternative) | Weighted CE | 6 | Mild (as V3) | 2e‑5 + warmup + cosine | Yes | 58.8% | 0.574 |
Why V3 is superior:
1. Optimised Focal Loss (γ=1.2 + class weights) – Fixes over‑penalisation of neutral/fear (V2) while still focusing on hard classes.
2. Selective freezing (first 6 layers) – More adaptation than V2, less overfitting than V1.
3. Mild augmentation – Generalisation without distortion (unlike V2’s heavy augmentation).
4. Proper validation split – Prevents test set leakage; V3’s metrics are truly held‑out.
5. Lower learning rate (2e-5) with warmup & cosine decay – Stable training with more trainable layers.
- Downloads last month
- 180
Model tree for muhammadsuleman1533/urdu-ser-model-v3
Base model
facebook/wav2vec2-xls-r-300m