Model Card for Model ID
This model performs sentiment classification on Roman Urdu text, labeling input sentences as Positive, Negative, or Neutral. It is fine-tuned from xlm-roberta-base and is designed to handle Roman Urdu, code-mixed Urdu-English, and social media style text.
Try the Model
You can test the Roman Urdu Sentiment Analysis model live on Hugging Face Spaces:
Model Details
Model Description
This model is a fine-tuned XLM-RoBERTa Base transformer for sentiment analysis in Roman Urdu. It was trained on a balanced dataset of ~54k sentences, with labels: Positive, Negative, and Neutral. The dataset includes messages from WhatsApp chats and YouTube comments, preprocessed to remove media, deleted messages, short messages, and pure emoji-only messages.
- Developed by: Muhammad Khubaib Ahmad
- Model type: XLM-RoBERTa Base (transformers)
- Language(s): Roman Urdu, code-mixed Urdu-English
- License: MIT
- Finetuned from model: `xlm-roberta-base
Model Sources
- Pretrained Base Model:
xlm-roberta-base
Uses
Direct Use
- Classifying sentiment in Roman Urdu social media text.
- Performing automated content moderation.
- Enabling analytics on Roman Urdu datasets.
Downstream Use
- Can be used as a feature extractor for multi-task NLP pipelines.
- Can be adapted to other Roman Urdu classification tasks with further fine-tuning.
Out-of-Scope Use
- Not suitable for fully understanding Urdu script (non-Roman Urdu) without transliteration.
- Not intended for detecting sarcasm, irony, or extremely domain-specific slang beyond the training data.
Bias, Risks, and Limitations
- The model may inherit biases from the training data, including slang, offensive terms, and informal language.
- Accuracy may drop for extremely rare or new Roman Urdu slang not seen during training.
- May misclassify mixed-script or heavily non-standard text.
Recommendations
- Users should always validate predictions in critical applications.
- Consider retraining or augmenting with additional domain-specific data for specialized use-cases.
How to Get Started with the Model
Quick Start
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="Khubaib01/roman-urdu-sentiment-xlm-r",
truncation=True
)
pipe("ye banda bohot acha hai")
Advanced Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "Khubaib01/roman-urdu-sentiment-xlm-r"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
inputs = tokenizer("ye insan acha nahi lagta", return_tensors="pt")
logits = model(**inputs).logits
label_id = logits.argmax(dim=-1).item()
id2label = {0:"Positive", 1:"Negative", 2:"Neutral"}
print(id2label[label_id])
Training Details
Training Data
- Sources: WhatsApp chat exports, YouTube comments in Roman Urdu.
- Preprocessing: Removed media messages, deleted messages, emoji-only messages, and very short messages.
- Dataset size: 54k sentences (18k per class, balanced)
- Labels: Positive, Negative, Neutral
Training Procedure
- Base Model: xlm-roberta-base
- Training regime: fp16 mixed precision, batch size 16, gradient accumulation 4
- Epochs: 4
- Learning Rate: 2e-5
- Optimizer: AdamW
- Evaluation Strategy: Epoch-level evaluation with stratified split
Evaluation
Evaluation Testing Data
10% held-out subset of the balanced dataset (stratified per class)
Metrics
- Accuracy, F1-score per class
- Confusion matrix to inspect misclassification between classes
Results
- Overall accuracy: ~85โ90% (depending on dataset variation)
- Accuracy and F1 on vaidation data is 81.57%
- Balanced F1 across Positive, Negative, Neutral
Environmental Impact
- Hardware: NVIDIA T4 GPU (Colab)
- Training Time: ~4โ5 hours
- Precision: fp16
- Estimated COโ Emissions: Low (approx. 1โ2 kg COโeq, Colab)
Technical Specifications
- Model Architecture: XLM-RoBERTa Base (Transformer encoder, 12 layers, 768 hidden size)
- Objective: Sentiment classification (3-class softmax output)
- Software: Python, PyTorch, HuggingFace Transformers
- Tokenizer: SentencePiece-based multilingual tokenizer
Citation
- BibTeX:
@misc{roman_urdu_sentiment2025,
title={Roman Urdu Sentiment Analysis Model},
author={Muhammad Khubaib Ahmad},
year={2025},
note={HuggingFace Transformers model, fine-tuned from xlm-roberta-base}
}
- APA:
Muhammad Khubaib Ahmad. (2025). Roman Urdu Sentiment Analysis Model. HuggingFace Transformers. Fine-tuned from xlm-roberta-base.
Glossary
- Roman Urdu: Urdu language written using the Latin alphabet.
- Code-mixed text: Sentences containing a mix of Roman Urdu and English words.
- fp16: Half-precision floating-point training for faster GPU usage.
Model Card Author
Muhammad Khubaib Ahmad
Model Card Contact
- Downloads last month
- 81
Model tree for Khubaib01/roman-urdu-sentiment-xlm-r
Base model
FacebookAI/xlm-roberta-base