LusakaLang – Multilingual Topic Classification Model
🧠 Model Description
LusakaLang is a fine‑tuned multilingual BERT model designed for topic classification in Zambian English, Bemba, Nyanja, and code‑switched text.
It is optimized for ride‑hailing feedback, customer complaints, and general service‑related text commonly found in Lusaka’s multilingual digital communication.
The model captures:
- Lusaka‑style English
- Bemba & Nyanja idioms
- Natural code‑switching
- Local slang and pragmatic expressions
This makes it highly effective for real‑world Zambian NLP applications.
🎯 Task: Topic Classification
The model predicts one of the following topics:
- customer_support
- driver_behaviour
- payment_issues
- others (neutral, positive, unrelated, or out‑of‑domain text)
🧪 Training Details
Training Setup
- Base model:
bert-base-multilingual-cased - Epochs: 20
- Class weights: enabled (to correct class imbalance)
- Optimizer: AdamW
- Loss: Weighted cross‑entropy
- Temperature scaling: T = 2.3 (applied at inference time)
Why Temperature Scaling?
Class‑weighted training sharpens logits.
Temperature scaling at T = 2.3 improves:
- Confidence calibration
- Noise robustness
- Handling of positive/neutral text
- Foreign‑language generalization
- Reduction of overconfident misclassifications
Temperature scaling does not require retraining.
🧪 Training Data Creation & Local Review Process
The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.
To ensure authenticity:
- All samples were reviewed by a native Zambian speaker
- Code‑switching patterns were corrected
- Local idioms and slang were added
- Unnatural AI‑generated phrasing was removed
- Bemba/Nyanja grammar and tone were validated
This hybrid approach ensures the dataset reflects real Lusaka communication patterns.
📊 Evaluation Results (Validation Set)
| Metric | Score |
|---|---|
| Accuracy | 99.26% |
| Precision | 98.73% |
| Recall | 99.13% |
| Macro F1 | 98.93% |
| Micro F1 | 99.26% |
| Val Loss | 0.0523 |
These results show excellent generalization, strong multilingual performance, and balanced class behavior.
🔢 Confusion Matrix (Validation Set)
Interpretation
- No confusion between major complaint categories
- Only 3 total misclassifications across 4 classes
- Strong separation between “others” and complaint categories
- Excellent multilingual robustness
🔥 Inference Guide (Production)
Recommended Temperature: 2.3
scaled_logits = logits / 2.3
probs = softmax(scaled_logits)
- Downloads last month
- 101
Model tree for Kelvinmbewe/mbert_LusakaLang_Topic
Base model
google-bert/bert-base-multilingual-casedDatasets used to train Kelvinmbewe/mbert_LusakaLang_Topic
Evaluation results
- accuracy on LusakaLang Topic Datasetvalidation set self-reported0.993
- precision on LusakaLang Topic Datasetvalidation set self-reported0.987
- recall on LusakaLang Topic Datasetvalidation set self-reported0.991
- macro_f1 on LusakaLang Topic Datasetvalidation set self-reported0.989
- micro_f1 on LusakaLang Topic Datasetvalidation set self-reported0.993
- validation_loss on LusakaLang Topic Datasetvalidation set self-reported0.052


