LusakaLang – Multilingual Topic Classification Model

🧠 Model Description

LusakaLang is a fine‑tuned multilingual BERT model designed for topic classification in Zambian English, Bemba, Nyanja, and code‑switched text.
It is optimized for ride‑hailing feedback, customer complaints, and general service‑related text commonly found in Lusaka’s multilingual digital communication.

The model captures:

  • Lusaka‑style English
  • Bemba & Nyanja idioms
  • Natural code‑switching
  • Local slang and pragmatic expressions

This makes it highly effective for real‑world Zambian NLP applications.


🎯 Task: Topic Classification

The model predicts one of the following topics:

  • customer_support
  • driver_behaviour
  • payment_issues
  • others (neutral, positive, unrelated, or out‑of‑domain text)

🧪 Training Details

Training Setup

  • Base model: bert-base-multilingual-cased
  • Epochs: 20
  • Class weights: enabled (to correct class imbalance)
  • Optimizer: AdamW
  • Loss: Weighted cross‑entropy
  • Temperature scaling: T = 2.3 (applied at inference time)

Why Temperature Scaling?

Class‑weighted training sharpens logits.
Temperature scaling at T = 2.3 improves:

  • Confidence calibration
  • Noise robustness
  • Handling of positive/neutral text
  • Foreign‑language generalization
  • Reduction of overconfident misclassifications

Temperature scaling does not require retraining.


🧪 Training Data Creation & Local Review Process

The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.
To ensure authenticity:

  • All samples were reviewed by a native Zambian speaker
  • Code‑switching patterns were corrected
  • Local idioms and slang were added
  • Unnatural AI‑generated phrasing was removed
  • Bemba/Nyanja grammar and tone were validated

This hybrid approach ensures the dataset reflects real Lusaka communication patterns.


📊 Evaluation Results (Validation Set)

Metric Score
Accuracy 99.26%
Precision 98.73%
Recall 99.13%
Macro F1 98.93%
Micro F1 99.26%
Val Loss 0.0523

These results show excellent generalization, strong multilingual performance, and balanced class behavior.


image

🔢 Confusion Matrix (Validation Set)

image

Interpretation

  • No confusion between major complaint categories
  • Only 3 total misclassifications across 4 classes
  • Strong separation between “others” and complaint categories
  • Excellent multilingual robustness

image


🔥 Inference Guide (Production)

Recommended Temperature: 2.3

scaled_logits = logits / 2.3
probs = softmax(scaled_logits)
Downloads last month
101
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelvinmbewe/mbert_LusakaLang_Topic

Finetuned
(928)
this model

Datasets used to train Kelvinmbewe/mbert_LusakaLang_Topic

Evaluation results