NagameseBERT
A Foundational BERT model for Nagamese Creole - A compact, efficient language model for a low resource Northeast Indian language.
Overview
NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.
Key Features:
- Compact: 6.9M parameters (15× smaller than mBERT)
- Efficient: Pre-trained in 35 minutes on single A40 GPU
- Custom tokenizer: 8K BPE vocabulary optimized for Nagamese
- Rigorous evaluation: Multi-seed testing (n=3) with reproducible results
- Open: Model, code, and data splits publicly available
Performance
Multi-seed evaluation results (mean ± std, n=3):
| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 |
|---|---|---|---|---|---|
| NagameseBERT | 7M | 88.35 ± 0.71% | 0.807 ± 0.013 | 91.74 ± 0.68% | 0.565 ± 0.054 |
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 |
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 |
Trade-off: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.
Model Details
Architecture
- Type: RoBERTa-style BERT (no token type embeddings)
- Hidden size: 256
- Layers: 6 transformer blocks
- Attention heads: 4 per layer
- Intermediate size: 1,024
- Max sequence length: 64 tokens
- Total parameters: 6,878,528
Tokenizer
- Type: Byte-Pair Encoding (BPE)
- Vocabulary size: 8,000 tokens
- Special tokens:
[PAD],[UNK],[CLS],[SEP],[MASK] - Normalization: NFD Unicode + accent stripping
- Case: Preserved (for proper nouns and code-switched English)
Training Data
- Corpus size: 42,552 Nagamese sentences
- Average length: 11.82 tokens/sentence
- Split: 90% train (38,296) / 10% validation (4,256)
- Sources: Web, social media, community contributions (deduplicated)
Pre-training
- Objective: Masked Language Modeling (15% masking)
- Optimizer: AdamW (lr=5e-4, weight_decay=0.01)
- Batch size: 64
- Epochs: 50
- Training time: ~35 minutes
- Hardware: NVIDIA A40 (48GB)
- Final validation loss: 2.79
Usage
Load Model and Tokenizer
from transformers import AutoTokenizer, AutoModel
model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Fine-tuning for Token Classification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
"MWirelabs/nagamesebert",
num_labels=num_labels
)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=100,
per_device_train_batch_size=8,
learning_rate=3e-5,
weight_decay=0.01
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Evaluation
Dataset
- Source: NagaNLP Annotated Corpus
- Total: 214 sentences
- Split (seed=42): 171 train / 21 dev / 22 test (80/10/10)
- POS tags: 13 Universal Dependencies tags
- NER tags: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format
Experimental Setup
- Seeds: 42, 123, 456 (n=3 for variance estimation)
- Batch size: 32
- Learning rate: 3e-5
- Epochs: 100
- Optimization: AdamW with 100 warmup steps
- Hardware: NVIDIA A40
- Metrics: Token-level accuracy and macro-averaged F1
Data Leakage Statement: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.
Limitations
- Corpus size: 42K sentences is modest; expansion to 100K+ could improve performance
- Evaluation scale: Small test set (22 sentences) limits statistical power
- Task scope: Only evaluated on token classification; needs broader task assessment
- Efficiency metrics: No quantitative inference benchmarks (latency, memory) yet provided
- Data documentation: Complete data provenance and licenses to be formalized
Citation
If you use NagameseBERT in your research, please cite:
@misc{nagamesebert2025,
title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
author={MWire Labs},
year={2025},
url={https://huggingface.co/MWirelabs/nagamesebert}
}
Contact
MWire Labs
Shillong, Meghalaya, India
Website: MWire Labs
License
This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to:
- Share — copy and redistribute the material
- Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit to MWire Labs
Acknowledgments
We thank the Nagamese-speaking community for their contributions to corpus development and validation.
- Downloads last month
- 32
Dataset used to train MWirelabs/nagamesebert
Evaluation results
- Accuracy on NagaNLP Annotated Corpusself-reported88.350
- F1 (macro) on NagaNLP Annotated Corpusself-reported80.720
- Accuracy on NagaNLP Annotated Corpusself-reported91.740
- F1 (macro) on NagaNLP Annotated Corpusself-reported56.510