NagameseBERT

A Foundational BERT model for Nagamese Creole - A compact, efficient language model for a low resource Northeast Indian language.

Overview

NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

Key Features:

Compact: 6.9M parameters (15× smaller than mBERT)
Efficient: Pre-trained in 35 minutes on single A40 GPU
Custom tokenizer: 8K BPE vocabulary optimized for Nagamese
Rigorous evaluation: Multi-seed testing (n=3) with reproducible results
Open: Model, code, and data splits publicly available

Performance

Multi-seed evaluation results (mean ± std, n=3):

Model	Parameters	POS Accuracy	POS F1	NER Accuracy	NER F1
NagameseBERT	7M	88.35 ± 0.71%	0.807 ± 0.013	91.74 ± 0.68%	0.565 ± 0.054
mBERT	110M	95.14 ± 0.47%	0.916 ± 0.008	96.11 ± 0.72%	0.750 ± 0.064
XLM-RoBERTa	125M	95.64 ± 0.56%	0.919 ± 0.008	96.38 ± 0.26%	0.819 ± 0.066

Trade-off: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.

Model Details

Architecture

Type: RoBERTa-style BERT (no token type embeddings)
Hidden size: 256
Layers: 6 transformer blocks
Attention heads: 4 per layer
Intermediate size: 1,024
Max sequence length: 64 tokens
Total parameters: 6,878,528

Tokenizer

Type: Byte-Pair Encoding (BPE)
Vocabulary size: 8,000 tokens
Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Normalization: NFD Unicode + accent stripping
Case: Preserved (for proper nouns and code-switched English)

Training Data

Corpus size: 42,552 Nagamese sentences
Average length: 11.82 tokens/sentence
Split: 90% train (38,296) / 10% validation (4,256)
Sources: Web, social media, community contributions (deduplicated)

Pre-training

Objective: Masked Language Modeling (15% masking)
Optimizer: AdamW (lr=5e-4, weight_decay=0.01)
Batch size: 64
Epochs: 50
Training time: ~35 minutes
Hardware: NVIDIA A40 (48GB)
Final validation loss: 2.79

Usage

Load Model and Tokenizer

from transformers import AutoTokenizer, AutoModel

model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning for Token Classification

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
    "MWirelabs/nagamesebert",
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=100,
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    weight_decay=0.01
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

Evaluation

Dataset

Source: NagaNLP Annotated Corpus
Total: 214 sentences
Split (seed=42): 171 train / 21 dev / 22 test (80/10/10)
POS tags: 13 Universal Dependencies tags
NER tags: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format

Experimental Setup

Seeds: 42, 123, 456 (n=3 for variance estimation)
Batch size: 32
Learning rate: 3e-5
Epochs: 100
Optimization: AdamW with 100 warmup steps
Hardware: NVIDIA A40
Metrics: Token-level accuracy and macro-averaged F1

Data Leakage Statement: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.

Limitations

Corpus size: 42K sentences is modest; expansion to 100K+ could improve performance
Evaluation scale: Small test set (22 sentences) limits statistical power
Task scope: Only evaluated on token classification; needs broader task assessment
Efficiency metrics: No quantitative inference benchmarks (latency, memory) yet provided
Data documentation: Complete data provenance and licenses to be formalized

Citation

If you use NagameseBERT in your research, please cite:

@misc{nagamesebert2025,
  title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
  author={MWire Labs},
  year={2025},
  url={https://huggingface.co/MWirelabs/nagamesebert}
}

Contact

MWire Labs
Shillong, Meghalaya, India
Website: MWire Labs

License

This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to:

Share — copy and redistribute the material
Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit to MWire Labs

Acknowledgments

We thank the Nagamese-speaking community for their contributions to corpus development and validation.

Downloads last month: 32

Safetensors

Model size

6.88M params

Tensor type

F32

Dataset used to train MWirelabs/nagamesebert

Evaluation results

Accuracy on NagaNLP Annotated Corpus
self-reported

88.350
F1 (macro) on NagaNLP Annotated Corpus
self-reported

80.720
Accuracy on NagaNLP Annotated Corpus
self-reported

91.740
F1 (macro) on NagaNLP Annotated Corpus
self-reported

56.510