Papers
arxiv:2512.12537

NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Published on Dec 14, 2025
Authors:
,
,

Abstract

A novel synthetic-hybrid methodology using LLM-generated and human-validated data creates high-quality NLP resources for Nagamese, significantly improving performance on linguistic tasks compared to zero-shot baselines.

AI-generated summary

The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.12537 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.