--- language: - en - kha license: mit library_name: fasttext tags: - embeddings - word-embeddings - khasi - multilingual - northeast-india - low-resource - Meghalaya datasets: - custom metrics: - cosine_similarity model-index: - name: Badnyal/khasi-english-embeddings results: - task: type: word-similarity name: Cross-lingual Word Similarity dataset: name: Khasi-English Parallel Corpus type: custom metrics: - type: cosine_similarity value: 0.29 name: Cross-lingual Similarity Score --- # Khasi-English Word Embeddings ## Model Description This model provides the first comprehensive word embeddings for the Khasi language, trained on a bilingual Khasi-English corpus. Khasi is an Austroasiatic language of the Mon-Khmer branch, spoken primarily in Meghalaya, Northeast India. ## Model Architecture - **Model Type**: FastText (Skip-gram) - **Embedding Dimension**: 300 - **Vocabulary Size**: 38,220 tokens - **Training Algorithm**: Hierarchical Softmax - **Context Window**: 5 words ## Training Data The model was trained on a curated corpus containing: - **63,909 Khasi sentences** from diverse sources - **65,239 English sentences** for cross-lingual alignment - **65,241 parallel translation pairs** ### Data Sources - Clean Khasi text corpus - Processed historical documents - Bilingual translation datasets - Cultural and administrative texts ## Performance Metrics | Metric | Value | |--------|-------| | Vocabulary Coverage | 38,220 words | | Cross-lingual Similarity | 0.290 | | Training Epochs | 20 | | Embedding Dimension | 300 | ## Usage ### Loading the Model ```python import fasttext # Load the model model = fasttext.load_model('khasi_embeddings.bin') # Get word vector vector = model.get_word_vector('__khasi__ ka') # Find similar words similar_words = model.get_nearest_neighbors('__khasi__ ka', k=10) ``` ### Cross-lingual Queries ```python # English to Khasi semantic similarity khasi_word = model.get_word_vector('__khasi__ bad') english_word = model.get_word_vector('__english__ and') # Calculate similarity from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity([khasi_word], [english_word])[0][0] ``` ## Language Coverage ### Khasi Language Features - Native script support - Morphological variations - Cultural terminology - Administrative vocabulary ### Cross-lingual Capabilities - Khasi-English semantic alignment - Translation assistance - Cultural concept mapping ## Limitations - **Cross-lingual alignment**: Limited by structural differences between Khasi and English - **Domain coverage**: Primarily trained on formal/administrative texts - **Dialectal variations**: May not capture all regional Khasi variants ## Intended Use This model is designed for: - **Research**: Computational linguistics studies on Khasi - **Language preservation**: Digital archiving and analysis - **Educational tools**: Language learning applications - **Cultural preservation**: Maintaining indigenous knowledge ## Ethical Considerations This model was developed with respect for Khasi cultural heritage and language preservation goals. Users are encouraged to collaborate with Khasi language communities when deploying this model. ## Citation If you use this model in your research, please cite: ```bibtex @misc{khasi-embeddings-2025, title={Khasi-English Word Embeddings: First Comprehensive Embeddings for Khasi Language}, author={Badnyal}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/Badnyal/khasi-english-embeddings}} } ``` ## Acknowledgments Special thanks to the contributors to the preservation of indigenous languages of Northeast India. ## Contact For questions, collaborations, or feedback regarding this model, please open an issue in the model repository. --- *This model represents pioneering work in Khasi language processing and serves as a foundation for future research in Northeast Indian computational linguistics.*