sentence-transformers
Safetensors
xlm-roberta
embeddings
multilingual
NLP
Indic-languages
semantic-search
similarity
Instructions to use krutrim-ai-labs/Vyakyarth with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use krutrim-ai-labs/Vyakyarth with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("krutrim-ai-labs/Vyakyarth") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages
This is a sentence-transformers model finetuned from sentence-transformers/stsb-xlm-r-multilingual. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))
print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017
print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127
Evaluation/Benchmarking
Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark
| Language | MuRIL | IndicBERT | Vyakyarth | jina-embeddings-v3 |
|---|---|---|---|---|
| Bengali | 77.0 | 91.0 | 98.7 | 97.4 |
| Gujarati | 67.0 | 92.4 | 98.7 | 97.3 |
| Hindi | 84.2 | 90.5 | 99.9 | 98.8 |
| Kannada | 88.4 | 89.1 | 99.2 | 96.8 |
| Malayalam | 82.2 | 89.2 | 98.7 | 96.3 |
| Marathi | 83.9 | 92.5 | 98.8 | 97.1 |
| Sanskrit | 36.4 | 30.4 | 90.1 | 84.1 |
| Tamil | 79.4 | 90.0 | 97.9 | 95.8 |
| Telugu | 43.5 | 88.6 | 97.5 | 97.3 |
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
License
This code repository and the model weights are licensed under the Krutrim Community License.
7. Citation
@inproceedings{
author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}
Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
- Downloads last month
- 4,098
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
