Malaysian MaskLM
Collection
Trained on 17B tokens, 81GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil. β’ 7 items β’ Updated
How to use mesolitica/malaysian-mistral-191M-MLM-512 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="mesolitica/malaysian-mistral-191M-MLM-512", trust_remote_code=True) # Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("mesolitica/malaysian-mistral-191M-MLM-512", trust_remote_code=True)
model = AutoModel.from_pretrained("mesolitica/malaysian-mistral-191M-MLM-512", trust_remote_code=True)Replicating https://github.com/McGill-NLP/llm2vec using https://huggingface.co/mesolitica/malaysian-mistral-191M-4096, done by https://github.com/aisyahrzk https://twitter.com/aisyahhhrzk
Source code at https://github.com/mesolitica/malaya/tree/master/session/llm2vec
WandB, https://wandb.ai/aisyahrazak/mistral-191M-mlm?nw=nwuseraisyahrazak