YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Cendol-mT5 IDโMAD (15 Epochs)
๐ Overview
Cendol-mT5 IDโMAD (15ep) adalah model machine translation berbasis mT5-small yang telah di-fine-tune untuk menerjemahkan Bahasa Indonesia โ Bahasa Madura secara dua arah (bidirectional translation).
Model ini ditujukan untuk:
- penelitian NLP bahasa daerah (low-resource language)
- eksperimen machine translation IndonesiaโMadura
- aplikasi edukasi dan linguistik komputasional
๐ง Model Details
- Base model:
google/mt5-small - Architecture: EncoderโDecoder (Text-to-Text Transformer)
- Fine-tuning epochs: 15
- Framework: HuggingFace Transformers
- Tokenizer: SentencePiece (shared multilingual vocab)
๐ Training Data
Model dilatih menggunakan dataset paralel IndonesiaโMadura yang berasal dari:
- NusaX Dataset (MadureseโIndonesian)
- Pembersihan data (text normalization & filtering)
- Pembagian data: train / validation / test
Bahasa Madura yang digunakan mencakup variasi kosakata dan ejaan non-standar, mencerminkan kondisi bahasa Madura nyata (naturally noisy data).
๐ Evaluation Results
Evaluasi dilakukan menggunakan BLEU dan ROUGE pada data validasi dan data uji.
๐น Validation Set
- ID โ MAD
- BLEU โ 24โ25
- ROUGE-L โ 0.49
- MAD โ ID
- BLEU โ 38
- ROUGE-L โ 0.61
๐น Test Set
- ID โ MAD
- BLEU โ 27
- ROUGE-L โ 0.53
- MAD โ ID
- BLEU โ 38
- ROUGE-L โ 0.62
Hasil menunjukkan performa yang lebih tinggi pada arah Madura โ Indonesia, yang umum terjadi pada bahasa sumber low-resource.
๐ How to Use
Installation
pip install transformers sentencepiece torch
Load Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "addinda/cendol-mt5-id-mad-15ep"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
Indonesian โ Madurese
text = "Saya sedang belajar bahasa Madura."
inputs = tokenizer(
"translate Indonesian to Madurese: " + text,
return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Madurese โ Indonesian
text = "Engkok badha ka sakola."
inputs = tokenizer(
"translate Madurese to Indonesian: " + text,
return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 2
Model tree for addinda/cendol-mt5-id-mad-15ep
Base model
google/mt5-small