YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Cendol-mT5 IDโ€“MAD (15 Epochs)

๐Ÿ“Œ Overview

Cendol-mT5 IDโ€“MAD (15ep) adalah model machine translation berbasis mT5-small yang telah di-fine-tune untuk menerjemahkan Bahasa Indonesia โ†” Bahasa Madura secara dua arah (bidirectional translation).

Model ini ditujukan untuk:

  • penelitian NLP bahasa daerah (low-resource language)
  • eksperimen machine translation Indonesiaโ€“Madura
  • aplikasi edukasi dan linguistik komputasional

๐Ÿ”ง Model Details

  • Base model: google/mt5-small
  • Architecture: Encoderโ€“Decoder (Text-to-Text Transformer)
  • Fine-tuning epochs: 15
  • Framework: HuggingFace Transformers
  • Tokenizer: SentencePiece (shared multilingual vocab)

๐Ÿ“š Training Data

Model dilatih menggunakan dataset paralel Indonesiaโ€“Madura yang berasal dari:

  • NusaX Dataset (Madureseโ€“Indonesian)
  • Pembersihan data (text normalization & filtering)
  • Pembagian data: train / validation / test

Bahasa Madura yang digunakan mencakup variasi kosakata dan ejaan non-standar, mencerminkan kondisi bahasa Madura nyata (naturally noisy data).


๐Ÿ“Š Evaluation Results

Evaluasi dilakukan menggunakan BLEU dan ROUGE pada data validasi dan data uji.

๐Ÿ”น Validation Set

  • ID โ†’ MAD
    • BLEU โ‰ˆ 24โ€“25
    • ROUGE-L โ‰ˆ 0.49
  • MAD โ†’ ID
    • BLEU โ‰ˆ 38
    • ROUGE-L โ‰ˆ 0.61

๐Ÿ”น Test Set

  • ID โ†’ MAD
    • BLEU โ‰ˆ 27
    • ROUGE-L โ‰ˆ 0.53
  • MAD โ†’ ID
    • BLEU โ‰ˆ 38
    • ROUGE-L โ‰ˆ 0.62

Hasil menunjukkan performa yang lebih tinggi pada arah Madura โ†’ Indonesia, yang umum terjadi pada bahasa sumber low-resource.


๐Ÿš€ How to Use

Installation

pip install transformers sentencepiece torch

Load Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "addinda/cendol-mt5-id-mad-15ep"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Indonesian โ†’ Madurese

text = "Saya sedang belajar bahasa Madura."
inputs = tokenizer(
    "translate Indonesian to Madurese: " + text,
    return_tensors="pt"
)

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Madurese โ†’ Indonesian

text = "Engkok badha ka sakola."
inputs = tokenizer(
    "translate Madurese to Indonesian: " + text,
    return_tensors="pt"
)

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for addinda/cendol-mt5-id-mad-15ep

Base model

google/mt5-small
Finetuned
(666)
this model