ByT5-Small Khmer to English Translation
This model is a fine-tuned version of google/byt5-small for Khmer (KM) to English (EN) translation.
Unlike standard token-based models (like BERT or XLM-R), ByT5 operates at the byte level. This makes it exceptionally robust for the Khmer language, which has no explicit word boundaries (scripta continua) and is often subject to non-standard Romanization ("Sing Khmer") or inconsistent Unicode typing.
Model Details
- Model Architecture: ByT5 (Byte-Level T5)
- Base Model:
google/byt5-small - Task: Machine Translation (Khmer to English)
- Training Strategy: Single-task fine-tuning (no task prefix).
- Preprocessing: Inputs were normalized using Unicode NFC to ensure consistency in Khmer vowel representation.
How to Use
For the best results, it is highly recommended to use NFC Normalization on your input text before passing it to the model. This matches the data processing pipeline used during training.
Python Inference Script
import torch
import unicodedata
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load Model
MODEL_ID = "Darayut/byt5-small-khm-en-translation"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(device)
def translate(text):
# --- PREPROCESSING (Must match training) ---
# 1. Normalize to NFC (Fixes hidden Khmer vowel issues)
# This is crucial for ByT5 as it reads raw bytes.
text = unicodedata.normalize("NFC", text.strip())
# 2. Tokenize
inputs = tokenizer(text, return_tensors="pt").input_ids.to(device)
# 3. Generate
# max_length=128 is usually enough for English sentences
outputs = model.generate(inputs, max_length=128)
# 4. Decode
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
return translation
# --- Example Usage ---
khmer_text = "ααΈαα·αααΊααΆααΆαααααΎααααΎααααααααααααααααα·ααααα"
result = translate(khmer_text)
print(f"Output: {result}")
# Expected Output: "Life is a journey fulfilled by the experience"
Training Data & Procedure
The model was fine-tuned on a curated dataset of Khmer-English pairs (including subsets from OPUS Paracrawl, MTEB, and Polynews).
Key Training Configurations:
- Normalization: All data was pre-processed using
unicodedata.normalize("NFC", text). - Prefix: No task prefix was used.
- Optimizer: AdamW.
- Precision: FP32 (Full Precision) was used to ensure stability and prevent gradient collapse common in small ByT5 models on low-resource datasets.
Limitations
- Inference Speed: Since ByT5 generates text byte-by-byte (character-by-character), inference is slower than standard word-level models.
- Context: The model is optimized for sentence-level translation. Extremely long documents should be split into sentences before processing.
- Downloads last month
- -
Model tree for Darayut/byt5-small-khm-en-translation
Base model
google/byt5-small