ByT5-Small Khmer to English Translation

This model is a fine-tuned version of google/byt5-small for Khmer (KM) to English (EN) translation.

Unlike standard token-based models (like BERT or XLM-R), ByT5 operates at the byte level. This makes it exceptionally robust for the Khmer language, which has no explicit word boundaries (scripta continua) and is often subject to non-standard Romanization ("Sing Khmer") or inconsistent Unicode typing.

Model Details

Model Architecture: ByT5 (Byte-Level T5)
Base Model: google/byt5-small
Task: Machine Translation (Khmer to English)
Training Strategy: Single-task fine-tuning (no task prefix).
Preprocessing: Inputs were normalized using Unicode NFC to ensure consistency in Khmer vowel representation.

How to Use

For the best results, it is highly recommended to use NFC Normalization on your input text before passing it to the model. This matches the data processing pipeline used during training.

Python Inference Script

import torch
import unicodedata
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load Model
MODEL_ID = "Darayut/byt5-small-khm-en-translation"
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Loading {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(device)

def translate(text):
    # --- PREPROCESSING (Must match training) ---
    # 1. Normalize to NFC (Fixes hidden Khmer vowel issues)
    # This is crucial for ByT5 as it reads raw bytes.
    text = unicodedata.normalize("NFC", text.strip())
    
    # 2. Tokenize
    inputs = tokenizer(text, return_tensors="pt").input_ids.to(device)
    
    # 3. Generate
    # max_length=128 is usually enough for English sentences
    outputs = model.generate(inputs, max_length=128)
    
    # 4. Decode
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

# --- Example Usage ---
khmer_text = "ជីវិតគឺជាការធ្វើដំណើរដែលពោរពេញដោយបទពិសោធន៍"
result = translate(khmer_text)
print(f"Output: {result}")

# Expected Output: "Life is a journey fulfilled by the experience"

Training Data & Procedure

The model was fine-tuned on a curated dataset of Khmer-English pairs (including subsets from OPUS Paracrawl, MTEB, and Polynews).

Key Training Configurations:

Normalization: All data was pre-processed using unicodedata.normalize("NFC", text).
Prefix: No task prefix was used.
Optimizer: AdamW.
Precision: FP32 (Full Precision) was used to ensure stability and prevent gradient collapse common in small ByT5 models on low-resource datasets.

Limitations

Inference Speed: Since ByT5 generates text byte-by-byte (character-by-character), inference is slower than standard word-level models.
Context: The model is optimized for sentence-level translation. Extremely long documents should be split into sentences before processing.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Darayut/byt5-small-khm-en-translation

Base model

google/byt5-small

Finetuned

(223)

this model