--- language: - zu - en tags: - translation - african-languages - scientific-translation - afriscience-mt - m2m100 license: apache-2.0 base_model: facebook/m2m100_1.2B datasets: - afriscience-mt pipeline_tag: translation model-index: - name: m2m100_1.2b-zul-eng results: - task: type: translation metrics: - name: BLEU (test) type: bleu value: 30.05 - name: chrF (test) type: chrf value: 52.24 - name: SSA-COMET (test) type: comet value: 60.13 --- # m2m100_1.2b-zul-eng [![Model on HF](https://huggingface.co/datasets/huggingface/badges/raw/main/model-on-hf-sm.svg)](https://huggingface.co/AfriScience-MT/m2m100_1.2b-zul-eng) This model is part of the **AfriScience-MT** project, focused on machine translation of scientific texts for African languages. ## Model Description | Property | Value | |----------|-------| | **Model Type** | Seq2Seq Translation | | **Translation Direction** | isiZulu → English | | **Base Model** | [facebook/m2m100_1.2B](https://huggingface.co/facebook/m2m100_1.2B) | | **Domain** | Scientific/Academic texts | | **Training** | Full fine-tuning on AfriScience-MT dataset | ## Evaluation Results Performance on the AfriScience-MT test set: | Split | BLEU | chrF | SSA-COMET | |-------|------|------|-----------| | Validation | 31.39 | 53.03 | 61.09 | | **Test** | **30.05** | **52.24** | **60.13** | **Metrics explanation:** - **BLEU**: Measures n-gram overlap with reference translations (0-100, higher is better) - **chrF**: Character-level F-score, robust for morphologically rich languages (0-100, higher is better) - **SSA-COMET**: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) ([McGill-NLP/ssa-comet-stl](https://huggingface.co/McGill-NLP/ssa-comet-stl)) ## Usage ### Quick Start ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "AfriScience-MT/m2m100_1.2b-zul-eng" model = AutoModelForSeq2SeqLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # Set source language tokenizer.src_lang = "zu" # Translate text = "The mitochondria is the powerhouse of the cell." inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256) # Generate with target language forced_bos_token_id = tokenizer.get_lang_id("en") outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5) translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] print(translation) ``` ### Batch Translation ```python texts = [ "Climate change affects agricultural productivity.", "The study analyzed genetic markers in the population.", "Renewable energy sources are essential for sustainable development." ] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256) outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5) translations = tokenizer.batch_decode(outputs, skip_special_tokens=True) for src, tgt in zip(texts, translations): print(f"{src}\n→ {tgt}\n") ``` ## Training Details ### Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 10 | | Batch Size | 1 | | Learning Rate | 2e-05 | ### Training Data - **Dataset**: AfriScience-MT - **Domain**: Scientific abstracts and papers - **Languages**: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu) ## Reproducibility To reproduce this model: ```bash # Clone the AfriScience-MT repository git clone https://github.com/afriscience-mt/afriscience-mt.git cd afriscience-mt # Install dependencies pip install -r requirements.txt # Run training python -m afriscience_mt.scripts.run_seq2seq_training \ --data_dir ./data \ --source_lang zul \ --target_lang eng \ --model_name facebook/m2m100_1.2B \ --model_type m2m100 \ --output_dir ./output \ --num_epochs 10 \ --batch_size 16 \ --learning_rate 2e-5 ``` ## Limitations - **Domain Specificity**: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text. - **Language Coverage**: Only supports the specific language pair indicated. - **Input Length**: Maximum input length is 256 tokens; longer texts should be split into segments. ## Citation If you use this model, please cite the AfriScience-MT project: ```bibtex @inproceedings{afriscience-mt-2025, title={AfriScience-MT: Machine Translation for African Scientific Literature}, author={AfriScience-MT Team}, year={2025}, url={https://github.com/afriscience-mt/afriscience-mt} } ``` ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). ## Acknowledgments - Built on top of [{base_model}](https://huggingface.co/{base_model}) - Evaluation using [SSA-COMET](https://huggingface.co/McGill-NLP/ssa-comet-stl) for African language assessment