---
language:
- zu
- en
tags:
- translation
- african-languages
- scientific-translation
- afriscience-mt
- m2m100
license: apache-2.0
base_model: facebook/m2m100_1.2B
datasets:
- afriscience-mt
pipeline_tag: translation
model-index:
- name: m2m100_1.2b-zul-eng
  results:
  - task:
      type: translation
    metrics:
    - name: BLEU (test)
      type: bleu
      value: 30.05
    - name: chrF (test)
      type: chrf
      value: 52.24
    - name: SSA-COMET (test)
      type: comet
      value: 60.13
---

# m2m100_1.2b-zul-eng

[![Model on HF](https://huggingface.co/datasets/huggingface/badges/raw/main/model-on-hf-sm.svg)](https://huggingface.co/AfriScience-MT/m2m100_1.2b-zul-eng)

This model is part of the **AfriScience-MT** project, focused on machine translation of scientific texts for African languages.

## Model Description

| Property | Value |
|----------|-------|
| **Model Type** | Seq2Seq Translation |
| **Translation Direction** | isiZulu → English |
| **Base Model** | [facebook/m2m100_1.2B](https://huggingface.co/facebook/m2m100_1.2B) |
| **Domain** | Scientific/Academic texts |
| **Training** | Full fine-tuning on AfriScience-MT dataset |

## Evaluation Results

Performance on the AfriScience-MT test set:

| Split | BLEU | chrF | SSA-COMET |
|-------|------|------|-----------|
| Validation | 31.39 | 53.03 | 61.09 |
| **Test** | **30.05** | **52.24** | **60.13** |

**Metrics explanation:**
- **BLEU**: Measures n-gram overlap with reference translations (0-100, higher is better)
- **chrF**: Character-level F-score, robust for morphologically rich languages (0-100, higher is better)
- **SSA-COMET**: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) ([McGill-NLP/ssa-comet-stl](https://huggingface.co/McGill-NLP/ssa-comet-stl))

## Usage

### Quick Start

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "AfriScience-MT/m2m100_1.2b-zul-eng"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set source language
tokenizer.src_lang = "zu"

# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("en")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)
```

### Batch Translation

```python
texts = [
    "Climate change affects agricultural productivity.",
    "The study analyzed genetic markers in the population.",
    "Renewable energy sources are essential for sustainable development."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
    print(f"{src}\n→ {tgt}\n")
```

## Training Details

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch Size | 1 |
| Learning Rate | 2e-05 |

### Training Data

- **Dataset**: AfriScience-MT
- **Domain**: Scientific abstracts and papers
- **Languages**: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu)


## Reproducibility

To reproduce this model:

```bash
# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt

# Install dependencies
pip install -r requirements.txt

# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
    --data_dir ./data \
    --source_lang zul \
    --target_lang eng \
    --model_name facebook/m2m100_1.2B \
    --model_type m2m100 \
    --output_dir ./output \
    --num_epochs 10 \
    --batch_size 16 \
    --learning_rate 2e-5
```

## Limitations

- **Domain Specificity**: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text.
- **Language Coverage**: Only supports the specific language pair indicated.
- **Input Length**: Maximum input length is 256 tokens; longer texts should be split into segments.

## Citation

If you use this model, please cite the AfriScience-MT project:

```bibtex
@inproceedings{afriscience-mt-2025,
  title={AfriScience-MT: Machine Translation for African Scientific Literature},
  author={AfriScience-MT Team},
  year={2025},
  url={https://github.com/afriscience-mt/afriscience-mt}
}
```

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

## Acknowledgments

- Built on top of [{base_model}](https://huggingface.co/{base_model})
- Evaluation using [SSA-COMET](https://huggingface.co/McGill-NLP/ssa-comet-stl) for African language assessment