QT V.4.1 32K UltraLingo β SuperBPE Tokenizer
Quartz Data Infrastructure β quartz.host | AENEA Global β aeneaglobal.com
A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.
Key Results
Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):
| Metric | QT V.4.1 32K | Llama 3 (128K) |
|---|---|---|
| Vocabulary size | 32,000 | 128,256 |
| Mean fertility (tokens/word) | 4.343 | 5.716 |
| Median fertility | 2.812 | 2.700 |
| Equity ratio (max/min fertility) | 38.6x | 118.6x |
| Total tokens (204 langs) | 14,138,900 | 16,764,198 |
| Token savings | β15.7% | baseline |
At one quarter the vocabulary size, QT V.4.1 32K produces 15.7% fewer total tokens than Llama 3 with 3.1x better cross-lingual equity.
Architecture
QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:
1. Two-Stage SuperBPE Training
- Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β roots, affixes, morphemes, character sequences.
- Stage 2 (3,200 tokens, 10%): SuperBPE β lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.
Based on Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, β27% inference compute).
2. Script-Aware Pre-Tokenization (Indic Only)
- Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
- Preserves conjunct consonants by not breaking across virama (halant) marks
- CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning
3. Streaming Sharded Training
- Corpus sharded to disk for RAM-bounded training
- Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
- Enables SuperBPE training on consumer hardware (16 GB RAM)
Training Data
Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):
| Category | Share | Description |
|---|---|---|
| Wikipedia | 70.7% | 72 languages, 27 scripts β sqrt-proportional sampling with 0.3% floor per language |
| Stack Exchange | 21.7% | English reasoning, STEM, humanities, multilingual Q&A |
| Code | 8.0% | Python, JavaScript, Java, C/C++, Go/Rust, Shell |
Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.
Per-Script Performance
FLORES-200 benchmark β mean tokens per word (lower is better):
| Script | QT V.4.1 32K | Llama 3 (128K) | Languages |
|---|---|---|---|
| Latin | 2.53 | 2.39 | 37 |
| Arabic | 2.42 | 2.70 | 2 |
| Gurmukhi | 2.75 | 8.23 | 1 |
| Hebrew | 2.78 | 5.76 | 1 |
| Cyrillic | 2.82 | 2.59 | 5 |
| Devanagari | 2.98 | 3.52 | 3 |
| Armenian | 3.36 | 12.23 | 1 |
| Bengali | 3.40 | 8.07 | 1 |
| Sinhala | 3.57 | 11.37 | 1 |
| Tamil | 3.62 | 12.45 | 1 |
| Gujarati | 3.79 | 10.02 | 1 |
| Odia | 3.80 | 16.90 | 1 |
| Telugu | 4.07 | 13.36 | 1 |
| Georgian | 4.24 | 15.47 | 1 |
| Kannada | 4.36 | 15.01 | 1 |
| Ethiopic | 4.42 | 11.95 | 1 |
| Malayalam | 4.61 | 16.33 | 1 |
| Greek | 3.33 | 2.58 | 1 |
| Myanmar | 7.28 | 29.77 | 1 |
| Thai | 14.05 | 14.03 | 1 |
| Khmer | 15.70 | 40.91 | 1 |
| CJK | 21.23 | 19.75 | 4 |
| Tibetan | 38.60 | 149.79 | 1 |
| Lao | 58.01 | 39.60 | 1 |
Special Tokens
14 structural tokens + 72 language tags = 86 special tokens total.
| ID | Token | Purpose |
|---|---|---|
| 0 | <|padding|> |
Padding |
| 1 | <|bos|> |
Beginning of sequence |
| 2 | <|endoftext|> |
End of text |
| 3 | <|unk|> |
Unknown |
| 4 | <|sep|> |
Separator |
| 5 | <|system|> |
System prompt |
| 6 | <|user|> |
User turn |
| 7 | <|assistant|> |
Assistant turn |
| 8 | <|tool_call|> |
Tool invocation |
| 9 | <|tool_result|> |
Tool response |
| 10 | <|thinking|> |
Thinking open |
| 11 | <|/thinking|> |
Thinking close |
| 12 | <|code|> |
Code open |
| 13 | <|/code|> |
Code close |
| 14β85 | <|lang:xx|> |
Language tags (72 languages) |
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids) # Token IDs
print(encoded.tokens) # Token strings
# Decode
text = tok.decode(encoded.ids)
print(text)
Intended Use
QT V.4.1 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.
Optimised for:
- Multilingual language modelling across 72 languages
- Cross-lingual transfer with equitable compression across scripts
- Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
- Mathematical and scientific text
- Instruction-following with dedicated chat tokens
Recommended Pairing
| Model Size | Tokenizer | Vocab |
|---|---|---|
| Sub-500M (Prelude series) | QT V.4.1 32K (this model) | 32,000 |
| 500Mβ2B (Overture series) | QT V.4.1 64K | 64,000 |
Training Configuration
Algorithm: SuperBPE (two-stage)
Stage 1 vocab: 28,800 (90% β subword with whitespace boundaries)
Stage 2 vocab: 3,200 (10% β superword, no whitespace constraint)
Min frequency: Stage 1: 2, Stage 2: 50
Sample ratio: Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization: Script-aware (Indic virama segmentation)
Training mode: Streaming sharded (500 MB shards)
Seed: 42
Training time: ~98 minutes (RTX 4060, 16 GB RAM)
Limitations
- Lao remains the weakest script (58.0 TPW) due to limited training data and absence of whitespace word boundaries.
- Tibetan (38.6 TPW) has improved over Llama 3 (149.8 TPW) but remains high. Future versions will increase the Tibetan corpus weight.
- At 32K vocabulary, there is less headroom for low-resource scripts compared to the 64K variant. If your use case is primarily multilingual, consider the 64K version.
- Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
- The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.
References
- Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models"
- arXiv 2511.03237 β IndicSuperTokenizer: SOTA fertility on 22 Indic languages
- arXiv 2508.06533 β "The Art of Breaking Words": iterative fertility-driven reweighting
- Tao et al., NeurIPS 2024 β Scaling Laws with Vocabulary
- arXiv 2601.20994 β "The Depth Delusion": width > depth, 32K optimal for sub-500M
- NeurIPS 2025 Workshop β "From Bias to Balance": balanced tokenizer datasets
- Arnett et al. 2025 β Crosslingual Tokenizer Inequities
Citation
@misc{downey2026qt,
title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
author={Downey, James},
year={2026},
publisher={AENEA Global Ltd},
url={https://huggingface.co/JamesQuartz/qt-v4.1-32k-ultralingo}
}
About
Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).
- Quartz β Open-source data pipelines and tokenizers (quartz.host)
- AENEA β Language model laboratory (aenea.app)
- Crassus β Institutional credit intelligence (crassus.info)