QT V.4.4 32K UltraLingo β SuperBPE Tokenizer
Quartz Data Infrastructure β quartz.host | AENEA Global β aeneaglobal.com
A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.
QT V.4.4 introduces equity-balanced Stage 2 corpus construction β a four-bucket system that oversamples underserved scripts for SuperBPE training, achieving the best cross-lingual equity ratio of any QT tokenizer at any vocabulary size.
Key Results
Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):
| Metric | QT V.4.4 32K | Llama 3 (128K) |
|---|---|---|
| Vocabulary size | 32,000 | 128,256 |
| Mean fertility (tokens/word) | 4.231 | 5.716 |
| Median fertility | 2.843 | 2.700 |
| Equity ratio (max/min fertility) | 31.5x | 118.6x |
| Total tokens (204 langs) | 14,125,437 | 16,764,198 |
| Token savings | β15.7% | baseline |
At one quarter the vocabulary size, QT V.4.4 32K produces 15.7% fewer total tokens than Llama 3 with 3.8x better cross-lingual equity. The 31.5x equity ratio is the best of any QT tokenizer β including the 64K variant (32.3x).
Evolution: V.4.1 β V.4.4
| Metric | V.4.4 32K | V.4.1 32K | Improvement |
|---|---|---|---|
| Mean fertility | 4.231 | 4.343 | β2.6% |
| Equity ratio | 31.5x | 38.6x | β18.4% (fairer) |
| Tibetan | 26.70 | 38.60 | +11.90 TPW |
| Thai | 12.88 | 14.05 | +1.17 TPW |
| Myanmar | 6.27 | 7.28 | +1.01 TPW |
| Khmer | 13.95 | 15.70 | +1.75 TPW |
| CJK (avg) | 19.85 | 21.23 | +1.38 TPW |
| Bengali | 3.12 | 3.40 | +0.28 TPW |
| Kannada | 3.96 | 4.36 | +0.40 TPW |
Architecture
QT V.4.4 is a two-stage SuperBPE tokenizer with four innovations over standard BPE:
1. Two-Stage SuperBPE Training
- Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β roots, affixes, morphemes, character sequences.
- Stage 2 (3,200 tokens, 10%): SuperBPE β lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.
Based on Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, β27% inference compute).
2. Script-Aware Pre-Tokenization (Indic Only)
- Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
- Preserves conjunct consonants by not breaking across virama (halant) marks
- CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning
3. Streaming Sharded Training
- Corpus sharded to disk for RAM-bounded training
- Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
- Enables SuperBPE training on consumer hardware (16 GB RAM)
4. Equity-Balanced Stage 2 Corpus (V.4.4 Innovation)
Instead of randomly sampling a small fraction of the corpus for Stage 2 (which starves low-resource scripts), V.4.4 builds a purpose-constructed Stage 2 training corpus with four buckets:
| Bucket | Share | Max Chunk | Scripts | Rationale |
|---|---|---|---|---|
| CJK | 20% | 1000 chars | Chinese, Japanese, Korean | Needs volume with long sequences for multi-character merges |
| Underserved | 40% | 200 chars | Tibetan, Thai, Lao, Khmer, Myanmar, Indic (10 scripts), Ethiopic | Needs presence β short chunks cap RAM while ensuring meaningful superword merges |
| Latin | 25% | 500 chars | Latin-script languages | Already well-served by Stage 1, included for multi-word expressions |
| Other | 15% | 500 chars | Cyrillic, Hebrew, Arabic, Greek, Armenian, Georgian, mixed | Already well-served by Stage 1 subword merges |
The corpus is built by streaming the full training data (low RAM) and classifying each text by dominant script. Per-bucket chunk size limits prevent the BPE pair frequency table from exceeding available RAM, while ensuring each underserved script receives sufficient data for the trainer to learn meaningful merges.
Training Data
Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):
| Category | Share | Description |
|---|---|---|
| Wikipedia | 70.7% | 72 languages, 27 scripts β sqrt-proportional sampling with 0.3% floor per language |
| Stack Exchange | 21.7% | English reasoning, STEM, humanities, multilingual Q&A |
| Code | 8.0% | Python, JavaScript, Java, C/C++, Go/Rust, Shell |
Per-Script Performance
FLORES-200 benchmark β mean tokens per word (lower is better):
| Script | QT V.4.4 32K | Llama 3 (128K) | Languages |
|---|---|---|---|
| Latin | 2.57 | 2.39 | 37 |
| Devanagari | 2.84 | 3.52 | 3 |
| Gurmukhi | 2.82 | 8.23 | 1 |
| Hebrew | 2.87 | 5.76 | 2 |
| Cyrillic | 2.90 | 2.59 | 5 |
| Arabic | 2.34 | 2.70 | 2 |
| Bengali | 3.12 | 8.07 | 1 |
| Greek | 3.22 | 2.58 | 1 |
| Armenian | 3.22 | 12.23 | 1 |
| Gujarati | 3.35 | 10.02 | 1 |
| Sinhala | 3.50 | 11.37 | 1 |
| Tamil | 3.88 | 12.45 | 1 |
| Odia | 3.93 | 16.90 | 1 |
| Kannada | 3.96 | 15.01 | 1 |
| Ethiopic | 4.06 | 11.95 | 1 |
| Telugu | 4.13 | 13.36 | 1 |
| Georgian | 4.51 | 15.47 | 1 |
| Malayalam | 4.54 | 16.33 | 1 |
| Myanmar | 6.27 | 29.77 | 1 |
| Thai | 12.88 | 14.03 | 1 |
| Khmer | 13.95 | 40.91 | 1 |
| CJK | 19.85 | 19.75 | 4 |
| Tibetan | 26.70 | 149.79 | 1 |
| Lao | 58.40 | 39.60 | 1 |
Special Tokens
14 structural tokens + 72 language tags = 86 special tokens total.
| ID | Token | Purpose |
|---|---|---|
| 0 | <|padding|> |
Padding |
| 1 | <|bos|> |
Beginning of sequence |
| 2 | <|endoftext|> |
End of text |
| 3 | <|unk|> |
Unknown |
| 4 | <|sep|> |
Separator |
| 5 | <|system|> |
System prompt |
| 6 | <|user|> |
User turn |
| 7 | <|assistant|> |
Assistant turn |
| 8 | <|tool_call|> |
Tool invocation |
| 9 | <|tool_result|> |
Tool response |
| 10 | <|thinking|> |
Thinking open |
| 11 | <|/thinking|> |
Thinking close |
| 12 | <|code|> |
Code open |
| 13 | <|/code|> |
Code close |
| 14β85 | <|lang:xx|> |
Language tags (72 languages) |
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids) # Token IDs
print(encoded.tokens) # Token strings
# Decode
text = tok.decode(encoded.ids)
print(text)
Intended Use
QT V.4.4 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.
Optimised for:
- Multilingual language modelling across 72 languages
- Cross-lingual transfer with equitable compression across scripts
- Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
- Mathematical and scientific text
- Instruction-following with dedicated chat tokens
Recommended Pairing
| Model Size | Tokenizer | Vocab | Equity |
|---|---|---|---|
| Sub-500M (Prelude series) | QT V.4.4 32K (this model) | 32,000 | 31.5x |
| 500Mβ2B (Overture series) | QT V.4.1 64K | 64,000 | 32.3x |
Training Configuration
Algorithm: SuperBPE (two-stage)
Stage 1 vocab: 28,800 (90% β subword with whitespace boundaries)
Stage 2 vocab: 3,200 (10% β superword, no whitespace constraint)
Stage 1 min freq: 2
Stage 2 min freq: 50
Stage 1 sample ratio: 0.35
Stage 2 corpus: 300 MB equity-balanced (4-bucket system)
Pre-tokenization: Script-aware (Indic virama segmentation)
Training mode: Streaming sharded (500 MB shards)
Seed: 42
Training time: ~52 minutes (RTX 4060, 16 GB RAM)
Limitations
- Lao remains the weakest script (58.4 TPW) due to limited training data and absence of whitespace word boundaries.
- Tibetan (26.7 TPW) has improved dramatically over previous versions (V.4.1: 38.6, Llama 3: 149.8) but remains elevated due to the lack of whitespace delimiters.
- English fertility (1.89 TPW) is higher than V.4.1 (1.50) because Stage 2 merge budget is redistributed toward underserved scripts. This is a deliberate equity tradeoff.
- Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
- The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.
References
- Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models"
- arXiv 2511.03237 β IndicSuperTokenizer: SOTA fertility on 22 Indic languages
- arXiv 2508.06533 β "The Art of Breaking Words": iterative fertility-driven reweighting
- Tao et al., NeurIPS 2024 β Scaling Laws with Vocabulary
- arXiv 2601.20994 β "The Depth Delusion": width > depth, 32K optimal for sub-500M
- NeurIPS 2025 Workshop β "From Bias to Balance": balanced tokenizer datasets
- Arnett et al. 2025 β Crosslingual Tokenizer Inequities
Citation
@misc{downey2026qt,
title={QT V.4.4 UltraLingo: Equity-Balanced Streaming SuperBPE for Multilingual Language Modelling},
author={Downey, James},
year={2026},
publisher={AENEA Global Ltd},
url={https://huggingface.co/JamesQuartz/qt-v4.4-32k-ultralingo}
}
About
Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).
- Quartz β Open-source data pipelines and tokenizers (quartz.host)
- AENEA β Language model laboratory (aenea.app)