QT V.4.1 32K UltraLingo — SuperBPE Tokenizer

Quartz Data Infrastructure — quartz.host | AENEA Global — aeneaglobal.com

A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric	QT V.4.1 32K	Llama 3 (128K)
Vocabulary size	32,000	128,256
Mean fertility (tokens/word)	4.343	5.716
Median fertility	2.812	2.700
Equity ratio (max/min fertility)	38.6x	118.6x
Total tokens (204 langs)	14,138,900	16,764,198
Token savings	−15.7%	baseline

At one quarter the vocabulary size, QT V.4.1 32K produces 15.7% fewer total tokens than Llama 3 with 3.1x better cross-lingual equity.

Architecture

QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:

1. Two-Stage SuperBPE Training

Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units — roots, affixes, morphemes, character sequences.
Stage 2 (3,200 tokens, 10%): SuperBPE — lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, −27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
Preserves conjunct consonants by not breaking across virama (halant) marks
CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

Corpus sharded to disk for RAM-bounded training
Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
Enables SuperBPE training on consumer hardware (16 GB RAM)

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category	Share	Description
Wikipedia	70.7%	72 languages, 27 scripts — sqrt-proportional sampling with 0.3% floor per language
Stack Exchange	21.7%	English reasoning, STEM, humanities, multilingual Q&A
Code	8.0%	Python, JavaScript, Java, C/C++, Go/Rust, Shell

Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.

Per-Script Performance

FLORES-200 benchmark — mean tokens per word (lower is better):

Script	QT V.4.1 32K	Llama 3 (128K)	Languages
Latin	2.53	2.39	37
Arabic	2.42	2.70	2
Gurmukhi	2.75	8.23	1
Hebrew	2.78	5.76	1
Cyrillic	2.82	2.59	5
Devanagari	2.98	3.52	3
Armenian	3.36	12.23	1
Bengali	3.40	8.07	1
Sinhala	3.57	11.37	1
Tamil	3.62	12.45	1
Gujarati	3.79	10.02	1
Odia	3.80	16.90	1
Telugu	4.07	13.36	1
Georgian	4.24	15.47	1
Kannada	4.36	15.01	1
Ethiopic	4.42	11.95	1
Malayalam	4.61	16.33	1
Greek	3.33	2.58	1
Myanmar	7.28	29.77	1
Thai	14.05	14.03	1
Khmer	15.70	40.91	1
CJK	21.23	19.75	4
Tibetan	38.60	149.79	1
Lao	58.01	39.60	1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID	Token	Purpose
0	`<\|padding\|>`	Padding
1	`<\|bos\|>`	Beginning of sequence
2	`<\|endoftext\|>`	End of text
3	`<\|unk\|>`	Unknown
4	`<\|sep\|>`	Separator
5	`<\|system\|>`	System prompt
6	`<\|user\|>`	User turn
7	`<\|assistant\|>`	Assistant turn
8	`<\|tool_call\|>`	Tool invocation
9	`<\|tool_result\|>`	Tool response
10	`<\|thinking\|>`	Thinking open
11	`<\|/thinking\|>`	Thinking close
12	`<\|code\|>`	Code open
13	`<\|/code\|>`	Code close
14–85	`<\|lang:xx\|>`	Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.1 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.

Optimised for:

Multilingual language modelling across 72 languages
Cross-lingual transfer with equitable compression across scripts
Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
Mathematical and scientific text
Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size	Tokenizer	Vocab
Sub-500M (Prelude series)	QT V.4.1 32K (this model)	32,000
500M–2B (Overture series)	QT V.4.1 64K	64,000

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        28,800 (90% — subword with whitespace boundaries)
Stage 2 vocab:        3,200 (10% — superword, no whitespace constraint)
Min frequency:        Stage 1: 2, Stage 2: 50
Sample ratio:         Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~98 minutes (RTX 4060, 16 GB RAM)

Limitations

Lao remains the weakest script (58.0 TPW) due to limited training data and absence of whitespace word boundaries.
Tibetan (38.6 TPW) has improved over Llama 3 (149.8 TPW) but remains high. Future versions will increase the Tibetan corpus weight.
At 32K vocabulary, there is less headroom for low-resource scripts compared to the 64K variant. If your use case is primarily multilingual, consider the 64K version.
Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models"
arXiv 2511.03237 — IndicSuperTokenizer: SOTA fertility on 22 Indic languages
arXiv 2508.06533 — "The Art of Breaking Words": iterative fertility-driven reweighting
Tao et al., NeurIPS 2024 — Scaling Laws with Vocabulary
arXiv 2601.20994 — "The Depth Delusion": width > depth, 32K optimal for sub-500M
NeurIPS 2025 Workshop — "From Bias to Balance": balanced tokenizer datasets
Arnett et al. 2025 — Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.1-32k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

Quartz — Open-source data pipelines and tokenizers (quartz.host)
AENEA — Language model laboratory (aenea.app)
Crassus — Institutional credit intelligence (crassus.info)

Downloads last month: -; Downloads are not tracked for this model. How to track