QT V.4.4 32K UltraLingo — SuperBPE Tokenizer

Quartz Data Infrastructure — quartz.host | AENEA Global — aeneaglobal.com

A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.

QT V.4.4 introduces equity-balanced Stage 2 corpus construction — a four-bucket system that oversamples underserved scripts for SuperBPE training, achieving the best cross-lingual equity ratio of any QT tokenizer at any vocabulary size.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric	QT V.4.4 32K	Llama 3 (128K)
Vocabulary size	32,000	128,256
Mean fertility (tokens/word)	4.231	5.716
Median fertility	2.843	2.700
Equity ratio (max/min fertility)	31.5x	118.6x
Total tokens (204 langs)	14,125,437	16,764,198
Token savings	−15.7%	baseline

At one quarter the vocabulary size, QT V.4.4 32K produces 15.7% fewer total tokens than Llama 3 with 3.8x better cross-lingual equity. The 31.5x equity ratio is the best of any QT tokenizer — including the 64K variant (32.3x).

Evolution: V.4.1 → V.4.4

Metric	V.4.4 32K	V.4.1 32K	Improvement
Mean fertility	4.231	4.343	−2.6%
Equity ratio	31.5x	38.6x	−18.4% (fairer)
Tibetan	26.70	38.60	+11.90 TPW
Thai	12.88	14.05	+1.17 TPW
Myanmar	6.27	7.28	+1.01 TPW
Khmer	13.95	15.70	+1.75 TPW
CJK (avg)	19.85	21.23	+1.38 TPW
Bengali	3.12	3.40	+0.28 TPW
Kannada	3.96	4.36	+0.40 TPW

Architecture

QT V.4.4 is a two-stage SuperBPE tokenizer with four innovations over standard BPE:

1. Two-Stage SuperBPE Training

Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units — roots, affixes, morphemes, character sequences.
Stage 2 (3,200 tokens, 10%): SuperBPE — lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, −27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
Preserves conjunct consonants by not breaking across virama (halant) marks
CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

Corpus sharded to disk for RAM-bounded training
Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
Enables SuperBPE training on consumer hardware (16 GB RAM)

4. Equity-Balanced Stage 2 Corpus (V.4.4 Innovation)

Instead of randomly sampling a small fraction of the corpus for Stage 2 (which starves low-resource scripts), V.4.4 builds a purpose-constructed Stage 2 training corpus with four buckets:

Bucket	Share	Max Chunk	Scripts	Rationale
CJK	20%	1000 chars	Chinese, Japanese, Korean	Needs volume with long sequences for multi-character merges
Underserved	40%	200 chars	Tibetan, Thai, Lao, Khmer, Myanmar, Indic (10 scripts), Ethiopic	Needs presence — short chunks cap RAM while ensuring meaningful superword merges
Latin	25%	500 chars	Latin-script languages	Already well-served by Stage 1, included for multi-word expressions
Other	15%	500 chars	Cyrillic, Hebrew, Arabic, Greek, Armenian, Georgian, mixed	Already well-served by Stage 1 subword merges

The corpus is built by streaming the full training data (low RAM) and classifying each text by dominant script. Per-bucket chunk size limits prevent the BPE pair frequency table from exceeding available RAM, while ensuring each underserved script receives sufficient data for the trainer to learn meaningful merges.

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category	Share	Description
Wikipedia	70.7%	72 languages, 27 scripts — sqrt-proportional sampling with 0.3% floor per language
Stack Exchange	21.7%	English reasoning, STEM, humanities, multilingual Q&A
Code	8.0%	Python, JavaScript, Java, C/C++, Go/Rust, Shell

Per-Script Performance

FLORES-200 benchmark — mean tokens per word (lower is better):

Script	QT V.4.4 32K	Llama 3 (128K)	Languages
Latin	2.57	2.39	37
Devanagari	2.84	3.52	3
Gurmukhi	2.82	8.23	1
Hebrew	2.87	5.76	2
Cyrillic	2.90	2.59	5
Arabic	2.34	2.70	2
Bengali	3.12	8.07	1
Greek	3.22	2.58	1
Armenian	3.22	12.23	1
Gujarati	3.35	10.02	1
Sinhala	3.50	11.37	1
Tamil	3.88	12.45	1
Odia	3.93	16.90	1
Kannada	3.96	15.01	1
Ethiopic	4.06	11.95	1
Telugu	4.13	13.36	1
Georgian	4.51	15.47	1
Malayalam	4.54	16.33	1
Myanmar	6.27	29.77	1
Thai	12.88	14.03	1
Khmer	13.95	40.91	1
CJK	19.85	19.75	4
Tibetan	26.70	149.79	1
Lao	58.40	39.60	1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID	Token	Purpose
0	`<\|padding\|>`	Padding
1	`<\|bos\|>`	Beginning of sequence
2	`<\|endoftext\|>`	End of text
3	`<\|unk\|>`	Unknown
4	`<\|sep\|>`	Separator
5	`<\|system\|>`	System prompt
6	`<\|user\|>`	User turn
7	`<\|assistant\|>`	Assistant turn
8	`<\|tool_call\|>`	Tool invocation
9	`<\|tool_result\|>`	Tool response
10	`<\|thinking\|>`	Thinking open
11	`<\|/thinking\|>`	Thinking close
12	`<\|code\|>`	Code open
13	`<\|/code\|>`	Code close
14–85	`<\|lang:xx\|>`	Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.4 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.

Optimised for:

Multilingual language modelling across 72 languages
Cross-lingual transfer with equitable compression across scripts
Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
Mathematical and scientific text
Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size	Tokenizer	Vocab	Equity
Sub-500M (Prelude series)	QT V.4.4 32K (this model)	32,000	31.5x
500M–2B (Overture series)	QT V.4.1 64K	64,000	32.3x

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        28,800 (90% — subword with whitespace boundaries)
Stage 2 vocab:        3,200 (10% — superword, no whitespace constraint)
Stage 1 min freq:     2
Stage 2 min freq:     50
Stage 1 sample ratio: 0.35
Stage 2 corpus:       300 MB equity-balanced (4-bucket system)
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~52 minutes (RTX 4060, 16 GB RAM)

Limitations

Lao remains the weakest script (58.4 TPW) due to limited training data and absence of whitespace word boundaries.
Tibetan (26.7 TPW) has improved dramatically over previous versions (V.4.1: 38.6, Llama 3: 149.8) but remains elevated due to the lack of whitespace delimiters.
English fertility (1.89 TPW) is higher than V.4.1 (1.50) because Stage 2 merge budget is redistributed toward underserved scripts. This is a deliberate equity tradeoff.
Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models"
arXiv 2511.03237 — IndicSuperTokenizer: SOTA fertility on 22 Indic languages
arXiv 2508.06533 — "The Art of Breaking Words": iterative fertility-driven reweighting
Tao et al., NeurIPS 2024 — Scaling Laws with Vocabulary
arXiv 2601.20994 — "The Depth Delusion": width > depth, 32K optimal for sub-500M
NeurIPS 2025 Workshop — "From Bias to Balance": balanced tokenizer datasets
Arnett et al. 2025 — Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.4 UltraLingo: Equity-Balanced Streaming SuperBPE for Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.4-32k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

Quartz — Open-source data pipelines and tokenizers (quartz.host)
AENEA — Language model laboratory (aenea.app)

Downloads last month: -; Downloads are not tracked for this model. How to track