QT V.4.4 32K UltraLingo β€” SuperBPE Tokenizer

Quartz Data Infrastructure β€” quartz.host | AENEA Global β€” aeneaglobal.com

A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.

QT V.4.4 introduces equity-balanced Stage 2 corpus construction β€” a four-bucket system that oversamples underserved scripts for SuperBPE training, achieving the best cross-lingual equity ratio of any QT tokenizer at any vocabulary size.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric QT V.4.4 32K Llama 3 (128K)
Vocabulary size 32,000 128,256
Mean fertility (tokens/word) 4.231 5.716
Median fertility 2.843 2.700
Equity ratio (max/min fertility) 31.5x 118.6x
Total tokens (204 langs) 14,125,437 16,764,198
Token savings βˆ’15.7% baseline

At one quarter the vocabulary size, QT V.4.4 32K produces 15.7% fewer total tokens than Llama 3 with 3.8x better cross-lingual equity. The 31.5x equity ratio is the best of any QT tokenizer β€” including the 64K variant (32.3x).

Evolution: V.4.1 β†’ V.4.4

Metric V.4.4 32K V.4.1 32K Improvement
Mean fertility 4.231 4.343 βˆ’2.6%
Equity ratio 31.5x 38.6x βˆ’18.4% (fairer)
Tibetan 26.70 38.60 +11.90 TPW
Thai 12.88 14.05 +1.17 TPW
Myanmar 6.27 7.28 +1.01 TPW
Khmer 13.95 15.70 +1.75 TPW
CJK (avg) 19.85 21.23 +1.38 TPW
Bengali 3.12 3.40 +0.28 TPW
Kannada 3.96 4.36 +0.40 TPW

Architecture

QT V.4.4 is a two-stage SuperBPE tokenizer with four innovations over standard BPE:

1. Two-Stage SuperBPE Training

  • Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β€” roots, affixes, morphemes, character sequences.
  • Stage 2 (3,200 tokens, 10%): SuperBPE β€” lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, βˆ’27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

  • Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
  • Preserves conjunct consonants by not breaking across virama (halant) marks
  • CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

  • Corpus sharded to disk for RAM-bounded training
  • Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
  • Enables SuperBPE training on consumer hardware (16 GB RAM)

4. Equity-Balanced Stage 2 Corpus (V.4.4 Innovation)

Instead of randomly sampling a small fraction of the corpus for Stage 2 (which starves low-resource scripts), V.4.4 builds a purpose-constructed Stage 2 training corpus with four buckets:

Bucket Share Max Chunk Scripts Rationale
CJK 20% 1000 chars Chinese, Japanese, Korean Needs volume with long sequences for multi-character merges
Underserved 40% 200 chars Tibetan, Thai, Lao, Khmer, Myanmar, Indic (10 scripts), Ethiopic Needs presence β€” short chunks cap RAM while ensuring meaningful superword merges
Latin 25% 500 chars Latin-script languages Already well-served by Stage 1, included for multi-word expressions
Other 15% 500 chars Cyrillic, Hebrew, Arabic, Greek, Armenian, Georgian, mixed Already well-served by Stage 1 subword merges

The corpus is built by streaming the full training data (low RAM) and classifying each text by dominant script. Per-bucket chunk size limits prevent the BPE pair frequency table from exceeding available RAM, while ensuring each underserved script receives sufficient data for the trainer to learn meaningful merges.

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category Share Description
Wikipedia 70.7% 72 languages, 27 scripts β€” sqrt-proportional sampling with 0.3% floor per language
Stack Exchange 21.7% English reasoning, STEM, humanities, multilingual Q&A
Code 8.0% Python, JavaScript, Java, C/C++, Go/Rust, Shell

Per-Script Performance

FLORES-200 benchmark β€” mean tokens per word (lower is better):

Script QT V.4.4 32K Llama 3 (128K) Languages
Latin 2.57 2.39 37
Devanagari 2.84 3.52 3
Gurmukhi 2.82 8.23 1
Hebrew 2.87 5.76 2
Cyrillic 2.90 2.59 5
Arabic 2.34 2.70 2
Bengali 3.12 8.07 1
Greek 3.22 2.58 1
Armenian 3.22 12.23 1
Gujarati 3.35 10.02 1
Sinhala 3.50 11.37 1
Tamil 3.88 12.45 1
Odia 3.93 16.90 1
Kannada 3.96 15.01 1
Ethiopic 4.06 11.95 1
Telugu 4.13 13.36 1
Georgian 4.51 15.47 1
Malayalam 4.54 16.33 1
Myanmar 6.27 29.77 1
Thai 12.88 14.03 1
Khmer 13.95 40.91 1
CJK 19.85 19.75 4
Tibetan 26.70 149.79 1
Lao 58.40 39.60 1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID Token Purpose
0 <|padding|> Padding
1 <|bos|> Beginning of sequence
2 <|endoftext|> End of text
3 <|unk|> Unknown
4 <|sep|> Separator
5 <|system|> System prompt
6 <|user|> User turn
7 <|assistant|> Assistant turn
8 <|tool_call|> Tool invocation
9 <|tool_result|> Tool response
10 <|thinking|> Thinking open
11 <|/thinking|> Thinking close
12 <|code|> Code open
13 <|/code|> Code close
14–85 <|lang:xx|> Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.4 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.

Optimised for:

  • Multilingual language modelling across 72 languages
  • Cross-lingual transfer with equitable compression across scripts
  • Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
  • Mathematical and scientific text
  • Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size Tokenizer Vocab Equity
Sub-500M (Prelude series) QT V.4.4 32K (this model) 32,000 31.5x
500M–2B (Overture series) QT V.4.1 64K 64,000 32.3x

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        28,800 (90% β€” subword with whitespace boundaries)
Stage 2 vocab:        3,200 (10% β€” superword, no whitespace constraint)
Stage 1 min freq:     2
Stage 2 min freq:     50
Stage 1 sample ratio: 0.35
Stage 2 corpus:       300 MB equity-balanced (4-bucket system)
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~52 minutes (RTX 4060, 16 GB RAM)

Limitations

  • Lao remains the weakest script (58.4 TPW) due to limited training data and absence of whitespace word boundaries.
  • Tibetan (26.7 TPW) has improved dramatically over previous versions (V.4.1: 38.6, Llama 3: 149.8) but remains elevated due to the lack of whitespace delimiters.
  • English fertility (1.89 TPW) is higher than V.4.1 (1.50) because Stage 2 merge budget is redistributed toward underserved scripts. This is a deliberate equity tradeoff.
  • Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
  • The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

  • Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models"
  • arXiv 2511.03237 β€” IndicSuperTokenizer: SOTA fertility on 22 Indic languages
  • arXiv 2508.06533 β€” "The Art of Breaking Words": iterative fertility-driven reweighting
  • Tao et al., NeurIPS 2024 β€” Scaling Laws with Vocabulary
  • arXiv 2601.20994 β€” "The Depth Delusion": width > depth, 32K optimal for sub-500M
  • NeurIPS 2025 Workshop β€” "From Bias to Balance": balanced tokenizer datasets
  • Arnett et al. 2025 β€” Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.4 UltraLingo: Equity-Balanced Streaming SuperBPE for Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.4-32k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

  • Quartz β€” Open-source data pipelines and tokenizers (quartz.host)
  • AENEA β€” Language model laboratory (aenea.app)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support