QT V.4.1 32K UltraLingo β€” SuperBPE Tokenizer

Quartz Data Infrastructure β€” quartz.host | AENEA Global β€” aeneaglobal.com

A 32,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Prelude model series (sub-500M parameters). Part of the QuartzTokenizer (QT) family.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric QT V.4.1 32K Llama 3 (128K)
Vocabulary size 32,000 128,256
Mean fertility (tokens/word) 4.343 5.716
Median fertility 2.812 2.700
Equity ratio (max/min fertility) 38.6x 118.6x
Total tokens (204 langs) 14,138,900 16,764,198
Token savings βˆ’15.7% baseline

At one quarter the vocabulary size, QT V.4.1 32K produces 15.7% fewer total tokens than Llama 3 with 3.1x better cross-lingual equity.

Architecture

QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:

1. Two-Stage SuperBPE Training

  • Stage 1 (28,800 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β€” roots, affixes, morphemes, character sequences.
  • Stage 2 (3,200 tokens, 10%): SuperBPE β€” lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens. Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, βˆ’27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

  • Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
  • Preserves conjunct consonants by not breaking across virama (halant) marks
  • CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

  • Corpus sharded to disk for RAM-bounded training
  • Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
  • Enables SuperBPE training on consumer hardware (16 GB RAM)

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category Share Description
Wikipedia 70.7% 72 languages, 27 scripts β€” sqrt-proportional sampling with 0.3% floor per language
Stack Exchange 21.7% English reasoning, STEM, humanities, multilingual Q&A
Code 8.0% Python, JavaScript, Java, C/C++, Go/Rust, Shell

Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.

Per-Script Performance

FLORES-200 benchmark β€” mean tokens per word (lower is better):

Script QT V.4.1 32K Llama 3 (128K) Languages
Latin 2.53 2.39 37
Arabic 2.42 2.70 2
Gurmukhi 2.75 8.23 1
Hebrew 2.78 5.76 1
Cyrillic 2.82 2.59 5
Devanagari 2.98 3.52 3
Armenian 3.36 12.23 1
Bengali 3.40 8.07 1
Sinhala 3.57 11.37 1
Tamil 3.62 12.45 1
Gujarati 3.79 10.02 1
Odia 3.80 16.90 1
Telugu 4.07 13.36 1
Georgian 4.24 15.47 1
Kannada 4.36 15.01 1
Ethiopic 4.42 11.95 1
Malayalam 4.61 16.33 1
Greek 3.33 2.58 1
Myanmar 7.28 29.77 1
Thai 14.05 14.03 1
Khmer 15.70 40.91 1
CJK 21.23 19.75 4
Tibetan 38.60 149.79 1
Lao 58.01 39.60 1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID Token Purpose
0 <|padding|> Padding
1 <|bos|> Beginning of sequence
2 <|endoftext|> End of text
3 <|unk|> Unknown
4 <|sep|> Separator
5 <|system|> System prompt
6 <|user|> User turn
7 <|assistant|> Assistant turn
8 <|tool_call|> Tool invocation
9 <|tool_result|> Tool response
10 <|thinking|> Thinking open
11 <|/thinking|> Thinking close
12 <|code|> Code open
13 <|/code|> Code close
14–85 <|lang:xx|> Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.1 32K is designed as the tokenizer for the AENEA Prelude model series (sub-500M parameters). The 32K vocabulary is chosen to maximise parameter efficiency for small models, following the findings of "The Depth Delusion" (arXiv 2601.20994) that 32K is optimal for sub-500M parameter models.

Optimised for:

  • Multilingual language modelling across 72 languages
  • Cross-lingual transfer with equitable compression across scripts
  • Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
  • Mathematical and scientific text
  • Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size Tokenizer Vocab
Sub-500M (Prelude series) QT V.4.1 32K (this model) 32,000
500M–2B (Overture series) QT V.4.1 64K 64,000

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        28,800 (90% β€” subword with whitespace boundaries)
Stage 2 vocab:        3,200 (10% β€” superword, no whitespace constraint)
Min frequency:        Stage 1: 2, Stage 2: 50
Sample ratio:         Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~98 minutes (RTX 4060, 16 GB RAM)

Limitations

  • Lao remains the weakest script (58.0 TPW) due to limited training data and absence of whitespace word boundaries.
  • Tibetan (38.6 TPW) has improved over Llama 3 (149.8 TPW) but remains high. Future versions will increase the Tibetan corpus weight.
  • At 32K vocabulary, there is less headroom for low-resource scripts compared to the 64K variant. If your use case is primarily multilingual, consider the 64K version.
  • Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
  • The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

  • Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models"
  • arXiv 2511.03237 β€” IndicSuperTokenizer: SOTA fertility on 22 Indic languages
  • arXiv 2508.06533 β€” "The Art of Breaking Words": iterative fertility-driven reweighting
  • Tao et al., NeurIPS 2024 β€” Scaling Laws with Vocabulary
  • arXiv 2601.20994 β€” "The Depth Delusion": width > depth, 32K optimal for sub-500M
  • NeurIPS 2025 Workshop β€” "From Bias to Balance": balanced tokenizer datasets
  • Arnett et al. 2025 β€” Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.1-32k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

  • Quartz β€” Open-source data pipelines and tokenizers (quartz.host)
  • AENEA β€” Language model laboratory (aenea.app)
  • Crassus β€” Institutional credit intelligence (crassus.info)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support