arxiv:2602.05400

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Published on Feb 5

· Submitted by

Xuan Ouyang on Feb 11

#1 Paper of the day

Qwen

Upvote

270

Authors:

Shaobo Wang ,

Xuan Ouyang ,

Tianyi Xu ,

Abstract

OPUS is a dynamic data selection framework that improves pre-training efficiency by scoring data candidates based on optimizer-induced update projections in a stable proxy-derived target space, achieving superior performance with reduced computational overhead.

AI-generated summary

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

View arXiv page View PDF Add to collection

Community

YoungXuan

Paper author Paper submitter about 13 hours ago

In this paper, we argue that LLM pre-training is entering a “data-wall” regime where readily available high-quality public text is approaching exhaustion, so progress must shift from more tokens to better tokens chosen at the right time. While most existing pipelines either (i) apply static, training-agnostic quality filters or (ii) use dynamic selection criteria defined in raw gradient space, modern LLMs are actually trained with adaptive optimizers like AdamW or Muon whose preconditioning reshapes the effective update direction—creating a fundamental mismatch between “how we score data” and “how training truly updates the model.” To bridge this gap, we introduce OPUS (Optimizer-induced Projected Utility Selection), a dynamic selection framework that defines data utility directly in the optimizer-induced update space: a sample is valuable insofar as its optimizer-shaped effective update aligns with the descent direction of a stable, high-quality target distribution (our proxy).

Concretely, OPUS operationalizes this idea through a principled objective, a scalable estimator, and a diversity-preserving selection rule. Our key contributions are: (1) an optimizer-aware utility for dynamic selection, with closed-form approximations for effective update directions under AdamW and Muon, aligning scoring with real training geometry; (2) BENCH-PROXY, an in-distribution proxy construction method that retrieves benchmark-aligned samples from the pre-training corpus to stabilize the target direction; (3) scalable utility estimation using the Ghost technique + CountSketch projections to avoid per-sample gradient materialization; and (4) Boltzmann sampling with redundancy control to prevent diversity collapse under non-stationary streams. Empirically, OPUS delivers strong data/compute efficiency: it reports only ~4.7% additional compute overhead for selection while achieving large gains across datasets, optimizers, and scales—including improved accuracy (+2.2% average over 10 benchmarks and 8× compute reduction in one highlighted setting), outperforming industrial static/dynamic baselines and even matching or exceeding much longer-token training in several regimes.

mishig

about 3 hours ago

Main Results of OPUS

Overview

OPUS (Optimizer-induced Projected Utility Selection) is a dynamic data selection framework for LLM pre-training that aligns data selection with the optimizer's actual update geometry (supporting both AdamW and Muon optimizers). It achieves superior data efficiency with minimal computational overhead.

Key Quantitative Results

1. Pre-training from Scratch (FineWeb, 30B tokens)

Figure 1 & Table 3: OPUS outperforms all compute-matched baselines across model scales and optimizers:

GPT-2 XL (Muon): Achieves 41.75% average accuracy vs. 40.29% for random selection (1.46 point improvement), outperforming even the 60B-token random baseline (41.29%) while using only half the tokens.
GPT-2 Large (AdamW): Achieves 41.43% vs. 39.29% for random (2.14 point improvement).
Cross-optimizer consistency: OPUS achieves best compute-matched performance under both Muon (matrix preconditioning) and AdamW (diagonal preconditioning), validating that optimizer-aware selection matters.

2. Robustness to Data Quality (FineWeb-Edu)

Table 4: OPUS demonstrates remarkable efficiency even with lower-quality data:

When selecting from mid-quality data (score 3), OPUS matches or exceeds static baselines trained on high-quality data (scores 4-5).
GPT-2 XL (Muon): OPUS achieves 44.99% average accuracy when selecting from score-3 data, outperforming all baselines trained on the superior score-4/5 partition (best baseline: 42.59%).

3. Continued Pre-training Efficiency

Figure 5 & Figure 6: On Qwen3-8B-Base continued pre-training with SciencePedia:

OPUS achieves 6× data efficiency: Using only 0.5B tokens, OPUS outperforms full training with 3B tokens on science benchmarks (OlympicArena and SciAssess).
Superior performance in specialized domains (physics, chemistry, biology, medicine, materials science).

4. Computational Efficiency

Figure 7:

Overhead: Only 4.7% additional compute cost compared to random selection.
Contrast with naive dynamic selection implementations that incur >3.5× slowdown.
Achieved through Ghost technique + CountSketch projections.

Comparative Performance

vs. Static Methods

OPUS consistently outperforms industrial-level static filters:

QuRating, DSIR, DCLM-FastText, FineWeb-Edu, UltraFineweb (Table 3, Table 4)
Static methods suffer from training-agnostic heuristics; OPUS adapts to model state.

vs. Dynamic Methods

High-PPL (perplexity-based): OPUS beats by ~2% average accuracy.
GREATS: OPUS outperforms while being more scalable (GREATS assumes SGD geometry; OPUS handles adaptive optimizers correctly).

Ablation Insights

Table 7 & Table 8:

Boltzmann sampling (temperature τ=0.9) outperforms greedy top-k selection (41.75% vs. 40.49%), preventing diversity collapse.
Bench-Proxy (benchmark-aligned proxy) improves over standard proxy (41.75% vs. 41.03%).
Robust to hyperparameters: Works across buffer sizes (16-64), projection dimensions (4096-16384).

Qualitative Analysis

Appendix A: OPUS selects a more diverse mixture of documents (instructional content + general web text) compared to:

High-PPL: Concentrates on high-loss but potentially noisy samples.
QuRating: Extreme preference for "educational" patterns only.
Static filters: Fixed heuristics that don't adapt to training dynamics.

Summary

OPUS achieves 8× computation reduction on GPT-XL while improving accuracy by 2.2% over random selection. It is the first dynamic selection method that properly accounts for modern optimizer geometries (AdamW, Muon), enabling principled, scalable, and diverse data selection at every training iteration with only 4.7% overhead.