Papers - Training
updated
SELF: Language-Driven Self-Evolution for Large Language Model
Paper
• 2310.00533
• Published
• 2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length
Paper
• 2310.00576
• Published
• 2
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
• 2305.13169
• Published
• 3
Transformers Can Achieve Length Generalization But Not Robustly
Paper
• 2402.09371
• Published
• 14
Triple-Encoders: Representations That Fire Together, Wire Together
Paper
• 2402.12332
• Published
• 2
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
Training Compute-Optimal Large Language Models
Paper
• 2203.15556
• Published
• 11
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper
• 2309.11495
• Published
• 40
Contrastive Decoding Improves Reasoning in Large Language Models
Paper
• 2309.09117
• Published
• 40
Paper
• 2407.10671
• Published
• 168
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Paper
• 2404.05405
• Published
• 10
Scaling Laws for Precision
Paper
• 2411.04330
• Published
• 7
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Paper
• 1806.07572
• Published
• 1
Procedural Knowledge in Pretraining Drives Reasoning in Large Language
Models
Paper
• 2411.12580
• Published
• 2
Studying Large Language Model Generalization with Influence Functions
Paper
• 2308.03296
• Published
• 14
Scaling and evaluating sparse autoencoders
Paper
• 2406.04093
• Published
• 3
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper
• 2308.11466
• Published
• 1
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper
• 2105.13626
• Published
• 4
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
Paper
• 2103.06874
• Published
• 2
Paper
• 2412.08905
• Published
• 122
An Evolved Universal Transformer Memory
Paper
• 2410.13166
• Published
• 6
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper
• 2412.11768
• Published
• 43
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Paper
• 2410.20771
• Published
• 3
Paper
• 2412.09764
• Published
• 5
DeepSeek-V3 Technical Report
Paper
• 2412.19437
• Published
• 76