kaizuberbuehler 's Collections LM Architectures
updated
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published
• 66
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
• 2404.07839
• Published
• 48
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
• 2404.05892
• Published
• 40
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
• 2312.00752
• Published
• 150
Multi-Head Mixture-of-Experts
Paper
• 2404.15045
• Published
• 60
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
• 2403.19887
• Published
• 112
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published
• 116
Better & Faster Large Language Models via Multi-token Prediction
Paper
• 2404.19737
• Published
• 81
Contextual Position Encoding: Learning to Count What's Important
Paper
• 2405.18719
• Published
• 5
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published
• 68
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published
• 51
Alleviating Distortion in Image Generation via Multi-Resolution
Diffusion Models
Paper
• 2406.09416
• Published
• 29
Transformers meet Neural Algorithmic Reasoners
Paper
• 2406.09308
• Published
• 44
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
• 2406.07522
• Published
• 40
Explore the Limits of Omni-modal Pretraining at Scale
Paper
• 2406.09412
• Published
• 11
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
• 2406.07394
• Published
• 29
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published
• 26
Mixture of A Million Experts
Paper
• 2407.04153
• Published
• 5
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Paper
• 2407.12854
• Published
• 31
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
• 2407.21770
• Published
• 22
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
• 2408.04619
• Published
• 175
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
• 2408.12570
• Published
• 32
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
LLMs + Persona-Plug = Personalized LLMs
Paper
• 2409.11901
• Published
• 35
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
• 2409.16280
• Published
• 18
Paper
• 2410.05258
• Published
• 180
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published
• 13
Monet: Mixture of Monosemantic Experts for Transformers
Paper
• 2412.04139
• Published
• 13
MH-MoE:Multi-Head Mixture-of-Experts
Paper
• 2411.16205
• Published
• 26
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published
• 47
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
• 2411.10958
• Published
• 57
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
• 2411.04965
• Published
• 69
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
• 2411.04996
• Published
• 50
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
• 2501.04519
• Published
• 288
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published
• 90
Transformer^2: Self-adaptive LLMs
Paper
• 2501.06252
• Published
• 55
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
• 2501.09755
• Published
• 35
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 28
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published
• 61
Autonomy-of-Experts Models
Paper
• 2501.13074
• Published
• 44
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for
Mixture-of-Experts Language Models
Paper
• 2501.12370
• Published
• 11
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
• 2501.16975
• Published
• 32
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published
• 17
Scaling Embedding Layers in Language Models
Paper
• 2502.01637
• Published
• 24
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Paper
• 2502.01068
• Published
• 18
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
• 2502.05171
• Published
• 152
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
Paper
• 2502.04416
• Published
• 12
The Curse of Depth in Large Language Models
Paper
• 2502.05795
• Published
• 40
Paper
• 2502.06049
• Published
• 31
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published
• 57
LLM Pretraining with Continuous Concepts
Paper
• 2502.08524
• Published
• 30
Large Language Diffusion Models
Paper
• 2502.09992
• Published
• 126
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 167
Continuous Diffusion Model for Language Modeling
Paper
• 2502.11564
• Published
• 53
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
• 2502.13145
• Published
• 38
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper
• 2502.13189
• Published
• 17
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
Transformers without Normalization
Paper
• 2503.10622
• Published
• 170
Block Diffusion: Interpolating Between Autoregressive and Diffusion
Language Models
Paper
• 2503.09573
• Published
• 75
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
• 2503.02130
• Published
• 32
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
• 2503.08686
• Published
• 19
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published
• 153
Technologies on Effectiveness and Efficiency: A Survey of State Spaces
Models
Paper
• 2503.11224
• Published
• 28
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
Models
Paper
• 2503.16257
• Published
• 27
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Paper
• 2503.11579
• Published
• 21
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published
• 51
FFN Fusion: Rethinking Sequential Computation in Large Language Models
Paper
• 2503.18908
• Published
• 19
Paper
• 2504.00927
• Published
• 56
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published
• 26
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization
for Test-Time Expert Re-Mixing
Paper
• 2504.07964
• Published
• 62
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published
• 30
TransMamba: Flexibly Switching between Transformer and Mamba
Paper
• 2503.24067
• Published
• 21
BitNet b1.58 2B4T Technical Report
Paper
• 2504.12285
• Published
• 83
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper
• 2504.10449
• Published
• 15
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language
Models
Paper
• 2504.15133
• Published
• 26