arxiv:2512.10411

Sliding Window Attention Adaptation

Published on Dec 11

· Submitted by

yuyijiong on Dec 15

Oregon State University

Upvote

Authors:

Abstract

Sliding Window Attention Adaptation (SWAA) enables Transformer-based Large Language Models (LLMs) to use sliding window attention without retraining, recovering long-context performance through a combination of adaptation techniques.

AI-generated summary

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

View arXiv page View PDF GitHub 6 Add to collection

Community

yuyijiong

Paper submitter 9 days ago

•

edited 9 days ago

We propose a set of practical recipes that can let a full-attention LLM use sliding window attention to improve efficiency. For example, some can achieve nearly 100% acceleration of LLM long-context inference speed with 90% accuracy retainment; some can only achieve about 30% acceleration but with nearly 100% accuracy retainment.

Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation