Latest Long-Context Modeling Research Papers
The newest Long-Context Modeling papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Long-Context Modeling so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Long-Context Modeling papers in your inbox — free →Recent papers
- Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing[object Object], [object Object], [object Object], [object Object] et al. · CoRR 2026 · Dec 31, 2026
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficienc…
- Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix ItXinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li et al. · arXiv · Jun 9, 2026
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet an…
- End-to-End Context Compression at ScaleAng Li, Sean McLeish, Haozhe Chen, Nimit Kalra et al. · arXiv · Jun 8, 2026
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time …
- Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context SelectionYutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li et al. · arXiv · May 28, 2026
Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual informati…
- Separating Semantic Competition from Context Length in RAG ReadingVyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar et al. · arXiv · May 26, 2026
Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. T…
- Language Models Need SleepSangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti · arXiv · May 25, 2026
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model perio…
- Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter MatchingAbdalrahman Wael · arXiv · May 13, 2026
We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense bas…
- KV-Fold: One-Step KV-Cache Recurrence for Long-Context InferenceAlireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez · arXiv · May 12, 2026
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the a…
- The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM AgentsJiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo et al. · arXiv · May 8, 2026
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades…
- Long Context Pre-Training with Lighthouse AttentionBowen Peng, Subho Ghosh, Jeffrey Quesnelle · arXiv · May 7, 2026
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hie…
- The Impossibility Triangle of Long-Context ModelingYan Zhou · arXiv · May 6, 2026
We identify and prove a fundamental trade-off governing long-sequence models: no model can simultaneously achieve (i) per-step computation independent of sequence length (Efficiency), (ii) state size independent of sequence length (Compactn…
- Safety and accuracy follow different scaling laws in clinical large language modelsSebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia et al. · arXiv · May 5, 2026
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, …
- Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving MemoryDerong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia et al. · arXiv · May 1, 2026
Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-cra…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- Long-Context Aware Upcycling: A New Frontier for Hybrid LLM ScalingParsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari et al. · arXiv · Apr 27, 2026
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Tran…
- Skill Retrieval Augmentation for Agentic AIWeihang Su, Jianming Long, Qingyao Ai, Yichen Tang et al. · arXiv · Apr 27, 2026
As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incor…
- BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question AnsweringJinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne · arXiv · Apr 24, 2026
A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribut…
- QuantClaw: Precision Where It Matters for OpenClawManyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu et al. · arXiv · Apr 24, 2026
Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While …