Agents & Foundation

Latest Long-Context Modeling Research Papers

The newest Long-Context Modeling papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Long-Context Modeling so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Long-Context Modeling papers in your inbox — free →

Recent papers

Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
[object Object], [object Object], [object Object], [object Object] et al. · CoRR 2026 · Dec 31, 2026
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficienc…
Prompt Design at Scale: How Format, Instruction Count, and Context Length Shape Instruction Adherence and Hallucination in Large Language Models
Netanel Eliav · arXiv · Jul 21, 2026
Practitioners make three prompt-design decisions with almost no controlled evidence behind them: how to format instructions and context (markdown, plain text, prose, or tabular), how many simultaneous instructions a system prompt can carry …
Extending LLM Context via Associative Recurrent Memory
Gleb Kuzmin, Ivan Rodkin, Aydar Bulatov, Yuri Kuratov et al. · arXiv · Jul 13, 2026
Extending the context length of large language models (LLMs) is critical for many real-world applications, yet standard transformers remain constrained by quadratic compute and linear memory scaling. In this work, we investigate the Associa…
A Sovereign, Open-Source Foundation Model for German and English
The Soofi-Team, :, Benedikt Droste, David Fitzek et al. · arXiv · Jul 10, 2026
We present Soofi S 30B-A3B, a sovereign, open-source Mixture-of-Experts (MoE) hybrid Mamba Transformer foundation model for German and English. Its hybrid design activates only 3B of 30B parameters per token and keeps the inference cache ne…
Self-Guided Test-Time Training for Long-Context LLMs
Xinyu Zhu, Zhe Xu, Xiaohan Wei, Yunchen Pu et al. · arXiv · Jul 10, 2026
Long-context processing has become increasingly important for large language models (LLMs), but simply extending the context window does not guarantee effective utilization of long inputs. As input length grows, accuracy often degrades, ind…
WILDTRACE: Benchmarking Natural Evidence Trails in Long-Context Reasoning
Zixin Chen, Peng Liu, Haobo Li, Rui Sheng et al. · arXiv · Jul 10, 2026
Answering complex questions over long documents frequently requires integrating evidence that the source itself disperses naturally across distant passages. In an incident report, the operating condition, design flaw, and missed safety chec…
Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents
Yifan Wu, Lizhu Zhang, Yuhang Zhou, Mingyi Wang et al. · arXiv · Jul 9, 2026
In long-horizon tasks, decision-relevant state is often scattered across an expanding trajectory, while the action agent must surface it and act. As trajectories grow, task requirements, environment facts, prior attempts, diagnoses, and ope…
LongCrafter: Towards Diverse Long-Context Understanding via Evidence-Graph-Guided Instruction Synthesis
Chenhao Yuan, Yinhao Xu, Shuwen Xu, Xizhi Yang et al. · arXiv · Jul 7, 2026
Synthesizing long-context supervised fine-tuning (SFT) data is a scalable way to enhance the long-context understanding of large language models (LLMs), yet existing approaches share three limitations: narrow task coverage, insufficient ins…
Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
Lei Bai, Zongsheng Cao, Yang Chen, Zhiyao Cui et al. · arXiv · Jun 29, 2026
We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and…
Morphing into Hybrid Attention Models
Disen Lan, Jianbin Zheng, Yuxi Ren, Xin Xia et al. · arXiv · Jun 29, 2026
Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically …
Less is More: Quality-Aware Training Data Selection for Scientific Summarization
Maria Nefeli Paraskevopoulou, Tatiana Passali, Grigorios Tsoumakas · arXiv · Jun 23, 2026
Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summariz…
Self-Compacting Language Model Agents
Tianjian Li, Jingyu Zhang, William Jurayj, Xi Wang et al. · arXiv · Jun 22, 2026
Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered a…
TriggerBench: Investigating Prospective Memory for Large Language Models
Tianhua Zhang, Xinjiang Wang, Qianxi Zhang, Qi Chen et al. · arXiv · Jun 22, 2026
While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously re…
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization
Aueaphum Aueawatthanaphisut · arXiv · Jun 18, 2026
Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generat…
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu et al. · arXiv · Jun 18, 2026
The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent…
Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction
Jingyi Zhou, Senlin Luo, Haofan Chen · arXiv · Jun 17, 2026
Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modelin…
Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions
Hui Zhang, Shuren Song · arXiv · Jun 17, 2026
The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in S…
KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing
Mufei Li, Shikun Liu, Dongqi Fu, Haoyu Wang et al. · arXiv · Jun 15, 2026
Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally i…
GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge
Pavan C Shekar, Abhishek H S, Aswanth Krishnan · arXiv · Jun 12, 2026
Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code…
Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
Hoin Jung, Xiaoqian Wang · arXiv · Jun 11, 2026
Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen…
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu et al. · arXiv · Jun 10, 2026
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and pred…
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
Xinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li et al. · arXiv · Jun 9, 2026
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet an…
End-to-End Context Compression at Scale
Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra et al. · arXiv · Jun 8, 2026
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time …
Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection
Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li et al. · arXiv · May 28, 2026
Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual informati…
Separating Semantic Competition from Context Length in RAG Reading
Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar et al. · arXiv · May 26, 2026
Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. T…
Language Models Need Sleep
Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti · arXiv · May 25, 2026
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model perio…
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gheorghe Comanici, E. Bieber, Mike Schaekermann, Ice Pasupat et al. · arXiv.org · Jul 7, 2025
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on fronti…
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Ting Chen, Jiangtao Feng, Jiangjie Chen et al. · arXiv.org · Jul 2, 2025
Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text…
BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack, Joshua Davis, B. Durieux, Antoine Chaffin et al. · arXiv.org · Jun 12, 2025
Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text…
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang et al. · Neural Information Processing Systems · Apr 21, 2025
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist…

Track Long-Context Modeling on Distill AI — start free →

Latest Long-Context Modeling Research Papers

Recent papers

Related topics