Latest Agentic AI & LLM Agents Research Papers
The newest Agentic AI & LLM Agents papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Agentic AI & LLM Agents so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Agentic AI & LLM Agents papers in your inbox — free →Recent papers
- EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving AgentsWeixian Xu, Shilong Liu, Mengdi Wang · arXiv · Jun 9, 2026
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings…
- ABC-Bench: An Agentic Bio-Capabilities Benchmark for BiosecurityAndrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman et al. · arXiv · Jun 9, 2026
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previo…
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World DomainsGenta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao et al. · arXiv · Jun 9, 2026
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and…
- SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep ResearchPu Ning, Quan Chen, Kun Tao, Xinyu Tang et al. · arXiv · Jun 8, 2026
Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main …
- Observability for Delegated Execution in Agentic AI SystemsAbhinav Mishra, Kumar Sharad · arXiv · Jun 8, 2026
Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple incompatible delegation assignments. This gap is especially acute in LLM-based agentic systems, where…
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval MechanismCong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang et al. · arXiv · Jun 5, 2026
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and …
- Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research LifecycleJiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu et al. · arXiv · Jun 5, 2026
As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution f…
- HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary TeachersLizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou et al. · arXiv · Jun 4, 2026
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial re…
- Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and RefinementJui-Hui Chung, Ziyang Cai, Zihao Li, Qishuo Yin et al. · arXiv · Jun 4, 2026
We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First,…
- Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny SignalsThamilvendhan Munirathinam · arXiv · Jun 4, 2026
As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it h…
- Agent Memory: Characterization and System Implications of Stateful Long-Horizon WorkloadsYasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens et al. · arXiv · Jun 4, 2026
LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessio…
- Bridging the Last Mile of Time Series Forecasting with LLM AgentsYuhua Liao, Zetian Wang, Qiangqiang Nie, Zhenhua Zhang · arXiv · Jun 1, 2026
Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on numerical extrapolation. However, in real-world forecasting settings, a statistically plausible basel…
- Monitoring Agentic Systems Before They're ReliableMarisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens et al. · arXiv · Jun 1, 2026
Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: …
- Iteris: Agentic Research Loops for Computational MathematicsLeheng Chen, Zihao Liu, Wanyi He, Bin Dong · arXiv · Jun 1, 2026
Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational m…
- Ghost Tool Calls: Issue-Time Privacy for Speculative Agent ToolsBardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler · arXiv · Jun 1, 2026
Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the ca…
- MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment SimulationWenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang et al. · arXiv · Jun 1, 2026
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms.…
- Beyond One-shot: AI Agents for Learning in Field ExperimentsJunjie Luo, Ritu Agarwal, Gordon Gao · arXiv · Jun 1, 2026
Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experime…
- Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM AgentsAnany Kotawala · arXiv · May 28, 2026
Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this local…
- Gram: Assessing sabotage propensities via automated alignment auditingDavid Lindner, Victoria Krakovna, Sebastian Farquhar · arXiv · May 28, 2026
We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini m…
- Calibrating Conservatism for Scalable OversightWilliam Overman, Mohsen Bayati · arXiv · May 27, 2026
Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches…
- Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data RetrievalShiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy · arXiv · May 27, 2026
In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) f…
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AIShangding Gu · arXiv · May 25, 2026
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the…
- CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI ScientistsJunlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai et al. · arXiv · May 25, 2026
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is …
- The Modality Paradox in Autonomous LLM Engineering: Asymmetric Agent Loops and Mathematical Halting.Manpreet Chadha · PubMed · May 24, 2026
BACKGROUND: Long-acting sandostatin (S-LAR; octreotide acetate) is well tolerated and effective for symptom control and possibly disease control in gastroenteropancreatic neuroendocrine tumors (GEP-NETs). We undertook a retrospective analys…
- Tool-Entropy Collapse: A Cross-Architecture Signature of Agent WANDERING FailureCaio Vicentino · Open MIND · May 24, 2026
We identify a 34% blind spot in probe-based LLM agent failure monitoring on Qwen3.6-27B SWE-bench Pro: the WANDERING sub-class where probe says success but agent never emits finish_tool. We test six detector designs across three signal chan…
- MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent SystemsQianshu Cai, Yonggang Zhang, Xianzhang Jia, Wei Xue et al. · arXiv · May 21, 2026
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all …
- Code as Agent HarnessXuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei et al. · arXiv · May 18, 2026
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a tar…
- SkillGenBench: Benchmarking Skill Generation Pipelines for LLM AgentsYifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang et al. · arXiv · May 18, 2026
As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and document…
- Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent DeploymentS. Bensalem, Y. Dong, M. Franzle, X. Huang et al. · arXiv · May 18, 2026
This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a con…