Latest Artificial Intelligence Research Papers
The newest Artificial Intelligence papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Artificial Intelligence so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Artificial Intelligence papers in your inbox — free →Recent papers
- A Unifying Lens on Supervised Fine-Tuning Through Target Distribution DesignTong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An et al. · arXiv · Jun 9, 2026
Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot targe…
- EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving AgentsWeixian Xu, Shilong Liu, Mengdi Wang · arXiv · Jun 9, 2026
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings…
- The Role of Feedback Alignment in Self-DistillationSemih Kara, Oğuzhan Ersoy · arXiv · Jun 9, 2026
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by…
- Piper: A Programmable Distributed Training SystemMegan Frisella, Shubham Tiwari, Andy Ruan, Yi Pan et al. · arXiv · Jun 9, 2026
Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretrain…
- Flaws in the LLM Automation NarrativeGeorge Perrett, Javae Elliott, Jennifer Hill, Marc Scott · arXiv · Jun 9, 2026
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance acro…
- ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning ModelsWenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei et al. · arXiv · Jun 9, 2026
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, …
- ABC-Bench: An Agentic Bio-Capabilities Benchmark for BiosecurityAndrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman et al. · arXiv · Jun 9, 2026
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previo…
- Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniquesGuido Di Federico, Wenchao Teng, Louis J. Durlofsky · arXiv · Jun 9, 2026
Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional …
- Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data CurationSoham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth · arXiv · Jun 9, 2026
Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced e…
- Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in FootballAndrew Kang, Priya Narasimhan · arXiv · Jun 9, 2026
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent t…
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDAVinamra Sharma, Xingjian Fu, Jude Haris, José Cano · arXiv · Jun 9, 2026
Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the pro…
- Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in NewsPooja Prajod · arXiv · Jun 9, 2026
As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures speci…
- FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language modelMahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes et al. · arXiv · Jun 9, 2026
A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segment…
- PhantomBench: Benchmarking the Non-existential Threat of Language ModelsHaeji Jung, Hila Gonen · arXiv · Jun 9, 2026
Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavio…
- RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement LearningYichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang et al. · arXiv · Jun 9, 2026
Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fi…
- Test-Time Gradient Guidance of Flow Policies in Reinforcement LearningZhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li et al. · arXiv · Jun 9, 2026
Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imita…
- Unifying Local Communications and Local Updates for LLM PretrainingPietro Cagnasso, Eugene Belilovsky, Edouard Oyallon · arXiv · Jun 9, 2026
Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely o…
- A History-Aware Visually Grounded Critic for Computer Use AgentsJaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen et al. · arXiv · Jun 9, 2026
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, exi…
- Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language ModelsPeiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du et al. · arXiv · Jun 9, 2026
With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality con…
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World DomainsGenta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao et al. · arXiv · Jun 9, 2026
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and…
- OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement DynamicsMingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang et al. · arXiv · Jun 8, 2026
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lac…
- An Agency-Transferring Model-Free Policy Enhancement TechniqueAnton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko · arXiv · Jun 8, 2026
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal polic…
- PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal LawsDanqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins et al. · arXiv · Jun 8, 2026
Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structu…
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context RoutingJisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu et al. · arXiv · Jun 8, 2026
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world predictio…
- Evaluation Cards: An Interpretive Layer for AI Evaluation ReportingAvijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy et al. · arXiv · Jun 8, 2026
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a…
- Topological Neural OperatorsLennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal · arXiv · Jun 8, 2026
We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features…
- Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context DriftsUdvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard · arXiv · Jun 8, 2026
We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions …
- Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety AttributionYifan Wang · arXiv · Jun 8, 2026
Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filt…
- SIGA: Self-Evolving Coding-Agent Adapters for Scientific SimulationMatthew Ho, Brian Liu, Jixuan Chen, Audrey Wang et al. · arXiv · Jun 8, 2026
Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool int…