Latest Embodied AI Research Papers
The newest Embodied AI papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Embodied AI so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Embodied AI papers in your inbox — free →Recent papers
- TacForeSight: Force-Guided Tactile World Model for Contact-Rich ManipulationYujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng et al. · arXiv · Jun 9, 2026
Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control …
- JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive ManipulationDrake Moore, Matt Cheng, Xiang Zhi Tan, Taşkın Padır · arXiv · Jun 9, 2026
Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percenta…
- EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid RobotsYanshuo Lu, Yuxuan Hu, Shenghai Yuan, Xinyu Zhou et al. · arXiv · Jun 9, 2026
Falls are one of the leading causes of injury and hospitalization among elderly individuals, making reliable fall awareness an essential capability for safety monitoring in residential environments. However, existing fall detection systems …
- Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance RejectionBatu Candan, Mohammed Atallah, Simone Servadio, Saeed Arabi · arXiv · Jun 9, 2026
Precise state estimation for navigation of autonomous agricultural robots is often compromised by sensor outages (GNSS/LiDAR/Visual) and high-frequency vibrations inherent in off-road environments. This paper proposes a robust navigation al…
- AllDayNav: Lifelong Navigation via Real-World Reinforcement LearningHang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu et al. · arXiv · Jun 9, 2026
Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle …
- Task Robustness via Re-Labelling Vision-Action Robot DataArtur Kuramshin, Özgür Aslan, Cyrus Neary, Glen Berseth · arXiv · Jun 9, 2026
The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instruction…
- AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot NavigationTianhao Zang, Siwei Cheng, Haidong Huang, Shanze Wang et al. · arXiv · Jun 9, 2026
Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent o…
- MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual ManipulationYinchen Tian, Huan Li, Muyao Peng, Xi Wang et al. · arXiv · Jun 9, 2026
Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-v…
- GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual NavigationLiang Wang, Jin Jin, KanZhong Yao, YiBin Wu et al. · arXiv · Jun 9, 2026
Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overh…
- IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic ManipulationJiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen et al. · arXiv · Jun 9, 2026
Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previou…
- MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action ModelsHao Shi, Weiye Li, Bin Xie, Yulin Wang et al. · arXiv · Jun 8, 2026
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore strug…
- iMaC: Translating Actions into Motion and Contact Images for Embodied World ModelsZhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li et al. · arXiv · Jun 8, 2026
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint ang…
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context RoutingJisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu et al. · arXiv · Jun 8, 2026
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world predictio…
- AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile LearningHong Li, Yue Xu, Yihan Tang, Yankang Dong et al. · arXiv · Jun 8, 2026
Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To addres…
- Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action ModelsSeongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi et al. · arXiv · Jun 8, 2026
Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene…
- ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action ModelsFan Zhang, Seongbin Park, Baharan Mirzasoleiman, Shariar Talebi et al. · arXiv · Jun 8, 2026
Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the rob…
- ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action PoliciesHaodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang et al. · arXiv · Jun 8, 2026
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework…
- DexPIE: Stable Dexterous Policy Improvement from Real-World ExperienceRuizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin et al. · arXiv · Jun 8, 2026
Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors durin…
- Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement LearningMohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser · arXiv · Jun 8, 2026
Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved b…
- CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor ControlJiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu et al. · arXiv · Jun 8, 2026
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this …
- Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving ApplicationsTao Li, Liang Liu, Jianli Han, Weimin Lv · arXiv · Jun 8, 2026
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localiz…
- Targeting World Models to Compromise Robot Learning PipelinesEthan Rathbun, Ahmed Agha, Saaduddin Mahmud, Christopher Amato et al. · arXiv · Jun 8, 2026
World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into t…
- Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight RelabelingCarlos Vélez García, Miguel Cazorla, Jorge Pomares · arXiv · Jun 8, 2026
Hindsight relabeling usually turns achieved future states into exact goals, which can overconstrain offline robot learning when task success depends only on a subset of the state. We propose Goal-Set Hindsight Relabeling (GS-HER), a predica…
- $ω$-EVA: Envision, Verify, and Act with Latent Interactive World ModelsZhenguo Sun, Yu Sun, Hande Huang, Alois Knoll · arXiv · Jun 8, 2026
Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect …
- Affordance-Based Hierarchical Reinforcement Learning for Quadruped PedipulationTuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin et al. · arXiv · Jun 5, 2026
The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous sel…
- Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic GraspingKaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca et al. · arXiv · Jun 5, 2026
Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising…
- Spline Policy: A Structured Representation for Robot PoliciesMengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert et al. · arXiv · Jun 5, 2026
Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spl…
- RhinoVLA Technical ReportHuixi Intelligence, :, Chen Zhang, Chenyang Zhou et al. · arXiv · Jun 5, 2026
Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment…
- CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied PlanningCong Chen, Haowen Wang, Zhixiang Zhang, Pei Ren et al. · arXiv · Jun 5, 2026
Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representat…
- Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language NavigationHaoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu et al. · arXiv · Jun 5, 2026
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint pred…