Latest Spatial AI & SLAM Research Papers
The newest Spatial AI & SLAM papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Spatial AI & SLAM so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Spatial AI & SLAM papers in your inbox — free →Recent papers
- AllDayNav: Lifelong Navigation via Real-World Reinforcement LearningHang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu et al. · arXiv · Jun 9, 2026
Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle …
- Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban EnvironmentsMason Peterson, Qingyuan Li, Yixuan Jia, Fernando Cladera et al. · arXiv · Jun 4, 2026
Successful robot automation requires accurate global localization to support repeatability, task planning, goal specification, and safe operation. However, reliable localization in GNSS-denied environments remains an open problem. Overhead …
- RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel PruningZiyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki · arXiv · Jun 4, 2026
Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is th…
- Breaking Time: A Fully Gaussian Framework for Distributed and Continuous-Time SLAMDavide Ceriola, Simone Ferrari, Luca Di Giammarino, Leonardo Brizi et al. · arXiv · Jun 4, 2026
Continuous-time SLAM provides a principled framework for fusing heterogeneous sensors while estimating smooth trajectories, and is particularly well-suited for handling heterogeneous, asynchronous sensor streams with non-uniform readout pat…
- DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and GroundingLuzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li · arXiv · May 28, 2026
Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross…
- Towards Ubiquitous Mapping and Localization for Dynamic Indoor EnvironmentsHalim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin et al. · arXiv · May 18, 2026
We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly e…
- StableVLA: Towards Robust Vision-Language-Action Models without Extra DataYiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng et al. · arXiv · May 18, 2026
It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, pa…
- RGB-only Active 3D Scene Graph Generation for Indoor Mobile RobotsGiorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini · arXiv · May 18, 2026
Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB came…
- Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLNZiyi Xia, Chaoran Xiong, Litao Wei, Xinhao Hu et al. · arXiv · May 14, 2026
Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models…
- X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose InteractionKai Xiong, Hongjie Fang, Lixin Yang, Cewu Lu · arXiv · May 12, 2026
Effectively handling the interplay between spatial perception and action generation remains a critical bottleneck in robotic manipulation. Existing methods typically treat spatial perception and action execution as decoupled or strictly uni…
- Learning Action Manifold with Multi-view Latent Priors for Robotic ManipulationJunjin Xiao, Dongyang Li, Yandan Yang, Shuang Zeng et al. · arXiv · May 12, 2026
This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views…
- MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent ReconstructionZhihao Cao, Qi Shao, Shuhao Zhai, Jing Zhang et al. · arXiv · May 11, 2026
Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large-scale scene capture for virtual production and cooperative multi-robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can genera…
- OpenSGA: Efficient 3D Scene Graph Alignment in the Open WorldGang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora · arXiv · May 11, 2026
Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a pla…
- AERO-VIS: Asynchronous Event-based Real-time Onboard Visual-Inertial SLAMYannick Burkhardt, Sebastián Barbas Laina, Simon Boche, Leonard Freißmuth et al. · arXiv · May 8, 2026
The robustness of event cameras to high dynamic range and motion blur holds the potential to improve visual odometry systems in challenging environments. Although their high temporal resolution does not require synchronous processing, most …
- Dr-PoGO: Direct Radar Pose-Graph OptimizationCedric Le Gentil, Weican Li, Leonardo Brizi, Timothy D. Barfoot · arXiv · May 6, 2026
This paper introduces Dr-PoGO, a method for Simultaneous Localization And Mapping (SLAM) using a 2D spinning radar. Unlike cameras or lidars that require line-of-sight, millimetre-wave radars can `see' through dust, falling snow, rain, etc.…
- Robust Visual SLAM for UAV Navigation in GPS-Denied and Degraded Environments: A Multi-Paradigm Evaluation and Deployment StudyPrasoon Kumar, Akshay Deepak, Sandeep Kumar · arXiv · May 5, 2026
Reliable localization in GPS-denied, visually degraded environments is critical for autonomous UAV opera- tions. This paper presents a systematic comparative evaluation of five V-SLAM systems ORB-SLAM3, DPVO, DROID-SLAM, DUSt3R, and MASt3R …
- RLDX-1 Technical ReportDongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang et al. · arXiv · May 5, 2026
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited fr…
- DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social NavigationDanil Tokhchukov, Veronika Morozova, Gonzalo Ferrer · arXiv · May 4, 2026
Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this…
- Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data AugmentationChenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang et al. · arXiv · May 4, 2026
Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited envi…
- MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine ManipulationXianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata · arXiv · May 1, 2026
Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drif…
- Learning-Based Hierarchical Scene Graph Matching for Robot Localization Leveraging Prior MapsNimrod Millenium Ndulue, Jose Andres Millan-Romera, Matteo Giorgi, Holger Voos et al. · arXiv · Apr 30, 2026
Accurate localization is a fundamental requirement for autonomous robots operating in indoor environments. Scene graphs encode the spatial structure of an environment as a hierarchy of semantic entities and their relationships, and can be c…
- Robust Graph Matching through Semantic Relationship Generation for SLAMDavid Perez-Saura, Jose Andres Millan-Romera, Miguel Fernandez-Cortizas, Holger Voos et al. · arXiv · Apr 28, 2026
Graph-based representations such as Scene Graphs enable localization in structured indoor environments by matching a locally observed graph, constructed from sensor data, to a prior map. This process is particularly challenging in environme…
- COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual LocalizationMuhammad Shaheer, Miguel Fernandez-Cortizas, Asier Bikandi-Noya, Holger Voos et al. · arXiv · Apr 28, 2026
Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we pres…
- Passage-Aware Structural Mapping for RGB-D Visual SLAMAli Tourani, Miguel Fernandez-Cortizas, Saad Ejaz, David Pérez Saura et al. · arXiv · Apr 27, 2026
Doorways and passages are critical structural elements for indoor robot navigation, yet they remain underexplored in modern Visual SLAM (VSLAM) frameworks. This paper presents a passage-aware structural mapping approach for RGB-D VSLAM that…
- Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic ManipulationYifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu et al. · arXiv · Apr 27, 2026
Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vis…
- SLAM as a Stochastic Control Problem with Partial Information: Optimal Solutions and Rigorous ApproximationsIlir Gusija, Fady Alajaji, Serdar Yüksel · arXiv · Apr 23, 2026
Simultaneous localization and mapping (SLAM) is a foundational state estimation problem in robotics in which a robot accurately constructs a map of its environment while also localizing itself within this construction. We study the active S…
- Driving Scene Understanding: How much temporal context and spatial resolution is necessary?Ramashish Gaurav, Bryan P. Tripp, Apurva Narayan · Canadian AI 2021 · Jan 1, 2021
Driving Scene Understanding is a broad field which addresses the problem of recognizing a variety of on-road situations; namely driver behaviour/intention recognition, driver-action causal reasoning, pedestrians’ and nearby vehicles’ intent…