Latest Video Understanding Research Papers
The newest Video Understanding papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Video Understanding so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Video Understanding papers in your inbox — free →Recent papers
- Next Forcing: Causal World Modeling with Multi-Chunk PredictionGangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu et al. · arXiv · Jun 9, 2026
Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the t…
- AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic InferenceHangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu et al. · arXiv · Jun 9, 2026
Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxilia…
- Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip SynchronizationPaul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee et al. · arXiv · Jun 9, 2026
Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, t…
- WorldOlympiad: Can Your World Model Survive a Triathlon?Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang et al. · arXiv · Jun 9, 2026
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or s…
- Latent Spatial Memory for Video World ModelsWeijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen et al. · arXiv · Jun 8, 2026
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE enco…
- MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action ModelsHao Shi, Weiye Li, Bin Xie, Yulin Wang et al. · arXiv · Jun 8, 2026
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore strug…
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context RoutingJisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu et al. · arXiv · Jun 8, 2026
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world predictio…
- Echo-Memory: A Controlled Study of Memory in Action World ModelsWayne King, Zeyue Xue, Yuxuan Bian, Jie Huang et al. · arXiv · Jun 8, 2026
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure i…
- SemDINO: A DINOv3-Driven Network for Cross-Temporal Semantic Alignment in Change DetectionXinyu Tong, Meihua Zhou, Jinxiao Sun, Yingjie Tang et al. · arXiv · Jun 8, 2026
Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale rep…
- Hybrid Robustness Verification for Spatio-Temporal Neural NetworksSherwin Varghese, Matthew Wicker, Alessio Lomuscio · arXiv · Jun 8, 2026
With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive co…
- GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker DevelopmentTianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan et al. · arXiv · Jun 8, 2026
Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoi…
- Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing AnalysisSamuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra et al. · arXiv · Jun 8, 2026
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhy…
- MAVIS: Multi-Agent Video Retrieval via Structured Video UnderstandingJie Zhang, Qilang Ye, Hao Zhou, Haochen Liang et al. · arXiv · Jun 8, 2026
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridg…
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval MechanismCong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang et al. · arXiv · Jun 5, 2026
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and …
- Streaming Video Generation with Streaming Force ControlHanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv et al. · arXiv · Jun 5, 2026
We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, …
- Planning-aligned Token Compression for Long-Context Autonomous DrivingZhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus et al. · arXiv · Jun 5, 2026
Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for comple…
- Watch, Remember, Reason: Human-View Video Understanding with MLLMsJiahao Meng, Yue Tan, Qi Xu, Kuan Gao et al. · arXiv · Jun 5, 2026
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse e…
- Mind the Gap: Disentangling Performance Bottlenecks in Video Instance SegmentationDanial Hamdi, Fardin Ayar, Mahdi Javanmardi · arXiv · Jun 5, 2026
In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates ide…
- Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam VideosAnurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen et al. · arXiv · Jun 5, 2026
Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. Th…
- Spatial-Temporal Decoupled Adapter for Micro-gesture Online RecognitionXucheng Shen, Kun Li, Fei Wang, Wei Qian et al. · arXiv · Jun 5, 2026
Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal…
- Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases ThemWoojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen et al. · arXiv · Jun 4, 2026
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consi…
- StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated DatasetZhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang et al. · IJCV · Jun 4, 2026
Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. T…
- RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow SchedulingChensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang et al. · arXiv · Jun 4, 2026
Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Exi…
- Towards One-to-Many Temporal GroundingQi Xu, Yue Tan, Shihao Chen, Jiahao Meng et al. · arXiv · Jun 4, 2026
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments f…
- RoboDream: Compositional World Models for Scalable Robot Data SynthesisJunjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li et al. · arXiv · Jun 1, 2026
Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling…
- From Zero to Hero: Training-Free Custom Concept Spawning in World ModelsKiymet Akdemir, Pinar Yanardag · arXiv · Jun 1, 2026
Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or …
- AdaCodec: A Predictive Visual Code for Video MLLMsHaowen Hou, Zhen Huang, Zheming Liang, Qingyi Si et al. · arXiv · Jun 1, 2026
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visu…
- Policy-based Foveated Imaging and PerceptionHoward Xiao, Jan Ackermann, Boyang Deng, Gordon Wetzstein · arXiv · Jun 1, 2026
Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, lat…
- VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time OptimizationJunhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao et al. · arXiv · Jun 1, 2026
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle …
- LongLive-RAG: A General Retrieval-Augmented Framework for Long Video GenerationQixin Hu, Shuai Yang, Wei Huang, Song Han et al. · arXiv · Jun 1, 2026
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during gen…