Vision

Latest Video Understanding Research Papers

The newest Video Understanding papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Video Understanding so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Video Understanding papers in your inbox — free →

Recent papers

A Framework for Understanding Moving Image Music Concerts: Repertoire, Reception and Recontextualisation
Elizabeth Hunt · Open MIND · Jan 1, 2029
This thesis investigates the phenomenon of moving image music concerts, focussing on the live orchestral presentation of music from film, television, and video games. With orchestras increasingly turning to such concerts to attract new audi…
3D-Aware VLMs with Implicit and Explicit Geometries
Wenhao Li, Xueying Jiang, Quanhao Qian, Deli Zhao et al. · arXiv · Jul 23, 2026
Despite rapid progress, most existing vision-language models (VLMs) built from 2D visual inputs often struggle when handling various 3D tasks that require fine-grained spatial understanding and reasoning. To bridge this gap, we present VLM-…
Streaming Multi-Agent Autoregressive Diffusion Model with World State Registers
Sicheng Mo, Yuheng Li, Ziyang Leng, Krishna Kumar Singh et al. · arXiv · Jul 23, 2026
Multi-agent interactive world models should not only generate consistent observations, but also maintain world states that persist across agents and evolve across views. Existing autoregressive video diffusion pipelines carry forward observ…
Unified Video Dense Prediction from Disjoint Data
Yihong Sun, Seoung Wug Oh, Jiahui Huang, Bharath Hariharan et al. · arXiv · Jul 23, 2026
Scene understanding requires simultaneous prediction about geometry, appearance, and semantics. However, existing task-specific annotations are fragmented across incompatible, domain-specific datasets. Current unified systems circumvent thi…
GraphVid: Interactive Graph-Controllable Video Generation
Vedant Shah, Onkar Susladkar, Tushar Prakash, Kiet Nguyen et al. · arXiv · Jul 23, 2026
Controllable video generation remains challenging due to the difficulty of specifying precise multi-object interactions using text prompts or motion-control inputs that primarily constrain pixel movement. In practice, trajectory-based contr…
Self-Supervised Learning of Structured Dynamics from Videos
Lukas Knobel, Andrew Zisserman, Yuki M. Asano · arXiv · Jul 23, 2026
Understanding motion in video is a fundamental challenge for visual learning, as frame-to-frame change entangles two sources of dynamics: camera motion and object motion. This decomposition has remained underexplored in representation learn…
SANA-Video 2.0: Hybrid Linear Attention with Attention Residuals for Efficient Video Generation
Junsong Chen, Jincheng Yu, Yitong Li, Shuchen Xue et al. · arXiv · Jul 23, 2026
We introduce SANA-Video 2.0, a hybrid video diffusion transformer instantiated at 5B and 14B scales under a unified architecture. Designed to generate high-quality video up to 720p on a single GPU, SANA-Video 2.0 matches full-softmax video …
ElasticTTT: Prior-Preserving Test-Time Tuning for Video Editing
Yueyi Liu, Chi Zhang, Sen Cui, Miao Liu · arXiv · Jul 23, 2026
Test-Time Tuning (TTT) on pretrained diffusion models has emerged as a powerful paradigm for video editing. However, there exists a foundational mismatch between the distribution-mapping nature of generative models and the single-point opti…
Texture++: Elevating 3D Asset Texture Resolution with a Region-Aware Diffusion Model
Shuaiwei Wang, Shi Li, Jieting Xu, Yuchi Huo et al. · arXiv · Jul 23, 2026
Numerous 3D assets are discarded due to low texture resolution, while current super-resolution models ignore texture maps and focus on natural images. An efficient and generalizable texture super-resolution model can revitalize a large corp…
Adaptive Identity Anchoring: Closed-Loop Keyframe Placement for Synthetic Paired Supervision in Video Face Swapping
Logan Robbins · arXiv · Jul 23, 2026
Video face swapping has no natural paired supervision: no real footage exists of one person's face performing another person's video. The strongest current answer, DreamID-V's SyncID-Pipe, mints pairs by replacing the identity in exactly tw…
PercepCap: Video Captioner with Structured Spatio-Temporal Perception
Yifan Xu, Zihao Wang, Zhixiao Wang, Jiaming Zhang et al. · arXiv · Jul 22, 2026
Video captioning requires fine-grained spatio-temporal understanding of videos, including spatial perception of where objects are located and temporal perception of when events occur. Existing MLLMs usually generate captions directly from v…
Self Gradient Forcing: Native Long Video Extrapolation
Junhao Zhuang, Shiyi Zhang, Yuxuan Bian, Yaowei Li et al. · arXiv · Jul 22, 2026
Recent autoregressive video diffusion methods are increasingly built upon Self Forcing, where the student is trained on histories produced by its own rollout rather than ground-truth video contexts. This reduces exposure bias, but the histo…
Vera: Identity-Faithful Human Subject-to-Video Generation
Yulong Xu, Xinyue Liu, Shujuan Li, huafeng shi et al. · arXiv · Jul 22, 2026
Subject-to-video (S2V) generation has made substantial progress in preserving reference subjects across diverse categories, yet generic subject consistency remains insufficient for human-centric generation. A video may appear globally consi…
StreamHOI: Interaction-aware Temporal Memory Adaptation for Streaming HOI Video Generation
Zejing Rao, Haoxian Zhang, Xiaoqiang Liu, Yiping Meng et al. · arXiv · Jul 22, 2026
Existing human--object interaction (HOI) video generation methods are largely limited to offline short-video generation with complex driving conditions, making them unsuitable for real-time interactive applications. We present \emph{StreamH…
HeadCast: Casting Attention Heads for Efficient Autoregressive Video Generation
Jinliang Shen, Lianghao Su, Zheming Li, Kang He et al. · arXiv · Jul 22, 2026
Autoregressive (AR) video diffusion models have become a promising paradigm for long and streaming video synthesis, but the continuously growing Key-Value (KV) cache makes attention the dominant inference cost, especially at high resolution…
Masked Visual Actions for Unified World Modeling
Hadi Alzayer, Wenlong Huang, Haonan Chen, Christopher Luey et al. · arXiv · Jul 21, 2026
Video models absorb rich priors over how the visual world moves, interacts, and responds to contact, making them promising substrates for robotic world modeling. The central challenge is how to communicate action to such models in a form al…
OmniReasoner: Thinking with Long Audio-Video via Native Tool Use
Yu Chen, Caorui Li, Ziyu Xiong, Yidong Wang et al. · arXiv · Jul 21, 2026
Long audio-video reasoning is difficult for omnimodal LLMs because the decisive evidence is often sparse, cross-modal, and too expensive to preserve with uniformly high-fidelity inputs. We introduce OmniReasoner, a tool-use post-training fr…
InstructMixup: Instruction-Guided Salient Patch Editing for Robust Data Augmentation
Khawar Islam, Arif Mahmood, Xin Jin, Naveed Akhtar · arXiv · Jul 21, 2026
In image and video technologies, data augmentation is widely used to improve the generalization of deep visual models, and mixup-based strategies that interpolate between samples have become the dominant approach. However, computing informa…
ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU
Fan Jiang, Zhaoxu Sun, Mengchao Wang, Ziyu Zhu et al. · arXiv · Jul 21, 2026
We present ABot-World-0, an action-conditioned video world model for real-time, long-horizon closed-loop interaction, supported by a multi-source data infrastructure spanning AAA games, simulation engines, and internet videos to learn contr…
FlexiAvatar: Unified 3D Gaussian Human Avatars Under Arbitrary Body Visibility
Yihalem Yimolal Tiruneh, Muhammad Salman Ali, Uyoung Jeong, Muneeb A. Khan et al. · arXiv · Jul 21, 2026
Reconstructing animatable 3D human avatars from monocular video is a fundamental problem in computer vision with broad applications in AR/VR and digital content creation. Existing approaches typically couple parametric body models with neur…
Wavefront Parallelization for Efficient Learned Image Compression
Shimon Murai, Fangzheng Lin, Kasidis Arunruangsirilert, Jiro Katto · arXiv · Jul 21, 2026
Autoregressive context models are foundational for learned image compression,but they suffer from slow serial inference. Existing acceleration methods such as checkerboard context require architectural changes and retraining, thus are inapp…
Context-structured Video Anomaly Detection with Large Vision-Language Models
Dongjun Kim, Changjae Oh, Andrea Cavallaro, Jeonghoon Mo · arXiv · Jul 21, 2026
Training video anomaly detectors is challenging due to the difficulty and cost of annotating diverse and rare abnormal events. Although recent large vision-language models enable training-free inference, existing approaches mostly rely on h…
Hierarchical Denoising For Multi-Step Visual Reasoning
Zezhong Qian, Xiaowei Chi, Chak-Wing Mak, Tianze Zhou et al. · arXiv · Jul 16, 2026
Video models are evolving into vision foundation models, yet they still lack human-like multi-step reasoning. Streaming autoregressive diffusion models are efficient but limited in reasoning, while bidirectional diffusion enables global rev…
Online Neural Space Time Memory for Dynamic Novel View Synthesis
Baback Elmieh, Lynn Tsai, Zeman Li, Srinivas Kaza et al. · arXiv · Jul 16, 2026
Online novel view synthesis from multi-view streaming videos faces a fundamental trade-off: maintaining a persistent, long-horizon memory to reconstruct temporarily occluded regions while operating under strict real-time constraints. While …
Divergent Gaze Patterns in Artistic Viewing: Spatial and Temporal Signatures of Attention Across Autistic Individuals, Artists, and Neurotypical Observers
Mohammed Amine Kerkouri, Daphné Senggaran, Renaud Jusiak, Océane Lehmann et al. · arXiv · Jul 16, 2026
How different populations visually explore artworks bears on cognitive science and on accessibility design, yet most eye-tracking work in autism has used social scenes rather than art, and has analysed where the eyes land while ignoring whe…
MAGiSt3R: Multi-Agent Feed-forward 3D Reconstruction from Monocular RGB Videos
Ziren Gong, Xiaohan Li, Fabio Tosi, Ninghui Xu et al. · arXiv · Jul 16, 2026
This paper presents MAGiSt3R, a multi-agent 3D reconstruction framework performing reconstruction and camera tracking for monocular RGB videos at almost 10 FPS. MAGiSt3R relies on a feed-forward model from the 3R family to process RGB video…
ESAR: Event-Based Synthetic Aperture Reconstruction
Harbir Antil, Daniel Blauvelt, David Sayre · arXiv · Jul 16, 2026
Event cameras report asynchronous polarity events when changes in log--radiance exceed a fixed contrast threshold, producing signed temporal contrast measurements rather than conventional image frames. We formulate monocular event-based ima…
Video = World + Event Stream
Lianghua Huang, Zhi-Fan Wu, Yupeng Shi, Wei Wang et al. · arXiv · Jul 16, 2026
We present Wan-Streamer v0.3, which reframes our native-streaming interaction model under a single organizing view: a video is a world plus an event stream. The world is the persistent context in which a video unfolds, including the environ…
From Draft to Draft-Free: One-Step Video Object Removal via Privileged Distillation and Fast Planting
Zizhao Chen, Ping Wei, Guang Dai, Jingdong Wang et al. · arXiv · Jul 16, 2026
Video object removal is a fundamental yet challenging task in video editing. Despite recent progress, existing methods typically fall into two categories. Traditional approaches based on optical flow or attention mechanisms often introduce …
The Seriality Gap in Video Diffusion Models
Jorge Diaz Chao, Konpat Preechakul, Yuxi Liu, Yutong Bai · arXiv · Jul 14, 2026
When one ball strikes another, then another, video models should predict the consequences of each bounce. In controlled experiments on multi-ball hard-sphere dynamics, we find that the performance of standard bidirectional video diffusion d…

Track Video Understanding on Distill AI — start free →

Latest Video Understanding Research Papers

Recent papers

Related topics