Generation

Latest Video Generation Research Papers

The newest Video Generation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Video Generation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Video Generation papers in your inbox — free →

Recent papers

Streaming Multi-Agent Autoregressive Diffusion Model with World State Registers
Sicheng Mo, Yuheng Li, Ziyang Leng, Krishna Kumar Singh et al. · arXiv · Jul 23, 2026
Multi-agent interactive world models should not only generate consistent observations, but also maintain world states that persist across agents and evolve across views. Existing autoregressive video diffusion pipelines carry forward observ…
GraphVid: Interactive Graph-Controllable Video Generation
Vedant Shah, Onkar Susladkar, Tushar Prakash, Kiet Nguyen et al. · arXiv · Jul 23, 2026
Controllable video generation remains challenging due to the difficulty of specifying precise multi-object interactions using text prompts or motion-control inputs that primarily constrain pixel movement. In practice, trajectory-based contr…
SANA-Video 2.0: Hybrid Linear Attention with Attention Residuals for Efficient Video Generation
Junsong Chen, Jincheng Yu, Yitong Li, Shuchen Xue et al. · arXiv · Jul 23, 2026
We introduce SANA-Video 2.0, a hybrid video diffusion transformer instantiated at 5B and 14B scales under a unified architecture. Designed to generate high-quality video up to 720p on a single GPU, SANA-Video 2.0 matches full-softmax video …
Self Gradient Forcing: Native Long Video Extrapolation
Junhao Zhuang, Shiyi Zhang, Yuxuan Bian, Yaowei Li et al. · arXiv · Jul 22, 2026
Recent autoregressive video diffusion methods are increasingly built upon Self Forcing, where the student is trained on histories produced by its own rollout rather than ground-truth video contexts. This reduces exposure bias, but the histo…
Vera: Identity-Faithful Human Subject-to-Video Generation
Yulong Xu, Xinyue Liu, Shujuan Li, huafeng shi et al. · arXiv · Jul 22, 2026
Subject-to-video (S2V) generation has made substantial progress in preserving reference subjects across diverse categories, yet generic subject consistency remains insufficient for human-centric generation. A video may appear globally consi…
StreamHOI: Interaction-aware Temporal Memory Adaptation for Streaming HOI Video Generation
Zejing Rao, Haoxian Zhang, Xiaoqiang Liu, Yiping Meng et al. · arXiv · Jul 22, 2026
Existing human--object interaction (HOI) video generation methods are largely limited to offline short-video generation with complex driving conditions, making them unsuitable for real-time interactive applications. We present \emph{StreamH…
HeadCast: Casting Attention Heads for Efficient Autoregressive Video Generation
Jinliang Shen, Lianghao Su, Zheming Li, Kang He et al. · arXiv · Jul 22, 2026
Autoregressive (AR) video diffusion models have become a promising paradigm for long and streaming video synthesis, but the continuously growing Key-Value (KV) cache makes attention the dominant inference cost, especially at high resolution…
The Seriality Gap in Video Diffusion Models
Jorge Diaz Chao, Konpat Preechakul, Yuxi Liu, Yutong Bai · arXiv · Jul 14, 2026
When one ball strikes another, then another, video models should predict the consequences of each bounce. In controlled experiments on multi-ball hard-sphere dynamics, we find that the performance of standard bidirectional video diffusion d…
FlowWAM: Optical Flow as a Unified Action Representation for World Action Models
Yixiang Chen, Peiyan Li, Yuan Xu, Qisen Ma et al. · arXiv · Jul 14, 2026
World Action Models (WAMs) are able to leverage pretrained video generators for both world modeling and action prediction. However, directly leveraging such video generators for control raises a new challenge: how to represent actions in a …
Cycle-World: Mitigating Error Accumulation in Long-term Video World Models via Reverse-Prediction Cycle Consistency
Zihan Su, Teng Hu, Jiangning Zhang, Ruiyan Wang et al. · arXiv · Jul 13, 2026
Autoregressive diffusion models have enabled high-quality video generation, yet their sequential nature inherently suffers from error accumulation. In long-horizon video synthesis, minor prediction deviations compound over time, inevitably …
Wan-Dancer: A Hierarchical Framework for Minute-scale Coherent Music-to-Dance Generation
Mingyang Huang, Peng Zhang, Li Hu, Guangyuan Wang et al. · arXiv · Jul 10, 2026
Generating long-duration, high-definition, and rhythmically synchronized dance videos directly from music remains a significant challenge, primarily due to the temporal constraints of current diffusion models, which typically fail beyond 20…
LongE2V: Long-Horizon Event-based Video Reconstruction, Prediction, and Frame Interpolation with Video Diffusion Models
Cheng-De Fan, Chun-Wei Tuan Mu, Chen-Wei Chang, Chin-Yang Lin et al. · arXiv · Jul 9, 2026
Recovering high-quality video from sparse event streams is a challenging task. Regression methods often blur textures, while existing generative models struggle with long-term stability. We propose LongE2V, a novel approach that leverages p…
OPSD-V: On-Policy Self-Distillation for Post-Training Few-Step Autoregressive Video Generators
Hongyu Liu, Chun Wang, Feng Gao, Xuanhua He et al. · arXiv · Jul 9, 2026
We propose OPSD-V, an on-policy self-distillation paradigm for post-training few-step autoregressive (AR) video diffusion models. Existing few-step AR video generators can produce long videos with low latency, but still suffer from error ac…
OpenCoF: Learning to Reason Through Video Generation
Xinyan Chen, Ziyu Guo, Renrui Zhang, Dongzhi Jiang et al. · arXiv · Jul 9, 2026
Reasoning has become a core capability for large models, especially when reliable decisions require understanding logical consequences. Recent video generation models offer a reasoning path distinct from previous Chain-of-Thought (CoT): rea…
HumanForge: A Human-Centric Deepfake Video Benchmark with Multi-Agent Forgery Rationales
Wenbo Xu, Zhimin Chen, Xiaojie Liang, Hengrui Liu et al. · arXiv · Jul 9, 2026
Rapid advancements in video diffusion models and temporal editing tools have enabled the generation of highly realistic human-centric videos, posing unprecedented challenges to digital content forensics. Existing benchmarks primarily focus …
Native Video-Action Pretraining for Generalizable Robot Control
Qihang Zhang, Lin Li, Luyao Zhang, Shuai Yang et al. · arXiv · Jul 9, 2026
The advent of video-action models offers a promising path for robot control. Nevertheless, we argue that repurposing video generative models designed for digital content creation is inherently inadequate for physical environments. To bridge…
Scaling Mixture-of-Experts Video Pretraining for Embodied Intelligence
Shuailei Ma, Jiaqi Liao, Xinyang Wang, Jingjing Wang et al. · arXiv · Jul 8, 2026
Despite the recent promise in robot control, video generative models suffer from a domain mismatch due to their primary focus on content creation. For example, their design inherently prioritizes visual fidelity and creativity over computat…
Point as Skeleton: Accumulated Point Cloud Enhanced Autoregressive Generation for Closed-Loop Autonomous Driving Simulation
Songbur Wong, Xiaosong Jia, Junqi You, Bo Zhang et al. · arXiv · Jul 7, 2026
Evaluating end-to-end autonomous driving (E2E-AD) remains challenging, as existing driving simulation methods often trade off closed-loop interactivity (e.g., CARLA) and real-world visual fidelity (e.g., nuScenes). We present \textbf{\emph{…
Prompt-Adapter Context Routing for Parameter-Efficient Multi-Shot Long Video Extrapolation
Anna Córdoba, Adam Puente Tercero, Nerea Angulo Hijo, Mar Linares Tercero et al. · arXiv · Jul 7, 2026
We present PACR-Video, a parameter-efficient framework for multi-shot long video extrapolation that preserves recurring entities, scene structure, visual style, and causal progression without full generator fine-tuning. PACR-Video keeps a t…
OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers
Donghyun Lee, Jitesh Chavan, Duy Nguyen, Sam Huang et al. · arXiv · Jul 2, 2026
Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activatio…
Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models
Yue Han, Chong Li, Zhening Liu, Cong Huang et al. · arXiv · Jul 1, 2026
Recent 3D generative models can synthesize high-quality geometry but often struggle to reproduce intricate textures from reference images, largely due to the scarcity of large-scale 3D training data with rich surface appearance. In contrast…
EcoVideo: Entropy-Orchestrated Video Generation Paradigm in Cloud-Edge Dynamics
Jiayu Chen, Hengyi Zhang, Maoliang Li, Minyu Li et al. · arXiv · Jun 29, 2026
DiT video generation is latency-intensive due to iterative full-frame denoising, while prior cloud-edge methods largely rely on static inter-step decoupling and cannot leverage inter-frame similarity or adapt to system dynamics. We propose …
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
Peiwen Zhang, Yufan Deng, Shangkun Sun, Juncheng Ma et al. · arXiv · Jun 26, 2026
Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, incl…
RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation
Minghao Yin, Jiahao Lu, Wenbo Hu, Wang Zhao et al. · arXiv · Jun 25, 2026
Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between …
MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation
JoungBin Lee, Jaewoo Jung, Jongmin Lee, Tongmin Kim et al. · arXiv · Jun 24, 2026
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representat…
DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation
Nan Chen, Yiyang Cai, Rongchang Xie, Junwen Pan et al. · arXiv · Jun 24, 2026
Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as …
FunPiQ: A New Benchmark for Pixel-Level Quality Assessment in Fundus Images
Pengwei Wang, José Morano, Virginia Mares, Hrvoje Bogunović · arXiv · Jun 24, 2026
Color fundus photography (CFP) is the most common ophthalmic imaging modality for large-scale screening. However, it is highly susceptible to degradations, making robust fundus image quality assessment (FIQA) crucial. The criteria for what …
FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation
Orest Kupyn, Goutam Bhat, Philipp Henzler, Fabian Manhardt et al. · arXiv · Jun 23, 2026
Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode mul…
GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction
Chenrui Fan, Paolo Favaro · arXiv · Jun 23, 2026
Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames s…
OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis
Chenrui Fan, Paolo Favaro · arXiv · Jun 23, 2026
Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today's generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, …

Track Video Generation on Distill AI — start free →

Latest Video Generation Research Papers

Recent papers

Related topics