Latest Video Generation Research Papers
The newest Video Generation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Video Generation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Video Generation papers in your inbox — free →Recent papers
- Next Forcing: Causal World Modeling with Multi-Chunk PredictionGangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu et al. · arXiv · Jun 9, 2026
Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the t…
- Streaming Video Generation with Streaming Force ControlHanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv et al. · arXiv · Jun 5, 2026
We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, …
- Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases ThemWoojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen et al. · arXiv · Jun 4, 2026
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consi…
- RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow SchedulingChensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang et al. · arXiv · Jun 4, 2026
Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Exi…
- RoboDream: Compositional World Models for Scalable Robot Data SynthesisJunjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li et al. · arXiv · Jun 1, 2026
Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling…
- From Zero to Hero: Training-Free Custom Concept Spawning in World ModelsKiymet Akdemir, Pinar Yanardag · arXiv · Jun 1, 2026
Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or …
- VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time OptimizationJunhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao et al. · arXiv · Jun 1, 2026
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle …
- LongLive-RAG: A General Retrieval-Augmented Framework for Long Video GenerationQixin Hu, Shuai Yang, Wei Huang, Song Han et al. · arXiv · Jun 1, 2026
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during gen…
- VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video DiffusionHidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan et al. · arXiv · May 28, 2026
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV lay…
- AdaState: Self-Evolving Anchors for Streaming Video GenerationYusuf Dalva, Pinar Yanardag · arXiv · May 28, 2026
Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representatio…
- YoCausal: How Far is Video Generation from World Model? A Causality PerspectiveYou-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang et al. · arXiv · May 28, 2026
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-wo…
- Veda: Scalable Video Diffusion via Distilled Sparse AttentionShihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei et al. · arXiv · May 28, 2026
Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality i…
- VPG: Visual Prefix Guidance for Autoregressive Image and Video GenerationXinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu et al. · arXiv · May 28, 2026
Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modi…
- Gamma-World: Generative Multi-Agent World Modeling Beyond Two PlayersFangfu Liu, Kai He, Tianchang Shen, Tianshi Cao et al. · arXiv · May 27, 2026
World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multipl…
- OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement LearningYunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin et al. · arXiv · May 27, 2026
Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, qu…
- PARE: Pruning and Adaptive Routing for Efficient Video GenerationYutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao et al. · arXiv · May 26, 2026
Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but t…
- AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and BeyondHaiming Zhang, Junfei Zhou, Feng Jiang, Jingzhong Li et al. · arXiv · May 25, 2026
Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely …
- On-Policy Adversarial Flow Distillation for Autoregressive Video GenerationYang Luo, Shengju Qian, Xiaohang Tang, Zirui Zhu et al. · arXiv · May 25, 2026
Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribu…
- Paris 2.0: A Decentralized Diffusion Model for Video GenerationAli Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang · arXiv · May 25, 2026
We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed…
- MotiMotion: Motion-Controlled Video Generation with Visual ReasoningLee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi et al. · arXiv · May 21, 2026
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by miss…
- WorldKV: Efficient World Memory with World Retrieval and CompressionJung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang et al. · arXiv · May 21, 2026
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full…
- LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video GenerationYukang Chen, Luozhou Wang, Wei Huang, Shuai Yang et al. · arXiv · May 18, 2026
We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressi…
- Spectral Progressive Diffusion for Efficient Image and Video GenerationHoward Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein · arXiv · May 18, 2026
Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later…
- EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric VideosRuiping Liu, Junwei Zheng, Yufan Chen, Di Wen et al. · arXiv · May 18, 2026
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchm…
- Advancing Narrative Long Video Generation via Training-Free Identity-Aware MemoryJinzhuo Liu, Jiangning Zhang, Wencan Jiang, Yabiao Wang et al. · arXiv · May 18, 2026
Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined …
- EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video GenerationRuozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez · arXiv · May 14, 2026
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use…
- RefDecoder: Enhancing Visual Generation with Conditional Video DecodingXiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren et al. · arXiv · May 14, 2026
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We obs…
- RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPOYanzuo Lu, Ronglai Zuo, Jiankang Deng · arXiv · May 14, 2026
Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive fe…
- Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training VideoYifan Wang, Tong He · arXiv · May 14, 2026
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control b…
- Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video GenerationMin Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou et al. · arXiv · May 14, 2026
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirec…