Latest Diffusion Models Research Papers
The newest Diffusion Models papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Diffusion Models so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Diffusion Models papers in your inbox — free →Recent papers
- Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip SynchronizationPaul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee et al. · arXiv · Jun 9, 2026
Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, t…
- Mean Flow Distillation: Robust and Stable Distillation for Flow Matching ModelsAn Zhao, Shengyuan Zhang, Zhongjian Sun, Yixiang Zhou et al. · arXiv · Jun 9, 2026
Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODE-based iterative sampling incurs substantial computational overhead in inference, which limits their applicabil…
- UniPET: a universal network for high-quality PET image denoising across varied dose reduction factorsZhiwen Yang, Yang Zhou, Haowei Chen, Hui Zhang et al. · arXiv · Jun 9, 2026
Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the …
- U-TTT: Towards Generalizable PET Image Denoising via Test-Time TrainingZhiwen Yang, Jiayin Li, Hao Lu, Hui Zhang et al. · arXiv · Jun 9, 2026
Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of genera…
- PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal LawsDanqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins et al. · arXiv · Jun 8, 2026
Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structu…
- Evaluating the Representation Space of Diffusion Models via Self-Supervised PrinciplesXiao Li, Yixuan Jia, Zekai Zhang, Xiang Li et al. · arXiv · Jun 8, 2026
Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from …
- Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity ConstraintsRavi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag et al. · arXiv · Jun 8, 2026
The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in c…
- Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing AnalysisSamuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra et al. · arXiv · Jun 8, 2026
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhy…
- DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose EstimationTony Danjun Wang, Tolga Birdal, Nassir Navab · arXiv · Jun 5, 2026
Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this lea…
- Complexity-Balanced Diffusion SplittingNoam Issachar, Dani Lischinski, Raanan Fattal · arXiv · Jun 4, 2026
Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deployi…
- Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases ThemWoojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen et al. · arXiv · Jun 4, 2026
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consi…
- RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow SchedulingChensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang et al. · arXiv · Jun 4, 2026
Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Exi…
- SAM-Flow: Source-Anchored Masked Flow for Training-Free Image EditingHaowang Cui, Rui Chen, Tao Luo, Tao Guo et al. · arXiv · Jun 4, 2026
Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based an…
- RoboDream: Compositional World Models for Scalable Robot Data SynthesisJunjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li et al. · arXiv · Jun 1, 2026
Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling…
- LongLive-RAG: A General Retrieval-Augmented Framework for Long Video GenerationQixin Hu, Shuai Yang, Wei Huang, Song Han et al. · arXiv · Jun 1, 2026
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during gen…
- Improving Combined Detection and Classification of TEM Defects via Mask-Conditioned Latent Diffusion AugmentationNi Li, Nuohao Liu, Ryan Jacobs, Ajay Annamareddy et al. · arXiv · Jun 1, 2026
Analyzing microstructural defects in transmission electron microscopy (TEM) images, particularly in irradiated metal alloys, is often limited by the availability of high-quality, labeled data. To address this, we introduce a generative data…
- Drifting Preference Optimization for One-Step Generative ModelsZhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
- VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video DiffusionHidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan et al. · arXiv · May 28, 2026
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV lay…
- AdaState: Self-Evolving Anchors for Streaming Video GenerationYusuf Dalva, Pinar Yanardag · arXiv · May 28, 2026
Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representatio…
- YoCausal: How Far is Video Generation from World Model? A Causality PerspectiveYou-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang et al. · arXiv · May 28, 2026
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-wo…
- Colored Noise Diffusion SamplingHadar Davidson, Noam Issachar, Sagie Benaim · arXiv · May 28, 2026
Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stoc…
- Veda: Scalable Video Diffusion via Distilled Sparse AttentionShihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei et al. · arXiv · May 28, 2026
Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality i…
- Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step ScalingXinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu et al. · arXiv · May 27, 2026
Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quant…
- OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement LearningYunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin et al. · arXiv · May 27, 2026
Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, qu…
- Towards Controllable Image Generation through Representation-Conditioned Diffusion ModelsNithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen · arXiv · May 26, 2026
Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text p…
- PARE: Pruning and Adaptive Routing for Efficient Video GenerationYutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao et al. · arXiv · May 26, 2026
Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but t…
- MRT: Masked Region Transformer for Layered Image Generation and Editing at ScaleZhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou et al. · arXiv · May 26, 2026
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains …
- Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object DetectionNico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock · arXiv · May 26, 2026
Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually o…
- Squeezing Capacity from Multimodal Large Language Models for Subject-driven GenerationShuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li et al. · arXiv · May 25, 2026
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-mod…
- Reinforcing Few-step Generators via Reward-Tilted Distribution MatchingYushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang et al. · arXiv · May 25, 2026
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-sta…