Latest Image Generation Research Papers
The newest Image Generation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Image Generation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Image Generation papers in your inbox — free →Recent papers
- IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoderYitong Chen, Zijie Diao, Junke Wang, Lingyu Kong et al. · arXiv · Jun 9, 2026
Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quali…
- Echo-Memory: A Controlled Study of Memory in Action World ModelsWayne King, Zeyue Xue, Yuxuan Bian, Jie Huang et al. · arXiv · Jun 8, 2026
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure i…
- Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity ConstraintsRavi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag et al. · arXiv · Jun 8, 2026
The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in c…
- LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative ModelsLu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu et al. · arXiv · Jun 1, 2026
Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To ad…
- Drifting Preference Optimization for One-Step Generative ModelsZhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
- Colored Noise Diffusion SamplingHadar Davidson, Noam Issachar, Sagie Benaim · arXiv · May 28, 2026
Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stoc…
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference OptimizationZhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu · arXiv · May 27, 2026
Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we p…
- Towards Controllable Image Generation through Representation-Conditioned Diffusion ModelsNithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen · arXiv · May 26, 2026
Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text p…
- MRT: Masked Region Transformer for Layered Image Generation and Editing at ScaleZhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou et al. · arXiv · May 26, 2026
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains …
- Squeezing Capacity from Multimodal Large Language Models for Subject-driven GenerationShuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li et al. · arXiv · May 25, 2026
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-mod…
- Reinforcing Few-step Generators via Reward-Tilted Distribution MatchingYushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang et al. · arXiv · May 25, 2026
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-sta…
- InstructSAM: Segment Any Instance with Any InstructionsYuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin et al. · arXiv · May 25, 2026
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction …
- Paris 2.0: A Decentralized Diffusion Model for Video GenerationAli Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang · arXiv · May 25, 2026
We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed…
- Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-ResolutionZixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore et al. · arXiv · May 25, 2026
Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\text…
- A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and DeblurringAdina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten et al. · arXiv · May 25, 2026
Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the siz…
- Swift Sampling: Selecting Temporal Surprises via Taylor SeriesDahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta et al. · arXiv · May 21, 2026
While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we…
- SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion TransformersJavad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell et al. · arXiv · May 21, 2026
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by m…
- PIXLRelight: Controllable Relighting via Intrinsic ConditioningMiguel Farinha, Ronald Clark · arXiv · May 18, 2026
We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse a…
- EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric VideosRuiping Liu, Junwei Zheng, Yufan Chen, Di Wen et al. · arXiv · May 18, 2026
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchm…
- Aligning Latent Geometry for Spherical Flow Matching in Image GenerationTuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe, Adil Kaan Akan et al. · arXiv · May 14, 2026
Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even…
- Does Synthetic Layered Design Data Benefit Layered Design Decomposition?Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang et al. · arXiv · May 14, 2026
Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-ge…
- AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable RewardRunhui Huang, Jie Wu, Rui Yang, Zhe Liu et al. · arXiv · May 12, 2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start st…
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language ModelsYanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou et al. · arXiv · May 12, 2026
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent …
- SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image GenerationTianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang et al. · arXiv · May 8, 2026
While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to t…
- STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationYing Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang et al. · arXiv · May 8, 2026
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive lang…
- GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image GenerationZiyu Zhai, Siyou Li, Juexi Shao, Juntao Yu · arXiv · May 7, 2026
Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks …
- Hyperbolic Concept Bottleneck ModelsDaniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes · arXiv · May 7, 2026
Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in fla…
- FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image GenerationMingfeng Lin, Jiakun Chen, Liang Han, Liqiang Nie · arXiv · May 7, 2026
Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous…
- Taming Outlier Tokens in Diffusion TransformersXiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen et al. · arXiv · May 6, 2026
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limite…
- D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion ModelsDengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang et al. · arXiv · May 6, 2026
The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant cha…