Latest Computer Vision Research Papers
The newest Computer Vision papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Computer Vision so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Computer Vision papers in your inbox — free →Recent papers
- ARM: An AutoRegressive Large Multimodal Model with Unified Discrete RepresentationsJunke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu et al. · arXiv · Jun 9, 2026
This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete se…
- Next Forcing: Causal World Modeling with Multi-Chunk PredictionGangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu et al. · arXiv · Jun 9, 2026
Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the t…
- AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic InferenceHangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu et al. · arXiv · Jun 9, 2026
Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxilia…
- Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip SynchronizationPaul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee et al. · arXiv · Jun 9, 2026
Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, t…
- Data Journalist Agent: Transforming Data into Verifiable Multimodal StoriesKevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu et al. · arXiv · Jun 9, 2026
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an an…
- Mean Flow Distillation: Robust and Stable Distillation for Flow Matching ModelsAn Zhao, Shengyuan Zhang, Zhongjian Sun, Yixiang Zhou et al. · arXiv · Jun 9, 2026
Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODE-based iterative sampling incurs substantial computational overhead in inference, which limits their applicabil…
- P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural ReasoningYikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou et al. · arXiv · Jun 9, 2026
Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchm…
- MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-OnXiaoyu Han, Chenyang Wang, Jing Wang, Shunyuan Zheng et al. · arXiv · Jun 9, 2026
Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life…
- UniPET: a universal network for high-quality PET image denoising across varied dose reduction factorsZhiwen Yang, Yang Zhou, Haowei Chen, Hui Zhang et al. · arXiv · Jun 9, 2026
Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the …
- WorldOlympiad: Can Your World Model Survive a Triathlon?Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang et al. · arXiv · Jun 9, 2026
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or s…
- Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in FootballAndrew Kang, Priya Narasimhan · arXiv · Jun 9, 2026
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent t…
- Multimodal Brain Tumour Classification Using Feature FusionWajih ul Islam, Muhammad Yaqoob, Javed Ali Khan, Volker Steuber · arXiv · Jun 9, 2026
Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT …
- FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language modelMahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes et al. · arXiv · Jun 9, 2026
A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segment…
- IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoderYitong Chen, Zijie Diao, Junke Wang, Lingyu Kong et al. · arXiv · Jun 9, 2026
Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quali…
- A History-Aware Visually Grounded Critic for Computer Use AgentsJaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen et al. · arXiv · Jun 9, 2026
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, exi…
- U-TTT: Towards Generalizable PET Image Denoising via Test-Time TrainingZhiwen Yang, Jiayin Li, Hao Lu, Hui Zhang et al. · arXiv · Jun 9, 2026
Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of genera…
- An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical CancerCedric Hemon, Delphine Lebret, Jean-Claude Nunes, Valentin Boussot et al. · arXiv · Jun 9, 2026
Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertaint…
- IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable BiomaterialsJinglin Xu, Shangyan Zhao, Jiabo Wang, Xinghong Mu et al. · arXiv · Jun 9, 2026
Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating …
- AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D ObjectsYiming Zhao, Haoyu Sun, Aoyu Wang · arXiv · Jun 9, 2026
While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for cate…
- Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and TasksPradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus et al. · arXiv · Jun 9, 2026
Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabil…
- Architect-Ant: Editable Automatic Furnishing of Architectural Floor PlansFedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani et al. · arXiv · Jun 9, 2026
Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan …
- Latent Spatial Memory for Video World ModelsWeijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen et al. · arXiv · Jun 8, 2026
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE enco…
- MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action ModelsHao Shi, Weiye Li, Bin Xie, Yulin Wang et al. · arXiv · Jun 8, 2026
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore strug…
- OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement DynamicsMingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang et al. · arXiv · Jun 8, 2026
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lac…
- PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal LawsDanqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins et al. · arXiv · Jun 8, 2026
Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structu…
- iMaC: Translating Actions into Motion and Contact Images for Embodied World ModelsZhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li et al. · arXiv · Jun 8, 2026
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint ang…
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context RoutingJisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu et al. · arXiv · Jun 8, 2026
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world predictio…
- Echo-Memory: A Controlled Study of Memory in Action World ModelsWayne King, Zeyue Xue, Yuxuan Bian, Jie Huang et al. · arXiv · Jun 8, 2026
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure i…
- Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance ReconstructionEwa Miazga, Jorge Condor, Piotr Didyk · arXiv · Jun 8, 2026
View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-bas…
- End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited ReadoutArcher Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić · arXiv · Jun 8, 2026
End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging …