Latest Multimodal Learning Research Papers
The newest Multimodal Learning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Multimodal Learning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Multimodal Learning papers in your inbox — free →Recent papers
- ARM: An AutoRegressive Large Multimodal Model with Unified Discrete RepresentationsJunke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu et al. · arXiv · Jun 9, 2026
This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete se…
- AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic InferenceHangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu et al. · arXiv · Jun 9, 2026
Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxilia…
- Data Journalist Agent: Transforming Data into Verifiable Multimodal StoriesKevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu et al. · arXiv · Jun 9, 2026
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an an…
- P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural ReasoningYikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou et al. · arXiv · Jun 9, 2026
Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchm…
- Multimodal Brain Tumour Classification Using Feature FusionWajih ul Islam, Muhammad Yaqoob, Javed Ali Khan, Volker Steuber · arXiv · Jun 9, 2026
Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT …
- FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language modelMahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes et al. · arXiv · Jun 9, 2026
A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segment…
- MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action ModelsHao Shi, Weiye Li, Bin Xie, Yulin Wang et al. · arXiv · Jun 8, 2026
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore strug…
- OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement DynamicsMingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang et al. · arXiv · Jun 8, 2026
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lac…
- Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity ConstraintsRavi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag et al. · arXiv · Jun 8, 2026
The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in c…
- Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous DrivingYimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani et al. · arXiv · Jun 8, 2026
Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-v…
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval MechanismCong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang et al. · arXiv · Jun 5, 2026
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and …
- TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language AlignmentSweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller et al. · arXiv · Jun 5, 2026
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has …
- Watch, Remember, Reason: Human-View Video Understanding with MLLMsJiahao Meng, Yue Tan, Qi Xu, Kuan Gao et al. · arXiv · Jun 5, 2026
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse e…
- OpenGlass: Open-Source Smart Glasses for On-Device Event-Based Gesture RecognitionPietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer et al. · arXiv · Jun 5, 2026
Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporti…
- VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language PlanningZikai Zhang, Hubert P. H. Shum, Toby P. Breckon · arXiv · Jun 5, 2026
Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, …
- PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene UnderstandingShaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang et al. · arXiv · Jun 4, 2026
Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remai…
- Thinking with Imagination: Agentic Visual Spatial Reasoning with World SimulatorsChenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao et al. · arXiv · Jun 4, 2026
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobse…
- EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language ModelsQiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye et al. · arXiv · Jun 4, 2026
Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtl…
- GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-AttentionGiordano Cicchetti, Eleonora Grassucci, Danilo Comminiello · arXiv · Jun 4, 2026
Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwis…
- Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language ModelsGuangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor · arXiv · Jun 1, 2026
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language mod…
- Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward ModelingSeojeong Park, Jiho Choi, Junyong Kang, Seonho Lee et al. · arXiv · Jun 1, 2026
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to …
- ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction TuningYu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou · arXiv · Jun 1, 2026
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning …
- AdaCodec: A Predictive Visual Code for Video MLLMsHaowen Hou, Zhen Huang, Zheming Liang, Qingyi Si et al. · arXiv · Jun 1, 2026
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visu…
- VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time OptimizationJunhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao et al. · arXiv · Jun 1, 2026
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle …
- Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual EventsXiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang et al. · arXiv · Jun 1, 2026
Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are de…
- GPIC: A Giant Permissive Image Corpus for Visual GenerationKeshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang et al. · arXiv · May 28, 2026
Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images caption…
- Archon: A Unified Multimodal Model for Holistic Digital Human GenerationChong Bao, Shichen Liu, Lijun Yu, David Futschik et al. · arXiv · May 28, 2026
Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretraine…
- Grounded 3D-Aware Spatial Vision-Language ModelingAn-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu et al. · arXiv · May 28, 2026
We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grou…
- LoMo: Local Modality Substitution for Deeper Vision-Language FusionFeng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang et al. · arXiv · May 28, 2026
Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its …
- From Pixels to Words -- Towards Native One-Vision Models at ScaleHaiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong et al. · arXiv · May 27, 2026
Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixe…