Multimodal & Audio

Latest Multimodal Learning Research Papers

The newest Multimodal Learning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Multimodal Learning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Multimodal Learning papers in your inbox — free →

Recent papers

Effects of secretome and EVs from neural crest-derived stem cells on glioblastoma multiforme cells
Atiyeh Asadpour · CentAUR (University of Read... · Jan 1, 2027
Glioblastoma multiforme (GBM), an incurable primary brain cancer, is a very heterogeneous, and aggressive type of cancer, with poor patient outcome despite multimodal therapy. The secretomes of GBM cells contribute to enhanced tumour sustai…
Multimodal analysis of professionals’ multidimensional identities:From threat appraisers to coping innovators
Carmen Daniela Maier · OpenAlex · Jan 1, 2027
Fostering Engagement through a Latency-Optimized LLM-based Dialogue System for Multimodal ECA Responses - Supplemental Material
Kühlem, Konstantin W., Ehret, Jonathan, Kuhlen, Torsten W., Bönsch, Andrea · Zenodo (CERN European Organ... · Dec 18, 2026
Fostering Engagement through a Latency-Optimized LLM-based Dialogue System for Multimodal ECA Responses - Supplemental Material
Kühlem, Konstantin W., Ehret, Jonathan, Kuhlen, Torsten W., Bönsch, Andrea · Zenodo (CERN European Organ... · Dec 18, 2026
AI Applicability in Law: Opportunities,Limitations, and Governance in a Hybrid Legal Ecosystem
Ojha Anjan Kumar · Zenodo (CERN European Organ... · Dec 13, 2026
Artificial Intelligence (AI) is transforming the legal domain far beyond prior waves of digitization and workflow automation. Modern AI systems including large language models, multimodal reasoning engines, neural retrieval systems, and pre…
AI Applicability in Law: Opportunities,Limitations, and Governance in a Hybrid Legal Ecosystem
Ojha Anjan Kumar · Zenodo (CERN European Organ... · Dec 13, 2026
Artificial Intelligence (AI) is transforming the legal domain far beyond prior waves of digitization and workflow automation. Modern AI systems including large language models, multimodal reasoning engines, neural retrieval systems, and pre…
Conserved Multimodal Nonlinear Dynamics in Brain Activity and Structure
Calvin Grant · Zenodo (CERN European Organ... · Dec 5, 2026
SUMMARY The brain has been generating the same rhythms — delta, theta, alpha, beta, gamma — in every subject ever recorded. The boundaries between them are reproducible to within a hertz across species, ages, anesthesia, wakefulness, epilep…
Conserved Multimodal Nonlinear Dynamics in Brain Activity and Structure
Calvin Grant · Zenodo (CERN European Organ... · Dec 5, 2026
SUMMARY The brain has been generating the same rhythms — delta, theta, alpha, beta, gamma — in every subject ever recorded. The boundaries between them are reproducible to within a hertz across species, ages, anesthesia, wakefulness, epilep…
Reconstruction and Renaissance of Dongbei China:A Multimodal Metaphor Analysis of New Media on Harbin
Amily; id_orcid 0000-0001-5583-9442 Wang Guenier · Research Explorer (The Univ... · Dec 1, 2026
Towards Early and Accurate Disease Detection Through Multimodal Predictive Modeling: Fusion of Electronic Health Records, Medical Imaging, And Omics Data Using Interpretable Machine Learning.
Muhammad Ahsan Hayat, Jahangir Baig, Shayan Ahmed, Ahmed Faraz Ayubi · Zenodo (CERN European Organ... · Nov 3, 2026
Early detection of disease is a cornerstone for improving patient outcomes, reducing costs, and enabling preventative interventions. Traditional predictive models often rely on a single type of data (e.g., imaging, clinical labs, or genomic…
Towards Early and Accurate Disease Detection Through Multimodal Predictive Modeling: Fusion of Electronic Health Records, Medical Imaging, And Omics Data Using Interpretable Machine Learning.
Muhammad Ahsan Hayat, Jahangir Baig, Shayan Ahmed, Ahmed Faraz Ayubi · Zenodo (CERN European Organ... · Nov 3, 2026
Early detection of disease is a cornerstone for improving patient outcomes, reducing costs, and enabling preventative interventions. Traditional predictive models often rely on a single type of data (e.g., imaging, clinical labs, or genomic…
AI-Driven Biomarker Discovery & Progression Modeling for Precision Diagnosis of Glaucoma
Cheng Huang · SMU Scholar (Southern Metho... · Oct 1, 2026
This dissertation presents a comprehensive study on the integration of artificial intelligence (AI) for glaucoma diagnosis and retinal image analysis. Leveraging multimodal imaging data including fundus photography, Optical Coherence Tomogr…
Cross-modal Image Recommendation for News Articles by Multimodal Foundation Models-based Retrieval-Reranking
Damianos Galanopoulos, Andreas Goulas, Vasileios Mezaris · Open MIND · Oct 1, 2026
Retrieving relevant images for a given news article is challenging and can be considered a special version of the cross-modal retrieval problem. This notebook paper presents our solution for the MediaEval NewsImages 2025 benchmarking task. …
3D-Aware VLMs with Implicit and Explicit Geometries
Wenhao Li, Xueying Jiang, Quanhao Qian, Deli Zhao et al. · arXiv · Jul 23, 2026
Despite rapid progress, most existing vision-language models (VLMs) built from 2D visual inputs often struggle when handling various 3D tasks that require fine-grained spatial understanding and reasoning. To bridge this gap, we present VLM-…
UnDA: Unpaired Domain Alignment for Cross-Modal Knowledge Transfer in Medical Imaging
Rafsan Jany, Shadab Tanjeed Ahmad, Ahsan Bulbul, Tahsinul Islam et al. · arXiv · Jul 23, 2026
Multimodal based approaches often outperform single modality approaches in downstream tasks as the different modalities provide complementary information, yet acquiring paired clinical data remains a significant challenge in real world scen…
Look Less, Think Faster: Joint Token-Compute Adaptation for Multimodal LLMs
Pengcheng Wang, Zhiquan Wang, Jayoung Lee, Zhuoyan Xu et al. · arXiv · Jul 22, 2026
Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across vision-language tasks. However, their high inference cost, arising from both the large number of input visual tokens and the heavy computation of …
Test-Time Training for Modality Order Consistency in Vision-Language Models
Aditi Gupta, Yossi Gandelsman · arXiv · Jul 22, 2026
We find that vision-language models are sensitive to a specific semantically irrelevant change: the order in which the image and question are presented. Across three models and three benchmarks, image first prompting consistently outperform…
Diverse-Intent Multi-Turn Fashion Image Retrieval
Mingqiang Tang, Haokun Wen, Meng Liu, Yupeng Hu et al. · arXiv · Jul 22, 2026
Real-world fashion search involves interactive retrieval across multiple turns. However, existing multi-turn retrieval methods are built on a restrictive assumption that every interaction follows the same attribute-editing paradigm, leaving…
Multimodal Large Language Models for Remote Sensing Image Understanding: Domain-Specific or General-Purpose?
Qiwei Ma, Chunping Qiu, Xinjun Cheng, Xiaoyu Zhang et al. · arXiv · Jul 22, 2026
The rapid development of multimodal large language models (MLLMs) has introduced a flexible paradigm for remote sensing image scene understanding (RSISU), enabling natural-language interaction with remote sensing imagery. However, a systema…
Appearance Pointers -- Multimodal Region Control of Diffusion Transformers
Rahul Sajnani, Yulia Gryaditskaya, Radomír Měch, Srinath Sridhar et al. · arXiv · Jul 21, 2026
Controllable image generation remains challenging for creative professionals, who often require precise regional control over materials, object identities, and spatial arrangements that cannot be reliably achieved through text prompting alo…
ExpertVerse: A General-Purpose Benchmark for Expert-Level Reasoning in Knowledge-Intensive Visual Synthesis
Yuan Wang, Yongchao Du, Mengting Chen, Jinsong Lan et al. · arXiv · Jul 21, 2026
Recent advances in multimodal generative models have enabled instruction-based image generation to move beyond semantic manipulation to knowledge-driven visual reasoning. However, these methods focus on explicit commonsense reasoning, shall…
OmniReasoner: Thinking with Long Audio-Video via Native Tool Use
Yu Chen, Caorui Li, Ziyu Xiong, Yidong Wang et al. · arXiv · Jul 21, 2026
Long audio-video reasoning is difficult for omnimodal LLMs because the decisive evidence is often sparse, cross-modal, and too expensive to preserve with uniformly high-fidelity inputs. We introduce OmniReasoner, a tool-use post-training fr…
No Training, Better Flights: Test-Time Scaled VLMs for UAV Navigation
Feinan Cheng, Dongliang Xu, Wenli Nong, Zhiheng Zhang et al. · arXiv · Jul 21, 2026
Test-time scaling offers a promising method to improve the inference performance of Vision-Language Models (VLMs) without additional training. Existing approaches to vision-language navigation (VLN) for Unmanned Aerial Vehicle (UAV) typical…
PathAgentBench: Benchmarking Evidence-Seeking Vision-Language Models on Whole-Slide Pathology Image
Dankai Liao, Tianyi Zhang, Yufeng Wu, Xinyue Zhang et al. · arXiv · Jul 21, 2026
Whole-slide image (WSI) diagnosis requires identifying diagnostically relevant regions, examining them across magnifications, and integrating multi-scale evidence. However, most existing pathology benchmarks evaluate models on pre-cropped p…
Cognitive Dual-Process Planning for Autonomous Driving with Structured Scene Knowledge and Verifiable Reasoning-Action Consistency
Zhongyao Yang, Haoyu Li, Yu Yan, Zhuangxuan Yu et al. · arXiv · Jul 21, 2026
High-level planning for autonomous driving is a knowledge-intensive engineering decision task that requires accurate scene understanding, timely inference, and internally consistent action selection. Vision-language models (VLMs) can make i…
HoloGeo: Mitigating Landmark Bias in Geo-localization via Evidence-Driven Reasoning
Pengcheng Zhou, Xuanyu Liu, Yanchen Yin, Bobo Li et al. · arXiv · Jul 16, 2026
Recent advances in Vision-Language Models (VLMs) have significantly improved image geo-localization, yet existing models remain susceptible to landmark bias, causing them to overlook geographical cues or form spurious correlations, ultimate…
Beyond the Leaderboard: Design Lessons for Trustworthy Multimodal VQA
Sushant Gautam, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen et al. · arXiv · Jul 16, 2026
Healthcare multimodal AI must combine visual and textual evidence while remaining reliable and interpretable. Using MediaEval Medico 2025 as a retrospective GI endoscopy case study, we analyze design choices across nine documented systems f…
Structural-Semantic Reciprocal Learning for Unsupervised Visible-Infrared Person Re-Identification
Moyao Tian, Shijia Liu, Yan Yang, Xin Yuan et al. · arXiv · Jul 16, 2026
Unsupervised visible-infrared person re-identification (USVI-ReID) is challenging due to the large modality gap and the lack of cross-modal identity annotations. Progressive association paradigms have been proposed to gradually bridge the g…
Symbal: Detecting Systematic Misalignments in Model-Generated Captions
Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier, Akshay Chaudhari et al. · arXiv · Jul 16, 2026
Multimodal large language models (MLLMs) often introduce errors when generating image captions, resulting in misaligned image-text pairs. Our work focuses on a class of captioning errors that we refer to as systematic misalignments, where a…
AlphaWiSE: Adaptive Weight Interpolation for Continual Multimodal Representation Learning
Sarthak Jain, Qiran Hu, Zhen Zhu, Yaoyao Liu · arXiv · Jul 16, 2026
Multimodal models such as CLIP learn a shared embedding space for cross-modal retrieval, but continual adaptation to sequentially arriving data can disrupt the cross-modal alignment acquired from earlier phases. Conventional continual-learn…

Track Multimodal Learning on Distill AI — start free →

Latest Multimodal Learning Research Papers

Recent papers

Related topics