Language & NLP

Latest Question Answering Research Papers

The newest Question Answering papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Question Answering so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Question Answering papers in your inbox — free →

Recent papers

MedGame: Storytelling Gamification Empowered by Large Language Models for Medical Education
Qian Wu, Xinrong Zhou, Zizhan Ma, Kai Chen et al. · arXiv · Jul 23, 2026
Large Language Models (LLMs) show promise for medical education, but most existing systems focus on localized interactions such as question answering or single-turn feedback, rather than organizing an entire clinical case into a decision-ce…
When Trivia Is Not Trivial: Everyday Knowledge Failures in Multilingual LLMs
Anna Mosolova, Djamé Seddah · arXiv · Jul 23, 2026
Quiz rooms, trivia nights, and quiz shows challenge human knowledge across a wide range of topics, from canonical facts to everyday culture. In this paper, we examine whether large language models (LLMs) can perform competitively in such se…
Capital Markets LLM Reliability Score (CM-LRS): From Plausible to Bankable
Prerit Ahuja · arXiv · Jul 23, 2026
In capital-markets workflows the question is rarely whether a large language model can produce a fluent draft, but whether the draft is bankable: defensible in front of a counter-party or a regulator, with the documents in hand. Existing me…
WaveformQA: Benchmarking LLM Temporal Reasoning on Digital Waveforms
Yichuan Liu, Daniel Cummings, Nick Vadlamudi · arXiv · Jul 22, 2026
Large Language Models (LLMs) have demonstrated strong capabilities in code generation and reasoning, yet their ability to perform temporal reasoning over digital waveform data remains largely unexplored. Although reasoning over digital wave…
HalluTruthQA: A Fine-Grained Benchmark for Hallucination Detection, Localization, and Explanation in Arabic Question Answering
Abdessalam Bouchekif, Mohammed-En-Nadhir Zighem, Salah Eddine Bekhouche, Hichem Telli et al. · arXiv · Jul 22, 2026
Large language models (LLMs) can generate fluent Arabic answers, yet factual errors remain difficult to detect, localize, explain, and verify. Existing hallucination benchmarks often provide response-level labels, with limited support for i…
Efficient Chain-of-Modality Reasoning via Progressive Compression for Spoken Language Models
Pengchao Feng, Chao-Hong Tan, Qian Chen, Wen Wang et al. · arXiv · Jul 22, 2026
Spoken language models (SLMs) enable natural human-computer interaction, but their reasoning ability still lags behind that of text-based large language models, especially on spoken mathematical question answering tasks. One important reaso…
DAIS: Dependency-Aware Intermediate QA Supervision for Complex Reasoning
Yu Wang, Ming Fan, Xicheng Zhang, Zhiyong Li et al. · arXiv · Jul 21, 2026
Chain-of-thought (CoT) supervision exposes intermediate rationales, but flat rationale targets usually optimize a single reasoning sequence and provide limited supervision on how local conclusions should support later decisions. We introduc…
AILQA: Evaluating AI-Driven Legal Question Answering Systems for the Indian Legal System
Shubham Kumar Nigam, Shubham Kumar Mishra, Noel Shallum, Kripabandhu Ghosh et al. · arXiv · Jul 21, 2026
This comprehensive study introduces an advanced Artificial Intelligence for Indian Legal Question Answering (AILQA) system tailored to the Indian legal context. AILQA leverages a variety of embedding and generative models, including recent …
Find Before You Fine-Tune: A Diagnostic Study of Small LLMs for Cybersecurity QA
Shaswata Mitra, Subash Neupane, Trisha Chakraborty, Himanshu Tripathi et al. · arXiv · Jul 21, 2026
Large Language Models (LLMs) are increasingly fine-tuned for critical-domain Question-Answering (QA), yet choosing which small model to adapt, before paying the cost of adaptation, remains difficult. Fine-tuning can improve domain alignment…
Search-on-Graph-R1: Training Large Language Models to Search Knowledge Graphs with Reinforcement Learning
Jia Ao Sun, Hao Yu, Fengran Mo, Zhan Su et al. · arXiv · Jul 20, 2026
Knowledge graph question answering (KGQA) requires navigating from topic entities to an answer several relations away. Recent methods prompt a frontier LLM to explore the graph through a retrieval tool, but their reliance on frontier-scale …
FinSAgent: Corpus-Aligned Multi-Agent RAG Framework for Evidence-Grounded SEC Filing Question Answering
Jijun Chi, Zhenghan Tai, Hanwei Wu, Tung Sum Thomas Kwok et al. · arXiv · Jul 20, 2026
Financial question answering over U.S. Securities and Exchange Commission (SEC) filings requires retrieving and synthesizing heterogeneous evidence dispersed across long, standardized, and highly redundant disclosures. Existing retrieval-au…
Beyond the Leaderboard: Design Lessons for Trustworthy Multimodal VQA
Sushant Gautam, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen et al. · arXiv · Jul 16, 2026
Healthcare multimodal AI must combine visual and textual evidence while remaining reliable and interpretable. Using MediaEval Medico 2025 as a retrospective GI endoscopy case study, we analyze design choices across nine documented systems f…
CoTu at EXACT 2026: Neuro-Symbolic Reasoning for Transparent Educational QA
Quoc-Khang Tran, Minh-Thien Nguyen, Phu-An Thai, Xuan-Tung Bui et al. · arXiv · Jul 16, 2026
Transparent educational question answering asks for answers that are not only correct but explainable, and doing so with small models rules out the reasoning power of the largest proprietary systems. The EXACT 2026 competition poses this pr…
Gold-Guided Programmatic Distillation for Financial Reasoning over Hybrid Tables and Text
Yun Dong, Erica Zhao, Elana Chen · arXiv · Jul 16, 2026
Financial question answering over hybrid tabular and textual data may require multi-source reasoning and precise numerical computation. While large language models (LLMs) can generate intermediate reasoning steps, natural-language rationale…
Stop Thinking, Start Looking: Efficient Post-Training for Multimodal Document Question Answering via Reasoning-Free Alignment
Harikrishnan P M, Goutham Vignesh, Ganesh Parab, Saisubramaniam Gopalakrishnan et al. · arXiv · Jul 16, 2026
Efficient multimodal document question answering with explicit visual grounding, locating the precise document region that supports each answer remains an open challenge. Current approaches bifurcate into Supervised Fine-Tuning (SFT), which…
MARS: Multi-hop Adaptive Retrieval and SPARQL Generation for KGQA
Nikit Srivastava, Daniel Vollmers, René Speck, Nikolaos Karalis et al. · arXiv · Jul 16, 2026
Large language models (LLMs) have demonstrated strong reasoning performance, but their tendency to hallucinate limits their reliability in knowledge-intensive tasks requiring up-to-date and grounded information. Combining knowledge graphs (…
DS@GT ARC at LongEval: Citation Integrity and Factual Grounding in Scientific QA
Brandon Michaels, Brendon Johnson · arXiv · Jul 15, 2026
This paper describes DS@GT ARC's submission to the CLEF 2026 LongEval Task 4 on Retrieval-Augmented Generation (RAG). In this submission, we examine a divergence between traditional natural language evaluation metrics and citation integrity…
DeepStress: Stress-Testing Deep Search Agents
Ismael Rousseau, Geraldine Damnati, Frederic Bechet · arXiv · Jul 15, 2026
While search agents demonstrate impressive capabilities in multi-step question answering, their robustness to poor-quality evidence remains under-explored. This phenomenon occurs rarely in realistic benchmarks but can lead to dramatic failu…
MemOps: Benchmarking Lifecycle Memory Operations in Long-Horizon Conversations
Xixuan Hao, Zeyu Zhang, Zehao Lin, Yihang Sun et al. · arXiv · Jul 14, 2026
Long-term memory has become a foundational capability for LLM-based agents that accompany users across extended, multi-session interactions. Existing benchmarks, however, evaluate such memory almost exclusively through downstream question a…
Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026
Nirjhar Das, Md. Al-Mamun Provath · arXiv · Jul 10, 2026
We present our submission to the QANTA 2026 shared challenge at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Quanta evaluates multimodal quizbowl systems that answer pyramid-style questions from incrementally …
WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search
Xiaoshuai Song, Liancheng Zhang, Kangzhi Zhao, Yutao Zhu et al. · arXiv · Jul 9, 2026
Large language model (LLM)-based web search agents are transforming information seeking from simple factoid question answering into complex, deep-and-wide search and research-oriented tasks. A single ReAct-style agent is constrained by one …
Two Axes of LLM Abstention: Answer Correctness and Question Answerability
Benedikt J. Wagner · arXiv · Jul 9, 2026
A model should refuse two different things: answers it would get wrong, and questions it should not answer at all, such as unanswerable ones or ones resting on a false premise. The usual recipe thresholds a single confidence score, which ca…
LEXIC: Lightweight Eye-tracking eXtension via Injected Complexity
Sumin Lee, Kyeonghun Kim, Subeen Lee, Jiwon Yang et al. · arXiv · Jul 9, 2026
On the recent EyeBench benchmark, predicting reading comprehension from eye movements exposes a stark gap: text-aware models using pretrained language models reach 56--63% AUROC, while gaze-only models operate at chance. We ask how far a ga…
Evaluating RAG Metrics in Applied Contexts: An Experiment, Its Findings and Its Limitations
Quentin Brabant · arXiv · Jul 8, 2026
This paper reports an empirical study evaluating the relevance of several RAG metrics. The experiment is based on a question-answering dataset created by human annotators from business data. The generated responses and retrieved spans of a …
From Voting to Agent Collaboration: Answer-Type-Aware LLM Pipelines for BioASQ 14b
Taeyun Roh, Eunha Lee, Wonjune Jang, Sohyun Chung et al. · arXiv · Jul 7, 2026
Biomedical question answering requires not only accurate extraction of information from scientific literature but also reliable integration of evidence across multiple documents. This study presents a question-type-specific large language m…
Healthier LLMs: Retrieval-Augmented Generation for Public Health Question Answering
Felix Feldman, Joshua Harris, Timothy Laurence, Leo Loman et al. · arXiv · Jul 7, 2026
Large language models (LLMs) achieve promising results on medical question answering benchmarks, yet their use in public health is constrained by hallucinations and the rapid evolution of official guidance. Retrieval-Augmented Generation (R…
Estimating Uncertainty from Reasoning: A Large-Scale Study of Multi- and Crosslingual MCQA Performance in LLMs
Andrea Alfarano, Andrea Bacciu, Saab Mansour, Amin Mantrach et al. · arXiv · Jul 7, 2026
Uncertainty estimation (UE) enables LLM-powered systems to recognize when to abstain, yet existing research has predominantly focused on English. We present the first large-scale evaluation of UE methods across 22 languages, spanning high-,…
AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations
Javier Irigoyen, Roberto Daza, Francisco Jurado, Julian Fierrez et al. · arXiv · Jul 2, 2026
This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations f…
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
Valentin J. J. Kreileder, Johannes Reisinger, Andreas Fischer · arXiv · Jul 2, 2026
Retrieval-Augmented Generation (RAG) systems use the question-answering capabilities of Large Language Models (LLMs) to access information outside their parameters. We evaluate if cluster-based semantic chunking improves retrieval and answe…
MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering
Dang Quang Thien Tran, Quang V. Dang, Vinamra Tyagi, Sai Soorya Rao Veeravalli et al. · arXiv · Jul 1, 2026
As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal set…

Track Question Answering on Distill AI — start free →

Latest Question Answering Research Papers

Recent papers

Related topics