Latest Question Answering Research Papers
The newest Question Answering papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Question Answering so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Question Answering papers in your inbox — free →Recent papers
- Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question AnsweringXiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang et al. · arXiv · Jun 9, 2026
Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. A…
- Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEvalJoão Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski et al. · arXiv · Jun 9, 2026
Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing…
- LakeQA: An Exploratory QA Benchmark over a Million-Scale Data LakeHaonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya et al. · arXiv · Jun 9, 2026
Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurat…
- SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World TasksHongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang et al. · arXiv · Jun 8, 2026
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simul…
- Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill MemoryHaoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin et al. · arXiv · Jun 8, 2026
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing m…
- ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in ChinaYi Zhang, Bolei Ma, Yong Cao, Chengyan Wu et al. · arXiv · Jun 8, 2026
We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with …
- When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt VariationsMahdi Alkaeed · arXiv · Jun 5, 2026
Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt pertur…
- Modeling semantic association in self-paced reading with language model embeddingsSara Møller Østergaard, Kenneth Enevoldsen, Afra Alishahi, Bruno Nicenboim · arXiv · Jun 5, 2026
Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM…
- EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question AnsweringXiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang et al. · arXiv · Jun 5, 2026
Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks …
- CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact VerificationChenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan et al. · arXiv · Jun 5, 2026
Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limit…
- Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement LearningJonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha et al. · arXiv · Jun 4, 2026
Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address…
- Improving Answer Extraction in Context-based Question Answering Systems Using LLMsHafez Abdelghaffar, Ahmed Alansary, Ali Hamdi · arXiv · Jun 4, 2026
Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly wh…
- MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question AnsweringQing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu et al. · arXiv · Jun 4, 2026
Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve e…
- Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała · arXiv · Jun 4, 2026
Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) sy…
- YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA TransitionPSBC LLM Team, Huawei LLM Team, Ruihan Long, Junjie Wu et al. · arXiv · Jun 4, 2026
Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this…
- MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QAKaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang et al. · arXiv · Jun 4, 2026
Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate r…
- ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and ReasoningZhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou et al. · arXiv · Jun 1, 2026
The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answerin…
- A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RLLei Yang, Siyu Ding, Deyi Xiong · arXiv · Jun 1, 2026
Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades …
- CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data ReasoningChengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu et al. · arXiv · Jun 1, 2026
Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a promi…
- Who Am I? History-Aware Profiles for Student Simulation in Tutoring DialoguesZhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee et al. · arXiv · May 28, 2026
A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses o…
- ExCAM: Explainable Cultural Awareness MetricsChristoph Leiter, Haiyue Song, Hour Kaing, Jin Tei et al. · arXiv · May 28, 2026
Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like beha…
- CRITIC-R1: Learning Structured Critics for Retrieval-Augmented GenerationWenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu et al. · arXiv · May 28, 2026
Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce ex…
- Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question AnsweringShicheng Fan, Haochang Hao, Dehai Min, Weihao Liu et al. · arXiv · May 28, 2026
Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statem…
- GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question AnsweringXin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu et al. · arXiv · May 28, 2026
Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA…
- When Discourse Pressures Conflict: Information Structure in Vision-Language Model OutputsMarcell Fekete, Johannes Bjerva, Tamás Káldi · arXiv · May 27, 2026
Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using inf…
- AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig et al. · arXiv · May 27, 2026
AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distin…
- Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA AdaptationEvgenii Palnikov, Elizaveta Gavrilova · arXiv · May 27, 2026
We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pair…
- ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question AnsweringYikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu et al. · arXiv · May 27, 2026
Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG …
- SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about LaughterLee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun et al. · arXiv · May 27, 2026
Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplore…
- What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QAYuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu et al. · arXiv · May 25, 2026
Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides wh…