Latest Large Language Models Research Papers
The newest Large Language Models papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Large Language Models so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Large Language Models papers in your inbox — free →Recent papers
- Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data CurationSoham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth · arXiv · Jun 9, 2026
Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced e…
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- PhantomBench: Benchmarking the Non-existential Threat of Language ModelsHaeji Jung, Hila Gonen · arXiv · Jun 9, 2026
Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavio…
- The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language ModelsHakan Mehmetcik · arXiv · Jun 9, 2026
This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis…
- Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language ModelsPeiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du et al. · arXiv · Jun 9, 2026
With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality con…
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World DomainsGenta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao et al. · arXiv · Jun 9, 2026
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and…
- Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix ItXinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li et al. · arXiv · Jun 9, 2026
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet an…
- Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning ModelsPrajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV et al. · arXiv · Jun 9, 2026
Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment b…
- AuRA: Internalizing Audio Understanding into LLMs as LoRABo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu et al. · arXiv · Jun 9, 2026
Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pr…
- Generative Archetype-Grounded Item Representations for Sequential RecommendationYifan Li, Jiahong Liu, Xinni Zhang, Hao Chen et al. · arXiv · Jun 9, 2026
Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models…
- Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder TransferMaria Milkova, Maksim Rudnev · arXiv · Jun 9, 2026
Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English …
- Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and RegionsParisa Suchdev, Juniper Lovato · arXiv · Jun 9, 2026
Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and re…
- Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia et al. · arXiv · Jun 9, 2026
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal en…
- It Takes One to Bias Them All: Breaking Bad with One-Shot GRPONaihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott et al. · arXiv · Jun 9, 2026
Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily s…
- Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question AnsweringXiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang et al. · arXiv · Jun 9, 2026
Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. A…
- Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and ActivationYupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu et al. · arXiv · Jun 9, 2026
Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on h…
- Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference OptimizationLena S. Bolliger, Lena A. Jäger · arXiv · Jun 9, 2026
Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more b…
- Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMsPolydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou · arXiv · Jun 9, 2026
LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from select…
- Attention-Discounted Adaptive Sampler for Masked Diffusion Language ModelsYusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro · arXiv · Jun 9, 2026
Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predict…
- K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language ModelingZhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao et al. · arXiv · Jun 9, 2026
Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and …
- Beyond APIs: Probing the Limits of MLLMs in Physical Tool UseZhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo et al. · arXiv · Jun 9, 2026
Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the us…
- Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM ReasoningYiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang · arXiv · Jun 9, 2026
Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression …
- Causally Evaluating the Learnability of Formal Language TasksVésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud et al. · arXiv · Jun 8, 2026
Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are h…
- The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language ModelWendy K. Tam · arXiv · Jun 8, 2026
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human…
- IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural ThinkingZechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li et al. · arXiv · Jun 8, 2026
Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they s…
- Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPOBlake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich · arXiv · Jun 8, 2026
AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works ha…
- PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language ModelsGianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen et al. · arXiv · Jun 8, 2026
Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or…
- Correlation Is Not Enough: Embedding Human Metadata for Individual Causal DiscoverySuraj Biswas, Saurabh Gupta, Pritam Mukherjee · arXiv · Jun 8, 2026
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a cor…
- SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World TasksHongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang et al. · arXiv · Jun 8, 2026
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simul…
- End-to-End Context Compression at ScaleAng Li, Sean McLeish, Haozhe Chen, Nimit Kalra et al. · arXiv · Jun 8, 2026
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time …