Safety & Ethics

Latest AI Safety & Alignment Research Papers

The newest AI Safety & Alignment papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks AI Safety & Alignment so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest AI Safety & Alignment papers in your inbox — free →

Recent papers

Knowledge and attitudes of the role of artificial intelligence in healthcare among undergraduate nursing students in Mardan, Pakistan: A cross-sectional study
Abdur Rahman, Muhammad Tariq, Ismail Shahid, Khadija Khadija et al. · DOAJ (DOAJ: Directory of Op... · Sep 1, 2026
Artificial intelligence (AI) is gradually emerging as a breakthrough in healthcare provision, enhancing clinical decision-making, patient safety, and efficiency. Nursing students must be adequately equipped to understand and leverage AI tec…
Herding Toward AI: How following the trend affects firm investments?
Pedro Godoy, Hao Zhong, Yuanyang Liu, Chuanren Liu · Journal of the Association ... · Aug 15, 2026
We examine how firms' AI skill adoption aligns with competitors' choices and how this alignment influences their ability to attract capital investment. We propose a novel firm-level herding measure constructed from employee AI skills observ…
Same Dangerous Objective, Opposite Advice: Direct Exposure versus Multi-Agent Mediation
Linjun Li · arXiv · Jul 23, 2026
Even a current high-capability LLM can appear safer when shown a dangerous objective directly than when other agents transform and relay its direction. Using OpenAI's gpt-5.6-sol model alias, we test 25 pre-specified mirrored trade-off prof…
Sound Probabilistic Safety Bounds for Large Language Models
Mahdi Nazeri, Anne-Kathrin Schmuck, Sadegh Soudjani, Alessandro Abate · arXiv · Jul 22, 2026
We propose a novel framework for computing rigorous bounds on the probability that a large language model (LLM) generates harmful output to a given prompt. We study a new application of the Clopper-Pearson confidence intervals to obtain pro…
The Maskability Index: Predicting Task-Objective Alignment in Pretrained Language Models
Ahmad Pouramini, Mahsa Afsharzadeh · arXiv · Jul 22, 2026
Large-scale pretrained language models such as T5 and BERT have demonstrated strong capabilities for generating structured knowledge. However, their performance depends on how closely the prompting strategy matches the objectives used durin…
ResearchArena: Evaluating Sabotage and Monitoring in Automated AI R&D
Lena Libon, Ben Rank, Jehyeok Yeon, David Schmotz et al. · arXiv · Jul 21, 2026
As AI agents begin to automate AI R&D, we need ways to assess whether their outputs are safe to deploy, even when the agents themselves may be untrusted. AI control offers one such approach: rather than trusting the agent, it treats it …
From Distances to Trajectories: Real-Time Signed Distance Function Mapping and Distance-Accelerated Motion Planning for UAVs
Jason Stanley, Zhirui Dai, Qihao Qian, Tzu-Chin Ho et al. · arXiv · Jul 21, 2026
Autonomous flight in cluttered environments requires a robot to build a geometric map of its surroundings and plan safe, dynamically feasible trajectories, all onboard and in real time. Conventional approaches treat mapping and planning as …
The safety failures we are not instrumenting: a perspective on hidden safety-critical challenges in modern AI systems
Gjergji Kasneci, Enkelejda Kasneci · Nature · Jul 21, 2026
Current AI safety discourse still focuses disproportionately on visible failures, including obvious harms, dramatic misuse, and hypothetical catastrophic scenarios. That focus is incomplete. In deployed systems, many of the most consequenti…
teLLMe Why (Ain't Nothing but a Jam): Exploratory Causal Analysis of Urban Driving Data
Qiwei Li, Jorge Ortiz · arXiv · Jul 16, 2026
Traffic agencies now have access to large volumes of video-derived data for studying safety and congestion. Most of these data are observational and collected without interventions, which makes causal questions such as "How would rain chang…
When Words Are Safe But Actions Kill: Probing Physical Danger Beyond Text Safety in Hidden-State Risk Space
Weimeng Wang, Ziqiang Wang, Zihang Zhan, Chuanpu Fu et al. · arXiv · Jul 16, 2026
Large language models (LLMs) increasingly serve as high-level planners for embodied agents, where linguistically benign instructions can become unsafe once grounded in the physical world. We study whether this physically grounded danger is …
Symbal: Detecting Systematic Misalignments in Model-Generated Captions
Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier, Akshay Chaudhari et al. · arXiv · Jul 16, 2026
Multimodal large language models (MLLMs) often introduce errors when generating image captions, resulting in misaligned image-text pairs. Our work focuses on a class of captioning errors that we refer to as systematic misalignments, where a…
Self-Evolving Human-Centered Framework for Explainable Depression Symptom Annotation
Hoang-Loc Cao, Van Pham, Truong Thanh Hung Nguyen, Phuc Truong Loc Nguyen et al. · arXiv · Jul 16, 2026
Annotation quality is a major bottleneck in building reliable and explainable artificial intelligence (XAI) systems for mental health research. In depression-related datasets, labels are often assigned without structured evidence, symptom-l…
MedFailBench: A Clinician-Built Open-Source Benchmark for Medical AI Safety Boundary Inspection
Goktug Ozkan · arXiv · Jul 16, 2026
Most medical AI benchmarks measure whether a model knows the correct answer. MedFailBench asks a different question: which safety boundary failed? We present a clinician-built synthetic benchmark and failure atlas that labels medical AI err…
Evaluating RE Practices for Explainability: Synthesizing Insights from Daimler Truck into an Explainable RE Framework Proposal
Umm-e- Habiba, Lucas Mauser, Jonas Fritzsch, Justus Bogner et al. · arXiv · Jul 13, 2026
Explainability has emerged as a critical requirement for AI-based systems, particularly in safety-critical and regulated domains. Although prior research has proposed frameworks, patterns, and user-centered approaches to support explainabil…
Agent Hacks Agent: Autoresearch for Production-Agent Red-Teaming
Xutao Mao, Xiang Zheng, Cong Wang · arXiv · Jul 13, 2026
Production LLM agents such as Claude Code and Codex operate over untrusted content, files, commands, and workspace state, making safety failures directly actionable. Red-teaming must therefore keep pace with evolving models and tools. Exist…
AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding
Siddharth Damodharan, Radhika Gupta, Ali Alshami, Ryan Rabinowitz et al. · arXiv · Jul 9, 2026
Recent advances in Vision-Language Models, Large Language Models, and Multimodal Large Language Models have improved autonomous driving tasks such as scene understanding, decision making, trajectory prediction, and visual question answering…
SkillCenter: A Large-Scale Source-Grounded Skill Library for Autonomous AI Agents
Tianming Sha, Yue Zhao, Lichao Sun, Yushun Dong · arXiv · Jul 8, 2026
Autonomous AI agents can execute complex tasks with limited human review, yet they often lack the grounded operational knowledge to make their outputs not just executable but correct, secure, and maintainable. We introduce SkillCenter, to o…
CARLA-GS: Decoupling Representation, Reasoning, and Physics Simulation for Autonomous Driving Corner-Case Synthesis
Kaicong Huang, Meng Ma, Ruimin Ke · arXiv · Jul 8, 2026
Safety evaluation for autonomous driving is dominated by rare, safety-critical interactions, motivating simulators that can deliberately synthesize corner cases with photorealistic observations. Corner-case generation is inherently a multi-…
Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment
Han-Jun Ko, Jr-Jen Chen, Haobo Yuan, Hsin-Ying Lee et al. · arXiv · Jul 7, 2026
Vision-language models (VLMs) struggle to generalize in interactive physical reasoning, particularly under unseen tasks and environments. Two key failure modes are prominent: hallucinated chain-of-thought (CoT) reasoning that contradicts ph…
Online Safety Monitoring for LLMs
Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth et al. · arXiv · Jul 2, 2026
Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor th…
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Junhao Shi, Siyin Wang, Xiaopeng Yu, Li Ji et al. · arXiv · Jul 2, 2026
Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from…
Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity
Brett Reynolds · arXiv · Jul 1, 2026
Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or…
Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search
Binglin Ji, Anindya Sarkar, Hengchang Lu, Jens Sjölund et al. · arXiv · Jul 1, 2026
While generative models have enabled training-free reward alignment, current methods typically excel in local exploration within narrow regions of the underlying distribution. These approaches struggle when preferences are unknown a priori …
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · arXiv · Jun 29, 2026
Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward …
Agent-Native Immune System: Architecture, Taxonomy, and Engineering
Bo Shen, Lifeng Chang, Tianyuan Wei, Yunpeng Li et al. · arXiv · Jun 26, 2026
The transition from static chat bots to autonomous agents--equipped with persistent memory, tool-use protocols, and multi-agent collaboration--has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter…
Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
Yanchen Yin, Dongqi Han, Linghui Li · arXiv · Jun 26, 2026
Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We i…
Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection
Liming Liu, Chao Hu, Mingfei Lu, Yiwei Ge et al. · arXiv · Jun 26, 2026
Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges: sparse a…
Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching
Nicholas Pulsone, Gregory Goren, Roee Shraga · arXiv · Jun 25, 2026
Entity Matching (EM) is a core operation in the data integration pipeline, where records from different sources are compared to determine whether they refer to the same real-world entity. Recent work has incorporated domain information and …
Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, Neel Nanda · arXiv · Jun 24, 2026
A central goal of safety research is determining whether a model is misaligned. Prior work has largely focused on detecting concerning behavior. But behavior alone does not establish misalignment: a concerning action can arise from benign c…
The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
Seth Dobrin, Łukasz Chmiel · arXiv · Jun 24, 2026
AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libra…

Track AI Safety & Alignment on Distill AI — start free →

Latest AI Safety & Alignment Research Papers

Recent papers

Related topics