Latest AI Safety & Alignment Research Papers
The newest AI Safety & Alignment papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks AI Safety & Alignment so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest AI Safety & Alignment papers in your inbox — free →Recent papers
- The Role of Feedback Alignment in Self-DistillationSemih Kara, Oğuzhan Ersoy · arXiv · Jun 9, 2026
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by…
- Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety AttributionYifan Wang · arXiv · Jun 8, 2026
Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filt…
- Hybrid Robustness Verification for Spatio-Temporal Neural NetworksSherwin Varghese, Matthew Wicker, Alessio Lomuscio · arXiv · Jun 8, 2026
With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive co…
- TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language AlignmentSweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller et al. · arXiv · Jun 5, 2026
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has …
- Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and PredictabilityChaitanya Shinde, Hadi Hajieghrary, Paul Schmitt, Adam Shoemaker et al. · arXiv · Jun 5, 2026
The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the abse…
- Watch, Remember, Reason: Human-View Video Understanding with MLLMsJiahao Meng, Yue Tan, Qi Xu, Kuan Gao et al. · arXiv · Jun 5, 2026
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse e…
- RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario GenerationQi Lan, Yining Tang, Yu Shen, Yi Zhou et al. · arXiv · Jun 4, 2026
Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their ite…
- Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy FrameworksBoyi Chen, Shengqin Chu, Zicheng Wang, Brian Baetz et al. · arXiv · Jun 4, 2026
Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and…
- Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive RoboticsHaimin Hu · arXiv · Jun 1, 2026
Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuri…
- SafeSteer: Localized On-Policy Distillation for Efficient Safety AlignmentHao Li, Jingkun An, Zijun Song, Pengyu Zhu et al. · arXiv · Jun 1, 2026
Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or …
- Gram: Assessing sabotage propensities via automated alignment auditingDavid Lindner, Victoria Krakovna, Sebastian Farquhar · arXiv · May 28, 2026
We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini m…
- SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent NetworksEdwin Jose · arXiv · May 27, 2026
Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing ap…
- Utility-Aware Multimodal Contrastive Learning for Product Image GenerationXiaohang Feng, Yiling Xie · arXiv · May 27, 2026
Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not …
- Paper 17 — Terms of Structure and Traceable Document-State Layers: Structural Conditions for AI+AGI-Generated Documents in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 26, 2026
This working paper introduces Terms of Structure and Traceable Document-State Layers as a non-executive structural framework for AI+AGI-generated documents, document-state records, revision histories, structural conditions, and reference co…
- Paper 15 — Telecommunication Edge Reference Nodes: Network-Side Structural Roles, Personalized Boundary Configuration, and Essential-Function Parity Hardware in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 26, 2026
This working paper introduces Telecommunication Edge Reference Nodes as network-side structural reference nodes for the AI+AGI era. Building on the AGI Structural Alignment Series, this paper addresses how telecommunications edge environmen…
- Retrying vs Resampling in AI ControlJames Lucassen, Adam Kaufman · arXiv · May 25, 2026
AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We f…
- Paper 8 — HTS-Based Evidence Sealing in the AGI Era: Preserving Output History, Human Discretion, and Evidentiary ReferencesThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces HTS-Based Evidence Sealing as a non-executable history-reference and time-based sealing reference framework for the AGI era. Building on Papers 1 through 7 of the AGI Structural Alignment Series, this paper add…
- Paper 10 — Consent and Order Candidate Layers in the AGI Era: Non-Executable Action Candidates Before Human DiscretionThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces Consent and Order Candidate Layers as a non-executable framework for AI+AGI-generated consent phrases, order phrases, approval phrases, payment request phrases, contract phrases, and action request phrases in t…
- Paper 2 — SCD + Security: Output Reference Boundary Structures in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces SCD + Security as a non-executable output reference boundary structure for the AGI era. Building on Paper 1 of the AGI Structural Alignment Series, which established BIFACE-Based Sentence Coordinate Documents a…
- Paper 4 — Role Society, Human Discretion, Accountability, and Non-Identifiable Role CoordinationThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces Role Society as a social structure for the AGI era in which the basic unit of coordination is no longer the account, direct identifier, or personal ID, but a role-based unit of human discretion and accountabili…
- Paper 6 — Multi-Layer Cross Verification in the AGI Era: Role Feasibility, Output Reference, and Pre-Transaction ValidationThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces Multi-Layer Cross Verification as a structural methodology for the AGI era. Building on Papers 1 through 5 of the AGI Structural Alignment Series, this paper addresses how AI+AGI outputs may be cross-checked be…
- Paper 9 — GSL-Based Authority-Condition Boundary Structures in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces GSL-Based Authority-Condition Boundary Structures as a non-executable boundary framework for the AGI era. Building on Papers 1 through 8 of the AGI Structural Alignment Series, this paper addresses the structur…
- Paper 7 — AGI Output Governance: Candidate Outputs, Human Discretion, and Evidentiary Boundaries in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 24, 2026
This working paper introduces AGI Output Governance as a structural framework for positioning AI+AGI outputs as candidate, reference, or assistive outputs before they are mistaken for human judgment, consent, approval, responsibility, evide…
- The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation LearningVishal Rajput · arXiv · May 21, 2026
Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method…
- LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent SystemsSadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri et al. · arXiv · May 21, 2026
Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication,…
- MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking DataAmir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie et al. · arXiv · May 21, 2026
Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challeng…
- Superhuman Safe and Agile Racing through Multi-Agent Reinforcement LearningIsmail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza · arXiv · May 21, 2026
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where ot…
- Paper 2 — SCD + Security: Output Reference Boundary Structures in the AGI EraThe First Waters · Zenodo (CERN European Organ... · May 21, 2026
This working paper introduces SCD + Security as a non-executable output reference boundary structure for the AGI era. Building on Paper 1 of the AGI Structural Alignment Series, which established BIFACE-Based Sentence Coordinate Documents a…
- Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent DeploymentS. Bensalem, Y. Dong, M. Franzle, X. Huang et al. · arXiv · May 18, 2026
This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a con…
- Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal AlignmentSayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss · arXiv · May 14, 2026
Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually …