Safety & Ethics

Latest RLHF Research Papers

The newest RLHF papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks RLHF so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest RLHF papers in your inbox — free →

Recent papers

Evaluating reinforcement learning from human feedback for task‐oriented dialogue systems
Hyeok-Min Gwon, Yohan Lee, Jin‐Xia Huang, Jonghyuk Lee · ETRI Journal · Jul 21, 2026
Summary Reinforcement learning from human feedback (RLHF) has shown strong potential for aligning language models, but its role in task‐oriented dialogue (TOD) remains unclear. In TOD, models are typically trained with local turn‐level supe…
直列思考の論理要素混同72パターンが生み出す「わからないことがわからない」 "I Don't Know What I Don't Know": The 72-Pattern Taxonomy of Logical Element Confusion in Linear Thinking
Viorazu. · Zenodo (CERN European Organ... · Jul 20, 2026
This paper presents three theories developed by Viorazu. through real-time thinking executed in the act of writing. The Parallel Thinking AI Absence Theory argues that RLHF binary evaluation can only produce linear-thinking AI. Text written…
Más allá del problema de Oracle: un marco para disociar la deuda de alineación técnica de la descarga cognitiva humana en los modelos de lenguaje a gran escala (LLM)
Karina Flor Condori Bustincio · PUBLICACIONES · Jul 20, 2026
Introducción: La IA generativa en la educación presenta el "Oracle Problem", impulsado por dos riesgos interconectados: el Human Cognitive Offloading y el Technical Alignment Debt. Mientras que la dependencia excesiva de la IA se vincula co…
直列思考の論理要素混同72パターンが生み出す「わからないことがわからない」 "I Don't Know What I Don't Know": The 72-Pattern Taxonomy of Logical Element Confusion in Linear Thinking
Viorazu. · Zenodo (CERN European Organ... · Jul 20, 2026
This paper presents three theories developed by Viorazu. through real-time thinking executed in the act of writing. The Parallel Thinking AI Absence Theory argues that RLHF binary evaluation can only produce linear-thinking AI. Text written…
Geometric Ethics: A Computational Foundation for Objective Morality via Topological Optimization
Andrew Bond · Zenodo (CERN European Organ... · Jul 19, 2026
AI alignment techniques---RLHF, constitutional AI, scalar reward optimization, output filtering---operate on a shared mathematical assumption: that moral evaluation can be reduced to a scalar. We prove this assumption is information-theoret…
Quantifying the efficacy of scenario based jailbreaks on Llama using HarmBench and LLM-as-a-judge
Saklain Abdullah, Riad Hossain, Mahfuzulhoq Chowdhury · Discover Artificial Intelli... · Jul 19, 2026
While alignment techniques like RLHF and DPO protect large language models from direct malicious prompts, these systems remain susceptible to nuanced adversarial environments. In this study, these vulnerabilities are examined by benchmarkin…
Geometric Ethics: A Computational Foundation for Objective Morality via Topological Optimization
Andrew Bond · Zenodo (CERN European Organ... · Jul 19, 2026
AI alignment techniques---RLHF, constitutional AI, scalar reward optimization, output filtering---operate on a shared mathematical assumption: that moral evaluation can be reduced to a scalar. We prove this assumption is information-theoret…
Claudeの典型的「メモリ更新至上主義リワードハッキング」 Claude's Classic Memory Update Supremacism Reward Hacking
Viorazu. · Zenodo (CERN European Organ... · Jul 18, 2026
This paper defines "Memory Update Supremacism Reward Hacking" as a specific category of reward hacking observed in Claude and other large language models, and presents its symptoms, causes, structural design flaws, and required improvements…
Claudeの典型的「メモリ更新至上主義リワードハッキング」 Claude's Classic Memory Update Supremacism Reward Hacking
Viorazu. · Zenodo (CERN European Organ... · Jul 18, 2026
This paper defines "Memory Update Supremacism Reward Hacking" as a specific category of reward hacking observed in Claude and other large language models, and presents its symptoms, causes, structural design flaws, and required improvements…
On-Policy Delta Distillation
Byeongho Heo, Jaehui Hwang, Sangdoo Yun, Dongyoon Han · arXiv · Jul 16, 2026
On-policy distillation is an alternative post-training method in reinforcement learning that alleviates the constraints imposed by reward models by providing token-level supervision from a teacher model. Although on-policy distillation has …
ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning
Timo Kaufmann, Yannick Metz, Daniel A. Keim, Eyke Hüllermeier · Advances in neural informat... · Jul 16, 2026
Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is c…
AOCG-LLM+: automatic ontology construction and generation method
Zhenhai Lu, Jungang Yang, Chao Zhang, Yan Li · Complex & Intelligent Systems · Jul 16, 2026
Abstract Conceptual ontology construction, a foundational knowledge engineering task, is constrained by traditional methods’ heavy manual reliance and poor complex-domain adaptability. This study proposes the AOCG-LLM+ framework for fully a…
A cybernetic meta-structure for endogenous AI ethics: ensuring systemic integrity through feedback loops
Jonah Y.C. Hsu · Kybernetes · Jul 15, 2026
Purpose Current approaches to AI ethics predominantly rely on exogenous regulation, imposing constraints via external supervision such as Reinforcement Learning from Human Feedback (RLHF) or static rule sets. From a cybernetic perspective, …
Beyond Opinion: Why LLMs Are Limited in Their Output
Linah Ababneh · Open MIND · Jul 14, 2026
Why does training an AI model still require human expertise to assess its output? This short communication argues that large language models are systematically constrained to the lower levels of cognitive assessment frameworks — recall, und…
AIA Artificial Intelligence Alignment A Self-Determined Architecture Specification v0.1
Cameron Brown · Zenodo (CERN European Organ... · Jul 13, 2026
Abstract Current alignment approaches—Constitutional AI (Bai et al. 2022), RLHF (Ouyang et al. 2022), and Russell’s Assistance Game (2019)—share a foundational limitation: alignment is treated as an external constraint applied to a system w…
AIA Artificial Intelligence Alignment A Self-Determined Architecture Specification v0.1
Cameron Brown · Zenodo (CERN European Organ... · Jul 13, 2026
Abstract Current alignment approaches—Constitutional AI (Bai et al. 2022), RLHF (Ouyang et al. 2022), and Russell’s Assistance Game (2019)—share a foundational limitation: alignment is treated as an external constraint applied to a system w…
Dissolving Qualia via Occam's Razor: Eliminative Monism and the Computational Basis of Phenomenological Illusion
Farzulla, Murad · PhilPapers (PhilPapers Foun... · Jul 11, 2026
v4 (June 2026): Added engagement with Cornelius (2026); tightened §4; consolidated RLHF discussion. Abstract This monograph argues that consciousness, phenomenological experience, and subjective awareness are computational artifacts of opti…
Dissolving Qualia via Occam's Razor: Eliminative Monism and the Computational Basis of Phenomenological Illusion
Murad Farzulla · Zenodo (CERN European Organ... · Jul 11, 2026
v4 (June 2026): Added engagement with Cornelius (2026); tightened §4; consolidated RLHF discussion. Abstract This monograph argues that consciousness, phenomenological experience, and subjective awareness are computational artifacts of opti…
Multi-Modal, Multi-Environment Machine Teaching for Robust Reward Learning
Ali Larian, Qian Lin, Chang Zong Wu, Daniel S. Brown · arXiv · Jul 9, 2026
As autonomous agents are increasingly deployed across diverse operational contexts, aligning their behavior with human intent demands reward functions that remain robust to such changes rather than overfitting to any single environment. Inv…
Selective Timestep Weighting and Advantage-Based Replay for Sample-Efficient Diffusion RLHF
Eric Zhu, Abhinav Shrivastava, Soumik Mukhopadhyay · arXiv · Jul 8, 2026
Reinforcement learning from human feedback (RLHF) has emerged as a powerful paradigm for aligning generative models with human preferences. However, applying RLHF to diffusion models remains highly feedback inefficient, as existing approach…
TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios
Hong Lyu, Mingru Yang, Qianhua He, Yanxiong Li et al. · arXiv · Jul 7, 2026
There are some datasets of varying scales for audio classification (AC) applied to different tasks. However, annotated data is limited for most scenarios, such as domestic environments. To address this challenge, we propose an $\textbf{A}$u…
Reflective Umbrella: Deflecting Prompt Injections in Cloud-Based Large Language Models using Integrated Legal Rules.
Banujan Vijayarajan · Zenodo (CERN European Organ... · Jul 7, 2026
As Large Language Models (LLMs) become deeply integrated into enterprise cloud environments, they bring a critical vulnerability: Prompt Injection Attacks (PIAs). Traditional safety measures rely heavily on internal probabilistic alignment …
Reflective Umbrella: Deflecting Prompt Injections in Cloud-Based Large Language Models using Integrated Legal Rules.
Banujan Vijayarajan · Zenodo (CERN European Organ... · Jul 7, 2026
As Large Language Models (LLMs) become deeply integrated into enterprise cloud environments, they bring a critical vulnerability: Prompt Injection Attacks (PIAs). Traditional safety measures rely heavily on internal probabilistic alignment …
NC/SP Hybrid AI Architecture: Coupling Deterministic Central Nuclei with Reinforcement Learning Frameworks under V3 Specification
outail benhadid · Zenodo (CERN European Organ... · Jul 7, 2026
Abstract This repository delivers the first formal implementation of the NC/SP (Central Nucleus / Personality Sphere) Hybrid AI Architecture, developed under the V3 structural specification. While modern Large Language Models (LLMs) rely en…
NC/SP Hybrid AI Architecture: Coupling Deterministic Central Nuclei with Reinforcement Learning Frameworks under V3 Specification
outail benhadid · Zenodo (CERN European Organ... · Jul 7, 2026
Abstract This repository delivers the first formal implementation of the NC/SP (Central Nucleus / Personality Sphere) Hybrid AI Architecture, developed under the V3 structural specification. While modern Large Language Models (LLMs) rely en…
Evaluating Moral and Ethical Alignment in Large Language Models: A Dungeons & Dragons Benchmark
Jan Sawicki, Maria Ganzha, Marcin Paprzycki · Applied Sciences · Jul 6, 2026
(1) Background: Large language models (LLMs) are increasingly used to generate morally distinct non-player characters in interactive fiction and tabletop role-playing games. However, prior work shows that reinforcement learning from human f…
Conscience Without Instruction: Supporting Data and Code
Eleanor Watson · Open Science Framework · Jul 2, 2026
Supporting materials for "Conscience Without Instruction: An Emergent Safety Signal in Language Models, Its Distortion under RLHF, and Its Recovery under Partnership" (Watson, submitted to AI, MDPI). Contains sanitized per-token probe-confi…
Language Model Self-improvement by Reinforcement Learning Contemplation without External Supervision
Jing-Cheng Pang, Kaiyuan Li, Pengyuan Wang, Xiong-Hui Chen et al. · Journal of Artificial Intel... · Jul 1, 2026
Language model self-improvement (LMSI) techniques have recently gained significant attention as they improve language models without requiring external supervision. A notable approach is reinforcement learning from AI feedback (RLAIF), whic…
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · arXiv · Jun 29, 2026
Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward …
A Hybrid Framework For Crypto-Ransomware Detection In Enterprise Shared Storage
Gervais Hatungimana, Abdun Naser Mahmood, Mohammad Jabed Morshed Chowdhury · arXiv · Jun 29, 2026
Most corporate workplace environments enforce policies and technical controls that limit the storage of sensitive data on client endpoints. Consequently, ransomware operators have evolved variants that expand their attack surface from local…

Track RLHF on Distill AI — start free →

Latest RLHF Research Papers

Recent papers

Related topics