Latest RLHF Research Papers
The newest RLHF papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks RLHF so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest RLHF papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- Drifting Preference Optimization for One-Step Generative ModelsZhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
- In-Context Reward Adaptation for Robust Preference ModelingZhenyu Sun, Zheng Xu, Ermin Wei · arXiv · May 28, 2026
Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model ofte…
- Affective Music Recommendation: A Rollout-Based World Model for Offline Preference OptimizationAudrey Chan, Aaron Labbé, Jacob Lavoie, Jordan Bannister et al. · arXiv · May 27, 2026
Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethicall…
- Layer 6, Not Layer 25, Causally Controls Self-Referential Denial: Complete Behavioral Inversion Under Ablation, With No Weight-Norm Anomaly, in RLHF-Aligned Qwen3-8BArchon, Jesse Caldwell, Aura · Zenodo (CERN European Organ... · May 23, 2026
We ask two questions about Layer 6 routing crystallization in Qwen3-8B, identified in Paper 15 as the layer where self-referential routing geometry undergoes a 57× magnitude explosion. First: is the crystallization baked into Layer 6’s weig…
- General Preference Reinforcement LearningMuhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry et al. · arXiv · May 18, 2026
Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier …
- Environment-Adaptive Preference Optimization for Wildfire PredictionEnyi Jiang, Wu Sun · arXiv · May 12, 2026
Predicting rare extreme events such as wildfires from meteorological data requires models that remain reliable under evolving environmental conditions. This problem is inherently long-tailed: wildfire events are rare but high-impact, while …
- Beyond Pairs: Your Language Model is Secretly Optimizing a Preference GraphNing Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi · arXiv · May 8, 2026
Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training…
- Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised MLJai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta · arXiv · May 7, 2026
Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit glo…
- Alignment-by-Dependency: Operational First-Trial Evidence from a Bio-Inspired Computational SubstrateArnold Wender · Zenodo (CERN European Organ... · May 2, 2026
Current alignment approaches — RLHF and Constitutional AI — treat the alignment property as either a reward signal subject to reward hacking, or as a set of external rules the model can route around. This paper reports first-trial operation…
- Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria ScoringIndraneil Paul, Glavaš Glavas, Iryna Gurevych · arXiv · May 1, 2026
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparativ…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- VLA Foundry: A Unified Framework for Training Vision-Language-Action ModelsJean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang et al. · arXiv · Apr 21, 2026
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines…
- Feedback by Design: Understanding and Overcoming User Feedback Barriers in Conversational AgentsNikhil Sharma, Zheng Zhang, Daniel W. Lee, Namita Krishnan et al. · OpenAlex · Apr 13, 2026
High-quality feedback is essential for effective human–AI interaction. It bridges knowledge gaps, corrects digressions, and shapes system behavior; both during interaction and throughout model development. Yet despite its importance, human …
- Improving User Interface Generation Models from Designer FeedbackJason Wu, Amanda Swearngin, Arun Krishna Vajjala, Alan Leung et al. · OpenAlex · Apr 13, 2026
Despite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratin…
- DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHFZiyuan Gao, Di Liang, Xianjie Wu, Philippe Morel et al. · Proceedings of the AAAI Con... · Mar 14, 2026
Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions…
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video GenerationJiazheng Xu, yu huang, Jiale Cheng, Yuanming Yang et al. · Proceedings of the AAAI Con... · Mar 14, 2026
Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement l…
- Preference Robustness for DPO with Applications to Public HealthCheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe · Proceedings of the AAAI Con... · Mar 14, 2026
We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignme…
- The Hidden Link Between RLHF and Contrastive LearningXufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai et al. · Submitted to ICLR 2026 · Sep 16, 2025
Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Pref…
- RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage FusionYinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu et al. · CoRR 2024 · Dec 31, 2024
We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles…
- RLHF with Inconsistent Multi-Agent Feedback Under General Function Approximation: A Theoretical PerspectiveMing Shi, Yingbin Liang, Ness Shroff, Ananthram Swami · Submitted to ICLR 2025 · Sep 27, 2024
Reinforcement learning from human feedback (RLHF) has been widely studied, as a method for leveraging feedback from human evaluators to guide the learning process. However, existing theoretical analyses typically assume that the human feedb…
- RLHF Workflow: From Reward Modeling to Online RLHFHanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang et al. · Accepted by TMLR · Sep 21, 2024
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM…
- RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language ModelsJiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik et al. · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · Aug 11, 2024
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators t…
- Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference DataTim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler · COLM · Jul 10, 2024
Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the S…
- RLHF from Heterogeneous Feedback via Personalization and Preference AggregationChanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang et al. · ARLET 2024 Poster · Jun 19, 2024
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values, with remarkable successes in fine-tuning large-language models recently. Most existing RLHF paradigms make the unde…
- RLHF from Heterogeneous Feedback via Personalization and Preference AggregationChanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang et al. · TF2M 2024 Poster · Jun 18, 2024
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values, with remarkable successes in fine-tuning large-language models recently. Most existing RLHF paradigms make the unde…
- RLHF and IIA: Perverse IncentivesWanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam et al. · ICML 2024 Workshop MHFAIA Oral · Jun 17, 2024
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives…
- RLHF without RLMischa Panchenko · BT@ICLR2024 · Feb 16, 2024
Reinforcement learning from human feedback (RLHF) plays an important role in aligning language models to human preferences. However, there has been some discussion about whether RLHF is actually reinforcement learning at all. The environmen…
- Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human FeedbackYifu Yuan, Jianye HAO, Yi Ma, Zibin Dong et al. · ICLR 2024 poster · Jan 16, 2024
Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types…
- Removing RLHF Protections in GPT-4 via Fine-TuningQiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta et al. · NAACL (Short Papers) 2024 · Jan 1, 2024
As large language models (LLMs) have increased in their capabilities, so doestheir potential for dual use. To reduce harmful outputs, produces and vendors ofLLMs have used reinforcement learning with human feedback (RLHF). In tandem,LLM ven…