Reinforcement learning from human feedback — citation graph of key research
Explore the most influential research on reinforcement learning from human feedback as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.
Top papers on reinforcement learning from human feedback
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse et al. — 2022 · arXiv (Cornell University) · 364 citations - Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, Claudia Shi et al. — 2023 · arXiv (Cornell University) · 89 citations - RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor et al. — 2023 · arXiv (Cornell University) · 67 citations - A Survey of Reinforcement Learning from Human Feedback
Timo Kaufmann, Paul Weng, Viktor Bengs et al. — 2023 · arXiv (Cornell University) · 34 citations - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari et al. — 2025 · ACM Computing Surveys · 44 citations - Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al. — 2022 · arXiv (Cornell University) · 4,283 citations - The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities
Timothy R. McIntosh, Teo Sušnjak, Tong Liu et al. — 2024 · IEEE Transactions on Cognitive and Developmental Systems · 54 citations - Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai, Xuehai Pan, Ruiyang Sun et al. — 2023 · arXiv (Cornell University) · 20 citations - Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing
Stefano Palminteri, Germain Lefebvre, Emma J. Kilford et al. — 2017 · PLoS Computational Biology · 245 citations - Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback
Adam Dahlgren Lindström, Leila Methnani, Lea Krause et al. — 2025 · Ethics and Information Technology · 25 citations - Reinforcement Learning from Human Feedback in LLMs: Whose Culture, Whose Values, Whose Perspectives?
Kristian González Barman, Simon Lohse, Henk W. de Regt — 2025 · Philosophy & Technology · 15 citations - Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback
Lu Wang, Chaoyun Zhang, Ruomeng Ding et al. — 2023 · 23 citations