Reinforcement learning from human feedback — citation graph of key research

Explore the most influential research on reinforcement learning from human feedback as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.

Top papers on reinforcement learning from human feedback

  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    Yuntao Bai, Andy Jones, Kamal Ndousse et al. — 2022 · arXiv (Cornell University) · 364 citations
  2. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    Stephen Casper, Xander Davies, Claudia Shi et al. — 2023 · arXiv (Cornell University) · 89 citations
  3. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
    Harrison Lee, Samrat Phatale, Hassan Mansoor et al. — 2023 · arXiv (Cornell University) · 67 citations
  4. A Survey of Reinforcement Learning from Human Feedback
    Timo Kaufmann, Paul Weng, Viktor Bengs et al. — 2023 · arXiv (Cornell University) · 34 citations
  5. RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
    Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari et al. — 2025 · ACM Computing Surveys · 44 citations
  6. Training language models to follow instructions with human feedback
    Long Ouyang, Jeff Wu, Xu Jiang et al. — 2022 · arXiv (Cornell University) · 4,283 citations
  7. The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities
    Timothy R. McIntosh, Teo Sušnjak, Tong Liu et al. — 2024 · IEEE Transactions on Cognitive and Developmental Systems · 54 citations
  8. Safe RLHF: Safe Reinforcement Learning from Human Feedback
    Josef Dai, Xuehai Pan, Ruiyang Sun et al. — 2023 · arXiv (Cornell University) · 20 citations
  9. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing
    Stefano Palminteri, Germain Lefebvre, Emma J. Kilford et al. — 2017 · PLoS Computational Biology · 245 citations
  10. Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback
    Adam Dahlgren Lindström, Leila Methnani, Lea Krause et al. — 2025 · Ethics and Information Technology · 25 citations
  11. Reinforcement Learning from Human Feedback in LLMs: Whose Culture, Whose Values, Whose Perspectives?
    Kristian González Barman, Simon Lohse, Henk W. de Regt — 2025 · Philosophy & Technology · 15 citations
  12. Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback
    Lu Wang, Chaoyun Zhang, Ruomeng Ding et al. — 2023 · 23 citations