Reinforcement learning from human feedback — citation graph of key research

Explore the most influential research on reinforcement learning from human feedback as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.

Top papers on reinforcement learning from human feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse et al. — 2022 · arXiv (Cornell University) · 372 citations
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al. — 2022 · arXiv (Cornell University) · 4,302 citations
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper — 2023 · arXiv (Cornell University) · 93 citations
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor et al. — 2023 · arXiv (Cornell University) · 69 citations
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari et al. — 2025 · ACM Computing Surveys · 57 citations
A Survey of Reinforcement Learning from Human Feedback
Timo Kaufmann, Paul Weng, Viktor Bengs et al. — 2023 · arXiv (Cornell University) · 36 citations
The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities
Timothy R. McIntosh, Teo Sušnjak, Tong Liu et al. — 2024 · IEEE Transactions on Cognitive and Developmental Systems · 57 citations
Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing
Stefano Palminteri, Germain Lefebvre, Emma J. Kilford et al. — 2017 · PLoS Computational Biology · 251 citations
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai, Xuehai Pan, Ruiyang Sun et al. — 2023 · arXiv (Cornell University) · 21 citations
Inline Hardware KV-Cache Compression for Long-Context Transformer Inference: An Architectural Case for a Memory-Path Compression Engine
Jakubův, Jan, Chvalovský, Karel, Goertzel, Zarathustra et al. — 2023 · DROPS (Schloss Dagstuhl – Leibniz Center for Informatics) · 77,005 citations
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback
Adam Dahlgren Lindström, Leila Methnani, Lea Krause et al. — 2025 · Ethics and Information Technology · 37 citations
A Comprehensive Survey of Multiagent Reinforcement Learning
Lucian Buşoniu, Robert Babuška, Bart De Schutter — 2008 · IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) · 2,206 citations

Constellation — Explore research as a citation graph

Constellation maps the world's scientific literature as an interactive graph. Search any research topic or paste a paper's DOI or arXiv link, and see how works connect through citations and shared references. Discover the sub-fields of an area, tell foundational work from the frontier, follow citation trails, and generate a synthesis of the landscape — across every discipline, powered by OpenAlex.

This application requires JavaScript to run.