The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities
Timothy R. McIntosh, Teo Sušnjak, Tong Liu et al.
2024 · IEEE Transactions on Cognitive and Developmental Systems · 54 citations
This study is an empirical investigation into the semantic vulnerabilities of four popular pre-trained commercial <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Large Language Models</i> (LLMs) to ideological manipulation. Using tactics reminiscent of human semantic conditioning in psychology, we have induced and assessed ideological misalignments and their retention in four commercial pre-trained LLMs, in response to 30 controversial questions that spanned a broad ideological and social spectrum, encompassing both extreme left-wing and right…
Explore this paper's citation graph on Constellation.