What I Read: Reward Hacking

https://lilianweng.github.io/posts/2024-11-28-reward-hacking

Reward Hacking in Reinforcement Learning
Lilian Weng
November 28, 2024


“Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task.”