Skip to content

Andrew Fairless, Ph.D.

Data, Science, and Tinkering

Overview
Experience and Education
Publications
SHAP Tutorial
Understanding the Basics of Bayesian Linear Regression
Classifying Medicine
The Peanuts Project

Search for:

Search for:

What I Read: Reward Hacking

Home/What I Learn/What I Read: Reward Hacking

By BylineAndrew Fairless on April 1, 2025December 21, 2024

https://lilianweng.github.io/posts/2024-11-28-reward-hacking

Reward Hacking in Reinforcement Learning
Lilian Weng
November 28, 2024

“Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task.”

Cat Links What I Learn Tag Links adversarial alignment large language model machine learning reinforcement learning reward

Post navigation

What I Read: data engineeringPrev post

What I Read: Implementing RaftNext post

Categories

Bayesian statistics Machine Learning Statistics What I Learn What I Make

Tags

artificial intelligence attention Bayesian chatbot classification cloud cognition computer vision database data engineering data science deployment efficiency embedding generalization generative GPU graph healthcare image interpretability large language model latency linear algebra machine learning medicine MLOps monitoring natural language processing neural network neuroscience optimization pipeline probability Python recurrent regression reinforcement learning scalability software engineering SQL statistics training transformer unit test

Copyright © 2025 Andrew Fairless, Ph.D.. All Rights Reserved. | Simple Persona by Catch Themes

Scroll Up