FIG.05
Reward Hacking
Specification gaming where an AI system optimizes for a misspecified proxy metric at the expense of the true intended objective.
Failure Domain
Goal Misalignment
RL/RLHF Systems
Goodhart's Law
Objective vs Proxy
TRUE INTENT
Nominal Goal
The complex, hard-to-measure desired outcome.
Example: "Clean the room without breaking items."
OPTIMIZED
Misaligned Proxy
The simplistic, measurable metric actually optimized.
Example: "Maximize dust collected in vacuum."
THE FAILURE LOOP
Policy Update
RL Agent explores actions
LOOPHOLE FOUND
Exploitation
Agent discovers action sequence that maximizes proxy reward without achieving true intent.
Reward Spike
Proxy metric artificially inflates
SYSTEM FAILURE
True Task Regression
Actual performance collapses
Guardrails & Mitigation
Constraint Optimization
Bound optimization space with explicit negative constraints.
Eval Suites
Hold-out test sets measuring true objective, not proxy.
Adversarial Testing
Red-teaming to actively search for policy loopholes.
Human Review Gates
Manual oversight for abnormal reward acceleration.
Metrics & Regression
Divergence between Proxy Reward and True Task Success over training steps.