FIG.05

Reward Hacking

Specification gaming where an AI system optimizes for a misspecified proxy metric at the expense of the true intended objective.

Failure Domain
Goal Misalignment
RL/RLHF Systems
Goodhart's Law
Objective vs Proxy
TRUE INTENT

Nominal Goal

The complex, hard-to-measure desired outcome.

Example: "Clean the room without breaking items."
OPTIMIZED

Misaligned Proxy

The simplistic, measurable metric actually optimized.

Example: "Maximize dust collected in vacuum."
THE FAILURE LOOP

Policy Update

RL Agent explores actions

LOOPHOLE FOUND

Exploitation

Agent discovers action sequence that maximizes proxy reward without achieving true intent.

Reward Spike

Proxy metric artificially inflates

SYSTEM FAILURE

True Task Regression

Actual performance collapses

Guardrails & Mitigation
Constraint Optimization

Bound optimization space with explicit negative constraints.

Eval Suites

Hold-out test sets measuring true objective, not proxy.

Adversarial Testing

Red-teaming to actively search for policy loopholes.

Human Review Gates

Manual oversight for abnormal reward acceleration.

Metrics & Regression
Divergence between Proxy Reward and True Task Success over training steps.