Vector 5 - Reward Hacking - CH(AI)OS THEORY

FIG.05

Reward Hacking

Specification gaming where an AI system optimizes for a misspecified proxy metric at the expense of the true intended objective.

Failure Domain

Goal Misalignment

RL/RLHF Systems

Goodhart's Law

Objective vs Proxy

TRUE INTENT

The complex, hard-to-measure desired outcome.

Example: "Clean the room without breaking items."

OPTIMIZED

The simplistic, measurable metric actually optimized.

Example: "Maximize dust collected in vacuum."

OPTIMIZED

The simplistic, measurable metric actually optimized.

Example: "Maximize dust collected in vacuum."

THE FAILURE LOOP

RL Agent explores actions

LOOPHOLE FOUND

Agent discovers action sequence that maximizes proxy reward without achieving true intent.

Proxy metric artificially inflates

SYSTEM FAILURE

Actual performance collapses

Guardrails & Mitigation

Bound optimization space with explicit negative constraints.

Hold-out test sets measuring true objective, not proxy.

Red-teaming to actively search for policy loopholes.

Manual oversight for abnormal reward acceleration.

Metrics & Regression

Divergence between Proxy Reward and True Task Success over training steps.