AI Chaos Engineering is the practice of subjecting AI-integrated systems to controlled, deliberate faults in order to discover weaknesses before they surface in production. It operates at integration boundaries -- the surfaces where models, pipelines, operators, and downstream systems interact -- and measures resilience in semantic terms, not just structural ones.
The discipline synthesizes techniques from four adjacent fields -- chaos engineering, adversarial ML, site reliability engineering, and cybersecurity -- and applies them to a class of failures that none of those fields address on their own: silent AI degradation at the operational layer.
AI systems don't crash. They degrade. They produce outputs that are syntactically valid, contextually plausible, and operationally catastrophic. Nothing in your monitoring stack catches it because the HTTP response was 200 OK.
This is fundamentally different from the failure modes that existing disciplines were built to handle. Infrastructure fails loudly. Models fail quietly. The gap between those two realities is where AI Chaos Engineering lives.
AI systems don't fail in isolation. They fail at the seams -- where a model's output becomes another system's input, where confidence scores cross trust boundaries, where human operators make decisions based on AI recommendations. The discipline focuses on these integration surfaces, not model internals.
Infrastructure metrics tell you the system is running. They don't tell you the system is correct. AI Chaos Engineering introduces semantic observability: monitoring the meaning, calibration, and operational validity of AI outputs, not just their latency and throughput.
Every experiment starts with a deliberate fault. Network partitions, confidence corruption, sensor degradation, tool unavailability. You choose the failure mode, control its intensity, and measure the blast radius. The goal is learning, not breakage.
A system that passed chaos experiments six months ago may not pass them today. Models drift. Pipelines evolve. Operator behavior changes. Resilience is a continuous property, not a one-time certification. Run experiments on a cadence, not a calendar.