Your system runs every day. The Observatory says GREEN. Components healthy. Outcomes tracking. You start to trust it.
Then one morning your primary data source returns empty. Or your LLM provider has a 4-hour outage. Or someone pushes a config change that silently flips a threshold. And your beautiful autonomous system produces garbage — confidently, automatically, at scale.
You did not know it could break that way because you never tested it.
Netflix runs Chaos Monkey — it randomly kills production servers to prove the system survives. Google runs DiRT (Disaster Recovery Testing) — it simulates entire datacenter failures. These companies do not test because they are paranoid. They test because the only systems you can trust are the ones you have tried to break.
This issue gives you three prompts that build a chaos engineering practice for your AI system. Not theoretical. Not a framework diagram. Actual tests you can run this week that will find the failures hiding in your system right now.
The Five Failure Modes
Every AI system breaks in one of five ways. Your stress tests must cover all five.
| Mode | What Happens | Why It Is Dangerous |
|---|---|---|
| Input Starvation | A data source returns empty, stale, or malformed data | The pipeline runs “successfully” on bad inputs and produces confident wrong outputs |
| Provider Failure | Your LLM API, database, or external service goes down | Cascading failures: one timeout causes a retry storm that takes down adjacent systems |
| Config Drift | A threshold, weight, or parameter changes silently | No error is thrown. The system simply starts making different decisions with no alert. |
| Volume Spike | 10x the normal input volume hits at once | Rate limits, memory exhaustion, queue backlogs. What processes 100 items in 2 minutes may choke on 1,000. |
| Feedback Corruption | The system’s self-improvement loop learns the wrong lesson | The most insidious mode: the system “improves” itself into worse performance and the improvement metric says it is getting better. |
The blind spot: Most people test Mode 2 (provider failure) because it is the most visible. Mode 5 (feedback corruption) is almost never tested — and it is the one that can destroy months of work silently.
Prompt 1 — The Failure Scenario Generator
Before you can test, you need to know what to test. This prompt maps your system’s attack surface.
You are a chaos engineering specialist for AI systems. Given a system description, generate failure scenarios that are specific, testable, and ranked by blast radius. SYSTEM DESCRIPTION: [Describe your AI system: what it does, its data sources, its decision points, its outputs, and who/what consumes those outputs.] For EACH of the 5 failure modes, generate 3 scenarios: ## MODE 1: INPUT STARVATION For each data source your system depends on: - Scenario A: Source returns empty response (HTTP 200, no data) - Scenario B: Source returns stale data (valid format, 48hrs old) - Scenario C: Source returns corrupted data (wrong schema, partial) For each: What does your system do RIGHT NOW? What SHOULD it do? ## MODE 2: PROVIDER FAILURE For each external service (LLM, database, API): - Scenario A: Complete outage (connection refused) - Scenario B: Degraded service (10x latency, intermittent 500s) - Scenario C: Silent wrong answers (200 OK, garbage content) For each: What is the blast radius? Which downstream systems break? ## MODE 3: CONFIG DRIFT For each configurable parameter: - Scenario A: Value doubled from current - Scenario B: Value set to zero - Scenario C: Value set to opposite sign / inverted boolean For each: Would any monitoring catch this within 24 hours? ## MODE 4: VOLUME SPIKE - Scenario A: 10x normal input volume - Scenario B: 100x normal input volume - Scenario C: Normal volume but 10x complexity per item For each: What is the first bottleneck? At what multiple does the system fail vs degrade gracefully? ## MODE 5: FEEDBACK CORRUPTION For each self-improvement mechanism: - Scenario A: Outcome data is inverted (wins recorded as losses) - Scenario B: Training signal has 20% label noise - Scenario C: Optimization metric diverges from actual goal For each: How many cycles before damage is detectable? OUTPUT FORMAT (per scenario): - ID: [MODE]-[LETTER] (e.g., M1-A) - Description: One sentence - Blast radius: What breaks (list components) - Detection time: How long before you would notice - Current resilience: NONE / PARTIAL / FULL - Test method: How to simulate this scenario safely - Priority: P0-P3 based on (blast radius x probability)
What you get: A complete attack surface map of your AI system. Most teams discover 3–5 P0 scenarios they had never considered. The prompt forces you to think about each failure mode systematically rather than testing whatever comes to mind first.
Prompt 2 — The Controlled Burn
Generating scenarios is step one. Running them safely is the hard part. This prompt designs test harnesses that break things without burning the house down.
You are a test harness designer for AI system chaos
experiments. Given a failure scenario, design a safe test
that validates system resilience without affecting
production outputs.
SCENARIO: [Paste one scenario from Prompt 1]
Design the test following these rules:
## ISOLATION
- The test MUST run against a shadow/staging copy.
Never inject failures into production pipelines.
- Define exactly which files, databases, and APIs the
shadow copy needs. List them.
- Specify what "reset to clean state" looks like after
the test completes.
## INJECTION METHOD
- How to simulate the failure:
- For data sources: mock file / modified API response
- For services: timeout wrapper / error-returning stub
- For config: parameter override file (not editing prod)
- For volume: replay tool with multiplier
- For feedback: synthetic outcome data with known errors
- The injection must be: reversible, logged, and scoped
to the test environment only.
## OBSERVATION POINTS
- What metrics to capture DURING the test:
- System behavior: Did it detect the failure?
- Degradation: Did quality degrade gracefully or cliff?
- Recovery: When the failure is removed, does the system
recover automatically or need manual intervention?
- Blast radius: Which components were affected? Which
were correctly isolated?
- Expected vs actual comparison template.
## PASS / FAIL CRITERIA
Define exactly what "passed" means:
- PASS: System detected failure within [X] minutes,
degraded gracefully, recovered automatically, and
no incorrect outputs reached downstream consumers.
- PARTIAL: System degraded gracefully but did not detect
or auto-recover. Human intervention needed.
- FAIL: Incorrect outputs reached downstream consumers
OR system failed to degrade gracefully (crash, hang,
silent wrong answers).
## RUNBOOK
Step-by-step instructions to run this test:
1. Set up shadow environment
2. Verify baseline (run system once, confirm normal output)
3. Inject failure
4. Observe for [duration]
5. Remove failure
6. Observe recovery for [duration]
7. Collect metrics and compare to pass/fail criteria
8. Reset environment to clean state
9. Document results
Run your first Controlled Burn on the scenario you are most confident your system handles. If it fails that one, you know the rest of your assumptions are also wrong. Start with your strongest case to calibrate your expectations before testing the scary ones.
Prompt 3 — The Resilience Scorecard
After running your tests, you need a way to track what you found, what you fixed, and what still needs work. This prompt turns raw test results into a prioritized action plan.
You are a resilience auditor. Given chaos test results, produce a scorecard that tells the system owner exactly where they stand and what to fix first. TEST RESULTS: [Paste results from your Controlled Burn runs] ## RESILIENCE SCORE Calculate an overall score (0-100) using: - Each P0 FAIL scenario: -20 points - Each P1 FAIL scenario: -10 points - Each P2 FAIL scenario: -5 points - Each PARTIAL (any priority): -3 points - Each PASS (any priority): +0 (baseline expectation) - Bonus: +5 for each auto-recovery confirmed - Bonus: +3 for each failure detected within 5 minutes Start at 100, apply penalties and bonuses. Floor at 0. ## FAILURE MAP A table showing: | Scenario | Result | Detection Time | Recovery | Fix Effort | For each failed or partial scenario, include: - Root cause: Why did the system fail this test? - Fix category: GUARD (add input validation), FALLBACK (add degraded-mode behavior), CIRCUIT BREAKER (add timeout/retry limits), MONITOR (add detection), or ARCHITECTURE (structural change needed) - Estimated fix effort: HOURS / DAYS / WEEKS - Dependencies: What else must change for this fix to work? ## PRIORITY MATRIX Rank all fixes by: (blast radius x probability) / effort The highest-value fixes go first. Present as an ordered list with: 1. [Fix name] — [One sentence] — Effort: [X] — Value: [HIGH/MED/LOW] ## RETEST SCHEDULE For each fix implemented, when to rerun the chaos test: - Immediately after fix (regression check) - 7 days later (stability check) - 30 days later (drift check) Track retest results over time to confirm fixes hold. ## TREND ANALYSIS If this is not the first scorecard: - Score trend: improving, stable, or declining? - New failures discovered since last test? - Previously fixed items that regressed? - Overall system resilience trajectory
What you get: A single number that captures your system’s resilience, a prioritized fix list ordered by value, and a retest schedule that prevents fixes from regressing. The scorecard becomes your tracking document — run it monthly and watch the number climb.
The Testing Cadence
Chaos testing is not a one-time event. It is a practice with a rhythm.
| Frequency | What to Test | Time Required |
|---|---|---|
| Weekly | One random P0 scenario from your failure map. Rotate through them. | 30 minutes |
| After every change | The specific failure modes related to what you changed. Changed a data source? Run Mode 1 tests. Changed a threshold? Run Mode 3. | 15 minutes |
| Monthly | Full suite across all 5 modes. Update the resilience scorecard. | 2 hours |
| After any real incident | The exact failure that occurred in production. Add it to your scenario library. It is now a permanent test. | 1 hour |
The golden rule: Every production incident becomes a chaos test. If it happened once, it will happen again. The only question is whether your system handles it automatically next time.
What This Looks Like After 90 Days
The Uncomfortable Truth About AI Systems
Every AI system you have not stress-tested is running on luck. The data sources have been stable so far. The LLM provider has been up so far. The config has not drifted so far.
“So far” is not a resilience strategy. It is a countdown.
The organizations that survive are not the ones with the most sophisticated AI. They are the ones that know exactly how their system fails — because they made it fail first.
Start with Mode 1 (Input Starvation). It is the easiest to simulate (just rename a data file), the most common in practice, and the most likely to reveal that your pipeline silently produces outputs from stale data. Most people discover their system has been running on yesterday’s data at least once — and nobody noticed.
Try It This Week
Pick your system’s most critical data source. Rename it. Run your pipeline. Watch what happens. Does it error? Does it use cached data? Does it produce output anyway with no warning? Whatever happens, you just learned something no amount of normal testing would have revealed.
The system you can trust is the system you have tried to break. Start breaking.