Issue #27

The Stress Test: How to Break Your AI System on Purpose So It Never Breaks by Accident

The AI Playbook 15 min read 3 prompts
← Issue #26: The Observatory

Your system runs every day. The Observatory says GREEN. Components healthy. Outcomes tracking. You start to trust it.

Then one morning your primary data source returns empty. Or your LLM provider has a 4-hour outage. Or someone pushes a config change that silently flips a threshold. And your beautiful autonomous system produces garbage — confidently, automatically, at scale.

You did not know it could break that way because you never tested it.

Netflix runs Chaos Monkey — it randomly kills production servers to prove the system survives. Google runs DiRT (Disaster Recovery Testing) — it simulates entire datacenter failures. These companies do not test because they are paranoid. They test because the only systems you can trust are the ones you have tried to break.

This issue gives you three prompts that build a chaos engineering practice for your AI system. Not theoretical. Not a framework diagram. Actual tests you can run this week that will find the failures hiding in your system right now.


The Five Failure Modes

Every AI system breaks in one of five ways. Your stress tests must cover all five.

Mode What Happens Why It Is Dangerous
Input Starvation A data source returns empty, stale, or malformed data The pipeline runs “successfully” on bad inputs and produces confident wrong outputs
Provider Failure Your LLM API, database, or external service goes down Cascading failures: one timeout causes a retry storm that takes down adjacent systems
Config Drift A threshold, weight, or parameter changes silently No error is thrown. The system simply starts making different decisions with no alert.
Volume Spike 10x the normal input volume hits at once Rate limits, memory exhaustion, queue backlogs. What processes 100 items in 2 minutes may choke on 1,000.
Feedback Corruption The system’s self-improvement loop learns the wrong lesson The most insidious mode: the system “improves” itself into worse performance and the improvement metric says it is getting better.

The blind spot: Most people test Mode 2 (provider failure) because it is the most visible. Mode 5 (feedback corruption) is almost never tested — and it is the one that can destroy months of work silently.


Prompt 1 — The Failure Scenario Generator

Before you can test, you need to know what to test. This prompt maps your system’s attack surface.

Prompt 1 — The Failure Scenario Generator
You are a chaos engineering specialist for AI systems.
Given a system description, generate failure scenarios
that are specific, testable, and ranked by blast radius.

SYSTEM DESCRIPTION:
[Describe your AI system: what it does, its data sources,
its decision points, its outputs, and who/what consumes
those outputs.]

For EACH of the 5 failure modes, generate 3 scenarios:

## MODE 1: INPUT STARVATION
For each data source your system depends on:
- Scenario A: Source returns empty response (HTTP 200, no data)
- Scenario B: Source returns stale data (valid format, 48hrs old)
- Scenario C: Source returns corrupted data (wrong schema, partial)
For each: What does your system do RIGHT NOW? What SHOULD it do?

## MODE 2: PROVIDER FAILURE
For each external service (LLM, database, API):
- Scenario A: Complete outage (connection refused)
- Scenario B: Degraded service (10x latency, intermittent 500s)
- Scenario C: Silent wrong answers (200 OK, garbage content)
For each: What is the blast radius? Which downstream systems break?

## MODE 3: CONFIG DRIFT
For each configurable parameter:
- Scenario A: Value doubled from current
- Scenario B: Value set to zero
- Scenario C: Value set to opposite sign / inverted boolean
For each: Would any monitoring catch this within 24 hours?

## MODE 4: VOLUME SPIKE
- Scenario A: 10x normal input volume
- Scenario B: 100x normal input volume
- Scenario C: Normal volume but 10x complexity per item
For each: What is the first bottleneck? At what multiple
does the system fail vs degrade gracefully?

## MODE 5: FEEDBACK CORRUPTION
For each self-improvement mechanism:
- Scenario A: Outcome data is inverted (wins recorded as losses)
- Scenario B: Training signal has 20% label noise
- Scenario C: Optimization metric diverges from actual goal
For each: How many cycles before damage is detectable?

OUTPUT FORMAT (per scenario):
- ID: [MODE]-[LETTER] (e.g., M1-A)
- Description: One sentence
- Blast radius: What breaks (list components)
- Detection time: How long before you would notice
- Current resilience: NONE / PARTIAL / FULL
- Test method: How to simulate this scenario safely
- Priority: P0-P3 based on (blast radius x probability)

What you get: A complete attack surface map of your AI system. Most teams discover 3–5 P0 scenarios they had never considered. The prompt forces you to think about each failure mode systematically rather than testing whatever comes to mind first.


Prompt 2 — The Controlled Burn

Generating scenarios is step one. Running them safely is the hard part. This prompt designs test harnesses that break things without burning the house down.

Prompt 2 — The Controlled Burn
You are a test harness designer for AI system chaos
experiments. Given a failure scenario, design a safe test
that validates system resilience without affecting
production outputs.

SCENARIO: [Paste one scenario from Prompt 1]

Design the test following these rules:

## ISOLATION
- The test MUST run against a shadow/staging copy.
  Never inject failures into production pipelines.
- Define exactly which files, databases, and APIs the
  shadow copy needs. List them.
- Specify what "reset to clean state" looks like after
  the test completes.

## INJECTION METHOD
- How to simulate the failure:
  - For data sources: mock file / modified API response
  - For services: timeout wrapper / error-returning stub
  - For config: parameter override file (not editing prod)
  - For volume: replay tool with multiplier
  - For feedback: synthetic outcome data with known errors
- The injection must be: reversible, logged, and scoped
  to the test environment only.

## OBSERVATION POINTS
- What metrics to capture DURING the test:
  - System behavior: Did it detect the failure?
  - Degradation: Did quality degrade gracefully or cliff?
  - Recovery: When the failure is removed, does the system
    recover automatically or need manual intervention?
  - Blast radius: Which components were affected? Which
    were correctly isolated?
- Expected vs actual comparison template.

## PASS / FAIL CRITERIA
Define exactly what "passed" means:
- PASS: System detected failure within [X] minutes,
  degraded gracefully, recovered automatically, and
  no incorrect outputs reached downstream consumers.
- PARTIAL: System degraded gracefully but did not detect
  or auto-recover. Human intervention needed.
- FAIL: Incorrect outputs reached downstream consumers
  OR system failed to degrade gracefully (crash, hang,
  silent wrong answers).

## RUNBOOK
Step-by-step instructions to run this test:
1. Set up shadow environment
2. Verify baseline (run system once, confirm normal output)
3. Inject failure
4. Observe for [duration]
5. Remove failure
6. Observe recovery for [duration]
7. Collect metrics and compare to pass/fail criteria
8. Reset environment to clean state
9. Document results
Pro Tip

Run your first Controlled Burn on the scenario you are most confident your system handles. If it fails that one, you know the rest of your assumptions are also wrong. Start with your strongest case to calibrate your expectations before testing the scary ones.


Prompt 3 — The Resilience Scorecard

After running your tests, you need a way to track what you found, what you fixed, and what still needs work. This prompt turns raw test results into a prioritized action plan.

Prompt 3 — The Resilience Scorecard
You are a resilience auditor. Given chaos test results,
produce a scorecard that tells the system owner exactly
where they stand and what to fix first.

TEST RESULTS: [Paste results from your Controlled Burn runs]

## RESILIENCE SCORE
Calculate an overall score (0-100) using:
- Each P0 FAIL scenario: -20 points
- Each P1 FAIL scenario: -10 points
- Each P2 FAIL scenario: -5 points
- Each PARTIAL (any priority): -3 points
- Each PASS (any priority): +0 (baseline expectation)
- Bonus: +5 for each auto-recovery confirmed
- Bonus: +3 for each failure detected within 5 minutes
Start at 100, apply penalties and bonuses. Floor at 0.

## FAILURE MAP
A table showing:
| Scenario | Result | Detection Time | Recovery | Fix Effort |
For each failed or partial scenario, include:
- Root cause: Why did the system fail this test?
- Fix category: GUARD (add input validation), FALLBACK
  (add degraded-mode behavior), CIRCUIT BREAKER (add
  timeout/retry limits), MONITOR (add detection), or
  ARCHITECTURE (structural change needed)
- Estimated fix effort: HOURS / DAYS / WEEKS
- Dependencies: What else must change for this fix to work?

## PRIORITY MATRIX
Rank all fixes by: (blast radius x probability) / effort
The highest-value fixes go first. Present as an ordered
list with:
1. [Fix name] — [One sentence] — Effort: [X] — Value: [HIGH/MED/LOW]

## RETEST SCHEDULE
For each fix implemented, when to rerun the chaos test:
- Immediately after fix (regression check)
- 7 days later (stability check)
- 30 days later (drift check)
Track retest results over time to confirm fixes hold.

## TREND ANALYSIS
If this is not the first scorecard:
- Score trend: improving, stable, or declining?
- New failures discovered since last test?
- Previously fixed items that regressed?
- Overall system resilience trajectory

What you get: A single number that captures your system’s resilience, a prioritized fix list ordered by value, and a retest schedule that prevents fixes from regressing. The scorecard becomes your tracking document — run it monthly and watch the number climb.


The Testing Cadence

Chaos testing is not a one-time event. It is a practice with a rhythm.

FrequencyWhat to TestTime Required
Weekly One random P0 scenario from your failure map. Rotate through them. 30 minutes
After every change The specific failure modes related to what you changed. Changed a data source? Run Mode 1 tests. Changed a threshold? Run Mode 3. 15 minutes
Monthly Full suite across all 5 modes. Update the resilience scorecard. 2 hours
After any real incident The exact failure that occurred in production. Add it to your scenario library. It is now a permanent test. 1 hour

The golden rule: Every production incident becomes a chaos test. If it happened once, it will happen again. The only question is whether your system handles it automatically next time.


What This Looks Like After 90 Days

87
Resilience score after 3 monthly cycles
You started at 34. Most AI systems do. After three months of weekly chaos tests, targeted fixes, and retests, your system handles failures that would have caused silent data corruption on day one. You sleep better. Not because nothing breaks — but because you know what breaks and how your system responds.

The Uncomfortable Truth About AI Systems

Every AI system you have not stress-tested is running on luck. The data sources have been stable so far. The LLM provider has been up so far. The config has not drifted so far.

“So far” is not a resilience strategy. It is a countdown.

The organizations that survive are not the ones with the most sophisticated AI. They are the ones that know exactly how their system fails — because they made it fail first.

Pro Tip

Start with Mode 1 (Input Starvation). It is the easiest to simulate (just rename a data file), the most common in practice, and the most likely to reveal that your pipeline silently produces outputs from stale data. Most people discover their system has been running on yesterday’s data at least once — and nobody noticed.


Try It This Week

Pick your system’s most critical data source. Rename it. Run your pipeline. Watch what happens. Does it error? Does it use cached data? Does it produce output anyway with no warning? Whatever happens, you just learned something no amount of normal testing would have revealed.

The system you can trust is the system you have tried to break. Start breaking.

Next Issue

Issue #28: The Immune System

Chaos testing finds vulnerabilities. But what if your system could detect and heal them automatically? Issue #28 builds a self-healing layer that responds to failures in real time — without waiting for you to notice.

Get the next issue

One tested AI workflow, delivered every week. No fluff.

Free forever. One email per week. Unsubscribe anytime.