Issue #27 taught you how to break your system on purpose. You found the failures hiding in your pipeline. You know where it cracks.
But here is the problem with chaos testing: it still requires you. You run the tests. You read the results. You decide what to fix. If you are not watching, the failures you found are just notes in a document nobody reads.
The immune system in your body does not wait for you to notice an infection. It detects threats, classifies them, and mounts a response — all before you feel a single symptom. The best-engineered AI systems work the same way.
A self-healing system has three layers: detection (know something is wrong), diagnosis (know what is wrong), and response (fix it or degrade gracefully — automatically). This issue gives you three prompts that build each layer.
Why Most “Monitoring” Fails
Every team has monitoring. Dashboards, alerts, uptime checks. But monitoring is not healing. Monitoring tells you the patient is sick. Healing is what happens next.
| What Most Teams Have | What Self-Healing Requires |
|---|---|
| Uptime check: “Is the service responding?” | Semantic check: “Is the response correct?” |
| Error rate alert: “Errors exceeded threshold” | Root cause classifier: “This error pattern means X” |
| Dashboard: “Human reads it, human decides” | Playbook: “For pattern X, execute response Y automatically” |
| Alert fatigue: 200 alerts/day, all ignored | Signal clarity: 3 actionable events/day, all resolved |
The key insight: A self-healing system does not need to fix every problem. It needs to handle the 5–10 failure patterns that account for 90% of incidents. Everything else can still page a human. Start narrow. Expand over time.
The Three Layers
Your immune system operates in three layers that cascade:
- Detection — Sentinels that notice anomalies before they become outages. Not “is the server up” but “is the output sane.”
- Diagnosis — Pattern matching that classifies the anomaly into a known failure mode. This converts a vague “something is wrong” into a specific “this is a Mode 3 config drift in the threshold parameter.”
- Response — Pre-written playbooks that execute the correct fix for each classified failure. No human in the loop for known patterns.
Each prompt below builds one layer. By the end, you will have an immune system that handles the failure patterns you discovered in Issue #27’s stress tests.
Prompt 1 — The Sentinel Network
Sentinels are lightweight checks that run after every pipeline execution. They do not just verify “did it run” — they verify “does the output make sense.”
You are a reliability engineer designing health sentinels for an AI system. Sentinels are lightweight checks that run AFTER each pipeline execution and flag anomalies BEFORE outputs reach consumers. SYSTEM DESCRIPTION: [Describe your AI system: inputs, processing steps, outputs, and who consumes those outputs.] KNOWN FAILURE MODES: [Paste your failure map from Issue #27's Resilience Scorecard, or list the top 5 ways your system breaks.] Design a sentinel for EACH of these categories: ## 1. FRESHNESS SENTINELS For each data source and output: - Max acceptable age (e.g., "signals must be <4 hours old") - Check method: timestamp comparison, file modification time, or API response header - Alert threshold vs auto-heal threshold (e.g., 4hr = warn, 8hr = use fallback data) ## 2. SANITY SENTINELS For each output your system produces: - Range checks: "Score must be between 0 and 100" - Distribution checks: "At least 5 BUY and 5 SELL signals expected on a normal day. Zero of either = anomaly." - Consistency checks: "If VIX > 30 and system says STRONG_BUY on 50 tickers, something is wrong." - Delta checks: "Output changed by >50% from yesterday with no corresponding market move = anomaly." ## 3. DEPENDENCY SENTINELS For each external service: - Health check endpoint or test query - Timeout threshold (in seconds, not minutes) - Fallback behavior when dependency is degraded vs down - Circuit breaker threshold: "After N failures in M minutes, stop trying and use cached data." ## 4. INTEGRITY SENTINELS For the pipeline itself: - Input/output count validation: "Processed N inputs, produced M outputs. M should be within 10% of N." - Schema validation: "Output JSON must match this schema." - Duplicate detection: "Same output produced twice = error." - Regression detection: "Model accuracy dropped >5% from rolling 30-day average." OUTPUT FORMAT (per sentinel): - Name: [SHORT_NAME] (e.g., FRESH_SIGNALS) - Layer: Freshness / Sanity / Dependency / Integrity - Check: What it verifies (one sentence) - Frequency: After every run / hourly / daily - Alert level: INFO / WARN / CRITICAL - Auto-heal action: What to do if triggered (or "none — page human") - Implementation: Pseudocode (10 lines max)
What you get: A complete sentinel network tailored to your system. Most teams discover they have been relying on “is the server up?” when what they actually need is “is the output correct?” The four sentinel categories force you to check freshness, sanity, dependencies, and integrity — the four dimensions where AI systems silently fail.
Prompt 2 — The Diagnosis Engine
Sentinels tell you something is wrong. The diagnosis engine tells you what is wrong. It takes a cluster of sentinel alerts and maps them to a specific root cause.
You are building an automated diagnosis engine for an AI
system. The engine takes sentinel alerts as input and
produces a specific, actionable root cause diagnosis.
SENTINEL ALERTS:
[Paste current alerts, or describe a scenario:
"FRESH_SIGNALS fired (data 6hrs old), SANITY_SCORE_RANGE
fired (3 scores outside bounds), DEPENDENCY_API healthy"]
KNOWN FAILURE PATTERNS:
Build a pattern library — a decision tree that maps
combinations of sentinel alerts to root causes:
## PATTERN MATCHING RULES
For each known failure mode, define:
Pattern: [NAME]
Trigger: [Which sentinels fire, in what combination]
Root cause: [One sentence — the actual problem]
Confidence: HIGH / MEDIUM / LOW
Evidence needed: [What additional data to collect to
confirm the diagnosis]
Disambiguation: [How to tell this apart from similar
patterns]
Example patterns to define:
1. STALE_SOURCE
Trigger: FRESH_* fires + DEPENDENCY_* healthy
Root cause: Upstream source is online but publishing
stale data. API returns 200 but content is old.
Confidence: HIGH if source timestamp confirms staleness.
2. PROVIDER_OUTAGE
Trigger: DEPENDENCY_* fires + FRESH_* fires shortly after
Root cause: External service down, causing cascade.
Confidence: HIGH if dependency health check fails.
3. CONFIG_DRIFT
Trigger: SANITY_* fires + FRESH_* healthy + DEPENDENCY_*
healthy (all inputs fresh and sources up, but output
is wrong)
Root cause: A parameter or threshold was changed.
Confidence: MEDIUM — needs config diff to confirm.
4. DATA_CORRUPTION
Trigger: INTEGRITY_* fires (schema validation, count
mismatch, duplicates)
Root cause: Input data is structurally broken.
Confidence: HIGH if schema validation pinpoints field.
5. CAPACITY_OVERLOAD
Trigger: Pipeline duration > 2x normal + INTEGRITY
count mismatch (some items dropped)
Root cause: Volume exceeded processing capacity.
Confidence: HIGH if input count confirms spike.
6. FEEDBACK_LOOP_DRIFT
Trigger: REGRESSION_* fires (accuracy declining) +
all other sentinels healthy
Root cause: The self-improvement loop is learning
wrong lessons from its own outputs.
Confidence: LOW — hardest to diagnose. Requires
manual review of recent learning updates.
## UNKNOWN PATTERN HANDLER
When sentinel alerts do not match any known pattern:
- Log all active alerts + system state snapshot
- Classify as UNKNOWN with confidence LOW
- Page human with full context
- After resolution: add the new pattern to the library
OUTPUT: For each diagnosed pattern, produce:
{
"pattern": "STALE_SOURCE",
"confidence": "HIGH",
"root_cause": "...",
"evidence": ["..."],
"recommended_response": "...",
"escalate_to_human": true/false
}
What you get: A pattern library that turns vague “something is wrong” alerts into specific, actionable diagnoses. The key innovation is the unknown pattern handler — every novel failure gets logged, and after resolution, it becomes a new pattern. Your diagnosis engine gets smarter with every incident.
Prompt 3 — The Response Playbook
A diagnosis without a response is just a fancier alert. This prompt builds automated response playbooks — specific actions the system takes for each diagnosed failure, without waiting for a human.
You are a site reliability engineer designing automated
response playbooks for an AI system. Each playbook
handles one diagnosed failure pattern end-to-end.
DIAGNOSIS: [Paste output from the Diagnosis Engine]
For each failure pattern, design a response playbook:
## PLAYBOOK STRUCTURE
### Immediate Response (seconds)
What to do RIGHT NOW to stop the bleeding:
- Halt: Should the pipeline stop processing? (Y/N + why)
- Quarantine: Should current outputs be quarantined
(marked as potentially bad) before reaching consumers?
- Notify: Who/what gets notified? (Slack, email, dashboard)
### Automated Fix (minutes)
Pre-scripted remediation steps:
For STALE_SOURCE:
1. Switch to backup data source (if available)
2. If no backup: use last-known-good cached data
3. Tag all outputs as "DEGRADED — stale source"
4. Set retry timer to re-check primary source every 5min
5. After 3 successful checks: restore to primary
For PROVIDER_OUTAGE:
1. Activate circuit breaker (stop retries)
2. Switch to fallback mode (cached responses, reduced
feature set, or graceful degradation)
3. Monitor provider status endpoint every 60s
4. After 3 consecutive successes: close circuit breaker
5. Run integrity check on first post-recovery output
For CONFIG_DRIFT:
1. Snapshot current config
2. Diff against last-known-good config
3. If diff is small (<3 parameters): auto-revert
4. If diff is large: quarantine outputs + page human
5. Log the drift event for pattern analysis
For DATA_CORRUPTION:
1. Reject corrupted input batch
2. Attempt re-fetch from source
3. If re-fetch also corrupt: use last-known-good data
4. Flag affected outputs for manual review
5. Add corruption signature to integrity sentinel
For CAPACITY_OVERLOAD:
1. Enable rate limiting on input queue
2. Process in priority order (highest value first)
3. Drop or defer lowest-priority items
4. Alert if >20% of items are dropped
5. Post-incident: adjust capacity thresholds
For FEEDBACK_LOOP_DRIFT:
1. FREEZE the learning loop (stop all parameter updates)
2. Snapshot current model state
3. Compare outputs to 7-day-ago model state
4. Page human with divergence report
5. Do NOT auto-fix — this requires human judgment
### Recovery Verification (minutes to hours)
After the fix is applied:
- Re-run sentinels to verify fix worked
- Compare output quality to pre-incident baseline
- Confirm no downstream consumers received bad data
- Update incident log with timeline + resolution
### Post-Incident Learning
After every incident (automated or manual):
- Was the diagnosis correct?
- Was the response appropriate?
- How long from detection to resolution?
- Should this pattern's response be upgraded?
- Add any new sentinel patterns discovered
OUTPUT per playbook:
- Pattern name
- Response time target (how fast should this resolve?)
- Automation level: FULL (no human needed) / PARTIAL
(human approves before fix executes) / MANUAL (human
required)
- Rollback plan: How to undo the automated response
if it makes things worse
- Success criteria: How to know the fix worked
What you get: A complete set of automated response playbooks. The critical detail is the automation level classification. Some failures (stale data, provider outage) can be handled fully automatically. Others (feedback loop drift) must always involve a human. Knowing which is which prevents both under-automation (humans doing what machines should) and over-automation (machines making judgment calls they should not).
The Maturity Model
Self-healing does not happen overnight. It is a progression:
| Level | Capability | What It Looks Like |
|---|---|---|
| 0 — Blind | No monitoring beyond “is it running” | You discover failures when users complain |
| 1 — Aware | Sentinels detect anomalies | You get alerts, but you still diagnose and fix manually |
| 2 — Diagnostic | Diagnosis engine classifies failures | Alerts tell you what is wrong, not just that something is wrong |
| 3 — Responsive | Automated playbooks handle known patterns | Common failures resolve without human intervention |
| 4 — Adaptive | System learns new patterns from incidents | Every incident makes the immune system stronger |
Most teams are at Level 0 or 1. Getting to Level 3 with the three prompts above puts you ahead of 95% of AI system operators. Level 4 is the long game — building it is Issue #29’s topic.
The uncomfortable truth: A Level 3 system with 10 well-defined playbooks outperforms a Level 0 system with a dedicated 24/7 ops team. The playbooks respond in seconds. Humans respond in minutes to hours — if they are awake, if they see the alert, if they remember the fix.
What This Looks Like After 90 Days
The Architecture in Practice
Here is what a self-healing pipeline looks like end to end:
- Pipeline runs. Fetches data, processes signals, generates outputs.
- Sentinels run. Freshness, sanity, dependency, integrity checks — all within 30 seconds of pipeline completion.
- If sentinels pass: Outputs are released to consumers. Business as usual.
- If sentinels fire: Diagnosis engine receives the alert cluster. Classifies the root cause within 5 seconds.
- If diagnosis matches a known pattern: Response playbook executes automatically. Outputs are quarantined until fix is verified.
- If diagnosis is unknown: Outputs quarantined, human paged with full context, incident logged for post-mortem.
- After resolution: Recovery verification runs. Sentinels confirm the fix. Incident is logged. If new, the pattern is added to the library.
The entire cycle — detection, diagnosis, response, verification — completes in under 2 minutes for known patterns. Compare that to the typical sequence: alert fires at 2 AM, engineer wakes up at 2:15, logs into the system by 2:30, starts diagnosing at 2:45, pushes a fix by 3:30. That is 90 minutes vs 2 minutes.
Start with exactly one playbook for your most common failure. Do not try to build all six at once. Run it for two weeks. Once you trust it, add the second. The confidence that comes from seeing your first automated recovery changes how you think about every subsequent one.
Try It This Week
Take your most critical pipeline. Add three sentinels: one freshness check, one sanity check, and one dependency check. Run them after every pipeline execution. Log the results. After one week, you will have a baseline of what “normal” looks like — and when something deviates, you will know immediately instead of discovering it three days later.
The system that heals itself is the system you can trust with your sleep. Start building your immune system.