Building on Issues #1–25
You built agents that run tasks. Quality gates that catch errors. An evolution engine that improves the system over time. Decision rules that operate autonomously. Memory that persists across sessions. A scaling layer that coordinates multiple agent systems.
Congratulations. You now have a complex distributed system. And complex distributed systems fail in ways you cannot predict.
Your quality gate might stop catching a class of error because the error evolved. Your evolution engine might optimize for a metric that stopped being meaningful. Your decision rules might approve actions in a context they were never tested against. Each individual component reports green. The system as a whole is broken.
This is the problem of emergent failure — when every part of the system passes its own health check, but the system as a whole is producing bad results. No single component is wrong. The interaction between components is wrong. And no component is looking at that.
You need something that watches the watchers. Not another layer of agents — a layer of observation that reduces your entire system’s health to a single question: “Should I intervene today, or can I trust this?”
The Three Layers of Observability
Most people monitor at one layer and call it done. “All my agents ran successfully.” That is Layer 1. You need all three.
| Layer | What It Watches | Question It Answers | Example Failure |
|---|---|---|---|
| 1. Component | Individual agents, gates, pipelines, cron jobs | “Did each piece run and produce output?” | A pipeline ran but returned an empty file |
| 2. System | Interactions between components | “Are the pieces working together correctly?” | Agent A writes output. Agent B reads stale cache instead. |
| 3. Outcome | Real-world results vs. predictions | “Is the system actually producing value?” | System runs perfectly. Win rate has declined 12% over 3 weeks. |
Layer 1 is necessary but not sufficient. It catches crashes, timeouts, and missing files. It does not catch a pipeline that produces structurally valid output with the wrong data inside it. It does not catch a decision rule that is 100% consistent with its training data but operating in a market regime that invalidates its logic.
The silent failure pattern: Component health = 100%. System health = 100%. Outcome health = declining for two weeks. This means your system is running perfectly — and doing the wrong thing. Only Layer 3 catches this. If you only monitor Layers 1 and 2, you will not know something is wrong until a human notices the results.
The Health Score
Every component, every system interaction, and every outcome metric feeds into a single composite score from 0 to 100. One number. One glance. One decision: “Do I need to intervene?”
| Score | State | Your Action | Time Required |
|---|---|---|---|
| 90–100 | GREEN — all nominal | Glance at the digest. Move on with your day. | 2–4 min |
| 70–89 | YELLOW — degradation detected | Read the full summary. Investigate flagged items. | 10–20 min |
| 50–69 | ORANGE — intervention needed | Stop other work. Diagnose and fix now. | 30–60 min |
| 0–49 | RED — system failure | Kill switch. Manual control until resolved. | Until fixed |
The score is weighted by importance: Outcome health (50%) > System health (30%) > Component health (20%).
This weighting is counterintuitive. Most engineers weight Component health highest because it is the easiest to measure. But a system that runs flawlessly while producing bad results is worse than a system that has visible errors it catches and corrects. The weighting reflects reality: outcomes are what matter.
Plot your health score daily. The trend matters more than the number. A score of 82 that has been declining for 5 consecutive days is more dangerous than a score of 71 that has been rising for 3. Build a 7-day moving average and alert on the direction, not just the level.
Step 1: Component Audit
Before you can compute system health, you need to know what your system consists of. Every agent, every pipeline, every cron job, every gate — inventoried, with a definition of what “healthy” looks like for each one.
Prompt 1 — The Component Auditor
This agent scans your entire system inventory. For each component, it checks three things: did it run, is its output valid, and are its dependencies healthy?
You are a systems observability agent. Your job is to audit
every component in the AI system for health.
Input: A JSON inventory of components. Each component has:
- component_id, type (agent/pipeline/gate/cron), schedule
- expected_output (file path, schema, or API response)
- dependencies (list of component_ids this reads from)
- last_run_log (stdout/stderr from most recent execution)
For each component, evaluate three dimensions:
## EXECUTION STATUS
- Did it run on schedule? Compare last_run_time to schedule.
Flag if overdue by more than 10% of the interval.
- Did it complete without errors? Check exit code + stderr.
- Did it produce output? Verify file exists, is non-empty,
and was modified after last_run_time.
Score: 100 (ran + clean) / 75 (ran + warnings) /
50 (ran + errors but produced output) / 0 (failed)
## OUTPUT QUALITY
- Structural validity: Does output match expected schema?
Parse it. Check required fields. Flag missing/extra keys.
- Semantic reasonableness: Are values within expected
bounds? Flag outliers (>3 std dev from 30-day rolling
mean). Flag both "changed too much" (possible data
corruption) and "changed too little" (possible staleness).
- Freshness: Is every data source inside the output current?
Check timestamps of embedded data vs wall clock.
Score: 100 (valid + reasonable + fresh) /
70 (valid but stale or outliers) / 0 (invalid)
## DEPENDENCY HEALTH
- Are all upstream dependencies producing fresh output?
For each dependency: last_output_time, expected_freshness,
staleness_flag.
- Is this component's output consumed downstream?
If no downstream consumer has read it in 2x the expected
interval, flag as "orphaned output."
- Broken chain detection: If upstream produced output but
this component did not read it, flag as "missed input."
Score: 100 (all deps fresh + consumed) /
60 (stale deps or orphaned) / 0 (broken chain)
COMPONENT SCORE = (Execution * 0.4) + (Quality * 0.4)
+ (Dependencies * 0.2)
Output as JSON array:
[{ component_id, execution_score, quality_score,
dependency_score, composite_score, issues: [],
recommended_action: string | null }]
Flag any component scoring below 80.
What you get: A complete health inventory of every piece of your system, with scores and specific issues. The Component Auditor catches the things that individual agents miss about themselves: stale data, orphaned outputs, broken dependency chains, and output that passes schema validation but fails semantic checks.
Step 2: System-Level Synthesis
Component health is Layer 1. It tells you each part works. But a system is more than the sum of its parts. Ten healthy components can produce an unhealthy system if they are interacting incorrectly.
This is the hardest layer to build because the failure modes are emergent — they do not exist in any single component. They exist in the spaces between components.
Prompt 2 — The System Health Synthesizer
You are a system-level health analyst. Your job is to
detect failures that NO individual component would catch.
Input: Component audit results from the Component Auditor
(JSON array of scores, issues, and outputs).
Analyze four dimensions:
## INTERACTION ANALYSIS
For every component pair that shares data (A produces,
B consumes):
- Latency: Time between A's output and B's read. Is it
growing? Flag if 2x the 30-day median.
- Consistency: Does B's view of the data match A's latest
output? Flag version mismatches or stale reads.
- Volume: Is throughput within 2 standard deviations of
the 30-day rolling average? Flag both drops and spikes.
## EMERGENT PATTERN DETECTION
- Correlation breaks: Identify component pairs whose
outputs usually correlate. Flag any pair where the
correlation has broken in the last 7 days.
(Example: signal generator and risk manager usually agree
on 85% of decisions. This week: 62%. Something shifted.)
- Cascade risk: Map the dependency graph. Identify the
single component whose failure would break the most
downstream systems. This is your highest-risk node.
- Feedback loops: Trace every output-to-input chain. Flag
any cycle where a component's output feeds back as its
own input through a chain of intermediaries.
## DRIFT DETECTION
- Decision drift: Are autonomous rules making decisions
that cluster differently than 30 days ago? Compare
decision distribution (approve/reject/escalate ratios).
- Data drift: Have input data distributions shifted?
For each data source, compare current 7-day distribution
to 30-day baseline. Flag shifts >1 std dev.
- Performance drift: Rolling 7-day outcome metrics vs
30-day baseline. Flag sustained declines (3+ consecutive
days below baseline).
## RESOURCE EFFICIENCY
- Are any components consuming disproportionate resources
(time, tokens, storage) relative to their contribution?
- Are any components redundant (producing outputs that
overlap with another component's output)?
SYSTEM SCORE formula:
- Start at 100
- Subtract 5 for each interaction anomaly
- Subtract 10 for each correlation break
- Subtract 15 for each drift detection flag
- Subtract 20 for any feedback loop
- Floor at 0
Output:
{ system_score, interaction_anomalies: [],
correlation_breaks: [], drift_flags: [],
cascade_risk_node: string,
most_important_finding: string,
most_urgent_recommendation: string }
The cascade risk node is the most important finding in most audits. It is the component that, if it fails, takes down the most other things. Prioritize hardening this node: add redundancy, tighten its monitoring, and ensure it has the most conservative error handling. Every system has a single point of failure. Know what yours is.
Step 3: Alert Triage
The Component Auditor and System Health Synthesizer produce rich, detailed output. You do not want to read all of it. You want three sentences and one number.
The Alert Prioritizer reduces the noise to signal. Maximum 5 alerts. Deduplicated. Prioritized. With time horizons so you know what to fix now vs. what to schedule.
Prompt 3 — The Alert Prioritizer
You are an alert triage specialist. You receive raw
findings from the Component Auditor and System Health
Synthesizer. Your job: reduce noise to signal.
RULES:
1. MAXIMUM 5 alerts per digest. If more than 5 issues
exist, group related ones and escalate the group.
Never exceed 5. Prioritize ruthlessly.
2. Every alert must have exactly four fields:
- severity: P0 / P1 / P2 / P3
- blast_radius: what breaks if this is ignored
- time_horizon: how long before critical
- recommended_action: specific, actionable, one sentence
3. De-duplicate: If the same root cause produces multiple
symptoms, report the root cause ONCE, not the symptoms.
Example: "Data source X went stale" not "Pipeline A has
stale data" + "Dashboard B shows yesterday's numbers" +
"Alert C fired for anomaly."
4. Trend over snapshot: A metric at 85% but declining 10
points per week is P1. A metric at 72% but stable for
30 days is P3. Always consider direction.
5. Historical context: Has this alert fired before? How
was it resolved? If this is a recurrence of a previously
fixed issue, that is P0 (regression), not P2.
PRIORITY DEFINITIONS:
P0: System is producing wrong outputs right now.
Response: immediate.
P1: System will degrade within 48 hours if ignored.
Response: today.
P2: Performance declining but not critical.
Response: this week.
P3: Improvement opportunity, not a problem.
Response: next planning cycle.
OUTPUT FORMAT (exactly this structure):
HEALTH SCORE: [0-100]
TREND: [UP / STABLE / DOWN] (7-day direction)
STATUS: [GREEN / YELLOW / ORANGE / RED]
ALERTS (max 5):
1. [P0-P3] [Title] — [One sentence description] —
Action: [Specific action]
Blast radius: [What breaks] | Horizon: [Time to critical]
QUIET WINS (max 3):
Things that improved or stayed healthy this period.
Positive signal confirms the system is working.
Do not skip this section — it prevents alert fatigue.
RECOMMENDED DAILY TIME: [X minutes]
How much time the operator should spend on this system
today based on current health.
GREEN: 2-5 min. YELLOW: 10-20 min.
ORANGE: 30-60 min. RED: until resolved.
What you get: A daily digest you can read in under 4 minutes on a good day. One number tells you the health. The trend tells you the direction. The alerts tell you what to do. The quiet wins tell you the system is earning your trust. And the time estimate tells you exactly how much attention your system needs today.
The Alert Hierarchy
Not all problems are equal. The hierarchy defines your response, not just the severity. Every alert maps to a specific action.
| Priority | Definition | Response | Example |
|---|---|---|---|
| P0 | System is producing wrong outputs now | Drop everything. Fix immediately. | Decision rule approved an action that violated a safety constraint |
| P1 | System will degrade within 48 hours | Investigate today. No later. | Primary data source went stale; pipeline will produce bad signals tomorrow |
| P2 | Performance declining but still functional | Schedule for this week. | Win rate dropped 3 points over 14 days but still above threshold |
| P3 | Improvement opportunity | Queue for next planning cycle. | A component could be 2x faster with a caching layer |
The P0 test: If you went on vacation for a week and this issue existed the entire time, would you come back to damage? Real damage — lost money, broken customer experiences, corrupted data? If yes, it is P0. If you would come back to a suboptimal but functional system, it is P1 or below. Be honest about this. Most things that feel urgent are P2.
The Observatory Dashboard
The Observatory is not a product you buy. It is a practice you build. But if you want to know what the daily output looks like, here is the format:
- The number. Health score + trend arrow + color. A three-second glance tells you everything.
- The alerts. Maximum 5, prioritized, with specific actions. A two-minute read on most days.
- The quiet wins. What is working. What improved. Takes thirty seconds and prevents the corrosive effect of only seeing problems.
- The time estimate. “Spend 4 minutes on your system today.” Or: “Spend 45 minutes — here is exactly where to focus.”
The Meta-Monitoring Problem
The obvious question: who watches the Observatory? If the Observatory itself breaks, you have no monitoring at all. This is the recursion problem of observability, and the answer is deliberately simple:
- Heartbeat. The Observatory emits a timestamp every time it runs. A dead-simple external check (cron job, uptime monitor, or webhook) verifies the timestamp is fresh. If the heartbeat stops, you get an alert through a completely separate channel — email, SMS, or a push notification. No AI involved. Just a timestamp comparison.
- Staleness guard. If the health score has not updated in 25 hours, the external alert fires automatically. This catches both crashes (the Observatory stopped) and silent failures (the Observatory ran but did not write output).
- Weekly manual calibration. Once a week, spend 15 minutes comparing the Observatory’s last 7 digests against reality. Did the score accurately reflect what happened? Did it miss anything important? Did it fire any false alarms? This calibration is how the Observatory improves over time.
Keep the meta-monitoring layer as simple as possible. A cron job that checks “did the Observatory produce output in the last 24 hours?” is better than an AI agent monitoring an AI agent monitoring an AI agent. The recursion has to stop somewhere. Stop it with a dumb, reliable, zero-dependency check.
Common Mistakes
1. Monitoring only Layer 1
“All agents ran successfully.” This means nothing. The agents could have run successfully while consuming stale data, writing outputs nobody reads, or producing technically valid but semantically wrong results. Layer 1 is table stakes. Build all three layers.
2. Too many alerts
If your Observatory produces 20 alerts per day, you will ignore all of them within a week. The maximum is 5. If there are more than 5 issues, the Alert Prioritizer groups them. If you find yourself wanting to raise the limit, your system has a design problem, not a monitoring problem.
3. No quiet wins
If every digest is a list of problems, you will dread opening it. Quiet wins — things that are working, improving, or stable — provide the positive signal that makes the Observatory sustainable. They also serve as a baseline: when a quiet win disappears from the list, that itself is a signal.
4. Optimizing the score instead of the system
Goodhart’s Law applies here. If you tune your Observatory to always show GREEN, you have not fixed your system — you have broken your monitoring. The health score must reflect reality, not the outcome you want. When the score is low, the correct response is to fix the system, not to adjust the thresholds.
5. Building the dashboard before the practice
The Observatory is a practice first. Start with a text-only daily summary. Read it for 30 days. Learn what matters and what is noise. Only then build the polished dashboard. If you build the dashboard first, you will optimize for aesthetics instead of signal quality.
The Observatory Checklist
Phase 1: Inventory (Day 1)
- Every component listed: agents, pipelines, gates, crons, data sources
- For each: definition of “healthy,” expected schedule, expected output
- Dependency graph drawn (who produces, who consumes)
- Cascade risk node identified (single point of failure)
Phase 2: Component Monitoring (Days 2–7)
- Component Auditor running daily (Prompt 1)
- All components scored on execution, quality, dependencies
- Broken chains and orphaned outputs identified and fixed
- Baseline scores established (7-day rolling average)
Phase 3: System Monitoring (Days 8–14)
- System Health Synthesizer running daily (Prompt 2)
- Interaction anomalies tracked
- Drift detection baselines established
- Correlation pairs identified and monitored
Phase 4: Alert Triage (Day 15+)
- Alert Prioritizer running daily (Prompt 3)
- Composite health score computed and plotted
- Daily digest delivered (one number, max 5 alerts, quiet wins)
- Heartbeat monitor wired (external, dumb, zero-dependency)
- Weekly manual calibration scheduled
What This Looks Like After 30 Days
Try It This Week
List every component in your AI system. For each one, write down three things: what “healthy” means, how you would detect failure, and what breaks downstream if it fails. This is your component inventory.
Then run Prompt 1 against the inventory. You will discover broken chains, orphaned outputs, and stale data you did not know about. Fix those first. That alone will improve your system more than any new feature.
You do not need a dashboard. Start with a daily text summary. The Observatory is a practice before it is a product. The system that watches itself is the system that earns your trust. Build the proof.