The Control Room: A Unified Monitoring Interface for Your AI Agent System

You built the system. Persistent agents that run on schedule. A dashboard that aggregates their output. Quality gates that catch errors. Feedback loops that track drift. An autonomy framework that promotes your best agents. A safety net for when things break. An orchestra that coordinates multi-agent workflows.

Each piece has its own interface. Its own log file. Its own way of telling you what happened.

Your quality gate writes to quality_report.json. Your safety net writes to incidents/. Your feedback loop writes to metrics/. Your orchestra writes to run_summary.json. Your autonomy framework writes to trust_tiers.json.

To understand the state of your system, you open five files. You mentally cross-reference timestamps. You check whether the run summary's failure count matches the incident log. You compare this week's accuracy to last week's by scrolling through a JSONL file.

This is not monitoring. This is archaeology.

The control room fixes this. One interface. Every agent, every workflow, every incident, every metric -- visible in a single view. Not a dashboard for one agent's output. A dashboard for the system itself.

The Framework: Four Layers

Layer	Question It Answers	Data Source
Heartbeat	Is everything running?	`run_state.json`
Quality	Is it running well?	`gate_results, accuracy`
Incidents	What went wrong?	`incident_log, validations`
Trends	Is it getting better?	`weekly_snapshots.jsonl`

Key insight: Each layer answers a different question at a different timescale. Heartbeat = right now. Quality = this week. Incidents = recent past. Trends = trajectory over months. Together, they give you the complete picture in under 90 seconds.

Layer 1: Heartbeat Monitor

The heartbeat monitor answers one question: "Is everything running?" It reads every agent's last execution timestamp and compares it to the expected schedule. If an agent should run every hour and the last run was 3 hours ago, that is red.

Heartbeat Monitor

You are a system monitor. Build a heartbeat status page for a
multi-agent AI system.

Read these data sources:
- Agent schedules from ~/workflows/dependency_map.json
  (the "schedule" field for each agent)
- Last run timestamps from ~/workflows/run_state.json
- Current time from the system clock

For each agent, compute:

1. STATUS: Based on expected schedule and last run timestamp.
   - GREEN: Last run within 1x expected interval.
     (Hourly agent ran less than 1 hour ago.)
   - YELLOW: Last run between 1x and 2x expected interval.
     (Hourly agent ran 1-2 hours ago.)
   - RED: Last run beyond 2x expected interval.
     (Hourly agent has not run in over 2 hours.)
   - GRAY: Agent has never run (no timestamp found).

2. LAST RUN: Exact timestamp and "X minutes/hours ago" relative.

3. NEXT EXPECTED: When the agent should run next based on schedule.

4. LAST RESULT: SUCCESS, FAILED, or TIMEOUT from the most recent
   run_state entry.

5. UPTIME (7-day): Percentage of expected runs that actually
   completed successfully in the last 7 days.

Output as HTML (save to ~/control_room/heartbeat.html):
- Summary bar: X agents green, Y yellow, Z red
- Grid of agent cards sorted: RED first, YELLOW, GREEN
- Each card: name, status dot, last run, next expected,
  result, 7-day uptime
- Auto-refresh every 60 seconds
- Red cards pulse to draw attention

Design: Dark bg (#06040F), monospace font, dense layout.
Status colors: green #22C55E, yellow #EAB308, red #EF4444,
gray #64748B.

Rules:
- No run_state.json = all agents GRAY with "Never run"
- RED + FAILED = "NEEDS ATTENTION" label
- Summary counts must equal total agents (verify)
- Include "Last updated" timestamp at bottom

The heartbeat monitor surfaces something every multi-agent system has: at least one process that is silently behind schedule. It has been late for days, maybe weeks, and you did not notice because nothing checks. The heartbeat makes the invisible visible.

Pro Tip

Set the RED threshold at 2x the expected interval, not 1x. Agents have natural variance -- a cron job that runs at :00 sometimes starts at :01. The 2x buffer prevents false alarms while still catching genuine failures.

Layer 2: Quality Scorecard

The heartbeat tells you what is running. The quality scorecard tells you what is running well. It reads quality gate results, accuracy metrics, and drift scores to produce a per-agent quality summary.

Quality Scorecard

You are a quality analyst for a multi-agent AI system. Produce a
scorecard showing how well each agent performs -- not just whether
it runs.

Read:
- Quality gate results: ~/quality/gate_results.jsonl
  (each line: agent, timestamp, result, details)
- Accuracy metrics: ~/metrics/agent_accuracy.json
  (per-agent scores, updated weekly)
- Drift detection: ~/metrics/drift_scores.json
  (per-agent drift vs baseline)
- Handoff validations: ~/workflows/validations/

For each agent, compute:

1. GATE PASS RATE (7-day):
   Runs passed / total runs. Display as percentage.
   Color: >95% green, 80-95% yellow, <80% red

2. ACCURACY (latest):
   From agent_accuracy.json. Show trend arrow:
   up if improved, down if declined, flat if within 1%

3. DRIFT SCORE:
   LOW (green) = consistent with baseline
   MEDIUM (yellow) = noticeable change, monitor
   HIGH (red) = significant deviation, investigate

4. HANDOFF HEALTH:
   % of handoffs passing all checks.
   List recurring validation failures.

5. COMPOSITE SCORE:
   40% accuracy + 30% gate rate + 20% handoff + 10% inverse drift
   Letter grade: A (>90), B (80-90), C (70-80), D (60-70), F (<60)

Output as HTML (~/control_room/quality.html):
- System composite at top (average across agents)
- Sortable table: one row per agent, columns for each metric
- Color-coded cells, expandable detail rows
- Match heartbeat.html dark theme styling

Rules:
- No gate results = "No data" (not 0%)
- Accuracy >7 days old = STALE flag
- HIGH drift = inline warning about downstream trust
- Missing data = re-weight remaining metrics

The composite score is a leading indicator. When an agent drops from B to C, investigate immediately -- do not wait for it to hit D. The quality scorecard catches degradation while there is still time to fix it before downstream consumers are affected.

Layer 3: Incident Feed

The incident feed is a unified timeline of everything that went wrong. Instead of checking five different log files, every failure, timeout, validation error, and escalation appears in one chronological feed.

Incident Aggregator

You are an incident aggregator for a multi-agent AI system.
Produce a unified incident feed from multiple data sources.

Read:
- Safety net incidents: ~/incidents/incident_log.jsonl
- Workflow failures: ~/workflows/run_summary_*.json
  (the "failures" array from each run summary)
- Quality gate failures: ~/quality/gate_results.jsonl
  (entries where result = "FAIL")
- Handoff validation failures: ~/workflows/validations/
  (entries where overall = "FAIL")
- Escalation log: ~/incidents/escalations.jsonl

Normalize each to:
{
  "timestamp": "[ISO]",
  "source": "safety_net" | "workflow" | "quality_gate" |
            "handoff" | "escalation",
  "severity": "CRITICAL" | "HIGH" | "MEDIUM" | "LOW",
  "agent": "[agent name]",
  "summary": "[1-line description]",
  "details": "[full context]",
  "resolved": true/false
}

Severity assignment:
- CRITICAL: Agent down >2x interval, escalation triggered,
  data corruption
- HIGH: Workflow failure impacting downstream, autonomous
  agent gate failure
- MEDIUM: Supervised agent gate failure, handoff warning
- LOW: Self-resolved timeout, minor validation warning

Deduplicate: same agent + same failure type within 5 minutes
= one incident with count.

Output as HTML (~/control_room/incidents.html):
- Filter bar: All | Critical | High | Medium | Low | Unresolved
- Timeline: newest first, incident cards
- Each card: severity badge, timestamp (relative), agent name,
  source badge, summary, expandable details
- "Mark resolved" button on each card
- Critical always at top until resolved

Rules:
- Max 100 incidents displayed. Older archived.
- Unresolved >7 days = "STALE -- investigate or close"
- Must load in under 1 second (pre-compute)

The four data sources -- safety net, workflow summaries, quality gates, handoff validations -- already exist if you built Issues #15 through #19. The incident feed is not new data collection. It is data unification. One view instead of five.

Layer 4: The Six Metrics That Matter

The trend dashboard tracks six metrics over time. Not sixty. Six. Each one changes your behavior when it moves.

Metric	What It Measures	Direction
Pipeline Duration	Total time, Wave 1 to last wave	Lower is better
Success Rate	% of runs completing successfully	Higher is better (north star)
MTTD	Time from incident to detection	Lower is better
MTTR	Time from detection to resolution	Lower is better
Gate Pass Rate	% of runs passing quality checks	Leading indicator (drops first)
Autonomy Dist.	Agents at each trust tier	More autonomous over time

The test for every metric: "If this number doubled, would I do something different?" If the answer is no, the metric does not belong on the control room. Six metrics, not sixty. Every metric you add dilutes attention.

You do not need a prompt-based generator for the trend dashboard. Build a weekly snapshot aggregator script that runs every Sunday, reads all data sources, computes the six metrics, and appends the results to weekly_snapshots.jsonl. The trend dashboard reads that file and renders the charts.

Wiring the Control Room

Step 1: Build the heartbeat monitor first. This is the most immediately useful layer. Run it against your existing run_state.json and dependency_map.json. Within 10 minutes, you have a live status page showing which agents are healthy and which need attention.

Step 2: Add the quality scorecard. This requires quality gate results and accuracy metrics from Issues #15 and #16. If you built those, you already have the data. Wire it into the scorecard.

Step 3: Wire the incident feed. This aggregates data you are already producing -- incident logs from Issue #18, validation results from Issue #19, quality gate failures from Issue #15. The incident feed just unifies them into one view.

Step 4: Build the trend dashboard. Create the weekly snapshot aggregator script. Run it manually for the first few weeks, then wire it into your orchestra as a weekly Wave 0 agent.

Step 5: Build the unified control room. A single HTML page with four tabs -- Heartbeat, Quality, Incidents, Trends -- that loads each layer. This is the one URL you open every morning.

What the Control Room Looks Like Running

7:00 AM -- Heartbeat Tab

9 agents green, 1 yellow. The yellow agent is the news fetcher -- ran 50 minutes ago, expected interval is 60 minutes. Not alarming, but worth watching. Zero red.

7:00:30 AM -- Quality Tab

System composite: B+ (87%). Research synthesizer dropped from B to C -- drift score went from LOW to MEDIUM. Output length increased 40% vs baseline. Probably a prompt change from last week. Worth reviewing.

7:01 AM -- Incidents Tab

3 incidents in 24 hours. One MEDIUM: handoff validation caught a missing field, auto-resolved via fallback. Two LOW: API timeouts that succeeded on retry. Nothing requiring action.

7:01:30 AM -- Complete

Success rate: 91% to 94% over 4 weeks. Pipeline duration stable at ~8 min. MTTD improved from 8 min to 3.5 min. One agent promoted from supervised to semi-autonomous this week.

Compare this to the old approach: cat run_state.json, grep FAIL quality/*.jsonl, tail -50 incidents/incident_log.jsonl, mental math to figure out if things are getting better or worse. Twenty minutes of file-reading to get the same picture.

90 sec

Complete system health check -- every agent, every metric

Four layers, four questions answered: Is it running? Is it running well? What went wrong? Is it getting better? One URL, one morning check, complete operational visibility.

Common Mistakes

Too many metrics. If a metric does not change your behavior when it moves, remove it. The control room shows six metrics, not sixty. Every additional metric dilutes attention from the ones that matter.
No thresholds, no alerts. A dashboard without thresholds is a screensaver. Every metric needs green/yellow/red boundaries. Every red threshold should trigger an alert. The control room is not just for manual checking -- it is the trigger for automated responses.
Building before data exists. Heartbeat and quality scorecard work from day one. The trend dashboard needs 4+ weeks of snapshots to show actual trends. Build them in order.
Forgetting to monitor the monitor. If the control room page is more than 2 hours old, display a red banner: "STALE -- control room has not refreshed since [timestamp]." Monitor the monitor.
Over-engineering the UI. This is an operations tool. Optimize for information density, not aesthetics. If you spend more time on CSS than on the data pipeline, your priorities are wrong.

The Bottom Line

You built the pieces: agents, quality gates, feedback loops, autonomy tiers, a safety net, and an orchestra. Each piece produces its own data, writes to its own files, tells its own story.

The control room unifies those stories into one view. Four layers -- heartbeat, quality, incidents, trends -- give you a complete picture of your system's health in under two minutes.

The heartbeat tells you what is running. The scorecard tells you what is running well. The incident feed tells you what went wrong. The trend dashboard tells you whether the system is getting better or worse.

Without the control room, you have a system that works but requires detective work to verify. With it, you have a system that tells you its own status -- clearly, quickly, and honestly.

Try It This Week

Build just the heartbeat monitor. Take every scheduled process you have -- agents, cron jobs, scripts, anything that runs on a timer -- and create a status page that shows when each one last ran and whether it is on schedule.

You will discover something interesting: at least one process is silently behind schedule. It has been late for days, maybe weeks, and you did not notice because nothing checks. The heartbeat monitor makes the invisible visible.

Once you see the heartbeat working, you will want the other three layers. That is the point -- operational visibility is addictive because it replaces anxiety with information.

Start with the heartbeat. Reply with your status page screenshot and I will review your thresholds and suggest which agents need tighter monitoring intervals.

Next Issue: Issue #21

The Memory

Your agents run every day, but they forget everything between sessions. The control room shows you what is happening now. The memory system gives your agents access to what happened before -- past decisions, past errors, past successes. We will build a structured memory layer that makes your agents smarter over time, not just persistent.

The Control Room: A Unified Monitoring Interface for Your AI System