Issue #16

The Feedback Loop: A Metrics Layer That Makes Your AI System Improve Itself

The AI Playbook 10 min read 3 prompts

Last issue, you built quality gates that catch errors before they reach your dashboard. Stale data, hallucinated facts, broken logic — all flagged before you see them.

But here is the question nobody asks: are your agents actually getting better over time?

Without measurement, you have no idea. Your quality gates catch errors today, but the same errors might be happening every week. The same agent might fail in the same way on every Monday. You are firefighting, not improving.

The fix is a feedback loop — a metrics layer that aggregates your quality gate results into weekly scores, detects when an agent is degrading, and tells you exactly what to fix next. It turns your AI system from "runs every day" into "improves every week."


What Gets Measured Gets Fixed

Your quality gates from Issue #15 already produce structured JSON for every check. That data is sitting in your ~/quality/ folder. Right now, you read it once and move on. That is a waste.

Those JSON files are a goldmine. They contain the complete error history of every agent you run. Aggregated over time, they tell you:

Key insight: A quality gate without a feedback loop is like a smoke detector without a fire department. It tells you something is wrong, but it does not fix anything. The feedback loop closes the circuit: detect, measure, prioritize, fix, verify the fix worked.


Component 1: Weekly Accuracy Tracker

The accuracy tracker reads all your quality gate JSON files from the past week and produces a single report: how accurate was each agent, broken down by check type?

Weekly Accuracy Tracker
You are an accuracy analyst. Your job is to aggregate quality
gate results into a weekly accuracy report.

Read ALL JSON files in ~/quality/ from the past 7 days.
Group by agent name and check type (freshness, factcheck, logic).

For each agent, calculate:
1. TOTAL CHECKS: how many checks ran this week
2. PASS RATE: percentage that passed (no issues found)
3. ERROR BREAKDOWN: count by severity (MINOR, MAJOR, CRITICAL)
4. MOST COMMON ERROR: the error type that appeared most often
5. TREND: compare to last week's report (if it exists)

Output format (save to ~/quality/weekly_accuracy_[DATE].json):
{
  "week_ending": "[DATE]",
  "agents": {
    "[AGENT_NAME]": {
      "total_checks": [number],
      "pass_rate": [percentage],
      "freshness_pass_rate": [percentage],
      "factcheck_pass_rate": [percentage],
      "logic_pass_rate": [percentage],
      "errors": {
        "minor": [count],
        "major": [count],
        "critical": [count]
      },
      "most_common_error": "[description]",
      "trend": "IMPROVING" or "STABLE" or "DEGRADING"
    }
  },
  "system_pass_rate": [overall percentage],
  "worst_agent": "[name]",
  "worst_check_type": "[type]"
}

Also output a plain-text summary (save to
~/quality/weekly_accuracy_[DATE].txt):
- One line per agent with pass rate and trend arrow
- Overall system score
- Top 3 issues to fix this week

Run this every Sunday evening. It takes under a minute and gives you a clear picture of your system's health.

Here is what the output looks like after four weeks:

Agent Week 1 Week 2 Week 3 Week 4 Trend
Research 82% 85% 91% 94% +12%
Financial 76% 79% 78% 88% +12%
Email 91% 88% 85% 83% -8%

Two agents improving. One degrading. Without the tracker, you would not know the email agent is getting worse — its quality gate still catches errors, so your dashboard still looks clean. But the underlying accuracy is slipping. The feedback loop makes that visible.


Component 2: Drift Detector

The drift detector is an early warning system. It compares this week's accuracy to the trailing average and flags anything that is degrading beyond normal variance.

Drift Detector
You are a drift detector. Your job is to identify agents or
check types that are getting worse over time.

Read the weekly accuracy reports in ~/quality/weekly_accuracy_*.json
(use all available history).

For each agent and each check type:
1. Calculate the trailing 4-week average pass rate
2. Compare this week's pass rate to the trailing average
3. Flag any drop greater than 5 percentage points as DRIFT
4. Flag any drop greater than 15 percentage points as ALARM
5. Flag any agent with 2+ consecutive weeks of decline as TREND

Output format (save to ~/quality/drift_report_[DATE].json):
{
  "week_ending": "[DATE]",
  "alerts": [
    {
      "agent": "[name]",
      "check_type": "[freshness/factcheck/logic/overall]",
      "current_week": [percentage],
      "trailing_avg": [percentage],
      "delta": [percentage points],
      "severity": "DRIFT" or "ALARM",
      "consecutive_declines": [count],
      "likely_cause": "[your best assessment]"
    }
  ],
  "stable_agents": ["[names of agents with no drift]"],
  "system_trend": "IMPROVING" or "STABLE" or "DEGRADING"
}

Rules:
- Ignore the first 2 weeks (not enough baseline data)
- A single bad week is DRIFT. Two bad weeks is ALARM.
- If an agent's pass rate has been below 80% for 3+ weeks,
  flag it as CHRONIC regardless of trend direction
- Always suggest a likely cause based on error patterns
Pro Tip

The "likely cause" field is where this gets powerful. If the drift detector says "email agent freshness failures increased 3x this week — 4 of 5 failures occurred on Monday mornings," that tells you exactly where to look. Maybe your email provider has Monday maintenance. Maybe your cron job runs before the inbox updates. The pattern is in the data.


Component 3: Improvement Prioritizer

You have limited time. The prioritizer looks at all your error data and answers one question: what single change would improve your system the most?

Improvement Prioritizer
You are an improvement analyst. Your job is to recommend the
single highest-impact fix for this week.

Read all quality gate results from ~/quality/ (past 30 days).
Read the drift report at ~/quality/drift_report_[DATE].json.
Read the weekly accuracy report at ~/quality/weekly_accuracy_[DATE].json.

Analyze error patterns:
1. GROUP errors by category (stale data, hallucination, logic,
   priority misrank, math error, missing alert, etc.)
2. COUNT frequency of each category over the past 30 days
3. WEIGHT by severity (CRITICAL=3, MAJOR=2, MINOR=1)
4. CALCULATE impact score = frequency x severity weight
5. RANK categories by impact score

Output format (save to ~/quality/priorities_[DATE].json):
{
  "week_ending": "[DATE]",
  "recommendations": [
    {
      "rank": 1,
      "category": "[error category]",
      "impact_score": [number],
      "frequency": "[X times in 30 days]",
      "affected_agents": ["[agent names]"],
      "suggested_fix": "[specific, actionable recommendation]",
      "estimated_improvement": "[X% pass rate improvement]",
      "effort": "LOW" or "MEDIUM" or "HIGH"
    }
  ],
  "quick_wins": ["[fixes that are LOW effort + HIGH impact]"],
  "systemic_issues": ["[patterns affecting multiple agents]"]
}

Rules:
- Limit to top 5 recommendations (focus, not volume)
- "Suggested fix" must be specific and actionable, not generic
  Bad: "Improve the research agent"
  Good: "Add a URL reachability check to the research agent's
  fact-check gate — 8 of 12 CRITICAL errors last month were
  hallucinated source URLs"
- Always include at least one quick win (low effort, high impact)
- If the same error category appears in 3+ agents, flag it as
  systemic and recommend a shared fix

Wiring the Feedback Loop

Step 1: Schedule the tracker. Add a Sunday evening task that runs the Weekly Accuracy Tracker prompt. It reads the week's quality gate files and produces the summary. This takes about 2 minutes.

Step 2: Schedule the detector. Immediately after the tracker, run the Drift Detector. It reads the tracker's output plus historical reports and flags anything degrading. Another 2 minutes.

Step 3: Schedule the prioritizer. After the detector, run the Improvement Prioritizer. It reads everything and gives you one clear recommendation for the week. 2 more minutes.

Step 4: Act on the recommendation. This is the part that matters. Every Monday morning, you open the prioritizer's output and implement the top recommendation. Just one. This takes 15-30 minutes depending on complexity.

Step 5: Verify next Sunday. When the tracker runs again, check whether your fix actually improved the score. If yes, move to the next recommendation. If no, investigate why and try a different approach.

The cycle: Sunday: measure → Monday: fix → Week: observe → Sunday: measure again. Every week, one fix. Every week, better accuracy. After 8 weeks, your system is unrecognizable from where it started.


What the Numbers Actually Look Like

Here is a realistic trajectory for a system that starts measuring and fixing one thing per week:

Week System Accuracy Fix Applied Impact
1 78% Baseline (no fix yet) --
2 78% Added API retry for stale data measuring
3 84% Retry worked (+6%) +6%
4 86% Added URL check for sources +2%
6 91% Fixed Monday cron timing +5%
8 95% Added threshold sanity check +4%

78% to 95% in 8 weeks. Not from smarter AI. Not from more expensive models. From measuring what breaks and fixing one thing per week.


What Could Go Wrong

  1. Vanity metrics. A 95% pass rate means nothing if your quality gates are too lenient. The feedback loop is only as good as the gates feeding it. If you lower your thresholds to get a better score, you are lying to yourself. Keep gates strict.
  2. Fixing symptoms instead of causes. The prioritizer might say "research agent hallucinated 5 times this week." The temptation is to add a hallucination check. The real fix might be to shorten the agent's context window or use a more specific prompt. Look for root causes.
  3. Over-optimizing one agent. If your research agent is at 98% and your email agent is at 72%, stop improving the research agent. Diminishing returns are real. The prioritizer handles this — trust its rankings.
  4. Skipping the "verify" step. You applied a fix last week. This week's score went up 3%. Was it your fix or random variance? Check the specific error category you targeted. If that category dropped but another rose, your fix worked but exposed a new issue. That is progress.

The Bottom Line

A quality gate catches errors. A feedback loop eliminates them. The difference is measurement.

Without measurement, you are guessing which improvements matter. With measurement, you are making one precise fix per week backed by data. In two months, you have a system that is measurably, provably better than the one you started with.

78% → 95%
System accuracy after 8 weeks of measured improvement
One fix per week. Measured every Sunday. Verified the following Sunday. No guessing. No over-engineering. Just the data telling you what to fix next.

Try It This Week

If you have quality gates running from Issue #15, you already have the data. Run the Weekly Accuracy Tracker prompt on Sunday. Look at your pass rates. Pick the worst number and investigate why.

If you do not have gates yet, start there. Issue #15 will get you set up. Then come back here.

One measurement. One fix. Every week. That is the entire system. Reply to this email with your Week 1 accuracy scores — I will tell you exactly what to fix first.

Next Issue: Issue #17

The Handoff

Your metrics prove which agents are reliable. Now let them act on their own. We will build an earned-autonomy framework — agents that graduate from "review required" to "autonomous" based on their accuracy track record.

Get the next issue

One tested AI workflow, delivered every week. No fluff.

Free forever. One email per week. Unsubscribe anytime.