Issue #15

The Quality Gate: An AI Review Loop That Catches Errors Before You Do

The AI Playbook 10 min read 3 prompts

Last issue, you built a decision dashboard — one morning view that ranks everything your agents found.

Here is the uncomfortable truth: that dashboard is only as good as the data feeding it. And your agents will get things wrong.

A research agent will hallucinate a statistic. A financial agent will pull stale prices because an API timed out. An email agent will misclassify something urgent as routine. These are not edge cases. They are Tuesday.

The fix is not smarter agents. It is a quality gate — a review layer that sits between your agents and your dashboard, checking their output before you ever see it.

This issue shows you how to build one. Three checks. One prompt. Zero errors reaching your morning view.


Why Agents Make Mistakes

Understanding the failure modes helps you write better checks. Agent errors fall into three categories:

1. Stale data. The most common failure. An API call fails silently, and the agent processes yesterday's data as if it is fresh. Your financial dashboard shows Monday's prices on Wednesday morning. You make a decision based on information that is 48 hours old.

2. Hallucinated facts. The agent invents a number, a name, or a trend that does not exist in the source material. This is especially dangerous in research agents that summarize long documents — the summary sounds authoritative even when it is fabricated.

3. Logic errors. The agent's reasoning is wrong. It marks something as P2 that should be P0. It calculates a percentage change incorrectly. It misses a threshold that should trigger an alert. These are the hardest to catch because the output looks well-formatted and confident.

Key insight: You cannot prevent these errors. LLMs hallucinate. APIs fail. Logic breaks at edge cases. But you can catch every one of them before the output reaches your dashboard — the same way software teams catch bugs before code reaches production.


The Three-Check Architecture

A quality gate runs three checks on every piece of agent output. If any check fails, the output is flagged — not deleted, not rewritten, just flagged so you know exactly what to trust and what to verify.

Check 1: Freshness

Is the data actually from today? This check is mechanical — no AI needed. Compare timestamps in the agent's output against the current date. If any data source is older than your threshold (usually 24 hours), flag it.

Freshness Gate Prompt
You are a quality gate. Your ONLY job is to check data freshness.

Read the agent output file at: [AGENT_OUTPUT_PATH]

For every data point, date, or metric in the output:
1. Extract the date or timestamp it references
2. Compare against today's date: [TODAY]
3. Flag anything older than 24 hours

Output format (save to ~/quality/freshness_[AGENT]_[TODAY].json):
{
  "agent": "[AGENT_NAME]",
  "checked_at": "[NOW]",
  "status": "PASS" or "FAIL",
  "items_checked": [count],
  "stale_items": [
    {
      "field": "what data point",
      "data_date": "when the data is from",
      "age_hours": [number],
      "severity": "WARNING" or "CRITICAL"
    }
  ]
}

Rules:
- Market data older than 1 trading day = CRITICAL
- News older than 48 hours = WARNING
- Reference data (company descriptions, etc.) = exempt
- If you cannot determine the date of a data point, flag it as
  "UNDATED" with severity WARNING

Check 2: Fact Verification

Does the output match its source material? This is where a second AI pass adds real value. The reviewing agent reads both the source and the summary, looking for claims in the summary that are not supported by the source.

Fact Check Prompt
You are a fact-checking gate. Your ONLY job is to verify that
the agent's output is supported by its source material.

Source material: [SOURCE_FILE_PATH]
Agent output: [AGENT_OUTPUT_PATH]

For every factual claim in the agent's output:
1. Find the supporting evidence in the source material
2. Verify numbers match exactly (not approximately)
3. Verify names, dates, and entities are correct
4. Flag any claim that appears in the output but NOT in the source

Output format (save to ~/quality/factcheck_[AGENT]_[TODAY].json):
{
  "agent": "[AGENT_NAME]",
  "status": "PASS" or "FAIL",
  "claims_checked": [count],
  "unsupported_claims": [
    {
      "claim": "what the agent said",
      "source_says": "what the source actually says (or 'NOT FOUND')",
      "severity": "MINOR" (rounding) or "MAJOR" (wrong fact)
        or "CRITICAL" (fabricated)
    }
  ]
}

Rules:
- Rounding differences under 2% = MINOR (acceptable)
- Wrong numbers, names, or dates = MAJOR
- Claims with no source at all = CRITICAL (likely hallucination)
- Opinions and analysis are exempt — only check factual claims

Check 3: Logic Audit

Are the agent's conclusions consistent with its own data? This check catches the most dangerous errors — the ones where the data is correct but the interpretation is wrong.

Logic Audit Prompt
You are a logic auditor. Your ONLY job is to check whether
the agent's conclusions follow from its own data.

Agent output: [AGENT_OUTPUT_PATH]

Check these logic patterns:
1. PRIORITY RANKING: Are P0/P1/P2 assignments consistent?
   Would any P2 item be P0 under the stated rules?
2. MATH: Recalculate every percentage, average, and comparison.
   Does 2+2 actually equal 4 in every case?
3. THRESHOLDS: If the agent says "X is above/below threshold Y",
   verify the comparison is correct.
4. CONTRADICTIONS: Does the agent say one thing in the summary
   and the opposite in the details?
5. MISSING ALERTS: Based on the data, should any alert have
   fired that did not?

Output format (save to ~/quality/logic_[AGENT]_[TODAY].json):
{
  "agent": "[AGENT_NAME]",
  "status": "PASS" or "FAIL",
  "checks_run": [count],
  "issues": [
    {
      "type": "PRIORITY" or "MATH" or "THRESHOLD"
        or "CONTRADICTION" or "MISSING_ALERT",
      "description": "what is wrong",
      "agent_said": "the agent's claim",
      "correct_answer": "what it should be",
      "severity": "MINOR" or "MAJOR" or "CRITICAL"
    }
  ]
}

Wiring It Together

Step 1: Create the quality folder. Make ~/quality/ to hold all gate output. Each check writes its own JSON file. This gives you a complete audit trail.

Step 2: Run checks after each agent. Your daily schedule becomes:

Step 3: Feed gate results into your dashboard. Your collector from Issue #14 should read the quality JSON files and attach a status badge to each agent's section. A green check means all three gates passed. A yellow warning means minor issues. A red flag means do not trust this section until you verify it manually.

Here is what your morning view looks like with quality gates:

Pass Research Agent — all 3 checks passed. 14 claims verified against source. Data from today.
Warn Financial Agent — freshness check flagged bond data (18 hours old, market closed). Facts and logic passed.
Fail Email Agent — logic audit found 2 emails marked P2 that match P0 rules (sender is CFO + contains "urgent"). Review manually.
Pro Tip

Start with the freshness check only. It catches the most common failure (stale data), requires no AI for basic cases, and takes under a minute. Add the fact check and logic audit once the freshness gate is running reliably.


The Self-Improving Gate

This is where quality gates become genuinely powerful. Every time you catch an error that the gate missed, you add a new check. The gate grows with every mistake.

Week 1: Your gate has the three standard checks. It catches stale data and a hallucinated number.

Week 3: You noticed the financial agent keeps rounding aggressively. You add a check: "If any percentage change exceeds 10%, verify against raw price data." Gate now has 4 checks.

Week 6: The research agent cited a news article that does not exist. You add a check: "For any named source, verify the URL is reachable." Gate now has 5 checks.

Week 12: Your gate has 8 checks. The error rate reaching your dashboard has dropped from roughly 15% to under 2%. Not because your agents got smarter — because the gate learned every failure mode.

The pattern: Error reaches you → diagnose root cause → write a check that catches this class of error → add it to the gate → verify the check works. Every mistake makes the system permanently better. This is how production software teams work. Now your AI system works the same way.


What Could Go Wrong

  1. The gate itself halluccinates. A reviewing AI can flag correct output as wrong. Solution: make your checks as mechanical as possible. Timestamp comparisons do not hallucinate. Number re-calculations rarely do. Save the subjective judgment for the logic audit, and weight its findings lower than the other two.
  2. Too many false positives. If your gate flags 20 items every morning and 18 are fine, you will start ignoring it. Tune your thresholds. Start strict and loosen — it is easier to relax a check than to tighten one after you have stopped paying attention.
  3. Gate slows down the pipeline. Three AI checks per agent adds 5-10 minutes to your daily run. If this matters, run the freshness check only (instant, no AI needed) and batch the other two to run in parallel.
  4. Over-engineering on day one. Do not build 12 checks before running any of them. Start with freshness. Add one check per week based on actual errors you encounter. The best quality systems grow from real failures, not hypothetical ones.

The Bottom Line

Your agents will make mistakes. The question is whether you catch those mistakes before or after you act on them.

A quality gate turns your AI system from "mostly right, sometimes dangerously wrong" into "verified correct, with clear flags on anything uncertain." That is the difference between a tool you check and a tool you trust.

<2% error rate
Instead of ~15% undetected errors reaching your dashboard
Three automated checks. Self-improving with every mistake. Complete audit trail. You only review what is flagged — everything else is verified.

Try It This Week

Pick your most error-prone agent. Add the freshness check. Run it for three days. Count how many stale-data incidents it catches that you would have missed.

Then add the fact check. Then the logic audit. By next week, that agent's output will be the most reliable data source on your dashboard.

Reply to this email with what your gate caught — the best stories will be featured in a future issue.

Next Issue: Issue #16

The Feedback Loop

Your system runs, but is it getting better? We will build a metrics layer that tracks what your agents get right, what they get wrong, and what to fix next — automatic improvement, measured weekly.

Get the next issue

One tested AI workflow, delivered every week. No fluff.

Free forever. One email per week. Unsubscribe anytime.