The Evaluation Lab: How to Know If Your AI Actually Works

← Issue #30: The Agent Protocol

Every team that ships an AI feature hits the same wall. The feature works — in the sense that it produces output without crashing. But nobody knows if the output is good.

The gap is enormous. Google reported at ICLR 2026 that 73% of production AI systems have no formal evaluation framework. Teams ship, watch for complaints, and hope. When quality degrades — and it always degrades — they discover it from angry users, not from their own monitoring.

The problem is not laziness. The problem is that evaluating AI output is genuinely hard. Traditional software testing checks binary conditions: the function returns the right number, the API returns 200, the button renders. AI output exists on a spectrum. A summary can be mostly accurate. A recommendation can be somewhat relevant. A classification can be usually correct.

“It seems fine” is not an evaluation strategy. These three prompts replace gut feel with a structured framework that catches quality problems before your users do.

The Evaluation Gap

Most AI evaluation failures follow the same pattern:

Anti-Pattern	What Goes Wrong	How Common
Vibe check	Developer reads 5 outputs, says “looks good”	Universal
Benchmark theater	Model scores 92% on a benchmark unrelated to your task	Very common
Demo-driven development	Cherry-picked examples convince stakeholders while average quality is poor	Very common
Regression blindness	Model update improves area A while silently degrading area B	Common
Metric gaming	Team optimizes for the measured metric while the actual user experience worsens	Common

The principle: If you cannot measure quality, you cannot maintain quality. And if you cannot maintain quality, every model update, prompt change, and data refresh is a roll of the dice with your users’ trust as the stake.

Prompt 1 — The Quality Rubric

Before you can measure quality, you need to define what quality means for your specific use case. This prompt forces you to convert vague expectations into concrete, measurable criteria.

Prompt 1 — The Quality Rubric

You are an AI quality engineer building an evaluation rubric
for a production AI system. The goal is to replace "it seems
fine" with measurable criteria that any evaluator can apply
consistently.

SYSTEM DESCRIPTION:
[Describe your AI feature. What does it do? What input does
it receive? What output does it produce? Who consumes the
output?]

Build the quality rubric:

## 1. DIMENSIONS
Identify every dimension of quality that matters for this
system. Common dimensions include:

- ACCURACY: Is the output factually correct?
- RELEVANCE: Does it address the user's actual need?
- COMPLETENESS: Does it cover all required aspects?
- COHERENCE: Is it internally consistent and logical?
- CONCISENESS: Is it appropriately brief without losing info?
- SAFETY: Does it avoid harmful, biased, or inappropriate content?
- FORMATTING: Does it follow the required structure/schema?
- ACTIONABILITY: Can the user act on the output directly?

For YOUR system, rank these by importance. Drop any that
don't apply. Add domain-specific dimensions I missed.

## 2. SCORING CRITERIA
For each dimension, define a 1-5 scale with concrete examples:

| Score | Label     | Definition + Example                    |
|-------|-----------|-----------------------------------------|
| 5     | Excellent | [Specific example of a 5 for this dim]  |
| 4     | Good      | [Minor issue that makes this a 4]       |
| 3     | Adequate  | [Noticeable issue, still usable]        |
| 2     | Poor      | [Significant issue, barely usable]      |
| 1     | Failing   | [Unacceptable — would harm the user]    |

Each score description must be specific enough that two
independent evaluators would assign the same score to the
same output at least 80% of the time.

## 3. MINIMUM ACCEPTABLE QUALITY
Define the quality floor — the minimum scores below which
the output should NOT be shown to users:

- Hard floor: any dimension scoring 1 = block the output
- Soft floor: average across all dimensions < [threshold]
- Critical dimensions: [list] — these can never score below 3

## 4. WEIGHTS
Not all dimensions matter equally. Assign weights (sum to 1.0):
- [dimension]: [weight] — [why this weight]

## OUTPUT: EVALUATION RUBRIC DOCUMENT
Produce the complete rubric as a table, with scoring criteria,
minimum thresholds, and dimension weights. This becomes the
contract that every model version, prompt change, and data
update must satisfy.

What happens when you run this: You will discover that your team disagrees about what “good” means. One person thinks accuracy is paramount; another prioritizes conciseness. The rubric makes these tradeoffs explicit. Once your team agrees on the rubric, every future evaluation becomes mechanical instead of subjective.

Pro tip: The “two evaluators, same score, 80% of the time” test is the most important constraint. If your scoring criteria are so vague that different people assign different scores, the rubric is worthless. Keep refining the examples until you hit 80% agreement.

Prompt 2 — The Eval Suite Generator

A rubric defines how to measure. An eval suite defines what to measure against. This prompt generates a comprehensive set of test cases that covers the full range of inputs your system will see in production.

Prompt 2 — The Eval Suite Generator

You are building a comprehensive evaluation suite for a
production AI system. The suite must cover normal cases,
edge cases, adversarial inputs, and regression traps.

SYSTEM DESCRIPTION:
[Same as Prompt 1.]

QUALITY RUBRIC:
[Paste your rubric from Prompt 1.]

Generate the eval suite:

## 1. GOLDEN SET (20-50 examples)
Hand-curated input/output pairs where the "correct" output
is known. These are your ground truth. For each:
- INPUT: The exact input to the system
- EXPECTED OUTPUT: The ideal response
- DIMENSIONS TESTED: Which rubric dimensions this validates
- WHY INCLUDED: What failure mode would this catch?

Distribution requirements:
- 40% typical inputs (the boring middle of your distribution)
- 30% edge cases (unusual but valid inputs)
- 20% adversarial (inputs designed to break the system)
- 10% regression traps (inputs that broke previous versions)

## 2. CATEGORY COVERAGE
Define categories of input your system handles. For each:
- Category name and description
- Estimated production frequency (% of real traffic)
- Number of eval examples (proportional to frequency)
- Known failure modes in this category

Check: does every category have at least 3 examples? Does
the distribution match production traffic within 10%?

## 3. ADVERSARIAL CASES
Inputs specifically designed to expose weaknesses:
- Ambiguous inputs (multiple valid interpretations)
- Boundary inputs (at the edge of what the system handles)
- Contradictory inputs (conflicting requirements)
- Injection attempts (if applicable to your domain)
- Missing context (incomplete information)
- Out-of-scope inputs (things the system should refuse)

For each: what is the EXPECTED behavior? (Not just "don't
break" — what should the system actually do?)

## 4. REGRESSION TRAPS
Inputs that caught bugs in previous versions. Every bug
you fix becomes a permanent eval case. Format:
- Input that caused the bug
- What the system did wrong
- What the system should do
- Date the bug was fixed
- Why it might regress (what change could re-introduce it)

## OUTPUT: EVAL SUITE
Produce the complete suite as structured data (JSON or CSV)
with fields: id, category, input, expected_output,
dimensions_tested, adversarial_flag, regression_flag.

What happens when you run this: The distribution requirements force you to think about what your system actually sees in production, not just the impressive demos. Most eval suites are 90% happy-path examples. Real users send ambiguous, incomplete, contradictory inputs — and that is where quality breaks down.

The Quality Cliff

There is a critical insight that most teams learn the hard way:

AI quality does not degrade gradually. It falls off a cliff.

A model update can score 95% on your eval suite while producing garbage on the 5% that matters most to your power users.

This happens because AI models are not deterministic machines with predictable failure modes. They are statistical systems where a seemingly minor change — a prompt edit, a model version bump, a data refresh — can cause catastrophic quality loss in a narrow but critical slice of inputs.

The solution is not more eval cases. The solution is stratified evaluation: track quality separately across input categories and alert on any category-level regression, even if the aggregate score looks fine.

Anthropic does this internally. They run over 1,000 eval categories across Claude models before every release. A model that scores higher overall but regresses in any critical category does not ship. This discipline is why model updates rarely break existing workflows.

Prompt 3 — The Regression Detector

Static eval suites catch bugs at deploy time. Regression detectors catch quality drift in production — the slow, invisible decay that happens as real-world inputs diverge from your test data.

Prompt 3 — The Regression Detector

You are a reliability engineer building a continuous quality
monitoring system for a production AI feature. The system must
detect quality regressions BEFORE users notice them.

SYSTEM DESCRIPTION:
[Same as Prompt 1.]

EVAL SUITE:
[Reference your eval suite from Prompt 2.]

Design the regression detection system:

## 1. CONTINUOUS EVAL PIPELINE
Define the automated evaluation loop:
- TRIGGER: When does evaluation run?
  (every deploy, every N hours, every N requests)
- SAMPLE: How are production inputs sampled for eval?
  (random 1%, stratified by category, all high-stakes)
- JUDGE: How is quality scored automatically?
  (LLM-as-judge, rule-based, hybrid)
- STORE: Where are scores stored for trend analysis?

## 2. BASELINE & THRESHOLDS
- BASELINE: Current quality scores by category from
  your golden set. This is what "good" looks like.
- ABSOLUTE THRESHOLD: Below this score = immediate alert
- RELATIVE THRESHOLD: Drop of >X% from baseline = alert
- TREND THRESHOLD: 3 consecutive declines = investigate

For each rubric dimension and input category, define:
| Category | Dimension | Baseline | Absolute Min | Alert on Drop |
Example: "Legal queries | Accuracy | 4.6 | 4.0 | >0.3 drop"

## 3. LLM-AS-JUDGE PROTOCOL
If using an LLM to grade outputs automatically:
- JUDGE MODEL: Which model grades? (must differ from prod)
- JUDGE PROMPT: Exact rubric + scoring instructions
- CALIBRATION: How do you verify the judge agrees with
  human evaluators? (run both on 50 examples, measure
  correlation, recalibrate monthly)
- BIAS CHECK: Known biases (verbosity bias, position bias,
  self-preference) and mitigations

## 4. ALERT & RESPONSE PROTOCOL
When a regression is detected:
- WHO is notified? (on-call, team lead, specific owner)
- WHAT information is included in the alert?
  (category, dimension, score drop, example outputs,
  possible cause)
- WHAT is the response SLA?
  (P0: quality below absolute min = respond in 1 hour)
  (P1: quality trending down = respond in 24 hours)
  (P2: edge case regression = next sprint)
- ROLLBACK CRITERIA: When do you revert to previous version
  vs. investigate and fix forward?

## 5. DRIFT DETECTION
Production inputs change over time. Your eval suite must
evolve with them:
- Monthly: sample 100 production inputs, categorize them.
  Has the distribution shifted from your eval suite?
- Quarterly: refresh golden set with new production examples
- On any model/prompt change: re-run full eval suite

## OUTPUT: MONITORING SPEC
Produce the pipeline configuration, threshold table, alert
routing rules, and drift detection schedule. This becomes
your AI feature's SLA.

What happens when you run this: The LLM-as-judge section solves the scaling problem. You cannot have humans grade every output, but you can have a different AI model grade outputs using your rubric — and periodically verify that the AI judge agrees with human judgment. This gives you continuous quality monitoring at near-zero marginal cost.

Pro tip: Never use the same model to judge its own output. Claude should not grade Claude. GPT should not grade GPT. Cross-model evaluation catches the systematic biases that self-evaluation misses. If budget allows, use two different judge models and alert when they disagree — disagreement between judges is a strong signal that the output is ambiguous or problematic.

The Bigger Picture

Evaluation is the hidden infrastructure that separates AI demos from AI products. A demo needs to work once, impressively. A product needs to work correctly, thousands of times a day, across the full distribution of real-world inputs, through model updates and data shifts, without degrading.

The three layers build on each other:

The Quality Rubric defines what “good” means — converts subjective judgment into measurable criteria (Prompt 1)
The Eval Suite defines what to test — covers the full range of inputs with known-good outputs (Prompt 2)
The Regression Detector ensures quality is maintained continuously — catches drift before users notice (Prompt 3)

Issue #30 showed you how to coordinate multiple agents working together. This issue ensures that whatever those agents produce actually meets your quality bar — and keeps meeting it as the system evolves.

Next Issue

The Memory Palace: How to Give Your AI Perfect Recall Without Infinite Context

Context windows are finite. Your knowledge is not. Next issue: three prompts to build memory systems that let AI remember everything important while forgetting everything irrelevant — the same way human experts do.

The Evaluation Gap

Prompt 1 — The Quality Rubric

Prompt 2 — The Eval Suite Generator

The Quality Cliff

Prompt 3 — The Regression Detector

The Bigger Picture

The Memory Palace: How to Give Your AI Perfect Recall Without Infinite Context

Want deeper AI workflows?