Every team that ships an AI feature hits the same wall. The feature works — in the sense that it produces output without crashing. But nobody knows if the output is good.
The gap is enormous. Google reported at ICLR 2026 that 73% of production AI systems have no formal evaluation framework. Teams ship, watch for complaints, and hope. When quality degrades — and it always degrades — they discover it from angry users, not from their own monitoring.
The problem is not laziness. The problem is that evaluating AI output is genuinely hard. Traditional software testing checks binary conditions: the function returns the right number, the API returns 200, the button renders. AI output exists on a spectrum. A summary can be mostly accurate. A recommendation can be somewhat relevant. A classification can be usually correct.
“It seems fine” is not an evaluation strategy. These three prompts replace gut feel with a structured framework that catches quality problems before your users do.
The Evaluation Gap
Most AI evaluation failures follow the same pattern:
| Anti-Pattern | What Goes Wrong | How Common |
|---|---|---|
| Vibe check | Developer reads 5 outputs, says “looks good” | Universal |
| Benchmark theater | Model scores 92% on a benchmark unrelated to your task | Very common |
| Demo-driven development | Cherry-picked examples convince stakeholders while average quality is poor | Very common |
| Regression blindness | Model update improves area A while silently degrading area B | Common |
| Metric gaming | Team optimizes for the measured metric while the actual user experience worsens | Common |
The principle: If you cannot measure quality, you cannot maintain quality. And if you cannot maintain quality, every model update, prompt change, and data refresh is a roll of the dice with your users’ trust as the stake.
Prompt 1 — The Quality Rubric
Before you can measure quality, you need to define what quality means for your specific use case. This prompt forces you to convert vague expectations into concrete, measurable criteria.
You are an AI quality engineer building an evaluation rubric for a production AI system. The goal is to replace "it seems fine" with measurable criteria that any evaluator can apply consistently. SYSTEM DESCRIPTION: [Describe your AI feature. What does it do? What input does it receive? What output does it produce? Who consumes the output?] Build the quality rubric: ## 1. DIMENSIONS Identify every dimension of quality that matters for this system. Common dimensions include: - ACCURACY: Is the output factually correct? - RELEVANCE: Does it address the user's actual need? - COMPLETENESS: Does it cover all required aspects? - COHERENCE: Is it internally consistent and logical? - CONCISENESS: Is it appropriately brief without losing info? - SAFETY: Does it avoid harmful, biased, or inappropriate content? - FORMATTING: Does it follow the required structure/schema? - ACTIONABILITY: Can the user act on the output directly? For YOUR system, rank these by importance. Drop any that don't apply. Add domain-specific dimensions I missed. ## 2. SCORING CRITERIA For each dimension, define a 1-5 scale with concrete examples: | Score | Label | Definition + Example | |-------|-----------|-----------------------------------------| | 5 | Excellent | [Specific example of a 5 for this dim] | | 4 | Good | [Minor issue that makes this a 4] | | 3 | Adequate | [Noticeable issue, still usable] | | 2 | Poor | [Significant issue, barely usable] | | 1 | Failing | [Unacceptable — would harm the user] | Each score description must be specific enough that two independent evaluators would assign the same score to the same output at least 80% of the time. ## 3. MINIMUM ACCEPTABLE QUALITY Define the quality floor — the minimum scores below which the output should NOT be shown to users: - Hard floor: any dimension scoring 1 = block the output - Soft floor: average across all dimensions < [threshold] - Critical dimensions: [list] — these can never score below 3 ## 4. WEIGHTS Not all dimensions matter equally. Assign weights (sum to 1.0): - [dimension]: [weight] — [why this weight] ## OUTPUT: EVALUATION RUBRIC DOCUMENT Produce the complete rubric as a table, with scoring criteria, minimum thresholds, and dimension weights. This becomes the contract that every model version, prompt change, and data update must satisfy.
What happens when you run this: You will discover that your team disagrees about what “good” means. One person thinks accuracy is paramount; another prioritizes conciseness. The rubric makes these tradeoffs explicit. Once your team agrees on the rubric, every future evaluation becomes mechanical instead of subjective.
Pro tip: The “two evaluators, same score, 80% of the time” test is the most important constraint. If your scoring criteria are so vague that different people assign different scores, the rubric is worthless. Keep refining the examples until you hit 80% agreement.
Prompt 2 — The Eval Suite Generator
A rubric defines how to measure. An eval suite defines what to measure against. This prompt generates a comprehensive set of test cases that covers the full range of inputs your system will see in production.
You are building a comprehensive evaluation suite for a production AI system. The suite must cover normal cases, edge cases, adversarial inputs, and regression traps. SYSTEM DESCRIPTION: [Same as Prompt 1.] QUALITY RUBRIC: [Paste your rubric from Prompt 1.] Generate the eval suite: ## 1. GOLDEN SET (20-50 examples) Hand-curated input/output pairs where the "correct" output is known. These are your ground truth. For each: - INPUT: The exact input to the system - EXPECTED OUTPUT: The ideal response - DIMENSIONS TESTED: Which rubric dimensions this validates - WHY INCLUDED: What failure mode would this catch? Distribution requirements: - 40% typical inputs (the boring middle of your distribution) - 30% edge cases (unusual but valid inputs) - 20% adversarial (inputs designed to break the system) - 10% regression traps (inputs that broke previous versions) ## 2. CATEGORY COVERAGE Define categories of input your system handles. For each: - Category name and description - Estimated production frequency (% of real traffic) - Number of eval examples (proportional to frequency) - Known failure modes in this category Check: does every category have at least 3 examples? Does the distribution match production traffic within 10%? ## 3. ADVERSARIAL CASES Inputs specifically designed to expose weaknesses: - Ambiguous inputs (multiple valid interpretations) - Boundary inputs (at the edge of what the system handles) - Contradictory inputs (conflicting requirements) - Injection attempts (if applicable to your domain) - Missing context (incomplete information) - Out-of-scope inputs (things the system should refuse) For each: what is the EXPECTED behavior? (Not just "don't break" — what should the system actually do?) ## 4. REGRESSION TRAPS Inputs that caught bugs in previous versions. Every bug you fix becomes a permanent eval case. Format: - Input that caused the bug - What the system did wrong - What the system should do - Date the bug was fixed - Why it might regress (what change could re-introduce it) ## OUTPUT: EVAL SUITE Produce the complete suite as structured data (JSON or CSV) with fields: id, category, input, expected_output, dimensions_tested, adversarial_flag, regression_flag.
What happens when you run this: The distribution requirements force you to think about what your system actually sees in production, not just the impressive demos. Most eval suites are 90% happy-path examples. Real users send ambiguous, incomplete, contradictory inputs — and that is where quality breaks down.
The Quality Cliff
There is a critical insight that most teams learn the hard way:
This happens because AI models are not deterministic machines with predictable failure modes. They are statistical systems where a seemingly minor change — a prompt edit, a model version bump, a data refresh — can cause catastrophic quality loss in a narrow but critical slice of inputs.
The solution is not more eval cases. The solution is stratified evaluation: track quality separately across input categories and alert on any category-level regression, even if the aggregate score looks fine.
Anthropic does this internally. They run over 1,000 eval categories across Claude models before every release. A model that scores higher overall but regresses in any critical category does not ship. This discipline is why model updates rarely break existing workflows.
Prompt 3 — The Regression Detector
Static eval suites catch bugs at deploy time. Regression detectors catch quality drift in production — the slow, invisible decay that happens as real-world inputs diverge from your test data.
You are a reliability engineer building a continuous quality monitoring system for a production AI feature. The system must detect quality regressions BEFORE users notice them. SYSTEM DESCRIPTION: [Same as Prompt 1.] EVAL SUITE: [Reference your eval suite from Prompt 2.] Design the regression detection system: ## 1. CONTINUOUS EVAL PIPELINE Define the automated evaluation loop: - TRIGGER: When does evaluation run? (every deploy, every N hours, every N requests) - SAMPLE: How are production inputs sampled for eval? (random 1%, stratified by category, all high-stakes) - JUDGE: How is quality scored automatically? (LLM-as-judge, rule-based, hybrid) - STORE: Where are scores stored for trend analysis? ## 2. BASELINE & THRESHOLDS - BASELINE: Current quality scores by category from your golden set. This is what "good" looks like. - ABSOLUTE THRESHOLD: Below this score = immediate alert - RELATIVE THRESHOLD: Drop of >X% from baseline = alert - TREND THRESHOLD: 3 consecutive declines = investigate For each rubric dimension and input category, define: | Category | Dimension | Baseline | Absolute Min | Alert on Drop | Example: "Legal queries | Accuracy | 4.6 | 4.0 | >0.3 drop" ## 3. LLM-AS-JUDGE PROTOCOL If using an LLM to grade outputs automatically: - JUDGE MODEL: Which model grades? (must differ from prod) - JUDGE PROMPT: Exact rubric + scoring instructions - CALIBRATION: How do you verify the judge agrees with human evaluators? (run both on 50 examples, measure correlation, recalibrate monthly) - BIAS CHECK: Known biases (verbosity bias, position bias, self-preference) and mitigations ## 4. ALERT & RESPONSE PROTOCOL When a regression is detected: - WHO is notified? (on-call, team lead, specific owner) - WHAT information is included in the alert? (category, dimension, score drop, example outputs, possible cause) - WHAT is the response SLA? (P0: quality below absolute min = respond in 1 hour) (P1: quality trending down = respond in 24 hours) (P2: edge case regression = next sprint) - ROLLBACK CRITERIA: When do you revert to previous version vs. investigate and fix forward? ## 5. DRIFT DETECTION Production inputs change over time. Your eval suite must evolve with them: - Monthly: sample 100 production inputs, categorize them. Has the distribution shifted from your eval suite? - Quarterly: refresh golden set with new production examples - On any model/prompt change: re-run full eval suite ## OUTPUT: MONITORING SPEC Produce the pipeline configuration, threshold table, alert routing rules, and drift detection schedule. This becomes your AI feature's SLA.
What happens when you run this: The LLM-as-judge section solves the scaling problem. You cannot have humans grade every output, but you can have a different AI model grade outputs using your rubric — and periodically verify that the AI judge agrees with human judgment. This gives you continuous quality monitoring at near-zero marginal cost.
Pro tip: Never use the same model to judge its own output. Claude should not grade Claude. GPT should not grade GPT. Cross-model evaluation catches the systematic biases that self-evaluation misses. If budget allows, use two different judge models and alert when they disagree — disagreement between judges is a strong signal that the output is ambiguous or problematic.
The Bigger Picture
Evaluation is the hidden infrastructure that separates AI demos from AI products. A demo needs to work once, impressively. A product needs to work correctly, thousands of times a day, across the full distribution of real-world inputs, through model updates and data shifts, without degrading.
The three layers build on each other:
- The Quality Rubric defines what “good” means — converts subjective judgment into measurable criteria (Prompt 1)
- The Eval Suite defines what to test — covers the full range of inputs with known-good outputs (Prompt 2)
- The Regression Detector ensures quality is maintained continuously — catches drift before users notice (Prompt 3)
Issue #30 showed you how to coordinate multiple agents working together. This issue ensures that whatever those agents produce actually meets your quality bar — and keeps meeting it as the system evolves.