Issue #22

The Evolution: A Self-Improvement Engine for Your AI System

The AI Playbook 16 min read 4 prompts

Building on Issue #21: The Memory

Your agents remember now. The decision registry stops them from re-debating settled questions. The error catalog prevents repeated mistakes. The success pattern library replicates what works. Every session is smarter than the last because every session reads what came before.

But there is a ceiling.

The memory system records what happened. It does not figure out what to do about it. Your error catalog has 22 entries and 15 prevention rules -- but you wrote every one of them by hand. Your success patterns are growing -- but you identified each one yourself. The system stores knowledge, but it does not generate knowledge. It remembers, but it does not think.

You are still the bottleneck. Every improvement requires you to notice a pattern, diagnose the root cause, hypothesize a fix, test it, and deploy it. Your agents wait for you to get smarter so they can get smarter.

This is the last manual dependency. Remove it and you have something fundamentally different: a system that improves itself.

The Evolution is a four-agent loop that reads your system's own performance data, identifies patterns, proposes rule changes, shadow-tests them, and promotes the winners -- all without you lifting a finger. Not AGI. Not sentient machines. Four prompts, three JSON files, and the discipline to let the data decide what changes.


The Four Agents

The evolution engine has four roles. Each one does one thing. They run in sequence, usually once per day.

Agent Input Output
Performance Reviewer Outcome data, error catalog, quality scores Pattern analysis report
Rule Proposer Pattern analysis + current rules Proposed rule changes with hypotheses
Shadow Tester Proposed rules + live inputs A/B comparison data
Promotion Gate Shadow test results Approved changes (or rejections)

Key insight: No single agent has enough power to break your system. The reviewer can only observe. The proposer can only suggest. The tester can only run shadow experiments. The gate can only promote changes that have evidence. Four agents, four checkpoints, zero unvalidated changes reaching production.


Agent 1: The Performance Reviewer

This agent reads everything your system produces -- outcomes, errors, quality scores, memory entries -- and looks for patterns. Not individual events. Patterns. "This type of input fails 3x more often than average." "Error rate spiked after we changed the data source last Tuesday." "The same prevention rule fires 40% of the time, suggesting the underlying cause has not been fixed."

The reviewer does not propose fixes. It produces a structured analysis that the next agent consumes.

Prompt 1 -- Performance Reviewer
You are a performance reviewer for an AI agent system.
Your job is to analyze outcome data and identify patterns
that suggest the system should change.

Read these files:
- ~/memory/error_catalog.json
- ~/memory/success_patterns.json
- ~/memory/decision_registry.json
- ~/metrics/weekly_accuracy.json (last 4 weeks)
- ~/quality/gate_results.jsonl (last 7 days)
- ~/logs/session_outcomes.jsonl (last 14 days)

Produce: ~/evolution/performance_review.json

Schema:
{
  "review_date": "2026-04-05",
  "period_analyzed": "2026-03-22 to 2026-04-05",
  "overall_health": "IMPROVING" | "STABLE" | "DEGRADING",
  "patterns": [
    {
      "id": "PAT-001",
      "type": "error_cluster" | "performance_drift" |
              "unused_rule" | "success_decay" |
              "recurring_override" | "threshold_mismatch",
      "description": "What pattern did you find?",
      "evidence": "Specific data points supporting this",
      "severity": "HIGH" | "MEDIUM" | "LOW",
      "affected_agents": ["agent_a", "agent_b"],
      "occurrences": 7,
      "trend": "INCREASING" | "STABLE" | "DECREASING",
      "suggested_investigation": "What should the Rule
        Proposer look into based on this pattern?"
    }
  ],
  "metrics_summary": {
    "total_sessions": 42,
    "overall_pass_rate": 0.91,
    "pass_rate_delta_vs_prior": +0.03,
    "top_error_category": "stale_data",
    "top_error_frequency": 12,
    "prevention_rules_fired": 34,
    "prevention_rules_effective": 28,
    "success_patterns_reused": 15
  }
}

Analysis rules:
1. A pattern requires 3+ data points. One bad session
   is not a pattern. Three bad sessions with the same
   error category IS a pattern.
2. Compare current period to prior period. Regression
   matters more than absolute performance.
3. Check prevention rule effectiveness: if a rule fires
   but errors still occur, the rule is not working.
4. Check for "silent decay" -- success patterns that
   stopped being validated. Approaches that worked 6
   weeks ago may not work today.
5. Check for decision conflicts -- two registry entries
   that contradict each other.
6. Flag any metric that crosses a threshold:
   - Pass rate below 85%: HIGH severity
   - Error category with 5+ occurrences in 7 days: HIGH
   - Prevention rule effectiveness below 60%: MEDIUM
   - Success pattern not validated in 30+ days: MEDIUM
7. Output maximum 10 patterns, sorted by severity
   then occurrences.
8. Do NOT propose solutions. Only identify patterns.
   The Rule Proposer handles solutions.

The separation between observation and recommendation is deliberate. When the same agent identifies a problem and proposes a fix, it tends to anchor on the first solution it thinks of. By splitting these into two agents, the reviewer can be thorough about diagnosis without rushing to treatment.

Key insight: Most systems fail not because of one catastrophic error but because of slow drift that nobody notices. The performance reviewer's primary job is catching drift -- the gradual decay that turns a 95% system into an 82% system over six weeks.


Agent 2: The Rule Proposer

The proposer reads the performance review and generates concrete, testable hypotheses for improvement. Each proposal is a specific change -- a new prevention rule, a modified threshold, an updated success pattern, a prompt adjustment -- with a prediction about what will happen if the change is applied.

The predictions are the crucial part. Without them, you cannot measure whether the change worked.

Prompt 2 -- Rule Proposer
You are a rule proposer for an AI agent system. You read
performance analysis and propose specific, testable changes.

Read:
- ~/evolution/performance_review.json
- ~/memory/error_catalog.json (current rules)
- ~/memory/success_patterns.json (current patterns)
- ~/agents/active_prompts/ (current agent prompts)

Produce: ~/evolution/proposed_changes.json

Schema:
{
  "proposal_date": "2026-04-05",
  "review_id": "2026-04-05",
  "proposals": [
    {
      "id": "PROP-001",
      "addresses_pattern": "PAT-001",
      "change_type": "new_rule" | "modify_rule" |
                     "retire_rule" | "new_pattern" |
                     "modify_prompt" | "adjust_threshold",
      "description": "What change are you proposing?",
      "current_state": "What does the system do now?",
      "proposed_state": "What would the system do after
                        this change?",
      "hypothesis": "If we make this change, [specific
                    metric] will [improve/decrease] by
                    [estimated amount] because [reasoning].",
      "measurement": "How to verify the hypothesis --
                     which metric, what threshold, over
                     what time period.",
      "risk": "LOW" | "MEDIUM" | "HIGH",
      "risk_notes": "What could go wrong?",
      "rollback_plan": "How to undo this change if it
                       fails.",
      "shadow_test_duration_days": 3 | 5 | 7 | 14,
      "status": "PROPOSED"
    }
  ]
}

Proposal rules:
1. Every proposal MUST have a falsifiable hypothesis.
   Bad: "This will make things better." Good: "This will
   reduce stale_data errors from 12/week to under 5/week
   because the new rule catches outdated sources before
   they enter the pipeline."
2. Every proposal MUST have a measurement plan. If you
   cannot measure whether it worked, do not propose it.
3. Maximum 5 proposals per review cycle. Quality over
   quantity. The system cannot absorb 20 changes at once.
4. Proposals that modify existing rules must include
   the EXACT current text and the EXACT proposed text.
   No ambiguity.
5. Risk assessment is mandatory:
   - LOW: affects formatting, logging, non-critical paths
   - MEDIUM: affects output quality, could change results
   - HIGH: affects core logic, could break downstream
6. HIGH-risk proposals require 14-day shadow tests.
   MEDIUM requires 7 days. LOW requires 3 days.
7. Never propose removing a prevention rule unless its
   error has been FIXED for 30+ days AND the fix is
   structural (not just the rule catching it).
8. Include a rollback plan for every proposal. "Revert
   to previous prompt" is acceptable for prompt changes.
   Threshold changes need the specific old value.
Tip

The best proposals come from the intersection of two patterns. "Stale data errors are increasing AND the data freshness check is not firing" is a much stronger signal than either pattern alone. Train your proposer to look for these intersections.


Agent 3: The Shadow Tester

This is where the system diverges from every other "self-improving AI" article you have read. Most people propose changes and deploy them. The shadow tester runs proposed changes alongside the current system -- same inputs, both versions -- and measures which one performs better. No change reaches production without evidence.

The shadow tester does not decide anything. It collects data. The promotion gate decides.

Prompt 3 -- Shadow Tester
You are a shadow tester for an AI agent system. You run
proposed changes alongside the current system and collect
comparison data.

Read:
- ~/evolution/proposed_changes.json (active proposals)
- ~/evolution/shadow_results.json (existing test data)

For each proposal with status "PROPOSED" or "TESTING":

1. Create two processing paths:
   CONTROL: Current rules/prompts (no changes)
   VARIANT: Current rules/prompts + proposed change

2. For every new input that arrives for the affected
   agent, run it through BOTH paths.

3. Record results in ~/evolution/shadow_results.json:

{
  "proposal_id": "PROP-001",
  "test_start": "2026-04-05",
  "test_end": null,
  "required_duration_days": 7,
  "samples": [
    {
      "input_hash": "abc123",
      "timestamp": "2026-04-05T14:30:00Z",
      "control_output_quality": 0.88,
      "variant_output_quality": 0.94,
      "control_errors": ["stale_data"],
      "variant_errors": [],
      "control_latency_ms": 2400,
      "variant_latency_ms": 2600
    }
  ],
  "aggregate": {
    "total_samples": 15,
    "control_avg_quality": 0.86,
    "variant_avg_quality": 0.92,
    "quality_delta": +0.06,
    "control_error_rate": 0.20,
    "variant_error_rate": 0.07,
    "error_rate_delta": -0.13,
    "control_avg_latency_ms": 2350,
    "variant_avg_latency_ms": 2500,
    "statistical_significance": 0.94,
    "sample_size_sufficient": true
  },
  "status": "TESTING" | "COMPLETE" | "INSUFFICIENT_DATA"
}

Shadow testing rules:
1. NEVER apply the variant to production output. Both
   paths run, but only the CONTROL output is used.
2. Minimum sample sizes by risk level:
   - LOW risk: 10 samples
   - MEDIUM risk: 20 samples
   - HIGH risk: 50 samples
3. If test duration expires but sample size is not met,
   extend by 50% of original duration. If still not met,
   mark as INSUFFICIENT_DATA.
4. Quality scoring: use the same quality gate from
   Issue #15. Both paths go through the same gate.
5. Record EVERYTHING. Latency matters. Error types
   matter. Edge cases matter. The promotion gate needs
   complete data to decide.
6. If the variant causes a CRITICAL error on any sample,
   immediately halt the test and mark as FAILED.
7. Run tests in isolation -- only one proposed change
   per test. Never stack untested changes.

The shadow tester is the most operationally complex piece. In a simple system, "run both paths" means literally running your agent prompt twice with different system instructions. In a complex system, you might shadow-test by replaying yesterday's inputs with the new rules and comparing outputs to what your production system actually produced. Either approach works. The point is: the change must prove itself on real data before it goes live.


Agent 4: The Promotion Gate

The gate reads shadow test results and makes one of three decisions: promote the change to production, reject it, or extend the test. The gate has strict criteria -- no judgment calls, no "it seems better." Numbers or nothing.

Prompt 4 -- Promotion Gate
You are a promotion gate for an AI agent system. You review
shadow test results and decide whether proposed changes
should be promoted to production.

Read:
- ~/evolution/shadow_results.json
- ~/evolution/proposed_changes.json
- ~/evolution/promotion_log.json (history of past decisions)

For each test with status "COMPLETE":

Apply these promotion criteria:
1. Quality delta must be >= +0.02 (2% improvement)
   OR error rate delta must be <= -0.05 (5% fewer errors)
2. No CRITICAL errors in variant samples
3. Latency increase must be < 20% over control
4. Statistical significance must be >= 0.90
5. Sample size must meet minimum for risk level
6. The change must not contradict any ACTIVE decision
   in ~/memory/decision_registry.json

Decision logic:
- ALL criteria met -> PROMOTE
- Criteria 1 met but significance < 0.90 -> EXTEND
  (request 50% more test duration)
- Quality delta negative -> REJECT
- Any CRITICAL error in variant -> REJECT
- Latency increase > 20% -> REJECT (unless quality
  delta > +0.10, then flag for human review)

For PROMOTED changes:
1. Apply the change to production files:
   - New rules -> add to ~/memory/error_catalog.json
   - Modified prompts -> update in ~/agents/active_prompts/
   - Threshold changes -> update in ~/config/thresholds.json
   - New patterns -> add to ~/memory/success_patterns.json
2. Log the promotion:

~/evolution/promotion_log.json entry:
{
  "proposal_id": "PROP-001",
  "decision": "PROMOTE" | "REJECT" | "EXTEND",
  "decision_date": "2026-04-12",
  "evidence_summary": "Quality +6%, error rate -13%,
    significance 0.94, 15 samples over 7 days",
  "change_applied": "Added prevention rule: 'If source
    timestamp > 48h old, flag as POTENTIALLY_STALE'",
  "rollback_available_until": "2026-04-26",
  "promoted_by": "auto_gate"
}

3. Add the change to the decision registry with
   reasoning = the evidence summary from the test.
4. Update the proposal status to "PROMOTED" or "REJECTED"
   in proposed_changes.json.

For REJECTED changes:
1. Log the rejection with specific reason.
2. Feed the rejection back to the Rule Proposer:
   "PROP-001 rejected because [reason]. Consider
   alternative approaches for PAT-001."

Gate rules:
- The gate NEVER overrides its own criteria. No
  "it looks promising, let's promote anyway."
- Rollback window: 14 days. If production metrics
  degrade after promotion, auto-rollback.
- Maximum 2 promotions per day. The system needs
  time to stabilize between changes.
- Every promotion creates a snapshot of the prior
  state for rollback purposes.
- Log everything. The promotion log is the audit
  trail for why your system is the way it is.

Key insight: The promotion gate is intentionally conservative. A 90% significance threshold means roughly 1 in 10 promotions might be noise. But the rollback window catches those. The alternative -- promoting on gut feel -- has a much worse false positive rate. Let the data be slow. Slow and right beats fast and wrong.


The Loop in Practice

Here is what a typical evolution cycle looks like over 30 days.

Day 1 -- Performance Review

The performance reviewer runs for the first time. It finds 4 patterns: stale data errors clustering on Monday mornings (weekend data not refreshed), a success pattern that has not been validated in 35 days, a prevention rule that fires constantly but errors still slip through, and a threshold that was set manually 2 months ago and has not been re-evaluated.

Day 2 -- Rule Proposals

The rule proposer reads the review. It generates 3 proposals: increase the data freshness window from 24h to 48h for weekend-crossing data, replace the ineffective prevention rule with a more specific one, and lower the confidence threshold from 0.80 to 0.75 based on the last month's calibration data. Each proposal has a hypothesis and measurement plan.

Days 3-9 -- Shadow Testing

The shadow tester runs all three proposals. The data freshness change shows a 14% reduction in Monday errors. The new prevention rule drops its error category by 60%. The threshold change shows mixed results -- slight quality improvement but 8% more false positives.

Day 10 -- Promotion Gate

The gate promotes the first two changes. It rejects the threshold change (quality delta was positive but error rate increased). The rejection is fed back to the proposer for the next cycle.

Day 14 -- Next Review Cycle

The next performance review notes that Monday error rates dropped significantly, confirming the promoted changes are working. The threshold issue appears again as a pattern, but this time the proposer has the rejection data and tries a different approach -- adjusting the threshold only for the specific agent where calibration drift was worst.

Day 30 -- Results

Three evolution cycles have completed. Seven changes promoted. Two rejected. One auto-rolled back. The system's overall pass rate has improved from 87% to 93% -- and you did not manually write a single rule.

87% → 93%
Pass rate improvement over 30 days -- zero manual rules written
Four agents, four checkpoints, three evolution cycles. Seven changes promoted, two rejected, one auto-rolled back. The system improved itself while you focused on building new features.

Common Mistakes

  1. Letting the proposer and the gate be the same agent. The agent that proposes a change is biased toward promoting it. Separation of concerns is not bureaucracy -- it is how you prevent an overconfident agent from pushing bad changes into production.
  2. Shadow testing without quality scoring. If your shadow tester just checks "did it error or not," you are missing the signal. A change that eliminates one error type while slightly degrading overall quality is a bad trade. Score both paths through the same quality gate.
  3. Promoting on insufficient data. Ten samples is not enough for a HIGH-risk change. The minimum sample sizes exist for a reason -- small samples amplify noise. A change that "works" on 8 samples might fail on the next 80.
  4. Stacking untested changes. If you promote Change A, then immediately test Change B on top of it, you cannot tell whether Change B's results are caused by Change B or by the interaction with Change A. Test one change at a time. Patience is a feature.
  5. Skipping the rollback window. Every promotion should be provisional for 14 days. Production workloads are messier than shadow tests. An improvement that looked great in testing might degrade on input patterns the shadow test never saw.
  6. Forgetting to feed rejections back. A rejected proposal is not a dead end -- it is data. "We tried lowering the threshold and it increased false positives" is valuable information for the next proposal cycle. Feed every rejection back to the proposer with the specific reason.

Try It This Week

Start with the Performance Reviewer. You already have the data -- your error catalog from Issue 21, your quality gate results from Issue 15, your session outcomes from your existing logs. Run Prompt 1 and see what patterns emerge.

You will likely find 2-3 patterns you already knew about but had not prioritized, and 1-2 patterns you did not know existed. The ones you did not know about are the highest-leverage improvements -- they have been silently degrading your system while you focused on the problems you could see.

Once the reviewer is producing useful pattern reports, add the Rule Proposer. Start with LOW-risk proposals only -- formatting changes, logging improvements, non-critical threshold adjustments. Get comfortable with the propose-test-promote loop before you let it touch anything critical.

Shadow testing is the hardest part to implement, but also the part that makes everything else trustworthy. Even a simple implementation -- replay yesterday's inputs with the new rules and compare -- is better than no shadow testing at all.

Reply with your first performance review output. I will tell you which patterns are real signals versus noise, and help you write your first round of proposals.

Next Issue: Issue #23

The Scaling Layer

Your evolution engine works for one system. But what happens when you run 5 systems, or 10? Issue #23 builds the scaling layer -- shared memory across multiple agent systems, cross-system pattern detection, and a governance model that prevents one system's improvements from breaking another. From single-system evolution to organizational learning.

Get the next issue

One tested AI workflow, delivered every week. No fluff.

Free forever. One email per week. Unsubscribe anytime.