The Bootstrapper: A Step-by-Step Framework for Building Your First Self-Improving AI Agent System

Building on Issues #1–23

Over the last 23 issues, we have built complex systems. Evolution engines that propose their own improvements. Memory layers that persist context across sessions. Scaling layers that coordinate multiple agents without breaking contracts. Quality gates, feedback loops, handoff protocols, safety nets, orchestration patterns.

All of that assumes you already have a working system to improve.

But everyone starts at zero. You have a task you want automated, a vague sense of what “good” looks like, and a blank folder. The gap between “I want an AI agent that does X” and actually having one running in production is where most people stall. Not because the technology is hard — because the process is unclear.

This issue is the Day 1 guide. A concrete, 5-day framework that takes you from an idea to a working, self-improving agent. No architecture diagrams. No theoretical frameworks. Just the steps, in order, with the exact prompts to run at each stage.

If you have been reading this series and thinking “I should build something” — this is where you start.

Why People Get Stuck

The most common failure mode is not technical. It is procedural. People start by writing a prompt. They iterate on the prompt for days. They get 80% accuracy and feel stuck. They never build the measurement system that would tell them what the other 20% looks like. They never wire it to run automatically. They never close the loop.

The prompt is not the starting point. The contract is.

A contract is a precise definition of what your agent should do — not how it should do it, but what “done” looks like. Inputs, outputs, success criteria, failure modes. The prompt comes later. The contract comes first.

This is counterintuitive. You have a powerful AI tool in front of you and your instinct is to start talking to it. Resist that instinct for one day. Define the contract first. Everything else — the prompt, the quality gate, the feedback loop, the automation — builds on that contract. If the contract is wrong, everything built on top of it is wrong faster.

The 5-Day Bootstrap

This is not a metaphor. This is a literal 5-day calendar. Each day has one deliverable. By the end of Day 5, you have a self-improving agent in production.

Day	Deliverable	Time Required
Day 1	The Contract	2 hours
Day 2	The Minimum Prompt + 10 manual runs	3 hours
Day 3	The Quality Gate	2 hours
Day 4	The Feedback Loop	2 hours
Day 5	Automation + Monitoring	1 hour

Total: 10 hours of focused work across 5 days. That is it.

Day 1: Define the Contract

Do not touch the AI today. Today is for thinking.

Write a document — plain text, markdown, a napkin — that answers these six questions:

What is the input? Be specific. Not “customer data” — “a JSON object with fields: customer_id (string), purchase_history (array of objects with date, amount, product_id), and support_tickets (array of objects with date, category, resolution).” If you cannot specify the input format, you do not understand the problem yet.
What is the output? Same level of specificity. Not “a recommendation” — “a JSON object with fields: risk_score (float 0-1), churn_probability (float 0-1), recommended_action (one of: ‘retain_offer’, ‘check_in_call’, ‘no_action’), and reasoning (string, 2-3 sentences explaining the recommendation).”
What does “correct” look like? Define 5 examples. Five inputs with their expected outputs. These are your ground truth. If you cannot produce 5 examples by hand, the task is not well-defined enough for an agent.
What does “wrong” look like? Define 3 failure modes. “Wrong” is not just incorrect output. It is the specific ways output can be incorrect. For a churn predictor: false positives (flagging happy customers), false negatives (missing at-risk customers), and hallucinated reasoning (correct score but made-up justification).
What are the edge cases? What happens with empty input? Missing fields? Extreme values? A customer with 10,000 purchases? A customer with zero? List at least 5 edge cases and what the correct behavior should be for each.
What is the success threshold? “90% accuracy” is a start, but accuracy on what metric? For the churn predictor: “correctly classify 90% of customers who churn within 30 days as risk_score > 0.7, while flagging no more than 15% of non-churning customers as false positives.” This is your exit criteria for Day 2.

This document is your contract. Everything you build will be measured against it. If the contract is wrong, fix the contract — do not hack the prompt to work around a bad contract.

Prompt 1 — The Contract Writer

If you are struggling to write the contract yourself, use this agent. Give it your vague description of what you want automated, and it will produce a formal contract.

Prompt 1 — The Contract Writer

You are a contract writer for AI agent systems. Your job is
to take a vague description of a task someone wants to
automate and produce a rigorous, testable contract.

Input: A plain-English description of the task.

Produce a contract with these exact sections:

## INPUT SPECIFICATION
- Exact format (JSON schema, file type, API response shape)
- Required fields with types and constraints
- Optional fields with defaults
- Example input (realistic, not trivial)

## OUTPUT SPECIFICATION
- Exact format (JSON schema, file type, structured text)
- Required fields with types and constraints
- Example output matching the example input
- What the output must NEVER contain (safety constraints)

## GROUND TRUTH EXAMPLES
- 5 input/output pairs that define "correct"
- At least 1 easy case, 2 medium cases, 2 hard cases
- Each pair includes a 1-line explanation of WHY this
  output is correct for this input

## FAILURE MODES
- At least 3 specific ways the output can be wrong
- For each: description, how to detect it, severity
  (CRITICAL / HIGH / MEDIUM / LOW)
- CRITICAL = causes downstream harm or incorrect decisions
- HIGH = wrong output but detectable by a human reviewer
- MEDIUM = suboptimal but not harmful
- LOW = cosmetic or stylistic

## EDGE CASES
- At least 5 edge cases with expected behavior
- Include: empty input, missing fields, extreme values,
  ambiguous input, adversarial input
- For each: the input, the expected output, and WHY

## SUCCESS CRITERIA
- Primary metric with a specific numeric threshold
- Secondary metric if applicable
- Maximum acceptable false positive rate
- Maximum acceptable false negative rate
- Minimum sample size to declare success (at least 50)

## ASSUMPTIONS
- What this contract assumes about the environment
- What upstream systems must guarantee
- What downstream systems expect

Rules:
1. Be specific enough that two different people reading
   this contract would build agents that produce the same
   output for the same input.
2. If the task description is too vague to specify any
   section, say exactly what information is missing and
   ask for it. Do not guess.
3. Ground truth examples must be realistic. Do not use
   "foo/bar" placeholders.
4. Success criteria must be measurable by an automated
   system, not just a human judgment.
5. The contract is the source of truth. If the prompt
   disagrees with the contract, the contract wins.

Run this prompt with your task description. Review the output. Edit it. The contract writer will miss nuances only you know — your specific data format, your business rules, your tolerance for different types of errors. The prompt gets you 80% there. Your domain knowledge fills the rest.

Day 2: Build the Minimum Prompt

Now you can talk to the AI.

Write the simplest prompt that satisfies the contract. Not a clever prompt. Not an optimized prompt. The minimum viable prompt. Include the contract’s input specification, output specification, and 2-3 of your ground truth examples as in-context demonstrations.

Then run it 10 times. Manually. With real inputs — not your ground truth examples, which the model has already seen. Ten fresh inputs where you know (or can look up) the correct answer.

Record the results in a simple table:

Run	Input Summary	Expected Output	Actual Output	Correct?	Error Type
1	Customer #4821, 12 purchases, 0 tickets	risk_score: 0.2, no_action	risk_score: 0.3, no_action	Yes	—
2	Customer #1093, 1 purchase, 3 tickets	risk_score: 0.8, retain_offer	risk_score: 0.4, no_action	No	False negative
...	...	...	...	...	...

This table is your accuracy baseline. It tells you three things:

Your accuracy number. 7 out of 10 correct = 70% baseline.
Your error distribution. Are failures clustered in one error type? If 2 out of 3 errors are false negatives, that tells you exactly where to focus.
Whether the contract is right. Sometimes you score a run as “wrong” and then realize the expected output was wrong — your contract was underspecified. Fix the contract, not the prompt.

Do not optimize the prompt yet. Do not add chain-of-thought. Do not add examples for the error types. Just record the baseline. You will optimize on Day 4 with actual data, not intuition.

If accuracy is below 50%, the task may be too complex for a single prompt. Split it into two steps — one that extracts/transforms the input, one that makes the decision. Re-run Day 2 for each step separately.

Day 3: Add the Quality Gate

Take the error types from your Day 2 table and turn them into automated checks. This is the quality gate from Issue #15, but scoped to your specific agent.

For each error type, write a check that can run without human review:

Format check: Does the output match the contract’s output specification? Correct JSON structure, required fields present, values within expected ranges.
Confidence check: If the model provides reasoning, does the reasoning actually support the conclusion? A risk_score of 0.9 with reasoning that says “customer shows no signs of churn” is a confidence mismatch.
Boundary check: Are scores within valid ranges? Is the recommended_action one of the allowed values?
Consistency check: If you run the same input twice, do you get the same output? Inconsistency on identical inputs is a sign the prompt is underspecified.

Wire these checks to run automatically on every output. The gate produces a simple verdict:

Quality Gate Output Example

{
  "run_id": "run_047",
  "timestamp": "2026-04-05T14:30:00Z",
  "input_hash": "a1b2c3d4",
  "output": { ... },
  "quality_gate": {
    "pass": true,
    "checks": {
      "format_valid": true,
      "fields_complete": true,
      "values_in_range": true,
      "confidence_aligned": true,
      "consistency_check": "not_run"
    },
    "score": 1.0,
    "failed_checks": []
  }
}

Every output that fails the quality gate gets flagged for review. Every output that passes gets logged and shipped. This is the difference between “an AI that produces output” and “an AI system that guarantees minimum quality.”

On Day 3, you will also discover edge cases the contract missed. A customer with a null purchase history. A score that rounds to exactly 0.0. A reasoning field that is empty. Each one becomes a new check in the quality gate and a new edge case in the contract.

Key insight: The quality gate is a living document. It grows every time you find a new failure mode. By week 4, it will catch errors you did not know existed on Day 1.

Day 4: Wire the Feedback Loop

The quality gate catches errors. The feedback loop fixes them.

Build two things:

1. A results log that appends every run’s quality gate output to a JSON Lines file. One line per run. Never overwrite — append only.

Results Log Format (JSONL)

{"run_id":"run_047","timestamp":"2026-04-05T14:30:00Z","pass":true,"score":1.0,"failed_checks":[]}
{"run_id":"run_048","timestamp":"2026-04-05T14:35:00Z","pass":false,"score":0.6,"failed_checks":["confidence_aligned"]}
{"run_id":"run_049","timestamp":"2026-04-05T14:40:00Z","pass":true,"score":1.0,"failed_checks":[]}

2. A weekly reviewer that reads the log and produces improvement recommendations. This ties directly to Issue #16 — the feedback loop — but scoped to your single agent.

The weekly reviewer reads the last 7 days of results, calculates accuracy, identifies the most common failure modes, and proposes specific prompt changes to address them.

The key word is “specific.” Not “improve accuracy on edge cases.” Instead: “Add an explicit instruction to handle null purchase_history by defaulting to risk_score: 0.5 with reasoning ‘Insufficient data for assessment.’ This addresses 3 of the 4 confidence_aligned failures this week, all of which involved customers with missing purchase data.”

You review the recommendation. You edit the prompt. You run 10 more manual tests on the changed section. If accuracy improves, you keep the change. If not, you revert.

Key insight: This is not a complex system. It is a log file and a prompt that reads the log file. But it is the foundation of self-improvement. The agent’s failures become the data that drives the agent’s evolution. Without this loop, you are optimizing by intuition. With it, you are optimizing by evidence.

Day 5: Schedule and Monitor

Your agent works. Your quality gate catches errors. Your feedback loop proposes improvements. Now make it run without you.

Three things to set up:

1. Scheduled Execution

Cron job, LaunchAgent, Windows Task Scheduler, or a cloud function on a timer. The agent runs on its schedule, processes its inputs, runs the quality gate, and logs the results. You do not need to be present.

Example Cron Schedule

# Run every day at 6 AM
0 6 * * * cd ~/my-agent && python3 run_agent.py >> logs/agent.log 2>&1

2. Heartbeat Monitor

A simple check that runs every hour and verifies: did the agent run on schedule? Did it produce output? Did the quality gate run? If any answer is “no,” send an alert.

The heartbeat is separate from the agent. If the agent crashes, the agent cannot tell you it crashed. The heartbeat is an independent process that expects the agent to have run and raises an alarm when it has not. This is the monitoring pattern from Issue #18 — the safety net.

3. Weekly Digest

Every Sunday, the feedback loop reviewer runs automatically, reads the week’s log, and produces a summary: total runs, pass rate, most common failures, recommended prompt changes. You read this once per week. That is your total time commitment after Day 5 — reading one digest per week and approving or rejecting the recommended changes.

10 Hours

From blank folder to self-improving agent in production

An agent that runs on schedule. A quality gate that catches errors before they ship. A feedback loop that proposes its own improvements. A heartbeat that alerts you if anything breaks. A weekly digest that keeps you informed without daily involvement.

Prompt 2 — The Bootstrap Auditor

After you complete Days 1-5, run this auditor against your implementation. It will find the gaps you missed.

Prompt 2 — The Bootstrap Auditor

You are a bootstrap auditor for AI agent systems. You
review a newly built agent system and identify gaps in
its contract, quality gate, feedback loop, and automation.

I will provide:
1. The agent's contract (input spec, output spec, success
   criteria, failure modes, edge cases)
2. The agent's prompt
3. The quality gate checks
4. The feedback loop configuration
5. The automation setup (schedule, monitoring, alerts)

Analyze each component and produce a gap report:

## CONTRACT GAPS
- Missing input edge cases (fields that could be null,
  empty, malformed, or adversarial)
- Missing output constraints (values that should be
  bounded but are not)
- Success criteria that cannot be measured automatically
- Failure modes not covered by the quality gate
- Assumptions that are not validated at runtime

## PROMPT GAPS
- Instructions that are ambiguous (two reasonable
  interpretations exist)
- Missing instructions for edge cases defined in contract
- Overly complex instructions that could be simplified
- Instructions that contradict each other
- Missing examples for the hardest failure modes

## QUALITY GATE GAPS
- Failure modes from the contract with no corresponding
  automated check
- Checks that can false-positive (flag correct output
  as wrong)
- Checks that can false-negative (miss incorrect output)
- Missing consistency checks (same input, different output)
- Missing latency/timeout checks
- No check for output length or format drift

## FEEDBACK LOOP GAPS
- Error types that are logged but never analyzed
- Analysis that produces recommendations too vague to act on
- No mechanism to measure whether prompt changes actually
  improved accuracy
- No rollback plan if a prompt change makes things worse
- Log format that loses information needed for analysis

## AUTOMATION GAPS
- No heartbeat monitor (agent can fail silently)
- No alerting on quality gate failures
- No log rotation (logs grow forever)
- No backup of the current prompt before changes
- Schedule does not account for input availability
  (agent runs before input data is ready)
- No graceful handling of missing or stale input

## SEVERITY RANKING
Rank all gaps by severity:
- CRITICAL: Will cause wrong output to ship undetected
- HIGH: Will prevent the feedback loop from working
- MEDIUM: Will cause operational issues within 30 days
- LOW: Will cause issues eventually but not urgently

For each gap, provide:
1. What is missing
2. What could go wrong because of it
3. A specific fix (not "add a check" — describe the check)
4. Estimated effort to fix (minutes)

Rules:
1. Be harsh. The goal is to find everything wrong BEFORE
   it causes a production incident.
2. If a section is genuinely complete, say so. Do not
   invent phantom issues.
3. Every gap must have a concrete fix. "Improve X" is not
   a fix. "Add a check that verifies field Y is non-null
   and within range [0, 1]" is a fix.
4. Prioritize gaps that could cause silent failures over
   gaps that would cause loud errors. Silent failures are
   always worse.
5. Check for gaps between components. The contract defines
   a failure mode, but the quality gate does not check for
   it. The feedback loop logs an error type, but the
   weekly reviewer does not analyze it. These seams are
   where bugs hide.

Run this after Day 5. Fix the critical and high gaps before considering your bootstrap complete. The medium and low gaps go into your backlog — the feedback loop will surface them when they matter.

Common Bootstrap Mistakes

Starting with the prompt instead of the contract. This is the most common and most costly mistake. You write a clever prompt, run it a few times, tweak it when it fails, run it again, tweak again. After three days of this you have a prompt that handles the cases you have seen and breaks on the cases you have not. The contract forces you to define “all cases” before writing a single line of prompt. It takes two hours. It saves two weeks.
Optimizing before measuring. “I bet chain-of-thought would improve this.” Maybe. But improve it from what? If you do not have a baseline accuracy number from 10 manual runs, you cannot know whether any change helped. Measure first, then optimize. Always.
Building the evolution engine before the quality gate. You read Issues #22 and #23 and want to build a self-improving, cross-system orchestrated intelligence. But your agent does not have a quality gate yet. Self-improvement without measurement is just random mutation. The quality gate is the minimum — build it before anything else.
Skipping manual runs on Day 2. “I’ll just automate it and check the results later.” The manual runs are where you learn what your agent actually does. Not what you think it does. Not what the prompt says it should do. What it actually produces when given real input. This takes 30 minutes. Skip it, and you will spend 3 hours debugging a problem you would have caught in run #4.
Over-engineering Day 1. The contract does not need to be perfect. It needs to be specific enough to test against. You will revise it on Day 3 when the quality gate reveals edge cases you missed. You will revise it again on Day 4 when the feedback loop finds error patterns. Version 1 of the contract is a starting point, not a final specification.
Not defining “good enough” before starting. Without a success threshold, you will optimize forever. “90% accuracy with less than 15% false positive rate” is a finish line. Without it, 85% accuracy feels like it needs one more tweak, and 92% feels like maybe you can push to 95%, and you never ship. Define the number. Hit the number. Ship.

The 30-Day Trajectory

Day 5 is not the end. It is the foundation. Here is what happens in the next 25 days.

Week 2 — Gather Data and Make Your First Revision

Your agent runs daily. The quality gate logs every result. By the end of week 2, you have 50+ data points — enough to trust your accuracy metric. The weekly reviewer produces its first data-backed recommendation. You review it, apply the change, re-run 10 tests on the modified behavior, and deploy.

This is your first evidence-based improvement. Not “I think this prompt is better” — “the data shows 6 of the last 8 failures were false negatives on customers with fewer than 3 purchases, and adding an explicit instruction for low-purchase customers reduced failures from 6 to 1 in testing.”

Week 3 — Add Earned Autonomy

Your agent has been running for two weeks. You have data on its accuracy, its failure modes, and its improvement trajectory. Time to stop reviewing every output.

This is Issue #17 — the handoff. Define an autonomy tier:

Outputs with quality score above 0.95: Ship automatically. No review.
Outputs with quality score 0.7-0.95: Ship but flag for weekly batch review.
Outputs with quality score below 0.7: Hold for manual review before shipping.

Start conservative. If the agent sustains 95%+ quality for a week, widen the auto-ship threshold. This is earned autonomy — trust based on evidence, not hope.

Week 4 — Evolution Engine

Your agent has three weeks of data. Your quality gate is mature. Your feedback loop has produced 2-3 prompt revisions, each backed by data. Now wire the evolution engine from Issue #22:

Performance reviewer — reads the last 30 days of quality gate logs, calculates accuracy trends
Rule proposer — reads the reviewer’s analysis and proposes specific prompt changes
Shadow tester — runs the proposed change against the last 50 inputs in parallel with the current prompt
Promotion gate — if the shadow test shows improvement above threshold, promotes the change

At this point, your agent proposes and tests its own improvements. You review the promotion gate’s decisions once per week. The agent that took 10 hours to build on Days 1-5 is now a self-improving system that requires less than 30 minutes per week of your attention.

The Bootstrap Checklist

Print this. Check each box as you complete it.

Day 1

Input specification with format, types, and constraints
Output specification with format, types, and constraints
5 ground truth examples (1 easy, 2 medium, 2 hard)
3 failure modes with detection methods
5 edge cases with expected behavior
Success threshold with specific metrics

Day 2

Minimum viable prompt written
10 manual runs completed with fresh inputs
Results table with accuracy and error types
Baseline accuracy calculated
Error distribution analyzed

Day 3

Format check automated
Boundary check automated
At least one failure-mode-specific check automated
Quality gate runs on every output
Contract updated with new edge cases found

Day 4

Results log (JSONL, append-only) wired
Weekly reviewer prompt written
First reviewer run produces specific recommendations
Process for applying and testing changes defined

Day 5

Agent scheduled (cron/LaunchAgent/cloud timer)
Heartbeat monitor running independently
Alert wired for missed runs and quality failures
Weekly digest automation configured
Bootstrap auditor run, critical gaps fixed

Try It This Week

Pick one task. Not the most complex thing you want to automate — the simplest one that would save you real time. A daily report. A data transformation. A classification task. Something you currently do in 15-30 minutes that you do every day or every week.

Run through Days 1-5. By Friday, you will have a self-improving agent handling that task. The following week, you will have data proving it works. The week after that, you will have earned autonomy reducing your review time to minutes per week.

Then pick the second task. And the third. Each one is faster than the last because you already understand the framework. By the end of the month, you have what most people spend months building — not because you are smarter, but because you started with the contract instead of the technology.

Reply with your Day 1 contract. I will review it, identify gaps, and help you write the Day 2 prompt.

Next Issue

Issue #25: The Decision Engine →

You have the bootstrap. You have the evolution engine. You have multi-system governance. But you are still the human in the loop — reading digests, approving promotions, deciding what to build next. Issue #25 covers the decision engine — how to codify your judgment into rules that let the system make decisions you would have made, without waiting for you to make them.

Why People Get Stuck

The 5-Day Bootstrap

Day 1: Define the Contract

Prompt 1 — The Contract Writer

Day 2: Build the Minimum Prompt

Day 3: Add the Quality Gate

Day 4: Wire the Feedback Loop

Day 5: Schedule and Monitor

1. Scheduled Execution

2. Heartbeat Monitor

3. Weekly Digest

Prompt 2 — The Bootstrap Auditor

Common Bootstrap Mistakes

The 30-Day Trajectory

The Bootstrap Checklist

Day 1

Day 2

Day 3

Day 4

Day 5

Try It This Week

Issue #25: The Decision Engine →

Get the next issue