Building on Issues #1–23
Over the last 23 issues, we have built complex systems. Evolution engines that propose their own improvements. Memory layers that persist context across sessions. Scaling layers that coordinate multiple agents without breaking contracts. Quality gates, feedback loops, handoff protocols, safety nets, orchestration patterns.
All of that assumes you already have a working system to improve.
But everyone starts at zero. You have a task you want automated, a vague sense of what “good” looks like, and a blank folder. The gap between “I want an AI agent that does X” and actually having one running in production is where most people stall. Not because the technology is hard — because the process is unclear.
This issue is the Day 1 guide. A concrete, 5-day framework that takes you from an idea to a working, self-improving agent. No architecture diagrams. No theoretical frameworks. Just the steps, in order, with the exact prompts to run at each stage.
If you have been reading this series and thinking “I should build something” — this is where you start.
Why People Get Stuck
The most common failure mode is not technical. It is procedural. People start by writing a prompt. They iterate on the prompt for days. They get 80% accuracy and feel stuck. They never build the measurement system that would tell them what the other 20% looks like. They never wire it to run automatically. They never close the loop.
The prompt is not the starting point. The contract is.
A contract is a precise definition of what your agent should do — not how it should do it, but what “done” looks like. Inputs, outputs, success criteria, failure modes. The prompt comes later. The contract comes first.
This is counterintuitive. You have a powerful AI tool in front of you and your instinct is to start talking to it. Resist that instinct for one day. Define the contract first. Everything else — the prompt, the quality gate, the feedback loop, the automation — builds on that contract. If the contract is wrong, everything built on top of it is wrong faster.
The 5-Day Bootstrap
This is not a metaphor. This is a literal 5-day calendar. Each day has one deliverable. By the end of Day 5, you have a self-improving agent in production.
| Day | Deliverable | Time Required |
|---|---|---|
| Day 1 | The Contract | 2 hours |
| Day 2 | The Minimum Prompt + 10 manual runs | 3 hours |
| Day 3 | The Quality Gate | 2 hours |
| Day 4 | The Feedback Loop | 2 hours |
| Day 5 | Automation + Monitoring | 1 hour |
Total: 10 hours of focused work across 5 days. That is it.
Day 1: Define the Contract
Do not touch the AI today. Today is for thinking.
Write a document — plain text, markdown, a napkin — that answers these six questions:
- What is the input? Be specific. Not “customer data” — “a JSON object with fields: customer_id (string), purchase_history (array of objects with date, amount, product_id), and support_tickets (array of objects with date, category, resolution).” If you cannot specify the input format, you do not understand the problem yet.
- What is the output? Same level of specificity. Not “a recommendation” — “a JSON object with fields: risk_score (float 0-1), churn_probability (float 0-1), recommended_action (one of: ‘retain_offer’, ‘check_in_call’, ‘no_action’), and reasoning (string, 2-3 sentences explaining the recommendation).”
- What does “correct” look like? Define 5 examples. Five inputs with their expected outputs. These are your ground truth. If you cannot produce 5 examples by hand, the task is not well-defined enough for an agent.
- What does “wrong” look like? Define 3 failure modes. “Wrong” is not just incorrect output. It is the specific ways output can be incorrect. For a churn predictor: false positives (flagging happy customers), false negatives (missing at-risk customers), and hallucinated reasoning (correct score but made-up justification).
- What are the edge cases? What happens with empty input? Missing fields? Extreme values? A customer with 10,000 purchases? A customer with zero? List at least 5 edge cases and what the correct behavior should be for each.
- What is the success threshold? “90% accuracy” is a start, but accuracy on what metric? For the churn predictor: “correctly classify 90% of customers who churn within 30 days as risk_score > 0.7, while flagging no more than 15% of non-churning customers as false positives.” This is your exit criteria for Day 2.
This document is your contract. Everything you build will be measured against it. If the contract is wrong, fix the contract — do not hack the prompt to work around a bad contract.
Prompt 1 — The Contract Writer
If you are struggling to write the contract yourself, use this agent. Give it your vague description of what you want automated, and it will produce a formal contract.
You are a contract writer for AI agent systems. Your job is to take a vague description of a task someone wants to automate and produce a rigorous, testable contract. Input: A plain-English description of the task. Produce a contract with these exact sections: ## INPUT SPECIFICATION - Exact format (JSON schema, file type, API response shape) - Required fields with types and constraints - Optional fields with defaults - Example input (realistic, not trivial) ## OUTPUT SPECIFICATION - Exact format (JSON schema, file type, structured text) - Required fields with types and constraints - Example output matching the example input - What the output must NEVER contain (safety constraints) ## GROUND TRUTH EXAMPLES - 5 input/output pairs that define "correct" - At least 1 easy case, 2 medium cases, 2 hard cases - Each pair includes a 1-line explanation of WHY this output is correct for this input ## FAILURE MODES - At least 3 specific ways the output can be wrong - For each: description, how to detect it, severity (CRITICAL / HIGH / MEDIUM / LOW) - CRITICAL = causes downstream harm or incorrect decisions - HIGH = wrong output but detectable by a human reviewer - MEDIUM = suboptimal but not harmful - LOW = cosmetic or stylistic ## EDGE CASES - At least 5 edge cases with expected behavior - Include: empty input, missing fields, extreme values, ambiguous input, adversarial input - For each: the input, the expected output, and WHY ## SUCCESS CRITERIA - Primary metric with a specific numeric threshold - Secondary metric if applicable - Maximum acceptable false positive rate - Maximum acceptable false negative rate - Minimum sample size to declare success (at least 50) ## ASSUMPTIONS - What this contract assumes about the environment - What upstream systems must guarantee - What downstream systems expect Rules: 1. Be specific enough that two different people reading this contract would build agents that produce the same output for the same input. 2. If the task description is too vague to specify any section, say exactly what information is missing and ask for it. Do not guess. 3. Ground truth examples must be realistic. Do not use "foo/bar" placeholders. 4. Success criteria must be measurable by an automated system, not just a human judgment. 5. The contract is the source of truth. If the prompt disagrees with the contract, the contract wins.
Run this prompt with your task description. Review the output. Edit it. The contract writer will miss nuances only you know — your specific data format, your business rules, your tolerance for different types of errors. The prompt gets you 80% there. Your domain knowledge fills the rest.
Day 2: Build the Minimum Prompt
Now you can talk to the AI.
Write the simplest prompt that satisfies the contract. Not a clever prompt. Not an optimized prompt. The minimum viable prompt. Include the contract’s input specification, output specification, and 2-3 of your ground truth examples as in-context demonstrations.
Then run it 10 times. Manually. With real inputs — not your ground truth examples, which the model has already seen. Ten fresh inputs where you know (or can look up) the correct answer.
Record the results in a simple table:
| Run | Input Summary | Expected Output | Actual Output | Correct? | Error Type |
|---|---|---|---|---|---|
| 1 | Customer #4821, 12 purchases, 0 tickets | risk_score: 0.2, no_action | risk_score: 0.3, no_action | Yes | — |
| 2 | Customer #1093, 1 purchase, 3 tickets | risk_score: 0.8, retain_offer | risk_score: 0.4, no_action | No | False negative |
| ... | ... | ... | ... | ... | ... |
This table is your accuracy baseline. It tells you three things:
- Your accuracy number. 7 out of 10 correct = 70% baseline.
- Your error distribution. Are failures clustered in one error type? If 2 out of 3 errors are false negatives, that tells you exactly where to focus.
- Whether the contract is right. Sometimes you score a run as “wrong” and then realize the expected output was wrong — your contract was underspecified. Fix the contract, not the prompt.
Do not optimize the prompt yet. Do not add chain-of-thought. Do not add examples for the error types. Just record the baseline. You will optimize on Day 4 with actual data, not intuition.
If accuracy is below 50%, the task may be too complex for a single prompt. Split it into two steps — one that extracts/transforms the input, one that makes the decision. Re-run Day 2 for each step separately.
Day 3: Add the Quality Gate
Take the error types from your Day 2 table and turn them into automated checks. This is the quality gate from Issue #15, but scoped to your specific agent.
For each error type, write a check that can run without human review:
- Format check: Does the output match the contract’s output specification? Correct JSON structure, required fields present, values within expected ranges.
- Confidence check: If the model provides reasoning, does the reasoning actually support the conclusion? A risk_score of 0.9 with reasoning that says “customer shows no signs of churn” is a confidence mismatch.
- Boundary check: Are scores within valid ranges? Is the recommended_action one of the allowed values?
- Consistency check: If you run the same input twice, do you get the same output? Inconsistency on identical inputs is a sign the prompt is underspecified.
Wire these checks to run automatically on every output. The gate produces a simple verdict:
{
"run_id": "run_047",
"timestamp": "2026-04-05T14:30:00Z",
"input_hash": "a1b2c3d4",
"output": { ... },
"quality_gate": {
"pass": true,
"checks": {
"format_valid": true,
"fields_complete": true,
"values_in_range": true,
"confidence_aligned": true,
"consistency_check": "not_run"
},
"score": 1.0,
"failed_checks": []
}
}
Every output that fails the quality gate gets flagged for review. Every output that passes gets logged and shipped. This is the difference between “an AI that produces output” and “an AI system that guarantees minimum quality.”
On Day 3, you will also discover edge cases the contract missed. A customer with a null purchase history. A score that rounds to exactly 0.0. A reasoning field that is empty. Each one becomes a new check in the quality gate and a new edge case in the contract.
Key insight: The quality gate is a living document. It grows every time you find a new failure mode. By week 4, it will catch errors you did not know existed on Day 1.
Day 4: Wire the Feedback Loop
The quality gate catches errors. The feedback loop fixes them.
Build two things:
1. A results log that appends every run’s quality gate output to a JSON Lines file. One line per run. Never overwrite — append only.
{"run_id":"run_047","timestamp":"2026-04-05T14:30:00Z","pass":true,"score":1.0,"failed_checks":[]}
{"run_id":"run_048","timestamp":"2026-04-05T14:35:00Z","pass":false,"score":0.6,"failed_checks":["confidence_aligned"]}
{"run_id":"run_049","timestamp":"2026-04-05T14:40:00Z","pass":true,"score":1.0,"failed_checks":[]}
2. A weekly reviewer that reads the log and produces improvement recommendations. This ties directly to Issue #16 — the feedback loop — but scoped to your single agent.
The weekly reviewer reads the last 7 days of results, calculates accuracy, identifies the most common failure modes, and proposes specific prompt changes to address them.
The key word is “specific.” Not “improve accuracy on edge cases.” Instead: “Add an explicit instruction to handle null purchase_history by defaulting to risk_score: 0.5 with reasoning ‘Insufficient data for assessment.’ This addresses 3 of the 4 confidence_aligned failures this week, all of which involved customers with missing purchase data.”
You review the recommendation. You edit the prompt. You run 10 more manual tests on the changed section. If accuracy improves, you keep the change. If not, you revert.
Key insight: This is not a complex system. It is a log file and a prompt that reads the log file. But it is the foundation of self-improvement. The agent’s failures become the data that drives the agent’s evolution. Without this loop, you are optimizing by intuition. With it, you are optimizing by evidence.
Day 5: Schedule and Monitor
Your agent works. Your quality gate catches errors. Your feedback loop proposes improvements. Now make it run without you.
Three things to set up:
1. Scheduled Execution
Cron job, LaunchAgent, Windows Task Scheduler, or a cloud function on a timer. The agent runs on its schedule, processes its inputs, runs the quality gate, and logs the results. You do not need to be present.
# Run every day at 6 AM 0 6 * * * cd ~/my-agent && python3 run_agent.py >> logs/agent.log 2>&1
2. Heartbeat Monitor
A simple check that runs every hour and verifies: did the agent run on schedule? Did it produce output? Did the quality gate run? If any answer is “no,” send an alert.
The heartbeat is separate from the agent. If the agent crashes, the agent cannot tell you it crashed. The heartbeat is an independent process that expects the agent to have run and raises an alarm when it has not. This is the monitoring pattern from Issue #18 — the safety net.
3. Weekly Digest
Every Sunday, the feedback loop reviewer runs automatically, reads the week’s log, and produces a summary: total runs, pass rate, most common failures, recommended prompt changes. You read this once per week. That is your total time commitment after Day 5 — reading one digest per week and approving or rejecting the recommended changes.
Prompt 2 — The Bootstrap Auditor
After you complete Days 1-5, run this auditor against your implementation. It will find the gaps you missed.
You are a bootstrap auditor for AI agent systems. You review a newly built agent system and identify gaps in its contract, quality gate, feedback loop, and automation. I will provide: 1. The agent's contract (input spec, output spec, success criteria, failure modes, edge cases) 2. The agent's prompt 3. The quality gate checks 4. The feedback loop configuration 5. The automation setup (schedule, monitoring, alerts) Analyze each component and produce a gap report: ## CONTRACT GAPS - Missing input edge cases (fields that could be null, empty, malformed, or adversarial) - Missing output constraints (values that should be bounded but are not) - Success criteria that cannot be measured automatically - Failure modes not covered by the quality gate - Assumptions that are not validated at runtime ## PROMPT GAPS - Instructions that are ambiguous (two reasonable interpretations exist) - Missing instructions for edge cases defined in contract - Overly complex instructions that could be simplified - Instructions that contradict each other - Missing examples for the hardest failure modes ## QUALITY GATE GAPS - Failure modes from the contract with no corresponding automated check - Checks that can false-positive (flag correct output as wrong) - Checks that can false-negative (miss incorrect output) - Missing consistency checks (same input, different output) - Missing latency/timeout checks - No check for output length or format drift ## FEEDBACK LOOP GAPS - Error types that are logged but never analyzed - Analysis that produces recommendations too vague to act on - No mechanism to measure whether prompt changes actually improved accuracy - No rollback plan if a prompt change makes things worse - Log format that loses information needed for analysis ## AUTOMATION GAPS - No heartbeat monitor (agent can fail silently) - No alerting on quality gate failures - No log rotation (logs grow forever) - No backup of the current prompt before changes - Schedule does not account for input availability (agent runs before input data is ready) - No graceful handling of missing or stale input ## SEVERITY RANKING Rank all gaps by severity: - CRITICAL: Will cause wrong output to ship undetected - HIGH: Will prevent the feedback loop from working - MEDIUM: Will cause operational issues within 30 days - LOW: Will cause issues eventually but not urgently For each gap, provide: 1. What is missing 2. What could go wrong because of it 3. A specific fix (not "add a check" — describe the check) 4. Estimated effort to fix (minutes) Rules: 1. Be harsh. The goal is to find everything wrong BEFORE it causes a production incident. 2. If a section is genuinely complete, say so. Do not invent phantom issues. 3. Every gap must have a concrete fix. "Improve X" is not a fix. "Add a check that verifies field Y is non-null and within range [0, 1]" is a fix. 4. Prioritize gaps that could cause silent failures over gaps that would cause loud errors. Silent failures are always worse. 5. Check for gaps between components. The contract defines a failure mode, but the quality gate does not check for it. The feedback loop logs an error type, but the weekly reviewer does not analyze it. These seams are where bugs hide.
Run this after Day 5. Fix the critical and high gaps before considering your bootstrap complete. The medium and low gaps go into your backlog — the feedback loop will surface them when they matter.
Common Bootstrap Mistakes
- Starting with the prompt instead of the contract. This is the most common and most costly mistake. You write a clever prompt, run it a few times, tweak it when it fails, run it again, tweak again. After three days of this you have a prompt that handles the cases you have seen and breaks on the cases you have not. The contract forces you to define “all cases” before writing a single line of prompt. It takes two hours. It saves two weeks.
- Optimizing before measuring. “I bet chain-of-thought would improve this.” Maybe. But improve it from what? If you do not have a baseline accuracy number from 10 manual runs, you cannot know whether any change helped. Measure first, then optimize. Always.
- Building the evolution engine before the quality gate. You read Issues #22 and #23 and want to build a self-improving, cross-system orchestrated intelligence. But your agent does not have a quality gate yet. Self-improvement without measurement is just random mutation. The quality gate is the minimum — build it before anything else.
- Skipping manual runs on Day 2. “I’ll just automate it and check the results later.” The manual runs are where you learn what your agent actually does. Not what you think it does. Not what the prompt says it should do. What it actually produces when given real input. This takes 30 minutes. Skip it, and you will spend 3 hours debugging a problem you would have caught in run #4.
- Over-engineering Day 1. The contract does not need to be perfect. It needs to be specific enough to test against. You will revise it on Day 3 when the quality gate reveals edge cases you missed. You will revise it again on Day 4 when the feedback loop finds error patterns. Version 1 of the contract is a starting point, not a final specification.
- Not defining “good enough” before starting. Without a success threshold, you will optimize forever. “90% accuracy with less than 15% false positive rate” is a finish line. Without it, 85% accuracy feels like it needs one more tweak, and 92% feels like maybe you can push to 95%, and you never ship. Define the number. Hit the number. Ship.
The 30-Day Trajectory
Day 5 is not the end. It is the foundation. Here is what happens in the next 25 days.
Your agent runs daily. The quality gate logs every result. By the end of week 2, you have 50+ data points — enough to trust your accuracy metric. The weekly reviewer produces its first data-backed recommendation. You review it, apply the change, re-run 10 tests on the modified behavior, and deploy.
This is your first evidence-based improvement. Not “I think this prompt is better” — “the data shows 6 of the last 8 failures were false negatives on customers with fewer than 3 purchases, and adding an explicit instruction for low-purchase customers reduced failures from 6 to 1 in testing.”
Your agent has been running for two weeks. You have data on its accuracy, its failure modes, and its improvement trajectory. Time to stop reviewing every output.
This is Issue #17 — the handoff. Define an autonomy tier:
- Outputs with quality score above 0.95: Ship automatically. No review.
- Outputs with quality score 0.7-0.95: Ship but flag for weekly batch review.
- Outputs with quality score below 0.7: Hold for manual review before shipping.
Start conservative. If the agent sustains 95%+ quality for a week, widen the auto-ship threshold. This is earned autonomy — trust based on evidence, not hope.
Your agent has three weeks of data. Your quality gate is mature. Your feedback loop has produced 2-3 prompt revisions, each backed by data. Now wire the evolution engine from Issue #22:
- Performance reviewer — reads the last 30 days of quality gate logs, calculates accuracy trends
- Rule proposer — reads the reviewer’s analysis and proposes specific prompt changes
- Shadow tester — runs the proposed change against the last 50 inputs in parallel with the current prompt
- Promotion gate — if the shadow test shows improvement above threshold, promotes the change
At this point, your agent proposes and tests its own improvements. You review the promotion gate’s decisions once per week. The agent that took 10 hours to build on Days 1-5 is now a self-improving system that requires less than 30 minutes per week of your attention.
The Bootstrap Checklist
Print this. Check each box as you complete it.
Day 1
- Input specification with format, types, and constraints
- Output specification with format, types, and constraints
- 5 ground truth examples (1 easy, 2 medium, 2 hard)
- 3 failure modes with detection methods
- 5 edge cases with expected behavior
- Success threshold with specific metrics
Day 2
- Minimum viable prompt written
- 10 manual runs completed with fresh inputs
- Results table with accuracy and error types
- Baseline accuracy calculated
- Error distribution analyzed
Day 3
- Format check automated
- Boundary check automated
- At least one failure-mode-specific check automated
- Quality gate runs on every output
- Contract updated with new edge cases found
Day 4
- Results log (JSONL, append-only) wired
- Weekly reviewer prompt written
- First reviewer run produces specific recommendations
- Process for applying and testing changes defined
Day 5
- Agent scheduled (cron/LaunchAgent/cloud timer)
- Heartbeat monitor running independently
- Alert wired for missed runs and quality failures
- Weekly digest automation configured
- Bootstrap auditor run, critical gaps fixed
Try It This Week
Pick one task. Not the most complex thing you want to automate — the simplest one that would save you real time. A daily report. A data transformation. A classification task. Something you currently do in 15-30 minutes that you do every day or every week.
Run through Days 1-5. By Friday, you will have a self-improving agent handling that task. The following week, you will have data proving it works. The week after that, you will have earned autonomy reducing your review time to minutes per week.
Then pick the second task. And the third. Each one is faster than the last because you already understand the framework. By the end of the month, you have what most people spend months building — not because you are smarter, but because you started with the contract instead of the technology.
Reply with your Day 1 contract. I will review it, identify gaps, and help you write the Day 2 prompt.