Building on Issues #1–24
You have built the agent. You have the quality gate, the feedback loop, the evolution engine, the memory layer, the scaling governance, and the bootstrap framework. Your system works. It catches errors, proposes its own improvements, coordinates across multiple agents, and runs on a schedule.
But every morning, you open a digest. You read through 15 recommendations. You approve 12, reject 2, and defer 1. You check a dashboard, notice a metric drifting, and decide whether to intervene or wait another day. You glance at a queue of proposed rule changes, evaluate each one, and click “promote” or “discard.”
You are the bottleneck.
Not because you are slow. Because you are the only entity in the system with judgment. Your agents can detect patterns, propose changes, test hypotheses, and report results. But they cannot decide. Every decision waits for you. And the system’s throughput is capped at the number of decisions you can make per day.
This issue fixes that. Not by removing you from the system — but by teaching the system what you would decide, so it only escalates the decisions you actually need to see.
The Three Types of Decisions
Before you can codify judgment, you need to classify the decisions your system asks you to make. Every decision falls into one of three categories:
| Type | Description | Example | Can Automate? |
|---|---|---|---|
| Mechanical | Same inputs always produce the same output. No judgment required. | “If quality score > 0.95, ship automatically.” | Yes |
| Pattern | Requires judgment, but your judgment follows a consistent pattern you could articulate. | “I always approve rule changes that improve PF by >5% with n > 50.” | After extraction |
| Novel | Requires context, intuition, or information the system does not have. | “Should we pivot our pricing model?” | No — escalate |
Most people assume their decisions are mostly Novel. They are not. When you audit your actual decisions over a 30-day period, you will find that 60–80% are Mechanical or Pattern decisions — things you decide the same way every time, based on data the system already has.
The Decision Engine automates the Mechanical and Pattern decisions. It escalates the Novel ones. And it continuously learns from your Novel decisions to reclassify them as Pattern decisions over time.
The compounding effect: Every time you make a Novel decision, the system records your reasoning. After you make the same type of decision 5+ times with consistent logic, the Decision Engine proposes a new rule. If you approve, that entire class of decisions never reaches you again. Your inbox of decisions shrinks every week.
The Decision Rule Format
A decision rule has five components. All five are required. If any one is missing, the rule is incomplete and will not be promoted to autonomous operation.
| Component | What It Is | Example |
|---|---|---|
| Condition | The specific, measurable trigger | shadow_pf_improvement > 0.05 AND sample_size > 50 |
| Action | What the system does when the condition is true | promote_rule_to_production |
| Confidence | How certain the system should be before acting | 0.92 (based on 12/13 historical approvals matching this pattern) |
| Fallback | What happens if confidence is below threshold | escalate_to_human with summary + recommendation |
| Audit Trail | What the system records for every autonomous decision | rule_id, timestamp, inputs, decision, confidence, reasoning |
The Audit Trail is the component most people skip. It is the most important one. Without it, you have an autonomous system you cannot inspect. When something goes wrong — and something will go wrong — the audit trail is how you diagnose, learn, and tighten the rule. Every autonomous decision must be more traceable than every human decision, not less.
Start every rule with a confidence threshold of 0.95. This means the system will only act autonomously when it is extremely certain. As the rule proves itself over 30+ correct autonomous decisions, you can lower the threshold to 0.85 or 0.80. Never start low and raise later. Starting high is safe. Starting low creates bad habits the system internalizes.
Step 1: The Decision Audit
Before you write a single rule, you need data. Specifically, you need a record of every decision you have made in the last 30 days and the reasoning behind each one.
If your system already logs your approvals and rejections (it should — Issue #16 covered this), pull those logs. If not, start logging now and come back in 30 days. You cannot build a Decision Engine without decision data.
Once you have the data, run Prompt 1.
Prompt 1 — The Decision Auditor
This agent reads your decision history and classifies each decision into the three types. It also identifies patterns — groups of decisions where you applied the same logic repeatedly.
You are a decision pattern analyst. Your job is to read a
log of human decisions and classify each one, then identify
repeating patterns that could be automated.
Input: A JSON array of decision records. Each record has:
- decision_id: unique identifier
- timestamp: when the decision was made
- context: what the system presented to the human
- decision: what the human chose (approve/reject/defer/modify)
- reasoning: why (if recorded), or "not recorded"
Produce a report with these exact sections:
## DECISION CLASSIFICATION
For each decision, classify as:
- MECHANICAL: Same inputs always produce same output
- PATTERN: Judgment required, but follows a consistent rule
- NOVEL: Requires context the system does not have
Include: decision_id, classification, confidence (0-1),
and a 1-line justification for the classification.
## PATTERN EXTRACTION
For each group of PATTERN decisions that share logic:
- pattern_id: a short descriptive name
- decisions: list of decision_ids that follow this pattern
- rule_draft: the decision rule in plain English
("When X is true and Y > threshold, the human always Z")
- consistency: what percentage of decisions in this group
follow the rule exactly (must be > 80% to qualify)
- exceptions: any decisions that ALMOST fit but diverged,
with notes on why
## AUTOMATION CANDIDATES
Rank all patterns by:
1. Frequency (how often this decision type occurs)
2. Consistency (how reliably the human follows the pattern)
3. Impact (what happens if the rule is wrong once)
Top candidates = high frequency + high consistency + low
impact if wrong.
## NOVEL DECISIONS
List all NOVEL decisions. For each, explain:
- Why this cannot currently be automated
- What additional data or context would be needed to
eventually automate it
- Whether it could become a PATTERN decision if the system
tracked specific additional information
Be conservative. If you are unsure whether a decision is
PATTERN or NOVEL, classify it as NOVEL. False negatives
(missing an automatable pattern) are safe. False positives
(automating a decision that requires human judgment) are
dangerous.
What you get: A ranked list of your most automatable decisions, with draft rules and consistency scores. This is the roadmap for your Decision Engine. Start with the top 3 candidates — the decisions you make most often, most consistently, with the lowest downside if wrong.
Step 2: Write the Rules
Take the top 3 automation candidates from your audit. For each one, you need to convert the plain-English pattern into an executable decision rule with all five components.
This is where most people make the critical mistake: they write the rule too broadly. A pattern that says “I usually approve performance improvements” becomes a rule that says if improvement > 0, approve. That rule will approve a 0.1% improvement with a sample size of 3. You would never approve that.
The rule must be tighter than your judgment, not looser. If you are unsure whether you would approve something, the rule should escalate. The system earns autonomy by being more conservative than you, not less.
Prompt 2 — The Rule Writer
You are a decision rule engineer. Your job is to convert a plain-English decision pattern into a precise, executable decision rule with safety guarantees. Input: - pattern_description: the plain-English pattern from the Decision Auditor - historical_decisions: the specific decisions that formed this pattern (with context and reasoning) - consistency_score: how often the human followed this pattern exactly For each pattern, produce a decision rule with ALL FIVE components: ## CONDITION - Express as a boolean formula using only measurable fields - Every threshold must come from the historical data (not invented) - Use the TIGHTEST threshold that captures 90%+ of historical approvals - Include a minimum sample size requirement (never less than n=30) - Example: improvement_pf > 0.05 AND sample_size >= 50 AND days_in_shadow >= 10 ## ACTION - One of: approve, reject, defer, escalate, modify - If modify: specify exactly what changes - If defer: specify the re-evaluation trigger ## CONFIDENCE CALCULATION - How to compute confidence for this specific rule - Must use historical consistency as the baseline - Formula: (matching_historical_decisions / total_historical_decisions) * recency_weight - Recency weight: decisions from last 7 days count 2x, last 30 days count 1x, older counts 0.5x ## FALLBACK - What happens when confidence < threshold (default 0.95) - Must include: escalation path, summary format, and recommended action with reasoning - The human must see: the data, what the rule WOULD have decided, and why the confidence was below threshold ## AUDIT RECORD - JSON schema for the audit log entry - Must include: rule_id, timestamp, all input values, computed confidence, decision made, and whether it was autonomous or escalated SAFETY CONSTRAINTS: - No rule may have a confidence threshold below 0.80 - No rule may act on fewer than 30 historical examples - Every rule must have a kill switch: if the rule makes 3 consecutive decisions that a human later overrides, it automatically reverts to full escalation mode - Every rule must expire after 90 days and require re-validation against fresh data
The kill switch and expiration are non-negotiable. A rule that was 95% accurate six months ago may be 60% accurate today because the underlying data distribution shifted. The 90-day expiration forces you to re-validate — and more importantly, it forces the system to re-validate, because the Decision Engine should handle re-validation automatically. If a rule expires and its historical accuracy is still above threshold, it re-promotes itself. If not, it escalates for human review.
Step 3: The Graduation Protocol
You do not deploy a rule straight to autonomous operation. You graduate it through tiers, exactly like Issue #17’s earned autonomy model — but applied to decisions instead of outputs.
The rule evaluates every decision but takes no action. It logs what it would have decided alongside what you actually decided. At the end of 14 days, you compare. If the rule matches your decisions 95%+ of the time, it graduates to Tier 1.
Key metric: Shadow accuracy. The percentage of decisions where the rule’s output matches yours.
The rule makes recommendations that appear in your daily digest, clearly labeled as [AUTO-SUGGEST]. You still make every decision, but you can see what the rule would have done. This catches cases where shadow accuracy was high but the rule’s reasoning was wrong — right answer, wrong logic.
Key metric: Override rate. How often you choose differently from the suggestion. If override rate is below 5%, graduate to Tier 2.
The rule acts autonomously, but every decision appears in your digest for review. You do not need to approve each one — you just scan for errors. If you see one, you override it and the system logs the override as training data for the next rule revision.
Key metric: Post-action override rate. If you override fewer than 2% of autonomous decisions over 30 days, graduate to Tier 3.
The rule acts autonomously. Decisions appear in the weekly summary, not the daily digest. You review them once per week in aggregate. The audit trail records everything. The kill switch remains active — 3 consecutive overrides in a single review session automatically demotes the rule back to Tier 1.
Key metric: Weekly override rate. Should remain below 1%. If it rises above 3% in any week, automatic demotion.
Step 4: The Learning Loop
The Decision Engine is not a one-time build. It is a living system that gets smarter as you use it.
Every time you make a Novel decision, the system records it. After 5 Novel decisions of the same type with consistent logic, the Decision Auditor proposes a new pattern. After 10, the Rule Writer drafts a rule. After 30, the rule enters shadow mode automatically.
This is the flywheel:
- You make decisions. The system watches and records.
- Patterns emerge. The auditor identifies them.
- Rules are drafted. The rule writer formalizes them.
- Rules graduate. Shadow → Suggest → Act-and-Report → Full Autonomy.
- Your decision load shrinks. You make fewer decisions, each one more important.
- The remaining decisions are harder. Which means they are more valuable for the system to learn from.
- Repeat.
After 6 months, the decisions that reach you are genuinely novel — the ones that require your unique context, intuition, or values. Everything else runs on rules you validated and the system maintains.
Prompt 3 — The Graduation Monitor
This agent runs weekly. It checks every active rule’s performance, proposes graduations and demotions, and identifies new pattern candidates from your Novel decisions.
You are a decision rule governance agent. You run weekly to maintain the health of all active decision rules. Input: - active_rules: JSON array of all rules with their current tier, accuracy metrics, and audit logs - human_decisions: all human decisions from the past 7 days - overrides: any cases where a human overrode an autonomous decision Produce a weekly governance report with these sections: ## RULE HEALTH For each active rule: - rule_id, current_tier, days_at_tier - accuracy_this_week, accuracy_all_time - decisions_made_this_week (autonomous vs escalated) - override_count_this_week - status: HEALTHY / WARNING / DEMOTE / EXPIRE ## GRADUATION CANDIDATES Rules ready to move up a tier: - rule_id, current_tier, proposed_tier - evidence: accuracy %, override rate, sample size - recommendation: GRADUATE or HOLD (with reasoning) ## DEMOTION TRIGGERS Rules that should move down a tier: - rule_id, current_tier, proposed_tier - trigger: what went wrong (override spike, accuracy drop, distribution shift) - recommendation: DEMOTE or INVESTIGATE ## EXPIRING RULES Rules within 14 days of their 90-day expiration: - rule_id, expiration_date - current_accuracy vs original_accuracy - recommendation: RENEW (accuracy held) or RETIRE (decayed) - If RENEW: updated confidence thresholds based on recent data ## NEW PATTERN CANDIDATES Novel decisions from the past 7 days that match existing Novel decisions: - proposed_pattern_name - matching_decision_ids (must be >= 5) - draft_rule in plain English - consistency_score - recommendation: DRAFT RULE or NEEDS MORE DATA Format the entire report as structured JSON so it can be consumed programmatically by the Decision Engine.
The Safety Architecture
Autonomous decisions require stronger safety guarantees than human decisions. When a human makes a bad decision, they notice immediately and correct it. When a rule makes a bad decision, it may not be caught until the weekly review — and by then, the damage may have compounded.
Your Decision Engine needs four safety layers:
| Layer | What It Does | When It Fires |
|---|---|---|
| Kill Switch | Demotes rule to Tier 1 after 3 consecutive overrides | Real-time, on every override |
| Drift Detector | Monitors input distribution. If inputs look different from training data, escalates. | On every decision |
| Impact Cap | Limits the blast radius of any single autonomous decision | Before action execution |
| Expiration | Forces re-validation every 90 days | On schedule |
The Drift Detector deserves special attention. A rule trained on bull-market decisions will make bad decisions in a bear market. Not because the rule is wrong — because the world changed. The drift detector compares every new decision’s input features against the historical distribution. If any feature is more than 2 standard deviations from the training mean, the decision escalates regardless of confidence score.
The Impact Cap is context-dependent. A rule that auto-approves a content edit has a small blast radius — the worst case is a bad paragraph that gets fixed next day. A rule that auto-approves a deployment has a large blast radius. Set impact caps proportional to the cost of being wrong, not the frequency of the decision.
What This Looks Like in Practice
After 90 days with the Decision Engine, a typical system looks like this:
Here is what your daily workflow looks like before and after:
| Activity | Before | After |
|---|---|---|
| Morning digest review | 45 min (read everything, decide everything) | 8 min (scan autonomous decisions, decide on 3–5 escalations) |
| Rule change approvals | 20 min (evaluate each proposal) | 0 min (graduated rules handle standard promotions) |
| Quality gate overrides | 15 min (review flagged outputs) | 5 min (only novel edge cases escalate) |
| System monitoring | 10 min (check dashboards manually) | 0 min (monitoring rules act autonomously, alert on anomaly) |
| Total daily time | 90 min | 13 min |
That is 77 minutes per day. 9 hours per week. 38 hours per month. Almost a full work week, every month, permanently freed.
Common Mistakes
1. Automating Novel decisions
If you cannot articulate why you made a decision — if it was intuition, gut feeling, or “I just knew” — it is a Novel decision. Do not write a rule for it. Let the system collect more examples. Intuition is often a pattern you have not consciously identified yet. After 20+ examples, the pattern may become clear. Or it may not, and it stays human-only. Both are fine.
2. Starting at Tier 2
The graduation protocol exists for a reason. Shadow mode catches logic errors. Suggest mode catches reasoning errors. Skipping tiers is how you end up with a rule that makes 50 bad decisions before you notice. Two weeks of shadow is cheap insurance.
3. Ignoring the drift detector
Your rules were trained on historical conditions. Markets shift. Customer behavior changes. Team priorities evolve. A rule that was 98% accurate last quarter may be 70% accurate this quarter because the world it was trained on no longer exists. The drift detector and the 90-day expiration are not bureaucracy — they are the immune system.
4. Too many rules at once
Start with 3. Graduate them fully. Learn from the process. Then add 3 more. If you write 20 rules on day one, you will spend all your time managing rules instead of making decisions. The goal is less human effort, not more.
5. No audit trail
If you cannot explain why the system made a decision, you do not have an autonomous system. You have a black box. When something goes wrong — and it will — the audit trail is the difference between a 5-minute fix and a week of debugging.
The Decision Engine Checklist
Print this. Check each box as you complete it.
Phase 1: Audit (Week 1)
- Decision log collected (30+ days of decisions with reasoning)
- Decision Auditor run (Prompt 1)
- Decisions classified: Mechanical, Pattern, Novel
- Top 3 automation candidates identified
- Consistency scores verified (all > 80%)
Phase 2: Rule Writing (Week 2)
- Rule Writer run for each top 3 candidate (Prompt 2)
- All 5 components present for each rule
- Confidence thresholds set (starting at 0.95)
- Kill switches wired
- Expiration dates set (90 days from today)
- Audit trail schema defined
Phase 3: Graduation (Weeks 3–10)
- All 3 rules in Shadow mode (Tier 0)
- 14-day shadow accuracy calculated
- Rules meeting 95%+ graduated to Suggest mode (Tier 1)
- Override rates tracked for 14 days
- Rules with <5% override graduated to Act-and-Report (Tier 2)
- 30-day post-action override rate measured
- Rules with <2% override graduated to Full Autonomy (Tier 3)
Phase 4: Governance (Ongoing)
- Graduation Monitor running weekly (Prompt 3)
- Drift detector active on all Tier 2+ rules
- Impact caps set for each rule
- 90-day re-validation calendar set
- Novel decision log reviewed monthly for new patterns
Try It This Week
Pull your decision log from the last 30 days. If you do not have one, start today — every time you approve, reject, or modify something your AI system suggests, write one line: what you decided and why. In 30 days, you will have enough data to run the Decision Auditor.
If you already have the log, run Prompt 1 now. Identify your top 3 automation candidates. Write the rules. Put them in shadow mode. In 61 days, those decisions will never reach you again.
The goal is not to remove yourself from the system. The goal is to ensure that every minute you spend on the system is spent on decisions that actually require you. Not the decisions you make on autopilot — the decisions that make a difference.