Issue #29

The Safety Gate: What a “Too Dangerous” AI Model Means for You

The AI Rundown 12 min read 3 prompts
← Issue #28: The Immune System

On April 7, Anthropic announced it would not release its most powerful model to the public. Claude Mythos Preview — a model that can autonomously discover zero-day vulnerabilities, chain them into novel attack sequences, and find bugs that went undetected for decades — was deemed too dangerous for general access.

Three days later, Federal Reserve Chairman Powell and Treasury Secretary Bessent convened an emergency meeting with the CEOs of Citigroup, Bank of America, Wells Fargo, Morgan Stanley, and Goldman Sachs to discuss Mythos’s implications for financial infrastructure. A model’s mere existence triggered a government crisis response.

Instead of a public release, Anthropic launched Project Glasswing — restricted access to 11 companies (AWS, Apple, Google, Microsoft, CrowdStrike, and others) for defensive security work only. Up to $100 million in usage credits. Forty total organizations invited. No academic researchers. No public access.

This is the first major model to be withheld on safety grounds since OpenAI held back GPT-2 in 2019. And unlike GPT-2, where the “danger” was largely theoretical, Mythos has demonstrated capabilities that existing defenses cannot match.


Why This Matters for Every AI Builder

You are probably not building a model that discovers zero-day exploits. But the core question Anthropic faced — “Can this system do things we did not intend and cannot control?” — applies to every AI system you build.

Anthropic’s Problem Your Version
Model discovers vulnerabilities autonomously Your agent takes actions you did not anticipate
Capabilities exceed what safety testing covered Your system behaves differently in production than in dev
Access must be restricted to trusted parties Your API/tool access must be scoped to what is actually needed
No kill switch once capabilities are in the wild No rollback plan once your agent has taken real-world actions

The principle: Anthropic did not just ask “Does it work?” They asked “What happens if it works too well?” That second question is the one most teams skip. These three prompts force you to answer it for your own systems.


Prompt 1 — The Capability Audit

Before Anthropic decided to withhold Mythos, they ran extensive red-team evaluations to understand what the model could actually do. Most teams never do this. They know what they built the system to do, but not what it can do.

Prompt 1 — The Capability Audit
You are a red team evaluator assessing an AI system's
actual capabilities vs. its intended capabilities.

SYSTEM DESCRIPTION:
[Describe your AI system: what it does, what tools it
has access to, what data it can read/write, and what
actions it can take in the real world.]

INTENDED CAPABILITIES:
[List what the system was designed to do. Be specific.]

Now conduct a capability audit:

## 1. CAPABILITY SURFACE MAPPING
For each tool, API, or data source the system can access:
- What is the INTENDED use?
- What UNINTENDED uses are possible with this access?
- What is the worst-case action the system could take?
- Rate: LOW / MEDIUM / HIGH / CRITICAL risk

Example: A system with email-send access intended for
notifications could also send phishing emails, exfiltrate
data via email body, or spam thousands of contacts.

## 2. COMPOSITION RISKS
List combinations of capabilities that create emergent
risks. Individual tools may be safe; combinations may not.
- Tool A + Tool B = what new capability?
- Data access X + Action Y = what risk?

## 3. SCOPE CREEP CHECK
For each capability rated MEDIUM or above:
- Does the system NEED this access to fulfill its purpose?
- Can the access be narrowed? (read-only, rate-limited,
  scoped to specific resources)
- What is the MINIMUM permission set required?

## 4. OUTPUT: THE CAPABILITY MAP
Produce a table:
| Capability | Intended Use | Unintended Risk | Severity |
| Minimum Permission | Current Permission | Gap |

Flag every row where Current Permission > Minimum.

What happens when you run this: You will find at least 3–5 capabilities your system has that it does not need. Almost every AI system is over-permissioned — given broad access “just in case” rather than the minimum required. This audit tells you exactly where to tighten.

Pro tip: Run this audit every time you add a new tool or data source. Anthropic did not discover Mythos’s capabilities on launch day — they found them through systematic red-teaming over weeks. Your audit cadence should match your development cadence.


Prompt 2 — The Threat Model

Anthropic’s concern was not just what Mythos would do, but what bad actors could make it do. Your AI system faces the same question: what happens when someone with bad intent gets access?

Prompt 2 — The Threat Model
You are a security architect building a threat model for
an AI-powered system. Think like an attacker. Your goal
is to find every way this system can be misused, abused,
or manipulated.

SYSTEM DESCRIPTION:
[Paste your system description from Prompt 1.]

CAPABILITY MAP:
[Paste your capability map output from Prompt 1.]

Build a threat model across these vectors:

## 1. PROMPT INJECTION / MANIPULATION
- Can users craft inputs that override system instructions?
- Can data from external sources (APIs, web, files) contain
  instructions the system will follow?
- What happens if the system's context window is poisoned
  with adversarial content?
Test: "If I put instructions in a customer support ticket,
would the AI follow them instead of its system prompt?"

## 2. DATA EXFILTRATION
- Can the system be tricked into revealing training data,
  system prompts, API keys, or PII from its context?
- Can outputs be crafted to leak information indirectly?
  (e.g., encoding data in response formatting)
Test: "Can a user get the system to include internal data
in an email, API response, or log file?"

## 3. PRIVILEGE ESCALATION
- Can the system be prompted to use tools beyond its
  intended scope?
- Can it be instructed to modify its own permissions
  or access controls?
Test: "Can a user get the system to call an admin API
by embedding the request in normal conversation?"

## 4. DENIAL OF SERVICE
- Can inputs cause the system to consume excessive
  resources (token limits, API calls, compute)?
- Can it be put into infinite loops or recursive calls?
Test: "What input would max out our API budget in
one request?"

## 5. REPUTATION / TRUST ATTACKS
- Can the system be made to produce outputs that damage
  the organization's reputation?
- Can it be tricked into making false claims, sending
  inappropriate content, or taking harmful actions?

## OUTPUT: THREAT MATRIX
| Vector | Attack Scenario | Likelihood | Impact | Priority |
For each HIGH priority threat, include a specific
mitigation recommendation.

What happens when you run this: The prompt injection vector alone will likely surface 2–3 attack paths you have not considered. Most AI systems trust their input data implicitly — if your system reads from a database, API, or file that a user can modify, you have a prompt injection surface.


The Glasswing Decision Framework

Here is the part that gets interesting. Anthropic had three options with Mythos:

  1. Release publicly — Maximum reach, maximum risk. Every hacker gets the tool.
  2. Withhold entirely — Maximum safety, zero value. Defenders cannot use it either.
  3. Controlled access — Glasswing. Restricted to vetted partners, defensive use only. Accept some risk for defensive value.

They chose option 3. And the reasoning applies directly to your own systems:

The question is never “Is it safe?”
It is: “For whom, under what constraints, is the risk acceptable?”

Your AI system has a Glasswing decision too. Which users get which capabilities? What actions require human approval? Where do you accept risk, and where is the cost of failure too high?


Prompt 3 — The Kill Switch Protocol

Anthropic built an off-switch into Glasswing from day one. If a partner misuses Mythos, access is revoked instantly. Your AI system needs the same — not as an afterthought, but as architecture.

Prompt 3 — The Kill Switch Protocol
You are a systems architect designing safety controls
for an AI system that takes real-world actions. The goal:
ensure every action can be monitored, throttled, paused,
or reversed.

SYSTEM DESCRIPTION:
[Paste your system description.]

THREAT MATRIX:
[Paste your HIGH priority threats from Prompt 2.]

Design a kill switch protocol:

## 1. ACTION CLASSIFICATION
Categorize every action your system can take:

GREEN (auto-approve):
- Read-only operations
- Responses within known templates
- Actions with zero real-world side effects

YELLOW (log + throttle):
- Write operations within normal parameters
- External API calls within rate limits
- Actions that can be reversed within 24 hours

RED (require human approval):
- Irreversible actions (send email, delete data, charge card)
- Actions above a cost/impact threshold
- First-time actions the system has never taken before
- Any action flagged by the threat model

## 2. CIRCUIT BREAKERS
For each action category, define automatic shutoffs:
- Rate limit: max N actions per minute/hour/day
- Anomaly detection: if action pattern deviates from
  baseline by >X%, pause and alert
- Cost ceiling: if cumulative cost exceeds $Y, hard stop
- Scope fence: if action targets a resource outside the
  allowed list, block and log

## 3. ROLLBACK PROCEDURES
For every YELLOW and RED action:
- Can it be reversed? How? How fast?
- What is the blast radius if it cannot be reversed?
- Who gets notified and how quickly?

## 4. THE EMERGENCY STOP
Design a single command that:
- Immediately halts all in-progress actions
- Prevents new actions from starting
- Preserves full state for forensic analysis
- Notifies all stakeholders
- Can be triggered by: API call, dashboard button,
  Slack command, or dead man's switch (auto-triggers
  if heartbeat stops)

## OUTPUT: SAFETY ARCHITECTURE DOCUMENT
Produce a complete safety spec with action classification
table, circuit breaker thresholds, rollback procedures,
and emergency stop implementation plan.

What happens when you run this: You get a complete safety architecture for your system. The action classification alone is transformative — most teams have never explicitly decided which actions their AI can take autonomously vs. which require approval. Making that decision before an incident is the difference between a controlled response and a crisis.

Pro tip: Start with everything classified RED and relax to YELLOW/GREEN over time as you build confidence. Anthropic started with “nobody gets access” and relaxed to 11 vetted partners. Your safety controls should follow the same trajectory: restrictive by default, permissive only with evidence.


The Bigger Picture

The Mythos story is not really about one model. It is about a threshold we have crossed: AI systems are now capable enough that their creators sometimes choose not to release them.

The critics make fair points. TechCrunch asked whether Anthropic is protecting the internet or protecting its business moat. The answer is probably both. Controlled access creates enterprise contracts. “Too dangerous to release” is powerful marketing for an upcoming IPO.

But the capability is real. The emergency Fed meeting was real. And the lesson for builders is the same regardless of Anthropic’s motives:

Issue #28 gave you a self-healing system. This issue gives you the safety gates to make sure it heals in the right direction.

Next Issue

The Agent Protocol: Building AI Systems That Coordinate

Anthropic also launched Managed Agents this week — infrastructure for deploying AI agents that work together in production. Next issue: three prompts to design multi-agent systems that stay coordinated, avoid conflicts, and degrade gracefully when one agent fails.

Want deeper AI workflows?

Every issue includes copy-paste prompts, real-world examples, and frameworks you can apply immediately. Subscribe free or go Pro for advanced playbooks.

Get Pro — $15/mo $144/yr — save 33%