On April 7, Anthropic announced it would not release its most powerful model to the public. Claude Mythos Preview — a model that can autonomously discover zero-day vulnerabilities, chain them into novel attack sequences, and find bugs that went undetected for decades — was deemed too dangerous for general access.
Three days later, Federal Reserve Chairman Powell and Treasury Secretary Bessent convened an emergency meeting with the CEOs of Citigroup, Bank of America, Wells Fargo, Morgan Stanley, and Goldman Sachs to discuss Mythos’s implications for financial infrastructure. A model’s mere existence triggered a government crisis response.
Instead of a public release, Anthropic launched Project Glasswing — restricted access to 11 companies (AWS, Apple, Google, Microsoft, CrowdStrike, and others) for defensive security work only. Up to $100 million in usage credits. Forty total organizations invited. No academic researchers. No public access.
This is the first major model to be withheld on safety grounds since OpenAI held back GPT-2 in 2019. And unlike GPT-2, where the “danger” was largely theoretical, Mythos has demonstrated capabilities that existing defenses cannot match.
Why This Matters for Every AI Builder
You are probably not building a model that discovers zero-day exploits. But the core question Anthropic faced — “Can this system do things we did not intend and cannot control?” — applies to every AI system you build.
| Anthropic’s Problem | Your Version |
|---|---|
| Model discovers vulnerabilities autonomously | Your agent takes actions you did not anticipate |
| Capabilities exceed what safety testing covered | Your system behaves differently in production than in dev |
| Access must be restricted to trusted parties | Your API/tool access must be scoped to what is actually needed |
| No kill switch once capabilities are in the wild | No rollback plan once your agent has taken real-world actions |
The principle: Anthropic did not just ask “Does it work?” They asked “What happens if it works too well?” That second question is the one most teams skip. These three prompts force you to answer it for your own systems.
Prompt 1 — The Capability Audit
Before Anthropic decided to withhold Mythos, they ran extensive red-team evaluations to understand what the model could actually do. Most teams never do this. They know what they built the system to do, but not what it can do.
You are a red team evaluator assessing an AI system's actual capabilities vs. its intended capabilities. SYSTEM DESCRIPTION: [Describe your AI system: what it does, what tools it has access to, what data it can read/write, and what actions it can take in the real world.] INTENDED CAPABILITIES: [List what the system was designed to do. Be specific.] Now conduct a capability audit: ## 1. CAPABILITY SURFACE MAPPING For each tool, API, or data source the system can access: - What is the INTENDED use? - What UNINTENDED uses are possible with this access? - What is the worst-case action the system could take? - Rate: LOW / MEDIUM / HIGH / CRITICAL risk Example: A system with email-send access intended for notifications could also send phishing emails, exfiltrate data via email body, or spam thousands of contacts. ## 2. COMPOSITION RISKS List combinations of capabilities that create emergent risks. Individual tools may be safe; combinations may not. - Tool A + Tool B = what new capability? - Data access X + Action Y = what risk? ## 3. SCOPE CREEP CHECK For each capability rated MEDIUM or above: - Does the system NEED this access to fulfill its purpose? - Can the access be narrowed? (read-only, rate-limited, scoped to specific resources) - What is the MINIMUM permission set required? ## 4. OUTPUT: THE CAPABILITY MAP Produce a table: | Capability | Intended Use | Unintended Risk | Severity | | Minimum Permission | Current Permission | Gap | Flag every row where Current Permission > Minimum.
What happens when you run this: You will find at least 3–5 capabilities your system has that it does not need. Almost every AI system is over-permissioned — given broad access “just in case” rather than the minimum required. This audit tells you exactly where to tighten.
Pro tip: Run this audit every time you add a new tool or data source. Anthropic did not discover Mythos’s capabilities on launch day — they found them through systematic red-teaming over weeks. Your audit cadence should match your development cadence.
Prompt 2 — The Threat Model
Anthropic’s concern was not just what Mythos would do, but what bad actors could make it do. Your AI system faces the same question: what happens when someone with bad intent gets access?
You are a security architect building a threat model for an AI-powered system. Think like an attacker. Your goal is to find every way this system can be misused, abused, or manipulated. SYSTEM DESCRIPTION: [Paste your system description from Prompt 1.] CAPABILITY MAP: [Paste your capability map output from Prompt 1.] Build a threat model across these vectors: ## 1. PROMPT INJECTION / MANIPULATION - Can users craft inputs that override system instructions? - Can data from external sources (APIs, web, files) contain instructions the system will follow? - What happens if the system's context window is poisoned with adversarial content? Test: "If I put instructions in a customer support ticket, would the AI follow them instead of its system prompt?" ## 2. DATA EXFILTRATION - Can the system be tricked into revealing training data, system prompts, API keys, or PII from its context? - Can outputs be crafted to leak information indirectly? (e.g., encoding data in response formatting) Test: "Can a user get the system to include internal data in an email, API response, or log file?" ## 3. PRIVILEGE ESCALATION - Can the system be prompted to use tools beyond its intended scope? - Can it be instructed to modify its own permissions or access controls? Test: "Can a user get the system to call an admin API by embedding the request in normal conversation?" ## 4. DENIAL OF SERVICE - Can inputs cause the system to consume excessive resources (token limits, API calls, compute)? - Can it be put into infinite loops or recursive calls? Test: "What input would max out our API budget in one request?" ## 5. REPUTATION / TRUST ATTACKS - Can the system be made to produce outputs that damage the organization's reputation? - Can it be tricked into making false claims, sending inappropriate content, or taking harmful actions? ## OUTPUT: THREAT MATRIX | Vector | Attack Scenario | Likelihood | Impact | Priority | For each HIGH priority threat, include a specific mitigation recommendation.
What happens when you run this: The prompt injection vector alone will likely surface 2–3 attack paths you have not considered. Most AI systems trust their input data implicitly — if your system reads from a database, API, or file that a user can modify, you have a prompt injection surface.
The Glasswing Decision Framework
Here is the part that gets interesting. Anthropic had three options with Mythos:
- Release publicly — Maximum reach, maximum risk. Every hacker gets the tool.
- Withhold entirely — Maximum safety, zero value. Defenders cannot use it either.
- Controlled access — Glasswing. Restricted to vetted partners, defensive use only. Accept some risk for defensive value.
They chose option 3. And the reasoning applies directly to your own systems:
Your AI system has a Glasswing decision too. Which users get which capabilities? What actions require human approval? Where do you accept risk, and where is the cost of failure too high?
Prompt 3 — The Kill Switch Protocol
Anthropic built an off-switch into Glasswing from day one. If a partner misuses Mythos, access is revoked instantly. Your AI system needs the same — not as an afterthought, but as architecture.
You are a systems architect designing safety controls for an AI system that takes real-world actions. The goal: ensure every action can be monitored, throttled, paused, or reversed. SYSTEM DESCRIPTION: [Paste your system description.] THREAT MATRIX: [Paste your HIGH priority threats from Prompt 2.] Design a kill switch protocol: ## 1. ACTION CLASSIFICATION Categorize every action your system can take: GREEN (auto-approve): - Read-only operations - Responses within known templates - Actions with zero real-world side effects YELLOW (log + throttle): - Write operations within normal parameters - External API calls within rate limits - Actions that can be reversed within 24 hours RED (require human approval): - Irreversible actions (send email, delete data, charge card) - Actions above a cost/impact threshold - First-time actions the system has never taken before - Any action flagged by the threat model ## 2. CIRCUIT BREAKERS For each action category, define automatic shutoffs: - Rate limit: max N actions per minute/hour/day - Anomaly detection: if action pattern deviates from baseline by >X%, pause and alert - Cost ceiling: if cumulative cost exceeds $Y, hard stop - Scope fence: if action targets a resource outside the allowed list, block and log ## 3. ROLLBACK PROCEDURES For every YELLOW and RED action: - Can it be reversed? How? How fast? - What is the blast radius if it cannot be reversed? - Who gets notified and how quickly? ## 4. THE EMERGENCY STOP Design a single command that: - Immediately halts all in-progress actions - Prevents new actions from starting - Preserves full state for forensic analysis - Notifies all stakeholders - Can be triggered by: API call, dashboard button, Slack command, or dead man's switch (auto-triggers if heartbeat stops) ## OUTPUT: SAFETY ARCHITECTURE DOCUMENT Produce a complete safety spec with action classification table, circuit breaker thresholds, rollback procedures, and emergency stop implementation plan.
What happens when you run this: You get a complete safety architecture for your system. The action classification alone is transformative — most teams have never explicitly decided which actions their AI can take autonomously vs. which require approval. Making that decision before an incident is the difference between a controlled response and a crisis.
Pro tip: Start with everything classified RED and relax to YELLOW/GREEN over time as you build confidence. Anthropic started with “nobody gets access” and relaxed to 11 vetted partners. Your safety controls should follow the same trajectory: restrictive by default, permissive only with evidence.
The Bigger Picture
The Mythos story is not really about one model. It is about a threshold we have crossed: AI systems are now capable enough that their creators sometimes choose not to release them.
The critics make fair points. TechCrunch asked whether Anthropic is protecting the internet or protecting its business moat. The answer is probably both. Controlled access creates enterprise contracts. “Too dangerous to release” is powerful marketing for an upcoming IPO.
But the capability is real. The emergency Fed meeting was real. And the lesson for builders is the same regardless of Anthropic’s motives:
- Know what your system can do — not just what you designed it to do (Prompt 1)
- Know how it can be misused — think like an attacker, not a builder (Prompt 2)
- Build the off-switch before you need it — not after the incident (Prompt 3)
Issue #28 gave you a self-healing system. This issue gives you the safety gates to make sure it heals in the right direction.