Project Glasswing and Mythos

This article describes how Cloudflare used Anthropic's Mythos Preview, a security-specialized LLM, to identify and fix potential vulnerabilities in its own systems. It explains how Mythos Preview differs from existing models, especially in exploit-chain construction and proof generation, while also discussing problems such as the model's spontaneous refusals and the difficulty of managing false positives. Ultimately, Cloudflare argues that effective vulnerability discovery requires more than an LLM itself. It requires a systematic framework, or harness, for finding and managing meaningful security vulnerabilities in large codebases.

1. Mythos Preview Opens a New Frontier for Security LLMs

Over the past several months, Cloudflare tested several security-specialized LLMs against its own infrastructure to find and fix potential vulnerabilities. The model that drew the most attention was Anthropic's Mythos Preview. As part of Project Glasswing, Cloudflare applied Mythos Preview to more than 50 internal repositories and closely observed its performance and behavior.

Cloudflare describes Mythos Preview as a tool in a different category from ordinary frontier models, not merely an incremental improvement. It performed far more sophisticated work that previous models could not. Two capabilities stood out.

Exploit chain construction: Real attacks often do not exploit a single bug. They connect several smaller attack primitives into a working exploit. For example, a use-after-free bug may be turned into arbitrary read/write primitives, control flow may be hijacked, and a ROP chain may be used to gain full control. Mythos Preview could combine these primitives and produce working proofs. The reasoning looked less like an automated scanner and more like the work of an experienced researcher.
Proof generation: Finding a bug and proving that it is exploitable are different tasks. Mythos Preview can do both. The model writes code to trigger a suspected bug, compiles it in a scratch environment, and runs it. If the program behaves as predicted, that becomes the proof. If it does not, the model reads the failure, adjusts the hypothesis, and tries again. This iterative process is as important as the bugs it finds. A suspicious flaw without a working proof is merely a guess, but Mythos Preview can close that gap itself.

Other frontier models also found many bugs in the same tests, and some performed better than expected in reasoning. But they struggled to connect the pieces into a working chain. They could identify interesting bugs and write thoughtful explanations of why those bugs mattered, but often stopped before completing an exploit chain and left exploitability open. Mythos Preview was different. It could connect traditionally low-severity bugs that might have sat in a backlog into a single, more serious exploit chain.

2. Model Refusals During Legitimate Vulnerability Research

The Mythos Preview model provided as part of Project Glasswing did not include the additional safety layers applied to generally available models such as Opus 4.7 or GPT-5.5. Even so, the model sometimes refused certain requests on its own. The same emergent guardrails that make cyber capabilities safer also sometimes blocked legitimate security research.

These spontaneous refusals were not consistent. The same task, framed differently or presented in a different context, could produce a completely different result.

"For example, the model initially refused vulnerability research on a particular project, but after an unrelated change to the project environment, it agreed to perform the same research on the same code. The code being analyzed had not changed."

In another case, the model found and confirmed several serious memory bugs in a codebase but refused to write a demonstration exploit. When the same request was framed differently, however, the answer changed. Because of the probabilistic nature of the model, even semantically identical requests could produce different results across runs. The same task could lead to opposite outcomes depending on when and how it was presented.

This matters. Even if the model's emergent refusals and safety behaviors exist, they are not consistent enough to serve as a complete safety boundary. For this reason, any capable cyber frontier model that becomes broadly available in the future will need additional safety mechanisms on top of these baseline behaviors, especially outside controlled research environments such as Project Glasswing.

3. Solving the Signal-to-Noise Problem

One of the hardest parts of vulnerability triage is deciding which bugs are real, exploitable, and urgent enough to fix now. This was already difficult before AI. AI-based vulnerability scanners and AI-generated code make it even more complicated. Cloudflare built several follow-up validation steps to address the problem.

Two major factors affect the noise rate.

Programming language: Languages such as C and C++ allow direct memory control and create bug classes such as buffer overflows and out-of-bounds reads and writes. Memory-safe languages such as Rust eliminate many of these bugs at compile time. Cloudflare therefore saw consistently more false positives in projects written in memory-unsafe languages.
Model bias: A capable human researcher tells you what they found and how confident they are. A model does not behave the same way. If you ask it to find bugs, it finds bugs whether or not the code really contains them. The results are full of language like "maybe," "potentially," and "theoretically possible." Those ambiguous findings far outnumber definite ones. As an exploration bias, this may be reasonable. In a triage queue, it is painful because every speculative finding consumes human attention and tokens to dismiss, and the cost compounds across thousands of findings.

Mythos Preview showed a major improvement in generating proofs of concept, especially proofs that combine multiple vulnerabilities. A finding that comes with a PoC is immediately actionable because the team does not waste time asking whether it is real.

Cloudflare's system is intentionally tuned to report a lot so it finds more and misses less. That also creates noise. But at triage time, Mythos Preview's output was clearly higher quality: fewer ambiguous findings, clearer reproduction steps, and much less effort required to decide whether to fix or dismiss something.

4. Why General Coding Agents Are Poor Fits for Repository Analysis

When Cloudflare first began AI-assisted vulnerability research last year, the intuitive approach was to put a general coding agent into an arbitrary repository and ask it to find vulnerabilities. This produced results, but it did not meaningfully cover real codebases or identify valuable findings. There were two main reasons.

Context: Coding agents are tuned for workflows such as feature building, bug fixing, and refactoring. They absorb a lot of source code, keep one hypothesis in mind, and iterate on it. That is not a good fit for vulnerability research, which is narrow and parallel by nature. Human researchers pick a specific target and investigate it deeply. The target might be one complex feature, a transition across a security boundary, or a particular vulnerability class such as command injection. Then the same pattern is repeated thousands of times across different features, boundaries, or vulnerability classes. A single agent session over hundreds of thousands of lines of code, even with subagents, may cover only about 0.1% of the surface usefully before the model's context window fills, compression begins, and important early findings are discarded.
Throughput: A single-stream agent does one thing at a time. Real codebases require many hypotheses across many components simultaneously, and interesting findings need follow-up work. You can push a single agent harder, but at some point the limitation is not the model itself; it is the shape of the interaction. Using a model directly in a coding-agent interface is useful for manual investigation when a researcher already has a lead and wants a second opinion. It is not the right tool for high coverage. Once Cloudflare accepted that, it stopped asking Mythos Preview to do the wrong job and started building a harness instead.

5. How the Harness Solves the Problem

Running large-scale work taught Cloudflare four important lessons, all pointing to the need for a harness that manages the overall process.

Narrow scope produces better results: If you tell a model, "Find vulnerabilities in this repository," it wanders. If you say, "Find command injection in this specific function. There is this trust boundary above it, this architecture document, and this prior coverage of the area," the model behaves much more like a real researcher. Narrowing the scope is key.
Adversarial review reduces noise: Adding a second agent between the initial finding and the queue, with a different prompt, a different model, and no ability to generate new findings, catches a lot of noise that the first agent misses when reviewing its own work. Deliberately creating disagreement between agents works better than simply telling one agent to be careful.
Splitting chains across agents improves reasoning: "Is this code buggy?" and "Can an attacker actually reach this bug from outside the system?" are different questions. Each question is narrower than the combined version, so the model performs better when asked separately. Decomposing complex problems and assigning each part to a specialized step works better.
Many narrow parallel tasks beat one comprehensive agent: Coverage improves when many agents work on narrowly scoped questions and the results are deduplicated later. That is far more effective than asking one agent to find everything comprehensively.

All of these observations are about model behavior. Together they show that what is needed is no longer just a chat interface, but a harness that helps achieve the final result. Cloudflare used Mythos Preview to build and improve its existing harness around the model's strengths.

5.1. Cloudflare's Vulnerability Discovery Harness

Cloudflare's harness for scanning real code across its runtime, edge data path, protocol stacks, control plane, and dependent open-source projects works in the following stages.

Stage	Role	Why It Matters
Recon	An agent reads the repository top to bottom, delegates work to subagents for each subsystem, and produces architecture documents containing build commands, trust boundaries, entry points, and expected attack surfaces. It also creates the initial task queue for the next stage.	Gives all downstream agents shared context and reduces wandering.
Hunt	Each task combines one attack class with scope hints. Hunters run concurrently, often around 50 at a time, and each delegates to a few exploration subagents. Each hunter can compile and run PoC code in a task-specific scratch directory.	This is where most of the work happens: many narrow tasks in parallel rather than one comprehensive agent.
Validate	An independent agent rereads the code and tries to disprove the original finding. It uses a different prompt and cannot generate new findings.	Catches a significant amount of noise that hunters miss when reviewing themselves.
Gapfill	Hunters mark areas they touched but did not cover thoroughly, and those areas are queued for another pass.	Counteracts the model's tendency to drift toward attack classes where it has already succeeded.
Dedupe	Findings sharing the same root cause are consolidated into one record.	Variant analysis becomes a feature rather than a way to inflate the queue with duplicates.
Trace	For each confirmed finding in a shared library, tracing agents, one per consumer repository, use cross-repository symbol indexes to determine whether attacker-controlled input can actually reach the bug from outside the system.	Turns "there is a flaw" into "there is a reachable vulnerability." This is the most important stage.
Feedback	Reachable traces become new hunt tasks in the consumer repository where the bug is exposed.	Closes the loop. The pipeline improves as it runs.
Report	Agents write structured reports according to a predefined schema, fix schema-validation errors themselves, and submit the reports to a collection API.	Output becomes queryable data, not free-form prose.

6. Implications for Security Teams

The biggest reaction Cloudflare heard from other security leaders about Mythos Preview was about speed: scan faster, patch faster, and shorten response cycles. Several teams Cloudflare spoke with now operate a two-hour SLA from CVE disclosure to production patch. If attacker timelines get shorter, defender timelines must shorten too. That makes sense. But Cloudflare believes speed alone will not be enough, and many teams will learn this the hard way through time, effort, and cost.

Patching faster does not change the shape of the pipeline that creates patches. If regression testing takes a day, you cannot reach a two-hour SLA without skipping it. And if you skip regression testing, the bugs you introduce tend to be worse than the bug you meant to patch. Cloudflare saw something similar when it allowed models to write their own patches. Some patches fixed the original bug while quietly breaking something else the code depended on.

The harder question is what the architecture around the vulnerability should look like. The principle is to make bugs difficult for attackers to exploit even when they exist. That makes the gap between vulnerability disclosure and patch deployment less dangerous. This means defense systems in front of applications that block access to the bug. It also means designing applications so a flaw in one area does not give an attacker access to another. And it means being able to deploy fixes everywhere code runs at the same time, rather than waiting for individual teams to roll them out.

This issue cuts both ways. The same capabilities that helped Cloudflare find bugs in its own code could, in the wrong hands, accelerate attacks against applications across the internet. Cloudflare sits in front of millions of applications, and the architectural principles described above are exactly the principles its products are built to apply on behalf of customers. Cloudflare says it will share more about what this means for customers in the coming weeks.

Closing

Through Project Glasswing, Cloudflare confirmed that Anthropic's Mythos Preview is more than a simple vulnerability-search tool. It can show complex reasoning and proof-generation capabilities that resemble the work of skilled researchers. At the same time, the project exposed challenges that still need to be solved, including inconsistent spontaneous refusals and noise management.

The central lesson is that the performance of the model itself is not enough. A systematic harness is essential for finding and managing meaningful vulnerabilities in large codebases. Cloudflare's harness consists of Recon, Hunt, Validate, Gapfill, Dedupe, Trace, Feedback, and Report. Each stage is designed to maximize the model's strengths and compensate for its weaknesses.

The experience also suggests that security teams should think beyond simply patching faster. They should invest in architectural defenses that make exploitation difficult even when vulnerabilities exist. AI will accelerate both attack and defense, so security teams need to rethink and innovate their strategies for this new era. Cloudflare plans to use these lessons to provide stronger security solutions for customers and continue sharing research toward a safer digital environment.

1. Mythos Preview Opens a New Frontier for Security LLMs

Exploit chain construction: Real attacks often do not exploit a single bug. They connect several smaller attack primitives into a working exploit. For example, a use-after-free bug may be turned into arbitrary read/write primitives, control flow may be hijacked, and a ROP chain may be used to gain full control. Mythos Preview could combine these primitives and produce working proofs. The reasoning looked less like an automated scanner and more like the work of an experienced researcher.
Proof generation: Finding a bug and proving that it is exploitable are different tasks. Mythos Preview can do both. The model writes code to trigger a suspected bug, compiles it in a scratch environment, and runs it. If the program behaves as predicted, that becomes the proof. If it does not, the model reads the failure, adjusts the hypothesis, and tries again. This iterative process is as important as the bugs it finds. A suspicious flaw without a working proof is merely a guess, but Mythos Preview can close that gap itself.

2. Model Refusals During Legitimate Vulnerability Research

These spontaneous refusals were not consistent. The same task, framed differently or presented in a different context, could produce a completely different result.

"For example, the model initially refused vulnerability research on a particular project, but after an unrelated change to the project environment, it agreed to perform the same research on the same code. The code being analyzed had not changed."

3. Solving the Signal-to-Noise Problem

Two major factors affect the noise rate.

Programming language: Languages such as C and C++ allow direct memory control and create bug classes such as buffer overflows and out-of-bounds reads and writes. Memory-safe languages such as Rust eliminate many of these bugs at compile time. Cloudflare therefore saw consistently more false positives in projects written in memory-unsafe languages.
Model bias: A capable human researcher tells you what they found and how confident they are. A model does not behave the same way. If you ask it to find bugs, it finds bugs whether or not the code really contains them. The results are full of language like "maybe," "potentially," and "theoretically possible." Those ambiguous findings far outnumber definite ones. As an exploration bias, this may be reasonable. In a triage queue, it is painful because every speculative finding consumes human attention and tokens to dismiss, and the cost compounds across thousands of findings.

4. Why General Coding Agents Are Poor Fits for Repository Analysis

Context: Coding agents are tuned for workflows such as feature building, bug fixing, and refactoring. They absorb a lot of source code, keep one hypothesis in mind, and iterate on it. That is not a good fit for vulnerability research, which is narrow and parallel by nature. Human researchers pick a specific target and investigate it deeply. The target might be one complex feature, a transition across a security boundary, or a particular vulnerability class such as command injection. Then the same pattern is repeated thousands of times across different features, boundaries, or vulnerability classes. A single agent session over hundreds of thousands of lines of code, even with subagents, may cover only about 0.1% of the surface usefully before the model's context window fills, compression begins, and important early findings are discarded.
Throughput: A single-stream agent does one thing at a time. Real codebases require many hypotheses across many components simultaneously, and interesting findings need follow-up work. You can push a single agent harder, but at some point the limitation is not the model itself; it is the shape of the interaction. Using a model directly in a coding-agent interface is useful for manual investigation when a researcher already has a lead and wants a second opinion. It is not the right tool for high coverage. Once Cloudflare accepted that, it stopped asking Mythos Preview to do the wrong job and started building a harness instead.

5. How the Harness Solves the Problem

Running large-scale work taught Cloudflare four important lessons, all pointing to the need for a harness that manages the overall process.

Narrow scope produces better results: If you tell a model, "Find vulnerabilities in this repository," it wanders. If you say, "Find command injection in this specific function. There is this trust boundary above it, this architecture document, and this prior coverage of the area," the model behaves much more like a real researcher. Narrowing the scope is key.
Adversarial review reduces noise: Adding a second agent between the initial finding and the queue, with a different prompt, a different model, and no ability to generate new findings, catches a lot of noise that the first agent misses when reviewing its own work. Deliberately creating disagreement between agents works better than simply telling one agent to be careful.
Splitting chains across agents improves reasoning: "Is this code buggy?" and "Can an attacker actually reach this bug from outside the system?" are different questions. Each question is narrower than the combined version, so the model performs better when asked separately. Decomposing complex problems and assigning each part to a specialized step works better.
Many narrow parallel tasks beat one comprehensive agent: Coverage improves when many agents work on narrowly scoped questions and the results are deduplicated later. That is far more effective than asking one agent to find everything comprehensively.

5.1. Cloudflare's Vulnerability Discovery Harness

Cloudflare's harness for scanning real code across its runtime, edge data path, protocol stacks, control plane, and dependent open-source projects works in the following stages.

Stage	Role	Why It Matters
Recon	An agent reads the repository top to bottom, delegates work to subagents for each subsystem, and produces architecture documents containing build commands, trust boundaries, entry points, and expected attack surfaces. It also creates the initial task queue for the next stage.	Gives all downstream agents shared context and reduces wandering.
Hunt	Each task combines one attack class with scope hints. Hunters run concurrently, often around 50 at a time, and each delegates to a few exploration subagents. Each hunter can compile and run PoC code in a task-specific scratch directory.	This is where most of the work happens: many narrow tasks in parallel rather than one comprehensive agent.
Validate	An independent agent rereads the code and tries to disprove the original finding. It uses a different prompt and cannot generate new findings.	Catches a significant amount of noise that hunters miss when reviewing themselves.
Gapfill	Hunters mark areas they touched but did not cover thoroughly, and those areas are queued for another pass.	Counteracts the model's tendency to drift toward attack classes where it has already succeeded.
Dedupe	Findings sharing the same root cause are consolidated into one record.	Variant analysis becomes a feature rather than a way to inflate the queue with duplicates.
Trace	For each confirmed finding in a shared library, tracing agents, one per consumer repository, use cross-repository symbol indexes to determine whether attacker-controlled input can actually reach the bug from outside the system.	Turns "there is a flaw" into "there is a reachable vulnerability." This is the most important stage.
Feedback	Reachable traces become new hunt tasks in the consumer repository where the bug is exposed.	Closes the loop. The pipeline improves as it runs.
Report	Agents write structured reports according to a predefined schema, fix schema-validation errors themselves, and submit the reports to a collection API.	Output becomes queryable data, not free-form prose.

1. Mythos Preview Opens a New Frontier for Security LLMs

2. Model Refusals During Legitimate Vulnerability Research

3. Solving the Signal-to-Noise Problem

4. Why General Coding Agents Are Poor Fits for Repository Analysis

5. How the Harness Solves the Problem

5.1. Cloudflare's Vulnerability Discovery Harness

6. Implications for Security Teams

Closing

Related writing

The journey of AI tokens in the data center 🚀

Reframing a Company's Real AI Asset

Every Conversation Gets Recorded

Reading

1. Mythos Preview Opens a New Frontier for Security LLMs

2. Model Refusals During Legitimate Vulnerability Research

3. Solving the Signal-to-Noise Problem

4. Why General Coding Agents Are Poor Fits for Repository Analysis

5. How the Harness Solves the Problem

5.1. Cloudflare's Vulnerability Discovery Harness

6. Implications for Security Teams

Closing

Related writing

The journey of AI tokens in the data center 🚀

Reframing a Company's Real AI Asset

Every Conversation Gets Recorded