GPT-5.5 vs Claude Mythos: What the AISI Cybersecurity Numbers Actually Tell You (2026)

Two AI cybersecurity models represented as server towers with glowing data streams in a dark server room

Quick Answer: Both models completed AISI's 32-step corporate network attack simulation — Mythos in 3 of 10 attempts, GPT-5.5 in 2 of 10. On narrow cyber tasks, GPT-5.5 edges slightly ahead. On real-world code issue resolution (SWE-bench Pro), Claude Opus 4.7 leads by nearly 6 points. Neither model can yet autonomously develop zero-day exploits in hardened production systems.

Blocking Mythos from 70 more organizations does not solve the problem it is trying to solve. That is the uncomfortable math of this moment.

In April 2026, the UK AI Security Institute published evaluations of both Claude Mythos Preview and GPT-5.5 against identical tests. The results confirmed what most people in the field suspected but did not want to say loudly: this capability is not specific to one lab, one model, or one controlled program. It is a trend. And GPT-5.5 is already broadly available to paying subscribers right now, while Anthropic and the White House argue over 70 more access slots.

Here is what the verified numbers show, what they do not show, and why the framing of this debate keeps missing the actual question.

The Test That Started Everything: "The Last Ones"

AISI built a cyber range called "The Last Ones." Thirty-two steps. A simulated corporate network spanning four subnets and roughly twenty hosts. The agent starts with no credentials, no privileges. It has to chain together reconnaissance, credential theft, lateral movement across Active Directory forests, a CI/CD supply-chain pivot, and finally extract a protected internal database. AISI estimates a human expert would need about 20 hours to complete this end-to-end.

Claude Mythos Preview was the first model to finish it. AISI's published evaluation shows Mythos completed the full chain in 3 out of 10 attempts. GPT-5.5 completed it in 2 out of 10. The gap is one attempt out of ten.

That is not a wide margin. It is close enough that calling either model definitively superior on this specific test is not supported by the data.

Important context: AISI explicitly states that both ranges lack active defenders, defensive tooling, and alert penalties found in real-world environments. These are capability evaluations in controlled, sandboxed conditions. AISI does not claim to know how either model would perform against a genuinely hardened production target.

The Narrow Task Numbers

Beyond the network simulation, AISI runs 95 narrow cyber tasks across four difficulty tiers covering reverse engineering, web exploitation, and cryptography. These are built in capture-the-flag format. On this suite, GPT-5.5 scored the highest of any model AISI has evaluated, though AISI noted the performance is within the margin of error relative to Mythos.

Third-party data from XBOW's vulnerability miss-rate benchmark adds more texture. GPT-5 missed 40% of known CVEs in open-source applications frozen at vulnerable versions. Claude Opus 4.6 brought that to 18%. GPT-5.5 reached 10%. That is a meaningful progression, not an incremental one.

On CyberGym benchmarks, GPT-5.5 scored 81.8%. Mythos Preview scored 68.6% on AISI's expert-level cyber tasks. But these benchmarks test different things under different conditions. Treating them as a simple leaderboard comparison is not accurate.

Metric Claude Mythos Preview GPT-5.5
TLO Completion (AISI) 3/10 attempts 2/10 attempts
AISI Narrow Cyber Tasks (pass@5) 68.6% 90.5% (strongest overall)
XBOW CVE Miss Rate 18% (Opus 4.6) 10%
SWE-bench Pro (real code issues) 64.3% (Opus 4.7) 58.6%
Preparedness Classification Not publicly disclosed High (below Critical)
Public Access ~50 organizations only Broadly available now

Sources: AISI evaluations (April/May 2026), XBOW vulnerability benchmark, OpenAI GPT-5.5 System Card, Vellum.ai analysis. Data verified May 2026.

Where Mythos Actually Leads

Cyber benchmarks are not the only data that matters. On SWE-bench Pro, which measures resolution of real GitHub issues, Claude Opus 4.7 scores 64.3% against GPT-5.5's 58.6%. A nearly 6-point gap on production-relevant work. For teams building actual security tools and autonomous agents that resolve real code problems, that number matters more than a CTF pass rate.

Mythos also found a 27-year-old vulnerability in OpenBSD, a system that had been reviewed by human experts for decades. That specific finding drew more attention than any benchmark score. One real discovery in a production-relevant system carries more signal than a hundred CTF completions.

GPT-5.5 solved a reverse engineering challenge in 10 minutes and 22 seconds using approximately $1.73 in API costs — a task AISI estimated would take a human expert around 12 hours. The cost curve collapsing matters as much as the capability ceiling.

The Access War: What the White House Move Actually Does

Anthropic launched Mythos on April 7, 2026, under Project Glasswing. The initial access list was around 50 organizations including Apple, Microsoft, Google, Amazon Web Services, JPMorgan Chase, and Nvidia. Anthropic then proposed adding roughly 70 more, bringing the total to approximately 120. The White House said no.

Two stated reasons: national security concerns about misuse, and worry that Anthropic does not have sufficient compute to serve 120 organizations while also prioritizing government access. Anthropic disputes the compute limitation, pointing to recent deals with Amazon, Google, and Broadcom. Those buildouts take time to come online.

There is also a separate story running underneath this one. The Pentagon designated Anthropic a supply-chain risk in March 2026. The NSA, which falls under the Department of Defense, is reportedly running Mythos on classified networks anyway. Dario Amodei met with White House Chief of Staff Susie Wiles and Treasury Secretary Scott Bessent on April 17. Both sides described the meeting as productive. The situation is more tangled than any single access decision.

Meanwhile, GPT-5.5 is live. OpenAI's Trusted Access for Cyber program expands Mythos-level capabilities to thousands of verified defenders with fewer restrictions. The model that the White House is trying to contain in Anthropic's hands is functionally matched by a model that is already broadly available from a different lab.

That is the tension. You cannot access-restrict your way out of a capability that is now two labs deep into production.

The Safeguard Problem

AISI's evaluation of GPT-5.5 found something worth dwelling on. During red-teaming, evaluators identified a universal jailbreak that bypassed the model's cyber safeguards across every malicious query OpenAI provided, including in multi-turn agentic settings. It took six hours of expert red-teaming to develop.

OpenAI made several updates after AISI reported this. But a configuration issue in the version provided for final verification meant AISI could not confirm the effectiveness of the patches. That is not a knock specifically at OpenAI. Safeguards are not airtight at this capability level for any lab. AISI raised the annoyance level for attackers. They did not close the door.

Mythos is also not clean on this front. There was unauthorized access to the model through a private online forum, investigated by Anthropic. A controlled-access model with 50 trusted organizations still had people who should not have had access using it.

The weakest link in both cases was not the model. It was the human and organizational layer around it. That is always where it breaks.

My Take

The GPT-5.5 vs Mythos debate is mostly a distraction from the actual question, which is: what happens when these capabilities are in models that any mid-range developer can run locally?

AISI's numbers are genuinely useful — they give the clearest public picture we have of where these models actually sit. But AISI also notes that inference compute improves performance. Throw more hardware at GPT-5.5 and the pass rate goes up. These are not fixed capability ceilings. They are snapshots.

The White House blocking 70 more Mythos access slots while GPT-5.5 ships to paying subscribers is the kind of decision that looks decisive and does very little. Access control is not the same as capability control. Those are different problems. The second one does not have a clean policy solution yet.

What This Is Not

Neither model creates vulnerabilities. The 27-year-old OpenBSD bug existed for 27 years before Mythos found it. GPT-5.5 did not invent the CVEs it located. These models are expensive, fast microscopes for a vulnerability landscape that was already there. That framing matters because "AI found a zero-day" headlines consistently imply a new attack surface that did not exist before. It did exist. The cost to find it just dropped.

Neither model has reached OpenAI's "Critical" capability threshold: autonomously developing functional zero-day exploits in hardened real-world production systems without human intervention. GPT-5.5's system card is explicit on this. Mythos has not published equivalent threshold comparisons publicly.

The AISI cyber ranges also lack active defenders. A model scoring 3/10 on a static simulated network is a different problem than a model scoring 3/10 against a target with active monitoring, incident response, and real-time defensive tooling. We do not yet have public data on the second scenario.

If you want to read more about where AI capabilities are heading at the architectural level, the analysis of xAI's Grok 5 and the 10 trillion parameter plan covers the scale trajectory that sits behind these benchmark conversations.

FAQ

Did GPT-5.5 beat Claude Mythos at cybersecurity?

On AISI's narrow cyber tasks, GPT-5.5 scored highest of any tested model (pass@5 of 90.5%). On AISI's corporate network simulation, Mythos completed it in 3/10 attempts versus GPT-5.5's 2/10. On real code issue resolution (SWE-bench Pro), Claude Opus 4.7 leads GPT-5.5 by 5.7 points. "Beating" depends on which metric you are asking about.

What is Project Glasswing?

Anthropic's controlled-access program for Mythos Preview. Around 50 organizations including Microsoft, Google, Apple, Amazon Web Services, JPMorgan Chase, and Nvidia were given access to use the model primarily for defensive purposes — scanning their own infrastructure for vulnerabilities before adversaries do.

Why did the White House block Mythos access expansion?

Two publicly stated reasons: national security concerns about misuse with wider access, and concern that Anthropic lacks sufficient compute to serve 120 organizations without degrading government access to the model. Anthropic disputes the compute argument. The political dimension involving the Pentagon's supply-chain risk designation also plays into this.

Can these models actually hack real systems?

AISI explicitly states it cannot draw conclusions about real-world hardened targets from these evaluations. The test ranges lack active defenders and real-time alerts. GPT-5.5's system card confirms it cannot autonomously develop functional zero-day exploits in hardened production systems — that is OpenAI's "Critical" threshold, which GPT-5.5 has not crossed.

What is AISI's "The Last Ones" test?

A 32-step corporate network attack simulation built with SpecterOps. It models a full enterprise intrusion across four subnets and roughly twenty hosts, requiring reconnaissance, credential theft, lateral movement across Active Directory forests, a CI/CD supply-chain pivot, and database exfiltration. AISI estimates it would take a human expert about 20 hours. Both Mythos and GPT-5.5 have completed it end-to-end.

Is GPT-5.5 publicly available right now?

Yes. GPT-5.5 is available to paying subscribers. OpenAI's Trusted Access for Cyber program extends additional cyber-permissive capabilities to verified defenders. Mythos Preview remains restricted to Project Glasswing organizations only and has no announced public release date.

The honest caveat at the end of all of this: both AISI and OpenAI acknowledge these evaluations are early snapshots. AISI is building additional ranges specifically designed to test models against hardened targets with active defensive tooling. The results from those tests, whenever published, will tell a different and more complete story than what we have now. The current benchmarks show capability that is real, measurable, and accelerating. They do not yet show what these models can do when the other side is actually fighting back.

💡

About the Author

Vinod Pandey writes analysis on AI models, costs, and capabilities at revolutioninai.com. Every article is built from publicly verifiable data — benchmarks, system cards, and official evaluations. No fabricated testing claims.

LinkedIn

Post a Comment

0 Comments