The strongest model doesn't automatically win. That's what Microsoft just demonstrated.
Anthropic's Mythos is their most powerful model — so powerful it isn't publicly available. OpenAI's GPT-5.5 is their current flagship. Microsoft used neither. Instead, they stitched together publicly available models into a structured pipeline of more than 100 specialized agents, and that system just topped the CyberGym benchmark leaderboard. By five points.
This isn't a benchmark story. It's a systems design story. Here's exactly how MDASH works — and what it means for the AI companies racing to build the world's most powerful single model.
What Is MDASH?
MDASH stands for Multi-Model Agentic Scanning Harness. It was built by Microsoft's Autonomous Code Security (ACS) team — several of whom were part of Team Atlanta, the group that won $29.5 million in DARPA's AI Cyber Challenge by building autonomous systems that could find and patch vulnerabilities in real software.
The core idea: don't ask one model to do everything. Break the problem into stages, assign specialized agents to each stage, and let disagreement between agents become a signal rather than a failure. When an auditor flags something and a debater can't refute it — that finding's credibility goes up.
MDASH is model-agnostic by design. When a better model releases, Microsoft swaps it in via configuration change. The pipeline, plugins, calibrations, and domain-specific context all carry forward. The model is one input. The system is the asset.
The Five-Stage Pipeline
MDASH doesn't scan code. It runs code through a structured sequence where each stage has a different job.
Stage 1: Prepare. The system ingests source code, builds language-aware indexes, and analyzes past commits to map attack surfaces and threat models. Before any agent looks for bugs, MDASH knows where to look.
Stage 2: Scan. Specialized auditor agents examine candidate code paths and generate findings — with hypotheses and supporting evidence. These aren't pattern matches. They're reasoned assessments of whether a code path could be exploitable.
Stage 3: Validate. A second set of agents — the debaters — argues against each finding. Can this actually be reached? Is it truly exploitable? If the debater can't punch holes in an auditor's case, the finding survives. Frontier models handle the heavy reasoning here. Distilled, faster models handle high-volume verification work.
Stage 4: Dedup. Semantically equivalent findings get collapsed. If three auditors flagged the same underlying issue through different code paths, that's one finding — not three.
Stage 5: Prove. The system constructs and executes actual inputs that trigger the bug. Not theoretical. Working proof-of-concept exploits. This is where MDASH stops being a scanner and becomes something closer to an automated offensive researcher.
Different models run at different stages. One state-of-the-art model handles reasoning-heavy tasks. A completely separate model acts as an independent counterpoint in the validation stage. The disagreement between them is a feature, not a problem.
The Numbers That Matter
Three benchmark results — each harder to ignore than the last.
CyberGym (public): MDASH scored 88.45%. Mythos Preview came in second at 83.1%, GPT-5.5 at 81.8%. The benchmark, developed at UC Berkeley and published at ICLR 2026, contains 1,507 real-world vulnerability reproduction tasks from 188 open-source projects. Given a description of a known vulnerability and the unpatched code, the system must produce working attack code that triggers it.
Historical recall (internal): Microsoft ran MDASH against pre-patch snapshots of two heavily reviewed Windows components to see if it would rediscover bugs that were later confirmed by the Microsoft Security Response Center. For clfs.sys, 96% recall across 28 MSRC cases spanning five years. For tcpip.sys, 100% recall across 7 cases spanning five years. These are bugs that real attackers exploited. That a system recovers 96% of them in one of the most-reviewed kernel components in Windows is a significant claim.
Private driver test (StorageDrive): Microsoft planted 21 vulnerabilities into a private, never-published Windows driver — ensuring the models had never seen the code during training. MDASH found all 21 with zero false positives.
Two Bugs That Required More Than One Model
The benchmark scores are one thing. Two specific vulnerabilities show why a multi-agent architecture was actually necessary to find them.
CVE-2026-33827 — tcpip.sys (use-after-free). The code releases a piece of memory, then tries to access it again. The problem is that the two operations are separated by layers of validation logic. Reading through normally, nothing looks broken. The same operation is handled correctly elsewhere in the codebase — so to catch this, an agent needs to compare patterns across multiple files and notice the inconsistency. One model examining one function won't connect those dots. A team of agents where some hunt patterns and others compare across the whole codebase — that finds it. The vulnerability is reachable remotely without authentication via crafted IPv4 packets.
CVE-2026-33824 — IKEEXT service (double-free). Scattered across six files. The bug occurs during a shallow copy in network packet reassembly — surface data gets copied, but not the underlying data, so two parts of the code both believe they own the same memory. When both try to release it, things break. IKEEXT runs as LocalSystem — highest privilege on Windows. Triggering this takes just two specially crafted UDP packets. No timing tricks, no complex setup. Because the bug spans six files, no single-file analysis catches it. Tracking memory ownership across that spread requires the coordination that a multi-agent system provides.
Both CVEs were patched in May 2026 Patch Tuesday alongside 14 others MDASH found in tcpip.sys, ikeext.dll, http.sys, dnsapi.dll, netlogon.dll, and telnet.exe.
Where MDASH Failed
The system missed about 12% of CyberGym tasks. When Microsoft analyzed the failures, two patterns emerged.
Among findings that targeted the wrong code area, 82% came from tasks with vague vulnerability descriptions that lacked specific function or file identifiers. Description quality directly affects scan accuracy — MDASH is only as precise as the information it's given to start with.
There were also cases where the agent built inputs in libfuzzer format when the task required Honggfuzz format. The underlying reasoning was sound. The output format was wrong. The logic worked; the packaging didn't.
Neither failure type is trivial, but both are fixable through better tooling rather than a better base model.
My Take
The interesting tension here is that MDASH depends on the very companies it just beat. It uses Anthropic's and OpenAI's models to outperform their single-model security approaches. If those companies stopped training stronger models, MDASH's ceiling stops rising too.
Microsoft is essentially arguing that the engineering layer on top of models is where durable advantage lives — not in the models themselves. That's a bet on infrastructure over intelligence. The results so far support it. Whether it holds as frontier models get dramatically stronger is a different question.
What's unambiguous: the gap between "we have the best model" and "we have the best system" just became visible.
What This Means for Model Companies
Anthropic poured resources into making Mythos the strongest single model available for security work. OpenAI did the same with GPT-5.5. Microsoft didn't compete at that layer. They built a system that uses both companies' models and outperforms both companies' single-model security approaches at the application layer.
Microsoft's VP of Agentic Security, Taesoo Kim, framed it directly: the durable advantage lies in the agentic system around the model, not in any single model itself. For platform companies that don't own frontier training infrastructure, this is meaningful. For model companies, it's a reminder that leading on raw capability doesn't automatically translate into leading in deployment.
The same architecture is available to attackers. MDASH uses publicly available models. There are no exclusive technical barriers. A well-resourced adversary can build a similar pipeline. The gap between defender speed and attacker speed just narrowed — in both directions.
Current Status
MDASH is being used by Microsoft's internal security engineering teams and tested with a limited number of external customers in a private preview. No general availability timeline or pricing has been announced as of May 2026.
The team behind it includes members from Microsoft's Autonomous Code Security group, Microsoft Offensive Research and Security Engineering (MORSE), and Microsoft Windows Attack Research and Protection (WARP).
- MDASH scored 88.45% on CyberGym — roughly 5 points above Anthropic's Mythos (83.1%) and OpenAI's GPT-5.5 (81.8%)
- The system uses 100+ specialized agents across a five-stage pipeline: Prepare, Scan, Validate, Dedup, Prove
- It found 16 real Windows vulnerabilities including 4 critical RCE flaws, patched in May 2026 Patch Tuesday
- 100% recall on historical tcpip.sys MSRC cases; 96% on clfs.sys — both over five years of confirmed, exploited bugs
- Model-agnostic: new models can be swapped in via configuration, preserving all engineering assets
- MDASH uses publicly available models — no exclusive technical barrier prevents adversaries from building a similar system
Frequently Asked Questions
What does MDASH stand for?
Multi-Model Agentic Scanning Harness. It's Microsoft's codename for their AI-powered vulnerability discovery system, built by the Autonomous Code Security team.
Which AI models does MDASH use?
Microsoft hasn't specified exact models publicly, but the system uses "generally available" frontier models alongside distilled smaller models — sourced from other companies including Anthropic and OpenAI. The architecture is model-agnostic, meaning models can be swapped out as new ones release.
What is the CyberGym benchmark?
CyberGym is a public benchmark developed by UC Berkeley researchers and published at ICLR 2026. It contains 1,507 real-world vulnerability reproduction tasks from 188 OSS-Fuzz projects. A system is given a known vulnerability description and unpatched code, then must produce working attack code that triggers the bug. Scores on the leaderboard are self-reported.
Can MDASH be used by companies outside Microsoft?
As of May 2026, MDASH is in limited private preview with a small set of external customers. No general availability date or pricing has been announced. Microsoft has opened a sign-up for the private preview on its Security Blog.
Does MDASH replace human security researchers?
Based on Microsoft's framing, no — the system is positioned as a tool that assists and scales the work of security engineering teams, not one that operates fully autonomously without human review. The findings it generates still go through Microsoft's standard security response process before patches are issued.
What is Anthropic's Project Glasswing?
Project Glasswing is Anthropic's cybersecurity initiative focused on using Mythos — their most capable, non-publicly-available model — to scan codebases, validate findings, and suggest patches. Mythos is accessible through Glasswing partners on an exclusive basis. On the CyberGym benchmark, Mythos Preview scored 83.1%, placing second behind MDASH.
Microsoft will share more about MDASH as the private preview expands. Worth watching: whether the system's performance holds as it moves from heavily-reviewed Windows internals to more diverse external codebases — that's a different problem than what CyberGym measures.
0 Comments