Microsoft’s Multi-Model Agentic Security System Sets New AI-Powered Benchmark

0
15

Key Takeaways

  • Microsoft’s new agentic security system, codenamed MDASH, uses an ensemble of >100 specialized AI agents to discover, debate, and prove exploitable bugs end‑to‑end.
  • In a private test, MDASH found all 21 planted vulnerabilities with zero false positives and achieved 96% recall on five‑year MSRC cases in clfs.sys and 100% recall in tcpip.sys.
  • The system contributed to the discovery of 16 new CVEs in the latest Patch Tuesday, including four Critical remote‑code‑execution flaws in the Windows kernel TCP/IP stack and IKEv2 service.
  • MDASH scored 88.45% on the public CyberGym benchmark, leading the leaderboard by roughly five points, demonstrating that the surrounding agentic architecture adds substantial value beyond raw model capability.
  • The harness is model‑agnostic; improvements in underlying AI models can be adopted by simply updating a configuration, preserving prior investments in plugins, scopes, and calibrations.

Overview of MDASH
Microsoft announced a major advancement in AI‑powered cyber defense with the release of its agentic security system, codename MDASH (Microsoft Security multi‑model agentic scanning harness). Developed by the Autonomous Code Security (ACS) team, MDASH orchestrates more than 100 specialized AI agents across an ensemble of frontier and distilled models to discover, debate, and prove exploitable bugs from source code to validated proof‑of‑concept. Unlike single‑model approaches, the harness treats the model as just one input; the true value lies in the structured pipeline that surrounds it. The system is already being used by Microsoft security engineering teams and is undergoing limited private‑preview testing with select customers.

Architecture and Agentic Design
MDASH operates as a staged pipeline: Prepare ingests the target, builds language‑aware indices, and draws attack surfaces from commit history; Scan runs auditor agents that emit candidate findings with hypotheses and evidence; Validate employs debater agents that argue for and against each finding’s reachability and exploitability; Dedup collapses semantically equivalent findings; Prove constructs and executes triggering inputs where the bug class permits, using tools such as ASan to confirm the vulnerability. Three core properties enable this to work in practice: an ensemble of diverse models managed by the harness, specialized agents with distinct roles (auditor, debater, prover), and an extensible plugin architecture that lets domain experts inject Microsoft‑specific knowledge such as kernel calling conventions, IRP rules, and lock invariants.

Evaluation on StorageDrive
To establish a baseline free from training‑data leakage, researchers ran MDASH on StorageDrive, a private device driver used in Microsoft interviews that contains 21 deliberately injected vulnerabilities never seen by any language model. Using the default configuration, the harness identified all 21 ground‑truth bugs with zero false positives. This result demonstrates that MDASH’s reasoning and vulnerability‑discovery capabilities can approximate those of professional offensive researchers when faced with wholly unseen code.

Patch Tuesday Findings
Applying MDASH to the Windows networking and authentication stack yielded the 16 CVEs disclosed in the latest Patch Tuesday. Ten of the flaws reside in kernel mode and six in user mode; the majority are reachable from a network position without credentials. Notable Critical remote‑code‑execution vulnerabilities include CVE‑2026-33827 (a use‑after‑free in tcpip.sys via SSRR processing), CVE‑2026-33824 (an unauthenticated IKEv2 SA_INIT double‑free leading to LocalSystem RCE), CVE‑2026-41089 (a stack overflow in netlogon.dll), and CVE‑2026-41096 (a heap OOB in dnsapi.dll). The full list also contains several Important‑severity denial‑of‑service, information‑disclosure, and security‑feature‑bypass bugs across tcpip.sys, ikeext.dll, telnet.exe, http.sys, and related components.

Deep Dive: CVE‑2026-33827
This Critical flaw resides in the Windows IPv4 receive path (tcpip.sys). A reference‑counted Path object is released prematurely in Ipv4pReceiveRoutingHeader but later reused during Strict Source and Record Route (SSRR) processing. Because the object’s reference count can reach zero at the release point, the memory may be reclaimed by a per‑processor lookaside allocator, turning the later access into a kernel‑mode use‑after‑free. The bug is reachable by a remote, unauthenticated attacker via crafted IPv4 packets carrying the SSRR option that pass validation. Single‑model systems missed it because the lifetime violation spans non‑trivial control flow and multiple validation checks, obscuring the temporal dependency; cross‑file reasoning and concurrency analysis—capabilities embodied in MDASH’s auditor‑debater‑prover stages—were required to surface the exploit path.

Deep Dive: CVE‑2026-33824
Located in the IKEEXT service (ikeext.dll), this Critical vulnerability allows a remote, unauthenticated attacker to achieve LocalSystem code execution by sending a crafted IKE_SA_INIT packet followed by a single IKEv2 fragment. The service duplicates the packet’s receive context with a shallow memcpy, causing both the queued context and the live Main Mode SA to share ownership of an attacker‑supplied security‑realm identifier. During teardown, each context frees the same pointer, resulting in a deterministic double‑free. The bug spans six source files; no single‑file analysis sees the aliasing lifecycle. MDASH’s specialized auditor agents flagged the inconsistent pattern, while debater agents validated the exploitability by referencing the correct counterpart elsewhere in the code base, illustrating the power of staged, cross‑file reasoning.

Performance Benchmarks
Beyond the Patch Tuesday cohort, MDASH was evaluated against historical MSRC data. It achieved 96% recall on 28 confirmed bugs in clfs.sys over five years and 100% recall on seven bugs in tcpip.sys over the same period, indicating that the system would have rediscovered the majority of real‑world, exploited flaws had it been available earlier. On the public CyberGym benchmark—1,507 real‑world vulnerability reproduction tasks from 188 OSS‑Fuzz projects—MDASH scored 88.45%, topping the leaderboard by roughly five points. Failure analysis showed that most errors stemmed from vague task descriptions or mismatched input formats (e.g., libFuzzer vs. honggfuzz), underscoring that the agentic system’s core reasoning remains strong when supplied with clear specifications.

Extensibility and Plugin System
A key strength of MDASH is its plugin‑friendly design. While the foundation models handle general language understanding, domain‑specific plugins supply invariants that models cannot be expected to internalize—such as Windows kernel calling conventions, IRP rules, lock invariants, IPC trust boundaries, or codec state machines. The CLFS proving plugin, for example, knows how to construct a triggering log file from a candidate finding, enabling the Prove stage to validate bugs that would otherwise remain unactionable triage entries. Because the pipeline’s targeting, validation, deduplication, and proof stages are model‑agnostic, organizations can swap in newer models with a simple configuration flip, preserving existing plugins, scopes, and calibrations.

Implications for AI‑Powered Vulnerability Discovery
The results signal a shift from AI vulnerability research as a curiosity to an engineering‑grade capability at enterprise scale. MDASH demonstrates that durable advantage lies not in any single model but in the agentic system that orchestrates models, agents, and plugins. For defenders, the pivotal question becomes: What does the tool do with the model, and what survives when the next model arrives? By focusing on pipeline robustness, validation, and extensibility, security teams can build long‑term resilience against the rapid turnover of foundation models while still benefiting from their advancing reasoning power.

How to Join the Private Preview
Microsoft is offering a limited private preview of MDASH to interested customers. Organizations can sign up via the provided link on the announcement page to gain early access, contribute feedback, and begin integrating the agentic harness into their own security workflows. Participation includes guidance on configuring the harness for specific codebases, developing custom plugins, and interpreting the validated findings produced by the system.

Conclusion
The Microsoft Security multi‑model agentic scanning harness (MDASH) represents a concrete step toward scalable, AI‑driven vulnerability discovery. By combining an ensemble of models with >100 specialized agents, a rigorous validate‑prove pipeline, and an extensible plugin architecture, MDASH has already uncovered 16 new vulnerabilities—including four Critical remote‑code‑execution flaws—across core Windows components. Its strong performance on internal MSRC retrospectives and the public CyberGym benchmark confirms that the surrounding agentic system adds substantial value beyond raw model capability. As AI models continue to evolve, MDASH’s model‑agnostic design ensures that investments in plugins, configurations, and expertise will carry forward, providing a durable foundation for proactive defense at enterprise scale.


Prepared for readers seeking a concise yet thorough overview of Microsoft’s latest agentic security advancement.

SignUpSignUp form

LEAVE A REPLY

Please enter your comment!
Please enter your name here