Key Takeaways
- High‑quality security attack logs are essential for detection, response, forensics, and compliance, but collecting realistic logs at scale is costly and slow.
- AI‑driven synthetic log generation translates attacker tactics, techniques, and procedures (TTPs) into structured telemetry, accelerating detection engineering while preserving realism.
- Three progressively sophisticated methods were explored: prompt‑engineered generation, an agentic workflow (generator‑evaluator‑improver loop), and reinforcement learning with verifiable rewards (RLVR).
- Agentic collaboration consistently outperformed pure prompting, delivering the highest recall across multiple datasets; reasoning‑enhanced models further improved fidelity.
- Early RLVR experiments show promise for verbatim‑level alignment but require substantial labeled data; data‑augmentation techniques help scale training examples.
- Evaluation on goal‑driven campaigns, the open‑source Security Datasets Project, and the ATLASv2 dataset demonstrates that synthetic logs can capture multi‑event sequences, parent‑child process relationships, and realistic command lines.
- Synthetic logs complement lab‑based simulations by enabling rapid, safe experimentation and broader coverage of rare or emerging threats without exposing sensitive data.
Introduction
Logs and telemetry form the foundation of modern cybersecurity, enabling threat detection, incident response, forensic investigation, and compliance across endpoints, networks, and cloud environments. Despite their importance, obtaining high‑quality security attack logs at scale remains notoriously difficult.
Challenges of Real‑World Telemetry
Real‑world security telemetry is dominated by repeated benign activity, with malicious events occurring very rarely. Gathering, labeling, and maintaining datasets that contain authentic attack logs is both costly and operationally challenging; it requires not only tagging malicious behavior but also fully reconstructing entire attack scenarios. These obstacles slow detection engineering and limit the effectiveness of rule‑based and anomaly‑detection approaches.
AI‑Generated Synthetic Logs: A New Path
To overcome these barriers, the work explores using artificial intelligence to generate realistic, high‑fidelity synthetic security attack logs. By translating attacker behaviors expressed as TTPs—directly into structured telemetry—the approach aims to accelerate detection development while preserving realism and security.
Relevance to Microsoft Defender Customers
For Microsoft Defender customers, this research is crucial because it directly addresses the bottleneck of acquiring high‑quality, realistic attack logs needed for effective threat detection and response. Leveraging AI‑driven synthetic log generation allows organizations to speed up the creation of detection rules and AI‑based automation, ensure privacy, and reduce operational overhead. Synthetic logs enable simulation of a broader range of attack scenarios—including rare and emerging threats—without exposing sensitive data or relying on expensive lab setups, ultimately enhancing the agility and effectiveness of Microsoft Defender’s detection and response capabilities.
Synthetic Logs vs. Lab Simulations
Synthetic data has long served as a privacy‑conscious substitute for real data in many fields, and in cybersecurity it offers added advantages: safe, shareable datasets that avoid exposing customer information; the ability to simulate rare or emerging attacks that are hard to observe in production; accelerated detection engineering and testing; and reproducible experiments for benchmarking. While synthetic logs do not replace all lab‑based validation, they complement traditional simulations by speeding up early‑stage detection design, testing, and coverage expansion, thereby reducing the slow, labor‑intensive nature of executing real attacks in controlled environments.
Core Idea: From TTPs to Logs
The proposed workflow consumes “TTP + Action” as input and produces structured security logs as output. High‑level attacker TTPs are drawn from the MITRE ATT&CK framework, paired with concrete attacker actions (e.g., using forfiles.exe with obfuscated command lines). The goal is not to reproduce logs verbatim but to generate semantically correct logs that would accurately trigger detections, mirroring real attacker behavior.
Prompt‑Engineered Generation (Baseline)
The first technique relies on a series of expert‑crafted prompts. The workflow involves: (1) prompting the model with a detailed attack scenario and context; (2) iterative generation across multiple turns to maintain coherence; and (3) evaluation by an independent large language model (LLM) acting as a judge, which assesses realism and consistency. The prompts explicitly instruct the model to reason like a cybersecurity researcher, leverage MITRE ATT&CK knowledge, and produce coherent attack narratives.
Agentic Workflow‑Based Generation
Recognizing that pure prompting struggles with complex, multi‑stage attacks, an agentic workflow was introduced using three specialized agents that collaborate in a generate‑evaluate‑improve loop: the Generator Agent creates an initial set of logs; the Evaluator Agent reviews the logs and provides structured feedback; and the Improver Agent suggests targeted refinements based on that feedback. This cyclical process allows the system to correct errors, fill gaps, and refine details over multiple turns, significantly improving log completeness and fidelity, especially for intricate attack chains.
Multi‑Turn Reinforcement Learning with Verifiable Rewards (RLVR)
While agentic workflows produce semantically correct logs, they still diverge from real event logs in details such as process paths, command‑line arguments, and service names. To narrow this gap, experiments employed reinforcement learning with verifiable rewards. An LLM‑as‑a‑Judge compares synthesized data against ground‑truth logs, awarding partial rewards for semantic alignment and imposing penalties for inexact matches, yielding a context‑aware, flexible reward signal. The judge also supplies transparent, auditable reasoning. Because this approach depends heavily on labeled training data, data‑augmentation techniques—paraphrasing attack narratives while preserving technical intent and perturbing parameters (e.g., substituting executable names, re‑ordering flags)—were applied to scale training examples from hundreds to thousands.
Evaluation Datasets
To assess generalizability, three complementary datasets were used:
- Goal‑Driven (GD) Campaigns: Tightly scoped datasets from repeatable attack simulations conducted by threat researchers, each built around a specific security objective (e.g., detecting credential dumping on Windows servers). Ten GD executions provided clean ground truth and well‑defined attacker actions.
- Security Datasets Project: An open‑source initiative offering malicious and benign datasets from multiple platforms, enabling broader cross‑environment evaluation.
- ATLASv2 Dataset: Comprises Windows Security Auditing, Sysmon, Firefox, and DNS telemetry generated across two Windows VMs executing ten multi‑stage attack scenarios with realistic noise and cross‑host behaviors; evaluation focused on malicious activity during attack windows.
External datasets were used solely for research validation and not in any commercial product development.
Evaluation Methodology
The primary metric was recall, measuring the model’s ability to generate semantically relevant log instances (true positives) expected for a given attack scenario. The LLM‑as‑a‑Judge performed flexible matching—for example, treating a synthetic log containing forfiles.exe as a match for a ground‑truth entry with the full path D:\Windows\System32\forfiles.exe. Experiments were conducted across multiple reasoning and non‑reasoning models, with recall values reported for medium reasoning effort where applicable.
Key Results
Prompt‑only approaches established a baseline but showed inconsistent performance across datasets. Agentic workflows delivered dramatic recall improvements, outperforming pure prompting in every case. Reasoning‑enhanced models combined with the agentic refinement achieved the highest fidelity. Early RLVR experiments indicated significant promise for achieving verbatim‑level alignment, though they highlighted the need for substantial labeled data; data‑augmentation helped mitigate this requirement. Tables 1 and 2 (referenced in the original text) summarize recall values for prompt‑based and agentic workflow‑based methods, respectively, confirming that agentic collaboration is the most effective technique for high‑quality synthetic attack log generation.
Overall Implications and Future Work
AI‑driven synthetic log generation shows strong potential to produce semantically meaningful logs from TTPs and attacker actions, capture multi‑event sequences, preserve parent‑child process relationships, and generate realistic command lines. This capability can accelerate detection engineering by reducing reliance on costly lab setups and enabling rapid, safe experimentation without sacrificing realism. Continued work on reinforcement learning with verifiable rewards, coupled with expanded data‑augmentation strategies, may further close the gap between synthetic and real logs, ultimately providing security teams with a scalable, privacy‑preserving tool for staying ahead of evolving cyber threats.
References and Further Reading
- ATLASv2: ATLAS Attack Engagements, Version 2 (arXiv:2401.01341).
- Microsoft Defender Security Research, with contributions from Raghav Batta and members of Microsoft Threat Intelligence.
- For additional insights, see the Microsoft Threat Intelligence Blog, LinkedIn, X (formerly Twitter), Bluesky, and the Microsoft Threat Intelligence podcast.
- Documentation on real‑time protection capabilities is available for organizations seeking to enable these defenses.

