Zero-Day Response Playbook: From Discovery to Recovery¶
A zero-day drops. No patch exists. Your threat intel feed lights up. The clock starts. What you do in the first 60 minutes determines whether this becomes a contained incident or a headline-making breach.
This post walks through a complete zero-day response lifecycle — from the moment you learn about a new vulnerability through containment, threat hunting, and recovery. We follow Meridian Healthcare (fictional) through their response to a critical zero-day in their edge gateway appliance, with detection queries, decision trees, and communication templates you can adapt immediately.
1. The Zero-Day Reality¶
Zero-day vulnerabilities occupy a unique space in security operations. Unlike known CVEs with patches and signatures, zero-days force defenders to operate without their usual safety nets:
- No patch available — vendor is still developing a fix
- No signatures — IDS/IPS rules don't detect the exploitation
- Limited IOCs — threat intel is sparse and evolving rapidly
- Uncertainty — scope of exploitation is unknown
The average time from zero-day disclosure to first exploitation in the wild has dropped from 14 days in 2020 to under 24 hours in 2025. For critical infrastructure, that window is often measured in minutes.
The Golden Hour
The first 60 minutes after zero-day awareness determine your outcome. Organizations with rehearsed playbooks contain incidents 74% faster than those without (Mandiant M-Trends 2025).
2. Zero-Day Response Framework¶
Phase 0: Preparation (Before the Zero-Day)¶
The best zero-day response starts months before the vulnerability exists.
Asset Inventory
You cannot protect what you cannot find. Maintain a continuously updated inventory of:
| Asset Category | Key Data Points | Update Frequency |
|---|---|---|
| Network appliances | Vendor, model, firmware version, exposure | Weekly |
| Web applications | Framework, dependencies, internet-facing | Daily |
| Endpoints | OS version, patch level, installed software | Real-time (EDR) |
| Cloud services | Provider, service, API versions, configurations | Daily |
| Third-party integrations | Vendor, data flow, access scope | Monthly |
Pre-positioned Detection
Deploy behavioral detections that catch exploitation patterns regardless of the specific vulnerability:
// Anomalous process execution from network appliance management interfaces
DeviceProcessEvents
| where Timestamp > ago(24h)
| where InitiatingProcessFileName in ("httpd", "nginx", "sshd", "java")
| where FileName in ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where ProcessCommandLine has_any ("whoami", "id", "net user", "cat /etc/passwd")
| project Timestamp, DeviceName, InitiatingProcessFileName, FileName, ProcessCommandLine
| sort by Timestamp desc
index=endpoint sourcetype=sysmon EventCode=1
| where parent_image IN ("httpd", "nginx", "sshd", "java.exe")
| where Image IN ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where CommandLine IN ("*whoami*", "*id*", "*net user*", "*cat /etc/passwd*")
| table _time, host, parent_image, Image, CommandLine
| sort - _time
Phase 1: Awareness (T+0 to T+15 Minutes)¶
Intelligence Intake
Zero-day awareness typically arrives through one of these channels:
- Vendor advisory — official disclosure with CVSS score
- Threat intel feed — CISA KEV, commercial feeds, ISAC alerts
- Social media / researcher disclosure — Twitter/X, Mastodon, security blogs
- Internal detection — anomalous behavior matching exploitation patterns
- Law enforcement notification — FBI, CISA direct contact
Verify Before Acting
Not every "zero-day" tweet is real. Before activating your response:
- Confirm the vendor/product is in your environment
- Verify the source credibility (official advisory > researcher > anonymous)
- Check if a CVE has been assigned
- Assess exploitability (remote code execution vs. local privilege escalation)
Initial Triage Decision Tree
Is the affected product in our environment?
├── NO → Monitor only, update threat intel
└── YES → Is it internet-facing?
├── YES → CRITICAL — activate IR team immediately
└── NO → Is there confirmed exploitation in the wild?
├── YES → HIGH — activate IR team within 1 hour
└── NO → MEDIUM — schedule assessment within 4 hours
Phase 2: Assessment (T+15 to T+60 Minutes)¶
Exposure Mapping
Identify every instance of the vulnerable component:
// Find all instances of vulnerable appliance (example: EdgeGuard VPN)
DeviceNetworkEvents
| where Timestamp > ago(7d)
| where RemoteUrl has "edgeguard" or RemotePort in (443, 8443, 10443)
| summarize
Connections = count(),
UniqueDevices = dcount(DeviceName),
FirstSeen = min(Timestamp),
LastSeen = max(Timestamp)
by RemoteIP, RemotePort
| sort by Connections desc
Exploitation Check
Hunt for evidence that the vulnerability has already been exploited:
// Check for post-exploitation indicators on network appliances
DeviceProcessEvents
| where Timestamp > ago(30d)
| where DeviceName has_any ("vpn", "gateway", "edge", "fw")
| where FileName in ("curl", "wget", "certutil.exe", "bitsadmin.exe")
| where ProcessCommandLine has_any ("http://", "https://", "ftp://")
| project Timestamp, DeviceName, FileName, ProcessCommandLine, AccountName
| sort by Timestamp desc
Phase 3: Containment (T+1 Hour to T+4 Hours)¶
Immediate Actions
| Priority | Action | Owner | Timeline |
|---|---|---|---|
| P0 | Block exploitation at WAF/IPS (generic rules) | Network team | Immediate |
| P0 | Isolate confirmed-compromised systems | SOC / IR | Immediate |
| P1 | Disable vulnerable service if non-critical | App owner | 30 min |
| P1 | Implement vendor-recommended workaround | Sysadmin | 1 hour |
| P2 | Increase logging verbosity on affected systems | SOC | 1 hour |
| P2 | Deploy additional monitoring rules | Detection eng | 2 hours |
| P3 | Notify executive stakeholders | CISO | 2 hours |
| P3 | Engage external IR if needed | IR lead | 4 hours |
Network Containment
# Example: Emergency ACL to block exploitation attempts
# Appliance management interface — restrict to jump hosts only
# (Adapt to your firewall platform)
# Block external access to management ports
deny tcp any host 198.51.100.10 eq 443
deny tcp any host 198.51.100.10 eq 8443
deny tcp any host 198.51.100.10 eq 22
# Allow only from authorized management subnet
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 443
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 22
Phase 4: Eradication (T+4 Hours to T+48 Hours)¶
Once the vulnerability is contained, focus shifts to removing any attacker persistence:
Persistence Hunt Checklist
- [ ] Check for new user accounts created during the exploitation window
- [ ] Review scheduled tasks / cron jobs added recently
- [ ] Inspect web shells in web-accessible directories
- [ ] Check for SSH key additions in authorized_keys
- [ ] Review certificate changes and new TLS certificates
- [ ] Inspect startup scripts and init systems
- [ ] Check for modified system binaries (file integrity monitoring)
// Hunt for web shells deployed during exploitation window
DeviceFileEvents
| where Timestamp between (datetime(2026-09-15) .. datetime(2026-09-17))
| where FolderPath has_any ("wwwroot", "htdocs", "html", "webapps")
| where FileName endswith_cs ".php" or FileName endswith_cs ".jsp"
or FileName endswith_cs ".aspx" or FileName endswith_cs ".py"
| where ActionType == "FileCreated"
| project Timestamp, DeviceName, FolderPath, FileName, SHA256, InitiatingProcessFileName
| sort by Timestamp desc
index=endpoint sourcetype=sysmon EventCode=11
| where TargetFilename="*wwwroot*" OR TargetFilename="*htdocs*" OR TargetFilename="*webapps*"
| where TargetFilename="*.php" OR TargetFilename="*.jsp" OR TargetFilename="*.aspx"
| where _time >= "09/15/2026:00:00:00" AND _time <= "09/17/2026:23:59:59"
| table _time, host, TargetFilename, hashes, Image
| sort - _time
Phase 5: Recovery (T+48 Hours to T+7 Days)¶
Patch Deployment
When the vendor releases a patch:
- Test in staging — deploy patch to non-production first (minimum 2 hours observation)
- Phased rollout — internet-facing systems first, then internal
- Verify remediation — run vulnerability scanner to confirm patch effectiveness
- Remove workarounds — reverse any temporary mitigations that may impact functionality
Validation Queries
// Verify no exploitation attempts after patching
CommonSecurityLog
| where TimeGenerated > ago(7d)
| where DeviceVendor == "EdgeGuard" and DeviceProduct == "VPN"
| where Activity has_any ("exploit", "overflow", "injection", "traversal")
| summarize AttemptCount = count() by bin(TimeGenerated, 1h), SourceIP
| sort by TimeGenerated desc
3. Case Study: Meridian Healthcare¶
Scenario: EdgeGuard VPN Zero-Day (Fictional)
Organization: Meridian Healthcare (fictional, 12,000 employees, 3 hospitals) Vulnerability: Remote code execution in EdgeGuard VPN appliance (CVE-2026-XXXX) CVSS: 9.8 (Critical) — unauthenticated RCE via crafted SAML assertion Initial awareness: CISA emergency directive, 06:42 UTC
Timeline¶
| Time | Event | Action |
|---|---|---|
| 06:42 | CISA emergency directive received | SOC manager alerted via PagerDuty |
| 06:55 | Confirmed 4 EdgeGuard appliances in environment | All internet-facing |
| 07:10 | Threat hunt initiated — checked 30 days of logs | No IOCs found (clean) |
| 07:30 | Emergency CAB convened | Approved immediate workaround |
| 07:45 | SAML authentication disabled on all appliances | Switched to certificate-based auth |
| 08:15 | Additional monitoring rules deployed | Sysmon + NetFlow on appliance subnets |
| 10:00 | Vendor releases emergency patch | Staged in test environment |
| 14:00 | Patch deployed to production appliances | Phased: DMZ first, then internal |
| 16:00 | SAML re-enabled with patched firmware | Full functionality restored |
| 18:00 | Post-incident review scheduled | Lessons learned session for next week |
What Went Right¶
- Asset inventory was current — 4 appliances identified in under 15 minutes
- Pre-positioned behavioral detections caught the pattern (even without IOCs)
- Rehearsed playbook — team followed zero-day playbook without improvising
- Fallback authentication — certificate-based auth was already configured
What Needed Improvement¶
- No offline backup for VPN access — remote workers lost access for 2 hours
- Patch testing environment didn't match production — 30-minute delay finding compatible test appliance
- Communication gaps — clinical staff weren't notified about VPN disruption until 45 minutes after workaround
4. Communication Templates¶
Internal Stakeholder Notification¶
SUBJECT: [URGENT] Zero-Day Vulnerability Response — [Product Name]
STATUS: Active Response
SEVERITY: Critical (CVSS 9.8)
AFFECTED SYSTEMS: [List]
CURRENT ACTIONS:
- Workaround applied at [time]
- Threat hunt in progress — no evidence of exploitation
- Vendor patch expected [timeframe]
BUSINESS IMPACT:
- [Service X] temporarily unavailable
- Workaround in place — [alternative access method]
NEXT UPDATE: [Time]
IR Lead: [Name] | [Contact]
Board/Executive Summary¶
SUBJECT: Zero-Day Incident Summary — [Date]
A critical vulnerability (CVE-XXXX-XXXX) was discovered in [product],
which is used in our environment for [purpose].
We were notified at [time] and activated our zero-day response playbook.
Within [X] minutes, we confirmed [N] affected systems. No evidence of
exploitation was found. Workarounds were applied within [X] minutes,
and the vendor patch was deployed within [X] hours.
Total business disruption: [X] hours of [service] unavailability.
No data loss or unauthorized access detected.
5. Key Takeaways¶
- Preparation beats reaction — asset inventories, behavioral detections, and rehearsed playbooks compress response time from days to hours
- Hunt backward — when a zero-day drops, assume it was exploited before disclosure and hunt 30-90 days back
- Workarounds first, patches second — don't wait for the patch to act; disable, isolate, or restrict immediately
- Behavioral detection > signature detection — generic detections for "web server spawning shell" catch zero-days that IOC-based rules miss
- Communication is a control — stakeholders who don't know what's happening make decisions that undermine your response
Related Resources¶
- Zero-Day Response Playbook — full operational playbook
- Chapter 8: Incident Response — IR lifecycle fundamentals
- Chapter 29: Vulnerability Management — vulnerability assessment and remediation
- Chapter 38: Advanced Threat Hunting — proactive hunting techniques
- Chapter 4: Detection Engineering — building behavioral detections
- SC-026: Zero-Day Exploitation — attack scenario
- Detection Query Library — pre-built KQL/SPL queries