Zero-Day Response Playbook: From Discovery to Recovery¶

A zero-day drops. No patch exists. Your threat intel feed lights up. The clock starts. What you do in the first 60 minutes determines whether this becomes a contained incident or a headline-making breach.

This post walks through a complete zero-day response lifecycle — from the moment you learn about a new vulnerability through containment, threat hunting, and recovery. We follow Meridian Healthcare (fictional) through their response to a critical zero-day in their edge gateway appliance, with detection queries, decision trees, and communication templates you can adapt immediately.

1. The Zero-Day Reality¶

Zero-day vulnerabilities occupy a unique space in security operations. Unlike known CVEs with patches and signatures, zero-days force defenders to operate without their usual safety nets:

No patch available — vendor is still developing a fix
No signatures — IDS/IPS rules don't detect the exploitation
Limited IOCs — threat intel is sparse and evolving rapidly
Uncertainty — scope of exploitation is unknown

The average time from zero-day disclosure to first exploitation in the wild has dropped from 14 days in 2020 to under 24 hours in 2025. For critical infrastructure, that window is often measured in minutes.

The Golden Hour

The first 60 minutes after zero-day awareness determine your outcome. Organizations with rehearsed playbooks contain incidents 74% faster than those without (Mandiant M-Trends 2025).

2. Zero-Day Response Framework¶

Phase 0: Preparation (Before the Zero-Day)¶

The best zero-day response starts months before the vulnerability exists.

Asset Inventory

You cannot protect what you cannot find. Maintain a continuously updated inventory of:

Asset Category	Key Data Points	Update Frequency
Network appliances	Vendor, model, firmware version, exposure	Weekly
Web applications	Framework, dependencies, internet-facing	Daily
Endpoints	OS version, patch level, installed software	Real-time (EDR)
Cloud services	Provider, service, API versions, configurations	Daily
Third-party integrations	Vendor, data flow, access scope	Monthly

Pre-positioned Detection

Deploy behavioral detections that catch exploitation patterns regardless of the specific vulnerability:

KQL (Microsoft Sentinel)SPL (Splunk)

// Anomalous process execution from network appliance management interfaces
DeviceProcessEvents
| where Timestamp > ago(24h)
| where InitiatingProcessFileName in ("httpd", "nginx", "sshd", "java")
| where FileName in ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where ProcessCommandLine has_any ("whoami", "id", "net user", "cat /etc/passwd")
| project Timestamp, DeviceName, InitiatingProcessFileName, FileName, ProcessCommandLine
| sort by Timestamp desc

index=endpoint sourcetype=sysmon EventCode=1
| where parent_image IN ("httpd", "nginx", "sshd", "java.exe")
| where Image IN ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where CommandLine IN ("*whoami*", "*id*", "*net user*", "*cat /etc/passwd*")
| table _time, host, parent_image, Image, CommandLine
| sort - _time

Phase 1: Awareness (T+0 to T+15 Minutes)¶

Intelligence Intake

Zero-day awareness typically arrives through one of these channels:

Vendor advisory — official disclosure with CVSS score
Threat intel feed — CISA KEV, commercial feeds, ISAC alerts
Social media / researcher disclosure — Twitter/X, Mastodon, security blogs
Internal detection — anomalous behavior matching exploitation patterns
Law enforcement notification — FBI, CISA direct contact

Verify Before Acting

Not every "zero-day" tweet is real. Before activating your response:

Confirm the vendor/product is in your environment
Verify the source credibility (official advisory > researcher > anonymous)
Check if a CVE has been assigned
Assess exploitability (remote code execution vs. local privilege escalation)

Initial Triage Decision Tree

Is the affected product in our environment?
├── NO → Monitor only, update threat intel
└── YES → Is it internet-facing?
    ├── YES → CRITICAL — activate IR team immediately
    └── NO → Is there confirmed exploitation in the wild?
        ├── YES → HIGH — activate IR team within 1 hour
        └── NO → MEDIUM — schedule assessment within 4 hours

Phase 2: Assessment (T+15 to T+60 Minutes)¶

Exposure Mapping

Identify every instance of the vulnerable component:

KQLSPL

// Find all instances of vulnerable appliance (example: EdgeGuard VPN)
DeviceNetworkEvents
| where Timestamp > ago(7d)
| where RemoteUrl has "edgeguard" or RemotePort in (443, 8443, 10443)
| summarize 
    Connections = count(),
    UniqueDevices = dcount(DeviceName),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by RemoteIP, RemotePort
| sort by Connections desc

index=network sourcetype=firewall
| where dest_port IN (443, 8443, 10443)
| where app="edgeguard-vpn"
| stats count as Connections, dc(src) as UniqueClients, 
        earliest(_time) as FirstSeen, latest(_time) as LastSeen by dest, dest_port
| sort - Connections

Exploitation Check

Hunt for evidence that the vulnerability has already been exploited:

KQLSPL

// Check for post-exploitation indicators on network appliances
DeviceProcessEvents
| where Timestamp > ago(30d)
| where DeviceName has_any ("vpn", "gateway", "edge", "fw")
| where FileName in ("curl", "wget", "certutil.exe", "bitsadmin.exe")
| where ProcessCommandLine has_any ("http://", "https://", "ftp://")
| project Timestamp, DeviceName, FileName, ProcessCommandLine, AccountName
| sort by Timestamp desc

index=endpoint sourcetype=sysmon EventCode=1
| where host IN ("vpn-*", "gw-*", "edge-*", "fw-*")
| where Image IN ("*curl*", "*wget*", "*certutil*", "*bitsadmin*")
| where CommandLine="*http*" OR CommandLine="*ftp*"
| table _time, host, Image, CommandLine, User
| sort - _time

Phase 3: Containment (T+1 Hour to T+4 Hours)¶

Immediate Actions

Priority	Action	Owner	Timeline
P0	Block exploitation at WAF/IPS (generic rules)	Network team	Immediate
P0	Isolate confirmed-compromised systems	SOC / IR	Immediate
P1	Disable vulnerable service if non-critical	App owner	30 min
P1	Implement vendor-recommended workaround	Sysadmin	1 hour
P2	Increase logging verbosity on affected systems	SOC	1 hour
P2	Deploy additional monitoring rules	Detection eng	2 hours
P3	Notify executive stakeholders	CISO	2 hours
P3	Engage external IR if needed	IR lead	4 hours

Network Containment

# Example: Emergency ACL to block exploitation attempts
# Appliance management interface — restrict to jump hosts only
# (Adapt to your firewall platform)

# Block external access to management ports
deny tcp any host 198.51.100.10 eq 443
deny tcp any host 198.51.100.10 eq 8443
deny tcp any host 198.51.100.10 eq 22

# Allow only from authorized management subnet
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 443
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 22

Phase 4: Eradication (T+4 Hours to T+48 Hours)¶

Once the vulnerability is contained, focus shifts to removing any attacker persistence:

Persistence Hunt Checklist

[ ] Check for new user accounts created during the exploitation window
[ ] Review scheduled tasks / cron jobs added recently
[ ] Inspect web shells in web-accessible directories
[ ] Check for SSH key additions in authorized_keys
[ ] Review certificate changes and new TLS certificates
[ ] Inspect startup scripts and init systems
[ ] Check for modified system binaries (file integrity monitoring)

KQLSPL

// Hunt for web shells deployed during exploitation window
DeviceFileEvents
| where Timestamp between (datetime(2026-09-15) .. datetime(2026-09-17))
| where FolderPath has_any ("wwwroot", "htdocs", "html", "webapps")
| where FileName endswith_cs ".php" or FileName endswith_cs ".jsp" 
    or FileName endswith_cs ".aspx" or FileName endswith_cs ".py"
| where ActionType == "FileCreated"
| project Timestamp, DeviceName, FolderPath, FileName, SHA256, InitiatingProcessFileName
| sort by Timestamp desc

index=endpoint sourcetype=sysmon EventCode=11
| where TargetFilename="*wwwroot*" OR TargetFilename="*htdocs*" OR TargetFilename="*webapps*"
| where TargetFilename="*.php" OR TargetFilename="*.jsp" OR TargetFilename="*.aspx"
| where _time >= "09/15/2026:00:00:00" AND _time <= "09/17/2026:23:59:59"
| table _time, host, TargetFilename, hashes, Image
| sort - _time

Phase 5: Recovery (T+48 Hours to T+7 Days)¶

Patch Deployment

When the vendor releases a patch:

Test in staging — deploy patch to non-production first (minimum 2 hours observation)
Phased rollout — internet-facing systems first, then internal
Verify remediation — run vulnerability scanner to confirm patch effectiveness
Remove workarounds — reverse any temporary mitigations that may impact functionality

Validation Queries

KQLSPL

// Verify no exploitation attempts after patching
CommonSecurityLog
| where TimeGenerated > ago(7d)
| where DeviceVendor == "EdgeGuard" and DeviceProduct == "VPN"
| where Activity has_any ("exploit", "overflow", "injection", "traversal")
| summarize AttemptCount = count() by bin(TimeGenerated, 1h), SourceIP
| sort by TimeGenerated desc

index=network sourcetype=edgeguard
| where signature="*exploit*" OR signature="*overflow*" OR signature="*injection*"
| timechart span=1h count by src

3. Case Study: Meridian Healthcare¶

Scenario: EdgeGuard VPN Zero-Day (Fictional)

Organization: Meridian Healthcare (fictional, 12,000 employees, 3 hospitals) Vulnerability: Remote code execution in EdgeGuard VPN appliance (CVE-2026-XXXX) CVSS: 9.8 (Critical) — unauthenticated RCE via crafted SAML assertion Initial awareness: CISA emergency directive, 06:42 UTC

Timeline¶

Time	Event	Action
06:42	CISA emergency directive received	SOC manager alerted via PagerDuty
06:55	Confirmed 4 EdgeGuard appliances in environment	All internet-facing
07:10	Threat hunt initiated — checked 30 days of logs	No IOCs found (clean)
07:30	Emergency CAB convened	Approved immediate workaround
07:45	SAML authentication disabled on all appliances	Switched to certificate-based auth
08:15	Additional monitoring rules deployed	Sysmon + NetFlow on appliance subnets
10:00	Vendor releases emergency patch	Staged in test environment
14:00	Patch deployed to production appliances	Phased: DMZ first, then internal
16:00	SAML re-enabled with patched firmware	Full functionality restored
18:00	Post-incident review scheduled	Lessons learned session for next week

What Went Right¶

Asset inventory was current — 4 appliances identified in under 15 minutes
Pre-positioned behavioral detections caught the pattern (even without IOCs)
Rehearsed playbook — team followed zero-day playbook without improvising
Fallback authentication — certificate-based auth was already configured

What Needed Improvement¶

No offline backup for VPN access — remote workers lost access for 2 hours
Patch testing environment didn't match production — 30-minute delay finding compatible test appliance
Communication gaps — clinical staff weren't notified about VPN disruption until 45 minutes after workaround

4. Communication Templates¶

Internal Stakeholder Notification¶

SUBJECT: [URGENT] Zero-Day Vulnerability Response — [Product Name]

STATUS: Active Response
SEVERITY: Critical (CVSS 9.8)
AFFECTED SYSTEMS: [List]

CURRENT ACTIONS:
- Workaround applied at [time]
- Threat hunt in progress — no evidence of exploitation
- Vendor patch expected [timeframe]

BUSINESS IMPACT:
- [Service X] temporarily unavailable
- Workaround in place — [alternative access method]

NEXT UPDATE: [Time]

IR Lead: [Name] | [Contact]

Board/Executive Summary¶

SUBJECT: Zero-Day Incident Summary — [Date]

A critical vulnerability (CVE-XXXX-XXXX) was discovered in [product],
which is used in our environment for [purpose]. 

We were notified at [time] and activated our zero-day response playbook.
Within [X] minutes, we confirmed [N] affected systems. No evidence of
exploitation was found. Workarounds were applied within [X] minutes,
and the vendor patch was deployed within [X] hours.

Total business disruption: [X] hours of [service] unavailability.
No data loss or unauthorized access detected.

5. Key Takeaways¶

Preparation beats reaction — asset inventories, behavioral detections, and rehearsed playbooks compress response time from days to hours
Hunt backward — when a zero-day drops, assume it was exploited before disclosure and hunt 30-90 days back
Workarounds first, patches second — don't wait for the patch to act; disable, isolate, or restrict immediately
Behavioral detection > signature detection — generic detections for "web server spawning shell" catch zero-days that IOC-based rules miss
Communication is a control — stakeholders who don't know what's happening make decisions that undermine your response

Zero-Day Response Playbook — full operational playbook
Chapter 8: Incident Response — IR lifecycle fundamentals
Chapter 29: Vulnerability Management — vulnerability assessment and remediation
Chapter 38: Advanced Threat Hunting — proactive hunting techniques
Chapter 4: Detection Engineering — building behavioral detections
SC-026: Zero-Day Exploitation — attack scenario
Detection Query Library — pre-built KQL/SPL queries

Zero-Day Response Playbook: From Discovery to Recovery¶

1. The Zero-Day Reality¶

2. Zero-Day Response Framework¶

Phase 0: Preparation (Before the Zero-Day)¶

Phase 1: Awareness (T+0 to T+15 Minutes)¶

Phase 2: Assessment (T+15 to T+60 Minutes)¶

Phase 3: Containment (T+1 Hour to T+4 Hours)¶

Phase 4: Eradication (T+4 Hours to T+48 Hours)¶

Phase 5: Recovery (T+48 Hours to T+7 Days)¶

3. Case Study: Meridian Healthcare¶

Timeline¶

What Went Right¶

What Needed Improvement¶

4. Communication Templates¶

Internal Stakeholder Notification¶

Board/Executive Summary¶

5. Key Takeaways¶

Related Resources¶