Skip to content

Zero-Day Response Playbook: From Discovery to Recovery

A zero-day drops. No patch exists. Your threat intel feed lights up. The clock starts. What you do in the first 60 minutes determines whether this becomes a contained incident or a headline-making breach.

This post walks through a complete zero-day response lifecycle — from the moment you learn about a new vulnerability through containment, threat hunting, and recovery. We follow Meridian Healthcare (fictional) through their response to a critical zero-day in their edge gateway appliance, with detection queries, decision trees, and communication templates you can adapt immediately.


1. The Zero-Day Reality

Zero-day vulnerabilities occupy a unique space in security operations. Unlike known CVEs with patches and signatures, zero-days force defenders to operate without their usual safety nets:

  • No patch available — vendor is still developing a fix
  • No signatures — IDS/IPS rules don't detect the exploitation
  • Limited IOCs — threat intel is sparse and evolving rapidly
  • Uncertainty — scope of exploitation is unknown

The average time from zero-day disclosure to first exploitation in the wild has dropped from 14 days in 2020 to under 24 hours in 2025. For critical infrastructure, that window is often measured in minutes.

The Golden Hour

The first 60 minutes after zero-day awareness determine your outcome. Organizations with rehearsed playbooks contain incidents 74% faster than those without (Mandiant M-Trends 2025).


2. Zero-Day Response Framework

Phase 0: Preparation (Before the Zero-Day)

The best zero-day response starts months before the vulnerability exists.

Asset Inventory

You cannot protect what you cannot find. Maintain a continuously updated inventory of:

Asset Category Key Data Points Update Frequency
Network appliances Vendor, model, firmware version, exposure Weekly
Web applications Framework, dependencies, internet-facing Daily
Endpoints OS version, patch level, installed software Real-time (EDR)
Cloud services Provider, service, API versions, configurations Daily
Third-party integrations Vendor, data flow, access scope Monthly

Pre-positioned Detection

Deploy behavioral detections that catch exploitation patterns regardless of the specific vulnerability:

// Anomalous process execution from network appliance management interfaces
DeviceProcessEvents
| where Timestamp > ago(24h)
| where InitiatingProcessFileName in ("httpd", "nginx", "sshd", "java")
| where FileName in ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where ProcessCommandLine has_any ("whoami", "id", "net user", "cat /etc/passwd")
| project Timestamp, DeviceName, InitiatingProcessFileName, FileName, ProcessCommandLine
| sort by Timestamp desc
index=endpoint sourcetype=sysmon EventCode=1
| where parent_image IN ("httpd", "nginx", "sshd", "java.exe")
| where Image IN ("cmd.exe", "powershell.exe", "bash", "sh", "python", "perl")
| where CommandLine IN ("*whoami*", "*id*", "*net user*", "*cat /etc/passwd*")
| table _time, host, parent_image, Image, CommandLine
| sort - _time

Phase 1: Awareness (T+0 to T+15 Minutes)

Intelligence Intake

Zero-day awareness typically arrives through one of these channels:

  1. Vendor advisory — official disclosure with CVSS score
  2. Threat intel feed — CISA KEV, commercial feeds, ISAC alerts
  3. Social media / researcher disclosure — Twitter/X, Mastodon, security blogs
  4. Internal detection — anomalous behavior matching exploitation patterns
  5. Law enforcement notification — FBI, CISA direct contact

Verify Before Acting

Not every "zero-day" tweet is real. Before activating your response:

  • Confirm the vendor/product is in your environment
  • Verify the source credibility (official advisory > researcher > anonymous)
  • Check if a CVE has been assigned
  • Assess exploitability (remote code execution vs. local privilege escalation)

Initial Triage Decision Tree

Is the affected product in our environment?
├── NO → Monitor only, update threat intel
└── YES → Is it internet-facing?
    ├── YES → CRITICAL — activate IR team immediately
    └── NO → Is there confirmed exploitation in the wild?
        ├── YES → HIGH — activate IR team within 1 hour
        └── NO → MEDIUM — schedule assessment within 4 hours

Phase 2: Assessment (T+15 to T+60 Minutes)

Exposure Mapping

Identify every instance of the vulnerable component:

// Find all instances of vulnerable appliance (example: EdgeGuard VPN)
DeviceNetworkEvents
| where Timestamp > ago(7d)
| where RemoteUrl has "edgeguard" or RemotePort in (443, 8443, 10443)
| summarize 
    Connections = count(),
    UniqueDevices = dcount(DeviceName),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by RemoteIP, RemotePort
| sort by Connections desc
index=network sourcetype=firewall
| where dest_port IN (443, 8443, 10443)
| where app="edgeguard-vpn"
| stats count as Connections, dc(src) as UniqueClients, 
        earliest(_time) as FirstSeen, latest(_time) as LastSeen by dest, dest_port
| sort - Connections

Exploitation Check

Hunt for evidence that the vulnerability has already been exploited:

// Check for post-exploitation indicators on network appliances
DeviceProcessEvents
| where Timestamp > ago(30d)
| where DeviceName has_any ("vpn", "gateway", "edge", "fw")
| where FileName in ("curl", "wget", "certutil.exe", "bitsadmin.exe")
| where ProcessCommandLine has_any ("http://", "https://", "ftp://")
| project Timestamp, DeviceName, FileName, ProcessCommandLine, AccountName
| sort by Timestamp desc
index=endpoint sourcetype=sysmon EventCode=1
| where host IN ("vpn-*", "gw-*", "edge-*", "fw-*")
| where Image IN ("*curl*", "*wget*", "*certutil*", "*bitsadmin*")
| where CommandLine="*http*" OR CommandLine="*ftp*"
| table _time, host, Image, CommandLine, User
| sort - _time

Phase 3: Containment (T+1 Hour to T+4 Hours)

Immediate Actions

Priority Action Owner Timeline
P0 Block exploitation at WAF/IPS (generic rules) Network team Immediate
P0 Isolate confirmed-compromised systems SOC / IR Immediate
P1 Disable vulnerable service if non-critical App owner 30 min
P1 Implement vendor-recommended workaround Sysadmin 1 hour
P2 Increase logging verbosity on affected systems SOC 1 hour
P2 Deploy additional monitoring rules Detection eng 2 hours
P3 Notify executive stakeholders CISO 2 hours
P3 Engage external IR if needed IR lead 4 hours

Network Containment

# Example: Emergency ACL to block exploitation attempts
# Appliance management interface — restrict to jump hosts only
# (Adapt to your firewall platform)

# Block external access to management ports
deny tcp any host 198.51.100.10 eq 443
deny tcp any host 198.51.100.10 eq 8443
deny tcp any host 198.51.100.10 eq 22

# Allow only from authorized management subnet
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 443
permit tcp 10.250.0.0/24 host 198.51.100.10 eq 22

Phase 4: Eradication (T+4 Hours to T+48 Hours)

Once the vulnerability is contained, focus shifts to removing any attacker persistence:

Persistence Hunt Checklist

  • [ ] Check for new user accounts created during the exploitation window
  • [ ] Review scheduled tasks / cron jobs added recently
  • [ ] Inspect web shells in web-accessible directories
  • [ ] Check for SSH key additions in authorized_keys
  • [ ] Review certificate changes and new TLS certificates
  • [ ] Inspect startup scripts and init systems
  • [ ] Check for modified system binaries (file integrity monitoring)
// Hunt for web shells deployed during exploitation window
DeviceFileEvents
| where Timestamp between (datetime(2026-09-15) .. datetime(2026-09-17))
| where FolderPath has_any ("wwwroot", "htdocs", "html", "webapps")
| where FileName endswith_cs ".php" or FileName endswith_cs ".jsp" 
    or FileName endswith_cs ".aspx" or FileName endswith_cs ".py"
| where ActionType == "FileCreated"
| project Timestamp, DeviceName, FolderPath, FileName, SHA256, InitiatingProcessFileName
| sort by Timestamp desc
index=endpoint sourcetype=sysmon EventCode=11
| where TargetFilename="*wwwroot*" OR TargetFilename="*htdocs*" OR TargetFilename="*webapps*"
| where TargetFilename="*.php" OR TargetFilename="*.jsp" OR TargetFilename="*.aspx"
| where _time >= "09/15/2026:00:00:00" AND _time <= "09/17/2026:23:59:59"
| table _time, host, TargetFilename, hashes, Image
| sort - _time

Phase 5: Recovery (T+48 Hours to T+7 Days)

Patch Deployment

When the vendor releases a patch:

  1. Test in staging — deploy patch to non-production first (minimum 2 hours observation)
  2. Phased rollout — internet-facing systems first, then internal
  3. Verify remediation — run vulnerability scanner to confirm patch effectiveness
  4. Remove workarounds — reverse any temporary mitigations that may impact functionality

Validation Queries

// Verify no exploitation attempts after patching
CommonSecurityLog
| where TimeGenerated > ago(7d)
| where DeviceVendor == "EdgeGuard" and DeviceProduct == "VPN"
| where Activity has_any ("exploit", "overflow", "injection", "traversal")
| summarize AttemptCount = count() by bin(TimeGenerated, 1h), SourceIP
| sort by TimeGenerated desc
index=network sourcetype=edgeguard
| where signature="*exploit*" OR signature="*overflow*" OR signature="*injection*"
| timechart span=1h count by src

3. Case Study: Meridian Healthcare

Scenario: EdgeGuard VPN Zero-Day (Fictional)

Organization: Meridian Healthcare (fictional, 12,000 employees, 3 hospitals) Vulnerability: Remote code execution in EdgeGuard VPN appliance (CVE-2026-XXXX) CVSS: 9.8 (Critical) — unauthenticated RCE via crafted SAML assertion Initial awareness: CISA emergency directive, 06:42 UTC

Timeline

Time Event Action
06:42 CISA emergency directive received SOC manager alerted via PagerDuty
06:55 Confirmed 4 EdgeGuard appliances in environment All internet-facing
07:10 Threat hunt initiated — checked 30 days of logs No IOCs found (clean)
07:30 Emergency CAB convened Approved immediate workaround
07:45 SAML authentication disabled on all appliances Switched to certificate-based auth
08:15 Additional monitoring rules deployed Sysmon + NetFlow on appliance subnets
10:00 Vendor releases emergency patch Staged in test environment
14:00 Patch deployed to production appliances Phased: DMZ first, then internal
16:00 SAML re-enabled with patched firmware Full functionality restored
18:00 Post-incident review scheduled Lessons learned session for next week

What Went Right

  • Asset inventory was current — 4 appliances identified in under 15 minutes
  • Pre-positioned behavioral detections caught the pattern (even without IOCs)
  • Rehearsed playbook — team followed zero-day playbook without improvising
  • Fallback authentication — certificate-based auth was already configured

What Needed Improvement

  • No offline backup for VPN access — remote workers lost access for 2 hours
  • Patch testing environment didn't match production — 30-minute delay finding compatible test appliance
  • Communication gaps — clinical staff weren't notified about VPN disruption until 45 minutes after workaround

4. Communication Templates

Internal Stakeholder Notification

SUBJECT: [URGENT] Zero-Day Vulnerability Response — [Product Name]

STATUS: Active Response
SEVERITY: Critical (CVSS 9.8)
AFFECTED SYSTEMS: [List]

CURRENT ACTIONS:
- Workaround applied at [time]
- Threat hunt in progress — no evidence of exploitation
- Vendor patch expected [timeframe]

BUSINESS IMPACT:
- [Service X] temporarily unavailable
- Workaround in place — [alternative access method]

NEXT UPDATE: [Time]

IR Lead: [Name] | [Contact]

Board/Executive Summary

SUBJECT: Zero-Day Incident Summary — [Date]

A critical vulnerability (CVE-XXXX-XXXX) was discovered in [product],
which is used in our environment for [purpose]. 

We were notified at [time] and activated our zero-day response playbook.
Within [X] minutes, we confirmed [N] affected systems. No evidence of
exploitation was found. Workarounds were applied within [X] minutes,
and the vendor patch was deployed within [X] hours.

Total business disruption: [X] hours of [service] unavailability.
No data loss or unauthorized access detected.

5. Key Takeaways

  1. Preparation beats reaction — asset inventories, behavioral detections, and rehearsed playbooks compress response time from days to hours
  2. Hunt backward — when a zero-day drops, assume it was exploited before disclosure and hunt 30-90 days back
  3. Workarounds first, patches second — don't wait for the patch to act; disable, isolate, or restrict immediately
  4. Behavioral detection > signature detection — generic detections for "web server spawning shell" catch zero-days that IOC-based rules miss
  5. Communication is a control — stakeholders who don't know what's happening make decisions that undermine your response