Zero-Day Response Playbook: From Public Disclosure to Patched Production

Zero-Day Response Playbook: From Public Disclosure to Patched Production

Problem

A high-severity CVE is published for a component in your stack. The NVD entry posts at 3 AM. By 6 AM there is a working public PoC. By noon your component appears in threat intelligence feeds as actively exploited. You have a business-critical service running the vulnerable version, no patch is available yet, and the change freeze for the upcoming release is in effect.

This scenario repeats itself across the industry with each major CVE wave — Log4Shell, MOVEit, CitrixBleed, ingress-nginx, and dozens of others that received less media coverage. The gap between “organisation knows about the CVE” and “organisation has effective mitigations in place” is measured in days to weeks for most teams. During that window, the vulnerability is being actively exploited.

The root cause is not a lack of security awareness. Most security teams know they need to patch quickly. The problems are operational:

No inventory of what is running. When a CVE drops for a specific library version, teams cannot immediately answer “do we run this?” without manual investigation across dozens of services, container images, and infrastructure components. SBOM programs exist but are rarely complete or queryable in real time.

No decision framework for compensating controls. When a patch is unavailable or the risk of emergency patching outweighs the risk of brief exposure, teams need a structured way to evaluate and implement compensating controls — WAF rules, network restrictions, feature flags, configuration changes. Without this framework, the response defaults to either “patch everything immediately” (which causes operational incidents from rushed changes) or “wait for the next release cycle” (which leaves the vulnerability open for days).

No escalation path that bypasses change management. Most organizations have change advisory boards (CABs), deployment freezes, and approval workflows. These exist for good reasons. But they also create friction that, during a zero-day, translates directly into additional exposure time. Teams need a documented emergency change management process that is accepted by stakeholders in advance — not improvised under pressure.

No post-patch verification. After applying a patch, many teams mark the ticket as resolved without confirming that the patched version is actually running in all affected environments. Partial deployment (some instances patched, some not) is common and leaves the attack surface partially open.

This playbook provides a structured, repeatable process for zero-day response. It is designed to be adopted before a zero-day occurs and rehearsed, so that when the alert fires at 3 AM the team has a defined procedure rather than improvising.

Target systems: any production infrastructure — applies to all cloud, on-premises, and Kubernetes environments; relevant to security teams, SRE/platform teams, and engineering leads responsible for vulnerability response.


Threat Model

The “adversary” in zero-day response is time: the window between public disclosure and active exploitation in your environment. Threat actors — both nation-state and commodity ransomware operators — maintain automated scanning infrastructure that begins probing for newly disclosed vulnerabilities within hours of CVE publication.

Phase 1 (0–6 hours post-disclosure): Advanced threat actors and security researchers publish PoC. Automated scanners begin probing internet-facing services. Internal-only services are not yet at risk from external actors but are at risk if an attacker already has network access.

Phase 2 (6–48 hours): Commodity threat actors integrate the CVE into exploit frameworks (Metasploit, Nuclei templates). Internet-facing vulnerable services are actively being exploited at scale.

Phase 3 (48 hours – 2 weeks): Vulnerability is integrated into ransomware and malware distribution infrastructure. Internal services behind VPNs are targeted via phishing and supply chain. Unpatched services are being systematically compromised.

Phase 4 (2+ weeks): Vulnerability is “baked in” to attacker toolkits. Any unpatched service at this point is a significant liability regardless of other controls.

The goal of the playbook is to reduce the time to effective mitigation (TTEM) — the point at which either a patch is applied or compensating controls are verified to be effective — to under 24 hours for critical vulnerabilities in internet-facing services, and under 72 hours for internal services.


Configuration / Implementation

Phase 1 — Detection and triage (target: 0–2 hours)

Step 1.1: Establish a vulnerability detection feed

Do not rely on passive awareness (news, Twitter, colleagues mentioning CVEs). Subscribe to structured feeds:

# Recommended automated feeds:

# CISA Known Exploited Vulnerabilities (KEV) catalog
# Subscribe via RSS/ATOM:
# https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json

# NVD CVE feed (JSON format):
# https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-recent.json.gz

# OSV (Open Source Vulnerabilities) API — queryable by component:
curl "https://api.osv.dev/v1/query" \
  -d '{"package": {"name": "log4j-core", "ecosystem": "Maven"}}' | \
  jq '.vulns[] | {id, summary, severity: .severity[0].score}'

# GitHub Security Advisories — subscribe to GHSA for your dependencies
gh api /repos/owner/repo/vulnerability-alerts --method GET

# Vendor security bulletins — subscribe to mailing lists for:
# - Linux kernel (oss-security@lists.openwall.com)
# - Kubernetes (kubernetes-security-announce@googlegroups.com)
# - Your cloud provider (AWS Security Bulletins, GCP Security Bulletins)

Step 1.2: Query your SBOM inventory for the affected component

# Using grype against your SBOM database
# Assumes SBOM files are stored in s3://sbom-inventory/
aws s3 cp s3://sbom-inventory/ /tmp/sboms/ --recursive --include "*.json"

for sbom in /tmp/sboms/*.json; do
  SERVICE=$(basename "$sbom" .json)
  RESULT=$(grype sbom:"$sbom" --only-fixed --fail-on critical 2>/dev/null)
  if echo "$RESULT" | grep -q "CVE-2026-XXXXX"; then
    echo "AFFECTED: $SERVICE"
  fi
done

# Or use Syft + Grype for live container images:
for image in $(kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | sort -u); do
  syft "$image" -o spdx-json 2>/dev/null | \
    grype --add-cpes-if-none 2>/dev/null | \
    grep CVE-2026-XXXXX && echo "AFFECTED: $image"
done

Step 1.3: Classify affected services by exposure tier

Tier 1 — Internet-facing, processes untrusted input, runs as privileged user
→ Target TTEM: < 4 hours

Tier 2 — Internal API, reachable from authenticated employees or services
→ Target TTEM: < 24 hours

Tier 3 — Batch or background service, no direct input from external sources
→ Target TTEM: < 72 hours

Tier 4 — Dev/test/staging environments
→ Target TTEM: < 1 week

Phase 2 — Compensating controls (target: 2–8 hours)

When a patch is not yet available or emergency patching is not feasible, implement compensating controls. Document each control as: what it blocks, what it does not block, and its expiry condition (the point at which patching is required regardless).

WAF rule (for web application CVEs):

# AWS WAF — block requests matching known CVE exploit patterns
aws wafv2 create-rule-group \
  --name "CVE-2026-XXXXX-block" \
  --scope REGIONAL \
  --capacity 100 \
  --rules '[{
    "Name": "BlockExploitPattern",
    "Priority": 0,
    "Statement": {
      "ByteMatchStatement": {
        "SearchString": "EXPLOIT_PATTERN_HERE",
        "FieldToMatch": {"Body": {}},
        "TextTransformations": [{"Priority": 0, "Type": "URL_DECODE"}],
        "PositionalConstraint": "CONTAINS"
      }
    },
    "Action": {"Block": {}},
    "VisibilityConfig": {
      "SampledRequestsEnabled": true,
      "CloudWatchMetricsEnabled": true,
      "MetricName": "CVE-2026-XXXXX"
    }
  }]' \
  --visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=CVE-Group

# Set expiry: calendar reminder 48 hours out to verify patch status
echo "CVE-2026-XXXXX WAF rule expires: verify patch by $(date -d '+48 hours')" | \
  wall

Network restriction (for network-protocol CVEs):

# Block external access to vulnerable service port temporarily
# Example: block inbound SSH to all but management CIDR
aws ec2 revoke-security-group-ingress \
  --group-id sg-xxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 10.0.0.0/8  # Management network only

Feature flag (for application-layer CVEs):

# Disable the vulnerable feature path via feature flag
curl -X POST "https://flags.internal/api/flags/vulnerable-feature/disable" \
  -H "Authorization: Bearer $FLAGS_TOKEN" \
  -d '{"reason": "CVE-2026-XXXXX mitigation", "expires_at": "2026-05-13T00:00:00Z"}'

Document the compensating control formally:

## CVE-2026-XXXXX Compensating Control Record

| Field | Value |
|-------|-------|
| CVE | CVE-2026-XXXXX |
| CVSS | 9.8 Critical |
| Component | org.example:library:1.4.2 |
| Control type | WAF rule + network restriction |
| Control implemented by | @engineer-name |
| Control implemented at | 2026-05-12T06:30:00Z |
| Control scope | Production tier 1 services |
| What is blocked | Remote code execution via crafted HTTP request |
| What is NOT blocked | Authenticated attackers with existing access |
| Expiry condition | Patch deployed and verified, or 72 hours (whichever is sooner) |
| Ticket | INC-2026-0512-001 |

Phase 3 — Emergency patch deployment (target: 4–48 hours)

Step 3.1: Obtain and test the patch

# Check for patch availability
# For container images — rebuild with updated base
FROM ubuntu:22.04
RUN apt update && apt install --only-upgrade vulnerable-package && \
    apt clean

# For Kubernetes deployments — update image tag and redeploy
# Use kustomize overlays for rapid image updates:
# kustomize/overlays/emergency-patch/kustomization.yaml
images:
- name: org/service
  newTag: "1.4.3-patched"  # New safe version

# Apply and verify
kustomize build kustomize/overlays/emergency-patch | kubectl apply -f -
kubectl rollout status deployment/service -n production --timeout=5m

Step 3.2: Emergency change management

Define the emergency change process before a zero-day occurs. Key components:

## Emergency Change Process (for CVE response)

**Criteria for emergency change (bypasses CAB approval):**
- CVE is CVSS ≥ 9.0 and actively exploited (in CISA KEV catalog)
- Or CVE is CVSS ≥ 7.5 and affects internet-facing Tier 1 service
- Or security team has confirmed active exploitation in environment

**Required approvals for emergency change:**
- Security team lead (on-call) ✓
- SRE on-call ✓
- Engineering lead for affected service ✓ (or their delegate)
- NOT required: CAB review, change freeze waiver, full regression test

**Required documentation (can be filed post-change):**
- Ticket created before change
- Brief change description (1–3 sentences)
- Rollback procedure documented
- Post-change verification step defined

**Post-change (within 24 hours):**
- Complete ticket with full change record
- File change freeze exception report
- Schedule post-mortem if service was impacted

Step 3.3: Phased deployment with rapid rollout

#!/bin/bash
# emergency-patch-deploy.sh
# Rapid phased rollout with automated rollback on error

SERVICE=$1
NEW_IMAGE=$2
NAMESPACE=${3:-production}

set -euo pipefail

echo "Emergency patch deployment: $SERVICE -> $NEW_IMAGE"
echo "Namespace: $NAMESPACE"
echo "Started at: $(date -u)"

# Step 1: Deploy to 10% of pods
kubectl set image deployment/"$SERVICE" \
  "$SERVICE"="$NEW_IMAGE" \
  -n "$NAMESPACE" \
  --record

# Wait and check error rate
sleep 60
ERROR_RATE=$(kubectl exec -n monitoring deploy/prometheus -- \
  promtool query instant \
  "rate(http_requests_total{service='$SERVICE',status=~'5..'}[1m]) / rate(http_requests_total{service='$SERVICE'}[1m])" \
  2>/dev/null | tail -1)

if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
  echo "ERROR: Error rate $ERROR_RATE exceeds threshold; rolling back"
  kubectl rollout undo deployment/"$SERVICE" -n "$NAMESPACE"
  exit 1
fi

echo "10% deployment healthy. Proceeding to full rollout."

# Step 2: Full rollout
kubectl rollout status deployment/"$SERVICE" -n "$NAMESPACE" --timeout=10m

echo "Deployment complete at: $(date -u)"
echo "Verify patch: kubectl exec deploy/$SERVICE -n $NAMESPACE -- $SERVICE --version"

Phase 4 — Verification (target: within 2 hours of patch)

Step 4.1: Confirm patched version is running everywhere

# Verify no old versions remain in any cluster
kubectl get pods --all-namespaces -o jsonpath=\
'{range .items[*]}{.metadata.namespace}/{.metadata.name}: {range .spec.containers[*]}{.image}{"\n"}{end}{end}' | \
  grep "service-name" | grep -v "1.4.3-patched"
# Expected: no output (all pods on patched version)

# For AWS Lambda — verify function code version
aws lambda list-functions --query \
  'Functions[?starts_with(FunctionName, `service-`)].[FunctionName,Runtime,LastModified]' \
  --output table

# Scan running images with Grype to confirm CVE is resolved
for pod in $(kubectl get pods -n production -l app=service -o name); do
  image=$(kubectl get "$pod" -n production -o jsonpath='{.spec.containers[0].image}')
  grype "$image" --only-fixed 2>/dev/null | grep "CVE-2026-XXXXX" && \
    echo "STILL VULNERABLE: $pod ($image)" || \
    echo "PATCHED: $pod ($image)"
done

Step 4.2: Remove compensating controls

Once the patch is verified across all instances, remove temporary controls to avoid leaving unnecessary restrictions in place:

# Remove WAF rule
aws wafv2 delete-rule-group \
  --name "CVE-2026-XXXXX-block" \
  --scope REGIONAL \
  --id $RULE_GROUP_ID \
  --lock-token $LOCK_TOKEN

# Restore network access
aws ec2 revoke-security-group-ingress \
  --group-id sg-xxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 10.0.0.0/8

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxx \
  --protocol tcp \
  --port 22 \
  --cidr 203.0.113.0/24  # Restore original rule

# Re-enable feature flag
curl -X POST "https://flags.internal/api/flags/vulnerable-feature/enable" \
  -H "Authorization: Bearer $FLAGS_TOKEN"

Phase 5 — Post-incident review (within 72 hours)

## Zero-Day Response Post-Mortem Template

**CVE:** CVE-2026-XXXXX
**Discovery time:** [when team first became aware]
**Detection source:** [how we found out: feed, colleague, alert]
**TTEM (Time to Effective Mitigation):** [hours]
**Target TTEM:** [hours based on tier]
**Met target:** [Y/N]

**Timeline:**
- T+0h: CVE disclosed publicly
- T+Xh: Team became aware
- T+Xh: Inventory scan completed; N services identified as affected
- T+Xh: Compensating controls implemented on Tier 1 services
- T+Xh: Patch available
- T+Xh: Patch deployed to all Tier 1 services
- T+Xh: Verification complete
- T+Xh: Compensating controls removed

**What worked:**
- [specific things that went well]

**What didn't work:**
- [blockers, delays, process failures]

**Action items:**
- [ ] Update SBOM coverage for component X (gap found during inventory)
- [ ] Automate Tier 1 detection for this class of vulnerability
- [ ] Pre-approve emergency change process for security team on-call

Expected Behaviour

Metric Without playbook With playbook
Time from CVE disclosure to inventory complete 4–24 hours (manual) < 2 hours (automated SBOM query)
Time to implement compensating controls Ad-hoc; 2–8 hours < 4 hours (documented control patterns)
Emergency patching requires CAB approval Yes — 24–48 hour delay No — emergency process pre-approved; CAB notified after
Post-patch verification Optional; often skipped Mandatory; scripted check before ticket closed
Partial deployment (some instances unpatched) Common Detected by post-patch scan; blocked from ticket close

Metrics to track:

# TTEM by CVE (track in your ticketing system)
# Query your SIEM or ticketing API for vulnerability response metrics:
# - Time from CVE_DISCLOSED to MITIGATED (compensating control active)
# - Time from CVE_DISCLOSED to PATCHED (patch verified)
# - Count of affected services per CVE
# - Percentage of Tier 1 services patched within target SLA

Trade-offs

Aspect Benefit Cost Mitigation
Bypassing CAB for emergency changes Reduces TTEM by 24–48 hours Risk of rushed change causing service incident Require rollback procedure before change; phased rollout with automated error-rate gate
Compensating controls without patch Buys time without service disruption from emergency patch False confidence — controls may not fully block all exploit paths Document exactly what the control blocks and what it does not; set mandatory expiry
Phased rollout for patch Limits blast radius if patch has regressions Slower full deployment; Tier 1 partially vulnerable longer Run phased rollout quickly (10% → 100% in < 2 hours for small clusters)
Automated SBOM query Fast inventory Accuracy depends on SBOM completeness; gaps exist Supplement with runtime scanning (Trivy Operator, Grype); treat SBOM gaps as a separate finding

Failure Modes

Failure Symptom Detection Recovery
SBOM inventory is incomplete CVE scanner finds affected service that SBOM query missed Post-patch scan of running images finds CVE still present Treat SBOM gap as a separate P1 finding; add missing service to SBOM pipeline immediately
Compensating control bypassed by novel exploit variant Controls in place but exploitation succeeds IDS/WAF alert on novel pattern; EDR alert on post-exploit activity Activate incident response process; patch immediately regardless of change freeze
Emergency patch introduces regression Service error rate spikes post-patch Automated rollback in deploy script triggers Rollback to previous version; notify security team; investigate patch; re-deploy with fix
Compensating control not removed after patch WAF rule continues to block legitimate traffic; feature flag remains disabled Post-patch checklist item missed; user reports feature missing Implement compensating control expiry as calendar reminders with automated alerts; daily sweep of temporary controls