Patch SLA Compression in the LLM Exploit Era: From 30 Days to 24 Hours

Patch SLA Compression in the LLM Exploit Era: From 30 Days to 24 Hours

The Problem

Enterprise vulnerability management programs typically define patch SLAs by CVSS severity: Critical (CVSS ≥ 9.0) within 30 days, High (CVSS 7.0–8.9) within 60 days. These numbers were calibrated in an era when exploit development took days to weeks and attackers needed significant skill to turn a CVE description into working code.

That calibration is obsolete. LLM-assisted exploit pipelines can produce a functional exploit for a well-described CVE within hours. The CISA KEV list grows weekly with CVEs that were weaponised the same week they were published. The 30-day SLA for critical CVEs now means that a significant fraction of organisations will be exploited before their change-control process even schedules the patch.

The right answer is not simply to say “patch faster” — that advice ignores the infrastructure and process constraints that make fast patching hard. The answer is to identify and eliminate the structural blockers that make sub-24-hour patching impossible, then implement the automation that makes it reliable.

The structural blockers are:

Mutable infrastructure. If patching requires SSH-ing into servers and running apt upgrade, you need available engineers, maintenance windows, and manual verification. Immutable images (rebuild and redeploy, never patch in place) make patching a deployment operation, not a system administration operation.

Change-control bureaucracy. Many enterprise change-control processes require a CAB meeting, a change ticket, a rollback plan, and a multi-hour approval window. A 24-hour SLA cannot coexist with a weekly CAB cycle. Security-exception pathways and automated approval workflows are required.

Test/staging environment lag. If patching requires the patch to traverse dev → test → staging → production with manual approval at each gate, the latency is days even with an immutable image model. Parallel staged rollout (5% → 25% → 100%) with automated health gates removes human approval from the hot path.

Escalation paths not defined. A weekend CVE requires an on-call engineer with the authority and access to deploy a patch. Many organisations have on-call for production incidents but not for security patches. The patch SLA timer does not stop on weekends.

Target systems: Any internet-facing production infrastructure; organisations with CVSS 9+ CVEs appearing in their scanner results more than once per quarter; teams currently operating on a 30-day or 60-day patch cycle.

Threat Model

1. Automated exploit deployment targeting unpatched fleet (attacker with LLM exploit pipeline). Objective: scan the internet for services returning version strings or response headers indicating vulnerable software; deploy LLM-generated exploit at scale. Impact: organisations still in their 30-day patch window are compromised before they patch.

2. Targeted attack on high-value organisation (APT or criminal group). Objective: identify a specific target’s technology stack via prior reconnaissance; wait for a CVE affecting their stack; deploy an LLM-generated exploit within 24 hours of publication. Impact: targeted intrusion; data exfiltration; ransomware deployment.

3. Patch regression causing production incident (accidental, during emergency patch). Objective (unintentional): rushing a patch through without adequate testing causes a regression that breaks production. Impact: availability incident; may be worse than the CVE itself for some scenarios. This is the risk the staged rollout approach mitigates.

Hardening Configuration

Step 1: Define SLA Tiers and Measure Baseline

Before you can compress SLAs, you need to measure where you are:

# Measure current patch latency from CVE publication to production deployment
# Query your scanner history and deployment logs

# Example: Trivy history + Kubernetes deployment timestamps
# For each CVE: find earliest scan showing it, find latest scan not showing it
# (indicating patch deployed), compute duration

python3 << 'EOF'
import json, subprocess
from datetime import datetime

# Get scan history for a specific CVE
CVE = "CVE-2025-XXXXX"
# (In practice: query your scanner's API or database)

# Compute median, 90th percentile patch latency from historical data
patch_times = [72, 144, 336, 48, 720, 24, 168]   # hours, sample data
patch_times.sort()

n = len(patch_times)
print(f"Median patch time: {patch_times[n//2]}h")
print(f"P90 patch time: {patch_times[int(n*0.9)]}h")
print(f"Max patch time: {max(patch_times)}h")
EOF

Define your target SLA tiers:

Tier Criteria Target SLA Automation Level
P0 CISA KEV + internet-exposed 12 hours Fully automated; security approval required
P1 CVSS ≥ 9.5 OR EPSS ≥ 0.30 24 hours Automated + emergency change-control
P2 CVSS 9.0–9.4 OR EPSS 0.15–0.30 72 hours Normal fast-track process
P3 CVSS 7.0–8.9, EPSS < 0.15 7 days Standard process

Step 2: Immutable Infrastructure Prerequisites

# Base image pattern for immutable deployments
# Always pin to a digest, not a tag
FROM debian:bookworm@sha256:1234abc... AS base

# All package installations happen at build time
RUN apt-get update && apt-get install -y --no-install-recommends \
    libssl3=3.0.15-1 \
    && rm -rf /var/lib/apt/lists/*

# No SSH, no package manager in the final image
FROM gcr.io/distroless/base-debian12 AS final
COPY --from=base /usr/lib /usr/lib
COPY --from=base /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app/server"]

Automated base image rebuild on upstream CVE patches:

# Renovate config: auto-update base image digests
{
  "extends": ["config:base"],
  "docker": {
    "enabled": true,
    "pinDigests": true
  },
  "packageRules": [
    {
      "matchPackagePatterns": ["debian", "alpine", "ubuntu"],
      "automerge": false,   # Security team reviews base image updates
      "labels": ["security", "base-image-update"]
    }
  ]
}

Step 3: Emergency Change-Control Pathway

Define a security change pathway that bypasses the standard CAB cycle for P0/P1 patches:

# change-control-policy.yaml
standard_pathway:
  approval_required: ["change-advisory-board"]
  minimum_lead_time_hours: 48
  applies_to: [P3, P4]

fast_track_pathway:
  approval_required: ["engineering-lead", "security-team"]
  minimum_lead_time_hours: 4
  applies_to: [P2]

emergency_security_pathway:
  approval_required: ["security-oncall", "ciso-delegate"]
  minimum_lead_time_hours: 0    # Can start immediately
  applies_to: [P0, P1]
  constraints:
    - staged_rollout_required: true
    - automated_rollback_required: true
    - post_deployment_review_hours: 24
  notification:
    - ciso@example.com
    - security-leads@example.com

Automate the pathway selection:

#!/usr/bin/env python3
# patch-pathway.py — determines correct pathway for a given CVE

import requests

def get_patch_pathway(cve_id: str) -> str:
    # Fetch EPSS score
    resp = requests.get(f"https://api.first.org/data/v1/epss?cve={cve_id}")
    epss = float(resp.json()["data"][0]["epss"]) if resp.ok else 0.0

    # Fetch KEV status
    kev_resp = requests.get(
        "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
    )
    kev_ids = {v["cveID"] for v in kev_resp.json()["vulnerabilities"]}

    # Fetch CVSS score (simplified)
    nvd_resp = requests.get(
        f"https://services.nvd.nist.gov/rest/json/cves/2.0?cveId={cve_id}"
    )
    try:
        cvss = nvd_resp.json()["vulnerabilities"][0]["cve"]["metrics"]["cvssMetricV31"][0]["cvssData"]["baseScore"]
    except (KeyError, IndexError):
        cvss = 0.0

    if cve_id in kev_ids:
        return "P0", 12
    elif cvss >= 9.5 or epss >= 0.30:
        return "P1", 24
    elif cvss >= 9.0 or epss >= 0.15:
        return "P2", 72
    else:
        return "P3", 168

tier, sla_hours = get_patch_pathway(sys.argv[1])
print(f"Tier: {tier}, SLA: {sla_hours}h")

Step 4: Staged Rollout with Automated Health Gates

# Argo Rollouts strategy for emergency patches
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5      # 5% of traffic for 5 minutes
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 25
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 100
      autoPromotionEnabled: true    # No human approval needed for health-gated rollout

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      successCondition: result[0] < 0.005   # < 0.5% error rate
      failureLimit: 1
      interval: 60s
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"5.."}[2m])) /
            sum(rate(http_requests_total[2m]))

Step 5: SLA Tracking and Escalation

# sla-tracker.py — tracks CVE patch SLA compliance
# Runs every 15 minutes as a Kubernetes CronJob

import json
from datetime import datetime, timedelta, UTC
import requests

class CVESLATracker:
    def __init__(self):
        self.findings = self.load_active_findings()

    def load_active_findings(self) -> list[dict]:
        # Query your scanner API for unpatched CVEs
        # Return list of {cve_id, tier, discovered_at, target_host_count}
        pass

    def check_sla_status(self) -> list[dict]:
        now = datetime.now(UTC)
        results = []
        for finding in self.findings:
            tier, sla_hours = get_patch_pathway(finding["cve_id"])
            deadline = finding["discovered_at"] + timedelta(hours=sla_hours)
            hours_remaining = (deadline - now).total_seconds() / 3600

            status = "on-track"
            if hours_remaining < 0:
                status = "breached"
            elif hours_remaining < sla_hours * 0.25:   # < 25% time remaining
                status = "at-risk"

            results.append({
                "cve_id": finding["cve_id"],
                "tier": tier,
                "sla_hours": sla_hours,
                "hours_remaining": hours_remaining,
                "status": status,
                "affected_hosts": finding["target_host_count"],
            })

        return results

    def escalate_at_risk(self, findings: list[dict]):
        for f in findings:
            if f["status"] == "at-risk":
                self.notify_oncall(f)
            elif f["status"] == "breached":
                self.notify_ciso(f)
                self.open_incident(f)

Prometheus metrics for SLA dashboard:

# SLA compliance metric
cat << 'EOF' | curl --data-binary @- http://pushgateway:9091/metrics/job/patch-sla
# HELP cve_patch_sla_hours_remaining Hours until SLA breach for unpatched CVE
# TYPE cve_patch_sla_hours_remaining gauge
cve_patch_sla_hours_remaining{cve="CVE-2026-XXXXX",tier="P1"} 18.5
# HELP cve_patch_sla_breached Whether a CVE has breached its patch SLA
# TYPE cve_patch_sla_breached gauge
cve_patch_sla_breached{cve="CVE-2025-YYYYY",tier="P2"} 0
EOF

Step 6: Weekend and Holiday Coverage

# on-call rotation that includes security patch authority
# PagerDuty escalation policy for P0/P1 CVEs

escalation_policy:
  name: Security Patch Response
  rules:
    - escalation_delay_in_minutes: 0
      targets:
        - type: schedule
          id: security-oncall-primary    # First: security oncall
    - escalation_delay_in_minutes: 15
      targets:
        - type: schedule
          id: security-oncall-secondary   # 15 min: secondary oncall
    - escalation_delay_in_minutes: 30
      targets:
        - type: user
          id: ciso-delegate               # 30 min: CISO delegate

Pre-authorise emergency deployments to avoid weekend approval delays:

# Pre-authorisation stored in a secrets manager as a time-bounded token
# The on-call engineer presents this token to the deployment pipeline
# without needing a CAB approval for P0/P1 patches

vault write auth/approle/role/emergency-patch-deployer \
  secret_id_ttl=24h \
  token_ttl=1h \
  token_policies=["production-deploy", "read-only-config"]

Expected Behaviour After Hardening

Scenario 30-Day SLA Era 24-Hour SLA with Hardening
CISA KEV published Friday 17:00 Triage begins Monday; patch deployed next week On-call paged Friday 17:05; patch deployed Saturday 05:00
CVSS 9.8, EPSS 0.35, active scanning detected SLA clock starts; 30-day target P1 classification; 24-hour clock; automated rollout starts
Patch causes regression at 5% canary Full rollout deployed; regression found in production Argo Rollouts health gate fails; automatic rollback; no wider impact
SLA breached for P3 CVE No monitoring; team discovers weeks later Metric fires; ticket opened; escalation to engineering lead

Trade-offs and Operational Considerations

Aspect Benefit Cost Mitigation
Immutable image model Patching is a deployment operation; fast and repeatable Base image rebuild time adds 10-30 min to patch timeline Pre-build base images nightly; patch only needs app layer rebuild
Automated staged rollout No human approval needed for healthy patches Auto-rollout deploys a regression if health checks miss it Define health checks that cover the specific CVE’s attack vector
Emergency change pathway Bypasses CAB latency Reduced oversight for changes; higher regression risk Require post-deployment review within 24h; log all emergency deployments
EPSS-based tier assignment Focuses resources on likely-to-be-exploited CVEs EPSS lags 24-72h post-publication; may misclassify fresh CVEs Use KEV as a CVSS-independent tier classifier; re-evaluate EPSS at 24h

Failure Modes

Failure Symptom Detection Recovery
On-call engineer lacks deployment access P0 CVE unpatched despite escalation SLA metric shows breach; deployment log empty Pre-provision break-glass credentials with dual-approval for on-call role
Staged rollout health check too lenient Regression reaches 100% before detection Post-deployment error rate spike Tighten health check thresholds; add smoke tests specific to the patched component
SLA tracker misclassifies CVE tier Wrong urgency; wrong timeline Manual review against policy; EPSS/KEV mismatch Allow security team to manually override tier assignment with justification
Immutable image rebuild fails (base image pull error) Patch timeline extended by build failure CI build failure alert; SLA clock continues Pre-pull base images to local mirror; use cached layer if upstream unavailable