NGINX CVE Patch Management Across Mixed Bare Metal, VM, and Kubernetes Fleets

NGINX CVE Patch Management Across Mixed Bare Metal, VM, and Kubernetes Fleets

Problem

Most organisations running NGINX at scale have it deployed in at least two different ways: as a systemd service on bare metal or VMs (managed via Ansible, Salt, or manual configuration), and as the ingress-nginx controller in Kubernetes (managed via Helm). Some have a third tier: NGINX in Docker containers built into application images, often with no automatic update path.

When a critical NGINX CVE is published — like CVE-2024-7347, CVE-2025-23419, or the 2025 ingress annotation injection family — the patching process differs for each tier:

Bare metal / VM tier: NGINX is installed as a distro package or from nginx.org’s repository. Patching requires running the package manager (apt-get upgrade nginx or yum update nginx) across each host. Ansible can orchestrate this, but the fleet may have different distros, package repositories, and OS configurations. Some hosts may be running NGINX compiled from source with custom modules, which requires a recompile.

Kubernetes ingress-nginx tier: NGINX lives inside the ingress-nginx controller image, versioned by Helm chart. Patching requires updating the Helm chart version, which changes the controller image, which triggers a rolling restart of ingress-nginx pods. The challenge is that breaking changes in the Helm chart can affect all inbound traffic to the cluster.

Application container tier: NGINX is baked into application Docker images as a reverse proxy or static file server. These images have their own build pipelines and may use outdated base images (nginx:1.24-alpine from months ago). There is no central way to patch these — each application team must rebuild and redeploy their image.

The inventory problem. Before patching, you need to know what you have. Many organisations cannot answer “how many NGINX instances are running right now, and what versions?” across the full fleet. Without inventory, there is no way to measure patch progress or confirm coverage.

The patch window. For critical CVEs, the window between disclosure and active exploitation can be less than 48 hours. A patching process that takes two weeks leaves the fleet exposed. OS-level controls (Seccomp, capability bounding sets — see the companion article on NGINX worker hardening) buy time in this window, but patching is the actual remediation.

Target systems: any organisation operating NGINX at scale across more than one deployment tier; security teams responsible for CVE SLAs; platform teams who own NGINX infrastructure but not individual application NGINX deployments.


Threat Model

Risk 1 — Extended exposure window. A critical NGINX CVE is disclosed. The fleet has 200 NGINX instances. Without a patch management process, the team must manually identify all instances, determine their versions, and patch them individually. The process takes three weeks; active exploitation begins in week one.

Risk 2 — Hidden NGINX in application containers. Application teams have embedded NGINX in their Docker images as a static file server. These images are not tracked in the NGINX fleet inventory. A CVE that affects all NGINX versions is patched on the infrastructure tier but persists in 15 application containers that the security team doesn’t know exist.

Risk 3 — Package repository pinning blocks patches. A host has a pinned NGINX version (via /etc/apt/preferences.d/nginx or yum.conf excludes) because a previous upgrade caused issues. The pin silently prevents the security update from being applied. The host reports “NGINX up to date” but is running a vulnerable version.


Configuration / Implementation

Step 1 — Build fleet inventory across all tiers

#!/bin/bash
# nginx-fleet-inventory.sh
# Comprehensive NGINX version inventory across all deployment tiers

OUTPUT_FILE="/tmp/nginx-fleet-inventory-$(date +%Y%m%d-%H%M).csv"
echo "Tier,Host/Cluster,Location,NGINX Version,Source,Last Updated" > "$OUTPUT_FILE"

# Tier 1: Systemd hosts via Ansible ad-hoc
echo "=== Tier 1: Bare metal / VM NGINX versions ==="
ansible all -m shell -a "nginx -v 2>&1 | grep -oP '(?<=nginx/)[\d.]+'; systemctl is-active nginx" \
  --limit nginx_hosts 2>/dev/null | \
  while IFS= read -r line; do
    if [[ "$line" =~ ^([^|]+)\|(.+)$ ]]; then
      HOST="${BASH_REMATCH[1]}"
      VERSION="${BASH_REMATCH[2]}"
      echo "systemd,$HOST,/usr/sbin/nginx,$VERSION,package,$(date +%Y-%m-%d)" >> "$OUTPUT_FILE"
    fi
  done

# Tier 2: Kubernetes ingress-nginx via kubectl
echo ""
echo "=== Tier 2: Kubernetes ingress-nginx versions ==="
for context in $(kubectl config get-contexts -o name 2>/dev/null); do
  kubectl --context="$context" get pods -A \
    -l "app.kubernetes.io/name=ingress-nginx" \
    -o jsonpath='{range .items[*]}{.spec.containers[0].image}{"\t"}{.metadata.namespace}{"\n"}{end}' \
    2>/dev/null | \
  while IFS=$'\t' read -r image namespace; do
    version=$(echo "$image" | grep -oP '(?<=:v)[\d.]+')
    echo "kubernetes,$context,$namespace/ingress-nginx,${version:-unknown},helm,$(date +%Y-%m-%d)" >> "$OUTPUT_FILE"
  done
done

# Tier 3: NGINX in Docker containers (Kubernetes pods)
echo ""
echo "=== Tier 3: Application containers running NGINX ==="
for context in $(kubectl config get-contexts -o name 2>/dev/null); do
  kubectl --context="$context" get pods -A \
    -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {range .spec.containers[*]}{.image}{" "}{end}{"\n"}{end}' \
    2>/dev/null | \
  while IFS= read -r pod_images; do
    if echo "$pod_images" | grep -qE "nginx:[0-9]|/nginx:"; then
      POD=$(echo "$pod_images" | cut -d: -f1)
      IMAGE=$(echo "$pod_images" | grep -oE "nginx:[^[:space:]]+" | head -1)
      VERSION=$(echo "$IMAGE" | grep -oP '(?<=:)[\d.]+')
      echo "container,$context,$POD,${VERSION:-embedded},docker-image,unknown" >> "$OUTPUT_FILE"
    fi
  done
done

echo ""
echo "Inventory written to: $OUTPUT_FILE"
cat "$OUTPUT_FILE"

Step 2 — Check inventory against CVE-affected versions

#!/usr/bin/env python3
# scripts/nginx-cve-check.py
# Compares fleet inventory against known CVE-affected version ranges

import csv
import sys
from dataclasses import dataclass
from typing import Optional
from packaging.version import Version, InvalidVersion

@dataclass
class NginxCVE:
    cve_id: str
    severity: str
    affected_mainline_lt: Optional[str]
    affected_stable_lt: Optional[str]
    description: str

# Current CVE database — update as new CVEs are published
NGINX_CVES = [
    NginxCVE("CVE-2024-7347", "MEDIUM", "1.27.1", "1.26.2",
              "ngx_http_mp4_module heap buffer overflow"),
    NginxCVE("CVE-2024-24989", "HIGH", "1.25.4", None,
              "QUIC module NULL pointer dereference"),
    NginxCVE("CVE-2024-24990", "HIGH", "1.25.4", None,
              "QUIC module use-after-free"),
    NginxCVE("CVE-2025-23419", "MEDIUM", "1.27.4", "1.26.3",
              "mTLS session resumption bypass"),
]

def parse_version(version_str: str) -> Optional[Version]:
    try:
        # Strip 'v' prefix if present
        return Version(version_str.lstrip('v'))
    except InvalidVersion:
        return None

def check_vulnerabilities(nginx_version: str) -> list[NginxCVE]:
    v = parse_version(nginx_version)
    if not v:
        return []
    
    findings = []
    for cve in NGINX_CVES:
        is_mainline = (v.major, v.minor) in [(v.major, m) for m in [25, 27]]
        
        if is_mainline and cve.affected_mainline_lt:
            patched = parse_version(cve.affected_mainline_lt)
            if patched and v < patched:
                findings.append(cve)
        elif not is_mainline and cve.affected_stable_lt:
            patched = parse_version(cve.affected_stable_lt)
            if patched and v < patched:
                findings.append(cve)
    
    return findings

if __name__ == "__main__":
    inventory_file = sys.argv[1] if len(sys.argv) > 1 else "/tmp/nginx-fleet-inventory.csv"
    
    critical_hosts = []
    
    with open(inventory_file) as f:
        reader = csv.DictReader(f)
        for row in reader:
            version = row.get("NGINX Version", "unknown")
            if version in ("unknown", "embedded", ""):
                print(f"UNKNOWN version: {row['Tier']}/{row['Host/Cluster']}/{row['Location']}")
                continue
            
            findings = check_vulnerabilities(version)
            if findings:
                host_id = f"{row['Tier']}/{row['Host/Cluster']}/{row['Location']}"
                critical = any(f.severity in ("HIGH", "CRITICAL") for f in findings)
                if critical:
                    critical_hosts.append(host_id)
                
                print(f"\n{'CRITICAL' if critical else 'WARNING'}: {host_id}")
                print(f"  Running NGINX {version}")
                for finding in findings:
                    print(f"  [{finding.severity}] {finding.cve_id}: {finding.description}")
    
    if critical_hosts:
        print(f"\n\nCRITICAL: {len(critical_hosts)} hosts require immediate patching")
        sys.exit(1)
    else:
        print("\nAll hosts running patched NGINX versions")
        sys.exit(0)

Step 3 — Patch Tier 1: bare metal / VM via Ansible

# playbooks/nginx-emergency-patch.yml
# Emergency patching playbook for NGINX CVE remediation

---
- name: Emergency NGINX Patch — CVE Remediation
  hosts: nginx_hosts
  serial: "20%"  # Roll out to 20% of hosts at a time
  max_fail_percentage: 10  # Abort if more than 10% of hosts fail
  
  vars:
    target_nginx_version_debian: "1.26.3"
    target_nginx_version_rhel: "1.26.3"
    slack_webhook: "{{ lookup('env', 'SLACK_WEBHOOK') }}"
  
  pre_tasks:
    - name: Record current NGINX version
      command: nginx -v
      register: nginx_version_before
      changed_when: false
      ignore_errors: true
    
    - name: Verify NGINX is running before patching
      service_facts:
    
    - name: Check if NGINX is active
      assert:
        that: "ansible_facts.services['nginx.service'].state == 'running'"
        fail_msg: "NGINX is not running on {{ inventory_hostname }} — skipping patch"
      ignore_errors: true

  tasks:
    - name: Patch NGINX on Debian/Ubuntu
      apt:
        name: nginx
        state: latest
        update_cache: yes
      when: ansible_os_family == "Debian"
      notify: reload nginx
    
    - name: Patch NGINX on RHEL/CentOS
      yum:
        name: nginx
        state: latest
        update_cache: yes
      when: ansible_os_family == "RedHat"
      notify: reload nginx
    
    - name: Record new NGINX version
      command: nginx -v
      register: nginx_version_after
      changed_when: false
    
    - name: Test NGINX configuration is valid
      command: nginx -t
      register: nginx_test
      changed_when: false
    
    - name: Log patch result
      debug:
        msg: "{{ inventory_hostname }}: {{ nginx_version_before.stderr | default('unknown') }} → {{ nginx_version_after.stderr }}"

  handlers:
    - name: reload nginx
      service:
        name: nginx
        state: reloaded

  post_tasks:
    - name: Verify NGINX is serving requests after reload
      uri:
        url: "http://localhost/health"
        status_code: 200
      ignore_errors: true
      register: health_check
    
    - name: Alert on failed health check
      debug:
        msg: "WARNING: Health check failed on {{ inventory_hostname }} after NGINX patch"
      when: health_check.status is defined and health_check.status != 200

Step 4 — Patch Tier 2: Kubernetes ingress-nginx

#!/bin/bash
# scripts/patch-ingress-nginx.sh
# Staged ingress-nginx update across Kubernetes clusters

PATCHED_CHART_VERSION="${1:?Usage: $0 <chart-version> [cluster-context...]}"
CLUSTERS="${@:2}"

# If no clusters specified, use all contexts
if [[ -z "$CLUSTERS" ]]; then
    CLUSTERS=$(kubectl config get-contexts -o name 2>/dev/null)
fi

STAGING_CLUSTER="${STAGING_CLUSTER:-staging}"
WAIT_MINUTES="${WAIT_MINUTES:-15}"

patch_cluster() {
    local context="$1"
    echo ""
    echo "=== Patching ingress-nginx on cluster: $context ==="
    
    # Get current version
    CURRENT=$(helm -n ingress-nginx list --kube-context="$context" \
        -o json 2>/dev/null | jq -r '.[0].chart')
    echo "Current chart: $CURRENT"
    
    # Upgrade
    helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
        --kube-context="$context" \
        --namespace ingress-nginx \
        --version "$PATCHED_CHART_VERSION" \
        --wait \
        --timeout 5m \
        --atomic \  # Roll back automatically on failure
        -f "helm/ingress-nginx/values-${context}.yaml" 2>&1
    
    if [[ $? -ne 0 ]]; then
        echo "FAIL: Helm upgrade failed on $context — check for automatic rollback"
        return 1
    fi
    
    # Verify new pods are running
    kubectl --context="$context" rollout status \
        deployment/ingress-nginx-controller \
        -n ingress-nginx \
        --timeout=3m
    
    # Smoke test — send a test request through ingress
    echo "Running smoke test on $context..."
    INGRESS_IP=$(kubectl --context="$context" get svc \
        -n ingress-nginx ingress-nginx-controller \
        -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 5 \
        -H "Host: health.example.com" \
        "http://$INGRESS_IP/health" 2>/dev/null)
    
    if [[ "$HTTP_STATUS" != "200" ]]; then
        echo "WARN: Smoke test returned HTTP $HTTP_STATUS on $context"
    else
        echo "OK: Smoke test passed on $context"
    fi
    
    return 0
}

# Always patch staging first
echo "Phase 1: Staging cluster"
patch_cluster "$STAGING_CLUSTER" || {
    echo "ABORT: Staging patch failed — halting production rollout"
    exit 1
}

echo "Waiting ${WAIT_MINUTES} minutes before production rollout..."
echo "Monitor staging: kubectl --context=$STAGING_CLUSTER get pods -n ingress-nginx -w"
sleep "${WAIT_MINUTES}m"

# Patch production clusters
echo ""
echo "Phase 2: Production clusters"
FAILED_CLUSTERS=()
for context in $CLUSTERS; do
    [[ "$context" == "$STAGING_CLUSTER" ]] && continue
    patch_cluster "$context" || FAILED_CLUSTERS+=("$context")
done

if [[ ${#FAILED_CLUSTERS[@]} -gt 0 ]]; then
    echo ""
    echo "FAILED clusters: ${FAILED_CLUSTERS[*]}"
    exit 1
fi

echo ""
echo "Patch complete. Run the CVE check script to verify:"
echo "  python3 scripts/nginx-cve-check.py /tmp/nginx-fleet-inventory.csv"

Step 5 — Track patch progress and SLA compliance

#!/bin/bash
# scripts/nginx-patch-sla-report.sh
# Generates a patch SLA compliance report for a given CVE

CVE_ID="${1:?Usage: $0 <cve-id> <disclosure-date> <sla-days>}"
DISCLOSURE_DATE="${2:?}"  # Format: YYYY-MM-DD
SLA_DAYS="${3:-7}"

INVENTORY_FILE="/tmp/nginx-fleet-inventory.csv"

DISCLOSURE_EPOCH=$(date -d "$DISCLOSURE_DATE" +%s)
NOW_EPOCH=$(date +%s)
DAYS_ELAPSED=$(( (NOW_EPOCH - DISCLOSURE_EPOCH) / 86400 ))
SLA_DEADLINE=$(date -d "$DISCLOSURE_DATE + $SLA_DAYS days" +%Y-%m-%d)

echo "=== NGINX $CVE_ID Patch SLA Report ==="
echo "CVE disclosure: $DISCLOSURE_DATE"
echo "SLA deadline: $SLA_DEADLINE (${SLA_DAYS} days)"
echo "Days elapsed: $DAYS_ELAPSED"
echo ""

python3 scripts/nginx-cve-check.py "$INVENTORY_FILE" 2>/dev/null | \
    grep -E "CRITICAL|WARNING|UNKNOWN" | \
    while IFS= read -r line; do
        if [[ "$DAYS_ELAPSED" -gt "$SLA_DAYS" ]]; then
            echo "SLA BREACH: $line"
        else
            echo "OPEN: $line (${SLA_DAYS - DAYS_ELAPSED} days remaining)"
        fi
    done

# Count total vs patched
TOTAL=$(wc -l < "$INVENTORY_FILE")
VULNERABLE=$(python3 scripts/nginx-cve-check.py "$INVENTORY_FILE" 2>/dev/null | \
    grep -c "CRITICAL\|WARNING")
PATCHED=$((TOTAL - VULNERABLE - 1))  # -1 for header

echo ""
echo "Summary: $PATCHED/$TOTAL hosts patched ($(( PATCHED * 100 / TOTAL ))%)"

Expected Behaviour

Scenario Without process With process
CVE published; fleet inventory needed Manual host-by-host audit; 2–3 days Inventory script runs in 10 minutes; vulnerable hosts identified immediately
Patch Ansible tier Manual per-host; no coordination serial: 20% rolling playbook; automatic rollback on failure; health check post-patch
Patch Kubernetes ingress-nginx Manual Helm upgrade per cluster; no staging gate Staged script: staging first, 15-min wait, then production with atomic rollback
Application container tier Not tracked; patching unknown Docker image scan in CI catches NGINX base image version; application teams alerted
CVE SLA compliance reporting Not tracked SLA report script shows per-host compliance percentage and days remaining

Trade-offs

Aspect Benefit Cost Mitigation
serial: 20% in Ansible Limits blast radius of bad patch Takes longer to patch full fleet Acceptable trade-off for zero-downtime patching; increase serial percentage for critical CVEs
Helm --atomic flag Automatic rollback on upgrade failure Rollback may cause brief traffic interruption during rollout Prefer rollback over leaving a broken ingress state; monitor rollout in real time
Inventorying application container NGINX Full fleet visibility Requires scanning all pod images in all clusters Run inventory script weekly and on CVE publication; pipe into Slack or ticketing
CVE database in script Immediate check without external dependencies Must be updated manually as CVEs are published Subscribe to nginx-announce mailing list; create a bot that opens a PR to update the CVE list

Failure Modes

Failure Symptom Detection Recovery
Ansible NGINX package pinned by OS configuration apt-get upgrade nginx says “already latest” but vulnerable version remains CVE check script still flags host post-patch Check /etc/apt/preferences.d/ for nginx pins; remove pin and retry patch
Helm --atomic rollback triggers mid-production rollout Ingress-nginx rolls back; cluster still on old version Helm release shows previous revision; CVE check flags cluster Investigate why upgrade failed (helm history ingress-nginx -n ingress-nginx); fix and retry
Application container NGINX not caught by inventory Hidden vulnerable NGINX in application pods Post-patch CVE scan reports clean; exploitation via application container Add image scanning to application CI pipelines; require approved NGINX base image tags
Patch applied but NGINX not reloaded Service running old binary from before update nginx -v shows new version but /proc/$(pgrep nginx)/exe shows old path systemctl restart nginx (not reload) forces binary reload; test service still active