Kubernetes Node Kernel Patch Velocity: Draining and Replacing Nodes at Speed After a Critical CVE

Kubernetes Node Kernel Patch Velocity: Draining and Replacing Nodes at Speed After a Critical CVE

Problem

A critical kernel CVE drops Monday morning. Your Kubernetes cluster has 300 nodes across three availability zones running Ubuntu 24.04 with the vulnerable kernel. The public proof-of-concept is already on GitHub. You need to patch all 300 nodes within 24 hours — without taking the cluster down.

The challenge is not conceptually hard: drain each node, apply the patch, reboot, verify. The challenge is operational velocity at scale. Draining a node takes time: workloads need grace periods to shut down cleanly, PodDisruptionBudgets (PDBs) need to be respected, and DaemonSets need to be excluded from draining. If you drain nodes one at a time sequentially and each drain takes five minutes, 300 nodes takes 25 hours — you’ve already missed your window. If you drain too many nodes simultaneously, you violate PDBs and start losing quorum for stateful workloads.

The other challenge is scope clarity. Kernel patching in Kubernetes means three distinct infrastructure patterns depending on your node management model:

  1. Self-managed nodes — VMs you provision and manage directly, where in-place patching (apt upgrade + reboot) is the primary path.
  2. Cloud-managed node groups — AWS managed node groups, GKE node pools, AKS node pools — where the cleanest path is replacing nodes from a pre-patched AMI or OS image.
  3. Cluster API-managed nodes — where a MachineDeployment rolling update handles the replacement automatically.

This article covers all three, with a focus on doing it fast and verifiably. Target systems: Kubernetes 1.28+, any cloud provider, systemd-based Linux nodes.

Threat Model

Kernel LPE from within a container. An attacker who has any code execution inside a Pod — through a compromised application, a supply chain attack, or a stolen service account token that lets them exec into a pod — can exploit a kernel local privilege escalation to escape the container and gain root on the host node. Namespaces, seccomp profiles, and AppArmor policies slow this down in some configurations; they do not block a kernel exploit that bypasses the privilege enforcement mechanism itself.

Node root leads to full node compromise. Once an attacker has root on a node, they have access to everything on that node: all pod filesystems via the container runtime, the kubelet’s credentials in /var/lib/kubelet/, all secrets mounted into pods on that node, and the node’s cloud IAM identity (instance profile, workload identity). The kubelet credential allows the attacker to impersonate the node API server principal — in default RBAC configurations this grants read access to all secrets on the node and the ability to modify status objects.

Slow patch rollout as lateral movement opportunity. In a multi-tenant cluster, every unpatched node is a potential pivot point. An attacker who compromises one tenant’s workload can wait for it to be scheduled onto an unpatched node, escalate to root, steal kubelet credentials, and begin enumerating secrets belonging to other tenants. A 72-hour patch window in a multi-tenant cluster is a significant exposure. A 24-hour window is better. Anything under six hours closes the practical exploitation window for all but the most sophisticated attackers.

The module blacklist bypass. The temporary mitigation for many kernel network LPEs (blacklisting a kernel module) can be undone by a process with root that calls modprobe directly, or by a system service that legitimately reloads network configuration. This mitigation buys time; it is not a substitute for patching.

Configuration and Implementation

Step 1: Fleet-Wide Kernel Version Audit

Before doing anything else, produce a precise inventory of which nodes are running which kernel version. This tells you the blast radius and lets you confirm remediation when you’re done.

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

To identify only vulnerable nodes — those running a kernel below the fixed version — pipe through a simple filter. In this example the fix landed in 6.8.0-60-generic:

FIXED_VERSION="6.8.0-60"

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}' \
  | awk -v fixed="$FIXED_VERSION" '
    {
      split($2, ver, "-")
      split(ver[1], parts, ".")
      split(fixed, fparts, ".")
      if (parts[3]+0 < fparts[3]+0) print $1, $2, "VULNERABLE"
      else print $1, $2, "OK"
    }
  '

Save the full output to a file. You will need it to confirm that every node that started vulnerable has been remediated before you declare the incident closed.

For a larger cluster, get structured JSON output that your incident management tooling can consume:

kubectl get nodes -o json \
  | jq -r '.items[] | [.metadata.name, .status.nodeInfo.kernelVersion, .status.conditions[] | select(.type=="Ready") | .status] | @tsv'

Node count by kernel version, useful for a quick summary:

kubectl get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kernelVersion}{"\n"}{end}' | sort | uniq -c | sort -rn

Step 2: Immediate Module Blacklist as Temporary Mitigation

While the full patch rollout is being prepared and executed, apply a temporary mitigation without a reboot. For CVE-2026-43284/43500 (Dirty Frag), the vulnerability is in the XFRM subsystem used by IPsec; unloading the esp4 and esp6 modules eliminates the exploitable code path on nodes that are not actively using IPsec.

Deploy a privileged DaemonSet to execute this across all nodes immediately:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kernel-module-blacklist
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: kernel-module-blacklist
  template:
    metadata:
      labels:
        app: kernel-module-blacklist
    spec:
      hostPID: true
      hostNetwork: true
      tolerations:
        - operator: Exists
      initContainers:
        - name: unload-modules
          image: ubuntu:24.04
          command:
            - /bin/bash
            - -c
            - |
              set -euo pipefail
              for mod in esp4 esp6 xfrm4_tunnel xfrm6_tunnel; do
                if lsmod | grep -q "^${mod} "; then
                  modprobe -r "$mod" && echo "Unloaded $mod" || echo "Failed to unload $mod (may be in use)"
                else
                  echo "$mod not loaded"
                fi
              done
              # Persist blacklist so modules are not reloaded on network restart
              cat > /host/etc/modprobe.d/cve-2026-43284-blacklist.conf <<'EOF'
              blacklist esp4
              blacklist esp6
              blacklist xfrm4_tunnel
              blacklist xfrm6_tunnel
              EOF
          securityContext:
            privileged: true
          volumeMounts:
            - name: host-root
              mountPath: /host
      containers:
        - name: pause
          image: gcr.io/google_containers/pause:3.9
      volumes:
        - name: host-root
          hostPath:
            path: /
kubectl apply -f kernel-module-blacklist.yaml

# Verify rollout
kubectl rollout status daemonset/kernel-module-blacklist -n kube-system

# Confirm modules are unloaded on a sample node
kubectl debug node/node-name-001 -it --image=ubuntu:24.04 -- lsmod | grep esp

Important: this is a mitigation, not a fix. Network Manager and certain CNI plugins can trigger module reloads. Monitor for reloaded modules (Step 7 can be adapted for this). The DaemonSet’s initContainer will re-run the unload if the pod is restarted, but it will not catch a reload triggered by a network event between pod restarts.

Step 3: Cordon and Drain Pattern

For in-place patching, the standard workflow is cordon (prevent new scheduling) then drain (evict existing workloads).

NODE="node-name-001"

# Cordon prevents new pods from being scheduled here
kubectl cordon "$NODE"

# Drain evicts all evictable pods
kubectl drain "$NODE" \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=300s

The flags matter:

  • --ignore-daemonsets: DaemonSet pods are not evicted (they can’t be rescheduled elsewhere by design). They will be terminated when the node shuts down.
  • --delete-emptydir-data: Pods using emptyDir volumes will lose that data on eviction. This flag acknowledges that and proceeds. If any workloads store meaningful state in emptyDir, you need to fix that separately.
  • --grace-period=60: Allows 60 seconds for pods to shut down gracefully. Tune this to match your workloads’ actual shutdown requirements.
  • --timeout=300s: If the drain hasn’t completed in 5 minutes, fail rather than hang indefinitely. A stuck drain usually means a PDB is blocking eviction (see Failure Modes).

After drain completes and the kernel is patched and the node is back up:

kubectl uncordon "$NODE"

Step 4: Automated Rolling Patch with Node Groups

Patching 300 nodes one at a time is too slow. But patching all 300 simultaneously will violate PDBs and potentially take down services. The right batch size depends on your cluster size and PDB configuration, but 10% is a reasonable starting point — 30 nodes at a time from a 300-node cluster, leaving 270 nodes to absorb the rescheduled workloads.

A shell script that drives rolling drain across a node list:

#!/usr/bin/env bash
set -euo pipefail

NODES=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \
  | grep "vulnerable-label" )  # filter by label if you've labelled vulnerable nodes

BATCH_SIZE=30
DRAIN_TIMEOUT=300
PATCH_TIMEOUT=600  # seconds to wait for the node to come back after reboot

mapfile -t NODE_ARRAY <<< "$NODES"
TOTAL=${#NODE_ARRAY[@]}
echo "Total vulnerable nodes: $TOTAL"

for ((i=0; i<TOTAL; i+=BATCH_SIZE)); do
  BATCH=("${NODE_ARRAY[@]:i:BATCH_SIZE}")
  echo "--- Patching batch $((i/BATCH_SIZE + 1)): ${#BATCH[@]} nodes ---"

  # Cordon all nodes in batch first
  for node in "${BATCH[@]}"; do
    kubectl cordon "$node"
    echo "Cordoned $node"
  done

  # Drain nodes in parallel within the batch
  PIDS=()
  for node in "${BATCH[@]}"; do
    kubectl drain "$node" \
      --ignore-daemonsets \
      --delete-emptydir-data \
      --grace-period=60 \
      --timeout="${DRAIN_TIMEOUT}s" &
    PIDS+=($!)
  done

  # Wait for all drains in this batch
  for pid in "${PIDS[@]}"; do
    wait "$pid" || { echo "Drain failed for a node in this batch"; exit 1; }
  done

  echo "All nodes in batch drained. Apply patches now (manual step or trigger automation)."
  # For in-place patching, this is where you'd invoke your configuration management tool
  # (Ansible, Chef, Puppet) against the batch, or trigger SSM Run Command on AWS.

  # Wait for nodes to return Ready
  for node in "${BATCH[@]}"; do
    echo "Waiting for $node to become Ready..."
    kubectl wait node "$node" --for=condition=Ready --timeout="${PATCH_TIMEOUT}s"
    kubectl uncordon "$node"
    echo "Uncordoned $node"
  done

  echo "Batch complete. Sleeping 30s before next batch."
  sleep 30
done

echo "All nodes patched."

Label vulnerable nodes at the start of the incident to make selection easy:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}' \
  | grep "6\.8\.0-[0-5][0-9]-generic" \
  | awk '{print $1}' \
  | xargs -I{} kubectl label node {} vulnerability=cve-2026-43284 --overwrite

Step 5: Cluster API MachineDeployment Rolling Upgrade

If you’re using Cluster API, node replacement via a MachineDeployment rolling update is cleaner than in-place patching: you replace nodes from a pre-patched OS image, which gives you a known-good state rather than hoping the in-place patch applied correctly.

First, build or reference the patched AMI/OS image. On AWS, this might be an AMI you’ve built with Packer that uses the 6.8.0-60-generic kernel. Update the MachineDeployment:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: worker-nodes-us-east-1a
  namespace: default
spec:
  replicas: 100
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10   # No more than 10 nodes unavailable at once
      maxSurge: 10         # Bring up 10 new nodes before draining old ones
  template:
    spec:
      bootstrap:
        configRef:
          name: worker-bootstrap-config
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachineTemplate
        name: worker-patched-kernel  # Updated template referencing new AMI

The AWSMachineTemplate references the patched AMI:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
metadata:
  name: worker-patched-kernel
  namespace: default
spec:
  template:
    spec:
      ami:
        id: ami-0a1b2c3d4e5f67890  # AMI with kernel 6.8.0-60-generic pre-installed
      instanceType: m5.xlarge
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      sshKeyName: cluster-keypair

Apply the updated template and update the MachineDeployment’s infrastructure reference, then watch the rollout:

kubectl apply -f worker-patched-kernel-template.yaml
kubectl patch machinedeployment worker-nodes-us-east-1a \
  --type=merge \
  -p '{"spec":{"template":{"spec":{"infrastructureRef":{"name":"worker-patched-kernel"}}}}}'

# Watch the rollout
kubectl get machinedeployment worker-nodes-us-east-1a -w

Cluster API respects the maxUnavailable and maxSurge settings, cordons and drains old nodes before terminating them, and waits for new nodes to become Ready before proceeding. This is the lowest-risk path if you have pre-built patched images.

Step 6: kured for Automated Reboot After In-Place Patch

kured (Kubernetes Reboot Daemon) watches for the /var/run/reboot-required sentinel file that apt upgrade creates and automates the cordon-drain-reboot-uncordon cycle. For organizations running regular unattended upgrades, kured is the steady-state tool; during an incident it can be configured with a maintenance window of “now” to drive the rolling reboot.

Deploy kured with Helm:

helm repo add kubereboot https://kubereboot.github.io/charts
helm repo update

helm upgrade --install kured kubereboot/kured \
  --namespace kube-system \
  --set configuration.rebootSentinel=/var/run/reboot-required \
  --set configuration.drainTimeout=300s \
  --set configuration.period=1m \
  --set configuration.rebootCommand=/bin/systemctl\ reboot \
  --set tolerations[0].operator=Exists

For the incident response case, override the maintenance window to allow immediate reboots:

# Annotate kured to disable the maintenance window restriction
kubectl annotate daemonset kured -n kube-system \
  weave.works/kured-node-lock="" --overwrite

# Or patch the DaemonSet directly to set start/end to span the full day
kubectl patch daemonset kured -n kube-system \
  --type=merge \
  -p '{
    "spec": {
      "template": {
        "spec": {
          "containers": [{
            "name": "kured",
            "args": [
              "--reboot-sentinel=/var/run/reboot-required",
              "--period=1m",
              "--drain-timeout=300s",
              "--concurrency=10"
            ]
          }]
        }
      }
    }
  }'

The --concurrency=10 flag tells kured to reboot up to 10 nodes simultaneously, rather than the default of one at a time. Tune this against your PDB requirements and cluster capacity.

Trigger the apt upgrade across all nodes using your configuration management tooling (Ansible, AWS SSM Run Command, GCP OS Config). kured detects the /var/run/reboot-required file on each node and queues the reboot, respecting the concurrency limit.

# Ansible example — run against all cluster nodes
ansible all -i inventory/production \
  -m apt \
  -a "upgrade=dist update_cache=yes" \
  --become

After the upgrade completes on each node, /var/run/reboot-required is written. kured picks it up within the next poll period (configured as 1 minute above) and begins the drain-reboot cycle.

Step 7: Verification — DaemonSet Kernel Version Check

After the patch rollout completes, verify every node is running the fixed kernel. Deploy a verification DaemonSet that exports a Prometheus gauge:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kernel-version-check
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: kernel-version-check
  template:
    metadata:
      labels:
        app: kernel-version-check
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      hostPID: true
      tolerations:
        - operator: Exists
      containers:
        - name: kernel-check
          image: prom/node-exporter:v1.8.0
          args:
            - --collector.disable-defaults
            - --collector.textfile
            - --collector.textfile.directory=/var/node-exporter
          ports:
            - containerPort: 9100
          volumeMounts:
            - name: textfile-dir
              mountPath: /var/node-exporter
      initContainers:
        - name: write-kernel-metric
          image: ubuntu:24.04
          command:
            - /bin/bash
            - -c
            - |
              KERNEL=$(uname -r)
              MIN_PATCH="6.8.0-60"
              KERNEL_NUM=$(echo "$KERNEL" | grep -oP '^\d+\.\d+\.\d+-\K\d+')
              MIN_NUM=$(echo "$MIN_PATCH" | grep -oP '\d+$')
              if [ "${KERNEL_NUM:-0}" -ge "${MIN_NUM:-999}" ]; then
                COMPLIANT=1
              else
                COMPLIANT=0
              fi
              NODE=$(cat /proc/sys/kernel/hostname)
              cat > /var/node-exporter/kernel_patch_compliance.prom <<EOF
              # HELP kernel_patch_compliant 1 if node kernel meets minimum patched version for CVE-2026-43284
              # TYPE kernel_patch_compliant gauge
              kernel_patch_compliant{node="$NODE",kernel="$KERNEL",min_required="$MIN_PATCH"} $COMPLIANT
              EOF
          securityContext:
            privileged: true
          volumeMounts:
            - name: textfile-dir
              mountPath: /var/node-exporter
      volumes:
        - name: textfile-dir
          emptyDir: {}

This relies on node-exporter’s textfile collector to expose the metric. A Prometheus alert on kernel_patch_compliant == 0 fires for any node still running the vulnerable kernel:

groups:
  - name: kernel-patch-compliance
    rules:
      - alert: NodeKernelVulnerable
        expr: kernel_patch_compliant == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} running vulnerable kernel {{ $labels.kernel }}"
          description: "Minimum required kernel is {{ $labels.min_required }}. This node must be patched immediately."

For a quick ad-hoc check without deploying a DaemonSet:

# Check kernel version on all nodes via node debug containers
for node in $(kubectl get nodes -o name | sed 's|node/||'); do
  echo -n "$node: "
  kubectl debug node/"$node" -it --image=ubuntu:24.04 -- uname -r 2>/dev/null | tr -d '\r'
done

Handling PodDisruptionBudgets That Block Drains

PDBs that set minAvailable: 0 or maxUnavailable: 0 incorrectly can cause drain to hang indefinitely. Identify which PDBs are blocking:

# List all PDBs with their current status
kubectl get pdb -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,MIN_AVAILABLE:.spec.minAvailable,MAX_UNAVAILABLE:.spec.maxUnavailable,DISRUPTIONS_ALLOWED:.status.disruptionsAllowed'

# Find PDBs currently allowing zero disruptions
kubectl get pdb -A -o json \
  | jq -r '.items[] | select(.status.disruptionsAllowed == 0) | [.metadata.namespace, .metadata.name, .status.currentHealthy, .status.desiredHealthy] | @tsv'

For a PDB that is incorrectly configured (for example, a single-replica Deployment with minAvailable: 1 — which can never be drained), you have two options:

Option 1: Fix the PDB — if the application genuinely should be scaled up for high availability, do that now. Increase the Deployment replicas before attempting to drain.

Option 2: Temporarily patch the PDB during the incident — if the PDB is misconfigured and the application owner accepts the disruption:

# Temporarily allow disruption by setting minAvailable to 0
kubectl patch pdb my-app-pdb -n production \
  --type=merge \
  -p '{"spec":{"minAvailable":0}}'

# Drain the node
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Restore the original PDB after the node is uncordoned
kubectl patch pdb my-app-pdb -n production \
  --type=merge \
  -p '{"spec":{"minAvailable":1}}'

Document any PDB modifications made during the incident for the post-incident review.

Expected Behaviour

The following table compares patching approaches for a 300-node cluster:

Approach Time to Patch 300 Nodes Disruption Level Verification Method
Manual drain one by one 25–40 hours (sequential, ~5 min/node) Low (single node offline at a time) Manual uname -r per node
Rolling drain script (10% batches) 4–6 hours Low-medium (30 nodes offline per batch, PDBs respected) DaemonSet Prometheus metric
kured with concurrency=10 6–10 hours (apt + reboot per batch) Low-medium (10 nodes rebooting at a time) DaemonSet Prometheus metric
MachineDeployment image update 3–5 hours (surge=10, parallel provisioning) Low (surge creates new nodes before draining old) Node kernel version in kubectl get nodes
Module blacklist DaemonSet only 5–10 minutes (no reboot) None lsmod via debug container

The module blacklist DaemonSet is the first action (fastest mitigation), followed by the patching approach best suited to your infrastructure. MachineDeployment rolling update is the fastest full remediation for cloud-managed node groups. The rolling drain script is the practical path for self-managed nodes when you don’t have pre-patched images ready.

Trade-offs

Dimension Option A Option B When to Prefer A When to Prefer B
Patch method In-place patch (apt upgrade + reboot) Node replacement (new node from patched AMI) No pre-built images available; on-prem nodes Cloud provider with image pipeline; want clean state
Batch size 10% of nodes at a time (30/300) 25% of nodes at a time (75/300) Tight PDB constraints; stateful workloads; unknown PDB config Well-understood PDB config; mostly stateless workloads; time pressure
Reboot automation kured automated reboot Manual cordon/drain/reboot per node Steady-state operations; large fleet Incident with specific ordering requirements; need precise control over sequencing
Kernel fix type Live patch (kpatch, livepatch) Full kernel update + reboot Cannot afford any downtime; very short maintenance window Need complete fix; live patches sometimes incomplete for complex vulnerabilities

Live patching note: kernel live patching tools like Ubuntu Livepatch or kpatch can apply kernel patches without a reboot. For a critical CVE this is attractive — no node drain required, no workload disruption. The trade-offs: live patches are not always available for a given CVE on the day of disclosure (the patch has to be compiled and distributed), live patches cover only the running kernel and don’t persist across reboots (so a rebooted node would need to reapply), and live patches for complex vulnerabilities can be incomplete or have edge cases not covered by the live patch scope. For an LPE with a public PoC, the conservative path is a full kernel update with reboot.

Failure Modes

Failure Mode Symptom Detection Recovery
PDB blocking drain kubectl drain hangs; node stuck in SchedulingDisabled (cordoned) but not drained; pods not evicted after timeout kubectl get pdb -A shows DISRUPTIONS_ALLOWED=0; kubectl drain output shows “cannot evict pod as it would violate PDB” Scale up the affected Deployment to allow disruption, or temporarily patch the PDB to minAvailable: 0; document and restore after
Node fails to boot after patch Node status NotReady for >10 minutes after reboot; workloads from that node not rescheduled (DaemonSet pods stuck Terminating) kubectl get nodes shows NotReady; node age resets; Prometheus alert on kube_node_status_condition{condition="Ready",status="true"} == 0 SSH to node directly; check journalctl -b for boot failures; may require kernel rollback (grub boot to previous kernel entry) or node termination and replacement
kured reboot window too narrow Nodes not rebooted within maintenance window; /var/run/reboot-required present but kured did not act; incident closure delayed kured logs show nodes deferred past window end; kubectl logs -n kube-system daemonset/kured shows “not within reboot window”; patch compliance DaemonSet still shows failures Extend or temporarily remove the maintenance window; increase kured concurrency; consider switching to manual drain for remaining nodes to meet the deadline
Module mitigation reloaded by network manager esp4/esp6 reappear in lsmod after network reconfiguration; CVE-2026-43284 exploitable again on affected nodes Periodic lsmod check via DaemonSet; Prometheus alert on module presence; network events in journalctl showing NetworkManager or CNI plugin reload Re-apply module blacklist DaemonSet; accelerate full patch rollout for these nodes; ensure blacklist config file persists in /etc/modprobe.d/
Drain times out on nodes with long-lived connections kubectl drain hits --timeout before all pods are evicted; some pods remain on the cordoned node Drain exits with error; kubectl get pods -o wide --field-selector spec.nodeName=node-name shows remaining pods Identify pods with long graceful shutdown requirements; increase --grace-period; investigate pods that ignore SIGTERM; force-delete as last resort