Kubernetes CVE Auto-Remediation Operator: Closing the Patch Window Automatically

Kubernetes CVE Auto-Remediation Operator: Closing the Patch Window Automatically

The Problem

The median time between a CVE publication and a working exploit in the wild has been falling for years. LLM-assisted exploit development has accelerated that trend: a CVE published at 09:00 can have a PoC generated by an AI tool and weaponised by mid-afternoon. Standard enterprise patch windows — monthly, bi-weekly, or even weekly cycles — no longer close the exposure gap for CVSS 9+ vulnerabilities with network-accessible attack surfaces.

Kubernetes clusters are particularly exposed because they concentrate many separately-updated components: container base images (often Debian or Alpine), runtimes (containerd, runc), ingress controllers (NGINX, Envoy), sidecar proxies (Istio, Linkerd), and monitoring agents — each with their own CVE stream. A cluster operator tracking patch needs across 50 DaemonSets, 200 Deployments, and 5 node pool images cannot manually triage and action each new CVE as it drops.

The solution is an operator that automates the remediation loop:

  1. Ingest CVE feeds (OSV, NVD, GitHub Advisories) on a short polling interval.
  2. Correlate findings against the current workload inventory (image digests, package SBOMs stored in Artifact Registry or OCI).
  3. Score each finding using EPSS (exploitation probability) and KEV (Known Exploited Vulnerabilities) membership to prioritise automatic vs manual action.
  4. Remediate automatically for high-confidence, high-severity cases: update the image reference in a Deployment/DaemonSet to the patched digest, trigger a node pool rotation for node-level CVEs, or open a PR if the operator is configured in audit-only mode.

This is distinct from EPSS-driven triage tools (which produce prioritised lists) and Copa-style patching (which patches images in place). The operator closes the loop by taking the remediation action autonomously, with configurable gates for human review.

Target systems: Kubernetes 1.28+ clusters; Artifact Registry, ECR, or similar registry with image signing; workloads using image digest pinning; GitOps workflows (ArgoCD, Flux) for deployment configuration management.

Threat Model

1. LLM-generated exploit released same day as CVE (external attacker). Objective: exploit a container base-image CVE before the platform team patches it. Impact: container compromise; potential host escape if the CVE is in runc or containerd. Mitigation: auto-remediation closes the window from days to minutes for qualifying CVEs.

2. Node-level CVE exploited from within a container (attacker with container code execution). Objective: escalate from container to host via a kernel or containerd CVE. Impact: host-level compromise, cluster-wide lateral movement. Mitigation: node pool rotation triggered automatically when node-image CVEs exceed threshold.

3. Operator misconfiguration causing unintended rollouts (operator with write access to cluster). Objective (accidental): operator auto-updates a production DaemonSet image during a traffic peak, causing a rollout that degrades service. Impact: availability incident from an automated patch. Mitigation: rollout gates (maintenance windows, canary deployment integration, Prometheus health checks before proceeding).

4. CVE feed injection (attacker who compromises the CVE feed source or the operator’s network path to it). Objective: cause the operator to believe a non-existent CVE affects a running image; trigger a forced rollout that disrupts operations. Impact: availability attack via the security toolchain. Mitigation: TLS-verified feed fetches; signature verification on OSV JSON payloads.

Hardening Configuration

Operator Architecture

CVE Feeds (OSV, NVD, GitHub) → CVE Ingestor → CVE Store (Redis/etcd)
                                                        ↓
Cluster Workload Scanner → Image Inventory → Correlator → Decision Engine
                                                        ↓
                                              (EPSS < threshold) → Alert only
                                              (EPSS ≥ threshold) → Remediator
                                                        ↓
                                         GitOps PR / Direct kubectl patch

CRD: CVERemediationPolicy

apiVersion: security.example.com/v1alpha1
kind: CVERemediationPolicy
metadata:
  name: production-policy
  namespace: security-ops
spec:
  # EPSS probability threshold above which auto-remediation fires (0.0–1.0)
  epssThreshold: 0.15        # 15% exploitation probability
  # Always auto-remediate if in CISA KEV list, regardless of EPSS
  alwaysRemediateKEV: true
  # Minimum CVSS score to trigger any action (alert or auto-remediate)
  minCvssScore: 7.0
  # Scopes this policy applies to
  scope:
    namespaces: ["production", "staging"]
    excludeNamespaces: ["kube-system"]   # kube-system requires manual approval
  # How remediation is performed
  remediationMode: GitOpsPR   # GitOpsPR | DirectPatch | AlertOnly
  gitops:
    repo: "github.com/example/k8s-manifests"
    branch: "auto-patch"
    reviewers: ["security-team"]
  # Rollout safety gates
  rolloutGates:
    # Only roll out during these windows (UTC)
    maintenanceWindows:
      - start: "02:00"
        end: "06:00"
        days: ["Mon", "Tue", "Wed", "Thu", "Fri"]
    # Check Prometheus before proceeding with rollout
    healthChecks:
      - query: 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) < 0.01'
        description: "Error rate below 1%"
    maxConcurrentRollouts: 2

Operator Controller Logic

// pkg/controller/remediation_controller.go
package controller

import (
    "context"
    "time"

    appsv1 "k8s.io/api/apps/v1"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

type RemediationReconciler struct {
    client.Client
    CVEStore    CVEStore
    ImageStore  ImageInventory
    EPSSClient  EPSSClient
    KEVClient   KEVClient
    GitOpsClient GitOpsClient
}

func (r *RemediationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Load the policy
    policy := &secv1alpha1.CVERemediationPolicy{}
    if err := r.Get(ctx, req.NamespacedName, policy); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Scan all images in scope against CVE store
    findings := r.correlateFindings(ctx, policy)

    for _, finding := range findings {
        // Score the finding
        epss := r.EPSSClient.GetScore(finding.CVE.ID)
        isKEV := r.KEVClient.IsKnownExploited(finding.CVE.ID)

        shouldRemediate := (epss >= policy.Spec.EPSSThreshold) ||
            (isKEV && policy.Spec.AlwaysRemediateKEV)

        if !shouldRemediate {
            r.emitAlert(finding, epss, isKEV)
            continue
        }

        if !r.rolloutGatesPassed(ctx, policy) {
            // Outside maintenance window or health checks failing — queue
            r.queueForNextWindow(finding)
            continue
        }

        if err := r.remediate(ctx, policy, finding); err != nil {
            return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
        }
    }

    return ctrl.Result{RequeueAfter: 15 * time.Minute}, nil
}

func (r *RemediationReconciler) remediate(
    ctx context.Context,
    policy *secv1alpha1.CVERemediationPolicy,
    finding Finding,
) error {
    // Find the patched image digest from the registry
    patchedDigest, err := r.ImageStore.GetPatchedDigest(
        finding.Image, finding.CVE.ID,
    )
    if err != nil {
        return err
    }

    switch policy.Spec.RemediationMode {
    case "GitOpsPR":
        return r.GitOpsClient.CreatePatchPR(finding, patchedDigest)
    case "DirectPatch":
        return r.patchDeployment(ctx, finding, patchedDigest)
    default:
        return nil  // AlertOnly
    }
}

func (r *RemediationReconciler) patchDeployment(
    ctx context.Context,
    finding Finding,
    patchedDigest string,
) error {
    deploy := &appsv1.Deployment{}
    if err := r.Get(ctx, types.NamespacedName{
        Name: finding.WorkloadName, Namespace: finding.Namespace,
    }, deploy); err != nil {
        return err
    }

    // Build a strategic merge patch that updates only the affected container
    patch := []byte(`{"spec":{"template":{"spec":{"containers":[{` +
        `"name":"` + finding.ContainerName + `",` +
        `"image":"` + patchedDigest + `"}]}}}}`)

    return r.Patch(ctx, deploy, client.RawPatch(types.StrategicMergePatchType, patch))
}

Image Inventory Integration

The operator needs to know what images are running and their package contents. Integrate with an SBOM store:

# Generate SBOM for each deployed image at push time (in CI)
syft REGION-docker.pkg.dev/PROJECT/repo/myapp:latest \
  -o spdx-json > sbom.json

# Upload SBOM to a registry alongside the image (OCI artifact)
oras push REGION-docker.pkg.dev/PROJECT/repo/myapp:latest-sbom \
  --artifact-type application/spdx+json \
  sbom.json:application/spdx+json \
  --subject REGION-docker.pkg.dev/PROJECT/repo/myapp@sha256:abc...

The operator queries the SBOM to identify which packages in each image are affected by a given CVE, enabling precise correlation rather than image-level matching only.

Rollout Safety Integration with Argo Rollouts

# When using Argo Rollouts, the operator creates a new rollout analysis
# rather than directly updating the Deployment
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: cve-patch-safety-check
spec:
  metrics:
    - name: error-rate
      successCondition: result[0] < 0.01
      failureLimit: 2
      interval: 30s
      provider:
        prometheus:
          address: http://prometheus:9090
          query: >
            sum(rate(http_requests_total{status=~"5..",
              deployment="{{ args.deployment-name }}"}[5m])) /
            sum(rate(http_requests_total{
              deployment="{{ args.deployment-name }}"}[5m]))

Audit Log and RBAC

# The operator service account needs only targeted permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cve-remediator
rules:
  - apiGroups: ["apps"]
    resources: ["deployments", "daemonsets", "statefulsets"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["security.example.com"]
    resources: ["cveremediation policies"]
    verbs: ["get", "list", "watch", "update", "patch"]
  # Explicitly deny: secrets, configmaps, RBAC resources

Every remediation action is recorded:

r.EventRecorder.Event(
    deploy,
    corev1.EventTypeNormal,
    "CVEAutoRemediation",
    fmt.Sprintf("Auto-patched container %s from %s to %s for %s (EPSS=%.3f)",
        finding.ContainerName,
        finding.CurrentDigest[:12],
        patchedDigest[:12],
        finding.CVE.ID,
        epss,
    ),
)

Expected Behaviour After Hardening

Scenario Without Operator With Operator
CVSS 9.8 CVE published at 09:00 Patch queued in weekly cycle EPSS score checked at 09:15; if ≥ 15%, GitOps PR created by 09:20
CISA KEV list updated with new entry Manual notification, 2-day triage Operator sees KEV update; remediates all affected workloads at next maintenance window
Operator runs during health check failure N/A (no automation) Rollout gated; queued until error rate returns to normal
Patched image not yet available in registry N/A Operator retries at 15-minute intervals; alerts if patch not available after 4 hours
Workload has no matching patched image tag N/A Operator opens issue/alert with manual action required

Trade-offs and Operational Considerations

Aspect Benefit Cost Mitigation
Direct patch mode Fastest remediation (minutes) Bypasses human review for critical systems Restrict DirectPatch to non-production by default; require GitOpsPR for production
Maintenance window gates Prevents disruptive rollouts Extends exposure window until next window Shorten maintenance window spacing to 6-hour intervals for critical CVEs
EPSS threshold Filters noise; only acts on likely-to-be-exploited CVEs EPSS lags first hours after a CVE; threshold may miss early exploitation Set alwaysRemediateKEV: true to catch actively-exploited CVEs regardless of EPSS
SBOM-based correlation Precise matching; fewer false positives SBOM generation must be wired into every build Add SBOM upload as a mandatory CI step; fail build if SBOM absent
GitOps PR mode Human review maintained Adds review latency Set merge automation for PRs approved by security team with CODEOWNERS rules

Failure Modes

Failure Symptom Detection Recovery
Patched image digest unavailable Remediation stalled; CVE window open Operator metric cve_remediation_stalled fires alert Trigger manual image build; notify security team; document exception
Operator RBAC too broad Operator modifies unintended resources RBAC audit log shows unexpected patch operations Restrict to specific resource names; add OPA admission policy to constrain operator
CVE feed rate limiting Operator unable to fetch updates; stale data cve_feed_fetch_failures metric spikes Implement exponential backoff; cache last-known-good feed snapshot
Health check flapping prevents rollout CVEs remain unpatched despite EPSS threshold cve_rollout_gated_health_check metric persists Review health check thresholds; allow security override with dual-approval
GitOps PR accumulation 50+ open PRs from operator; review fatigue PR count metric in GitHub API Group CVE PRs by workload; set auto-merge policy for critical CVEs after security review