Kubernetes CVE Auto-Remediation Operator: Closing the Patch Window Automatically

The Problem

The median time between a CVE publication and a working exploit in the wild has been falling for years. LLM-assisted exploit development has accelerated that trend: a CVE published at 09:00 can have a PoC generated by an AI tool and weaponised by mid-afternoon. Standard enterprise patch windows — monthly, bi-weekly, or even weekly cycles — no longer close the exposure gap for CVSS 9+ vulnerabilities with network-accessible attack surfaces.

Kubernetes clusters are particularly exposed because they concentrate many separately-updated components: container base images (often Debian or Alpine), runtimes (containerd, runc), ingress controllers (NGINX, Envoy), sidecar proxies (Istio, Linkerd), and monitoring agents — each with their own CVE stream. A cluster operator tracking patch needs across 50 DaemonSets, 200 Deployments, and 5 node pool images cannot manually triage and action each new CVE as it drops.

The solution is an operator that automates the remediation loop:

Ingest CVE feeds (OSV, NVD, GitHub Advisories) on a short polling interval.
Correlate findings against the current workload inventory (image digests, package SBOMs stored in Artifact Registry or OCI).
Score each finding using EPSS (exploitation probability) and KEV (Known Exploited Vulnerabilities) membership to prioritise automatic vs manual action.
Remediate automatically for high-confidence, high-severity cases: update the image reference in a Deployment/DaemonSet to the patched digest, trigger a node pool rotation for node-level CVEs, or open a PR if the operator is configured in audit-only mode.

This is distinct from EPSS-driven triage tools (which produce prioritised lists) and Copa-style patching (which patches images in place). The operator closes the loop by taking the remediation action autonomously, with configurable gates for human review.

Target systems: Kubernetes 1.28+ clusters; Artifact Registry, ECR, or similar registry with image signing; workloads using image digest pinning; GitOps workflows (ArgoCD, Flux) for deployment configuration management.

Threat Model

1. LLM-generated exploit released same day as CVE (external attacker). Objective: exploit a container base-image CVE before the platform team patches it. Impact: container compromise; potential host escape if the CVE is in runc or containerd. Mitigation: auto-remediation closes the window from days to minutes for qualifying CVEs.

2. Node-level CVE exploited from within a container (attacker with container code execution). Objective: escalate from container to host via a kernel or containerd CVE. Impact: host-level compromise, cluster-wide lateral movement. Mitigation: node pool rotation triggered automatically when node-image CVEs exceed threshold.

3. Operator misconfiguration causing unintended rollouts (operator with write access to cluster). Objective (accidental): operator auto-updates a production DaemonSet image during a traffic peak, causing a rollout that degrades service. Impact: availability incident from an automated patch. Mitigation: rollout gates (maintenance windows, canary deployment integration, Prometheus health checks before proceeding).

4. CVE feed injection (attacker who compromises the CVE feed source or the operator’s network path to it). Objective: cause the operator to believe a non-existent CVE affects a running image; trigger a forced rollout that disrupts operations. Impact: availability attack via the security toolchain. Mitigation: TLS-verified feed fetches; signature verification on OSV JSON payloads.

Hardening Configuration

Operator Architecture

CVE Feeds (OSV, NVD, GitHub) → CVE Ingestor → CVE Store (Redis/etcd)
                                                        ↓
Cluster Workload Scanner → Image Inventory → Correlator → Decision Engine
                                                        ↓
                                              (EPSS < threshold) → Alert only
                                              (EPSS ≥ threshold) → Remediator
                                                        ↓
                                         GitOps PR / Direct kubectl patch

CRD: CVERemediationPolicy

apiVersion: security.example.com/v1alpha1
kind: CVERemediationPolicy
metadata:
  name: production-policy
  namespace: security-ops
spec:
  # EPSS probability threshold above which auto-remediation fires (0.0–1.0)
  epssThreshold: 0.15        # 15% exploitation probability
  # Always auto-remediate if in CISA KEV list, regardless of EPSS
  alwaysRemediateKEV: true
  # Minimum CVSS score to trigger any action (alert or auto-remediate)
  minCvssScore: 7.0
  # Scopes this policy applies to
  scope:
    namespaces: ["production", "staging"]
    excludeNamespaces: ["kube-system"]   # kube-system requires manual approval
  # How remediation is performed
  remediationMode: GitOpsPR   # GitOpsPR | DirectPatch | AlertOnly
  gitops:
    repo: "github.com/example/k8s-manifests"
    branch: "auto-patch"
    reviewers: ["security-team"]
  # Rollout safety gates
  rolloutGates:
    # Only roll out during these windows (UTC)
    maintenanceWindows:
      - start: "02:00"
        end: "06:00"
        days: ["Mon", "Tue", "Wed", "Thu", "Fri"]
    # Check Prometheus before proceeding with rollout
    healthChecks:
      - query: 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) < 0.01'
        description: "Error rate below 1%"
    maxConcurrentRollouts: 2

Operator Controller Logic

// pkg/controller/remediation_controller.go
package controller

import (
    "context"
    "time"

    appsv1 "k8s.io/api/apps/v1"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

type RemediationReconciler struct {
    client.Client
    CVEStore    CVEStore
    ImageStore  ImageInventory
    EPSSClient  EPSSClient
    KEVClient   KEVClient
    GitOpsClient GitOpsClient
}

func (r *RemediationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Load the policy
    policy := &secv1alpha1.CVERemediationPolicy{}
    if err := r.Get(ctx, req.NamespacedName, policy); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Scan all images in scope against CVE store
    findings := r.correlateFindings(ctx, policy)

    for _, finding := range findings {
        // Score the finding
        epss := r.EPSSClient.GetScore(finding.CVE.ID)
        isKEV := r.KEVClient.IsKnownExploited(finding.CVE.ID)

        shouldRemediate := (epss >= policy.Spec.EPSSThreshold) ||
            (isKEV && policy.Spec.AlwaysRemediateKEV)

        if !shouldRemediate {
            r.emitAlert(finding, epss, isKEV)
            continue
        }

        if !r.rolloutGatesPassed(ctx, policy) {
            // Outside maintenance window or health checks failing — queue
            r.queueForNextWindow(finding)
            continue
        }

        if err := r.remediate(ctx, policy, finding); err != nil {
            return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
        }
    }

    return ctrl.Result{RequeueAfter: 15 * time.Minute}, nil
}

func (r *RemediationReconciler) remediate(
    ctx context.Context,
    policy *secv1alpha1.CVERemediationPolicy,
    finding Finding,
) error {
    // Find the patched image digest from the registry
    patchedDigest, err := r.ImageStore.GetPatchedDigest(
        finding.Image, finding.CVE.ID,
    )
    if err != nil {
        return err
    }

    switch policy.Spec.RemediationMode {
    case "GitOpsPR":
        return r.GitOpsClient.CreatePatchPR(finding, patchedDigest)
    case "DirectPatch":
        return r.patchDeployment(ctx, finding, patchedDigest)
    default:
        return nil  // AlertOnly
    }
}

func (r *RemediationReconciler) patchDeployment(
    ctx context.Context,
    finding Finding,
    patchedDigest string,
) error {
    deploy := &appsv1.Deployment{}
    if err := r.Get(ctx, types.NamespacedName{
        Name: finding.WorkloadName, Namespace: finding.Namespace,
    }, deploy); err != nil {
        return err
    }

    // Build a strategic merge patch that updates only the affected container
    patch := []byte(`{"spec":{"template":{"spec":{"containers":[{` +
        `"name":"` + finding.ContainerName + `",` +
        `"image":"` + patchedDigest + `"}]}}}}`)

    return r.Patch(ctx, deploy, client.RawPatch(types.StrategicMergePatchType, patch))
}

Image Inventory Integration

The operator needs to know what images are running and their package contents. Integrate with an SBOM store:

# Generate SBOM for each deployed image at push time (in CI)
syft REGION-docker.pkg.dev/PROJECT/repo/myapp:latest \
  -o spdx-json > sbom.json

# Upload SBOM to a registry alongside the image (OCI artifact)
oras push REGION-docker.pkg.dev/PROJECT/repo/myapp:latest-sbom \
  --artifact-type application/spdx+json \
  sbom.json:application/spdx+json \
  --subject REGION-docker.pkg.dev/PROJECT/repo/myapp@sha256:abc...

The operator queries the SBOM to identify which packages in each image are affected by a given CVE, enabling precise correlation rather than image-level matching only.

Rollout Safety Integration with Argo Rollouts

# When using Argo Rollouts, the operator creates a new rollout analysis
# rather than directly updating the Deployment
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: cve-patch-safety-check
spec:
  metrics:
    - name: error-rate
      successCondition: result[0] < 0.01
      failureLimit: 2
      interval: 30s
      provider:
        prometheus:
          address: http://prometheus:9090
          query: >
            sum(rate(http_requests_total{status=~"5..",
              deployment="{{ args.deployment-name }}"}[5m])) /
            sum(rate(http_requests_total{
              deployment="{{ args.deployment-name }}"}[5m]))

Audit Log and RBAC

# The operator service account needs only targeted permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cve-remediator
rules:
  - apiGroups: ["apps"]
    resources: ["deployments", "daemonsets", "statefulsets"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["security.example.com"]
    resources: ["cveremediation policies"]
    verbs: ["get", "list", "watch", "update", "patch"]
  # Explicitly deny: secrets, configmaps, RBAC resources

Every remediation action is recorded:

r.EventRecorder.Event(
    deploy,
    corev1.EventTypeNormal,
    "CVEAutoRemediation",
    fmt.Sprintf("Auto-patched container %s from %s to %s for %s (EPSS=%.3f)",
        finding.ContainerName,
        finding.CurrentDigest[:12],
        patchedDigest[:12],
        finding.CVE.ID,
        epss,
    ),
)

Expected Behaviour After Hardening

Scenario	Without Operator	With Operator
CVSS 9.8 CVE published at 09:00	Patch queued in weekly cycle	EPSS score checked at 09:15; if ≥ 15%, GitOps PR created by 09:20
CISA KEV list updated with new entry	Manual notification, 2-day triage	Operator sees KEV update; remediates all affected workloads at next maintenance window
Operator runs during health check failure	N/A (no automation)	Rollout gated; queued until error rate returns to normal
Patched image not yet available in registry	N/A	Operator retries at 15-minute intervals; alerts if patch not available after 4 hours
Workload has no matching patched image tag	N/A	Operator opens issue/alert with manual action required

Trade-offs and Operational Considerations

Aspect	Benefit	Cost	Mitigation
Direct patch mode	Fastest remediation (minutes)	Bypasses human review for critical systems	Restrict DirectPatch to non-production by default; require GitOpsPR for production
Maintenance window gates	Prevents disruptive rollouts	Extends exposure window until next window	Shorten maintenance window spacing to 6-hour intervals for critical CVEs
EPSS threshold	Filters noise; only acts on likely-to-be-exploited CVEs	EPSS lags first hours after a CVE; threshold may miss early exploitation	Set `alwaysRemediateKEV: true` to catch actively-exploited CVEs regardless of EPSS
SBOM-based correlation	Precise matching; fewer false positives	SBOM generation must be wired into every build	Add SBOM upload as a mandatory CI step; fail build if SBOM absent
GitOps PR mode	Human review maintained	Adds review latency	Set merge automation for PRs approved by security team with CODEOWNERS rules

Failure Modes

Failure	Symptom	Detection	Recovery
Patched image digest unavailable	Remediation stalled; CVE window open	Operator metric `cve_remediation_stalled` fires alert	Trigger manual image build; notify security team; document exception
Operator RBAC too broad	Operator modifies unintended resources	RBAC audit log shows unexpected patch operations	Restrict to specific resource names; add OPA admission policy to constrain operator
CVE feed rate limiting	Operator unable to fetch updates; stale data	`cve_feed_fetch_failures` metric spikes	Implement exponential backoff; cache last-known-good feed snapshot
Health check flapping prevents rollout	CVEs remain unpatched despite EPSS threshold	`cve_rollout_gated_health_check` metric persists	Review health check thresholds; allow security override with dual-approval
GitOps PR accumulation	50+ open PRs from operator; review fatigue	PR count metric in GitHub API	Group CVE PRs by workload; set auto-merge policy for critical CVEs after security review