Kubernetes CVE Auto-Remediation Operator: Closing the Patch Window Automatically
The Problem
The median time between a CVE publication and a working exploit in the wild has been falling for years. LLM-assisted exploit development has accelerated that trend: a CVE published at 09:00 can have a PoC generated by an AI tool and weaponised by mid-afternoon. Standard enterprise patch windows — monthly, bi-weekly, or even weekly cycles — no longer close the exposure gap for CVSS 9+ vulnerabilities with network-accessible attack surfaces.
Kubernetes clusters are particularly exposed because they concentrate many separately-updated components: container base images (often Debian or Alpine), runtimes (containerd, runc), ingress controllers (NGINX, Envoy), sidecar proxies (Istio, Linkerd), and monitoring agents — each with their own CVE stream. A cluster operator tracking patch needs across 50 DaemonSets, 200 Deployments, and 5 node pool images cannot manually triage and action each new CVE as it drops.
The solution is an operator that automates the remediation loop:
- Ingest CVE feeds (OSV, NVD, GitHub Advisories) on a short polling interval.
- Correlate findings against the current workload inventory (image digests, package SBOMs stored in Artifact Registry or OCI).
- Score each finding using EPSS (exploitation probability) and KEV (Known Exploited Vulnerabilities) membership to prioritise automatic vs manual action.
- Remediate automatically for high-confidence, high-severity cases: update the image reference in a Deployment/DaemonSet to the patched digest, trigger a node pool rotation for node-level CVEs, or open a PR if the operator is configured in audit-only mode.
This is distinct from EPSS-driven triage tools (which produce prioritised lists) and Copa-style patching (which patches images in place). The operator closes the loop by taking the remediation action autonomously, with configurable gates for human review.
Target systems: Kubernetes 1.28+ clusters; Artifact Registry, ECR, or similar registry with image signing; workloads using image digest pinning; GitOps workflows (ArgoCD, Flux) for deployment configuration management.
Threat Model
1. LLM-generated exploit released same day as CVE (external attacker). Objective: exploit a container base-image CVE before the platform team patches it. Impact: container compromise; potential host escape if the CVE is in runc or containerd. Mitigation: auto-remediation closes the window from days to minutes for qualifying CVEs.
2. Node-level CVE exploited from within a container (attacker with container code execution). Objective: escalate from container to host via a kernel or containerd CVE. Impact: host-level compromise, cluster-wide lateral movement. Mitigation: node pool rotation triggered automatically when node-image CVEs exceed threshold.
3. Operator misconfiguration causing unintended rollouts (operator with write access to cluster). Objective (accidental): operator auto-updates a production DaemonSet image during a traffic peak, causing a rollout that degrades service. Impact: availability incident from an automated patch. Mitigation: rollout gates (maintenance windows, canary deployment integration, Prometheus health checks before proceeding).
4. CVE feed injection (attacker who compromises the CVE feed source or the operator’s network path to it). Objective: cause the operator to believe a non-existent CVE affects a running image; trigger a forced rollout that disrupts operations. Impact: availability attack via the security toolchain. Mitigation: TLS-verified feed fetches; signature verification on OSV JSON payloads.
Hardening Configuration
Operator Architecture
CVE Feeds (OSV, NVD, GitHub) → CVE Ingestor → CVE Store (Redis/etcd)
↓
Cluster Workload Scanner → Image Inventory → Correlator → Decision Engine
↓
(EPSS < threshold) → Alert only
(EPSS ≥ threshold) → Remediator
↓
GitOps PR / Direct kubectl patch
CRD: CVERemediationPolicy
apiVersion: security.example.com/v1alpha1
kind: CVERemediationPolicy
metadata:
name: production-policy
namespace: security-ops
spec:
# EPSS probability threshold above which auto-remediation fires (0.0–1.0)
epssThreshold: 0.15 # 15% exploitation probability
# Always auto-remediate if in CISA KEV list, regardless of EPSS
alwaysRemediateKEV: true
# Minimum CVSS score to trigger any action (alert or auto-remediate)
minCvssScore: 7.0
# Scopes this policy applies to
scope:
namespaces: ["production", "staging"]
excludeNamespaces: ["kube-system"] # kube-system requires manual approval
# How remediation is performed
remediationMode: GitOpsPR # GitOpsPR | DirectPatch | AlertOnly
gitops:
repo: "github.com/example/k8s-manifests"
branch: "auto-patch"
reviewers: ["security-team"]
# Rollout safety gates
rolloutGates:
# Only roll out during these windows (UTC)
maintenanceWindows:
- start: "02:00"
end: "06:00"
days: ["Mon", "Tue", "Wed", "Thu", "Fri"]
# Check Prometheus before proceeding with rollout
healthChecks:
- query: 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) < 0.01'
description: "Error rate below 1%"
maxConcurrentRollouts: 2
Operator Controller Logic
// pkg/controller/remediation_controller.go
package controller
import (
"context"
"time"
appsv1 "k8s.io/api/apps/v1"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
)
type RemediationReconciler struct {
client.Client
CVEStore CVEStore
ImageStore ImageInventory
EPSSClient EPSSClient
KEVClient KEVClient
GitOpsClient GitOpsClient
}
func (r *RemediationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Load the policy
policy := &secv1alpha1.CVERemediationPolicy{}
if err := r.Get(ctx, req.NamespacedName, policy); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Scan all images in scope against CVE store
findings := r.correlateFindings(ctx, policy)
for _, finding := range findings {
// Score the finding
epss := r.EPSSClient.GetScore(finding.CVE.ID)
isKEV := r.KEVClient.IsKnownExploited(finding.CVE.ID)
shouldRemediate := (epss >= policy.Spec.EPSSThreshold) ||
(isKEV && policy.Spec.AlwaysRemediateKEV)
if !shouldRemediate {
r.emitAlert(finding, epss, isKEV)
continue
}
if !r.rolloutGatesPassed(ctx, policy) {
// Outside maintenance window or health checks failing — queue
r.queueForNextWindow(finding)
continue
}
if err := r.remediate(ctx, policy, finding); err != nil {
return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
}
}
return ctrl.Result{RequeueAfter: 15 * time.Minute}, nil
}
func (r *RemediationReconciler) remediate(
ctx context.Context,
policy *secv1alpha1.CVERemediationPolicy,
finding Finding,
) error {
// Find the patched image digest from the registry
patchedDigest, err := r.ImageStore.GetPatchedDigest(
finding.Image, finding.CVE.ID,
)
if err != nil {
return err
}
switch policy.Spec.RemediationMode {
case "GitOpsPR":
return r.GitOpsClient.CreatePatchPR(finding, patchedDigest)
case "DirectPatch":
return r.patchDeployment(ctx, finding, patchedDigest)
default:
return nil // AlertOnly
}
}
func (r *RemediationReconciler) patchDeployment(
ctx context.Context,
finding Finding,
patchedDigest string,
) error {
deploy := &appsv1.Deployment{}
if err := r.Get(ctx, types.NamespacedName{
Name: finding.WorkloadName, Namespace: finding.Namespace,
}, deploy); err != nil {
return err
}
// Build a strategic merge patch that updates only the affected container
patch := []byte(`{"spec":{"template":{"spec":{"containers":[{` +
`"name":"` + finding.ContainerName + `",` +
`"image":"` + patchedDigest + `"}]}}}}`)
return r.Patch(ctx, deploy, client.RawPatch(types.StrategicMergePatchType, patch))
}
Image Inventory Integration
The operator needs to know what images are running and their package contents. Integrate with an SBOM store:
# Generate SBOM for each deployed image at push time (in CI)
syft REGION-docker.pkg.dev/PROJECT/repo/myapp:latest \
-o spdx-json > sbom.json
# Upload SBOM to a registry alongside the image (OCI artifact)
oras push REGION-docker.pkg.dev/PROJECT/repo/myapp:latest-sbom \
--artifact-type application/spdx+json \
sbom.json:application/spdx+json \
--subject REGION-docker.pkg.dev/PROJECT/repo/myapp@sha256:abc...
The operator queries the SBOM to identify which packages in each image are affected by a given CVE, enabling precise correlation rather than image-level matching only.
Rollout Safety Integration with Argo Rollouts
# When using Argo Rollouts, the operator creates a new rollout analysis
# rather than directly updating the Deployment
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: cve-patch-safety-check
spec:
metrics:
- name: error-rate
successCondition: result[0] < 0.01
failureLimit: 2
interval: 30s
provider:
prometheus:
address: http://prometheus:9090
query: >
sum(rate(http_requests_total{status=~"5..",
deployment="{{ args.deployment-name }}"}[5m])) /
sum(rate(http_requests_total{
deployment="{{ args.deployment-name }}"}[5m]))
Audit Log and RBAC
# The operator service account needs only targeted permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cve-remediator
rules:
- apiGroups: ["apps"]
resources: ["deployments", "daemonsets", "statefulsets"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["security.example.com"]
resources: ["cveremediation policies"]
verbs: ["get", "list", "watch", "update", "patch"]
# Explicitly deny: secrets, configmaps, RBAC resources
Every remediation action is recorded:
r.EventRecorder.Event(
deploy,
corev1.EventTypeNormal,
"CVEAutoRemediation",
fmt.Sprintf("Auto-patched container %s from %s to %s for %s (EPSS=%.3f)",
finding.ContainerName,
finding.CurrentDigest[:12],
patchedDigest[:12],
finding.CVE.ID,
epss,
),
)
Expected Behaviour After Hardening
| Scenario | Without Operator | With Operator |
|---|---|---|
| CVSS 9.8 CVE published at 09:00 | Patch queued in weekly cycle | EPSS score checked at 09:15; if ≥ 15%, GitOps PR created by 09:20 |
| CISA KEV list updated with new entry | Manual notification, 2-day triage | Operator sees KEV update; remediates all affected workloads at next maintenance window |
| Operator runs during health check failure | N/A (no automation) | Rollout gated; queued until error rate returns to normal |
| Patched image not yet available in registry | N/A | Operator retries at 15-minute intervals; alerts if patch not available after 4 hours |
| Workload has no matching patched image tag | N/A | Operator opens issue/alert with manual action required |
Trade-offs and Operational Considerations
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Direct patch mode | Fastest remediation (minutes) | Bypasses human review for critical systems | Restrict DirectPatch to non-production by default; require GitOpsPR for production |
| Maintenance window gates | Prevents disruptive rollouts | Extends exposure window until next window | Shorten maintenance window spacing to 6-hour intervals for critical CVEs |
| EPSS threshold | Filters noise; only acts on likely-to-be-exploited CVEs | EPSS lags first hours after a CVE; threshold may miss early exploitation | Set alwaysRemediateKEV: true to catch actively-exploited CVEs regardless of EPSS |
| SBOM-based correlation | Precise matching; fewer false positives | SBOM generation must be wired into every build | Add SBOM upload as a mandatory CI step; fail build if SBOM absent |
| GitOps PR mode | Human review maintained | Adds review latency | Set merge automation for PRs approved by security team with CODEOWNERS rules |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Patched image digest unavailable | Remediation stalled; CVE window open | Operator metric cve_remediation_stalled fires alert |
Trigger manual image build; notify security team; document exception |
| Operator RBAC too broad | Operator modifies unintended resources | RBAC audit log shows unexpected patch operations | Restrict to specific resource names; add OPA admission policy to constrain operator |
| CVE feed rate limiting | Operator unable to fetch updates; stale data | cve_feed_fetch_failures metric spikes |
Implement exponential backoff; cache last-known-good feed snapshot |
| Health check flapping prevents rollout | CVEs remain unpatched despite EPSS threshold | cve_rollout_gated_health_check metric persists |
Review health check thresholds; allow security override with dual-approval |
| GitOps PR accumulation | 50+ open PRs from operator; review fatigue | PR count metric in GitHub API | Group CVE PRs by workload; set auto-merge policy for critical CVEs after security review |