containerd CVE-2022-23648: Path Traversal That Exposed the Host Filesystem

containerd CVE-2022-23648: Path Traversal That Exposed the Host Filesystem

The Problem

CVE-2022-23648 was disclosed on March 3, 2022, affecting containerd versions before 1.6.1, 1.5.10, and 1.4.13. The vulnerability required no exploit code, no kernel bug, and no special privileges inside the container. A crafted OCI image — one whose image config JSON specified a volume mount with an empty Target.Path — caused containerd to bind-mount the host root filesystem into the container at startup. Every pod that ran a malicious image on an unpatched node had read access to the complete host filesystem for the lifetime of the container.

To understand why this happened, you need to understand how the OCI Image Specification structures image metadata and what containerd actually does with it.

The OCI Image Specification and image-level volume declarations

The OCI Image Specification (OCI Image Layout 1.0.0) defines the structure of a container image as three components: an image manifest, an image config JSON, and a set of layer tarballs. The image manifest points to the config and lists the layers. The image config JSON — typically stored at a content-addressable path within the OCI layout, referenced by the manifest as mediaType: application/vnd.oci.image.config.v1+json — contains metadata describing how to run the container: the entrypoint, environment variables, working directory, user, and volume declarations.

Volume declarations in the OCI image config are specified under the Volumes key in the config object:

{
  "config": {
    "Volumes": {
      "/data": {}
    },
    "Entrypoint": ["/app/server"],
    "Env": ["PATH=/usr/local/bin:/usr/bin:/bin"]
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": ["sha256:abc123..."]
  }
}

These image-level volume declarations are not the same as Kubernetes volume mounts. They are hints baked into the image at build time, intended to declare paths that should be treated as externally mounted volumes rather than part of the image’s own filesystem. Docker originated this pattern; the OCI spec inherited it. When a container runtime encounters a volume declaration in the image config, it is expected to prepare that path — typically by creating a directory at that path inside the container’s root and mounting an appropriate backing store there.

The OCI spec defines the structure of volume declarations but is deliberately permissive on the path format: it specifies that keys in the Volumes map are paths, but does not mandate that a runtime must validate the path before using it. The Docker image spec format, which the OCI config format was derived from, had the same silence on validation. containerd inherited this handling.

What containerd does during container creation

When containerd creates a container from an OCI image, the containerd/pkg/cri/server/container_create.go code path processes the image config to prepare the container’s root filesystem. For each volume declaration in the image config’s Volumes map, containerd calls into its snapshot and mount management code to set up the corresponding mount point inside the container’s filesystem.

The processing flow before the fix was roughly:

  1. Extract the Volumes map from the OCI image config.
  2. For each key in Volumes, treat the key as the destination path inside the container.
  3. Resolve that destination path relative to the container’s root bundle directory on the host — this is the host-side path to the container’s filesystem, something like /run/containerd/io.containerd.runtime.v2.task/<namespace>/<container-id>/rootfs.
  4. If the destination path is empty, filepath.Join(bundleRoot, "") resolves to bundleRoot itself — the container root directory.
  5. Prepare a bind mount from the host source for the volume to that resolved destination.

The filepath.Join behaviour in Go is the key: filepath.Join("/run/containerd/.../rootfs", "") returns /run/containerd/.../rootfs without appending anything. The empty path component is silently dropped. The resulting bind mount target is the container’s own root — and when containerd sets up a bind mount for a volume declaration with an empty source path, the default source resolves to or near the host root.

The practical effect: a container image whose Volumes map contains "": {} — an empty string as the key — caused containerd to issue a bind mount with the host root filesystem (/) as the source and the container’s rootfs directory as the destination. After this mount, ls / inside the running container showed the host’s filesystem, not the image’s. The container process could read /etc/, /var/lib/kubelet/, /run/containerd/, /proc/, and everything else on the host filesystem.

Constructing the malicious image

An attacker can create the malicious image config with a minimal Dockerfile-derived build or by constructing the OCI layout directly. The critical piece is the image config JSON:

{
  "architecture": "amd64",
  "os": "linux",
  "config": {
    "Volumes": {
      "": {}
    },
    "Entrypoint": ["/bin/sh"],
    "Env": ["PATH=/usr/local/bin:/usr/bin:/bin"]
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": ["sha256:<layer-hash>"]
  }
}

The layer itself can be a completely benign alpine or scratch image. The vulnerability is in the image config, not the filesystem layer. When pushed to any registry and pulled by a vulnerable containerd node, the "": {} volume declaration triggers the bind mount before the container process even starts. The container’s entrypoint — which can itself be entirely legitimate — already has host filesystem access at startup. No exploit code is required in the container process at all.

This is what makes CVE-2022-23648 particularly severe compared to many container escape vulnerabilities: there is no race condition to win, no kernel bug to trigger, no privilege to escalate. The vulnerability is in the runtime’s image processing, not in the container’s execution. The containerd daemon on the host performs the bind mount as root during container setup, before any user code runs.

Scope

containerd is the container runtime for virtually every major Kubernetes deployment as of 2022. Google Kubernetes Engine, Amazon EKS (after the 2022 transition away from Docker), Azure Kubernetes Service, and self-managed Kubernetes clusters that had migrated from Docker to containerd as their CRI were all affected. The containerd 1.4.x branch was widely deployed in LTS distributions; 1.5.x was the current stable branch; 1.6.x had just launched. All three were vulnerable.

Docker Engine was not directly affected by this CVE because Docker’s container handling applied path normalization before calling into containerd. However, containerd as a standalone CRI runtime — which is how Kubernetes uses it — exercised the vulnerable code path. The coordinated disclosure acknowledged this explicitly: Docker users were not at risk, but direct containerd CRI users were.

Threat Model

The attack vector is a container image, which is the fundamental unit of workload delivery in Kubernetes. An operator does not need to misonfigure anything to be vulnerable. They simply need to run a container from a malicious image on an unpatched node.

Malicious images without exploit code. The vulnerability is triggered entirely by the image config JSON, which containerd processes before the container process starts. A supply chain attack that injects a malicious Volumes entry into a popular base image — alpine, ubuntu, node, python — would propagate to every downstream FROM statement. When a downstream image is built, the resulting image config merges volume declarations from all layers. An attacker who can modify a base image’s config, or who controls a layer in the image’s build chain, can inject the empty Volumes key without touching the application code. The malicious layer is invisible to docker inspect at a surface level because the Volumes map appears to contain a single key whose value is an empty object — syntactically valid, semantically catastrophic.

What host filesystem read access yields.

The bind mount exposes the complete host filesystem, which gives access to:

  • /var/lib/kubelet/pki/kubelet-client-current.pem — the kubelet’s TLS client certificate and private key, used to authenticate to the Kubernetes API server as system:node:<nodename>. With these credentials, an attacker can impersonate the node to the API server and use the node’s RBAC permissions. In clusters that have not adopted the NodeRestriction admission plugin with strict configuration, node credentials can be used to list secrets across namespaces.
  • /var/lib/kubelet/pods/*/volumes/kubernetes.io~secret/*/ — every Kubernetes Secret mounted as a volume into any pod running on the node is stored on the node’s filesystem in plaintext. This includes service account tokens, TLS certificates, database credentials, and API keys mounted by co-located workloads.
  • /etc/kubernetes/pki/ — on control plane nodes, this directory contains the cluster CA certificate and key, the API server TLS certificate, and the service account signing key. If the malicious pod is scheduled to a control plane node, the impact is total cluster compromise.
  • /run/containerd/containerd.sock — the containerd Unix domain socket. With read access to this socket path, an attacker can determine whether they can connect to the containerd API. In many configurations, the socket is mode 0600 owned by root. However, if running as UID 0 (which is the default for containers that do not specify runAsUser), the malicious container process can connect to the containerd socket via the host filesystem mount and issue containerd API calls: list all running containers, extract container configs (including environment variables for all pods on the node), pull additional images, and create new privileged containers.
  • /proc/ — the host’s procfs is visible through the bind-mounted host root. A process inside the container can read /proc/<pid>/environ for any process on the host to extract environment variables — a common mechanism for injecting credentials into application processes. /proc/<pid>/cmdline reveals the full command lines of all host processes, including any secrets passed as flags to system daemons.

The cloud IMDS angle. On cloud-hosted nodes, the instance metadata service is reachable from the node at http://169.254.169.254/ (AWS, GCP, Azure) or http://169.254.169.254/latest/ (AWS IMDSv1). With host filesystem access, an attacker can read the node’s network configuration, find the node’s instance role, and from there access the IMDS. On nodes not configured with IMDSv2 (which requires a PUT request to obtain a token), IMDSv1 is accessible with a simple HTTP GET from the host network namespace. The node’s IAM instance role typically grants ECR pull access, S3 access for log shipping, and potentially broader permissions depending on the cluster configuration.

Multi-tenant clusters. In a cluster where multiple teams share nodes, a malicious image scheduled by one tenant gains read access to secrets from all co-located tenants. The Kubernetes namespace model, RBAC, and PodSecurityPolicies/Pod Security Standards provide no protection here: the isolation boundary is at the node/container runtime level, and that boundary is what CVE-2022-23648 breaks.

The supply chain propagation path. An attacker compromises a single popular base image tag — say alpine:3.15.3 — by injecting the "": {} volume declaration into the image config. This does not change the image’s layer content; the SHA256 of the layer tarball is unchanged. Only the image config changes, and the image config is a separate content-addressed object. Depending on how the registry is compromised, it is possible to push a modified image config without changing the manifest tag visible to users. Every CI/CD pipeline that runs FROM alpine:3.15.3 and subsequently rebuilds their application image inherits the volume declaration in the final image config. The malicious declaration propagates silently through build systems that do not inspect intermediate image configs.

Hardening Configuration

1. Patch Verification

Verify the containerd version on every node. The patched versions are containerd >= 1.6.1, >= 1.5.10, and >= 1.4.13.

# Check containerd version directly on a node
containerd --version
# Patched output:
# containerd github.com/containerd/containerd 1.6.2 ...

# Check all nodes in a cluster via the Kubernetes node info
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'
# Output shows containerd version per node:
# node-1   containerd://1.6.2
# node-2   containerd://1.5.10
# node-3   containerd://1.4.13

# Find any nodes running vulnerable containerd versions
kubectl get nodes -o json | jq -r '
  .items[] |
  select(
    .status.nodeInfo.containerRuntimeVersion |
    test("containerd://1\\.([0-3]\\.|4\\.([0-9]$|1[0-2]$)|5\\.[0-9]$|6\\.0$)")
  ) |
  "\(.metadata.name)\t\(.status.nodeInfo.containerRuntimeVersion) -- VULNERABLE"
'

For nodes where direct SSH is restricted, run a privileged pod to inspect the binary:

kubectl run containerd-check \
  --image=ubuntu:22.04 \
  --restart=Never \
  --rm -it \
  --overrides='{
    "spec": {
      "hostPID": true,
      "tolerations": [{"operator": "Exists"}],
      "containers": [{
        "name": "check",
        "image": "ubuntu:22.04",
        "command": ["nsenter", "--target", "1", "--mount", "--", "containerd", "--version"],
        "securityContext": {"privileged": true}
      }]
    }
  }'

For continuous version compliance monitoring across a fleet, deploy a reporting DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: runtime-version-reporter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: runtime-version-reporter
  template:
    metadata:
      labels:
        app: runtime-version-reporter
    spec:
      hostPID: true
      tolerations:
        - operator: Exists
      containers:
        - name: reporter
          image: ubuntu:22.04
          command:
            - nsenter
            - "--target"
            - "1"
            - "--mount"
            - "--"
            - containerd
            - "--version"
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: 10m
              memory: 16Mi
            limits:
              cpu: 50m
              memory: 32Mi

2. Image Admission Control: Restrict to Trusted Registries

The immediate operational mitigation before patching is to block images from untrusted registries at admission time. A malicious image cannot trigger CVE-2022-23648 if it is never admitted to the cluster. Use Kyverno to enforce a registry allowlist:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
  annotations:
    policies.kyverno.io/title: Restrict Image Registries
    policies.kyverno.io/category: Supply Chain Security
    policies.kyverno.io/severity: high
    policies.kyverno.io/description: >-
      All container images must originate from approved registries.
      Images from public registries (docker.io, public.ecr.aws,
      ghcr.io) require an explicit exception documented in the
      registry allowlist ConfigMap.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: >-
          Image "{{ request.object.spec.containers[].image }}" is not
          from an approved registry. Approved registries:
          registry.company.com, gcr.io/company-project.
          Raise a change request to add a registry exception.
        pattern:
          spec:
            containers:
              - image: "registry.company.com/* | gcr.io/company-project/*"
    - name: validate-init-container-registries
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: >-
          Init container image "{{ request.object.spec.initContainers[].image }}"
          is not from an approved registry.
        pattern:
          spec:
            initContainers:
              - image: "registry.company.com/* | gcr.io/company-project/*"

This policy applies to all pods across all namespaces, including system namespaces. Allow-list your CNI plugin, CSI drivers, and other infrastructure images explicitly before enabling Enforce mode. Start with Audit mode (validationFailureAction: Audit) to discover violations:

# Check Kyverno policy report for violations
kubectl get policyreport --all-namespaces -o json | \
  jq -r '.items[] | .results[] | select(.result == "fail") | "\(.resources[].namespace)/\(.resources[].name): \(.message)"'

3. Image Signature Verification with Sigstore Policy Controller

Registry allowlisting prevents pulling from untrusted registries but does not protect against a compromised internal registry or a compromised image tag within a trusted registry. Require cryptographic signatures on all images using the Sigstore Policy Controller:

apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: require-signed-images
spec:
  images:
    - glob: "registry.company.com/**"
  authorities:
    - keyless:
        url: https://fulcio.sigstore.dev
        identities:
          - issuer: "https://token.actions.githubusercontent.com"
            subjectRegExp: "https://github.com/company-org/.*"

Verify a specific image’s signature before deploying:

cosign verify \
  --certificate-identity-regexp="https://github.com/company-org/.*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  registry.company.com/myapp:v1.2.3

# Output on successful verification:
# Verification for registry.company.com/myapp:v1.2.3 --
# The following checks were performed on each of these signatures:
#   - The cosign claims were validated
#   - Existence of the claims in the transparency log was verified offline
#   - The code-signing certificate claims were validated
#
# [{"critical":{"identity":{"docker-reference":"registry.company.com/myapp"},...}

# Verify and inspect the image config for suspicious volume declarations:
cosign download config registry.company.com/myapp:v1.2.3 | \
  jq '.config.Volumes // {} | keys[]'
# Prints each declared volume path; an empty string output indicates the CVE-2022-23648 config

Use cosign to inspect image configs from existing images in your cluster before patching to detect any malicious images already present:

# Inspect all unique images currently running in the cluster
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | \
  sort -u | \
  while read image; do
    volumes=$(cosign download config "$image" 2>/dev/null | jq -r '.config.Volumes // {} | keys[]')
    if echo "$volumes" | grep -q '^$'; then
      echo "ALERT: $image contains empty volume path in config"
    fi
  done

4. Read-Only Root Filesystem and No Unnecessary Volume Mounts

A read-only root filesystem and explicit empty volume declarations in the Kubernetes pod spec do not prevent CVE-2022-23648 — the bind mount happens at the containerd level before the Kubernetes volume configuration is applied — but they establish a defence-in-depth baseline that limits what an attacker can do after gaining host filesystem access via other means:

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
  namespace: production
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
    fsGroup: 1000
  volumes: []   # Explicitly no volumes; prevents operator error mounting host paths
  containers:
    - name: app
      image: registry.company.com/app:v1.2.3
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
      volumeMounts: []

The readOnlyRootFilesystem: true flag sets the container’s rootfs mount as read-only after containerd completes the container setup. If the CVE-2022-23648 bind mount has already been applied, it makes the host filesystem visible but the container cannot write to paths on the host filesystem through the bind-mounted root. This limits the attacker to read-only exfiltration rather than modification — still severe for credential theft, but preventing persistence mechanisms that require writes (e.g., modifying kubelet configs, dropping SSH keys, installing cron jobs on the host).

Running as a non-root UID (runAsUser: 1000) further limits what the container process can read from the bind-mounted host filesystem: host files owned by root (mode 0640 or tighter) are not readable. Kubelet credentials at /var/lib/kubelet/pki/ are typically owned root:root with mode 0600, making them inaccessible to a non-root container process even through the bind mount.

5. Falco Detection: Unexpected Host Path Reads from Containers

The bind mount created by CVE-2022-23648 causes the container process to read files that normal container workloads have no business accessing. Falco rules targeting these access patterns provide detection even on unpatched nodes:

# /etc/falco/rules.d/cve-2022-23648.yaml

- rule: Container reading host kubernetes credentials
  desc: >
    A container process is reading kubelet PKI credentials or Kubernetes
    control plane PKI material from the host filesystem. This is consistent
    with CVE-2022-23648 exploitation where the host filesystem is
    bind-mounted into the container.
  condition: >
    open_read
    and container
    and (fd.name startswith "/var/lib/kubelet/pki"
         or fd.name startswith "/etc/kubernetes/pki"
         or fd.name startswith "/var/lib/kubelet/pods")
    and not proc.name in (kubelet, kube-proxy, node-exporter)
  output: >
    Container reading host k8s credentials
    (user=%user.name user_uid=%user.uid
    proc=%proc.name pid=%proc.pid
    container=%container.name
    image=%container.image.repository:%container.image.tag
    file=%fd.name
    namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [container, cve-2022-23648, host-filesystem, supply-chain]

- rule: Container accessing containerd socket
  desc: >
    A container process is accessing the containerd Unix domain socket.
    This is only expected from the kubelet and containerd shim processes,
    not from container workloads.
  condition: >
    (open_read or open_write)
    and container
    and fd.name = "/run/containerd/containerd.sock"
  output: >
    Container accessing containerd socket
    (user=%user.name user_uid=%user.uid
    proc=%proc.name pid=%proc.pid
    container=%container.name
    image=%container.image.repository:%container.image.tag
    namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [container, runtime-escape, cve-2022-23648]

- rule: Container reading host proc environ
  desc: >
    A container process is reading /proc/<pid>/environ for a process
    that is not a child of the container's own PID namespace. This
    pattern is used to extract credentials from host process environment
    variables after gaining host filesystem access.
  condition: >
    open_read
    and container
    and fd.name glob "/proc/*/environ"
    and not fd.name = "/proc/self/environ"
    and not fd.name glob "/proc/*/task/*/environ"
  output: >
    Container reading host process environment
    (user=%user.name user_uid=%user.uid
    proc=%proc.name pid=%proc.pid
    container=%container.name
    image=%container.image.repository:%container.image.tag
    fd=%fd.name
    namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, cve-2022-23648, credential-access]

Load the rules and verify Falco picks them up:

cp /etc/falco/rules.d/cve-2022-23648.yaml /etc/falco/rules.d/
# Validate rule syntax before reloading
falcoctl driver config --list 2>&1 | grep cve-2022-23648 || true
# Reload without restart (Falco 0.32+)
kill -HUP $(pidof falco)

6. containerd Socket and Host Path Protection via OPA Gatekeeper

Restrict pods from mounting sensitive host paths using an OPA Gatekeeper constraint. This blocks operators from accidentally exposing the containerd socket or kubelet directories to pods, reducing the blast radius of any container escape:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPHostFilesystem
metadata:
  name: psp-host-filesystem
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: [Pod]
    excludedNamespaces:
      - kube-system
  parameters:
    allowedHostPaths:
      # Only paths explicitly required by infrastructure components
      - pathPrefix: "/var/log"
        readOnly: true
      - pathPrefix: "/tmp/k8s-webhook"
        readOnly: false

Verify the containerd socket permissions at the OS level — they should not be world-readable:

# Check socket permissions on a node
ls -la /run/containerd/containerd.sock
# Expected: srw------- 1 root root 0 Jan 1 00:00 /run/containerd/containerd.sock
# mode 0600, owned root:root

# If the socket is group-accessible, tighten it:
chmod 0600 /run/containerd/containerd.sock

# Verify no pods are mounting containerd-related paths
kubectl get pods --all-namespaces -o json | jq -r '
  .items[] |
  select(
    .spec.volumes[]?.hostPath.path? |
    (startswith("/run/containerd") or
     startswith("/var/lib/kubelet") or
     startswith("/etc/kubernetes"))
  ) |
  "\(.metadata.namespace)/\(.metadata.name)"
'

Expected Behaviour

Patched containerd (>= 1.6.1, 1.5.10, 1.4.13). The fix in containerd adds explicit validation of volume mount paths during container creation. When containerd processes an image config whose Volumes map contains an empty string key, it returns an error and refuses to create the container:

FATA[0000] failed to create container for task: failed to create container:
failed to handle image volumes: invalid volume path "": path must not be empty

The pod status reflects this as a CreateContainerError:

kubectl describe pod <pod-name>
# Events:
#   Warning  Failed  5s   kubelet  Error: failed to create container
#             for task: failed to handle image volumes: invalid volume path "":
#             path must not be empty

The malicious image is inert on patched containerd. The container is never created; no bind mount is issued; the host filesystem is not exposed.

Kyverno policy violation output. When a pod attempts to use an image from a non-allowlisted registry, Kyverno blocks the admission request and returns:

Error from server: admission webhook "validate.kyverno.svc-fail" denied the request:
resource Pod/production/malicious-app was blocked due to the following policies:
restrict-image-registries:
  validate-registries: 'Images must be from approved registries.
  Image "docker.io/attacker/exploit-image:latest" violates the approved
  registry policy. Approved registries: registry.company.com, gcr.io/company-project.'

Falco alert sequence for a CVE-2022-23648 exploit attempt on an unpatched node. An attacker’s container starts and immediately begins reading host credential files. Falco fires within milliseconds:

09:14:22.003812445: CRITICAL Container reading host k8s credentials
  user=root user_uid=0
  proc=sh pid=24891
  container=malicious-workload
  image=docker.io/attacker/exploit:latest
  file=/var/lib/kubelet/pki/kubelet-client-current.pem
  namespace=default pod=malicious-workload-6b9cf4-xk2p7

09:14:22.008341102: CRITICAL Container reading host k8s credentials
  user=root user_uid=0
  proc=sh pid=24891
  container=malicious-workload
  image=docker.io/attacker/exploit:latest
  file=/var/lib/kubelet/pki/kubelet-client-2024-01-15-09-14-00.pem
  namespace=default pod=malicious-workload-6b9cf4-xk2p7

09:14:22.021503887: CRITICAL Container accessing containerd socket
  user=root user_uid=0
  proc=sh pid=24891
  container=malicious-workload
  image=docker.io/attacker/exploit:latest
  namespace=default pod=malicious-workload-6b9cf4-xk2p7

Multiple CRITICAL-priority alerts within 20ms of pod startup, all from the same image, are a reliable indicator of CVE-2022-23648 exploitation. At this point the node must be treated as compromised: cordon and drain it immediately, preserve the disk image for forensics, and rotate all credentials that were accessible on the node — kubelet client certificates, service account tokens mounted into co-located pods, and any cloud instance role credentials.

Trade-offs

Registry allowlisting is the highest-leverage pre-patch mitigation but creates real friction. Development environments routinely pull from Docker Hub (base images, tooling containers, debugging images). Staging environments may pull from a separate registry than production. Any registry allowlist policy applied cluster-wide breaks these workflows immediately. Mitigate this by:

  • Using Kyverno’s namespaceSelector to exclude development namespaces from the strict allowlist and apply a looser policy there.
  • Creating an explicit exception process: a Kyverno PolicyException resource that allows specific images from public registries after review, with an expiry annotation enforced by a custom controller.
  • Starting in Audit mode for two weeks, using the PolicyReport output to identify all current violations, and resolving them before switching to Enforce.

Image signature verification adds a signing step to every CI/CD pipeline. Third-party base images from public registries — the FROM node:20-alpine images that are the foundation of most application containers — are not signed by default unless the upstream project has adopted Sigstore. For base images without signatures, use a policy that requires either a signature from the original publisher or a signature from your internal registry after an internal scanning step (pull → scan → re-push to internal registry with your own signature). This pattern, sometimes called image promotion, also decouples your production workloads from upstream registry availability.

The Sigstore Policy Controller adds an admission webhook to every pod creation. If the Policy Controller is unavailable (its pods are crashing, network connectivity to Fulcio/Rekor is interrupted), pod creation fails. This is the secure-by-default behaviour, but it can cause cascading failures during upgrades or infrastructure disruptions. Configure the Policy Controller with at least two replicas and set its failure policy carefully:

# Policy Controller webhook configuration
failurePolicy: Fail      # Deny pod creation if webhook is unreachable
# vs.
failurePolicy: Ignore    # Allow pod creation if webhook is unreachable (less secure)

In most production configurations, failurePolicy: Fail is correct security posture, but it requires robust Policy Controller availability.

Falco rules for host path reads generate false positives from legitimate infrastructure components. CSI drivers read from /var/lib/kubelet/pods/ as part of their normal volume management loop. Node-level monitoring agents (Datadog node agent, Prometheus node exporter, Elastic Agent) read from /proc/*/ extensively. CNI plugins read from paths in /etc/kubernetes/. Tune the Falco rules with explicit allow-lists for these processes by name and image:

# Add to the condition for host k8s credential reads:
and not (container.image.repository in (
  "gcr.io/datadoghq/agent",
  "k8s.gcr.io/node-problem-detector/node-problem-detector",
  "registry.k8s.io/metrics-server/metrics-server"
))
and not proc.name in (node-exporter, datadog-agent, elastic-agent)

Tuning Falco rules takes time and requires running in alert-only mode in production before relying on automated responses. Build the rule gradually: deploy, observe false positives for a week, add exceptions for the legitimate ones, and repeat.

Failure Modes

Assuming Kubernetes-level controls prevent runtime vulnerabilities. This is the most dangerous misconception about CVE-2022-23648. PodSecurityAdmission at the Restricted level enforces runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false, and dropped capabilities. None of these controls stop the exploit. The bind mount is performed by the containerd daemon — a root process on the host — during container setup, before any pod security context is applied. By the time the container process starts, the bind mount is already in place. PodSecurity restricts what the container process can do within its own namespace; it has no visibility into the mounts that containerd applies during container creation. Operators who believe their Restricted pod security policy makes them immune to containerd CVEs are wrong.

Tracking Kubernetes versions but not containerd versions separately. Managed Kubernetes offerings abstract the relationship between Kubernetes version and container runtime version. EKS 1.24 shipped with containerd versions that varied by the AMI release date within the same Kubernetes minor version. An EKS 1.24 cluster created in January 2022 ran a different containerd version than one created in April 2022, even if both were upgraded to the same Kubernetes patch version at the same time. The Kubernetes node info API (nodeInfo.containerRuntimeVersion) exposes the containerd version, but it is not surfaced in most cluster dashboard tools. Many operators know their Kubernetes version precisely and cannot name the containerd version on any of their nodes. Implement the version compliance DaemonSet described in the hardening section and alert on any node below the minimum containerd patch version.

Pulling base images without inspecting image configs. The attack vector for CVE-2022-23648 is an image config field — Volumes with an empty key — that is invisible in docker image inspect output unless you explicitly look at the raw config JSON. docker image inspect myimage:latest prints a formatted summary that omits the raw Volumes map representation. The raw config is accessible via:

docker image inspect myimage:latest --format '{{ json .Config.Volumes }}'
# Or via the registry API:
crane config myimage:latest | jq '.config.Volumes'

Most CI/CD pipelines do not inspect the image config after building or before deploying. Integrate a config inspection step into your image build and deployment pipelines. A one-line check that fails the build if any volume path is an empty string would have blocked this entire attack class:

# In CI, after building the image:
volumes=$(crane config ${IMAGE}:${TAG} | jq -r '.config.Volumes // {} | keys[]')
if echo "$volumes" | grep -q '^$'; then
  echo "ERROR: image config contains empty volume path (CVE-2022-23648 signature)"
  exit 1
fi

Not monitoring for unexpected host path file accesses after patching. Patching containerd closes the CVE-2022-23648 vector, but host filesystem exposure through container escapes is a recurring class of vulnerability. CVE-2022-23648 is one instance; CVE-2019-5736 (runc binary overwrite), CVE-2021-30465 (runc mount destination race), and future unpatched variants all expose the host filesystem to container processes via different mechanisms. The Falco rules detecting access to /var/lib/kubelet/pki/ and the containerd socket from container processes are not CVE-2022-23648-specific — they are indicators of any container escape that achieves host filesystem access. Deploy and maintain them regardless of the containerd patch status.