Overlayfs Copy-on-Write Container Escape: CVE-2023-0386 and Writeback Race Mitigations

Problem

Every container runtime that uses overlayfs — Docker, containerd, CRI-O — builds the container’s filesystem from two layers stacked in the kernel. The lower layer is read-only and contains the image content: all files baked into the Docker image at build time. The upper layer is a writable scratch area, initially empty, that belongs to this specific container instance. The kernel’s overlayfs driver presents a unified view of both layers to the container process: reads come from upper if the file exists there, otherwise from lower. A third directory, the work directory, is required by the overlayfs implementation for atomic rename and copy operations during writes.

Copy-up is what happens when a container process first writes to a file that exists only in the lower layer. The kernel cannot modify the read-only lower layer, so overlayfs copies the file from lower to upper and redirects the write to the copy. The full copy-up sequence is:

Create a new file in the work directory.
Copy the lower-layer file’s content into the work directory file.
Copy the lower-layer file’s metadata — ownership (uid/gid), permissions (mode bits), and extended attributes (xattrs) — to the work directory file.
Atomically rename the work directory file into the upper directory.

Step 3 is where the vulnerability class originates. When overlayfs copies metadata from the lower layer to the upper layer, it copies everything: the uid/gid of the original file owner, the POSIX permission bits, and all xattrs. For ordinary files owned by a container-internal user, this is harmless. But container images routinely include setuid binaries — ping, sudo, su, newuidmap — that are owned by root and carry setuid bits. They also include binaries with file capabilities stored as security xattrs (security.capability), which grant specific POSIX capabilities to any process that executes the file, regardless of the process’s user.

The privilege copy problem: when overlayfs performs copy-up of a setuid-root binary or a file with security xattrs from the lower layer to the upper layer, the copy in the upper layer retains the original ownership and capability xattrs. In the host’s view, the upper layer is a directory on the host filesystem — it lives under /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ or a similar path depending on the runtime. Files written to the upper layer are visible to the host.

CVE-2023-0386 exploits this copy-up path within a user namespace. A user namespace maps unprivileged host UIDs to a synthetic UID 0 inside the namespace. Overlayfs, when mounted inside a user namespace, applies copy-up using the caller’s user namespace credentials. The bug: when a container mounts FUSE (Filesystem in Userspace) inside its user namespace and then mounts overlayfs over the FUSE filesystem, it can manipulate what the lower-layer file appears to be during copy-up. The FUSE filesystem can present a file whose uid is 0 and whose xattrs include security.capability with elevated capabilities. When overlayfs copies this file to the upper layer — which is a real directory on the host filesystem, outside the user namespace — the kernel incorrectly preserves the security.capability xattr on the resulting upper-layer file. The upper-layer file on the host then has file capabilities that any user on the host can exploit by executing it.

The exploit chain:

Attacker is an unprivileged user on the host, or is running inside a container with user namespace access.
Attacker creates a user namespace (unshare -U) and mounts a FUSE filesystem inside it.
The FUSE filesystem is programmed to report a file with uid=0 and security.capability set to cap_setuid+ep.
Attacker mounts overlayfs inside the user namespace, using the FUSE filesystem as the lower layer and a real host directory as the upper layer.
Attacker triggers copy-up by opening the FUSE-backed file for writing.
Overlayfs copies the file, including its security.capability xattr, into the upper directory on the host filesystem.
Attacker executes the file from the upper directory. The file runs with cap_setuid, allowing the attacker to call setuid(0) and become root on the host.

Affected systems: Linux kernels before 6.2 (the fix was backported to 5.15.90, 5.10.165, 5.4.230, and 4.19.274 in the 6.2-era stable series). Any container runtime that uses overlayfs and permits user namespaces is affected: Docker with user namespace remapping, containerd with snapshotter overlayfs, and CRI-O. Rootless container modes are affected because they rely on user namespaces. Unprivileged users on a standard Ubuntu 22.04 or Debian 12 system (which enable kernel.unprivileged_userns_clone=1 by default) can exploit this without any container involvement at all.

Threat Model

Threat 1: Container attacker exploiting overlayfs copy-up to gain host capabilities. An attacker who has code execution inside a container — through a compromised application, a supply chain attack, or a misconfigured debug container — can mount overlayfs with a FUSE lower layer inside a user namespace. Kubernetes pods do not require CAP_SYS_ADMIN for user namespace operations when kernel.unprivileged_userns_clone=1. The attacker uses the copy-up path to write a capability-bearing file to the container’s upper layer, which resides on the host filesystem. Executing the file grants host-level capabilities, completing the container escape. The blast radius is host root: the attacker can read all secrets on the node, access the container runtime socket, and pivot to other pods or the Kubernetes API server.

Threat 2: Race in the copy-up writeback window. The copy-up sequence involves a temporary file in the work directory before the atomic rename to the upper directory. A sufficiently fast attacker can attempt to swap the work directory file between the capability-copy step and the rename step. In practice CVE-2023-0386 does not require winning this race — the FUSE lower layer gives the attacker direct control over what overlayfs reads during copy-up — but the race window is a separate attack surface in implementations that modify the work directory file before the rename completes. Kernels that serialize copy-up with the overlayfs inode lock close this window; unpatched kernels may not.

Threat 3: Unprivileged user outside containers gaining elevated file capabilities. On distributions that enable unprivileged user namespaces, any local user account can exploit CVE-2023-0386 without any container or Kubernetes involvement. A developer workstation, a shared build server, or a CI runner where the kernel is not patched is vulnerable to privilege escalation by any local user, including processes running as service accounts for other applications. The scope extends well beyond Kubernetes clusters.

The blast radius in all three scenarios is host root or the equivalent: an attacker who can execute a file with cap_setuid+ep on the host can call setuid(0), read /etc/shadow, write to the container runtime socket, and access all secrets accessible to the node. In a Kubernetes context this means kubelet TLS credentials, all secrets mounted into co-located pods, and instance metadata credentials for cloud IAM roles.

Configuration and Implementation

1. Kernel version verification

Verify that every node is running a kernel that includes the CVE-2023-0386 fix. The fix landed in Linux 6.2 and was backported to the 5.15, 5.10, 5.4, and 4.19 stable series.

# Check kernel version on the current node
uname -r
# Examples of patched versions:
# 6.2.0-26-generic       -- mainline fix
# 5.15.90-1-generic      -- LTS backport
# 5.10.165-1-generic     -- LTS backport

# On a Kubernetes node via kubectl, check all nodes at once
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

# Parse vulnerable vs patched using a simple version comparison
kubectl get nodes -o json | jq -r '
  .items[] |
  .metadata.name as $name |
  .status.nodeInfo.kernelVersion as $kver |
  "\($name)\t\($kver)"
' | while IFS=$'\t' read -r node kver; do
  # Extract major.minor.patch from kernel version string
  major=$(echo "$kver" | grep -oP '^\d+')
  minor=$(echo "$kver" | grep -oP '^\d+\.\K\d+')
  if [ "$major" -lt 6 ] && [ "$minor" -lt 2 ]; then
    echo "POSSIBLY VULNERABLE: $node ($kver) -- verify backport status"
  else
    echo "OK: $node ($kver)"
  fi
done

For distributions that backport fixes without bumping the minor version, check the package changelog directly:

# Debian/Ubuntu
apt changelog linux-image-$(uname -r) 2>/dev/null | grep -i "CVE-2023-0386" | head -5

# RHEL/CentOS/Rocky
rpm -q --changelog kernel | grep -i "CVE-2023-0386" | head -5

# Verify the specific overlayfs fix is present by checking for the guard in copy_updata
# (requires kernel debug symbols or source, useful for build verification)
grep -r "security_inode_copy_up" /usr/src/linux-headers-$(uname -r)/fs/overlayfs/ 2>/dev/null

2. Verifying and restricting unprivileged user namespaces

Unprivileged user namespaces are the prerequisite for CVE-2023-0386 on uncontainerised systems. Check the current setting:

# 1 = unprivileged user namespaces enabled (vulnerable on unpatched kernels)
# 0 = disabled
cat /proc/sys/kernel/unprivileged_userns_clone

# Also check the broader user namespace limit
cat /proc/sys/user/max_user_namespaces
# 0 = no user namespaces allowed for unprivileged users

Disable unprivileged user namespaces as a temporary mitigation on unpatched kernels:

# Immediate disable (survives until reboot)
sysctl -w kernel.unprivileged_userns_clone=0

# Persist across reboots
cat > /etc/sysctl.d/99-userns-hardening.conf << 'EOF'
# Disable unprivileged user namespace creation.
# This mitigates CVE-2023-0386 and related overlayfs escape classes
# on kernels before the 6.2 / LTS backport fix.
# WARNING: this breaks rootless container modes (Podman rootless,
# Docker user namespace remapping, containerd rootless snapshotter).
# Re-enable after patching the kernel.
kernel.unprivileged_userns_clone = 0
EOF
sysctl --system

Verify the setting took effect:

sysctl kernel.unprivileged_userns_clone
# Expected: kernel.unprivileged_userns_clone = 0

# Test that an unprivileged user can no longer create a user namespace
su - testuser -c 'unshare -U id'
# Expected: unshare: unshare failed: Operation not permitted

3. containerd configuration: privileged vs rootless mode

In privileged (root) containerd mode, the snapshotter runs as root and the overlayfs upper layer directories are owned by root on the host. The copy-up race still applies, but the attacker must already be root inside the container (or have a means to reach the overlayfs path from a user namespace). Verify the containerd snapshotter configuration:

# Check which snapshotter is in use
containerd config dump 2>/dev/null | grep -A5 snapshotter
# Expected for non-rootless:
#   [plugins."io.containerd.snapshotter.v1.overlayfs"]
#     root_path = ""

# For containerd 1.7+, check the sandbox runtime
cat /etc/containerd/config.toml | grep -E "snapshotter|rootless"

For rootless containerd (where the daemon runs as an unprivileged user), apply additional restrictions because the daemon itself operates within a user namespace:

# /etc/containerd/config.toml -- rootless hardening
version = 2

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  # Use native overlayfs; fuse-overlayfs is needed only on kernels
  # that do not support overlayfs within user namespaces.
  # On patched kernels, native overlayfs is preferred.
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
      # Disable user namespace passthrough for pods unless explicitly required.
      # This keeps pods in the host user namespace where overlayfs copy-up
      # uses the kernel's full privilege checks.
      UsernsMode = ""

Restart containerd after config changes:

systemctl restart containerd
systemctl is-active containerd

4. Seccomp profile blocking exploit-relevant syscalls

The CVE-2023-0386 exploit chain requires unshare(2) (to create a user namespace) and mount(2) (to mount FUSE and overlayfs). The Kubernetes runtime default seccomp profile does not block these syscalls. A custom seccomp profile that blocks them prevents the exploit without patching the kernel, at the cost of breaking any workload that legitimately uses namespaces:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["unshare"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "Block user namespace creation; mitigates CVE-2023-0386"
    },
    {
      "names": ["mount"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "Block overlayfs/FUSE mount inside containers"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ERRNO",
      "args": [
        {
          "index": 0,
          "value": 268435456,
          "op": "SCMP_CMP_MASKED_EQ",
          "valueTwo": 268435456
        }
      ],
      "comment": "Block CLONE_NEWUSER flag (0x10000000); prevents user namespace creation via clone(2)"
    }
  ]
}

Apply the profile to pods via a Kubernetes seccomp annotation or (Kubernetes 1.19+) the securityContext.seccompProfile field:

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
  namespace: production
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: "profiles/block-userns-mount.json"
  containers:
    - name: app
      image: registry.company.com/app:v1.2.3
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 10001

The seccomp profile must be placed in the kubelet’s seccomp profile directory on every node, typically /var/lib/kubelet/seccomp/profiles/:

# Copy the profile to each node's kubelet seccomp directory
# (Automate via DaemonSet or configuration management)
mkdir -p /var/lib/kubelet/seccomp/profiles
cp block-userns-mount.json /var/lib/kubelet/seccomp/profiles/

5. AppArmor and SELinux policies

AppArmor can deny the mount operation class inside containers, blocking the FUSE and overlayfs mount steps of the exploit:

# /etc/apparmor.d/containers/block-overlayfs-escape
#include <tunables/global>

profile container-hardened flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Deny mounting FUSE or overlayfs filesystems inside the container
  deny mount fstype=fuse,
  deny mount fstype=overlay,
  deny mount fstype=overlayfs,

  # Deny unshare that would create user namespaces
  deny capability sys_admin,

  # Allow normal container operation
  file,
  network,
  signal,
  ptrace peer=@{profile_name},
}

Load and verify the profile:

apparmor_parser -r -W /etc/apparmor.d/containers/block-overlayfs-escape
aa-status | grep container-hardened

Reference the AppArmor profile in Kubernetes via an annotation (containerd applies the profile through the CRI):

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: localhost/container-hardened

On SELinux-enabled systems (RHEL, Rocky, Fedora), the container_t domain that Podman and CRI-O use already restricts mount operations by default when SELinux is enforcing. Verify enforcement:

getenforce
# Expected: Enforcing

# Check that the container process runs in the correct domain
ps -eZ | grep container_t
# Running containers should appear as:
# system_u:system_r:container_t:s0:c123,c456  <pid>  <process>

6. Kubernetes Pod Security Admission enforcement

Apply the Restricted Pod Security Standards profile to namespaces that do not require elevated privileges. The Restricted profile blocks allowPrivilegeEscalation, requires non-root UIDs, and mandates the RuntimeDefault or Localhost seccomp profile. These controls do not directly block the overlayfs copy-up path, but they reduce the privilege available to a compromised container and make it harder for the attacker to reach the user namespace creation syscalls through legitimate application code:

# Label a namespace for Restricted enforcement
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=v1.28 \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/warn-version=v1.28

# Verify the label
kubectl get namespace production --show-labels

Use Kyverno to enforce a cluster-wide policy requiring seccomp profiles on all pods, filling the gap for namespaces that have not yet migrated to PSA Restricted:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-seccomp-profile
  annotations:
    policies.kyverno.io/title: Require Seccomp Profile
    policies.kyverno.io/category: Pod Security
    policies.kyverno.io/severity: high
    policies.kyverno.io/description: >-
      All containers must specify a seccomp profile.
      RuntimeDefault or Localhost profiles are accepted.
      Unconfined seccomp is prohibited cluster-wide.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-seccomp-profile
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: "Containers must specify seccomp profile (RuntimeDefault or Localhost). Unconfined is not permitted."
        pattern:
          spec:
            securityContext:
              seccompProfile:
                type: "RuntimeDefault | Localhost"

7. Falco rule: capability-bearing file creation in container upper layers

The observable signature of a successful CVE-2023-0386 exploit is the creation of a file with security.capability xattr in a path that is on the host filesystem but originated from overlayfs copy-up. Falco can detect the subsequent execution of such a file or the xattr write itself:

# /etc/falco/rules.d/overlayfs-cow-escape.yaml

- rule: Capability xattr written to file in container
  desc: >
    A process inside a container wrote the security.capability extended
    attribute to a file. This is the mechanism used by CVE-2023-0386 to
    plant a capability-bearing file on the host via overlayfs copy-up.
    Legitimate container workloads do not set security.capability xattrs.
  condition: >
    container
    and syscall.type = setxattr
    and setxattr.name = "security.capability"
    and not proc.name in (containerd, containerd-shim, runc)
  output: >
    security.capability xattr written inside container
    (user=%user.name uid=%user.uid
    proc=%proc.name pid=%proc.pid
    file=%fd.name
    container=%container.name
    image=%container.image.repository:%container.image.tag
    namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [container-escape, overlayfs, cve-2023-0386, capability]

- rule: Setuid binary executed from overlayfs upper layer
  desc: >
    A setuid or capability-bearing binary was executed from a path that
    corresponds to an overlayfs upper layer directory. This is a post-exploit
    indicator of the CVE-2023-0386 chain being used to gain elevated
    privileges on the host.
  condition: >
    spawned_process
    and (proc.is_suid_binary = true or
         fd.name startswith "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots")
    and not container
    and not proc.pname in (containerd, kubelet, systemd)
  output: >
    Setuid/capability binary executed from overlayfs path
    (user=%user.name uid=%user.uid
    proc=%proc.name pid=%proc.pid
    exe=%proc.exe
    args=%proc.args
    parent=%proc.pname ppid=%proc.ppid)
  priority: CRITICAL
  tags: [container-escape, overlayfs, host, privilege-escalation]

- rule: Unprivileged user namespace creation in container
  desc: >
    A container process created a new user namespace. This is the first
    step of CVE-2023-0386 exploitation. Few legitimate containerised
    workloads create user namespaces; this is almost always anomalous.
  condition: >
    container
    and syscall.type = unshare
    and (evt.arg.flags contains CLONE_NEWUSER)
    and not proc.name in (newuidmap, newgidmap)
  output: >
    User namespace creation inside container
    (user=%user.name uid=%user.uid
    proc=%proc.name pid=%proc.pid
    flags=%evt.arg.flags
    container=%container.name
    image=%container.image.repository:%container.image.tag
    namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container-escape, overlayfs, cve-2023-0386, user-namespace]

Reload Falco after adding rules:

# Validate rule syntax
falcoctl artifact install rules:falco-rules
# Reload live (Falco 0.32+)
kill -HUP $(pgrep -x falco)
# Verify rules loaded
falco --list -r /etc/falco/rules.d/overlayfs-cow-escape.yaml 2>&1 | grep -E "Rule|Error"

8. Native kernel protection in Linux 6.2+

The actual fix in Linux 6.2 adds a check in the overlayfs copy-up path (fs/overlayfs/copy_up.c) that suppresses security.capability xattr copying when the upper layer is not owned by the same user namespace as the lower layer. The key change is in ovl_copy_up_data() and the surrounding metadata copy routines:

Before the fix: security.capability was copied unconditionally from lower to upper during copy-up.
After the fix: the kernel checks whether the file’s owning user namespace matches the mount’s user namespace. If the upper directory is in a different (more privileged) namespace than the FUSE-backed lower layer, the security.capability xattr is stripped from the copy.

Verify the fix is active by checking the kernel’s overlayfs configuration:

# Confirm overlayfs is built as a module or built-in (not absent)
cat /proc/filesystems | grep overlay
# Expected: nodev overlay

# Confirm the kernel was built with overlayfs user namespace support
# (present in all kernels affected by CVE-2023-0386 and its fix)
zcat /proc/config.gz 2>/dev/null | grep CONFIG_OVERLAY_FS
# Expected: CONFIG_OVERLAY_FS=y or CONFIG_OVERLAY_FS=m

On patched kernels, attempting to create an overlayfs with a FUSE lower layer from a user namespace no longer copies the security.capability xattr. The escape chain is broken at the kernel level, and no additional configuration is required.

Expected Behaviour

Scenario	Unpatched kernel (< 6.2, no backport)	Patched kernel (>= 6.2 or backport)
Exploit: FUSE+overlayfs copy-up from user namespace	`security.capability` xattr copied to upper layer on host; attacker can execute capability-bearing file as host root	`security.capability` xattr stripped during copy-up; upper-layer file has no capabilities; exploit fails
`unshare -U` by unprivileged user	Succeeds if `unprivileged_userns_clone=1`	Succeeds (user namespaces still allowed); exploit blocked at copy-up layer
`unshare -U` with `sysctl kernel.unprivileged_userns_clone=0`	Fails with EPERM; exploit blocked at namespace creation step	Fails with EPERM; defence-in-depth maintained
Falco rule firing on `setxattr security.capability`	Alert fires; human response required	Alert fires on any attempt; useful for detecting probing even on patched systems
Pod with Restricted PSA and RuntimeDefault seccomp	`unshare` blocked by seccomp; exploit fails at step 2	Defence-in-depth; redundant protection
Rootless containerd on unpatched kernel	Vulnerable; FUSE lower layer accessible in rootless context	Not vulnerable; copy-up strips capability xattr

Verify that the kernel protection is working on a patched system:

# As an unprivileged user, attempt the exploit chain (should fail on patched kernel)
# Step 1: create user namespace -- should succeed on patched kernel with userns enabled
unshare -U --map-root-user bash -c "echo 'User namespace OK: uid=$(id -u)'"

# Step 2: mount FUSE + overlayfs and check if capability xattr survives copy-up
# This requires the full exploit PoC; for verification use the kernel selftest instead:
# https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/filesystems/overlayfs
#
# Run the overlayfs selftests if kernel-testing packages are installed:
cd /usr/lib/linux-tools-$(uname -r)/ 2>/dev/null || \
  cd /usr/share/linux-tools-$(uname -r)/ 2>/dev/null
./run_tests.sh overlayfs 2>&1 | grep -E "PASS|FAIL|CVE"

Trade-offs

Mitigation	Protection Provided	Operational Cost	Breaks
Patch kernel to 6.2+ / LTS backport	Closes CVE-2023-0386 at source; no configuration needed	Node drain and reboot required per node; managed K8s clusters depend on provider release schedule	Nothing; preferred remediation
`sysctl kernel.unprivileged_userns_clone=0`	Blocks exploit at namespace creation; also prevents other user namespace CVEs	Immediate effect; requires reboot persistence via sysctl.d	Rootless Podman, Docker user namespace remapping, containerd rootless snapshotter, Flatpak, Chrome sandboxing, Firefox sandboxing
Seccomp: block `unshare` and `mount`	Blocks exploit within containers even on unpatched kernel	Profile must be distributed to all nodes and referenced in pod specs	Any containerised workload that calls `unshare` or `mount` (legitimate uses: Buildah, kaniko in containers, nested container runtimes)
AppArmor: deny mount fstype=fuse,overlay	Blocks FUSE and overlayfs mount steps	Profile requires AppArmor-enabled kernel and per-node deployment	FUSE-backed filesystems inside containers (e.g., s3fuse, sshfs mounts)
Kubernetes PSA Restricted	Reduces container privileges; blocks `allowPrivilegeEscalation`	Namespace-level opt-in; requires updating existing pod specs	Pods that legitimately need elevated capabilities or root; requires migration effort
Falco detection rules	Detects exploit in progress; enables rapid response	Alert noise tuning required; adds observability overhead	Nothing; detective control only

The strongest approach on the path to full remediation is: disable unprivileged user namespaces immediately on nodes where rootless containers are not in use, deploy the Falco rules cluster-wide, and schedule kernel patching in the next maintenance window. Re-enable user namespaces after the kernel is patched. For clusters that depend on rootless container modes, the seccomp + AppArmor combination provides meaningful containment while the kernel patch is being rolled out, but it requires careful pod-by-pod validation to avoid breaking legitimate workloads.

Failure Modes

Scenario	Why It Fails	Consequence	Correct Response
Container runtime updated but kernel not patched	CVE-2023-0386 is a kernel vulnerability. Updating containerd, Docker, or CRI-O has no effect on the copy-up bug; the runtime calls into the kernel’s overlayfs implementation, which remains vulnerable	Nodes appear patched from a runtime perspective but exploit succeeds; false sense of security	Check kernel version, not runtime version; track kernel CVE status separately from runtime CVE status
`unprivileged_userns_clone=0` breaking legitimate tooling	Flatpak, browser sandboxing, Buildah, rootless Podman, and some CI tools require unprivileged user namespaces; they fail with EPERM after the sysctl change	CI pipelines break; developer tools stop working; pressure to re-enable the setting without patching the kernel	Audit which tools require user namespaces before applying the sysctl; isolate those workloads to already-patched nodes rather than disabling the protection globally
Seccomp profile blocking `unshare` but not `clone` with `CLONE_NEWUSER`	`unshare(2)` is a thin wrapper; user namespace creation can also be triggered via `clone(2)` with `CLONE_NEWUSER` (0x10000000) or `clone3(2)`. A seccomp profile that blocks only `unshare` leaves the `clone`/`clone3` path open	Attacker uses `clone` directly; seccomp mitigation is bypassed	Block `clone` with `CLONE_NEWUSER` mask, `clone3`, and `unshare` in the seccomp profile; use the masked argument comparison shown in the Configuration section
Falco not monitoring overlayfs upper layer paths	The overlayfs upper layer directory path varies by runtime, snapshotter version, and configuration (containerd default vs custom root); a Falco rule hardcoding `/var/lib/containerd/...` misses CRI-O or Docker installations	The “setuid binary executed from overlayfs path” rule does not fire for non-containerd runtimes	Use Falco’s `proc.is_suid_binary` and `setxattr` syscall detection rather than path-prefix rules; these are runtime-agnostic
LTS kernel with backport not recognized as patched	Distributions backport CVE-2023-0386 fix without changing the kernel major.minor version; a version check that only looks at `6.2` misses `5.15.90-1` or `5.10.165-1`	Automated tooling flags nodes as vulnerable when they are patched; wastes remediation effort; or incorrectly marks nodes as patched when backport is absent	Check distribution advisory trackers (USN for Ubuntu, RHSA for RHEL) rather than relying solely on kernel version strings; use package changelog checks as shown in the Configuration section
AppArmor profile in Complain mode rather than Enforce	New AppArmor profiles default to Complain mode in some distributions; mount deny rules log but do not block	Exploit succeeds; AppArmor generates a log entry that may or may not be monitored	Verify `aa-status