User Namespace Privilege Escalation: CVE-2023-32233 and the Netlink Attack Chain
The Problem
CVE-2023-32233 was disclosed on May 8, 2023, alongside a patch authored by Pablo Neira Ayuso, the Netfilter maintainer. A public PoC exploit appeared on GitHub within 48 hours. The vulnerability exists in the nf_tables subsystem — specifically in how the kernel handles anonymous sets during transaction processing — and affects all Linux kernels through 6.3.1. Patched in 6.3.2, 6.2.15, 6.1.29, 5.15.111, and 5.10.180.
The vulnerability class is use-after-free. The exploitation path is: unprivileged user → create user namespace → obtain CAP_NET_ADMIN inside namespace → send crafted Netlink messages to trigger use-after-free in nf_tables → kernel code execution → host root. Total time from shell to root on unpatched Ubuntu 22.04 with the published PoC: under five minutes.
nf_tables Transactions and Anonymous Set Lifetime
The nftables subsystem uses a transactional model for rule changes. Userspace sends a batch of Netlink messages, the kernel processes them as an atomic transaction, and either commits all changes or aborts the entire batch. Inside the transaction machinery, nf_tables allocates kernel objects — rules, chains, sets — and tracks them through nft_trans wrapper structures that hold references until commit or abort.
Anonymous sets are a specific object type in nftables. Unlike named sets (which are persistent and reusable across rules), anonymous sets are bound to a single rule and have no independent lifecycle. They are created implicitly when a rule references them and destroyed when the rule is removed. The kernel allocates an anonymous set, adds it to the transaction object list, and expects the transaction cleanup path to handle its memory.
The bug is in nf_tables_newset(). When processing a transaction that adds an anonymous set and then subsequently deletes it within the same transaction batch, the abort path for the transaction does not correctly account for the fact that the set was both created and slated for deletion in the same batch. The reference counting walks the transaction list and calls nft_set_destroy() on the set during abort — but the set has already been enqueued for deletion by the NFT_MSG_DELSET operation earlier in the same transaction. The result is that nft_set_destroy() is called twice on the same kernel object. The second call operates on freed memory.
The freed memory is a struct nft_set on the kernel’s slab heap. Because the attacker controls the timing and can influence what the kernel allocates between the two references (by sending additional Netlink messages between transactions), this is a classic heap use-after-free with heap grooming potential. The published PoC used the reclaim window to position a controlled object in the freed slot and redirect execution.
The User Namespace Enabler
The nf_tables subsystem requires CAP_NET_ADMIN to send Netlink socket messages. On a normal system, CAP_NET_ADMIN is a privileged capability held only by root and processes granted it explicitly. However, user namespaces change this.
User namespaces (introduced in Linux 3.8) allow an unprivileged process to create a new namespace in which it holds full root capabilities. The kernel maps the user’s host UID (e.g., uid 1000) to uid 0 inside the namespace. Inside that namespace, the process can create child namespaces — including a network namespace — and within the network namespace it holds CAP_NET_ADMIN. That capability is scoped to the new network namespace, not the host network, but the kernel’s nftables code does not adequately enforce that scope when processing Netlink messages. The use-after-free is reachable from a namespace-scoped CAP_NET_ADMIN context.
The syscall sequence to acquire the attack surface:
// Step 1: create a user namespace and a network namespace inside it.
// CLONE_NEWUSER gives us uid 0 inside the new namespace.
// CLONE_NEWNET gives us a private network namespace with CAP_NET_ADMIN.
unshare(CLONE_NEWUSER | CLONE_NEWNET);
// Step 2: set up the uid/gid mapping so the kernel accepts our
// namespace-root identity. Write to /proc/self/uid_map:
// "0 <host_uid> 1\n"
// This maps namespace uid 0 to the current host uid.
// Step 3: open a Netlink socket to the NFNL_SUBSYS_NFTABLES family.
// This requires CAP_NET_ADMIN — which we now hold in the network namespace.
int nl_fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER);
// Step 4: craft Netlink message batches that trigger the use-after-free
// in nf_tables_newset() by interleaving NEWSET and DELSET operations
// in the same transaction in a way that defeats the abort path's
// reference counting.
On Ubuntu 22.04 (and 20.04, Debian 11, Fedora, and most modern desktop Linux distributions), kernel.unprivileged_userns_clone is set to 1 by default — or the equivalent upstream sysctl user.max_user_namespaces is set to a non-zero value. Any logged-in user, any container process running as a non-root UID with no capabilities, any application sandbox can call unshare(CLONE_NEWUSER) without any permissions check. The kernel hands out namespaced root to anyone who asks.
The Pattern This Belongs To
CVE-2023-32233 is the third high-profile kernel privilege escalation in two years that was reachable specifically because of unprivileged user namespace access:
- CVE-2022-0847 (Dirty Pipe) — write to read-only page cache files. Reachable from any unprivileged process; user namespaces not required, but the page cache attack surface is fully accessible without privileges.
- CVE-2022-1015 (Netfilter nft_expr) — out-of-bounds write in
nft_validate_register_store(). Reachable via user namespaceCAP_NET_ADMIN, same attack chain as CVE-2023-32233. - CVE-2023-32233 (nf_tables anonymous sets) — use-after-free via user namespace
CAP_NET_ADMIN.
The design tension is real and acknowledged: user namespaces are a security feature (they enable rootless containers, application sandboxes, Chrome’s renderer isolation, Flatpak/Snap confinement) that simultaneously expands the unprivileged kernel attack surface. Every kernel subsystem feature that is accessible from within a user namespace — nftables, eBPF, io_uring, userfaultfd — is reachable by any process on the system, not just privileged ones. When any of those subsystems has a memory safety bug, it becomes exploitable without any account privileges at all.
The kernel developers have been progressively restricting access from user namespaces to specific subsystems (eBPF had kernel.unprivileged_bpf_disabled added after a string of eBPF-reachable CVEs; io_uring had similar restrictions added after CVE-2022-2602 and related bugs). nftables reachability from user namespaces is the same category of problem.
What the PoC Exploit Does
The published PoC (available on GitHub from user Liuk3r and multiple mirrors within 48 hours of disclosure) works in four phases:
-
Namespace setup. Calls
unshare(CLONE_NEWUSER | CLONE_NEWNET), writes uid/gid mappings, opens a Netlink socket. -
Heap spray. Sends a series of valid nftables Netlink messages to allocate controlled objects on the kernel heap and establish known heap layout state. This primes the slab allocator to return freed
nft_setmemory at a predictable offset from the spray objects. -
Use-after-free trigger. Sends a crafted transaction batch:
BATCH_BEGIN,NEWSET(allocates anonymous set),DELSET(marks it for deletion),BATCH_END. The abort path triggers the double-free. The freednft_setslab slot is then reclaimed by a controlled allocation positioned during the spray phase. -
Kernel code execution. The reclaimed slab slot contains attacker-controlled data that overlaps with a function pointer in the repurposed object. A subsequent nftables operation dereferences that function pointer, redirecting execution to a ROP chain or shellcode in user-controlled memory mapped at a fixed address. The shellcode calls
commit_creds(prepare_kernel_cred(NULL))to set the process credentials to uid 0, then returns to userspace. A/bin/shexec follows.
Total elapsed time on unpatched Ubuntu 22.04 with a 5.15.x kernel: approximately 4 minutes on a fresh VM, including heap spray time. On a system with KASLR and SMEP/SMAP enabled, the PoC includes a kernel address leak stage that adds 30–60 seconds.
Threat Model
Unprivileged local user on any unpatched Linux machine. A user with a shell account and no special permissions creates a user namespace, obtains CAP_NET_ADMIN, and exploits the nftables use-after-free to get a root shell. The only prerequisite is kernel.unprivileged_userns_clone=1, which is the default on Ubuntu 20.04, 22.04, Debian 11, and Fedora. No social engineering, no network access, no service exploitation required.
Container process running as non-root. A container configured with runAsUser: 1000, all capabilities dropped, allowPrivilegeEscalation: false, running inside a Kubernetes pod — if the host kernel has user.max_user_namespaces > 0, that container process can call unshare(CLONE_NEWUSER | CLONE_NEWNET) to create a nested user namespace, get CAP_NET_ADMIN inside it, and exploit this CVE to reach host root. This is a complete bypass of all Kubernetes pod security controls. The pod security admission controller, PodSecurityPolicy, OPA Gatekeeper policies, and securityContext settings all operate above the kernel; a kernel use-after-free that reaches host root bypasses all of them.
Kubernetes node compromise via pod escape. The most impactful scenario in a Kubernetes environment: a workload compromise (application RCE) becomes a node compromise and then a cluster compromise. An attacker with RCE in a container exploits CVE-2023-32233 to reach host root on the node, then reads kubelet credentials from /var/lib/kubelet/pki/, mounts the host filesystem to access secret material, and pivots to the Kubernetes API server. Kubernetes has no visibility into kernel-level privilege escalation; the audit log shows nothing unusual because the exploit happens entirely in kernel memory before any API call is made.
Sandbox escapes. Chrome’s renderer sandbox uses user namespaces to isolate renderer processes from the OS. A kernel CVE reachable from user namespace context can escape the Chrome sandbox and reach the host OS from a compromised renderer. This is the same reason Chrome has shipped with --no-sandbox fallback warnings when user namespaces are restricted — the sandbox depends on user namespace access. The same applies to Flatpak (uses user namespaces for bwrap confinement) and Snap (uses AppArmor + user namespaces). A CVE like this collapses those isolation layers.
Multi-tenant systems. On any shared Linux host — a university login server, a build machine shared by a team, a CI runner without per-job VM isolation — any user with a login can escalate to root and read or modify other users’ data, credentials, and running processes.
Hardening Configuration
1. Disable Unprivileged User Namespaces (Targeted Restriction)
This is the primary control. Disabling unprivileged user namespace creation eliminates the attack vector entirely for this CVE and for the entire class of nftables/Netfilter CVEs that require user namespace CAP_NET_ADMIN.
# Check current setting before making changes
# Debian/Ubuntu: kernel.unprivileged_userns_clone (Debian-specific sysctl)
sysctl kernel.unprivileged_userns_clone
# kernel.unprivileged_userns_clone = 1 (default, unprivileged creation allowed)
# Upstream/Fedora/RHEL: user.max_user_namespaces (standard upstream kernel)
sysctl user.max_user_namespaces
# user.max_user_namespaces = 7640 (default, non-zero = allowed)
# --- Debian/Ubuntu systems ---
# kernel.unprivileged_userns_clone is a boolean (0/1) added by Debian patches.
# Setting to 0 prevents unprivileged processes from calling unshare(CLONE_NEWUSER).
echo "kernel.unprivileged_userns_clone=0" >> /etc/sysctl.d/99-security.conf
sysctl -p /etc/sysctl.d/99-security.conf
# Verify the change took effect immediately:
sysctl kernel.unprivileged_userns_clone
# kernel.unprivileged_userns_clone = 0
# --- Upstream kernel / Fedora / RHEL / CentOS Stream ---
# user.max_user_namespaces controls how many user namespaces can exist concurrently.
# Setting to 0 prevents unprivileged creation entirely.
# NOTE: setting to 0 also prevents root from creating user namespaces for container
# workloads, so use this carefully on container hosts.
echo "user.max_user_namespaces=0" >> /etc/sysctl.d/99-security.conf
sysctl -p /etc/sysctl.d/99-security.conf
# Persist across reboots (in addition to sysctl.d, verify systemd-sysctl loads it):
systemctl status systemd-sysctl
# Confirm the unit reads /etc/sysctl.d/99-security.conf
On Ubuntu/Debian, prefer kernel.unprivileged_userns_clone=0 over user.max_user_namespaces=0. The Debian-specific boolean is more surgical: it blocks unprivileged creation while still allowing the kernel and root processes to create user namespaces for container runtimes running as root. The upstream user.max_user_namespaces=0 is a global hard cap that breaks containerd and Docker on systems where the daemon runs as root.
What breaks when you set this. This is not a free control:
- Rootless Docker/Podman: fails immediately. Rootless container runtimes require the user to call
unshare(CLONE_NEWUSER)before spawning containers. Settingkernel.unprivileged_userns_clone=0makes rootless containers impossible. Users relying onpodman runwithout root will see:Error: cannot re-exec process: fork/exec /proc/self/exe: operation not permitted. - Chrome/Chromium without
--no-sandbox: the browser’s renderer sandbox usesCLONE_NEWUSERfor process isolation. With the sysctl set to 0, Chrome falls back to--no-sandboxmode automatically on Ubuntu, which runs renderer processes with no OS-level sandbox. This is a security regression for browser usage. - Flatpak: uses
bubblewrap(bwrap) which requires user namespaces. Applications installed via Flatpak will fail to launch. - Snap: uses AppArmor-based confinement in combination with user namespaces. Snap confinement degrades without user namespace access.
- LXC/LXD without privileged mode: unprivileged LXC containers rely on user namespaces. Privileged LXC containers (run as root) continue working.
- systemd-nspawn: containers created without
--private-users=offrequire user namespace support.
On a dedicated server with no desktop applications, no rootless containers, and no Snap/Flatpak packages, this list is empty and the control is safe to deploy. On a developer workstation running rootless Podman and Chrome, it breaks daily workflow.
2. Seccomp Profile Blocking CLONE_NEWUSER
For environments where disabling unprivileged user namespaces entirely is not acceptable, a seccomp profile can block the specific unshare() and clone() calls with the CLONE_NEWUSER flag (bit 0x10000000, decimal 268435456).
Create /var/lib/kubelet/seccomp/profiles/block-userns.json:
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_AARCH64",
"SCMP_ARCH_ARM"
],
"syscalls": [
{
"names": ["unshare"],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1,
"args": [
{
"index": 0,
"value": 268435456,
"op": "SCMP_CMP_MASKED_EQ",
"valueTwo": 268435456
}
]
},
{
"names": ["clone"],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1,
"args": [
{
"index": 0,
"value": 268435456,
"op": "SCMP_CMP_MASKED_EQ",
"valueTwo": 268435456
}
]
},
{
"names": ["clone3"],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 38
}
]
}
SCMP_CMP_MASKED_EQ with valueTwo = 268435456 (the CLONE_NEWUSER flag bitmask) matches only calls where the CLONE_NEWUSER bit is set in the flags argument. Other unshare() calls (e.g., CLONE_NEWNS for mount namespaces, CLONE_NEWPID) pass through unaffected. clone3 is blocked entirely with ENOSYS (38) because its flags structure is more complex to filter by argument and it’s not required by most container workloads.
Apply to a Kubernetes pod:
apiVersion: v1
kind: Pod
metadata:
name: hardened-workload
annotations:
seccomp.security.alpha.kubernetes.io/pod: "localhost/profiles/block-userns.json"
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/block-userns.json
runAsNonRoot: true
runAsUser: 1000
containers:
- name: app
image: myapp:2.1.0
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true
Set as the cluster-wide default seccomp profile via the kubelet configuration:
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
seccompDefault: true
And set the RuntimeDefault profile to reference the blocking profile via a cluster-level RuntimeClass.
For Docker, apply system-wide via /etc/docker/daemon.json:
{
"seccomp-profile": "/etc/docker/seccomp/block-userns.json"
}
Then restart the Docker daemon: systemctl restart docker.
3. AppArmor userns_restrict
Ubuntu 23.10 and later ship with AppArmor 4.0, which includes a userns_restrict feature specifically designed to address the class of vulnerabilities that use user namespaces as an attack staging area. It is distinct from the sysctl approach: rather than a global kernel switch, it applies per-process confinement through AppArmor profiles and can be tuned to allow specific trusted processes to create user namespaces while blocking untrusted ones.
# Check if AppArmor userns restriction is available on your kernel
# (Ubuntu 23.10+ with AppArmor 4.0, or Ubuntu 22.04 with HWE kernel + updated apparmor)
aa-status --json 2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
print('userns_restrict available:', 'unprivileged_userns_create' in str(d))
"
# Enable the global AppArmor user namespace restriction on Ubuntu 23.10+:
# This uses the kernel.apparmor_restrict_unprivileged_userns sysctl
sysctl -w kernel.apparmor_restrict_unprivileged_userns=1
echo "kernel.apparmor_restrict_unprivileged_userns=1" >> /etc/sysctl.d/99-security.conf
For a custom AppArmor profile that blocks user namespace creation for a specific application:
# /etc/apparmor.d/usr.bin.myapp
#include <tunables/global>
profile myapp /usr/bin/myapp {
#include <abstractions/base>
# Allow normal file and network access
/usr/bin/myapp mr,
/usr/lib/** mr,
/usr/share/** r,
/proc/self/status r,
network inet stream,
# Explicitly deny user namespace creation.
# On AppArmor 4.0 with userns_restrict support:
deny capability sys_admin,
# Block write to uid_map / gid_map (required step after unshare CLONE_NEWUSER)
deny /proc/*/uid_map w,
deny /proc/*/gid_map w,
deny /proc/*/setgroups w,
}
Load and enforce:
apparmor_parser -r -W /etc/apparmor.d/usr.bin.myapp
aa-status | grep myapp
# myapp (enforce)
On Ubuntu 22.04 with the standard (non-HWE) AppArmor version, the kernel.apparmor_restrict_unprivileged_userns sysctl is not present. The /proc/*/uid_map write denial is the available mechanism. Verify coverage:
# Attempt to create a user namespace as a non-root user, should produce a denial:
# (run as a non-root user under the confined profile)
unshare --user --map-root-user /bin/bash -c "id"
# If AppArmor denial fires: unshare: unshare failed: Operation not permitted
# Check the AppArmor log:
grep "DENIED" /var/log/audit/audit.log | grep "uid_map"
# type=AVC msg=audit(1715126400.441:8832): apparmor="DENIED" operation="file_perm"
# profile="myapp" name="/proc/12047/uid_map" pid=12047 comm="unshare" requested_mask="w"
4. Kernel Patch Verification
Before deploying any of the above mitigations, confirm whether the running kernel is actually vulnerable. The patched versions by series:
uname -r
# Patched mainline/stable versions:
# >= 6.3.2 (mainline)
# >= 6.2.15 (stable)
# >= 6.1.29 (longterm)
# >= 5.15.111 (longterm)
# >= 5.10.180 (longterm)
# Kernels below 5.10 are unaffected (nf_tables set transaction code is different).
# On Debian/Ubuntu, check the installed package:
dpkg -l "linux-image-$(uname -r)" | awk 'NR==4{print $3}'
# Example patched output: 5.15.0-75.82 (Ubuntu maps 5.15.0-75 -> 5.15.111 upstream)
# Cross-reference with Ubuntu Security Notices:
# USN-6122-1: linux 5.15.0-75.82 (May 2023) addresses CVE-2023-32233
# On RHEL/CentOS, check for the advisory:
rpm -q kernel | head -5
# kernel-5.14.0-284.11.1.el9_2 or later addresses CVE-2023-32233 on RHEL 9
# Quick automated check — compare running kernel version against known-good series:
python3 -c "
import platform
r = platform.release().split('-')[0]
v = tuple(int(x) for x in r.split('.')[:3])
patched = [
(6, 3, 2), (6, 2, 15), (6, 1, 29), (5, 15, 111), (5, 10, 180)
]
series_ok = any(
v[0] == p[0] and v[1] == p[1] and v[2] >= p[2]
for p in patched
) or v >= (6, 4, 0)
print('Kernel:', platform.release())
print('Status:', 'PATCHED (by version)' if series_ok else 'CHECK DISTRO ERRATA - may be VULNERABLE')
"
Note that distro kernels backport patches to older version numbers. Ubuntu 22.04 ships a 5.15.x kernel and applied the CVE-2023-32233 fix in linux-image-5.15.0-75.82. The version number 5.15.0-75 does not map directly to 5.15.111, so the script above will show CHECK DISTRO ERRATA for Ubuntu kernels. Always cross-reference with the Ubuntu Security Notices (USN-6122-1 for this CVE) rather than relying purely on upstream version comparisons.
5. Audit Namespace Creation Events
Log every unshare() and clone() call that includes the CLONE_NEWUSER flag. This does not prevent exploitation but provides forensic visibility and supports detection of exploitation attempts.
# Add auditd rules to capture CLONE_NEWUSER syscall arguments.
# The flag CLONE_NEWUSER = 0x10000000 = 268435456 decimal.
# -F a0&0x10000000 matches calls where bit 0x10000000 is set in the first argument.
auditctl -a always,exit -F arch=b64 -S unshare \
-F a0&0x10000000 -k user_ns_create
auditctl -a always,exit -F arch=b64 -S clone \
-F a0&0x10000000 -k user_ns_create
# Persist rules across reboots:
cat >> /etc/audit/rules.d/user-namespaces.rules << 'EOF'
# Audit unprivileged user namespace creation attempts
# CLONE_NEWUSER = 0x10000000 in both unshare() and clone() flags
-a always,exit -F arch=b64 -S unshare -F a0&0x10000000 -k user_ns_create
-a always,exit -F arch=b64 -S clone -F a0&0x10000000 -k user_ns_create
# Also audit clone3 (newer glibc uses this)
-a always,exit -F arch=b64 -S clone3 -k user_ns_create
EOF
augenrules --load
# Verify the rules are active:
auditctl -l | grep user_ns_create
# -a always,exit -F arch=b64 -S unshare -F a0&0x10000000 -k user_ns_create
# -a always,exit -F arch=b64 -S clone -F a0&0x10000000 -k user_ns_create
# Parse recent user namespace creation events:
ausearch -k user_ns_create --raw | aureport --syscall -i | tail -20
# 1. 05/08/2026 14:32:11 unshare <no_session> root 12041 /tmp/exploit success
# 2. 05/08/2026 14:32:11 clone <no_session> root 12042 /tmp/exploit success
# Full raw event for forensic detail:
ausearch -k user_ns_create -ts today | head -40
# ----
# time->Fri May 8 14:32:11 2026
# type=SYSCALL msg=audit(1715175131.441:9021):
# arch=c000003e syscall=272 success=yes exit=0
# a0=10000000 a1=0 a2=0 a3=0
# items=0 ppid=12039 pid=12041 auid=1000 uid=1000 gid=1000
# euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000
# tty=pts0 ses=4 comm="unshare" exe="/usr/bin/unshare"
# key="user_ns_create"
# type=PROCTITLE msg=audit(1715175131.441:9021):
# proctitle=756E7368617265002D2D75736572002D2D6D61702D726F6F742D75736572
The a0=10000000 confirms the CLONE_NEWUSER bit was set. auid=1000 is the audit UID (the logged-in user), uid=1000 is the effective UID. A root process (uid=0) creating user namespaces for container workloads will appear here too — filter by auid != 0 or use ausearch -ua 1000 to restrict to non-root initiators.
For shipping these events to a SIEM, configure audisp-remote or use the auditd JSON plugin:
# /etc/audit/plugins.d/syslog.conf — ship audit events to syslog for SIEM ingestion
active = yes
direction = out
path = builtin_syslog
type = builtin
args = LOG_INFO
format = string
6. Kubernetes-Specific: Container Runtime Configuration
Containerd and CRI-O can be configured to prevent container processes from creating user namespaces. This is a defence-in-depth control for Kubernetes nodes, applied at the runtime layer below the kubelet.
For containerd, edit /etc/containerd/config.toml:
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# Disable host user namespace sharing (prevents containers from seeing host UIDs)
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
# Disable user namespace remapping at the runtime level.
# This does not prevent exploit; combine with the sysctl below.
# Apply a default seccomp profile to all containers
[plugins."io.containerd.grpc.v1.cri".seccomp]
default_profile = "/etc/containerd/seccomp/block-userns.json"
For CRI-O, edit /etc/crio/crio.conf:
[crio.runtime]
# Apply a default seccomp profile that blocks CLONE_NEWUSER
seccomp_profile = "/etc/crio/seccomp/block-userns.json"
# Disable additional capabilities beyond the container's explicit cap list
default_capabilities = []
# Prevent privileged containers from being created without explicit annotation
allow_fuse_devices = false
On Kubernetes nodes, also apply the seccomp default via the kubelet’s configuration to ensure the block-userns profile is applied even to pods that do not explicitly set a seccompProfile:
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
seccompDefault: true
# With seccompDefault=true, pods without an explicit seccompProfile get
# RuntimeDefault applied, which blocks dangerous syscalls including
# CLONE_NEWUSER in most production-oriented default profiles.
Verify a running container cannot create user namespaces:
# Test from inside a running container:
kubectl exec -it <pod-name> -- /bin/sh -c "unshare --user --map-root-user id"
# Expected with seccomp profile active:
# unshare: unshare failed: Operation not permitted
# Check the seccomp mode of a running container process:
kubectl exec -it <pod-name> -- cat /proc/self/status | grep Seccomp
# Seccomp: 2
# Mode 2 = SECCOMP_MODE_FILTER (a BPF profile is active)
# Mode 0 = unconfined (no profile — investigate why)
Expected Behaviour
sysctl kernel.unprivileged_userns_clone=0 — effect on rootless Podman. The failure is immediate and unambiguous:
$ podman run --rm alpine id
Error: cannot re-exec process: fork/exec /proc/self/exe: operation not permitted
# Rootless Podman calls unshare(CLONE_NEWUSER) as its first step.
# With the sysctl=0, the kernel returns EPERM.
# Confirm the error comes from the sysctl, not a capability issue:
$ strace -e unshare podman run --rm alpine true 2>&1 | grep unshare
unshare(CLONE_NEWUSER|CLONE_NEWNS) = -1 EPERM (Operation not permitted)
Audit log for a namespace creation event. A non-root user calling unshare --user with the audit rules active:
type=SYSCALL msg=audit(1715175131.441:9021): arch=c000003e syscall=272
success=yes exit=0 a0=10000000 a1=0 a2=0 a3=0
items=0 ppid=12039 pid=12041 auid=1000 uid=1000 gid=1000
euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000
tty=pts0 ses=4 comm="unshare" exe="/usr/bin/unshare"
subj=unconfined key="user_ns_create"
type=PROCTITLE msg=audit(1715175131.441:9021):
proctitle=756E7368617265002D2D75736572002D2D6D61702D726F6F742D75736572
Decoded: proctitle = unshare --user --map-root-user. The success=yes here is before we applied the sysctl restriction. With kernel.unprivileged_userns_clone=0, the same call produces success=no exit=-1 (EPERM), and the syscall is still logged.
AppArmor userns_restrict denial. On Ubuntu 23.10+ with kernel.apparmor_restrict_unprivileged_userns=1:
$ unshare --user --map-root-user /bin/bash -c "id"
unshare: unshare failed: Permission denied
# AppArmor log entry in /var/log/audit/audit.log:
type=AVC msg=audit(1715175200.832:9044): apparmor="DENIED"
operation="userns_create" class="namespace"
profile="unconfined" pid=12087 comm="unshare"
requested_mask="userns_create"
The operation="userns_create" and class="namespace" are AppArmor 4.0-specific fields for user namespace restriction. Older AppArmor versions will show operation="file_perm" on the /proc/self/uid_map write instead.
Seccomp block on CLONE_NEWUSER. With the seccomp profile applied to a container:
# Inside the container:
$ unshare --user --map-root-user id
unshare: unshare failed: Operation not permitted
# Seccomp audit log on the host:
type=SECCOMP msg=audit(1715175300.221:9067):
auid=1000 uid=0 gid=0 ses=4 subj=docker-default
pid=12104 comm="unshare" exe="/usr/bin/unshare"
sig=0 arch=c000003e syscall=272 compat=0
ip=0x7f4d3a1c2e41 code=0x50000
# syscall=272 is unshare(2) on x86-64; code=0x50000 is SECCOMP_RET_ERRNO
Trade-offs
Disabling unprivileged user namespaces globally. The security benefit is complete elimination of the user namespace exploit chain for all processes. The operational cost is high on machines running rootless containers, desktop applications, or browser sandboxes. On headless server infrastructure running privileged container runtimes, the cost is near zero. This is the correct control for production Kubernetes nodes where container workloads run under the root-owned containerd daemon, not under rootless Podman. It is the wrong default for developer workstations.
Seccomp profile on CLONE_NEWUSER. More targeted than the sysctl. Can be applied per-container, per-pod, or per-runtime-class. Does not break Chrome or Flatpak on the host. The failure mode is coverage gaps: a pod that does not have the profile applied (newly deployed workload, admission webhook misconfiguration, emergency kubectl exec session) has no protection. Seccomp-based controls require thorough admission enforcement to be reliable. An OPA/Gatekeeper policy that rejects pods without a seccompProfile is required to make this meaningful at scale.
AppArmor userns_restrict. The best available option on Ubuntu 23.10+ for mixed environments that need to preserve rootless container and browser functionality for some processes while restricting others. The limitation is that it requires AppArmor 4.0 and a supporting kernel. On Ubuntu 22.04, the feature is not available without a backport. Teams running 22.04 LTS (supported through 2027) cannot rely on this control without upgrading AppArmor manually, which is not a supported configuration.
Audit rules. Zero performance impact for normal workloads because unshare(CLONE_NEWUSER) is not a hot path. High-frequency container creation environments (CI runners that create thousands of containers per hour) may see measurable audit log volume from container runtime calls to unshare. Filter on auid to distinguish container runtime calls (running as root, auid 0) from user-initiated calls (auid > 1000). The value of the audit trail is primarily forensic — detecting that an exploit attempt occurred, not preventing it.
Patching. Ubuntu USN-6122-1 was published the same day as CVE-2023-32233 disclosure (May 8, 2023). The 48-hour PoC window means the patch and the working exploit were available simultaneously. Any deployment that patches on a monthly cadence was vulnerable for a month with a public exploit in circulation. This CVE is a concrete argument for emergency kernel patching procedures activated by CVSS score threshold, not calendar cycle.
Failure Modes
Assuming namespace isolation protects against this. The user namespace is not the target; it is the staging ground. The exploit uses the namespace to acquire CAP_NET_ADMIN, then uses that capability to trigger a kernel memory safety bug that breaks out of all namespace boundaries. Believing that a container “runs in its own namespace” and is therefore protected against this CVE is exactly backwards: the namespace is what makes the exploit reachable to an unprivileged user.
Trusting kernel.apparmor_restrict_unprivileged_userns=1 on Ubuntu 22.04 without verifying AppArmor version. The sysctl key exists on some 22.04 kernels as a stub but the enforcement is not backed by AppArmor profile mediation unless AppArmor 4.0 is installed. Checking sysctl kernel.apparmor_restrict_unprivileged_userns and seeing 1 does not confirm protection if the AppArmor userspace tools are pre-4.0. Verify: apparmor_parser --version (need 4.0.0+) and aa-status should show namespace restrictions as active policies.
Not patching promptly because “we have compensating controls.” The seccomp and sysctl controls above are defence-in-depth measures, not substitutes for patching. A future kernel vulnerability in a different subsystem that is also reachable from a user namespace — and there will be one — is not covered by a CVE-2023-32233-specific seccomp rule. Patching the kernel addresses the specific vulnerability and contributes to the general improvement of kernel memory safety. A 30-day patch cadence is too slow for kernel CVEs with published exploits on day zero.
Confusing user namespace restriction with disabling all namespaces. Setting kernel.unprivileged_userns_clone=0 does not affect PID namespaces, mount namespaces, network namespaces, or UTS namespaces when created by root. Containerd and Docker running as root continue to create and manage namespaces for container workloads without any degradation. The CLONE_NEWUSER flag is specifically gated; other clone() flags are unaffected by this sysctl. Engineers sometimes resist this control under the incorrect impression it will break container networking or process isolation — it will not when the container runtime runs as root.
Insufficient audit coverage leading to false confidence. The audit rules in section 5 log unshare and clone calls with CLONE_NEWUSER. They do not log newuidmap execution, /proc/self/uid_map writes, or Netlink socket operations to the NFNL subsystem. An attacker who successfully exploits the vulnerability and escalates to root will not appear in the user namespace audit log after escalation — the subsequent root actions are not gated by the user_ns_create key. The audit rules detect the attempt; a separate rule set is needed to detect post-escalation root activity (unexpected SUID execution, /etc/passwd modification, new cron entries) that indicates the exploit succeeded.
Overlooking container runtimes that spawn helper processes with user namespace access. Some container build tools (Buildah, kaniko in certain configurations, docker buildx with rootless mode) create user namespaces during image builds. A hardened Kubernetes node that has the sysctl set to 0 will break these tools silently. Audit the container build pipeline before deploying the restriction to nodes that host CI/CD workloads, not just production application nodes.