Linux User Namespace Security: Attack Surface Reduction and Safe Delegation
Problem
User namespaces allow an unprivileged process to appear as root within an isolated namespace, mapping its UID/GID to real UIDs outside the namespace. This enables rootless containers (Podman, rootless Docker, rootless Buildah) without requiring a setuid helper binary.
The security trade-off is significant. User namespaces expose a large portion of the kernel’s namespace and capability code to unprivileged processes — code that was previously only reachable by root. Since user namespaces were enabled by default in Linux 3.8, they have been the enabling primitive for dozens of kernel privilege escalation CVEs:
- CVE-2022-0185: heap overflow in
legacy_parse_param()reachable via user namespaces. - CVE-2022-25636: out-of-bounds write in
nft_fwd_dup_netdev_offload()— requiresCAP_NET_ADMINin a user namespace. - CVE-2023-0386: FUSE ovlfs SUID privilege escalation via user namespaces.
- CVE-2021-3493: overlayfs privilege escalation via user namespaces (Ubuntu-specific, but demonstrates the pattern).
The pattern: an attacker with unprivileged local access creates a user namespace, gaining capabilities within it, then exploits a kernel vulnerability in the namespace-related code paths that require those capabilities. Without user namespaces, these code paths are unreachable from unprivileged processes.
Specific gaps in unmanaged systems:
- Unprivileged user namespace creation enabled globally (
kernel.unprivileged_userns_clone=1). - No monitoring of user namespace creation events; attacker reconnaissance goes undetected.
- Rootless containers used on nodes where rootless is unnecessary (e.g., server infrastructure that doesn’t run developer tooling).
- No seccomp profile applied to processes that create user namespaces; post-namespace syscalls are unrestricted.
Target systems: Linux kernel 5.12+; Ubuntu 22.04+ (has kernel.unprivileged_userns_clone sysctl); RHEL 9+ (user.max_user_namespaces); systemd 252+ (delegate user namespace creation to specific services); Podman 4.x (rootless containers with user namespaces).
Threat Model
- Adversary 1 — Unprivileged kernel exploit via user namespace: An attacker with a local shell (e.g., via a web application RCE running as
www-data) creates a user namespace, gainingCAP_NET_ADMINwithin it, and exploits a kernel vulnerability in the netfilter or network stack that requires that capability. - Adversary 2 — Container escape via overlayfs in user namespace: A compromised container uses
unshare -Urto create a new user namespace, then exploits an overlayfs or FUSE vulnerability reachable from within the namespace to escape the container. - Adversary 3 — Rootless container abuse: A developer running rootless Podman/Docker on a workstation has user namespaces enabled. An attacker with local code execution uses the user namespace capability to reach kernel code paths they couldn’t otherwise access.
- Adversary 4 — UID mapping manipulation: A process creates a user namespace with a crafted UID mapping that produces unexpected UID 0 in the parent namespace during filesystem operations (historical pattern in overlayfs CVEs).
- Access level: All adversaries have unprivileged local code execution (no root required).
- Objective: Escalate from unprivileged user to root on the host; escape a container.
- Blast radius: With unprivileged user namespaces enabled globally, every kernel CVE in the user namespace code paths is exploitable by any local user. Restricting creation to root reduces the attack surface to processes that already have elevated privilege.
Configuration
Step 1: Audit Current User Namespace State
# Check if unprivileged user namespace creation is enabled.
sysctl kernel.unprivileged_userns_clone
# 1 = enabled (Ubuntu default); 0 = disabled
# Check the maximum number of user namespaces per user.
sysctl user.max_user_namespaces
# 0 = disabled; large number = enabled
# List currently existing user namespaces.
lsns -t user
# Shows all user namespaces and the processes using them.
# Count user namespace creation events in the last hour (if auditd enabled).
ausearch -sc unshare --start recent | grep -c "syscall=unshare"
# List processes running inside user namespaces.
ps -eo pid,user,args --no-headers | while read pid user args; do
ns=$(readlink /proc/$pid/ns/user 2>/dev/null)
init_ns=$(readlink /proc/1/ns/user 2>/dev/null)
if [[ -n "$ns" && "$ns" != "$init_ns" ]]; then
echo "PID $pid ($user): $args [in user namespace: $ns]"
fi
done
Step 2: Restrict Unprivileged User Namespace Creation
On systems that don’t require rootless containers (most production servers), disable unprivileged user namespace creation:
# Ubuntu/Debian: disable unprivileged user namespace creation.
sysctl -w kernel.unprivileged_userns_clone=0
# Make permanent.
echo "kernel.unprivileged_userns_clone=0" >> /etc/sysctl.d/99-user-namespace-security.conf
sysctl --system
# RHEL/CentOS: use user.max_user_namespaces.
sysctl -w user.max_user_namespaces=0
echo "user.max_user_namespaces=0" >> /etc/sysctl.d/99-user-namespace-security.conf
# Verify.
sysctl kernel.unprivileged_userns_clone
# Expected: kernel.unprivileged_userns_clone = 0
# Test that unprivileged creation is blocked.
sudo -u nobody unshare -Ur /bin/bash -c "whoami"
# Expected: unshare: unshare failed: Operation not permitted
Impact assessment before disabling: identify which workloads use user namespaces:
# Check if any services require user namespaces.
# Rootless Podman/Docker will break.
systemctl list-units --type=service | while read unit _; do
if systemctl show "$unit" -p ExecStart 2>/dev/null | grep -q "podman\|docker\|buildah\|rootless"; then
echo "User namespace dependency: $unit"
fi
done
# Check running processes in user namespaces.
lsns -t user -o PID,COMMAND 2>/dev/null | tail -n +2 | awk '{print $2}' | sort -u
Step 3: Delegate User Namespace Creation to Specific Services
Rather than globally disabling or enabling, systemd can grant user namespace creation to specific services while keeping it disabled for others:
# /etc/systemd/system/rootless-builder.service
[Service]
# Allow this service to create user namespaces even if globally disabled.
# Available in systemd 252+ and kernels with LSM namespace support.
AmbientCapabilities=CAP_SYS_ADMIN
CapabilityBoundingSet=CAP_SYS_ADMIN
# Or: use UserNamespacePermission (systemd 256+).
# UserNamespacePermission=yes
User=builder
Group=builder
ExecStart=/usr/local/bin/rootless-build.sh
For Podman specifically, use newuidmap and newgidmap setuid helpers rather than global user namespace enablement:
# These setuid helpers allow specific UID mapping without user namespaces.
# They require /etc/subuid and /etc/subgid entries for the user.
ls -la /usr/bin/newuidmap /usr/bin/newgidmap
# Should be -rwsr-xr-x (setuid root)
# Configure /etc/subuid and /etc/subgid for the builder user.
echo "builder:100000:65536" >> /etc/subuid
echo "builder:100000:65536" >> /etc/subgid
# Rootless Podman works without global user namespace enablement
# when newuidmap/newgidmap are present and /etc/subuid is configured.
sudo -u builder podman run --rm alpine whoami
# Should work without kernel.unprivileged_userns_clone=1.
Step 4: Seccomp Profile for Processes Using User Namespaces
If user namespaces must remain enabled, restrict what the namespaced process can do with seccomp:
// seccomp-user-ns.json — restrict syscalls available after user namespace creation.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": [
"unshare",
"clone",
"clone3",
"setuid",
"setgid",
"setgroups",
"newuidmap",
"newgidmap"
],
"action": "SCMP_ACT_ALLOW"
},
{
"names": [
"mount",
"umount2",
"pivot_root"
],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1
},
{
"names": [
"ptrace"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
Apply via systemd:
[Service]
SystemCallFilter=~@mount @reboot @module @privileged
# Even if the process creates a user namespace, these syscalls remain blocked.
Step 5: AppArmor Policy Restricting User Namespace Creation
On Ubuntu, AppArmor can restrict user namespace creation to specific binaries:
# /etc/apparmor.d/restrict-user-namespaces
# Block all user namespace creation except from specific trusted binaries.
profile restrict-user-namespaces flags=(attach_disconnected) {
# Default: deny user namespace creation.
deny userns,
# Allow specific trusted processes to create user namespaces.
^/usr/bin/podman {
userns,
# Additional rules for podman...
}
^/usr/bin/buildah {
userns,
# Additional rules for buildah...
}
}
# Load the AppArmor profile.
apparmor_parser -r /etc/apparmor.d/restrict-user-namespaces
# Verify it's enforcing.
aa-status | grep restrict-user-namespaces
Ubuntu 23.10+ supports kernel.apparmor_restrict_unprivileged_userns:
# Restrict user namespace creation via AppArmor (Ubuntu 23.10+).
sysctl -w kernel.apparmor_restrict_unprivileged_userns=1
echo "kernel.apparmor_restrict_unprivileged_userns=1" >> /etc/sysctl.d/99-userns.conf
Step 6: Audit User Namespace Creation with auditd
# /etc/audit/rules.d/user-namespaces.rules
# Audit all unshare syscalls (used to create user namespaces).
-a always,exit -F arch=b64 -S unshare -F a0&0x10000000 -k user_ns_create
# 0x10000000 = CLONE_NEWUSER flag
# Audit clone syscalls with CLONE_NEWUSER.
-a always,exit -F arch=b64 -S clone -F a0&0x10000000 -k user_ns_create_clone
# Alert on user namespace creation from unexpected users.
auditctl -a always,exit -F arch=b64 -S unshare \
-F uid!=0 -F uid!=1000 \ # Not root, not the known builder user.
-k unexpected_user_ns
Real-time monitoring:
# Watch for user namespace creation events.
ausearch -k user_ns_create -i --start today | grep -v "^----" | \
awk '/type=SYSCALL/ {print}' | grep -v "auid=0"
# Shows non-root user namespace creation events.
Step 7: Telemetry
linux_user_namespace_create_total{uid, process} counter
linux_user_namespace_current_count gauge
linux_user_namespace_creation_denied_total{uid, process} counter
auditd_user_ns_event_total{uid, syscall} counter
Alert on:
linux_user_namespace_create_totalfrom unexpected UIDs — a non-builder user creating namespaces deserves inspection.linux_user_namespace_current_countgrowing unbounded — a process may be creating namespaces without cleaning up (resource exhaustion or exploit attempt).- Any user namespace creation on a system where
kernel.unprivileged_userns_clone=0— this indicates a privilege escalation (only root can create them when disabled).
Expected Behaviour
| Signal | Unrestricted user namespaces | Restricted configuration |
|---|---|---|
| Unprivileged kernel CVE via user namespace | Exploitable by any local user | Blocked; unshare returns EPERM |
| Rootless Podman (authorised service) | Works | Works (newuidmap/newgidmap or explicit service delegation) |
unshare -Ur from a web shell |
Creates namespace; exploit proceeds | Blocked by kernel.unprivileged_userns_clone=0 |
| User namespace creation audit trail | None | auditd records all creation events with UID and binary |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
unprivileged_userns_clone=0 |
Eliminates the class of user-namespace-enabled kernel exploits | Breaks rootless containers globally | Identify which services need rootless; use newuidmap/newgidmap pattern or service-level delegation. |
| AppArmor restriction | Per-binary policy; rootless works for approved binaries | AppArmor profile maintenance | AppArmor userns restriction is well-supported on Ubuntu; maintain profiles alongside binary updates. |
| Seccomp on namespaced processes | Post-namespace syscall restriction | Must profile each application | Start with a permissive seccomp, add restrictions based on strace output. |
| Audit user namespace creation | Visibility into all namespace events | Audit log volume | Rate-limit audit events; alert on unexpected sources only. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Disabling breaks a CI tool | CI pipeline fails: unshare: Operation not permitted |
CI failure; linux_user_namespace_creation_denied_total rises |
Identify the tool; enable for its service user or use newuidmap approach. |
| AppArmor profile denies legitimate binary | Process fails unexpectedly | AppArmor DENIED log entry | Add the binary to the AppArmor profile; reload. |
| User namespace count limit reached | New namespace creation fails; containers don’t start | ENOSPC on unshare; container start errors |
Increase user.max_user_namespaces; or investigate leak. |
| Audit log flood from high-frequency namespace creation | Log pipeline overwhelmed | Log volume metrics | Rate-limit audit rule with -F key=user_ns_create -F rate_limit=10 or audit namespaced aggregation. |