User Namespaces for Pods: UID Remapping, Container Escape Defense, and the GA Path in Kubernetes 1.30+
Problem
Container security has long had an awkward asymmetry. A Pod’s container that runs as root (uid=0) inside the container is, from the kernel’s perspective, also uid=0 on the host. Linux capabilities, seccomp, and SELinux/AppArmor are the things that keep that root-on-the-host from doing host-level damage — but a kernel exploit, a CVE in the runtime, or a slip in the security context config means the host sees a real uid=0 process attempting privileged operations.
User namespaces remap UIDs and GIDs between the container and the host. With user namespaces enabled, a container’s uid=0 is the host’s uid=1000000 (or whatever range the runtime assigned). A kernel exploit that confuses runtime-level controls but ends up granting host-level access still ends up as an unprivileged user — which means filesystem access governed by host-side permissions, no CAP_* granted, no privileged syscalls.
User namespaces are old (Linux 3.8, 2013), but production Kubernetes adoption has been gated on multiple constraints:
- Filesystem images. UIDs in the image have to be shifted by the runtime; older runtimes did this by chmod’ing every file (slow, expensive). The kernel’s
idmapmounts (5.12+) made it free. - Volume support. Persistent volumes have to participate in the UID mapping. CSI drivers needed updates.
- Runtime support. containerd 1.7+ and CRI-O 1.25+ added the integration.
- Kubernetes feature stability. Alpha in 1.25, beta in 1.28, GA in 1.30 (April 2024).
By 2026, every supported Kubernetes minor version has user namespaces available; the major cloud-managed offerings (GKE 1.30+, EKS 1.30+, AKS 1.30+) support the feature. The hardening win is substantial: most documented container escapes (CVE-2019-5736, CVE-2022-0847 “Dirty Pipe,” CVE-2024-21626) require host-uid privilege after the escape to do real damage. With user namespaces, that privilege is absent.
The specific gaps in a 1.30+ cluster without user namespaces:
- Pods running as root container an actual host-uid-0 process.
- Container-escape exploits land as host-uid-0; capability drops apply but kernel-exploit primitives that bypass capabilities still grant root.
- Volume mounts use container-side UIDs verbatim; a malicious image can write files owned by uid 0 onto host-side persistent volumes.
- Cross-pod attacks via shared volumes or hostPath are bounded only by host-side permissions, which assume the kernel is correct.
This article covers enabling user namespaces in Pod specs, the runtime and CSI prerequisites, what protections it adds vs. Pod Security Admission’s restricted profile, and the operational caveats.
Target systems: Kubernetes 1.30+ with UserNamespacesSupport feature gate enabled (default-on); containerd 1.7+ or CRI-O 1.25+; Linux kernel 6.3+ for full idmap-mount coverage; CSI drivers that support idmap mounts.
Threat Model
- Adversary 1 — Container-escape via runtime CVE: an attacker exploits a
runc/ containerd / CRI-O CVE that lets a container break out and run code on the host with the container’s effective host UID. - Adversary 2 — Container-escape via kernel CVE: a kernel vulnerability in a syscall that the container can invoke (file ops, networking, namespace creation) that bypasses Pod Security Admission and lands as host UID 0.
- Adversary 3 — Volume-mediated cross-pod attack: attacker writes carefully-permissioned files to a shared volume; another Pod consuming the volume reads or executes them with elevated trust.
- Adversary 4 —
hostPathmount abuse: a privileged Pod withhostPathaccidentally writes to host paths with host-uid-0 ownership. - Access level: Adversary 1 has code execution inside a container running as root. Adversary 2 has any code execution that can reach the vulnerable syscall. Adversary 3 has writeable access to a shared volume. Adversary 4 has a Pod spec with
hostPath. - Objective: Privileged host-level activity — read host secrets, modify host files, install persistence, pivot to other Pods or the kubelet.
- Blast radius: Without user namespaces, container-escape often equals host-root. With user namespaces, container-escape lands as a per-Pod unprivileged UID; subsequent host-side exploitation requires additional privilege escalation, and ordinary host file permissions limit what the unprivileged ID can read.
Configuration
Step 1: Verify the Cluster Supports It
# Feature gate.
kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}'
# v1.30.x or higher
# Confirm the feature gate is on (default in 1.30+).
kubectl get --raw /metrics | grep kubernetes_feature_enabled.*UserNamespacesSupport
# Container runtime.
ssh worker-1 'crictl version'
# RuntimeName: containerd, Version 1.7.x+
# Kernel.
ssh worker-1 'uname -r'
# 6.3+ for full idmap-mount support; 5.19+ works with older idmap modes.
Step 2: Enable User Namespaces in a Pod
Single field in the Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: app
namespace: payments
spec:
hostUsers: false # The flag. Setting false enables user namespaces.
containers:
- name: app
image: ghcr.io/myorg/app:1.0
securityContext:
runAsNonRoot: false # Pod can run as uid 0 inside; container-uid 0 is mapped
runAsUser: 0 # to a host-side per-Pod range like 1000000.
ports:
- containerPort: 8080
After applying, observe the host-side UID:
# From the worker node:
ssh worker-1
sudo crictl ps --name app -q | xargs -I{} sudo crictl inspect {} | \
jq '.info.runtimeSpec.linux.uidMappings'
# [
# {"containerID": 0, "hostID": 65536, "size": 65536}
# ]
# Or check the actual process:
ps -eo pid,uid,cmd | grep myapp
# 12345 65536 /usr/local/bin/myapp
Inside the container, processes still see uid=0; on the host they’re uid=65536 or similar. kubectl exec works normally.
Step 3: Enforce Cluster-Wide via Admission Policy
For real protection, require user namespaces by default. With ValidatingAdmissionPolicy:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-user-namespaces
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
validations:
- expression: >
has(object.spec.hostUsers) && object.spec.hostUsers == false
message: "Pods must set spec.hostUsers: false (user namespaces required)."
reason: Forbidden
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: require-user-namespaces-binding
spec:
policyName: require-user-namespaces
validationActions: [Deny, Audit]
matchResources:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-public", "kube-node-lease"]
- key: pod-security.kubernetes.io/enforce
operator: NotIn
values: ["privileged"]
The system namespaces and explicitly-privileged namespaces are exempt; everything else must opt-in to user namespaces.
Step 4: Volume Compatibility
User namespaces shift the apparent UID of files in mounts. CSI drivers that support idmap mounts handle this transparently; older drivers may show files owned by uid_in_namespace + offset.
Compatible: hostPath (in some configurations), emptyDir, persistentVolumeClaim with idmap-aware CSI drivers (CSI spec v1.10+ added the negotiation), configMap, secret.
For older CSI drivers, you may need to set fsGroup and fsGroupChangePolicy to remap on attach. Or, for highest compatibility, use a sidecar that fixes ownership on first mount.
Step 5: Per-Pod UID Range Configuration
The container runtime allocates the per-Pod UID range. containerd’s default range is 0:65536:65536 (allocate 65536 UIDs starting at 65536 host-side per pod). Confirm in containerd’s config:
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
IoUid = 0
IoGid = 0
For very high pod-density nodes, plan for sufficient host UID space. A node with 110 pods × 65536 UIDs/pod needs ~7 million host UIDs available — ample within a 32-bit UID space.
Step 6: Telemetry
Track adoption rate. Most clusters cannot flip hostUsers: false everywhere overnight; measure progress:
# Prometheus metric expression for percent of Pods using userns.
sum(kube_pod_info * on(pod, namespace) group_left(host_users)
label_replace(
kube_pod_spec{...},
...
)) by (namespace)
Actual implementation depends on your kube-state-metrics version; the kube-state-metrics project added hostUsers exposure in 2.13. Once available:
sum by (namespace) (
kube_pod_spec_host_users == 0
)
/
sum by (namespace) (kube_pod_spec_host_users)
Alert on namespaces where adoption regresses unexpectedly.
Expected Behaviour
| Signal | Without hostUsers: false |
With hostUsers: false |
|---|---|---|
| Process ownership on host | Container uid=0 = host uid=0 |
Container uid=0 = host uid=65536 (or per-pod range) |
| Filesystem ownership of container-created files | Owned by host uid=0 |
Owned by mapped host UID |
| Privileged operations from container-root after escape | Succeed (host root) | Fail (host non-root permissions apply) |
runc / containerd CVE exploitation |
Lands as host-root | Lands as unprivileged host UID |
| Cross-namespace volume reuse | Possible if both pods are uid 0 | Each pod sees its own UID range; shared filesystem ownership becomes explicit |
kubectl exec UX |
Normal | Normal (in-container uid=0 still works) |
| Performance | Baseline | Negligible (idmap mount avoids per-file chown) |
Verify the protection holds:
# Inside the container.
id
# uid=0(root) gid=0(root)
# From the host worker node.
ps -ef | grep <container-process-name>
# uid=65536, not uid=0
# Negative test: simulate a CVE-2019-5736 style escape trying to write /etc/shadow on host.
# The container-side write attempt as uid 0 maps to host attempt as uid 65536.
# Host /etc/shadow is owned by uid 0, mode 0640 — write fails with EACCES.
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Container-root → unprivileged host UID | Major mitigation for runtime / kernel CVEs | Some workloads break under UID remapping | Run an adoption pilot per app team; identify and fix incompatible workloads. |
| idmap mounts avoid per-file chown | No startup penalty | Requires kernel 5.19+ for stable behavior | Set worker-node minimum kernel as part of cluster baseline. |
| CSI driver requirements | Strong volume integration | Older drivers may need updates | Newer cloud-managed clusters (GKE/EKS/AKS 1.30+) ship supported drivers; for self-managed CSI, check the driver’s release notes. |
| Per-Pod UID space | Strong isolation between Pods | UID-space exhaustion possible on extreme densities | Default containerd allocations easily fit 200+ Pods per node; tune only at unusual densities. |
| Admission-time enforcement | Steady-state security stays even as Pods are created | Onboarding new namespaces requires explicit opt-in | Document the requirement in your platform team’s onboarding flow. |
| Network namespace + user namespace combination | Network operations also affected | Some networking utilities behave unexpectedly | Test network plugins in user-namespaced Pods explicitly. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| CSI driver does not support idmap mounts | Pod fails to start with volume permission error | kubectl describe pod shows mount-permission errors |
Upgrade CSI driver; or fall back to fsGroup ownership change on the volume; or accept the workload runs without user namespaces while you upgrade. |
Application checks for “real” UID via /proc/self/status |
Application errors when uid in namespace doesn’t match expected | Application logs show UID-mismatch errors | Application bug; the in-container UID is uid 0 as expected. The host UID outside the namespace is irrelevant to the container. Patch the application’s check. |
| Host-side file permissions break expectations | Files written by Pod end up owned by host uid=65536 not uid=0 |
Operations team complains about file ownership in shared persistent storage | Either accept the new ownership model and update operations runbooks, or use fsGroup to assign a known group ID for shared access. |
setcap inside container affects host |
A capability set inside container does NOT affect host | Operator confusion: “I gave the container CAP_NET_ADMIN, why isn’t it working?” | Actually, this is the desired behavior. Capabilities apply within the user namespace. Container can do its own privileged actions inside, but cannot affect host. |
| Kernel UID-namespace exploit | Attacker bypasses user-namespace isolation | Host kernel logs show unusual capability claims | Keep kernels patched; user namespaces themselves have had CVEs (CVE-2023-32233, etc.). The mitigation: subscribe to your distro’s security advisory feed. |
--privileged containers ignore the policy |
Pods that demand privilege escape the protection | Audit logs show pods with securityContext.privileged: true |
Pair user-namespace requirement with PSA restricted enforcement. Truly privileged workloads belong in a dedicated, gated namespace. |