runc CVE-2019-5736: Overwriting the Container Runtime from Inside a Container
The Problem
On February 11, 2019, Adam Iwaniuk and Borys Popławski disclosed CVE-2019-5736, a vulnerability in runc that allowed a malicious container to overwrite the host runc binary and achieve arbitrary code execution as root on the host. Every major container runtime — Docker before 18.09.2, containerd before 1.2.4 (when using the affected runc), CRI-O before 1.11.8, and therefore every Kubernetes version running on those runtimes — was affected. The patch was runc 1.0-rc6, released the same day as the disclosure.
The attack requires only one thing: the ability to exec into a container you control, or a container image that fires the exploit autonomously on startup. That is a remarkably low bar. In a multi-tenant Kubernetes cluster, any user with kubectl exec permission to a pod they own meets it. In a public cloud environment where an attacker has compromised a container image in the supply chain, they meet it without any user interaction at all.
Understanding this vulnerability requires understanding what runc actually does during a docker exec or kubectl exec invocation, and specifically how it exposes itself through /proc.
What runc does and why it touches /proc/self/exe
runc is the OCI-compliant low-level runtime that creates and manages the Linux containers underlying both Docker and Kubernetes. When you run docker exec -it <container> /bin/bash, the Docker daemon calls runc with the exec subcommand. runc must enter the existing container’s namespaces (mount, PID, network, UTS, IPC) and spawn the requested process inside them.
To enter the container’s namespaces, runc uses setns(2) on the file descriptors in /proc/<container-init-pid>/ns/. This is host-side work — runc is still running in the host mount namespace at this point. The namespace transition happens in the Go runtime initialiser via runc init, which is exec’d as a subprocess. Here is where the problem begins.
When the Linux kernel executes a binary, it records the binary’s path in the process’s /proc/self/exe symlink. This symlink points to the actual file on the filesystem, regardless of namespace boundaries. When runc exec’s itself as runc init to perform the namespace join, the resulting child process’s /proc/self/exe resolves to the path of the runc binary on the host filesystem — because runc is installed on the host and the /proc/self/exe symlink traverses the host mount namespace, not the container’s.
Critically, /proc/self/exe is readable even when the binary backing it has its execute bit set and is owned by root. A process inside the container can open file descriptors into /proc/<runc-pid>/exe during the brief window when runc is executing inside the container’s PID namespace to set up the new process. Because runc’s child process participates in the container’s PID namespace at this point, the attacker’s process can enumerate /proc/ and find runc’s PID, then open /proc/<runc-pid>/exe.
That file descriptor is a direct handle to the runc binary on the host filesystem. If the attacker can open it for writing — which is prevented while the binary is executing (Linux holds a write lock on executing binaries) — they can overwrite the binary. The exploit’s trick is to keep the file descriptor open and wait for the execve to complete, at which point the write lock is released and the descriptor, still pointing at the file, can be used to overwrite it.
The practical exploit sequence:
# Pseudocode for the attack flow
# Step 1: Set the container entrypoint to a script that monitors /proc
# This fires either on container start or is injected via docker exec
# Inside the container:
def exploit():
# Step 2: Scan /proc for a process whose /proc/<pid>/exe points to
# a path containing "runc" — this is the runc binary executing on the host
runc_pid = None
while runc_pid is None:
for pid in os.listdir("/proc"):
try:
exe_link = os.readlink(f"/proc/{pid}/exe")
if "runc" in exe_link:
runc_pid = pid
break
except (PermissionError, FileNotFoundError):
continue
# Step 3: Open /proc/<runc_pid>/exe for reading to get a file descriptor
# pointing at the host runc binary. This is allowed while runc is
# executing — the binary is open for execution, not locked against reads.
fd = open(f"/proc/{runc_pid}/exe", "r")
# Step 4: Wait for the execve to complete. After execve, the write lock
# on the executing binary is released. The file descriptor remains valid
# and still points to the runc file on the host filesystem.
# Poll /proc/<runc_pid>/exe until the O_RDONLY open succeeds for writing
# (using a /proc/self/fd/<n> path trick to reopen with O_RDWR)
runc_on_host = f"/proc/self/fd/{fd.fileno()}"
while True:
try:
write_fd = open(runc_on_host, "wb")
break
except PermissionError:
time.sleep(0.0001)
# Step 5: Overwrite the runc binary with our payload
# The next invocation of runc (next kubectl exec, next container start)
# executes our payload as root on the host
write_fd.write(malicious_elf_binary)
write_fd.close()
The real exploit is more intricate — it involves timing races, the use of /proc/self/fd/<n> as an O_PATH trick to convert a read-only fd to a writable one after the kernel releases the execute lock, and handling the O_PATH→O_RDWR reopening. The Proof of Concept published by Aleksa Saric (Red Hat) shortly after the disclosure demonstrated all of this working against the actual runc 1.0-rc5 binary.
Why namespace isolation does not prevent this
The container provides mount namespace isolation, PID namespace isolation, and (if configured) user namespace isolation. None of these defences prevent this attack as implemented:
Mount namespace isolation means the container cannot directly open("/usr/bin/runc") on the host filesystem. The container’s view of the filesystem is its own. But /proc is special: it is mounted per-PID-namespace, not per-mount-namespace. Processes in the container’s PID namespace can see /proc/<pid> entries for any process that shares the PID namespace — including runc during the brief exec window.
PID namespace isolation normally hides host processes from container processes. However, the runc exec flow places the runc child process inside the container’s PID namespace (this is how it joins the container’s namespaces). During that window, the runc process is visible to the container’s /proc, and its /proc/<pid>/exe symlink resolves to the host binary.
User namespace isolation (when not configured with userns-remap) maps UID 0 inside the container to UID 0 on the host. The process inside the container has the same effective UID as root on the host for purposes of file descriptor operations that traverse into the host namespace. This is what makes the write possible after the execute lock releases.
Scope of affected systems
- Docker Engine < 18.09.2 (all versions that shipped runc < 1.0-rc6)
- containerd < 1.2.4 (shim versions using affected runc)
- CRI-O < 1.11.8
- OpenShift 3.x with affected runc versions
- All Kubernetes versions running on any of the above runtimes
- LXC < 3.1.0 (separate but related fix path)
The managed Kubernetes offerings (EKS, GKE, AKS) pushed patched node AMIs and base images within days. However, clusters not enrolled in automatic node pool updates ran vulnerable runc for weeks or months after the disclosure, because the runc version is baked into the node image and updating it requires cycling nodes — not a step operators take without coordination.
Threat Model
The attack surface is any code path that calls runc exec in the context of a container the attacker controls or has compromised.
Vector 1: kubectl exec by a legitimate user into their own pod. A developer with kubectl exec permission to pods in their namespace uses this for debugging. That is enough. The attacker does not need to compromise any other principal. They exec into their own pod, run the exploit, and gain root on the node running the pod.
Vector 2: Malicious container image that auto-triggers on container start. The exploit payload is placed in the container entrypoint or CMD. Any operation that causes runc to exec into the container — container startup itself, any subsequent docker exec or kubectl exec — triggers the exploit. A compromised container image in a public registry can weaponise this: the image runs its declared workload normally but fires the exploit during startup, transparently to the operator.
Vector 3: Compromised adjacent container on the same node. If an attacker compromises any container on a node through an application vulnerability (RCE in a web service, deserialization flaw, etc.), they can run the runc exploit from within that container and escape to the node.
Once the attacker overwrites the runc binary on the host:
- The backdoor is persistent. The replacement binary runs as root on the host. It survives container restarts. The replacement is itself the runc binary — it can behave normally (chain-exec the real runc) while also executing the payload, making detection harder.
- Node-level access enables reading all pod secrets. The node runs all pods in its kubelet’s assigned workload. Every container’s environment variables are readable via
/proc/<pid>/environfrom host root. Environment variables are a primary secret injection mechanism in Kubernetes. All of them are now readable. - Kubelet credentials are on the node filesystem. The kubelet’s TLS client certificate, used to authenticate to the Kubernetes API server, is stored at a well-known path (typically
/var/lib/kubelet/pki/kubelet-client-current.pem). With host root, the attacker reads these credentials and can impersonate the node to the API server. - Cloud instance metadata service (IMDS) access. On cloud-hosted nodes, the IMDS is reachable from the node. The node’s instance role credentials (AWS IAM role, GCP service account token, Azure managed identity token) are accessible at
http://169.254.169.254/latest/meta-data/iam/security-credentials/(AWS) or equivalent. These credentials are typically scoped for node operations — ECR pull access, S3 access for log shipping, EKS node bootstrap — but may include broader permissions depending on the IAM role configuration. - Lateral movement to other nodes. Kubelet credentials have
system:node:<nodename>group membership in Kubernetes RBAC. Node-to-node lateral movement via credential reuse is limited by RBAC, but stolen bootstrap tokens or shared node credentials in misconfigured clusters enable broader movement. - Secrets from the API server. If the kubelet credential has
getonsecretsin any namespace (a misconfiguration, but common in older clusters), all secrets are accessible directly.
The blast radius from a single kubectl exec into an attacker-controlled pod is full node compromise, with realistic paths to cluster-wide secret exposure and cloud account credential theft.
Hardening Configuration
1. Patch Verification
Verify the runc version on every node in the cluster. Any version below 1.0-rc6 is vulnerable.
# On the node directly (via SSH or privileged pod):
runc --version
# Output on patched version:
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.0.2-dev
# go: go1.21.9
# libseccomp: 2.5.4
# Check containerd version (containerd ships its own runc binary):
containerd --version
# containerd github.com/containerd/containerd 1.7.18 ...
# On a node, find the runc binary actually used by containerd:
containerd config dump | grep -i runc
# Look for: runtime_path = "/usr/bin/runc" or similar
# Verify that binary's version:
/usr/bin/runc --version
For Kubernetes clusters where direct node SSH is restricted, run a privileged pod to check:
kubectl run runc-check \
--image=ubuntu:22.04 \
--restart=Never \
--overrides='{"spec":{"hostPID":true,"containers":[{"name":"check","image":"ubuntu:22.04","command":["nsenter","--target","1","--mount","--","runc","--version"],"securityContext":{"privileged":true}}]}}' \
-- runc --version
For bulk node verification across a large cluster, a DaemonSet that reports runc versions as pod annotations provides continuous visibility:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: runc-version-reporter
namespace: kube-system
spec:
selector:
matchLabels:
app: runc-version-reporter
template:
metadata:
labels:
app: runc-version-reporter
spec:
hostPID: true
tolerations:
- operator: Exists
containers:
- name: reporter
image: ubuntu:22.04
command:
- nsenter
- --target
- "1"
- --mount
- --
- runc
- --version
securityContext:
privileged: true
resources:
requests:
cpu: 10m
memory: 16Mi
2. RBAC: Restrict kubectl exec
The pods/exec and pods/attach sub-resources in the Kubernetes API are the primary vectors for CVE-2019-5736. Restrict access to these sub-resources tightly. Most application workloads do not require users to exec into pods.
# ClusterRole that explicitly prohibits exec and attach access.
# Apply this as a deny-oriented role and ensure it is not overridden
# by broader ClusterRoleBindings granting wildcard resource access.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: no-exec-attach
rules: []
# No rules = no permissions granted.
# This is used as a reference; the actual control is removing pods/exec
# from any developer-facing ClusterRole.
---
# Developer ClusterRole that explicitly omits pods/exec and pods/attach.
# Grants typical read and log access without exec capability.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: developer-readonly
rules:
- apiGroups: [""]
resources:
- pods
- pods/log
- pods/status
- services
- endpoints
- configmaps
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources:
- deployments
- replicasets
- statefulsets
- daemonsets
verbs: ["get", "list", "watch"]
# pods/exec and pods/attach are intentionally absent
Audit all existing ClusterRoles and Roles for exec permissions:
kubectl get clusterroles -o json | \
jq -r '.items[] | select(.rules[]? | .resources[]? == "pods/exec") | .metadata.name'
kubectl get roles --all-namespaces -o json | \
jq -r '.items[] | select(.rules[]? | .resources[]? == "pods/exec") | "\(.metadata.namespace)/\(.metadata.name)"'
Configure the API server audit policy to log all exec and attach requests. These are high-value events — every kubectl exec should be in your audit log.
# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all exec and attach at RequestResponse level — capture the full request
- level: RequestResponse
resources:
- group: ""
resources:
- pods/exec
- pods/attach
- pods/portforward
# Log all authentication events
- level: Metadata
omitStages:
- RequestReceived
resources:
- group: ""
resources:
- secrets
- serviceaccounts/token
# Default: metadata only for everything else
- level: Metadata
omitStages:
- RequestReceived
Apply the audit policy to the API server with:
# kube-apiserver flags:
--audit-log-path=/var/log/kubernetes/audit/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10
--audit-log-maxsize=100
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
3. Kyverno Policy Blocking exec to Production Namespaces
Restricting exec via RBAC prevents legitimate users from calling it. A Kyverno policy provides a second layer: it can block exec calls that pass RBAC checks, for example because an overly broad role was granted, or to enforce namespace-level restrictions that RBAC cannot express cleanly.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-exec-into-pods
annotations:
policies.kyverno.io/title: Disallow exec into Pods
policies.kyverno.io/category: Security
policies.kyverno.io/severity: high
policies.kyverno.io/description: >-
Prohibits kubectl exec and kubectl attach to pods in namespaces
labelled environment=production. Break-glass access requires
removing the label and re-applying after the maintenance window.
spec:
validationFailureAction: Enforce
background: false
rules:
- name: deny-exec-production
match:
any:
- resources:
kinds:
- PodExecOptions
subjects:
- kind: Group
name: "system:authenticated"
preconditions:
all:
- key: "{{ request.namespace }}"
operator: AnyIn
value: "{{ request.object.metadata.namespace }}"
context:
- name: namespaceLabels
apiCall:
urlPath: "/api/v1/namespaces/{{ request.namespace }}"
jmesPath: "metadata.labels"
validate:
message: >-
kubectl exec is not permitted in production namespaces.
Use the break-glass procedure documented in the runbook.
deny:
conditions:
all:
- key: "{{ namespaceLabels.environment || '' }}"
operator: Equals
value: "production"
Label your production namespaces:
kubectl label namespace production environment=production
kubectl label namespace payments environment=production
kubectl label namespace data-pipeline environment=production
4. User Namespace Remapping
User namespace remapping (userns-remap) maps UID 0 inside all containers to an unprivileged UID on the host. This does not prevent the attack from being executed, but it limits what an attacker who does overwrite runc can do: the overwritten binary runs as the remapped UID on the host rather than as root.
Configure containerd for user namespace remapping:
# /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
For Kubernetes 1.25+ with native user namespace support in pods, enable it per-pod with the hostUsers: false field:
apiVersion: v1
kind: Pod
metadata:
name: userns-isolated
namespace: production
spec:
hostUsers: false # Maps UID 0 in container to an unprivileged UID on host
containers:
- name: app
image: myapp:1.0.0
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
With hostUsers: false, the container’s UID 0 maps to a high UID (e.g., 65536) on the host. If an attacker exploits CVE-2019-5736 with this configuration and overwrites runc, the overwritten binary executes as UID 65536 on the host — unprivileged, unable to read other users’ files, unable to access kubelet credentials, unable to access the IMDS without additional privilege escalation.
This is not a complete mitigation — it reduces the blast radius, not the attack surface. A chained privilege escalation using a separate kernel vulnerability can still recover root. But it breaks the clean single-step path from container escape to root on the host.
5. Kata Containers as Complete Mitigation
Kata Containers eliminates the runc attack surface entirely by running each pod inside a lightweight VM using QEMU or Firecracker. The container process never executes inside the host kernel’s namespace. There is no shared host runc binary to overwrite, because the hypervisor boundary prevents /proc/self/exe from resolving to anything on the host filesystem.
Create a RuntimeClass for Kata:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-containers
handler: kata
overhead:
podFixed:
memory: "160Mi"
cpu: "250m"
scheduling:
nodeSelector:
kata-containers: "true"
Label the nodes that have Kata installed:
kubectl label node <node-name> kata-containers=true
Apply the RuntimeClass to pods requiring strong isolation:
apiVersion: v1
kind: Pod
metadata:
name: isolated-workload
namespace: production
spec:
runtimeClassName: kata-containers
containers:
- name: app
image: myapp:1.0.0
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"
Use a Kyverno policy to enforce Kata for specific workloads or namespaces:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-kata-runtime-production
spec:
validationFailureAction: Enforce
rules:
- name: require-kata-runtimeclass
match:
any:
- resources:
kinds: [Pod]
namespaces: [production, payments]
validate:
message: "Pods in production namespaces must use the kata-containers RuntimeClass."
pattern:
spec:
runtimeClassName: kata-containers
With Kata, CVE-2019-5736 is not exploitable: the runc process visible inside the VM is the VM’s runc, isolated from the host by the hypervisor. Overwriting it affects only the VM’s own runtime, which is destroyed when the pod terminates. The host runc binary is never accessible from inside the container.
6. Falco Detection
Falco provides runtime detection of the exploit’s characteristic behaviours: opening a file descriptor to a binary in /proc/<pid>/exe from inside a container, and writing to the runc binary path.
# /etc/falco/rules.d/cve-2019-5736.yaml
# Detect when a process inside a container opens /proc/<pid>/exe
# for another process — specifically looking for runc binary access
- rule: Container Opens Proc Exe FD
desc: >
A container process opened /proc/<pid>/exe, which can indicate
an attempt to get a file descriptor to a host binary via the
CVE-2019-5736 attack pattern.
condition: >
container
and open_read
and fd.name glob "/proc/*/exe"
and not proc.name in (ps, top, htop, lsof)
output: >
Container opened /proc/*/exe file descriptor
(user=%user.name user_uid=%user.uid container=%container.name
image=%container.image.repository:%container.image.tag
proc=%proc.name pid=%proc.pid fd=%fd.name
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: WARNING
tags: [container, cve-2019-5736, proc-escape]
# Detect writes to the runc binary path from any process
- rule: Runc Binary Written
desc: >
The runc binary was written. This can indicate exploitation of
CVE-2019-5736 or tampering with the container runtime. Any
legitimate runc update occurs via the package manager, not direct
file writes.
condition: >
open_write
and (fd.name = "/usr/bin/runc"
or fd.name = "/usr/sbin/runc"
or fd.name glob "*/bin/runc")
and not proc.name in (dpkg, rpm, apt, yum, dnf, zypper, cp, install)
output: >
runc binary written
(user=%user.name user_uid=%user.uid proc=%proc.name pid=%proc.pid
container=%container.name image=%container.image.repository:%container.image.tag
fd=%fd.name namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: CRITICAL
tags: [host, runtime-integrity, cve-2019-5736]
# Detect when a process attempts to reopen /proc/self/fd/<n> for write
# after initially opening it read-only — the specific reopen trick used
# in the CVE-2019-5736 exploit
- rule: Container Reopens Proc FD for Write
desc: >
A container process opened /proc/self/fd/<n> for write. This is
the file descriptor reopen pattern used in the CVE-2019-5736 exploit
to convert a read-only fd on the host runc binary to a writable one.
condition: >
container
and open_write
and fd.name glob "/proc/self/fd/*"
output: >
Container process reopened proc fd for write
(user=%user.name user_uid=%user.uid container=%container.name
image=%container.image.repository:%container.image.tag
proc=%proc.name pid=%proc.pid fd=%fd.name
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: CRITICAL
tags: [container, cve-2019-5736, proc-escape]
Apply this rule file and reload Falco:
# Copy rules to the Falco rules directory
cp cve-2019-5736.yaml /etc/falco/rules.d/
# Reload Falco rules without restart (Falco 0.32+)
kill -1 $(pidof falco)
# Or via the Falco gRPC API:
falcoctl driver config --update
Expected Behaviour
On patched runc (1.0-rc6+): The fix in runc 1.0-rc6 added a write lock on the runc binary before entering the container’s PID namespace. If an attacker attempts the exploit, the open() call for writing to /proc/<runc-pid>/exe fails with ETXTBSY (text file busy) — the kernel refuses to open for writing a binary that is currently executing. The attack loop cannot proceed.
# Attempting to write to a running binary on a patched system:
# $ runc --version
# runc version 1.1.12 ...
# The open-for-write attempt fails:
# open /proc/self/fd/5: text file busy
Audit log output for exec attempts: With the audit policy in place, every kubectl exec produces a RequestResponse level audit event:
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "a3f1e2c0-8d4b-4e5a-b6c7-d8e9f0a1b2c3",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/production/pods/myapp-6d8f9b-x4v2k/exec",
"verb": "create",
"user": {
"username": "alice@example.com",
"groups": ["system:authenticated"]
},
"sourceIPs": ["10.0.1.45"],
"objectRef": {
"resource": "pods",
"namespace": "production",
"name": "myapp-6d8f9b-x4v2k",
"subresource": "exec"
},
"responseStatus": {"code": 101},
"requestReceivedTimestamp": "2026-05-08T14:23:11.000Z"
}
Falco alert output when the exploit pattern is detected:
14:23:14.887265432: CRITICAL runc binary written
user=root user_uid=0 proc=sh pid=8842
container=compromised-app
image=attacker/malicious-image:latest
fd=/usr/bin/runc
namespace=production
pod=compromised-app-7d9c4b-r3k8p
14:23:11.234567890: WARNING Container opened /proc/*/exe file descriptor
user=root user_uid=0 container=compromised-app
image=attacker/malicious-image:latest
proc=sh pid=8840 fd=/proc/8838/exe
namespace=production pod=compromised-app-7d9c4b-r3k8p
These two alerts firing in sequence — /proc/*/exe open followed by a write to the runc binary path — is a near-certain indicator of CVE-2019-5736 exploitation. The CRITICAL priority alert on the runc write should page on-call immediately. At that point, the node must be treated as fully compromised: isolate it from the cluster (cordon and drain to a quarantine pool), preserve disk forensics, and rotate all secrets that could have been exposed from pod environment variables on that node.
Trade-offs
Kata Containers is the only complete mitigation that eliminates the attack surface rather than reducing blast radius. The trade-offs are real:
- Pod startup time increases by approximately 100-200ms on QEMU-based Kata (Firecracker is faster, ~50ms overhead). This is imperceptible for long-running services but noticeable for short-lived batch jobs and FaaS-style workloads.
- Nested virtualisation must be available on the node. On bare-metal this is straightforward. On cloud VMs, nested virtualisation support varies by instance type. On AWS, bare-metal instances (
.metal) or instances with nested virtualisation enabled (most current-generation*.xlargeand larger) are required. On GCP, most n1/n2 instances support it. On Azure, Dv3/Ev3 and later support nested virtualisation. - Some kernel features behave differently inside a Kata VM. eBPF programs with certain map types may fail. Performance-sensitive syscalls (io_uring in particular) have overhead. Storage drivers that rely on host kernel features (some CSI drivers, FUSE-based volumes) may not work without specific Kata configuration.
- The hypervisor becomes the new trust boundary. Hypervisor CVEs (QEMU has a long history of them) become the relevant attack surface instead of runc CVEs. This is generally a better position — hypervisor escapes are harder than container escapes — but it is not zero-risk.
User namespace remapping reduces blast radius rather than preventing the escape:
- Workloads that legitimately require UID 0 on the host (NFS client mounts, some storage drivers that use host device paths, some monitoring agents that bind to host ports below 1024) break with userns-remap enabled. The failure mode is silent in some cases: the workload starts but NFS mounts fail at runtime with permission errors.
- Kubernetes user namespace support (
hostUsers: false) is GA as of 1.30, but some CNI plugins do not handle remapped UIDs correctly for network policy enforcement. Test thoroughly before rolling out cluster-wide. - User namespace remapping adds complexity to
securityContextreasoning: UID 1000 in a remapped container might map to UID 66536 on the host. Audit tooling that correlates host UIDs to container processes needs to account for this offset.
Blocking kubectl exec via RBAC and admission policy has the most operational friction:
- Developers rely on
kubectl execfor debugging. Removing access creates pressure to find workarounds (deploying debug containers with exec enabled, using port-forward with interactive shells, requesting temporary elevated permissions). Without a defined break-glass procedure, the restriction becomes a tax on incident response. - A documented break-glass procedure should: require an approval step (PagerDuty-style incident creation, Slack approval from a second engineer), grant time-limited exec access via a temporary RoleBinding with a TTL enforced by a controller, and log all exec activity during the window to a separate audit trail.
- For production debugging without exec, Kubernetes ephemeral containers (
kubectl debug) combined with a shared process namespace allow attaching a debug container to a running pod. This has the same runc exec path and the same CVE exposure, but scoped access controls can restrict it to a smaller set of users than general exec.
Failure Modes
Assuming Pod Security Standards prevent this. A pod running as non-root with readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, and all capabilities dropped is still fully exploitable via CVE-2019-5736. The attack works by opening a file descriptor to the host runc binary through /proc — it does not require any capability in the container, does not write to the container’s filesystem, and does not use privilege escalation within the container. Pod Security Standards restrict what the container process can do within its own context, not what it can do through /proc into the host during the runc exec window. Do not assume that a hardened pod security context is a container escape mitigation.
Not tracking runc versions across nodes. Managed Kubernetes offerings (EKS, GKE, AKS) update the container runtime as part of node pool updates, but not all clusters are configured to auto-update node pools, and some use custom AMIs or VM images with a manually maintained runc version. In large clusters, individual node groups may be on different runtime versions depending on when they were last recycled. A single vulnerable node in a cluster of 500 is still a viable attack target. Implement the DaemonSet version reporter described above, export its output to a monitoring system, and alert when any node reports a runc version below the required minimum. Version compliance is a host-level metric, not something the Kubernetes control plane surfaces by default.
Missing exec audit logging after the fact. If the API server audit log does not capture pods/exec events at RequestResponse level, there is no post-incident record of which user exec’d into which pod at what time. A Metadata level audit entry tells you an exec happened but not the command that was run. For CVE-2019-5736 investigation, you need the full request: which container, which command, which user, from which IP. Without this, incident response reduces to “a node was compromised, we don’t know how.” Audit log retention of at least 90 days is the minimum for meaningful post-incident analysis.
Treating the problem as fixed after the CVE patch. CVE-2019-5736 was the first publicly disclosed runc escape, but the underlying design tension — runc executing within the container’s PID namespace with access to host-backed /proc entries — is an architectural property of the OCI runtime model, not a one-time code defect. CVE-2021-30465 (runc mount destination race condition), CVE-2022-29162 (runc process.cwd and leaked fds), and CVE-2023-27561 (re-introduction of the CVE-2019-5736 pattern) all followed from the same substrate. The mitigations that matter long-term are architectural: user namespace isolation to limit what a container escape achieves, and strong isolation runtimes (Kata, gVisor) to eliminate the shared-kernel attack surface entirely.