Container Runtime Security: gVisor, Kata Containers, and crun Beyond runc
The Problem
Every container running under the default runc runtime shares the host kernel. The namespace and cgroup layer hides resources from the container process, but the process communicates with the kernel directly via syscalls. Seccomp BPF filters reduce the attack surface by blocking syscalls the container does not need — a typical Docker default profile blocks around 44 of the ~350+ available syscalls. That still leaves more than 300 syscalls reachable, each of which is a potential exploitation path.
This matters concretely:
CVE-2022-0847(Dirty Pipe) was exploited through thesplicesyscall, which is in the Docker default seccomp allowlist.CVE-2019-5736(runc overwrite) exploited the/proc/self/exepath — no syscall filter blocked it because the attacker operated within the process execution model.CVE-2024-21626(runc working directory) exploited a file descriptor leak during container startup, before any seccomp filter took effect.
The root cause is architectural: when the container process and the host share a kernel, a logic flaw in the kernel becomes a container escape. Syscall filtering does not prevent exploitation of the syscalls that remain in the allowlist. It is a reduction of attack surface, not elimination of the attack vector.
For most workloads — internal services, well-understood applications, trusted code — runc with a hardened seccomp profile, AppArmor/SELinux, and a non-root user is the right trade-off. Overhead is minimal and the security posture is acceptable.
For high-risk workloads — untrusted code execution, multi-tenant platforms, AI inference with user-supplied inputs, CI job runners, ingress processing of arbitrary network traffic — the shared-kernel model is not acceptable and an alternative runtime is warranted.
This article covers three alternatives and how to choose between them.
runc’s Security Model: What It Gives You and Where It Stops
runc is the OCI reference runtime. When containerd or Docker creates a container, it calls runc to fork the container process into a set of Linux namespaces (pid, net, mnt, uts, ipc, user) and place it under a cgroup. The container process then runs with:
- Capabilities: a reduced capability set (typically the Docker default drops
NET_ADMIN,SYS_PTRACE,SYS_MODULE, and about 25 others) - Seccomp BPF: a syscall allowlist enforced by the kernel
- AppArmor or SELinux: mandatory access control on file and network operations
The enforcement boundary is the kernel. Every syscall from the container process goes to the same kernel that serves the host. When you apply a seccomp filter, the kernel evaluates the BPF program on each syscall before dispatching it. If the filter allows the syscall, it runs at full privilege in the kernel. A kernel vulnerability in that handler is exploitable by the container process.
# Inspect the default Docker seccomp profile allowlist
docker run --rm -it ubuntu:24.04 bash -c "cat /proc/$$/status | grep Seccomp"
# Seccomp: 2 (2 = filter mode active)
# Count blocked syscalls in Docker's default profile
curl -s https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json \
| jq '[.syscalls[] | select(.action == "SCMP_ACT_ALLOW")] | length'
# ~340 allowed — the filter blocks ~44 syscalls
# View kernel attack surface from within a container
strace -c -f sleep 60 &
# Every unique syscall here is a potential kernel exploitation path
The practical implication: runc is excellent when you trust the code running inside the container. For untrusted code, you need a different isolation primitive.
gVisor (runsc): A User-Space Kernel
gVisor intercepts syscalls from the container process before they reach the host kernel. The Sentry — gVisor’s core component — is a user-space process that implements a substantial subset of the Linux kernel API. Container processes make syscalls; those syscalls are caught by gVisor and either handled entirely in user space or forwarded to the host kernel through a minimal, audited interface.
Architecture: Sentry and Gofer
The Sentry handles most Linux syscalls (network, process management, signals, futexes) in Go code running as an unprivileged user-space process. File system operations go through the Gofer, a separate process that mediates all host filesystem access over a 9P protocol connection. The container process never touches the host filesystem directly.
Container Process
|
| syscall
v
[ Sentry - user-space Linux kernel implementation ]
| |
| host FS ops | select kernel syscalls (~100 total surface)
v v
[ Gofer ] [ Host Kernel ]
|
| 9P
v
[ Host Filesystem ]
The Sentry exposes roughly 240 syscalls to container processes. Of those, about 100 result in a call to the host kernel; the rest are handled entirely within the Sentry. The host kernel attack surface from a gVisor container is a small, audited interface — not the full syscall table.
Platform: KVM vs ptrace
gVisor supports two execution platforms:
ptrace platform: Uses ptrace to intercept syscalls. Available on any Linux host without hardware virtualisation. Carries significant performance overhead — each syscall from the container involves a ptrace stop/resume cycle. Suitable for development environments or workloads where syscall frequency is low.
KVM platform: Runs the Sentry inside a VM using KVM hardware virtualisation. Much lower overhead than ptrace because syscall interception is handled by the hypervisor in hardware (VMEXIT). Requires KVM access on the host — works on bare metal and most cloud VM types (GCP n2, AWS metal instances, Azure DCsv3). This is the production deployment mode.
# Verify KVM availability for gVisor
ls -la /dev/kvm
# Must exist and be accessible to the containerd process
# Check CPU virtualisation extensions
grep -E 'vmx|svm' /proc/cpuinfo | head -1
# vmx = Intel VT-x, svm = AMD-V
Configuring gVisor with containerd
Install the runsc binary and configure containerd to use it as a runtime handler:
# Install runsc (gVisor runtime binary)
RUNSC_VERSION="20240930.0"
curl -fsSL "https://storage.googleapis.com/gvisor/releases/release/${RUNSC_VERSION}/x86_64/runsc" \
-o /usr/local/bin/runsc
curl -fsSL "https://storage.googleapis.com/gvisor/releases/release/${RUNSC_VERSION}/x86_64/runsc.sha512" \
| sha512sum --check
chmod 755 /usr/local/bin/runsc
# Install the containerd shim
curl -fsSL "https://storage.googleapis.com/gvisor/releases/release/${RUNSC_VERSION}/x86_64/containerd-shim-runsc-v1" \
-o /usr/local/bin/containerd-shim-runsc-v1
chmod 755 /usr/local/bin/containerd-shim-runsc-v1
Configure containerd to register the gVisor runtime handler:
# /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
# Use KVM platform for production
ConfigPath = "/etc/containerd/runsc.toml"
# /etc/containerd/runsc.toml
[runsc_config]
platform = "kvm"
network = "sandbox"
debug-log = "/var/log/runsc/%ID%/"
systemctl restart containerd
# Test: run a container with gVisor
docker run --runtime=runsc --rm ubuntu:24.04 uname -r
# Returns gVisor's kernel version string, not the host kernel version
# e.g.: 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2010 (gVisor fake version)
Performance Characteristics
gVisor adds latency on syscall-heavy workloads. Benchmarks from the gVisor team and independent sources show:
| Workload type | runc overhead | gVisor (KVM) overhead | Notes |
|---|---|---|---|
| CPU-bound (no I/O) | baseline | +1-3% | Sentry handles few host syscalls |
| Network throughput (TCP bulk) | baseline | +10-20% | Network stack in user space |
| Syscall-heavy (forks, stats) | baseline | +2-5x | Each syscall has Sentry cost |
| File I/O (random small reads) | baseline | +2-10x | Gofer adds round-trip overhead |
For AI inference, REST API servers, or batch processing workloads, the overhead is acceptable. For database engines with high-frequency small I/O or applications that make thousands of syscalls per second, benchmark first.
Kata Containers: VM-Isolated Containers
Kata Containers takes a different approach: instead of intercepting syscalls in user space, Kata wraps each container (or pod) in a lightweight virtual machine. The container process runs inside a VM with its own kernel. A kernel exploit inside the container only compromises the guest kernel — not the host.
Architecture
Pod / Container Group
|
[ kata-agent ] (runs inside VM)
|
[ Guest Kernel ] (lightweight, hardened)
|
[ Hypervisor: QEMU / Cloud Hypervisor / Firecracker ]
|
[ Host Kernel ]
The kata-agent runs as PID 1 inside the guest VM, receives instructions from the Kata Containers runtime shim on the host over a vsock channel, and creates the container processes inside the VM using runc or crun. The container appears normal from the application’s perspective — its filesystem is mounted via virtio-fs or device pass-through, its network is presented via a veth pair bridged through virtio-net.
Hypervisor Backends
Kata supports three hypervisor backends, each with different trade-offs:
QEMU (default): Full-featured, best compatibility, highest overhead. Boot time ~300-500ms, memory footprint ~200MB per pod overhead. Suitable when compatibility and feature completeness are the priority.
Cloud Hypervisor (ch): Purpose-built for cloud workloads. Written in Rust. Faster boot (~150ms), lower overhead (~130MB). Good balance of security and performance for production deployments.
Firecracker: AWS’s VMM, designed for serverless workloads. Fastest boot (<125ms in ideal conditions), smallest memory footprint (~50MB overhead), but limited device support (no PCI, no USB). Best for function-as-a-service or CI environments with many short-lived containers.
Configuring Kata with containerd
# Install Kata Containers from official release
KATA_VERSION="3.6.0"
curl -fsSL "https://github.com/kata-containers/kata-containers/releases/download/${KATA_VERSION}/kata-static-${KATA_VERSION}-amd64.tar.xz" \
-o kata-static.tar.xz
tar -xf kata-static.tar.xz -C /opt
# Binaries land in /opt/kata/bin/
# Add kata symlinks to PATH
ln -sf /opt/kata/bin/containerd-shim-kata-v2 /usr/local/bin/containerd-shim-kata-v2
ln -sf /opt/kata/bin/kata-runtime /usr/local/bin/kata-runtime
# Verify hardware virtualisation
kata-runtime check
# Must report: System is capable of running Kata Containers
# /etc/containerd/config.toml — add Kata runtime handlers
version = 2
# Kata with QEMU
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-qemu]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-qemu.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
# Kata with Cloud Hypervisor (preferred for production)
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-clh]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-clh.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-clh.toml"
# Kata with Firecracker
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-fc]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-fc.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-fc.toml"
# /opt/kata/share/defaults/kata-containers/configuration-clh.toml (key excerpt)
[hypervisor.clh]
path = "/opt/kata/bin/cloud-hypervisor"
kernel = "/opt/kata/share/kata-containers/vmlinux-confidential.container"
image = "/opt/kata/share/kata-containers/kata-containers.img"
# Disable unnecessary devices
disable_vhost_net = false
virtio_fs_daemon = "/opt/kata/libexec/virtiofsd"
# Memory for the VM (add to your container memory request)
default_memory = 2048
# Enable confidential computing if hardware supports it
confidential_guest = false
systemctl restart containerd
# Test Kata isolation — guest kernel differs from host
docker run --runtime=kata-clh --rm ubuntu:24.04 uname -r
# Returns the Kata guest kernel version: e.g., 6.1.62-container
# Different from host kernel
crun: A Leaner OCI Runtime
crun is a C implementation of the OCI container runtime spec, developed by Red Hat as an alternative to runc (written in Go). It is not a sandbox or a VM-based runtime — its security posture is comparable to runc. The value proposition is different: lower overhead, smaller codebase, and native cgroup v2 support.
Why crun Matters for Security
A smaller codebase means a smaller attack surface in the runtime itself. runc is approximately 70,000 lines of Go across its dependencies; crun’s core is approximately 7,000 lines of C. Fewer lines mean fewer places for bugs. The runtime executes as a privileged process during container setup, so vulnerabilities in the runtime binary itself (like CVE-2019-5736) are serious — crun’s reduced size is a genuine security benefit.
crun added cgroup v2 support before runc and handles the unified hierarchy more cleanly. On systems using cgroup v2 exclusively (Ubuntu 22.04+, RHEL 9, Fedora 31+), crun is the better choice at the runtime level.
# Install crun
apt-get install -y crun # Ubuntu 22.04+
dnf install -y crun # RHEL 9 / Fedora
# Or build from source for the latest version
git clone https://github.com/containers/crun
cd crun && ./autogen.sh && ./configure && make -j$(nproc)
cp crun /usr/local/bin/crun
# Verify crun capabilities
crun --version
# crun version 1.15
# commit: ...
# spec: 1.0.0
# +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
# /etc/containerd/config.toml — replace runc with crun
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/local/bin/crun"
# crun is a drop-in replacement; all runc options apply
SystemdCgroup = true
Podman uses crun as its default runtime on RHEL/Fedora. The switch from runc to crun is transparent to the container workload.
Kubernetes RuntimeClass: Mixing Runtimes in a Cluster
Kubernetes RuntimeClass lets you assign different OCI runtimes to different pods. Sensitive pods get gVisor or Kata; trusted workloads stay on runc or crun. The selection is made at the pod level in the pod spec.
RuntimeClass Configuration
First, ensure the node’s containerd config registers all handlers (as shown above). Then create RuntimeClass objects in Kubernetes:
# runtimeclasses.yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
scheduling:
nodeSelector:
runtime.kubernetes.io/gvisor: "true"
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-clh
handler: kata-clh
scheduling:
nodeSelector:
runtime.kubernetes.io/kata: "true"
tolerations:
- key: kata
operator: Exists
effect: NoSchedule
overhead:
podFixed:
memory: "130Mi" # Kata VM overhead — factored into scheduling
cpu: "250m"
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: crun
handler: crun
kubectl apply -f runtimeclasses.yaml
# Label nodes that have gVisor installed
kubectl label node worker-01 runtime.kubernetes.io/gvisor=true
# Label nodes that have Kata + KVM
kubectl label node worker-02 runtime.kubernetes.io/kata=true
Selecting a RuntimeClass in a Pod
# untrusted-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: user-code-runner
namespace: tenant-sandbox
spec:
runtimeClassName: gvisor # All containers in this pod use gVisor
containers:
- name: runner
image: python:3.12-slim
command: ["python", "-c", "import sys; exec(sys.stdin.read())"]
resources:
limits:
memory: "512Mi"
cpu: "500m"
securityContext:
runAsNonRoot: true
runAsUser: 65534
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
# ci-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: build-job
spec:
template:
spec:
runtimeClassName: kata-clh # Build jobs get VM isolation
containers:
- name: builder
image: ubuntu:24.04
command: ["/bin/bash", "-c", "make all"]
securityContext:
runAsNonRoot: false # Build often needs root — VM provides isolation
Enforcing RuntimeClass with OPA / Kyverno
Allow-listing is not enough — you need to prevent workloads from omitting runtimeClassName in sensitive namespaces:
# kyverno-policy-require-runtime.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-sandbox-runtime
spec:
validationFailureAction: Enforce
rules:
- name: check-runtimeclass
match:
resources:
kinds: [Pod]
namespaces:
- tenant-*
- ci-*
validate:
message: "Pods in tenant and CI namespaces must specify runtimeClassName: gvisor or kata-clh"
pattern:
spec:
runtimeClassName: "gvisor | kata-clh"
Workload Assignment: Which Runtime for Which Pod
Not all workloads justify the overhead of gVisor or Kata. Use this decision matrix:
| Workload | Trust Level | Recommended Runtime | Rationale |
|---|---|---|---|
| Internal microservice | High | runc / crun | Trusted code, low risk, performance matters |
| Third-party container from external registry | Medium | gVisor | Reduced supply-chain risk |
| AI inference with user-supplied inputs | Low | gVisor | Prompt injection → code execution path |
| CI job runner (user-submitted build jobs) | Untrusted | Kata Containers | Build jobs need root; VM isolates the host |
| Ingress/proxy container | Medium-High | gVisor | Directly processes attacker-controlled network data |
| Multi-tenant function execution | Untrusted | gVisor or Kata | Each tenant’s code is untrusted |
| Database (PostgreSQL, MySQL) | High | crun | Trusted, syscall-heavy — crun gives marginal improvement |
| Kubernetes system pods (coredns, kube-proxy) | High | runc / crun | Trusted, perf-sensitive |
AI/ML Workload Consideration
LLM serving endpoints process user-supplied prompts. A successful prompt injection that leads to arbitrary code execution hits the container boundary. For public-facing inference endpoints, gVisor’s user-space kernel means that a container escape attempt via a kernel CVE fails — the guest kernel is gVisor’s Sentry, not the host kernel. Deploy inference containers with runtimeClassName: gvisor and KVM platform.
Security Testing: Isolating the Difference
The following test demonstrates the isolation boundary concretely. It uses a known container information disclosure technique — reading the host’s /proc filesystem — not an active exploit, to avoid legal issues with reproduction.
# Test 1: runc — host kernel visible from container
docker run --rm ubuntu:24.04 cat /proc/version
# Linux version 6.8.0-41-generic (buildd@...) (gcc version 13.2.0...) #41-Ubuntu...
# MATCHES host kernel exactly
# Test 2: gVisor — Sentry kernel visible, not host
docker run --runtime=runsc --rm ubuntu:24.04 cat /proc/version
# Linux version 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2010
# gVisor's fake kernel version — host kernel is not visible
# Test 3: Kata — guest kernel visible
docker run --runtime=kata-clh --rm ubuntu:24.04 cat /proc/version
# Linux version 6.1.62-container (kata@...) ...
# Guest kernel version — different from host
# Test the syscall surface
# runc: all host syscalls available
docker run --rm ubuntu:24.04 strace -c ls / 2>&1 | tail -5
# gVisor: Sentry intercepts; host syscall count is tiny
docker run --runtime=runsc --rm ubuntu:24.04 strace -c ls / 2>&1 | tail -5
# Note: strace may not work inside gVisor — it relies on ptrace which gVisor restricts
# This is itself a security feature
# Escape attempt: exploit /proc/sysrq-trigger (requires SYS_ADMIN but demonstrates boundary)
docker run --rm --privileged ubuntu:24.04 sh -c "echo h > /proc/sysrq-trigger" 2>&1
# Under runc --privileged: succeeds, affects host kernel
# Under gVisor: fails — Sentry's /proc is virtual, not host's /proc
docker run --runtime=runsc --privileged ubuntu:24.04 sh -c "echo h > /proc/sysrq-trigger" 2>&1
# write /proc/sysrq-trigger: operation not permitted (gVisor blocks this)
The --privileged flag behaves differently across runtimes. Under runc, --privileged gives the container nearly full host kernel access. Under gVisor, --privileged grants more capabilities within gVisor’s user-space kernel but cannot reach the real host kernel. This is the isolation guarantee.
2025-2026 Developments
gVisor: Improved Network Stack and Systrap Platform
The gVisor team shipped a new Systrap platform in 2024-2025 as an alternative to ptrace. Systrap uses seccomp to trap syscalls into the Sentry without requiring ptrace, reducing context-switch overhead by 30-40% on syscall-heavy workloads. On hosts without KVM, Systrap is now recommended over ptrace.
gVisor’s network stack (netstack) gained significant improvements in 2024-2025:
- RACK-TLP (Recent ACKnowledgement - Tail Loss Probe) for better TCP loss recovery
- UDP-GRO (Generic Receive Offload) reducing CPU overhead for high-throughput UDP
- IPv6 extension header support
Enable Systrap in the gVisor config:
# /etc/containerd/runsc.toml
[runsc_config]
platform = "systrap" # New: systrap instead of ptrace for non-KVM hosts
network = "sandbox"
Kata 3.x: Dragonball VMM and Confidential Containers
Kata Containers 3.x introduced Dragonball, a Rust-based VMM developed by Alibaba Cloud. It provides:
- Sub-100ms VM boot times
- Purpose-built for container workloads (no legacy device emulation)
- ~40MB memory overhead per pod (vs ~200MB for QEMU)
- Full integration with Kata’s virtio-fs for shared filesystem access
# configuration-dragonball.toml (Kata 3.x)
[hypervisor.dragonball]
path = "/opt/kata/bin/dragonball"
kernel = "/opt/kata/share/kata-containers/vmlinux.container"
default_memory = 512
enable_iothreads = true
Kata 3.x also deepened Confidential Containers support (CoCo), running the guest VM inside a hardware Trusted Execution Environment using AMD SEV-SNP or Intel TDX. This protects container memory from the hypervisor and host OS — relevant for regulated data processing where even the platform operator should not read workload memory.
# Check for SEV-SNP support (AMD)
dmesg | grep -i sev
# [ 0.000000] SEV-SNP: initialized
# Check for TDX support (Intel)
dmesg | grep -i tdx
# [ 0.000000] tdx: TDX module: attributes 0x0, vendor_id 0x8086
Choosing the Right Runtime
The decision is not binary between runc and an alternative — it is a tiered model:
-
Default runtime: crun on cgroup v2 hosts, runc elsewhere. Smallest runtime codebase, full OCI compatibility, no overhead.
-
Reduced kernel attack surface: gVisor (KVM or Systrap platform) for workloads that process untrusted input or run untrusted code but are not expected to need root access or unusual kernel features. Expected overhead 5-20% depending on syscall frequency.
-
Full VM isolation: Kata Containers for workloads that legitimately need elevated privileges (build systems, legacy applications requiring root, workloads with unknown syscall profiles), multi-tenant environments, or regulated workloads requiring TEE support. Expected overhead 130-200MB per pod + 100-300ms startup latency.
-
Layer them: RuntimeClass lets you apply different runtimes per namespace or per workload type in the same cluster. There is no requirement to choose one runtime for all workloads.
The combination of gVisor for untrusted-input processing and Kata for privileged or CI workloads, alongside runc/crun for trusted services, provides defence-in-depth at the runtime layer without requiring architectural changes to applications.