io_uring Security and Hardening: Disabling, Restricting, and Auditing a Bypass-Prone Syscall Interface
Problem
io_uring is a high-performance asynchronous I/O interface introduced in Linux 5.1. Instead of issuing one syscall per operation, a process sets up shared memory ring buffers with the kernel and submits I/O operations by writing submission queue entries (SQEs). The kernel processes them in the background and posts completions to a separate ring.
For database engines, proxies, and network dataplanes this is a meaningful throughput win. It is also one of the hardest kernel subsystems to reason about from a security perspective, for three reasons:
- io_uring bypasses seccomp for the queued operations. A seccomp filter intercepts syscalls. io_uring submits work through shared memory ring buffers; the individual operations (
IORING_OP_OPENAT,IORING_OP_READ,IORING_OP_SENDMSG, etc.) are not seccomp-filterable on most kernels. A sandbox that blocksopenatat the syscall level does not blockIORING_OP_OPENAT. - The attack surface is large and evolving. Since 2019, io_uring has accumulated over 100 CVEs, including use-after-free, type confusion, and privilege escalation bugs. Google’s kCTF rewarded multiple critical io_uring exploits in 2023–2024. ChromeOS and Android disabled the interface for unprivileged processes. Docker Desktop disables it by default. Major distros restrict it for containers.
- Workloads rarely need it. Most production services perform I/O through glibc, Go’s runtime, or managed runtimes that do not use io_uring. You pay the kernel attack surface cost even when no process benefits.
This article covers four complementary controls: disabling io_uring globally, restricting it per-container via seccomp, constraining which opcodes a process can submit, and auditing io_uring usage across a fleet.
Target systems: Linux kernel 5.15+ (the io_uring_disabled sysctl arrived in 6.6, with backports in RHEL 9.4 and Ubuntu 24.04). Kubernetes 1.28+ for the container-level controls.
Threat Model
- Adversary: Attacker with code execution inside an unprivileged process — compromised application container, malicious userspace process on a shared host, or code executing under a seccomp sandbox (browser renderer, document converter, CI runner).
- Access level: Unprivileged user namespace or container, typically with a seccomp filter and minimal Linux capabilities.
- Objective: Reach vulnerable kernel code paths to achieve privilege escalation, read kernel memory (heap leak, KASLR bypass), or escape the sandbox.
- Blast radius: A successful io_uring kernel exploit grants root on the host. On a Kubernetes node this means access to every pod’s secrets, the kubelet credentials, and in cloud environments the instance metadata credentials. The standard kernel-exploit blast radius applies: one compromised pod becomes one compromised node.
- What this does not defend against: root-equivalent processes that can legitimately use io_uring (database engines running as their service user with required capabilities). The controls here target unprivileged workloads that do not need the interface.
Configuration
Option 1: Disable io_uring Globally
The cleanest control. Refuse any attempt to call io_uring_setup at the kernel level. Since 6.6 (backported to RHEL 9.4 and Ubuntu 24.04), a dedicated sysctl exists:
# /etc/sysctl.d/60-io-uring.conf
# Disable io_uring for all processes.
# 0 = allowed (default)
# 1 = disabled for unprivileged processes (CAP_SYS_ADMIN still allowed)
# 2 = disabled for everyone, including root
kernel.io_uring_disabled = 2
Apply and verify:
sudo sysctl -p /etc/sysctl.d/60-io-uring.conf
cat /proc/sys/kernel/io_uring_disabled
# 2
# Verify: an unprivileged io_uring_setup call should now return ENOSYS or EPERM.
strace -e io_uring_setup -f -- ./io_uring_test_program
# io_uring_setup(8, {...}) = -1 EPERM (Operation not permitted)
Set kernel.io_uring_disabled = 1 instead if one specific daemon legitimately uses io_uring and you want to allow it as root. Use 2 when no process on the host needs it.
For older kernels (5.15–6.5) without the sysctl, disable at boot via the kernel command line. Append to GRUB_CMDLINE_LINUX in /etc/default/grub:
io_uring.disabled=1
Rebuild grub and reboot:
sudo update-grub # Debian/Ubuntu
sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL/Rocky
sudo reboot
Option 2: Block io_uring_setup via Seccomp
When you cannot disable io_uring host-wide (because one service needs it), block it for everything else via seccomp. The three io_uring syscalls are io_uring_setup (425), io_uring_enter (426), and io_uring_register (427).
For containers, extend the runtime’s default seccomp profile. The Docker and containerd default profiles already block these syscalls unless CAP_SYS_ADMIN is granted. Verify your runtime version:
# Check that io_uring is in the blocked list for unprivileged containers.
docker run --rm alpine:3 sh -c \
'apk add -q strace && strace -e io_uring_setup sh -c "true"' 2>&1 | \
grep io_uring_setup
# io_uring_setup(...) = -1 EPERM (Operation not permitted)
For Kubernetes, use the RuntimeDefault seccomp profile on every pod:
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:1.0
For a custom seccomp profile that adds io_uring to an existing allow-list, deny the three syscalls explicitly:
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"names": ["io_uring_setup", "io_uring_enter", "io_uring_register"],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 38
}
]
}
errnoRet: 38 returns ENOSYS, which signals to the application that the syscall is unimplemented. Most runtimes (glibc, Rust’s tokio with io-uring feature, Go’s experimental io_uring support) fall back to synchronous I/O when they see ENOSYS. Returning EPERM instead is more truthful but causes some applications to abort.
Option 3: Restrict Which Opcodes a Process Can Submit
If a service legitimately uses io_uring for file I/O but should never open new files or make network calls through it, constrain the opcodes via io_uring_register with IORING_REGISTER_RESTRICTIONS (available since kernel 5.10). This is set once during initialization and cannot be loosened afterwards.
// restrict_io_uring.c
// Allow only IORING_OP_READ and IORING_OP_WRITE on pre-registered fds.
#include <liburing.h>
struct io_uring_restriction res[3] = {
{
.opcode = IORING_RESTRICTION_REGISTER_OP,
.register_op = IORING_REGISTER_FILES,
},
{
.opcode = IORING_RESTRICTION_SQE_OP,
.sqe_op = IORING_OP_READ,
},
{
.opcode = IORING_RESTRICTION_SQE_OP,
.sqe_op = IORING_OP_WRITE,
},
};
io_uring_register_restrictions(&ring, res, 3);
io_uring_enable_rings(&ring);
With this in place, a compromised process that controls submission queue entries cannot submit IORING_OP_OPENAT, IORING_OP_CONNECT, or any other opcode — the kernel rejects them before execution. Applies only when the service itself sets up the restrictions. For third-party software that uses io_uring, Options 1 or 2 are the only reliable controls.
Option 4: Audit io_uring Usage Across the Fleet
Before disabling io_uring, measure who uses it. auditd records syscall invocations; eBPF lets you record per-opcode submissions without modifying applications.
Audit rule (auditd):
# /etc/audit/rules.d/io-uring.rules
-a always,exit -F arch=b64 -S io_uring_setup -k io_uring_usage
-a always,exit -F arch=b64 -S io_uring_register -k io_uring_usage
Reload and query:
sudo augenrules --load
sudo ausearch -k io_uring_usage --start today | \
awk '/comm=/ {for (i=1;i<=NF;i++) if ($i ~ /^comm=/) print $i}' | \
sort -u
For opcode-level visibility, use bpftrace:
sudo bpftrace -e '
tracepoint:io_uring:io_uring_submit_sqe {
@opcodes[comm, args->opcode] = count();
}
interval:s:30 { print(@opcodes); clear(@opcodes); }
'
Expected output: a frequency table of (process, opcode) pairs. Anything unexpected — a web service submitting IORING_OP_OPENAT to /etc/shadow, a sidecar issuing IORING_OP_CONNECT to a public IP — warrants investigation.
Expected Behaviour
After applying io_uring_disabled = 2:
| Signal | Before | After |
|---|---|---|
io_uring_setup() |
Returns valid ring fd | Returns EPERM for all users |
| Processes using io_uring | Work normally | Fall back to synchronous I/O or fail at startup |
| Kernel attack surface | io_uring code paths reachable from any process | io_uring code paths unreachable; exploit attempts fail at syscall entry |
cat /proc/sys/kernel/io_uring_disabled |
0 |
2 |
| Benchmark throughput | High for io_uring-aware apps | Unchanged for apps using read/write/epoll; degraded for io_uring-native apps |
After seccomp-level blocking (Option 2):
- Containers with
RuntimeDefaultprofile receiveEPERMforio_uring_setup. - Privileged containers (with
CAP_SYS_ADMIN) still have access — use Option 1 to block them too. - Applications fall back to synchronous I/O if they handle the error; abort at startup if they do not.
Trade-offs
| Control | Security Benefit | Cost | Mitigation |
|---|---|---|---|
io_uring_disabled = 2 |
Removes the entire subsystem from the kernel attack surface | Any process that needs io_uring fails. Measure first via auditd/bpftrace. | Whitelist via = 1 and run the exempt service as a user with CAP_SYS_ADMIN. |
| Seccomp block in containers | Blocks unprivileged containers without touching the host kernel | Does not protect against root-in-container (CAP_SYS_ADMIN bypasses the block). Containers with the capability can still exploit. | Combine with allowPrivilegeEscalation: false, drop all capabilities, and use unprivileged user namespaces. |
IORING_REGISTER_RESTRICTIONS |
Fine-grained opcode allowlist enforced by the kernel | Requires source-code modification. Cannot be applied to third-party binaries. | Use for in-house services. Fall back to seccomp for everything else. |
| Performance impact of disabling | N/A | Throughput loss of 10-40% for io_uring-native workloads (benchmarked on Redis with async replication, FoundationDB, ScyllaDB). | Keep io_uring enabled (via = 1) on hosts dedicated to those databases; disable on general application nodes. |
| Auditing via bpftrace | Visibility into which processes use io_uring before you disable it | eBPF programs consume CPU (~1-3% on loaded systems) and require CAP_BPF to run. |
Run audits on a representative sample of nodes for 24-48 hours; do not leave bpftrace running continuously. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Legitimate service depends on io_uring | Service fails with io_uring_setup: Operation not permitted or crashes at startup |
systemd logs Failed to start, application logs mention io_uring initialization failure |
Identify the service via audit logs. Either allow it via io_uring_disabled = 1 and run it with the needed capability, or reconfigure the service to use synchronous I/O (most have a config flag). |
| Seccomp block returns wrong errno | Application aborts instead of falling back | Application logs show unexpected termination on io_uring_setup | Change seccomp rule from errnoRet: 1 (EPERM) to errnoRet: 38 (ENOSYS). Most runtimes handle ENOSYS gracefully. |
| Setting reverts after reboot | /proc/sys/kernel/io_uring_disabled back to 0 |
Audit rules show io_uring_setup calls succeeding again | Ensure the sysctl config file is in /etc/sysctl.d/ (persistent) rather than set via sysctl -w (runtime only). Verify with `sudo sysctl -a |
| Container runtime upgrade removes io_uring from default seccomp profile | New containers can call io_uring_setup successfully | ausearch -k io_uring_usage shows new processes using the interface after an upgrade |
Pin your seccomp profile as a Localhost profile under /var/lib/kubelet/seccomp/profiles/ and reference it explicitly in pod specs. Do not depend on RuntimeDefault alone. |
| Exploit in io_uring code despite restrictions | Kernel panic, unexpected privilege escalation | Node abruptly reboots; security scanner flags new root processes | Keep kernels current. CVEs in io_uring are patched quickly by upstream and distros. Subscribe to your distro’s security advisory list. Disabling (Option 1) is the only reliable mitigation until patches land. |
| IORING_REGISTER_RESTRICTIONS bypass via opcode not covered | A legal opcode ends up doing something unexpected | Audit logs show unusual file access by the restricted process | Review the opcode list before each kernel upgrade. New opcodes (e.g., IORING_OP_FTRUNCATE added in 6.5, IORING_OP_BIND/LISTEN added in 6.11) appear regularly and must be added to deny restrictions. |