io_uring Security and Hardening: Disabling, Restricting, and Auditing a Bypass-Prone Syscall Interface

Problem

io_uring is a high-performance asynchronous I/O interface introduced in Linux 5.1. Instead of issuing one syscall per operation, a process sets up shared memory ring buffers with the kernel and submits I/O operations by writing submission queue entries (SQEs). The kernel processes them in the background and posts completions to a separate ring.

For database engines, proxies, and network dataplanes this is a meaningful throughput win. It is also one of the hardest kernel subsystems to reason about from a security perspective, for three reasons:

io_uring bypasses seccomp for the queued operations. A seccomp filter intercepts syscalls. io_uring submits work through shared memory ring buffers; the individual operations (IORING_OP_OPENAT, IORING_OP_READ, IORING_OP_SENDMSG, etc.) are not seccomp-filterable on most kernels. A sandbox that blocks openat at the syscall level does not block IORING_OP_OPENAT.
The attack surface is large and evolving. Since 2019, io_uring has accumulated over 100 CVEs, including use-after-free, type confusion, and privilege escalation bugs. Google’s kCTF rewarded multiple critical io_uring exploits in 2023–2024. ChromeOS and Android disabled the interface for unprivileged processes. Docker Desktop disables it by default. Major distros restrict it for containers.
Workloads rarely need it. Most production services perform I/O through glibc, Go’s runtime, or managed runtimes that do not use io_uring. You pay the kernel attack surface cost even when no process benefits.

This article covers four complementary controls: disabling io_uring globally, restricting it per-container via seccomp, constraining which opcodes a process can submit, and auditing io_uring usage across a fleet.

Target systems: Linux kernel 5.15+ (the io_uring_disabled sysctl arrived in 6.6, with backports in RHEL 9.4 and Ubuntu 24.04). Kubernetes 1.28+ for the container-level controls.

Threat Model

Adversary: Attacker with code execution inside an unprivileged process — compromised application container, malicious userspace process on a shared host, or code executing under a seccomp sandbox (browser renderer, document converter, CI runner).
Access level: Unprivileged user namespace or container, typically with a seccomp filter and minimal Linux capabilities.
Objective: Reach vulnerable kernel code paths to achieve privilege escalation, read kernel memory (heap leak, KASLR bypass), or escape the sandbox.
Blast radius: A successful io_uring kernel exploit grants root on the host. On a Kubernetes node this means access to every pod’s secrets, the kubelet credentials, and in cloud environments the instance metadata credentials. The standard kernel-exploit blast radius applies: one compromised pod becomes one compromised node.
What this does not defend against: root-equivalent processes that can legitimately use io_uring (database engines running as their service user with required capabilities). The controls here target unprivileged workloads that do not need the interface.

Configuration

Option 1: Disable io_uring Globally

The cleanest control. Refuse any attempt to call io_uring_setup at the kernel level. Since 6.6 (backported to RHEL 9.4 and Ubuntu 24.04), a dedicated sysctl exists:

# /etc/sysctl.d/60-io-uring.conf
# Disable io_uring for all processes.
# 0 = allowed (default)
# 1 = disabled for unprivileged processes (CAP_SYS_ADMIN still allowed)
# 2 = disabled for everyone, including root
kernel.io_uring_disabled = 2

Apply and verify:

sudo sysctl -p /etc/sysctl.d/60-io-uring.conf
cat /proc/sys/kernel/io_uring_disabled
# 2

# Verify: an unprivileged io_uring_setup call should now return ENOSYS or EPERM.
strace -e io_uring_setup -f -- ./io_uring_test_program
# io_uring_setup(8, {...}) = -1 EPERM (Operation not permitted)

Set kernel.io_uring_disabled = 1 instead if one specific daemon legitimately uses io_uring and you want to allow it as root. Use 2 when no process on the host needs it.

For older kernels (5.15–6.5) without the sysctl, disable at boot via the kernel command line. Append to GRUB_CMDLINE_LINUX in /etc/default/grub:

io_uring.disabled=1

Rebuild grub and reboot:

sudo update-grub   # Debian/Ubuntu
sudo grub2-mkconfig -o /boot/grub2/grub.cfg   # RHEL/Rocky
sudo reboot

Option 2: Block io_uring_setup via Seccomp

When you cannot disable io_uring host-wide (because one service needs it), block it for everything else via seccomp. The three io_uring syscalls are io_uring_setup (425), io_uring_enter (426), and io_uring_register (427).

For containers, extend the runtime’s default seccomp profile. The Docker and containerd default profiles already block these syscalls unless CAP_SYS_ADMIN is granted. Verify your runtime version:

# Check that io_uring is in the blocked list for unprivileged containers.
docker run --rm alpine:3 sh -c \
  'apk add -q strace && strace -e io_uring_setup sh -c "true"' 2>&1 | \
  grep io_uring_setup
# io_uring_setup(...) = -1 EPERM (Operation not permitted)

For Kubernetes, use the RuntimeDefault seccomp profile on every pod:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:1.0

For a custom seccomp profile that adds io_uring to an existing allow-list, deny the three syscalls explicitly:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["io_uring_setup", "io_uring_enter", "io_uring_register"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 38
    }
  ]
}

errnoRet: 38 returns ENOSYS, which signals to the application that the syscall is unimplemented. Most runtimes (glibc, Rust’s tokio with io-uring feature, Go’s experimental io_uring support) fall back to synchronous I/O when they see ENOSYS. Returning EPERM instead is more truthful but causes some applications to abort.

Option 3: Restrict Which Opcodes a Process Can Submit

If a service legitimately uses io_uring for file I/O but should never open new files or make network calls through it, constrain the opcodes via io_uring_register with IORING_REGISTER_RESTRICTIONS (available since kernel 5.10). This is set once during initialization and cannot be loosened afterwards.

// restrict_io_uring.c
// Allow only IORING_OP_READ and IORING_OP_WRITE on pre-registered fds.
#include <liburing.h>

struct io_uring_restriction res[3] = {
    {
        .opcode = IORING_RESTRICTION_REGISTER_OP,
        .register_op = IORING_REGISTER_FILES,
    },
    {
        .opcode = IORING_RESTRICTION_SQE_OP,
        .sqe_op = IORING_OP_READ,
    },
    {
        .opcode = IORING_RESTRICTION_SQE_OP,
        .sqe_op = IORING_OP_WRITE,
    },
};

io_uring_register_restrictions(&ring, res, 3);
io_uring_enable_rings(&ring);

With this in place, a compromised process that controls submission queue entries cannot submit IORING_OP_OPENAT, IORING_OP_CONNECT, or any other opcode — the kernel rejects them before execution. Applies only when the service itself sets up the restrictions. For third-party software that uses io_uring, Options 1 or 2 are the only reliable controls.

Option 4: Audit io_uring Usage Across the Fleet

Before disabling io_uring, measure who uses it. auditd records syscall invocations; eBPF lets you record per-opcode submissions without modifying applications.

Audit rule (auditd):

# /etc/audit/rules.d/io-uring.rules
-a always,exit -F arch=b64 -S io_uring_setup -k io_uring_usage
-a always,exit -F arch=b64 -S io_uring_register -k io_uring_usage

Reload and query:

sudo augenrules --load
sudo ausearch -k io_uring_usage --start today | \
  awk '/comm=/ {for (i=1;i<=NF;i++) if ($i ~ /^comm=/) print $i}' | \
  sort -u

For opcode-level visibility, use bpftrace:

sudo bpftrace -e '
tracepoint:io_uring:io_uring_submit_sqe {
  @opcodes[comm, args->opcode] = count();
}
interval:s:30 { print(@opcodes); clear(@opcodes); }
'

Expected output: a frequency table of (process, opcode) pairs. Anything unexpected — a web service submitting IORING_OP_OPENAT to /etc/shadow, a sidecar issuing IORING_OP_CONNECT to a public IP — warrants investigation.

Expected Behaviour

After applying io_uring_disabled = 2:

Signal	Before	After
`io_uring_setup()`	Returns valid ring fd	Returns `EPERM` for all users
Processes using io_uring	Work normally	Fall back to synchronous I/O or fail at startup
Kernel attack surface	io_uring code paths reachable from any process	io_uring code paths unreachable; exploit attempts fail at syscall entry
`cat /proc/sys/kernel/io_uring_disabled`	`0`	`2`
Benchmark throughput	High for io_uring-aware apps	Unchanged for apps using read/write/epoll; degraded for io_uring-native apps

After seccomp-level blocking (Option 2):

Containers with RuntimeDefault profile receive EPERM for io_uring_setup.
Privileged containers (with CAP_SYS_ADMIN) still have access — use Option 1 to block them too.
Applications fall back to synchronous I/O if they handle the error; abort at startup if they do not.

Trade-offs

Control	Security Benefit	Cost	Mitigation
`io_uring_disabled = 2`	Removes the entire subsystem from the kernel attack surface	Any process that needs io_uring fails. Measure first via auditd/bpftrace.	Whitelist via `= 1` and run the exempt service as a user with `CAP_SYS_ADMIN`.
Seccomp block in containers	Blocks unprivileged containers without touching the host kernel	Does not protect against root-in-container (CAP_SYS_ADMIN bypasses the block). Containers with the capability can still exploit.	Combine with `allowPrivilegeEscalation: false`, drop all capabilities, and use unprivileged user namespaces.
`IORING_REGISTER_RESTRICTIONS`	Fine-grained opcode allowlist enforced by the kernel	Requires source-code modification. Cannot be applied to third-party binaries.	Use for in-house services. Fall back to seccomp for everything else.
Performance impact of disabling	N/A	Throughput loss of 10-40% for io_uring-native workloads (benchmarked on Redis with async replication, FoundationDB, ScyllaDB).	Keep io_uring enabled (via `= 1`) on hosts dedicated to those databases; disable on general application nodes.
Auditing via bpftrace	Visibility into which processes use io_uring before you disable it	eBPF programs consume CPU (~1-3% on loaded systems) and require `CAP_BPF` to run.	Run audits on a representative sample of nodes for 24-48 hours; do not leave bpftrace running continuously.

Failure Modes

Failure	Symptom	Detection	Recovery
Legitimate service depends on io_uring	Service fails with `io_uring_setup: Operation not permitted` or crashes at startup	systemd logs `Failed to start`, application logs mention io_uring initialization failure	Identify the service via audit logs. Either allow it via `io_uring_disabled = 1` and run it with the needed capability, or reconfigure the service to use synchronous I/O (most have a config flag).
Seccomp block returns wrong errno	Application aborts instead of falling back	Application logs show unexpected termination on io_uring_setup	Change seccomp rule from `errnoRet: 1` (EPERM) to `errnoRet: 38` (ENOSYS). Most runtimes handle ENOSYS gracefully.
Setting reverts after reboot	`/proc/sys/kernel/io_uring_disabled` back to `0`	Audit rules show io_uring_setup calls succeeding again	Ensure the sysctl config file is in `/etc/sysctl.d/` (persistent) rather than set via `sysctl -w` (runtime only). Verify with `sudo sysctl -a
Container runtime upgrade removes io_uring from default seccomp profile	New containers can call io_uring_setup successfully	`ausearch -k io_uring_usage` shows new processes using the interface after an upgrade	Pin your seccomp profile as a `Localhost` profile under `/var/lib/kubelet/seccomp/profiles/` and reference it explicitly in pod specs. Do not depend on `RuntimeDefault` alone.
Exploit in io_uring code despite restrictions	Kernel panic, unexpected privilege escalation	Node abruptly reboots; security scanner flags new root processes	Keep kernels current. CVEs in io_uring are patched quickly by upstream and distros. Subscribe to your distro’s security advisory list. Disabling (Option 1) is the only reliable mitigation until patches land.
IORING_REGISTER_RESTRICTIONS bypass via opcode not covered	A legal opcode ends up doing something unexpected	Audit logs show unusual file access by the restricted process	Review the opcode list before each kernel upgrade. New opcodes (e.g., `IORING_OP_FTRUNCATE` added in 6.5, `IORING_OP_BIND`/`LISTEN` added in 6.11) appear regularly and must be added to deny restrictions.