Double-Fetch Vulnerabilities in the Linux Network Stack: skb Races and TOCTOU in Packet Handling

Problem

A double-fetch vulnerability is a specific class of time-of-check time-of-use (TOCTOU) bug in which the kernel reads the same value from userspace memory — or from memory shared between kernel and userspace — on two separate occasions. The first read is used to validate or bound a value. The second read is where that value is acted upon. Because userspace memory can be modified by any thread belonging to the same process, an attacker can race a separate thread to change the value between the two kernel reads, causing the kernel to operate on a value it never validated.

The canonical shape of the bug looks like this in C pseudocode:

/* VULNERABLE: double-fetch pattern */
int setsockopt_handler(struct socket *sock, int level, int optname,
                       char __user *optval, unsigned int optlen)
{
    int len;

    /* First fetch: validate the length */
    if (get_user(len, (int __user *)optval))
        return -EFAULT;

    if (len < 0 || len > MAX_OPT_LEN)
        return -EINVAL;

    /* --- attacker races here, changes *optval to -1 --- */

    /* Second fetch: use the length to copy */
    if (get_user(len, (int __user *)optval))
        return -EFAULT;

    if (copy_from_user(kernel_buf, optval + sizeof(int), len))
        return -EFAULT;

    /* kernel_buf now overflowed by attacker-supplied len */
}

Between the two get_user calls, the attacking process has a window — typically a handful of nanoseconds to a few microseconds, depending on CPU scheduling — in which a racing thread can mmap a new value over the same page, flip a value in shared memory, or use userfaultfd to stall the kernel and then resume it with a substituted value. The userfaultfd technique, in particular, serialises the race to near-100% reliability by causing the kernel to block during the first fault, giving the attacker unlimited time to rewrite the shared page before the second fetch occurs.

Where Double-Fetches Appear in the Network Stack

Socket option handling (setsockopt/getsockopt). Many socket option handlers read a length field from userspace before copying option data. The pattern recurs across AF_PACKET, AF_UNIX, AF_NETLINK, and IP socket option handlers. When the length field resides in the same userspace buffer as the option payload, a single racing write defeats the validation.

Netlink message parsing. Netlink is the primary kernel-userspace interface for network configuration — routing, firewall rules, interface management. A nlmsghdr carries a nlmsg_len field specifying the total message size. The legacy nlmsg_parse validated nlmsg_len and then passed the same userspace pointer further down the stack, where attribute parsers re-examined the length field from the original pointer rather than a kernel-side copy. This opened a window during complex message parsing, when the kernel traversed nested attributes, for the attacker to shrink nlmsg_len after the boundary check had passed.

skb clone operations. Socket buffers (sk_buff, universally abbreviated skb) use reference counting to allow the same packet data to be shared across multiple consumers — for example, a cloned skb passed to a AF_PACKET socket monitor while the original continues through the IP stack. The skb_shared_info structure appended after the packet data is in kernel memory, so it is not directly vulnerable to userspace racing. However, the packet data itself, when it has not yet been pulled into linear memory from a user-supplied scatter-gather list, can remain in pages that userspace still has mapped. Drivers and protocol handlers that inspect packet fields twice — once to route and once to process — without calling skb_linearise first have exhibited double-fetch characteristics for remotely-supplied data.

BPF map value access. The BPF verifier analyses each bytecode instruction to prove memory safety before a program is loaded. When a BPF program reads from a map value, the verifier marks the register as containing a bounded pointer at analysis time. If the underlying map value can be modified by another CPU between the verifier’s analysis pass and the JIT-compiled program’s execution, the bounds the verifier proved no longer apply. Several CVEs in this area involve the verifier failing to account for concurrent map writes, allowing a carefully constructed program to read an out-of-bounds kernel address after JIT compilation accepted it.

Historical CVEs

CVE-2016-6516 — a double-fetch in the ioctl(FIOCLONE) path is not network-specific but established the exploit technique against copy_from_user-adjacent code that the network CVEs refined.

CVE-2017-7533inotify double-fetch; again a template for the concurrent-write-then-race technique used in later network CVEs.

CVE-2021-3490 — BPF ALU32 bounds tracking error allowing out-of-bounds read/write. While not a classical double-fetch, it arose from the verifier checking bounds on a value that could be concurrently mutated, a cousin of the double-fetch class.

CVE-2022-23222 — BPF verifier flaw allowing unprivileged local users to elevate privileges via incorrect pointer arithmetic tracking; exploitation involved racing map value updates against verifier-checked paths.

CVE-2023-0179 — Netfilter nft_payload double-fetch in the nftables subsystem, allowing local privilege escalation via stack overflow by causing a length field read in validation to differ from the length field read during copy. Affects kernels 5.5 through 6.1.

Target kernel versions for these bugs cluster in the 5.x and early 6.x series. The net subsystem, BPF, and netfilter collectively account for the largest share of double-fetch CVEs in recent kernel security advisories.


Threat Model

Scenario 1: Local Attacker via setsockopt Double-Fetch → Heap Overflow → LPE

A local unprivileged user creates a socket and calls setsockopt with a crafted option buffer whose embedded length field points to a value near MAX_OPT_LEN. A racing thread rewrites the length to a large value after kernel validation but before the copy_from_user call. The kernel copies attacker-controlled data beyond the allocated heap slab, corrupting adjacent objects. From there, standard kernel heap exploitation techniques — overwriting struct cred, corrupting a function pointer in a neighbouring sk_buff, or targeting modprobe_path — lead to privilege escalation.

With SMAP enabled: The kernel cannot access userspace memory without going through copy_from_user, which uses an explicit stac/clac pair to temporarily clear the AC flag. SMAP does not eliminate the double-fetch window; it prevents the attacker from directly redirecting kernel execution into userspace shellcode, but the race itself remains exploitable if the attacker’s goal is heap corruption rather than code injection.

Without SMAP: The attacker can additionally place shellcode in userspace and redirect execution from the corrupted heap object directly into it, significantly reducing exploitation complexity.

Netlink sockets require CAP_NET_ADMIN by default, but within a user namespace or container environment where the attacker has obtained that capability legitimately, crafted RTM_NEWROUTE or NFT_MSG_NEWRULE messages with a mutated nlmsg_len can exploit TOCTOU in message length validation. The attacker sends a message whose nlmsg_len passes the outer boundary check, then races a second write to shrink it below the size of a nested attribute the kernel is mid-way through parsing. This can cause the kernel to read attribute data beyond the intended boundary, leading to information disclosure or memory corruption.

Blast radius with namespace isolation: If the container runs in a dedicated network namespace, the attacker controls only the routes and firewall rules within that namespace, and the corrupted kernel memory is constrained to data structures for that namespace. This does not prevent kernel exploitation — kernel memory corruption is kernel memory corruption regardless of namespace — but it does mean the attacker must first obtain CAP_NET_ADMIN within the namespace, raising the bar from a pure unprivileged starting position.

Scenario 3: BPF Verifier Bypass via Map Value Double-Fetch → Arbitrary Kernel Write

An unprivileged BPF program (when kernel.unprivileged_bpf_disabled=0) loads a program that reads a BPF map value into a register. The verifier checks the value’s range at load time and approves the program. At runtime, a second userspace thread modifies the map value concurrently with the JIT-compiled program’s execution. If the JIT code reads the map value twice — once for a branch condition and once for pointer arithmetic — the second read can produce an out-of-bounds value the verifier never validated. The result is an arbitrary kernel read or write primitive.

With kernel.unprivileged_bpf_disabled=2: Unprivileged users cannot load BPF programs at all. This eliminates Scenario 3 entirely at the cost of restricting observability tools that rely on unprivileged BPF.


Configuration and Implementation

The Correct Pattern: Copy Once, Work from the Kernel Copy

The fundamental fix for every double-fetch vulnerability is identical: copy the untrusted value into a kernel-owned variable once, then use only the kernel-owned copy for all subsequent operations.

/* VULNERABLE: reads optval->len twice from userspace */
static int vulnerable_setsockopt(struct socket *sock,
                                 char __user *optval, unsigned int optlen)
{
    struct my_opt __user *uopt = (struct my_opt __user *)optval;
    int data_len;

    /* First fetch */
    if (get_user(data_len, &uopt->len))
        return -EFAULT;
    if (data_len < 0 || data_len > MY_MAX_LEN)
        return -EINVAL;

    /* RACE WINDOW: attacker mutates uopt->len here */

    /* Second fetch — now data_len may differ from what was validated */
    if (get_user(data_len, &uopt->len))
        return -EFAULT;

    return do_copy(sock, uopt->data, data_len);
}

/* CORRECT: copy the entire structure once into kernel memory */
static int hardened_setsockopt(struct socket *sock,
                                char __user *optval, unsigned int optlen)
{
    struct my_opt kopt;  /* kernel-owned copy */

    if (optlen < sizeof(kopt))
        return -EINVAL;

    /* Single copy_from_user pulls the entire struct atomically */
    if (copy_from_user(&kopt, optval, sizeof(kopt)))
        return -EFAULT;

    /* All subsequent reads use kopt, not optval */
    if (kopt.len < 0 || kopt.len > MY_MAX_LEN)
        return -EINVAL;

    return do_copy(sock, kopt.data, kopt.len);
}

The copy_from_user call is not atomic in the CPU instruction sense — it is a rep movsb loop — but it is a single traversal of the userspace data. Any mutation the attacker makes after copy_from_user returns has no effect on kopt. The key discipline is never re-reading through the original __user pointer after validation.

SMAP Enforcement

Supervisor Mode Access Prevention (SMAP) is an x86 architectural control that causes any kernel access to userspace pages without the explicit stac instruction to raise a fault. It does not prevent double-fetch races — both fetches go through copy_from_user, which sets the AC flag correctly — but it closes the secondary attack surface of redirecting kernel execution into userspace pages containing shellcode. Verify SMAP is active:

grep -w smap /proc/cpuinfo | head -1
# Expected: flags: ... smap ...

dmesg | grep -i smap
# Expected: [    0.000000] ... SMAP enabled

On virtual machines, the hypervisor must expose the SMAP CPUID bit to the guest. Check the VM CPU model: qemu should use -cpu host or a named model that includes SMAP. VMware and Hyper-V expose SMAP by default on recent releases.

The nlmsg_parse_deprecated family passes nested attribute pointers back to callers, who may re-examine nlmsg_len from the original netlink message header rather than a safely bounded copy. The strict replacement functions copy attribute data into kernel structures during parsing and reject messages with trailing bytes or inconsistent lengths.

# Verify the kernel was compiled with strict netlink validation support
grep CONFIG_NETLINK_STRICT_CHK /boot/config-$(uname -r)
# Expected: CONFIG_NETLINK_STRICT_CHK=y

For custom netlink families in out-of-tree drivers, use nlmsg_parse_deprecated_strict instead of nlmsg_parse when maximum compatibility is needed, or the current nla_parse_nested_deprecated with explicit NLA policy tables that bound every attribute length. Policy tables defined with NLA_POLICY_EXACT_LEN or NLA_POLICY_MAX_LEN cause the generic parser to reject overlong attributes before any subsystem code runs.

/* nla policy with strict length bounds eliminates double-fetch opportunity */
static const struct nla_policy my_policy[MY_ATTR_MAX + 1] = {
    [MY_ATTR_LEN]  = { .type = NLA_U32 },
    [MY_ATTR_DATA] = NLA_POLICY_MAX_LEN(MY_MAX_DATA),
};

BPF Hardening

Disable unprivileged BPF system-wide. This is the single most effective mitigation for Scenario 3 and eliminates a large class of verifier-bypass CVEs:

# Permanent via sysctl.d
cat > /etc/sysctl.d/90-bpf-hardening.conf << 'EOF'
# Prevent unprivileged users from loading BPF programs
# Value 2 = also prevent users in user namespaces from loading BPF
kernel.unprivileged_bpf_disabled = 2
EOF

sysctl -p /etc/sysctl.d/90-bpf-hardening.conf

# Verify
sysctl kernel.unprivileged_bpf_disabled
# Expected: kernel.unprivileged_bpf_disabled = 2

Enable JIT hardening to prevent the JIT-compiled BPF code region from being readable or executable from userspace:

cat >> /etc/sysctl.d/90-bpf-hardening.conf << 'EOF'
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 2
EOF

sysctl -p /etc/sysctl.d/90-bpf-hardening.conf

bpf_jit_harden = 2 blinds immediate values in the JIT output, making it harder for an attacker who has achieved a kernel read primitive to locate exploit gadgets in the JIT region.

Fuzzing with Syzkaller and kcov

Syzkaller drives kernel fuzzing through a combination of coverage-guided syscall generation and a grammar of valid-looking but boundary-violating inputs. For double-fetch discovery, enable kcov coverage:

# Check kernel has kcov enabled
grep CONFIG_KCOV /boot/config-$(uname -r)
# Expected: CONFIG_KCOV=y

grep CONFIG_KCOV_ENABLE_COMPARISONS /boot/config-$(uname -r)
# Expected: CONFIG_KCOV_ENABLE_COMPARISONS=y

A minimal Syzkaller configuration targeting the network subsystem socket paths:

{
  "target": "linux/amd64",
  "http": "localhost:56741",
  "workdir": "/srv/syzkaller-work",
  "kernel_obj": "/path/to/linux/build",
  "image": "/path/to/rootfs.img",
  "enable_syscalls": [
    "socket$inet", "socket$inet6", "socket$netlink", "socket$packet",
    "setsockopt", "getsockopt", "sendmsg", "recvmsg",
    "bpf$PROG_LOAD", "bpf$MAP_CREATE"
  ],
  "sandbox": "namespace",
  "procs": 8
}

Syzkaller’s userfaultfd-aware harness automatically attempts to interpose page faults during syscall execution, directly targeting the double-fetch window without requiring manual race construction.

Network Namespace Isolation

Running containerised workloads in dedicated network namespaces limits the scope of netlink exploitation. An attacker who gains CAP_NET_ADMIN inside a container’s network namespace can only corrupt routing and firewall state for that namespace; they cannot directly manipulate the host’s routing table or netfilter rules.

# Verify a process is running in an isolated network namespace
ls -la /proc/<pid>/ns/net
# Should differ from /proc/1/ns/net for a properly isolated container

# Manual isolation for a shell session
unshare -n bash
# Confirms with: ip link show → only lo

Do not grant CAP_NET_ADMIN to containers that do not require it. In Kubernetes, ensure the securityContext does not include NET_ADMIN in capabilities.add unless the pod is specifically a network management workload.

Seccomp Filtering of Dangerous Socket Syscalls

Restrict setsockopt, getsockopt, sendmsg, and bpf for unprivileged service processes using seccomp-bpf. This does not eliminate the kernel bug but prevents the vulnerable codepath from being reached:

/* Example seccomp allowlist — compile with libseccomp */
#include <seccomp.h>

int install_socket_seccomp(void)
{
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));
    if (!ctx) return -1;

    /* Allow only the specific socket families the process needs */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(socket), 1,
        SCMP_A0(SCMP_CMP_EQ, AF_INET));

    /* Block setsockopt entirely if not required */
    /* (omitting it from the allowlist leaves it blocked by default) */

    /* Block BPF unconditionally for unprivileged services */
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(bpf), 0);

    return seccomp_load(ctx);
}

For systemd services, apply a filter via the unit file:

[Service]
SystemCallFilter = @system-service
SystemCallFilter = ~@network-io
# Or more targeted:
SystemCallFilter = ~bpf setsockopt getsockopt
SystemCallErrnoName = EPERM

Expected Behaviour

The table below summarises what an attacker can achieve against each mitigation configuration, assuming a local unprivileged attacker with a working double-fetch exploit for a socket option handler.

Mitigation Configuration Attack Outcome Notes
No mitigations (stock kernel, unprivileged BPF enabled) Full LPE via heap overflow or BPF verifier bypass Race reliability varies; userfaultfd can bring it close to 100%
SMAP enabled, no other mitigations Heap overflow still possible; userspace shellcode redirection blocked SMAP eliminates ret2usr but not heap corruption primitives
kernel.unprivileged_bpf_disabled=2 BPF verifier bypass path closed; setsockopt races unaffected High-value mitigation for Scenario 3 specifically
Strict netlink parsing (CONFIG_NETLINK_STRICT_CHK=y) Netlink TOCTOU window shrunk; attribute overread blocked Does not affect non-netlink socket option handlers
Seccomp blocking setsockopt/bpf for service process Vulnerable syscall unreachable; attack impossible for that process Must be applied per-process; does not help for general user sessions
Network namespace isolation Blast radius limited to namespace; kernel corruption still possible Raises privilege bar but does not prevent exploitation post-CAP_NET_ADMIN
All mitigations combined Attack requires kernel bug in a syscall not blocked by seccomp, plus a SMAP-resistant exploitation technique Materially raises exploitation cost

Verify sysctl state:

sysctl kernel.unprivileged_bpf_disabled net.core.bpf_jit_harden
# Expected:
# kernel.unprivileged_bpf_disabled = 2
# net.core.bpf_jit_harden = 2

grep -E 'CONFIG_NETLINK_STRICT_CHK|CONFIG_BPF_JIT_ALWAYS_ON|CONFIG_KCOV' \
    /boot/config-$(uname -r)
# Expected: all =y

Trade-offs

Mitigation What It Breaks Operational Impact Recommended Resolution
kernel.unprivileged_bpf_disabled=2 Unprivileged observability tools: bpftrace run as non-root, some versions of perf, older tc filters, Cilium in non-privileged mode Development and SRE workflows that rely on running BPF tools as a normal user are blocked; monitoring agents that drop privileges after startup lose BPF access Run observability tools with CAP_BPF + CAP_PERFMON (Linux 5.8+) instead of full root; update tool configs to request only necessary capabilities
Strict netlink parsing (nlmsg_parse_deprecated_strict) Legacy userspace tools using iproute2 < 4.15, custom network management daemons built against old netlink ABI, some vendor network agents Routing or firewall configuration tools may fail with EINVAL on previously accepted messages Audit and update netlink clients; test in staging with strace -e netlink to identify non-conforming message formats
Seccomp filtering of setsockopt/getsockopt Any application that dynamically adjusts socket behaviour at runtime: TCP congestion control selection, socket buffer tuning, multicast group management Overbroad filters break legitimate application features; debugging becomes harder as EPERM on socket options may manifest as silent performance degradation Profile the application’s socket option usage with strace -e setsockopt,getsockopt before writing the filter; use allowlist by optname where possible
Disabling userfaultfd for unprivileged users (vm.unprivileged_userfaultfd=0) Container runtimes that use userfaultfd for live migration (CRIU), some memory-efficient runtimes CRIU-based checkpoint/restore for containers fails without userfaultfd; this setting closes a major race-amplification primitive Run CRIU with CAP_SYS_PTRACE which still permits userfaultfd; apply the restriction only where CRIU is not in use
bpf_jit_harden=2 Negligible performance overhead from constant blinding in JIT output; some BPF introspection tools that read JIT output for debugging Minor throughput reduction for BPF-heavy workloads (XDP, heavy tc filtering); usually less than 1% Acceptable trade-off for production; disable only in controlled benchmark environments, not in production

Failure Modes

Failure Mode Root Cause Detection Remediation
SMAP not active on VM guest Hypervisor CPU model does not expose SMAP bit to guest; default QEMU machine type pc uses qemu32/qemu64 CPU without SMAP grep smap /proc/cpuinfo returns empty; `dmesg grep SMAP` shows nothing
kernel.unprivileged_bpf_disabled silently reset to 0 Monitoring agent or container runtime (e.g., older Falco versions, some Cilium releases) sets sysctl -w kernel.unprivileged_bpf_disabled=0 at startup to enable its own BPF programs Periodic sysctl kernel.unprivileged_bpf_disabled audit; auditd rule on sysctl writes; Falco rule for /proc/sys/kernel/unprivileged_bpf_disabled writes Pin the value with a systemd ExecStartPre that verifies it; use sysctl --system at boot from a drop-in that overrides agent-set values; update monitoring agent to use capability-based BPF access
Seccomp allowlist too permissive — setsockopt allowed without optname restriction Filter blocks the syscall number but allows all optname values; a double-fetch in a rarely-used option handler remains reachable strace the application and compare allowed optnames against the filter; audit with bpftrace -e 'tracepoint:syscalls:sys_enter_setsockopt { printf("%d\n", args->optname); }' Refine the seccomp rule to match both the syscall and the specific optname values required; reject all other optnames with EPERM
Netlink strict validation disabled by distro kernel config Downstream kernel build omits CONFIG_NETLINK_STRICT_CHK to maintain compatibility with legacy tools grep CONFIG_NETLINK_STRICT_CHK /boot/config-$(uname -r) returns # CONFIG_NETLINK_STRICT_CHK is not set Use a distribution that ships with strict netlink enabled (RHEL 9+, Ubuntu 22.04 HWE, Debian bookworm); or rebuild the kernel with CONFIG_NETLINK_STRICT_CHK=y
userfaultfd left enabled for unprivileged users Default changed back by a container runtime install script or infrastructure automation sysctl vm.unprivileged_userfaultfd returns 1; check for automation touching /etc/sysctl.d/ Add a CIS-benchmark compliance check for this sysctl; lock the value in a high-priority sysctl drop-in (/etc/sysctl.d/99-lock-uffd.conf) and ensure no automation overwrites it
Kernel not patched for CVE-2023-0179 and similar LTS kernel branch not updated; organisation pinned to a fixed kernel version for stability uname -r against vendor advisory fixed-version list; check kernel_version in vulnerability scanner output Apply vendor errata or backport the patch; prioritise network subsystem CVEs with local LPE severity