copy_from_user Failure Exploitation: Kernel Copy Fault Handling Vulnerabilities

Problem

copy_from_user and copy_to_user are the kernel’s sanctioned mechanism for transferring data across the user/kernel boundary. Every syscall that accepts a struct pointer — recvmsg, setsockopt, ioctl, read, write — eventually calls one of these functions to bring userspace data into kernel memory or push kernel results back to the caller. The function signature is deceptively simple:

unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);

It returns the number of bytes that could not be copied. A return value of zero means success; any non-zero value signals that a page fault occurred mid-transfer and the destination buffer is only partially written.

The critical property that produces exploitable conditions is this: the copy can fault after writing some bytes and before writing others. When a syscall allocates a kernel object, starts copying the userspace-supplied struct into it, faults partway through, and then fails to unwind the partially-written allocation before returning EFAULT, the kernel heap contains a live object in an inconsistent state.

How partial copies arise

A user process controls the layout of its own virtual address space. An attacker can deliberately construct a mapping such that the first portion of a struct sits on a valid, readable page and the remainder maps to an unmapped or guard page. When the kernel calls copy_from_user with that address, it successfully copies the first portion — populating fields such as a size, a type discriminant, or a function pointer field — then faults on the second page access and returns a partial byte count.

The fault path through copy_from_user uses x86’s STAC/CLAC (for SMAP-aware kernels) and the exception table mechanism to suppress the page fault and return the residual byte count. Critically, the kernel memory written before the fault is not zeroed or reverted by this mechanism. The kernel must do that itself, and when it does not, the partially-written object persists.

The failure pattern in syscall handlers

A concrete pattern that recurs across multiple CVE classes:

/* simplified example of a vulnerable pattern */
static int vulnerable_cmd(struct file *f, unsigned long arg)
{
    struct my_cfg cfg;
    struct kernel_obj *obj;

    obj = kmem_cache_alloc(obj_cache, GFP_KERNEL);
    if (!obj)
        return -ENOMEM;

    /* fault here after writing cfg.type but before cfg.len */
    if (copy_from_user(&obj->cfg, (void __user *)arg, sizeof(obj->cfg))) {
        kmem_cache_free(obj_cache, obj);   /* freed, but partially written */
        return -EFAULT;
    }

    list_add(&obj->list, &global_list);
    return 0;
}

In the vulnerable form the object is freed on error — but if a concurrent thread or a post-fault race holds a reference, the freed-but-partially-populated object becomes a use-after-free primitive with attacker-influenced content. Variants exist where the kmem_cache_free call is simply missing, leaving the partially-written object reachable from global_list. Both forms have appeared in real driver code.

Structural CVE patterns

Three structural patterns appear repeatedly across Linux kernel copy-fault vulnerabilities:

Incomplete error unwind: The handler frees the allocation but does not remove it from a reference-counted data structure, leaving a dangling pointer. Subsequent accesses dereference the freed slab object.

Missing EFAULT check: The handler ignores the return value of copy_from_user entirely, treating the partial copy as success. This is possible because the C type system does not enforce checking unsigned long return values, and __must_check annotations are not universally applied to internal wrappers.

Type-confused partial init: A struct with a discriminant field early in layout and variable-length payload fields later. A page-boundary fault populates the discriminant (controlling which code path executes) while leaving the payload fields containing stale allocator metadata from a previous slab occupant. The kernel then dispatches through the discriminant to a code path that trusts the uninitialized payload.

Target systems: Linux kernel 5.x and 6.x. All syscall handlers that copy variable-length or multi-field structs from userspace are potentially affected if they perform object allocation before the copy completes, or fail to check copy return values.

Threat Model

Adversary: An unprivileged local attacker with code execution in a normal user process. This covers container escapes where initial code execution is already inside a container as an unprivileged uid, compromised service processes running without CAP_SYS_ADMIN, or an attacker who has achieved RCE in a non-root web service.

Objective: Local privilege escalation (LPE) to uid=0 or direct kernel code execution, either to escape a container, read /etc/shadow, or install a persistent kernel rootkit.

Attack path 1: page-boundary fault → partial init → type confusion

The attacker allocates a userspace buffer for a target struct. The buffer is laid out so that the first N bytes of the struct (up to the page boundary) are mapped readable, and the bytes immediately following the boundary are either unmapped or mapped as a guard page via mmap(PROT_NONE). The syscall is issued with this address. The kernel copies the first portion, faults, and — if the handler is not correct — leaves a partially-initialised kernel object live on the heap.

If the struct’s early fields control a type discriminant or pointer field, the attacker controls which code path the kernel subsequently executes and with what pointer value. Without SMAP, the attacker can place a fake kernel object at a known userspace address and forge the pointer fields to point into their own mapping.

Attack path 2: copy-fault race window → use-after-free

Some subsystems process copy errors asynchronously or inside RCU read-side critical sections. An attacker triggers the page fault, then races a second thread that either frees or reallocates the same slab object while the first thread is still in the EFAULT return path. The result is a classic UAF with the attacker controlling the object’s content via the partial copy. This is the same primitive as heap spray–based UAF exploits but with an additional attacker-controlled initialisation step for the first N bytes of the object.

Attack path 3: USMA without SMAP

On kernels or CPU configurations where Supervisor Mode Access Prevention is absent (pre-SMAP hardware, SMAP disabled in bootloader, hypervisor CPUID masking), a kernel dereference of an attacker-controlled pointer does not fault even when the pointer points into userspace. This means an attacker who achieves any control over a kernel pointer — including via the partial-copy primitive above — can point the kernel at a fake structure in userspace memory and have that fake structure be treated as authoritative kernel data.

Without SMAP, the combination of a copy-fault partial-init and a missing bounds check becomes a reliable, low-noise LPE primitive because the attacker can refine the fake object across multiple attempts without needing to control kernel heap layout precisely.

Blast radius comparison:

Condition	Attacker capability
No SMAP, no SLUB hardening, no KASAN	Reliable LPE; fake objects in userspace; no detection
SMAP enabled, no SLUB hardening	LPE still possible but requires kernel heap control; forged pointers must land in kernel memory
SMAP + SLUB freelist randomisation	Heap spray reliability degrades; attacker needs additional info leak to locate target slab
SMAP + SLUB hardening + KASAN	Partial-init detected at access time; freelist corruption detected at free time; most exploit chains break

Configuration / Implementation

Verifying SMAP and SMEP

SMAP (Supervisor Mode Access Prevention) and SMEP (Supervisor Mode Execution Prevention) are hardware features. SMAP causes the CPU to fault when kernel-mode code accesses a userspace virtual address without an explicit STAC/CLAC bracket; copy_from_user uses this bracket legitimately, but an attacker-forged kernel pointer dereference outside that bracket faults.

# Verify SMAP and SMEP are reported by the CPU
grep -w 'smap\|smep' /proc/cpuinfo | head -2

# Verify the kernel was not booted with nosmap/nosmep
grep -E 'nosmap|nosmep' /proc/cmdline && echo "WARNING: SMAP/SMEP disabled" || echo "OK"

SMAP and SMEP are enabled automatically when the CPU reports them and the kernel has not been told to disable them. They require no explicit CONFIG_ option on modern kernels; the kernel detects the CPUID bits at boot and sets CR4.SMAP / CR4.SMEP.

To confirm the bits are set in the running kernel:

# CR4 bit 20 = SMEP, bit 21 = SMAP
# On x86, read via MSR-tools or kernel's own reporting
dmesg | grep -i 'smap\|smep'

A correctly configured system shows output similar to:

[    0.000000] FEATURE SMEP enabled
[    0.000000] FEATURE SMAP enabled

SLUB freelist randomisation and hardening

The SLUB allocator’s freelist randomisation and hardening options significantly raise the bar for heap shaping attacks that exploit copy-fault primitives.

# Check kernel config for SLUB hardening options
zcat /proc/config.gz 2>/dev/null | grep -E 'SLAB_FREELIST|SLUB_DEBUG'
# or
grep -E 'SLAB_FREELIST|SLUB_DEBUG' /boot/config-$(uname -r)

The relevant options:

CONFIG_SLAB_FREELIST_RANDOM=y      # Randomise free object order within each slab
CONFIG_SLAB_FREELIST_HARDENED=y    # Encode freelist pointers; detect corruption at free time
CONFIG_SLUB_DEBUG=y                # Enable runtime slab debugging (production: selective)

CONFIG_SLAB_FREELIST_RANDOM means an attacker cannot rely on deterministic slab ordering to place a target object immediately adjacent to the partially-initialised one. CONFIG_SLAB_FREELIST_HARDENED XORs each freelist pointer with a per-slab random value and the pointer’s own address, causing a kernel panic if the pointer is overwritten — which catches the allocator-metadata corruption that copy-fault exploits depend on for reliable heap control.

To enable these in a custom kernel build:

scripts/config --enable SLAB_FREELIST_RANDOM
scripts/config --enable SLAB_FREELIST_HARDENED

KASAN and KFENCE for detecting partial initialisation

KASAN (Kernel Address Sanitizer) instruments every memory access in the kernel. In the copy-fault context, it detects accesses to stack or heap regions that were never initialised, or accesses to freed objects. KFENCE provides a lighter-weight sampling-based detector suitable for production kernels.

KASAN (test kernels, CI pipelines):

# Kernel config
grep -E 'KASAN|KFENCE' /boot/config-$(uname -r)

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y        # Software instrumentation (slower, comprehensive)
CONFIG_KASAN_INLINE=y         # Inline checks vs out-of-line (faster, larger binary)

Boot a KASAN kernel and exercise copy paths. Partial-init use will surface as:

==================================================================
BUG: KASAN: slab-use-after-free in my_driver_process+0x48/0x110
Read of size 8 at addr ffff888012345678 by task syzkaller/1234

KFENCE (production-safe sampling):

# Enable with sampling interval (accesses per check)
# /etc/default/grub or kernel command line:
GRUB_CMDLINE_LINUX="kfence.sample_interval=100"

KFENCE allocates a small pool of guard-paged objects and randomly serves allocations from it. A use-after-free or out-of-bounds write on a KFENCE object causes an immediate fault with a kernel warning, visible in dmesg.

Fault injection testing with CONFIG_FAULT_INJECTION

The kernel’s fault injection framework lets you force copy_from_user to fail at arbitrary call sites, enabling systematic testing of error unwind paths in kernel modules and drivers.

# Kernel config required
grep FAULT_INJECTION /boot/config-$(uname -r)

CONFIG_FAULT_INJECTION=y
CONFIG_FAIL_USERCOPY=y          # Specifically fault copy_from/to_user calls
CONFIG_FAULT_INJECTION_DEBUG_FS=y

With these enabled, control fault injection via debugfs:

# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug

# Configure copy_from_user fault injection
echo 1    > /sys/kernel/debug/fail_usercopy/probability   # 1-in-N fault rate
echo 1    > /sys/kernel/debug/fail_usercopy/interval
echo -1   > /sys/kernel/debug/fail_usercopy/times          # unlimited
echo 1    > /sys/kernel/debug/fail_usercopy/task-filter
echo 1    > /sys/kernel/debug/fail_usercopy/verbose

# Restrict to a specific process by setting its attribute
echo 1 > /proc/$(pgrep target_process)/make-it-fail

Run your syscall test harness against the target driver with fault injection enabled and observe whether:

The kernel panics (missing error unwind, use-after-free of partial object)
dmesg shows KASAN reports
The syscall returns EFAULT cleanly with no memory corruption

Syzkaller for fuzzing copy paths

Syzkaller is Google’s kernel coverage-guided fuzzer. It generates structured syscall sequences and is highly effective at finding copy-fault handling bugs because it exercises unusual argument shapes, including cross-page-boundary allocations.

A minimal syzkaller configuration targeting a custom driver:

{
    "target": "linux/amd64",
    "http": "127.0.0.1:56741",
    "workdir": "/tmp/syzkaller-work",
    "kernel_obj": "/path/to/kernel/build",
    "image": "/path/to/vm-image.img",
    "sshkey": "/path/to/ssh-key",
    "syzkaller": "/path/to/syzkaller",
    "procs": 4,
    "type": "qemu",
    "vm": {
        "count": 4,
        "kernel": "/path/to/bzImage",
        "cpu": 2,
        "mem": 2048
    },
    "enable_syscalls": ["ioctl$MY_DRIVER_CMD"]
}

Run syzkaller against a KASAN-enabled kernel. The combination of coverage guidance and fault injection (CONFIG_FAIL_USERCOPY) systematically exercises partial-copy paths that manual testing misses.

Writing safe copy paths

A safe copy path has three properties: it checks the return value, it unwinds all kernel state on error, and it uses __user annotations so sparse can enforce the discipline statically.

static int safe_cmd(struct file *f, unsigned long arg)
{
    struct my_cfg cfg;
    struct kernel_obj *obj;
    int ret;

    /* zero-initialise before copy: prevents stale data confusion */
    memset(&cfg, 0, sizeof(cfg));

    if (copy_from_user(&cfg, (void __user *)arg, sizeof(cfg)))
        return -EFAULT;

    /* validate fields before allocating */
    if (cfg.type >= MY_TYPE_MAX || cfg.len > MY_LEN_MAX)
        return -EINVAL;

    obj = kmem_cache_zalloc(obj_cache, GFP_KERNEL);  /* zalloc, not alloc */
    if (!obj)
        return -ENOMEM;

    obj->cfg = cfg;

    ret = register_obj(obj);
    if (ret) {
        kmem_cache_free(obj_cache, obj);
        return ret;
    }

    return 0;
}

Key changes from the vulnerable pattern: copy_from_user runs before allocation (no partial-init of a live kernel object), the return value is checked unconditionally, kmem_cache_zalloc ensures the object is fully zero-initialised even if the copy were to occur after allocation, and field validation runs before the object enters any reachable data structure.

Run the sparse static checker against driver code to catch missing __user annotations:

make C=1 CF="-D__CHECK_ENDIAN__" drivers/my_driver/

Kernel Lockdown LSM

Kernel lockdown restricts access to /dev/mem, /proc/kcore, and kernel module loading in ways that close some post-exploitation access paths. Even after a copy-fault exploitation succeeds, lockdown limits what the attacker can do with kernel write primitives.

# Check lockdown state
cat /sys/kernel/security/lockdown

# Enable in bootloader (grub example)
# GRUB_CMDLINE_LINUX="lockdown=confidentiality"

# Or compile in a default:
# CONFIG_SECURITY_LOCKDOWN_LSM=y
# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY=y

Expected Behaviour

The following table shows what an attacker obtains at each stage of applying the mitigations described above. “Attacker position” assumes an unprivileged local process with the ability to issue arbitrary syscalls.

Configuration	Partial-copy primitive	Heap control	Pointer forgery	LPE reliability
Baseline (no SMAP, no SLUB hardening)	Works; object left on heap	Deterministic slab ordering	Userspace pointers accepted by kernel	High; textbook technique
SMAP enabled	Works	Deterministic	Requires kernel-space target; userspace pointer faults	Reduced; needs heap info leak
SMAP + SLUB freelist random	Works	Non-deterministic order	Requires kernel-space target	Low; spray unreliable without leak
SMAP + SLUB random + SLUB hardened	Works	Hardened; corruption detected at free	Corruption panics kernel	Very low; crashes rather than exploits
Full stack (above + KASAN + lockdown)	Detected at access time	Detected	Detected	Effectively mitigated in test; surface minimal in production

Verifying KASAN detection. On a test kernel with KASAN and CONFIG_FAIL_USERCOPY, trigger a controlled fault and observe dmesg:

dmesg | grep -A 20 'BUG: KASAN'

Expected output on a partially-initialised access:

==================================================================
BUG: KASAN: slab-out-of-bounds in kernel_obj_process+0x7c/0x180 [my_driver]
Read of size 8 at addr ffff888034ab1040 by task test_harness/2341

CPU: 2 PID: 2341 Comm: test_harness Not tainted 6.8.0-kasan #1
Call Trace:
 kasan_report+0xb2/0xe0
 kernel_obj_process+0x7c/0x180 [my_driver]
 my_ioctl+0x44/0xa0 [my_driver]
 __x64_sys_ioctl+0x8e/0xd0

Verifying SLUB hardened freelist. A corrupted freelist pointer causes:

BUG: KASAN: slab-out-of-bounds ...
or
kernel BUG at mm/slub.c:XXXX!

rather than silent heap corruption and continued execution.

Sysctl hardening baseline (add to /etc/sysctl.d/60-kernel-hardening.conf):

# Restrict access to kernel pointers in /proc
kernel.kptr_restrict = 2

# Restrict dmesg to root (limits KASLR defeat via boot messages)
kernel.dmesg_restrict = 1

# Disable kexec (limits post-exploit persistence)
kernel.kexec_load_disabled = 1

# Restrict perf to root (limits hardware PMU-based KASLR defeat)
kernel.perf_event_paranoid = 3

Trade-offs

Mitigation	Performance overhead	Memory overhead	Operational impact	Notes
SMAP/SMEP	< 1% on modern CPUs (Skylake+)	None	None in production	Paid once at context switch via CR4 update; negligible on benchmarks
SLUB freelist randomisation	< 0.5% allocator throughput	None	None	Overhead from PRNG call per slab; imperceptible in most workloads
SLUB freelist hardening	~1–2% allocator throughput	None	Kernel panic on detected corruption	Panic-on-corruption is deliberate; tune `panic_on_oops` accordingly
KASAN (generic)	2–4× slowdown	2–8× memory	Not suitable for production; CI and fuzzing only	Use KFENCE in production
KFENCE	< 1% (sampling)	~2 MB fixed pool	Safe for production; catches subset of bugs	Miss rate depends on sampling interval; tunable
CONFIG_FAULT_INJECTION	Negligible when inactive	None	Requires debugfs access; disable in production	Gate debugfs behind `debugfs_allow=0` unless actively testing
Kernel lockdown (confidentiality)	None	None	Blocks `kexec`, `/dev/mem`, unsigned modules	Can break hibernation; verify against workload before deploying
Syzkaller fuzzing	N/A (offline)	N/A	Requires dedicated VM fleet	One-time setup cost; run continuously against kernel dev branches

Failure Modes

Failure mode	Root cause	Detection	Remediation
SMAP disabled in bootloader	`nosmap` on kernel command line, or hypervisor masking CPUID.SMAP bit	`grep nosmap /proc/cmdline`; `dmesg	grep -i smap`
SMEP disabled without noticing	Bare-metal upgrade to CPU without SMEP, or KVM guest with stripped CPUID	`grep smep /proc/cpuinfo` returns empty	Enforce SMEP/SMAP in VM template CPUID policy; CI gate on `/proc/cpuinfo` check
KASAN false negatives for partial init	Partial copy fills fields that are then accessed before a subsequent KASAN-instrumented call	Object access after partial copy but before `kmem_cache_free` is in a window KASAN cannot see without init tracking	Enable `CONFIG_KASAN_STACK` and `CONFIG_KMSAN` (Kernel Memory Sanitizer) for uninit tracking
KMSAN not detecting uninitialised reads	KMSAN tracks kernel-originated uninit but `copy_from_user` is a copy, not an init — uncopied bytes remain uninitialised	KMSAN report absence despite partial copy	Zero-initialise slab objects with `kmem_cache_zalloc`; explicit `memset` before partial-copy structs
`copy_from_user` wrappers that swallow errors	Internal subsystem wrappers that call `copy_from_user` but return void or coerce the return to bool incorrectly	Sparse warns with `__must_check`; manual audit of wrapper return types	Fix wrappers to propagate `unsigned long` return; annotate with `__must_check`; run `make C=1`
SLUB hardening disabled in distro config	Distribution ships `CONFIG_SLAB_FREELIST_HARDENED=n` for performance	`zcat /proc/config.gz	grep SLAB_FREELIST_HARDENED`
Lockdown bypassed via module loading	Unsigned module loaded before lockdown LSM activates, or `lockdown=integrity` (not `confidentiality`)	`dmesg	grep lockdown`; check module signing policy
Fault injection left active in production	debugfs left mounted and writable; `fail_usercopy` probability > 0	`cat /sys/kernel/debug/fail_usercopy/probability`	Set probability to 0 after testing; restrict debugfs: `GRUB_CMDLINE_LINUX="debugfs=off"`

io_uring Security and Hardening — io_uring’s shared-ring architecture creates similar copy-path exposure; the two vulnerability classes frequently appear in the same kernel subsystems.
Dirty Pipe and Container Escape via Kernel Pipe Splicing — covers the splice/pipe write primitive, which interacts with the copy path and the page cache in ways that share structural similarities with copy-fault partial-write exploitation.
TOCTOU Vulnerability Defences — race conditions between fault handler return and object use are a subclass of TOCTOU; the defensive patterns overlap significantly.
Detecting Copy-on-Write Exploits with eBPF — eBPF-based observability for CoW and copy-path anomalies; complements the fault-injection testing approach described here.