copy_from_user Failure Exploitation: Kernel Copy Fault Handling Vulnerabilities

copy_from_user Failure Exploitation: Kernel Copy Fault Handling Vulnerabilities

Problem

copy_from_user and copy_to_user are the kernel’s sanctioned mechanism for transferring data across the user/kernel boundary. Every syscall that accepts a struct pointer — recvmsg, setsockopt, ioctl, read, write — eventually calls one of these functions to bring userspace data into kernel memory or push kernel results back to the caller. The function signature is deceptively simple:

unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);

It returns the number of bytes that could not be copied. A return value of zero means success; any non-zero value signals that a page fault occurred mid-transfer and the destination buffer is only partially written.

The critical property that produces exploitable conditions is this: the copy can fault after writing some bytes and before writing others. When a syscall allocates a kernel object, starts copying the userspace-supplied struct into it, faults partway through, and then fails to unwind the partially-written allocation before returning EFAULT, the kernel heap contains a live object in an inconsistent state.

How partial copies arise

A user process controls the layout of its own virtual address space. An attacker can deliberately construct a mapping such that the first portion of a struct sits on a valid, readable page and the remainder maps to an unmapped or guard page. When the kernel calls copy_from_user with that address, it successfully copies the first portion — populating fields such as a size, a type discriminant, or a function pointer field — then faults on the second page access and returns a partial byte count.

The fault path through copy_from_user uses x86’s STAC/CLAC (for SMAP-aware kernels) and the exception table mechanism to suppress the page fault and return the residual byte count. Critically, the kernel memory written before the fault is not zeroed or reverted by this mechanism. The kernel must do that itself, and when it does not, the partially-written object persists.

The failure pattern in syscall handlers

A concrete pattern that recurs across multiple CVE classes:

/* simplified example of a vulnerable pattern */
static int vulnerable_cmd(struct file *f, unsigned long arg)
{
    struct my_cfg cfg;
    struct kernel_obj *obj;

    obj = kmem_cache_alloc(obj_cache, GFP_KERNEL);
    if (!obj)
        return -ENOMEM;

    /* fault here after writing cfg.type but before cfg.len */
    if (copy_from_user(&obj->cfg, (void __user *)arg, sizeof(obj->cfg))) {
        kmem_cache_free(obj_cache, obj);   /* freed, but partially written */
        return -EFAULT;
    }

    list_add(&obj->list, &global_list);
    return 0;
}

In the vulnerable form the object is freed on error — but if a concurrent thread or a post-fault race holds a reference, the freed-but-partially-populated object becomes a use-after-free primitive with attacker-influenced content. Variants exist where the kmem_cache_free call is simply missing, leaving the partially-written object reachable from global_list. Both forms have appeared in real driver code.

Structural CVE patterns

Three structural patterns appear repeatedly across Linux kernel copy-fault vulnerabilities:

Incomplete error unwind: The handler frees the allocation but does not remove it from a reference-counted data structure, leaving a dangling pointer. Subsequent accesses dereference the freed slab object.

Missing EFAULT check: The handler ignores the return value of copy_from_user entirely, treating the partial copy as success. This is possible because the C type system does not enforce checking unsigned long return values, and __must_check annotations are not universally applied to internal wrappers.

Type-confused partial init: A struct with a discriminant field early in layout and variable-length payload fields later. A page-boundary fault populates the discriminant (controlling which code path executes) while leaving the payload fields containing stale allocator metadata from a previous slab occupant. The kernel then dispatches through the discriminant to a code path that trusts the uninitialized payload.

Target systems: Linux kernel 5.x and 6.x. All syscall handlers that copy variable-length or multi-field structs from userspace are potentially affected if they perform object allocation before the copy completes, or fail to check copy return values.


Threat Model

Adversary: An unprivileged local attacker with code execution in a normal user process. This covers container escapes where initial code execution is already inside a container as an unprivileged uid, compromised service processes running without CAP_SYS_ADMIN, or an attacker who has achieved RCE in a non-root web service.

Objective: Local privilege escalation (LPE) to uid=0 or direct kernel code execution, either to escape a container, read /etc/shadow, or install a persistent kernel rootkit.

Attack path 1: page-boundary fault → partial init → type confusion

The attacker allocates a userspace buffer for a target struct. The buffer is laid out so that the first N bytes of the struct (up to the page boundary) are mapped readable, and the bytes immediately following the boundary are either unmapped or mapped as a guard page via mmap(PROT_NONE). The syscall is issued with this address. The kernel copies the first portion, faults, and — if the handler is not correct — leaves a partially-initialised kernel object live on the heap.

If the struct’s early fields control a type discriminant or pointer field, the attacker controls which code path the kernel subsequently executes and with what pointer value. Without SMAP, the attacker can place a fake kernel object at a known userspace address and forge the pointer fields to point into their own mapping.

Attack path 2: copy-fault race window → use-after-free

Some subsystems process copy errors asynchronously or inside RCU read-side critical sections. An attacker triggers the page fault, then races a second thread that either frees or reallocates the same slab object while the first thread is still in the EFAULT return path. The result is a classic UAF with the attacker controlling the object’s content via the partial copy. This is the same primitive as heap spray–based UAF exploits but with an additional attacker-controlled initialisation step for the first N bytes of the object.

Attack path 3: USMA without SMAP

On kernels or CPU configurations where Supervisor Mode Access Prevention is absent (pre-SMAP hardware, SMAP disabled in bootloader, hypervisor CPUID masking), a kernel dereference of an attacker-controlled pointer does not fault even when the pointer points into userspace. This means an attacker who achieves any control over a kernel pointer — including via the partial-copy primitive above — can point the kernel at a fake structure in userspace memory and have that fake structure be treated as authoritative kernel data.

Without SMAP, the combination of a copy-fault partial-init and a missing bounds check becomes a reliable, low-noise LPE primitive because the attacker can refine the fake object across multiple attempts without needing to control kernel heap layout precisely.

Blast radius comparison:

Condition Attacker capability
No SMAP, no SLUB hardening, no KASAN Reliable LPE; fake objects in userspace; no detection
SMAP enabled, no SLUB hardening LPE still possible but requires kernel heap control; forged pointers must land in kernel memory
SMAP + SLUB freelist randomisation Heap spray reliability degrades; attacker needs additional info leak to locate target slab
SMAP + SLUB hardening + KASAN Partial-init detected at access time; freelist corruption detected at free time; most exploit chains break

Configuration / Implementation

Verifying SMAP and SMEP

SMAP (Supervisor Mode Access Prevention) and SMEP (Supervisor Mode Execution Prevention) are hardware features. SMAP causes the CPU to fault when kernel-mode code accesses a userspace virtual address without an explicit STAC/CLAC bracket; copy_from_user uses this bracket legitimately, but an attacker-forged kernel pointer dereference outside that bracket faults.

# Verify SMAP and SMEP are reported by the CPU
grep -w 'smap\|smep' /proc/cpuinfo | head -2

# Verify the kernel was not booted with nosmap/nosmep
grep -E 'nosmap|nosmep' /proc/cmdline && echo "WARNING: SMAP/SMEP disabled" || echo "OK"

SMAP and SMEP are enabled automatically when the CPU reports them and the kernel has not been told to disable them. They require no explicit CONFIG_ option on modern kernels; the kernel detects the CPUID bits at boot and sets CR4.SMAP / CR4.SMEP.

To confirm the bits are set in the running kernel:

# CR4 bit 20 = SMEP, bit 21 = SMAP
# On x86, read via MSR-tools or kernel's own reporting
dmesg | grep -i 'smap\|smep'

A correctly configured system shows output similar to:

[    0.000000] FEATURE SMEP enabled
[    0.000000] FEATURE SMAP enabled

SLUB freelist randomisation and hardening

The SLUB allocator’s freelist randomisation and hardening options significantly raise the bar for heap shaping attacks that exploit copy-fault primitives.

# Check kernel config for SLUB hardening options
zcat /proc/config.gz 2>/dev/null | grep -E 'SLAB_FREELIST|SLUB_DEBUG'
# or
grep -E 'SLAB_FREELIST|SLUB_DEBUG' /boot/config-$(uname -r)

The relevant options:

CONFIG_SLAB_FREELIST_RANDOM=y      # Randomise free object order within each slab
CONFIG_SLAB_FREELIST_HARDENED=y    # Encode freelist pointers; detect corruption at free time
CONFIG_SLUB_DEBUG=y                # Enable runtime slab debugging (production: selective)

CONFIG_SLAB_FREELIST_RANDOM means an attacker cannot rely on deterministic slab ordering to place a target object immediately adjacent to the partially-initialised one. CONFIG_SLAB_FREELIST_HARDENED XORs each freelist pointer with a per-slab random value and the pointer’s own address, causing a kernel panic if the pointer is overwritten — which catches the allocator-metadata corruption that copy-fault exploits depend on for reliable heap control.

To enable these in a custom kernel build:

scripts/config --enable SLAB_FREELIST_RANDOM
scripts/config --enable SLAB_FREELIST_HARDENED

KASAN and KFENCE for detecting partial initialisation

KASAN (Kernel Address Sanitizer) instruments every memory access in the kernel. In the copy-fault context, it detects accesses to stack or heap regions that were never initialised, or accesses to freed objects. KFENCE provides a lighter-weight sampling-based detector suitable for production kernels.

KASAN (test kernels, CI pipelines):

# Kernel config
grep -E 'KASAN|KFENCE' /boot/config-$(uname -r)
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y        # Software instrumentation (slower, comprehensive)
CONFIG_KASAN_INLINE=y         # Inline checks vs out-of-line (faster, larger binary)

Boot a KASAN kernel and exercise copy paths. Partial-init use will surface as:

==================================================================
BUG: KASAN: slab-use-after-free in my_driver_process+0x48/0x110
Read of size 8 at addr ffff888012345678 by task syzkaller/1234

KFENCE (production-safe sampling):

# Enable with sampling interval (accesses per check)
# /etc/default/grub or kernel command line:
GRUB_CMDLINE_LINUX="kfence.sample_interval=100"

KFENCE allocates a small pool of guard-paged objects and randomly serves allocations from it. A use-after-free or out-of-bounds write on a KFENCE object causes an immediate fault with a kernel warning, visible in dmesg.

Fault injection testing with CONFIG_FAULT_INJECTION

The kernel’s fault injection framework lets you force copy_from_user to fail at arbitrary call sites, enabling systematic testing of error unwind paths in kernel modules and drivers.

# Kernel config required
grep FAULT_INJECTION /boot/config-$(uname -r)
CONFIG_FAULT_INJECTION=y
CONFIG_FAIL_USERCOPY=y          # Specifically fault copy_from/to_user calls
CONFIG_FAULT_INJECTION_DEBUG_FS=y

With these enabled, control fault injection via debugfs:

# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug

# Configure copy_from_user fault injection
echo 1    > /sys/kernel/debug/fail_usercopy/probability   # 1-in-N fault rate
echo 1    > /sys/kernel/debug/fail_usercopy/interval
echo -1   > /sys/kernel/debug/fail_usercopy/times          # unlimited
echo 1    > /sys/kernel/debug/fail_usercopy/task-filter
echo 1    > /sys/kernel/debug/fail_usercopy/verbose

# Restrict to a specific process by setting its attribute
echo 1 > /proc/$(pgrep target_process)/make-it-fail

Run your syscall test harness against the target driver with fault injection enabled and observe whether:

  1. The kernel panics (missing error unwind, use-after-free of partial object)
  2. dmesg shows KASAN reports
  3. The syscall returns EFAULT cleanly with no memory corruption

Syzkaller for fuzzing copy paths

Syzkaller is Google’s kernel coverage-guided fuzzer. It generates structured syscall sequences and is highly effective at finding copy-fault handling bugs because it exercises unusual argument shapes, including cross-page-boundary allocations.

A minimal syzkaller configuration targeting a custom driver:

{
    "target": "linux/amd64",
    "http": "127.0.0.1:56741",
    "workdir": "/tmp/syzkaller-work",
    "kernel_obj": "/path/to/kernel/build",
    "image": "/path/to/vm-image.img",
    "sshkey": "/path/to/ssh-key",
    "syzkaller": "/path/to/syzkaller",
    "procs": 4,
    "type": "qemu",
    "vm": {
        "count": 4,
        "kernel": "/path/to/bzImage",
        "cpu": 2,
        "mem": 2048
    },
    "enable_syscalls": ["ioctl$MY_DRIVER_CMD"]
}

Run syzkaller against a KASAN-enabled kernel. The combination of coverage guidance and fault injection (CONFIG_FAIL_USERCOPY) systematically exercises partial-copy paths that manual testing misses.

Writing safe copy paths

A safe copy path has three properties: it checks the return value, it unwinds all kernel state on error, and it uses __user annotations so sparse can enforce the discipline statically.

static int safe_cmd(struct file *f, unsigned long arg)
{
    struct my_cfg cfg;
    struct kernel_obj *obj;
    int ret;

    /* zero-initialise before copy: prevents stale data confusion */
    memset(&cfg, 0, sizeof(cfg));

    if (copy_from_user(&cfg, (void __user *)arg, sizeof(cfg)))
        return -EFAULT;

    /* validate fields before allocating */
    if (cfg.type >= MY_TYPE_MAX || cfg.len > MY_LEN_MAX)
        return -EINVAL;

    obj = kmem_cache_zalloc(obj_cache, GFP_KERNEL);  /* zalloc, not alloc */
    if (!obj)
        return -ENOMEM;

    obj->cfg = cfg;

    ret = register_obj(obj);
    if (ret) {
        kmem_cache_free(obj_cache, obj);
        return ret;
    }

    return 0;
}

Key changes from the vulnerable pattern: copy_from_user runs before allocation (no partial-init of a live kernel object), the return value is checked unconditionally, kmem_cache_zalloc ensures the object is fully zero-initialised even if the copy were to occur after allocation, and field validation runs before the object enters any reachable data structure.

Run the sparse static checker against driver code to catch missing __user annotations:

make C=1 CF="-D__CHECK_ENDIAN__" drivers/my_driver/

Kernel Lockdown LSM

Kernel lockdown restricts access to /dev/mem, /proc/kcore, and kernel module loading in ways that close some post-exploitation access paths. Even after a copy-fault exploitation succeeds, lockdown limits what the attacker can do with kernel write primitives.

# Check lockdown state
cat /sys/kernel/security/lockdown

# Enable in bootloader (grub example)
# GRUB_CMDLINE_LINUX="lockdown=confidentiality"

# Or compile in a default:
# CONFIG_SECURITY_LOCKDOWN_LSM=y
# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY=y

Expected Behaviour

The following table shows what an attacker obtains at each stage of applying the mitigations described above. “Attacker position” assumes an unprivileged local process with the ability to issue arbitrary syscalls.

Configuration Partial-copy primitive Heap control Pointer forgery LPE reliability
Baseline (no SMAP, no SLUB hardening) Works; object left on heap Deterministic slab ordering Userspace pointers accepted by kernel High; textbook technique
SMAP enabled Works Deterministic Requires kernel-space target; userspace pointer faults Reduced; needs heap info leak
SMAP + SLUB freelist random Works Non-deterministic order Requires kernel-space target Low; spray unreliable without leak
SMAP + SLUB random + SLUB hardened Works Hardened; corruption detected at free Corruption panics kernel Very low; crashes rather than exploits
Full stack (above + KASAN + lockdown) Detected at access time Detected Detected Effectively mitigated in test; surface minimal in production

Verifying KASAN detection. On a test kernel with KASAN and CONFIG_FAIL_USERCOPY, trigger a controlled fault and observe dmesg:

dmesg | grep -A 20 'BUG: KASAN'

Expected output on a partially-initialised access:

==================================================================
BUG: KASAN: slab-out-of-bounds in kernel_obj_process+0x7c/0x180 [my_driver]
Read of size 8 at addr ffff888034ab1040 by task test_harness/2341

CPU: 2 PID: 2341 Comm: test_harness Not tainted 6.8.0-kasan #1
Call Trace:
 kasan_report+0xb2/0xe0
 kernel_obj_process+0x7c/0x180 [my_driver]
 my_ioctl+0x44/0xa0 [my_driver]
 __x64_sys_ioctl+0x8e/0xd0

Verifying SLUB hardened freelist. A corrupted freelist pointer causes:

BUG: KASAN: slab-out-of-bounds ...
or
kernel BUG at mm/slub.c:XXXX!

rather than silent heap corruption and continued execution.

Sysctl hardening baseline (add to /etc/sysctl.d/60-kernel-hardening.conf):

# Restrict access to kernel pointers in /proc
kernel.kptr_restrict = 2

# Restrict dmesg to root (limits KASLR defeat via boot messages)
kernel.dmesg_restrict = 1

# Disable kexec (limits post-exploit persistence)
kernel.kexec_load_disabled = 1

# Restrict perf to root (limits hardware PMU-based KASLR defeat)
kernel.perf_event_paranoid = 3

Trade-offs

Mitigation Performance overhead Memory overhead Operational impact Notes
SMAP/SMEP < 1% on modern CPUs (Skylake+) None None in production Paid once at context switch via CR4 update; negligible on benchmarks
SLUB freelist randomisation < 0.5% allocator throughput None None Overhead from PRNG call per slab; imperceptible in most workloads
SLUB freelist hardening ~1–2% allocator throughput None Kernel panic on detected corruption Panic-on-corruption is deliberate; tune panic_on_oops accordingly
KASAN (generic) 2–4× slowdown 2–8× memory Not suitable for production; CI and fuzzing only Use KFENCE in production
KFENCE < 1% (sampling) ~2 MB fixed pool Safe for production; catches subset of bugs Miss rate depends on sampling interval; tunable
CONFIG_FAULT_INJECTION Negligible when inactive None Requires debugfs access; disable in production Gate debugfs behind debugfs_allow=0 unless actively testing
Kernel lockdown (confidentiality) None None Blocks kexec, /dev/mem, unsigned modules Can break hibernation; verify against workload before deploying
Syzkaller fuzzing N/A (offline) N/A Requires dedicated VM fleet One-time setup cost; run continuously against kernel dev branches

Failure Modes

Failure mode Root cause Detection Remediation
SMAP disabled in bootloader nosmap on kernel command line, or hypervisor masking CPUID.SMAP bit grep nosmap /proc/cmdline; `dmesg grep -i smap`
SMEP disabled without noticing Bare-metal upgrade to CPU without SMEP, or KVM guest with stripped CPUID grep smep /proc/cpuinfo returns empty Enforce SMEP/SMAP in VM template CPUID policy; CI gate on /proc/cpuinfo check
KASAN false negatives for partial init Partial copy fills fields that are then accessed before a subsequent KASAN-instrumented call Object access after partial copy but before kmem_cache_free is in a window KASAN cannot see without init tracking Enable CONFIG_KASAN_STACK and CONFIG_KMSAN (Kernel Memory Sanitizer) for uninit tracking
KMSAN not detecting uninitialised reads KMSAN tracks kernel-originated uninit but copy_from_user is a copy, not an init — uncopied bytes remain uninitialised KMSAN report absence despite partial copy Zero-initialise slab objects with kmem_cache_zalloc; explicit memset before partial-copy structs
copy_from_user wrappers that swallow errors Internal subsystem wrappers that call copy_from_user but return void or coerce the return to bool incorrectly Sparse warns with __must_check; manual audit of wrapper return types Fix wrappers to propagate unsigned long return; annotate with __must_check; run make C=1
SLUB hardening disabled in distro config Distribution ships CONFIG_SLAB_FREELIST_HARDENED=n for performance `zcat /proc/config.gz grep SLAB_FREELIST_HARDENED`
Lockdown bypassed via module loading Unsigned module loaded before lockdown LSM activates, or lockdown=integrity (not confidentiality) `dmesg grep lockdown`; check module signing policy
Fault injection left active in production debugfs left mounted and writable; fail_usercopy probability > 0 cat /sys/kernel/debug/fail_usercopy/probability Set probability to 0 after testing; restrict debugfs: GRUB_CMDLINE_LINUX="debugfs=off"