TOCTOU Vulnerability Defences: Eliminating Time-of-Check to Time-of-Use Races Across the Stack

Problem

A TOCTOU vulnerability has the following shape:

check(X)            ← security decision made on state X
[race window]       ← attacker substitutes X′ for X
use(X)              ← operation executes against X′, not X

The check passes on X. The use operates on X′. The security property assumed at check time does not hold at use time. Everything between those two moments is the race window — and the window can be as short as a single context-switch latency, which is enough.

TOCTOU bugs appear in every layer of the stack because the pattern is architectural: any design that separates a validation step from an execution step, without holding state immutable between them, is structurally vulnerable. The concrete categories:

Filesystem TOCTOU — the most historically exploited class. A privileged process calls access(path, W_OK) to check whether the calling user can write to path, then calls open(path, O_WRONLY) to perform the write. Between the two calls, an attacker replaces path with a symlink pointing to /etc/shadow. The access() check resolved path to a writable file owned by the user. The open() call resolves the same path string again and follows the newly-inserted symlink. The write goes to /etc/shadow under the process’s elevated privileges. This is a symlink race, and it has been the root cause of privilege escalation bugs in sudo, at, crontab, tmpwatch, and dozens of Unix utilities over four decades.

The stat() + open() variant works identically: check metadata with stat(), open the path, but the file has been swapped between the two calls.

Kernel copy TOCTOU — affects kernel code that must access memory in userspace. When a syscall handler receives a pointer argument from userspace, the memory at that address is not under the kernel’s control. A correct implementation reads the value once, copies it into a kernel-local variable, validates the copy, and operates on the copy. A buggy implementation reads the pointer twice: once to validate, once to use. An attacker running a second thread can modify the memory between the two reads, causing the kernel to validate one value and operate on a different one. This is the double-fetch vulnerability pattern. CVE-2016-6516 (ext4 ioctl double-fetch) and CVE-2017-8890 (TCP socket double-fetch) are historical examples of this class in the Linux kernel.

Kubernetes admission control TOCTOU — not an exploitable attack path in the conventional sense, but the reason the admission controller design is the way it is. When an admission webhook validates a Pod specification, it receives the proposed object. If the object could be modified between the webhook’s validation decision and the scheduler’s binding decision, an attacker who could mutate the object in-flight could bypass the webhook. The design prevents this by enforcing immutability of the admitted spec: resourceVersion is a monotonically-increasing token that changes on any modification, and the API server rejects writes that present a stale resourceVersion. Understanding this mechanism clarifies why bypasses of admission control typically target other seams — privilege escalation to bypass RBAC, not mutation of the object in the admission window.

Application-level TOCTOU — any check-then-act sequence in application code. The file-creation pattern: check whether a file exists with os.path.exists(), then create it with open(path, 'w') — an attacker creates a symlink at path between the check and the create, and the open() follows the symlink. The permission pattern: load a user’s permissions at the start of an operation, cache them, perform the operation — another process revokes the permission between load and use, but the cached value says the permission still holds.

The unifying property is that the security-relevant state is read more than once, and the reads are not atomic with respect to concurrent modification.

Threat Model

Threat 1: Filesystem symlink race for arbitrary write as root

A privileged daemon (running as root or with CAP_DAC_OVERRIDE) checks whether a user-supplied path is safe before writing to it. The check uses lstat() to confirm the path is a regular file in a writable directory, not a symlink. The write uses open(path, O_WRONLY). An attacker creates a thread that repeatedly unlinks and re-creates path as a symlink to /etc/cron.d/backdoor. The privilege of the daemon, the user-writable directory, and the race window between lstat() and open() combine to produce an arbitrary write as root. The attacker wins the race by running the swap loop and the privileged operation in a tight concurrent loop; on a busy system, the expected number of iterations to win is small.

Threat 2: Kernel double-fetch leading to heap overflow

A syscall handler validates a count argument from userspace: if count > MAX, return -EINVAL. It then calls copy_from_user(kernel_buf, user_ptr, count), reading count from userspace again in the copy call rather than using the validated local copy. An attacker allocates kernel_buf of size MAX, passes a safe value for count during validation, then switches to a second thread that overwrites the userspace count value with a large number just before the copy executes. The copy reads the large value, writes past the end of kernel_buf, and produces a kernel heap overflow reachable from an unprivileged process.

Threat 3: Application symlink race on file creation

An application creates per-user configuration files in /tmp/appname-$USER/. It checks whether the directory exists, creates it if not, then writes a configuration file into it. An attacker pre-creates /tmp/appname-victim as a symlink to /home/victim/.ssh/. When the application creates its configuration file, it writes into .ssh/ under the application’s privileges, potentially overwriting authorized_keys.

Threat 4: Permission revocation race

An API server loads a user’s authorization token at the start of a long-running operation, validates it, then proceeds through a multi-step transaction. An administrator revokes the user’s token at step 3. The remaining steps execute using the cached validation result from step 1. The user performs actions they are no longer authorized to perform. This matters most in operations that take more than a few milliseconds and in systems that treat token validation as a one-time gate.

Configuration / Implementation

Filesystem TOCTOU: openat2() with RESOLVE flags

The openat2() syscall (Linux 5.6+) is the correct answer to filesystem TOCTOU. It accepts a struct open_how that includes resolve flags controlling how path components are resolved. These flags are evaluated atomically during path walk — they cannot be bypassed by races between separate syscalls.

#include <fcntl.h>
#include <linux/openat2.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

int safe_open_beneath(int dirfd, const char *path, int flags) {
    struct open_how how;
    memset(&how, 0, sizeof(how));
    how.flags = flags;
    how.resolve = RESOLVE_NO_SYMLINKS    /* reject any symlink component */
               | RESOLVE_NO_MAGICLINKS  /* reject /proc/self/fd, /proc/self/exe, etc. */
               | RESOLVE_BENEATH;       /* reject paths that escape dirfd via .. */

    int fd = (int)syscall(SYS_openat2, dirfd, path, &how, sizeof(how));
    if (fd < 0) {
        if (errno == ENOSYS) {
            /* kernel < 5.6 — fall back only if security requirements allow */
            fprintf(stderr, "openat2 not available; refusing to fall back\n");
            return -1;
        }
        return -1;
    }
    return fd;
}

/* Usage: open a file for writing that must be a regular file under /srv/uploads/ */
int main(void) {
    int dirfd = open("/srv/uploads", O_RDONLY | O_DIRECTORY | O_CLOEXEC);
    if (dirfd < 0) { perror("open dirfd"); return 1; }

    int fd = safe_open_beneath(dirfd, "user-data.txt", O_WRONLY | O_CREAT | O_TRUNC);
    if (fd < 0) { perror("safe_open_beneath"); close(dirfd); return 1; }

    const char *data = "safe content\n";
    write(fd, data, strlen(data));

    close(fd);
    close(dirfd);
    return 0;
}

RESOLVE_NO_SYMLINKS causes the path walk to return ELOOP if any component (including the final component) is a symlink. RESOLVE_NO_MAGICLINKS rejects magic symlinks — the special readlink-target-aware symlinks in /proc like /proc/self/fd/N that can redirect a path walk through a file descriptor boundary and bypass RESOLVE_BENEATH. RESOLVE_BENEATH ensures the resolved path stays under dirfd; .. traversal that would escape is rejected with EXDEV.

The critical property: all three flags are applied during the kernel’s path walk, in a single atomic operation. There is no window between check and use.

O_NOFOLLOW for single-component opens

When the path is a single component (no directory traversal), O_NOFOLLOW is sufficient and available on kernels older than 5.6:

int fd = open(path, O_WRONLY | O_NOFOLLOW | O_CLOEXEC);
if (fd < 0 && errno == ELOOP) {
    /* path is a symlink — reject */
}

O_NOFOLLOW only protects the final component. A path like dir/../../etc/passwd with a symlink in dir is not protected by O_NOFOLLOW alone. Use openat2() with RESOLVE_BENEATH for any path with directory components.

Use file descriptors, not path names, for subsequent operations

Once a file is open, operate on the file descriptor, not on the path:

struct stat st;
fstat(fd, &st);   /* not stat(path, &st) */
fchmod(fd, 0644); /* not chmod(path, 0644) */
fchown(fd, uid, gid); /* not chown(path, uid, gid) */

A path-based operation re-resolves the path name. An fd-based operation operates on the inode already referenced by the descriptor. There is no race window.

Atomic file creation with O_CREAT | O_EXCL

To create a file that must not already exist — and must not silently follow a symlink on creation:

int fd = open(path, O_WRONLY | O_CREAT | O_EXCL | O_NOFOLLOW | O_CLOEXEC, 0600);
if (fd < 0 && errno == EEXIST) {
    /* file already exists — handle the collision */
}

O_EXCL causes open() to fail with EEXIST if the path already exists, including if it is a symlink. The check and the creation are a single atomic syscall. The os.path.exists() + open() two-step has no place in security-sensitive code.

Kernel Copy TOCTOU: copy once, use the local copy

The rule for kernel code reading from userspace is: copy the data into a kernel-local variable first, validate the copy, then use the copy. Never read from a __user pointer more than once.

/* WRONG: double-fetch — attacker can change user_count between reads */
long bad_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    struct user_req __user *ureq = (struct user_req __user *)arg;
    size_t count;

    if (get_user(count, &ureq->count))  /* read 1 */
        return -EFAULT;
    if (count > MAX_COUNT)
        return -EINVAL;

    /* BUG: copy_from_user reads ureq->count again internally
       if the size field is derived from ureq, not from the validated 'count' */
    return do_copy(ureq->data, ureq->count);  /* read 2 — TOCTOU */
}

/* CORRECT: copy entire struct once, validate and use the local copy */
long good_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    struct user_req __user *ureq = (struct user_req __user *)arg;
    struct user_req kreq;   /* kernel-local copy */

    if (copy_from_user(&kreq, ureq, sizeof(kreq)))
        return -EFAULT;

    /* All subsequent accesses use kreq, which is in kernel memory.
       The attacker cannot modify it. */
    if (kreq.count > MAX_COUNT)
        return -EINVAL;

    return do_copy_from_kernel(&kreq);  /* operates on kreq, not ureq */
}

For single scalar values, get_user() provides an atomic read from userspace:

u32 val;
if (get_user(val, (u32 __user *)uptr))
    return -EFAULT;
/* val is now in kernel space; validate and use val, never uptr again */

The __user annotation (enforced by the sparse static analyser) marks pointers into userspace. Any dereference of a __user pointer in kernel code after the initial copy should be treated as a bug. Running make C=1 or make C=2 with the kernel build invokes sparse, which flags unsafe __user dereferences.

Application-Level Patterns

File existence check and creation — use O_EXCL

import os
import errno

def create_config_file(path: str, content: str) -> None:
    try:
        # O_CREAT | O_EXCL is atomic: create only if not exists, fail if it is a symlink
        fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL | os.O_NOFOLLOW, 0o600)
    except OSError as e:
        if e.errno == errno.EEXIST:
            raise FileExistsError(f"File already exists: {path}")
        if e.errno == errno.ELOOP:
            raise PermissionError(f"Path is a symlink: {path}")
        raise
    with os.fdopen(fd, 'w') as f:
        f.write(content)

Never use os.path.exists(path) followed by open(path, 'w'). The os.open() call with O_CREAT | O_EXCL | O_NOFOLLOW is atomic.

Database: check-and-act atomically with SELECT FOR UPDATE

The permission-check-then-write pattern is the application-level equivalent of the filesystem race:

-- WRONG: two separate statements; permission can be revoked between them
SELECT has_permission FROM user_roles WHERE user_id = $1 AND resource = $2;
-- [attacker revokes permission here]
INSERT INTO audit_log (user_id, action) VALUES ($1, 'sensitive-action');
UPDATE resource SET status = 'processed' WHERE id = $3;

-- CORRECT: lock the permission row for the duration of the transaction
BEGIN;

SELECT has_permission
FROM user_roles
WHERE user_id = $1
  AND resource = $2
FOR UPDATE;          -- row-level lock held until COMMIT; concurrent revocation blocks

-- If has_permission is false here, ROLLBACK
INSERT INTO audit_log (user_id, action) VALUES ($1, 'sensitive-action');
UPDATE resource SET status = 'processed' WHERE id = $3;

COMMIT;

SELECT FOR UPDATE takes a row-level exclusive lock. A concurrent UPDATE that revokes the permission will block until this transaction commits or rolls back. The transaction sees a consistent permission state from check through commit.

Idempotency tokens against duplicate-request races

In distributed systems, a client may retry a request if it does not receive a response. Two concurrent executions of the same operation introduce a TOCTOU variant: the first execution checks that a transfer has not yet occurred; the second execution races through the same check before the first commits. Idempotency tokens close this:

import uuid
import psycopg2

def initiate_transfer(conn, sender_id: int, recipient_id: int,
                      amount_cents: int, idempotency_key: str) -> dict:
    with conn.cursor() as cur:
        # Atomically insert the idempotency record; fail if key already exists
        try:
            cur.execute(
                """
                INSERT INTO idempotency_keys (key, created_at)
                VALUES (%s, NOW())
                """,
                (idempotency_key,)
            )
        except psycopg2.errors.UniqueViolation:
            conn.rollback()
            # Return the result of the original (already-committed) operation
            cur.execute(
                "SELECT result FROM idempotency_keys WHERE key = %s",
                (idempotency_key,)
            )
            return cur.fetchone()[0]

        # The INSERT succeeded — we own this operation
        cur.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s AND balance >= %s",
            (amount_cents, sender_id, amount_cents)
        )
        if cur.rowcount == 0:
            conn.rollback()
            raise ValueError("Insufficient funds or account not found")

        cur.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount_cents, recipient_id)
        )
        result = {"status": "ok", "transferred": amount_cents}
        cur.execute(
            "UPDATE idempotency_keys SET result = %s WHERE key = %s",
            (psycopg2.extras.Json(result), idempotency_key)
        )
        conn.commit()
        return result

The UNIQUE constraint on idempotency_keys.key makes the first-insert atomic and exclusive. A duplicate request loses the race at the INSERT and returns the cached result of the original.

Kubernetes Admission TOCTOU: Why resourceVersion Exists

An admission webhook receives the proposed object in its webhook request body. The question is: can the object be mutated between the webhook’s response and the API server’s commit? In the Kubernetes API server, the admission phase and the storage write are part of the same request-handling pipeline. The object is not stored until all admission controllers (both mutating and validating) have returned success. A mutation submitted concurrently by a different client that targets the same object must provide the current resourceVersion; the API server performs an optimistic concurrency check and rejects writes with stale resourceVersion values with a 409 Conflict.

This means the relevant TOCTOU seam is not between admission and storage — it is between the admission webhook’s validation and its validation of the object the webhook actually saw. A common mistake is for a validating webhook to check a field in the submitted object but not verify that the object matches what is in etcd at the moment of the check. The correct pattern is for the API server to present the object-as-it-will-be-stored to the webhook, which is what the AdmissionReview request provides: the incoming object (after mutating admission) is what the webhook validates.

The practical defence is to design admission webhooks that treat the object in the AdmissionReview request as the authoritative input, validate it completely in a single pass, and return a binding decision without making external calls that could introduce their own race windows.

Testing TOCTOU Races

A TOCTOU race can be reliably reproduced in a test by running the check and the racing mutation in two tight concurrent loops and looking for divergence. The following C sketch outlines the pattern:

#include <pthread.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <sys/stat.h>

static volatile int running = 1;
static const char *target = "/tmp/toctou-test-file";
static const char *symlink_target = "/tmp/toctou-sensitive";

void *racer(void *arg) {
    while (running) {
        unlink(target);
        /* create a regular file so the check passes */
        int fd = open(target, O_CREAT | O_WRONLY, 0644);
        if (fd >= 0) close(fd);
        /* immediately replace it with a symlink */
        unlink(target);
        symlink(symlink_target, target);
    }
    return NULL;
}

int vulnerable_write(const char *path, const char *data) {
    struct stat st;
    if (lstat(path, &st) < 0) return -1;
    if (S_ISLNK(st.st_mode)) return -1;  /* check: reject symlinks */

    /* race window: racer replaces regular file with symlink here */

    int fd = open(path, O_WRONLY);        /* use: follows symlink */
    if (fd < 0) return -1;
    write(fd, data, strlen(data));
    close(fd);
    return 0;
}

int main(void) {
    /* create initial regular file */
    int fd = open(target, O_CREAT | O_WRONLY, 0644);
    close(fd);
    touch(symlink_target);

    pthread_t t;
    pthread_create(&t, NULL, racer, NULL);

    int wins = 0;
    for (int i = 0; i < 100000; i++) {
        if (vulnerable_write(target, "test") == 0) {
            /* check whether symlink_target was written */
            struct stat st;
            if (stat(symlink_target, &st) == 0 && st.st_size > 0) {
                wins++;
                printf("Race won on iteration %d\n", i);
                break;
            }
        }
    }

    running = 0;
    pthread_join(t, NULL);
    printf("Race wins: %d\n", wins);
    return 0;
}

This pattern — two threads, one performing the vulnerable check-then-act sequence and one racing the state — is the basis for TOCTOU unit tests in security-focused test suites. The test should demonstrate that the vulnerable implementation loses the race within a bounded number of iterations, and that the fixed implementation (using openat2() with RESOLVE_NO_SYMLINKS) never does.

Expected Behaviour

The following table maps each TOCTOU pattern to its atomic alternative and the observable difference under strace:

TOCTOU Pattern	Atomic Alternative	strace Observable
`access(path)` + `open(path)`	`open(path, O_RDONLY \| O_NOFOLLOW)` and check errno	Single `openat` syscall vs two (`faccessat` + `openat`)
`lstat(path)` + `open(path, O_WRONLY)`	`openat2(dirfd, path, {flags, RESOLVE_NO_SYMLINKS})`	Single `openat2` with resolve flags
`os.path.exists(path)` + `open(path, 'w')`	`os.open(path, O_CREAT \| O_EXCL \| O_NOFOLLOW)`	Single `openat` with `O_CREAT\|O_EXCL` vs `stat` + `openat`
`stat(path)` + `chmod(path)`	`fstat(fd)` + `fchmod(fd)`	`fstat` + `fchmod` on fd vs path-based calls
`get_user(count)` validate, then re-read `count`	`copy_from_user(&kreq)` once, use `kreq.count`	Single `copy_from_user` call in kernel trace
`SELECT permission` then `UPDATE`	`BEGIN; SELECT ... FOR UPDATE; UPDATE; COMMIT`	Single transaction in query log vs two

Verification with strace -e trace=openat,openat2,faccessat,stat,lstat ./your-binary will show whether the implementation issues separate path-resolution syscalls or consolidates them into a single atomic call. Any sequence of faccessat or lstat immediately followed by openat on the same path string is a TOCTOU candidate.

Trade-offs

Mitigation	Kernel/Runtime Requirement	Compatibility Impact	Performance Impact
`openat2()` with `RESOLVE_NO_SYMLINKS`	Linux 5.6+ (released March 2020)	Symlink-heavy workflows (e.g. `/usr` merge symlinks, version-switching tools) break if they pass symlinked paths	Negligible — single syscall replaces two
`O_NOFOLLOW`	All Linux kernels	Applications that deliberately open through symlinks (e.g. `~/.vim` → `/opt/vim`) must be restructured to resolve symlinks explicitly	Negligible
`RESOLVE_NO_MAGICLINKS`	Linux 5.6+	Breaks any code that opens files via `/proc/self/fd/N` paths or `/proc/$pid/exe`	Negligible
`RESOLVE_BENEATH`	Linux 5.6+	Code that uses relative paths escaping the base directory fails with `EXDEV` — catches bugs, may break poorly-written utilities	Negligible
`O_CREAT \| O_EXCL`	POSIX	Applications that treat pre-existing files as valid must handle `EEXIST` explicitly	Negligible
`fstat`/`fchmod`/`fchown` over path-based equivalents	POSIX	None — fd-based operations are universally supported	Negligible
`SELECT ... FOR UPDATE`	All SQL databases supporting row-locking	Increases lock contention on high-write permission tables; may introduce deadlocks if lock order is inconsistent	Moderate — row lock held for transaction duration
Idempotency tokens	Application-level design change	Clients must generate and include tokens; existing clients without token support cannot use this guarantee	Low — single `INSERT` overhead per request
Kernel: single `copy_from_user`	Kernel development practice	None — simplifies code	Negligible

Failure Modes

Incorrect Mitigation	Why It Fails	Correct Alternative
`O_NOFOLLOW` without `RESOLVE_NO_MAGICLINKS`	`/proc/self/fd/N` is a magic symlink; `O_NOFOLLOW` rejects regular symlinks but the kernel resolves magic links differently. An attacker with an open fd to a sensitive file can create a `/proc` magic link path that bypasses `O_NOFOLLOW`.	`openat2()` with `RESOLVE_NO_SYMLINKS \| RESOLVE_NO_MAGICLINKS`
`openat2()` with fallback to `open()` on `ENOSYS`	The fallback silently removes all TOCTOU protection on kernels < 5.6. Attackers targeting systems with older kernels trigger `ENOSYS` by running on a compatible-ABI older kernel. The security guarantee disappears without logging.	Treat `ENOSYS` as a hard failure; document the kernel version requirement; use distribution backports if older kernels must be supported
`O_CREAT \| O_EXCL` on NFS without `no_subtree_check`	NFS does not guarantee atomicity of `O_EXCL` across clients on some server configurations. Two concurrent creates on different NFS clients can both succeed.	Use NFS server-side locking (`lockd`) or operate on local filesystems for security-sensitive file creation
`SELECT ... FOR UPDATE` missed by ORM	ORM abstractions (Django ORM, SQLAlchemy without explicit lock hints) do not add `FOR UPDATE` by default. Code that uses `obj = Model.objects.get(pk=id)` followed by `obj.save()` does not lock.	Use ORM-specific lock hints (`select_for_update()` in Django, `with_for_update()` in SQLAlchemy) and review the generated SQL
`fstat()` after `open()` without `O_NOFOLLOW`	If `open()` follows a symlink and the attacker loses the race on `O_NOFOLLOW`, but the code opens without `O_NOFOLLOW` and then validates with `fstat()`, the fd already points to the attacker’s file. `fstat()` validates the file the fd already opened — too late.	Apply `O_NOFOLLOW` or `RESOLVE_NO_SYMLINKS` at `open()` time, not after
Validating `path` argument before `chdir()` + relative open	Calling `chdir(dir)` followed by `open(filename, ...)` in a multi-threaded process changes the CWD for all threads. Another thread may change the CWD between the `chdir()` and the `open()`, invalidating the assumption that `filename` resolves under `dir`.	Use `openat(dirfd, filename, ...)` with a directory fd opened with `O_DIRECTORY`; never use `chdir()` for security boundary enforcement in multi-threaded code
Idempotency key stored in a non-UNIQUE column	A race between two identical requests inserts two rows, both passing the duplicate check. The operation executes twice.	Enforce `UNIQUE` constraint at the database level, not in application logic
`pthread_mutex_lock` around check-and-act in one process	Correct for single-process races, but does not protect against races from another process or a setuid subprocess that does not hold the same mutex.	For cross-process synchronisation, use advisory file locks (`flock`, `fcntl F_SETLK`) or design out the shared-path access