LLM Copy-Paste Vulnerability Propagation: When AI Reproduces Unsafe Memory Copy Patterns

LLM Copy-Paste Vulnerability Propagation: When AI Reproduces Unsafe Memory Copy Patterns

Problem

Large language models do not invent code — they compress and reproduce patterns from their training corpus. For code generation models, that corpus is predominantly public repositories: GitHub, GitLab, SourceForge, and various code hosting platforms indexed before the model’s knowledge cutoff. Public code is not uniformly safe. A substantial fraction of the C and C++ code in any large open-source corpus contains memory-unsafe patterns that were introduced before the vulnerability was discovered, before the developer knew better, or because the code predates modern safety conventions.

When a developer asks an LLM assistant to implement a buffer copy, file copy, or kernel memory transfer, the model draws on those patterns statistically. It produces the completion most consistent with its training distribution — and that distribution contains a significant proportion of vulnerable idioms.

The vulnerability classes that recur most frequently in AI-generated copy code are:

Unbounded memcpy with user-controlled length. The model generates memcpy(dst, src, user_len) without inserting a bounds check, because the training data frequently has this form. The programmer supplies the bounds check separately in some of the training examples, but not all — and the model does not reliably infer that the check must be present.

Unchecked copy_from_user return value. In Linux kernel code, copy_from_user returns the number of bytes it failed to copy, not a boolean. Correct usage checks for a non-zero return and handles the partial copy. AI-generated kernel code frequently omits this check, because kernel novice code in public repositories does too.

TOCTOU-prone check-then-copy sequences. The model generates a length validation followed by a copy, with both operations reading the user-space value independently. Between the check and the copy, userspace can change the value — a classic time-of-check/time-of-use race. The model has seen this pattern repeatedly in training data and reproduces it without the double-fetch mitigation.

strncpy without null termination. strncpy does not null-terminate the destination buffer when the source is at least as long as the count argument. AI-generated code frequently uses strncpy as a “safe strncpy” without appending the null byte, because this misuse is endemic in the corpus.

Empirical measurement of this problem is limited but consistent in direction. The 2022 study by Pearce et al. (“Asleep at the Keyboard?”) evaluated GitHub Copilot against 89 code generation scenarios drawn from CWE-identified vulnerability classes. Approximately 40% of the generated completions contained at least one security-relevant weakness. The rate varied by domain — systems-level code involving memory operations, authentication, and cryptography produced higher proportions of vulnerable completions than higher-level application logic. Subsequent analyses of Copilot, CodeWhisperer, and open-weight code models like StarCoder confirmed the general finding: models reproduce unsafe patterns at non-trivial rates, particularly for idioms that appear frequently in low-quality or legacy code.

The scale dimension makes this qualitatively different from a single developer writing vulnerable code. If thousands of developers use the same model, accept the same unsafe completion, and ship it, the same vulnerability appears across thousands of codebases simultaneously. This monoculture risk means that a single discoverer — or a single exploit — becomes relevant against an extraordinarily large attack surface. A CVE filed against a pattern in one codebase may apply, with minimal modification, to thousands of AI-assisted codebases that independently received the same suggestion.

Target systems: Any codebase where developers use LLM code assistants (GitHub Copilot, Cursor, Codeium, Claude, Amazon CodeWhisperer) for systems-level code in C, C++, Rust (unsafe blocks), or Linux kernel modules. The risk is highest in codebases without mandatory SAST or fuzzing gates, where AI suggestions can reach production without any automated safety check.

Threat Model

Threat 1 — Developer accepts AI-suggested unsafe buffer copy. A developer working on a network daemon asks their code assistant to generate a packet parsing function. The assistant completes the function with memcpy(dest_buf, packet->data, packet->data_len). The developer reviews the logic for correctness — does it parse the right fields? — but not for safety. packet->data_len is attacker-controlled. The function ships with a heap buffer overflow. The vulnerability is functionally identical to CVE-class findings in mature codebases but was generated in seconds rather than written by a developer who might have noticed the pattern.

Threat 2 — AI-generated kernel module code with unchecked copy_from_user. A developer writes a character device driver for a hardware accelerator. They use an LLM assistant to generate the ioctl handler. The model produces code that calls copy_from_user to read a user-supplied request structure, then immediately uses fields from the structure without checking whether the copy succeeded. A partial copy — caused by the user-space address becoming invalid mid-copy — leaves the kernel operating on a partially initialised structure in kernel memory. The unchecked return value creates a local privilege escalation path: an attacker can exploit the partial-copy state to corrupt adjacent kernel data.

Threat 3 — AI completes a partially-written copy routine without the TOCTOU fix. A developer is midway through writing a system call handler. They have written the bounds check but not the copy. They invoke the assistant to complete the function. The assistant reproduces the common pattern of reading uargs->len twice — once to check, once to pass to copy_from_user — rather than reading it once into a local variable and using the local for both operations. The developer assumes the assistant preserved their intent. The TOCTOU window ships.

Threat 4 — Monoculture propagation. A popular open-source library accepts an AI-assisted pull request containing an unsafe strncpy sequence. The pattern is identical to suggestions the same model produces when asked for equivalent functionality. Other developers, having used the same model to generate similar code, independently submit the same pattern to their own repositories. A researcher discovers the vulnerability in the popular library, notes the pattern, and finds the same CWE across dozens of projects via code search. A single exploit template covers all of them.

Configuration / Implementation

Recognising AI-Generated Unsafe Copy Patterns

The following patterns appear consistently in LLM-generated copy code. Recognising them at review time is the first line of defence.

Pattern 1 — memcpy with user-controlled size, no bounds check.

/* LLM-generated: no upper bound on user_req.length */
void handle_write(struct user_request *user_req)
{
    char local_buf[256];
    copy_from_user(&req, user_req, sizeof(req));
    memcpy(local_buf, req.data, req.length);   /* ← req.length unchecked */
    process(local_buf);
}

The safe version reads the length into a local variable, validates it before use, and limits the copy to the destination size:

void handle_write(struct user_request __user *user_req)
{
    struct user_request req;
    char local_buf[256];
    size_t safe_len;

    if (copy_from_user(&req, user_req, sizeof(req)))
        return -EFAULT;

    safe_len = min_t(size_t, req.length, sizeof(local_buf));
    memcpy(local_buf, req.data, safe_len);
    process(local_buf);
}

Pattern 2 — copy_from_user return value ignored.

/* LLM-generated: return value discarded */
static long dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
{
    struct cmd_packet pkt;
    copy_from_user(&pkt, (void __user *)arg, sizeof(pkt));  /* ← unchecked */
    return dispatch_command(&pkt);
}

The correct form:

static long dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
{
    struct cmd_packet pkt;

    if (copy_from_user(&pkt, (void __user *)arg, sizeof(pkt)))
        return -EFAULT;

    return dispatch_command(&pkt);
}

Pattern 3 — TOCTOU: length checked then re-read from userspace.

/* LLM-generated: reads uargs->len twice, race window between check and copy */
int syscall_handler(struct user_args __user *uargs)
{
    char kbuf[PAGE_SIZE];

    if (uargs->len > PAGE_SIZE)          /* check reads from userspace */
        return -EINVAL;

    copy_from_user(kbuf, uargs->data, uargs->len);  /* copy re-reads from userspace */
    return process_buffer(kbuf, uargs->len);
}

The fix is to read len once into a kernel-space local:

int syscall_handler(struct user_args __user *uargs)
{
    char kbuf[PAGE_SIZE];
    size_t len;

    if (get_user(len, &uargs->len))      /* single read into kernel variable */
        return -EFAULT;

    if (len > PAGE_SIZE)
        return -EINVAL;

    if (copy_from_user(kbuf, uargs->data, len))
        return -EFAULT;

    return process_buffer(kbuf, len);
}

Pattern 4 — strncpy without null termination.

/* LLM-generated: strncpy does not null-terminate when src >= n bytes */
void set_name(char *dst, const char *src)
{
    strncpy(dst, src, MAX_NAME_LEN);
    /* dst may not be null-terminated */
    use_name(dst);
}

Use strlcpy (BSD), strscpy (kernel), or manually append the null byte:

void set_name(char *dst, const char *src)
{
    strscpy(dst, src, MAX_NAME_LEN);   /* kernel: always null-terminates, returns -E2BIG on truncation */
    use_name(dst);
}

For userspace C, use strlcpy if available, or the explicit pattern:

void set_name(char *dst, const char *src)
{
    strncpy(dst, src, MAX_NAME_LEN - 1);
    dst[MAX_NAME_LEN - 1] = '\0';
    use_name(dst);
}

SAST Rules Targeting LLM-Typical Copy Bugs

These Semgrep rules can be added to an existing CI pipeline to catch the four patterns described above. They are calibrated for C kernel and systems code.

Rule 1 — copy_from_user with unchecked return value.

rules:
  - id: unchecked-copy-from-user
    languages: [c]
    message: >
      copy_from_user return value is not checked. A non-zero return indicates
      a partial or failed copy; using the destination buffer afterwards risks
      operating on uninitialised or attacker-influenced data.
    severity: ERROR
    pattern: |
      copy_from_user($DST, $SRC, $LEN);
    pattern-not: |
      if (copy_from_user($DST, $SRC, $LEN)) { ... }
    pattern-not: |
      $RET = copy_from_user($DST, $SRC, $LEN);
    metadata:
      cwe: CWE-252
      references:
        - https://www.kernel.org/doc/html/latest/kernel-hacking/hacking.html

Rule 2 — memcpy size derived from a user-supplied struct field.

rules:
  - id: memcpy-user-controlled-size
    languages: [c]
    message: >
      memcpy size argument is derived from a user-supplied struct field without
      an intervening bounds check. Verify that the size is bounded before use.
    severity: WARNING
    patterns:
      - pattern: |
          copy_from_user(&$STRUCT, $UPTR, sizeof($STRUCT));
          ...
          memcpy($DST, ..., $STRUCT.$LEN_FIELD);
      - pattern: |
          memcpy($DST, $SRC, $REQ->$LEN_FIELD);
    metadata:
      cwe: CWE-120

Rule 3 — Double-fetch TOCTOU: field read from __user pointer inside a conditional, then again in copy_from_user size.

rules:
  - id: double-fetch-toctou
    languages: [c]
    message: >
      Potential double-fetch: a field from a __user pointer is read once to
      validate and again as an argument to a copy function. An attacker can
      change the value between the two reads. Read the value into a kernel-local
      variable once and use the local for both validation and the copy.
    severity: ERROR
    patterns:
      - pattern: |
          if ($UPTR->$LEN > ...) { ... }
          ...
          copy_from_user($DST, $UPTR->$DATA, $UPTR->$LEN);
    metadata:
      cwe: CWE-367

CodeQL query — Dataflow from user input to memcpy size argument.

/**
 * @name memcpy size argument reachable from user-controlled source
 * @description Finds memcpy calls where the size argument can be reached
 *              by data flow from a source tagged as user-controlled input.
 * @kind path-problem
 * @id cpp/memcpy-user-controlled-size
 * @severity error
 */

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import DataFlow::PathGraph

class UserControlledSource extends DataFlow::Node {
  UserControlledSource() {
    exists(FunctionCall fc |
      fc.getTarget().getName() in [
        "copy_from_user", "get_user", "recv", "recvfrom", "read"
      ] and
      this.asExpr() = fc.getArgument(0)
    )
  }
}

class MemcpySizeSink extends DataFlow::Node {
  MemcpySizeSink() {
    exists(FunctionCall fc |
      fc.getTarget().getName() = "memcpy" and
      this.asExpr() = fc.getArgument(2)
    )
  }
}

class UserToMemcpyConfig extends TaintTracking::Configuration {
  UserToMemcpyConfig() { this = "UserToMemcpyConfig" }
  override predicate isSource(DataFlow::Node n) { n instanceof UserControlledSource }
  override predicate isSink(DataFlow::Node n) { n instanceof MemcpySizeSink }
}

from UserToMemcpyConfig cfg, DataFlow::PathNode src, DataFlow::PathNode sink
where cfg.hasFlowPath(src, sink)
select sink.getNode(), src, sink,
  "memcpy size is reachable from user-controlled data at $@.", src.getNode(), "this source"

Adding Copy-Safety Rules to CI

In a pre-commit or CI context, add the Semgrep rules to the project’s existing Semgrep configuration:

# .semgrep/copy-safety.yml — add to semgrep.yml includes list
rules:
  # paste the rules above here, or reference them via registry
  - id: unchecked-copy-from-user
    # ...
  - id: memcpy-user-controlled-size
    # ...
  - id: double-fetch-toctou
    # ...

Run as a blocking CI step on any PR that touches .c or .h files under the kernel or systems path:

# .github/workflows/sast.yml (excerpt)
- name: Run copy-safety Semgrep rules
  uses: semgrep/semgrep-action@v1
  with:
    config: .semgrep/copy-safety.yml
  env:
    SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}

Prompting LLMs for Safer Patterns

The prompt context an LLM receives directly influences which training patterns it activates. Default prompts like “write a function that copies user data into a kernel buffer” activate the full distribution of training examples, including unsafe ones. More specific prompts steer the model toward the safer subset.

Prompt guidance for kernel code:

“Write a Linux kernel ioctl handler that reads a user-supplied struct using copy_from_user. Check the return value of copy_from_user and return -EFAULT on failure. Read any length fields into a local kernel variable before validating them. Do not read any user-space pointer more than once.”

Prompt guidance for userspace C:

“Write a function that copies at most MAX_BUF bytes from src to dst. Use strlcpy or an equivalent bounded copy. Ensure the result is always null-terminated. Do not use strcpy or unbounded memcpy.”

Prompt guidance to avoid TOCTOU:

“Do not read any field from user memory more than once. Read it into a stack variable first, validate the stack variable, and use the stack variable for all subsequent operations.”

These constraints work best when embedded in the system prompt or in a per-project Copilot instruction file (.github/copilot-instructions.md) so they apply to all suggestions in the repository without requiring each developer to remember them.

Code Review Checklist for AI-Assisted PRs

When reviewing a pull request where AI assistance is indicated (or suspected from commit patterns), apply the following checklist specifically to any function that performs buffer copies, file copies, or memory transfers:

  • [ ] Does every copy_from_user / copy_to_user call check its return value?
  • [ ] Is every length or size field from userspace read into a kernel-local variable before the bounds check?
  • [ ] Is the same user-space address or field read more than once? If yes, is there a documented reason why the double-read cannot race?
  • [ ] Does every memcpy have a statically verified or runtime-checked upper bound on its size argument?
  • [ ] Is strncpy used? If yes, is the destination guaranteed to be null-terminated by subsequent code?
  • [ ] Are all destination buffer sizes visible and consistent with the copy size?
  • [ ] Does the function handle partial copy (fewer bytes than requested) as an error?

Testing AI-Generated Copy Code with Fuzzing

SAST catches known patterns but not novel unsafe compositions. Fuzzing complements static analysis by exercising runtime behaviour.

For userspace code, integrate libFuzzer with a harness that feeds attacker-controlled input directly to the copy function:

/* fuzz_copy_handler.c */
#include <stdint.h>
#include <stddef.h>
#include "your_module.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
{
    struct user_request req;

    if (size < sizeof(req))
        return 0;

    memcpy(&req, data, sizeof(req));
    /* Ensure req.length can be attacker-controlled */
    req.length = *(uint32_t *)(data + offsetof(struct user_request, length));

    handle_write(&req);   /* function under test */
    return 0;
}

Compile with AddressSanitizer and UBSan:

clang -fsanitize=address,undefined -g -O1 \
    fuzz_copy_handler.c your_module.c \
    -o fuzz_copy_handler

./fuzz_copy_handler -max_len=65536 -jobs=4 corpus/

For kernel code, use syzkaller to generate structured syscall sequences that reach the ioctl handler under test. Configure syzkaller’s syz_description with the ioctl’s argument types, ensuring the fuzzer understands that struct fields are user-controlled.

Expected Behaviour

The following table maps each vulnerable pattern to its safe replacement and the SAST rule that detects it:

Vulnerable Pattern Root Cause Safe Replacement Detecting Rule
memcpy(dst, src, req->len) with no bound check LLM omits bound validation min_t(size_t, req->len, sizeof(dst)) before memcpy memcpy-user-controlled-size
copy_from_user(&s, uptr, sz) return value discarded LLM mirrors novice kernel code if (copy_from_user(...)) return -EFAULT; unchecked-copy-from-user
Length checked via uptr->len, then passed as uptr->len to copy Double-fetch from userspace pointer Read len once with get_user, use kernel local throughout double-fetch-toctou
strncpy(dst, src, N) without dst[N-1] = '\0' strncpy semantics misunderstood strscpy(dst, src, N) or explicit null append Manual review; pattern-match rule
memcpy size from struct field; struct populated by copy_from_user Indirect taint through struct Validate field after copy, before second memcpy CodeQL taint query

Detection rate expectations based on static rule design: the Semgrep rules for patterns 1–3 achieve high precision for direct (non-wrapped) calls in C files. Taint through function calls and indirect field access requires the CodeQL query, which trades speed for depth. Neither approach catches patterns fully obscured by macro wrappers — those require manual review or macro-aware tools.

Trade-offs

Measure Benefit Cost / Friction
Semgrep copy-safety rules in CI (blocking) Catches patterns 1–3 before merge; zero developer runtime overhead once configured False positives on legitimate unchecked-copy patterns (e.g., callers that handle errors via out-of-band mechanism); developers must suppress findings with # nosemgrep comments and justification
CodeQL taint tracking for memcpy size Catches indirect taint; covers cases Semgrep misses Slow: CodeQL analysis adds minutes to CI; requires CodeQL database build; false positive rate higher than Semgrep for large codebases
Mandatory libFuzzer harness for new copy functions Catches runtime bounds errors that SAST misses; produces reproducible crash PoCs Developers must write and maintain fuzzing harnesses; corpus management overhead; fuzzing time constrains CI throughput; may require infrastructure for continuous fuzzing
Per-repository copilot-instructions.md with safety constraints Reduces unsafe suggestions at generation time; cheapest intervention Developers can ignore or override the file; not enforced by the model; instruction quality degrades over time if not maintained; model may partially comply
Code review checklist (mandatory for AI-assisted PRs) No tooling required; catches patterns tools miss Review time increases; requires reviewers to identify AI-assisted code; checklist compliance depends on team discipline; does not scale well at high PR volume
Prompt engineering guidance in developer onboarding Shifts responsibility upstream; reduces suggestions to review Requires training adoption; developers forget or revert to default prompts; effectiveness varies by model and model version; no enforcement mechanism

Failure Modes

Failure Mode Mechanism Mitigation
SAST rule misses copy through wrapper function safe_copy(dst, src, len) wraps memcpy; Semgrep pattern matches only direct memcpy calls; wrapper is not in the pattern Add wrapper to Semgrep pattern list; use CodeQL taint query which follows call chains; document wrapper functions for tool awareness
Developer suppresses finding without review Developer adds # nosemgrep: memcpy-user-controlled-size comment to silence CI, ships without fixing Require suppression comments to include a named reviewer; configure Semgrep to log all suppressions to a security team channel; track suppression count per developer over time
AI assistant ignores organisation safety rules copilot-instructions.md is present but the model weights the instruction less than the completion context; suggestion is still unsafe Treat AI suggestions as untrusted; apply SAST regardless of instruction presence; do not reduce review rigour because instructions exist
TOCTOU rule produces excessive false positives Heuristic pattern matches legitimate code that reads the same field twice for unrelated reasons Tune the rule to require both reads to flow into copy-function arguments; accept higher false-negative rate in exchange for lower friction
Fuzzing harness does not reach vulnerable path Harness calls a wrapper that does not pass attacker data directly to the copy call Audit harness coverage with llvm-cov; ensure harness calls the function at the deepest instrumented level; review uncovered branches
Monoculture vulnerability undiscovered across affected codebases No central registry correlates AI-generated vulnerable patterns across unrelated repositories Coordinate with SAST tool vendors to add detection for known LLM-generated vulnerable patterns; monitor NVD for CWE-120/CWE-367 findings and backcheck against pattern library
Model update changes suggestion distribution A new model version generates fewer of the known unsafe patterns; team relaxes review; new model introduces different unsafe pattern not yet covered by rules Treat model updates as requiring a new SAST rule audit; re-run the code generation evaluation suite after each model update