Detecting Copy-on-Write Exploitation with eBPF: Tracing Dirty Pipe and Overlayfs Attack Patterns

Problem

Copy-on-write (CoW) is a kernel memory management strategy: when two processes share a read-only page and one needs to modify it, the kernel forks the page and hands the writer a private copy. That invariant — the writer always gets a private copy, never the shared original — is what CoW exploits break. Three exploits in the past decade have each attacked a different point in that invariant.

Dirty pipe (CVE-2022-0847) splices data from a file into a pipe buffer, marking that buffer’s page with PIPE_BUF_FLAG_CAN_MERGE. A subsequent write to the pipe propagates directly into the backing page-cache page without allocating a private copy — overwriting a read-only file from a process with no write permissions. The syscall sequence is deterministic: open(O_RDONLY), pipe(), splice(fd, pipe), write(pipe, ...). No process with legitimate intent writes to a pipe whose source is an O_RDONLY file descriptor and then expects the file to change.

Dirty COW (CVE-2016-5195) races madvise(MADV_DONTNEED) with a write to a copy-on-write mapped page. The MADV_DONTNEED causes the kernel to release the private copy; the concurrent write then lands on the original shared page before a new private copy can be allocated. The required sequence is: mmap(PROT_READ, MAP_PRIVATE) on a read-only file, then a tight loop of write(mem_fd) interleaved with madvise(MADV_DONTNEED, ...) on the same range.

Overlayfs capability copy-up is a container-specific variant. When a container reads a setuid or file-capability-bearing binary from the lower layer (the image), overlayfs may copy it to the upper layer while preserving its security.capability xattr. A non-root process inside the container can trigger this copy-up and then execute the file, gaining capabilities the container was not configured to grant. The precondition is an unshare or mount creating a new overlayfs instance followed by access to a file with non-zero cap_effective in the lower layer.

All three share kernel-observable preconditions that appear before the privilege escalation completes. eBPF tracing programs resident in the kernel can observe these preconditions — specific syscall sequences, VFS operations on unusual file attribute combinations — and raise an alert or terminate the offending process within microseconds of the attempt, before the escalation lands.

Target systems: Linux with eBPF support (kernel 5.8+ for ring buffer and BTF-backed kprobes), Tetragon 1.2+ deployed as a DaemonSet, Falco 0.38+ with modern rule syntax. auditd as a complementary layer where eBPF cannot be deployed.

Threat Model

Three concrete attacker scenarios drive the detection design:

1. Container attacker exploiting dirty pipe to overwrite a host file. The attacker has code execution inside a container. The container image has /proc/self/fd available (standard) and the host kernel is 5.8–5.16 (pre-patch). The attacker splices content from the container’s read-only lower-layer filesystem into a pipe, then writes crafted content that propagates into the host’s page cache, overwriting /etc/passwd or a host binary visible through the overlayfs mount. Detection window: the splice call that marks the page with PIPE_BUF_FLAG_CAN_MERGE is visible before any data is modified.

2. Local user exploiting a dirty COW variant to write to a read-only mapping. The attacker has a shell account on a multi-tenant Linux host. They map a setuid binary MAP_PRIVATE | PROT_READ, then race write to /proc/self/mem against madvise(MADV_DONTNEED). Detection window: the combination of madvise(MADV_DONTNEED) on a MAP_PRIVATE mapping of a file with S_ISUID is anomalous and visible to eBPF before any write succeeds.

3. Overlayfs capability copy-up. The attacker has code execution in a Kubernetes pod with no host-path mounts and no SYS_ADMIN. They call unshare(CLONE_NEWUSER | CLONE_NEWNS) (permitted in some kernel configurations), mount a new overlayfs with a lower layer containing a capability-bearing binary, then access the binary to trigger copy-up. Detection window: unshare followed by mount with fstype=overlay followed by open() on a file with security.capability xattr — all observable via kprobe before the process executes the file.

eBPF’s advantage is that the preconditions for all three exploits require syscalls that are individually rare or combination-anomalous. Falco catches the user-space-observable signatures. Tetragon catches deeper VFS-layer signals, including page flags and xattr checks, and can terminate the process with SIGKILL before the exploit completes.

Configuration / Implementation

Tetragon TracingPolicy: Dirty Pipe Detection

The core signal is a write system call on a pipe file descriptor where the pipe’s backing page carries PIPE_BUF_FLAG_CAN_MERGE and the pipe was previously fed via splice from an O_RDONLY file descriptor. Tetragon’s kprobe mechanism hooks pipe_write in the kernel before the page flag check occurs.

# tetragon-dirty-pipe.yaml
# TracingPolicy detecting the dirty pipe exploit precondition.
# Hooks pipe_write and inspects the pipe buffer flags.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: dirty-pipe-detection
  namespace: kube-system
spec:
  kprobes:
    # Hook 1: observe splice() to track pipe population from O_RDONLY fd.
    - call: "do_splice"
      syscall: false
      args:
        - index: 0
          type: "file"      # in_file: the source file descriptor
        - index: 2
          type: "file"      # out_file: the destination (pipe)
      selectors:
        - matchArgs:
            # Source file must be opened read-only (flags & O_ACCMODE == O_RDONLY).
            - index: 0
              operator: "FileFlags"
              values: ["O_RDONLY"]
          matchActions:
            - action: FollowFD
              argFd: 2
              argName: "splice_pipe_out"

    # Hook 2: observe pipe_write; if the pipe was populated via splice from
    # a read-only source, the combination is the dirty pipe precondition.
    - call: "pipe_write"
      syscall: false
      args:
        - index: 0
          type: "kiocb"     # the kiocb contains the file pointer for the pipe
      selectors:
        - matchArgs:
            - index: 0
              operator: "FilePrivate"
              # The pipe_inode_info pointer signals a pipe; we match on the fd
              # that was tracked by FollowFD above.
              values: ["splice_pipe_out"]
          matchActions:
            - action: Sigkill
            - action: Post
              rateLimit: "1/minute"
              ratelimitScope: "process"
  # Forward events to the Tetragon gRPC export and Kubernetes audit.
  podSelector:
    matchExpressions:
      - key: "tetragon.cilium.io/monitoring"
        operator: NotIn
        values: ["disabled"]

# tetragon-overlayfs-cap-copy.yaml
# TracingPolicy detecting overlayfs capability copy-up.
# Hooks ovl_copy_up_inode and checks for non-zero file capabilities.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: overlayfs-cap-copy-up
  namespace: kube-system
spec:
  kprobes:
    - call: "ovl_copy_up_inode"
      syscall: false
      args:
        - index: 0
          type: "inode"    # the lower-layer inode being copied up
      selectors:
        - matchArgs:
            # Match when the inode has security.capability xattr (i_xflags set).
            # Tetragon exposes inode security attributes via the "InodeAttr" operator.
            - index: 0
              operator: "InodeAttr"
              values: ["has_security_capability"]
          matchActions:
            - action: Post
              rateLimit: "5/minute"
              ratelimitScope: "namespace"
            # Sigkill stops the copy-up before the upper-layer file is created
            # with the capability xattr intact.
            - action: Sigkill

Falco Rules: User-Space Observable Patterns

Falco operates on the syscall stream from its eBPF driver and can detect the behavioural sequence even without kernel internal state access.

# falco-cow-exploit-rules.yaml
# Falco rules for dirty pipe, dirty COW, and overlayfs capability copy-up.

# Rule 1: Write to a pipe backed by a splice from an O_RDONLY file.
# Detects the dirty pipe exploit sequence at the user-space syscall level.
- rule: Dirty Pipe Exploit Attempt
  desc: >
    A process has called splice() to transfer data from a read-only file
    into a pipe, and then called write() on that pipe. This is the precise
    syscall sequence required by the dirty pipe exploit (CVE-2022-0847).
    Legitimate programs that splice from files into pipes do not then write
    to the write end of the same pipe expecting the source file to change.
  condition: >
    evt.type = write
    and fd.type = fifo
    and thread.cap_effective does not contain cap_dac_override
    and thread.cap_effective does not contain cap_fowner
    and (proc.aname[2] = splice or proc.aname[3] = splice)
    and not proc.name in (logrotate, rsyslogd, journald, auditd)
    and not container.image.repository in (trusted_image_list)
  output: >
    Dirty pipe exploit pattern: write to pipe after splice from read-only fd
    (proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
     container=%container.name image=%container.image.repository
     fd=%fd.name parent=%proc.pname gparent=%proc.aname[2])
  priority: CRITICAL
  tags: [cow, dirty-pipe, exploit, CVE-2022-0847]

# Rule 2: madvise(MADV_DONTNEED) on a file-backed MAP_PRIVATE region.
# This is the dirty COW precondition — legitimate programs rarely call
# MADV_DONTNEED on file-backed private mappings in a write loop.
- rule: Dirty COW madvise Precondition
  desc: >
    A process has called madvise(MADV_DONTNEED) on a mapping of a file
    that is also being written via /proc/self/mem or a writable alias.
    This is the precondition for dirty COW variants (CVE-2016-5195 family).
    Detect the madvise call; correlate with concurrent /proc/self/mem write
    at the SIEM layer for full confidence.
  condition: >
    evt.type = madvise
    and evt.arg.advice = MADV_DONTNEED
    and fd.type = file
    and fd.is_readonly = false
    and (fd.filename startswith /proc/self or fd.filename contains mem)
    and not proc.name in (postgres, java, mysqld, node)
    and container.id != host
  output: >
    Possible dirty COW precondition: madvise MADV_DONTNEED on mem-mapped file
    (proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
     file=%fd.name container=%container.name image=%container.image.repository)
  priority: HIGH
  tags: [cow, dirty-cow, madvise, CVE-2016-5195]

# Rule 3: unshare() followed by mount with fstype overlay inside a container.
# Indicates a user-namespace + overlayfs setup used in capability copy-up
# and several container escape techniques.
- rule: Suspicious Unshare and Overlayfs Mount in Container
  desc: >
    A process inside a container has called unshare() to create a new
    user or mount namespace, and subsequently called mount() with filesystem
    type "overlay". This sequence is characteristic of overlayfs-based
    container escapes and the overlayfs capability copy-up technique.
    Privileged containers performing legitimate storage operations are
    excluded via the trusted_privileged_images macro.
  condition: >
    evt.type = mount
    and evt.arg.fstype = overlay
    and container.id != host
    and proc.vpid != 1
    and not trusted_privileged_images
    and thread.cap_effective does not contain cap_sys_admin
  output: >
    Overlayfs mount after unshare in container — possible exploit setup
    (proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
     container=%container.name image=%container.image.repository
     fstype=%evt.arg.fstype mntdir=%evt.arg.dir)
  priority: CRITICAL
  tags: [cow, overlayfs, container-escape, exploit]

# Macro: image list that legitimately uses overlay mounts inside containers
# (e.g., Docker-in-Docker, buildkitd). Maintain this list carefully.
- macro: trusted_privileged_images
  condition: >
    container.image.repository in (
      docker,
      moby/buildkit,
      gcr.io/kaniko-project/executor,
      registry.k8s.io/build-image/kube-cross
    )

auditd Rules as Complementary Layer

Where eBPF cannot be deployed (older kernels, locked-down environments), auditd provides a syscall-level audit trail. auditd cannot enforce (kill), but it feeds SIEM pipelines that can trigger automated response.

# /etc/audit/rules.d/60-cow-exploits.rules
# Audit rules complementing eBPF detection for CoW exploit preconditions.

# Capture all splice and vmsplice calls (dirty pipe vector).
-a always,exit -F arch=b64 -S splice -S vmsplice -k cow_splice

# Capture madvise calls with MADV_DONTNEED (0x9 = 9 decimal).
# The a2 filter matches the advice argument.
-a always,exit -F arch=b64 -S madvise -F a2=9 -k cow_madvise

# Capture unshare calls that create new user namespaces (flag bit 0x10000000).
-a always,exit -F arch=b64 -S unshare -F a0&0x10000000=0x10000000 -k cow_unshare_userns

# Capture mount with overlay filesystem type.
-a always,exit -F arch=b64 -S mount -k cow_overlay_mount

# Ensure rules are loaded and not mutable at runtime.
-e 2

Load with augenrules --load and verify with auditctl -l | grep cow_.

SIEM Correlation and Noise Tuning

splice and madvise appear in legitimate workloads. sendfile is implemented via splice internally; PostgreSQL and databases use madvise extensively. Correlation reduces false positives dramatically:

Dirty pipe signal: Alert only when splice source fd has O_RDONLY flag AND the destination pipe fd is written within 500 ms by the same thread AND the process has no CAP_DAC_OVERRIDE or CAP_FOWNER. This three-part condition is not triggered by sendfile.
Dirty COW signal: Alert when madvise(MADV_DONTNEED) appears AND the same process has an open fd to /proc/self/mem or /dev/mem AND the madvise target mapping is file-backed. Database processes (postgres, mysqld) are excluded by name and cgroup.
Overlayfs signal: Alert on any mount(overlay) inside a container namespace from a non-init process without CAP_SYS_ADMIN. This has near-zero legitimate trigger rate in standard Kubernetes workloads.

Ship Tetragon JSON events and Falco alerts to your SIEM with process ancestry (proc.aname[*]) and container metadata. Alert triage is significantly easier when the alert includes the full process tree from PID 1 to the offending process.

Expected Behaviour

The table below shows the observable syscall sequence for each exploit class, the Tetragon or Falco component that fires first, and the expected alert latency from exploit initiation.

Exploit class	Observable precondition sequence	Detector	Alert latency
Dirty pipe	`open(O_RDONLY)` → `pipe()` → `splice(rdonly_fd, pipe)` → `write(pipe)`	Tetragon kprobe on `pipe_write`	< 1 ms (in-kernel)
Dirty pipe (user-space)	`splice` + `write` on fifo with no write caps	Falco syscall rule	10–50 ms (syscall event latency)
Dirty COW	`mmap(PROT_READ, MAP_PRIVATE)` → `write(/proc/self/mem)` + `madvise(MADV_DONTNEED)`	Falco + auditd	10–100 ms
Overlayfs cap copy-up	`unshare(NEWNS)` → `mount(overlay)` → `open(cap_file)` → `ovl_copy_up_inode`	Tetragon kprobe on `ovl_copy_up_inode`	< 1 ms
Overlayfs (user-space)	`unshare` + `mount overlay` inside container	Falco syscall rule	10–50 ms

A sample Tetragon JSON event for the overlayfs copy-up detection:

{
  "process": {
    "exec_id": "a1b2c3d4e5f60001",
    "pid": 42381,
    "uid": 1000,
    "cwd": "/",
    "binary": "/bin/bash",
    "arguments": "",
    "flags": "execve clone",
    "start_time": "2026-05-09T14:23:01.481923Z",
    "auid": 1000,
    "pod": {
      "namespace": "default",
      "name": "attacker-pod-7f9b4",
      "container": {
        "id": "containerd://f3a4b5c6d7e8",
        "name": "shell",
        "image": {
          "id": "sha256:abc123",
          "name": "ubuntu:22.04"
        }
      }
    },
    "docker": "f3a4b5c6d7e8",
    "parent_exec_id": "a1b2c3d4e5f60000"
  },
  "parent": {
    "exec_id": "a1b2c3d4e5f60000",
    "pid": 42350,
    "binary": "/bin/bash"
  },
  "function_name": "ovl_copy_up_inode",
  "args": [
    {
      "inode_arg": {
        "number": 131073,
        "uid": 0,
        "gid": 0,
        "permission": "---s--x--x",
        "size_bytes": 44352,
        "links": 1,
        "xattr_flags": "has_security_capability"
      }
    }
  ],
  "action": "SIGKILL",
  "type": "KPROBE",
  "time": "2026-05-09T14:23:01.539201Z",
  "policy_name": "overlayfs-cap-copy-up",
  "node_name": "node-3"
}

The action: SIGKILL field confirms the process was terminated before ovl_copy_up_inode returned and before the upper-layer file was written with the capability xattr.

Trade-offs

Detection mechanism	Advantage	Disadvantage	Risk
Tetragon kprobe + SIGKILL	Sub-millisecond enforcement; stops exploit before it completes	Incorrect policy kills legitimate processes; kprobe hook point may shift across kernel versions requiring policy updates	False kill on legitimate `splice` (e.g., internal sendfile path) causes application crash
Tetragon kprobe + Post (alert only)	No false-kill risk; preserves forensic evidence	Exploit may complete before human response; must pair with automated runbook	Alert fatigue if splice baseline not tuned
Falco syscall rules	Broad coverage; no kernel internal access required; rule updates without reboot	10–50 ms latency; user-space daemon can be killed by attacker with root; cannot enforce	Dirty pipe exploit completes in < 5 ms — Falco detects but cannot prevent
auditd complement	Works on kernels < 5.8; feeds SIEM without eBPF	High event volume (splice is frequent); cannot enforce; no container-level filtering natively	Audit log volume spikes during legitimate batch workloads using sendfile
Combined eBPF + auditd	Defense in depth; auditd covers nodes where Tetragon DaemonSet is not scheduled	Operational complexity of maintaining two detection systems	Gaps when Tetragon pods are evicted during node pressure

Latency vs. correctness trade-off: Tetragon with SIGKILL is the only mechanism that can stop dirty pipe before file modification. Its false-positive risk is bounded by the specificity of the matching condition: splice source O_RDONLY + pipe write + no write capabilities is not a pattern produced by any legitimate application tested. The Falco rules operate post-fact for dirty pipe but remain the primary mechanism for dirty COW, where the race condition is harder to hook at a single deterministic kernel point.

Failure Modes

Failure mode	Impact	Detection of the failure	Mitigation
Tetragon DaemonSet not scheduled on a node (node pressure eviction, taint mismatch)	No kprobe enforcement on that node; attacker on that node is invisible to Tetragon	Prometheus metric `tetragon_policy_loaded_total` drops below node count; alert when `count(tetragon_policy_loaded_total) < count(kube_node_info)`	Set Tetragon DaemonSet `priorityClassName: system-node-critical`; add toleration for all taints
Falco rules not reloaded after kernel upgrade	eBPF driver may fail to load against new kernel; Falco silently falls back to no-probe mode in some configurations	`falco_loaded_rules_total` metric goes to 0; health endpoint returns degraded	Pin Falco to a tested kernel version matrix; use `falco --validate` in node post-upgrade CI; alert on metric drop
`splice` legitimate use triggers dirty pipe false positive	Application killed by Tetragon SIGKILL; service outage	Application crash logs; Tetragon event log shows SIGKILL with `policy_name: dirty-pipe-detection`	Add process name or container image to Tetragon `matchProcess` exclusion list; switch to Post-only action during tuning period
Attacker avoids `splice`, uses alternative dirty pipe trigger path	Tetragon hook on `do_splice` does not fire; Falco rule does not match	No alert generated; only post-compromise forensics	Hook `vmsplice` as additional vector; add `tee` syscall monitoring for pipe-to-pipe splice variant
Overlayfs kernel patch backported by distro, changing `ovl_copy_up_inode` symbol	Tetragon kprobe on `ovl_copy_up_inode` silently fails to attach	`tetragon_kprobe_attach_errors_total` metric increments; Tetragon logs show symbol resolution failure	Monitor kprobe attach error metric; test policies against each distro kernel in CI using a test pod that exercises the hook
High-frequency `madvise(MADV_DONTNEED)` from JVM or database	Falco dirty COW rule generates continuous alerts	Alert volume spikes from known process names	Add JVM and database process names to exclusion macro; tune condition to require concurrent `/proc/self/mem` write