Detecting Copy-on-Write Exploitation with eBPF: Tracing Dirty Pipe and Overlayfs Attack Patterns
Problem
Copy-on-write (CoW) is a kernel memory management strategy: when two processes share a read-only page and one needs to modify it, the kernel forks the page and hands the writer a private copy. That invariant — the writer always gets a private copy, never the shared original — is what CoW exploits break. Three exploits in the past decade have each attacked a different point in that invariant.
Dirty pipe (CVE-2022-0847) splices data from a file into a pipe buffer, marking that buffer’s page with PIPE_BUF_FLAG_CAN_MERGE. A subsequent write to the pipe propagates directly into the backing page-cache page without allocating a private copy — overwriting a read-only file from a process with no write permissions. The syscall sequence is deterministic: open(O_RDONLY), pipe(), splice(fd, pipe), write(pipe, ...). No process with legitimate intent writes to a pipe whose source is an O_RDONLY file descriptor and then expects the file to change.
Dirty COW (CVE-2016-5195) races madvise(MADV_DONTNEED) with a write to a copy-on-write mapped page. The MADV_DONTNEED causes the kernel to release the private copy; the concurrent write then lands on the original shared page before a new private copy can be allocated. The required sequence is: mmap(PROT_READ, MAP_PRIVATE) on a read-only file, then a tight loop of write(mem_fd) interleaved with madvise(MADV_DONTNEED, ...) on the same range.
Overlayfs capability copy-up is a container-specific variant. When a container reads a setuid or file-capability-bearing binary from the lower layer (the image), overlayfs may copy it to the upper layer while preserving its security.capability xattr. A non-root process inside the container can trigger this copy-up and then execute the file, gaining capabilities the container was not configured to grant. The precondition is an unshare or mount creating a new overlayfs instance followed by access to a file with non-zero cap_effective in the lower layer.
All three share kernel-observable preconditions that appear before the privilege escalation completes. eBPF tracing programs resident in the kernel can observe these preconditions — specific syscall sequences, VFS operations on unusual file attribute combinations — and raise an alert or terminate the offending process within microseconds of the attempt, before the escalation lands.
Target systems: Linux with eBPF support (kernel 5.8+ for ring buffer and BTF-backed kprobes), Tetragon 1.2+ deployed as a DaemonSet, Falco 0.38+ with modern rule syntax. auditd as a complementary layer where eBPF cannot be deployed.
Threat Model
Three concrete attacker scenarios drive the detection design:
1. Container attacker exploiting dirty pipe to overwrite a host file. The attacker has code execution inside a container. The container image has /proc/self/fd available (standard) and the host kernel is 5.8–5.16 (pre-patch). The attacker splices content from the container’s read-only lower-layer filesystem into a pipe, then writes crafted content that propagates into the host’s page cache, overwriting /etc/passwd or a host binary visible through the overlayfs mount. Detection window: the splice call that marks the page with PIPE_BUF_FLAG_CAN_MERGE is visible before any data is modified.
2. Local user exploiting a dirty COW variant to write to a read-only mapping. The attacker has a shell account on a multi-tenant Linux host. They map a setuid binary MAP_PRIVATE | PROT_READ, then race write to /proc/self/mem against madvise(MADV_DONTNEED). Detection window: the combination of madvise(MADV_DONTNEED) on a MAP_PRIVATE mapping of a file with S_ISUID is anomalous and visible to eBPF before any write succeeds.
3. Overlayfs capability copy-up. The attacker has code execution in a Kubernetes pod with no host-path mounts and no SYS_ADMIN. They call unshare(CLONE_NEWUSER | CLONE_NEWNS) (permitted in some kernel configurations), mount a new overlayfs with a lower layer containing a capability-bearing binary, then access the binary to trigger copy-up. Detection window: unshare followed by mount with fstype=overlay followed by open() on a file with security.capability xattr — all observable via kprobe before the process executes the file.
eBPF’s advantage is that the preconditions for all three exploits require syscalls that are individually rare or combination-anomalous. Falco catches the user-space-observable signatures. Tetragon catches deeper VFS-layer signals, including page flags and xattr checks, and can terminate the process with SIGKILL before the exploit completes.
Configuration / Implementation
Tetragon TracingPolicy: Dirty Pipe Detection
The core signal is a write system call on a pipe file descriptor where the pipe’s backing page carries PIPE_BUF_FLAG_CAN_MERGE and the pipe was previously fed via splice from an O_RDONLY file descriptor. Tetragon’s kprobe mechanism hooks pipe_write in the kernel before the page flag check occurs.
# tetragon-dirty-pipe.yaml
# TracingPolicy detecting the dirty pipe exploit precondition.
# Hooks pipe_write and inspects the pipe buffer flags.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: dirty-pipe-detection
namespace: kube-system
spec:
kprobes:
# Hook 1: observe splice() to track pipe population from O_RDONLY fd.
- call: "do_splice"
syscall: false
args:
- index: 0
type: "file" # in_file: the source file descriptor
- index: 2
type: "file" # out_file: the destination (pipe)
selectors:
- matchArgs:
# Source file must be opened read-only (flags & O_ACCMODE == O_RDONLY).
- index: 0
operator: "FileFlags"
values: ["O_RDONLY"]
matchActions:
- action: FollowFD
argFd: 2
argName: "splice_pipe_out"
# Hook 2: observe pipe_write; if the pipe was populated via splice from
# a read-only source, the combination is the dirty pipe precondition.
- call: "pipe_write"
syscall: false
args:
- index: 0
type: "kiocb" # the kiocb contains the file pointer for the pipe
selectors:
- matchArgs:
- index: 0
operator: "FilePrivate"
# The pipe_inode_info pointer signals a pipe; we match on the fd
# that was tracked by FollowFD above.
values: ["splice_pipe_out"]
matchActions:
- action: Sigkill
- action: Post
rateLimit: "1/minute"
ratelimitScope: "process"
# Forward events to the Tetragon gRPC export and Kubernetes audit.
podSelector:
matchExpressions:
- key: "tetragon.cilium.io/monitoring"
operator: NotIn
values: ["disabled"]
# tetragon-overlayfs-cap-copy.yaml
# TracingPolicy detecting overlayfs capability copy-up.
# Hooks ovl_copy_up_inode and checks for non-zero file capabilities.
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: overlayfs-cap-copy-up
namespace: kube-system
spec:
kprobes:
- call: "ovl_copy_up_inode"
syscall: false
args:
- index: 0
type: "inode" # the lower-layer inode being copied up
selectors:
- matchArgs:
# Match when the inode has security.capability xattr (i_xflags set).
# Tetragon exposes inode security attributes via the "InodeAttr" operator.
- index: 0
operator: "InodeAttr"
values: ["has_security_capability"]
matchActions:
- action: Post
rateLimit: "5/minute"
ratelimitScope: "namespace"
# Sigkill stops the copy-up before the upper-layer file is created
# with the capability xattr intact.
- action: Sigkill
Falco Rules: User-Space Observable Patterns
Falco operates on the syscall stream from its eBPF driver and can detect the behavioural sequence even without kernel internal state access.
# falco-cow-exploit-rules.yaml
# Falco rules for dirty pipe, dirty COW, and overlayfs capability copy-up.
# Rule 1: Write to a pipe backed by a splice from an O_RDONLY file.
# Detects the dirty pipe exploit sequence at the user-space syscall level.
- rule: Dirty Pipe Exploit Attempt
desc: >
A process has called splice() to transfer data from a read-only file
into a pipe, and then called write() on that pipe. This is the precise
syscall sequence required by the dirty pipe exploit (CVE-2022-0847).
Legitimate programs that splice from files into pipes do not then write
to the write end of the same pipe expecting the source file to change.
condition: >
evt.type = write
and fd.type = fifo
and thread.cap_effective does not contain cap_dac_override
and thread.cap_effective does not contain cap_fowner
and (proc.aname[2] = splice or proc.aname[3] = splice)
and not proc.name in (logrotate, rsyslogd, journald, auditd)
and not container.image.repository in (trusted_image_list)
output: >
Dirty pipe exploit pattern: write to pipe after splice from read-only fd
(proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
container=%container.name image=%container.image.repository
fd=%fd.name parent=%proc.pname gparent=%proc.aname[2])
priority: CRITICAL
tags: [cow, dirty-pipe, exploit, CVE-2022-0847]
# Rule 2: madvise(MADV_DONTNEED) on a file-backed MAP_PRIVATE region.
# This is the dirty COW precondition — legitimate programs rarely call
# MADV_DONTNEED on file-backed private mappings in a write loop.
- rule: Dirty COW madvise Precondition
desc: >
A process has called madvise(MADV_DONTNEED) on a mapping of a file
that is also being written via /proc/self/mem or a writable alias.
This is the precondition for dirty COW variants (CVE-2016-5195 family).
Detect the madvise call; correlate with concurrent /proc/self/mem write
at the SIEM layer for full confidence.
condition: >
evt.type = madvise
and evt.arg.advice = MADV_DONTNEED
and fd.type = file
and fd.is_readonly = false
and (fd.filename startswith /proc/self or fd.filename contains mem)
and not proc.name in (postgres, java, mysqld, node)
and container.id != host
output: >
Possible dirty COW precondition: madvise MADV_DONTNEED on mem-mapped file
(proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
file=%fd.name container=%container.name image=%container.image.repository)
priority: HIGH
tags: [cow, dirty-cow, madvise, CVE-2016-5195]
# Rule 3: unshare() followed by mount with fstype overlay inside a container.
# Indicates a user-namespace + overlayfs setup used in capability copy-up
# and several container escape techniques.
- rule: Suspicious Unshare and Overlayfs Mount in Container
desc: >
A process inside a container has called unshare() to create a new
user or mount namespace, and subsequently called mount() with filesystem
type "overlay". This sequence is characteristic of overlayfs-based
container escapes and the overlayfs capability copy-up technique.
Privileged containers performing legitimate storage operations are
excluded via the trusted_privileged_images macro.
condition: >
evt.type = mount
and evt.arg.fstype = overlay
and container.id != host
and proc.vpid != 1
and not trusted_privileged_images
and thread.cap_effective does not contain cap_sys_admin
output: >
Overlayfs mount after unshare in container — possible exploit setup
(proc=%proc.name pid=%proc.pid user=%user.name uid=%user.uid
container=%container.name image=%container.image.repository
fstype=%evt.arg.fstype mntdir=%evt.arg.dir)
priority: CRITICAL
tags: [cow, overlayfs, container-escape, exploit]
# Macro: image list that legitimately uses overlay mounts inside containers
# (e.g., Docker-in-Docker, buildkitd). Maintain this list carefully.
- macro: trusted_privileged_images
condition: >
container.image.repository in (
docker,
moby/buildkit,
gcr.io/kaniko-project/executor,
registry.k8s.io/build-image/kube-cross
)
auditd Rules as Complementary Layer
Where eBPF cannot be deployed (older kernels, locked-down environments), auditd provides a syscall-level audit trail. auditd cannot enforce (kill), but it feeds SIEM pipelines that can trigger automated response.
# /etc/audit/rules.d/60-cow-exploits.rules
# Audit rules complementing eBPF detection for CoW exploit preconditions.
# Capture all splice and vmsplice calls (dirty pipe vector).
-a always,exit -F arch=b64 -S splice -S vmsplice -k cow_splice
# Capture madvise calls with MADV_DONTNEED (0x9 = 9 decimal).
# The a2 filter matches the advice argument.
-a always,exit -F arch=b64 -S madvise -F a2=9 -k cow_madvise
# Capture unshare calls that create new user namespaces (flag bit 0x10000000).
-a always,exit -F arch=b64 -S unshare -F a0&0x10000000=0x10000000 -k cow_unshare_userns
# Capture mount with overlay filesystem type.
-a always,exit -F arch=b64 -S mount -k cow_overlay_mount
# Ensure rules are loaded and not mutable at runtime.
-e 2
Load with augenrules --load and verify with auditctl -l | grep cow_.
SIEM Correlation and Noise Tuning
splice and madvise appear in legitimate workloads. sendfile is implemented via splice internally; PostgreSQL and databases use madvise extensively. Correlation reduces false positives dramatically:
-
Dirty pipe signal: Alert only when
splicesource fd hasO_RDONLYflag AND the destination pipe fd is written within 500 ms by the same thread AND the process has noCAP_DAC_OVERRIDEorCAP_FOWNER. This three-part condition is not triggered bysendfile. -
Dirty COW signal: Alert when
madvise(MADV_DONTNEED)appears AND the same process has an open fd to/proc/self/memor/dev/memAND the madvise target mapping is file-backed. Database processes (postgres,mysqld) are excluded by name and cgroup. -
Overlayfs signal: Alert on any
mount(overlay)inside a container namespace from a non-init process withoutCAP_SYS_ADMIN. This has near-zero legitimate trigger rate in standard Kubernetes workloads.
Ship Tetragon JSON events and Falco alerts to your SIEM with process ancestry (proc.aname[*]) and container metadata. Alert triage is significantly easier when the alert includes the full process tree from PID 1 to the offending process.
Expected Behaviour
The table below shows the observable syscall sequence for each exploit class, the Tetragon or Falco component that fires first, and the expected alert latency from exploit initiation.
| Exploit class | Observable precondition sequence | Detector | Alert latency |
|---|---|---|---|
| Dirty pipe | open(O_RDONLY) → pipe() → splice(rdonly_fd, pipe) → write(pipe) |
Tetragon kprobe on pipe_write |
< 1 ms (in-kernel) |
| Dirty pipe (user-space) | splice + write on fifo with no write caps |
Falco syscall rule | 10–50 ms (syscall event latency) |
| Dirty COW | mmap(PROT_READ, MAP_PRIVATE) → write(/proc/self/mem) + madvise(MADV_DONTNEED) |
Falco + auditd | 10–100 ms |
| Overlayfs cap copy-up | unshare(NEWNS) → mount(overlay) → open(cap_file) → ovl_copy_up_inode |
Tetragon kprobe on ovl_copy_up_inode |
< 1 ms |
| Overlayfs (user-space) | unshare + mount overlay inside container |
Falco syscall rule | 10–50 ms |
A sample Tetragon JSON event for the overlayfs copy-up detection:
{
"process": {
"exec_id": "a1b2c3d4e5f60001",
"pid": 42381,
"uid": 1000,
"cwd": "/",
"binary": "/bin/bash",
"arguments": "",
"flags": "execve clone",
"start_time": "2026-05-09T14:23:01.481923Z",
"auid": 1000,
"pod": {
"namespace": "default",
"name": "attacker-pod-7f9b4",
"container": {
"id": "containerd://f3a4b5c6d7e8",
"name": "shell",
"image": {
"id": "sha256:abc123",
"name": "ubuntu:22.04"
}
}
},
"docker": "f3a4b5c6d7e8",
"parent_exec_id": "a1b2c3d4e5f60000"
},
"parent": {
"exec_id": "a1b2c3d4e5f60000",
"pid": 42350,
"binary": "/bin/bash"
},
"function_name": "ovl_copy_up_inode",
"args": [
{
"inode_arg": {
"number": 131073,
"uid": 0,
"gid": 0,
"permission": "---s--x--x",
"size_bytes": 44352,
"links": 1,
"xattr_flags": "has_security_capability"
}
}
],
"action": "SIGKILL",
"type": "KPROBE",
"time": "2026-05-09T14:23:01.539201Z",
"policy_name": "overlayfs-cap-copy-up",
"node_name": "node-3"
}
The action: SIGKILL field confirms the process was terminated before ovl_copy_up_inode returned and before the upper-layer file was written with the capability xattr.
Trade-offs
| Detection mechanism | Advantage | Disadvantage | Risk |
|---|---|---|---|
| Tetragon kprobe + SIGKILL | Sub-millisecond enforcement; stops exploit before it completes | Incorrect policy kills legitimate processes; kprobe hook point may shift across kernel versions requiring policy updates | False kill on legitimate splice (e.g., internal sendfile path) causes application crash |
| Tetragon kprobe + Post (alert only) | No false-kill risk; preserves forensic evidence | Exploit may complete before human response; must pair with automated runbook | Alert fatigue if splice baseline not tuned |
| Falco syscall rules | Broad coverage; no kernel internal access required; rule updates without reboot | 10–50 ms latency; user-space daemon can be killed by attacker with root; cannot enforce | Dirty pipe exploit completes in < 5 ms — Falco detects but cannot prevent |
| auditd complement | Works on kernels < 5.8; feeds SIEM without eBPF | High event volume (splice is frequent); cannot enforce; no container-level filtering natively | Audit log volume spikes during legitimate batch workloads using sendfile |
| Combined eBPF + auditd | Defense in depth; auditd covers nodes where Tetragon DaemonSet is not scheduled | Operational complexity of maintaining two detection systems | Gaps when Tetragon pods are evicted during node pressure |
Latency vs. correctness trade-off: Tetragon with SIGKILL is the only mechanism that can stop dirty pipe before file modification. Its false-positive risk is bounded by the specificity of the matching condition: splice source O_RDONLY + pipe write + no write capabilities is not a pattern produced by any legitimate application tested. The Falco rules operate post-fact for dirty pipe but remain the primary mechanism for dirty COW, where the race condition is harder to hook at a single deterministic kernel point.
Failure Modes
| Failure mode | Impact | Detection of the failure | Mitigation |
|---|---|---|---|
| Tetragon DaemonSet not scheduled on a node (node pressure eviction, taint mismatch) | No kprobe enforcement on that node; attacker on that node is invisible to Tetragon | Prometheus metric tetragon_policy_loaded_total drops below node count; alert when count(tetragon_policy_loaded_total) < count(kube_node_info) |
Set Tetragon DaemonSet priorityClassName: system-node-critical; add toleration for all taints |
| Falco rules not reloaded after kernel upgrade | eBPF driver may fail to load against new kernel; Falco silently falls back to no-probe mode in some configurations | falco_loaded_rules_total metric goes to 0; health endpoint returns degraded |
Pin Falco to a tested kernel version matrix; use falco --validate in node post-upgrade CI; alert on metric drop |
splice legitimate use triggers dirty pipe false positive |
Application killed by Tetragon SIGKILL; service outage | Application crash logs; Tetragon event log shows SIGKILL with policy_name: dirty-pipe-detection |
Add process name or container image to Tetragon matchProcess exclusion list; switch to Post-only action during tuning period |
Attacker avoids splice, uses alternative dirty pipe trigger path |
Tetragon hook on do_splice does not fire; Falco rule does not match |
No alert generated; only post-compromise forensics | Hook vmsplice as additional vector; add tee syscall monitoring for pipe-to-pipe splice variant |
Overlayfs kernel patch backported by distro, changing ovl_copy_up_inode symbol |
Tetragon kprobe on ovl_copy_up_inode silently fails to attach |
tetragon_kprobe_attach_errors_total metric increments; Tetragon logs show symbol resolution failure |
Monitor kprobe attach error metric; test policies against each distro kernel in CI using a test pod that exercises the hook |
High-frequency madvise(MADV_DONTNEED) from JVM or database |
Falco dirty COW rule generates continuous alerts | Alert volume spikes from known process names | Add JVM and database process names to exclusion macro; tune condition to require concurrent /proc/self/mem write |