Linux fanotify for Real-Time Filesystem Security Monitoring
The Problem
Every anti-malware scanner, DLP agent, and file-integrity monitor on Linux needs the same thing: a reliable stream of “this process just opened / wrote / executed this file” events, delivered fast enough to block the operation if needed.
inotify is the historical answer and it falls short in four concrete ways:
- Per-inode watches. You must explicitly call
inotify_add_watch()for every directory you care about. Monitoring/usrmeans watching hundreds of directories and updating the watch list every time a subdirectory is created. Race conditions between directory creation and watch installation are real. - No path in events.
inotifyreturns a watch descriptor and a filename component. Reconstructing the full path requires mapping watch descriptors back to directories you’re already tracking — a bookkeeping problem, not a kernel feature. - No blocking.
inotifyis notification-only. By the time your daemon processesIN_CLOSE_WRITE, the bytes are already on disk. You cannot intercept and deny the operation. - No process identity.
inotifyevents carry no PID. You can’t tell which process triggered the event without racing to correlate/procstate.
fanotify (Linux 2.6.37, significantly extended through 5.x) was designed to fix all of these. It is the API that RHEL’s fapolicyd, ClamAV’s on-access scanner, and most commercial endpoint protection agents on Linux use under the hood.
inotify vs fanotify: What Actually Changes
| Capability | inotify | fanotify |
|---|---|---|
| Scope | Per-directory watch | Per-mount or per-filesystem |
| Path in event | No (watch descriptor + filename component) | Yes (with FAN_REPORT_DFID_NAME, kernel 5.1+) |
| PID in event | No | Yes |
| Blocking mode | No | Yes (FAN_CLASS_CONTENT, FAN_CLASS_PRE_CONTENT) |
| Permission decisions | No | Yes (FAN_OPEN_PERM, FAN_ACCESS_PERM) |
| Binary execution events | No | Yes (FAN_OPEN_EXEC_PERM) |
| Kernel version | 2.6.13 | 2.6.37 (full features: 5.1+) |
The FAN_CLASS_* hierarchy controls what your daemon can do with events:
FAN_CLASS_NOTIF— notification only, no blocking. Equivalent to an enrichedinotify.FAN_CLASS_CONTENT— can intercept file reads/opens. Used by on-access virus scanners.FAN_CLASS_PRE_CONTENT— can intercept partial-write events. Used by hierarchical storage managers, but also useful for DLP on write interception.
Only one process can hold FAN_CLASS_PRE_CONTENT per mount. Multiple processes can hold FAN_CLASS_CONTENT and FAN_CLASS_NOTIF simultaneously — they queue independently.
Filesystem-Wide Marks: FAN_MARK_FILESYSTEM
The most important operational improvement over inotify is FAN_MARK_FILESYSTEM. Instead of walking a directory tree and adding individual watches, a single fanotify_mark() call covers everything under a mount point:
int fan_fd = fanotify_init(FAN_CLASS_CONTENT | FAN_CLOEXEC | FAN_NONBLOCK, O_RDONLY | O_LARGEFILE);
// Watch the entire filesystem containing /usr
fanotify_mark(fan_fd,
FAN_MARK_ADD | FAN_MARK_FILESYSTEM,
FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | FAN_CLOSE_WRITE,
AT_FDCWD, "/usr");
The FAN_MARK_FILESYSTEM flag requires CAP_SYS_ADMIN. This is intentional: a filesystem-wide mark intercepts every open on the mount, including operations by root. The kernel will not let unprivileged processes install such a broad net.
Marks are additive. You can layer:
// Also watch /etc for config mutations
fanotify_mark(fan_fd,
FAN_MARK_ADD | FAN_MARK_FILESYSTEM,
FAN_CLOSE_WRITE | FAN_MOVED_TO,
AT_FDCWD, "/etc");
To exclude noisy paths from a filesystem-wide mark use FAN_MARK_IGNORED_MASK:
// Suppress events for /var/log (high-volume writes, low-value for this use case)
fanotify_mark(fan_fd,
FAN_MARK_ADD | FAN_MARK_IGNORED_MASK | FAN_MARK_IGNORED_SURV_MODIFY,
FAN_CLOSE_WRITE,
AT_FDCWD, "/var/log");
Permission Events: FAN_OPEN_PERM and FAN_ACCESS_PERM
Permission events are the blocking primitive. When the kernel delivers a FAN_OPEN_PERM event, the opening process is suspended in the kernel waiting for your daemon to write a response. Until you respond, the open() syscall does not return.
This is the mechanism anti-malware scanners use: the file open is held, the scanner reads the file descriptor from the event, scans it, then writes allow or deny back to the fanotify fd.
The response is a struct fanotify_response:
struct fanotify_response response = {
.fd = event->fd,
.response = FAN_ALLOW, // or FAN_DENY
};
write(fan_fd, &response, sizeof(response));
FAN_DENY causes the syscall in the triggering process to return EPERM. The file is never opened from the process’s perspective.
Key events for security monitoring:
| Event flag | Triggered when | Blocking? |
|---|---|---|
FAN_OPEN_PERM |
Any open() / openat() |
Yes |
FAN_ACCESS_PERM |
read() on a watched inode |
Yes |
FAN_OPEN_EXEC_PERM |
execve() of a watched file |
Yes |
FAN_CLOSE_WRITE |
File closed after write | No (notification) |
FAN_MOVED_TO |
File renamed/moved into watched path | No |
FAN_CREATE |
File created in watched directory | No |
FAN_DELETE |
File deleted | No |
FAN_OPEN_EXEC_PERM deserves special attention. It fires when the kernel is about to execute the file — after the execve() is called but before the new process image is loaded. Your daemon can deny execution of files that fail a hash check, aren’t in an allowlist, or were written by an untrusted process. This is the foundation of application allowlisting on Linux without LSM policy.
Getting File Paths: FAN_REPORT_DFID_NAME (Linux 5.1+)
Before kernel 5.1, fanotify events gave you an open file descriptor to the affected file. You could read the file, but mapping it back to a path required readlink(/proc/self/fd/<n>) — fragile if the file was deleted between the event and your read, and wrong if it was hardlinked under multiple paths.
Linux 5.1 introduced FAN_REPORT_DFID_NAME. Pass it to fanotify_init() and events include:
- A
file_handleidentifying the directory containing the file - The filename component as a null-terminated string appended to the event structure
int fan_fd = fanotify_init(
FAN_CLASS_NOTIF | FAN_REPORT_DFID_NAME | FAN_CLOEXEC | FAN_NONBLOCK,
O_RDONLY | O_LARGEFILE
);
Reading events with FAN_REPORT_DFID_NAME requires walking a variable-length structure. Each event is followed by a fanotify_event_info_fid header, then the file_handle, then the filename:
struct fanotify_event_metadata *meta;
struct fanotify_event_info_fid *fid;
char buf[4096];
ssize_t n;
while ((n = read(fan_fd, buf, sizeof(buf))) > 0) {
meta = (struct fanotify_event_metadata *)buf;
while (FAN_EVENT_OK(meta, n)) {
if (meta->mask & FAN_EVENTS_BITS) {
fid = (struct fanotify_event_info_fid *)(meta + 1);
struct file_handle *fh = (struct file_handle *)fid->handle;
// Filename is after the variable-length file_handle
char *fname = (char *)fh->f_handle + fh->handle_bytes;
// Open the directory by handle to get a real fd
int dir_fd = open_by_handle_at(AT_FDCWD, fh, O_RDONLY);
if (dir_fd >= 0) {
// Now we have: dir_fd + fname = the full file
// You can openat(dir_fd, fname, ...) to read it
// or readlink /proc/self/fd/<dir_fd> + "/" + fname for the path
close(dir_fd);
}
}
if (meta->fd != FAN_NOFD)
close(meta->fd); // Always close the event fd
meta = FAN_EVENT_NEXT(meta, n);
}
}
open_by_handle_at() requires CAP_DAC_READ_SEARCH. On kernels before 5.1, fall back to readlink(/proc/self/fd/<event_fd>) but accept that deleted-file races exist.
Linux 5.9 added FAN_REPORT_PIDFD — instead of a PID that may be recycled, events carry a pidfd that remains valid until you close it. This eliminates TOCTOU races when correlating event PIDs to process metadata.
Practical C Skeleton: Init, Mark, Read, Respond
A minimal but production-shaped fanotify daemon has four layers:
#define _GNU_SOURCE
#include <fcntl.h>
#include <limits.h>
#include <sys/fanotify.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#define EVENT_BUF_LEN (10 * (sizeof(struct fanotify_event_metadata) + NAME_MAX + 1))
static int fan_fd;
static void send_response(int event_fd, uint32_t response_type) {
struct fanotify_response resp = {
.fd = event_fd,
.response = response_type,
};
if (write(fan_fd, &resp, sizeof(resp)) < 0)
perror("fanotify response write");
close(event_fd);
}
static int is_trusted_write(pid_t pid) {
// Placeholder: read /proc/<pid>/exe, check against allowlist
// In production: use pidfd_open + /proc/self/fdinfo for race-safe check
return 0;
}
int main(void) {
fan_fd = fanotify_init(
FAN_CLASS_CONTENT | FAN_CLOEXEC | FAN_NONBLOCK | FAN_REPORT_PIDFD,
O_RDONLY | O_LARGEFILE | O_CLOEXEC
);
if (fan_fd < 0) { perror("fanotify_init"); return 1; }
// Block execution of anything written to /usr on this filesystem
if (fanotify_mark(fan_fd,
FAN_MARK_ADD | FAN_MARK_FILESYSTEM,
FAN_OPEN_EXEC_PERM | FAN_CLOSE_WRITE,
AT_FDCWD, "/usr") < 0) {
perror("fanotify_mark /usr"); return 1;
}
// Alert on /etc writes (config tampering)
if (fanotify_mark(fan_fd,
FAN_MARK_ADD | FAN_MARK_FILESYSTEM,
FAN_CLOSE_WRITE | FAN_MOVED_TO,
AT_FDCWD, "/etc") < 0) {
perror("fanotify_mark /etc"); return 1;
}
char buf[EVENT_BUF_LEN] __attribute__((aligned(8)));
for (;;) {
ssize_t n = read(fan_fd, buf, sizeof(buf));
if (n < 0) {
if (errno == EAGAIN) continue; // non-blocking, no events
perror("fanotify read"); break;
}
struct fanotify_event_metadata *ev =
(struct fanotify_event_metadata *)buf;
while (FAN_EVENT_OK(ev, n)) {
if (ev->vers != FANOTIFY_METADATA_VERSION) {
fprintf(stderr, "kernel/userspace version mismatch\n");
goto done;
}
pid_t pid = ev->pid;
char exe[PATH_MAX];
snprintf(exe, sizeof(exe), "/proc/%d/exe", pid);
char target[PATH_MAX];
ssize_t l = readlink(exe, target, sizeof(target) - 1);
if (l > 0) target[l] = '\0';
else snprintf(target, sizeof(target), "<unknown>");
if (ev->mask & FAN_OPEN_EXEC_PERM) {
// Check if the binary is in our allowlist
// For demo: deny anything in /usr/local/bin not signed
fprintf(stderr, "[EXEC] pid=%d exe=%s\n", pid, target);
// Real check: verify HMAC or check inode against known-good db
send_response(ev->fd, FAN_ALLOW);
ev->fd = FAN_NOFD; // already closed by send_response
}
if (ev->mask & FAN_CLOSE_WRITE) {
char path[PATH_MAX];
if (ev->fd != FAN_NOFD) {
snprintf(exe, sizeof(exe), "/proc/self/fd/%d", ev->fd);
ssize_t pl = readlink(exe, path, sizeof(path) - 1);
if (pl > 0) path[pl] = '\0';
else snprintf(path, sizeof(path), "<unknown>");
}
fprintf(stderr, "[WRITE] pid=%d path=%s\n", pid, path);
// Emit to SIEM here (syslog, structured JSON to stdout, etc.)
}
if (ev->fd != FAN_NOFD) close(ev->fd);
ev = FAN_EVENT_NEXT(ev, n);
}
}
done:
close(fan_fd);
return 0;
}
Compile with:
gcc -O2 -Wall -o fan_monitor fan_monitor.c
# Requires CAP_SYS_ADMIN to run; on a non-root user:
sudo setcap cap_sys_admin,cap_dac_read_search+eip ./fan_monitor
Security Monitoring Use Cases
Detecting Unauthorized Binary Writes to /usr/bin and /lib
An attacker who achieves RCE will frequently try to persist by dropping a binary into a standard search-path directory. A FAN_CLOSE_WRITE | FAN_MARK_FILESYSTEM mark on /usr with FAN_CLOSE_WRITE catches the write after the file is closed — the full content is committed. Cross-reference the writing PID against a list of blessed package-manager processes (dpkg, rpm, dnf). Any other writer is anomalous.
# Verify your mark is active
cat /proc/<fan_pid>/fdinfo/<fan_fd>
# Output includes: fanotify flags, event queue size, inode table size
Alerting on /etc/passwd Writes
/etc/passwd writes indicate account creation or modification. With FAN_CLOSE_WRITE on /etc:
if (strstr(path, "/etc/passwd") || strstr(path, "/etc/shadow")) {
alert_siem("CREDENTIAL_FILE_MODIFIED", pid, path);
}
Combine with the writing process’s cmdline and parent PID to distinguish useradd (expected) from a web process writing directly to /etc/shadow (critical alert).
Blocking Execution of Files Written by Untrusted Processes
This is the “write-then-exec” attack pattern: download a payload, write it to /tmp, chmod +x, execute. With FAN_OPEN_EXEC_PERM on the filesystem:
- On
FAN_CLOSE_WRITE: record(inode, device, sha256)→writer_pid_exein a hash map. - On
FAN_OPEN_EXEC_PERM: look up the inode in the hash map. If the writer was not a trusted package manager, respondFAN_DENY.
This requires tracking inodes across events, handling hardlinks (same inode, multiple paths), and expiring old entries. The inode + device pair is your stable identity — not the path, which can be renamed under you.
Application Allowlisting Without LSM
fapolicyd (used in RHEL/Fedora) implements exactly this pattern: a fanotify FAN_OPEN_EXEC_PERM daemon that checks executing binaries against a hash database. No SELinux policy required, no kernel module, just a daemon with CAP_SYS_ADMIN. The commercial equivalent is SentinelOne, CrowdStrike Falcon, and Carbon Black’s kernel components — all built on the same fanotify primitives on Linux.
fanotify vs auditd: When to Use Each
These tools are not alternatives — they serve fundamentally different purposes and you should run both.
| Concern | fanotify | auditd |
|---|---|---|
| Real-time blocking | Yes — holds the syscall | No — post-fact notification |
| Full syscall coverage | No — filesystem events only | Yes — all syscalls, netfilter, user space |
| PID/process metadata | Yes | Yes |
| Compliance logging (PCI, HIPAA) | Not designed for this | Yes — tamper-evident audit trail |
| User/UID context | Via /proc correlation | Native in audit record |
| Network events | No | Yes (via audit rules) |
| Kernel version dependency | 5.1+ for full features | All kernels |
| Performance at scale | High (in-kernel filtering) | Moderate (rule ordering critical) |
Rule of thumb: Use fanotify when you need to block. Use auditd when you need a forensic record. For SOC workflows, pipe fanotify alerts to your SIEM alongside auditd logs — fanotify tells you what was stopped, auditd tells you everything that happened around it.
A common architecture layers them:
fanotify daemon → blocks exec of unknown binaries, alerts on /etc writes
auditd → logs all execves, setuid calls, network connects, cron changes
SIEM → correlates both streams
eBPF as Complement: Where fanotify Falls Short
fanotify operates at the VFS layer — it sees file operations, not the full syscall context around them. Two gaps where eBPF fills in:
1. Syscall argument granularity. fanotify tells you a file was opened. An eBPF kprobe on do_sys_open or an LSM hook on security_file_open can capture the full open() flags, the calling process’s capability set, the cgroup, and the network namespace — context fanotify doesn’t surface.
2. Memory-mapped execution. mmap(PROT_EXEC) on a file doesn’t trigger FAN_OPEN_EXEC_PERM. Shellcode loaded via mmap is invisible to fanotify. eBPF LSM hooks on security_mmap_file catch this. This is a genuine blind spot: attackers loading shellcode via mmap from a writable+executable anonymous mapping bypass fanotify entirely, though an eBPF LSM hook on security_mmap_addr can catch anonymous executable mappings.
3. Container namespaces. A FAN_MARK_FILESYSTEM mark placed from the host covers all containers sharing that filesystem. eBPF programs can carry namespace context in their map keys, enabling per-container policy without the host-wide blast radius of a fanotify mark.
For a full treatment of eBPF LSM hooks, see eBPF-LSM (lsm_bpf): Kernel Security Policy as Hot-Loadable BPF Programs.
Performance Considerations
Event Queue Sizing
The kernel fanotify queue defaults to 16,384 events. On a busy system, the queue can fill before your daemon drains it. A full queue causes the kernel to either drop events (notification mode) or block the triggering process (permission mode). Check queue depth:
cat /proc/sys/fs/fanotify/max_queued_events # default 16384
# Raise it for high-throughput systems:
sysctl -w fs.fanotify.max_queued_events=65536
Monitor queue saturation:
# Events dropped due to queue overflow appear in kernel log
dmesg | grep fanotify
Avoiding Deadlocks in Permission Event Handlers
The cardinal rule: your permission event handler must never open a file on a filesystem it is monitoring with a permission event mark. If your handler calls open("/usr/lib/libcrypto.so") to initialize OpenSSL, and /usr/lib is marked with FAN_OPEN_PERM, your daemon will generate an event it needs to respond to, while it’s blocked waiting to respond to a different event. Deadlock.
Practical mitigations:
- Pre-load all libraries before installing marks. Call
dlopen()and resolve all symbols before the firstfanotify_mark(). - Use a separate thread for I/O. The event-reading thread writes events to an internal queue; a pool of worker threads handles the scanning and responds. Workers must use
O_PATHopens or work via the event’s own fd. - Exclude your daemon’s PID. In the event loop, check
ev->pid == getpid()and immediatelyFAN_ALLOWthose events. Kernel 5.0+ supportsFAN_MARK_ONLYDIRand ignore masks, but self-generated events are the most common deadlock vector. - Set response timeouts. If your scanner can stall (network lookup, heavy I/O), use a watchdog thread that sends
FAN_ALLOWfor events that haven’t been responded to within N milliseconds. A stalled scanner should fail open (allow), not hang the system.
Thread Pool Architecture
main thread: read(fan_fd, buf) → dispatch to work queue
worker pool: dequeue event → scan → write response to fan_fd
(all workers share fan_fd for writes — this is safe)
Worker count should be tuned to your scan latency, not your core count. If a scan takes 50ms on average and you want to handle 100 concurrent opens, you need ~5 workers minimum. Profile with perf stat on the scanner process and watch for read saturation on fan_fd.
Operational Runbook
# Verify fanotify support in running kernel
grep -r CONFIG_FANOTIFY /boot/config-$(uname -r)
# CONFIG_FANOTIFY=y
# CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y ← required for FAN_OPEN_PERM
# Check active fanotify file descriptors system-wide
find /proc -name fdinfo -type d 2>/dev/null | xargs grep -l fanotify 2>/dev/null
# List marks on a fanotify fd (replace PID and FD)
cat /proc/<pid>/fdinfo/<fd>
# Adjust system-wide limits
sysctl fs.fanotify.max_user_marks # default 8192 per user
sysctl fs.fanotify.max_user_instances # default 128 per user
sysctl fs.fanotify.max_queued_events # default 16384
What This Doesn’t Cover
- NFS and FUSE mounts. fanotify marks on NFS mounts are local-only: you monitor operations from this host, not remote writes. For distributed filesystem integrity, you need agents on each node or a centralized audit service.
- Overlayfs (container layers). Writes to the container layer may not surface the host path as expected. Test your marks against your container runtime before relying on them.
- Audit trail durability. fanotify is a live stream, not a log. If your daemon crashes, events during the outage are gone. For compliance use cases, fanotify should feed a durable sink (auditd, a structured log file, a SIEM). It is not a replacement for auditd’s tamper-evident ring buffer.
fanotify is the right primitive when you need real-time filesystem visibility with blocking capability, and you understand its scope: it covers VFS operations on local filesystems, but not the full syscall surface, container namespaces, or memory-mapped execution. Used alongside auditd and eBPF, it forms a complete picture of filesystem activity on a production Linux host.