Seccomp-BPF for Non-Container Workloads: Syscall Filtering for System Services

Problem

Every Linux process has access to hundreds of syscalls. A web server needs read, write, accept, send, recv, and a handful of others. It does not need ptrace, mount, kexec_load, create_module, or perf_event_open. Yet without seccomp, every syscall is available to the process — and to any attacker who achieves code execution within it.

Seccomp-BPF (seccomp mode 2) allows each process to install a BPF filter that the kernel evaluates for every syscall. If the filter returns SECCOMP_RET_KILL, the process is terminated; SECCOMP_RET_ERRNO returns a specific error; SECCOMP_RET_ALLOW lets the syscall proceed. The filter runs in-kernel, before the syscall executes — an attacker cannot bypass it from userspace.

In the container world, seccomp profiles are well-established (Docker’s default profile, Kubernetes seccompProfile field). For system services running directly on Linux hosts — nginx, PostgreSQL, OpenSSH, Prometheus node-exporter, custom daemons — seccomp is rarely applied.

Specific gaps in unmanaged services:

Services compiled with no seccomp filter; full syscall table available on compromise.
systemd units with no SystemCallFilter= directive, despite systemd’s built-in seccomp support since version 198.
No tooling to audit which syscalls a service actually uses at runtime.
Profile development is manual and error-prone; one wrong exclusion breaks the service.
No alerting when a service makes an unusual syscall (pre-filter, audit mode).

Target systems: Linux kernel 3.17+ (seccomp-BPF stable); systemd 198+ (SystemCallFilter, SystemCallArchitectures); libseccomp 2.5+ (C/Go/Python library for profile generation); strace 6.0+, seccomp-tools for auditing.

Threat Model

Adversary 1 — RCE in a service daemon: An attacker achieves code execution in nginx via a memory corruption vulnerability. Without seccomp, they call execve to spawn a shell, mmap + mprotect to make shellcode executable, or ptrace to attach to another process. With a strict seccomp filter, those syscalls return EPERM and the exploit chain fails.
Adversary 2 — Syscall-level privilege escalation: A known kernel vulnerability (e.g., a perf_event_open or userfaultfd exploit) requires calling a specific syscall from userspace. If that syscall is blocked in the service’s filter, the exploit is not reachable.
Adversary 3 — Container escape via host service: A containerised workload achieves RCE in a host-level service (e.g., through a shared Unix socket). Without seccomp on the host service, the attacker uses it as a proxy to call privileged syscalls.
Adversary 4 — Time-of-check to time-of-use via seccomp bypass: An attacker attempts to use seccomp itself to install a permissive filter (child processes can add more restrictive filters, not less restrictive ones — SECCOMP_FILTER_FLAG_TSYNC propagates filters but cannot remove them). The no_new_privs bit prevents child processes from gaining privileges.
Access level: Adversaries 1–3 have process-level code execution in the targeted service. Adversary 4 has process-level execution and attempts syscall-level escalation.
Objective: Escape the service’s privilege context, pivot to the host, execute arbitrary code outside the service’s intended scope.
Blast radius: Without seccomp, RCE in a service = access to the full Linux syscall table = many escalation paths. With a strict profile, the attacker is limited to syscalls the service legitimately uses — typically no execve, no ptrace, no mount.

Configuration

Step 1: Audit Syscalls with strace

Before writing a profile, record which syscalls the service actually makes at runtime:

# Trace a running service by PID.
strace -p $(pgrep -f nginx) -f -e trace=all -o /tmp/nginx-syscalls.txt &
# Exercise the service: run typical workloads, startup, reload.
# Stop after a representative window.
kill %1

# Extract the unique syscall names.
grep -oP '(?<=^|\n)\w+(?=\()' /tmp/nginx-syscalls.txt | sort -u

# Or attach to all processes in a systemd service cgroup.
systemctl show nginx --property=MainPID --value | xargs -I{} strace -p {} -f \
  -e trace=all 2>/tmp/nginx-strace.txt &
# ... run workloads ...
kill %1
grep -oP '^\w+(?=\()' /tmp/nginx-strace.txt | sort -u > /tmp/nginx-allowed-syscalls.txt

Use seccomp-tools for more targeted analysis:

# Install seccomp-tools (Ruby gem).
gem install seccomp-tools

# If the service already has a seccomp filter, dump it.
seccomp-tools dump -p $(pgrep nginx)

# Disassemble a raw BPF filter.
seccomp-tools disasm /tmp/filter.bpf

Step 2: Generate a Profile with libseccomp

Use libseccomp to build a profile programmatically rather than writing raw BPF:

/* nginx-seccomp.c — generate a seccomp profile for nginx worker processes */
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    scmp_filter_ctx ctx;

    /* Default action: kill the process on any unlisted syscall. */
    ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
    if (!ctx) { perror("seccomp_init"); return 1; }

    /* Allow each syscall nginx workers legitimately need. */
    const int allowed[] = {
        SCMP_SYS(read), SCMP_SYS(write), SCMP_SYS(writev),
        SCMP_SYS(open), SCMP_SYS(openat), SCMP_SYS(close),
        SCMP_SYS(stat), SCMP_SYS(fstat), SCMP_SYS(lstat),
        SCMP_SYS(poll), SCMP_SYS(select), SCMP_SYS(epoll_wait),
        SCMP_SYS(epoll_ctl), SCMP_SYS(epoll_create1),
        SCMP_SYS(accept4), SCMP_SYS(accept),
        SCMP_SYS(recv), SCMP_SYS(recvfrom), SCMP_SYS(recvmsg),
        SCMP_SYS(send), SCMP_SYS(sendto), SCMP_SYS(sendmsg),
        SCMP_SYS(socket), SCMP_SYS(connect), SCMP_SYS(bind),
        SCMP_SYS(listen), SCMP_SYS(getsockopt), SCMP_SYS(setsockopt),
        SCMP_SYS(mmap), SCMP_SYS(mprotect), SCMP_SYS(munmap),
        SCMP_SYS(brk), SCMP_SYS(mremap),
        SCMP_SYS(futex), SCMP_SYS(nanosleep),
        SCMP_SYS(getpid), SCMP_SYS(getuid), SCMP_SYS(geteuid),
        SCMP_SYS(getgid), SCMP_SYS(getegid),
        SCMP_SYS(setuid), SCMP_SYS(setgid),   /* for worker process drop */
        SCMP_SYS(prctl),                        /* for no_new_privs */
        SCMP_SYS(exit), SCMP_SYS(exit_group),
        SCMP_SYS(rt_sigaction), SCMP_SYS(rt_sigreturn),
        SCMP_SYS(rt_sigprocmask), SCMP_SYS(sigaltstack),
        SCMP_SYS(getcwd), SCMP_SYS(chdir),
        SCMP_SYS(ioctl),
        /* Explicitly NOT included: execve, ptrace, mount, kexec_load */
    };

    for (size_t i = 0; i < sizeof(allowed)/sizeof(allowed[0]); i++) {
        if (seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allowed[i], 0) < 0) {
            fprintf(stderr, "Failed to add rule for syscall %d\n", allowed[i]);
            seccomp_release(ctx);
            return 1;
        }
    }

    /* Export as BPF binary for use with systemd or a loader. */
    FILE *f = fopen("/etc/seccomp/nginx-worker.bpf", "wb");
    seccomp_export_bpf(ctx, fileno(f));
    fclose(f);

    seccomp_release(ctx);
    return 0;
}

gcc -o gen-nginx-seccomp nginx-seccomp.c -lseccomp
./gen-nginx-seccomp
# Produces /etc/seccomp/nginx-worker.bpf

For quick profile generation from a strace output, use oci-seccomp-bpf-hook in audit mode or the sysexit tracer approach:

# Generate a profile from strace output automatically.
cat /tmp/nginx-allowed-syscalls.txt | while read syscall; do
  echo "  - $syscall"
done > /etc/seccomp/nginx-syscalls.yaml

Step 3: Apply via systemd SystemCallFilter

systemd’s SystemCallFilter= applies a seccomp filter to the service without modifying the service binary:

# /etc/systemd/system/nginx.service.d/seccomp.conf
[Service]
# Lock to the native syscall ABI (prevents 32-bit syscall bypass on 64-bit kernels).
SystemCallArchitectures=native

# Allow only the listed syscall groups.
# Systemd provides named groups (@network-io, @file-system, @basic-io, etc.)
SystemCallFilter=@system-service
SystemCallFilter=@network-io
SystemCallFilter=@file-system

# Additionally block specific high-risk syscalls that the groups include.
SystemCallFilter=~@debug          # Blocks ptrace, perf_event_open, etc.
SystemCallFilter=~@mount          # Blocks mount, umount2, pivot_root.
SystemCallFilter=~@module         # Blocks init_module, delete_module.
SystemCallFilter=~@reboot         # Blocks reboot, kexec_load.
SystemCallFilter=~@privileged     # Blocks chown, setuid to other UIDs beyond own drop.
SystemCallFilter=~@raw-io         # Blocks iopl, ioperm, direct I/O.
SystemCallFilter=~@cpu-emulation  # Blocks modify_ldt, vm86.
SystemCallFilter=~@obsolete       # Blocks _sysctl, create_module, etc.

# Kill the process (not just the syscall) on violation.
SystemCallErrorNumber=EPERM       # Return EPERM instead of killing (less disruptive for debugging).
                                  # Change to ~kill once stable.

systemctl daemon-reload
systemctl restart nginx

# Test that nginx still works.
curl -s -o /dev/null -w '%{http_code}' http://localhost/
# Expected: 200

# Check if any syscalls are being blocked (EPERM mode).
journalctl -u nginx --since "1 minute ago" | grep -i seccomp
dmesg | grep -i seccomp

The systemd syscall groups (prefixed @) are maintained and updated with each systemd release, covering well-known categories without requiring manual syscall enumeration. View available groups:

systemd-analyze syscall-filter
# Lists all available @ groups and their member syscalls.

Step 4: Apply Directly in Code with libseccomp (Go)

For custom daemons, apply the filter from within the process itself:

package main

import (
    "log"
    libseccomp "github.com/seccomp/libseccomp-golang"
)

func installSeccompFilter() error {
    // Default action: kill the process.
    filter, err := libseccomp.NewFilter(libseccomp.ActKillProcess)
    if err != nil {
        return err
    }
    defer filter.Release()

    // Add allowed syscalls.
    allowed := []string{
        "read", "write", "close", "fstat", "mmap", "mprotect", "munmap",
        "brk", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "poll", "lseek", "pread64", "pwrite64", "readv", "writev",
        "access", "pipe", "select", "sched_yield", "mremap", "msync",
        "mincore", "madvise", "dup", "dup2", "pause", "nanosleep",
        "getitimer", "alarm", "setitimer", "getpid", "socket",
        "connect", "accept", "sendto", "recvfrom", "sendmsg", "recvmsg",
        "bind", "listen", "getsockname", "getpeername", "getsockopt",
        "setsockopt", "clone", "fork", "vfork", "execve",  // execve needed for startup only
        "exit", "exit_group", "futex", "epoll_create", "epoll_ctl", "epoll_wait",
        "openat", "newfstatat", "getdents64", "fcntl",
        "getuid", "geteuid", "getgid", "getegid",
        "set_robust_list", "get_robust_list", "prctl",
        "arch_prctl", "setrlimit", "getrlimit", "sigaltstack",
    }

    for _, sc := range allowed {
        syscallID, err := libseccomp.GetSyscallFromName(sc)
        if err != nil {
            return fmt.Errorf("unknown syscall %s: %w", sc, err)
        }
        if err := filter.AddRule(syscallID, libseccomp.ActAllow); err != nil {
            return fmt.Errorf("add rule %s: %w", sc, err)
        }
    }

    // Load the filter into the kernel.
    return filter.Load()
}

func main() {
    // Install the filter early — before accepting connections.
    if err := installSeccompFilter(); err != nil {
        log.Fatalf("seccomp filter install failed: %v", err)
    }
    // ... start serving ...
}

For processes that need to drop execve after startup (most daemons):

// After startup is complete (workers forked, sockets bound), tighten the filter.
func tightenFilterPostStartup() error {
    filter, _ := libseccomp.NewFilter(libseccomp.ActKillProcess)
    // Same as above but WITHOUT execve.
    // This second filter is additive — cannot loosen existing restrictions.
    return filter.Load()
}

Step 5: Audit Mode Before Enforcement

Start with SCMP_ACT_LOG (kernel 4.14+) to log violations without killing the process:

# systemd audit mode — log violations but don't kill.
[Service]
SystemCallFilter=@system-service @network-io @file-system
SystemCallFilter=~@debug ~@mount ~@module ~@reboot
SystemCallErrorNumber=EPERM   # Return error; don't kill.

Monitor for blocked syscalls:

# Watch auditd for seccomp events.
auditctl -a always,exit -F arch=b64 -S all -F key=seccomp-audit
ausearch -k seccomp-audit --start today | grep SECCOMP | awk '{print $NF}' | sort | uniq -c

# Or watch the kernel log directly.
dmesg -w | grep -i seccomp
# Format: audit: type=1326 audit(timestamp:serial): auid=... syscall=X ...

For each blocked syscall, decide: add to the allowlist (it’s legitimate), or confirm it’s blocked correctly (it’s an exploit attempt). After a week with no unexpected blocks, switch to enforcement.

Step 6: Architecture Pinning

On 64-bit kernels, 32-bit syscalls have different numbers. An attacker can bypass a 64-bit filter by using 32-bit compatibility syscalls. Pin to native:

# systemd.
SystemCallArchitectures=native

# Or in libseccomp: set architecture explicitly.

/* In libseccomp C: */
seccomp_arch_remove(ctx, SCMP_ARCH_X86);      /* Remove 32-bit x86 */
seccomp_arch_remove(ctx, SCMP_ARCH_X32);      /* Remove x32 ABI */
/* Keep only SCMP_ARCH_X86_64 (the default) */

Step 7: Telemetry

seccomp_violation_total{service, syscall}         counter
seccomp_filter_installed_total{service}           counter
seccomp_audit_events_total{service}               counter
service_syscall_count{service, syscall}           counter (from audit mode)

Alert on:

seccomp_violation_total non-zero for a production service in enforcement mode — either a bug in the profile (service regression) or an active exploit attempt. Treat as high priority until root-caused.
Any syscall from the @debug or @module groups appearing in audit logs for a production service — these are never expected from normal operation.

Expected Behaviour

Signal	No seccomp	Audit mode	Enforcement mode
`execve` from compromised nginx	Succeeds — shell spawned	Logged; process continues	Process killed immediately
`ptrace` attach to other process	Succeeds	Logged	Process killed
Service functionality	Normal	Normal	Normal (if profile is correct)
Kernel exploit via `perf_event_open`	Reachable	Logged	Blocked; syscall returns `EPERM`
Profile bug: missing required syscall	N/A	Service logs error; continues	Service fails to start or errors at runtime

Trade-offs

Aspect	Benefit	Cost	Mitigation
`SystemCallFilter` in systemd	No code changes required	systemd groups are coarse; may allow more than needed	Combine groups with explicit `~@debug` exclusions for precision.
`SCMP_ACT_KILL_PROCESS`	Immediate termination on violation	Profile bugs crash the service	Start with `EPERM` mode; move to kill after validation.
Architecture pinning	Closes 32-bit ABI bypass	Some services need compatibility mode	Almost no production service needs 32-bit compat; enable unless you’ve confirmed a dependency.
libseccomp in-process	Tightest possible control; no intermediary	Service code must call the API; compile-time dependency	Wrap in a thin init function; add to supervisor process.
Audit-first deployment	Safe discovery of required syscalls	Delay before enforcement; two deployment phases	Worth the delay; rushing to enforcement with an incomplete profile breaks the service.

Failure Modes

Failure	Symptom	Detection	Recovery
Missing syscall in profile	Service fails at startup or on specific operation	`SECCOMP` audit events; service error logs	Add the missing syscall to the allowlist; reload.
Profile applied before process finishes setup	Startup fails (e.g., `execve` needed for fork-exec)	Immediate crash on service start	Allow `execve` in the startup phase; add a second tighter filter post-fork.
32-bit compat syscall blocked unexpectedly	Java or older binary fails	Application errors; audit log shows 32-bit syscall	Identify which binary needs compat; either add it to allowlist or recompile as native 64-bit.
Kernel version mismatch	`SCMP_ACT_KILL_PROCESS` not available on old kernels	Service refuses to start	Fall back to `SCMP_ACT_KILL` (kills thread, not process) on kernels < 4.14.
Profile not applied after package update	Service binary updated; wrapper not re-applied	Audit shows no seccomp events	Tie profile application to the service unit; it reapplies on every restart automatically with systemd.