Seccomp as a Shared Kernel Attack Surface Limiter: Building Minimal Syscall Profiles

The Problem

Linux has approximately 400 syscalls on x86_64. The kernel code paths behind each of those syscalls represent attack surface reachable from any container running on the host. This is the fundamental problem with container isolation: all containers on a host share the same kernel. A container process that reaches a vulnerable kernel code path via a syscall can exploit that vulnerability to escape the container, escalate privileges on the host, or affect other tenants. The container runtime’s namespace isolation does not protect against this — CLONE_NEWPID, CLONE_NEWNET, and CLONE_NEWNS all create separate views of kernel resources, but the kernel itself is shared, and every syscall is a direct request to that shared kernel.

Docker’s default seccomp profile blocks approximately 50 syscalls — primarily dangerous ones like ptrace, kexec_load, reboot, mount, and create_module. This is a useful baseline. It prevents the most obvious container-to-host vectors like loading kernel modules or rebooting the host. But it does not close the kernel attack surfaces that produced the most exploited CVEs in the past five years, because those CVEs exploit functionality that most containerized applications legitimately need — or appear to need.

The five syscall categories that matter most:

splice() / tee() / vmsplice(): The zero-copy data movement syscalls. CVE-2022-0847 (Dirty Pipe) requires splice() to fill a pipe buffer with pages that have the PIPE_BUF_FLAG_CAN_MERGE flag set, followed by a write() to that pipe. The exploit overwrites arbitrary read-only file-backed pages in the page cache. Docker’s default profile allows splice. Most web servers and API services never call splice directly. The glibc sendfile wrapper uses sendfile(), not splice. Most application code that processes HTTP requests has no need for splice whatsoever.

perf_event_open(): The performance profiling syscall that exposes kernel internals for measurement. CVE-2023-2235 exploited a use-after-free in perf_group_detach() reachable via perf_event_open(). CVE-2022-1729 exploited a race condition in sys_perf_event_open(). CVE-2020-25704 exploited a memory leak in the PEBS buffer handling path. Docker’s default profile allows perf_event_open. No web server, database, message queue, or microservice calls perf_event_open in production. The syscall is a pure attack surface for containerized workloads.

socket(AF_NETLINK): Netlink sockets are the kernel’s primary interface for configuring kernel subsystems from userspace: network routing, firewall rules, netfilter tables, network device configuration. CVE-2022-1015 and CVE-2022-1016 both required loading nftables rules via AF_NETLINK sockets (domain=16). CVE-2022-1015 was an out-of-bounds write in the nftables nft_validate_register_store() function. CVE-2022-1016 was a stack data disclosure. Both were exploited together in public PoC chains within two weeks of disclosure. Docker’s default profile allows socket with any address family. Most containerized applications do not configure network interfaces or firewall rules.

bpf(): The eBPF syscall. CVE-2021-3490 exploited the eBPF verifier’s handling of 32-bit register bounds to allow out-of-bounds reads and writes. CVE-2022-23222 exploited a type confusion in the verifier’s pointer arithmetic tracking. CVE-2023-2163 was a verifier logic error that allowed proving false pointer arithmetic facts. All three have public PoC exploits that achieve privilege escalation from unprivileged container to host root. Docker’s default profile allows bpf. No application workload calls bpf() directly.

unshare(CLONE_NEWUSER): Creates user namespaces, which grant the calling process CAP_SYS_ADMIN and other capabilities inside the new namespace. In combination with socket(AF_NETLINK), user namespaces allow unprivileged processes to invoke nftables operations that require CAP_NET_ADMIN — because that capability is held inside the user namespace. CVE-2023-32233 required exactly this: unshare(CLONE_NEWUSER) followed by socket(AF_NETLINK) to create a netlink socket for nftables configuration, followed by sending crafted nftables batch operations that triggered a use-after-free in the anonymous set handling code. Docker’s default profile allows unshare with any flags.

The common thread: all five CVE categories require the attacker to make a specific syscall. Seccomp evaluated before the syscall reaches its kernel handler means the kernel code path is never executed. The exploit cannot work because the vulnerable function is never called. This is not a defense-in-depth measure — it is complete mitigation for the specific exploit path, even on an unpatched kernel.

How Seccomp Works

Seccomp (Secure Computing Mode) operates as a BPF program attached to a thread at the point of the syscall entry path. When a process calls seccomp(SECCOMP_SET_MODE_FILTER, ...), a BPF program is loaded into the kernel and associated with the calling thread. Every subsequent syscall made by that thread (and child threads, if SECCOMP_FILTER_FLAG_TSYNC is set, or via inheritance on fork()) runs the BPF program before any other processing.

The BPF program receives a seccomp_data struct containing the syscall number, architecture, instruction pointer, and first six syscall arguments. It returns one of several actions:

SCMP_ACT_ALLOW (0x7fff0000): permit the syscall, continue to the kernel handler
SCMP_ACT_ERRNO(n) (0x00050000 | n): block the syscall, return errno n to the caller
SCMP_ACT_KILL_PROCESS (0x80000000): terminate the process
SCMP_ACT_KILL_THREAD (0x00000000): terminate the thread
SCMP_ACT_LOG (0x7ffc0000): permit the syscall but log it via the audit subsystem
SCMP_ACT_TRACE (0x7ff00000 | msg): notify an attached ptrace tracer

The BPF filter is evaluated in the kernel’s classic BPF interpreter, not eBPF. The evaluation cost is roughly 5–15 nanoseconds per syscall for a filter with 50–100 instructions, which is negligible against the syscall overhead itself (100–1000ns depending on hardware, kernel version, and Spectre mitigations). Seccomp filters cannot be removed once applied — they can only be augmented with additional filters that are at least as restrictive.

There are two profile strategies:

Blocklist (denylist) profile: The default action is SCMP_ACT_ALLOW. Specific syscalls are explicitly blocked. Docker’s default profile is a blocklist. The advantage is low breakage risk — applications continue working unless they happen to call one of the specifically blocked syscalls. The disadvantage is that the surface blocked is limited to what the profile author thought to block, and the profile does not adapt to the application’s actual syscall usage.

Allowlist (safelist) profile: The default action is SCMP_ACT_ERRNO. Specific syscalls are explicitly allowed. This is the most restrictive approach — the application can only make syscalls that appear on the allowlist. The advantage is maximal attack surface reduction; the kernel code paths for all unlisted syscalls are unreachable. The disadvantage is high initial complexity: building an allowlist profile requires profiling the application’s complete syscall usage under representative workloads, and any gap causes application failure.

Docker and Kubernetes use profile JSON files to specify seccomp configurations. Both Docker (via --security-opt seccomp=profile.json) and Kubernetes (via securityContext.seccompProfile) understand the same JSON schema: a defaultAction field and an array of syscalls entries, each specifying names, an action, and optional argument filters.

Threat Model

The relevant scenario: a containerized workload is running on a host with a kernel that has one of the exploitable CVEs above unpatched. The attacker controls code executing inside the container — via a compromised application, a deserialization vulnerability, RCE in a dependency, or a malicious container image. The container is not privileged. The Docker default seccomp profile is applied. The attacker wants to escalate from container process to host root.

Against this scenario:

Dirty Pipe (CVE-2022-0847): The attacker calls pipe() followed by splice() to fill the pipe buffer with PIPE_BUF_FLAG_CAN_MERGE pages, then calls write() to overwrite an arbitrary read-only page-cache-backed file (e.g., /etc/passwd or an SUID binary). If splice is blocked, the pipe buffer pages cannot be populated with the flag required by the exploit — the attack stops at the first syscall.
CVE-2023-2235: The attacker calls perf_event_open() to create a perf event group, then manipulates the group structure to trigger the use-after-free in perf_group_detach(). If perf_event_open is blocked, the kernel code path is never entered.
CVE-2022-1015 / CVE-2022-1016 (nftables): The attacker calls unshare(CLONE_NEWUSER) to gain CAP_NET_ADMIN in a new user namespace, then opens an AF_NETLINK socket for nftables, then sends crafted NFNL_SUBSYS_NFTABLES netlink messages that trigger the out-of-bounds write. Blocking either unshare(CLONE_NEWUSER) (via argument filter on the flags argument) or socket(AF_NETLINK) (via argument filter on the domain argument) stops the chain.
CVE-2021-3490 / CVE-2022-23222 / CVE-2023-2163 (eBPF verifier): The attacker calls bpf(BPF_PROG_LOAD, ...) with a crafted eBPF program designed to exploit a verifier bug. The verifier runs before the program is JIT-compiled, and the bug allows the crafted program to prove false facts about memory access bounds, enabling out-of-bounds reads and writes once loaded. If bpf is blocked, the program is never submitted to the verifier.
CVE-2023-32233 (nftables batch use-after-free): The attacker calls unshare(CLONE_NEWUSER), opens an AF_NETLINK socket, and sends NFNL_SUBSYS_NFTABLES batch operations that trigger a use-after-free in the anonymous set handling. Blocking socket(AF_NETLINK) stops this.

These are not theoretical attack paths. CVE-2022-0847 had a public PoC exploit within 10 days of disclosure. CVE-2021-3490 had a public exploit within two weeks. CVE-2022-1015 and CVE-2022-1016 were published together with a combined exploit chain. CVE-2023-32233 had an exploitable PoC published alongside the CVE. The CVSS scores range from 7.8 to 9.8; all achieve local privilege escalation to root.

Kernel patching is the correct remediation. Seccomp provides defense-in-depth for the window between disclosure and patch deployment, and defense-in-depth when running on kernel versions where backported patches are not available. In a cloud environment running a locked OS image, the gap between “CVE disclosed” and “all nodes patched and restarted” is measured in days to weeks, not hours.

Hardening Configuration

1. Building a Minimal Syscall Profile with strace

Before blocking any syscall, verify that your application does not actually use it. False confidence in a blocklist derived from “this syscall sounds dangerous” is how production incidents happen.

Profile a containerized application to capture its actual syscall usage:

# Run the container without a seccomp profile
docker run --security-opt seccomp=unconfined \
  --name app-profile \
  --rm -d \
  myapp:latest

# Find the container's init PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' app-profile)

# Trace all syscalls in the process tree, writing to file
strace -f -e trace=all -p "$PID" -o /tmp/syscalls-raw.txt 2>&1 &
STRACE_PID=$!

# Run a representative workload — cover all code paths:
# HTTP requests, DB queries, background jobs, startup/shutdown
docker exec app-profile /app/run-integration-test.sh

# Stop strace
kill $STRACE_PID

# Extract unique syscall names
grep -oP '^[a-z_0-9]+(?=\()' /tmp/syscalls-raw.txt | sort -u > /tmp/syscalls-used.txt
wc -l /tmp/syscalls-used.txt
cat /tmp/syscalls-used.txt

Check whether the dangerous syscalls appear in the profile:

for syscall in splice tee vmsplice perf_event_open bpf unshare; do
  if grep -qx "$syscall" /tmp/syscalls-used.txt; then
    echo "FOUND: $syscall — investigate before blocking"
  else
    echo "NOT FOUND: $syscall — safe to block"
  fi
done

The alternative profiling approach uses seccomp’s audit mode. Deploy with SCMP_ACT_LOG as the action for the syscalls you intend to block. This logs violations to the audit subsystem without blocking them, letting you verify that production traffic does not trigger the filter before you enforce it:

# After deploying with SCMP_ACT_LOG profile:

# Check kernel audit log for seccomp events
journalctl -k | grep "type=SECCOMP" | tail -20
# Output: type=SECCOMP msg=audit(1715123456.789:101): auid=1000 uid=0 gid=0 ses=1
#         subj=... pid=4512 comm="app" exe="/app/server" sig=0 arch=c000003e
#         syscall=332 compat=0 ip=0x7f1234abcd ef=0x0000000000000000 code=0x7ffc0000

# code=0x7ffc0000 is SCMP_ACT_LOG — logged but allowed

# Decode syscall numbers to names
ausearch -m SECCOMP --start today | \
  grep -oP 'syscall=\K[0-9]+' | \
  sort | uniq -c | sort -rn | \
  while read count num; do
    name=$(ausyscall --dump 2>/dev/null | awk -v n="$num" '$1==n{print $2}')
    printf "%6d  %s (%s)\n" "$count" "${name:-unknown}" "$num"
  done

For production profiling across a deployment, aggregate audit logs from all nodes before enforcing the filter.

2. The CVE-Blocking Seccomp Profile

The following profile adds a targeted blocklist on top of Docker’s default profile. It uses SCMP_ACT_ERRNO with EPERM (1) as the block action — this returns a permission error to the calling process rather than killing it, which produces cleaner error messages and avoids silent crashes for applications that call a blocked syscall and handle the error.

The socket(AF_NETLINK) and unshare(CLONE_NEWUSER) entries use argument filters. The BPF filter for these evaluates the first argument against a specific value, applying the block action only when the argument matches — allowing the syscall for other argument values.

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {
      "names": ["splice", "tee", "vmsplice"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "CVE-2022-0847 (Dirty Pipe): splice fills pipe buffer pages with PIPE_BUF_FLAG_CAN_MERGE, enabling page-cache overwrites"
    },
    {
      "names": ["perf_event_open"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "CVE-2023-2235, CVE-2022-1729, CVE-2020-25704: perf subsystem use-after-free and race conditions"
    },
    {
      "names": ["bpf"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "CVE-2021-3490, CVE-2022-23222, CVE-2023-2163: eBPF verifier logic errors enabling OOB read/write"
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "args": [
        {
          "index": 0,
          "value": 16,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Block AF_NETLINK (domain=16): CVE-2022-1015, CVE-2022-1016, CVE-2023-32233 all require netlink socket for nftables"
    },
    {
      "names": ["unshare"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "args": [
        {
          "index": 0,
          "value": 268435456,
          "op": "SCMP_CMP_MASKED_EQ",
          "valueTwo": 268435456
        }
      ],
      "comment": "Block CLONE_NEWUSER (0x10000000): prevents gaining CAP_NET_ADMIN inside user namespace for nftables attacks"
    },
    {
      "names": [
        "ptrace", "process_vm_readv", "process_vm_writev",
        "kexec_load", "kexec_file_load",
        "reboot", "syslog",
        "create_module", "init_module", "finit_module", "delete_module",
        "iopl", "ioperm",
        "settimeofday", "clock_settime", "clock_adjtime",
        "acct",
        "mount", "umount2", "pivot_root", "chroot",
        "swapon", "swapoff",
        "mknod",
        "open_by_handle_at", "name_to_handle_at",
        "setns",
        "lookup_dcookie",
        "perf_event_open",
        "add_key", "request_key", "keyctl",
        "userfaultfd",
        "nfsservctl",
        "get_kernel_syms",
        "query_module",
        "_sysctl",
        "uselib",
        "ustat",
        "sysfs",
        "kcmp",
        "pciconfig_read", "pciconfig_write",
        "io_uring_setup", "io_uring_enter", "io_uring_register"
      ],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "comment": "Docker default profile blocked syscalls, plus io_uring (see CVE-2022-2586, CVE-2023-2163 io_uring vector)"
    }
  ]
}

Save this to /etc/seccomp/cve-blocking.json on each host. The argument filter for unshare uses SCMP_CMP_MASKED_EQ with a mask of 0x10000000 (CLONE_NEWUSER) — this blocks unshare only when the CLONE_NEWUSER flag is set in the flags argument, allowing unshare(CLONE_NEWNS) or other flag combinations used by legitimate container runtimes.

The AF_NETLINK argument filter blocks socket(AF_NETLINK, ...) by matching domain argument value 16. Other address families — AF_INET (2), AF_INET6 (10), AF_UNIX (1) — are unaffected. Note that some init systems and service managers use AF_NETLINK for network interface enumeration; verify this against your application’s strace profile before enforcing.

3. Kubernetes RuntimeDefault and Custom Profiles

In Kubernetes 1.19+, the seccomp profile is specified in the pod’s securityContext. Three profile types are available: Unconfined (no filter), RuntimeDefault (the container runtime’s built-in profile, equivalent to Docker’s default), and Localhost (a custom profile file on the node).

Apply the CVE-blocking profile to a pod:

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
  namespace: production
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/cve-blocking.json
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]

The localhostProfile path is relative to the kubelet’s --seccomp-profile-root directory, which defaults to /var/lib/kubelet/seccomp/. The full path on the node is /var/lib/kubelet/seccomp/profiles/cve-blocking.json.

Distribute the profile to all nodes via DaemonSet. The DaemonSet mounts the profile from a ConfigMap and writes it to the kubelet seccomp directory:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seccomp-profiles
  namespace: kube-system
data:
  cve-blocking.json: |
    {
      "defaultAction": "SCMP_ACT_ALLOW",
      "architectures": ["SCMP_ARCH_X86_64"],
      "syscalls": [
        { "names": ["splice","tee","vmsplice"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1 },
        { "names": ["perf_event_open"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1 },
        { "names": ["bpf"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1 },
        { "names": ["socket"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1,
          "args": [{"index": 0, "value": 16, "op": "SCMP_CMP_EQ"}] },
        { "names": ["unshare"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1,
          "args": [{"index": 0, "value": 268435456, "op": "SCMP_CMP_MASKED_EQ", "valueTwo": 268435456}] }
      ]
    }
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: seccomp-profile-installer
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: seccomp-installer
  template:
    metadata:
      labels:
        app: seccomp-installer
    spec:
      hostPID: false
      initContainers:
      - name: install-profiles
        image: busybox:1.36
        command:
        - sh
        - -c
        - |
          mkdir -p /seccomp-target/profiles
          cp /seccomp-source/cve-blocking.json /seccomp-target/profiles/cve-blocking.json
          echo "Profile installed"
        volumeMounts:
        - name: seccomp-source
          mountPath: /seccomp-source
        - name: seccomp-target
          mountPath: /seccomp-target
      containers:
      - name: pause
        image: gcr.io/google_containers/pause:3.9
        resources:
          requests:
            cpu: "1m"
            memory: "4Mi"
      volumes:
      - name: seccomp-source
        configMap:
          name: seccomp-profiles
      - name: seccomp-target
        hostPath:
          path: /var/lib/kubelet/seccomp
          type: DirectoryOrCreate
      tolerations:
      - operator: Exists

For associating a seccomp profile with a RuntimeClass (so that any pod selecting the RuntimeClass automatically uses the profile without specifying it per-pod):

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: hardened-runc
handler: runc
scheduling:
  nodeClassification: {}

RuntimeClass itself does not carry a seccomp profile specification in the core API. The association is implemented via an admission webhook (e.g., the Security Profiles Operator) or by using the SeccompDefault feature gate (beta in 1.27), which makes RuntimeDefault the default profile for all pods that do not specify Unconfined. For Localhost profiles to be the default, you need a mutating admission webhook that sets the seccomp profile based on namespace labels or RuntimeClass selection.

The Security Profiles Operator (SPO) provides this functionality and also handles profile distribution, eliminating the need for the DaemonSet above:

apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
  name: cve-blocking
  namespace: default
spec:
  defaultAction: SCMP_ACT_ALLOW
  syscalls:
  - action: SCMP_ACT_ERRNO
    errnoRet: 1
    names:
    - splice
    - tee
    - vmsplice
    - perf_event_open
    - bpf
  - action: SCMP_ACT_ERRNO
    errnoRet: 1
    names:
    - socket
    args:
    - index: 0
      value: 16
      op: SCMP_CMP_EQ

4. OPA/Kyverno Policy Requiring Seccomp

Prevent unprotected pods from reaching the cluster with a Kyverno admission policy. This enforces that every pod specifies either RuntimeDefault or Localhost as its seccomp profile type — both are acceptable, but Unconfined and the absent field (which defaults to Unconfined when SeccompDefault is not enabled) are rejected:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-seccomp-profile
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: check-seccomp
    match:
      any:
      - resources:
          kinds: [Pod]
    exclude:
      any:
      - resources:
          namespaces:
          - kube-system
          - kube-public
    validate:
      message: >-
        Pods must specify a seccomp profile (RuntimeDefault or Localhost).
        Add spec.securityContext.seccompProfile.type: RuntimeDefault or Localhost.
      pattern:
        spec:
          securityContext:
            seccompProfile:
              type: "RuntimeDefault | Localhost"

A separate rule requiring the CVE-blocking profile specifically for sensitive namespaces:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-cve-blocking-seccomp
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: check-cve-blocking-profile
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces:
          - production
          - staging
    validate:
      message: >-
        Production pods must use the cve-blocking seccomp profile.
        Set spec.securityContext.seccompProfile.type: Localhost and
        localhostProfile: profiles/cve-blocking.json
      pattern:
        spec:
          securityContext:
            seccompProfile:
              type: Localhost
              localhostProfile: "profiles/cve-blocking.json"

5. Seccomp Performance Impact

Seccomp adds BPF filter evaluation cost to every syscall. Measure the actual overhead for your workload:

/* /tmp/bench_syscall.c */
#include <sys/syscall.h>
#include <unistd.h>
#include <time.h>
#include <stdio.h>

int main(void) {
    struct timespec start, end;
    long ns;
    int iterations = 1000000;

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < iterations; i++) {
        syscall(SYS_getpid);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    ns = (end.tv_sec - start.tv_sec) * 1000000000L
       + (end.tv_nsec - start.tv_nsec);
    printf("1M getpid syscalls: %ld ns total, %ld ns/syscall\n",
           ns, ns / iterations);
    return 0;
}

gcc -O2 -o /tmp/bench_syscall /tmp/bench_syscall.c

# Baseline: no seccomp
/tmp/bench_syscall
# Typical output: 1M getpid syscalls: 142000000 ns total, 142 ns/syscall

# With Docker default seccomp (~50 rule blocklist)
docker run --security-opt seccomp=/etc/docker/seccomp/default.json \
  -v /tmp/bench_syscall:/bench:ro busybox /bench
# Typical output: 1M getpid syscalls: 156000000 ns total, 156 ns/syscall (~14ns overhead)

# With CVE-blocking profile (5 additional rules)
docker run --security-opt seccomp=/etc/seccomp/cve-blocking.json \
  -v /tmp/bench_syscall:/bench:ro busybox /bench
# Typical output: 1M getpid syscalls: 159000000 ns total, 159 ns/syscall (~17ns overhead)

The overhead is approximately 5–20ns per syscall for a filter with fewer than 100 rules, which is below measurement noise for most workloads. The overhead matters only for syscall-intensive tight loops — a web server making one read syscall per HTTP request at 10,000 RPS adds roughly 0.17ms of total seccomp overhead per second across all requests, which is not measurable in end-to-end latency.

Filters with hundreds of rules show higher overhead due to linear scan time in the BPF filter. The blocklist profiles described here have fewer than 60 rules total and are not in the performance-sensitive range.

6. Audit Mode: Test Before Enforcing

Before applying the profile with SCMP_ACT_ERRNO, deploy with SCMP_ACT_LOG. Modify the profile by replacing all SCMP_ACT_ERRNO actions with SCMP_ACT_LOG:

# Create audit-mode version of the profile
sed 's/SCMP_ACT_ERRNO/SCMP_ACT_LOG/g' /etc/seccomp/cve-blocking.json \
  > /etc/seccomp/cve-blocking-audit.json

# Deploy with audit profile
docker run --security-opt seccomp=/etc/seccomp/cve-blocking-audit.json \
  myapp:latest

Run your full integration test suite and load test, then check the audit log:

# Filter for seccomp events
journalctl -k --since "1 hour ago" | grep "type=SECCOMP"

# Example output for a logged splice() call:
# type=SECCOMP msg=audit(1715134567.123:234): auid=0 uid=0 gid=0 ses=4 subj=unconfined
#   pid=7823 comm="rsync" exe="/usr/bin/rsync" sig=0 arch=c000003e
#   syscall=275 compat=0 ip=0x7f3a12bc4d20 code=0x7ffc0000
# syscall=275 is splice on x86_64; code=0x7ffc0000 is SCMP_ACT_LOG

# Decode syscall numbers
ausyscall x86_64 275
# Output: splice

# Summarize which of your target syscalls are being hit
journalctl -k --since "1 hour ago" | grep "type=SECCOMP" | \
  grep -oP 'syscall=\K[0-9]+' | sort | uniq -c | sort -rn | \
  while read count num; do
    name=$(python3 -c "import ctypes; libc=ctypes.CDLL(None); \
      print(ctypes.c_char_p(libc.syscall_name($num)).value)" 2>/dev/null || \
      ausyscall "$num" 2>/dev/null || echo "unknown")
    printf "%6d  syscall %-4s  %s\n" "$count" "$num" "$name"
  done

Any hit on splice, perf_event_open, bpf, or a socket with domain 16 in the audit log is a signal to investigate before enforcing. An rsync or sendfile-based file transfer tool may call splice legitimately. A monitoring agent may call perf_event_open. Identify the process from the comm= field in the audit log entry and determine whether it should be running in the container.

Expected Behaviour

After applying the profile, verify it is active:

# Check seccomp mode for the container init process
docker exec hardened-container cat /proc/1/status | grep Seccomp
# Output: Seccomp:  2
# Mode 2 is SECCOMP_MODE_FILTER — BPF filter active
# Mode 0 is SECCOMP_MODE_DISABLED — no filter
# Mode 1 is SECCOMP_MODE_STRICT — only read/write/exit/_exit allowed

# Verify that a blocked syscall returns EPERM
docker exec hardened-container python3 -c "
import ctypes
libc = ctypes.CDLL(None, use_errno=True)
# Try to call perf_event_open (syscall 298 on x86_64)
# struct perf_event_attr zeroed out
result = libc.syscall(298, 0, -1, -1, -1, 0)
import ctypes, errno
err = ctypes.get_errno()
print(f'perf_event_open returned {result}, errno {err} ({errno.errorcode.get(err, \"unknown\")})')
"
# Output: perf_event_open returned -1, errno 1 (EPERM)

A blocked splice call returns -1 with errno = EPERM. Application code that calls splice and checks the return value will receive an EPERM error. Most applications that call splice fall back to read/write pairs on EPERM, because splice is an optimization rather than a required interface. An application that calls splice without error handling will crash with Permission denied — this is why audit mode testing is mandatory.

The kernel audit log entry for a blocked syscall (with SCMP_ACT_ERRNO and SCMP_ACT_LOG combined via a second filter layer, or with SCMP_ACT_LOG in the audit profile) looks like:

type=SECCOMP msg=audit(1715134567.123:234): auid=0 uid=0 gid=0 ses=4
  subj=unconfined pid=7823 comm="exploit" exe="/tmp/exploit"
  sig=0 arch=c000003e syscall=275 compat=0 ip=0x401234 code=0x00050001
  
# code=0x00050001 is SCMP_ACT_ERRNO | EPERM (errno=1)
# syscall=275 is splice
# arch=c000003e is AUDIT_ARCH_X86_64

Trade-offs

Blocklist versus allowlist: The profile above is a targeted blocklist layered on Docker’s default. It closes specific high-value CVE vectors with minimal breakage risk. A full allowlist profile — blocking every syscall not explicitly used by the application — closes the remaining ~350 syscall attack surface, but requires a complete profiling exercise and will break the application if any syscall is missed. For most teams, the targeted blocklist covering the five CVE categories is the right starting point. An allowlist is appropriate for extremely sensitive workloads where the profiling investment is justified.

splice() blocking: splice is a zero-copy optimization. Applications that call it (rsync for local copies, some HTTP servers for sendfile-like behaviour, cp in some libc implementations) will fall back to read/write pairs on EPERM. The fallback is functionally correct but slower. For workloads that process large files and rely on splice throughput — log forwarders using splice to pipe data between file descriptors, for example — measure the performance impact before enforcing. The Dirty Pipe CVE is closed; the question is whether the application performs acceptably without splice.

bpf() blocking: This breaks any eBPF-based tooling running inside the container. Cilium CNI does not run inside application containers, so container-to-Cilium impact is not a concern. Falco in kernel module mode also does not run inside application containers. However, eBPF-based observability agents that inject into containers — some APM tools, some service mesh sidecars — may call bpf() and will fail with this profile. Verify with your observability stack before enforcing. Falco in eBPF mode uses a privileged DaemonSet that runs outside application container seccomp constraints.

io_uring blocking: The profile above includes io_uring_setup, io_uring_enter, and io_uring_register in the extended blocklist. CVE-2022-2586 (use-after-free in io_uring linked timeouts) and the io_uring vector for CVE-2023-2163 demonstrate that io_uring’s kernel implementation is a productive attack surface. Most containerized application code does not use io_uring directly (it requires explicit opt-in via the liburing library or direct syscall). Node.js, Java, Python, and Go runtimes do not use io_uring in their standard configurations. Rust async runtimes using tokio-uring are the primary exception.

unshare argument filter limitations: The SCMP_CMP_MASKED_EQ filter on unshare blocks only the CLONE_NEWUSER flag. It does not block clone(CLONE_NEWUSER, ...) via the clone or clone3 syscalls. If your threat model includes attackers calling clone directly, extend the argument filter to cover those syscalls as well. In practice, exploit code for the nftables CVEs used unshare rather than clone because unshare is the simpler interface for this specific capability acquisition pattern.

Failure Modes

Assuming Docker’s default profile closes the high-value CVE vectors: It does not. The default profile is designed around “syscalls that have no legitimate use in containers” (module loading, kernel exec). The CVEs discussed here exploit functionality — file splicing, performance monitoring, network configuration, eBPF — that is legitimately useful outside containers. The default profile’s authors made a conscious decision not to block these because doing so would break common container workloads. The supplementary profile above addresses this, but requires explicit deployment.

Skipping audit mode before enforcement: Applying SCMP_ACT_ERRNO without first running with SCMP_ACT_LOG under production traffic is how seccomp-induced application crashes occur in production. The most common unexpected hits are monitoring agents calling perf_event_open, container health checks using unshare, and Java or Go runtime initialization code calling socket with unexpected address families. Always verify with SCMP_ACT_LOG against a realistic workload before switching to SCMP_ACT_ERRNO.

Seccomp not applying to all processes: The seccomp filter must be applied to the container init process (PID 1 inside the container, or the process specified as the container entrypoint). Docker and containerd both apply the profile to the initial process via the clone syscall flags before exec. The profile is then inherited by all child processes via fork and exec. If you apply seccomp via a wrapper that launches the container without using the runtime’s seccomp mechanism, child processes started before the filter is applied will not be filtered. Use the container runtime’s native seccomp integration, not post-start wrappers.

Privileged containers bypass seccomp: Any pod or container running with privileged: true in its security context bypasses seccomp entirely. The kernel sets seccomp = 0 (disabled) for processes with CAP_SYS_ADMIN, which is implied by --privileged. This means all of the above profiles are completely ineffective for privileged containers. Audit your deployments for privileged containers and treat them as separate, higher-risk entries in your hardening backlog — see the Kubernetes admission control article for policy-based privileged container prevention.

Seccomp and multiarch images: A profile specifying SCMP_ARCH_X86_64 only will not filter correctly on arm64 nodes running the same pod. The architecture list in the profile must match the actual CPU architecture. Syscall numbers differ between architectures — splice is syscall 275 on x86_64 and syscall 76 on arm64. The libseccomp library (which Docker and containerd use internally to compile the JSON profile to BPF) handles this translation, but only for the architectures listed in the architectures array. Include all architectures your cluster runs on.

OPA/Kyverno policy without background scanning: The Kyverno policy above validates new pods at admission time. It does not retroactively enforce against pods already running. Enable background: true in the policy spec (as shown above) and verify that Kyverno’s background controller has sufficient RBAC permissions to scan existing pods. A policy deployed with validationFailureAction: Enforce and background: false creates a false impression of comprehensive enforcement.