AF_PACKET and CAP_NET_RAW: Two Kernel CVEs That Made the Default Docker Capability Set Dangerous

AF_PACKET and CAP_NET_RAW: Two Kernel CVEs That Made the Default Docker Capability Set Dangerous

The Problem

Docker grants every container a fixed set of Linux capabilities without requiring --privileged. This capability set is documented in the Docker source and has been stable for years. The set includes capabilities like CAP_CHOWN, CAP_NET_BIND_SERVICE, CAP_SETUID, and — the one that matters here — CAP_NET_RAW. This last capability exists in the default set because a handful of common tools require it: ping needs a raw ICMP socket, various network diagnostic tools need raw socket access, and packet capture tools like tcpdump use it for link-layer access. The capability grant is intentional and there is a real use case behind it.

CAP_NET_RAW also permits opening AF_PACKET sockets. An AF_PACKET socket is a raw socket that receives Ethernet frames directly from a network interface at the link layer, bypassing the TCP/IP stack entirely. The socket returns complete Layer 2 frames — Ethernet header, IP header, transport header, and payload — to the reading process. This is how tcpdump works, how Wireshark captures packets, how DHCP clients see all DHCP traffic on a segment, and how ARP implementations can send and receive ARP frames outside the normal kernel ARP path. It is powerful, low-level, and necessarily privileged because unrestricted link-layer access would let any process sniff all traffic on a shared interface.

The AF_PACKET implementation in the Linux kernel supports two packet capture modes. The first is per-packet allocation: each received frame is copied into a recvmsg buffer. The second is the ring buffer interface — PACKET_RX_RING and PACKET_TX_RING — which maps a shared circular buffer between kernel and userspace using mmap(). The ring buffer interface is designed for high-performance packet capture (think 10Gbit/s capture without buffer copies). Applications like Suricata and Zeek use it. The interface is configured through setsockopt() on the AF_PACKET socket, and this configuration path is where both CVEs live.

CVE-2020-14386 was discovered by Or Cohen of NSO Group and published in September 2020. The vulnerability is an integer overflow leading to a heap out-of-bounds write in packet_sock.c. When an application requests the PACKET_VERSION V3 ring buffer format — tpacket_v3 — the kernel calculates the size of the ring buffer based on parameters supplied via setsockopt(PACKET_RX_RING): the block size, block count, and frame size. These values come from a tpacket_req3 structure passed in from userspace. The calculation in packet_set_ring() performed arithmetic on these values to determine how large a heap allocation to make for the ring buffer metadata. Specifically, the kernel computed the total number of frames as tp_block_size * tp_block_nr / tp_frame_size. When tp_block_size * tp_block_nr overflowed an unsigned 32-bit integer, the result was a small number despite the attacker-controlled input values being large. The subsequent heap allocation was sized according to the overflowed result — too small. Then the kernel populated this allocation as if the full, pre-overflow size were correct, writing data past the end of the allocated region into adjacent kernel heap memory. The write primitive targeted adjacent heap objects. By controlling what those objects were — through careful heap feng shui, pre-allocating and freeing objects to place a target in the right position — the attacker could corrupt kernel data structures with attacker-controlled values. Affected kernels: 4.6 through 5.8.6.

CVE-2021-22600 was discovered by Andy Nguyen of Google Project Zero and published in January 2022. The vulnerability is a double-free in packet_set_ring() triggered by a TOCTOU race condition. When setsockopt(PACKET_RX_RING) is called on an AF_PACKET socket that already has an active ring buffer, the function must tear down the existing buffer before setting up the new one. The teardown calls free_pg_vec() to release the memory pages. The race occurs when two threads concurrently call setsockopt(PACKET_RX_RING) on the same socket file descriptor — a file descriptor that both threads obtained by opening the same AF_PACKET socket. Thread A begins tearing down the existing buffer. Thread B also begins tearing down what it sees as the existing buffer — but it is the same underlying pg_vec pointer that Thread A is about to free. Both threads reach free_pg_vec() with the same pointer. The result is a double-free of the packet ring buffer’s page vector.

A double-free in kernel heap memory follows the same exploit chain as any other heap use-after-free. The first free releases the memory back to the kernel allocator (SLUB in most production kernels). If the attacker can allocate a different, attacker-controlled object at the now-freed address before the second free triggers, the second free corrupts the allocator’s metadata for the controlled object. From corrupted allocator metadata, the standard SLUB exploitation techniques — crafting fake freelist pointers, redirecting subsequent kmalloc() calls to arbitrary addresses — produce a write-what-where primitive. A write-what-where against kernel memory is sufficient for privilege escalation. The standard technique writes a forged cred structure to the heap and redirects the current task’s cred pointer to it: uid=0, all capabilities set. The race window is narrow but exploitable via the standard trick of pinning threads to separate CPUs with sched_setaffinity() and spinning on the syscall until the race fires. A public proof-of-concept was posted to GitHub within weeks of the CVE publication. Affected kernels: 5.0 through 5.15.14 (with 5.15.17 and 5.16.3 containing the fix).

Both vulnerabilities share the same entry point requirement: the ability to open an AF_PACKET socket. The check in the kernel is explicit — socket(AF_PACKET, ...) returns EPERM unless the calling process has CAP_NET_RAW or is running in a user network namespace with the appropriate capabilities mapped. Every container with the default Docker capability set has CAP_NET_RAW. Every such container on an unpatched kernel is therefore one crafted setsockopt() call away from kernel code execution.

The practical scope is not academic. Kubernetes clusters routinely run containers with the full default capability set because no explicit capabilities.drop is configured. Network debugging images (netshoot, nicolaka/netshoot, alpine with iputils installed, BusyBox-based images with ping) are common in runbooks. Init containers that check network connectivity with ping before the main application starts are ubiquitous — this pattern appears in Helm charts for databases, message brokers, and service meshes. Every one of those containers carries CAP_NET_RAW. In a multi-tenant cluster, a single node running an unpatched 5.x kernel where any of these containers are co-located is exposed across tenant boundaries.

Threat Model

The baseline attacker is someone who has code execution inside a container — through application vulnerability, RCE in the app, supply chain compromise of the container image, or a compromised developer workstation that pushed malicious code into a staging environment. They have a shell. The container is running with the default Docker/containerd capability set, which has not been modified by capabilities.drop or a restrictive PodSecurityContext. The node kernel is unpatched.

The immediate attack surface: call socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)). On a container with CAP_NET_RAW, this succeeds. From here, the attacker calls setsockopt() on the resulting file descriptor with parameters crafted to trigger either the V3 ring buffer integer overflow (CVE-2020-14386) or the double-free race (CVE-2021-22600). Both exploits are single-machine, no network required, no external C2 needed for the privilege escalation step itself.

From kernel code execution, the standard outcome is a container-to-node escape: the exploit modifies the process’s kernel credentials to uid=0 with a full capability set, then invokes chroot or bind-mounts the host filesystem, reads the kubelet’s client certificate from /var/lib/kubelet/pki/, and issues Kubernetes API requests as the node. From the node service account, the attacker can list secrets across all namespaces the node is authorised to access, modify daemonset pod specs, or inject into the kubelet’s pod reconciliation loop.

Any pod running a network monitoring or debugging image retains CAP_NET_RAW by default and is a viable exploit host. This is particularly sharp for patterns like:

  • Network readiness init containers: Helm charts for Kafka, Redis, PostgreSQL, and RabbitMQ commonly include an init container that runs ping <service-host> or nc -z <service-host> <port> to block main container startup until a dependency is reachable. These init containers are frequently based on BusyBox or Alpine images that do not drop any capabilities. They run to completion on every pod restart.
  • Service mesh init containers: Istio’s istio-init container and Linkerd’s linkerd-init both require elevated network privileges to set up iptables rules for traffic interception. Neither historically requires CAP_NET_RAW for its primary function, but clusters that have broadly granted it to the mesh namespace through a permissive PSP or namespace-level default carry the exposure in every meshed pod.
  • DaemonSets for network observability: Suricata, Zeek, Cilium Hubble, network flow exporters — these legitimately need AF_PACKET access. They should be running on patched kernels with the ring buffer vulnerability surface reduced by seccomp, but they often are not.
  • Multi-tenant clusters: A compromised container in namespace A can exploit the vulnerability to escape to the node and immediately access the process environments of all pods in namespace B through /proc/<pid>/environ. Namespace isolation provides no protection once the kernel boundary is breached.

The unpatched kernel window matters. CVE-2020-14386 was patched in September 2020. CVE-2021-22600 was patched in January 2022 for 5.16.3 and 5.15.17, with backports to 5.10.94 in February 2022. Managed Kubernetes node images from all major cloud providers take days to weeks to roll out kernel patches after upstream release. During that window, any cluster running the affected kernel versions is exposed regardless of whether the vendor’s SLA says “patches within N days.”

Hardening Configuration

1. Drop CAP_NET_RAW Explicitly

This is the single highest-leverage action. Dropping CAP_NET_RAW makes socket(AF_PACKET, ...) return EPERM inside the container, removing the ability to reach either vulnerable code path. The minimal secure pod securityContext looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: application
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: registry.example.com/app:1.0.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
        add:
        - NET_BIND_SERVICE  # only if the container binds to a port below 1024

drop: [ALL] removes every capability from both the permitted and effective sets. add: [NET_BIND_SERVICE] adds back the single capability needed to bind to privileged ports — omit it if the application listens on port 8080 or higher. CAP_NET_RAW is not in the add list. It will not be available. socket(AF_PACKET, ...) will fail with EPERM. Both CVE-2020-14386 and CVE-2021-22600 become unreachable from this container.

For init containers, the same securityContext.capabilities block applies. Init containers that use ping for connectivity testing need a different approach (see section 3 below) rather than retaining CAP_NET_RAW.

2. Kyverno Policy Requiring CAP_NET_RAW Drop

Relying on individual pod authors to set capabilities.drop correctly does not scale. Enforce it at admission time. The following ClusterPolicy denies any pod that does not explicitly drop NET_RAW across all containers:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: drop-cap-net-raw
  annotations:
    policies.kyverno.io/title: Require CAP_NET_RAW to be dropped
    policies.kyverno.io/severity: high
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/description: >-
      CAP_NET_RAW enables AF_PACKET socket creation, the entry point for
      CVE-2020-14386 and CVE-2021-22600. It must be dropped from all
      containers unless explicitly exempted via namespace exclusion.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: require-drop-net-raw
    match:
      any:
      - resources:
          kinds: [Pod]
    exclude:
      any:
      - resources:
          namespaces:
          - kube-system
          - monitoring   # netdata, Prometheus node_exporter with AF_PACKET
          - network-tools
    validate:
      message: >-
        CAP_NET_RAW must be explicitly dropped. Add NET_RAW (or ALL) to
        capabilities.drop in every container's securityContext.
      foreach:
      - list: "request.object.spec.containers"
        deny:
          conditions:
            all:
            - key: "NET_RAW"
              operator: AnyNotIn
              value: "{{ element.securityContext.capabilities.drop || `[]` }}"
            - key: "ALL"
              operator: AnyNotIn
              value: "{{ element.securityContext.capabilities.drop || `[]` }}"
  - name: require-drop-net-raw-init
    match:
      any:
      - resources:
          kinds: [Pod]
    exclude:
      any:
      - resources:
          namespaces:
          - kube-system
          - monitoring
          - network-tools
    preconditions:
      all:
      - key: "{{ request.object.spec.initContainers | length(@) }}"
        operator: GreaterThan
        value: "0"
    validate:
      message: >-
        CAP_NET_RAW must be dropped in init containers. Init containers that
        use ping for connectivity testing must use TCP alternatives instead.
      foreach:
      - list: "request.object.spec.initContainers"
        deny:
          conditions:
            all:
            - key: "NET_RAW"
              operator: AnyNotIn
              value: "{{ element.securityContext.capabilities.drop || `[]` }}"
            - key: "ALL"
              operator: AnyNotIn
              value: "{{ element.securityContext.capabilities.drop || `[]` }}"

The two rules cover both main containers and init containers separately because Kyverno’s foreach does not recurse across both lists in a single rule. The || \[]`default handles containers with nocapabilities` block defined at all — without the default, the operator would error on a null dereference rather than producing a deny.

The excluded namespaces need their own tighter controls: image digest pinning, restricted service accounts, and separate network-monitoring-specific policies that permit CAP_NET_RAW only to specific workloads from specific registries.

3. Replace ping With Alternatives That Do Not Need CAP_NET_RAW

ping requires CAP_NET_RAW because it opens a raw ICMP socket (socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)). When CAP_NET_RAW is dropped, ping fails immediately with a permission error. Most uses of ping in init containers are connectivity checks — “is this hostname reachable?” — not round-trip-time measurements. There are several alternatives that work without raw sockets:

# TCP connectivity check via /dev/tcp (bash built-in — no external binary required)
timeout 2 bash -c "echo > /dev/tcp/postgres.svc.cluster.local/5432" 2>/dev/null \
  && echo "postgres reachable" || echo "postgres unreachable"

# curl with a brief timeout — uses TCP, works with HTTP and non-HTTP ports
curl --silent --max-time 2 --output /dev/null \
  telnet://redis.svc.cluster.local:6379 \
  && echo "redis reachable"

# netcat if available in the image
nc -z -w 2 kafka.svc.cluster.local 9092 && echo "kafka reachable"

The /dev/tcp approach is the lightest: no external binary, no capability required, works wherever bash is the init container shell.

For cases where ICMP ping specifically is needed — latency measurement, ICMP-based liveness probes — the correct approach is to set the cap_net_raw file capability on the ping binary inside the container image, rather than granting CAP_NET_RAW as a container capability. The distinction is important:

  • Container capabilities (set in securityContext.capabilities.add) are granted to every process in the container regardless of which binary it is. Any process — including exploit shellcode — can open an AF_PACKET socket.
  • File capabilities (setcap cap_net_raw+ep /usr/bin/ping) are attached to a specific executable. The capability is only effective when that specific binary runs. A process that does not exec ping does not inherit the file capability.
FROM alpine:3.20
RUN apk add --no-cache iputils && \
    setcap cap_net_raw+ep /usr/bin/ping
# ping now works without CAP_NET_RAW in the container capability set

When the container runs with drop: [ALL] and no add, /usr/bin/ping will still succeed because the file capability grants cap_net_raw for that specific binary invocation. Importantly, for CVE-2020-14386 and CVE-2021-22600, an attacker who has code execution inside the container cannot open an AF_PACKET socket directly — they would need to exec the specific ping binary with its file capability, which does not help with raw setsockopt() exploitation. The attack surface is meaningfully reduced even if not fully eliminated.

4. Seccomp Profile Blocking AF_PACKET Socket Creation

Dropping CAP_NET_RAW is the correct first control. Seccomp adds a second, independent layer that catches cases where CAP_NET_RAW is legitimately required by a workload but AF_PACKET specifically should not be accessible. The socket() syscall takes a domain argument: AF_INET is 2, AF_PACKET is 17. A seccomp profile can allow socket() for all other address families while blocking AF_PACKET:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_AARCH64"
  ],
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 17,
          "op": "SCMP_CMP_NE"
        }
      ]
    },
    {
      "names": [
        "accept", "accept4", "bind", "connect", "getsockname",
        "getsockopt", "getsockpeer", "listen", "recv", "recvfrom",
        "recvmsg", "recvmmsg", "send", "sendmsg", "sendmmsg", "sendto",
        "setsockopt", "shutdown", "socketpair"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The socket rule uses SCMP_CMP_NE (not-equal) on argument index 0 (the domain argument): allow socket() when the domain is anything other than 17 (AF_PACKET). All other socket operations are allowed unconditionally because they operate on file descriptors already created; the restriction is on the creation of the dangerous socket type.

Note that SCMP_ACT_ERRNO with defaultErrnoRet: 1 returns EPERM for all syscalls not explicitly listed. A production profile would enumerate the full set of syscalls the application actually requires rather than using this abbreviated form. The RuntimeDefault seccomp profile (Docker’s default, also the Kubernetes default when seccompProfile.type: RuntimeDefault is set) does not block AF_PACKET socket creation — this is a common misunderstanding. The runtime default profile is designed to block clearly dangerous syscalls (ptrace, kexec_load, create_module) while preserving broad compatibility. socket() with AF_PACKET is left open.

Apply the profile via pod spec:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/block-af-packet.json

The profile file must be present on every node at the path <kubelet-seccomp-root>/profiles/block-af-packet.json. The kubelet seccomp root defaults to /var/lib/kubelet/seccomp/.

5. Kernel Patch Verification

Before relying on the controls above to be the only barrier, verify the running kernel version on each node:

# Check kernel version
uname -r

# CVE-2020-14386: fixed in 5.9.0 and backported to stable branches as of September 2020
# If kernel is < 5.9, check for vendor backport:
grep -r "CVE-2020-14386" /proc/version_signature 2>/dev/null || \
  dpkg -l linux-image-* 2>/dev/null | grep "$(uname -r)"

# CVE-2021-22600 fix versions:
# Upstream:      >= 5.16.3
# Stable 5.15.x: >= 5.15.17
# Stable 5.10.x: >= 5.10.94
# Stable 5.4.x:  not affected (double-free race introduced in 5.0)
python3 -c "
import subprocess, sys
kver = subprocess.check_output(['uname', '-r']).decode().strip().split('-')[0]
major, minor, patch = [int(x) for x in kver.split('.')[:3]]
cve_2020 = (major, minor) >= (5, 9) or (major == 5 and minor == 8 and patch >= 7) or \
           (major == 4)  # 4.x needs distro backport check
cve_2021 = (major, minor, patch) >= (5, 16, 3) or \
           (major == 5 and minor == 15 and patch >= 17) or \
           (major == 5 and minor == 10 and patch >= 94)
print(f'Kernel: {kver}')
print(f'CVE-2020-14386 fixed: {(major, minor) >= (5, 9)}')
print(f'CVE-2021-22600 fixed: {cve_2021}')
"

Verifying kernel versions across a large fleet is best done from the Kubernetes API rather than per-node SSH:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

Cross-reference the output against the fixed version list. Any node running a kernel below 5.15.17 (for the 5.15 series), below 5.10.94 (for the 5.10 series), or below 5.16.3 (for 5.16+) is vulnerable to CVE-2021-22600. Any node below 5.9 (without a vendor backport) is vulnerable to CVE-2020-14386.

6. Falco Detection Rule

Dropping CAP_NET_RAW and patching the kernel closes the direct exploit path. Falco provides runtime detection for cases where the controls are incomplete — a workload in an excluded namespace, a newly deployed pod spec that slipped past the Kyverno policy, or a node that lagged on kernel updates.

- rule: AF_PACKET socket created in container
  desc: >
    Raw packet socket creation from a container process. AF_PACKET sockets
    require CAP_NET_RAW and are the entry point for CVE-2020-14386 and
    CVE-2021-22600. Legitimate uses (tcpdump, Suricata, Zeek) should be
    explicitly allowlisted by container name.
  condition: >
    evt.type = socket and
    evt.dir = < and
    container and
    evt.arg.domain = AF_PACKET and
    not proc.name in (tcpdump, suricata, zeek, tshark, dumpcap)
  output: >
    AF_PACKET socket created in container
    (proc=%proc.name pid=%proc.pid user=%user.name
     container=%container.name image=%container.image.repository
     ns=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, network, cve-2020-14386, cve-2021-22600, cap_net_raw]

- rule: setsockopt PACKET_RX_RING in container
  desc: >
    PACKET_RX_RING configuration on an AF_PACKET socket from a container.
    This setsockopt call is the immediate precursor to both
    CVE-2020-14386 (tpacket_v3 integer overflow) and
    CVE-2021-22600 (double-free race). Extremely unusual in production
    application containers.
  condition: >
    evt.type = setsockopt and
    evt.dir = < and
    container and
    evt.arg.optname = PACKET_RX_RING
  output: >
    PACKET_RX_RING setsockopt in container — potential AF_PACKET exploit
    (proc=%proc.name pid=%proc.pid
     container=%container.name image=%container.image.repository
     ns=%k8s.ns.name pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [container, network, cve-2020-14386, cve-2021-22600, cap_net_raw]

The first rule generates a WARNING on AF_PACKET socket creation from any container process that is not a known legitimate packet capture tool. Tune the proc.name allowlist for your environment — Suricata, Zeek, and tcpdump are the most common legitimate users.

The second rule fires at CRITICAL on setsockopt with PACKET_RX_RING. This syscall argument instructs the kernel to allocate and configure the ring buffer — the exact operation that triggers both CVEs. There are almost no legitimate reasons for a production application container to call this. The rule has very low false-positive risk compared to the first rule and should alert immediately.

Expected Behaviour

After applying drop: [ALL] to a container’s securityContext, the capability bitmask visible in /proc/self/status changes significantly:

# Inside a container with default capabilities (no drop)
cat /proc/self/status | grep Cap
CapInh:	0000000000000000
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000

Decode the bitmask with capsh --decode=00000000a80425fb — this includes CAP_NET_RAW (bit 13, value 0x2000). After drop: [ALL] and add: [NET_BIND_SERVICE]:

# Inside a container with drop: [ALL], add: [NET_BIND_SERVICE]
cat /proc/self/status | grep Cap
CapInh:	0000000000000000
CapPrm:	0000000000000400
CapEff:	0000000000000400
CapBnd:	0000000000000400
CapAmb:	0000000000000000

0x400 is CAP_NET_BIND_SERVICE (bit 10) only. CAP_NET_RAW is absent from every set. Attempting to open an AF_PACKET socket from within this container:

python3 -c "import socket; socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)"
# PermissionError: [Errno 1] Operation not permitted

The kernel check in packet_create()if (!capable(CAP_NET_RAW)) — returns false, and the syscall exits before reaching any of the vulnerable ring buffer configuration code.

When the Falco rule fires on a container that somehow has CAP_NET_RAW and creates an AF_PACKET socket, the alert appears in Falco’s output within milliseconds of the syscall:

2026-05-08T14:23:11.847Z WARNING kernel: AF_PACKET socket created in container
  proc=python3 pid=48291 user=nobody
  container=debug-nettools image=nicolaka/netshoot
  ns=production pod=app-debug-7d9f4b-xkv2p

This alert combined with the PACKET_RX_RING rule gives two detection points: one at socket creation (earlier, higher false-positive rate) and one at ring buffer configuration (later, essentially no false positives for application containers).

Trade-offs

Dropping CAP_NET_RAW is the highest-confidence mitigation but has operational consequences:

  • ping stops working inside containers. This breaks any readinessProbe, livenessProbe, or init container that invokes ping to test connectivity. This is the most common breakage in production. The fix is to replace ping with TCP-based connectivity checks (see section 3) rather than to restore CAP_NET_RAW. The operational investment is a one-time refactoring of probe commands.
  • DHCP clients running inside containers (common in some CNI configurations and in containers that need to acquire addresses dynamically) use raw sockets and require CAP_NET_RAW. These workloads belong in the excluded namespaces with tighter per-workload controls.
  • tcpdump and tshark stop working inside containers without CAP_NET_RAW. For containers used in network debugging workflows, this is intentional: packet capture should happen at the node level (through the kubelet’s kubectl debug node facility or node-level monitoring daemonsets) rather than from within application containers.
  • ARP tooling (arping, custom ARP implementations) uses AF_PACKET or AF_INET with SOCK_RAW. These are rarely needed in application containers.

File capabilities on ping is a reasonable compromise for container images that legitimately need ping to function but should not expose AF_PACKET to all container processes. The limitation is precision: file capabilities protect the specific binary but not the general case. An attacker who has remote code execution in the container and wants to use AF_PACKET directly — not via ping — still cannot, because the file capability only activates during exec of that specific binary. However, if the attacker can write a new binary to the container’s filesystem (writable rootfs, writable layer) and grant themselves exec capability, file capabilities are not a complete barrier. readOnlyRootFilesystem: true addresses this.

Seccomp AF_PACKET blocking is more surgical than dropping CAP_NET_RAW entirely. It allows workloads that need other raw socket operations (AF_INET SOCK_RAW for ICMP) to function while blocking the specific AF_PACKET entry point. The trade-off is that a custom seccomp profile must be maintained, distributed to every node, and updated when application syscall requirements change. The Kubernetes RuntimeDefault profile is not sufficient — it deliberately does not block socket() calls because the syscall is too broadly used.

Falco alerting on AF_PACKET creation generates noise from legitimate network monitoring workloads. A production cluster typically runs Prometheus node_exporter, Netdata, or a network flow exporter in a monitoring namespace — these create AF_PACKET sockets routinely. Tune the Falco condition to exclude these by container name or namespace before enabling the rule. The PACKET_RX_RING setsockopt rule is safer to alert on without tuning: legitimate monitoring tools use the ring buffer interface, but that is a known, enumerable set. Alert on everything except suricata, zeek, and dumpcap and the false-positive rate will be low.

Failure Modes

Relying on Docker’s default seccomp profile to block AF_PACKET: It does not. Docker’s default seccomp profile (moby/profiles/seccomp/default.json) blocks keyctl, add_key, request_key, mbind, migrate_pages, get_mempolicy, set_mempolicy, and a list of other clearly dangerous syscalls. It does not block socket() calls based on the address family argument. socket(AF_PACKET, SOCK_RAW, 0) returns successfully under the default profile if the process has CAP_NET_RAW. The seccomp profile and the capability check are independent controls. Both must be addressed.

Adding CAP_NET_RAW “because ping needs it”: This is the most common source of unnecessary exposure. Developers and platform engineers encounter a failing readinessProbe, trace it to a ping invocation in an init container, note that ping requires CAP_NET_RAW, and add it to the pod spec. The security implication — that CAP_NET_RAW simultaneously enables raw ICMP for ping and raw link-layer access for AF_PACKET, and that AF_PACKET has been the vector for multiple kernel privilege escalation CVEs — is not surfaced in the error message. The correct response is to replace ping with a TCP connectivity check, not to restore the capability.

Not tracking capability grants in init containers separately from main containers: Kubernetes applies securityContext at both the pod level and the container level, and init containers are distinct from main containers in the pod spec. A pod that drops ALL in its main container securityContext but does not configure any securityContext on its init containers will run those init containers with the container runtime’s default capabilities — which include CAP_NET_RAW. The Kyverno policy in section 2 addresses this by adding a separate rule for initContainers, but policies that only check spec.containers will miss this.

Assuming managed Kubernetes applies restrictive seccomp or capability defaults: EKS, GKE, and AKS do not apply RuntimeDefault seccomp profiles or drop capabilities by default unless explicitly configured. GKE Autopilot is an exception — it applies more restrictive defaults including seccomp. Standard/manual node pools in all three cloud providers run containers with the full Docker default capability set including CAP_NET_RAW, and with no seccomp profile applied unless seccompProfile.type is specified in the pod spec. The default seccomp profile in Kubernetes has been RuntimeDefault since Kubernetes 1.27 (with --seccomp-default kubelet flag), but this does not block AF_PACKET socket creation as noted above, and the flag is not enabled by default in managed cluster configurations.

Assuming that patching eliminates the need for the other controls: CVE-2020-14386 and CVE-2021-22600 are the known AF_PACKET vulnerabilities with public exploits. The AF_PACKET code path — particularly the tpacket_v3 ring buffer implementation — is complex, has been responsible for several other CVEs in the 2017–2021 period (CVE-2017-7308, CVE-2016-8655), and is plausible future vulnerability territory. Dropping CAP_NET_RAW from containers that do not need it is the correct posture regardless of current kernel patch status: it eliminates the entire AF_PACKET attack surface, not just the two named CVEs.