Netfilter CVE-2022-1015 and CVE-2022-1016: Kernel Heap Overflow from Container Network Rules

The Problem

Every Kubernetes pod that runs with CAP_NET_ADMIN can reach into kernel heap memory through a pair of vulnerabilities in the nftables subsystem. CVE-2022-1015, discovered and reported by Arthur Mongodin, is an out-of-bounds write in the nftables expression evaluator. CVE-2022-1016, found independently during analysis of the same code path, is a use-after-free in the expression teardown path. Together they provide read/write primitives against kernel heap memory from an unprivileged process that can load nftables rules — a precondition met by default in many Kubernetes workloads.

Understanding why requires understanding how nftables is structured. nftables replaced iptables as the Linux kernel’s primary packet filtering framework and is the backend for the nft userspace tool as well as the conntrack subsystem. The data model is hierarchical: tables group chains, chains hold rules, and rules contain expressions — the evaluatable units that match packet fields, perform arithmetic, load and store register values, and make verdicts. Expressions read and write to a shared register bank. The bank has two register sizes: 128-bit NFT_REG (four 32-bit sub-registers) and 32-bit NFT_REG32. The distinction matters for the vulnerability.

The out-of-bounds write in CVE-2022-1015 lives in nft_validate_register_store(), the kernel function responsible for checking that an expression attempting to write a value is writing to a valid register with a valid length. For NFT_REG32 writes, the validation computed whether the write would overflow the register array by checking the destination register index against the number of available 32-bit registers. The flaw was in the arithmetic: the index validation did not correctly account for the offset between the full-width NFT_REG register layout and the 32-bit NFT_REG32 sub-register layout. Because NFT_REG32 registers begin at a non-zero offset in the internal nft_regs union, passing a large enough NFT_REG32 index allowed a write that passed the bounds check but placed data past the end of the expression stack in the kernel heap slab. The write length came from a userspace-controlled expression, and its content was partially attacker-controlled. This is a classic heap overflow — arbitrary adjacent heap writes from a code path that was supposed to be locked down by length validation.

CVE-2022-1016 is a use-after-free in nf_tables_expr_destroy(), triggered when a table or chain deletion races with active packet processing. When nf_tables_delrule() initiated the destruction of a rule containing expressions, the code followed a reference-counted path through the rule’s expression list. Under the right race condition, an expression’s ops->destroy callback could be called while another CPU was still executing the expression against a packet, reading expression private data that had already been freed by the destruction path. The freed memory was not zeroed — its contents were readable by the racing execution context. When combined with CVE-2022-1015’s heap write, this provides both write-where-what and read-from-freed-object primitives, the two components needed to build a kernel ROP exploit.

Affected kernel versions span the 5.x series through the upstream maintenance branches: Linux 5.4 through 5.15 before 5.15.26, and 5.16 before 5.16.14. The stable branch 5.10.x is affected before 5.10.103. These are the kernel versions running on the nodes of almost every production Kubernetes cluster at the time of disclosure in March 2022. Both CVEs were patched in the same upstream commit series and backported simultaneously to the stable branches. The fix for CVE-2022-1015 adds a corrected bounds check in nft_validate_register_store() that properly accounts for NFT_REG32’s offset within the register union. The fix for CVE-2022-1016 adds proper read-copy-update (RCU) synchronisation around the expression destroy path, ensuring that no CPU is mid-expression before the expression data is freed.

The key enabler in Kubernetes is CAP_NET_ADMIN. This capability is required to create and manage network namespaces, configure interfaces, and load nftables rules. It is also granted by default — or by explicit cluster configuration — to a wide category of workloads: CNI plugin daemonsets, service mesh init containers (Istio’s istio-init, Linkerd’s linkerd-init), network diagnostic pods (tools like netshoot), and any workload that needs to modify firewall rules. Kubernetes’s SecurityContext.capabilities.add makes it trivial to grant and easy to overlook. A pod with CAP_NET_ADMIN can create a user network namespace and, within that namespace, load arbitrary nftables rules — which is all the exploit needs.

Threat Model

A pod running with CAP_NET_ADMIN is the attacker’s starting point. The pod does not need to be running as root. It does not need CAP_SYS_ADMIN. It does not need access to the host network namespace — user network namespaces are sufficient. The attack path proceeds as follows:

The compromised or malicious container creates a new user network namespace using unshare(CLONE_NEWNET | CLONE_NEWUSER). This operation succeeds for unprivileged processes in most Linux configurations (controlled by /proc/sys/kernel/unprivileged_userns_clone on Debian/Ubuntu derivatives, which defaults to enabled in most container host images). Within the new network namespace, the process loads a crafted nftables ruleset. The ruleset contains expressions designed to pass the flawed validation in nft_validate_register_store() while writing a controlled value to a heap address outside the legitimate expression stack. The heap corruption targets kernel data structures — most commonly, the exploit overwrites a function pointer in an adjacent heap object or builds a fake object at the overwritten address and uses CVE-2022-1016’s use-after-free read to confirm the write’s success.

With read/write primitives established, the exploit constructs a kernel ROP chain. Common targets: overwrite the cred pointer in the current task_struct to point to a fabricated credentials structure with uid=0, gid=0, and all capabilities set; or overwrite a function pointer in a kernel structure that gets called in a predictable context (such as a socket operation handler). Either path results in the container process acquiring full kernel privileges — root on the physical or virtual host node, not just within the container namespace.

Network namespace isolation does not prevent this. The confusion here is common and dangerous: network namespaces limit which interfaces, routing tables, and firewall rules a process sees. They do not create a separate kernel instance. The vulnerability is in the kernel’s own expression evaluation code, which executes in kernel space regardless of which network namespace the triggering process is in. Creating a user network namespace to load the nftables rules is the trigger, not the protection boundary.

In managed Kubernetes environments — EKS, GKE, AKS — node compromise immediately exposes the instance metadata API. On AWS, http://169.254.169.254/latest/meta-data/iam/security-credentials/ returns the temporary credentials for the instance’s attached IAM role. On GKE, http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token returns a short-lived OAuth token for the node’s service account. Neither endpoint requires authentication beyond being on the instance. A node-level IAM role with EC2 or GCE permissions can be used to enumerate and compromise the broader cloud environment — pivot from one compromised pod to cloud account access.

The multi-tenant blast radius is broader than it might appear. After achieving node root, the attacker can read /proc/<pid>/environ for every process on the node — including all co-located pods, whose secrets Kubernetes injects as environment variables. Every pod on the same node that has POSTGRES_PASSWORD, AWS_SECRET_ACCESS_KEY, or KUBECONFIG in its environment is now compromised. etcd credentials, service account tokens mounted at /var/run/secrets/kubernetes.io/serviceaccount/token, and TLS certificates stored in pod filesystems are all accessible via /proc from the host. The attacker does not need to break out of each pod individually — host root gives access to every pod’s process memory and filesystem simultaneously through /proc and the container runtime’s union filesystem mounts.

Hardening Configuration

1. Drop CAP_NET_ADMIN in Pod Security Context

The minimal fix is preventing the exploit’s precondition from being met. A pod that never receives CAP_NET_ADMIN cannot load nftables rules and cannot trigger either vulnerability.

apiVersion: v1
kind: Pod
metadata:
  name: application
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: registry.example.com/app:1.0.0
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
          add: []   # add back only what's genuinely needed

The drop: ["ALL"] line removes every capability in both the permitted and effective sets before the container starts. The add: [] field makes the intent explicit: nothing is being added back. For workloads that currently receive CAP_NET_ADMIN implicitly (through a permissive PodSecurityPolicy or a missing securityContext), this is a breaking change — see the Trade-offs section.

Enforce this at admission time using Kyverno. The following policy denies any pod that requests CAP_NET_ADMIN outside a designated privileged namespace, and denies any pod that does not explicitly drop ALL:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-cap-net-admin
  annotations:
    policies.kyverno.io/title: Restrict CAP_NET_ADMIN
    policies.kyverno.io/severity: high
    policies.kyverno.io/subject: Pod
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: deny-cap-net-admin
      match:
        any:
          - resources:
              kinds: [Pod]
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - istio-system
                - network-tools
      validate:
        message: "CAP_NET_ADMIN is not permitted outside privileged namespaces."
        deny:
          conditions:
            any:
              - key: "NET_ADMIN"
                operator: AnyIn
                value: "{{ request.object.spec.containers[].securityContext.capabilities.add[] }}"
              - key: "NET_ADMIN"
                operator: AnyIn
                value: "{{ request.object.spec.initContainers[].securityContext.capabilities.add[] }}"
    - name: require-drop-all
      match:
        any:
          - resources:
              kinds: [Pod]
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - istio-system
                - network-tools
      validate:
        message: "Pods must drop ALL capabilities in every container's securityContext."
        foreach:
          - list: "request.object.spec.containers"
            deny:
              conditions:
                any:
                  - key: "ALL"
                    operator: AnyNotIn
                    value: "{{ element.securityContext.capabilities.drop }}"

The exclude namespaces (kube-system, istio-system, network-tools) require their own tighter controls — admission policies scoped to those namespaces that enforce image digests and restrict which service accounts can run privileged pods.

2. AppArmor Profile Blocking nftables Interaction

AppArmor can deny the specific system interactions that nftables loading requires. The critical restriction is preventing the creation of AF_NETLINK sockets with the NETLINK_NETFILTER protocol, which is the only channel through which nftables rules are loaded into the kernel.

#include <tunables/global>

profile k8s-restricted-container flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Deny netlink socket creation — blocks nftables, iptables, conntrack from container
  deny network netlink,

  # Deny access to netfilter proc entries
  deny /proc/net/nf_conntrack r,
  deny /proc/net/nf_conntrack_expect r,
  deny /proc/sys/net/netfilter/** rwklx,

  # Deny nft and iptables binaries if present in container image
  deny /usr/sbin/nft x,
  deny /sbin/iptables x,
  deny /sbin/ip6tables x,
  deny /sbin/xtables-legacy-multi x,

  # Standard container permissions (read-only rootfs assumed)
  file,
  network inet tcp,
  network inet udp,
  network inet6 tcp,
  network inet6 udp,
}

Apply the profile to pods via annotation:

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: localhost/k8s-restricted-container

The deny network netlink rule is the load-bearing line. NetLink is the socket family used by nft, ip, and every other kernel configuration tool. A container that cannot create a SOCK_RAW or SOCK_DGRAM socket in AF_NETLINK cannot send NFT_MSG_NEWRULE messages to the kernel, and therefore cannot trigger the vulnerable code path. The AppArmor denial is logged to the kernel audit subsystem — each blocked attempt generates an AVC denial record.

3. Seccomp Profile Blocking Netlink

Seccomp provides a lower-level enforcement point: syscall filtering. The nftables exploit requires socket(AF_NETLINK, ...) and subsequent sendto or sendmsg calls to the netlink socket. A seccomp profile that blocks AF_NETLINK socket creation prevents the exploit entry point before any capability or AppArmor check.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_INET (2) only"
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 10,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_INET6 (10) only"
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 1,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_UNIX (1)"
    },
    {
      "names": [
        "setsockopt"
      ],
      "action": "SCMP_ACT_ERRNO",
      "args": [
        {
          "index": 1,
          "value": 270,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Block SOL_NETLINK (270) socket options"
    },
    {
      "names": [
        "read", "write", "open", "openat", "close", "stat", "fstat", "lstat",
        "poll", "lseek", "mmap", "mprotect", "munmap", "brk", "rt_sigaction",
        "rt_sigprocmask", "rt_sigreturn", "ioctl", "pread64", "pwrite64",
        "readv", "writev", "access", "pipe", "select", "sched_yield", "mremap",
        "msync", "mincore", "madvise", "dup", "dup2", "pause", "nanosleep",
        "getitimer", "alarm", "setitimer", "getpid", "sendfile", "connect",
        "accept", "sendto", "recvfrom", "sendmsg", "recvmsg", "shutdown",
        "bind", "listen", "getsockname", "getpeername", "getsockopt",
        "clone", "fork", "vfork", "execve", "exit", "wait4", "kill",
        "uname", "fcntl", "flock", "fsync", "fdatasync", "truncate",
        "ftruncate", "getdents", "getcwd", "chdir", "fchdir", "rename",
        "mkdir", "rmdir", "creat", "link", "unlink", "symlink", "readlink",
        "chmod", "fchmod", "chown", "fchown", "lchown", "umask", "gettimeofday",
        "getrlimit", "getrusage", "sysinfo", "times", "getuid", "getgid",
        "geteuid", "getegid", "setuid", "setgid", "getgroups",
        "prctl", "arch_prctl", "futex", "set_tid_address",
        "set_robust_list", "get_robust_list", "epoll_create", "epoll_ctl",
        "epoll_wait", "epoll_create1", "epoll_pwait",
        "eventfd", "eventfd2", "signalfd", "signalfd4",
        "inotify_init", "inotify_init1", "inotify_add_watch", "inotify_rm_watch",
        "timerfd_create", "timerfd_settime", "timerfd_gettime",
        "accept4", "recvmmsg", "sendmmsg", "getrandom",
        "fadvise64", "readahead", "getdents64", "pselect6", "ppoll",
        "exit_group", "tgkill", "waitid", "clock_gettime", "clock_getres",
        "clock_nanosleep", "seccomp", "memfd_create", "statx"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The critical exclusion is AF_NETLINK (value 16) from the allowed socket() families. The profile uses an allowlist model — only AF_INET (2), AF_INET6 (10), and AF_UNIX (1) socket families are permitted. Any socket(AF_NETLINK, ...) call returns EPERM. The setsockopt block on SOL_NETLINK (270) adds defence-in-depth against future netlink socket manipulation vectors.

Apply the profile to a pod:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/netfilter-restricted.json

4. Kernel Patch Verification

Verify that nodes are running a patched kernel before trusting any other control. The vulnerability is in kernel code — userspace controls reduce the attack surface but a patched kernel eliminates the vulnerability entirely.

# Check the running kernel version
uname -r

# Patched versions:
#   >= 5.16.14  (upstream stable)
#   >= 5.15.26  (longterm)
#   >= 5.10.103 (longterm)
#   >= 5.4.180  (longterm — check your distro's backport)

# On Debian/Ubuntu — verify the installed package version
apt-cache show "linux-image-$(uname -r)" | grep -E '^(Version|Package)'

# On RHEL/CentOS/Rocky — check the rpm changelog
rpm -q kernel --changelog | grep -A2 "CVE-2022-1015\|CVE-2022-1016"

# On Amazon Linux 2
rpm -q kernel | head -1
# Patched in kernel-5.10.96-90.460.amzn2.x86_64 and later

# For GKE nodes — check the node pool version; the fix is included in
# node image versions released after 2022-03-28
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name) \(.status.nodeInfo.kernelVersion) \(.status.nodeInfo.osImage)"'

For managed Kubernetes, compare node kernel versions against the distro’s security advisory. Most major distributions backported both patches within a week of upstream publication. The risk window for unpatched nodes extends until the next node pool rotation or forced update — which can be weeks in clusters with long node lifetimes.

5. Network Policy to Limit Post-Compromise Lateral Movement

A NetworkPolicy cannot prevent the kernel exploit, but it limits what an attacker can do after achieving kernel code execution. If the attacker uses the node compromise to pivot to other pods’ credentials, network policies slow down the subsequent lateral movement.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-pod-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
    - Ingress
  ingress:
    - from:
        - podSelector: {}
  egress:
    - to:
        - podSelector: {}
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
            except:
              - 169.254.169.254/32   # Block instance metadata API from application pods
      ports:
        - protocol: TCP
          port: 443
        - protocol: TCP
          port: 80

The explicit except: 169.254.169.254/32 block is significant: it prevents application pods from reaching the instance metadata API directly. After a container escape, the attacker runs in the host network namespace, not the pod’s network namespace — so this NetworkPolicy does not block the attacker on the host. But it prevents a less privileged first-stage exploit (that does not achieve full container escape) from fetching IAM credentials through the metadata API.

6. Falco Detection Rules

Falco watches kernel syscalls and can detect the specific patterns that nftables rule loading from a container process produces. Two rules are relevant: detecting nftables ruleset modifications from container processes, and detecting unexpected AF_NETLINK socket creation.

- rule: nftables_ruleset_modification_from_container
  desc: >
    A process inside a container is creating or modifying nftables rules via
    a netlink socket. This is abnormal for application workloads and matches
    the precondition for CVE-2022-1015 / CVE-2022-1016 exploitation.
  condition: >
    evt.type = sendto
    and container
    and not container.image.repository in (known_cni_images, known_nettools_images)
    and fd.sip = "0.0.0.0"
    and fd.type = "unix"
    and (
      evt.rawarg.buf contains "\x00\x0a\x00\x00"
      or evt.rawarg.buf contains "nft"
    )
  output: >
    nftables modification from container
    (user=%user.name container=%container.name image=%container.image.repository
    pod=%k8s.pod.name ns=%k8s.ns.name cmdline=%proc.cmdline pid=%proc.pid)
  priority: CRITICAL
  tags: [network, container-escape, cve-2022-1015, kernel]

- rule: unexpected_af_netlink_socket_from_container
  desc: >
    A container process created an AF_NETLINK socket. Application containers
    have no legitimate use for netlink. This matches the exploitation precondition
    for nftables kernel vulnerabilities including CVE-2022-1015 and CVE-2022-1016.
  condition: >
    evt.type = socket
    and evt.arg.domain = AF_NETLINK
    and container
    and not container.image.repository in (known_cni_images)
    and not proc.name in (ip, nft, iptables, conntrack, ss, tc)
  output: >
    AF_NETLINK socket created in container
    (user=%user.name container=%container.name image=%container.image.repository
    pod=%k8s.pod.name ns=%k8s.ns.name proc=%proc.name pid=%proc.pid
    parent=%proc.pname cmdline=%proc.cmdline)
  priority: WARNING
  tags: [network, netlink, container-escape, cve-2022-1015]

- list: known_cni_images
  items: [
    "calico/node",
    "calico/cni",
    "flannel/flannel",
    "cilium/cilium",
    "amazon-k8s-cni",
    "gcr.io/google-containers/pause"
  ]

- list: known_nettools_images
  items: [
    "nicolaka/netshoot",
    "busybox",
    "curlimages/curl"
  ]

The unexpected_af_netlink_socket_from_container rule fires on socket(AF_NETLINK, ...) — precisely what nft and any exploit that calls into the nftables API must invoke first. The known CNI images list must be maintained as your CNI plugin changes. Alert on CRITICAL priority rules through your SIEM with immediate incident response escalation; WARNING rules warrant investigation within the hour.

Expected Behaviour

A correctly hardened pod — with capabilities.drop: ["ALL"], the netfilter-restricted seccomp profile, and the AppArmor profile applied — produces the following observable behaviour when an attacker or misconfigured tool attempts to load nftables rules:

The socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) syscall returns -1 EPERM (Operation not permitted). The seccomp profile’s default deny action blocks it before the capability check is reached. In the kernel audit log:

type=SECCOMP msg=audit(1683542400.123:1247): auid=0 uid=1000 gid=1000
  ses=1 pid=12345 comm="nft" exe="/usr/sbin/nft"
  sig=0 arch=c000003e syscall=41 compat=0 ip=0x7f8b3c2a1234 code=0x50000

syscall=41 is socket on x86_64. code=0x50000 is SECCOMP_RET_ERRNO. This appears in /var/log/audit/audit.log on the node and in the pod’s stderr as nft: Failed to initialize Netlink socket: Operation not permitted.

If the seccomp profile is not in place but AppArmor is active, the socket creation is blocked at the AppArmor deny network netlink rule. The audit record looks different:

type=AVC msg=audit(1683542400.456:1248): apparmor="DENIED" operation="create"
  profile="k8s-restricted-container" pid=12346 comm="nft"
  family="netlink" sock_type="raw" protocol=12 requested_mask="create"
  denied_mask="create"

protocol=12 is NETLINK_NETFILTER. This is the AppArmor denial record. It appears in both the node’s audit log and in AppArmor’s own log at /var/log/syslog or /var/log/kern.log.

To verify that a pod’s capability set is correct post-deployment:

kubectl exec -it <pod-name> -- cat /proc/self/status | grep -E '^Cap'

A pod that has correctly dropped all capabilities shows:

CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000

All zero. CapEff is the effective capability set — this is what the kernel checks for privileged operations. A non-zero CapEff means the container has live capabilities. CapBnd is the bounding set — if this is non-zero, the process can potentially acquire capabilities through setcap or privileged exec. Both must be zero for a fully restricted workload.

A pod that still has CAP_NET_ADMIN shows a non-zero bitmask. CAP_NET_ADMIN is capability bit 12 (value 0x1000). If CapEff includes this bit, the pod can load nftables rules today on an unpatched kernel.

Trade-offs

Dropping CAP_NET_ADMIN universally breaks several categories of legitimate workload.

CNI plugins require CAP_NET_ADMIN to create veth pairs, configure interfaces, set up routes, and manage iptables or nftables rules on the host. CNI plugins run as DaemonSets in kube-system with elevated privileges — this is expected and necessary. The Kyverno policy above excludes kube-system for this reason. The risk is that kube-system exemptions become a catch-all: audit what actually runs there and challenge every workload that claims it belongs.

Service mesh init containers (Istio’s istio-init, Linkerd’s linkerd-init) use CAP_NET_ADMIN to insert iptables rules that redirect traffic through the sidecar proxy. Both Istio and Linkerd have documented workarounds for this: Istio’s CNI plugin performs the iptables manipulation as a CNI operation rather than a container init, eliminating the need for CAP_NET_ADMIN in the application namespace. Linkerd has an equivalent CNI mode. Deploying the CNI-based init removes CAP_NET_ADMIN from every application pod in the mesh — this is the correct long-term architecture for service mesh deployments on Kubernetes.

The seccomp netlink blocking has a less obvious side effect: some monitoring agents use AF_NETLINK to subscribe to kernel network events (via NETLINK_ROUTE or NETLINK_SOCK_DIAG). Tools that collect socket state, routing table changes, or interface statistics this way will fail silently or noisily depending on how they handle EPERM. Audit your monitoring stack for netlink usage before deploying the seccomp profile broadly. The ss tool, ip monitor, and agents like Datadog’s network performance monitoring component all use netlink. A staged rollout — development clusters first, with explicit testing of monitoring coverage — is appropriate.

Kata Containers eliminates these vulnerabilities entirely by running each pod inside a lightweight virtual machine with a separate kernel. The attacker’s nftables exploit would corrupt the guest kernel’s heap, not the host kernel’s heap — container escape from a Kata sandbox requires a separate hypervisor vulnerability. The trade-off is real: Kata adds 100–200ms of pod startup latency, uses more memory per pod (the guest kernel itself is approximately 50MB of RAM), and requires hardware virtualisation support on the node (nested virtualisation in cloud VMs, which is not universally available in older instance types). For workloads that genuinely require CAP_NET_ADMIN and cannot be restructured, Kata is the correct isolation boundary.

Failure Modes

The most dangerous failure mode is the assumption that network namespace isolation prevents kernel-level vulnerabilities. This assumption appears in conversations about multi-tenant Kubernetes: “the tenants are isolated in separate namespaces with NetworkPolicies, so one tenant can’t affect another.” NetworkPolicies and Kubernetes namespaces restrict what traffic flows between pods and what API resources one pod can see. They do not prevent a pod from executing kernel code. The kernel is shared. A vulnerability in nftables expression evaluation runs in kernel space and is reachable from any network namespace — the namespace boundary is the trigger mechanism, not the protection boundary. Every security review of a multi-tenant Kubernetes cluster needs to treat kernel vulnerabilities as a cross-tenant risk.

Granting CAP_NET_ADMIN to “just the sidecar” while believing the application container is isolated is another common mistake. In Kubernetes, multiple containers in a pod share the same network namespace. A CAP_NET_ADMIN granted to the Istio init container is not isolated to that container — CAP_NET_ADMIN in the pod’s initial network namespace setup affects the shared namespace. More importantly, the init container’s capability to create user network namespaces does not disappear after it exits; the vulnerability can be triggered by a process in any container in the pod that can unshare a new user namespace, which is a weaker precondition than the container itself having CAP_NET_ADMIN. Verify the full pod-level capability picture, not just per-container settings.

Not tracking kernel versions across nodes is a systemic failure that affects both this vulnerability and the broader response to kernel CVEs. Managed Kubernetes node pools do not automatically rotate nodes when a kernel CVE is published — they rotate according to the control plane upgrade schedule, which is often monthly or quarterly. In the window between CVE publication and node rotation, every node in the cluster is vulnerable. Know the kernel version running on every node pool. EKS exposes this via kubectl get nodes -o wide and the nodeInfo.kernelVersion field. Alert when nodes fall below the minimum patched version for any active kernel CVE. The maintenance window for kernel patching should be measured in days, not weeks — a remote code execution vulnerability in packet filtering code reachable from any container process is not a “next quarter” remediation.

Finally, exemptions accumulate. A Kyverno policy deployed today correctly blocks CAP_NET_ADMIN outside kube-system. Six months later, a new monitoring tool needs netlink access; an exception is added for monitoring. Then a network diagnostic namespace is added. Then a developer namespace for “testing.” Each exemption is individually justifiable. The aggregate is a policy with enough holes that the original protection is largely theoretical. Treat exemption namespaces as high-value targets requiring their own hardening: restrict which images can run there, enforce image digest pinning, audit service account permissions, and review the exemption list quarterly against the actual workloads present.