Hardening Linux Against Netlink Socket Privilege Escalation

Hardening Linux Against Netlink Socket Privilege Escalation

Problem

Netlink is the Linux kernel’s primary interface for kernel-to-userspace communication about networking, routing, and system configuration. It is implemented as a socket family (AF_NETLINK) with dozens of protocol families: NETLINK_ROUTE for network interface and routing table management, NETLINK_AUDIT for audit subsystem access, NETLINK_XFRM for IPsec transform management, NETLINK_KOBJECT_UEVENT for device events, and the extensible Generic Netlink family that hosts hundreds of kernel subsystem APIs.

The attack surface is substantial and persistent. Nearly every major Linux privilege escalation class of the past decade has had at least one Netlink variant:

NETLINK_ROUTE integer overflows and heap corruptions. The rtnetlink subsystem processes route and interface management messages. CVE-2022-0995 (watch queue overflow), CVE-2021-3715 (route4_change UAF), and several 2024–2025 disclosures exploit message parsing bugs in this family. These are reachable from unprivileged user namespaces — any process that can create a user namespace can open a NETLINK_ROUTE socket and reach the vulnerable code path.

Generic Netlink (genl) attack surface. The Generic Netlink framework is used by hundreds of kernel subsystems: cfg80211 (wireless), nl80211 (Wi-Fi management), nftables, ipvs, team driver, and dozens of others. Each registered Generic Netlink family is a potential attack surface. CVE-2023-6931 (perf_event integer overflow reachable via genl), CVE-2022-47929 (nftables netlink policy bypass), and CVE-2024-26924 (nftables set garbage collection bug reachable from a Netlink socket) are examples of this class.

NETLINK_AUDIT information leaks. The audit subsystem exposed audit records to any process that opened a NETLINK_AUDIT socket in some kernel versions, leaking syscall sequences, process names, and file paths that facilitate exploit development.

Container escape via NETLINK_ROUTE in host network namespace. When a container shares the host network namespace (hostNetwork: true) or when kernel Netlink messages cross namespace boundaries incorrectly, a Netlink-based exploit in a container can target the host kernel’s routing and network configuration, enabling denial of service or privilege escalation on the host.

The fundamental problem is that AF_NETLINK socket creation is permitted to unprivileged processes for many protocol families, and the capability checks inside each family are inconsistent — some operations require CAP_NET_ADMIN, others require only that the caller is in a user namespace where CAP_NET_ADMIN is available (which any unprivileged user can create).

On a default Linux installation without user namespace restrictions or Seccomp filtering, every process can open Netlink sockets across most families and exercise a large fraction of the kernel’s network configuration API. This is the broadest kernel API surface available to unprivileged code.

Target systems: Linux 5.10–6.12 on servers, cloud VMs, Kubernetes nodes, and CI runners; any system where unprivileged code execution is possible (containers, multi-user hosts, CI pipelines); distributions with user namespaces enabled by default (Ubuntu, Debian, Arch).


Threat Model

Adversary 1 — Container process targeting host kernel via shared netns. Access level: unprivileged code inside a hostNetwork: true Kubernetes pod or --network=host Docker container. Objective: open NETLINK_ROUTE socket, send crafted messages to trigger a kernel heap corruption, escalate to host root.

Adversary 2 — Unprivileged local user. Access level: shell account on a multi-tenant host. Objective: open a Generic Netlink family socket (nl80211, nftables), trigger a known vulnerability in the subsystem, achieve LPE to root.

Adversary 3 — CI pipeline code execution. Access level: malicious code running in a GitHub Actions self-hosted runner or GitLab Runner. Objective: use Netlink socket access (not restricted by the CI sandbox) to exploit a kernel vulnerability and escape the runner environment.

Adversary 4 — Container escape via netfilter Netlink. Access level: process inside a container with CAP_NET_ADMIN in its user namespace (required for nftables). Objective: exploit nftables/nf_tables Netlink messages to corrupt kernel memory and escape the container namespace.

Without hardening: all four adversaries can reach a large fraction of the Netlink API. With hardening: Seccomp blocks socket(AF_NETLINK, ...) for workloads that don’t need it; user namespace restrictions reduce the capabilities available to Netlink callers; capability dropping further limits message types.


Configuration / Implementation

Before restricting, understand what is legitimately using Netlink:

# List all processes with open Netlink sockets
ss -xlpn | grep netlink 2>/dev/null || true
cat /proc/net/netlink

# Better: use ss to show Netlink sockets with protocol family
ss --netlink --processes 2>/dev/null | head -40

# Identify Netlink family numbers
# Family 0 = NETLINK_ROUTE
# Family 9 = NETLINK_AUDIT
# Family 6 = NETLINK_XFRM (IPsec)
# Family 16 = NETLINK_KOBJECT_UEVENT
# Family 31 = NETLINK_SOCK_DIAG

# Map family numbers to names
awk 'NR>1 {print $3, $4, $5, $9}' /proc/net/netlink | while read sk refs groups pid; do
  comm=$(cat /proc/$pid/comm 2>/dev/null || echo "pid-$pid")
  echo "family=$sk refs=$refs groups=$groups pid=$pid ($comm)"
done | sort -k1 -t= -n | head -40

Step 2 — Restrict unprivileged user namespaces (primary attack enabler)

Most Netlink LPE chains require creating a user namespace to gain capabilities. Restricting this is the highest-leverage control:

# /etc/sysctl.d/90-netlink-hardening.conf

# Block unprivileged user namespace creation (eliminates capability grant for Netlink)
# Ubuntu/Debian use this sysctl
kernel.unprivileged_userns_clone = 0

# Upstream kernel equivalent
# kernel.unprivileged_userns_clone is Ubuntu-specific; upstream uses:
# Restrict via AppArmor or by limiting unprivileged namespace depth
user.max_user_namespaces = 0  # Hard block — breaks some container tools; test first

# Alternatively, use a less aggressive limit:
# user.max_user_namespaces = 1  # Allow one level; blocks nested namespace abuse
sysctl --system
# Verify
sysctl kernel.unprivileged_userns_clone 2>/dev/null || \
  sysctl user.max_user_namespaces

Note: user.max_user_namespaces = 0 will break Docker, Podman rootless, Flatpak, and some Snap packages. Audit before applying. On Kubernetes nodes where containers run as root (the common case), this is safe — containers use the host’s namespaces.

For workloads that have no legitimate Netlink need (most application containers), block socket(AF_NETLINK, ...) entirely:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_INET (2) sockets only"
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 10,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_INET6 (10) sockets"
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 1,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Allow AF_UNIX (1) sockets"
    }
    // AF_NETLINK (16) is NOT in the allowlist — blocked
  ]
}

For a simpler approach, use the RuntimeDefault profile and add an explicit Netlink deny on top:

# Kubernetes pod spec — deny Netlink socket creation
apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: deny-netlink.json
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      runAsNonRoot: true

A targeted Seccomp profile that blocks only AF_NETLINK:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "args": [
        {
          "index": 0,
          "value": 16,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Block AF_NETLINK (16) socket creation"
    }
  ]
}

Save as /var/lib/kubelet/seccomp/deny-netlink.json on each node.

Step 4 — Drop CAP_NET_ADMIN from all workloads that don’t need it

Many Netlink operations require CAP_NET_ADMIN. Dropping it reduces the exploitable surface even when a Netlink socket can be opened:

# All pods should drop CAP_NET_ADMIN unless they explicitly need it
# Network plugin pods (CNI, Cilium, Calico) are legitimate exceptions

securityContext:
  capabilities:
    drop:
    - ALL
    # Add back only if genuinely needed:
    # add:
    # - NET_BIND_SERVICE  # Only for binding port < 1024

Enforce with Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: deny-cap-net-admin
spec:
  validationFailureAction: Enforce
  rules:
  - name: deny-net-admin
    match:
      any:
      - resources:
          kinds: [Pod]
    exclude:
      any:
      - resources:
          namespaces: [kube-system, cilium, calico-system, network-plugins]
    validate:
      message: "CAP_NET_ADMIN requires explicit security review annotation"
      deny:
        conditions:
          any:
          - key: "{{ request.object.spec.containers[].securityContext.capabilities.add[] | contains(@, 'NET_ADMIN') || contains(@, 'NET_RAW') }}"
            operator: Equals
            value: true

AppArmor can restrict Netlink socket operations by family and verb:

# /etc/apparmor.d/container-netlink-restrict
profile container-netlink-restrict flags=(attach_disconnected) {
  # Allow standard network operations
  network inet tcp,
  network inet udp,
  network inet6 tcp,
  network inet6 udp,
  network unix,

  # Restrict Netlink — allow only KOBJECT_UEVENT for udev-like tools
  # Deny ROUTE, AUDIT, XFRM, and Generic Netlink families
  deny network netlink raw,

  # Allow specific families if genuinely needed:
  # network netlink dgram,  # KOBJECT_UEVENT for device event monitoring
}

For hosts using AppArmor for container profiles (Docker’s default profile):

# Verify Docker AppArmor profile blocks Netlink
docker run --rm ubuntu:22.04 python3 -c "
import socket
try:
    s = socket.socket(socket.AF_NETLINK, socket.SOCK_RAW, 0)
    print('FAIL: Netlink allowed')
except PermissionError as e:
    print(f'PASS: Netlink blocked — {e}')
"
# Falco rule — alert on Netlink socket creation outside known processes
- rule: Unexpected Netlink Socket Creation
  desc: A process outside the known set created a Netlink socket
  condition: >
    evt.type = socket and
    evt.arg.domain = AF_NETLINK and
    not proc.name in (networkd, systemd-networkd, ip, ss, tc, nft,
                      ifconfig, route, iptables, nftables, rtnl_helper,
                      cilium-agent, calico-node, flannel, weave)
  output: >
    Unexpected Netlink socket created
    (proc=%proc.name pid=%proc.pid uid=%user.uid container=%container.name
     image=%container.image.repository domain=%evt.arg.domain)
  priority: WARNING
  tags: [netlink, kernel, privilege-escalation]

# Alert on Netlink from containers specifically
- rule: Container Netlink Socket
  desc: Container process created a Netlink socket (potential LPE attempt)
  condition: >
    evt.type = socket and
    evt.arg.domain = AF_NETLINK and
    container.id != host
  output: >
    Netlink socket created inside container
    (proc=%proc.name pid=%proc.pid container=%container.name
     image=%container.image.repository)
  priority: CRITICAL
  tags: [netlink, container, lpe]

The nftables Netlink interface has been a recurring source of LPE vulnerabilities. Restrict it for containers:

# Check if nft is usable from an unprivileged container
# (should fail after applying controls)
kubectl run test --image=alpine --rm -it -- \
  sh -c "apk add nftables && nft list tables 2>&1"
# Expected: permission denied

# On the host, verify nft requires CAP_NET_ADMIN
su -s /bin/bash nobody -c "nft list tables 2>&1"
# Expected: Error: Could not process rule: Operation not permitted

Expected Behaviour

Signal Before hardening After hardening
socket(AF_NETLINK, ...) from container Succeeds Blocked by Seccomp — EPERM
kernel.unprivileged_userns_clone 1 0 (on nodes where containers run as root)
CAP_NET_ADMIN in app container Present (Docker default) Dropped via capabilities.drop: ALL
Falco alert on container Netlink socket No alert CRITICAL alert fires
nft list tables as non-root, non-network pod Succeeds on unprotected nodes Fails — no CAP_NET_ADMIN, no Netlink socket

Verification:

# Confirm Seccomp blocks AF_NETLINK in a test pod
kubectl run netlink-test \
  --image=python:3.11-slim \
  --restart=Never \
  --overrides='{"spec":{"securityContext":{"seccompProfile":{"type":"Localhost","localhostProfile":"deny-netlink.json"}}}}' \
  --rm -it -- python3 -c "
import socket
try:
    s = socket.socket(16, socket.SOCK_RAW, 0)  # AF_NETLINK=16
    print('FAIL: Netlink socket allowed')
except OSError as e:
    print(f'PASS: Netlink blocked — {e}')
"

Trade-offs

Aspect Benefit Cost Mitigation
user.max_user_namespaces = 0 Removes capability grant path for most Netlink exploits Breaks rootless containers, Flatpak, Snap, some desktop tools Apply only on server/Kubernetes nodes; use user namespace restriction (not full block) on developer workstations
Seccomp AF_NETLINK block Surgical; does not affect other socket families Some monitoring tools (systemd-networkd, network diagnostics) legitimately use Netlink Apply per-workload; exempt network-plane DaemonSets; use RuntimeDefault for general workloads
Drop CAP_NET_ADMIN Blocks a large fraction of exploitable Netlink message types Network-intensive apps (VPN, network monitoring) need this capability Require documented approval + Kyverno annotation for any pod requesting NET_ADMIN
AppArmor Netlink denial Defence-in-depth alongside Seccomp AppArmor profiles require maintenance; profile generation is complex Use Docker’s default AppArmor profile as a starting point; it already restricts some Netlink operations

Failure Modes

Failure Symptom Detection Recovery
Seccomp blocks legitimate monitoring tool Tool fails to gather network statistics; dashboards show gaps Tool logs show EPERM on socket creation; Falco shows blocked call Add exemption in Seccomp profile for specific tool; document the exception
user.max_user_namespaces = 0 breaks container runtime Container creation fails; Docker/containerd errors on startup Container runtime logs show namespace creation failure Revert sysctl; apply only on nodes where containers run as root (not rootless mode)
Kyverno policy blocks CNI plugin update Network plugin DaemonSet cannot be updated; pods go unscheduled kubectl rollout status shows unavailable; Kyverno audit log shows denial Ensure CNI namespace (kube-system, cilium) is in Kyverno exception list
AppArmor profile denies udev device events Device hotplug events missed; storage mounts fail System logs show AppArmor denial for KOBJECT_UEVENT Add network netlink dgram, rule for processes that legitimately need device event monitoring