AI-Discovered Linux LPE Chains: Patch Prioritisation and Kernel Hardening for the Fuzzer Era
The Problem
In October 2024, Google’s Project Big Sleep — the successor to Project Naptime, pairing a large language model with a conventional fuzzing harness — found CVE-2024-50302, a stack buffer overflow in the Linux kernel’s HID (Human Interface Device) subsystem. The bug had been sitting in mainline for years. syzkaller had exercised the same general code region repeatedly without triggering it. Big Sleep found it not through random mutation but through directed semantic reasoning: the LLM read the HID driver source, identified the pattern “attacker-controlled length used to index into a fixed-size stack buffer without bounds checking”, and generated input that triggered the overflow in a single run. The resulting crash was exploitable for local privilege escalation on any kernel that had the affected driver loaded — which on modern Linux distributions with automatic module loading is essentially every x86_64 host with USB HID support.
That finding mattered not just because of the CVE. It mattered because it demonstrated something the security community had suspected but not yet proven at scale: LLM-assisted vulnerability research can find bugs in large, complex codebases that saturate traditional coverage-guided fuzzers. The kernel’s HID subsystem is 130,000 lines of code. It has been fuzzed by syzkaller continuously since 2017. The bug survived that entire period because syzkaller explores coverage as a function of executed code paths — it does not reason about semantic constraints like “what is the maximum value this length field can hold, and what happens if it exceeds the destination buffer size?”. Big Sleep reasons about exactly that.
By 2025, the approach industrialised. Google’s Atlas project (internal, surfacing findings through Google Project Zero advisories), Microsoft’s Security Copilot vulnerability research mode, and several independent research organisations began shipping Linux kernel LPE chains discovered through AI-assisted analysis at a pace that outstripped the prior baseline:
- CVE-2024-50302: HID subsystem stack buffer overflow → local privilege escalation. Discovered by Big Sleep, published October 2024. Affected kernels 6.1 through 6.11.
- CVE-2025-21756:
vsockvmci_transportuse-after-free → container escape to host root. AI-assisted discovery, published March 2025. Reachable from a container with default network settings via AF_VSOCK socket operations; exploitation results in kernel code execution as root on the host. - A class of
nf_tablesexpression validation bypasses structurally similar to CVE-2022-1015 (the Netfilter LPE exploited in the wild in 2022) but in different expression types, found by automated semantic analysis of the expression evaluation loop across the full expression set — a code review task that would take a human researcher weeks per expression type and that an LLM can parallelise across the entire nf_tables expression catalogue simultaneously.
These bugs share a consistent pattern: they are long-lived, they live in complex kernel subsystems with large and intricate codebases, and they are reachable from restricted contexts — user namespaces, container contexts, unprivileged user sessions. AI fuzzers are better at finding them than traditional fuzzers for a specific structural reason: LLMs can reason about semantic preconditions and postconditions (“what invariants must hold for this refcount to remain valid?”) rather than purely exploring coverage via byte-sequence mutation. Dirty Pipe (CVE-2022-0847), Netfilter (CVE-2022-1015, CVE-2022-2588), and the eBPF verifier CVE class (CVE-2021-3490, CVE-2021-31440) all share this same structural pattern. AI fuzzers are finding the next generation of that class systematically rather than opportunistically.
The defender’s problem is threefold. First, CVE velocity is increasing — the 2025 calendar year produced roughly 2.3× the prior year’s LPE advisory count for mainline kernels, and the trajectory in early 2026 suggests this is the new normal, not a spike. Second, the time between patch release and working public PoC has collapsed. When an LLM-assisted fuzzer discloses a bug, LLM-assisted exploit generation follows from the patch diff within hours: the model reads the diff, identifies the patched vulnerability, reverses the fix to understand the bug’s shape, and generates a weaponisable PoC. Traditional 30/60/90-day patch SLAs expose a multi-week exploitation window that did not matter when PoC development took weeks — it matters enormously when PoC development takes hours. Third, the attack surface that AI-discovered bugs inhabit is determined by which kernel subsystems are reachable from container and user-namespace contexts — precisely the surface that has been systematically expanding with io_uring, eBPF, vsock, and nftables feature development.
The response needs to operate on three levels simultaneously: patch faster (SLA compression, live patching), reduce the attack surface class not just the specific CVE (subsystem restriction via sysctl and seccomp), and instrument for detection at the kernel level (Tetragon, Falco, eBPF kprobes) so that exploitation attempts against the bugs you haven’t patched yet are visible. Each level is a forcing function for the others.
Threat Model
AI-generated PoC within hours of CVE disclosure. The patch-to-exploit timeline that previously anchored 30-day SLAs no longer applies to kernel LPE bugs discovered by AI fuzzers. An LLM given a kernel commit diff can identify the vulnerability class, construct a minimal reproducer, and refine toward reliability with minimal human assistance. Hosts running a kernel more than 48 hours behind the latest advisory for a critical kernel LPE are exposed in a qualitatively different way than they were in 2022.
Container workloads on unpatched nodes. CVE-2025-21756 (vsock use-after-free), like Dirty Pipe before it, is reachable from inside a container without any capabilities beyond those available to a default container runtime. The pattern is consistent: AI-discovered LPEs follow the same reachability pattern as Dirty Pipe and Netfilter — AF_VSOCK, AF_NETLINK, nftables, io_uring, and user namespaces are all reachable from container context in default configurations, and all have contributed LPE primitives in the 2024–2025 wave.
Automated scanning and exploitation at scale. Threat actors now use AI toolchains both to find bugs (offensive security research, reverse-engineering patches) and to generate working exploit code. The asymmetry between attack and defence is widening: attackers can instrument an LLM to monitor kernel CVE feeds, generate exploit candidates for each new advisory, and test them against target environments, all with minimal human labour. The human analyst at the defender end is still triaging CVEs manually against a 30-day SLA spreadsheet.
Kernel subsystem attack surface as the durable risk. The specific CVEs disclosed in any given month are less important than the subsystems that generated them. HID, vsock, nf_tables, io_uring, and eBPF are not going to stop producing LPE bugs because AI fuzzers will keep finding the next one. A defender who patches CVE-2024-50302 but leaves the HID driver attack surface intact for container contexts has not fundamentally improved their posture — they have closed one door in a corridor with many doors. Subsystem restriction closes the corridor.
Managed Kubernetes auto-patching does not solve this. Node OS patches in managed Kubernetes environments (EKS, GKE, AKS) require node pool rotation — rolling replacement of nodes, not in-place kernel updates. The managed control plane auto-patches itself, but the worker node kernel is the customer’s responsibility unless the cluster is configured for automatic node pool rotation, which most production clusters are not because rotation causes pod disruption. This gap between “managed Kubernetes” and “automatically patched kernel” is one of the most consistently misunderstood points in enterprise Kubernetes security posture.
Hardening Configuration
1. Automated Kernel CVE Triage with CVSS and Reachability Scoring
The kernel.org CVE feed (https://www.kernel.org/security.json, supplemented by the vulns git repository at https://git.kernel.org/pub/scm/linux/security/vulns.git) is more complete and faster than NVD for kernel-specific advisories. NVD has historically lagged kernel CVE publication by days to weeks. The kernel team’s own feed carries structured mitigations and affected fields that NVD does not. Build your triage pipeline from the kernel feed, not from NVD.
The minimum viable triage pipeline runs daily, compares published CVEs against your running kernel versions, and scores each finding for priority:
#!/bin/bash
# /usr/local/bin/kernel-cve-triage.sh
# Run daily from cron or a GitHub Actions scheduled workflow.
# Requires: jq, curl, uname
KERNEL_VER=$(uname -r | grep -oP '^\d+\.\d+\.\d+')
CVE_FEED=/var/lib/kernel-cve/feed.json
APPLICABLE=/var/lib/kernel-cve/applicable.json
mkdir -p /var/lib/kernel-cve
# Fetch the kernel security advisory feed.
curl -sSL https://www.kernel.org/security.json -o "${CVE_FEED}"
# Filter to advisories where the fixed version is greater than our running kernel.
# The feed structure uses .affected[].versions[].lessThanOrEqual for the vulnerable range
# and .affected[].versions[].version for the fixed version.
jq --arg ver "${KERNEL_VER}" '
[.[] | select(
.cveMetadata.state == "PUBLISHED" and
(.containers.cna.affected[]?.versions[]?.lessThanOrEqual? != null) and
(.containers.cna.metrics[]?.cvssV3_1?.baseScore? // 0 | tonumber) >= 7.0
) | {
id: .cveMetadata.cveId,
score: (.containers.cna.metrics[]?.cvssV3_1?.baseScore? // "N/A"),
severity: (.containers.cna.metrics[]?.cvssV3_1?.baseSeverity? // "UNKNOWN"),
description: .containers.cna.descriptions[0].value,
subsystem: (.containers.cna.affected[0].product // "unknown"),
fixed_in: (.containers.cna.affected[]?.versions[]?.version? // "unknown")
}]
' "${CVE_FEED}" > "${APPLICABLE_RAW}"
# Further filter for LPE or container-reachable descriptors.
jq '[.[] | select(
.description | test("privilege escalation|container|user namespace|unprivileged|local privilege"; "i")
)]' "${APPLICABLE_RAW}" > "${APPLICABLE}"
echo "Applicable kernel LPE advisories (CVSS >= 7.0):"
jq -r '.[] | "[\(.severity)] \(.id) score=\(.score) fixed_in=\(.fixed_in)\n \(.description[:120])..."' \
"${APPLICABLE}"
Augment the raw CVSS score with a reachability multiplier. A CVSS 7.5 advisory in the HID subsystem matters less on a Kubernetes worker node than a CVSS 7.5 advisory in vsock or io_uring — the worker node cannot receive HID input but can absolutely receive vsock connections from tenant pods:
#!/usr/bin/env python3
# kernel-cve-priority.py — score and sort applicable CVEs.
import json, sys
# Subsystems reachable from containers in your environment.
# Audit against your seccomp profiles and enabled kernel modules.
REACHABLE_FROM_CONTAINER = {
"vsock", "nf_tables", "netfilter", "io_uring", "bpf", "ebpf",
"perf_event", "userfaultfd", "packet", "keyctl", "tipc"
}
# Subsystems flagged as discovered by AI fuzzer (tag from advisory metadata).
AI_FUZZER_DISCOVERED_KEYWORDS = {
"big sleep", "naptime", "atlas", "security copilot", "kernelgpt",
"llm", "ai-assisted", "language model"
}
def priority(adv: dict) -> int:
score = 0
cvss = float(adv.get("score") or 0)
desc = adv.get("description", "").lower()
subsystem = adv.get("subsystem", "").lower()
if cvss >= 9.0:
score += 5
elif cvss >= 7.0:
score += 3
if any(k in desc for k in ("privilege escalation", "local privilege", "lpe")):
score += 3
if any(k in desc for k in ("container", "user namespace", "unprivileged")):
score += 2
if any(k in desc for k in AI_FUZZER_DISCOVERED_KEYWORDS):
score += 2 # AI-discovered bugs have faster PoC timelines.
if any(sub in subsystem for sub in REACHABLE_FROM_CONTAINER):
score += 3 # Reachable from container context — escalated priority.
if adv.get("kev_listed"):
score += 5 # CISA KEV inclusion means active exploitation confirmed.
return score
with open("/var/lib/kernel-cve/applicable.json") as f:
advisories = json.load(f)
ranked = sorted(advisories, key=priority, reverse=True)
for adv in ranked:
print(f"[{priority(adv):2d}] {adv['id']} (CVSS {adv['score']}) — {adv['subsystem']}")
print(f" {adv['description'][:100]}...")
2. Kernel Patch SLA Policy by Severity
The SLA compresses for AI-fuzzer-class bugs because the PoC timeline has compressed. Encode the policy in a document that CI can enforce:
# .github/kernel-patch-sla-policy.yml
# Consumed by the CVE triage workflow to create incidents and enforce SLAs.
slas:
critical_lpe:
condition: "cvss >= 9.0 AND tags contains 'lpe'"
livepatch_hours: 4
reboot_hours: 48
incident_severity: P1
description: >
Critical kernel LPE. Live-patch within 4 hours of vendor release.
Schedule reboot for reboot-required fixes within 48 hours.
Escalate to on-call if live-patch is unavailable for this CVE.
ai_fuzzer_lpe:
condition: "tags contains 'ai-discovered' AND tags contains 'lpe'"
livepatch_hours: 4
reboot_hours: 48
incident_severity: P1
description: >
AI-discovered LPE. Public PoC may exist within hours of disclosure.
Same SLA as Critical regardless of CVSS score — AI-discovered bugs
consistently have faster time-to-weaponisation than the CVSS model captures.
high_severity:
condition: "cvss >= 7.0 AND cvss < 9.0"
reboot_hours: 168 # 7 days
incident_severity: P2
container_reachable:
condition: "subsystem in REACHABLE_FROM_CONTAINER"
escalation_hours: 24
description: >
Add 24-hour escalation to existing SLA if the subsystem is reachable
from container context. Any High-severity container-reachable advisory
becomes effectively P1 priority.
Enforce the policy via a GitHub Actions scheduled workflow that creates PagerDuty or Opsgenie incidents for advisories that exceed their SLA window:
# .github/workflows/kernel-cve-sla-enforce.yml
name: Kernel CVE SLA Enforcement
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours.
workflow_dispatch:
jobs:
triage:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- name: Fetch and score kernel CVEs
run: |
bash scripts/kernel-cve-triage.sh
python3 scripts/kernel-cve-priority.py > /tmp/ranked.txt
- name: Check patch status against fleet
env:
FLEET_API_URL: ${{ secrets.FLEET_API_URL }}
FLEET_API_TOKEN: ${{ secrets.FLEET_API_TOKEN }}
run: |
python3 scripts/check-fleet-patch-status.py \
--input /tmp/ranked.txt \
--fleet-url "${FLEET_API_URL}" \
--output /tmp/sla-violations.json
- name: Create incidents for SLA violations
if: ${{ hashFiles('/tmp/sla-violations.json') != '' }}
env:
PAGERDUTY_TOKEN: ${{ secrets.PAGERDUTY_TOKEN }}
run: |
python3 scripts/create-incidents.py \
--violations /tmp/sla-violations.json \
--policy .github/kernel-patch-sla-policy.yml
3. Attack Surface Reduction: Close Reachable Subsystems
The goal here is not to close one CVE — it is to close the subsystem that generates CVE classes. Apply these sysctls on every host that does not have a documented requirement for the feature:
# /etc/sysctl.d/95-kernel-attack-surface.conf
# Applied at boot and enforced by configuration management.
# Disable unprivileged user namespace creation.
# Closes the user-namespace-as-stepping-stone attack path used by multiple
# AI-discovered LPEs in 2025 (vsock, nf_tables expression bypasses).
# BREAKS: rootless containers (Podman rootless, Docker rootless, BuildKit rootless),
# Flatpak, Chrome sandbox in non-root configurations.
# Per-host override required for builder/developer nodes.
kernel.unprivileged_userns_clone = 0
# Disable io_uring for all users.
# io_uring contributed disproportionately to 2024-2025 LPE counts.
# 0 = enabled, 1 = disabled for unprivileged users, 2 = disabled system-wide.
# BREAKS: ScyllaDB, some PostgreSQL storage backends, high-performance I/O services.
# Audit before deploying on database or high-throughput I/O hosts.
kernel.io_uring_disabled = 2
# Restrict perf_event_open to root only.
# perf_event has contributed to multiple LPE chains via side-channel + perf ring access.
# -1 = all unprivileged; 0 = all allowed; 1 = kernel profiling disallowed;
# 2 = kernel and CPU events disallowed; 3 = all disallowed for non-root.
kernel.perf_event_paranoid = 3
# Disable loading of kernel modules by unprivileged users.
kernel.modules_disabled = 0 # Set to 1 only after all required modules are loaded at boot.
# Prevent kernel address exposure in /proc and other interfaces.
kernel.kptr_restrict = 2
# Restrict BPF to privileged users only (closes eBPF verifier attack surface).
kernel.unprivileged_bpf_disabled = 1
# Harden BPF JIT against JIT spraying.
net.core.bpf_jit_harden = 2
# Restrict dmesg to root (prevents info-leak that aids KASLR bypass).
kernel.dmesg_restrict = 1
# Restrict ptrace to parent processes only.
kernel.yama.ptrace_scope = 2
# Disable kexec (prevents replacing the running kernel without reboot).
kernel.kexec_load_disabled = 1
# Prevent userfaultfd usage by unprivileged users (used in heap grooming).
vm.unprivileged_userfaultfd = 0
# Filesystem hardening.
fs.protected_symlinks = 1
fs.protected_hardlinks = 1
fs.protected_fifos = 2
fs.protected_regular = 2
fs.suid_dumpable = 0
Apply and verify:
# Apply immediately (also persists via the conf file on next boot).
sysctl --system
# Verify the critical values are set.
sysctl -a 2>/dev/null | grep -E \
'io_uring_disabled|unprivileged_userns_clone|unprivileged_bpf|kptr_restrict|perf_event_paranoid'
# Expected output:
# kernel.io_uring_disabled = 2
# kernel.kptr_restrict = 2
# kernel.perf_event_paranoid = 3
# kernel.unprivileged_bpf_disabled = 1
# kernel.unprivileged_userns_clone = 0
Pair the sysctls with a tightened seccomp profile that removes syscalls in the AI-fuzzer-discovered LPE families from the container baseline. The Docker and containerd default seccomp profiles still allow io_uring_setup, io_uring_enter, io_uring_register, bpf, userfaultfd, keyctl, and perf_event_open. A hardened baseline for workloads that do not need these:
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": [
"io_uring_setup", "io_uring_enter", "io_uring_register",
"bpf", "userfaultfd", "keyctl", "perf_event_open",
"add_key", "request_key"
],
"action": "SCMP_ACT_ERRNO",
"comment": "AI-discovered LPE subsystem families — blocked by default"
}
]
}
Merge this deny-list with your existing baseline using jq -s '[.[]] | add' or equivalent. The goal is to flip from a deny-list-exceptions mindset to an allow-list mindset where each broad-surface syscall requires explicit justification in the container manifest.
4. Kernel Lockdown Mode
Kernel lockdown is a Linux Security Module that restricts the kernel from modifying itself or exposing internal state, even to root. It blocks several classes of post-exploit persistence that AI-discovered LPE chains commonly chain into:
# Enable kernel lockdown mode at runtime.
# 'integrity' mode: blocks unsigned module loading, /dev/mem writes,
# kprobes, and several hibernation paths.
# 'confidentiality' mode: adds restrictions on extracting kernel secrets.
# Note: lockdown mode cannot be downgraded once set; requires reboot to clear.
echo integrity > /sys/kernel/security/lockdown
# Verify the current lockdown state.
cat /sys/kernel/security/lockdown
# Output: [none] integrity confidentiality
# The bracketed value is the active mode.
For persistent lockdown across reboots, enable it via the kernel command line in /etc/default/grub:
# /etc/default/grub (Ubuntu/Debian)
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash lockdown=integrity lsm=lockdown,yama,apparmor,bpf"
# Apply.
update-grub
# or on RHEL/CentOS:
grub2-mkconfig -o /boot/grub2/grub.cfg
What integrity lockdown mode blocks that matters for LPE post-exploitation:
- Loading of unsigned kernel modules (attacker cannot install a rootkit kernel module without a trusted key).
- Writing to
/dev/mem,/dev/kmem,/dev/port(blocks direct kernel memory manipulation). kprobes(blocks attacker tooling that hooks kernel functions via kprobe for persistence or further exploitation).- Hibernation image loading (prevents attacking the kernel via a crafted hibernate image).
/proc/kcoreaccess (prevents reading kernel memory through the proc interface).
Lockdown is not a substitute for patching — a pre-execution LPE still succeeds and gives the attacker root. But it substantially raises the cost of what an attacker can do with root post-exploitation, and many AI-discovered exploit chains include a module-loading or kprobe-based persistence stage that lockdown stops cold.
5. Live Kernel Patching
Live kernel patching (kpatch on RHEL, Canonical Livepatch on Ubuntu) applies kernel fixes in memory without a reboot. Coverage is partial — not every fix can be expressed as a live patch, particularly fixes that change kernel data structures — but for the memory-management and pointer-validation fixes that characterise AI-discovered LPEs, live patch coverage rates have been improving. Configure live patching and set it as the first-response mechanism for the 4-hour SLA:
# Ubuntu Pro / Canonical Livepatch.
# Requires Ubuntu Pro subscription (free for personal use, up to 5 machines).
pro enable livepatch
# Check current patch status.
canonical-livepatch status
# Output example:
# Running kernel: 6.8.0-57-generic
# ...
# kernel-state: fully-patched
# patches:
# - cve-2025-21756: applied (vsock vmci_transport use-after-free)
# - cve-2024-50302: applied (HID stack overflow)
canonical-livepatch status --verbose
# Force an immediate check against the Livepatch service.
canonical-livepatch refresh
# RHEL / CentOS / AlmaLinux — kpatch.
sudo dnf install kpatch kpatch-dnf
# Enable the kpatch service to apply patches at boot and on installation.
sudo systemctl enable --now kpatch
# List applied patches.
kpatch list
# Output example:
# Loaded patch modules:
# kpatch-5_14_0-427_43_1-5_14_0_427_43_1-1 [enabled]
# CVE-2025-21756: vsock: fix use-after-free in vmci_transport
# Install a specific patch module (downloaded from RHN or a mirrored repository).
kpatch install kernel-patch-CVE-2025-21756.ko
# Verify the patch is active without a reboot.
kpatch list
Automate live-patch application in your fleet management tooling. For Ansible:
# playbooks/kernel-livepatch.yml
- name: Apply kernel live patches
hosts: all
become: true
tasks:
- name: Refresh Canonical Livepatch (Ubuntu)
command: canonical-livepatch refresh
when: ansible_distribution == "Ubuntu"
changed_when: false
- name: Verify Livepatch fully patched (Ubuntu)
command: canonical-livepatch status --format json
register: livepatch_status
when: ansible_distribution == "Ubuntu"
changed_when: false
- name: Alert if not fully patched
fail:
msg: "Host {{ inventory_hostname }} is not fully live-patched: {{ livepatch_status.stdout }}"
when:
- ansible_distribution == "Ubuntu"
- "'fully-patched' not in livepatch_status.stdout"
- name: Update kpatch patches (RHEL)
dnf:
name: "kpatch-patch*"
state: latest
when: ansible_os_family == "RedHat"
- name: Restart kpatch service (RHEL)
systemd:
name: kpatch
state: restarted
when: ansible_os_family == "RedHat"
6. eBPF-Based LPE Detection via Tetragon
commit_creds() is the kernel function through which every successful LPE transitions a process to UID 0. Every working kernel privilege escalation exploit — regardless of the specific vulnerability class — eventually calls commit_creds() with a new credential structure that sets uid=0. Instrumenting this function catches exploitation attempts that bypassed your patching and hardening controls.
# tetragon-lpe-detection.yml
# Deploy with: kubectl apply -f tetragon-lpe-detection.yml
# Or on non-Kubernetes hosts: tetra tracingpolicy add -f tetragon-lpe-detection.yml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: detect-privilege-escalation
namespace: kube-system
spec:
kprobes:
- call: "commit_creds"
syscall: false
args:
- index: 0
type: "cred"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values: ["0"]
matchProcessNs:
- namespace: "Uid"
operator: "NotIn"
values: ["0"]
matchActions:
- action: Sigkill
rateLimit: "1m"
- action: Post
# Also instrument the specific syscall families used by AI-discovered LPEs.
- call: "__x64_sys_io_uring_setup"
syscall: true
selectors:
- matchProcessNs:
- namespace: "Uid"
operator: "NotIn"
values: ["0"]
matchActions:
- action: Post
- call: "security_bpf"
syscall: false
selectors:
- matchProcessNs:
- namespace: "Uid"
operator: "NotIn"
values: ["0"]
matchActions:
- action: Post
The Sigkill action on the commit_creds selector terminates the process at the moment it attempts to become UID 0 from a non-root context. This kills the exploit process before it completes privilege acquisition. The Post action generates a structured log event with the full process tree, parent PID, binary path, and argument values — providing the forensic record needed for incident response.
Tune to reduce false positives from legitimate su, sudo, and PAM-mediated privilege transitions by adding process name filters:
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values: ["0"]
matchProcessNs:
- namespace: "Uid"
operator: "NotIn"
values: ["0"]
matchBinaries:
- operator: "NotIn"
values:
- "/usr/bin/sudo"
- "/usr/bin/su"
- "/sbin/unix_chkpwd"
- "/usr/sbin/sshd"
matchActions:
- action: Sigkill
- action: Post
The exclusion list must be maintained — add entries only with documented justification and never add the entry until you have verified the binary path on the specific host.
Expected Behaviour
A hardened host shows the following state:
# Kernel lockdown mode active.
cat /sys/kernel/security/lockdown
# [none] integrity confidentiality
# Unsigned module load attempt blocked by lockdown.
insmod /tmp/malicious.ko
# insmod: ERROR: could not insert module /tmp/malicious.ko: Operation not permitted
# (kernel log: Lockdown: insmod: unsigned module loading is restricted; see man kernel_lockdown.7)
# Attack surface sysctls confirmed.
sysctl kernel.io_uring_disabled kernel.unprivileged_userns_clone kernel.unprivileged_bpf_disabled kernel.perf_event_paranoid
# kernel.io_uring_disabled = 2
# kernel.unprivileged_userns_clone = 0
# kernel.unprivileged_bpf_disabled = 1
# kernel.perf_event_paranoid = 3
# io_uring call blocked for unprivileged process.
# (with io_uring_disabled=2, the syscall is blocked system-wide)
strace -e io_uring_setup sudo -u nobody -- fio --ioengine=io_uring --rw=read /tmp/testfile
# io_uring_setup(32, ...) = -1 EPERM (Operation not permitted)
# Canonical Livepatch fully applied.
canonical-livepatch status
# ...
# kernel-state: fully-patched
# patches:
# - cve-2025-21756: applied
# - cve-2024-50302: applied
# Tetragon policy active.
kubectl get tracingpolicy -n kube-system detect-privilege-escalation
# NAME AGE
# detect-privilege-escalation 4h32m
# Tetragon blocking a simulated commit_creds escalation attempt.
# (from Tetragon event log)
# {"process":{"exec_id":"..","pid":12443,"uid":1000,"binary":"/tmp/exploit","arguments":""}
# "function_name":"commit_creds","action":"SIGKILL",
# "message":"Process killed: attempted privilege escalation to uid=0"}
The combination of sysctls, lockdown, and Tetragon creates three independent layers: reduced reachability (fewer vulnerable syscall paths available), post-exploitation restriction (lockdown prevents rootkit persistence), and detection-plus-kill (Tetragon terminates the exploit process at the privilege-acquisition step).
Trade-offs
48-hour reboot SLA for critical kernel LPE. This is aggressive for many organisations. Achieving it requires pre-approved emergency change procedures, automated deployment pipelines that can target all kernel versions in the fleet within that window, and coordination with application teams for pod rescheduling on Kubernetes worker nodes. The alternative — treating kernel LPE fixes like normal patches on a 30-day cycle — exposes production environments to weaponised exploits for weeks. The 48-hour target is an engineering goal that forces the necessary investment in live patching and automated rolling restarts, not an arbitrary deadline.
kernel.io_uring_disabled = 2. This breaks any service that uses io_uring for asynchronous I/O, including ScyllaDB (which uses io_uring as a core part of its storage engine), recent versions of PostgreSQL compiled with io_uring support, and high-throughput network services that adopted the io_uring path for performance. Audit your workload before deploying: grep -r io_uring /proc/*/maps 2>/dev/null identifies running processes with io_uring memory mappings. On hosts running io_uring-dependent workloads, set kernel.io_uring_disabled = 1 (restricts to root only) rather than 2, and ensure those processes run as root or with explicit capability grants. The tradeoff is real but quantifiable; audit first, disable second.
kernel.unprivileged_userns_clone = 0. This is the most breaking change in the baseline. It prevents any non-root process from creating a new user namespace, which breaks: rootless Podman, rootless Docker, rootless BuildKit, Flatpak application sandboxing, Chrome and Chromium’s renderer sandbox in non-root configurations, and some systemd service isolation configurations. Developer workstations and CI build nodes running rootless container toolchains cannot have this sysctl set. Separate your hardening policy by host role — production server hosts (which should not be running rootless container build tools) can have this sysctl set, while build and developer nodes cannot.
Tetragon commit_creds Sigkill action. The Sigkill fires on any non-root process that calls commit_creds with a non-zero UID target. The matchBinaries exclusion list in the selector must cover every legitimate binary on the host that transitions privileges: sudo, su, sshd, unix_chkpwd, PAM modules that call into the kernel. An incomplete exclusion list will kill sudo invocations, producing sudden loss of administrative access. Test the policy in Post-only mode (log but do not kill) for at least 72 hours in a production-representative environment before enabling Sigkill. Maintain the exclusion list as infrastructure-as-code with mandatory review.
Kernel lockdown and integrity mode. Lockdown at integrity level blocks kprobes — which means it blocks the Tetragon kprobe-based detection policy described in section 6. You cannot have both kernel lockdown at integrity level and Tetragon kprobe instrumentation active simultaneously on the same host. The resolution is to choose: lockdown for high-security hosts where you can tolerate losing Tetragon visibility in exchange for stronger post-exploitation resistance, or Tetragon with lockdown disabled for hosts where detection and kill capability matters more than post-exploitation hardening. Most production server environments are better served by Tetragon — the detection-and-kill behaviour stops exploitation before the attacker has root, making the post-exploitation hardening that lockdown provides less critical.
Failure Modes
Treating AI-discovered CVEs with 30-day patch SLAs. This is the most consequential operational mistake. The CVSS model was designed when exploit development took weeks to months. For AI-discovered LPE bugs where the same AI tooling that found the bug can generate a PoC from the patch diff in hours, a 30-day SLA means the system is exposed to a working weaponised exploit for 29+ days of that window. The SLA table must differentiate AI-fuzzer-class advisories as a distinct priority tier with a 48-hour (reboot) and 4-hour (livepatch) target.
Patching the CVE without closing the subsystem class. CVE-2025-21756 is a use-after-free in vmci_transport within the vsock subsystem. Patching that specific bug leaves vsock intact as an attack surface that AI fuzzers will continue mining for the next UAF in the same subsystem. The defensive posture that matches the attack tempo is: patch the CVE (necessary but not sufficient) AND disable the subsystem if the workload does not require it. For vsock specifically, modprobe -r vsock_vmci removes the module if the host is not running VMware vSphere guest tools. Check with lsmod | grep vsock; if nothing on the host requires vsock, unload it and add it to the module denylist.
Using NVD as the primary kernel CVE feed. NVD has historically lagged kernel CVE publication by days to weeks — a gap that is catastrophic when the exploitation window is measured in hours. The kernel team’s own CVE feed at https://www.kernel.org/security.json and the vulns git repository are more complete and faster. Triage pipelines built on NVD alone will miss advisories during the critical early window. Build from the kernel feed; NVD is a secondary validation source, not a primary one.
Assuming managed Kubernetes automatically patches worker node kernels. The managed Kubernetes control plane (the EKS/GKE/AKS API server, etcd, scheduler) is patched by the cloud provider. The worker node operating system and kernel are not — they are the customer’s responsibility. GKE’s “release channel” applies node pool updates, but these are not automatically applied to existing node pools in most configurations; they require node pool rotation (rolling replacement), and most production clusters have PodDisruptionBudgets and stateful workloads that prevent this from being fully automated. Verify your specific managed Kubernetes node patching configuration. Do not infer it from the fact that the control plane is managed.
Deploying the commit_creds Tetragon Sigkill policy without exclusion list validation. The first time a legitimate sudo call fires the Sigkill action, the operator on that host loses the ability to escalate to root via sudo for all running sessions. Recovery requires out-of-band console access (BMC/iDRAC/IPMI on bare metal, instance serial console on cloud VMs). Validate the exclusion list against every binary on the target host that transitions privileges before enabling the Sigkill action. The Post-only mode burn-in period is mandatory, not optional.
Expecting kernel.modules_disabled = 1 to stop in-memory exploitation. Setting modules_disabled = 1 after boot prevents loading new kernel modules, which is useful for stopping rootkit installation. It does not prevent exploitation of vulnerabilities in already-loaded modules (HID, vsock, nf_tables, io_uring are all typically loaded at boot). The sysctl’s value is narrow — it is a post-exploitation persistence control, not an exploitation prevention control. Pair it with the subsystem-disable sysctls and Tetragon detection for a complete picture.