perf_event_open and Kernel Profiling as an Attack Surface: CVE-2023-2235 and Hardening Paranoid Mode

The Problem

perf_event_open() is a Linux syscall — number 298 on x86_64 — that opens a performance monitoring context, giving the caller access to hardware performance counters (CPU cycles, cache misses, branch mispredictions), software events (page faults, context switches), and kernel tracepoints. It is the foundation for perf stat, perf record, continuous profiling agents (Datadog APM, Pyroscope, pprof-based Go profilers), and JVM and Python profilers (async-profiler, py-spy). Any time a tool tells you “CPU cycles: 2.4 billion” or “99% of samples in malloc,” it got that data through perf_event_open() or an interface built on top of it.

Access is governed by a single sysctl: kernel.perf_event_paranoid. Its semantics:

-1: No restrictions. Any user can access all events, including kernel-level ones.
0: Unprivileged users can access CPU counters. Kernel-level events require CAP_SYS_PTRACE.
1: Unprivileged users can access CPU counters but not kernel-level events. This is the default on Ubuntu, Debian, CentOS, and most mainstream distributions.
2: Only privileged users (CAP_PERFMON on 5.8+, otherwise CAP_SYS_ADMIN) can use perf_event_open().
3: perf_event_open() is completely disabled for all non-root users. Present in some hardened distribution configurations (Fedora, RHEL).

The problem is that the kernel code implementing perf_event_open() is one of the most complex syscall handlers in the Linux kernel, and its complexity has produced a consistent stream of exploitable vulnerabilities. The default access level on most systems (paranoid=1) is sufficient to trigger every one of them.

CVE-2023-2235 was disclosed in May 2023. It is a use-after-free in perf_group_detach(), the function responsible for removing an event from its event group. When a group leader event is being destroyed concurrently with a sibling event’s detachment, the sibling’s ctx (context) pointer can be dereferenced after the context object has been freed. The race window is narrow but reliably triggerable with appropriate thread timing. A successful exploit yields a read/write primitive in kernel memory — sufficient to overwrite cred structures or function pointers and achieve ring-0 code execution.

The exploitation requirement is minimal: perf_event_paranoid at 1 or below (the Ubuntu and Debian default), and the ability to call perf_event_open() from userspace. No setuid binary, no special group membership, no additional capabilities. Any process running under any unprivileged UID on an unpatched kernel meets the criteria. In a container, any container whose host node has paranoid <= 1 and a vulnerable kernel version meets the criteria — containers do not have their own kernel, so the host sysctl applies to every container on the node.

Affected kernel versions: 6.1.x before 6.1.29, 6.2.x before 6.2.16, and 6.3.x before 6.3.2. The patch was upstreamed in commit c0cde81b7a9a in the 6.3.2 stable series.

CVE-2023-2235 is not an isolated incident. It is the most recent entry in a pattern that has repeated since at least 2020:

CVE-2020-25704: Memory leak in perf_event_parse_addr_filter(). Repeated calls to perf_event_open() with crafted address filter parameters leak kernel memory, disclosing kernel pointer values to userspace. Kernel pointer disclosure is a prerequisite for defeating KASLR in staged exploits.
CVE-2021-33624: Out-of-bounds read in perf_event_open() on s390 architecture in the BPF JIT. Kernel pointer leakage through the perf interface.
CVE-2022-1729: Use-after-free in the perf_event_open() group leader creation path — the predecessor bug to CVE-2023-2235. Same subsystem, same group leader/sibling race condition class, one version earlier. Required paranoid <= 1. Publicly exploited.
CVE-2023-2235: Use-after-free in perf_group_detach(). Privilege escalation to root. Required paranoid <= 1.

The pattern is: new kernel version, new rearrangement of the event group lifecycle code, new use-after-free or race condition in the same subsystem. The subsystem complexity does not decrease between kernel versions.

The Kubernetes operational context amplifies this risk. Observability DaemonSets — Datadog agent, Dynatrace OneAgent, New Relic Infrastructure, Pixie — require hardware counter access to deliver their profiling features. They configure this by running a privileged init container at pod startup that writes kernel.perf_event_paranoid=0 or -1 to the host’s sysctl namespace:

initContainers:
- name: init-sysctl
  image: busybox
  securityContext:
    privileged: true
  command: ["sysctl", "-w", "kernel.perf_event_paranoid=-1"]

This is a node-level change. Once the init container has run, every pod on that node — including pods from every other namespace, every tenant, every workload — can call perf_event_open() at the unrestricted level. A single privileged DaemonSet pod widens the kernel attack surface for every other workload on the same node. If any of those workloads is compromised, the attacker has perf_event_open() access to a potentially unpatched kernel.

Threat Model

Default configuration, unpatched kernel. perf_event_paranoid=1 is the default on Ubuntu 22.04, Debian 12, CentOS Stream 9, and Amazon Linux 2023. Any unprivileged process — including any container without explicit seccomp restrictions on perf_event_open() — can call the syscall for CPU counter events. CVE-2023-2235 requires exactly this: an unprivileged caller with access to the syscall on a kernel between 6.1 and 6.3.2. The exploit is a race between concurrent thread operations on a perf event group; it requires no other preconditions and produces a kernel read/write primitive.

Observability DaemonSet sysctl mutation. A privileged DaemonSet init container sets perf_event_paranoid=-1 at node startup to enable full hardware counter access for its profiling agent. This sysctl is a host namespace setting — it applies globally to all containers on the node, not just the agent. A compromised application container running on the same node gains unrestricted perf_event_open() access. The agent’s profiling capability and the attack surface expansion are inseparable when implemented via the sysctl.

CAP_SYS_PTRACE added for debugging and not removed. At paranoid=2, perf_event_open() normally requires CAP_PERFMON (Linux 5.8+) or CAP_SYS_ADMIN (pre-5.8). However, CAP_SYS_PTRACE is treated as sufficient by the kernel’s perf_event_open() access check for some call modes. A pod spec that adds SYS_PTRACE “for a debugging session” and is never updated back to a minimal capability set retains access to perf_event_open() even at paranoid=2. CAP_SYS_PTRACE is also frequently added to containers running Python or Java profilers (py-spy, async-profiler), because those tools use ptrace for stack unwinding when perf counters are not available.

Cross-process performance counter side channels. Even without exploiting a CVE, hardware performance counter access enables side-channel attacks against co-located processes. Processes sharing a physical CPU core share hardware counter state. A process with perf_event_open() access can use counter data to reconstruct cache access patterns from adjacent processes, enabling Spectre-class cross-process information disclosure. This is distinct from the use-after-free class of vulnerabilities but equally requires restricting counter access as the mitigation.

Container escape via privileged profiling. A container that achieves perf_event_open() access to kernel-level events (paranoid <= 0) can attach to kernel tracepoints that expose the internal kernel state. Combined with a kernel vulnerability, this provides the primitives for a container escape. The perf subsystem’s privileged-level events expose significantly more kernel internal state than CPU counter events alone.

Hardening Configuration

1. Set perf_event_paranoid to 3

Check the current setting first — many distributions ship with 1, some ship with 2:

sysctl kernel.perf_event_paranoid

To restrict to root only:

echo "kernel.perf_event_paranoid=3" >> /etc/sysctl.d/99-security.conf
sysctl -p /etc/sysctl.d/99-security.conf

Verify the change was applied:

sysctl kernel.perf_event_paranoid
# kernel.perf_event_paranoid = 3

What each level breaks in practice:

3: Disables perf stat, perf record, perf top for all non-root users. Breaks py-spy, async-profiler, pprof’s hardware counter mode, and any profiler that calls perf_event_open() directly. Continuous profiling agents (Datadog, Pyroscope) require root or CAP_PERFMON. This is the hardest setting and the only one that closes the attack surface completely for unprivileged callers.
2: Allows root and processes with CAP_PERFMON (Linux 5.8+) or CAP_SYS_ADMIN (pre-5.8) to use all perf features. Breaks unprivileged developer tooling. Does not protect against containers or processes that have been granted CAP_SYS_PTRACE, which bypasses paranoid=2 for some call modes.
1 (most distribution defaults): Allows unprivileged CPU counter access. Sufficient to call perf_event_open() in the mode required by CVE-2022-1729 and CVE-2023-2235. This is not a safe default against unpatched kernels.
0 and -1: Should not appear on production systems unless a specific kernel profiling workflow requires it, with compensating controls in place.

On Linux 5.8 and later, CAP_PERFMON provides profiling access without the broader privileges of CAP_SYS_ADMIN. This is the intended path for granting profiler access at paranoid=2 or 3. However, not all profiling tools have been updated to request CAP_PERFMON instead of CAP_SYS_ADMIN. Verify tool compatibility before relying on it.

2. Seccomp Profile Blocking perf_event_open()

A seccomp profile that explicitly blocks syscall 298 prevents any process in the container from calling perf_event_open(), regardless of the host sysctl value. This is the defense-in-depth layer: even if a DaemonSet has set paranoid=-1 on the node, containers with this seccomp profile cannot call the syscall.

Create /etc/kubernetes/seccomp/block-perf-event.json:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["perf_event_open"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1
    }
  ]
}

errnoRet: 1 returns EPERM. Some tools handle EPERM from perf_event_open() gracefully and fall back to software-only profiling. Others fail loudly. Test before deploying.

Apply this as the cluster-wide default seccomp profile using the kubelet configuration:

# /var/lib/kubelet/config.yaml
seccompDefault: true

With seccompDefault: true (available from Kubernetes 1.25 GA), every pod that does not explicitly declare a seccomp profile receives the RuntimeDefault seccomp profile, which blocks many dangerous syscalls including perf_event_open(). This closes the gap for pods that have no seccomp annotation.

For workloads that genuinely require profiling access, define a separate RuntimeClass with an explicit seccomp exception:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: profiling-workload
handler: runc
scheduling:
  nodeClassSelector:
    matchLabels:
      node-role: monitoring
overhead: {}
---
apiVersion: v1
kind: Pod
metadata:
  name: datadog-agent
  namespace: monitoring
spec:
  runtimeClassName: profiling-workload
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: allow-perf-event.json

Where allow-perf-event.json is a profile that explicitly permits perf_event_open(). This makes the exception explicit and auditable — it is a named deviation, not an absent default.

3. Restrict CAP_SYS_PTRACE in Pod Security

The default secure container security context:

spec:
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      capabilities:
        drop: ["ALL"]
        # Do NOT add SYS_PTRACE unless the workload genuinely requires it

CAP_SYS_PTRACE bypasses perf_event_paranoid=2 in some kernel call paths. A pod with SYS_PTRACE can call perf_event_open() for hardware counter access even when the sysctl should be restricting it. This means that paranoid=2 on the host does not protect against pods that carry CAP_SYS_PTRACE.

Enforce this with a Kyverno policy that denies SYS_PTRACE outside the monitoring namespace:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: deny-sys-ptrace-outside-monitoring
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: deny-sys-ptrace
    match:
      any:
      - resources:
          kinds: ["Pod"]
    exclude:
      any:
      - resources:
          namespaces: ["monitoring"]
    validate:
      message: "CAP_SYS_PTRACE is not permitted outside the monitoring namespace. Remove it from securityContext.capabilities.add."
      deny:
        conditions:
          any:
          - key: "SYS_PTRACE"
            operator: AnyIn
            value: "{{ request.object.spec.containers[].securityContext.capabilities.add[] }}"
          - key: "SYS_PTRACE"
            operator: AnyIn
            value: "{{ request.object.spec.initContainers[].securityContext.capabilities.add[] }}"

This policy rejects pod creation for any namespace other than monitoring if any container or init container requests SYS_PTRACE. Apply an equivalent policy to prevent the ALL capability set from being added (which includes SYS_PTRACE implicitly) in non-monitoring namespaces.

4. Observability Agent Configuration Without Sysctl Changes

The Datadog Agent’s continuous profiler (v7.36+) and eBPF-based system probe can be configured to use eBPF ring buffers and CO-RE (Compile Once - Run Everywhere) programs rather than perf_event_open() for several data sources. For environments where kernel patch status is uncertain, reducing reliance on perf_event_open() in the agent itself limits the risk surface of the monitoring tooling.

Datadog Helm values that minimise perf_event_open() dependency:

# datadog-agent values.yaml (Helm)
datadog:
  systemProbe:
    enabled: true
    enableTCPQueueLength: false
    enableOOMKill: false
    # btfPath enables CO-RE: avoids kernel-headers dependency and some perf_event paths
    btfPath: ""
  profiling:
    enabled: true
    # The eBPF-based continuous profiler (requires Linux 4.14+, kernel BTF preferred)
    # does not require perf_event_paranoid changes for CPU profiling on newer agent versions
    ebpfEnabled: true
  securityAgent:
    enabled: true

agents:
  containers:
    agent:
      securityContext:
        capabilities:
          add:
            - SYS_ADMIN       # for eBPF program loading (pre-5.8)
            # CAP_BPF + CAP_PERFMON replace SYS_ADMIN on 5.8+ for eBPF profiling
            # add: ["BPF", "PERFMON"] on 5.8+ kernels
          drop: ["ALL"]

Note that on kernels earlier than 5.8, eBPF-based profiling still requires CAP_SYS_ADMIN. On 5.8+, CAP_BPF and CAP_PERFMON are the correct minimum. Consult your agent vendor’s documentation for the exact minimum capabilities required by their current version — these have changed across agent releases as vendors adopt the new capability split.

The key principle: prefer approaches that place the profiling capability in the agent’s own capability set rather than widening the host sysctl. A sysctl change is a node-level blast radius; a capability on a specific pod is scoped to that pod.

5. Kernel Patch Tracking for perf_event CVEs

CVE-2023-2235 was patched in the following stable series:

# Patched versions:
# 6.3.2+   (stable 6.3 series)
# 6.2.16+  (stable 6.2 series)
# 6.1.29+  (stable 6.1 series, LTS)

uname -r
# Example: 6.1.27-1-amd64 — VULNERABLE

# Check Debian/Ubuntu package version carrying the patch:
apt-cache show linux-image-$(uname -r) | grep -E "^Version:"
# Version: 6.1.27-1 — compare against 6.1.29 — vulnerable

# On RHEL/CentOS, kernel CVE backports are applied to the distro kernel version:
rpm -q kernel
# kernel-5.14.0-284.11.1.el9_2.x86_64 — check Red Hat CVE tracker
# for whether CVE-2023-2235 backport is included in your build

Distribution kernels backport patches without incrementing to the upstream patched version number. Ubuntu’s linux-image-6.1.0-9-amd64 may contain the CVE-2023-2235 fix despite reporting 6.1.0 in uname -r. The authoritative source for a given distribution is its security advisory:

Ubuntu: ubuntu-cve-tracker and USN advisories
Debian: Debian Security Tracker at security-tracker.debian.org
Red Hat: access.redhat.com/security/cve/CVE-2023-2235
Amazon Linux: alas.aws.amazon.com

For a fleet, query systematically:

# On Ubuntu/Debian nodes:
dpkg -l linux-image-* | grep -E "^ii" | awk '{print $2, $3}'
# Compare against the USN that patches CVE-2023-2235 (USN-6119-1 and related)

6. Audit perf_event_open() Syscalls

Before locking down access, baseline who is actually calling perf_event_open() in your environment. Surprises are common — tools call the syscall via indirect paths (language runtimes, JIT compilers, profiling SDKs).

Add an auditd rule:

# Log all perf_event_open calls with process context
auditctl -a always,exit -F arch=b64 -S perf_event_open -k perf_event_audit

# Persist across reboots:
echo '-a always,exit -F arch=b64 -S perf_event_open -k perf_event_audit' \
  >> /etc/audit/rules.d/perf-event.rules
augenrules --load

Query the log:

ausearch -k perf_event_audit | head -100

Parse for container identity. The audit record includes the calling process’s pid, uid, comm, and exe. On a Kubernetes node, cross-reference with /proc/<pid>/cgroup to identify the container:

ausearch -k perf_event_audit --raw | awk -F'[= ]' '/SYSCALL/{
  for(i=1;i<=NF;i++) {
    if ($i=="pid") pid=$(i+1)
    if ($i=="uid") uid=$(i+1)
    if ($i=="comm") comm=$(i+1)
  }
  cmd = "cat /proc/" pid "/cgroup 2>/dev/null | grep -o \"kubepods.*\" | head -1"
  cmd | getline cgroup
  close(cmd)
  print "uid=" uid " comm=" comm " cgroup=" cgroup
}'

This reveals which containers are making the syscall. In practice, you will find:

The Datadog or New Relic agent (expected)
JVM processes running with -XX:+UseG1GC and JVM performance counter collection enabled (often unexpected)
Python processes using the perf module or py-spy (expected if debugging, unexpected in production)
Any language runtime with auto-instrumentation hooks that sample CPU counters

Use this baseline to determine which workloads need legitimate perf_event_open() access before applying seccomp restrictions.

Expected Behaviour

After setting perf_event_paranoid=3, running perf stat ls as a non-root user produces:

Error:
You may not have permission to collect system-level stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).

The current value is 3.

The tool exits non-zero. No profiling data is collected. This is the correct outcome.

When the seccomp profile blocks syscall 298 and a process attempts perf_event_open(), the syscall returns EPERM (errno 1). The audit log entry looks like:

type=SECCOMP msg=audit(1715200000.000:1234): auid=1000 uid=1000 gid=1000 \
  ses=1 pid=12345 comm="perf" exe="/usr/bin/perf" sig=0 arch=c000003e \
  syscall=298 compat=0 ip=0x7f1234567890 code=0x50001

syscall=298 is perf_event_open. code=0x50001 is SECCOMP_RET_ERRNO with return value 1 (EPERM). The calling process receives EPERM and the audit log records the event.

When the Kyverno policy blocks a pod that requests SYS_PTRACE, the admission controller rejects the pod with:

Error from server: error when creating "pod.yaml": admission webhook
"validate.kyverno.svc" denied the request:
policy Pod/default/my-debug-pod for resource violation:
deny-sys-ptrace-outside-monitoring/deny-sys-ptrace:
CAP_SYS_PTRACE is not permitted outside the monitoring namespace.
Remove it from securityContext.capabilities.add.

The pod is never created. The cluster audit log records the rejection event with the requesting user’s identity, the namespace, and the policy name.

Trade-offs

perf_event_paranoid=3 eliminates the entire class of unprivileged perf_event_open() privilege escalation CVEs. It also eliminates all profiling capability for non-root processes. Developers lose perf stat, perf record, perf top, flame graph generation, and hardware counter access. On a development workstation or CI build node, this is a significant capability regression. On a production application node, it is the correct tradeoff: production workloads do not benefit from having hardware counter access, and the attack surface reduction is real.

CAP_PERFMON (Linux 5.8+) is the capability-based middle ground. It grants perf_event_open() access equivalent to paranoid=0 without requiring root. It does not include the broader powers of CAP_SYS_ADMIN. On a cluster of 5.8+ nodes, granting CAP_PERFMON to profiling agents rather than setting a permissive host sysctl keeps the access scoped to specific pods. The limitations: not all profiling tools request CAP_PERFMON yet (many still request CAP_SYS_ADMIN), and some older container runtimes do not map it correctly. Verify end-to-end before relying on it.

Seccomp blocking perf_event_open() is the most surgical control: it blocks the syscall regardless of sysctl settings and regardless of capabilities, for workloads that genuinely never need profiling. The failure mode is silent brokenness: a profiling tool that calls perf_event_open() and receives EPERM may silently fall back to software-only profiling without telling the operator. Performance data appears valid but reflects only software counters. Test all profiling tooling against the seccomp profile before deployment.

eBPF-based profiling as an alternative to perf_event_open() shifts the privilege requirement from the sysctl to the pod’s capability set. This is generally preferable — a scoped capability is better than a node-wide sysctl change. The trade-off is that eBPF programs still require CAP_BPF (5.8+) or CAP_SYS_ADMIN (pre-5.8) to load, and the eBPF verifier is itself a complex kernel subsystem with its own CVE history (see CVE-2021-3489, CVE-2022-23222). Replacing perf_event_open() risk with eBPF loading risk is a trade, not an elimination.

Failure Modes

Setting perf_event_paranoid=-1 for a monitoring tool and not reverting it. The DaemonSet init container runs at node startup, sets the sysctl, and the change persists until the node reboots. If the DaemonSet is removed (agent uninstall, node drained), the sysctl value remains at -1 on the running node — it is not reverted. The node continues to expose unrestricted perf_event_open() access to all containers indefinitely. Mitigation: audit kernel.perf_event_paranoid values across nodes regularly. Any node at 0 or -1 should trigger an alert unless the DaemonSet responsible is confirmed running and the node is in the approved monitoring tier.

CAP_SYS_PTRACE added for a debugging session and not removed. A developer adds SYS_PTRACE to a container spec to run py-spy or strace during an incident. The incident is resolved. The pod spec is never updated. The capability persists in the Deployment manifest and is present in every subsequent replica. paranoid=2 on the host does not protect against this pod — CAP_SYS_PTRACE bypasses the paranoid check for hardware counter access. The Kyverno policy above prevents this by making SYS_PTRACE outside the monitoring namespace a hard rejection.

Confusing paranoid levels and treating level 1 as safe. paranoid=1 — the Ubuntu and Debian default — is frequently described as “restricts kernel profiling” and treated as sufficient hardening. It is not. CVE-2022-1729 and CVE-2023-2235 both require only paranoid <= 1 to exploit. The 1 level still permits hardware counter access, and that access is sufficient to trigger the use-after-free race in the event group lifecycle code. If the kernel is unpatched, paranoid=1 provides no protection.

Trusting that containerised workloads do not do profiling. The assumption “our containers just run web servers, they won’t call perf_event_open()” is incorrect in two ways. First, modern language runtimes (JVM, .NET CLR, V8) may call perf_event_open() autonomously for JIT optimisation hints. Second, an attacker who has compromised a container does not care what the original workload was — they will call perf_event_open() if it is available to them. The absence of a legitimate profiling use case in a workload is not a reason to leave access open; it is a reason to close it via seccomp.

Not distinguishing between host sysctl and container seccomp. A seccomp profile that blocks perf_event_open() on the container does not change the sysctl on the host. A privileged pod on the same node can still read and write the sysctl. Defence-in-depth requires both: the host sysctl at the most restrictive level the monitoring stack allows, and seccomp blocking the syscall for workloads that do not need it. Relying on only one layer leaves the other attack path open.