Linux Capability Hardening: Dropping Privileges from Daemons and Services
Problem
Historically, root was binary: you either had all privileges or none. Linux capabilities split root into 38 discrete permissions — CAP_NET_BIND_SERVICE, CAP_SYS_PTRACE, CAP_KILL, and 35 others — so a process can hold exactly the subset it needs.
In practice, most services still run with the full root capability set. The pattern: a daemon was written to run as root (because it needed to bind port 80, or write to /proc), nobody ever removed the extra capabilities it wasn’t using, and it now holds CAP_SYS_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, and 20 others it will never call. If that daemon is compromised, the attacker inherits all of them.
The specific failure modes in unmanaged deployments:
- Services in systemd units with no
CapabilityBoundingSet=directive, running with a bounding set of all capabilities. - Container workloads inheriting the Docker / containerd default set (which includes
CAP_NET_RAW,CAP_CHOWN,CAP_SETUID,CAP_SETGID,CAP_SYS_CHROOT, and others rarely needed by application code). - Binaries with file capabilities (
setcap) that are world-readable and executable — any local user can exec them. - No audit trail of capability use; a compromised service escalating via
CAP_SYS_ADMINwould go undetected. - Ambient capabilities misunderstood: non-root child processes can inherit capabilities without a setuid binary, but this is rarely scoped correctly.
The hardening goal is the capability analog of least-privilege: enumerate exactly which capabilities each service requires, drop everything else at the bounding set level, and alert when capabilities outside the expected set are exercised.
Target systems: Linux kernel 5.8+ (ambient capabilities stable); systemd 230+ (full CapabilityBoundingSet=, AmbientCapabilities=, SecureBits= support); containerd 1.7+ and Docker 24+.
Threat Model
- Adversary 1 — RCE in an over-privileged daemon: A web application server runs as root with all capabilities. An attacker achieves RCE via a web vulnerability. With
CAP_NET_RAWthey can sniff traffic; withCAP_SYS_PTRACEthey can attach to any process; withCAP_SYS_ADMINthey can mount filesystems and escape namespaces. - Adversary 2 — File capability abuse: A binary has
CAP_NET_BIND_SERVICEset viasetcap. An attacker who achieves local code execution can exec that binary, inherit the capability, and bind a listener on a privileged port to intercept traffic. - Adversary 3 — Ambient capability escalation: An application grants ambient capabilities to child processes during a privileged operation. A compromised child process retains those capabilities across exec boundaries unexpectedly.
- Adversary 4 — Container breakout via
CAP_SYS_ADMIN: A container workload retainsCAP_SYS_ADMIN. The attacker mounts the host filesystem, writes a cron job, escapes. - Access level: Adversary 1 requires application-level RCE. Adversaries 2 and 3 require local shell access. Adversary 4 has container-level execution.
- Objective: Escalate from application/container scope to host or other-user scope.
- Blast radius: With a full capability set, an RCE is nearly equivalent to an unrestricted root shell. With a minimal capability set, RCE is contained to what the service legitimately does — typically no privilege escalation path remains.
Configuration
Step 1: Audit Current Capability Sets
Before dropping anything, establish what each service actually holds and uses.
# List capabilities for a running process by PID.
grep -E '^Cap' /proc/$(pgrep nginx)/status
# Decode hex capability masks.
capsh --decode=00000000a80425fb
# List all processes with non-empty capability sets.
ps -eo pid,comm,cap_eff --no-headers | awk '$3 != "0000000000000000"'
# Find binaries with file capabilities set.
find / -xdev -not -path '/proc/*' -not -path '/sys/*' \
-executable -type f 2>/dev/null \
| xargs -P4 getcap 2>/dev/null
The five capability sets per-process:
| Set | Meaning |
|---|---|
CapInh |
Inheritable — can pass across exec if file also has it |
CapPrm |
Permitted — upper bound of effective |
CapEff |
Effective — what the kernel checks |
CapBnd |
Bounding — hard ceiling; cannot be added back once dropped |
CapAmb |
Ambient — inherited by unprivileged children |
The goal: drive CapBnd to the minimum, CapEff to only what’s in active use, CapAmb to zero unless explicitly needed.
Step 2: Identify Required Capabilities
For each service, determine the minimum required set. Common patterns:
| Service type | Typically needed | Can drop |
|---|---|---|
| Web server (nginx, Apache) | CAP_NET_BIND_SERVICE (if binding <1024), CAP_SETUID, CAP_SETGID (for worker process drop) |
Everything else |
| Container runtime | CAP_SETUID, CAP_SETGID, CAP_MKNOD, CAP_SYS_CHROOT, CAP_CHOWN, CAP_FSETID, CAP_FOWNER, CAP_SETFCAP, CAP_NET_BIND_SERVICE |
CAP_SYS_ADMIN, CAP_NET_RAW, CAP_SYS_PTRACE |
| Database (PostgreSQL) | CAP_NET_BIND_SERVICE if port <1024; usually none (runs on 5432) |
All, especially CAP_SYS_ADMIN |
| DNS resolver (unbound) | CAP_NET_BIND_SERVICE (port 53) |
Everything else |
| Network packet capture | CAP_NET_RAW, CAP_NET_ADMIN |
All others |
| Custom app listening >1024 | None | All |
Use strace to confirm at runtime:
strace -e trace=process,network -f -p $(pgrep nginx) 2>&1 | grep -E 'capset|prctl.*CAP'
Or audit via kernel with auditd:
auditctl -a always,exit -F arch=b64 -S capset -k cap_changes
ausearch -k cap_changes --start today | aureport --summary
Step 3: Harden systemd Service Units
systemd exposes capability controls natively. Add to the [Service] section:
[Service]
# Drop the bounding set to exactly what the service needs.
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SETUID CAP_SETGID
# Prevent acquiring new capabilities via setuid/setcap exec within the service.
NoNewPrivileges=yes
# If the service runs as a non-root user but needs specific capabilities,
# grant them as ambient capabilities (kernel 4.3+).
User=www-data
AmbientCapabilities=CAP_NET_BIND_SERVICE
# Secure bits: keep caps locked to this unit; prevent inheritance to children.
# SECBIT_NOROOT | SECBIT_NOROOT_LOCKED | SECBIT_NO_SETUID_FIXUP | SECBIT_NO_SETUID_FIXUP_LOCKED
SecureBits=keep-caps-locked no-setuid-fixup-locked noroot noroot-locked
A minimal hardened nginx unit looks like:
[Unit]
Description=nginx HTTP server
After=network.target
[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t
ExecStart=/usr/sbin/nginx
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
# Capability hardening.
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SETUID CAP_SETGID CAP_CHOWN
NoNewPrivileges=yes
# Additional systemd sandboxing (complements capability drops).
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/log/nginx /var/run /var/cache/nginx
[Install]
WantedBy=multi-user.target
Reload and verify:
systemctl daemon-reload
systemctl restart nginx
# Confirm the effective caps.
grep CapEff /proc/$(pgrep -f 'nginx: master')/status | xargs capsh --decode=
Step 4: Drop Capabilities in Container Workloads
Docker / containerd accept per-container capability configuration.
Docker run:
# Drop all capabilities then add only what's needed.
docker run --cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges:true \
nginx:alpine
Kubernetes Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:alpine
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
Enforce via Kyverno policy:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-drop-all-capabilities
spec:
validationFailureAction: Enforce
rules:
- name: drop-all-caps
match:
any:
- resources:
kinds: [Pod]
validate:
message: "Containers must drop ALL capabilities."
pattern:
spec:
containers:
- securityContext:
capabilities:
drop:
- ALL
Step 5: Remove or Scope File Capabilities
File capabilities persist on the binary across all invocations. Audit and minimize:
# Remove all file capabilities from a binary that no longer needs them.
setcap -r /usr/bin/ping
# Add only what's needed (ping needs CAP_NET_RAW for ICMP).
setcap cap_net_raw+ep /usr/bin/ping
# Verify.
getcap /usr/bin/ping
# Output: /usr/bin/ping = cap_net_raw+ep
For system binaries, prefer running as a dedicated low-privilege user with ambient caps in systemd rather than file caps. File caps are sticky to the binary on disk — anyone who can exec the file gains the capability.
Restrict file capability binaries from being world-executable where not needed:
# Change ping to only be executable by the 'netadmin' group.
chgrp netadmin /usr/bin/ping
chmod 750 /usr/bin/ping
Step 6: Audit Capability Use with bpftrace
To confirm which capabilities a service actually calls (not just holds), trace cap_capable kernel calls:
# Trace all capability checks by process name.
bpftrace -e '
kprobe:cap_capable {
printf("%s(%d) cap=%d ret=%d\n", comm, pid, arg2, retval);
}
' | grep nginx
Or use the pre-built BCC tool:
# capsnoop: print every capability check in real time.
/usr/share/bcc/tools/capable -K
This reveals which capabilities the process actually queries at runtime (not just what it holds). Common findings:
- nginx queries
CAP_NET_BIND_SERVICEonce at startup when binding port 80; never queries it again. - PostgreSQL checks
CAP_SYS_NICEduring startup on some distributions even if it doesn’t need it; can be dropped after verification. - Java applications often check
CAP_SYS_ADMINduring JVM startup; usually succeeds without it.
Step 7: Audit Capability Changes with auditd
Detect runtime capability escalation attempts:
# /etc/audit/rules.d/capabilities.rules
-a always,exit -F arch=b64 -S capset -k capability_change
-a always,exit -F arch=b64 -S prctl -F a0=0x16 -k set_secbits
-w /proc/sys/kernel/cap_last_cap -p w -k cap_boundary_change
Alert pattern: a service’s process calling capset to expand its own capability set is an indicator of exploit activity. Legitimate services set capabilities once (at start, via systemd) and never call capset again.
Step 8: Verify with capsh
After hardening, confirm the resulting process has the expected capability set:
# Show capabilities of current shell.
capsh --print
# Show capabilities for a specific process.
PID=$(pgrep -f 'nginx: master')
for cap in CapInh CapPrm CapEff CapBnd CapAmb; do
val=$(grep "^$cap:" /proc/$PID/status | awk '{print $2}')
echo "$cap: $(capsh --decode=$val)"
done
Expected output for a hardened nginx master:
CapInh: =
CapPrm: cap_chown,cap_setgid,cap_setuid,cap_net_bind_service
CapEff: cap_chown,cap_setgid,cap_setuid,cap_net_bind_service
CapBnd: cap_chown,cap_setgid,cap_setuid,cap_net_bind_service
CapAmb: =
Nothing outside the four declared capabilities appears in the bounding set.
Expected Behaviour
| Signal | Before hardening | After hardening |
|---|---|---|
CapBnd for nginx master |
All 38 capabilities | cap_net_bind_service, cap_setuid, cap_setgid, cap_chown only |
| RCE blast radius | Full root capability set available | Limited to the 4 declared capabilities |
CAP_SYS_ADMIN available to nginx |
Yes | No — not in bounding set |
CAP_NET_RAW available to nginx |
Yes | No |
capset syscall at runtime |
Allowed | Blocked by NoNewPrivileges=yes |
| Container default cap set | ~14 Docker-default capabilities | 0 or 1 (explicit add-back only) |
Verify the bounding set shrank:
grep CapBnd /proc/$(pgrep -f 'nginx: master')/status
# Before: CapBnd: 000001ffffffffff (all caps)
# After: CapBnd: 00000000000020e0 (net_bind_service + setuid + setgid + chown)
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Reduced bounding set | Eliminates post-exploit escalation paths via dropped caps | Some services fail to start if a required cap was not identified | Test with strace and bpftrace before production; gradually tighten. |
NoNewPrivileges=yes |
Prevents setuid/setcap binaries from escalating within the service | Breaks services that intentionally exec setuid children (e.g., su, sudo in init scripts) |
Redesign those init scripts; use dedicated systemd units per privilege level. |
| File cap removal from ping/traceroute | Reduces local escalation surface | Breaks tools for non-root users | Wrap in a ping-capable group with ambient caps in a wrapper unit. |
| Ambient capabilities | Non-root process can have caps without setuid binary | Ambient caps persist across exec; a child that execs a setuid binary may unexpectedly combine them | Pair with SecureBits=noroot to prevent cap re-inheritance via setuid. |
| Kyverno enforcement in k8s | Prevents misconfigured pods from launching | Breaks workloads that haven’t been updated | Audit mode first (audit vs enforce); roll out per namespace. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Required cap not in bounding set | Service fails at startup with EPERM on a syscall |
journalctl -u <service> shows permission denied; strace at restart shows which syscall |
Add the cap to CapabilityBoundingSet=; reload. |
NoNewPrivileges breaks init script |
Init script calls sudo or another setuid binary; fails silently or with permission error |
Service fails post-start exec; check ExecStartPost logs |
Replace setuid calls in init scripts with explicit systemd unit ordering or systemd-run. |
| Ambient cap leaks to untrusted child | Child process has unexpected capabilities | bpftrace cap_capable shows child calling cap checks that should be absent |
Audit ambient cap grants; add SecureBits=noroot-locked to the parent unit. |
| Pod fails Kyverno admission | Pod cannot schedule; admission webhook rejects | Kubernetes event: Policy drop-all-caps: validation failed |
Add capabilities.drop: [ALL] and explicit add to the pod spec. |
| File cap on writable binary | Attacker writes malicious binary at file-cap-path and gains caps | Periodic `find / -xdev -executable | xargs getcap` scan finds unexpected entries |
| bpftrace shows cap never used | A capability in the set is never queried at runtime | Trace shows no calls to cap_capable for a specific cap |
Remove it from the bounding set; it’s dead weight. |