Linux tmpfs and POSIX Shared Memory Security Hardening
The Problem
Three world-writable filesystems are present on nearly every Linux system and are almost always left with their default permissive mount options: /tmp, /dev/shm, and /run. All three are backed by tmpfs — a virtual filesystem stored entirely in RAM with no persistent backing store. Because they are writable by all users and because their contents live in memory rather than on disk, they are a recurring staging ground for post-exploitation activity.
The concrete problems:
/dev/shmis executable by default. An attacker who achieves RCE on a web application can write shellcode or a compiled ELF binary to/dev/shm,mmapit withPROT_EXEC, and execute it without touching the disk. The OverlayFS local privilege escalation pair CVE-2023-2640 / CVE-2023-32629 — known as GameOver(lay) — specifically used/dev/shmas the staging area on Ubuntu systems. The attack writes a crafted executable to/dev/shm, uses OverlayFS to expose it with inflated permissions, and escalates to root./tmpis executable by default and sticky-bit world-writable. Exploits dropped to/tmprun directly. The sticky bit prevents deletion by non-owners but does nothing to prevent execution.- POSIX shared memory objects created via
shm_open(3)reside in/dev/shmas regular files. An attacker who can write to/dev/shmandmmapthe file withPROT_EXEChas an in-memory execution primitive that bypasses signature-based detection looking at the filesystem. - Abstract UNIX domain sockets (those with a null byte at the start of the name) bind to the kernel’s abstract socket namespace, which has no filesystem representation and no inode-level permissions. Any process in the same network namespace can connect to an abstract socket regardless of DAC permissions. Services communicating through abstract sockets are invisible to tools that only walk
/var/run. - A process that can write arbitrarily to a world-writable tmpfs can exhaust the host’s available RAM and swap, causing an out-of-memory event that can be weaponised into a denial-of-service condition.
None of these problems require kernel exploits. They are reliably exploitable by any unprivileged process on a default-configured system.
Target systems: Ubuntu 22.04 LTS, Ubuntu 24.04 LTS, Debian 12, RHEL 9 / Rocky Linux 9.
Threat Model
- Adversary: Attacker with unprivileged code execution on the host — a compromised web application, a malicious container escaping to the host, or a CI job running untrusted code.
- Access level: Unprivileged local user with the ability to write files to world-writable directories and create POSIX shared memory objects.
- Objectives:
- Stage and execute payloads in memory without writing to persistent storage, evading endpoint agents that monitor disk activity.
- Abuse
shm_open/mmapto create executable mappings from/dev/shmfor privilege escalation. - Exhaust RAM via unbounded tmpfs writes to trigger OOM and disrupt services.
- Race another process on a TOCTOU-vulnerable path in
/tmpto substitute a file with a symlink, redirecting a privileged write to an attacker-chosen target. - Communicate covertly with a malicious process via an abstract UNIX socket that bypasses filesystem-permission inspection.
- Blast radius: Successful privilege escalation from one compromised service affects all workloads on the host. On a Kubernetes node, host root grants access to all pod secrets and kubelet credentials.
Mount Hardening Options
Three mount flags apply to tmpfs and block the most commonly exploited default behaviours:
| Flag | Effect | Notes |
|---|---|---|
noexec |
The kernel refuses execve and mmap(PROT_EXEC) against files on this filesystem |
Most important on /dev/shm and /tmp |
nosuid |
setuid and setgid bits on executables are ignored |
Prevents a setuid binary dropped to /tmp from escalating |
nodev |
Block and character device files cannot be opened by path | Prevents raw disk access through device nodes in /tmp |
These flags interact with the kernel at the VFS layer, not at userspace. They are enforced on every mmap and execve call against files on the affected filesystem — there is no userspace bypass available to an unprivileged process.
/etc/fstab Hardening
If /tmp and /dev/shm are not listed in /etc/fstab, add them. If they are already listed, update the options column. The size= limit is critical — without it a single unprivileged process can dd if=/dev/zero into /tmp until the system OOMs.
# /etc/fstab
# /tmp: in-memory, 2 GiB cap, no execution, no setuid, no devices
tmpfs /tmp tmpfs defaults,noexec,nosuid,nodev,size=2G 0 0
# /dev/shm: POSIX shared memory — most payloads are staged here
tmpfs /dev/shm tmpfs defaults,noexec,nosuid,nodev,size=512M 0 0
# /run: PID files and sockets — execution not needed here
tmpfs /run tmpfs defaults,noexec,nosuid,nodev,size=256M,mode=755 0 0
# /var/tmp: persists across reboots, same risk profile as /tmp
# If /var/tmp is a separate ext4/xfs partition, use the same options:
/dev/sda5 /var/tmp ext4 defaults,noexec,nosuid,nodev 0 2
Apply without rebooting:
sudo mount -o remount,noexec,nosuid,nodev,size=2G /tmp
sudo mount -o remount,noexec,nosuid,nodev,size=512M /dev/shm
sudo mount -o remount,noexec,nosuid,nodev /run
# Verify — OPTIONS column must contain noexec,nosuid,nodev
findmnt -o TARGET,OPTIONS /tmp /dev/shm /run
Systemd Mount Units
Systemd generates mount units from /etc/fstab, but you can also write drop-in overrides for the units it generates automatically. The unit names are derived from the mount point path by replacing / with - and appending .mount.
# The generated unit name for /dev/shm
systemctl cat dev-shm.mount
# Override: harden options without replacing the full unit
sudo mkdir -p /etc/systemd/system/dev-shm.mount.d/
# /etc/systemd/system/dev-shm.mount.d/hardened.conf
[Mount]
Options=defaults,noexec,nosuid,nodev,size=512M
sudo systemctl daemon-reload
sudo systemctl restart dev-shm.mount
systemctl show dev-shm.mount --property=Options
The same pattern applies to tmp.mount and run.mount. Systemd ships default units for all three — override rather than replace to survive package upgrades.
/dev/shm and POSIX Shared Memory Exploitation
POSIX shared memory is created with shm_open(3) (equivalent to open(2) on a file under /dev/shm) and mapped into a process’s address space with mmap(2). On a default system with an executable /dev/shm, this sequence gives an attacker in-memory code execution:
// Attacker primitive: write shellcode to /dev/shm and execute it
int fd = shm_open("/payload", O_RDWR | O_CREAT, 0700);
ftruncate(fd, shellcode_len);
void *p = mmap(NULL, shellcode_len,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_SHARED, fd, 0);
memcpy(p, shellcode, shellcode_len);
((void(*)())p)(); // jump to shellcode
With noexec on /dev/shm, the mmap call with PROT_EXEC returns EACCES. The payload is written to disk but cannot be made executable by mapping it from the file descriptor — the kernel enforces this at the do_mmap path regardless of the calling process’s capabilities.
The GameOver(lay) CVE pair (CVE-2023-2640 / CVE-2023-32629) relied on this exact primitive: a crafted executable was written to /dev/shm, then OverlayFS tricks were used to expose it with elevated SUID permissions through a user namespace. Systems with noexec on /dev/shm broke a prerequisite of the exploit chain before the OverlayFS layer was even reached.
Check the current state before hardening:
# Confirm /dev/shm is executable (bad: exec in output means no noexec)
findmnt -n -o OPTIONS /dev/shm | tr ',' '\n' | grep -E '^(no)?exec'
# List POSIX shared memory objects currently open on the system
ls -la /dev/shm/
# Check which processes have /dev/shm mappings with EXEC permissions
grep -r ' r-xp .*/dev/shm/' /proc/*/maps 2>/dev/null
tmpfs Size Limits
Without a size= option, tmpfs expands to fill all available RAM and swap. An unprivileged process can perform a RAM exhaustion attack:
# Unprivileged DoS — fills /tmp until OOM killer fires
dd if=/dev/zero of=/tmp/bomb bs=1M # runs until OOM or disk quota
The size= mount option caps the total space the tmpfs instance can use. Setting size=2G on /tmp limits the blast radius to 2 GiB regardless of how much RAM the host has. Choose sizes based on your observed peak usage:
# Check current tmpfs usage across all mounts
df -h --type=tmpfs
# Identify largest consumers in /tmp right now
du -sh /tmp/* 2>/dev/null | sort -rh | head -20
For /dev/shm, the relevant question is how large legitimate POSIX shared memory segments get. PostgreSQL, for example, creates a shared memory segment sized to shared_buffers. Check running allocations:
# Show named POSIX shm objects and sizes
ls -lah /dev/shm/
# For anonymous huge shared memory (used by databases), check /proc/sysvipc
ipcs -m
Set the size= limit to 2–3× the observed peak rather than a hard minimum — applications that legitimately use shared memory will fail with ENOSPC if the limit is too tight, and the failure mode is often opaque.
/tmp Sticky Bit and TOCTOU Risks
The /tmp sticky bit (chmod +t /tmp, mode 1777) prevents users from deleting each other’s files. It does not prevent TOCTOU (time-of-check-to-time-of-use) attacks against programs that create predictable filenames in /tmp.
The canonical pattern: a privileged process checks whether /tmp/workfile exists, creates it if not, then writes sensitive data to it. An attacker who wins the race between the check and the create can place a symlink at /tmp/workfile pointing to /etc/passwd. The privileged process then writes to the symlink target.
Verify the sticky bit is set (it should be by default, but confirm after hardening):
stat /tmp | grep -i access
# Expected: Access: (1777/drwxrwxrwt)
For newly written code, use O_TMPFILE or mkstemp(3) which create files atomically with a unique name, bypassing the race:
# Safe temporary file creation (shell)
tmpfile=$(mktemp /tmp/app.XXXXXXXXXX)
# Kernel-side: open(2) with O_TMPFILE | O_RDWR on Linux 3.11+
The sticky bit cannot prevent all TOCTOU races — it only prevents deletion. The correct defence is PrivateTmp=yes in systemd units, which removes the shared namespace entirely.
Systemd PrivateTmp Isolation
PrivateTmp=yes in a systemd service unit mounts a private, service-specific tmpfs over /tmp and /var/tmp within the service’s mount namespace. The service sees an empty, fresh /tmp that is invisible to all other processes on the host, including other services. No other process can read, write, or race on the service’s temporary files.
# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
[Service]
ExecStart=/usr/bin/myapp
PrivateTmp=yes # isolated /tmp and /var/tmp
PrivateDevices=yes # no device files
NoNewPrivileges=yes # no setuid escalation
ProtectSystem=strict # / and /usr read-only
ProtectHome=yes # /home, /root, /run/user read-only
[Install]
WantedBy=multi-user.target
Apply and verify:
sudo systemctl daemon-reload
sudo systemctl restart myapp
# Confirm the private mount is active
systemctl show myapp.service --property=PrivateTmp
# Expected: PrivateTmp=yes
# Check from inside the service namespace
nsenter -t $(systemctl show -p MainPID --value myapp.service) -m \
findmnt /tmp
# The source should show 'tmpfs' with a unique kernel ID different from
# the host's /tmp mount
PrivateTmp=yes also resolves the application-exec exception problem: if a service legitimately needs to execute binaries from its own /tmp (JVM class extraction, Python native extension loading), you can mount the private /tmp with exec without opening the host’s global /tmp to execution:
[Service]
PrivateTmp=yes
# The private tmpfs defaults to exec. No host-wide /tmp exec needed.
Note that PrivateTmp=yes applies only to processes started by the unit. Shell sessions, su invocations, and systemd-run --user processes all use the host’s /tmp. The host-level noexec mount options remain essential.
Abstract UNIX Socket Namespace
Abstract UNIX domain sockets are identified by a pathname where the first byte is a null character. They bind to the kernel’s abstract socket namespace rather than the filesystem. Because there is no filesystem inode, there are no permission bits — any process in the same network namespace can connect to an abstract socket by name.
This creates a monitoring blind spot: filesystem-based auditing tools, tools that walk /var/run, and tools that check socket file permissions are all ineffective against abstract sockets. Processes can communicate covertly over abstract sockets without leaving any filesystem artefact.
Enumerate abstract sockets currently in use:
# /proc/net/unix — all UNIX domain sockets; abstract sockets have @ prefix
# in the ss output (the @ is ss's rendering of the leading null byte)
ss -xlp | grep '@'
# Raw /proc view — abstract names appear with \0 prefix (shown as empty path start)
cat /proc/net/unix | awk '$NF ~ /^@/ { print $NF, $(NF-1) }' | sort
# With socket state and owning process
ss -xlnp state listening | head -40
A socket name starting with @ in ss output or with a blank leading field in /proc/net/unix is abstract. Legitimate abstract sockets include D-Bus (@/tmp/dbus-...), X11 (@/tmp/.X11-unix/X0), and some GNOME and systemd IPC paths.
Audit unfamiliar entries:
# Find PID owning an abstract socket (if ss shows the inode)
ss -xlp | grep '@myapp-secret'
# Output includes pid=NNNN; cross-reference with:
ls -la /proc/NNNN/exe
cat /proc/NNNN/cmdline | tr '\0' ' '
Add auditd rules to detect processes binding to new abstract sockets on servers where the set of expected abstract sockets is stable:
# /etc/audit/rules.d/80-abstract-sockets.rules
# Detect bind() calls on AF_UNIX sockets — abstract socket creation
# syscall 49 is bind(2) on x86-64
-a always,exit -F arch=b64 -S bind -F a0!=0 -k unix_socket_bind
Restrict containers from accessing the host abstract socket namespace by placing them in a separate network namespace (--network=none or a dedicated CNI network in Kubernetes). Abstract sockets are scoped to the network namespace — a container with its own network namespace cannot reach abstract sockets bound in the host namespace.
Application Impact: exec in /tmp Exceptions
Several legitimate workloads require execution from tmpfs paths. The correct response is per-service exception management, not globally relaxing /tmp to exec.
Common cases:
| Application | Behaviour | Solution |
|---|---|---|
| JVM (HotSpot, GraalVM) | Extracts JIT-compiled native code and JVM shared libraries to a temp directory, then dlopens them |
Set -Djava.io.tmpdir=/var/lib/jvm-work on an exec-capable bind mount, or use PrivateTmp=yes |
| Python (ctypes, cffi) | Compiles C extensions to tempfile.gettempdir() — defaults to /tmp |
Set TMPDIR in the systemd unit to an exec-capable path |
| Node.js (node-gyp, some native addons) | Extracts .node shared objects during startup |
Set npm_config_cache and TMPDIR to an exec-capable directory |
| Ansible (on the managed node) | Copies and executes Python modules via /tmp or ~/.ansible/tmp |
Set remote_tmp in ansible.cfg to a path on a filesystem with exec, or use ansible_remote_tmp=/var/ansible-tmp on a dedicated ext4 mount |
| Chrome/Chromium headless | Uses /dev/shm for renderer sandbox; fails with --disable-dev-shm-usage absent |
Pass --disable-dev-shm-usage --no-sandbox or mount a larger /dev/shm without noexec for the browser service only |
| PostgreSQL (older builds) | Uses System V shared memory (SYSV) not POSIX shm; unaffected by /dev/shm options |
No exception needed for modern PostgreSQL (≥9.3) which uses POSIX shm read-only |
Detection — identify which processes are actively executing from tmpfs mounts:
#!/bin/bash
# find-tmpfs-exec.sh
# Report processes with exec mappings into tmpfs-backed paths
for pid_dir in /proc/[0-9]*/maps; do
pid=$(echo "$pid_dir" | grep -oP '(?<=/proc/)\d+')
[ -r "$pid_dir" ] || continue
if grep -q ' r-xp .*/tmp\|/dev/shm\|/run/' "$pid_dir" 2>/dev/null; then
comm=$(cat /proc/"$pid"/comm 2>/dev/null)
echo "PID $pid ($comm): exec mapping in tmpfs path"
grep ' r-xp .*/tmp\|/dev/shm\|/run/' "$pid_dir"
fi
done
For each identified process, create a dedicated exec-capable directory and redirect that application’s tmpdir there:
# Create an exec-capable tmpdir for a specific service
sudo mkdir -p /var/lib/myapp/tmp
sudo chown myapp:myapp /var/lib/myapp/tmp
# /etc/systemd/system/myapp.service
[Service]
Environment=TMPDIR=/var/lib/myapp/tmp
PrivateTmp=no # disabled because we manage our own tmpdir
Mount /var/lib/myapp/tmp as a separate tmpfs with only nosuid,nodev (allowing exec) and a tight size limit:
# /etc/fstab — per-service exec-capable tmpfs
tmpfs /var/lib/myapp/tmp tmpfs defaults,nosuid,nodev,size=512M,uid=myapp,gid=myapp,mode=700 0 0
This grants exec capability to exactly one service on exactly one mount point, leaving /tmp and /dev/shm fully hardened for everything else.
Verification
A single script to validate the expected state after hardening:
#!/bin/bash
# verify-tmpfs-hardening.sh
FAIL=0
check() {
local label="$1" mp="$2" flag="$3"
if findmnt -n -o OPTIONS "$mp" 2>/dev/null | grep -qw "$flag"; then
printf "OK %-20s has %s\n" "$mp" "$flag"
else
printf "FAIL %-20s missing %s\n" "$mp" "$flag"
FAIL=1
fi
}
check_size() {
local mp="$1"
local size
size=$(findmnt -n -o OPTIONS "$mp" 2>/dev/null | grep -oP 'size=\K[^,]+')
if [ -n "$size" ]; then
printf "OK %-20s size limit: %s\n" "$mp" "$size"
else
printf "WARN %-20s no size= limit set\n" "$mp"
fi
}
echo "=== Mount flags ==="
for mp in /tmp /dev/shm /run; do
check "$mp" "$mp" noexec
check "$mp" "$mp" nosuid
check "$mp" "$mp" nodev
check_size "$mp"
done
echo ""
echo "=== Sticky bit ==="
stat_out=$(stat -c '%a' /tmp)
if [[ "$stat_out" =~ ^1 ]]; then
echo "OK /tmp has sticky bit (mode: $stat_out)"
else
echo "FAIL /tmp missing sticky bit (mode: $stat_out)"
FAIL=1
fi
echo ""
echo "=== Active exec mappings in tmpfs paths ==="
found=0
for maps in /proc/[0-9]*/maps; do
pid=$(echo "$maps" | grep -oP '(?<=/proc/)\d+')
[ -r "$maps" ] || continue
if grep -qE ' r-xp .+(/tmp|/dev/shm|/run)/' "$maps" 2>/dev/null; then
comm=$(cat /proc/"$pid"/comm 2>/dev/null)
echo "WARN PID $pid ($comm) has exec mapping in tmpfs path"
found=1
fi
done
[ "$found" -eq 0 ] && echo "OK No exec mappings found in /tmp, /dev/shm, /run"
echo ""
if [ "$FAIL" -eq 0 ]; then
echo "ALL CHECKS PASSED"
exit 0
else
echo "SOME CHECKS FAILED"
exit 1
fi
Trade-offs
| Change | What breaks | Workaround |
|---|---|---|
noexec on /tmp |
JVM JIT cache extraction, Python native extension compilation, Ansible remote modules | Redirect TMPDIR per service; use PrivateTmp=yes with exec-capable private tmpfs |
noexec on /dev/shm |
Chrome headless renderer, some older memcached builds that self-modify | Pass --disable-dev-shm-usage to Chrome; upgrade memcached; isolate with a per-service exec-capable tmpfs |
size= limits on /tmp |
Large package extractions (dpkg, rpm unpacking multi-GB installers) |
Temporarily mount -o remount,size=8G /tmp before the operation, remount back after |
size= on /dev/shm |
PostgreSQL with large shared_buffers may fail to allocate its shared segment |
Set size= to 1.5× PostgreSQL shared_buffers; or redirect PostgreSQL to a dedicated POSIX shm mount |
PrivateTmp=yes |
Services that use /tmp as an IPC channel between separate processes |
Move IPC to /run sockets (also tmpfs, namespaced per-service by RuntimeDirectory=) |
Failure Modes
fstab syntax error causes boot failure. Always run sudo mount -a after editing /etc/fstab to validate syntax before the next reboot. A failed /dev/shm mount on boot drops systemd into degraded state; /tmp failure can cause early-boot services to crash before the emergency shell is available.
Remount without size= removes the size cap. mount -o remount,exec /tmp strips options that are not explicitly re-stated only on kernels older than 5.8. On current kernels, options not mentioned in a remount are preserved. Verify with findmnt after every remount:
findmnt -n -o OPTIONS /tmp
PrivateTmp=yes hides files from tools running as root outside the namespace. strace, lsof, and ls /tmp run from a root shell on the host will not see files created by a service with PrivateTmp=yes because those tools operate in the host mount namespace. Use nsenter to inspect the service’s namespace:
nsenter -t "$(systemctl show -p MainPID --value myapp.service)" --mount \
ls /tmp
Abstract socket monitoring creates high-volume audit logs. The bind syscall fires on every socket creation. On busy application servers, filter to AF_UNIX specifically and scope to non-root processes:
-a always,exit -F arch=b64 -S bind -F a0=1 -F auid!=unset -F auid!=0 -k abstract_socket
(a0=1 matches AF_UNIX = 1).