Hardening Linux AF_VSOCK Against VM-to-Host Escape

Hardening Linux AF_VSOCK Against VM-to-Host Escape

Problem

AF_VSOCK (Virtual Socket) is a socket address family designed for efficient communication between virtual machines and their hypervisors. Unlike network sockets that require a full IP stack, VSOCK uses a CID (Context Identifier) addressing model: the hypervisor always has CID 2, the host has CID 1, and each VM gets a unique CID assigned at boot. Services on the hypervisor bind VSOCK ports; processes inside VMs connect to them.

The practical uses are pervasive: guest agents (VMware Tools, QEMU guest agent, AWS SSM agent), container runtime shims (containerd uses VSOCK for Firecracker micro-VM communication), nested virtualisation, and cloud provider metadata services all use VSOCK. Every cloud VM runs at least one VSOCK-connected process.

The security problem is structural: VSOCK creates a direct, low-level communication channel from untrusted guest code to the hypervisor. Any vulnerability in a VSOCK-listening service on the hypervisor is reachable from inside every VM that machine hosts. And vulnerabilities have appeared regularly:

CVE-2021-26708 (Linux kernel VSOCK): race condition in vsock_stream_connect() and related paths enabling local privilege escalation and, in the context of a VM, a guest-to-host escalation vector.

CVE-2022-26525 / CVE-2022-26526: QEMU vhost-vsock backend vulnerabilities allowing a malicious guest to corrupt host memory via crafted VSOCK messages.

CVE-2024-50264: use-after-free in vsock/virtio transport enabling guest code to corrupt virtio ring state, potentially triggering host kernel code execution.

VMware Tools VSOCK exposure: VMware’s guest agent exposes a VSOCK service with a documented protocol. Research in 2024 demonstrated that under-validated message parsing in this agent created a command injection path from guest to host.

Beyond specific CVEs, VSOCK has a structural risk that is often overlooked in hardened VM deployments: the surface is invisible to standard network security tooling. Firewall rules, network ACLs, and packet capture don’t see VSOCK traffic. An attacker who reaches VSOCK-based services bypasses all network-layer controls. The channel is fast, reliable, and completely auditing-transparent by default.

Target systems: Linux KVM/QEMU virtual machines, AWS EC2 instances with SSM agent, VMware vSphere guests, Firecracker-based container environments, any Linux host running VSOCK-listening services (virtio-vsock, vhost-vsock).


Threat Model

Adversary 1 — Compromised VM code reaching hypervisor services. Access level: code execution inside a guest VM. Objective: connect to VSOCK ports on the hypervisor (CID 2), exploit a vulnerability in a listening service, achieve host code execution or read host memory.

Adversary 2 — Container escape via VSOCK in Firecracker. Access level: code inside a Firecracker micro-VM (used as a container sandbox). Objective: exploit a VSOCK vulnerability in the containerd-shim VSOCK listener to escape the Firecracker boundary and reach the host.

Adversary 3 — Malicious guest VSOCK packet injection. Access level: root inside a guest VM with VSOCK device access. Objective: send malformed VSOCK messages that trigger kernel bugs in the vhost-vsock backend on the host, corrupting host kernel memory.

Adversary 4 — VSOCK lateral movement between VMs. Access level: code inside one guest VM. Objective: connect to VSOCK ports on sibling VMs (if the hypervisor allows inter-VM VSOCK). Most hypervisors restrict this, but misconfigurations exist.

Without hardening: VSOCK is an unmonitored, unconstrained channel from guest to hypervisor. With hardening: Seccomp blocks AF_VSOCK socket creation in workloads that don’t need it; hypervisor-side service isolation limits blast radius; audit logging captures VSOCK connection patterns.


Configuration / Implementation

Step 1 — Audit current VSOCK usage

# List processes with open VSOCK sockets
ss --vsock --processes
# Or:
ss -xlp | grep vsock

# Check which CIDs are active on the host
# (Run on the KVM/QEMU host, not inside a VM)
ls /dev/vhost-vsock 2>/dev/null && echo "vhost-vsock device present"

# Check which processes listen on VSOCK ports
# Inside a guest VM:
ss --vsock --listening

# On KVM host — find VSOCK-listening processes
for pid in $(ls /proc | grep '^[0-9]'); do
  fd_dir="/proc/$pid/fd"
  if ls "$fd_dir" 2>/dev/null | while read fd; do
    target=$(readlink "$fd_dir/$fd" 2>/dev/null)
    echo "$target"
  done | grep -q "vsock"; then
    echo "PID $pid ($(cat /proc/$pid/comm)) has VSOCK socket"
  fi
done 2>/dev/null

Step 2 — Block AF_VSOCK via Seccomp for workloads that don’t need it

Most application workloads inside VMs have no legitimate need to open VSOCK sockets directly. Block the socket family:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1,
      "args": [
        {
          "index": 0,
          "value": 40,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Block AF_VSOCK (40) socket creation"
    }
  ]
}
# Verify AF_VSOCK family number on your kernel
python3 -c "import socket; print(socket.AF_VSOCK)"
# Should print: 40

# Apply to containers in Kubernetes
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: seccomp-deny-vsock
  namespace: default
data:
  deny-vsock.json: |
    {
      "defaultAction": "SCMP_ACT_ALLOW",
      "syscalls": [{
        "names": ["socket"],
        "action": "SCMP_ACT_ERRNO",
        "args": [{"index": 0, "value": 40, "op": "SCMP_CMP_EQ"}]
      }]
    }
EOF

For Kubernetes pods:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: deny-vsock.json
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]

Step 3 — Restrict VSOCK on the hypervisor side

On KVM/QEMU hosts, restrict which processes can access /dev/vhost-vsock:

# Check current permissions on vhost-vsock device
ls -la /dev/vhost-vsock
# Default: crw------- 1 root root — only root can access

# If the device is more permissive, tighten it
chmod 0600 /dev/vhost-vsock
chown root:kvm /dev/vhost-vsock

# Restrict via udev rule
cat > /etc/udev/rules.d/90-vsock.rules << 'EOF'
KERNEL=="vhost-vsock", GROUP="kvm", MODE="0660"
EOF
udevadm control --reload-rules && udevadm trigger

# Verify: non-kvm processes cannot access the device
su -s /bin/bash nobody -c "cat /dev/vhost-vsock" 2>&1
# Expected: Permission denied

Step 4 — Harden VSOCK-listening services on the hypervisor

Services that legitimately listen on VSOCK should be hardened with minimal privileges:

# /etc/systemd/system/qemu-guest-agent.service — example hardening
[Unit]
Description=QEMU Guest Agent

[Service]
ExecStart=/usr/bin/qemu-ga --method=virtio-serial
# Run as dedicated user, not root
User=qemu-guest
Group=qemu-guest
# Restrict capabilities
CapabilityBoundingSet=
AmbientCapabilities=
# Restrict filesystem access
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
# Restrict syscalls
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
NoNewPrivileges=true

For custom VSOCK services, validate all messages rigorously:

// vsock_server.rs — secure VSOCK listener pattern
use vsock::{VsockListener, VMADDR_CID_HOST};
use std::io::{Read, Write};

fn secure_vsock_server() -> std::io::Result<()> {
    let listener = VsockListener::bind_with_cid_port(
        VMADDR_CID_HOST,
        9999
    )?;
    
    for stream in listener.incoming() {
        let mut stream = stream?;
        
        // Get the peer CID — validate it's an expected guest
        let peer_addr = stream.peer_addr()?;
        let peer_cid = peer_addr.cid();
        
        // Only accept connections from specific VM CIDs
        let allowed_cids = [3u32, 4, 5]; // Specific VM CIDs
        if !allowed_cids.contains(&peer_cid) {
            eprintln!("Rejected connection from unexpected CID: {}", peer_cid);
            continue;
        }
        
        // Read message with strict size limit (prevent resource exhaustion)
        let mut buf = vec![0u8; 4096]; // Max 4KB message
        let n = stream.read(&mut buf)?;
        
        if n == 0 || n > 1024 {
            eprintln!("Invalid message size: {} bytes from CID {}", n, peer_cid);
            continue;
        }
        
        // Parse and validate message strictly
        let msg = &buf[..n];
        handle_message(peer_cid, msg, &mut stream)?;
    }
    Ok(())
}

Step 5 — Enable VSOCK audit logging

VSOCK connections don’t appear in iptables logs or standard network logs. Add explicit audit:

# /etc/audit/rules.d/92-vsock.rules
# Audit socket() calls for AF_VSOCK
-a always,exit -F arch=b64 -S socket -F a0=40 -F key=vsock_socket

# Audit connect() calls that may target VSOCK peers
-a always,exit -F arch=b64 -S connect -F key=vsock_connect
augenrules --load && systemctl restart auditd

# Monitor VSOCK socket creation
ausearch -k vsock_socket --start today | \
  grep -v "^----" | head -20

Step 6 — Apply VSOCK firewall rules via eBPF

For Firecracker or vhost-vsock environments where you need fine-grained filtering:

// vsock_filter.bpf.c — eBPF program to filter VSOCK connections
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

// Allow list of permitted guest CIDs
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 256);
    __type(key, __u32);    // Guest CID
    __type(value, __u8);   // 1 = allowed
} allowed_cids SEC(".maps");

SEC("cgroup/connect6")
int vsock_connect_filter(struct bpf_sock_addr *ctx) {
    if (ctx->family != AF_VSOCK)
        return 1; // Allow non-VSOCK
    
    __u32 peer_cid = ctx->user_ip4; // CID in VSOCK context
    __u8 *allowed = bpf_map_lookup_elem(&allowed_cids, &peer_cid);
    
    if (!allowed) {
        bpf_printk("VSOCK blocked: CID %u not in allowlist\n", peer_cid);
        return 0; // Block
    }
    return 1; // Allow
}

Expected Behaviour

Signal Before hardening After hardening
App container opens VSOCK socket Succeeds Blocked by Seccomp — EPERM
/dev/vhost-vsock permissions May be world-readable 0660, group kvm only
auditd logs VSOCK socket creation Not captured vsock_socket key fires
Guest connects to hypervisor VSOCK from unexpected CID No logging, no filtering Rejected by service-level CID allowlist
VSOCK service runs as root Common default Runs as dedicated user with restricted capabilities

Verification:

# Inside a VM — confirm Seccomp blocks VSOCK
python3 -c "
import socket
try:
    s = socket.socket(40, socket.SOCK_STREAM)  # AF_VSOCK = 40
    print('FAIL: VSOCK socket created')
except OSError as e:
    print(f'PASS: VSOCK blocked — {e}')
"

# On host — confirm vhost-vsock permissions
stat -c "%a %U %G" /dev/vhost-vsock
# Expected: 660 root kvm

Trade-offs

Aspect Benefit Cost Mitigation
Seccomp AF_VSOCK block Eliminates VSOCK exploitation from app containers Breaks workloads that legitimately need VSOCK (VM agents, container shims) Apply only to application containers; exempt system agent pods/services
CID allowlist in VSOCK service Limits which VMs can connect to the service Requires knowing CIDs at service startup; CIDs can change For cloud environments, use CID-to-instance metadata mapping; update allowlist via service restart on VM lifecycle events
eBPF VSOCK filtering Kernel-level enforcement; cannot be bypassed by userspace Requires Linux 5.10+; adds complexity Use as belt-and-suspenders with service-level CID checks

Failure Modes

Failure Symptom Detection Recovery
Seccomp blocks legitimate VM agent VM agent cannot communicate with hypervisor; agent health checks fail Agent logs show socket creation error; VM management plane loses contact Add VSOCK socket to agent’s Seccomp exemption; use a targeted profile instead of blocking all AF_VSOCK
CID allowlist too restrictive New VM cannot connect to hypervisor service; agent fails New VM connectivity issues; agent logs show connection refused Add new VM’s CID to the allowlist; automate via VM lifecycle hooks
Kernel update changes VSOCK family number Unlikely — AF_VSOCK = 40 is stable If Seccomp stops working Verify with python3 -c "import socket; print(socket.AF_VSOCK)" after kernel update