vLLM and the KV-Cache Isolation Problem: How Shared Memory Leaks Between Inference Requests

vLLM and the KV-Cache Isolation Problem: How Shared Memory Leaks Between Inference Requests

The Problem

High-throughput LLM inference at the scale required for production workloads forces a fundamental choice: maximize GPU utilization by sharing memory between concurrent requests, or maintain strong isolation by giving each request private memory. vLLM and Triton Inference Server chose utilization. This choice has security consequences in multi-tenant deployments that most infrastructure teams have not fully accounted for.

vLLM’s PagedAttention

vLLM (from UC Berkeley, now widely deployed as the backend for OpenAI-compatible inference APIs at companies running Llama, Mistral, and Qwen variants) introduced PagedAttention in 2023. The design is inspired directly by OS virtual memory paging and was explicitly motivated by the observation that existing inference engines wasted 60–80% of allocated GPU memory through fragmentation and over-provisioning of KV-cache storage.

To understand the security issue, you need to understand what the KV-cache is and why it is expensive. Transformer attention computes key and value matrices for every token in the context window. During autoregressive decoding — generating tokens one at a time — the keys and values from all prior tokens must be recomputed or cached at every step. Caching them is not optional at inference scale: recomputing them from scratch at each decoding step multiplies compute cost by sequence length. For a Llama-3-70B model with a 128K context window, a single request’s KV-cache can occupy several gigabytes of GPU DRAM. On a single A100 80GB GPU, you cannot hold even a handful of such requests in memory simultaneously unless you manage the cache carefully.

PagedAttention manages the KV-cache as a pool of fixed-size physical blocks — typically 16 tokens × 2 (keys + values) × num_layers × head_dim × sizeof(float16). For Llama-3-8B (32 layers, 32 heads, 128-dimensional heads), each block is approximately 16 × 2 × 32 × 32 × 128 × 2 = 8.4 MB. The GPU memory pool is pre-partitioned into these blocks at startup, and blocks are allocated and freed as requests arrive and complete. A reference counter tracks how many active requests are using each block. When a request completes, its blocks are returned to the free pool — but the blocks are not zeroed.

The prefix sharing feature compounds this. When two requests share a common prefix — for example, two API calls using the same system prompt — PagedAttention can map both requests to the same physical blocks for the prefix tokens, incrementing the reference count. The first request to use a given prefix sequence populates the blocks; subsequent requests reuse them without recomputation. This is “prompt caching” in vLLM’s terminology, controlled by the enable_prefix_caching parameter introduced in vLLM 0.4.0. When prefix caching is enabled, blocks are not freed when the request using them completes if the prefix might be reused — they are retained in an LRU cache of prefix blocks.

The security problem is the intersection of two properties: blocks are not zero-initialized on allocation, and blocks are shared across requests that may belong to different tenants.

When a block is freed from Tenant A’s request and allocated to Tenant B’s request, the GPU DRAM backing that block still contains the KV-cache data written by Tenant A’s inference. Specifically, it contains the attention key vectors and value vectors computed from Tenant A’s input tokens — the system prompt and the user’s actual query. Tenant B’s model kernel will overwrite those vectors as it processes Tenant B’s tokens, but the overwrite is sequential: as Tenant B’s sequence grows from token 1 to token N, the block is overwritten token by token. At the moment Tenant B’s first kernel invocation begins on that block, the first few rows of the key/value matrices still contain Tenant A’s data.

Luo et al. (2024, “Stealing LLM System Prompts via KV-Cache Side-Channels”) demonstrated that KV-cache vectors from a target prompt can be used to recover the original text with high accuracy using embedding inversion — the same technique that has been applied to BERT embeddings, but extended to causal LM KV representations. The intuition: the key vector for token position $i$ is $K_i = W_K \cdot h_i$, where $h_i$ is the hidden state derived from the token embedding and all prior context. Given $W_K$ (which is the model’s public weight) and $K_i$, solving for $h_i$ and then inverting $h_i$ to recover the input token is a constrained optimization problem. The constraint is that input tokens come from a discrete vocabulary; with beam search over likely token candidates, the system prompt reconstruction succeeds with >80% token accuracy on system prompts up to several hundred tokens.

The key point is that KV-cache vectors are not ciphertext or random noise. They are a deterministic, invertible (with the right inversion technique) function of the input tokens and the model weights — both of which are known to any party that has access to the same model deployment.

Triton Inference Server Shared Memory

NVIDIA’s Triton Inference Server uses a distinct but related shared-memory mechanism: Linux POSIX shared memory (/dev/shm, accessed via shm_open()) for zero-copy tensor transfer between client processes and the inference server, and between stages in an ensemble pipeline.

The standard flow for high-throughput Triton clients: the client allocates a named shared memory region using tritonclient.utils.shared_memory.create_shared_memory_region(), writes input tensors into it, and sends an inference request that references the region name rather than transferring tensor data over the gRPC or HTTP connection. Triton reads the tensors directly from the shared memory region without copying. This avoids serializing large tensors (image batches, audio spectrograms, embedding arrays) over a socket, which would dominate latency for large inputs.

The problem: unless Triton instances are running in separate container PID/IPC namespaces, all clients connecting to the same Triton server share the same /dev/shm namespace. Shared memory region names in Triton’s Python client library follow a predictable pattern: the default is input_region_<N> or an application-specified name passed to create_shared_memory_region(). A malicious tenant who knows (or can enumerate) the region name can call shm_open("input_region_5", O_RDONLY, 0) directly and read the tensor contents of another client’s in-flight inference request — no GPU access required, no root privilege required, just knowledge of the region name and access to the same Linux kernel.

The /dev/shm namespace is an IPC namespace feature. Linux containers get separate IPC namespaces by default in Kubernetes pod specs, which means two pods cannot see each other’s shared memory regions. But this protection disappears entirely if: (a) multiple tenants share the same Triton pod (which is common in cost-optimized deployments), (b) the pod is configured with hostIPC: true, or © the tenants are implemented as separate threads or co-processes within the same container. In all three cases, the IPC namespace boundary does not exist, and the /dev/shm regions are mutually visible.

The root enabler is the Linux kernel’s IPC namespace model. IPC namespaces provide isolation at the container boundary, not at the request or tenant boundary within a container. vLLM’s KV-cache lives in GPU DRAM managed by the CUDA driver, which has no namespace concept at all — it is a flat physical memory pool shared by all CUDA contexts within the same process.

Threat Model

Cross-tenant KV-cache page reuse: In a multi-tenant vLLM deployment (multiple organizations’ API keys sharing one inference endpoint), Tenant B’s request receives a free KV-cache block that previously held Tenant A’s request. The block contains attention key and value vectors computed from Tenant A’s system prompt and user query. Before Tenant B’s tokens have fully overwritten the block, those vectors encode recoverable information about Tenant A’s input text. An adversary who controls Tenant B can issue a crafted request designed to trigger block allocation from recently-freed blocks (by requesting a sequence length that matches the freed block count) and then exfiltrate the raw activation values via a custom inference harness. The model weights are public; the inversion attack follows.

Triton /dev/shm direct read: An attacker tenant with a process running in the same Linux IPC namespace as a Triton server enumerates /dev/shm to find active shared memory region names. Standard ls /dev/shm lists all regions in the namespace. The attacker opens a region belonging to another tenant’s in-flight request using shm_open() with O_RDONLY — no elevated privilege required — and reads the tensor data: input embeddings, image tensors, tokenized prompts, or output logits depending on where the victim tenant’s request is in its lifecycle.

Shared memory exhaustion DoS: A malicious tenant mounts a resource exhaustion attack against /dev/shm by allocating large shared memory regions in a tight loop without releasing them. /dev/shm is backed by tmpfs, which by default consumes up to 50% of RAM. When /dev/shm is full, shm_open() calls from legitimate clients fail with ENOSPC. Triton’s response is to reject new inference requests. The attack vector requires only a client-side loop with no authentication bypass.

KV-cache timing side-channel: vLLM’s prefix caching creates a measurable timing difference between a cache hit (the prefix blocks already exist and are reused) and a cache miss (blocks must be allocated and the attention computation must run). An attacker tenant crafts probe requests using candidate system prompts — for example, known confidentiality notice templates or API gateway system prompt patterns — and measures time-to-first-token (TTFT). A cache hit for a candidate prompt confirms that another tenant’s active or recent request used that exact prefix. The attacker does not need to read memory at all; the cache hit rate leaks presence information. This attack is structurally identical to cross-origin timing attacks in browsers and is detectable at the inference-serving layer.

Continuous batching cross-tenant kernel co-execution: vLLM’s continuous batching engine groups multiple in-flight requests into a single prefill or decode kernel invocation. During a batched decode step, token generation for Tenant A’s request and Tenant B’s request occur in the same CUDA kernel on the same GPU. GPU memory is shared during the active kernel, not just during free/alloc cycles. While the attention computation for each sequence in the batch is logically separate (the attention mask prevents cross-sequence attention), the GPU’s shared caches (L2 cache, shared memory per SM) are not partitioned per sequence. Cache timing attacks at the GPU microarchitecture level — demonstrated in principle by Naghibijouybari et al. (2018) for co-located GPU workloads — apply to batched inference.

Hardening Configuration

1. Per-Tenant vLLM Process Isolation

The most complete isolation is process separation: one vLLM process per tenant, with separate CUDA contexts and separate KV-cache memory pools. No cross-tenant block sharing is possible when blocks are allocated from separate GPU memory regions managed by separate processes.

# Per-tenant vLLM serving: run separate processes per tenant
# This eliminates cross-tenant KV-cache page reuse entirely.
#
# vllm serve meta-llama/Llama-3-8B-Instruct \
#   --model meta-llama/Llama-3-8B-Instruct \
#   --max-model-len 4096 \
#   --gpu-memory-utilization 0.40 \  # 40% of GPU per tenant instance
#   --port 8001
#
# vllm serve meta-llama/Llama-3-8B-Instruct \
#   --gpu-memory-utilization 0.40 \
#   --port 8002

Kubernetes Deployment per tenant using a shared GPU with fractional utilization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-tenant-a
  namespace: tenant-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      tenant: tenant-a
  template:
    metadata:
      labels:
        app: vllm
        tenant: tenant-a
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.4.3
        args:
        - "--model"
        - "meta-llama/Llama-3-8B-Instruct"
        - "--max-model-len"
        - "4096"
        - "--gpu-memory-utilization"
        - "0.40"
        - "--disable-log-requests"
        resources:
          limits:
            nvidia.com/gpu: "1"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-tenant-b
  namespace: tenant-b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      tenant: tenant-b
  template:
    metadata:
      labels:
        app: vllm
        tenant: tenant-b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.4.3
        args:
        - "--model"
        - "meta-llama/Llama-3-8B-Instruct"
        - "--max-model-len"
        - "4096"
        - "--gpu-memory-utilization"
        - "0.40"
        - "--disable-log-requests"
        resources:
          limits:
            nvidia.com/gpu: "1"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"

The two pods must be scheduled to different physical GPUs, or use MIG partitioning (see section 5). If they share a physical GPU without MIG, the CUDA driver still provides separate virtual address spaces per CUDA context — direct memory access between processes is not possible — but the co-execution timing attacks in section 2 remain applicable.

2. Disable vLLM Prefix Caching for Multi-Tenant Deployments

If per-tenant process separation is not feasible, disabling prefix caching eliminates cross-request block retention and substantially reduces the window during which freed blocks contain meaningful data from prior requests. It does not eliminate the remanence window entirely — a freed block is still not zeroed — but it ensures that blocks are not deliberately retained across request boundaries.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    enable_prefix_caching=False,   # Disable cross-request KV-cache sharing
    gpu_memory_utilization=0.90,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
)

outputs = llm.generate(["Explain zero-copy IPC"], sampling_params)

When using the vLLM OpenAI-compatible server:

vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching=false \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

The throughput cost is significant and must be quantified before accepting it as a control. Prefix caching provides the largest benefit for requests that share long system prompts. For a 1,000-token system prompt with Llama-3-8B, a cache hit reduces TTFT from ~200ms to ~20ms — a 10x reduction. At scale, if 90% of your requests share a common system prompt, disabling prefix caching means every request pays the full 200ms prefill cost. For latency-sensitive applications, this is often unacceptable. The correct architectural response is to implement per-tenant prefix cache partitioning — vLLM does not currently support this natively, which means it requires either process separation or a custom scheduler patch.

3. Isolate /dev/shm with IPC Namespaces

The default Kubernetes pod spec already provides IPC namespace isolation between pods — each pod gets its own /dev/shm namespace, so shared memory regions from one pod are invisible to another pod. The critical point is to audit for configurations that break this default.

apiVersion: v1
kind: Pod
metadata:
  name: triton-tenant-a
  namespace: tenant-a
spec:
  # hostIPC: false is the default — do not override it.
  # Setting hostIPC: true shares the host's IPC namespace with all
  # processes using hostIPC: true, including other pods.
  hostIPC: false
  containers:
  - name: triton
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    args:
    - "tritonserver"
    - "--model-repository=/models"
    - "--shm-size=1g"
    # --shm-region-prefix namespaces shared memory region names within
    # a Triton instance to reduce collision probability in shared environments.
    - "--shm-region-prefix=tenant-a-"
    resources:
      limits:
        memory: "16Gi"
        nvidia.com/gpu: "1"
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000

The --shm-size flag limits the total shared memory Triton allocates, capping the impact of shared memory exhaustion attacks. Setting it per-instance to a value below the system tmpfs limit prevents one Triton instance from exhausting /dev/shm for all instances on the same node.

To verify that IPC namespace isolation is in effect between two pods on the same node:

# From pod 1: create a shared memory region
kubectl exec -n tenant-a triton-tenant-a -- \
  python3 -c "
import ctypes, ctypes.util
libc = ctypes.CDLL(ctypes.util.find_library('c'))
# shm_open(name, O_CREAT|O_RDWR, 0600)
fd = libc.shm_open('/test-region', 0o102, 0o600)
print(f'Created region, fd={fd}')
"

# From pod 2: attempt to open the same region — should fail with ENOENT
kubectl exec -n tenant-b triton-tenant-b -- \
  python3 -c "
import ctypes, ctypes.util
libc = ctypes.CDLL(ctypes.util.find_library('c'))
# shm_open(name, O_RDONLY, 0)
fd = libc.shm_open('/test-region', 0o0, 0o400)
print(f'fd={fd}')  # Expect -1 if isolation is working
"

The second call returning fd=-1 with errno ENOENT confirms that the IPC namespace boundary is enforced. If it returns a valid fd, both pods share an IPC namespace and cross-tenant tensor reads are possible.

4. GPU Memory Zeroing Between Requests

Zeroing freed KV-cache blocks before returning them to the free pool eliminates the remanence window — a newly allocated block contains no data from any prior request. This is not vLLM’s default behavior. Implementing it requires patching vLLM’s block manager.

In vLLM’s architecture, the CacheEngine manages the physical KV-cache tensors (gpu_cache is a list of layer-indexed tensors). The BlockSpaceManager (in vllm/core/block_manager.py) handles logical-to-physical block mapping. When a sequence finishes, free_sequence() releases the physical blocks. The zeroing hook can be inserted here:

import torch
from vllm.core.block_manager import BlockSpaceManager

# Monkey-patch vLLM's block free operation to zero blocks on release.
# Production use: patch vllm/core/block_manager.py directly and pin the version.

_original_free_sequence = BlockSpaceManager.free

def _zeroing_free_sequence(self, seq):
    # Retrieve block IDs before freeing them
    block_table = self.block_tables.get(seq.seq_id, [])
    block_ids = [block.block_number for block in block_table]

    # Call original free to return blocks to the allocator
    _original_free_sequence(self, seq)

    # Zero the physical KV-cache pages for all freed blocks.
    # kv_cache shape: [num_layers, 2, num_blocks, block_size, num_heads, head_dim]
    # Index dimension 2 (num_blocks) to zero specific block IDs.
    for layer_idx in range(len(self.kv_cache)):
        for kv_idx in range(2):  # 0 = keys, 1 = values
            for block_id in block_ids:
                self.kv_cache[layer_idx][kv_idx][block_id].zero_()

    torch.cuda.synchronize()

BlockSpaceManager.free = _zeroing_free_sequence

The torch.cuda.synchronize() call is required to ensure the zero-fill completes before the blocks are visible to the next allocator call. Without synchronization, a race exists between the zero-fill kernel and the kernel that writes new data to the reallocated block.

The throughput cost is real and must be benchmarked before deploying this in production. For a Llama-3-70B model with a block size of 16 tokens:

  • Each block: 16 tokens × 2 (K+V) × 80 layers × 64 heads × 128 head_dim × 2 bytes = ~26.2 MB
  • A 4096-token request uses 256 blocks (4096 / 16)
  • Total bytes to zero per request completion: ~6.7 GB
  • At an A100’s HBM bandwidth of ~2 TB/s, zeroing time: ~3.3ms per request

At 100 requests/second, zeroing overhead is 330ms worth of bandwidth consumed per second — roughly 16% of HBM bandwidth budget. This is measurable. For 8K-context requests with 70B models, the overhead scales linearly with sequence length. Platforms with high per-request memory turnover will see a more significant impact. The right framing is: this is the cost of correctness in a multi-tenant environment, analogous to OS page zeroing on allocation.

5. MIG Partitioning for Hard Tenant Separation

NVIDIA Multi-Instance GPU (MIG) partitions an A100 or H100 into hardware-isolated slices with separate DRAM banks, L2 cache partitions, and compute engine partitions. Unlike software CUDA context separation, MIG uses hardware mechanisms that prevent cross-instance timing attacks and provide physically separate memory — the same DRAM cells are never shared between two MIG instances.

# Configure MIG on an A100 80GB (requires driver 525+ and CUDA 11.4+)
# Enable MIG mode
nvidia-smi -i 0 -mig 1

# List available GPU instance profiles
nvidia-smi mig -lgip

# Create two 3g.20gb instances (3/7 of the GPU each, ~20GB DRAM each)
# Suitable for two Llama-3-8B instances with comfortable headroom
nvidia-smi mig -cgi 3g.20gb,3g.20gb -C

# Verify instances were created
nvidia-smi mig -lgi
# Expected output:
# +----------------------------------------------------+
# | GPU instances:                                     |
# | GPU   Name          Profile  Instance   Placement  |
# |       (Profile) ID    ID       Start:Size |
# |====================================================|
# |   0  MIG 3g.20gb       9       1          0:3      |
# |   0  MIG 3g.20gb       9       2          4:3      |
# +----------------------------------------------------+

Kubernetes MIG device plugin configuration for per-tenant allocation:

# ConfigMap for NVIDIA device plugin MIG strategy
apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    flags:
      migStrategy: "single"
    resources:
      - name: nvidia.com/mig-3g.20gb
        replicas: 1
---
# Tenant A pod: explicitly requests one MIG 3g.20gb instance
apiVersion: v1
kind: Pod
metadata:
  name: vllm-tenant-a
  namespace: tenant-a
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:v0.4.3
    args:
    - "--model"
    - "meta-llama/Llama-3-8B-Instruct"
    - "--gpu-memory-utilization"
    - "0.90"
    resources:
      limits:
        nvidia.com/mig-3g.20gb: "1"

With this configuration, the Kubernetes scheduler ensures that each pod receives a separate physical MIG instance. The NVIDIA driver enforces that DRAM allocated to one MIG instance is not accessible to another — not just logically separated through virtual address spaces, but physically separate memory banks. GPU timing side-channels are also eliminated at the L2 cache level, which is partitioned per MIG instance.

The constraint: MIG is only available on A100 and H100 GPUs. V100, T4, A10G, and L40 do not support MIG. Fixed partition profiles (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.80gb for the A100 80GB) may not match the specific memory requirements of a given model and context length, which can lead to either wasted capacity or inability to fit the workload in the partition.

6. Audit /dev/shm Access Between Containers

Even with IPC namespace isolation between pods, intra-pod multi-process deployments (sidecar patterns, ensemble models sharing a Triton pod) require detection controls for unexpected cross-process shared memory access.

Audit shm_open calls using the Linux audit subsystem:

# Audit openat calls against paths under /dev/shm
# This captures all shm_open() calls (which use openat internally)
auditctl -a always,exit -F arch=b64 -S openat \
  -F path=/dev/shm -k shm_access

# For shm_open specifically (uses the shm_open syscall wrapper around openat):
auditctl -a always,exit -F arch=b64 -S openat \
  -F dir=/dev/shm -k shm_access

# View audit events
ausearch -k shm_access --start today | \
  aureport --file -i

Falco rule for detecting unexpected /dev/shm access from container processes:

# /etc/falco/rules.d/shm_isolation.yaml

# Define the set of processes that legitimately access /dev/shm
# in your Triton deployment.
- list: triton_shm_users
  items: [tritonserver, python3, tritonclient]

- rule: Unexpected shared memory access from container
  desc: >
    A process inside a container is accessing /dev/shm in a way that
    is not expected for the container's declared role. This may indicate
    a cross-tenant tensor read attempt via POSIX shared memory.
  condition: >
    open_read and
    fd.directory = /dev/shm and
    container and
    not proc.name in (triton_shm_users) and
    not proc.cmdline startswith "tritonserver"
  output: >
    Unexpected /dev/shm access (user=%user.name pid=%proc.pid
    command=%proc.cmdline container=%container.name
    image=%container.image.repository file=%fd.name)
  priority: WARNING
  tags: [container, filesystem, ipc, multi-tenant]

- rule: Shared memory region opened by unexpected process
  desc: >
    A process is opening a /dev/shm region whose name matches the
    inference server's region naming pattern, from a process that
    did not create the region. Cross-tenant read attempt.
  condition: >
    open_read and
    fd.name startswith /dev/shm/input_region and
    container and
    not proc.name in (triton_shm_users)
  output: >
    Cross-tenant shm region access attempt (pid=%proc.pid
    proc=%proc.cmdline region=%fd.name container=%container.name)
  priority: CRITICAL
  tags: [container, ipc, multi-tenant, data-exfiltration]

- rule: Excessive shared memory allocation
  desc: >
    A process is creating an unusually large number of /dev/shm regions.
    Possible shared memory exhaustion attack.
  condition: >
    open_write and
    fd.directory = /dev/shm and
    container and
    fd.count > 50
  output: >
    Possible /dev/shm exhaustion attack (pid=%proc.pid
    proc=%proc.cmdline container=%container.name count=%fd.count)
  priority: WARNING
  tags: [container, ipc, dos]

Expected Behaviour

With per-tenant vLLM process isolation and MIG partitioning, nvidia-smi shows separate compute instances that do not share DRAM or L2:

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC DEC OFA JPG  |
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    1   0  0   |   8192MiB / 20096MiB | 28      0 |  2   0   1   0   0   |
|  0    2   0  1   |   8192MiB / 20096MiB | 28      0 |  2   0   1   0   0   |
+------------------+----------------------+-----------+-----------------------+

Two separate nvidia-smi processes, each owning one MIG partition. GPU memory utilization figures are independent; tenant A exhausting its 20GB partition does not affect tenant B’s partition.

With per-pod IPC namespace isolation, ls -la /dev/shm from tenant-a’s pod shows only regions created by tenant-a’s processes:

kubectl exec -n tenant-a triton-tenant-a -- ls -la /dev/shm
# total 0
# drwxrwxrwt 2 root root  60 May  8 10:00 .
# drwxr-xr-x 5 root root 360 May  8 10:00 ..
# -rw------- 1 1000 1000 1073741824 May  8 10:01 tenant-a-input_region_0

The region name tenant-a-input_region_0 is visible only within the pod’s IPC namespace. An attempt to open it from tenant-b’s pod fails with ENOENT because the region does not exist in tenant-b’s IPC namespace — even if both pods are running on the same Kubernetes node with shared host memory.

After configuring the audit rule and triggering a test access:

# From an attacker pod attempting to open another tenant's region
python3 -c "import ctypes; libc = ctypes.CDLL('libc.so.6'); print(libc.shm_open('/input_region_0', 0, 0))"

Falco emits:

10:23:14.891: WARNING Unexpected /dev/shm access from container
  (user=attacker pid=7823 command=python3 container=attacker-pod
   image=attacker/tool file=/dev/shm/input_region_0)

Trade-offs

Per-tenant vLLM instances: The memory cost is multiplicative. A single vLLM instance for Llama-3-8B with gpu_memory_utilization=0.90 uses approximately 72GB of DRAM on an A100 80GB, leaving room for only the model weights plus a small KV-cache pool. Running two such instances requires two GPUs. At meaningful tenant counts (50–100 organizational tenants), the GPU fleet requirement becomes economically prohibitive for most operators. The alternative is to accept lower gpu_memory_utilization per instance and schedule multiple instances per GPU — which reduces KV-cache pool size per tenant, lowering maximum concurrent request capacity per tenant. Per-tenant instances also eliminate any possibility of cross-tenant prefix caching, which may have been the primary throughput optimization for deployments that share system prompts across tenants.

Disabling prefix caching: The throughput impact is workload-dependent and can be measured precisely by running vllm benchmark_throughput.py against your specific prompt distribution with and without enable_prefix_caching. For workloads where 80% of requests share a 1,000-token system prompt, the TTFT increase from disabling prefix caching can be 5–10x. For workloads with diverse prompts and short system prompts, the impact is negligible. Operators should benchmark before disabling; accepting a 5x TTFT increase for all users to close a tenant isolation gap that affects 0.1% of users is a bad trade. The targeted alternative — per-tenant cache partitioning — requires vLLM modifications that are not yet in the mainline codebase as of v0.5.x.

GPU memory zeroing on block free: The overhead is proportional to freed block count per request. For short requests (256 tokens, 16 blocks freed), zeroing cost on an A100 is approximately 0.2ms. For long-context requests (32K tokens, 2,048 blocks, ~848 MB to zero), the cost rises to ~0.4ms at peak HBM bandwidth, but this does not account for contention from concurrent requests also running GPU kernels. In practice, synchronous zeroing serializes with other GPU work via cudaDeviceSynchronize() and adds latency to request completion. Asynchronous zeroing (using a CUDA stream for the zero-fills, not synchronized to the request completion path) is safer for throughput but introduces a window where a newly allocated block may briefly contain old data while the zero-fill stream catches up. For a correct security guarantee, synchronous zeroing is required.

MIG partitioning: Only A100 and H100 support MIG. If your inference fleet consists of A10G, L40S, or V100 instances (common in cost-optimized deployments), MIG is not available. The fixed partition profiles can waste GPU capacity: running a Llama-3-13B model that requires ~26GB of DRAM on a 3g.20gb partition (20GB) is impossible; stepping up to a 4g.20gb partition (20GB, more compute) still does not fit. Fitting Llama-3-70B requires a 7g.80gb partition — the full A100 — which provides no partitioning benefit. MIG is best matched to smaller models (8B–13B parameter range) where 2–4 instances fit on a single A100.

Failure Modes

Assuming Kubernetes namespace isolation protects KV-cache memory: Kubernetes namespaces (namespace: tenant-a vs namespace: tenant-b) are API-layer abstractions. They control which API objects each service account can read or modify via the Kubernetes API server. They have no relationship to Linux kernel IPC namespaces, GPU memory management, or CUDA context isolation. Two pods in different Kubernetes namespaces but scheduled to the same Kubernetes node, sharing a vLLM deployment, share the same GPU memory pool. Kubernetes namespace labeling does not create any kernel-level memory boundary.

Enabling vLLM prefix caching in a multi-tenant API gateway without per-tenant partitioning: A common deployment pattern is a single vLLM instance behind an API gateway that injects per-customer system prompts before forwarding requests. With enable_prefix_caching=True (the default in vLLM 0.4+), the KV-cache blocks for each customer’s system prompt are retained and shared across requests that use the same system prompt. If two customers happen to use the same system prompt text (for example, a default template), their cached blocks are the same physical blocks. More critically, the timing side-channel described in the threat model section is always present when prefix caching is enabled: an attacker can enumerate system prompt candidates by measuring TTFT, confirming which prompts are currently cached without any memory access.

Setting hostIPC: true on inference pods: This configuration appears in some vendor Helm charts for Triton, inherited from HPC workload templates where shared memory for MPI was required. hostIPC: true places the pod in the host’s IPC namespace, which is shared with all other pods on the same node that also set hostIPC: true, as well as with any host processes. The entire /dev/shm isolation model collapses: all Triton instances on the node share a single /dev/shm namespace, and any process on the node can enumerate and read any shared memory region. Audit your inference Helm charts for this setting. Run kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.hostIPC}{"\n"}{end}' | grep true across your cluster.

Treating /dev/shm exhaustion as a performance issue: When Triton operators observe shm_open() failed: ENOSPC errors in their inference server logs, the first instinct is to increase the --shm-size flag or expand the tmpfs mount. This addresses the symptom while potentially amplifying the risk: a larger /dev/shm means an attacker can exhaust more system RAM before triggering the OOM condition. The correct response to repeated ENOSPC events is to determine whether they represent legitimate throughput growth (scale the infrastructure) or a pattern of rapid allocation without deallocation (investigate for DoS). Rate-limit client-side shared memory region creation in the Triton client library or at the API gateway layer — legitimate clients should be creating at most one active region per concurrent request.

Equating “separate container” with “separate memory”: A container is a set of Linux namespaces and cgroups. The GPU memory backing a model’s KV-cache is managed by the CUDA driver via cudaMalloc() — it is not subject to cgroup memory limits, not visible to the OOM killer in the way CPU memory is, and not isolated by Linux namespaces. Container boundaries mean nothing to CUDA contexts that share the same GPU. Two containers running on the same GPU without MIG partitioning share the same CUDA device, the same driver context, and (in the case of multi-process serving with the same CUDA device visible) the same physical DRAM. The only isolation primitives that actually matter for GPU memory are: separate CUDA contexts (prevents direct pointer dereference across contexts, but not timing attacks), separate GPU processes with CUDA MPS disabled, and MIG partitioning (physically separate DRAM banks).