vLLM Production Security Hardening
Problem
vLLM has become the dominant open-source framework for serving large language models at scale. Its OpenAI-compatible API surface, continuous batching, and PagedAttention memory management make it the default choice for teams self-hosting models like Llama 3, Mistral, Qwen, and Code Llama. But the convenience that makes vLLM easy to stand up is also what makes it dangerous in production: a single command — python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70b-instruct — starts a fully functional inference server with no authentication, no rate limiting, and the API bound to 0.0.0.0:8000 by default.
That default posture exposes an extremely powerful compute resource to anyone with network access. The /v1/completions and /v1/chat/completions endpoints accept arbitrary prompts and return model outputs immediately, with no identity check. In cloud environments where security groups or Kubernetes ingress rules are misconfigured — which is common during initial deployment — this means any host on the internet can query the model. Even inside a private network, any workload in the cluster can reach the inference server unless explicit NetworkPolicy rules exist.
The attack surface is larger than simple unauthorised API access. A motivated adversary with repeated access to completion outputs can reconstruct model behavior through model extraction: by systematically sampling the model with crafted inputs and recording outputs, they can train a surrogate model that approximates the target’s behavior, effectively stealing the intellectual property embodied in a fine-tuned or RLHF-trained checkpoint. This attack is entirely passive from the network perspective and leaves no anomalous signature unless request volume and prompt patterns are monitored.
Cost-exhaustion attacks are an immediate financial threat. vLLM will happily accept requests with max_tokens: 4096 and n: 10 (ten completions per request), consuming substantial GPU compute per call. Without per-user token budgets or global rate limits, a single malicious client can saturate an H100 running at $3–5/hour, running up thousands of dollars in cloud GPU costs before anyone notices. vLLM does not natively enforce per-client token quotas.
Multi-tenant deployments introduce a subtler threat: KV cache cross-contamination. PagedAttention’s prefix caching feature reuses cached key-value states across requests that share a common prefix — for example, a shared system prompt. If two tenants share a vLLM instance and their requests have overlapping prefixes, tenant A’s cached context can influence tenant B’s completions, leaking semantic content across isolation boundaries. This is not a theoretical concern: any multi-tenant deployment with prefix caching enabled and a shared system prompt prefix is potentially vulnerable.
vLLM’s LoRA adapter loading introduces a path traversal surface. The --lora-modules flag accepts arbitrary filesystem paths, and if dynamic adapter loading is enabled at runtime (via the /v1/load_adapter endpoint in some configurations), a caller who can reach that endpoint can attempt to load adapters from paths outside the intended directory. Model files are not arbitrary code in the traditional sense, but maliciously crafted safetensors or GGUF files have demonstrated deserialization vulnerabilities in the past, and PyTorch’s legacy pickle-based format remains a code execution vector.
The tool-call parsing pipeline is an additional injection surface. vLLM parses structured JSON from model outputs to dispatch tool calls. A model prompted to output malformed or adversarially-crafted tool-call JSON can trigger parsing errors or, in sufficiently complex serving stacks, influence downstream tool dispatch logic. This is particularly relevant in agentic deployments where tool-call outputs are automatically executed.
Multi-GPU tensor parallelism expands the network attack surface across nodes. When vLLM uses --tensor-parallel-size > 1, it opens NCCL communication channels between GPU workers. These channels are not authenticated by default, and in Kubernetes environments they may be reachable from other pods if NetworkPolicy is not correctly configured.
Target systems: vLLM 0.4.x–0.6.x, CUDA 12.x, NVIDIA H100/A100, Kubernetes with the NVIDIA GPU Operator.
Threat Model
-
External attacker with network access executes model extraction by sending thousands of systematically varied completions requests. The model’s fine-tuned behavior, domain knowledge, and RLHF alignment are valuable IP. Without authentication the attacker needs only a routable path to port 8000.
-
Cost-exhaustion attacker sends high-
max_tokens, high-nrequests continuously. The goal is financial: driving up GPU-hour costs or degrading service availability for legitimate users. Without rate limiting, a single client can consume 100% of GPU capacity indefinitely. -
Multi-tenant context injector crafts prompts with prefixes that match another tenant’s cached prefix, attempting to read or influence that tenant’s session context via shared KV cache state. In shared vLLM deployments with a common system prompt, this can leak information about what other users are discussing or subtly shift model behavior for subsequent requests in the shared cache.
-
Insider with LoRA adapter upload access uploads a malicious adapter file to the model directory, either by exploiting the dynamic adapter loading API or via direct filesystem access. The malicious adapter contains weights crafted to cause the model to behave differently (producing harmful outputs, leaking system prompts, or bypassing content filters) or, in pickle-format models, to execute arbitrary Python code during deserialization.
Without hardening, any of these adversaries can operate undetected: no authentication means no identity to audit, no rate limiting means no anomaly signal on request volume, and no structured logging means no forensic trail after an incident. With full hardening applied, each adversary faces authentication gates, request-budget enforcement, cache isolation, filesystem controls, and a complete audit trail that records every request’s identity, prompt length, token count, and outcome.
Configuration / Implementation
API Key Authentication
vLLM 0.4+ includes a built-in --api-key flag that enables Bearer token validation on all API endpoints:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--api-key "${VLLM_API_KEY}" \
--host 127.0.0.1 \
--port 8000 \
--max-model-len 8192 \
--disable-log-requests false
Binding to 127.0.0.1 instead of 0.0.0.0 ensures vLLM is only reachable via the reverse proxy or sidecar — not directly from the network. Store the key in a Kubernetes Secret and inject it as an environment variable:
apiVersion: v1
kind: Secret
metadata:
name: vllm-api-key
namespace: inference
type: Opaque
stringData:
api-key: "sk-prod-replace-with-256bit-random-value"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
serviceAccountName: vllm-sa
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
env:
- name: VLLM_API_KEY
valueFrom:
secretKeyRef:
name: vllm-api-key
key: api-key
args:
- "--model"
- "meta-llama/Llama-3-70b-instruct"
- "--api-key"
- "$(VLLM_API_KEY)"
- "--host"
- "127.0.0.1"
- "--port"
- "8000"
- "--max-model-len"
- "8192"
- "--disable-prefix-caching"
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
For key rotation, deploy a new Secret value and perform a rolling restart: kubectl rollout restart deployment/vllm-server -n inference. Automate rotation with External Secrets Operator or Vault Agent Injector.
Rate Limiting
Place nginx in front of vLLM as a TLS-terminating reverse proxy with rate limiting:
# /etc/nginx/nginx.conf
worker_processes auto;
events {
worker_connections 4096;
}
http {
# Rate limit zones — keyed on $http_authorization to rate-limit per API key
limit_req_zone $http_authorization zone=per_key:10m rate=20r/m;
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=60r/m;
log_format vllm_access '$time_iso8601 $remote_addr "$http_authorization" '
'$request_method $request_uri $status '
'$body_bytes_sent $request_time '
'"$http_x_request_id"';
upstream vllm_backend {
server 127.0.0.1:8000;
keepalive 64;
}
server {
listen 443 ssl http2;
server_name inference.internal.example.com;
ssl_certificate /etc/ssl/certs/inference.crt;
ssl_certificate_key /etc/ssl/private/inference.key;
ssl_protocols TLSv1.3;
ssl_ciphers TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256;
access_log /var/log/nginx/vllm_access.log vllm_access;
# Apply rate limits — burst allows short spikes, nodelay enforces immediately
limit_req zone=per_key burst=5 nodelay;
limit_req zone=per_ip burst=10 nodelay;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 300s;
proxy_send_timeout 60s;
# Strip sensitive headers from upstream response
proxy_hide_header X-Powered-By;
# Enforce request body size (prevent oversized prompt bombs)
client_max_body_size 1m;
}
# Block all non-API paths
location / {
return 404;
}
}
# Redirect HTTP to HTTPS
server {
listen 80;
return 301 https://$host$request_uri;
}
}
Use --max-model-len to hard-cap total context (prompt + completion) at the server level. This prevents a caller from setting max_tokens beyond what the flag allows, regardless of what they send in the request body.
For Kubernetes-native rate limiting, deploy Envoy as a sidecar and configure the local rate limit filter:
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-config
namespace: inference
data:
envoy.yaml: |
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 9000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
http_filters:
- name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
stat_prefix: local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 20
fill_interval: 60s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
filter_enforced:
runtime_key: local_rate_limit_enforced
default_value:
numerator: 100
denominator: HUNDRED
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
name: local_route
virtual_hosts:
- name: vllm_backend
domains: ["*"]
routes:
- match:
prefix: "/v1/"
route:
cluster: vllm_cluster
timeout: 300s
clusters:
- name: vllm_cluster
type: STATIC
load_assignment:
cluster_name: vllm_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8000
Request Isolation in Multi-Tenant Deployments
Disable prefix caching when multiple tenants share a vLLM instance:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--disable-prefix-caching \
--api-key "${VLLM_API_KEY}"
The --disable-prefix-caching flag prevents vLLM from reusing KV cache blocks across requests, eliminating the cross-tenant cache contamination risk. The latency penalty is real (see Trade-offs), but it is the only correct option when tenants must be isolated on a shared instance.
For stronger isolation, deploy a separate vLLM instance per tenant namespace. Use namespace-scoped Secrets and NetworkPolicy so tenant A’s vLLM pod is completely unreachable from tenant B’s namespace:
# One deployment per tenant namespace — repeat per tenant
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: tenant-alpha # <-- per-tenant namespace
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
tenant: alpha
template:
metadata:
labels:
app: vllm-server
tenant: alpha
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args:
- "--model"
- "meta-llama/Llama-3-70b-instruct"
- "--api-key"
- "$(VLLM_API_KEY)"
resources:
limits:
nvidia.com/gpu: "1"
LoRA Adapter Security
Restrict adapter loading to a pre-approved directory with tight filesystem permissions. Never enable dynamic adapter loading endpoints in production:
# On the host or in the init container, set permissions on the model directory
chmod 750 /models/adapters
chown vllm-service:vllm-service /models/adapters
# Launch with explicit allowlist — no dynamic loading
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--enable-lora \
--lora-modules adapter-v1=/models/adapters/adapter-v1 \
adapter-v2=/models/adapters/adapter-v2 \
--max-lora-rank 64
Do not pass --enable-prefix-caching alongside LoRA when tenants select different adapters — cached KV blocks from adapter-v1 requests must not be served to adapter-v2 requests. Use Kubernetes ReadOnlyRootFilesystem and a projected volume for adapter files so the running container cannot write new adapter files to disk:
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: ["ALL"]
volumeMounts:
- name: adapters
mountPath: /models/adapters
readOnly: true
volumes:
- name: adapters
configMap:
name: lora-adapter-registry # or a PVC pre-populated by CI/CD
Network Isolation
Kubernetes NetworkPolicy restricting ingress to vLLM pods:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-ingress-only
namespace: inference
spec:
podSelector:
matchLabels:
app: vllm-server
policyTypes:
- Ingress
- Egress
ingress:
# Allow only from the nginx/Envoy proxy pod
- from:
- podSelector:
matchLabels:
app: inference-proxy
ports:
- protocol: TCP
port: 8000
egress:
# Allow DNS
- ports:
- protocol: UDP
port: 53
# Allow HuggingFace Hub for model download (restrict in air-gapped environments)
- ports:
- protocol: TCP
port: 443
For tensor-parallel deployments (--tensor-parallel-size > 1), NCCL worker-to-worker traffic requires an additional egress rule scoped to the vLLM pod’s own namespace, with port ranges matching the NCCL ephemeral port range (typically 40000–50000):
egress:
- to:
- podSelector:
matchLabels:
app: vllm-server
ports:
- protocol: TCP
port: 40000
endPort: 50000
CUDA and GPU Isolation
Use NVIDIA Multi-Instance GPU (MIG) partitioning to give each tenant a hardware-isolated GPU slice. On an H100 80GB, a 1g.10gb MIG instance provides 10 GB of HBM and one compute slice:
# Enable MIG mode on GPU 0 (requires root, GPU reset)
nvidia-smi -i 0 -mig 1
# Create MIG instances — example: 7 x 1g.10gb slices on H100
nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C
# List available MIG devices
nvidia-smi -L
In Kubernetes, configure the NVIDIA device plugin to expose MIG instances as discrete resources:
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
sharing:
mig:
strategy: single
Reference a specific MIG profile in the pod resource request:
resources:
limits:
nvidia.com/mig-1g.10gb: "1"
Scope CUDA_VISIBLE_DEVICES in non-MIG environments to prevent a process from accessing GPUs outside its allocation:
env:
- name: CUDA_VISIBLE_DEVICES
value: "0" # or injected by device plugin
Audit Logging
Enable request-ID logging in vLLM and correlate with nginx access logs:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--disable-log-requests false \
--enable-request-id-logging
vLLM emits structured log lines for each request including prompt token count and completion token count. Augment this with nginx access logs (configured above) that capture the API key hash, request-ID, and HTTP status. Ship both to a central SIEM (Elasticsearch, Splunk, or Loki) for correlation.
A minimal Fluent Bit configuration to ship logs from the inference namespace:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: inference
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/pods/inference_*/*.log
Parser docker
Tag inference.*
Refresh_Interval 10
[FILTER]
Name grep
Match inference.*
Regex log (vllm|nginx)
[OUTPUT]
Name es
Match inference.*
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index vllm-audit
tls On
tls.verify On
TLS Termination
vLLM has no native TLS support. All TLS termination must happen at the nginx or Envoy layer. Use cert-manager to provision and rotate certificates automatically:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: inference-tls
namespace: inference
spec:
secretName: inference-tls-secret
duration: 2160h # 90 days
renewBefore: 360h # renew 15 days before expiry
dnsNames:
- inference.internal.example.com
issuerRef:
name: internal-ca-issuer
kind: ClusterIssuer
Reference the TLS secret in the nginx Deployment volume mounts and point ssl_certificate / ssl_certificate_key to the projected paths.
Expected Behaviour
| Signal | Without Hardening | With Hardening |
|---|---|---|
Unauthenticated request to /v1/chat/completions |
200 OK — full model response returned | 401 Unauthorized — no model output leaked |
| Rate limit breach (> 20 req/min per key) | Request accepted, GPU saturated | 429 Too Many Requests, request dropped at nginx |
| KV cache cross-contamination (shared prefix, two tenants) | Tenant B’s completion influenced by Tenant A’s cached context | No shared cache blocks — --disable-prefix-caching ensures independent KV state per request |
Oversized request (max_tokens > server cap) |
Full token budget consumed, GPU time stolen | Request rejected — --max-model-len cap enforced at vLLM layer |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Prefix caching disabled | Eliminates KV cache cross-contamination between tenants | 20–40% latency increase for requests sharing a common prefix (e.g., system prompt) | Accept the trade-off in multi-tenant deployments; re-enable only in single-tenant instances with no isolation requirement |
| Per-tenant vLLM instances | Hard GPU and memory isolation; independent scaling and key rotation per tenant | GPU cost multiplied by tenant count; model weights loaded multiple times into GPU memory | Use MIG partitioning to share physical GPU hardware while maintaining software isolation |
| MIG partitioning | Hardware-level isolation of compute and HBM per tenant; prevents GPU memory side-channel attacks | Reduces maximum throughput per slice vs. full GPU; MIG reconfiguration requires GPU reset (brief outage) | Pre-configure MIG profiles at node provisioning time; treat MIG configuration as immutable infrastructure |
Strict --max-model-len |
Prevents cost-exhaustion via oversized requests; bounds worst-case GPU time per request | Users cannot submit long documents or multi-turn histories exceeding the cap | Set the cap at the 95th percentile of legitimate request lengths; provide a documented limit in the API consumer guide |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| API key rotation causes service downtime | Clients receive 401 immediately after key rotation; completions drop to zero | Spike in HTTP 401 rate in SIEM; alerts on completion request count dropping > 50% | Pre-rotate: update Secret, roll out new pod with new key, wait for readiness probe, then invalidate old key; use a 5-minute overlap window |
| nginx rate-limiter blocks legitimate traffic | Legitimate users receive 429 with no clear cause; support tickets spike | High 429 rate in nginx access logs for known-good clients; compare against historical request rate | Temporarily increase burst parameter; investigate whether a CI/CD pipeline or batch job is the source of elevated request rate |
| MIG reconfiguration requires GPU reset | All inference pods on the node terminate; in-flight requests fail | Node condition changes in kubectl get nodes; GPU operator logs show reset event |
Cordon the node before reconfiguration, drain inference pods, perform MIG change, re-label and uncordon; use PodDisruptionBudget to shift load |
| LoRA adapter path not found at startup | vLLM exits with FileNotFoundError or ValueError on launch; pod enters CrashLoopBackOff |
CrashLoopBackOff visible in kubectl get pods; adapter path error in pod logs |
Verify that the adapter volume is mounted correctly and files are present; check ReadOnly volume mount vs. dynamic write assumption |
When to Consider a Managed Alternative
Self-hosting vLLM provides maximum control but carries significant operational overhead. Consider a managed inference platform when:
- Compliance requires vendor SLAs and SOC 2 / ISO 27001 attestation — AWS Bedrock, Google Vertex AI Model Garden, and Azure AI Studio provide compliance-ready infrastructure with documented shared-responsibility models, which self-hosted vLLM cannot match without substantial investment.
- Your team lacks GPU infrastructure expertise — MIG configuration, CUDA version compatibility, NCCL tuning, and GPU driver management are deep specialisations. AWS Bedrock and Vertex AI abstract all GPU management.
- Fleet size makes per-node hardening impractical — at tens or hundreds of inference nodes, keeping CUDA drivers patched, rotating API keys, and maintaining NetworkPolicy across namespaces becomes a full-time operation. Managed platforms handle this automatically.
- Model serving cost predictability is required — managed platforms offer on-demand and provisioned throughput pricing with cost caps; self-hosted GPU fleets require active FinOps discipline to avoid runaway costs.
- You need multi-region redundancy without building it yourself — Vertex AI and Azure AI Studio provide regional failover; replicating this with self-hosted vLLM requires additional load balancing, replication, and health-check infrastructure.