vLLM Production Security Hardening

Problem

vLLM has become the dominant open-source framework for serving large language models at scale. Its OpenAI-compatible API surface, continuous batching, and PagedAttention memory management make it the default choice for teams self-hosting models like Llama 3, Mistral, Qwen, and Code Llama. But the convenience that makes vLLM easy to stand up is also what makes it dangerous in production: a single command — python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70b-instruct — starts a fully functional inference server with no authentication, no rate limiting, and the API bound to 0.0.0.0:8000 by default.

That default posture exposes an extremely powerful compute resource to anyone with network access. The /v1/completions and /v1/chat/completions endpoints accept arbitrary prompts and return model outputs immediately, with no identity check. In cloud environments where security groups or Kubernetes ingress rules are misconfigured — which is common during initial deployment — this means any host on the internet can query the model. Even inside a private network, any workload in the cluster can reach the inference server unless explicit NetworkPolicy rules exist.

The attack surface is larger than simple unauthorised API access. A motivated adversary with repeated access to completion outputs can reconstruct model behavior through model extraction: by systematically sampling the model with crafted inputs and recording outputs, they can train a surrogate model that approximates the target’s behavior, effectively stealing the intellectual property embodied in a fine-tuned or RLHF-trained checkpoint. This attack is entirely passive from the network perspective and leaves no anomalous signature unless request volume and prompt patterns are monitored.

Cost-exhaustion attacks are an immediate financial threat. vLLM will happily accept requests with max_tokens: 4096 and n: 10 (ten completions per request), consuming substantial GPU compute per call. Without per-user token budgets or global rate limits, a single malicious client can saturate an H100 running at $3–5/hour, running up thousands of dollars in cloud GPU costs before anyone notices. vLLM does not natively enforce per-client token quotas.

Multi-tenant deployments introduce a subtler threat: KV cache cross-contamination. PagedAttention’s prefix caching feature reuses cached key-value states across requests that share a common prefix — for example, a shared system prompt. If two tenants share a vLLM instance and their requests have overlapping prefixes, tenant A’s cached context can influence tenant B’s completions, leaking semantic content across isolation boundaries. This is not a theoretical concern: any multi-tenant deployment with prefix caching enabled and a shared system prompt prefix is potentially vulnerable.

vLLM’s LoRA adapter loading introduces a path traversal surface. The --lora-modules flag accepts arbitrary filesystem paths, and if dynamic adapter loading is enabled at runtime (via the /v1/load_adapter endpoint in some configurations), a caller who can reach that endpoint can attempt to load adapters from paths outside the intended directory. Model files are not arbitrary code in the traditional sense, but maliciously crafted safetensors or GGUF files have demonstrated deserialization vulnerabilities in the past, and PyTorch’s legacy pickle-based format remains a code execution vector.

The tool-call parsing pipeline is an additional injection surface. vLLM parses structured JSON from model outputs to dispatch tool calls. A model prompted to output malformed or adversarially-crafted tool-call JSON can trigger parsing errors or, in sufficiently complex serving stacks, influence downstream tool dispatch logic. This is particularly relevant in agentic deployments where tool-call outputs are automatically executed.

Multi-GPU tensor parallelism expands the network attack surface across nodes. When vLLM uses --tensor-parallel-size > 1, it opens NCCL communication channels between GPU workers. These channels are not authenticated by default, and in Kubernetes environments they may be reachable from other pods if NetworkPolicy is not correctly configured.

Target systems: vLLM 0.4.x–0.6.x, CUDA 12.x, NVIDIA H100/A100, Kubernetes with the NVIDIA GPU Operator.

Threat Model

External attacker with network access executes model extraction by sending thousands of systematically varied completions requests. The model’s fine-tuned behavior, domain knowledge, and RLHF alignment are valuable IP. Without authentication the attacker needs only a routable path to port 8000.
Cost-exhaustion attacker sends high-max_tokens, high-n requests continuously. The goal is financial: driving up GPU-hour costs or degrading service availability for legitimate users. Without rate limiting, a single client can consume 100% of GPU capacity indefinitely.
Multi-tenant context injector crafts prompts with prefixes that match another tenant’s cached prefix, attempting to read or influence that tenant’s session context via shared KV cache state. In shared vLLM deployments with a common system prompt, this can leak information about what other users are discussing or subtly shift model behavior for subsequent requests in the shared cache.
Insider with LoRA adapter upload access uploads a malicious adapter file to the model directory, either by exploiting the dynamic adapter loading API or via direct filesystem access. The malicious adapter contains weights crafted to cause the model to behave differently (producing harmful outputs, leaking system prompts, or bypassing content filters) or, in pickle-format models, to execute arbitrary Python code during deserialization.

Without hardening, any of these adversaries can operate undetected: no authentication means no identity to audit, no rate limiting means no anomaly signal on request volume, and no structured logging means no forensic trail after an incident. With full hardening applied, each adversary faces authentication gates, request-budget enforcement, cache isolation, filesystem controls, and a complete audit trail that records every request’s identity, prompt length, token count, and outcome.

Configuration / Implementation

API Key Authentication

vLLM 0.4+ includes a built-in --api-key flag that enables Bearer token validation on all API endpoints:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --api-key "${VLLM_API_KEY}" \
  --host 127.0.0.1 \
  --port 8000 \
  --max-model-len 8192 \
  --disable-log-requests false

Binding to 127.0.0.1 instead of 0.0.0.0 ensures vLLM is only reachable via the reverse proxy or sidecar — not directly from the network. Store the key in a Kubernetes Secret and inject it as an environment variable:

apiVersion: v1
kind: Secret
metadata:
  name: vllm-api-key
  namespace: inference
type: Opaque
stringData:
  api-key: "sk-prod-replace-with-256bit-random-value"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      serviceAccountName: vllm-sa
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.3
          env:
            - name: VLLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: vllm-api-key
                  key: api-key
          args:
            - "--model"
            - "meta-llama/Llama-3-70b-instruct"
            - "--api-key"
            - "$(VLLM_API_KEY)"
            - "--host"
            - "127.0.0.1"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "8192"
            - "--disable-prefix-caching"
          resources:
            limits:
              nvidia.com/gpu: "1"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            allowPrivilegeEscalation: false

For key rotation, deploy a new Secret value and perform a rolling restart: kubectl rollout restart deployment/vllm-server -n inference. Automate rotation with External Secrets Operator or Vault Agent Injector.

Rate Limiting

Place nginx in front of vLLM as a TLS-terminating reverse proxy with rate limiting:

# /etc/nginx/nginx.conf

worker_processes auto;

events {
    worker_connections 4096;
}

http {
    # Rate limit zones — keyed on $http_authorization to rate-limit per API key
    limit_req_zone $http_authorization zone=per_key:10m rate=20r/m;
    limit_req_zone $binary_remote_addr zone=per_ip:10m rate=60r/m;

    log_format vllm_access '$time_iso8601 $remote_addr "$http_authorization" '
                            '$request_method $request_uri $status '
                            '$body_bytes_sent $request_time '
                            '"$http_x_request_id"';

    upstream vllm_backend {
        server 127.0.0.1:8000;
        keepalive 64;
    }

    server {
        listen 443 ssl http2;
        server_name inference.internal.example.com;

        ssl_certificate     /etc/ssl/certs/inference.crt;
        ssl_certificate_key /etc/ssl/private/inference.key;
        ssl_protocols       TLSv1.3;
        ssl_ciphers         TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256;

        access_log /var/log/nginx/vllm_access.log vllm_access;

        # Apply rate limits — burst allows short spikes, nodelay enforces immediately
        limit_req zone=per_key burst=5 nodelay;
        limit_req zone=per_ip burst=10 nodelay;

        location /v1/ {
            proxy_pass         http://vllm_backend;
            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Request-ID $request_id;
            proxy_read_timeout 300s;
            proxy_send_timeout 60s;

            # Strip sensitive headers from upstream response
            proxy_hide_header  X-Powered-By;

            # Enforce request body size (prevent oversized prompt bombs)
            client_max_body_size 1m;
        }

        # Block all non-API paths
        location / {
            return 404;
        }
    }

    # Redirect HTTP to HTTPS
    server {
        listen 80;
        return 301 https://$host$request_uri;
    }
}

Use --max-model-len to hard-cap total context (prompt + completion) at the server level. This prevents a caller from setting max_tokens beyond what the flag allows, regardless of what they send in the request body.

For Kubernetes-native rate limiting, deploy Envoy as a sidecar and configure the local rate limit filter:

apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-config
  namespace: inference
data:
  envoy.yaml: |
    static_resources:
      listeners:
        - name: listener_0
          address:
            socket_address:
              address: 0.0.0.0
              port_value: 9000
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    stat_prefix: ingress_http
                    http_filters:
                      - name: envoy.filters.http.local_ratelimit
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                          stat_prefix: local_rate_limiter
                          token_bucket:
                            max_tokens: 100
                            tokens_per_fill: 20
                            fill_interval: 60s
                          filter_enabled:
                            runtime_key: local_rate_limit_enabled
                            default_value:
                              numerator: 100
                              denominator: HUNDRED
                          filter_enforced:
                            runtime_key: local_rate_limit_enforced
                            default_value:
                              numerator: 100
                              denominator: HUNDRED
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: vllm_backend
                          domains: ["*"]
                          routes:
                            - match:
                                prefix: "/v1/"
                              route:
                                cluster: vllm_cluster
                                timeout: 300s
      clusters:
        - name: vllm_cluster
          type: STATIC
          load_assignment:
            cluster_name: vllm_cluster
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: 127.0.0.1
                          port_value: 8000

Request Isolation in Multi-Tenant Deployments

Disable prefix caching when multiple tenants share a vLLM instance:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --disable-prefix-caching \
  --api-key "${VLLM_API_KEY}"

The --disable-prefix-caching flag prevents vLLM from reusing KV cache blocks across requests, eliminating the cross-tenant cache contamination risk. The latency penalty is real (see Trade-offs), but it is the only correct option when tenants must be isolated on a shared instance.

For stronger isolation, deploy a separate vLLM instance per tenant namespace. Use namespace-scoped Secrets and NetworkPolicy so tenant A’s vLLM pod is completely unreachable from tenant B’s namespace:

# One deployment per tenant namespace — repeat per tenant
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: tenant-alpha   # <-- per-tenant namespace
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
      tenant: alpha
  template:
    metadata:
      labels:
        app: vllm-server
        tenant: alpha
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.3
          args:
            - "--model"
            - "meta-llama/Llama-3-70b-instruct"
            - "--api-key"
            - "$(VLLM_API_KEY)"
          resources:
            limits:
              nvidia.com/gpu: "1"

LoRA Adapter Security

Restrict adapter loading to a pre-approved directory with tight filesystem permissions. Never enable dynamic adapter loading endpoints in production:

# On the host or in the init container, set permissions on the model directory
chmod 750 /models/adapters
chown vllm-service:vllm-service /models/adapters

# Launch with explicit allowlist — no dynamic loading
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --enable-lora \
  --lora-modules adapter-v1=/models/adapters/adapter-v1 \
                 adapter-v2=/models/adapters/adapter-v2 \
  --max-lora-rank 64

Do not pass --enable-prefix-caching alongside LoRA when tenants select different adapters — cached KV blocks from adapter-v1 requests must not be served to adapter-v2 requests. Use Kubernetes ReadOnlyRootFilesystem and a projected volume for adapter files so the running container cannot write new adapter files to disk:

securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop: ["ALL"]
volumeMounts:
  - name: adapters
    mountPath: /models/adapters
    readOnly: true
volumes:
  - name: adapters
    configMap:
      name: lora-adapter-registry  # or a PVC pre-populated by CI/CD

Network Isolation

Kubernetes NetworkPolicy restricting ingress to vLLM pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-ingress-only
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: vllm-server
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow only from the nginx/Envoy proxy pod
    - from:
        - podSelector:
            matchLabels:
              app: inference-proxy
      ports:
        - protocol: TCP
          port: 8000
  egress:
    # Allow DNS
    - ports:
        - protocol: UDP
          port: 53
    # Allow HuggingFace Hub for model download (restrict in air-gapped environments)
    - ports:
        - protocol: TCP
          port: 443

For tensor-parallel deployments (--tensor-parallel-size > 1), NCCL worker-to-worker traffic requires an additional egress rule scoped to the vLLM pod’s own namespace, with port ranges matching the NCCL ephemeral port range (typically 40000–50000):

  egress:
    - to:
        - podSelector:
            matchLabels:
              app: vllm-server
      ports:
        - protocol: TCP
          port: 40000
          endPort: 50000

CUDA and GPU Isolation

Use NVIDIA Multi-Instance GPU (MIG) partitioning to give each tenant a hardware-isolated GPU slice. On an H100 80GB, a 1g.10gb MIG instance provides 10 GB of HBM and one compute slice:

# Enable MIG mode on GPU 0 (requires root, GPU reset)
nvidia-smi -i 0 -mig 1

# Create MIG instances — example: 7 x 1g.10gb slices on H100
nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# List available MIG devices
nvidia-smi -L

In Kubernetes, configure the NVIDIA device plugin to expose MIG instances as discrete resources:

apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      mig:
        strategy: single

Reference a specific MIG profile in the pod resource request:

resources:
  limits:
    nvidia.com/mig-1g.10gb: "1"

Scope CUDA_VISIBLE_DEVICES in non-MIG environments to prevent a process from accessing GPUs outside its allocation:

env:
  - name: CUDA_VISIBLE_DEVICES
    value: "0"   # or injected by device plugin

Audit Logging

Enable request-ID logging in vLLM and correlate with nginx access logs:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --disable-log-requests false \
  --enable-request-id-logging

vLLM emits structured log lines for each request including prompt token count and completion token count. Augment this with nginx access logs (configured above) that capture the API key hash, request-ID, and HTTP status. Ship both to a central SIEM (Elasticsearch, Splunk, or Loki) for correlation.

A minimal Fluent Bit configuration to ship logs from the inference namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: inference
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Log_Level    info

    [INPUT]
        Name             tail
        Path             /var/log/pods/inference_*/*.log
        Parser           docker
        Tag              inference.*
        Refresh_Interval 10

    [FILTER]
        Name    grep
        Match   inference.*
        Regex   log (vllm|nginx)

    [OUTPUT]
        Name         es
        Match        inference.*
        Host         elasticsearch.logging.svc.cluster.local
        Port         9200
        Index        vllm-audit
        tls          On
        tls.verify   On

TLS Termination

vLLM has no native TLS support. All TLS termination must happen at the nginx or Envoy layer. Use cert-manager to provision and rotate certificates automatically:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: inference-tls
  namespace: inference
spec:
  secretName: inference-tls-secret
  duration: 2160h    # 90 days
  renewBefore: 360h  # renew 15 days before expiry
  dnsNames:
    - inference.internal.example.com
  issuerRef:
    name: internal-ca-issuer
    kind: ClusterIssuer

Reference the TLS secret in the nginx Deployment volume mounts and point ssl_certificate / ssl_certificate_key to the projected paths.

Expected Behaviour

Signal	Without Hardening	With Hardening
Unauthenticated request to `/v1/chat/completions`	200 OK — full model response returned	401 Unauthorized — no model output leaked
Rate limit breach (> 20 req/min per key)	Request accepted, GPU saturated	429 Too Many Requests, request dropped at nginx
KV cache cross-contamination (shared prefix, two tenants)	Tenant B’s completion influenced by Tenant A’s cached context	No shared cache blocks — `--disable-prefix-caching` ensures independent KV state per request
Oversized request (`max_tokens` > server cap)	Full token budget consumed, GPU time stolen	Request rejected — `--max-model-len` cap enforced at vLLM layer

Trade-offs

Aspect	Benefit	Cost	Mitigation
Prefix caching disabled	Eliminates KV cache cross-contamination between tenants	20–40% latency increase for requests sharing a common prefix (e.g., system prompt)	Accept the trade-off in multi-tenant deployments; re-enable only in single-tenant instances with no isolation requirement
Per-tenant vLLM instances	Hard GPU and memory isolation; independent scaling and key rotation per tenant	GPU cost multiplied by tenant count; model weights loaded multiple times into GPU memory	Use MIG partitioning to share physical GPU hardware while maintaining software isolation
MIG partitioning	Hardware-level isolation of compute and HBM per tenant; prevents GPU memory side-channel attacks	Reduces maximum throughput per slice vs. full GPU; MIG reconfiguration requires GPU reset (brief outage)	Pre-configure MIG profiles at node provisioning time; treat MIG configuration as immutable infrastructure
Strict `--max-model-len`	Prevents cost-exhaustion via oversized requests; bounds worst-case GPU time per request	Users cannot submit long documents or multi-turn histories exceeding the cap	Set the cap at the 95th percentile of legitimate request lengths; provide a documented limit in the API consumer guide

Failure Modes

Failure	Symptom	Detection	Recovery
API key rotation causes service downtime	Clients receive 401 immediately after key rotation; completions drop to zero	Spike in HTTP 401 rate in SIEM; alerts on completion request count dropping > 50%	Pre-rotate: update Secret, roll out new pod with new key, wait for readiness probe, then invalidate old key; use a 5-minute overlap window
nginx rate-limiter blocks legitimate traffic	Legitimate users receive 429 with no clear cause; support tickets spike	High 429 rate in nginx access logs for known-good clients; compare against historical request rate	Temporarily increase `burst` parameter; investigate whether a CI/CD pipeline or batch job is the source of elevated request rate
MIG reconfiguration requires GPU reset	All inference pods on the node terminate; in-flight requests fail	Node condition changes in `kubectl get nodes`; GPU operator logs show reset event	Cordon the node before reconfiguration, drain inference pods, perform MIG change, re-label and uncordon; use PodDisruptionBudget to shift load
LoRA adapter path not found at startup	vLLM exits with `FileNotFoundError` or `ValueError` on launch; pod enters CrashLoopBackOff	CrashLoopBackOff visible in `kubectl get pods`; adapter path error in pod logs	Verify that the adapter volume is mounted correctly and files are present; check ReadOnly volume mount vs. dynamic write assumption

When to Consider a Managed Alternative

Self-hosting vLLM provides maximum control but carries significant operational overhead. Consider a managed inference platform when:

Compliance requires vendor SLAs and SOC 2 / ISO 27001 attestation — AWS Bedrock, Google Vertex AI Model Garden, and Azure AI Studio provide compliance-ready infrastructure with documented shared-responsibility models, which self-hosted vLLM cannot match without substantial investment.
Your team lacks GPU infrastructure expertise — MIG configuration, CUDA version compatibility, NCCL tuning, and GPU driver management are deep specialisations. AWS Bedrock and Vertex AI abstract all GPU management.
Fleet size makes per-node hardening impractical — at tens or hundreds of inference nodes, keeping CUDA drivers patched, rotating API keys, and maintaining NetworkPolicy across namespaces becomes a full-time operation. Managed platforms handle this automatically.
Model serving cost predictability is required — managed platforms offer on-demand and provisioned throughput pricing with cost caps; self-hosted GPU fleets require active FinOps discipline to avoid runaway costs.
You need multi-region redundancy without building it yourself — Vertex AI and Azure AI Studio provide regional failover; replicating this with self-hosted vLLM requires additional load balancing, replication, and health-check infrastructure.