Hardening NGINX as a Reverse Proxy for AI Inference Endpoints
Problem
NGINX is a common first choice for reverse-proxying AI inference endpoints — vLLM, Ollama, llama.cpp, TGI (Text Generation Inference), and proprietary APIs. It handles authentication offloading, rate limiting, TLS termination, and load balancing across multiple model instances. The configuration is often treated as infrastructure plumbing: set up once, not revisited.
The security implications of this positioning are underappreciated:
NGINX CVE exposure is higher at inference proxies. A compromised NGINX instance in front of an inference endpoint can: strip authentication headers, log and exfiltrate all inference requests and responses (which may contain proprietary prompts, sensitive user data, or model outputs used in production), redirect traffic to rogue model instances, and modify responses to inject malicious content. The business impact of NGINX compromise at this position is significantly higher than NGINX serving static content.
Inference endpoints have unique request characteristics. Inference API requests are large (often 100KB–2MB including context), long-running (5–120 seconds for streaming responses), and billed per token. These characteristics make inference proxies uniquely vulnerable to:
- Token-stuffing attacks: sending maximum-context requests to exhaust compute budget
- Streaming timeout abuse: holding connections open with slow streaming to exhaust NGINX worker connections
- Cost amplification: automated requests that generate expensive completions
- Prompt injection via NGINX request manipulation: a compromised NGINX that rewrites the
messagesfield in a proxied JSON body before forwarding to the model
NGINX CVE patching is often delayed at AI infrastructure. AI teams that deploy inference stacks focus on model availability, latency, and GPU utilisation. NGINX is an afterthought — the version in the Docker base image may not be updated even as critical CVEs are published. The ngx_http_mp4_module CVE (CVE-2024-7347) is irrelevant to inference, but a QUIC module CVE on an inference endpoint that supports HTTP/3 is directly relevant.
Inference endpoints are high-value targets for data exfiltration. The request/response stream contains proprietary system prompts, user queries, model completions, and potentially PII. An attacker who achieves SSRF via NGINX or compromises an NGINX worker process gains access to this stream.
Target systems: NGINX deployed in front of vLLM, Ollama, TGI, llama.cpp, or OpenAI-compatible inference APIs; organisations self-hosting LLMs with NGINX as the API gateway; any deployment where NGINX proxies to AI inference backends.
Threat Model
Adversary 1 — NGINX CVE exploitation at inference proxy. A memory corruption vulnerability in NGINX (CVE-2024-7347, QUIC module CVEs) is exploited on the inference proxy. The attacker’s RCE in the NGINX worker gives them access to all inference request/response traffic, API keys passed in headers, and the ability to modify proxied requests before they reach the model.
Adversary 2 — Cost amplification via unbounded requests. An attacker discovers the inference endpoint and sends automated requests with maximum context lengths. Without rate limiting and token budget enforcement, the attacker generates GPU compute costs orders of magnitude higher than intended usage patterns.
Adversary 3 — System prompt exfiltration via SSRF. An SSRF vulnerability in NGINX (misconfigured proxy_pass) allows an attacker to reach the inference backend directly, bypassing authentication. The attacker queries the model with a prompt designed to echo back the system prompt, extracting proprietary instructions.
Adversary 4 — Response manipulation for downstream injection. A compromised NGINX rewrites model responses in transit, injecting malicious content into AI-generated outputs before they reach end users or downstream applications. If downstream applications parse and act on model output, injected content can trigger unintended actions.
Configuration / Implementation
Step 1 — Keep NGINX patched and disable unused modules
# Verify current NGINX version against CVE list
nginx -V 2>&1 | head -5
# For Docker-based inference deployments, check the base image NGINX version
docker run --rm your-inference-proxy nginx -v
# Disable modules not needed for inference proxying
# In nginx.conf or a load_module include:
# DO NOT load:
# ngx_http_mp4_module — irrelevant to inference, carries CVE-2024-7347
# ngx_http_image_filter_module — irrelevant to inference
# ngx_http_flv_module — irrelevant to inference
# Build or pull NGINX without these modules for inference-specific deployments
nginx -V 2>&1 | grep -E "mp4|image_filter|flv"
# If these appear and you don't use them, rebuild without them
Step 2 — Configure rate limiting for inference endpoints
# /etc/nginx/conf.d/inference-api.conf
http {
# Rate limiting zones
# Per-API-key rate limiting (requires extracting key from header)
# Limit: 60 requests per minute per API key
limit_req_zone $http_authorization zone=inference_per_key:10m rate=1r/s;
# Per-IP rate limiting (catches unauthenticated probing)
limit_req_zone $binary_remote_addr zone=inference_per_ip:10m rate=10r/m;
# Burst limit for streaming requests (tokens/second proxy)
limit_req_zone $binary_remote_addr zone=inference_streaming:10m rate=5r/m;
upstream vllm_backend {
server 127.0.0.1:8000;
# If multiple model instances:
# server 127.0.0.1:8001 weight=1;
# server 127.0.0.1:8002 weight=1;
keepalive 16;
keepalive_requests 1000;
keepalive_timeout 60s;
}
server {
listen 443 ssl;
server_name inference.example.com;
ssl_certificate /etc/ssl/certs/inference.crt;
ssl_certificate_key /etc/ssl/private/inference.key;
# Enforce minimum TLS version — do not accept TLS 1.0/1.1
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384';
# Apply rate limiting with burst allowance for legitimate users
limit_req zone=inference_per_key burst=10 nodelay;
limit_req zone=inference_per_ip burst=5 nodelay;
limit_req_status 429;
# Hard limit on request size (prevents prompt-stuffing attacks)
# Adjust based on your model's context window — 4MB covers 100K token context
client_max_body_size 4m;
# Timeout for inference (model inference can take 60-120s)
proxy_read_timeout 180s;
proxy_connect_timeout 10s;
proxy_send_timeout 30s;
location /v1/ {
# Require authentication header — deny unauthenticated requests at NGINX
if ($http_authorization = "") {
return 401 '{"error": "Authorization header required"}';
}
proxy_pass http://vllm_backend;
# Pass client identity to backend for logging
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
# DO NOT strip the Authorization header — the backend needs it
# for API key validation
proxy_set_header Authorization $http_authorization;
# Support streaming responses (SSE/chunked transfer)
proxy_buffering off;
proxy_cache off;
# For Server-Sent Events (streaming completions)
proxy_set_header Connection '';
proxy_http_version 1.1;
}
# Separate rate limit zone for streaming endpoints
location /v1/chat/completions {
limit_req zone=inference_streaming burst=3 nodelay;
# Additional body size check for completions endpoint
client_max_body_size 4m;
proxy_pass http://vllm_backend;
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection '';
proxy_http_version 1.1;
}
# Block direct access to model management endpoints
location ~ ^/v1/(models/load|shutdown|reload) {
return 403 '{"error": "Administrative endpoints not available"}';
}
# Health check endpoint — no auth required, no rate limiting
location /health {
proxy_pass http://vllm_backend/health;
}
}
}
Step 3 — Log inference requests for audit and anomaly detection
# Inference-specific log format — captures request metadata without body content
log_format inference_audit escape=json
'{'
'"timestamp":"$time_iso8601",'
'"client_ip":"$remote_addr",'
'"method":"$request_method",'
'"endpoint":"$request_uri",'
'"status":$status,'
'"request_bytes":$request_length,'
'"response_bytes":$body_bytes_sent,'
'"duration_ms":$request_time,'
'"api_key_prefix":"${http_authorization:7:8}",' # First 8 chars of Bearer token only
'"upstream":"$upstream_addr",'
'"upstream_status":"$upstream_status",'
'"upstream_duration":$upstream_response_time'
'}';
access_log /var/log/nginx/inference-audit.json inference_audit;
Important: Do not log request or response bodies at the NGINX layer. Inference request bodies contain user prompts that may include PII; response bodies contain model outputs. NGINX-level logging should capture metadata (endpoint, size, duration, status, API key prefix) but not content.
Step 4 — Add response header security and strip internal headers
# Add security headers to inference API responses
add_header X-Content-Type-Options nosniff always;
add_header X-Frame-Options DENY always;
# Strip internal NGINX and backend headers from responses
# Prevent information disclosure about backend infrastructure
proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_hide_header X-vLLM-Version;
# Add identifying information about the proxy for debugging
# (without disclosing version)
add_header X-Request-ID $request_id always;
# CORS configuration — restrict to known origins for API endpoints
add_header Access-Control-Allow-Origin "https://app.example.com" always;
add_header Access-Control-Allow-Methods "POST, GET, OPTIONS" always;
add_header Access-Control-Allow-Headers "Authorization, Content-Type" always;
# Handle OPTIONS preflight without reaching the backend
if ($request_method = OPTIONS) {
return 204;
}
Step 5 — Apply systemd hardening to the NGINX inference proxy
# /etc/systemd/system/nginx.service.d/inference-hardening.conf
# Additional hardening for NGINX serving as an inference proxy
[Service]
# Prevent NGINX workers from accessing model weights or GPU device files
# directly — restrict to network and serving operations
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
# Workers only need to read config and write logs
ReadOnlyPaths=/etc/nginx /etc/ssl/certs
ReadWritePaths=/var/log/nginx /var/cache/nginx /run/nginx
# Inference proxy doesn't need module loading syscalls
SystemCallFilter=~finit_module init_module delete_module
# Prevent ptrace on GPU-side processes
SystemCallFilter=~ptrace process_vm_readv process_vm_writev
# No shell spawning from NGINX workers
SystemCallFilter=~@obsolete
MemoryDenyWriteExecute=yes
Step 6 — Monitor for cost anomalies and abuse patterns
# Prometheus alerting rules for inference proxy abuse detection
groups:
- name: nginx_inference_abuse
rules:
# Alert on unusual request volume from a single API key pattern
- alert: InferenceHighRequestRate
expr: |
rate(nginx_http_requests_total{
location="/v1/chat/completions"
}[5m]) * 300 > 100
for: 2m
labels:
severity: warning
annotations:
summary: "High inference request rate: {{ $value | humanize }}/5m"
description: "Possible automated abuse of inference endpoint"
# Alert on large requests (prompt stuffing)
- alert: InferenceLargeRequestBody
expr: |
histogram_quantile(0.99,
rate(nginx_http_request_size_bytes_bucket{
location=~"/v1/.*"
}[10m])
) > 2000000
labels:
severity: warning
annotations:
summary: "99th percentile inference request size > 2MB"
description: "Possible context-stuffing attack on inference endpoint"
# Alert on high 429 rate (rate limit being hit — may indicate abuse)
- alert: InferenceRateLimitHigh
expr: |
rate(nginx_http_requests_total{
status="429",
location=~"/v1/.*"
}[5m]) > 5
labels:
severity: info
annotations:
summary: "Rate limit being hit {{ $value | humanize }}/s on inference endpoint"
Expected Behaviour
| Scenario | Without hardening | With hardening |
|---|---|---|
| Cost amplification via max-context requests | Unbounded compute consumption | client_max_body_size 4m rejects oversized requests with 413 |
| NGINX CVE allows worker RCE | Attacker can read all inference traffic | systemd ProtectSystem, NoNewPrivileges, Seccomp limit post-exploit capability |
| Unauthenticated inference request | Forwarded to vLLM; model responds | NGINX returns 401 before request reaches backend |
| Internal vLLM management endpoint probed | Accessible via proxy | NGINX returns 403 for management paths |
| NGINX version disclosed in response headers | Server: nginx/1.24.0 reveals version |
proxy_hide_header Server suppresses version |
| Streaming request holds connection open indefinitely | Worker connection exhaustion | proxy_read_timeout 180s closes stale connections |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
client_max_body_size 4m |
Prevents prompt stuffing | Limits legitimate long-context requests | Set based on your model’s actual context window; increase for models with large context windows |
proxy_buffering off for streaming |
Low latency streaming responses | NGINX cannot aggregate responses for modification | Required for SSE streaming; accept the trade-off; use separate non-streaming endpoint if needed |
| Rate limiting per API key | Prevents key-level abuse | Adds latency for burst traffic | Tune burst parameter; use a shared memory zone large enough to track all active keys |
| Stripping body from logs | Protects user PII in logs | Loses forensic detail if inference is abused | Log request size and endpoint; correlate with backend logs if investigation needed |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Rate limit too aggressive | Legitimate users hit 429 | User reports; 429 spike in metrics | Increase rate limit zone size; tune burst parameter |
proxy_read_timeout too short for long completions |
Clients receive 504 on long inference requests | 504 errors in NGINX logs; client timeout errors | Increase proxy_read_timeout to match your model’s maximum generation time + 20% margin |
| NGINX CVE patching delayed in Docker image | Running vulnerable NGINX version | Version check script identifies outdated image | Pin base image to patched version; add version check to CI pipeline |
proxy_buffering off causes OOM on long responses |
NGINX memory grows during very long streaming responses | NGINX memory usage metric rises | Set proxy_max_temp_file_size 0 and monitor; restart if OOM detected |
Related Articles
- NGINX Worker Privilege Hardening — OS-level systemd hardening directives used in this article
- vLLM Shared Memory Inference Isolation — securing the vLLM backend that NGINX proxies to
- LLM API Security — broader LLM API security beyond the proxy layer
- NGINX CVE Exploitation Detection — detecting active exploitation of NGINX CVEs at the proxy layer
- NGINX Fleet Patch Management — keeping NGINX patched across the fleet where inference proxies are deployed