Hardening NGINX as a Reverse Proxy for AI Inference Endpoints

Problem

NGINX is a common first choice for reverse-proxying AI inference endpoints — vLLM, Ollama, llama.cpp, TGI (Text Generation Inference), and proprietary APIs. It handles authentication offloading, rate limiting, TLS termination, and load balancing across multiple model instances. The configuration is often treated as infrastructure plumbing: set up once, not revisited.

The security implications of this positioning are underappreciated:

NGINX CVE exposure is higher at inference proxies. A compromised NGINX instance in front of an inference endpoint can: strip authentication headers, log and exfiltrate all inference requests and responses (which may contain proprietary prompts, sensitive user data, or model outputs used in production), redirect traffic to rogue model instances, and modify responses to inject malicious content. The business impact of NGINX compromise at this position is significantly higher than NGINX serving static content.

Inference endpoints have unique request characteristics. Inference API requests are large (often 100KB–2MB including context), long-running (5–120 seconds for streaming responses), and billed per token. These characteristics make inference proxies uniquely vulnerable to:

Token-stuffing attacks: sending maximum-context requests to exhaust compute budget
Streaming timeout abuse: holding connections open with slow streaming to exhaust NGINX worker connections
Cost amplification: automated requests that generate expensive completions
Prompt injection via NGINX request manipulation: a compromised NGINX that rewrites the messages field in a proxied JSON body before forwarding to the model

NGINX CVE patching is often delayed at AI infrastructure. AI teams that deploy inference stacks focus on model availability, latency, and GPU utilisation. NGINX is an afterthought — the version in the Docker base image may not be updated even as critical CVEs are published. The ngx_http_mp4_module CVE (CVE-2024-7347) is irrelevant to inference, but a QUIC module CVE on an inference endpoint that supports HTTP/3 is directly relevant.

Inference endpoints are high-value targets for data exfiltration. The request/response stream contains proprietary system prompts, user queries, model completions, and potentially PII. An attacker who achieves SSRF via NGINX or compromises an NGINX worker process gains access to this stream.

Target systems: NGINX deployed in front of vLLM, Ollama, TGI, llama.cpp, or OpenAI-compatible inference APIs; organisations self-hosting LLMs with NGINX as the API gateway; any deployment where NGINX proxies to AI inference backends.

Threat Model

Adversary 1 — NGINX CVE exploitation at inference proxy. A memory corruption vulnerability in NGINX (CVE-2024-7347, QUIC module CVEs) is exploited on the inference proxy. The attacker’s RCE in the NGINX worker gives them access to all inference request/response traffic, API keys passed in headers, and the ability to modify proxied requests before they reach the model.

Adversary 2 — Cost amplification via unbounded requests. An attacker discovers the inference endpoint and sends automated requests with maximum context lengths. Without rate limiting and token budget enforcement, the attacker generates GPU compute costs orders of magnitude higher than intended usage patterns.

Adversary 3 — System prompt exfiltration via SSRF. An SSRF vulnerability in NGINX (misconfigured proxy_pass) allows an attacker to reach the inference backend directly, bypassing authentication. The attacker queries the model with a prompt designed to echo back the system prompt, extracting proprietary instructions.

Adversary 4 — Response manipulation for downstream injection. A compromised NGINX rewrites model responses in transit, injecting malicious content into AI-generated outputs before they reach end users or downstream applications. If downstream applications parse and act on model output, injected content can trigger unintended actions.

Configuration / Implementation

Step 1 — Keep NGINX patched and disable unused modules

# Verify current NGINX version against CVE list
nginx -V 2>&1 | head -5

# For Docker-based inference deployments, check the base image NGINX version
docker run --rm your-inference-proxy nginx -v

# Disable modules not needed for inference proxying
# In nginx.conf or a load_module include:
# DO NOT load:
#   ngx_http_mp4_module     — irrelevant to inference, carries CVE-2024-7347
#   ngx_http_image_filter_module — irrelevant to inference
#   ngx_http_flv_module     — irrelevant to inference

# Build or pull NGINX without these modules for inference-specific deployments
nginx -V 2>&1 | grep -E "mp4|image_filter|flv"
# If these appear and you don't use them, rebuild without them

Step 2 — Configure rate limiting for inference endpoints

# /etc/nginx/conf.d/inference-api.conf

http {
    # Rate limiting zones
    
    # Per-API-key rate limiting (requires extracting key from header)
    # Limit: 60 requests per minute per API key
    limit_req_zone $http_authorization zone=inference_per_key:10m rate=1r/s;
    
    # Per-IP rate limiting (catches unauthenticated probing)
    limit_req_zone $binary_remote_addr zone=inference_per_ip:10m rate=10r/m;
    
    # Burst limit for streaming requests (tokens/second proxy)
    limit_req_zone $binary_remote_addr zone=inference_streaming:10m rate=5r/m;

    upstream vllm_backend {
        server 127.0.0.1:8000;
        # If multiple model instances:
        # server 127.0.0.1:8001 weight=1;
        # server 127.0.0.1:8002 weight=1;
        
        keepalive 16;
        keepalive_requests 1000;
        keepalive_timeout 60s;
    }

    server {
        listen 443 ssl;
        server_name inference.example.com;
        
        ssl_certificate /etc/ssl/certs/inference.crt;
        ssl_certificate_key /etc/ssl/private/inference.key;
        
        # Enforce minimum TLS version — do not accept TLS 1.0/1.1
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_prefer_server_ciphers on;
        ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384';
        
        # Apply rate limiting with burst allowance for legitimate users
        limit_req zone=inference_per_key burst=10 nodelay;
        limit_req zone=inference_per_ip burst=5 nodelay;
        limit_req_status 429;
        
        # Hard limit on request size (prevents prompt-stuffing attacks)
        # Adjust based on your model's context window — 4MB covers 100K token context
        client_max_body_size 4m;
        
        # Timeout for inference (model inference can take 60-120s)
        proxy_read_timeout 180s;
        proxy_connect_timeout 10s;
        proxy_send_timeout 30s;
        
        location /v1/ {
            # Require authentication header — deny unauthenticated requests at NGINX
            if ($http_authorization = "") {
                return 401 '{"error": "Authorization header required"}';
            }
            
            proxy_pass http://vllm_backend;
            
            # Pass client identity to backend for logging
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header Host $host;
            
            # DO NOT strip the Authorization header — the backend needs it
            # for API key validation
            proxy_set_header Authorization $http_authorization;
            
            # Support streaming responses (SSE/chunked transfer)
            proxy_buffering off;
            proxy_cache off;
            
            # For Server-Sent Events (streaming completions)
            proxy_set_header Connection '';
            proxy_http_version 1.1;
        }
        
        # Separate rate limit zone for streaming endpoints
        location /v1/chat/completions {
            limit_req zone=inference_streaming burst=3 nodelay;
            
            # Additional body size check for completions endpoint
            client_max_body_size 4m;
            
            proxy_pass http://vllm_backend;
            proxy_buffering off;
            proxy_cache off;
            proxy_set_header Connection '';
            proxy_http_version 1.1;
        }
        
        # Block direct access to model management endpoints
        location ~ ^/v1/(models/load|shutdown|reload) {
            return 403 '{"error": "Administrative endpoints not available"}';
        }
        
        # Health check endpoint — no auth required, no rate limiting
        location /health {
            proxy_pass http://vllm_backend/health;
        }
    }
}

Step 3 — Log inference requests for audit and anomaly detection

# Inference-specific log format — captures request metadata without body content
log_format inference_audit escape=json
    '{'
    '"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"method":"$request_method",'
    '"endpoint":"$request_uri",'
    '"status":$status,'
    '"request_bytes":$request_length,'
    '"response_bytes":$body_bytes_sent,'
    '"duration_ms":$request_time,'
    '"api_key_prefix":"${http_authorization:7:8}",'  # First 8 chars of Bearer token only
    '"upstream":"$upstream_addr",'
    '"upstream_status":"$upstream_status",'
    '"upstream_duration":$upstream_response_time'
    '}';

access_log /var/log/nginx/inference-audit.json inference_audit;

Important: Do not log request or response bodies at the NGINX layer. Inference request bodies contain user prompts that may include PII; response bodies contain model outputs. NGINX-level logging should capture metadata (endpoint, size, duration, status, API key prefix) but not content.

Step 4 — Add response header security and strip internal headers

# Add security headers to inference API responses
add_header X-Content-Type-Options nosniff always;
add_header X-Frame-Options DENY always;

# Strip internal NGINX and backend headers from responses
# Prevent information disclosure about backend infrastructure
proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_hide_header X-vLLM-Version;

# Add identifying information about the proxy for debugging
# (without disclosing version)
add_header X-Request-ID $request_id always;

# CORS configuration — restrict to known origins for API endpoints
add_header Access-Control-Allow-Origin "https://app.example.com" always;
add_header Access-Control-Allow-Methods "POST, GET, OPTIONS" always;
add_header Access-Control-Allow-Headers "Authorization, Content-Type" always;

# Handle OPTIONS preflight without reaching the backend
if ($request_method = OPTIONS) {
    return 204;
}

Step 5 — Apply systemd hardening to the NGINX inference proxy

# /etc/systemd/system/nginx.service.d/inference-hardening.conf
# Additional hardening for NGINX serving as an inference proxy

[Service]
# Prevent NGINX workers from accessing model weights or GPU device files
# directly — restrict to network and serving operations
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes

# Workers only need to read config and write logs
ReadOnlyPaths=/etc/nginx /etc/ssl/certs
ReadWritePaths=/var/log/nginx /var/cache/nginx /run/nginx

# Inference proxy doesn't need module loading syscalls
SystemCallFilter=~finit_module init_module delete_module

# Prevent ptrace on GPU-side processes
SystemCallFilter=~ptrace process_vm_readv process_vm_writev

# No shell spawning from NGINX workers
SystemCallFilter=~@obsolete

MemoryDenyWriteExecute=yes

Step 6 — Monitor for cost anomalies and abuse patterns

# Prometheus alerting rules for inference proxy abuse detection

groups:
- name: nginx_inference_abuse
  rules:
  # Alert on unusual request volume from a single API key pattern
  - alert: InferenceHighRequestRate
    expr: |
      rate(nginx_http_requests_total{
        location="/v1/chat/completions"
      }[5m]) * 300 > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High inference request rate: {{ $value | humanize }}/5m"
      description: "Possible automated abuse of inference endpoint"

  # Alert on large requests (prompt stuffing)
  - alert: InferenceLargeRequestBody
    expr: |
      histogram_quantile(0.99,
        rate(nginx_http_request_size_bytes_bucket{
          location=~"/v1/.*"
        }[10m])
      ) > 2000000
    labels:
      severity: warning
    annotations:
      summary: "99th percentile inference request size > 2MB"
      description: "Possible context-stuffing attack on inference endpoint"

  # Alert on high 429 rate (rate limit being hit — may indicate abuse)
  - alert: InferenceRateLimitHigh
    expr: |
      rate(nginx_http_requests_total{
        status="429",
        location=~"/v1/.*"
      }[5m]) > 5
    labels:
      severity: info
    annotations:
      summary: "Rate limit being hit {{ $value | humanize }}/s on inference endpoint"

Expected Behaviour

Scenario	Without hardening	With hardening
Cost amplification via max-context requests	Unbounded compute consumption	`client_max_body_size 4m` rejects oversized requests with 413
NGINX CVE allows worker RCE	Attacker can read all inference traffic	systemd `ProtectSystem`, `NoNewPrivileges`, Seccomp limit post-exploit capability
Unauthenticated inference request	Forwarded to vLLM; model responds	NGINX returns 401 before request reaches backend
Internal vLLM management endpoint probed	Accessible via proxy	NGINX returns 403 for management paths
NGINX version disclosed in response headers	`Server: nginx/1.24.0` reveals version	`proxy_hide_header Server` suppresses version
Streaming request holds connection open indefinitely	Worker connection exhaustion	`proxy_read_timeout 180s` closes stale connections

Trade-offs

Aspect	Benefit	Cost	Mitigation
`client_max_body_size 4m`	Prevents prompt stuffing	Limits legitimate long-context requests	Set based on your model’s actual context window; increase for models with large context windows
`proxy_buffering off` for streaming	Low latency streaming responses	NGINX cannot aggregate responses for modification	Required for SSE streaming; accept the trade-off; use separate non-streaming endpoint if needed
Rate limiting per API key	Prevents key-level abuse	Adds latency for burst traffic	Tune burst parameter; use a shared memory zone large enough to track all active keys
Stripping body from logs	Protects user PII in logs	Loses forensic detail if inference is abused	Log request size and endpoint; correlate with backend logs if investigation needed

Failure Modes

Failure	Symptom	Detection	Recovery
Rate limit too aggressive	Legitimate users hit 429	User reports; 429 spike in metrics	Increase rate limit zone size; tune burst parameter
`proxy_read_timeout` too short for long completions	Clients receive 504 on long inference requests	504 errors in NGINX logs; client timeout errors	Increase `proxy_read_timeout` to match your model’s maximum generation time + 20% margin
NGINX CVE patching delayed in Docker image	Running vulnerable NGINX version	Version check script identifies outdated image	Pin base image to patched version; add version check to CI pipeline
`proxy_buffering off` causes OOM on long responses	NGINX memory grows during very long streaming responses	NGINX memory usage metric rises	Set `proxy_max_temp_file_size 0` and monitor; restart if OOM detected

NGINX Worker Privilege Hardening — OS-level systemd hardening directives used in this article
vLLM Shared Memory Inference Isolation — securing the vLLM backend that NGINX proxies to
LLM API Security — broader LLM API security beyond the proxy layer
NGINX CVE Exploitation Detection — detecting active exploitation of NGINX CVEs at the proxy layer
NGINX Fleet Patch Management — keeping NGINX patched across the fleet where inference proxies are deployed