LLM Rate Limiting in Kubernetes: Token-Bucket Control for vLLM and TGI at Scale

The Problem

Standard API rate limiting assumes request equivalence. An nginx limit_req_zone counting requests per minute, a Kong plugin blocking users who hit 100 requests per hour, a Redis-backed token bucket admitting 10 requests per second — all of these models treat every HTTP request as an interchangeable unit of cost. For most APIs, that assumption is close enough. For LLM inference, it is wrong by several orders of magnitude.

Consider two requests to the same vLLM endpoint, both arriving as HTTP POST to /v1/chat/completions:

Request A: {"messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 10} — 9 input tokens, 3 output tokens, roughly $0.00002 at typical pricing.
Request B: {"messages": [{"role": "user", "content": "<400-page document text>"}], "max_tokens": 4096} — 98,000 input tokens, 4,096 output tokens, roughly $5.20.

Both are “one request.” A standard rate limiter that blocks at 10 requests per minute would allow an attacker to consume $52 in GPU compute in 60 seconds while staying entirely within the request-count limit. Ten of Request A in the same window costs $0.0002. The cost ratio is 260,000×. This is not a theoretical worst case — it is the realistic difference between a one-line query and a document summarisation task, and both are valid use cases for the same endpoint.

The units that matter for LLM rate limiting are tokens, not requests. Specifically, there are five dimensions that must be controlled independently:

Input tokens per request. A single request with 100,000 input tokens occupies the attention mechanism for the entire forward pass duration, blocking a KV cache slot on the GPU and preventing other sequences from being scheduled. vLLM enforces --max-model-len as an upper bound on context length, but this is a server-wide limit — not a per-user configurable ceiling, and not enforced at the ingress layer where you can return a meaningful error before the request reaches the GPU.

Output tokens per request. The max_tokens parameter in the request body controls maximum output length, but nothing prevents a client from setting it to 65,536. The model will generate tokens until it hits that limit or produces an end-of-sequence token, occupying a GPU slot the entire time. Users must not control this parameter without an upper bound enforced at the ingress layer.

Total tokens per user per time period. Even with per-request limits, a user making a request every 5 seconds stays under any per-request ceiling while consuming substantial tokens over an hour. Budget accounting across a rolling time window is the only control for cumulative cost.

Concurrent requests per user. vLLM’s --max-num-seqs limits total concurrent sequences across all users. Without a per-user concurrency limit, a single user can occupy all available slots — effectively a GPU-exhaustion DoS against every other user on the system.

Priority queuing. When GPU capacity is constrained and requests queue, first-in-first-out scheduling is the wrong policy for a system with tiered users. Premium users blocked by free-tier traffic is both a revenue problem and a latency SLA violation.

vLLM’s built-in controls cover some of this surface. The --max-num-seqs flag limits concurrent sequences. Individual requests include max_tokens. What vLLM does not provide: per-user token accounting, multi-period budget windows, priority scheduling across users, or usage attribution signals for billing. HuggingFace TGI’s situation is similar — --max-total-tokens, --max-input-length, and --max-batch-total-tokens are server-wide limits, not per-user controls.

These gaps must be filled at the Kubernetes ingress and sidecar layer, before requests reach the inference server.

Threat Model

Billing DoS via large inputs. An attacker authenticates as a free-tier user and submits requests containing 95,000-token inputs at the maximum rate allowed by a request-count limiter. At 10 requests per minute, this generates 950,000 input tokens per minute — approximately $2.85 per minute in compute cost at typical input pricing, $171 per hour. The attack stays entirely under a 10-rpm request limit. Without token-count enforcement, there is no mechanism to detect or stop it until the billing invoice arrives.

Budget circumvention through fragmentation. Per-request token limits are bypassed by splitting large inputs into many small requests. A user who cannot submit 50,000 tokens in one request sends 25 requests of 2,000 tokens each instead, each well within the per-request limit. Without per-user rolling window budget accounting, every individual request passes validation while the cumulative cost is identical.

GPU queue exhaustion. A single user or coordinated group of users issues many concurrent requests, filling all --max-num-seqs slots. Other users receive 503s or wait in queue indefinitely. vLLM’s scheduler has no concept of user identity — it schedules sequences by arrival order within its internal queue. The attacker does not need to overflow the queue; they simply need to maintain enough concurrent sequences to saturate capacity.

Cost attribution failure. Without per-user token tracking, cost allocation to tenants, customers, or departments is impossible. Billing surprises are discovered when the monthly cloud invoice arrives — a 30-day feedback loop is not an operational control.

Hardening Configuration

1. Token Estimation Middleware

Token counting happens at the application layer, before the request reaches vLLM or TGI. This requires reading the request body — something Envoy’s standard rate limit filter cannot do without ext_proc. A FastAPI sidecar deployed as a Kubernetes sidecar container or as an independent service handles token estimation, budget accounting, and input validation.

# token_limiter.py — FastAPI middleware for LLM token rate limiting
import asyncio
import time
from contextlib import asynccontextmanager
from typing import AsyncIterator

import redis.asyncio as redis
import tiktoken
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse

# cl100k_base is the encoding used by GPT-4 and Llama models trained on OpenAI-style data.
# For Llama 3, the tokenizer differs slightly but cl100k_base gives estimates within 5-10%.
# Load once at startup — tokenizer initialisation is expensive.
_tokenizer: tiktoken.Encoding | None = None
_redis_client: redis.Redis | None = None


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
    global _tokenizer, _redis_client
    _tokenizer = tiktoken.get_encoding("cl100k_base")
    _redis_client = redis.Redis(
        host="redis-rate-limit.ai-inference.svc.cluster.local",
        port=6379,
        decode_responses=True,
        socket_connect_timeout=1,
        socket_timeout=1,
    )
    yield
    await _redis_client.aclose()


app = FastAPI(lifespan=lifespan)

# Hard limits enforced before budget check.
# These are server-side policy — clients cannot override them via request parameters.
MAX_INPUT_TOKENS = 16_000
MAX_OUTPUT_TOKENS = 4_096

# Per-user rolling hourly budget.
HOURLY_TOKEN_BUDGET = 100_000

# Per-user concurrent request limit.
MAX_CONCURRENT_PER_USER = 3


def estimate_input_tokens(messages: list[dict]) -> int:
    """Estimate input token count for a chat completion request.

    Uses tiktoken for speed. Actual billing tokens from the LLM may differ by up
    to 15% due to model-specific BPE differences and special tokens. The buffer of
    10 tokens per message accounts for role tokens (<|im_start|>, role name, etc.)
    that OpenAI-format APIs add around each message.
    """
    total = 0
    for msg in messages:
        total += 10  # Per-message overhead
        content = msg.get("content", "")
        if isinstance(content, str):
            total += len(_tokenizer.encode(content))
        elif isinstance(content, list):
            # Multimodal content — process text parts, approximate image tokens
            for part in content:
                if part.get("type") == "text":
                    total += len(_tokenizer.encode(part.get("text", "")))
                elif part.get("type") == "image_url":
                    # OpenAI vision: 512x512 image = ~170 tokens; 1024x1024 = ~765 tokens.
                    # Without inspecting actual image dimensions, use the high-detail default.
                    total += 765
    return total


@app.middleware("http")
async def token_rate_limit(request: Request, call_next):
    # Pass through non-inference endpoints (health, metrics) without accounting.
    if request.url.path not in {"/v1/chat/completions", "/v1/completions"}:
        return await call_next(request)

    # User identity comes from upstream authentication — API gateway or Istio JWT
    # validation populates this header after verifying the bearer token.
    user_id = request.headers.get("X-User-ID")
    user_tier = request.headers.get("X-User-Tier", "free")  # premium | standard | free

    if not user_id:
        raise HTTPException(401, "Missing X-User-ID header — authentication required")

    # Read and parse the body. FastAPI middleware consumes the body stream; we must
    # re-inject it so the proxied request to vLLM still has a body.
    raw_body = await request.body()
    try:
        import json
        body = json.loads(raw_body)
    except json.JSONDecodeError:
        raise HTTPException(400, "Request body is not valid JSON")

    # --- 1. Enforce output token ceiling ---
    max_tokens = body.get("max_tokens", 1000)
    if not isinstance(max_tokens, int) or max_tokens < 1:
        raise HTTPException(400, "max_tokens must be a positive integer")
    if max_tokens > MAX_OUTPUT_TOKENS:
        raise HTTPException(
            400,
            {
                "error": f"max_tokens {max_tokens} exceeds server limit of {MAX_OUTPUT_TOKENS}",
                "max_allowed": MAX_OUTPUT_TOKENS,
            },
        )

    # --- 2. Estimate and enforce input token ceiling ---
    messages = body.get("messages", [])
    if not messages:
        # /v1/completions uses 'prompt' instead of 'messages'
        prompt = body.get("prompt", "")
        input_tokens = len(_tokenizer.encode(prompt)) + 10
    else:
        input_tokens = estimate_input_tokens(messages)

    if input_tokens > MAX_INPUT_TOKENS:
        raise HTTPException(
            400,
            {
                "error": f"Input too long: {input_tokens} estimated tokens (max {MAX_INPUT_TOKENS})",
                "estimated_tokens": input_tokens,
                "max_allowed": MAX_INPUT_TOKENS,
            },
        )

    # Conservative estimate: assume the model generates max_tokens output.
    # Actual output will usually be less, but we reserve the full budget upfront
    # to prevent budget overruns on streaming responses where we can't rollback.
    estimated_total = input_tokens + max_tokens

    # --- 3. Check concurrent request limit ---
    concurrent_key = f"concurrent:{user_id}"
    try:
        concurrent = await _redis_client.incr(concurrent_key)
        await _redis_client.expire(concurrent_key, 300)  # 5-minute TTL as safety net

        if concurrent > MAX_CONCURRENT_PER_USER:
            await _redis_client.decr(concurrent_key)
            raise HTTPException(
                429,
                {
                    "error": "Concurrent request limit exceeded",
                    "active_requests": concurrent - 1,
                    "limit": MAX_CONCURRENT_PER_USER,
                },
            )
    except redis.RedisError:
        # Redis failure: fail OPEN on concurrency check (prefer availability),
        # but log for alerting. See Failure Modes section.
        pass

    # --- 4. Check and deduct from per-user hourly token budget ---
    budget_key = f"token_budget:{user_id}"
    try:
        pipe = _redis_client.pipeline()
        pipe.incrbyfloat(budget_key, estimated_total)
        pipe.expire(budget_key, 3600)
        results = await pipe.execute()
        current_usage = float(results[0])

        tier_limits = {
            "premium": 500_000,
            "standard": 200_000,
            "free": HOURLY_TOKEN_BUDGET,
        }
        hourly_limit = tier_limits.get(user_tier, HOURLY_TOKEN_BUDGET)

        if current_usage > hourly_limit:
            # Rollback the optimistic increment to avoid overcounting.
            await _redis_client.incrbyfloat(budget_key, -estimated_total)
            ttl = await _redis_client.ttl(budget_key)
            raise HTTPException(
                429,
                {
                    "error": "Hourly token budget exceeded",
                    "used": current_usage - estimated_total,
                    "limit": hourly_limit,
                    "reset_in_seconds": ttl,
                    "tier": user_tier,
                },
            )
    except redis.RedisError:
        # Redis failure: fail OPEN. Metrics will show Redis errors; alert on them.
        pass

    # --- 5. Forward request to vLLM ---
    # Re-inject body for the proxied call (handled by the ASGI scope reconstruction
    # or the reverse proxy in front of this middleware).
    try:
        response = await call_next(request)
        return response
    finally:
        # Decrement concurrent counter when the response stream closes.
        try:
            await _redis_client.decr(concurrent_key)
        except redis.RedisError:
            pass

The middleware sits between the Kubernetes ingress and the vLLM pod. It is stateless itself — all state lives in Redis — so it scales horizontally without coordination overhead.

2. vLLM Deployment with Server-Side Limits

The middleware enforces policy at the edge. vLLM enforces limits at the model server as a second line of defence — so that requests which somehow bypass the middleware still encounter ceilings.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai-inference
  labels:
    app: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
      annotations:
        # Expose vLLM metrics to Prometheus via pod annotation scraping.
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: vllm
        # Pin to a specific digest, not a floating tag.
        image: vllm/vllm-openai:v0.4.2@sha256:a3b1c2d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=meta-llama/Llama-3-8B-Instruct
        - --max-model-len=16384        # Hard ceiling on context window
        - --max-num-seqs=32            # Max concurrent KV cache slots (tune to GPU VRAM)
        - --max-logprobs=0             # Disable logprob output — reduces response payload
        - --disable-log-requests       # Do not log request content to container stdout
        - --served-model-name=llama3-8b  # Public alias — do not expose HuggingFace path
        - --tensor-parallel-size=1    # Adjust for multi-GPU nodes
        - --gpu-memory-utilization=0.90
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: "4"
            memory: 16Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "8"
            memory: 24Gi
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
          failureThreshold: 2
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false  # vLLM writes temp files to /tmp
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
          readOnly: true
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-weights-pvc
          readOnly: true
      - name: tmp
        emptyDir:
          medium: Memory
          sizeLimit: 4Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

The --max-num-seqs=32 flag is the primary GPU concurrency control at the model server level. Set this based on your GPU VRAM, model size, and maximum context length — the KV cache for 32 concurrent 16K-token sequences on Llama 3 8B requires approximately 18GB, leaving headroom for model weights on an L4 (24GB VRAM). Reduce to 16 for longer contexts or larger models.

3. Priority Queue via Envoy ext_proc

When all --max-num-seqs slots are occupied, vLLM queues internally in FIFO order regardless of user tier. To impose priority ordering before requests reach vLLM, implement priority queuing in the ext_proc sidecar — Envoy’s external processing filter sends request headers (and optionally the body) to a gRPC service that can delay, modify, or reject the request.

# priority_queue.py — Envoy ext_proc service implementing priority queuing
# Deployed as a sidecar in the token-limiter pod.

import asyncio
import heapq
import time
import uuid
from dataclasses import dataclass, field
from enum import IntEnum

import grpc
from envoy.service.ext_proc.v3 import external_processor_pb2 as ext_proc_pb2
from envoy.service.ext_proc.v3 import (
    external_processor_pb2_grpc as ext_proc_pb2_grpc,
)


class Priority(IntEnum):
    PREMIUM = 0    # Lowest integer = highest priority in min-heap
    STANDARD = 1
    FREE = 2


@dataclass(order=True)
class QueueEntry:
    priority: Priority
    timestamp: float           # Tie-break by arrival time within same tier
    request_id: str = field(compare=False)
    user_id: str = field(compare=False)
    future: asyncio.Future = field(compare=False)


TIER_PRIORITY = {
    "premium": Priority.PREMIUM,
    "standard": Priority.STANDARD,
    "free": Priority.FREE,
}

# Tune these to your vLLM --max-num-seqs value and expected queue depth.
MAX_CONCURRENT = 32
MAX_QUEUE_DEPTH = 96       # 3× concurrent slots — beyond this, reject immediately
QUEUE_TIMEOUT_SECONDS = {  # Max time a request waits in queue per tier
    Priority.PREMIUM: 30,
    Priority.STANDARD: 20,
    Priority.FREE: 10,
}


class LLMPriorityQueue:
    def __init__(self):
        self._heap: list[QueueEntry] = []
        self._active: set[str] = set()
        self._lock = asyncio.Lock()
        self._slot_available = asyncio.Event()
        self._slot_available.set()  # Slots available at start

    async def acquire(self, request_id: str, user_id: str, tier: str) -> bool:
        """Attempt to acquire an inference slot.

        Returns True immediately if a slot is available.
        Queues the request and waits if capacity is constrained.
        Returns False if the queue is full or the wait timeout elapses.
        """
        priority = TIER_PRIORITY.get(tier, Priority.FREE)
        timeout = QUEUE_TIMEOUT_SECONDS[priority]

        async with self._lock:
            if len(self._active) < MAX_CONCURRENT:
                self._active.add(request_id)
                return True

            if len(self._heap) >= MAX_QUEUE_DEPTH:
                return False  # Reject: queue at capacity

            loop = asyncio.get_event_loop()
            future: asyncio.Future = loop.create_future()
            entry = QueueEntry(
                priority=priority,
                timestamp=time.monotonic(),
                request_id=request_id,
                user_id=user_id,
                future=future,
            )
            heapq.heappush(self._heap, entry)

        # Wait outside the lock — the lock is released while the request queues.
        try:
            await asyncio.wait_for(future, timeout=timeout)
            return True
        except asyncio.TimeoutError:
            async with self._lock:
                # Remove from queue if still present (may have been popped but future
                # not yet resolved due to scheduling).
                self._heap = [e for e in self._heap if e.request_id != request_id]
                heapq.heapify(self._heap)
            return False

    async def release(self, request_id: str) -> None:
        """Release a slot and promote the highest-priority queued request."""
        async with self._lock:
            self._active.discard(request_id)
            if self._heap and len(self._active) < MAX_CONCURRENT:
                next_entry = heapq.heappop(self._heap)
                self._active.add(next_entry.request_id)
                if not next_entry.future.done():
                    next_entry.future.set_result(True)

The priority queue runs as a gRPC service inside the token-limiter pod. Envoy’s ext_proc filter sends request headers to it; the service checks the X-User-Tier header and either allows the request through immediately or holds the connection open until a slot is available (or the timeout elapses).

4. Envoy Rate Limiting Filter via Istio EnvoyFilter

For header-based rate limiting that does not require body inspection — request count limiting as a coarse first gate before the token middleware — configure Envoy’s native rate limit filter via an Istio EnvoyFilter resource:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: llm-rate-limit-headers
  namespace: ai-inference
spec:
  workloadSelector:
    labels:
      app: token-limiter
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.ratelimit
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
          domain: llm-api
          failure_mode_deny: true   # Fail CLOSED if rate limit service is unavailable
          rate_limit_service:
            grpc_service:
              envoy_grpc:
                cluster_name: outbound|8081||ratelimit-svc.ai-inference.svc.cluster.local
            transport_api_version: V3
          enable_x_ratelimit_headers: DRAFT_VERSION_03
  - applyTo: VIRTUAL_HOST
    match:
      context: SIDECAR_INBOUND
      routeConfiguration:
        vhost:
          name: inbound|http|8080
    patch:
      operation: MERGE
      value:
        rate_limits:
        - actions:
          - request_headers:
              header_name: X-User-ID
              descriptor_key: user_id
          - request_headers:
              header_name: X-User-Tier
              descriptor_key: tier

The rate limit service backing this filter holds coarse per-user request limits (e.g., 60 requests/minute) as a first gate. The token-counting middleware behind it enforces the token budget. This layered approach means that clearly abusive request volumes are rejected at Envoy before the token middleware incurs the cost of body parsing and Redis round-trips.

failure_mode_deny: true is critical here. The request-count rate limiter is a security control, not a convenience feature. If the rate limit service is unreachable, the correct posture is to deny requests until it recovers — not to admit all traffic. This is the inverse of the token budget middleware’s Redis failure posture, which fails open because budget enforcement failure is recoverable via monitoring and post-hoc billing review. Request-count limiting failure is not — the Envoy filter is the last line of defence against flooding.

5. TGI-Specific Configuration

HuggingFace TGI exposes different configuration surfaces than vLLM. The token middleware logic is identical — TGI speaks the same OpenAI-compatible API format. The server-side limits use TGI’s own flags:

# TGI Deployment (equivalent configuration for HuggingFace TGI)
- name: tgi
  image: ghcr.io/huggingface/text-generation-inference:2.0.4@sha256:b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3
  args:
  - --model-id=meta-llama/Llama-3-8B-Instruct
  - --max-concurrent-requests=32    # Equivalent to vLLM --max-num-seqs
  - --max-input-length=16000        # Hard ceiling on input tokens per request
  - --max-total-tokens=20096        # max-input-length + max output tokens
  - --max-batch-total-tokens=65536  # Total tokens across all batched sequences
  - --waiting-served-ratio=1.2      # Tune continuous batching aggressiveness
  - --max-waiting-tokens=20         # Tokens generated before checking waiting queue
  - --hostname=0.0.0.0
  - --port=8080
  env:
  - name: HUGGING_FACE_HUB_TOKEN
    valueFrom:
      secretKeyRef:
        name: hf-credentials
        key: token

TGI’s --max-total-tokens is the sum of input and output limits — set it to MAX_INPUT_TOKENS + MAX_OUTPUT_TOKENS from your middleware configuration to keep server-side and middleware limits consistent.

6. Token Usage Metrics and Alerting

Per-user token accounting data in Redis is operationally useful but not observable from outside the rate limiter pod. Prometheus metrics expose the same data for dashboards, SLO tracking, and alerting:

# metrics.py — Prometheus instrumentation for token rate limiter
from prometheus_client import Counter, Gauge, Histogram

# Total tokens consumed, broken down by user tier and model.
# Do NOT label by user_id — high-cardinality labels break Prometheus.
# Use separate time-series in a billing database for per-user accounting.
tokens_consumed = Counter(
    "llm_tokens_consumed_total",
    "Total tokens consumed by inference requests",
    ["user_tier", "model", "token_type"],  # token_type: input | output
)

# Request outcomes by rejection reason.
requests_rejected = Counter(
    "llm_requests_rejected_total",
    "Requests rejected by rate limiter",
    ["reason", "user_tier"],
    # reason values: input_too_long | output_limit_exceeded | budget_exceeded |
    #                concurrent_limit | queue_timeout | queue_full
)

# Estimated cost per request — sized to detect $5+ requests at p99.
request_cost_usd = Histogram(
    "llm_request_estimated_cost_usd",
    "Estimated USD cost of each inference request",
    ["user_tier"],
    buckets=[0.0001, 0.001, 0.01, 0.05, 0.10, 0.50, 1.0, 5.0, 10.0, 50.0],
)

# Active requests waiting in priority queue.
queue_depth = Gauge(
    "llm_priority_queue_depth",
    "Number of requests currently waiting in priority queue",
    ["user_tier"],
)

# Rolling hourly budget utilisation (sampled, not exact — Redis TTL makes exact
# aggregation complex; sample periodically for trend visibility).
budget_utilisation = Gauge(
    "llm_budget_utilisation_ratio",
    "Fraction of hourly token budget consumed (sampled across active users)",
    ["user_tier"],
)


async def record_request_outcome(
    user_tier: str,
    model: str,
    input_tokens: int,
    output_tokens: int,
    rejected: bool = False,
    rejection_reason: str | None = None,
) -> None:
    if rejected and rejection_reason:
        requests_rejected.labels(reason=rejection_reason, user_tier=user_tier).inc()
        return

    tokens_consumed.labels(
        user_tier=user_tier, model=model, token_type="input"
    ).inc(input_tokens)
    tokens_consumed.labels(
        user_tier=user_tier, model=model, token_type="output"
    ).inc(output_tokens)

    # Pricing: adjust these constants for your model and provider.
    # Example: Llama 3 8B self-hosted — estimate cost from GPU hourly rate / throughput.
    # Using OpenAI-equivalent pricing as a reference for cost distribution visibility.
    cost = (input_tokens * 0.003 + output_tokens * 0.015) / 1000
    request_cost_usd.labels(user_tier=user_tier).observe(cost)

Alerting on these metrics closes the operational loop:

groups:
- name: llm_rate_limiting
  rules:

  # High rejection rate → likely abuse or misconfigured limits
  - alert: LLMHighRejectionRate
    expr: |
      rate(llm_requests_rejected_total[5m]) > 0.5
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "LLM rate limiter rejecting >0.5 rps — check for abuse or limit misconfiguration"
      runbook: "https://wiki.internal/runbooks/llm-rate-limiting"

  # Budget exhaustion for premium users is an SLA signal
  - alert: LLMPremiumBudgetExhausted
    expr: |
      rate(llm_requests_rejected_total{reason="budget_exceeded", user_tier="premium"}[10m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Premium user hitting hourly token budget — review budget tier or increase limit"

  # p99 request cost above $5 — very large inputs reaching the model
  - alert: LLMHighCostRequestAtP99
    expr: |
      histogram_quantile(0.99, rate(llm_request_estimated_cost_usd_bucket[15m])) > 5.0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "p99 LLM request cost >$5 — unusually large requests are reaching vLLM"

  # Queue depth growing → GPU capacity constrained
  - alert: LLMQueueDepthElevated
    expr: |
      sum(llm_priority_queue_depth) > 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "LLM priority queue depth >20 — GPU capacity may need scaling"

  # Redis errors in rate limiter — budget enforcement degraded
  - alert: LLMRateLimiterRedisErrors
    expr: |
      rate(redis_command_duration_seconds_count{status="error"}[5m]) > 0.1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis errors in LLM rate limiter — token budget enforcement is degraded"

Expected Behaviour

After deploying the token middleware and configuring vLLM with the above manifest, the system behaves as follows across the five control dimensions:

A free-tier user submitting a request with 95,000 input tokens receives an immediate HTTP 400 before the request touches a GPU:

{
  "error": "Input too long: 97,340 estimated tokens (max 16000)",
  "estimated_tokens": 97340,
  "max_allowed": 16000
}

A free-tier user who has consumed 97,000 of their 100,000-token hourly budget and attempts a request estimated at 5,000 tokens (over budget) receives HTTP 429:

{
  "error": "Hourly token budget exceeded",
  "used": 97000,
  "limit": 100000,
  "reset_in_seconds": 1847,
  "tier": "free"
}

The reset_in_seconds field comes from Redis TTL on the budget key — it tells the client exactly when to retry, allowing well-behaved clients to implement backoff without polling.

When all 32 --max-num-seqs slots on vLLM are occupied, a premium user’s request waits up to 30 seconds in the priority queue before the queue service dispatches it to vLLM. A free-tier user in the same queue waits up to 10 seconds before receiving HTTP 503. Premium users are always ahead of standard and free users in the heap — a premium request arriving after 20 free-tier requests are queued will be dispatched before any of them.

Prometheus metrics at /metrics on the token-limiter pod show per-tier token consumption rates, rejection reasons, and cost distribution. The llm_tokens_consumed_total counter advances on every successful request completion — not on submission — requiring the middleware to capture the actual response from vLLM (including streamed token counts from the x-ratelimit-remaining-tokens header vLLM emits) and update metrics post-response.

Trade-offs

Pre-request token estimation accuracy. tiktoken with cl100k_base encoding gives estimates within 5-15% of actual tokens billed by most transformer models trained on similar data. Llama 3 uses its own BPE tokenizer — for production deployments where billing accuracy matters, load the actual model tokenizer via transformers.AutoTokenizer at middleware startup and use it for estimation. The trade-off is startup time (~30 seconds for tokenizer loading) and memory (~500MB for the tokenizer vocabulary). For cost control rather than exact billing, the 15% error band is acceptable when combined with a conservative buffer in budget calculations.

Optimistic budget reservation vs. streaming uncertainty. The middleware reserves input_tokens + max_tokens upfront. For streaming responses, the actual output token count is known only when the stream ends. Reserving the maximum prevents the budget from being exceeded mid-stream, but it overstates consumption for requests where the model terminates early. A post-response correction (decrement the over-reserved amount) is possible but adds complexity: the correction requires the middleware to remain in the request path for the duration of streaming and parse the streaming token count from vLLM’s SSE response headers. The simpler approach — reserve maximum, don’t correct — errs on the side of conservative budget consumption, which is the right default for abuse prevention.

Priority queuing vs. free-tier experience. Premium users preempting free-tier queue positions means that during sustained GPU saturation, free-tier users may wait the full 10-second timeout and receive 503 errors repeatedly. This is intentional from a product and cost perspective but must be communicated clearly in API documentation. Clients that do not handle 503 gracefully will fail silently. Set the Retry-After header on all 503 responses so well-behaved clients know when to retry.

Middleware as a single point of failure. The token limiter sits in the critical path for all inference requests. Deploy it with minReadySeconds, PodDisruptionBudget limiting disruptions during node drains, and at least 2 replicas. Horizontal Pod Autoscaler on CPU/memory ensures it scales with traffic. Redis Sentinel or Redis Cluster eliminates the Redis single point of failure. Without these, a token limiter outage takes down all inference traffic.

Envoy ext_proc latency. The priority queue gRPC call from Envoy ext_proc adds round-trip latency on every request. For requests that pass through immediately (slots available), this is a 1-5ms overhead. For queued requests, the latency is dominated by queue wait time. Profile your ext_proc service response times under load before production deployment — a slow ext_proc service can become the bottleneck before vLLM is saturated.

Failure Modes

Request-count limiting instead of token-count limiting. If the only rate limiting in place is the Envoy request-count filter (60 requests/minute) without the token-budget middleware, an attacker makes one 16,000-token request every second — staying under 60 rpm while consuming approximately 57 million input tokens per hour. No request is rejected. Cost accumulates invisibly. The token-budget middleware is not optional.

Redis fails open during abuse. The middleware’s Redis error handling fails open on budget enforcement to preserve availability. This is the correct trade-off in normal operation. Under an active billing-DoS attack, Redis failure at the same moment means all budget enforcement stops. Monitor Redis health aggressively — the LLMRateLimiterRedisErrors alert must page, not notify. If Redis is unavailable for more than 60 seconds during elevated traffic, consider failing closed temporarily via a feature flag.

Output token budget accounting gap. The budget reserves input_tokens + max_tokens at request start. If vLLM generates output beyond max_tokens due to a bug or streaming edge case (rare but observed in some vLLM versions with certain samplers), actual consumption exceeds the reserved amount. The post-response correction described above catches this if implemented. Without it, budget overruns of up to max_tokens are possible per request.

No cost monitoring → delayed discovery. Per-request Prometheus metrics are necessary but not sufficient. The llm_request_estimated_cost_usd histogram shows distribution but not cumulative daily cost, which is what the cloud billing invoice reflects. Implement a daily aggregation job that sums llm_tokens_consumed_total by tier and projects monthly cost. Alert when projected monthly cost exceeds budget thresholds. Without this, a sustained increase in usage (legitimate or abusive) is discovered on the billing invoice — a 30-day feedback cycle, not an operational one.

Stale concurrent counters in Redis. The concurrent request counter (concurrent:{user_id}) is decremented in a finally block when the response stream closes. If the middleware pod restarts mid-request, the finally block does not execute and the counter is not decremented. The 300-second TTL on concurrent keys prevents permanent lock-out, but a user may be blocked for up to 5 minutes after a pod restart. Shorten the TTL to match your p99 request latency plus a safety margin — for most LLM requests, 120 seconds is sufficient.