LLM Rate Limiting in Production: Token Budgets, Per-User Quotas, and Abuse Detection

LLM Rate Limiting in Production: Token Budgets, Per-User Quotas, and Abuse Detection

Problem

Traditional API rate limiting counts requests. One request equals one unit. This assumption collapses with large language models. A single LLM request can range from 50 tokens (a quick classification) to 128,000 tokens (a full-context document analysis). Treating both as equal in a rate limiter means a user sending cheap classification requests gets throttled at the same threshold as a user running full-context summarizations that cost 2,000 times more.

The consequences are both financial and operational. A user who discovers they can send 60 requests per minute to a GPT-4 class model, each consuming 100K input tokens and generating 4K output tokens, runs up thousands of dollars in provider charges within an hour. Request-count rate limiting sees 60 requests per minute and considers it normal. Token-based metering sees 6.24 million tokens per minute and flags it immediately.

The problem compounds across user tiers. Free-tier users should get different token budgets than enterprise customers. Teams running batch summarization jobs need higher burst capacity than interactive chatbot users. A single rate limiting policy applied uniformly either throttles legitimate enterprise users or leaves the door open for free-tier abuse.

Abuse detection adds another layer. Prompt injection probing follows a distinctive pattern: high input token counts (long, crafted prompts) paired with short output tokens (the model refuses or returns an error). Request-count rate limiting cannot distinguish this pattern from legitimate usage. Token-aware rate limiting can.

Target systems: API gateways proxying LLM inference endpoints. Internal model serving infrastructure (vLLM, TGI, Triton). Applications consuming external AI provider APIs (OpenAI, Anthropic, Cohere). Redis or equivalent state store for distributed token counters.

Threat Model

  • Adversary: External user abusing free or trial-tier access. Compromised internal service account making unchecked model calls. Automated bot scripting bulk inference requests. Attacker probing for prompt injection vulnerabilities via rapid prompt iteration.
  • Objective: Consume disproportionate compute resources without paying. Generate excessive provider API charges billed to the organization. Probe model behavior through high-volume prompt injection attempts. Denial-of-wallet attack that exhausts monthly budgets.
  • Blast radius: Without token-based rate limiting, a single user or compromised service can exhaust an entire team’s monthly AI budget in hours. At $0.01 per 1K input tokens and $0.03 per 1K output tokens for frontier models, 10 million tokens of abuse costs $100-300. At scale, with multiple abusers or a coordinated attack, monthly budgets of $10,000+ can be drained before anyone notices.

Configuration

Token Counting Middleware

Build token counting into the request/response pipeline. The gateway must inspect both the request (to count input tokens) and the response (to count output tokens) before updating rate limit counters.

# token_counter.py
# Middleware for counting tokens in LLM requests and responses
import tiktoken
import redis
import time
import json
from datetime import datetime

class TokenRateLimiter:
    def __init__(self, redis_url="redis://redis.llm-gateway.svc.cluster.local:6379"):
        self.redis = redis.Redis.from_url(redis_url, decode_responses=True)
        self.encoder = tiktoken.get_encoding("cl100k_base")

    def count_input_tokens(self, request_body: dict) -> int:
        """Count tokens in the request payload."""
        messages = request_body.get("messages", [])
        total = 0
        for msg in messages:
            content = msg.get("content", "")
            if isinstance(content, str):
                total += len(self.encoder.encode(content))
            elif isinstance(content, list):
                # Multimodal: estimate image tokens separately
                for part in content:
                    if part.get("type") == "text":
                        total += len(self.encoder.encode(part["text"]))
                    elif part.get("type") == "image_url":
                        total += 765  # Base cost for low-detail image
        return total

    def check_and_update(self, user_id: str, tier: str,
                         input_tokens: int, model: str) -> dict:
        """Check if user is within token budget. Returns allow/deny decision."""
        limits = self._get_tier_limits(tier, model)
        now = datetime.utcnow()

        # Keys for different time windows
        minute_key = f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d%H%M')}"
        hour_key = f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d%H')}"
        day_key = f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d')}"
        month_key = f"tokens:{user_id}:{model}:{now.strftime('%Y%m')}"

        pipe = self.redis.pipeline()
        pipe.get(minute_key)
        pipe.get(hour_key)
        pipe.get(day_key)
        pipe.get(month_key)
        results = pipe.execute()

        current = {
            "minute": int(results[0] or 0),
            "hour": int(results[1] or 0),
            "day": int(results[2] or 0),
            "month": int(results[3] or 0),
        }

        # Check all windows
        for window, key in [("minute", minute_key), ("hour", hour_key),
                            ("day", day_key), ("month", month_key)]:
            if current[window] + input_tokens > limits[window]:
                return {
                    "allowed": False,
                    "reason": f"{window}_limit_exceeded",
                    "current": current[window],
                    "limit": limits[window],
                    "retry_after": self._retry_after(window),
                }

        # Pre-deduct input tokens (output tokens added after response)
        pipe = self.redis.pipeline()
        for key, ttl in [(minute_key, 120), (hour_key, 7200),
                         (day_key, 172800), (month_key, 2764800)]:
            pipe.incrby(key, input_tokens)
            pipe.expire(key, ttl)
        pipe.execute()

        return {"allowed": True, "input_tokens_reserved": input_tokens}

    def record_output_tokens(self, user_id: str, model: str,
                             output_tokens: int):
        """Record output tokens after response is received."""
        now = datetime.utcnow()
        keys = [
            (f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d%H%M')}", 120),
            (f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d%H')}", 7200),
            (f"tokens:{user_id}:{model}:{now.strftime('%Y%m%d')}", 172800),
            (f"tokens:{user_id}:{model}:{now.strftime('%Y%m')}", 2764800),
        ]
        pipe = self.redis.pipeline()
        for key, ttl in keys:
            pipe.incrby(key, output_tokens)
            pipe.expire(key, ttl)
        pipe.execute()

    def _get_tier_limits(self, tier: str, model: str) -> dict:
        """Token limits per time window by tier and model cost class."""
        cost_multiplier = self._model_cost_multiplier(model)
        base_limits = {
            "free":       {"minute": 10000, "hour": 100000,
                           "day": 500000, "month": 5000000},
            "pro":        {"minute": 50000, "hour": 1000000,
                           "day": 10000000, "month": 100000000},
            "enterprise": {"minute": 200000, "hour": 5000000,
                           "day": 50000000, "month": 500000000},
        }
        limits = base_limits.get(tier, base_limits["free"])
        # Expensive models get tighter limits (inverse of cost)
        return {k: int(v / cost_multiplier) for k, v in limits.items()}

    def _model_cost_multiplier(self, model: str) -> float:
        """Higher multiplier = more expensive model = tighter token limit."""
        model_costs = {
            "gpt-4o": 1.0,
            "gpt-4o-mini": 0.2,
            "claude-sonnet": 0.6,
            "claude-opus": 3.0,
            "llama-3-70b": 0.15,
        }
        return model_costs.get(model, 1.0)

    def _retry_after(self, window: str) -> int:
        """Seconds until the rate limit window resets."""
        return {"minute": 60, "hour": 3600,
                "day": 86400, "month": 2592000}[window]

Kong Plugin for Token-Based Rate Limiting

# kong-token-rate-limiting.yaml
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: llm-token-rate-limit
  namespace: llm-gateway
plugin: pre-function
config:
  access:
    - |
      local redis = require "resty.redis"
      local cjson = require "cjson"

      local function count_tokens_estimate(text)
        -- Rough estimate: 1 token per 4 characters for English text
        -- Production systems should use a proper tokenizer
        if not text then return 0 end
        return math.ceil(#text / 4)
      end

      local body = kong.request.get_raw_body()
      if not body then return end

      local ok, parsed = pcall(cjson.decode, body)
      if not ok then return end

      local input_tokens = 0
      if parsed.messages then
        for _, msg in ipairs(parsed.messages) do
          if type(msg.content) == "string" then
            input_tokens = input_tokens + count_tokens_estimate(msg.content)
          end
        end
      end

      local consumer = kong.client.get_consumer()
      if not consumer then
        return kong.response.exit(401, { error = "authentication required" })
      end

      local red = redis:new()
      red:set_timeout(100)
      local ok, err = red:connect("redis.llm-gateway.svc.cluster.local", 6379)
      if not ok then
        kong.log.err("Redis connection failed: ", err)
        return  -- fail open if Redis is down (configurable)
      end

      local tier = consumer.custom_id or "free"
      local model = parsed.model or "default"
      local window_key = consumer.id .. ":" .. model .. ":"
                         .. os.date("%Y%m%d%H")

      local current = red:get(window_key)
      current = tonumber(current) or 0

      local limits = {
        free = 100000, pro = 1000000, enterprise = 5000000
      }
      local limit = limits[tier] or limits["free"]

      if current + input_tokens > limit then
        red:set_keepalive(10000, 100)
        return kong.response.exit(429, {
          error = "token_limit_exceeded",
          current_tokens = current,
          limit = limit,
          retry_after = 3600
        }, {
          ["Retry-After"] = "3600",
          ["X-RateLimit-Tokens-Remaining"] = tostring(limit - current),
          ["X-RateLimit-Tokens-Limit"] = tostring(limit)
        })
      end

      red:incrby(window_key, input_tokens)
      red:expire(window_key, 7200)
      red:set_keepalive(10000, 100)

      -- Store input token count for response phase
      kong.ctx.shared.input_tokens = input_tokens
  log:
    - |
      -- Record output tokens from the response
      local cjson = require "cjson"
      local redis = require "resty.redis"

      local body = kong.service.response.get_raw_body()
      if not body then return end

      local ok, parsed = pcall(cjson.decode, body)
      if not ok then return end

      local output_tokens = 0
      if parsed.usage then
        output_tokens = parsed.usage.completion_tokens or 0
      end

      if output_tokens > 0 then
        local consumer = kong.client.get_consumer()
        if not consumer then return end
        local model = "default"
        if parsed.model then model = parsed.model end

        local red = redis:new()
        red:set_timeout(100)
        local ok, err = red:connect(
          "redis.llm-gateway.svc.cluster.local", 6379
        )
        if ok then
          local window_key = consumer.id .. ":" .. model .. ":"
                             .. os.date("%Y%m%d%H")
          red:incrby(window_key, output_tokens)
          red:set_keepalive(10000, 100)
        end
      end

Envoy ext_proc Filter for Token Counting

# envoy-token-ratelimit.yaml
# Envoy configuration using ext_proc for token-aware rate limiting
static_resources:
  listeners:
    - name: llm_listener
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: llm_gateway
                route_config:
                  name: llm_routes
                  virtual_hosts:
                    - name: llm_backend
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/v1/chat/completions"
                          route:
                            cluster: llm_backend
                http_filters:
                  - name: envoy.filters.http.ext_proc
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor
                      grpc_service:
                        envoy_grpc:
                          cluster_name: token_ratelimit_service
                      processing_mode:
                        request_body_mode: BUFFERED
                        response_body_mode: BUFFERED
                      message_timeout: 2s
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
    - name: token_ratelimit_service
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}
      load_assignment:
        cluster_name: token_ratelimit_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: token-ratelimit.llm-gateway.svc.cluster.local
                      port_value: 50051
    - name: llm_backend
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: llm_backend
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: inference.ml-serving.svc.cluster.local
                      port_value: 8000

Abuse Detection Rules

Detect prompt injection probing and other abuse patterns through token consumption analysis.

# prometheus-token-abuse-alerts.yaml
groups:
  - name: llm-token-abuse
    rules:
      # High input tokens with very short outputs: prompt injection probing
      - alert: LLMPromptInjectionProbing
        expr: >
          (
            sum by (user_id) (
              rate(llm_input_tokens_total[10m])
            )
            /
            (sum by (user_id) (
              rate(llm_output_tokens_total[10m])
            ) + 1)
          ) > 50
          and
          sum by (user_id) (
            rate(llm_request_total[10m])
          ) > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: >
            Possible prompt injection probing by {{ $labels.user_id }}.
            Input/output token ratio exceeds 50:1 over 10 minutes.
          runbook: >
            Review request logs for this user. High input with minimal
            output indicates the user is testing prompt boundaries.
            Consider blocking the user and reviewing submitted prompts.

      # Single user consuming disproportionate share of token budget
      - alert: LLMTokenBudgetHog
        expr: >
          sum by (user_id) (
            increase(llm_tokens_total[1h])
          )
          /
          sum(increase(llm_tokens_total[1h]))
          > 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: >
            User {{ $labels.user_id }} consuming over 50 percent of
            total token budget in the last hour.

      # Rapid request pattern: many requests in tight succession
      - alert: LLMAutomatedAbuse
        expr: >
          sum by (user_id) (
            rate(llm_request_total[1m])
          ) > 10
          and
          stddev_over_time(
            sum by (user_id) (
              rate(llm_request_total[1m])
            )[10m:1m]
          ) < 0.5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: >
            User {{ $labels.user_id }} sending requests at a constant
            automated rate (>10/min with low variance). Likely scripted.

      # Sudden model upgrade: user switches from cheap to expensive model
      - alert: LLMModelUpgradeAbuse
        expr: >
          sum by (user_id) (
            rate(llm_tokens_total{model=~".*opus.*|.*gpt-4o$"}[15m])
          )
          > 5 *
          avg_over_time(
            sum by (user_id) (
              rate(llm_tokens_total{model=~".*opus.*|.*gpt-4o$"}[15m])
            )[7d:1h]
          )
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: >
            User {{ $labels.user_id }} suddenly using 5x more tokens on
            expensive models compared to 7-day average.

Kubernetes Deployment with Redis State Store

# llm-ratelimit-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-rate-limiter
  namespace: llm-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-rate-limiter
  template:
    metadata:
      labels:
        app: llm-rate-limiter
    spec:
      containers:
        - name: rate-limiter
          image: llm-gateway/token-rate-limiter:1.4.0
          ports:
            - containerPort: 50051
              name: grpc
            - containerPort: 9090
              name: metrics
          env:
            - name: REDIS_URL
              value: "redis://redis-master.llm-gateway.svc.cluster.local:6379"
            - name: FAIL_OPEN
              value: "false"
            - name: LOG_LEVEL
              value: "info"
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            grpc:
              port: 50051
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            grpc:
              port: 50051
            initialDelaySeconds: 10
            periodSeconds: 15
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-master
  namespace: llm-gateway
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
      role: master
  template:
    metadata:
      labels:
        app: redis
        role: master
    spec:
      containers:
        - name: redis
          image: redis:7.2-alpine
          ports:
            - containerPort: 6379
          args:
            - "--maxmemory"
            - "512mb"
            - "--maxmemory-policy"
            - "volatile-ttl"
            - "--save"
            - ""
            - "--appendonly"
            - "no"
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 640Mi
---
apiVersion: v1
kind: Service
metadata:
  name: redis-master
  namespace: llm-gateway
spec:
  selector:
    app: redis
    role: master
  ports:
    - port: 6379
      targetPort: 6379
---
apiVersion: v1
kind: Service
metadata:
  name: llm-rate-limiter
  namespace: llm-gateway
spec:
  selector:
    app: llm-rate-limiter
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
    - name: metrics
      port: 9090
      targetPort: 9090

Prometheus Metrics for Token Usage

# prometheus-token-metrics-scrape.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-rate-limiter
  namespace: llm-gateway
spec:
  selector:
    matchLabels:
      app: llm-rate-limiter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: "llm_(input_tokens|output_tokens|tokens|request)_(total|bucket)"
          action: keep
# metrics.py
# Prometheus metrics exported by the token rate limiter
from prometheus_client import Counter, Histogram, Gauge

# Token counters by user, model, and tier
llm_input_tokens_total = Counter(
    "llm_input_tokens_total",
    "Total input tokens processed",
    ["user_id", "model", "tier", "org_id"]
)

llm_output_tokens_total = Counter(
    "llm_output_tokens_total",
    "Total output tokens generated",
    ["user_id", "model", "tier", "org_id"]
)

llm_tokens_total = Counter(
    "llm_tokens_total",
    "Total tokens (input + output)",
    ["user_id", "model", "tier", "org_id"]
)

# Rate limit decisions
llm_ratelimit_decisions_total = Counter(
    "llm_ratelimit_decisions_total",
    "Rate limit decisions",
    ["user_id", "tier", "decision", "reason"]
)

# Token usage per request (for percentile analysis)
llm_request_tokens = Histogram(
    "llm_request_tokens",
    "Tokens per request",
    ["model", "tier", "direction"],
    buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 200000]
)

# Current usage as percentage of limit
llm_budget_utilization = Gauge(
    "llm_budget_utilization_ratio",
    "Current token usage as ratio of limit (0-1)",
    ["user_id", "tier", "window"]
)

Expected Behaviour

  • Every LLM request is metered by input and output token count, not just request count
  • Free-tier users are limited to approximately 5 million tokens per month; pro-tier to 100 million; enterprise to 500 million
  • Expensive models (Claude Opus, GPT-4o) have proportionally tighter token limits than cheaper models (GPT-4o-mini, Llama 3 70B)
  • Users hitting their token limit receive HTTP 429 with a Retry-After header and their current usage in response headers
  • Prompt injection probing (high input, low output ratio) triggers an alert within 5 minutes
  • Automated scripting (constant request rate with low variance) is detected and flagged within 3 minutes
  • Token counters in Redis survive individual pod restarts; data loss on Redis restart only affects the current time window
  • Prometheus metrics enable per-user, per-model, per-tier dashboards showing token consumption trends

Trade-offs

Decision Impact Risk Mitigation
Token counting at the gateway Adds 10-50ms latency for request body parsing and tokenization Tokenizer library may count differently than the model provider Use the same tokenizer the provider uses (tiktoken for OpenAI, provider SDKs where available). Accept 5-10% variance. True count comes from provider response and is reconciled.
Redis for distributed state Enables consistent rate limiting across multiple gateway replicas Redis failure blocks all rate limiting decisions Configure fail-open or fail-closed based on risk tolerance. Run Redis with Sentinel for HA. Rate limiter pods cache last-known limits for 60 seconds on Redis failure.
Cost-aware model multipliers Prevents users from draining budgets on expensive models Multipliers need manual updates when provider pricing changes Store multipliers in a ConfigMap. Alert when provider pricing pages change (external monitoring). Review quarterly.
Pre-deducting input tokens Prevents users from exceeding limits by sending parallel requests If the request fails upstream, the user loses tokens from their budget Implement a reconciliation job that credits back tokens for failed requests (HTTP 5xx from upstream). Run every 5 minutes.
Per-minute burst windows Allows short bursts of legitimate high-volume usage Overly tight minute limits frustrate interactive users Set minute limits high enough for 3-5 concurrent large requests. Use the hourly and daily windows as the real enforcement boundaries.

Failure Modes

Failure Symptom Detection Recovery
Redis unavailable Rate limiter cannot check or update token counters Redis health check fails; rate limiter logs connection errors If configured fail-open: requests pass without rate limiting (risky but maintains availability). If fail-closed: all LLM requests return 503 until Redis recovers. Sentinel-managed Redis failover should complete within 30 seconds.
Tokenizer mismatch Gateway counts differ from provider counts; users get throttled early or late Compare gateway token counts against provider-reported usage.completion_tokens over time. Alert if drift exceeds 15%. Update tokenizer library. Switch to provider-reported token counts as the source of truth for billing reconciliation.
Rate limit too aggressive for tier Legitimate enterprise customers hit limits during normal batch processing Support tickets. HTTP 429 rate increases for enterprise consumers. Review actual usage patterns for the affected tier. Adjust limits in the ConfigMap. Consider dedicated rate limit profiles for batch workloads.
Abuse detection false positive Legitimate user flagged as prompt injection prober (e.g., they send long documents for summarization) User complaint. Review of flagged requests shows benign content. Tune the input/output ratio threshold. Add exceptions for known document-processing use cases. Require manual review before automated blocking.

When to Consider a Managed Alternative

Kong Konnect with the AI Gateway plugin provides built-in token-aware rate limiting without custom plugin development. Cloudflare AI Gateway offers per-user token metering and cost tracking as a managed service in front of major AI providers. Portkey provides a unified AI gateway with token budgets, caching, and abuse detection built in.

Premium content pack: LLM rate limiting policy templates. Token counting middleware for Envoy ext_proc and Kong, Redis schema for distributed counters, Prometheus alert rules for abuse detection, and Grafana dashboards for per-user token consumption.