Defending Prometheus Against High-Cardinality Label Injection and DoS
Problem
Prometheus stores all time series in memory. Each unique combination of metric name and label values constitutes a separate time series. A metric http_requests_total{method="GET", path="/api/users", status="200"} and http_requests_total{method="GET", path="/api/users/12345", status="200"} are two different time series — and if the path includes arbitrary user IDs, there are as many time series as there are unique user IDs.
This is the high-cardinality problem, and it is typically encountered as an accidental configuration mistake. But it is also an intentional attack vector: any attacker who can influence the values that appear in metric labels — or who has write access to a Prometheus remote-write or pushgateway endpoint — can deliberately create unbounded numbers of time series, exhausting Prometheus memory until the process OOMs and the entire monitoring stack goes dark.
The attack surface is broader than commonly understood:
Pushgateway exposure. The Prometheus Pushgateway has no authentication by default. Any process that can reach it can push arbitrary metrics with arbitrary label values. In a Kubernetes cluster where pods have network access to the monitoring namespace (common without NetworkPolicy), any compromised pod can write metrics with 10,000 unique label combinations, each allocating memory in Prometheus.
Remote-write relay injection. Prometheus remote-write ingestion endpoints (Thanos Receiver, Grafana Mimir, Cortex) typically accept metrics without label cardinality enforcement. An attacker who can send HTTP POST requests to these endpoints can inject arbitrary time series. Many remote-write endpoints are protected only by network ACLs, not authentication.
Instrumentation code injection. In applications that create metric labels from user input (HTTP request paths, usernames, tenant identifiers), a high-cardinality attack can be triggered by sending requests with many unique values — without any direct access to the metrics infrastructure. The application faithfully creates a new time series for each unique label value.
Alertmanager and recording rule cascades. High cardinality doesn’t just affect Prometheus memory — it propagates through the stack. Recording rules that process high-cardinality metrics produce many output series. Alertmanager receives many alerts. The evaluation loop slows down. The entire observability stack degrades simultaneously.
The business impact is significant: the monitoring system goes offline exactly when it is needed most — during or after an incident. If the attack coincides with an active breach, the attacker has blinded the defenders’ primary visibility tool.
Target systems: Prometheus 2.x on any deployment; Thanos, Cortex, Grafana Mimir remote-write endpoints; Prometheus Pushgateway; any application using Prometheus client libraries with high-cardinality label patterns.
Threat Model
Adversary 1 — Pushgateway bombardment from compromised pod. Access level: code execution in a pod with network access to the monitoring namespace. Objective: POST metrics with 100,000 unique label values to the Pushgateway, causing Prometheus to allocate ~1 GB of memory per 100,000 time series, OOM the Prometheus pod, and eliminate monitoring visibility.
Adversary 2 — Application-layer cardinality injection. Access level: ability to send HTTP requests to a monitored application. Objective: send requests with unique path parameters (UUIDs, tokens) that become metric labels, creating unbounded time series without touching the metrics infrastructure directly.
Adversary 3 — Remote-write endpoint injection. Access level: network access to a Thanos Receiver or Mimir ingestion endpoint (common in multi-tenant environments). Objective: inject millions of time series via remote-write, overwhelming the TSDB and causing ingestion backpressure that stops legitimate metrics from being recorded.
Adversary 4 — Metrics scrape endpoint poisoning. Access level: ability to respond to a Prometheus scrape (e.g., via ARP spoofing or a compromised exporter). Objective: return a metrics payload with 50,000 unique label values on each scrape interval, exhausting Prometheus memory over time.
Configuration / Implementation
Step 1 — Set per-target and global time series limits
# prometheus.yml — enforce series limits
global:
scrape_interval: 15s
scrape_timeout: 10s
# Global limit on number of accepted samples per scrape
# Sample limit helps but doesn't prevent cardinality explosion over time
# Use in addition to series limits
scrape_configs:
- job_name: 'application'
static_configs:
- targets: ['app:8080']
# Per-target time series limit
# Scrapes that would exceed this are rejected entirely
sample_limit: 10000
# Limit on unique label names per scrape target
label_name_length_limit: 256
label_value_length_limit: 1024
# Global limit — Prometheus 2.40+
# Reject scraped metrics that would create too many series
limit_config:
# Max time series per scrape job target
sample_limit: 5000
For Prometheus 2.45+, use the storage-level limit:
# prometheus.yml
storage:
# Global limit on active time series
# Prometheus will reject new series beyond this limit
# rather than OOMing
# Set to 2× expected series count for headroom
tsdb:
out_of_order_time_window: 10m
Start Prometheus with series limit flags:
prometheus \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=15d \
--query.max-samples=50000000 \
--web.max-connections=512
# --storage.tsdb.max-block-duration and min-block-duration affect cardinality too
Step 2 — Authenticate and rate-limit the Pushgateway
# Deploy Pushgateway behind nginx with authentication
cat > /etc/nginx/conf.d/pushgateway.conf << 'EOF'
upstream pushgateway {
server localhost:9091;
}
server {
listen 9092;
# Basic authentication
auth_basic "Pushgateway";
auth_basic_user_file /etc/nginx/.htpasswd;
# Rate limit metric pushes
limit_req_zone $binary_remote_addr zone=pushgw:10m rate=10r/m;
limit_req zone=pushgw burst=5;
# Limit request body size (prevent huge metric payloads)
client_max_body_size 1m;
location / {
proxy_pass http://pushgateway;
}
}
EOF
# Generate credentials
htpasswd -c /etc/nginx/.htpasswd monitoring-writer
systemctl reload nginx
Apply NetworkPolicy to restrict who can reach the Pushgateway:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: pushgateway-access-control
namespace: monitoring
spec:
podSelector:
matchLabels:
app: pushgateway
policyTypes: [Ingress]
ingress:
# Only allow from approved namespaces
- from:
- namespaceSelector:
matchLabels:
monitoring-write-access: "true"
ports:
- port: 9091
protocol: TCP
# Allow Prometheus to scrape
- from:
- podSelector:
matchLabels:
app: prometheus
ports:
- port: 9091
Step 3 — Monitor cardinality and alert on explosion
# Prometheus alerting rules for cardinality monitoring
groups:
- name: prometheus_cardinality
rules:
# Alert when total time series count grows rapidly
- alert: PrometheusHighCardinalityGrowth
expr: |
rate(prometheus_tsdb_head_series[5m]) * 300 > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus series count growing rapidly: {{ $value | humanize }} new series/5m"
description: "Check for high-cardinality metrics being scraped or pushed"
# Alert when a single job creates too many series
- alert: JobHighCardinality
expr: |
sum by (job) (prometheus_tsdb_head_series_not_yet_removed) > 50000
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job }} has {{ $value | humanize }} active series"
# Alert when Prometheus memory usage is high (cardinality indicator)
- alert: PrometheusHighMemoryUsage
expr: |
process_resident_memory_bytes{job="prometheus"} /
(1024 * 1024 * 1024) > 8
for: 10m
labels:
severity: critical
annotations:
summary: "Prometheus using {{ $value | humanize }}GB of memory — possible cardinality attack"
# Alert when scrape limit is being hit
- alert: ScrapeSampleLimitHit
expr: |
sum(prometheus_target_scrapes_sample_limit_hit_total) by (job) > 0
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job }} hitting sample limit — review cardinality"
Step 4 — Identify high-cardinality metrics in existing data
# Find the top-cardinality metrics in your Prometheus instance
# Query the TSDB via the Prometheus API
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=topk(20, count by (__name__)({__name__!=""}))' | \
jq -r '.data.result[] | "\(.value[1]) \(.metric.__name__)"' | \
sort -rn | head -20
# Find metrics with high label cardinality
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=sort_desc(count by (__name__, job)(count by (job, __name__, pod)({__name__!=""})))' | \
jq '.data.result[:10]'
# Use mimirtool for deeper cardinality analysis
# Install: go install github.com/grafana/mimir/pkg/mimirtool@latest
mimirtool analyze prometheus \
--address=http://prometheus:9090 \
--output cardinality-report.json
jq '.metrics | sort_by(-.series_count) | .[:20] |
.[] | {metric: .metric_name, series: .series_count, labels: .label_names}' \
cardinality-report.json
Step 5 — Add cardinality controls to application instrumentation
# Python — implement cardinality-safe metric labels
from prometheus_client import Counter, Histogram
import hashlib
# BAD: Using full path as label (unbounded cardinality)
# requests_total = Counter('http_requests_total', 'Total requests', ['path'])
# requests_total.labels(path=request.path).inc() # Unique per URL!
# GOOD: Normalise path to a known pattern set
KNOWN_PATHS = {
"/api/users": "/api/users",
"/api/products": "/api/products",
"/health": "/health",
}
def normalise_path(path: str) -> str:
"""Replace dynamic path segments with a placeholder."""
import re
# Replace UUIDs and numeric IDs with placeholders
normalised = re.sub(r'/[0-9a-f-]{8,36}', '/:id', path)
normalised = re.sub(r'/\d+', '/:id', normalised)
# If the normalised path is still not in known paths, use a catch-all
return KNOWN_PATHS.get(normalised, "/other")
# GOOD: Cardinality-safe instrumentation
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status']
)
def track_request(method: str, path: str, status: int):
requests_total.labels(
method=method,
path=normalise_path(path), # Bounded cardinality
status=str(status)
).inc()
Expected Behaviour
| Signal | Before hardening | After hardening |
|---|---|---|
sample_limit per scrape target |
Not set (unlimited) | 10,000 samples per scrape; excess rejected |
| Pushgateway accepts unauthenticated POSTs | Yes | nginx proxy requires credentials |
| Time series growth rate alert | Not configured | Alert fires when >10,000 new series/5min |
| Compromised pod pushes 100,000 label values | Prometheus OOMs | NetworkPolicy blocks pod from reaching Pushgateway |
| Application creates per-user metrics | Unbounded cardinality | Path normalisation caps label values |
Verification:
# Check current series count
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=prometheus_tsdb_head_series' | \
jq '.data.result[0].value[1]'
# Verify sample_limit is enforced
curl -s "http://prometheus:9090/api/v1/targets" | \
jq '.data.activeTargets[] | select(.scrapePool == "application") | .health'
# Verify Pushgateway rejects unauthenticated requests
curl -X POST http://pushgateway:9092/metrics/job/test 2>&1 | grep -i "401\|auth"
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Sample limit per scrape | Hard cap on per-target series growth | Scrapes that hit the limit are entirely rejected; missing data | Set limit to 2× expected series count; alert on limit hits so limits can be adjusted |
| Pushgateway authentication | Prevents anonymous cardinality injection | All Pushgateway clients must authenticate | Use service account tokens or an internal CA; automate credential distribution |
| Path normalisation in application | Bounded metric cardinality | May lose some debugging precision (all IDs become :id) |
Retain a separate high-cardinality trace-level metric for debugging; use exemplars |
| NetworkPolicy restricting Pushgateway access | Only approved namespaces can push | New services need namespace label to push metrics | Document the onboarding process; add the label as part of namespace creation automation |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Sample limit set too low | Legitimate metrics dropped; dashboards show gaps | Alert: ScrapeSampleLimitHit; missing series in dashboards |
Increase sample limit for the affected job; investigate why the target has more series than expected |
| Cardinality attack before limits are set | Prometheus OOMs; all monitoring goes dark | Nothing — monitoring is down | Restart Prometheus with --storage.tsdb.retention.time=1h to compact; reduce series; add limits; restart normally |
| Path normalisation changes a metric that was being alerted on | Alert stops firing because label value changed | Alert test fails; on-call investigates | Update alert queries to use the normalised path format; document metric label conventions |
| Rate limit on Pushgateway breaks legitimate batch job | Batch job metrics not received; Pushgateway returns 429 | Job completion metrics missing; batch job succeeds but no telemetry | Increase rate limit for authenticated clients from batch job service accounts |
Related Articles
- Prometheus Security Metrics — securing the Prometheus instance itself including authentication and TLS
- Prometheus Remote Write Security — securing the remote-write path that is also vulnerable to cardinality injection
- Thanos Prometheus Multitenancy Security — multi-tenant Prometheus deployments where cardinality attacks from one tenant affect others
- Alert Correlation — correlating cardinality alerts with concurrent security incidents
- Security SLOs — defining observability availability as a security SLO so cardinality-based outages are tracked