VictoriaMetrics Security Hardening: Authentication, TLS, Tenant Isolation, and Data Protection
Problem
VictoriaMetrics is a drop-in Prometheus-compatible time-series database designed for high ingestion throughput and low memory consumption. Its cluster mode separates write (vminsert), read (vmselect), and storage (vmstorage) concerns across independently scalable components. Its single-node binary is a common replacement for Prometheus in resource-constrained environments. Both modes share a structural security problem: there is no built-in authentication.
The default VictoriaMetrics deployment accepts all read and write requests without any authentication challenge. An unauthenticated attacker who reaches the vmselect HTTP port (8481) can execute arbitrary MetricsQL queries against all stored data. An attacker who reaches vminsert (8480) can inject arbitrary metrics — poisoning dashboards, triggering false alerts, and introducing cardinality explosions that exhaust storage. vmstorage (8482) exposes its native binary protocol without TLS, so intra-cluster traffic between vminsert/vmselect and vmstorage carries metric data in plaintext.
Additional risks compound the authentication gap:
- No tenant enforcement at the query layer. VictoriaMetrics cluster mode uses URL-based tenancy (
/insert/0/and/select/0/) where0is the tenant ID. Any client can change the path segment to query a different tenant’s metrics. Without a proxy that enforces tenant routing from authenticated identity, URL-based tenancy is a naming convention, not a security boundary. - deleteRange API has no access control. The HTTP API at
/api/v1/admin/tsdb/delete_seriespermanently deletes metric series matching a selector. In a default deployment, any client that can reach vmselect’s HTTP port can delete metric data — including security-relevant telemetry — without authentication. - vmagent scrape credentials may be stored in plaintext. vmagent, VictoriaMetrics’s scrape agent, stores scrape target credentials in its configuration file. If that file is not protected appropriately, or if the vmagent HTTP API is exposed, credentials for scrape targets leak.
- Metric labels can contain PII. Labels are freeform. Any label with high cardinality and sensitive values — user IDs, email addresses, session tokens embedded in URL path labels — constitutes both a privacy risk and a cardinality bomb that degrades performance.
- Backups to object storage default to unencrypted. vmbackup transfers metric data snapshots to S3 or GCS. Without explicit encryption configuration, backup objects are stored in plaintext.
Target systems: VictoriaMetrics v1.100+ single-node and cluster mode; vmagent v1.100+; vmauth v1.100+; vmgateway v1.100+; vmbackup v1.100+; Kubernetes deployments via victoria-metrics-k8s-stack Helm chart.
Threat Model
- Adversary 1 — Unauthenticated metric read: An attacker who reaches the vmselect HTTP port executes
{__name__=~".+"}and enumerates all stored metric series, including series whose labels contain internal service names, user identifiers, or API endpoint paths. No authentication is required in the default configuration. - Adversary 2 — Cross-tenant query via URL manipulation: A developer in team A changes the tenant path segment in a MetricsQL query URL from
/select/1/to/select/2/and reads team B’s metrics. Without vmauth enforcing tenant routing from authenticated identity, this URL substitution succeeds. - Adversary 3 — Metric injection via vminsert: An attacker sends crafted Prometheus remote-write traffic to vminsert, injecting metrics that match existing series names with modified values. Alert thresholds tied to those series fire or are suppressed based on attacker-controlled data. The attacker can also inject series with high-cardinality label combinations that exhaust vmstorage memory.
- Adversary 4 — deleteRange data destruction: An attacker who reaches vmselect’s admin API calls
DELETE /api/v1/admin/tsdb/delete_series?match[]=job="security-events", permanently destroying security event telemetry without authentication. The deletion covers historical data, removing forensic evidence of prior activity. - Adversary 5 — gRPC/native protocol interception: An attacker with network access to the intra-cluster network captures plaintext vminsert-to-vmstorage or vmselect-to-vmstorage traffic, reconstructing metric series and label sets from the wire. In Kubernetes, a compromised pod with
NET_ADMINor network packet capture capability achieves this. - Adversary 6 — Backup data exfiltration: An attacker obtains read access to the S3 bucket used by vmbackup and downloads metric snapshot objects containing months of production telemetry, including label-embedded PII.
- Access level: Adversaries 1–4 need network access to HTTP ports. Adversary 5 needs pod-level network access in the cluster. Adversary 6 needs S3 credentials or a misconfigured bucket ACL.
- Objective: Extract sensitive metric data, inject false telemetry, destroy audit trail, deny monitoring service.
- Blast radius: Full metric read exposes infrastructure topology, request rates, error patterns, and any PII embedded in labels across all services and all tenants stored in VictoriaMetrics.
Configuration
Step 1: vmauth as Authentication Proxy
vmauth is VictoriaMetrics’s official authentication and routing proxy. Deploy it as the sole external-facing entry point for all read and write traffic. All vminsert and vmselect ports should be unreachable from outside the cluster directly — only vmauth should be reachable.
# vmauth-config.yaml — authentication and per-tenant routing.
users:
# Basic auth user for vmagent remote write — insert only.
- username: "vmagent-writer"
password: "${VMAGENT_WRITER_PASSWORD}" # Loaded from environment.
url_prefix:
- "http://vminsert:8480/insert/0/prometheus/api/v1/write"
# Restrict to write path only — cannot read.
# Bearer token for Grafana — select only, restricted to tenant 0.
- bearer_token: "${GRAFANA_BEARER_TOKEN}"
url_prefix:
- "http://vmselect:8481/select/0/prometheus"
# Grafana datasource uses this token; cannot write or access other tenants.
# Per-tenant routing: team A's token routes to tenant 1.
- bearer_token: "${TEAM_A_BEARER_TOKEN}"
url_prefix:
- "http://vmselect:8481/select/1/prometheus"
# Team A can only query their own tenant data.
# Per-tenant routing: team B's token routes to tenant 2.
- bearer_token: "${TEAM_B_BEARER_TOKEN}"
url_prefix:
- "http://vmselect:8481/select/2/prometheus"
# Admin user: access to all tenants and admin APIs.
# Requires separate, tightly controlled credential.
- username: "admin"
password: "${VM_ADMIN_PASSWORD}"
url_prefix:
- "http://vmselect:8481"
- "http://vminsert:8480"
Deploy vmauth as a Kubernetes Deployment with the config mounted from a Secret (not a ConfigMap — the config contains credentials):
apiVersion: apps/v1
kind: Deployment
metadata:
name: vmauth
namespace: monitoring
spec:
replicas: 2
template:
spec:
containers:
- name: vmauth
image: victoriametrics/vmauth:v1.100.0
args:
- "-auth.config=/etc/vmauth/config.yaml"
- "-tls"
- "-tlsCertFile=/etc/vmauth/tls/tls.crt"
- "-tlsKeyFile=/etc/vmauth/tls/tls.key"
ports:
- containerPort: 8427 # HTTPS with -tls flag.
volumeMounts:
- name: vmauth-config
mountPath: /etc/vmauth
readOnly: true
- name: tls
mountPath: /etc/vmauth/tls
readOnly: true
volumes:
- name: vmauth-config
secret:
secretName: vmauth-config # Not ConfigMap — contains credentials.
- name: tls
secret:
secretName: vmauth-tls
For Kubernetes Ingress, terminate TLS at the Ingress controller and forward to vmauth over HTTPS internally:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vmauth
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
# Restrict source IPs to internal networks only.
nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,172.16.0.0/12"
spec:
tls:
- hosts: ["metrics.internal.example.com"]
secretName: vmauth-ingress-tls
rules:
- host: metrics.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vmauth
port:
number: 8427
Step 2: TLS for All Component Communication
Enable TLS on every VictoriaMetrics component. In cluster mode, all internal communication — vminsert to vmstorage, vmselect to vmstorage — must be encrypted.
vmstorage with TLS:
# vmstorage startup flags — enable TLS for vminsert and vmselect connections.
/usr/bin/vmstorage \
-storageDataPath=/var/lib/victoriametrics \
-vminsertAddr=:8400 \
-vmselectAddr=:8401 \
# TLS for vminsert connections.
-tls \
-tlsCertFile=/etc/victoriametrics/tls/tls.crt \
-tlsKeyFile=/etc/victoriametrics/tls/tls.key \
# Require client certificate from vminsert and vmselect (mutual TLS).
-mtls \
-mtlsAllowedCACerts=/etc/victoriametrics/tls/ca.crt
vminsert connecting to vmstorage with TLS:
/usr/bin/vminsert \
-storageNode=vmstorage-0.vmstorage:8400,vmstorage-1.vmstorage:8400 \
# TLS for connections to vmstorage.
-storageNodeTLS \
-storageNodeTLSCertFile=/etc/victoriametrics/tls/tls.crt \
-storageNodeTLSKeyFile=/etc/victoriametrics/tls/tls.key \
-storageNodeTLSCAFile=/etc/victoriametrics/tls/ca.crt \
# Mutual TLS: vminsert presents its client certificate to vmstorage.
-storageNodeTLSInsecureSkipVerify=false
vmselect connecting to vmstorage with TLS:
/usr/bin/vmselect \
-storageNode=vmstorage-0.vmstorage:8401,vmstorage-1.vmstorage:8401 \
-storageNodeTLS \
-storageNodeTLSCertFile=/etc/victoriametrics/tls/tls.crt \
-storageNodeTLSKeyFile=/etc/victoriametrics/tls/tls.key \
-storageNodeTLSCAFile=/etc/victoriametrics/tls/ca.crt \
-storageNodeTLSInsecureSkipVerify=false
Use cert-manager to issue certificates from an internal CA, with 90-day validity and automatic renewal at 30 days remaining. VictoriaMetrics reads TLS certificates from disk on each connection, so rotation does not require a restart.
Step 3: vmgateway for JWT-Based Tenant Access Control
vmgateway provides per-request JWT validation and tenant enforcement for the query path. It validates a JWT on every request, extracts the tenant claim from the token, and enforces that the request only accesses that tenant’s data — regardless of what the client puts in the URL path.
# vmgateway startup flags.
/usr/bin/vmgateway \
-clusterMode \
-read.url=http://vmselect:8481 \
-write.url=http://vminsert:8480 \
# JWT validation: verify tokens with this public key.
-auth.publicKeyPath=/etc/vmgateway/jwt-public.pem \
# Extract tenant ID from JWT claim "vm_access.tenant_id".
-enable.auth \
-tls \
-tlsCertFile=/etc/vmgateway/tls/tls.crt \
-tlsKeyFile=/etc/vmgateway/tls/tls.key
The JWT payload structure for a tenant-scoped token:
{
"sub": "grafana-team-payments",
"iat": 1746662400,
"exp": 1746748800,
"vm_access": {
"tenant_id": {
"account_id": 1,
"project_id": 0
},
"mode": 1
}
}
mode: 1 grants read access only. mode: 2 grants write access. mode: 3 grants read and write. vmgateway enforces these modes: a token with mode: 1 issued to a Grafana service account cannot write metrics regardless of the HTTP endpoint it calls.
Issue short-lived JWTs (1 hour expiry) from your identity provider or from a vmgateway-compatible token issuer. Do not issue long-lived static tokens for production workloads.
Step 4: Protecting the deleteRange API
The deleteRange API permanently removes metric series. It must only be callable by authorised operators, not by Grafana, vmagent, or application services.
At the vmauth routing layer, route delete requests only for an explicit admin credential:
# vmauth-config.yaml — admin-only routing for delete operations.
users:
# All regular users: explicitly block access to the delete endpoint.
- bearer_token: "${GRAFANA_BEARER_TOKEN}"
url_prefix:
- "http://vmselect:8481/select/0/prometheus"
# url_map can be used to block specific paths:
url_map:
- src_paths:
- "/api/v1/admin/tsdb/delete_series.*"
action: deny
- src_paths:
- ".*"
url_prefix: "http://vmselect:8481/select/0/prometheus"
# Admin-only: access to delete endpoint requires separate admin token.
- bearer_token: "${VM_ADMIN_DELETE_TOKEN}"
url_prefix:
- "http://vmselect:8481"
At the network layer, enforce this with a Kubernetes NetworkPolicy that blocks direct access to vmselect port 8481 from all pods except vmauth:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vmselect-restrict
namespace: monitoring
spec:
podSelector:
matchLabels:
app: vmselect
policyTypes:
- Ingress
ingress:
# Only vmauth can reach vmselect directly.
- from:
- podSelector:
matchLabels:
app: vmauth
ports:
- port: 8481
Log all calls to the delete endpoint. Alert on any delete API call — it should be rare and always deliberate:
# Alertmanager rule: alert on any deleteRange call (detected via vmauth access logs).
groups:
- name: victoriametrics-admin
rules:
- alert: VictoriaMetricsDeleteSeriesCall
expr: increase(vmauth_requests_total{path=~".*delete_series.*"}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "MetricSeries delete API called — verify this is authorised"
Step 5: vmagent Security
vmagent scrapes Prometheus-format targets and forwards to VictoriaMetrics. Its configuration file contains scrape credentials for targets that require authentication.
# vmagent-config.yaml — credential security.
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
# TLS for scraping HTTPS targets.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
- job_name: "internal-api-metrics"
static_configs:
- targets: ["api.internal:8443"]
scheme: https
tls_config:
ca_file: /etc/vmagent/certs/internal-ca.crt
# Use credentials_file, not inline credentials.
# The file is mounted from a Kubernetes Secret.
authorization:
type: Bearer
credentials_file: /var/run/secrets/vmagent/api-bearer-token
# Relabeling: drop sensitive labels before forwarding to VictoriaMetrics.
metric_relabel_configs:
# Drop user_id label — high cardinality, likely contains PII.
- action: labeldrop
regex: "user_id|session_id|request_id"
# Redact path segments that contain tokens or IDs.
- source_labels: [path]
regex: '/api/v1/users/([^/]+)(.*)'
replacement: '/api/v1/users/REDACTED$2'
target_label: path
action: replace
remote_write:
- url: "https://vmauth.monitoring.svc.cluster.local:8427/insert/0/prometheus/api/v1/write"
tls_config:
ca_file: /etc/vmagent/certs/vmauth-ca.crt
bearer_token_file: /var/run/secrets/vmagent/vmauth-writer-token
queue_config:
max_samples_per_send: 10000
max_shards: 8
Restrict vmagent’s own HTTP API. The vmagent HTTP interface (port 8429) exposes the configuration, active scrape targets, and the ability to reload configuration. Never expose it outside the monitoring namespace:
# vmagent startup flags.
/usr/bin/vmagent \
-promscrape.config=/etc/vmagent/config.yaml \
-remoteWrite.url=https://vmauth.monitoring.svc.cluster.local:8427/insert/0/prometheus/api/v1/write \
-remoteWrite.bearerTokenFile=/var/run/secrets/vmagent/vmauth-writer-token \
-remoteWrite.tlsCAFile=/etc/vmagent/certs/vmauth-ca.crt \
# Bind HTTP interface to localhost only — not accessible from other pods.
-httpListenAddr=127.0.0.1:8429 \
# Enable basic auth on the local HTTP interface for any local access.
-httpAuth.username=admin \
-httpAuth.password="${VMAGENT_HTTP_PASSWORD}"
Step 6: Cluster Mode Network Isolation
In cluster mode, separate each component tier onto distinct network segments with explicit NetworkPolicies. No component should be reachable from application namespaces directly:
# vmstorage-networkpolicy.yaml — only vminsert and vmselect can reach vmstorage.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vmstorage-isolation
namespace: monitoring
spec:
podSelector:
matchLabels:
app: vmstorage
policyTypes:
- Ingress
- Egress
ingress:
# vminsert on port 8400.
- from:
- podSelector:
matchLabels:
app: vminsert
ports:
- port: 8400
# vmselect on port 8401.
- from:
- podSelector:
matchLabels:
app: vmselect
ports:
- port: 8401
egress:
# vmstorage only needs to make outbound connections for replication.
- to:
- podSelector:
matchLabels:
app: vmstorage
ports:
- port: 8400
- port: 8401
For vminsert and vmselect, deny all ingress from application namespaces and only allow ingress from vmauth:
# vminsert-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vminsert-isolation
namespace: monitoring
spec:
podSelector:
matchLabels:
app: vminsert
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: vmauth
ports:
- port: 8480
Step 7: Preventing Label Cardinality and PII Leakage
High cardinality labels are both a performance problem and a PII risk. Each unique label combination creates a new time series. Labels that include user-identifying values — email addresses, usernames, session tokens — grow unboundedly, consume significant vmstorage memory, and constitute personal data stored in the metrics system.
Enforce cardinality limits at the vmstorage and vminsert level to stop cardinality explosions before they exhaust resources:
# vmstorage: limit new series creation rate and total unique timeseries.
/usr/bin/vmstorage \
-storage.maxDailySeries=5000000 \ # Alert if a tenant exceeds this daily.
-storage.maxHourlySeries=1000000 # Throttle burst cardinality growth.
# vminsert: limit per-label value count.
/usr/bin/vminsert \
-maxLabelsPerTimeseries=40 \ # Drop series with more than 40 labels.
-maxLabelValueLen=1024 # Truncate label values exceeding 1 KB.
At vmagent, apply relabeling to drop or redact sensitive labels before they reach VictoriaMetrics:
# Global relabeling applied to all scraped metrics.
# Place in vmagent-config.yaml under scrape_configs defaults or as
# metric_relabel_configs on each job.
metric_relabel_configs:
# Drop any label matching PII patterns.
- action: labeldrop
regex: "(user_id|email|username|session_token|api_key|password|secret)"
# Redact URL path labels containing identifiers.
- source_labels: [url, http_url, request_uri]
regex: '(.*)/([0-9a-f]{8,}|[0-9]+)(.*)'
replacement: '$1/REDACTED$3'
target_label: url
action: replace
# Drop high-cardinality trace/span IDs.
- action: labeldrop
regex: "(trace_id|span_id|request_id|correlation_id)"
Audit cardinality weekly using MetricsQL:
# Top 20 metrics by unique series count — investigate any with unexpectedly
# high counts for potential PII label leakage.
topk(20, count by (__name__)({__name__=~".+"}))
Alert when a single metric exceeds a cardinality threshold:
- alert: HighCardinalityMetric
expr: count by (__name__)(count by (__name__, instance)({__name__=~".+"})) > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has > 50k unique series — investigate for PII labels"
Step 8: vmbackup to S3 with Encryption
vmbackup snapshots vmstorage data to S3 (or GCS, Azure Blob). Configure encryption and least-privilege IAM to protect backup data at rest.
# vmbackup: backup with SSE-KMS encryption.
/usr/bin/vmbackup \
-storageDataPath=/var/lib/victoriametrics \
-dst=s3://victoriametrics-backups-prod/daily \
# SSE-KMS: encrypt with a customer-managed KMS key.
-s3StorageClass=STANDARD_IA \
-s3SSECustomerAlgorithm=aws:kms \
-s3SSEKMSKeyID=arn:aws:kms:us-east-1:123456789012:key/mrk-abc123 \
# Credentials via IAM role for service accounts (IRSA) — no static keys.
-s3ForcePathStyle=false
Apply least-privilege IAM for the vmbackup service account. vmbackup needs PutObject but not DeleteObject (retention management should be handled by S3 lifecycle rules, not by vmbackup itself):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VmbackupWrite",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:AbortMultipartUpload",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::victoriametrics-backups-prod",
"arn:aws:s3:::victoriametrics-backups-prod/*"
]
},
{
"Sid": "KmsDecryptForRestore",
"Effect": "Allow",
"Action": ["kms:GenerateDataKey", "kms:Decrypt"],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123"
}
]
}
Enable S3 Block Public Access at both the bucket and account level. Enable access logging on the backup bucket to detect any unexpected GetObject calls that might indicate backup data exfiltration.
Step 9: Rate Limiting and Resource Protection
Without per-tenant rate limits, a single tenant can exhaust vmstorage’s query resources — either accidentally through an expensive dashboard or intentionally as a denial-of-service. The -search.maxUniqueTimeseries flag is the primary guard against query-induced resource exhaustion:
# vmselect: resource protection flags.
/usr/bin/vmselect \
-storageNode=vmstorage-0.vmstorage:8401,vmstorage-1.vmstorage:8401 \
# Maximum unique time series a single query can access.
# Queries exceeding this are rejected with an error.
-search.maxUniqueTimeseries=500000 \
# Maximum time range a query can cover (prevents full-history table scans).
-search.maxQueryDuration=60s \
# Maximum number of concurrent queries from all users combined.
-search.maxConcurrentRequests=16 \
# Maximum bytes allowed per query result.
-search.maxBytesPerQuery=1073741824 \ # 1 GB
# Maximum lookback window — prevents unlimited step ranges.
-search.maxLookback=168h # 7 days
For per-tenant enforcement, vmgateway applies rate limits per JWT-authenticated tenant. Configure limits in the vmgateway startup flags or via its configuration file:
/usr/bin/vmgateway \
-clusterMode \
-read.url=http://vmselect:8481 \
-write.url=http://vminsert:8480 \
-auth.publicKeyPath=/etc/vmgateway/jwt-public.pem \
-enable.auth \
# Per-tenant ingestion rate limit (default applies if JWT has no specific limit).
-ratelimit.config=/etc/vmgateway/ratelimit.yaml
# vmgateway ratelimit.yaml — per-tenant limits.
rateLimits:
# Default limit applies to any tenant not listed explicitly.
- type: queries
value: 100
resolution: minute
- type: writes
value: 100000
resolution: minute
# Payments team has higher write budget.
- tenantID:
accountID: 1
type: writes
value: 1000000
resolution: minute
Step 10: Telemetry
vm_http_requests_total{path, code} counter
vm_rows_inserted_total{type} counter
vm_new_timeseries_created_total{type} counter
vm_cache_size_bytes{type} gauge
vmauth_requests_total{username, path, code} counter
vmauth_request_duration_seconds{path} histogram
vmgateway_access_denied_total{tenant} counter
vmbackup_uptime_seconds gauge
Alert on:
vm_new_timeseries_created_totalrate spike — sudden increase in new series creation is a cardinality explosion indicator; investigate for PII labels or a misconfigured application.vmauth_requests_total{code="401"}rate spike — failed authentication attempts; may indicate credential stuffing or a rotated token that was not updated in all consumers.vmgateway_access_denied_totalnon-zero — a tenant exceeded its rate limit or attempted to access a path its JWT does not permit; investigate for cross-tenant probing.vmauth_requests_total{path=~".*delete_series.*"}non-zero — any deleteRange call should be alerted and reviewed.- vmstorage disk fill rate:
vm_data_size_bytesgrowth rate exceeding expected baseline may indicate a cardinality explosion consuming storage.
Expected Behaviour
| Signal | Default VictoriaMetrics | Hardened VictoriaMetrics |
|---|---|---|
| Anonymous MetricsQL query | All metrics readable without credentials | 401 from vmauth; credentials required |
| Cross-tenant query via URL manipulation | Path change gives access to any tenant | vmgateway enforces tenant from JWT claim; URL tenant segment ignored |
| deleteRange API call without auth | Metrics permanently deleted; no record | vmauth denies; NetworkPolicy blocks direct access; alert fires |
| vminsert-to-vmstorage traffic | Plaintext native protocol | Mutual TLS; certificate required on both ends |
| High-cardinality PII label series | Stored indefinitely; unbounded growth | Relabeling drops PII labels; maxLabelsPerTimeseries limits cardinality |
| Backup data in S3 | Plaintext objects | SSE-KMS encrypted; IAM restricts access; no public ACL |
| Single tenant exhausting query resources | Unlimited concurrent queries | -search.maxUniqueTimeseries and vmgateway per-tenant rate limits |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| vmauth as sole entry point | Single authentication boundary; all access logged | Single point of failure; vmauth outage stops all metric access | Deploy vmauth as Deployment with 2+ replicas; include health checks in Kubernetes readiness probe |
| Mutual TLS for vmstorage | Prevents rogue vminsert/vmselect from connecting | Certificate management for all component pods | cert-manager automates issuance and rotation; mount as volume; VictoriaMetrics reloads without restart |
| JWT tokens via vmgateway | Short-lived tokens; per-tenant enforcement | Token issuance infrastructure required; applications must refresh tokens | Issue tokens from existing OIDC provider; use 1-hour expiry with client-side refresh |
| deleteRange blocked to non-admins | Prevents accidental or malicious metric deletion | Legitimate data deletion (GDPR erasure) requires admin credential | Document admin credential access procedure; use separate delete token stored in secrets manager |
| Label PII relabeling | Prevents PII in metrics; reduces cardinality | May drop labels that dashboards currently depend on | Audit existing dashboards before applying relabeling; coordinate with team owners |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| vmauth certificate expiry | All metric access fails with TLS errors | vmauth_requests_total drops to 0; Grafana datasource errors | cert-manager auto-renewal; alert at 14 days remaining on certificate expiry |
| vmgateway JWT public key mismatch | All authenticated requests return 401 | vmgateway_access_denied_total spike across all tenants | Verify JWT issuer public key matches vmgateway’s configured key; redeploy with corrected key path |
| vmstorage mutual TLS client cert expiry | vminsert/vmselect cannot connect to vmstorage; write and read fail | vm_rows_inserted_total drops to 0; vmselect returns errors | Rotate vmstorage client certificates; rolling restart vminsert and vmselect |
| Rate limit misconfigured too low | Legitimate Grafana dashboards return query errors | vmgateway rate limit denials for production tenant | Increase rate limit in vmgateway ratelimit.yaml; reload config without restart |
| vmbackup IAM permission denied | Backup fails silently; RPO at risk | vmbackup exits non-zero; alert on backup age exceeding SLA | Review IAM role policy; ensure PutObject and KMS GenerateDataKey are permitted |
| PII relabeling drops required label | Dashboard shows “No data” for affected metric | Grafana panel errors; increase in “no data” panel count | Roll back relabeling rule; audit label usage before re-applying with correct regex |