Securing Distributed Tracing Infrastructure: Grafana Tempo and Jaeger
Problem
Distributed tracing infrastructure exists to help engineering teams understand the runtime behaviour of production systems. In doing that job, traces accumulate a cross-cutting view of every request that flows through your architecture: HTTP URLs with query strings, internal service addresses, database queries, user-facing identifiers, authentication headers, and exception payloads. Each of those categories is a potential data class that should never exist in an observability store — but which lands there routinely when tracing infrastructure is deployed without a security-first configuration.
The trace backend is an unusual kind of security liability. Unlike a database with obvious access controls, tracing backends are frequently treated as infrastructure plumbing: deployed with permissive defaults, accessible to any developer with a cluster login, and retained for months. The combination of rich, queryable data and minimal access controls makes the Jaeger UI or Tempo query endpoint a quiet exfiltration target. An analyst with five minutes and a browser can map your internal service graph, extract query parameters that contain user emails, and find SQL spans that reveal your schema — without triggering a single security alert.
This guide covers the infrastructure-level security controls for Grafana Tempo and Jaeger: what data should be stripped before it reaches storage, how each component’s authentication and authorisation is configured, how storage is secured at rest, and how multi-tenant isolation is enforced. SDK-level PII scrubbing and OTel Collector redaction processors are covered in companion articles; this article focuses on the tracing infrastructure itself.
Target systems: Grafana Tempo 2.4+; Jaeger 1.55+ (all-in-one and distributed); OpenTelemetry Collector contrib 0.100+ as the collection pipeline; Kubernetes as the deployment environment; S3-compatible object storage.
Threat Model
- Adversary 1 — Developer exfiltrating PII through the Jaeger UI: A developer with internal network access opens the Jaeger query UI (no authentication configured) and searches for traces on the
/api/usersendpoint. The span attributes includehttp.target: /api/users/alice@example.comanddb.statement: SELECT * FROM users WHERE email = 'alice@example.com'. Within minutes they have a list of user emails and the database schema. No vulnerability was exploited — just default configuration. - Adversary 2 — Compromised developer credential reads Tempo storage: An attacker obtains a developer’s AWS credentials via a phishing or supply-chain compromise. The Tempo S3 bucket has no additional bucket policy — any authenticated IAM identity in the account can read it. The attacker downloads raw trace blocks, which are gzip-compressed but unencrypted. The blocks contain trace data covering months of production traffic, including PII-bearing span attributes.
- Adversary 3 — Cross-tenant trace correlation in a multi-tenant cluster: Two product teams share a Tempo instance. Team A is permitted to query their own traces. The Tempo API endpoint is authenticated but has no tenant-enforcement — callers can specify any
X-Scope-OrgIDvalue. Team A queries Team B’s tenant ID and reads their internal service graph, request rates, and error patterns. - Adversary 4 — Trace injection via unauthenticated OTLP endpoint: The OTel Collector’s OTLP gRPC port is exposed cluster-wide without authentication. A compromised pod sends forged spans that attribute errors to a legitimate service, creating fabricated incident evidence in Jaeger. The forged traces survive for the full retention period and appear during a post-incident review.
- Adversary 5 — Right-to-erasure violation discovered in audit: A GDPR audit finds that user email addresses are embedded in Tempo trace blocks. The legal team requests deletion of all traces containing a specific user’s data. Tempo’s append-only block storage cannot selectively delete individual span attributes — the only option is to delete the entire block, which destroys unrelated production telemetry.
- Access level: Adversaries 1 and 3 have internal read access. Adversary 2 has stolen cloud credentials. Adversary 4 has compromised pod-level access. Adversary 5 is an institutional consequence of missing scrubbing controls.
- Objective: Extract PII and internal architecture intelligence; forge forensic evidence; trigger compliance violations.
Configuration
Step 1: PII Prevention at the OpenTelemetry Collector
The single most important security control for tracing infrastructure is ensuring that sensitive data never reaches the backend. Retroactive deletion is not reliably possible in append-only object stores. The OTel Collector’s transform and redaction processors form the scrubbing layer between applications and Tempo or Jaeger.
# otelcol-contrib-config.yaml — span attribute scrubbing before backend export.
processors:
# Redaction processor: block specific attribute keys and value patterns.
redaction/pii:
allow_all_keys: true
# Block-list: these keys are always deleted from spans.
blocked_keys:
- "http.request.header.authorization"
- "http.request.header.cookie"
- "http.response.header.set-cookie"
- "http.request.header.x-api-key"
- "http.request.header.x-auth-token"
# Value pattern block-list: any attribute whose value matches
# these patterns has the matching text replaced with [REDACTED].
blocked_values:
- "(?i)Bearer\\s+[A-Za-z0-9\\-._~+/]+=*" # Bearer tokens.
- "\\b\\d{3}-\\d{2}-\\d{4}\\b" # US SSNs.
- "\\b(?:\\d[ -]*?){13,16}\\b" # Card numbers.
- "sk-[A-Za-z0-9]{32,}" # OpenAI-style API keys.
summary: silent
# Transform processor: additional pattern-based scrubbing for structured fields.
transform/scrub-urls:
trace_statements:
- context: span
statements:
# Redact email addresses embedded in URL paths and query strings.
- replace_pattern(attributes["http.url"],
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
"[EMAIL-REDACTED]")
- replace_pattern(attributes["http.target"],
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
"[EMAIL-REDACTED]")
# Redact string literal values in SQL statements.
# Keeps query structure for debugging; removes parameter values.
- replace_pattern(attributes["db.statement"],
"'[^']*'", "'?'")
- replace_pattern(attributes["db.statement"],
"= [0-9]{4,}", "= ?")
# Hash user identifiers rather than dropping them entirely.
# A hashed user ID is non-reversible but still allows grouping
# traces by user for debugging without storing the raw ID.
transform/hash-user-ids:
trace_statements:
- context: span
statements:
# SHA-256 hash of user.id attribute if present.
- set(attributes["user.id"],
SHA256(attributes["user.id"]))
where attributes["user.id"] != nil
service:
pipelines:
traces:
receivers: [otlp]
processors:
- redaction/pii
- transform/scrub-urls
- transform/hash-user-ids
- tail_sampling
- batch
exporters: [otlphttp/tempo]
Why hashing beats deletion for user IDs: Hashing with SHA-256 preserves the ability to correlate traces for a given user during an incident investigation — you can compute the hash of a known user ID and query for it — while preventing the raw identifier from being stored. Deletion prevents all cross-trace correlation.
Step 2: OTLP Ingestion Endpoint Authentication
The OTLP gRPC and HTTP endpoints that receive spans from application services should require authentication. An unauthenticated OTLP endpoint accepts spans from any process in the cluster, enabling trace injection attacks.
# otelcol-contrib-config.yaml — authenticated OTLP receiver.
extensions:
# OIDC authenticator: validate JWT bearer tokens.
# Services use their Kubernetes service account OIDC tokens as credentials.
oidc:
issuer_url: https://kubernetes.default.svc.cluster.local
audience: otel-collector
attribute: sub # Extract service identity from subject claim.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
auth:
authenticator: oidc
http:
endpoint: 0.0.0.0:4318
auth:
authenticator: oidc
# TLS for HTTP OTLP.
tls:
cert_file: /etc/otel/tls/tls.crt
key_file: /etc/otel/tls/tls.key
min_version: "1.2"
service:
extensions: [oidc]
pipelines:
traces:
receivers: [otlp]
processors: [redaction/pii, transform/scrub-urls, batch]
exporters: [otlphttp/tempo]
Services obtain tokens using the Kubernetes service account token API. The OIDC authenticator validates the token signature against the cluster’s JWKS endpoint and rejects any request that does not carry a valid token for the configured audience.
If OIDC is not available in your environment, a static bearer token shared via Kubernetes Secret is a minimal improvement over no authentication, though it provides weaker guarantees:
extensions:
bearertokenauth:
tokens_file: /etc/otel/tokens.txt # One token per line.
Supplement authentication with a NetworkPolicy that restricts which pods can reach the OTLP port:
# NetworkPolicy: pods send traces to the collector; nothing else connects.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: otel-collector-otlp-ingress
namespace: tracing
spec:
podSelector:
matchLabels:
app: otel-collector
policyTypes:
- Ingress
ingress:
# Application pods in the apps namespace may send traces.
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: apps
ports:
- port: 4317
protocol: TCP
- port: 4318
protocol: TCP
# Block all other inbound except Prometheus scraping.
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 8888
protocol: TCP
Step 3: Grafana Tempo Authentication and Multi-Tenancy
Tempo exposes an HTTP API that accepts trace queries and ingestion. By default the API is unauthenticated. In a production deployment Tempo sits behind an authenticating reverse proxy.
OAuth2-proxy for Tempo query access:
# Tempo deployment with OAuth2-proxy sidecar.
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
namespace: tracing
spec:
template:
spec:
containers:
- name: tempo
image: grafana/tempo:2.4.0
args:
- -config.file=/etc/tempo/tempo.yaml
ports:
- name: http
containerPort: 3200
- name: grpc
containerPort: 9095
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
args:
- --upstream=http://localhost:3200
- --http-address=0.0.0.0:4180
- --provider=oidc
- --oidc-issuer-url=https://company.okta.com/oauth2/default
- --client-id=$(OIDC_CLIENT_ID)
- --client-secret=$(OIDC_CLIENT_SECRET)
- --cookie-secret=$(COOKIE_SECRET)
- --email-domain=example.com
# Only users in the observability group may read traces.
- --allowed-group=observability-platform
- --skip-provider-button=true
ports:
- containerPort: 4180
envFrom:
- secretRef:
name: oauth2-proxy-credentials
Expose only the OAuth2-proxy port through the Service and Ingress. Tempo’s native HTTP port must not be directly accessible from outside the pod.
Multi-tenancy with X-Scope-OrgID enforcement:
Tempo’s multi-tenancy model uses the X-Scope-OrgID header to route trace data and queries to per-tenant storage partitions. The tenant ID must be injected and validated by the authentication gateway — callers must not be able to specify arbitrary tenant IDs.
# tempo.yaml — multi-tenancy configuration.
multitenancy_enabled: true
# Tempo enforces that requests carry the X-Scope-OrgID header.
# The authentication gateway is responsible for setting this header
# to the verified tenant ID from the OIDC token claims.
# Example: oauth2-proxy passes the user's group as the org ID.
limits_config:
# Per-tenant trace ingestion rate limit.
ingestion_rate_limit_bytes: 15000000 # 15 MB/s per tenant.
ingestion_burst_size_bytes: 20000000 # 20 MB burst.
# Per-tenant query parallelism limit.
max_search_duration: 168h
# Distributor: multi-tenant aware ingestion.
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
An authentication proxy that correctly injects X-Scope-OrgID prevents Adversary 3 (cross-tenant trace correlation). The proxy reads the tenant identity from the validated OIDC token and sets the header — the caller never controls the tenant scope of their query.
Step 4: Tempo Storage Security (S3 Backend)
Tempo stores trace data as compressed blocks in object storage. Without additional controls, the bucket is a readable archive of all traces for the entire retention period.
# tempo.yaml — S3 backend with encryption and access restrictions.
storage:
trace:
backend: s3
s3:
bucket: tempo-traces-prod
endpoint: s3.amazonaws.com
region: us-east-1
# Server-side encryption with a customer-managed KMS key.
# Separate KMS key per tenant in a multi-tenant deployment.
sse_type: KMS
sse_kms_key_id: arn:aws:kms:us-east-1:123456789012:key/mrk-abc123
wal:
path: /var/tempo/wal
Enforce bucket-level controls separately from application configuration. The bucket policy must enforce encryption in transit and block public access:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyNonTLS",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::tempo-traces-prod",
"arn:aws:s3:::tempo-traces-prod/*"
],
"Condition": {
"Bool": {"aws:SecureTransport": "false"}
}
},
{
"Sid": "DenyNonEncryptedPuts",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::tempo-traces-prod/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
}
}
},
{
"Sid": "AllowTempoServiceRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/tempo-service-role"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::tempo-traces-prod",
"arn:aws:s3:::tempo-traces-prod/*"
]
}
]
}
S3 Block Public Access settings must be enabled at both the bucket and account level:
# Enforce block-public-access on the bucket.
aws s3api put-public-access-block \
--bucket tempo-traces-prod \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,\
BlockPublicPolicy=true,RestrictPublicBuckets=true
Use a dedicated IAM role for Tempo with minimal permissions. The role should have no access to other buckets and no IAM permissions of its own. In an EKS environment, bind the role to the Tempo service account using IRSA:
# ServiceAccount annotation for IRSA.
apiVersion: v1
kind: ServiceAccount
metadata:
name: tempo
namespace: tracing
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/tempo-service-role
Step 5: Jaeger Security Configuration
Jaeger’s collector, query service, and storage backend each present separate attack surfaces.
Jaeger Collector authentication: The Jaeger Collector accepts spans over gRPC or HTTP. In a pipeline where the OTel Collector forwards to Jaeger (rather than Tempo), the Jaeger Collector should not be directly reachable by application pods. Route all span ingestion through the OTel Collector, which handles authentication as described in Step 2. The Jaeger Collector is then a cluster-internal receiver that only the OTel Collector communicates with:
# NetworkPolicy: only the OTel Collector may send spans to Jaeger Collector.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: jaeger-collector-ingress
namespace: tracing
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: jaeger
app.kubernetes.io/component: collector
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: otel-collector
ports:
- port: 14250 # Jaeger gRPC.
protocol: TCP
- port: 14268 # Jaeger HTTP (Thrift).
protocol: TCP
# No other ingress sources permitted.
Jaeger Query service behind an authenticating proxy: The Jaeger Query HTTP API and UI must not be exposed without authentication:
# Jaeger Query Deployment with OAuth2-proxy sidecar.
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
namespace: tracing
spec:
template:
spec:
containers:
- name: jaeger-query
image: jaegertracing/jaeger-query:1.55.0
args:
- --query.base-path=/jaeger
# Disable the in-process HTTP server for the health check;
# health should be served on a non-public port.
- --admin.http.host-port=:14269
ports:
- name: query-http
containerPort: 16686
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
args:
- --upstream=http://localhost:16686
- --http-address=0.0.0.0:4180
- --provider=oidc
- --oidc-issuer-url=https://company.okta.com/oauth2/default
- --client-id=$(OIDC_CLIENT_ID)
- --client-secret=$(OIDC_CLIENT_SECRET)
- --cookie-secret=$(COOKIE_SECRET)
- --email-domain=example.com
- --allowed-group=observability-platform
- --skip-provider-button=true
ports:
- containerPort: 4180
# Ingress: route to oauth2-proxy port, not Jaeger native port.
# Restrict by source IP to internal networks only.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: jaeger-query
namespace: tracing
annotations:
nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,172.16.0.0/12"
spec:
ingressClassName: nginx
rules:
- host: jaeger.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: jaeger-query
port:
number: 4180 # OAuth2-proxy port.
tls:
- hosts:
- jaeger.internal.example.com
secretName: jaeger-tls
Jaeger’s query API does not natively support tenant isolation. In a multi-tenant deployment, either run one Jaeger instance per tenant or use Tempo (which does support multi-tenancy via the X-Scope-OrgID model) and access Jaeger traces through Grafana’s unified Tempo datasource.
Step 6: Tail-Based Sampling to Preserve Security-Relevant Traces
Tail-based sampling makes sampling decisions after the full trace has been assembled. This ensures that security-relevant traces — authentication failures, slow requests that may indicate timing attacks, requests to sensitive endpoints — are always retained, while high-volume benign traffic is sampled aggressively.
# OTel Collector: tail sampling before export to Tempo/Jaeger.
processors:
tail_sampling:
decision_wait: 15s # Wait 15 seconds for all spans to arrive.
num_traces: 50000 # Maximum in-flight traces before forced eviction.
expected_new_traces_per_sec: 1000
policies:
# Retain all error traces unconditionally.
- name: retain-errors
type: status_code
status_code:
status_codes: [ERROR]
# Retain slow traces (>2s) — may indicate timing attacks or abuse.
- name: retain-slow
type: latency
latency:
threshold_ms: 2000
# Retain traces touching authentication endpoints.
- name: retain-auth
type: string_attribute
string_attribute:
key: http.target
values: ["/auth/", "/login", "/token", "/oauth", "/api/keys"]
enabled_regex_matching: true
invert_match: false
# Retain traces with suspicious attribute patterns.
# A span attribute set by security middleware when anomalies are detected.
- name: retain-security-flagged
type: string_attribute
string_attribute:
key: security.event
values: [".*"]
enabled_regex_matching: true
# Sample 0.5% of remaining healthy fast traces.
- name: sample-baseline
type: probabilistic
probabilistic:
sampling_percentage: 0.5
The sampling configuration above ensures that the traces most likely to be needed during incident response are always retained, while the storage cost of baseline healthy traffic is reduced by 99.5%.
Step 7: Trace Data Retention and the Right-to-Erasure Problem
Append-only object storage creates a compliance challenge for GDPR and similar right-to-erasure obligations. Tempo writes trace data as immutable compressed blocks; individual span attributes cannot be deleted without deleting the entire block.
The practical answer is prevention, not remediation: if PII never enters trace storage, there is no erasure obligation to fulfill. The scrubbing pipeline in Step 1 is therefore not optional from a compliance perspective.
Configure Tempo’s compactor for automatic block deletion after the retention period:
# tempo.yaml — compactor retention.
compactor:
compaction:
# Delete blocks older than this duration.
block_retention: 168h # 7 days for standard traces.
# For environments with stricter retention requirements, reduce to 72h or 48h.
# Compliance traces (if needed for audit) should go to a separate pipeline
# with different retention — not to the same Tempo instance as operational traces.
# Compacted blocks (post-merge) also respect retention.
compacted_block_retention: 1h
For environments with per-user deletion obligations, enforce a data minimisation policy at ingestion time:
- No raw user identifiers in span attributes — hash at the Collector (Step 1).
- No PII in
http.url,http.target, ordb.statement— scrub at the Collector (Step 1). - Short retention (7 days or less) for operational traces containing any user-correlated data.
- If a user makes an erasure request, verify that the scrubbing pipeline was active during their usage period. If it was, there is no raw PII to delete.
For situations where a misconfigured service emitted PII before the scrubbing pipeline was deployed, the only mitigation in Tempo is to delete the affected blocks. Identify the time range of the misconfigured period and use the Tempo API to force-compact and delete those blocks:
# Identify blocks covering the affected time range.
curl -s http://tempo.tracing.svc:3200/api/search/tags \
| jq '.tagNames[]'
# List blocks in the compactor's object storage path.
aws s3 ls s3://tempo-traces-prod/ --recursive \
| grep "^2026-03-" # Affected month.
# Delete specific blocks (irreversible — test first).
# Tempo will re-read the block list from object storage on restart.
aws s3 rm s3://tempo-traces-prod/tenant-id/block-uuid/ --recursive
Step 8: Cross-Tenant Trace Correlation Risks in Service Mesh Environments
In a service mesh (Istio, Linkerd), every pod-to-pod request generates a span. When requests cross tenant namespace boundaries — a shared API gateway calling into tenant-specific services, for example — the trace spans those boundaries. The resulting trace contains spans from services belonging to different tenants.
This creates a cross-tenant information disclosure path: if Tenant A’s service makes a call to the shared gateway, the trace may contain spans from the gateway that reveal information about requests from Tenant B if they share the same trace ID propagation context.
Controls:
# Istio: terminate trace context at namespace boundaries.
# Use a VirtualService to strip trace headers on cross-namespace traffic.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: gateway-ingress-tenant-a
namespace: tenant-a
spec:
hosts:
- shared-gateway.platform.svc.cluster.local
http:
- headers:
request:
remove:
- traceparent # W3C Trace Context.
- tracestate
- x-b3-traceid # B3 propagation.
- x-b3-spanid
- x-b3-sampled
route:
- destination:
host: shared-gateway.platform.svc.cluster.local
Removing trace context headers at the tenant boundary causes the gateway to start a new trace for each cross-boundary request. The gateway’s trace and the tenant’s trace are no longer linked, preventing correlation.
In Tempo’s multi-tenant model, assign the OTel Collector or sidecar to a specific tenant ID so that spans from Tenant A’s services are stored under Tenant A’s partition only. The shared gateway should write its spans to the platform tenant partition, not to any specific tenant.
Expected Behaviour
| Signal | Default deployment | Hardened deployment |
|---|---|---|
Email address in http.target span |
Stored verbatim in Tempo/Jaeger | Replaced with [EMAIL-REDACTED] at Collector |
SQL parameter values in db.statement |
Full values stored | Replaced with ? placeholders at Collector |
| Jaeger UI access | No authentication | SSO login required; group membership enforced |
| Tempo query for another tenant’s traces | Possible with any X-Scope-OrgID |
Tenant ID injected by auth proxy; caller cannot override |
| Tempo S3 blocks | Unencrypted; accessible to IAM principals in account | KMS-encrypted; access limited to Tempo service role |
| OTLP endpoint accepts spans from any pod | Yes — unauthenticated | JWT authentication required; network policy restricts source pods |
| Cross-namespace trace correlation | Full traces cross tenant boundaries | Trace context headers stripped at tenant boundary |
| Traces retained after retention period | Indefinitely (no retention configured) | Compactor deletes blocks after 7 days |
Telemetry
# Scrubbing pipeline effectiveness.
otelcol_processor_redaction_processed_spans_total{processor} counter
otelcol_processor_redaction_masked_attributes_total{processor} counter
# Tempo backend health.
tempo_ingester_traces_created_total{tenant} counter
tempo_compactor_blocks_deleted_total counter
tempo_storage_bytes_used_total{tenant} gauge
# Authentication events.
oauth2_proxy_requests_total{status} counter
oauth2_proxy_authentication_failures_total counter
# Sampling decisions.
otelcol_processor_tail_sampling_sampling_decision_timer_bucket histogram
otelcol_processor_tail_sampling_count_traces_sampled_total counter
Alert on:
tempo_storage_bytes_used_totalgrowing past retention threshold — compaction may not be running, traces accumulating beyond policy.oauth2_proxy_authentication_failures_totalspike — potential brute-force against the trace query UI.otelcol_processor_redaction_masked_attributes_totaldrops to zero — scrubbing pipeline may be misconfigured or bypassed.- Any direct access to Tempo or Jaeger native ports (port 3200 or 16686) from outside the pod — OAuth2-proxy may have been bypassed.
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Hashed user IDs | Supports cross-trace correlation without storing raw PII | Hash is one-way; operators must compute the hash to query by user | Store the hash-to-user mapping in a separate, access-controlled lookup table for incident investigation |
| Short retention (7 days) | Limits PII exposure window; reduces erasure complexity | Historical trace analysis beyond 7 days is not possible | Archive anonymised aggregates (RED metrics, error counts, latency percentiles) to Prometheus long-term storage |
| Tenant-boundary trace context stripping | Prevents cross-tenant correlation via shared gateway | Breaks distributed trace continuity across boundaries | Generate a synthetic correlation ID that is shared without being a W3C trace context header; used for operational correlation only |
| Per-tenant KMS key | Breach of one tenant’s key does not expose other tenants | Higher KMS key management overhead | Use AWS KMS multi-region keys for DR; automate key rotation via Terraform |
| Tail sampling for security-critical traces | Error and auth traces always retained | Requires sufficient Collector memory to buffer in-flight traces | Size Collector memory to num_traces × average_trace_size_bytes; monitor OOM events |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Scrubbing processor misconfigured | PII appears in stored traces | Nightly scan of sampled spans for PII patterns; alert on any match | Fix Collector config; existing affected blocks cannot be retroactively scrubbed — delete if retention policy allows |
| OAuth2-proxy credential rotation failure | Jaeger/Tempo UI inaccessible; 401 errors | oauth2-proxy health check failure; alert on 5xx rate | Rotate the OIDC_CLIENT_SECRET and restart the proxy pod; automate rotation with an External Secrets Operator pattern |
| Tempo compactor not running | Storage grows past retention period | tempo_compactor_blocks_deleted_total at zero for >24h |
Restart compactor pod; verify object storage permissions for the service role include s3:DeleteObject |
| OTLP JWT expiry | Application spans rejected; trace coverage drops to zero | otelcol_receiver_refused_spans_total spike; alert on spike |
Rotate the service account token; verify the Kubernetes token automounting is enabled and not near the 24h expiry |
| Cross-tenant trace pollution | Tenant A can query Tenant B’s spans | Test query using wrong tenant ID from authenticated session; should return 403 | Verify auth proxy correctly extracts and sets X-Scope-OrgID from the validated OIDC claim; do not allow callers to set this header directly |
| KMS key revoked or deleted | Tempo cannot read or write blocks | Tempo errors on all block operations; storage completely unavailable | Restore key from KMS backup; use key aliases to enable transparent rotation without changing Tempo config |