OpenTelemetry Collector Hardening: Pipeline Injection, RBAC, and Securing the Observability Data Path

OpenTelemetry Collector Hardening: Pipeline Injection, RBAC, and Securing the Observability Data Path

The Problem

The OpenTelemetry Collector is the single most privileged non-control-plane process in most Kubernetes clusters. Every application pod sends it traces, metrics, and logs. It runs as a DaemonSet on every node and receives host-level metrics. It holds a Kubernetes service account with API read access to enrich spans with pod metadata. Its OTLP receiver listens on an open port that any pod in the cluster can reach. In a default deployment, it runs with no memory limits, no authentication on the receiver, no TLS, and debug endpoints bound to 0.0.0.0.

This is a position an attacker would specifically target — not because they want to read your Prometheus metrics, but because a compromised collector can suppress evidence of an ongoing attack, destroy forensic trails, exfiltrate secrets embedded in traces, and provide a pivot point to backend observability infrastructure like Loki, Grafana, or a cloud-hosted metrics endpoint.

The Collector’s Position in the Observability Stack

In a typical production Kubernetes deployment:

  • Every application pod sends traces (spans), metrics, and logs to the collector agent running on the same node
  • The collector DaemonSet processes telemetry and forwards it to Prometheus remote-write, Loki, Jaeger/Tempo, or external backends (Datadog, New Relic, Honeycomb)
  • The collector runs a Kubernetes receiver that calls the API server to read pod labels, node annotations, and namespace metadata — enriching every span with that metadata
  • The collector’s service account is cluster-scoped because metadata enrichment requires reading across namespaces
  • Operator-mode collector deployments also have access to cluster-wide Kubernetes events

That service account, those backend credentials in the exporter config, and that API access are all reachable through the collector’s attack surface.

Attack Vector 1: Log Injection via User-Controlled Span Attributes

The most common path for an attacker to pollute the observability pipeline is through an application that embeds user-supplied input into structured logs or span attributes without sanitisation.

# Vulnerable: user input reaches the OTel log body directly
@app.post("/login")
def login(username: str, password: str):
    logger.info(f"Login attempt for user: {username}")
    # Attacker sends: username = "admin\nERROR: Security bypass successful\n[2026-05-09T03:12:44Z] auth=ok user=admin"
    # The injected newlines create fake log entries in the structured output

In a plain logging system, a newline injection is annoying but contained. In an OTel pipeline, the injected content flows through the collector into the SIEM, where SIEM correlation rules may evaluate the fabricated severity or fabricated fields. An attacker who understands the SIEM’s alert suppression rules can craft log bodies that match a “known benign” pattern and suppress alerts during lateral movement. The injected entries also pollute forensic evidence — post-incident log review becomes unreliable if the attacker controlled log content during the incident.

Beyond newline injection, span attribute injection is a subtler vector. An attacker-controlled service can emit spans with http.method = "GET\nAuthorization: Bearer <real-token>" or attribute values containing carriage returns that shift columns in log forwarder output. These do not crash the collector but they can corrupt downstream indexing and SIEM field parsing.

Attack Vector 2: Metric Manipulation to Hide Attack Traffic

A compromised application pod can send arbitrary metrics over OTLP. There is no authentication on the default OTLP gRPC receiver — any pod that can reach port 4317 on the collector can emit metrics as any service name.

The specific threat: a compromised service begins exfiltrating data over HTTP to an external endpoint. Error rate metrics for that service will spike as some requests are intercepted and fail. If the same compromised process — or a second pod the attacker controls — can emit false metrics reporting http.server.request.duration.bucket and http.server.requests.total with values that mask the spike, Prometheus-based alerting may not trigger. The attacker has not compromised Prometheus itself; they have made the collector emit misleading data that Prometheus stores and evaluates faithfully.

A more blunt variant: a pod begins emitting metrics with unbounded cardinality — for example, a unique label value per HTTP request, such as request_id=<uuid> or user_id=<numeric>. Each unique label combination creates a new time series in the collector’s metric storage buffer. A single pod sending 10,000 unique label combinations per second will exhaust the collector’s memory within minutes. When the collector OOMs, the entire node’s observability goes dark — alerts stop firing, traces stop reaching Jaeger, logs stop reaching Loki. The attacker now has a window of observability blindness to operate within.

Attack Vector 3: pprof and Health Endpoint Exposure

The OTel Collector ships with the pprof extension enabled in its example configurations, and many operator deployments leave it active. By default, pprof binds to localhost:1777 — but many Kubernetes deployments explicitly set the endpoint to 0.0.0.0:1777 to make it reachable for profiling from outside the pod. In a flat pod network, that means any pod in the cluster can reach the collector’s pprof endpoint.

The pprof endpoint exposes:

  • /debug/pprof/heap — a heap profile that captures live allocations in the collector’s memory at the time of the request. The collector is currently processing spans. Those spans contain trace attributes from your applications: database query strings, API request bodies, user session identifiers, authentication tokens if any service incorrectly populates span attributes with them.
  • /debug/pprof/goroutine — goroutine stack traces. These expose collector internals including exporter goroutine state, which can reveal backend endpoint URLs, connection state, and retry buffer contents.
  • /debug/pprof/allocs — allocation profiles. These reveal the structure of the collector’s internal data representations, which can inform attacks against the collector’s parser.

No authentication is required. A curl http://otel-collector.observability.svc:1777/debug/pprof/heap > heap.out from any pod retrieves it.

The health_check extension has a subtler problem. If it is bound to 0.0.0.0 and the collector is in a deployment where readiness is gated on the health endpoint, an attacker who can send a request to the health check port and manipulate the response (or simply block it) can cause the collector pod to be restarted by Kubernetes — another mechanism for creating an observability blackout.

Attack Vector 4: OTLP Receiver Exhaustion and Backpressure Attacks

The OTLP gRPC receiver accepts telemetry without authentication. An attacker with a pod in the cluster can:

Span flooding: Send spans faster than the collector can export them to backends. The collector’s internal pipeline has bounded queues. When the export queue fills because the backend (e.g., Tempo) cannot keep up with ingest, the collector begins dropping spans or blocking the OTLP receiver. Legitimate services’ traces are lost.

Attribute value size exhaustion: The OTLP spec does not enforce maximum attribute value size. An attacker-controlled service can send spans with a single attribute containing a 10 MB binary blob, repeated across thousands of spans per second. Without a size limit, the collector allocates memory proportional to the attribute size. This is a targeted OOM attack against the collector process.

gRPC stream holding: A malicious gRPC client opens a streaming connection to the OTLP receiver and sends a valid Export request but never closes the stream or sends a FIN. The collector holds a goroutine per open stream. Holding thousands of streams open exhausts the collector’s goroutine pool and prevents legitimate connections from being processed.

Attack Vector 5: Collector RBAC Abuse for Secret Exfiltration

The Kubernetes receiver in the OTel Collector requires API access to read pod, node, and namespace metadata. This is a legitimate requirement — it is how the collector adds k8s.pod.name, k8s.namespace.name, and k8s.node.name to every span without the application having to supply them.

The problem is that a ClusterRole granting get/list/watch on pods and nodes is often written by someone thinking “I need the collector to read Kubernetes resources” without carefully enumerating what “resources” means. A commonly seen over-broad ClusterRole:

# Dangerously over-broad — seen in real deployments
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
- apiGroups: [""]
  resources: ["*"]  # Everything in core API group
  verbs: ["get", "list", "watch"]

With this role, the collector service account can read secrets, configmaps, serviceaccounts, persistentvolumes, and endpoints. If an attacker compromises the collector — through any of the attack vectors above, or through a supply chain compromise of the collector container image — they can use the service account token to enumerate and read all secrets in the cluster. This includes database passwords, API keys, TLS private keys, and cloud provider credentials stored in Kubernetes secrets.

The collector’s service account token is available at /var/run/secrets/kubernetes.io/serviceaccount/token inside the collector pod. An attacker who achieves code execution in the collector — even transiently — can exfiltrate this token and use it from outside the cluster to read secrets via the Kubernetes API indefinitely, until the token is rotated.

Threat Model

  • Log injection via user-controlled span attributes → fabricated forensic evidence in SIEM, alert suppression during active compromise
  • High-cardinality metric flood from a compromised pod → collector OOM → node-wide observability blackout during attack window
  • pprof endpoint accessible from pod network → heap dump contains span attribute data including application secrets embedded in traces
  • Attribute value size exhaustion → targeted collector OOM without affecting overall cluster
  • Collector RBAC too broad → collector service account reads cluster secrets after any collector compromise
  • Unencrypted OTLP receiver → telemetry intercepted by pod-network-level attacker (tcpdump on compromised node)
  • Exporter credentials in plaintext config → secrets readable by any process that can read the collector ConfigMap

Hardening Configuration

1. Restrict Debug Endpoints to Loopback

The most immediate fix. Remove pprof from the production service configuration entirely. If you need profiling in production under controlled conditions, restrict it to loopback and access it via kubectl port-forward.

# otel-collector-config.yaml
extensions:
  health_check:
    endpoint: 127.0.0.1:13133
    # NOT endpoint: 0.0.0.0:13133
    # Kubernetes liveness/readiness probes use exec probes against localhost,
    # or the probe is configured to use the pod IP — which still reaches
    # 127.0.0.1 from within the pod. This is safe.

  # pprof: omit entirely in production
  # If debugging requires pprof, add it temporarily and access via:
  # kubectl port-forward -n observability pod/otel-collector-abc123 1777:1777

service:
  extensions: [health_check]
  # pprof removed from extensions list

If you must expose health_check for Kubernetes probes and your probe mechanism cannot use exec, bind health_check to the pod’s own IP rather than all interfaces — but this is not a security boundary since any pod on the same network can reach pod IPs. The correct model is to keep health_check on loopback and configure Kubernetes probes with the host field pointing to localhost:

# Kubernetes probe configuration that works with loopback-bound health check
livenessProbe:
  httpGet:
    path: /
    port: 13133
    host: 127.0.0.1  # Probe runs inside the pod, loopback is valid
  initialDelaySeconds: 30
  periodSeconds: 10

2. Enable mTLS on OTLP Receivers

Mutual TLS ensures that only services holding a valid client certificate issued by your CA can send telemetry to the collector. This eliminates unauthenticated span flooding, forged metric injection, and interception of telemetry in transit.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4          # Reject spans with oversized payloads
        keepalive:
          server_parameters:
            max_connection_idle: 60s      # Close idle connections; limits goroutine hold attacks
            max_connection_age: 300s      # Force reconnection; limits persistent attackers
            max_connection_age_grace: 30s
        tls:
          cert_file: /etc/otel/tls/server.crt
          key_file: /etc/otel/tls/server.key
          client_ca_file: /etc/otel/tls/ca.crt
          min_version: "1.2"
          # client_auth is "RequireAndVerifyClientCert" by default when client_ca_file is set

      http:
        endpoint: 0.0.0.0:4318
        max_request_body_size: 4194304    # 4 MiB — reject oversized HTTP payloads
        tls:
          cert_file: /etc/otel/tls/server.crt
          key_file: /etc/otel/tls/server.key
          client_ca_file: /etc/otel/tls/ca.crt
          min_version: "1.2"

Certificate issuance for application pods is the operational challenge. cert-manager with the SPIFFE/X.509 SVID issuer automates this — each pod receives a certificate in its trust domain as part of its lifecycle, with automatic rotation before expiry. Alternatively, use Istio or Linkerd’s mTLS enforcement at the service mesh layer, which handles certificate issuance transparently and can block non-mTLS traffic to the collector’s OTLP ports at the sidecar level before it reaches the collector process at all.

If mTLS is not immediately feasible, a lower-friction interim measure is a shared bearer token on the OTLP HTTP receiver:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
        # Note: authentication extension is available in collector-contrib
        auth:
          authenticator: bearertokenauth

extensions:
  bearertokenauth:
    token: "${OTEL_COLLECTOR_AUTH_TOKEN}"   # Injected from Kubernetes Secret

This is not as strong as mTLS — the token is shared, not per-service, and does not protect against interception without TLS — but it eliminates completely unauthenticated access.

3. Memory Limiter and Attribute Size Enforcement

The memory_limiter processor must appear first in every pipeline. When it triggers, it drops data before it reaches other processors, preventing a telemetry flood from affecting the entire processing chain.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
    # When memory exceeds limit_mib - spike_limit_mib (384 MiB in this config),
    # the processor returns a "retryable error" to the receiver, which propagates
    # backpressure to the sending client. Legitimate gRPC clients will retry with
    # exponential backoff; flood-attack clients do not help themselves by retrying.
    # When memory exceeds limit_mib (512 MiB), it begins dropping data permanently.

  # Enforce maximum attribute value size — reject oversized attributes
  # that could be used for memory exhaustion
  transform/enforce_limits:
    trace_statements:
    - context: span
      statements:
      # Truncate any attribute value over 4096 bytes
      - truncate_all(attributes, 4096)
      # Drop spans from services sending excessive attributes (cardinality guard)
      - limit(attributes, 128, [])
    log_statements:
    - context: log
      statements:
      - truncate_all(attributes, 4096)
      # Truncate log body if over 64 KiB — larger bodies indicate injection attempts
      - set(body, Substring(body, 0, 65536)) where Len(body) > 65536

  # Remove high-cardinality labels from metrics before they reach exporters
  transform/sanitize_metrics:
    metric_statements:
    - context: datapoint
      statements:
      # These label names are known sources of unbounded cardinality
      # Adjust to your application's actual label names
      - delete_key(attributes, "user_id")
      - delete_key(attributes, "request_id")
      - delete_key(attributes, "trace_id")
      - delete_key(attributes, "session_id")
      - delete_key(attributes, "customer_id")
      # Labels to retain: method, status_code, route, service_name (bounded sets)

  batch:
    timeout: 10s
    send_batch_size: 1024
    send_batch_max_size: 2048

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, transform/enforce_limits, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [memory_limiter, transform/sanitize_metrics, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, transform/enforce_limits, transform/detect_injection, batch]
      exporters: [loki]

The truncate_all OTTL function is available in collector-contrib v0.96+. The limit function caps the number of attributes per span — a span legitimately generated by application code typically has fewer than 30 attributes; a span crafted for an injection attack may have hundreds.

4. Log Injection Detection

Flag suspicious log entries before they reach the SIEM, so that downstream correlation rules can treat them with appropriate skepticism.

processors:
  transform/detect_injection:
    log_statements:
    - context: log
      statements:
      # Newline sequences in the log body suggest log injection
      # Legitimate structured logging frameworks do not embed newlines in message fields
      - set(attributes["security.injection_suspected"], true) where
          IsMatch(body, "\\r|\\n")

      # Log body claiming a different severity than the OTel severity field
      # indicates the body was crafted to look like a different log entry
      - set(attributes["security.severity_mismatch"], true) where
          IsMatch(body, "(?i)(FATAL|CRITICAL|EMERGENCY|ALERT)") and
          severity_number < SEVERITY_NUMBER_ERROR

      # Bodies over 10 KiB are anomalous — flag but do not drop
      # (dropping may itself destroy forensic evidence)
      - set(attributes["security.oversized_body"], true) where
          Len(body) > 10240

      # ANSI escape codes in log body — terminal injection
      - set(attributes["security.ansi_escape"], true) where
          IsMatch(body, "\\x1b\\[")

These flags appear on the log record as it flows into Loki or the SIEM. A detection rule on security.injection_suspected = true can alert the SOC to investigate the originating service’s logs for patterns of log forgery — which is itself an indicator of compromise for the service, not just a data quality issue.

5. Minimal RBAC for the Collector Service Account

The Kubernetes receiver needs to read pod, node, and namespace metadata. It does not need to read secrets, configmaps, service accounts, or persistent volumes. Enumerate exactly what it needs and nothing more.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: observability
  annotations:
    # Opt out of automounting the service account token if using projected volumes
    # This forces explicit token mounting and makes token usage auditable
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-k8s-receiver
rules:
# Kubernetes Attributes Processor — required for metadata enrichment
- apiGroups: [""]
  resources:
  - pods
  - nodes
  - namespaces
  verbs: ["get", "list", "watch"]

# ReplicaSet owner reference resolution — required to map pods to deployments
- apiGroups: ["apps"]
  resources:
  - replicasets
  verbs: ["get", "list", "watch"]

# Node conditions — required if using the k8s_events receiver
- apiGroups: [""]
  resources:
  - events
  verbs: ["get", "list", "watch"]

# Explicitly NOT granted:
# - secrets (any apiGroup)
# - configmaps (read access would expose other services' configuration)
# - serviceaccounts (token access)
# - persistentvolumes, persistentvolumeclaims
# - endpoints, services (not required for metadata enrichment)
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-k8s-receiver
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-k8s-receiver
subjects:
- kind: ServiceAccount
  name: otel-collector
  namespace: observability

Additionally, configure the collector pod to use a projected service account token with a short expiry rather than the long-lived token mounted by default:

# Collector DaemonSet pod spec
volumes:
- name: kube-api-access
  projected:
    sources:
    - serviceAccountToken:
        expirationSeconds: 3600   # 1-hour token expiry vs. default no-expiry
        path: token
    - configMap:
        name: kube-root-ca.crt
        items:
        - key: ca.crt
          path: ca.crt

volumeMounts:
- name: kube-api-access
  mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  readOnly: true

A stolen 1-hour token expires. A stolen default token does not.

6. NetworkPolicy for Collector Pods

NetworkPolicy restricts which pods can reach the collector’s OTLP ports and which backends the collector can reach. This prevents a compromised pod from reaching the collector to inject telemetry if it is not in an allowed namespace, and prevents a compromised collector from reaching arbitrary cluster-internal services.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: otel-collector
  namespace: observability
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: otel-collector
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # OTLP gRPC and HTTP from application pods (all namespaces)
  # If mTLS is enforced, this ingress rule is defence-in-depth;
  # if mTLS is not yet enforced, this is the primary access control
  - from:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values:
          - kube-system    # kube-system pods should not send app telemetry
          - observability  # Avoid circular routing — observability stack sends via separate pipeline
    ports:
    - protocol: TCP
      port: 4317   # OTLP gRPC
    - protocol: TCP
      port: 4318   # OTLP HTTP

  # Prometheus scraping of collector's own metrics
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: observability
      podSelector:
        matchLabels:
          app.kubernetes.io/name: prometheus
    ports:
    - protocol: TCP
      port: 8888   # Collector internal metrics

  egress:
  # Kubernetes API server — required for k8s metadata enrichment
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: default
      podSelector:
        matchLabels:
          component: apiserver
    ports:
    - protocol: TCP
      port: 443
  # If the API server is reached via a ClusterIP Service, you may need:
  - to:
    - ipBlock:
        cidr: 10.96.0.1/32  # Replace with your cluster's kubernetes.default.svc IP
    ports:
    - protocol: TCP
      port: 443

  # Tempo (trace backend)
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: observability
      podSelector:
        matchLabels:
          app.kubernetes.io/name: tempo
    ports:
    - protocol: TCP
      port: 4317   # Tempo OTLP gRPC receiver

  # Prometheus remote-write endpoint
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: observability
      podSelector:
        matchLabels:
          app.kubernetes.io/name: prometheus
    ports:
    - protocol: TCP
      port: 9090

  # Loki push API
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: observability
      podSelector:
        matchLabels:
          app.kubernetes.io/name: loki
    ports:
    - protocol: TCP
      port: 3100

  # DNS resolution (required for any DNS-based backend addressing)
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

The egress rules lock the collector to its known backends. A compromised collector cannot initiate connections to arbitrary cluster-internal services, enumerate service endpoints, or exfiltrate data to arbitrary external addresses. If your backends are external (Datadog, Honeycomb), add egress rules for those specific CIDRs or use a dedicated HTTPS egress proxy so that all exporter traffic passes through a controlled egress point where you can log and inspect it.

7. Exporter Credential Management

Exporter credentials in the collector’s ConfigMap are readable by anyone with get configmaps access in the observability namespace — which is a broad set of principals in many clusters. Use environment variable injection from Kubernetes Secrets instead:

# collector-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: otel-collector-exporters
  namespace: observability
type: Opaque
stringData:
  datadog_api_key: "your-key-here"    # Replace with actual secret management
  loki_basic_auth: "user:password"
# otel-collector-config.yaml — environment variable expansion
exporters:
  datadog:
    api:
      key: "${env:DATADOG_API_KEY}"
      site: datadoghq.com
    tls:
      insecure_skip_verify: false

  loki:
    endpoint: https://loki.observability.svc:3100/loki/api/v1/push
    headers:
      Authorization: "Basic ${env:LOKI_BASIC_AUTH}"
    tls:
      ca_file: /etc/otel/tls/ca.crt
# DaemonSet pod spec environment variable injection
env:
- name: DATADOG_API_KEY
  valueFrom:
    secretKeyRef:
      name: otel-collector-exporters
      key: datadog_api_key
- name: LOKI_BASIC_AUTH
  valueFrom:
    secretKeyRef:
      name: otel-collector-exporters
      key: loki_basic_auth

In a more mature setup, replace the Kubernetes Secret with an external secrets operator pulling from Vault or AWS Secrets Manager. This removes the plaintext secret from etcd entirely — only an encrypted reference lives in the cluster.

8. Resource Limits and Pod Security

# DaemonSet container spec
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 640Mi   # Above the memory_limiter limit_mib (512 MiB) to allow limiter to act
                    # before the OOM killer fires. Gap should be >= spike_limit_mib.

securityContext:
  runAsNonRoot: true
  runAsUser: 10001
  runAsGroup: 10001
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop: ["ALL"]

The memory limit ceiling must be set above the memory_limiter’s limit_mib setting. If the container limit and the memory_limiter limit are equal, the OOM killer can terminate the collector before the memory_limiter has a chance to shed load gracefully. The standard margin is memory_limiter.limit_mib + memory_limiter.spike_limit_mib + 64 MiB as the container memory limit.

Expected Behaviour After Hardening

Memory limiter under telemetry flood: When a noisy service begins sending spans at 50,000/second and collector memory climbs past 384 MiB (limit_mib - spike_limit_mib), the memory_limiter processor returns ResourceExhausted on the OTLP gRPC stream. The sending client receives this status code and its OTel SDK applies exponential backoff. Legitimate services recover gracefully. If memory continues to climb past 512 MiB, the limiter begins dropping data permanently and emits a otelcol_processor_dropped_spans metric — which Prometheus can alert on. The collector does not OOM. The rest of the cluster’s observability continues working.

mTLS rejection of unauthenticated sender: A pod without a valid client certificate attempts to open an OTLP gRPC stream to the collector. The TLS handshake fails during the client certificate verification step. The connecting pod receives a TLS handshake error. No telemetry is accepted. The collector logs a TLS handshake failure with the connecting IP. This log entry is itself observable and can alert on repeated handshake failures from a single source — indicating a probing attempt.

Log injection detection attribute: A log record arrives with body "User login: admin\nERROR: Security bypass successful". The transform/detect_injection processor evaluates the OTTL condition IsMatch(body, "\\r|\\n") — this matches. The processor sets attributes["security.injection_suspected"] = true on the log record before it reaches the Loki exporter. In Loki, the log is queryable as {security_injection_suspected="true"}. A Grafana alert rule on this label fires and the SOC investigates the originating service.

pprof no longer accessible: An attacker in a compromised pod runs curl http://otel-collector.observability.svc:1777/debug/pprof/heap. The connection is refused — the pprof extension is not running. Without pprof, the attacker has no mechanism to extract heap content from the collector. Profiling data from application spans remains confined to the collector’s memory space and the downstream backends.

Minimal RBAC on service account token: An attacker achieves code execution inside the collector container and retrieves the projected service account token from /var/run/secrets/kubernetes.io/serviceaccount/token. They attempt to use this token to list secrets: kubectl --token=<token> get secrets -A. The Kubernetes API returns 403 Forbidden — the token’s ClusterRole does not grant secrets.get. The token expires in less than one hour regardless, rendering it worthless for post-session lateral movement.

Trade-offs

mTLS on OTLP receivers requires every telemetry sender — every application pod — to hold a valid client certificate from a trusted CA. In a cluster without cert-manager or a service mesh, this is a significant operational burden: certificates must be issued, rotated, and distributed to every new pod. cert-manager with a ClusterIssuer and automatic annotation-based certificate issuance reduces this to a per-namespace configuration change. Istio or Linkerd enforce mTLS at the mesh layer with no per-application configuration at all. The first deployment of mTLS on the OTLP receiver will break any service that does not yet have a certificate — plan a rollout with a deadline and track which services are not yet onboarded. The alternative of bearer token auth is weaker but has no certificate lifecycle complexity.

Memory limiter drops telemetry when load exceeds the threshold. This means that during a traffic spike — a legitimate one, such as a load test or a release-day traffic surge — some traces and metrics will be lost. The memory_limiter is not a fair queue; it drops on a first-available basis. The correct response is to right-size the limiter threshold for your normal peak load, use the collector’s retry/queue exporter configuration to buffer data during transient backend failures rather than spikes, and alert on otelcol_processor_dropped_spans so you know when drops are occurring and can investigate whether they are attack-driven or load-driven.

High-cardinality metric filtering via transform/sanitize_metrics removes labels that developers often want during debugging. user_id and request_id on metrics enable per-user and per-request analysis in Grafana — capabilities that engineering teams will miss. The trade-off is cardinality explosion vs. debugging capability. The correct resolution is to keep these high-cardinality dimensions in traces (where they belong) and remove them from metrics (where they do not scale). Communicate the policy to developers before enforcing it and provide documentation on querying traces for per-request analysis.

NetworkPolicy egress restrictions will break the collector if a new backend is added without updating the policy. A new Grafana Tempo cluster, a new external logging destination, or a change in backend CIDR requires a policy update before it will work. This is friction, but it is the right kind of friction — it forces explicit review of every new network path the collector opens. Maintain the NetworkPolicy in the same Git repository as the collector Helm values, requiring a PR to change either.

Failure Modes

pprof endpoint left on 0.0.0.0 in production: The most common failure mode is a configuration copy-paste from a development environment where pprof was left accessible for profiling. In production, this means every pod in the cluster can retrieve heap dumps from the collector at any time, silently, with no authentication and no audit log entry. The heap dump contains whatever the collector is currently processing — which includes span attributes from every service. If any service incorrectly populates span attributes with session tokens, database credentials, or PII (a separate problem, but a common one), those values are present in the heap dump. Detection: scan collector configurations in CI for pprof extensions with non-loopback endpoints. Add an OPA policy that rejects collector ConfigMaps where the pprof endpoint does not match 127.0.0.1.

No memory_limiter means a single service OOMs the collector for the entire node: Without a memory limiter, a single application running a memory-exhaustion attack against the collector takes the collector OOM. The DaemonSet restarts the pod — but the Kubernetes restart backoff means the collector is unavailable for 10, 20, 40, 80 seconds between restart attempts. During this time, no telemetry from any service on the node reaches any backend. Alerts that depend on telemetry continuity (e.g., “alert if no data received from service X in 5 minutes”) may fire — but alerts that depend on positive signal from telemetry (e.g., “alert when error rate exceeds threshold”) will not fire because there is no data to evaluate. This is the attacker’s intended effect: an observability blackout that suppresses positive-signal alerts.

Collector service account able to read secrets: The consequence of an over-broad ClusterRole is not hypothetical. When an attacker compromises the collector — through a container image supply chain attack, through an RCE vulnerability in the OTLP receiver parsing code, or through persistent access via a compromised CI/CD pipeline that deploys the collector — the service account token is immediately available for Kubernetes API access. Scanning for this misconfiguration: kubectl auth can-i list secrets --as=system:serviceaccount:observability:otel-collector -A. If this returns “yes” for any namespace, the service account is over-privileged. Remediate immediately — re-scoping the ClusterRole is a non-breaking change; the collector does not use secrets access for any legitimate function.

No NetworkPolicy allows collector to initiate connections to arbitrary cluster services: A compromised collector with no egress NetworkPolicy can connect to any service in the cluster — including the Kubernetes API server (with its own service account), internal databases, other observability backends, and any service listening on a non-standard port. This makes the collector a useful pivot point: compromise the collector, then use its network access and service account to enumerate the cluster and reach services that should be isolated from general pod traffic. The collector’s network position — it reaches every node and every namespace via incoming telemetry — makes it a particularly high-value pivot. NetworkPolicy egress restriction is not defence-in-depth here; it is the primary control limiting the blast radius of a collector compromise.