Grafana Alloy Security Hardening: Protecting the OTel Collector Distribution

Grafana Alloy Security Hardening: Protecting the OTel Collector Distribution

The Problem

Grafana Alloy graduated to GA in April 2024 as the successor to Grafana Agent, consolidating metrics scraping, log collection, trace forwarding, and profiling into a single binary that speaks the OpenTelemetry Collector component model. Alloy is widely deployed as a DaemonSet on Kubernetes clusters, as a system service on bare metal, and as a sidecar in containerised workloads.

The security posture of an Alloy deployment is shaped by two structural facts. First, Alloy has access to everything it collects: logs potentially containing credentials, traces containing request payloads, and metrics that may reveal internal topology. Any component in the Alloy pipeline that is misconfigured or compromised has access to this data stream. Second, Alloy supports remote configuration via Grafana Cloud’s Fleet Management, which means the pipeline configuration itself can be modified remotely — a feature that dramatically increases the blast radius of a compromised management plane.

The concrete risks that go unaddressed in default deployments:

Credentials in log streams. Applications emit logs containing secrets at error time (database connection strings, API keys, JWT tokens). Alloy’s default loki.source.kubernetes_pods scraping forwards these verbatim to Loki. Operators discover this during a security review of Loki contents, often months after deployment.

Unauthenticated Alloy UI. Alloy exposes an HTTP debug UI on port 12345 by default. In Kubernetes, this port is often accessible from within the cluster without any authentication, allowing any pod to read the full pipeline configuration, the current component graph, and in some cases component status that reveals backend endpoint URLs and credentials.

Remote configuration as a lateral movement path. If an attacker gains access to Grafana Cloud Fleet Management credentials, they can push a modified Alloy config that redirects log/metric/trace traffic to an attacker-controlled endpoint, effectively creating a persistent data exfiltration channel that survives Alloy pod restarts.

Overly-broad scraping permissions. On Kubernetes, Alloy typically runs with a ClusterRole that grants get, list, watch on pods, endpoints, services, and nodes cluster-wide. This is often broader than necessary, and a compromised Alloy process can use these permissions for cluster reconnaissance.

Sensitive metric labels. Prometheus-style metrics can carry label values containing user identifiers, IP addresses, or request paths. Alloy pipelines that forward these to Grafana Cloud or a third-party backend export potentially personal data without operators realising label cardinality includes PII.

Target systems: Alloy 1.x+ deployed as Kubernetes DaemonSet or Linux service; Grafana Cloud Fleet Management users; self-hosted Loki/Tempo/Mimir backends.

Threat Model

1. Internal attacker with pod exec access (authenticated cluster user with exec privileges). Objective: exec into an Alloy pod; read the Alloy UI at localhost:12345 to extract backend credentials from component configuration; use credentials to exfiltrate data from monitoring backends. Impact: monitoring backend credentials compromised; potential access to logs from all applications on the cluster.

2. Fleet Management credentials compromise (attacker with Grafana Cloud service account). Objective: push a modified Alloy configuration via Fleet Management that redirects all log output to loki.write pointing to an attacker-controlled server. Impact: persistent log exfiltration from all pods on all clusters using the compromised Fleet Management account.

3. Log stream injection (attacker with write access to a pod’s stdout). Objective: emit specially crafted log lines containing escape sequences or structured log payloads that trigger Alloy relabeling rules to route the attacker’s log lines to a different Loki tenant or strip security-relevant fields. Impact: log data integrity compromised; security monitoring blind spots introduced.

4. Network-adjacent attacker within the cluster (compromised non-Alloy pod). Objective: query Alloy’s /metrics, /ready, or debug UI endpoints to enumerate the monitoring topology (backend URLs, label matchers, scrape targets). Impact: reconnaissance data for targeting monitoring infrastructure.

Hardening Configuration

Securing the Alloy UI and HTTP Endpoints

# alloy-config.alloy — restrict the built-in HTTP server
logging {
  level  = "warn"
  format = "json"
}

// Disable the debug UI in production or bind to 127.0.0.1 only
http {
  listen_addr = "127.0.0.1:12345"
}

For Kubernetes DaemonSet deployments, the listen_addr of 127.0.0.1 makes the UI accessible only via kubectl port-forward, not cluster-wide. Additionally, configure a NetworkPolicy to prevent other pods from reaching Alloy’s port:

# networkpolicy-alloy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: alloy-ingress-restrict
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: alloy
  policyTypes:
    - Ingress
  ingress:
    # Allow only Prometheus to scrape Alloy's own metrics
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus
      ports:
        - port: 12345
          protocol: TCP
    # No other ingress to Alloy

Scrubbing Credentials from Log Pipelines

// Log collection with regex-based secret scrubbing
loki.source.kubernetes_pods "app_pods" {
  forward_to = [loki.process.scrub_secrets.receiver]
}

loki.process "scrub_secrets" {
  // Redact common credential patterns before forwarding
  stage.replace {
    expression = `(?i)(password|passwd|secret|api[_-]?key|token|bearer)\s*[:=]\s*['"]?([^'"\s,}{]+)['"]?`
    replace    = "${1}=REDACTED"
  }

  // Redact AWS access key patterns
  stage.replace {
    expression = `(?:AKIA|ASIA|AROA)[A-Z0-9]{16}`
    replace    = "AWS_KEY_REDACTED"
  }

  // Redact connection string passwords
  stage.replace {
    expression = `(postgresql|mysql|redis|mongodb)://[^:]+:([^@]+)@`
    replace    = "${1}://REDACTED:REDACTED@"
  }

  forward_to = [loki.write.grafana_cloud.receiver]
}

Test the scrubbing rules before deploying:

# Use alloy fmt to validate syntax
alloy fmt alloy-config.alloy

# Test a rule against a sample log line
echo '{"msg":"connecting to postgresql://admin:s3cr3tpassword@db:5432/prod"}' | \
  alloy run --dry-run alloy-config.alloy

Restricting Kubernetes RBAC for Alloy

Audit the ClusterRole generated by the Helm chart and reduce it:

# rbac-alloy-minimal.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: alloy-minimal
rules:
  # Pod log collection — only needs pod metadata for labels
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
  # Node metrics scraping
  - apiGroups: [""]
    resources: ["nodes", "nodes/metrics"]
    verbs: ["get", "list", "watch"]
  # Service discovery for metrics scraping
  - apiGroups: [""]
    resources: ["services", "endpoints"]
    verbs: ["get", "list", "watch"]
  # REMOVE: nodes/proxy (allows arbitrary API proxy), secrets, configmaps
  # unless explicitly needed by your pipeline

Namespace-scope Alloy when only cluster-wide scraping is not needed:

# For namespace-scoped deployments, use Role not ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: alloy-ns-scrape
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "endpoints"]
    verbs: ["get", "list", "watch"]

Securing Remote Configuration (Fleet Management)

If using Grafana Cloud Fleet Management, scope the service account and audit usage:

// In alloy-config.alloy: configure remote config with explicit failure handling
remotecfg {
  url              = "https://fleet.grafana.com/agent/api/v1/config/app"
  id               = constants.hostname
  poll_frequency   = "5m"
  // Use a dedicated Grafana Cloud service account scoped to Fleet Management only
  // Store token in a Kubernetes Secret, not inline
  basic_auth {
    username = env("GRAFANA_FLEET_USERNAME")
    password = env("GRAFANA_FLEET_TOKEN")
  }
}

Restrict the Fleet Management service account to the minimum required permissions in Grafana Cloud:

# In Grafana Cloud: create a service account with ONLY the Fleet Management role
# Do not reuse the same SA that writes metrics/logs

# Monitor for unexpected config pushes via Grafana Cloud audit log
# Alert on: source=fleet_management, action=config_update

Disable remote configuration entirely in environments where it is not needed:

// Do NOT include remotecfg block in static-configuration environments
// Alloy will use only the local config file

Preventing PII Leakage in Metrics Labels

// prometheus.relabel rules to drop or hash PII-bearing labels before export
prometheus.relabel "drop_pii_labels" {
  forward_to = [prometheus.remote_write.mimir.receiver]

  // Drop labels that may contain user identifiers
  rule {
    action        = "labeldrop"
    regex         = "user_id|email|username|customer_id|session_id"
  }

  // Hash IP address labels rather than forwarding raw values
  rule {
    source_labels = ["client_ip"]
    target_label  = "client_ip_hashed"
    action        = "replace"
    // Use a consistent hash for correlation without storing raw IPs
    replacement   = "hashed"   // In practice: use a custom Alloy component for SHA-256
  }

  rule {
    action = "labeldrop"
    regex  = "client_ip"
  }
}

TLS for Backend Connections

Alloy by default performs TLS verification for loki.write, prometheus.remote_write, and otelcol.exporter.otlp. Verify it is not disabled:

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir.internal:9090/api/v1/push"
    tls_config {
      // Do NOT set insecure_skip_verify = true
      ca_file   = "/etc/alloy/certs/internal-ca.pem"
    }
  }
}

Audit all existing Alloy configs for insecure_skip_verify:

grep -r "insecure_skip_verify" /etc/alloy/
# Any output requires review and removal

Expected Behaviour After Hardening

Check Before Hardening After Hardening
Alloy UI accessible from cluster pods Any pod can reach alloy:12345 NetworkPolicy blocks all non-Prometheus ingress; UI binds 127.0.0.1
Log line with password=s3cret forwarded Forwarded verbatim to Loki Redacted to password=REDACTED before storage
ClusterRole includes nodes/proxy Full API proxy access granted Removed from ClusterRole; minimal permissions only
Remote config update from Fleet Management Any SA with fleet access can push Dedicated Fleet Management SA; alerts on config push events
Metrics labels include user_id User IDs exported to third-party monitoring backend labeldrop rule removes PII before remote write

Verification:

# Confirm UI is not reachable from a test pod
kubectl run test --image=busybox -it --rm -- \
  wget -qO- http://alloy.monitoring:12345/graph
# Expected: connection refused or network policy block

# Confirm a credential-containing log line is scrubbed
kubectl logs -n monitoring daemonset/alloy | \
  grep "REDACTED" | head -5
# Expected: lines showing REDACTED values, not actual secrets

Trade-offs and Operational Considerations

Aspect Benefit Cost Mitigation
Log scrubbing regex Prevents credential exfiltration Regex maintenance burden; false positives may redact non-credential strings Start with conservative patterns; expand after review; log scrub events for audit
Fleet Management restriction Reduces blast radius of compromised SA Requires manual config updates in environments without Fleet Management Document the manual update procedure; add config validation to CI
UI binding to 127.0.0.1 Prevents internal network access to debug info Debug access requires kubectl port-forward Document port-forward procedure for on-call; acceptable trade-off
Minimal ClusterRole Reduces blast radius if Alloy process is compromised Some pipeline components need specific permissions; may break custom components Audit each permission before removing; use kubectl auth can-i to test
TLS verification enforcement Prevents MitM on backend connections Self-signed certs require explicit CA configuration Use cert-manager to issue internal TLS certs; distribute CA via ConfigMap

Failure Modes

Failure Symptom Detection Recovery
Regex scrubbing too aggressive Legitimate log content redacted; log analysis broken SRE notices missing fields in Loki; application logs unusable Narrow regex; redeploy Alloy; note that historical logs in Loki retain redacted values
Fleet Management token expired Alloy stops receiving config updates; runs stale config Alloy logs: remote config fetch failed; Fleet Management shows disconnected agents Rotate token; update Kubernetes Secret; trigger Alloy reload
NetworkPolicy blocks Prometheus scrape Alloy’s own metrics not collected; gaps in observability-of-observability Prometheus target shows connection refused for Alloy Verify NetworkPolicy allows Prometheus pod labels; check label selectors
ClusterRole too restrictive Alloy cannot scrape certain resources; missing metrics Alloy logs: RBAC permission denied; gaps in collected metrics Add specific permission back with justification; avoid reverting to broad role
TLS CA cert expiry Backend write failures; data loss Alloy logs: x509 certificate has expired; remote write errors Rotate CA cert; update Alloy ca_file reference; restart Alloy