Detecting Secret Access Anomalies: Vault and AWS Secrets Manager Audit Log Analysis
The Problem
Secret stores are the highest-value targets in most infrastructure. A compromised database password, API key, or TLS private key is often the difference between a contained incident and a full breach. Yet the monitoring applied to secret access is frequently the weakest in the environment: teams spend engineering hours instrumenting application traces and shipping Kubernetes events to a SIEM while Vault and Secrets Manager audit logs sit unanalysed, or worse, unshipped.
The core reason this happens is that normal secret access is boring. Vault agent renews the lease on secret/database/production/password every five minutes. External Secrets Operator reads a batch of secrets every hour from the same service account. A CI pipeline reads a deploy secret once per deploy event. Nothing in these patterns triggers a generic “secret access” alert — and so no alert is configured at all. The implicit assumption is that because access control is enforced, any access must be legitimate. That assumption collapses the moment a token is stolen.
Compromised tokens do not announce themselves. The first observable action a credential thief takes after gaining a valid Vault token is reading everything that token can reach. This looks like normal read operations — the same API calls, the same paths, the same response codes. The only thing that changes is the pattern: the cadence, the breadth of paths accessed, the source IP, the time of day. These are precisely the signals that generic monitoring never captures, because it is built on alerting rules that treat all reads equally.
AWS Secrets Manager presents a parallel problem. CloudTrail records every GetSecretValue, DescribeSecret, and ListSecrets call. A legitimate application might call GetSecretValue on one specific ARN every few minutes. An attacker who has obtained an IAM access key calls GetSecretValue across thirty different ARNs over ten minutes from an IP address that has never appeared in CloudTrail before. CloudTrail dutifully records both. AWS native alerting — GuardDuty, Security Hub — may fire on some categories of anomaly, but their coverage has gaps, their latency is high, and their signals do not integrate natively with the operational dashboards where your team already works.
Normal secret access patterns that establish the baseline:
app-backendVault agent renewssecret/production/database/passwordevery 300 seconds, always from the same service account, always from pod IPs in the10.0.1.0/24subnet- External Secrets Operator reads a fixed list of 12 secrets every 3600 seconds using the
k8s-esoaccessor - The release CI pipeline reads
secret/ci/deploy-tokenexactly once per deployment, correlated with agit pushevent to the release branch - Secret rotation happens quarterly: a write event to
secret/production/database/passwordis immediately followed by a pod restart event in Kubernetes
Anomaly 1 — Bulk read (harvesting):
10:00:01 READ secret/production/database/password accessor=tok_A1b2c3
10:00:02 READ secret/production/api-keys/stripe accessor=tok_A1b2c3
10:00:02 READ secret/production/api-keys/sendgrid accessor=tok_A1b2c3
10:00:03 READ secret/staging/database/password accessor=tok_A1b2c3
10:00:03 READ secret/staging/api-keys/stripe accessor=tok_A1b2c3
10:00:04 READ secret/production/certificates/ca-key accessor=tok_A1b2c3
...
# 500 reads across 47 distinct paths over 5 minutes
The application that owns tok_A1b2c3 normally reads two paths at startup and zero during steady-state operation. Five hundred reads across staging, production, and certificate paths in five minutes is credential harvesting. The token is legitimate — the behaviour is not.
Anomaly 2 — Access from unexpected source IP:
09:00:00 READ secret/production/database/password remote_addr=10.0.1.47 accessor=tok_B9x2
10:00:00 READ secret/production/database/password remote_addr=185.234.219.12 accessor=tok_B9x2
The first read is normal — 10.0.1.47 is within the expected pod CIDR. The second is the same token, same path, same response code, but from 185.234.219.12 — an external IP address that appears in no prior audit event. The token was exfiltrated from the pod. The legitimate workload and the attacker are both using it simultaneously.
Anomaly 3 — Secret rotation without subsequent deployment:
At 14:23:07, a write event appears for secret/production/stripe/api-key. Rotation is complete. The application reads the new value on the next pod restart. But no pod restart follows in the next thirty minutes. Two explanations: rotation propagation failed and the application is running on stale credentials, or the path was written by an attacker attempting to inject a value they control into the production application. Either way, the silence after a rotation event is a detection signal.
Anomaly 4 — Admin token used outside maintenance windows:
Admin-level tokens or IAM roles with secretsmanager:* permissions should be used only during planned maintenance operations. Any use at 03:14 on a Sunday that does not correspond to a scheduled maintenance window is suspicious. Admin token misuse at off-hours is a consistent indicator across insider threat and attacker post-compromise scenarios.
Anomaly 5 — First-ever access to a new secret path:
The service account for payment-service has accessed secret/production/database/payment-db every five minutes for the last eighteen months. Today it accessed secret/admin/certificates/signing-key for the first time. This is lateral movement: either the pod’s token was stolen and the attacker is exploring what it can reach, or the application code was modified to reach beyond its expected scope. First-ever path access per accessor is a zero-cost signal — it requires only a persistent set of known paths per accessor.
Threat Model
Compromised application token: A Vault token exposed through a misconfigured environment variable, a leaked container image, or an exfiltrated Kubernetes secret. The attacker uses it immediately to enumerate all readable secrets. The first observable signal is a burst of read events across many paths from a token that normally reads two or three paths. Detection window: the burst typically occurs within the first five minutes of token acquisition.
Token exfiltration without rotation: A valid token is copied to an attacker-controlled system. The legitimate workload continues using the token for its normal operations. Simultaneously, the attacker uses the same token from an external IP. The token is not revoked because nothing appears broken — applications continue to function normally. The dual-IP pattern is the only detection signal until credentials obtained through the token are used externally.
Former employee or contractor token not revoked: A Vault token or IAM access key associated with a former employee continues to function after offboarding. Periodic access — daily or weekly — from that accessor to production secrets is invisible unless accessor-level baseline monitoring is running. The access pattern is low-frequency and therefore does not trigger bulk-read detectors.
Secret rotation without propagation: Automated rotation writes a new secret value to Vault or Secrets Manager. The consuming application fails to pick up the new value due to a caching bug or lease misconfiguration. It continues authenticating with the old credential, which succeeds until the upstream system (database, payment processor) rotates on its side. The rotation event in the audit log is unmatched by any subsequent application read of the new value, revealing the propagation failure before it becomes a production outage.
Insider threat with overly broad policy: A developer or operations engineer with a Vault policy that grants access to more paths than their role requires uses that token to read secrets outside their operational scope. Path-diversity anomaly detection catches this: the token reads paths it has never historically accessed. Without per-accessor baseline tracking, this access is indistinguishable from normal operations.
Hardening Configuration
1. Vault Audit Log: What It Contains and Where to Ship It
Vault’s audit device writes one NDJSON line per request and per response to any configured sink. Enable the file audit device at startup and point it at a path that your log shipper can read:
# vault.hcl
audit "file" {
type = "file"
options = {
file_path = "/var/log/vault/audit.log"
log_raw = false # HMAC-hashes secret values in the log
mode = "0600"
}
}
A response event for a read operation contains:
{
"time": "2026-05-09T10:00:01.234567Z",
"type": "response",
"auth": {
"client_token": "hmac-sha256:abc...",
"accessor": "tok_A1b2c3",
"display_name": "k8s-backend-prod",
"policies": ["default", "production-read"],
"token_type": "service"
},
"request": {
"id": "req-uuid",
"operation": "read",
"path": "secret/production/database/password",
"remote_address": "10.0.1.47",
"namespace": {"id": "root"}
},
"response": {
"data": {"password": "hmac-sha256:def..."}
}
}
The auth.accessor is stable across token renewals for the same token — it is the correct field for per-identity tracking. auth.client_token is HMAC-hashed (Vault hashes it before writing even with log_raw = false) and cannot be recovered from the log. The request.remote_address is the network source of the request.
Ship this to Loki using Fluent Bit:
# /etc/fluent-bit/vault-audit.conf
[INPUT]
Name tail
Path /var/log/vault/audit.log
Tag vault.audit
Parser json
DB /var/lib/fluent-bit/vault-audit.db
Refresh_Interval 5
Skip_Empty_Lines On
[FILTER]
Name record_modifier
Match vault.audit
Record log_source vault
Record environment production
[FILTER]
Name nest
Match vault.audit
Operation lift
Nested_under auth
Add_prefix auth_
[FILTER]
Name nest
Match vault.audit
Operation lift
Nested_under request
Add_prefix request_
[OUTPUT]
Name loki
Match vault.audit
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=vault-audit,env=production
Label_keys $type,$auth_accessor
Line_format json
Auto_kubernetes_labels Off
The nest/lift filters promote nested fields to the top level so LogQL can reference auth_accessor and request_path directly without JSON parsing in every query. The DB setting persists the file offset so restarts do not re-ship old events.
Store Vault audit logs externally from day one. An attacker with Vault admin access can disable the audit device — but only if you have not already replicated the audit stream to an append-only external system. If the audit log lives only on the Vault server, the attacker silences it before harvesting. If it is already streaming to Loki with a retention policy the Vault process cannot touch, disabling the audit device at T creates a gap in the Loki stream that is itself a detection signal.
2. Bulk Read Detection with LogQL
# Alert: any accessor making more than 50 read/list operations in 5 minutes
# This catches harvesting while allowing high-frequency single-path renewals
sum by (auth_accessor) (
count_over_time(
{job="vault-audit"}
| json
| type = "response"
| request_operation =~ "read|list"
[5m]
)
) > 50
This fires for ESO if it syncs more than 50 secrets at once. Handle that with an allowlist label:
# Exclude known high-volume legitimate accessors
sum by (auth_accessor) (
count_over_time(
{job="vault-audit"}
| json
| type = "response"
| request_operation =~ "read|list"
| auth_accessor !~ "eso-accessor-.*|vault-agent-renewal-.*"
[5m]
)
) > 50
Path diversity is a stronger signal than raw read count. Vault agent renewing a single lease generates high read count but zero path diversity. An attacker harvesting generates high read count AND high path diversity:
# Recording rule: unique paths per accessor over 5m
# Store as a metric; alert on cardinality > threshold
count by (auth_accessor) (
count_over_time(
{job="vault-audit"}
| json
| type = "response"
| request_operation = "read"
| label_format path=request_path
[5m]
) by (auth_accessor, path)
) > 10
For a Prometheus recording rule that captures path diversity (Loki does not natively support distinct counting, so use a Python side-car or Grafana alerting with derived metrics):
# prometheus/rules/vault-anomaly.yaml
groups:
- name: vault_anomaly_detection
interval: 60s
rules:
- alert: VaultBulkReadDetected
expr: |
sum by (auth_accessor) (
increase(vault_audit_reads_total[5m])
) > 50
for: 0m
labels:
severity: critical
team: security
annotations:
summary: "Vault bulk read: {{ $labels.auth_accessor }}"
description: >
Accessor {{ $labels.auth_accessor }} made {{ $value }} reads
in the last 5 minutes. Expected < 50 for any non-ESO accessor.
runbook: https://runbooks.internal/vault/bulk-read
3. Python: Bulk Read and Path Diversity Detection
"""
vault_anomaly_detector.py
Reads Vault audit log NDJSON and detects:
1. Bulk reads: high read count in a rolling window
2. Path diversity: reads across many distinct paths (harvesting signal)
3. New path access: accessor reading a path for the first time
4. Dual-source IP: same accessor reading from multiple IPs simultaneously
"""
import json
import redis
import logging
from collections import defaultdict
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Iterator
log = logging.getLogger(__name__)
def parse_vault_audit(path: str) -> Iterator[dict]:
"""Yield parsed audit events, skipping malformed lines."""
with open(path) as fh:
for line in fh:
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError as e:
log.warning("Malformed audit line: %s — %s", line[:80], e)
def detect_bulk_reads(
audit_log_path: str,
window_minutes: int = 5,
min_reads: int = 30,
min_distinct_paths: int = 10,
accessor_allowlist: set[str] | None = None,
) -> list[dict]:
"""
Flag accessors reading many distinct secrets in a short window.
Rationale: legitimate Vault agent lease renewal is high-frequency but
low-path-diversity. Credential harvesting is high-frequency AND
high-path-diversity. The combined threshold avoids false positives
from renewal agents while catching attackers.
Returns a list of findings, each with accessor, read count,
distinct path count, window start, and a sample of accessed paths.
"""
allowlist = accessor_allowlist or set()
# Accumulate per-accessor timeline and path set
by_accessor: dict[str, dict] = defaultdict(
lambda: {"reads": [], "paths": set()}
)
for event in parse_vault_audit(audit_log_path):
if event.get("type") != "response":
continue
op = event.get("request", {}).get("operation", "")
if op not in ("read", "list"):
continue
accessor = event.get("auth", {}).get("accessor", "unknown")
if accessor in allowlist:
continue
try:
ts = datetime.fromisoformat(
event["time"].replace("Z", "+00:00")
)
except (KeyError, ValueError):
continue
path = event.get("request", {}).get("path", "")
by_accessor[accessor]["reads"].append(ts)
by_accessor[accessor]["paths"].add(path)
findings = []
window = timedelta(minutes=window_minutes)
for accessor, data in by_accessor.items():
reads = sorted(data["reads"])
if len(reads) < min_reads:
continue
# Sliding window: find the densest burst
for i, ts in enumerate(reads):
burst = [r for r in reads if ts <= r <= ts + window]
if len(burst) >= min_reads:
distinct = len(data["paths"])
if distinct >= min_distinct_paths:
findings.append(
{
"type": "bulk_read_harvesting",
"accessor": accessor,
"reads_in_window": len(burst),
"distinct_paths": distinct,
"window_start": ts.isoformat(),
"sample_paths": sorted(data["paths"])[:5],
"severity": "critical",
}
)
break # One finding per accessor per analysis run
return findings
def detect_dual_source_ip(
audit_log_path: str,
lookback_minutes: int = 60,
known_cidr_prefixes: list[str] | None = None,
) -> list[dict]:
"""
Detect the same accessor reading secrets from multiple source IPs
within a lookback window, where at least one IP is outside known
infrastructure CIDRs.
Legitimate cause: rolling pod replacement (two pods briefly active).
Malicious cause: exfiltrated token used from attacker infrastructure
simultaneously with the legitimate workload.
Distinguishing heuristic: if one IP is in known_cidr_prefixes and
another is not, that is higher confidence than two unknown IPs.
"""
prefixes = tuple(known_cidr_prefixes or ["10.", "172.16.", "192.168."])
cutoff = datetime.now(timezone.utc) - timedelta(minutes=lookback_minutes)
by_accessor: dict[str, set] = defaultdict(set)
for event in parse_vault_audit(audit_log_path):
if event.get("type") != "response":
continue
try:
ts = datetime.fromisoformat(
event["time"].replace("Z", "+00:00")
)
except (KeyError, ValueError):
continue
if ts < cutoff:
continue
accessor = event.get("auth", {}).get("accessor", "unknown")
ip = event.get("request", {}).get("remote_address", "")
if ip:
by_accessor[accessor].add(ip)
findings = []
for accessor, ips in by_accessor.items():
internal = {ip for ip in ips if ip.startswith(prefixes)}
external = ips - internal
if internal and external:
findings.append(
{
"type": "dual_source_ip",
"accessor": accessor,
"internal_ips": sorted(internal),
"external_ips": sorted(external),
"severity": "critical",
}
)
elif len(external) > 1:
# Multiple external IPs — possible token sharing or rotation
# of attacker infrastructure
findings.append(
{
"type": "multiple_external_ips",
"accessor": accessor,
"external_ips": sorted(external),
"severity": "high",
}
)
return findings
async def alert_new_path_access(
accessor: str,
path: str,
redis_client, # aioredis.Redis
min_set_size_to_alert: int = 1,
) -> bool:
"""
Track the set of paths each accessor has historically accessed.
Alert on first-ever access to a new path, after the accessor has
an established baseline (min_set_size_to_alert paths already known).
Uses Redis SADD return value: 1 = element was new, 0 = already existed.
SCARD gives current set size for baseline check.
Returns True if an alert was fired.
"""
key = f"vault:known_paths:{accessor}"
# SADD is atomic; returns 1 if path is new, 0 if already in set
is_new = await redis_client.sadd(key, path)
if is_new:
set_size = await redis_client.scard(key)
# Only alert once baseline is established; the first N paths
# for a brand-new accessor are not anomalous
if set_size > min_set_size_to_alert:
log.warning(
"ALERT new_path_access accessor=%s path=%s "
"known_paths=%d",
accessor,
path,
set_size - 1, # Exclude the new path from "known" count
)
return True
return False
4. AWS Secrets Manager: CloudTrail Anomaly Detection
CloudTrail records Secrets Manager API calls under the event source secretsmanager.amazonaws.com. The relevant API calls for access anomaly detection are GetSecretValue, ListSecrets, and DescribeSecret. PutSecretValue and RotateSecret are the write-side rotation signals.
"""
secretsmanager_anomaly.py
Analyses CloudTrail events for Secrets Manager access anomalies.
For real-time detection, replace the lookup_events polling approach
with a CloudTrail → EventBridge → Lambda pipeline.
"""
import boto3
import json
import logging
from collections import defaultdict
from datetime import datetime, timedelta, timezone
from typing import Any
log = logging.getLogger(__name__)
READ_APIS = {"GetSecretValue", "DescribeSecret", "ListSecrets"}
WRITE_APIS = {"PutSecretValue", "RotateSecret", "CreateSecret", "DeleteSecret"}
def fetch_cloudtrail_events(
region: str,
lookback_hours: int = 24,
) -> list[dict]:
"""
Fetch Secrets Manager CloudTrail events via lookup_events.
Note: lookup_events is limited to 90 days and 50 events/second.
For production, query CloudTrail Lake or Athena on S3.
"""
ct = boto3.client("cloudtrail", region_name=region)
start_time = datetime.now(timezone.utc) - timedelta(hours=lookback_hours)
events = []
paginator = ct.get_paginator("lookup_events")
for page in paginator.paginate(
LookupAttributes=[
{
"AttributeKey": "EventSource",
"AttributeValue": "secretsmanager.amazonaws.com",
}
],
StartTime=start_time,
EndTime=datetime.now(timezone.utc),
):
for raw in page["Events"]:
detail = json.loads(raw["CloudTrailEvent"])
events.append(detail)
return events
def detect_sm_anomalies(
region: str = "us-east-1",
lookback_hours: int = 24,
bulk_threshold: int = 20,
max_source_ips: int = 3,
known_automation_arns: set[str] | None = None,
) -> list[dict]:
"""
Detect anomalous Secrets Manager access in CloudTrail:
1. Bulk access: identity reads >bulk_threshold distinct secrets
2. Multiple source IPs: same identity accesses from >max_source_ips
3. ListSecrets calls: legitimate apps never call ListSecrets;
only admin tools and attackers enumerate the secret namespace
4. Access from new geographic region or IP not seen in baseline
"""
allowlist = known_automation_arns or set()
events = fetch_cloudtrail_events(region, lookback_hours)
by_identity: dict[str, dict[str, Any]] = defaultdict(
lambda: {
"read_events": [],
"write_events": [],
"secrets_read": set(),
"source_ips": set(),
"list_calls": 0,
}
)
for event in events:
event_name = event.get("eventName", "")
identity = (
event.get("userIdentity", {}).get("arn")
or event.get("userIdentity", {}).get("principalId")
or "unknown"
)
if identity in allowlist:
continue
source_ip = event.get("sourceIPAddress", "")
request_params = event.get("requestParameters") or {}
secret_id = request_params.get("secretId", "")
data = by_identity[identity]
data["source_ips"].add(source_ip)
if event_name in READ_APIS:
data["read_events"].append(event)
if secret_id:
data["secrets_read"].add(secret_id)
if event_name == "ListSecrets":
data["list_calls"] += 1
elif event_name in WRITE_APIS:
data["write_events"].append(event)
anomalies = []
for identity, data in by_identity.items():
# Bulk secret access
if len(data["secrets_read"]) > bulk_threshold:
anomalies.append(
{
"type": "bulk_secret_access",
"identity": identity,
"distinct_secrets_read": len(data["secrets_read"]),
"sample_secrets": sorted(data["secrets_read"])[:5],
"severity": "critical",
}
)
# Multiple source IPs — token/key sharing or theft
if len(data["source_ips"]) > max_source_ips:
anomalies.append(
{
"type": "multiple_source_ips",
"identity": identity,
"source_ip_count": len(data["source_ips"]),
"source_ips": sorted(data["source_ips"]),
"severity": "high",
}
)
# ListSecrets: no legitimate application should enumerate the namespace
if data["list_calls"] > 0:
anomalies.append(
{
"type": "secret_namespace_enumeration",
"identity": identity,
"list_calls": data["list_calls"],
"severity": "high",
"note": "GetSecretValue on known ARNs is normal; "
"ListSecrets is reconnaissance",
}
)
return anomalies
def detect_rotation_without_read(
region: str = "us-east-1",
lookback_hours: int = 2,
propagation_window_minutes: int = 30,
) -> list[dict]:
"""
Detect secret rotation events (PutSecretValue / RotateSecret)
that are not followed by a GetSecretValue on the same secret
within the propagation window.
This indicates either:
- Rotation succeeded but application failed to pick up new value
(caching bug, ESO sync interval too long, Vault agent down)
- Unauthorised write to a secret path (attacker injection)
"""
events = fetch_cloudtrail_events(region, lookback_hours)
rotation_events: list[dict] = []
read_events: list[dict] = []
for event in events:
event_name = event.get("eventName", "")
params = event.get("requestParameters") or {}
secret_id = params.get("secretId", "")
try:
event_time = datetime.fromisoformat(
event["eventTime"].replace("Z", "+00:00")
)
except (KeyError, ValueError):
continue
if event_name in ("PutSecretValue", "RotateSecret"):
rotation_events.append(
{"secret_id": secret_id, "time": event_time, "raw": event}
)
elif event_name == "GetSecretValue":
read_events.append(
{"secret_id": secret_id, "time": event_time}
)
findings = []
window = timedelta(minutes=propagation_window_minutes)
for rotation in rotation_events:
secret = rotation["secret_id"]
rot_time = rotation["time"]
subsequent_read = any(
r["secret_id"] == secret and rot_time <= r["time"] <= rot_time + window
for r in read_events
)
if not subsequent_read:
findings.append(
{
"type": "rotation_without_subsequent_read",
"secret_id": secret,
"rotation_time": rot_time.isoformat(),
"propagation_window_minutes": propagation_window_minutes,
"severity": "warning",
"note": "Secret rotated but not read within propagation window. "
"Check ESO sync, Vault agent lease status, and application startup.",
}
)
return findings
5. Correlation: Secret Rotation Without Kubernetes Deployment
Vault rotation events and Kubernetes deployment events exist in separate log streams. Correlating them reveals the full propagation chain — or its absence:
"""
rotation_deployment_correlation.py
Correlates Vault write events (rotation) with Kubernetes deployment
restart events to verify secret propagation completed.
Requires:
- Vault audit log events (write operations)
- Kubernetes event stream or deployment history (from k8s API or Loki)
"""
from datetime import datetime, timedelta
import json
import logging
from kubernetes import client, config # pip install kubernetes
log = logging.getLogger(__name__)
def get_recent_deployments(
namespace: str,
lookback_minutes: int = 60,
) -> list[dict]:
"""
Fetch Kubernetes deployment rollout events from the past window.
Uses the k8s events API to find ReplicaSet creation events
(which indicate a rollout was triggered).
"""
config.load_incluster_config() # or load_kube_config() for local
v1 = client.CoreV1Api()
cutoff = datetime.utcnow() - timedelta(minutes=lookback_minutes)
events = v1.list_namespaced_event(namespace)
deployments = []
for event in events.items:
if (
event.reason == "ScalingReplicaSet"
and event.involved_object.kind == "Deployment"
):
event_time = event.first_timestamp.replace(tzinfo=None)
if event_time >= cutoff:
deployments.append(
{
"deployment": event.involved_object.name,
"namespace": namespace,
"timestamp": event_time,
# Applications annotate their deployments with
# the secret paths they consume; this drives correlation
"secret_refs": event.involved_object.annotations.get(
"secrets.systemshardening.com/paths", ""
).split(","),
}
)
return deployments
def detect_rotation_without_deployment(
vault_audit_path: str,
k8s_namespace: str,
grace_window_minutes: int = 30,
) -> list[dict]:
"""
For each secret write event in Vault audit log, check whether a
Kubernetes deployment restart occurred within the grace window.
Assumes applications annotate Deployments with the secret paths
they consume: secrets.systemshardening.com/paths: "secret/production/stripe/key"
Grace window should be set to max(ESO sync interval, Vault agent
renew-increment) + 5 minutes buffer.
"""
deployments = get_recent_deployments(k8s_namespace, lookback_minutes=120)
issues = []
window = timedelta(minutes=grace_window_minutes)
with open(vault_audit_path) as fh:
for line in fh:
try:
event = json.loads(line)
except json.JSONDecodeError:
continue
if event.get("type") != "response":
continue
if event.get("request", {}).get("operation") not in ("create", "update"):
continue
path = event.get("request", {}).get("path", "")
try:
ts = datetime.fromisoformat(
event["time"].replace("Z", "+00:00")
).replace(tzinfo=None)
except (KeyError, ValueError):
continue
# Find deployments within grace window that reference this path
related = [
d
for d in deployments
if ts <= d["timestamp"] <= ts + window
and any(path in ref for ref in d["secret_refs"])
]
if not related:
issues.append(
{
"type": "rotation_without_deployment",
"secret_path": path,
"rotation_time": ts.isoformat(),
"grace_window_minutes": grace_window_minutes,
"severity": "warning",
}
)
return issues
6. LogQL: Off-Hours Admin Token Access
# Alert: any Vault operation using a token with admin policy outside
# maintenance window hours (06:00–22:00 UTC Monday–Friday)
# Requires token_policies field in Vault audit events
{job="vault-audit"}
| json
| type = "response"
| auth_policies =~ ".*admin.*|.*root.*"
| line_format "{{.time}} {{.auth.accessor}} {{.request.path}}"
Pair with a Grafana alerting rule that applies an hour-of-day filter using __range or an external time-gating function. Alternatively, use Alertmanager’s time_intervals to suppress this alert during expected maintenance windows and alert only when firing outside those intervals:
# alertmanager.yml
time_intervals:
- name: maintenance_window
time_intervals:
- weekdays: ["monday:friday"]
times:
- start_time: "06:00"
end_time: "22:00"
route:
receiver: security-pagerduty
routes:
- match:
alertname: VaultAdminTokenAccess
mute_time_intervals:
- maintenance_window
Any VaultAdminTokenAccess alert that fires and is not muted by maintenance_window fires to PagerDuty immediately — it represents admin token use at a time when no maintenance is planned.
7. Grafana Dashboard
# grafana/dashboards/vault-anomaly.json (abbreviated panel definitions)
panels:
- title: "Vault Reads per Accessor (5m buckets)"
type: timeseries
datasource: Loki
targets:
- expr: |
sum by (auth_accessor) (
count_over_time(
{job="vault-audit"} | json | type="response"
| request_operation="read" [5m]
)
)
legendFormat: "{{auth_accessor}}"
- title: "Secrets Manager API Calls by Event (1h)"
type: barchart
datasource: CloudWatch
targets:
- expression: |
SEARCH('{AWS/CloudTrail} EventName', 'Sum', 3600)
filters:
EventSource: secretsmanager.amazonaws.com
- title: "First-Ever Path Accesses (Redis set delta)"
type: stat
description: "Count of new accessor:path pairs seen this hour. Spikes indicate lateral movement."
targets:
- expr: vault_new_path_accesses_total
- title: "Vault Audit Gap Detection"
type: stat
description: "Time since last audit log event. >60s means log shipping broken or audit device disabled."
targets:
- expr: time() - max(last_over_time({job="vault-audit"} | json [1m]) by (host)) > 60
thresholds:
- color: red
value: 60
- title: "Rotation Without Subsequent Read (last 6h)"
type: logs
datasource: Loki
targets:
- expr: |
{job="vault-anomaly-detector"}
| json
| type="rotation_without_subsequent_read"
The audit gap panel is critical. A 60-second gap in the Vault audit log stream is either a log shipping failure (investigate Fluent Bit) or an attacker disabling the audit device before harvesting (escalate immediately). In either case, the gap is the signal.
Expected Behaviour
During normal operations: The Grafana “Reads per Accessor” panel shows flat, predictable lines — k8s-backend-prod at approximately 12 reads per 5m window (two secrets, 150-second renewal interval). k8s-eso shows a spike every 3600 seconds as it syncs all secrets, then drops back to zero. No bulk-read alerts fire. The Redis first-ever-path counter is zero across each full day unless a new service is deployed and reads a new secret for the first time.
During a harvesting attempt: At T+0, a bulk read alert fires with reads_in_window: 487, distinct_paths: 43, accessor: tok_A1b2c3. Simultaneously, the dual-source IP detector outputs an alert with internal_ips: ["10.0.1.47"] and external_ips: ["185.234.219.12"]. The Vault operator immediately runs:
vault token revoke -accessor tok_A1b2c3
The stolen token is invalid within milliseconds. Log entries confirm ErrCode: 403 for all subsequent requests from 185.234.219.12. The legitimate workload, if it was still using the same token, will need a token renewal — Vault agent handles this automatically by re-authenticating to Kubernetes.
After a rotation event with no subsequent read: The Python detector fires a rotation_without_subsequent_read finding for secret/production/stripe/api-key. An operator checks the ESO sync logs:
kubectl logs -n external-secrets deployment/external-secrets -c manager \
| grep "stripe/api-key" | tail -20
The output shows ERROR: rate limit exceeded; retry in 30s — ESO hit the Secrets Manager GetSecretValue rate limit and backed off. The finding correctly identified a propagation failure before the application began returning payment processing errors. The fix is a Secrets Manager request rate increase or an ESO batch-read configuration change.
Trade-offs
Bulk read threshold tuning: The 30-read, 10-path threshold works for applications with two to five secrets. ESO syncing 80 secrets per cycle produces a finding on every sync cycle. The fix is not to raise the threshold globally — it is to maintain an allowlist of accessors that are expected to perform bulk reads, and apply higher thresholds to those accessors only. Keeping the default threshold low catches attackers; the allowlist prevents operational noise.
CloudTrail polling vs. streaming: The lookup_events API has a propagation delay of 5–15 minutes. For real-time detection, replace polling with CloudTrail → EventBridge → Lambda: CloudTrail writes to EventBridge in near-real-time, Lambda invokes the anomaly detector, and the finding is available within 30 seconds of the API call. The polling approach is appropriate for batch analysis and daily reports; the streaming approach is required for incident response timescales.
First-ever path detection in Redis: Redis key expiry is a footgun here. If you set a TTL on vault:known_paths:{accessor} — for example, 30 days — then any accessor that has not been seen in 30 days will have its entire path history expire. When the accessor reappears, every path it accesses will look like a new access. Either set no TTL (accept unbounded Redis memory growth) or use a TTL long enough to span your longest inactive period and accept occasional false positives when legitimate long-dormant services wake up.
Rotation-without-deployment correlation accuracy: The correlation depends on Kubernetes Deployments being annotated with the secret paths they consume. This annotation is not automatic — it requires a convention enforced by admission policy (OPA Gatekeeper or Kyverno). Without consistent annotation, the correlation produces false positives for every rotation of a secret whose consumers are not annotated.
Admin token off-hours alerting and time zones: Maintenance windows defined in UTC are accurate for infrastructure; they do not account for on-call engineers in non-UTC time zones performing legitimate emergency operations. The alert should fire to an on-call channel with a one-click acknowledgment — not directly to PagerDuty as a page — unless a second corroborating signal (bulk read, external IP) is present simultaneously.
Failure Modes
No baseline of normal access patterns: All reads look identical. A bulk read from an attacker is indistinguishable from ESO’s sync cycle. The first-ever path detector has no base set to compare against — everything is new. Without baseline data, anomaly detection is pattern matching with no patterns. Resolution: instrument for at least two weeks before enabling alerts, use the recording rules to build per-accessor read profiles, and do not skip the allowlist configuration for known high-frequency accessors.
Vault audit log not shipped externally from day one: If audit logs are written to the Vault server’s local filesystem only, an attacker with server access disables the audit device before harvesting. The log from before the disable operation exists but is not searchable. Resolution: configure Fluent Bit or Vector to tail the audit log from the moment Vault starts. The external Loki stream is append-only from the attacker’s perspective — they cannot retroactively delete events already shipped.
Alertmanager routing misconfigured: The bulk read alert fires in Prometheus. Alertmanager routes it to a channel that no one monitors at night. The alert resolves after the window passes (the attacker finished harvesting). The next morning, an operator sees the resolved alert and assumes it was a transient glitch. Resolution: bulk read and dual-source IP alerts should route to PagerDuty with an explicit routing rule that does not honour any inhibit_rules or quiet periods. Secret access anomalies are always actionable.
Not correlating rotation events with deployments: Rotation runs successfully. ESO fails silently (rate limit, misconfigured refresh interval, IAM permission missing on the new secret version). The application continues running with the old credential, which the upstream system (RDS, Stripe, SendGrid) invalidates on its own rotation schedule. The application starts returning authentication errors days after the rotation event. Without rotation-propagation monitoring, the root cause is non-obvious. Resolution: the detect_rotation_without_subsequent_read detector gives a finding within 30 minutes of a rotation that is not propagated. This is a five-minute fix (check ESO logs) rather than a two-hour production incident investigation.
Secrets Manager ListSecrets not monitored: Attackers enumerate the secret namespace before reading secrets. ListSecrets returns up to 100 secret names and ARNs per call. A few ListSecrets calls map the entire namespace. If ListSecrets is not explicitly alerted on, an attacker can build a complete inventory of secrets before making any GetSecretValue calls that would trigger a bulk-read alert. Resolution: any ListSecrets call from a non-admin identity should be an immediate alert. Applications have no legitimate reason to enumerate the namespace — they know the ARN of the secret they need.