Secrets in Transit: mTLS and Certificate Pinning for Secret Store Communication
The Problem
Most organizations treat secrets management as a storage problem: rotate credentials, encrypt at rest, use Vault or AWS Secrets Manager instead of environment variables hardcoded into images. That framing misses an entire attack surface. The network path between your application and the secret store is where secrets travel in plaintext. The TLS session terminates inside your application, which means the decrypted secret value exists on the wire before your application processes it — and anything that can position itself between your application and the secret store reads every secret that application ever retrieves.
To understand the concrete exposure, trace what happens when an application reads a database password at startup. The application calls boto3.client('secretsmanager').get_secret_value(SecretId='prod/db/password'). A TCP connection opens to secretsmanager.us-east-1.amazonaws.com or to a VPC endpoint for that service. A TLS handshake validates the server’s certificate against trusted CAs. The application sends an HTTPS request with an AWS SigV4 signature in the Authorization header — this is the authentication credential that proves the application has access to the secret. AWS returns an HTTP 200 response whose body contains the secret value in plaintext JSON: {"SecretString": "hunter2_prod_not_really"}. The application extracts the string and connects to the database.
The same flow applies to HashiCorp Vault: vault kv get secret/db-password opens a TLS session to https://vault.internal:8200, sends the Vault token in an X-Vault-Token header, and gets back a JSON response with the plaintext secret in data.data.password. The decrypted value is never stored in Vault at rest — the Vault server holds encrypted data and decrypts it on read, returning plaintext to authorized callers. That means the only place the plaintext secret value ever exists in transit is in this HTTP response body, which traverses whatever network path lies between your application and the Vault or Secrets Manager endpoint.
If anything can intercept that path, it reads decrypted secrets without needing to compromise the secret store itself. No Vault audit log entry for an unauthorized read. No AWS CloudTrail event from an unexpected principal. The attacker sees exactly what the application sees.
Four Attack Scenarios on the Secrets API Path
DNS hijacking to a fake Vault endpoint. An SSRF vulnerability in a sidecar container or adjacent service allows the attacker to redirect requests intended for vault.service.consul to an attacker-controlled server. In many Vault client configurations, the Vault address is read from an environment variable: VAULT_ADDR=https://vault.service.consul:8200. If the attacker can modify the environment, or if a DNS poisoning attack redirects that hostname, the Vault client TLS-connects to the attacker’s server. The attacker presents a certificate signed by any trusted CA — or, if the application is using VAULT_SKIP_VERIFY=true for “convenience,” no valid certificate at all. The Vault client sends its authentication credential (X-Vault-Token, X-Vault-Role-Id, or a JWT in the Authorization header) to authenticate. The attacker harvests that credential, which can then be replayed against the real Vault endpoint to read any secrets the application was authorized to access. The attacker returns plausible-looking fake secrets to avoid detection.
Compromised internal CA. Enterprise environments typically run an internal CA for services that don’t face the public internet. Vault is one of those services: vault.internal:8200 has a TLS certificate signed by corp-internal-ca.example.com. If an attacker compromises the internal CA — a realistic scenario given that internal CAs frequently run on under-monitored Windows hosts with PKI misconfigured to allow certificate issuance without approval — they can issue a certificate for vault.internal and perform MITM at a network boundary. This bypasses standard TLS validation entirely because the forged certificate chains to a CA the application trusts. The standard advice of “check the certificate is valid” provides zero protection against this scenario.
Bulk secret harvesting from a valid token. An attacker who has obtained a Vault token through any means — a leaked CI pipeline variable, a stolen service account JWT, an over-permissioned Kubernetes serviceaccount that maps to a permissive Vault role — makes authenticated requests to list and read secrets. vault kv list secret/ returns all secret paths. Iterating through those paths with vault kv get is a normal API operation that produces normal audit log entries. Each individual read is authorized. The anomaly is the rate: 10,000 API calls in five minutes from a single token is not normal application behavior. Without traffic anomaly detection or rate limiting, the attacker harvests the entire accessible namespace of the secret store, exfiltrates the results, and exits. By the time the stolen credentials are revoked — if they’re detected at all — every secret accessible to that token is compromised.
Pod-to-secrets lateral movement via missing NetworkPolicy. In a Kubernetes cluster without NetworkPolicy, every pod can initiate TCP connections to every other pod and service. An attacker who has compromised any application pod — via a known vulnerability in the application, via a supply chain attack that ships malicious code in a container image, via a misconfigured debug endpoint — can reach the Vault service. If the compromised application has a serviceaccount mapped to a Vault role with broad read permissions (common when teams grant roles by namespace rather than by specific pod label), the attacker reads secrets from Vault using the existing Vault agent socket or by direct HTTP calls to the Vault service IP. The NetworkPolicy you thought existed for the secret store namespace turns out to apply only to ingress, not egress from the production namespace.
Threat Model
The relevant threats in priority order:
- MITM via DNS hijacking or SSRF: application sends its authentication credential to the attacker; attacker harvests all secrets the token can access; fake secrets returned to application may trigger cascading failures or pass undetected
- Compromised internal CA: MITM on the Vault TLS session; every secret retrieved by any application using that CA trust anchor is readable in transit; the attack is invisible to Vault audit logging
- Bulk secret reads from a compromised token: complete exfiltration of the accessible secret namespace via the legitimate API; individual requests are authorized; only rate or volume anomalies reveal the attack
- Unrestricted pod-to-Vault network access: any compromised pod can query the secret store; blast radius of any container compromise expands to include all secrets accessible via Vault role mappings
Hardening Configuration
1. Certificate Pinning for Vault TLS
Standard TLS validation checks that the server certificate is signed by a trusted CA and that the hostname matches the CN or SAN. Pinning checks that the specific certificate — or its public key hash — matches a value you embedded in the application at deployment time. A compromised CA that issues a fraudulent certificate for vault.internal passes standard TLS validation but fails the pin comparison immediately.
The most robust pinning approach pins to the public key hash (SPKI pin) rather than the full certificate, because public keys persist across certificate renewals when the same key pair is reused. For Vault specifically, pinning to the leaf certificate SHA-256 is simpler and appropriate when you have tight control over certificate rotation:
import ssl
import hashlib
import requests
import hvac
# Generate this value from your current Vault certificate:
# openssl s_client -connect vault.internal:8200 </dev/null 2>/dev/null \
# | openssl x509 -outform DER \
# | sha256sum
# Store the result as "sha256:<hex>"
VAULT_CERT_PIN = "sha256:a3f2c1d9e8b4567890abcdef1234567890abcdef1234567890abcdef12345678"
class PinnedSSLAdapter(requests.adapters.HTTPAdapter):
"""
HTTPAdapter that validates the server's leaf certificate against a
pre-computed SHA-256 pin after the TLS handshake completes. Raises
ssl.SSLError if the pin does not match before any application data
is read from the response.
"""
def __init__(self, cert_pin: str, *args, **kwargs):
self.cert_pin = cert_pin
super().__init__(*args, **kwargs)
def send(self, request, **kwargs):
# Force explicit CA bundle; never set verify=False
kwargs['verify'] = '/etc/vault/tls/vault-ca.crt'
response = super().send(request, **kwargs)
# Extract the leaf certificate in DER form from the underlying
# socket. This is the certificate the server presented — not the
# CA cert. binary_form=True returns the raw DER bytes; False would
# return a decoded dict that omits the information needed to hash.
raw_conn = response.raw._connection
if raw_conn is None or raw_conn.sock is None:
raise ssl.SSLError("Cannot inspect TLS connection: no socket available")
cert_der = raw_conn.sock.getpeercert(binary_form=True)
cert_hash = "sha256:" + hashlib.sha256(cert_der).hexdigest()
if cert_hash != self.cert_pin:
raise ssl.SSLError(
f"Certificate pin mismatch for {request.url}. "
f"Expected {self.cert_pin}, got {cert_hash}. "
f"Possible MITM or certificate rotation without pin update."
)
return response
def build_vault_client(vault_addr: str, vault_token: str) -> hvac.Client:
session = requests.Session()
session.mount('https://', PinnedSSLAdapter(cert_pin=VAULT_CERT_PIN))
client = hvac.Client(
url=vault_addr,
token=vault_token,
session=session,
)
if not client.is_authenticated():
raise RuntimeError("Vault authentication failed after establishing pinned session")
return client
When the pin check fails, ssl.SSLError is raised inside send(), before the response body is read. The application never processes the response from the potentially attacker-controlled server. This means the secret value is never returned, and the application fails with an error rather than silently operating with an attacker’s fake secret.
For public key pinning (SPKI), which survives certificate renewals with the same key pair:
import ssl
import hashlib
import base64
from cryptography import x509
from cryptography.hazmat.backends import default_backend
# Generate SPKI pin from existing certificate:
# openssl s_client -connect vault.internal:8200 </dev/null 2>/dev/null \
# | openssl x509 -pubkey -noout \
# | openssl pkey -pubin -outform DER \
# | sha256sum | awk '{print $1}' \
# | xxd -r -p | base64
VAULT_SPKI_PIN = "base64encodedSHA256ofSubjectPublicKeyInfo=="
def verify_spki_pin(cert_der: bytes, expected_pin: str) -> bool:
cert = x509.load_der_x509_certificate(cert_der, default_backend())
pub_key_der = cert.public_key().public_bytes(
encoding=serialization.Encoding.DER,
format=serialization.PublicFormat.SubjectPublicKeyInfo,
)
pin = base64.b64encode(hashlib.sha256(pub_key_der).digest()).decode()
return pin == expected_pin
2. mTLS for Vault Agent Authentication
Standard TLS is one-directional: the client verifies the server’s identity. mTLS requires the client to also present a certificate, which the server validates. For Vault, this means the application presents a TLS client certificate during the handshake, and Vault’s cert auth method maps that certificate’s CN or organization to a specific Vault policy. This replaces token-based authentication with certificate-based authentication on the TLS layer — an attacker who can MITM the TLS session has already failed, because the client certificate presentation and validation happen in the handshake before any application data is exchanged.
Configure Vault Agent to use the cert auth method with a client certificate:
# /etc/vault-agent/vault-agent.hcl
# Application presents a TLS client certificate; Vault validates it
# and maps the certificate to a named policy at the mTLS layer.
vault {
address = "https://vault.internal:8200"
ca_cert = "/etc/vault/tls/vault-ca.crt"
# The application's own certificate and private key
client_cert = "/etc/app/tls/client.crt"
client_key = "/etc/app/tls/client.key"
# Explicit server name prevents certificate reuse across endpoints
tls_server_name = "vault.internal"
# Never set tls_skip_verify = true; this negates the entire model
}
auto_auth {
method "cert" {
# "name" must match the cert auth role configured in Vault
config = {
name = "payment-service"
}
}
}
# Sink for the token Vault Agent produces after mTLS auth
sink "file" {
config = {
path = "/run/vault-agent/token"
mode = 0400
}
}
template {
source = "/etc/vault-agent/templates/db-password.ctmpl"
destination = "/run/secrets/db-password"
perms = 0400
}
Configure the Vault server to accept this certificate and map it to the correct policy:
# Enable the cert auth method (one-time cluster setup)
vault auth enable cert
# Register the application's client CA certificate
# Vault will accept any client certificate signed by this CA
# with the CN "payment-service" and apply app-readonly-policy
vault write auth/cert/certs/payment-service \
display_name=payment-service \
policies=payment-service-secrets \
certificate=@/etc/vault/client-cas/payment-service-ca.crt \
allowed_common_names="payment-service" \
ttl=1h
# The policy itself: restrict to specific secret paths
vault policy write payment-service-secrets - <<'EOF'
path "secret/data/production/payment/*" {
capabilities = ["read"]
}
path "secret/metadata/production/payment/*" {
capabilities = ["list", "read"]
}
EOF
In Kubernetes, cert-manager manages the client certificate lifecycle automatically. The application deployment requests a Certificate resource; cert-manager issues it and rotates it before expiry; Vault Agent picks up the renewed certificate and re-authenticates:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: payment-service-vault-client
namespace: production
spec:
secretName: payment-service-vault-client-tls
duration: 24h
renewBefore: 4h
subject:
organizations:
- production
commonName: payment-service
isCA: false
usages:
- client auth
issuerRef:
name: internal-ca-issuer
kind: ClusterIssuer
The 24-hour certificate lifetime with a 4-hour renewal window means the client certificate is valid for less than a day. A compromised certificate — extracted from a running pod’s filesystem — is worthless after 24 hours, and more likely worthless in 4-20 hours depending on where in the rotation cycle it was stolen.
3. NetworkPolicy Restricting Vault Access to Labeled Pods
The goal is a default-deny model where only pods that explicitly declare they need secret store access can initiate connections to the Vault service. Everything else gets a connection timeout.
# Default deny: blocks all egress from the production namespace
# that is not explicitly permitted by another NetworkPolicy.
# Apply this first; subsequent policies are additive.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
# Allow intra-namespace traffic (pods talking to each other)
- to:
- podSelector: {}
# Allow DNS resolution (kube-dns runs in kube-system)
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
---
# Explicit opt-in: pods labeled needs-secrets=true can reach the Vault service
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-vault-egress
namespace: production
spec:
podSelector:
matchLabels:
needs-secrets: "true"
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: vault
podSelector:
matchLabels:
app.kubernetes.io/name: vault
ports:
- port: 8200
protocol: TCP
---
# Ingress side: Vault only accepts connections from labeled pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vault-allow-labeled-clients
namespace: vault
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: vault
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: production
podSelector:
matchLabels:
needs-secrets: "true"
ports:
- port: 8200
protocol: TCP
A compromised pod without the needs-secrets: "true" label attempts to connect to vault.vault.svc.cluster.local:8200 and receives no TCP SYN-ACK. The connection times out rather than being refused — this is typical for NetworkPolicy enforcement via the CNI (Cilium, Calico, or similar). The absence of a TCP RST means the application cannot easily distinguish “wrong address” from “NetworkPolicy blocked” without inspecting the CNI’s flow logs.
4. AWS Secrets Manager — VPC Endpoint to Eliminate the Internet Path
By default, secretsmanager.us-east-1.amazonaws.com resolves to public AWS infrastructure. Traffic from a VPC instance to that public endpoint routes through the VPC’s internet gateway if one exists, or fails if the VPC has no internet connectivity. A VPC interface endpoint creates a private endpoint inside the VPC that routes traffic through AWS PrivateLink — the traffic never leaves the AWS network, which eliminates any exposure at internet exchange points and prevents the public endpoint from being reachable as an alternative.
# Create the Secrets Manager VPC interface endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0a1b2c3d4e5f67890 \
--service-name com.amazonaws.us-east-1.secretsmanager \
--vpc-endpoint-type Interface \
--subnet-ids subnet-0a1b2c3d4e5f67891 subnet-0a1b2c3d4e5f67892 \
--security-group-ids sg-0a1b2c3d4e5f67893 \
--private-dns-enabled \
--tag-specifications 'ResourceType=vpc-endpoint,Tags=[{Key=Name,Value=secretsmanager-private}]'
# The endpoint security group: only allow HTTPS from your application SG
aws ec2 authorize-security-group-ingress \
--group-id sg-0a1b2c3d4e5f67893 \
--protocol tcp \
--port 443 \
--source-group sg-application-sg-id
# Attach a restrictive endpoint policy — only allow reads on production secrets,
# and only from the application's IAM role
aws ec2 modify-vpc-endpoint \
--vpc-endpoint-id vpce-0a1b2c3d4e5f67894 \
--policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/payment-service-role"
},
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret"
],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:production/*"
},
{
"Effect": "Deny",
"Principal": "*",
"Action": "secretsmanager:*",
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:PrincipalArn": "arn:aws:iam::123456789012:role/payment-service-role"
}
}
}
]
}'
With --private-dns-enabled, the existing public endpoint hostname secretsmanager.us-east-1.amazonaws.com resolves to the private endpoint IP from within the VPC. Application code requires no changes — the boto3 client calls the same hostname, but the DNS resolution returns the private endpoint address. A VPC with no internet gateway route and a Secrets Manager endpoint is unreachable from the public internet regardless of security group configuration, because there is no network path for return traffic.
To detect whether your environment has an accidental internet path for Secrets Manager traffic, check the route table associated with your application subnets:
# List route tables for your application subnet
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-0a1b2c3d4e5f67891" \
--query 'RouteTables[*].Routes[?GatewayId!=`local`].{Dest:DestinationCidrBlock,GW:GatewayId}'
# A route with GatewayId starting "igw-" for 0.0.0.0/0 means traffic
# to public IPs (including the public Secrets Manager endpoint) exits
# through the internet gateway. Add the VPC endpoint and this route
# becomes irrelevant for Secrets Manager traffic, since the endpoint
# DNS overrides the public resolution within the VPC.
5. Detect Bulk Secret Reads via Vault Audit Log Analysis
Vault’s audit log records every API request and response, including the token accessor (a non-sensitive reference to the token), the requested path, the operation type, and a timestamp. Enable file or syslog audit devices on all Vault nodes in production:
vault audit enable file file_path=/var/log/vault/audit.log log_raw=false
# log_raw=false means secrets are hashed in audit output by default
# This is intentional: the audit log should not itself become a secret dump
A bulk read attack has a distinguishable signature: many read operations from a single token accessor in a short time window, often preceded by list operations on parent paths to enumerate available secrets. The following script detects this pattern from the NDJSON audit log:
import json
import sys
from collections import defaultdict
from datetime import datetime, timedelta
def parse_vault_audit_log(path: str):
"""Yield parsed audit log events from a Vault NDJSON audit log file."""
with open(path, encoding='utf-8') as f:
for line_no, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError as exc:
print(f"Warning: malformed JSON at line {line_no}: {exc}", file=sys.stderr)
def detect_bulk_reads(
audit_log_path: str,
window_minutes: int = 5,
read_threshold: int = 50,
) -> list[dict]:
"""
Identify token accessors that performed an anomalously high number of
secret read operations within a sliding time window.
Returns a list of anomaly dicts, each containing the accessor, the
operation count, the window start time, and a sample of accessed paths.
"""
# Build per-accessor timeline: [(timestamp, path, operation), ...]
reads_by_accessor = defaultdict(list)
for event in parse_vault_audit_log(audit_log_path):
# Only examine response events (each request has a paired response)
if event.get('type') != 'response':
continue
operation = event.get('request', {}).get('operation')
if operation not in ('read', 'list'):
continue
# Errors (e.g., 403 denied) are still logged; skip them
# to avoid false positives from failed enumeration attempts
# that the ACL already blocked. Non-zero errors map to HTTP 4xx/5xx.
response_errors = event.get('response', {}).get('errors')
if response_errors:
continue
accessor = event.get('auth', {}).get('accessor', 'unauthenticated')
path = event.get('request', {}).get('path', '')
time_str = event.get('time', '')
try:
ts = datetime.fromisoformat(time_str.replace('Z', '+00:00'))
except ValueError:
continue
reads_by_accessor[accessor].append((ts, path, operation))
window = timedelta(minutes=window_minutes)
anomalies = []
for accessor, events in reads_by_accessor.items():
events.sort(key=lambda e: e[0])
already_flagged_at = None
for i, (ts, path, op) in enumerate(events):
# Skip positions already covered by a previously flagged window
if already_flagged_at and ts < already_flagged_at + window:
continue
window_events = [
e for e in events[i:]
if e[0] <= ts + window
]
if len(window_events) >= read_threshold:
already_flagged_at = ts
anomalies.append({
'accessor': accessor,
'read_count': len(window_events),
'window_start': ts.isoformat(),
'window_end': (ts + window).isoformat(),
'read_threshold': read_threshold,
'window_minutes': window_minutes,
'sample_paths': [e[1] for e in window_events[:15]],
'operations': list({e[2] for e in window_events}),
})
return anomalies
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Detect bulk Vault secret reads')
parser.add_argument('audit_log', help='Path to Vault audit log (NDJSON)')
parser.add_argument('--window', type=int, default=5, help='Time window in minutes')
parser.add_argument('--threshold', type=int, default=50, help='Read count threshold')
args = parser.parse_args()
anomalies = detect_bulk_reads(args.audit_log, args.window, args.threshold)
if not anomalies:
print("No bulk read anomalies detected.")
sys.exit(0)
print(f"ALERT: {len(anomalies)} bulk read anomaly(ies) detected:\n")
for a in anomalies:
print(f" Accessor : {a['accessor']}")
print(f" Reads : {a['read_count']} in {a['window_minutes']}min starting {a['window_start']}")
print(f" Ops : {', '.join(a['operations'])}")
print(f" Paths : {a['sample_paths'][:5]}")
print()
Run this script against the audit log on a 1-minute cron or feed the audit log into a log aggregation pipeline (Loki, Elasticsearch, Splunk) with equivalent alert logic. The alert fires before the attacker finishes iterating through the secret store — 50 reads in 5 minutes is a conservative threshold for most applications; adjust downward for high-security environments where any bulk read is anomalous.
6. Alert on Vault Tokens Used from Unexpected Source IPs
Beyond volume anomalies, a token used from a new IP address is a strong signal of credential theft. Vault’s audit log includes the remote address for each request. A LogQL query for Grafana/Loki detects new-IP usage:
# Grafana Loki: alert on Vault token accessor used from a previously unseen IP.
# This requires a baseline period to establish normal accessor->IP mappings.
# LogQL to extract accessor and remote_address from Vault audit logs:
{job="vault-audit"}
| json
| type = "request"
| line_format "accessor={{.auth.accessor}} ip={{.request.remote_address}}"
# Feed this into a Loki alerting rule that tracks accessor->IP history
# over a 7-day window and fires when a new combination appears.
# Alternatively, ship to Elasticsearch and use a terms aggregation:
{
"aggs": {
"by_accessor": {
"terms": { "field": "auth.accessor.keyword" },
"aggs": {
"ips": {
"terms": { "field": "request.remote_address.keyword" }
}
}
}
},
"query": {
"bool": {
"filter": [
{ "term": { "type": "request" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
Compare the per-accessor IP list from the past hour against the rolling 7-day baseline. Any accessor that appears in the current window from an IP not present in the baseline gets an alert. This catches token theft from compromised CI pipelines, stolen Vault agent tokens from pods that have been evicted, and tokens extracted from application memory via a memory dump.
Expected Behaviour Under Attack
When certificate pinning is active and an attacker intercepts the TLS session, the pin check fails inside PinnedSSLAdapter.send() before the response body is returned. The application raises ssl.SSLError: Certificate pin mismatch for https://vault.internal:8200. The secret is never retrieved. The application fails to start or fails the specific operation that required the secret. This is the correct failure mode — a loud, visible error is preferable to silently returning a secret from an attacker’s server.
When a pod without the needs-secrets: "true" label attempts to connect to vault.vault.svc.cluster.local:8200, the CNI drops the SYN packet at the kernel level. The connecting pod sees the TCP connection hang until the application-layer timeout fires — typically 30 to 120 seconds depending on the client configuration. No RST is sent. The Vault service logs nothing, because the connection never reaches the Vault process. The only record of the blocked attempt is in the CNI’s flow log (Cilium cilium monitor, Calico flow logs, or VPC Flow Logs for EKS with the VPC CNI).
When the bulk read detector fires, it outputs the accessor ID and a sample of accessed paths. The accessor maps back to a specific token in Vault’s token store via vault token lookup -accessor <accessor>. This reveals the token’s creation time, its associated entity or auth method role, and its remaining TTL. The correct immediate response is to revoke the accessor: vault token revoke -accessor <accessor>. This invalidates the token immediately across all Vault nodes in the cluster.
Trade-offs and Operational Costs
Certificate pinning — The pin value must be updated before Vault’s certificate expires. If the certificate rotates and the pin is not updated, the application breaks immediately on the next connection attempt. The operational model is: automate certificate rotation through cert-manager or Vault’s PKI secrets engine, automate pin value extraction as part of the rotation pipeline, and deploy the updated pin before the old certificate expires. In practice, this means Vault certificate rotation and application deployment must be coordinated. For organizations that haven’t previously coordinated these, pinning surfaces an operational gap that was already a latent risk.
mTLS with cert auth method — Certificate lifecycle management for client certificates at scale in Kubernetes requires cert-manager or equivalent tooling. Without automation, client certificates expire and the Vault agent fails to authenticate — the application stops retrieving secrets. With automation, the operational overhead is low but the initial setup takes time: cluster issuer configuration, certificate resources per workload, Vault role registration per certificate CN. The benefit is that authentication credentials are cryptographic private keys stored in Kubernetes secrets (or CSI secret store volumes), not bearer tokens that can be trivially exfiltrated.
VPC endpoints — AWS charges $0.01/hour per availability zone per endpoint, plus $0.01/GB of data processed. For most secrets workloads (small payloads, moderate request rates), the data transfer cost is negligible. The per-AZ charge for a three-AZ deployment is roughly $22/month — less than the cost of a single engineer-hour. In regulated environments (PCI DSS, HIPAA), a VPC endpoint for Secrets Manager is typically required to satisfy the “in transit over private networks” requirement.
Bulk read detection — The read_threshold parameter requires calibration against your specific workload. ESO (External Secrets Operator), Vault Agent’s template rendering at startup, and legitimate batch jobs that retrieve configuration for many services can all exceed a naive threshold of 50 reads in 5 minutes. The resolution is to exclude known-legitimate accessors — those belonging to dedicated ESO or Vault Agent service accounts — from the detection window, or to raise the threshold for those specific accessors while keeping a lower threshold for application accessors. The detector is not useful if it is perpetually noisy.
Failure Modes
VAULT_SKIP_VERIFY=true in production. This environment variable disables all TLS certificate validation for the Vault client. It is set to work around certificate configuration problems in development environments and then carried into production. With it set, a DNS hijack or compromised CA is completely effective — the application connects to any server presenting any certificate. Certificate pinning is also bypassed when TLS verification is skipped, because there is no TLS verification to attach pinning logic to. Grep your deployment manifests, Helm values, and ConfigMaps for VAULT_SKIP_VERIFY and VAULT_TLS_SKIP_VERIFY. Either value should block promotion to production.
NetworkPolicy applied to ingress only. The common mental model for NetworkPolicy is “protect the Vault service from unauthorized access,” which leads to writing a NetworkPolicy that restricts ingress to Vault. This does nothing to prevent a compromised pod from initiating egress to Vault. Effective network isolation requires both: an ingress policy on the Vault service restricting source pods, and a default-deny egress policy on the production namespace that blocks all egress except explicitly labeled pods. If either half is missing, the model is incomplete.
No VPC endpoint in a VPC with an internet gateway. A VPC with an internet gateway has a public route for 0.0.0.0/0. The AWS SDK resolves secretsmanager.us-east-1.amazonaws.com to a public IP address. Traffic exits the VPC through the internet gateway and transits the public internet before reaching AWS infrastructure. Assuming that “AWS manages the TLS” provides sufficient protection ignores the fact that you are routing over infrastructure you do not control, through ISPs that have historically been targets for BGP hijacking. A VPC endpoint costs less than a single engineer-hour per month and eliminates the public path entirely.
Vault audit log not shipped to a SIEM. Vault’s audit log is only useful if something reads it. A bulk read attack that takes 20 minutes produces 10,000 audit log entries on the Vault node’s filesystem. If those entries are not shipped to a centralized log aggregation system with alerting, the attack is invisible until the stolen credentials appear in threat intelligence feeds or until the attacker uses them externally. Enable the syslog audit device in addition to file to ensure log entries are forwarded to your log aggregation pipeline in real time rather than requiring a file-based shipping agent to pick them up.
Token used as static long-lived credential. Vault tokens with no TTL (ttl=0) or very long TTLs (weeks to months) make the bulk read scenario worse: a stolen token remains valid indefinitely. Combine short TTLs (1-4 hours) with automatic renewal through Vault Agent to ensure that compromised tokens have a bounded validity window. A token revoked 4 hours after compromise — because it expired and renewal failed after the pod was isolated — limits the attack window even when the bulk read detector did not fire in time.