Secret Rotation Automation: Zero-Downtime Database Password Rotation at Scale
The Problem
Database credential rotation fails in production for one of three reasons: teams don’t do it at all because the procedure causes downtime, they do it during a maintenance window and call that “zero-downtime”, or they automate the wrong thing — updating the secret store before the database, or the database before the applications are ready. Each path leads to the same place: a production outage, or credentials that were leaked three months ago and are still valid.
The core mechanical problem is that a database password is stored in two places simultaneously: in the secret store (Vault, AWS Secrets Manager) and in the database itself. Any rotation procedure must coordinate both stores plus every running application process that holds the credential in memory. The naive procedure updates them sequentially and creates a window of guaranteed failure:
# Naive rotation — causes outage:
NEW_PASS=$(openssl rand -base64 32)
# Step 1: Update the database. Old password now invalid.
psql -h postgres.internal -U admin -c "ALTER USER app_user PASSWORD '$NEW_PASS';"
# Step 2: Update the secret store.
vault kv put secret/prod/db password="$NEW_PASS"
# Step 3: Restart all application instances.
kubectl rollout restart deployment/api-server
# Gap between Step 1 and the last instance completing Step 3:
# Every running application has the old (now invalid) password.
# Every new database connection attempt fails with:
# FATAL: password authentication failed for user "app_user"
With 50 application instances across multiple availability zones, the restart wave from Step 3 alone takes several minutes. Any in-flight requests during that window hit a failed connection and return errors.
Why Most Teams Don’t Rotate
Beyond the mechanical failure, there are structural reasons rotation doesn’t happen. Security mandates say “rotate every 90 days.” The platform team has no automated rotation. An engineer runs the rotation manually, causes a 3-minute outage at 2pm on a Tuesday, gets paged, rolls back, and the secret stays rotated once a year during a planned maintenance window at 2am. The security policy is technically met on paper; the credential is effectively static.
The second structural problem is visibility. Without automated rotation, the only signal that a credential was compromised is when an attacker uses it. Rotation is the bound on how long a compromised credential is useful. A 90-day rotation schedule means a credential stolen on day 1 is valid for 89 more days. That is the window. Making rotation automated and zero-downtime is what makes a short rotation schedule operationally viable.
The Zero-Downtime Rotation Pattern
Zero-downtime rotation requires that both the old and new credentials are valid simultaneously during the transition. The full state machine has five explicit steps:
- Generate new credential: Create a new database password (or a new database user) in a pending state. The old credential remains unchanged and fully valid.
- Update the secret store: Write the new credential to Vault or AWS Secrets Manager as a pending version. The current version is still served to applications.
- Verify the new credential: Test that the new credential actually works against the database before proceeding. This is the step most implementations skip.
- Promote to current: Mark the new credential as current in the secret store. Vault agents, External Secrets Operator, and application-side refresh logic pick this up on their next refresh cycle.
- Revoke the old credential: Only after confirming that applications have stopped using the old credential. Monitor
pg_stat_activityor the application’s connection pool metrics to confirm.
The dual-credential pattern at the PostgreSQL level looks like this:
-- Option A: create a parallel role (works for both v1 and v2 simultaneously)
-- Useful when the application authenticates with a specific username that
-- external systems track (audit logs, firewall rules).
-- Before rotation: app connects as app_user_v1
-- Step 1: create the new identity with all necessary grants
CREATE ROLE app_user_v2 WITH LOGIN PASSWORD 'new_generated_password_v2';
GRANT CONNECT ON DATABASE mydb TO app_user_v2;
GRANT USAGE ON SCHEMA public TO app_user_v2;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user_v2;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_user_v2;
-- Both app_user_v1 and app_user_v2 are valid simultaneously.
-- After all applications have switched to v2 (verify via pg_stat_activity):
SELECT usename, count(*) FROM pg_stat_activity
WHERE datname = 'mydb' GROUP BY usename;
-- app_user_v1 | 0 ← safe to revoke
-- app_user_v2 | 43
REVOKE ALL PRIVILEGES ON DATABASE mydb FROM app_user_v1;
REVOKE ALL ON ALL TABLES IN SCHEMA public FROM app_user_v1;
DROP ROLE app_user_v1;
-- Option B: ALTER USER (same username, in-place password change)
-- Existing connections authenticated with the old password remain valid
-- until they reconnect. PostgreSQL validates the password at connection
-- time, not during the session. Open sessions survive the ALTER USER.
-- New connections must use the new password.
ALTER USER app_user PASSWORD 'new_generated_password_v2';
-- Old connections: still alive, still working.
-- New connections: must use new_generated_password_v2.
-- This window lasts until all old connections close or are killed.
Option B (ALTER USER) is simpler and works when you can tolerate the transition being tied to connection churn. Option A (parallel role) is required when the rotation schedule is shorter than your connection pool’s lifetime, or when you need to explicitly control the revocation moment.
Threat Model
Rotation failure leaves a potentially compromised credential valid indefinitely. The most common rotation failure mode is a successful rotation in the secret store that silently fails to reach the database, or vice versa. The credential mismatch is often not detected until an application restart — which might be days later for a stable service. During that interval, the old credential is either invalid (causing intermittent connection failures as pools cycle) or the new credential was never applied (making rotation theatre).
The rotation mechanism itself is an attack surface. AWS Secrets Manager rotation Lambdas typically have IAM permissions to read all secrets in the account and connect to production databases. Vault agents with database engine access can generate credentials for any role the admin configured. These are high-value targets: compromising the rotation mechanism gives access to all credentials that mechanism manages.
Applications not refreshing credentials after rotation. An application that caches its database password at startup and never refreshes will continue using the old credential until it’s restarted. If the rotation schedule is 30 days and deployments are monthly, the application is effectively never using a rotated credential.
The testSecret step not testing actual data access. A rotation Lambda that only verifies SELECT 1 in the test step has not confirmed the credential works for the operations the application actually performs. A user with LOGIN privilege but missing table grants will pass the connectivity test and fail in production.
Credential mismatch after partial rotation. If the Lambda’s setSecret step updates the database but the Lambda fails before finishSecret, the AWSPENDING credential works against the database, but AWSCURRENT in the secret store still has the old (now invalid) password. Applications reading from AWSCURRENT get a credential that fails. Recovery requires manual stage manipulation.
Hardening Configuration
1. AWS Secrets Manager Rotation Lambda
AWS Secrets Manager calls a Lambda function in four sequential steps: createSecret, setSecret, testSecret, finishSecret. Each step must be idempotent — the rotation can be retried from any point. The Lambda below implements zero-downtime rotation using ALTER USER (Option B), with explicit error handling at each step:
# rotation_lambda.py
# Runtime: Python 3.12, Lambda environment
# Required packages: psycopg2-binary (bundled in layer)
# Required IAM permissions: secretsmanager:GetSecretValue,
# secretsmanager:PutSecretValue, secretsmanager:UpdateSecretVersionStage,
# secretsmanager:DescribeSecret — scoped to this specific secret ARN only.
import boto3
import json
import psycopg2
import logging
import string
import secrets as _secrets
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
arn = event['SecretId']
token = event['ClientRequestToken']
step = event['Step']
sm = boto3.client('secretsmanager')
metadata = sm.describe_secret(SecretId=arn)
if not metadata.get('RotationEnabled'):
raise ValueError(f"Rotation not enabled for {arn}")
versions = metadata.get('VersionIdsToStages', {})
if token not in versions:
raise ValueError(f"Version {token} not found in secret {arn}")
if 'AWSCURRENT' in versions[token]:
# Already rotated; finishSecret already ran.
logger.info("Version %s is already AWSCURRENT — rotation complete", token)
return
if 'AWSPENDING' not in versions[token]:
raise ValueError(f"Version {token} is not in AWSPENDING stage")
dispatch = {
'createSecret': create_secret,
'setSecret': set_secret,
'testSecret': test_secret,
'finishSecret': finish_secret,
}
if step not in dispatch:
raise ValueError(f"Unknown step: {step}")
dispatch[step](sm, arn, token)
def create_secret(sm, arn, token):
"""Generate new password and store in AWSPENDING. Idempotent."""
try:
sm.get_secret_value(SecretId=arn, VersionStage='AWSPENDING',
VersionId=token)
logger.info("AWSPENDING already set for %s — skipping createSecret", token)
return
except sm.exceptions.ResourceNotFoundException:
pass
current = _get_secret(sm, arn, 'AWSCURRENT')
# 32-character password drawn from a restricted alphabet.
# Excludes characters that require quoting in psql and shell: ' " ` \ ;
alphabet = string.ascii_letters + string.digits + '!@#%^&*()-_=+'
new_password = ''.join(_secrets.choice(alphabet) for _ in range(32))
pending = {
'host': current['host'],
'port': current['port'],
'dbname': current['dbname'],
'username': current['username'],
'password': new_password,
'engine': current.get('engine', 'postgres'),
}
sm.put_secret_value(
SecretId=arn,
ClientRequestToken=token,
SecretString=json.dumps(pending),
VersionStages=['AWSPENDING'],
)
logger.info("createSecret: AWSPENDING version %s written", token)
def set_secret(sm, arn, token):
"""Apply new password to the database. Old connections remain alive.
PostgreSQL validates credentials at connection time. Existing connections
authenticated with the old password are not interrupted. New connections
must use the new password immediately after this step completes.
"""
pending = _get_secret(sm, arn, 'AWSPENDING', version_id=token)
current = _get_secret(sm, arn, 'AWSCURRENT')
# Connect using the admin/superuser credentials stored in AWSCURRENT.
# The rotation user must have CREATEROLE or be a superuser to ALTER USER.
# In RDS: use the master user or a user with rds_superuser.
conn = _connect(current)
try:
with conn.cursor() as cur:
# Use parameterized query for the password value.
# The username is not a parameter in psycopg2 ALTER USER;
# validate it matches a known safe identifier before interpolation.
username = pending['username']
if not username.replace('_', '').isalnum():
raise ValueError(f"Unsafe username in secret: {username!r}")
cur.execute(
f'ALTER USER "{username}" PASSWORD %s',
(pending['password'],)
)
conn.commit()
logger.info("set_secret: password changed for user %s", username)
except Exception:
conn.rollback()
raise
finally:
conn.close()
def test_secret(sm, arn, token):
"""Verify the new credentials work for the operations the app performs.
SELECT 1 is not sufficient. Test that the user can perform the actual
operations (SELECT, INSERT, UPDATE, DELETE) on the tables the application
uses. This catches credential mismatches between the Lambda admin user
and the app user's actual grant set.
"""
pending = _get_secret(sm, arn, 'AWSPENDING', version_id=token)
conn = _connect(pending)
try:
with conn.cursor() as cur:
# Basic connectivity
cur.execute('SELECT 1')
assert cur.fetchone()[0] == 1
# Verify table access — using a known sentinel table.
# Adjust this to match the application's actual access pattern.
cur.execute("""
SELECT has_table_privilege(%s, 'public.schema_migrations', 'SELECT')
""", (pending['username'],))
can_select = cur.fetchone()[0]
if not can_select:
raise PermissionError(
f"User {pending['username']} cannot SELECT on "
"public.schema_migrations — grant missing"
)
logger.info("test_secret: new credentials verified for user %s",
pending['username'])
finally:
conn.close()
def finish_secret(sm, arn, token):
"""Promote AWSPENDING to AWSCURRENT. Atomic from the secret store's perspective."""
metadata = sm.describe_secret(SecretId=arn)
current_version = None
for version_id, stages in metadata['VersionIdsToStages'].items():
if 'AWSCURRENT' in stages:
if version_id == token:
logger.info("finish_secret: %s already AWSCURRENT", token)
return
current_version = version_id
break
sm.update_secret_version_stage(
SecretId=arn,
VersionStage='AWSCURRENT',
MoveToVersionId=token,
RemoveFromVersionId=current_version,
)
logger.info("finish_secret: %s promoted to AWSCURRENT (was %s)",
token, current_version)
def _get_secret(sm, arn, stage, version_id=None):
kwargs = {'SecretId': arn, 'VersionStage': stage}
if version_id:
kwargs['VersionId'] = version_id
return json.loads(sm.get_secret_value(**kwargs)['SecretString'])
def _connect(creds):
return psycopg2.connect(
host=creds['host'],
port=int(creds['port']),
dbname=creds['dbname'],
user=creds['username'],
password=creds['password'],
connect_timeout=10,
sslmode='require', # Never connect to RDS without TLS
)
The Lambda’s execution role needs only the permissions for this specific secret ARN. A policy that grants secretsmanager:* on * turns the rotation mechanism into an account-wide credential dump:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSecretAccess",
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:PutSecretValue",
"secretsmanager:UpdateSecretVersionStage",
"secretsmanager:DescribeSecret"
],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:production/postgres/*"
},
{
"Sid": "AllowVPCNetworking",
"Effect": "Allow",
"Action": [
"ec2:CreateNetworkInterface",
"ec2:DescribeNetworkInterfaces",
"ec2:DeleteNetworkInterface"
],
"Resource": "*"
}
]
}
Deploy the Lambda into the same VPC as the RDS instance. Without VPC placement, the Lambda either cannot reach the database or must punch a hole through the security group to the public internet — neither is acceptable for a rotation function that handles production credentials.
Configure rotation schedule via Terraform:
resource "aws_secretsmanager_secret_rotation" "db_rotation" {
secret_id = aws_secretsmanager_secret.db_credentials.id
rotation_lambda_arn = aws_lambda_function.db_rotation.arn
rotation_rules {
# Rotate every 30 days.
# AWS Secrets Manager will also rotate immediately on first enable.
automatically_after_days = 30
}
}
# The Lambda must grant Secrets Manager permission to invoke it.
resource "aws_lambda_permission" "allow_secrets_manager" {
statement_id = "AllowSecretsManagerInvocation"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.db_rotation.function_name
principal = "secretsmanager.amazonaws.com"
source_arn = aws_secretsmanager_secret.db_credentials.arn
}
2. Vault Database Secrets Engine
Vault’s database secrets engine eliminates the rotation problem entirely for applications that can adopt it: credentials are generated on demand with a short TTL, and Vault revokes them automatically when the lease expires. There is no rotation event to coordinate, no dual-credential window to manage, and no stale-credential problem — a credential that lives for one hour is effectively revoked by the clock.
# Enable and configure the database secrets engine.
vault secrets enable database
# Configure the PostgreSQL connection using a long-lived admin user.
# The {{username}} and {{password}} placeholders are filled by Vault
# at connection time using the configured admin credentials.
vault write database/config/production-postgres \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@postgres.internal:5432/mydb?sslmode=require" \
allowed_roles="api-readwrite,api-readonly,analytics-readonly" \
username="vault_admin" \
password="$VAULT_ADMIN_PASSWORD" \
password_authentication="scram-sha-256" \
max_open_connections=5 \
max_idle_connections=2 \
max_connection_lifetime="5m"
# Rotate the static vault_admin password immediately so it's no longer
# the value that was typed on the command line or stored in shell history.
vault write -force database/rotate-root/production-postgres
# Create a role for the API server (read-write, 1-hour TTL).
vault write database/roles/api-readwrite \
db_name=production-postgres \
creation_statements="
CREATE ROLE \"{{name}}\"
WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';
GRANT SELECT, INSERT, UPDATE, DELETE
ON ALL TABLES IN SCHEMA public TO \"{{name}}\";
GRANT USAGE, SELECT
ON ALL SEQUENCES IN SCHEMA public TO \"{{name}}\";
" \
revocation_statements="
REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA public FROM \"{{name}}\";
REVOKE ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public FROM \"{{name}}\";
DROP ROLE IF EXISTS \"{{name}}\";
" \
renew_statements="
ALTER ROLE \"{{name}}\" VALID UNTIL '{{expiration}}';
" \
default_ttl="1h" \
max_ttl="24h"
# Request a credential. This is what the application or Vault agent does.
vault read database/creds/api-readwrite
# Key Value
# --- -----
# lease_id database/creds/api-readwrite/7Ah2kXmN...
# lease_duration 1h
# lease_renewable true
# password A1b-Cd2-Ef3-Gh4-Ij5k (unique, 1h TTL)
# username v-k8s-api-readwrit-AbCdEfGhIj-1234567890
Each credential is a distinct PostgreSQL role with a generated username. At any point, pg_stat_activity shows which Vault-issued credentials are active — the username encodes the role and a timestamp, making it trivial to correlate a running connection with its Vault lease.
For Kubernetes, use the Vault Agent Injector or the Vault Secrets Operator rather than making direct Vault API calls from application code. The injector sidecar handles lease renewal transparently and writes the credential to a shared volume:
# Deployment annotation for Vault Agent Injector
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-inject-secret-db: "database/creds/api-readwrite"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "database/creds/api-readwrite" -}}
PGUSER={{ .Data.username }}
PGPASSWORD={{ .Data.password }}
{{- end }}
vault.hashicorp.com/role: "api-server"
# Restart the pod when credentials change (Vault agent rewrites the file)
vault.hashicorp.com/agent-inject-command-db: "kill -HUP 1"
spec:
serviceAccountName: api-server
containers:
- name: api
image: api-server:latest
env:
- name: DB_CREDENTIALS_FILE
value: /vault/secrets/db
The injected file at /vault/secrets/db is rewritten by the Vault agent before the current lease expires (by default at 2/3 of the TTL). The agent-inject-command-db annotation sends SIGHUP to PID 1 after rewriting, which the application uses to reload its database connection pool without restarting the process.
3. Application-Side Credential Refresh
Neither AWS Secrets Manager nor Vault can force-reload credentials held in application memory. The application must implement credential refresh logic. The common failure pattern: the application reads the database password once at startup, builds a connection pool, and never re-reads it. The pool holds connections alive for hours or days. Even if the secret store has the new credential, the running application uses the old one until restart.
The correct pattern for a Go application using Vault dynamic credentials:
// db/managed_pool.go
package db
import (
"context"
"database/sql"
"fmt"
"sync"
"time"
vault "github.com/hashicorp/vault/api"
_ "github.com/lib/pq"
)
// ManagedPool wraps sql.DB with automatic credential refresh.
// It maintains a single connection pool and replaces it when credentials
// are close to expiry. Concurrent callers share the pool; the refresh
// is serialised by the write lock.
type ManagedPool struct {
mu sync.RWMutex
db *sql.DB
vaultClient *vault.Client
rolePath string // e.g. "database/creds/api-readwrite"
host string
dbName string
expiry time.Time
leaseID string
}
func NewManagedPool(ctx context.Context, vc *vault.Client, rolePath, host, dbName string) (*ManagedPool, error) {
p := &ManagedPool{
vaultClient: vc,
rolePath: rolePath,
host: host,
dbName: dbName,
}
if _, err := p.refresh(ctx); err != nil {
return nil, fmt.Errorf("initial credential fetch failed: %w", err)
}
return p, nil
}
// DB returns a live *sql.DB, refreshing credentials if they are within
// 5 minutes of expiry. This is the only method callers should use.
func (p *ManagedPool) DB(ctx context.Context) (*sql.DB, error) {
p.mu.RLock()
// Check without upgrade: still more than 5 minutes remaining.
if time.Now().Before(p.expiry.Add(-5 * time.Minute)) {
db := p.db
p.mu.RUnlock()
return db, nil
}
p.mu.RUnlock()
return p.refresh(ctx)
}
func (p *ManagedPool) refresh(ctx context.Context) (*sql.DB, error) {
p.mu.Lock()
defer p.mu.Unlock()
// Double-check: another goroutine may have refreshed while we waited.
if time.Now().Before(p.expiry.Add(-5 * time.Minute)) {
return p.db, nil
}
secret, err := p.vaultClient.Logical().ReadWithContext(ctx, p.rolePath)
if err != nil {
return nil, fmt.Errorf("vault read %s: %w", p.rolePath, err)
}
if secret == nil || secret.Data == nil {
return nil, fmt.Errorf("vault returned nil secret for %s", p.rolePath)
}
username, ok1 := secret.Data["username"].(string)
password, ok2 := secret.Data["password"].(string)
if !ok1 || !ok2 {
return nil, fmt.Errorf("vault secret missing username or password fields")
}
dsn := fmt.Sprintf(
"postgres://%s:%s@%s/%s?sslmode=require",
username, password, p.host, p.dbName,
)
newDB, err := sql.Open("postgres", dsn)
if err != nil {
return nil, fmt.Errorf("sql.Open: %w", err)
}
// Verify the connection works before replacing the live pool.
if err := newDB.PingContext(ctx); err != nil {
_ = newDB.Close()
return nil, fmt.Errorf("ping with new credentials failed: %w", err)
}
newDB.SetMaxOpenConns(25)
newDB.SetMaxIdleConns(5)
// Max connection lifetime slightly shorter than the credential TTL.
// This ensures no connection outlives its credential.
leaseDuration := time.Duration(secret.LeaseDuration) * time.Second
newDB.SetConnMaxLifetime(leaseDuration - 2*time.Minute)
// Close the old pool. Connections drain; in-flight queries complete.
// sql.DB.Close does not interrupt in-flight queries.
if p.db != nil {
_ = p.db.Close()
}
p.db = newDB
p.leaseID = secret.LeaseID
p.expiry = time.Now().Add(leaseDuration)
return newDB, nil
}
The critical detail is SetConnMaxLifetime set to slightly less than the credential TTL. PostgreSQL validates credentials at connection time, not during the session. A connection established with the old credential continues to work until the session ends. By capping the connection lifetime, the pool naturally cycles its connections before the credential expires, so no in-flight queries are interrupted.
For AWS Secrets Manager instead of Vault, the refresh logic uses the SDK with caching:
// For AWS Secrets Manager, use the official caching client rather than
// implementing your own. It handles version tracking and cache invalidation.
import (
"github.com/aws/aws-secretsmanager-caching-go/secretcache"
)
cache, err := secretcache.New()
secret, err := cache.GetSecretString("production/postgres/api-server")
// The cache refreshes secrets that are older than the TTL (default: 1 hour).
// On rotation, the cache returns stale credentials for up to the TTL.
// Set ForceRefreshBeforeGet if your application needs to detect rotation
// faster than the cache TTL.
4. Rotation Health Monitoring
A rotation that fails silently is worse than no rotation at all: you believe the credential was rotated, but the old credential remains valid and the compliance clock keeps ticking as if rotation succeeded.
# rotation_monitor.py — run as a scheduled job (daily or hourly)
import boto3
from datetime import datetime, timedelta, timezone
from dataclasses import dataclass
from typing import Optional
@dataclass
class RotationStatus:
secret_arn: str
rotation_enabled: bool
last_rotated: Optional[datetime]
rotation_interval_days: Optional[int]
days_since_rotation: Optional[int]
overdue_by_days: Optional[int]
alert: Optional[str]
def check_rotation_health(secret_arn: str) -> RotationStatus:
sm = boto3.client('secretsmanager')
meta = sm.describe_secret(SecretId=secret_arn)
rotation_enabled = meta.get('RotationEnabled', False)
last_rotated = meta.get('LastRotatedDate')
rotation_rules = meta.get('RotationRules', {})
interval_days = rotation_rules.get('AutomaticallyAfterDays')
days_since = None
overdue_by = None
alert = None
if not rotation_enabled:
alert = "Rotation is disabled on this secret"
elif last_rotated is None:
alert = "Secret has never been rotated"
elif interval_days is not None:
now = datetime.now(timezone.utc)
last_rotated_utc = last_rotated.replace(tzinfo=timezone.utc) \
if last_rotated.tzinfo is None else last_rotated
days_since = (now - last_rotated_utc).days
if days_since > interval_days:
overdue_by = days_since - interval_days
alert = (
f"Rotation overdue by {overdue_by} days "
f"(last rotated {days_since} days ago, "
f"schedule: every {interval_days} days)"
)
return RotationStatus(
secret_arn=secret_arn,
rotation_enabled=rotation_enabled,
last_rotated=last_rotated,
rotation_interval_days=interval_days,
days_since_rotation=days_since,
overdue_by_days=overdue_by,
alert=alert,
)
def check_all_database_secrets(prefix: str = 'production/postgres/') -> list[RotationStatus]:
"""Check all secrets matching a prefix."""
sm = boto3.client('secretsmanager')
paginator = sm.get_paginator('list_secrets')
results = []
for page in paginator.paginate(Filters=[{'Key': 'name', 'Values': [prefix]}]):
for secret in page['SecretList']:
status = check_rotation_health(secret['ARN'])
if status.alert:
# Emit CloudWatch metric for alerting
cw = boto3.client('cloudwatch')
cw.put_metric_data(
Namespace='Security/SecretRotation',
MetricData=[{
'MetricName': 'RotationOverdueDays',
'Dimensions': [{'Name': 'SecretArn', 'Value': secret['ARN']}],
'Value': status.overdue_by_days or 0,
'Unit': 'Count',
}]
)
results.append(status)
return results
For Vault, equivalent monitoring uses the sys/leases/lookup API to verify that leases are being issued and renewed:
# Check that dynamic credentials are being actively issued for a role.
# A role with zero active leases may indicate the application is not
# connecting to Vault, or is using stale cached credentials.
vault list sys/leases/lookup/database/creds/api-readwrite
# Keys
# ----
# 7Ah2kXmN...
# 8Bj3lYnO...
# Count: 12 active leases for this role
# Audit log query: count rotation events in the past 24 hours.
# Requires audit log to be enabled and forwarded to a queryable store.
# In Vault Enterprise, use the audit log API:
vault read sys/audit-hash/file \
input='{"type":"request","request":{"operation":"read","path":"database/creds/api-readwrite"}}'
5. Rollback When Rotation Fails
The rollback procedure assumes rotation failed after setSecret (the database password was changed) but before finishSecret completed (AWSCURRENT was not updated). The AWSPENDING version has the new password; AWSCURRENT has the old (now invalid) password.
#!/usr/bin/env bash
# rollback-rotation.sh
# Usage: ./rollback-rotation.sh <secret-arn>
# Run this when a rotation fails and applications cannot connect.
# Requires: awscli v2, jq
set -euo pipefail
SECRET_ARN="${1:?Usage: $0 <secret-arn>}"
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
describe=$(aws secretsmanager describe-secret \
--secret-id "$SECRET_ARN" \
--region "$REGION")
CURRENT_VERSION=$(echo "$describe" | jq -r '
.VersionIdsToStages | to_entries[]
| select(.value | index("AWSCURRENT"))
| .key')
PREVIOUS_VERSION=$(echo "$describe" | jq -r '
.VersionIdsToStages | to_entries[]
| select(.value | index("AWSPREVIOUS"))
| .key')
PENDING_VERSION=$(echo "$describe" | jq -r '
.VersionIdsToStages | to_entries[]
| select(.value | index("AWSPENDING"))
| .key // empty')
echo "Current version (secret store): $CURRENT_VERSION"
echo "Previous version (last known-good): $PREVIOUS_VERSION"
echo "Pending version (failed rotation): ${PENDING_VERSION:-none}"
if [[ -z "$PREVIOUS_VERSION" ]]; then
echo "ERROR: No AWSPREVIOUS version found. Cannot automatically roll back."
echo "Manual recovery required: determine the last working credentials and"
echo "set them directly via: aws secretsmanager update-secret"
exit 1
fi
echo ""
echo "Rolling back: promoting AWSPREVIOUS ($PREVIOUS_VERSION) to AWSCURRENT."
echo "WARNING: This sets the secret store back to the previous credential."
echo "The database may have already accepted the new password (if setSecret ran)."
echo "Verify database connectivity after rollback."
echo ""
read -rp "Proceed? [y/N] " confirm
[[ "${confirm,,}" == "y" ]] || { echo "Aborted."; exit 0; }
aws secretsmanager update-secret-version-stage \
--secret-id "$SECRET_ARN" \
--version-stage AWSCURRENT \
--move-to-version-id "$PREVIOUS_VERSION" \
--remove-from-version-id "$CURRENT_VERSION" \
--region "$REGION"
echo "Rollback complete. AWSPREVIOUS ($PREVIOUS_VERSION) is now AWSCURRENT."
echo ""
echo "Next steps:"
echo "1. Verify that applications can connect to the database."
echo "2. If setSecret ran before the failure, the database has the PENDING"
echo " password, not the PREVIOUS password. In that case:"
echo " a. Get the PENDING credential: aws secretsmanager get-secret-value \\"
echo " --secret-id $SECRET_ARN --version-stage AWSPENDING"
echo " b. Run ALTER USER manually to reset to the PREVIOUS password, or"
echo " c. Promote the PENDING version to AWSCURRENT instead."
echo "3. Investigate why the rotation Lambda failed before re-enabling rotation."
The rollback script handles the clean case (finishSecret failed, database unchanged). The messy case is when setSecret applied the new password to the database but the Lambda failed before finishSecret. In that state, the AWSPENDING credential works against the database but AWSCURRENT has the old (invalid) password. The correct action in that case is to promote AWSPENDING to AWSCURRENT, not restore AWSPREVIOUS:
# When setSecret succeeded but finishSecret failed:
# The PENDING credential works. Promote it to CURRENT.
PENDING=$(aws secretsmanager describe-secret --secret-id "$SECRET_ARN" | \
jq -r '.VersionIdsToStages | to_entries[] |
select(.value | index("AWSPENDING")) | .key')
aws secretsmanager update-secret-version-stage \
--secret-id "$SECRET_ARN" \
--version-stage AWSCURRENT \
--move-to-version-id "$PENDING" \
--remove-from-version-id "$CURRENT_VERSION"
# Now AWSCURRENT matches what the database accepts.
Expected Behaviour
AWS Secrets Manager rotation (four Lambda invocations):
2026-05-09T14:00:00Z [INFO] createSecret: AWSPENDING version abc123 written
2026-05-09T14:00:01Z [INFO] createSecret: completed in 0.8s
2026-05-09T14:00:05Z [INFO] setSecret: password changed for user app_user
2026-05-09T14:00:05Z [INFO] setSecret: completed in 0.4s
# At this point: database accepts new password. Old connections still alive.
# Applications continue to work. AWSCURRENT still has old password.
2026-05-09T14:00:10Z [INFO] test_secret: new credentials verified for user app_user
2026-05-09T14:00:10Z [INFO] test_secret: table access confirmed for public.schema_migrations
2026-05-09T14:00:10Z [INFO] test_secret: completed in 0.3s
# If this step fails, rotation halts here. AWSCURRENT unchanged.
# Applications unaffected. Old password still valid in database.
# Manual investigation required.
2026-05-09T14:00:15Z [INFO] finish_secret: abc123 promoted to AWSCURRENT (was def456)
# From this moment: applications reading AWSCURRENT get the new password.
# Old connections: still alive with old password (PostgreSQL session-level).
# New connections: must use new password.
# Old connections drain as pool cycles connections per SetConnMaxLifetime.
Vault dynamic credential issuance:
$ vault read database/creds/api-readwrite
Key Value
--- -----
lease_id database/creds/api-readwrite/7Ah2kXmN-AbCd-EfGh-IjKl-1234567890ab
lease_duration 1h
lease_renewable true
password A1b-Cd2-Ef3-Gh4-Ij5k (32 characters, random)
username v-k8s-api-readwrit-AbCdEfGhIj-1234567890
# In PostgreSQL, this role is visible:
# SELECT rolname, rolvaliduntil FROM pg_roles WHERE rolname LIKE 'v-k8s-%';
# v-k8s-api-readwrit-AbCdEfGhIj-1234567890 | 2026-05-09 15:00:00+00
#
# After 1 hour, Vault calls the revocation_statements:
# DROP ROLE IF EXISTS "v-k8s-api-readwrit-AbCdEfGhIj-1234567890";
# The role disappears from pg_roles. Any surviving connection drops at next
# query attempt with: FATAL: role "..." does not exist
Go ManagedPool refresh:
The pool’s DB() method checks credential expiry on every call but only acquires the write lock when credentials are within 5 minutes of expiry. In normal operation, the read lock path returns in microseconds. When refresh is needed, the first goroutine to acquire the write lock fetches new credentials and pings the database before swapping the pool. All other concurrent callers block on the write lock and receive the new pool once it’s ready. No goroutine receives a pool backed by expired credentials.
Trade-offs
Vault dynamic credentials vs. AWS Secrets Manager rotation. Vault dynamic credentials are architecturally cleaner: there is no rotation event, no dual-credential window, and a leaked credential expires within the TTL. The cost is that Vault becomes a runtime dependency. If Vault is unavailable when an application starts or needs to refresh credentials, the application cannot connect to the database. This requires high-availability Vault deployment (Raft cluster, not a single instance) and a circuit-breaker in the credential fetch path that can extend the lease on existing credentials rather than failing immediately. AWS Secrets Manager rotation has no runtime dependency on the rotation mechanism after credentials are cached — the Lambda only runs during rotation, not during normal operation.
Application-side refresh without restart. The Go ManagedPool pattern works because sql.DB.Close() is graceful: in-flight queries complete, and the pool is replaced. Not all ORMs support this pattern. SQLAlchemy’s Engine does not provide a clean pool-swap primitive; the typical pattern is to dispose the engine and create a new one, which risks dropping connections mid-query under high concurrency. Rails connection pools do not refresh credentials without restart. The engineering cost of implementing credential refresh in the application layer is real — factor it into the decision between Vault dynamic credentials (requires application adoption) and Secrets Manager rotation with longer cache TTLs.
The testSecret step scope. Testing only SELECT 1 confirms the credential is syntactically valid and that the database accepted the password change. It does not confirm that the role has the grants the application needs. A rotation that passes testSecret but revokes table grants (e.g., because the new role was created without running the GRANT statements) will fail in production silently. Write testSecret checks that mirror actual application queries, including any specific schemas or tables that are not publicly accessible.
Rotation interval vs. connection pool lifetime. A 30-day rotation interval with a connection pool that holds connections for 24 hours means the pool cycles naturally before the next rotation. A 1-hour rotation interval (or a 1-hour Vault TTL) with a pool that holds connections for 8 hours means the pool will regularly hit expired credentials. The rotation interval must be longer than max connection lifetime + rotation propagation window.
Failure Modes
The Lambda has overly broad IAM permissions. A rotation Lambda with secretsmanager:GetSecretValue on * and access to the VPC where production databases live is a high-privilege execution context. Any code injection in the Lambda (via a malicious dependency, a compromised base image, or a misconfigured event source) has read access to all secrets in the account and network access to all databases the Lambda’s security group allows. Scope the Lambda’s IAM policy to the specific secret ARN prefix it manages. Use a separate Lambda function (and separate IAM role) for each secret group.
testSecret passes but production access fails. A test that only verifies SELECT 1 will succeed for a user with CONNECT privilege but no table grants. The application then gets a credential that authenticates successfully but fails on every actual query with ERROR: permission denied for table orders. This is more disruptive than a failed rotation: the rotation succeeded (AWSCURRENT is updated, AWSPREVIOUS is the old working credential), but the application is broken. The testSecret step should execute a representative sample of the queries the application performs, against the specific tables and schemas it accesses.
No rollback procedure in the runbook. Teams document the rotation process but not the recovery process. When a rotation fails at 2am, the on-call engineer reads the runbook, sees no rollback steps, and manually guesses how to restore the previous credential — often incorrectly. The rollback procedure should be tested during rotation dry-runs. A rotation dry-run deliberately aborts the process at each step and verifies that rollback restores the working state.
Application credential cache outlives the credential TTL. For Vault dynamic credentials with a 1-hour TTL: if the application holds the credential for 65 minutes without refreshing, the PostgreSQL role is dropped mid-connection-pool-cycle. New connection attempts fail with FATAL: role does not exist. Existing open connections are not affected (they are already authenticated). The pool refresh logic must initiate the credential fetch at TTL - refresh_ahead (5 minutes in the Go example), not at TTL. If Vault is unreachable during that 5-minute window, the lease should be renewed rather than a new credential fetched — vault lease renew <lease_id> extends the TTL without generating a new role.
Rotation succeeded in the secret store but not in the database. If setSecret connects to the database but the ALTER USER statement fails (role does not exist, insufficient privileges, database connection timeout), the Lambda should raise an exception so Secrets Manager marks the rotation as failed. Silently catching that exception and returning success leads to AWSCURRENT being updated to a credential the database does not recognise. Every subsequent application connection will fail until manual intervention. Log and raise; never swallow exceptions in setSecret or testSecret.