External Secrets Operator: Syncing Cloud Secrets Without Storing Them in Kubernetes

External Secrets Operator: Syncing Cloud Secrets Without Storing Them in Kubernetes

The Problem

Kubernetes Secrets are a convenience layer built on a dangerous default. A vanilla cluster stores secrets as base64-encoded values in etcd, which means anyone with etcd read access — a compromised etcd backup, a misconfigured snapshot endpoint, or a node-level etcdctl call — gets every secret in the cluster in plaintext. Encryption at rest (EncryptionConfiguration) helps, but it only encrypts the etcd storage file; the API server decrypts on every read, so any principal with secrets:get in a namespace sees the value regardless. Secret rotation means patching the Kubernetes Secret, then restarting every pod that mounts it as an environment variable, because environment variables are injected at container creation time. There is no native integration with enterprise secret stores and no automatic propagation of rotations.

The External Secrets Operator (ESO) shifts the trust model. Instead of the Kubernetes Secret being the authoritative store, it becomes a read cache. ESO watches ExternalSecret custom resources, polls the configured backend — AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, HashiCorp Vault, 1Password, Doppler — on a refreshInterval, and writes the result into a Kubernetes Secret. When the value in the cloud store rotates, the next ESO reconciliation cycle picks it up and overwrites the Kubernetes Secret. Workloads consuming the secret via volume mount pick up the change automatically; workloads using environment variable injection need a pod restart (the stakater/Reloader controller handles this).

What this buys on the security side is concrete. An etcd compromise does not give an attacker persistent access to secret values — it gives them the cached copy, which will be overwritten at the next refresh cycle. The authoritative values remain in the cloud store, gated behind IAM controls that etcd access bypasses entirely. Rotation propagates within the refresh window without any manual Kubernetes intervention. The cloud store’s audit log records every ESO read; K8s audit.log records what workloads consumed the resulting Secret.

The attack surface that ESO introduces is equally concrete, and this is where most deployments have gaps.

ESO’s Architecture

Three Kubernetes resources form the ESO model:

ClusterSecretStore or SecretStore: defines the connection to a backend — credentials, endpoint, authentication method. A ClusterSecretStore is cluster-scoped; any namespace can reference it in an ExternalSecret. A SecretStore is namespace-scoped; only ExternalSecrets in the same namespace can reference it.

ExternalSecret: namespace-scoped. Specifies which paths in the cloud store to read, maps them to keys in a Kubernetes Secret, and sets the refresh interval. One ExternalSecret produces exactly one Kubernetes Secret.

ESO controller: runs in its own namespace (typically external-secrets). Watches all ExternalSecrets in the cluster, reconciles on the schedule specified in each, reads from the referenced SecretStore, creates or updates the target Kubernetes Secret. The controller needs secrets:create and secrets:update in every namespace that has ExternalSecrets — this is the widest RBAC grant in the system, and it is the primary reason a controller compromise is high-impact.

Threat Model

ESO service account compromised. The ESO controller ServiceAccount holds RBAC permission to create and update Kubernetes Secrets across all namespaces where ExternalSecrets exist. An attacker who can execute code as this ServiceAccount — through a compromised controller pod, a stolen projected token, or RBAC misconfiguration — can read or overwrite secrets in those namespaces. They don’t need to touch the cloud store at all.

SecretStore cloud credentials compromised. Each SecretStore or ClusterSecretStore holds credentials (or references credentials via IRSA/Workload Identity) to authenticate to the cloud secret backend. If those credentials are stolen, the attacker can call the cloud API directly — with the same permissions ESO has. If ESO’s IAM role covers secretsmanager:GetSecretValue on arn:aws:secretsmanager:*:*:secret:*, the attacker can read every secret in the account, not just what ESO was reading. This is the blast-radius problem: ESO’s IAM scope is the maximum damage surface of a credential compromise.

ClusterSecretStore with broad IAM. A ClusterSecretStore is accessible by any namespace. If a developer in a low-trust namespace (say, dev-sandbox) creates an ExternalSecret that references the cluster-scoped store and points at production/database/admin-password, ESO fetches it and writes it into the dev-sandbox namespace. The IAM policy on the cloud side governs what ESO can fetch — not what any individual ExternalSecret requests. If the IAM policy is overly broad, any ExternalSecret in any namespace can extract any secret ESO has access to.

Silent sync failure. ESO fails to refresh because the IAM role’s session expired, the backend is rate-limiting, or a network policy dropped the egress connection. The Kubernetes Secret retains its last successfully synced value. If the cloud store’s value rotated (scheduled rotation, breach response, key compromise), the Kubernetes Secret is now stale. Applications continue using the old credential. This isn’t an immediate security failure — it’s a ticking operational clock. The application will start failing authentication when the cloud service rejects the rotated-but-stale credential. The failure arrives silently, at an unpredictable time after the rotation, because nothing surfaced the sync failure when it started.

etcd compromise recovers stale cache only. Under ESO, an attacker who dumps etcd recovers whatever the last synced value was. If the refresh interval is one hour and the dump happens 55 minutes into a cycle, they have a value that will be overwritten in five minutes. For highly sensitive secrets, shortening the refresh interval reduces this window — but the cloud store IAM controls remain the primary defence.

Hardening Configuration

1. Use Namespace-Scoped SecretStore Instead of ClusterSecretStore

The default recommendation in ESO documentation is to use ClusterSecretStore for simplicity. The security recommendation is the opposite: use namespace-scoped SecretStore unless you have a specific operational reason to share a store credential across namespaces. A namespace-scoped store means a compromised ExternalSecret in one namespace cannot access paths that namespace’s store credential does not cover.

# Namespace-scoped SecretStore — only ExternalSecrets in 'production' can use this
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: eso-production-sa   # ServiceAccount in this namespace only
---
# If you must use ClusterSecretStore (cross-namespace orchestration),
# the ServiceAccount stays in ESO's own namespace
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager-cluster
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: eso-cluster-sa
            namespace: external-secrets   # Pinned to ESO's namespace

For the namespace-scoped case, each namespace gets its own IRSA-annotated ServiceAccount:

# Create a separate IAM role per namespace with the minimum required paths
eksctl create iamserviceaccount \
  --name eso-production-sa \
  --namespace production \
  --cluster prod-cluster \
  --attach-policy-arn arn:aws:iam::123456789012:policy/eso-production-policy \
  --approve

The per-namespace IAM role only covers the paths that namespace legitimately needs. A developer who creates a rogue ExternalSecret in production pointing at staging/database/credentials will get an access denied from IAM — because the production namespace’s IAM role doesn’t cover staging paths.

2. IAM Role Scoping (AWS)

This is the most frequently misconfigured aspect of ESO deployments. The IAM policy attached to ESO’s role must specify exact resource ARNs, not wildcards.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ESOProductionRead",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:production/*"
      ]
    },
    {
      "Sid": "DenyHighPrivilegePaths",
      "Effect": "Deny",
      "Action": "secretsmanager:*",
      "Resource": [
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:admin/*",
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:vault-unseal/*",
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:eso-credentials/*"
      ]
    }
  ]
}

The explicit Deny for high-privilege paths matters because it overrides any future Allow you might accidentally add. ESO cannot read its own credentials out of the store (preventing the bootstrap privilege escalation where ESO reads the credentials that grant it broader access), cannot read admin secrets, and cannot read vault unseal keys even if a misconfigured policy statement tries to grant it.

For GCP Secret Manager, the equivalent is a custom IAM role with secretmanager.versions.access scoped to specific secret resource IDs — not projects/*/secrets/*. For Azure Key Vault, a Key Vault access policy (or Azure RBAC role assignment) scoped to the specific vault and the Key Vault Secrets User built-in role, not Contributor.

3. ExternalSecret with Explicit Refresh and Field Extraction

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager     # References the namespace-scoped SecretStore
    kind: SecretStore
  target:
    name: database-credentials    # Name of the Kubernetes Secret to create
    creationPolicy: Owner         # ESO owns the secret; deletion cascades
    deletionPolicy: Retain        # Don't delete the K8s Secret if ESO is removed
                                  # (prevents accidental outage during ESO upgrades)
    template:
      engineVersion: v2
      metadata:
        annotations:
          # This annotation is set by ESO — used by Kyverno policy below
          reconcile.external-secrets.io/managed: "true"
      data:
        DB_HOST: "{{ .host }}"
        DB_PASSWORD: "{{ .password }}"
        DB_USER: "{{ .username }}"
  data:
  - secretKey: host
    remoteRef:
      key: production/database/credentials    # Secret path in AWS Secrets Manager
      property: host                           # JSON field within the secret value
  - secretKey: password
    remoteRef:
      key: production/database/credentials
      property: password
  - secretKey: username
    remoteRef:
      key: production/database/credentials
      property: username

creationPolicy: Owner means if you delete the ExternalSecret, ESO deletes the Kubernetes Secret. This is correct for most cases — the K8s Secret shouldn’t exist independently of the ESO resource that manages it. But deletionPolicy: Retain on the target prevents ESO from deleting the K8s Secret if ESO itself is removed from the cluster (during an upgrade or namespace deletion), which avoids an outage where all secrets disappear simultaneously because the ESO controller was briefly absent.

The refreshInterval: 1h is a starting point, not a universal answer. For database passwords on a 24-hour rotation schedule, one hour is fine. For Vault dynamic secrets with a 15-minute TTL, the interval must be shorter than the TTL — set it to 10m to ensure the Kubernetes Secret is refreshed before the dynamic credential expires.

4. Prometheus Alerts for Sync Failures

ESO exposes Prometheus metrics. The two alerts that prevent the silent-stale-secret failure mode:

groups:
- name: external-secrets
  rules:
  - alert: ExternalSecretSyncFailed
    expr: |
      externalsecrets_sync_calls_error{} > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ExternalSecret sync failing: {{ $labels.name }} in {{ $labels.namespace }}"
      description: >
        The ExternalSecret {{ $labels.name }} in {{ $labels.namespace }} has been
        failing to sync for 5 minutes. The Kubernetes Secret is now stale.
        If the source value has rotated, applications will begin failing authentication.
        Check: kubectl describe externalsecret {{ $labels.name }} -n {{ $labels.namespace }}

  - alert: ExternalSecretStale
    expr: |
      time() - externalsecrets_sync_calls_total{status="success"} > 7200
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "ExternalSecret not refreshed for >2h: {{ $labels.name }} in {{ $labels.namespace }}"
      description: >
        The ExternalSecret {{ $labels.name }} has a 1h refresh interval but has not
        successfully synced in 2 hours. Either ESO is failing silently or the backend
        is returning errors that are not being surfaced.

  - alert: ExternalSecretStoreAuthFailure
    expr: |
      increase(externalsecrets_provider_api_calls_count{statusCode=~"4.."}[10m]) > 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ESO cloud provider auth failures: {{ $labels.provider }}"
      description: >
        ESO is receiving 4xx responses from {{ $labels.provider }}. This may indicate
        that the IAM role or cloud credential has been revoked, rotated, or is experiencing
        permission drift. Investigate before secrets begin expiring.

The ExternalSecretStale alert uses twice the refresh interval as its threshold. For a one-hour refresh cycle, alerting at two hours gives ESO one missed cycle before paging. This avoids alert noise from transient backend hiccups while still catching genuine sync failures before they cause application errors.

Check ExternalSecret status directly:

# Overview: READY=True means last sync succeeded, READY=False means failing
kubectl get externalsecrets -A

# NAMESPACE    NAME                   STORE                 REFRESH INTERVAL   STATUS    READY
# production   database-credentials   aws-secrets-manager   1h                 SecretSynced   True
# staging      api-keys               aws-secrets-manager   1h                 SecretSyncError   False

# Detailed failure reason:
kubectl describe externalsecret api-keys -n staging
# Events:
#   Type     Reason          Age   From              Message
#   ----     ------          ----  ----              -------
#   Warning  SecretSyncError 3m    external-secrets  could not fetch secret: AccessDenied:
#            User: arn:aws:sts::123456789012:assumed-role/eso-staging-role/... is not
#            authorized to perform: secretsmanager:GetSecretValue on resource:
#            staging/database/credentials

An AccessDenied error in the Events section tells you immediately: ESO’s IAM role lost permission to read this secret. This is either a policy change that removed the grant, a resource that was moved to a new path, or — in a breach response — the IAM policy was deliberately restricted and the ExternalSecret needs updating to point at the new location.

5. Incident Response: Cloud Credentials Compromised

If you have reason to believe ESO’s cloud credentials are compromised — an anomalous CloudTrail spike, a secret read from an unexpected IP, an ESO pod that was running a suspicious process — the response sequence for AWS:

# Step 1: Immediately attach an emergency deny-all inline policy to ESO's IAM role.
# This takes effect within seconds and overrides all allow statements.
aws iam put-role-policy \
  --role-name eso-production-role \
  --policy-name EMERGENCY-DENY-ALL \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Deny",
      "Action": "secretsmanager:*",
      "Resource": "*"
    }]
  }'

# ESO will now fail to sync (you'll see SecretSyncError in ExternalSecret status).
# That's acceptable — you've stopped the bleeding.

# Step 2: Determine what was accessed. Look back 24 hours from the suspected compromise.
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=eso-production-role \
  --start-time "$(date -u -d '24 hours ago' '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  | jq '.Events[] | {
      time: .EventTime,
      action: .EventName,
      secret: (.Resources[]? | select(.ResourceType == "AWS::SecretsManager::Secret") | .ResourceName),
      sourceIP: (.CloudTrailEvent | fromjson | .sourceIPAddress)
    }'

# Step 3: Identify requests from unexpected source IPs.
# ESO's IRSA calls come from the EKS node pool CIDRs. Requests from elsewhere are attacker activity.

# Step 4: Rotate all secrets that ESO's role could read.
aws secretsmanager list-secrets \
  --filters Key=name,Values="production/" \
  | jq -r '.SecretList[].Name' \
  | while read secret; do
      echo "Rotating: $secret"
      aws secretsmanager rotate-secret --secret-id "$secret"
    done

# Step 5: Create a new IAM role with the same policy and update the IRSA annotation.
# Do not re-enable the compromised role — create fresh credentials.
kubectl annotate serviceaccount eso-production-sa \
  -n production \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/eso-production-role-v2 \
  --overwrite

# Step 6: Remove the emergency deny policy from the OLD role and let the new SA take over.
aws iam delete-role-policy \
  --role-name eso-production-role \
  --policy-name EMERGENCY-DENY-ALL

# Step 7: Force-trigger ESO reconciliation after rotation is complete.
kubectl annotate externalsecret database-credentials \
  -n production \
  force-sync="$(date +%s)" \
  --overwrite

The force-sync annotation on an ExternalSecret causes ESO to reconcile immediately rather than waiting for the next refreshInterval. After rotating secrets, trigger this on all ExternalSecrets in affected namespaces — otherwise, workloads continue running with stale credentials until the next scheduled refresh.

6. Kyverno Policy: Require ESO-Managed Secrets in Production

This policy audits the production namespace for Kubernetes Secrets that were not created by ESO. Any Secret without the ESO management label is either a legacy resource that should be migrated or a Secret that someone created directly — bypassing the cloud store audit trail.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-eso-managed-secrets
spec:
  validationFailureAction: Audit   # Move to Enforce once all legacy secrets are migrated
  background: true                  # Also scan existing secrets, not just admission events
  rules:
  - name: require-eso-annotation
    match:
      any:
      - resources:
          kinds: [Secret]
          namespaces: [production, staging]
    exclude:
      any:
      - resources:
          # System secrets: service account tokens, TLS certs created by cert-manager
          kinds: [Secret]
          selector:
            matchExpressions:
            - key: reconcile.external-secrets.io/managed
              operator: In
              values: ["true"]
      - resources:
          # Exclude secrets created by Kubernetes itself
          kinds: [Secret]
          names:
          - "default-token-*"
          - "*.dockerconfigjson"
    validate:
      message: >
        Secrets in production and staging must be managed by External Secrets Operator.
        Create an ExternalSecret resource instead of a plain Kubernetes Secret.
        If this is a system secret (TLS cert, service account token), add the label
        'reconcile.external-secrets.io/managed: "true"' to suppress this audit finding.
      deny: {}

Start with Audit mode, run it for two to four weeks, and work through the findings before switching to Enforce. Enforcing immediately in an existing cluster will break any workload that creates Kubernetes Secrets through non-ESO means — including cert-manager, Helm charts that create secrets in post-install hooks, and operators that manage their own credentials.

7. ValidatingAdmissionPolicy: Block Direct Secret Modification

ESO reconciles drift — if someone edits a K8s Secret directly, ESO overwrites it within a reconciliation cycle. But between the edit and the next reconciliation, the modified value is live. For high-sensitivity namespaces, use a ValidatingAdmissionPolicy to block unauthorized Secret modifications entirely:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: block-direct-secret-edits
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      resources: ["secrets"]
      operations: ["UPDATE", "DELETE"]
      apiVersions: ["v1"]
  matchConditions:
  - name: in-production-namespace
    expression: "request.namespace == 'production'"
  validations:
  - expression: >
      request.userInfo.username.startsWith(
        "system:serviceaccount:external-secrets:"
      )
    message: >
      Direct modification of Kubernetes Secrets in production is not permitted.
      Secrets are managed by External Secrets Operator. Update the value in
      AWS Secrets Manager — ESO will propagate the change within the refresh interval.
    reason: Forbidden
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: block-direct-secret-edits-binding
spec:
  policyName: block-direct-secret-edits
  validationActions: [Deny]

This policy allows only the ESO controller’s ServiceAccount to modify Secrets in production. Any other principal — including cluster-admins — gets a Forbidden response and a message explaining the correct procedure. This forces compliance rather than relying on process.

Expected Behaviour

A healthy ESO deployment under kubectl get externalsecrets -A looks like:

NAMESPACE    NAME                      STORE                 REFRESH INTERVAL   STATUS         READY
production   database-credentials      aws-secrets-manager   1h                 SecretSynced   True
production   api-gateway-token         aws-secrets-manager   1h                 SecretSynced   True
staging      database-credentials      aws-secrets-manager   1h                 SecretSynced   True

READY=True and STATUS=SecretSynced means the last reconciliation succeeded. The timestamp in kubectl describe shows when.

A failing ExternalSecret looks like:

NAMESPACE    NAME                      STORE                 REFRESH INTERVAL   STATUS           READY
production   database-credentials      aws-secrets-manager   1h                 SecretSyncError  False

kubectl describe externalsecret database-credentials -n production shows the Events section with the specific error — AccessDenied, ResourceNotFoundException (the secret was deleted in the cloud store), InvalidRequestException (malformed request, usually a wrong region), or a network error if the backend is unreachable.

The Prometheus ExternalSecretSyncFailed alert fires after five minutes of continuous error. In your Slack channel, the alert reads: “ExternalSecret sync failing: database-credentials in production — The Kubernetes Secret is now stale. If the source value has rotated, applications will begin failing authentication.” This gives you a concrete action: check the ExternalSecret Events, fix the underlying cause (IAM permission restored, new secret path updated in the ExternalSecret spec), and the next reconciliation cycle will update the Kubernetes Secret automatically.

After the IAM incident response, kubectl get externalsecrets -A initially shows SecretSyncError across all production ExternalSecrets (expected — you blocked ESO’s access during the emergency). After creating the new role and annotating the ServiceAccount, the next reconciliation cycle succeeds and status returns to SecretSynced. Any ExternalSecret where you triggered force-sync recovers immediately.

Trade-offs

Namespace-scoped SecretStore vs. ClusterSecretStore. The ClusterSecretStore is operationally simpler — one store definition, all namespaces use it. But it means a single IAM role covers all namespaces, and an ExternalSecret in any namespace can request any path that role can read. Per-namespace SecretStore requires maintaining one IAM role and one ServiceAccount per namespace. In a cluster with many namespaces, this is significant operational overhead. Mitigate with Terraform or Pulumi automation that provisions the IAM role and IRSA ServiceAccount as part of namespace creation. The per-namespace IAM role still has a blast radius — it covers all secrets in that namespace’s path — but it contains the blast radius to one namespace rather than the entire account.

Refresh interval and rotation window. A one-hour refresh interval means there is up to a one-hour window between a secret rotating in the cloud store and the Kubernetes Secret updating. For breach response, this matters: if you rotate a compromised database password, applications using the old value continue to succeed for up to one hour. Applications using the new value immediately (through direct cloud SDK calls) and applications using the Kubernetes Secret (through ESO’s cache) are out of sync for that window. If you need immediate propagation, trigger force-sync via annotation immediately after rotation — but this only helps if ESO is healthy and can reach the backend.

creationPolicy: Owner and ExternalSecret deletion. If someone deletes the ExternalSecret resource, ESO deletes the Kubernetes Secret. If a workload is running and the Secret is mounted as a volume, the pod continues running — volumes survive Secret deletion until the pod restarts. But any new pod that starts will fail to mount the now-absent Secret. In a namespace with a HPA or a deployment rolling update in progress, this creates an outage. For high-availability workloads, add a PodDisruptionBudget and ensure ExternalSecret deletions are protected by RBAC — only ESO administrators should be able to delete ExternalSecrets in production.

ESO controller RBAC scope. ESO needs secrets:create, secrets:update, and secrets:delete (for cleanup on ExternalSecret deletion) across all managed namespaces. If you install ESO with cluster-wide permissions (the default Helm chart configuration), the controller ServiceAccount can create and modify Secrets in every namespace, including kube-system. Consider installing ESO with namespace-restricted RBAC — the Helm chart supports scopedNamespace values that limit the controller to specific namespaces. This reduces the blast radius if the controller is compromised, at the cost of needing to update the RBAC if you add new namespaces.

Failure Modes

ClusterSecretStore with wildcard IAM. The ESO documentation quickstart uses Resource: "*" in the IAM policy for simplicity. Many production deployments copy this without tightening it. An attacker who compromises any ExternalSecret in any namespace — even a low-trust dev namespace — and can modify it to point at production/database/admin receives the admin password in the ESO-managed Kubernetes Secret in their namespace. The IAM policy governs what ESO can read, not what ExternalSecrets can request. Fix: scope the IAM policy to specific path prefixes, and use per-namespace SecretStores so path scope tracks namespace boundaries.

Sync failure not detected for days. Without the ExternalSecretStale Prometheus alert, a sync failure that starts at 3am on a Friday goes unnoticed until Monday’s incident — when the 72-hour-rotated database password is rejected and applications begin failing. The K8s Secret has been serving a credential that the database stopped accepting 24 hours ago. The application error is “authentication failed” with no reference to ESO or Secrets Manager. Diagnosis time: variable. Fix: deploy the stale-secret alert as a prerequisite to using ESO in production, not as a nice-to-have.

ESO IRSA role too broad. A role with GetSecretValue on arn:aws:secretsmanager:us-east-1:123456789012:secret:* covers every secret in the account in that region — including secrets for other environments, other services, and administrative credentials. ESO reads only what ExternalSecrets request, but the IAM policy determines what it can read. CloudTrail does not distinguish “ESO was legitimately reading this” from “an attacker called GetSecretValue after stealing ESO’s session token.” Fix: path-prefix ARNs per IAM role, explicit Deny for high-privilege paths.

No chaos test for ESO downtime. ESO is a controller — if the controller pod crashes or the external-secrets namespace develops a resource constraint, reconciliation stops. Kubernetes Secrets retain their last synced values. If secrets are not rotating, this is invisible — everything continues working. If secrets are rotating (24-hour database password rotation, short-lived Vault dynamic credentials), the Kubernetes Secrets start aging. After one rotation cycle, applications fail. Most teams discover this during the first unplanned ESO outage, not before. Fix: include ESO availability in your operational runbooks; run quarterly drills where you scale ESO to zero for 15 minutes and verify that the ExternalSecretStale alert fires within the configured window. Add ESO’s controller pod to your uptime monitoring as a first-class target, not just an incidental Kubernetes workload.

Accidental ExternalSecret deletion cascades. A developer runs kubectl delete externalsecret --all -n production intending to clean up a staging environment, but has their kubeconfig context pointed at the wrong cluster. With creationPolicy: Owner, ESO deletes all the Kubernetes Secrets. Pods with running volumes survive until their next restart. Deployments that scale up or restart for any reason fail to start. Fix: protect ExternalSecret deletion with RBAC (delete verb removed from developer roles in production namespaces), add a finalizer-based confirmation step, and keep deletionPolicy: Retain on all production ExternalSecret targets so the Kubernetes Secret survives even if ESO deletes its parent.