API Key Lifecycle in CI/CD: Rotation, Scoping, and Detecting Long-Lived Credential Sprawl

The Problem

A Stripe secret key created in 2021 for a now-decommissioned staging pipeline still works today. The developer who created it left the company in 2023. The key was never rotated because no one knew it existed. It lived in a .env file that was committed once, then removed from HEAD in the next commit — but it remains readable in git log -p forever. In 2024, a contractor cloning the repository for an audit exfiltrated the key in a TruffleHog scan that wrote results to a publicly accessible S3 bucket. The key was then harvested by an automated scanner and used to issue fraudulent refunds for three weeks before Stripe’s fraud detection flagged the pattern.

This scenario is not a worst-case construction. It describes the actual mechanics behind the majority of API key compromises that reach a billing or fraud alert. The key was never rotated. Its scope was never limited. Its usage was never monitored. It existed in five locations simultaneously — git history, a Docker image layer, a Jenkins credential store, a developer’s local .env file, and a build artifact log — and none of those locations had a time-to-live.

API keys in CI/CD accumulate across a predictable set of locations. Understanding each one is necessary before you can audit or remediate them.

Git history is the most dangerous because it is permanent by default and often overlooked. A key committed in commit a3f9c1d and removed in commit b2e8f4a is fully retrievable by anyone with repository read access. git log -p -- .env shows the diff containing the key. git log --all -p extends this to all branches, including branches created by fork-and-PR workflows on public repositories. The key is as accessible as it was the day it was committed — and if it was never rotated after the removal commit, it remains valid.

Docker image layers preserve secrets that were ever present during a build, even if the final image does not contain them. A Dockerfile that runs COPY .env /app/.env during an early stage and then removes it with RUN rm /app/.env in a later stage has still written the key into the layer created by the COPY instruction. Each layer is a separate addressable object in the image manifest. Anyone who pulls the image can extract the layer: docker save myimage:latest | tar xf - | tar xf <layer.tar>. The key is there. The same applies to ARG values used in RUN commands — ARG STRIPE_KEY followed by RUN python setup.py --key $STRIPE_KEY writes the key into the layer’s metadata. docker history --no-trunc exposes it.

CI environment variables — GitHub Actions secrets, GitLab CI variables, Jenkins credentials — accumulate over the lifetime of an organisation. A secret added for a project in 2020 persists indefinitely unless explicitly deleted. Former employees who configured secrets may retain copies. Offboarding processes rarely include an audit of which CI secrets the departing engineer created or had access to. The key is valid until rotated; the rotation is never triggered because no one knows the key exists or who owns it.

Artifact stores and build logs are frequently overlooked. When a pipeline fails with DEBUG=True or --log-level debug, HTTP client libraries commonly log request headers, which include Authorization: Bearer <api-key>. pytest output containing a failing request will include the key. Coverage report HTML sometimes renders environment variable dumps. These artifacts are uploaded to S3, stored in GitHub Actions artifact storage, or pushed to Artifactory — and they persist for 90 days to multiple years depending on retention policy.

Container environment variables baked in at build time present a separate class of problem. ENV API_KEY=sk_live_xxx in a Dockerfile copies the key into the image’s configuration layer, which is readable via docker inspect myimage:latest --format '{{json .Config.Env}}'. Every container started from the image exposes the key. Every copy of the image in every registry — internal, external, cached, or archived — contains the key. Rebuilding the image does not remove the old image; it creates a new one alongside the old one.

The blast radius problem is compounding: API keys are almost never created with minimum scope in CI/CD contexts. Broad permissions are chosen at creation time for convenience — “I’ll scope it down later” — and the later never arrives. A Stripe secret key with no restrictions can charge any customer, issue full refunds, create payouts to arbitrary bank accounts, and read all transaction history. A Stripe restricted key scoped to balance.read can do none of those things. Both look identical in a compromised build log. The difference in impact is the difference between a nuisance and a regulatory incident.

The detection gap is the final structural problem. API key compromise is typically discovered by one of three mechanisms: automated secret scanning by GitHub, GitLab, or the key issuer (Stripe, AWS, Twilio) which revokes the key on detection; an unexpected billing spike that someone investigates; or a vendor security researcher who reports it. In all three cases, the key has been compromised for an indeterminate period before discovery. The usage between compromise and detection is rarely recoverable in logs. The attacker had full access for however long they chose to be quiet.

Threat Model

Stale key in git history exploited after employee departure. A developer commits a Datadog API key to a monitoring script in 2022, notices the error, removes it in the next commit, and moves on. In 2025, a disgruntled former colleague clones the repository and runs git log -p | grep -A2 "DD_API_KEY". The key is there. It was never rotated because the removal commit felt sufficient. The former colleague uses the key to delete monitoring data and silence alerts for a production incident.

Docker image pushed to public registry, key harvested within hours. A Dockerfile for a development image includes ENV GITHUB_TOKEN=ghp_xxxxxxxxxxxx added by a developer trying to pull private packages during the build. The image is pushed to Docker Hub as part of a CI pipeline that runs on every pull request, including those from external contributors. Within two hours of the push, automated scanners crawling Docker Hub for API key patterns find the key. The token is used to push a malicious commit to an internal repository the token has write access to.

CI environment variable leaked through test log. A Jest test suite makes a real HTTP request to a third-party API (Twilio SMS gateway) in integration tests. The test fails with a 401 and the HTTP client logs the full request including the Authorization header. GitHub Actions uploads the test log as an artifact. A team member downloads the artifact to debug the failure and stores it in a shared Google Drive folder with broad access. Six months later, the Twilio key in the log is used to send thousands of SMS messages, creating a $40,000 billing liability.

Former contractor’s key copy used after access revocation. A contractor is granted a SendGrid API key to configure email delivery for a project. The key is shared over Slack. The contractor finishes the engagement, their accounts are deprovisioned, and the key is left unchanged because “we’ll rotate everything during the next security review.” The contractor uses the key eighteen months later to exfiltrate a mailing list. The review never happened.

Hardening Configuration

1. Audit Existing API Key Exposure

Start with what already exists before hardening anything new. An audit of an existing codebase and image registry will almost always find keys that no one knew were present.

Scan git history across all branches for secrets. TruffleHog’s --only-verified flag contacts the relevant API to confirm the key is still valid before reporting it — this dramatically reduces noise in repositories with years of history:

pip install trufflehog

# Scan all commits, all branches, all tags in the current repository:
trufflehog git file://. --only-verified --json | \
  jq '{
    file: .SourceMetadata.Data.Git.file,
    commit: .SourceMetadata.Data.Git.commit,
    branch: .SourceMetadata.Data.Git.branch,
    detector: .DetectorName,
    verified: .Verified
  }'

A finding in this output means the key is both present in git history and verified as currently valid by the issuing service. This key must be rotated immediately — the historical commit containing it cannot be safely removed without a full git history rewrite and force-push to all clones, which is operationally complex. The safe remediation is rotation: invalidate the old key, issue a new one.

For Docker images, scan each image in your registry. Trivy’s --scanners secret mode inspects all layers in the image manifest, not just the final filesystem state:

# Install Trivy:
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh

# Scan a specific image including all layers:
trivy image --scanners secret myorg/myimage:latest

# Scan all tags of an image in a registry:
docker image ls myorg/myimage --format '{{.Tag}}' | \
  while read tag; do
    echo "=== Scanning myorg/myimage:${tag} ==="
    trivy image --scanners secret --quiet myorg/myimage:"${tag}"
  done

When Trivy finds a secret in a Docker layer, the output identifies the layer digest, the file within the layer, and the secret type. A finding in an intermediate layer that was supposed to have been cleaned up confirms the cleanup did not work. The remediation is to rebuild the image correctly (see section 6), retag, and delete old image versions from the registry — including the cached layers if using a registry that supports layer-level deletion.

Enumerate CI secrets by name to build an inventory. GitHub Actions does not expose secret values via the API, but names reveal accumulation:

# List all repository secrets (names only — values are never accessible via API):
gh api repos/ORG/REPO/actions/secrets | \
  jq '.secrets[] | {name: .name, created_at: .created_at, updated_at: .updated_at}'

# List organisation-level secrets:
gh api orgs/ORG/actions/secrets | \
  jq '.secrets[] | {name: .name, created_at: .created_at, updated_at: .updated_at, visibility: .visibility}'

The updated_at field is the last time the secret value was changed. A secret with updated_at from 2021 in a repository where the CI pipeline runs daily is a rotation SLA violation waiting to be enforced. Secrets named DATADOG_API_KEY_LEGACY, OLD_STRIPE_KEY, or TEMP_GITHUB_TOKEN are candidates for immediate audit and deletion.

2. Enforce Key Rotation via CI

Rotation policies only work if they are enforced. A document that says “rotate keys every 90 days” and a CI check that fails if they have not been rotated are not equivalent — only the second one actually rotates keys.

For keys where the provider exposes key metadata through their API, use that metadata to check key age. This GitHub Actions workflow runs weekly and fails if any tracked key exceeds the rotation SLA:

name: API Key Rotation Check
on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 09:00 UTC
  workflow_dispatch:

jobs:
  check-rotation:
    runs-on: ubuntu-latest
    steps:
      - name: Check AWS access key age
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: us-east-1
        run: |
          # Get the username associated with the current credentials:
          USERNAME=$(aws sts get-caller-identity --query 'Arn' --output text \
            | sed 's|.*/||')

          # List access keys for this user and check creation date:
          aws iam list-access-keys --user-name "$USERNAME" \
            --query 'AccessKeyMetadata[*].[AccessKeyId,CreateDate,Status]' \
            --output text | while read KEY_ID CREATED STATUS; do
              CREATED_TS=$(date -d "$CREATED" +%s)
              NOW=$(date +%s)
              AGE_DAYS=$(( (NOW - CREATED_TS) / 86400 ))
              MAX_AGE=90

              echo "Key: ${KEY_ID} | Age: ${AGE_DAYS} days | Status: ${STATUS}"

              if [ "$STATUS" = "Active" ] && [ "$AGE_DAYS" -gt "$MAX_AGE" ]; then
                echo "::error::Access key ${KEY_ID} is ${AGE_DAYS} days old (SLA: ${MAX_AGE} days). Rotate immediately."
                exit 1
              fi
            done

      - name: Check GitHub token age via creation metadata
        env:
          GH_TOKEN: ${{ secrets.DEPLOY_GITHUB_TOKEN }}
        run: |
          # Fine-grained tokens expose creation and expiry via the meta API:
          TOKEN_META=$(curl -sf -H "Authorization: Bearer $GH_TOKEN" \
            https://api.github.com/meta/token 2>/dev/null || echo "{}")

          EXPIRES=$(echo "$TOKEN_META" | jq -r '.expires_at // empty')
          if [ -n "$EXPIRES" ]; then
            EXPIRES_TS=$(date -d "$EXPIRES" +%s)
            NOW=$(date +%s)
            DAYS_REMAINING=$(( (EXPIRES_TS - NOW) / 86400 ))

            if [ "$DAYS_REMAINING" -lt 14 ]; then
              echo "::warning::GitHub token expires in ${DAYS_REMAINING} days. Rotate before expiry."
            fi
            echo "GitHub token expires in ${DAYS_REMAINING} days."
          else
            echo "::warning::GitHub token has no expiry set. Consider switching to a fine-grained token with expiry."
          fi

For services where key age is not queryable through the API, track rotation dates in the CI secret name itself. A naming convention of STRIPE_KEY_20260101 encodes the rotation date; a CI check can parse the name and fail if the suffix is older than the SLA. This is imperfect but effective as a forcing function when better metadata is unavailable.

AWS Secrets Manager can enforce automated rotation without human intervention for supported services. For custom API keys, use a Lambda rotation function:

# Create the secret with a rotation schedule:
aws secretsmanager create-secret \
  --name prod/stripe/restricted-key \
  --secret-string '{"key": "rk_live_xxxx"}' \
  --tags Key=team,Value=platform Key=rotates,Value=90d

# Enable rotation (requires a Lambda function ARN for the rotation logic):
aws secretsmanager rotate-secret \
  --secret-id prod/stripe/restricted-key \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:123456789012:function:stripe-key-rotator \
  --rotation-rules AutomaticallyAfterDays=90

# Retrieve the current value in CI (no static key stored anywhere):
STRIPE_KEY=$(aws secretsmanager get-secret-value \
  --secret-id prod/stripe/restricted-key \
  --query SecretString --output text | jq -r '.key')

The rotation Lambda calls the Stripe API to create a new restricted key, updates the secret in Secrets Manager, and can optionally call a webhook to trigger validation of the new key before the old one is deleted. The CI pipeline retrieves the current key at runtime rather than storing a static value.

3. Scope API Keys to Minimum Required Permissions

Every API key in CI/CD should be created with exactly the permissions required for the specific job it performs, and no more. This is not a nice-to-have — it determines whether a key compromise is a minor incident or a full account takeover.

For AWS, create a separate IAM user for each distinct CI workload and attach a narrowly scoped policy:

# Create a dedicated user for the artifact upload job:
aws iam create-user --user-name ci-artifact-uploader \
  --tags Key=purpose,Value=ci-upload Key=created,Value=2026-05-09

# Define minimum permissions — only the specific bucket operations needed:
cat > /tmp/artifact-upload-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowArtifactUpload",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "arn:aws:s3:::my-build-artifacts/*",
      "Condition": {
        "StringEquals": {
          "s3:prefix": ["builds/", "releases/"]
        }
      }
    },
    {
      "Sid": "AllowBucketList",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-build-artifacts",
      "Condition": {
        "StringLike": {
          "s3:prefix": ["builds/*", "releases/*"]
        }
      }
    }
  ]
}
EOF

aws iam put-user-policy \
  --user-name ci-artifact-uploader \
  --policy-name artifact-upload-only \
  --policy-document file:///tmp/artifact-upload-policy.json

# Create the access key and note the rotation deadline:
KEY_OUTPUT=$(aws iam create-access-key --user-name ci-artifact-uploader)
echo "$KEY_OUTPUT" | jq '{
  AccessKeyId: .AccessKey.AccessKeyId,
  SecretAccessKey: .AccessKey.SecretAccessKey,
  RotateBy: "'$(date -d '+90 days' +%Y-%m-%d)'"
}'

For Stripe, use restricted keys rather than the full secret key in any CI context. In the Stripe dashboard, a restricted key can be configured to specific resources and access levels:

# Create a Stripe restricted key via API (for automation):
curl https://api.stripe.com/v1/restricted_keys \
  -u "$STRIPE_SECRET_KEY:" \
  -d name="CI-Read-Balance-Only" \
  -d "permissions[0][resource]=balance" \
  -d "permissions[0][access]=read"

The resulting restricted key cannot charge customers, create payouts, or read card data. If it leaks from a build log, the attacker has access to the Stripe balance — a disclosure, but not a financial control failure.

For GitHub tokens used in CI pipelines, use fine-grained personal access tokens or GitHub Apps rather than classic tokens. Fine-grained tokens are scoped to specific repositories and specific permission sets:

# List the permissions of a fine-grained token to verify scope:
curl -sf -H "Authorization: Bearer $GITHUB_TOKEN" \
  https://api.github.com/installation/repositories | \
  jq '.repositories[].full_name'

# Check token permissions:
curl -sf -H "Authorization: Bearer $GITHUB_TOKEN" \
  -I https://api.github.com/user | \
  grep -i "x-oauth-scopes"

A fine-grained token that shows X-OAuth-Scopes: repo:contents:read cannot push code. A classic token that shows X-OAuth-Scopes: repo has full repository access including write.

4. Replace Static Keys with Short-Lived Credentials

The most effective mitigation for static API key sprawl is eliminating static keys where the provider supports OIDC or identity federation. A credential that expires in 15 minutes cannot be used from a build log from last year.

GitHub Actions supports OIDC federation with AWS, GCP, Azure, Vault, and others. The workflow receives a signed JWT from GitHub’s OIDC provider, exchanges it for a short-lived cloud credential, and the credential expires when the job ends:

name: Deploy to Production
on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deployer
          aws-region: us-east-1
          # No AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY defined anywhere.
          # A temporary credential valid for this job only is exchanged via OIDC.

      - name: Upload build artifact
        run: |
          aws s3 cp dist/ s3://my-build-artifacts/builds/${{ github.sha }}/ --recursive
          # The temporary credential used here cannot be reused after this job ends.

The IAM role trust policy on the AWS side restricts which GitHub repositories and workflows can assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
          "token.actions.githubusercontent.com:sub": "repo:myorg/myrepo:ref:refs/heads/main"
        }
      }
    }
  ]
}

The sub condition locks the role assumption to the specific repository and branch. A forked repository or a feature branch cannot assume this role. A stolen OIDC JWT from a different repository cannot be used here. There is no static credential to exfiltrate, rotate, or inventory.

For services that do not support OIDC, HashiCorp Vault with AppRole authentication provides equivalent short-lived credentials for any service that Vault has a secrets engine for:

      - name: Fetch Stripe key from Vault (short-lived lease)
        uses: hashicorp/vault-action@d1720f055e0635fd932a1d2a48f87a666a57906c
        with:
          url: https://vault.internal.example.com
          method: jwt
          jwtGithubAudience: https://vault.internal.example.com
          role: github-actions-stripe
          secrets: |
            secret/data/stripe/restricted-key key | STRIPE_KEY ;
        # STRIPE_KEY is now available as an env var for this job only.
        # Vault issues a lease; the secret is not stored in GitHub at all.

5. Detect Anomalous API Key Usage

Rotation and scoping reduce the impact of a compromise. Detection reduces the dwell time. The goal is to detect compromise before the key is reported — either by a provider’s automated scanner or a billing alert.

For AWS credentials, CloudTrail logs every API call with the access key ID that made it. Query CloudTrail for unusual patterns against a specific key:

import boto3
from datetime import datetime, timedelta
from typing import NamedTuple

KNOWN_CI_REGIONS = {'us-east-1', 'us-west-2'}
DESTRUCTIVE_EVENTS = {
    'DeleteBucket', 'TerminateInstances', 'DeleteUser', 'DeleteSecret',
    'PutBucketPolicy', 'CreateUser', 'AttachUserPolicy', 'CreateAccessKey',
    'DeleteTrail', 'StopLogging',  # CloudTrail tampering
}
NORMAL_CI_EVENTS = {
    'PutObject', 'GetObject', 'ListBucket', 'GetCallerIdentity',
    'DescribeInstances', 'CreateLogGroup', 'PutLogEvents',
}


class Anomaly(NamedTuple):
    kind: str
    event: str
    region: str
    source_ip: str
    timestamp: str


def detect_key_usage_anomaly(access_key_id: str, hours: int = 24) -> list[Anomaly]:
    """
    Query CloudTrail for the last `hours` hours of activity for `access_key_id`.
    Returns anomalies: unexpected regions, destructive operations, off-hours
    activity, and rapid sequential calls from multiple source IPs.
    """
    cloudtrail = boto3.client('cloudtrail', region_name='us-east-1')
    start = datetime.utcnow() - timedelta(hours=hours)

    paginator = cloudtrail.get_paginator('lookup_events')
    pages = paginator.paginate(
        LookupAttributes=[{
            'AttributeKey': 'AccessKeyId',
            'AttributeValue': access_key_id,
        }],
        StartTime=start,
        EndTime=datetime.utcnow(),
    )

    anomalies: list[Anomaly] = []
    seen_ips: set[str] = set()

    for page in pages:
        for event in page['Events']:
            event_detail = __import__('json').loads(event.get('CloudTrailEvent', '{}'))
            region = event_detail.get('awsRegion', 'unknown')
            source_ip = event_detail.get('sourceIPAddress', 'unknown')
            event_name = event['EventName']
            event_time = event['EventTime']

            seen_ips.add(source_ip)

            # Flag calls from regions not used by this CI workload:
            if region not in KNOWN_CI_REGIONS:
                anomalies.append(Anomaly(
                    kind='unexpected_region',
                    event=event_name,
                    region=region,
                    source_ip=source_ip,
                    timestamp=event_time.isoformat(),
                ))

            # Flag destructive or privilege-escalation operations:
            if event_name in DESTRUCTIVE_EVENTS:
                anomalies.append(Anomaly(
                    kind='destructive_operation',
                    event=event_name,
                    region=region,
                    source_ip=source_ip,
                    timestamp=event_time.isoformat(),
                ))

            # Flag operations entirely outside the expected event set:
            if event_name not in NORMAL_CI_EVENTS and event_name not in DESTRUCTIVE_EVENTS:
                anomalies.append(Anomaly(
                    kind='unexpected_operation',
                    event=event_name,
                    region=region,
                    source_ip=source_ip,
                    timestamp=event_time.isoformat(),
                ))

    # Flag if the key was used from more than one distinct IP in the window:
    # Legitimate CI keys hit AWS from runner IP ranges — not residential IPs
    if len(seen_ips) > 3:
        anomalies.append(Anomaly(
            kind='multiple_source_ips',
            event='(multiple)',
            region='(multiple)',
            source_ip=', '.join(sorted(seen_ips)),
            timestamp=datetime.utcnow().isoformat(),
        ))

    return anomalies


if __name__ == '__main__':
    import sys
    key_id = sys.argv[1]
    found = detect_key_usage_anomaly(key_id)
    if found:
        for a in found:
            print(f"ANOMALY [{a.kind}] event={a.event} region={a.region} ip={a.source_ip} at={a.timestamp}")
        sys.exit(1)
    print(f"No anomalies detected for {key_id} in the last 24 hours.")

Run this as a scheduled GitHub Actions workflow, passing each monitored key ID:

name: API Key Usage Anomaly Detection
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours

jobs:
  detect-anomalies:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11

      - uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502
        with:
          role-to-assume: arn:aws:iam::123456789012:role/cloudtrail-reader
          aws-region: us-east-1

      - name: Run anomaly detection
        env:
          MONITORED_KEY_ID: ${{ secrets.CI_ARTIFACT_KEY_ID }}
        run: |
          pip install -q boto3
          python scripts/detect_key_anomaly.py "$MONITORED_KEY_ID"

Note that MONITORED_KEY_ID is the access key ID (not the secret key) — it is public in CloudTrail logs and non-sensitive. Only the key ID is needed to query CloudTrail for its usage history.

For third-party APIs that do not expose usage logs, enable webhook notifications where available. Stripe sends webhook events for every API operation. A sudden spike in charge.created events at 03:00 UTC from a key that only runs during business hours is a detectable anomaly if the webhook receiver is watching for it.

6. Never Bake API Keys into Docker Build Context

The correct approach is to inject credentials at container runtime, not at build time. This is not a style preference — it is the difference between a key that exists in a recoverable layer and a key that exists only in memory during container execution.

# WRONG: key is in the image layer permanently
ENV STRIPE_KEY=rk_live_xxxxxxxxxxxxxxxx

# WRONG: ARG values are visible in docker history --no-trunc
ARG NPM_TOKEN
RUN echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > ~/.npmrc && \
    npm install && \
    rm ~/.npmrc
# The RUN layer contains the token in its exec metadata even after rm.

# CORRECT: install dependencies without embedding the token in any layer.
# Use BuildKit's --secret mount (available since Docker 18.09):
# syntax=docker/dockerfile:1
FROM node:20-slim AS build
RUN --mount=type=secret,id=npm_token \
    echo "//registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token)" > ~/.npmrc && \
    npm ci && \
    rm ~/.npmrc
# The secret is available during this RUN command but is NOT stored in the layer.

FROM node:20-slim
COPY --from=build /app/node_modules ./node_modules
COPY src/ ./src/
# No token present in any layer of the final image.

Build with:

DOCKER_BUILDKIT=1 docker build \
  --secret id=npm_token,env=NPM_TOKEN \
  -t myorg/myapp:latest .

The --mount=type=secret directive makes the secret available at /run/secrets/<id> during the RUN instruction and removes it before the layer is committed. docker history --no-trunc on the resulting image shows no token in the command metadata. trivy image --scanners secret finds nothing.

Verify an existing image does not contain secrets before pushing:

# Check image history for any environment variables or ARG values containing key patterns:
docker history --no-trunc myimage:latest | \
  grep -iE "(api_key|secret_key|token|password|auth)[[:space:]]*[=:][[:space:]]*[A-Za-z0-9+/_-]{16,}"

# Trivy: deep scan including intermediate layers:
trivy image --scanners secret --severity HIGH,CRITICAL myimage:latest

Expected Behaviour

TruffleHog finding a verified key in git history produces JSON output like:

{
  "SourceMetadata": {
    "Data": {
      "Git": {
        "file": ".env",
        "commit": "a3f9c1d8b2e4f6a1c3d5e7f9b2a4c6d8e0f2a4b6",
        "branch": "refs/heads/main",
        "repository": "https://github.com/myorg/myrepo.git"
      }
    }
  },
  "DetectorName": "Stripe",
  "Verified": true
}

Verified: true means TruffleHog successfully authenticated with the Stripe API using the found key. This key is live. Immediate rotation is the only correct response — not deletion of the commit, not a .gitignore update.

Trivy finding a secret in a Docker layer produces:

SECRET  HIGH  STRIPE_SECRET_KEY  secret/stripe.key  [REDACTED]
Results for layer sha256:4a2b8c...

The output identifies the layer digest, the file path within the layer, and the secret type. The layer digest can be correlated to the specific RUN or COPY instruction that created it using docker history.

The GitHub Actions rotation check failing produces a workflow annotation visible in the Actions UI:

Error: Access key AKIAIOSFODNN7EXAMPLE is 127 days old (SLA: 90 days). Rotate immediately.

The workflow fails with exit code 1 and blocks any dependent jobs. If the rotation check runs in a required status check, it blocks merges until remediated. This is the correct behaviour — an unrotated key should block pipeline progress just as a failing test does.

Trade-offs

OIDC over static keys eliminates the rotation problem entirely for supported services. The trade-off is infrastructure: an OIDC trust relationship requires IAM role configuration, and not all CI providers or self-hosted runner environments support OIDC. Vault-based short-lived credentials address the gap for self-hosted or on-premises pipelines, but add operational dependency on Vault availability.

90-day rotation SLA is a reasonable starting point derived from NIST SP 800-63B guidance on authenticator lifecycle. The real number depends on key scope: a key with AdministratorAccess warrants a 30-day SLA; a read-only key for a non-sensitive service might tolerate 180 days. Teams that resist rotation SLAs universally resist because rotation is manual — automated rotation via Secrets Manager or Vault makes the SLA operationally invisible.

Scanning git history with TruffleHog on a large repository with five years of commits can take 20–40 minutes. Running this on every pull request is impractical. The correct approach is to run the full historical scan once (as an audit), then run targeted scanning on changed files only on every PR:

# PR-scoped scan: only files changed in this PR
git diff --name-only origin/main...HEAD | \
  xargs trufflehog filesystem --only-verified --json

Docker layer scanning is fast for single images but scales poorly across hundreds of image tags in a large registry. Prioritise scanning images built from Dockerfiles that handle secrets (private package installs, API-authenticated build steps) and images deployed to production environments.

Failure Modes

Rotating a key but not updating all consumers. A key exists in five places: a GitHub Actions secret, a GitLab CI variable for a mirror pipeline, a Jenkins credential, a developer’s local .env, and an Ansible vault file. The rotation updates GitHub Actions. The next day, the Jenkins pipeline fails with 401. The developer’s local .env still contains the old key. The Ansible vault file with the old key is committed in a branch that gets merged three weeks later, overwriting the new key in the infrastructure configuration. The rotation created an inconsistency rather than resolving a risk. Resolution: build a secret inventory before rotating — know every consumer of every key. Tools like grep -r "AKIAIOSFODNN7EXAMPLE" . in all related repositories find consumers by key ID prefix.

Creating restricted keys, then silently expanding them. A Stripe restricted key is created with balance:read scope. Two weeks later, a developer needs to trigger a refund from the CI pipeline. Rather than creating a separate restricted key with refund permissions, they edit the existing key to add refunds:write — “just temporarily.” The documentation says balance:read and the CI pipeline uses a key with balance:read,refunds:write. The discrepancy is invisible until the key leaks. Resolution: treat scope expansion as requiring the same approval as creating a new key. Audit key permissions quarterly: curl https://api.stripe.com/v1/restricted_keys -u "$STRIPE_SECRET_KEY:" | jq '.data[] | {id: .id, name: .name, permissions: .permissions}'.

Treating git history scanning as a one-time activity. An organisation runs TruffleHog on the existing git history, finds and rotates twelve keys, marks the audit complete, and moves on. Nine months later, a developer commits a Twilio auth token while debugging a staging issue and removes it in the next commit — confident that “we scan for this.” The scan only runs as a one-time audit. No PR check exists. The token persists in history undetected for months. Resolution: run targeted scanning on every pull request as a required status check. The historical scan is an audit; the PR check is the control.

Not monitoring API key usage — discovering compromise from billing. A SendGrid API key used for transactional email in CI tests leaks through a debug log in a build artifact. The key is used by the attacker for eight weeks to send phishing emails at low volume. The first signal is a SendGrid billing spike in the ninth week, followed immediately by a domain reputation alert. By then, 400,000 phishing emails have been sent from the organisation’s SendGrid account and the domain is on blocklists. Resolution: monitor SendGrid usage metrics (emails per day, unique recipient domains, bounce rate) and alert on deviations. Most API providers expose usage metrics — CloudWatch for AWS, Stripe’s usage dashboard, SendGrid activity feeds via webhook. An anomaly threshold should generate an alert hours after compromise, not weeks.