Secrets in AI Pipelines: Training Data Credentials, Model Registry Access, and MLOps Secret Sprawl

The Problem

ML pipelines are credential-dense by nature. A single training run touches more external services than most production web applications: a storage bucket for raw data, a feature store for preprocessed inputs, an experiment tracker for metrics, a model registry for artifact storage, a container registry for the training image, a GPU cluster scheduler, and frequently several inference APIs used during evaluation. Every one of these connections requires credentials. The ecosystem has not converged on a unified secret management approach, and the research culture from which MLOps emerged actively resisted operational discipline — notebooks were meant to be exploratory, scripts were meant to run once, credentials were meant to just work.

The result is a credential surface that is both broad and poorly controlled. The surface is not theoretical. It is visible in any git log --all on a shared ML repository.

Training data access — S3, GCS, Azure Blob Storage

Training datasets live in cloud object storage. Accessing them requires credentials. The AWS SDK looks for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables, in ~/.aws/credentials, and in instance/task metadata. The first time a researcher configured their local environment, they ran aws configure, populated ~/.aws/credentials, and moved on. When that researcher’s code ran on a shared GPU server — a common pattern before Kubernetes-based MLOps matured — the credentials followed: ~/.aws/credentials on the server, owned by the researcher’s user account, readable by any process running as that user. If the server runs multiple users’ training jobs (a standard academic cluster configuration), the credential isolation depends entirely on Unix file permissions, which are frequently relaxed to let jobs run without permission errors.

The pattern that produces the most direct key exposure is the notebook path: a researcher creates a notebook locally, calls boto3.client('s3') which reads from ~/.aws/credentials, the notebook works. They upload the notebook to a shared JupyterHub. Other team members can now access their S3 bucket — not because they deliberately shared the credential, but because the running JupyterHub process inherits the user’s home directory.

Experiment tracking — MLflow, Weights & Biases, Neptune

wandb login writes the API key to ~/.netrc. The ~/.netrc file has permissions 600 by default, but on shared servers, “by default” is optimistic. W&B’s documentation suggests passing the key directly to wandb.login(key="...") for automation — a pattern that immediately leads to hardcoded keys in training scripts, which are committed to version control. A search for wandb.login(key= across GitHub returns thousands of results, many with keys still present.

MLflow’s tracking server can be configured with HTTP basic auth (MLFLOW_TRACKING_USERNAME, MLFLOW_TRACKING_PASSWORD) or with token auth (MLFLOW_TRACKING_TOKEN). Both appear regularly as hardcoded strings in experiment configuration YAML files. YAML configs are attractive targets because they are committed alongside code and rarely treated as credential carriers.

Model registry access — Hugging Face, MLflow Registry, Vertex AI

Hugging Face tokens (HF_TOKEN or the older HUGGINGFACE_HUB_TOKEN) are stored by huggingface-hub in ~/.huggingface/token after huggingface-cli login. Tokens with write permissions to an organisation’s model namespace allow pushing arbitrary model weights — an attacker who obtains such a token can replace a private model’s weights with a backdoored version. The token is also commonly found in Jupyter notebooks: researchers call notebook_login() inside a cell, which prompts for the token and stores it — but the token value is sometimes visible in the cell’s subsequent output if they accidentally call print(os.environ['HF_TOKEN']) to debug.

GPU compute access — Kubernetes, SLURM

Kubernetes kubeconfig files include cluster API server credentials and often include client certificates or bearer tokens with broad cluster permissions. ML engineers frequently copy their personal kubeconfig to a training script directory to allow the script to submit jobs programmatically. The kubeconfig then gets committed to the repository because it is sitting in the project directory and git add . is faster than git add train.py.

SLURM job scripts include SSH keypairs for multi-node communication. The private key is referenced by path in the job script, and the job script is in the repository. The private key itself may or may not be — often it is not, but the path reference makes it easy to infer where to look.

Inference API keys — OpenAI, Anthropic, and the cost dimension

Inference API keys are expensive credentials in a literal sense: a stolen OPENAI_API_KEY runs up charges immediately and irreversibly. The key format (sk- prefix, 51 characters) is widely known and actively scanned for by threat actors using automated GitHub search. These keys appear in evaluation scripts, benchmark harnesses, and model comparison notebooks. The evaluation context is particularly risky: an evaluation script might call a model API thousands of times, meaning a stolen key that reproduces the evaluation logic produces thousands of billable API calls before the theft is noticed.

The Jupyter notebook output problem

This is the mechanism most specific to ML workflows and the one that most completely bypasses standard CI secret detection. Jupyter notebooks store execution state as JSON inside the .ipynb file. Every cell’s output — including print() statements, tracebacks, and the REPL-style display of values — is serialised into the notebook file. A notebook cell that contains:

import os
print(os.environ.get('AWS_ACCESS_KEY_ID'))

and was executed with the key set will contain the key value in the outputs field of that cell in the JSON. git diff will show the change to the notebook file as a binary-like diff that is difficult to read; git log -p on an .ipynb file requires understanding the JSON structure to spot the exposure. Most developers do not read notebook diffs at this level. The output persists in git log --all indefinitely unless the history is rewritten — and rewriting history on a shared repository is operationally expensive.

The aggregate picture: a typical ML team’s codebase contains credentials in .env files that are not in .gitignore, Jupyter notebook output cells, YAML experiment configurations committed alongside code, shell scripts that export keys before calling training commands, and requirements.txt files with git+https://token@github.com/... URLs. None of these are caught by standard secret scanners that focus on CI configuration files.

Threat Model

Notebook output leaks W&B key to all repository contributors. A researcher commits a notebook with WANDB_API_KEY printed in an output cell. The team’s repository has 15 contributors. All 15 now have the key. The key has full account permissions — it can read all team runs, delete experiments, modify alerts, and depending on the W&B plan, access billing. The researcher who originally set up the account leaves the company; their access is revoked, but the key in the repository is not the same credential. It persists.

AWS credentials in training environment variables leak into CloudWatch or Kubernetes event logs. A training job fails with an uncaught exception. The exception handler logs the full environment (a common debugging pattern: logging.error("Environment: %s", os.environ)). CloudWatch Logs now contains AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in a log group that has 90-day retention and is readable by anyone with logs:FilterLogEvents on the account. If that IAM permission is broadly granted — common in research environments where “everyone needs to see their logs” — any team member can retrieve the key.

Hugging Face token with write permissions committed to notebook. A token with write access to an organisation’s model namespace is committed in a notebook cell. An attacker with read access to the repository (a disgruntled contractor, a compromised team member account, or a public leak) uses the token to push a modified version of a widely-used internal model. The modified weights behave identically on standard benchmarks but produce subtly biased outputs for specific input distributions. Detection requires model diff tooling that most teams do not have.

Inference API key stolen from evaluation script. An OPENAI_API_KEY is hardcoded in eval/run_benchmarks.py. A threat actor scanning GitHub finds it via code search. Before the team notices, the actor runs 500,000 token API calls, generating a $1,500 invoice. The key is revoked, but the evaluation script is now broken and the team does not know when the key was first captured — meaning all evaluation results produced since the key was committed must be treated as potentially contaminated (an attacker who had the key could have injected responses into logged evaluation runs, though this is a high-sophistication scenario).

Kubeconfig with cluster-admin committed to training directory. An ML engineer adds their kubeconfig to a project directory to enable automated job submission. They add it to .gitignore locally but the ignore rule uses a relative path that does not match the file’s location in a subdirectory. git add . includes it. The kubeconfig carries a service account token with cluster-admin binding. An attacker retrieves it from the repository and has full access to the Kubernetes cluster running the GPU workloads — compute that is expensive to provision and useful for cryptomining or as a launching point for lateral movement.

Hardening Configuration

1. nbstripout: Strip Notebook Outputs Before Commit

nbstripout operates as a git filter: when git stages a notebook, the filter strips the output cells before hashing and storing the blob. The notebook on disk retains outputs (the researcher’s interactive experience is unchanged); the version stored in git never contains them.

pip install nbstripout
cd your-ml-repo
nbstripout --install  # Installs as git filter for this repository

# Verify the installation:
cat .git/config | grep -A3 filter
# Should show:
# [filter "nbstripout"]
#     clean = nbstripout
#     smudge = cat

If the team uses pre-commit, the hook approach is more portable — it does not require each developer to run nbstripout --install in every clone:

# .pre-commit-config.yaml
repos:
- repo: https://github.com/kynan/nbstripout
  rev: 0.7.1
  hooks:
  - id: nbstripout
    files: \.ipynb$
    args:
      - '--extra-keys'
      - 'metadata.colab metadata.accelerator cell.metadata.id'

pre-commit install

The --extra-keys argument strips Colab-specific metadata that otherwise diffs on every open/close cycle. Run pre-commit run --all-files once to strip outputs from all existing notebooks already in the repository — this does not rewrite history, but it prevents future commits from adding new outputs.

For notebooks that already have committed outputs containing credentials, the history must be rewritten:

# Remove a specific notebook's output history using git-filter-repo
pip install git-filter-repo
git filter-repo --path notebooks/train_experiment.ipynb --force
# Then re-add the stripped version:
nbstripout notebooks/train_experiment.ipynb
git add notebooks/train_experiment.ipynb
git commit -m "chore: strip notebook outputs from history"
git push --force-with-lease origin main

All collaborators must re-clone after a history rewrite. The old history remains in any existing clones until those are discarded.

2. Use IRSA Instead of Static AWS Keys for Training Jobs

IAM Roles for Service Accounts (IRSA) on EKS allows a Kubernetes pod to assume an IAM role without any static credentials. The pod receives a projected service account token, which the AWS SDK exchanges for short-lived STS credentials automatically. No AWS_ACCESS_KEY_ID appears in the pod spec, environment variables, or logs.

First, associate a service account with an IAM role. The IAM role’s trust policy must allow the cluster’s OIDC provider to issue credentials for the specific service account:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub":
          "system:serviceaccount:ml-training:ml-training-sa",
        "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud":
          "sts.amazonaws.com"
      }
    }
  }]
}

The IAM role carries a policy granting read-only access to the training bucket:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-training-data-bucket",
      "arn:aws:s3:::my-training-data-bucket/*"
    ]
  }]
}

The Kubernetes manifests reference the service account by name; no credentials appear anywhere in the spec:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-training-sa
  namespace: ml-training
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ml-training-s3-reader
---
apiVersion: batch/v1
kind: Job
metadata:
  name: model-training-run-42
  namespace: ml-training
spec:
  template:
    spec:
      serviceAccountName: ml-training-sa
      restartPolicy: Never
      containers:
      - name: trainer
        image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-training:v1.4.2
        env:
        - name: S3_TRAINING_BUCKET
          value: my-training-data-bucket
        - name: S3_CHECKPOINT_PREFIX
          value: checkpoints/run-42/
        # No AWS credentials here. boto3 finds the projected token at
        # /var/run/secrets/eks.amazonaws.com/serviceaccount/token automatically.
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "32Gi"

What appears in CloudTrail under this configuration: the userIdentity field shows AssumedRole with roleArn: arn:aws:iam::123456789012:role/ml-training-s3-reader and a sessionName like botocore-session-1746748800. There is no IAM user identity, no access key ID. If the job is compromised, you can see exactly which role was used and revoke it without rotating a key.

Local development does not benefit from IRSA. Developers running training scripts locally must use aws sso login or a short-lived profile from AWS Identity Center. The separation is intentional: local credentials should have limited scope and expire. The CI and Kubernetes training path should never use static keys.

3. Experiment Tracking — Scoped Tokens and No Hardcoding

W&B API keys have a single scope: full account access. There is no concept of a read-only key or a key restricted to a specific project. This means a leaked W&B key is always a high-impact event. Until W&B introduces fine-grained tokens, the mitigation is rotation frequency and environment-only injection — never hardcode, never write to ~/.netrc in automated contexts.

# WRONG — common in training scripts
import wandb
wandb.login(key="abc123xxxxxxxxxxxxxx")

# WRONG — reads from ~/.netrc on shared servers
wandb.login()

# CORRECT — key injected from environment, which is sourced from a secret manager
import os
import wandb

api_key = os.environ.get("WANDB_API_KEY")
if not api_key:
    raise RuntimeError(
        "WANDB_API_KEY not set. Set it from your secret manager before running."
    )
wandb.login(key=api_key)

In a Kubernetes training job, inject the key from a secret, not from a ConfigMap or directly in the spec:

# Create the secret from Vault or AWS Secrets Manager via External Secrets Operator,
# not by manually running kubectl create secret.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ml-api-keys
  namespace: ml-training
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: ml-api-keys
    creationPolicy: Owner
  data:
  - secretKey: WANDB_API_KEY
    remoteRef:
      key: secret/mlops/experiment-tracking
      property: wandb_api_key
  - secretKey: HF_TOKEN
    remoteRef:
      key: secret/mlops/model-registry
      property: hf_token
---
# Reference in job spec:
containers:
- name: trainer
  image: ...
  envFrom:
  - secretRef:
      name: ml-api-keys

MLflow tracking credentials follow the same pattern. If the tracking server uses HTTP basic auth, MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD come from a secret reference, never from a hardcoded YAML config value.

4. Hugging Face Token Scoping and Vault Storage

Hugging Face’s fine-grained token system (released in 2024) supports restricting a token to specific repositories and to read or write operations. Create tokens through the Hugging Face API or the settings UI with the minimum required scope:

# Read-only token for pulling private model weights during training
# Scope: read access to your-org/private-base-model only

# Write token for pushing fine-tuned models to the registry
# Scope: write access to your-org/fine-tuned-* only

# Do NOT create tokens with full account write access for automated pipelines.
# A full-write token compromised in a training script = attacker can push to any org repo.

Store both tokens in Vault and retrieve them at runtime, explicitly avoiding persistence to disk:

import os
import hvac
from huggingface_hub import login

def get_hf_token(vault_path: str) -> str:
    client = hvac.Client(url=os.environ["VAULT_ADDR"])
    # Kubernetes auth: pod's service account token authenticates to Vault
    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
        jwt = f.read()
    client.auth.kubernetes.login(
        role="ml-training",
        jwt=jwt
    )
    if not client.is_authenticated():
        raise RuntimeError("Vault authentication failed")
    secret = client.secrets.kv.v2.read_secret_version(
        path=vault_path,
        mount_point="secret"
    )
    return secret["data"]["data"]["token"]

hf_token = get_hf_token("mlops/model-registry")
login(
    token=hf_token,
    add_to_git_credential=False  # Critical: do not write to ~/.gitconfig or ~/.netrc
)

add_to_git_credential=False prevents huggingface-hub from persisting the token to the git credential store on the training node. Without this flag, the first training job that runs on a node will write the token to the node’s git credential store, where it persists across jobs.

5. Pre-commit Hook: ML-Specific Credential Pattern Detection

Standard secret detection tools (truffleHog, detect-secrets) scan for well-known credential patterns in text files. They do not typically scan Jupyter notebook output cells, which are stored as JSON values within a larger structure. This hook adds both text-level scanning and notebook-output-specific scanning:

#!/usr/bin/env bash
# .git/hooks/pre-commit (or managed via pre-commit framework)
set -euo pipefail

# Patterns covering ML-specific credentials
PATTERNS=(
  'sk-[a-zA-Z0-9]{48}'                          # OpenAI API key
  'hf_[a-zA-Z0-9]{37}'                           # Hugging Face token
  'AKIA[0-9A-Z]{16}'                             # AWS access key ID
  'wandb\.login\s*\(\s*key\s*=\s*["\x27][^"\x27]{10}' # Hardcoded W&B login
  '[Aa]nthrop[^=]*[=:]\s*["\x27]?sk-ant-[a-zA-Z0-9]'  # Anthropic key
  'neptune\.init[^)]*api_token\s*=\s*["\x27][^"\x27]{20}' # Neptune token
)

STAGED_FILES=$(git diff --cached --name-only)

# 1. Scan all staged text files for credential patterns
for pattern in "${PATTERNS[@]}"; do
  if git diff --cached | grep -qP "$pattern" 2>/dev/null; then
    echo "BLOCKED: credential pattern detected in staged diff"
    echo "  Pattern: $pattern"
    echo "  Run: git diff --cached | grep -P '$pattern' to locate"
    exit 1
  fi
done

# 2. Scan Jupyter notebook output cells specifically
for nb in $(echo "$STAGED_FILES" | grep '\.ipynb$' || true); do
  if ! git ls-files --cached --error-unmatch "$nb" 2>/dev/null; then
    continue
  fi

  FOUND=$(git show ":$nb" | python3 - <<'PYEOF'
import json
import re
import sys

CREDENTIAL_PATTERNS = [
    r'sk-[a-zA-Z0-9]{48}',           # OpenAI
    r'hf_[a-zA-Z0-9]{37}',           # Hugging Face
    r'AKIA[0-9A-Z]{16}',             # AWS access key
    r'sk-ant-[a-zA-Z0-9\-]{80,}',   # Anthropic
]

try:
    nb = json.load(sys.stdin)
except json.JSONDecodeError:
    sys.exit(0)

for cell in nb.get('cells', []):
    for output in cell.get('outputs', []):
        # Collect all text content from output
        text_parts = (
            output.get('text', []) +
            output.get('data', {}).get('text/plain', []) +
            [output.get('ename', '')] +
            [output.get('evalue', '')]
        )
        text = ''.join(text_parts)
        for pattern in CREDENTIAL_PATTERNS:
            if re.search(pattern, text):
                print(f"Credential pattern '{pattern}' found in notebook output cell")
                sys.exit(1)

sys.exit(0)
PYEOF
  )
  if [ $? -ne 0 ]; then
    echo "BLOCKED: credential found in notebook output in $nb"
    echo "  Run: nbstripout $nb && git add $nb"
    exit 1
  fi
done

exit 0

Make the hook executable and install it:

chmod +x .git/hooks/pre-commit

For team-wide enforcement via pre-commit:

# .pre-commit-config.yaml — add alongside nbstripout
- repo: local
  hooks:
  - id: ml-credential-scan
    name: ML credential pattern scan
    language: script
    entry: .githooks/ml-credential-scan.sh
    stages: [commit]
    pass_filenames: false

6. Unified Secret Access Pattern for Training Scripts

Rather than having each training script implement its own secret retrieval, centralise the pattern in a module that handles both local development (AWS Secrets Manager) and cluster execution (Vault via Kubernetes auth):

# ml_secrets.py
"""
Centralised secret retrieval for ML pipeline components.
Supports two backends: AWS Secrets Manager (for local dev and CI)
and HashiCorp Vault with Kubernetes auth (for cluster training jobs).
"""
import json
import logging
import os
from functools import lru_cache
from pathlib import Path
from typing import Any

logger = logging.getLogger(__name__)


def _is_kubernetes_environment() -> bool:
    """Detect whether we're running inside a Kubernetes pod."""
    return Path("/var/run/secrets/kubernetes.io/serviceaccount/token").exists()


@lru_cache(maxsize=None)
def _get_vault_client():
    import hvac
    client = hvac.Client(url=os.environ["VAULT_ADDR"])
    token_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    with open(token_path) as f:
        jwt = f.read()
    client.auth.kubernetes.login(
        role=os.environ.get("VAULT_ROLE", "ml-training"),
        jwt=jwt
    )
    if not client.is_authenticated():
        raise RuntimeError("Vault Kubernetes authentication failed")
    return client


@lru_cache(maxsize=None)
def _get_aws_secrets_client():
    import boto3
    return boto3.client(
        "secretsmanager",
        region_name=os.environ.get("AWS_REGION", "us-east-1")
    )


@lru_cache(maxsize=32)
def get_secret(name: str) -> dict[str, Any]:
    """
    Retrieve a named secret from the appropriate backend.

    In Kubernetes: reads from Vault at secret/mlops/{name}
    Elsewhere: reads from AWS Secrets Manager at mlops/{name}

    Results are cached for the process lifetime — safe for training jobs
    (single run, short-lived). Do not use this in long-lived servers
    without a cache TTL.
    """
    if _is_kubernetes_environment():
        client = _get_vault_client()
        response = client.secrets.kv.v2.read_secret_version(
            path=f"mlops/{name}",
            mount_point="secret"
        )
        return response["data"]["data"]
    else:
        client = _get_aws_secrets_client()
        response = client.get_secret_value(SecretId=f"mlops/{name}")
        return json.loads(response["SecretString"])


def configure_experiment_tracking() -> None:
    """Set up W&B and MLflow from secrets. Call once at training script entry."""
    creds = get_secret("experiment-tracking")

    import wandb
    wandb.login(key=creds["wandb_api_key"])
    logger.info("W&B configured (key: ...%s)", creds["wandb_api_key"][-4:])

    import mlflow
    mlflow.set_tracking_uri(creds["mlflow_tracking_uri"])
    os.environ["MLFLOW_TRACKING_USERNAME"] = creds["mlflow_username"]
    os.environ["MLFLOW_TRACKING_PASSWORD"] = creds["mlflow_password"]


def configure_model_registry() -> None:
    """Configure Hugging Face hub access from secrets."""
    creds = get_secret("model-registry")
    from huggingface_hub import login
    login(token=creds["hf_token"], add_to_git_credential=False)
    logger.info("Hugging Face hub configured")


def get_inference_api_key(provider: str) -> str:
    """
    Retrieve an inference API key by provider name.
    provider: 'openai', 'anthropic', 'cohere', etc.
    """
    creds = get_secret("inference-apis")
    key = creds.get(f"{provider}_api_key")
    if not key:
        raise KeyError(f"No API key configured for provider: {provider}")
    return key

Usage in a training script is then a single import and a setup call at the entry point — no credential handling in the training logic itself:

# train.py
from ml_secrets import configure_experiment_tracking, configure_model_registry

def main():
    configure_experiment_tracking()
    configure_model_registry()
    # ... training logic, no credentials anywhere below this point

Expected Behaviour

When nbstripout is correctly installed as a git filter, committing a notebook that has output cells produces a clean diff: only the cell source changes appear, not the output JSON. Running git show HEAD:notebooks/train.ipynb | python3 -c "import json,sys; nb=json.load(sys.stdin); print(sum(len(c.get('outputs',[])) for c in nb['cells']))" returns 0 — zero output cells stored in any commit.

When the pre-commit credential hook blocks a commit, the output is explicit about which pattern matched and which file to fix. The remediation instruction — nbstripout <file> && git add <file> — appears inline. The hook exits non-zero, which prevents the commit from completing regardless of how git commit was invoked.

When a training job runs with IRSA, CloudTrail entries for S3 operations show "userIdentity": {"type": "AssumedRole", "principalId": "AROAEXAMPLEID:botocore-session-...", "arn": "arn:aws:sts::123456789012:assumed-role/ml-training-s3-reader/botocore-session-..."}. There is no accessKeyId identifying a long-lived IAM user credential. If the role is over-permissioned or the training job is compromised, revocation is a single aws iam delete-role-policy or detach operation — no key rotation, no audit of where the key was stored.

When ml_secrets.py initialises Vault authentication in a Kubernetes pod, a successful login produces a short-lived Vault token valid for the TTL configured on the Kubernetes auth role (typically one hour for training jobs). The @lru_cache on get_secret means subsequent calls within the same process hit the in-memory cache, not the network. If Vault is unreachable at startup, the training job fails immediately with a clear error — the alternative, proceeding without credentials and failing mid-run after expensive GPU time, is worse.

Trade-offs

IRSA for training jobs eliminates static keys at the cost of EKS dependency. Teams running training on bare-metal GPU servers, GCP (which uses Workload Identity rather than IRSA), or non-Kubernetes schedulers (SLURM, Ray cluster without Kubernetes) cannot use IRSA directly. GCP has equivalent Workload Identity Federation; SLURM and bare-metal environments have no equivalent without a separate secret injection mechanism. Local developers still need credentials — the correct local pattern is AWS IAM Identity Center with short-lived SSO credentials, not a permanent ~/.aws/credentials entry. The IRSA path only covers the automated training path.

nbstripout strips all output, including useful visualisations. Teams that commit notebooks specifically to share training curves, confusion matrices, or other outputs with reviewers will find their workflow disrupted. The correct resolution is to export those outputs to a separate artefact store (W&B, MLflow, or even as committed PNG files) rather than relying on notebook output cells for communication. Teams that resist nbstripout on these grounds are accepting a permanent credential leak risk to preserve a convenience that has better alternatives.

Vault as a training script dependency adds operational complexity. Researchers who run training scripts outside of Kubernetes — on their laptops, on university HPC clusters, on cloud instances they ssh into — cannot use the Kubernetes auth path. The ml_secrets.py module above handles this by falling back to AWS Secrets Manager, but that fallback requires local AWS credentials to be properly configured. In practice, mixed environments mean the secret management path is different depending on where the script runs, which increases the chance of a developer reverting to hardcoded credentials because “nothing works on my machine.”

Pre-commit hooks can be bypassed. git commit --no-verify skips all pre-commit hooks. In a research environment where fast iteration is culturally prioritised, --no-verify will be used. The pre-commit hook layer is a friction tool, not an enforcement boundary. The enforcement boundary is a CI check — running the same credential scan in CI against every pushed commit, where --no-verify has no effect.

Failure Modes

nbstripout installed globally but not as a git filter for the repository. pip install nbstripout makes the command available. nbstripout --install in the repository root installs the git filter. If the second step is not performed, or if the repository is re-cloned without running the install step, the filter is absent and notebook outputs commit normally. Per-repository filter installation is fragile at scale. The pre-commit approach, with the configuration committed to the repository and enforced via CI, is more reliable because it requires no per-clone setup beyond pre-commit install.

IRSA configured for training jobs but development machines use long-lived keys in ~/.aws. The training path is secured; the development path is not. Credentials used to develop training scripts, test data loading, and debug jobs are still long-lived keys. When a developer’s laptop is lost or their account is compromised, the attacker has credentials with permissions scoped to whatever the developer was testing — which is often close to production training data access. Development credentials need separate IAM policies with narrower scope and shorter maximum session durations.

W&B API key with full account permissions stored in the secret manager. The secret manager is correctly configured and the key is retrieved properly at runtime. But the key itself grants full account access — billing, user management, all projects, all teams. When the key is rotated (automated monthly rotation), the old key is invalidated. But if an attacker exfiltrated the key before rotation, they had full account access for up to 30 days. W&B’s audit log should be reviewed for anomalous API usage — runs created outside of known training windows, unusual geographic access patterns, API calls to admin endpoints from training job IPs.

Hugging Face token with write access to all organisation repositories used in a training job. A compromised training environment — a malicious Python package in the training image’s requirements.txt, a code injection via a poisoned training data file — can exfiltrate the token and use it to push modified model weights to any repository in the organisation. The token should be scoped to write access to the specific output repository for the training run, not the entire organisation namespace. Creating a token per-pipeline-stage (read token for pulling base model, write token for pushing fine-tuned output) limits the blast radius to what that stage legitimately needs.

ml_secrets.py caching secrets with @lru_cache in a long-running inference server. The module documentation says the cache is safe for training jobs, which are short-lived. If the same module is used in an inference server that runs for weeks, a secret rotated in Vault will not be reflected until the server restarts. The inference server continues to use the old, potentially revoked credential and will start failing authentication with cryptic errors. Long-lived services require cache TTLs on secret retrieval — either drop @lru_cache and accept the network overhead, or implement a time-based cache with explicit expiry tied to the Vault lease duration.