Isolating AI Training Batch Jobs in Kubernetes

Isolating AI Training Batch Jobs in Kubernetes

Problem

AI training workloads on Kubernetes present a distinct security isolation problem. A single training job typically combines several high-value components: large GPU nodes with substantial compute cost, model weight checkpoints stored on shared persistent volumes, proprietary training datasets containing sensitive business data or personal information, and service account tokens used to fetch data from object stores and model registries. When these components share infrastructure with production services, security boundaries erode in predictable ways.

The specific failure modes that emerge in practice:

Training data oversharing. Training datasets are often stored on shared NFS volumes or object storage buckets with overly broad access policies — either because the data pipeline team provisioned access quickly and never revisited it, or because multiple training projects share a data lake without per-job access scoping. A compromised training job or a debugging engineer with kubectl exec access can read training data from other projects.

Model weight exfiltration via shared storage. Checkpoint directories are written to PersistentVolumeClaims that are often bound to the same StorageClass as production volumes. A pod with access to the storage class’s underlying NFS or Ceph cluster can potentially traverse to other PVCs, particularly when volume names are guessable.

Cross-namespace lateral movement via DNS. By default, all pods in a Kubernetes cluster can resolve service names in all namespaces via DNS (service.namespace.svc.cluster.local). A training job running in a ml-training namespace can make HTTP calls to vault.production.svc.cluster.local unless NetworkPolicy prevents it.

GPU node exposure. Training jobs frequently run with elevated permissions to access GPU devices: privileged: true, host path mounts for GPU drivers, or hostPID: true for GPU debugging tools. These permissions make the GPU node a stepping stone for container escape.

Service account token with broad permissions. Training pipelines often use a single service account that has access to the model registry, the data lake bucket, and the experiment tracking service. A vulnerability in the training framework (PyTorch, JAX, Hugging Face Trainer) or a malicious training script could use this token to exfiltrate data or access production model serving infrastructure.

Shared node pool with production. When training jobs and production services share the same node pool, a container escape from a training job lands on a node that also runs production workloads. The attacker inherits access to all other pods’ volumes, network connections, and secrets mounted on that node.

The scale of AI training workloads amplifies each of these risks. A single training run may consume 32 A100 GPUs for 72 hours, processing hundreds of gigabytes of proprietary data. The value of what’s in scope during that window is substantial, and the security controls applied are often whatever was expedient to get the job running.

Target systems: Kubernetes 1.26+ clusters running AI/ML training workloads (PyTorch, JAX, TensorFlow, Hugging Face); clusters using GPU operator or NVIDIA device plugin; any environment where ML training and production services share the same cluster.


Threat Model

Adversary 1 — Compromised training script. Access level: code running inside a training job pod (malicious dependency, poisoned dataset, supply chain compromise). Objective: use the training job’s service account token to access the model registry, exfiltrate the proprietary training dataset, or pivot to production services via internal DNS.

Adversary 2 — ML engineer with excessive kubectl access. Access level: kubectl exec into training pods across multiple projects. Objective: read training data or model checkpoints from another team’s training run — competitive intelligence theft or data exfiltration.

Adversary 3 — Container escape via GPU driver vulnerability. Access level: training job running with privileged: true for GPU access. Objective: use a vulnerability in the NVIDIA driver or the GPU device plugin to escape to the host node, then access other pods’ secrets and network traffic.

Adversary 4 — Training infrastructure as lateral movement hub. Access level: initial compromise of any pod in the cluster via a separate vulnerability. Objective: use the lax network policy in the training namespace (which is often wide-open to allow data fetching) as a pivot point to reach production services.

Without isolation: training job compromise translates directly to training data exfiltration, model theft, and production service access. With isolation: dedicated node pools contain escapes to training nodes; NetworkPolicy blocks lateral movement; scoped service accounts limit token blast radius.


Configuration / Implementation

Step 1 — Dedicate a node pool exclusively to training workloads

Training jobs must not share nodes with production workloads:

# Label and taint training nodes
# (Run on each training node or set in node group config)
kubectl label node gpu-node-01 workload-type=ai-training
kubectl taint node gpu-node-01 workload-type=ai-training:NoSchedule

# For AWS EKS — node group configuration
# eksctl create nodegroup config:
nodeGroups:
- name: ai-training-gpu
  instanceType: p4d.24xlarge
  minSize: 0
  maxSize: 16
  labels:
    workload-type: ai-training
  taints:
  - key: workload-type
    value: ai-training
    effect: NoSchedule
  tags:
    k8s.io/cluster-autoscaler/node-template/taint/workload-type: ai-training:NoSchedule

Require training jobs to explicitly tolerate the taint:

# training-job-template.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: training-run-001
  namespace: ml-training
spec:
  template:
    spec:
      # Must tolerate the training-node taint
      tolerations:
      - key: workload-type
        value: ai-training
        effect: NoSchedule
      # Must target training nodes specifically
      nodeSelector:
        workload-type: ai-training
      # Ensure pod does not schedule on production nodes
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload-type
                operator: In
                values: [ai-training]

Step 2 — Namespace isolation with network policy

# Deny all ingress/egress from ml-training namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ml-training
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
# Allow training pods to reach only approved external endpoints
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-egress-allowlist
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      role: training-job
  policyTypes: [Egress]
  egress:
  # Allow DNS resolution
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
  # Allow access to data lake (specific namespace only)
  - to:
    - namespaceSelector:
        matchLabels:
          name: ml-data
    ports:
    - port: 443
  # Allow access to model registry
  - to:
    - namespaceSelector:
        matchLabels:
          name: ml-registry
    ports:
    - port: 443
  # Allow experiment tracking (MLflow/W&B internal service)
  - to:
    - namespaceSelector:
        matchLabels:
          name: ml-platform
    ports:
    - port: 5000
  # Explicitly: NO egress to production namespace
  # NO egress to kube-system
  # NO egress to default namespace
---
# Block all ingress to training pods (training jobs are clients, not servers)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: training-ingress-deny
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      role: training-job
  policyTypes: [Ingress]
  ingress: []  # No ingress rules = deny all ingress

Step 3 — Scoped service accounts per project

# One service account per training project — not shared across projects
apiVersion: v1
kind: ServiceAccount
metadata:
  name: training-project-alpha
  namespace: ml-training
  annotations:
    # AWS IRSA — scope to only project-alpha's data bucket
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/training-project-alpha
---
# IAM role for project-alpha (attached to service account via IRSA)
# IAM policy — read-only access to project-alpha data only
{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::training-data-bucket/project-alpha/*",
        "arn:aws:s3:::training-data-bucket"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": ["project-alpha/*"]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::model-checkpoints/project-alpha/*"
    }
  ]
}
---
# RBAC for the service account — minimal
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: training-project-alpha
  namespace: ml-training
rules:
# Allow reading its own job status (for distributed training coordination)
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch"]
  resourceNames: []  # Scope to specific job names in production
# No pods/exec, no secrets access, no cross-namespace access

Step 4 — Restrict GPU access without privileged containers

The default approach of running GPU containers as privileged: true is unnecessary and dangerous:

# Use device plugin approach instead of privileged containers
apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: training
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 4  # Request GPUs via device plugin
        memory: "128Gi"
        cpu: "32"
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
      # NOT privileged: true
      # NOT hostPID: true
      # NOT hostNetwork: true
    volumeMounts:
    - name: training-data
      mountPath: /data
      readOnly: true  # Training data is read-only
    - name: checkpoint-dir
      mountPath: /checkpoints
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: project-alpha-data
      readOnly: true
  - name: checkpoint-dir
    persistentVolumeClaim:
      claimName: project-alpha-checkpoints
  - name: tmp
    emptyDir:
      sizeLimit: 10Gi

Step 5 — Enforce isolation with Kyverno policies

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: training-namespace-isolation
spec:
  validationFailureAction: Enforce
  rules:
  # Training namespace jobs must use training node selector
  - name: require-training-node-selector
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces: [ml-training]
    validate:
      message: "Training pods must target training nodes"
      pattern:
        spec:
          nodeSelector:
            workload-type: ai-training

  # Training pods must not use privileged mode
  - name: deny-privileged-training
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces: [ml-training]
    validate:
      message: "Privileged containers not permitted in training namespace; use device plugin for GPU access"
      deny:
        conditions:
          any:
          - key: "{{ request.object.spec.containers[].securityContext.privileged | contains(@, true) }}"
            operator: Equals
            value: true

  # Training pods must declare resource limits (prevent resource starvation of other workloads)
  - name: require-resource-limits
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces: [ml-training]
    validate:
      message: "Training pods must declare CPU, memory, and GPU resource limits"
      pattern:
        spec:
          containers:
          - resources:
              limits:
                memory: "?*"
                cpu: "?*"

Step 6 — Audit training job data access

Log all S3/GCS data access by training jobs for anomaly detection:

# Add structured logging to training jobs for data access auditing
# Environment variable consumed by training framework logging hooks
env:
- name: TRAINING_AUDIT_LOG
  value: "true"
- name: TRAINING_PROJECT_ID
  value: "project-alpha"
- name: TRAINING_RUN_ID
  valueFrom:
    fieldRef:
      fieldPath: metadata.name
# training_audit.py — hook into dataset loading
import logging
import os
import boto3

audit_logger = logging.getLogger('training.audit')

class AuditedS3Dataset:
    def __init__(self, bucket: str, prefix: str):
        self.bucket = bucket
        self.prefix = prefix
        self.project_id = os.environ.get('TRAINING_PROJECT_ID', 'unknown')
        self.run_id = os.environ.get('TRAINING_RUN_ID', 'unknown')
        self._s3 = boto3.client('s3')
    
    def load_file(self, key: str) -> bytes:
        audit_logger.info({
            "event": "training_data_access",
            "project": self.project_id,
            "run_id": self.run_id,
            "bucket": self.bucket,
            "key": key,
        })
        response = self._s3.get_object(Bucket=self.bucket, Key=key)
        return response['Body'].read()

Expected Behaviour

Signal Before isolation After isolation
Training pod on production node Possible Blocked — taint/toleration mismatch
Training job accessing production vault Succeeds via DNS Blocked by NetworkPolicy egress deny
kubectl exec into another project’s training pod Permitted with broad RBAC Denied — service account scoped to own project
GPU pod runs as privileged: true Common Blocked by Kyverno policy
Training data PVC readable cross-namespace Possible via shared StorageClass PVCs in separate namespaces; cross-namespace mount blocked
Training service account has S3 bucket-wide access Common (over-provisioned) Scoped to project-specific prefix via IRSA condition

Verification:

# Confirm training pods cannot reach production namespace
kubectl exec -n ml-training training-job-pod -- \
  curl -s --max-time 5 http://api.production.svc.cluster.local/health
# Expected: connection timeout or refused

# Confirm training nodes are correctly tainted
kubectl get nodes -l workload-type=ai-training \
  -o jsonpath='{.items[*].spec.taints}' | jq .
# Expected: [{key: workload-type, value: ai-training, effect: NoSchedule}]

# Confirm no privileged training pods
kubectl get pods -n ml-training -o json | \
  jq '.items[] | select(.spec.containers[].securityContext.privileged == true) | .metadata.name'
# Expected: empty output

Trade-offs

Aspect Benefit Cost Mitigation
Dedicated GPU node pool Complete node isolation; escape stays on training node Higher cost — cannot share idle GPU capacity with production Use cluster autoscaler to scale to zero when no training jobs run; spot/preemptible instances for training nodes
NetworkPolicy deny-all with allowlist Precise egress control Initial configuration requires mapping all legitimate data sources Start with audit mode NetworkPolicy (log but don’t block); capture actual traffic; convert to enforcement after 1 week
Per-project service accounts Token compromise limited to one project Management overhead scales with project count Automate service account provisioning via a Helm chart or Terraform module in your ML platform
Non-privileged GPU access Removes container escape via privileged mode Some older GPU workloads expect privileged; device plugin requires NVIDIA GPU Operator Deploy GPU Operator; test GPU workloads with device plugin before enforcing policy

Failure Modes

Failure Symptom Detection Recovery
NetworkPolicy blocks training data access Training job fails with connection timeout to data lake Job logs show connection errors; NetworkPolicy audit log shows blocked egress Add the missing destination to the egress allowlist; verify with kubectl exec -- curl before re-running job
Training node autoscaler does not drain before scale-down In-flight training job killed mid-epoch Job fails; checkpoint is incomplete; training metrics show abrupt stop Configure cluster autoscaler scale-down-delay-after-add=30m; use Volcano or Kubeflow for job preemption handling
Kyverno policy blocks legitimate GPU workload Pod fails admission with “Privileged containers not permitted” kubectl describe pod shows Kyverno policy violation Verify GPU device plugin is configured on the node; update pod spec to use nvidia.com/gpu: N resource limit instead of privileged: true
Service account token rotation breaks long-running jobs 24h+ training job fails mid-run with 401 to data store Job logs show authentication failure; S3 access denied Use projected service account tokens with extended TTL for long-running jobs; configure expirationSeconds: 86400 in the projected volume