Isolating AI Training Batch Jobs in Kubernetes
Problem
AI training workloads on Kubernetes present a distinct security isolation problem. A single training job typically combines several high-value components: large GPU nodes with substantial compute cost, model weight checkpoints stored on shared persistent volumes, proprietary training datasets containing sensitive business data or personal information, and service account tokens used to fetch data from object stores and model registries. When these components share infrastructure with production services, security boundaries erode in predictable ways.
The specific failure modes that emerge in practice:
Training data oversharing. Training datasets are often stored on shared NFS volumes or object storage buckets with overly broad access policies — either because the data pipeline team provisioned access quickly and never revisited it, or because multiple training projects share a data lake without per-job access scoping. A compromised training job or a debugging engineer with kubectl exec access can read training data from other projects.
Model weight exfiltration via shared storage. Checkpoint directories are written to PersistentVolumeClaims that are often bound to the same StorageClass as production volumes. A pod with access to the storage class’s underlying NFS or Ceph cluster can potentially traverse to other PVCs, particularly when volume names are guessable.
Cross-namespace lateral movement via DNS. By default, all pods in a Kubernetes cluster can resolve service names in all namespaces via DNS (service.namespace.svc.cluster.local). A training job running in a ml-training namespace can make HTTP calls to vault.production.svc.cluster.local unless NetworkPolicy prevents it.
GPU node exposure. Training jobs frequently run with elevated permissions to access GPU devices: privileged: true, host path mounts for GPU drivers, or hostPID: true for GPU debugging tools. These permissions make the GPU node a stepping stone for container escape.
Service account token with broad permissions. Training pipelines often use a single service account that has access to the model registry, the data lake bucket, and the experiment tracking service. A vulnerability in the training framework (PyTorch, JAX, Hugging Face Trainer) or a malicious training script could use this token to exfiltrate data or access production model serving infrastructure.
Shared node pool with production. When training jobs and production services share the same node pool, a container escape from a training job lands on a node that also runs production workloads. The attacker inherits access to all other pods’ volumes, network connections, and secrets mounted on that node.
The scale of AI training workloads amplifies each of these risks. A single training run may consume 32 A100 GPUs for 72 hours, processing hundreds of gigabytes of proprietary data. The value of what’s in scope during that window is substantial, and the security controls applied are often whatever was expedient to get the job running.
Target systems: Kubernetes 1.26+ clusters running AI/ML training workloads (PyTorch, JAX, TensorFlow, Hugging Face); clusters using GPU operator or NVIDIA device plugin; any environment where ML training and production services share the same cluster.
Threat Model
Adversary 1 — Compromised training script. Access level: code running inside a training job pod (malicious dependency, poisoned dataset, supply chain compromise). Objective: use the training job’s service account token to access the model registry, exfiltrate the proprietary training dataset, or pivot to production services via internal DNS.
Adversary 2 — ML engineer with excessive kubectl access. Access level: kubectl exec into training pods across multiple projects. Objective: read training data or model checkpoints from another team’s training run — competitive intelligence theft or data exfiltration.
Adversary 3 — Container escape via GPU driver vulnerability. Access level: training job running with privileged: true for GPU access. Objective: use a vulnerability in the NVIDIA driver or the GPU device plugin to escape to the host node, then access other pods’ secrets and network traffic.
Adversary 4 — Training infrastructure as lateral movement hub. Access level: initial compromise of any pod in the cluster via a separate vulnerability. Objective: use the lax network policy in the training namespace (which is often wide-open to allow data fetching) as a pivot point to reach production services.
Without isolation: training job compromise translates directly to training data exfiltration, model theft, and production service access. With isolation: dedicated node pools contain escapes to training nodes; NetworkPolicy blocks lateral movement; scoped service accounts limit token blast radius.
Configuration / Implementation
Step 1 — Dedicate a node pool exclusively to training workloads
Training jobs must not share nodes with production workloads:
# Label and taint training nodes
# (Run on each training node or set in node group config)
kubectl label node gpu-node-01 workload-type=ai-training
kubectl taint node gpu-node-01 workload-type=ai-training:NoSchedule
# For AWS EKS — node group configuration
# eksctl create nodegroup config:
nodeGroups:
- name: ai-training-gpu
instanceType: p4d.24xlarge
minSize: 0
maxSize: 16
labels:
workload-type: ai-training
taints:
- key: workload-type
value: ai-training
effect: NoSchedule
tags:
k8s.io/cluster-autoscaler/node-template/taint/workload-type: ai-training:NoSchedule
Require training jobs to explicitly tolerate the taint:
# training-job-template.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: training-run-001
namespace: ml-training
spec:
template:
spec:
# Must tolerate the training-node taint
tolerations:
- key: workload-type
value: ai-training
effect: NoSchedule
# Must target training nodes specifically
nodeSelector:
workload-type: ai-training
# Ensure pod does not schedule on production nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values: [ai-training]
Step 2 — Namespace isolation with network policy
# Deny all ingress/egress from ml-training namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: ml-training
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
---
# Allow training pods to reach only approved external endpoints
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: training-egress-allowlist
namespace: ml-training
spec:
podSelector:
matchLabels:
role: training-job
policyTypes: [Egress]
egress:
# Allow DNS resolution
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Allow access to data lake (specific namespace only)
- to:
- namespaceSelector:
matchLabels:
name: ml-data
ports:
- port: 443
# Allow access to model registry
- to:
- namespaceSelector:
matchLabels:
name: ml-registry
ports:
- port: 443
# Allow experiment tracking (MLflow/W&B internal service)
- to:
- namespaceSelector:
matchLabels:
name: ml-platform
ports:
- port: 5000
# Explicitly: NO egress to production namespace
# NO egress to kube-system
# NO egress to default namespace
---
# Block all ingress to training pods (training jobs are clients, not servers)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: training-ingress-deny
namespace: ml-training
spec:
podSelector:
matchLabels:
role: training-job
policyTypes: [Ingress]
ingress: [] # No ingress rules = deny all ingress
Step 3 — Scoped service accounts per project
# One service account per training project — not shared across projects
apiVersion: v1
kind: ServiceAccount
metadata:
name: training-project-alpha
namespace: ml-training
annotations:
# AWS IRSA — scope to only project-alpha's data bucket
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/training-project-alpha
---
# IAM role for project-alpha (attached to service account via IRSA)
# IAM policy — read-only access to project-alpha data only
{
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::training-data-bucket/project-alpha/*",
"arn:aws:s3:::training-data-bucket"
],
"Condition": {
"StringLike": {
"s3:prefix": ["project-alpha/*"]
}
}
},
{
"Effect": "Allow",
"Action": ["s3:PutObject"],
"Resource": "arn:aws:s3:::model-checkpoints/project-alpha/*"
}
]
}
---
# RBAC for the service account — minimal
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: training-project-alpha
namespace: ml-training
rules:
# Allow reading its own job status (for distributed training coordination)
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch"]
resourceNames: [] # Scope to specific job names in production
# No pods/exec, no secrets access, no cross-namespace access
Step 4 — Restrict GPU access without privileged containers
The default approach of running GPU containers as privileged: true is unnecessary and dangerous:
# Use device plugin approach instead of privileged containers
apiVersion: v1
kind: Pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: training
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 4 # Request GPUs via device plugin
memory: "128Gi"
cpu: "32"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
# NOT privileged: true
# NOT hostPID: true
# NOT hostNetwork: true
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true # Training data is read-only
- name: checkpoint-dir
mountPath: /checkpoints
- name: tmp
mountPath: /tmp
volumes:
- name: training-data
persistentVolumeClaim:
claimName: project-alpha-data
readOnly: true
- name: checkpoint-dir
persistentVolumeClaim:
claimName: project-alpha-checkpoints
- name: tmp
emptyDir:
sizeLimit: 10Gi
Step 5 — Enforce isolation with Kyverno policies
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: training-namespace-isolation
spec:
validationFailureAction: Enforce
rules:
# Training namespace jobs must use training node selector
- name: require-training-node-selector
match:
any:
- resources:
kinds: [Pod]
namespaces: [ml-training]
validate:
message: "Training pods must target training nodes"
pattern:
spec:
nodeSelector:
workload-type: ai-training
# Training pods must not use privileged mode
- name: deny-privileged-training
match:
any:
- resources:
kinds: [Pod]
namespaces: [ml-training]
validate:
message: "Privileged containers not permitted in training namespace; use device plugin for GPU access"
deny:
conditions:
any:
- key: "{{ request.object.spec.containers[].securityContext.privileged | contains(@, true) }}"
operator: Equals
value: true
# Training pods must declare resource limits (prevent resource starvation of other workloads)
- name: require-resource-limits
match:
any:
- resources:
kinds: [Pod]
namespaces: [ml-training]
validate:
message: "Training pods must declare CPU, memory, and GPU resource limits"
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
Step 6 — Audit training job data access
Log all S3/GCS data access by training jobs for anomaly detection:
# Add structured logging to training jobs for data access auditing
# Environment variable consumed by training framework logging hooks
env:
- name: TRAINING_AUDIT_LOG
value: "true"
- name: TRAINING_PROJECT_ID
value: "project-alpha"
- name: TRAINING_RUN_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
# training_audit.py — hook into dataset loading
import logging
import os
import boto3
audit_logger = logging.getLogger('training.audit')
class AuditedS3Dataset:
def __init__(self, bucket: str, prefix: str):
self.bucket = bucket
self.prefix = prefix
self.project_id = os.environ.get('TRAINING_PROJECT_ID', 'unknown')
self.run_id = os.environ.get('TRAINING_RUN_ID', 'unknown')
self._s3 = boto3.client('s3')
def load_file(self, key: str) -> bytes:
audit_logger.info({
"event": "training_data_access",
"project": self.project_id,
"run_id": self.run_id,
"bucket": self.bucket,
"key": key,
})
response = self._s3.get_object(Bucket=self.bucket, Key=key)
return response['Body'].read()
Expected Behaviour
| Signal | Before isolation | After isolation |
|---|---|---|
| Training pod on production node | Possible | Blocked — taint/toleration mismatch |
| Training job accessing production vault | Succeeds via DNS | Blocked by NetworkPolicy egress deny |
kubectl exec into another project’s training pod |
Permitted with broad RBAC | Denied — service account scoped to own project |
GPU pod runs as privileged: true |
Common | Blocked by Kyverno policy |
| Training data PVC readable cross-namespace | Possible via shared StorageClass | PVCs in separate namespaces; cross-namespace mount blocked |
| Training service account has S3 bucket-wide access | Common (over-provisioned) | Scoped to project-specific prefix via IRSA condition |
Verification:
# Confirm training pods cannot reach production namespace
kubectl exec -n ml-training training-job-pod -- \
curl -s --max-time 5 http://api.production.svc.cluster.local/health
# Expected: connection timeout or refused
# Confirm training nodes are correctly tainted
kubectl get nodes -l workload-type=ai-training \
-o jsonpath='{.items[*].spec.taints}' | jq .
# Expected: [{key: workload-type, value: ai-training, effect: NoSchedule}]
# Confirm no privileged training pods
kubectl get pods -n ml-training -o json | \
jq '.items[] | select(.spec.containers[].securityContext.privileged == true) | .metadata.name'
# Expected: empty output
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Dedicated GPU node pool | Complete node isolation; escape stays on training node | Higher cost — cannot share idle GPU capacity with production | Use cluster autoscaler to scale to zero when no training jobs run; spot/preemptible instances for training nodes |
| NetworkPolicy deny-all with allowlist | Precise egress control | Initial configuration requires mapping all legitimate data sources | Start with audit mode NetworkPolicy (log but don’t block); capture actual traffic; convert to enforcement after 1 week |
| Per-project service accounts | Token compromise limited to one project | Management overhead scales with project count | Automate service account provisioning via a Helm chart or Terraform module in your ML platform |
| Non-privileged GPU access | Removes container escape via privileged mode | Some older GPU workloads expect privileged; device plugin requires NVIDIA GPU Operator | Deploy GPU Operator; test GPU workloads with device plugin before enforcing policy |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| NetworkPolicy blocks training data access | Training job fails with connection timeout to data lake | Job logs show connection errors; NetworkPolicy audit log shows blocked egress | Add the missing destination to the egress allowlist; verify with kubectl exec -- curl before re-running job |
| Training node autoscaler does not drain before scale-down | In-flight training job killed mid-epoch | Job fails; checkpoint is incomplete; training metrics show abrupt stop | Configure cluster autoscaler scale-down-delay-after-add=30m; use Volcano or Kubeflow for job preemption handling |
| Kyverno policy blocks legitimate GPU workload | Pod fails admission with “Privileged containers not permitted” | kubectl describe pod shows Kyverno policy violation | Verify GPU device plugin is configured on the node; update pod spec to use nvidia.com/gpu: N resource limit instead of privileged: true |
| Service account token rotation breaks long-running jobs | 24h+ training job fails mid-run with 401 to data store | Job logs show authentication failure; S3 access denied | Use projected service account tokens with extended TTL for long-running jobs; configure expirationSeconds: 86400 in the projected volume |
Related Articles
- GPU Isolation — isolating GPU access at the Kubernetes layer including device plugin configuration
- Fine-Tuning Pipeline Security — security controls specific to LLM fine-tuning pipelines running on Kubernetes
- Kubernetes Network Policies — writing network policies that contain lateral movement between namespaces
- RBAC Design Patterns — designing service account RBAC to scope training job permissions to their project
- AI Training Network Segmentation — network-level segmentation for AI training infrastructure including GPU cluster isolation