Kubernetes CronJob Security: Least Privilege, Concurrency Controls, and Credential Isolation
Problem
Kubernetes CronJobs are frequently treated as second-class workloads from a security perspective. They run infrequently, they are not continuously monitored, and their permissions tend to accumulate over time as requirements change. A CronJob that once needed to read a single ConfigMap now has cluster-wide secret access because “it was easier to give it more permission than figure out exactly what it needs.”
Specific failure modes:
- Overpermissive service accounts. A database backup CronJob uses a service account with read access to all Secrets in all namespaces — far more than the backup key it actually needs. Any compromise of this CronJob’s pod provides full secret exfiltration.
- Credentials left in completed pod environments. The CronJob completes; the Job pod enters
Completedstate but remains in the cluster. Environment variables containing database passwords, API keys, and service account tokens are readable viakubectl describe podby anyone with pod-read access. - Unbounded concurrency creating race conditions. A CronJob with
concurrencyPolicy: Allowruns every minute. If the job takes two minutes, two instances run simultaneously. Both read the same state, make conflicting writes, and produce corrupted output. This can be exploited by timing attacks. - No job TTL, accumulating forensic evidence. Completed Job pods accumulate indefinitely. Each one has logs, environment variables, and mounted secret data that persists long after the job finished. An attacker with pod-read access can forensically extract credentials from completed jobs run weeks ago.
- Container images not pinned. The CronJob container image uses
:latest. An update to the base image introduces a vulnerability between CronJob runs; the vulnerability is present for the duration of the image cache lifetime but appears “fixed” when the cache is refreshed. - No deadline or active deadline. A CronJob without
activeDeadlineSecondsruns indefinitely if it hangs. A hung job using a database connection, holding a lock, or consuming quota blocks subsequent scheduled runs.
Target systems: Kubernetes 1.28+ CronJob API (batch/v1); Job TTL controller; concurrencyPolicy; RBAC for CronJob service accounts.
Threat Model
- Adversary 1 — Pod environment variable credential extraction: A developer with namespace access runs
kubectl get podsand finds a completed backup CronJob pod. They runkubectl describe pod backup-job-1234and read the database password from the pod’s environment variables. The pod has been sitting in Completed state for three weeks. - Adversary 2 — Overpermissive SA used by compromised CronJob: An attacker discovers a vulnerability in the backup script (e.g., shell injection via a filename). The CronJob’s service account has
secrets: getacross all namespaces. The attacker extracts all cluster secrets through the vulnerability. - Adversary 3 — Concurrent execution race condition: A secret rotation CronJob reads the current secret value, generates a new one, and writes it. Two instances run simultaneously; both read the same old value; both generate different new values; one overwrites the other’s write. The rotation process fails and one system gets a new credential that the other doesn’t know about.
- Adversary 4 — Image substitution between runs: The CronJob uses
:latest. An attacker compromises the container registry between Monday and Tuesday. Tuesday’s run pulls the compromised image. - Adversary 5 — Stale job accumulation for forensic extraction: Completed jobs accumulate in the cluster. Each completed pod retains environment variables and mounted volume data. An attacker who gains read access to pods can reconstruct weeks of credential history from completed job pods.
- Access level: Adversaries 1 and 5 need namespace pod-read access. Adversary 2 exploits a pre-existing vulnerability in the CronJob’s workload. Adversary 4 needs registry access. Adversary 3 exploits scheduling configuration.
- Objective: Extract credentials, escalate via over-permissive service account, corrupt scheduled operations.
- Blast radius: A CronJob service account with cluster-wide secret access provides complete credential exfiltration capability to anyone who can exploit the CronJob’s workload.
Configuration
Step 1: Dedicated Least-Privilege Service Account
# Service account scoped to exactly what the CronJob needs.
apiVersion: v1
kind: ServiceAccount
metadata:
name: backup-cronjob-sa
namespace: database
---
# Role: only access the specific secret this backup job needs.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: backup-cronjob-role
namespace: database
rules:
# Read the backup credentials secret — by name only.
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["backup-db-credentials"]
verbs: ["get"]
# Write backup status to a ConfigMap.
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["backup-status"]
verbs: ["get", "update", "patch"]
# NOT: list secrets, access other namespaces, create/delete resources.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: backup-cronjob-binding
namespace: database
subjects:
- kind: ServiceAccount
name: backup-cronjob-sa
namespace: database
roleRef:
kind: Role
name: backup-cronjob-role
apiGroup: rbac.authorization.k8s.io
Step 2: CronJob Configuration with Security Controls
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
namespace: database
spec:
schedule: "0 2 * * *" # Daily at 2am.
# Prevent concurrent execution — critical for operations with shared state.
concurrencyPolicy: Forbid # Wait for previous job; do not run overlapping.
# Options: Allow (default), Forbid, Replace.
# Keep minimal history — don't accumulate completed jobs.
successfulJobsHistoryLimit: 2 # Keep last 2 successful jobs.
failedJobsHistoryLimit: 3 # Keep last 3 failed jobs for debugging.
# Start deadline: if the job can't start within 300s of scheduled time, skip.
startingDeadlineSeconds: 300
jobTemplate:
spec:
# TTL: delete the Job (and its pods) 1 hour after completion.
ttlSecondsAfterFinished: 3600
# Active deadline: kill the job if it runs longer than 2 hours.
activeDeadlineSeconds: 7200
# Retry only once on failure.
backoffLimit: 1
template:
spec:
serviceAccountName: backup-cronjob-sa
restartPolicy: OnFailure
securityContext:
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
containers:
- name: backup
# Pin by digest — no surprise image changes between runs.
image: gcr.io/example/db-backup@sha256:abc123def456...
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
# Do NOT pass credentials as environment variables.
# Mount them as files from a Secret volume instead.
volumeMounts:
- name: db-credentials
mountPath: /etc/backup/credentials
readOnly: true
- name: tmp
mountPath: /tmp
# No env vars with credentials.
# Backup script reads from mounted files:
# DB_PASSWORD=$(cat /etc/backup/credentials/password)
volumes:
- name: db-credentials
secret:
secretName: backup-db-credentials
defaultMode: 0400 # Owner read-only.
- name: tmp
emptyDir: {}
# Disable auto-mount of service account token if not needed.
automountServiceAccountToken: false
Step 3: Avoid Credentials in Environment Variables
# BAD: credentials in environment variables — visible in kubectl describe pod.
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: backup-db-credentials
key: password
# This appears in `kubectl describe pod backup-job-1234` output for all time.
# GOOD: credentials mounted as files — not visible in pod description.
volumeMounts:
- name: db-credentials
mountPath: /etc/backup/credentials
readOnly: true
volumes:
- name: db-credentials
secret:
secretName: backup-db-credentials
# Mount specific keys only.
items:
- key: password
path: password
mode: 0400
# In the CronJob script: read credentials from files, not env.
#!/bin/bash
set -euo pipefail
# Read credentials from mounted secret volume.
DB_HOST="$(cat /etc/backup/credentials/host)"
DB_USER="$(cat /etc/backup/credentials/username)"
DB_PASSWORD="$(cat /etc/backup/credentials/password)"
# Use process substitution to avoid password in command arguments.
PGPASSWORD="$DB_PASSWORD" pg_dump \
-h "$DB_HOST" \
-U "$DB_USER" \
--format=custom \
--file=/tmp/backup-"$(date +%Y%m%d-%H%M%S)".dump \
mydb
# Zero the credential variables after use.
unset DB_PASSWORD
Step 4: TTL and Cleanup Enforcement
# Ensure TTL controller is enabled (it is by default in k8s 1.21+).
# Verify:
kubectl get deployment ttl-controller -n kube-system 2>/dev/null || \
echo "TTL controller: check kube-controller-manager --feature-gates"
# Enforce cleanup via Kyverno if teams forget to set TTL.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-job-ttl
spec:
validationFailureAction: Enforce
rules:
- name: require-ttlSecondsAfterFinished
match:
any:
- resources:
kinds: ["CronJob"]
validate:
message: "CronJobs must set ttlSecondsAfterFinished (max 86400)."
pattern:
spec:
jobTemplate:
spec:
ttlSecondsAfterFinished: "1-86400" # Between 1s and 24h.
# Emergency cleanup: delete all completed job pods older than 24 hours.
kubectl get pods --all-namespaces \
-o json | \
jq -r '
.items[] |
select(
.status.phase == "Succeeded" and
(now - (.status.completionTime // .metadata.creationTimestamp | fromdateiso8601)) > 86400
) |
"\(.metadata.namespace) \(.metadata.name)"
' | \
while read ns name; do
kubectl delete pod "$name" -n "$ns" --grace-period=0
done
Step 5: NetworkPolicy for CronJob Pods
# Restrict CronJob pod network access to only required destinations.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backup-cronjob-netpol
namespace: database
spec:
podSelector:
matchLabels:
app: database-backup
policyTypes:
- Egress
- Ingress
ingress: [] # No inbound connections to CronJob pods.
egress:
# Only to the database.
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- port: 5432
# Only to the backup storage (S3 endpoint).
- to:
- ipBlock:
cidr: 10.0.50.0/24 # Internal S3/object storage.
ports:
- port: 443
# DNS.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP
Step 6: Monitoring and Alerting for CronJob Health
# Prometheus rules for CronJob monitoring.
groups:
- name: cronjob-security
rules:
# Alert if a CronJob hasn't run in twice its expected interval.
- alert: CronJobMissedSchedule
expr: |
time() - kube_cronjob_status_last_schedule_time{cronjob="database-backup"} > 90000
for: 5m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.cronjob }} has not run in 25 hours"
# Alert if a Job is running longer than expected.
- alert: CronJobRunningTooLong
expr: |
kube_job_status_active{job_name=~"database-backup-.*"} > 0
and
(time() - kube_job_status_start_time{job_name=~"database-backup-.*"}) > 7200
for: 5m
labels:
severity: warning
annotations:
summary: "Backup job running for over 2 hours"
# Alert if failed jobs are accumulating.
- alert: CronJobFailureAccumulating
expr: |
kube_cronjob_status_failed{} > 2
for: 10m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.cronjob }} has {{ $value }} recent failures"
Step 7: Audit Logging for CronJob Activity
# Kubernetes audit policy: log CronJob execution.
# /etc/kubernetes/audit-policy.yaml
- level: RequestResponse
resources:
- group: "batch"
resources: ["jobs", "cronjobs"]
verbs: ["create", "delete", "patch", "update"]
namespaces: ["production", "database"]
# Alert on unexpected CronJob modifications.
# Any change to a production CronJob outside CI/CD should trigger review.
Step 8: Telemetry
kube_cronjob_status_last_schedule_time{cronjob, namespace} gauge
kube_cronjob_status_failed{cronjob, namespace} gauge
kube_job_status_active{job_name, namespace} gauge
kube_job_duration_seconds{job_name} histogram
kube_pod_status_phase{pod, namespace, phase} gauge
cronjob_credential_in_env_var_total{cronjob, namespace} gauge # Should be 0.
cronjob_ttl_not_set_total{namespace} gauge # Should be 0.
Alert on:
kube_cronjob_status_failed> 0 — a scheduled job is failing; investigate before next run.kube_job_status_activenon-zero for longer thanactiveDeadlineSeconds— job has hung; investigate.- CronJob pod in
Succeededphase older thanttlSecondsAfterFinished— TTL controller may not be functioning. cronjob_credential_in_env_var_totalnon-zero — a CronJob is exposing credentials in environment variables.
Expected Behaviour
| Signal | Default CronJob | Hardened CronJob |
|---|---|---|
| Completed pod with credentials | Env vars readable via kubectl describe | Credentials in volume files; not in pod description |
| Service account scope | Often namespace-wide or cluster-wide | Scoped to specific named resources |
| Concurrent runs | Allow (default) — may corrupt shared state | Forbid — only one instance runs at a time |
| Completed pod accumulation | Infinite; forensic evidence persists | TTL deletes after 1 hour |
| Image between runs | :latest — may change silently | Digest-pinned — same binary every run |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
concurrencyPolicy: Forbid |
Prevents race conditions | Next run skipped if current hasn’t finished | Increase activeDeadlineSeconds; alert on missed runs |
Short ttlSecondsAfterFinished |
Limits credential exposure window | Logs lost before debugging | Ship logs to centralised logging before TTL expires |
automountServiceAccountToken: false |
Prevents SA token misuse | CronJob cannot call Kubernetes API | Only disable if the CronJob doesn’t need API access |
| Credentials in volume files | Not visible in pod description | Slightly more complex scripting | Standardise on file-based credential reading |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
concurrencyPolicy: Forbid skips run |
Backup not taken for 24h if previous run hung | Missed schedule alert | Investigate hung run; manually delete stuck job; next run proceeds |
| TTL deletes job before debugging | Developer cannot inspect failed pod logs | Logs already shipped to Loki/Elasticsearch | Increase TTL for failed jobs (failedJobsHistoryLimit); ship logs |
| Service account too restrictive | Job fails with 403 | Job failure alert; audit log shows permission denied | Add specific required permission; avoid broadening to secrets: * |
startingDeadlineSeconds skips job |
Node pressure delays pod start; job never runs | Missed schedule alert | Investigate node pressure; increase startingDeadlineSeconds |