Kubernetes Forensics After Compromise: Reconstructing the Attack Timeline
The Problem
Kubernetes is optimised for ephemerality. Pods are scheduled, run, evicted, and rescheduled continuously. Containers restart on failure. Nodes are drained and replaced during upgrades. The same properties that make Kubernetes operationally resilient make it a forensic nightmare: the primary evidence artifacts — container filesystems, in-memory state, process trees, and network connections — vanish the moment a pod is deleted or a container restarts. An attacker who understands this can execute a complete intrusion, exfiltrate data, and establish persistence, then have most of the evidence automatically cleaned up by normal cluster operations before any investigation begins.
But Kubernetes is also an API-first system. Every operation that matters — pod creation, secret reads, exec sessions, RBAC binding creation — passes through the API server and can be logged. The container runtime writes logs to the node filesystem before shipping them to stdout. etcd holds the full current state of the cluster. Cloud providers log control plane API calls independently of the cluster. The evidence exists; it just requires knowing exactly where to look and moving faster than automated rotation and deletion processes.
This article is an operational guide to Kubernetes forensics after a suspected compromise. It covers what disappears when a pod is deleted, what survives and where to find it, the specific commands to extract each evidence source, and how to reconstruct a complete attack timeline from multiple correlated sources. It also covers the hardening and detection configuration that makes future investigations viable — because an investigation into a cluster with no audit policy enabled is an investigation into almost nothing.
What Disappears and What Survives
Understanding the evidence lifecycle is the first step in any Kubernetes forensic investigation. Not knowing what is gone is as important as knowing what remains.
What disappears when a pod is deleted:
- Container filesystem layers. Unless data was written to a mounted PersistentVolume, every file written by the container during its lifetime is lost when the pod is deleted and containerd garbage-collects the overlay filesystem layers.
- Container memory. Process memory, decrypted secrets held in memory, open file handles — all gone with the container process. There is no equivalent of a memory dump you can take post-facto from a deleted pod.
- Pod logs at the Kubernetes API level.
kubectl logs <pod>hits the kubelet, which reads from the node filesystem. Once the pod is gone and the node has rotated its log files, that path is closed. - Network connections and conntrack state. The ephemeral port connections the container made are gone. Unless you were running a network flow exporter (Cilium Hubble, Calico flow logs, AWS VPC Flow Logs) they are unrecoverable.
- Temporary files in container
/tmp. Dropped payloads, intermediate scripts, downloaded tools — gone with the overlay layer.
What survives pod deletion:
- Kubernetes API server audit log, if enabled and shipped to an external destination. This is the most forensically valuable artifact in the cluster.
- Kubelet pod logs at
/var/log/pods/on the node. These survive pod deletion until the kubelet’s log rotation policy removes them — typically hours to days depending on configuration. - Container runtime state: containerd keeps image layers in its content store even after containers using them are gone. A malicious image that was pulled and run leaves its layers on the node until containerd GC runs.
- Kubernetes Events, for one hour by default (configurable via
--event-ttlon the API server). Events capture pod lifecycle, OOMKills, failed image pulls, and exec operations. - PersistentVolume data. PVs survive the pod lifecycle by design. Any data the attacker wrote to a mounted volume is recoverable.
- etcd state: the current state of all Kubernetes objects. Historical writes are visible only if etcd audit logging was enabled, which is rarely the case by default.
- Cloud provider audit logs: EKS API calls appear in CloudTrail, GKE in Cloud Audit Logs, AKS in Azure Monitor. These are independent of the cluster and cannot be tampered with from within the cluster.
- Node filesystem artifacts outside
/var/log/pods/:/var/lib/containerd/,/var/lib/kubelet/pods/, and any files written by a container escape to the host filesystem.
The investigation order follows this hierarchy: API server audit log first, then kubelet pod logs, then container runtime artifacts, then cloud provider logs, then etcd snapshot.
Evidence Source 1: API Server Audit Log
The API server audit log is the single most important forensic artifact in a Kubernetes cluster. Every API call that reaches the API server is recorded: timestamp, user identity (username, UID, groups), source IP, impersonated identity (if any), resource, subresource, verb, HTTP status code, and — depending on audit level — the full request and response body.
On self-hosted clusters, the audit log is typically written to /var/log/kubernetes/audit.log on the control plane node. The exact path is configured with --audit-log-path on the kube-apiserver process. On managed clusters, the location varies: EKS streams audit logs to CloudWatch Logs under the /aws/eks/<cluster-name>/cluster log group; GKE sends them to Cloud Logging under the k8s_cluster resource; AKS sends them to a Log Analytics workspace.
The audit log is newline-delimited JSON. Each line is a complete audit.k8s.io/v1 Event object. jq is the primary tool for querying it.
Find all API actions by a specific service account during the incident window:
grep "system:serviceaccount:production:app" /var/log/kubernetes/audit.log | \
jq '{time: .requestReceivedTimestamp,
verb: .verb,
resource: .objectRef.resource,
subresource: .objectRef.subresource,
name: .objectRef.name,
namespace: .objectRef.namespace,
sourceIP: .sourceIPs[0],
status: .responseStatus.code}' | head -50
Find all secret reads (list and get) — the most common post-initial-access action:
grep '"resource":"secrets"' /var/log/kubernetes/audit.log | \
grep '"verb":"get"\|"verb":"list"' | \
jq '{time: .requestReceivedTimestamp,
user: .user.username,
secret: .objectRef.name,
ns: .objectRef.namespace,
sourceIP: .sourceIPs[0]}'
Find ClusterRoleBinding creation — the primary backdoor persistence mechanism in Kubernetes:
grep '"resource":"clusterrolebindings"' /var/log/kubernetes/audit.log | \
grep '"verb":"create"' | \
jq '{time: .requestReceivedTimestamp,
user: .user.username,
binding: .requestObject.metadata.name,
subject: .requestObject.subjects,
role: .requestObject.roleRef.name,
sourceIP: .sourceIPs[0]}'
Find all pods/exec and pods/attach operations — interactive access to running pods:
grep '"subresource":"exec"\|"subresource":"attach"' /var/log/kubernetes/audit.log | \
jq '{time: .requestReceivedTimestamp,
user: .user.username,
pod: .objectRef.name,
ns: .objectRef.namespace,
command: .requestObject.command,
sourceIP: .sourceIPs[0]}'
Find operations from an unexpected source IP (replace with the IP range you consider internal):
grep '"verb":"create"\|"verb":"delete"\|"verb":"patch"' /var/log/kubernetes/audit.log | \
jq 'select(.sourceIPs[0] | startswith("10.") | not) |
{time: .requestReceivedTimestamp,
user: .user.username,
verb: .verb,
resource: .objectRef.resource,
name: .objectRef.name,
sourceIP: .sourceIPs[0]}'
Find impersonation — an attacker with impersonation rights (or a compromised service account that has them) can act as any user in the cluster:
grep '"impersonatedUser"' /var/log/kubernetes/audit.log | \
jq '{time: .requestReceivedTimestamp,
realUser: .user.username,
impersonating: .impersonatedUser.username,
verb: .verb,
resource: .objectRef.resource,
sourceIP: .sourceIPs[0]}'
The requestReceivedTimestamp field in audit events uses RFC 3339 with microsecond precision. All timestamps are UTC. When correlating with node system logs or cloud provider logs, convert to the same timezone before building the timeline.
Evidence Source 2: Kubelet Pod Logs at /var/log/pods/
The kubelet writes container stdout and stderr to the node filesystem under /var/log/pods/. The path structure is:
/var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/<rotation-index>.log
These files survive pod deletion until kubelet log rotation removes them. The rotation policy is controlled by containerLogMaxSize and containerLogMaxFiles in the kubelet configuration, defaulting to 10 MiB and 5 files respectively. On a busy node with small containers, this can mean log files are rotated and overwritten within hours of pod deletion. On a quiet node, they can persist for days.
To find log files for a specific pod (even a deleted one, as long as you know the pod UID from the audit log):
# Get pod UID from audit log:
grep '"name":"compromised-pod"' /var/log/kubernetes/audit.log | \
jq '.objectRef.namespace + "/" + .objectRef.name + " uid: " + .requestObject.metadata.uid' | \
head -5
# Then find its logs on the node:
ls /var/log/pods/production_compromised-pod_<uid>/
If you do not know the pod UID, search by namespace and pod name prefix:
ls /var/log/pods/ | grep "^production_"
Search all production pod logs for indicators of command execution:
grep -r "curl\|wget\|nc\|ncat\|python3 -c\|bash -i\|/bin/sh -c\|chmod +x\|base64 -d" \
/var/log/pods/production_*/
Search for outbound connection attempts (exfiltration):
grep -rE "([0-9]{1,3}\.){3}[0-9]{1,3}:[0-9]{4,5}" \
/var/log/pods/production_*/ | \
grep -v "127\.0\.0\.\|10\.\|172\.1[6-9]\.\|172\.2[0-9]\.\|172\.3[0-1]\.\|192\.168\."
The kubelet pod log format includes a timestamp prefix and stream indicator. Each line is structured as:
2026-05-08T14:23:01.123456789Z stdout F <actual log line>
The F indicates a full line; P indicates a partial line that continues on the next record. Strip the prefix when grepping for content patterns, or use jq if the container was using JSON logging:
grep -h "" /var/log/pods/production_web-*/web/0.log | \
awk '{$1=$2=""; print $0}' | \
grep -E "exec|eval|shell|reverse"
Evidence Source 3: Container Runtime Artifacts
containerd maintains an image content store and a snapshot store on the node filesystem, under /var/lib/containerd/. Even after a container is deleted, its image layers remain in the content store until containerd’s garbage collector runs. This gives you a forensic window to inspect what image was running.
List all images currently in the containerd cache, including recently pulled ones:
ctr --namespace k8s.io images ls
Look for images that were not part of your known workload inventory — unusual registries, unexpected image names, or images with very recent creation times:
ctr --namespace k8s.io images ls | \
grep -v "registry.k8s.io\|docker.io/library\|your-internal-registry" | \
sort -k4
List all containers including stopped ones (containerd retains metadata even for non-running containers briefly):
ctr --namespace k8s.io containers ls
Export a container’s snapshot filesystem for offline forensic analysis:
# Get the snapshot key for a container:
ctr --namespace k8s.io containers info <container-id> | jq '.SnapshotKey'
# Export the full filesystem as a tar archive:
ctr --namespace k8s.io snapshots export \
/tmp/forensic-snapshot-$(date +%s).tar \
<snapshot-key>
Alternatively, use crictl (CRI-compatible tool that works with both containerd and CRI-O):
# List all pods including recently stopped ones:
crictl pods --state all
# List all containers including stopped:
crictl ps --all
# Inspect a specific container:
crictl inspect <container-id> | jq '{image: .status.image, command: .status.command, state: .status.state}'
For forensic image analysis without running the image, export the image layers:
# Save the image to a tar file:
ctr --namespace k8s.io images export \
/tmp/suspicious-image-$(date +%s).tar \
<image-reference>
# Extract and analyse the layers:
mkdir /tmp/image-analysis
tar xf /tmp/suspicious-image-*.tar -C /tmp/image-analysis
# Inspect each layer tar for embedded tools, scripts, or modified binaries
find /tmp/image-analysis -name "*.tar" -exec tar tf {} \; | \
grep -E "usr/bin/|usr/local/bin/|tmp/|\.sh$|\.py$"
Evidence Source 4: Kubernetes Events
Kubernetes Events are objects in the API server that record noteworthy occurrences: pod scheduling decisions, image pull results, container kills, OOMKills, probe failures, and — critically — exec operations. Events are stored in etcd and exposed via the API. The default TTL is one hour after the last occurrence (--event-ttl on the API server), but this is configurable and should be extended to at least 24 hours for any cluster where you care about forensic capability.
Events survive pod deletion for their full TTL. An exec event recorded at 14:23 will still be visible at 15:00 even if the pod was deleted at 14:30.
Get all events across all namespaces, sorted chronologically, filtered to warnings and anomalies:
kubectl get events --all-namespaces \
--sort-by='.metadata.creationTimestamp' \
-o json | \
jq '.items[] | select(.type == "Warning" or .reason == "Killing" or .reason == "OOMKilling") |
{time: .firstTimestamp,
last: .lastTimestamp,
count: .count,
ns: .metadata.namespace,
reason: .reason,
object: .involvedObject.name,
message: .message}'
Find exec-related events (the API server records a Warning event for each exec):
kubectl get events --all-namespaces \
--sort-by='.metadata.creationTimestamp' \
-o json | \
jq '.items[] | select(.reason == "Exec" or (.message | test("exec";"i"))) |
{time: .firstTimestamp,
ns: .metadata.namespace,
pod: .involvedObject.name,
message: .message}'
Find image pull events for unexpected images:
kubectl get events --all-namespaces \
--sort-by='.metadata.creationTimestamp' \
-o json | \
jq '.items[] | select(.reason == "Pulled" or .reason == "Pulling") |
{time: .firstTimestamp,
ns: .metadata.namespace,
pod: .involvedObject.name,
message: .message}' | \
grep -v "your-internal-registry"
Evidence Source 5: etcd Snapshot Preservation
etcd holds the complete current state of the cluster: all pods, deployments, RBAC bindings, secrets (base64-encoded but present), configmaps, and custom resources. Taking a snapshot immediately when an incident is suspected preserves the state before the attacker, automated systems, or remediation actions clean up evidence.
Take a forensic snapshot of current etcd state:
ETCDCTL_API=3 etcdctl snapshot save \
/tmp/etcd-forensic-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
etcdctl snapshot status /tmp/etcd-forensic-*.db
Verify the snapshot integrity:
ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-forensic-*.db \
--write-out=table
To query the snapshot content without restoring a full cluster, use etcdhelper (from the openshift/origin toolbox) or restore the snapshot to a temporary etcd instance:
# Restore to a temporary data directory:
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-forensic-*.db \
--data-dir /tmp/etcd-forensic-restored \
--name forensic-node \
--initial-cluster forensic-node=https://127.0.0.1:2380 \
--initial-advertise-peer-urls https://127.0.0.1:2380
# Start a temporary etcd against the restored data:
etcd --data-dir /tmp/etcd-forensic-restored \
--listen-client-urls http://127.0.0.1:2399 \
--advertise-client-urls http://127.0.0.1:2399 &
# Query all keys from the snapshot:
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2399 \
get "" --prefix --keys-only | grep "clusterrolebinding\|secret\|pod"
A particularly useful query is enumerating all ClusterRoleBindings in the snapshot to identify backdoors:
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2399 \
get /registry/clusterrolebindings/ --prefix | \
strings | grep -A2 "cluster-admin"
etcd stores values as protobuf, not raw JSON, so direct inspection requires decoding. Use etcdhelper for clean JSON output if available, or restore to a Kubernetes API server and use kubectl against the restored cluster.
Threat Model
A sophisticated attacker with cluster access has two goals in the evidence domain: exploit the ephemerality of Kubernetes to have evidence automatically destroyed, and avoid generating high-signal artifacts in the sources they cannot control.
What attackers can clean up:
- They can delete created pods, deployments, and jobs — removing them from the current API state.
- They can delete RBAC bindings they created — but only if they still have sufficient permissions.
- They can delete Kubernetes Secrets they created for C2 communication.
- They can overwrite or delete files within a running container.
What attackers cannot clean up:
- API server audit log entries that have already been written. An attacker who creates a ClusterRoleBinding generates an audit event the moment the API server processes the request. Deleting the ClusterRoleBinding generates a second audit event. Both are permanent if the log has been shipped externally.
- Kubelet pod logs that have already been written to the node filesystem — assuming the logs are shipped to an external system before the pod is deleted.
- Cloud provider audit logs. An attacker inside a GKE pod cannot modify GKE Cloud Audit Logs. An attacker who exploited a node cannot retroactively remove EKS CloudTrail entries for API calls made before the compromise.
- Container image layers in the node’s content store — until containerd GC runs.
Evidence-conscious attacker behaviour:
Sophisticated attackers adapt to the logging environment. Common patterns seen in Kubernetes intrusions:
- Using existing service accounts with excess permissions rather than creating new ones — avoids ClusterRoleBinding creation events.
- Using
kubectl execinto existing pods rather than deploying new workloads — avoids pod creation events and image pull events. The exec is still logged if audit policy capturespods/exec. - Exfiltrating data through existing application paths (writing to a mounted PVC, using the application’s outbound HTTP connections) rather than opening new network connections.
- Using the Kubernetes downward API and environment variable injection (rather than direct API calls) to avoid generating API server audit entries for secret reads.
- Timing operations to occur during high-traffic periods when log volume is highest — not to avoid logging, but to make it harder to filter their events from legitimate traffic in an investigation.
- Deleting created resources immediately after use, relying on the 1-hour Event TTL and log rotation to eventually clean up evidence.
The counter to all of these: ship audit logs externally immediately, extend Event TTL, and use Falco or eBPF-based runtime detection for real-time alerting on exec operations and suspicious network connections, regardless of the audit logging configuration.
Hardening Configuration
1. Kubernetes API Server Audit Policy
The audit policy controls what gets logged and at what detail level. There are four levels: None (no logging), Metadata (request metadata only — no request or response body), Request (request body included), and RequestResponse (both request and response body). The policy is a list of ordered rules; the first matching rule applies.
A policy that captures high-value forensic events without generating unmanageable log volume:
apiVersion: audit.k8s.io/v1
kind: Policy
# Don't log these high-frequency, low-value requests at all
omitStages:
- RequestReceived
rules:
# Log exec and attach at RequestResponse — captures the command being run
- level: RequestResponse
verbs: ["create"]
resources:
- group: ""
resources: ["pods/exec", "pods/attach", "pods/portforward"]
# Log all RBAC writes at RequestResponse — captures the full binding/role being created
- level: RequestResponse
verbs: ["create", "update", "patch", "delete", "deletecollection"]
resources:
- group: "rbac.authorization.k8s.io"
resources:
- "roles"
- "rolebindings"
- "clusterroles"
- "clusterrolebindings"
# Log secret access at Metadata — captures who, not what (avoids logging secret values)
- level: Metadata
verbs: ["get", "list", "watch"]
resources:
- group: ""
resources: ["secrets"]
# Log configmap reads at Metadata (configmaps frequently hold sensitive config)
- level: Metadata
verbs: ["get", "list", "watch"]
resources:
- group: ""
resources: ["configmaps"]
# Log all pod writes at Request level — captures pod spec including image, env vars, volumes
- level: Request
verbs: ["create", "update", "patch", "delete"]
resources:
- group: ""
resources: ["pods", "replicationcontrollers"]
- group: "apps"
resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
# Log service account token creation (used for credential theft)
- level: Request
verbs: ["create"]
resources:
- group: ""
resources: ["serviceaccounts/token"]
# Log node and namespace operations
- level: Request
verbs: ["create", "update", "patch", "delete"]
resources:
- group: ""
resources: ["namespaces", "nodes"]
# Log authentication failures (failed logins, token rejections)
- level: Metadata
omitStages: []
# Capture 401 and 403 responses for all resources
nonResourceURLs:
- "/api*"
- "/apis*"
# Default: log everything else at Metadata level
- level: Metadata
Apply this policy to kube-apiserver with:
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10
--audit-log-maxsize=100
For managed clusters: EKS enables audit logging per log type in the cluster configuration. Enable audit under enabledClusterLoggingTypes. GKE enables it via --enable-cloud-audit-logging. Both stream to their respective external log services automatically, solving the local storage problem.
2. Ship Logs to Immutable External Store
Audit logs stored only on the control plane node are vulnerable to destruction if the node is compromised or if the attacker deletes the log files. Ship audit logs to an external immutable store as the primary forensic record.
Fluent Bit DaemonSet configuration for shipping pod logs and audit logs to S3 with Object Lock:
# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/pods/**/*.log
Tag pod.*
Parser cri
DB /var/log/flb_pod.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
[INPUT]
Name tail
Path /var/log/kubernetes/audit.log
Tag k8s.audit
Parser json
DB /var/log/flb_audit.db
Mem_Buf_Limit 100MB
Buffer_Max_Size 10MB
[FILTER]
Name record_modifier
Match *
Record node ${NODE_NAME}
Record cluster ${CLUSTER_NAME}
[OUTPUT]
Name s3
Match k8s.audit
bucket forensic-logs-immutable
region us-east-1
s3_key_format /audit/%Y/%m/%d/%H/${NODE_NAME}-%M-%S-$UUID.json.gz
compression gzip
total_file_size 100M
upload_timeout 10m
# S3 Object Lock: governance mode, 90-day retention
# Configured on the bucket, not the agent
use_put_object On
[OUTPUT]
Name s3
Match pod.*
bucket forensic-logs-immutable
region us-east-1
s3_key_format /pods/%Y/%m/%d/%H/${NODE_NAME}-%M-%S-$UUID.json.gz
compression gzip
total_file_size 100M
upload_timeout 10m
The S3 bucket must have Object Lock enabled with governance mode and a minimum 90-day retention period. This prevents the objects from being deleted even if an attacker compromises AWS credentials used by the log shipper — governance mode requires a specific s3:BypassGovernanceRetention permission that should not be granted to the log shipper role.
3. Forensic Snapshot Script
When a compromise is suspected, run this script immediately on the control plane node (or from a workstation with cluster admin access). It captures point-in-time evidence before automated cleanup or attacker remediation destroys it.
#!/bin/bash
# k8s-forensic-snapshot.sh
# Run immediately when a Kubernetes compromise is suspected.
# Requires: cluster-admin kubeconfig, etcdctl, kubectl, ss, tar
set -euo pipefail
INCIDENT_ID="${1:-$(date +%Y%m%d-%H%M%S)}"
INCIDENT_DIR="/tmp/k8s-incident-${INCIDENT_ID}"
mkdir -p "${INCIDENT_DIR}"
echo "[*] Collecting Kubernetes forensic evidence to ${INCIDENT_DIR}"
echo "[*] Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# --- API server state (current) ---
echo "[*] Collecting current API server state..."
kubectl get all --all-namespaces -o json \
> "${INCIDENT_DIR}/all-resources.json" 2>&1 || true
kubectl get events --all-namespaces -o json \
> "${INCIDENT_DIR}/events.json" 2>&1 || true
kubectl get clusterrolebindings -o json \
> "${INCIDENT_DIR}/clusterrolebindings.json" 2>&1 || true
kubectl get clusterroles -o json \
> "${INCIDENT_DIR}/clusterroles.json" 2>&1 || true
kubectl get rolebindings --all-namespaces -o json \
> "${INCIDENT_DIR}/rolebindings.json" 2>&1 || true
kubectl get serviceaccounts --all-namespaces -o json \
> "${INCIDENT_DIR}/serviceaccounts.json" 2>&1 || true
# Collect RBAC subjects for cluster-admin — the most critical backdoor indicator
kubectl get clusterrolebindings -o json | \
jq '[.items[] | select(.roleRef.name == "cluster-admin") |
{name: .metadata.name,
created: .metadata.creationTimestamp,
subjects: .subjects}]' \
> "${INCIDENT_DIR}/cluster-admin-bindings.json" 2>&1 || true
# --- Audit log (recent) ---
echo "[*] Collecting recent audit log entries..."
if [[ -f /var/log/kubernetes/audit.log ]]; then
tail -50000 /var/log/kubernetes/audit.log \
> "${INCIDENT_DIR}/audit-recent-50k.log"
# Also grab any rotated audit logs
ls /var/log/kubernetes/audit.log.* 2>/dev/null | \
xargs -I{} cp {} "${INCIDENT_DIR}/" || true
else
echo "WARNING: /var/log/kubernetes/audit.log not found" \
> "${INCIDENT_DIR}/audit-not-found.txt"
fi
# --- etcd snapshot ---
echo "[*] Taking etcd snapshot..."
if command -v etcdctl &>/dev/null; then
ETCDCTL_API=3 etcdctl snapshot save \
"${INCIDENT_DIR}/etcd-snapshot.db" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key 2>&1 || \
echo "WARNING: etcd snapshot failed" > "${INCIDENT_DIR}/etcd-error.txt"
else
echo "WARNING: etcdctl not found — etcd snapshot skipped" \
> "${INCIDENT_DIR}/etcd-not-found.txt"
fi
# --- Node state ---
echo "[*] Collecting node state..."
ps auxww > "${INCIDENT_DIR}/processes.txt" 2>&1 || true
ss -tulnp > "${INCIDENT_DIR}/network-listening.txt" 2>&1 || true
ss -tnp > "${INCIDENT_DIR}/network-established.txt" 2>&1 || true
ip route > "${INCIDENT_DIR}/routes.txt" 2>&1 || true
iptables-save > "${INCIDENT_DIR}/iptables.txt" 2>&1 || true
ls -la /var/log/pods/ > "${INCIDENT_DIR}/pod-log-listing.txt" 2>&1 || true
# Capture currently running container list from containerd
if command -v ctr &>/dev/null; then
ctr --namespace k8s.io containers ls \
> "${INCIDENT_DIR}/containerd-containers.txt" 2>&1 || true
ctr --namespace k8s.io images ls \
> "${INCIDENT_DIR}/containerd-images.txt" 2>&1 || true
fi
# Capture pod specs from kubelet data directory
find /var/lib/kubelet/pods -name "*.json" -maxdepth 4 2>/dev/null | \
while read f; do
cp "$f" "${INCIDENT_DIR}/kubelet-pod-$(basename $(dirname $f)).json" 2>/dev/null || true
done
# --- Collect pod logs for target namespace (edit as needed) ---
TARGET_NS="${TARGET_NS:-production}"
echo "[*] Collecting pod logs for namespace: ${TARGET_NS}"
if [[ -d /var/log/pods ]]; then
find /var/log/pods -name "*.log" -path "*_${TARGET_NS}_*" -newer /tmp 2>/dev/null | \
head -100 | \
while read logfile; do
relpath=$(echo "$logfile" | sed 's|/var/log/pods/||')
destfile="${INCIDENT_DIR}/podlog-$(echo "$relpath" | tr '/' '-')"
cp "$logfile" "$destfile" 2>/dev/null || true
done
fi
# --- Archive and hash ---
echo "[*] Creating archive..."
ARCHIVE="${INCIDENT_DIR}.tar.gz"
tar czf "${ARCHIVE}" -C "$(dirname ${INCIDENT_DIR})" "$(basename ${INCIDENT_DIR})"
echo "[*] Computing SHA256 hash for evidence integrity..."
sha256sum "${ARCHIVE}" > "${ARCHIVE}.sha256"
echo ""
echo "[+] Forensic evidence collected:"
echo " Archive: ${ARCHIVE}"
echo " Hash: ${ARCHIVE}.sha256"
echo " Contents:"
ls -lh "${INCIDENT_DIR}/"
echo ""
echo "[!] Upload ${ARCHIVE} to your secure evidence store immediately."
echo "[!] Do NOT remediate or modify the cluster until evidence is secured."
4. Falco Real-Time Detection for Forensic-Value Events
Falco captures events at the kernel level, independent of the Kubernetes API server. This provides a separate evidence channel that an attacker cannot suppress through API manipulation.
Rules targeting the highest-forensic-value events:
- rule: kubectl exec into production pod
desc: kubectl exec session initiated into a production namespace pod
condition: >
ka.target.namespace = "production" and
ka.verb = "create" and
ka.target.subresource = "exec"
output: >
FORENSIC kubectl-exec (user=%ka.user.name
pod=%ka.target.name ns=%ka.target.namespace
container=%ka.req.pod.containers.image
command=%ka.req.pod.exec.command
sourceIP=%ka.source.ip)
priority: WARNING
source: k8saudit
- rule: ClusterRoleBinding created outside CI/CD
desc: A ClusterRoleBinding was created by a non-CI service account
condition: >
ka.verb = "create" and
ka.target.resource = "clusterrolebindings" and
not ka.user.name in (
"system:serviceaccount:ci:deploy-bot",
"system:serviceaccount:flux-system:flux"
)
output: >
FORENSIC crb-created (user=%ka.user.name
binding=%ka.target.name
subject=%ka.req.binding.subject.name
role=%ka.req.binding.role
sourceIP=%ka.source.ip)
priority: CRITICAL
source: k8saudit
- rule: Secret read by unexpected service account
desc: A service account read a secret outside its expected namespace
condition: >
ka.verb in (get, list) and
ka.target.resource = "secrets" and
ka.user.name startswith "system:serviceaccount:" and
ka.target.namespace != ka.user.name.split(":")[2]
output: >
FORENSIC cross-ns-secret-read (user=%ka.user.name
secret=%ka.target.name secret_ns=%ka.target.namespace
sourceIP=%ka.source.ip)
priority: WARNING
source: k8saudit
- rule: Container running interactive shell
desc: A container spawned an interactive shell — possible manual access
condition: >
spawned_process and
proc.name in (bash, sh, zsh, dash, fish) and
proc.tty != 0 and
container and
not proc.pname in (containerd, dockerd, runc)
output: >
FORENSIC interactive-shell (container=%container.name
image=%container.image.repository:%container.image.tag
pid=%proc.pid user=%user.name
cmdline=%proc.cmdline)
priority: WARNING
source: syscall
5. Timeline Reconstruction Script
The following Python script reads multiple Kubernetes evidence sources and produces a unified chronological timeline:
#!/usr/bin/env python3
"""
k8s-timeline.py — Reconstruct attack timeline from Kubernetes evidence sources.
Usage: python3 k8s-timeline.py --audit /path/to/audit.log [--events events.json]
"""
import json
import sys
import argparse
from datetime import datetime, timezone
from pathlib import Path
def parse_rfc3339(ts: str) -> datetime:
"""Parse RFC 3339 timestamp to UTC datetime."""
ts = ts.rstrip("Z").split(".")[0] # Trim microseconds
return datetime.fromisoformat(ts).replace(tzinfo=timezone.utc)
def parse_audit_log(path: str) -> list[dict]:
events = []
high_value_verbs = {"create", "delete", "patch", "update"}
high_value_resources = {
"secrets", "clusterrolebindings", "rolebindings",
"clusterroles", "roles", "pods/exec", "pods/attach",
"pods/portforward", "serviceaccounts/token", "namespaces",
}
with open(path) as f:
for lineno, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
ev = json.loads(line)
except json.JSONDecodeError:
continue
verb = ev.get("verb", "")
obj_ref = ev.get("objectRef", {})
resource = obj_ref.get("resource", "")
subresource = obj_ref.get("subresource", "")
full_resource = f"{resource}/{subresource}" if subresource else resource
# Filter to forensically relevant events
is_high_value = (
verb in high_value_verbs or
full_resource in high_value_resources or
ev.get("responseStatus", {}).get("code", 200) in (401, 403, 422)
)
if not is_high_value:
continue
user = ev.get("user", {})
ts_str = ev.get("requestReceivedTimestamp", "")
if not ts_str:
continue
events.append({
"time": parse_rfc3339(ts_str),
"time_str": ts_str[:19],
"source": "audit",
"user": user.get("username", "unknown"),
"groups": ",".join(user.get("groups", [])),
"source_ip": ev.get("sourceIPs", ["unknown"])[0],
"verb": verb,
"resource": full_resource,
"namespace": obj_ref.get("namespace", ""),
"name": obj_ref.get("name", ""),
"status": ev.get("responseStatus", {}).get("code", 0),
"detail": _extract_detail(ev),
})
return events
def _extract_detail(ev: dict) -> str:
"""Extract the most useful contextual detail from an audit event."""
req = ev.get("requestObject") or {}
resource = ev.get("objectRef", {}).get("resource", "")
subresource = ev.get("objectRef", {}).get("subresource", "")
if subresource == "exec":
cmd = req.get("command", [])
return f"exec: {' '.join(cmd)}"
if resource == "clusterrolebindings":
subjects = req.get("subjects", [])
role = req.get("roleRef", {}).get("name", "")
subj_str = ",".join(s.get("name", "") for s in subjects)
return f"role={role} subjects={subj_str}"
if resource == "secrets":
return f"secret access"
if resource == "pods":
spec = req.get("spec", {})
containers = spec.get("containers", [])
images = ",".join(c.get("image", "") for c in containers)
return f"images={images}" if images else ""
return ""
def parse_k8s_events(path: str) -> list[dict]:
events = []
try:
with open(path) as f:
data = json.load(f)
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"Warning: could not load events file {path}: {e}", file=sys.stderr)
return events
for item in data.get("items", []):
ts_str = item.get("firstTimestamp") or item.get("metadata", {}).get("creationTimestamp")
if not ts_str:
continue
events.append({
"time": parse_rfc3339(ts_str),
"time_str": ts_str[:19],
"source": "k8s-event",
"user": "kubelet",
"groups": "",
"source_ip": "",
"verb": item.get("reason", ""),
"resource": item.get("involvedObject", {}).get("kind", ""),
"namespace": item.get("metadata", {}).get("namespace", ""),
"name": item.get("involvedObject", {}).get("name", ""),
"status": 0,
"detail": item.get("message", "")[:120],
})
return events
def print_timeline(events: list[dict], filter_user: str = None) -> None:
sorted_events = sorted(events, key=lambda e: e["time"])
if filter_user:
sorted_events = [e for e in sorted_events if filter_user in e["user"]]
print("=" * 120)
print("KUBERNETES INCIDENT TIMELINE")
print("=" * 120)
print(f"{'TIMESTAMP':20} {'SOURCE':10} {'USER':35} {'VERB':10} "
f"{'RESOURCE':30} {'NAME':30} {'IP':15} DETAIL")
print("-" * 120)
for ev in sorted_events:
status_flag = " [DENIED]" if ev["status"] in (401, 403) else ""
print(
f"{ev['time_str']:20} "
f"{ev['source']:10} "
f"{ev['user'][:34]:35} "
f"{ev['verb']:10} "
f"{ev['resource'][:29]:30} "
f"{ev['name'][:29]:30} "
f"{ev['source_ip'][:14]:15} "
f"{ev['detail'][:60]}{status_flag}"
)
print("-" * 120)
print(f"Total forensic events: {len(sorted_events)}")
def main():
parser = argparse.ArgumentParser(description="Kubernetes incident timeline reconstruction")
parser.add_argument("--audit", required=True, help="Path to API server audit log")
parser.add_argument("--events", help="Path to kubectl get events JSON output")
parser.add_argument("--user", help="Filter timeline to specific username or pattern")
args = parser.parse_args()
all_events = []
all_events.extend(parse_audit_log(args.audit))
if args.events:
all_events.extend(parse_k8s_events(args.events))
print_timeline(all_events, filter_user=args.user)
if __name__ == "__main__":
main()
Run during an investigation with:
# Full timeline from audit log:
python3 k8s-timeline.py --audit /var/log/kubernetes/audit.log \
--events /tmp/k8s-incident-20260508/events.json
# Filter to a specific compromised service account:
python3 k8s-timeline.py --audit /var/log/kubernetes/audit.log \
--user "system:serviceaccount:production:app"
Expected Behaviour
A reconstructed timeline from the audit log for a typical Kubernetes credential theft and persistence incident looks like this:
TIMESTAMP SOURCE USER VERB RESOURCE NAME IP DETAIL
------------------------------------------------------------------------------------------------------------------------
2026-05-08T14:23:01 audit system:serviceaccount:prod:web-app list secrets 10.0.1.45 secret access
2026-05-08T14:23:04 audit system:serviceaccount:prod:web-app get secrets aws-credentials 10.0.1.45 secret access
2026-05-08T14:23:07 audit system:serviceaccount:prod:web-app get secrets database-password 10.0.1.45 secret access
2026-05-08T14:31:15 audit attacker@example.com create pods/exec web-app-7d4f9b-xkqz2 198.51.100.42 exec: bash -i
2026-05-08T14:33:22 audit attacker@example.com create clusterrolebindings system:anonymous-admin 198.51.100.42 role=cluster-admin subjects=system:anonymous
2026-05-08T14:33:45 audit system:anonymous list secrets 198.51.100.42 secret access
2026-05-08T14:34:12 k8s-event kubelet Exec Pod web-app-7d4f9b-xkqz2 prod exec command executed
2026-05-08T14:41:00 audit attacker@example.com delete clusterrolebindings system:anonymous-admin 198.51.100.42
The timeline shows the complete sequence: the initial service account credential theft (14:23), the kubectl exec session that established interactive access (14:31), the ClusterRoleBinding creation granting cluster-admin to anonymous (14:33), secret enumeration using the elevated binding (14:33-14:34), and the attacker cleaning up the binding they created (14:41). All of this is reconstructable from the audit log even if every pod involved was deleted and the cluster was remediated before the investigation started — provided the audit log was shipped externally before the attacker could destroy it.
Trade-offs
Audit log verbosity. RequestResponse level for all resources generates 3-10x the log volume of Metadata level. A production cluster with 1,000 API calls per second at RequestResponse can generate 10+ GB of audit log per hour. Selective policy — RequestResponse for exec and RBAC writes, Metadata for secret reads, None for high-frequency read-only calls like pod status updates — reduces this to 1-5% of total API call volume while preserving forensic value.
etcd audit logging. etcd has its own audit logging mechanism (separate from the Kubernetes API server audit log), controlled by --experimental-enable-v3-discovery-rpcs and the etcd --audit-log-path flag. It captures every read and write to the etcd keyspace, including direct access that bypasses the API server. The performance overhead is significant: 15-30% reduction in etcd throughput at high log verbosity. Permanently enabling it is rarely warranted; enabling it during an active investigation to capture ongoing attacker access is sometimes appropriate.
Log retention cost. Ninety days of compressed audit logs for a mid-sized cluster (100 nodes, 500 pods) typically requires 1-3 TB of storage. At AWS S3 standard pricing, that is $23-69/month for storage alone, before considering ingestion and query costs. This is the minimum acceptable retention for forensic investigations — most post-incident investigations begin more than 7 days after initial compromise.
Kubernetes Events TTL. Extending --event-ttl from 1 hour to 24 hours provides a meaningful improvement in forensic coverage at negligible cost: events are small objects in etcd, and even a busy cluster generates at most a few thousand events per hour.
Failure Modes
No API server audit policy enabled. The default Kubernetes installation does not enable audit logging. Without it, there is no record of what API calls were made, what secrets were read, or what resources were created or deleted. An investigation into a cluster with no audit policy is reduced to asking “what does the current state look like” — with no visibility into how it got there. This is the single most common and most consequential forensic failure in Kubernetes incident response.
Audit logs stored only on the control plane node. An attacker with control plane access, or a scenario where the control plane node is destroyed as part of remediation, can destroy the local audit log. If the audit log is not shipped to an external system in near-real-time, the forensic record can be lost before the investigation begins. Ship audit logs externally as the primary record; treat the local file as a buffer.
Kubernetes Events not extended beyond the 1-hour default. If an incident is not detected within one hour of the triggering events, the Kubernetes Events that captured exec operations, image pulls, and pod lifecycle will have expired. At 1 hour TTL, by the time most incidents are detected the Events are already gone.
Not taking a forensic snapshot immediately after incident discovery. Automated cluster systems (Kubernetes garbage collection, kubelet log rotation, containerd GC) continue running during an investigation. Every minute of delay means more evidence is destroyed by normal operations. The forensic snapshot script should be the first action after an incident is suspected — before remediation, before rebooting nodes, before deleting anything.
Insufficient RBAC on the audit log. If the service accounts used by the log shipper have broad permissions, an attacker who compromises the log shipper can modify or delete log data before it reaches the immutable external store. The log shipper should have only read access to the node filesystem path containing the audit log, and the destination S3 bucket should have Object Lock enabled with a retention policy that prevents deletion even by the log shipper’s credentials.