ContainerSSH for CI/CD Pipeline Debugging: Ephemeral, Audited Shell Access to Build Environments

Problem

When a GitHub Actions job fails after forty minutes of compilation, or a Tekton Task crashes midway through an integration test suite, the developer’s first instinct is to get a shell. They want to inspect the filesystem, reproduce the failing command interactively, and look at what was left behind in /tmp. This is a legitimate and efficient debugging workflow — and a persistent security headache for every platform team.

The common solutions are all problematic in different ways:

Persistent debug SSH keys on runners. Adding an SSH key pair to the runner fleet and distributing the private key to every developer on the team. It works immediately, but the key is never revoked, every developer who has ever been on the team retains access, and there is no audit trail of what was done during a session. The runner has access to CI secrets — registry credentials, signing keys, cloud IAM roles — and so does anyone who can SSH into it.

kubectl exec with broad RBAC. Granting developers pods/exec access in the CI namespace so they can exec into a running or recently-terminated pod. The problem is that Kubernetes pod exec is difficult to scope: a ClusterRole with pods/exec is easy to misconfigure into namespace-wide or cluster-wide access, and kubectl exec sessions are not recorded by default. The Kubernetes audit log captures the exec initiation event but not the commands typed inside the session.

Always-on debug sidecars. Adding a debug container to every CI pod that runs an SSH daemon and stays alive for a grace period after the main container exits. This creates a persistent listening service on every build pod, a fixed attack surface that is up regardless of whether anyone needs debugging, and often ends up with the debug container sharing the same environment — including mounted secrets — as the build container.

Ad-hoc kubectl debug ephemeral containers. Better than a sidecar, but still requires the developer to have pods/ephemeralcontainers RBAC, is not recorded, and gives access to the live process namespace and filesystem of the target pod.

ContainerSSH provides a different model entirely. Rather than giving developers access to the runner or build pod directly, ContainerSSH sits in front of Kubernetes as an SSH server that, on each incoming connection, provisions a fresh container via a backend API, streams the SSH session into that container, and tears it down on disconnect. Combined with a webhook-based auth system that validates time-limited tokens and strips secrets from the container environment, this gives developers a debug shell that matches the build environment without touching the actual build infrastructure.

The specific problem this article solves: a GitHub Actions job or Tekton PipelineRun fails, a developer needs to investigate interactively, and the platform team needs a mechanism that is (1) time-bounded so the access path closes automatically, (2) audited so every command is recorded, (3) scoped to the failed build’s image and environment but without CI secrets, and (4) provisioned on-demand rather than always running.

Target systems: ContainerSSH 0.5+, GitHub Actions with self-hosted or ARC-managed runners, Tekton Pipelines v1 on Kubernetes 1.28+, S3-compatible object storage for audit log shipping.

Threat Model

Adversary 1 — Malicious insider using debug access for artifact injection. A developer with standing SSH access to CI runners uses a debug session to modify a build artifact after it has passed security scanning. Because the session is on the runner rather than a fresh container, the developer can overwrite files in the workspace or inject code into a build cache that persists to subsequent builds. With ContainerSSH, the debug container is freshly provisioned from the build image — it does not share the runner’s workspace, and the build artifact has already been pushed. The session recording creates an audit trail that makes the action attributable.

Adversary 2 — Credentials never revoked after incident. A developer is granted emergency SSH access to diagnose a production build failure. The incident is resolved and everyone moves on. Three months later, the developer’s account is phished. The attacker uses the stale debug SSH key to access the CI runner, which still has access to registry push credentials and a cloud role with artifact storage write permissions. With ContainerSSH, the access mechanism is a time-limited JWT. The token expires automatically. There is no credential to revoke because there was never a standing credential.

Adversary 3 — Secret exfiltration via debug session. A CI build has access to a container registry push credential, a code signing key, and an AWS role for artifact upload. A developer (or attacker with developer access) connects to a debug container that replicates the build environment, reads the environment variables, and exfiltrates the secrets. The ContainerSSH auth webhook response controls what environment variables are injected into the debug container. By stripping any environment variable whose name matches SECRET_, TOKEN_, KEY_, PASSWORD_, or _CREDENTIAL, the webhook ensures the debug container has a functional build environment for reproducing compile and test failures without access to the credentials used during the actual build.

Access level for all adversaries: Developer-level access to the CI system — the ability to trigger workflows or observe build failures. No cluster-admin, no direct pod access.

Blast radius without ContainerSSH: Compromise of a standing debug SSH key gives persistent access to the runner, its attached secrets, its build cache, and potentially adjacent runners in the same fleet.

Blast radius with ContainerSSH: Compromise of a session token gives access to a debug container for the remaining duration of the token’s validity (at most 30 minutes from provisioning), with secrets stripped from the environment. The container is isolated from the runner and from other builds.

Configuration

Architecture Overview

The end-to-end flow has five components:

GitHub Actions workflow — on job failure, a subsequent step calls an internal provisioning API to register a time-limited debug session token keyed to the failed job’s image and commit SHA.
Session provisioning API — a small service (or a serverless function) that mints a signed JWT, stores session metadata in Redis or DynamoDB, and returns the token to the workflow step.
ContainerSSH server — listens on port 2222, calls the auth webhook on every connection attempt, and launches containers in Kubernetes via the backend API.
Auth webhook — validates the JWT, checks expiry and usage limits, and returns a ContainerSSH backend response specifying the container image and sanitized environment.
Audit log shipper — ContainerSSH session recordings shipped to S3, tagged with PR number, commit SHA, and developer identity.

GitHub Actions job fails
         │
         ▼
  if: failure() step calls /provision-debug-session
         │
         └── returns: { token: "eyJ...", ssh_host: "containerssh.ci.example.com", expires_at: 1746900000 }
                  │
                  ▼
  Developer: ssh -p 2222 <token>@containerssh.ci.example.com
                  │
                  ▼
  ContainerSSH auth webhook
         │── validates JWT signature and expiry
         │── looks up session metadata (image, env vars)
         │── strips secrets from env
         └── returns ContainerSSH AuthResponse with pod spec
                  │
                  ▼
  ContainerSSH launches Kubernetes pod
  (same image as failed build, no secrets in env)
                  │
                  ▼
  Developer gets shell ─── session recording ──► S3
                  │
                  └── after 30 min: JWT exp reached, auth webhook rejects reconnect, ContainerSSH terminates container

Step 1: GitHub Actions Workflow Integration

Add a debug session provisioning step that runs only on failure. The step calls the internal provisioning API, writes the SSH connection string to the job summary, and notifies the developer via a Slack or Teams webhook.

# .github/workflows/build.yml
name: Build and Test
on:
  pull_request:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11  # v4.1.1

      - name: Build
        id: build
        run: ./scripts/build.sh

      - name: Test
        id: test
        run: ./scripts/test.sh

      # Provision a debug session only when a prior step has failed.
      # CONTAINERSSH_PROVISION_TOKEN is a long-lived API key for the
      # provisioning service, scoped to "create debug sessions" only.
      - name: Provision ContainerSSH debug session
        if: failure()
        id: debug_session
        env:
          PROVISION_API: ${{ vars.CONTAINERSSH_PROVISION_API_URL }}
          PROVISION_TOKEN: ${{ secrets.CONTAINERSSH_PROVISION_TOKEN }}
        run: |
          RESPONSE=$(curl -sf -X POST "${PROVISION_API}/sessions" \
            -H "Authorization: Bearer ${PROVISION_TOKEN}" \
            -H "Content-Type: application/json" \
            -d '{
              "image": "'"${{ env.BUILD_IMAGE }}"'",
              "commit_sha": "'"${{ github.sha }}"'",
              "pr_number": "'"${{ github.event.pull_request.number }}"'",
              "actor": "'"${{ github.actor }}"'",
              "ttl_seconds": 1800,
              "env": {
                "GOPATH": "/go",
                "GOCACHE": "/go/cache",
                "CI": "true",
                "GITHUB_SHA": "'"${{ github.sha }}"'",
                "GITHUB_REF": "'"${{ github.ref }}"'"
              }
            }')

          TOKEN=$(echo "$RESPONSE" | jq -r '.token')
          HOST=$(echo "$RESPONSE" | jq -r '.ssh_host')
          EXPIRES=$(echo "$RESPONSE" | jq -r '.expires_at_human')

          echo "## Debug Session Available" >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"
          echo "A ContainerSSH debug session has been provisioned for this failed build." >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"
          echo '```' >> "$GITHUB_STEP_SUMMARY"
          echo "ssh -p 2222 ${TOKEN}@${HOST}" >> "$GITHUB_STEP_SUMMARY"
          echo '```' >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"
          echo "**Expires:** ${EXPIRES}" >> "$GITHUB_STEP_SUMMARY"
          echo "**Note:** CI secrets are not available in this environment." >> "$GITHUB_STEP_SUMMARY"

The BUILD_IMAGE environment variable should be set earlier in the workflow to the fully-qualified image reference used in the build (e.g., registry.example.com/build-env/go-builder:1.22.3@sha256:abc123). Using a digest-pinned image reference ensures the debug container runs exactly the same image as the failed build.

Step 2: Session Provisioning Webhook (Python)

The provisioning service mints a signed JWT, stores metadata about the session, and returns the connection details. It runs as a Kubernetes Deployment in the CI namespace.

# provision_service.py
import os
import time
import json
import re
import hmac
import hashlib
import secrets

import jwt
import redis
from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel

app = FastAPI()

JWT_SECRET = os.environ["JWT_SECRET"]        # 256-bit random key
PROVISION_API_KEY = os.environ["PROVISION_API_KEY"]
CONTAINERSSH_HOST = os.environ["CONTAINERSSH_HOST"]
redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])

# Patterns whose matching env var names will be stripped before
# the session metadata is stored. The debug container never sees these.
SECRET_PATTERNS = re.compile(
    r"(SECRET|TOKEN|KEY|PASSWORD|CREDENTIAL|CERT|PRIVATE|API_KEY)",
    re.IGNORECASE,
)


class SessionRequest(BaseModel):
    image: str
    commit_sha: str
    pr_number: str | None = None
    actor: str
    ttl_seconds: int = 1800   # 30 minutes, hard cap at 3600
    env: dict[str, str] = {}


def strip_secrets(env: dict[str, str]) -> dict[str, str]:
    """Return env with any key matching a secret pattern removed."""
    return {k: v for k, v in env.items() if not SECRET_PATTERNS.search(k)}


@app.post("/sessions")
async def provision_session(
    request: SessionRequest,
    authorization: str = Header(...),
):
    # Validate the provisioning API key supplied by the workflow step.
    expected = f"Bearer {PROVISION_API_KEY}"
    if not hmac.compare_digest(authorization, expected):
        raise HTTPException(status_code=401, detail="Invalid provisioning token")

    ttl = min(request.ttl_seconds, 3600)
    now = int(time.time())
    exp = now + ttl
    session_id = secrets.token_urlsafe(16)

    # JWT payload. The SSH username supplied by the developer will be
    # validated against this token by the ContainerSSH auth webhook.
    payload = {
        "sub": session_id,
        "actor": request.actor,
        "image": request.image,
        "commit_sha": request.commit_sha,
        "pr_number": request.pr_number,
        "iat": now,
        "exp": exp,
    }
    token = jwt.encode(payload, JWT_SECRET, algorithm="HS256")

    # Store session metadata in Redis with TTL matching the JWT.
    session_data = {
        "image": request.image,
        "actor": request.actor,
        "commit_sha": request.commit_sha,
        "pr_number": request.pr_number,
        "env": json.dumps(strip_secrets(request.env)),
        "used": "0",
    }
    redis_client.hset(f"session:{session_id}", mapping=session_data)
    redis_client.expire(f"session:{session_id}", ttl + 60)  # small grace period

    return {
        "token": token,
        "ssh_host": CONTAINERSSH_HOST,
        "expires_at": exp,
        "expires_at_human": time.strftime("%Y-%m-%d %H:%M UTC", time.gmtime(exp)),
        "session_id": session_id,
    }

Step 3: ContainerSSH Auth Webhook

ContainerSSH calls the auth webhook with the SSH username and public key (or password). The webhook decodes the username as a JWT, validates it, retrieves session metadata, and returns a backend configuration telling ContainerSSH what container to launch.

# auth_webhook.py
import json
import time
import re

import jwt
import redis
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

JWT_SECRET = os.environ["JWT_SECRET"]
redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])
SECRET_PATTERNS = re.compile(
    r"(SECRET|TOKEN|KEY|PASSWORD|CREDENTIAL|CERT|PRIVATE|API_KEY)",
    re.IGNORECASE,
)

# Debug tools to inject into the container alongside the build image.
# These do not exist in the production build image.
DEBUG_PACKAGES_INIT = [
    "sh", "-c",
    "apt-get update -qq && apt-get install -y --no-install-recommends "
    "strace curl jq procps lsof file binutils 2>/dev/null || true"
]


class AuthRequest(BaseModel):
    username: str        # The JWT token, used as the SSH username
    remoteAddress: str
    connectionId: str


@app.post("/auth")
async def authenticate(request: AuthRequest):
    # Decode and validate the JWT.
    try:
        payload = jwt.decode(
            request.username,
            JWT_SECRET,
            algorithms=["HS256"],
        )
    except jwt.ExpiredSignatureError:
        # Token has expired. Return a ContainerSSH auth failure response.
        return {"success": False, "error": "Session token has expired"}
    except jwt.InvalidTokenError as exc:
        return {"success": False, "error": f"Invalid token: {exc}"}

    session_id = payload["sub"]

    # Retrieve session metadata from Redis.
    session_key = f"session:{session_id}"
    session_data = redis_client.hgetall(session_key)
    if not session_data:
        return {"success": False, "error": "Session not found or already expired"}

    # Decode bytes from Redis.
    image = session_data[b"image"].decode()
    env_raw = session_data[b"env"].decode()
    actor = session_data[b"actor"].decode()
    commit_sha = session_data[b"commit_sha"].decode()
    pr_number = session_data.get(b"pr_number", b"none").decode()
    env = json.loads(env_raw)

    # Final secret strip — defence in depth in case the provisioner missed something.
    env = {k: v for k, v in env.items() if not SECRET_PATTERNS.search(k)}

    # Annotate the environment with debug session metadata for the audit trail.
    env["DEBUG_SESSION_ID"] = session_id
    env["DEBUG_ACTOR"] = actor
    env["DEBUG_COMMIT_SHA"] = commit_sha
    env["DEBUG_PR_NUMBER"] = pr_number

    # Mark session as used (allows revocation if "single-use" policy is enforced).
    redis_client.hset(session_key, "used", "1")

    # Return a ContainerSSH backend response.
    # The kubernetes backend launches a pod matching the build image.
    return {
        "success": True,
        "username": actor,
        "metadata": {
            "session_id": session_id,
            "commit_sha": commit_sha,
            "pr_number": pr_number,
        },
        "backend": {
            "type": "kubernetes",
            "kubernetes": {
                "pod": {
                    "metadata": {
                        "labels": {
                            "app": "containerssh-debug",
                            "commit-sha": commit_sha[:8],
                            "pr-number": pr_number,
                            "actor": actor,
                        },
                        "annotations": {
                            "containerssh.io/session-id": session_id,
                            "containerssh.io/actor": actor,
                        },
                    },
                    "spec": {
                        "containers": [
                            {
                                "name": "debug",
                                "image": image,
                                "env": [
                                    {"name": k, "value": v}
                                    for k, v in env.items()
                                ],
                                "resources": {
                                    "limits": {
                                        "memory": "2Gi",
                                        "cpu": "1000m",
                                    },
                                    "requests": {
                                        "memory": "256Mi",
                                        "cpu": "100m",
                                    },
                                },
                                "securityContext": {
                                    "allowPrivilegeEscalation": False,
                                    "readOnlyRootFilesystem": False,
                                    "runAsNonRoot": True,
                                    "capabilities": {
                                        "drop": ["ALL"],
                                    },
                                },
                            }
                        ],
                        "serviceAccountName": "containerssh-debug-sa",
                        "automountServiceAccountToken": False,
                    },
                }
            },
        },
    }

Step 4: ContainerSSH Server Configuration

Deploy ContainerSSH with the Kubernetes backend, the auth webhook pointing at the provisioning service, and session recording configured to ship to S3.

# containerssh-config.yaml
log:
  level: info
  format: json

ssh:
  listenAddress: "0.0.0.0:2222"
  hostkeys:
    - /etc/containerssh/hostkey

auth:
  # ContainerSSH calls this webhook on every SSH connection attempt.
  # The webhook validates the JWT and returns the backend configuration.
  webhook:
    url: http://containerssh-auth-webhook.ci-system.svc.cluster.local:8080/auth
    timeout: 10s
    # Do not cache auth responses — every reconnect must re-validate the JWT.
    authTimeout: 0s

backend: kubernetes

kubernetes:
  client:
    # Use in-cluster credentials.
    inCluster: true
  pod:
    namespace: containerssh-debug
    # Default spec; overridden by the auth webhook response.
    spec:
      containers: []

audit:
  enable: true
  format: binary
  storage: s3
  s3:
    local: /tmp/containerssh-audit
    region: us-east-1
    bucket: ci-audit-logs-example
    prefix: containerssh/
    # Tag each audit log object with build metadata.
    # ContainerSSH reads metadata from the auth webhook response.
    metadata:
      - name: session-id
        value: "{{ .Metadata.session_id }}"
      - name: commit-sha
        value: "{{ .Metadata.commit_sha }}"
      - name: pr-number
        value: "{{ .Metadata.pr_number }}"

Step 5: Tekton PipelineRun Integration

For Tekton-based CI, add a finally Task to every Pipeline that calls the provisioning API on failure. Tekton’s finally Tasks run unconditionally at the end of a PipelineRun, with access to the PipelineRun’s result and status.

# tekton-debug-session-task.yaml
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: provision-debug-session
  namespace: tekton-pipelines
spec:
  params:
    - name: image
      type: string
      description: "The build container image (digest-pinned)"
    - name: commit-sha
      type: string
    - name: pipeline-run-name
      type: string
    - name: pipeline-status
      type: string
  stepTemplate:
    securityContext:
      runAsNonRoot: true
      allowPrivilegeEscalation: false
  steps:
    - name: provision-session
      image: curlimages/curl:8.7.1@sha256:c3b8bee303c6c6beed656cfc921218c529d65aa61114eb9e27c62047a1271b9b
      script: |
        #!/bin/sh
        set -eu

        # Only provision a debug session if the pipeline has failed.
        if [ "$(params.pipeline-status)" != "Failed" ]; then
          echo "Pipeline succeeded; no debug session needed."
          exit 0
        fi

        RESPONSE=$(curl -sf -X POST \
          "${PROVISION_API}/sessions" \
          -H "Authorization: Bearer ${PROVISION_API_KEY}" \
          -H "Content-Type: application/json" \
          -d '{
            "image": "$(params.image)",
            "commit_sha": "$(params.commit-sha)",
            "actor": "tekton-pipeline",
            "pipeline_run": "$(params.pipeline-run-name)",
            "ttl_seconds": 1800
          }')

        echo "Debug session provisioned:"
        echo "$RESPONSE" | grep -o '"token":"[^"]*"' | cut -d'"' -f4
      env:
        - name: PROVISION_API
          valueFrom:
            secretKeyRef:
              name: containerssh-provision
              key: api-url
        - name: PROVISION_API_KEY
          valueFrom:
            secretKeyRef:
              name: containerssh-provision
              key: api-key
---
# Add the debug session Task to your Pipeline's finally block.
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: build-pipeline
  namespace: tekton-pipelines
spec:
  params:
    - name: image
      type: string
    - name: commit-sha
      type: string
  tasks:
    - name: build
      taskRef:
        name: build-task
      params:
        - name: image
          value: "$(params.image)"
    - name: test
      taskRef:
        name: test-task
      runAfter: [build]
  finally:
    - name: debug-session
      taskRef:
        name: provision-debug-session
      params:
        - name: image
          value: "$(params.image)"
        - name: commit-sha
          value: "$(params.commit-sha)"
        - name: pipeline-run-name
          value: "$(context.pipelineRun.name)"
        - name: pipeline-status
          value: "$(tasks.status)"

Step 6: Audit Log Shipping and Tagging

ContainerSSH session recordings in binary format are shipped to S3 via the built-in audit backend. To make the recordings queryable by PR number and commit SHA, configure the S3 prefix to include those values:

# In containerssh-config.yaml — detailed S3 key structure
audit:
  s3:
    bucket: ci-audit-logs-example
    prefix: "containerssh/{{ .Metadata.commit-sha }}/{{ .Metadata.session-id }}/"
    parallelUploads: 5
    queueSize: 100

Each session produces an .bin recording that can be replayed with containerssh-auditlog-tool — a standalone binary that renders the session as a terminal replay, usable during incident review.

To integrate audit log availability into your incident response process, add an S3 lifecycle policy to transition recordings to Glacier after 90 days and expire them after 1 year. Tag each recording with the associated Jira ticket or GitHub issue number if the session was provisioned via an escalation workflow.

Expected Behaviour

The following table describes system behaviour across the scenarios most likely to be encountered in production operation:

Scenario	System Behaviour
Build fails; developer connects within the 30-minute window	Auth webhook validates JWT, returns pod spec with build image and sanitized env. ContainerSSH launches a debug pod in the `containerssh-debug` namespace. Developer gets an interactive shell. Session is recorded.
Build fails; developer connects after the 30-minute window	JWT `exp` has passed. Auth webhook returns `{"success": false, "error": "Session token has expired"}`. ContainerSSH returns SSH `Permission denied`. No container is launched.
Developer attempts to read CI secrets (registry token, signing key)	Environment variables matching `SECRET_`, `TOKEN_`, `KEY_`, `PASSWORD_`, `CREDENTIAL` patterns are stripped by the auth webhook before the container spec is returned. The env vars do not exist in the debug container. The developer can confirm this with `env
Developer tries to mount the runner’s workspace or access other builds	The debug container runs in an isolated pod in `containerssh-debug` namespace with no PVC mounts. The pod’s service account has no RBAC permissions. The runner workspace is not accessible.
Developer session exceeds 30 minutes (long-running investigation)	The JWT has expired. If the developer disconnects and tries to reconnect, the auth webhook rejects the connection. The existing session continues until the developer disconnects or ContainerSSH terminates the container via the session timeout configured in the backend. For complex investigations, a new token must be provisioned via the workflow re-run or an escalation API endpoint.
Security team replays a debug session for audit purposes	`containerssh-auditlog-tool replay s3://ci-audit-logs-example/containerssh/<commit-sha>/<session-id>/session.bin` renders the session as a terminal replay. S3 object tags identify the PR, commit SHA, and actor.

Trade-offs

Trade-off	Detail	Mitigation
Debug container launched without CI secrets	The developer cannot reproduce failures caused by missing or invalid secrets. If the build fails because a registry credential is stale, the debug container cannot replicate that exact failure.	Provide a read-only view of secret names (not values) in the debug environment so the developer knows which secrets are expected. For secret-dependent failures, use a separate escalation workflow with stricter approval requirements.
30-minute session expiry may be insufficient for complex investigations	A developer debugging a non-deterministic race condition or a large test suite may need more than 30 minutes. Forcing a reconnect interrupts the session and loses terminal state.	Increase the TTL cap for specific developer roles via the provisioning API policy. Log an alert when a session exceeds 25 minutes to prompt the developer to request an extension before expiry.
Session recording overhead	ContainerSSH records every byte written to the terminal. For sessions with high-throughput output (e.g., running a full test suite in the debug container), this creates significant recording data volume and minor CPU overhead on the ContainerSSH server.	Apply recording sampling for output-heavy commands. Configure the S3 upload queue size and parallel upload count to avoid blocking the SSH session on slow S3 writes.
Debug container uses the same base image as the build	The build image may contain vulnerabilities or misconfigured tooling that the developer can exploit during the debug session. Running the full build image as the debug environment gives the developer access to any capability included in that image.	Use a separate debug image that shares the build image as a base but adds only approved debug tools. Apply the same image scanning policy to the debug image as to production images.
ContainerSSH is a new component to operate	The platform team must operate ContainerSSH, the auth webhook, and the provisioning service. These are additional failure points and maintenance surface.	Run ContainerSSH as a Kubernetes Deployment with two replicas. The provisioning service is stateless (state is in Redis). The auth webhook is stateless. All three components can be managed as standard Kubernetes workloads with existing GitOps tooling.

Failure Modes

Failure Mode	Impact	Detection and Recovery
Auth webhook is down or returning 5xx	No debug sessions can be provisioned or connected. Developers get `Permission denied` on SSH. This is the worst-case timing failure — the webhook is most likely to be load-tested precisely when builds are failing in bulk.	Alert on auth webhook error rate >1% over 5 minutes. Run two replicas behind a Kubernetes Service. The ContainerSSH server logs the webhook response for each connection attempt; check logs for `context deadline exceeded` to distinguish timeout from rejection.
Provisioning step in the GitHub Actions workflow fails (network error, API key expired)	No session token is written to the job summary. The developer does not know a debug session could have been provisioned. The `if: failure()` step fails silently if not monitored.	Add `continue-on-error: false` to the provisioning step and emit a distinct workflow annotation on failure. Monitor the provisioning API’s key expiry and rotate with a 30-day lead time using a secrets management workflow.
Debug container hits memory or CPU limits and is OOMKilled	The debug session is terminated unexpectedly. The developer loses terminal state.	Set resource limits in the auth webhook response to match the failed build’s limits. Alert the developer via the session summary if the debug container is terminated by the OOM killer. The session recording up to the point of OOMKill is preserved in S3 and available for review.
Audit log not shipped to S3 (S3 outage, IAM permission error)	Session recordings are buffered locally on the ContainerSSH pod. If the pod is restarted before the buffer is flushed, recordings are lost. This creates a compliance gap.	Configure a local buffer directory on a PVC (not ephemeral storage). Alert on S3 upload failures. The ContainerSSH S3 backend retries failed uploads; configure an SQS dead-letter queue or a CloudWatch alarm on the upload failure metric.
Redis session store is unavailable	The auth webhook cannot retrieve session metadata for valid JWTs. All debug sessions are rejected even with valid, unexpired tokens.	Run Redis with persistence enabled and a standby replica. The provisioning service stores session metadata with a TTL matching the JWT; Redis data loss within the TTL window is the primary risk. Implement a fallback where session metadata is also stored in the JWT claims (with secrets excluded), allowing the auth webhook to operate without Redis if a degraded-mode flag is set.
Debug container is left running after session disconnect	If ContainerSSH fails to clean up the pod on disconnect, the container continues running, consuming resources and remaining as a potential lateral movement point.	Configure Kubernetes pod TTL (via the `ttlSecondsAfterFinished` field on the pod spec generated by the auth webhook). A separate cleanup CronJob deletes pods in `containerssh-debug` older than 40 minutes as a backstop. Alert on any debug pod surviving past the token TTL.

ContainerSSH as a Bastion Host Replacement — Using ContainerSSH to replace traditional jump servers with ephemeral, audited container sessions for production access.
ContainerSSH Auth Webhook Hardening — Detailed hardening guide for the ContainerSSH auth and configuration webhook endpoints: TLS mutual authentication, rate limiting, and JWT validation edge cases.
ContainerSSH Audit Logging and Session Replay — Configuring ContainerSSH session recording backends, replaying sessions with containerssh-auditlog-tool, and integrating recordings with SIEM platforms.
GitHub Actions Supply Chain Hardening — Pinning actions to commit SHAs, minimum GITHUB_TOKEN permissions, OIDC token scope reduction, and StepSecurity Harden-Runner for runtime monitoring.
Golden Path Security for CI/CD — Building secure-by-default pipeline templates that encode security controls so developer teams do not need to implement them per-repository.