Kubernetes Operator Security: RBAC Scoping, Webhook Hardening, and Privilege Minimisation

Kubernetes Operator Security: RBAC Scoping, Webhook Hardening, and Privilege Minimisation

Problem

Kubernetes operators automate the management of stateful applications by watching custom resources and reconciling cluster state. They run continuously as pods with service account permissions, and many operators request — or are granted — far more access than they need:

  • cluster-admin ClusterRoleBinding for operator service accounts. The operator SDK’s scaffolding defaults generate broad ClusterRoles. Operators deployed with cluster-admin have full read/write access to every resource in every namespace. A vulnerability in the operator (dependency confusion, deserialization flaw, RCE in a watched library) immediately escalates to full cluster control.
  • Admission webhooks without proper TLS or timeout configuration. Operators frequently install mutating and validating admission webhooks to enforce policy on their custom resources. A webhook that times out, returns invalid TLS, or fails open (returns allowed: true on error) undermines admission control for the entire cluster.
  • CRD validation gaps. Custom Resource Definitions allow users to submit arbitrary YAML. Without server-side validation (OpenAPI schema on the CRD), users can pass malformed or oversized inputs that trigger operator logic bugs.
  • Operator with cross-namespace secret access. An operator managing a distributed database cluster needs to read connection secrets. It gets secrets: get on all namespaces — including namespaces it has no business accessing, containing credentials for unrelated services.
  • Operator image not pinned by digest. The operator deployment uses :latest or a mutable tag. A supply chain attack on the operator image repository pushes a backdoored image that gets pulled on the next pod restart.
  • No resource limits on the operator pod. A bug in the operator’s reconciliation loop causes it to consume all available CPU, starving other workloads including the cluster control plane.

Target systems: Operator SDK 1.33+ (controller-runtime 0.17+); kubebuilder 3.14+; Helm-based operators; Crossplane providers; Prometheus Operator, cert-manager, external-secrets-operator.

Threat Model

  • Adversary 1 — Operator RCE via dependency vulnerability: An attacker exploits a known CVE in a library used by the operator (e.g., a YAML parser, HTTP client, or template engine). Code execution inside the operator pod, which has cluster-admin, becomes full cluster compromise.
  • Adversary 2 — Malicious CRD input: A user submits a custom resource with a crafted field value (extremely long string, YAML bomb, template injection) that triggers a bug in the operator’s reconciliation logic — crashing the operator, causing it to execute arbitrary code, or exhausting memory.
  • Adversary 3 — Admission webhook bypass: An attacker exploits a timeout or TLS configuration weakness in the operator’s admission webhook. With failurePolicy: Ignore, a timed-out webhook causes all resources to be admitted without validation. The attacker submits resources that should be rejected.
  • Adversary 4 — Secret exfiltration via operator permissions: The operator has secrets: list, get across all namespaces. An attacker who compromises the operator pod reads all cluster secrets — including TLS certificates, database passwords, and API tokens for every application in the cluster.
  • Adversary 5 — Image supply chain attack: An attacker compromises the container registry or CI pipeline for an operator image. The next restart of the operator deployment pulls a backdoored image with an exfiltration payload, while the operator continues to function normally.
  • Access level: Adversaries 1 and 4 exploit existing operator compromise. Adversary 2 needs permission to create custom resources. Adversary 3 needs API server access. Adversary 5 needs registry or CI access.
  • Objective: Full cluster compromise via operator’s elevated permissions; secret exfiltration; admission control bypass.
  • Blast radius: An operator running with cluster-admin provides a single point of failure — its compromise is equivalent to control plane compromise.

Configuration

Step 1: Minimum Viable RBAC for Operators

Replace broad ClusterRoles with namespace-scoped Roles where possible, and remove any permissions not proven necessary:

# BAD: common operator scaffold default.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin   # Never use this for operators.
subjects:
  - kind: ServiceAccount
    name: my-operator
    namespace: my-operator-system
# GOOD: scoped ClusterRole for a database operator managing its CRDs cluster-wide.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: my-db-operator
rules:
  # Manage own CRDs only.
  - apiGroups: ["mydb.example.com"]
    resources: ["mydatabases", "mydatabases/status", "mydatabases/finalizers"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

  # Core resources needed to manage the database.
  - apiGroups: [""]
    resources: ["pods", "services", "endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

  # StatefulSets for the database cluster.
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

  # Read ConfigMaps for operator configuration.
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]

  # Events for operator status reporting.
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]

  # NOT included: secrets (use External Secrets pattern instead).
  # NOT included: nodes (operator does not need node info).
  # NOT included: namespaces (operator is namespace-scoped in practice).
  # NOT included: clusterroles/clusterrolebindings (privilege escalation path).

For operators that only manage resources in specific namespaces, use namespace-scoped Roles:

# Role scoped to the operator's managed namespace.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-operator
  namespace: my-app
rules:
  - apiGroups: ["myapp.example.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Step 2: CRD Schema Validation

Every CRD must have a complete OpenAPI v3 schema. Without it, users can submit arbitrary input to the operator:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: mydatabases.mydb.example.com
spec:
  group: mydb.example.com
  names:
    kind: MyDatabase
    plural: mydatabases
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          required: ["spec"]
          properties:
            spec:
              type: object
              required: ["replicas", "version"]
              properties:
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 10        # Prevent runaway replica counts.
                version:
                  type: string
                  pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"   # Semver only.
                  maxLength: 20
                storage:
                  type: string
                  pattern: "^[0-9]+(Gi|Mi)$"
                  maxLength: 10
                # Prevent injection via free-text fields.
                displayName:
                  type: string
                  maxLength: 64
                  pattern: "^[a-zA-Z0-9 _-]+$"
              additionalProperties: false   # Reject unknown fields.
      # Prevent status subresource from being written by users.
      subresources:
        status: {}
# x-kubernetes-validations: CEL validation for cross-field constraints (k8s 1.25+).
spec:
  schema:
    openAPIV3Schema:
      properties:
        spec:
          x-kubernetes-validations:
            - rule: "self.replicas <= 3 || self.tier == 'enterprise'"
              message: "More than 3 replicas requires enterprise tier."
            - rule: "self.version.startsWith('5.') || self.version.startsWith('8.')"
              message: "Only versions 5.x and 8.x are supported."

Step 3: Admission Webhook Hardening

Operators commonly install webhooks for defaulting and validation. Secure them explicitly:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: my-operator-webhook
webhooks:
  - name: validate.mydatabase.mydb.example.com
    admissionReviewVersions: ["v1"]
    clientConfig:
      service:
        name: my-operator-webhook
        namespace: my-operator-system
        port: 9443
        path: /validate-mydb-example-com-v1alpha1-mydatabase
      # caBundle must be set to the webhook server's CA.
      # Use cert-manager to provision and rotate the TLS certificate.
      caBundle: <BASE64_CA>

    rules:
      - apiGroups: ["mydb.example.com"]
        apiVersions: ["v1alpha1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["mydatabases"]

    # CRITICAL: failurePolicy must be Fail for security-relevant webhooks.
    # Ignore means: if the webhook times out or errors, the request is admitted.
    failurePolicy: Fail

    # Timeout: 10 seconds maximum. Webhooks that take too long block API operations.
    timeoutSeconds: 10

    # Only intercept requests in namespaces with this label.
    # Reduces blast radius and avoids intercepting system namespaces.
    namespaceSelector:
      matchLabels:
        mydb.example.com/inject: "enabled"

    # Side effects: must be None for validating webhooks.
    sideEffects: None
    matchPolicy: Equivalent

TLS certificate management for the webhook server via cert-manager:

# cert-manager Certificate for the webhook server.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-operator-webhook-cert
  namespace: my-operator-system
spec:
  secretName: my-operator-webhook-cert
  dnsNames:
    - my-operator-webhook.my-operator-system.svc
    - my-operator-webhook.my-operator-system.svc.cluster.local
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
  duration: 8760h    # 1 year.
  renewBefore: 720h  # Renew 30 days before expiry.

Step 4: Operator Pod Security

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-operator
  namespace: my-operator-system
spec:
  replicas: 1
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
    spec:
      serviceAccountName: my-operator
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault

      containers:
        - name: manager
          image: ghcr.io/example/my-operator@sha256:abc123...  # Pin by digest.
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            capabilities:
              drop: ["ALL"]

          # Resource limits — prevent runaway reconciliation from starving the node.
          resources:
            limits:
              cpu: "500m"
              memory: "256Mi"
            requests:
              cpu: "100m"
              memory: "128Mi"

          # Liveness and readiness probes.
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8081
            initialDelaySeconds: 15
            periodSeconds: 20

          # Restrict leader election to the operator's namespace.
          env:
            - name: LEADER_ELECTION_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

Step 5: Operator Image Supply Chain

# Pin operator image by digest in Helm values.
# values.yaml
image:
  repository: ghcr.io/example/my-operator
  # Tag is informational; digest is authoritative.
  tag: v1.2.3
  digest: sha256:abc123def456...
  pullPolicy: IfNotPresent

# Verify digest in deployment.
# template/deployment.yaml
image: "{{ .Values.image.repository }}@{{ .Values.image.digest }}"
# Kyverno policy: enforce digest-pinned images for operators.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-operator-image-digest
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-digest
      match:
        any:
          - resources:
              kinds: ["Pod"]
              namespaces: ["*-operator-system", "*-operator"]
      validate:
        message: "Operator pods must use image digest pinning."
        pattern:
          spec:
            containers:
              - image: "*@sha256:*"

Step 6: Leader Election and HA Security

Many operators use leader election to prevent split-brain. Leader election uses Lease resources — restrict access:

# Role for leader election — scoped to specific lease name.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-operator-leader-election
  namespace: my-operator-system
rules:
  # Leader election via Lease (preferred) or ConfigMap (legacy).
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    resourceNames: ["my-operator-leader-election"]   # Specific lease name only.
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  # NOT: all leases — this prevents interference with other operators' elections.

Step 7: CRD Finalizer Security

Finalizers are used by operators to clean up resources. A finalizer that never completes blocks namespace deletion indefinitely:

// controller/mydatabase_controller.go
func (r *MyDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    db := &mydbv1alpha1.MyDatabase{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    finalizerName := "mydb.example.com/cleanup"

    if db.DeletionTimestamp.IsZero() {
        // Add finalizer if not present.
        if !controllerutil.ContainsFinalizer(db, finalizerName) {
            controllerutil.AddFinalizer(db, finalizerName)
            return ctrl.Result{}, r.Update(ctx, db)
        }
    } else {
        // Resource is being deleted. Run cleanup.
        if controllerutil.ContainsFinalizer(db, finalizerName) {
            // CRITICAL: enforce a timeout on cleanup. Never block indefinitely.
            cleanupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
            defer cancel()

            if err := r.cleanup(cleanupCtx, db); err != nil {
                // Log but do not return error indefinitely — after N retries, remove finalizer.
                r.Recorder.Event(db, "Warning", "CleanupFailed", err.Error())
                // Check if we've been trying too long.
                if time.Since(db.DeletionTimestamp.Time) > 5*time.Minute {
                    // Force-remove finalizer after 5 minutes to prevent infinite block.
                    log.Error(err, "Cleanup timed out; force-removing finalizer")
                }
            }
            controllerutil.RemoveFinalizer(db, finalizerName)
            return ctrl.Result{}, r.Update(ctx, db)
        }
    }
    return ctrl.Result{}, nil
}

Step 8: Telemetry

operator_reconcile_total{controller, result}              counter
operator_reconcile_duration_seconds{controller}           histogram
operator_reconcile_errors_total{controller, error_type}   counter
operator_webhook_requests_total{webhook, allowed}         counter
operator_webhook_duration_seconds{webhook}                histogram
operator_crd_validation_failures_total{crd, field}        counter
operator_managed_resources_total{controller, namespace}   gauge

Alert on:

  • operator_reconcile_errors_total sustained — the operator is repeatedly failing reconciliation; may indicate a CRD schema bypass or resource issue.
  • operator_webhook_duration_seconds P99 > 8s — webhook approaching timeout threshold; risk of fail-open if failurePolicy: Ignore.
  • operator_crd_validation_failures_total — a user is submitting CRs that fail schema validation; may indicate probing for injection vulnerabilities.
  • Operator pod restart — an operator crash may indicate exploitation attempt or memory exhaustion from malicious CR input.
  • Leader election lease acquired by unexpected pod — possible operator pod impersonation.

Expected Behaviour

Signal Default operator deploy Hardened operator
Operator RCE Full cluster compromise (cluster-admin) Limited to managed CRDs and owned namespaces
Malicious CRD input Arbitrary YAML accepted; logic bugs triggered Schema validation rejects invalid input before reconciliation
Webhook timeout failurePolicy: Ignore admits all resources failurePolicy: Fail blocks admission until webhook recovers
Secret access from operator All namespace secrets readable Only specific secrets in managed namespaces
Mutable image tag pulled Backdoored image loaded on restart Digest pinning detects image substitution

Trade-offs

Aspect Benefit Cost Mitigation
Namespace-scoped Role vs ClusterRole Strong blast radius isolation Operators managing cluster-wide resources need ClusterRole Scope ClusterRole to only cluster-level resources (CRDs, nodes); use Roles for namespaced resources
failurePolicy: Fail on webhook No bypass via timeout Webhook downtime blocks resource creation Run webhook with 2+ replicas; set timeoutSeconds: 10; have runbook for webhook failures
Digest-pinned images Supply chain integrity Must update digest on every release Automate via Renovate/Dependabot PR for digest updates
CRD additionalProperties: false Rejects unknown fields Blocks forward-compatible CR additions Update schema before adding new fields; use versioning

Failure Modes

Failure Symptom Detection Recovery
Webhook cert expired All CR creation/update blocked; TLS error cert-manager expiry alert; API 503 on CR operations cert-manager auto-renews if configured; manual renewal if cert-manager unavailable
Operator RBAC too restrictive Reconciliation fails for legitimate resources operator_reconcile_errors_total with permissions error Review audit log for denied API calls; add specific permissions
Finalizer deadlock Namespace stuck in Terminating state Namespace stuck > 5 minutes kubectl patch to remove finalizer manually; investigate operator state
Leader election lock not released Only one replica ever runs; manual restart required to recover from crash Operator not reconciling after previous leader pod deletion Delete stale Lease resource; operator will re-elect