Policy as Code at Scale: OPA, Rego Testing, and Enterprise Policy Libraries

Policy as Code at Scale: OPA, Rego Testing, and Enterprise Policy Libraries

Problem

Security rules written in runbooks rot. A policy that says “container images must be pulled from the internal registry” is enforced by whoever happens to be reviewing the deployment manifest at the time, and ignored everywhere else. When your organisation runs 40 clusters, 600 services, and a fleet of Terraform workspaces that span three cloud accounts, manual enforcement is not a strategy.

Open Policy Agent (OPA) solves the distribution problem by decoupling the policy decision from the enforcement point. OPA evaluates arbitrary queries against structured input and a policy corpus written in Rego. The enforcement point — Kubernetes admission webhook, Terraform plan gate, API gateway, CI pipeline — asks OPA a question and acts on the answer. The policy itself lives in one place, tested and version-controlled like application code.

The gap most teams hit is not OPA adoption. It is the second-order problem: as policy libraries grow from five rules to five hundred, they become unmaintainable without deliberate structure. Naming collisions between team-specific rules surface as silent overrides. Test coverage drops to zero because there is no convention for where test files live. Exceptions accumulate as inline if conditions inside rules, making the actual security posture impossible to audit.

This article covers the operational discipline required to run policy as code at enterprise scale: library structure, Rego unit testing, bundle distribution, CI/CD integration, exception management, and Kubernetes-specific patterns with Gatekeeper.

Policy Library Structure

OPA loads policies from a filesystem or a bundle. The directory structure determines package namespaces, which determines whether rules from different teams can coexist without collision.

A workable structure for a multi-team organisation:

policies/
  lib/
    kubernetes/
      helpers.rego
      image_helpers.rego
    terraform/
      helpers.rego
    common/
      exceptions.rego
  kubernetes/
    admission/
      deny_privileged.rego
      deny_privileged_test.rego
      require_requests_limits.rego
      require_requests_limits_test.rego
    audit/
      no_default_namespace.rego
      no_default_namespace_test.rego
  terraform/
    aws/
      deny_public_s3.rego
      deny_public_s3_test.rego
      require_encryption.rego
      require_encryption_test.rego
  data/
    exceptions.json
    registries.json

Package names mirror directory paths: a rule in policies/kubernetes/admission/deny_privileged.rego uses package kubernetes.admission.deny_privileged. The lib/ subtree holds helper functions that are imported by domain packages, never evaluated directly by enforcement points.

Input schema contracts are enforced via OPA’s built-in schema annotation. Annotate the package-level input schema so that opa check validates query inputs against it:

# METADATA
# schemas:
#   - input: schema.kubernetes_admission_review
package kubernetes.admission.deny_privileged

import future.keywords.if
import future.keywords.contains

deny contains msg if {
    input.request.object.spec.containers[_].securityContext.privileged == true
    msg := sprintf("container %v runs as privileged", [input.request.object.spec.containers[_].name])
}

The schema file at schema/kubernetes_admission_review.json matches the AdmissionReview structure. Every rule that processes Kubernetes admission requests imports the same schema. A rule author who references a field that does not exist in the schema fails opa check before the PR merges.

Naming conventions: rule names that produce violation messages use deny, violation, or warn as the base name depending on severity. The enforcement point keys on these names. Use a consistent convention — mixing deny in one package and violation in another forces the calling code to know which convention applies.

Rego Unit Testing

Every .rego file has a corresponding _test.rego file in the same directory. OPA discovers test functions by the test_ prefix.

package kubernetes.admission.deny_privileged_test

import future.keywords.if

test_deny_privileged_container if {
    result := data.kubernetes.admission.deny_privileged.deny with input as {
        "request": {
            "object": {
                "spec": {
                    "containers": [
                        {
                            "name": "app",
                            "securityContext": {"privileged": true}
                        }
                    ]
                }
            }
        }
    }
    count(result) == 1
}

test_allow_non_privileged_container if {
    result := data.kubernetes.admission.deny_privileged.deny with input as {
        "request": {
            "object": {
                "spec": {
                    "containers": [
                        {
                            "name": "app",
                            "securityContext": {"privileged": false}
                        }
                    ]
                }
            }
        }
    }
    count(result) == 0
}

test_allow_exception_container if {
    result := data.kubernetes.admission.deny_privileged.deny with input as {
        "request": {
            "namespace": "kube-system",
            "object": {
                "metadata": {"labels": {"exception-id": "EX-2024-001"}},
                "spec": {
                    "containers": [
                        {
                            "name": "node-agent",
                            "securityContext": {"privileged": true}
                        }
                    ]
                }
            }
        }
    } with data.exceptions as {
        "EX-2024-001": {
            "policy": "deny_privileged",
            "expires": "2026-12-31",
            "reason": "eBPF node agent requires host network access"
        }
    }
    count(result) == 0
}

Run the full suite:

opa test policies/ -v --coverage

The --coverage flag emits a coverage report identifying which rule branches have no test. Set a coverage gate in CI:

opa test policies/ --coverage --format json | \
  jq -e '.coverage >= 90' || \
  { echo "Policy coverage below 90%"; exit 1; }

Test fixtures for complex inputs live in testdata/ directories as JSON files and are loaded with with input as data.testdata.fixture_name. This keeps test functions short and the fixture reusable across multiple test cases.

Policy Bundling and Distribution

OPA’s bundle format is a .tar.gz archive containing policies, data files, and an optional manifest. Bundles are the unit of distribution: build once, serve everywhere.

opa build policies/ \
  --bundle \
  --output bundle.tar.gz \
  --metadata '{"revision":"'"$(git rev-parse HEAD)"'"}'

Sign the bundle with a key pair so that OPA agents reject bundles that have been tampered with in transit:

# Generate signing key (do this once; store private key in Vault)
openssl genrsa -out signing.key 2048
openssl rsa -in signing.key -pubout -out signing-pub.pem

# Sign the bundle
opa sign \
  --signing-key signing.key \
  --bundle bundle.tar.gz \
  --output-file-path .signatures.json

# Repack with signature
tar -czf bundle-signed.tar.gz -C bundle-extracted . .signatures.json

OPA agents verify the signature at load time using the configured public key. A bundle without a valid signature is rejected.

Upload the signed bundle to object storage:

aws s3 cp bundle-signed.tar.gz \
  s3://org-opa-bundles/kubernetes/admission/latest/bundle.tar.gz \
  --sse aws:kms \
  --metadata "git-revision=$(git rev-parse HEAD),built-at=$(date -u +%Y-%m-%dT%H:%M:%SZ)"

Use versioned paths (v1.2.3/bundle.tar.gz) in production and only update the latest/ pointer after smoke tests pass.

Bundle Server Configuration

Each OPA agent is configured to poll the bundle server. On Kubernetes, the OPA agent runs as a sidecar or as a standalone deployment behind an admission webhook.

services:
  bundle-server:
    url: https://opa-bundles.example.internal
    credentials:
      s3_signing:
        environment_credentials: {}

bundles:
  kubernetes-admission:
    service: bundle-server
    resource: /kubernetes/admission/latest/bundle.tar.gz
    polling:
      min_delay_seconds: 60
      max_delay_seconds: 120
    signing:
      keyid: policy-signing-key-2024
      keys:
        policy-signing-key-2024:
          algorithm: RS256
          key: |
            -----BEGIN PUBLIC KEY-----
            MIIBIjANBgkqhkiG9w0BAQ...
            -----END PUBLIC KEY-----

status:
  prometheus:
    http_port: 8182

OPA exposes bundle status at /v1/status. Alert when bundles.kubernetes-admission.active_revision stops advancing or when bundles.kubernetes-admission.code is anything other than empty:

- alert: OPABundleStale
  expr: time() - opa_bundle_last_request_timestamp_seconds{name="kubernetes-admission"} > 300
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "OPA bundle has not been refreshed in 5 minutes"

- alert: OPABundleError
  expr: opa_bundle_failed_load_counter{name="kubernetes-admission"} > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "OPA bundle load failed — policy may be stale or compromised"

CI/CD Integration

Policy code changes go through the same gate as application code. The CI pipeline runs on every PR targeting the policies/ directory.

# .github/workflows/policy-validation.yml
name: Policy Validation
on:
  pull_request:
    paths:
      - 'policies/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install OPA
        run: |
          curl -L -o opa https://openpolicyagent.org/downloads/v0.65.0/opa_linux_amd64_static
          chmod +x opa
          sudo mv opa /usr/local/bin/

      - name: Check Rego syntax and schemas
        run: opa check policies/ --bundle

      - name: Run policy tests with coverage
        run: |
          opa test policies/ \
            --coverage \
            --format json \
            --output test-results.json
          jq -e '.coverage >= 90' test-results.json

      - name: Build bundle
        run: |
          opa build policies/ --bundle --output bundle.tar.gz

      - name: Run conftest against IaC fixtures
        run: |
          conftest test terraform/examples/ \
            --policy policies/terraform/ \
            --output github

opa check catches parse errors, undefined references, and schema violations. It fails the build before any tests run. Tests gate on coverage. The conftest step validates that real Terraform configurations in the examples/ directory pass the current policy set — this prevents policy changes from being merged without verifying they do not break existing legitimate infrastructure.

For Terraform plans, conftest evaluates the plan JSON output:

terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json \
  --policy policies/terraform/ \
  --namespace terraform

The Rego rule reads the Terraform plan structure:

package terraform

import future.keywords.if
import future.keywords.contains

deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    resource.change.after.acl == "public-read"
    msg := sprintf("S3 bucket %v has public-read ACL", [resource.address])
}

This CI/CD integration is a natural extension of the patterns covered in the Internal Developer Platform security architecture.

Managing Exceptions Without Exceptions-by-Default

The anti-pattern is encoding exceptions directly in rules:

# Anti-pattern: exception baked into rule logic
deny contains msg if {
    input.request.object.spec.containers[_].securityContext.privileged == true
    not input.request.namespace == "kube-system"
    not input.request.object.metadata.name == "falco"
    msg := "privileged container denied"
}

Every exception is a code change. Code changes require review. Over time the rule becomes a list of special cases that is impossible to audit as a policy. The reviewer approving the rule change is evaluating code, not policy intent.

The correct model stores exceptions in the data plane, not the policy plane:

// data/exceptions.json
{
  "EX-2025-001": {
    "policy": "deny_privileged",
    "namespace": "monitoring",
    "workload": "node-exporter",
    "expires": "2025-12-31",
    "approved_by": "security-team",
    "ticket": "SEC-4412",
    "reason": "Node exporter requires host PID namespace, tracked for replacement with eBPF collector"
  }
}

The rule reads the exception data:

package kubernetes.admission.deny_privileged

import future.keywords.if
import future.keywords.contains
import future.keywords.in

deny contains msg if {
    container := input.request.object.spec.containers[_]
    container.securityContext.privileged == true
    not has_valid_exception
    msg := sprintf("privileged container %v denied; request an exception via SEC ticket", [container.name])
}

has_valid_exception if {
    exception_id := input.request.object.metadata.labels["exception-id"]
    ex := data.exceptions[exception_id]
    ex.policy == "deny_privileged"
    time.now_ns() < time.parse_rfc3339_ns(concat("T00:00:00Z", [ex.expires, ""]))
}

The exception record is a data change, reviewable by a security team workflow that is separate from the engineering code review process. Expired exceptions stop working automatically — no manual cleanup required. An audit query over data.exceptions returns all current exceptions with their expiry and approval trail.

Alert on exceptions approaching expiry:

package exceptions.audit

expiring_soon contains ex if {
    ex := data.exceptions[_]
    expire_ns := time.parse_rfc3339_ns(concat("T00:00:00Z", [ex.expires, ""]))
    days_remaining := (expire_ns - time.now_ns()) / 1000000000 / 86400
    days_remaining < 30
    days_remaining > 0
}

OPA in Kubernetes: Gatekeeper

Gatekeeper implements the OPA evaluation loop as a native Kubernetes controller. Policy authors work with two CRDs: ConstraintTemplate defines the Rego logic and the schema of the constraint, and Constraint instantiates that template with specific parameters.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredregistries
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredRegistries
      validation:
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredregistries

        import future.keywords.if
        import future.keywords.contains

        violation contains {"msg": msg} if {
            container := input.review.object.spec.containers[_]
            not image_allowed(container.image)
            msg := sprintf("image %v is not from an approved registry", [container.image])
        }

        image_allowed(image) if {
            registry := input.parameters.registries[_]
            startswith(image, registry)
        }
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredRegistries
metadata:
  name: require-approved-registries
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
      - kube-system
  parameters:
    registries:
      - "registry.internal.example.com/"
      - "gcr.io/distroless/"

Set enforcementAction: dryrun when rolling out a new constraint across existing clusters. Gatekeeper’s audit controller evaluates all existing resources against the constraint and reports violations in the constraint’s .status.violations field without blocking anything. This surfaces your existing compliance gap before you enforce it.

For a comparison with an alternative Kubernetes policy engine that uses a different paradigm, see Kyverno controller security.

Deploy Gatekeeper with the mutation webhook disabled initially and enable it only after the validation webhook is running stably. Separate the admission webhook timeout configuration from the failure policy: setting failurePolicy: Ignore on the admission webhook means OPA downtime does not block deployments, but it also means your policy is unenforced during that window. Alert on webhook latency and treat OPA availability as a security-relevant SLO.

Styra DAS

Styra Declarative Authorization Service is the commercial management plane for OPA. It adds a policy authoring UI, impact analysis before deployment, a GitOps sync mechanism, decision logging at scale, and a centralised dashboard across all OPA agents in the fleet.

The architectural difference from self-managed OPA is that Styra DAS manages the bundle server, the signing key infrastructure, and the status collection that you would otherwise build yourself. Each OPA agent registers with the DAS control plane over mTLS and receives its bundle configuration from the plane rather than from a local config file. Policy changes go through DAS’s approval workflow, and DAS provides per-decision audit logs with policy provenance — which rule in which bundle revision produced the Allow or Deny.

For organisations with more than a handful of OPA deployments, the operational cost of self-managing bundle distribution, signing, and status monitoring typically exceeds the Styra DAS licensing cost at some fleet size. The break-even depends on available platform engineering time. The self-managed path described in this article is viable and production-proven; the gap is operational tooling, not security properties.

Operational Discipline

A policy library that is not tested is a liability. A policy library with no ownership model drifts into the same state as any other undocumented runbook. Three practices sustain policy as code over time:

CODEOWNERS over policy directories. Each domain directory in the policy library is owned by the team responsible for that domain. The security team owns policies/lib/ and the exception data. Platform engineering owns policies/kubernetes/. Each change to a directory requires approval from its owner. This enforces review accountability without centralising all policy authorship in the security team.

Decision logging. OPA emits a decision log entry for every query it evaluates. Ship these to your SIEM. Each entry includes the input, the result, and the rule trace. Querying decision logs answers “how many times was this policy violated in the last 24 hours and by which workloads” — the data your security metrics programme needs.

Policy deprecation. Rules that are no longer enforced accumulate in the library unless there is a removal process. Add a deprecated annotation and a removal date to rules that are being phased out. Run an OPA query in CI that fails if any deprecated rule has passed its removal date. This keeps the library from becoming an archaeological site.

The same version-controlled, testable, reviewable workflow that governs application code must govern policy code. The alternative — ad-hoc enforcement by humans at review time — does not scale past the first hundred services.