Terraform Security Guardrails: Preventing Cloud Misconfigurations at the Infrastructure Layer

Terraform Security Guardrails: Preventing Cloud Misconfigurations at the Infrastructure Layer

The Misconfiguration Problem

IaC misconfigurations are not subtle. A public S3 bucket is a public S3 bucket. An open security group rule is a one-line resource block that any scanner can flag in milliseconds. The reason they reach production is not that they are hard to detect — it is that detection happens after deployment.

The three misconfiguration classes that generate the most incident reports:

Public exposure. aws_s3_bucket_acl with acl = "public-read". Security group ingress rules with cidr_blocks = ["0.0.0.0/0"] on ports 22, 3389, or 5432. RDS instances with publicly_accessible = true. These create directly reachable attack surface with no compensating controls.

Missing encryption. RDS instances without storage_encrypted = true. S3 buckets without server-side encryption configured via aws_s3_bucket_server_side_encryption_configuration. EBS volumes with encrypted = false. When these resources contain PII or credentials, unencrypted storage is a compliance violation in addition to a security failure.

Overpermissive IAM. aws_iam_policy with actions = ["*"] and resources = ["*"]. Instance profiles attached to EC2 instances that can call iam:CreateUser or iam:AttachUserPolicy. Lambda execution roles with AdministratorAccess. Any of these turns a compromised compute resource into a full account takeover.

The fix is gates at every point where code can be written and reviewed: the developer’s machine, the PR, and the deployment pipeline.

Checkov: Scanning Terraform Directories

Checkov ships with over 1,000 built-in checks across Terraform, CloudFormation, Kubernetes manifests, and Helm charts. For Terraform specifically, it parses HCL natively and resolves variable references and module outputs before evaluating checks, which reduces false positives on parameterised configurations.

Basic directory scan with CLI output:

checkov -d ./terraform --framework terraform

Fail only on HIGH and CRITICAL severity, output SARIF for GitHub PR annotations:

checkov -d ./terraform \
  --framework terraform \
  --output sarif \
  --output-file-path results.sarif \
  --hard-fail-on HIGH,CRITICAL

Checkov check IDs follow the format CKV_AWS_*, CKV_GCP_*, CKV_AZURE_*. Suppress a false positive inline with a comment directly in the Terraform resource:

resource "aws_s3_bucket" "access_logs" {
  bucket = "my-access-logs-bucket"

  # checkov:skip=CKV_AWS_18:This bucket IS the access log destination; logging to itself is circular
  # checkov:skip=CKV_AWS_144:Cross-region replication not required for ephemeral log data
}

Suppression is tracked in source control alongside the resource, so reviewers can see both the suppression and its justification.

Writing a custom check in Python when the built-in checks do not cover an internal standard — for example, requiring all S3 buckets to have a specific tag:

# checkov/custom_checks/check_s3_required_tags.py
from checkov.common.models.enums import CheckCategories, CheckResult
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class S3RequiredTagsCheck(BaseResourceCheck):
    def __init__(self):
        name = "Ensure S3 buckets have required cost-allocation tags"
        id = "CKV_CUSTOM_S3_001"
        supported_resources = ["aws_s3_bucket"]
        categories = [CheckCategories.GENERAL_SECURITY]
        super().__init__(name=name, id=id, categories=categories,
                         supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        tags = conf.get("tags", [{}])[0]
        required_tags = {"Environment", "Team", "CostCenter"}
        if isinstance(tags, dict) and required_tags.issubset(tags.keys()):
            return CheckResult.PASSED
        return CheckResult.FAILED

check = S3RequiredTagsCheck()

Run with the custom check directory:

checkov -d ./terraform \
  --external-checks-dir ./checkov/custom_checks \
  --check CKV_CUSTOM_S3_001

Trivy Config Scanning for Terraform

Trivy’s config subcommand scans Terraform directories for misconfigurations using the same check library as its container scanning engine. If Trivy is already in the pipeline for image scanning, enabling IaC scanning requires no additional tooling.

trivy config ./terraform \
  --severity HIGH,CRITICAL \
  --exit-code 1

The --exit-code 1 flag causes Trivy to exit non-zero when findings at or above the severity threshold exist, which fails the CI step. Without it, Trivy always exits 0 regardless of findings.

In GitHub Actions, adding Terraform scanning alongside existing image scanning:

- name: Trivy - Terraform misconfig scan
  uses: aquasecurity/trivy-action@0.28.0
  with:
    scan-type: config
    scan-ref: ./terraform
    severity: HIGH,CRITICAL
    exit-code: '1'
    format: sarif
    output: trivy-terraform.sarif

- name: Upload Trivy results
  uses: github/codeql-action/upload-sarif@v3
  if: always()
  with:
    sarif_file: trivy-terraform.sarif

Trivy’s check IDs are AVD-prefixed (e.g., AVD-AWS-0086 for S3 public access block). Inline suppression uses trivy:ignore annotations in the Terraform file:

#trivy:ignore:AVD-AWS-0086
resource "aws_s3_bucket" "public_assets" {
  bucket = "my-public-static-assets"
}

Pre-Commit Hooks for IaC

Pre-commit hooks catch issues before a commit reaches the remote, before CI runs, before a reviewer sees the code. For Terraform, three hooks are worth enforcing on every repository: terraform validate, tflint, and checkov.

.pre-commit-config.yaml:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.96.1
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
        args:
          - --init-args=-backend=false
      - id: terraform_tflint
        args:
          - --args=--config=__GIT_WORKING_DIR__/.tflint.hcl
      - id: terraform_checkov
        args:
          - --args=--framework terraform
          - --args=--hard-fail-on HIGH,CRITICAL
          - --args=--skip-check CKV2_AWS_6,CKV_AWS_144

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: detect-private-key
      - id: check-merge-conflict

terraform validate with -backend=false skips backend initialisation, which is necessary for hooks that run without a configured state backend. tflint with a config file enables provider-specific rules — the .tflint.hcl file configures which ruleset to load:

# .tflint.hcl
plugin "aws" {
  enabled = true
  version = "0.36.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "aws_instance_invalid_type" {
  enabled = true
}

rule "aws_iam_policy_document_gov_friendly_arns" {
  enabled = false
}

Install the hooks and run against all files for initial adoption:

pre-commit install
pre-commit run --all-files

Conftest with OPA: Policy Gates on Terraform Plan JSON

Checkov and Trivy check Terraform source files. Conftest with OPA checks the terraform plan JSON output — what Terraform will actually do, after resolving all variables, module calls, data sources, and remote state references. This distinction matters: a resource configuration might look acceptable in source but reference a variable that evaluates to 0.0.0.0/0 at plan time.

Generate a plan JSON:

terraform init
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

Write a Rego policy that enforces tagging standards and blocks open security group rules. Store policies in ./policy/:

# policy/security_groups.rego
package terraform.security_groups

import future.keywords.if
import future.keywords.in

deny[msg] if {
  resource := input.resource_changes[_]
  resource.type == "aws_security_group_rule"
  resource.change.after.type == "ingress"
  cidr := resource.change.after.cidr_blocks[_]
  cidr == "0.0.0.0/0"
  msg := sprintf(
    "Security group rule '%s' allows ingress from 0.0.0.0/0 — restrict to specific CIDRs",
    [resource.address]
  )
}

deny[msg] if {
  resource := input.resource_changes[_]
  resource.type == "aws_security_group_rule"
  resource.change.after.type == "ingress"
  cidr := resource.change.after.ipv6_cidr_blocks[_]
  cidr == "::/0"
  msg := sprintf(
    "Security group rule '%s' allows ingress from ::/0 — restrict to specific IPv6 CIDRs",
    [resource.address]
  )
}
# policy/tagging.rego
package terraform.tagging

import future.keywords.if
import future.keywords.in

required_tags := {"Environment", "Team", "CostCenter"}

taggable_types := {
  "aws_instance", "aws_s3_bucket", "aws_rds_cluster",
  "aws_elasticache_cluster", "aws_eks_cluster", "aws_lambda_function"
}

deny[msg] if {
  resource := input.resource_changes[_]
  resource.type in taggable_types
  resource.change.actions[_] in {"create", "update"}
  tags := resource.change.after.tags
  missing := required_tags - {k | tags[k]}
  count(missing) > 0
  msg := sprintf(
    "Resource '%s' is missing required tags: %v",
    [resource.address, missing]
  )
}

Run Conftest against the plan JSON:

conftest test tfplan.json \
  --policy ./policy \
  --namespace terraform.security_groups

conftest test tfplan.json \
  --policy ./policy \
  --all-namespaces

In GitHub Actions, the full sequence — plan, export, test:

- name: Terraform plan
  run: |
    terraform init -backend-config=backend.hcl
    terraform plan -out=tfplan.binary
    terraform show -json tfplan.binary > tfplan.json

- name: Conftest policy check
  run: |
    conftest test tfplan.json \
      --policy ./policy \
      --all-namespaces \
      --output github

The --output github flag emits findings as GitHub Actions annotations. See the policy-as-code at scale guide for structuring OPA policies across multiple repositories.

Terraform Cloud Sentinel: Policy as Code in the Apply Workflow

Sentinel is HashiCorp’s policy framework built into Terraform Cloud and Terraform Enterprise. Unlike Conftest which runs in CI as a separate step, Sentinel policies run inside the Terraform Cloud plan/apply workflow — between plan completion and apply approval. This means they run with access to the full plan, the Terraform Cloud workspace metadata, and cost estimation results.

A Sentinel policy that blocks creation of unencrypted RDS instances:

# sentinel/require-rds-encryption.sentinel
import "tfplan/v2" as tfplan

rds_instances = filter tfplan.resource_changes as _, resource_change {
  resource_change.type is "aws_db_instance" and
  (resource_change.change.actions contains "create" or
   resource_change.change.actions contains "update")
}

violations = filter rds_instances as address, instance {
  instance.change.after.storage_encrypted is not true
}

main = rule {
  length(violations) is 0
}

A Sentinel policy that enforces cost limits using cost estimation data:

# sentinel/restrict-monthly-cost.sentinel
import "tfrun"

max_monthly_cost = 500.00

main = rule {
  float(tfrun.cost_estimate.proposed_monthly_cost) < max_monthly_cost
}

Policy sets in Terraform Cloud are configured in sentinel.hcl and attached to workspaces:

# sentinel.hcl
policy "require-rds-encryption" {
  source            = "./require-rds-encryption.sentinel"
  enforcement_level = "hard-mandatory"
}

policy "restrict-monthly-cost" {
  source            = "./restrict-monthly-cost.sentinel"
  enforcement_level = "soft-mandatory"
}

hard-mandatory blocks apply with no override option. soft-mandatory allows an authorized user to override the policy with a written justification. advisory logs a warning but never blocks.

Policy sets are linked to a VCS repository and assigned to workspace groups in the Terraform Cloud UI or via the API, which means the same policy set can enforce consistent standards across all workspaces in an organization without duplicating policy code. This connects directly to the principles in the golden path security guide.

Atlantis Security Hardening

Atlantis is a self-hosted tool that runs terraform plan and terraform apply triggered by pull request comments. Misconfigured Atlantis deployments are a significant attack surface: anyone who can comment on a PR can run arbitrary Terraform with whatever credentials Atlantis holds.

The essential hardening configuration in atlantis.yaml and the server flags:

# atlantis.yaml (repository-level configuration)
version: 3
projects:
  - name: production
    dir: ./terraform/environments/production
    workspace: production
    terraform_version: v1.9.0
    autoplan:
      when_modified:
        - "**/*.tf"
        - "**/*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable

apply_requirements: approved means Atlantis will refuse to run terraform apply unless the PR has at least one approving review. mergeable means the PR must meet all branch protection requirements — status checks passing, no merge conflicts — before apply is permitted.

Server-level configuration controlling which repositories Atlantis will respond to:

# atlantis server config
repos:
  - id: github.com/myorg/.*
    branch: /.*/
    apply_requirements:
      - approved
      - mergeable
    workflow: default
    allowed_overrides:
      - apply_requirements
    allow_custom_workflows: false

allow_custom_workflows: false prevents repository owners from defining custom workflow steps that could execute arbitrary commands with Atlantis’s cloud credentials.

Webhook secret validation is configured at the server level:

atlantis server \
  --gh-webhook-secret="$GITHUB_WEBHOOK_SECRET" \
  --gh-token="$GITHUB_TOKEN" \
  --repo-allowlist="github.com/myorg/*" \
  --atlantis-url="https://atlantis.internal.example.com" \
  --ssl-cert-file=/etc/atlantis/tls.crt \
  --ssl-key-file=/etc/atlantis/tls.key \
  --write-git-creds

--repo-allowlist is critical. Without it, anyone who can reach the Atlantis endpoint and send a valid-looking webhook can trigger plan and apply against any Terraform configuration. The allowlist restricts Atlantis to repositories in github.com/myorg/.

For plan-only access in PRs, set the workflow to restrict apply to the default branch only:

# server-side repos config
repos:
  - id: github.com/myorg/.*
    apply_requirements:
      - approved
    allowed_overrides: []
    allow_custom_workflows: false

Atlantis should run with a dedicated IAM role or service account with the minimum permissions needed for the Terraform workspaces it manages. A single Atlantis instance managing both staging and production should use separate AWS roles per workspace, switching via assume_role in the backend configuration rather than holding a single set of production credentials.

Drift Detection

Terraform state describes intended infrastructure. Reality drifts. An engineer manually modifies a security group to allow traffic during an incident. A cloud console click adds an IAM policy. The state file no longer matches what exists.

Terraform Cloud’s native drift detection runs a plan against each workspace on a schedule and surfaces the diff in the UI. Enable it per-workspace:

# Terraform Cloud workspace settings (via API or UI)
assessments_enabled = true

When a drift is detected, Terraform Cloud creates an assessment result showing the diff between state and reality. Teams can configure notifications via webhooks.

driftctl covers resources that Terraform Cloud drift detection misses — resources that exist in the cloud account but have no corresponding Terraform state entry (unmanaged resources):

# Scan AWS account and compare against all Terraform state files in S3
driftctl scan \
  --from tfstate+s3://my-tfstate-bucket/terraform.tfstate \
  --from tfstate+s3://my-tfstate-bucket/production/terraform.tfstate \
  --output console://

# Output JSON for processing
driftctl scan \
  --from tfstate+s3://my-tfstate-bucket/terraform.tfstate \
  --output json://drift-report.json

driftctl reports three categories: resources managed by Terraform that have drifted, resources managed by Terraform that no longer exist (deleted outside Terraform), and resources that exist in the account but are not managed by any Terraform state.

The third category — unmanaged resources — is frequently where manual changes that create security risk accumulate.

Secrets in Terraform State

Terraform state is a security boundary problem. Resources that generate secrets — aws_db_instance with password, aws_iam_access_key, tls_private_key — write their sensitive values to state in plaintext by default. Anyone with read access to state reads all secrets.

State backend encryption with S3:

terraform {
  backend "s3" {
    bucket         = "my-tfstate-bucket"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123"
    dynamodb_table = "terraform-state-lock"
  }
}

encrypt = true with kms_key_id uses a customer-managed KMS key rather than S3’s default SSE key. This means state can only be decrypted by identities that have kms:Decrypt permission on that specific key — tightening access control beyond S3 bucket policies alone.

Avoid generating secrets in Terraform where possible. For database passwords, use AWS Secrets Manager with rotation:

resource "random_password" "db_password" {
  length           = 32
  special          = true
  override_special = "!#$%&*()-_=+[]{}<>:?"
}

resource "aws_secretsmanager_secret" "db_password" {
  name = "production/rds/master-password"
  kms_key_id = aws_kms_key.secrets.arn
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = random_password.db_password.result
}

resource "aws_db_instance" "main" {
  password = random_password.db_password.result
  # ...
}

The password appears in state (in the random_password resource), but it is also in Secrets Manager where it can be rotated and accessed by applications without additional Terraform runs.

For sensitive outputs, mark them explicitly:

output "db_connection_string" {
  value     = "postgresql://${aws_db_instance.main.username}:${random_password.db_password.result}@${aws_db_instance.main.endpoint}/${aws_db_instance.main.db_name}"
  sensitive = true
}

sensitive = true prevents the value from appearing in plan and apply output. It does not prevent it from being stored in state — that is why state encryption is non-negotiable for any workspace managing secrets.

Putting the Layers Together

A complete Terraform security guardrail stack has five distinct layers, each catching different classes of issues:

  1. Pre-commit: terraform validate, tflint, checkov — catches syntax errors and known-bad patterns before a commit exists.
  2. PR scanning: Checkov and Trivy with SARIF output uploaded to GitHub — catches misconfigurations with inline PR annotations before merge.
  3. Plan-time policy: Conftest with OPA policies against terraform show -json output — catches issues that only manifest when variables are resolved.
  4. Workflow policy: Sentinel (Terraform Cloud/Enterprise) or Conftest in CI — enforces organizational standards as a hard gate before apply.
  5. Drift detection: Terraform Cloud assessments or driftctl — catches manual changes that bypass all previous layers.

Each layer has different latency (pre-commit is seconds, drift detection runs daily) and different coverage (pre-commit sees source, plan-time policies see resolved values). Running all five layers means misconfigurations either do not get committed, do not get merged, do not get applied, or get detected within 24 hours of a manual change.

The cost of this stack is modest: pre-commit hooks add 10–30 seconds to a commit, CI scanning adds 1–3 minutes to a PR, and plan-time policy checks add less than a minute to the apply workflow. The cost of a public S3 bucket holding customer data, or an open security group that exposes an RDS instance to the internet, is measured in incident response hours and regulatory filings.