Production Access Management with Teleport and Boundary: Brokered, Recorded, Auditable Access

Production Access Management with Teleport and Boundary: Brokered, Recorded, Auditable Access

Problem

Operator access to production hosts has long been a structural weakness:

  • Static SSH keys distributed by config management; key rotation rarely happens; departed engineers’ keys often persist.
  • Bastion hosts with shared accounts; “who logged in” requires correlating multiple logs.
  • VPN + direct SSH model gives broad network access on top of host access.
  • Database access via shared passwords in 1Password / Vault that everyone copies into their .psqlrc.
  • Kubernetes access via long-lived kubeconfigs distributed manually.

The pattern: operators need access; access becomes static; access drifts; access leaks. Each compromise of an operator’s laptop / credentials grants the attacker the same broad, persistent reach.

By 2026, brokered access management is the default. Teleport (gravitational) and HashiCorp Boundary are the two leading open-source options; commercial offerings like StrongDM, Tailscale SSH, and Cloudflare Access provide similar capabilities.

The architecture: a centralized broker sits between operators and production. Operators authenticate to the broker via SSO (OIDC); the broker issues a short-lived certificate or session; the operator uses it to access production. The broker records the session, enforces RBAC, and revokes access at the end of the certificate’s lifetime.

The properties that matter:

  • Just-in-time access with short TTLs (default 1-8 hours).
  • Session recording for SSH, kubectl exec, database queries.
  • Identity-bound to the SSO user, not a shared account.
  • RBAC at the broker, not per-host.
  • Audit centralized and structured.
  • No-VPN model — broker handles network access.

The specific gaps in pre-broker setups:

  • Operator SSH keys persist after employees leave.
  • Database access uses shared credentials; no per-user audit.
  • Kubernetes context switching is manual; everyone has every cluster’s kubeconfig.
  • “Quick fix” production access becomes permanent.
  • Compliance audits manually correlate “who did what” across many sources.

This article covers Teleport’s architecture, RBAC and approval workflow, session recording for SSH / k8s / databases, the migration from static SSH, and the operational integration with on-call / break-glass scenarios.

Target systems: Teleport 16+, HashiCorp Boundary 0.18+, StrongDM (commercial), Cloudflare Access; integrates with Okta, Azure AD, Google Workspace, GitHub for SSO.

Threat Model

  • Adversary 1 — Stolen operator credential: an attacker has the operator’s laptop, SSH keys, or VPN cert. Wants to reach production.
  • Adversary 2 — Departed employee: still has access via legacy SSH keys not yet rotated.
  • Adversary 3 — Insider abuse: legitimate operator using their access for unauthorized actions, expecting no accountability.
  • Adversary 4 — Lateral movement: attacker with one host’s access tries to reach others on the production network.
  • Adversary 5 — Credential exfil from operator endpoint: malware on operator laptop reads SSH keys, browser cookies, Vault tokens.
  • Access level: Adversary 1 has operator endpoint compromise. Adversary 2 has historical credentials. Adversary 3 has legitimate access. Adversary 4 has one host. Adversary 5 has malware on endpoint.
  • Objective: Read or modify production data; pivot through production; act without leaving traceable footprint.
  • Blast radius: With static SSH: a stolen key reaches every host the user had access to, indefinitely. With brokered access: stolen credentials grant only what’s currently active (often nothing — sessions are short-lived); the broker enforces fresh authentication for each access.

Configuration

Step 1: Teleport Architecture

Teleport has three roles:

  • Auth Service: issues certificates; manages roles and users; integrates with SSO.
  • Proxy Service: the public-facing entry point; handles user-facing connections.
  • Agents: installed on each managed resource (server, k8s cluster, database). Connect outbound to the Proxy.

Operators connect to the Proxy via tsh (CLI) or a web UI; the Proxy routes to the Agent on the requested resource.

# Install Teleport on the auth + proxy server.
curl -fsSL https://get.gravitational.com/teleport.repo | sudo bash
sudo apt install teleport
# /etc/teleport.yaml on the central server.
version: v3
teleport:
  nodename: teleport-prod
  data_dir: /var/lib/teleport
  log:
    output: stdout
auth_service:
  enabled: yes
  cluster_name: prod.internal.example.com
  authentication:
    type: oidc
    oidc:
      issuer_url: https://login.example.com/
      client_id: teleport-prod
      client_secret_file: /etc/teleport/oidc-client-secret
      redirect_url: https://teleport.example.com/v1/webapi/oidc/callback
      claims_to_roles:
        - claim: groups
          value: sre-team
          roles: [sre]
        - claim: groups
          value: payments-team
          roles: [payments-developer]
proxy_service:
  enabled: yes
  public_addr: teleport.example.com:443
  https_keypairs:
    - cert_file: /etc/teleport/tls.crt
      key_file: /etc/teleport/tls.key

Connect to SSO; user authenticates via existing identity provider; Teleport issues a certificate scoped to the user’s roles.

Step 2: Role Definitions

# roles/sre.yaml
kind: role
version: v7
metadata:
  name: sre
spec:
  options:
    max_session_ttl: 8h
    forward_agent: false
    require_session_mfa: true   # MFA on every session; replays don't help
  allow:
    logins: [ec2-user, ubuntu]
    node_labels:
      'env': ['production', 'staging']
    kubernetes_labels:
      'cluster': ['*']
    db_labels:
      'env': ['production']
    db_users: ['readonly', 'breakglass']
    db_names: ['*']
    rules:
      - resources: [session]
        verbs: [list, read]
  deny:
    logins: [root]
    node_labels:
      'tag': ['hardened-prod']   # certain tagged hosts are off-limits even to SRE

The role is a least-privilege shape: SREs can SSH as ec2-user to production hosts, exec into any K8s namespace, query databases as readonly or breakglass user. They cannot become root directly; cannot reach hardened-prod-tagged hosts.

Step 3: Session Recording

Every session is recorded by default. SSH sessions to disk as keystroke replay; kubectl exec sessions as command + output; database queries as audit log.

auth_service:
  session_recording: node-sync     # record at the node
  proxy_listener_mode: multiplex

Recording modes:

  • node-sync — recording to the node’s local disk, synced to S3 / GCS in real time. Tamper-resistant: the operator on the node can’t easily delete the recording.
  • proxy — recording at the Proxy. Less reliable if the connection terminates abnormally.
  • off — explicitly disabled; not recommended for production.

Replays:

tsh ssh sessions ls
# 2026-04-29 10:00  alice@teleport-prod  node prod-web-01  duration 12m
# 2026-04-29 11:30  bob@teleport-prod    node prod-db-01   duration 5m

tsh play <session-id>
# Replays the session at original speed in the terminal.

For database access, queries are logged in structured JSON:

tsh db logs query <session-id>
# {"timestamp": "2026-04-29T10:01:23Z", "user": "alice", "query": "SELECT * FROM orders WHERE customer_id = 5"}

Step 4: Access Requests / JIT

For elevated permissions beyond a user’s standing role, use access requests:

# roles/sre.yaml — extends the role.
spec:
  allow:
    request:
      roles: [prod-write, prod-admin]
      thresholds:
        - approve: 1
          deny: 1
      annotations:
        purpose: ['*']
# Operator requests elevated access.
tsh request create --roles=prod-write --reason="Investigating SEV2 incident #1234, fixing payment-api memory leak"

# Approver receives notification (Slack, email).
tsh request review --approve <request-id> --reason="Approved per incident-1234"

# Operator now has prod-write role for the request's TTL.
tsh login --request-id=<id>

Standing access stays minimal; elevation is recorded with explicit business reason.

Step 5: Database Access

Database connections route through Teleport, with per-query logging.

db_service:
  enabled: yes
  resources:
    - labels:
        env: production
  databases:
    - name: payments-db
      protocol: postgres
      uri: payments-db.internal:5432
      ad: {}

Operator connects via tsh:

tsh db login payments-db --db-user=readonly --db-name=payments
tsh db connect payments-db
# psql session opens; queries logged via Teleport's audit pipeline.

The actual database password isn’t shared; Teleport authenticates to the database on the operator’s behalf using a service account (or a Vault-issued dynamic credential, depending on configuration).

Step 6: Kubernetes Access

Teleport can serve as the Kubernetes API entrypoint:

kubernetes_service:
  enabled: yes
  kube_cluster_name: prod-east
  resources:
    - labels:
        env: production

Operators get a kubeconfig automatically:

tsh kube login prod-east
kubectl get pods   # routed through Teleport

The kubeconfig is short-lived; refreshes via tsh login. Compromised laptop = access expires within the session TTL.

Kubernetes RBAC is layered with Teleport’s role:

kind: role
version: v7
metadata:
  name: payments-developer
spec:
  allow:
    kubernetes_labels:
      env: ['production']
    kubernetes_groups: ['payments-developer']
    kubernetes_users: ['payments-developer']
    kubernetes_resources:
      - kind: pod
        namespace: payments
        verbs: ['get', 'list', 'exec']
      - kind: deployment
        namespace: payments
        verbs: ['get', 'list', 'patch']

Per-resource access at the K8s layer; Teleport issues a kubeconfig with the appropriate restrictions.

Step 7: Boundary as an Alternative

Boundary’s model is similar but with a different decomposition:

# Boundary controller config.
controller {
  name = "controller-prod"
  description = "Production controller"

  database {
    url = "postgresql://boundary@db.internal:5432/boundary"
  }
}

listener "tcp" {
  address = "0.0.0.0:9200"
  purpose = "api"
  tls_disable = false
  tls_cert_file = "/etc/boundary/tls.crt"
  tls_key_file = "/etc/boundary/tls.key"
}
# Define a target.
boundary targets create tcp -name "payments-db" \
  -default-port 5432 \
  -session-connection-limit 10 \
  -session-max-seconds 14400 \
  -host-source ${HOST_SET_ID}

# Operator connects.
boundary connect postgres -target-id <target-id>
# Boundary establishes a tunnel to payments-db; operator's psql connects to localhost.

Boundary is lighter on session recording but excellent on TCP-level brokering. Often paired with Vault for dynamic database credentials.

Step 8: Integration With On-Call

For emergency access during incidents:

# roles/oncall-emergency.yaml
kind: role
metadata:
  name: oncall-emergency
spec:
  options:
    max_session_ttl: 4h
    require_session_mfa: true
  allow:
    request:
      roles: [prod-admin]
      thresholds:
        - approve: 1                     # only need 1 approver for emergencies
          deny: 1
      annotations:
        incident: ['SEV1', 'SEV2']       # require an incident reason
        pagerduty_active_incident: ['true']

A custom plugin verifies the user is currently on-call (PagerDuty integration); approves automatically if so. Audit log captures every emergency elevation with the linked incident.

Step 9: Telemetry

teleport_sessions_started_total{cluster, type}
teleport_sessions_recorded_total{cluster}
teleport_access_requests_total{role, result}
teleport_session_duration_seconds
teleport_failed_auth_total{user, reason}
teleport_audit_events_total{type}

Alert on:

  • failed_auth_total rising for a specific user — possible compromised credential or stale config.
  • access_requests_total{result="denied"} rising — possible attempted privilege escalation.
  • Sessions exceeding expected duration — possible long-running unauthorized activity.

Expected Behaviour

Signal Static SSH + bastions Teleport / Boundary
Departed employee SSH access Until rotation Expires at TTL (8h) automatically
Per-user audit Manual log correlation Centralized; structured by user/session
Session replay Manual / impossible Built-in; standard kubectl/SSH/DB
Per-resource access control Per-host config Centralized RBAC
Database query audit Database-side audit (often disabled) Per-query log with user attribution
Kubernetes access Kubeconfig per cluster, distributed manually Routed through broker; identity-bound
Compromise of one machine Broad reach Bounded to that one TTL window

Trade-offs

Aspect Benefit Cost Mitigation
Centralized broker Single audit pane Single point of failure Run in HA; for short outages, break-glass procedure documented.
Session recording Forensic clarity Storage cost; privacy implications Encrypted at rest; short retention for routine sessions, longer for elevated.
Identity-bound certificates No shared credentials SSO outage = no access Plan break-glass for SSO unavailability.
Just-in-time elevation Minimal standing access Friction during incidents Auto-approve for on-call during active SEV1; manual approval otherwise.
Database brokerage Per-query audit Latency overhead Negligible for interactive queries; matters less for query-heavy applications.
Migration from static Long-term security improvement Engineering effort Phased rollout; per-team migration; coexist briefly.

Failure Modes

Failure Symptom Detection Recovery
Teleport / Boundary controller down All operator access blocked Service health check fails HA deploy; for outage longer than break-glass window, use the documented emergency procedure.
SSO outage All authentication blocked Auth provider error Local fallback users with high audit; avoid using unless emergency.
Stale role assignments Departed user retains access Periodic SSO sync drift Continuous sync with SSO; alert on stale role assignments.
Session recording storage full Sessions stop recording teleport_sessions_recorded_total rate stalls Alert on storage utilization; migrate to S3 / GCS at >70% local capacity.
Approval flow misuse Auto-approval for non-emergency Audit shows elevations without active incidents Tighten auto-approval criteria; require manual approval for non-incident elevations.
Privilege drift Role accumulates over time Periodic role audit Review roles quarterly; remove unused permissions.
Latency-sensitive workload broken DB queries slow due to broker App-level latency monitors Some workloads (high-throughput batch) bypass broker; document the exemption with compensating controls.

When to Consider a Managed Alternative

Self-hosted Teleport / Boundary requires HA infrastructure, session-recording storage, integration with SSO, and ongoing operational care (8-15 hours/month for a multi-environment fleet).

  • Teleport Cloud: managed Teleport; SSO integration; session storage included.
  • StrongDM: commercial broker; multi-protocol, audit pipeline integrated.
  • Cloudflare Access: identity-bound zero-trust gateway; integrates with existing IdP.

For organizations with strict regulatory constraints prohibiting third-party brokers, self-hosted Teleport with on-prem session recording is the right choice.