Certificate Lifecycle Management: From Issuance to Renewal and Revocation

Certificate Lifecycle Management: From Issuance to Renewal and Revocation

Problem

On 8 February 2023, Spotify’s iOS app stopped working for millions of users. The root cause: an expired TLS certificate that no automated check had caught. Spotify is not exceptional. The pattern recurs across organisations of every size — certificate expiry outages hit Goldman Sachs, LinkedIn, Equifax, and hundreds of smaller companies every year. The Microsoft Teams outage of October 2020 was a wildcard certificate that expired without any monitoring alert firing.

The certificate problem has three distinct failure modes:

Expiry without warning. A certificate was issued, put into production, and then forgotten. No owner, no monitoring, no renewal process. Engineers discover it by a frantic page at midnight.

Unmanaged self-signed certificates. Developers run openssl req -new -x509 to unblock a project and wire it into production. Years later the cert is still there, rotated by nobody, trusted only by a hardcoded exception in an internal tool written by someone who has since left.

Revocation nobody enforces. A private key is compromised or an engineer’s client certificate persists after departure. The certificate is marked revoked in the CA, but nothing in the chain checks the CRL or OCSP response. The revoked cert remains operationally valid.

Fixing these requires treating certificates as managed objects with explicit owners, tracked validity windows, automated renewal, enforced revocation, and monitored expiry across every type: public TLS, internal PKI, code signing, client auth, and S/MIME.

Threat Model

  • Adversary 1 — Service disruption: Certificate expiry causes an outage. No hostile actor required — failure of internal process is the threat. Impact: availability, SLA breach, revenue loss.
  • Adversary 2 — Private key theft: An attacker exfiltrates the private key corresponding to a production TLS certificate. Without revocation infrastructure, the stolen key remains valid until the cert expires, enabling impersonation or traffic decryption.
  • Adversary 3 — Rogue certificate: An attacker obtains a fraudulent certificate from a misissuing CA for a domain you own. Without Certificate Transparency (CT) log monitoring you will not detect it.
  • Adversary 4 — Stale client certificate: A departed employee’s or contractor’s client certificate is not revoked. They retain access to internal services that trust the issuing CA without checking revocation status.
  • Adversary 5 — Weak internal PKI: Self-signed or internally issued certificates with SHA-1, 4096-day validity, or no CRL/OCSP. Internal services treat the internal CA as a trust anchor, but any compromise of the internal CA’s private key invalidates the entire chain with no revocation path.
  • Asset: TLS private keys, internal CA keys, code signing keys, client certificates.
  • Blast radius: Ranges from service unavailability (expiry) to full impersonation (key theft), to full internal PKI compromise (root CA key theft with no revocation).

Certificate Categories and Lifecycle Requirements

Different certificate types have different validity windows, different issuance paths, and different consequences on failure. Treating them uniformly means the policy fits none of them well.

Public TLS certificates cover externally reachable domains. As of 2026, the CA/Browser Forum maximum validity is 398 days; Google’s Chrome Root Policy proposes a 90-day cap being phased in. Let’s Encrypt and ZeroSSL issue 90-day certificates by design. Renewal should be automated via ACME. Human-driven renewal for public TLS is an operational antipattern.

Internal PKI certificates cover services that communicate within your network. Validity windows are set by your internal CA policy — 1 year for server certificates is common, shorter (90 days) is better. The key difference from public TLS: you control the CA, so you can issue, revoke, and re-issue without waiting for an external provider.

Code signing certificates have high-value, long-validity keys and are often issued from a separate CA hierarchy. Compromise means an attacker can sign malicious artifacts that your toolchain trusts. Validity is typically 1–3 years; the signing key should live in an HSM.

Client authentication certificates (mTLS) cover service-to-service auth and, where used, human-to-service auth (e.g. internal VPNs that use certificate-based authentication instead of passwords). Validity should be short — 90 days for services, 30 days for humans where possible — because revocation checking for client certs is frequently absent.

S/MIME certificates for email signing and encryption have their own CA hierarchy (your organisation’s CA or a public S/MIME CA). Lifecycle management is frequently neglected entirely. As with client certs: short validity, automated renewal where tooling allows.

Building a Certificate Inventory

You cannot manage what you cannot see. The inventory is the foundation; without it, renewal automation and monitoring are incomplete by definition.

Discovery

External scanning — testssl.sh. Run testssl.sh in batch mode across your external IP ranges and DNS inventory:

testssl.sh --file hosts.txt --logfile testssl-results.json --jsonfile-pretty testssl-output.json

testssl.sh reports the certificate subject, SANs, issuer, validity window, and any weak cipher or protocol configurations alongside. Pipe the JSON output to a parser that extracts the expiry timestamps and adds them to your inventory store.

Certificate Transparency log monitoring — certspotter. Certspotter by SSLMate watches CT logs for certificates issued for your domains. Any certificate for *.example.com or example.com — whether issued by your expected CA or a misissuing CA — triggers an alert:

# Self-hosted certspotter
certspotter -watchlist domains.txt -script /usr/local/bin/cert-alert.sh

The hosted version at sslmate.com/certspotter provides the same function without self-hosting. CT monitoring covers the rogue certificate threat: any certificate the CT ecosystem sees becomes visible to you within minutes of issuance.

Internal network scanning. External tools only see what is publicly routable. Internal certificates on port 443, 8443, mutual-TLS service meshes, and LDAPS endpoints require internal scanning. A lightweight Go scanner that connects to each host:port, reads the certificate chain, and emits JSON to a collector works well. Run it from a host that can reach all network segments, and schedule it daily via cron.

Kubernetes Secrets. In Kubernetes environments, TLS secrets live in Secret objects of type kubernetes.io/tls. Enumerate them and extract expiry dates:

kubectl get secrets --all-namespaces -o json \
  | jq -r '.items[] | select(.type=="kubernetes.io/tls") | 
    [.metadata.namespace, .metadata.name, 
     (.data["tls.crt"] | @base64d)] | @tsv' \
  | while IFS=$'\t' read ns name cert; do
      expiry=$(echo "$cert" | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
      echo "$ns/$name: $expiry"
    done

cert-manager’s Certificate custom resources also expose .status.notAfter in their status fields, which is queryable via the Kubernetes API.

Recording in a System of Record

The output of discovery should populate a structured store — a CMDB table, a Postgres database, a spreadsheet with a defined schema, or a purpose-built tool like Smallstep Certificates or Keyfactor. The minimum fields per certificate record:

Field Description
cn / san Subject common name and all Subject Alternative Names
issuer Issuing CA name and type (public/internal/self-signed)
serial Certificate serial number for revocation tracking
not_before / not_after Validity window
owner_team Responsible team
system System or service where the certificate is deployed
renewal_method Manual / ACME / cert-manager / other
last_seen Date last observed by scanner

Cross-reference the inventory against your DNS records monthly. Certificates on hostnames that no longer resolve in DNS are candidates for decommission. Certificates with no owner_team entry get escalated — ownerless certificates are unmanaged by definition.

ACME Automation for Public Certificates

The ACME protocol (RFC 8555) is the correct answer for public TLS certificates. It eliminates human-driven renewal for the vast majority of your external attack surface.

cert-manager in Kubernetes

cert-manager is the standard certificate lifecycle controller for Kubernetes. Install it via Helm:

helm repo add jetstack https://charts.jetstack.io
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.17.0 \
  --set crds.enabled=true

Configure a ClusterIssuer for Let’s Encrypt production (using DNS-01 challenge for wildcard support; HTTP-01 for standard domain validation):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: security@example.com
    privateKeySecretRef:
      name: letsencrypt-production-account-key
    solvers:
    - dns01:
        route53:
          region: eu-west-1
          hostedZoneID: Z1234ABCDEF

Request a certificate by creating a Certificate object (or by annotating an Ingress with cert-manager.io/cluster-issuer: letsencrypt-production):

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-example-com
  namespace: production
spec:
  secretName: api-example-com-tls
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  dnsNames:
  - api.example.com
  renewBefore: 720h   # renew 30 days before expiry

cert-manager will renew the certificate automatically when renewBefore is reached, updating the Kubernetes Secret in place. Pods that mount the Secret need to reload the certificate — configure your application server to watch for inotify events on the certificate file, or use cert-manager’s CSI driver or the Trust Manager for workload certificate injection.

certbot for VMs

On virtual machines and bare-metal hosts, certbot handles ACME renewal:

certbot certonly \
  --dns-route53 \
  --dns-route53-propagation-seconds 30 \
  --email security@example.com \
  --agree-tos \
  --no-eff-email \
  -d api.example.com

certbot’s systemd timer runs renewal twice daily, which is the recommended cadence. The certificate renews when fewer than 30 days remain. Post-renewal hooks reload the application server:

# /etc/letsencrypt/renewal-hooks/deploy/reload-nginx.sh
#!/bin/bash
systemctl reload nginx

For ZeroSSL (the alternative ACME CA useful for rate-limit headroom or Let’s Encrypt redundancy), configure certbot’s --server flag to the ZeroSSL ACME endpoint and supply your EAB credentials from the ZeroSSL dashboard.

Internal PKI Lifecycle

Internal PKI gives you full control over the CA hierarchy. That control is also the risk — poor internal PKI design can be worse than using a public CA because there is no external audit.

Intermediate CA Rotation

Never issue end-entity certificates directly from the root CA. The hierarchy should be: offline root CA → online intermediate CA → end-entity certificates. The intermediate CA is what cert-manager and Vault present to requestors. The offline root CA should be air-gapped, with its private key in an HSM or an encrypted offline backup.

Rotate intermediate CAs on a defined schedule — annually is common, 90 days is better for high-assurance environments. Rotation procedure:

  1. Generate a new intermediate CA keypair (in the HSM where possible).
  2. Request a certificate for the new intermediate from the root CA (ceremony, logged, witnessed for root CA operations).
  3. Install the new intermediate in cert-manager or Vault.
  4. Configure a transition period where both old and new intermediates are trusted.
  5. Re-issue end-entity certificates from the new intermediate over the transition window.
  6. Revoke the old intermediate and remove it from the trust store after all end-entity certificates beneath it have been re-issued.

Certificate Templates and Validity Windows

Define certificate profiles (templates) that enforce key usage, extended key usage, and maximum validity:

Profile Key Usage EKU Max Validity
tls-server digitalSignature, keyEncipherment serverAuth 90 days
tls-client digitalSignature clientAuth 30 days
code-signing digitalSignature codeSigning 365 days
internal-service digitalSignature, keyEncipherment serverAuth, clientAuth 90 days

In HashiCorp Vault’s PKI secrets engine, these are pki/roles. A role named tls-server enforces that applications cannot request a client-auth certificate through the server endpoint, and cannot request a 2-year validity:

vault write pki_int/roles/tls-server \
  allowed_domains="internal.example.com" \
  allow_subdomains=true \
  max_ttl="2160h" \
  key_type="ec" \
  key_bits=256 \
  require_cn=false \
  server_flag=true \
  client_flag=false

cert-manager with Vault Issuer

cert-manager’s Vault issuer connects Kubernetes workloads to your internal Vault PKI, providing unified lifecycle management regardless of whether the certificate comes from Let’s Encrypt or your internal CA:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: vault-internal-pki
spec:
  vault:
    server: https://vault.internal.example.com
    path: pki_int/sign/tls-server
    auth:
      kubernetes:
        mountPath: /v1/auth/kubernetes
        role: cert-manager
        secretRef:
          name: vault-token
          key: token

Certificate objects referencing vault-internal-pki get their certificates issued by Vault, renewed automatically before expiry, and stored in Kubernetes Secrets — the same mechanism as public certificates via Let’s Encrypt. Application teams do not need to know which CA issued their certificate; the lifecycle is identical from their perspective.

Monitoring and Alerting

Inventory and automation reduce the chance of expiry, but monitoring is the safety net. Certificates expire for unexpected reasons — ACME renewal fails because a DNS challenge cannot be completed, cert-manager has a bug, the Vault issuer loses its token.

Prometheus Blackbox Exporter

The Prometheus blackbox exporter probes TLS endpoints and exposes probe_ssl_earliest_cert_expiry, which returns the expiry timestamp as a Unix timestamp. The metric is for the soonest-expiring certificate in the chain (not just the leaf), catching intermediate expiry as well.

Configure the blackbox exporter to scrape your external and internal endpoints:

# blackbox.yml
modules:
  https_2xx:
    prober: http
    http:
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false

In Prometheus scrape config, enumerate targets:

- job_name: blackbox_tls
  metrics_path: /probe
  params:
    module: [https_2xx]
  static_configs:
  - targets:
    - https://api.example.com
    - https://app.example.com
    - https://internal-service.internal.example.com:8443
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: blackbox-exporter:9115

Alerting rules with a 28-day warning, 14-day critical, and 7-day page:

groups:
- name: tls_certificate_expiry
  rules:
  - alert: TLSCertExpiryWarning
    expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 28
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Certificate expiring in {{ $value | printf \"%.0f\" }} days"
      description: "{{ $labels.instance }} certificate expires in {{ $value | printf \"%.0f\" }} days."

  - alert: TLSCertExpiryCritical
    expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
    for: 30m
    labels:
      severity: critical
    annotations:
      summary: "Certificate expiring in {{ $value | printf \"%.0f\" }} days — CRITICAL"
      description: "{{ $labels.instance }} certificate expires in {{ $value | printf \"%.0f\" }} days. Renewal should have already occurred."

  - alert: TLSCertExpiryPage
    expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Certificate expiry imminent — {{ $value | printf \"%.0f\" }} days remaining"
      description: "{{ $labels.instance }} has {{ $value | printf \"%.0f\" }} days before certificate expiry. Immediate action required."

The 28-day warning gives two full renewal cycles for a 90-day certificate before the critical threshold is hit. If your monitoring is firing warning alerts for ACME-managed certificates, the automation has failed; investigate before it escalates to critical.

cert-manager Certificate Status

cert-manager exposes certificate readiness via Kubernetes conditions. Use kube-state-metrics to export cert_manager_certificate_expiration_timestamp_seconds and cert_manager_certificate_ready_status, then alert on certificates that are not Ready or that are within 14 days of expiry even through cert-manager — this catches cases where cert-manager sees the certificate but cannot renew it.

Revocation Infrastructure

Revocation is the most frequently neglected part of certificate lifecycle management. The reason: it is hard to make work correctly, and the common failure mode (client does not check revocation) is invisible until a key-compromise incident makes it suddenly very visible.

OCSP Stapling

Online Certificate Status Protocol (OCSP) allows a client to check whether a certificate has been revoked by querying the CA’s OCSP responder. The problem with client-side OCSP checking: it adds latency, it leaks browsing patterns to the CA, and it fails open in most implementations (if the OCSP responder is unreachable, the browser accepts the certificate anyway — “soft fail”).

OCSP stapling moves the OCSP response to the server. The server periodically fetches a signed OCSP response from the CA and includes it in the TLS handshake. The client receives the freshness-stamped response without contacting the CA, and without the soft-fail problem.

Enable OCSP stapling in nginx:

ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/ssl/certs/ca-chain.pem;
resolver 1.1.1.1 8.8.8.8 valid=300s;
resolver_timeout 5s;

For internal PKI certificates, your Vault or internal CA must expose an OCSP responder endpoint. Vault’s PKI secrets engine includes a built-in OCSP responder at <pki_mount>/ocsp. Configure the OCSP URL in the certificate’s Authority Information Access extension when issuing:

vault write pki_int/config/urls \
  issuing_certificates="https://vault.internal.example.com/v1/pki_int/ca" \
  crl_distribution_points="https://vault.internal.example.com/v1/pki_int/crl" \
  ocsp_servers="https://vault.internal.example.com/v1/pki_int/ocsp"

CRL Distribution Points

Certificate Revocation Lists are signed lists of revoked certificate serial numbers, published by the CA and downloadable by clients. CRLs are coarse-grained (the entire list is downloaded) and can be stale (published on a schedule, typically every 24 hours). They are, however, a reliable baseline when OCSP stapling is not available.

Ensure your internal CA CRL is published to an accessible HTTP endpoint (not HTTPS — clients need to fetch the CRL before they can validate any HTTPS connection) and refreshed at least every 24 hours. Vault publishes its CRL at <pki_mount>/crl; serve it behind an nginx with a public-facing HTTP endpoint if clients outside the cluster need it.

Monitor CRL publication: alert if the CRL’s nextUpdate field is within 2 hours of expiry without a fresh CRL having been published. An expired CRL causes client-side validation failures that can look identical to a revoked certificate from the application’s perspective.

Revoking a Compromised Private Key

When a private key is known or suspected to be compromised, the revocation procedure is:

  1. Assess scope. Which certificate(s) use this key? Check SANs — a wildcard cert may cover dozens of services.
  2. Issue replacement. Generate a new key pair and issue a new certificate immediately. Deploy the replacement before proceeding with revocation to avoid a gap.
  3. Revoke the old certificate. For ACME certificates, use certbot revoke --cert-name <name> or the cert-manager cmctl revoke command. For Vault-issued certs: vault write pki_int/revoke serial_number=<serial>.
  4. Verify revocation. Use openssl ocsp to confirm the CA’s OCSP responder returns REVOKED for the serial:
openssl ocsp \
  -issuer issuer.pem \
  -cert compromised.pem \
  -url https://ocsp.example.com \
  -resp_text
  1. Rotate dependent secrets. If the key was co-located with other secrets (a Kubernetes Secret containing a TLS key pair alongside a database password), treat all co-located secrets as compromised.
  2. Incident record. Create an incident record documenting what was revoked, when, and what systems were affected. Include this in your next certificate status review.

Handling Mass Revocation Events

Let’s Encrypt has triggered two significant mass revocation events: one in March 2020 (3 million certificates revoked due to a CAA checking bug) and one in January 2022 (two million certificates revoked due to a TLS ALPN-01 challenge bug). In both cases, Let’s Encrypt gave 24–72 hours notice before revocation.

An automated response capability makes the difference between a brief renewal surge and an outage:

Detection. Subscribe to the Let’s Encrypt status page and the CA/Browser Forum public list. certspotter’s CT monitoring will show a spike in re-issuances for your domains if Let’s Encrypt begins proactive renewal on affected certificates.

Automated forced renewal. cert-manager supports annotation-triggered renewal via cmctl renew. Write a script that queries the Let’s Encrypt API for your account’s affected certificates (Let’s Encrypt provides an endpoint identifying affected serials during a mass revocation event) and triggers renewal for each matching cert-manager Certificate object:

# Force renewal for all certificates issued before a specific date
kubectl get certificate --all-namespaces -o json \
  | jq -r '.items[] | select(.status.notBefore < "2025-01-15T00:00:00Z") 
    | [.metadata.namespace, .metadata.name] | @tsv' \
  | while IFS=$'\t' read ns name; do
      cmctl renew -n "$ns" "$name"
    done

certbot mass renewal. On VM hosts, certbot renew --force-renewal renews all managed certificates regardless of expiry date. This can hit Let’s Encrypt rate limits (50 certificates per registered domain per week) if you have many certificates for the same domain. Use the --preferred-chain flag to select the affected or unaffected chain where applicable.

Certificate Pinning and Rotation Procedures

Certificate pinning — hardcoding an expected certificate or public key hash in a client — trades revocation flexibility for a stronger binding. The trade-off is significant: a pinned certificate cannot be replaced without a client update. Pinning is appropriate in two narrow cases: mobile applications distributed through app stores (where the update process is controlled) and internal service-to-service communication where both sides are under your control and deployable together.

For mobile applications using HPKP or platform-level pinning, maintain a minimum of two pins: the currently deployed certificate’s public key, and a backup public key (generated and stored, not yet used in a certificate). Rotation procedure:

  1. Issue a new certificate from the backup public key.
  2. Release an application update that pins the new certificate’s key and a new backup key.
  3. Wait for the old application version to age out of the install base (typically 90–120 days for a mandatory update).
  4. Retire the old pinned key.

For internal mTLS with pinned certificates, use SPIFFE/SPIRE instead of static pinning. SPIFFE workload identity certificates are short-lived (typically 1 hour) and renewed automatically; the SVID format and trust bundle replace per-certificate pinning with workload identity verification. See the SPIFFE/SPIRE article for implementation details.

Operational Integration

Integrate certificate lifecycle into existing operational processes:

Change management. Certificate renewals that change SANs or issuers should go through change management. Pure renewals (same key, same SANs, same issuer) are pre-approved change types if the renewal is automated.

Employee offboarding. Client authentication certificates for VPN, internal services, or code signing must be revoked as part of the offboarding checklist. Automate this: when an identity is deprovisioned in your IdP (Okta, Entra), trigger a webhook that revokes any client certificates associated with that identity.

Quarterly certificate review. Run the inventory scanner, compare against the CMDB, and review:

  • Certificates with no owner_team.
  • Certificates whose last_seen date is more than 30 days ago (may have been decommissioned without cleanup).
  • Self-signed certificates in production.
  • Certificates with SHA-1 or RSA keys below 2048 bits.
  • Validity windows longer than your policy allows.

The review output is a set of remediation tickets, not a one-time fix. Certificate hygiene degrades continuously; the quarterly review is the control that catches degradation before it becomes an incident.

Summary

Certificate lifecycle management is an infrastructure discipline, not a one-time configuration task. The key controls:

  • Inventory. Run testssl.sh and an internal scanner on a schedule; monitor CT logs with certspotter; record everything in a system of record with explicit owners.
  • ACME automation. Use cert-manager for Kubernetes and certbot for VMs; set renewBefore: 720h (30 days) so renewal happens well before expiry.
  • Internal PKI. Use Vault PKI with certificate templates that enforce short validity and correct key usage; rotate intermediate CAs annually.
  • Monitoring. Blackbox exporter with 28-day warning / 14-day critical / 7-day page alerts; cert-manager status monitoring via kube-state-metrics.
  • Revocation. OCSP stapling on all TLS endpoints; CRL distribution points in all internally issued certificates; a documented revocation procedure that includes deployment of the replacement before revocation of the old certificate.
  • Mass revocation readiness. Scripts to force-renew all affected certificates on a known-revocation event; rate-limit awareness for Let’s Encrypt domains.

The goal is that certificate expiry becomes a non-event: an automated renewal processes it, the monitoring confirms success, and no human action is required. When that is true for 95% of your certificate estate, your capacity is free to focus on the edge cases — code signing keys, root CA ceremonies, and the occasional compromised key — where human judgement is genuinely needed.