Hardware Security Module Integration: Key Management for Production Systems

Hardware Security Module Integration: Key Management for Production Systems

Problem

Private keys stored in filesystem files, environment variables, Kubernetes Secrets, or application memory have a fundamental vulnerability: they can be extracted by anyone with the right access level. A root compromise on an application server, a Kubernetes etcd dump, a memory forensics attack, or a cloud metadata service exploit can yield every private key the application uses.

A Hardware Security Module (HSM) removes this possibility. The private key is generated inside the HSM and is physically non-exportable. All cryptographic operations (signing, decryption, key derivation) are performed inside the HSM. The application requests the operation via PKCS#11, KMIP, or a vendor API; the key material never leaves the HSM boundary. A root-compromised server can exfiltrate a signing request but not the key itself.

The specific gaps in systems without HSM integration:

  • Root CA private keys stored as PEM files on an application server, accessible to any process with file system access.
  • TLS private keys mounted into containers via Kubernetes Secrets, visible in etcd and to any pod that mounts the same Secret.
  • Code signing keys in CI/CD environment variables, visible in build logs and accessible to all pipeline steps.
  • Application encryption keys (for database column encryption, file encryption) stored in .env files alongside application code.
  • No key usage audit log — applications use private keys without any per-operation record.

HSMs provide: key non-extractability, per-operation audit logging, hardware random number generation, tamper evidence, and FIPS 140-2/140-3 certification for compliance requirements.

Target systems: AWS CloudHSM (FIPS 140-2 Level 3); Azure Dedicated HSM (Thales Luna); Google Cloud HSM (via Cloud KMS); on-premise: Thales Luna Network HSM, nCipher nShield, Nitrokey HSM2; software: SoftHSM2 (for testing); HashiCorp Vault with HSM auto-unseal (PKCS#11 seal).

Threat Model

  • Adversary 1 — Root-level key extraction: An attacker achieves root access on an application server and copies the TLS private key PEM file. They use the key to decrypt captured TLS traffic or impersonate the server.
  • Adversary 2 — Kubernetes etcd dump: An attacker with etcd access dumps all Kubernetes Secrets, extracting TLS private keys, API keys, and signing certificates stored as Secrets.
  • Adversary 3 — Memory dump on key in use: An attacker with root or ptrace access dumps the process memory of a running application while it holds a private key in memory. They extract the key from the memory dump.
  • Adversary 4 — CI/CD environment variable leak: A CI pipeline exposes environment variables in logs or to malicious steps. A signing key stored as a CI environment variable is exfiltrated.
  • Adversary 5 — Insider threat with file system access: A privileged insider exports the root CA private key from the filesystem, creating a shadow CA.
  • Access level: Adversaries 1–3 have OS-level root or ptrace access. Adversary 4 has CI pipeline access. Adversary 5 has legitimate file system access.
  • Objective: Extract private keys to impersonate services, decrypt historical traffic, forge signed artifacts, or establish persistent access.
  • Blast radius: Without HSM: key extraction = long-term compromise of everything signed or encrypted with that key. With HSM: the key is non-exportable; the attacker can at most abuse the HSM API if they have the authentication credentials, but each operation is logged and auditable.

Configuration

Step 1: Choose an HSM Integration Pattern

Three primary patterns, by use case:

Pattern Best for Implementation
Cloud KMS with HSM backing Managed, lower ops burden AWS KMS + CloudHSM, GCP Cloud HSM, Azure Key Vault with Managed HSM
Vault with HSM auto-unseal Centralised secret management with HSM protection of the Vault master key Vault PKCS#11 seal + any FIPS HSM
Direct PKCS#11 Applications that do their own crypto (TLS termination, signing) Application links to OpenSSL PKCS#11 engine or Go PKCS#11 library

This article covers all three patterns.

Step 2: Cloud KMS with HSM-Backed Keys (AWS)

AWS CloudHSM creates a dedicated HSM cluster in your VPC. AWS KMS can use CloudHSM as a custom key store, providing KMS API convenience with HSM-level key protection.

# Create a CloudHSM cluster.
aws cloudhsmv2 create-cluster \
  --hsm-type hsm2m.medium \
  --subnet-ids subnet-abc123 subnet-def456

# Initialize the cluster.
CLUSTER_ID=$(aws cloudhsmv2 describe-clusters --query 'Clusters[0].ClusterId' --output text)
aws cloudhsmv2 initialize-cluster --cluster-id $CLUSTER_ID \
  --signed-cert file://cluster-csr-signed.crt \
  --trust-anchor file://customerCA.crt

# Create HSMs in each AZ (minimum 2 for HA).
aws cloudhsmv2 create-hsm --cluster-id $CLUSTER_ID --availability-zone us-east-1a
aws cloudhsmv2 create-hsm --cluster-id $CLUSTER_ID --availability-zone us-east-1b

# Create a KMS custom key store backed by CloudHSM.
aws kms create-custom-key-store \
  --custom-key-store-name "production-hsm-keystore" \
  --cloud-hsm-cluster-id $CLUSTER_ID \
  --key-store-password "hsm-crypto-officer-password"

# Connect the key store.
KEY_STORE_ID=$(aws kms describe-custom-key-stores \
  --custom-key-store-name "production-hsm-keystore" \
  --query 'CustomKeyStores[0].CustomKeyStoreId' --output text)
aws kms connect-custom-key-store --custom-key-store-id $KEY_STORE_ID

# Create an HSM-backed asymmetric key for signing.
aws kms create-key \
  --origin AWS_CLOUDHSM \
  --custom-key-store-id $KEY_STORE_ID \
  --key-usage SIGN_VERIFY \
  --key-spec RSA_4096 \
  --description "Code signing key — HSM-backed"

Use via SDK:

import boto3

kms = boto3.client("kms", region_name="us-east-1")
KEY_ID = "arn:aws:kms:us-east-1:123456789:key/abc123"

def sign_with_hsm(data: bytes) -> bytes:
    # The signing operation is performed inside CloudHSM.
    # The private key never leaves the HSM.
    response = kms.sign(
        KeyId=KEY_ID,
        Message=hashlib.sha256(data).digest(),
        MessageType="DIGEST",
        SigningAlgorithm="RSASSA_PKCS1_V1_5_SHA_256",
    )
    return response["Signature"]

def verify_signature(data: bytes, signature: bytes) -> bool:
    response = kms.verify(
        KeyId=KEY_ID,
        Message=hashlib.sha256(data).digest(),
        MessageType="DIGEST",
        Signature=signature,
        SigningAlgorithm="RSASSA_PKCS1_V1_5_SHA_256",
    )
    return response["SignatureValid"]

Step 3: HashiCorp Vault with PKCS#11 HSM Auto-Unseal

Vault’s master key (used to encrypt the barrier keyring) can be sealed/unsealed using an HSM. The Vault master key is generated and stored inside the HSM — even Vault operators cannot extract it.

# vault.hcl
seal "pkcs11" {
  lib            = "/usr/lib/softhsm/libsofthsm2.so"   # Or path to hardware HSM library.
  slot           = "0"
  pin            = "hsm-crypto-officer-pin"
  key_label      = "vault-hsm-key"
  hmac_key_label = "vault-hmac-key"
  generate_key   = "true"   # Generate key in HSM if not present.
}

storage "raft" {
  path    = "/opt/vault/data"
  node_id = "vault-1"
}

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/etc/vault/tls/vault.crt"
  tls_key_file  = "/etc/vault/tls/vault.key"
}

With HSM auto-unseal:

  • Vault starts automatically on node restart (the HSM provides the unseal key).
  • The Vault master key is physically protected — a compromised Vault server cannot export the root key.
  • Vault operators cannot perform a cold extraction of the master key.

Initialize Vault with HSM:

vault operator init \
  -recovery-shares=5 \
  -recovery-threshold=3
# Note: with HSM seal, these are "recovery keys" (for emergency), not unseal keys.
# Store recovery key shares in a physically secure location (safe, bank vault).

Step 4: Direct PKCS#11 for TLS Private Keys

For applications that terminate TLS and hold the private key (nginx, HAProxy, custom Go services), use PKCS#11 to keep the key in the HSM.

nginx with PKCS#11 (via the nginx-pkcs11 module or OpenSSL engine):

# Install OpenSSL PKCS#11 engine.
apt install libengine-pkcs11-openssl

# Initialize SoftHSM2 (for testing; replace with hardware HSM library in production).
softhsm2-util --init-token --slot 0 --label "tls-keys" \
  --pin 1234 --so-pin 5678

# Generate TLS key inside the HSM (key never exported as plaintext).
pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
  --login --pin 1234 \
  --keypairgen --key-type rsa:4096 \
  --label "nginx-tls-key" --id 01

# Generate a CSR using the HSM-resident key.
openssl req -engine pkcs11 -keyform engine \
  -key "pkcs11:object=nginx-tls-key;type=private" \
  -new -out nginx.csr \
  -subj "/CN=api.example.com"

# Submit CSR to CA; get signed certificate.
# Install the certificate (not the key — the key stays in the HSM).

nginx configuration using PKCS#11:

ssl_certificate /etc/nginx/certs/api.example.com.crt;
# Reference the HSM key via PKCS#11 URI instead of a PEM file.
ssl_certificate_key "engine:pkcs11:pkcs11:object=nginx-tls-key;type=private;pin-value=1234";

Go application using PKCS#11:

import (
    "crypto"
    "crypto/tls"
    p11 "github.com/miekg/pkcs11"
    "github.com/ThalesGroup/crypto11"
)

func loadHSMKey(hsmLib, pin, label string) (crypto.Signer, error) {
    config := &crypto11.Config{
        Path:        hsmLib,    // "/usr/lib/softhsm/libsofthsm2.so" or HSM vendor path.
        Pin:         pin,
        TokenLabel:  "tls-keys",
    }
    ctx, err := crypto11.Configure(config)
    if err != nil {
        return nil, err
    }
    // Find the key by label; returns a crypto.Signer backed by the HSM.
    // All signing operations go through PKCS#11; the key never leaves the HSM.
    return ctx.FindKeyPair(nil, []byte(label))
}

func newTLSConfigWithHSM(hsmLib, pin, label, certPath string) (*tls.Config, error) {
    signer, err := loadHSMKey(hsmLib, pin, label)
    if err != nil {
        return nil, err
    }
    cert, err := tls.LoadX509KeyPair(certPath, "")  // Load cert; key comes from HSM.
    if err != nil {
        return nil, err
    }
    cert.PrivateKey = signer   // Replace the key loaded from cert with the HSM signer.
    return &tls.Config{
        Certificates: []tls.Certificate{cert},
    }, nil
}

Step 5: Key Usage Audit Logging

Every HSM operation creates an audit log entry. Configure your HSM and application to route these to your SIEM:

# For AWS CloudHSM: audit logs automatically sent to CloudWatch Logs.
# Filter for key usage events.
import boto3

logs = boto3.client("logs")
logs.put_metric_filter(
    logGroupName="/aws/cloudhsm/cluster-abc123",
    filterName="KeyUsageMetric",
    filterPattern='{ $.eventName = "CKM_RSA_PKCS_SIGN" || $.eventName = "CKM_AES_GCM" }',
    metricTransformations=[{
        "metricName": "HSMKeyOperationsTotal",
        "metricNamespace": "Security/HSM",
        "metricValue": "1",
    }]
)

For on-premise HSMs, audit logs typically go to syslog:

# Thales Luna HSM syslog format example:
# Apr 29 10:23:45 hsm1 lunacm[1234]: AUDIT CKO_PRIVATE_KEY op=C_Sign slot=0 key=nginx-tls-key user=nginx-app result=OK

# Forward to your SIEM with structured parsing.

Alert on:

  • Unexpected key usage times (signing operations at 3am from a production server).
  • Key usage from unexpected source IPs.
  • Failed authentication attempts to the HSM.
  • Key deletion operations — should be extremely rare and pre-approved.

Step 6: Key Rotation with HSM

Keys in HSMs support rotation without extraction. The old key is kept for decryption of legacy data while the new key is used for all new operations:

# AWS KMS: rotate an HSM-backed key automatically (annual rotation).
aws kms enable-key-rotation --key-id $KEY_ID

# Or manual rotation: create a new key, update references, retire old key.
NEW_KEY_ID=$(aws kms create-key \
  --origin AWS_CLOUDHSM \
  --custom-key-store-id $KEY_STORE_ID \
  --key-usage SIGN_VERIFY \
  --key-spec RSA_4096 \
  --query 'KeyMetadata.KeyId' --output text)

# Update your application to use NEW_KEY_ID for signing.
# Keep OLD_KEY_ID enabled for verification of previously signed artifacts.
# Schedule OLD_KEY_ID deletion after the signature validity window expires.
aws kms schedule-key-deletion --key-id $OLD_KEY_ID --pending-window-in-days 30

For Vault PKI with HSM backing, key rotation is via Vault’s PKI engine:

# Rotate the intermediate CA key (new key generated in HSM).
vault write pki_int/root/rotate/exported \
  common_name="new-intermediate-ca" \
  key_type=ec key_bits=256

# The old key is retained for CRL signing; new certs use the new key.

Step 7: SoftHSM2 for Testing

Use SoftHSM2 in development and CI to test PKCS#11 integration without hardware:

# Install SoftHSM2.
apt install softhsm2

# Initialize a token.
softhsm2-util --init-token --free --label "dev-test" --pin 1234 --so-pin 5678

# Generate a key pair in SoftHSM2.
pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
  --login --pin 1234 \
  --keypairgen --key-type rsa:2048 \
  --label "test-signing-key" --id 01

# List objects in the token.
pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
  --login --pin 1234 --list-objects

In CI, run SoftHSM2 in a Docker container alongside your application tests:

# .github/workflows/test.yml
services:
  softhsm:
    image: ghcr.io/letsencrypt/softhsm:latest
    env:
      SOFTHSM2_CONF: /etc/softhsm2.conf
    options: --health-cmd "pkcs11-tool --list-slots" --health-interval 10s

Step 8: Telemetry

hsm_operations_total{operation, key_label, result}         counter
hsm_authentication_failure_total{hsm_id}                   counter
hsm_key_operations_per_second{key_label}                   gauge
hsm_session_count{hsm_id}                                  gauge
hsm_free_session_count{hsm_id}                             gauge
vault_hsm_seal_status{status}                              gauge (1=sealed)

Alert on:

  • hsm_authentication_failure_total non-zero — incorrect PIN or credentials; possible brute-force or misconfiguration.
  • hsm_free_session_count == 0 — HSM session pool exhausted; new cryptographic operations will fail; scaling issue.
  • vault_hsm_seal_status == 1 — Vault sealed unexpectedly; may indicate HSM connectivity loss.
  • Key operation rate anomaly — signing rate 10× normal for a key may indicate unauthorized use.

Expected Behaviour

Signal Without HSM With HSM
Root server compromise Attacker copies private key from filesystem Key non-exportable; attacker can only abuse API while they have access
Kubernetes etcd dump TLS/signing keys in Secret data extracted Keys not in etcd; only references to KMS key IDs
Process memory dump Private key visible in memory during use PKCS#11 operations performed inside HSM; key never in application memory
Key usage audit None or application-level logging Per-operation HSM audit log; tamper-evident
Compliance (FIPS 140-2/3) Not certified (software key store) FIPS 140-2 Level 3 certified (hardware HSM)

Trade-offs

Aspect Benefit Cost Mitigation
HSM for all keys Key non-extractability Cost (CloudHSM: ~$1.50/hr per HSM); operational complexity Use cloud KMS with HSM backing for lower cost; direct HSM for highest-value keys only (root CA, code signing).
PKCS#11 for TLS Private key never leaves HSM TLS handshakes go through HSM; may add 1–5ms latency Session resumption reduces per-handshake cost; acceptable for most applications.
Vault HSM auto-unseal Vault restarts without operator intervention HSM becomes a dependency of Vault availability HSM HA (two HSMs in cluster); fallback recovery keys for emergency.
SoftHSM for testing No hardware required for dev/CI SoftHSM2 is software; no real security properties Clearly document that SoftHSM is not production-grade; use in dev/CI only.
Per-operation audit logging Complete key usage history Log volume; SIEM storage cost Compress and aggregate; alert on anomalies rather than storing all events forever.

Failure Modes

Failure Symptom Detection Recovery
HSM unreachable (network loss) PKCS#11 operations fail; TLS handshakes fail; Vault seals Application errors; hsm_operations_total{result="error"} spikes Restore HSM connectivity; for Vault: use recovery keys; for TLS: fail to fallback cert path.
HSM PIN locked after failed attempts All operations fail with CKR_PIN_LOCKED hsm_authentication_failure_total spikes then operations stop Use Security Officer (SO) PIN to reset the User PIN; investigate why PIN attempts failed.
HSM session pool exhausted Cryptographic operations queue and timeout hsm_free_session_count == 0; latency spikes Increase session limit in HSM configuration; reduce sessions per application instance.
Key rotation breaks old signature verification Signatures signed with old key fail to verify Application errors verifying old signatures Keep old key enabled for verification; only delete after signature validity period expires.
PKCS#11 library version mismatch Application crashes at startup dlopen errors; PKCS#11 init failure Pin PKCS#11 library version in deployment; test upgrades in staging.
CloudHSM cluster in single AZ AZ failure takes down HSM Operations fail for duration of AZ outage Deploy HSMs in at least 2 AZs; CloudHSM replicates keys across cluster members.