DNSSEC Key Rollover Operational Security: Lessons from the .de TLD Three-Hour Outage

DNSSEC Key Rollover Operational Security: Lessons from the .de TLD Three-Hour Outage

Problem

On May 5, 2026, at approximately 09:14 UTC, DENIC — the registry operator for the .de country-code TLD — began an automated Zone Signing Key (ZSK) rollover for the de. zone. By 09:17 UTC, every validating resolver on the internet was returning SERVFAIL for all .de domains. Cloudflare 1.1.1.1, Google 8.8.8.8, ISP resolvers, and enterprise recursive resolvers with DNSSEC validation enabled all rejected every .de query for the next three hours. Non-validating resolvers continued to work, which made the incident particularly confusing to diagnose: roughly 30% of end users saw no problem while the rest experienced total failure.

The proximate cause was a bug in DENIC’s automated rollover script that generated RRSIGs over the DNSKEY RRset using the previous ZSK private key while simultaneously publishing the new DNSKEY RRset as the only active key. The signatures did not match the key material, so every resolver that attempted DNSSEC validation rejected the zone immediately.

This is the defining characteristic of DNSSEC operational failure: the blast radius is not limited to misconfigured operators or unlucky users. When a TLD or well-trafficked zone breaks DNSSEC validation, it is a complete and instantaneous outage for all domains beneath it, affecting all validating resolvers simultaneously, worldwide.

The .de incident is not unique. The same class of failure has occurred at other registries and at large authoritative operators. The root cause is invariably the same: DNSSEC key rollovers are complex multi-step operations with tight timing dependencies, and the tooling — automated or manual — is easy to use incorrectly. Rollover procedures that have been correct for years can fail silently when a dependency changes, when a script’s execution order assumption breaks, or when key metadata falls out of sync with the live zone.

This article covers:

  • The mechanics of ZSK and KSK rollovers and where each can fail
  • What “non-validatable signatures” means at the wire level
  • Pre-rollover validation tools that would have caught the DENIC bug before publication
  • Continuous DNSSEC health monitoring with alerting on RRSIG expiry
  • Staging environment design for rollover testing
  • Canary resolver checks as a deployment gate
  • Automated rollback triggering
  • Key TTL strategy for fast recovery
  • BIND9 dnssec-policy and PowerDNS AutoDNSSEC with staged rollover

Threat Model

DNSSEC operational failures create attack surface in three distinct ways that are worth making explicit.

Outage as deliberate DoS. An adversary who can trigger a DNSSEC validation failure on a target zone causes SERVFAIL-based denial of service without touching the authoritative server infrastructure at all. Supply chain attacks against DNS management APIs, injection into automated rollover pipelines, and social engineering of registry support staff are all realistic vectors for inducing a validation failure. The attacker does not need to compromise the authoritative server — they only need to cause the zone to publish inconsistent key material.

Rollover window exploitation. During a KSK rollover, there is a trust chain gap window between when the new KSK is published in the zone and when the parent zone publishes the new DS record. If the old KSK is removed before the parent DS update propagates, resolvers that have cached the old DS record will reject the new KSK signatures. An attacker monitoring for this window can time a cache poisoning attempt against the zone, knowing that validating resolvers will fall back to insecure mode or SERVFAIL during the gap, depending on resolver configuration.

Rollback failure extends outage indefinitely. The most dangerous scenario is a rollover that fails and a rollback that also fails. If the rollback procedure shares the same bug as the rollover procedure — or if the previous signed zone has been overwritten — the zone remains broken until a human fully reconstructs it from key backups. In practice this has extended outages from minutes to hours to, in one documented case, more than a day.

Every rollover procedure must be designed with the assumption that both the rollover and the rollback can fail independently.

Configuration and Implementation

DNSSEC Rollover Mechanics

A DNSSEC-signed zone uses two types of keys. The Zone Signing Key (ZSK) signs all RRsets in the zone. The Key Signing Key (KSK) signs only the DNSKEY RRset. The KSK’s public key is registered with the parent zone as a DS (Delegation Signer) record, creating the chain of trust from the root downward.

ZSK rollover can use two methods:

Pre-publish method: Publish the new ZSK in the DNSKEY RRset (with no RRSIG from it yet) and wait for the DNSKEY TTL to expire everywhere. Then sign the zone with both the old and new ZSK. After the old signatures expire, remove the old ZSK from the DNSKEY RRset. This method is safer because resolvers never see a DNSKEY RRset that lacks a corresponding valid RRSIG.

Double-signature method: Sign the zone with both old and new ZSK simultaneously, publish both keys and both full sets of signatures, then remove the old ZSK after its TTL expires. This produces larger zone files but is the method used by BIND9’s dnssec-policy because it avoids the timing dependency on DNSKEY TTL propagation.

KSK rollover requires an additional step that ZSK rollover does not: the parent zone must be updated. The sequence is: publish new KSK in DNSKEY RRset → wait for DNSKEY TTL to expire → notify parent registry to add new DS record → wait for DS TTL to expire everywhere → switch signing to new KSK → remove old KSK → notify parent to remove old DS record.

The critical constraint: do not remove the old KSK until the new DS record has been confirmed in the parent zone and has propagated past the parent’s DS TTL. Violating this ordering breaks validation for any resolver that cached the old DS record.

What Non-Validatable Signatures Look Like at the Wire Level

The DENIC failure mode was a mismatch between the signing key and the published DNSKEY RRset. In RFC 4034 terms, RRSIG validation requires that:

  1. The RRSIG Signer's Name field matches the zone apex
  2. The RRSIG Key Tag field matches the Key Tag computed from the DNSKEY RR
  3. The cryptographic signature in the RRSIG verifies against the public key in the matching DNSKEY RR

When DENIC’s script published the new ZSK as the only key in the DNSKEY RRset but signed the DNSKEY RRset with the old ZSK private key, the RRSIG Key Tag field referenced a key that was no longer present in the DNSKEY RRset. Every resolver that attempted to validate the DNSKEY RRset found no matching DNSKEY record for the RRSIG’s Key Tag and returned SERVFAIL.

You can observe this failure mode directly with dig:

# Query for DNSKEY RRset with DNSSEC
dig +dnssec +multiline DNSKEY de. @1.1.1.1

# The RRSIG record will show the Key Tag of the signing key
# If that Key Tag does not appear in any DNSKEY record in the answer,
# the zone is broken in exactly the DENIC failure mode

# Check what Key Tag the RRSIG references
dig +dnssec DNSKEY de. | grep -E 'RRSIG|DNSKEY' | awk '{print $1, $4, $NF}'

Other non-validatable signature conditions:

  • Expired RRSIG: the RRSIG Signature Expiration field is in the past. Resolvers reject with BOGUS/SERVFAIL.
  • Wrong DNSKEY flags: a ZSK published with the SEP (Secure Entry Point) flag set (flag value 257) when the zone expects it as flag 256, causing Key Tag mismatch in computations.
  • Algorithm mismatch: signing with ECDSA P-256 (algorithm 13) but publishing an RSA key (algorithm 8) in the DNSKEY RRset.

Pre-Rollover Validation

The .de outage would have been prevented by a single validation step run against the new signed zone before publication. Both major toolchains provide this.

ldns-verify-zone performs offline validation of a signed zone file:

# Verify zone file signatures against the keys in the zone
ldns-verify-zone /etc/bind/signed/db.example.com.signed

# Expected output for a valid zone:
# Zone is verified and complete

# For a broken zone (DENIC scenario):
# Zone verification failed: No matching DNSKEY found for RRSIG

# Include checking DS records against parent
ldns-verify-zone -k /etc/bind/keys/ds-parent.key \
    /etc/bind/signed/db.example.com.signed

dnssec-signzone -V (BIND9 tooling) verifies signatures after signing:

# Sign the zone
dnssec-signzone -A -3 $(head -c 16 /dev/urandom | xxd -p) \
    -N INCREMENT \
    -o example.com \
    -t \
    /etc/bind/zones/db.example.com

# Verify the output
named-checkzone -i full -k fail \
    example.com /etc/bind/zones/db.example.com.signed

# PowerDNS equivalent
pdnsutil check-zone example.com
pdnsutil rectify-zone example.com

Automated pre-publication gate: Wrap these checks in a script that must exit 0 before any zone publication can proceed:

#!/usr/bin/env bash
# pre-publish-validate.sh — exit non-zero to abort rollover

set -euo pipefail

ZONE_FILE="${1:?Usage: $0 <signed-zone-file> <zone-name>}"
ZONE_NAME="${2:?}"

echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Validating ${ZONE_FILE} for ${ZONE_NAME}"

# 1. Structural check
if ! named-checkzone "${ZONE_NAME}" "${ZONE_FILE}"; then
    echo "ABORT: named-checkzone failed" >&2
    exit 1
fi

# 2. DNSSEC signature validation
if ! ldns-verify-zone "${ZONE_FILE}"; then
    echo "ABORT: ldns-verify-zone failed — signatures do not validate" >&2
    exit 1
fi

# 3. Check RRSIG expiry — reject zones with signatures expiring within 7 days
EXPIRY=$(ldns-read-zone "${ZONE_FILE}" | awk '/RRSIG/ {print $9}' | sort -n | head -1)
NOW_EPOCH=$(date +%s)
EXPIRY_EPOCH=$(date -d "${EXPIRY:0:4}-${EXPIRY:4:2}-${EXPIRY:6:2}T${EXPIRY:8:2}:${EXPIRY:10:2}:${EXPIRY:12:2}Z" +%s 2>/dev/null || echo 0)
DAYS_REMAINING=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

if [[ ${DAYS_REMAINING} -lt 7 ]]; then
    echo "ABORT: Earliest RRSIG expires in ${DAYS_REMAINING} days — too short" >&2
    exit 1
fi

echo "PASS: Zone ${ZONE_NAME} validates cleanly. Earliest RRSIG expiry: ${DAYS_REMAINING} days"
exit 0

Continuous DNSSEC Health Monitoring

Static pre-publication checks are necessary but not sufficient. You need continuous monitoring that fires before human operators notice something is wrong.

Nagios/Icinga — check_dns_dnssec plugin:

# Install check_dns_dnssec (from monitoring-plugins-dnssec or similar)
# Basic check for zone validation status
/usr/lib/nagios/plugins/check_dns_dnssec \
    -H example.com \
    -s 8.8.8.8 \
    -w 7 \   # warn if RRSIG expires within 7 days
    -c 2     # critical if RRSIG expires within 2 days

# Icinga2 service object
object Service "dnssec-validity" {
  host_name = "dns-monitoring"
  check_command = "check_dns_dnssec"
  vars.dnssec_domain = "example.com"
  vars.dnssec_resolver = "8.8.8.8"
  vars.dnssec_warn = 7
  vars.dnssec_crit = 2
  check_interval = 5m
  max_check_attempts = 2
}

Zabbix DNSSEC template — key items:

# zabbix-dnssec-template.yaml (excerpt)
items:
  - name: "DNSSEC validation status"
    key: "external.dnssec.validate[{$ZONE},{$RESOLVER}]"
    type: EXTERNAL
    delay: 5m
    value_type: INTEGER
    # 0 = validates, 1 = BOGUS, 2 = SERVFAIL, 3 = timeout

  - name: "DNSSEC RRSIG minimum expiry days"
    key: "external.dnssec.rrsig_expiry_days[{$ZONE}]"
    type: EXTERNAL
    delay: 1h
    value_type: FLOAT

triggers:
  - name: "DNSSEC validation BOGUS for {$ZONE}"
    expression: "last(/template/external.dnssec.validate[{$ZONE},{$RESOLVER}])=1"
    severity: DISASTER
    manual_close: false

  - name: "DNSSEC RRSIG expiring within 3 days for {$ZONE}"
    expression: "last(/template/external.dnssec.rrsig_expiry_days[{$ZONE}])<3"
    severity: HIGH

Prometheus + Alertmanager:

# prometheus-dnssec-rules.yaml
groups:
  - name: dnssec
    interval: 2m
    rules:
      - alert: DNSSECValidationBogus
        expr: dns_dnssec_validation_status{zone=~".+"} == 1
        for: 1m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "DNSSEC validation BOGUS for {{ $labels.zone }}"
          description: >
            Zone {{ $labels.zone }} is returning BOGUS from resolver
            {{ $labels.resolver }}. Check for key rollover issues.
            Rollback procedure: https://runbooks.internal/dnssec-rollback

      - alert: DNSSECRRSIGExpiringSoon
        expr: dns_rrsig_expiry_days{zone=~".+"} < 7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "DNSSEC RRSIG expiring in {{ $value }} days for {{ $labels.zone }}"

Staging Environment Design

Every DNSSEC rollover procedure should be tested in a staging environment that mirrors production timing and resolver behaviour before being run in production.

# Staging environment: BIND9 authoritative + Unbound validating resolver
# in separate network namespaces or containers

# 1. Start an isolated validating resolver for testing
cat > /tmp/unbound-test.conf << 'EOF'
server:
    interface: 127.0.0.2
    port: 5353
    do-ip4: yes
    do-udp: yes
    verbosity: 2
    val-log-level: 2
    auto-trust-anchor-file: "/var/lib/unbound/root.key"

# Override .example.com to use local authoritative
stub-zone:
    name: "example.com."
    stub-addr: 127.0.0.1@5354
EOF

unbound -c /tmp/unbound-test.conf

# 2. Run rollover against local BIND9 authoritative
# 3. Verify from the test resolver
dig +dnssec @127.0.0.2 -p 5353 example.com DNSKEY
dig +dnssec @127.0.0.2 -p 5353 www.example.com A

# Expected: AD (Authenticated Data) bit set in all responses
# Failure: SERVFAIL — stops the staging test before production is touched

BIND9 dnssec-policy for Automated Safe Rollover

BIND9 9.16+ includes dnssec-policy, which manages key lifecycle automatically with built-in rollover safety. The double-signature method is used, which avoids timing dependencies.

# /etc/bind/named.conf.options (excerpt)

dnssec-policy "production" {
    keys {
        # KSK: 2048-bit RSA, rollover every 2 years
        ksk key-directory lifetime P2Y algorithm rsasha256 2048;
        # ZSK: 2048-bit RSA, rollover every 90 days
        zsk key-directory lifetime P90D algorithm rsasha256 2048;
    };

    # DNSKEY RRset TTL — keep short for fast recovery
    dnskey-ttl 300;

    # Maximum zone TTL affects how long old signatures must be retained
    max-zone-ttl 3600;

    # Signature validity period and refresh threshold
    signatures-validity 14d;
    signatures-refresh 5d;

    # Require parent DS before completing KSK rollover
    # (named will wait and re-check, not proceed automatically)
    parent-ds-ttl 86400;
    parent-propagation-delay 1h;
    zone-propagation-delay 5m;
    publish-safety 1h;
    retire-safety 1h;
};

zone "example.com" {
    type primary;
    file "/etc/bind/zones/db.example.com";
    dnssec-policy "production";
    key-directory "/etc/bind/keys/example.com";
    inline-signing yes;
};

With inline-signing yes, BIND9 maintains an unsigned zone file and a separate signed zone in memory. The signed zone is updated automatically by named without a separate dnssec-signzone invocation, eliminating the class of bugs where the wrong tool version or wrong key path is used.

PowerDNS AutoDNSSEC with Staged Rollover

# Enable AutoDNSSEC for a zone
pdnsutil enable-dnssec example.com

# Add keys — PowerDNS creates them in the database
pdnsutil add-zone-key example.com zsk active 2048 rsasha256
pdnsutil add-zone-key example.com ksk active 2048 rsasha256

# Check current key state before any rollover
pdnsutil show-zone example.com

# Staged ZSK rollover (pre-publish method)
# Step 1: Generate and publish new ZSK (inactive — not yet signing)
pdnsutil add-zone-key example.com zsk inactive 2048 rsasha256
# Note the new key ID from output, e.g. 7

# Step 2: Wait for DNSKEY TTL (300s) to expire across resolvers
sleep 300

# Step 3: Activate new ZSK, deactivate old ZSK
pdnsutil activate-zone-key example.com 7
pdnsutil deactivate-zone-key example.com 3   # old key ID

# Step 4: Run pre-publication validation before increasing serial
pdnsutil export-zone-dnskey example.com | ldns-verify-zone /dev/stdin || {
    echo "ABORT: Validation failed, rolling back"
    pdnsutil activate-zone-key example.com 3
    pdnsutil deactivate-zone-key example.com 7
    exit 1
}

# Step 5: Increase serial to trigger zone transfer to secondaries
pdnsutil increase-serial example.com

# Step 6: After another DNSKEY TTL, remove old inactive ZSK
sleep 300
pdnsutil remove-zone-key example.com 3

Canary Resolver Checks as a Deployment Gate

After publishing new key material but before decommissioning old keys, query from multiple geographic vantage points to confirm that validation succeeds from external resolvers.

#!/usr/bin/env bash
# canary-check.sh — run after key publication, before old key removal

ZONE="${1:?provide zone name}"
RESOLVERS=(
    "1.1.1.1"          # Cloudflare
    "8.8.8.8"          # Google
    "9.9.9.9"          # Quad9
    "208.67.222.222"   # OpenDNS
)

FAIL=0

for resolver in "${RESOLVERS[@]}"; do
    echo -n "Checking ${ZONE} DNSKEY from ${resolver}... "

    # dig returns exit code 0 even on SERVFAIL; check the status line
    result=$(dig +dnssec +time=5 +tries=2 \
        @"${resolver}" "${ZONE}" DNSKEY 2>&1)

    status=$(echo "${result}" | awk '/^;; ->>HEADER<<-/ {print $6}' | tr -d ',')
    ad_bit=$(echo "${result}" | grep -c 'flags:.*\bad\b' || true)

    if [[ "${status}" != "NOERROR" ]]; then
        echo "FAIL (status=${status})"
        FAIL=1
    elif [[ "${ad_bit}" -eq 0 ]]; then
        echo "FAIL (NOERROR but no AD bit — not validating)"
        FAIL=1
    else
        echo "OK (NOERROR + AD bit)"
    fi
done

if [[ ${FAIL} -ne 0 ]]; then
    echo ""
    echo "CANARY FAILED: Do not remove old keys. Initiate rollback."
    exit 1
fi

echo ""
echo "All canary resolvers validate. Safe to proceed."
exit 0

Automated Rollback Triggering

The rollback trigger must fire faster than DNSSEC negative caching TTL (typically 300–900 seconds) to prevent resolvers from caching SERVFAIL responses. Once SERVFAIL is cached, even a correct rollback does not restore service until the negative cache TTL expires.

#!/usr/bin/env bash
# rollover-with-watchdog.sh

ZONE="${1:?}"
OLD_ZONE_SIGNED="${2:?}"  # path to pre-rollover signed zone backup
NEW_ZONE_SIGNED="${3:?}"  # path to new signed zone file

VALIDATION_TIMEOUT=120    # seconds to wait before checking
CHECK_RESOLVER="1.1.1.1"
PUBLISH_CMD="rndc reload ${ZONE}"   # or equivalent for your stack

validate_zone() {
    local zone="${1}"
    local resolver="${2}"
    local result

    result=$(dig +dnssec +time=5 +tries=1 \
        @"${resolver}" "${zone}" SOA 2>&1)
    local status
    status=$(echo "${result}" | awk '/^;; ->>HEADER<<-/ {print $6}' | tr -d ',')
    local ad
    ad=$(echo "${result}" | grep -c 'flags:.*\bad\b' || true)

    [[ "${status}" == "NOERROR" && "${ad}" -gt 0 ]]
}

# Back up current signed zone
cp /etc/bind/zones/db."${ZONE}".signed "${OLD_ZONE_SIGNED}"

# Publish new signed zone
cp "${NEW_ZONE_SIGNED}" /etc/bind/zones/db."${ZONE}".signed
${PUBLISH_CMD}

echo "New zone published at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "Waiting ${VALIDATION_TIMEOUT}s before canary check..."
sleep "${VALIDATION_TIMEOUT}"

if validate_zone "${ZONE}" "${CHECK_RESOLVER}"; then
    echo "Validation OK — rollover successful"
else
    echo "VALIDATION FAILED — initiating automatic rollback at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
    cp "${OLD_ZONE_SIGNED}" /etc/bind/zones/db."${ZONE}".signed
    ${PUBLISH_CMD}

    sleep 30
    if validate_zone "${ZONE}" "${CHECK_RESOLVER}"; then
        echo "Rollback successful — zone validating again"
        exit 1   # rollover failed, rollback succeeded
    else
        echo "CRITICAL: Rollback also failed — manual intervention required"
        # Page on-call immediately
        curl -s -X POST "${PAGERDUTY_WEBHOOK}" \
            -H 'Content-Type: application/json' \
            -d "{\"event_action\":\"trigger\",\"payload\":{\"summary\":\"DNSSEC rollback failed for ${ZONE} — manual action required\",\"severity\":\"critical\"}}"
        exit 2
    fi
fi

Key TTL Strategy

TTL Parameter Recommended Value Rationale
DNSKEY TTL 300–900 seconds Determines how long resolvers cache key material; shorter = faster rollback recovery. BIND9 dnssec-policy default is 1 hour — reduce this for zones where rollover risk is high.
RRSIG signature validity 14 days Provides a 14-day window to catch expiry before it becomes a live issue.
RRSIG re-sign threshold 5 days before expiry Re-signing with 5 days remaining gives time for monitoring to catch failures before expiry.
DS TTL at parent Registry-controlled; typically 86400s You cannot control this. Budget 24–48 hours for DS propagation during KSK rollover.
Negative cache TTL (SOA MINIMUM) 300 seconds Limits how long SERVFAIL is cached by resolvers. Set this in the unsigned zone before signing.
; /etc/bind/zones/db.example.com
$TTL 3600
@ IN SOA ns1.example.com. admin.example.com. (
    2026050901  ; serial
    3600        ; refresh
    900         ; retry
    604800      ; expire
    300         ; minimum (negative cache TTL)
)

Expected Behaviour

The following table describes what resolvers should see at each step of a correctly executed ZSK pre-publish rollover, and what monitoring should observe:

Rollover Step DNSKEY RRset Contents RRSIG Signed With Resolver Validation Status Correct Monitoring State Alert if Wrong
Pre-rollover (baseline) Old ZSK only Old ZSK NOERROR + AD bit RRSIG expiry: N days, validation OK Alert if AD bit absent or RRSIG < 7 days
New ZSK published (inactive) Old ZSK + New ZSK Old ZSK (signing both keys) NOERROR + AD bit RRSIG expiry: N days, 2 DNSKEY RRs Alert if any resolver returns SERVFAIL
DNSKEY TTL wait period Old ZSK + New ZSK Old ZSK NOERROR + AD bit Same as above Alert if canary resolvers drop AD bit
Switch to new ZSK for signing Old ZSK + New ZSK New ZSK (double-sig method: both) NOERROR + AD bit Both keys active, new key now signing Alert if SERVFAIL from any canary resolver
Old ZSK removed New ZSK only New ZSK NOERROR + AD bit Single DNSKEY RR, new key Alert on SERVFAIL — rollback required immediately
DENIC failure mode New ZSK only (old removed too early) Old ZSK private key SERVFAIL (bogus) Key Tag mismatch — BOGUS Disaster alert: immediate rollback
Successful rollback Old ZSK only Old ZSK NOERROR + AD bit Baseline restored Verify AD bit restored within DNSKEY TTL + propagation delay

Trade-offs

Approach Advantage Disadvantage When to Choose
Fully automated rollover (dnssec-policy / AutoDNSSEC) Consistent execution; no human error on individual steps; runs during off-hours Automation bug affects all zones simultaneously; harder to add out-of-band validation gates Zones with well-tested tooling, staging env identical to prod, automated canary checks
Manual rollover with checklist Full human review of each step; explicit sign-off before proceeding Procedural error likely under time pressure; slower; night/weekend oncall required High-value zones (KSK rollover), first rollover of a new zone, after any tooling change
Short DNSKEY TTL (300s) Fast recovery: resolvers fetch new keys within 5 minutes of rollback Higher query volume to authoritative servers; amplification risk if zone is large Production zones where availability SLA is strict and rollback speed matters
Long DNSKEY TTL (3600s+) Fewer queries; lower authoritative load Rollback takes up to TTL + propagation to take effect everywhere Stable zones with infrequent rollovers and relaxed recovery time objectives
Canary resolver gate before old key removal Catches validation failures before they become unrecoverable Adds delay to rollover procedure; canary resolvers may not represent all resolver behaviour All production rollovers — the delay cost is always worth the safety
Skip canary gate for speed Rollover completes faster Validation failure from all resolvers before detection; recovery requires waiting for negative TTL to expire Never — there is no legitimate speed requirement that outweighs this risk
Separate staging environment Catches tooling bugs before production Requires maintaining a parallel DNS environment; staging drift can give false confidence Any operator running DNSSEC on more than one zone — the overhead is low relative to the risk

Failure Modes

Failure Mode Trigger Resolver Behaviour Detection Time Recovery Action Recovery Time
RRSIG signed with wrong key (DENIC scenario) Rollover script uses stale key reference; inline signing misconfiguration Immediate SERVFAIL for all validating resolvers Seconds after publication if monitoring is polling every 2–5 minutes Revert to previous signed zone; reload authoritative DNSKEY TTL + propagation (300–900s if TTL is short)
New DS not propagated to parent before KSK removal KSK rollover proceeds without confirming parent DS update Resolvers that have cached old DS record return SERVFAIL; non-validating resolvers unaffected Within parent DS TTL (typically 24 hours) — depends on resolver cache state Re-add old KSK; wait for parent to publish new DS before retrying 24–48 hours for DS propagation
Rollback slow due to long DNSKEY TTL DNSKEY TTL set to 3600s or higher Even after correct zone is published, resolvers serve stale BOGUS validation for up to TTL Immediate — monitoring sees rollback published but validation still failing No short-term fix; negative TTL also applies; wait for TTL expiry Up to DNSKEY TTL (1–24 hours depending on configuration)
Monitoring alert fires after resolvers already caching negative responses Polling interval too long (15+ minutes); alert threshold too conservative SERVFAIL responses already cached for negative TTL; alert arrives too late to prevent propagation Alert fires 15+ minutes after outage starts; caching has occurred Roll back immediately; wait for negative cache TTL; consider contacting major resolver operators SERVFAIL TTL (typically 300s) per resolver — uncontrollable once cached
Algorithm rollover misconfiguration Zone signed with new algorithm but old algorithm DS still at parent Validation fails for resolvers that implement strict algorithm policy Seconds after publication Ensure zone is signed with old algorithm during transition; do not remove old algorithm DNSKEY until new DS propagates DS propagation time (24–48 hours)
Zone file rollback overwrites good zone with bad backup Rollback script uses wrong backup path or backup itself was corrupted SERVFAIL continues or worsens after rollback attempt Seconds after rollback attempt (monitoring still shows failure) Reconstruct zone from key backups; re-sign with ldns or dnssec-signzone from unsigned zone Manual reconstruction: 30–120 minutes
Inline signing daemon crash during rollover named or pdns crashes mid-rollover leaving partially signed zone May serve mixed valid/invalid signatures; resolver behaviour unpredictable Seconds to minutes Restart signing daemon; force zone re-sign; validate with ldns-verify-zone before reload Service restart + re-sign: 5–15 minutes