DNSSEC Key Rollover Operational Security: Lessons from the .de TLD Three-Hour Outage
Problem
On May 5, 2026, at approximately 09:14 UTC, DENIC — the registry operator for the .de country-code TLD — began an automated Zone Signing Key (ZSK) rollover for the de. zone. By 09:17 UTC, every validating resolver on the internet was returning SERVFAIL for all .de domains. Cloudflare 1.1.1.1, Google 8.8.8.8, ISP resolvers, and enterprise recursive resolvers with DNSSEC validation enabled all rejected every .de query for the next three hours. Non-validating resolvers continued to work, which made the incident particularly confusing to diagnose: roughly 30% of end users saw no problem while the rest experienced total failure.
The proximate cause was a bug in DENIC’s automated rollover script that generated RRSIGs over the DNSKEY RRset using the previous ZSK private key while simultaneously publishing the new DNSKEY RRset as the only active key. The signatures did not match the key material, so every resolver that attempted DNSSEC validation rejected the zone immediately.
This is the defining characteristic of DNSSEC operational failure: the blast radius is not limited to misconfigured operators or unlucky users. When a TLD or well-trafficked zone breaks DNSSEC validation, it is a complete and instantaneous outage for all domains beneath it, affecting all validating resolvers simultaneously, worldwide.
The .de incident is not unique. The same class of failure has occurred at other registries and at large authoritative operators. The root cause is invariably the same: DNSSEC key rollovers are complex multi-step operations with tight timing dependencies, and the tooling — automated or manual — is easy to use incorrectly. Rollover procedures that have been correct for years can fail silently when a dependency changes, when a script’s execution order assumption breaks, or when key metadata falls out of sync with the live zone.
This article covers:
- The mechanics of ZSK and KSK rollovers and where each can fail
- What “non-validatable signatures” means at the wire level
- Pre-rollover validation tools that would have caught the DENIC bug before publication
- Continuous DNSSEC health monitoring with alerting on RRSIG expiry
- Staging environment design for rollover testing
- Canary resolver checks as a deployment gate
- Automated rollback triggering
- Key TTL strategy for fast recovery
- BIND9
dnssec-policyand PowerDNS AutoDNSSEC with staged rollover
Threat Model
DNSSEC operational failures create attack surface in three distinct ways that are worth making explicit.
Outage as deliberate DoS. An adversary who can trigger a DNSSEC validation failure on a target zone causes SERVFAIL-based denial of service without touching the authoritative server infrastructure at all. Supply chain attacks against DNS management APIs, injection into automated rollover pipelines, and social engineering of registry support staff are all realistic vectors for inducing a validation failure. The attacker does not need to compromise the authoritative server — they only need to cause the zone to publish inconsistent key material.
Rollover window exploitation. During a KSK rollover, there is a trust chain gap window between when the new KSK is published in the zone and when the parent zone publishes the new DS record. If the old KSK is removed before the parent DS update propagates, resolvers that have cached the old DS record will reject the new KSK signatures. An attacker monitoring for this window can time a cache poisoning attempt against the zone, knowing that validating resolvers will fall back to insecure mode or SERVFAIL during the gap, depending on resolver configuration.
Rollback failure extends outage indefinitely. The most dangerous scenario is a rollover that fails and a rollback that also fails. If the rollback procedure shares the same bug as the rollover procedure — or if the previous signed zone has been overwritten — the zone remains broken until a human fully reconstructs it from key backups. In practice this has extended outages from minutes to hours to, in one documented case, more than a day.
Every rollover procedure must be designed with the assumption that both the rollover and the rollback can fail independently.
Configuration and Implementation
DNSSEC Rollover Mechanics
A DNSSEC-signed zone uses two types of keys. The Zone Signing Key (ZSK) signs all RRsets in the zone. The Key Signing Key (KSK) signs only the DNSKEY RRset. The KSK’s public key is registered with the parent zone as a DS (Delegation Signer) record, creating the chain of trust from the root downward.
ZSK rollover can use two methods:
Pre-publish method: Publish the new ZSK in the DNSKEY RRset (with no RRSIG from it yet) and wait for the DNSKEY TTL to expire everywhere. Then sign the zone with both the old and new ZSK. After the old signatures expire, remove the old ZSK from the DNSKEY RRset. This method is safer because resolvers never see a DNSKEY RRset that lacks a corresponding valid RRSIG.
Double-signature method: Sign the zone with both old and new ZSK simultaneously, publish both keys and both full sets of signatures, then remove the old ZSK after its TTL expires. This produces larger zone files but is the method used by BIND9’s dnssec-policy because it avoids the timing dependency on DNSKEY TTL propagation.
KSK rollover requires an additional step that ZSK rollover does not: the parent zone must be updated. The sequence is: publish new KSK in DNSKEY RRset → wait for DNSKEY TTL to expire → notify parent registry to add new DS record → wait for DS TTL to expire everywhere → switch signing to new KSK → remove old KSK → notify parent to remove old DS record.
The critical constraint: do not remove the old KSK until the new DS record has been confirmed in the parent zone and has propagated past the parent’s DS TTL. Violating this ordering breaks validation for any resolver that cached the old DS record.
What Non-Validatable Signatures Look Like at the Wire Level
The DENIC failure mode was a mismatch between the signing key and the published DNSKEY RRset. In RFC 4034 terms, RRSIG validation requires that:
- The RRSIG
Signer's Namefield matches the zone apex - The RRSIG
Key Tagfield matches the Key Tag computed from the DNSKEY RR - The cryptographic signature in the RRSIG verifies against the public key in the matching DNSKEY RR
When DENIC’s script published the new ZSK as the only key in the DNSKEY RRset but signed the DNSKEY RRset with the old ZSK private key, the RRSIG Key Tag field referenced a key that was no longer present in the DNSKEY RRset. Every resolver that attempted to validate the DNSKEY RRset found no matching DNSKEY record for the RRSIG’s Key Tag and returned SERVFAIL.
You can observe this failure mode directly with dig:
# Query for DNSKEY RRset with DNSSEC
dig +dnssec +multiline DNSKEY de. @1.1.1.1
# The RRSIG record will show the Key Tag of the signing key
# If that Key Tag does not appear in any DNSKEY record in the answer,
# the zone is broken in exactly the DENIC failure mode
# Check what Key Tag the RRSIG references
dig +dnssec DNSKEY de. | grep -E 'RRSIG|DNSKEY' | awk '{print $1, $4, $NF}'
Other non-validatable signature conditions:
- Expired RRSIG: the RRSIG
Signature Expirationfield is in the past. Resolvers reject with BOGUS/SERVFAIL. - Wrong DNSKEY flags: a ZSK published with the SEP (Secure Entry Point) flag set (flag value 257) when the zone expects it as flag 256, causing Key Tag mismatch in computations.
- Algorithm mismatch: signing with ECDSA P-256 (algorithm 13) but publishing an RSA key (algorithm 8) in the DNSKEY RRset.
Pre-Rollover Validation
The .de outage would have been prevented by a single validation step run against the new signed zone before publication. Both major toolchains provide this.
ldns-verify-zone performs offline validation of a signed zone file:
# Verify zone file signatures against the keys in the zone
ldns-verify-zone /etc/bind/signed/db.example.com.signed
# Expected output for a valid zone:
# Zone is verified and complete
# For a broken zone (DENIC scenario):
# Zone verification failed: No matching DNSKEY found for RRSIG
# Include checking DS records against parent
ldns-verify-zone -k /etc/bind/keys/ds-parent.key \
/etc/bind/signed/db.example.com.signed
dnssec-signzone -V (BIND9 tooling) verifies signatures after signing:
# Sign the zone
dnssec-signzone -A -3 $(head -c 16 /dev/urandom | xxd -p) \
-N INCREMENT \
-o example.com \
-t \
/etc/bind/zones/db.example.com
# Verify the output
named-checkzone -i full -k fail \
example.com /etc/bind/zones/db.example.com.signed
# PowerDNS equivalent
pdnsutil check-zone example.com
pdnsutil rectify-zone example.com
Automated pre-publication gate: Wrap these checks in a script that must exit 0 before any zone publication can proceed:
#!/usr/bin/env bash
# pre-publish-validate.sh — exit non-zero to abort rollover
set -euo pipefail
ZONE_FILE="${1:?Usage: $0 <signed-zone-file> <zone-name>}"
ZONE_NAME="${2:?}"
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Validating ${ZONE_FILE} for ${ZONE_NAME}"
# 1. Structural check
if ! named-checkzone "${ZONE_NAME}" "${ZONE_FILE}"; then
echo "ABORT: named-checkzone failed" >&2
exit 1
fi
# 2. DNSSEC signature validation
if ! ldns-verify-zone "${ZONE_FILE}"; then
echo "ABORT: ldns-verify-zone failed — signatures do not validate" >&2
exit 1
fi
# 3. Check RRSIG expiry — reject zones with signatures expiring within 7 days
EXPIRY=$(ldns-read-zone "${ZONE_FILE}" | awk '/RRSIG/ {print $9}' | sort -n | head -1)
NOW_EPOCH=$(date +%s)
EXPIRY_EPOCH=$(date -d "${EXPIRY:0:4}-${EXPIRY:4:2}-${EXPIRY:6:2}T${EXPIRY:8:2}:${EXPIRY:10:2}:${EXPIRY:12:2}Z" +%s 2>/dev/null || echo 0)
DAYS_REMAINING=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
if [[ ${DAYS_REMAINING} -lt 7 ]]; then
echo "ABORT: Earliest RRSIG expires in ${DAYS_REMAINING} days — too short" >&2
exit 1
fi
echo "PASS: Zone ${ZONE_NAME} validates cleanly. Earliest RRSIG expiry: ${DAYS_REMAINING} days"
exit 0
Continuous DNSSEC Health Monitoring
Static pre-publication checks are necessary but not sufficient. You need continuous monitoring that fires before human operators notice something is wrong.
Nagios/Icinga — check_dns_dnssec plugin:
# Install check_dns_dnssec (from monitoring-plugins-dnssec or similar)
# Basic check for zone validation status
/usr/lib/nagios/plugins/check_dns_dnssec \
-H example.com \
-s 8.8.8.8 \
-w 7 \ # warn if RRSIG expires within 7 days
-c 2 # critical if RRSIG expires within 2 days
# Icinga2 service object
object Service "dnssec-validity" {
host_name = "dns-monitoring"
check_command = "check_dns_dnssec"
vars.dnssec_domain = "example.com"
vars.dnssec_resolver = "8.8.8.8"
vars.dnssec_warn = 7
vars.dnssec_crit = 2
check_interval = 5m
max_check_attempts = 2
}
Zabbix DNSSEC template — key items:
# zabbix-dnssec-template.yaml (excerpt)
items:
- name: "DNSSEC validation status"
key: "external.dnssec.validate[{$ZONE},{$RESOLVER}]"
type: EXTERNAL
delay: 5m
value_type: INTEGER
# 0 = validates, 1 = BOGUS, 2 = SERVFAIL, 3 = timeout
- name: "DNSSEC RRSIG minimum expiry days"
key: "external.dnssec.rrsig_expiry_days[{$ZONE}]"
type: EXTERNAL
delay: 1h
value_type: FLOAT
triggers:
- name: "DNSSEC validation BOGUS for {$ZONE}"
expression: "last(/template/external.dnssec.validate[{$ZONE},{$RESOLVER}])=1"
severity: DISASTER
manual_close: false
- name: "DNSSEC RRSIG expiring within 3 days for {$ZONE}"
expression: "last(/template/external.dnssec.rrsig_expiry_days[{$ZONE}])<3"
severity: HIGH
Prometheus + Alertmanager:
# prometheus-dnssec-rules.yaml
groups:
- name: dnssec
interval: 2m
rules:
- alert: DNSSECValidationBogus
expr: dns_dnssec_validation_status{zone=~".+"} == 1
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "DNSSEC validation BOGUS for {{ $labels.zone }}"
description: >
Zone {{ $labels.zone }} is returning BOGUS from resolver
{{ $labels.resolver }}. Check for key rollover issues.
Rollback procedure: https://runbooks.internal/dnssec-rollback
- alert: DNSSECRRSIGExpiringSoon
expr: dns_rrsig_expiry_days{zone=~".+"} < 7
for: 10m
labels:
severity: warning
annotations:
summary: "DNSSEC RRSIG expiring in {{ $value }} days for {{ $labels.zone }}"
Staging Environment Design
Every DNSSEC rollover procedure should be tested in a staging environment that mirrors production timing and resolver behaviour before being run in production.
# Staging environment: BIND9 authoritative + Unbound validating resolver
# in separate network namespaces or containers
# 1. Start an isolated validating resolver for testing
cat > /tmp/unbound-test.conf << 'EOF'
server:
interface: 127.0.0.2
port: 5353
do-ip4: yes
do-udp: yes
verbosity: 2
val-log-level: 2
auto-trust-anchor-file: "/var/lib/unbound/root.key"
# Override .example.com to use local authoritative
stub-zone:
name: "example.com."
stub-addr: 127.0.0.1@5354
EOF
unbound -c /tmp/unbound-test.conf
# 2. Run rollover against local BIND9 authoritative
# 3. Verify from the test resolver
dig +dnssec @127.0.0.2 -p 5353 example.com DNSKEY
dig +dnssec @127.0.0.2 -p 5353 www.example.com A
# Expected: AD (Authenticated Data) bit set in all responses
# Failure: SERVFAIL — stops the staging test before production is touched
BIND9 dnssec-policy for Automated Safe Rollover
BIND9 9.16+ includes dnssec-policy, which manages key lifecycle automatically with built-in rollover safety. The double-signature method is used, which avoids timing dependencies.
# /etc/bind/named.conf.options (excerpt)
dnssec-policy "production" {
keys {
# KSK: 2048-bit RSA, rollover every 2 years
ksk key-directory lifetime P2Y algorithm rsasha256 2048;
# ZSK: 2048-bit RSA, rollover every 90 days
zsk key-directory lifetime P90D algorithm rsasha256 2048;
};
# DNSKEY RRset TTL — keep short for fast recovery
dnskey-ttl 300;
# Maximum zone TTL affects how long old signatures must be retained
max-zone-ttl 3600;
# Signature validity period and refresh threshold
signatures-validity 14d;
signatures-refresh 5d;
# Require parent DS before completing KSK rollover
# (named will wait and re-check, not proceed automatically)
parent-ds-ttl 86400;
parent-propagation-delay 1h;
zone-propagation-delay 5m;
publish-safety 1h;
retire-safety 1h;
};
zone "example.com" {
type primary;
file "/etc/bind/zones/db.example.com";
dnssec-policy "production";
key-directory "/etc/bind/keys/example.com";
inline-signing yes;
};
With inline-signing yes, BIND9 maintains an unsigned zone file and a separate signed zone in memory. The signed zone is updated automatically by named without a separate dnssec-signzone invocation, eliminating the class of bugs where the wrong tool version or wrong key path is used.
PowerDNS AutoDNSSEC with Staged Rollover
# Enable AutoDNSSEC for a zone
pdnsutil enable-dnssec example.com
# Add keys — PowerDNS creates them in the database
pdnsutil add-zone-key example.com zsk active 2048 rsasha256
pdnsutil add-zone-key example.com ksk active 2048 rsasha256
# Check current key state before any rollover
pdnsutil show-zone example.com
# Staged ZSK rollover (pre-publish method)
# Step 1: Generate and publish new ZSK (inactive — not yet signing)
pdnsutil add-zone-key example.com zsk inactive 2048 rsasha256
# Note the new key ID from output, e.g. 7
# Step 2: Wait for DNSKEY TTL (300s) to expire across resolvers
sleep 300
# Step 3: Activate new ZSK, deactivate old ZSK
pdnsutil activate-zone-key example.com 7
pdnsutil deactivate-zone-key example.com 3 # old key ID
# Step 4: Run pre-publication validation before increasing serial
pdnsutil export-zone-dnskey example.com | ldns-verify-zone /dev/stdin || {
echo "ABORT: Validation failed, rolling back"
pdnsutil activate-zone-key example.com 3
pdnsutil deactivate-zone-key example.com 7
exit 1
}
# Step 5: Increase serial to trigger zone transfer to secondaries
pdnsutil increase-serial example.com
# Step 6: After another DNSKEY TTL, remove old inactive ZSK
sleep 300
pdnsutil remove-zone-key example.com 3
Canary Resolver Checks as a Deployment Gate
After publishing new key material but before decommissioning old keys, query from multiple geographic vantage points to confirm that validation succeeds from external resolvers.
#!/usr/bin/env bash
# canary-check.sh — run after key publication, before old key removal
ZONE="${1:?provide zone name}"
RESOLVERS=(
"1.1.1.1" # Cloudflare
"8.8.8.8" # Google
"9.9.9.9" # Quad9
"208.67.222.222" # OpenDNS
)
FAIL=0
for resolver in "${RESOLVERS[@]}"; do
echo -n "Checking ${ZONE} DNSKEY from ${resolver}... "
# dig returns exit code 0 even on SERVFAIL; check the status line
result=$(dig +dnssec +time=5 +tries=2 \
@"${resolver}" "${ZONE}" DNSKEY 2>&1)
status=$(echo "${result}" | awk '/^;; ->>HEADER<<-/ {print $6}' | tr -d ',')
ad_bit=$(echo "${result}" | grep -c 'flags:.*\bad\b' || true)
if [[ "${status}" != "NOERROR" ]]; then
echo "FAIL (status=${status})"
FAIL=1
elif [[ "${ad_bit}" -eq 0 ]]; then
echo "FAIL (NOERROR but no AD bit — not validating)"
FAIL=1
else
echo "OK (NOERROR + AD bit)"
fi
done
if [[ ${FAIL} -ne 0 ]]; then
echo ""
echo "CANARY FAILED: Do not remove old keys. Initiate rollback."
exit 1
fi
echo ""
echo "All canary resolvers validate. Safe to proceed."
exit 0
Automated Rollback Triggering
The rollback trigger must fire faster than DNSSEC negative caching TTL (typically 300–900 seconds) to prevent resolvers from caching SERVFAIL responses. Once SERVFAIL is cached, even a correct rollback does not restore service until the negative cache TTL expires.
#!/usr/bin/env bash
# rollover-with-watchdog.sh
ZONE="${1:?}"
OLD_ZONE_SIGNED="${2:?}" # path to pre-rollover signed zone backup
NEW_ZONE_SIGNED="${3:?}" # path to new signed zone file
VALIDATION_TIMEOUT=120 # seconds to wait before checking
CHECK_RESOLVER="1.1.1.1"
PUBLISH_CMD="rndc reload ${ZONE}" # or equivalent for your stack
validate_zone() {
local zone="${1}"
local resolver="${2}"
local result
result=$(dig +dnssec +time=5 +tries=1 \
@"${resolver}" "${zone}" SOA 2>&1)
local status
status=$(echo "${result}" | awk '/^;; ->>HEADER<<-/ {print $6}' | tr -d ',')
local ad
ad=$(echo "${result}" | grep -c 'flags:.*\bad\b' || true)
[[ "${status}" == "NOERROR" && "${ad}" -gt 0 ]]
}
# Back up current signed zone
cp /etc/bind/zones/db."${ZONE}".signed "${OLD_ZONE_SIGNED}"
# Publish new signed zone
cp "${NEW_ZONE_SIGNED}" /etc/bind/zones/db."${ZONE}".signed
${PUBLISH_CMD}
echo "New zone published at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "Waiting ${VALIDATION_TIMEOUT}s before canary check..."
sleep "${VALIDATION_TIMEOUT}"
if validate_zone "${ZONE}" "${CHECK_RESOLVER}"; then
echo "Validation OK — rollover successful"
else
echo "VALIDATION FAILED — initiating automatic rollback at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
cp "${OLD_ZONE_SIGNED}" /etc/bind/zones/db."${ZONE}".signed
${PUBLISH_CMD}
sleep 30
if validate_zone "${ZONE}" "${CHECK_RESOLVER}"; then
echo "Rollback successful — zone validating again"
exit 1 # rollover failed, rollback succeeded
else
echo "CRITICAL: Rollback also failed — manual intervention required"
# Page on-call immediately
curl -s -X POST "${PAGERDUTY_WEBHOOK}" \
-H 'Content-Type: application/json' \
-d "{\"event_action\":\"trigger\",\"payload\":{\"summary\":\"DNSSEC rollback failed for ${ZONE} — manual action required\",\"severity\":\"critical\"}}"
exit 2
fi
fi
Key TTL Strategy
| TTL Parameter | Recommended Value | Rationale |
|---|---|---|
| DNSKEY TTL | 300–900 seconds | Determines how long resolvers cache key material; shorter = faster rollback recovery. BIND9 dnssec-policy default is 1 hour — reduce this for zones where rollover risk is high. |
| RRSIG signature validity | 14 days | Provides a 14-day window to catch expiry before it becomes a live issue. |
| RRSIG re-sign threshold | 5 days before expiry | Re-signing with 5 days remaining gives time for monitoring to catch failures before expiry. |
| DS TTL at parent | Registry-controlled; typically 86400s | You cannot control this. Budget 24–48 hours for DS propagation during KSK rollover. |
| Negative cache TTL (SOA MINIMUM) | 300 seconds | Limits how long SERVFAIL is cached by resolvers. Set this in the unsigned zone before signing. |
; /etc/bind/zones/db.example.com
$TTL 3600
@ IN SOA ns1.example.com. admin.example.com. (
2026050901 ; serial
3600 ; refresh
900 ; retry
604800 ; expire
300 ; minimum (negative cache TTL)
)
Expected Behaviour
The following table describes what resolvers should see at each step of a correctly executed ZSK pre-publish rollover, and what monitoring should observe:
| Rollover Step | DNSKEY RRset Contents | RRSIG Signed With | Resolver Validation Status | Correct Monitoring State | Alert if Wrong |
|---|---|---|---|---|---|
| Pre-rollover (baseline) | Old ZSK only | Old ZSK | NOERROR + AD bit | RRSIG expiry: N days, validation OK | Alert if AD bit absent or RRSIG < 7 days |
| New ZSK published (inactive) | Old ZSK + New ZSK | Old ZSK (signing both keys) | NOERROR + AD bit | RRSIG expiry: N days, 2 DNSKEY RRs | Alert if any resolver returns SERVFAIL |
| DNSKEY TTL wait period | Old ZSK + New ZSK | Old ZSK | NOERROR + AD bit | Same as above | Alert if canary resolvers drop AD bit |
| Switch to new ZSK for signing | Old ZSK + New ZSK | New ZSK (double-sig method: both) | NOERROR + AD bit | Both keys active, new key now signing | Alert if SERVFAIL from any canary resolver |
| Old ZSK removed | New ZSK only | New ZSK | NOERROR + AD bit | Single DNSKEY RR, new key | Alert on SERVFAIL — rollback required immediately |
| DENIC failure mode | New ZSK only (old removed too early) | Old ZSK private key | SERVFAIL (bogus) | Key Tag mismatch — BOGUS | Disaster alert: immediate rollback |
| Successful rollback | Old ZSK only | Old ZSK | NOERROR + AD bit | Baseline restored | Verify AD bit restored within DNSKEY TTL + propagation delay |
Trade-offs
| Approach | Advantage | Disadvantage | When to Choose |
|---|---|---|---|
| Fully automated rollover (dnssec-policy / AutoDNSSEC) | Consistent execution; no human error on individual steps; runs during off-hours | Automation bug affects all zones simultaneously; harder to add out-of-band validation gates | Zones with well-tested tooling, staging env identical to prod, automated canary checks |
| Manual rollover with checklist | Full human review of each step; explicit sign-off before proceeding | Procedural error likely under time pressure; slower; night/weekend oncall required | High-value zones (KSK rollover), first rollover of a new zone, after any tooling change |
| Short DNSKEY TTL (300s) | Fast recovery: resolvers fetch new keys within 5 minutes of rollback | Higher query volume to authoritative servers; amplification risk if zone is large | Production zones where availability SLA is strict and rollback speed matters |
| Long DNSKEY TTL (3600s+) | Fewer queries; lower authoritative load | Rollback takes up to TTL + propagation to take effect everywhere | Stable zones with infrequent rollovers and relaxed recovery time objectives |
| Canary resolver gate before old key removal | Catches validation failures before they become unrecoverable | Adds delay to rollover procedure; canary resolvers may not represent all resolver behaviour | All production rollovers — the delay cost is always worth the safety |
| Skip canary gate for speed | Rollover completes faster | Validation failure from all resolvers before detection; recovery requires waiting for negative TTL to expire | Never — there is no legitimate speed requirement that outweighs this risk |
| Separate staging environment | Catches tooling bugs before production | Requires maintaining a parallel DNS environment; staging drift can give false confidence | Any operator running DNSSEC on more than one zone — the overhead is low relative to the risk |
Failure Modes
| Failure Mode | Trigger | Resolver Behaviour | Detection Time | Recovery Action | Recovery Time |
|---|---|---|---|---|---|
| RRSIG signed with wrong key (DENIC scenario) | Rollover script uses stale key reference; inline signing misconfiguration | Immediate SERVFAIL for all validating resolvers | Seconds after publication if monitoring is polling every 2–5 minutes | Revert to previous signed zone; reload authoritative | DNSKEY TTL + propagation (300–900s if TTL is short) |
| New DS not propagated to parent before KSK removal | KSK rollover proceeds without confirming parent DS update | Resolvers that have cached old DS record return SERVFAIL; non-validating resolvers unaffected | Within parent DS TTL (typically 24 hours) — depends on resolver cache state | Re-add old KSK; wait for parent to publish new DS before retrying | 24–48 hours for DS propagation |
| Rollback slow due to long DNSKEY TTL | DNSKEY TTL set to 3600s or higher | Even after correct zone is published, resolvers serve stale BOGUS validation for up to TTL | Immediate — monitoring sees rollback published but validation still failing | No short-term fix; negative TTL also applies; wait for TTL expiry | Up to DNSKEY TTL (1–24 hours depending on configuration) |
| Monitoring alert fires after resolvers already caching negative responses | Polling interval too long (15+ minutes); alert threshold too conservative | SERVFAIL responses already cached for negative TTL; alert arrives too late to prevent propagation | Alert fires 15+ minutes after outage starts; caching has occurred | Roll back immediately; wait for negative cache TTL; consider contacting major resolver operators | SERVFAIL TTL (typically 300s) per resolver — uncontrollable once cached |
| Algorithm rollover misconfiguration | Zone signed with new algorithm but old algorithm DS still at parent | Validation fails for resolvers that implement strict algorithm policy | Seconds after publication | Ensure zone is signed with old algorithm during transition; do not remove old algorithm DNSKEY until new DS propagates | DS propagation time (24–48 hours) |
| Zone file rollback overwrites good zone with bad backup | Rollback script uses wrong backup path or backup itself was corrupted | SERVFAIL continues or worsens after rollback attempt | Seconds after rollback attempt (monitoring still shows failure) | Reconstruct zone from key backups; re-sign with ldns or dnssec-signzone from unsigned zone | Manual reconstruction: 30–120 minutes |
| Inline signing daemon crash during rollover | named or pdns crashes mid-rollover leaving partially signed zone | May serve mixed valid/invalid signatures; resolver behaviour unpredictable | Seconds to minutes | Restart signing daemon; force zone re-sign; validate with ldns-verify-zone before reload | Service restart + re-sign: 5–15 minutes |