Security SLOs and Error Budgets: SRE Discipline Applied to Detection and Response
Problem
Engineering organizations adopted SRE-style SLOs (service-level objectives) and error budgets a decade ago. Reliability became measurable; teams justified investment based on burn rate; on-call rotations had clear targets.
Security has lagged. The standard “metrics” — vulnerability counts, patch SLA, audit findings — are activity, not outcomes. They tell you nothing about whether the security program is working. The questions that matter — “are we detecting attacks fast enough?”, “are we responding fast enough?”, “is the program improving?” — get qualitative, post-hoc answers.
By 2026 the practice is maturing: security teams adopt SLO discipline. The framework: pick a small set of measurable Service Level Indicators (SLIs), set Service Level Objectives (SLOs) defining acceptable thresholds, track burn rate against the SLO. When burn-rate exceeds budget, the team prioritizes fixing the underlying gap.
The specific gaps in non-SLO security programs:
- “We detect attacks” — how fast? what fraction?
- “We respond to incidents” — within what time?
- “Coverage is comprehensive” — measured how?
- “We’re getting better” — compared to what?
This article covers the SLI selection for security, defining realistic SLO targets, computing burn rate over rolling windows, the operational integration with engineering management (where SLO violations drive prioritization), and the failure modes — chasing the wrong metric, tying compensation to easily-gamed numbers.
Target systems: Prometheus / Grafana for metrics; SLI sources: SIEM, alerting platform, ticketing system, drill-results database. Alerting via Pyrra, Sloth, or hand-rolled error-budget alerting.
Threat Model
Different from typical articles — the “adversary” is the security gap that emerges when the program lacks measurable accountability:
- Adversary 1 — Steady decay: the program ships features, accumulates debt, declines in effectiveness. Without measurement, decline is invisible until breach.
- Adversary 2 — Whack-a-mole prioritization: every quarter is a new shiny project; baseline operations (rule maintenance, drill repetition) atrophy.
- Adversary 3 — Activity-vs-outcome confusion: the program produces lots of activity (rules written, dashboards built, vulnerabilities patched) without measurable outcome improvement.
- Access level: the failure mode is internal accountability.
- Objective: the bad outcome — visible only when an incident reveals what the program failed to prevent.
- Blast radius: unbounded; under-performing security programs face the same external threats with less effective defense.
Configuration
Step 1: Pick SLIs (Service Level Indicators)
Three categories produce SLIs that are both measurable and meaningful:
Detection SLIs:
- Detection coverage: fraction of MITRE ATT&CK techniques relevant to your environment that have at least one detection rule with TPR > some threshold.
- Detection latency: time from event-generation to alert-fired.
- MTTD (mean time to detect): time from compromise event to first detection — measured per incident, retrospectively.
Response SLIs:
- Time to acknowledge (TTA): alert fired → analyst acknowledges.
- Time to first action (TTFA): alert fired → first investigative action.
- Time to containment: alert fired → blast radius limited.
- Time to resolution: alert fired → closed.
Program-health SLIs:
- Detection-rule decay rate: percent of rules failing weekly fixture tests.
- Drill cadence compliance: drills run vs. scheduled.
- Threat-model freshness: percent of services with TM reviewed within 365 days.
Pick 4-7 SLIs total. More than that becomes ceremony rather than instrument.
Step 2: Define SLOs
For each SLI, pick a threshold the team commits to. Realistic, not aspirational.
# slos.yaml — checked into the security-team repo.
slos:
- name: "P1 Detection Latency"
description: "Critical detection rules fire within 60 seconds of event generation."
sli:
type: histogram_p95
query: 'histogram_quantile(0.95, sum by (le) (rate(detection_pipeline_latency_seconds_bucket{severity="critical"}[5m])))'
objective: 60 # seconds
period: 30d
error_budget_minutes: 21.6 # 0.05% of 30d, computed below
- name: "Alert Acknowledgement Time"
description: "Critical alerts acknowledged by on-call within 5 minutes p99."
sli:
type: histogram_p99
query: 'histogram_quantile(0.99, sum by (le) (rate(alert_acknowledge_seconds_bucket{severity="critical"}[5m])))'
objective: 300
period: 30d
- name: "Detection-Rule Test Coverage"
description: "All active detection rules pass weekly fixture tests."
sli:
type: ratio
good: 'detection_rule_tests_total{result="pass"}'
total: 'detection_rule_tests_total'
objective: 0.99
period: 7d
- name: "Threat-Model Freshness"
description: "Tier-1 services have threat models reviewed within 365 days."
sli:
type: ratio
good: 'service_threat_model_age_days{tier="1"} < 365'
total: 'count(service_threat_model_age_days{tier="1"})'
objective: 0.95
period: 30d
The error_budget for “P1 Detection Latency” computes as: in a 30-day window, 0.05% (the 99.95% target) of measurements may exceed 60s. With a measurement every minute (43,200 measurements/30d), error budget = 21.6 measurements.
Step 3: Burn Rate Alerting
A 30-day SLO with rapid burn means the team is on track to violate it. Burn rate = (SLO violation rate observed in window) / (acceptable steady-state rate).
# alerting-rules.yaml
groups:
- name: security-slo-burnrate
rules:
- alert: HighBurnRateP1Detection
expr: |
(
sum(rate(detection_pipeline_latency_seconds_bucket{severity="critical",le="60"}[1h]))
/
sum(rate(detection_pipeline_latency_seconds_count{severity="critical"}[1h]))
) < 0.99 # less than 99% within budget over 1h
AND
(
sum(rate(detection_pipeline_latency_seconds_bucket{severity="critical",le="60"}[5m]))
/
sum(rate(detection_pipeline_latency_seconds_count{severity="critical"}[5m]))
) < 0.99
for: 5m
labels:
severity: critical
team: security
annotations:
summary: "P1 detection latency burning budget at high rate"
description: |
In the last 1h, > 1% of P1 detections breached the 60s latency SLO.
At this rate, the 30-day budget will be exhausted in {{ ... }}.
The dual-window approach (1h + 5m) catches both sudden bursts and sustained drift. Pyrra and Sloth automate this rule generation from the SLO config.
Step 4: Incident-Driven SLI Updates
After every incident, compute its impact on SLIs:
# update-slo-from-incident.py
def update_mttd_for_incident(incident):
"""Compute MTTD from an incident's retrospective and update SLI dataset."""
compromise_time = parse_iso(incident["compromise_evidence_earliest_timestamp"])
first_alert = parse_iso(incident["first_alert_fired_at"])
mttd_seconds = (first_alert - compromise_time).total_seconds()
metrics.histogram("incident_mttd_seconds",
labels={"severity": incident["severity"]},
value=mttd_seconds)
# Also: did any pre-incident detection rules fire that, in retrospect, should have escalated?
if incident.get("detection_rules_fired_in_window"):
metrics.counter("incident_rule_fired_underranked_total",
labels={"severity": incident["severity"]}).inc()
The SLI is the histogram of incident-MTTDs over time. SLO violations (MTTD > target for too many incidents) trigger backlog work.
Step 5: Dashboards That Drive Action
Per-SLO dashboard with three things: current burn rate, time-to-budget-exhaustion projection, top contributing factors.
P1 Detection Latency SLO (objective: <60s p95)
Current 30d window: 98.7% (target 99.95%) [BURNING]
Time-to-exhaustion: 14 days
Top contributors:
- rules/aws/iam-priv-esc: p95=320s (slow CloudTrail ingest)
- rules/k8s/secret-access: p95=180s (audit-log batching)
- all other rules combined: p95=42s
Action: tickets PROD-1234 (CloudTrail ingest tuning), PROD-1235 (audit batching)
The dashboard is one surface; the tickets are the work. SLO violation = ticket priority elevation.
Step 6: SLO Review Cadence
Weekly SLO review with engineering management:
- Which SLOs are burning?
- Are the underlying tickets prioritized?
- Are objectives still right? (After 6+ months, recalibrate.)
Quarterly: re-evaluate the SLI selection. Some SLIs become uninteresting (always green, no signal); others become important (new threat surface).
Step 7: Connecting to Business
For executive reporting, translate SLOs to business outcome:
Detection Coverage: 87% (target 90%)
→ 13% of relevant attacker techniques have no reliable detection.
→ Estimated mean dwell time for those attacks: 14 days
→ Estimated incident scope multiplier vs. covered detections: 3.5x
MTTD p95 (across all incidents in 90 days): 4.3 hours (target 1 hour)
→ For the 95% of incidents detected slowly, attacker had ~4x more time to act.
→ Expected loss given an incident: 2x baseline.
The translation requires assumptions; document them. Business-leadership consumption is “X% improvement in MTTD reduces expected loss by Y%.”
Step 8: Avoid Common Anti-Patterns
- Optimize for SLI, not for outcome. If the SLI is “alerts acknowledged within 5 min” and incentives reward this, analysts will ack-and-investigate-later. Pair with quality SLIs (e.g., “incidents resolved correctly”).
- Tie SLOs to compensation. Easily gamed; people optimize for the metric, not the goal.
- Set unrealistic SLOs. A 99.99% target sounds impressive; if it’s perpetually violated, it loses meaning. Set SLOs you can hit ~95% of the time, with budget for the rest.
- Drown in SLIs. 4-7 is the sweet spot. More becomes noise.
- Ignore burning budget. Every burn is a signal; if you ignore it, the SLO program is theatre.
Expected Behaviour
| Signal | Without SLOs | With SLOs |
|---|---|---|
| Investment justification | Anecdotal | Quantified — “MTTD violated; need detection-engineering cycles” |
| Engineering-management visibility | Project status | SLO compliance trend |
| Drift detection | Discovered post-incident | Caught at burn-rate alerting |
| Cross-team coordination | Ad-hoc | Tied to specific SLO violations |
| Calibration over time | Static | Reviewed quarterly |
| Connection to business outcome | Implicit | Explicit translation |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Defined SLIs | Measurable program health | Some SLIs are noise; require maintenance | Quarterly review; prune SLIs that don’t drive action. |
| Realistic SLOs | Honest signal | Less impressive on paper | Frame for engineering audience; for executive reporting, translate. |
| Burn-rate alerting | Proactive, not reactive | Alert fatigue if too many SLOs | Limit to 4-7 SLOs total; high-volume metrics get burn-rate alerts. |
| Incident-derived SLIs | Connects metrics to actual outcomes | Each incident requires retrospective discipline | Use post-incident ticket templates; populate SLI fields automatically where possible. |
| Cross-team SLO ownership | Shared accountability | Negotiation between teams | Make ownership explicit per SLO; engineering management arbitrates. |
| Quarterly SLO review | Calibration over time | Meeting overhead | 1 hour per quarter. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| SLO drift over time | SLOs always green / always red | No useful signal | Recalibrate to a realistic threshold; avoid both extremes. |
| SLI gaming | Numbers improve, real outcomes don’t | Incident retrospectives don’t show improvement | Pair leading SLIs (latency, ack time) with lagging SLIs (incident MTTD computed from retrospectives). |
| Alert fatigue | Burn-rate alerts daily | Engineering ignores | Tune; not every SLO needs a fast-burn-rate alert. |
| Compensation tied to SLI | Behavior shifts to gaming | Hard to detect from numbers | Don’t tie security SLOs to individual compensation. Use them for prioritization, not performance review. |
| Budget exhaustion ignored | SLO violations don’t drive ticket priority | Burn alerts fire repeatedly without resolution | Engineering management mandates: SLO red = ticket priority, period. |
| Wrong objects measured | SLIs don’t correlate with security outcome | Quarterly review reveals lack of correlation | Replace; don’t keep an irrelevant metric for tradition. |
| Budget consumed by single event | One bad week wipes a quarter | Trend analysis | Two-tier alerting: page on imminent violation, daily-digest for slow burn. |