SIEM Cost Optimization: Cardinality, Retention, Sampling, and Index-Tier Strategy
Problem
SIEM bills follow a predictable trajectory: a vendor-pitched price quote at signing; a 2x increase the following year because “log volume grew”; a 4x increase the year after when retention requirements shift. By year three, the security organization is fighting for its budget against the SIEM line item.
The drivers are mostly self-inflicted:
- Indiscriminate forwarding. A team adds new instrumentation; the logs flow to the SIEM at full volume even though only specific signals are needed.
- High-cardinality fields. A log field like
request_idoruser_sessioncreates one index entry per unique value. Index storage scales with cardinality, not volume. - Hot retention used for everything. Hot-tier (search-instantly) is typically 10-20x more expensive per GB than cold-tier (search-with-delay). Many SIEMs default everything to hot.
- Detection rules over the noisiest sources. Rules that scan every event in a high-volume index are expensive even when they fire rarely.
- Duplicate enrichment. The same enrichment runs at ingest, at search, and in dashboards.
- Long retention by default. “Just keep everything for 365 days” multiplies cost by 5-10x compared to differentiated retention.
By 2026 every major SIEM (Splunk, Elastic, Sentinel, Chronicle, Sumo Logic, Logscale, Panther) supports cost-control primitives: tiered retention, sampling, summary indexes, ingest-time filtering, dataset routing. Few teams use them well.
This article covers cardinality reduction, retention tiering, ingest-time sampling that preserves detection, summary indexes for high-volume metrics, and the operational discipline of treating SIEM-spend as a measured engineering output.
Target systems: Splunk Enterprise / Cloud, Elastic Stack, Microsoft Sentinel, Google Chronicle, Sumo Logic, CrowdStrike Logscale, Panther; vendor-neutral patterns with vendor-specific implementations.
Threat Model
The “adversary” here is the structural failure mode of SIEM cost growth, with security implications:
- Adversary 1 — Cost overrun forces dropped sources: budget pressure makes the SOC drop or sample sources, creating detection gaps.
- Adversary 2 — Retention budget cliff: an incident requires 90-day forensic data; retention was 14 days because cost was too high.
- Adversary 3 — Cardinality explosion: a misbehaving service emits one new index entry per request; daily cost spikes 10x; emergency response prioritizes cost over detection.
- Adversary 4 — Slow searches under load: high index size makes detection-rule searches time out; rules run late or skip events.
- Access level: the failure mode is internal — over-spend leads to over-correction.
- Objective: an adversary observing a thrifty SIEM can move into the gaps the cost optimization created.
- Blast radius: detection gaps are the same as if the rule never existed. A SIEM cost program done poorly leaves the same coverage holes a real attacker exploits.
Configuration
Step 1: Measure Per-Source Spend
You can’t optimize what you don’t measure. Compute per-source bytes, events, and cost.
For Splunk:
| metadata type=hosts index=*
| stats sum(totalCount) as events, sum(totalSizeBytes)/1024/1024/1024 as gb_total
by index, host
| eval daily_gb = gb_total / 30
| eval daily_cost = daily_gb * 4.5 // license $/GB/day
| sort - daily_cost
| head 50
For Elastic:
# Per-index size and document count.
curl -s 'localhost:9200/_cat/indices?v&format=json' | \
jq 'sort_by(."store.size") | reverse | .[0:30]'
Build a dashboard ranking sources by cost. The top 20 sources are typically 80% of spend. Optimize them first.
Step 2: Cardinality Audit
Identify high-cardinality fields. In Splunk:
| metasearch index=high_volume_index
| eval _time_bucket=relative_time(_time, "@h")
| stats dc(request_id) as rid_card,
dc(user_id) as user_card,
dc(trace_id) as trace_card,
count
by _time_bucket
A field with cardinality approaching the row count is a unique-per-event identifier. Indexing it expands the inverted-index dramatically. Don’t index it by default; project to a derived field if you need partial cardinality.
For Elastic:
# See field cardinality stats.
curl -s 'localhost:9200/logs-*/_field_caps?fields=*&format=json' | \
jq '.fields | to_entries | map(select(.value.keyword)) | length'
# For specific index, check field cardinality.
curl -s 'localhost:9200/logs-app-*/_search?size=0&q=*' \
-H 'Content-Type: application/json' \
-d '{
"aggs": {
"field_card": {
"cardinality": {"field": "request_id.keyword"}
}
}
}'
For each high-cardinality field, decide:
- Drop entirely if you never search by it.
- Keep but don’t index — store the value in
_sourceonly; not searchable but visible on demand. - Hash to bucket — replace
request_idwithrequest_id_bucket = hash(request_id) % 1024. Loses one-to-one searchability; preserves population statistics.
Step 3: Tiered Retention
Differentiate retention by source. Detection-relevant: short hot, then warm. Compliance: long cold. Forensic-only: archive.
# Example retention tiers.
audit_logs_critical:
hot: 30d # search-immediate
warm: 90d # search with seconds-to-minutes delay
cold: 365d # search with minutes-to-hours delay
archive: 7y # restore-required
application_logs:
hot: 7d
warm: 30d
cold: 90d
archive: 1y
debug_logs:
hot: 1d
warm: 0d
cold: 0d
archive: 0d
Splunk’s index lifecycle:
# indexes.conf
[audit_logs]
homePath = $SPLUNK_DB/audit/db
coldPath = $SPLUNK_DB/audit/colddb
maxHotBuckets = 5
maxDataSize = auto_high_volume
homePath.maxDataSizeMB = 100000 # 100 GB hot
coldPath.maxDataSizeMB = 1000000 # 1 TB cold
frozenTimePeriodInSecs = 31536000 # 1 year before frozen
coldToFrozenScript = /opt/splunk/bin/move-to-archive.sh
The coldToFrozenScript moves the bucket to S3 / GCS / Azure blob storage at $0.001/GB/month — orders of magnitude cheaper than active SIEM storage.
Step 4: Ingest-Time Sampling
Some sources have signal density too low to justify full ingestion. Sample at ingest time, with structure that preserves detection.
# Splunk Edge Processor / Cribl pipeline example.
- type: filter
description: "Drop 90% of HTTP 200 access logs; keep all errors and 5% of 200s"
filter: |
if event.status >= 400 || random() < 0.05 {
keep
} else {
drop
}
- type: aggregate
description: "Roll up dropped 200s into per-minute summary metrics"
by: [host, path]
every: 60s
emit: ["count", "sum(bytes)", "p99(latency_ms)"]
Detection on errors and outliers gets full fidelity. Aggregate metrics on the routine 200s replace per-event detail.
For Elastic:
# Logstash filter for sampling.
filter {
if [status] >= 400 {
# Keep all errors; tag for retention.
mutate { add_field => { "retention_tier" => "hot" } }
} else if rand() < 0.05 {
mutate { add_field => { "retention_tier" => "warm" } }
} else {
drop {}
}
}
Step 5: Summary Indexes for Detection-Already-Aggregated
Detections that scan high-volume indexes are expensive. Pre-aggregate at ingest time into a summary index that detection rules query instead.
# Daily summary of per-user activity.
index=auth_events earliest=@d-1d latest=@d
| stats count as event_count,
dc(source_ip) as unique_ips,
dc(user_agent) as unique_uas,
first(event_time) as first_event,
last(event_time) as last_event
by user_id, hour
| collect index=auth_summary
Detection rule queries auth_summary (small index, fast searches) rather than the full auth_events. Storage cost for the summary is a fraction of the source.
The trade-off: rules can no longer correlate at sub-hour granularity from the summary alone. For most behavioural rules (“user X accessed N distinct IPs in the past day”), hour-bucketed aggregates suffice.
Step 6: Cardinality Constraints in Detection Rules
Some detection rules naturally produce high-cardinality output (one row per source IP, one row per user). When these rules run continuously, the result indexes themselves become expensive.
# Bad: one row per source IP, every minute, indefinitely.
index=auth_logs status=failure
| stats count by source_ip, _time
| outputlookup auth_failures.csv
# Better: aggregated, with cardinality control.
index=auth_logs status=failure
| stats count as failures by source_ip
| where failures > 10
| outputlookup auth_failures_high.csv
Cap output cardinality of detection-result outputs to alerts-only or top-N.
Step 7: Cost Allocation Per Team
Make spend visible. Per-source-team SIEM cost dashboard:
| metadata type=indexes
| eval team = case(
match(index, "^app-payments"), "payments-team",
match(index, "^app-auth"), "auth-team",
match(index, "^infra-"), "platform-team",
1=1, "shared")
| stats sum(totalSizeBytes)/1024/1024/1024 as gb_total
by team
| eval daily_cost = gb_total / 30 * 4.5
| sort - daily_cost
Send the report monthly to engineering managers. Teams that own their cost figure out their own optimization.
Step 8: Quarterly Audit
Schedule quarterly:
- Top sources by spend. Anyone in the top 10 needs an explanation.
- Sources with no detection rules attached. Why are we paying to index data nobody searches?
- Stale rules. Rules that haven’t fired in 180 days but query expensive indexes.
- Cardinality drift. Fields that were OK at design and are now unique-per-event.
Each line item is an action: keep, drop, sample, route to cheaper tier, summarize.
Expected Behaviour
| Signal | Unmanaged | Managed |
|---|---|---|
| Annual SIEM spend growth | 50-150% | <30% (matches log volume growth) |
| Hot-tier proportion | 80-100% | 20-40% |
| Detection coverage | Same | Same (or better — fewer slow searches) |
| Search latency p99 | Seconds-to-minutes | Sub-second on summary indexes |
| Cardinality of largest source | Often unbounded | Bounded by ingest-time hashing / dropping |
| Per-team cost visibility | None | Quarterly report |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Tiered retention | Massive cost reduction | Search-on-cold-data is slower | Detections run on hot tier; forensic queries on cold tier are user-facing-acceptable. |
| Ingest sampling | Cost reduction with low signal-loss | Some detection nuance lost on under-sampled sources | Keep all errors / outliers; sample only the routine baseline. |
| Summary indexes | Detection rules cheap | Detail lost at summary granularity | Keep raw data in cold tier; summary in hot. |
| Cardinality constraints | Predictable index cost | Some search patterns no longer work directly | Educate teams; provide derived-field libraries. |
| Per-team allocation | Engineering accountability | Allocation discussions can be political | Use simple, transparent rules; show per-team trend over time. |
| Quarterly audit | Continuous improvement | Engineering effort | Automate the reports; the audit is reading 5 dashboards. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Sampling drops the very events that matter | Detection rule fires less often | Rule TPR drops; correlations lost | Always keep severity:high and outliers unsampled. Tune the sampling logic to preserve the tail. |
| Cold-tier search timeout | User can’t query for forensic data | Cold-tier search returns timeout | Cold tier should be slow but functional. If timing out, allocate more cold-tier capacity (still cheap) or move to a faster tier. |
| Cardinality explosion not caught | Daily cost spikes | Cost metric jumps; quarterly audit catches | Add cardinality alerts: per-index, per-day field-cardinality > threshold triggers Slack ping. |
| Summary index drift from raw | Detection on summary differs from detection on raw | Cross-check sample queries | Periodic backfill comparison: query raw and summary for same period; alert on divergence. |
| Retention shortened too aggressively | Compliance audit can’t find required data | Audit failure | Confirm legal/compliance requirements before reducing retention; some industries (PCI, HIPAA) mandate specific minimums. |
| Per-team allocation fights | Teams game the system to reduce their numbers | Cost grows in untracked sources | Lock per-source ownership; new sources require designation. |