SIEM Cost Optimization: Cardinality, Retention, Sampling, and Index-Tier Strategy

Problem

SIEM bills follow a predictable trajectory: a vendor-pitched price quote at signing; a 2x increase the following year because “log volume grew”; a 4x increase the year after when retention requirements shift. By year three, the security organization is fighting for its budget against the SIEM line item.

The drivers are mostly self-inflicted:

Indiscriminate forwarding. A team adds new instrumentation; the logs flow to the SIEM at full volume even though only specific signals are needed.
High-cardinality fields. A log field like request_id or user_session creates one index entry per unique value. Index storage scales with cardinality, not volume.
Hot retention used for everything. Hot-tier (search-instantly) is typically 10-20x more expensive per GB than cold-tier (search-with-delay). Many SIEMs default everything to hot.
Detection rules over the noisiest sources. Rules that scan every event in a high-volume index are expensive even when they fire rarely.
Duplicate enrichment. The same enrichment runs at ingest, at search, and in dashboards.
Long retention by default. “Just keep everything for 365 days” multiplies cost by 5-10x compared to differentiated retention.

By 2026 every major SIEM (Splunk, Elastic, Sentinel, Chronicle, Sumo Logic, Logscale, Panther) supports cost-control primitives: tiered retention, sampling, summary indexes, ingest-time filtering, dataset routing. Few teams use them well.

This article covers cardinality reduction, retention tiering, ingest-time sampling that preserves detection, summary indexes for high-volume metrics, and the operational discipline of treating SIEM-spend as a measured engineering output.

Target systems: Splunk Enterprise / Cloud, Elastic Stack, Microsoft Sentinel, Google Chronicle, Sumo Logic, CrowdStrike Logscale, Panther; vendor-neutral patterns with vendor-specific implementations.

Threat Model

The “adversary” here is the structural failure mode of SIEM cost growth, with security implications:

Adversary 1 — Cost overrun forces dropped sources: budget pressure makes the SOC drop or sample sources, creating detection gaps.
Adversary 2 — Retention budget cliff: an incident requires 90-day forensic data; retention was 14 days because cost was too high.
Adversary 3 — Cardinality explosion: a misbehaving service emits one new index entry per request; daily cost spikes 10x; emergency response prioritizes cost over detection.
Adversary 4 — Slow searches under load: high index size makes detection-rule searches time out; rules run late or skip events.
Access level: the failure mode is internal — over-spend leads to over-correction.
Objective: an adversary observing a thrifty SIEM can move into the gaps the cost optimization created.
Blast radius: detection gaps are the same as if the rule never existed. A SIEM cost program done poorly leaves the same coverage holes a real attacker exploits.

Configuration

Step 1: Measure Per-Source Spend

You can’t optimize what you don’t measure. Compute per-source bytes, events, and cost.

For Splunk:

| metadata type=hosts index=*
| stats sum(totalCount) as events, sum(totalSizeBytes)/1024/1024/1024 as gb_total
  by index, host
| eval daily_gb = gb_total / 30
| eval daily_cost = daily_gb * 4.5    // license $/GB/day
| sort - daily_cost
| head 50

For Elastic:

# Per-index size and document count.
curl -s 'localhost:9200/_cat/indices?v&format=json' | \
  jq 'sort_by(."store.size") | reverse | .[0:30]'

Build a dashboard ranking sources by cost. The top 20 sources are typically 80% of spend. Optimize them first.

Step 2: Cardinality Audit

Identify high-cardinality fields. In Splunk:

| metasearch index=high_volume_index
| eval _time_bucket=relative_time(_time, "@h")
| stats dc(request_id) as rid_card,
        dc(user_id) as user_card,
        dc(trace_id) as trace_card,
        count
  by _time_bucket

A field with cardinality approaching the row count is a unique-per-event identifier. Indexing it expands the inverted-index dramatically. Don’t index it by default; project to a derived field if you need partial cardinality.

For Elastic:

# See field cardinality stats.
curl -s 'localhost:9200/logs-*/_field_caps?fields=*&format=json' | \
  jq '.fields | to_entries | map(select(.value.keyword)) | length'

# For specific index, check field cardinality.
curl -s 'localhost:9200/logs-app-*/_search?size=0&q=*' \
  -H 'Content-Type: application/json' \
  -d '{
    "aggs": {
      "field_card": {
        "cardinality": {"field": "request_id.keyword"}
      }
    }
  }'

For each high-cardinality field, decide:

Drop entirely if you never search by it.
Keep but don’t index — store the value in _source only; not searchable but visible on demand.
Hash to bucket — replace request_id with request_id_bucket = hash(request_id) % 1024. Loses one-to-one searchability; preserves population statistics.

Step 3: Tiered Retention

Differentiate retention by source. Detection-relevant: short hot, then warm. Compliance: long cold. Forensic-only: archive.

# Example retention tiers.
audit_logs_critical:
  hot: 30d         # search-immediate
  warm: 90d        # search with seconds-to-minutes delay
  cold: 365d       # search with minutes-to-hours delay
  archive: 7y      # restore-required

application_logs:
  hot: 7d
  warm: 30d
  cold: 90d
  archive: 1y

debug_logs:
  hot: 1d
  warm: 0d
  cold: 0d
  archive: 0d

Splunk’s index lifecycle:

# indexes.conf
[audit_logs]
homePath = $SPLUNK_DB/audit/db
coldPath = $SPLUNK_DB/audit/colddb
maxHotBuckets = 5
maxDataSize = auto_high_volume
homePath.maxDataSizeMB = 100000      # 100 GB hot
coldPath.maxDataSizeMB = 1000000     # 1 TB cold
frozenTimePeriodInSecs = 31536000    # 1 year before frozen
coldToFrozenScript = /opt/splunk/bin/move-to-archive.sh

The coldToFrozenScript moves the bucket to S3 / GCS / Azure blob storage at $0.001/GB/month — orders of magnitude cheaper than active SIEM storage.

Step 4: Ingest-Time Sampling

Some sources have signal density too low to justify full ingestion. Sample at ingest time, with structure that preserves detection.

# Splunk Edge Processor / Cribl pipeline example.
- type: filter
  description: "Drop 90% of HTTP 200 access logs; keep all errors and 5% of 200s"
  filter: |
    if event.status >= 400 || random() < 0.05 {
      keep
    } else {
      drop
    }

- type: aggregate
  description: "Roll up dropped 200s into per-minute summary metrics"
  by: [host, path]
  every: 60s
  emit: ["count", "sum(bytes)", "p99(latency_ms)"]

Detection on errors and outliers gets full fidelity. Aggregate metrics on the routine 200s replace per-event detail.

For Elastic:

# Logstash filter for sampling.
filter {
  if [status] >= 400 {
    # Keep all errors; tag for retention.
    mutate { add_field => { "retention_tier" => "hot" } }
  } else if rand() < 0.05 {
    mutate { add_field => { "retention_tier" => "warm" } }
  } else {
    drop {}
  }
}

Step 5: Summary Indexes for Detection-Already-Aggregated

Detections that scan high-volume indexes are expensive. Pre-aggregate at ingest time into a summary index that detection rules query instead.

# Daily summary of per-user activity.
index=auth_events earliest=@d-1d latest=@d
| stats count as event_count,
        dc(source_ip) as unique_ips,
        dc(user_agent) as unique_uas,
        first(event_time) as first_event,
        last(event_time) as last_event
  by user_id, hour
| collect index=auth_summary

Detection rule queries auth_summary (small index, fast searches) rather than the full auth_events. Storage cost for the summary is a fraction of the source.

The trade-off: rules can no longer correlate at sub-hour granularity from the summary alone. For most behavioural rules (“user X accessed N distinct IPs in the past day”), hour-bucketed aggregates suffice.

Step 6: Cardinality Constraints in Detection Rules

Some detection rules naturally produce high-cardinality output (one row per source IP, one row per user). When these rules run continuously, the result indexes themselves become expensive.

# Bad: one row per source IP, every minute, indefinitely.
index=auth_logs status=failure
| stats count by source_ip, _time
| outputlookup auth_failures.csv

# Better: aggregated, with cardinality control.
index=auth_logs status=failure
| stats count as failures by source_ip
| where failures > 10
| outputlookup auth_failures_high.csv

Cap output cardinality of detection-result outputs to alerts-only or top-N.

Step 7: Cost Allocation Per Team

Make spend visible. Per-source-team SIEM cost dashboard:

| metadata type=indexes
| eval team = case(
    match(index, "^app-payments"), "payments-team",
    match(index, "^app-auth"), "auth-team",
    match(index, "^infra-"), "platform-team",
    1=1, "shared")
| stats sum(totalSizeBytes)/1024/1024/1024 as gb_total
  by team
| eval daily_cost = gb_total / 30 * 4.5
| sort - daily_cost

Send the report monthly to engineering managers. Teams that own their cost figure out their own optimization.

Step 8: Quarterly Audit

Schedule quarterly:

Top sources by spend. Anyone in the top 10 needs an explanation.
Sources with no detection rules attached. Why are we paying to index data nobody searches?
Stale rules. Rules that haven’t fired in 180 days but query expensive indexes.
Cardinality drift. Fields that were OK at design and are now unique-per-event.

Each line item is an action: keep, drop, sample, route to cheaper tier, summarize.

Expected Behaviour

Signal	Unmanaged	Managed
Annual SIEM spend growth	50-150%	<30% (matches log volume growth)
Hot-tier proportion	80-100%	20-40%
Detection coverage	Same	Same (or better — fewer slow searches)
Search latency p99	Seconds-to-minutes	Sub-second on summary indexes
Cardinality of largest source	Often unbounded	Bounded by ingest-time hashing / dropping
Per-team cost visibility	None	Quarterly report

Trade-offs

Aspect	Benefit	Cost	Mitigation
Tiered retention	Massive cost reduction	Search-on-cold-data is slower	Detections run on hot tier; forensic queries on cold tier are user-facing-acceptable.
Ingest sampling	Cost reduction with low signal-loss	Some detection nuance lost on under-sampled sources	Keep all errors / outliers; sample only the routine baseline.
Summary indexes	Detection rules cheap	Detail lost at summary granularity	Keep raw data in cold tier; summary in hot.
Cardinality constraints	Predictable index cost	Some search patterns no longer work directly	Educate teams; provide derived-field libraries.
Per-team allocation	Engineering accountability	Allocation discussions can be political	Use simple, transparent rules; show per-team trend over time.
Quarterly audit	Continuous improvement	Engineering effort	Automate the reports; the audit is reading 5 dashboards.

Failure Modes

Failure	Symptom	Detection	Recovery
Sampling drops the very events that matter	Detection rule fires less often	Rule TPR drops; correlations lost	Always keep `severity:high` and outliers unsampled. Tune the sampling logic to preserve the tail.
Cold-tier search timeout	User can’t query for forensic data	Cold-tier search returns timeout	Cold tier should be slow but functional. If timing out, allocate more cold-tier capacity (still cheap) or move to a faster tier.
Cardinality explosion not caught	Daily cost spikes	Cost metric jumps; quarterly audit catches	Add cardinality alerts: per-index, per-day field-cardinality > threshold triggers Slack ping.
Summary index drift from raw	Detection on summary differs from detection on raw	Cross-check sample queries	Periodic backfill comparison: query raw and summary for same period; alert on divergence.
Retention shortened too aggressively	Compliance audit can’t find required data	Audit failure	Confirm legal/compliance requirements before reducing retention; some industries (PCI, HIPAA) mandate specific minimums.
Per-team allocation fights	Teams game the system to reduce their numbers	Cost grows in untracked sources	Lock per-source ownership; new sources require designation.