Prompt Cache Security: Side-Channels, Poisoning, and Tenant Isolation in LLM Provider Caches
Problem
Major LLM providers introduced prompt caching in 2024-2025. Anthropic’s prompt caching (GA in 2024), OpenAI’s prompt caching (released 2024), Google Gemini context caching (2024) — all share the same model: long, repeated prefixes (system prompts, retrieval-augmented context, large code files) get cached server-side. Subsequent requests with the same prefix skip recomputation, paying ~10% of the original cost and returning 30-90% faster.
The security story is barely documented. The mechanism is a server-side cache keyed on a hash of the prefix tokens, scoped to the customer account. The cache is shared across all requests from that account; cache hits and misses produce measurable timing differences. This creates several attack surfaces that almost no application security review accounts for:
- Cross-request timing side-channel. An attacker who can submit prompts and observe response timing can probe the cache. If a specific text is in the cache (because another user submitted it), the attacker’s request hits and returns faster. This leaks information about other users on the same account.
- Cache-key collision poisoning (theoretical). If the cache key isn’t cryptographically strong against colliders, an attacker could craft a prompt whose hash collides with a victim prompt’s, returning attacker-controlled cached output.
- Tenant boundary unclear. When a SaaS application runs on a single API key on behalf of many end-users, the cache is shared across users. User A’s cached prefix may be exposed to User B’s queries via timing.
- System-prompt leak via cache eviction. If your system prompt is cached and an attacker can probe enough prefixes to reverse the cached content, they may reconstruct portions of confidential prompts.
- Billing-side leakage. Cache hits are often invoiced separately from cache misses. The invoice itself exposes which prompts your application caches and how often.
The provider documentation typically says the cache is “isolated per organization” and “doesn’t impact other organizations.” This is true; it does not address per-user isolation within an organization. Building a multi-user product on a single API key means accepting cross-user cache exposure unless mitigated at the application layer.
This article covers the threat model for prompt caching, application-level mitigations (cache-key salting, request shaping, deterministic latency), and the configuration choices that bound the side-channel surface.
Target systems: Anthropic API with prompt caching (cache_control parameter), OpenAI prompt caching (automatic on supported models), Google Gemini context caching (CachedContent API), Azure OpenAI prompt caching.
Threat Model
- Adversary 1 — Cross-user observer in shared application: an end-user of a multi-tenant SaaS app that uses one provider API key for all users. Wants to determine what other users have queried.
- Adversary 2 — External timing-side-channel attacker: anyone with the ability to submit prompts, observing response latency to infer cache state.
- Adversary 3 — System-prompt extractor: an attacker who wants to reconstruct the application’s confidential system prompt by observing which prefixes hit cache.
- Adversary 4 — Cache-key collider (theoretical): crafts inputs whose internal cache key matches a victim’s, hoping to influence the victim’s response.
- Access level: Adversary 1 has user-level access to the application. Adversaries 2-3 have only API submission capability. Adversary 4 has API submission + knowledge of the cache-key derivation.
- Objective: Read or infer cache contents from other users / sessions; reconstruct confidential prompts; influence other users’ outputs.
- Blast radius: With shared cache and no application-level isolation, every cached prompt’s content is observable via timing to anyone who can submit prompts on the same API key. With proper salting and per-tenant scoping, observation is bounded to the requester’s own cached content.
Configuration
Pattern 1: Per-Tenant Cache-Key Salting
The simplest mitigation: include a per-tenant secret in the cached prefix. Different tenants produce different cache keys for what would otherwise be identical content.
import hashlib
def per_tenant_system_prompt(tenant_id: str, base_prompt: str) -> str:
# Per-tenant salt — secret, not derived from tenant_id alone.
salt = TENANT_SALTS[tenant_id]
# Salt is deliberately invisible to the model (whitespace + hash).
salt_marker = f"<!--cache-salt-{salt}-->"
return f"{salt_marker}\n{base_prompt}"
# Anthropic API call.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": per_tenant_system_prompt(tenant_id, BASE_SYSTEM_PROMPT),
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
Now tenant A’s cache and tenant B’s cache are entirely separate inside the provider’s cache namespace. A timing side-channel between them is impossible.
The salt format must be invisible to the model — an HTML comment, a zero-width-space pattern, or a token sequence the model treats as no-op. Don’t put the salt in a position where it changes model behavior.
Pattern 2: Per-User Salting Within a Tenant
For multi-user applications, take the per-tenant pattern down one level:
def per_user_cache_prefix(user_id: str, system_prompt: str) -> str:
# User-specific salt; rotate periodically.
salt = derive_user_salt(user_id)
return f"<!-- u:{salt} -->\n{system_prompt}"
The trade-off: cache hit rate drops. Each user’s cache is independent; first request per user always misses. Calculate whether the privacy benefit outweighs the cost — for sensitive content, almost always yes.
Pattern 3: Constant-Time Response Normalization
For applications that genuinely need a shared cache (cost matters more than per-user privacy), eliminate the timing channel by normalizing latency.
import time
async def constant_time_response(prompt, target_latency_ms=2000):
start = time.monotonic()
result = await llm.generate(prompt)
elapsed_ms = (time.monotonic() - start) * 1000
if elapsed_ms < target_latency_ms:
await asyncio.sleep((target_latency_ms - elapsed_ms) / 1000)
return result
Pad every response to a fixed latency. The minimum should be the cache-miss latency; cache hits are artificially slowed.
The cost: cache-hit responses no longer deliver their performance benefit to end-users. The benefit is purely cost (you still pay 10% of tokens for cache hits). Use this for endpoints where the application is highly latency-tolerant — back-office, batch, async — and never for interactive chat.
Pattern 4: Cache-Key Hardening
For providers that expose cache-key configuration, configure it to require explicit invocation rather than implicit hashing of the prefix:
# Anthropic API: cache_control points are explicit. Place them only at
# stable, public-knowledge boundaries (system prompt, public document).
# Never cache user-specific or sensitive content.
system = [
{
"type": "text",
"text": PUBLIC_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # cache this
},
]
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": LARGE_PUBLIC_DOCUMENT,
"cache_control": {"type": "ephemeral"}, # cache this
},
{
"type": "text",
"text": user_specific_query, # do NOT cache this
},
],
},
]
User-specific content lands after the cache point, so it isn’t included in the cached prefix. The cache covers only the public preamble.
Pattern 5: Probing Detection
For applications with a constrained input surface, detect cache-probing attempts:
class CacheProbeDetector:
def __init__(self, window_seconds=60, max_unique_prefixes=10):
self.requests_per_user = defaultdict(list)
self.window = window_seconds
self.max_unique = max_unique_prefixes
def check(self, user_id: str, prefix_hash: bytes) -> bool:
now = time.time()
self.requests_per_user[user_id] = [
(h, t) for (h, t) in self.requests_per_user[user_id]
if now - t < self.window
]
unique = {h for (h, t) in self.requests_per_user[user_id]}
if len(unique) > self.max_unique:
return False # too many distinct prefixes; block
self.requests_per_user[user_id].append((prefix_hash, now))
return True
A legitimate user generally submits a small number of distinct prefix shapes per minute. An attacker probing the cache submits many distinct, deliberately-different prefixes. Rate-limit on prefix-cardinality, not request count.
Pattern 6: Cache-Hit Telemetry Hygiene
Cache hits are billed differently from cache misses; many providers expose cache_creation_input_tokens and cache_read_input_tokens separately in the response. Logs and metrics that include these values can leak cache state.
# Bad: per-request metric exposes cache state.
metrics.observe("llm_cache_read_tokens", response.usage.cache_read_input_tokens,
tags=["user_id:" + user_id])
# Good: aggregate across users; keep cardinality coarse.
metrics.observe("llm_cache_read_tokens_total", response.usage.cache_read_input_tokens)
Don’t expose per-user cache statistics in operational dashboards visible across teams. The aggregate is fine; per-user is a side-channel into another internal surface.
Pattern 7: Choose Cache Lifetime Carefully
Anthropic’s prompt caching defaults to a 5-minute TTL with optional 1-hour extension. OpenAI caches for 5-10 minutes typically. Longer TTLs increase hit rates and cost savings but extend the window during which side-channel observation is possible.
For sensitive content, use the shortest TTL the provider offers. A cache that lives 30 seconds bounds attack windows much tighter than one living an hour.
Expected Behaviour
| Signal | No mitigation | With per-tenant salting + telemetry hygiene |
|---|---|---|
| Cross-tenant timing observable | Yes (probing detects other tenants’ prefixes) | No (each tenant has independent cache namespace) |
| Cross-user (within tenant) timing observable | Yes | Depends on per-user salting (Pattern 2) |
| System-prompt content reconstructable | Possible via cumulative probing | Salt makes hash-collision-finding effectively impossible |
| Cache hit rate | High (no salting) | Lower per user; near-zero across tenants |
| Cost saving from caching | Maximum | Reduced; trade-off vs. privacy |
| Probe detection | None | Block on prefix-cardinality threshold |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Per-tenant salt | Hard tenant isolation in shared API key | Cache hit rate per-tenant | Tenants with high-volume queries still benefit from cache; per-user salt for multi-user-per-tenant. |
| Per-user salt within tenant | Strong cross-user isolation | First-request latency penalty per user | Acceptable for sensitive applications (healthcare, finance, legal). |
| Constant-time response padding | Eliminates timing channel completely | Loses interactive latency benefit | Use for non-interactive workloads only. |
| Selective cache_control placement | Cache only public-knowledge content | Cache hit rate lower than caching everything | The right default; treat user-specific content as never-cache. |
| Probe-rate detection | Catches active attacks | False positives on legitimate variability | Tune threshold per application; alert before block. |
| Telemetry hygiene | Eliminates internal-side-channel | Less granular observability | Aggregate is sufficient for operational purposes. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Salt accidentally visible to model | Cache-control marker influences output | Output contains salt-related artifacts | Use truly invisible markers (zero-width-space, HTML comments stripped by markdown). Test that model output is unchanged with vs. without salt. |
| Salt rotation breaks cache | Sudden cost spike on rotation event | Provider invoice shows cache-miss-only billing | Rotate salts during low-traffic windows; warm cache deliberately afterwards. |
| Provider changes cache-key derivation | Mitigation degrades silently | Cache hit rate changes shape | Subscribe to provider changelog; periodically validate isolation experimentally. |
| User submits salt as part of input | Cache key leaks via user input | Salt visible in logs | Sanitize user input — strip cache-control markers, comments, etc., before incorporation into prompts. |
| Constant-time padding too short | Some real responses exceed pad time | Padding ineffective for those requests | Set pad to slightly above worst-case observed latency; monitor and adjust. |
| Probe detector false positive | Legitimate user blocked | Support tickets | Lower threshold sensitivity; add appeal mechanism; alert before automatic block. |
| Logging accidentally captures cache-hit fields | Internal telemetry leak | Audit of dashboards reveals per-user cache stats | Strip cache_*_tokens fields at the logging layer; aggregate only. |