AI Agent Session Isolation in Multi-Tenant Platforms
The Problem
As LLM-based agents move from single-user deployments to shared platform infrastructure — think hosted coding assistants, customer-service bots, or internal enterprise AI portals where dozens of teams share one deployment — the isolation boundary between user sessions becomes a critical security property. Unlike a stateless API where each request is independent, agent platforms maintain state: conversation history, tool call results, memory stores, retrieved document chunks, and sometimes background task queues.
The failure modes are not hypothetical. Several early-generation hosted agent platforms shipped with bugs that allowed one user’s conversation history to leak into a subsequent user’s context — typically due to connection pool reuse or shared LRU caches. The attack surface is larger than most teams expect because isolation must hold across multiple layers:
Context window isolation. The most obvious layer: one user’s prompt-response pairs must not appear in another user’s context. Failures here are usually implementation bugs in session management — shared request objects, incorrect session ID scoping in async handlers, or race conditions in multi-request pipelines.
Memory store isolation. Agent frameworks that implement persistent memory (conversation summaries, user preferences, retrieved facts) store these in vector databases, key-value stores, or relational tables. If memory namespace partitioning uses a user identifier that is client-supplied and unvalidated, an attacker can craft a memory key that reads another user’s stored memories.
Tool call result isolation. When an agent invokes a tool (code execution, web search, file read), the result is placed into the context for the response. If tool invocation infrastructure is shared — a single code-execution sandbox serves multiple users — a side-channel in the sandbox state (temporary files, shared memory, environment variables) can leak information across requests.
Prompt cache side-channels. LLM inference infrastructure caches prompt prefixes for efficiency. A user whose request causes a cache hit on a prefix injected by a previous user implicitly confirms the existence and content of that prefix. This is the same class of vulnerability as timing side-channels in traditional caches.
Background task queue bleed. Agent platforms that run background tasks (summarisation, proactive retrieval, scheduled tool calls) must ensure that a background task initiated by user A does not write into user B’s context even if user B starts a session while the background task is still running.
Target systems: Hosted LLM agent platforms (LangChain/LangGraph hosted, custom FastAPI/Django agent services); platforms using vector databases (Pinecone, Weaviate, Chroma) for agent memory; platforms with shared code execution sandboxes (Modal, E2B, custom Docker pools).
Threat Model
1. Authenticated user exploiting memory namespace bypass (regular platform user). Objective: craft a memory retrieval query or directly call the memory API with a forged namespace key to read another user’s conversation history or preferences. Impact: disclosure of other users’ private conversations; competitive intelligence in enterprise deployments; PII exposure.
2. Timing-based prompt cache inference (authenticated user with controlled timing). Objective: submit a prompt that partially overlaps with content a target user is known to have submitted; measure response latency to infer whether the partial prompt hit the cache, confirming content. Impact: confirm or deny that a specific user submitted a specific prompt; partial content reconstruction.
3. Shared sandbox side-channel (authenticated user with code execution tool access). Objective: execute code in a shared sandbox that reads /proc, /tmp, or environment variables left behind by a previous user’s execution. Impact: session tokens, API keys, or data from the previous user’s sandbox environment exposed.
4. Race condition in async session handler (automated high-frequency attacker). Objective: send concurrent requests timed to hit a race condition in session ID assignment or context buffer management; cause own context to bleed into target user’s response or vice versa. Impact: arbitrary cross-session context injection; responses containing another user’s private data.
The blast radius depends on what agents do: a customer service agent leaking conversation history is a GDPR incident; a coding assistant leaking one developer’s proprietary code to another developer is a material IP breach.
Hardening Configuration
Session ID Generation and Scoping
# Session ID must be cryptographically unpredictable and bound to authentication
import secrets
import hashlib
def create_session_id(authenticated_user_id: str) -> str:
# Combine a server-side secret with user ID and random nonce
# This prevents a user from guessing or iterating to another user's session ID
nonce = secrets.token_bytes(32)
server_secret = get_server_secret() # from vault/KMS
return hashlib.sha256(
server_secret + authenticated_user_id.encode() + nonce
).hexdigest()
# Session storage: always key by session_id, never by user-supplied value
class SessionStore:
def __init__(self, redis_client):
self._redis = redis_client
self._prefix = "session:"
def get(self, session_id: str, authenticated_user_id: str) -> dict | None:
# Re-verify ownership before returning any session data
data = self._redis.hgetall(f"{self._prefix}{session_id}")
if not data:
return None
if data.get("owner_id") != authenticated_user_id:
# Log this as a potential cross-session access attempt
log.warning(
"cross_session_access_attempt",
session_id=session_id,
claimed_user=authenticated_user_id,
actual_owner=data.get("owner_id"),
)
return None
return data
Memory Store Namespace Isolation
# Vector database (example: Chroma) — enforce per-user collection namespacing
class IsolatedMemoryStore:
def __init__(self, chroma_client):
self._client = chroma_client
def _collection_name(self, user_id: str) -> str:
# Never use user_id directly as a collection name — sanitise and prefix
safe_id = hashlib.sha256(user_id.encode()).hexdigest()[:16]
return f"user_{safe_id}_memory"
def store(self, user_id: str, content: str, metadata: dict) -> None:
collection = self._client.get_or_create_collection(
name=self._collection_name(user_id)
)
collection.add(
documents=[content],
metadatas=[{"user_id_hash": hashlib.sha256(user_id.encode()).hexdigest(),
**metadata}],
ids=[secrets.token_hex(16)],
)
def retrieve(self, user_id: str, query: str, n: int = 5) -> list[str]:
collection_name = self._collection_name(user_id)
try:
collection = self._client.get_collection(name=collection_name)
except Exception:
return [] # User has no stored memories
results = collection.query(query_texts=[query], n_results=n)
# Verify metadata matches user before returning
return [
doc for doc, meta in zip(
results["documents"][0], results["metadatas"][0]
)
if meta.get("user_id_hash") == hashlib.sha256(user_id.encode()).hexdigest()
]
Code Execution Sandbox Isolation
Each user’s code execution must run in an isolated environment with no shared filesystem state:
# Using E2B (or equivalent sandbox-as-a-service) — create per-session sandboxes
from e2b_code_interpreter import Sandbox
class IsolatedCodeExecutor:
def __init__(self):
self._sandboxes: dict[str, Sandbox] = {}
async def execute(self, session_id: str, code: str) -> str:
if session_id not in self._sandboxes:
# One sandbox per session; never reuse across sessions
self._sandboxes[session_id] = await Sandbox.create(
timeout=120, # session timeout
metadata={"session_id": session_id}
)
sandbox = self._sandboxes[session_id]
result = await sandbox.run_code(code)
return result.text
async def close_session(self, session_id: str) -> None:
if session_id in self._sandboxes:
await self._sandboxes[session_id].kill()
del self._sandboxes[session_id]
For self-hosted Docker-based sandboxes:
# Each execution runs in a fresh container with no shared volumes
docker run --rm \
--network=none \ # No network access
--read-only \ # Read-only root filesystem
--tmpfs /tmp:size=64m,noexec \ # Temp filesystem, no execute
--memory=512m \
--cpus=0.5 \
--security-opt=no-new-privileges \
--user=65534:65534 \ # nobody user
python:3.12-slim \
python -c "${USER_CODE}"
Preventing Prompt Cache Side-Channels
For platforms using KV-cache on inference servers (vLLM, TensorRT-LLM), ensure per-user cache partitioning:
# When submitting to vLLM: include a per-session cache key prefix
# that prevents cross-session cache hits
async def call_inference(session_id: str, messages: list[dict]) -> str:
# Prepend a session-specific system message that breaks shared prefix caching
# Use a deterministic but session-unique value (not random, to preserve
# within-session caching)
session_prefix = hashlib.sha256(
f"session:{session_id}".encode()
).hexdigest()[:8]
system_message = {
"role": "system",
"content": f"[Session {session_prefix}] You are a helpful assistant."
}
full_messages = [system_message] + messages
response = await openai_client.chat.completions.create(
model="hosted-model",
messages=full_messages,
)
return response.choices[0].message.content
For vLLM with enable_prefix_caching, configure per-user cache buckets or disable prefix caching in multi-tenant contexts where session isolation is paramount:
# vllm serve — disable prefix caching for highest isolation guarantee
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8b-Instruct \
--disable-prefix-caching \
--max-model-len 4096
Context Buffer Isolation in Async Handlers
# FastAPI async handler — use contextvars to prevent context leakage
# across async boundaries
from contextvars import ContextVar
from fastapi import FastAPI, Depends
current_session_id: ContextVar[str] = ContextVar("current_session_id")
current_user_id: ContextVar[str] = ContextVar("current_user_id")
app = FastAPI()
async def get_session(request: Request, token: str = Depends(oauth2_scheme)):
user = verify_token(token)
session_id = request.headers.get("X-Session-ID")
if not validate_session_ownership(session_id, user.id):
raise HTTPException(status_code=403)
# Set in contextvar, not in a shared mutable dict
current_session_id.set(session_id)
current_user_id.set(user.id)
return session_id
@app.post("/chat")
async def chat(message: str, session_id: str = Depends(get_session)):
# All downstream functions use current_session_id.get()
# Each coroutine gets its own copy via ContextVar semantics
context = await load_context(current_session_id.get())
response = await run_agent(message, context, current_user_id.get())
await save_context(current_session_id.get(), context)
return {"response": response}
Audit Logging for Cross-Session Access Attempts
import structlog
log = structlog.get_logger()
def audit_session_access(
requesting_user: str,
session_id: str,
access_granted: bool,
reason: str
) -> None:
log.info(
"session_access_audit",
requesting_user=requesting_user,
session_id_prefix=session_id[:8], # Don't log full session ID
access_granted=access_granted,
reason=reason,
timestamp=datetime.utcnow().isoformat(),
)
if not access_granted:
# Alert on repeated failed attempts
increment_counter(f"cross_session_attempt:{requesting_user}")
if get_counter(f"cross_session_attempt:{requesting_user}") > 5:
alert_security_team(requesting_user, "repeated_cross_session_attempts")
Expected Behaviour After Hardening
| Scenario | Before Hardening | After Hardening |
|---|---|---|
| User A queries memory store with User B’s namespace key | Returns User B’s memories if namespace not validated | Namespace validated against authenticated user; returns empty |
| Code execution in shared Docker container | /tmp from previous execution visible |
Fresh container per session; no shared filesystem |
| Prompt cache hit on another user’s prefix | Response latency reveals prefix overlap | Per-session system prefix prevents cross-user cache hits |
| Race condition in async handler | Session context from concurrent request leaks | ContextVar isolation; each coroutine has own copy |
| Repeated cross-session access attempts | No detection | Counter alert fires after 5 attempts; security team notified |
Trade-offs and Operational Considerations
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Per-session sandboxes | Complete execution isolation | Higher latency (sandbox creation overhead); resource cost | Pre-warm sandbox pools; impose session timeout |
| Disabled prefix caching | Eliminates cache side-channel | Increased inference latency; higher compute cost per request | Re-enable for non-sensitive internal applications only; benchmark impact |
| Per-user vector DB collections | Strong memory isolation | Increased vector DB collection count at scale | Use Chroma/Weaviate tenant features for higher-scale multi-tenancy |
| ContextVar for async isolation | Prevents context bleed in Python async | Requires care with threading (ContextVars don’t propagate to threads) | Use copy_context().run() when spawning threads from async code |
| Owner re-verification on every session read | Prevents IDOR on session objects | Additional Redis/DB read per request | Cache verification result in JWT with short TTL |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| ContextVar not set in middleware | Downstream code gets LookupError or uses default value |
Unit tests with concurrent requests; error logs | Add default value to ContextVar; validate in integration tests |
| Sandbox pool exhaustion | Code execution fails for new sessions | Sandbox creation timeout in logs; user-facing “service unavailable” | Increase pool size; implement session eviction for idle sessions |
| Memory namespace collision (hash collision) | Extremely rare: two users share a namespace | Metadata owner_id check catches it; log shows mismatch | Increase namespace hash to 256-bit; accept collision probability is negligible |
| Cross-session audit alert storm | Many false positives during legitimate load testing | Alert volume spike | Implement rate limiting on alerts; suppress during known load test windows |
| Session ownership validation bypass via SQL injection in session lookup | Attacker accesses arbitrary sessions | Anomalous data access in audit log | Use parameterised queries; input validation on session_id format |