AI Agent Session Isolation in Multi-Tenant Platforms

The Problem

As LLM-based agents move from single-user deployments to shared platform infrastructure — think hosted coding assistants, customer-service bots, or internal enterprise AI portals where dozens of teams share one deployment — the isolation boundary between user sessions becomes a critical security property. Unlike a stateless API where each request is independent, agent platforms maintain state: conversation history, tool call results, memory stores, retrieved document chunks, and sometimes background task queues.

The failure modes are not hypothetical. Several early-generation hosted agent platforms shipped with bugs that allowed one user’s conversation history to leak into a subsequent user’s context — typically due to connection pool reuse or shared LRU caches. The attack surface is larger than most teams expect because isolation must hold across multiple layers:

Context window isolation. The most obvious layer: one user’s prompt-response pairs must not appear in another user’s context. Failures here are usually implementation bugs in session management — shared request objects, incorrect session ID scoping in async handlers, or race conditions in multi-request pipelines.

Memory store isolation. Agent frameworks that implement persistent memory (conversation summaries, user preferences, retrieved facts) store these in vector databases, key-value stores, or relational tables. If memory namespace partitioning uses a user identifier that is client-supplied and unvalidated, an attacker can craft a memory key that reads another user’s stored memories.

Tool call result isolation. When an agent invokes a tool (code execution, web search, file read), the result is placed into the context for the response. If tool invocation infrastructure is shared — a single code-execution sandbox serves multiple users — a side-channel in the sandbox state (temporary files, shared memory, environment variables) can leak information across requests.

Prompt cache side-channels. LLM inference infrastructure caches prompt prefixes for efficiency. A user whose request causes a cache hit on a prefix injected by a previous user implicitly confirms the existence and content of that prefix. This is the same class of vulnerability as timing side-channels in traditional caches.

Background task queue bleed. Agent platforms that run background tasks (summarisation, proactive retrieval, scheduled tool calls) must ensure that a background task initiated by user A does not write into user B’s context even if user B starts a session while the background task is still running.

Target systems: Hosted LLM agent platforms (LangChain/LangGraph hosted, custom FastAPI/Django agent services); platforms using vector databases (Pinecone, Weaviate, Chroma) for agent memory; platforms with shared code execution sandboxes (Modal, E2B, custom Docker pools).

Threat Model

1. Authenticated user exploiting memory namespace bypass (regular platform user). Objective: craft a memory retrieval query or directly call the memory API with a forged namespace key to read another user’s conversation history or preferences. Impact: disclosure of other users’ private conversations; competitive intelligence in enterprise deployments; PII exposure.

2. Timing-based prompt cache inference (authenticated user with controlled timing). Objective: submit a prompt that partially overlaps with content a target user is known to have submitted; measure response latency to infer whether the partial prompt hit the cache, confirming content. Impact: confirm or deny that a specific user submitted a specific prompt; partial content reconstruction.

3. Shared sandbox side-channel (authenticated user with code execution tool access). Objective: execute code in a shared sandbox that reads /proc, /tmp, or environment variables left behind by a previous user’s execution. Impact: session tokens, API keys, or data from the previous user’s sandbox environment exposed.

4. Race condition in async session handler (automated high-frequency attacker). Objective: send concurrent requests timed to hit a race condition in session ID assignment or context buffer management; cause own context to bleed into target user’s response or vice versa. Impact: arbitrary cross-session context injection; responses containing another user’s private data.

The blast radius depends on what agents do: a customer service agent leaking conversation history is a GDPR incident; a coding assistant leaking one developer’s proprietary code to another developer is a material IP breach.

Hardening Configuration

Session ID Generation and Scoping

# Session ID must be cryptographically unpredictable and bound to authentication
import secrets
import hashlib

def create_session_id(authenticated_user_id: str) -> str:
    # Combine a server-side secret with user ID and random nonce
    # This prevents a user from guessing or iterating to another user's session ID
    nonce = secrets.token_bytes(32)
    server_secret = get_server_secret()  # from vault/KMS
    return hashlib.sha256(
        server_secret + authenticated_user_id.encode() + nonce
    ).hexdigest()

# Session storage: always key by session_id, never by user-supplied value
class SessionStore:
    def __init__(self, redis_client):
        self._redis = redis_client
        self._prefix = "session:"

    def get(self, session_id: str, authenticated_user_id: str) -> dict | None:
        # Re-verify ownership before returning any session data
        data = self._redis.hgetall(f"{self._prefix}{session_id}")
        if not data:
            return None
        if data.get("owner_id") != authenticated_user_id:
            # Log this as a potential cross-session access attempt
            log.warning(
                "cross_session_access_attempt",
                session_id=session_id,
                claimed_user=authenticated_user_id,
                actual_owner=data.get("owner_id"),
            )
            return None
        return data

Memory Store Namespace Isolation

# Vector database (example: Chroma) — enforce per-user collection namespacing

class IsolatedMemoryStore:
    def __init__(self, chroma_client):
        self._client = chroma_client

    def _collection_name(self, user_id: str) -> str:
        # Never use user_id directly as a collection name — sanitise and prefix
        safe_id = hashlib.sha256(user_id.encode()).hexdigest()[:16]
        return f"user_{safe_id}_memory"

    def store(self, user_id: str, content: str, metadata: dict) -> None:
        collection = self._client.get_or_create_collection(
            name=self._collection_name(user_id)
        )
        collection.add(
            documents=[content],
            metadatas=[{"user_id_hash": hashlib.sha256(user_id.encode()).hexdigest(),
                       **metadata}],
            ids=[secrets.token_hex(16)],
        )

    def retrieve(self, user_id: str, query: str, n: int = 5) -> list[str]:
        collection_name = self._collection_name(user_id)
        try:
            collection = self._client.get_collection(name=collection_name)
        except Exception:
            return []  # User has no stored memories
        results = collection.query(query_texts=[query], n_results=n)
        # Verify metadata matches user before returning
        return [
            doc for doc, meta in zip(
                results["documents"][0], results["metadatas"][0]
            )
            if meta.get("user_id_hash") == hashlib.sha256(user_id.encode()).hexdigest()
        ]

Code Execution Sandbox Isolation

Each user’s code execution must run in an isolated environment with no shared filesystem state:

# Using E2B (or equivalent sandbox-as-a-service) — create per-session sandboxes
from e2b_code_interpreter import Sandbox

class IsolatedCodeExecutor:
    def __init__(self):
        self._sandboxes: dict[str, Sandbox] = {}

    async def execute(self, session_id: str, code: str) -> str:
        if session_id not in self._sandboxes:
            # One sandbox per session; never reuse across sessions
            self._sandboxes[session_id] = await Sandbox.create(
                timeout=120,      # session timeout
                metadata={"session_id": session_id}
            )
        sandbox = self._sandboxes[session_id]
        result = await sandbox.run_code(code)
        return result.text

    async def close_session(self, session_id: str) -> None:
        if session_id in self._sandboxes:
            await self._sandboxes[session_id].kill()
            del self._sandboxes[session_id]

For self-hosted Docker-based sandboxes:

# Each execution runs in a fresh container with no shared volumes
docker run --rm \
  --network=none \               # No network access
  --read-only \                  # Read-only root filesystem
  --tmpfs /tmp:size=64m,noexec \ # Temp filesystem, no execute
  --memory=512m \
  --cpus=0.5 \
  --security-opt=no-new-privileges \
  --user=65534:65534 \           # nobody user
  python:3.12-slim \
  python -c "${USER_CODE}"

Preventing Prompt Cache Side-Channels

For platforms using KV-cache on inference servers (vLLM, TensorRT-LLM), ensure per-user cache partitioning:

# When submitting to vLLM: include a per-session cache key prefix
# that prevents cross-session cache hits

async def call_inference(session_id: str, messages: list[dict]) -> str:
    # Prepend a session-specific system message that breaks shared prefix caching
    # Use a deterministic but session-unique value (not random, to preserve
    # within-session caching)
    session_prefix = hashlib.sha256(
        f"session:{session_id}".encode()
    ).hexdigest()[:8]

    system_message = {
        "role": "system",
        "content": f"[Session {session_prefix}] You are a helpful assistant."
    }
    full_messages = [system_message] + messages

    response = await openai_client.chat.completions.create(
        model="hosted-model",
        messages=full_messages,
    )
    return response.choices[0].message.content

For vLLM with enable_prefix_caching, configure per-user cache buckets or disable prefix caching in multi-tenant contexts where session isolation is paramount:

# vllm serve — disable prefix caching for highest isolation guarantee
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-Instruct \
  --disable-prefix-caching \
  --max-model-len 4096

Context Buffer Isolation in Async Handlers

# FastAPI async handler — use contextvars to prevent context leakage
# across async boundaries
from contextvars import ContextVar
from fastapi import FastAPI, Depends

current_session_id: ContextVar[str] = ContextVar("current_session_id")
current_user_id: ContextVar[str] = ContextVar("current_user_id")

app = FastAPI()

async def get_session(request: Request, token: str = Depends(oauth2_scheme)):
    user = verify_token(token)
    session_id = request.headers.get("X-Session-ID")
    if not validate_session_ownership(session_id, user.id):
        raise HTTPException(status_code=403)

    # Set in contextvar, not in a shared mutable dict
    current_session_id.set(session_id)
    current_user_id.set(user.id)
    return session_id

@app.post("/chat")
async def chat(message: str, session_id: str = Depends(get_session)):
    # All downstream functions use current_session_id.get()
    # Each coroutine gets its own copy via ContextVar semantics
    context = await load_context(current_session_id.get())
    response = await run_agent(message, context, current_user_id.get())
    await save_context(current_session_id.get(), context)
    return {"response": response}

Audit Logging for Cross-Session Access Attempts

import structlog

log = structlog.get_logger()

def audit_session_access(
    requesting_user: str,
    session_id: str,
    access_granted: bool,
    reason: str
) -> None:
    log.info(
        "session_access_audit",
        requesting_user=requesting_user,
        session_id_prefix=session_id[:8],   # Don't log full session ID
        access_granted=access_granted,
        reason=reason,
        timestamp=datetime.utcnow().isoformat(),
    )
    if not access_granted:
        # Alert on repeated failed attempts
        increment_counter(f"cross_session_attempt:{requesting_user}")
        if get_counter(f"cross_session_attempt:{requesting_user}") > 5:
            alert_security_team(requesting_user, "repeated_cross_session_attempts")

Expected Behaviour After Hardening

Scenario	Before Hardening	After Hardening
User A queries memory store with User B’s namespace key	Returns User B’s memories if namespace not validated	Namespace validated against authenticated user; returns empty
Code execution in shared Docker container	`/tmp` from previous execution visible	Fresh container per session; no shared filesystem
Prompt cache hit on another user’s prefix	Response latency reveals prefix overlap	Per-session system prefix prevents cross-user cache hits
Race condition in async handler	Session context from concurrent request leaks	ContextVar isolation; each coroutine has own copy
Repeated cross-session access attempts	No detection	Counter alert fires after 5 attempts; security team notified

Trade-offs and Operational Considerations

Aspect	Benefit	Cost	Mitigation
Per-session sandboxes	Complete execution isolation	Higher latency (sandbox creation overhead); resource cost	Pre-warm sandbox pools; impose session timeout
Disabled prefix caching	Eliminates cache side-channel	Increased inference latency; higher compute cost per request	Re-enable for non-sensitive internal applications only; benchmark impact
Per-user vector DB collections	Strong memory isolation	Increased vector DB collection count at scale	Use Chroma/Weaviate tenant features for higher-scale multi-tenancy
ContextVar for async isolation	Prevents context bleed in Python async	Requires care with threading (ContextVars don’t propagate to threads)	Use `copy_context().run()` when spawning threads from async code
Owner re-verification on every session read	Prevents IDOR on session objects	Additional Redis/DB read per request	Cache verification result in JWT with short TTL

Failure Modes

Failure	Symptom	Detection	Recovery
ContextVar not set in middleware	Downstream code gets `LookupError` or uses default value	Unit tests with concurrent requests; error logs	Add default value to ContextVar; validate in integration tests
Sandbox pool exhaustion	Code execution fails for new sessions	Sandbox creation timeout in logs; user-facing “service unavailable”	Increase pool size; implement session eviction for idle sessions
Memory namespace collision (hash collision)	Extremely rare: two users share a namespace	Metadata owner_id check catches it; log shows mismatch	Increase namespace hash to 256-bit; accept collision probability is negligible
Cross-session audit alert storm	Many false positives during legitimate load testing	Alert volume spike	Implement rate limiting on alerts; suppress during known load test windows
Session ownership validation bypass via SQL injection in session lookup	Attacker accesses arbitrary sessions	Anomalous data access in audit log	Use parameterised queries; input validation on session_id format