AI Agent Session Isolation in Multi-Tenant Platforms

AI Agent Session Isolation in Multi-Tenant Platforms

The Problem

As LLM-based agents move from single-user deployments to shared platform infrastructure — think hosted coding assistants, customer-service bots, or internal enterprise AI portals where dozens of teams share one deployment — the isolation boundary between user sessions becomes a critical security property. Unlike a stateless API where each request is independent, agent platforms maintain state: conversation history, tool call results, memory stores, retrieved document chunks, and sometimes background task queues.

The failure modes are not hypothetical. Several early-generation hosted agent platforms shipped with bugs that allowed one user’s conversation history to leak into a subsequent user’s context — typically due to connection pool reuse or shared LRU caches. The attack surface is larger than most teams expect because isolation must hold across multiple layers:

Context window isolation. The most obvious layer: one user’s prompt-response pairs must not appear in another user’s context. Failures here are usually implementation bugs in session management — shared request objects, incorrect session ID scoping in async handlers, or race conditions in multi-request pipelines.

Memory store isolation. Agent frameworks that implement persistent memory (conversation summaries, user preferences, retrieved facts) store these in vector databases, key-value stores, or relational tables. If memory namespace partitioning uses a user identifier that is client-supplied and unvalidated, an attacker can craft a memory key that reads another user’s stored memories.

Tool call result isolation. When an agent invokes a tool (code execution, web search, file read), the result is placed into the context for the response. If tool invocation infrastructure is shared — a single code-execution sandbox serves multiple users — a side-channel in the sandbox state (temporary files, shared memory, environment variables) can leak information across requests.

Prompt cache side-channels. LLM inference infrastructure caches prompt prefixes for efficiency. A user whose request causes a cache hit on a prefix injected by a previous user implicitly confirms the existence and content of that prefix. This is the same class of vulnerability as timing side-channels in traditional caches.

Background task queue bleed. Agent platforms that run background tasks (summarisation, proactive retrieval, scheduled tool calls) must ensure that a background task initiated by user A does not write into user B’s context even if user B starts a session while the background task is still running.

Target systems: Hosted LLM agent platforms (LangChain/LangGraph hosted, custom FastAPI/Django agent services); platforms using vector databases (Pinecone, Weaviate, Chroma) for agent memory; platforms with shared code execution sandboxes (Modal, E2B, custom Docker pools).

Threat Model

1. Authenticated user exploiting memory namespace bypass (regular platform user). Objective: craft a memory retrieval query or directly call the memory API with a forged namespace key to read another user’s conversation history or preferences. Impact: disclosure of other users’ private conversations; competitive intelligence in enterprise deployments; PII exposure.

2. Timing-based prompt cache inference (authenticated user with controlled timing). Objective: submit a prompt that partially overlaps with content a target user is known to have submitted; measure response latency to infer whether the partial prompt hit the cache, confirming content. Impact: confirm or deny that a specific user submitted a specific prompt; partial content reconstruction.

3. Shared sandbox side-channel (authenticated user with code execution tool access). Objective: execute code in a shared sandbox that reads /proc, /tmp, or environment variables left behind by a previous user’s execution. Impact: session tokens, API keys, or data from the previous user’s sandbox environment exposed.

4. Race condition in async session handler (automated high-frequency attacker). Objective: send concurrent requests timed to hit a race condition in session ID assignment or context buffer management; cause own context to bleed into target user’s response or vice versa. Impact: arbitrary cross-session context injection; responses containing another user’s private data.

The blast radius depends on what agents do: a customer service agent leaking conversation history is a GDPR incident; a coding assistant leaking one developer’s proprietary code to another developer is a material IP breach.

Hardening Configuration

Session ID Generation and Scoping

# Session ID must be cryptographically unpredictable and bound to authentication
import secrets
import hashlib

def create_session_id(authenticated_user_id: str) -> str:
    # Combine a server-side secret with user ID and random nonce
    # This prevents a user from guessing or iterating to another user's session ID
    nonce = secrets.token_bytes(32)
    server_secret = get_server_secret()  # from vault/KMS
    return hashlib.sha256(
        server_secret + authenticated_user_id.encode() + nonce
    ).hexdigest()

# Session storage: always key by session_id, never by user-supplied value
class SessionStore:
    def __init__(self, redis_client):
        self._redis = redis_client
        self._prefix = "session:"

    def get(self, session_id: str, authenticated_user_id: str) -> dict | None:
        # Re-verify ownership before returning any session data
        data = self._redis.hgetall(f"{self._prefix}{session_id}")
        if not data:
            return None
        if data.get("owner_id") != authenticated_user_id:
            # Log this as a potential cross-session access attempt
            log.warning(
                "cross_session_access_attempt",
                session_id=session_id,
                claimed_user=authenticated_user_id,
                actual_owner=data.get("owner_id"),
            )
            return None
        return data

Memory Store Namespace Isolation

# Vector database (example: Chroma) — enforce per-user collection namespacing

class IsolatedMemoryStore:
    def __init__(self, chroma_client):
        self._client = chroma_client

    def _collection_name(self, user_id: str) -> str:
        # Never use user_id directly as a collection name — sanitise and prefix
        safe_id = hashlib.sha256(user_id.encode()).hexdigest()[:16]
        return f"user_{safe_id}_memory"

    def store(self, user_id: str, content: str, metadata: dict) -> None:
        collection = self._client.get_or_create_collection(
            name=self._collection_name(user_id)
        )
        collection.add(
            documents=[content],
            metadatas=[{"user_id_hash": hashlib.sha256(user_id.encode()).hexdigest(),
                       **metadata}],
            ids=[secrets.token_hex(16)],
        )

    def retrieve(self, user_id: str, query: str, n: int = 5) -> list[str]:
        collection_name = self._collection_name(user_id)
        try:
            collection = self._client.get_collection(name=collection_name)
        except Exception:
            return []  # User has no stored memories
        results = collection.query(query_texts=[query], n_results=n)
        # Verify metadata matches user before returning
        return [
            doc for doc, meta in zip(
                results["documents"][0], results["metadatas"][0]
            )
            if meta.get("user_id_hash") == hashlib.sha256(user_id.encode()).hexdigest()
        ]

Code Execution Sandbox Isolation

Each user’s code execution must run in an isolated environment with no shared filesystem state:

# Using E2B (or equivalent sandbox-as-a-service) — create per-session sandboxes
from e2b_code_interpreter import Sandbox

class IsolatedCodeExecutor:
    def __init__(self):
        self._sandboxes: dict[str, Sandbox] = {}

    async def execute(self, session_id: str, code: str) -> str:
        if session_id not in self._sandboxes:
            # One sandbox per session; never reuse across sessions
            self._sandboxes[session_id] = await Sandbox.create(
                timeout=120,      # session timeout
                metadata={"session_id": session_id}
            )
        sandbox = self._sandboxes[session_id]
        result = await sandbox.run_code(code)
        return result.text

    async def close_session(self, session_id: str) -> None:
        if session_id in self._sandboxes:
            await self._sandboxes[session_id].kill()
            del self._sandboxes[session_id]

For self-hosted Docker-based sandboxes:

# Each execution runs in a fresh container with no shared volumes
docker run --rm \
  --network=none \               # No network access
  --read-only \                  # Read-only root filesystem
  --tmpfs /tmp:size=64m,noexec \ # Temp filesystem, no execute
  --memory=512m \
  --cpus=0.5 \
  --security-opt=no-new-privileges \
  --user=65534:65534 \           # nobody user
  python:3.12-slim \
  python -c "${USER_CODE}"

Preventing Prompt Cache Side-Channels

For platforms using KV-cache on inference servers (vLLM, TensorRT-LLM), ensure per-user cache partitioning:

# When submitting to vLLM: include a per-session cache key prefix
# that prevents cross-session cache hits

async def call_inference(session_id: str, messages: list[dict]) -> str:
    # Prepend a session-specific system message that breaks shared prefix caching
    # Use a deterministic but session-unique value (not random, to preserve
    # within-session caching)
    session_prefix = hashlib.sha256(
        f"session:{session_id}".encode()
    ).hexdigest()[:8]

    system_message = {
        "role": "system",
        "content": f"[Session {session_prefix}] You are a helpful assistant."
    }
    full_messages = [system_message] + messages

    response = await openai_client.chat.completions.create(
        model="hosted-model",
        messages=full_messages,
    )
    return response.choices[0].message.content

For vLLM with enable_prefix_caching, configure per-user cache buckets or disable prefix caching in multi-tenant contexts where session isolation is paramount:

# vllm serve — disable prefix caching for highest isolation guarantee
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-Instruct \
  --disable-prefix-caching \
  --max-model-len 4096

Context Buffer Isolation in Async Handlers

# FastAPI async handler — use contextvars to prevent context leakage
# across async boundaries
from contextvars import ContextVar
from fastapi import FastAPI, Depends

current_session_id: ContextVar[str] = ContextVar("current_session_id")
current_user_id: ContextVar[str] = ContextVar("current_user_id")

app = FastAPI()

async def get_session(request: Request, token: str = Depends(oauth2_scheme)):
    user = verify_token(token)
    session_id = request.headers.get("X-Session-ID")
    if not validate_session_ownership(session_id, user.id):
        raise HTTPException(status_code=403)

    # Set in contextvar, not in a shared mutable dict
    current_session_id.set(session_id)
    current_user_id.set(user.id)
    return session_id

@app.post("/chat")
async def chat(message: str, session_id: str = Depends(get_session)):
    # All downstream functions use current_session_id.get()
    # Each coroutine gets its own copy via ContextVar semantics
    context = await load_context(current_session_id.get())
    response = await run_agent(message, context, current_user_id.get())
    await save_context(current_session_id.get(), context)
    return {"response": response}

Audit Logging for Cross-Session Access Attempts

import structlog

log = structlog.get_logger()

def audit_session_access(
    requesting_user: str,
    session_id: str,
    access_granted: bool,
    reason: str
) -> None:
    log.info(
        "session_access_audit",
        requesting_user=requesting_user,
        session_id_prefix=session_id[:8],   # Don't log full session ID
        access_granted=access_granted,
        reason=reason,
        timestamp=datetime.utcnow().isoformat(),
    )
    if not access_granted:
        # Alert on repeated failed attempts
        increment_counter(f"cross_session_attempt:{requesting_user}")
        if get_counter(f"cross_session_attempt:{requesting_user}") > 5:
            alert_security_team(requesting_user, "repeated_cross_session_attempts")

Expected Behaviour After Hardening

Scenario Before Hardening After Hardening
User A queries memory store with User B’s namespace key Returns User B’s memories if namespace not validated Namespace validated against authenticated user; returns empty
Code execution in shared Docker container /tmp from previous execution visible Fresh container per session; no shared filesystem
Prompt cache hit on another user’s prefix Response latency reveals prefix overlap Per-session system prefix prevents cross-user cache hits
Race condition in async handler Session context from concurrent request leaks ContextVar isolation; each coroutine has own copy
Repeated cross-session access attempts No detection Counter alert fires after 5 attempts; security team notified

Trade-offs and Operational Considerations

Aspect Benefit Cost Mitigation
Per-session sandboxes Complete execution isolation Higher latency (sandbox creation overhead); resource cost Pre-warm sandbox pools; impose session timeout
Disabled prefix caching Eliminates cache side-channel Increased inference latency; higher compute cost per request Re-enable for non-sensitive internal applications only; benchmark impact
Per-user vector DB collections Strong memory isolation Increased vector DB collection count at scale Use Chroma/Weaviate tenant features for higher-scale multi-tenancy
ContextVar for async isolation Prevents context bleed in Python async Requires care with threading (ContextVars don’t propagate to threads) Use copy_context().run() when spawning threads from async code
Owner re-verification on every session read Prevents IDOR on session objects Additional Redis/DB read per request Cache verification result in JWT with short TTL

Failure Modes

Failure Symptom Detection Recovery
ContextVar not set in middleware Downstream code gets LookupError or uses default value Unit tests with concurrent requests; error logs Add default value to ContextVar; validate in integration tests
Sandbox pool exhaustion Code execution fails for new sessions Sandbox creation timeout in logs; user-facing “service unavailable” Increase pool size; implement session eviction for idle sessions
Memory namespace collision (hash collision) Extremely rare: two users share a namespace Metadata owner_id check catches it; log shows mismatch Increase namespace hash to 256-bit; accept collision probability is negligible
Cross-session audit alert storm Many false positives during legitimate load testing Alert volume spike Implement rate limiting on alerts; suppress during known load test windows
Session ownership validation bypass via SQL injection in session lookup Attacker accesses arbitrary sessions Anomalous data access in audit log Use parameterised queries; input validation on session_id format