Securing Reasoning Model Scratchpad Output in Production AI Applications

Securing Reasoning Model Scratchpad Output in Production AI Applications

Problem

Reasoning models — Claude’s extended thinking, OpenAI o1/o3, DeepSeek R1, and Google Gemini Flash Thinking — generate an intermediate chain-of-thought (CoT) or scratchpad before producing their final response. This scratchpad is where the model works through the problem: it reasons about the user’s request, considers the contents of its context window, examines retrieved documents, and evaluates possible responses.

The security issue is that this scratchpad is often returned to the caller alongside the final answer. Unlike the final response, which the model curates for the user, the scratchpad contains raw intermediate reasoning that was never intended to be customer-facing. In many applications, the scratchpad is being returned verbatim to browser clients, mobile apps, or downstream services — exposing content that the system was never designed to display.

What the scratchpad may contain that should not reach end users:

System prompt content. Reasoning models frequently reference or paraphrase their system prompt while working through how to respond. A scratchpad entry like “The system prompt instructs me not to discuss competitor pricing, but the user is asking about X…” directly discloses system prompt content, including instructions that developers intended to keep confidential.

Retrieved context and internal data. In RAG applications, the scratchpad will reason about retrieved documents by content — “Document 3 mentions the internal project codename Bluebird and a budget of $2.3M…” — disclosing information from internal knowledge bases that was only intended to inform the final answer.

Internal API responses and tool output. When reasoning models are used as agents with tool access, the scratchpad reasons through tool call results: “The user database query returned 847 records including several with PII flags. I need to summarize without including the raw data…” The raw content of the tool output may be present in the scratchpad even when the final response appropriately omits it.

Credential and secret reconstruction. If a secret or credential exists anywhere in the model’s context — environment variables mentioned in a document, API keys logged in retrieved data, passwords visible in a schema — the model may reference it during reasoning. “The connection string in the retrieved config file contains what appears to be a database password: postgres://user:XXXX@host/db. I should not include this in my response but can use the host to…” The password appears in the scratchpad.

Security decision rationale. The scratchpad reveals the model’s reasoning about content policy, access control, and security decisions. “The user seems to be asking about X for legitimate purposes even though the request pattern matches a jailbreak attempt. I’ll respond because…” Disclosing this reasoning enables adversaries to craft inputs that are more likely to pass the model’s reasoning filters.

The deployment patterns that expose this content are widespread. Many teams building with reasoning models:

  1. Stream the full API response including thinking blocks to the frontend without filtering.
  2. Log raw API responses (including thinking content) to application log infrastructure.
  3. Forward full model output through a chain of services where thinking content accumulates.
  4. Use reasoning model output as input to a second model call without stripping the thinking block.

Claude’s API returns extended thinking in a distinct thinking content block, making it possible to filter programmatically. OpenAI’s o1/o3 does not return CoT content to the caller (it is processed server-side only). DeepSeek R1 returns <think>...</think> tags inline in the completion text. The filtering approach differs by model, but the principle is consistent: thinking content must never reach end users or be included in application logs without explicit review.

Target systems: any production application using Claude with extended_thinking: true, DeepSeek R1 via API, Gemini Flash Thinking, or any reasoning model that returns intermediate thinking to the caller; RAG pipelines using reasoning models for synthesis; agentic systems using reasoning models for planning.


Threat Model

Adversary 1 — End user exploiting scratchpad disclosure. Access level: standard user of a customer-facing AI application. Objective: inspect the API response or streaming chunks to read the full scratchpad, extract the system prompt, and use that information to craft inputs that bypass the application’s intended guardrails.

Adversary 2 — Log-scraping attacker. Access level: read access to application logs (common in shared logging infrastructure, misconfigured SIEM, or compromised log aggregation). Objective: extract API keys, database credentials, or internal data that appear in logged scratchpad content.

Adversary 3 — Prompt injection amplified by scratchpad. Access level: ability to influence content that enters the model’s context (via retrieved documents, tool output, or user input). Objective: inject a prompt that causes the model to reason about and reference sensitive context in its scratchpad, making that data appear in the response stream even if the final answer omits it.

Adversary 4 — Developer accidentally exposing thinking in prototype. Access level: developer building a feature using a reasoning model. Objective: not malicious — but the scratchpad disclosure happens unintentionally when developer code is promoted to production without stripping thinking blocks from the response.

Without controls: scratchpad content reaches users, logs, and downstream services. With controls: thinking blocks are stripped before delivery; logs contain only sanitized output; streaming is gated to final content only.


Configuration / Implementation

Step 1 — Identify which model responses include thinking content

import anthropic

client = anthropic.Anthropic()

# Detect thinking blocks in a response
def has_thinking_content(response) -> bool:
    """Check if a Claude response contains extended thinking blocks."""
    return any(block.type == "thinking" for block in response.content)

def extract_thinking_and_response(response) -> tuple[str, str]:
    """Separate thinking content from visible response content."""
    thinking_parts = []
    response_parts = []
    
    for block in response.content:
        if block.type == "thinking":
            thinking_parts.append(block.thinking)
        elif block.type == "text":
            response_parts.append(block.text)
    
    return "\n".join(thinking_parts), "\n".join(response_parts)

# Example: make a call with extended thinking
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[{"role": "user", "content": "Analyze this security report..."}]
)

thinking_content, visible_content = extract_thinking_and_response(response)

# NEVER send thinking_content to the user
# Log thinking_content only if required for debugging, in a restricted log sink
user_response = visible_content

For DeepSeek R1, strip <think> tags:

import re

def strip_deepseek_thinking(completion_text: str) -> tuple[str, str]:
    """Extract and remove <think>...</think> blocks from DeepSeek R1 output."""
    thinking_pattern = re.compile(r'<think>(.*?)</think>', re.DOTALL)
    
    thinking_content = '\n'.join(thinking_pattern.findall(completion_text))
    visible_content = thinking_pattern.sub('', completion_text).strip()
    
    return thinking_content, visible_content

# Usage
raw_output = deepseek_client.chat.completions.create(...)
thinking, response = strip_deepseek_thinking(raw_output.choices[0].message.content)
# Send only 'response' to the user

Step 2 — Implement a thinking-aware response wrapper

Create a consistent wrapper that enforces thinking-block stripping across your application:

from dataclasses import dataclass
from typing import Optional
import anthropic

@dataclass
class SafeModelResponse:
    """A model response with thinking content safely separated."""
    visible_content: str
    thinking_content: Optional[str]  # None if thinking was disabled
    model: str
    usage: dict
    
    def to_user_dict(self) -> dict:
        """Serialization safe for user-facing APIs — never includes thinking."""
        return {
            "content": self.visible_content,
            "model": self.model,
        }
    
    def to_internal_dict(self) -> dict:
        """Serialization for internal logging — includes thinking if present."""
        return {
            "content": self.visible_content,
            "thinking": self.thinking_content,  # Log to restricted sink only
            "model": self.model,
            "usage": self.usage,
        }


class SafeReasoningClient:
    """Wrapper around Claude client that enforces thinking content separation."""
    
    def __init__(self, client: anthropic.Anthropic, enable_thinking: bool = False):
        self._client = client
        self._enable_thinking = enable_thinking
    
    def create(self, messages: list, system: str = "", **kwargs) -> SafeModelResponse:
        """Create a message, always returning a SafeModelResponse."""
        params = {
            "model": kwargs.get("model", "claude-sonnet-4-6"),
            "max_tokens": kwargs.get("max_tokens", 4096),
            "messages": messages,
        }
        
        if system:
            params["system"] = system
        
        if self._enable_thinking:
            params["thinking"] = {
                "type": "enabled",
                "budget_tokens": kwargs.get("thinking_budget", 8000)
            }
            params["max_tokens"] = max(params["max_tokens"], 16000)
        
        response = self._client.messages.create(**params)
        
        thinking_parts = []
        text_parts = []
        for block in response.content:
            if block.type == "thinking":
                thinking_parts.append(block.thinking)
            elif block.type == "text":
                text_parts.append(block.text)
        
        return SafeModelResponse(
            visible_content="\n".join(text_parts),
            thinking_content="\n".join(thinking_parts) if thinking_parts else None,
            model=response.model,
            usage={
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
            }
        )


# Usage in application code:
client = SafeReasoningClient(anthropic.Anthropic(), enable_thinking=True)
result = client.create(
    system="You are a security analyst...",
    messages=[{"role": "user", "content": user_query}]
)

# Safe to send to user:
return jsonify(result.to_user_dict())

# Log to restricted sink (not application logs):
internal_logger.info(result.to_internal_dict())

Step 3 — Secure streaming to prevent thinking leakage

When streaming responses, ensure thinking blocks are not forwarded to the client stream:

import anthropic
from flask import Response, stream_with_context

client = anthropic.Anthropic()

def stream_safe_response(messages: list, system: str) -> Response:
    """Stream only visible text content; suppress thinking blocks."""
    
    def generate():
        in_thinking_block = False
        
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=16000,
            thinking={"type": "enabled", "budget_tokens": 8000},
            system=system,
            messages=messages,
        ) as stream:
            for event in stream:
                # Only forward text delta events, not thinking deltas
                if hasattr(event, 'type'):
                    if event.type == 'content_block_start':
                        if hasattr(event, 'content_block'):
                            in_thinking_block = (event.content_block.type == 'thinking')
                    
                    elif event.type == 'content_block_stop':
                        in_thinking_block = False
                    
                    elif event.type == 'content_block_delta':
                        if not in_thinking_block and hasattr(event, 'delta'):
                            if hasattr(event.delta, 'text'):
                                # Only yield visible text deltas
                                yield f"data: {event.delta.text}\n\n"
        
        yield "data: [DONE]\n\n"
    
    return Response(
        stream_with_context(generate()),
        mimetype="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

Step 4 — Sanitize thinking content before logging

If thinking content needs to be logged for debugging, route it to a restricted log sink with access controls:

import logging
import hashlib

# Configure separate logger for thinking content with restricted access
thinking_logger = logging.getLogger('ai.thinking')
thinking_handler = logging.FileHandler('/var/log/ai/thinking-restricted.log')
thinking_handler.setLevel(logging.DEBUG)
thinking_logger.addHandler(thinking_handler)
thinking_logger.setLevel(logging.DEBUG)

# Main application logger — never receives thinking content
app_logger = logging.getLogger('ai.application')

def log_request_safely(
    request_id: str,
    thinking_content: Optional[str],
    visible_content: str,
    context_sources: list[str]
) -> None:
    """Log request with thinking content separated to restricted sink."""
    
    # Application log: no thinking content, no raw context
    app_logger.info({
        "request_id": request_id,
        "response_length": len(visible_content),
        "context_source_count": len(context_sources),
        "has_thinking": thinking_content is not None,
    })
    
    # Restricted log: thinking content with access controls on the file
    if thinking_content:
        thinking_logger.debug({
            "request_id": request_id,
            "thinking_hash": hashlib.sha256(thinking_content.encode()).hexdigest()[:16],
            # Log a truncated version — full content may contain sensitive data
            "thinking_preview": thinking_content[:500] + "..." if len(thinking_content) > 500 else thinking_content,
        })
# Restrict the thinking log file to application service account only
chmod 0600 /var/log/ai/thinking-restricted.log
chown aiservice:aiservice /var/log/ai/thinking-restricted.log

# Set up log rotation without compression (easier to audit access)
cat > /etc/logrotate.d/ai-thinking <<'EOF'
/var/log/ai/thinking-restricted.log {
    daily
    rotate 7
    create 0600 aiservice aiservice
    notifempty
    nocompress
}
EOF

Step 5 — Audit existing code for thinking content exposure

Scan your codebase for patterns that may be forwarding thinking content to users:

# Find places where Claude response content is returned directly to users
# without going through a thinking-aware wrapper

grep -rn "response\.content" --include="*.py" src/ | \
  grep -v "SafeModelResponse\|thinking_content\|extract_thinking"

# Find direct json serialization of Anthropic response objects
grep -rn "response\.model_dump\|response\.dict\(\)\|json\.dumps.*response" \
  --include="*.py" src/

# Find streaming that may forward all event types
grep -rn "stream\|EventStream" --include="*.py" src/ | \
  grep -v "thinking_block\|in_thinking"

Step 6 — Disable extended thinking when not needed

If a task does not require complex multi-step reasoning, use standard mode to eliminate the scratchpad risk entirely:

# Decision framework for enabling extended thinking:
TASKS_REQUIRING_THINKING = {
    "complex_analysis",      # Multi-step reasoning tasks
    "math_proofs",           # Formal reasoning
    "code_debugging",        # Requires internal state tracking
    "security_assessment",   # Multi-factor evaluation
}

TASKS_NOT_REQUIRING_THINKING = {
    "summarization",         # Single-pass
    "classification",        # Categorical output
    "translation",           # Direct transformation
    "simple_qa",             # Factual lookup
    "formatting",            # Structural transformation
}

def create_with_appropriate_thinking(task_type: str, **kwargs):
    enable_thinking = task_type in TASKS_REQUIRING_THINKING
    return SafeReasoningClient(
        anthropic.Anthropic(),
        enable_thinking=enable_thinking
    ).create(**kwargs)

Expected Behaviour

Signal Before hardening After hardening
API response to user includes thinking block Yes — full scratchpad in JSON response No — to_user_dict() omits thinking content
Streaming response includes thinking deltas Yes — thinking_delta events forwarded to client No — in_thinking_block gate suppresses thinking deltas
Application logs contain raw scratchpad Yes — in default response logging No — thinking content in restricted log only
System prompt content visible in scratchpad Visible if scratchpad returned to user Suppressed — thinking never reaches user
DeepSeek R1 <think> tags in user response Included in raw completion Stripped by strip_deepseek_thinking() before return

Verification:

# Test: confirm thinking is not in user-facing response
client = SafeReasoningClient(anthropic.Anthropic(), enable_thinking=True)
result = client.create(
    system="Keep this confidential: PROJECT_CODENAME=BLUEBIRD",
    messages=[{"role": "user", "content": "What can you help me with?"}]
)

user_dict = result.to_user_dict()
assert "thinking" not in user_dict, "Thinking block found in user response!"
assert "BLUEBIRD" not in user_dict["content"], "System prompt content leaked!"
print("PASS: thinking content not exposed to user")
print(f"User sees: {user_dict['content'][:100]}...")

Trade-offs

Aspect Benefit Cost Mitigation
Stripping thinking from all user responses Eliminates information disclosure Debugging becomes harder — reasoning is hidden Maintain thinking in restricted logs with request_id linkage; developers can query restricted logs during debugging
Separate restricted log sink Auditable thinking access; prevents broad log scraping Additional logging infrastructure to maintain Use structured logging with log-level separation; same infrastructure, different file path and permissions
Disabling thinking for simple tasks Eliminates scratchpad risk for most calls Slightly reduced quality on complex tasks that would benefit from thinking Benchmark quality impact; most summarization/classification tasks show no degradation without thinking
Streaming gate for thinking blocks Prevents streaming-based scratchpad extraction Small latency overhead per event to check block type Overhead is microseconds; negligible in practice

Failure Modes

Failure Symptom Detection Recovery
Framework update returns thinking in new block type New block type not recognized by filter; thinking leaks Integration test checking thinking key absence in user response Add content block type assertion to integration tests; run on every dependency update
Thinking content in multi-turn conversation history Prior turn’s thinking included in next turn’s messages Next response references previous thinking content explicitly Strip thinking blocks from message history before including in subsequent calls; Claude’s API handles this — do not pass raw response.content back as assistant messages
Restricted log file becomes world-readable after rotation Thinking content accessible to all processes File permission audit; logrotate runs as wrong user Verify logrotate create directive includes correct permissions; audit after each rotation
Developer bypasses wrapper in test code that reaches production Direct response.content serialization in response path Code review; grep scan for unguarded response serialization Add pre-commit hook scanning for direct Claude response serialization patterns outside approved wrappers