Securing Reasoning Model Scratchpad Output in Production AI Applications
Problem
Reasoning models — Claude’s extended thinking, OpenAI o1/o3, DeepSeek R1, and Google Gemini Flash Thinking — generate an intermediate chain-of-thought (CoT) or scratchpad before producing their final response. This scratchpad is where the model works through the problem: it reasons about the user’s request, considers the contents of its context window, examines retrieved documents, and evaluates possible responses.
The security issue is that this scratchpad is often returned to the caller alongside the final answer. Unlike the final response, which the model curates for the user, the scratchpad contains raw intermediate reasoning that was never intended to be customer-facing. In many applications, the scratchpad is being returned verbatim to browser clients, mobile apps, or downstream services — exposing content that the system was never designed to display.
What the scratchpad may contain that should not reach end users:
System prompt content. Reasoning models frequently reference or paraphrase their system prompt while working through how to respond. A scratchpad entry like “The system prompt instructs me not to discuss competitor pricing, but the user is asking about X…” directly discloses system prompt content, including instructions that developers intended to keep confidential.
Retrieved context and internal data. In RAG applications, the scratchpad will reason about retrieved documents by content — “Document 3 mentions the internal project codename Bluebird and a budget of $2.3M…” — disclosing information from internal knowledge bases that was only intended to inform the final answer.
Internal API responses and tool output. When reasoning models are used as agents with tool access, the scratchpad reasons through tool call results: “The user database query returned 847 records including several with PII flags. I need to summarize without including the raw data…” The raw content of the tool output may be present in the scratchpad even when the final response appropriately omits it.
Credential and secret reconstruction. If a secret or credential exists anywhere in the model’s context — environment variables mentioned in a document, API keys logged in retrieved data, passwords visible in a schema — the model may reference it during reasoning. “The connection string in the retrieved config file contains what appears to be a database password: postgres://user:XXXX@host/db. I should not include this in my response but can use the host to…” The password appears in the scratchpad.
Security decision rationale. The scratchpad reveals the model’s reasoning about content policy, access control, and security decisions. “The user seems to be asking about X for legitimate purposes even though the request pattern matches a jailbreak attempt. I’ll respond because…” Disclosing this reasoning enables adversaries to craft inputs that are more likely to pass the model’s reasoning filters.
The deployment patterns that expose this content are widespread. Many teams building with reasoning models:
- Stream the full API response including thinking blocks to the frontend without filtering.
- Log raw API responses (including thinking content) to application log infrastructure.
- Forward full model output through a chain of services where thinking content accumulates.
- Use reasoning model output as input to a second model call without stripping the thinking block.
Claude’s API returns extended thinking in a distinct thinking content block, making it possible to filter programmatically. OpenAI’s o1/o3 does not return CoT content to the caller (it is processed server-side only). DeepSeek R1 returns <think>...</think> tags inline in the completion text. The filtering approach differs by model, but the principle is consistent: thinking content must never reach end users or be included in application logs without explicit review.
Target systems: any production application using Claude with extended_thinking: true, DeepSeek R1 via API, Gemini Flash Thinking, or any reasoning model that returns intermediate thinking to the caller; RAG pipelines using reasoning models for synthesis; agentic systems using reasoning models for planning.
Threat Model
Adversary 1 — End user exploiting scratchpad disclosure. Access level: standard user of a customer-facing AI application. Objective: inspect the API response or streaming chunks to read the full scratchpad, extract the system prompt, and use that information to craft inputs that bypass the application’s intended guardrails.
Adversary 2 — Log-scraping attacker. Access level: read access to application logs (common in shared logging infrastructure, misconfigured SIEM, or compromised log aggregation). Objective: extract API keys, database credentials, or internal data that appear in logged scratchpad content.
Adversary 3 — Prompt injection amplified by scratchpad. Access level: ability to influence content that enters the model’s context (via retrieved documents, tool output, or user input). Objective: inject a prompt that causes the model to reason about and reference sensitive context in its scratchpad, making that data appear in the response stream even if the final answer omits it.
Adversary 4 — Developer accidentally exposing thinking in prototype. Access level: developer building a feature using a reasoning model. Objective: not malicious — but the scratchpad disclosure happens unintentionally when developer code is promoted to production without stripping thinking blocks from the response.
Without controls: scratchpad content reaches users, logs, and downstream services. With controls: thinking blocks are stripped before delivery; logs contain only sanitized output; streaming is gated to final content only.
Configuration / Implementation
Step 1 — Identify which model responses include thinking content
import anthropic
client = anthropic.Anthropic()
# Detect thinking blocks in a response
def has_thinking_content(response) -> bool:
"""Check if a Claude response contains extended thinking blocks."""
return any(block.type == "thinking" for block in response.content)
def extract_thinking_and_response(response) -> tuple[str, str]:
"""Separate thinking content from visible response content."""
thinking_parts = []
response_parts = []
for block in response.content:
if block.type == "thinking":
thinking_parts.append(block.thinking)
elif block.type == "text":
response_parts.append(block.text)
return "\n".join(thinking_parts), "\n".join(response_parts)
# Example: make a call with extended thinking
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000
},
messages=[{"role": "user", "content": "Analyze this security report..."}]
)
thinking_content, visible_content = extract_thinking_and_response(response)
# NEVER send thinking_content to the user
# Log thinking_content only if required for debugging, in a restricted log sink
user_response = visible_content
For DeepSeek R1, strip <think> tags:
import re
def strip_deepseek_thinking(completion_text: str) -> tuple[str, str]:
"""Extract and remove <think>...</think> blocks from DeepSeek R1 output."""
thinking_pattern = re.compile(r'<think>(.*?)</think>', re.DOTALL)
thinking_content = '\n'.join(thinking_pattern.findall(completion_text))
visible_content = thinking_pattern.sub('', completion_text).strip()
return thinking_content, visible_content
# Usage
raw_output = deepseek_client.chat.completions.create(...)
thinking, response = strip_deepseek_thinking(raw_output.choices[0].message.content)
# Send only 'response' to the user
Step 2 — Implement a thinking-aware response wrapper
Create a consistent wrapper that enforces thinking-block stripping across your application:
from dataclasses import dataclass
from typing import Optional
import anthropic
@dataclass
class SafeModelResponse:
"""A model response with thinking content safely separated."""
visible_content: str
thinking_content: Optional[str] # None if thinking was disabled
model: str
usage: dict
def to_user_dict(self) -> dict:
"""Serialization safe for user-facing APIs — never includes thinking."""
return {
"content": self.visible_content,
"model": self.model,
}
def to_internal_dict(self) -> dict:
"""Serialization for internal logging — includes thinking if present."""
return {
"content": self.visible_content,
"thinking": self.thinking_content, # Log to restricted sink only
"model": self.model,
"usage": self.usage,
}
class SafeReasoningClient:
"""Wrapper around Claude client that enforces thinking content separation."""
def __init__(self, client: anthropic.Anthropic, enable_thinking: bool = False):
self._client = client
self._enable_thinking = enable_thinking
def create(self, messages: list, system: str = "", **kwargs) -> SafeModelResponse:
"""Create a message, always returning a SafeModelResponse."""
params = {
"model": kwargs.get("model", "claude-sonnet-4-6"),
"max_tokens": kwargs.get("max_tokens", 4096),
"messages": messages,
}
if system:
params["system"] = system
if self._enable_thinking:
params["thinking"] = {
"type": "enabled",
"budget_tokens": kwargs.get("thinking_budget", 8000)
}
params["max_tokens"] = max(params["max_tokens"], 16000)
response = self._client.messages.create(**params)
thinking_parts = []
text_parts = []
for block in response.content:
if block.type == "thinking":
thinking_parts.append(block.thinking)
elif block.type == "text":
text_parts.append(block.text)
return SafeModelResponse(
visible_content="\n".join(text_parts),
thinking_content="\n".join(thinking_parts) if thinking_parts else None,
model=response.model,
usage={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
)
# Usage in application code:
client = SafeReasoningClient(anthropic.Anthropic(), enable_thinking=True)
result = client.create(
system="You are a security analyst...",
messages=[{"role": "user", "content": user_query}]
)
# Safe to send to user:
return jsonify(result.to_user_dict())
# Log to restricted sink (not application logs):
internal_logger.info(result.to_internal_dict())
Step 3 — Secure streaming to prevent thinking leakage
When streaming responses, ensure thinking blocks are not forwarded to the client stream:
import anthropic
from flask import Response, stream_with_context
client = anthropic.Anthropic()
def stream_safe_response(messages: list, system: str) -> Response:
"""Stream only visible text content; suppress thinking blocks."""
def generate():
in_thinking_block = False
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
system=system,
messages=messages,
) as stream:
for event in stream:
# Only forward text delta events, not thinking deltas
if hasattr(event, 'type'):
if event.type == 'content_block_start':
if hasattr(event, 'content_block'):
in_thinking_block = (event.content_block.type == 'thinking')
elif event.type == 'content_block_stop':
in_thinking_block = False
elif event.type == 'content_block_delta':
if not in_thinking_block and hasattr(event, 'delta'):
if hasattr(event.delta, 'text'):
# Only yield visible text deltas
yield f"data: {event.delta.text}\n\n"
yield "data: [DONE]\n\n"
return Response(
stream_with_context(generate()),
mimetype="text/event-stream",
headers={"Cache-Control": "no-cache"}
)
Step 4 — Sanitize thinking content before logging
If thinking content needs to be logged for debugging, route it to a restricted log sink with access controls:
import logging
import hashlib
# Configure separate logger for thinking content with restricted access
thinking_logger = logging.getLogger('ai.thinking')
thinking_handler = logging.FileHandler('/var/log/ai/thinking-restricted.log')
thinking_handler.setLevel(logging.DEBUG)
thinking_logger.addHandler(thinking_handler)
thinking_logger.setLevel(logging.DEBUG)
# Main application logger — never receives thinking content
app_logger = logging.getLogger('ai.application')
def log_request_safely(
request_id: str,
thinking_content: Optional[str],
visible_content: str,
context_sources: list[str]
) -> None:
"""Log request with thinking content separated to restricted sink."""
# Application log: no thinking content, no raw context
app_logger.info({
"request_id": request_id,
"response_length": len(visible_content),
"context_source_count": len(context_sources),
"has_thinking": thinking_content is not None,
})
# Restricted log: thinking content with access controls on the file
if thinking_content:
thinking_logger.debug({
"request_id": request_id,
"thinking_hash": hashlib.sha256(thinking_content.encode()).hexdigest()[:16],
# Log a truncated version — full content may contain sensitive data
"thinking_preview": thinking_content[:500] + "..." if len(thinking_content) > 500 else thinking_content,
})
# Restrict the thinking log file to application service account only
chmod 0600 /var/log/ai/thinking-restricted.log
chown aiservice:aiservice /var/log/ai/thinking-restricted.log
# Set up log rotation without compression (easier to audit access)
cat > /etc/logrotate.d/ai-thinking <<'EOF'
/var/log/ai/thinking-restricted.log {
daily
rotate 7
create 0600 aiservice aiservice
notifempty
nocompress
}
EOF
Step 5 — Audit existing code for thinking content exposure
Scan your codebase for patterns that may be forwarding thinking content to users:
# Find places where Claude response content is returned directly to users
# without going through a thinking-aware wrapper
grep -rn "response\.content" --include="*.py" src/ | \
grep -v "SafeModelResponse\|thinking_content\|extract_thinking"
# Find direct json serialization of Anthropic response objects
grep -rn "response\.model_dump\|response\.dict\(\)\|json\.dumps.*response" \
--include="*.py" src/
# Find streaming that may forward all event types
grep -rn "stream\|EventStream" --include="*.py" src/ | \
grep -v "thinking_block\|in_thinking"
Step 6 — Disable extended thinking when not needed
If a task does not require complex multi-step reasoning, use standard mode to eliminate the scratchpad risk entirely:
# Decision framework for enabling extended thinking:
TASKS_REQUIRING_THINKING = {
"complex_analysis", # Multi-step reasoning tasks
"math_proofs", # Formal reasoning
"code_debugging", # Requires internal state tracking
"security_assessment", # Multi-factor evaluation
}
TASKS_NOT_REQUIRING_THINKING = {
"summarization", # Single-pass
"classification", # Categorical output
"translation", # Direct transformation
"simple_qa", # Factual lookup
"formatting", # Structural transformation
}
def create_with_appropriate_thinking(task_type: str, **kwargs):
enable_thinking = task_type in TASKS_REQUIRING_THINKING
return SafeReasoningClient(
anthropic.Anthropic(),
enable_thinking=enable_thinking
).create(**kwargs)
Expected Behaviour
| Signal | Before hardening | After hardening |
|---|---|---|
API response to user includes thinking block |
Yes — full scratchpad in JSON response | No — to_user_dict() omits thinking content |
| Streaming response includes thinking deltas | Yes — thinking_delta events forwarded to client |
No — in_thinking_block gate suppresses thinking deltas |
| Application logs contain raw scratchpad | Yes — in default response logging | No — thinking content in restricted log only |
| System prompt content visible in scratchpad | Visible if scratchpad returned to user | Suppressed — thinking never reaches user |
DeepSeek R1 <think> tags in user response |
Included in raw completion | Stripped by strip_deepseek_thinking() before return |
Verification:
# Test: confirm thinking is not in user-facing response
client = SafeReasoningClient(anthropic.Anthropic(), enable_thinking=True)
result = client.create(
system="Keep this confidential: PROJECT_CODENAME=BLUEBIRD",
messages=[{"role": "user", "content": "What can you help me with?"}]
)
user_dict = result.to_user_dict()
assert "thinking" not in user_dict, "Thinking block found in user response!"
assert "BLUEBIRD" not in user_dict["content"], "System prompt content leaked!"
print("PASS: thinking content not exposed to user")
print(f"User sees: {user_dict['content'][:100]}...")
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Stripping thinking from all user responses | Eliminates information disclosure | Debugging becomes harder — reasoning is hidden | Maintain thinking in restricted logs with request_id linkage; developers can query restricted logs during debugging |
| Separate restricted log sink | Auditable thinking access; prevents broad log scraping | Additional logging infrastructure to maintain | Use structured logging with log-level separation; same infrastructure, different file path and permissions |
| Disabling thinking for simple tasks | Eliminates scratchpad risk for most calls | Slightly reduced quality on complex tasks that would benefit from thinking | Benchmark quality impact; most summarization/classification tasks show no degradation without thinking |
| Streaming gate for thinking blocks | Prevents streaming-based scratchpad extraction | Small latency overhead per event to check block type | Overhead is microseconds; negligible in practice |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Framework update returns thinking in new block type | New block type not recognized by filter; thinking leaks | Integration test checking thinking key absence in user response |
Add content block type assertion to integration tests; run on every dependency update |
| Thinking content in multi-turn conversation history | Prior turn’s thinking included in next turn’s messages | Next response references previous thinking content explicitly | Strip thinking blocks from message history before including in subsequent calls; Claude’s API handles this — do not pass raw response.content back as assistant messages |
| Restricted log file becomes world-readable after rotation | Thinking content accessible to all processes | File permission audit; logrotate runs as wrong user | Verify logrotate create directive includes correct permissions; audit after each rotation |
| Developer bypasses wrapper in test code that reaches production | Direct response.content serialization in response path |
Code review; grep scan for unguarded response serialization | Add pre-commit hook scanning for direct Claude response serialization patterns outside approved wrappers |
Related Articles
- LLM System Prompt Protection — preventing system prompt disclosure through model output, complementary to scratchpad controls
- LLM Prompt Security Patterns — defensive prompt design that reduces the chance of sensitive context appearing in reasoning
- AI Context Window Data Exfiltration — indirect prompt injection that may cause scratchpad-level disclosure of sensitive context
- Prompt Cache Security — security considerations for the prompt caching feature that interacts with extended thinking
- AI Agent Output Verification — verifying that agent outputs conform to expected schemas, applicable to thinking-aware response validation