AI Agent Observability and Tracing: OpenTelemetry for Agent Runs and Tool Calls

AI Agent Observability and Tracing: OpenTelemetry for Agent Runs and Tool Calls

Problem

A production AI agent’s single run involves:

  • Multiple model calls (planner, executor, summarizer).
  • Tool invocations (database queries, API calls, MCP tools).
  • Memory reads and writes.
  • Internal control-flow decisions.
  • Possibly recursive sub-agent calls.

When something goes wrong — wrong tool invoked, leaked data, hallucinated output, runaway tool-call loop — the question is “what happened” and the answer is in the trace. By 2026 the OpenTelemetry semantic conventions for GenAI (gen_ai.* namespace) are stable; agent frameworks (LangGraph, AutoGen, Anthropic SDK, OpenAI Agents) emit OTel traces by default or via thin shims.

The structural observability shape:

  • Each run is a trace. A run has a root span; child spans for each step.
  • Each model call is a span. Tagged with model, prompt token count, completion token count, finish reason.
  • Each tool call is a span. Tagged with tool name, arguments (often hashed), duration, return-value type.
  • Each memory operation is a span. Reads, writes, retrieval scores.
  • Each decision branch is a span event. Annotates why a particular path was taken.

The observability data has security uses:

  • Detect anomalous tool-call patterns. A burst of delete_* calls or unusual sequences flag attacks-via-prompt-injection.
  • Track per-tenant agent cost for billing and abuse detection.
  • Forensics for incidents — when an agent did something wrong, the trace shows the chain of model decisions.
  • Compliance evidence — auditable record of what an autonomous system did and why.

The specific gaps in default agent deployments:

  • Tracing is often left to the agent framework’s own pipeline (logs to stdout, not aggregated).
  • Tool-call arguments may be logged with sensitive content (PII, customer data).
  • Per-run costs aren’t attributed to tenants.
  • Anomaly-detection on tool-use patterns isn’t run.
  • Long-running agent runs blow out trace ingest budgets.

This article covers OTel semantic conventions for agents, span hierarchy patterns, redaction at the SDK boundary, anomaly detection on tool-use, and per-tenant cost attribution.

Target systems: OpenTelemetry SDK 1.30+ with gen_ai semantic conventions; Anthropic SDK with native OTel support; LangSmith / Langfuse / Helicone as commercial agent observability tools; Tempo / Jaeger / Honeycomb as backend.

Threat Model

  • Adversary 1 — Prompt-injection-driven anomalous tool calls: an attacker has gotten content into the agent’s input that causes unusual tool calls; defender wants to detect and shut down.
  • Adversary 2 — Cost-exhaustion abuse: a misbehaving (or malicious) agent loops on tool calls; defender wants to detect and bound.
  • Adversary 3 — Data exfil via tool outputs: an agent run reads sensitive data; defender wants to know what data the agent saw.
  • Adversary 4 — PII leakage in logged spans: observability captures request bodies that contain customer data; insider with trace-store access reads them.
  • Adversary 5 — Audit gap exploitation: attacker takes advantage of trace-data deletion or non-aggregation to act untraced.
  • Access level: all adversaries have only normal user-agent interaction.
  • Objective: Read sensitive data; cause the agent to act outside intended scope; act without leaving forensic traces.
  • Blast radius: without good observability, attacks via the agent are slow to detect and hard to investigate. With observability + redaction, agent actions are visible while sensitive data is not exposed in logs.

Configuration

Step 1: OTel Agent Semantic Conventions

The gen_ai namespace defines standard span attributes:

gen_ai.system           # vendor (anthropic, openai, ollama)
gen_ai.request.model    # model name
gen_ai.response.model   # model that actually responded
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.response.finish_reason   # stop, length, tool_calls
gen_ai.request.temperature
gen_ai.request.max_tokens
gen_ai.tool.name
gen_ai.tool.call.id
gen_ai.operation.name   # chat, embed, tool_call

Use these consistently; backends understand them.

Step 2: Span Hierarchy for an Agent Run

# agent_traced.py
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import json, hashlib

tracer = trace.get_tracer("my-agent")

class AgentRun:
    def __init__(self, user_id, task):
        self.user_id = user_id
        self.task = task

    def execute(self):
        # Root span for the agent run.
        with tracer.start_as_current_span("agent.run") as run_span:
            run_span.set_attribute("agent.user_id", self.user_id)
            run_span.set_attribute("agent.task_hash",
                hashlib.sha256(self.task.encode()).hexdigest()[:16])
            run_span.set_attribute("gen_ai.operation.name", "agent.execute")

            try:
                plan = self._plan()
                results = self._execute_plan(plan)
                summary = self._summarize(results)
                return summary
            except Exception as e:
                run_span.set_status(Status(StatusCode.ERROR, str(e)))
                raise

    def _plan(self):
        with tracer.start_as_current_span("agent.plan") as span:
            span.set_attribute("gen_ai.operation.name", "chat")
            span.set_attribute("gen_ai.system", "anthropic")
            span.set_attribute("gen_ai.request.model", "claude-opus-4-7")
            response = anthropic_client.messages.create(...)
            span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
            span.set_attribute("gen_ai.response.finish_reason", response.stop_reason)
            return response.content

    def _execute_plan(self, plan):
        with tracer.start_as_current_span("agent.execute_plan") as span:
            results = []
            for step in plan:
                results.append(self._execute_step(step))
            return results

    def _execute_step(self, step):
        with tracer.start_as_current_span(f"agent.tool_call.{step.tool}") as span:
            span.set_attribute("gen_ai.operation.name", "tool_call")
            span.set_attribute("gen_ai.tool.name", step.tool)
            # Hash arguments rather than logging them — see Step 3.
            args_json = json.dumps(step.args, sort_keys=True)
            span.set_attribute("gen_ai.tool.args_hash",
                hashlib.sha256(args_json.encode()).hexdigest()[:16])
            span.set_attribute("gen_ai.tool.args_size_bytes", len(args_json))
            try:
                return step.tool_fn(step.args)
            except Exception as e:
                span.set_status(Status(StatusCode.ERROR, str(e)))
                raise

The trace tree:

agent.run
├── agent.plan         (model call to planner)
├── agent.execute_plan
│   ├── agent.tool_call.search_docs
│   ├── agent.tool_call.fetch_records
│   └── agent.tool_call.summarize_for_user
└── agent.summarize    (model call to summarizer)

A backend like Tempo, Honeycomb, or Langfuse renders this as a flame graph.

Step 3: Redaction at the SDK Boundary

Tool arguments and model prompts often contain user PII or business-sensitive content. Don’t log them verbatim.

# safe_attributes.py
import re, hashlib

SENSITIVE_PATTERNS = [
    re.compile(r'(?i)(?<=password["\']?\s*[:=]\s*["\']?)[^\s"\',}]+', re.IGNORECASE),
    re.compile(r'(?i)(?<=api_key["\']?\s*[:=]\s*["\']?)[^\s"\',}]+'),
    re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),  # email
    re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),  # SSN
    re.compile(r'sk-[A-Za-z0-9]{32,}'),    # API key prefix
]

def safe_log_value(s: str, max_len: int = 200) -> dict:
    """Return a dict suitable for span.set_attribute that doesn't leak."""
    for pat in SENSITIVE_PATTERNS:
        s = pat.sub("[REDACTED]", s)
    return {
        "value_redacted": s[:max_len] + ("..." if len(s) > max_len else ""),
        "value_size": len(s),
        "value_hash": hashlib.sha256(s.encode()).hexdigest()[:16],
    }

# Use throughout agent code.
def log_user_input(span, user_input):
    safe = safe_log_value(user_input)
    span.set_attribute("agent.user_input.size", safe["value_size"])
    span.set_attribute("agent.user_input.hash", safe["value_hash"])
    span.set_attribute("agent.user_input.preview", safe["value_redacted"])

The hash is useful for cross-correlation (“did the same input appear in another run?”) without exposing content.

For tool arguments, prefer schemas — record types and shapes rather than values:

def log_tool_args(span, tool_name, args):
    span.set_attribute("agent.tool.name", tool_name)
    # Record schema, not values.
    schema = {k: type(v).__name__ for k, v in args.items()}
    span.set_attribute("agent.tool.arg_schema", json.dumps(schema))
    # Hash for correlation.
    args_str = json.dumps(args, sort_keys=True)
    span.set_attribute("agent.tool.args_hash",
        hashlib.sha256(args_str.encode()).hexdigest()[:16])

Step 4: Anomaly Detection on Tool-Use Patterns

With per-run trace data, build detection on tool-call patterns:

# tool_anomaly.py
SUSPICIOUS_PATTERNS = [
    {
        "name": "rapid_destructive_calls",
        "match": lambda calls: sum(1 for c in calls if c.startswith("delete_")) > 3,
        "severity": "high",
    },
    {
        "name": "unusual_tool_sequence",
        "match": lambda calls: "fetch_secrets" in calls and "outbound_http" in calls,
        "severity": "critical",
    },
    {
        "name": "tool_call_loop",
        "match": lambda calls: any(calls.count(c) > 10 for c in set(calls)),
        "severity": "medium",
    },
]

def analyze_run(run_id, tool_calls):
    findings = []
    call_names = [c.tool_name for c in tool_calls]
    for pattern in SUSPICIOUS_PATTERNS:
        if pattern["match"](call_names):
            findings.append({
                "run_id": run_id,
                "pattern": pattern["name"],
                "severity": pattern["severity"],
            })
    return findings

Run this analyzer over completed traces; alert on findings. For real-time response, consume traces from a streaming pipeline (Kafka with OTel exporter) and react before the run completes.

Step 5: Per-Tenant Cost Attribution

Tag every span with the tenant ID. Roll up costs:

-- Example: per-tenant token consumption from trace data.
SELECT
    span.attributes['agent.tenant_id'] AS tenant,
    SUM(span.attributes['gen_ai.usage.input_tokens']) AS input_tokens,
    SUM(span.attributes['gen_ai.usage.output_tokens']) AS output_tokens,
    SUM(span.attributes['gen_ai.usage.input_tokens'] * 0.000003)
        + SUM(span.attributes['gen_ai.usage.output_tokens'] * 0.000015)
        AS estimated_cost_usd
FROM spans
WHERE span.name LIKE 'agent.%'
  AND span.start_time > now() - interval '1 day'
GROUP BY tenant
ORDER BY estimated_cost_usd DESC;

A tenant whose cost spikes 10x in a day is either using the agent more legitimately or being abused. Both warrant investigation.

Step 6: Trace Sampling for Volume Management

Per-run traces are detailed; volume scales with agent invocations. Sample:

from opentelemetry.sdk.trace.sampling import (
    ParentBased, TraceIdRatioBased, ALWAYS_ON, ALWAYS_OFF,
)

# Always sample for: errors, suspicious patterns, high-cost runs.
# Sample 10% of routine runs.
sampler = ParentBased(
    root=TraceIdRatioBased(0.1),
)
provider = TracerProvider(sampler=sampler)

For tail-based sampling (decide post-completion based on trace contents), use the OTel Collector’s tail_sampling processor:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: long-runs
        type: latency
        latency: {threshold_ms: 60000}
      - name: anomaly
        type: numeric_attribute
        numeric_attribute: {key: agent.anomaly_score, min_value: 1}
      - name: routine-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

100% of error or anomalous traces; 10% of routine runs. Forensic detail when needed; manageable volume otherwise.

Step 7: Forensic Replay

For incident investigation, you need enough trace detail to reconstruct what happened.

def replay_trace(run_id):
    spans = trace_store.fetch_by_root(run_id)
    print(f"Run {run_id} for tenant {spans[0].attributes['agent.tenant_id']}")
    for span in sorted(spans, key=lambda s: s.start_time):
        indent = "  " * span.depth
        print(f"{indent}{span.name} ({span.duration_ms}ms)")
        if span.attributes.get("gen_ai.tool.name"):
            print(f"{indent}  tool={span.attributes['gen_ai.tool.name']}")
            print(f"{indent}  args_hash={span.attributes['agent.tool.args_hash']}")
        if span.status == "ERROR":
            print(f"{indent}  ERROR: {span.events[0].attributes['exception.message']}")

The output is a step-by-step record of the run. For a real incident response, the trace combined with model-input hashes (for correlating with other agent runs that saw the same input) is the forensic core.

Step 8: Telemetry SLOs

agent_runs_total{tenant, outcome}                              counter
agent_run_duration_seconds                                     histogram
agent_tool_calls_total{tenant, tool, outcome}                  counter
agent_anomaly_detections_total{pattern, severity}              counter
agent_token_usage_total{tenant, model}                          counter
agent_redaction_replacements_total                              counter
agent_trace_pii_findings_total{pattern}                         counter (post-trace audit)

Alert on:

  • agent_anomaly_detections_total{severity="critical"} non-zero — likely active prompt-injection attack.
  • agent_trace_pii_findings_total non-zero — redaction failed; tighten patterns.

Expected Behaviour

Signal Without observability With OTel agent observability
Per-run cost attribution None / aggregate only Per-tenant per-run
Tool-call anomaly detection After-the-fact via SIEM Real-time on trace stream
Forensic reconstruction of incidents Logs piecewise Single trace shows full chain
PII in trace data Often present Hashed / redacted at SDK
Volume management Per-event keep-all (expensive) Tail-sampled by anomaly + error
Cross-run correlation Hard Via input hashes

Trade-offs

Aspect Benefit Cost Mitigation
Detailed per-call spans Fine forensics Trace volume Tail sampling: keep what matters.
SDK-side redaction Sensitive data not exfiltrated to observability Some debug context lost Pair with secured / audited debug log for known-narrow cases.
Hash-based correlation Track patterns without content Can’t see actual content during routine ops Acceptable; for incident, retrieve from a separate (more-secured) log.
Real-time anomaly detection Stop attacks in progress Streaming pipeline complexity Use existing OTel + Kafka; small custom processor for patterns.
Per-tenant attribution Billing / abuse signals Per-tenant index in trace store Standard for multi-tenant deployments.
Tail sampling Volume bounded Some traces lost (acceptable) Decision policies cover errors + anomalies; routine sample is small.

Failure Modes

Failure Symptom Detection Recovery
Redaction misses sensitive pattern PII in trace store Periodic scan of stored traces Tighten patterns; consider LLM-based classifier for high-stakes audit.
Sampler discards relevant trace Forensic gap during incident Specific trace not in store post-event Tail sampling captures errors / anomalies; routine sampling discards by design. Increase sampling for high-risk tenants.
Trace ingest latency Anomaly detection lags Detection-rule firing time vs. event time Use streaming ingest (Kafka); reduce decision_wait if needed.
Per-tenant tagging missed Cost attribution incomplete Some agent runs missing tenant_id attribute Enforce at SDK init; reject runs without tenant context.
Anomaly false positive Legitimate operations flagged Operator review of detected anomalies Tune patterns; allow operator-triggered “expected” annotations.
Trace export down Brief observability gap Standard health check on exporter Buffer locally for short outages; escalate if extended.
Span attribute size limit Long arguments truncated Backend rejection or silent truncation Hash + size; don’t try to fit full content in span attrs.