Detecting LLM-Driven Bots Through Observability: Signals That Survive AI Mimicry

The Problem

Between 2024 and 2026, the two foundations of client-side bot detection collapsed in quick succession. TLS fingerprinting died when uTLS became trivially accessible and browser-impersonation presets for every major client were published in open-source form. Behavioural biometrics died — or at least severely degraded — when LLM-driven browser agents (Anthropic Computer Use, OpenAI Operator, Google Mariner, and the cluster of open-source equivalents built on Playwright and Puppeteer) started running real Chromium instances and generating mouse movement curves, keystroke timing, and scroll patterns that are indistinguishable from human input when scored by the heuristic models that Cloudflare Bot Management, DataDome, and PerimeterX deploy.

These systems are not simulating a browser. They are controlling a real browser. The canvas fingerprint is authentic because it comes from a real GPU-accelerated Chromium render. The TLS handshake is authentic because it is a real Chromium TLS stack. The mouse trajectory passes entropy checks because the LLM planning loop introduces genuine variance in timing and path. The typing speed is within the human distribution because LLMs are trained to produce human-like latency when driving input. Every client-side probe returns a genuine result.

What the client-side model cannot see is the shape of the server-side interaction: what the session requested, in what order, at what pace relative to server response time, against what background of parallel resource fetching. These are structural properties of the session, not of any individual event. An agent fetching product prices cannot fake the incidental image and CSS requests that a real browser would make as it rendered the page. An agent maintaining consistent inter-request timing cannot fake the slowdown that real users exhibit when the server is under load, because the agent is not waiting on perception — it is parsing a DOM it already has. An agent that never visited the homepage cannot fake the DNS pre-resolution that real browsers perform for links in the page they just loaded.

Seven server-side signals derive from standard infrastructure — nginx structured logs, Envoy telemetry, OpenTelemetry traces — and survive every current effort at AI-driven mimicry, because they depend on structural properties of the session that an agent would have to deliberately and expensively fake, and faking them is counterproductive to the task the agent is trying to accomplish.

API call graph topology. Human sessions produce high-branching, poorly-directed call graphs: back-navigation, abandoned flows, repeated requests, accidental double-clicks. Agent sessions produce directed acyclic graphs converging on a small set of target endpoints. The graph diameter is short. The branching factor is low. This is directly observable in structured access logs correlated by session ID.

Resource fetch completeness. A real browser rendering /product/123 generates 30–80 parallel subrequests in the first 500ms: CSS, JavaScript bundles, fonts, the favicon, product images, analytics beacons. An agent visiting the same URL to extract price and availability data does not need those resources and does not fetch them. Nginx logs for the session contain GET requests for the API endpoints and nothing else. The ratio of API requests to static resource requests diverges sharply from real browser sessions.

Semantic request coherence. A session that visits the product page for item X at /product/123 and immediately requests /api/product/123/price and /api/product/123/availability is exhibiting narrow goal-directedness. A real human session for the same path would include requests to adjacent items the user considered, compare pages, brand filter API calls, and random navigation that reflects browsing as exploration rather than extraction. Sequential request analysis using sliding window correlation identifies the tightly-scoped semantic field of agent sessions.

Timing variance under load. Real users slow down when pages load slowly. Their inter-request intervals are correlated with server response time because they are waiting for pages to load before clicking. Agents make API calls programmatically; the timing of the next request depends on when the LLM finishes parsing the previous response, not on perceived page load speed. When server response time increases, human session inter-request intervals lengthen; agent session inter-request intervals remain flat. This decorrelation is detectable in Prometheus time-series analysis across sessions grouped by the server-side upstream_response_time.

DNS pre-resolution gap. A real browser that loads a page containing links to images.cdn.example.com immediately pre-resolves that hostname, generating a DNS query visible in server-side DNS logs before any request to that host arrives. An agent navigating directly to a product URL without having loaded the referring page shows no prior DNS pre-resolution for the linked hostnames. The absence of pre-resolution in the timing window before first contact is a negative signal that survives fingerprint mimicry.

WebSocket heartbeat regularity. Applications using WebSocket connections for real-time updates show human-correlated heartbeat patterns: messages are sent irregularly, at intervals that vary with user activity. Agent frameworks maintain WebSocket connections at their own tick rate, typically a fixed interval driven by the event loop. The coefficient of variation of heartbeat intervals is significantly lower for agent-managed connections than for human-driven ones. This is directly observable in WebSocket access logs that capture per-frame timing.

Server-push utilisation. HTTP/2 server push offers resources to the client before they are explicitly requested. Real browsers accept pushed resources and cache them; the push is utilised. An agent that navigates directly to a known URL without needing the pushed resources will RST_STREAM the pushed resources immediately. The RST_STREAM frames are observable in HTTP/2 connection logs and are a reliable signal that the client did not consume resources that a real browser would have.

Each signal is extractable from infrastructure that most production deployments already run. The following sections cover the log schema required, the extraction logic, how to feed a weighted composite score, and where each detection breaks.

Threat Model

Mass-scrape by agentic bot pool. An operator running 500 Playwright+LLM agent sessions against an e-commerce site to extract product pricing is detectable primarily through resource fetch incompleteness (no CSS, no images, no fonts) and API call graph linearity (each session hits /product/{id}, /api/product/{id}/price, /api/product/{id}/availability, then terminates). Real sessions browsing the same product pages generate 10–15× more requests per page view across a full resource set. The ratio is stable and not easily faked without downloading and discarding the resources, which is expensive and self-defeating.

Credential stuffing via agentic browser. An agent cycling through username/password combinations using a real browser to bypass fingerprinting-based rate limits is detectable through timing consistency under load (all sessions maintain the same inter-request interval regardless of server response time variance) and DNS pre-resolution gap (no pre-resolution of linked domains the login page contains, because the agent navigated directly without loading the referring page). The absence of idle gaps in sessions longer than 60 seconds is also diagnostic.

Automated checkout fraud. An agentic session that navigates directly from session start to the checkout flow exhibits two independent signals: semantic coherence (the API call sequence goes directly to cart and checkout APIs without any product browsing) and resource fetch incompleteness (checkout flows load significantly fewer product image assets than browsing sessions, but even this reduced set is not fetched by agents focused on completing the transaction). The direct-to-checkout path without prior category browsing has a near-zero base rate in real human sessions.

Hardening Configuration

1. Structured Access Log Schema for Bot-Detection Signals

The default nginx combined log format does not capture session identifiers, HTTP/2 stream identifiers, upstream response times, or SSL cipher information. Extend it:

# /etc/nginx/nginx.conf — extend the http block with a structured log format.
# Captures all fields needed for the seven detection signals.
log_format bot_detection escape=json
  '{'
    '"timestamp":"$time_iso8601",'
    '"remote_addr":"$remote_addr",'
    '"request":"$request",'
    '"uri":"$uri",'
    '"args":"$args",'
    '"status":"$status",'
    '"bytes_sent":"$bytes_sent",'
    '"request_length":"$request_length",'
    '"http_referer":"$http_referer",'
    '"http_user_agent":"$http_user_agent",'
    '"request_time":"$request_time",'
    '"upstream_response_time":"$upstream_response_time",'
    '"upstream_connect_time":"$upstream_connect_time",'
    '"ssl_protocol":"$ssl_protocol",'
    '"ssl_cipher":"$ssl_cipher",'
    '"http2":"$http2",'
    '"http2_stream_id":"$http2_stream_id",'
    '"session_id":"$cookie_session_id",'
    '"x_forwarded_for":"$http_x_forwarded_for",'
    '"request_id":"$request_id"'
  '}';

access_log /var/log/nginx/bot_detection.log bot_detection buffer=16k flush=5s;

The $http2 variable is h2 for HTTP/2 connections and empty for HTTP/1.1. The $http2_stream_id identifies the stream within the connection — essential for correlating RST_STREAM events with resource requests. The $upstream_response_time is the raw upstream latency, required for timing-variance-under-load analysis.

For Envoy, the equivalent access log configuration using the JSON formatter:

# envoy.yaml — access_log configuration under http_connection_manager
access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /dev/stdout
      log_format:
        json_format:
          timestamp: "%START_TIME%"
          method: "%REQ(:METHOD)%"
          path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
          response_code: "%RESPONSE_CODE%"
          duration: "%DURATION%"
          upstream_host: "%UPSTREAM_HOST%"
          upstream_service_time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
          response_flags: "%RESPONSE_FLAGS%"
          bytes_received: "%BYTES_RECEIVED%"
          bytes_sent: "%BYTES_SENT%"
          session_id: "%REQ(COOKIE)%"
          request_id: "%REQ(X-REQUEST-ID)%"
          protocol: "%PROTOCOL%"

The response_flags field in Envoy includes DC (Downstream Connection Termination) which captures RST_STREAM events — the server-push signal.

2. Resource Fetch Completeness Scoring

Resource completeness detection requires a manifest of expected resources per page. This manifest is derived by running a headless browser crawl of each landing page and recording the full resource waterfall. Build and maintain this as a checked-in JSON file, updated as part of the frontend release pipeline.

# bot_detection/resource_completeness.py
import re
import fnmatch
from dataclasses import dataclass
from typing import Optional

# Per-page resource manifests: map landing page prefix to expected resource
# patterns. These are maintained by the frontend release pipeline — not
# hand-coded — via a headless crawl that records waterfall resources.
RESOURCE_MANIFEST: dict[str, list[str]] = {
    "/product/": [
        "favicon.ico",
        "/static/css/main.*.css",
        "/static/js/app.*.js",
        "/static/js/vendor.*.js",
        "/api/product/*/price",
        "/api/product/*/availability",
        "/static/images/product/*",
        "/static/fonts/*.woff2",
        "/api/recommendations*",
    ],
    "/login": [
        "favicon.ico",
        "/static/css/auth.*.css",
        "/static/js/login.*.js",
        "/api/auth/challenge",
    ],
    "/checkout": [
        "favicon.ico",
        "/static/css/checkout.*.css",
        "/static/js/checkout.*.js",
        "/api/cart/summary",
        "/api/user/addresses",
        "/api/payment/methods",
    ],
}

# Resources that are signals of browsing intent rather than task completion.
# Their absence is diagnostic for extraction-focused agents.
BROWSING_SIGNAL_PATTERNS = [
    "/static/images/product/*",
    "/static/fonts/*.woff2",
    "/api/recommendations*",
]


def _matches_pattern(path: str, pattern: str) -> bool:
    """fnmatch-style glob match, anchored at the path start."""
    return fnmatch.fnmatch(path, pattern)


def resource_completeness_score(
    session_requests: list[str],
    landing_page: str,
) -> float:
    """
    Returns a suspicion score [0.0, 1.0]. Higher = more suspicious.

    Logic:
    - Find the manifest for the landing page prefix.
    - Count how many expected resource patterns were matched.
    - Weight browsing-signal patterns more heavily.
    - Completeness = weighted matched / weighted total.
    - Score = 1.0 - completeness.
    """
    manifest: Optional[list[str]] = None
    for prefix, patterns in RESOURCE_MANIFEST.items():
        if landing_page.startswith(prefix):
            manifest = patterns
            break

    if manifest is None:
        return 0.5  # Unknown page — neutral score; do not bias.

    total_weight = 0.0
    matched_weight = 0.0

    for pattern in manifest:
        # Browsing-signal patterns carry 2× weight.
        weight = 2.0 if any(
            _matches_pattern(pattern, bp) for bp in BROWSING_SIGNAL_PATTERNS
        ) else 1.0
        total_weight += weight

        if any(_matches_pattern(req, pattern) for req in session_requests):
            matched_weight += weight

    if total_weight == 0:
        return 0.5

    completeness = matched_weight / total_weight
    return round(1.0 - completeness, 4)


@dataclass
class SessionResourceProfile:
    session_id: str
    landing_page: str
    api_request_count: int
    static_request_count: int
    image_request_count: int
    font_request_count: int
    all_requests: list[str]

    @property
    def api_to_static_ratio(self) -> float:
        """Real browser sessions are typically 1:3 to 1:6 (API:static).
        Agent sessions are often 3:1 or higher (many API, few static)."""
        if self.static_request_count == 0:
            return float("inf")
        return self.api_request_count / self.static_request_count

    @property
    def completeness_score(self) -> float:
        return resource_completeness_score(self.all_requests, self.landing_page)

The api_to_static_ratio is a simple pre-filter: if a session generates more API requests than static resource requests, it is a strong prior that the session is not a real browser rendering pages. A real Chromium session loading a product page typically requests 1–3 API endpoints and 20–60 static resources. An agent session often inverts this.

3. Timing Consistency Under Load

The signal here is decorrelation, not absolute timing. Collect per-session aggregates of upstream_response_time and inter-request interval. For human sessions, these are positively correlated: slow server → longer user wait → wider inter-request gap. For agents, they are uncorrelated.

# prometheus/recording_rules/bot_detection.yml
groups:
  - name: bot_detection_timing
    interval: 30s
    rules:
      # Track per-session median inter-request interval over a 5-minute window.
      - record: session:inter_request_interval_p50:5m
        expr: |
          histogram_quantile(0.5,
            rate(http_session_inter_request_interval_bucket[5m])
          )

      # Track per-session median upstream response time.
      - record: session:upstream_response_time_p50:5m
        expr: |
          histogram_quantile(0.5,
            rate(http_upstream_response_time_bucket[5m])
          )

      # Coefficient of variation of inter-request intervals per session.
      # Low CV = suspiciously regular = likely agent.
      # CV = stddev / mean. Human sessions: CV typically 0.8–2.5.
      # Agent sessions: CV typically 0.1–0.4.
      - record: session:inter_request_interval_cv:5m
        expr: |
          (
            stddev_over_time(session:inter_request_interval_p50:5m[5m])
            /
            avg_over_time(session:inter_request_interval_p50:5m[5m])
          )

      # Timing decorrelation: compare session CV against load-weighted baseline.
      # Alert when session maintains regular timing while peers slow down.
      - alert: AgenticTimingDecorrelation
        expr: |
          session:inter_request_interval_cv:5m < 0.35
          and
          avg_over_time(nginx_upstream_response_time_seconds[5m]) > 0.5
        for: 2m
        labels:
          severity: high
          signal: timing_decorrelation
        annotations:
          summary: >
            Session {{ $labels.session_id }} maintaining regular timing
            while server load is elevated — timing CV={{ $value | humanize }}.
          description: >
            Human sessions slow down under load. This session shows CV < 0.35
            during a period of elevated upstream response time (>500ms p50).
            Likely agent maintaining programmatic request cadence.

      # Secondary alert: low CV in isolation (possible on fast servers).
      - alert: SuspiciouslyRegularSessionTiming
        expr: |
          session:inter_request_interval_cv:5m < 0.25
          and
          count_over_time(session:inter_request_interval_p50:5m[5m]) > 10
        for: 3m
        labels:
          severity: medium
          signal: low_timing_cv

The AgenticTimingDecorrelation rule is the higher-fidelity version: it fires only when the session is regular while the server is slow, which significantly reduces false positives from power users who happen to navigate quickly on a fast server.

A Grafana panel to visualise the signal:

{
  "title": "Session Timing CV vs Server Load",
  "type": "timeseries",
  "targets": [
    {
      "expr": "session:inter_request_interval_cv:5m",
      "legendFormat": "Session CV - {{ session_id }}"
    },
    {
      "expr": "avg_over_time(nginx_upstream_response_time_seconds[5m])",
      "legendFormat": "Server response time (p50)"
    }
  ],
  "fieldConfig": {
    "overrides": [
      {
        "matcher": { "id": "byName", "options": "Server response time (p50)" },
        "properties": [{ "id": "custom.axisPlacement", "value": "right" }]
      }
    ]
  },
  "thresholds": {
    "steps": [
      { "value": null, "color": "green" },
      { "value": 0.35, "color": "yellow" },
      { "value": 0.25, "color": "red" }
    ]
  }
}

4. API Call Graph Analysis with OpenTelemetry Traces

Session-scoped distributed traces capture the sequence of backend API calls. OpenTelemetry spans include parent-child relationships that reconstruct the call graph. Build the graph, compute metrics, and classify:

# bot_detection/call_graph.py
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
import networkx as nx


@dataclass
class Span:
    span_id: str
    parent_span_id: Optional[str]
    name: str          # e.g., "GET /api/product/123/price"
    endpoint: str      # normalised path, e.g., "GET /api/product/{id}/price"
    duration_ms: float
    start_time_ms: float


@dataclass
class SessionGraphMetrics:
    session_id: str
    node_count: int
    edge_count: int
    unique_endpoints: int
    graph_diameter: int
    branching_factor: float
    endpoint_revisit_rate: float   # repeated requests to same endpoint / total
    path_entropy: float            # Shannon entropy of endpoint visit sequence

    @property
    def linearity_score(self) -> float:
        """
        Score [0.0, 1.0] for how linear the call graph is.
        0.0 = fully branching (human-like exploration)
        1.0 = fully linear (agent-like directed task)

        Low branching_factor + low path_entropy + low diameter
        relative to node count → high linearity.
        """
        # Normalise branching factor: 1.0 is linear, 0.0 would be maximally branching.
        bf_norm = max(0.0, 1.0 - (self.branching_factor / 3.0))

        # Normalise entropy: lower entropy = more linear.
        # Max entropy for n endpoints is log2(n); normalise against that.
        import math
        max_entropy = math.log2(max(2, self.unique_endpoints))
        entropy_norm = 1.0 - min(1.0, self.path_entropy / max_entropy)

        # Diameter ratio: short diameter relative to nodes = linear.
        diameter_norm = 1.0 - min(
            1.0,
            self.graph_diameter / max(1, self.node_count)
        )

        return round((bf_norm * 0.4 + entropy_norm * 0.35 + diameter_norm * 0.25), 4)


def analyse_session_graph(
    session_id: str,
    spans: list[Span],
) -> SessionGraphMetrics:
    """
    Build a directed call graph from OTel spans and compute linearity metrics.

    Human session example: 45 nodes, diameter 8, branching ~2.1, entropy 3.8
    Agent session example:  12 nodes, diameter 4, branching 1.1, entropy 1.2
    """
    import math
    from collections import Counter

    G = nx.DiGraph()

    for span in spans:
        G.add_node(span.endpoint)
        if span.parent_span_id:
            parent_spans = [s for s in spans if s.span_id == span.parent_span_id]
            if parent_spans:
                G.add_edge(parent_spans[0].endpoint, span.endpoint)

    endpoint_sequence = [s.endpoint for s in sorted(spans, key=lambda x: x.start_time_ms)]
    endpoint_counts = Counter(endpoint_sequence)
    total = len(endpoint_sequence)

    # Shannon entropy of endpoint visitation frequency.
    entropy = -sum(
        (c / total) * math.log2(c / total)
        for c in endpoint_counts.values()
        if c > 0
    )

    # Graph diameter — only meaningful for weakly-connected graphs.
    try:
        diameter = nx.diameter(G.to_undirected()) if len(G.nodes) > 1 else 0
    except nx.NetworkXError:
        diameter = 0

    revisit_count = sum(c - 1 for c in endpoint_counts.values() if c > 1)

    return SessionGraphMetrics(
        session_id=session_id,
        node_count=G.number_of_nodes(),
        edge_count=G.number_of_edges(),
        unique_endpoints=len(endpoint_counts),
        graph_diameter=diameter,
        branching_factor=G.number_of_edges() / max(1, G.number_of_nodes()),
        endpoint_revisit_rate=revisit_count / max(1, total),
        path_entropy=entropy,
    )

Export linearity_score as a Prometheus gauge labeled by session_id. A threshold of > 0.7 on linearity score alone has approximately 12% false positives from power users completing focused tasks. Combine with resource completeness to reduce that rate significantly.

5. Loki LogQL: Resource Diversity Detection

The Loki query identifies sessions that make many API calls but almost no static resource requests — the resource fetch incompleteness signal expressed as a streaming query:

# LogQL: sessions with high API call volume but low static resource diversity.
# This query runs as a Grafana alerting rule on 10-minute windows.
#
# Left side: sessions with >50 API requests in 10 minutes.
# Right side: same sessions with <5 static resource requests.
# The join (and) across session_id is not native in LogQL;
# use Grafana correlations or the metric recording approach below.

# API request rate per session:
sum by (session_id) (
  count_over_time(
    {job="nginx-access"}
    | json
    | line_format "{{.session_id}} {{.uri}}"
    | uri =~ "^/api/.*"
    [10m]
  )
) > 50

# Static resource request count per session (run separately, correlate in Grafana):
sum by (session_id) (
  count_over_time(
    {job="nginx-access"}
    | json
    | line_format "{{.session_id}} {{.uri}}"
    | uri =~ "^/static/.*"
    [10m]
  )
) < 5

For the composite query that joins both conditions via a recording rule, convert the LogQL metrics to Prometheus via Loki’s ruler and use PromQL for the join:

# loki/ruler.yml — recording rules to enable PromQL join
groups:
  - name: bot_detection_resource
    interval: 1m
    rules:
      - record: session:api_requests_10m
        expr: |
          sum by (session_id) (
            count_over_time(
              {job="nginx-access"}
              | json
              | uri =~ "^/api/.*"
              [10m]
            )
          )

      - record: session:static_requests_10m
        expr: |
          sum by (session_id) (
            count_over_time(
              {job="nginx-access"}
              | json
              | uri =~ "^/static/.*"
              [10m]
            )
          )

Then in Prometheus/Alertmanager:

- alert: HighApiLowStaticSession
  expr: |
    session:api_requests_10m > 50
    unless
    session:static_requests_10m > 5
  for: 5m
  labels:
    severity: high
    signal: resource_incompleteness
  annotations:
    summary: >
      Session {{ $labels.session_id }} making high API volume
      with no static resource fetching — likely agent session.

The unless operator in PromQL returns the left-hand series where no matching right-hand series exists. Sessions with session:api_requests_10m > 50 but no session:static_requests_10m > 5 are exactly the high-API, no-static sessions.

6. Composite Bot Score with SIEM Integration

Weight the individual signals into a composite score that drives enforcement decisions. Each signal contributes proportionally based on its empirical precision against a labelled session dataset:

# bot_detection/scorer.py
from __future__ import annotations
from dataclasses import dataclass
from prometheus_client import Gauge, Counter

# Prometheus metrics for export.
BOT_SCORE = Gauge(
    "session_bot_score",
    "Composite bot suspicion score [0.0, 1.0]",
    ["session_id"],
)
BOT_ACTION = Counter(
    "session_bot_action_total",
    "Actions taken on sessions by bot scorer",
    ["action"],
)


@dataclass
class BotScore:
    session_id: str

    # Each signal is normalised to [0.0, 1.0] where 1.0 = most suspicious.
    resource_completeness: float   # weight 0.30
    timing_regularity: float       # weight 0.25
    graph_linearity: float         # weight 0.25
    semantic_coherence: float      # weight 0.15
    server_push_rejection: float   # weight 0.05

    # Signal-specific weights derived from labelled session analysis.
    # Adjust these weights using a validation dataset for your traffic profile.
    WEIGHTS: tuple[float, ...] = (0.30, 0.25, 0.25, 0.15, 0.05)

    @property
    def composite(self) -> float:
        scores = (
            self.resource_completeness,
            self.timing_regularity,
            self.graph_linearity,
            self.semantic_coherence,
            self.server_push_rejection,
        )
        return round(
            sum(w * s for w, s in zip(self.WEIGHTS, scores)),
            4,
        )

    def action(self) -> str:
        """
        Block:     composite > 0.80  (high confidence agent session)
        Challenge: composite > 0.55  (suspicious; CAPTCHA or JS challenge)
        Flag:      composite > 0.35  (log and monitor; do not enforce)
        Allow:     composite <= 0.35 (insufficient signal)
        """
        score = self.composite
        if score > 0.80:
            return "block"
        elif score > 0.55:
            return "challenge"
        elif score > 0.35:
            return "flag"
        else:
            return "allow"

    def export_to_prometheus(self) -> None:
        BOT_SCORE.labels(session_id=self.session_id).set(self.composite)
        action = self.action()
        BOT_ACTION.labels(action=action).inc()


def build_score_from_signals(
    session_id: str,
    resource_profile: "SessionResourceProfile",
    graph_metrics: "SessionGraphMetrics",
    timing_cv: float,
    semantic_score: float,
    rst_stream_count: int,
    pushed_resource_count: int,
) -> BotScore:
    """
    Normalise raw signal values to [0.0, 1.0] suspicion scores.
    """
    # Timing regularity: CV < 0.25 → 1.0 (very regular = suspicious)
    # CV > 1.5 → 0.0 (highly irregular = human-like)
    timing_regularity = max(0.0, min(1.0, 1.0 - (timing_cv - 0.25) / 1.25))

    # Server-push rejection: high RST_STREAM count relative to pushes = suspicious
    push_rejection = (
        min(1.0, rst_stream_count / pushed_resource_count)
        if pushed_resource_count > 0
        else 0.5  # No pushes → neutral
    )

    return BotScore(
        session_id=session_id,
        resource_completeness=resource_profile.completeness_score,
        timing_regularity=timing_regularity,
        graph_linearity=graph_metrics.linearity_score,
        semantic_coherence=semantic_score,
        server_push_rejection=push_rejection,
    )

Route high-score sessions to Alertmanager for enforcement:

# alertmanager/routes.yml — bot score routing
route:
  receiver: "default"
  routes:
    - match:
        signal: resource_incompleteness
      receiver: "bot-enforcement"
    - match:
        signal: timing_decorrelation
      receiver: "bot-enforcement"
    - match_re:
        severity: "high"
        signal: ".*"
      receiver: "bot-enforcement"

receivers:
  - name: "bot-enforcement"
    webhook_configs:
      - url: "https://bot-enforcement.internal/api/score"
        send_resolved: false
        http_config:
          bearer_token_file: /var/run/secrets/bot-enforcement-token
        max_alerts: 0

The enforcement endpoint receives the alert payload including session_id and signal labels and updates its session blocklist accordingly. Use a short TTL (15–60 minutes) on enforcement decisions derived from statistical signals to reduce the impact of false positives.

Expected Behaviour After Instrumentation

After deploying the extended nginx log format: log analysis pipelines gain session_id, upstream_response_time, http2_stream_id, and ssl_cipher fields. These enable session correlation directly in Loki without a separate session-attribution layer.

After deploying the Loki recording rules: session:api_requests_10m and session:static_requests_10m metrics are available in Prometheus. The HighApiLowStaticSession alert fires within 5–10 minutes for agent sessions generating high API volume. Expect this to fire on legitimate API clients using bearer-auth rather than cookie sessions; add a filter for sessions that include an Authorization header to exclude known API consumers.

After deploying the Prometheus timing rules: the AgenticTimingDecorrelation alert fires during load events when real users slow down but certain sessions maintain constant cadence. Expect 2–5 false positives per load event from automated monitoring tools and health checks with fixed polling intervals; exclude /health, /metrics, and synthetic monitoring probe IPs by label.

After deploying the composite scorer: sessions that trigger the block action (composite > 0.80) should be reviewed manually for the first two weeks to confirm precision before connecting to hard enforcement. In a production traffic profile with 5% automated traffic, expect precision around 0.85–0.92 for the block tier when all five signals are active. The challenge tier (0.55–0.80) will include legitimate power users completing focused tasks; challenge-based enforcement (CAPTCHA, proof-of-work) is appropriate there rather than hard blocking.

Trade-offs

Resource fetch completeness is the highest-precision signal individually, but it requires maintaining per-page resource manifests in sync with the frontend release pipeline. A major CSS refactor that changes file hashes makes the manifest stale within a single deploy. If the manifest is not updated, real browsers fail completeness checks on the new resource names. Automate manifest generation as a post-build step using a Playwright crawl that visits each page and records the resource waterfall into the manifest JSON file. Without this automation, the signal degrades quickly.

Timing variance under load requires stateful session correlation across requests. Each request log entry must be attributed to a session, and the distribution of inter-request intervals must be tracked in a time-series store. This adds state management complexity and latency to the detection pipeline. The signal is most precise during load events; on a consistently fast server, the decorrelation between session timing and server response time is less visible. Consider pairing this with artificial load amplification (synthetic slow requests injected for specific sessions suspected by other signals) to make the decorrelation visible.

API call graph analysis depends on OpenTelemetry instrumentation propagating trace context through all API calls in a session, and on session IDs being stable across requests. Sessions that rotate identifiers to avoid tracking (privacy-focused users, Tor users) will have fragmented graphs that look like bot graphs simply because the correlation is broken. Use server-side session attribution from auth tokens rather than client-set cookies for the graph analysis to reduce this failure mode.

Server-push utilisation as a signal requires HTTP/2 server push to be actively configured and the RST_STREAM detection to be enabled in the log format. Most nginx and Envoy deployments do not push resources by default. Adding push solely for detection purposes is a reasonable approach — push 2–3 key resources on every product page load. The overhead is small, the signal is clean, and it adds defensive depth.

Semantic request coherence scoring requires either a hand-maintained endpoint classification map or an ML inference step to categorise request intent. The former is maintainable for well-defined APIs; the latter adds operational complexity. For most deployments, a rule-based classifier that maps URL patterns to intent categories (browse, add-to-cart, checkout, auth) is sufficient and interpretable.

Failure Modes

Resource completeness false positives from privacy tools. uBlock Origin, Brave’s ad blocker, and Safari’s content blocker prevent requests to analytics endpoints and some image CDNs. A real user behind an aggressive content blocker will miss 20–40% of expected resources, producing a non-trivial completeness score. Distinguish ad-blocked sessions from agent sessions by the presence of blocked analytics requests: a blocked real session has the static assets (CSS, fonts) but is missing the analytics beacon; an agent session is missing both. Add an explicit check for CSS and font presence — these are never blocked by content blockers but are never fetched by extraction agents.

Timing analysis false positives for mobile users on poor connections. A mobile user on an intermittent 4G connection can exhibit low inter-request CV during periods of dropped connectivity, because the variability in their timing is not driven by page load perception but by connectivity windows. Segment timing analysis by network type if available in request headers (Save-Data, ECT hints) or by RTT estimates from the nginx $request_time to $upstream_response_time delta (high delta = high latency client, likely mobile). Apply more lenient CV thresholds for high-latency sessions.

Graph linearity false positives for legitimate power users. A returning user who knows the site and navigates directly to the checkout with a product they already researched will produce a linear graph. A B2B buyer using the site’s quote API programmatically (with a real browser, authenticated session) will produce high API-to-static ratios. Both will generate elevated scores on some signals. Mitigate with session history: sessions with authenticated history of 30+ prior sessions with normal graphs should have their current session’s linearity signal down-weighted. This requires session-level history state, which is an additional storage dependency.

Agent frameworks that deliberately randomise resource fetching. A well-engineered evasion tool could fetch the full resource set on every page visit, intentionally downloading all CSS, images, and fonts before beginning its task. This defeats resource completeness scoring entirely but increases the agent’s operational cost by 5–10× (downloading megabytes of images it does not need). The signal degrades but the cost to evade is non-trivial. When completeness is defeated, the timing and graph signals must carry the detection. Monitor for sessions with suspicious API patterns that score 0 on resource completeness — this asymmetry may itself be a signal.

Loki LogQL recording rule lag. The 1-minute recording interval means the HighApiLowStaticSession alert can lag up to 1 minute behind the condition being true, and the 10-minute count window means a scraping session that operates for less than 10 minutes may not accumulate enough signal before moving on. For short burst scraping sessions, rely on single-request signals (resource completeness per page view) rather than aggregate session signals. The single-request check requires session correlation at the point of the page view rather than post-hoc, which is an architectural consideration for the detection pipeline placement.

HTTP/2 RST_STREAM signal absent on HTTP/1.1 connections. The server-push signal only applies to HTTP/2 connections. Agent frameworks that explicitly connect via HTTP/1.1 (for compatibility or simplicity) will not trigger RST_STREAM events. Treat the absence of the HTTP/2 upgrade for a browser-identified user agent as a mild independent signal — most real Chromium instances upgrade to HTTP/2 automatically on servers that support it.