AI-Generated Monitoring vs. Open Source Observability Standards: The Ecosystem Argument

The Problem

Observability infrastructure is different from application code in one critical way: it has to survive the team that wrote it. Your authentication service can be replaced. Your CI pipeline can be migrated. Your observability layer cannot be silently broken without someone noticing — but only if there is someone still around who understands what it does, who maintains its dependencies, and how it fits into the surrounding tooling. That is not a soft requirement. It is the condition under which observability provides any value at all.

An LLM can produce a working Prometheus exporter in under two minutes. It can write a Fluent Bit Lua filter, an OpenTelemetry instrumentation shim, or a custom log parser for any format you describe. The output passes a code review by someone who doesn’t know Prometheus internals. Tests pass. The exporter ships. Metrics appear in Grafana. The initial velocity is genuinely impressive, and the temptation to repeat the pattern is strong because the feedback loop is so short.

The feedback loop for the maintenance cost is 18 months long. By then, the engineer who prompted the exporter into existence has moved teams. The exporter is running in production with no owner listed in the service catalogue. Prometheus 3.0 changed the scraping configuration format, and the custom exporter’s /metrics endpoint returns a Content-Type header the new scrape manager doesn’t accept by default. The organisation adopted Datadog six months ago, and the platform team is now trying to route all metrics through the OpenTelemetry Collector — except the custom exporter uses a non-standard label schema that doesn’t map cleanly to Datadog conventions. A CVE was published against the net/http wrapper the exporter uses; the security team raised a ticket, it sat in an unowned backlog, and the service is still running the vulnerable version.

This is not a failure mode that requires bad intent or careless engineering. It is the normal lifecycle of custom observability code when the initial delivery cost is driven close to zero by AI assistance. The lower the perceived cost of creation, the more of it gets created, and the higher the aggregate maintenance cost becomes once the initial author is no longer available. The problem is structural, not individual.

What open source observability standards — Prometheus, OpenTelemetry, Fluent Bit, the OpenTelemetry Collector — provide is not better code than an LLM can write today. In many cases, the LLM produces syntactically equivalent code. What open source standards provide is an ecosystem contract that extends years past the initial commit: semantic conventions that vendors and tools rely on, a security disclosure process with a documented response timeline, stability guarantees that allow tooling to evolve without breaking consumers, and an update pipeline that patches CVEs without requiring a human to remember that a custom exporter exists.

Five things open source observability provides that AI-generated code cannot replicate:

1. Semantic conventions. OpenTelemetry’s semantic conventions define standard attribute names for HTTP, database, RPC, messaging, and system metrics and traces. http.request.method, db.system, messaging.system, net.peer.name — these are not suggestions. Vendors build dashboards assuming them. Correlation queries across traces, metrics, and logs work because every service uses the same field names. When you instrument with OTel SDKs, you get these names automatically. When an LLM writes custom instrumentation, it guesses: http.verb, database.type, queue.name. Those guesses are incompatible with every standard dashboard, every vendor’s out-of-the-box detection, and every cross-service query you will ever write.

2. Vendor interoperability. The OpenTelemetry Collector ships exporters for Datadog, Jaeger, Zipkin, Grafana Tempo, AWS X-Ray, Google Cloud Trace, Azure Monitor, and a dozen others. Those exporters are maintained by the respective vendors. When Datadog’s intake API version changes, Datadog’s engineering team updates the exporter. When AWS X-Ray changes its trace format, the AWS Distro for OpenTelemetry team updates their exporter. Your AI-generated custom exporter has no vendor relationship. When the target API changes, the exporter breaks silently, and nobody knows until alerts stop firing.

3. Security patch pipeline. go.opentelemetry.io/otel, github.com/prometheus/client_golang, fluent-bit — all have security disclosure processes, CVE tracking, and patch release timelines. The Prometheus project’s security contacts are listed at prometheus.io/community. CNCF maintains a security disclosure process for all its hosted projects. Your AI-generated exporter has whatever vulnerability management process the team remembers to apply to it, which in practice means none.

4. Stability guarantees. Prometheus’s data model — metric name, label set, sample value, timestamp — has been stable for years. PromQL is a documented, stable API. OpenTelemetry SDK APIs are explicitly versioned with stability levels: stable, experimental, deprecated. Changes to stable APIs require a deprecation cycle documented in the changelog. Custom code has whatever stability the original author intended, documented nowhere, honoured by nobody who wasn’t in the room when it was written.

5. Integration surface area. The prometheus/client_golang library integrates with go-kit metrics, Go’s standard expvar, the default process and Go runtime collectors, and standard alerting configurations. The OTel SDK integrates with context propagation across gRPC, HTTP, AWS Lambda invocations, Kafka message headers, and cloud provider tracing. Building that integration surface from scratch, or expecting an LLM to generate it correctly, is not a realistic ask.

The Comparison in Code

Consider an AI-generated Prometheus exporter for a custom internal API. The LLM produces something like this:

package main

import (
    "fmt"
    "net/http"
    "sync/atomic"
    "time"
)

var requestCount int64
var errorCount int64

func handler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    atomic.AddInt64(&requestCount, 1)
    
    // ... actual handler logic ...
    
    duration := time.Since(start).Seconds()
    fmt.Fprintf(w, "Request took %f seconds\n", duration)
}

func metricsHandler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "# HELP request_total Total HTTP requests\n")
    fmt.Fprintf(w, "# TYPE request_total counter\n")
    fmt.Fprintf(w, "request_total %d\n", atomic.LoadInt64(&requestCount))
    fmt.Fprintf(w, "# HELP error_total Total HTTP errors\n")
    fmt.Fprintf(w, "# TYPE error_total counter\n")
    fmt.Fprintf(w, "error_total %d\n", atomic.LoadInt64(&errorCount))
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/metrics", metricsHandler)
    http.ListenAndServe(":8080", nil)
}

This code has four distinct problems visible without running it: the histogram is missing entirely, so latency data is lost; the metric names don’t include a unit suffix, violating Prometheus naming conventions; there is no label cardinality control, so any label added later risks a cardinality explosion; and the Content-Type returned by metricsHandler is text/plain (Go’s default), not text/plain; version=0.0.4; charset=utf-8, which Prometheus 2.x accepts but which scrape targets with strict content negotiation configured will reject.

The equivalent implementation using prometheus/client_golang:

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            // Unit suffix required by Prometheus naming conventions
            Name: "http_server_requests_total",
            Help: "Total number of HTTP requests received.",
        },
        []string{"method", "route", "status_code"},
    )

    requestDurationSeconds = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            // Aligns with OpenTelemetry semantic conventions for HTTP server duration
            Name: "http_server_request_duration_seconds",
            Help: "Duration of HTTP server requests in seconds.",
            // Prometheus default buckets cover 5ms-10s; adjust for your SLOs
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "route", "status_code"},
    )
)

func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK}
        next(rw, r)
        duration := time.Since(start).Seconds()
        statusStr := http.StatusText(rw.statusCode)

        requestsTotal.WithLabelValues(r.Method, r.URL.Path, statusStr).Inc()
        requestDurationSeconds.WithLabelValues(r.Method, r.URL.Path, statusStr).Observe(duration)
    }
}

type statusRecorder struct {
    http.ResponseWriter
    statusCode int
}

func (r *statusRecorder) WriteHeader(code int) {
    r.statusCode = code
    r.ResponseWriter.WriteHeader(code)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentedHandler(myHandler))
    http.ListenAndServe(":8080", nil)
}

The library version handles content negotiation automatically, registers with the default registry correctly, includes Go runtime and process metrics by default, uses thread-safe counter and histogram implementations, and produces metric names that match OTel’s HTTP semantic conventions for http.server.request.duration. When Prometheus releases a new version of the client library, you update go.mod. When a CVE is found in a dependency, the maintainers publish a patched release and Dependabot opens a PR.

Threat Model

AI-generated exporter with an unpatched CVE and no owner. The exporter uses a third-party HTTP client library. A CVE is published. The security team scans the running binary with govulncheck and finds the vulnerability. There is no listed maintainer. The ticket sits in the platform backlog because nobody has context on the exporter. The vulnerable service continues running. In a Prometheus context, the /metrics endpoint is typically accessible on internal networks without authentication — a compromised exporter can be used as a pivot point for internal HTTP requests or as a source of leaked environment variables if the exporter serialises os.Getenv values as metrics labels (a pattern AI-generated code sometimes produces).

Custom metric names break multi-vendor adoption. The engineering organisation instruments 40 services over 12 months using AI-generated OTel instrumentation that uses non-standard span attribute names: http.verb instead of http.request.method, user.id as a span attribute (a high-cardinality label that should never appear on spans), duration_ms as a metric name without unit suffix. When the organisation adopts Datadog, the vendor’s out-of-the-box dashboards, APM service maps, and anomaly detection all expect OTel semantic convention names. The migration cost — rewriting 40 services’ instrumentation, updating dashboards, rewriting alert rules — exceeds the time saved across all 40 services by AI-assisted generation. The net result is a negative return on the AI productivity claim.

AI-generated log parser breaks trace-log correlation. A custom Fluent Bit Lua filter parses application logs and emits structured JSON. The parser extracts a trace_id field. The OTel Collector’s log processor expects traceId in camelCase, following the OTel log data model specification. Grafana’s Loki trace correlation feature expects traceID. The result is that logs arrive in the pipeline with a non-standard field name, trace correlation breaks entirely, and the incident response team cannot link a specific log line to the trace that produced it during an outage. Discovering this during an incident, when the field name mismatch becomes obvious, is the wrong time.

Cardinality explosion from high-cardinality AI-generated labels. AI-generated exporters commonly include identifiers as metric labels: user_id, request_id, session_id, job_id. These identifiers have unbounded cardinality. A Prometheus server tracking a counter labelled by user_id for a service with 500,000 users creates 500,000 time series for a single metric. Prometheus memory consumption scales linearly with the number of active time series. At 500,000 series from a single exporter, the Prometheus instance OOMs or begins dropping scrapes. prometheus/client_golang does not prevent this, but the official documentation, community guidance, and code review checklists for teams using the library consistently flag high-cardinality labels. AI-generated code has no such review layer.

Hardening Configuration

1. Use Official Client Libraries with Dependency Monitoring

Never expose metrics using a custom HTTP handler. Always use the maintained client library for your language.

// go.mod — pin to a specific version and monitor with govulncheck
require (
    github.com/prometheus/client_golang v1.20.5
    go.opentelemetry.io/otel v1.33.0
    go.opentelemetry.io/otel/sdk v1.33.0
    go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.33.0
)

Run govulncheck as part of your CI pipeline and against running binaries:

# Install govulncheck
go install golang.org/x/vuln/cmd/govulncheck@latest

# Scan the module for known vulnerabilities in dependencies
govulncheck ./...

# Example output for a vulnerability in an observability dependency:
# Vulnerability #1: GO-2024-3107
# An attacker can cause a Prometheus client_golang server to crash by sending
# a specially crafted request to the /metrics endpoint.
#   More info: https://pkg.go.dev/vuln/GO-2024-3107
#   Module: github.com/prometheus/client_golang
#     Found in: github.com/prometheus/client_golang@v1.18.0
#     Fixed in: github.com/prometheus/client_golang@v1.19.1
#     Example traces found:
#       #1: main.go:14:2: myapp calls promhttp.Handler

# A clean scan produces no output and exits 0.
# A CI gate on non-zero exit code enforces this.

For Python services using the OpenTelemetry SDK:

# requirements.txt — pin exact versions for reproducible builds
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-exporter-otlp-proto-grpc==1.29.0
opentelemetry-semantic-conventions==0.50b0
opentelemetry-instrumentation-fastapi==0.50b0
opentelemetry-instrumentation-sqlalchemy==0.50b0

# Scan Python dependencies with pip-audit
pip install pip-audit
pip-audit -r requirements.txt

# Output for a vulnerable dependency:
# Name                        Version  ID                  Fix Versions
# --------------------------- -------- ------------------- ------------
# opentelemetry-api           1.24.0   GHSA-xxxx-yyyy-zzzz 1.25.0

2. OpenTelemetry Semantic Conventions Compliance in Python

Always import from opentelemetry.semconv rather than using string literals for attribute names. The semconv package is versioned alongside the OTel SDK and tracks the specification.

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.semconv.resource import ResourceAttributes
from opentelemetry.sdk.resources import Resource

# Configure resource attributes using semantic convention constants
resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "payment-processor",
    ResourceAttributes.SERVICE_VERSION: "2.4.1",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

tracer = trace.get_tracer(__name__, tracer_provider=provider)

def process_payment(request):
    with tracer.start_as_current_span("process_payment") as span:
        # Use semantic convention constants — not string literals
        span.set_attribute(SpanAttributes.HTTP_REQUEST_METHOD, request.method)
        span.set_attribute(SpanAttributes.URL_FULL, str(request.url))
        span.set_attribute(SpanAttributes.HTTP_RESPONSE_STATUS_CODE, 200)

        # Database spans: use db.* semantic conventions
        with tracer.start_as_current_span("db.query") as db_span:
            db_span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
            db_span.set_attribute(SpanAttributes.DB_NAME, "payments")
            db_span.set_attribute(SpanAttributes.DB_OPERATION, "INSERT")
            # DO NOT include db.statement in production — it may contain PII or credentials
            # db_span.set_attribute(SpanAttributes.DB_STATEMENT, sql)  # WRONG in prod

        # INCORRECT patterns an LLM commonly generates — do not use:
        # span.set_attribute("http.verb", request.method)       # non-standard name
        # span.set_attribute("url", str(request.url))           # ambiguous
        # span.set_attribute("response_code", 200)              # wrong name
        # span.set_attribute("user_id", user.id)                # HIGH CARDINALITY — never on spans
        # span.set_attribute("request_id", req_id)              # HIGH CARDINALITY — use baggage

        return process(request)

3. Validate Metric and Attribute Names in CI

Add a static analysis step that rejects non-standard attribute names before they reach production:

#!/usr/bin/env python3
# scripts/validate_otel_attributes.py
# Run in CI to catch non-standard OTel attribute names before they ship.

import ast
import sys
import glob
from opentelemetry.semconv.trace import SpanAttributes

# Build set of all valid semantic convention attribute values
STANDARD_ATTRS = {
    v for k, v in vars(SpanAttributes).items()
    if not k.startswith('_') and isinstance(v, str)
}

# Patterns that indicate high-cardinality labels — reject in any context
HIGH_CARDINALITY_PATTERNS = [
    "user_id", "user.id", "request_id", "request.id",
    "session_id", "session.id", "job_id", "job.id",
    "transaction_id", "correlation_id",
]

violations = []

for filepath in glob.glob("**/*.py", recursive=True):
    if "test_" in filepath or "_test.py" in filepath:
        continue
    try:
        with open(filepath) as f:
            tree = ast.parse(f.read(), filename=filepath)
    except SyntaxError:
        continue

    for node in ast.walk(tree):
        # Find span.set_attribute("literal_string", ...) calls
        if (isinstance(node, ast.Call) and
                isinstance(node.func, ast.Attribute) and
                node.func.attr == "set_attribute" and
                node.args and isinstance(node.args[0], ast.Constant)):
            attr_name = node.args[0].value
            if isinstance(attr_name, str):
                # Check for high-cardinality patterns
                for pattern in HIGH_CARDINALITY_PATTERNS:
                    if pattern in attr_name.lower():
                        violations.append(
                            f"{filepath}:{node.lineno}: HIGH CARDINALITY attribute "
                            f"'{attr_name}' should not appear on spans"
                        )

if violations:
    print("OTel attribute violations found:")
    for v in violations:
        print(f"  {v}")
    sys.exit(1)

print(f"OTel attribute validation passed.")

# .github/workflows/observability-lint.yml
name: Observability lint
on: [push, pull_request]

jobs:
  otel-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
      - uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b  # v5.3.0
        with:
          python-version: "3.12"
      - run: pip install opentelemetry-semantic-conventions
      - run: python scripts/validate_otel_attributes.py

4. Dependabot for Observability Library Updates

# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: gomod
    directory: /
    schedule:
      interval: weekly
    groups:
      observability:
        patterns:
          - "github.com/prometheus/*"
          - "go.opentelemetry.io/*"
          - "github.com/fluent/*"
    ignore:
      # Do not auto-merge observability libraries — require human review
      - dependency-name: "*"
        update-types: ["version-update:semver-major"]

  - package-ecosystem: pip
    directory: /
    schedule:
      interval: weekly
    groups:
      opentelemetry:
        patterns:
          - "opentelemetry-*"

The groups configuration causes Dependabot to open a single PR for all observability library updates, rather than one PR per package. This keeps the review burden manageable and makes it easy to see the complete picture of what changed in the observability stack in a given week.

5. Use the OpenTelemetry Collector as the Single Pipeline

Replace any custom log forwarders, metric proxies, or trace routers with the OpenTelemetry Collector. The Collector is a CNCF-maintained binary with a documented security disclosure process, a stable configuration API, and vendor-maintained exporter plugins.

# otel-collector-config.yaml
# Replaces: custom log forwarders, metric proxies, and trace routers

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        # TLS required — never run the Collector without mTLS on grpc in production
        tls:
          cert_file: /etc/otel/tls/server.crt
          key_file: /etc/otel/tls/server.key
          client_ca_file: /etc/otel/tls/ca.crt
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: "app"
          scrape_interval: 15s
          static_configs:
            - targets: ["app:8080"]
          # Relabelling: drop high-cardinality labels before storage
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: ".*_request_id.*"
              action: drop

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  # Redact PII from span attributes before export
  redaction:
    allow_all_keys: false
    allowed_keys:
      - http.request.method
      - http.response.status_code
      - db.system
      - db.name
      - db.operation
      - net.peer.name
      - service.name
      - service.version
    blocked_keys:
      - db.statement
      - http.request.header.authorization
      - http.request.header.cookie
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  # Datadog via OTLP — maintained by Datadog engineering
  otlphttp/datadog:
    endpoint: https://ingest.datadoghq.com/api/intake/otlp
    headers:
      DD-API-KEY: ${env:DD_API_KEY}
  # Prometheus remote write for long-term storage in Thanos or Cortex
  prometheusremotewrite:
    endpoint: https://prometheus.example.com/api/v1/write
    tls:
      ca_file: /etc/otel/tls/ca.crt
  # Grafana Tempo for trace storage
  otlp/tempo:
    endpoint: tempo.monitoring.svc.cluster.local:4317
    tls:
      insecure: false
      ca_file: /etc/otel/tls/ca.crt

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, redaction, batch]
      exporters: [otlphttp/datadog, otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/datadog, prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, redaction, batch]
      exporters: [otlphttp/datadog]
  telemetry:
    logs:
      level: warn
    metrics:
      address: 0.0.0.0:8888

The redaction processor is security-critical. AI-generated instrumentation commonly includes db.statement (which contains SQL with user-supplied values) and raw HTTP headers (which contain auth tokens). The Collector’s redaction processor enforces an allowlist at the pipeline level, regardless of what individual services emit.

6. Decision Framework: Where AI Assistance is Safe

SAFE — AI-generated, low maintenance cost:
  - Grafana dashboard JSON (declarative, not a running service)
  - Prometheus alerting rules (YAML, reviewed before deployment)
  - Prometheus recording rules (PromQL expressions, statically validated)
  - Fluent Bit parser configuration for known log formats
  - OTel Collector pipeline configuration (declarative YAML)

UNSAFE — use maintained libraries, not AI-generated code:
  - Any code that exposes a /metrics endpoint
  - Any OTel SDK instrumentation that runs in production
  - Any log pipeline component that handles PII or auth data
  - Any exporter that sends data to an external vendor
  - Any collector sidecar or agent binary

The distinction maps to runtime risk and maintenance obligations. A Grafana dashboard JSON file does not run with network access, has no dependencies that can contain CVEs, and cannot leak credentials. An AI-generated Go binary running on port 8080 in production does all three.

Expected Behaviour

A correctly configured observability stack using official libraries produces observable, predictable output at every layer.

The OTel Collector’s own metrics endpoint (0.0.0.0:8888/metrics) shows receiver and exporter health:

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline
otelcol_receiver_accepted_spans{receiver="otlp",service_instance_id="...",transport="grpc"} 142847

# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination
otelcol_exporter_sent_spans{exporter="otlp/tempo",...} 142839

# HEALTH SIGNAL: if sent < accepted by more than a small delta, spans are being dropped
# otelcol_exporter_send_failed_spans > 0 requires immediate investigation

The govulncheck output for a clean observability stack:

$ govulncheck ./...
Scanning your code and 87 packages across 12 dependent modules for known vulnerabilities...

No vulnerabilities found.

A Dependabot PR for an observability library update looks like:

Bump go.opentelemetry.io/otel from 1.32.0 to 1.33.0

Release notes for v1.33.0:
- Fix: metric SDK memory leak when using delta temporality
- Security: update golang.org/x/net dependency (GO-2025-xxxx)
- Stability: HTTP semantic conventions updated to stable status

Files changed: go.mod, go.sum

The PR diff shows exactly two files changing with a version bump. Review it, confirm the release notes match the tag, merge. The entire maintenance action takes under five minutes for a change that patches a security fix across your entire observability stack.

Trade-offs

OTel Collector operational complexity vs. direct exporter-to-vendor. Running the Collector as a sidecar or DaemonSet adds a process to manage. It has its own resource consumption (typically 50-200MB RAM at moderate throughput), its own configuration management, and its own upgrade cycle. The trade-off is multi-vendor portability: when your organisation moves from Datadog to Grafana Cloud, you change two lines in the Collector configuration. Without the Collector, you change instrumentation in every service. The Collector also centralises the security controls — PII redaction, credential filtering, TLS termination — that would otherwise need to be implemented consistently in every service’s instrumentation code.

Enforcing semantic conventions on existing codebases. If your services already emit non-standard metric and attribute names, adopting OTel semantic conventions breaks existing dashboards and alert rules. The migration path is to run old and new instrumentation in parallel using the OTel SDK’s metric bridge (go.opentelemetry.io/otel/bridge/opencensus for OpenCensus, go.opentelemetry.io/otel/bridge/prometheus for Prometheus), emit both old and new names during a transition period, migrate dashboards, then remove the old names. This takes weeks, not minutes. That cost is real, and it is the cost of not adopting conventions from the start.

AI assistance for dashboard and alert configuration is genuinely useful. Generating a Grafana dashboard JSON for a specific service, writing a PromQL expression for a P99 latency alert, or drafting a Prometheus recording rule for a complex query are all tasks where AI assistance provides real value with low risk. The output is declarative configuration that humans review before deployment, has no runtime dependencies, cannot contain executable CVEs, and is validated by the tools that consume it (Grafana’s import, promtool check rules). This is the correct scope for AI assistance in observability work.

Failure Modes

Gauge used for monotonically increasing values. AI-generated exporters frequently use gauge types for values that should be counters: request counts, error counts, bytes processed. In Prometheus, rate(gauge_metric[5m]) produces incorrect results because Prometheus cannot detect counter resets for gauge types. PromQL’s rate() and irate() functions are semantically defined for counters only. The symptom is that rate calculations appear correct in steady-state operation but produce negative spikes whenever the exporting service restarts. This is invisible in testing because tests don’t typically restart the service mid-test.

Cardinality explosion at production scale. A custom exporter includes a customer_tier label with values free, pro, enterprise — three values, fine. Six months later, someone adds customer_id as a label. A service with 100,000 customers and 5 metrics creates 500,000 time series. Prometheus allocates approximately 3KB per active time series, so this single change adds 1.5GB of memory pressure to the Prometheus instance. The prometheus/client_golang library does not prevent this, but prometheus-community/prom-label-proxy and Prometheus’s --storage.tsdb.max-block-chunk-segment-size can limit the blast radius. The correct prevention is a cardinality review gate in CI: promtool check metrics catches format issues; a label cardinality policy enforced in pull request review catches semantic issues.

AI-generated OTel instrumentation that loses trace context. Correct distributed tracing requires propagating the W3C TraceContext headers (traceparent, tracestate) across every service call. An AI-generated HTTP client wrapper that doesn’t call otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header)) before each outbound request produces orphaned spans: traces appear in Grafana Tempo but are disconnected from their parents. The downstream call shows up as a root span with no parent, making it impossible to reconstruct the call chain during an incident. This is undetectable in unit tests and only becomes visible in integration tests that verify trace context propagation — tests that most teams don’t write.

Untracked AI-generated exporters in the service catalogue. The security team runs a vulnerability scan. govulncheck reports a CVE in a library used by three services. Two services have owners. The third is the AI-generated exporter that someone deployed 14 months ago. The CMDB has no record of it. The Kubernetes deployment has no owner label. The /metrics endpoint reveals it’s running golang.org/x/net v0.18.0, which has three known CVEs. Patching it requires finding the source code (it’s in a personal GitHub repository that the original author hasn’t touched in a year), understanding what it does, rebuilding it, and deploying it — assuming anyone can find the Dockerfile. The correct prevention is enforcing service catalogue registration as a deployment prerequisite and requiring owner labels on all Kubernetes deployments: kubectl label deployment <name> owner=<team> --overwrite as a CI gate.

The ecosystem argument for open source observability is not sentimental attachment to established tools. It is a risk calculation. The time saved by AI-generated observability code accrues immediately and is visible on a sprint board. The maintenance cost, CVE exposure, semantic incompatibility, and vendor migration friction accrue over 12-24 months and are invisible until they cause an incident or a failed migration. Prometheus, OpenTelemetry, and the Collector represent a multi-year investment by CNCF, Google, Microsoft, Datadog, and dozens of other organisations in exactly the kind of stable, interoperable, secure foundation that custom code cannot replicate — regardless of how quickly that custom code was generated.