Container Patch Compliance Observability: Tracking CVE-to-Patch SLAs Across a Fleet

Container Patch Compliance Observability: Tracking CVE-to-Patch SLAs Across a Fleet

The Problem

Running Copa against 200 images on a nightly schedule does not produce compliance. It produces patch activity. Compliance requires answering four different questions at any moment:

  1. Which images currently have critical CVEs in the registry?
  2. Which of those images have been running with a critical CVE for longer than 24 hours, breaching the SLA?
  3. Which Copa runs failed, and which images are therefore not receiving patches?
  4. Which running Pods in the cluster are executing an unpatched image digest — even if the registry version was patched?

The gap between questions one and four is where most teams lose visibility. A Copa pipeline that patches the registry image but does not verify that Pods have restarted to pick up the new digest produces a compliance dashboard that shows green while production containers remain vulnerable. Similarly, a Copa CronJob that silently fails for a single image keeps the SLA clock running with no alert. Security teams reporting based on last-scan results rather than running-container state create drift between registry state and cluster state that an auditor — or an attacker — will find before they do.

This article builds the observability layer to close those gaps: the metrics that matter, the exporters that collect them, the recording rules that compute SLA status, and the Grafana panels and Alertmanager rules that make fleet patch compliance visible in real time.

Target environment: Kubernetes clusters running Copa as a CronJob, Trivy for scanning, Prometheus and Grafana for metrics storage and visualisation, and optionally OWASP Dependency-Track for remediation history.


Threat Model

Threat 1: Registry Patched, Pods Still Running Old Digest

This is the most common compliance gap. Copa patches the image in the registry. The registry tag myapp:latest now resolves to a new digest. But the running Pod was launched with the old digest and will continue running it until it is restarted. In Kubernetes, Pods are not automatically recycled when the underlying image changes. A compliance system that checks registry state but not cluster state will report the CVE as remediated while the workload remains exploitable.

Detection requires matching the image digest from kubectl get pods -o json against the digest of the most recently patched image in the registry.

Threat 2: Copa CronJob Silent Failure

Copa runs as a Kubernetes CronJob. If the job fails — due to a registry push error, a Trivy scan timeout, an OOM kill, or a misconfigured RBAC token — the patch does not happen. The CronJob may record a Failed status in Kubernetes events, but if no one is watching those events and no Prometheus metric is scraped from the job result, the failure is invisible. The SLA clock for every unpatched CVE on that image keeps running.

Detection requires Copa emitting structured exit metrics that a Prometheus exporter scrapes after every run.

Threat 3: Reporting on Scan Time Rather Than Running-Container State

Security teams under audit pressure often pull reports from the most recent Trivy scan of the registry image. This misses two things: pods running pre-patch digests (Threat 1), and images that were never scanned because the Trivy operator does not have visibility into a namespace or node. A fleet with 200 images across 15 namespaces has numerous gaps if namespace-level Trivy operator configuration is inconsistent.

Detection requires explicitly tracking which images have been scanned and when, and alerting when the scan age exceeds the scanning interval.


Configuration and Implementation

1. Core Metric Schema

Define five core metrics. These become the canonical vocabulary for the entire observability system — dashboards, alerts, and SLO tracking all derive from these.

# HELP image_cve_age_hours Hours since the CVE was first seen on this image
# TYPE image_cve_age_hours gauge
image_cve_age_hours{image="registry.example.com/myapp", digest="sha256:abc123", cve_id="CVE-2024-1234", severity="CRITICAL"} 31.4

# HELP copa_patch_success_total Total successful Copa patch runs per image
# TYPE copa_patch_success_total counter
copa_patch_success_total{image="registry.example.com/myapp", copa_version="0.7.0"} 14

# HELP copa_patch_failure_total Total failed Copa patch runs per image
# TYPE copa_patch_failure_total counter
copa_patch_failure_total{image="registry.example.com/myapp", failure_reason="push_error"} 2

# HELP image_patch_sla_breach_total Total SLA breaches (critical CVE age > threshold)
# TYPE image_patch_sla_breach_total counter
image_patch_sla_breach_total{image="registry.example.com/myapp", cve_id="CVE-2024-1234", threshold_hours="24"} 1

# HELP running_containers_with_critical_cve Number of running containers with at least one critical CVE
# TYPE running_containers_with_critical_cve gauge
running_containers_with_critical_cve{namespace="production", pod="myapp-7d9f8b-xkq2p", image="registry.example.com/myapp", digest="sha256:old999"} 1

All labels use consistent image reference format. Establish a normalisation convention up front: always use the full registry hostname and strip tags in favour of digest references. A registry-image mismatch between how Copa logs a reference and how Kubernetes records it in the Pod spec is the most common cause of false negatives in digest-matching logic.

2. Copa Run Metrics Exporter

Copa outputs a JSON result file after each run. A lightweight Python exporter parses this output and pushes metrics to a Prometheus Pushgateway, which is appropriate for batch jobs that do not run continuously.

#!/usr/bin/env python3
"""copa_metrics_exporter.py — parse Copa JSON output, push to Pushgateway"""

import json
import sys
import os
import time
from prometheus_client import CollectorRegistry, Counter, Gauge, push_to_gateway

PUSHGATEWAY_URL = os.environ.get("PUSHGATEWAY_URL", "http://prometheus-pushgateway:9091")
COPA_OUTPUT_FILE = os.environ.get("COPA_OUTPUT_FILE", "/tmp/copa-result.json")
COPA_VERSION = os.environ.get("COPA_VERSION", "unknown")
JOB_NAME = "copa_patch_job"


def parse_copa_result(path: str) -> dict:
    with open(path) as f:
        return json.load(f)


def push_metrics(result: dict) -> None:
    registry = CollectorRegistry()

    success_counter = Counter(
        "copa_patch_success_total",
        "Successful Copa patch runs",
        ["image", "copa_version"],
        registry=registry,
    )
    failure_counter = Counter(
        "copa_patch_failure_total",
        "Failed Copa patch runs",
        ["image", "failure_reason", "copa_version"],
        registry=registry,
    )
    patch_duration = Gauge(
        "copa_patch_duration_seconds",
        "Duration of the Copa patch run",
        ["image"],
        registry=registry,
    )

    image = result.get("image", "unknown")
    status = result.get("status", "unknown")
    duration = result.get("duration_seconds", 0)

    patch_duration.labels(image=image).set(duration)

    if status == "success":
        success_counter.labels(image=image, copa_version=COPA_VERSION).inc()
    else:
        failure_reason = result.get("error", {}).get("code", "unknown")
        failure_counter.labels(
            image=image, failure_reason=failure_reason, copa_version=COPA_VERSION
        ).inc()

    push_to_gateway(PUSHGATEWAY_URL, job=JOB_NAME, registry=registry)
    print(f"Pushed metrics for {image}: status={status}")


def main():
    try:
        result = parse_copa_result(COPA_OUTPUT_FILE)
        push_metrics(result)
    except FileNotFoundError:
        print(f"Copa output file not found: {COPA_OUTPUT_FILE}", file=sys.stderr)
        sys.exit(1)
    except json.JSONDecodeError as exc:
        print(f"Invalid Copa output JSON: {exc}", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

Run this as an init container or a post-job hook in the Copa CronJob manifest:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: copa-patch-myapp
  namespace: security
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: copa
            image: ghcr.io/project-copacetic/copacetic:v0.7.0
            args:
            - patch
            - --image=registry.example.com/myapp:latest
            - --report=/tmp/copa-result.json
            - --output=registry.example.com/myapp:patched
            volumeMounts:
            - name: copa-output
              mountPath: /tmp
          - name: metrics-exporter
            image: registry.example.com/copa-metrics-exporter:1.0.0
            env:
            - name: PUSHGATEWAY_URL
              value: "http://prometheus-pushgateway.monitoring:9091"
            - name: COPA_VERSION
              value: "0.7.0"
            volumeMounts:
            - name: copa-output
              mountPath: /tmp
          volumes:
          - name: copa-output
            emptyDir: {}

The emptyDir volume ensures both the Copa container and the metrics exporter share access to /tmp/copa-result.json. The exporter container runs after Copa completes because Kubernetes sidecar containers in a Job do not automatically sequence — use an init container pattern or a shell wrapper script that runs Copa and then the exporter sequentially if ordering must be guaranteed.

3. Trivy Operator: Continuous Cluster Scanning

The Trivy Kubernetes operator continuously scans running container images and exposes results as VulnerabilityReport CRDs. It also exposes a Prometheus metrics endpoint.

Install the operator:

helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
helm upgrade --install trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  --create-namespace \
  --set="trivy.ignoreUnfixed=false" \
  --set="operator.scannerReportTTL=24h" \
  --set="operator.configAuditScannerEnabled=false" \
  --set="serviceMonitor.enabled=true"

The operator emits trivy_image_vulnerabilities with labels for namespace, pod, container, image, severity, and CVE ID:

trivy_image_vulnerabilities{
  container_name="myapp",
  image_digest="sha256:old999abc",
  image_registry="registry.example.com",
  image_repository="myapp",
  image_tag="latest",
  namespace="production",
  resource_kind="ReplicaSet",
  resource_name="myapp-7d9f8b",
  severity="CRITICAL"
} 3

This metric reflects the running container’s digest, not the registry tag. That distinction is what makes it useful for detecting Threat 1 — a Pod that picked up an old image digest will still report critical CVEs even after the registry image is patched.

Configure the operator to watch all namespaces and set a scan interval aligned with the Copa schedule:

apiVersion: v1
kind: ConfigMap
metadata:
  name: trivy-operator
  namespace: trivy-system
data:
  scanJob.timeout: "5m"
  vulnerabilityReports.scanner: "Trivy"
  node.collector.imageRef: "ghcr.io/aquasecurity/node-collector:0.3.1"
  trivy.severity: "CRITICAL,HIGH,MEDIUM,LOW"
  trivy.slow: "true"

4. Connecting Registry State to Cluster State

The critical join is between the digest Copa just pushed to the registry and the digest currently running in each Pod. A small reconciliation script runs as a Kubernetes CronJob or a Prometheus custom collector.

#!/usr/bin/env python3
"""digest_drift_collector.py — detect pods running pre-patch image digests"""

import json
import subprocess
import os
from prometheus_client import start_http_server, Gauge
import time

LISTEN_PORT = int(os.environ.get("METRICS_PORT", "9090"))
PATCHED_DIGEST_FILE = os.environ.get("PATCHED_DIGEST_FILE", "/etc/copa/patched-digests.json")

running_unpatched = Gauge(
    "running_containers_with_critical_cve",
    "Containers running an image digest that predates the latest Copa patch",
    ["namespace", "pod", "image", "digest"],
)


def get_running_pods() -> list[dict]:
    result = subprocess.run(
        ["kubectl", "get", "pods", "--all-namespaces", "-o", "json"],
        capture_output=True,
        text=True,
        check=True,
    )
    pods = json.loads(result.stdout)
    return pods.get("items", [])


def load_patched_digests(path: str) -> dict[str, str]:
    """Map image name -> latest patched digest from Copa pipeline."""
    with open(path) as f:
        return json.load(f)


def collect() -> None:
    running_unpatched.clear()
    try:
        patched = load_patched_digests(PATCHED_DIGEST_FILE)
        pods = get_running_pods()
    except Exception as exc:
        print(f"Collection error: {exc}")
        return

    for pod in pods:
        namespace = pod["metadata"]["namespace"]
        pod_name = pod["metadata"]["name"]
        statuses = pod.get("status", {}).get("containerStatuses", [])
        for status in statuses:
            image_id = status.get("imageID", "")
            # imageID format: registry.example.com/myapp@sha256:abc123
            image_name = image_id.split("@")[0] if "@" in image_id else image_id
            running_digest = image_id.split("@sha256:")[-1] if "@sha256:" in image_id else ""

            latest_patched_digest = patched.get(image_name, "")
            if latest_patched_digest and running_digest and running_digest != latest_patched_digest:
                running_unpatched.labels(
                    namespace=namespace,
                    pod=pod_name,
                    image=image_name,
                    digest=f"sha256:{running_digest}",
                ).set(1)


def main():
    start_http_server(LISTEN_PORT)
    while True:
        collect()
        time.sleep(300)  # scrape every 5 minutes


if __name__ == "__main__":
    main()

Copa’s pipeline should write patched-digests.json as a post-patch step:

{
  "registry.example.com/myapp": "abc123def456...",
  "registry.example.com/api-gateway": "789xyz000..."
}

Store this file in a ConfigMap or an object storage bucket that the digest drift collector can read. Mount it as a volume in the collector Deployment.

5. Prometheus Recording Rules

Recording rules pre-compute expensive queries and produce the time series that dashboards and alerts consume. Separate SLA state from raw CVE counts.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: container-patch-compliance
  namespace: monitoring
spec:
  groups:
  - name: copa.sla
    interval: 5m
    rules:
    # Images with at least one critical CVE visible in running containers
    - record: fleet:running_critical_cve_images:count
      expr: |
        count(
          trivy_image_vulnerabilities{severity="CRITICAL"} > 0
        ) by (namespace, image_repository)

    # Containers breaching 24-hour SLA (critical CVE age > 24h)
    - record: fleet:sla_breach_24h:count
      expr: |
        count(
          image_cve_age_hours{severity="CRITICAL"} > 24
        ) by (image)

    # Copa success rate over last 7 days per image
    - record: copa:patch_success_rate_7d:ratio
      expr: |
        (
          increase(copa_patch_success_total[7d])
        ) / (
          increase(copa_patch_success_total[7d]) + increase(copa_patch_failure_total[7d])
        )

    # Fleet-wide percentage of images with no running critical CVEs
    - record: fleet:patch_compliance_ratio:gauge
      expr: |
        1 - (
          count(running_containers_with_critical_cve > 0) /
          count(trivy_image_vulnerabilities)
        )

    # Running containers whose digest does not match latest Copa output
    - record: fleet:digest_drift_containers:count
      expr: |
        count(running_containers_with_critical_cve > 0)

6. Grafana Dashboard Panels

Organise the dashboard into three rows: fleet overview, SLA tracking, and Copa pipeline health.

Fleet Heatmap — Image × Severity × Age

Panel type: Heatmap
Query:
  sum by (image_repository, severity) (
    trivy_image_vulnerabilities
  )
X axis: image_repository
Y axis: severity (CRITICAL, HIGH, MEDIUM)
Color: count of CVEs, red for > 5 critical, yellow for > 10 high

SLA Breach Table

Panel type: Table
Query:
  topk(20,
    image_cve_age_hours{severity="CRITICAL"} > 24
  )
Columns:
  - image (from label)
  - cve_id (from label)
  - Value → "Age (hours)"
  - Threshold annotation: red if > 48h, orange if 24–48h
Sort: descending by age

Copa Success Rate Over Time

Panel type: Time series
Query A (success rate):
  rate(copa_patch_success_total[1h])
Query B (failure rate):
  rate(copa_patch_failure_total[1h])
Legend: "Success" and "Failure" per image
Threshold line: 0 failures — any non-zero failure rate triggers annotation

Top-N Most Vulnerable Running Images

Panel type: Bar gauge
Query:
  topk(10,
    sum by (image_repository) (
      trivy_image_vulnerabilities{severity="CRITICAL"}
    )
  )
Orientation: Horizontal
Thresholds: green=0, yellow=1, red=5

Digest Drift Status

Panel type: Stat
Query:
  fleet:digest_drift_containers:count
Title: "Pods Running Pre-Patch Digest"
Color: green if 0, red if > 0

7. Alertmanager Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: copa-alerts
  namespace: monitoring
spec:
  groups:
  - name: copa.alerts
    rules:
    # PagerDuty: critical CVE with CVSS >= 9 breaching 4-hour SLA
    - alert: CriticalCVESLABreach
      expr: |
        image_cve_age_hours{severity="CRITICAL"} > 4
      for: 0m
      labels:
        severity: critical
        team: security
        pagerduty: "true"
      annotations:
        summary: "Critical CVE SLA breach on {{ $labels.image }}"
        description: |
          CVE {{ $labels.cve_id }} has been present on image {{ $labels.image }}
          for {{ $value | humanizeDuration }}. SLA threshold is 4 hours for CVSS >= 9.
        runbook_url: "https://wiki.example.com/runbooks/cve-sla-breach"

    # Slack: Copa patch job failed for an image
    - alert: CopaPatchJobFailed
      expr: |
        increase(copa_patch_failure_total[1h]) > 0
      for: 0m
      labels:
        severity: warning
        team: platform
        slack_channel: "#security-alerts"
      annotations:
        summary: "Copa patch failed for {{ $labels.image }}"
        description: |
          Copa failed to patch {{ $labels.image }}. Reason: {{ $labels.failure_reason }}.
          The SLA clock continues running until the image is successfully patched.

    # Warning: pods running pre-patch image digest
    - alert: PodsRunningUnpatchedDigest
      expr: |
        fleet:digest_drift_containers:count > 0
      for: 30m
      labels:
        severity: warning
        team: security
      annotations:
        summary: "{{ $value }} pods running unpatched image digest"
        description: |
          Pods are running image digests that predate the latest Copa patch.
          These pods remain vulnerable even though the registry image is patched.
          Trigger a rollout restart for affected deployments.

    # Warning: Trivy scan stale (no scan in > 26 hours for an image)
    - alert: TrivyScanStale
      expr: |
        (time() - trivy_image_vulnerabilities_last_scan_timestamp) > 93600
      for: 0m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Trivy scan stale for {{ $labels.image_repository }}"
        description: "No Trivy scan result in 26 hours. CVE data may be outdated."

Alertmanager routing:

route:
  group_by: ["alertname", "image"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default-slack"
  routes:
  - match:
      pagerduty: "true"
    receiver: "pagerduty-critical"
    continue: false
  - match:
      slack_channel: "#security-alerts"
    receiver: "security-slack"

receivers:
- name: "pagerduty-critical"
  pagerduty_configs:
  - routing_key: "${PAGERDUTY_KEY}"
    severity: critical

- name: "security-slack"
  slack_configs:
  - api_url: "${SLACK_WEBHOOK_URL}"
    channel: "#security-alerts"
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

8. OWASP Dependency-Track Integration

Dependency-Track provides remediation history and project-level CVE tracking that Prometheus cannot replicate. Push Copa-verified scan results after each successful patch run.

#!/usr/bin/env bash
# upload_bom_to_dtrack.sh — push post-patch SBOM to Dependency-Track

set -euo pipefail

DTRACK_URL="${DTRACK_URL:-https://dtrack.example.com}"
DTRACK_API_KEY="${DTRACK_API_KEY}"
IMAGE="${1:?image required}"
BOM_FILE="${2:?bom file required}"
PROJECT_NAME="${IMAGE//\//-}"

# Create or update project
PROJECT_UUID=$(curl -sS -X PUT "${DTRACK_URL}/api/v1/project" \
  -H "X-Api-Key: ${DTRACK_API_KEY}" \
  -H "Content-Type: application/json" \
  -d "{\"name\": \"${PROJECT_NAME}\", \"version\": \"patched-$(date +%Y%m%d)\"}" \
  | jq -r '.uuid')

# Upload BOM (CycloneDX format from Trivy)
curl -sS -X POST "${DTRACK_URL}/api/v1/bom" \
  -H "X-Api-Key: ${DTRACK_API_KEY}" \
  -F "project=${PROJECT_UUID}" \
  -F "bom=@${BOM_FILE}"

echo "Uploaded BOM for ${IMAGE} to project ${PROJECT_UUID}"

Trivy can produce the CycloneDX BOM as part of the Copa pipeline:

trivy image --format cyclonedx --output /tmp/bom.json registry.example.com/myapp:patched
bash upload_bom_to_dtrack.sh registry.example.com/myapp /tmp/bom.json

Dependency-Track stores the full remediation history: when a CVE was first seen, when it was resolved (verified by the post-patch scan), and whether it recurred. This audit trail satisfies compliance requirements that point-in-time Prometheus metrics cannot.


Expected Behaviour

Event Metric Emitted Alert Fired Dashboard Updated
Trivy detects new critical CVE on running container trivy_image_vulnerabilities{severity="CRITICAL"} increments; image_cve_age_hours starts at 0 No alert (within SLA window) Fleet heatmap turns yellow; SLA breach table adds row
Copa patch succeeds for image copa_patch_success_total increments; copa_patch_duration_seconds recorded No alert Copa success rate panel updates; digest drift panel shows 0 after pods restart
Copa patch fails for image copa_patch_failure_total increments with failure_reason label CopaPatchJobFailed fires to Slack Copa failure rate panel shows non-zero; affected image highlighted in heatmap
Pod still running unpatched digest after registry patch running_containers_with_critical_cve remains 1; fleet:digest_drift_containers:count > 0 PodsRunningUnpatchedDigest fires after 30m Digest drift stat panel turns red
Critical CVE age exceeds 4 hours (CVSS ≥ 9) image_cve_age_hours crosses threshold CriticalCVESLABreach fires to PagerDuty SLA breach table row turns red; breach count increments
Trivy operator does not scan image for > 26 hours trivy_image_vulnerabilities_last_scan_timestamp staleness threshold crossed TrivyScanStale fires to Slack Heatmap shows stale indicator for affected image

Trade-offs

Approach A Approach B Consideration
Trivy operator continuous scanning (always-on DaemonSet-adjacent scanning of running containers) Scheduled Copa scan-only (Trivy runs only as part of the nightly Copa CronJob) Continuous scanning catches drift between Copa runs — a new CVE disclosed at noon is visible by 12:05, not the next day. Cost: operator consumes ~100m CPU and ~256Mi RAM per cluster. Scheduled scanning costs nothing at runtime but leaves an 18–23 hour blind spot for newly disclosed CVEs.
Dependency-Track integration (full SBOM, project history, policy gates, audit export) Prometheus-only (metrics, dashboards, alerts) Dependency-Track provides audit-grade remediation history, SBOM storage, and policy-gate APIs that Prometheus cannot replicate. Operational overhead: a Dependency-Track deployment requires a persistent database, regular updates, and API key management. Prometheus-only is significantly simpler but cannot produce the historical “CVE first seen / resolved on date” records that auditors require.
Alert on SLA breach (fire when CVE age > threshold) Alert on risk score (fire when CVSS × exploit-likelihood exceeds threshold) Breach-based alerting is deterministic and auditable — a 24-hour SLA either held or it did not. Risk-score alerting is more nuanced but requires a CVSS enrichment pipeline and a decision about which threat intelligence source to trust. Start with breach-based alerting; add risk-score routing once the base system is stable.
Pushgateway for Copa job metrics Prometheus scrape of a long-running exporter Copa runs for seconds to minutes per image and then exits — it cannot host a scrape endpoint. Pushgateway is the correct pattern. Downside: Pushgateway metrics persist after the job exits, which can cause stale metrics if Copa stops running entirely. Set a Pushgateway TTL (--push.disable-consistency-check plus a scrape interval longer than the Copa schedule) or use a heartbeat metric to detect absence of Copa runs.

Failure Modes

Failure Symptom Detection Mitigation
Trivy operator does not watch a namespace Running containers in that namespace have no CVE metrics; fleet compliance ratio appears better than reality TrivyScanStale alert on missing trivy_image_vulnerabilities_last_scan_timestamp; cross-reference kubectl get namespaces against namespaces present in Trivy metrics Ensure operator namespace configuration includes all production namespaces; use a namespace selector that defaults to all rather than an explicit list
Copa metrics exporter container fails (OOM, crash) copa_patch_success_total and copa_patch_failure_total do not increment; Copa runs silently with no Prometheus record Absence-of-metric alert: absent(increase(copa_patch_success_total[25h])) fires if no Copa metrics update within expected window Use a Pushgateway heartbeat metric that Copa CronJob emits regardless of patch outcome; alert on absent(copa_job_last_run_timestamp)
Image reference format mismatch Digest drift collector compares registry.example.com/myapp against registry.example.com:443/myapp — no match, false positive Test with kubectl get pods -o jsonpath='{..imageID}' and compare manually against Copa output format Normalise all image references in both the Copa pipeline and the digest drift collector using a shared normalisation function; strip port 443 and handle docker.io short references
Grafana alert silenced or muted CriticalCVESLABreach has been silenced for a maintenance window; silence is not removed; alert never fires again Monitor Grafana silence API: alert if any silence rule has endsAt more than 48 hours in the future without a matching change-request ticket Integrate Grafana silence management with the change management system; auto-expire silences; alert on silences that exceed maximum duration policy
Pushgateway stale metrics after Copa decommission Old copa_patch_success_total metrics persist from an image that no longer exists; dashboard shows phantom compliance Pushgateway’s /metrics endpoint shows push_time_seconds for each job/instance group; alert when time() - push_time_seconds > 25h Set Pushgateway metric TTL via --push.disable-consistency-check=false and implement a cleanup job that deletes stale groups; or migrate to a custom exporter with a built-in metric expiry
Dependency-Track API unavailable BOM uploads fail silently; remediation history gaps Wrap the upload script in a check: curl --fail returns non-zero on 5xx; log and alert via Slack on repeated failures Implement a retry queue for BOM uploads (dead-letter to a configmap or object storage); retry on next Copa run with the post-patch BOM