AI Triage of Open-Source Scanner Findings to Cut False-Positive Noise
The Problem
Production security programmes that run Trivy against container images, Grype against SBOMs, and Semgrep against source code on every CI run generate volumes that defeat manual triage. A mid-size engineering organisation with forty services, each with a container image rebuilt daily, will see 200–500 scanner findings per sprint. The realistic exploitability rate — findings that represent a genuine, reachable vulnerability in the specific deployment context — is typically below 5%.
That gap between output volume and actionable signal is not primarily a tooling problem. Trivy and Grype are both doing their jobs correctly: they report every CVE affecting every package in the image, regardless of whether the vulnerable function is called, whether the package is present only in a build-time layer that is discarded in the final image, or whether the vulnerability requires an attack vector (local privilege, authenticated access, specific network position) that the application’s threat model simply does not expose. Semgrep reports code patterns that match a rule, regardless of whether the matching code is reachable from a public entry point or protected by access controls that make exploitation implausible.
EPSS scores help. A finding with an EPSS score of 0.00023 is unlikely to be exploited even if theoretically vulnerable. But EPSS is a population-level statistical estimate of exploitation likelihood across all deployments; it cannot account for whether the specific package is in your application’s runtime dependency tree or only in test fixtures, whether the vulnerable code path is reachable from your API surface, or whether compensating controls already present in your environment reduce the real-world risk further.
The result is a well-documented failure mode: security teams spend their time processing the same categories of low-priority findings repeatedly, genuine critical findings get buried in the volume, JIRA backlogs accumulate hundreds of open vulnerability tickets that no one has reviewed in months, and the team stops trusting scanner output as a useful signal. Engineers learn that scanner alerts are noise, and the security programme quietly degrades.
LLMs are now capable of performing the contextual assessment step that scanners cannot. Given a finding (CVE, affected package, version, EPSS score) combined with context about the application’s dependency tree, the package’s purpose, and the application’s deployment model, an LLM can produce a structured assessment of exploitability, a priority tier, and a recommended action. Applied at scale across scanner output, this turns a volume problem into a manageable queue of genuinely prioritised work.
This article covers the full pipeline: enriching scanner output with context, structuring the LLM triage prompt, deduplicating across scanners, handling Semgrep code-level findings, tracking triage accuracy over time, and routing output to appropriate downstream systems.
Threat Model
Finding buried in noise causes missed incident. A P1-severity finding — a CVE with a public exploit in a runtime package that is both reachable and unpatched — is present in the scanner output but surrounded by 300 other findings, most of them test dependencies or build tools. Without effective triage, the P1 is addressed at the same priority as the surrounding noise. The window between the CVE’s public disclosure and your patch application remains open longer than it should, and during that window active exploitation is possible.
Alert fatigue causes silent security posture degradation. Teams that receive consistently high volumes of low-quality findings develop systematic avoidance behaviours. Findings go unreviewed. JIRA tickets accumulate without triage. Monthly review cadences slip to quarterly. When a genuine incident occurs, the forensic trail shows that the exploited vulnerability appeared in scanner output weeks before exploitation, but the ticket was never opened. This is the most common real-world failure pattern in programmes that deploy scanners without effective triage.
Tooling gap causes missed vulnerability. Trivy and Grype use different vulnerability databases and scanning techniques. A CVE present in the NVD may appear in one scanner’s output but not the other’s, due to differences in package detection, database update timing, or matching logic. A programme that relies on a single scanner will miss vulnerabilities that the other scanner would have caught. Conversely, processing both scanners’ output independently generates duplicate tickets for the same finding, consuming triage capacity without additional coverage benefit.
Same CVE triaged as separate tickets. In environments without cross-scanner deduplication, the same CVE affecting the same package in the same container image appears once in Trivy output and once in Grype output. Both generate separate tickets. A security engineer triages one, patches the package, and marks the ticket resolved. The second ticket remains open, generating unnecessary follow-up and confusion about what was actually remediated.
Hardening Configuration
Trivy Output Enrichment and LLM Triage
The first step is collecting the context that scanners omit. For each finding, the triage script needs to know: is the affected package a runtime dependency, a build-time dependency, or a test-only dependency? What is the package’s primary purpose? What is the EPSS score?
Install dependencies and run Trivy in JSON mode:
pip install anthropic requests packaging
# Run Trivy against a container image, outputting JSON
trivy image --format json --output trivy-output.json \
--severity CRITICAL,HIGH,MEDIUM \
myregistry/myapp:latest
# Run Grype against an SBOM
syft myregistry/myapp:latest -o spdx-json > sbom.spdx.json
grype sbom:sbom.spdx.json -o json > grype-output.json
The enrichment and triage script reads Trivy JSON, fetches EPSS scores, classifies dependencies against the runtime dependency tree, then submits each finding to an LLM for structured assessment:
#!/usr/bin/env python3
"""
triage_scanner_findings.py — AI-assisted vulnerability triage pipeline.
Reads Trivy JSON output, enriches findings with EPSS scores and dependency
classification, submits each finding to Claude for structured triage, and
emits a prioritised finding report.
"""
import json
import sys
import time
import hashlib
import requests
import anthropic
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Literal
EPSS_API = "https://api.first.org/data/v1/epss"
ANTHROPIC_MODEL = "claude-opus-4-5"
client = anthropic.Anthropic()
@dataclass
class EnrichedFinding:
cve_id: str
package_name: str
installed_version: str
fixed_version: str | None
severity: str
epss_score: float
epss_percentile: float
dep_type: Literal["runtime", "build", "test", "unknown"]
scanner: str
title: str
description: str
# Populated after LLM triage
exploitability: str = ""
priority: str = ""
reasoning: str = ""
recommended_action: str = ""
fingerprint: str = ""
def __post_init__(self):
# Stable fingerprint for deduplication across scanners
raw = f"{self.cve_id}:{self.package_name}:{self.installed_version}"
self.fingerprint = hashlib.sha256(raw.encode()).hexdigest()[:16]
def fetch_epss_scores(cve_ids: list[str]) -> dict[str, dict]:
"""Fetch EPSS scores for a batch of CVE IDs from FIRST API."""
scores = {}
# API supports batching up to 100 CVEs per request
for i in range(0, len(cve_ids), 100):
batch = cve_ids[i:i + 100]
params = {"cve": ",".join(batch)}
try:
resp = requests.get(EPSS_API, params=params, timeout=10)
resp.raise_for_status()
for item in resp.json().get("data", []):
scores[item["cve"]] = {
"epss": float(item.get("epss", 0)),
"percentile": float(item.get("percentile", 0)),
}
except requests.RequestException as e:
print(f"EPSS fetch error for batch: {e}", file=sys.stderr)
time.sleep(0.2) # Rate limit courtesy
return scores
def classify_dependency(package_name: str, runtime_deps: set[str],
build_deps: set[str], test_deps: set[str]) -> str:
"""Classify a package into dependency tier based on known sets."""
pkg = package_name.lower()
if pkg in runtime_deps:
return "runtime"
if pkg in test_deps:
return "test"
if pkg in build_deps:
return "build"
return "unknown"
def load_runtime_deps(requirements_file: str = "requirements.txt") -> set[str]:
"""
Parse a requirements.txt (or pyproject.toml) to extract runtime deps.
For non-Python projects, replace with package.json parsing, go.mod, etc.
"""
deps = set()
try:
for line in Path(requirements_file).read_text().splitlines():
line = line.strip()
if line and not line.startswith("#") and not line.startswith("-"):
pkg = line.split("==")[0].split(">=")[0].split("[")[0].lower()
deps.add(pkg)
except FileNotFoundError:
pass
return deps
TRIAGE_PROMPT = """You are a security vulnerability triage analyst. Assess the following scanner finding for a production web application.
Finding details:
- CVE ID: {cve_id}
- Package: {package_name} version {installed_version}
- Fixed in version: {fixed_version}
- Severity (CVSS-based): {severity}
- EPSS score: {epss_score:.4f} (exploitation probability in next 30 days, population-level)
- EPSS percentile: {epss_percentile:.1f}th percentile
- Dependency classification: {dep_type}
- CVE title: {title}
- CVE description: {description}
Application context:
- This is a production Python web API (FastAPI) serving external traffic
- Runtime dependencies are directly invoked; test dependencies are only present in CI
- Build dependencies are not present in the production container image
Assess the following:
1. Given the dependency classification, is the vulnerable code likely present and reachable at runtime?
2. Based on the CVE description and the package's typical use case, is the vulnerable function likely to be invoked in a standard web API deployment?
3. Considering the EPSS score alongside context, what is the realistic exploitability?
Output valid JSON only, no prose before or after:
{{
"exploitability": "low|medium|high",
"reasoning": "Two to four sentences explaining your assessment, citing the dep_type, EPSS, and CVE specifics.",
"priority": "P1|P2|P3|P4",
"recommended_action": "Specific action: patch, monitor, accept, or investigate with rationale"
}}
Priority guide:
- P1: exploitability=high, public exploit known or likely, runtime dep — patch within 24h
- P2: exploitability=medium or high without confirmed public exploit, runtime dep — patch within sprint
- P3: exploitability=low or non-runtime dep — batch review weekly
- P4: test/build dep or EPSS < 0.001 with no public exploit evidence — accept or monitor quarterly"""
def triage_finding(finding: EnrichedFinding) -> dict:
"""Submit a single finding to the LLM and parse the structured response."""
prompt = TRIAGE_PROMPT.format(
cve_id=finding.cve_id,
package_name=finding.package_name,
installed_version=finding.installed_version,
fixed_version=finding.fixed_version or "no fix available",
severity=finding.severity,
epss_score=finding.epss_score,
epss_percentile=finding.epss_percentile,
dep_type=finding.dep_type,
title=finding.title,
description=finding.description[:800], # Truncate long descriptions
)
message = client.messages.create(
model=ANTHROPIC_MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
# Cache the system-level triage instruction to reduce token cost
system=[{
"type": "text",
"text": "You are a precise security analyst. Output only valid JSON.",
"cache_control": {"type": "ephemeral"},
}],
)
try:
return json.loads(message.content[0].text)
except (json.JSONDecodeError, IndexError) as e:
print(f"LLM parse error for {finding.cve_id}: {e}", file=sys.stderr)
return {
"exploitability": "unknown",
"reasoning": "LLM response parsing failed",
"priority": "P2", # Conservative fallback
"recommended_action": "Manual review required",
}
def parse_trivy_output(trivy_json_path: str) -> list[dict]:
"""Extract vulnerability records from Trivy JSON output."""
raw = json.loads(Path(trivy_json_path).read_text())
findings = []
for result in raw.get("Results", []):
for vuln in result.get("Vulnerabilities", []):
findings.append({
"cve_id": vuln.get("VulnerabilityID", ""),
"package_name": vuln.get("PkgName", ""),
"installed_version": vuln.get("InstalledVersion", ""),
"fixed_version": vuln.get("FixedVersion"),
"severity": vuln.get("Severity", "UNKNOWN"),
"title": vuln.get("Title", ""),
"description": vuln.get("Description", ""),
"scanner": "trivy",
})
return findings
def parse_grype_output(grype_json_path: str) -> list[dict]:
"""Extract vulnerability records from Grype JSON output."""
raw = json.loads(Path(grype_json_path).read_text())
findings = []
for match in raw.get("matches", []):
vuln = match.get("vulnerability", {})
artifact = match.get("artifact", {})
findings.append({
"cve_id": vuln.get("id", ""),
"package_name": artifact.get("name", ""),
"installed_version": artifact.get("version", ""),
"fixed_version": (vuln.get("fix", {}) or {}).get("versions", [None])[0],
"severity": vuln.get("severity", "UNKNOWN"),
"title": vuln.get("description", "")[:120],
"description": vuln.get("description", ""),
"scanner": "grype",
})
return findings
def deduplicate_findings(findings: list[EnrichedFinding]) -> list[EnrichedFinding]:
"""
Deduplicate across scanners by CVE + package + version fingerprint.
When both scanners report the same finding, prefer the one with more
metadata (non-null fixed_version, longer description).
"""
seen: dict[str, EnrichedFinding] = {}
for f in findings:
if f.fingerprint not in seen:
seen[f.fingerprint] = f
else:
existing = seen[f.fingerprint]
# Prefer the richer record
if f.fixed_version and not existing.fixed_version:
seen[f.fingerprint] = f
elif len(f.description) > len(existing.description):
seen[f.fingerprint] = f
# Track which scanners saw this
existing_scanners = existing.scanner.split("+")
if f.scanner not in existing_scanners:
seen[f.fingerprint].scanner = existing.scanner + "+" + f.scanner
return list(seen.values())
def main():
runtime_deps = load_runtime_deps("requirements.txt")
test_deps = load_runtime_deps("requirements-test.txt")
build_deps = load_runtime_deps("requirements-build.txt")
# Load and merge findings from both scanners
raw_findings = []
raw_findings.extend(parse_trivy_output("trivy-output.json"))
raw_findings.extend(parse_grype_output("grype-output.json"))
# Fetch EPSS scores in batch
cve_ids = list({f["cve_id"] for f in raw_findings if f["cve_id"].startswith("CVE-")})
epss_data = fetch_epss_scores(cve_ids)
# Build enriched findings
enriched = []
for f in raw_findings:
epss = epss_data.get(f["cve_id"], {"epss": 0.0, "percentile": 0.0})
dep_type = classify_dependency(f["package_name"], runtime_deps, build_deps, test_deps)
enriched.append(EnrichedFinding(
cve_id=f["cve_id"],
package_name=f["package_name"],
installed_version=f["installed_version"],
fixed_version=f["fixed_version"],
severity=f["severity"],
epss_score=epss["epss"],
epss_percentile=epss["percentile"],
dep_type=dep_type,
scanner=f["scanner"],
title=f["title"],
description=f["description"],
))
# Deduplicate across scanners
unique_findings = deduplicate_findings(enriched)
print(f"Findings: {len(raw_findings)} raw → {len(unique_findings)} after dedup")
# LLM triage each unique finding
results = []
for finding in unique_findings:
triage = triage_finding(finding)
finding.exploitability = triage.get("exploitability", "unknown")
finding.priority = triage.get("priority", "P3")
finding.reasoning = triage.get("reasoning", "")
finding.recommended_action = triage.get("recommended_action", "")
results.append(asdict(finding))
time.sleep(0.5) # Rate limit
# Write output
output_path = Path("triage-results.json")
output_path.write_text(json.dumps(results, indent=2))
print(f"Triage complete. Results written to {output_path}")
# Summary
by_priority = {}
for r in results:
p = r["priority"]
by_priority[p] = by_priority.get(p, 0) + 1
for p in sorted(by_priority):
print(f" {p}: {by_priority[p]} findings")
if __name__ == "__main__":
main()
Semgrep Code-Level Finding Enrichment
Semgrep findings include the matched code snippet and rule ID, which gives the LLM enough context to assess reachability. The enrichment prompt is different from the CVE triage prompt because the question is about code flow, not package exploitability:
SEMGREP_TRIAGE_PROMPT = """You are a security code reviewer assessing whether a Semgrep finding is a true positive.
Rule ID: {rule_id}
Rule message: {rule_message}
Severity: {severity}
File: {file_path} (line {line})
Matched code:
{code_snippet}
Surrounding context (10 lines):
{context_snippet}
Assess:
1. Is this pattern actually exploitable as written, or does it require additional attacker-controlled input that isn't present?
2. Is there a sanitiser, validator, or access control in the surrounding context that mitigates the risk?
3. What is the realistic exploitability?
Output valid JSON only:
{{
"exploitability": "low|medium|high",
"reasoning": "Two to three sentences citing specific code evidence.",
"priority": "P1|P2|P3|P4",
"recommended_action": "Specific fix or dismissal with rationale",
"false_positive_likely": true|false
}}"""
def parse_semgrep_output(semgrep_json_path: str) -> list[dict]:
"""Extract findings from Semgrep JSON output with surrounding context."""
raw = json.loads(Path(semgrep_json_path).read_text())
findings = []
for result in raw.get("results", []):
file_path = result.get("path", "")
start_line = result.get("start", {}).get("line", 0)
# Read surrounding context from the source file
context = ""
try:
lines = Path(file_path).read_text().splitlines()
ctx_start = max(0, start_line - 6)
ctx_end = min(len(lines), start_line + 5)
context = "\n".join(
f"{i+1}: {l}" for i, l in enumerate(lines[ctx_start:ctx_end])
)
except (FileNotFoundError, PermissionError):
pass
findings.append({
"rule_id": result.get("check_id", ""),
"rule_message": result.get("extra", {}).get("message", ""),
"severity": result.get("extra", {}).get("severity", "WARNING"),
"file_path": file_path,
"line": start_line,
"code_snippet": result.get("extra", {}).get("lines", ""),
"context_snippet": context,
})
return findings
Cross-Scanner Deduplication with Confidence Scoring
When Trivy and Grype both report CVE-2024-XXXXX for requests==2.28.0, the deduplication must handle minor differences in package name normalisation (e.g., python-requests vs requests). The fingerprint function accounts for common normalisation variations:
import re
def normalise_package_name(name: str) -> str:
"""Normalise package names for cross-scanner comparison."""
# Remove common prefixes: python-, py-, lib-, -dev suffixes
name = name.lower().strip()
name = re.sub(r'^(python-|py-|lib)', '', name)
name = re.sub(r'(-dev|-python|-lib)$', '', name)
# PyPI normalisation: replace hyphens and underscores
name = re.sub(r'[-_.]+', '-', name)
return name
def stable_finding_fingerprint(cve_id: str, package: str, version: str) -> str:
"""Stable fingerprint that survives minor scanner naming differences."""
normalised_pkg = normalise_package_name(package)
# Strip patch version for version-level dedup (CVE-2024-X in requests 2.28.x)
version_parts = version.split(".")
version_key = ".".join(version_parts[:2]) if len(version_parts) >= 2 else version
raw = f"{cve_id}:{normalised_pkg}:{version_key}"
return hashlib.sha256(raw.encode()).hexdigest()[:16]
Prometheus Metrics for Triage Accuracy Tracking
The AI triage pipeline’s accuracy must be tracked over time. The key recall metric is: when the LLM assessed a finding as low priority (P3/P4) and it was subsequently confirmed as exploitable, that is a false negative. Tracking this requires a feedback loop from your patching process back to the triage records:
from prometheus_client import Counter, Gauge, Histogram, push_to_gateway
# Metrics
triage_total = Counter(
'vulnerability_triage_total',
'Total findings processed by AI triage',
['scanner', 'priority', 'dep_type']
)
triage_false_negatives = Counter(
'vulnerability_triage_false_negatives_total',
'Findings assessed as P3/P4 that were later confirmed exploitable',
['scanner']
)
triage_false_positives = Counter(
'vulnerability_triage_false_positives_total',
'Findings assessed as P1/P2 that were confirmed non-exploitable after investigation'
)
dedup_savings = Counter(
'vulnerability_dedup_cross_scanner_savings_total',
'Number of duplicate cross-scanner findings eliminated'
)
triage_latency = Histogram(
'vulnerability_triage_llm_latency_seconds',
'Time taken for LLM to triage a single finding',
buckets=[0.5, 1.0, 2.0, 5.0, 10.0]
)
def record_triage_metrics(finding: EnrichedFinding):
triage_total.labels(
scanner=finding.scanner,
priority=finding.priority,
dep_type=finding.dep_type,
).inc()
def record_confirmed_false_negative(scanner: str):
"""Call when a P3/P4 finding is later confirmed as genuinely exploitable."""
triage_false_negatives.labels(scanner=scanner).inc()
def push_metrics(pushgateway_url: str, job: str = "vuln-triage"):
push_to_gateway(pushgateway_url, job=job, registry=None)
Monitor the false negative rate with a Prometheus alerting rule:
# prometheus-rules/vuln-triage.yaml
groups:
- name: vulnerability_triage
rules:
- alert: TriageFalseNegativeRateHigh
expr: |
rate(vulnerability_triage_false_negatives_total[30d])
/
rate(vulnerability_triage_total{priority=~"P3|P4"}[30d]) > 0.02
for: 1h
labels:
severity: warning
annotations:
summary: "AI triage false negative rate exceeds 2% over 30 days"
description: "Review triage prompt and dep_type classification logic"
Output Pipeline: Issue Creation and Batch Review
Route findings to different downstream channels based on priority tier:
import os
from github import Github # pip install PyGithub
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
GITHUB_REPO = os.environ["GITHUB_REPO"] # e.g. "org/repo"
def create_github_issue(finding: EnrichedFinding, repo):
"""Open a GitHub issue for P1/P2 findings."""
title = f"[{finding.priority}] {finding.cve_id} in {finding.package_name} {finding.installed_version}"
body = f"""## Vulnerability Finding
**CVE:** {finding.cve_id}
**Package:** `{finding.package_name}` version `{finding.installed_version}`
**Fixed in:** `{finding.fixed_version or 'No fix available'}`
**Severity:** {finding.severity}
**EPSS Score:** {finding.epss_score:.4f} ({finding.epss_percentile:.1f}th percentile)
**Detected by:** {finding.scanner}
**Dependency type:** {finding.dep_type}
## AI Triage Assessment
**Exploitability:** {finding.exploitability}
**Priority:** {finding.priority}
**Reasoning:** {finding.reasoning}
**Recommended action:** {finding.recommended_action}
---
*Triaged by AI pipeline — verify before closing*
"""
return repo.create_issue(
title=title,
body=body,
labels=[f"security:{finding.priority.lower()}", "vuln-scan"],
)
def route_findings(results: list[EnrichedFinding]):
gh = Github(GITHUB_TOKEN)
repo = gh.get_repo(GITHUB_REPO)
p1_p2 = [f for f in results if f.priority in ("P1", "P2")]
p3_p4 = [f for f in results if f.priority in ("P3", "P4")]
# Immediate issues for actionable findings
for finding in p1_p2:
create_github_issue(finding, repo)
# Batch report for low-priority findings (weekly digest)
if p3_p4:
batch_report = "\n".join(
f"- {f.cve_id} | {f.package_name} {f.installed_version} "
f"| {f.dep_type} | {f.recommended_action}"
for f in p3_p4
)
repo.create_issue(
title=f"Weekly low-priority vulnerability batch ({len(p3_p4)} findings)",
body=f"## Low-Priority Findings for Weekly Review\n\n{batch_report}",
labels=["security:p3-p4", "weekly-batch"],
)
Expected Behaviour After Hardening
After the pipeline is operational, the daily scanner output of 200–500 raw findings is reduced to a manageable prioritised queue. Typical distributions in environments where this has been deployed:
- P1 (patch within 24h): 0–3 findings per sprint — these are runtime dependencies with public exploits and high EPSS scores
- P2 (patch within sprint): 5–15 findings — runtime deps with confirmed CVEs but lower exploitation probability or no public PoC
- P3 (weekly batch review): 30–80 findings — test or build dependencies, or runtime deps with very low EPSS
- P4 (quarterly accept/monitor): remainder — non-runtime deps, informational findings, or findings with EPSS below 0.001
Cross-scanner deduplication eliminates 15–40% of the raw finding count before triage, depending on how much overlap exists between Trivy and Grype databases. Semgrep findings with the false-positive flag set are batched separately for a weekly developer review rather than generating security tickets.
P1 findings generate GitHub issues within minutes of the scanner run completing, rather than hours or days after manual triage. The security team’s daily triage time drops from 2–4 hours to 15–30 minutes spent reviewing P1/P2 assessments and confirming the LLM’s reasoning.
Trade-offs and Operational Considerations
| Consideration | Detail |
|---|---|
| LLM API cost | At roughly 1,500 tokens per finding (prompt + response), triaging 400 findings per sprint costs approximately 600K tokens. At Claude Opus pricing this is ~$9/sprint — cheaper than 20 minutes of analyst time. |
| Prompt caching | System-level triage instruction is cacheable. After the first finding, cache hits reduce cost by ~70% for the system prompt portion. |
| Dep classification accuracy | The dep-type classification is the most impactful input to the triage. Incorrect classification of a test package as runtime (or vice versa) significantly affects priority. Validate classification logic against your actual dependency tree before relying on it. |
| EPSS update lag | EPSS scores update daily. Scanner findings from early in the day may carry yesterday’s EPSS. For critical-severity findings, consider re-fetching EPSS after 24 hours. |
| LLM knowledge cutoff | The LLM’s training data may not include information about CVEs disclosed after its training cutoff. For very recent CVEs, the model will assess based on the CVE description rather than exploit knowledge. This is conservative but may understate risk for newly weaponised vulnerabilities. |
| Human oversight | The pipeline routes P1/P2 to issue trackers automatically. A human security engineer should confirm the LLM’s reasoning before marking any P1 finding as resolved. The pipeline reduces triage time; it does not replace human judgement on remediation decisions. |
| Semgrep rule quality variance | Semgrep rules vary considerably in precision. The LLM triage is more useful for community rules (which have higher false-positive rates) than for Semgrep Pro rules (which are typically higher precision). Consider applying more conservative prompting for community rules. |
Failure Modes
| Failure Mode | Likelihood | Impact | Detection | Mitigation |
|---|---|---|---|---|
| LLM misclassifies P1 as P3 (false negative) | Low-Medium | High — exploited window extends | False negative rate metric in Prometheus; post-incident review | Prometheus alert on false negative rate > 2%; monthly manual audit of a sample of P3 decisions |
| Dep-type classification error promotes test dep to runtime | Medium | Medium — unnecessary P1/P2 noise | P1/P2 tickets flagged as “test dep” in post-triage review | Validate classification logic against package-lock.json / Poetry lockfile rather than requirements.txt alone |
| EPSS API unavailable — all scores default to 0 | Low | Medium — EPSS weighting is lost, triage reverts to severity-only | Script should log EPSS fetch failure explicitly | Implement EPSS fallback: if API unavailable, default to 0.05 for CRITICAL, 0.01 for HIGH (conservative estimates) |
| LLM API rate limit during large batch | Medium | Low — triage delayed, not failed | Timeout exceptions logged | Implement exponential backoff with jitter; consider processing large batches during off-peak hours |
| CVE description too brief for meaningful LLM assessment | Medium | Medium — LLM defaults to conservative P2 unnecessarily | “insufficient information” in reasoning field | Pre-enrich short CVE descriptions from NVD/OSV before submission |
| Deduplication fingerprint collision (different vulns match same FP) | Very Low | High — one finding’s triage silently covers another | Audit dedup logic with synthetic test cases | Include CVSS vector in fingerprint for additional discrimination |
| GitHub issue creation fails silently | Low | High — P1 findings not escalated | Missing issue count vs expected count | Validate issue creation; send finding summary to Slack as fallback |