AI-Assisted Vulnerability Discovery in CI/CD Pipelines

The Problem

Traditional SAST tools — Semgrep, CodeQL, Coverity — operate on pattern matching and predefined rules. They excel at finding known bug classes in known patterns: SQL injection, XSS, strcpy without bounds checking, hardcoded credentials matching known formats. What they cannot do is reason about code.

Reasoning is what makes the difference between detecting a strcpy and detecting that a specific call site is safe because a caller three frames up has already validated length. It is also what makes the difference between missing and finding a use-after-free where the free and the dereference are separated by six function calls across three files, or an integer overflow that only triggers when a 32-bit intermediate result wraps under specific combinations of caller-supplied values.

Anthropic’s Red Team, operating with systematic LLM-based code analysis across real codebases, produced more than 500 validated high-severity vulnerabilities — classes of findings that traditional SAST either does not model or cannot trace across the semantic distance required. Independently, Nicholas Carlini demonstrated AI-assisted discovery of a buffer overflow in the Linux kernel’s NFS driver that had been present and undetected for 23 years. The bug was not obscure: it was in a code path handling network-attacker-controlled input. Traditional SAST tools had been run against the Linux kernel for years. None flagged it.

The implication for CI/CD security is direct: the gap between “passed all SAST checks” and “contains no exploitable vulnerabilities” has never been wider. Engineers who treat a green Semgrep run as a security assurance signal are not wrong that the tool ran correctly. They are wrong about what that signal means.

What LLMs Find That Rule-Based Tools Miss

The vulnerability classes where AI analysis outperforms rule-based SAST are consistent across published research:

Logic bugs. A function that is semantically incorrect — returning success when it should fail, skipping a validation step under a specific branch — has no syntactic signature. It cannot be expressed as a pattern over tokens. LLMs read the function’s intent from variable names, comments, surrounding context, and caller conventions, then compare that intent against what the code actually does.

Integer overflows in complex arithmetic. The NFS kernel bug was an integer overflow. The pattern a * b where both are 32-bit values is not inherently dangerous; it depends on whether the result is ever stored in a 32-bit variable and whether the inputs are caller-controlled. Determining this requires tracing data provenance across call chains — a task that requires understanding the semantics of each intermediate function.

Use-after-free in complex data flows. When a pointer is freed in one function and dereferenced in another, with callbacks, event loops, or concurrency primitives between them, the bug is a semantic relationship across the codebase. Pattern-matching on free() followed by pointer use cannot model this distance. An LLM reading the code as prose can.

Authentication and authorisation bypass. Authorisation logic bugs — checking the wrong identity, skipping a permission check on a specific code path, returning authorised for unauthenticated requests under an error condition — are logic bugs. They exist in the relationship between the caller’s intent and the callee’s behaviour.

The Assurance Gap in Production CI Pipelines

In most CI pipelines, SAST is the only automated code-quality security gate applied to source code. Dependency scanning covers known CVEs in third-party libraries. Container scanning covers OS-level packages. SAST covers the first-party code. That gate has a model: it finds patterns. Code that does not match patterns passes. Until recently, that was approximately the best available. It is no longer approximately the best available, and attackers know it — the same LLM tooling that finds these vulnerabilities is accessible to offensive researchers.

Threat Model

The threat model here is a gap in security assurance, not a single adversary class.

Code that passes traditional SAST but contains AI-discoverable vulnerabilities. Attackers using LLM-based analysis can find vulnerabilities in production codebases that CI pipelines currently mark as clean. The attack surface includes any codebase with logic bugs, integer promotion issues, use-after-free patterns in complex call chains, or authentication bypass conditions — categories well-represented in security-relevant code.

Long-lived codebases with accumulated technical debt. Traditional SAST normalises the baseline over time: engineers suppress findings, grandfather unsafe patterns, and accept that the tool generates noise. A codebase that has run Semgrep for three years may have hundreds of suppressed rules. AI analysis re-evaluates code without inheriting this suppression history, surfacing risks that have been invisible for years.

New code introducing subtle logic bugs. Integer promotion is a class of bugs where correct-looking arithmetic is semantically wrong under specific input conditions. Off-by-one errors at data structure boundaries. These bugs are introduced continuously by developers whose code passes both review and SAST. They accumulate in the delta between what SAST checks and what is actually safe.

Third-party dependencies. Known CVEs in dependencies are tracked by SCA tools. Unknown vulnerabilities in dependency source code — particularly in libraries that handle untrusted input, parse binary formats, or manage memory explicitly — are not. Deep semantic analysis of dependency source is currently impractical for most teams but is the direction the field is moving.

Hardening Configuration

Step 1: Select the Right AI SAST Tool for Your Pipeline

Three distinct categories of AI SAST tooling have different integration profiles, false-positive rates, and cost structures:

(a) LLM-as-reviewer tools — Claude Code, GitHub Copilot code review, CodeRabbit. These review diffs at PR creation time and post review comments. Integration is minimal (a GitHub App or a CI step that calls an API). They operate on the changed diff rather than the full repository. False-positive rates depend heavily on the system prompt and model configuration. Latency is typically 30–90 seconds per PR. This is the category to start with for immediate CI integration.

(b) Purpose-built AI security scanners — Semgrep AI (semantic layer over Semgrep’s rule engine), Snyk DeepCode AI, JetBrains Qodana. These run on full repositories, maintain context across files, and are tuned specifically for vulnerability patterns. Higher setup cost, better coverage of cross-file bugs, and more predictable false-positive rates (typically published in vendor documentation).

© Custom LLM pipelines — calling the Anthropic or OpenAI API directly with changed files and a security-focused system prompt. Maximum control over the prompt, the scope, and the cost per analysis. Higher operational overhead. The correct choice for teams with specific vulnerability classes to target or proprietary code patterns that off-the-shelf tools do not model.

For most teams, start with category (a). The integration overhead is near-zero, the signal is immediate, and the baseline false-positive rate can be measured before committing to a more expensive tool.

Step 2: Integrate as a Non-Blocking Advisory Step First

Adding AI SAST directly as a blocking gate produces false-positive fatigue within days. Engineers learn to dismiss findings without reading them. The correct integration sequence is advisory first, blocking later.

Add a GitHub Actions step that calls an AI reviewer on changed files and posts findings as PR comments, but does not set a failing check status:

name: AI Security Review

on:
  pull_request:
    types: [opened, synchronize]

permissions:
  contents: read
  pull-requests: write

jobs:
  ai-sast:
    name: AI-assisted security review
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
        with:
          fetch-depth: 0

      - name: Get changed files
        id: changed
        run: |
          git diff --name-only origin/${{ github.base_ref }}...HEAD \
            | grep -E '\.(c|cpp|go|py|js|ts|rs)$' \
            > changed_files.txt
          echo "count=$(wc -l < changed_files.txt)" >> $GITHUB_OUTPUT

      - name: Run AI security review
        if: steps.changed.outputs.count != '0'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: python .github/scripts/ai_sast_review.py \
               --files changed_files.txt \
               --pr ${{ github.event.pull_request.number }} \
               --repo ${{ github.repository }} \
               --advisory-only

The --advisory-only flag posts findings as review comments without failing the check. After a 2–4 week baseline period, measure:

Total findings generated
Findings reviewed and marked false positive
Findings reviewed and confirmed as genuine

Only promote to blocking after the false-positive rate is measured and a triage process is in place.

The system prompt is the single most important configuration parameter. A generic code-review prompt generates style and performance suggestions that dilute security signal. Use a security-focused prompt:

SECURITY_REVIEW_PROMPT = """You are a security-focused code reviewer specialising in vulnerability discovery.

Review the provided code for security vulnerabilities only. Do not comment on style, performance, or non-security correctness issues.

Focus specifically on:
- Integer overflow and underflow conditions, including implicit type promotion
- Use-after-free and use-after-return patterns
- Buffer overflow and out-of-bounds access
- Authentication and authorisation bypass conditions
- Logic bugs where the function's behaviour diverges from its apparent intent
- Race conditions in concurrent code paths
- Injection vulnerabilities from untrusted input reaching sensitive sinks

For each finding:
1. State the vulnerability class
2. Identify the exact line(s) affected
3. Describe the conditions under which it is exploitable
4. Explain why it is a security issue, not a correctness issue

If no security vulnerabilities are present, respond with: NO_FINDINGS

Do not produce findings for suppressed or clearly intentional patterns."""

Step 3: Focus AI Review on High-Risk Code Paths

AI SAST API costs scale with tokens processed. Running full-repository analysis on every PR is expensive and generates findings in low-risk code (configuration files, documentation generators, test utilities) that dilute signal from high-risk paths.

Configure the CI step to apply AI review only to files matching high-risk path patterns:

- name: Filter to high-risk files
  id: filter
  run: |
    git diff --name-only origin/${{ github.base_ref }}...HEAD \
      | grep -E '(
          auth|authn|authz|
          crypto|cipher|sign|verify|
          parse|decode|unmarshal|deserialise|
          malloc|free|alloc|realloc|
          unsafe|cgo|ffi|
          ipc|socket|rpc|grpc
        )' \
      | grep -E '\.(c|cpp|go|py|js|ts|rs)$' \
      > high_risk_files.txt || true
    echo "count=$(wc -l < high_risk_files.txt)" >> $GITHUB_OUTPUT

The highest-value targets for AI SAST are:

Untrusted input parsing (network protocol handlers, file format parsers, API request deserialisers)
Memory management in C, C++, or Rust unsafe blocks
Authentication and authorisation logic
Cryptographic key handling and signature verification
Inter-process communication (pipes, shared memory, Unix sockets, RPC handlers)

Code outside these paths has a much lower expected finding rate. Running AI analysis only on changed files in these categories reduces cost by 60–80% on typical codebases without materially reducing finding rate.

Step 4: Manage False Positives with a Findings Database

Without suppression infrastructure, the same false positive is flagged on every PR that touches the same file. Engineers add the file to an ignore list. Signal degrades.

Store findings in a YAML suppression file committed to the repository. Key each entry by a hash of the file path, line range, and finding description — stable enough to survive minor refactoring, specific enough to not suppress across unrelated findings:

suppressions:
  - id: "sha256:a3f8c2..."
    file: "src/net/nfs/xdr.c"
    line_range: [142, 158]
    finding_class: "integer-overflow"
    reviewed_by: "security-team"
    reviewed_at: "2026-04-15"
    disposition: "false-positive"
    rationale: "Multiplication result is assigned to uint64_t; no truncation possible at this call site."

  - id: "sha256:9d1e44..."
    file: "src/auth/session.py"
    line_range: [89, 102]
    finding_class: "logic-bug"
    reviewed_by: "alice@example.com"
    reviewed_at: "2026-04-20"
    disposition: "accepted-risk"
    rationale: "Pattern is intentional; reviewed against threat model in SEC-2291."

The CI script loads this file before posting findings. Any finding whose hash matches a suppressed entry is silently dropped. Only new or changed findings surface as PR comments. This eliminates the primary driver of tool abandonment: engineers being repeatedly shown findings they have already reviewed.

import hashlib
import yaml

def finding_hash(file_path: str, line_start: int, line_end: int, finding_class: str) -> str:
    content = f"{file_path}:{line_start}-{line_end}:{finding_class}"
    return f"sha256:{hashlib.sha256(content.encode()).hexdigest()[:8]}..."

def load_suppressions(suppression_file: str) -> set[str]:
    with open(suppression_file) as f:
        data = yaml.safe_load(f)
    return {entry["id"] for entry in data.get("suppressions", [])}

def filter_findings(findings: list[dict], suppressions: set[str]) -> list[dict]:
    return [
        f for f in findings
        if finding_hash(f["file"], f["line_start"], f["line_end"], f["class"])
           not in suppressions
    ]

Step 5: Validate AI Findings Before Blocking Merges

Published false-positive rates for AI SAST tools range from 20% to 40% depending on codebase complexity, language, and system prompt quality. Blocking merges on unreviewed AI findings introduces a veto that a false positive can trigger — which trains engineers to dismiss the findings or route around the check.

The correct gate structure: AI findings are advisory until reviewed by a security team member. A finding that a human has reviewed and confirmed as genuine becomes a blocking status check. Define a triage SLA — 48 hours is workable for most security teams — and track it:

name: Check AI SAST review status

on:
  pull_request:
    types: [opened, synchronize, labeled]

jobs:
  check-sast-status:
    runs-on: ubuntu-latest
    steps:
      - name: Check for unreviewed blocking findings
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          BLOCKING=$(gh pr view ${{ github.event.pull_request.number }} \
            --json labels \
            --jq '.labels[].name' \
            | grep -c "ai-sast:blocking-finding" || true)

          REVIEWED=$(gh pr view ${{ github.event.pull_request.number }} \
            --json labels \
            --jq '.labels[].name' \
            | grep -c "ai-sast:security-reviewed" || true)

          if [ "$BLOCKING" -gt 0 ] && [ "$REVIEWED" -eq 0 ]; then
            echo "Blocking AI SAST finding awaiting security team review."
            echo "SLA: 48 hours from finding posted."
            exit 1
          fi

The workflow: AI review posts a finding as a PR comment. A bot applies the ai-sast:blocking-finding label. A security engineer reviews the finding within 48 hours. If genuine, the engineer applies ai-sast:security-reviewed and opens a separate remediation issue. If false positive, the engineer adds the finding to the suppression file and removes the blocking label.

Expected Behaviour After Hardening

After advisory integration, AI SAST runs on every PR and posts findings as review comments within 3 minutes for typical PR sizes (under 500 changed lines). The check is visible in the PR status list but does not gate merge.

After the 4-week baseline period, a functioning deployment typically shows:

False-positive rate of 20–30% (tool-dependent; CodeRabbit and Semgrep AI run toward the lower end; custom prompts against general models run higher without prompt tuning)
Genuine finding rate of 2–5 per 100 PRs on active codebases with mixed security-sensitive and business-logic code
Finding rate concentrated in auth, parsing, and memory management files — consistent with the expected distribution

After the findings database is in place, the noise reduction is significant: repeat false positives that previously appeared on every PR touching a given file are suppressed automatically. Engineers see only new findings or findings on genuinely changed code paths.

Trade-offs and Operational Considerations

LLM API costs for AI SAST scale with the volume of code sent for analysis. Current pricing (2026) puts typical PR analysis at $0.01–0.05 per PR using Claude Sonnet or GPT-4o — modest at 100 PRs/month, material at 10,000. Rate-limit the CI step to run on PRs to protected branches (main, release/*) only, or implement a cost cap that pauses AI review for the remainder of the billing period if a daily threshold is exceeded.

AI-generated findings are explanatory. They state the vulnerability class, describe the conditions under which it is exploitable, and explain the semantic reason it is a bug — not a rule ID that requires consulting a database. This improves developer trust in findings compared with cryptic SAST output like CWE-190 (integer-overflow) with a line number. It also increases the review burden for security teams, because each finding is a paragraph rather than a flag. Budget accordingly when setting the review SLA.

AI SAST does not replace traditional SAST. Both should run. Traditional tools find known patterns reliably, quickly, and cheaply — strcpy without bounds checking, SQL string concatenation, eval() on user input. AI tools find the semantic bugs that traditional tools cannot model. The combination provides coverage that neither provides alone.

Failure Modes

AI SAST integrated but findings routed to a low-priority queue. The integration looks functional — findings are generated, the CI step completes — but the findings land in a queue that no one owns. They accumulate. After a month, an engineer notices the queue has 200 unreviewed items and archives it. The tool runs but has no effect on security posture.

System prompt not security-focused. A generic code review prompt — “review this code for issues” — produces suggestions about variable naming, missing error handling, and performance characteristics. Security findings are present but diluted. The finding rate appears high; the genuine security finding rate is low. Engineers dismiss findings as noise.

False-positive rate not measured. The tool runs for six weeks. A developer complains that AI SAST flagged an obvious false positive for the third time on the same function. The security team has no data on false-positive rate because they never tracked it. Without measurement, there is no basis for tuning the prompt, adjusting the scope, or defending the tool’s value. Teams that do not measure tend to disable.

AI SAST step given access to CI secrets to call the LLM API. The CI step that calls the Anthropic or OpenAI API must have the API key available. If this step runs on a self-hosted runner that also handles production secrets, the AI SAST step becomes a supply chain risk: a compromised LLM client library or a malicious PR that modifies the AI SAST script could exfiltrate the API key and any other secrets available in the environment. Scope the AI SAST step to a dedicated runner with no production access. Store the LLM API key in a dedicated CI secret used only by this step.