AI-Generated Code and Open Source License Compliance: The Copilot Copyright Problem

AI-Generated Code and Open Source License Compliance: The Copilot Copyright Problem

The Problem

Traditional licence compliance tooling has one job: scan declared dependencies, identify their licences, and block the build if anything copyleft slips through. This model assumes that every piece of code in your product either came from a package manager with a lockfile or was written directly by a human engineer. AI coding assistants have broken that assumption in two distinct ways — a legal problem and a supply-chain visibility problem — and neither is addressed by existing SBOM tooling.

GitHub Copilot is trained on public GitHub repositories. Those repositories include code licenced under GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0, AGPL-3.0, MPL-2.0, EUPL, and a long tail of other copyleft and non-permissive licences. When Copilot generates code, it is statistically sampling from a distribution shaped by that training corpus. The legal question — does output from a model trained on copyleft code inherit the licence of the training data — is open. The risks are not.

The lawsuit. In November 2022, a class action (Doe v. GitHub, Inc., N.D. Cal. 3:22-cv-06823) was filed alleging that Copilot reproduces GPL-licenced code without attribution and in violation of the terms under which it was licenced. The case was substantially dismissed in 2024 on standing grounds, but the core copyright claims survive in amended form and the appeals process is ongoing as of mid-2026. GitHub’s position is that training on public code and generating transformative output constitutes fair use. Copyright holders in the affected repositories disagree, and no court has yet ruled definitively on the merits. The legal risk is unresolved, not resolved in Microsoft’s favour.

The memorisation problem. Independent of the lawsuit, LLMs demonstrably memorise portions of their training data. Carlini et al. (2021, “Extracting Training Data from Large Language Models”) showed that GPT-2 could be prompted to reproduce verbatim text from its training corpus, including code. Their 2023 follow-up extended this to larger models and showed that memorisation scales with model size and training data repetition. Copilot has been publicly observed reproducing recognisable fragments from well-known GPL repositories — including unique identifier names, variable conventions, and comment strings that appear in a single upstream source. GitHub’s “public code filter” (block suggestions matching public repository code) is a mitigation, not a guarantee; it operates on exact or near-exact substring matching and does not detect structural reproduction of an algorithm’s logic.

The copyleft infection mechanism. GPL-3.0 §5 requires that any work that “contains or is derived from the Program” and is distributed must itself be licenced under GPL-3.0. If a Copilot-generated function in your commercial product reproduces a GPL-licenced function closely enough to constitute a derivative work, the product ships a GPL obligation. The entire work may then be subject to GPL’s disclosure requirements — you must offer the complete corresponding source code to anyone who receives the binary. AGPL-3.0 §13 extends the trigger condition to network use: running AGPL code in a service accessed over HTTPS is sufficient, no binary distribution required. LGPL is more nuanced: static linking creates derivative-work obligations; dynamic linking with an unmodified library generally does not. An AI-generated function that inlines logic from an LGPL library collapses this distinction — there is no separate library to dynamically link; the code is now part of your codebase.

The practical exposure scenario: a senior engineer uses Copilot to implement a base64url-encoded JWT parser for a proprietary SaaS product. Copilot generates a function that closely resembles the core decoding logic in python-jwt (MIT) or a Rust JWT library with a more restrictive licence. The code is reviewed, passes code review (the reviewer is checking logic, not copyright provenance), and ships. Six months later, during an acquisition, the acquirer’s IP counsel identifies the similarity. The dispute is now between two companies with significant legal fees and deal risk attached.

The SBOM Gap Problem

Traditional SBOM tooling (syft, trivy SBOM mode, cdxgen, spdx-sbom-generator) works by parsing package manifests — package-lock.json, go.sum, requirements.txt, Cargo.lock, pom.xml — and attributing licence information to each named package. The dependency must be declared somewhere for the tool to find it. This is the gap AI-generated code creates.

When an LLM generates code that inlines functionality previously provided by a declared dependency, the result is invisible to manifest-based SBOM generation:

# ── Before: SBOM-visible ──────────────────────────────────────────────────────
# requirements.txt declares: PyJWT==2.8.0 (MIT licence)
# syft finds it, declares it, CVE feed covers CVE-2022-29217, etc.
import jwt
payload = jwt.decode(token, public_key, algorithms=["RS256"])

# ── After: SBOM-invisible ─────────────────────────────────────────────────────
# requirements.txt: PyJWT removed (saves a dependency, "cleaner")
# syft sees nothing. No licence. No CVE feed.
# May reproduce logic from PyJWT, python-jose, or authlib.
import base64, hmac, hashlib, json
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding

def decode_jwt(token: str, public_key) -> dict:
    """AI-generated JWT decoder — no declared dependency, unknown provenance."""
    try:
        header_b64, payload_b64, signature_b64 = token.split('.')
        # pad base64url segments
        header_b64 += '=' * (-len(header_b64) % 4)
        payload_b64 += '=' * (-len(payload_b64) % 4)
        signing_input = f"{token.split('.')[0]}.{token.split('.')[1]}".encode()
        sig = base64.urlsafe_b64decode(signature_b64 + '==')
        public_key.verify(sig, signing_input, padding.PKCS1v15(), hashes.SHA256())
        return json.loads(base64.urlsafe_b64decode(payload_b64))
    except Exception:
        raise ValueError("Invalid token")

The problems with the AI-generated version stack:

  1. No licence declaration. The code has no SPDX identifier, no copyright notice, no attribution. Its licence status is unknown.
  2. No CVE feed. If this implementation has a timing side-channel or signature-bypass bug, no CVE will ever cover it — it is not a named package with a version. The equivalent bug in PyJWT (CVE-2022-29217: algorithm confusion) triggered alerts in every SBOM-aware pipeline. This implementation would not.
  3. Unknown provenance. The implementation may structurally reproduce logic from one or more upstream sources. No static analysis tool will surface this without explicit similarity checking.
  4. No upstream maintainer. When a novel attack against JWT RS256 implementations is published, your security team must discover and remediate this manually — no Dependabot PR will arrive.

The scale of this problem grows with AI adoption. Teams that use Copilot or Claude Code heavily often reduce declared dependency counts as AI inlines utility functions. The SBOM gets smaller and cleaner. The actual risk surface grows and becomes invisible.

Threat Model

  • GPL-adjacent AI output in a commercial product. Copyright holder of a GPL library whose logic was reproduced identifies the similarity. Demands source disclosure or licence fee. The claim is against you, not Microsoft or Anthropic — the AI vendor’s terms of service explicitly disclaim responsibility for copyright infringement in generated output.
  • AI-inlined vulnerable logic. CVE published against the library whose logic was reproduced. Your SBOM does not reference the library; your vulnerability management programme does not flag it. The vulnerability ships.
  • SBOM audit failure. Enterprise customer contract requires a complete CycloneDX SBOM as a condition of procurement. Auditor identifies AI-generated components not represented in the SBOM. Contract compliance failure.
  • Acquisition due diligence. Acquirer’s IP counsel runs FOSSA or Black Duck against the codebase. Identifies high-similarity matches between product code and GPL repositories. Escrow dispute; deal price reduction; representations-and-warranties exposure.
  • Proprietary algorithm reproduction. AI trained on code that leaked to GitHub (from private repositories, through code intelligence tools, through employee device compromise) reproduces proprietary algorithms. Trade secret misappropriation claim.

Hardening Configuration

1. Enforce Copilot’s Public Code Filter at Organisation Level

The public code filter is a necessary baseline — it blocks suggestions that exactly or near-exactly match public repository code. It does not block structural reproduction of an algorithm, but it reduces verbatim copying.

# For GitHub Copilot Business/Enterprise, enforce via org policy.
# This cannot be overridden by individual users once set at org level.
gh api orgs/{org}/copilot/billing \
  -X PUT \
  -H "Accept: application/vnd.github+json" \
  -f public_code_suggestions='block'

# Verify the setting:
gh api orgs/{org}/copilot/billing \
  --jq '{public_code_suggestions: .public_code_suggestions, seat_management_setting: .seat_management_setting}'
# Expected output:
# {
#   "public_code_suggestions": "block",
#   "seat_management_setting": "seat_based"
# }

For Cursor, Claude Code, and other tools that do not provide a built-in filter, enforce the constraint through a system prompt or project-level instruction file:

<!-- .cursorrules or .claude/CLAUDE.md -->
## Code Generation Policy

- Do not reproduce code verbatim or near-verbatim from any identifiable open source repository.
- Generate original implementations using standard library primitives where possible.
- Do not inline the functionality of named open source packages to avoid declaring them as dependencies.
- If implementing cryptographic, authentication, or authorisation logic, use only declared, well-known libraries.
- When uncertain whether a pattern comes from a specific upstream source, state the uncertainty in a comment.

This is process, not enforcement — but it creates an audit trail when violations occur.

2. Licence Scanning with FOSSA and Scancode in CI

FOSSA performs dependency-based licence scanning and also runs snippet matching against its corpus of known open source code. Scancode performs similar analysis locally. Neither will catch all AI-generated reproduction, but both catch large classes of verbatim and near-verbatim matches.

# .github/workflows/license-compliance.yml
name: License Compliance
on:
  push:
    branches: [main]
  pull_request:

permissions:
  contents: read

jobs:
  fossa-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1

      - name: FOSSA analyse
        uses: fossas/fossa-action@09da0cc88e39e6d4f5bc99f6b70d8b2b6f40e272 # v1.4.0
        with:
          api-key: ${{ secrets.FOSSA_API_KEY }}
          # fail-on-issues causes the step to fail if FOSSA finds policy violations
          run-tests: true

  scancode-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1

      - name: Install ScanCode
        run: pip install scancode-toolkit==32.3.0

      - name: Run ScanCode
        run: |
          scancode \
            --license \
            --copyright \
            --info \
            --processes 4 \
            --json-pp scancode-results.json \
            src/

      - name: Check for copyleft licences
        run: |
          # Extract any GPL/LGPL/AGPL/MPL licence detections in non-test source files
          VIOLATIONS=$(jq -r '
            .files[] |
            select(.type == "file") |
            select(.path | test("test_|_test\\.py$|spec\\.") | not) |
            select(.license_detections | length > 0) |
            .license_detections[].license_expression |
            select(test("gpl|lgpl|agpl|eupl|cddl"; "i"))
          ' scancode-results.json)

          if [ -n "$VIOLATIONS" ]; then
            echo "::error::Copyleft licence detected in non-test source files:"
            echo "$VIOLATIONS"
            exit 1
          fi

      - name: Upload ScanCode results
        uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4.3.1
        with:
          name: scancode-results
          path: scancode-results.json
          retention-days: 90

FOSSA’s snippet analysis runs against its own index of open source code — it can match fragments that do not appear in any declared dependency. A FOSSA alert that reports GPL-2.0-only snippet match in src/auth/jwt.py for code that has no PyJWT in requirements.txt is the signal that AI-generated reproduction has occurred.

3. Code Similarity Check Against Known Copyleft Repositories

For teams with a high volume of AI-generated code, adding a structural similarity check directly catches algorithm reproduction that FOSSA’s snippet index may not cover for newer or less common upstream sources.

# Install ast-grep for structural code pattern matching
# https://ast-grep.github.io/
cargo install ast-grep --locked

# Create a catalogue of known GPL function signatures to check against.
# This example targets JWT/crypto patterns from commonly reproduced libraries.
cat > gpl-patterns.yml << 'EOF'
id: base64url-decode-pattern
language: python
rule:
  any:
    - pattern: |
        $PAD = '=' * (-len($B64) % 4)
    - pattern: |
        base64.urlsafe_b64decode($_ + '==')
message: "Possible reproduction of base64url decode pattern from known JWT libraries. Review for licence provenance."
severity: WARNING

---
id: hmac-sha256-jwt-verify
language: python
rule:
  all:
    - pattern: hmac.new($_, $_, hashlib.sha256)
    - has-sibling:
        pattern: base64.urlsafe_b64decode($_)
message: "JWT HMAC verification pattern — check for reproduction of PyJWT or python-jose logic."
severity: WARNING
EOF

ast-grep scan --config gpl-patterns.yml src/ --json | \
  jq '.[] | {file: .file, line: .range.start.line, message: .message}'

This is complementary to FOSSA, not a replacement. FOSSA matches on textual content; ast-grep matches on abstract syntax structure. The two together cover a wider surface.

4. AI Attribution Policy and Commit Metadata

Without attribution, there is no way to identify AI-generated code in a future audit. The attribution policy must be enforced at commit time, not as a guideline engineers may or may not follow.

# .github/CONTRIBUTING.md — AI code attribution policy excerpt
# (Also enforce via pre-commit hook below)

## AI-Generated Code Policy

All AI-generated code must be:
1. Reviewed by a human engineer before merging. AI output is not reviewed code.
2. Attributed in the commit message with a Co-authored-by trailer:
   Co-authored-by: GitHub Copilot <copilot@github.com>
   Co-authored-by: Claude Code <claude-code@anthropic.com>
3. Checked for verbatim reproduction via the licence-compliance CI job before merging.
4. Represented in the SBOM with an AI-origin annotation if it inlines functionality
   previously provided by a named dependency.

AI-generated code MUST NOT:
- Replace a declared dependency without security team approval and SBOM update.
- Implement cryptographic primitives, signature verification, or token parsing.
  Use audited libraries (PyJWT, python-jose, authlib) with pinned versions instead.
- Be committed to main without passing the license-compliance CI job.

Enforce attribution at the pre-commit level so engineers cannot accidentally omit it:

# .git/hooks/commit-msg (or via pre-commit framework)
#!/bin/bash
COMMIT_MSG_FILE="$1"
COMMIT_MSG=$(cat "$COMMIT_MSG_FILE")

# If the branch name or staged diff contains AI tool indicators,
# require Co-authored-by trailer
BRANCH=$(git rev-parse --abbrev-ref HEAD)
STAGED_DIFF=$(git diff --cached --name-only)

# Detect if session was AI-assisted via environment variable set by IDE plugin
if [ "${AI_ASSISTED_SESSION:-0}" = "1" ]; then
  if ! echo "$COMMIT_MSG" | grep -q "Co-authored-by:.*[Cc]opilot\|[Cc]laude\|[Cc]ursor"; then
    echo "ERROR: AI_ASSISTED_SESSION is set but commit message lacks Co-authored-by trailer."
    echo "Add one of:"
    echo "  Co-authored-by: GitHub Copilot <copilot@github.com>"
    echo "  Co-authored-by: Claude Code <claude-code@anthropic.com>"
    exit 1
  fi
fi

5. SBOM Generation That Captures AI-Generated Components

Syft and cdxgen generate SBOMs from package manifests. AI-generated inline code will not appear in those SBOMs. Annotate the SBOM with AI-origin components as a post-processing step, driven by the presence of AI attribution in commit metadata.

#!/usr/bin/env python3
# scripts/annotate-sbom-ai-components.py
# Run after syft generates the base SBOM.
# Reads git log for AI-attributed commits, identifies modified files,
# and adds AI-origin component entries to the SBOM.

import json
import subprocess
import sys
from pathlib import Path

SBOM_PATH = Path("sbom.spdx.json")

def get_ai_attributed_files() -> list[dict]:
    """Return list of files modified in commits with AI co-author trailers."""
    log = subprocess.run(
        [
            "git", "log", "--format=%H %s", "--diff-filter=AM",
            "--grep=Co-authored-by.*[Cc]opilot",
            "--grep=Co-authored-by.*[Cc]laude",
            "--grep=Co-authored-by.*[Cc]ursor",
            "--regexp-ignore-case",
        ],
        capture_output=True, text=True, check=True
    ).stdout.strip()

    results = []
    for line in log.splitlines():
        if not line:
            continue
        sha, *rest = line.split(" ", 1)
        files = subprocess.run(
            ["git", "diff-tree", "--no-commit-id", "-r", "--name-only", sha],
            capture_output=True, text=True, check=True
        ).stdout.strip().splitlines()
        for f in files:
            if f.endswith((".py", ".go", ".ts", ".js", ".rs", ".java", ".rb")):
                results.append({"commit": sha, "file": f})
    return results

def main():
    if not SBOM_PATH.exists():
        print(f"SBOM not found at {SBOM_PATH}. Run syft first.", file=sys.stderr)
        sys.exit(1)

    with open(SBOM_PATH) as f:
        sbom = json.load(f)

    ai_files = get_ai_attributed_files()
    if not ai_files:
        print("No AI-attributed files found in git history.")
        return

    existing_ids = {p["SPDXID"] for p in sbom.get("packages", [])}

    for entry in ai_files:
        safe_name = entry["file"].replace("/", "-").replace(".", "-")
        spdx_id = f"SPDXRef-ai-generated-{safe_name}-{entry['commit'][:8]}"
        if spdx_id in existing_ids:
            continue

        sbom.setdefault("packages", []).append({
            "SPDXID": spdx_id,
            "name": f"ai-generated:{entry['file']}",
            "versionInfo": entry["commit"][:12],
            "supplier": "Tool: ai-coding-assistant",
            "licenseConcluded": "NOASSERTION",
            "licenseDeclared": "NOASSERTION",
            "copyrightText": "NOASSERTION",
            "comment": (
                f"AI-generated code in {entry['file']} "
                f"(commit {entry['commit'][:12]}). "
                "Licence provenance unknown. No upstream CVE feed. "
                "Requires manual licence review before distribution."
            ),
            "annotations": [
                {
                    "annotationType": "REVIEW",
                    "annotator": "Tool: ai-sbom-annotator",
                    "comment": "Pending human licence review",
                }
            ],
        })
        existing_ids.add(spdx_id)

    with open(SBOM_PATH, "w") as f:
        json.dump(sbom, f, indent=2)

    print(f"Annotated SBOM with {len(ai_files)} AI-generated file entries.")

if __name__ == "__main__":
    main()

Integrate this into CI after the SBOM generation step:

  sbom-generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
        with:
          fetch-depth: 0  # full history required for git log AI attribution scan

      - name: Install syft
        run: |
          curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | \
            sh -s -- -b /usr/local/bin v1.4.1

      - name: Generate base SBOM
        run: syft dir:. -o spdx-json=sbom.spdx.json

      - name: Annotate AI-generated components
        run: python3 scripts/annotate-sbom-ai-components.py

      - name: Validate SBOM
        run: |
          # Check that every package with NOASSERTION licence is flagged for review
          UNREVIEWED=$(jq '[.packages[] |
            select(.licenseConcluded == "NOASSERTION") |
            select(.annotations[]?.comment == "Pending human licence review")
          ] | length' sbom.spdx.json)
          echo "AI-generated components pending licence review: $UNREVIEWED"
          # Fail if there are unreviewed AI components and this is a release build
          if [ "${GITHUB_REF}" = "refs/heads/main" ] && [ "$UNREVIEWED" -gt 0 ]; then
            echo "::error::Release build contains AI-generated components without licence review."
            exit 1
          fi

      - name: Upload SBOM
        uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4.3.1
        with:
          name: sbom
          path: sbom.spdx.json
          retention-days: 365

When AI-generated code exceeds a threshold of a PR’s total contribution, flag it for legal review before merge. This is not a block — it is a routing gate that ensures human judgement is applied before AI-heavy changes reach main.

# .github/workflows/ai-legal-gate.yml
name: AI Code Legal Review Gate
on:
  pull_request:
    types: [opened, synchronize, reopened]

permissions:
  contents: read
  pull-requests: write

jobs:
  ai-attribution-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
        with:
          fetch-depth: 0

      - name: Measure AI attribution ratio
        id: ai-ratio
        run: |
          BASE="${{ github.event.pull_request.base.sha }}"
          HEAD="${{ github.event.pull_request.head.sha }}"

          TOTAL=$(git rev-list --count "${BASE}..${HEAD}" 2>/dev/null || echo 0)
          if [ "$TOTAL" -eq 0 ]; then
            echo "ai_ratio=0" >> "$GITHUB_OUTPUT"
            exit 0
          fi

          AI=$(git log "${BASE}..${HEAD}" --format="%B" | \
            grep -cP "Co-authored-by:.*(?i)(copilot|claude|cursor|codeium|tabnine)" || true)

          # Use awk for ratio calculation (bc may not be available)
          RATIO=$(awk "BEGIN { printf \"%.2f\", $AI / $TOTAL }")
          echo "ai_ratio=${RATIO}" >> "$GITHUB_OUTPUT"
          echo "ai_commits=${AI}" >> "$GITHUB_OUTPUT"
          echo "total_commits=${TOTAL}" >> "$GITHUB_OUTPUT"

      - name: Post legal review notice for high AI-ratio PRs
        if: ${{ steps.ai-ratio.outputs.ai_ratio > 0.4 }}
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh pr comment ${{ github.event.pull_request.number }} \
            --body "**Automated notice: AI-generated code threshold exceeded.**

          This PR has ${{ steps.ai-ratio.outputs.ai_commits }} of ${{ steps.ai-ratio.outputs.total_commits }} commits with AI co-author attribution (${{ steps.ai-ratio.outputs.ai_ratio }} ratio).

          **Required before merge:**
          - [ ] Licence compliance CI job has passed (FOSSA + ScanCode results reviewed)
          - [ ] SBOM updated to reflect any AI-generated inline functionality
          - [ ] No cryptographic, authentication, or authorisation logic implemented inline without a declared library
          - [ ] Reviewer has checked for verbatim reproduction of named upstream functions

          This comment is informational. Contact the security team with questions about licence provenance."

      - name: Require legal label for high-ratio PRs on main
        if: >
          steps.ai-ratio.outputs.ai_ratio > 0.6 &&
          github.event.pull_request.base.ref == 'main'
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh pr edit ${{ github.event.pull_request.number }} \
            --add-label "legal-review-required"
          echo "::warning::AI attribution ratio ${{ steps.ai-ratio.outputs.ai_ratio }} on PR targeting main — legal-review-required label applied."

Expected Behaviour

After the full configuration is in place, the pipeline produces the following observable outcomes:

FOSSA scan on a PR introducing AI-generated JWT logic that reproduces a fragment from python-jose (Apache 2.0 / MIT dual-licenced, but the reproduction is detectable by snippet matching): FOSSA reports a SNIPPET_MATCH finding in the FOSSA web UI against the matched upstream source, with licence Apache-2.0. The CI step exits 0 because Apache 2.0 is permissive, but the finding is recorded and visible. If the snippet matched GPL-licensed code instead, FOSSA exits non-zero and the job fails.

SBOM annotation after a merge with AI co-author trailers: sbom.spdx.json contains a packages entry with "name": "ai-generated:src/auth/jwt.py", "licenseConcluded": "NOASSERTION", and a comment documenting the commit SHA and pending review status. The SBOM grows to reflect the actual code surface, not just the manifest surface.

Legal review gate on a PR where 5 of 7 commits have Co-authored-by: GitHub Copilot trailers: the gate posts an automated comment listing the checklist items and applies the legal-review-required label. The PR cannot be merged via branch protection rules until the label is removed by a security team member.

ScanCode run on a file that contains the AI-generated base64url decode pattern flagged by ast-grep: if ScanCode also detects a licence header match (because the file includes a comment block reproducing a docstring from an MIT-licenced library), it surfaces it as a MIT licence detection attributed to a file in src/ rather than to a declared package. This triggers manual review of where that comment block came from.

Trade-offs

Copilot public code filter reduces suggestion completeness. Engineers report fewer suggestions for uncommon patterns, more generic suggestions for common ones. The filter operates on textual similarity against a corpus snapshot — it will miss reproduction of code that was not in GitHub’s corpus at training time, and it will miss structural reproduction where variable names have been changed. It is a necessary but not sufficient control.

SBOM annotation via git history depends on consistent AI attribution in commit messages. Engineers who do not add co-author trailers produce AI-generated code that is invisible to the annotation script. The pre-commit hook reduces the failure rate but does not eliminate it for commits made outside the normal IDE workflow. The SBOM therefore represents a lower bound on AI-generated code, not an exhaustive record.

FOSSA snippet matching adds approximately 3–6 minutes to CI runtime and requires an API key. The snippet corpus is proprietary and does not cover every open source repository. FOSSA’s snippet detection is most effective for heavily reproduced, well-known upstream code; it is less effective for niche libraries or recent additions to the open source ecosystem. ScanCode covers the gap for licence header detection but not for structural algorithm reproduction.

Legal review gate adds process friction proportional to AI adoption. Teams that use AI for every commit will see the gate trigger constantly and will be incentivised to omit co-author trailers to avoid it. The threshold (40% ratio in the example) must be calibrated against actual team AI usage patterns. Set it too low and it becomes noise; too high and it catches nothing. Review the trigger rate monthly for the first three months after deployment.

Failure Modes

“Copilot’s output is Microsoft’s problem.” This assumption is specifically contradicted by the Copilot terms of service. Microsoft’s Copilot copyright commitment (announced 2023) indemnifies enterprise customers for copyright claims arising from Copilot output, subject to conditions: the customer must have the public code filter enabled, must not have deliberately tried to generate infringing code, and must use Copilot through the official API. If any of those conditions are not met — for example, a developer disabled the filter locally — the indemnity does not apply. Even where it does apply, the indemnity covers legal defence costs, not contractual obligations arising from GPL licence violations in customer contracts. These are different exposures.

SBOM from manifest scan only. The most common failure mode in organisations that do have SBOM tooling. syft dir:. on a repository with heavy AI-generated code produces a clean, complete-looking SBOM that covers every declared dependency and none of the AI-generated inline code. Compliance teams review the SBOM, find no issues, and sign off. The actual licence risk is in the code, not the manifest.

Treating AI attribution as optional. Without consistent attribution, the AI code review gate and SBOM annotation are ineffective. The gate only fires when co-author trailers are present. If engineers omit trailers, the PR looks like normal human-written code. The only backstop is FOSSA snippet matching, which is not comprehensive. Attribution must be enforced, not encouraged.

Scanning only declared dependencies for CVEs. A known-vulnerable algorithm inlined by an AI assistant will not appear in Trivy, Grype, or Snyk results. These tools query CVE databases indexed by package name and version. src/auth/jwt.py has no package name, no version, no CVE mapping. The vulnerability is present; the tooling is blind to it. The mitigation is to prohibit AI-generated implementations of security-sensitive functionality and require named, pinned, audited libraries for all cryptographic and authentication code — enforced at code review and via linting rules that flag implementation of known-sensitive patterns (base64url decode combined with HMAC or asymmetric signature verification outside of declared auth libraries).

No process for retrospective review. The legal risk from AI-generated code is not limited to future PRs. Organisations that have used Copilot or other AI assistants for 12–18 months already have an unknown quantity of AI-generated code in production with no attribution metadata. A retrospective audit requires running FOSSA snippet matching and ScanCode against the full codebase (not just new PRs), reviewing findings against a legal risk matrix, and making a documented business decision about each significant finding. This is not a CI job — it is a project. It needs to happen once, and then the CI controls described above maintain the property going forward.