AI Model Output Watermarking: Provenance for Generated Text and Code

AI Model Output Watermarking: Provenance for Generated Text and Code

Problem

C2PA signs media files at creation. It works for images and video — the manifest sits in metadata, the signature is verifiable. For text and code, there is no metadata to sign; output is plain UTF-8 that any subsequent system handles as opaque content.

Watermarking is the textual equivalent: bias the model’s token sampling slightly so that generated text carries a statistical signature detectable by anyone with the watermark key. The detection is probabilistic — never per-character, but per-passage with high confidence.

Three production approaches:

  • Aaronson / Kirchenbauer-style green-list watermarking. At each generation step, partition the vocabulary into a “green list” and “red list” using a hash of the previous N tokens and a secret key. Bias sampling toward the green list. A passage of generated text has more green tokens than chance would predict; a detector with the key computes the ratio and concludes “this was likely generated by our model.”
  • SynthID (Google DeepMind, 2023). A more robust variant deployed in Gemini outputs. Maintains quality through tokenizer-level adjustments rather than logit biasing.
  • Aaronson 2022 unbiased scheme. Provably preserves output distribution while embedding a detectable signal — important when watermarking introduces no measurable quality loss.

By 2026 watermarking is shipped:

  • Google Gemini outputs carry SynthID by default; their detector is the only verifier (closed-source detector keys).
  • OpenAI implemented and publicly discussed (but didn’t widely deploy) a similar scheme.
  • Anthropic’s research on watermarking continues; deployment status varies by model.
  • Open-source watermarkers (lm-watermarking, watermark-evaluation) for self-hosted models.

The specific gaps in production deployments:

  • Most applications using LLM APIs don’t request watermarking, even when the provider supports it.
  • Detection requires the secret key — only the original generator can detect.
  • Watermarks survive direct copying but degrade on editing (paraphrasing, translation, partial use).
  • Adversaries running their own model can produce text with no watermark, which is then indistinguishable from “not-our-model output.”
  • Watermark detection is statistical; small text snippets are hard to classify confidently.

This article covers the green-list scheme in concept, deployment patterns at the application layer, the detection workflow for trust-and-safety teams, evasion considerations, and the realistic expectations for what watermarking does and doesn’t solve.

Target systems: SynthID via Vertex AI / Gemini API, lm-watermarking research codebase, custom watermarking on self-hosted models (vLLM, Text Generation Inference). Detection: provider-specific APIs or lm-watermarking toolkit.

Threat Model

  • Adversary 1 — Synthetic-content abuse: an attacker uses LLM-generated text for spam, disinformation, fake reviews, academic dishonesty. Wants the output indistinguishable from human-authored content.
  • Adversary 2 — AI-content evader: knows watermarking exists; paraphrases, mixes languages, uses cyclical translation, or runs through a different model to strip the signal.
  • Adversary 3 — Provenance forger: wants to claim an LLM-generated document is human-authored to evade content-policy enforcement.
  • Adversary 4 — False positive abuser: weaponizes detection by submitting borderline content that triggers watermark detection on innocent humans (false positives on naturally-green-biased text).
  • Access level: all adversaries have only the LLM API or output access. No detection-key access.
  • Objective: Use generated content without it being identifiable; cause false-positive provenance attributions.
  • Blast radius: without watermarking, every LLM-generated document is unverifiable. With watermarking + correct detection workflow, a meaningful fraction of generated content can be identified — but never 100%, and never small snippets reliably.

Configuration

Pattern 1: Generation With SynthID (Gemini)

Vertex AI / Gemini outputs carry SynthID by default. Configure to ensure the watermark is included:

from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.0-pro")
response = model.generate_content(
    "Write a 500-word essay on Roman history.",
    generation_config={
        "temperature": 0.7,
        "max_output_tokens": 800,
    },
    # SynthID is on by default in current Gemini deployments.
)
print(response.text)
# The text contains an embedded SynthID watermark.

For Vertex AI’s text-detection API:

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic.schema import predict
from google.protobuf.json_format import MessageToDict

# Submit suspect text for watermark detection.
endpoint = aiplatform.Endpoint("projects/myproj/locations/us-central1/endpoints/synthid-text-detector")
result = endpoint.predict(instances=[{"text": suspect_text}])
# Result includes:
#   - watermark_detected: bool
#   - confidence: float (0.0-1.0)
#   - models: list of models the watermark might be from

Use only as one signal among many. Confidence < 0.95 should not be treated as a definitive determination.

Pattern 2: Self-Hosted Watermarking (Green-List Scheme)

For self-hosted vLLM or Text Generation Inference deployments, integrate the open-source lm-watermarking package:

# generate_watermarked.py
from lm_watermarking import WatermarkLogitsProcessor

processor = WatermarkLogitsProcessor(
    vocab=tokenizer.get_vocab(),
    gamma=0.25,        # fraction of vocab in green list
    delta=2.0,         # additive bias toward green-list tokens
    seeding_scheme="simple_1",
    hash_key=KEY,      # secret key
)

output = model.generate(
    input_ids=tokens,
    logits_processor=[processor],
    max_new_tokens=400,
)
text = tokenizer.decode(output[0])

gamma=0.25 and delta=2.0 are standard recommendations. Higher delta = stronger signal but higher quality impact; lower gamma = more sensitive detection but more vulnerable to evasion.

Detection on the suspect text:

from lm_watermarking import WatermarkDetector

detector = WatermarkDetector(
    vocab=tokenizer.get_vocab(),
    gamma=0.25,
    seeding_scheme="simple_1",
    hash_key=KEY,
    z_threshold=4.0,   # z-score threshold for "watermarked"
    tokenizer=tokenizer,
)

result = detector.detect(suspect_text)
# {"is_watermarked": True, "z_score": 6.7, "p_value": 0.000001, "fraction_of_green_tokens": 0.42}

A z-score of 4.0+ corresponds to roughly 1-in-30,000 chance of false positive on natural text. For high-stakes decisions, require z-score 5.0+.

Pattern 3: Multi-Model Watermarking

Different models have different watermark schemes. To detect across the fleet, run multiple detectors:

def detect_any_watermark(text):
    results = {}
    # Try our internal watermark.
    results["internal"] = our_detector.detect(text)
    # Try common open-source schemes.
    for scheme in ["llama-watermark", "synthid-open-text", "binoculars"]:
        results[scheme] = scheme_detectors[scheme].detect(text)
    return {
        "any_watermark": any(r["is_watermarked"] for r in results.values()),
        "scores": results,
    }

A passage flagged by any detector is suspicious; flagged by multiple is high-confidence AI-generated.

Pattern 4: Application-Layer Generation Logging

Even without watermarking, log every generation event server-side. The hash of generated text + the generation context (model, tokens, generation timestamp, user ID) is your authoritative record.

import hashlib, json
from datetime import datetime, timezone

def log_generation(user_id, prompt, response, model_version):
    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "user_id": user_id,
        "model": model_version,
        "prompt_hash": hashlib.sha256(prompt.encode()).hexdigest()[:16],
        "response_hash": hashlib.sha256(response.encode()).hexdigest(),
        "response_length": len(response),
    }
    audit_log.write(json.dumps(record))

Later, given a suspect document, compute its SHA-256 and search the log. Exact matches confirm “this text was generated by our system at time T for user X.” Watermarking is one signal; immutable provenance logs are another.

Pattern 5: Detection Workflow Integration

For trust-and-safety review:

def review_submission(submission):
    signals = {}

    # Watermark detection.
    wm_result = detect_any_watermark(submission.text)
    signals["watermark"] = wm_result

    # Heuristic AI-detection (AI Classifier, GPTZero, etc.).
    signals["ai_classifier"] = ai_classifier_score(submission.text)

    # Stylometric analysis vs. user's prior submissions.
    signals["stylometry"] = stylometry_compare(submission.text, user_baseline(submission.user_id))

    # Provenance log lookup.
    signals["provenance_log_match"] = provenance_log_search(submission.text)

    # Combine signals — no single signal is decisive.
    score = combine_signals(signals)
    if score >= HIGH_CONFIDENCE_THRESHOLD:
        return Action.ESCALATE_TO_HUMAN
    elif score >= MEDIUM_CONFIDENCE:
        return Action.LABEL_AS_AI
    return Action.ALLOW

Multiple signals reduce false-positive rate; no single signal is decisive enough to action against a user.

Pattern 6: User Disclosure and Calibration

Where the application itself generates content (a writing assistant, a code generator), be explicit:

# Generated-content disclosure.
generated_content_metadata:
  generator: "MyCorp AI Assistant"
  model_version: "claude-opus-4-7"
  watermark_scheme: "synthid-text-v2"
  generation_timestamp: "2026-04-29T10:00:00Z"
  user_attribution: "user-12345"

Embed this in HTTP response headers, document metadata, JSON response — wherever convention allows. Watermarks are insurance against tampering; disclosure is the primary signal.

Pattern 7: Evasion-Resistance Testing

Periodically test how well your watermark survives common transformations:

def evasion_test(text):
    transformations = {
        "paraphrase_via_other_llm": paraphrase_with(text, model="competitor-model"),
        "sentence_shuffle": shuffle_sentences(text),
        "back_translation": translate(translate(text, "fr"), "en"),
        "synonym_substitution": synonym_substitute(text, rate=0.15),
        "delete_random_5pct": delete_random_words(text, 0.05),
    }
    results = {}
    for name, transformed in transformations.items():
        results[name] = our_detector.detect(transformed)
    return results

A robust scheme survives some transformations (synonym substitution at low rate, partial deletion); none survives heavy paraphrasing or translation cycles. Knowing the survival rate informs how much weight to give the signal.

Pattern 8: Telemetry

ai_watermark_generations_total{model, scheme}
ai_watermark_detections_total{outcome, scheme}
ai_watermark_detection_score_histogram
ai_watermark_false_positive_total       (when corrected by human review)
ai_watermark_evasion_attempts_total     (heuristic; rising signal of attack)

Alert on:

  • false_positive_total rising — detector calibration off; review threshold.
  • Pattern of evasion attempts — adversarial campaign in progress.

Expected Behaviour

Signal Without watermarking With watermarking
Direct-copy detection Stylometric guess only High-confidence signal
Paraphrased copy Hard Reduced; still possible at moderate paraphrase
Heavy translation cycle Impossible Watermark erased
Adversary uses different model No detection No detection (this watermark is model-specific)
Application-layer provenance log Possible exact-match hashes Same
Combined signal accuracy Single-signal decisions Multi-signal; lower false-positive rate

Trade-offs

Aspect Benefit Cost Mitigation
Watermarking by default Provenance for generated content Tiny quality impact (delta ≤ 2.0 imperceptible) Acceptable; production schemes are quality-preserving.
Detector key secrecy Only the generator can detect Closed verification means external audit is hard Publish methodology; share keys with trust-and-safety partners under NDA.
Multi-signal review Lower false-positive rate More tooling to maintain Use existing tools (GPTZero, AI Classifier) for heuristic; combine with watermark + provenance log.
Provenance log Indisputable on exact match Privacy implications (logging all generated text) Hash-only logging; don’t store the actual text.
Evasion-resistance Standard transformations bounded No defense against original-model substitution The realistic frame: watermarking catches lazy abuse, not motivated adversaries.
Transparency Calibration of public expectations Sophisticated adversaries adjust Acceptable; don’t oversell what watermarking does.

Failure Modes

Failure Symptom Detection Recovery
Detector key leaked Adversary can craft non-watermarked text or evade Detection rates drop while submissions grow Rotate key; old generations become un-detectable but that’s accepted; future generations use new key.
False-positive on natural text Innocent user accused Human-review identifies Raise z-threshold; review threshold against known-human corpus periodically.
Watermark erased by editing Detection fails on legitimately-AI-generated content Score below threshold despite real AI origin Acceptable failure mode; combine with other signals.
Adversary uses non-watermarked competitor model No detection at all Comparable quality output that doesn’t trip detector Beyond watermarking’s scope; rely on stylometry + behavior analysis.
Evasion-tool widely deployed Detection rates collapse Population statistics shift Update watermark scheme; deploy multi-scheme detection.
Confidence threshold mis-calibrated Too many escalations or too many missed Operator feedback Calibrate against ground-truth labeled corpus; review quarterly.
Watermark on quoted-from-AI text A human document quotes an AI passage; entire doc flagged Stylometric analysis would help Watermark applies to passages; don’t action on whole document if only a passage is flagged.

Realistic Expectations

Watermarking is one tool among many. Realistic expectations:

  • High confidence on direct, unedited AI-generated text. Real production utility for spam detection, content-policy enforcement.
  • Reduced confidence on paraphrased text. Useful as one signal in a multi-signal review.
  • Near-zero utility on heavily-edited or translated text. Don’t claim coverage where there isn’t.
  • No protection against adversaries who control their own model. A determined attacker generates from a non-watermarking model.
  • Privacy: watermarking does not deanonymize users. It identifies that the text came from a watermarked model, not who generated it.

The use cases watermarking does serve:

  • Publishing platforms tagging AI-generated content for users.
  • Search engines reducing the rank of mass-produced AI content.
  • Academic / education contexts as one signal for academic-integrity review.
  • Trust-and-safety teams correlating with other signals to identify abuse campaigns.

The use cases watermarking does not serve:

  • Definitive “is this AI” judgment for individual borderline cases.
  • Legal evidence in disputes about authorship.
  • Replacement for human review in high-stakes decisions.