AI Model Output Watermarking: Provenance for Generated Text and Code

Problem

C2PA signs media files at creation. It works for images and video — the manifest sits in metadata, the signature is verifiable. For text and code, there is no metadata to sign; output is plain UTF-8 that any subsequent system handles as opaque content.

Watermarking is the textual equivalent: bias the model’s token sampling slightly so that generated text carries a statistical signature detectable by anyone with the watermark key. The detection is probabilistic — never per-character, but per-passage with high confidence.

Three production approaches:

Aaronson / Kirchenbauer-style green-list watermarking. At each generation step, partition the vocabulary into a “green list” and “red list” using a hash of the previous N tokens and a secret key. Bias sampling toward the green list. A passage of generated text has more green tokens than chance would predict; a detector with the key computes the ratio and concludes “this was likely generated by our model.”
SynthID (Google DeepMind, 2023). A more robust variant deployed in Gemini outputs. Maintains quality through tokenizer-level adjustments rather than logit biasing.
Aaronson 2022 unbiased scheme. Provably preserves output distribution while embedding a detectable signal — important when watermarking introduces no measurable quality loss.

By 2026 watermarking is shipped:

Google Gemini outputs carry SynthID by default; their detector is the only verifier (closed-source detector keys).
OpenAI implemented and publicly discussed (but didn’t widely deploy) a similar scheme.
Anthropic’s research on watermarking continues; deployment status varies by model.
Open-source watermarkers (lm-watermarking, watermark-evaluation) for self-hosted models.

The specific gaps in production deployments:

Most applications using LLM APIs don’t request watermarking, even when the provider supports it.
Detection requires the secret key — only the original generator can detect.
Watermarks survive direct copying but degrade on editing (paraphrasing, translation, partial use).
Adversaries running their own model can produce text with no watermark, which is then indistinguishable from “not-our-model output.”
Watermark detection is statistical; small text snippets are hard to classify confidently.

This article covers the green-list scheme in concept, deployment patterns at the application layer, the detection workflow for trust-and-safety teams, evasion considerations, and the realistic expectations for what watermarking does and doesn’t solve.

Target systems: SynthID via Vertex AI / Gemini API, lm-watermarking research codebase, custom watermarking on self-hosted models (vLLM, Text Generation Inference). Detection: provider-specific APIs or lm-watermarking toolkit.

Threat Model

Adversary 1 — Synthetic-content abuse: an attacker uses LLM-generated text for spam, disinformation, fake reviews, academic dishonesty. Wants the output indistinguishable from human-authored content.
Adversary 2 — AI-content evader: knows watermarking exists; paraphrases, mixes languages, uses cyclical translation, or runs through a different model to strip the signal.
Adversary 3 — Provenance forger: wants to claim an LLM-generated document is human-authored to evade content-policy enforcement.
Adversary 4 — False positive abuser: weaponizes detection by submitting borderline content that triggers watermark detection on innocent humans (false positives on naturally-green-biased text).
Access level: all adversaries have only the LLM API or output access. No detection-key access.
Objective: Use generated content without it being identifiable; cause false-positive provenance attributions.
Blast radius: without watermarking, every LLM-generated document is unverifiable. With watermarking + correct detection workflow, a meaningful fraction of generated content can be identified — but never 100%, and never small snippets reliably.

Configuration

Pattern 1: Generation With SynthID (Gemini)

Vertex AI / Gemini outputs carry SynthID by default. Configure to ensure the watermark is included:

from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.0-pro")
response = model.generate_content(
    "Write a 500-word essay on Roman history.",
    generation_config={
        "temperature": 0.7,
        "max_output_tokens": 800,
    },
    # SynthID is on by default in current Gemini deployments.
)
print(response.text)
# The text contains an embedded SynthID watermark.

For Vertex AI’s text-detection API:

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic.schema import predict
from google.protobuf.json_format import MessageToDict

# Submit suspect text for watermark detection.
endpoint = aiplatform.Endpoint("projects/myproj/locations/us-central1/endpoints/synthid-text-detector")
result = endpoint.predict(instances=[{"text": suspect_text}])
# Result includes:
#   - watermark_detected: bool
#   - confidence: float (0.0-1.0)
#   - models: list of models the watermark might be from

Use only as one signal among many. Confidence < 0.95 should not be treated as a definitive determination.

Pattern 2: Self-Hosted Watermarking (Green-List Scheme)

For self-hosted vLLM or Text Generation Inference deployments, integrate the open-source lm-watermarking package:

# generate_watermarked.py
from lm_watermarking import WatermarkLogitsProcessor

processor = WatermarkLogitsProcessor(
    vocab=tokenizer.get_vocab(),
    gamma=0.25,        # fraction of vocab in green list
    delta=2.0,         # additive bias toward green-list tokens
    seeding_scheme="simple_1",
    hash_key=KEY,      # secret key
)

output = model.generate(
    input_ids=tokens,
    logits_processor=[processor],
    max_new_tokens=400,
)
text = tokenizer.decode(output[0])

gamma=0.25 and delta=2.0 are standard recommendations. Higher delta = stronger signal but higher quality impact; lower gamma = more sensitive detection but more vulnerable to evasion.

Detection on the suspect text:

from lm_watermarking import WatermarkDetector

detector = WatermarkDetector(
    vocab=tokenizer.get_vocab(),
    gamma=0.25,
    seeding_scheme="simple_1",
    hash_key=KEY,
    z_threshold=4.0,   # z-score threshold for "watermarked"
    tokenizer=tokenizer,
)

result = detector.detect(suspect_text)
# {"is_watermarked": True, "z_score": 6.7, "p_value": 0.000001, "fraction_of_green_tokens": 0.42}

A z-score of 4.0+ corresponds to roughly 1-in-30,000 chance of false positive on natural text. For high-stakes decisions, require z-score 5.0+.

Pattern 3: Multi-Model Watermarking

Different models have different watermark schemes. To detect across the fleet, run multiple detectors:

def detect_any_watermark(text):
    results = {}
    # Try our internal watermark.
    results["internal"] = our_detector.detect(text)
    # Try common open-source schemes.
    for scheme in ["llama-watermark", "synthid-open-text", "binoculars"]:
        results[scheme] = scheme_detectors[scheme].detect(text)
    return {
        "any_watermark": any(r["is_watermarked"] for r in results.values()),
        "scores": results,
    }

A passage flagged by any detector is suspicious; flagged by multiple is high-confidence AI-generated.

Pattern 4: Application-Layer Generation Logging

Even without watermarking, log every generation event server-side. The hash of generated text + the generation context (model, tokens, generation timestamp, user ID) is your authoritative record.

import hashlib, json
from datetime import datetime, timezone

def log_generation(user_id, prompt, response, model_version):
    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "user_id": user_id,
        "model": model_version,
        "prompt_hash": hashlib.sha256(prompt.encode()).hexdigest()[:16],
        "response_hash": hashlib.sha256(response.encode()).hexdigest(),
        "response_length": len(response),
    }
    audit_log.write(json.dumps(record))

Later, given a suspect document, compute its SHA-256 and search the log. Exact matches confirm “this text was generated by our system at time T for user X.” Watermarking is one signal; immutable provenance logs are another.

Pattern 5: Detection Workflow Integration

For trust-and-safety review:

def review_submission(submission):
    signals = {}

    # Watermark detection.
    wm_result = detect_any_watermark(submission.text)
    signals["watermark"] = wm_result

    # Heuristic AI-detection (AI Classifier, GPTZero, etc.).
    signals["ai_classifier"] = ai_classifier_score(submission.text)

    # Stylometric analysis vs. user's prior submissions.
    signals["stylometry"] = stylometry_compare(submission.text, user_baseline(submission.user_id))

    # Provenance log lookup.
    signals["provenance_log_match"] = provenance_log_search(submission.text)

    # Combine signals — no single signal is decisive.
    score = combine_signals(signals)
    if score >= HIGH_CONFIDENCE_THRESHOLD:
        return Action.ESCALATE_TO_HUMAN
    elif score >= MEDIUM_CONFIDENCE:
        return Action.LABEL_AS_AI
    return Action.ALLOW

Multiple signals reduce false-positive rate; no single signal is decisive enough to action against a user.

Pattern 6: User Disclosure and Calibration

Where the application itself generates content (a writing assistant, a code generator), be explicit:

# Generated-content disclosure.
generated_content_metadata:
  generator: "MyCorp AI Assistant"
  model_version: "claude-opus-4-7"
  watermark_scheme: "synthid-text-v2"
  generation_timestamp: "2026-04-29T10:00:00Z"
  user_attribution: "user-12345"

Embed this in HTTP response headers, document metadata, JSON response — wherever convention allows. Watermarks are insurance against tampering; disclosure is the primary signal.

Pattern 7: Evasion-Resistance Testing

Periodically test how well your watermark survives common transformations:

def evasion_test(text):
    transformations = {
        "paraphrase_via_other_llm": paraphrase_with(text, model="competitor-model"),
        "sentence_shuffle": shuffle_sentences(text),
        "back_translation": translate(translate(text, "fr"), "en"),
        "synonym_substitution": synonym_substitute(text, rate=0.15),
        "delete_random_5pct": delete_random_words(text, 0.05),
    }
    results = {}
    for name, transformed in transformations.items():
        results[name] = our_detector.detect(transformed)
    return results

A robust scheme survives some transformations (synonym substitution at low rate, partial deletion); none survives heavy paraphrasing or translation cycles. Knowing the survival rate informs how much weight to give the signal.

Pattern 8: Telemetry

ai_watermark_generations_total{model, scheme}
ai_watermark_detections_total{outcome, scheme}
ai_watermark_detection_score_histogram
ai_watermark_false_positive_total       (when corrected by human review)
ai_watermark_evasion_attempts_total     (heuristic; rising signal of attack)

Alert on:

false_positive_total rising — detector calibration off; review threshold.
Pattern of evasion attempts — adversarial campaign in progress.

Expected Behaviour

Signal	Without watermarking	With watermarking
Direct-copy detection	Stylometric guess only	High-confidence signal
Paraphrased copy	Hard	Reduced; still possible at moderate paraphrase
Heavy translation cycle	Impossible	Watermark erased
Adversary uses different model	No detection	No detection (this watermark is model-specific)
Application-layer provenance log	Possible exact-match hashes	Same
Combined signal accuracy	Single-signal decisions	Multi-signal; lower false-positive rate

Trade-offs

Aspect	Benefit	Cost	Mitigation
Watermarking by default	Provenance for generated content	Tiny quality impact (delta ≤ 2.0 imperceptible)	Acceptable; production schemes are quality-preserving.
Detector key secrecy	Only the generator can detect	Closed verification means external audit is hard	Publish methodology; share keys with trust-and-safety partners under NDA.
Multi-signal review	Lower false-positive rate	More tooling to maintain	Use existing tools (GPTZero, AI Classifier) for heuristic; combine with watermark + provenance log.
Provenance log	Indisputable on exact match	Privacy implications (logging all generated text)	Hash-only logging; don’t store the actual text.
Evasion-resistance	Standard transformations bounded	No defense against original-model substitution	The realistic frame: watermarking catches lazy abuse, not motivated adversaries.
Transparency	Calibration of public expectations	Sophisticated adversaries adjust	Acceptable; don’t oversell what watermarking does.

Failure Modes

Failure	Symptom	Detection	Recovery
Detector key leaked	Adversary can craft non-watermarked text or evade	Detection rates drop while submissions grow	Rotate key; old generations become un-detectable but that’s accepted; future generations use new key.
False-positive on natural text	Innocent user accused	Human-review identifies	Raise z-threshold; review threshold against known-human corpus periodically.
Watermark erased by editing	Detection fails on legitimately-AI-generated content	Score below threshold despite real AI origin	Acceptable failure mode; combine with other signals.
Adversary uses non-watermarked competitor model	No detection at all	Comparable quality output that doesn’t trip detector	Beyond watermarking’s scope; rely on stylometry + behavior analysis.
Evasion-tool widely deployed	Detection rates collapse	Population statistics shift	Update watermark scheme; deploy multi-scheme detection.
Confidence threshold mis-calibrated	Too many escalations or too many missed	Operator feedback	Calibrate against ground-truth labeled corpus; review quarterly.
Watermark on quoted-from-AI text	A human document quotes an AI passage; entire doc flagged	Stylometric analysis would help	Watermark applies to passages; don’t action on whole document if only a passage is flagged.

Realistic Expectations

Watermarking is one tool among many. Realistic expectations:

High confidence on direct, unedited AI-generated text. Real production utility for spam detection, content-policy enforcement.
Reduced confidence on paraphrased text. Useful as one signal in a multi-signal review.
Near-zero utility on heavily-edited or translated text. Don’t claim coverage where there isn’t.
No protection against adversaries who control their own model. A determined attacker generates from a non-watermarking model.
Privacy: watermarking does not deanonymize users. It identifies that the text came from a watermarked model, not who generated it.

The use cases watermarking does serve:

Publishing platforms tagging AI-generated content for users.
Search engines reducing the rank of mass-produced AI content.
Academic / education contexts as one signal for academic-integrity review.
Trust-and-safety teams correlating with other signals to identify abuse campaigns.

The use cases watermarking does not serve:

Definitive “is this AI” judgment for individual borderline cases.
Legal evidence in disputes about authorship.
Replacement for human review in high-stakes decisions.