AI Model Output Watermarking: Provenance for Generated Text and Code
Problem
C2PA signs media files at creation. It works for images and video — the manifest sits in metadata, the signature is verifiable. For text and code, there is no metadata to sign; output is plain UTF-8 that any subsequent system handles as opaque content.
Watermarking is the textual equivalent: bias the model’s token sampling slightly so that generated text carries a statistical signature detectable by anyone with the watermark key. The detection is probabilistic — never per-character, but per-passage with high confidence.
Three production approaches:
- Aaronson / Kirchenbauer-style green-list watermarking. At each generation step, partition the vocabulary into a “green list” and “red list” using a hash of the previous N tokens and a secret key. Bias sampling toward the green list. A passage of generated text has more green tokens than chance would predict; a detector with the key computes the ratio and concludes “this was likely generated by our model.”
- SynthID (Google DeepMind, 2023). A more robust variant deployed in Gemini outputs. Maintains quality through tokenizer-level adjustments rather than logit biasing.
- Aaronson 2022 unbiased scheme. Provably preserves output distribution while embedding a detectable signal — important when watermarking introduces no measurable quality loss.
By 2026 watermarking is shipped:
- Google Gemini outputs carry SynthID by default; their detector is the only verifier (closed-source detector keys).
- OpenAI implemented and publicly discussed (but didn’t widely deploy) a similar scheme.
- Anthropic’s research on watermarking continues; deployment status varies by model.
- Open-source watermarkers (
lm-watermarking,watermark-evaluation) for self-hosted models.
The specific gaps in production deployments:
- Most applications using LLM APIs don’t request watermarking, even when the provider supports it.
- Detection requires the secret key — only the original generator can detect.
- Watermarks survive direct copying but degrade on editing (paraphrasing, translation, partial use).
- Adversaries running their own model can produce text with no watermark, which is then indistinguishable from “not-our-model output.”
- Watermark detection is statistical; small text snippets are hard to classify confidently.
This article covers the green-list scheme in concept, deployment patterns at the application layer, the detection workflow for trust-and-safety teams, evasion considerations, and the realistic expectations for what watermarking does and doesn’t solve.
Target systems: SynthID via Vertex AI / Gemini API, lm-watermarking research codebase, custom watermarking on self-hosted models (vLLM, Text Generation Inference). Detection: provider-specific APIs or lm-watermarking toolkit.
Threat Model
- Adversary 1 — Synthetic-content abuse: an attacker uses LLM-generated text for spam, disinformation, fake reviews, academic dishonesty. Wants the output indistinguishable from human-authored content.
- Adversary 2 — AI-content evader: knows watermarking exists; paraphrases, mixes languages, uses cyclical translation, or runs through a different model to strip the signal.
- Adversary 3 — Provenance forger: wants to claim an LLM-generated document is human-authored to evade content-policy enforcement.
- Adversary 4 — False positive abuser: weaponizes detection by submitting borderline content that triggers watermark detection on innocent humans (false positives on naturally-green-biased text).
- Access level: all adversaries have only the LLM API or output access. No detection-key access.
- Objective: Use generated content without it being identifiable; cause false-positive provenance attributions.
- Blast radius: without watermarking, every LLM-generated document is unverifiable. With watermarking + correct detection workflow, a meaningful fraction of generated content can be identified — but never 100%, and never small snippets reliably.
Configuration
Pattern 1: Generation With SynthID (Gemini)
Vertex AI / Gemini outputs carry SynthID by default. Configure to ensure the watermark is included:
from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel
model = GenerativeModel("gemini-2.0-pro")
response = model.generate_content(
"Write a 500-word essay on Roman history.",
generation_config={
"temperature": 0.7,
"max_output_tokens": 800,
},
# SynthID is on by default in current Gemini deployments.
)
print(response.text)
# The text contains an embedded SynthID watermark.
For Vertex AI’s text-detection API:
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic.schema import predict
from google.protobuf.json_format import MessageToDict
# Submit suspect text for watermark detection.
endpoint = aiplatform.Endpoint("projects/myproj/locations/us-central1/endpoints/synthid-text-detector")
result = endpoint.predict(instances=[{"text": suspect_text}])
# Result includes:
# - watermark_detected: bool
# - confidence: float (0.0-1.0)
# - models: list of models the watermark might be from
Use only as one signal among many. Confidence < 0.95 should not be treated as a definitive determination.
Pattern 2: Self-Hosted Watermarking (Green-List Scheme)
For self-hosted vLLM or Text Generation Inference deployments, integrate the open-source lm-watermarking package:
# generate_watermarked.py
from lm_watermarking import WatermarkLogitsProcessor
processor = WatermarkLogitsProcessor(
vocab=tokenizer.get_vocab(),
gamma=0.25, # fraction of vocab in green list
delta=2.0, # additive bias toward green-list tokens
seeding_scheme="simple_1",
hash_key=KEY, # secret key
)
output = model.generate(
input_ids=tokens,
logits_processor=[processor],
max_new_tokens=400,
)
text = tokenizer.decode(output[0])
gamma=0.25 and delta=2.0 are standard recommendations. Higher delta = stronger signal but higher quality impact; lower gamma = more sensitive detection but more vulnerable to evasion.
Detection on the suspect text:
from lm_watermarking import WatermarkDetector
detector = WatermarkDetector(
vocab=tokenizer.get_vocab(),
gamma=0.25,
seeding_scheme="simple_1",
hash_key=KEY,
z_threshold=4.0, # z-score threshold for "watermarked"
tokenizer=tokenizer,
)
result = detector.detect(suspect_text)
# {"is_watermarked": True, "z_score": 6.7, "p_value": 0.000001, "fraction_of_green_tokens": 0.42}
A z-score of 4.0+ corresponds to roughly 1-in-30,000 chance of false positive on natural text. For high-stakes decisions, require z-score 5.0+.
Pattern 3: Multi-Model Watermarking
Different models have different watermark schemes. To detect across the fleet, run multiple detectors:
def detect_any_watermark(text):
results = {}
# Try our internal watermark.
results["internal"] = our_detector.detect(text)
# Try common open-source schemes.
for scheme in ["llama-watermark", "synthid-open-text", "binoculars"]:
results[scheme] = scheme_detectors[scheme].detect(text)
return {
"any_watermark": any(r["is_watermarked"] for r in results.values()),
"scores": results,
}
A passage flagged by any detector is suspicious; flagged by multiple is high-confidence AI-generated.
Pattern 4: Application-Layer Generation Logging
Even without watermarking, log every generation event server-side. The hash of generated text + the generation context (model, tokens, generation timestamp, user ID) is your authoritative record.
import hashlib, json
from datetime import datetime, timezone
def log_generation(user_id, prompt, response, model_version):
record = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"user_id": user_id,
"model": model_version,
"prompt_hash": hashlib.sha256(prompt.encode()).hexdigest()[:16],
"response_hash": hashlib.sha256(response.encode()).hexdigest(),
"response_length": len(response),
}
audit_log.write(json.dumps(record))
Later, given a suspect document, compute its SHA-256 and search the log. Exact matches confirm “this text was generated by our system at time T for user X.” Watermarking is one signal; immutable provenance logs are another.
Pattern 5: Detection Workflow Integration
For trust-and-safety review:
def review_submission(submission):
signals = {}
# Watermark detection.
wm_result = detect_any_watermark(submission.text)
signals["watermark"] = wm_result
# Heuristic AI-detection (AI Classifier, GPTZero, etc.).
signals["ai_classifier"] = ai_classifier_score(submission.text)
# Stylometric analysis vs. user's prior submissions.
signals["stylometry"] = stylometry_compare(submission.text, user_baseline(submission.user_id))
# Provenance log lookup.
signals["provenance_log_match"] = provenance_log_search(submission.text)
# Combine signals — no single signal is decisive.
score = combine_signals(signals)
if score >= HIGH_CONFIDENCE_THRESHOLD:
return Action.ESCALATE_TO_HUMAN
elif score >= MEDIUM_CONFIDENCE:
return Action.LABEL_AS_AI
return Action.ALLOW
Multiple signals reduce false-positive rate; no single signal is decisive enough to action against a user.
Pattern 6: User Disclosure and Calibration
Where the application itself generates content (a writing assistant, a code generator), be explicit:
# Generated-content disclosure.
generated_content_metadata:
generator: "MyCorp AI Assistant"
model_version: "claude-opus-4-7"
watermark_scheme: "synthid-text-v2"
generation_timestamp: "2026-04-29T10:00:00Z"
user_attribution: "user-12345"
Embed this in HTTP response headers, document metadata, JSON response — wherever convention allows. Watermarks are insurance against tampering; disclosure is the primary signal.
Pattern 7: Evasion-Resistance Testing
Periodically test how well your watermark survives common transformations:
def evasion_test(text):
transformations = {
"paraphrase_via_other_llm": paraphrase_with(text, model="competitor-model"),
"sentence_shuffle": shuffle_sentences(text),
"back_translation": translate(translate(text, "fr"), "en"),
"synonym_substitution": synonym_substitute(text, rate=0.15),
"delete_random_5pct": delete_random_words(text, 0.05),
}
results = {}
for name, transformed in transformations.items():
results[name] = our_detector.detect(transformed)
return results
A robust scheme survives some transformations (synonym substitution at low rate, partial deletion); none survives heavy paraphrasing or translation cycles. Knowing the survival rate informs how much weight to give the signal.
Pattern 8: Telemetry
ai_watermark_generations_total{model, scheme}
ai_watermark_detections_total{outcome, scheme}
ai_watermark_detection_score_histogram
ai_watermark_false_positive_total (when corrected by human review)
ai_watermark_evasion_attempts_total (heuristic; rising signal of attack)
Alert on:
false_positive_totalrising — detector calibration off; review threshold.- Pattern of evasion attempts — adversarial campaign in progress.
Expected Behaviour
| Signal | Without watermarking | With watermarking |
|---|---|---|
| Direct-copy detection | Stylometric guess only | High-confidence signal |
| Paraphrased copy | Hard | Reduced; still possible at moderate paraphrase |
| Heavy translation cycle | Impossible | Watermark erased |
| Adversary uses different model | No detection | No detection (this watermark is model-specific) |
| Application-layer provenance log | Possible exact-match hashes | Same |
| Combined signal accuracy | Single-signal decisions | Multi-signal; lower false-positive rate |
Trade-offs
| Aspect | Benefit | Cost | Mitigation |
|---|---|---|---|
| Watermarking by default | Provenance for generated content | Tiny quality impact (delta ≤ 2.0 imperceptible) | Acceptable; production schemes are quality-preserving. |
| Detector key secrecy | Only the generator can detect | Closed verification means external audit is hard | Publish methodology; share keys with trust-and-safety partners under NDA. |
| Multi-signal review | Lower false-positive rate | More tooling to maintain | Use existing tools (GPTZero, AI Classifier) for heuristic; combine with watermark + provenance log. |
| Provenance log | Indisputable on exact match | Privacy implications (logging all generated text) | Hash-only logging; don’t store the actual text. |
| Evasion-resistance | Standard transformations bounded | No defense against original-model substitution | The realistic frame: watermarking catches lazy abuse, not motivated adversaries. |
| Transparency | Calibration of public expectations | Sophisticated adversaries adjust | Acceptable; don’t oversell what watermarking does. |
Failure Modes
| Failure | Symptom | Detection | Recovery |
|---|---|---|---|
| Detector key leaked | Adversary can craft non-watermarked text or evade | Detection rates drop while submissions grow | Rotate key; old generations become un-detectable but that’s accepted; future generations use new key. |
| False-positive on natural text | Innocent user accused | Human-review identifies | Raise z-threshold; review threshold against known-human corpus periodically. |
| Watermark erased by editing | Detection fails on legitimately-AI-generated content | Score below threshold despite real AI origin | Acceptable failure mode; combine with other signals. |
| Adversary uses non-watermarked competitor model | No detection at all | Comparable quality output that doesn’t trip detector | Beyond watermarking’s scope; rely on stylometry + behavior analysis. |
| Evasion-tool widely deployed | Detection rates collapse | Population statistics shift | Update watermark scheme; deploy multi-scheme detection. |
| Confidence threshold mis-calibrated | Too many escalations or too many missed | Operator feedback | Calibrate against ground-truth labeled corpus; review quarterly. |
| Watermark on quoted-from-AI text | A human document quotes an AI passage; entire doc flagged | Stylometric analysis would help | Watermark applies to passages; don’t action on whole document if only a passage is flagged. |
Realistic Expectations
Watermarking is one tool among many. Realistic expectations:
- High confidence on direct, unedited AI-generated text. Real production utility for spam detection, content-policy enforcement.
- Reduced confidence on paraphrased text. Useful as one signal in a multi-signal review.
- Near-zero utility on heavily-edited or translated text. Don’t claim coverage where there isn’t.
- No protection against adversaries who control their own model. A determined attacker generates from a non-watermarking model.
- Privacy: watermarking does not deanonymize users. It identifies that the text came from a watermarked model, not who generated it.
The use cases watermarking does serve:
- Publishing platforms tagging AI-generated content for users.
- Search engines reducing the rank of mass-produced AI content.
- Academic / education contexts as one signal for academic-integrity review.
- Trust-and-safety teams correlating with other signals to identify abuse campaigns.
The use cases watermarking does not serve:
- Definitive “is this AI” judgment for individual borderline cases.
- Legal evidence in disputes about authorship.
- Replacement for human review in high-stakes decisions.