On-Device WASM Model Extraction: Defending Transformers.js and ONNX-WASM Against Weight Stealing

The Problem

On-device AI inference via WebAssembly has moved from research curiosity to production reality. Three frameworks dominate the space:

Transformers.js (Hugging Face, @xenova/transformers ≥ 3.0): runs ONNX-format models in the browser using the ONNX Runtime Web WASM backend plus WebGL acceleration. End users run Whisper (speech recognition), BERT (embeddings), GPT-2, Phi-3-mini, and Llama 3.2-1B entirely in the browser tab. Models are fetched as .onnx files and stored in the browser’s Cache Storage API. A typical Transformers.js pipeline call resolves to a sequence of fetch() requests for model files that land in the browser cache with no special protection.

MLC-LLM (WebGPU/WASM backend, mlc-ai/web-llm ≥ 0.2.46): compiles LLMs — Llama 3, Phi-3, Mistral 7B, Gemma 2 — to WASM + WebGPU compute shaders. Model weights are downloaded in MLC’s quantized raw tensor format (.bin shards), then loaded into WebGPU buffers and WASM linear memory for inference. The weight files are structurally simple: quantized floating-point tensors with a flat layout.

ONNX Runtime Web (Microsoft, onnxruntime-web ≥ 1.20): Microsoft’s cross-platform ONNX runtime with a WASM backend. Used widely for embedding models, vision classifiers, and small instruction-tuned models in enterprise web applications. Shares the same .onnx file format as Transformers.js.

The value proposition for all three is real: no server-side inference cost, privacy (user data never leaves the device), sub-100ms latency for small models. The security problem is equally real: the model weights are now on the client device, in a format any JavaScript context can read. For a fine-tuned proprietary model — domain-specific legal reasoning, medical triage, financial analysis, security vulnerability detection — those weights represent millions of dollars of training cost and competitive advantage.

Attack Surface 1: Direct Weight File Download

Transformers.js downloads model files via fetch() into the browser cache before execution:

// This is what Transformers.js does internally when you call pipeline()
const pipe = await pipeline('text-generation', 'your-org/proprietary-model');
// Internally executes approximately:
// fetch('https://your-cdn.example.com/models/proprietary-model/onnx/decoder_model.onnx')
// → stored in Cache Storage under the fetch URL as the cache key
// → also stored in IndexedDB for metadata

Open DevTools → Application → Cache Storage → expand your origin → the .onnx file is listed with its full URL. Right-click → copy URL, then from any terminal:

wget -O stolen_model.onnx "https://your-cdn.example.com/models/proprietary-model/onnx/decoder_model.onnx"

The ONNX format is a Protocol Buffer file (onnx.proto). The computation graph and all weight tensors are stored in a single binary. There is no encryption, no access control enforced by the ONNX container itself. Anyone who can observe the fetch() URL — in DevTools, via a network proxy, via a browser extension with webRequest permission — can download the full model file without executing it at all. The download does not require running the web app, authenticating, or triggering any server-side logic. For .onnx files served from a CDN as static assets, the URL is often guessable from the Transformers.js model ID pattern.

To read the stolen weights in Python:

import onnx
import numpy as np

model = onnx.load("stolen_model.onnx")
for initializer in model.graph.initializer:
    weights = np.array(onnx.numpy_helper.to_array(initializer))
    print(f"{initializer.name}: shape={weights.shape}, dtype={weights.dtype}")
# Every weight tensor in the model, fully accessible.

Attack Surface 2: WASM Linear Memory Dump

WASM’s memory model is a flat, contiguous ArrayBuffer exposed via WebAssembly.Memory. The specification design makes this safe between origins — cross-origin WASM memory is not accessible — but it provides no isolation within the same origin. For MLC-LLM and custom WASM inference engines, model weights are loaded into this flat buffer for inference. From JavaScript running in the same page:

// If you have a reference to the WebAssembly.Instance (often exposed
// on the page's window object or obtainable via the framework's API):
const wasmInstance = mlcEngine._wasm_instance;  // MLC-LLM exposes this
const wasmMemory = wasmInstance.exports.memory;
const buffer = new Uint8Array(wasmMemory.buffer);

// Snapshot the entire WASM linear memory to a file:
const blob = new Blob([buffer], { type: 'application/octet-stream' });
const a = document.createElement('a');
a.href = URL.createObjectURL(blob);
a.download = 'wasm_memory_dump.bin';
a.click();
// Done. The dump contains all model weights currently loaded.

This attack executes in a browser console with the web app open, in a browser extension with <all_urls> permissions and the activeTab grant, or via any JavaScript injected into the page — including XSS payloads and compromised CDN scripts. The WASM specification offers no JavaScript-visible memory protection mechanism. For MLC-LLM’s Llama 3.2-1B model, the weight dump is approximately 800MB of quantized tensors in a recoverable format. Larger models shard across multiple loads, but the attacker can dump after each shard loads.

The MLC-LLM model format stores a mlc-chat-config.json alongside the weight shards that specifies the architecture (number of layers, attention heads, hidden dimension, quantization scheme). The WASM dump plus this JSON is sufficient to reconstruct a fully operational model.

Attack Surface 3: Query-Based Model Extraction

Even without direct weight access, an attacker who can invoke a locally running WASM model can extract a functionally equivalent surrogate. For server-side models, query-based extraction is slowed by API rate limits, per-query costs, and telemetry that flags abnormal query patterns. For an on-device WASM model:

No API rate limit: the model runs locally; queries cost only compute time on the attacker’s hardware
Full output logits accessible: most ONNX/Transformers.js deployments return top-k tokens or full logit vectors, not just top-1 outputs; this is necessary for beam search but provides rich signal for extraction
No authentication per query: once the model is loaded, the inference API (session.run() for ONNX Runtime, pipeline() for Transformers.js) requires no token validation
Deterministic outputs: WASM models running without temperature or with fixed seed are fully deterministic, making membership inference and gradient estimation more reliable

The Tramer et al. model extraction attack (2016) demonstrated that a model’s decision boundary can be reconstructed from input/output pairs by solving for weight matrices that reproduce the observed output behaviour. For transformer-based models, more recent work shows that systematic probing with sequences derived from token frequency analysis can recover attention weight patterns with high fidelity. For a classification model with vocabulary size 50,000 and 12 attention layers, on the order of 500,000 to 2,000,000 queries is sufficient for meaningful weight reconstruction — achievable in hours on a modern laptop running the WASM model locally, with no per-query cost.

The ONNX model format makes the architecture fully visible (number of layers, dimensions, weight shapes are encoded in the graph topology), so the attacker knows exactly what they are reconstructing. They are not guessing the architecture — they are solving for the specific weight tensors in a known architecture.

Why On-Device Changes the Threat Model Fundamentally

Server-side model inference keeps weights on hardware the vendor controls, behind network access controls, with authentication on every query, with egress monitoring, and with the ability to revoke access. The attacker must exfiltrate information through the inference API output — a narrow channel.

On-device deployment inverts every dimension of that control:

Control	Server-Side	On-Device WASM
Weight file access	No network access to weight storage	Weights downloadable as static files
Memory access	Isolated process with OS-level memory protection	WASM linear memory readable by same-origin JS
Query rate limit	Enforced at API layer	No enforcement; unlimited local queries
Authentication per inference	Required	None after initial load
Exfiltration detection	Network egress monitoring	Attacker-controlled local execution
Key rotation	Rotate server credentials	Attacker retains extracted weights permanently

This is not a misconfiguration — it is the architecture. Shipping the model to the device means the device has the model. Defences are about raising extraction cost, limiting the value of what is extracted, and detecting extraction after the fact.

Threat Model

T1: Direct static file download by any visitor. The .onnx or MLC weight shard files are served as unauthenticated static assets from a CDN. Any visitor observing the network requests in DevTools has the URL. wget or curl retrieves the full model file with no authentication. The attacker does not need to interact with the web app at all. This is the lowest-effort, highest-yield attack. A fine-tuned proprietary model worth $10M in training cost is available to any visitor to the web app.

T2: XSS dumps WASM memory. A stored or reflected XSS vulnerability in the web app hosting the WASM model allows the attacker to inject a JavaScript payload into the page. The injected script reads WebAssembly.Memory.buffer from the running inference engine, assembles the weight tensors, and exfiltrates them via a fetch() to attacker-controlled infrastructure. The attack chain: find XSS → inject memory dump script → model weights exfiltrated to external server → attacker reconstructs model offline.

T3: Browser extension reads WASM memory. A malicious browser extension with the tabs permission and content_scripts matching the target origin can inject JavaScript into any page on that origin, identical in capability to an XSS payload. Extensions installed by enterprise users on managed devices, extensions that appear legitimate but are compromised, and extensions distributed via Chrome Web Store account compromise all fall into this category. The extension developer model explicitly grants same-origin JavaScript execution; it cannot be blocked without restricting extension permissions entirely.

T4: Query-based extraction with zero API cost. An authorised user of the web app runs automated extraction queries against the locally loaded model. No rate limiting, no per-query authentication, no usage monitoring exists at the WASM layer. The attacker queries the model millions of times using crafted inputs (adversarial examples, vocabulary boundary probes, gradient estimation prompts) to train a surrogate model. The surrogate replicates the behaviour of the original model with sufficient fidelity to compete commercially. This attack does not require the attacker to ever exfiltrate the weight file — they exfiltrate through the model’s outputs.

T5: Competitor extracts fine-tuned confidential model via authorised access. The attacker is a paying customer of the web application. They have legitimate access to the on-device model to perform their normal workload. Using that legitimate access, they download the model file, dump WASM memory, or run query-based extraction over time. The fine-tuned model captures proprietary training data, domain adaptation, curated datasets, RLHF labelling from human experts — none of which is disclosed in the model weights directly, but all of which is reconstructible through systematic probing of the extracted model using membership inference attacks.

Hardening Configuration

1. Serve Models via Authenticated Time-Limited Signed URLs

The zero-effort defence: stop serving model files as unauthenticated static assets. Every request for a model weight file should require authentication and produce an auditable access record.

# Flask backend: generate time-limited signed URLs for model downloads
# boto3 >= 1.26.0
import boto3
import hashlib
import hmac
from datetime import datetime
from typing import Optional

s3 = boto3.client('s3', region_name='us-east-1')

def generate_model_url(
    model_id: str,
    user_id: str,
    shard_index: Optional[int] = None,
    ttl_seconds: int = 3600
) -> str:
    """
    Issue a time-limited signed URL for a single model file or shard.
    URL expires after ttl_seconds. Audit log records who requested what and when.
    """
    if not user_has_model_access(user_id, model_id):
        raise PermissionError(
            f"User {user_id} does not have access to model tier {model_id}"
        )

    # Construct the S3 object key
    if shard_index is not None:
        object_key = f"models/{model_id}/weights/shard_{shard_index:04d}.bin"
    else:
        object_key = f"models/{model_id}/model.onnx"

    url = s3.generate_presigned_url(
        'get_object',
        Params={
            'Bucket': 'proprietary-models-prod',
            'Key': object_key,
        },
        ExpiresIn=ttl_seconds
    )

    # Audit log: user, model, timestamp, IP address, shard
    audit_model_access(
        user_id=user_id,
        model_id=model_id,
        shard_index=shard_index,
        object_key=object_key,
        timestamp=datetime.utcnow().isoformat(),
    )

    return url


def detect_bulk_download(user_id: str, model_id: str) -> bool:
    """
    Flag users who request more URLs than a normal usage pattern requires.
    A user legitimately loading a model requests each shard URL once per session.
    Requesting the same shard URL 5+ times in 24h indicates automated extraction.
    """
    recent_requests = get_audit_records(user_id, model_id, hours=24)
    shard_request_counts = {}
    for record in recent_requests:
        key = record['shard_index']
        shard_request_counts[key] = shard_request_counts.get(key, 0) + 1

    return any(count >= 5 for count in shard_request_counts.values())

This does not prevent an authorised user from downloading the model, but it adds authentication enforcement, an audit trail, and a detection layer. The signed URL cannot be shared across users — it is tied to the S3 presigned URL expiry, and re-requesting requires re-authentication.

On the client side, Transformers.js supports custom fetch() implementations via the env.remoteHost and custom env.useCache configuration, allowing you to intercept the default fetch behaviour and inject authentication headers:

import { env, pipeline } from '@xenova/transformers';

// Override the default fetch with an authenticated fetch
const originalFetch = globalThis.fetch;
globalThis.fetch = async (url, options = {}) => {
    if (url.includes('your-cdn.example.com/models/')) {
        // Obtain a signed URL from your backend before downloading
        const signedUrl = await getSignedUrl(url, await getSessionToken());
        return originalFetch(signedUrl, options);
    }
    return originalFetch(url, options);
};

async function getSignedUrl(originalUrl: string, sessionToken: string): Promise<string> {
    const response = await originalFetch('/api/model-url', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${sessionToken}`,
        },
        body: JSON.stringify({ url: originalUrl }),
    });
    const { signedUrl } = await response.json();
    return signedUrl;
}

2. Model Watermarking for Weight Theft Detection and Attribution

Signed URLs prevent casual download but not extraction by authorised users. Watermarking shifts the defence posture from prevention to detection and attribution: embed a unique, user-specific or session-specific mark in the model weights before distribution. When a stolen copy surfaces — uploaded to Hugging Face, open-sourced by a competitor, found in a commercial product — the watermark identifies which authorised download the weights came from.

The watermark must survive: weight quantisation, fine-tuning on top of the extracted model, ONNX format round-tripping, and minor post-processing. Existing techniques such as those from Uchida et al. embed the watermark in specific weight layer statistics that survive typical downstream operations.

import numpy as np
import onnx
from onnx import numpy_helper
import hashlib


def embed_watermark(
    model_path: str,
    output_path: str,
    watermark_key: str,
    perturbation_std: float = 0.00008
) -> None:
    """
    Embed a per-user watermark into model weights as a pseudo-random perturbation.
    The perturbation magnitude (0.00008 * weight magnitude) is below the threshold
    where users notice quality degradation but statistically unique per watermark_key.
    """
    model = onnx.load(model_path)
    key_hash = hashlib.sha256(watermark_key.encode()).digest()
    seed = int.from_bytes(key_hash[:4], 'big')
    rng = np.random.RandomState(seed)

    for initializer in model.graph.initializer:
        weights = numpy_helper.to_array(initializer).copy()
        if weights.dtype not in (np.float32, np.float16):
            continue
        noise = rng.normal(0, perturbation_std, weights.shape).astype(weights.dtype)
        watermarked = weights + noise
        new_initializer = numpy_helper.from_array(watermarked, name=initializer.name)
        # Replace in-place in the model protobuf
        idx = list(model.graph.initializer).index(initializer)
        model.graph.initializer.remove(initializer)
        model.graph.initializer.insert(idx, new_initializer)

    onnx.save(model, output_path)


def verify_watermark(
    suspect_model_path: str,
    watermark_key: str,
    perturbation_std: float = 0.00008,
    correlation_threshold: float = 0.15
) -> dict:
    """
    Test whether a suspect model contains the watermark for watermark_key.
    Returns a dict with per-layer correlation scores and a verdict.
    High correlation (> threshold) indicates the watermark is present.
    """
    model = onnx.load(suspect_model_path)
    key_hash = hashlib.sha256(watermark_key.encode()).digest()
    seed = int.from_bytes(key_hash[:4], 'big')
    rng = np.random.RandomState(seed)

    correlations = []
    for initializer in model.graph.initializer:
        weights = numpy_helper.to_array(initializer)
        if weights.dtype not in (np.float32, np.float16):
            continue
        expected_pattern = rng.normal(0, perturbation_std, weights.shape).astype(np.float32)
        w_flat = weights.flatten().astype(np.float32)
        p_flat = expected_pattern.flatten()
        if len(w_flat) < 100:
            continue
        corr = float(np.corrcoef(w_flat, p_flat)[0, 1])
        correlations.append(abs(corr))

    if not correlations:
        return {'verdict': 'inconclusive', 'mean_correlation': 0.0}

    mean_corr = float(np.mean(correlations))
    return {
        'verdict': 'watermark_present' if mean_corr > correlation_threshold else 'watermark_absent',
        'mean_correlation': mean_corr,
        'layers_checked': len(correlations),
    }

Generate per-user watermarked variants at download time and store the (user_id, watermark_key) mapping. When a suspected stolen model is found, run verify_watermark() against all issued keys. This approach produces false positive rates below 0.1% for threshold 0.15 with typical transformer weight distributions.

The limitation: watermarking works for attribution after the fact. It does not prevent the theft. It is useful for legal action, for communicating to the user community that a specific leak occurred, and for understanding which access control failure was exploited.

3. Cross-Origin Isolation Headers to Restrict Memory Access Surface

The COEP and COOP headers enable cross-origin isolation, which restricts which execution contexts can share a process with your web app. This does not prevent same-origin JavaScript from reading WASM memory, but it eliminates the cross-origin iframe and SharedArrayBuffer attack surfaces, and it is required for SharedArrayBuffer-based WASM threading (used by MLC-LLM for parallel inference) in modern browsers.

# nginx: apply to the directory serving the WASM-based AI application
location /ai-app/ {
    # Cross-origin isolation: prevents cross-origin resources from accessing
    # this page's memory via SharedArrayBuffer timing attacks.
    add_header Cross-Origin-Embedder-Policy "require-corp" always;
    add_header Cross-Origin-Opener-Policy "same-origin" always;

    # Content Security Policy: restrict which scripts can execute in this page.
    # 'strict-dynamic' propagates trust to dynamically loaded scripts via nonces.
    # Prevents CDN compromise from injecting weight-dumping scripts.
    add_header Content-Security-Policy "
        default-src 'self';
        script-src 'self' 'nonce-{REQUEST_NONCE}' 'strict-dynamic';
        connect-src 'self' https://your-cdn.example.com;
        worker-src 'self' blob:;
        wasm-src 'self';
    " always;

    # Subresource Integrity for the WASM runtime itself.
    # If the WASM bundle is served from a CDN, SRI prevents CDN compromise
    # from substituting a memory-dumping WASM module.
    # (Set the actual hash in your HTML: <script src="..." integrity="sha384-...">)
}

The Content-Security-Policy with script-src 'nonce-{nonce}' is the most important control here. XSS that injects a <script src="..."> tag from an external host is blocked unless the injected script carries the page-specific nonce — which an attacker cannot predict. This closes the primary path from XSS to WASM memory dump. Inline XSS payloads (<img onerror="...">) are blocked by the absence of 'unsafe-inline' in script-src.

Verify the headers are set correctly:

curl -I https://your-app.example.com/ai-app/ | grep -E 'Cross-Origin|Content-Security'
# Expected output:
# cross-origin-embedder-policy: require-corp
# cross-origin-opener-policy: same-origin
# content-security-policy: default-src 'self'; ...

4. Model Sharding with Per-Session Encrypted Delivery

Serving the entire model as a single file means a single download gives the attacker everything. Model sharding with per-session encryption means each session receives different encrypted shards, and a WASM memory dump captures only the fraction of weights currently resident in memory.

// Client-side sharded model loader with per-session AES-GCM decryption
class ShardedModelLoader {
    private shardCount: number;
    private sessionToken: string;
    private cryptoKey: CryptoKey | null = null;

    constructor(shardCount: number, sessionToken: string) {
        this.shardCount = shardCount;
        this.sessionToken = sessionToken;
    }

    async initialize(): Promise<void> {
        // Fetch the session-specific decryption key from the server.
        // The server generates a unique key per session and encrypts the shards with it.
        const response = await fetch('/api/model-session-key', {
            method: 'POST',
            headers: { 'Authorization': `Bearer ${this.sessionToken}` },
        });
        const { keyBytes } = await response.json();
        const rawKey = Uint8Array.from(atob(keyBytes), c => c.charCodeAt(0));

        // Import as non-extractable — the raw key bytes cannot be read back from
        // this CryptoKey handle by any JavaScript in the page.
        this.cryptoKey = await crypto.subtle.importKey(
            'raw',
            rawKey,
            { name: 'AES-GCM' },
            false,  // extractable = false: key cannot be exported
            ['decrypt']
        );
        // Zero the raw key bytes so they are not recoverable from memory.
        rawKey.fill(0);
    }

    async loadShard(index: number): Promise<ArrayBuffer> {
        if (!this.cryptoKey) throw new Error('Loader not initialized');

        const response = await fetch(`/api/model-shard/${index}`, {
            headers: { 'Authorization': `Bearer ${this.sessionToken}` },
        });
        const encryptedShard = await response.arrayBuffer();

        // First 12 bytes are the AES-GCM IV, remainder is ciphertext + tag.
        const iv = encryptedShard.slice(0, 12);
        const ciphertext = encryptedShard.slice(12);

        const decrypted = await crypto.subtle.decrypt(
            { name: 'AES-GCM', iv },
            this.cryptoKey,
            ciphertext
        );
        return decrypted;
    }

    async loadAndProcess(index: number, processFn: (data: ArrayBuffer) => void): Promise<void> {
        const shardData = await this.loadShard(index);
        processFn(shardData);
        // After processFn completes, shardData goes out of scope.
        // The GC will eventually clear it; for higher assurance, zero before release:
        new Uint8Array(shardData).fill(0);
    }
}

The server-side shard encryption:

# Generate per-session encrypted shards at request time
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os
import base64

def encrypt_shard(shard_bytes: bytes, key: bytes) -> bytes:
    """Encrypt a model weight shard with AES-256-GCM. Returns IV + ciphertext."""
    aesgcm = AESGCM(key)
    iv = os.urandom(12)  # 96-bit IV, standard for AES-GCM
    ciphertext = aesgcm.encrypt(iv, shard_bytes, None)
    return iv + ciphertext  # prepend IV for client-side decryption


def create_session_shards(model_path: str, session_id: str) -> tuple[str, list[bytes]]:
    """
    Read model, split into shards, encrypt each with a session-unique key.
    Returns (base64-encoded key, list of encrypted shard bytes).
    """
    with open(model_path, 'rb') as f:
        model_bytes = f.read()

    shard_size = 50 * 1024 * 1024  # 50MB per shard
    shards = [model_bytes[i:i+shard_size] for i in range(0, len(model_bytes), shard_size)]

    key = os.urandom(32)  # 256-bit AES key
    encrypted_shards = [encrypt_shard(shard, key) for shard in shards]

    # Store key → session_id mapping server-side for audit
    store_session_key(session_id, base64.b64encode(key).decode())
    return base64.b64encode(key).decode(), encrypted_shards

A WASM memory dump at any point in time captures at most one decrypted shard — and only while that shard is resident. For a 1.5GB model split into 50MB shards, the attacker’s dump captures approximately 3% of the model at best. Reassembling the full model requires dumping memory at the right moment for each of 30 shards, for each of which the session has already ended and the server-issued key has expired.

The fundamental limitation applies: the key is delivered to the browser and used in the browser. An attacker who intercepts the key delivery (via MITM, via XSS that fires before keyBytes.fill(0) executes, or via direct DevTools key inspection) has the decryption key. The extractable: false flag prevents JavaScript from calling crypto.subtle.exportKey() on the CryptoKey handle, but it does not prevent capturing keyBytes before it is zeroed.

5. Application-Layer Query Rate Limiting

Client-side rate limiting is bypassable by any attacker who modifies the JavaScript — which, in a browser, all attackers can do via DevTools or browser extensions. The control is nonetheless worth implementing: it raises the cost of automated extraction for the common case (an attacker who has not bothered to patch the rate limiter) and provides a logging hook for server-side detection.

For higher assurance, gate model inference on a server-side heartbeat that both enforces usage limits and instruments query patterns:

// Rate limit backed by a server-side heartbeat
class RateLimitedInferenceEngine {
    private model: any;
    private sessionToken: string;
    private queryCount: number = 0;
    private readonly batchSize: number = 25; // heartbeat every 25 queries
    private readonly maxQueriesPerMinute: number = 60;
    private queryTimestamps: number[] = [];

    constructor(model: any, sessionToken: string) {
        this.model = model;
        this.sessionToken = sessionToken;
    }

    async infer(input: string): Promise<any> {
        const now = Date.now();

        // Local rate limit check (bypassable, but cheap and deters automation)
        this.queryTimestamps = this.queryTimestamps.filter(t => now - t < 60_000);
        if (this.queryTimestamps.length >= this.maxQueriesPerMinute) {
            throw new Error('Local rate limit exceeded — slow down');
        }
        this.queryTimestamps.push(now);

        this.queryCount++;

        // Server heartbeat every batchSize queries.
        // Server can revoke the session if it detects extraction patterns.
        if (this.queryCount % this.batchSize === 0) {
            const permitted = await this.heartbeat();
            if (!permitted) {
                throw new Error('Session revoked — contact support');
            }
        }

        return this.model.infer(input);
    }

    private async heartbeat(): Promise<boolean> {
        try {
            const response = await fetch('/api/inference-heartbeat', {
                method: 'POST',
                headers: {
                    'Authorization': `Bearer ${this.sessionToken}`,
                    'Content-Type': 'application/json',
                },
                body: JSON.stringify({
                    query_count: this.queryCount,
                    timestamp: Date.now(),
                }),
            });
            return response.ok;
        } catch {
            // Network failure — apply a grace period rather than blocking the user
            return true;
        }
    }
}

On the server, the heartbeat endpoint detects extraction-indicative patterns:

@app.post("/api/inference-heartbeat")
@require_auth
def inference_heartbeat(user_id: str, session_id: str):
    body = request.get_json()
    query_count = body['query_count']
    
    # Flag sessions with anomalously high query rates
    session_data = get_session(session_id)
    elapsed_minutes = (time.time() - session_data['start_time']) / 60
    qpm = query_count / max(elapsed_minutes, 1)
    
    if qpm > 300:  # 300 QPM far exceeds any legitimate interactive use case
        flag_session_for_review(user_id, session_id, reason='high_qpm', qpm=qpm)
        # Optionally revoke immediately for very high rates:
        if qpm > 1000:
            return {'permitted': False}, 403
    
    return {'permitted': True}, 200

6. Encrypted Model Distribution with Non-Extractable WebCrypto Key Handles

The extractable: false parameter in crypto.subtle.importKey() deserves a specific note because it is the only browser-native mechanism that narrows the JavaScript key exposure window. The non-extractable key handle cannot be passed to exportKey() — the call throws. The key is usable for cryptographic operations (decrypt, sign, verify) but its raw bytes cannot be recovered by any script in the page.

// Demonstrate the non-extractable property explicitly
const keyBytes = new Uint8Array(32); // from server
crypto.getRandomValues(keyBytes);    // placeholder

const nonExtractableKey = await crypto.subtle.importKey(
    'raw',
    keyBytes,
    { name: 'AES-GCM' },
    false,       // extractable = false
    ['decrypt']
);

// This will throw: NotSupportedError or InvalidAccessError
try {
    const exported = await crypto.subtle.exportKey('raw', nonExtractableKey);
    console.log('This line never executes');
} catch (e) {
    console.log('Export correctly rejected:', e.name);
    // Output: "Export correctly rejected: InvalidAccessError"
}

// Decryption still works with the non-extractable key:
const decrypted = await crypto.subtle.decrypt(
    { name: 'AES-GCM', iv: new Uint8Array(12) },
    nonExtractableKey,
    encryptedData  // the ciphertext
);
// decrypted contains the plaintext — accessible, but the key is not

The precise limitation: extractable: false prevents exportKey(). It does not prevent an attacker from:

Capturing keyBytes before zeroing (patching the code via DevTools)
Reading the decrypted plaintext immediately after decrypt() returns
Intercepting the network response that delivers the key to the browser

This is not a bug in WebCrypto — it accurately reflects the constraint that browser JavaScript has no hardware-backed key storage equivalent to a TPM or Secure Enclave (unless the deployment targets Chrome on hardware that supports the Web Authentication API’s device-bound key storage, which WASM inference engines do not currently integrate).

Expected Behaviour

After configuring authenticated signed URLs, attempting to wget the model URL from a new, unauthenticated session:

wget -O test.onnx "https://proprietary-models-prod.s3.amazonaws.com/models/v3/model.onnx"
# Output: ERROR 403: Forbidden
# The presigned URL has expired; a new one requires re-authenticating to the application backend.

wget -O test.onnx "https://proprietary-models-prod.s3.amazonaws.com/models/v3/model.onnx?X-Amz-Signature=..."
# Output: ERROR 403: Forbidden
# The URL is time-limited; after TTL expiry the S3 signature validation fails.

After deploying COEP/COOP headers, DevTools Network tab for the /ai-app/ page shows:

Response Headers:
  cross-origin-embedder-policy: require-corp
  cross-origin-opener-policy: same-origin
  content-security-policy: default-src 'self'; script-src 'self' 'nonce-abc123' 'strict-dynamic'; ...

Any cross-origin <iframe> embedded in the page without a Cross-Origin-Resource-Policy: cross-origin header on its own resources will fail to load — expected behaviour indicating the isolation policy is active.

Watermark verification on an identified stolen model copy:

result = verify_watermark(
    suspect_model_path='suspected_stolen_model.onnx',
    watermark_key='user_id:u_7a3f9b2c:session_id:s_2026_03_15_082200',
)
print(result)
# {'verdict': 'watermark_present', 'mean_correlation': 0.31, 'layers_checked': 144}
# A correlation of 0.31 is statistically decisive: p < 0.0001 under the null hypothesis
# that the correlation arose by chance. Attribution to that user's download session confirmed.

Trade-offs

Authenticated signed URLs add an access control layer and audit trail but do not prevent extraction by authorised users who legitimately receive the model. The session-bound URL prevents URL sharing; it does not prevent the authorised user from reading the downloaded bytes from Cache Storage.

Model watermarking provides post-hoc attribution and is the only control that survives weight theft. It does not prevent the theft. The correlation-based verification can produce false positives at low perturbation levels when the suspect model has been heavily fine-tuned on top of the extracted base. The perturbation must be tuned carefully: too small and the watermark does not survive fine-tuning; too large and it degrades model quality.

COEP/COOP and CSP are the highest-value mitigations against XSS-to-WASM-memory-dump attack chains. CSP nonce-based script-src blocks the injected script execution entirely. The operational cost is nonce injection — every response must carry a fresh nonce, which requires server-side templating and prevents naive static HTML deployment. CSP is also brittle: a single unsafe-inline or unsafe-eval entry in the policy defeats the protection. Regular CSP auditing via report-uri is required to catch regressions.

Model sharding with AES-GCM meaningfully increases the cost of memory-dump-based extraction but adds inference latency (sequential shard fetch-decrypt-load cycles) and implementation complexity. For MLC-LLM’s quantized models, sharding at 50MB boundaries is compatible with the model’s native shard format; for ONNX models, sharding requires post-processing after export. An attacker who dumps memory repeatedly across 30 shards across 30 separate sessions can still reconstruct the full model — but this requires significantly more effort and generates audit log entries that should trigger detection.

Non-extractable WebCrypto keys prevent exportKey() but do not prevent capturing the plaintext output of decrypt(). The control is meaningful specifically against passive memory inspection (WASM memory dump) rather than against an attacker who patches the JavaScript to intercept decryption outputs. It costs nothing to implement and should always be used when importing model decryption keys.

Server-side inference heartbeats are the only mechanism that allows server-side detection and session revocation. They require the user to be online — which defeats the offline-use value proposition of on-device AI — and they are bypassable by an attacker who patches the heartbeat call out of the JavaScript. Treat them as a logging and detection mechanism, not as an enforcement mechanism.

Failure Modes

Deploying fine-tuned proprietary models via Transformers.js with env.useBrowserCache = true and no authentication on model files. This is the default behaviour. The model is downloaded once, cached in the browser’s Cache Storage, and thereafter loaded from cache without any network request — and therefore without any opportunity to enforce authentication on subsequent loads. A visitor who loads the model once, then opens DevTools and copies the Cache Storage URL, can wget the model indefinitely from the CDN. The fix: serve all model files through authenticated presigned URLs, and set Cache-Control: no-store on the presigned URL response so the browser does not cache the model under a guessable key.

Trusting that WebAssembly.Memory is inaccessible to JavaScript. The WASM specification section on linear memory is explicit: linear memory is accessible to JavaScript via the buffer property on WebAssembly.Memory instances. There is no access control mechanism. Same-origin JavaScript — which includes any script loaded by the page, any content script injected by a browser extension, and any JavaScript injected via XSS — can read the full memory buffer. Models loaded into WASM linear memory are not protected by the WASM boundary.

Using extractable: false and concluding the model decryption key is secure. An attacker who has JavaScript execution in the page can set a breakpoint on crypto.subtle.decrypt in DevTools, inspect the keyBytes argument before fill(0) executes, or patch the code to console.log the key on import. Non-extractable keys prevent exportKey() calls — they do not prevent key observation in a debug-capable environment. The browser is fundamentally a transparent execution environment.

Not monitoring Hugging Face and GitHub for watermark matches. The most common path to discovering stolen on-device models is not your audit log — it is a customer reporting that a competitor’s product behaves identically to yours, or a security researcher finding a model published on Hugging Face that matches your architecture exactly. Automated watermark verification against newly published models requires integrating with the Hugging Face dataset API and running verify_watermark() periodically against any published model that matches your architecture signature. Without this monitoring, stolen models may be in commercial deployment for months before you discover them.

Assuming that architectural choices are protected by WASM opacity. Even without the weight file, the ONNX graph topology — layer count, attention heads, hidden dimension, normalization choices, custom operators — is visible in the .onnx format and reconstructible from a WASM binary using wasm-decompile plus LLM-assisted analysis. If your competitive advantage lies partially in architectural innovation rather than solely in trained weights, on-device deployment exposes both. Server-side inference is the only deployment model that keeps architecture private.

Choosing on-device deployment for high-value models because it “feels” private. The privacy benefit (user data stays on device) and the security risk (model weights go to the device) are both real. For a general-purpose embedding model built on a public base, on-device deployment is fine and the extraction risk is low. For a domain-specific fine-tune representing years of expert labelling and millions of dollars of compute — especially in regulated industries where the training data itself is sensitive — on-device deployment leaks the model to every authorised user and provides no effective way to prevent redistribution. The correct architecture for high-value models is server-side inference with on-device preprocessing; only the low-value portion of the pipeline belongs on the client.