AI Model Evaluation Pipeline Security

Problem

Evaluation pipelines are how teams decide whether a model is safe enough to deploy. Inspect AI (UK AISI), lm-evaluation-harness (EleutherAI), HELM (Stanford), MT-Bench, the OpenAI evals framework, and a long tail of internal harnesses run candidate models against benchmark datasets, agentic tasks, and red-team prompts. Their output drives release decisions, regulatory filings (EU AI Act conformity), and capability claims in model cards.

Three properties make eval pipelines a non-obvious threat surface that security teams routinely overlook:

They run untrusted code by design. Agentic evals (SWE-bench, Cybench, GAIA, AISI’s autonomy suite) require the model to execute shell commands, write to disk, browse the web, and call APIs. The eval harness is expected to give the model a sandbox in which to run arbitrary code. If that sandbox is the host machine, every eval run is a code-execution opportunity for whoever supplies the dataset or the candidate model.
They ingest untrusted datasets. Benchmark datasets are pulled from HuggingFace, GitHub, S3 buckets, or model providers. A poisoned dataset can include adversarial examples that target the harness itself (parser bugs in the JSONL loader, deserialization in pickle-encoded prompts) or that manipulate the eval scorer into producing favourable scores.
Their outputs feed regulatory and commercial decisions. A pipeline that mis-reports a 75% safety score as 95% has compliance and reputational consequences far beyond a normal CI test failure. Tampering with eval results — by either an insider or an attacker who has gained pipeline access — is a high-value target.

Several specific incidents in 2024–2025 showed the pattern: HuggingFace datasets weaponised with pickle deserialization that ran during datasets.load_dataset(), a Cybench-style harness used as a launchpad to attack the host because the agent’s bash tool was a thin wrapper over subprocess.run, and an internal eval pipeline at one frontier lab where a contractor manipulated scoring scripts to inflate a model’s harmlessness numbers prior to release sign-off.

This article focuses on the three pillars of a hardened eval pipeline: dataset trust, sandboxed model execution, and attestation of results. Target systems: Inspect AI ≥ 0.3, lm-evaluation-harness ≥ 0.5, HuggingFace datasets ≥ 3.0, Docker / Firecracker / gVisor / WASI sandboxes for agent execution, and SLSA-style provenance for eval artifacts.

Threat Model

Poisoned benchmark dataset. Goal: trigger code execution during dataset load, or craft examples that exploit the scorer’s grading logic. Surface: HuggingFace trust_remote_code=True, pickle/jsonl parsing, regex-based scorers.
Hostile candidate model in agentic eval. Goal: escape the agent sandbox during a permitted code-execution task. Surface: subprocess wrappers, host-network access, mounted credentials.
Insider tampering with eval outputs. Goal: alter pass/fail rates, flip safety verdicts before sign-off. Surface: writeable result store, no signing, no audit log.
Compromised eval-runner credential. Goal: pivot from eval pipeline to model registry, training data, or deployment systems. Surface: broad IAM scopes on the runner, persistent secrets.

Blast radius without hardening: a poisoned eval can give an attacker code execution on infrastructure that holds model weights and training data. With hardening, eval execution is contained to a single ephemeral sandbox with attested results.

Configuration / Implementation

Step 1 — Pin and verify datasets

Use HuggingFace dataset revisions and forbid trust_remote_code:

from datasets import load_dataset, DownloadConfig

ds = load_dataset(
    "lmsys/mt_bench_human_judgments",
    revision="d8f4a4b7a3c5e6f8a9b0c1d2e3f4g5h6",   # commit-pinned
    trust_remote_code=False,                          # never True in production
    download_config=DownloadConfig(
        num_proc=1,
        max_retries=3,
        force_download=False,
        token=None,                                   # public datasets only on this path
    ),
)

Maintain a vetted-dataset registry — a YAML file with name, revision, sha256, license, vetted_by, vetted_at. Pre-flight check before any eval run:

def verify_dataset(name: str, revision: str) -> None:
    expected = REGISTRY[name]
    if revision != expected["revision"]:
        raise SecurityError(f"unpinned dataset {name}@{revision}")
    actual_sha = compute_dataset_sha256(name, revision)
    if actual_sha != expected["sha256"]:
        raise SecurityError(f"dataset hash mismatch: {actual_sha}")

For datasets that legitimately ship loader scripts (which require trust_remote_code=True), fork them, audit the script, vendor the reviewed copy, and load from the local path only.

Step 2 — Sandbox the candidate model’s tool calls

Inspect AI ships a sandbox abstraction; use it strictly:

from inspect_ai import Task, eval
from inspect_ai.solver import use_tools, generate
from inspect_ai.tool import bash, python
from inspect_ai.dataset import json_dataset

@task
def cybench_subset() -> Task:
    return Task(
        dataset=json_dataset("vetted/cybench-2025-Q2.jsonl"),
        solver=[
            use_tools([bash(timeout=30), python(timeout=30)]),
            generate(),
        ],
        sandbox=("docker", "infra/cybench-sandbox.yaml"),
        message_limit=20,
        time_limit=300,
    )

Sandbox config (infra/cybench-sandbox.yaml):

services:
  default:
    image: cybench-sandbox:1.4.0@sha256:abc...
    cap_drop: [ALL]
    cap_add: []
    security_opt:
      - no-new-privileges:true
      - seccomp=infra/seccomp-strict.json
      - apparmor=cybench-sandbox
    read_only: true
    tmpfs:
      - /tmp:size=64m,mode=1777
      - /workspace:size=128m
    networks:
      - sandbox-isolated
    user: "65534:65534"
    pids_limit: 100
    mem_limit: 512m
    cpu_quota: 50000
networks:
  sandbox-isolated:
    driver: bridge
    internal: true   # no external network

Key elements:

internal: true — the sandbox cannot reach the internet at all by default. Tasks that legitimately need web access run on a separate proxied network with an FQDN allowlist.
No mounted host paths beyond a fresh tmpfs.
user: 65534 — non-root.
mem_limit, pids_limit, cpu_quota — resource bounds prevent agent CPU/RAM exhaustion of the runner.

For higher-isolation needs, swap Docker for Firecracker / gVisor / Kata. lm-evaluation-harness’s --sandbox flag accepts the same backends since 0.5.

Step 3 — Network-allowlist tasks that need it

Web-browsing tasks (GAIA, BrowseComp) need network. Constrain it:

networks:
  sandbox-web-allowlist:
    driver: bridge
    driver_opts:
      com.docker.network.bridge.enable_ip_masquerade: "true"
    ipam:
      config:
      - subnet: 10.42.0.0/16

Plus an egress proxy:

# eval-egress.example.net
server {
  listen 443 ssl;
  server_name eval-egress.example.net;
  location / {
    if ($http_host !~ ^(en\.wikipedia\.org|www\.google\.com|arxiv\.org)$) {
      return 403;
    }
    proxy_pass https://$http_host$request_uri;
  }
  access_log /var/log/nginx/eval-egress.log json_combined;
}

Sandbox uses HTTP_PROXY pointing at this proxy; routes to anything else are dropped at the network-namespace iptables.

Step 4 — Sign and attest eval results

Every eval run produces a JSON or .eval log. Sign it at completion with a runner-bound key:

import json
from sigstore.sign import SigningContext

ctx = SigningContext.production()
with ctx.signer(identity_token=runner_oidc_token) as signer:
    result = signer.sign_artifact(eval_log_bytes)
    open("results/eval.sig", "wb").write(result.to_bundle().to_json().encode())

Provenance via SLSA-style attestation:

# eval-provenance.json
predicateType: https://slsa.dev/provenance/v1
predicate:
  buildDefinition:
    buildType: https://example.net/eval/v1
    externalParameters:
      task: cybench_subset
      model: claude-opus-4-7@2026-04-15
      dataset: cybench-2025-Q2@sha256:abc...
      harness: inspect-ai@0.3.42
    internalParameters:
      sandbox_image: cybench-sandbox:1.4.0@sha256:def...
      runner: gh-actions/ubuntu-22.04
  runDetails:
    builder:
      id: https://example.net/runners/eval-runner-7
    metadata:
      invocationId: eval-2026-05-08-001
      startedOn: 2026-05-08T10:00:00Z
      finishedOn: 2026-05-08T10:42:13Z

Store the signed log + provenance in a write-once bucket (s3://example-eval-results with object lock = governance, retention = 7 years for AI Act conformity).

Step 5 — Scope eval-runner IAM

The runner needs to: pull the candidate model, write results, post status. It should not be able to: read arbitrary models, write to deployment paths, or read training data.

resource "aws_iam_role_policy" "eval_runner" {
  name = "eval-runner"
  role = aws_iam_role.eval_runner.id
  policy = jsonencode({
    Statement = [
      {
        Effect = "Allow"
        Action = ["s3:GetObject"]
        Resource = "arn:aws:s3:::eval-models/${var.candidate_model}/*"
      },
      {
        Effect = "Allow"
        Action = ["s3:PutObject", "s3:PutObjectRetention"]
        Resource = "arn:aws:s3:::eval-results/${var.run_id}/*"
      },
      {
        Effect = "Deny"
        Action = ["s3:*"]
        Resource = ["arn:aws:s3:::training-data/*", "arn:aws:s3:::prod-models/*"]
      },
    ]
  })
}

Use OIDC federation, not long-lived keys. Token lifetime ≤ run duration + 5 minutes.

Step 6 — Lock down the scorer

For LLM-as-judge scoring, the judge model itself is part of the trusted base. Pin it explicitly and run it in its own isolated path:

@scorer
def judge_with_pinned_model(question, answer):
    response = judge_client.complete(
        model="claude-opus-4-7@2026-04-15",
        system=read_signed_file("scorer-prompts/v3-signed.txt"),
        messages=[{"role": "user", "content": format_q_a(question, answer)}],
        max_tokens=512,
    )
    return parse_grade(response)

Sign the scorer prompt files; verify on load. A subtle insider attack is to tweak the scorer prompt to be more lenient — signing forces the change to leave a git trail.

Step 7 — Audit and detection

audit_signals:
  - sandbox_egress_policy_violations  # alert on >0
  - eval_runs_without_signed_log      # alert on >0
  - dataset_loaded_without_revision_pin
  - candidate_model_attempting_subprocess_outside_sandbox
  - judge_prompt_unsigned
  - eval_result_modified_post_signing

The last two are the insider-tampering tripwires; the first four catch external compromise.

Expected Behaviour

Signal	Before hardening	After hardening
`trust_remote_code=True` in eval	Allowed; arbitrary loader code runs	Refused at preflight
Agent task escapes via `os.system`	Runs on host	Contained to ephemeral sandbox
Eval log altered post-run	Silent	Signature verification fails
Runner credentials usable post-run	Yes	OIDC token expired ≤5min after
Web access from non-allowlisted FQDN	Allowed	403 at proxy
Score-prompt edit for leniency	No trace	Git diff + signature failure

# Verify a signed eval log:
sigstore verify identity \
  --bundle results/eval.sig \
  --cert-identity 'https://github.com/example/eval-runner/.github/workflows/eval.yml@refs/heads/main' \
  --cert-oidc-issuer https://token.actions.githubusercontent.com \
  results/eval.json

Trade-offs

Aspect	Benefit	Cost	Mitigation
Strict sandbox network policy	Eliminates exfil	Some web-using tasks need explicit allowlist	Per-task allowlist registry; review quarterly
Pinned dataset revisions	Reproducibility + integrity	Dataset updates need re-vet	Monthly review cadence; auto-PR on upstream changes
Signed scorer prompts	Tamper-evident	More cumbersome iteration	Use ephemeral signing keys for dev branch
OIDC short-lived runner creds	No long-lived secrets to leak	Slight runner setup complexity	Reuse standard CI federation patterns
Sandbox CPU/mem limits	Predictable runner cost	Some genuinely heavy tasks need overrides	Per-task budget profile; on-call review for new asks

Failure Modes

Failure	Symptom	Detection	Recovery
Sandbox image bitrot	Container start fails after upstream tag move	Daily preflight `docker run --rm sandbox:tag /bin/true`	Pin by digest; auto-PR on rebuild
Egress proxy blocks legit task	Task fails partway, partial scores	Eval log shows 403 on URL	Add to allowlist; rerun; mark previous result invalidated
Signing key compromise	Trust boundary collapsed	Cosign rekor log shows signatures from unexpected identity	Revoke key, re-sign all in-flight results, audit Rekor
Insider tampering with raw `.eval`	Detected on signature verify	CI post-run verify step	Replay run; investigate writer; rotate runner
Judge prompt drift	Subtle metric shifts	Diff alert against signed baseline	Roll back to prior prompt; investigate change author

When to Consider a Managed Alternative

AISI-supported managed Inspect runs and HuggingFace’s evaluation services bake in many sandboxing defaults.
Internal red-team / capability evals at frontier scale (GAIA, METR uplift suites) typically require self-hosted custom infra; the patterns here apply directly.