API Gateway Hardening at the OS Layer: Rate Limiting with nftables and eBPF

API Gateway Hardening at the OS Layer: Rate Limiting with nftables and eBPF

The Problem

Application-layer rate limiting has a fundamental architectural flaw: it evaluates requests inside the process it is trying to protect. Every nginx limit_req, every Kong rate-limiting plugin evaluation, every application middleware check consumes CPU and memory inside the userspace process. Under volumetric attack, the attacker does not need to bypass the rate limiter — they just need to send enough requests that the rate limiter itself becomes the bottleneck.

At 100,000 requests per second, a modern API gateway running nginx or Kong will spend most of its CPU budget on TLS context allocation, HTTP/2 frame parsing, and connection state management — before any rate limit logic executes. The rate limiter code runs, technically, but the process is context-switching under I/O pressure, allocating connection state for packets it will ultimately drop, and returning 429s that require fully processed HTTP responses to deliver. Each rejected request still completes TLS handshake, HTTP request parsing, and response serialisation. A 429 is not free. Under a sustained 100k req/s flood, the gateway process exhausts its CPU budget generating rate-limit rejections.

This is not hypothetical. In documented API abuse campaigns — credential stuffing botnets hitting authentication endpoints, scrapers using residential proxy pools to distribute load, DDoS-as-a-service operations targeting fintech APIs — the attack pattern is specifically engineered to exploit this. Operators observe: rate limits are configured, rate limits are triggering (logs show 429s), but the application is still degraded. What is happening is that the 429 response path is itself overloaded. The attacker does not care about the 429. They want the CPU exhaustion.

The solution is to evaluate rate limits before the application process is involved at all. The Linux kernel provides three enforcement points, in order of how early in the packet path they execute:

XDP (eXpress Data Path) runs in the NIC driver layer, before the kernel’s networking stack allocates a socket buffer. An XDP program attached to a network interface processes packets as raw frames directly from the ring buffer. At this layer, dropping a packet costs roughly 50 nanoseconds and does not involve the kernel’s TCP stack, conntrack, or any socket allocation. XDP programs written in eBPF can sustain 10–100 million packets per second on commodity hardware. An attacker generating 100k requests per second is generating at minimum 200k–400k packets per second (SYN, data, ACK). An XDP program can absorb that entirely and return zero load to the application.

nftables runs in the netfilter framework, which is deeper in the kernel stack than XDP but still before any userspace process is involved. nftables evaluates after the kernel’s IP routing and conntrack modules, which means it has access to connection state, but it drops connections before TLS handshake begins. This is the correct enforcement point for connection-level limits: reject a TCP connection before the application allocates a TLS context for it. A dropped SYN at the nftables layer costs roughly one netfilter evaluation — no TLS context allocation, no memory for the SSL session, no HTTP parser state.

tc (traffic control) with eBPF evaluates at the network interface’s traffic control layer, after the networking stack processes the frame but before it reaches any socket. This layer supports traffic shaping per flow, which makes it suitable for bandwidth limiting alongside packet-rate limiting. Less commonly used for API gateway protection than XDP or nftables, but relevant for controlling downstream traffic to clients.

The attack patterns that kernel-level rate limiting addresses directly:

Volumetric API abuse: A botnet with 10,000 infected nodes sends 10 credential stuffing requests per second from each node. That is 100,000 requests per second total to your /api/v1/auth/login endpoint. nginx’s limit_req zone catches this only if you set the zone limit at the endpoint level, not the per-IP level — because each individual IP is only sending 10 requests per second, which is under most per-IP limits. nftables per-/24-subnet limits and XDP per-endpoint counters catch this where per-IP limits do not.

SYN flood against HTTPS: An attacker sends tens of thousands of TCP SYN packets per second to port 443. Each SYN that completes handshake causes the application to allocate a TLS context — 10–50 KB of memory for the SSL session structure in OpenSSL or BoringSSL. At 10,000 connections per second, this is 100–500 MB of TLS context allocation per second before a single HTTP request is processed. nftables conntrack limits terminate new TCP connections before TLS handshake begins when the per-IP new connection rate exceeds the configured threshold.

Slowloris connection exhaustion: Connections that send HTTP headers one byte every 10 seconds hold a connection slot open without consuming bandwidth. nginx has a client_header_timeout directive, but many deployments have it set too high. nftables conntrack-based connection counting per source IP limits the number of simultaneous connections regardless of their transfer rate.

JWT validation DoS: Requests sent with cryptographically valid but intentionally overlong or malformed JWTs — RS256 with a 16KB key, for instance — exhaust the JWT validation library’s CPU budget. These requests look legitimate at the TCP and IP layers; they complete TLS and begin HTTP request processing before the application discovers they are malformed. Endpoint-level rate limiting at the nftables layer limits the rate at which any single IP can reach the endpoint, throttling the volume of JWT validation attempts regardless of their content.

Threat Model

  • Botnet generating 100,000 requests per second: application-layer rate limiting saturated at approximately 10,000 evaluations per second on a typical 8-core API gateway node. Kernel-level XDP drop handles the remaining 90,000 pps with no userspace involvement.
  • Distributed credential stuffing from 10,000 IPs at 10 req/s each: per-IP limits of 100/s per IP do not trigger. Per-/24-subnet limits of 500/s catch this because many botnets use IP ranges from shared hosting and cloud providers where multiple botnet nodes occupy the same /24.
  • SYN flood at 50,000 SYN/s: TLS handshake pool exhausted in seconds on a gateway with the default OpenSSL session limit. nftables conntrack new-connection rate limit drops SYN packets before any TLS context is allocated.
  • Slowloris at 5,000 simultaneous connections held open: nginx’s default worker_connections 1024 means 5,000 simultaneous Slowloris connections saturate 5 nginx workers. nftables per-IP connection count limit of 20 simultaneous connections prevents any single IP from holding more than 20 Slowloris connections, reducing the total impact from 5,000 to number_of_attacking_IPs * 20.
  • JWT DoS with 1,000 malformed RSA JWTs per second: 1,000 RSA-2048 verification failures per second consumes approximately 2 seconds of CPU on a single core. Rate limiting at 100 requests per second per IP throttles the crypto load per source, making this attack require 10x more attacking IPs to achieve the same CPU exhaustion.

Hardening Configuration

1. nftables Per-IP Rate Limiting for API Endpoints

The nftables meter construct implements per-key rate limiting in kernel space. Unlike iptables recent module or hashlimit — which use linked lists and have poor performance under large key counts — nftables meters use kernel hashmaps with O(1) lookup. Each meter entry tracks packet count and timing for a key (source IP, source /24, etc.) without userspace involvement.

#!/bin/bash
# /etc/nftables/api-protection.sh
# Complete nftables API protection ruleset
# Run as root. Idempotent: flushes and rebuilds on each execution.

set -euo pipefail

nft flush ruleset

nft add table inet api_protection

nft add chain inet api_protection input \
  '{ type filter hook input priority 0; policy accept; }'

# --- Per-source-IP rate limit on HTTPS (443) ---
# Allow up to 100 packets/second with a burst of 200.
# The meter key is src IP; state persists in kernel for the lifetime of the rule.

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  meter ip_rate_limit size 65536 \
    '{ ip saddr timeout 60s limit rate 100/second burst 200 packets }' \
  accept

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  meter ip_rate_limit size 65536 \
    '{ ip saddr timeout 60s limit rate over 100/second burst 200 packets }' \
  counter drop

# --- Per-/24 subnet limit (catches distributed botnet across a single provider range) ---
# Many botnets and credential stuffing operations use cloud hosting ranges.
# A /24 containing 50 botnet nodes at 10 req/s each = 500 req/s from the subnet.

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  meter subnet_rate size 16384 \
    '{ ip saddr and 255.255.255.0 timeout 60s limit rate 500/second burst 1000 packets }' \
  accept

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  meter subnet_rate size 16384 \
    '{ ip saddr and 255.255.255.0 timeout 60s limit rate over 500/second }' \
  counter drop

# --- SYN flood protection ---
# Evaluates only TCP SYN packets (new connection initiation).
# Drops before kernel allocates TCP socket state and long before TLS context.

nft add rule inet api_protection input \
  tcp flags '& (fin|syn|rst|ack) == syn' \
  meter syn_flood size 65536 \
    '{ ip saddr timeout 30s limit rate 20/second burst 40 packets }' \
  accept

nft add rule inet api_protection input \
  tcp flags '& (fin|syn|rst|ack) == syn' \
  meter syn_flood size 65536 \
    '{ ip saddr timeout 30s limit rate over 20/second }' \
  counter drop

echo "nftables API protection ruleset loaded"
nft list ruleset

The size parameter on meters sets the maximum number of tracked keys. A meter with size 65536 tracking source IPs uses approximately 4 MB of kernel memory (64 bytes per entry). Under an attack from more than 65,536 distinct source IPs, older entries are evicted. This is acceptable: eviction means a previously-tracked attacking IP temporarily resets its counter to zero and gets one burst window before being rate-limited again. The alternative — an unbounded meter — risks kernel memory exhaustion.

2. Connection Limit with nftables conntrack

The conntrack module tracks active connection state. Using it for per-IP connection counting blocks Slowloris and similar hold-open attacks at the TCP layer, before nginx allocates a connection slot.

# Append to the api-protection ruleset

# --- Simultaneous connection limit per source IP ---
# ct count tracks the number of entries in the conntrack table
# matching the specified key. ct state new evaluates only on
# the first packet of a new connection (the SYN).

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  ct state new \
  meter conn_limit size 65536 \
    '{ ip saddr ct count over 20 }' \
  reject with tcp reset

# --- Total conntrack table cap ---
# Prevents conntrack table exhaustion, which would cause the kernel
# to start accepting connections without tracking them (skipping all
# conntrack-dependent rules).
# sysctl net.netfilter.nf_conntrack_max shows the configured maximum.
# This rule drops new connections when the table is 80% full.

nft add rule inet api_protection input \
  ip protocol tcp tcp dport 443 \
  ct state new \
  ct count over 80000 \
  drop

The ct count over 20 meter evicts conntrack entries on connection close. A legitimate user browsing an API with keep-alive connections will hold 2–5 simultaneous connections. Setting the limit at 20 leaves headroom for HTTP/2 multiplexing and legitimate high-concurrency API clients while blocking a Slowloris attacker from holding hundreds of connections from one IP.

Setting ct count over 80000 as an absolute table cap requires knowing your configured nf_conntrack_max. Check and tune:

sysctl net.netfilter.nf_conntrack_max
# Default on most systems: 65536 or 131072
# Set 80% of your configured maximum in the ct count rule above

3. XDP Rate Limiter (eBPF) for High-Volume Protection

XDP programs run at the NIC driver layer, before the kernel’s networking stack allocates a socket buffer (sk_buff). Dropping a packet in XDP costs one eBPF program execution and a map lookup — roughly 50–200 nanoseconds. This is three orders of magnitude cheaper than processing the same packet to the nginx connection handler.

The program below implements a token-bucket rate limiter keyed by source IP. The eBPF LRU hash map provides O(1) key lookup and automatic eviction of the least-recently-used entries when the map reaches capacity.

// xdp_rate_limit.c — XDP packet rate limiter for API gateway protection
// Compile with: clang -O2 -target bpf -c xdp_rate_limit.c -o xdp_rate_limit.o
// Load with:    ip link set dev eth0 xdp obj xdp_rate_limit.o sec xdp

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

// Map capacity: 1M entries. Each entry is 16 bytes (rate_entry struct).
// Total memory: ~16 MB for the map itself, plus kernel overhead.
#define MAX_ENTRIES     (1 << 20)

// Per-IP packet rate limit: packets per second.
// For HTTPS traffic, 1 HTTP/2 request = roughly 4-8 packets (TLS, HEADERS, DATA).
// 1000 pps ≈ 125-250 requests/second. Tune to your legitimate peak per-IP load.
#define RATE_LIMIT_PPS  1000

// Sliding window: 1 second in nanoseconds.
#define WINDOW_NS       1000000000ULL

struct rate_entry {
    __u64 window_start;   // Timestamp of window start (bpf_ktime_get_ns())
    __u32 count;          // Packets seen in current window
    __u32 _pad;           // Alignment padding
};

// BPF_MAP_TYPE_LRU_HASH: bounded hash map with LRU eviction.
// When the map reaches MAX_ENTRIES, the least-recently-accessed entry is
// evicted to make room. This prevents memory exhaustion under large-IP attacks.
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, MAX_ENTRIES);
    __type(key, __u32);              // Source IPv4 address
    __type(value, struct rate_entry);
} rate_map SEC(".maps");

// Separate map for per-source-IP drop counters — readable from userspace
// via bpftool for monitoring without affecting the hot path.
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, MAX_ENTRIES);
    __type(key, __u32);
    __type(value, __u64);
} drop_counters SEC(".maps");

SEC("xdp")
int xdp_rate_limit(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;

    // Parse Ethernet header. Bounds-check is mandatory: the eBPF verifier
    // rejects programs that dereference pointers without bounds checks.
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only process IPv4. Pass IPv6 through to nftables for separate handling.
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    // Only process TCP (port 443). Pass UDP, ICMP, etc. through.
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;

    struct tcphdr *tcp = (void *)(ip + 1);
    if ((void *)(tcp + 1) > data_end)
        return XDP_PASS;

    // Only rate-limit packets to destination port 443 (HTTPS).
    if (bpf_ntohs(tcp->dest) != 443)
        return XDP_PASS;

    __u32 src_ip = ip->saddr;
    __u64 now    = bpf_ktime_get_ns();

    struct rate_entry *entry = bpf_map_lookup_elem(&rate_map, &src_ip);

    if (!entry) {
        // First packet from this IP: create a new rate entry.
        struct rate_entry new_entry = {
            .window_start = now,
            .count        = 1,
        };
        bpf_map_update_elem(&rate_map, &src_ip, &new_entry, BPF_ANY);
        return XDP_PASS;
    }

    // Check if we have moved past the rate window (1 second).
    if (now - entry->window_start > WINDOW_NS) {
        // New window: reset counter.
        entry->window_start = now;
        entry->count        = 1;
        return XDP_PASS;
    }

    // Within the current window: increment and check limit.
    entry->count++;
    if (entry->count > RATE_LIMIT_PPS) {
        // Increment drop counter for monitoring (best-effort; ignore errors).
        __u64 *drops = bpf_map_lookup_elem(&drop_counters, &src_ip);
        if (drops) {
            __sync_fetch_and_add(drops, 1);
        } else {
            __u64 initial = 1;
            bpf_map_update_elem(&drop_counters, &src_ip, &initial, BPF_ANY);
        }
        return XDP_DROP;
    }

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

Compile and load the XDP program:

# Install dependencies (Debian/Ubuntu)
apt-get install -y clang llvm libelf-dev linux-headers-$(uname -r) \
  libbpf-dev iproute2 bpftool

# Compile to eBPF bytecode
clang -O2 -target bpf \
  -I/usr/include/$(uname -m)-linux-gnu \
  -c xdp_rate_limit.c \
  -o xdp_rate_limit.o

# Attach to the ingress interface
# Use xdpdrv (native) mode when the NIC driver supports it — this is the
# fast path. Fall back to xdpgeneric (generic) mode if not.
ip link set dev eth0 xdp obj xdp_rate_limit.o sec xdp

# Verify the program is attached
ip link show eth0 | grep xdp

# List all loaded eBPF programs
bpftool prog list | grep xdp_rate_limit

# Inspect drop counters from userspace (read-only, no performance impact)
bpftool map dump name drop_counters

# To detach:
ip link set dev eth0 xdp off

The xdpdrv (driver/native) mode is critical for performance. In native mode, the XDP program runs before the kernel allocates a socket buffer — the packet is dropped directly in the ring buffer. In xdpgeneric mode (the fallback when the NIC driver does not support native XDP), the packet has already been converted to an sk_buff before the XDP program runs, and XDP_DROP frees the already-allocated buffer. Generic mode is still substantially faster than userspace processing but does not provide the full performance benefit of native XDP.

Check whether your NIC driver supports native XDP:

# Drivers with native XDP support include: mlx4, mlx5, i40e, ixgbe, nfp, virtio_net, veth
ethtool -i eth0 | grep driver
bpftool feature probe dev eth0 | grep xdp

4. nginx Integration with Kernel-Level Rate Limiting

With XDP and nftables handling volumetric traffic, nginx receives only the requests that passed kernel-level filtering. The nginx rate limit zones now operate on a substantially reduced request volume, which means they can enforce much tighter per-endpoint limits without the processing overhead of evaluating 100k req/s.

The key architectural point: nftables and XDP act as a coarse filter protecting the application from volume. nginx applies fine-grained semantic limits — per-endpoint, per-authenticated-user, per-route — on the legitimate traffic that makes it through.

# /etc/nginx/nginx.conf (http block)

http {
    # Rate limit zones apply after kernel filtering.
    # These limits are tight because they operate on filtered traffic only.

    # Authentication endpoint: 5 attempts per minute per IP.
    # Under normal operation, a user logging in rarely needs more than 1-2/minute.
    limit_req_zone $binary_remote_addr zone=auth_login:10m rate=5r/m;

    # General API: 60 requests per minute per IP.
    limit_req_zone $binary_remote_addr zone=api_general:10m rate=60r/m;

    # Sensitive data endpoints: 10 requests per minute per IP.
    limit_req_zone $binary_remote_addr zone=api_sensitive:10m rate=10r/m;

    # Per-authenticated-user limit using JWT subject claim (if using nginx-plus
    # or openresty with lua-resty-jwt). This catches distributed attacks that
    # rotate source IPs but reuse the same credential.
    # limit_req_zone $jwt_claim_sub zone=auth_user:10m rate=30r/m;

    server {
        listen 443 ssl http2;

        # Authentication endpoint: tight limit, minimal burst.
        # A burst of 3 allows a user who double-clicks login to not see a 429.
        location /api/v1/auth/login {
            limit_req zone=auth_login burst=3 nodelay;
            limit_req_status 429;

            # Return Retry-After header so legitimate clients back off correctly.
            add_header Retry-After 60 always;

            proxy_pass http://backend_auth;
        }

        # Token refresh: tighter than general API, credential stuffing often
        # targets refresh endpoints to maintain session without re-authenticating.
        location /api/v1/auth/token/refresh {
            limit_req zone=auth_login burst=2 nodelay;
            limit_req_status 429;
            proxy_pass http://backend_auth;
        }

        # Sensitive endpoints: account details, payment methods, PII.
        location ~ ^/api/v1/(account|payment|profile) {
            limit_req zone=api_sensitive burst=5 nodelay;
            limit_req_status 429;
            proxy_pass http://backend_api;
        }

        # General API traffic.
        location /api/v1/ {
            limit_req zone=api_general burst=20 nodelay;
            limit_req_status 429;
            proxy_pass http://backend_api;
        }
    }
}

5. Monitoring nftables Rate Limit Counters

The nftables counter keyword on drop rules accumulates packet and byte counts per rule. These are the primary signal for verifying that kernel-level rate limiting is working and distinguishing attack traffic from legitimate traffic.

# Inspect all counters in the protection table
nft list table inet api_protection

# Watch counter values update in real time (1-second refresh)
watch -n1 'nft list table inet api_protection | grep -E "(packets|bytes|counter)"'

# Extract raw counter values for a specific rule
# nft outputs format: "counter packets 12345 bytes 678900"
nft list table inet api_protection | \
  awk '/counter/ { for(i=1;i<=NF;i++) if ($i=="packets") print $(i+1) }'

Export to Prometheus for dashboarding and alerting. Run this as a systemd service or scrape target:

#!/bin/bash
# /usr/local/bin/nftables-exporter.sh
# Prometheus text format exporter for nftables API protection counters.
# Invoke from a metrics endpoint (e.g., node_exporter textfile collector).

set -euo pipefail

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/nftables.prom"
TMPFILE="$(mktemp)"

cat > "$TMPFILE" << 'PROM_EOF'
# HELP nftables_api_dropped_packets_total Packets dropped by nftables API protection rules
# TYPE nftables_api_dropped_packets_total counter
PROM_EOF

# Parse nft output: each drop rule with a counter outputs
# "counter packets N bytes M" on its own line after the rule match.
nft -j list table inet api_protection 2>/dev/null | \
  python3 -c "
import json, sys

data = json.load(sys.stdin)
rules = [
    item['rule']
    for item in data.get('nftables', [])
    if 'rule' in item
]

for rule in rules:
    chain = rule.get('chain', 'unknown')
    for stmt in rule.get('expr', []):
        if 'counter' in stmt:
            c = stmt['counter']
            packets = c.get('packets', 0)
            bytes_  = c.get('bytes', 0)
            comment = rule.get('comment', f'rule_{rule.get(\"handle\", 0)}')
            print(f'nftables_api_dropped_packets_total{{chain=\"{chain}\",rule=\"{comment}\"}} {packets}')
            print(f'nftables_api_dropped_bytes_total{{chain=\"{chain}\",rule=\"{comment}\"}} {bytes_}')
" >> "$TMPFILE" 2>/dev/null || true

# XDP drop counters from the eBPF map
bpftool map dump name drop_counters 2>/dev/null | \
  awk '
    /key:/ { gsub(/key: /, ""); key=$0 }
    /value:/ { gsub(/value: /, ""); val=$0; printf "nftables_xdp_dropped_packets_total{src_ip=\"%s\"} %s\n", key, val }
  ' >> "$TMPFILE" 2>/dev/null || true

mv "$TMPFILE" "$OUTPUT_FILE"

6. Automatic Blocklist Feed from Rate Limit Violations

Rate limiting per second is appropriate for traffic shaping, but an IP that persistently exceeds limits across multiple windows is an attacker, not a legitimate client with bursty traffic. Move persistent violators into a blocklist with a timeout so they face an outright drop rather than per-packet rate evaluation.

# Add a blocklist set to the nftables table.
# timeout 1h: entries expire automatically after one hour.
# flags timeout: enables per-element timeouts.

nft add set inet api_protection blocklist \
  '{ type ipv4_addr; flags timeout; timeout 1h; size 65536; }'

# Drop rule for blocklisted IPs — priority 0 means it evaluates before rate limit rules.
# This must be inserted at the beginning of the chain, not appended.
nft insert rule inet api_protection input \
  ip saddr @blocklist \
  counter drop

# /usr/local/bin/nftables-blocklist-updater.sh
# Run via systemd timer every 60 seconds.
# Reads nftables meter state and promotes persistent violators to the blocklist.

#!/bin/bash
set -euo pipefail

BLOCKLIST_SET="inet api_protection blocklist"
VIOLATION_LOG="/var/log/nftables-violations.log"
BLOCK_THRESHOLD=5  # Block after N consecutive violations within the check window

# nftables does not expose per-meter-key violation counts directly via CLI.
# Instead, correlate with kernel audit events or use the JSON ruleset dump
# to identify IPs that are consistently in the "over rate" state.

# Method: parse kernel log for nftables drop events (requires LOG target on drop rules).
# Add LOG target to drop rules:
#   nft add rule inet api_protection input \
#     meter ip_rate_limit '{ ip saddr limit rate over 100/second }' \
#     log prefix "nft-api-drop: " level warn counter drop

journalctl -k --since "1 minute ago" --no-pager 2>/dev/null | \
  grep "nft-api-drop:" | \
  grep -oP 'SRC=\K[0-9.]+' | \
  sort | uniq -c | sort -rn | \
  while read count ip; do
    if [ "$count" -ge "$BLOCK_THRESHOLD" ]; then
      # Add to blocklist with 1-hour timeout.
      # nft handles duplicate adds gracefully: updates the timeout.
      nft add element inet api_protection blocklist "{ ${ip} timeout 1h }" 2>/dev/null && {
        echo "$(date -Iseconds) BLOCKED ${ip} (${count} drops in last 60s)" >> "$VIOLATION_LOG"
        logger -t nftables-blocklist "Blocked ${ip}: ${count} rate limit violations"
      }
    fi
  done

Set up the blocklist updater as a systemd timer rather than a cron job to get accurate 60-second intervals and automatic restart on failure:

# /etc/systemd/system/nftables-blocklist.service
[Unit]
Description=nftables API protection blocklist updater
After=nftables.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/nftables-blocklist-updater.sh
# /etc/systemd/system/nftables-blocklist.timer
[Unit]
Description=Run nftables blocklist updater every 60 seconds

[Timer]
OnBootSec=60
OnUnitActiveSec=60
AccuracySec=5

[Install]
WantedBy=timers.target
systemctl daemon-reload
systemctl enable --now nftables-blocklist.timer

Expected Behaviour

After loading the nftables ruleset and XDP program, the system exhibits the following observable behaviour under attack.

nft list table inet api_protection shows meter entries accumulating for source IPs that are being rate-limited:

meter ip_rate_limit size 65536 {
    198.51.100.23 limit rate 100/second burst 200 packets (used)
    203.0.113.147 limit rate 100/second burst 200 packets (used)
}

The counter keywords on drop rules show increasing packet and byte counts — the signal that rate limiting is actively blocking traffic:

tcp dport 443 meter ip_rate_limit { ip saddr timeout 60s limit rate over 100/second burst 200 packets } counter packets 4521093 bytes 226054650 drop

bpftool map dump name drop_counters shows source IPs alongside their cumulative drop counts:

key: 198.51.100.23    value: 0x0000000000012a4f
key: 203.0.113.147    value: 0x0000000000003c2e

nginx access logs show a substantially reduced request rate compared to what the network interface is receiving. The kernel has already dropped the attack traffic; nginx logs only requests that made it through kernel filtering. A 100k req/s attack against a gateway with 1k req/s legitimate load should show nginx logs at roughly 1k req/s — the attack traffic does not appear in the access log at all, because it was dropped before TCP handshake.

The blocklist set grows during an active attack and shrinks as timeouts expire:

nft list set inet api_protection blocklist
# table inet api_protection {
#     set blocklist {
#         type ipv4_addr
#         flags timeout
#         elements = { 198.51.100.23 timeout 1h expires 45m23s,
#                      203.0.113.147 timeout 1h expires 52m11s }
#     }
# }

Trade-offs

XDP and nftables ordering: XDP runs before the kernel’s networking stack, which means XDP drops happen before nftables evaluates any packet. If you attach an XDP program that drops packets unconditionally, nftables never sees them. The two enforcement points are not redundant — they operate in series. XDP handles the highest-volume volumetric filtering (raw packet rate by source IP); nftables handles connection-state-aware filtering (conntrack, new connection rates, simultaneous connection counts). Do not configure overlapping rules that conflict: an IP dropped by XDP will not appear in nftables conntrack and will not be counted by nftables meters.

IPv6: The XDP program above handles only IPv4. IPv6 source addresses require a 128-bit key in the eBPF map, which changes the map structure and the packet parsing logic. The nftables rules using ip saddr apply only to IPv4. If your API gateway accepts IPv6 connections — increasingly common — you need parallel rulesets: ip6 saddr for nftables and extended XDP parsing for IPv6. Omitting IPv6 coverage means an attacker with an IPv6 botnet bypasses all kernel-level filtering.

Per-IP limits versus distributed attacks: Kernel-level per-IP rate limiting is effective against attacks concentrated on small numbers of source IPs. It is less effective against attacks distributing load across large numbers of IPs, each sending modest request rates. The per-/24-subnet limit in the nftables configuration catches some distributed attacks by grouping IPs in the same hosting range. For attacks using genuinely diverse IP space — residential proxy botnets that rotate IPs across millions of home routers — kernel-level filtering must be supplemented with application-layer behavioural analysis: anomaly detection on User-Agent patterns, TLS fingerprinting via JA3, and API-key-level rate limiting that is not IP-dependent.

nftables persistence: nft flush ruleset and nft add commands configure the running kernel state only. They do not persist across reboots. Use /etc/nftables.conf with nft -f for persistence, or convert the ruleset to nftables native syntax and write it to the configuration file:

nft list ruleset > /etc/nftables/api-protection.conf

# In /etc/nftables.conf:
# include "/etc/nftables/api-protection.conf"

# Enable the nftables service to load on boot:
systemctl enable nftables

Meter entries do not persist — they are rebuilt from zero after reboot. This is acceptable: an attacking IP that was in a rate-limit state before the reboot will re-enter the rate-limit state within one window (1 second) of restarting the attack.

XDP on virtual interfaces: In cloud environments (AWS, GCP, Azure) with virtualised NICs, native XDP mode may not be available. Generic XDP mode works but loses the performance advantage of bypassing sk_buff allocation. Check driver support before relying on XDP for volumetric protection in virtualised environments. In AWS, the ENA driver supports native XDP on Nitro instances; in GCP, the gVNIC driver supports it on C3 instances.

Failure Modes

Rate limiting configured only in nginx, no kernel layer: Under a 50k req/s volumetric attack, nginx’s limit_req_zone is evaluated inside the nginx worker process. Each evaluation consumes worker CPU. At 50k req/s, all 8 nginx workers are fully occupied evaluating rate limits and returning 429 responses. Legitimate traffic at 1k req/s experiences multi-second latency waiting for a worker. The application appears down. Adding kernel-level rate limiting drops 49k req/s before nginx is involved; nginx handles the remaining 1k req/s without degradation.

XDP program attached but nftables rules not loaded: XDP handles raw packet-rate limits by source IP, but connection-count limits (Slowloris prevention) and subnet-level limits require conntrack state that XDP does not have access to. Running only XDP leaves the application vulnerable to connection exhaustion attacks. Both layers must be active.

XDP conflicts with existing iptables/nftables rules: An XDP program that drops a packet before the networking stack prevents that packet from reaching any iptables or nftables rule. If your existing firewall allows traffic based on conntrack state (ct state established accept), XDP drops will prevent the first packet (SYN) of new connections from reaching iptables — but established connections may still work because they use existing conntrack entries. Test XDP attachment carefully in staging before deploying to production. The command ip link set dev eth0 xdp off immediately detaches the program if a conflict causes unexpected drops.

No monitoring of rate limit counters: nftables meters and XDP drop counters are kernel state. Without monitoring, you cannot distinguish “rate limiting is working and stopping an attack” from “rate limiting is misconfigured and dropping legitimate traffic.” The export script and Prometheus metrics are not optional — without them, you are operating the system blind. Set an alert on nftables_api_dropped_packets_total that fires when the drop rate exceeds 1% of total traffic, triggering investigation of whether the drops represent attack traffic or false positives.

Blocklist feed not implemented: Without the blocklist updater, an attacking IP that exceeds the rate limit is throttled per-second but never blocked outright. The attacker can sustain a metered attack indefinitely — sending at just over the rate limit, accepting the drops, and maintaining constant pressure on the system. The per-second rate limit reduces the impact but does not eliminate it. The blocklist converts persistent violators from throttled to blocked, which is the correct escalation for IPs that have demonstrated sustained malicious behaviour.

nftables ruleset not loaded at boot: The nftables service is installed but not enabled, or /etc/nftables.conf does not include the API protection rules. After a reboot, the system runs without kernel-level rate limiting. This is indistinguishable from normal operation until an attack occurs. Verify ruleset persistence with a test after each change:

systemctl is-enabled nftables
nft list table inet api_protection 2>&1 | grep -c "chain input" || \
  echo "WARNING: api_protection table not loaded"