API Design January 13, 2026

Rate Limiting Patterns: Token Bucket, Sliding Window, and When to Use Each

Most API teams set a rate limit once — usually in the gateway config, usually a global number, usually "1000 req/min" because that's the default — and never revisit it until a consumer saturates a write endpoint or an internal service discovers it shares limits with external traffic. This post covers the three main algorithms, what each one costs in accuracy versus implementation overhead, and when you should mix them across different endpoint groups.

Rate limiting pattern comparison diagrams

Rate limiting is one of those API platform concerns that gets implemented quickly and revisited slowly. A team shipping their first public API often reaches for a simple request-per-minute counter, ships it, and forgets about it until a downstream consumer files a support ticket because their batch job is getting throttled erratically. At that point, the limitations of the initial approach become a production constraint rather than a design choice.

This post covers the three algorithms that account for the vast majority of production rate limiting implementations — token bucket, sliding window counter, and leaky bucket — explains the behavioral differences that matter for API consumers, and then addresses a topic that is often skipped in algorithm comparisons: how to express rate limit policy as a first-class artifact that lives in your OpenAPI spec rather than buried in gateway configuration.

Fixed Window: The Baseline and Its Edge Cases

Fixed window counting is the simplest approach: count requests per consumer key in a time window (typically per minute or per hour), reject requests that exceed the limit, reset the counter at the window boundary. It is trivially implementable with an atomic increment in Redis and a key that encodes the consumer identifier and the current time window.

The well-known edge case is the "boundary burst" problem. A consumer with a 1000 req/min limit can make 1000 requests at 00:59, the counter resets at 01:00, and they immediately make another 1000 requests — effectively 2000 requests in a two-second span. For APIs serving computational workloads, this burst can saturate a backend in ways the limit was designed to prevent.

Fixed window is appropriate when the boundary burst problem does not apply to your traffic profile — for example, when consumers are integrating asynchronous jobs that naturally space out their requests, or when your backends can absorb short bursts without degradation. It is the wrong choice for APIs serving latency-sensitive workloads or APIs that front finite computational resources.

Token Bucket: Burst Tolerance with a Rate Ceiling

The token bucket algorithm models a bucket with a maximum capacity of B tokens that refills at a rate of r tokens per second. Each request consumes one token. If the bucket is empty, the request is rejected with a 429. If the bucket has tokens, the request proceeds and one token is consumed.

The two parameters — bucket capacity and refill rate — give you separate control over burst tolerance and sustained throughput. A bucket with capacity 200 and a refill rate of 20/second allows a consumer to send 200 requests instantly (consuming all tokens), then sustain 20 requests per second thereafter. The initial burst is legitimate because the consumer may have accumulated tokens during an idle period. This is why token bucket is the preferred algorithm for APIs where consumers have bursty but infrequent traffic patterns — webhook retry storms, batch jobs that wake up periodically, report generation triggers.

# Simplified token bucket state check (Redis Lua script concept)
# Key: rate_limit:{consumer_id}:{endpoint}
# Fields: tokens (float), last_refill (unix timestamp)

local now = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local capacity = tonumber(ARGV[3])
local tokens = tonumber(redis.call('HGET', KEYS[1], 'tokens') or capacity)
local last = tonumber(redis.call('HGET', KEYS[1], 'last_refill') or now)

-- Refill tokens based on elapsed time
local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
  return 1  -- allowed
else
  return 0  -- rejected
end

The response headers you return to the consumer matter as much as the algorithm itself. Consumers need to know how many tokens remain (X-RateLimit-Remaining), when their bucket will be full enough to make another request (Retry-After), and what their total capacity is (X-RateLimit-Limit). Without these headers, a consumer experiencing throttling has no information to build a correct backoff strategy.

Sliding Window Counter: Smoothing the Boundary Burst

The sliding window counter addresses the fixed window boundary burst by maintaining a rolling count of requests over the past N seconds rather than requests in the current calendar window. The typical implementation uses two counters — the previous window count and the current window count — and weights them by the overlap between the sliding window and each fixed window:

# Sliding window approximation
# current_window_requests: requests in current 1-min window
# prev_window_requests: requests in previous 1-min window
# elapsed: seconds elapsed in current window (0-60)

weight_prev = (60 - elapsed) / 60
estimated_total = prev_window_requests * weight_prev + current_window_requests

This is an approximation — it assumes requests were uniformly distributed in the previous window, which is not always true — but it is accurate enough to eliminate the boundary burst problem for the vast majority of traffic profiles. The implementation cost is low (two Redis keys per consumer per endpoint), and the consumer experience is smoother than fixed window: the limit effectively moves with time rather than resetting abruptly.

Sliding window is a good default for public REST APIs where you want fair throttling without the implementation complexity of a full token bucket. It handles the boundary burst problem and gives consumers a predictable, smooth limit experience.

Leaky Bucket: Queue-Based Rate Shaping

The leaky bucket algorithm processes requests from a queue at a fixed rate, regardless of input rate. Requests that arrive faster than the processing rate join the queue; requests that arrive when the queue is full are dropped. Unlike token bucket, which allows bursts up to the bucket capacity, leaky bucket enforces a strict output rate — effectively shaping traffic into a smooth stream.

Leaky bucket is less commonly used at the API gateway level because it introduces queuing latency for bursty consumers. A consumer sending 100 requests in a burst will get all 100 processed, but with added queuing delay on the last N requests rather than getting immediate 429 responses. For most public APIs, this is worse consumer experience than a clean 429 with a Retry-After header — the consumer's timeout budget is consuming silently rather than failing fast.

The appropriate use case for leaky bucket at the API layer is when you need to protect a backend service that is sensitive to request rate variance rather than total throughput — a database with connection pool limits, a third-party service with a strict requests-per-second SLA, or a batch processing service that degrades non-linearly under burst load. In those cases, shaping the traffic to a consistent rate is worth the added latency.

Per-Endpoint vs. Global Policies

A single global rate limit per consumer key misses most of the interesting policy decisions. Endpoints vary dramatically in computational cost: a GET /status health check has near-zero backend cost; a POST /reports/generate that queries a data warehouse might take 30 seconds and occupy a worker thread. A global 1000 req/min limit that applies identically to both is either too permissive for the expensive endpoint or too restrictive for the cheap one.

Per-endpoint rate limit policies — expressed as tags or extensions in the OpenAPI spec — allow the platform team to set appropriate limits based on endpoint cost rather than applying uniform blanket limits. An endpoint tagged with high computational cost gets a lower rate limit and a larger token bucket depth (allowing occasional bursts while keeping sustained throughput low). A cheap read endpoint gets a higher limit and a smaller burst allowance.

paths:
  /reports/generate:
    post:
      operationId: generateReport
      x-rate-limit:
        algorithm: token_bucket
        capacity: 5
        refill_rate: 0.1          # 1 request per 10 seconds sustained
        consumer_key: api_key
        tier_overrides:
          platform: { capacity: 20, refill_rate: 0.5 }
  /status:
    get:
      operationId: getStatus
      x-rate-limit:
        algorithm: sliding_window
        limit: 3000
        window_seconds: 60
        consumer_key: api_key

Expressing these policies in the spec rather than gateway configuration files has a concrete operational benefit: the policy is colocated with the operation definition it governs, visible to anyone reading the spec, and subject to the same review process as any other spec change. A platform engineer reviewing a PR that increases the capacity on a high-cost endpoint can evaluate whether that change is justified alongside the endpoint's implementation changes.

Algorithm Selection Reference

Algorithm	Burst Behavior	Implementation Cost	Best For
Fixed window	Boundary burst possible	Very low (single counter)	Internal APIs, low-cost endpoints
Token bucket	Controlled burst up to capacity	Medium (float + timestamp per key)	Bursty clients, expensive endpoints
Sliding window	Smooth, no boundary burst	Low (two counters per key)	Public REST APIs, default choice
Leaky bucket	None — strict output rate	Medium (queue per consumer)	Backend protection, rate shaping

The algorithm decision is not permanent. Starting with sliding window for all public endpoints and token bucket for expensive operations is a defensible baseline for most platform teams. As you observe actual consumer traffic patterns — particularly the distribution of burst-to-sustained ratios — you can tune the parameters or switch algorithms for specific endpoints without changing the policy-as-code structure that governs them.

What matters more than the algorithm choice is having rate limit policy as a legible, reviewable, version-controlled artifact rather than a set of gateway configuration values that a single engineer set two years ago and no one has touched since. The algorithm you pick is a second-order concern. The operational discipline around who can change the policy and how is a first-order concern.