Skip to main content

Token Rate Limit & Quota

The Token Rate Limit & Quota middleware enables token-based rate limiting and quota enforcement for LLM requests. Unlike traditional request-based rate limiting, this middleware tracks actual token consumption reported by LLM providers, giving you precise control over API costs and usage.

There are two variants of the middleware:

VariantAlgorithmBest forBehavior
ai-rate-limitToken bucketHandling traffic spikes while allowing burstsTokens refill over time; allows temporary bursts
ai-quotaSliding windowBudget control and hard spending capsFixed token allocation per time period; no bursts

Key Features and Benefits

  • Cost Control: Limit token consumption to manage LLM API costs effectively.
  • Granular Limits: Set separate limits for input tokens, output tokens, or total tokens.
  • Token Estimation: Optionally estimate input tokens before sending requests to block oversized prompts.
  • Flexible Extraction: Use JSON queries to extract token counts from any LLM response format.
  • Distributed State: Redis-backed storage ensures consistent limits across multiple gateway replicas.
  • Source Grouping: Apply limits per IP, per header value, or per custom criterion.

Requirements

  • AI Gateway must be enabled:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set hub.aigateway.enabled=true
  • The middleware requires a Redis instance in your cluster for storing rate limit state:

    Install Redis using Helm
    helm install redis oci://registry-1.docker.io/cloudpirates/redis
    Create the namespace apps and create a Secret to store the Redis password
    kubectl create ns apps
    kubectl create secret generic redis --from-literal=password=$(kubectl get secret --namespace default redis -o jsonpath="{.data.redis-password}" | base64 -d) -n apps

    Connection parameters to your Redis server are attached to your Middleware deployment.

    The following Redis modes are supported:

    For more information about Redis, we recommend the official Redis documentation.

    info

    If you use Redis in single instance mode or Redis Sentinel, you can configure the database field. This value won't be taken into account if you use Redis Cluster (only database 0 is available).

    In this case, a warning is displayed, and the value is ignored.

How It Works

The middleware intercepts requests and responses to track token consumption:

  1. Pre-Request Check (optional): If token estimation is enabled, the middleware estimates input tokens from the request body. Requests exceeding the limit are rejected immediately with 429 Too Many Requests.

  2. Forward Request: If the pre-check passes (or estimation is not enabled), the request proceeds to the LLM provider.

  3. Response Processing: The middleware extracts actual token counts from the LLM response using the configured JSON queries.

  4. Update Buckets: Token counts are recorded in Redis. The middleware updates the appropriate buckets (input, output, total) based on your configuration.

  5. Response Headers: The middleware adds headers indicating remaining token allowance for each configured limit.

Rate Limit vs Quota

Choose the variant based on your use case:

ai-rate-limit (Token Bucket Algorithm)

Tokens refill continuously at a steady rate calculated from your limit and period. The bucket has a maximum capacity equal to your limit, and tokens are added back at a constant rate.

How it works:

  • With a limit of 1000 tokens per hour, tokens refill at ~16.7 tokens per minute (~0.28 tokens per second)
  • If you consume 500 tokens, you can immediately burst another 500 tokens (up to the bucket capacity)
  • New tokens continuously accumulate, allowing for smooth traffic patterns
  • Good for handling traffic spikes while maintaining an average rate

Example scenario:

Limit: 1000 tokens/hour (refill rate: ~16.7 tokens/min)
10:00 AM - Start with 1000 tokens available
10:05 AM - Consume 600 tokens → 400 tokens remaining
10:06 AM - ~17 tokens refilled → 417 tokens available
10:10 AM - ~83 more tokens refilled → 500 tokens available

Use when:

  • You want to smooth traffic while allowing occasional bursts
  • Users may have legitimate spikes in token usage
  • You're optimizing for service availability and user experience

ai-quota (Sliding Window Algorithm)

A fixed allocation of tokens within each time window. The window "slides" continuously, and tokens consumed at time T become available again at time T + period.

How it works:

  • With a limit of 1000 tokens per hour, you have exactly 1000 tokens available in any 60-minute window
  • Each token consumption is tracked with a timestamp
  • Tokens consumed 61 minutes ago automatically expire and become available again
  • No refill rate—either you have quota left or you don't

Example scenario:

Quota: 1000 tokens/hour (sliding window)
10:00 AM - Consume 600 tokens → 400 tokens remaining
10:30 AM - Consume 400 tokens → 0 tokens remaining
10:59 AM - Still 0 tokens remaining (nothing expired yet)
11:00 AM - 600 tokens become available (10:00 AM consumption expired)
11:30 AM - 400 more tokens available (10:30 AM consumption expired)

Use when:

  • You need strict budget enforcement (cost control, provider limits)
  • You want predictable, hard caps on consumption
  • Preventing overspending is more important than availability
  • You're implementing tiered service plans with token allocations

Configuration Examples

Basic Rate Limit (Chat Completion API)

Configure token rate limiting for routes using the Chat Completion middleware:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chat-completion
namespace: apps
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 1
maxTokens: 2048

Multiple Token Limits

Configure separate limits for input, output, and total tokens. The request is rejected when any limit is reached.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit-multi
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 5000
period: 1h
jsonQuery: ".usage.prompt_tokens"
outputTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.completion_tokens"
totalTokenLimit:
limit: 12000
period: 1h
jsonQuery: ".usage.total_tokens"

With Token Estimation

Enable token estimation to block requests before they reach the LLM. This prevents sending oversized prompts that would consume excessive tokens or overwhelm the provider.

The default estimation uses request_body_length / 4 to approximate token count:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-with-estimation
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 1000
period: 1h
jsonQuery: ".usage.prompt_tokens"
estimateStrategy:
simple: {}

OpenAI Responses API Format

The Responses API uses different field names for token counts. Adjust the jsonQuery accordingly:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit-responses-api
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 5000
period: 1h
jsonQuery: ".usage.input_tokens"
outputTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.output_tokens"
totalTokenLimit:
limit: 12000
period: 1h
jsonQuery: ".usage.total_tokens"

Per-User Limits with Source Criterion

Apply limits per user or group by extracting the bucket name from request attributes:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-per-ip
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
totalTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.total_tokens"
sourceCriterion:
ipStrategy:
depth: 0 # Position in X-Forwarded-For header (0 = rightmost). See Source Criterion Configuration below.

IngressRoute Configuration

Apply the middleware to your AI routes. When using with the Chat Completion middleware or Responses API middleware, place the token rate limit middleware after it:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: openai-proxy
namespace: apps
spec:
routes:
- kind: Rule
match: Host(`ai.example.com`) && PathPrefix(`/v1/chat/completions`)
middlewares:
- name: chat-completion
- name: token-rate-limit
services:
- name: openai
port: 443
scheme: https
passHostHeader: false

Configuration Options

Main Configuration

FieldDescriptionRequiredDefault
store.redis.endpointsList of Redis endpoints (e.g., redis:6379).Yes
store.redis.usernameRedis username for authentication.No
store.redis.passwordRedis password. Supports URN references (e.g., urn:k8s:secret:redis:password).No
store.redis.databaseRedis database number.No0
store.redis.tlsTLS configuration for secure Redis connections. See TLS Configuration.No
store.redis.timeoutConnection timeout (e.g., 5s).No
store.redis.clusterEnable Redis Cluster mode. Set to {} to enable.No
store.redis.sentinelRedis Sentinel configuration. See Sentinel Configuration.No
inputTokenLimitConfiguration for input/prompt token limits.No*
outputTokenLimitConfiguration for output/completion token limits.No*
totalTokenLimitConfiguration for total token limits.No*
estimateStrategyToken estimation configuration for pre-request checks.No
sourceCriterionDefines how to extract the bucket identifier from requests.NoRemote address

*At least one of inputTokenLimit, outputTokenLimit, or totalTokenLimit must be configured.

Token Limit Configuration

Each token limit (inputTokenLimit, outputTokenLimit, totalTokenLimit) accepts:

FieldDescriptionRequiredDefault
limitMaximum number of tokens allowed in the period.Yes
periodTime window for the limit (e.g., 1h, 24h, 7d).Yes
jsonQuerygojq query to extract token count from the LLM response.Yes

Estimate Strategy Configuration

FieldDescriptionRequiredDefault
estimateStrategy.simpleEnable simple token estimation.No
estimateStrategy.simple.jsonQuerygojq query to extract text for estimation. If not set, uses entire request body.NoUses body_length / 4

Source Criterion Configuration

FieldDescriptionRequiredDefault
sourceCriterion.ipStrategy.depthThe IP's position in the X-Forwarded-For header to use, counting from the right (0 = rightmost/last entry).No0
sourceCriterion.ipStrategy.excludedIPsList of trusted proxy IPs to skip when scanning the X-Forwarded-For header.No
sourceCriterion.ipStrategy.ipv6SubnetIPv6 subnet size. All IPv6 addresses from the same subnet are treated as originating from the same IP.No
sourceCriterion.requestHeaderNameHeader name to use as bucket identifier.No
sourceCriterion.requestHostUse the request host as bucket identifier.Nofalse

TLS Configuration

FieldDescriptionRequiredDefault
store.redis.tls.caPath to the CA certificate file for verifying the Redis server.NoSystem CA
store.redis.tls.certPath to the client certificate file for mutual TLS.No
store.redis.tls.keyPath to the client private key file for mutual TLS.No
store.redis.tls.insecureSkipVerifySkip TLS certificate verification.Nofalse

Sentinel Configuration

FieldDescriptionRequiredDefault
store.redis.sentinel.masterSetName of the Redis Sentinel master set.Yes
store.redis.sentinel.usernameUsername for Sentinel authentication.No
store.redis.sentinel.passwordPassword for Sentinel authentication. Supports URN references.No

Response Headers

The middleware adds headers to inform clients of their remaining token allowance:

For ai-rate-limit:

HeaderDescription
X-Ratelimit-Remaining-Tokens-InputRemaining input tokens (if inputTokenLimit is configured)
X-Ratelimit-Remaining-Tokens-OutputRemaining output tokens (if outputTokenLimit is configured)
X-Ratelimit-Remaining-Tokens-TotalRemaining total tokens (if totalTokenLimit is configured)

For ai-quota:

HeaderDescription
X-Quota-Remaining-Tokens-InputRemaining input tokens (if inputTokenLimit is configured)
X-Quota-Remaining-Tokens-OutputRemaining output tokens (if outputTokenLimit is configured)
X-Quota-Remaining-Tokens-TotalRemaining total tokens (if totalTokenLimit is configured)
note

Remaining values can be negative when actual token consumption exceeds estimates. This occurs when the LLM response contains more tokens than initially estimated.

Observability

When a request is rate limited, the middleware adds attributes to the OpenTelemetry span:

AttributeDescription
middleware.typeai-rate-limit or ai-quota
middleware.limit.reachedWhich limit was reached: input-token, output-token, or total-token

The span status is set to Error with a message indicating the limit type.

Behavior Notes

Fail-Open on JSON Query Mismatch

When the configured jsonQuery does not match the LLM response (e.g., the response format differs from expected), the token value is treated as 0. This fail-open approach prioritizes availability over strict enforcement.

tip

Always verify your jsonQuery matches your LLM provider's response format. Test with actual responses before deploying to production.

Streaming Responses

The middleware buffers streaming responses (SSE) until it detects the stream termination marker (data: [DONE] for chat completion or event: response.completed for Responses API). Once the final chunk with token usage is received, the middleware extracts the token counts and forwards the complete buffered response to the client. This means that real-time streaming is not preserved — the client receives the entire response as a single chunk after the stream completes.

Token Estimation Accuracy

The default estimation (body_length / 4) provides a rough and fast approximation. For a more accurate estimation:

  1. Use a custom jsonQuery in estimateStrategy.simple to extract only the text content.
  2. Consider that actual token counts depend on the specific tokenizer used by your LLM provider.

Best Practices

Choosing Your Source Criterion

The sourceCriterion determines how token limits are grouped and applied. Choose based on your use case:

By IP Address (Default)

sourceCriterion:
ipStrategy:
depth: 0
  • Best for: Public APIs, anonymous access, simple deployments
  • Pros: No authentication required, works immediately
  • Cons: Shared IPs (NAT, corporate networks) share the same limit; can't differentiate legitimate users
  • Use when: You don't have user authentication or want basic protection

By Header Value

sourceCriterion:
requestHeaderName: "X-User-ID"
  • Best for: Authenticated APIs with user identifiers
  • Pros: True per-user limits, fair resource allocation
  • Cons: Requires authentication middleware, header can be spoofed without proper auth
  • Use when: You have JWT, API key, or other auth middleware that sets user identifiers

By JWT Claim

# Combine with JWT middleware
sourceCriterion:
requestHeaderName: "Group" # Header set by JWT middleware
  • Best for: Multi-tenant applications, team-based limits
  • Pros: Secure group-based limits, supports hierarchical access
  • Cons: More complex setup, requires JWT middleware
  • Use when: You want to limit tokens per team/organization rather than per individual user

By Request Host

sourceCriterion:
requestHost: true
  • Best for: Multi-tenant platforms where each tenant has their own domain
  • Pros: Natural isolation per tenant
  • Cons: Requires domain-based routing
  • Use when: Each customer/tenant accesses via their own domain (e.g., customer-a.example.com, customer-b.example.com)
Setting Appropriate Limits

Consider these factors when setting token limits:

Provider Limits

Check your LLM provider's rate limits and quotas:

  • Consult your provider's documentation for current TPM (tokens per minute) limits
  • Limits vary significantly by provider, tier, and model
  • Set your limits below provider limits to avoid provider-side rejections
tip

Provider limits change frequently. Always verify current limits in your provider's documentation before setting production values.

Cost Control

Calculate costs based on provider pricing:

Example calculation (using hypothetical pricing):
- Input: $X per 1M tokens
- Output: $Y per 1M tokens

Daily budget: $100
Estimated split: 60% input, 40% output

Input tokens: ($100 × 0.6) / ($X / 1M) = calculated tokens/day
Output tokens: ($100 × 0.4) / ($Y / 1M) = calculated tokens/day
note

Pricing varies by provider, model, and tier. Check your provider's current pricing page for accurate calculations.

User Experience

Balance cost control with usability:

  • Too restrictive: Users get rate limited frequently, poor experience
  • Too permissive: Risk of unexpected costs
  • Recommended approach: Set quota at a conservative percentage of budget (e.g., 70-80%), use rate limits for spike protection
tip

Start with conservative limits and increase based on actual usage patterns. Monitor your spending and adjust limits as needed.

Concurrent Users

Adjust limits based on expected concurrent usage:

Example: 100 concurrent users, budget of 10M tokens/day

Per-user quota: 10M / 100 = 100K tokens/day
Per-user rate limit: 100K / 24 hours = ~4,166 tokens/hour

This allows spikes but prevents any user from consuming the entire budget.
Token Estimation Strategy

When to Enable Estimation

  • You want to block oversized prompts before sending to the LLM
  • Your input token limit is for cost control or provider protection
  • You're using inputTokenLimit or totalTokenLimit
  • Don't use for outputTokenLimit (can't estimate output)

Estimation Accuracy

  • Default (body_length / 4) is a rough approximation
  • Accuracy varies significantly based on content type and tokenizer
  • Less accurate for code, non-English languages, or complex JSON structures
  • Consider using custom jsonQuery to extract only the prompt text for better accuracy
tip

Estimation accuracy depends on the specific tokenizer used by your LLM provider. The default algorithm provides a conservative estimate but may not reflect actual token counts precisely.

Example

# More accurate: extract only message content for estimation
estimateStrategy:
simple:
jsonQuery: ".messages[].content"

Troubleshooting

Requests not being rate limited
  • Verify the jsonQuery matches your LLM response format. Use a tool like jq play to test your query against actual responses.
  • Check Redis connectivity. The middleware fails open if Redis is unavailable.
  • Ensure the response contains a valid JSON body with token usage information.
Requests rejected immediately (429)
  • If using estimation, the estimated tokens may exceed your limit. Check if the request body is unusually large.
  • Previous requests may have exhausted the token budget. Check the X-*-Remaining-Tokens-* headers.
Negative remaining tokens in headers

This is expected behavior when:

  • Estimated tokens were lower than actual consumption
  • Multiple concurrent requests consumed tokens simultaneously

The next request will be rejected until tokens are available again.

Different users sharing the same limit

Configure sourceCriterion to isolate limits per user. By default, limits are applied per remote address. Use requestHeaderName to limit by a user identifier header.