Token Rate Limit & Quota
The Token Rate Limit & Quota middleware enables token-based rate limiting and quota enforcement for LLM requests. Unlike traditional request-based rate limiting, this middleware tracks actual token consumption reported by LLM providers, giving you precise control over API costs and usage.
There are two variants of the middleware:
| Variant | Algorithm | Best for | Behavior |
|---|---|---|---|
ai-rate-limit | Token bucket | Handling traffic spikes while allowing bursts | Tokens refill over time; allows temporary bursts |
ai-quota | Sliding window | Budget control and hard spending caps | Fixed token allocation per time period; no bursts |
Key Features and Benefits
- Cost Control: Limit token consumption to manage LLM API costs effectively.
- Granular Limits: Set separate limits for input tokens, output tokens, or total tokens.
- Token Estimation: Optionally estimate input tokens before sending requests to block oversized prompts.
- Flexible Extraction: Use JSON queries to extract token counts from any LLM response format.
- Distributed State: Redis-backed storage ensures consistent limits across multiple gateway replicas.
- Source Grouping: Apply limits per IP, per header value, or per custom criterion.
Requirements
-
AI Gateway must be enabled:
helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set hub.aigateway.enabled=true -
The middleware requires a Redis instance in your cluster for storing rate limit state:
Install Redis using Helmhelm install redis oci://registry-1.docker.io/cloudpirates/redisCreate the namespace apps and create a Secret to store the Redis passwordkubectl create ns apps
kubectl create secret generic redis --from-literal=password=$(kubectl get secret --namespace default redis -o jsonpath="{.data.redis-password}" | base64 -d) -n appsConnection parameters to your Redis server are attached to your Middleware deployment.
The following Redis modes are supported:
- Single instance mode
- Redis Cluster
- Redis Sentinel
For more information about Redis, we recommend the official Redis documentation.
infoIf you use Redis in single instance mode or Redis Sentinel, you can configure the
databasefield. This value won't be taken into account if you use Redis Cluster (only database0is available).In this case, a warning is displayed, and the value is ignored.
How It Works
The middleware intercepts requests and responses to track token consumption:
-
Pre-Request Check (optional): If token estimation is enabled, the middleware estimates input tokens from the request body. Requests exceeding the limit are rejected immediately with
429 Too Many Requests. -
Forward Request: If the pre-check passes (or estimation is not enabled), the request proceeds to the LLM provider.
-
Response Processing: The middleware extracts actual token counts from the LLM response using the configured JSON queries.
-
Update Buckets: Token counts are recorded in Redis. The middleware updates the appropriate buckets (input, output, total) based on your configuration.
-
Response Headers: The middleware adds headers indicating remaining token allowance for each configured limit.
Rate Limit vs Quota
Choose the variant based on your use case:
ai-rate-limit (Token Bucket Algorithm)
Tokens refill continuously at a steady rate calculated from your limit and period. The bucket has a maximum capacity equal to your limit, and tokens are added back at a constant rate.
How it works:
- With a limit of 1000 tokens per hour, tokens refill at ~16.7 tokens per minute (~0.28 tokens per second)
- If you consume 500 tokens, you can immediately burst another 500 tokens (up to the bucket capacity)
- New tokens continuously accumulate, allowing for smooth traffic patterns
- Good for handling traffic spikes while maintaining an average rate
Example scenario:
Limit: 1000 tokens/hour (refill rate: ~16.7 tokens/min)
10:00 AM - Start with 1000 tokens available
10:05 AM - Consume 600 tokens → 400 tokens remaining
10:06 AM - ~17 tokens refilled → 417 tokens available
10:10 AM - ~83 more tokens refilled → 500 tokens available
Use when:
- You want to smooth traffic while allowing occasional bursts
- Users may have legitimate spikes in token usage
- You're optimizing for service availability and user experience
ai-quota (Sliding Window Algorithm)
A fixed allocation of tokens within each time window. The window "slides" continuously, and tokens consumed at time T become available again at time T + period.
How it works:
- With a limit of 1000 tokens per hour, you have exactly 1000 tokens available in any 60-minute window
- Each token consumption is tracked with a timestamp
- Tokens consumed 61 minutes ago automatically expire and become available again
- No refill rate—either you have quota left or you don't
Example scenario:
Quota: 1000 tokens/hour (sliding window)
10:00 AM - Consume 600 tokens → 400 tokens remaining
10:30 AM - Consume 400 tokens → 0 tokens remaining
10:59 AM - Still 0 tokens remaining (nothing expired yet)
11:00 AM - 600 tokens become available (10:00 AM consumption expired)
11:30 AM - 400 more tokens available (10:30 AM consumption expired)
Use when:
- You need strict budget enforcement (cost control, provider limits)
- You want predictable, hard caps on consumption
- Preventing overspending is more important than availability
- You're implementing tiered service plans with token allocations
Configuration Examples
Basic Rate Limit (Chat Completion API)
Configure token rate limiting for routes using the Chat Completion middleware:
- Chat Completion
- ai-rate-limit
- ai-quota
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chat-completion
namespace: apps
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 1
maxTokens: 2048
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
password: urn:k8s:secret:redis:password
totalTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.total_tokens"
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
password: urn:k8s:secret:redis:password
totalTokenLimit:
limit: 100000
period: 24h
jsonQuery: ".usage.total_tokens"
Multiple Token Limits
Configure separate limits for input, output, and total tokens. The request is rejected when any limit is reached.
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit-multi
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 5000
period: 1h
jsonQuery: ".usage.prompt_tokens"
outputTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.completion_tokens"
totalTokenLimit:
limit: 12000
period: 1h
jsonQuery: ".usage.total_tokens"
With Token Estimation
Enable token estimation to block requests before they reach the LLM. This prevents sending oversized prompts that would consume excessive tokens or overwhelm the provider.
- Default estimation
- Custom JSON query
The default estimation uses request_body_length / 4 to approximate token count:
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-with-estimation
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 1000
period: 1h
jsonQuery: ".usage.prompt_tokens"
estimateStrategy:
simple: {}
Extract specific fields for more accurate estimation:
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-with-estimation
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 1000
period: 1h
jsonQuery: ".usage.prompt_tokens"
estimateStrategy:
simple:
jsonQuery: ".messages[].content"
OpenAI Responses API Format
The Responses API uses different field names for token counts. Adjust the jsonQuery accordingly:
- Non-streaming
- Streaming
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit-responses-api
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 5000
period: 1h
jsonQuery: ".usage.input_tokens"
outputTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.output_tokens"
totalTokenLimit:
limit: 12000
period: 1h
jsonQuery: ".usage.total_tokens"
For streaming responses, token usage appears in the final event under .response.usage:
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-rate-limit-responses-api-stream
namespace: apps
spec:
plugin:
ai-rate-limit:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
inputTokenLimit:
limit: 5000
period: 1h
jsonQuery: ".response.usage.input_tokens"
outputTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".response.usage.output_tokens"
totalTokenLimit:
limit: 12000
period: 1h
jsonQuery: ".response.usage.total_tokens"
Per-User Limits with Source Criterion
Apply limits per user or group by extracting the bucket name from request attributes:
- By IP address
- By header
- By JWT claim
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-per-ip
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
totalTokenLimit:
limit: 10000
period: 1h
jsonQuery: ".usage.total_tokens"
sourceCriterion:
ipStrategy:
depth: 0 # Position in X-Forwarded-For header (0 = rightmost). See Source Criterion Configuration below.
Use a header value (e.g., user ID from JWT middleware):
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-per-user
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
totalTokenLimit:
limit: 50000
period: 24h
jsonQuery: ".usage.total_tokens"
sourceCriterion:
requestHeaderName: "X-User-ID"
Combine with the JWT middleware to limit by a claim value (e.g., user group):
# First, configure JWT middleware to extract claims to headers
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: jwt-auth
namespace: apps
spec:
plugin:
jwt:
jwksUrl: https://auth.example.com/.well-known/jwks.json
forwardHeaders:
Group: group
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: token-quota-per-group
namespace: apps
spec:
plugin:
ai-quota:
store:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
password: urn:k8s:secret:redis:password
totalTokenLimit:
limit: 100000
period: 24h
jsonQuery: ".usage.total_tokens"
sourceCriterion:
requestHeaderName: "Group"
IngressRoute Configuration
Apply the middleware to your AI routes. When using with the Chat Completion middleware or Responses API middleware, place the token rate limit middleware after it:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: openai-proxy
namespace: apps
spec:
routes:
- kind: Rule
match: Host(`ai.example.com`) && PathPrefix(`/v1/chat/completions`)
middlewares:
- name: chat-completion
- name: token-rate-limit
services:
- name: openai
port: 443
scheme: https
passHostHeader: false
Configuration Options
Main Configuration
| Field | Description | Required | Default |
|---|---|---|---|
store.redis.endpoints | List of Redis endpoints (e.g., redis:6379). | Yes | |
store.redis.username | Redis username for authentication. | No | |
store.redis.password | Redis password. Supports URN references (e.g., urn:k8s:secret:redis:password). | No | |
store.redis.database | Redis database number. | No | 0 |
store.redis.tls | TLS configuration for secure Redis connections. See TLS Configuration. | No | |
store.redis.timeout | Connection timeout (e.g., 5s). | No | |
store.redis.cluster | Enable Redis Cluster mode. Set to {} to enable. | No | |
store.redis.sentinel | Redis Sentinel configuration. See Sentinel Configuration. | No | |
inputTokenLimit | Configuration for input/prompt token limits. | No* | |
outputTokenLimit | Configuration for output/completion token limits. | No* | |
totalTokenLimit | Configuration for total token limits. | No* | |
estimateStrategy | Token estimation configuration for pre-request checks. | No | |
sourceCriterion | Defines how to extract the bucket identifier from requests. | No | Remote address |
*At least one of inputTokenLimit, outputTokenLimit, or totalTokenLimit must be configured.
Token Limit Configuration
Each token limit (inputTokenLimit, outputTokenLimit, totalTokenLimit) accepts:
| Field | Description | Required | Default |
|---|---|---|---|
limit | Maximum number of tokens allowed in the period. | Yes | |
period | Time window for the limit (e.g., 1h, 24h, 7d). | Yes | |
jsonQuery | gojq query to extract token count from the LLM response. | Yes |
Estimate Strategy Configuration
| Field | Description | Required | Default |
|---|---|---|---|
estimateStrategy.simple | Enable simple token estimation. | No | |
estimateStrategy.simple.jsonQuery | gojq query to extract text for estimation. If not set, uses entire request body. | No | Uses body_length / 4 |
Source Criterion Configuration
| Field | Description | Required | Default |
|---|---|---|---|
sourceCriterion.ipStrategy.depth | The IP's position in the X-Forwarded-For header to use, counting from the right (0 = rightmost/last entry). | No | 0 |
sourceCriterion.ipStrategy.excludedIPs | List of trusted proxy IPs to skip when scanning the X-Forwarded-For header. | No | |
sourceCriterion.ipStrategy.ipv6Subnet | IPv6 subnet size. All IPv6 addresses from the same subnet are treated as originating from the same IP. | No | |
sourceCriterion.requestHeaderName | Header name to use as bucket identifier. | No | |
sourceCriterion.requestHost | Use the request host as bucket identifier. | No | false |
TLS Configuration
| Field | Description | Required | Default |
|---|---|---|---|
store.redis.tls.ca | Path to the CA certificate file for verifying the Redis server. | No | System CA |
store.redis.tls.cert | Path to the client certificate file for mutual TLS. | No | |
store.redis.tls.key | Path to the client private key file for mutual TLS. | No | |
store.redis.tls.insecureSkipVerify | Skip TLS certificate verification. | No | false |
Sentinel Configuration
| Field | Description | Required | Default |
|---|---|---|---|
store.redis.sentinel.masterSet | Name of the Redis Sentinel master set. | Yes | |
store.redis.sentinel.username | Username for Sentinel authentication. | No | |
store.redis.sentinel.password | Password for Sentinel authentication. Supports URN references. | No |
Response Headers
The middleware adds headers to inform clients of their remaining token allowance:
For ai-rate-limit:
| Header | Description |
|---|---|
X-Ratelimit-Remaining-Tokens-Input | Remaining input tokens (if inputTokenLimit is configured) |
X-Ratelimit-Remaining-Tokens-Output | Remaining output tokens (if outputTokenLimit is configured) |
X-Ratelimit-Remaining-Tokens-Total | Remaining total tokens (if totalTokenLimit is configured) |
For ai-quota:
| Header | Description |
|---|---|
X-Quota-Remaining-Tokens-Input | Remaining input tokens (if inputTokenLimit is configured) |
X-Quota-Remaining-Tokens-Output | Remaining output tokens (if outputTokenLimit is configured) |
X-Quota-Remaining-Tokens-Total | Remaining total tokens (if totalTokenLimit is configured) |
Remaining values can be negative when actual token consumption exceeds estimates. This occurs when the LLM response contains more tokens than initially estimated.
Observability
When a request is rate limited, the middleware adds attributes to the OpenTelemetry span:
| Attribute | Description |
|---|---|
middleware.type | ai-rate-limit or ai-quota |
middleware.limit.reached | Which limit was reached: input-token, output-token, or total-token |
The span status is set to Error with a message indicating the limit type.
Behavior Notes
Fail-Open on JSON Query Mismatch
When the configured jsonQuery does not match the LLM response (e.g., the response format differs from expected), the token value is treated as 0. This fail-open approach prioritizes availability over strict enforcement.
Always verify your jsonQuery matches your LLM provider's response format. Test with actual responses before deploying to production.
Streaming Responses
The middleware buffers streaming responses (SSE) until it detects the stream termination marker (data: [DONE] for chat completion or event: response.completed for Responses API).
Once the final chunk with token usage is received, the middleware extracts the token counts and forwards the complete buffered response to the client.
This means that real-time streaming is not preserved — the client receives the entire response as a single chunk after the stream completes.
Token Estimation Accuracy
The default estimation (body_length / 4) provides a rough and fast approximation. For a more accurate estimation:
- Use a custom
jsonQueryinestimateStrategy.simpleto extract only the text content. - Consider that actual token counts depend on the specific tokenizer used by your LLM provider.
Best Practices
Choosing Your Source Criterion
The sourceCriterion determines how token limits are grouped and applied. Choose based on your use case:
By IP Address (Default)
sourceCriterion:
ipStrategy:
depth: 0
- Best for: Public APIs, anonymous access, simple deployments
- Pros: No authentication required, works immediately
- Cons: Shared IPs (NAT, corporate networks) share the same limit; can't differentiate legitimate users
- Use when: You don't have user authentication or want basic protection
By Header Value
sourceCriterion:
requestHeaderName: "X-User-ID"
- Best for: Authenticated APIs with user identifiers
- Pros: True per-user limits, fair resource allocation
- Cons: Requires authentication middleware, header can be spoofed without proper auth
- Use when: You have JWT, API key, or other auth middleware that sets user identifiers
By JWT Claim
# Combine with JWT middleware
sourceCriterion:
requestHeaderName: "Group" # Header set by JWT middleware
- Best for: Multi-tenant applications, team-based limits
- Pros: Secure group-based limits, supports hierarchical access
- Cons: More complex setup, requires JWT middleware
- Use when: You want to limit tokens per team/organization rather than per individual user
By Request Host
sourceCriterion:
requestHost: true
- Best for: Multi-tenant platforms where each tenant has their own domain
- Pros: Natural isolation per tenant
- Cons: Requires domain-based routing
- Use when: Each customer/tenant accesses via their own domain (e.g.,
customer-a.example.com,customer-b.example.com)
Setting Appropriate Limits
Consider these factors when setting token limits:
Provider Limits
Check your LLM provider's rate limits and quotas:
- Consult your provider's documentation for current TPM (tokens per minute) limits
- Limits vary significantly by provider, tier, and model
- Set your limits below provider limits to avoid provider-side rejections
Provider limits change frequently. Always verify current limits in your provider's documentation before setting production values.
Cost Control
Calculate costs based on provider pricing:
Example calculation (using hypothetical pricing):
- Input: $X per 1M tokens
- Output: $Y per 1M tokens
Daily budget: $100
Estimated split: 60% input, 40% output
Input tokens: ($100 × 0.6) / ($X / 1M) = calculated tokens/day
Output tokens: ($100 × 0.4) / ($Y / 1M) = calculated tokens/day
Pricing varies by provider, model, and tier. Check your provider's current pricing page for accurate calculations.
User Experience
Balance cost control with usability:
- Too restrictive: Users get rate limited frequently, poor experience
- Too permissive: Risk of unexpected costs
- Recommended approach: Set quota at a conservative percentage of budget (e.g., 70-80%), use rate limits for spike protection
Start with conservative limits and increase based on actual usage patterns. Monitor your spending and adjust limits as needed.
Concurrent Users
Adjust limits based on expected concurrent usage:
Example: 100 concurrent users, budget of 10M tokens/day
Per-user quota: 10M / 100 = 100K tokens/day
Per-user rate limit: 100K / 24 hours = ~4,166 tokens/hour
This allows spikes but prevents any user from consuming the entire budget.
Token Estimation Strategy
When to Enable Estimation
- You want to block oversized prompts before sending to the LLM
- Your input token limit is for cost control or provider protection
- You're using
inputTokenLimitortotalTokenLimit - Don't use for
outputTokenLimit(can't estimate output)
Estimation Accuracy
- Default (
body_length / 4) is a rough approximation - Accuracy varies significantly based on content type and tokenizer
- Less accurate for code, non-English languages, or complex JSON structures
- Consider using custom
jsonQueryto extract only the prompt text for better accuracy
Estimation accuracy depends on the specific tokenizer used by your LLM provider. The default algorithm provides a conservative estimate but may not reflect actual token counts precisely.
Example
# More accurate: extract only message content for estimation
estimateStrategy:
simple:
jsonQuery: ".messages[].content"
Troubleshooting
Requests not being rate limited
- Verify the
jsonQuerymatches your LLM response format. Use a tool like jq play to test your query against actual responses. - Check Redis connectivity. The middleware fails open if Redis is unavailable.
- Ensure the response contains a valid JSON body with token usage information.
Requests rejected immediately (429)
- If using estimation, the estimated tokens may exceed your limit. Check if the request body is unusually large.
- Previous requests may have exhausted the token budget. Check the
X-*-Remaining-Tokens-*headers.
Negative remaining tokens in headers
This is expected behavior when:
- Estimated tokens were lower than actual consumption
- Multiple concurrent requests consumed tokens simultaneously
The next request will be rejected until tokens are available again.
Different users sharing the same limit
Configure sourceCriterion to isolate limits per user. By default, limits are applied per remote address. Use requestHeaderName to limit by a user identifier header.
Related Content
- Read the Chat Completion documentation.
- Read the Responses API documentation.
- Read the Distributed Rate Limit documentation for request-based rate limiting.
