Token Rate Limit & Quota

The Token Rate Limit & Quota middleware enables token-based rate limiting and quota enforcement for LLM requests. Unlike traditional request-based rate limiting, this middleware tracks actual token consumption reported by LLM providers, giving you precise control over API costs and usage.

There are two variants of the middleware:

Variant	Algorithm	Best for	Behavior
`ai-rate-limit`	Token bucket	Handling traffic spikes while allowing bursts	Tokens refill over time; allows temporary bursts
`ai-quota`	Sliding window	Budget control and hard spending caps	Fixed token allocation per time period; no bursts

Key Features and Benefits

Cost Control: Limit token consumption to manage LLM API costs effectively.
Granular Limits: Set separate limits for input tokens, output tokens, or total tokens.
Token Estimation: Optionally estimate input tokens before sending requests to block oversized prompts.
Flexible Extraction: Use JSON queries to extract token counts from any LLM response format.
Native Format Support: Auto-detect token usage for OpenAI Chat Completions and Responses API without writing JSON queries.
Custom Deny Responses: Tailor the rate-limit response with a custom status, message, and format-aware body.
Distributed State: Redis-backed storage ensures consistent limits across multiple gateway replicas.
Source Grouping: Apply limits per IP, per header value, or per custom criterion.

Requirements

AI Gateway must be enabled:

helm upgrade traefik traefik/traefik -n traefik --wait \
  --reset-then-reuse-values \
  --set hub.aigateway.enabled=true

The middleware requires a Redis instance in your cluster for storing rate limit state:
Install Redis using Helm
```
helm install redis oci://registry-1.docker.io/cloudpirates/redis
```
Create the namespace apps and create a Secret to store the Redis password
```
kubectl create ns apps
kubectl create secret generic redis --from-literal=password=$(kubectl get secret --namespace default redis -o jsonpath="{.data.redis-password}" | base64 -d) -n apps
```
Connection parameters to your Redis server are attached to your Middleware deployment.

The following Redis modes are supported:
- Single instance mode
- Redis Cluster
- Redis Sentinel
For more information about Redis, we recommend the official Redis documentation.

info
If you use Redis in single instance mode or Redis Sentinel, you can configure the database field. This value won't be taken into account if you use Redis Cluster (only database 0 is available).
In this case, a warning is displayed, and the value is ignored.

How It Works

The middleware intercepts requests and responses to track token consumption:

Pre-Request Check (optional): If token estimation is enabled, the middleware estimates input tokens from the request body. Requests exceeding the limit are rejected immediately with 429 Too Many Requests.
Forward Request: If the pre-check passes (or estimation is not enabled), the request proceeds to the LLM provider.
Response Processing: The middleware extracts actual token counts from the LLM response using the configured JSON queries.
Update Buckets: Token counts are recorded in Redis. The middleware updates the appropriate buckets (input, output, total) based on your configuration.
Response Headers: The middleware adds headers indicating remaining token allowance for each configured limit.

Rate Limit vs Quota

Choose the variant based on your use case:

`ai-rate-limit` (Token Bucket Algorithm)

Tokens refill continuously at a steady rate calculated from your limit and period. The bucket has a maximum capacity equal to your limit, and tokens are added back at a constant rate.

How it works:

With a limit of 1000 tokens per hour, tokens refill at ~16.7 tokens per minute (~0.28 tokens per second)
If you consume 500 tokens, you can immediately burst another 500 tokens (up to the bucket capacity)
New tokens continuously accumulate, allowing for smooth traffic patterns
Good for handling traffic spikes while maintaining an average rate

Example scenario:

Limit: 1000 tokens/hour (refill rate: ~16.7 tokens/min)
00 AM - Start with 1000 tokens available
05 AM - Consume 600 tokens → 400 tokens remaining
06 AM - ~17 tokens refilled → 417 tokens available
10 AM - ~83 more tokens refilled → 500 tokens available

Use when:

You want to smooth traffic while allowing occasional bursts
Users may have legitimate spikes in token usage
You're optimizing for service availability and user experience

`ai-quota` (Sliding Window Algorithm)

A fixed allocation of tokens within each time window. The window "slides" continuously, and tokens consumed at time T become available again at time T + period.

How it works:

With a limit of 1000 tokens per hour, you have exactly 1000 tokens available in any 60-minute window
Each token consumption is tracked with a timestamp
Tokens consumed 61 minutes ago automatically expire and become available again
No refill rate—either you have quota left or you don't

Example scenario:

Quota: 1000 tokens/hour (sliding window)
00 AM - Consume 600 tokens → 400 tokens remaining
30 AM - Consume 400 tokens → 0 tokens remaining
59 AM - Still 0 tokens remaining (nothing expired yet)
00 AM - 600 tokens become available (10:00 AM consumption expired)
30 AM - 400 more tokens available (10:30 AM consumption expired)

Use when:

You need strict budget enforcement (cost control, provider limits)
You want predictable, hard caps on consumption
Preventing overspending is more important than availability
You're implementing tiered service plans with token allocations

Configuration Examples

Basic Rate Limit (Chat Completion API)

Configure token rate limiting for routes using the Chat Completion middleware:

Chat Completion
ai-rate-limit
ai-quota

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chat-completion
  namespace: apps
spec:
  plugin:
    chat-completion:
      token: urn:k8s:secret:ai-keys:openai-token
      model: gpt-4o
      allowModelOverride: false
      allowParamsOverride: true
      params:
        temperature: 1
        topP: 1
        maxTokens: 2048

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
          password: urn:k8s:secret:redis:password
      totalTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".usage.total_tokens"

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
          password: urn:k8s:secret:redis:password
      totalTokenLimit:
        limit: 100000
        period: 24h
        jsonQuery: ".usage.total_tokens"

Native Format Support with `clientRequestFormat`

Set clientRequestFormat to ccr (Chat Completions) or responsesAPI so the middleware auto-detects token usage fields. With these formats, jsonQuery becomes optional — the middleware fills in the standard paths for you.

Chat Completions (CCR)
Responses API

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-ccr
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      clientRequestFormat: ccr
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 5000
        period: 1h
      outputTokenLimit:
        limit: 10000
        period: 1h
      totalTokenLimit:
        limit: 12000
        period: 1h

The middleware auto-applies .usage.prompt_tokens, .usage.completion_tokens, and .usage.total_tokens.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-responses
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      clientRequestFormat: responsesAPI
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 5000
        period: 1h
      outputTokenLimit:
        limit: 10000
        period: 1h
      totalTokenLimit:
        limit: 12000
        period: 1h

The middleware auto-applies .usage.input_tokens, .usage.output_tokens, and .usage.total_tokens.

tip

Using a non-OpenAI backend or a non-standard schema? Leave clientRequestFormat unset (defaults to custom) and provide jsonQuery per limit, as shown in the Basic Rate Limit example above.

Multiple Token Limits

Configure separate limits for input, output, and total tokens. The request is rejected when any limit is reached.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-multi
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 5000
        period: 1h
        jsonQuery: ".usage.prompt_tokens"
      outputTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".usage.completion_tokens"
      totalTokenLimit:
        limit: 12000
        period: 1h
        jsonQuery: ".usage.total_tokens"

With Token Estimation

Enable token estimation to block requests before they reach the LLM. This prevents sending oversized prompts that would consume excessive tokens or overwhelm the provider.

Default estimation
Custom JSON query

The default estimation uses request_body_length / 4 to approximate token count:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota-with-estimation
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 1000
        period: 1h
        jsonQuery: ".usage.prompt_tokens"
      estimateStrategy:
        simple: {}

Extract specific fields for more accurate estimation:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota-with-estimation
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 1000
        period: 1h
        jsonQuery: ".usage.prompt_tokens"
      estimateStrategy:
        simple:
          jsonQuery: ".messages[].content"

OpenAI Responses API Format

For non-streaming Responses API traffic, prefer the clientRequestFormat: responsesAPI approach above — the middleware auto-applies the standard .usage.* paths.

The examples below show explicit jsonQuery configuration. This is still useful for streaming Responses API responses (where token usage appears under .response.usage.* in the final SSE event) or when you need to override individual paths.

Non-streaming
Streaming

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-responses-api
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 5000
        period: 1h
        jsonQuery: ".usage.input_tokens"
      outputTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".usage.output_tokens"
      totalTokenLimit:
        limit: 12000
        period: 1h
        jsonQuery: ".usage.total_tokens"

For streaming responses, token usage appears in the final event under .response.usage:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-responses-api-stream
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      inputTokenLimit:
        limit: 5000
        period: 1h
        jsonQuery: ".response.usage.input_tokens"
      outputTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".response.usage.output_tokens"
      totalTokenLimit:
        limit: 12000
        period: 1h
        jsonQuery: ".response.usage.total_tokens"

Per-User Limits with Source Criterion

Apply limits per user or group by extracting the bucket name from request attributes:

By IP address
By header
By JWT claim

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota-per-ip
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      totalTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".usage.total_tokens"
      sourceCriterion:
        ipStrategy:
          depth: 0  # Position in X-Forwarded-For header (0 = rightmost). See Source Criterion Configuration below.

Use a header value (e.g., user ID from JWT middleware):

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota-per-user
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      totalTokenLimit:
        limit: 50000
        period: 24h
        jsonQuery: ".usage.total_tokens"
      sourceCriterion:
        requestHeaderName: "X-User-ID"

Combine with the JWT middleware to limit by a claim value (e.g., user group):

# First, configure JWT middleware to extract claims to headers
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: jwt-auth
  namespace: apps
spec:
  plugin:
    jwt:
      jwksUrl: https://auth.example.com/.well-known/jwks.json
      forwardHeaders:
        Group: group
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-quota-per-group
  namespace: apps
spec:
  plugin:
    ai-quota:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
          password: urn:k8s:secret:redis:password
      totalTokenLimit:
        limit: 100000
        period: 24h
        jsonQuery: ".usage.total_tokens"
      sourceCriterion:
        requestHeaderName: "Group"

Custom Deny Response

By default, the middleware returns 429 Too Many Requests with an empty body when a token limit is reached. Use onDenyResponse to override the status code, message, and content type. When clientRequestFormat is ccr or responsesAPI, the body is auto-wrapped in the client's expected format (JSON for non-streaming, SSE for streaming, auto-detected from stream: true in the request body).

Chat Completions (CCR)
Responses API
Custom (Raw)

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-ccr-deny
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      clientRequestFormat: ccr
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      totalTokenLimit:
        limit: 10000
        period: 1h
      onDenyResponse:
        statusCode: 200
        message: "You've reached your token allowance for this hour. Please try again later."

The deny body is delivered as a ChatCompletionResponse (or an SSE chunk for streaming clients), so OpenAI-compatible clients render it as a normal assistant message instead of an HTTP error.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-responses-deny
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      clientRequestFormat: responsesAPI
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      totalTokenLimit:
        limit: 10000
        period: 1h
      onDenyResponse:
        statusCode: 200
        message: "Token quota exceeded. Please try again later."

The deny body is delivered as a Responses API output item with type: refusal (or the equivalent SSE events terminating in response.completed).

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: token-rate-limit-custom-deny
  namespace: apps
spec:
  plugin:
    ai-rate-limit:
      store:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
      totalTokenLimit:
        limit: 10000
        period: 1h
        jsonQuery: ".usage.total_tokens"
      # With clientRequestFormat unset (custom), the deny body is the raw
      # message text. Set `contentType` when the default (application/json)
      # doesn't match what your client expects.
      onDenyResponse:
        statusCode: 429
        message: "Token quota exceeded."
        contentType: text/plain

IngressRoute Configuration

Apply the middleware to your AI routes. When using with the Chat Completion middleware or Responses API middleware, place the token rate limit middleware after it:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: openai-proxy
  namespace: apps
spec:
  routes:
    - kind: Rule
      match: Host(`ai.example.com`) && PathPrefix(`/v1/chat/completions`)
      middlewares:
        - name: chat-completion
        - name: token-rate-limit
      services:
        - name: openai
          port: 443
          scheme: https
          passHostHeader: false

Configuration Options

Main Configuration

Field	Description	Required	Default
`store.redis.endpoints`	List of Redis endpoints (e.g., `redis:6379`).	Yes
`store.redis.username`	Redis username for authentication.	No
`store.redis.password`	Redis password. Supports URN references (e.g., `urn:k8s:secret:redis:password`).	No
`store.redis.database`	Redis database number.	No	`0`
`store.redis.tls`	TLS configuration for secure Redis connections. See TLS Configuration.	No
`store.redis.timeout`	Connection timeout (e.g., `5s`).	No
`store.redis.cluster`	Enable Redis Cluster mode. Set to `{}` to enable.	No
`store.redis.sentinel`	Redis Sentinel configuration. See Sentinel Configuration.	No
`inputTokenLimit`	Configuration for input/prompt token limits.	No*
`outputTokenLimit`	Configuration for output/completion token limits.	No*
`totalTokenLimit`	Configuration for total token limits.	No*
`estimateStrategy`	Token estimation configuration for pre-request checks.	No
`clientRequestFormat`	Format used by upstream clients. One of `custom`, `ccr`, `responsesAPI`. See Client Request Format.	No	`custom`
`onDenyResponse`	Customize the response when a token limit is reached. See Deny Response Configuration.	No
`sourceCriterion`	Defines how to extract the bucket identifier from requests.	No	Remote address

*At least one of inputTokenLimit, outputTokenLimit, or totalTokenLimit must be configured.

Token Limit Configuration

Each token limit (inputTokenLimit, outputTokenLimit, totalTokenLimit) accepts:

Field	Description	Required
`limit`	Maximum number of tokens allowed in the period.	Yes
`period`	Time window for the limit (e.g., `1h`, `24h`, `7d`).	Yes
`jsonQuery`	gojq query to extract token count from the LLM response. Required when `clientRequestFormat` is `custom` (the default). When `clientRequestFormat` is `ccr` or `responsesAPI`, this is optional and defaults to the standard path for that format — see Client Request Format.	Conditional

Client Request Format

The clientRequestFormat field controls how the middleware reads client payloads, applies default token-extraction queries, and shapes deny response bodies.

`clientRequestFormat`	Best for	Default token paths	Stream support
`custom` (default)	Generic JSON / non-OpenAI backends	None — `jsonQuery` required per limit	Limited
`ccr`	OpenAI Chat Completions	Input: `.usage.prompt_tokens` Output: `.usage.completion_tokens` Total: `.usage.total_tokens`	Yes
`responsesAPI`	OpenAI Responses API	Input: `.usage.input_tokens` Output: `.usage.output_tokens` Total: `.usage.total_tokens`	Yes

You can still set jsonQuery explicitly with ccr or responsesAPI to override the default — useful when your provider uses a non-standard field name, or when consuming streaming Responses API responses (where token usage appears under .response.usage.*).

Estimate Strategy Configuration

Field	Description	Required	Default
`estimateStrategy.simple`	Enable simple token estimation.	No
`estimateStrategy.simple.jsonQuery`	gojq query to extract text for estimation. If not set, uses entire request body.	No	Uses `body_length / 4`

Source Criterion Configuration

Field	Description	Required	Default
`sourceCriterion.ipStrategy.depth`	The IP's position in the `X-Forwarded-For` header to use, counting from the right (0 = rightmost/last entry).	No	`0`
`sourceCriterion.ipStrategy.excludedIPs`	List of trusted proxy IPs to skip when scanning the `X-Forwarded-For` header.	No
`sourceCriterion.ipStrategy.ipv6Subnet`	IPv6 subnet size. All IPv6 addresses from the same subnet are treated as originating from the same IP.	No
`sourceCriterion.requestHeaderName`	Header name to use as bucket identifier.	No
`sourceCriterion.requestHost`	Use the request host as bucket identifier.	No	`false`

Deny Response Configuration

Customize the response when a token limit is reached. Configured via onDenyResponse at the top level of the middleware spec.

Field	Description	Required	Default
`onDenyResponse.statusCode`	HTTP status code (100-599).	No	`429`
`onDenyResponse.message`	Response body. With `clientRequestFormat: ccr` or `responsesAPI`, this is wrapped in the client's expected format.	No	Empty body
`onDenyResponse.contentType`	`Content-Type` header.	No	Auto-detected from format

The contentType field is most useful with clientRequestFormat: custom, where the deny body is the raw message text and the default application/json header may not match. With ccr and responsesAPI, the body is auto-wrapped in JSON (or SSE while streaming) and the auto-detected Content-Type usually matches the client's expectations — set contentType only when you need to override it.

Format-Aware Deny Responses

When clientRequestFormat is set, the deny body is automatically formatted to match the client's expected shape:

`clientRequestFormat`	Non-streaming response	Streaming response
`custom`	Raw `message` text	Raw `message` text
`ccr`	Chat Completion JSON with `message` as assistant content	SSE chunk (`data: {...}`)
`responsesAPI`	Responses API JSON with `message` as a `refusal` content item	SSE events (`response.completed`)

Streaming is auto-detected from stream: true in the request body.

If onDenyResponse is omitted entirely, the default behavior applies: 429 Too Many Requests with an empty body.

TLS Configuration

Field	Description	Required	Default
`store.redis.tls.ca`	Path to the CA certificate file for verifying the Redis server.	No	System CA
`store.redis.tls.cert`	Path to the client certificate file for mutual TLS.	No
`store.redis.tls.key`	Path to the client private key file for mutual TLS.	No
`store.redis.tls.insecureSkipVerify`	Skip TLS certificate verification.	No	`false`

Sentinel Configuration

Field	Description	Required
`store.redis.sentinel.masterSet`	Name of the Redis Sentinel master set.	Yes
`store.redis.sentinel.username`	Username for Sentinel authentication.	No
`store.redis.sentinel.password`	Password for Sentinel authentication. Supports URN references.	No

Response Headers

The middleware adds headers to inform clients of their remaining token allowance:

For ai-rate-limit:

Header	Description
`X-Ratelimit-Remaining-Tokens-Input`	Remaining input tokens (if `inputTokenLimit` is configured)
`X-Ratelimit-Remaining-Tokens-Output`	Remaining output tokens (if `outputTokenLimit` is configured)
`X-Ratelimit-Remaining-Tokens-Total`	Remaining total tokens (if `totalTokenLimit` is configured)

For ai-quota:

Header	Description
`X-Quota-Remaining-Tokens-Input`	Remaining input tokens (if `inputTokenLimit` is configured)
`X-Quota-Remaining-Tokens-Output`	Remaining output tokens (if `outputTokenLimit` is configured)
`X-Quota-Remaining-Tokens-Total`	Remaining total tokens (if `totalTokenLimit` is configured)

note

Remaining values can be negative when actual token consumption exceeds estimates. This occurs when the LLM response contains more tokens than initially estimated.

Observability

When a request is rate limited, the middleware adds attributes to the OpenTelemetry span:

Attribute	Description
`middleware.type`	`ai-rate-limit` or `ai-quota`
`middleware.limit.reached`	Which limit was reached: `input-token`, `output-token`, or `total-token`

The span status is set to Error with a message indicating the limit type.

Behavior Notes

Fail-Open on JSON Query Mismatch

When the configured jsonQuery does not match the LLM response (e.g., the response format differs from expected), the token value is treated as 0. This fail-open approach prioritizes availability over strict enforcement.

tip

Always verify your jsonQuery matches your LLM provider's response format. Test with actual responses before deploying to production.

Streaming Responses

The middleware buffers streaming responses (SSE) until it detects the stream termination marker (data: [DONE] for chat completion or event: response.completed for Responses API). Once the final chunk with token usage is received, the middleware extracts the token counts and forwards the complete buffered response to the client. This means that real-time streaming is not preserved — the client receives the entire response as a single chunk after the stream completes.

Token Estimation Accuracy

The default estimation (body_length / 4) provides a rough and fast approximation. For a more accurate estimation:

Use a custom jsonQuery in estimateStrategy.simple to extract only the text content.
Consider that actual token counts depend on the specific tokenizer used by your LLM provider.

Best Practices

Choosing Your Source Criterion

The sourceCriterion determines how token limits are grouped and applied. Choose based on your use case:

By IP Address (Default)

sourceCriterion:
  ipStrategy:
    depth: 0

Best for: Public APIs, anonymous access, simple deployments
Pros: No authentication required, works immediately
Cons: Shared IPs (NAT, corporate networks) share the same limit; can't differentiate legitimate users
Use when: You don't have user authentication or want basic protection

By Header Value

sourceCriterion:
  requestHeaderName: "X-User-ID"

Best for: Authenticated APIs with user identifiers
Pros: True per-user limits, fair resource allocation
Cons: Requires authentication middleware, header can be spoofed without proper auth
Use when: You have JWT, API key, or other auth middleware that sets user identifiers

By JWT Claim

# Combine with JWT middleware
sourceCriterion:
  requestHeaderName: "Group"  # Header set by JWT middleware

Best for: Multi-tenant applications, team-based limits
Pros: Secure group-based limits, supports hierarchical access
Cons: More complex setup, requires JWT middleware
Use when: You want to limit tokens per team/organization rather than per individual user

By Request Host

sourceCriterion:
  requestHost: true

Best for: Multi-tenant platforms where each tenant has their own domain
Pros: Natural isolation per tenant
Cons: Requires domain-based routing
Use when: Each customer/tenant accesses via their own domain (e.g., customer-a.example.com, customer-b.example.com)

Setting Appropriate Limits

Consider these factors when setting token limits:

Provider Limits

Check your LLM provider's rate limits and quotas:

Consult your provider's documentation for current TPM (tokens per minute) limits
Limits vary significantly by provider, tier, and model
Set your limits below provider limits to avoid provider-side rejections

tip

Provider limits change frequently. Always verify current limits in your provider's documentation before setting production values.

Cost Control

Calculate costs based on provider pricing:

Example calculation (using hypothetical pricing):
- Input: $X per 1M tokens
- Output: $Y per 1M tokens

Daily budget: $100
Estimated split: 60% input, 40% output

Input tokens: ($100 × 0.6) / ($X / 1M) = calculated tokens/day
Output tokens: ($100 × 0.4) / ($Y / 1M) = calculated tokens/day

note

Pricing varies by provider, model, and tier. Check your provider's current pricing page for accurate calculations.

User Experience

Balance cost control with usability:

Too restrictive: Users get rate limited frequently, poor experience
Too permissive: Risk of unexpected costs
Recommended approach: Set quota at a conservative percentage of budget (e.g., 70-80%), use rate limits for spike protection

tip

Start with conservative limits and increase based on actual usage patterns. Monitor your spending and adjust limits as needed.

Concurrent Users

Adjust limits based on expected concurrent usage:

Example: 100 concurrent users, budget of 10M tokens/day

Per-user quota: 10M / 100 = 100K tokens/day
Per-user rate limit: 100K / 24 hours = ~4,166 tokens/hour

This allows spikes but prevents any user from consuming the entire budget.

Token Estimation Strategy

When to Enable Estimation

You want to block oversized prompts before sending to the LLM
Your input token limit is for cost control or provider protection
You're using inputTokenLimit or totalTokenLimit
Don't use for outputTokenLimit (can't estimate output)

Estimation Accuracy

Default (body_length / 4) is a rough approximation
Accuracy varies significantly based on content type and tokenizer
Less accurate for code, non-English languages, or complex JSON structures
Consider using custom jsonQuery to extract only the prompt text for better accuracy

tip

Estimation accuracy depends on the specific tokenizer used by your LLM provider. The default algorithm provides a conservative estimate but may not reflect actual token counts precisely.

Example

# More accurate: extract only message content for estimation
estimateStrategy:
  simple:
    jsonQuery: ".messages[].content"

Troubleshooting

Requests not being rate limited

Verify the jsonQuery matches your LLM response format. Use a tool like jq play to test your query against actual responses.
Check Redis connectivity. The middleware fails open if Redis is unavailable.
Ensure the response contains a valid JSON body with token usage information.

Requests rejected immediately (429)

If using estimation, the estimated tokens may exceed your limit. Check if the request body is unusually large.
Previous requests may have exhausted the token budget. Check the X-*-Remaining-Tokens-* headers.

Negative remaining tokens in headers

This is expected behavior when:

Estimated tokens were lower than actual consumption
Multiple concurrent requests consumed tokens simultaneously

The next request will be rejected until tokens are available again.

Different users sharing the same limit

Configure sourceCriterion to isolate limits per user. By default, limits are applied per remote address. Use requestHeaderName to limit by a user identifier header.

Read the Chat Completion documentation.
Read the Responses API documentation.
Read the Distributed Rate Limit documentation for request-based rate limiting.

Key Features and Benefits​

Requirements​

How It Works​

Rate Limit vs Quota​

ai-rate-limit (Token Bucket Algorithm)​

ai-quota (Sliding Window Algorithm)​

Configuration Examples​

Basic Rate Limit (Chat Completion API)​

Native Format Support with clientRequestFormat​

Multiple Token Limits​

With Token Estimation​

OpenAI Responses API Format​

Per-User Limits with Source Criterion​

Custom Deny Response​

IngressRoute Configuration​

Configuration Options​

Main Configuration​

Token Limit Configuration​

Client Request Format​

Estimate Strategy Configuration​

Source Criterion Configuration​

Deny Response Configuration​

Format-Aware Deny Responses​

TLS Configuration​

Sentinel Configuration​

Response Headers​

Observability​

Behavior Notes​

Fail-Open on JSON Query Mismatch​

Streaming Responses​

Token Estimation Accuracy​

Best Practices​

By IP Address (Default)​

By Header Value​

By JWT Claim​

By Request Host​

Provider Limits​

Cost Control​

User Experience​

Concurrent Users​

When to Enable Estimation​

Estimation Accuracy​

Example​

Troubleshooting​

Related Content​