Skip to main content

AI Gateway Failover

The AI Gateway failover service automatically switches between LLM providers when the primary returns an error. Use it to handle provider outages, rate limits (HTTP 429), and model unavailability without interrupting consumers.

Failover builds on two capabilities:

  • Failover service with error detection -- triggers a fallback when the primary responds with a configured HTTP status code.
  • Service-level middlewares -- each backend gets its own Chat Completion middleware so every provider independently handles model selection, token injection, and GenAI metrics.

How It Works

The failover chain is built from nested TraefikService resources. Each backend references an ExternalName service with its own chat-completion middleware attached at the service level (the middlewares field on the TraefikService, not on the route). This means every provider independently handles model selection, token injection, and GenAI metrics -- so switching from OpenAI to Mistral transparently rewrites the model name, swaps the API key, and records the correct provider in traces.

Failover Triggers

TriggerWhen it firesConfiguration
Status codePrimary responds with a status matching the status codesfailover.errors.status
Health checkPrimary becomes unreachable based on periodic health probesfailover.healthCheck + loadBalancer.healthCheck

This guide focuses on status-code-based failover.

Prerequisites

Before configuring failover, ensure you have:

  • Traefik Hub with AI Gateway enabled:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set hub.aigateway.enabled=true
  • API keys for each LLM provider (OpenAI, Mistral, etc.)

  • A dedicated namespace for the AI Gateway resources:

    kubectl create namespace ai-failover

Configuration Example

This example configures a 3-provider failover chain: OpenAI GPT-4oMistral LargeMistral Small.

Step 1: Define Kubernetes Resources

external-services.yaml
apiVersion: v1
kind: Service
metadata:
name: openai-external
namespace: ai-failover
spec:
type: ExternalName
externalName: api.openai.com
ports:
- port: 443
targetPort: 443
---
apiVersion: v1
kind: Service
metadata:
name: mistral-external
namespace: ai-failover
spec:
type: ExternalName
externalName: api.mistral.ai
ports:
- port: 443
targetPort: 443
---
apiVersion: v1
kind: Service
metadata:
name: mistral-small-external
namespace: ai-failover
spec:
type: ExternalName
externalName: api.mistral.ai
ports:
- port: 443
targetPort: 443

Step 2: Configure the Failover Chain

Create nested TraefikService resources that define the failover order. Each backend attaches its own chat-completion middleware so model selection, token injection, and metrics work independently per provider.

failover-traefik-services.yaml
# Top-level failover: OpenAI GPT-4o → secondary failover
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: failover-primary
namespace: ai-failover
spec:
failover:
service:
name: openai-external
port: 443
middlewares:
- name: chatcompletion-gpt4o
fallback:
name: failover-secondary
kind: TraefikService
errors:
status:
- "429"
- "500-504"
maxRequestBodyBytes: 2097152
---
# Secondary failover: Mistral Large → Mistral Small
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: failover-secondary
namespace: ai-failover
spec:
failover:
service:
name: mistral-external
port: 443
middlewares:
- name: chatcompletion-mistral-large
fallback:
name: mistral-small-external
port: 443
middlewares:
- name: chatcompletion-mistral-small
errors:
status:
- "429"
- "500-504"
maxRequestBodyBytes: 2097152

Step 3: Configure the IngressRoute

Route traffic to the failover chain by referencing the top-level TraefikService:

ingressroute.yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: ai-failover
namespace: ai-failover
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`ai.example.com`) && Path(`/v1/chat/completions`)
services:
- name: failover-primary
kind: TraefikService
tls: {}
Path Rewriting for Custom Routes

This example routes requests to /v1/chat/completions, which is the standard path for OpenAI and Mistral chat completion endpoints.

If you want to expose the failover service on a different path (for example, PathPrefix(/api/ai)), you should attach a path rewrite middleware to each service in the failover chain. This ensures the backend providers receive requests at their expected paths.

For example, add a replacePathRegex middleware to rewrite /api/ai/* to /v1/chat/completions and attach it to the service-level middlewares field alongside the chat-completion middleware. See the NVIDIA NIMs Integration guide for a complete path rewriting example.

Step 4: Test the Failover Chain

Send a request to the failover endpoint:

curl -s -X POST "https://ai.example.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Hello, which model are you?"
}
]
}' | jq .

The response comes from OpenAI GPT-4o (the primary). Check the model field in the response to confirm which provider handled the request. When the primary returns a status matching errors.status (for example, 429 or 500-504), Traefik replays the request to the next provider in the chain automatically -- no client-side retry logic required.

Circuit Breaker Integration

Integrate circuit breakers with failover to implement cooldown periods that prevent the gateway from repeatedly trying unhealthy services. When a circuit breaker trips (opens), it stops sending requests to that service for a configured duration (fallbackDuration), allowing the service to recover before attempting requests again.

This is particularly useful for AI Gateway workloads where:

  • A provider experiences an outage or rate limit
  • You want to prevent cascading failures from overwhelming degraded services
  • You need automatic recovery without manual intervention

How It Works

  1. Normal operation (circuit closed): Requests flow through the failover chain normally
  2. Service degrades: Circuit breaker detects errors/latency issues and trips (opens)
  3. Cooldown period: Circuit breaker immediately returns 503 for fallbackDuration (e.g., 30s), triggering failover to the next provider
  4. Recovery attempt: After cooldown, circuit breaker enters recovering state and gradually sends traffic back to the service
  5. Service healthy: Circuit breaker closes, resuming normal operation

Configuration Example

Attach circuit breaker middlewares to each service in the failover chain to protect against cascading failures:

circuit-breaker-middlewares.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: circuit-breaker-primary
namespace: ai-failover
spec:
circuitBreaker:
expression: ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10
checkPeriod: 5s
fallbackDuration: 30s # Cooldown period - don't retry for 30s
recoveryDuration: 20s # Gradually restore traffic over 20s
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: circuit-breaker-secondary
namespace: ai-failover
spec:
circuitBreaker:
expression: ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10
checkPeriod: 5s
fallbackDuration: 30s
recoveryDuration: 20s

Then attach these to your failover services before the chat-completion middleware:

failover-with-circuit-breaker.yaml
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: failover-primary
namespace: ai-failover
spec:
failover:
service:
name: openai-external
port: 443
middlewares:
- name: circuit-breaker-primary # Circuit breaker first
- name: chatcompletion-gpt4o
fallback:
name: failover-secondary
kind: TraefikService
errors:
status:
- "429"
- "500-504"
maxRequestBodyBytes: 2097152
---
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: failover-secondary
namespace: ai-failover
spec:
failover:
service:
name: mistral-external
port: 443
middlewares:
- name: circuit-breaker-secondary # Circuit breaker first
- name: chatcompletion-mistral-large
fallback:
name: mistral-small-external
port: 443
middlewares:
- name: chatcompletion-mistral-small # No circuit breaker on last fallback
errors:
status:
- "429"
- "500-504"
maxRequestBodyBytes: 2097152

Testing Circuit Breaker Behavior

Simulate a service outage to observe the circuit breaker tripping and recovery:

  1. Trigger circuit breaker on primary (e.g., by blocking the OpenAI endpoint or simulating 500 errors)
  2. Observe immediate failover to Mistral Large (secondary) - no retries to OpenAI during cooldown
  3. Trigger circuit breaker on secondary (same technique)
  4. Observe failover to Mistral Small (tertiary)
  5. Wait for recovery - after 30s cooldown + 20s recovery, services gradually come back online in reverse order

This ensures smooth degradation and recovery without overwhelming unhealthy services with retry traffic.

Circuit Breaker Expression Tuning

The example uses ResponseCodeRatio(500, 600, 0, 600) > 0.30 (trip when 30% of requests return 5xx). For AI Gateway workloads, consider also checking latency:

expression: ResponseCodeRatio(500, 600, 0, 600) > 0.30 || LatencyAtQuantileMS(95.0) > 10000

This trips the circuit breaker when either 30% of requests fail OR 95th percentile latency exceeds 10 seconds.

See the CircuitBreaker middleware reference for complete configuration options.

Configuration Reference

Failover Errors

FieldDescriptionDefault
errors.statusList of HTTP status codes or ranges that trigger failover. Supports single codes ("500") and ranges ("500-504").None (failover only triggers on health check)
errors.maxRequestBodyBytesMaximum request body size (in bytes) to buffer for replay to the fallback service. Set to -1 for no limit.-1
Request Body Buffering and DoS Protection

When errors is configured, failover buffers the entire request body in memory so it can replay the request to the fallback service. If the body exceeds maxRequestBodyBytes, the gateway returns 413 Request Entity Too Large instead of attempting failover.

Security consideration: Always set maxRequestBodyBytes to a reasonable limit for your use case. The default value of -1 (unlimited) can expose your gateway to denial-of-service attacks where malicious clients send extremely large payloads to exhaust memory. For AI Gateway workloads, set this based on your maximum expected prompt size (for example, 2 MB for typical chat completions, higher for document processing with vision models).

Service Middlewares

FieldDescription
middlewaresList of middleware references to apply to this service. Each entry uses name (and optionally namespace) to reference a CRD Middleware.