Chat Completion
The Chat Completion middleware "promotes" any route to a chat-completion endpoint. It adds GenAI metrics, central governance of model/parameters, and (optionally) lets clients override them.
Key Features and Benefits
- One-line enablement: attach middleware, your route becomes an AI endpoint.
- Governance: lock or allow overrides for
model
,temperature
,topP
, etc. - Metrics: emits OpenTelemetry GenAI spans and counters.
- Works for local or cloud models: All you need is a Kubernetes
Service
pointing at the upstream host.
Requirements
-
You must have AI Gateway enabled:
helm install traefik -n traefik --wait \
--set hub.aigateway.enabled=true -
If routing to a cloud LLM provider, define a Kubernetes
ExternalName
service.
How It Works
- Intercepts the request and validates it against the OpenAI chat-completion schema.
- Applies governance by rewriting
model
or param fields if overrides are denied. - Starts a GenAI span and records the prompt tokens.
- Forwards the (possibly rewritten) request to the upstream LLM.
- Records usage metrics from the response (
model
, prompt/completion tokens, latency).
Configuration Example
- Middleware
- Secret
- IngressRoute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chatcompletion
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 1
maxTokens: 2048
frequencyPenalty: 0
presencePenalty: 0
apiVersion: v1
kind: Secret
metadata:
name: ai-keys
type: Opaque
data:
openai-token: sk-proj-XXXXX
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: openai
spec:
routes:
- kind: Rule
match: Host(`ai.example.com`)
middlewares:
- name: chatcompletion
services:
- name: chatgpt-external # ExternalName → api.openai.com
port: 443
scheme: https
passHostHeader: false
Configuration Options
Field | Description | Required | Default |
---|---|---|---|
token | URN of a Kubernetes Secret holding the API key (for example, urn:k8s:secret:<secretname>:<key> ) | No | |
model | Default (fallback) model name enforced when overrides are denied | No | |
allowModelOverride | true = clients may set the model field; false = middleware rewrites to model | No | auto (true if model empty, else false) |
params | Block containing default generation parameters applied when client omits them | No | |
params.temperature | Default temperature value | No | |
params.topP | Default top‑p value | No | |
params.maxTokens | Default max token count | No | |
params.frequencyPenalty | Default frequency‑penalty value | No | |
params.presencePenalty | Default presence‑penalty value | No | |
allowParamsOverride | true = clients may override params; false = middleware enforces params | No | false |
Combine Chat Completion with Semantic Cache or Content Guard by listing multiple middlewares in the same IngressRoute.
Chat middlewares strip Accept-Encoding
from client requests to keep the body readable by governance filters, but always request compressed responses from the backend for efficiency.
Standard flow: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses) → Client (uncompressed)
If you need compressed responses for your clients (for example, to reduce bandwidth on mobile apps or slow networks), add Traefik's standard Compress middleware before the AI middlewares:
With Compress middleware: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses + re-compresses) → Client (compressed)
However, this creates double-compression overhead because Traefik Hub must decompress the backend response to apply governance filters, then the Compress middleware re-compresses it for the client. For best performance, avoid the Compress middleware on AI routes unless client compression is essential.
Common Deployment Patterns
Local Inference (In-Cluster Model Server)
Deploy an LLM—such as Ollama inside your cluster and expose it with a ClusterIP Service
.
Attach chat-completion
directly to the route:
- Service
- Chat-completion Middleware
- IngressRoute
apiVersion: v1
kind: Service
metadata:
name: ollama-svc
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chatcompletion
spec:
plugin:
chat-completion:
model: llama4:maverick
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 1
maxTokens: 2048
frequencyPenalty: 0
presencePenalty: 0
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: local-llm
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/chat/completions`)
middlewares:
- name: chatcompletion
services:
- name: ollama-svc
port: 11434
No API token is needed because the model runs locally, but the middleware still records metrics and enforces any parameter rules you set.
Model-Based Routing
This pattern lets you expose many local models behind one hostname, with routing driven by the model field in the JSON payload.
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: multi-model-ai
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`) && Model(`qwen2.5:0.5b`)
middlewares:
- name: chatcompletion
services:
- name: ollama-external
port: 11434
passHostHeader: false
Cloud LLM With a Friendly Custom Path
When you proxy to a provider like Gemini or Cohere, you may want a shorter public path (for example, /api/gemini/chat
).
Use a replace-path-regex middleware before chat-completion:
- Chat-completion Middleware
- ExternalName Service
- Replace-path-regex Middleware
- IngressRoute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chatcompletion
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:gemini-token
model: gemini-2.0-flash
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 0.8
maxTokens: 4096
apiVersion: v1
kind: Service
metadata:
name: gemini-external
spec:
type: ExternalName
externalName: generativelanguage.googleapis.com
ports:
- port: 443
targetPort: 443
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: gemini-path-rewrite
spec:
replacePathRegex:
regex: ^/ai-api/openai/(.*)
replacement: /v1/chat/completions
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: gemini
spec:
routes:
- kind: Rule
match: Host(`ai.example.com`) && PathPrefix(`/api/gemini/`)
middlewares:
- name: gemini-path-rewrite # rewrite first
- name: chatcompletion # then apply chat governance/metrics
services:
- name: gemini-external
port: 443
scheme: https
passHostHeader: false
With this pattern you get a clean public URL while still benefiting from governance, metrics, and model-based routing.
You can find a full list of information about compatibility paths for different providers in the AI Gateway Overview Page.
For example, Google exposes two URLs for Gemini:
/v1beta/openai/chat/completions
: drop-in replacement for OpenAI SDKs. Use this if your client already talks to/v1/chat/completions
./v1beta/models/gemini-2.0-flash:chat
(or…:streamGenerateContent
): native REST shape. Use this if you control the client request format.
Pick the one that matches your client, then set replacePathRegex.replacement
accordingly to avoid a scenario where Gemini rejects your request even though the gateway added all the right headers.
Related Content
- Read the Semantic Cache documentation.
- Read the Content Guard documentation.