Skip to main content

Chat Completion

The Chat Completion middleware "promotes" any route to a chat-completion endpoint. It adds GenAI metrics, central governance of model/parameters, and (optionally) lets clients override them.

Key Features and Benefits

  • One-line enablement: attach middleware, your route becomes an AI endpoint.
  • Governance: lock or allow overrides for model, temperature, topP, etc.
  • Metrics: emits OpenTelemetry GenAI spans and counters.
  • Works for local or cloud models: All you need is a Kubernetes Service pointing at the upstream host.

Requirements

  • You must have AI Gateway enabled:

    helm install traefik -n traefik --wait \
    --set hub.aigateway.enabled=true
  • If routing to a cloud LLM provider, define a Kubernetes ExternalName service.

How It Works

  1. Intercepts the request and validates it against the OpenAI chat-completion schema.
  2. Applies governance by rewriting model or param fields if overrides are denied.
  3. Starts a GenAI span and records the prompt tokens.
  4. Forwards the (possibly rewritten) request to the upstream LLM.
  5. Records usage metrics from the response (model, prompt/completion tokens, latency).

Configuration Example

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chatcompletion
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 1
maxTokens: 2048
frequencyPenalty: 0
presencePenalty: 0

Configuration Options

FieldDescriptionRequiredDefault
tokenURN of a Kubernetes Secret holding the API key (for example, urn:k8s:secret:<secretname>:<key>)No
modelDefault (fallback) model name enforced when overrides are deniedNo
allowModelOverridetrue = clients may set the model field; false = middleware rewrites to modelNoauto (true if model empty, else false)
paramsBlock containing default generation parameters applied when client omits themNo
params.temperatureDefault temperature valueNo
params.topPDefault top‑p valueNo
params.maxTokensDefault max token countNo
params.frequencyPenaltyDefault frequency‑penalty valueNo
params.presencePenaltyDefault presence‑penalty valueNo
allowParamsOverridetrue = clients may override params; false = middleware enforces paramsNofalse
tip

Combine Chat Completion with Semantic Cache or Content Guard by listing multiple middlewares in the same IngressRoute.

Compression

Chat middlewares strip Accept-Encoding from client requests to keep the body readable by governance filters, but always request compressed responses from the backend for efficiency.

Standard flow: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses) → Client (uncompressed)

If you need compressed responses for your clients (for example, to reduce bandwidth on mobile apps or slow networks), add Traefik's standard Compress middleware before the AI middlewares:

With Compress middleware: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses + re-compresses) → Client (compressed)

However, this creates double-compression overhead because Traefik Hub must decompress the backend response to apply governance filters, then the Compress middleware re-compresses it for the client. For best performance, avoid the Compress middleware on AI routes unless client compression is essential.

Common Deployment Patterns

Local Inference (In-Cluster Model Server)

Deploy an LLM—such as Ollama inside your cluster and expose it with a ClusterIP Service. Attach chat-completion directly to the route:

Service that points at the local model runtime
apiVersion: v1
kind: Service
metadata:
name: ollama-svc
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434

No API token is needed because the model runs locally, but the middleware still records metrics and enforces any parameter rules you set.

Model-Based Routing

This pattern lets you expose many local models behind one hostname, with routing driven by the model field in the JSON payload.

Model-based routing
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: multi-model-ai
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`) && Model(`qwen2.5:0.5b`)
middlewares:
- name: chatcompletion
services:
- name: ollama-external
port: 11434
passHostHeader: false

Cloud LLM With a Friendly Custom Path

When you proxy to a provider like Gemini or Cohere, you may want a shorter public path (for example, /api/gemini/chat). Use a replace-path-regex middleware before chat-completion:

Chat-completion middleware (governance + metrics)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chatcompletion
spec:
plugin:
chat-completion:
token: urn:k8s:secret:ai-keys:gemini-token
model: gemini-2.0-flash
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 0.8
maxTokens: 4096

With this pattern you get a clean public URL while still benefiting from governance, metrics, and model-based routing.

Provider Compatibility Information

You can find a full list of information about compatibility paths for different providers in the AI Gateway Overview Page.

For example, Google exposes two URLs for Gemini:

  • /v1beta/openai/chat/completions: drop-in replacement for OpenAI SDKs. Use this if your client already talks to /v1/chat/completions.
  • /v1beta/models/gemini-2.0-flash:chat (or …:streamGenerateContent): native REST shape. Use this if you control the client request format.

Pick the one that matches your client, then set replacePathRegex.replacement accordingly to avoid a scenario where Gemini rejects your request even though the gateway added all the right headers.