Chat Completion

The Chat Completion middleware "promotes" any route to a chat-completion endpoint. It adds GenAI metrics, central governance of model/parameters, and (optionally) lets clients override them.

Key Features and Benefits

One-line enablement: attach middleware, your route becomes an AI endpoint.
Governance: lock or allow overrides for model, temperature, topP, etc.
Metrics: emits OpenTelemetry GenAI spans and counters.
Works for local or cloud models: All you need is a Kubernetes Service pointing at the upstream host.

Requirements

You must have AI Gateway enabled:

helm install traefik -n traefik --wait \
  --set hub.aigateway.enabled=true

If routing to a cloud LLM provider, define a Kubernetes ExternalName service.

How It Works

Intercepts the request and validates it against the OpenAI chat-completion schema.
Applies governance by rewriting model or param fields if overrides are denied.
Starts a GenAI span and records the prompt tokens.
Forwards the (possibly rewritten) request to the upstream LLM.
Records usage metrics from the response (model, prompt/completion tokens, latency).

Configuration Example

Middleware
Secret
IngressRoute

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chatcompletion
spec:
  plugin:
    chat-completion:
      token: urn:k8s:secret:ai-keys:openai-token
      model: gpt-4o
      allowModelOverride: false
      allowParamsOverride: true
      params:
        temperature: 1
        topP: 1
        maxTokens: 2048
        frequencyPenalty: 0
        presencePenalty: 0

apiVersion: v1
kind: Secret
metadata:
  name: ai-keys
type: Opaque
data:
  openai-token: sk-proj-XXXXX

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: openai
spec:
  routes:
    - kind: Rule
      match: Host(`ai.example.com`)
      middlewares:
        - name: chatcompletion
      services:
        - name: chatgpt-external   # ExternalName → api.openai.com
          port: 443
          scheme: https
          passHostHeader: false

Configuration Options

Field	Description	Required	Default
`token`	URN of a Kubernetes Secret holding the API key (for example, `urn:k8s:secret:<secretname>:<key>`)	No
`model`	Default (fallback) model name enforced when overrides are denied	No
`allowModelOverride`	`true` = clients may set the `model` field; `false` = middleware rewrites to `model`	No	auto (true if `model` empty, else false)
`params`	Block containing default generation parameters applied when client omits them	No
`params.temperature`	Default temperature value	No
`params.topP`	Default top‑p value	No
`params.maxTokens`	Default max token count	No
`params.maxCompletionTokens`	Default max completion token count (used by newer models like GPT-5)	No
`params.frequencyPenalty`	Default frequency‑penalty value	No
`params.presencePenalty`	Default presence‑penalty value	No
`allowParamsOverride`	`true` = clients may override params; `false` = middleware enforces `params`	No	`false`

GPT-5 and maxCompletionTokens

OpenAI's GPT-5 models (such as gpt-5-nano) require the maxCompletionTokens parameter instead of the older maxTokens parameter. If you need to support both GPT-4 and GPT-5 models, use Model-Based Routing to create separate routes with dedicated Chat Completion middleware configurations—one using maxTokens for GPT-4 routes, and another using maxCompletionTokens for GPT-5 routes.

tip

Combine Chat Completion with Semantic Cache or Content Guard by listing multiple middlewares in the same IngressRoute.

Compression

Chat middlewares strip Accept-Encoding from client requests to keep the body readable by governance filters, but always request compressed responses from the backend for efficiency.

Standard flow: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses) → Client (uncompressed)

If you need compressed responses for your clients (for example, to reduce bandwidth on mobile apps or slow networks), add Traefik's standard Compress middleware before the AI middlewares:

With Compress middleware: Client (uncompressed) → Traefik Hub → Backend (compressed) → Traefik Hub (decompresses + re-compresses) → Client (compressed)

However, this creates double-compression overhead because Traefik Hub must decompress the backend response to apply governance filters, then the Compress middleware re-compresses it for the client. For best performance, avoid the Compress middleware on AI routes unless client compression is essential.

Common Deployment Patterns

Local Inference (In-Cluster Model Server)

Deploy an LLM—such as Ollama inside your cluster and expose it with a ClusterIP Service. Attach chat-completion directly to the route:

Service
Chat-completion Middleware
IngressRoute

Service that points at the local model runtime
apiVersion: v1
kind: Service
metadata:
  name: ollama-svc
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434

Middleware that sets the model to llama4:maverick
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chatcompletion
spec:
  plugin:
    chat-completion:
      model: llama4:maverick
      allowModelOverride: false
      allowParamsOverride: true
      params:
        temperature: 1
        topP: 1
        maxTokens: 2048
        frequencyPenalty: 0
        presencePenalty: 0

Route becomes a chat-completion endpoint once the middleware is attached
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: local-llm
spec:
  routes:
    - kind: Rule
      match: Host(`ai.localhost`) && Path(`/v1/chat/completions`)
      middlewares:
        - name: chatcompletion
      services:
        - name: ollama-svc
          port: 11434

No API token is needed because the model runs locally, but the middleware still records metrics and enforces any parameter rules you set.

Model-Based Routing

This pattern lets you expose many local models behind one hostname, with routing driven by the model field in the JSON payload.

Model-based routing
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: multi-model-ai
spec:
  routes:
    - kind: Rule
      match: Host(`ai.localhost`) && Model(`qwen2.5:0.5b`)
      middlewares:
        - name: chatcompletion
      services:
        - name: ollama-external
          port: 11434
          passHostHeader: false

Example: Supporting Both GPT-4 and GPT-5

When routing to OpenAI, GPT-5 models require maxCompletionTokens while GPT-4 models use maxTokens. Create separate routes with dedicated middlewares:

GPT-5 Middleware
GPT-4 Middleware
ExternalName Service
IngressRoute

Chat-completion middleware for GPT-5 models
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chatcompletion-gpt5
spec:
  plugin:
    chat-completion:
      token: urn:k8s:secret:ai-keys:openai-token
      model: gpt-5-nano
      allowModelOverride: false
      allowParamsOverride: false
      params:
        temperature: 1
        topP: 1
        maxCompletionTokens: 2048
        frequencyPenalty: 0
        presencePenalty: 0

Chat-completion middleware for GPT-4 models
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chatcompletion-gpt4
spec:
  plugin:
    chat-completion:
      token: urn:k8s:secret:ai-keys:openai-token
      model: gpt-4o
      allowModelOverride: false
      allowParamsOverride: false
      params:
        temperature: 1
        topP: 1
        maxTokens: 2048
        frequencyPenalty: 0
        presencePenalty: 0

ExternalName service to OpenAI
apiVersion: v1
kind: Service
metadata:
  name: openai-external
spec:
  type: ExternalName
  externalName: api.openai.com
  ports:
    - port: 443
      targetPort: 443

Route with model-based routing for GPT-4 and GPT-5
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: openai-multi-model
spec:
  routes:
    - kind: Rule
      match: Host(`ai.example.com`) && Model(`gpt-5-nano`)
      middlewares:
        - name: chatcompletion-gpt5
      services:
        - name: openai-external
          port: 443
          scheme: https
          passHostHeader: false
    - kind: Rule
      match: Host(`ai.example.com`) && Model(`gpt-4o`)
      middlewares:
        - name: chatcompletion-gpt4
      services:
        - name: openai-external
          port: 443
          scheme: https
          passHostHeader: false

Cloud LLM With a Friendly Custom Path

When you proxy to a provider like Gemini or Cohere, you may want a shorter public path (for example, /api/gemini/chat). Use a replace-path-regex middleware before chat-completion:

Chat-completion Middleware
ExternalName Service
Replace-path-regex Middleware
IngressRoute

Chat-completion middleware (governance + metrics)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: chatcompletion
spec:
  plugin:
    chat-completion:
      token: urn:k8s:secret:ai-keys:gemini-token
      model: gemini-2.0-flash
      allowModelOverride: false
      allowParamsOverride: true
      params:
        temperature: 0.8
        maxTokens: 4096

ExternalName service to the Gemini API
apiVersion: v1
kind: Service
metadata:
  name: gemini-external
spec:
  type: ExternalName
  externalName: generativelanguage.googleapis.com
  ports:
    - port: 443
      targetPort: 443

Rewrite public path to the provider's compatibility endpoint
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: gemini-path-rewrite
spec:
  replacePathRegex:
    regex: ^/ai-api/openai/(.*)
    replacement: /v1/chat/completions

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: gemini
spec:
  routes:
    - kind: Rule
      match: Host(`ai.example.com`) && PathPrefix(`/api/gemini/`)
      middlewares:
        - name: gemini-path-rewrite  # rewrite first
        - name: chatcompletion       # then apply chat governance/metrics
      services:
        - name: gemini-external
          port: 443
          scheme: https
          passHostHeader: false

With this pattern you get a clean public URL while still benefiting from governance, metrics, and model-based routing.

Provider Compatibility Information

You can find a full list of information about compatibility paths for different providers in the AI Gateway Overview Page.

For example, Google exposes two URLs for Gemini:

/v1beta/openai/chat/completions: drop-in replacement for OpenAI SDKs. Use this if your client already talks to /v1/chat/completions.
/v1beta/models/gemini-2.0-flash:chat (or …:streamGenerateContent): native REST shape. Use this if you control the client request format.

Pick the one that matches your client, then set replacePathRegex.replacement accordingly to avoid a scenario where Gemini rejects your request even though the gateway added all the right headers.

Read the Semantic Cache documentation.
Read the Content Guard documentation.

Key Features and Benefits​

Requirements​

How It Works​

Configuration Example​

Configuration Options​

Common Deployment Patterns​

Local Inference (In-Cluster Model Server)​

Model-Based Routing​

Example: Supporting Both GPT-4 and GPT-5​

Cloud LLM With a Friendly Custom Path​

Related Content​