Skip to main content

Responses API

The Responses API middleware promotes any route to an OpenAI-compatible Responses API endpoint. It adds GenAI metrics, central governance of model/parameters, and (optionally) lets clients override them.

Key Features and Benefits

  • One-line enablement: attach middleware, your route becomes an AI Responses API endpoint.
  • Governance: lock or allow overrides for model, temperature, topP, tools, and more.
  • Metrics: emits OpenTelemetry GenAI spans and counters.
  • Tool control: configure and limit the number of tools clients can use.
  • Works for local or cloud models: All you need is a Kubernetes Service pointing at the upstream host.

Requirements

  • You must have AI Gateway enabled:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set hub.aigateway.enabled=true
  • If routing to a cloud LLM provider, define a Kubernetes ExternalName service.

Model Compatibility

The middleware is designed for the OpenAI Responses API format. When routing to other providers:

  • Parameter names may differ (e.g., maxOutputTokens vs max_tokens)
  • Parameter limits vary by model and provider
  • Tool support is provider-specific

For non-OpenAI providers, you may need to use a proxy service that translates between the Responses API format and your target provider's format.

How It Works

  1. Intercepts the request and validates it against the OpenAI Responses API schema.
  2. Applies governance by rewriting model, param fields, or instructions if overrides are denied.
  3. Starts a GenAI span and records the input tokens.
  4. Forwards the (possibly rewritten) request to the upstream LLM.
  5. Records usage metrics from the response (model, input/output tokens, latency).

Configuration Example

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 0.9
maxOutputTokens: 1024
maxToolCall: 20
store: true
tools:
- type: web_search

Configuration Options

FieldDescriptionRequiredDefault
tokenURN of a Kubernetes Secret holding the API key (for example, urn:k8s:secret:<secretname>:<key>)No
modelDefault model to use (for example, gpt-4o, gpt-4-turbo)Yes
allowModelOverridetrue = clients may set the model field; false = middleware rewrites to modelNoauto (true if model empty, else false)
allowParamsOverridetrue = clients may override params; false = middleware enforces paramsNotrue
instructionsSystem instructions to include in every requestNo
paramsBlock containing default generation parametersNo
params.temperatureSampling temperature between 0 and 2. Higher values make output more randomNo
params.topPNucleus sampling parameter. An alternative to temperature samplingNo
params.maxOutputTokensMaximum number of tokens to generate in the response (OpenAI Responses API format)No
params.maxToolCallMaximum number of tools that can be configured in a request. Requests exceeding this limit will be rejectedNo
params.storeWhether to store the conversation for future reference (OpenAI feature)No
params.toolsArray of tool configurations. Each tool must have a type field (for example, web_search, file_search, function)No
params.tools[].typeType of tool: web_search, file_search, code_interpreter, image_generation, function, mcp, etc.Yes
params.tools[].nameName of the tool (required for function type)No
params.tools[].descriptionDescription of what the tool does (for function type)No
params.tools[].parametersJSON Schema object describing the tool's parameters (for function type)No

Parameter Override Behavior

The middleware supports two modes for handling parameters:

Mode 1: Allow Parameters Override (allowParamsOverride: true)

When enabled (default), the middleware acts as a default value provider:

  • If a client provides a value for a parameter, the client's value is used.
  • If a client doesn't provide a value, the configured default is applied.
  • Tools follow the same pattern: client-provided tools take precedence, configured tools are used as fallback.
spec:
plugin:
responses-api:
model: gpt-4o
allowParamsOverride: true # Clients can override
params:
temperature: 0.7
maxOutputTokens: 1000

Mode 2: Force Parameters (allowParamsOverride: false)

When turned off, the middleware enforces the configured values:

  • All configured parameters override client values.
  • Clients cannot change these settings.
  • Useful for strict governance and cost control.
spec:
plugin:
responses-api:
model: gpt-4o
allowParamsOverride: false # Enforce configured values
params:
temperature: 0.5
maxOutputTokens: 500
tools:
- type: web_search

Model Override Behavior

The allowModelOverride setting controls whether clients can specify their own model:

  • allowModelOverride: true: Clients can use any model they specify. The configured model acts as a fallback when clients omit the model field entirely.
  • allowModelOverride: false (default when model is set): Clients must use the configured model. If they omit the model field or specify a different model, the configured model is used.
spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: true # Client can request gpt-4-turbo instead

Tool Control

The middleware provides governance over tool usage:

Configure Default Tools

Provide a list of tools that will be available by default:

params:
tools:
- type: web_search
- type: file_search
- type: function
name: get_weather
description: Get current weather for a location
parameters:
type: object
properties:
location:
type: string
description: City name
required:
- location
Tool-Specific Options

Different tool types support additional configuration options:

  • function tools: Support name, description, and parameters fields
  • mcp tools: Support extensive configuration options including server connections, authentication, and tool-specific parameters
  • Built-in tools (web_search, file_search, code_interpreter, image_generation): Have their own specific configuration options

Refer to the respective tool documentation for complete configuration options.

Limit Tool Count

Use maxToolCall to prevent clients from requesting too many tools:

params:
maxToolCall: 5 # Maximum 5 tools per request
tools:
- type: web_search

When a client request exceeds this limit, it will be rejected with a 400 Bad Request response.

Request Body Size Limits

The middleware enforces a maximum request body size based on the AI Gateway configuration:

helm upgrade traefik traefik/traefik -n traefik --wait \
--set hub.aigateway.enabled=true \
--set hub.aigateway.maxRequestBodySize=10485760 # 10MB

Requests exceeding this size will receive a 413 Request Entity Too Large response.

Streaming Support

The middleware fully supports streaming responses. When a client sets "stream": true in the request, the response will be streamed back as server-sent events (SSE).

Metrics and Streaming

When streaming is enabled, the middleware will not record detailed usage metrics (token counts) since the full response is not buffered. Duration metrics will still be recorded.

Metrics

The middleware emits OpenTelemetry GenAI metrics when metrics are enabled.

Working with Other AI Middlewares

The Responses API middleware can be combined with other AI Gateway middlewares for enhanced functionality. See the Adapting AI Middlewares for Responses API guide for more details.

OpenAI Responses API vs Chat Completions

The Responses API is OpenAI's successor to the Chat Completions API, designed for agent-like applications:

FeatureChat CompletionsResponses API
Request formatmessages[] arrayinput string + optional instructions
Response formatchoices[].message.contentoutput[] array
Built-in toolsFunction calling onlyWeb search, file search, code interpreter, image generation
State managementClient-managedServer-managed (via previous_response_id).
Important
  • Streaming and Generic Middlewares: Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When streaming is enabled, these middlewares wait for the complete response to arrive, process it, and then send the entire response as a single chunk to the client.
  • State Management: The middleware does not currently manage conversation state (previous_response_id). Each request is treated as stateless.
  • Metrics with Streaming: Token usage metrics are not recorded for streaming requests since the full response is not buffered.

Troubleshooting

Request Entity Too Large

Problem: Receiving 413 Request Entity Too Large errors.

Solution: Increase the AI Gateway's max request body size:

helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set hub.aigateway.maxRequestBodySize=20971520 # 20MB
Tool Count Exceeded

Problem: Receiving 400 Bad Request: Maximum X tools allowed, got Y.

Solution: Either reduce the number of tools in your request or increase maxToolCall:

params:
maxToolCall: 50 # Increase limit
Model Override Denied

Problem: Client's model selection is being overridden.

Solution: Enable model override in the middleware:

spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: true # Allow client to choose model
Metrics Not Being Recorded

Problem: No metrics are being recorded.

Solution: Ensure:

  1. AI Gateway is enabled with metrics
  2. The request includes the required headers (Hub-App-Name, Hub-App-Id)
  3. You're not using streaming mode (which doesn't record token metrics)

Next Steps