Responses API
The Responses API middleware promotes any route to an OpenAI-compatible Responses API endpoint. It adds GenAI metrics, central governance of model/parameters, and (optionally) lets clients override them.
Key Features and Benefits
- One-line enablement: attach middleware, your route becomes an AI Responses API endpoint.
- Governance: lock or allow overrides for
model,temperature,topP, tools, and more. - Metrics: emits OpenTelemetry GenAI spans and counters.
- Tool control: configure and limit the number of tools clients can use.
- Works for local or cloud models: All you need is a Kubernetes
Servicepointing at the upstream host.
Requirements
-
You must have AI Gateway enabled:
helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set hub.aigateway.enabled=true -
If routing to a cloud LLM provider, define a Kubernetes
ExternalNameservice.
The middleware is designed for the OpenAI Responses API format. When routing to other providers:
- Parameter names may differ (e.g.,
maxOutputTokensvsmax_tokens) - Parameter limits vary by model and provider
- Tool support is provider-specific
For non-OpenAI providers, you may need to use a proxy service that translates between the Responses API format and your target provider's format.
How It Works
- Intercepts the request and validates it against the OpenAI Responses API schema.
- Applies governance by rewriting
model, param fields, orinstructionsif overrides are denied. - Starts a GenAI span and records the input tokens.
- Forwards the (possibly rewritten) request to the upstream LLM.
- Records usage metrics from the response (
model, input/output tokens, latency).
Configuration Example
- Middleware
- Secret
- IngressRoute
- Service
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 1
topP: 0.9
maxOutputTokens: 1024
maxToolCall: 20
store: true
tools:
- type: web_search
apiVersion: v1
kind: Secret
metadata:
name: ai-keys
type: Opaque
# Option 1: Plain text
stringData:
openai-token: sk-proj-XXXXX
# Option 2: Pre-base64 encoded data
# data:
# openai-token: c2stcHJvai1YWFhYWA==
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: openai-responses
spec:
routes:
- kind: Rule
match: Host(`ai.example.com`)
middlewares:
- name: responsesapi
services:
- name: openai
port: 443
passHostHeader: false
apiVersion: v1
kind: Service
metadata:
name: openai
spec:
type: ExternalName
externalName: api.openai.com
ports:
- port: 443
Configuration Options
| Field | Description | Required | Default |
|---|---|---|---|
token | URN of a Kubernetes Secret holding the API key (for example, urn:k8s:secret:<secretname>:<key>) | No | |
model | Default model to use (for example, gpt-4o, gpt-4-turbo) | Yes | |
allowModelOverride | true = clients may set the model field; false = middleware rewrites to model | No | auto (true if model empty, else false) |
allowParamsOverride | true = clients may override params; false = middleware enforces params | No | true |
instructions | System instructions to include in every request | No | |
params | Block containing default generation parameters | No | |
params.temperature | Sampling temperature between 0 and 2. Higher values make output more random | No | |
params.topP | Nucleus sampling parameter. An alternative to temperature sampling | No | |
params.maxOutputTokens | Maximum number of tokens to generate in the response (OpenAI Responses API format) | No | |
params.maxToolCall | Maximum number of tools that can be configured in a request. Requests exceeding this limit will be rejected | No | |
params.store | Whether to store the conversation for future reference (OpenAI feature) | No | |
params.tools | Array of tool configurations. Each tool must have a type field (for example, web_search, file_search, function) | No | |
params.tools[].type | Type of tool: web_search, file_search, code_interpreter, image_generation, function, mcp, etc. | Yes | |
params.tools[].name | Name of the tool (required for function type) | No | |
params.tools[].description | Description of what the tool does (for function type) | No | |
params.tools[].parameters | JSON Schema object describing the tool's parameters (for function type) | No |
Parameter Override Behavior
The middleware supports two modes for handling parameters:
Mode 1: Allow Parameters Override (allowParamsOverride: true)
When enabled (default), the middleware acts as a default value provider:
- If a client provides a value for a parameter, the client's value is used.
- If a client doesn't provide a value, the configured default is applied.
- Tools follow the same pattern: client-provided tools take precedence, configured tools are used as fallback.
spec:
plugin:
responses-api:
model: gpt-4o
allowParamsOverride: true # Clients can override
params:
temperature: 0.7
maxOutputTokens: 1000
Mode 2: Force Parameters (allowParamsOverride: false)
When turned off, the middleware enforces the configured values:
- All configured parameters override client values.
- Clients cannot change these settings.
- Useful for strict governance and cost control.
spec:
plugin:
responses-api:
model: gpt-4o
allowParamsOverride: false # Enforce configured values
params:
temperature: 0.5
maxOutputTokens: 500
tools:
- type: web_search
Model Override Behavior
The allowModelOverride setting controls whether clients can specify their own model:
allowModelOverride: true: Clients can use any model they specify. The configuredmodelacts as a fallback when clients omit the model field entirely.allowModelOverride: false(default whenmodelis set): Clients must use the configured model. If they omit the model field or specify a different model, the configured model is used.
spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: true # Client can request gpt-4-turbo instead
Tool Control
The middleware provides governance over tool usage:
Configure Default Tools
Provide a list of tools that will be available by default:
params:
tools:
- type: web_search
- type: file_search
- type: function
name: get_weather
description: Get current weather for a location
parameters:
type: object
properties:
location:
type: string
description: City name
required:
- location
Different tool types support additional configuration options:
functiontools: Supportname,description, andparametersfieldsmcptools: Support extensive configuration options including server connections, authentication, and tool-specific parameters- Built-in tools (
web_search,file_search,code_interpreter,image_generation): Have their own specific configuration options
Refer to the respective tool documentation for complete configuration options.
Limit Tool Count
Use maxToolCall to prevent clients from requesting too many tools:
params:
maxToolCall: 5 # Maximum 5 tools per request
tools:
- type: web_search
When a client request exceeds this limit, it will be rejected with a 400 Bad Request response.
Request Body Size Limits
The middleware enforces a maximum request body size based on the AI Gateway configuration:
helm upgrade traefik traefik/traefik -n traefik --wait \
--set hub.aigateway.enabled=true \
--set hub.aigateway.maxRequestBodySize=10485760 # 10MB
Requests exceeding this size will receive a 413 Request Entity Too Large response.
Streaming Support
The middleware fully supports streaming responses. When a client sets "stream": true in the request, the response will be streamed back as server-sent events (SSE).
When streaming is enabled, the middleware will not record detailed usage metrics (token counts) since the full response is not buffered. Duration metrics will still be recorded.
Metrics
The middleware emits OpenTelemetry GenAI metrics when metrics are enabled.
Working with Other AI Middlewares
The Responses API middleware can be combined with other AI Gateway middlewares for enhanced functionality. See the Adapting AI Middlewares for Responses API guide for more details.
OpenAI Responses API vs Chat Completions
The Responses API is OpenAI's successor to the Chat Completions API, designed for agent-like applications:
| Feature | Chat Completions | Responses API |
|---|---|---|
| Request format | messages[] array | input string + optional instructions |
| Response format | choices[].message.content | output[] array |
| Built-in tools | Function calling only | Web search, file search, code interpreter, image generation |
| State management | Client-managed | Server-managed (via previous_response_id). |
- Streaming and Generic Middlewares: Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When streaming is enabled, these middlewares wait for the complete response to arrive, process it, and then send the entire response as a single chunk to the client.
- State Management: The middleware does not currently manage conversation state (
previous_response_id). Each request is treated as stateless. - Metrics with Streaming: Token usage metrics are not recorded for streaming requests since the full response is not buffered.
Troubleshooting
Request Entity Too Large
Problem: Receiving 413 Request Entity Too Large errors.
Solution: Increase the AI Gateway's max request body size:
helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set hub.aigateway.maxRequestBodySize=20971520 # 20MB
Tool Count Exceeded
Problem: Receiving 400 Bad Request: Maximum X tools allowed, got Y.
Solution: Either reduce the number of tools in your request or increase maxToolCall:
params:
maxToolCall: 50 # Increase limit
Model Override Denied
Problem: Client's model selection is being overridden.
Solution: Enable model override in the middleware:
spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: true # Allow client to choose model
Metrics Not Being Recorded
Problem: No metrics are being recorded.
Solution: Ensure:
- AI Gateway is enabled with metrics
- The request includes the required headers (
Hub-App-Name,Hub-App-Id) - You're not using streaming mode (which doesn't record token metrics)
Next Steps
- Learn how to use the Responses API with other middlewares in our Responses API Guide
- Configure Content Guard to protect sensitive data
- Set up Semantic Cache to reduce costs
- Use LLM Guard for custom content policies
