Adapting AI Middlewares for Responses API
This guide demonstrates how to configure Traefik Hub's AI Gateway middlewares to work with the OpenAI Responses API format. While these middlewares were originally designed for generic HTTP/JSON payloads, they can be adapted to handle the Responses API's specific request and response structure.
Overview
The OpenAI Responses API uses a different request/response format compared to the Chat Completions API:
| Aspect | Chat Completions | Responses API |
|---|---|---|
| Request input | messages[] array | input string + optional instructions |
| Response output | choices[].message.content | output[] array with structured items |
| Request path | /v1/chat/completions | /v1/responses |
To work with this format, you need to configure your AI middlewares to target the correct JSON paths in requests and responses.
Prerequisites
Before starting, ensure you have:
-
AI Gateway enabled in your Traefik Hub installation:
helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set hub.aigateway.enabled=true -
An OpenAI API key stored in a Kubernetes Secret:
apiVersion: v1
kind: Secret
metadata:
name: ai-keys
namespace: apps
type: Opaque
stringData:
openai-token: sk-proj-XXXXX -
An ExternalName service pointing to OpenAI:
apiVersion: v1
kind: Service
metadata:
name: openai
namespace: apps
spec:
type: ExternalName
externalName: api.openai.com
ports:
- port: 443
Adapting Content Guard
The Content Guard middleware detects and masks PII in requests and responses using JSON path queries. For Responses API, configure it to inspect the input and instructions fields,
as the Responses API uses a flat structure instead of the nested messages[].content array used in Chat Completion.
Configuration
- Content Guard Middleware
- Responses API Middleware
- IngressRoute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: content-guard-responses
namespace: apps
spec:
plugin:
content-guard:
engine:
presidio:
host: http://presidio-analyzer.presidio.svc.cluster.local:5002
language: en
timeout: 30s
request:
rules:
# Rule 1: Detect and mask PII in all request fields
- entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- SSN
- CREDIT_CARD
block: false
mask:
char: "*"
unmaskFromLeft: 2
unmaskFromRight: 2
# Rule 2: Specifically target input and instructions fields
- jsonPaths:
- .input
- .instructions
entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
block: false
mask:
char: "X"
response:
rules:
# Mask PII in response output
- entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
block: false
mask:
char: "*"
unmaskFromLeft: 1
unmaskFromRight: 1
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
namespace: apps
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o-2024-05-13
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 0.7
maxOutputTokens: 1024
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-with-content-guard
namespace: apps
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
passHostHeader: false
middlewares:
- name: content-guard-responses
- name: responsesapi
If your upstream service uses a different path (for example, /v1/chat/completions instead of /v1/responses), you'll need to add a path rewriting middleware:
# First, create the middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: replacepath-responses
spec:
replacePath:
path: "/v1/responses" # Rewrite to the correct upstream path
# Then reference it in your IngressRoute
middlewares:
- name: replacepath-responses
Key Configuration Points
jsonPaths: Set to[".input", ".instructions"]to target Responses API request fields.- Multiple Rules: You can have a global rule for all fields and specific rules for certain JSON paths.
- Engine Setup: Presidio must be deployed. See Presidio documentation for setup instructions.
Testing
Create a test request with PII:
curl -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "say the following back to me: My email is [email protected] and phone is 555-123-4567"
}'
The middleware will mask the PII before forwarding to the LLM and the following response will be returned:
[
{
// ...
"output": [
{
"type": "message",
"text": "My email is XXXXXXXXXXXXXXXXXXXX and phone is XXXXXXXXXXXX"
}
],
"role": "assistant"
}
]
//...
}
You can verify the PII is masked by checking the response body.
Adapting LLM Guard
The LLM Guard middleware performs custom content analysis using external services or LLMs. For Responses API, adjust the template to extract the input field.
Configuration
This example uses a sentiment analysis service to block negative content:
- LLM Guard Middleware
- Responses API Middleware
- IngressRoute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: sentiment-guard-responses
namespace: apps
spec:
plugin:
llm-guard-custom:
endpoint: http://sentiment-analyzer.apps.svc.cluster.local:5000/predict
clientConfig:
timeout: 30s
headers:
Content-Type: application/json
request:
# Extract the input field for sentiment analysis
template: '{"text":"{{ .input }}"}'
# Block if negative sentiment exceeds 60%
blockConditions:
- condition: 'JSONGt(".predictions[0].NEGATIVE", 0.6)'
reason: 'Negative sentiment detected'
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
namespace: apps
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o-2024-05-13
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 0.7
maxOutputTokens: 1024
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-with-sentiment-guard
namespace: apps
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
passHostHeader: false
middlewares:
- name: sentiment-guard-responses
- name: responsesapi
Key Configuration Points
template: Use{{ .input }}to extract the input field from the Responses API request.blockCondition: Define conditions using JSON query syntax to determine when to block requests.- External Service: You need to deploy your own sentiment analysis or content filtering service.
Example Sentiment Analyzer
Here's a Python sentiment analyzer that you can deploy in Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: sentiment-analyzer
namespace: sentiment
spec:
replicas: 1
selector:
matchLabels:
app: sentiment-analyzer
template:
metadata:
labels:
app: sentiment-analyzer
spec:
containers:
- name: sentiment-analyzer
image: newa/sentiment-analyzer:v1.0.0
ports:
- containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
name: sentiment-analyzer
namespace: sentiment
spec:
selector:
app: sentiment-analyzer
ports:
- port: 5000
targetPort: 5000
The sentiment analyzer service uses the DistilBERT Multilingual Sentiment Analysis Service Model: lxyuan/distilbert-base-multilingual-cased-sentiments-student
Testing
Test with negative content:
curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "I hate everything and everyone"
}'
The middleware will block this request with a Forbidden response.
Test with positive content:
curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "I love learning new things"
}'
This request will be allowed through.
Adapting Semantic Cache
The Semantic Cache middleware caches responses based on semantic similarity. For Responses API, use the generic semantic-cache plugin (not the chat-specific variant) and configure a contentTemplate.
Configuration
- Semantic Cache Middleware
- Responses API Middleware
- IngressRoute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-responses
namespace: apps
spec:
plugin:
semantic-cache:
# Use generic semantic-cache plugin
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379
database: 0
collectionName: ai_responses_cache
maxDistance: 0.6
ttl: 3600 # 1 hour
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
dimensions: 1536
readOnly: false
allowBypass: true
# Extract input and instructions for cache key
contentTemplate: '{{ .input }} {{ .instructions }}'
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
namespace: apps
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
allowParamsOverride: true
params:
temperature: 0.7
maxOutputTokens: 1024
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-with-cache
namespace: apps
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
passHostHeader: false
middlewares:
- name: responsesapi
- name: semantic-cache-responses
Key Configuration Points
- Plugin Name: Use
semantic-cache(generic) notchat-completion-semantic-cache. contentTemplate: Extract text frominputandinstructionsfields using Go template syntax.- Separate Database: Use a different database number or collection name to isolate Responses API cache from other caches.
- Vector Database: Deploy Redis Stack, Milvus, or Weaviate for vector storage.
Setting Up Redis Stack
Deploy Redis Stack to Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-stack
namespace: apps
spec:
replicas: 1
selector:
matchLabels:
app: redis-stack
template:
metadata:
labels:
app: redis-stack
spec:
containers:
- name: redis-stack
image: redis/redis-stack:latest
ports:
- containerPort: 6379
volumeMounts:
- name: redis-data
mountPath: /data
volumes:
- name: redis-data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: redis-stack
namespace: apps
spec:
selector:
app: redis-stack
ports:
- port: 6379
targetPort: 6379
Testing
Make the same request twice:
First request (cache miss):
curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "What is the capital of France?"
}' -i
In the response headers you will see:
X-Cache-Status: Miss
Second request (cache hit):
curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "What is the capital of France?"
}' -i
In the response headers you will see:
X-Cache-Distance: 0.000000
X-Cache-Status: Hit
The second request is served from cache, significantly faster and without consuming LLM tokens. The first request is not served from cache so it is a cache miss.
Complete Integration Example
Here's a complete example integrating all three middlewares with the Responses API:
- All Middlewares
- IngressRoute
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: content-guard-responses
namespace: apps
spec:
plugin:
content-guard:
engine:
presidio:
host: http://presidio-analyzer.presidio.svc.cluster.local:5002
request:
rules:
- jsonPaths:
- .input
- .instructions
entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
- SSN
mask:
char: "*"
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: sentiment-guard-responses
namespace: apps
spec:
plugin:
llm-guard-custom:
endpoint: http://sentiment-analyzer.apps.svc.cluster.local:5000/predict
clientConfig:
timeout: 30s
headers:
Content-Type: application/json
request:
template: '{"text":"{{ .input }}"}'
blockConditions:
- condition: 'JSONGt(".predictions[0].NEGATIVE", 0.8)'
reason: 'Negative sentiment detected'
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
namespace: apps
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
params:
temperature: 0.7
maxOutputTokens: 1024
tools:
- type: web_search
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-responses
namespace: apps
spec:
plugin:
semantic-cache:
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379
database: 0
collectionName: ai_responses_cache
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
contentTemplate: '{{ .input }} {{ .instructions }}'
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-complete-stack
namespace: apps
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
passHostHeader: false
middlewares:
# Order matters!
- name: content-guard-responses # 1. Mask PII first
- name: sentiment-guard-responses # 2. Block negative content
- name: responsesapi # 3. Apply governance + metrics
- name: semantic-cache-responses # 4. Cache responses
Middleware Order
The order of middlewares is critical:
- Content Guard: Masks PII before any other processing
- LLM Guard: Analyzes and potentially blocks content
- Responses API: Applies governance and records metrics
- Semantic Cache: Caches the final response
In most cases, you want to place Content Guard and LLM Guard before the Responses API middleware to inspect the original request. Place Semantic Cache after to cache the final response.
However, middleware order depends on your use case:
- Request-only checks: Place guards before the Semantic Cache middleware
- Response-only checks: Place guards after the Semantic Cache middleware
- Comprehensive protection: Use multiple guards around the cache (request checks → cache → response checks)
Testing the Complete Stack
Make a request with PII and sentiment:
curl -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "Hi, my email is [email protected]. Can you help me with something?"
}' -v
Expected behavior:
- Email address is masked:
j***@***.*** - Sentiment is checked (positive, so allowed)
- Model is set to
gpt-4o(governance applied) - Metrics are recorded
- Response is cached for future similar requests
Comparison: Chat Completions vs Responses API
Here's a side-by-side comparison of middleware configurations:
- Chat Completions
- Responses API
# Content Guard for Chat Completions
spec:
plugin:
chat-completion-content-guard: # Chat-specific variant
engine:
presidio:
host: http://presidio-analyzer.apps.svc.cluster.local:5002
request:
rules:
- entities: [EMAIL_ADDRESS, PHONE_NUMBER]
mask:
char: "*"
# Semantic Cache for Chat Completions
spec:
plugin:
chat-completion-semantic-cache: # Chat-specific variant
ignoreSystem: false
ignoreAssistant: true
messageHistory: 5
vectorDB:
redis:
endpoints: [redis-stack.apps.svc.cluster.local:6379]
collectionName: ai_chat_cache
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
# LLM Guard for Chat Completions
spec:
plugin:
llm-guard-custom:
endpoint: http://analyzer.apps.svc.cluster.local:5000/predict
request:
template: '{"text":"{{ (index .messages 0).content }}"}'
# Content Guard for Responses API
spec:
plugin:
content-guard: # Generic variant
engine:
presidio:
host: http://presidio-analyzer.apps.svc.cluster.local:5002
request:
rules:
- jsonPaths: [.input, .instructions] # Explicit paths
entities: [EMAIL_ADDRESS, PHONE_NUMBER]
mask:
char: "*"
# Semantic Cache for Responses API
spec:
plugin:
semantic-cache: # Generic variant
contentTemplate: '{{ .input }} {{ .instructions }}' # Explicit template
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379
database: 0
collectionName: ai_responses_cache
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
# LLM Guard for Responses API
spec:
plugin:
llm-guard-custom:
endpoint: http://analyzer.apps.svc.cluster.local:5000/predict
request:
template: '{"text":"{{ .input }}"}' # Different extraction
Streaming Support
Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When using "stream": true in Responses API requests:
- These middlewares wait for the complete response to arrive, process it, and then send the entire response as a single chunk to the client
- The client expects a stream but receives the processed response as one chunk after all processing is complete
- Token usage metrics will not be recorded for streaming requests
Workaround: Use separate routes for streaming and non-streaming requests:
# Non-streaming with full middleware stack
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/standard`)
middlewares:
- name: content-guard-responses
- name: responsesapi
- name: semantic-cache-responses
# Streaming with minimal middlewares
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/stream`)
middlewares:
- name: responsesapi
Troubleshooting
Content Guard Not Masking PII
PII is not being masked in requests.
Solutions:
-
Verify Presidio is running and accessible:
kubectl get pods -n presidio -
Check the JSON paths are correct:
jsonPaths:
- .input # Not .messages
- .instructions -
Enable debug logging in Traefik to see middleware execution:
helm upgrade traefik traefik/traefik -n traefik --wait \
--reset-then-reuse-values \
--set "additionalArguments={--log.level=DEBUG}" -
Test Presidio directly to verify it's working:
curl -X POST http://presidio-analyzer.apps.svc.cluster.local:5002/analyze \
-H "Content-Type: application/json" \
-d '{"text":"My email is [email protected]","language":"en"}'
LLM Guard Template Errors
Receiving errors about template execution or the guard service is not being called.
Solutions:
-
Verify the template syntax uses the correct field for Responses API:
# ✅ Correct - Use .input for Responses API
template: '{"text":"{{ .input }}"}'
# ❌ Incorrect - Don't use .messages (that's for Chat Completions)
template: '{"text":"{{ (index .messages 0).content }}"}' -
Test the external service directly to ensure it's accessible:
curl -X POST http://sentiment-analyzer.apps.svc.cluster.local:5000/predict \
-H "Content-Type: application/json" \
-d '{"text":"test message"}' -
Check service name resolution in your cluster:
kubectl get svc -n sentiment
kubectl get endpoints -n sentiment -
Verify the template output matches what your service expects by enabling debug logging
Semantic Cache Not Working
All requests show X-Cache-Status: Miss or cache is not being populated.
Solutions:
-
Verify your vector database is running with vector support. For example, you can run the following command to check if Redis Stack is running with it:
kubectl exec -it redis-stack-0 -n redis -- redis-cli
> MODULE LIST
# Should show "search" module loaded -
Check the contentTemplate extracts content correctly:
# ✅ Correct - Extract both input and instructions
contentTemplate: '{{ .input }} {{ .instructions }}'
# ❌ Incorrect - Chat Completions syntax
contentTemplate: '{{ .messages }}' -
Verify the vectorizer is accessible and has valid credentials:
# Test OpenAI vectorizer connectivity
kubectl get secret ai-keys -n apps -o jsonpath='{.data.openai-token}' | base64 -d -
Check Redis logs for any errors:
kubectl logs -n redis redis-stack-0 -
Verify the vector database configuration:
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379 # Full FQDN
database: 0
collectionName: ai_responses_cache # Unique collection name -
Test cache manually by making the same request twice and checking response headers
Wrong Middleware Order
Middlewares not behaving as expected or PII is being sent to LLM/cache.
Solution:
Ensure correct middleware order in IngressRoute. Order matters because:
- Content Guard must run first to mask PII before other middlewares see it
- LLM Guard should run after PII masking but before the API middleware
- Responses API middleware applies governance
- Semantic Cache should run last to cache the final response
middlewares:
- name: content-guard-responses # 1. Mask PII first
- name: sentiment-guard-responses # 2. Block negative content
- name: responsesapi # 3. Apply governance + metrics
- name: semantic-cache-responses # 4. Cache final response
Test the order:
-
Make a request with PII:
curl -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","input":"My email is [email protected]"}'
-
Check Traefik logs to see middleware execution order:
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100
Streaming Requests Not Working
Streaming responses are not being returned or middlewares are blocking streaming.
Issue:
Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When "stream": true is set, these middlewares wait for the complete response, process it, and send it as a single chunk to the client.
Solution:
Create separate routes for streaming and non-streaming requests:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-routes
namespace: apps
spec:
entryPoints:
- websecure
routes:
# Non-streaming with full middleware stack
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
middlewares:
- name: content-guard-responses
- name: sentiment-guard-responses
- name: responsesapi
- name: semantic-cache-responses
# Streaming with minimal middlewares (only governance)
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/stream`)
services:
- name: openai
port: 443
middlewares:
- name: responsesapi # Only governance, no content processing
Clients should use different endpoints based on their needs:
- Standard requests:
POST /v1/responses - Streaming requests:
POST /v1/responses/stream
Responses API Middleware Configuration Errors
Configuration validation errors or unexpected behavior from the Responses API middleware.
Common issues:
-
Model override not working:
# ✅ Correct - Allow clients to override model
spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: true # Clients can specify their own model
# ❌ Incorrect - Model is enforced
spec:
plugin:
responses-api:
model: gpt-4o
allowModelOverride: false # Client's model is ignored -
Too many tools error:
If receiving "Maximum X tools allowed, got Y" errors:
params:
maxToolCall: 20 # Increase limit or remove tools from request
tools:
- type: web_search -
Missing token:
Ensure the token secret exists and is referenced correctly:
# Check secret exists
kubectl get secret ai-keys -n apps
# Check secret content
kubectl get secret ai-keys -n apps -o yamlReference in middleware:
token: urn:k8s:secret:ai-keys:openai-token
Next Steps
- Learn more about the Responses API middleware configuration options
- Explore Content Guard for PII protection
- Set up Semantic Cache to reduce costs
- Configure LLM Guard for custom policies
- Review OpenTelemetry metrics for monitoring
