Skip to main content

Adapting AI Middlewares for Responses API

This guide demonstrates how to configure Traefik Hub's AI Gateway middlewares to work with the OpenAI Responses API format. While these middlewares were originally designed for generic HTTP/JSON payloads, they can be adapted to handle the Responses API's specific request and response structure.

Overview

The OpenAI Responses API uses a different request/response format compared to the Chat Completions API:

AspectChat CompletionsResponses API
Request inputmessages[] arrayinput string + optional instructions
Response outputchoices[].message.contentoutput[] array with structured items
Request path/v1/chat/completions/v1/responses

To work with this format, you need to configure your AI middlewares to target the correct JSON paths in requests and responses.

Prerequisites

Before starting, ensure you have:

  1. AI Gateway enabled in your Traefik Hub installation:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set hub.aigateway.enabled=true
  2. An OpenAI API key stored in a Kubernetes Secret:

    apiVersion: v1
    kind: Secret
    metadata:
    name: ai-keys
    namespace: apps
    type: Opaque
    stringData:
    openai-token: sk-proj-XXXXX
  3. An ExternalName service pointing to OpenAI:

    apiVersion: v1
    kind: Service
    metadata:
    name: openai
    namespace: apps
    spec:
    type: ExternalName
    externalName: api.openai.com
    ports:
    - port: 443

Adapting Content Guard

The Content Guard middleware detects and masks PII in requests and responses using JSON path queries. For Responses API, configure it to inspect the input and instructions fields, as the Responses API uses a flat structure instead of the nested messages[].content array used in Chat Completion.

Configuration

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: content-guard-responses
namespace: apps
spec:
plugin:
content-guard:
engine:
presidio:
host: http://presidio-analyzer.presidio.svc.cluster.local:5002
language: en
timeout: 30s
request:
rules:
# Rule 1: Detect and mask PII in all request fields
- entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- SSN
- CREDIT_CARD
block: false
mask:
char: "*"
unmaskFromLeft: 2
unmaskFromRight: 2

# Rule 2: Specifically target input and instructions fields
- jsonPaths:
- .input
- .instructions
entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
block: false
mask:
char: "X"
response:
rules:
# Mask PII in response output
- entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
block: false
mask:
char: "*"
unmaskFromLeft: 1
unmaskFromRight: 1
Path Rewriting

If your upstream service uses a different path (for example, /v1/chat/completions instead of /v1/responses), you'll need to add a path rewriting middleware:

# First, create the middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: replacepath-responses
spec:
replacePath:
path: "/v1/responses" # Rewrite to the correct upstream path

# Then reference it in your IngressRoute
middlewares:
- name: replacepath-responses

Key Configuration Points

  • jsonPaths: Set to [".input", ".instructions"] to target Responses API request fields.
  • Multiple Rules: You can have a global rule for all fields and specific rules for certain JSON paths.
  • Engine Setup: Presidio must be deployed. See Presidio documentation for setup instructions.

Testing

Create a test request with PII:

curl -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "say the following back to me: My email is [email protected] and phone is 555-123-4567"
}'

The middleware will mask the PII before forwarding to the LLM and the following response will be returned:

[
{
// ...
"output": [
{
"type": "message",
"text": "My email is XXXXXXXXXXXXXXXXXXXX and phone is XXXXXXXXXXXX"
}
],
"role": "assistant"
}
]
//...
}

You can verify the PII is masked by checking the response body.

Adapting LLM Guard

The LLM Guard middleware performs custom content analysis using external services or LLMs. For Responses API, adjust the template to extract the input field.

Configuration

This example uses a sentiment analysis service to block negative content:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: sentiment-guard-responses
namespace: apps
spec:
plugin:
llm-guard-custom:
endpoint: http://sentiment-analyzer.apps.svc.cluster.local:5000/predict
clientConfig:
timeout: 30s
headers:
Content-Type: application/json
request:
# Extract the input field for sentiment analysis
template: '{"text":"{{ .input }}"}'
# Block if negative sentiment exceeds 60%
blockConditions:
- condition: 'JSONGt(".predictions[0].NEGATIVE", 0.6)'
reason: 'Negative sentiment detected'

Key Configuration Points

  • template: Use {{ .input }} to extract the input field from the Responses API request.
  • blockCondition: Define conditions using JSON query syntax to determine when to block requests.
  • External Service: You need to deploy your own sentiment analysis or content filtering service.

Example Sentiment Analyzer

Here's a Python sentiment analyzer that you can deploy in Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
name: sentiment-analyzer
namespace: sentiment
spec:
replicas: 1
selector:
matchLabels:
app: sentiment-analyzer
template:
metadata:
labels:
app: sentiment-analyzer
spec:
containers:
- name: sentiment-analyzer
image: newa/sentiment-analyzer:v1.0.0
ports:
- containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
name: sentiment-analyzer
namespace: sentiment
spec:
selector:
app: sentiment-analyzer
ports:
- port: 5000
targetPort: 5000

The sentiment analyzer service uses the DistilBERT Multilingual Sentiment Analysis Service Model: lxyuan/distilbert-base-multilingual-cased-sentiments-student

Testing

Test with negative content:

curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "I hate everything and everyone"
}'

The middleware will block this request with a Forbidden response.

Test with positive content:

curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "I love learning new things"
}'

This request will be allowed through.

Adapting Semantic Cache

The Semantic Cache middleware caches responses based on semantic similarity. For Responses API, use the generic semantic-cache plugin (not the chat-specific variant) and configure a contentTemplate.

Configuration

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-responses
namespace: apps
spec:
plugin:
semantic-cache:
# Use generic semantic-cache plugin
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379
database: 0
collectionName: ai_responses_cache
maxDistance: 0.6
ttl: 3600 # 1 hour
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
dimensions: 1536
readOnly: false
allowBypass: true
# Extract input and instructions for cache key
contentTemplate: '{{ .input }} {{ .instructions }}'

Key Configuration Points

  • Plugin Name: Use semantic-cache (generic) not chat-completion-semantic-cache.
  • contentTemplate: Extract text from input and instructions fields using Go template syntax.
  • Separate Database: Use a different database number or collection name to isolate Responses API cache from other caches.
  • Vector Database: Deploy Redis Stack, Milvus, or Weaviate for vector storage.

Setting Up Redis Stack

Deploy Redis Stack to Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-stack
namespace: apps
spec:
replicas: 1
selector:
matchLabels:
app: redis-stack
template:
metadata:
labels:
app: redis-stack
spec:
containers:
- name: redis-stack
image: redis/redis-stack:latest
ports:
- containerPort: 6379
volumeMounts:
- name: redis-data
mountPath: /data
volumes:
- name: redis-data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: redis-stack
namespace: apps
spec:
selector:
app: redis-stack
ports:
- port: 6379
targetPort: 6379

Testing

Make the same request twice:

First request (cache miss):

curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "What is the capital of France?"
}' -i

In the response headers you will see:

X-Cache-Status: Miss

Second request (cache hit):

curl -k -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "What is the capital of France?"
}' -i

In the response headers you will see:

X-Cache-Distance: 0.000000
X-Cache-Status: Hit

The second request is served from cache, significantly faster and without consuming LLM tokens. The first request is not served from cache so it is a cache miss.

Complete Integration Example

Here's a complete example integrating all three middlewares with the Responses API:

---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: content-guard-responses
namespace: apps
spec:
plugin:
content-guard:
engine:
presidio:
host: http://presidio-analyzer.presidio.svc.cluster.local:5002
request:
rules:
- jsonPaths:
- .input
- .instructions
entities:
- EMAIL_ADDRESS
- PHONE_NUMBER
- SSN
mask:
char: "*"
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: sentiment-guard-responses
namespace: apps
spec:
plugin:
llm-guard-custom:
endpoint: http://sentiment-analyzer.apps.svc.cluster.local:5000/predict
clientConfig:
timeout: 30s
headers:
Content-Type: application/json
request:
template: '{"text":"{{ .input }}"}'
blockConditions:
- condition: 'JSONGt(".predictions[0].NEGATIVE", 0.8)'
reason: 'Negative sentiment detected'
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: responsesapi
namespace: apps
spec:
plugin:
responses-api:
token: urn:k8s:secret:ai-keys:openai-token
model: gpt-4o
allowModelOverride: false
params:
temperature: 0.7
maxOutputTokens: 1024
tools:
- type: web_search
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-responses
namespace: apps
spec:
plugin:
semantic-cache:
vectorDB:
redis:
endpoints:
- redis-stack.apps.svc.cluster.local:6379
database: 0
collectionName: ai_responses_cache
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
contentTemplate: '{{ .input }} {{ .instructions }}'

Middleware Order

The order of middlewares is critical:

  1. Content Guard: Masks PII before any other processing
  2. LLM Guard: Analyzes and potentially blocks content
  3. Responses API: Applies governance and records metrics
  4. Semantic Cache: Caches the final response
Middleware Order Matters

In most cases, you want to place Content Guard and LLM Guard before the Responses API middleware to inspect the original request. Place Semantic Cache after to cache the final response.

However, middleware order depends on your use case:

  • Request-only checks: Place guards before the Semantic Cache middleware
  • Response-only checks: Place guards after the Semantic Cache middleware
  • Comprehensive protection: Use multiple guards around the cache (request checks → cache → response checks)

Testing the Complete Stack

Make a request with PII and sentiment:

curl -X POST https://ai.localhost/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": "Hi, my email is [email protected]. Can you help me with something?"
}' -v

Expected behavior:

  1. Email address is masked: j***@***.***
  2. Sentiment is checked (positive, so allowed)
  3. Model is set to gpt-4o (governance applied)
  4. Metrics are recorded
  5. Response is cached for future similar requests

Comparison: Chat Completions vs Responses API

Here's a side-by-side comparison of middleware configurations:

# Content Guard for Chat Completions
spec:
plugin:
chat-completion-content-guard: # Chat-specific variant
engine:
presidio:
host: http://presidio-analyzer.apps.svc.cluster.local:5002
request:
rules:
- entities: [EMAIL_ADDRESS, PHONE_NUMBER]
mask:
char: "*"

# Semantic Cache for Chat Completions
spec:
plugin:
chat-completion-semantic-cache: # Chat-specific variant
ignoreSystem: false
ignoreAssistant: true
messageHistory: 5
vectorDB:
redis:
endpoints: [redis-stack.apps.svc.cluster.local:6379]
collectionName: ai_chat_cache
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token

# LLM Guard for Chat Completions
spec:
plugin:
llm-guard-custom:
endpoint: http://analyzer.apps.svc.cluster.local:5000/predict
request:
template: '{"text":"{{ (index .messages 0).content }}"}'

Streaming Support

Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When using "stream": true in Responses API requests:

  • These middlewares wait for the complete response to arrive, process it, and then send the entire response as a single chunk to the client
  • The client expects a stream but receives the processed response as one chunk after all processing is complete
  • Token usage metrics will not be recorded for streaming requests

Workaround: Use separate routes for streaming and non-streaming requests:

# Non-streaming with full middleware stack
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/standard`)
middlewares:
- name: content-guard-responses
- name: responsesapi
- name: semantic-cache-responses

# Streaming with minimal middlewares
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/stream`)
middlewares:
- name: responsesapi

Troubleshooting

Content Guard Not Masking PII

PII is not being masked in requests.

Solutions:

  1. Verify Presidio is running and accessible:

    kubectl get pods -n presidio
  2. Check the JSON paths are correct:

    jsonPaths:
    - .input # Not .messages
    - .instructions
  3. Enable debug logging in Traefik to see middleware execution:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set "additionalArguments={--log.level=DEBUG}"
  4. Test Presidio directly to verify it's working:

    curl -X POST http://presidio-analyzer.apps.svc.cluster.local:5002/analyze \
    -H "Content-Type: application/json" \
    -d '{"text":"My email is [email protected]","language":"en"}'
LLM Guard Template Errors

Receiving errors about template execution or the guard service is not being called.

Solutions:

  1. Verify the template syntax uses the correct field for Responses API:

    # ✅ Correct - Use .input for Responses API
    template: '{"text":"{{ .input }}"}'

    # ❌ Incorrect - Don't use .messages (that's for Chat Completions)
    template: '{"text":"{{ (index .messages 0).content }}"}'
  2. Test the external service directly to ensure it's accessible:

    curl -X POST http://sentiment-analyzer.apps.svc.cluster.local:5000/predict \
    -H "Content-Type: application/json" \
    -d '{"text":"test message"}'
  3. Check service name resolution in your cluster:

    kubectl get svc -n sentiment
    kubectl get endpoints -n sentiment
  4. Verify the template output matches what your service expects by enabling debug logging

Semantic Cache Not Working

All requests show X-Cache-Status: Miss or cache is not being populated.

Solutions:

  1. Verify your vector database is running with vector support. For example, you can run the following command to check if Redis Stack is running with it:

    kubectl exec -it redis-stack-0 -n redis -- redis-cli
    > MODULE LIST
    # Should show "search" module loaded
  2. Check the contentTemplate extracts content correctly:

    # ✅ Correct - Extract both input and instructions
    contentTemplate: '{{ .input }} {{ .instructions }}'

    # ❌ Incorrect - Chat Completions syntax
    contentTemplate: '{{ .messages }}'
  3. Verify the vectorizer is accessible and has valid credentials:

    # Test OpenAI vectorizer connectivity
    kubectl get secret ai-keys -n apps -o jsonpath='{.data.openai-token}' | base64 -d
  4. Check Redis logs for any errors:

    kubectl logs -n redis redis-stack-0
  5. Verify the vector database configuration:

    vectorDB:
    redis:
    endpoints:
    - redis-stack.apps.svc.cluster.local:6379 # Full FQDN
    database: 0
    collectionName: ai_responses_cache # Unique collection name
  6. Test cache manually by making the same request twice and checking response headers

Wrong Middleware Order

Middlewares not behaving as expected or PII is being sent to LLM/cache.

Solution:

Ensure correct middleware order in IngressRoute. Order matters because:

  • Content Guard must run first to mask PII before other middlewares see it
  • LLM Guard should run after PII masking but before the API middleware
  • Responses API middleware applies governance
  • Semantic Cache should run last to cache the final response
middlewares:
- name: content-guard-responses # 1. Mask PII first
- name: sentiment-guard-responses # 2. Block negative content
- name: responsesapi # 3. Apply governance + metrics
- name: semantic-cache-responses # 4. Cache final response

Test the order:

  1. Make a request with PII:

    curl -X POST https://ai.localhost/v1/responses \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o","input":"My email is [email protected]"}'
  1. Check Traefik logs to see middleware execution order:

    kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100
Streaming Requests Not Working

Streaming responses are not being returned or middlewares are blocking streaming.

Issue:

Content Guard, LLM Guard, and Semantic Cache do not support true streaming mode. When "stream": true is set, these middlewares wait for the complete response, process it, and send it as a single chunk to the client.

Solution:

Create separate routes for streaming and non-streaming requests:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: responses-routes
namespace: apps
spec:
entryPoints:
- websecure
routes:
# Non-streaming with full middleware stack
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses`)
services:
- name: openai
port: 443
middlewares:
- name: content-guard-responses
- name: sentiment-guard-responses
- name: responsesapi
- name: semantic-cache-responses

# Streaming with minimal middlewares (only governance)
- kind: Rule
match: Host(`ai.localhost`) && Path(`/v1/responses/stream`)
services:
- name: openai
port: 443
middlewares:
- name: responsesapi # Only governance, no content processing

Clients should use different endpoints based on their needs:

  • Standard requests: POST /v1/responses
  • Streaming requests: POST /v1/responses/stream
Responses API Middleware Configuration Errors

Configuration validation errors or unexpected behavior from the Responses API middleware.

Common issues:

  1. Model override not working:

    # ✅ Correct - Allow clients to override model
    spec:
    plugin:
    responses-api:
    model: gpt-4o
    allowModelOverride: true # Clients can specify their own model

    # ❌ Incorrect - Model is enforced
    spec:
    plugin:
    responses-api:
    model: gpt-4o
    allowModelOverride: false # Client's model is ignored
  2. Too many tools error:

    If receiving "Maximum X tools allowed, got Y" errors:

    params:
    maxToolCall: 20 # Increase limit or remove tools from request
    tools:
    - type: web_search
  3. Missing token:

    Ensure the token secret exists and is referenced correctly:

    # Check secret exists
    kubectl get secret ai-keys -n apps

    # Check secret content
    kubectl get secret ai-keys -n apps -o yaml

    Reference in middleware:

    token: urn:k8s:secret:ai-keys:openai-token

Next Steps