Semantic Cache
The Semantic Cache middleware reduces LLM response times and API costs by avoiding redundant computations. It uses semantic similarity (not only text matching) to determine whether a request has been previously answered—and reuses the cached result when appropriate.
There are two variants of the middleware:
Variant | Best for | Key options |
---|---|---|
semantic-cache | Any JSON payload / REST request | contentTemplate |
chat-completion-semantic-cache | OpenAI-compatible chat completions | ignoreSystem , ignoreAssistant , ignoreTool , messageHistory |
Key Features and Benefits
- Faster Responses: Resolve repeated requests in milliseconds instead of waiting for LLM inference.
- Lower API Costs: Avoid paying for redundant token usage across identical or similar prompts.
- Semantic Matching: Works even when input phrasing changes, thanks to vector-based similarity.
- Safe Caching: readOnly mode allows staging and production separation to prevent cache pollution
- Multiple Vectorizeres & DB support: Support for OpenAI, Ollama, Mistral, redis-stack, Milvus, and Weaviate etc.
Requirements
-
You must have AI Gateway enabled:
helm install traefik -n traefik --wait \
--set hub.aigateway.enabled=true -
You need a vectorizer that can produce text embeddings. We currently support:
- OpenAI
- Gemini
- Ollama
- Mistral
- Azure OpenAI
- Bedrock
- Cohere
If your chosen vectorizer requires a token, You'll need to store the chosen vectorizer credentials in a Kubernetes Secret and
reference it in the middleware configuration with a urn
reference.
-
You need a vector database that stores and retrieves embeddings: We currently support:
How It Works
When an AI request arrives, the semantic cache middleware processes it through the following steps:
Prerequisites for Caching:
- Only requests with a request body are cached.
GET
requests cannot be cached since they typically don't have bodies to extract content from. - Only
200 OK
responses are cached. Other status codes like201 Created
,400 Bad Request
, etc. are not cached. - The
contentTemplate
must successfully extract text from the request body for caching to occur.
-
Extract & Prepare Text : The middleware extracts text from the request body and formats it according to a content template. By default, it takes the last user message in a typical chat completion request.
-
Compute Embeddings : The text is converted into a vector using a vectorizer. Each vector can represent the semantic meaning of the request text.
-
Similarity Search in Vector Database: The middleware queries a vector database (redis-stack, weaviate, or Milvus) to see if there is a cached response with a sufficiently close vector (based on a similarity threshold, for example,
maxDistance
).
- Cache Hit: If a similar vector is found, the cached answer is returned.
- Cache Miss: The request proceeds to the AI service, and the resulting response is stored as a new entry in the cache (unless
readOnly
istrue
).
-
Response Headers : The user sees the following headers in the response:
X-Cache-Status
:Hit
orMiss
.X-Cache-Distance
: useful for tuning your similarity threshold.
-
Separate caches - stream (
"stream": true
) and non-stream answers live in different buckets; a hit in one does not populate the other.
The semantic cache only stores responses with 200 OK
status codes. If your API returns 201 Created
, 202 Accepted
, or other success codes, consider configuring your service to return 200 OK
for cached endpoints.
Configuration Examples
Choose the plugin variant that matches your use case, then configure it with your preferred vector database.
Semantic Cache Plugin
Use the semantic-cache
plugin for general REST APIs and custom JSON payloads. You must define a contentTemplate
to extract text from your specific JSON structure.
- Middleware with redis-stack (API Gateway)
- Middleware with Milvus
- Middleware with Weaviate
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://ollama.default.svc.cluster.local:11434 # Protocol HTTP/HTTPS must be declared
model: nomic-embed-text
vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
collectionName: demo_doc
maxDistance: 0.6
ttl: 3600
readOnly: false
allowBypass: true # Allow clients to bypass cache with Cache-Control headers
contentTemplate: '{{ .messages }}' # Extract 'messages' field from JSON body like {"messages": "text"}
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://ollama.default.svc.cluster.local:11434 # Protocol HTTP/HTTPS must be declared
model: nomic-embed-text
vectorDB:
milvus:
clientConfig:
address: http://milvus.default.svc.cluster.local:19530
collectionName: milvusv1
maxDistance: 0.5
ttl: 86400
readOnly: false
allowBypass: false
contentTemplate: "{{ .query }}" # Extract 'query' field from JSON body
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
vectorDB:
weaviate:
host: weaviate.default.svc.cluster.local:80
scheme: http
collectionName: rest_cache
maxDistance: 0.5
apiKey: urn:k8s:secret:ai-keys:weaviate-key
ttl: 3600
readOnly: false
allowBypass: true
contentTemplate: '{{ .prompt }}' # Extract 'prompt' field from JSON body
Chat Completion + Semantic Cache
Use the chat-completion-semantic-cache
plugin specifically for OpenAI-compatible chat completion endpoints. It has built-in understanding of the messages array and offers chat-specific filtering options.
- Middleware with redis-stack (Chat variant)
- Middleware with Milvus (Chat variant)
- Middleware with Weaviate (Chat variant)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-chat
spec:
plugin:
chat-completion-semantic-cache:
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
collectionName: chat_cache
maxDistance: 0.4
ttl: 1800
# Chat-specific options
ignoreAssistant: true
messageHistory: 4
readOnly: false
allowBypass: true
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-chat
spec:
plugin:
chat-completion-semantic-cache:
vectorizer:
ollama:
baseUrl: http://ollama.default.svc.cluster.local:11434
model: nomic-embed-text
vectorDB:
milvus:
clientConfig:
address: http://milvus.default.svc.cluster.local:19530
collectionName: chat_completions
maxDistance: 0.3
ttl: 86400
# Chat-specific options
ignoreSystem: false
ignoreAssistant: true
messageHistory: 8
readOnly: false
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-chat
spec:
plugin:
chat-completion-semantic-cache:
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
vectorDB:
weaviate:
host: weaviate.default.svc.cluster.local:80
scheme: http
collectionName: chat_cache
maxDistance: 0.5
apiKey: urn:k8s:secret:ai-keys:weaviate-key # Optional: for authenticated Weaviate instances
ttl: 600
clientConfig:
timeout: 5s
insecureSkipVerify: false
# Chat-specific options
ignoreAssistant: true
messageHistory: 4
readOnly: false
allowBypass: true
IngressRoute Configuration
Both plugins work with the same IngressRoute configuration:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: ai
namespace: traefik
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`)
middlewares:
- name: semantic-cache # or semantic-cache-chat
services:
- name: chatgpt-external # ExternalName to api.openai.com
port: 443
scheme: https
passHostHeader: false
vectorizer.*
tells Semantic Cache where to fetch embeddings for similarity checks.
services:
in the IngressRoute
tells Traefik Hub where to send the actual request for inference.
They can point to the same provider (all-OpenAI, all-Ollama) or to different ones (local embeddings + cloud LLM). Pick whatever best balances cost, performance, and availability for your use case.
The AI gateway upper-cases the first letter of the collection name internally (chat_cache
→ Chat_cache
). Submit names in any case; they are normalised automatically to Weaviate's expected format.
Configuration Options
Field | Description | Required | Default |
---|---|---|---|
vectorizer | Configures which embedding provider to use (OpenAI, Azure OpenAI, Cohere, Gemini, Mistral, Ollama, Bedrock.) | Yes | |
vectorizer.<embedding-provider>.baseURL | Configures the base URL of the embedding provider. | No | |
vectorizer.<embedding-provider>.model | Configures the embedding model. | Yes | |
vectorizer.<embedding-provider>.token | URN of a Kubernetes Secret that holds the embedding provider's API key (for example, urn:k8s:secret:<secretname>:<key> ). | No | |
vectorizer.<embedding-provider>.scheme | Configures the scheme for the embedding provider. | No | |
vectorizer.clientConfig | Configures the HTTP client settings for the embedding provider. | No | |
vectorizer.clientConfig.timeout | Configures the timeout for the HTTP client. | No | 3s |
vectorizer.clientConfig.proxyURL | Configures the proxy URL for the HTTP client. | No | |
vectorizer.clientConfig.insecureSkipVerify | Configures the insecure skip verify for the HTTP client. | No | false |
vectorDB | Configures which vector database to use (redis-stack, weaviate, or milvus). | Yes | |
vectorDB.redis.endpoints | Configures the Redis host and port (for example, redis.default.svc.cluster.local:6379 ). | Yes | |
vectorDB.redis.collectionName | Configures the collection name in Redis. | Yes | |
vectorDB.redis.maxDistance | Threshold for semantic similarity in Redis. The lower the value, the more exact the match must be. | No | |
vectorDB.redis.ttl | Configures the time to live for cached entries in Redis (0 = forever). | No | 0 |
vectorDB.redis.* | Redis configuration inherits from the Redis client configuration and supports additional Redis-specific options. | No | |
vectorDB.weaviate.host | Configures the Weaviate host and port (for example, weaviate.default.svc.cluster.local:80 ). | Yes | |
vectorDB.weaviate.collectionName | Configures the collection name in Weaviate. | Yes | |
vectorDB.weaviate.maxDistance | Threshold for semantic similarity in Weaviate. The lower the value, the more exact the match must be. | No | |
vectorDB.weaviate.ttl | Configures the time to live for cached entries in Weaviate (0 = forever). | No | 0 |
vectorDB.weaviate.scheme | Configures the scheme for Weaviate (http or https). | Yes | |
vectorDB.weaviate.apiKey | URN of a Kubernetes Secret that holds the Weaviate API key (for example, urn:k8s:secret:<secretname>:<key> ). | No | |
vectorDB.weaviate.clientConfig | Configures the HTTP client settings for Weaviate. | No | |
vectorDB.milvus.clientConfig | Configures the Milvus client settings including address, authentication, and connection options. | Yes | |
vectorDB.milvus.clientConfig.address | The Milvus server address (for example, http://milvus.default.svc.cluster.local:19530 ). | Yes | |
vectorDB.milvus.clientConfig.username | Username for Milvus authentication. | No | |
vectorDB.milvus.clientConfig.password | Password for Milvus authentication. (for example, urn:k8s:secret:<secretname>:<key> ) | No | |
vectorDB.milvus.clientConfig.dbName | Database name for Milvus connection. | No | |
vectorDB.milvus.clientConfig.identifier | Client identifier for Milvus connection. | No | |
vectorDB.milvus.clientConfig.enableTLSAuth | Enable TLS authentication for Milvus connection. | No | false |
vectorDB.milvus.clientConfig.apiKey | API key for Milvus authentication. | No | |
vectorDB.milvus.clientConfig.serverVersion | Milvus server version for compatibility. | No | |
vectorDB.milvus.clientConfig.disableConn | Disable connection for testing purposes. | No | false |
vectorDB.milvus.clientConfig.retryRateLimit | Configure retry rate limiting options for Milvus client. | No | |
vectorDB.milvus.clientConfig.retryRateLimit.maxRetry | Maximum number of retries for rate limited requests. | No | |
vectorDB.milvus.clientConfig.retryRateLimit.maxBackoff | Maximum backoff duration for rate limited retries. | No | |
vectorDB.milvus.collectionName | Configures the collection name in Milvus. | Yes | |
vectorDB.milvus.maxDistance | Threshold for semantic similarity in Milvus. The lower the value, the more exact the match must be. | No | |
vectorDB.milvus.ttl | Configures the time to live for cached entries in Milvus (0 = forever). | No | 0 |
allowBypass | When true, the middleware looks at the client's Cache-Control header. If the request includes no-cache or no-store, Semantic Cache skips both read and write operations, forwards the call directly to the model, and returns X-Cache-Status: Bypass. Use this when you want callers to opt-out of caching on demand. | No | false |
contentTemplate | A Go template that determines how to extract text from the request. The default template is for chat completion. For REST APIs, you must specify a template matching your JSON structure. | No | {{ $last := "" }}{{ range .messages}}{{ $last = .content }}{{ end }}{{ $last }} |
readOnly | When true, the cache is not updated after a miss. Existing entries can still be retrieved. | No | false |
chat-completion-semantic-cache
specific configuration options
Field | Description | Required | Default |
---|---|---|---|
ignoreSystem | When true, the system messages are not considered for cache lookup. | No | false |
ignoreAssistant | When true, the assistant messages are not considered for cache lookup. | No | false |
ignoreTool | When true, the tool messages are not considered for cache lookup. | No | false |
messageHistory | Configures the number of messages to consider for cache lookup. | No | 0 |
readOnly
By default, the readOnly
option is set to false
, meaning that on a cache miss, the middleware actively adds a new entry to the cache.
Setting readOnly
to true
is useful when you want to test or freeze a pre-populated vector database, preventing new requests from modifying its contents.
For example, in a production deployment, you can configure one route with readOnly: false
to serve as an internal endpoint that actively warms
up the cache—new entries are added when there is a cache miss. In contrast, a second route with readOnly: true
can serve as the production endpoint, ensuring that only pre-validated entries are returned
and protecting against cache poisoning. This separation of responsibilities helps maintain a robust and reliable caching layer.
contentTemplate
This field is a Go text template that receives the JSON body as input. By default, the middleware picks the last message's content from an array of messages. You can customize it to combine roles or multiple messages, for instance:
"{{ range .messages }}Role: {{ .role }} - Content: {{ .content }}\n{{ end }}"
vectorizer.clientConfig
All vectorizers accept an optional clientConfig
block for custom HTTP settings:
vectorizer:
openai:
model: text-embedding-3-large
token: urn:k8s:secret:ai-keys:openai
clientConfig:
timeout: 3s
proxyURL: http://squid.default.svc:3128
insecureSkipVerify: true
Use this when you need a proxy, custom CA bundle, or tighter time-outs.
Vector Database Configuration
Each vector database has specific configuration options:
- Redis Configuration
- Weaviate Configuration
- Milvus Configuration
Redis Configuration: The Redis vector database configuration inherits from the standard Redis client configuration, supporting additional options like authentication, database selection, and connection pooling. Common Redis-specific fields include:
vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
password: urn:k8s:secret:redis-auth:password
collectionName: my_cache
maxDistance: 0.6
ttl: 3600
Weaviate Configuration: Supports authentication via API keys and custom HTTP client settings for enterprise deployments:
vectorDB:
weaviate:
host: weaviate.default.svc.cluster.local:80
scheme: https
apiKey: urn:k8s:secret:weaviate-auth:api-key
clientConfig:
timeout: 10s
insecureSkipVerify: false
collectionName: cache_collection
maxDistance: 0.4
Milvus Configuration: Provides client configuration with address, authentication, and connection options:
vectorDB:
milvus:
clientConfig:
address: http://milvus.cluster.local:19530
username: admin
password: urn:k8s:secret:milvus-auth:password
dbName: semantic_cache_db
enableTLSAuth: true
apiKey: urn:k8s:secret:milvus-auth:api-key
serverVersion: 2.3.0
retryRateLimit:
maxRetry: 3
maxBackoff: 30s
collectionName: semantic_cache
maxDistance: 0.5
ttl: 86400
Troubleshooting
Cache Status: Miss (Always)
- Verify your
contentTemplate
matches your JSON structure exactly - Use
POST
orPUT
methods with a request body;GET
requests are not supported for caching. - Confirm your service responds with
200 OK
, as only successful responses are cached. - For REST APIs, define a
contentTemplate
that matches your JSON structure exactly.
Common Template Issues
- Missing quotes: Use
contentTemplate: "{{ .field }}"
notcontentTemplate: {{ .field }}
- Wrong field names: Ensure the field name in your template matches your JSON exactly (case-sensitive)
- Nested fields: Use dot notation for nested objects:
{{ .data.text }}
- Array access: For arrays, you may need
{{ index .items 0 }}
or range over them
Cache Distance Too High
If X-Cache-Distance
is always above your maxDistance
threshold:
- Your prompts may be too different semantically
- Try lowering the
maxDistance
value (for example, from 0.5 to 0.3) - Consider if your use case actually benefits from semantic similarity vs exact matching
Related Content
- Read the Chat Completion documentation.
- Read the Content Guard documentation.