Skip to main content

Semantic Cache

The Semantic Cache middleware reduces LLM response times and API costs by avoiding redundant computations. It uses semantic similarity (not only text matching) to determine whether a request has been previously answered—and reuses the cached result when appropriate.

There are two variants of the middleware:

VariantBest forKey options
semantic-cacheAny JSON payload / REST requestcontentTemplate
chat-completion-semantic-cacheOpenAI-compatible chat completionsignoreSystem, ignoreAssistant, ignoreTool, messageHistory

Key Features and Benefits

  • Faster Responses: Resolve repeated requests in milliseconds instead of waiting for LLM inference.
  • Lower API Costs: Avoid paying for redundant token usage across identical or similar prompts.
  • Semantic Matching: Works even when input phrasing changes, thanks to vector-based similarity.
  • Safe Caching: readOnly mode allows staging and production separation to prevent cache pollution
  • Multiple Vectorizeres & DB support: Support for OpenAI, Ollama, Mistral, redis-stack, Milvus, and Weaviate etc.

Requirements

  • You must have AI Gateway enabled:

    helm install traefik -n traefik --wait \
    --set hub.aigateway.enabled=true
  • You need a vectorizer that can produce text embeddings. We currently support:

    • OpenAI
    • Gemini
    • Ollama
    • Mistral
    • Azure OpenAI
    • Bedrock
    • Cohere
info

If your chosen vectorizer requires a token, You'll need to store the chosen vectorizer credentials in a Kubernetes Secret and reference it in the middleware configuration with a urn reference.

How It Works

When an AI request arrives, the semantic cache middleware processes it through the following steps:

Prerequisites for Caching:

  • Only requests with a request body are cached. GET requests cannot be cached since they typically don't have bodies to extract content from.
  • Only 200 OK responses are cached. Other status codes like 201 Created, 400 Bad Request, etc. are not cached.
  • The contentTemplate must successfully extract text from the request body for caching to occur.
  1. Extract & Prepare Text : The middleware extracts text from the request body and formats it according to a content template. By default, it takes the last user message in a typical chat completion request.

  2. Compute Embeddings : The text is converted into a vector using a vectorizer. Each vector can represent the semantic meaning of the request text.

  3. Similarity Search in Vector Database: The middleware queries a vector database (redis-stack, weaviate, or Milvus) to see if there is a cached response with a sufficiently close vector (based on a similarity threshold, for example, maxDistance).

  • Cache Hit: If a similar vector is found, the cached answer is returned.
  • Cache Miss: The request proceeds to the AI service, and the resulting response is stored as a new entry in the cache (unless readOnly is true).
  1. Response Headers : The user sees the following headers in the response:

    • X-Cache-Status: Hit or Miss.
    • X-Cache-Distance : useful for tuning your similarity threshold.
  2. Separate caches - stream ("stream": true) and non-stream answers live in different buckets; a hit in one does not populate the other.

Service Response Requirements

The semantic cache only stores responses with 200 OK status codes. If your API returns 201 Created, 202 Accepted, or other success codes, consider configuring your service to return 200 OK for cached endpoints.

Configuration Examples

Choose the plugin variant that matches your use case, then configure it with your preferred vector database.

Semantic Cache Plugin

Use the semantic-cache plugin for general REST APIs and custom JSON payloads. You must define a contentTemplate to extract text from your specific JSON structure.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://ollama.default.svc.cluster.local:11434 # Protocol HTTP/HTTPS must be declared
model: nomic-embed-text
vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
collectionName: demo_doc
maxDistance: 0.6
ttl: 3600
readOnly: false
allowBypass: true # Allow clients to bypass cache with Cache-Control headers
contentTemplate: '{{ .messages }}' # Extract 'messages' field from JSON body like {"messages": "text"}

Chat Completion + Semantic Cache

Use the chat-completion-semantic-cache plugin specifically for OpenAI-compatible chat completion endpoints. It has built-in understanding of the messages array and offers chat-specific filtering options.

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: semantic-cache-chat
spec:
plugin:
chat-completion-semantic-cache:
vectorizer:
openai:
model: text-embedding-3-small
token: urn:k8s:secret:ai-keys:openai-token
vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
collectionName: chat_cache
maxDistance: 0.4
ttl: 1800
# Chat-specific options
ignoreAssistant: true
messageHistory: 4
readOnly: false
allowBypass: true

IngressRoute Configuration

Both plugins work with the same IngressRoute configuration:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: ai
namespace: traefik
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`)
middlewares:
- name: semantic-cache # or semantic-cache-chat
services:
- name: chatgpt-external # ExternalName to api.openai.com
port: 443
scheme: https
passHostHeader: false
Vectorizer ≠ Service

vectorizer.* tells Semantic Cache where to fetch embeddings for similarity checks. services: in the IngressRoute tells Traefik Hub where to send the actual request for inference.

They can point to the same provider (all-OpenAI, all-Ollama) or to different ones (local embeddings + cloud LLM). Pick whatever best balances cost, performance, and availability for your use case.

Weaviate collection names

The AI gateway upper-cases the first letter of the collection name internally (chat_cacheChat_cache). Submit names in any case; they are normalised automatically to Weaviate's expected format.

Configuration Options

FieldDescriptionRequiredDefault
vectorizerConfigures which embedding provider to use (OpenAI, Azure OpenAI, Cohere, Gemini, Mistral, Ollama, Bedrock.)Yes
vectorizer.<embedding-provider>.baseURLConfigures the base URL of the embedding provider.No
vectorizer.<embedding-provider>.modelConfigures the embedding model.Yes
vectorizer.<embedding-provider>.tokenURN of a Kubernetes Secret that holds the embedding provider's API key (for example, urn:k8s:secret:<secretname>:<key>).No
vectorizer.<embedding-provider>.schemeConfigures the scheme for the embedding provider.No
vectorizer.clientConfigConfigures the HTTP client settings for the embedding provider.No
vectorizer.clientConfig.timeoutConfigures the timeout for the HTTP client.No3s
vectorizer.clientConfig.proxyURLConfigures the proxy URL for the HTTP client.No
vectorizer.clientConfig.insecureSkipVerifyConfigures the insecure skip verify for the HTTP client.Nofalse
vectorDBConfigures which vector database to use (redis-stack, weaviate, or milvus).Yes
vectorDB.redis.endpointsConfigures the Redis host and port (for example, redis.default.svc.cluster.local:6379).Yes
vectorDB.redis.collectionNameConfigures the collection name in Redis.Yes
vectorDB.redis.maxDistanceThreshold for semantic similarity in Redis. The lower the value, the more exact the match must be.No
vectorDB.redis.ttlConfigures the time to live for cached entries in Redis (0 = forever).No0
vectorDB.redis.*Redis configuration inherits from the Redis client configuration and supports additional Redis-specific options.No
vectorDB.weaviate.hostConfigures the Weaviate host and port (for example, weaviate.default.svc.cluster.local:80).Yes
vectorDB.weaviate.collectionNameConfigures the collection name in Weaviate.Yes
vectorDB.weaviate.maxDistanceThreshold for semantic similarity in Weaviate. The lower the value, the more exact the match must be.No
vectorDB.weaviate.ttlConfigures the time to live for cached entries in Weaviate (0 = forever).No0
vectorDB.weaviate.schemeConfigures the scheme for Weaviate (http or https).Yes
vectorDB.weaviate.apiKeyURN of a Kubernetes Secret that holds the Weaviate API key (for example, urn:k8s:secret:<secretname>:<key>).No
vectorDB.weaviate.clientConfigConfigures the HTTP client settings for Weaviate.No
vectorDB.milvus.clientConfigConfigures the Milvus client settings including address, authentication, and connection options.Yes
vectorDB.milvus.clientConfig.addressThe Milvus server address (for example, http://milvus.default.svc.cluster.local:19530).Yes
vectorDB.milvus.clientConfig.usernameUsername for Milvus authentication.No
vectorDB.milvus.clientConfig.passwordPassword for Milvus authentication. (for example, urn:k8s:secret:<secretname>:<key>)No
vectorDB.milvus.clientConfig.dbNameDatabase name for Milvus connection.No
vectorDB.milvus.clientConfig.identifierClient identifier for Milvus connection.No
vectorDB.milvus.clientConfig.enableTLSAuthEnable TLS authentication for Milvus connection.Nofalse
vectorDB.milvus.clientConfig.apiKeyAPI key for Milvus authentication.No
vectorDB.milvus.clientConfig.serverVersionMilvus server version for compatibility.No
vectorDB.milvus.clientConfig.disableConnDisable connection for testing purposes.Nofalse
vectorDB.milvus.clientConfig.retryRateLimitConfigure retry rate limiting options for Milvus client.No
vectorDB.milvus.clientConfig.retryRateLimit.maxRetryMaximum number of retries for rate limited requests.No
vectorDB.milvus.clientConfig.retryRateLimit.maxBackoffMaximum backoff duration for rate limited retries.No
vectorDB.milvus.collectionNameConfigures the collection name in Milvus.Yes
vectorDB.milvus.maxDistanceThreshold for semantic similarity in Milvus. The lower the value, the more exact the match must be.No
vectorDB.milvus.ttlConfigures the time to live for cached entries in Milvus (0 = forever).No0
allowBypassWhen true, the middleware looks at the client's Cache-Control header. If the request includes no-cache or no-store, Semantic Cache skips both read and write operations, forwards the call directly to the model, and returns X-Cache-Status: Bypass. Use this when you want callers to opt-out of caching on demand.Nofalse
contentTemplateA Go template that determines how to extract text from the request. The default template is for chat completion. For REST APIs, you must specify a template matching your JSON structure.No{{ $last := "" }}{{ range .messages}}{{ $last = .content }}{{ end }}{{ $last }}
readOnlyWhen true, the cache is not updated after a miss. Existing entries can still be retrieved.Nofalse

chat-completion-semantic-cache specific configuration options

FieldDescriptionRequiredDefault
ignoreSystemWhen true, the system messages are not considered for cache lookup.Nofalse
ignoreAssistantWhen true, the assistant messages are not considered for cache lookup.Nofalse
ignoreToolWhen true, the tool messages are not considered for cache lookup.Nofalse
messageHistoryConfigures the number of messages to consider for cache lookup.No0

readOnly

By default, the readOnly option is set to false, meaning that on a cache miss, the middleware actively adds a new entry to the cache. Setting readOnly to true is useful when you want to test or freeze a pre-populated vector database, preventing new requests from modifying its contents.

For example, in a production deployment, you can configure one route with readOnly: false to serve as an internal endpoint that actively warms up the cache—new entries are added when there is a cache miss. In contrast, a second route with readOnly: true can serve as the production endpoint, ensuring that only pre-validated entries are returned and protecting against cache poisoning. This separation of responsibilities helps maintain a robust and reliable caching layer.

contentTemplate

This field is a Go text template that receives the JSON body as input. By default, the middleware picks the last message's content from an array of messages. You can customize it to combine roles or multiple messages, for instance:

"{{ range .messages }}Role: {{ .role }} - Content: {{ .content }}\n{{ end }}"

vectorizer.clientConfig

All vectorizers accept an optional clientConfig block for custom HTTP settings:

vectorizer:
openai:
model: text-embedding-3-large
token: urn:k8s:secret:ai-keys:openai
clientConfig:
timeout: 3s
proxyURL: http://squid.default.svc:3128
insecureSkipVerify: true

Use this when you need a proxy, custom CA bundle, or tighter time-outs.

Vector Database Configuration

Each vector database has specific configuration options:

Redis Configuration: The Redis vector database configuration inherits from the standard Redis client configuration, supporting additional options like authentication, database selection, and connection pooling. Common Redis-specific fields include:

vectorDB:
redis:
endpoints:
- redis.default.svc.cluster.local:6379
password: urn:k8s:secret:redis-auth:password
collectionName: my_cache
maxDistance: 0.6
ttl: 3600

Troubleshooting

Cache Status: Miss (Always)
  • Verify your contentTemplate matches your JSON structure exactly
  • Use POST or PUT methods with a request body; GET requests are not supported for caching.
  • Confirm your service responds with 200 OK, as only successful responses are cached.
  • For REST APIs, define a contentTemplate that matches your JSON structure exactly.
Common Template Issues
  • Missing quotes: Use contentTemplate: "{{ .field }}" not contentTemplate: {{ .field }}
  • Wrong field names: Ensure the field name in your template matches your JSON exactly (case-sensitive)
  • Nested fields: Use dot notation for nested objects: {{ .data.text }}
  • Array access: For arrays, you may need {{ index .items 0 }} or range over them
Cache Distance Too High

If X-Cache-Distance is always above your maxDistance threshold:

  • Your prompts may be too different semantically
  • Try lowering the maxDistance value (for example, from 0.5 to 0.3)
  • Consider if your use case actually benefits from semantic similarity vs exact matching