Semantic Cache
The Semantic Cache middleware reduces LLM response times and API costs by avoiding redundant computations. It uses semantic similarity (not only text matching) to determine whether a request has been previously answered—and reuses the cached result when appropriate.
Key Features and Benefits
- Faster Responses: Resolve repeated requests in milliseconds instead of waiting for LLM inference.
- Lower API Costs: Avoid paying for redundant token usage across identical or similar prompts.
- Semantic Matching: Works even when input phrasing changes, thanks to vector-based similarity.
- Safe Caching: readOnly mode allows staging and production separation to prevent cache pollution
Requirements
-
You must have AI Gateway enabled:
helm upgrade traefik -n traefik --wait \
--reuse-values \
--set hub.experimental.aigateway=true \
traefik/traefik -
You need a vectorizer that can produce text embeddings. We currently support:
- OpenAI
- Mistral
- Ollama
If your chosen vectorizer requires a token, You'll need to store the chosen vectorizer credentials in a Kubernetes Secret
in a similar manner as the LLM providers in the AIService
.
-
You need a vector database that stores and retrieves embeddings: We currently support:
How It Works
When an AI request arrives:
-
Extract & Prepare Text : The middleware extracts text from the request body and formats it according to a content template. By default, it takes the last user message in a typical chat completion request.
-
Compute Embeddings : The text is converted into a vector using a vectorizer (OpenAI, Ollama, or Mistral). Each vector can represent the semantic meaning of the request text.
-
Similarity Search in Vector Database: The middleware queries a vector database (Redis or Milvus) to see if there is a cached response with a sufficiently close vector (based on a similarity threshold, e.g.,
maxDistance
).
- Cache Hit: If a similar vector is found, the cached answer is returned.
- Cache Miss: The request proceeds to the AI service, and the resulting response is stored as a new entry in the cache (unless
readOnly
istrue
).
- Response Headers :
The user sees the following headers in the response:
X-Cache-Status
:Hit
orMiss
.X-Cache-Score
(or “distance”): useful for tuning your similarity threshold.
Configuration Example
Below is an example demonstrating how to set up the semantic-cache middleware alongside an AIService
referencing Ollama/OpenAI, while using Redis/Milvus as the vector database:
- Middleware with Redis
- Middleware with Milvus
- AIService
- Ingressroute
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://host.docker.internal:11434
model: nomic-embed-text
vectorDB:
redis:
host: redis.default.svc.cluster.local:6379
collectionName: demo_doc
maxDistance: 0.6
readOnly: true
contentTemplate: "{{ range .messages }}{{ .content }} {{ end }}"
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://host.docker.internal:11434
model: nomic-embed-text
vectorDB:
milvus:
host: host.docker.internal:19530
collectionName: milvusv1
maxDistance: 0.5
readOnly: true
contentTemplate: "{{ range .}}ROLE: {{.Role}} / Content: {{ .Content }}\n{{ end }}"
apiVersion: hub.traefik.io/v1alpha1
kind: AIService
metadata:
name: ai-ollama
namespace: traefik
spec:
ollama:
baseUrl: http://host.docker.internal:11434
model: llama3.2
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: ai
namespace: traefik
spec:
routes:
- kind: Rule
match: Host(`ai.localhost`)
middlewares:
- name: semantic-cache
services:
- kind: TraefikService
name: traefik-semantic-cache@ai-gateway-service
Configuration Options
Field | Description | Required | Default |
---|---|---|---|
vectorizer | Configures which embedding provider to use (e.g., openai, ollama, or mistral). | Yes | |
vectorizer.<embedding-provider>.baseURL | Configures the base URL of the embedding provider. | Yes | |
vectorizer.<embedding-provider>.model | Configures the embedding model. | Yes | |
vectorizer.<embedding-provider>.token | Configures the API token/key of the embedding provider. | No | |
vectorizer.<embedding-provider>.token.secretName | Defines the name of the Kuberenetes Secret used to sore the embedding provider's API token/key. | No | |
vectorDB | Configures which vector database to use (redis or milvus). | Yes | |
vectorDB.host | Configures the host port where the vector database is running. | Yes | |
vectorDB.collectionName | Configures the collection name in the vector database. | Yes | |
vectorDB.maxDistance | Threshold for semantic similarity. The lower the value, the more exact the match must be. | No | |
readOnly | When true, the cache is not updated after a miss. Existing entries can still be retrieved. | No | false |
Dim | Configures the Dimensionality of the embeddings; if zero, uses the vectorize's default. | No | 0 |
contentTemplate | A Go template that determines how to extract text from the request. | No | {{ $last := "" }}{{ range .messages}}{{ $last = .content }}{{ end }}{{ $last }} |
readOnly
By default, the readOnly
option is set to false
, meaning that on a cache miss, the middleware actively adds a new entry to the cache.
Setting readOnly
to true
is useful when you want to test or freeze a pre-populated vector database, preventing new requests from modifying its contents.
For example, in a production deployment, you can configure one route with readOnly: false
to serve as an internal endpoint that actively warms
up the cache—new entries are added when there is a cache miss. In contrast, a second route with readOnly: true
can serve as the production endpoint, ensuring that only pre-validated entries are returned
and protecting against cache poisoning. This separation of responsibilities helps maintain a robust and reliable caching layer.
contentTemplate
This field is a Go text template that receives the JSON body as input. By default, the middleware picks the last message’s content from an array of messages. You can customize it to combine roles or multiple messages, for instance:
"{{ range .messages }}Role: {{ .role }} - Content: {{ .content }}\n{{ end }}"