Skip to main content

Semantic Cache

The Semantic Cache middleware reduces LLM response times and API costs by avoiding redundant computations. It uses semantic similarity (not only text matching) to determine whether a request has been previously answered—and reuses the cached result when appropriate.

Key Features and Benefits

  • Faster Responses: Resolve repeated requests in milliseconds instead of waiting for LLM inference.
  • Lower API Costs: Avoid paying for redundant token usage across identical or similar prompts.
  • Semantic Matching: Works even when input phrasing changes, thanks to vector-based similarity.
  • Safe Caching: readOnly mode allows staging and production separation to prevent cache pollution

Requirements

  • You must have AI Gateway enabled:

    helm upgrade traefik -n traefik --wait \
    --reuse-values \
    --set hub.experimental.aigateway=true \
    traefik/traefik
  • You need a vectorizer that can produce text embeddings. We currently support:

    • OpenAI
    • Mistral
    • Ollama
info

If your chosen vectorizer requires a token, You'll need to store the chosen vectorizer credentials in a Kubernetes Secret in a similar manner as the LLM providers in the AIService.

  • You need a vector database that stores and retrieves embeddings: We currently support:

How It Works

When an AI request arrives:

  1. Extract & Prepare Text : The middleware extracts text from the request body and formats it according to a content template. By default, it takes the last user message in a typical chat completion request.

  2. Compute Embeddings : The text is converted into a vector using a vectorizer (OpenAI, Ollama, or Mistral). Each vector can represent the semantic meaning of the request text.

  3. Similarity Search in Vector Database: The middleware queries a vector database (Redis or Milvus) to see if there is a cached response with a sufficiently close vector (based on a similarity threshold, e.g., maxDistance).

  • Cache Hit: If a similar vector is found, the cached answer is returned.
  • Cache Miss: The request proceeds to the AI service, and the resulting response is stored as a new entry in the cache (unless readOnly is true).
  1. Response Headers : The user sees the following headers in the response:
    • X-Cache-Status: Hit or Miss.
    • X-Cache-Score (or “distance”): useful for tuning your similarity threshold.

Configuration Example

Below is an example demonstrating how to set up the semantic-cache middleware alongside an AIService referencing Ollama/OpenAI, while using Redis/Milvus as the vector database:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
namespace: traefik
name: semantic-cache
spec:
plugin:
semantic-cache:
vectorizer:
ollama:
baseUrl: http://host.docker.internal:11434
model: nomic-embed-text
vectorDB:
redis:
host: redis.default.svc.cluster.local:6379
collectionName: demo_doc
maxDistance: 0.6
readOnly: true
contentTemplate: "{{ range .messages }}{{ .content }} {{ end }}"

Configuration Options

FieldDescriptionRequiredDefault
vectorizerConfigures which embedding provider to use (e.g., openai, ollama, or mistral).Yes
vectorizer.<embedding-provider>.baseURLConfigures the base URL of the embedding provider.Yes
vectorizer.<embedding-provider>.modelConfigures the embedding model.Yes
vectorizer.<embedding-provider>.tokenConfigures the API token/key of the embedding provider.No
vectorizer.<embedding-provider>.token.secretNameDefines the name of the Kuberenetes Secret used to sore the embedding provider's API token/key.No
vectorDBConfigures which vector database to use (redis or milvus).Yes
vectorDB.hostConfigures the host port where the vector database is running.Yes
vectorDB.collectionNameConfigures the collection name in the vector database.Yes
vectorDB.maxDistanceThreshold for semantic similarity. The lower the value, the more exact the match must be.No
readOnlyWhen true, the cache is not updated after a miss. Existing entries can still be retrieved.Nofalse
DimConfigures the Dimensionality of the embeddings; if zero, uses the vectorize's default.No0
contentTemplateA Go template that determines how to extract text from the request.No{{ $last := "" }}{{ range .messages}}{{ $last = .content }}{{ end }}{{ $last }}

readOnly

By default, the readOnly option is set to false, meaning that on a cache miss, the middleware actively adds a new entry to the cache. Setting readOnly to true is useful when you want to test or freeze a pre-populated vector database, preventing new requests from modifying its contents.

For example, in a production deployment, you can configure one route with readOnly: false to serve as an internal endpoint that actively warms up the cache—new entries are added when there is a cache miss. In contrast, a second route with readOnly: true can serve as the production endpoint, ensuring that only pre-validated entries are returned and protecting against cache poisoning. This separation of responsibilities helps maintain a robust and reliable caching layer.

contentTemplate

This field is a Go text template that receives the JSON body as input. By default, the middleware picks the last message’s content from an array of messages. You can customize it to combine roles or multiple messages, for instance:

"{{ range .messages }}Role: {{ .role }} - Content: {{ .content }}\n{{ end }}"