Semantic Cache

The Semantic Cache middleware reduces LLM response times and API costs by avoiding redundant computations. It uses semantic similarity (not only text matching) to determine whether a request has been previously answered—and reuses the cached result when appropriate.

There are two variants of the middleware:

Variant	Best for	Key options
`semantic-cache`	Any JSON payload / REST request	`contentTemplate`
`chat-completion-semantic-cache`	OpenAI-compatible chat completions	`ignoreSystem`, `ignoreAssistant`, `ignoreTool`, `messageHistory`

Key Features and Benefits

Faster Responses: Resolve repeated requests in milliseconds instead of waiting for LLM inference.
Lower API Costs: Avoid paying for redundant token usage across identical or similar prompts.
Semantic Matching: Works even when input phrasing changes, thanks to vector-based similarity.
Safe Caching: readOnly mode allows staging and production separation to prevent cache pollution
Multiple Vectorizeres & DB support: Support for OpenAI, Ollama, Mistral, redis-stack, Milvus, and Weaviate etc.

Requirements

You must have AI Gateway enabled:

helm upgrade traefik traefik/traefik -n traefik --wait \
  --reset-then-reuse-values \
  --set hub.aigateway.enabled=true

You need a vectorizer that can produce text embeddings. We currently support:
- OpenAI
- Gemini
- Ollama
- Mistral
- Azure OpenAI
- Bedrock
- Cohere

info

If your chosen vectorizer requires a token, You'll need to store the chosen vectorizer credentials in a Kubernetes Secret and reference it in the middleware configuration with a urn reference.

You need a vector database that stores and retrieves embeddings: We currently support:

How It Works

When an AI request arrives, the semantic cache middleware processes it through the following steps:

Prerequisites for Caching:

Only requests with a request body are cached. GET requests cannot be cached since they typically don't have bodies to extract content from.
Only 200 OK responses are cached. Other status codes like 201 Created, 400 Bad Request, etc. are not cached.
The contentTemplate must successfully extract text from the request body for caching to occur.

Extract & Prepare Text : The middleware extracts text from the request body and formats it according to a content template. By default, it takes the last user message in a typical chat completion request.
Compute Embeddings : The text is converted into a vector using a vectorizer. Each vector can represent the semantic meaning of the request text.
Similarity Search in Vector Database: The middleware queries a vector database (redis-stack, weaviate, or Milvus) to see if there is a cached response with a sufficiently close vector (based on a similarity threshold, for example, maxDistance).

Cache Hit: If a similar vector is found, the cached answer is returned.
Cache Miss: The request proceeds to the AI service, and the resulting response is stored as a new entry in the cache (unless readOnly is true).

Response Headers : The user sees the following headers in the response:
- X-Cache-Status: Hit or Miss.
- X-Cache-Distance : useful for tuning your similarity threshold.
Separate caches - stream ("stream": true) and non-stream answers live in different buckets; a hit in one does not populate the other.

Service Response Requirements

The semantic cache only stores responses with 200 OK status codes. If your API returns 201 Created, 202 Accepted, or other success codes, consider configuring your service to return 200 OK for cached endpoints.

Configuration Examples

Choose the plugin variant that matches your use case, then configure it with your preferred vector database.

Semantic Cache Plugin

Use the semantic-cache plugin for general REST APIs and custom JSON payloads. You must define a contentTemplate to extract text from your specific JSON structure.

Middleware with redis-stack (API Gateway)
Middleware with Milvus
Middleware with Weaviate

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  namespace: traefik
  name: semantic-cache
spec:
  plugin:
    semantic-cache:
      vectorizer:
        ollama:
          baseUrl: http://ollama.default.svc.cluster.local:11434 # Protocol HTTP/HTTPS must be declared
          model: nomic-embed-text
      vectorDB:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
          collectionName: demo_doc
          maxDistance: 0.6
          ttl: 3600
      readOnly: false
      allowBypass: true # Allow clients to bypass cache with Cache-Control headers
      contentTemplate: '{{ .messages }}' # Extract 'messages' field from JSON body like {"messages": "text"}

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  namespace: traefik
  name: semantic-cache
spec:
  plugin:
    semantic-cache:
      vectorizer:
        ollama:
          baseUrl: http://ollama.default.svc.cluster.local:11434 # Protocol HTTP/HTTPS must be declared
          model: nomic-embed-text
      vectorDB:
        milvus:
          clientConfig:
            address: http://milvus.default.svc.cluster.local:19530
          collectionName: milvusv1
          maxDistance: 0.5
          ttl: 86400
      readOnly: false
      allowBypass: false
      contentTemplate: "{{ .query }}" # Extract 'query' field from JSON body

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: semantic-cache
spec:
  plugin:
    semantic-cache:
      vectorizer:
        openai:
          model: text-embedding-3-small
          token: urn:k8s:secret:ai-keys:openai-token
      vectorDB:
        weaviate:
          host: weaviate.default.svc.cluster.local:80
          scheme: http
          collectionName: rest_cache
          maxDistance: 0.5
          apiKey: urn:k8s:secret:ai-keys:weaviate-key
          ttl: 3600
      readOnly: false
      allowBypass: true
      contentTemplate: '{{ .prompt }}' # Extract 'prompt' field from JSON body

Chat Completion + Semantic Cache

Use the chat-completion-semantic-cache plugin specifically for OpenAI-compatible chat completion endpoints. It has built-in understanding of the messages array and offers chat-specific filtering options.

Middleware with redis-stack (Chat variant)
Middleware with Milvus (Chat variant)
Middleware with Weaviate (Chat variant)

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: semantic-cache-chat
spec:
  plugin:
    chat-completion-semantic-cache:
      vectorizer:
        openai:
          model: text-embedding-3-small
          token: urn:k8s:secret:ai-keys:openai-token
      vectorDB:
        redis:
          endpoints:
            - redis.default.svc.cluster.local:6379
          collectionName: chat_cache
          maxDistance: 0.4
          ttl: 1800
      # Chat-specific options
      ignoreAssistant: true
      messageHistory: 4
      readOnly: false
      allowBypass: true

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: semantic-cache-chat
spec:
  plugin:
    chat-completion-semantic-cache:
      vectorizer:
        ollama:
          baseUrl: http://ollama.default.svc.cluster.local:11434
          model: nomic-embed-text
      vectorDB:
        milvus:
          clientConfig:
            address: http://milvus.default.svc.cluster.local:19530
          collectionName: chat_completions
          maxDistance: 0.3
          ttl: 86400
      # Chat-specific options
      ignoreSystem: false
      ignoreAssistant: true
      messageHistory: 8
      readOnly: false

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: semantic-cache-chat
spec:
  plugin:
    chat-completion-semantic-cache:
      vectorizer:
        openai:
          model: text-embedding-3-small
          token: urn:k8s:secret:ai-keys:openai-token
      vectorDB:
        weaviate:
          host: weaviate.default.svc.cluster.local:80
          scheme: http
          collectionName: chat_cache
          maxDistance: 0.5
          apiKey: urn:k8s:secret:ai-keys:weaviate-key  # Optional: for authenticated Weaviate instances
          ttl: 600
          clientConfig:
            timeout: 5s
            insecureSkipVerify: false
      # Chat-specific options
      ignoreAssistant: true
      messageHistory: 4
      readOnly: false
      allowBypass: true

IngressRoute Configuration

Both plugins work with the same IngressRoute configuration:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: ai
  namespace: traefik
spec:
  routes:
    - kind: Rule
      match: Host(`ai.localhost`)
      middlewares:
        - name: semantic-cache  # or semantic-cache-chat
      services:
        - name: chatgpt-external   # ExternalName to api.openai.com
          port: 443
          scheme: https
          passHostHeader: false

Vectorizer ≠ Service

vectorizer.* tells Semantic Cache where to fetch embeddings for similarity checks. services: in the IngressRoute tells Traefik Hub where to send the actual request for inference.

They can point to the same provider (all-OpenAI, all-Ollama) or to different ones (local embeddings + cloud LLM). Pick whatever best balances cost, performance, and availability for your use case.

Weaviate collection names

The AI gateway upper-cases the first letter of the collection name internally (chat_cache → Chat_cache). Submit names in any case; they are normalised automatically to Weaviate's expected format.

Configuration Options

Field	Description	Required	Default
`vectorizer`	Configures which embedding provider to use (OpenAI, Azure OpenAI, Cohere, Gemini, Mistral, Ollama, Bedrock.)	Yes
`vectorizer.<embedding-provider>.baseURL`	Configures the base URL of the embedding provider.	No
`vectorizer.<embedding-provider>.model`	Configures the embedding model.	Yes
`vectorizer.<embedding-provider>.token`	URN of a Kubernetes Secret that holds the embedding provider's API key (for example, `urn:k8s:secret:<secretname>:<key>`).	No
`vectorizer.<embedding-provider>.scheme`	Configures the scheme for the embedding provider.	No
`vectorizer.clientConfig`	Configures the HTTP client settings for the embedding provider.	No
`vectorizer.clientConfig.timeoutSeconds`	Configures the timeout in seconds for the HTTP client (integer value).	No	5
`vectorizer.clientConfig.proxyURL`	Configures the proxy URL for the HTTP client.	No
`vectorizer.clientConfig.insecureSkipVerify`	Configures the insecure skip verify for the HTTP client.	No	false
`vectorDB`	Configures which vector database to use (redis-stack, weaviate, or milvus).	Yes
`vectorDB.redis.endpoints`	Configures the Redis host and port (for example, `redis.default.svc.cluster.local:6379`).	Yes
`vectorDB.redis.collectionName`	Configures the collection name in Redis.	Yes
`vectorDB.redis.maxDistance`	Threshold for semantic similarity in Redis. The lower the value, the more exact the match must be.	No
`vectorDB.redis.ttl`	Configures the time to live for cached entries in Redis (0 = forever).	No	0
`vectorDB.redis.*`	Redis configuration inherits from the Redis client configuration and supports additional Redis-specific options.	No
`vectorDB.weaviate.host`	Configures the Weaviate host and port (for example, `weaviate.default.svc.cluster.local:80`).	Yes
`vectorDB.weaviate.collectionName`	Configures the collection name in Weaviate.	Yes
`vectorDB.weaviate.maxDistance`	Threshold for semantic similarity in Weaviate. The lower the value, the more exact the match must be.	No
`vectorDB.weaviate.ttl`	Configures the time to live for cached entries in Weaviate (0 = forever).	No	0
`vectorDB.weaviate.scheme`	Configures the scheme for Weaviate (http or https).	Yes
`vectorDB.weaviate.apiKey`	URN of a Kubernetes Secret that holds the Weaviate API key (for example, `urn:k8s:secret:<secretname>:<key>`).	No
`vectorDB.weaviate.clientConfig`	Configures the HTTP client settings for Weaviate.	No
`vectorDB.milvus.clientConfig`	Configures the Milvus client settings including address, authentication, and connection options.	Yes
`vectorDB.milvus.clientConfig.address`	The Milvus server address (for example, `http://milvus.default.svc.cluster.local:19530`).	Yes
`vectorDB.milvus.clientConfig.username`	Username for Milvus authentication.	No
`vectorDB.milvus.clientConfig.password`	Password for Milvus authentication. (for example, `urn:k8s:secret:<secretname>:<key>`)	No
`vectorDB.milvus.clientConfig.dbName`	Database name for Milvus connection.	No
`vectorDB.milvus.clientConfig.identifier`	Client identifier for Milvus connection.	No
`vectorDB.milvus.clientConfig.enableTLSAuth`	Enable TLS authentication for Milvus connection.	No	false
`vectorDB.milvus.clientConfig.apiKey`	API key for Milvus authentication.	No
`vectorDB.milvus.clientConfig.serverVersion`	Milvus server version for compatibility.	No
`vectorDB.milvus.clientConfig.disableConn`	Disable connection for testing purposes.	No	false
`vectorDB.milvus.clientConfig.retryRateLimit`	Configure retry rate limiting options for Milvus client.	No
`vectorDB.milvus.clientConfig.retryRateLimit.maxRetry`	Maximum number of retries for rate limited requests.	No
`vectorDB.milvus.clientConfig.retryRateLimit.maxBackoff`	Maximum backoff duration for rate limited retries.	No
`vectorDB.milvus.collectionName`	Configures the collection name in Milvus.	Yes
`vectorDB.milvus.maxDistance`	Threshold for semantic similarity in Milvus. The lower the value, the more exact the match must be.	No
`vectorDB.milvus.ttl`	Configures the time to live for cached entries in Milvus (0 = forever).	No	0
`allowBypass`	When true, the middleware looks at the client's Cache-Control header. If the request includes no-cache or no-store, Semantic Cache skips both read and write operations, forwards the call directly to the model, and returns X-Cache-Status: Bypass. Use this when you want callers to opt-out of caching on demand.	No	false
`contentTemplate`	A Go template that determines how to extract text from the request. The default template is for chat completion. For REST APIs, you must specify a template matching your JSON structure.	No	`{{ $last := "" }}{{ range .messages}}{{ $last = .content }}{{ end }}{{ $last }}`
`readOnly`	When true, the cache is not updated after a miss. Existing entries can still be retrieved.	No	false

chat-completion-semantic-cache specific configuration options

Field	Description	Required	Default
`ignoreSystem`	When true, the system messages are not considered for cache lookup.	No	false
`ignoreAssistant`	When true, the assistant messages are not considered for cache lookup.	No	false
`ignoreTool`	When true, the tool messages are not considered for cache lookup.	No	false
`messageHistory`	Configures the number of messages to consider for cache lookup.	No	0

`readOnly`

By default, the readOnly option is set to false, meaning that on a cache miss, the middleware actively adds a new entry to the cache. Setting readOnly to true is useful when you want to test or freeze a pre-populated vector database, preventing new requests from modifying its contents.

For example, in a production deployment, you can configure one route with readOnly: false to serve as an internal endpoint that actively warms up the cache—new entries are added when there is a cache miss. In contrast, a second route with readOnly: true can serve as the production endpoint, ensuring that only pre-validated entries are returned and protecting against cache poisoning. This separation of responsibilities helps maintain a robust and reliable caching layer.

`contentTemplate`

This field is a Go text template that receives the JSON body as input. By default, the middleware picks the last message's content from an array of messages. You can customize it to combine roles or multiple messages, for instance:

"{{ range .messages }}Role: {{ .role }} - Content: {{ .content }}\n{{ end }}"

`vectorizer.clientConfig`

All vectorizers accept an optional clientConfig block for custom HTTP settings:

vectorizer:
  openai:
    model: text-embedding-3-large
    token: urn:k8s:secret:ai-keys:openai
  clientConfig:
    timeoutSeconds: 30
    proxyURL: http://squid.default.svc:3128
    insecureSkipVerify: true

Use this when you need a proxy, custom CA bundle, or tighter time-outs.

Vector Database Configuration

Each vector database has specific configuration options:

Redis Configuration
Weaviate Configuration
Milvus Configuration

Redis Configuration: The Redis vector database configuration inherits from the standard Redis client configuration, supporting additional options like authentication, database selection, and connection pooling. Common Redis-specific fields include:

vectorDB:
  redis:
    endpoints:
      - redis.default.svc.cluster.local:6379
    password: urn:k8s:secret:redis-auth:password
    collectionName: my_cache
    maxDistance: 0.6
    ttl: 3600

Weaviate Configuration: Supports authentication via API keys and custom HTTP client settings for enterprise deployments:

vectorDB:
  weaviate:
    host: weaviate.default.svc.cluster.local:80
    scheme: https
    apiKey: urn:k8s:secret:weaviate-auth:api-key
    clientConfig:
      timeoutSeconds: 10
      insecureSkipVerify: false
    collectionName: cache_collection
    maxDistance: 0.4

Milvus Configuration: Provides client configuration with address, authentication, and connection options:

vectorDB:
  milvus:
    clientConfig:
      address: http://milvus.cluster.local:19530
      username: admin
      password: urn:k8s:secret:milvus-auth:password
      dbName: semantic_cache_db
      enableTLSAuth: true
      apiKey: urn:k8s:secret:milvus-auth:api-key
      serverVersion: 2.3.0
      retryRateLimit:
        maxRetry: 3
        maxBackoff: 30s
    collectionName: semantic_cache
    maxDistance: 0.5
    ttl: 86400

Troubleshooting

Cache Status: Miss (Always)

Verify your contentTemplate matches your JSON structure exactly
Use POST or PUT methods with a request body; GET requests are not supported for caching.
Confirm your service responds with 200 OK, as only successful responses are cached.
For REST APIs, define a contentTemplate that matches your JSON structure exactly.

Common Template Issues

Missing quotes: Use contentTemplate: "{{ .field }}" not contentTemplate: {{ .field }}
Wrong field names: Ensure the field name in your template matches your JSON exactly (case-sensitive)
Nested fields: Use dot notation for nested objects: {{ .data.text }}
Array access: For arrays, you may need {{ index .items 0 }} or range over them

Cache Distance Too High

If X-Cache-Distance is always above your maxDistance threshold:

Your prompts may be too different semantically
Try lowering the maxDistance value (for example, from 0.5 to 0.3)
Consider if your use case actually benefits from semantic similarity vs exact matching

Timeout Errors and Retries

If you see timeout errors and automatic retries in the logs when calling your vectorizer:

The field name is timeoutSeconds (not timeout)
The value must be an integer in seconds: timeoutSeconds: 30 (not 30s)
The default timeout is 5 seconds, which may be too short for some embedding models
Increase the timeout for slow vectorizers (for example, local Ollama instances): timeoutSeconds: 60

Example configuration:

vectorizer:
  ollama:
    baseUrl: http://ollama.default.svc.cluster.local:11434
    model: nomic-embed-text
  clientConfig:
    timeoutSeconds: 60  # Increase timeout to 60 seconds

Read the Chat Completion documentation.
Read the Content Guard documentation.

Key Features and Benefits​

Requirements​

How It Works​

Configuration Examples​

Semantic Cache Plugin​

Chat Completion + Semantic Cache​

IngressRoute Configuration​

Configuration Options​

readOnly​

contentTemplate​

vectorizer.clientConfig​

Vector Database Configuration​

Troubleshooting​

Related Content​