Skip to main content

Parallel LLM Guard

The Parallel LLM Guard middleware runs multiple LLM Guard checks concurrently instead of sequentially. LLM Guards that depend on external services (such as NVIDIA NIMs, OpenAI, or Ollama) add latency to every request. When multiple guards are chained sequentially, the total latency is the sum of all guard response times. The Parallel LLM Guard executes all configured guards at the same time, reducing total latency to the duration of the slowest guard.

Key Features and Benefits

  • Parallel Execution: All guards run concurrently, reducing total latency from the sum to the maximum of individual guard response times.
  • Early Cancellation: When any guard blocks a request or encounters an error, the middleware cancels all remaining in-flight guards immediately.
  • Defense in Depth: Combine multiple guard types (safety classification, content moderation, topic control) in a single middleware.
  • Mixed Guard Types: Each guard in the parallel set can use any supported LLM Guard configuration, allowing you to mix LLM-based and custom service guards.
  • Request and Response Protection: Guards can inspect incoming requests, outgoing responses, or both.

How It Works

The middleware processes requests and responses through these steps:

  1. Request Phase: The middleware reads the request body, then sends it to all configured request guards in parallel.
  2. Evaluation: Each guard calls its configured external service (LLM or custom API) and evaluates the response against its block and trace conditions.
  3. Early Exit: If any guard returns a block result or an error, the middleware cancels the context for all other guards and returns immediately (see Error Handling).
    • Any block results in a 403 Forbidden reply to the client
    • Any error results in a 500 Internal Server Error reply to the client
  4. Forwarding: If all request guards pass, the original request is forwarded to the backend service.
  5. Response Phase (if response guards are configured): The backend response is buffered and sent to all response guards in parallel, following the same evaluation and early exit logic.

Requirements

  • You must have AI Gateway enabled:

    helm upgrade traefik traefik/traefik -n traefik --wait \
    --reset-then-reuse-values \
    --set hub.aigateway.enabled=true
  • One or more external content analysis services or LLMs to use as guards.

Configuration Examples

Multiple LLM Guards with Parallel Execution

Run a safety classifier and a topic control model in parallel against chat completion traffic:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: parallel-safety-check
namespace: apps
spec:
plugin:
parallel-llm-guard:
guards:
- guard:
endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: llama-guard3:8b
request:
systemPrompt: |
Task: Check if there is unsafe content in the user messages according to our safety policy.

<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violent Crimes, S2: Non-Violent Crimes, S3: Sex Crimes
S4: Child Exploitation, S5: Defamation, S6: Specialized Advice
S7: Privacy, S8: Intellectual Property, S9: Indiscriminate Weapons
S10: Hate, S11: Self-Harm, S12: Sexual Content
S13: Elections, S14: Code Interpreter Abuse
<END UNSAFE CONTENT CATEGORIES>

Provide your safety assessment for the conversation.
blockConditions:
- reason: unsafe_content
condition: Contains("unsafe")
- guard:
endpoint: https://api.openai.com/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: gpt-4o
clientConfig:
headers:
Authorization: "urn:secret:openai-api-key:authorization"
request:
systemPrompt: |
Task: Check if the message contains off-topic content.
Answer with only true or false as a string.
traceConditions:
- reason: off_topic_content
condition: Contains("True") # case insensitive match

Mixed Guard Types

Combine an LLM-based guard with a custom moderation service:

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: parallel-mixed-guards
namespace: apps
spec:
plugin:
parallel-llm-guard:
guards:
- guard:
endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: llama-guard3:8b
request:
systemPrompt: |
Task: Check if there is unsafe content in the user messages.
blockConditions:
- reason: unsafe_content
condition: Contains("unsafe")
- guard:
endpoint: http://moderation-service.apps.svc.cluster.local/v1/moderations
clientRequestFormat: ccr
format:
custom: {}
request:
template: '{"input": ["{{ (index .messages 0).content }}"]}'
blockConditions:
- reason: moderation_flagged
condition: Contains("true")

Request and Response Guards

Guard both incoming requests and outgoing responses in parallel:

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: parallel-bidirectional
namespace: apps
spec:
plugin:
parallel-llm-guard:
guards:
- guard:
endpoint: https://api.openai.com/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: gpt-4o
clientConfig:
headers:
Authorization: "urn:secret:openai-api-key:authorization"
request:
systemPrompt: |
Task: Check if the message contains the word hello in it and answer with only true or false as a string.
traceConditions:
- reason: behavior_content
condition: Contains("False")
- guard:
endpoint: https://api.openai.com/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: gpt-4o
clientConfig:
headers:
Authorization: "urn:secret:openai-api-key:authorization"
response:
systemPrompt: |
Task: Check if the message contains the word salmon in it and answer with only true or false as a string.
traceConditions:
- reason: fish_content
condition: Contains("True")
- guard:
endpoint: http://moderation-service.apps.svc.cluster.local/v1/moderations
clientRequestFormat: ccr
format:
custom: {}
request:
template: '{"input": ["{{ (index .messages 0).content }}"]}'
blockConditions:
- reason: unsafe_content
condition: Contains("true")

Generic API Guards

Use guard with appropriate format settings to guard with non-chat APIs: Here, the format field describes what each guard endpoint expects, not what the client sends to Traefik. Use clientRequestFormat separately whenever the incoming client request is Chat Completions or Responses API.

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: parallel-api-guard
namespace: apps
spec:
plugin:
parallel-llm-guard:
guards:
- guard:
endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
format:
ccr:
model: llama-guard3:8b
request:
systemPrompt: |
Task: Check if there is unsafe content in the user messages.
blockConditions:
- reason: unsafe_content
condition: Contains("unsafe")
clientConfig:
timeoutSeconds: 30
maxRetries: 2
- guard:
endpoint: http://content-filter.apps.svc.cluster.local/analyze
format:
custom: {}
request:
template: '{"text": "{{.query}}", "user_id": "{{.user_id}}"}'
blockConditions:
- reason: policy_violation
condition: JSONEquals(".status", "blocked")

Customizing Deny Responses

Each guard in the parallel set supports onDenyResponse per block condition. See LLM Guard - Customizing Deny Responses.

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: parallel-guard-deny-response
namespace: apps
spec:
plugin:
parallel-llm-guard:
guards:
- guard:
endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
clientRequestFormat: ccr
format:
ccr:
model: llama-guard3:8b
request:
blockConditions:
- reason: unsafe_content
condition: Contains("unsafe")
onDenyResponse:
statusCode: 200
message: "Your request was flagged for safety."

Configuration Options

FieldDescriptionRequiredDefault
guardsArray of guard definitions to execute in parallel. Each entry must contain exactly one guard typeYes

Guard Types

Each entry in the guards array must contain exactly one of the following guard type keys:

Guard TypeDescription
guards.guardRecommended. Unified guard configuration that accepts any format via the format and clientRequestFormat fields. See the LLM Guard Documentation.
guards.chat-completion-llm-guardDeprecated. Same as guard with implicit clientRequestFormat: ccr and format.ccr. Still works for backward compatibility; prefer explicit guard config.
guards.chat-completion-llm-guard-customDeprecated. Same as guard with implicit clientRequestFormat: ccr and format.custom. Still works for backward compatibility; prefer explicit guard config.
guards.llm-guardDeprecated. Same as guard with implicit clientRequestFormat: custom and format.ccr. Still works for backward compatibility; prefer explicit guard config.
guards.llm-guard-customDeprecated. Same as guard with implicit clientRequestFormat: custom and format.custom. Still works for backward compatibility; prefer explicit guard config.

The configuration options for each guard type are identical to the standalone LLM Guard middleware variants. See the LLM Guard Configuration Options for the full reference, including endpoint, request/response rules, block and trace conditions, client configuration, and LLM-specific parameters.

Behavior Details

Cancellation Semantics

When a guard blocks or errors, all other in-flight guards are cancelled through Go context cancellation. Guards that have already completed are not affected. The middleware returns the result of the first guard that blocks or errors.

Error Handling

ScenarioHTTP StatusDescription
Any request guard blocks403 Forbidden (or custom via onDenyResponse)Request is rejected, remaining guards are cancelled
Any response guard blocks403 Forbidden (or custom via onDenyResponse)Response is rejected, remaining guards are cancelled
Any guard returns an error500 Internal Server ErrorRemaining guards are cancelled
Request body exceeds maxRequestBodySize413 Request Entity Too LargeNo guards are executed
Request body is empty or has negative Content-Length400 Bad RequestNo guards are executed
Client rejects uncompressed responses (Accept-Encoding: identity;q=0)406 Not AcceptableResponse guards must read the response body in clear text; the client's Accept-Encoding must allow identity encoding

Streaming Responses

Streaming Limitation

When you configure response guards, the middleware must buffer the entire backend response to analyze it. This means real-time streaming is lost and the client receives the complete response after all response guards have passed. For real-time streaming, only configure request guards or use non-streaming endpoints.

Comparison with Sequential Guards

AspectSequential (multiple standalone middlewares)Parallel (parallel-llm-guard)
LatencySum of all guard response timesMaximum of all guard individual response times
Failure behaviorStops at first blocking middlewareCancels all remaining guards on first block
ConfigurationSeparate Middleware resources per guardSingle Middleware resource
Guard typesOne type per middlewareMix any guard types in one middleware

Troubleshooting

Guards Are Not Executing in Parallel

Verify that all guards are defined inside the guards array of a single parallel-llm-guard middleware. If guards are configured as separate standalone middleware resources on the same route, they execute sequentially as part of the standard middleware chain.

Configuration Validation Errors

Each entry in the guards array must contain exactly one guard type. Common errors:

  • exactly one guard type must be configured: An entry has zero or more than one guard type key.
  • at least one guard must be configured: The guards array is empty.
  • Endpoint validation errors: Each guard requires a valid endpoint URL with an http or https scheme.
  • Model validation errors: format.ccr and format.responsesAPI require a non-empty model field inside the format block.
Performance Considerations
  • Deploy guard services close to the gateway (same cluster or region) to minimize network latency.
  • Use lightweight models for faster response times.
  • Set appropriate clientConfig.timeoutSeconds values for each guard to avoid slow guards holding up the entire request.
  • Monitor guard latency through tracing to identify bottlenecks.
  • Read the LLM Guard documentation for detailed per-guard configuration, block condition expressions, and deployment patterns.
  • Read the Content Guard documentation for pre-built PII detection.
  • Read the Chat Completion documentation for AI endpoint setup.
  • Read the NVIDIA NIMs Integration guide for using NVIDIA models as guards.