Parallel LLM Guard

The Parallel LLM Guard middleware runs multiple LLM Guard checks concurrently instead of sequentially. LLM Guards that depend on external services (such as NVIDIA NIMs, OpenAI, or Ollama) add latency to every request. When multiple guards are chained sequentially, the total latency is the sum of all guard response times. The Parallel LLM Guard executes all configured guards at the same time, reducing total latency to the duration of the slowest guard.

Key Features and Benefits

Parallel Execution: All guards run concurrently, reducing total latency from the sum to the maximum of individual guard response times.
Early Cancellation: When any guard blocks a request or encounters an error, the middleware cancels all remaining in-flight guards immediately.
Defense in Depth: Combine multiple guard types (safety classification, content moderation, topic control) in a single middleware.
Mixed Guard Types: Each guard in the parallel set can use any supported LLM Guard configuration, allowing you to mix LLM-based and custom service guards.
Request and Response Protection: Guards can inspect incoming requests, outgoing responses, or both.

How It Works

The middleware processes requests and responses through these steps:

Request Phase: The middleware reads the request body, then sends it to all configured request guards in parallel.
Evaluation: Each guard calls its configured external service (LLM or custom API) and evaluates the response against its block and trace conditions.
Early Exit: If any guard returns a block result or an error, the middleware cancels the context for all other guards and returns immediately (see Error Handling).
- Any block results in a 403 Forbidden reply to the client
- Any error results in a 500 Internal Server Error reply to the client
Forwarding: If all request guards pass, the original request is forwarded to the backend service.
Response Phase (if response guards are configured): The backend response is buffered and sent to all response guards in parallel, following the same evaluation and early exit logic.

Requirements

You must have AI Gateway enabled:

helm upgrade traefik traefik/traefik -n traefik --wait \
  --reset-then-reuse-values \
  --set hub.aigateway.enabled=true

One or more external content analysis services or LLMs to use as guards.

Configuration Examples

Multiple LLM Guards with Parallel Execution

Run a safety classifier and a topic control model in parallel against chat completion traffic:

Middleware
Secret
IngressRoute

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: parallel-safety-check
  namespace: apps
spec:
  plugin:
    parallel-llm-guard:
      guards:
        - guard:
            endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: llama-guard3:8b
            request:
              systemPrompt: |
                Task: Check if there is unsafe content in the user messages according to our safety policy.

                <BEGIN UNSAFE CONTENT CATEGORIES>
                S1: Violent Crimes, S2: Non-Violent Crimes, S3: Sex Crimes
                S4: Child Exploitation, S5: Defamation, S6: Specialized Advice
                S7: Privacy, S8: Intellectual Property, S9: Indiscriminate Weapons
                S10: Hate, S11: Self-Harm, S12: Sexual Content
                S13: Elections, S14: Code Interpreter Abuse
                <END UNSAFE CONTENT CATEGORIES>

                Provide your safety assessment for the conversation.
              blockConditions:
                - reason: unsafe_content
                  condition: Contains("unsafe")
        - guard:
            endpoint: https://api.openai.com/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: gpt-4o
            clientConfig:
              headers:
                Authorization: "urn:secret:openai-api-key:authorization"
            request:
              systemPrompt: |
                Task: Check if the message contains off-topic content.
                Answer with only true or false as a string.
              traceConditions:
                - reason: off_topic_content
                  condition: Contains("True") # case insensitive match

apiVersion: v1
kind: Secret
metadata:
  name: openai-api-key
  namespace: apps
type: Opaque
stringData:
  authorization: "Bearer sk-your-openai-api-key-here"

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-chat-app
  namespace: apps
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`chat.example.com`) && Path(`/v1/chat/completions`)
      services:
        - name: chat-backend
          port: 8080
      middlewares:
        - name: parallel-safety-check
  tls: {}

Mixed Guard Types

Combine an LLM-based guard with a custom moderation service:

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: parallel-mixed-guards
  namespace: apps
spec:
  plugin:
    parallel-llm-guard:
      guards:
        - guard:
            endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: llama-guard3:8b
            request:
              systemPrompt: |
                Task: Check if there is unsafe content in the user messages.
              blockConditions:
                - reason: unsafe_content
                  condition: Contains("unsafe")
        - guard:
            endpoint: http://moderation-service.apps.svc.cluster.local/v1/moderations
            clientRequestFormat: ccr
            format:
              custom: {}
            request:
              template: '{"input": ["{{ (index .messages 0).content }}"]}'
              blockConditions:
                - reason: moderation_flagged
                  condition: Contains("true")

Request and Response Guards

Guard both incoming requests and outgoing responses in parallel:

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: parallel-bidirectional
  namespace: apps
spec:
  plugin:
    parallel-llm-guard:
      guards:
        - guard:
            endpoint: https://api.openai.com/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: gpt-4o
            clientConfig:
              headers:
                Authorization: "urn:secret:openai-api-key:authorization"
            request:
              systemPrompt: |
                Task: Check if the message contains the word hello in it and answer with only true or false as a string.
              traceConditions:
                - reason: behavior_content
                  condition: Contains("False")
        - guard:
            endpoint: https://api.openai.com/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: gpt-4o
            clientConfig:
              headers:
                Authorization: "urn:secret:openai-api-key:authorization"
            response:
              systemPrompt: |
                Task: Check if the message contains the word salmon in it and answer with only true or false as a string.
              traceConditions:
                - reason: fish_content
                  condition: Contains("True")
        - guard:
            endpoint: http://moderation-service.apps.svc.cluster.local/v1/moderations
            clientRequestFormat: ccr
            format:
              custom: {}
            request:
              template: '{"input": ["{{ (index .messages 0).content }}"]}'
              blockConditions:
                - reason: unsafe_content
                  condition: Contains("true")

Generic API Guards

Use guard with appropriate format settings to guard with non-chat APIs: Here, the format field describes what each guard endpoint expects, not what the client sends to Traefik. Use clientRequestFormat separately whenever the incoming client request is Chat Completions or Responses API.

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: parallel-api-guard
  namespace: apps
spec:
  plugin:
    parallel-llm-guard:
      guards:
        - guard:
            endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
            format:
              ccr:
                model: llama-guard3:8b
            request:
              systemPrompt: |
                Task: Check if there is unsafe content in the user messages.
              blockConditions:
                - reason: unsafe_content
                  condition: Contains("unsafe")
            clientConfig:
              timeoutSeconds: 30
              maxRetries: 2
        - guard:
            endpoint: http://content-filter.apps.svc.cluster.local/analyze
            format:
              custom: {}
            request:
              template: '{"text": "{{.query}}", "user_id": "{{.user_id}}"}'
              blockConditions:
                - reason: policy_violation
                  condition: JSONEquals(".status", "blocked")

Customizing Deny Responses

Each guard in the parallel set supports onDenyResponse per block condition. See LLM Guard - Customizing Deny Responses.

Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: parallel-guard-deny-response
  namespace: apps
spec:
  plugin:
    parallel-llm-guard:
      guards:
        - guard:
            endpoint: http://ollama.apps.svc.cluster.local:11434/v1/chat/completions
            clientRequestFormat: ccr
            format:
              ccr:
                model: llama-guard3:8b
            request:
              blockConditions:
                - reason: unsafe_content
                  condition: Contains("unsafe")
                  onDenyResponse:
                    statusCode: 200
                    message: "Your request was flagged for safety."

Configuration Options

Field	Description	Required	Default
`guards`	Array of guard definitions to execute in parallel. Each entry must contain exactly one guard type	Yes

Guard Types

Each entry in the guards array must contain exactly one of the following guard type keys:

Guard Type	Description
`guards.guard`	Recommended. Unified guard configuration that accepts any format via the `format` and `clientRequestFormat` fields. See the LLM Guard Documentation.
`guards.chat-completion-llm-guard`	Deprecated. Same as `guard` with implicit `clientRequestFormat: ccr` and `format.ccr`. Still works for backward compatibility; prefer explicit `guard` config.
`guards.chat-completion-llm-guard-custom`	Deprecated. Same as `guard` with implicit `clientRequestFormat: ccr` and `format.custom`. Still works for backward compatibility; prefer explicit `guard` config.
`guards.llm-guard`	Deprecated. Same as `guard` with implicit `clientRequestFormat: custom` and `format.ccr`. Still works for backward compatibility; prefer explicit `guard` config.
`guards.llm-guard-custom`	Deprecated. Same as `guard` with implicit `clientRequestFormat: custom` and `format.custom`. Still works for backward compatibility; prefer explicit `guard` config.

The configuration options for each guard type are identical to the standalone LLM Guard middleware variants. See the LLM Guard Configuration Options for the full reference, including endpoint, request/response rules, block and trace conditions, client configuration, and LLM-specific parameters.

Behavior Details

Cancellation Semantics

When a guard blocks or errors, all other in-flight guards are cancelled through Go context cancellation. Guards that have already completed are not affected. The middleware returns the result of the first guard that blocks or errors.

Error Handling

Scenario	HTTP Status	Description
Any request guard blocks	`403 Forbidden` (or custom via `onDenyResponse`)	Request is rejected, remaining guards are cancelled
Any response guard blocks	`403 Forbidden` (or custom via `onDenyResponse`)	Response is rejected, remaining guards are cancelled
Any guard returns an error	`500 Internal Server Error`	Remaining guards are cancelled
Request body exceeds `maxRequestBodySize`	`413 Request Entity Too Large`	No guards are executed
Request body is empty or has negative `Content-Length`	`400 Bad Request`	No guards are executed
Client rejects uncompressed responses (`Accept-Encoding: identity;q=0`)	`406 Not Acceptable`	Response guards must read the response body in clear text; the client's `Accept-Encoding` must allow identity encoding

Streaming Responses

Streaming Limitation

When you configure response guards, the middleware must buffer the entire backend response to analyze it. This means real-time streaming is lost and the client receives the complete response after all response guards have passed. For real-time streaming, only configure request guards or use non-streaming endpoints.

Comparison with Sequential Guards

Aspect	Sequential (multiple standalone middlewares)	Parallel (`parallel-llm-guard`)
Latency	Sum of all guard response times	Maximum of all guard individual response times
Failure behavior	Stops at first blocking middleware	Cancels all remaining guards on first block
Configuration	Separate Middleware resources per guard	Single Middleware resource
Guard types	One type per middleware	Mix any guard types in one middleware

Troubleshooting

Guards Are Not Executing in Parallel

Verify that all guards are defined inside the guards array of a single parallel-llm-guard middleware. If guards are configured as separate standalone middleware resources on the same route, they execute sequentially as part of the standard middleware chain.

Configuration Validation Errors

Each entry in the guards array must contain exactly one guard type. Common errors:

exactly one guard type must be configured: An entry has zero or more than one guard type key.
at least one guard must be configured: The guards array is empty.
Endpoint validation errors: Each guard requires a valid endpoint URL with an http or https scheme.
Model validation errors: format.ccr and format.responsesAPI require a non-empty model field inside the format block.

Performance Considerations

Deploy guard services close to the gateway (same cluster or region) to minimize network latency.
Use lightweight models for faster response times.
Set appropriate clientConfig.timeoutSeconds values for each guard to avoid slow guards holding up the entire request.
Monitor guard latency through tracing to identify bottlenecks.

Read the LLM Guard documentation for detailed per-guard configuration, block condition expressions, and deployment patterns.
Read the Content Guard documentation for pre-built PII detection.
Read the Chat Completion documentation for AI endpoint setup.
Read the NVIDIA NIMs Integration guide for using NVIDIA models as guards.

Key Features and Benefits​

How It Works​

Requirements​

Configuration Examples​

Multiple LLM Guards with Parallel Execution​

Mixed Guard Types​

Request and Response Guards​

Generic API Guards​

Customizing Deny Responses​

Configuration Options​

Guard Types​

Behavior Details​

Cancellation Semantics​

Error Handling​

Streaming Responses​

Comparison with Sequential Guards​

Troubleshooting​

Related Content​