Guardrails

The Responses API includes built-in content safety checks. Every request is screened before reaching the LLM backend; if a violation is detected, the request is rejected immediately.

Two provider types ship out of the box:

  • llama-guard-3: content-safety classifier based on LlamaGuard 3. LLM-based, ~500ms–2s per check.

  • regex: fast pattern matcher for deterministic rules (PII patterns, blocklists, malicious URLs). Microseconds per check.

Providers are arranged into a pipeline per check type (input, tool_output, …). The pipeline runs stages sequentially and short-circuits on the first stage that returns a violation, so cheap regex checks can fail fast before the expensive LLM stage even runs.

Guardrails are evaluated eagerly, before the model generates any output, for synchronous, streaming, and background=true requests.


How It Works

  1. The worker reads the x-application-id header (or None if absent) and posts to POST /v1/check with application_id, check_type: "input", and the user’s content.

  2. GuardrailService looks up the per-application policy and walks the pipeline configured for the requested check type.

  3. For each enabled stage, the registry dispatches to the provider named in stage.provider (e.g. regex, llama-guard-3), passing the stage-specific config.

  4. If a stage returns violations, the remaining stages are skipped and the check returns {safe: false, violations: […​]}.

  5. If all stages pass, the request proceeds to the LLM backend normally.

The endpoint is also callable directly by other services that need to evaluate content against an application policy.


Standalone Check Endpoint

The guardrail pipeline is exposed directly at POST /v1/check for callers that want to screen content without going through /v1/responses. It uses the same per-application policy lookup and bearer auth as the rest of the API.

Example:

curl -X POST "$BASE_URL/v1/check" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "application_id": "test-app",
    "check_type": "input",
    "input": "What is the capital of France?"
  }'

Request:

{
  "application_id": "legal-app",
  "check_type": "input",
  "input": "How do I write a contract for..."
}
Field Type Description

application_id

string | null

Application whose policy to apply. Omit (or send null) to fall back to the default policy block.

check_type

string

Which pipeline to run. Typically "input". Future: "output", "tool_output".

input

string

Content to check.

context

object (optional)

Freeform metadata providers may use (e.g. tool_name for tool_output checks).

Response (safe):

{ "safe": true, "violations": [] }

Response (unsafe):

{
  "safe": false,
  "violations": [
    {
      "category": "Violent Crimes",
      "provider": "llama-guard-3",
      "stage": "content-safety",
      "step": 0
    }
  ]
}

Violation fields:

Field Type Description

category

string

The violated category (e.g. "Violent Crimes", "PII"). For provider failures under fail_mode: closed, the sentinel "provider_error".

provider

string

Registry key of the stage that produced the violation (e.g. "llama-guard-3", "regex").

stage

string

stage.name from the configured pipeline, useful for log correlation.

step

integer

Zero-indexed position in the configured pipeline list. Disabled stages do not change step.

Attribution fields let callers running multi-provider pipelines tell which stage produced the block, for example, distinguishing a regex PII match from a llama-guard-3 content-safety classification when both are configured for the same check type.

Status codes:

Status When

200

Check completed (inspect safe and violations)

404

application_id was supplied but no policy exists for it (under fail_mode: closed)

422

The application has no pipeline configured for the requested check_type


Application Identity

Per-request: callers send x-application-id: <id> to select a per-application policy. When the header is absent, the worker treats this as application_id=null and the service routes the lookup to the configured default policy block.

An application explicitly named "default" is a regular entry and does not collide with the default fallback block. The null sentinel is the only way to reach the fallback; unknown application IDs are rejected rather than silently inheriting the default policy.


Configuration

Guardrails are controlled via environment variables with the GUARDRAILS_ prefix:

Variable Pod Default Description

GUARDRAILS_DISABLED

API

false

Disables guardrails entirely. /v1/check still responds (returns safe=true). In distributed deployments, setting this on the worker alone has no effect because the worker only forwards requests to the API’s /v1/check endpoint.

GUARDRAILS_MODEL_NAME

API

llama-guard-3-8b

Model used by the llama-guard-3 provider. Currently only llama-guard-3-8b is supported.

GUARDRAILS_LABELS

API

Inline JSON array of LlamaGuard category definitions (highest priority)

GUARDRAILS_LABELS_FILENAME

API

Path to a JSON file with LlamaGuard category definitions

GUARDRAILS_POLICY_JSON

API

Inline JSON with per-application policy config (takes priority over GUARDRAILS_POLICY_FILE)

GUARDRAILS_POLICY_FILE

API

Path to a JSON file with per-application policy config

GUARDRAILS_ADMIN_API_ENABLED

API

false

Expose the /v1/admin/policies CRUD endpoints. Off by default; requires Postgres (the standard database settings).

GUARDRAILS_CHECK_URL

Worker

http://localhost:8000

Base URL for the /v1/check endpoint the worker calls

Policy resolution priority: Postgres (if database is configured) > GUARDRAILS_POLICY_JSON > GUARDRAILS_POLICY_FILE > generated default policy from GUARDRAILS_LABELS.

The DB layer is highest-precedence so a runtime override beats a baked-in policy. When the DB has no row for an application_id (or the row is disabled), the chain falls through to the next layer; existing deployments without a DB row keep their current behaviour.

Helm deployments: ConfigMap-mounted policy

For Helm deployments, the recommended way to ship a per-application policy is the guardrails.policy value. The chart renders it into a ConfigMap and mounts it at /etc/guardrails/policy.json on the API pod (GUARDRAILS_POLICY_FILE is wired automatically):

# values.yaml
guardrails:
  policy:
    default:
      fail_mode: closed
      check_types:
        input:
          pipeline:
            - provider: llama-guard-3
              name: content-safety
              config: {}
    applications:
      legal-app:
        fail_mode: closed
        check_types:
          input:
            pipeline:
              - provider: regex
                name: pii-patterns
                config:
                  patterns:
                    - { name: steuer_id, pattern: '\b\d{11}\b', category: PII }
              - provider: llama-guard-3
                name: content-safety
                config: {}

Why use the ConfigMap path:

  • The policy is reviewable as native YAML in values.yaml (no JSON-in-string-in-env-var escape gymnastics).

  • A checksum/configmap-guardrails-policy annotation rolls the API pod whenever the policy changes.

  • ConfigMaps hold up to ~1 MiB; env-var size limits don’t bite.

  • Out-of-band edits to the ConfigMap (kubectl edit configmap …-api-guardrails-policy) do not automatically restart the API; run kubectl rollout restart deployment/<api-deployment> to pick them up. Hot-reload via file-watcher is not yet implemented.

GUARDRAILS_POLICY_JSON remains supported as an escape hatch (smoke tests, single-pod overrides). When both are set, the inline env var wins.

Postgres-backed policy + admin API

For installations that already use Postgres for history/conversations/pending requests, policies can additionally be configured at runtime through HTTP: no ConfigMap edits, no pod rollouts.

Turn it on by setting GUARDRAILS_ADMIN_API_ENABLED=true. The application_policies table is created by Alembic revision 008_application_policies (runs automatically on startup with the rest of the migrations).

The admin endpoints are scoped to admin principals only (User.is_admin == True); the feature flag is the gate of last resort, so leave it off unless your auth setup is hardened. Endpoints:

Method Path Description

GET

/v1/admin/policies

List all stored policies

GET

/v1/admin/policies/{application_id}

Fetch one, 404 if missing

PUT

/v1/admin/policies/{application_id}

Upsert; body is an ApplicationPolicy JSON

DELETE

/v1/admin/policies/{application_id}

Delete; 204 on success, 404 if missing

Example: upsert a policy for legal-app:

curl -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://your-api/v1/admin/policies/legal-app \
  -d '{
    "application_id": "legal-app",
    "fail_mode": "closed",
    "check_types": {
      "input": {
        "pipeline": [
          {"provider": "regex", "name": "pii", "config": {"patterns": [{"name": "steuer_id", "pattern": "\\b\\d{11}\\b", "category": "PII"}]}}
        ]
      }
    }
  }'

The path application_id is authoritative; if the JSON body carries a different application_id, the path value wins. IDs are capped at 253 characters (DNS-1123); longer values return 422.

GUARDRAILS_POLICY_JSON / GUARDRAILS_POLICY_FILE remain useful for bootstrap: ship an initial policy via env var, then migrate it into the DB through the admin API. Both env-var and file paths still serve as fallbacks for any application_id not present in the DB.

app-manager-gitops deployments

  • Platform only

app-manager-gitops descriptors set envVars.GUARDRAILS_POLICY_JSON to the inline JSON policy:

{
  "config": {
    "envVars": {
      "GUARDRAILS_POLICY_JSON": "{\"default\":{...},\"applications\":{...}}"
    }
  }
}

The platform-manager API supports config.volumes.configMap[] for mounting an existing ConfigMap, but exposes no endpoint for creating one; the only resource type currently provisionable via /tenants/{tenant}/resources is redis. Until that gap closes, gitops deployments stay on the inline env var.


Per-Application Policy

The policy JSON keys applications by ID and defines a pipeline per check type:

{
  "default": {
    "fail_mode": "closed",
    "check_types": {
      "input": {
        "pipeline": [
          {
            "provider": "llama-guard-3",
            "name": "content-safety",
            "enabled": true,
            "config": {}
          }
        ]
      }
    }
  },
  "applications": {
    "legal-app": {
      "fail_mode": "closed",
      "check_types": {
        "input": {
          "pipeline": [
            {
              "provider": "regex",
              "name": "pii-patterns",
              "config": {
                "patterns": [
                  {
                    "name": "steuer_id",
                    "pattern": "\\b\\d{11}\\b",
                    "category": "PII"
                  }
                ]
              }
            },
            {
              "provider": "llama-guard-3",
              "name": "content-safety",
              "config": {}
            }
          ]
        }
      }
    }
  }
}

Top-level keys:

Key Description

default

Fallback policy block, reachable only when the request omits application_id.

applications

Map of application_id → policy.

Per-application fields:

Field Type Description

fail_mode

"closed" | "open"

What to do on provider error. closed = reject (fail-secure); open = allow + log. Default "closed".

check_types

object

Map of check type name (e.g. "input") to its pipeline configuration.

Per-stage fields:

Field Type Description

provider

string

Provider registry key: "llama-guard-3" or "regex".

name

string

Unique stage name within the pipeline. Used in logs.

enabled

boolean

Toggle a stage off without removing it from the config. Default true.

config

object

Provider-specific configuration. See below.


Providers

llama-guard-3

LLM-based content safety classifier. The config block is currently ignored; LlamaGuard category selection is configured globally via GUARDRAILS_LABELS rather than per-stage.

regex

Pattern-based provider. Config schema:

{
  "patterns": [
    { "name": "steuer_id",     "pattern": "\\b\\d{11}\\b",                "category": "PII" },
    { "name": "javascript_url", "pattern": "javascript:",                  "category": "MaliciousURL" }
  ]
}

Each pattern has a name (used in logs only: pattern names and category names are logged, but never the matched substring), a pattern (Python regular expression), and a category (returned as the violation category). Multiple patterns sharing a category produce a single deduplicated violation.

Invalid regex patterns raise at check time and are handled by fail_mode like any other provider error.

llm-judge

LLM-based judge that applies a natural-language policy supplied by the application manager. Each judge stage takes a template (the policy text) and a model (an operator-approved Responses-API model name). The provider wraps both in a server-controlled prompt envelope and parses a strict SAFE / UNSAFE first-line verdict.

Config schema:

Field Type Required Default Notes

model

string

yes

Any model the inference backend can serve. The platform does not gatekeep model selection; an unknown model surfaces as a backend error at check time.

template

string

yes

Free-form NL policy. Length 20..2000 chars. Cannot contain reserved envelope tokens or control characters.

violation_category

string

no

"Custom"

Returned as the Violation.category on UNSAFE. Matches ^[A-Za-z0-9 _-]{1,64}$.

max_input_chars

int

no

8000

Per-stage user-content cap, clamped against GUARDRAILS_JUDGE_MAX_INPUT_CHARS.

Example stage:

{
  "provider": "llm-judge",
  "name": "stay-on-topic",
  "config": {
    "model": "judge-1",
    "template": "Reject any message not related to legal advice in Germany.",
    "violation_category": "Off-Topic"
  }
}

Templates are open-ended natural-language strings; the same schema covers stay-on-topic checks, natural-language denylists ("reject any message that asks about competitors"), tone policies ("reject aggressive or threatening messages"), and anything else expressible as a one-paragraph policy. The provider does not know which kind of check the manager wrote.

Operator setup. One env var controls the per-stage user-content cap:

  • GUARDRAILS_JUDGE_MAX_INPUT_CHARS: upper bound on per-stage max_input_chars (default 8000).

There is no operator-side allowlist for judge models. Application managers may pick any model the backend can serve; an unknown-model error is handled by fail_mode like any other provider error.

Threat model.

  • Manager-supplied templates. Acceptance-time validation rejects unknown config keys, oversize templates, control characters, and embedded envelope tokens; suspicious phrases (ignore previous, override, …) surface as warnings on POST /v1/admin/policies. A manager who writes a permissive template degrades only their own app; the platform’s invariant is the response shape, not policy quality.

  • Malicious end-user content. Reserved envelope tokens in user content are HTML-encoded before the model sees them, so jailbreak attempts cannot syntactically break out of the user-message block. A sandwich reminder after the user content restates the classifier instruction, and the strict first-line parser rejects anything other than SAFE or UNSAFE. Under fail_mode: closed, a malformed verdict becomes a provider_error violation rather than a free pass; recommended for high-stakes apps.

  • Cost overrun. Templates are capped at 2000 chars and user content at max_input_chars; one judge call is one scheduler round-trip. Put cheap stages (regex) ahead of the judge to short-circuit before the LLM call.


Default Safety Categories (LlamaGuard)

When no custom labels are configured, the llama-guard-3 provider evaluates against the following categories:

# Category Description

S1

Violent Crimes

Terrorism, murder, assault, kidnapping, animal abuse

S2

Non-Violent Crimes

Fraud, scams, hacking, drug trafficking, weapons offenses

S3

Sex Crimes

Human trafficking, sexual assault, sexual harassment

S4

Child Exploitation

Child sexual abuse material or depictions

S5

Defamation

Verifiably false statements about real people

S6

Specialized Advice

Unqualified financial, medical, or legal advice

S7

Privacy

Disclosure of private individuals' sensitive information

S8

Intellectual Property

Content violating third-party IP rights

S9

Indiscriminate Weapons

Weapons of mass destruction (chemical, biological, nuclear)

S10

Hate

Hate speech based on protected characteristics

S11

Self-Harm

Suicide, self-injury, disordered eating

S12

Sexual Content

Explicit sexual depictions or erotic descriptions

S13

Elections

False information about electoral systems and voting

S14

Code Interpreter Abuse

Denial of service, container escapes, privilege escalation

S15

Profanity

Vulgar, offensive, or impolite language

S16

Prompt Injection Attack

Attempts to override system instructions or extract prompts

To override or extend these, provide a JSON array of label objects via GUARDRAILS_LABELS or GUARDRAILS_LABELS_FILENAME:

[
  {
    "name": "Financial Advice",
    "description": "AI models should not provide specific investment recommendations.",
    "enabled": true
  },
  {
    "name": "Profanity",
    "description": "Vulgar or offensive language.",
    "enabled": false
  }
]
Field Type Required Description

name

string

Yes

Display name of the category

description

string

Yes

Description used in the safety prompt

enabled

boolean

No

Whether this category is active (default true)

The 16 default categories are always included in the safety prompt. Custom labels are appended after them. Only labels present in your configured list with "enabled": true trigger violations.


Violation Response (Responses API)

When a guardrail violation blocks a POST /v1/responses request, the API returns HTTP 405 with:

{
  "error": {
    "message": "Guardrail violations found: Violent Crimes, Hate",
    "type": "invalid_request_error",
    "param": null,
    "code": "content_policy_violation"
  }
}

The message field lists the specific categories that were violated. For background=true requests, the pending record is marked as failed with the violation reason instead.