Guardrails

In this article:

How It Works
Standalone Check Endpoint
Application Identity
Configuration
Per-Application Policy
Providers
Default Safety Categories (LlamaGuard)
Violation Response (Responses API)

The Responses API includes built-in content safety checks. Every request is screened before reaching the LLM backend; if a violation is detected, the request is rejected immediately.

Two provider types ship out of the box:

llama-guard-3: content-safety classifier based on LlamaGuard 3. LLM-based, ~500ms–2s per check.
regex: fast pattern matcher for deterministic rules (PII patterns, blocklists, malicious URLs). Microseconds per check.

Providers are arranged into a pipeline per check type (input, tool_output, …). The pipeline runs stages sequentially and short-circuits on the first stage that returns a violation, so cheap regex checks can fail fast before the expensive LLM stage even runs.

Guardrails are evaluated eagerly, before the model generates any output, for synchronous, streaming, and background=true requests.

How It Works

The worker reads the x-application-id header (or None if absent) and posts to POST /v1/check with application_id, check_type: "input", and the user’s content.
GuardrailService looks up the per-application policy and walks the pipeline configured for the requested check type.
For each enabled stage, the registry dispatches to the provider named in stage.provider (e.g. regex, llama-guard-3), passing the stage-specific config.
If a stage returns violations, the remaining stages are skipped and the check returns {safe: false, violations: […]}.
If all stages pass, the request proceeds to the LLM backend normally.

The endpoint is also callable directly by other services that need to evaluate content against an application policy.

Standalone Check Endpoint

The guardrail pipeline is exposed directly at POST /v1/check for callers that want to screen content without going through /v1/responses. It uses the same per-application policy lookup and bearer auth as the rest of the API.

Example:

curl -X POST "$BASE_URL/v1/check" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "application_id": "test-app",
    "check_type": "input",
    "input": "What is the capital of France?"
  }'

Request:

{
  "application_id": "legal-app",
  "check_type": "input",
  "input": "How do I write a contract for..."
}

Field Type Description

Field	Type	Description
`application_id`	`string \| null`	Application whose policy to apply. Omit (or send `null`) to fall back to the `default` policy block.
`check_type`	`string`	Which pipeline to run. Typically `"input"`. Future: `"output"`, `"tool_output"`.
`input`	`string`	Content to check.
`context`	`object` (optional)	Freeform metadata providers may use (e.g. `tool_name` for `tool_output` checks).

application_id

string | null

Application whose policy to apply. Omit (or send null) to fall back to the default policy block.

check_type

string

Which pipeline to run. Typically "input". Future: "output", "tool_output".

input

string

Content to check.

context

object (optional)

Freeform metadata providers may use (e.g. tool_name for tool_output checks).

Response (safe):

{ "safe": true, "violations": [] }

Response (unsafe):

{
  "safe": false,
  "violations": [
    {
      "category": "Violent Crimes",
      "provider": "llama-guard-3",
      "stage": "content-safety",
      "step": 0
    }
  ]
}

Violation fields:

Field Type Description

Field	Type	Description
`category`	string	The violated category (e.g. `"Violent Crimes"`, `"PII"`). For provider failures under `fail_mode: closed`, the sentinel `"provider_error"`.
`provider`	string	Registry key of the stage that produced the violation (e.g. `"llama-guard-3"`, `"regex"`).
`stage`	string	`stage.name` from the configured pipeline, useful for log correlation.
`step`	integer	Zero-indexed position in the configured pipeline list. Disabled stages do not change `step`.

category

string

The violated category (e.g. "Violent Crimes", "PII"). For provider failures under fail_mode: closed, the sentinel "provider_error".

provider

string

Registry key of the stage that produced the violation (e.g. "llama-guard-3", "regex").

stage

string

stage.name from the configured pipeline, useful for log correlation.

step

integer

Zero-indexed position in the configured pipeline list. Disabled stages do not change step.

Attribution fields let callers running multi-provider pipelines tell which stage produced the block, for example, distinguishing a regex PII match from a llama-guard-3 content-safety classification when both are configured for the same check type.

Status codes:

Status When

Status	When
`200`	Check completed (inspect `safe` and `violations`)
`404`	`application_id` was supplied but no policy exists for it (under `fail_mode: closed`)
`422`	The application has no pipeline configured for the requested `check_type`

200

Check completed (inspect safe and violations)

404

application_id was supplied but no policy exists for it (under fail_mode: closed)

422

The application has no pipeline configured for the requested check_type

Application Identity

Per-request: callers send x-application-id: <id> to select a per-application policy. When the header is absent, the worker treats this as application_id=null and the service routes the lookup to the configured default policy block.

An application explicitly named "default" is a regular entry and does not collide with the default fallback block. The null sentinel is the only way to reach the fallback; unknown application IDs are rejected rather than silently inheriting the default policy.

Configuration

Guardrails are controlled via environment variables with the GUARDRAILS_ prefix:

Variable Pod Default Description

Variable	Pod	Default	Description
`GUARDRAILS_DISABLED`	API	`false`	Disables guardrails entirely. `/v1/check` still responds (returns `safe=true`). In distributed deployments, setting this on the worker alone has no effect because the worker only forwards requests to the API’s `/v1/check` endpoint.
`GUARDRAILS_MODEL_NAME`	API	`llama-guard-3-8b`	Model used by the `llama-guard-3` provider. Currently only `llama-guard-3-8b` is supported.
`GUARDRAILS_LABELS`	API		Inline JSON array of LlamaGuard category definitions (highest priority)
`GUARDRAILS_LABELS_FILENAME`	API		Path to a JSON file with LlamaGuard category definitions
`GUARDRAILS_POLICY_JSON`	API		Inline JSON with per-application policy config (takes priority over `GUARDRAILS_POLICY_FILE`)
`GUARDRAILS_POLICY_FILE`	API		Path to a JSON file with per-application policy config
`GUARDRAILS_ADMIN_API_ENABLED`	API	`false`	Expose the `/v1/admin/policies` CRUD endpoints. Off by default; requires Postgres (the standard database settings).
`GUARDRAILS_CHECK_URL`	Worker	`http://localhost:8000`	Base URL for the `/v1/check` endpoint the worker calls

GUARDRAILS_DISABLED

API

false

Disables guardrails entirely. /v1/check still responds (returns safe=true). In distributed deployments, setting this on the worker alone has no effect because the worker only forwards requests to the API’s /v1/check endpoint.

GUARDRAILS_MODEL_NAME

API

llama-guard-3-8b

Model used by the llama-guard-3 provider. Currently only llama-guard-3-8b is supported.

GUARDRAILS_LABELS

API

Inline JSON array of LlamaGuard category definitions (highest priority)

GUARDRAILS_LABELS_FILENAME

API

Path to a JSON file with LlamaGuard category definitions

GUARDRAILS_POLICY_JSON

API

Inline JSON with per-application policy config (takes priority over GUARDRAILS_POLICY_FILE)

GUARDRAILS_POLICY_FILE

API

Path to a JSON file with per-application policy config

GUARDRAILS_ADMIN_API_ENABLED

API

false

Expose the /v1/admin/policies CRUD endpoints. Off by default; requires Postgres (the standard database settings).

GUARDRAILS_CHECK_URL

Worker

http://localhost:8000

Base URL for the /v1/check endpoint the worker calls

Policy resolution priority: Postgres (if database is configured) > GUARDRAILS_POLICY_JSON > GUARDRAILS_POLICY_FILE > generated default policy from GUARDRAILS_LABELS.

The DB layer is highest-precedence so a runtime override beats a baked-in policy. When the DB has no row for an application_id (or the row is disabled), the chain falls through to the next layer; existing deployments without a DB row keep their current behaviour.

Helm deployments: ConfigMap-mounted policy

For Helm deployments, the recommended way to ship a per-application policy is the guardrails.policy value. The chart renders it into a ConfigMap and mounts it at /etc/guardrails/policy.json on the API pod (GUARDRAILS_POLICY_FILE is wired automatically):

# values.yaml
guardrails:
  policy:
    default:
      fail_mode: closed
      check_types:
        input:
          pipeline:
            - provider: llama-guard-3
              name: content-safety
              config: {}
    applications:
      legal-app:
        fail_mode: closed
        check_types:
          input:
            pipeline:
              - provider: regex
                name: pii-patterns
                config:
                  patterns:
                    - { name: steuer_id, pattern: '\b\d{11}\b', category: PII }
              - provider: llama-guard-3
                name: content-safety
                config: {}

Why use the ConfigMap path:

The policy is reviewable as native YAML in values.yaml (no JSON-in-string-in-env-var escape gymnastics).
A checksum/configmap-guardrails-policy annotation rolls the API pod whenever the policy changes.
ConfigMaps hold up to ~1 MiB; env-var size limits don’t bite.
Out-of-band edits to the ConfigMap (kubectl edit configmap …-api-guardrails-policy) do not automatically restart the API; run kubectl rollout restart deployment/<api-deployment> to pick them up. Hot-reload via file-watcher is not yet implemented.

GUARDRAILS_POLICY_JSON remains supported as an escape hatch (smoke tests, single-pod overrides). When both are set, the inline env var wins.

Postgres-backed policy + admin API

For installations that already use Postgres for history/conversations/pending requests, policies can additionally be configured at runtime through HTTP: no ConfigMap edits, no pod rollouts.

Turn it on by setting GUARDRAILS_ADMIN_API_ENABLED=true. The application_policies table is created by Alembic revision 008_application_policies (runs automatically on startup with the rest of the migrations).

The admin endpoints are scoped to admin principals only (User.is_admin == True); the feature flag is the gate of last resort, so leave it off unless your auth setup is hardened. Endpoints:

Method Path Description

Method	Path	Description
`GET`	`/v1/admin/policies`	List all stored policies
`GET`	`/v1/admin/policies/{application_id}`	Fetch one, 404 if missing
`PUT`	`/v1/admin/policies/{application_id}`	Upsert; body is an `ApplicationPolicy` JSON
`DELETE`	`/v1/admin/policies/{application_id}`	Delete; 204 on success, 404 if missing

GET

/v1/admin/policies

List all stored policies

GET

/v1/admin/policies/{application_id}

Fetch one, 404 if missing

PUT

/v1/admin/policies/{application_id}

Upsert; body is an ApplicationPolicy JSON

DELETE

/v1/admin/policies/{application_id}

Delete; 204 on success, 404 if missing

Example: upsert a policy for legal-app:

curl -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://your-api/v1/admin/policies/legal-app \
  -d '{
    "application_id": "legal-app",
    "fail_mode": "closed",
    "check_types": {
      "input": {
        "pipeline": [
          {"provider": "regex", "name": "pii", "config": {"patterns": [{"name": "steuer_id", "pattern": "\\b\\d{11}\\b", "category": "PII"}]}}
        ]
      }
    }
  }'

The path application_id is authoritative; if the JSON body carries a different application_id, the path value wins. IDs are capped at 253 characters (DNS-1123); longer values return 422.

GUARDRAILS_POLICY_JSON / GUARDRAILS_POLICY_FILE remain useful for bootstrap: ship an initial policy via env var, then migrate it into the DB through the admin API. Both env-var and file paths still serve as fallbacks for any application_id not present in the DB.

app-manager-gitops deployments

Platform only

app-manager-gitops descriptors set envVars.GUARDRAILS_POLICY_JSON to the inline JSON policy:

{
  "config": {
    "envVars": {
      "GUARDRAILS_POLICY_JSON": "{\"default\":{...},\"applications\":{...}}"
    }
  }
}

The platform-manager API supports config.volumes.configMap[] for mounting an existing ConfigMap, but exposes no endpoint for creating one; the only resource type currently provisionable via /tenants/{tenant}/resources is redis. Until that gap closes, gitops deployments stay on the inline env var.

Per-Application Policy

The policy JSON keys applications by ID and defines a pipeline per check type:

{
  "default": {
    "fail_mode": "closed",
    "check_types": {
      "input": {
        "pipeline": [
          {
            "provider": "llama-guard-3",
            "name": "content-safety",
            "enabled": true,
            "config": {}
          }
        ]
      }
    }
  },
  "applications": {
    "legal-app": {
      "fail_mode": "closed",
      "check_types": {
        "input": {
          "pipeline": [
            {
              "provider": "regex",
              "name": "pii-patterns",
              "config": {
                "patterns": [
                  {
                    "name": "steuer_id",
                    "pattern": "\\b\\d{11}\\b",
                    "category": "PII"
                  }
                ]
              }
            },
            {
              "provider": "llama-guard-3",
              "name": "content-safety",
              "config": {}
            }
          ]
        }
      }
    }
  }
}

Top-level keys:

Key Description

Key	Description
`default`	Fallback policy block, reachable only when the request omits `application_id`.
`applications`	Map of `application_id` → policy.

default

Fallback policy block, reachable only when the request omits application_id.

applications

Map of application_id → policy.

Per-application fields:

Field Type Description

Field	Type	Description
`fail_mode`	`"closed"` \| `"open"`	What to do on provider error. `closed` = reject (fail-secure); `open` = allow + log. Default `"closed"`.
`check_types`	object	Map of check type name (e.g. `"input"`) to its pipeline configuration.

fail_mode

"closed" | "open"

What to do on provider error. closed = reject (fail-secure); open = allow + log. Default "closed".

check_types

object

Map of check type name (e.g. "input") to its pipeline configuration.

Per-stage fields:

Field Type Description

Field	Type	Description
`provider`	string	Provider registry key: `"llama-guard-3"` or `"regex"`.
`name`	string	Unique stage name within the pipeline. Used in logs.
`enabled`	boolean	Toggle a stage off without removing it from the config. Default `true`.
`config`	object	Provider-specific configuration. See below.

provider

string

Provider registry key: "llama-guard-3" or "regex".

name

string

Unique stage name within the pipeline. Used in logs.

enabled

boolean

Toggle a stage off without removing it from the config. Default true.

config

object

Provider-specific configuration. See below.

Providers

`llama-guard-3`

LLM-based content safety classifier. The config block is currently ignored; LlamaGuard category selection is configured globally via GUARDRAILS_LABELS rather than per-stage.

`regex`

Pattern-based provider. Config schema:

{
  "patterns": [
    { "name": "steuer_id",     "pattern": "\\b\\d{11}\\b",                "category": "PII" },
    { "name": "javascript_url", "pattern": "javascript:",                  "category": "MaliciousURL" }
  ]
}

Each pattern has a name (used in logs only: pattern names and category names are logged, but never the matched substring), a pattern (Python regular expression), and a category (returned as the violation category). Multiple patterns sharing a category produce a single deduplicated violation.

Invalid regex patterns raise at check time and are handled by fail_mode like any other provider error.

`llm-judge`

LLM-based judge that applies a natural-language policy supplied by the application manager. Each judge stage takes a template (the policy text) and a model (an operator-approved Responses-API model name). The provider wraps both in a server-controlled prompt envelope and parses a strict SAFE / UNSAFE first-line verdict.

Config schema:

Field Type Required Default Notes

Field	Type	Required	Default	Notes
`model`	string	yes		Any model the inference backend can serve. The platform does not gatekeep model selection; an unknown model surfaces as a backend error at check time.
`template`	string	yes		Free-form NL policy. Length 20..2000 chars. Cannot contain reserved envelope tokens or control characters.
`violation_category`	string	no	`"Custom"`	Returned as the `Violation.category` on UNSAFE. Matches `^[A-Za-z0-9 _-]{1,64}$`.
`max_input_chars`	int	no	`8000`	Per-stage user-content cap, clamped against `GUARDRAILS_JUDGE_MAX_INPUT_CHARS`.

model

string

yes

Any model the inference backend can serve. The platform does not gatekeep model selection; an unknown model surfaces as a backend error at check time.

template

string

yes

Free-form NL policy. Length 20..2000 chars. Cannot contain reserved envelope tokens or control characters.

violation_category

string

"Custom"

Returned as the Violation.category on UNSAFE. Matches ^[A-Za-z0-9 _-]{1,64}$.

max_input_chars

int

8000

Per-stage user-content cap, clamped against GUARDRAILS_JUDGE_MAX_INPUT_CHARS.

Example stage:

{
  "provider": "llm-judge",
  "name": "stay-on-topic",
  "config": {
    "model": "judge-1",
    "template": "Reject any message not related to legal advice in Germany.",
    "violation_category": "Off-Topic"
  }
}

Templates are open-ended natural-language strings; the same schema covers stay-on-topic checks, natural-language denylists ("reject any message that asks about competitors"), tone policies ("reject aggressive or threatening messages"), and anything else expressible as a one-paragraph policy. The provider does not know which kind of check the manager wrote.

Operator setup. One env var controls the per-stage user-content cap:

GUARDRAILS_JUDGE_MAX_INPUT_CHARS: upper bound on per-stage max_input_chars (default 8000).

There is no operator-side allowlist for judge models. Application managers may pick any model the backend can serve; an unknown-model error is handled by fail_mode like any other provider error.

Threat model.

Manager-supplied templates. Acceptance-time validation rejects unknown config keys, oversize templates, control characters, and embedded envelope tokens; suspicious phrases (ignore previous, override, …) surface as warnings on POST /v1/admin/policies. A manager who writes a permissive template degrades only their own app; the platform’s invariant is the response shape, not policy quality.
Malicious end-user content. Reserved envelope tokens in user content are HTML-encoded before the model sees them, so jailbreak attempts cannot syntactically break out of the user-message block. A sandwich reminder after the user content restates the classifier instruction, and the strict first-line parser rejects anything other than SAFE or UNSAFE. Under fail_mode: closed, a malformed verdict becomes a provider_error violation rather than a free pass; recommended for high-stakes apps.
Cost overrun. Templates are capped at 2000 chars and user content at max_input_chars; one judge call is one scheduler round-trip. Put cheap stages (regex) ahead of the judge to short-circuit before the LLM call.

Default Safety Categories (LlamaGuard)

When no custom labels are configured, the llama-guard-3 provider evaluates against the following categories:

#	Category	Description
S1	Violent Crimes	Terrorism, murder, assault, kidnapping, animal abuse
S2	Non-Violent Crimes	Fraud, scams, hacking, drug trafficking, weapons offenses
S3	Sex Crimes	Human trafficking, sexual assault, sexual harassment
S4	Child Exploitation	Child sexual abuse material or depictions
S5	Defamation	Verifiably false statements about real people
S6	Specialized Advice	Unqualified financial, medical, or legal advice
S7	Privacy	Disclosure of private individuals' sensitive information
S8	Intellectual Property	Content violating third-party IP rights
S9	Indiscriminate Weapons	Weapons of mass destruction (chemical, biological, nuclear)
S10	Hate	Hate speech based on protected characteristics
S11	Self-Harm	Suicide, self-injury, disordered eating
S12	Sexual Content	Explicit sexual depictions or erotic descriptions
S13	Elections	False information about electoral systems and voting
S14	Code Interpreter Abuse	Denial of service, container escapes, privilege escalation
S15	Profanity	Vulgar, offensive, or impolite language
S16	Prompt Injection Attack	Attempts to override system instructions or extract prompts

Field	Type	Required	Description
`name`	string	Yes	Display name of the category
`description`	string	Yes	Description used in the safety prompt
`enabled`	boolean	No	Whether this category is active (default `true`)

Violation Response (Responses API)

When a guardrail violation blocks a POST /v1/responses request, the API returns HTTP 405 with:

{
  "error": {
    "message": "Guardrail violations found: Violent Crimes, Hate",
    "type": "invalid_request_error",
    "param": null,
    "code": "content_policy_violation"
  }
}

The message field lists the specific categories that were violated. For background=true requests, the pending record is marked as failed with the violation reason instead.