Guardrails
The Responses API includes built-in content safety checks. Every request is screened before reaching the LLM backend; if a violation is detected, the request is rejected immediately.
Two provider types ship out of the box:
-
llama-guard-3: content-safety classifier based on LlamaGuard 3. LLM-based, ~500ms–2s per check. -
regex: fast pattern matcher for deterministic rules (PII patterns, blocklists, malicious URLs). Microseconds per check.
Providers are arranged into a pipeline per check type (input, tool_output, …). The pipeline runs stages sequentially and short-circuits on the first stage that returns a violation, so cheap regex checks can fail fast before the expensive LLM stage even runs.
Guardrails are evaluated eagerly, before the model generates any output, for synchronous, streaming, and background=true requests.
How It Works
-
The worker reads the
x-application-idheader (orNoneif absent) and posts toPOST /v1/checkwithapplication_id,check_type: "input", and the user’s content. -
GuardrailServicelooks up the per-application policy and walks the pipeline configured for the requested check type. -
For each enabled stage, the registry dispatches to the provider named in
stage.provider(e.g.regex,llama-guard-3), passing the stage-specificconfig. -
If a stage returns violations, the remaining stages are skipped and the check returns
{safe: false, violations: […]}. -
If all stages pass, the request proceeds to the LLM backend normally.
The endpoint is also callable directly by other services that need to evaluate content against an application policy.
Standalone Check Endpoint
The guardrail pipeline is exposed directly at POST /v1/check for callers that
want to screen content without going through /v1/responses. It uses the same
per-application policy lookup and bearer auth as the rest of the API.
Example:
curl -X POST "$BASE_URL/v1/check" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"application_id": "test-app",
"check_type": "input",
"input": "What is the capital of France?"
}'
Request:
{
"application_id": "legal-app",
"check_type": "input",
"input": "How do I write a contract for..."
}
| Field | Type | Description |
|---|---|---|
|
|
Application whose policy to apply. Omit (or send |
|
|
Which pipeline to run. Typically |
|
|
Content to check. |
|
|
Freeform metadata providers may use (e.g. |
Response (safe):
{ "safe": true, "violations": [] }
Response (unsafe):
{
"safe": false,
"violations": [
{
"category": "Violent Crimes",
"provider": "llama-guard-3",
"stage": "content-safety",
"step": 0
}
]
}
Violation fields:
| Field | Type | Description |
|---|---|---|
|
string |
The violated category (e.g. |
|
string |
Registry key of the stage that produced the violation (e.g. |
|
string |
|
|
integer |
Zero-indexed position in the configured pipeline list. Disabled stages do not change |
Attribution fields let callers running multi-provider pipelines tell which stage produced the block, for example, distinguishing a regex PII match from a llama-guard-3 content-safety classification when both are configured for the same check type.
Status codes:
| Status | When |
|---|---|
|
Check completed (inspect |
|
|
|
The application has no pipeline configured for the requested |
Application Identity
Per-request: callers send x-application-id: <id> to select a per-application policy. When the header is absent, the worker treats this as application_id=null and the service routes the lookup to the configured default policy block.
An application explicitly named "default" is a regular entry and does not collide with the default fallback block. The null sentinel is the only way to reach the fallback; unknown application IDs are rejected rather than silently inheriting the default policy.
Configuration
Guardrails are controlled via environment variables with the GUARDRAILS_ prefix:
| Variable | Pod | Default | Description |
|---|---|---|---|
|
API |
|
Disables guardrails entirely. |
|
API |
|
Model used by the |
|
API |
Inline JSON array of LlamaGuard category definitions (highest priority) |
|
|
API |
Path to a JSON file with LlamaGuard category definitions |
|
|
API |
Inline JSON with per-application policy config (takes priority over |
|
|
API |
Path to a JSON file with per-application policy config |
|
|
API |
|
Expose the |
|
Worker |
Base URL for the |
Policy resolution priority: Postgres (if database is configured) > GUARDRAILS_POLICY_JSON > GUARDRAILS_POLICY_FILE > generated default policy from GUARDRAILS_LABELS.
The DB layer is highest-precedence so a runtime override beats a baked-in policy. When the DB has no row for an application_id (or the row is disabled), the chain falls through to the next layer; existing deployments without a DB row keep their current behaviour.
Helm deployments: ConfigMap-mounted policy
For Helm deployments, the recommended way to ship a per-application policy is the guardrails.policy value. The chart renders it into a ConfigMap and mounts it at /etc/guardrails/policy.json on the API pod (GUARDRAILS_POLICY_FILE is wired automatically):
# values.yaml
guardrails:
policy:
default:
fail_mode: closed
check_types:
input:
pipeline:
- provider: llama-guard-3
name: content-safety
config: {}
applications:
legal-app:
fail_mode: closed
check_types:
input:
pipeline:
- provider: regex
name: pii-patterns
config:
patterns:
- { name: steuer_id, pattern: '\b\d{11}\b', category: PII }
- provider: llama-guard-3
name: content-safety
config: {}
Why use the ConfigMap path:
-
The policy is reviewable as native YAML in
values.yaml(no JSON-in-string-in-env-var escape gymnastics). -
A
checksum/configmap-guardrails-policyannotation rolls the API pod whenever the policy changes. -
ConfigMaps hold up to ~1 MiB; env-var size limits don’t bite.
-
Out-of-band edits to the ConfigMap (
kubectl edit configmap …-api-guardrails-policy) do not automatically restart the API; runkubectl rollout restart deployment/<api-deployment>to pick them up. Hot-reload via file-watcher is not yet implemented.
GUARDRAILS_POLICY_JSON remains supported as an escape hatch (smoke tests, single-pod overrides). When both are set, the inline env var wins.
Postgres-backed policy + admin API
For installations that already use Postgres for history/conversations/pending requests, policies can additionally be configured at runtime through HTTP: no ConfigMap edits, no pod rollouts.
Turn it on by setting GUARDRAILS_ADMIN_API_ENABLED=true. The application_policies table is created by Alembic revision 008_application_policies (runs automatically on startup with the rest of the migrations).
The admin endpoints are scoped to admin principals only (User.is_admin == True); the feature flag is the gate of last resort, so leave it off unless your auth setup is hardened. Endpoints:
| Method | Path | Description |
|---|---|---|
|
|
List all stored policies |
|
|
Fetch one, 404 if missing |
|
|
Upsert; body is an |
|
|
Delete; 204 on success, 404 if missing |
Example: upsert a policy for legal-app:
curl -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://your-api/v1/admin/policies/legal-app \
-d '{
"application_id": "legal-app",
"fail_mode": "closed",
"check_types": {
"input": {
"pipeline": [
{"provider": "regex", "name": "pii", "config": {"patterns": [{"name": "steuer_id", "pattern": "\\b\\d{11}\\b", "category": "PII"}]}}
]
}
}
}'
The path application_id is authoritative; if the JSON body carries a different application_id, the path value wins. IDs are capped at 253 characters (DNS-1123); longer values return 422.
GUARDRAILS_POLICY_JSON / GUARDRAILS_POLICY_FILE remain useful for bootstrap: ship an initial policy via env var, then migrate it into the DB through the admin API. Both env-var and file paths still serve as fallbacks for any application_id not present in the DB.
app-manager-gitops deployments
-
Platform only
app-manager-gitops descriptors set envVars.GUARDRAILS_POLICY_JSON to the inline JSON policy:
{
"config": {
"envVars": {
"GUARDRAILS_POLICY_JSON": "{\"default\":{...},\"applications\":{...}}"
}
}
}
The platform-manager API supports config.volumes.configMap[] for mounting an existing ConfigMap, but exposes no endpoint for creating one; the only resource type currently provisionable via /tenants/{tenant}/resources is redis. Until that gap closes, gitops deployments stay on the inline env var.
Per-Application Policy
The policy JSON keys applications by ID and defines a pipeline per check type:
{
"default": {
"fail_mode": "closed",
"check_types": {
"input": {
"pipeline": [
{
"provider": "llama-guard-3",
"name": "content-safety",
"enabled": true,
"config": {}
}
]
}
}
},
"applications": {
"legal-app": {
"fail_mode": "closed",
"check_types": {
"input": {
"pipeline": [
{
"provider": "regex",
"name": "pii-patterns",
"config": {
"patterns": [
{
"name": "steuer_id",
"pattern": "\\b\\d{11}\\b",
"category": "PII"
}
]
}
},
{
"provider": "llama-guard-3",
"name": "content-safety",
"config": {}
}
]
}
}
}
}
}
Top-level keys:
| Key | Description |
|---|---|
|
Fallback policy block, reachable only when the request omits |
|
Map of |
Per-application fields:
| Field | Type | Description |
|---|---|---|
|
|
What to do on provider error. |
|
object |
Map of check type name (e.g. |
Per-stage fields:
| Field | Type | Description |
|---|---|---|
|
string |
Provider registry key: |
|
string |
Unique stage name within the pipeline. Used in logs. |
|
boolean |
Toggle a stage off without removing it from the config. Default |
|
object |
Provider-specific configuration. See below. |
Providers
llama-guard-3
LLM-based content safety classifier. The config block is currently ignored; LlamaGuard category selection is configured globally via GUARDRAILS_LABELS rather than per-stage.
regex
Pattern-based provider. Config schema:
{
"patterns": [
{ "name": "steuer_id", "pattern": "\\b\\d{11}\\b", "category": "PII" },
{ "name": "javascript_url", "pattern": "javascript:", "category": "MaliciousURL" }
]
}
Each pattern has a name (used in logs only: pattern names and category names are logged, but never the matched substring), a pattern (Python regular expression), and a category (returned as the violation category). Multiple patterns sharing a category produce a single deduplicated violation.
Invalid regex patterns raise at check time and are handled by fail_mode like any other provider error.
llm-judge
LLM-based judge that applies a natural-language policy supplied by the application manager. Each judge stage takes a template (the policy text) and a model (an operator-approved Responses-API model name). The provider wraps both in a server-controlled prompt envelope and parses a strict SAFE / UNSAFE first-line verdict.
Config schema:
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
|
string |
yes |
Any model the inference backend can serve. The platform does not gatekeep model selection; an unknown model surfaces as a backend error at check time. |
|
|
string |
yes |
Free-form NL policy. Length 20..2000 chars. Cannot contain reserved envelope tokens or control characters. |
|
|
string |
no |
|
Returned as the |
|
int |
no |
|
Per-stage user-content cap, clamped against |
Example stage:
{
"provider": "llm-judge",
"name": "stay-on-topic",
"config": {
"model": "judge-1",
"template": "Reject any message not related to legal advice in Germany.",
"violation_category": "Off-Topic"
}
}
Templates are open-ended natural-language strings; the same schema covers stay-on-topic checks, natural-language denylists ("reject any message that asks about competitors"), tone policies ("reject aggressive or threatening messages"), and anything else expressible as a one-paragraph policy. The provider does not know which kind of check the manager wrote.
Operator setup. One env var controls the per-stage user-content cap:
-
GUARDRAILS_JUDGE_MAX_INPUT_CHARS: upper bound on per-stagemax_input_chars(default8000).
There is no operator-side allowlist for judge models. Application managers may pick any model the backend can serve; an unknown-model error is handled by fail_mode like any other provider error.
Threat model.
-
Manager-supplied templates. Acceptance-time validation rejects unknown config keys, oversize templates, control characters, and embedded envelope tokens; suspicious phrases (
ignore previous,override, …) surface as warnings onPOST /v1/admin/policies. A manager who writes a permissive template degrades only their own app; the platform’s invariant is the response shape, not policy quality. -
Malicious end-user content. Reserved envelope tokens in user content are HTML-encoded before the model sees them, so jailbreak attempts cannot syntactically break out of the user-message block. A sandwich reminder after the user content restates the classifier instruction, and the strict first-line parser rejects anything other than
SAFEorUNSAFE. Underfail_mode: closed, a malformed verdict becomes aprovider_errorviolation rather than a free pass; recommended for high-stakes apps. -
Cost overrun. Templates are capped at 2000 chars and user content at
max_input_chars; one judge call is one scheduler round-trip. Put cheap stages (regex) ahead of the judge to short-circuit before the LLM call.
Default Safety Categories (LlamaGuard)
When no custom labels are configured, the llama-guard-3 provider evaluates against the following categories:
| # | Category | Description |
|---|---|---|
S1 |
Violent Crimes |
Terrorism, murder, assault, kidnapping, animal abuse |
S2 |
Non-Violent Crimes |
Fraud, scams, hacking, drug trafficking, weapons offenses |
S3 |
Sex Crimes |
Human trafficking, sexual assault, sexual harassment |
S4 |
Child Exploitation |
Child sexual abuse material or depictions |
S5 |
Defamation |
Verifiably false statements about real people |
S6 |
Specialized Advice |
Unqualified financial, medical, or legal advice |
S7 |
Privacy |
Disclosure of private individuals' sensitive information |
S8 |
Intellectual Property |
Content violating third-party IP rights |
S9 |
Indiscriminate Weapons |
Weapons of mass destruction (chemical, biological, nuclear) |
S10 |
Hate |
Hate speech based on protected characteristics |
S11 |
Self-Harm |
Suicide, self-injury, disordered eating |
S12 |
Sexual Content |
Explicit sexual depictions or erotic descriptions |
S13 |
Elections |
False information about electoral systems and voting |
S14 |
Code Interpreter Abuse |
Denial of service, container escapes, privilege escalation |
S15 |
Profanity |
Vulgar, offensive, or impolite language |
S16 |
Prompt Injection Attack |
Attempts to override system instructions or extract prompts |
To override or extend these, provide a JSON array of label objects via GUARDRAILS_LABELS or GUARDRAILS_LABELS_FILENAME:
[
{
"name": "Financial Advice",
"description": "AI models should not provide specific investment recommendations.",
"enabled": true
},
{
"name": "Profanity",
"description": "Vulgar or offensive language.",
"enabled": false
}
]
| Field | Type | Required | Description |
|---|---|---|---|
|
string |
Yes |
Display name of the category |
|
string |
Yes |
Description used in the safety prompt |
|
boolean |
No |
Whether this category is active (default |
The 16 default categories are always included in the safety prompt. Custom labels are appended after them. Only labels present in your configured list with "enabled": true trigger violations.
Violation Response (Responses API)
When a guardrail violation blocks a POST /v1/responses request, the API returns HTTP 405 with:
{
"error": {
"message": "Guardrail violations found: Violent Crimes, Hate",
"type": "invalid_request_error",
"param": null,
"code": "content_policy_violation"
}
}
The message field lists the specific categories that were violated. For background=true requests, the pending record is marked as failed with the violation reason instead.