PhariaAI v1.251200.0

Release Version

v1.251200.0

Release Date

December 08, 2025

Availability

On-premise and hosted

In this article:

PhariaAssistant
- Answer Formatting Mode: Optimize response generation
PhariaStudio
PhariaOS
Other updates

PhariaAssistant

This release introduces the Answer Formatting Mode in PhariaAssistant Chat, a configuration that lowers inference costs and improves response times for simple conversational queries.

Answer Formatting Mode: Optimize response generation

Availability

On-premise and hosted

This release introduces a new configuration option that reduces inference costs and latency for simple chat queries. When enabled, the model can respond directly to simple questions instead of using the two-step Final Answer Tool process, cutting inference calls in half for basic conversational exchanges.

Here is how to enable it:

Setting PHARIA_CHAT_ANSWER_FORMATTING_MODE

Setting	`PHARIA_CHAT_ANSWER_FORMATTING_MODE`
Default	`always`
Options	`always`, `on_tool_availability`

Default

always

Options

always, on_tool_availability

In the Helm Chart:

 env:
  PHARIA_CHAT_ANSWER_FORMATTING_MODE: "on_tool_availability"

In the POST /conversations/{conversation_id}/messages/v2 API request payload:

{
  "answer_formatting_mode": "on_tool_availability"
}

Key Benefits

Reduced latency: Simple queries (for example, "Hello!") complete faster with a single inference call instead of two.
Lower compute costs: Fewer inference calls for conversational exchanges without tool usage.
Backward compatible: Default always mode preserves existing behavior.

The following table features the impact of the introduced configuration:

Query Type always on_tool_availability

Query Type	`always`	`on_tool_availability`
Simple chat	2 inference calls	1 inference call
Tool-assisted	2+ inference calls	2+ inference calls
Citations	Always available	Only when tools are called

Simple chat

2 inference calls

1 inference call

Tool-assisted

2+ inference calls

Citations

Always available

Only when tools are called

If Guardrails are enabled, the system conducts one extra inference call.

Inline citations are only generated when the Final Answer Tool is used. With on_tool_availability, simple queries without tools will not include citation markup, which is expected since there are no sources to cite.

The PhariaAssistant team recommends the following setup:

Use always for enterprise deployments requiring consistent response formatting and citations in all tool-assisted responses.
Use on_tool_availability for cost and latency optimization when handling mixed simple chat and tool-assisted workloads.

PhariaStudio

With this update, the data platform part of the PhariaStudio has received a significant improvement to filter index creation in Qdrant that reduces both query latency and CPU usage. These enhancements improve the overall stability and performance of the Qdrant deployments, especially under high load or complex filtering conditions.

Action Required

If you previously created filter indexes for custom metadata fields, you need to recreate those indexes to take advantage of the performance improvements. Indexes created prior to this update do not automatically inherit the enhancements. For guidance on recreating filter indexes, see the updated PhariaSearch API documentation, under Filter Index.

PhariaOS

This release introduces several advancements in PhariaInference API that strengthen performance, stability, and operational transparency for teams running hybrid on-premise and external model workloads.

Distributing On-Premise Load to External Inference APIs: Route excess traffic to external backends while keeping a single model name

This feature enables you to handle load spikes by forwarding queued on-premise inference tasks to an external inference API. Users continue interacting with a single unified model name, while the platform distributes work across local and external resources.

Key Benefits:

Unified model for both local and external inference.
Automatic overflow of requests when local throughput is saturated.
Optional control of forwarding behavior via min_task_age_millis and max_tasks_in_flight.

For extensive technical details, see the related documentation page.

External connectors must point to an identical model variant; mixed variants result in inconsistent responses. The connector cannot adopt a model if an external connector is already registered under the same name.

Limiting Parallel Requests and Throttling for External API Connectors: Maintain stability when external backends enforce rate limits

This feature ensures that when an external inference API becomes rate-limited, the scheduler coordinates request flow safely. It pauses, resumes, and redistributes tasks without interrupting users, while still leveraging the scheduler’s queueing mechanisms.

Key Benefits:

Safe handling of external rate limits.
Automatic pause/resume based on rate limit headers of the external API.
On-prem workers keep on processing in case external capacity is restricted.

For extensive technical details, see the related documentation page.

Prometheus Metrics for External API Connectors: Gain visibility into token usage and throughput

This feature introduces dedicated Prometheus metrics that allow you to track throughput, input tokens, and output tokens for external API connectors. These metrics give operators clearer insight into usage patterns and load characteristics.

Key Benefits:

Improved observability for external API connector performance.
Transparent measurement of input/output token volume.
Throughput metric aligned with industry-standard tokens-per-second interpretation.

The exact setup includes the following metrics:

as_combined_throughput_hist
- (input_tokens + 5 × output_tokens) / 6 / duration_seconds
- Labels: model_name, request_type, stream
as_output_tokens_counter
- Total output tokens
- Label: model_name
as_input_tokens_counter
- Total input tokens
- Label: model_name

All metrics report per-model values to enable fine-grained monitoring across connectors.

Other updates

PhariaInference API now supports a new vLLM worker image with vLLM v0.12.0 on board.
PhariaInference API has deprecated the legacy model registration workflow using a models.json file and the /v1/models API endpoint. Earlier deployments depended on manually submitting model definitions to this endpoint, while current deployments embed model definitions directly in worker configuration files ("model packages"). The OpenAI-compatible /v2/models endpoint is not affected.
PhariaAssistant has received a new document drag-and-drop capability and the refreshed agent entry display.