PhariaAI v1.251200.0
Release Version |
v1.251200.0 |
Release Date |
December 08, 2025 |
Availability |
On-premise and hosted |
- PhariaAssistant
- PhariaStudio
- PhariaOS
- Distributing On-Premise Load to External Inference APIs: Route excess traffic to external backends while keeping a single model name
- Limiting Parallel Requests and Throttling for External API Connectors: Maintain stability when external backends enforce rate limits
- Prometheus Metrics for External API Connectors: Gain visibility into token usage and throughput
- Other updates
PhariaAssistant
This release introduces the Answer Formatting Mode in PhariaAssistant Chat, a configuration that lowers inference costs and improves response times for simple conversational queries.
Answer Formatting Mode: Optimize response generation
Availability |
On-premise and hosted |
This release introduces a new configuration option that reduces inference costs and latency for simple chat queries. When enabled, the model can respond directly to simple questions instead of using the two-step Final Answer Tool process, cutting inference calls in half for basic conversational exchanges.
Here is how to enable it:
| Setting | PHARIA_CHAT_ANSWER_FORMATTING_MODE |
|---|---|
Default |
|
Options |
|
In the Helm Chart:
env:
PHARIA_CHAT_ANSWER_FORMATTING_MODE: "on_tool_availability"
In the POST /conversations/{conversation_id}/messages/v2 API request payload:
{
"answer_formatting_mode": "on_tool_availability"
}
Key Benefits
-
Reduced latency: Simple queries (for example, "Hello!") complete faster with a single inference call instead of two.
-
Lower compute costs: Fewer inference calls for conversational exchanges without tool usage.
-
Backward compatible: Default
alwaysmode preserves existing behavior.
The following table features the impact of the introduced configuration:
| Query Type | always |
on_tool_availability |
|---|---|---|
Simple chat |
2 inference calls |
1 inference call |
Tool-assisted |
2+ inference calls |
2+ inference calls |
Citations |
Always available |
Only when tools are called |
Inline citations are only generated when the Final Answer Tool is used. With on_tool_availability, simple queries without tools will not include citation markup, which is expected since there are no sources to cite.
|
The PhariaAssistant team recommends the following setup:
-
Use
alwaysfor enterprise deployments requiring consistent response formatting and citations in all tool-assisted responses. -
Use
on_tool_availabilityfor cost and latency optimization when handling mixed simple chat and tool-assisted workloads.
PhariaStudio
With this update, the data platform part of the PhariaStudio has received a significant improvement to filter index creation in Qdrant that reduces both query latency and CPU usage. These enhancements improve the overall stability and performance of the Qdrant deployments, especially under high load or complex filtering conditions.
|
Action Required
If you previously created filter indexes for custom metadata fields, you need to recreate those indexes to take advantage of the performance improvements. Indexes created prior to this update do not automatically inherit the enhancements. For guidance on recreating filter indexes, see the updated PhariaSearch API documentation, under Filter Index. |
PhariaOS
This release introduces several advancements in PhariaInference API that strengthen performance, stability, and operational transparency for teams running hybrid on-premise and external model workloads.
Distributing On-Premise Load to External Inference APIs: Route excess traffic to external backends while keeping a single model name
This feature enables you to handle load spikes by forwarding queued on-premise inference tasks to an external inference API. Users continue interacting with a single unified model name, while the platform distributes work across local and external resources.
Key Benefits:
-
Unified model for both local and external inference.
-
Automatic overflow of requests when local throughput is saturated.
-
Optional control of forwarding behavior via
min_task_age_millisandmax_tasks_in_flight.
For extensive technical details, see the related documentation page.
| External connectors must point to an identical model variant; mixed variants result in inconsistent responses. The connector cannot adopt a model if an external connector is already registered under the same name. |
Limiting Parallel Requests and Throttling for External API Connectors: Maintain stability when external backends enforce rate limits
This feature ensures that when an external inference API becomes rate-limited, the scheduler coordinates request flow safely. It pauses, resumes, and redistributes tasks without interrupting users, while still leveraging the scheduler’s queueing mechanisms.
Key Benefits:
-
Safe handling of external rate limits.
-
Automatic pause/resume based on rate limit headers of the external API.
-
On-prem workers keep on processing in case external capacity is restricted.
For extensive technical details, see the related documentation page.
Prometheus Metrics for External API Connectors: Gain visibility into token usage and throughput
This feature introduces dedicated Prometheus metrics that allow you to track throughput, input tokens, and output tokens for external API connectors. These metrics give operators clearer insight into usage patterns and load characteristics.
Key Benefits:
-
Improved observability for external API connector performance.
-
Transparent measurement of input/output token volume.
-
Throughput metric aligned with industry-standard tokens-per-second interpretation.
The exact setup includes the following metrics:
-
as_combined_throughput_hist-
(input_tokens + 5 × output_tokens) / 6 / duration_seconds -
Labels:
model_name,request_type,stream
-
-
as_output_tokens_counter-
Total output tokens
-
Label:
model_name
-
-
as_input_tokens_counter-
Total input tokens
-
Label:
model_name
-
| All metrics report per-model values to enable fine-grained monitoring across connectors. |
Other updates
-
PhariaInference API now supports a new vLLM worker image with vLLM v0.12.0 on board.
-
PhariaInference API has deprecated the legacy model registration workflow using a
models.jsonfile and the/v1/modelsAPI endpoint. Earlier deployments depended on manually submitting model definitions to this endpoint, while current deployments embed model definitions directly in worker configuration files ("model packages"). The OpenAI-compatible/v2/modelsendpoint is not affected. -
PhariaAssistant has received a new document drag-and-drop capability and the refreshed agent entry display.