Tool calling
Tool calling (also known as function calling) enables AI models to interact with external systems and APIs in a structured way. Instead of generating free-form text responses, models can decide when and how to call functions defined in the user query, making AI applications more capable and deterministic.
Overview: Why use tool calling?
When a model determines that a tool should be used to answer a query, it generates a structured function call with the appropriate parameters in its response. Your client application then executes the actual function and provides the result back to the model, which can then formulate a natural language response incorporating the tool’s output.
Tool calling is particularly useful in the following contexts:
-
API integration: Calling external services such as weather APIs, databases, or web services.
-
Dynamic data retrieval: Fetching real-time information not present in the model’s training data.
-
Action execution: Performing operations such as sending emails, creating calendar events, or file operations.
-
Structured workflows: Building complex multistep processes with deterministic outcomes.
The tool calling process
The tool calling process follows the following pattern:
-
Define available tools with their schema definitions.
-
Send a user query along with tool definitions to the model.
-
The model decides whether to use a tool and, if so, generates a function call.
-
Execute the tool function in your application.
-
Send the tool result back to the model.
-
The model incorporates the result into its final response or decides to do further tool calls.
Implementing tool calling with PhariaInference
The PhariaInference API allows three approaches to implement tool calling:
-
Aleph Alpha client: Use the native Python client with full feature support.
-
OpenAI client compatibility: Use the familiar OpenAI client interface for easy migration.
-
Direct API calls: Make raw HTTP requests with tool definitions in the payload.
All approaches provide the same functionality and follow the OpenAI tool calling specification for compatibility.
Configuring the worker
Tool calling requires specific worker configuration. Currently, tool calling is supported only for worker type vllm.
Tool calling needs to be enabled with the chat capabilities in the worker configuration, as it operates through the /chat/completions endpoint.
Add the following to the config.toml of the worker:
[queue.models."your-model".chat_task]
supported = true
The following additional settings are recommended (though not strictly necessary) as they output reasoning and tool information in dedicated fields, if desired:
[generator]
# Optional: Enable dedicated reasoning parsing for models like DeepSeek R1
reasoning_parser = "deepseek_r1"
# Optional: Enable dedicated tool parsing with Hermes parser
tool_parser = "hermes"
[generator.structured_output]
supported_types = ["json_schema"]
See the respective model card for recommended values in combination with vLLM.
The following is an example of a complete worker configuration (see below for an explanation of optional settings):
edition = 1
[generator]
type = "vllm"
model_path = "/path/to/your-model/"
max_model_len = 8192
max_num_seqs = 64
# Optional: Enable dedicated reasoning parsing for models like DeepSeek R1
reasoning_parser = "deepseek_r1"
# Optional: Enable dedicated tool parsing with Hermes parser
tool_parser = "hermes"
[generator.structured_output]
supported_types = ["json_schema"]
[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "your-model"
version = 2
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"
[queue.models."your-model"]
worker_type = "vllm"
checkpoint = "your-model"
description = "Model with tool calling capabilities"
maximum_completion_tokens = 8192
multimodal_enabled = false
[queue.models."your-model".chat_task]
supported = true
[monitoring]
metrics_port = 4000
tcp_probes = []
Explanation of optional settings
-
reasoning_parser = "deepseek_r1": This setting specifies the use of the DeepSeek R1 reasoning parser, which is designed to extract reasoning content from models that generate outputs containing both reasoning steps and final conclusions. The reasoning content is typically wrapped in<think>…</think>tags, and the parser identifies and processes these sections to separate the reasoning content from the final answer into dedicated response fields. -
tool_parser = "hermes": This setting designates the Hermes tool parser for handling tool-related outputs. The Hermes parser extracts and manages tool calls within the model’s output, ensuring that tool-related content is processed appropriately and separated into dedicated fields. Note that there have been instances where the Hermes parser encountered issues with specific token handling during streaming outputs, particularly with models like Qwen3, so thorough testing is recommended for your specific use case. -
supported_types = ["json_schema"]: This configuration enables structured output support with JSON schema format, facilitating the generation of outputs that adhere to predefined JSON schemas.
By incorporating these optional settings, you can achieve more organized and structured outputs with clear delineation between reasoning processes, tool-related content, and final responses.
Deploying with the Aleph Alpha client
#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "aleph-alpha-client",
# ]
# ///
import os
from aleph_alpha_client import Client
from aleph_alpha_client.chat import ChatRequest, Message, Role
model = "your-model"
url = "https://inference-api.pharia.example.com"
token = os.getenv("API_TOKEN")
client = Client(token=token, host=url)
# Define tool
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "Name of the city"},
},
"required": ["city"],
},
"strict": True,
},
}
]
# Send a chat request
# Provide the available tool definitions to the model
system_message = "You are a helpful assistant."
user_message = "Please provide information on the weather in Heidelberg, Germany"
messages = [
Message(role=Role.System, content=system_message),
Message(role=Role.User, content=user_message),
]
request = ChatRequest(
messages=messages,
model=model,
tools=tools,
)
response = client.chat(request=request, model=model)
tool_call = response.message.tool_calls[0]
# Send tool call response to the model
# Tool needs to be called externally, inventing the response here.
tool_response = '{"temperature": 20, "unit": "Celsius", "condition": "cloudy"}'
tool_message = Message(role=Role.Tool, content=tool_response, tool_call_id=tool_call.id)
messages.append(tool_message)
follow_up_request = ChatRequest(
messages=messages,
model=model,
tools=tools,
)
# Receive output and inspect response
follow_up_response = client.chat(request=follow_up_request, model=model)
print(follow_up_response.message.content)
Deploying with the OpenAI client
#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "openai",
# ]
# ///
import os
from openai import OpenAI
model = "your-model"
url = "https://inference-api.pharia.example.com"
token = os.getenv("API_TOKEN")
client = OpenAI(base_url=url, api_key=token)
# Define tool
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "Name of the city"},
},
"required": ["city"],
},
"strict": True,
},
}
]
# Send a chat request
# Provide the available tool definitions to the model
system_message = "You are a helpful assistant."
user_message = "Please provide information on the weather in Heidelberg, Germany"
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
]
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
# Receive output and inspect tool call
assistant_message = response.choices[0].message
print(assistant_message.tool_calls)
# Send tool call response to the model
# Tool needs to be called externally, inventing the response here.
tool_response = {"temperature": 20, "unit": "Celsius", "condition": "cloudy"}
tool_message = {
"role": "tool",
"content": str(tool_response),
"tool_call_id": assistant_message.tool_calls[0].id,
}
messages.append(tool_message)
# Receive output and inspect response
follow_up_response = client.chat.completions.create(
model=model, messages=messages, tools=tools
)
print(follow_up_response.choices[0].message.content)
Deploying with direct API calls (cURL)
You can make direct HTTP requests to the /chat/completions endpoint with tool definitions:
#!/bin/bash
# First request - send user query with tool definitions
RESPONSE=$(curl -s -L -X POST "https://inference-api.pharia.example.com/chat/completions" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $API_TOKEN" \
-d '{
"model": "your-model",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Please provide information on the weather in Heidelberg, Germany"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "Name of the city"
}
},
"required": ["city"],
"strict": true
}
}
}
]
}')
echo "First response with tool call:"
echo "$RESPONSE" | jq .
# Extract tool call ID and arguments from response
TOOL_CALL_ID=$(echo "$RESPONSE" | jq -r '.choices[0].message.tool_calls[0].id')
TOOL_ARGUMENTS=$(echo "$RESPONSE" | jq -r '.choices[0].message.tool_calls[0].function.arguments')
echo "Tool call ID: $TOOL_CALL_ID"
echo "Tool arguments: $TOOL_ARGUMENTS"
# Execute tool function externally (simulated here)
TOOL_RESPONSE='{"temperature": 20, "unit": "Celsius", "condition": "cloudy"}'
# Extract the assistant message from the first response
ASSISTANT_MESSAGE=$(echo "$RESPONSE" | jq -c '.choices[0].message')
# Create the second request payload using jq to avoid JSON formatting issues
PAYLOAD=$(jq -n \
--arg model "your-model" \
--arg tool_response "$TOOL_RESPONSE" \
--arg tool_call_id "$TOOL_CALL_ID" \
--argjson assistant_msg "$ASSISTANT_MESSAGE" \
'{
model: $model,
messages: [
{
role: "system",
content: "You are a helpful assistant."
},
{
role: "user",
content: "Please provide information on the weather in Heidelberg, Germany"
},
$assistant_msg,
{
role: "tool",
content: $tool_response,
tool_call_id: $tool_call_id
}
]
}')
# Second request - send tool response back to model
FINAL_RESPONSE=$(curl -s -L -X POST "https://inference-api.pharia.example.com/chat/completions" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $API_TOKEN" \
-d "$PAYLOAD")
echo "Final response:"
echo "$FINAL_RESPONSE" | jq -r '.choices[0].message.content'