Tool Calling

Introduction

Tool calling (also known as function calling) enables AI models to interact with external systems and APIs in a structured way. Instead of generating free-form text responses, models can decide when and how to call functions defined in the user query, making AI applications more capable and deterministic.

When a model determines that a tool should be used to answer a query, it generates a structured function call with the appropriate parameters in its response. Your client application then executes the actual function and provides the result back to the model, which can then formulate a natural language response incorporating the tool's output.

Tool calling is particularly useful for:

API Integration: Calling external services such as weather APIs, databases, or web services
Dynamic Data Retrieval: Fetching real-time information not present in the model's training data
Action Execution: Performing operations like sending emails, creating calendar events, or file operations
Structured Workflows: Building complex multi-step processes with deterministic outcomes

The tool calling process follows this pattern:

Define available tools with their schema definitions
Send a user query along with tool definitions to the model
Model decides whether to use a tool and generates a function call
Execute the tool function in your application
Send the tool result back to the model
Model incorporates the result into its final response or decides to do further tool calls

Overview

There are three main approaches to implement tool calling with the Inference API:

Aleph Alpha Client - Use the native Python client with full feature support
OpenAI Client Compatibility - Use the familiar OpenAI client interface for easy migration
Direct API Calls - Make raw HTTP requests with tool definitions in the payload

All approaches provide the same functionality and follow the OpenAI tool calling specification for compatibility.

Deployment

Tool calling requires specific worker configuration to enable the functionality. Currently, tool calling is supported for worker type vllm.

Tool calling needs to be enabled via the chat capabilities in the worker config, as it operates through the /chat/completions endpoint.

Add the following to the config.toml of the worker:

[queue.models."your-model".chat_task]
supported = true

The following additional settings are recommended (though not strictly necessary) as they output reasoning and tool information in dedicated fields if desired:

[generator]
# Optional: Enable dedicated reasoning parsing for models like DeepSeek R1
reasoning_parser = "deepseek_r1"
# Optional: Enable dedicated tool parsing with Hermes parser
tool_parser = "hermes"

[generator.structured_output]
supported_types = ["json_schema"]

Refer to the respective model card for recommended values in combination with vLLM. This is an example of a complete worker configuration:

edition = 1

[generator]
type = "vllm"
model_path = "/path/to/your-model/"
max_model_len = 8192
max_num_seqs = 64
# Optional: Enable dedicated reasoning parsing for models like DeepSeek R1
reasoning_parser = "deepseek_r1"
# Optional: Enable dedicated tool parsing with Hermes parser
tool_parser = "hermes"

[generator.structured_output]
supported_types = ["json_schema"]

[queue]
url = "https://inference-api.pharia.example.com" 
token = "worker-token"
checkpoint_name = "your-model"
version = 2
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"

[queue.models."your-model"]
worker_type = "vllm"
checkpoint = "your-model"
description = "Model with tool calling capabilities"
maximum_completion_tokens = 8192
multimodal_enabled = false

[queue.models."your-model".chat_task]
supported = true

[monitoring]
metrics_port = 4000
tcp_probes = []

Explanation of Optional Settings:

reasoning_parser = "deepseek_r1": This setting specifies the use of the DeepSeek R1 reasoning parser, which is designed to extract reasoning content from models that generate outputs containing both reasoning steps and final conclusions. The reasoning content is typically wrapped in <think>...</think> tags, and the parser identifies and processes these sections to separate the reasoning content from the final answer into dedicated response fields.
tool_parser = "hermes": This setting designates the Hermes tool parser for handling tool-related outputs. The Hermes parser extracts and manages tool calls within the model's output, ensuring that tool-related content is processed appropriately and separated into dedicated fields. Note that there have been instances where the Hermes parser encountered issues with specific token handling during streaming outputs, particularly with models like Qwen3, so thorough testing is recommended for your specific use case.
supported_types = ["json_schema"]: This configuration enables structured output support with JSON schema format, facilitating the generation of outputs that adhere to predefined JSON schemas.

By incorporating these optional settings, you can achieve more organized and structured outputs with clear delineation between reasoning processes, tool-related content, and final responses.

Usage

Aleph Alpha Client

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "aleph-alpha-client",
# ]
# ///

import os
from aleph_alpha_client import Client
from aleph_alpha_client.chat import ChatRequest, Message, Role

model = "your-model"
url = "https://inference-api.pharia.example.com"
token = os.getenv("API_TOKEN")

client = Client(token=token, host=url)

# Define tool
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "Name of the city"},
                },
                "required": ["city"],
            },
            "strict": True,
        },
    }
]

# Send a chat request
# Provide the available tool definitions to the model

system_message = "You are a helpful assistant."
user_message = "Please provide information on the weather in Heidelberg, Germany"

messages = [
    Message(role=Role.System, content=system_message),
    Message(role=Role.User, content=user_message),
]

request = ChatRequest(
    messages=messages,
    model=model,
    tools=tools,
)

response = client.chat(request=request, model=model)
tool_call = response.message.tool_calls[0]

# Send tool call response to the model
# Tool needs to be called externally, inventing the response here.
tool_response = '{"temperature": 20, "unit": "Celsius", "condition": "cloudy"}'

tool_message = Message(role=Role.Tool, content=tool_response, tool_call_id=tool_call.id)
messages.append(tool_message)

follow_up_request = ChatRequest(
    messages=messages,
    model=model,
    tools=tools,
)

# Receive output and inspect response
follow_up_response = client.chat(request=follow_up_request, model=model)
print(follow_up_response.message.content)

OpenAI client

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "openai",
# ]
# ///

import os
from openai import OpenAI

model = "your-model"
url = "https://inference-api.pharia.example.com"
token = os.getenv("API_TOKEN")

client = OpenAI(base_url=url, api_key=token)

# Define tool
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "Name of the city"},
                },
                "required": ["city"],
            },
            "strict": True,
        },
    }
]

# Send a chat request
# Provide the available tool definitions to the model

system_message = "You are a helpful assistant."
user_message = "Please provide information on the weather in Heidelberg, Germany"

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message},
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
)

# Receive output and inspect tool call
assistant_message = response.choices[0].message
print(assistant_message.tool_calls)

# Send tool call response to the model
# Tool needs to be called externally, inventing the response here.
tool_response = {"temperature": 20, "unit": "Celsius", "condition": "cloudy"}

tool_message = {
    "role": "tool",
    "content": str(tool_response),
    "tool_call_id": assistant_message.tool_calls[0].id,
}

messages.append(tool_message)

# Receive output and inspect response
follow_up_response = client.chat.completions.create(
    model=model, messages=messages, tools=tools
)
print(follow_up_response.choices[0].message.content)

Direct API Calls with cURL

You can also make direct HTTP requests to the /chat/completions endpoint with tool definitions:

#!/bin/bash
# First request - send user query with tool definitions
RESPONSE=$(curl -s -L -X POST "https://inference-api.pharia.example.com/chat/completions" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $API_TOKEN" \
-d '{
  "model": "your-model",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user", 
      "content": "Please provide information on the weather in Heidelberg, Germany"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "Name of the city"
            }
          },
          "required": ["city"],
          "strict": true
        }
      }
    }
  ]
}')

echo "First response with tool call:"
echo "$RESPONSE" | jq .

# Extract tool call ID and arguments from response
TOOL_CALL_ID=$(echo "$RESPONSE" | jq -r '.choices[0].message.tool_calls[0].id')
TOOL_ARGUMENTS=$(echo "$RESPONSE" | jq -r '.choices[0].message.tool_calls[0].function.arguments')

echo "Tool call ID: $TOOL_CALL_ID"
echo "Tool arguments: $TOOL_ARGUMENTS"

# Execute tool function externally (simulated here)
TOOL_RESPONSE='{"temperature": 20, "unit": "Celsius", "condition": "cloudy"}'

# Extract the assistant message from the first response
ASSISTANT_MESSAGE=$(echo "$RESPONSE" | jq -c '.choices[0].message')

# Create the second request payload using jq to avoid JSON formatting issues
PAYLOAD=$(jq -n \
  --arg model "your-model" \
  --arg tool_response "$TOOL_RESPONSE" \
  --arg tool_call_id "$TOOL_CALL_ID" \
  --argjson assistant_msg "$ASSISTANT_MESSAGE" \
  '{
    model: $model,
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant."
      },
      {
        role: "user",
        content: "Please provide information on the weather in Heidelberg, Germany"
      },
      $assistant_msg,
      {
        role: "tool",
        content: $tool_response,
        tool_call_id: $tool_call_id
      }
    ]
  }')

# Second request - send tool response back to model
FINAL_RESPONSE=$(curl -s -L -X POST "https://inference-api.pharia.example.com/chat/completions" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $API_TOKEN" \
-d "$PAYLOAD")

echo "Final response:"
echo "$FINAL_RESPONSE" | jq -r '.choices[0].message.content'

Introduction​

Overview​

Deployment​

Usage​

Aleph Alpha Client​

OpenAI client​

Direct API Calls with cURL​