Chat evaluation framework
This article describes an evaluation framework for chat. You can evaluate the effectiveness of a chat application using different models and datasets.
The framework uses the agentic-eval package, which includes a command-line interface, that you can download from the Aleph Alpha account on JFrog. Note that you need credentials from Aleph Alpha to access this artifact.
Prerequisites
-
UV package manager (recommended) or pip
-
Access to the OpenAI API (for GPT models)
-
Access to PhariaStudio
Setting up a working environment
To run chat evaluations, prepare your environment with the following steps.
1. Create a virtual environment
Create a new project with a virtual environment using Python 3.12 or uv, and then activate it:
2. Download and install the agentic_eval package
| You need access to the Aleph Alpha account on JFrog to download packages. |
Download the latest package release file ("wheel": .whl file) from jFrog:
Install the package with pip or uv:
3. Define environment variables
Create an .env file and add the following environment variables to it:
PHARIA_AI_TOKEN= # Your bearer token from PhariaStudio
OPENAI_API_TOKEN= # Your OpenAI API token if you want OpenAI models as agents
SKILL_PUBLISH_REPOSITORY="v1/skills/playground"
PHARIA_KERNEL_ADDRESS="https://kernel.pharia.example.com/"
PHARIA_STUDIO_ADDRESS="https://studio.pharia.example.com/"
PHARIA_INFERENCE_API_ADDRESS="https://inference.pharia.example.com/"
DATA_PLATFORM_BASE_URL="https://pharia-data-api.example.pharia.com/api/v1"
Example: Running a quick evaluation from the CLI
Using the agentic_eval command-line interface (CLI), you can run a benchmark in a single line without writing Python code:
# List available datasets
agentic_eval list-datasets
# See the models available
agentic_eval run --help
# Run a quick test on the *test_small* dataset with the default classic chat client
agentic_eval run --dataset test_small --client classic-chat --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14
# Add flags to publish results
agentic_eval run --dataset test_small --push-to-studio --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14
The agentic_eval CLI supports many more actions, such as dataset creation and deletion. See the next section.
Evaluating with the agentic_eval CLI
You can get help on the commands available in the agentic_eval CLI as follows:
agentic_eval --help
The most common commands are described below. All commands accept the global flag -v / --verbose to enable information-level logging.
List all datasets
This command lists all datasets that have a mapping in PhariaData. Append --json to the command to show the full mapping instead of just the dataset names:
# Short listing
agentic_eval list-datasets
# Full JSON mapping
agentic_eval list-datasets --json
Create a dataset
This command registers a new dataset from a local JSON file. By default, the dataset is uploaded to PhariaData; append --local-only to keep the file on disk only.
# Upload dataset to remote platform and keep a local copy
agentic_eval create-dataset my_new_dataset data/my_dataset.json
# Create only locally (skip remote upload)
agentic_eval create-dataset my_new_dataset data/my_dataset.json --local-only
Dataset file format
Your dataset file must contain a JSON array (as a .json file) that includes the objects shown below. Each object represents a conversational search task.
[
{
"objective": "Who are the contributors for the Transport sector?",
"max_turns": 3,
"expected_facts": [
"The contributors for the Transport sector are Anna Białas-Motyl, Miriam Blumers, Evangelia Ford-Alexandraki, Alain Gallais, Annabelle Jansen, Dorothea Jung, Marius Ludwikowski, Boryana Milusheva and Joanna Raczkowska."
],
"attachments": {
"collection": "Assistant-File-Upload-Collection-QAXYZ",
"namespace": "Assistant"
}
},
{
"objective": "Which unit is responsible for Environmental statistics and accounts?",
"max_turns": 3,
"expected_facts": [
"The unit responsible for Environmental statistics and accounts is Eurostat, Unit E.2."
],
"attachments": {
"collection": "Assistant-File-Upload-Collection-QA-XYZ",
"namespace": "Assistant"
}
}
]
The fields above can be described as follows:
-
objective– the user’s goal/question to solve -
max_turns– the maximum number of conversation turns to generate -
expected_facts– list of ground-truth facts used for automated grading -
attachments.collectionandattachments.namespace– identifiers for documents available to the system
Delete a dataset
This command deletes a dataset from PhariaData, or from your local machine, or both:
# Delete dataset everywhere (remote + local)
agentic_eval delete-dataset my_new_dataset
# Delete only locally, keep remote repository
agentic_eval delete-dataset my_new_dataset --local --skip-remote
Run a benchmark
This command runs a benchmark against a chat client implementation:
agentic_eval run \
--dataset test_small \
--client classic-chat \
--chat-base-url https://chat.internal.company \
--run-name my_first_run \
--evaluation-model gpt-4.1-2025-04-14 \
--facilitator-model gpt-4.1-2025-04-14 \
--push-to-huggingface \
--push-to-studio \
--studio-project-name Chat-Evaluations
Supported clients: classic-chat (default) and agent-chat.
You can optionally override the default chat service endpoint (http://localhost:8000) using the --chat-base-url option.
The command above generates a benchmark run, evaluates the chat with GPT-4.1, aggregates the results, and publishes them to Hugging Face and/or PhariaStudio.
You can select less capable models that are available on PhariaInference. To list the available models, run:
uv run --help
Evaluating with the Python SDK
Example evaluation script
The following script shows a simple way to run conversation evaluations using the agentic_eval framework in Python:
import datetime
from agentic_eval.benchmarks.run_benchmark import run_benchmark
from agentic_eval.kernel_skills.conv_facilitator.models import (
FacilitatorModel,
)
from agentic_eval.external_chat_clients.pharia_classic_chat_service_client import (
PhariaClassicChatServiceClient,
)
from agentic_eval.kernel_skills.conversational_search_evaluator.models import (
CriticModel,
)
DATASET_NAME = "assistant_evals_docQnA_small"
PUBLISH_TO_HF = False
PUBLISH_TO_STUDIO = True
client = PhariaClassicChatServiceClient(api_base_url: str = "http://assistant.<your.pharia-ai.domain.com>/")
if __name__ == "__main__":
run_benchmark(
client,
dataset_name=DATASET_NAME,
run_name=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
evaluation_model=CriticModel.OPENAI_GPT_4_1, # Deepseek R1 Ephemeral doesn't generate a good json
user_agent_model=FacilitatorModel.OPENAI_GPT_4_1,
publish_to_huggingface=PUBLISH_TO_HF,
publish_to_studio=PUBLISH_TO_STUDIO,
)
What the script is doing
-
Imports the required components: The script imports the benchmark runner, model definitions, and chat client.
-
Configures the evaluation parameters: It then sets up benchmark type, models, and publishing options.
-
Initialises the chat client: The script creates a client to interact with the chat service being evaluated.
-
Runs benchmark: It executes the evaluation using the configured parameters.
-
Publishes results: The script publishes the results to PhariaStudio for tracking and visualisation, but does not publish to Hugging Face.
Customising your evaluations
You can customise the following variables to suit your evaluation needs:
Benchmark selection
BENCHMARK = Benchmarks.TEST_LARGE # Choose your benchmark
The following benchmark options are available:
-
Benchmarks.TEST_LARGE- Large test dataset for comprehensive evaluation. -
Benchmarks.TEST_SMALL- Small test dataset for quick testing. -
Benchmarks.ASSISTANT_EVALS_DOCQNA_SMALL- Evaluation dataset focused on a document Q&A.
Publishing configuration
PUBLISH_TO_HF = True # Publish results to Hugging Face leaderboard
PUBLISH_TO_STUDIO = True # Publish results to PhariaStudio for visualisation
Models configuration
# Evaluation model - Role: Judges conversation quality
evaluation_model = CriticModel.OPENAI_GPT_4_1
# User agent model - Role: Simulates user behaviour
user_agent_model = FacilitatorModel.OPENAI_GPT_4_1
The following models are available:
-
OPENAI_GPT_4_1- GPT-4.1 (recommended for high-quality evaluation) -
LLAMA_3_3_70B_INSTRUCT- Llama 3.3 70B -
LLAMA_3_1_8B_INSTRUCT- Llama 3.1 8B
Chat client
client = PhariaClassicChatServiceClient() # Replace with your custom client
You can replace PhariaClassicChatServiceClient() with your own chat client implementation. Your client must inherit from ConversationalSearchClient and implement the get_response method.
Additional configuration options
The run_benchmark function supports additional parameters for advanced use cases:
-
huggingface_repo_id: Specifies a custom Hugging Face repository (default:Aleph-Alpha/Chat-Assistant-Evaluations) -
huggingface_api_key: Provides a custom Hugging Face API key -
huggingface_private: Sets the repository visibility (default: True)
Example customisation
The following is an example of how you can customise the script for your specific needs:
BENCHMARK = Benchmarks.TEST_SMALL # Use smaller dataset for faster testing
PUBLISH_TO_HF = True # Enable Hugging Face publishing
PUBLISH_TO_STUDIO = True # Enable PhariaStudio publishing
client = MyCustomChatClient() # Use your custom client
if __name__ == "__main__":
run_benchmark(
client,
run_name="my-custom-evaluation-run",
benchmark_name=BENCHMARK,
evaluation_model=CriticModel.LLAMA_3_3_70B_INSTRUCT, # Use Llama for evaluation
user_agent_model=FacilitatorModel.LLAMA_3_70B_INSTRUCT, # Use Llama for user simulation
publish_to_huggingface=PUBLISH_TO_HF,
publish_to_studio=PUBLISH_TO_STUDIO,
studio_project_name="my-evaluation-project",
)
Troubleshooting
Import error: Module not found
-
Ensure the virtual environment is activated:
source .venv/bin/activate -
Verify the package installation:
pip list | grep agentic_eval -
Try reinstalling:
uv syncorpip install -e .
Authentication error: Invalid tokens
-
Ensure that the
.envfile has the correct token values. -
Verify that the tokens haven’t expired.
-
Ensure that no extra spaces are present in the token values.
Connection error: Cannot reach PhariaAI services
-
Check your VPN connection, if required.
-
Verify the URLs in the environment variables.
-
Test connectivity:
curl -I https://kernel.pharia.example.com/