Chat evaluation framework

This article describes an evaluation framework for chat. You can evaluate the effectiveness of a chat application using different models and datasets.

The framework uses the agentic-eval package, which includes a command-line interface, that you can download from the Aleph Alpha account on JFrog. Note that you need credentials from Aleph Alpha to access this artifact.


Prerequisites

  • Python 3.12

  • UV package manager (recommended) or pip

  • Access to the OpenAI API (for GPT models)

  • Access to PhariaStudio

Setting up a working environment

To run chat evaluations, prepare your environment with the following steps.

1. Create a virtual environment

Create a new project with a virtual environment using Python 3.12 or uv, and then activate it:

Using Python

python -m venv .venv  # Ensure Python 3.12 is your default version

Using uv

uv venv --python 3.12

Activate the environment

source .venv/bin/activate  # .venv/Scripts/activate on Windows

2. Download and install the agentic_eval package

You need access to the Aleph Alpha account on JFrog to download packages.

Download the latest package release file ("wheel": .whl file) from jFrog:

Install the package with pip or uv:

Using pip

pip install agentic_eval-VERSION-py3-none-any.whl

Using uv

uv add agentic_eval-VERSION-py3-none-any.whl

3. Define environment variables

Create an .env file and add the following environment variables to it:

PHARIA_AI_TOKEN=  # Your bearer token from PhariaStudio
OPENAI_API_TOKEN=  # Your OpenAI API token if you want OpenAI models as agents
SKILL_PUBLISH_REPOSITORY="v1/skills/playground"
PHARIA_KERNEL_ADDRESS="https://kernel.pharia.example.com/"
PHARIA_STUDIO_ADDRESS="https://studio.pharia.example.com/"
PHARIA_INFERENCE_API_ADDRESS="https://inference.pharia.example.com/"
DATA_PLATFORM_BASE_URL="https://pharia-data-api.example.pharia.com/api/v1"

Example: Running a quick evaluation from the CLI

Using the agentic_eval command-line interface (CLI), you can run a benchmark in a single line without writing Python code:

# List available datasets
agentic_eval list-datasets

# See the models available
agentic_eval run --help

# Run a quick test on the *test_small* dataset with the default classic chat client
agentic_eval run --dataset test_small --client classic-chat --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14

# Add flags to publish results
agentic_eval run --dataset test_small --push-to-studio --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14

The agentic_eval CLI supports many more actions, such as dataset creation and deletion. See the next section.

Evaluating with the agentic_eval CLI

You can get help on the commands available in the agentic_eval CLI as follows:

agentic_eval --help

The most common commands are described below. All commands accept the global flag -v / --verbose to enable information-level logging.

List all datasets

This command lists all datasets that have a mapping in PhariaData. Append --json to the command to show the full mapping instead of just the dataset names:

# Short listing
agentic_eval list-datasets

# Full JSON mapping
agentic_eval list-datasets --json

Create a dataset

This command registers a new dataset from a local JSON file. By default, the dataset is uploaded to PhariaData; append --local-only to keep the file on disk only.

# Upload dataset to remote platform and keep a local copy
agentic_eval create-dataset my_new_dataset data/my_dataset.json

# Create only locally (skip remote upload)
agentic_eval create-dataset my_new_dataset data/my_dataset.json --local-only

Dataset file format

Your dataset file must contain a JSON array (as a .json file) that includes the objects shown below. Each object represents a conversational search task.

[
  {
    "objective": "Who are the contributors for the Transport sector?",
    "max_turns": 3,
    "expected_facts": [
      "The contributors for the Transport sector are Anna Białas-Motyl, Miriam Blumers, Evangelia Ford-Alexandraki, Alain Gallais, Annabelle Jansen, Dorothea Jung, Marius Ludwikowski, Boryana Milusheva and Joanna Raczkowska."
    ],
    "attachments": {
      "collection": "Assistant-File-Upload-Collection-QAXYZ",
      "namespace": "Assistant"
    }
  },
  {
    "objective": "Which unit is responsible for Environmental statistics and accounts?",
    "max_turns": 3,
    "expected_facts": [
      "The unit responsible for Environmental statistics and accounts is Eurostat, Unit E.2."
    ],
    "attachments": {
      "collection": "Assistant-File-Upload-Collection-QA-XYZ",
      "namespace": "Assistant"
    }
  }
]

The fields above can be described as follows:

  • objective – the user’s goal/question to solve

  • max_turns – the maximum number of conversation turns to generate

  • expected_facts – list of ground-truth facts used for automated grading

  • attachments.collection and attachments.namespace – identifiers for documents available to the system

Delete a dataset

This command deletes a dataset from PhariaData, or from your local machine, or both:

# Delete dataset everywhere (remote + local)
agentic_eval delete-dataset my_new_dataset

# Delete only locally, keep remote repository
agentic_eval delete-dataset my_new_dataset --local --skip-remote

Run a benchmark

This command runs a benchmark against a chat client implementation:

agentic_eval run \
  --dataset test_small \
  --client classic-chat \
  --chat-base-url https://chat.internal.company \
  --run-name my_first_run \
  --evaluation-model gpt-4.1-2025-04-14 \
  --facilitator-model gpt-4.1-2025-04-14 \
  --push-to-huggingface \
  --push-to-studio \
  --studio-project-name Chat-Evaluations

Supported clients: classic-chat (default) and agent-chat.

You can optionally override the default chat service endpoint (http://localhost:8000) using the --chat-base-url option.

The command above generates a benchmark run, evaluates the chat with GPT-4.1, aggregates the results, and publishes them to Hugging Face and/or PhariaStudio.

You can select less capable models that are available on PhariaInference. To list the available models, run:

uv run --help

Evaluating with the Python SDK

Example evaluation script

The following script shows a simple way to run conversation evaluations using the agentic_eval framework in Python:

import datetime
from agentic_eval.benchmarks.run_benchmark import run_benchmark
from agentic_eval.kernel_skills.conv_facilitator.models import (
    FacilitatorModel,
)
from agentic_eval.external_chat_clients.pharia_classic_chat_service_client import (
    PhariaClassicChatServiceClient,
)
from agentic_eval.kernel_skills.conversational_search_evaluator.models import (
    CriticModel,
)

DATASET_NAME = "assistant_evals_docQnA_small"
PUBLISH_TO_HF = False
PUBLISH_TO_STUDIO = True
client = PhariaClassicChatServiceClient(api_base_url: str = "http://assistant.<your.pharia-ai.domain.com>/")

if __name__ == "__main__":
    run_benchmark(
        client,
        dataset_name=DATASET_NAME,
        run_name=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
        evaluation_model=CriticModel.OPENAI_GPT_4_1,  # Deepseek R1 Ephemeral doesn't generate a good json
        user_agent_model=FacilitatorModel.OPENAI_GPT_4_1,
        publish_to_huggingface=PUBLISH_TO_HF,
        publish_to_studio=PUBLISH_TO_STUDIO,
    )

What the script is doing

  • Imports the required components: The script imports the benchmark runner, model definitions, and chat client.

  • Configures the evaluation parameters: It then sets up benchmark type, models, and publishing options.

  • Initialises the chat client: The script creates a client to interact with the chat service being evaluated.

  • Runs benchmark: It executes the evaluation using the configured parameters.

  • Publishes results: The script publishes the results to PhariaStudio for tracking and visualisation, but does not publish to Hugging Face.

Customising your evaluations

You can customise the following variables to suit your evaluation needs:

Benchmark selection

BENCHMARK = Benchmarks.TEST_LARGE  # Choose your benchmark

The following benchmark options are available:

  • Benchmarks.TEST_LARGE - Large test dataset for comprehensive evaluation.

  • Benchmarks.TEST_SMALL - Small test dataset for quick testing.

  • Benchmarks.ASSISTANT_EVALS_DOCQNA_SMALL - Evaluation dataset focused on a document Q&A.

Publishing configuration

PUBLISH_TO_HF = True        # Publish results to Hugging Face leaderboard
PUBLISH_TO_STUDIO = True     # Publish results to PhariaStudio for visualisation

Models configuration

# Evaluation model - Role: Judges conversation quality
evaluation_model = CriticModel.OPENAI_GPT_4_1

# User agent model - Role: Simulates user behaviour
user_agent_model = FacilitatorModel.OPENAI_GPT_4_1

The following models are available:

  • OPENAI_GPT_4_1 - GPT-4.1 (recommended for high-quality evaluation)

  • LLAMA_3_3_70B_INSTRUCT - Llama 3.3 70B

  • LLAMA_3_1_8B_INSTRUCT - Llama 3.1 8B

Chat client

client = PhariaClassicChatServiceClient()  # Replace with your custom client

You can replace PhariaClassicChatServiceClient() with your own chat client implementation. Your client must inherit from ConversationalSearchClient and implement the get_response method.

Optional parameters
client = PhariaClassicChatServiceClient(
    api_base_url="http://localhost:8000",  # URL and port of your Pharia Chat Service
    user_id="system",                      # User identifier for the chat service
    client_name="PhariaClassicChatServiceClient"  # Custom name for the client instance
)

PhariaStudio project name

studio_project_name="pharia-chat-service"  # Customise the project name in PhariaStudio

Run name

The script automatically generates a timestamp-based run name:

run_name=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

Additional configuration options

The run_benchmark function supports additional parameters for advanced use cases:

  • huggingface_repo_id: Specifies a custom Hugging Face repository (default: Aleph-Alpha/Chat-Assistant-Evaluations)

  • huggingface_api_key: Provides a custom Hugging Face API key

  • huggingface_private: Sets the repository visibility (default: True)

Example customisation

The following is an example of how you can customise the script for your specific needs:

BENCHMARK = Benchmarks.TEST_SMALL  # Use smaller dataset for faster testing
PUBLISH_TO_HF = True               # Enable Hugging Face publishing
PUBLISH_TO_STUDIO = True           # Enable PhariaStudio publishing
client = MyCustomChatClient()      # Use your custom client

if __name__ == "__main__":
    run_benchmark(
        client,
        run_name="my-custom-evaluation-run",
        benchmark_name=BENCHMARK,
        evaluation_model=CriticModel.LLAMA_3_3_70B_INSTRUCT,  # Use Llama for evaluation
        user_agent_model=FacilitatorModel.LLAMA_3_70B_INSTRUCT,  # Use Llama for user simulation
        publish_to_huggingface=PUBLISH_TO_HF,
        publish_to_studio=PUBLISH_TO_STUDIO,
        studio_project_name="my-evaluation-project",
    )

Troubleshooting

Import error: Module not found

  • Ensure the virtual environment is activated:
    source .venv/bin/activate

  • Verify the package installation:
    pip list | grep agentic_eval

  • Try reinstalling:
    uv sync or pip install -e .

Authentication error: Invalid tokens

  • Ensure that the .env file has the correct token values.

  • Verify that the tokens haven’t expired.

  • Ensure that no extra spaces are present in the token values.

Connection error: Cannot reach PhariaAI services

  • Check your VPN connection, if required.

  • Verify the URLs in the environment variables.

  • Test connectivity:
    curl -I https://kernel.pharia.example.com/

Model not available: Specific model unavailable

  • Check that the model is available in your region.

  • Try alternative models from the supported list.

  • Verify the OpenAI API quota, if using GPT models.

Evaluation timeout: Long-running evaluations

  • Start with the TEST_SMALL benchmark for testing.

  • Check your system resources, particularly memory and CPU.

  • Consider using lighter models for initial testing.