Chat Evaluation

Prerequisites

Python 3.12
UV package manager (recommended) or pip
OpenAI API access (for GPT models)
Pharia Studio access

Setup

To run evaluations, you will need to go through the following setup:

Create Virtual Environment

Create a new project with a virtual environment using python 3.12 and activate it.

# using python
python -m venv .venv  # make sure python 3.12 is your default version

# using uv
uv venv --python 3.12

# activate it
source .venv/bin/activate  # .venv/Scripts/activate on Windows

Install Package

Download the latest package release file (.whl) from jFrog (https://<jfrog>/artifactory/python/.pypi/agentic-eval/) and install it:

# Get the latest version from releases page, then:
pip install agentic_eval-VERSION-py3-none-any.whl

# or with uv
uv add agentic_eval-VERSION-py3-none-any.whl

Define Environment Variables

Create an .env file and add the following environment variables to it. Fill-in the missing ones.

PHARIA_AI_TOKEN=  # your bearer token from Studio 
OPENAI_API_TOKEN=  # your OPENAI api token if you want openai models as agents
SKILL_PUBLISH_REPOSITORY="v1/skills/playground"
PHARIA_KERNEL_ADDRESS="https://kernel.pharia.example.com/"
PHARIA_STUDIO_ADDRESS="https://studio.pharia.example.com/"
PHARIA_INFERENCE_API_ADDRESS="https://inference.pharia.example.com/"
DATA_PLATFORM_BASE_URL="https://pharia-data-api.example.pharia.com/api/v1"

Quick evaluation from CLI

Run a benchmark in a single line without writing Python code:

# List available datasets
agentic_eval list-datasets

# See the models available
agentic_eval run --help

# Run a quick test on the *test_small* dataset with the default classic chat client
agentic_eval run --dataset test_small --client classic-chat --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14

# Add flags to publish results
agentic_eval run --dataset test_small --push-to-studio --chat-base-url http://localhost:8000 --evaluation-model gpt-4.1-2025-04-14 --facilitator-model gpt-4.1-2025-04-14

The agentic_eval command-line interface ships with the wheel and supports many more actions (dataset creation, deletion, etc.). See CLI Usage for the full reference.

CLI Usage

After installing the wheel you can access a fully-featured command-line interface via the

agentic_eval --help

The most common commands are explained below. All commands accept the global flag -v/--verbose to enable INFO level logging.

list-datasets

List all datasets that have a mapping in the Pharia Data Platform. Use --json to print the full mapping instead of just the dataset names.

# Short listing
agentic_eval list-datasets

# Full JSON mapping
agentic_eval list-datasets --json

create-dataset

Register a new dataset from a local .json. By default the dataset is uploaded to the Data Platform. Pass --local-only if you want to keep it only on disk.

# Upload dataset to remote platform and keep a local copy
agentic_eval create-dataset my_new_dataset data/my_dataset.jsonl

# Create only locally (skip remote upload)
agentic_eval create-dataset my_new_dataset data/my_dataset.jsonl --local-only

Dataset file format

Your input file must contain either

a JSON array (.json) with the objects shown below, or
a JSON Lines file (.jsonl) with one such object per line.

Each object represents a conversational search task:

[
  {
    "objective": "Who are the contributors for the Transport sector?",
    "max_turns": 3,
    "expected_facts": [
      "The contributors for the Transport sector are Anna Białas-Motyl, Miriam Blumers, Evangelia Ford-Alexandraki, Alain Gallais, Annabelle Jansen, Dorothea Jung, Marius Ludwikowski, Boryana Milusheva and Joanna Raczkowska."
    ],
    "attachments": {
      "collection": "Assistant-File-Upload-Collection-QAXYZ",
      "namespace": "Assistant"
    }
  },
  {
    "objective": "Which unit is responsible for Environmental statistics and accounts?",
    "max_turns": 3,
    "expected_facts": [
      "The unit responsible for Environmental statistics and accounts is Eurostat, Unit E.2."
    ],
    "attachments": {
      "collection": "Assistant-File-Upload-Collection-QA-XYZ",
      "namespace": "Assistant"
    }
  }
]

Field descriptions:

objective – the user's goal/question to solve
max_turns – maximum conversation turns to generate
expected_facts – list of ground-truth facts used for automated grading
attachments.collection and attachments.namespace – identifiers for documents available to the system

Ensure your dataset file follows this structure before running create-dataset.

delete-dataset

Delete a dataset from the Data Platform, from your local machine, or both.

# Delete dataset everywhere (remote + local)
agentic_eval delete-dataset my_new_dataset

# Delete only locally, keep remote repository
agentic_eval delete-dataset my_new_dataset --local --skip-remote

run

Run a benchmark against a chat client implementation.

agentic_eval run \
  --dataset test_small \
  --client classic-chat \
  --chat-base-url https://chat.internal.company \
  --run-name my_first_run \
  --evaluation-model gpt-4.1-2025-04-14 \
  --facilitator-model gpt-4.1-2025-04-14 \
  --push-to-huggingface \
  --push-to-studio \
  --studio-project-name Chat-Evaluations

Supported clients: classic-chat (default) and agent-chat.

You can optionally override the default chat service endpoint (http://localhost:8000) via the --chat-base-url flag.

The command will generate a run, evaluate it with GPT-4.1, aggregate the results and optionally publish them to Hugging Face and/or Pharia Studio. You can select less capable models available on inference. To list the available models, run

uv run --help

Advanced: Python SDK Usage

The following script provides a simple way to run conversation evaluations using the agentic-eval framework.

Usage

import datetime
from agentic_eval.benchmarks.run_benchmark import run_benchmark
from agentic_eval.kernel_skills.conv_facilitator.models import (
    FacilitatorModel,
)
from agentic_eval.external_chat_clients.pharia_classic_chat_service_client import (
    PhariaClassicChatServiceClient,
)
from agentic_eval.kernel_skills.conversational_search_evaluator.models import (
    CriticModel,
)

DATASET_NAME = "assistant_evals_docQnA_small"
PUBLISH_TO_HF = False
PUBLISH_TO_STUDIO = True
client = PhariaClassicChatServiceClient(api_base_url: str = "http://assistant.<your.pharia-ai.domain.com>/")

if __name__ == "__main__":
    run_benchmark(
        client,
        dataset_name=DATASET_NAME,
        run_name=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
        evaluation_model=CriticModel.OPENAI_GPT_4_1,  # Deepseek R1 Ephemeral doesn't generate a good json
        user_agent_model=FacilitatorModel.OPENAI_GPT_4_1,
        publish_to_huggingface=PUBLISH_TO_HF,
        publish_to_studio=PUBLISH_TO_STUDIO,
    )

Description

Imports Required Components: The script imports the benchmark runner, model definitions, and chat client
Configures Evaluation Parameters: Sets up benchmark type, models, and publishing options
Initializes Chat Client: Creates a client to interact with the chat service being evaluated
Runs Benchmark: Executes the evaluation using the configured parameters
Publishes Results: Optionally publishes results to HuggingFace and/or Pharia Studio for tracking and visualization

You can customize the following variables to suit your evaluation needs.

Benchmark Selection

BENCHMARK = Benchmarks.TEST_LARGE  # Choose your benchmark

Available benchmark options:

Benchmarks.TEST_LARGE - Large test dataset for comprehensive evaluation
Benchmarks.TEST_SMALL - Small test dataset for quick testing
Benchmarks.ASSISTANT_EVALS_DOCQNA_SMALL - Document Q&A focused evaluation dataset

Publishing Configuration

PUBLISH_TO_HF = True        # Publish results to HuggingFace leaderboard
PUBLISH_TO_STUDIO = True     # Publish results to Pharia Studio for visualization

Models Configuration

# Evaluation Model - Role: Judges conversation quality
evaluation_model = CriticModel.OPENAI_GPT_4_1

# User Agent Model - Role: Simulates user behavior
user_agent_model = FacilitatorModel.OPENAI_GPT_4_1

Available models:

OPENAI_GPT_4_1 - GPT-4.1 (recommended for high-quality evaluation)
LLAMA_3_3_70B_INSTRUCT - Llama 3.3 70B
LLAMA_3_1_8B_INSTRUCT - Llama 3.1 8B

Chat Client

client = PhariaClassicChatServiceClient()  # Replace with your custom client

You can replace PhariaClassicChatServiceClient() with your own chat client implementation. Your client must inherit from ConversationalSearchClient and implement the get_response method.

Optional Parameters:

client = PhariaClassicChatServiceClient(
    api_base_url="http://localhost:8000",  # URL and port of your Pharia Chat Service
    user_id="system",                      # User identifier for the chat service
    client_name="PhariaClassicChatServiceClient"  # Custom name for the client instance
)

Studio Project Configuration

studio_project_name="pharia-chat-service"  # Customize project name in Studio

Run Name

The script automatically generates a timestamp-based run name:

run_name=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

Example Customization

Here's an example of how you might customize the script for your specific needs:

BENCHMARK = Benchmarks.TEST_SMALL  # Use smaller dataset for faster testing
PUBLISH_TO_HF = True               # Enable HuggingFace publishing
PUBLISH_TO_STUDIO = True           # Enable Studio publishing
client = MyCustomChatClient()      # Use your custom client

if __name__ == "__main__":
    run_benchmark(
        client,
        run_name="my-custom-evaluation-run",
        benchmark_name=BENCHMARK,
        evaluation_model=CriticModel.LLAMA_3_3_70B_INSTRUCT,  # Use Llama for evaluation
        user_agent_model=FacilitatorModel.LLAMA_3_70B_INSTRUCT,  # Use Llama for user simulation
        publish_to_huggingface=PUBLISH_TO_HF,
        publish_to_studio=PUBLISH_TO_STUDIO,
        studio_project_name="my-evaluation-project",
    )

Additional Configuration Options

The run_benchmark function supports additional parameters for advanced use cases:

huggingface_repo_id: Specify custom HuggingFace repository (default: Aleph-Alpha/Chat-Assistant-Evaluations)
huggingface_api_key: Provide custom HuggingFace API key
huggingface_private: Set repository visibility (default: True)

Troubleshooting

Common Issues

ImportError: Module not found

Ensure virtual environment is activated: source .venv/bin/activate
Verify package installation: pip list | grep agentic_eval
Try reinstalling: uv sync or pip install -e .

Authentication Error: Invalid tokens

Check .env file has correct token values
Verify tokens haven't expired
Ensure no extra spaces in token values

Connection Error: Cannot reach Pharia services

Check VPN connection if required
Verify URLs in environment variables
Test connectivity: curl -I https://kernel.pharia.example.com/

Model Not Available: Specific model unavailable

Check model availability in your region
Try alternative models from the supported list
Verify OpenAI API quota if using GPT models

Evaluation Timeout: Long-running evaluations

Start with TEST_SMALL benchmark for testing
Check system resources (memory/CPU)
Consider using lighter models for initial testing

Prerequisites​

Setup​

Create Virtual Environment​

Install Package​

Define Environment Variables​

Quick evaluation from CLI​

CLI Usage​

list-datasets​

create-dataset​

delete-dataset​

run​

Advanced: Python SDK Usage​

Usage​

Description​

Benchmark Selection​

Publishing Configuration​

Models Configuration​

Chat Client​

Studio Project Configuration​

Run Name​

Example Customization​

Additional Configuration Options​

Troubleshooting​

Common Issues​