Create and submit evaluations

Prerequisites to use evaluation

From Implement a simple task we have our input and output data types, and from Create Examples for an evaluation datasets we have a list of examples with their expected output. The dataset is stored in PhariaStudio and its ID available from Store an evaluation dataset in PhariaStudio.

Add the necessary dependencies

from collections.abc import Iterable

import numpy as np
from dotenv import load_dotenv
from pydantic import BaseModel

from intelligence_layer.core import (
    CompleteInput,
    Llama3InstructModel,
    NoOpTracer
)

from intelligence_layer.evaluation import (
    AggregationLogic,
    Example,
    SingleOutputEvaluationLogic,
)

from intelligence_layer.evaluation.benchmark.studio_benchmark import StudioBenchmarkRepository

load_dotenv()

Write the evaluation logic

We first want to ensure that the evaluation output contains the domain-specific evaluation result, therefore we create its data type.

tip

In this simplified example, we expect that the expected output has the same data type of the task output.

class TellAJokeEvaluation(BaseModel):
    is_it_funnier: bool

We expect from the task to return a single output so we can use SingleOutputEvaluationLogic as follows.

class TellAJokeEvaluationLogic(
    SingleOutputEvaluationLogic[
        TellAJokeTaskInput, TellAJokeTaskOutput, TellAJokeTaskOutput, TellAJokeEvaluation
    ]  # We pass TellAJokeTaskOutput also as ExpectedOutput
):
    def do_evaluate_single_output(
        self, example: Example[TellAJokeTaskInput, TellAJokeTaskOutput], output: TellAJokeTaskOutput
    ) -> TellAJokeEvaluation:
        model = Llama3InstructModel("llama-3.1-8b-instruct")
        prompt_template = """Which is the funniest joke? Respond only with 1 or 2.

1)
{joke1}

2)
{joke2}
"""

        prompt_template.format(joke1=output.joke, joke2=example.expected_output.joke)

        prompt = model.to_instruct_prompt(
            prompt_template.format(joke1=output.joke, joke2=example.expected_output.joke))
        completion_input = CompleteInput(prompt=prompt)
        completion = model.complete(completion_input, NoOpTracer())
        is_it_funnier = "1" in completion.completions[0].completion

        return TellAJokeEvaluation(is_it_funnier=is_it_funnier)

Now that we have the domain specific evaluation logic, let's write how the results should be aggregated to have a unique result to understand the quality of the whole experiment.

class TellAJokeAggregation(BaseModel):
    ratio: float


class TellAJokeAggregationLogic(AggregationLogic[TellAJokeEvaluation, TellAJokeAggregation]):
    def aggregate(self, evaluations: Iterable[TellAJokeEvaluation]) -> TellAJokeAggregation:

        funny_arr = np.array(
            [evaluation.is_it_funnier for evaluation in evaluations]
        )

        ratio = np.count_nonzero(funny_arr) / funny_arr.size
        return TellAJokeAggregation(
            ratio=ratio
        )

Define the Benchmark

Now we have all we need to start creating a benchmark! In this step, we are exploring the Evaluation > Benchmarks section of PhariaStudio.

Product > Use Evaluation > studio-benchmark.png

If the project contains no Benchmarks, the UI presents the code snippet to add a benchmark through your code. As you can see, the three necessary components for the benchmark are dataset, evaluation logic, and aggregation logic. We are ready to define our benchmark! Let’s stitch everything together.

You can copy the dataset ID by navigating to the Dataset section of PhariaStudio.

studio_benchmark_repository = StudioBenchmarkRepository(studio_client)

evaluation_logic = TellAJokeEvaluationLogic()
aggregation_logic = TellAJokeAggregation()

benchmark = studio_benchmark_repository.create_benchmark(
    "<your-dataset-id>", evaluation_logic, aggregation_logic, "task eval short dataset",
)

After executing the above, you can check the outcome in Studio where you now will have at least one benchmark defined.

Product > Use Evaluation > studio-new-benchmark.png

Why Do We Need to Submit Evaluation and Aggregation Logic?

We want to ensure that all results are consistent to avoid invalidating the results of an entire benchmark because the logic varied from one execution to the next. To achieve this, we hash the two logics and compare them with what is currently being executed. As you can imagine, this is very restrictive, and even the addition of a space or a new line will generate a different hash for the function.

How to Execute a Benchmark

Executing a benchmark is as easy as it can get! By now, we have defined our task logic and our benchmark, so let’s combine the two as follows:

model = Pharia1ChatModel("pharia-1-llm-7b-control")

task = TellAJokeTask(model)
benchmark.execute(
    task=task,
    name="First attempt",
    description="First implementation of the task logic",
    labels=["v1"],
    metadata={"some": "metadata"}
)

After running the code, you will see Intelligence Layer executing your tasks and submitting the evaluation and all the related traces.

Opening PhariaStudio, you will find that the benchmark has attached a Last Run, and by clicking on it, you will be able to access that run and check what happened.

Product > Use Evaluation > studio-list-benchmarks.png

Check the Traces for Each Example

Now the interesting part: why didn’t my task perform as expected? Was it the AI logic? The answer to these questions can be discovered with this new integration!

By clicking on the run, it is possible to access the detailed view where each line is the execution over one example.

Product > Use Evaluation > studio-benchmark-lineages.png

But this is where it gets interesting: by clicking on each example, it is possible to check the lineage, which is linked to the execution trace!

Product > Use Evaluation > studio-benchmark-lineage-details.png

This closes the loop with the rest of the functionalities in PhariaStudio as it enables the user to check the trace content and try it on the Playground for faster debugging of the AI logic.

Prerequisites to use evaluation​

Add the necessary dependencies​

Write the evaluation logic​

Define the Benchmark​

Why Do We Need to Submit Evaluation and Aggregation Logic?​

How to Execute a Benchmark​

Check the Traces for Each Example​