Create and submit evaluations
Prerequisites to use evaluation
From Implement a simple task we have our input and output data types, and from Create Examples for an evaluation datasets we have a list of examples with their expected output. The dataset is stored in PhariaStudio and its ID available from Store an evaluation dataset in PhariaStudio.
Add the necessary dependencies
from collections.abc import Iterable
import numpy as np
from dotenv import load_dotenv
from pydantic import BaseModel
from intelligence_layer.core import (
CompleteInput,
Llama3InstructModel,
NoOpTracer
)
from intelligence_layer.evaluation import (
AggregationLogic,
Example,
SingleOutputEvaluationLogic,
)
from intelligence_layer.evaluation.benchmark.studio_benchmark import StudioBenchmarkRepository
load_dotenv()
Write the evaluation logic
We first want to ensure that the evaluation output contains the domain-specific evaluation result, therefore we create its data type.
In this simplified example, we expect that the expected output has the same data type of the task output.
class TellAJokeEvaluation(BaseModel):
is_it_funnier: bool
We expect from the task to return a single output so we can use SingleOutputEvaluationLogic
as follows.
class TellAJokeEvaluationLogic(
SingleOutputEvaluationLogic[
TellAJokeTaskInput, TellAJokeTaskOutput, TellAJokeTaskOutput, TellAJokeEvaluation
] # We pass TellAJokeTaskOutput also as ExpectedOutput
):
def do_evaluate_single_output(
self, example: Example[TellAJokeTaskInput, TellAJokeTaskOutput], output: TellAJokeTaskOutput
) -> TellAJokeEvaluation:
model = Llama3InstructModel("llama-3.1-8b-instruct")
prompt_template = """Which is the funniest joke? Respond only with 1 or 2.
1)
{joke1}
2)
{joke2}
"""
prompt_template.format(joke1=output.joke, joke2=example.expected_output.joke)
prompt = model.to_instruct_prompt(
prompt_template.format(joke1=output.joke, joke2=example.expected_output.joke))
completion_input = CompleteInput(prompt=prompt)
completion = model.complete(completion_input, NoOpTracer())
is_it_funnier = "1" in completion.completions[0].completion
return TellAJokeEvaluation(is_it_funnier=is_it_funnier)
Now that we have the domain specific evaluation logic, let's write how the results should be aggregated to have a unique result to understand the quality of the whole experiment.
class TellAJokeAggregation(BaseModel):
ratio: float
class TellAJokeAggregationLogic(AggregationLogic[TellAJokeEvaluation, TellAJokeAggregation]):
def aggregate(self, evaluations: Iterable[TellAJokeEvaluation]) -> TellAJokeAggregation:
funny_arr = np.array(
[evaluation.is_it_funnier for evaluation in evaluations]
)
ratio = np.count_nonzero(funny_arr) / funny_arr.size
return TellAJokeAggregation(
ratio=ratio
)
Define the Benchmark
Now we have all we need to start creating a benchmark! In this step, we are exploring the Evaluation > Benchmarks section of PhariaStudio.
If the project contains no Benchmarks, the UI presents the code snippet to add a benchmark through your code. As you can see, the three necessary components for the benchmark are dataset, evaluation logic, and aggregation logic. We are ready to define our benchmark! Let’s stitch everything together.
You can copy the dataset ID by navigating to the Dataset section of PhariaStudio.
studio_benchmark_repository = StudioBenchmarkRepository(studio_client)
evaluation_logic = TellAJokeEvaluationLogic()
aggregation_logic = TellAJokeAggregation()
benchmark = studio_benchmark_repository.create_benchmark(
"<your-dataset-id>", evaluation_logic, aggregation_logic, "task eval short dataset",
)
After executing the above, you can check the outcome in Studio where you now will have at least one benchmark defined.
Why Do We Need to Submit Evaluation and Aggregation Logic?
We want to ensure that all results are consistent to avoid invalidating the results of an entire benchmark because the logic varied from one execution to the next. To achieve this, we hash the two logics and compare them with what is currently being executed. As you can imagine, this is very restrictive, and even the addition of a space or a new line will generate a different hash for the function.
How to Execute a Benchmark
Executing a benchmark is as easy as it can get! By now, we have defined our task logic and our benchmark, so let’s combine the two as follows:
model = Pharia1ChatModel("pharia-1-llm-7b-control")
task = TellAJokeTask(model)
benchmark.execute(
task=task,
name="First attempt",
description="First implementation of the task logic",
labels=["v1"],
metadata={"some": "metadata"}
)
After running the code, you will see Intelligence Layer executing your tasks and submitting the evaluation and all the related traces.
Opening PhariaStudio, you will find that the benchmark has attached a Last Run, and by clicking on it, you will be able to access that run and check what happened.
Check the Traces for Each Example
Now the interesting part: why didn’t my task perform as expected? Was it the AI logic? The answer to these questions can be discovered with this new integration!
By clicking on the run, it is possible to access the detailed view where each line is the execution over one example.
But this is where it gets interesting: by clicking on each example, it is possible to check the lineage, which is linked to the execution trace!
This closes the loop with the rest of the functionalities in PhariaStudio as it enables the user to check the trace content and try it on the Playground for faster debugging of the AI logic.