Concepts

This section introduces the main concepts used in PhariaCatch. Terms in italic are also defined in this article.

Data point

A data point is a unique, individual unit of information that represents a specific value within its containing dataset. Each data point is the smallest single value that annotators evaluate. Data points represent the input or source information in PhariaCatch.

Dataset

A dataset is the collection of all data points and their associated annotations as provided by all the domain experts (annotators) assigned to the project (the workforce). Datasets are the result of a completed project in PhariaCatch.

Task

A task is a grouping of a single data point, two or more labels, and instructions for the annotators to classify the data point with one or more labels. A task can be assigned to one or more annotators.

Project

A project is a collection of tasks. To ensure the generation of consistent datasets, project managers assign a workforce of annotators to a project and provide instructions that apply to all tasks in the project.

Use Case

A use case is a collection of one or more projects. Project managers create use cases to organise projects that are connected logically in some way.

Annotator

An annotator is a domain expert who evaluates or judges data points and classifies ("annotates") them according to the instructions included in the containing task. Annotators are assigned tasks by a project manager in the context of a project.

Project Manager

A project manager is responsible for delivering high-quality datasets for the purpose of evaluating or optimising an AI system. Project managers create and assign tasks to annotators and oversee the resulting data quality in the PhariaCatch portal.

Instruction

An instruction is a set of detailed annotation guidelines provided to annotators to ensure consistent and accurate labeling of data points. Instructions should outline the task objectives, define annotation categories, provide examples, and specify any rules or edge cases to follow during the annotation process.

Label

A label is a description that informs a machine learning model what a specific data point represents, so that it can learn from the example. Labels are assigned to data points by domain experts (annotators) in PhariaCatch.

Aggregation

An aggregation is the process of associating with a data point the combined annotations of all annotators who have completed a specific task. It is the end result of the annotation process for each task.

Aggregation method

An aggregation method is the algorithm used to assign a final label to a data point that was annotated by more than one annotator. Currently, PhariaCatch supports one aggregation method: majority decision.

Majority decision (aggregation method)

The majority decision aggregation method associates a data point with the label that the majority of annotators agree on. With this method, if a majority decision is not reached (that is, no single label achieves 50% or more of all annotations), the data point takes the status “blocked”. In this case, the project manager can clarify the task or decide the final label for the data point.

Cross-annotation

A cross-annotation is the annotation of a single data point by multiple annotators, which allows the comparison of different annotators' opinions. Cross-annotation improves the generation of high-quality metrics over the dataset.

Conflict

A conflict is a difference in annotation decisions that cannot be resolved by the aggregation methods used in a project.

Quality metric

A quality metric is a calculated value that serves as an indicator of the quality of a dataset. For example, "completeness" and "consistency" can be quality metrics.

Inter-annotator agreement

An inter-annotator agreement value indicates the degree of consensus between all annotators assessing the same data point. It is calculated as the number of times all annotators agree on the same label for a data point, and is expressed as a percentage.

Ground Truth

A ground truth dataset is a verified, true dataset used for training, validating, and testing AI models. Ground truth data represents the gold standard of accurate data. It enables evaluation of model performance by comparing outputs to the human-labeled data. The part of the dataset which is free from conflicts is considered the ground truth. PhariaCatch exports only ground truth datasets.

Workforce

A workforce is a defined group of annotators and project managers that is assigned to a project by a project manager. Tasks within a project can only be assigned to the annotators who are members of the project’s workforce.