Skip to content

Evaluating Evaluators with LabelledEvaluatorDataset's#

The purpose of the llama-datasets is to provide builders the means to quickly benchmark LLM systems or tasks. In that spirit, the LabelledEvaluatorDataset exists to facilitate the evaluation of evaluators in a seamless and effortless manner.

This dataset consists of examples that carries mainly the following attributes: query, answer, ground_truth_answer, reference_score, and reference_feedback along with some other supplementary attributes. The user flow for producing evaluations with this dataset consists of making predictions over the dataset with a provided LLM evaluator, and then computing metrics that measure goodness of evaluations by computationally comparing them to the corresponding references.

Below is a snippet of code that makes use of the EvaluatorBenchmarkerPack to conveniently handle the above mentioned process flow.

from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.gemini import Gemini

# download dataset
evaluator_dataset, _ = download_llama_dataset(
    "MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)

# define evaluator
gemini_pro_llm = Gemini(model="models/gemini-pro", temperature=0)
evaluator = CorrectnessEvaluator(llm=gemini_pro_llm)

# download EvaluatorBenchmarkerPack and define the benchmarker
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
    evaluator=evaluators["gpt-3.5"],
    eval_dataset=evaluator_dataset,
    show_progress=True,
)

# produce the benchmark result
benchmark_df = await evaluator_benchmarker.arun(
    batch_size=5, sleep_time_in_seconds=0.5
)

A related llama-dataset is the LabelledPairwiseEvaluatorDataset, which again is meant to evaluate an evaluator, but this time where the evaluator is tasked on comparing a pair of LLM responses to a given query and to determine the better one amongst them. The usage flow described above is exactly the same as it is for the LabelledEvaluatorDataset, with the exception that the LLM evaluator must be equipped to perform the pairwise evaluation task — i.e., should be a PairwiseComparisonEvaluator.

More learning materials#

To see these datasets in action, be sure to checkout the notebooks listed below that benchmark LLM evaluators on slightly adapted versions of the MT-Bench dataset.