Usage Pattern (Retrieval)#
Using RetrieverEvaluator
#
This runs evaluation over a single query + ground-truth document set given a retriever.
The standard practice is to specify a set of valid metrics with from_metrics
.
from llama_index.evaluation import RetrieverEvaluator
# define retriever somewhere (e.g. from index)
# retriever = index.as_retriever(similarity_top_k=2)
retriever = ...
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
retriever_evaluator.evaluate(
query="query", expected_ids=["node_id1", "node_id2"]
)
Building an Evaluation Dataset#
You can manually curate a retrieval evaluation dataset of questions + node id’s. We also offer synthetic dataset generation over an existing text corpus with our generate_question_context_pairs
function:
from llama_index.evaluation import generate_question_context_pairs
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
The returned result is a EmbeddingQAFinetuneDataset
object (containing queries
, relevant_docs
, and corpus
).
Plugging it into RetrieverEvaluator
#
We offer a convenience function to run a RetrieverEvaluator
over a dataset in batch mode.
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
This should run much faster than you trying to call .evaluate
on each query separately.