Retrieval Evaluation

This notebook uses our RetrieverEvaluator to evaluate the quality of any Retriever module defined in LlamaIndex.

We specify a set of different evaluation metrics: this includes hit-rate and MRR. For any given question, these will compare the quality of retrieved results from the ground-truth context.

To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation.


Here we load in data (PG essay), parse into Nodes. We then index this data using our simple vector index and get a retriever.

import nest_asyncio

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI
documents = SimpleDirectoryReader("../../data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(llm=llm)
vector_index = VectorStoreIndex(nodes, service_context=service_context)
retriever = vector_index.as_retriever(similarity_top_k=2)

Try out Retrieval

We’ll try out retrieval over a simple dataset.

retrieved_nodes = retriever.retrieve("What did the author do growing up?")
from llama_index.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

Node ID: 749c5544-13ae-4632-b8dd-c6367b718a73
Similarity: 0.8203777233851344
Text: What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn’t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called “data processing.” This was in 9th grade, so I was 13 or 14. The school district’s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain’s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in …

Node ID: 6e5d20a0-0c93-4465-9496-5e8318640067
Similarity: 0.8143566621554992
Text: [10]

Wow, I thought, there’s an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.

This had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]

In the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right parties in New York, the only people allowed t…

Build an Evaluation dataset of (query, context) pairs

Here we build a simple evaluation dataset over the existing text corpus.

We use our generate_question_context_pairs to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.

We get back a EmbeddingQAFinetuneDataset object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.

from llama_index.evaluation import (
qa_dataset = generate_question_context_pairs(nodes, llm=llm, num_questions_per_chunk=2)
queries = qa_dataset.queries.values()
In the context, the author mentions his first experience with programming on a TRS-80. Describe the limitations he faced with this early computer and how he used it to write programs, including a word processor.
# [optional] save
# [optional] load
qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")

Use RetrieverEvaluator for Retrieval Evaluation

We’re now ready to run our retrieval evals. We’ll run our RetrieverEvaluator over the eval dataset that we generated.

We define two functions: get_eval_results and also display_results that run our retriever over the dataset.

from llama_index.evaluation import RetrieverEvaluator

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
Query: In the context, the author mentions his early experiences with programming on an IBM 1401. Describe the process he used to run a program on this machine and explain why he found it challenging to create meaningful programs on it.
Metrics: {'mrr': 1.0, 'hit_rate': 1.0}
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
import pandas as pd

def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}

    return metric_df
display_results("top-2 eval", eval_results)
retrievers hit_rate mrr
0 top-2 eval 0.833333 0.784722