Open In Colab

Retrieval Evaluation

This notebook uses our RetrieverEvaluator to evaluate the quality of any Retriever module defined in LlamaIndex.

We specify a set of different evaluation metrics: this includes hit-rate and MRR. For any given question, these will compare the quality of retrieved results from the ground-truth context.

To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation.

Setup

Here we load in data (PG essay), parse into Nodes. We then index this data using our simple vector index and get a retriever.

import nest_asyncio

nest_asyncio.apply()
from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SentenceSplitter
from llama_index.llms import OpenAI

Download Data

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(llm=llm)
vector_index = VectorStoreIndex(nodes, service_context=service_context)
retriever = vector_index.as_retriever(similarity_top_k=2)

Try out Retrieval

We’ll try out retrieval over a simple dataset.

retrieved_nodes = retriever.retrieve("What did the author do growing up?")
from llama_index.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

Node ID: 749c5544-13ae-4632-b8dd-c6367b718a73
Similarity: 0.8203777233851344
Text: What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn’t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called ā€œdata processing.ā€ This was in 9th grade, so I was 13 or 14. The school district’s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain’s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in …

Node ID: 6e5d20a0-0c93-4465-9496-5e8318640067
Similarity: 0.8143566621554992
Text: [10]

Wow, I thought, there’s an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.

This had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]

In the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right parties in New York, the only people allowed t…

Build an Evaluation dataset of (query, context) pairs

Here we build a simple evaluation dataset over the existing text corpus.

We use our generate_question_context_pairs to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.

We get back a EmbeddingQAFinetuneDataset object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.

from llama_index.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)
qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)
queries = qa_dataset.queries.values()
print(list(queries)[2])
In the context, the author mentions his first experience with programming on a TRS-80. Describe the limitations he faced with this early computer and how he used it to write programs, including a word processor.
# [optional] save
qa_dataset.save_json("pg_eval_dataset.json")
# [optional] load
qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")

Use RetrieverEvaluator for Retrieval Evaluation

We’re now ready to run our retrieval evals. We’ll run our RetrieverEvaluator over the eval dataset that we generated.

We define two functions: get_eval_results and also display_results that run our retriever over the dataset.

from llama_index.evaluation import RetrieverEvaluator

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)
Query: In the context, the author mentions his early experiences with programming on an IBM 1401. Describe the process he used to run a program on this machine and explain why he found it challenging to create meaningful programs on it.
Metrics: {'mrr': 1.0, 'hit_rate': 1.0}
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
import pandas as pd


def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}
    )

    return metric_df
display_results("top-2 eval", eval_results)
retrievers hit_rate mrr
0 top-2 eval 0.833333 0.784722