Retrieval Evaluation#

This notebook uses our RetrieverEvaluator to evaluate the quality of any Retriever module defined in LlamaIndex.

We specify a set of different evaluation metrics: this includes hit-rate and MRR. For any given question, these will compare the quality of retrieved results from the ground-truth context.

To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation.

Setup#

Here we load in data (PG essay), parse into Nodes. We then index this data using our simple vector index and get a retriever.

import nest_asyncio

nest_asyncio.apply()

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SentenceSplitter
from llama_index.llms import OpenAI

Download Data

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(llm=llm)

vector_index = VectorStoreIndex(nodes, service_context=service_context)
retriever = vector_index.as_retriever(similarity_top_k=2)

Try out Retrieval#

We’ll try out retrieval over a simple dataset.

retrieved_nodes = retriever.retrieve("What did the author do growing up?")

from llama_index.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

Node ID: node_0
Similarity: 0.8181379514114543
Text: What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn’t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called “data processing.” This was in 9th grade, so I was 13 or 14. The school district’s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain’s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in …

Node ID: node_52
Similarity: 0.8143530600618721
Text: It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.

In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.

In the fall of 2019, Bel was finally finished. Like McCarthy’s original Lisp, it’s a spec rather than an implementation, although like McCarthy’s Lisp it’s a spec expressed as code.

Now that I could write essays again, I wrote a bunch about topics I’d had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that ques…

Build an Evaluation dataset of (query, context) pairs#

Here we build a simple evaluation dataset over the existing text corpus.

We use our generate_question_context_pairs to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.

We get back a EmbeddingQAFinetuneDataset object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.

from llama_index.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

queries = qa_dataset.queries.values()
print(list(queries)[2])

"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. What were the key differences in terms of user interaction and programming capabilities?"

# [optional] save
qa_dataset.save_json("pg_eval_dataset.json")

# [optional] load
qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")

Use `RetrieverEvaluator` for Retrieval Evaluation#

We’re now ready to run our retrieval evals. We’ll run our RetrieverEvaluator over the eval dataset that we generated.

We define two functions: get_eval_results and also display_results that run our retriever over the dataset.

include_cohere_rerank = True

if include_cohere_rerank:
    !pip install cohere -q

from llama_index.evaluation import RetrieverEvaluator

metrics = ["mrr", "hit_rate"]

if include_cohere_rerank:
    metrics.append(
        "cohere_rerank_relevancy"  # requires COHERE_API_KEY environment variable to be set
    )

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Query: In the context provided, the author describes his early experiences with programming on an IBM 1401. Based on his description, what were some of the limitations and challenges he faced while trying to write programs on this machine?
Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'cohere_rerank_relevancy': 0.99620515}

# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

import pandas as pd


def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    columns = {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}

    if include_cohere_rerank:
        crr_relevancy = full_df["cohere_rerank_relevancy"].mean()
        columns.update({"cohere_rerank_relevancy": [crr_relevancy]})

    metric_df = pd.DataFrame(columns)

    return metric_df

display_results("top-2 eval", eval_results)

	retrievers	hit_rate	mrr	cohere_rerank_relevancy
0	top-2 eval	0.801724	0.685345	0.946009

Retrieval Evaluation#

Setup#

Try out Retrieval#

Build an Evaluation dataset of (query, context) pairs#

Use RetrieverEvaluator for Retrieval Evaluation#

Use `RetrieverEvaluator` for Retrieval Evaluation#