Reciprocal Rerank Fusion Retriever#
In this example, we walk through how you can combine retrieval results from multiple queries and multiple indexes.
The retrieved nodes will be reranked according to the Reciprocal Rerank Fusion
algorithm demonstrated in this paper. It provides an effecient method for rerranking retrieval results without excessive computation or reliance on external models.
Full credits go to @Raduaschl on github for their example implementation here.
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
Setup#
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
Download Data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
Next, we will setup a vector index over the documentation.
from llama_index import VectorStoreIndex, ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=256)
index = VectorStoreIndex.from_documents(
documents, service_context=service_context
)
Create a Hybrid Fusion Retriever#
In this step, we fuse our index with a BM25 based retriever. This will enable us to capture both semantic relations and keywords in our input queries.
Since both of these retrievers calculate a score, we can use the reciprocal rerank algorithm to re-sort our nodes without using an additional models or excessive computation.
This setup will also query 4 times, once with your original query, and generate 3 more queries.
By default, it uses the following prompt to generate extra queries:
QUERY_GEN_PROMPT = (
"You are a helpful assistant that generates multiple search queries based on a "
"single input query. Generate {num_queries} search queries, one on each line, "
"related to the following input query:\n"
"Query: {query}\n"
"Queries:\n"
)
First, we create our retrievers. Each will retrieve the top-2 most similar nodes:
from llama_index.retrievers import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=2)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=2
)
Next, we can create our fusion retriever, which well return the top-2 most similar nodes from the 4 returned nodes from the retrievers:
from llama_index.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=2,
num_queries=4, # set this to 1 to disable query generation
mode="reciprocal_rerank",
use_async=True,
verbose=True,
# query_gen_prompt="...", # we could override the query generation prompt here
)
# apply nested async to run in a notebook
import nest_asyncio
nest_asyncio.apply()
nodes_with_scores = retriever.retrieve(
"What happened at Interleafe and Viaweb?"
)
Generated queries:
1. What were the major events or milestones in the history of Interleafe and Viaweb?
2. Can you provide a timeline of the key developments and achievements of Interleafe and Viaweb?
3. What were the successes and failures of Interleafe and Viaweb as companies?
for node in nodes_with_scores:
print(f"Score: {node.score:.2f} - {node.text}...\n-----\n")
Score: 0.05 - Now you could just update the software right on the server.
We started a new company we called Viaweb, after the fact that our software worked via the web, and we got $10,000 in seed funding from Idelle's husband Julian. In return for that and doing the initial legal work and giving us business advice, we gave him 10% of the company. Ten years later this deal became the model for Y Combinator's. We knew founders needed something like this, because we'd needed it ourselves.
At this stage I had a negative net worth, because the thousand dollars or so I had in the bank was more than counterbalanced by what I owed the government in taxes. (Had I diligently set aside the proper proportion of the money I'd made consulting for Interleaf? No, I had not.) So although Robert had his graduate student stipend, I needed that seed funding to live on.
We originally hoped to launch in September, but we got more ambitious about the software as we worked on it....
-----
Score: 0.03 - [8]
There were three main parts to the software: the editor, which people used to build sites and which I wrote, the shopping cart, which Robert wrote, and the manager, which kept track of orders and statistics, and which Trevor wrote. In its time, the editor was one of the best general-purpose site builders. I kept the code tight and didn't have to integrate with any other software except Robert's and Trevor's, so it was quite fun to work on. If all I'd had to do was work on this software, the next 3 years would have been the easiest of my life. Unfortunately I had to do a lot more, all of it stuff I was worse at than programming, and the next 3 years were instead the most stressful.
There were a lot of startups making ecommerce software in the second half of the 90s. We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and inexpensive. It was lucky for us that we were poor, because that caused us to make Viaweb even more inexpensive than we realized. We charged $100 a month for a small store and $300 a month for a big one....
-----
As we can see, both retruned nodes correctly mention Viaweb and Interleaf!
Use in a Query Engine!#
Now, we can plug our retriever into a query engine to synthesize natural language responses.
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
response = query_engine.query("What happened at Interleafe and Viaweb?")
Generated queries:
1. What were the major events or milestones in the history of Interleafe and Viaweb?
2. Can you provide a timeline of the key developments and achievements of Interleafe and Viaweb?
3. What were the successes and failures of Interleafe and Viaweb as companies?
from llama_index.response.notebook_utils import display_response
display_response(response)
Final Response:
At Interleaf, the author had worked as a consultant and had made some money. However, they did not set aside the proper proportion of the money to pay taxes, resulting in a negative net worth.
At Viaweb, the author and their team started a new company that developed software for building websites and managing online stores. They received $10,000 in seed funding from Julian, who was the husband of Idelle. In return for the funding and business advice, Julian received a 10% stake in the company. The software developed by Viaweb was designed to be easy to use and inexpensive, with prices ranging from $100 to $300 per month.