Open In Colab

Recursive Retriever + Document Agents#

This guide shows how to combine recursive retrieval and “document agents” for advanced decision making over heterogeneous documents.

There are two motivating factors that lead to solutions for better retrieval:

  • Decoupling retrieval embeddings from chunk-based synthesis. Oftentimes fetching documents by their summaries will return more relevant context to queries rather than raw chunks. This is something that recursive retrieval directly allows.

  • Within a document, users may need to dynamically perform tasks beyond fact-based question-answering. We introduce the concept of “document agents” - agents that have access to both vector search and summary tools for a given document.

Setup and Download Data#

In this section, we’ll define imports and then download Wikipedia articles about different cities. Each article is stored separately.

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index
from llama_index import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.schema import IndexNode
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.llms import OpenAI
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

Define LLM + Service Context + Callback Manager

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

Build Document Agent for each Document#

In this section we define “document agents” for each document.

First we define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.

This document agent can dynamically choose to perform semantic search or summarization within a given document.

We create a separate document agent for each city.

from llama_index.agent import OpenAIAgent

# Build agents dictionary
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        city_docs[wiki_title], service_context=service_context
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        city_docs[wiki_title], service_context=service_context
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    f"Useful for retrieving specific context from {wiki_title}"
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for summarization questions related to"
                    f" {wiki_title}"
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[wiki_title] = agent

Build Composable Retriever over these Agents#

Now we define a set of summary nodes, where each node links to the corresponding Wikipedia city article. We then define a composable retriever + query engine on top of these Nodes to route queries down to a given node, which will in turn route it to the relevant document agent.

# define top-level nodes
objects = []
for wiki_title in wiki_titles:
    # define index node that links to these agents
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        " this index if you need to lookup specific facts about"
        f" {wiki_title}.\nDo not use this index if you want to analyze"
        " multiple cities."
    )
    node = IndexNode(
        text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
    )
    objects.append(node)
# define top-level retriever
vector_index = VectorStoreIndex(
    objects=objects, service_context=service_context
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)

Running Example Queries#

# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")
Retrieval entering Boston: OpenAIAgent
Retrieving from object OpenAIAgent with query Tell me about the sports teams in Boston
Added user message to memory: Tell me about the sports teams in Boston
print(response)
Boston is home to several professional sports teams across different leagues. These teams include the Boston Red Sox in Major League Baseball, the New England Patriots in the National Football League, the Boston Celtics in the NBA, the Boston Bruins in the NHL, and the New England Revolution in Major League Soccer. These teams have a rich history and are widely supported by fans in Boston and across the country.
# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston")
Retrieval entering Houston: OpenAIAgent
Retrieving from object OpenAIAgent with query Tell me about the sports teams in Houston
Added user message to memory: Tell me about the sports teams in Houston
print(response)
Houston is home to several professional sports teams across different leagues. The city has a professional football team called the Houston Texans, a professional basketball team called the Houston Rockets, a professional baseball team called the Houston Astros, a professional soccer team called the Houston Dynamo, and a professional women's soccer team called the Houston Dash. These teams compete in the National Football League (NFL), National Basketball Association (NBA), Major League Baseball (MLB), Major League Soccer (MLS), and National Women's Soccer League (NWSL) respectively. Houston also has minor league baseball, hockey, and other sports teams, making it a city with a rich sports culture.
# should use Seattle agent -> summary tool
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)
Retrieval entering Chicago: OpenAIAgent
Retrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago
Added user message to memory: Give me a summary on all the positive aspects of Chicago
=== Calling Function ===
Calling function: summary_tool with args: {
  "input": "positive aspects of Chicago"
}
Got output: Chicago is a vibrant city with a diverse economy and a wide range of industries. It serves as a major hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. The city has a thriving arts and music scene, making significant contributions to visual arts, literature, film, theater, comedy, food, dance, and various music genres. Chicago is also known for its prestigious universities, including the University of Chicago, Northwestern University, and the University of Illinois Chicago. Furthermore, it is home to professional sports teams in all major leagues.
========================
print(response)
Chicago is a vibrant city with a diverse economy and a wide range of industries. It serves as a major hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. The city has a thriving arts and music scene, making significant contributions to visual arts, literature, film, theater, comedy, food, dance, and various music genres. Chicago is also known for its prestigious universities, including the University of Chicago, Northwestern University, and the University of Illinois Chicago. Furthermore, it is home to professional sports teams in all major leagues.