Open In Colab

OpenAI Assistant Advanced Retrieval Cookbook#

In this notebook, we try out OpenAI Assistant API for advanced retrieval tasks, by plugging in a variety of query engine tools and datasets. The wrapper abstraction we use is our OpenAIAssistantAgent class, which allows us to plug in custom tools. We explore how OpenAIAssistant can complement/replace existing workflows solved by our retrievers/query engines through its agent execution + function calling loop.

  • Joint QA + Summarization

  • Auto retrieval

  • Joint SQL and vector search

%pip install llama-index-agent-openai
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-readers-wikipedia
%pip install llama-index-llms-openai
!pip install llama-index
import nest_asyncio

nest_asyncio.apply()

Joint QA and Summarization#

In this section we show how we can get the Assistant agent to both answer fact-based questions and summarization questions. This is something that the in-house retrieval tool struggles to accomplish.

Load Data#

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2023-11-11 09:40:13--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.009s  

2023-11-11 09:40:14 (8.24 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

Setup Vector + Summary Indexes/Query Engines/Tools#

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core import SummaryIndex

# initialize settings (set chunk size)
Settings.llm = OpenAI()
Settings.chunk_size = 1024
nodes = Settings.node_parser.get_nodes_from_documents(documents)

# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# Define Summary Index and Vector Index over Same Data
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

# define query engines
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)
vector_query_engine = vector_index.as_query_engine()
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    name="summary_tool",
    description=(
        "Useful for summarization questions related to the author's life"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context to answer specific questions about the author's life"
    ),
)

Define Assistant Agent#

from llama_index.agent.openai import OpenAIAssistantAgent

agent = OpenAIAssistantAgent.from_new(
    name="QA bot",
    instructions="You are a bot designed to answer questions about the author",
    openai_tools=[],
    tools=[summary_tool, vector_tool],
    verbose=True,
    run_retrieve_sleep_time=1.0,
)

Results: A bit flaky#

response = agent.chat("Can you give me a summary about the author's life?")
print(str(response))
=== Calling Function ===
Calling function: summary_tool with args: {"input":"Can you give me a summary about the author's life?"}
Got output: The author, Paul Graham, had a strong interest in writing and programming from a young age. They started writing short stories and experimenting with programming in high school. In college, they initially studied philosophy but switched to studying artificial intelligence. However, they realized that the AI being practiced at the time was not going to lead to true understanding of natural language. This led them to focus on Lisp programming and eventually write a book about Lisp hacking. Despite being in a PhD program in computer science, the author also developed a passion for art and decided to pursue it further. They attended the Accademia di Belli Arti in Florence but found that it did not teach them much. They then returned to the US and got a job at a software company. Afterward, they attended the Rhode Island School of Design but dropped out due to the focus on developing a signature style rather than teaching the fundamentals of art. They then moved to New York City and became interested in the World Wide Web, eventually starting a company called Viaweb. They later founded Y Combinator, an investment firm, and created Hacker News.
========================
Paul Graham is an author with eclectic interests and a varied career path. He began with interests in writing and programming, engaged in philosophy and artificial intelligence during college, and authored a book on Lisp programming. With an equally strong passion for art, he studied at the Accademia di Belli Arti in Florence and briefly at the Rhode Island School of Design before immersing himself in the tech industry by starting Viaweb and later founding the influential startup accelerator Y Combinator. He also created Hacker News, a social news website focused on computer science and entrepreneurship. Graham's life reflects a blend of technology, entrepreneurship, and the arts.
response = agent.query("What did the author do after RICS?")
print(str(response))
=== Calling Function ===
Calling function: vector_tool with args: {"input":"After RICS"}
Got output: After RICS, the author moved back to Providence to continue at RISD. However, it became clear that art school, specifically the painting department, did not have the same relationship to art as medical school had to medicine. Painting students were expected to express themselves and develop a distinctive signature style.
========================
After the author's time at the Royal Institution of Chartered Surveyors (RICS), they moved back to Providence to continue their studies at the Rhode Island School of Design (RISD). There, the author noted a significant difference in the educational approaches between RISD and medical school, specifically in the painting department. At RISD, students were encouraged to express themselves and to develop a unique and distinctive signature style in their artwork.

AutoRetrieval from a Vector Database#

Our existing “auto-retrieval” capabilities (in VectorIndexAutoRetriever) allow an LLM to infer the right query parameters for a vector database - including both the query string and metadata filter.

Since the Assistant API can call functions + infer function parameters, we explore its capabilities in performing auto-retrieval here.

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

import pinecone
import os

api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
# dimensions are for text-embedding-ada-002
try:
    pinecone.create_index(
        "quickstart", dimension=1536, metric="euclidean", pod_type="p1"
    )
except Exception:
    # most likely index already exists
    pass
pinecone_index = pinecone.Index("quickstart")
# Optional: delete data in your pinecone index
pinecone_index.delete(deleteAll=True, namespace="test")
{}
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text=(
            "Michael Jordan is a retired professional basketball player,"
            " widely regarded as one of the greatest basketball players of all"
            " time."
        ),
        metadata={
            "category": "Sports",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Angelina Jolie is an American actress, filmmaker, and"
            " humanitarian. She has received numerous awards for her acting"
            " and is known for her philanthropic work."
        ),
        metadata={
            "category": "Entertainment",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Elon Musk is a business magnate, industrial designer, and"
            " engineer. He is the founder, CEO, and lead designer of SpaceX,"
            " Tesla, Inc., Neuralink, and The Boring Company."
        ),
        metadata={
            "category": "Business",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Rihanna is a Barbadian singer, actress, and businesswoman. She"
            " has achieved significant success in the music industry and is"
            " known for her versatile musical style."
        ),
        metadata={
            "category": "Music",
            "country": "Barbados",
        },
    ),
    TextNode(
        text=(
            "Cristiano Ronaldo is a Portuguese professional footballer who is"
            " considered one of the greatest football players of all time. He"
            " has won numerous awards and set multiple records during his"
            " career."
        ),
        metadata={
            "category": "Sports",
            "country": "Portugal",
        },
    ),
]
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index, namespace="test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Define Function Tool#

Here we define the function interface, which is passed to OpenAI to perform auto-retrieval.

We were not able to get OpenAI to work with nested pydantic objects or tuples as arguments, so we converted the metadata filter keys and values into lists for the function API to work with.

# define function tool
from llama_index.core.tools import FunctionTool
from llama_index.core.vector_stores import (
    VectorStoreInfo,
    MetadataInfo,
    ExactMatchFilter,
    MetadataFilters,
)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

from typing import List, Tuple, Any
from pydantic import BaseModel, Field

# hardcode top k for now
top_k = 3

# define vector store info describing schema of vector store
vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "Category of the celebrity, one of [Sports, Entertainment,"
                " Business, Music]"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "Country of the celebrity, one of [United States, Barbados,"
                " Portugal]"
            ),
        ),
    ],
)


# define pydantic model for auto-retrieval function
class AutoRetrieveModel(BaseModel):
    query: str = Field(..., description="natural language query string")
    filter_key_list: List[str] = Field(
        ..., description="List of metadata filter field names"
    )
    filter_value_list: List[str] = Field(
        ...,
        description=(
            "List of metadata filter field values (corresponding to names"
            " specified in filter_key_list)"
        ),
    )


def auto_retrieve_fn(
    query: str, filter_key_list: List[str], filter_value_list: List[str]
):
    """Auto retrieval function.

    Performs auto-retrieval from a vector database, and then applies a set of filters.

    """
    query = query or "Query"

    exact_match_filters = [
        ExactMatchFilter(key=k, value=v)
        for k, v in zip(filter_key_list, filter_value_list)
    ]
    retriever = VectorIndexRetriever(
        index,
        filters=MetadataFilters(filters=exact_match_filters),
        top_k=top_k,
    )
    results = retriever.retrieve(query)
    return [r.get_content() for r in results]


description = f"""\
Use this tool to look up biographical information about celebrities.
The vector database schema is given below:
{vector_store_info.json()}
"""

auto_retrieve_tool = FunctionTool.from_defaults(
    fn=auto_retrieve_fn,
    name="celebrity_bios",
    description=description,
    fn_schema=AutoRetrieveModel,
)
auto_retrieve_fn(
    "celebrity from the United States",
    filter_key_list=["country"],
    filter_value_list=["United States"],
)
['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.',
 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.']

Initialize Agent#

from llama_index.agent.openai import OpenAIAssistantAgent

agent = OpenAIAssistantAgent.from_new(
    name="Celebrity bot",
    instructions="You are a bot designed to answer questions about celebrities.",
    tools=[auto_retrieve_tool],
    verbose=True,
)
response = agent.chat("Tell me about two celebrities from the United States. ")
print(str(response))
=== Calling Function ===
Calling function: celebrity_bios with args: {"query": "celebrity from United States", "filter_key_list": ["country"], "filter_value_list": ["United States"]}
Got output: ['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.']
========================
=== Calling Function ===
Calling function: celebrity_bios with args: {"query": "celebrity from United States", "filter_key_list": ["country"], "filter_value_list": ["United States"]}
Got output: ['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.']
========================
Here is some information about two celebrities from the United States:

1. Angelina Jolie - Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work. Over the years, Jolie has starred in several critically acclaimed and commercially successful films, and she has also been involved in various humanitarian causes, advocating for refugees and children's education, among other things.

2. Michael Jordan - Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time. During his career, Jordan dominated the NBA with his scoring ability, athleticism, and competitiveness. He won six NBA championships with the Chicago Bulls and earned the NBA Most Valuable Player Award five times. Jordan has also been a successful businessman and the principal owner of the Charlotte Hornets basketball team.

Both figures have made significant impacts in their respective fields and continue to be influential even after reaching the peaks of their careers.