Open In Colab

OpenAI Agent + Query Engine Experimental Cookbook#

In this notebook, we try out the OpenAIAgent across a variety of query engine tools and datasets. We explore how OpenAIAgent can compare/replace existing workflows solved by our retrievers/query engines.

  • Auto retrieval

  • Joint SQL and vector search

AutoRetrieval from a Vector Database#

Our existing β€œauto-retrieval” capabilities (in VectorIndexAutoRetriever) allow an LLM to infer the right query parameters for a vector database - including both the query string and metadata filter.

Since the OpenAI Function API can infer function parameters, we explore its capabilities in performing auto-retrieval here.

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex πŸ¦™.

!pip install llama-index
import pinecone
import os

api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west4-gcp-free")
import os
import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai

openai.api_key = "sk-<your-key>"
# dimensions are for text-embedding-ada-002
try:
    pinecone.create_index(
        "quickstart-index", dimension=1536, metric="euclidean", pod_type="p1"
    )
except Exception:
    # most likely index already exists
    pass
pinecone_index = pinecone.Index("quickstart-index")
# Optional: delete data in your pinecone index
pinecone_index.delete(deleteAll=True, namespace="test")
{}
from llama_index import VectorStoreIndex, StorageContext
from llama_index.vector_stores import PineconeVectorStore
from llama_index.schema import TextNode

nodes = [
    TextNode(
        text=(
            "Michael Jordan is a retired professional basketball player,"
            " widely regarded as one of the greatest basketball players of all"
            " time."
        ),
        metadata={
            "category": "Sports",
            "country": "United States",
            "gender": "male",
            "born": 1963,
        },
    ),
    TextNode(
        text=(
            "Angelina Jolie is an American actress, filmmaker, and"
            " humanitarian. She has received numerous awards for her acting"
            " and is known for her philanthropic work."
        ),
        metadata={
            "category": "Entertainment",
            "country": "United States",
            "gender": "female",
            "born": 1975,
        },
    ),
    TextNode(
        text=(
            "Elon Musk is a business magnate, industrial designer, and"
            " engineer. He is the founder, CEO, and lead designer of SpaceX,"
            " Tesla, Inc., Neuralink, and The Boring Company."
        ),
        metadata={
            "category": "Business",
            "country": "United States",
            "gender": "male",
            "born": 1971,
        },
    ),
    TextNode(
        text=(
            "Rihanna is a Barbadian singer, actress, and businesswoman. She"
            " has achieved significant success in the music industry and is"
            " known for her versatile musical style."
        ),
        metadata={
            "category": "Music",
            "country": "Barbados",
            "gender": "female",
            "born": 1988,
        },
    ),
    TextNode(
        text=(
            "Cristiano Ronaldo is a Portuguese professional footballer who is"
            " considered one of the greatest football players of all time. He"
            " has won numerous awards and set multiple records during his"
            " career."
        ),
        metadata={
            "category": "Sports",
            "country": "Portugal",
            "gender": "male",
            "born": 1985,
        },
    ),
]
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index, namespace="test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
Upserted vectors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00,  5.79it/s]

Define Function Tool#

Here we define the function interface, which is passed to OpenAI to perform auto-retrieval.

We were not able to get OpenAI to work with nested pydantic objects or tuples as arguments, so we converted the metadata filter keys and values into lists for the function API to work with.

# define function tool
from llama_index.tools import FunctionTool
from llama_index.vector_stores.types import (
    VectorStoreInfo,
    MetadataInfo,
    MetadataFilter,
    MetadataFilters,
    FilterCondition,
    FilterOperator,
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

from typing import List, Tuple, Any
from pydantic import BaseModel, Field

# hardcode top k for now
top_k = 3

# define vector store info describing schema of vector store
vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "Category of the celebrity, one of [Sports, Entertainment,"
                " Business, Music]"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "Country of the celebrity, one of [United States, Barbados,"
                " Portugal]"
            ),
        ),
        MetadataInfo(
            name="gender",
            type="str",
            description=("Gender of the celebrity, one of [male, female]"),
        ),
        MetadataInfo(
            name="born",
            type="int",
            description=("Born year of the celebrity, could be any integer"),
        ),
    ],
)
# define pydantic model for auto-retrieval function
class AutoRetrieveModel(BaseModel):
    query: str = Field(..., description="natural language query string")
    filter_key_list: List[str] = Field(
        ..., description="List of metadata filter field names"
    )
    filter_value_list: List[Any] = Field(
        ...,
        description=(
            "List of metadata filter field values (corresponding to names"
            " specified in filter_key_list)"
        ),
    )
    filter_operator_list: List[str] = Field(
        ...,
        description=(
            "Metadata filters conditions (could be one of <, <=, >, >=, ==, !=)"
        ),
    )
    filter_condition: str = Field(
        ...,
        description=("Metadata filters condition values (could be AND or OR)"),
    )


description = f"""\
Use this tool to look up biographical information about celebrities.
The vector database schema is given below:
{vector_store_info.json()}
"""

Define AutoRetrieve Functions

def auto_retrieve_fn(
    query: str,
    filter_key_list: List[str],
    filter_value_list: List[any],
    filter_operator_list: List[str],
    filter_condition: str,
):
    """Auto retrieval function.

    Performs auto-retrieval from a vector database, and then applies a set of filters.

    """
    query = query or "Query"

    metadata_filters = [
        MetadataFilter(key=k, value=v, operator=op)
        for k, v, op in zip(
            filter_key_list, filter_value_list, filter_operator_list
        )
    ]
    retriever = VectorIndexRetriever(
        index,
        filters=MetadataFilters(
            filters=metadata_filters, condition=filter_condition
        ),
        top_k=top_k,
    )
    query_engine = RetrieverQueryEngine.from_args(retriever)

    response = query_engine.query(query)
    return str(response)


auto_retrieve_tool = FunctionTool.from_defaults(
    fn=auto_retrieve_fn,
    name="celebrity_bios",
    description=description,
    fn_schema=AutoRetrieveModel,
)

Initialize Agent#

from llama_index.agent import OpenAIAgent
from llama_index.llms import OpenAI

agent = OpenAIAgent.from_tools(
    [auto_retrieve_tool],
    llm=OpenAI(temperature=0, model="gpt-4-0613"),
    verbose=True,
)
response = agent.chat("Tell me about two celebrities from the United States. ")
print(str(response))
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "celebrities from the United States",
"filter_key_list": ["country"],
"filter_value_list": ["United States"],
"filter_operator_list": ["=="],
"filter_condition": "and"
}
Got output: Angelina Jolie and Michael Jordan are both celebrities from the United States.
========================

STARTING TURN 2
---------------

Here are two celebrities from the United States:

1. **Angelina Jolie**: She is an American actress, filmmaker, and humanitarian. The recipient of numerous accolities, including an Academy Award and three Golden Globe Awards, she has been named Hollywood's highest-paid actress multiple times.

2. **Michael Jordan**: He is a former professional basketball player and the principal owner of the Charlotte Hornets of the National Basketball Association (NBA). He played 15 seasons in the NBA, winning six championships with the Chicago Bulls. He is considered one of the greatest players in the history of the NBA.
response = agent.chat("Tell me about two celebrities born after 1980. ")
print(str(response))
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "celebrities born after 1980",
"filter_key_list": ["born"],
"filter_value_list": [1980],
"filter_operator_list": [">"],
"filter_condition": "and"
}
Got output: Rihanna and Cristiano Ronaldo are both celebrities who were born after 1980.
========================

STARTING TURN 2
---------------

Here are two celebrities who were born after 1980:

1. **Rihanna**: She is a Barbadian singer, actress, and businesswoman. Born in Saint Michael and raised in Bridgetown, Barbados, Rihanna was discovered by American record producer Evan Rogers who invited her to the United States to record demo tapes. She rose to fame with her debut album "Music of the Sun" and its follow-up "A Girl like Me".

2. **Cristiano Ronaldo**: He is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal
response = agent.chat(
    "Tell me about few celebrities under category business and born after 1950. "
)
print(str(response))
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "business celebrities born after 1950",
"filter_key_list": ["category", "born"],
"filter_value_list": ["Business", 1950],
"filter_operator_list": ["==", ">"],
"filter_condition": "and"
}
Got output: Elon Musk is a notable business celebrity who was born in 1971.
========================

STARTING TURN 2
---------------

Elon Musk is a business celebrity who was born after 1950. He is a business magnate and investor. He is the founder, CEO, CTO, and chief designer of SpaceX; early investor, CEO and product architect of Tesla, Inc.; founder of The