Open In Colab

💬🤖 How to Build a Chatbot#

LlamaIndex serves as a bridge between your data and Language Learning Models (LLMs), providing a toolkit that enables you to establish a query interface around your data for a variety of tasks, such as question-answering and summarization.

In this tutorial, we’ll walk you through building a context-augmented chatbot using a Data Agent. This agent, powered by LLMs, is capable of intelligently executing tasks over your data. The end result is a chatbot agent equipped with a robust set of data interface tools provided by LlamaIndex to answer queries about your data.

Note: This tutorial builds upon initial work on creating a query interface over SEC 10-K filings - check it out here.

Context#

In this guide, we’ll build a “10-K Chatbot” that uses raw UBER 10-K HTML filings from Dropbox. Users can interact with the chatbot to ask questions related to the 10-K filings.

Preparation#

%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

import nest_asyncio

nest_asyncio.apply()
# set text wrapping
from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

Ingest Data#

Let’s first download the raw 10-k files, from 2019-2022.

# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

To parse the HTML files into formatted text, we use the Unstructured library. Thanks to LlamaHub, we can directly integrate with Unstructured, allowing conversion of any text into a Document format that LlamaIndex can ingest.

First we install the necessary packages:

Then we can use the UnstructuredReader to parse the HTML files into a list of Document objects.

from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

Setting up Vector Indices for each year#

We first setup a vector index for each year. Each vector index allows us to ask questions about the 10-K filing of a given year.

We build each index and save it to disk.

# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.chunk_size = 512
Settings.chunk_overlap = 64
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

To load an index from disk, do the following

# Load indices from disk
from llama_index.core import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

Setting up a Sub Question Query Engine to Synthesize Answers Across 10-K Filings#

Since we have access to documents of 4 years, we may not only want to ask questions regarding the 10-K document of a given year, but ask questions that require analysis over all 10-K filings.

To address this, we can use a Sub Question Query Engine. It decomposes a query into subqueries, each answered by an individual vector index, and synthesizes the results to answer the overall query.

LlamaIndex provides some wrappers around indices (and query engines) so that they can be used by query engines and agents. First we define a QueryEngineTool for each vector index. Each tool has a name and a description; these are what the LLM agent sees to decide which tool to choose.

from llama_index.core.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=(
                "useful for when you want to answer queries about the"
                f" {year} SEC 10-K for Uber"
            ),
        ),
    )
    for year in years
]

Now we can create the Sub Question Query Engine, which will allow us to synthesize answers across the 10-K filings. We pass in the individual_query_engine_tools we defined above.

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)

Setting up the Chatbot Agent#

We use a LlamaIndex Data Agent to setup the outer chatbot agent, which has access to a set of Tools. Specifically, we will use an OpenAIAgent, that takes advantage of OpenAI API function calling. We want to use the separate Tools we defined previously for each index (corresponding to a given year), as well as a tool for the sub question query engine we defined above.

First we define a QueryEngineTool for the sub question query engine:

query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description=(
            "useful for when you want to answer queries that require analyzing"
            " multiple SEC 10-K documents for Uber"
        ),
    ),
)

Then, we combine the Tools we defined above into a single list of tools for the agent:

tools = individual_query_engine_tools + [query_engine_tool]

Finally, we call OpenAIAgent.from_tools to create the agent, passing in the list of tools we defined above.

from llama_index.agent.openai import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

Testing the Agent#

We can now test the agent with various queries.

If we test it with a simple “hello” query, the agent does not use any Tools.

response = agent.chat("hi, i am bob")
print(str(response))
Added user message to memory: hi, i am bob
Hello Bob! How can I assist you today?

If we test it with a query regarding the 10-k of a given year, the agent will use the relevant vector index Tool.

response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))
Added user message to memory: What were some of the biggest risk factors in 2020 for Uber?
=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "biggest risk factors"
}
Got output: The biggest risk factors mentioned in the context are:

1. The adverse impact of the COVID-19 pandemic and actions taken to mitigate it on the business.
2. The potential reclassification of drivers as employees, workers, or quasi-employees instead of independent contractors.
3. Intense competition in the mobility, delivery, and logistics industries.
4. The need to lower fares or service fees and offer driver incentives and consumer discounts to remain competitive.
5. Significant losses incurred and the uncertainty of achieving profitability.
6. Difficulty in attracting and maintaining a critical mass of platform users.
7. Operational, compliance, and cultural challenges.
8. Negative media coverage and reputation issues.
9. Inability to optimize organizational structure or manage growth effectively.
10. Safety incidents that harm the ability to attract and retain platform users.
11. Risks associated with substantial investments in new offerings and technologies.
12. Potential fines or enforcement measures due to challenges faced.
13. Uncertainty and potential long-term financial impact of the COVID-19 pandemic, including changes in user behavior and demand for mobility services.
14. Potential adverse impact from business partners and third-party vendors affected by the pandemic.
15. Volatility in financial markets and its effect on stock price and access to capital markets.

These are the biggest risk factors mentioned in the given context.
========================

The biggest risk factors for Uber in 2020 were:

1. The adverse impact of the COVID-19 pandemic and actions taken to mitigate it on the business.
2. The potential reclassification of drivers as employees, workers, or quasi-employees instead of independent contractors.
3. Intense competition in the mobility, delivery, and logistics industries.
4. The need to lower fares or service fees and offer driver incentives and consumer discounts to remain competitive.
5. Significant losses incurred and the uncertainty of achieving profitability.
6. Difficulty in attracting and maintaining a critical mass of platform users.
7. Operational, compliance, and cultural challenges.
8. Negative media coverage and reputation issues.
9. Inability to optimize organizational structure or manage growth effectively.
10. Safety incidents that harm the ability to attract and retain platform users.
11. Risks associated with substantial investments in new offerings and technologies.
12. Potential fines or enforcement measures due to challenges faced.
13. Uncertainty and potential long-term financial impact of the COVID-19 pandemic, including changes in user behavior and demand for mobility services.
14. Potential adverse impact from business partners and third-party vendors affected by the pandemic.
15. Volatility in financial markets and its effect on stock price and access to capital markets.

These risk factors highlight the challenges and uncertainties faced by Uber in 2020.

Finally, if we test it with a query to compare/contrast risk factors across years, the agent will use the Sub Question Query Engine Tool.

cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = agent.chat(cross_query_str)
print(str(response))
Added user message to memory: Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points.
=== Calling Function ===
Calling function: sub_question_query_engine with args: {
  "input": "Compare/contrast the risk factors described in the Uber 10-K across years"
}
Generated 4 sub questions.
[vector_index_2022] Q: What are the risk factors described in the 2022 SEC 10-K for Uber?
[vector_index_2021] Q: What are the risk factors described in the 2021 SEC 10-K for Uber?
[vector_index_2020] Q: What are the risk factors described in the 2020 SEC 10-K for Uber?
[vector_index_2019] Q: What are the risk factors described in the 2019 SEC 10-K for Uber?
[vector_index_2022] A: The risk factors described in the 2022 SEC 10-K for Uber are not provided in the given context information.
[vector_index_2021] A: The risk factors described in the 2021 SEC 10-K for Uber are not provided in the given context information.
[vector_index_2019] A: The risk factors described in the 2019 SEC 10-K for Uber include potential infringement of intellectual property, the need to protect proprietary information, dependence on rapid technological advances, seasonality in revenue generation, fluctuations in usage of the platform, seasonal increases in revenue for certain quarters, and the potential impact of employee actions.
[vector_index_2020] A: The risk factors described in the 2020 SEC 10-K for Uber include the potential adverse effects on their business, financial condition, and results of operations. These risks could cause a decline in the trading price of their common stock and harm their business prospects. Additionally, there may be risks and uncertainties not currently known to Uber or that they do not believe are material. For a more detailed discussion of these risk factors, please refer to the "Risk Factors" section in Uber's Annual Report on Form 10-K.
Got output: The risk factors described in the Uber 10-K vary across different years. In the 2020 SEC 10-K, the risk factors include potential adverse effects on their business, financial condition, and results of operations. However, the 2019 SEC 10-K includes additional risk factors such as potential infringement of intellectual property, the need to protect proprietary information, dependence on rapid technological advances, seasonality in revenue generation, fluctuations in usage of the platform, seasonal increases in revenue for certain quarters, and the potential impact of employee actions. It is important to note that the specific risk factors may change from year to year based on the evolving business environment and circumstances.
========================

=== Calling Function ===
Calling function: vector_index_2022 with args: {
  "input": "risk factors"
}
Got output: Some of the risk factors mentioned in the context include the potential failure to meet regulatory requirements related to climate change, the impact of contagious diseases and pandemics on the business, the occurrence of catastrophic events, the uncertainty surrounding future pandemics or disease outbreaks, and the competitive nature of the mobility, delivery, and logistics industries. Additionally, the classification of drivers as employees instead of independent contractors, the need to lower fares or service fees to remain competitive, and the company's history of significant losses and anticipated increase in operating expenses are also mentioned as risk factors.
========================

=== Calling Function ===
Calling function: vector_index_2021 with args: {
  "input": "risk factors"
}
Got output: The COVID-19 pandemic and the impact of actions to mitigate the pandemic have adversely affected and may continue to adversely affect parts of our business. Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors. The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region. To remain competitive in certain markets, we have in the past lowered, and may continue to lower, fares or service fees, and we have in the past offered, and may continue to offer, significant Driver incentives and consumer discounts and promotions. We have incurred significant losses since inception, including in the United States and other major markets. We expect our operating expenses to increase significantly in the foreseeable future, and we may not achieve or maintain profitability. If we are unable to attract or maintain a critical mass of Drivers, consumers, merchants, shippers, and carriers, whether as a result of competition or other factors, our platform will become less appealing to platform users. Maintaining and enhancing our brand and reputation is critical to our business prospects. We have previously received significant media coverage and negative publicity regarding our brand and reputation, and while we have taken significant steps to rehabilitate our brand and reputation, failure to maintain and enhance our brand and reputation will cause our business to suffer. Our historical workplace culture and forward-leaning approach created operational, compliance, and cultural challenges and our efforts to address these challenges may not be successful. If we are unable to optimize our organizational structure or effectively manage our growth, our financial performance and future prospects will be adversely affected. Platform users may engage in, or be subject to, criminal, violent, inappropriate, or dangerous activity that results in major safety incidents, which may harm our ability to attract and retain Drivers, consumers, merchants, shippers, and carriers. We are making substantial investments in new offerings and technologies, and may increase such investments in the future. These new ventures are inherently risky, and we may never realize any expected benefits from them.
========================

=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "risk factors"
}
Got output: The risk factors mentioned in the context include the adverse impact of the COVID-19 pandemic, potential reclassification of drivers as employees, intense competition in the mobility, delivery, and logistics industries, the need to lower fares and offer incentives to remain competitive, significant losses and increased operating expenses, the importance of attracting and maintaining platform users, operational and cultural challenges, negative media coverage affecting brand reputation, difficulties in managing growth and organizational structure, safety incidents, risks associated with new ventures and investments, legal uncertainties, challenges in international operations, currency fluctuations, tax consequences, financial reporting burdens, political and economic instability, public health concerns, and limited influence over minority-owned affiliates. These risk factors could have an adverse effect on the business, financial condition, operating results, and prospects of the company.
========================

=== Calling Function ===
Calling function: vector_index_2019 with args: {
  "input": "risk factors"
}
Got output: The personal mobility, meal delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region. If we are unable to compete effectively in these industries, our business and financial prospects would be adversely impacted.
========================

Here is a comparison of the risk factors described in the Uber 10-K across years:

2022:
- Potential failure to meet regulatory requirements related to climate change
- Impact of contagious diseases and pandemics on the business
- Occurrence of catastrophic events
- Uncertainty surrounding future pandemics or disease outbreaks
- Competitive nature of the mobility, delivery, and logistics industries
- Classification of drivers as employees instead of independent contractors
- Need to lower fares or service fees to remain competitive
- History of significant losses and anticipated increase in operating expenses

2021:
- Adverse impact of the COVID-19 pandemic and actions to mitigate it
- Potential reclassification of drivers as employees instead of independent contractors
- Intense competition in the mobility, delivery, and logistics industries
- Need to lower fares or service fees and offer driver incentives and consumer discounts
- Significant losses incurred and uncertainty of achieving profitability
- Difficulty in attracting and maintaining a critical mass of platform users
- Operational, compliance, and cultural challenges
- Negative media coverage and reputation issues
- Inability to optimize organizational structure or manage growth effectively
- Safety incidents that harm the ability to attract and retain platform users
- Risks associated with substantial investments in new offerings and technologies

2020:
- Adverse impact of the COVID-19 pandemic and actions taken to mitigate it
- Potential reclassification of drivers as employees, workers, or quasi-employees instead of independent contractors
- Intense competition in the mobility, delivery, and logistics industries
- Need to lower fares or service fees and offer driver incentives and consumer discounts
- Significant losses incurred and uncertainty of achieving profitability
- Difficulty in attracting and maintaining a critical mass of platform users
- Operational, compliance, and cultural challenges
- Negative media coverage and reputation issues
- Inability to optimize organizational structure or manage growth effectively
- Safety incidents that harm the ability to attract and retain platform users
- Risks associated with substantial investments in new offerings and technologies
- Potential fines or enforcement measures due to challenges faced
- Uncertainty and potential long-term financial impact of the COVID-19 pandemic
- Potential adverse impact from business partners and third-party vendors affected by the pandemic
- Volatility in financial markets and its effect on stock price and access to capital markets

2019:
- Highly competitive personal mobility, meal delivery, and logistics industries
- Potential inability to compete effectively in these industries

These bullet points highlight the similarities and differences in the risk factors described in the Uber 10-K across years.

Setting up the Chatbot Loop#

Now that we have the chatbot setup, it only takes a few more steps to setup a basic interactive loop to chat with our SEC-augmented chatbot!

agent = OpenAIAgent.from_tools(tools)  # verbose=False by default

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = agent.chat(text_input)
    print(f"Agent: {response}")

# User: What were some of the legal proceedings against Uber in 2022?
Agent: In 2022, Uber is facing several legal proceedings. Here are some of them:

1. California: The state Attorney General and city attorneys filed a complaint against Uber and Lyft, alleging that drivers are misclassified as independent contractors. A preliminary injunction was issued but stayed pending appeal. The Court of Appeal affirmed the lower court's ruling, and Uber filed a petition for review with the California Supreme Court. However, the Supreme Court declined the petition for review. The lawsuit is ongoing, focusing on claims by the California Attorney General for periods prior to the enactment of Proposition 22.

2. Massachusetts: The Attorney General of Massachusetts filed a complaint against Uber, alleging that drivers are employees entitled to wage and labor law protections. Uber's motion to dismiss the complaint was denied, and a summary judgment motion is pending.

3. New York: Uber is facing allegations of misclassification and employment violations by the state Attorney General. The resolution of this matter is uncertain.

4. Switzerland: Several administrative bodies in Switzerland have issued rulings classifying Uber drivers as employees for social security or labor purposes. Uber is challenging these rulings before the Social Security and Administrative Tribunals.

These are some of the legal proceedings against Uber in 2022. The outcomes and potential losses in these cases are uncertain.