Get References from PDFs¶
This guide shows you how to use LlamaIndex to get in-line page number citations in the response (and the response is streamed).
This is a simple combination of using the page number metadata in our PDF loader along with our indexing/query abstractions to use this information.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-llms-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
download_loader,
RAKEKeywordTableIndex,
)
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
download_loader,
RAKEKeywordTableIndex,
)
In [ ]:
Copied!
from llama_index.llms.openai import OpenAI
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
from llama_index.llms.openai import OpenAI
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
Download Data
In [ ]:
Copied!
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
Load document and build index
In [ ]:
Copied!
reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()
reader = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"])
data = reader.load_data()
In [ ]:
Copied!
index = VectorStoreIndex.from_documents(data)
index = VectorStoreIndex.from_documents(data)
In [ ]:
Copied!
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
Stream response with page citation
In [ ]:
Copied!
response = query_engine.query(
"What was the impact of COVID? Show statements in bullet form and show"
" page reference after each statement."
)
response.print_response_stream()
response = query_engine.query(
"What was the impact of COVID? Show statements in bullet form and show"
" page reference after each statement."
)
response.print_response_stream()
• The ongoing COVID-19 pandemic continues to impact communities in the United States, Canada and globally (page 6). • The pandemic and related responses caused decreased demand for our platform leading to decreased revenues as well as decreased earning opportunities for drivers on our platform (page 6). • Our business continues to be impacted by the COVID-19 pandemic (page 6). • The exact timing and pace of the recovery remain uncertain (page 6). • The extent to which our operations will continue to be impacted by the pandemic will depend largely on future developments, which are highly uncertain and cannot be accurately predicted (page 6). • An increase in cases due to variants of the virus has caused many businesses to delay employees returning to the office (page 6). • We anticipate that continued social distancing, altered consumer behavior, reduced travel and commuting, and expected corporate cost cutting will be significant challenges for us (page 6). • We have adopted multiple measures, including, but not limited, to establishing new health and safety requirements for ridesharing and updating workplace policies (page 6). • We have had to take certain cost-cutting measures, including lay-offs, furloughs and salary reductions, which may have adversely affect employee morale, our culture and our ability to attract and retain employees (page 18). • The ultimate impact of the COVID-19 pandemic on our users, customers, employees, business, operations and financial performance depends on many factors that are not within our control (page 18).
Inspect source nodes
In [ ]:
Copied!
for node in response.source_nodes:
print("-----")
text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
print(f"Text:\t {text_fmt} ...")
print(f"Metadata:\t {node.node.metadata}")
print(f"Score:\t {node.score:.3f}")
for node in response.source_nodes:
print("-----")
text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
print(f"Text:\t {text_fmt} ...")
print(f"Metadata:\t {node.node.metadata}")
print(f"Score:\t {node.score:.3f}")
----- Text: Impact of COVID-19 to our BusinessThe ongoing COVID-19 pandemic continues to impact communities in the United States, Canada and globally. Since the pandemic began in March 2020,governments and private businesses - at the recommendation of public health officials - have enacted precautions to mitigate the spread of the virus, including travelrestrictions and social distancing measures in many regions of the United States and Canada, and many enterprises have instituted and maintained work from homeprograms and limited the number of employees on site. Beginning in the middle of March 2020, the pandemic and these related responses caused decreased demand for ourplatform leading to decreased revenues as well as decreased earning opportunities for drivers on our platform. Our business continues to be impacted by the COVID-19pandemic. Although we have seen some signs of demand improving, particularly compared to the dema ... Metadata: {'page_label': '6', 'file_name': 'lyft_2021.pdf'} Score: 0.821 ----- Text: will continue to be impacted by the pandemic will depend largely on future developments, which are highly uncertain and cannot beaccurately predicted, including new information which may emerge concerning COVID-19 variants and the severity of the pandemic and actions by government authoritiesand private businesses to contain the pandemic or recover from its impact, among other things. For example, an increase in cases due to variants of the virus has causedmany businesses to delay employees returning to the office. Even as travel restrictions and shelter-in-place orders are modified or lifted, we anticipate that continued socialdistancing, altered consu mer behavior, reduced travel and commuting, and expected corporate cost cutting will be significant challenges for us. The strength and duration ofthese challenges cannot b e presently estimated.In response to the COVID-19 pandemic, we have adopted multiple measures, including, but not limited, to establishing ne ... Metadata: {'page_label': '56', 'file_name': 'lyft_2021.pdf'} Score: 0.808 ----- Text: storing unrented and returned vehicles. These impacts to the demand for and operations of the different rental programs have and may continue to adversely affectour business, financial condi tion and results of operation.• The COVID-19 pandemic may delay or prevent us, or our current or prospective partners and suppliers, from being able to test, develop or deploy autonomousvehicle-related technology, including through direct impacts of the COVID-19 virus on employee and contractor health; reduced consumer demand forautonomous vehicle travel resulting from an overall reduced demand for travel; shelter-in-place orders by local, state or federal governments negatively impactingoperations, including our ability to test autonomous vehicle-related technology; impacts to the supply chains of our current or prospective partners and suppliers;or economic impacts limiting our or our current or prospective partners’ or suppliers’ ability to expend resources o ... Metadata: {'page_label': '18', 'file_name': 'lyft_2021.pdf'} Score: 0.805