Benchmarking OpenAI Retrieval API (through Assistant Agent)#
This guide benchmarks the Retrieval Tool from the OpenAI Assistant API, by using our OpenAIAssistantAgent
. We run over the Llama 2 paper, and compare generation quality against a naive RAG pipeline.
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
Setup Data#
Here we load the Llama 2 paper and chunk it.
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
--2023-11-08 21:53:52-- https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ādata/llama2.pdfā
data/llama2.pdf 100%[===================>] 13.03M 141KB/s in 1m 48s
2023-11-08 21:55:42 (123 KB/s) - ādata/llama2.pdfā saved [13661300/13661300]
from pathlib import Path
from llama_index import Document, ServiceContext, VectorStoreIndex
from llama_hub.file.pymu_pdf.base import PyMuPDFReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)
len(nodes)
89
Define Eval Modules#
We setup evaluation modules, including the dataset and evaluators.
Setup āGolden Datasetā#
Here we load in a āgoldenā dataset.
Option 1: Pull Existing Dataset#
NOTE: We pull this in from Dropbox. For details on how to generate a dataset please see our DatasetGenerator
module.
!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json
--2023-11-08 22:20:10-- https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6057:18::a27d:d12, 162.125.13.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6057:18::a27d:d12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1# [following]
--2023-11-08 22:20:11-- https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1
Resolving uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)... 2620:100:6057:15::a27d:d0f, 162.125.13.15
Connecting to uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)|2620:100:6057:15::a27d:d0f|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60656 (59K) [application/binary]
Saving to: ādata/llama2_eval_qr_dataset.jsonā
data/llama2_eval_qr 100%[===================>] 59.23K --.-KB/s in 0.02s
2023-11-08 22:20:12 (2.87 MB/s) - ādata/llama2_eval_qr_dataset.jsonā saved [60656/60656]
from llama_index.evaluation import QueryResponseDataset
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
Option 2: Generate New Dataset#
If you choose this option, you can choose to generate a new dataset from scratch. This allows you to play around with our DatasetGenerator
settings to make sure it suits your needs.
from llama_index.evaluation import (
DatasetGenerator,
QueryResponseDataset,
)
from llama_index import ServiceContext
from llama_index.llms import OpenAI
# NOTE: run this if the dataset isn't already saved
# Note: we only generate from the first 20 nodes, since the rest are references
eval_service_context = ServiceContext.from_defaults(
llm=OpenAI(model="gpt-4-1106-preview")
)
dataset_generator = DatasetGenerator(
nodes[:20],
service_context=eval_service_context,
show_progress=True,
num_questions_per_chunk=3,
)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
Eval Modules#
We define two evaluation modules: correctness and semantic similarity - both comparing quality of predicted response with actual response.
from llama_index.evaluation.eval_utils import get_responses, get_results_df
from llama_index.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
BatchEvalRunner,
)
from llama_index.llms import OpenAI
eval_llm = OpenAI(model="gpt-4-1106-preview")
eval_service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator_c = CorrectnessEvaluator(service_context=eval_service_context)
evaluator_s = SemanticSimilarityEvaluator(service_context=eval_service_context)
evaluator_dict = {
"correctness": evaluator_c,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
import numpy as np
import time
import os
import pickle
from tqdm import tqdm
def get_responses_sync(
eval_qs, query_engine, show_progress=True, save_path=None
):
if show_progress:
eval_qs_iter = tqdm(eval_qs)
else:
eval_qs_iter = eval_qs
pred_responses = []
start_time = time.time()
for eval_q in eval_qs_iter:
print(f"eval q: {eval_q}")
pred_response = agent.query(eval_q)
print(f"predicted response: {pred_response}")
pred_responses.append(pred_response)
if save_path is not None:
# save intermediate responses (to cache in case something breaks)
avg_time = (time.time() - start_time) / len(pred_responses)
pickle.dump(
{"pred_responses": pred_responses, "avg_time": avg_time},
open(save_path, "wb"),
)
return pred_responses
async def run_evals(
query_engine,
eval_qa_pairs,
batch_runner,
disable_async_for_preds=False,
save_path=None,
):
# then evaluate
# TODO: evaluate a sample of generated results
eval_qs = [q for q, _ in eval_qa_pairs]
eval_answers = [a for _, a in eval_qa_pairs]
if save_path is not None:
if not os.path.exists(save_path):
start_time = time.time()
if disable_async_for_preds:
pred_responses = get_responses_sync(
eval_qs,
query_engine,
show_progress=True,
save_path=save_path,
)
else:
pred_responses = get_responses(
eval_qs, query_engine, show_progress=True
)
avg_time = (time.time() - start_time) / len(eval_qs)
pickle.dump(
{"pred_responses": pred_responses, "avg_time": avg_time},
open(save_path, "wb"),
)
else:
# [optional] load
pickled_dict = pickle.load(open(save_path, "rb"))
pred_responses = pickled_dict["pred_responses"]
avg_time = pickled_dict["avg_time"]
else:
start_time = time.time()
pred_responses = get_responses(
eval_qs, query_engine, show_progress=True
)
avg_time = (time.time() - start_time) / len(eval_qs)
eval_results = await batch_runner.aevaluate_responses(
eval_qs, responses=pred_responses, reference=eval_answers
)
return eval_results, {"avg_time": avg_time}
Construct Assistant with Built-In Retrieval#
Letās construct the assistant by also passing it the built-in OpenAI Retrieval tool.
Here, we upload and pass in the file during assistant-creation time.
from llama_index.agent import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="SEC Analyst",
instructions="You are a QA assistant designed to analyze sec filings.",
openai_tools=[{"type": "retrieval"}],
instructions_prefix="Please address the user as Jerry.",
files=["data/llama2.pdf"],
verbose=True,
)
response = agent.query(
"What are the key differences between Llama 2 and Llama 2-Chat?"
)
print(str(response))
The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat:
1. Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the modelās safety by its rate of generating responses with safety violations and its helpfulness to users.
2. Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations.
3. Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwetās AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations.
4. Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios.
5. Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses.
6. Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety.
These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2ć9ā sourceć.
Benchmark#
We run the agent over our evaluation dataset. We benchmark against a standard top-k RAG pipeline (k=2) with gpt-4-turbo.
NOTE: During our time of testing (November 2023), the Assistant API is heavily rate-limited, and can take ~1-2 hours to generate responses over 60 datapoints.
Define Baseline Index + RAG Pipeline#
base_sc = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4-1106-preview"))
base_index = VectorStoreIndex(nodes, service_context=base_sc)
base_query_engine = base_index.as_query_engine(similarity_top_k=2)
Run Evals over Baseline#
base_eval_results, base_extra_info = await run_evals(
base_query_engine,
eval_dataset.qr_pairs,
batch_runner,
save_path="data/llama2_preds_base.pkl",
)
results_df = get_results_df(
[base_eval_results],
["Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
names | correctness | semantic_similarity | |
---|---|---|---|
0 | Base Query Engine | 4.05 | 0.964245 |
Run Evals over Assistant API#
assistant_eval_results, assistant_extra_info = await run_evals(
agent,
eval_dataset.qr_pairs[:55],
batch_runner,
save_path="data/llama2_preds_assistant.pkl",
disable_async_for_preds=True,
)
Get Results#
Here we seeā¦that our basic RAG pipeline does better.
Take these numbers with a grain of salt. The goal here is to give you a script so you can run this on your own data.
That said itās surprising the Retrieval API doesnāt give immediately better out of the box performance.
results_df = get_results_df(
[assistant_eval_results, base_eval_results],
["Retrieval API", "Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")
names | correctness | semantic_similarity | |
---|---|---|---|
0 | Retrieval API | 3.536364 | 0.952647 |
1 | Base Query Engine | 4.050000 | 0.964245 |
Base Avg Time: 0.25683316787083943
Assistant Avg Time: 75.43605598536405