Benchmarking OpenAI Retrieval API (through Assistant Agent)#

This guide benchmarks the Retrieval Tool from the OpenAI Assistant API, by using our OpenAIAssistantAgent. We run over the Llama 2 paper, and compare generation quality against a naive RAG pipeline.

%pip install llama-index-readers-file
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai

!pip install llama-index

import nest_asyncio

nest_asyncio.apply()

Setup Data#

Here we load the Llama 2 paper and chunk it.

!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2023-11-08 21:53:52--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M   141KB/s    in 1m 48s  

2023-11-08 21:55:42 (123 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]

from pathlib import Path
from llama_index.core import Document, VectorStoreIndex
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI

loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)

len(nodes)

Define Eval Modules#

We setup evaluation modules, including the dataset and evaluators.

Setup “Golden Dataset”#

Here we load in a “golden” dataset.

Option 1: Pull Existing Dataset#

NOTE: We pull this in from Dropbox. For details on how to generate a dataset please see our DatasetGenerator module.

!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json

--2023-11-08 22:20:10--  https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6057:18::a27d:d12, 162.125.13.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6057:18::a27d:d12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1# [following]
--2023-11-08 22:20:11--  https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1
Resolving uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)... 2620:100:6057:15::a27d:d0f, 162.125.13.15
Connecting to uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)|2620:100:6057:15::a27d:d0f|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60656 (59K) [application/binary]
Saving to: ‘data/llama2_eval_qr_dataset.json’

data/llama2_eval_qr 100%[===================>]  59.23K  --.-KB/s    in 0.02s   

2023-11-08 22:20:12 (2.87 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]

from llama_index.core.evaluation import QueryResponseDataset

# optional
eval_dataset = QueryResponseDataset.from_json(
    "data/llama2_eval_qr_dataset.json"
)

Option 2: Generate New Dataset#

If you choose this option, you can choose to generate a new dataset from scratch. This allows you to play around with our DatasetGenerator settings to make sure it suits your needs.

from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI

# NOTE: run this if the dataset isn't already saved
# Note: we only generate from the first 20 nodes, since the rest are references
llm = OpenAI(model="gpt-4-1106-preview")
dataset_generator = DatasetGenerator(
    nodes[:20],
    llm=llm,
    show_progress=True,
    num_questions_per_chunk=3,
)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")

# optional
eval_dataset = QueryResponseDataset.from_json(
    "data/llama2_eval_qr_dataset.json"
)

Eval Modules#

We define two evaluation modules: correctness and semantic similarity - both comparing quality of predicted response with actual response.

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI

eval_llm = OpenAI(model="gpt-4-1106-preview")
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
    "correctness": evaluator_c,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

import numpy as np
import time
import os
import pickle
from tqdm import tqdm


def get_responses_sync(
    eval_qs, query_engine, show_progress=True, save_path=None
):
    if show_progress:
        eval_qs_iter = tqdm(eval_qs)
    else:
        eval_qs_iter = eval_qs
    pred_responses = []
    start_time = time.time()
    for eval_q in eval_qs_iter:
        print(f"eval q: {eval_q}")
        pred_response = agent.query(eval_q)
        print(f"predicted response: {pred_response}")
        pred_responses.append(pred_response)
        if save_path is not None:
            # save intermediate responses (to cache in case something breaks)
            avg_time = (time.time() - start_time) / len(pred_responses)
            pickle.dump(
                {"pred_responses": pred_responses, "avg_time": avg_time},
                open(save_path, "wb"),
            )
    return pred_responses


async def run_evals(
    query_engine,
    eval_qa_pairs,
    batch_runner,
    disable_async_for_preds=False,
    save_path=None,
):
    # then evaluate
    # TODO: evaluate a sample of generated results
    eval_qs = [q for q, _ in eval_qa_pairs]
    eval_answers = [a for _, a in eval_qa_pairs]

    if save_path is not None:
        if not os.path.exists(save_path):
            start_time = time.time()
            if disable_async_for_preds:
                pred_responses = get_responses_sync(
                    eval_qs,
                    query_engine,
                    show_progress=True,
                    save_path=save_path,
                )
            else:
                pred_responses = get_responses(
                    eval_qs, query_engine, show_progress=True
                )
            avg_time = (time.time() - start_time) / len(eval_qs)
            pickle.dump(
                {"pred_responses": pred_responses, "avg_time": avg_time},
                open(save_path, "wb"),
            )
        else:
            # [optional] load
            pickled_dict = pickle.load(open(save_path, "rb"))
            pred_responses = pickled_dict["pred_responses"]
            avg_time = pickled_dict["avg_time"]
    else:
        start_time = time.time()
        pred_responses = get_responses(
            eval_qs, query_engine, show_progress=True
        )
        avg_time = (time.time() - start_time) / len(eval_qs)

    eval_results = await batch_runner.aevaluate_responses(
        eval_qs, responses=pred_responses, reference=eval_answers
    )
    return eval_results, {"avg_time": avg_time}

Construct Assistant with Built-In Retrieval#

Let’s construct the assistant by also passing it the built-in OpenAI Retrieval tool.

Here, we upload and pass in the file during assistant-creation time.

from llama_index.agent.openai import OpenAIAssistantAgent

agent = OpenAIAssistantAgent.from_new(
    name="SEC Analyst",
    instructions="You are a QA assistant designed to analyze sec filings.",
    openai_tools=[{"type": "retrieval"}],
    instructions_prefix="Please address the user as Jerry.",
    files=["data/llama2.pdf"],
    verbose=True,
)

response = agent.query(
    "What are the key differences between Llama 2 and Llama 2-Chat?"
)

print(str(response))

The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat:

Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the model’s safety by its rate of generating responses with safety violations and its helpfulness to users.

Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations.

Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwet’s AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations.

Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios.

Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses.

Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety.

These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2【9†source】.

Benchmark#

We run the agent over our evaluation dataset. We benchmark against a standard top-k RAG pipeline (k=2) with gpt-4-turbo.

NOTE: During our time of testing (November 2023), the Assistant API is heavily rate-limited, and can take ~1-2 hours to generate responses over 60 datapoints.

Define Baseline Index + RAG Pipeline#

llm = OpenAI(model="gpt-4-1106-preview")
base_index = VectorStoreIndex(nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=2, llm=llm)

Run Evals over Baseline#

base_eval_results, base_extra_info = await run_evals(
    base_query_engine,
    eval_dataset.qr_pairs,
    batch_runner,
    save_path="data/llama2_preds_base.pkl",
)

results_df = get_results_df(
    [base_eval_results],
    ["Base Query Engine"],
    ["correctness", "semantic_similarity"],
)
display(results_df)

	names	correctness	semantic_similarity
0	Base Query Engine	4.05	0.964245

Run Evals over Assistant API#

assistant_eval_results, assistant_extra_info = await run_evals(
    agent,
    eval_dataset.qr_pairs[:55],
    batch_runner,
    save_path="data/llama2_preds_assistant.pkl",
    disable_async_for_preds=True,
)

Get Results#

Here we see…that our basic RAG pipeline does better.

Take these numbers with a grain of salt. The goal here is to give you a script so you can run this on your own data.

That said it’s surprising the Retrieval API doesn’t give immediately better out of the box performance.

results_df = get_results_df(
    [assistant_eval_results, base_eval_results],
    ["Retrieval API", "Base Query Engine"],
    ["correctness", "semantic_similarity"],
)
display(results_df)
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")

	names	correctness	semantic_similarity
0	Retrieval API	3.536364	0.952647
1	Base Query Engine	4.050000	0.964245

Base Avg Time: 0.25683316787083943
Assistant Avg Time: 75.43605598536405