Benchmarking OpenAI Retrieval API (through Assistant Agent)#

Open In Colab

This guide benchmarks the Retrieval Tool from the OpenAI Assistant API, by using our OpenAIAssistantAgent. We run over the Llama 2 paper, and compare generation quality against a naive RAG pipeline.

%pip install llama-index-readers-file
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
!pip install llama-index
import nest_asyncio


Setup Data#

Here we load the Llama 2 paper and chunk it.

!mkdir -p 'data/'
!wget --user-agent "Mozilla" "" -O "data/llama2.pdf"
--2023-11-08 21:53:52--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M   141KB/s    in 1m 48s  

2023-11-08 21:55:42 (123 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]
from pathlib import Path
from llama_index.core import Document, VectorStoreIndex
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)

Define Eval Modules#

We setup evaluation modules, including the dataset and evaluators.

Setup “Golden Dataset”#

Here we load in a “golden” dataset.

Option 1: Pull Existing Dataset#

NOTE: We pull this in from Dropbox. For details on how to generate a dataset please see our DatasetGenerator module.

!wget "" -O data/llama2_eval_qr_dataset.json
--2023-11-08 22:20:10--
Resolving ( 2620:100:6057:18::a27d:d12,
Connecting to (|2620:100:6057:18::a27d:d12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: [following]
--2023-11-08 22:20:11--
Resolving ( 2620:100:6057:15::a27d:d0f,
Connecting to (|2620:100:6057:15::a27d:d0f|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60656 (59K) [application/binary]
Saving to: ‘data/llama2_eval_qr_dataset.json’

data/llama2_eval_qr 100%[===================>]  59.23K  --.-KB/s    in 0.02s   

2023-11-08 22:20:12 (2.87 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]
from llama_index.core.evaluation import QueryResponseDataset

# optional
eval_dataset = QueryResponseDataset.from_json(

Option 2: Generate New Dataset#

If you choose this option, you can choose to generate a new dataset from scratch. This allows you to play around with our DatasetGenerator settings to make sure it suits your needs.

from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
# NOTE: run this if the dataset isn't already saved
# Note: we only generate from the first 20 nodes, since the rest are references
llm = OpenAI(model="gpt-4-1106-preview")
dataset_generator = DatasetGenerator(
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
# optional
eval_dataset = QueryResponseDataset.from_json(

Eval Modules#

We define two evaluation modules: correctness and semantic similarity - both comparing quality of predicted response with actual response.

from llama_index.core.evaluation.eval_utils import (
from llama_index.core.evaluation import (
from llama_index.llms.openai import OpenAI
eval_llm = OpenAI(model="gpt-4-1106-preview")
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
    "correctness": evaluator_c,
    "semantic_similarity": evaluator_s,
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
import numpy as np
import time
import os
import pickle
from tqdm import tqdm

def get_responses_sync(
    eval_qs, query_engine, show_progress=True, save_path=None
    if show_progress:
        eval_qs_iter = tqdm(eval_qs)
        eval_qs_iter = eval_qs
    pred_responses = []
    start_time = time.time()
    for eval_q in eval_qs_iter:
        print(f"eval q: {eval_q}")
        pred_response = agent.query(eval_q)
        print(f"predicted response: {pred_response}")
        if save_path is not None:
            # save intermediate responses (to cache in case something breaks)
            avg_time = (time.time() - start_time) / len(pred_responses)
                {"pred_responses": pred_responses, "avg_time": avg_time},
                open(save_path, "wb"),
    return pred_responses

async def run_evals(
    # then evaluate
    # TODO: evaluate a sample of generated results
    eval_qs = [q for q, _ in eval_qa_pairs]
    eval_answers = [a for _, a in eval_qa_pairs]

    if save_path is not None:
        if not os.path.exists(save_path):
            start_time = time.time()
            if disable_async_for_preds:
                pred_responses = get_responses_sync(
                pred_responses = get_responses(
                    eval_qs, query_engine, show_progress=True
            avg_time = (time.time() - start_time) / len(eval_qs)
                {"pred_responses": pred_responses, "avg_time": avg_time},
                open(save_path, "wb"),
            # [optional] load
            pickled_dict = pickle.load(open(save_path, "rb"))
            pred_responses = pickled_dict["pred_responses"]
            avg_time = pickled_dict["avg_time"]
        start_time = time.time()
        pred_responses = get_responses(
            eval_qs, query_engine, show_progress=True
        avg_time = (time.time() - start_time) / len(eval_qs)

    eval_results = await batch_runner.aevaluate_responses(
        eval_qs, responses=pred_responses, reference=eval_answers
    return eval_results, {"avg_time": avg_time}

Construct Assistant with Built-In Retrieval#

Let’s construct the assistant by also passing it the built-in OpenAI Retrieval tool.

Here, we upload and pass in the file during assistant-creation time.

from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
    name="SEC Analyst",
    instructions="You are a QA assistant designed to analyze sec filings.",
    openai_tools=[{"type": "retrieval"}],
    instructions_prefix="Please address the user as Jerry.",
response = agent.query(
    "What are the key differences between Llama 2 and Llama 2-Chat?"
The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat:

1. Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the model’s safety by its rate of generating responses with safety violations and its helpfulness to users.

2. Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations.

3. Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwet’s AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations.

4. Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios.

5. Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses.

6. Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety.

These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2【9†source】.


We run the agent over our evaluation dataset. We benchmark against a standard top-k RAG pipeline (k=2) with gpt-4-turbo.

NOTE: During our time of testing (November 2023), the Assistant API is heavily rate-limited, and can take ~1-2 hours to generate responses over 60 datapoints.

Define Baseline Index + RAG Pipeline#

llm = OpenAI(model="gpt-4-1106-preview")
base_index = VectorStoreIndex(nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=2, llm=llm)

Run Evals over Baseline#

base_eval_results, base_extra_info = await run_evals(
results_df = get_results_df(
    ["Base Query Engine"],
    ["correctness", "semantic_similarity"],
names correctness semantic_similarity
0 Base Query Engine 4.05 0.964245

Run Evals over Assistant API#

assistant_eval_results, assistant_extra_info = await run_evals(

Get Results#

Here we see…that our basic RAG pipeline does better.

Take these numbers with a grain of salt. The goal here is to give you a script so you can run this on your own data.

That said it’s surprising the Retrieval API doesn’t give immediately better out of the box performance.

results_df = get_results_df(
    [assistant_eval_results, base_eval_results],
    ["Retrieval API", "Base Query Engine"],
    ["correctness", "semantic_similarity"],
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")
names correctness semantic_similarity
0 Retrieval API 3.536364 0.952647
1 Base Query Engine 4.050000 0.964245
Base Avg Time: 0.25683316787083943
Assistant Avg Time: 75.43605598536405