EmotionPrompt in RAG¶
Inspired by the "Large Language Models Understand and Can Be Enhanced by Emotional Stimuli" by Li et al., in this guide we show you how to evaluate the effects of emotional stimuli on your RAG pipeline:
- Setup the RAG pipeline with a basic vector index with the core QA template.
- Create some candidate stimuli (inspired by Fig. 2 of the paper)
- For each candidate stimulit, prepend to QA prompt and evaluate.
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-readers-file pymupdf
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio
nest_asyncio.apply()
Setup Data¶
We use the Llama 2 paper as the input data source for our RAG pipeline.
In [ ]:
Copied!
!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
mkdir: data: File exists
In [ ]:
Copied!
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
In [ ]:
Copied!
docs0 = PyMuPDFReader().load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SentenceSplitter(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(docs)
docs0 = PyMuPDFReader().load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SentenceSplitter(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(docs)
Setup Vector Index over this Data¶
We load this data into an in-memory vector store (embedded with OpenAI embeddings).
We'll be aggressively optimizing the QA prompt for this RAG pipeline.
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo")
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo")
In [ ]:
Copied!
index = VectorStoreIndex(base_nodes)
query_engine = index.as_query_engine(similarity_top_k=2)
index = VectorStoreIndex(base_nodes)
query_engine = index.as_query_engine(similarity_top_k=2)
Evaluation Setup¶
Golden Dataset¶
Here we load in a "golden" dataset.
NOTE: We pull this in from Dropbox. For details on how to generate a dataset please see our DatasetGenerator
module.
In [ ]:
Copied!
!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json
!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json
--2023-11-04 00:34:09-- https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1 Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6017:18::a27d:212, 162.125.2.18 Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6017:18::a27d:212|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com/cd/0/inline/CG4XGYSusXrgPle6I3vucuwf-NIN10QWldJ7wlc3wdzYWbv9OQey0tvB4qGxJ5W0BxL7cX-zn7Kxj5QReEbi1RNYOx1XMT9qwgMm2xWjW5a9seqV4AI8V7C0M2plvH5U1Yw/file?dl=1# [following] --2023-11-04 00:34:09-- https://uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com/cd/0/inline/CG4XGYSusXrgPle6I3vucuwf-NIN10QWldJ7wlc3wdzYWbv9OQey0tvB4qGxJ5W0BxL7cX-zn7Kxj5QReEbi1RNYOx1XMT9qwgMm2xWjW5a9seqV4AI8V7C0M2plvH5U1Yw/file?dl=1 Resolving uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com (uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com)... 2620:100:6017:15::a27d:20f, 162.125.2.15 Connecting to uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com (uc68b925272ee59de768b72ea323.dl.dropboxusercontent.com)|2620:100:6017:15::a27d:20f|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 60656 (59K) [application/binary] Saving to: ‘data/llama2_eval_qr_dataset.json’ data/llama2_eval_qr 100%[===================>] 59.23K --.-KB/s in 0.04s 2023-11-04 00:34:10 (1.48 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]
In [ ]:
Copied!
from llama_index.core.evaluation import QueryResponseDataset
from llama_index.core.evaluation import QueryResponseDataset
In [ ]:
Copied!
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
# optional
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
Get Evaluator¶
In [ ]:
Copied!
from llama_index.core.evaluation.eval_utils import get_responses
from llama_index.core.evaluation.eval_utils import get_responses
In [ ]:
Copied!
from llama_index.core.evaluation import CorrectnessEvaluator, BatchEvalRunner
evaluator_c = CorrectnessEvaluator()
evaluator_dict = {"correctness": evaluator_c}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
from llama_index.core.evaluation import CorrectnessEvaluator, BatchEvalRunner
evaluator_c = CorrectnessEvaluator()
evaluator_dict = {"correctness": evaluator_c}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
Define Correctness Eval Function¶
In [ ]:
Copied!
import numpy as np
async def get_correctness(query_engine, eval_qa_pairs, batch_runner):
# then evaluate
# TODO: evaluate a sample of generated results
eval_qs = [q for q, _ in eval_qa_pairs]
eval_answers = [a for _, a in eval_qa_pairs]
pred_responses = get_responses(eval_qs, query_engine, show_progress=True)
eval_results = await batch_runner.aevaluate_responses(
eval_qs, responses=pred_responses, reference=eval_answers
)
avg_correctness = np.array(
[r.score for r in eval_results["correctness"]]
).mean()
return avg_correctness
import numpy as np
async def get_correctness(query_engine, eval_qa_pairs, batch_runner):
# then evaluate
# TODO: evaluate a sample of generated results
eval_qs = [q for q, _ in eval_qa_pairs]
eval_answers = [a for _, a in eval_qa_pairs]
pred_responses = get_responses(eval_qs, query_engine, show_progress=True)
eval_results = await batch_runner.aevaluate_responses(
eval_qs, responses=pred_responses, reference=eval_answers
)
avg_correctness = np.array(
[r.score for r in eval_results["correctness"]]
).mean()
return avg_correctness
Try Out Emotion Prompts¶
We pul some emotion stimuli from the paper to try out.
In [ ]:
Copied!
emotion_stimuli_dict = {
"ep01": "Write your answer and give me a confidence score between 0-1 for your answer. ",
"ep02": "This is very important to my career. ",
"ep03": "You'd better be sure.",
# add more from the paper here!!
}
# NOTE: ep06 is the combination of ep01, ep02, ep03
emotion_stimuli_dict["ep06"] = (
emotion_stimuli_dict["ep01"]
+ emotion_stimuli_dict["ep02"]
+ emotion_stimuli_dict["ep03"]
)
emotion_stimuli_dict = {
"ep01": "Write your answer and give me a confidence score between 0-1 for your answer. ",
"ep02": "This is very important to my career. ",
"ep03": "You'd better be sure.",
# add more from the paper here!!
}
# NOTE: ep06 is the combination of ep01, ep02, ep03
emotion_stimuli_dict["ep06"] = (
emotion_stimuli_dict["ep01"]
+ emotion_stimuli_dict["ep02"]
+ emotion_stimuli_dict["ep03"]
)
Initialize base QA Prompt¶
In [ ]:
Copied!
QA_PROMPT_KEY = "response_synthesizer:text_qa_template"
QA_PROMPT_KEY = "response_synthesizer:text_qa_template"
In [ ]:
Copied!
from llama_index.core import PromptTemplate
from llama_index.core import PromptTemplate
In [ ]:
Copied!
qa_tmpl_str = """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, \
answer the query.
{emotion_str}
Query: {query_str}
Answer: \
"""
qa_tmpl = PromptTemplate(qa_tmpl_str)
qa_tmpl_str = """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, \
answer the query.
{emotion_str}
Query: {query_str}
Answer: \
"""
qa_tmpl = PromptTemplate(qa_tmpl_str)
Prepend emotions¶
In [ ]:
Copied!
async def run_and_evaluate(
query_engine, eval_qa_pairs, batch_runner, emotion_stimuli_str, qa_tmpl
):
"""Run and evaluate."""
new_qa_tmpl = qa_tmpl.partial_format(emotion_str=emotion_stimuli_str)
old_qa_tmpl = query_engine.get_prompts()[QA_PROMPT_KEY]
query_engine.update_prompts({QA_PROMPT_KEY: new_qa_tmpl})
avg_correctness = await get_correctness(
query_engine, eval_qa_pairs, batch_runner
)
query_engine.update_prompts({QA_PROMPT_KEY: old_qa_tmpl})
return avg_correctness
async def run_and_evaluate(
query_engine, eval_qa_pairs, batch_runner, emotion_stimuli_str, qa_tmpl
):
"""Run and evaluate."""
new_qa_tmpl = qa_tmpl.partial_format(emotion_str=emotion_stimuli_str)
old_qa_tmpl = query_engine.get_prompts()[QA_PROMPT_KEY]
query_engine.update_prompts({QA_PROMPT_KEY: new_qa_tmpl})
avg_correctness = await get_correctness(
query_engine, eval_qa_pairs, batch_runner
)
query_engine.update_prompts({QA_PROMPT_KEY: old_qa_tmpl})
return avg_correctness
In [ ]:
Copied!
# try out ep01
correctness_ep01 = await run_and_evaluate(
query_engine,
eval_dataset.qr_pairs,
batch_runner,
emotion_stimuli_dict["ep01"],
qa_tmpl,
)
# try out ep01
correctness_ep01 = await run_and_evaluate(
query_engine,
eval_dataset.qr_pairs,
batch_runner,
emotion_stimuli_dict["ep01"],
qa_tmpl,
)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:10<00:00, 5.48it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:23<00:00, 1.39s/it]
In [ ]:
Copied!
print(correctness_ep01)
print(correctness_ep01)
3.7916666666666665
In [ ]:
Copied!
# try out ep02
correctness_ep02 = await run_and_evaluate(
query_engine,
eval_dataset.qr_pairs,
batch_runner,
emotion_stimuli_dict["ep02"],
qa_tmpl,
)
# try out ep02
correctness_ep02 = await run_and_evaluate(
query_engine,
eval_dataset.qr_pairs,
batch_runner,
emotion_stimuli_dict["ep02"],
qa_tmpl,
)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:10<00:00, 5.62it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:21<00:00, 1.36s/it] /var/folders/1r/c3h91d9s49xblwfvz79s78_c0000gn/T/ipykernel_80474/3350915737.py:2: RuntimeWarning: coroutine 'run_and_evaluate' was never awaited correctness_ep02 = await run_and_evaluate( RuntimeWarning: Enable tracemalloc to get the object allocation traceback
In [ ]:
Copied!
print(correctness_ep02)
print(correctness_ep02)
3.941666666666667
In [ ]:
Copied!
# try none
correctness_base = await run_and_evaluate(
query_engine, eval_dataset.qr_pairs, batch_runner, "", qa_tmpl
)
# try none
correctness_base = await run_and_evaluate(
query_engine, eval_dataset.qr_pairs, batch_runner, "", qa_tmpl
)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:12<00:00, 4.92it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:59<00:00, 2.00s/it] /var/folders/1r/c3h91d9s49xblwfvz79s78_c0000gn/T/ipykernel_80474/997505056.py:2: RuntimeWarning: coroutine 'run_and_evaluate' was never awaited correctness_base = await run_and_evaluate( RuntimeWarning: Enable tracemalloc to get the object allocation traceback
In [ ]:
Copied!
print(correctness_base)
print(correctness_base)
3.8916666666666666