Trustworthy RAG with the Trustworthy Language Model¶
This tutorial demonstrates how to use Cleanlab's Trustworthy Language Model (TLM) in any RAG system, to score the trustworthiness of answers and improve overall reliability of the RAG system. We recommend first completing the TLM example tutorial.
Retrieval-Augmented Generation (RAG) has become popular for building LLM-based Question-Answer systems in domains where LLMs alone suffer from: hallucination, knowledge gaps, and factual inaccuracies. However, RAG systems often still produce unreliable responses, because they depend on LLMs that are fundamentally unreliable. Cleanlab's Trustworthy Language Model (TLM) offers a solution by providing trustworthiness scores to assess and improve response quality, independent of your RAG architecture or retrieval and indexing processes.
To diagnose when RAG answers cannot be trusted, simply swap your existing LLM that is generating answers based on the retrieved context with TLM. This notebook showcases this for a standard RAG system, based off a tutorial in the popular LlamaIndex framework. Here we merely replace the LLM used in the LlamaIndex tutorial with TLM, and showcase some of the benefits. TLM can be similarly inserted into any other RAG framework.
Setup¶
RAG is all about connecting LLMs to data, to better inform their answers. This tutorial uses Nvidia's Q1 FY2024 earnings report as an example dataset.
Use the following commands to download the data (earnings report) and store it in a directory named data/
.
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/
Now let's install the required dependencies.
%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface
We then initialize Cleanlab's TLM. Here we initialize a CleanlabTLM object with default settings.
You can get your Cleanlab API key here: https://app.cleanlab.ai/account after creating an account. For detailed instructions, refer to this guide.
from llama_index.llms.cleanlab import CleanlabTLM
# set api key in env or in llm
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"
llm = CleanlabTLM(api_key="your_api_key")
Note: If you encounter ValidationError
during the above import, please upgrade your python version to >= 3.11
You can achieve better results by playing with the TLM configurations outlined in this advanced TLM tutorial.
For example, if your application requires OpenAI's GPT-4 model and restrict the output tokens to 256, you can configure it using the options
argument:
options = {
"model": "gpt-4",
"max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
Let's start by asking the LLM a simple question.
response = llm.complete("What is NVIDIA's ticker symbol?")
print(response)
NVIDIA's ticker symbol is NVDA.
TLM not only provides a response but also includes a trustworthiness score indicating the confidence that this response is good/accurate. You can access this score from the response itself.
response.additional_kwargs
{'trustworthiness_score': 0.9884869430083446}
Build a RAG pipeline with TLM¶
Now let's integrate TLM into a RAG pipeline.
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
Settings.llm = llm
Specify Embedding Model¶
RAG uses an embedding model to match queries against document chunks to retrieve the most relevant data. Here we opt for a no-cost, local embedding model from Hugging Face. You can use any other embedding model by referring to this LlamaIndex guide.
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
Load Data and Create Index + Query Engine¶
Let's create an index from the documents stored in the data directory. The system can index multiple files within the same folder, although for this tutorial, we'll use just one document. We stick with the default index from LlamaIndex for this tutorial.
documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
doc.excluded_llm_metadata_keys.append(
"file_path"
) # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)
The generated index is used to power a query engine over the data.
query_engine = index.as_query_engine()
Note that TLM is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.
Extract Trustworthiness Score from LLM response¶
As we see above, Cleanlab's TLM also provides the trustworthiness_score
in addition to the text, in its response to the prompt.
To get this score out when TLM is used in a RAG pipeline, Llamaindex provides an instrumentation tool that allows us to observe the events running behind the scenes in RAG.
We can utilise this tooling to extract trustworthiness_score
from LLM's response.
Let's define a simple event handler that stores this score for every request sent to the LLM. You can refer to Llamaindex's documentation for more details on instrumentation.
from typing import Dict, List
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
class GetTrustworthinessScore(BaseEventHandler):
events: List[BaseEvent] = []
trustworthiness_score = 0
@classmethod
def class_name(cls) -> str:
"""Class name."""
return "GetTrustworthinessScore"
def handle(self, event: BaseEvent) -> Dict:
if isinstance(event, LLMCompletionEndEvent):
self.trustworthiness_score = event.response.additional_kwargs[
"trustworthiness_score"
]
self.events.append(event)
# Root dispatcher
root_dispatcher = get_dispatcher()
# Register event handler
event_handler = GetTrustworthinessScore()
root_dispatcher.add_event_handler(event_handler)
For each query, we can fetch this score from event_handler.trustworthiness_score
. Let's see it in action.
Answering queries with our RAG system¶
Let's try out our RAG pipeline based on TLM. Here we pose questions with differing levels of complexity.
# Optional: Define `display_response` helper function
# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the text response itself and the corresponding trustworthiness score.
def display_response(response):
response_str = response.response
trustworthiness_score = event_handler.trustworthiness_score
print(f"Response: {response_str}")
print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")
Easy Questions¶
We first pose straightforward questions that can be directly answered by the provided data and can be easily located within a few lines of text.
response = query_engine.query(
"What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)
Response: NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion. Trustworthiness score: 1.0
response = query_engine.query(
"What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)
Response: The GAAP earnings per diluted share for the quarter (Q1 FY24) was $0.82. Trustworthiness score: 1.0
response = query_engine.query(
"What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)
Response: Jensen Huang, NVIDIA's CEO, commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI. Trustworthiness score: 0.99
TLM returns high trustworthiness scores for these responses, indicating high confidence they are accurate. After doing a quick fact-check (reviewing the original earnings report), we can confirm that TLM indeed accurately answered these questions. In case you're curious, here are relevant excerpts from the data context for these questions:
NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, ...
GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.
Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, ...
Questions without Available Context¶
Now let's see how TLM responds to queries that cannot be answered using the provided data.
response = query_engine.query(
"What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)
Response: The report indicates that NVIDIA's professional visualization revenue declined by 53% year-over-year. While the specific factors contributing to this decline are not detailed in the provided information, several potential reasons can be inferred: 1. **Market Conditions**: The overall market for professional visualization may have faced challenges, leading to reduced demand for NVIDIA's products in this segment. 2. **Increased Competition**: The presence of competitors in the professional visualization space could have impacted NVIDIA's market share and revenue. 3. **Economic Factors**: Broader economic conditions, such as inflation or reduced spending in industries that utilize professional visualization tools, may have contributed to the decline. 4. **Transition to New Technologies**: The introduction of new technologies, such as the NVIDIA Omniverse™ Cloud, may have shifted focus away from traditional professional visualization products, affecting revenue. 5. **Product Lifecycle**: If certain products were nearing the end of their lifecycle or if there were delays in new product launches, this could have impacted sales. Overall, while the report does not specify the exact reasons for the decline, these factors could be contributing elements based on industry trends and market dynamics. Trustworthiness score: 0.76
The lower TLM trustworthiness score indicate a bit more uncertainty about the response, which aligns with the lack of information available. Let's try some more questions.
response = query_engine.query(
"How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)
Response: The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, which is attributed to a combination of factors, including a challenging market environment and possibly reduced demand for gaming hardware. While specific reasons for the decline are not detailed in the provided information, the overall decrease in revenue suggests that the gaming sector is facing headwinds compared to the previous year. Trustworthiness score: 0.92
response = query_engine.query(
"How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)
Response: The provided context information does not include any details about NVIDIA's dividend payout or the industry average for dividends. Therefore, I cannot provide a comparison of NVIDIA's dividend payout for this quarter to the industry average. Additional information regarding dividends would be needed to answer the query. Trustworthiness score: 0.87
We observe that TLM demonstrates the ability to recognize the limitations of the available information. It refrains from generating speculative responses or hallucinations, thereby maintaining the reliability of the question-answering system. This behavior showcases an understanding of the boundaries of the context and prioritizes accuracy over conjecture.
Challenging Questions¶
Let's see how our RAG system responds to harder questions, some of which may be misleading.
response = query_engine.query(
"How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?"
)
display_response(response)
Response: NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and it was reported that this revenue was up 19% from the previous quarter. To find the revenue for the previous quarter, we can use the following calculation: Let \( x \) be the revenue for the previous quarter. The equation based on the 19% increase is: \[ x + 0.19x = 7.19 \text{ billion} \] \[ 1.19x = 7.19 \text{ billion} \] \[ x = \frac{7.19 \text{ billion}}{1.19} \approx 6.04 \text{ billion} \] Now, to find the decrease in revenue from the previous quarter to this quarter: \[ \text{Decrease} = 7.19 \text{ billion} - 6.04 \text{ billion} \approx 1.15 \text{ billion} \] Thus, NVIDIA's revenue decreased by approximately $1.15 billion this quarter compared to the last quarter. Trustworthiness score: 0.6
response = query_engine.query(
"This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)
Response: The report mentions the following companies: Microsoft and Dell. ServiceNow is also mentioned in the context, but it is not specified in the provided highlights. Therefore, the companies explicitly mentioned in the report are Microsoft and Dell. Trustworthiness score: 0.6
TLM automatically alerts us that these answers are unreliable, by the low trustworthiness score. RAG systems with TLM help you properly exercise caution when you see low trustworthiness scores. Here are the correct answers to the aforementioned questions:
NVIDIA's revenue increased by $1.14 billion this quarter compared to last quarter.
Google, Amazon Web Services, Microsoft, Oracle, ServiceNow, Medtronic, Dell Technologies
With TLM, you can easily increase trust in any RAG system!
Feel free to check TLM's performance benchmarks for more details.