Cleanlab Trustworthy Language Model¶
This notebook shows how to use Cleanlab's Trustworthy Language Model (TLM) and Trustworthiness score.
TLM is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.
Trustworthiness score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of trustworthiness.
Read more about TLM API on Cleanlab Studio's docs. For more advanced usage, feel free to refer to the quickstart tutorial.
Visit https://cleanlab.ai and sign up to get a free API key.
Setup¶
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-cleanlab
%pip install llama-index
from llama_index.llms.cleanlab import CleanlabTLM
# set api key in env or in llm
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"
llm = CleanlabTLM(api_key="your_api_key")
resp = llm.complete("Who is Paul Graham?")
print(resp)
Paul Graham is an American computer scientist, entrepreneur, and venture capitalist. He is best known as the co-founder of the startup accelerator Y Combinator, which has helped launch numerous successful companies including Dropbox, Airbnb, and Reddit. Graham is also a prolific writer and essayist, known for his insightful and thought-provoking essays on topics ranging from startups and entrepreneurship to technology and society. He has been influential in the tech industry and is highly regarded for his expertise and contributions to the startup ecosystem.
You also get the trustworthiness score of the above response in additional_kwargs
. TLM automatically computes this score for all the <prompt, response> pair.
print(resp.additional_kwargs)
{'trustworthiness_score': 0.8659043183923533}
A high score indicates that LLM's response can be trusted. Let's take another example here.
resp = llm.complete(
"What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
print(resp)
The first automobile engine used in a commercial truck in the United States was the 1899 Winton Motor Carriage Company Model 10, which had a 2-cylinder engine with 20 horsepower.
print(resp.additional_kwargs)
{'trustworthiness_score': 0.5820799504369166}
A low score indicates that the LLM's response shouldn't be trusted.
From these 2 straightforward examples, we can observe that the LLM's responses with the highest scores are direct, accurate, and appropriately detailed.
On the other hand, LLM's responses with low trustworthiness score convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.
Streaming¶
Cleanlab’s TLM does not natively support streaming both the response and the trustworthiness score. However, there is an alternative approach available to achieve low-latency, streaming responses that can be used for your application.
Detailed information about the approach, along with example code, is available here.
Advance use of TLM¶
TLM can be configured with the following options:
- model: underlying LLM to use
- max_tokens: maximum number of tokens to generate in the response
- num_candidate_responses: number of alternative candidate responses internally generated by TLM
- num_consistency_samples: amount of internal sampling to evaluate LLM-response-consistency
- use_self_reflection: whether the LLM is asked to self-reflect upon the response it generated and self-evaluate this response
- log: specify additional metadata to return. include “explanation” here to get explanations of why a response is scored with low trustworthiness
These configurations are passed as a dictionary to the CleanlabTLM
object during initialization.
More details about these options can be referred from Cleanlab's API documentation and a few use-cases of these options are explored in this notebook.
Let's consider an example where the application requires gpt-4
model with 128
output tokens.
options = {
"model": "gpt-4",
"max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
resp = llm.complete("Who is Paul Graham?")
print(resp)
Paul Graham is a British-born American computer scientist, entrepreneur, venture capitalist, author, and essayist. He is best known for co-founding Viaweb, which was sold to Yahoo in 1998 for over $49 million and became Yahoo Store. He also co-founded the influential startup accelerator and seed capital firm Y Combinator, which has launched over 2,000 companies including Dropbox, Airbnb, Stripe, and Reddit. Graham is also known for his essays on startup companies and programming languages.
To understand why the TLM estimated low trustworthiness for the previous horsepower related question, specify the "explanation"
flag when initializing the TLM.
options = {
"log": ["explanation"],
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
resp = llm.complete(
"What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
print(resp)
The first automobile engine used in a commercial truck in the United States was in the 1899 "Motor Truck" built by the American company, the "GMC Truck Company." This early truck was equipped with a 2-horsepower engine. However, it's important to note that the development of commercial trucks evolved rapidly, and later models featured significantly more powerful engines.
print(resp.additional_kwargs["explanation"])
The proposed answer incorrectly attributes the first commercial truck in the United States to the GMC Truck Company and states that it was built in 1899 with a 2-horsepower engine. In reality, the first commercial truck is generally recognized as the "Motor Truck" built by the American company, the "GMC Truck Company," but it was actually produced by the "GMC" brand, which was established later. The first commercial truck is often credited to the "Benz Velo" or similar early models, which had varying horsepower ratings. The specific claim of a 2-horsepower engine is also misleading, as early trucks typically had more powerful engines. Therefore, the answer contains inaccuracies regarding both the manufacturer and the specifications of the engine. This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): The horsepower of the first automobile engine used in a commercial truck in the United States was 6 horsepower.