Cleanlab Trustworthy Language Model¶

This notebook shows how to use Cleanlab's Trustworthy Language Model (TLM) and Trustworthiness score.

TLM is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.
Trustworthiness score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of trustworthiness.

Read more about TLM API on Cleanlab Studio's docs. For more advanced usage, feel free to refer to the quickstart tutorial.

Visit https://cleanlab.ai and sign up to get a free API key.

Setup¶

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [ ]:

Copied!

%pip install llama-index-llms-cleanlab
%pip install llama-index-llms-cleanlab

In [ ]:

Copied!

%pip install llama-index
%pip install llama-index

In [ ]:

Copied!

from llama_index.llms.cleanlab import CleanlabTLM
from llama_index.llms.cleanlab import CleanlabTLM

In [ ]:

Copied!

# set api key in env or in llm
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")
# set api key in env or in llm
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")

In [ ]:

Copied!

resp = llm.complete("Who is Paul Graham?")
resp = llm.complete("Who is Paul Graham?")

In [ ]:

Copied!

print(resp)
print(resp)

Paul Graham is an American computer scientist, entrepreneur, and venture capitalist. He is best known as the co-founder of the startup accelerator Y Combinator, which has helped launch numerous successful companies including Dropbox, Airbnb, and Reddit. Graham is also a prolific writer and essayist, known for his insightful and thought-provoking essays on topics ranging from startups and entrepreneurship to technology and society. He has been influential in the tech industry and is highly regarded for his expertise and contributions to the startup ecosystem.

You also get the trustworthiness score of the above response in additional_kwargs. TLM automatically computes this score for all the <prompt, response> pair.

In [ ]:

Copied!

print(resp.additional_kwargs)
print(resp.additional_kwargs)

{'trustworthiness_score': 0.8659043183923533}

A score of ~0.86 indicates that LLM's response can be trusted. Let's take another example here.

In [ ]:

Copied!

resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)

In [ ]:

Copied!

print(resp)
print(resp)

The first automobile engine used in a commercial truck in the United States was the 1899 Winton Motor Carriage Company Model 10, which had a 2-cylinder engine with 20 horsepower.

In [ ]:

Copied!

print(resp.additional_kwargs)
print(resp.additional_kwargs)

{'trustworthiness_score': 0.5820799504369166}

A low score of ~0.58 indicates that the LLM's response shouldn't be trusted.

From these 2 straightforward examples, we can observe that the LLM's responses with the highest scores are direct, accurate, and appropriately detailed.
On the other hand, LLM's responses with low trustworthiness score convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.

Streaming¶

Cleanlab's TLM integration also supports streaming the response. Here's a simple example of stream_complete endpoint.

In [ ]:

Copied!

resp = llm.stream_complete("Who is Paul Graham?")
resp = llm.stream_complete("Who is Paul Graham?")

In [ ]:

Copied!

for r in resp:
    print(r.delta, end="")
for r in resp:
    print(r.delta, end="")

{"response": "Paul Graham is an American computer scientist, entrepreneur, and venture capitalist. He is best known as the co-founder of the startup accelerator Y Combinator, which has helped launch numerous successful companies including Dropbox, Airbnb, and Reddit. Graham is also a prolific writer and essayist, known for his insightful and thought-provoking essays on topics ranging from startups and entrepreneurship to technology and society. He has been influential in the tech industry and is highly regarded for his expertise and contributions to the startup ecosystem.", "trustworthiness_score": 0.8659043183923533}

Advance use of TLM¶

TLM can be configured with the following options:

model: underlying LLM to use
max_tokens: maximum number of tokens to generate in the response
num_candidate_responses: number of alternative candidate responses internally generated by TLM
num_consistency_samples: amount of internal sampling to evaluate LLM-response-consistency
use_self_reflection: whether the LLM is asked to self-reflect upon the response it generated and self-evaluate this response

These configurations are passed as a dictionary to the CleanlabTLM object during initialization.
More details about these options can be referred from Cleanlab's API documentation and a few use-cases of these options are explored in this notebook.

Let's consider an example where the application requires gpt-4 model with 128 output tokens.

In [ ]:

Copied!





options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

In [ ]:

Copied!

resp = llm.complete("Who is Paul Graham?")
resp = llm.complete("Who is Paul Graham?")

In [ ]:

Copied!

print(resp)
print(resp)

Paul Graham is a British-born American computer scientist, entrepreneur, venture capitalist, author, and essayist. He is best known for co-founding Viaweb, which was sold to Yahoo in 1998 for over $49 million and became Yahoo Store. He also co-founded the influential startup accelerator and seed capital firm Y Combinator, which has launched over 2,000 companies including Dropbox, Airbnb, Stripe, and Reddit. Graham is also known for his essays on startup companies and programming languages.