Anthropic Prompt Caching¶
In this Notebook, we will demonstrate the usage of Anthropic Prompt Caching with LlamaIndex abstractions.
Prompt Caching is enabled by marking cache_control
in the messages request.
How Prompt Caching works¶
When you send a request with Prompt Caching enabled:
- The system checks if the prompt prefix is already cached from a recent query.
- If found, it uses the cached version, reducing processing time and costs.
- Otherwise, it processes the full prompt and caches the prefix for future use.
Note:
A. Prompt caching works with Claude 3.5 Sonnet
, Claude 3 Haiku
and Claude 3 Opus
models.
B. The minimum cacheable prompt length is:
1. 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2. 2048 tokens for Claude 3 Haiku
C. Shorter prompts cannot be cached, even if marked with cache_control
.
Setup API Keys¶
import os
os.environ[
"ANTHROPIC_API_KEY"
] = "sk-..." # replace with your Anthropic API key
Setup LLM¶
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-3-5-sonnet-20240620")
Download Data¶
In this demonstration, we will use the text from the Paul Graham Essay
. We will cache the text and run some queries based on it.
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'
--2024-09-28 01:22:14-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘./paul_graham_essay.txt’ ./paul_graham_essay 100%[===================>] 73.28K --.-KB/s in 0.01s 2024-09-28 01:22:14 (5.73 MB/s) - ‘./paul_graham_essay.txt’ saved [75042/75042]
Load Data¶
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./paul_graham_essay.txt"],
).load_data()
document_text = documents[0].text
Prompt Caching¶
Enabling Prompt Cache:
- Include
"cache_control": {"type": "ephemeral"}
for the text prompt you want to cache. - Add
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
in the request.
We can verify if the text is cached by checking the following parameters:
cache_creation_input_tokens:
Number of tokens written to the cache when creating a new entry.
cache_read_input_tokens:
Number of tokens retrieved from the cache for this request.
input_tokens:
Number of input tokens which were not read from or used to create a cache.
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
{
"text": f"{document_text}",
"type": "text",
"cache_control": {"type": "ephemeral"},
},
{"text": "Why did Paul Graham start YC?", "type": "text"},
],
),
]
resp = llm.chat(
messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
Let's examine the raw response.
resp.raw
{'id': 'msg_01KCcFZnbAGjxSKJm7LnXajp', 'content': [TextBlock(text="Based on the essay, it seems Paul Graham started Y Combinator for a few key reasons:\n\n1. He had been thinking about ways to improve venture capital and startup funding, like making smaller investments in younger, more technical founders.\n\n2. He wanted to try angel investing but hadn't gotten around to it yet, despite intending to for years after Yahoo acquired his company Viaweb.\n\n3. He missed working with his former Viaweb co-founders Robert Morris and Trevor Blackwell and wanted to find a project they could collaborate on.\n\n4. His girlfriend (later wife) Jessica Livingston was looking for a new job after interviewing at a VC firm, and Graham had been telling her ideas for how to improve VC.\n\n5. When giving a talk to Harvard students about startups, he realized there was demand for seed funding and advice from experienced founders.\n\n6. They wanted to create an investment firm that would actually implement Graham's ideas about how to better fund and support early-stage startups.\n\n7. They were somewhat naïve about how to be angel investors, which allowed them to take novel approaches like the batch model of funding multiple startups at once.\n\nSo it was a convergence of Graham's ideas about improving startup funding, his desire to angel invest and work with his former co-founders again, and the opportunity presented by Jessica looking for a new job. Their lack of experience in traditional VC allowed them to take an innovative approach.", type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=12, output_tokens=313, cache_creation_input_tokens=17470, cache_read_input_tokens=0)}
As you can see, 17470
tokens have been cached, as indicated by cache_creation_input_tokens
.
Now, let’s run another query on the same document. It should retrieve the document text from the cache, which will be reflected in cache_read_input_tokens
.
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
{
"text": f"{document_text}",
"type": "text",
"cache_control": {"type": "ephemeral"},
},
{"text": "What did Paul Graham do growing up?", "type": "text"},
],
),
]
resp = llm.chat(
messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
resp.raw
{'id': 'msg_01CpwhtuvJ8UR64xSbpxoutZ', 'content': [TextBlock(text='Based on the essay, here are some key things Paul Graham did growing up:\n\n1. As a teenager, he focused mainly on writing and programming outside of school. He tried writing short stories but says they were "awful".\n\n2. In 9th grade (age 13-14), he started programming on an IBM 1401 computer at his school district\'s data processing center. He used an early version of Fortran.\n\n3. He convinced his father to buy a TRS-80 microcomputer around 1980 when he was in high school. He wrote simple games, a program to predict model rocket flight, and a word processor his father used.\n\n4. He planned to study philosophy in college, thinking it was more powerful than other fields. \n\n5. In college, he got interested in artificial intelligence after reading a novel featuring an intelligent computer and seeing a documentary about an AI program called SHRDLU.\n\n6. He taught himself Lisp programming language in college since there were no AI classes offered.\n\n7. For his undergraduate thesis, he reverse-engineered the SHRDLU AI program.\n\n8. He graduated college with a degree in "Artificial Intelligence" (in quotes on the diploma).\n\n9. He applied to grad schools for AI and ended up going to Harvard for graduate studies.\n\nSo in summary, his main interests and activities growing up centered around writing, programming, and eventually artificial intelligence as he entered college and graduate school.', type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=12, output_tokens=313, cache_creation_input_tokens=0, cache_read_input_tokens=17470)}
As you can see, the response was generated using cached text, as indicated by cache_read_input_tokens
.