Using LLMs#
Concept#
Picking the proper Large Language Model (LLM) is one of the first steps you need to consider when building any LLM application over your data.
LLMs are a core component of LlamaIndex. They can be used as standalone modules or plugged into other core LlamaIndex modules (indices, retrievers, query engines). They are always used during the response synthesis step (e.g. after retrieval). Depending on the type of index being used, LLMs may also be used during index construction, insertion, and query traversal.
LlamaIndex provides a unified interface for defining LLM modules, whether itβs from OpenAI, Hugging Face, or LangChain, so that you donβt have to write the boilerplate code of defining the LLM interface yourself. This interface consists of the following (more details below):
Support for text completion and chat endpoints (details below)
Support for streaming and non-streaming endpoints
Support for synchronous and asynchronous endpoints
Usage Pattern#
The following code snippet shows how you can get started using LLMs.
from llama_index.llms import OpenAI
# non-streaming
resp = OpenAI().complete("Paul Graham is ")
print(resp)
A Note on Tokenization#
By default, LlamaIndex uses a global tokenizer for all token counting. This defaults to cl100k
from tiktoken, which is the tokenizer to match the default LLM gpt-3.5-turbo
.
If you change the LLM, you may need to update this tokenizer to ensure accurate token counts, chunking, and prompting.
The single requirement for a tokenizer is that it is a callable function, that takes a string, and returns a list.
You can set a global tokenizer like so:
from llama_index import set_global_tokenizer
# tiktoken
import tiktoken
set_global_tokenizer(tiktoken.encoding_for_model("gpt-3.5-turbo").encode)
# huggingface
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta").encode
)
LLM Compatibility Tracking#
While LLMs are powerful, not every LLM is easy to set up. Furthermore, even with proper setup, some LLMs have trouble performing tasks that require strict instruction following.
LlamaIndex offers integrations with nearly every LLM, but it can be often unclear if the LLM will work well out of the box, or if further customization is needed.
The tables below attempt to validate the initial experience with various LlamaIndex features for various LLMs. These notebooks serve as a best attempt to gauge performance, as well as how much effort and tweaking is needed to get things to function properly.
Generally, paid APIs such as OpenAI or Anthropic are viewed as more reliable. However, local open-source models have been gaining popularity due to their customizability and approach to transparency.
Contributing: Anyone is welcome to contribute new LLMs to the documentation. Simply copy an existing notebook, setup and test your LLM, and open a PR with your results.
If you have ways to improve the setup for existing notebooks, contributions to change this are welcome!
Legend
β = should work fine
β οΈ = sometimes unreliable, may need prompt engineering to improve
π = usually unreliable, would need prompt engineering/fine-tuning to improve
Paid LLM APIs#
Model Name |
Basic Query Engines |
Router Query Engine |
Sub Question Query Engine |
Text2SQL |
Pydantic Programs |
Data Agents |
Notes |
---|---|---|---|---|---|---|---|
gpt-3.5-turbo (openai) |
β |
β |
β |
β |
β |
β |
|
gpt-3.5-turbo-instruct (openai) |
β |
β |
β |
β |
β |
β οΈ |
Tool usage in data-agents seems flakey. |
gpt-4 (openai) |
β |
β |
β |
β |
β |
β |
|
claude-2 (anthropic) |
β |
β |
β |
β |
β |
β οΈ |
Prone to hallucinating tool inputs. |
claude-instant-1.2 (anthropic) |
β |
β |
β |
β |
β |
β οΈ |
Prone to hallucinating tool inputs. |
Open Source LLMs#
Since open source LLMs require large amounts of resources, the quantization is reported. Quantization is just a method for reducing the size of an LLM by shrinking the accuracy of calculations within the model. Research has shown that up to 4Bit quantization can be achieved for large LLMs without impacting performance too severely.
Model Name |
Basic Query Engines |
Router Query Engine |
SubQuestion Query Engine |
Text2SQL |
Pydantic Programs |
Data Agents |
Notes |
---|---|---|---|---|---|---|---|
llama2-chat-7b 4bit (huggingface) |
β |
π |
π |
π |
π |
β οΈ |
Llama2 seems to be quite chatty, which makes parsing structured outputs difficult. Fine-tuning and prompt engineering likely required for better performance on structured outputs. |
llama2-13b-chat (replicate) |
β |
β |
π |
β |
π |
π |
Our ReAct prompt expects structured outputs, which llama-13b struggles at |
llama2-70b-chat (replicate) |
β |
β |
β |
β |
π |
β οΈ |
There are still some issues with parsing structured outputs, especially with pydantic programs. |
Mistral-7B-instruct-v0.1 4bit (huggingface) |
β |
π |
π |
β οΈ |
β οΈ |
β οΈ |
Mistral seems slightly more reliable for structured outputs compared to Llama2. Likely with some prompt engineering, it may do better. |
zephyr-7b-alpha (huggingface) |
β |
β |
β |
β |
β |
β οΈ |
Overall, |
zephyr-7b-beta (huggingface) |
β |
β |
β |
β |
π |
β |
Compared to |
stablelm-zephyr-3b (huggingface) |
β |
β οΈ |
β |
π |
β |
π |
stablelm-zephyr-3b does surprisingly well, especially for structured outputs (surpassing much larger models). It struggles a bit with text-to-SQL and tool use. |
starling-lm-7b-alpha (huggingface) |
β |
π |
β |
β οΈ |
β |
β |
starling-lm-7b-alpha does surprisingly well on agent tasks. It struggles a bit with routing, and is inconsistent with text-to-SQL. |
Modules#
We support integrations with OpenAI, Hugging Face, PaLM, and more.
- Available LLM integrations
- AI21
- Anthropic
- AnyScale
- Bedrock
- Clarifai
- Cohere
- EverlyAI
- Gradient
- Hugging Face
- Konko
- LangChain
- LiteLLM
- Llama API
- Llama CPP
- LocalAI
- MistralAI
- MonsterAPI
- NeutrinoAI
- Nvidia TensorRT-LLM
- Nvidia Triton
- Ollama
- OpenAI
- OpenLLM
- OpenRouter
- PaLM
- Perplexity
- Portkey
- Predibase
- Replicate
- RunGPT
- SageMaker
- Together.ai
- Vertex
- vLLM
- Xorbits Inference