Hugging Face LLMs¶
There are many ways to interface with LLMs from Hugging Face.
Hugging Face itself provides several Python packages to enable access,
which LlamaIndex wraps into LLM
entities:
- The
transformers
package: usellama_index.llms.HuggingFaceLLM
- The Hugging Face Inference API,
wrapped by
huggingface_hub[inference]
: usellama_index.llms.HuggingFaceInferenceAPI
There are many possible permutations of these two, so this notebook only details a few. Let's use Hugging Face's Text Generation task as our example.
In the below line, we install the packages necessary for this demo:
transformers[torch]
is needed forHuggingFaceLLM
huggingface_hub[inference]
is needed forHuggingFaceInferenceAPI
- The quotes are needed for Z shell (
zsh
)
%pip install llama-index-llms-huggingface
!pip install "transformers[torch]" "huggingface_hub[inference]"
Now that we're set up, let's play around:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index
import os
from typing import List, Optional
from llama_index.llms.huggingface import (
HuggingFaceInferenceAPI,
HuggingFaceLLM,
)
# SEE: https://huggingface.co/docs/hub/security-tokens
# We just need a token with read permissions for this demo
HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI
# This uses https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha
# downloaded (if first invocation) to the local Hugging Face model cache,
# and actually runs the model on your local machine's hardware
locally_run = HuggingFaceLLM(model_name="HuggingFaceH4/zephyr-7b-alpha")
# This will use the same model, but run remotely on Hugging Face's servers,
# accessed via the Hugging Face Inference API
# Note that using your token will not charge you money,
# the Inference API is free it just has rate limits
remotely_run = HuggingFaceInferenceAPI(
model_name="HuggingFaceH4/zephyr-7b-alpha", token=HF_TOKEN
)
# Or you can skip providing a token, using Hugging Face Inference API anonymously
remotely_run_anon = HuggingFaceInferenceAPI(
model_name="HuggingFaceH4/zephyr-7b-alpha"
)
# If you don't provide a model_name to the HuggingFaceInferenceAPI,
# Hugging Face's recommended model gets used (thanks to huggingface_hub)
remotely_run_recommended = HuggingFaceInferenceAPI(token=HF_TOKEN)
Underlying a completion with HuggingFaceInferenceAPI
is Hugging Face's
Text Generation task.
completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)
beyond! The Infinity Wall Clock is a unique and stylish way to keep track of time. The clock is made of a durable, high-quality plastic and features a bright LED display. The Infinity Wall Clock is powered by batteries and can be mounted on any wall. It is a great addition to any home or office.
If you are modifying the LLM, you should also change the global tokenizer to match!
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)
If you're curious, other Hugging Face Inference API tasks wrapped are:
llama_index.llms.HuggingFaceInferenceAPI.chat
: Conversational taskllama_index.embeddings.HuggingFaceInferenceAPIEmbedding
: Feature Extraction task
And yes, Hugging Face embedding models are supported with:
transformers[torch]
: wrapped byHuggingFaceEmbedding
huggingface_hub[inference]
: wrapped byHuggingFaceInferenceAPIEmbedding
Both of the above two subclass llama_index.embeddings.base.BaseEmbedding
.