Hugging Face LLMs¶
There are many ways to interface with LLMs from Hugging Face.
Hugging Face itself provides several Python packages to enable access,
which LlamaIndex wraps into LLM
entities:
- The
transformers
package: usellama_index.llms.HuggingFaceLLM
- The Hugging Face Inference API,
wrapped by
huggingface_hub[inference]
: usellama_index.llms.HuggingFaceInferenceAPI
There are many possible permutations of these two, so this notebook only details a few. Let's use Hugging Face's Text Generation task as our example.
In the below line, we install the packages necessary for this demo:
transformers[torch]
is needed forHuggingFaceLLM
huggingface_hub[inference]
is needed forHuggingFaceInferenceAPI
- The quotes are needed for Z shell (
zsh
)
%pip install llama-index-llms-huggingface
%pip install llama-index-llms-huggingface-api
!pip install "transformers[torch]" "huggingface_hub[inference]"
Now that we're set up, let's play around:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index
import os
from typing import List, Optional
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
# SEE: https://huggingface.co/docs/hub/security-tokens
# We just need a token with read permissions for this demo
HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI
# This uses https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha
# downloaded (if first invocation) to the local Hugging Face model cache,
# and actually runs the model on your local machine's hardware
locally_run = HuggingFaceLLM(model_name="HuggingFaceH4/zephyr-7b-alpha")
# This will use the same model, but run remotely on Hugging Face's servers,
# accessed via the Hugging Face Inference API
# Note that using your token will not charge you money,
# the Inference API is free it just has rate limits
remotely_run = HuggingFaceInferenceAPI(
model_name="HuggingFaceH4/zephyr-7b-alpha", token=HF_TOKEN
)
# Or you can skip providing a token, using Hugging Face Inference API anonymously
remotely_run_anon = HuggingFaceInferenceAPI(
model_name="HuggingFaceH4/zephyr-7b-alpha"
)
# If you don't provide a model_name to the HuggingFaceInferenceAPI,
# Hugging Face's recommended model gets used (thanks to huggingface_hub)
remotely_run_recommended = HuggingFaceInferenceAPI(token=HF_TOKEN)
Underlying a completion with HuggingFaceInferenceAPI
is Hugging Face's
Text Generation task.
completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)
beyond! The Infinity Wall Clock is a unique and stylish way to keep track of time. The clock is made of a durable, high-quality plastic and features a bright LED display. The Infinity Wall Clock is powered by batteries and can be mounted on any wall. It is a great addition to any home or office.
If you are modifying the LLM, you should also change the global tokenizer to match!
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)
If you're curious, other Hugging Face Inference API tasks wrapped are:
llama_index.llms.HuggingFaceInferenceAPI.chat
: Conversational taskllama_index.embeddings.HuggingFaceInferenceAPIEmbedding
: Feature Extraction task
And yes, Hugging Face embedding models are supported with:
transformers[torch]
: wrapped byHuggingFaceEmbedding
huggingface_hub[inference]
: wrapped byHuggingFaceInferenceAPIEmbedding
Both of the above two subclass llama_index.embeddings.base.BaseEmbedding
.
Using Hugging Face text-generaton-inference
¶
The new TextGenerationInference
class allows to interface with endpoints running text-generation-inference
, TGI. In addition to blazingly fast inference, it supports tool
usage starting from version 2.0.1
.
%pip install llama-index-llms-text-generation-inference
To initialize an instance of TextGenerationInference
, you need to provide the endpoint URL (self-hosted instance of TGI or public Inference Endpoint on Hugging Face created with TGI). In case of private Inference Endpoint, it is necessary to provide your HF token (either as initialization argument or environment variable).
import os
from typing import List, Optional
from llama_index.llms.text_generation_inference import (
TextGenerationInference,
)
URL = "your_tgi_endpoint"
model = TextGenerationInference(
model_url=URL, token=False
) # set token to False in case of public endpoint
completion_response = model.complete("To infinity, and")
print(completion_response)
beyond! This phrase is a reference to the famous line from the movie "Toy Story" when Buzz Lightyear, a toy astronaut, exclaims "To infinity and beyond!" as he soars through space. It has since become a catchphrase for reaching for the stars and striving for greatness. However, if you meant to ask a mathematical question, "To infinity" refers to a very large, infinite number, and "and beyond" could be interpreted as continuing infinitely in a certain direction. For example, "2 to the power of infinity" would represent a very large, infinite number.
To use tools with the TextGenerationInference
, you may use an already existing tool or define your own:
from typing import List, Literal
from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.core.tools import FunctionTool
from llama_index.core.base.llms.types import (
ChatMessage,
MessageRole,
)
def get_current_weather(location: str, format: str):
"""Get the current weather
Args:
location (str): The city and state, e.g. San Francisco, CA
format (str): The temperature unit to use ('celsius' or 'fahrenheit'). Infer this from the users location.
"""
...
class WeatherArgs(BaseModel):
location: str = Field(
description="The city and region, e.g. Paris, Ile-de-France"
)
format: Literal["fahrenheit", "celsius"] = Field(
description="The temperature unit to use ('fahrenheit' or 'celsius'). Infer this from the location.",
)
weather_tool = FunctionTool.from_defaults(
fn=get_current_weather,
name="get_current_weather",
description="Get the current weather",
fn_schema=WeatherArgs,
)
def get_current_weather_n_days(location: str, format: str, num_days: int):
"""Get the weather forecast for the next N days
Args:
location (str): The city and state, e.g. San Francisco, CA
format (str): The temperature unit to use ('celsius' or 'fahrenheit'). Infer this from the users location.
num_days (int): The number of days for the weather forecast.
"""
...
class ForecastArgs(BaseModel):
location: str = Field(
description="The city and region, e.g. Paris, Ile-de-France"
)
format: Literal["fahrenheit", "celsius"] = Field(
description="The temperature unit to use ('fahrenheit' or 'celsius'). Infer this from the location.",
)
num_days: int = Field(
description="The duration for the weather forecast in days.",
)
forecast_tool = FunctionTool.from_defaults(
fn=get_current_weather_n_days,
name="get_current_weather_n_days",
description="Get the current weather for n days",
fn_schema=ForecastArgs,
)
usr_msg = ChatMessage(
role=MessageRole.USER,
content="What's the weather like in Paris over next week?",
)
response = model.chat_with_tools(
user_msg=usr_msg,
tools=[
weather_tool,
forecast_tool,
],
tool_choice="get_current_weather_n_days",
)
print(response.message.additional_kwargs)
{'tool_calls': [{'id': 0, 'type': 'function', 'function': {'description': None, 'name': 'get_current_weather_n_days', 'arguments': {'format': 'celsius', 'location': 'Paris, Ile-de-France', 'num_days': 7}}}]}