Local Embeddings with HuggingFace#
LlamaIndex has support for HuggingFace embedding models, including BGE, Instructor, and more.
Furthermore, we provide utilties to create and use ONNX models using the Optimum library from HuggingFace.
HuggingFaceEmbedding#
The base HuggingFaceEmbedding
class is a generic wrapper around any HuggingFace model for embeddings. You can set either pooling="cls"
or pooling="mean"
– in most cases, you’ll want cls
pooling. But the model card for your particular model may have other recommendations.
You can refer to the embeddings leaderboard for more recommendations on embedding models.
This class depends on the transformers package, which you can install with pip install transformers
.
NOTE: if you were previously using a HuggingFaceEmbeddings
from LangChain, this should give equivilant results.
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index
from llama_index.embeddings import HuggingFaceEmbedding
# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()
# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
Hello World!
384
[-0.030880315229296684, -0.11021008342504501, 0.3917851448059082, -0.35962796211242676, 0.22797748446464539]
InstructorEmbedding#
Instructor Embeddings are a class of embeddings specifically trained to augment their embeddings according to an instruction. By default, queries are given query_instruction="Represent the question for retrieving supporting documents: "
and text is given text_instruction="Represent the document for retrieval: "
.
They rely on the Instructor
and SentenceTransformers
pip package, which you can install with pip install InstructorEmbedding
and pip install -U sentence-transformers
.
from llama_index.embeddings import InstructorEmbedding
embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import trange
load INSTRUCTOR_Transformer
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
max_seq_length 512
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
768
[ 0.02155361 -0.06098218 0.01796207 0.05490903 0.01526906]
OptimumEmbedding#
Optimum in a HuggingFace library for exporting and running HuggingFace models in the ONNX format.
You can install the dependencies with pip install transformers optimum[exporters]
.
First, we need to create the ONNX model. ONNX models provide improved inference speeds, and can be used across platforms (i.e. in TransformersJS)
from llama_index.embeddings import OptimumEmbedding
OptimumEmbedding.create_and_save_optimum_model(
"BAAI/bge-small-en-v1.5", "./bge_onnx"
)
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
- use_cache -> False
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Saved optimum model to ./bge_onnx. Use it with `embed_model = OptimumEmbedding(folder_name='./bge_onnx')`.
embed_model = OptimumEmbedding(folder_name="./bge_onnx")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
384
[-0.10364960134029388, -0.20998482406139374, -0.01883639395236969, -0.5241696834564209, 0.0335749015212059]
Benchmarking#
Let’s try comparing using a classic large document – the IPCC climate report, chapter 3.
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 20.7M 100 20.7M 0 0 16.5M 0 0:00:01 0:00:01 --:--:-- 16.5M
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
documents = SimpleDirectoryReader(
input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
Base HuggingFace Embeddings#
import os
import openai
# needed to synthesize responses later
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
from llama_index.embeddings import HuggingFaceEmbedding
# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")
service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
documents, service_context=service_context, show_progress=True
)
1min 27s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Optimum Embeddings#
We can use the onnx embeddings we created earlier
from llama_index.embeddings import OptimumEmbedding
embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")
service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
documents, service_context=service_context, show_progress=True
)
1min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)