Local Embeddings with HuggingFace

LlamaIndex has support for HuggingFace embedding models, including BGE, Instructor, and more.

Furthermore, we provide utilties to create and use ONNX models using the Optimum library from HuggingFace.

HuggingFaceEmbedding

The base HuggingFaceEmbedding class is a generic wrapper around any HuggingFace model for embeddings. You can set either pooling="cls" or pooling="mean" – in most cases, you’ll want cls pooling. But the model card for your particular model may have other recommendations.

You can refer to the embeddings leaderboard for more recommendations on embedding models.

This class depends on the transformers package, which you can install with pip install transformers.

NOTE: if you were previously using a HuggingFaceEmbeddings from LangChain, this should give equivilant results.

from llama_index.embeddings import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
Hello World!
384
[-0.030880315229296684, -0.11021008342504501, 0.3917851448059082, -0.35962796211242676, 0.22797748446464539]

InstructorEmbedding

Instructor Embeddings are a class of embeddings specifically trained to augment their embeddings according to an instruction. By default, queries are given query_instruction="Represent the question for retrieving supporting documents: " and text is given text_instruction="Represent the document for retrieval: ".

They rely on the Instructor pip package, which you can install with pip install InstructorEmbedding.

from llama_index.embeddings import InstructorEmbedding

embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import trange
load INSTRUCTOR_Transformer
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
max_seq_length  512
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
768
[ 0.02155361 -0.06098218  0.01796207  0.05490903  0.01526906]

OptimumEmbedding

Optimum in a HuggingFace library for exporting and running HuggingFace models in the ONNX format.

You can install the dependencies with pip install transformers optimum[exporters].

First, we need to create the ONNX model. ONNX models provide improved inference speeds, and can be used across platforms (i.e. in TransformersJS)

from llama_index.embeddings import OptimumEmbedding

OptimumEmbedding.create_and_save_optimum_model("BAAI/bge-small-en-v1.5", "./bge_onnx")
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saved optimum model to ./bge_onnx. Use it with `embed_model = OptimumEmbedding(folder_name='./bge_onnx')`.
embed_model = OptimumEmbedding(folder_name="./bge_onnx")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
384
[-0.10364960134029388, -0.20998482406139374, -0.01883639395236969, -0.5241696834564209, 0.0335749015212059]

Benchmarking

Let’s try comparing using a classic large document – the IPCC climate report, chapter 3.

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  16.5M      0  0:00:01  0:00:01 --:--:-- 16.5M
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

Base HuggingFace Embeddings

import os
import openai

# needed to synthesize responses later
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
from llama_index.embeddings import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)
1min 27s Β± 0 ns per loop (mean Β± std. dev. of 1 run, 1 loop each)

Optimum Embeddings

We can use the onnx embeddings we created earlier

from llama_index.embeddings import OptimumEmbedding

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")

service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)
1min 9s Β± 0 ns per loop (mean Β± std. dev. of 1 run, 1 loop each)