Local Embeddings with HuggingFace¶

LlamaIndex has support for HuggingFace embedding models, including BGE, Instructor, and more.

Furthermore, we provide utilities to create and use ONNX models using the Optimum library from HuggingFace.

HuggingFaceEmbedding¶

The base HuggingFaceEmbedding class is a generic wrapper around any HuggingFace model for embeddings. All embedding models on Hugging Face should work. You can refer to the embeddings leaderboard for more recommendations.

This class depends on the sentence-transformers package, which you can install with pip install sentence-transformers.

NOTE: if you were previously using a HuggingFaceEmbeddings from LangChain, this should give equivalent results.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [ ]:

Copied!

%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-instructor
%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-instructor

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

In [ ]:

Copied!

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

Hello World!
384
[-0.030880315229296684, -0.11021008342504501, 0.3917851448059082, -0.35962796211242676, 0.22797748446464539]

InstructorEmbedding¶

Instructor Embeddings are a class of embeddings specifically trained to augment their embeddings according to an instruction. By default, queries are given query_instruction="Represent the question for retrieving supporting documents: " and text is given text_instruction="Represent the document for retrieval: ".

They rely on the Instructor and SentenceTransformers (version 2.2.2) pip package, which you can install with pip install InstructorEmbedding and pip install -U sentence-transformers==2.2.2.

In [ ]:

Copied!

from llama_index.embeddings.instructor import InstructorEmbedding

embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")
from llama_index.embeddings.instructor import InstructorEmbedding

embed_model = InstructorEmbedding(model_name="hkunlp/instructor-base")

/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/InstructorEmbedding/instructor.py:7: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import trange

load INSTRUCTOR_Transformer

/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

max_seq_length  512

In [ ]:

Copied!

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

768
[ 0.02155361 -0.06098218  0.01796207  0.05490903  0.01526906]

OptimumEmbedding¶

Optimum in a HuggingFace library for exporting and running HuggingFace models in the ONNX format.

You can install the dependencies with pip install transformers optimum[exporters].

First, we need to create the ONNX model. ONNX models provide improved inference speeds, and can be used across platforms (i.e. in TransformersJS)

In [ ]:

Copied!

from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

OptimumEmbedding.create_and_save_optimum_model(
    "BAAI/bge-small-en-v1.5", "./bge_onnx"
)
from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

OptimumEmbedding.create_and_save_optimum_model(
    "BAAI/bge-small-en-v1.5", "./bge_onnx"
)

/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False

============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saved optimum model to ./bge_onnx. Use it with `embed_model = OptimumEmbedding(folder_name='./bge_onnx')`.

In [ ]:

Copied!

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
embed_model = OptimumEmbedding(folder_name="./bge_onnx")

In [ ]:

Copied!

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.10364960134029388, -0.20998482406139374, -0.01883639395236969, -0.5241696834564209, 0.0335749015212059]

Benchmarking¶

Let's try comparing using a classic large document -- the IPCC climate report, chapter 3.

In [ ]:

Copied!

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  16.5M      0  0:00:01  0:00:01 --:--:-- 16.5M

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

Base HuggingFace Embeddings¶

In [ ]:

Copied!





import os
import openai

# needed to synthesize responses later
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
import openai

# needed to synthesize responses later
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

In [ ]:

Copied!

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model

In [ ]:

Copied!

%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Parsing documents into nodes:   0%|          | 0/172 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/428 [00:00<?, ?it/s]

1min 27s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Optimum Embeddings¶

We can use the onnx embeddings we created earlier

In [ ]:

Copied!

from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model
from llama_index.embeddings.huggingface_optimum import OptimumEmbedding

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")

Settings.embed_model = embed_model

In [ ]:

Copied!

%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Parsing documents into nodes:   0%|          | 0/172 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/428 [00:00<?, ?it/s]

1min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)