Local Embeddings with HuggingFace¶
LlamaIndex has support for HuggingFace embedding models, including Sentence Transformer models like BGE, Mixedbread, Nomic, Jina, E5, etc. We can use these models to create embeddings for our documents and queries for retrieval.
Furthermore, we provide utilities to create and use ONNX and OpenVINO models using the Optimum library from HuggingFace.
HuggingFaceEmbedding¶
The base HuggingFaceEmbedding
class is a generic wrapper around any HuggingFace model for embeddings. All embedding models on Hugging Face should work. You can refer to the embeddings leaderboard for more recommendations.
This class depends on the sentence-transformers package, which you can install with pip install sentence-transformers
.
NOTE: if you were previously using a HuggingFaceEmbeddings
from LangChain, this should give equivalent results.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-embeddings-huggingface
!pip install llama-index
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# loads https://huggingface.co/BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
384 [-0.003275700844824314, -0.011690810322761536, 0.041559211909770966, -0.03814814239740372, 0.024183044210076332]
Benchmarking¶
Let's try comparing using a classic large document -- the IPCC climate report, chapter 3.
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 20.7M 100 20.7M 0 0 69.6M 0 --:--:-- --:--:-- --:--:-- 70.0M
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
documents = SimpleDirectoryReader(
input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
Base HuggingFace Embeddings¶
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# loads BAAI/bge-small-en-v1.5 with the default torch backend
embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
device="cpu",
embed_batch_size=8,
)
test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|██████████| 172/172 [00:00<00:00, 428.44it/s] Generating embeddings: 100%|██████████| 459/459 [00:19<00:00, 23.32it/s]
20.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
ONNX Embeddings¶
# pip install sentence-transformers[onnx]
# loads BAAI/bge-small-en-v1.5 with the onnx backend
embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
device="cpu",
backend="onnx",
model_kwargs={
"provider": "CPUExecutionProvider"
}, # For ONNX, you can specify the provider, see https://sbert.net/docs/sentence_transformer/usage/efficiency.html
)
test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|██████████| 172/172 [00:00<00:00, 421.63it/s] Generating embeddings: 100%|██████████| 459/459 [00:31<00:00, 14.53it/s]
32.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
OpenVINO Embeddings¶
# pip install sentence-transformers[openvino]
# loads BAAI/bge-small-en-v1.5 with the openvino backend
embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
device="cpu",
backend="openvino", # OpenVINO is very strong on CPUs
revision="refs/pr/16", # BAAI/bge-small-en-v1.5 itself doesn't have an OpenVINO model currently, but there's a PR with it that we can load: https://huggingface.co/BAAI/bge-small-en-v1.5/discussions/16
model_kwargs={
"file_name": "openvino_model_qint8_quantized.xml"
}, # If we're using an optimized/quantized model, we need to specify the file name like this
)
test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|██████████| 172/172 [00:00<00:00, 403.15it/s] Generating embeddings: 100%|██████████| 459/459 [00:08<00:00, 53.83it/s]
9.03 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
References¶
- Local Embedding Models explains more about using local models like these.
- Sentence Transformers > Speeding up Inference contains extensive documentation on how to use the backend options effectively, including optimization and quantization for ONNX and OpenVINO.