Qdrant Hybrid Search#

Qdrant supports hybrid search by combining search results from sparse and dense vectors.

dense vectors are the ones you have probably already been using – embedding models from OpenAI, BGE, SentenceTransformers, etc. are typically dense embedding models. They create a numerical representation of a piece of text, represented as a long list of numbers. These dense vectors can capture rich semantics across the entire piece of text.

sparse vectors are slightly different. They use a specialized approach or model (TF-IDF, BM25, SPLADE, etc.) for generating vectors. These vectors are typically mostly zeros, making them sparse vectors. These sparse vectors are great at capturing specific keywords and similar small details.

This notebook walks through setting up and customizing hybrid search with Qdrant and naver/efficient-splade-VI-BT-large variants from Huggingface.

Setup#

First, we setup our env and load our data.

!pip install llama-index qdrant-client pypdf "transformers[torch]"

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 09:33:57--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.131.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M  26.0MB/s    in 0.5s    

2023-12-21 09:33:58 (26.0 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/").load_data()

Indexing Data#

Now, we can index our data.

Hybrid search with Qdrant must be enabled from the beginning – we can simply set enable_hybrid=True.

This will run sparse vector generation locally using the "naver/efficient-splade-VI-BT-large-doc" model from Huggingface, in addition to generating dense vectors with OpenAI.

from llama_index import VectorStoreIndex, StorageContext, ServiceContext
from llama_index.vector_stores import QdrantVectorStore
from qdrant_client import QdrantClient

# creates a persistant index to disk
client = QdrantClient(path="./qdrant_data")

# create our vector store with hybrid indexing enabled
# batch_size controls how many nodes are encoded with sparse vectors at once
vector_store = QdrantVectorStore(
    "llama2_paper", client=client, enable_hybrid=True, batch_size=20
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(chunk_size=512)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    service_context=service_context,
)

Hybrid Queries#

When querying with hybrid mode, we can set similarity_top_k and sparse_top_k separately.

sparse_top_k represents how many nodes will be retrieved from each dense and sparse query. For example, if sparse_top_k=5 is set, that means I will retrieve 5 nodes using sparse vectors and 5 nodes using dense vectors.

similarity_top_k controls the final number of returned nodes. In the above setting, we end up with 10 nodes. A fusion algorithm is applied to rank and order the nodes from different vector spaces (relative score fusion in this case). similarity_top_k=2 means the top two nodes after fusion are returned.

query_engine = index.as_query_engine(
    similarity_top_k=2, sparse_top_k=12, vector_store_query_mode="hybrid"
)

from IPython.display import display, Markdown

response = query_engine.query(
    "How was Llama2 specifically trained differently from Llama1?"
)

display(Markdown(str(response)))

Llama2 was specifically trained differently from Llama1 by making several changes to improve performance. These changes included performing more robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models.

print(len(response.source_nodes))

Lets compare to not using hybrid search at all!

from IPython.display import display, Markdown

query_engine = index.as_query_engine(
    similarity_top_k=2,
    # sparse_top_k=10,
    # vector_store_query_mode="hybrid"
)

response = query_engine.query(
    "How was Llama2 specifically trained differently from Llama1?"
)
display(Markdown(str(response)))

Llama 2 was specifically trained differently from Llama 1 by making several changes to improve performance. These changes included performing more robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. These modifications were made to enhance the training process and optimize the performance of Llama 2 compared to Llama 1.

Async Support#

And of course, async queries are also supported (note that in-memory Qdrant data is not shared between async and sync clients!)

import nest_asyncio

nest_asyncio.apply()

from llama_index import VectorStoreIndex, StorageContext, ServiceContext
from llama_index.vector_stores import QdrantVectorStore
from qdrant_client import AsyncQdrantClient


# creates a persistant index to disk
aclient = AsyncQdrantClient(path="./qdrant_data_async")

# create our vector store with hybrid indexing enabled
vector_store = QdrantVectorStore(
    collection_name="llama2_paper",
    aclient=aclient,
    enable_hybrid=True,
    batch_size=20,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(chunk_size=512)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    service_context=service_context,
    use_async=True,
)

query_engine = index.as_query_engine(similarity_top_k=2, sparse_top_k=10)

response = await query_engine.aquery(
    "What baseline models are measured against in the paper?"
)

[Advanced] Customizing Hybrid Search with Qdrant#

In this section, we walk through various settings that can be used to fully customize the hybrid search experience

Customizing Sparse Vector Generation#

By default, sparse vector generation is done using seperate models for queries and documents – "naver/efficient-splade-VI-BT-large-doc" and "naver/efficient-splade-VI-BT-large-query"

Below is the default code for generating the sparse vectors and how you can set the functionality in the constructor. You can use this and customize as needed.

from typing import Any, List, Tuple
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

doc_tokenizer = AutoTokenizer.from_pretrained(
    "naver/efficient-splade-VI-BT-large-doc"
)
doc_model = AutoModelForMaskedLM.from_pretrained(
    "naver/efficient-splade-VI-BT-large-doc"
)

query_tokenizer = AutoTokenizer.from_pretrained(
    "naver/efficient-splade-VI-BT-large-query"
)
query_model = AutoModelForMaskedLM.from_pretrained(
    "naver/efficient-splade-VI-BT-large-query"
)


def sparse_doc_vectors(
    texts: List[str],
) -> Tuple[List[List[int]], List[List[float]]]:
    """
    Computes vectors from logits and attention mask using ReLU, log, and max operations.
    """
    tokens = doc_tokenizer(
        texts, truncation=True, padding=True, return_tensors="pt"
    )
    if torch.cuda.is_available():
        tokens = tokens.to("cuda")

    output = doc_model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits))
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    tvecs, _ = torch.max(weighted_log, dim=1)

    # extract the vectors that are non-zero and their indices
    indices = []
    vecs = []
    for batch in tvecs:
        indices.append(batch.nonzero(as_tuple=True)[0].tolist())
        vecs.append(batch[indices[-1]].tolist())

    return indices, vecs


def sparse_query_vectors(
    texts: List[str],
) -> Tuple[List[List[int]], List[List[float]]]:
    """
    Computes vectors from logits and attention mask using ReLU, log, and max operations.
    """
    # TODO: compute sparse vectors in batches if max length is exceeded
    tokens = query_tokenizer(
        texts, truncation=True, padding=True, return_tensors="pt"
    )
    if torch.cuda.is_available():
        tokens = tokens.to("cuda")

    output = query_model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits))
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    tvecs, _ = torch.max(weighted_log, dim=1)

    # extract the vectors that are non-zero and their indices
    indices = []
    vecs = []
    for batch in tvecs:
        indices.append(batch.nonzero(as_tuple=True)[0].tolist())
        vecs.append(batch[indices[-1]].tolist())

    return indices, vecs

vector_store = QdrantVectorStore(
    "llama2_paper",
    client=client,
    enable_hybrid=True,
    sparse_doc_fn=sparse_doc_vectors,
    sparse_query_fn=sparse_query_vectors,
)

Customizing `hybrid_fusion_fn()`#

By default, when running hbyrid queries with Qdrant, Relative Score Fusion is used to combine the nodes retrieved from both sparse and dense queries.

You can customize this function to be any other method (plain deduplication, Reciprocal Rank Fusion, etc.).

Below is the default code for our relative score fusion approach and how you can pass it into the constructor.

from llama_index.vector_stores.types import VectorStoreQueryResult


def relative_score_fusion(
    dense_result: VectorStoreQueryResult,
    sparse_result: VectorStoreQueryResult,
    alpha: float = 0.5,  # passed in from the query engine
    top_k: int = 2,  # passed in from the query engine i.e. similarity_top_k
) -> VectorStoreQueryResult:
    """
    Fuse dense and sparse results using relative score fusion.
    """
    # sanity check
    assert dense_result.nodes is not None
    assert dense_result.similarities is not None
    assert sparse_result.nodes is not None
    assert sparse_result.similarities is not None

    # deconstruct results
    sparse_result_tuples = list(
        zip(sparse_result.similarities, sparse_result.nodes)
    )
    sparse_result_tuples.sort(key=lambda x: x[0], reverse=True)

    dense_result_tuples = list(
        zip(dense_result.similarities, dense_result.nodes)
    )
    dense_result_tuples.sort(key=lambda x: x[0], reverse=True)

    # track nodes in both results
    all_nodes_dict = {x.node_id: x for x in dense_result.nodes}
    for node in sparse_result.nodes:
        if node.node_id not in all_nodes_dict:
            all_nodes_dict[node.node_id] = node

    # normalize sparse similarities from 0 to 1
    sparse_similarities = [x[0] for x in sparse_result_tuples]
    max_sparse_sim = max(sparse_similarities)
    min_sparse_sim = min(sparse_similarities)
    sparse_similarities = [
        (x - min_sparse_sim) / (max_sparse_sim - min_sparse_sim)
        for x in sparse_similarities
    ]
    sparse_per_node = {
        sparse_result_tuples[i][1].node_id: x
        for i, x in enumerate(sparse_similarities)
    }

    # normalize dense similarities from 0 to 1
    dense_similarities = [x[0] for x in dense_result_tuples]
    max_dense_sim = max(dense_similarities)
    min_dense_sim = min(dense_similarities)
    dense_similarities = [
        (x - min_dense_sim) / (max_dense_sim - min_dense_sim)
        for x in dense_similarities
    ]
    dense_per_node = {
        dense_result_tuples[i][1].node_id: x
        for i, x in enumerate(dense_similarities)
    }

    # fuse the scores
    fused_similarities = []
    for node_id in all_nodes_dict:
        sparse_sim = sparse_per_node.get(node_id, 0)
        dense_sim = dense_per_node.get(node_id, 0)
        fused_sim = alpha * (sparse_sim + dense_sim)
        fused_similarities.append((fused_sim, all_nodes_dict[node_id]))

    fused_similarities.sort(key=lambda x: x[0], reverse=True)
    fused_similarities = fused_similarities[:top_k]

    # create final response object
    return VectorStoreQueryResult(
        nodes=[x[1] for x in fused_similarities],
        similarities=[x[0] for x in fused_similarities],
        ids=[x[1].node_id for x in fused_similarities],
    )

vector_store = QdrantVectorStore(
    "llama2_paper",
    client=client,
    enable_hybrid=True,
    hybrid_fusion_fn=relative_score_fusion,
)

You may have noticed the alpha parameter in the above function. This can be set directely in the as_query_engine() call, which will set it in the vector index retriever.

index.as_query_engine(alpha=0.5, similarity_top_k=2)

Customizing Hybrid Qdrant Collections#

Instead of letting llama-index do it, you can also configure your Qdrant hybrid collections ahead of time.

NOTE: The names of vector configs must be text-dense and text-sparse if creating a hybrid index.

from qdrant_client import models

client.recreate_collection(
    collection_name="llama2_paper",
    vectors_config={
        "text-dense": models.VectorParams(
            size=1536,  # openai vector size
            distance=models.Distance.COSINE,
        )
    },
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(
            index=models.SparseIndexParams()
        )
    },
)

# enable hybrid since we created a sparse collection
vector_store = QdrantVectorStore(
    collection_name="llama2_paper", client=client, enable_hybrid=True
)