Hybrid RAG with Qdrant: multi-tenancy, custom sharding, distributed setup¶
What you'll build¶
This notebook implements a production-style Hybrid RAG on Qdrant using LlamaIndex, designed for multitenancy and scale-out via custom sharding.
- Hybrid search: dense embeddings + sparse BM25 for higher recall and precision.
- Multitenancy: isolate tenants using payload filters and shard routing.
- Custom sharding: keep each tenant local for performance and cost efficiency.
- Distributed Qdrant: multi-node setup with replication for high availability and throughput.
This notebook walks through an end to end Retrieval Augmented Generation workflow that uses Qdrant as a distributed hybrid search backend and LlamaIndex as the orchestration layer. You will build a tenant aware RAG that combines dense vectors with sparse signals, you will isolate data per tenant with filters, and you will route data and queries with a custom shard key for scale.
Install dependencies¶
About the dependencies¶
- llama-index: orchestration layer for ingestion, indexing, and retrieval.
- llama-index-vector-stores-qdrant: Qdrant integration with hybrid support.
- fastembed: lightweight CPU-friendly embedding/sparse models
%pip install -U llama-index llama-index-vector-stores-qdrant fastembed
Make sure you have a distributed Qdrant cluster up and running. Here is a compose.yaml
file:
services:
qdrant_primary:
image: "qdrant/qdrant:latest"
ports:
- "6333:6333"
environment:
QDRANT__CLUSTER__ENABLED: "true"
command: ["./qdrant", "--uri", "http://qdrant_primary:6335"]
restart: always
qdrant_secondary:
image: "qdrant/qdrant:latest"
environment:
QDRANT__CLUSTER__ENABLED: "true"
command: ["./qdrant", "--bootstrap", "http://qdrant_primary:6335"]
restart: always
Imports and global settings¶
Settings and connectivity¶
- Embeddings:
FastEmbedEmbedding('BAAI/bge-base-en-v1.5')
is a compact, high-quality baseline. - Connection:
QDRANT_URL
defaults to an HTTP endpoint; setQDRANT_API_KEY
for secured/cloud setups.
import os
from qdrant_client import AsyncQdrantClient, QdrantClient
from qdrant_client import models
from llama_index.core import (
Settings,
VectorStoreIndex,
Document,
StorageContext,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.fastembed import FastEmbedEmbedding
# Embeddings, small and fast
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Qdrant connection, local by default, set QDRANT_URL and QDRANT_API_KEY for cloud
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
client: QdrantClient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
aclient: AsyncQdrantClient = AsyncQdrantClient(
url=QDRANT_URL, api_key=QDRANT_API_KEY
)
COLLECTION = "hybrid_rag_multitenant_sharding_demo"
Create distributed-ready collection¶
Configure dual-vector schema (dense + sparse)¶
- Define vector field names:
dense
for embeddings andsparse
for BM25‑style signals. - Dense config:
- Determine embedding dimensionality at runtime by probing
Settings.embed_model
(avoids hardcoding). - Use cosine distance for semantic similarity.
- Determine embedding dimensionality at runtime by probing
- Sparse config:
- Enable an in‑memory sparse index (
on_disk=False
) to support hybrid scoring.
- Enable an in‑memory sparse index (
- These settings establish the collection’s dual‑index layout used later by QdrantVectorStore for hybrid retrieval.
dense_vector_name = "dense"
dense_config = models.VectorParams(
size=len(Settings.embed_model.get_text_embedding("probe")),
distance=models.Distance.COSINE,
)
sparse_vector_name = "sparse"
sparse_config = models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
Shard keys and selector contract¶
shard_keys
: ['tenant_a', 'tenant_b'] — predefined partitions used with custom sharding to keep each tenant local.payload_indexes
: keyword index ontenant_id
to accelerate filter-based queries.shard_key_selector_fn(tenant_id) -> tenant_id
: returns the shard key used for both writes and reads.
shard_keys = ["tenant_a", "tenant_b"]
payload_indexes = [
{
"field_name": "tenant_id",
"field_schema": models.PayloadSchemaType.KEYWORD,
}
]
def shard_key_selector_fn(tenant_id: str) -> models.ShardKeySelector:
return tenant_id
Initialize hybrid Qdrant store with custom sharding¶
This step creates or attaches to the collection named in COLLECTION
and configures a dual‑vector hybrid store:
- Hybrid search:
enable_hybrid=True
withdense_vector_name='dense'
andsparse_vector_name='sparse'
. - Dense config:
dense_config
uses cosine distance and derives size fromSettings.embed_model
. - Sparse config:
sparse_config
enables an in‑memory sparse index;fastembed_sparse_model='Qdrant/bm25'
supplies BM25‑style signals. - Distributed topology:
sharding_method=Custom
withshard_keys=['tenant_a','tenant_b']
.shard_key_selector_fn(tenant_id) -> tenant_id
routes both writes and reads.shard_number=6
,replication_factor=2
for scale and High availability.
- Payload index:
payload_indexes
accelerates filtering ontenant_id
.
Idempotent behavior: the vector store will create the collection if missing and reuse it on subsequent runs.
vector_store = QdrantVectorStore(
collection_name=COLLECTION,
client=client,
aclient=aclient,
dense_vector_name=dense_vector_name,
sparse_vector_name=sparse_vector_name,
enable_hybrid=True,
dense_config=dense_config,
sparse_config=sparse_config,
fastembed_sparse_model="Qdrant/bm25",
shard_number=6,
sharding_method=models.ShardingMethod.CUSTOM,
shard_key_selector_fn=shard_key_selector_fn,
shard_keys=shard_keys,
replication_factor=2,
payload_indexes=payload_indexes,
)
Prepare multi-tenant dataset¶
We create two tenants with small document sets. Each Document carries tenant_id, tags, and a doc_id.
Dataset design and extensibility¶
We simulate two tenants with a few short documents each. Every Document
carries:
tenant_id
for isolation and shard routing,tags
for quick filtering and debugging,text
content used for dense/sparse indexing.
TENANT_DOCS: dict[str, list[Document]] = {
"tenant_a": [
Document(
text="Solar panels reduce electricity bills and carbon footprint",
metadata={"tenant_id": "tenant_a", "tags": ["energy", "solar"]},
),
Document(
text="Inverters convert DC power to AC for home appliances",
metadata={"tenant_id": "tenant_a", "tags": ["energy", "hardware"]},
),
Document(
text="Net metering policies vary by region and utility provider",
metadata={
"tenant_id": "tenant_a",
"tags": ["policy", "regulation"],
},
),
],
"tenant_b": [
Document(
text="Kubernetes orchestrates containers across a cluster",
metadata={"tenant_id": "tenant_b", "tags": ["cloud", "k8s"]},
),
Document(
text="Service meshes add observability and traffic management",
metadata={
"tenant_id": "tenant_b",
"tags": ["cloud", "networking"],
},
),
Document(
text="Helm charts package and deploy Kubernetes applications",
metadata={"tenant_id": "tenant_b", "tags": ["cloud", "devops"]},
),
],
}
Ingest with shard key for locality¶
Here we embed text with the active Settings.embed_model, then upsert each point with payload and a shard key. This keeps each tenant local to a shard group in a cluster.
Embedding strategy¶
- FastEmbed keeps this demo CPU-friendly. For production, consider a service (e.g., text-embedding-3-large or in-house model) and cache embeddings.
- If you change the model, update
dense_config.size
to match and consider reindexing. - Avoid embedding on every run in notebooks; persist or cache to speed up iterations.
def create_dense_embeddings(docs: list[Document]) -> list[Document]:
for doc in docs:
doc.embedding = Settings.embed_model.get_text_embedding(doc.text)
return docs
Ingestion flow and locality guarantees¶
- We embed each doc with the configured embedding model (dense) and rely on the vector store to build the sparse representation.
- Writes use
shard_identifier=tenant_id
, ensuring documents live on the intended shard group.
Tip: For large batches, prefer the async ingestion APIs and chunk documents for backpressure control.
for tenant_id, docs in TENANT_DOCS.items():
docs = create_dense_embeddings(docs)
await vector_store.async_add(docs, shard_identifier=tenant_id)
Index wrapping and reusability¶
StorageContext.from_defaults(vector_store=vector_store)
binds the Qdrant collection to LlamaIndex's VectorStoreIndex
without re-ingesting data.
Benefits:
- Reuse the same physical collection for multiple retrievers or query pipelines.
- Swap retrieval modes (dense-only, sparse-only, hybrid) via retriever config, not data layout.
- Keep ingestion concerns (sharding, replication) decoupled from application query logic.
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(
vector_store, storage_context=storage_context
)
Multi-tenant retrieval¶
Use a tenant-scoped hybrid retriever and keep queries shard-local. You can also use metadata filters if you want to filter within the tenant's data.
Retrieval tips for hybrid mode¶
- Set
vector_store_query_mode=HYBRID
to combine dense and sparse. Tunesimilarity_top_k
,sparse_top_k
, andhybrid_top_k
. - Pass
vector_store_kwargs={"shard_identifier": tenant_id}
to keep queries within the tenant's shard. - Add metadata filters (e.g., on
tenant_id
ortags
) to further narrow candidates when needed.
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores.types import VectorStoreQueryMode
def create_retriever_for_tenant(tenant_id: str) -> VectorIndexRetriever:
if tenant_id not in shard_keys:
raise ValueError(
f"Unknown tenant_id: {tenant_id}. Expected one of {shard_keys}"
)
return VectorIndexRetriever(
index=index,
vector_store_query_mode=VectorStoreQueryMode.HYBRID,
similarity_top_k=5,
sparse_top_k=5,
hybrid_top_k=5,
vector_store_kwargs={"shard_identifier": tenant_id},
)
tenant_id = "tenant_b"
retriever = create_retriever_for_tenant(tenant_id)
query = "manage microservices traffic and observability"
results = retriever.retrieve(query)
print(f"Tenant: {tenant_id} | Query: {query}")
for i, r in enumerate(results, 1):
meta = r.node.metadata
print(
f"{i}. score={r.score:.4f} | tags={meta.get('tags')} | text={r.node.get_content()}"
)
Tenant: tenant_b | Query: manage microservices traffic and observability 1. score=4.6271 | tags=['cloud', 'networking'] | text=Service meshes add observability and traffic management 2. score=0.1213 | tags=['cloud', 'k8s'] | text=Kubernetes orchestrates containers across a cluster 3. score=0.0000 | tags=['cloud', 'devops'] | text=Helm charts package and deploy Kubernetes applications
Interpreting results¶
- The printout shows the hybrid score, tags (metadata), and snippet of the matched text.
- Verify tenant isolation by switching
tenant_id
and observing that results come only from that tenant's documents.