Building RAG from Scratch (Open-source only!)

In this tutorial, we show you how to build a data ingestion pipeline into a vector database, and then build a retrieval pipeline from that vector database, from scratch.

Notably, we use a fully open-source stack:

  • Sentence Transformers as the embedding model

  • Postgres as the vector store (we support many other vector stores too!)

  • Llama 2 as the LLM (through llama.cpp)


We setup our open-source components.

  1. Sentence Transformers

  2. Llama 2

  3. We initialize postgres and wrap it with our wrappers/abstractions.

Sentence Transformers

# sentence transformers
from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

Llama CPP

In this notebook, we use the llama-2-chat-13b-ggml model, along with the proper prompt formatting.

Check out our Llama CPP guide for full setup instructions/details.

!pip install llama-cpp-python
Requirement already satisfied: llama-cpp-python in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (0.2.7)
Requirement already satisfied: numpy>=1.20.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (1.23.5)
Requirement already satisfied: typing-extensions>=4.5.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: diskcache>=5.6.1 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from llama-cpp-python) (5.6.3)

[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
from llama_index.llms import LlamaCPP

# model_url = ""
model_url = ""

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    # kwargs to pass to __call__()
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},

Define Service Context

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

Initialize Postgres

Using an existing postgres running at localhost, create the database we’ll be using.

NOTE: Of course there are plenty of other open-source/self-hosted databases you can use! e.g. Chroma, Qdrant, Weaviate, and many more. Take a look at our vector store guide.

NOTE: You will need to setup postgres on your local system. Here’s an example of how to set it up on OSX:

NOTE: You will also need to install pgvector (

You can add a role like the following:

!pip install psycopg2-binary pgvector asyncpg "sqlalchemy[asyncio]" greenlet
import psycopg2

db_name = "vector_db"
host = "localhost"
password = "password"
port = "5432"
user = "jerry"
# conn = psycopg2.connect(connection_string)
conn = psycopg2.connect(
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")
from sqlalchemy import make_url
from llama_index.vector_stores import PGVectorStore

vector_store = PGVectorStore.from_params(
    embed_dim=384,  # openai embedding dimension

Build an Ingestion Pipeline from Scratch

We show how to build an ingestion pipeline as mentioned in the introduction.

We fast-track the steps here (can skip metadata extraction). More details can be found in our dedicated ingestion guide.

1. Load Data

!mkdir data
!wget --user-agent "Mozilla" "" -O "data/llama2.pdf"
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

2. Use a Text Splitter to Split Documents

from llama_index.text_splitter import SentenceSplitter
text_splitter = SentenceSplitter(
    # separator=" ",
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_splitter.split_text(doc.text)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

3. Manually Construct Nodes from Text Chunks

from llama_index.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata

4. Generate Embeddings for each Node

Here we generate embeddings for each Node using a sentence_transformers model.

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
    node.embedding = node_embedding

5. Load Nodes into a Vector Store

We now insert these nodes into our PostgresVectorStore.


Build Retrieval Pipeline from Scratch

We show how to build a retrieval pipeline. Similar to ingestion, we fast-track the steps. Take a look at our retrieval guide for more details!

query_str = "Can you tell me about the key concepts for safety finetuning"

1. Generate a Query Embedding

query_embedding = embed_model.get_query_embedding(query_str)

2. Query the Vector Database

# construct vector store query
from llama_index.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)

3. Parse Result into a Set of Nodes

from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

4. Put into a Retriever

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List

class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""

    def __init__(
        vector_store: PGVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        query_embedding = embed_model.get_query_embedding(query_str)
        vector_store_query = VectorStoreQuery(
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2

Plug this into our RetrieverQueryEngine to synthesize a response

from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, service_context=service_context
query_str = "How does Llama 2 perform compared to other open-source models?"

response = query_engine.query(query_str)
Llama.generate: prefix-match hit

llama_print_timings:        load time = 15473.66 ms
llama_print_timings:      sample time =    35.20 ms /    53 runs   (    0.66 ms per token,  1505.85 tokens per second)
llama_print_timings: prompt eval time = 16132.70 ms /  1816 tokens (    8.88 ms per token,   112.57 tokens per second)
llama_print_timings:        eval time =  3149.79 ms /    52 runs   (   60.57 ms per token,    16.51 tokens per second)
llama_print_timings:       total time = 19380.78 ms
 Based on the results shown in Table 3, Llama 2 outperforms all open-source models on most of the benchmarks, with an average improvement of around 5 points over the next best model (GPT-3.5).