Open In Colab

Llama2 + VectorStoreIndex

This notebook walks through the proper setup to use llama-2 with LlamaIndex. Specifically, we look at using a vector store index.


If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index


import os


# currently needed for notebooks
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

Load documents, build the VectorStoreIndex

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

from llama_index import (

from IPython.display import Markdown, display
INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.
from llama_index.llms import Replicate
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms.llama_utils import (

# The replicate endpoint
LLAMA_13B_V2_CHAT = "a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5"

# inject custom system prompt into llama-2
def custom_completion_to_prompt(completion: str) -> str:
    return completion_to_prompt(
            "You are a Q&A assistant. Your goal is to answer questions as "
            "accurately as possible is the instructions and context provided."

llm = Replicate(
    # override max tokens since it's interpreted
    # as context window instead of max tokens
    # override completion representation for llama 2
    # if using llama 2 for data agents, also override the message representation

# set a global service context
ctx = ServiceContext.from_defaults(llm=llm)

Download Data

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
index = VectorStoreIndex.from_documents(documents)


# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

Based on the context information provided, the author’s activities growing up were:

  1. Writing short stories, which were “awful” and lacked a strong plot.

  2. Programming on an IBM 1401 computer in 9th grade, using an early version of Fortran.

  3. Building a microcomputer with a friend, and writing simple games, a program to predict the height of model rockets, and a word processor.

  4. Studying philosophy in college, but finding it boring and switching to AI.

  5. Writing essays online, which became a turning point in their career.

Streaming Support

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("What happened at interleaf?")
for token in response.response_gen:
    print(token, end="")
 Based on the context information provided, it appears that the author worked at Interleaf, a company that made software for creating and managing documents. The author mentions that Interleaf was "on the way down" and that the company's Release Engineering group was large compared to the group that actually wrote the software. It is inferred that Interleaf was experiencing financial difficulties and that the author was nervous about money. However, there is no explicit mention of what specifically happened at Interleaf.