Opensearch Vector Store
Elasticsearch only supports Lucene indices, so only Opensearch is supported.
Note on setup: We setup a local Opensearch instance through the following doc. https://opensearch.org/docs/1.0/
If you run into SSL issues, try the following docker run
command instead:
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearchproject/opensearch:1.0.1
Reference: https://github.com/opensearch-project/OpenSearch/issues/1598
Download Data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from os import getenv
from llama_index import SimpleDirectoryReader
from llama_index.vector_stores import (
OpensearchVectorStore,
OpensearchVectorClient,
)
from llama_index import VectorStoreIndex, StorageContext
# http endpoint for your cluster (opensearch required for vector index usage)
endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")
# index to demonstrate the VectorStore impl
idx = getenv("OPENSEARCH_INDEX", "gpt-index-demo")
# load some sample data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
# OpensearchVectorClient stores text in this field by default
text_field = "content"
# OpensearchVectorClient stores embeddings in this field by default
embedding_field = "embedding"
# OpensearchVectorClient encapsulates logic for a
# single opensearch index with vector search enabled
client = OpensearchVectorClient(
endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field
)
# initialize vector store
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# initialize an index using our sample data and the client we just created
index = VectorStoreIndex.from_documents(
documents=documents, storage_context=storage_context
)
# run query
query_engine = index.as_query_engine()
res = query_engine.query("What did the author do growing up?")
res.response
INFO:root:> [query] Total LLM token usage: 29628 tokens
INFO:root:> [query] Total embedding token usage: 8 tokens
'\n\nThe author grew up writing short stories, programming on an IBM 1401, and building a computer kit from Heathkit. They also wrote programs for a TRS-80, such as games, a program to predict model rocket flight, and a word processor. After years of nagging, they convinced their father to buy a TRS-80, and they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. In college, they studied philosophy and AI, and wrote a book about Lisp hacking. They also took art classes and applied to art schools, and experimented with computer graphics and animation, exploring the use of algorithms to create art. Additionally, they experimented with machine learning algorithms, such as using neural networks to generate art, and exploring the use of numerical values to create art. They also took classes in fundamental subjects like drawing, color, and design, and applied to two art schools, RISD in the US, and the Accademia di Belli Arti in Florence. They were accepted to RISD, and while waiting to hear back from the Accademia, they learned Italian and took the entrance exam in Florence. They eventually graduated from RISD'
The OpenSearch vector store supports filter-context queries.
from llama_index import Document
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter
import regex as re
# Split the text into paragraphs.
text_chunks = documents[0].text.split("\n\n")
# Create a document for each footnote
footnotes = [
Document(
text=chunk,
id=documents[0].doc_id,
metadata={"is_footnote": bool(re.search(r"^\s*\[\d+\]\s*", chunk))},
)
for chunk in text_chunks
if bool(re.search(r"^\s*\[\d+\]\s*", chunk))
]
# Insert the footnotes into the index
for f in footnotes:
index.insert(f)
# Create a query engine that only searches certain footnotes.
footnote_query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
ExactMatchFilter(
key="term", value='{"metadata.is_footnote": "true"}'
),
ExactMatchFilter(
key="query_string",
value='{"query": "content: space AND content: lisp"}',
),
]
)
)
res = footnote_query_engine.query(
"What did the author about space aliens and lisp?"
)
res.response
"The author believes that any sufficiently advanced alien civilization would know about the Pythagorean theorem and possibly also about Lisp in McCarthy's 1960 paper."
Use reader to check out what VectorStoreIndex just created in our index.
Reader works with Elasticsearch too as it just uses the basic search features.
# create a reader to check out the index used in previous section.
from llama_index.readers import ElasticsearchReader
rdr = ElasticsearchReader(endpoint, idx)
# set embedding_field optionally to read embedding data from the elasticsearch index
docs = rdr.load_data(text_field, embedding_field=embedding_field)
# docs have embeddings in them
print("embedding dimension:", len(docs[0].embedding))
# full document is stored in metadata
print("all fields in index:", docs[0].metadata.keys())
embedding dimension: 1536
all fields in index: dict_keys(['content', 'embedding'])
# we can check out how the text was chunked by the `GPTOpensearchIndex`
print("total number of chunks created:", len(docs))
total number of chunks: 10
# search index using standard elasticsearch query DSL
docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Lisp"}}})
print("chunks that mention Lisp:", len(docs))
docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Yahoo"}}})
print("chunks that mention Yahoo:", len(docs))
chunks that mention Lisp: 10
chunks that mention Yahoo: 8