Property graph kuzu
# %pip install llama-index llama-index-embeddings-openai llama-index-graph-stores-kuzu
Kùzu is an open source, embedded graph database that's designed for query speed and scalability. It implements the Cypher query language, and utilizes a structured property graph model (a variant of the labelled property graph model) with support for ACID transactions. Because Kùzu is embedded, there's no requirement for a server to set up and use the database.
If you already have an existing graph, please skip to the end of this notebook. Otherwise, let's begin by creating a graph from unstructured text to demonstrate how to use Kùzu as a graph store.
import nest_asyncio
nest_asyncio.apply()
Environment Setup¶
import os
os.environ["OPENAI_API_KEY"] = "enter your key here"
We will be using OpenAI models for this example, so we'll specify the OpenAI API key.
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-08-27 16:12:46-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8000::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-08-27 16:12:47 (1.61 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
Graph Construction¶
We first need to create an empty Kùzu database directory by calling the kuzu.Database
constructor. This step instantiates the database and creates the necessary directories and files within a local directory that stores the graph. This Database
object is then passed to the KuzuPropertyGraph
constructor.
import shutil
import kuzu
shutil.rmtree("test_db", ignore_errors=True)
db = kuzu.Database("test_db")
from llama_index.graph_stores.kuzu import KuzuPropertyGraphStore
graph_store = KuzuPropertyGraphStore(db)
Because Kùzu implements the structured graph property model, it imposes some level of structure on the schema of the graph. In the above case, because we did not specify a relationship schema that we want in our graph, it uses a generic schema, where the relationship types are not constrained, allowing the extracted triples from the LLM to be stored as relationships in the graph.
Define models¶
Below, we'll define the models used for embedding the text and the LLMs that are used to extract triples from the text and generate the response. In this case, we specify different temperature settings for the same model - the extraction model has a temperature of 0.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
extract_llm = OpenAI(model="gpt-4o-mini", temperature=0.0)
generate_llm = OpenAI(model="gpt-4o-mini", temperature=0.3)
1. Create property graph index without imposing structure¶
Because we didn't specify the relationship schema above, we can simply invoke the SchemaLLMPathExtractor
to extract the triples from the text and store them in the graph. We can define the property graph index using Kùzu as the graph store, as shown below:
from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor
index = PropertyGraphIndex.from_documents(
documents,
embed_model=embed_model,
kg_extractors=[SchemaLLMPathExtractor(extract_llm)],
property_graph_store=graph_store,
show_progress=True,
)
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 17.81it/s] Extracting paths from text with schema: 100%|██████████| 22/22 [00:31<00:00, 1.43s/it] Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 1.34it/s] Generating embeddings: 100%|██████████| 2/2 [00:00<00:00, 3.06it/s]
Now that the graph is created, we can explore it in Kùzu Explorer, a web-base UI, by running a Docker container that pulls the latest image of Kùzu Explorer as follows:
docker run -p 8000:8000 \
-v ./test_db:/database \
--rm kuzudb/explorer:latest
Then, launch the UI and then visting http://localhost:8000/.
The easiest way to see the entire graph is to use a Cypher query like "match (a)-[b]->(c) return * limit 200"
.
To delete the entire graph, you can either delete the ./test_db
directory that contains the database files, or run the Cypher query "match (n) detach delete n"
in the Kùzu Explorer shell.
Querying and Retrieval¶
# Switch to the generate LLM during retrieval
Settings.llm = generate_llm
query_engine = index.as_query_engine(include_text=False)
response = query_engine.query("Tell me more about Interleaf and Viaweb")
print(str(response))
Interleaf and Viaweb are both products associated with the development of software solutions. Interleaf is linked to Lisp, indicating a relationship where Interleaf may utilize or be built upon Lisp programming language capabilities. Viaweb, on the other hand, is identified as an ecommerce software product and also has a connection to Lisp, suggesting that it may incorporate Lisp in its architecture or functionality. Both products are documented in a text file, which includes details about their creation and modification dates, file size, and type.
2. Create property graph index with structure¶
The recommended way to use Kùzu is to apply a structured schema to the graph. The schema is defined by specifying the relationship types (including direction) that we want in the graph. The imposition of structure helps with generating triples that are more meaningful for the types of questions we may want to answer from the graph.
By specifying the below validation schema, we can enforce that the graph only contains relationships of the specified types.
from typing import Literal
entities = Literal["PERSON", "PLACE", "ORGANIZATION"]
relations = Literal["HAS", "PART_OF", "WORKED_ON", "WORKED_WITH", "WORKED_AT"]
# Define the relationship schema that we will pass to our graph store
# This must be a list of valid triples in the form (head_entity, relation, tail_entity)
validation_schema = [
("ORGANIZATION", "HAS", "PERSON"),
("PERSON", "WORKED_AT", "ORGANIZATION"),
("PERSON", "WORKED_WITH", "PERSON"),
("PERSON", "WORKED_ON", "ORGANIZATION"),
("PERSON", "PART_OF", "ORGANIZATION"),
("ORGANIZATION", "PART_OF", "ORGANIZATION"),
("PERSON", "WORKED_AT", "PLACE"),
]
# Create a new empty database
shutil.rmtree("test_db", ignore_errors=True)
db = kuzu.Database("test_db")
Along with the Database
constructor, we also specify two additional arguments to the property graph store: has_structured_schema=True
and relationship_schema=validation_schema
, which provides Kùzu additional information as it instantiates a new graph.
graph_store = KuzuPropertyGraphStore(
db,
has_structured_schema=True,
relationship_schema=validation_schema,
)
To construct a property graph with the desired schema, observe that we specify a few additional arguments to the SchemaLLMPathExtractor
.
index = PropertyGraphIndex.from_documents(
documents,
embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
kg_extractors=[
SchemaLLMPathExtractor(
llm=OpenAI(model="gpt-4o-mini", temperature=0.0),
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=validation_schema,
strict=True, # if false, will allow triples outside of the schema
)
],
property_graph_store=graph_store,
show_progress=True,
)
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 16.23it/s] Extracting paths from text with schema: 100%|██████████| 22/22 [00:29<00:00, 1.34s/it] Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 1.17it/s] Generating embeddings: 100%|██████████| 4/4 [00:01<00:00, 3.69it/s]
We can now apply the query engine on the index as before.
# Switch to the generate LLM during retrieval
Settings.llm = generate_llm
query_engine = index.as_query_engine(include_text=False)
response2 = query_engine.query("Tell me more about Interleaf and Viaweb")
print(str(response2))
Interleaf and Viaweb are both organizations mentioned in the provided information. Interleaf is associated with Emacs, indicating a connection to text editing or software development environments. Viaweb, on the other hand, has several associations, including individuals like Julian and Idelle, as well as the programming language Lisp. This suggests that Viaweb may have a broader scope, potentially involving web development or e-commerce, given its historical context as an early web application platform. Both organizations appear to have been referenced in a document related to Paul Graham, indicating their relevance in discussions around technology or entrepreneurship.
Use existing graph¶
You can reuse an existing Database
object to connect to its underlying PropertyGraphIndex
. This is useful when you want to query the graph without having to re-extract the triples from the text.
graph_store = KuzuPropertyGraphStore(db)
# Set up the property graph index
index = PropertyGraphIndex.from_existing(
embed_model=embed_model,
llm=generate_llm,
property_graph_store=graph_store,
)
query_engine = index.as_query_engine(include_text=False)
response3 = query_engine.query("When was Viaweb founded, and by whom?")
print(str(response3))
Viaweb was founded by Paul Graham. The specific founding date is not provided in the information available.
For full details on construction, retrieval, querying of a property graph, see the full docs page.