Property graph kuzu

In [ ]:

Copied!

# pip install llama-index llama-index-embeddings-openai llama-index-graph-stores-kuzu
# pip install llama-index llama-index-embeddings-openai llama-index-graph-stores-kuzu

Kùzu is an open source, embedded graph database that's designed for query speed and scalability. It implements the Cypher query language, and utilizes a structured property graph model (a variant of the labelled property graph model) with support for ACID transactions. Because Kùzu is embedded, there's no requirement for a server to set up and use the database.

Let's begin by creating a graph from unstructured text to demonstrate how to use Kùzu as a graph and vector store to answer questions.

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

Environment Setup¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

We will be using OpenAI models for this example, so we'll specify the OpenAI API key.

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2025-08-06 13:30:12--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.004s  

2025-08-06 13:30:12 (19.3 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

Graph Construction¶

We first need to create an empty Kùzu database directory by calling the kuzu.Database constructor. This step instantiates the database and creates the necessary directories and files within a local directory that stores the graph. This Database object is then passed to the KuzuPropertyGraph constructor.

In [ ]:

Copied!





from pathlib import Path
import kuzu

DB_NAME = "ex.kuzu"
Path(DB_NAME).unlink(missing_ok=True)
db = kuzu.Database(DB_NAME)
from pathlib import Path
import kuzu

DB_NAME = "ex.kuzu"
Path(DB_NAME).unlink(missing_ok=True)
db = kuzu.Database(DB_NAME)

Because Kùzu implements the structured graph property model, it imposes some level of structure on the schema of the graph. In the above case, because we did not specify a relationship schema that we want in our graph, it uses a generic schema, where the relationship types are not constrained, allowing the extracted triples from the LLM to be stored as relationships in the graph.

Define LLMs¶

Below, we'll define the models used for embedding the text and the LLMs that are used to extract triples from the text and generate the response. In this case, we specify different temperature settings for the same model - the extraction model has a temperature of 0.

In [ ]:

Copied!





from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
extract_llm = OpenAI(model="gpt-4.1-mini", temperature=0.0)
generate_llm = OpenAI(model="gpt-4.1-mini", temperature=0.3)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
extract_llm = OpenAI(model="gpt-4.1-mini", temperature=0.0)
generate_llm = OpenAI(model="gpt-4.1-mini", temperature=0.3)

Create property graph index with structure¶

The recommended way to use Kùzu is to apply a structured schema to the graph. The schema is defined by specifying the relationship types (including direction) that we want in the graph. The imposition of structure helps with generating triples that are more meaningful for the types of questions we may want to answer from the graph.

By specifying the below validation schema, we can enforce that the graph only contains relationships of the specified types.

In [ ]:

Copied!





from typing import Literal

entities = Literal["PERSON", "PLACE", "ORGANIZATION"]
relations = Literal["HAS", "PART_OF", "WORKED_ON", "WORKED_WITH", "WORKED_AT"]
# Define the relationship schema that we will pass to our graph store
# This must be a list of valid triples in the form (head_entity, relation, tail_entity)
validation_schema = [
    ("ORGANIZATION", "HAS", "PERSON"),
    ("PERSON", "WORKED_AT", "ORGANIZATION"),
    ("PERSON", "WORKED_WITH", "PERSON"),
    ("PERSON", "WORKED_ON", "ORGANIZATION"),
    ("PERSON", "PART_OF", "ORGANIZATION"),
    ("ORGANIZATION", "PART_OF", "ORGANIZATION"),
    ("PERSON", "WORKED_AT", "PLACE"),
]
from typing import Literal

entities = Literal["PERSON", "PLACE", "ORGANIZATION"]
relations = Literal["HAS", "PART_OF", "WORKED_ON", "WORKED_WITH", "WORKED_AT"]
# Define the relationship schema that we will pass to our graph store
# This must be a list of valid triples in the form (head_entity, relation, tail_entity)
validation_schema = [
    ("ORGANIZATION", "HAS", "PERSON"),
    ("PERSON", "WORKED_AT", "ORGANIZATION"),
    ("PERSON", "WORKED_WITH", "PERSON"),
    ("PERSON", "WORKED_ON", "ORGANIZATION"),
    ("PERSON", "PART_OF", "ORGANIZATION"),
    ("ORGANIZATION", "PART_OF", "ORGANIZATION"),
    ("PERSON", "WORKED_AT", "PLACE"),
]

Create property graph store with a vector index¶

To create a KuzuPropertyGraphStore with a vector index, we need to specify the use_vector_index parameter as True. This will create a vector index on the property graph, allowing us to perform vector-based queries on the graph.

In [ ]:

Copied!





from llama_index.graph_stores.kuzu import KuzuPropertyGraphStore
from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

graph_store = KuzuPropertyGraphStore(
    db,
    has_structured_schema=True,
    relationship_schema=validation_schema,
    use_vector_index=True,  # Enable vector index for similarity search
    embed_model=embed_model,  # Auto-detects embedding dimension from model
)
from llama_index.graph_stores.kuzu import KuzuPropertyGraphStore
from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

graph_store = KuzuPropertyGraphStore(
    db,
    has_structured_schema=True,
    relationship_schema=validation_schema,
    use_vector_index=True,  # Enable vector index for similarity search
    embed_model=embed_model,  # Auto-detects embedding dimension from model
)

Auto-detected embedding dimension: 1536

To construct a property graph with the desired schema, we'll use SchemaLLMPathExtractor with the following parameters.

In [ ]:

Copied!





index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=embed_model,
    kg_extractors=[
        SchemaLLMPathExtractor(
            llm=extract_llm,
            possible_entities=entities,
            possible_relations=relations,
            kg_validation_schema=validation_schema,
            strict=True,  # if false, will allow triples outside of the schema
        )
    ],
    property_graph_store=graph_store,
    show_progress=True,
)
index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=embed_model,
    kg_extractors=[
        SchemaLLMPathExtractor(
            llm=extract_llm,
            possible_entities=entities,
            possible_relations=relations,
            kg_validation_schema=validation_schema,
            strict=True,  # if false, will allow triples outside of the schema
        )
    ],
    property_graph_store=graph_store,
    show_progress=True,
)

/Users/prrao/code/llama_index/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00,  9.39it/s]
Extracting paths from text with schema: 100%|██████████| 22/22 [00:28<00:00,  1.28s/it]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.25it/s]
Generating embeddings: 100%|██████████| 4/4 [00:01<00:00,  3.59it/s]

We can now apply the query engine on the index as before.

In [ ]:

Copied!





# Switch to the generate LLM during retrieval
Settings.llm = generate_llm

query_text = "Tell me more about Interleaf and Viaweb?"
query_engine = index.as_query_engine(include_text=False)

response = query_engine.query(query_text)
print(str(response))
# Switch to the generate LLM during retrieval
Settings.llm = generate_llm

query_text = "Tell me more about Interleaf and Viaweb?"
query_engine = index.as_query_engine(include_text=False)

response = query_engine.query(query_text)
print(str(response))

The information provided mentions Viaweb and associates it with individuals named Trevor, Trevor Blackwell, Robert, and Paul Graham. However, there is no information given about Interleaf or further details about Viaweb beyond these associations.

In [ ]:

Copied!

retriever = index.as_retriever(include_text=False)
nodes = retriever.retrieve(query_text)
nodes[0].text
retriever = index.as_retriever(include_text=False)
nodes = retriever.retrieve(query_text)
nodes[0].text

Out[ ]:

'Viaweb -> HAS -> Trevor'

Query the vector index¶

As an embedded graph database, Kuzu provides a fast and performance graph-based HNSW vector index (see the docs). This allows you to also use Kuzu for similarity (vector-based) retrieval on chunk nodes. The vector index is created after the embeddings are ingested into the chunk nodes, so you should be able to query them directly.

In [ ]:

Copied!





from llama_index.core.vector_stores.types import VectorStoreQuery

query_text = "How much funding did Idelle Weber provide to Viaweb?"
query_embedding = embed_model.get_text_embedding(query_text)
# Perform direct vector search on the graph store
vector_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5
)

nodes, similarities = graph_store.vector_query(vector_query)

for i, (node, similarity) in enumerate(zip(nodes, similarities)):
    print(f"  {i + 1}. Similarity: {similarity:.3f}")
    print(f"     Text: {node.text}...")
    print()
from llama_index.core.vector_stores.types import VectorStoreQuery

query_text = "How much funding did Idelle Weber provide to Viaweb?"
query_embedding = embed_model.get_text_embedding(query_text)
# Perform direct vector search on the graph store
vector_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5
)

nodes, similarities = graph_store.vector_query(vector_query)

for i, (node, similarity) in enumerate(zip(nodes, similarities)):
    print(f"  {i + 1}. Similarity: {similarity:.3f}")
    print(f"     Text: {node.text}...")
    print()

1. Similarity: 0.421
Text: Large numbers of former students kept in touch with her, including me. After I moved to New York I became her de facto studio assistant.

She liked to paint on big, square canvases, 4 to 5 feet on a side. One day in late 1994 as I was stretching one of these monsters there was something on the radio about a famous fund manager. He wasn't that much older than me, and was super rich. The thought suddenly occurred to me: why don't I become rich? Then I'll be able to work on whatever I want.

Meanwhile I'd been hearing more and more about this new thing called the World Wide Web. Robert Morris showed it to me when I visited him in Cambridge, where he was now in grad school at Harvard. It seemed to me that the web would be a big deal. I'd seen what graphical user interfaces had done for the popularity of microcomputers. It seemed like the web would do the same for the internet.

If I wanted to get rich, here was the next train leaving the station. I was right about that part. What I got wrong was the idea. I decided we should start a company to put art galleries online. I can't honestly say, after reading so many Y Combinator applications, that this was the worst startup idea ever, but it was up there. Art galleries didn't want to be online, and still don't, not the fancy ones. That's not how they sell. I wrote some software to generate web sites for galleries, and Robert wrote some to resize images and set up an http server to serve the pages. Then we tried to sign up galleries. To call this a difficult sale would be an understatement. It was difficult to give away. A few galleries let us make sites for them for free, but none paid us.

Then some online stores started to appear, and I realized that except for the order buttons they were identical to the sites we'd been generating for galleries. This impressive-sounding thing called an "internet storefront" was something we already knew how to build.

So in the summer of 1995, after I submitted the camera-ready copy of ANSI Common Lisp to the publishers, we started trying to write software to build online stores. At first this was going to be normal desktop software, which in those days meant Windows software. That was an alarming prospect, because neither of us knew how to write Windows software or wanted to learn. We lived in the Unix world. But we decided we'd at least try writing a prototype store builder on Unix. Robert wrote a shopping cart, and I wrote a new site generator for stores — in Lisp, of course.

We were working out of Robert's apartment in Cambridge. His roommate was away for big chunks of time, during which I got to sleep in his room. For some reason there was no bed frame or sheets, just a mattress on the floor. One morning as I was lying on this mattress I had an idea that made me sit up like a capital L. What if we ran the software on the server, and let users control it by clicking on links? Then we'd never have to write anything to run on users' computers. We could generate the sites on the same server we'd serve them from. Users wouldn't need anything more than a browser.

This kind of software, known as a web app, is common now, but at the time it wasn't clear that it was even possible. To find out, we decided to try making a version of our store builder that you could control through the browser. A couple days later, on August 12, we had one that worked. The UI was horrible, but it proved you could build a whole store through the browser, without any client software or typing anything into the command line on the server.

Now we felt like we were really onto something. I had visions of a whole new generation of software working this way. You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group called Release Engineering that seemed to be at least as big as the group that actually wrote the software. Now you could just update the software right on the server.

We started a new company we called Viaweb, after the fact that our software worked via the web, and we got $10,000 in seed funding from Idelle's husband Julian. In return for that and doing the initial legal work and giving us business advice, we gave him 10% of the company. Ten years later this deal became the model for Y Combinator's....

2. Similarity: 0.390
Text: I'd compounded this problem by buying a house up in the Santa Cruz Mountains, with a beautiful view but miles from anywhere. I stuck it out for a few more months, then in desperation I went back to New York, where unless you understand about rent control you'll be surprised to hear I still had my apartment, sealed up like a tomb of my old life. Idelle was in New York at least, and there were other people trying to paint there, even though I didn't know any of them.

When I got back to New York I resumed my old life, except now I was rich. It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now when I was tired of walking, all I had to do was raise my hand, and (unless it was raining) a taxi would stop to pick me up. Now when I walked past charming little restaurants I could go in and order lunch. It was exciting for a while. Painting started to go better. I experimented with a new kind of still life where I'd paint one painting in the old way, then photograph it and print it, blown up, on canvas, and then use that as the underpainting for a second still life, painted from the same objects (which hopefully hadn't rotted yet).

Meanwhile I looked for an apartment to buy. Now I could actually choose what neighborhood to live in. Where, I asked myself and various real estate agents, is the Cambridge of New York? Aided by occasional visits to actual Cambridge, I gradually realized there wasn't one. Huh.

Around this time, in the spring of 2000, I had an idea. It was clear from our experience with Viaweb that web apps were the future. Why not build a web app for making web apps? Why not let people edit code on our server through the browser, and then host the resulting applications for them? [9] You could run all sorts of services on the servers that these applications could use just by making an API call: making and receiving phone calls, manipulating images, taking credit card payments, etc.

I got so excited about this idea that I couldn't think about anything else. It seemed obvious that this was the future. I didn't particularly want to start another company, but it was clear that this idea would have to be embodied as one, so I decided to move to Cambridge and start it. I hoped to lure Robert into working on it with me, but there I ran into a hitch. Robert was now a postdoc at MIT, and though he'd made a lot of money the last time I'd lured him into working on one of my schemes, it had also been a huge time sink. So while he agreed that it sounded like a plausible idea, he firmly refused to work on it.

Hmph. Well, I'd do it myself then. I recruited Dan Giffin, who had worked for Viaweb, and two undergrads who wanted summer jobs, and we got to work trying to build what it's now clear is about twenty companies and several open source projects worth of software. The language for defining applications would of course be a dialect of Lisp. But I wasn't so naive as to assume I could spring an overt Lisp on a general audience; we'd hide the parentheses, like Dylan did.

By then there was a name for the kind of company Viaweb was, an "application service provider," or ASP. This name didn't last long before it was replaced by "software as a service," but it was current for long enough that I named this new company after it: it was going to be called Aspra.

I started working on the application builder, Dan worked on network infrastructure, and the two undergrads worked on the first two services (images and phone calls). But about halfway through the summer I realized I really didn't want to run a company — especially not a big one, which it was looking like this would have to be. I'd only started Viaweb because I needed the money. Now that I didn't need money anymore, why was I doing this? If this vision had to be realized as a company, then screw the vision. I'd build a subset that could be done as an open source project.

Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I would often encounter startups working on parts of this new architecture, and it was very useful to have spent so much time thinking about it and even trying to write some of it....

3. Similarity: 0.385
Text: In its time, the editor was one of the best general-purpose site builders. I kept the code tight and didn't have to integrate with any other software except Robert's and Trevor's, so it was quite fun to work on. If all I'd had to do was work on this software, the next 3 years would have been the easiest of my life. Unfortunately I had to do a lot more, all of it stuff I was worse at than programming, and the next 3 years were instead the most stressful.

There were a lot of startups making ecommerce software in the second half of the 90s. We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and inexpensive. It was lucky for us that we were poor, because that caused us to make Viaweb even more inexpensive than we realized. We charged $100 a month for a small store and $300 a month for a big one. This low price was a big attraction, and a constant thorn in the sides of competitors, but it wasn't because of some clever insight that we set the price low. We had no idea what businesses paid for things. $300 a month seemed like a lot of money to us.

We did a lot of things right by accident like that. For example, we did what's now called "doing things that don't scale," although at the time we would have described it as "being so lame that we're driven to the most desperate measures to get users." The most common of which was building stores for them. This seemed particularly humiliating, since the whole raison d'etre of our software was that people could use it to make their own stores. But anything to get users.

We learned a lot more about retail than we wanted to know. For example, that if you could only have a small image of a man's shirt (and all images were small then by present standards), it was better to have a closeup of the collar than a picture of the whole shirt. The reason I remember learning this was that it meant I had to rescan about 30 images of men's shirts. My first set of scans were so beautiful too.

Though this felt wrong, it was exactly the right thing to be doing. Building stores for users taught us about retail, and about how it felt to use our software. I was initially both mystified and repelled by "business" and thought we needed a "business person" to be in charge of it, but once we started to get users, I was converted, in much the same way I was converted to fatherhood once I had kids. Whatever users wanted, I was all theirs. Maybe one day we'd have so many users that I couldn't scan their images for them, but in the meantime there was nothing more important to do.

Another thing I didn't get at the time is that growth rate is the ultimate test of a startup. Our growth rate was fine. We had about 70 stores at the end of 1996 and about 500 at the end of 1997. I mistakenly thought the thing that mattered was the absolute number of users. And that is the thing that matters in the sense that that's how much money you're making, and if you're not making enough, you might go out of business. But in the long term the growth rate takes care of the absolute number. If we'd been a startup I was advising at Y Combinator, I would have said: Stop being so stressed out, because you're doing fine. You're growing 7x a year. Just don't hire too many more people and you'll soon be profitable, and then you'll control your own destiny.

Alas I hired lots more people, partly because our investors wanted me to, and partly because that's what startups did during the Internet Bubble. A company with just a handful of employees would have seemed amateurish. So we didn't reach breakeven until about when Yahoo bought us in the summer of 1998. Which in turn meant we were at the mercy of investors for the entire life of the company. And since both we and our investors were noobs at startups, the result was a mess even by startup standards.

It was a huge relief when Yahoo bought us. In principle our Viaweb stock was valuable. It was a share in a business that was profitable and growing rapidly. But it didn't feel very valuable to me; I had no idea how to value a business, but I was all too keenly aware of the near-death experiences we seemed to have every few months. Nor had I changed my grad student lifestyle significantly since we started....

4. Similarity: 0.318
Text: The UI was horrible, but it proved you could build a whole store through the browser, without any client software or typing anything into the command line on the server.

Now we felt like we were really onto something. I had visions of a whole new generation of software working this way. You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group called Release Engineering that seemed to be at least as big as the group that actually wrote the software. Now you could just update the software right on the server.

We started a new company we called Viaweb, after the fact that our software worked via the web, and we got $10,000 in seed funding from Idelle's husband Julian. In return for that and doing the initial legal work and giving us business advice, we gave him 10% of the company. Ten years later this deal became the model for Y Combinator's. We knew founders needed something like this, because we'd needed it ourselves.

At this stage I had a negative net worth, because the thousand dollars or so I had in the bank was more than counterbalanced by what I owed the government in taxes. (Had I diligently set aside the proper proportion of the money I'd made consulting for Interleaf? No, I had not.) So although Robert had his graduate student stipend, I needed that seed funding to live on.

We originally hoped to launch in September, but we got more ambitious about the software as we worked on it. Eventually we managed to build a WYSIWYG site builder, in the sense that as you were creating pages, they looked exactly like the static ones that would be generated later, except that instead of leading to static pages, the links all referred to closures stored in a hash table on the server.

It helped to have studied art, because the main goal of an online store builder is to make users look legit, and the key to looking legit is high production values. If you get page layouts and fonts and colors right, you can make a guy running a store out of his bedroom look more legit than a big company.

(If you're curious why my site looks so old-fashioned, it's because it's still made with this software. It may look clunky today, but in 1996 it was the last word in slick.)

In September, Robert rebelled. "We've been working on this for a month," he said, "and it's still not done." This is funny in retrospect, because he would still be working on it almost 3 years later. But I decided it might be prudent to recruit more programmers, and I asked Robert who else in grad school with him was really good. He recommended Trevor Blackwell, which surprised me at first, because at that point I knew Trevor mainly for his plan to reduce everything in his life to a stack of notecards, which he carried around with him. But Rtm was right, as usual. Trevor turned out to be a frighteningly effective hacker.

It was a lot of fun working with Robert and Trevor. They're the two most independent-minded people I know, and in completely different ways. If you could see inside Rtm's brain it would look like a colonial New England church, and if you could see inside Trevor's it would look like the worst excesses of Austrian Rococo.

We opened for business, with 6 stores, in January 1996. It was just as well we waited a few months, because although we worried we were late, we were actually almost fatally early. There was a lot of talk in the press then about ecommerce, but not many people actually wanted online stores. [8]

There were three main parts to the software: the editor, which people used to build sites and which I wrote, the shopping cart, which Robert wrote, and the manager, which kept track of orders and statistics, and which Trevor wrote. In its time, the editor was one of the best general-purpose site builders. I kept the code tight and didn't have to integrate with any other software except Robert's and Trevor's, so it was quite fun to work on. If all I'd had to do was work on this software, the next 3 years would have been the easiest of my life. Unfortunately I had to do a lot more, all of it stuff I was worse at than programming, and the next 3 years were instead the most stressful.

There were a lot of startups making ecommerce software in the second half of the 90s. We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and inexpensive....

5. Similarity: 0.297
Text: In the art world, money and coolness are tightly coupled. Anything expensive comes to be seen as cool, and anything seen as cool will soon become equally expensive.

[7] Technically the apartment wasn't rent-controlled but rent-stabilized, but this is a refinement only New Yorkers would know or care about. The point is that it was really cheap, less than half market price.

[8] Most software you can launch as soon as it's done. But when the software is an online store builder and you're hosting the stores, if you don't have any users yet, that fact will be painfully obvious. So before we could launch publicly we had to launch privately, in the sense of recruiting an initial set of users and making sure they had decent-looking stores.

[9] We'd had a code editor in Viaweb for users to define their own page styles. They didn't know it, but they were editing Lisp expressions underneath. But this wasn't an app editor, because the code ran when the merchants' sites were generated, not when shoppers visited them.

[10] This was the first instance of what is now a familiar experience, and so was what happened next, when I read the comments and found they were full of angry people. How could I claim that Lisp was better than other languages? Weren't they all Turing complete? People who see the responses to essays I write sometimes tell me how sorry they feel for me, but I'm not exaggerating when I reply that it has always been like this, since the very beginning. It comes with the territory. An essay must tell readers things they don't already know, and some people dislike being told such things.

[11] People put plenty of stuff on the internet in the 90s of course, but putting something online is not the same as publishing it online. Publishing online means you treat the online version as the (or at least a) primary version.

[12] There is a general lesson here that our experience with Y Combinator also teaches: Customs continue to constrain you long after the restrictions that caused them have disappeared. Customary VC practice had once, like the customs about publishing essays, been based on real constraints. Startups had once been much more expensive to start, and proportionally rare. Now they could be cheap and common, but the VCs' customs still reflected the old world, just as customs about writing essays still reflected the constraints of the print era.

Which in turn implies that people who are independent-minded (i.e. less influenced by custom) will have an advantage in fields affected by rapid change (where customs are more likely to be obsolete).

Here's an interesting point, though: you can't always predict which fields will be affected by rapid change. Obviously software and venture capital will be, but who would have predicted that essay writing would be?

[13] Y Combinator was not the original name. At first we were called Cambridge Seed. But we didn't want a regional name, in case someone copied us in Silicon Valley, so we renamed ourselves after one of the coolest tricks in the lambda calculus, the Y combinator.

I picked orange as our color partly because it's the warmest, and partly because no VC used it. In 2005 all the VCs used staid colors like maroon, navy blue, and forest green, because they were trying to appeal to LPs, not founders. The YC logo itself is an inside joke: the Viaweb logo had been a white V on a red circle, so I made the YC logo a white Y on an orange square.

[14] YC did become a fund for a couple years starting in 2009, because it was getting so big I could no longer afford to fund it personally. But after Heroku got bought we had enough money to go back to being self-funded.

[15] I've never liked the term "deal flow," because it implies that the number of new startups at any given time is fixed. This is not only false, but it's the purpose of YC to falsify it, by causing startups to be founded that would not otherwise have existed.

[16] She reports that they were all different shapes and sizes, because there was a run on air conditioners and she had to get whatever she could, but that they were all heavier than she could carry now.

[17] Another problem with HN was a bizarre edge case that occurs when you both write essays and run a forum. When you run a forum, you're assumed to see if not every conversation, at least every conversation involving you. And when you write essays, people post highly imaginative misinterpretations of them on forums....

In [ ]:

Copied!





from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore, TextNode
from llama_index.core.vector_stores.types import VectorStoreQuery
from llama_index.core.indices.property_graph import LLMSynonymRetriever
from typing import List


class GraphVectorRetriever(BaseRetriever):
    """
    A retriever that performs vector search on a property graph store.
    """

    def __init__(self, graph_store, embed_model, similarity_top_k: int = 5):
        self.graph_store = graph_store
        self.embed_model = embed_model
        self.similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # Get query embedding
        query_embedding = self.embed_model.get_text_embedding(
            query_bundle.query_str
        )

        # Perform vector search
        vector_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self.similarity_top_k,
        )
        nodes, similarities = self.graph_store.vector_query(vector_query)

        # Convert ChunkNodes to TextNodes
        nodes_with_scores = []
        for node, similarity in zip(nodes, similarities):
            # Convert ChunkNode to TextNode
            if hasattr(node, "text"):
                text_node = TextNode(
                    text=node.text,
                    id_=node.id,
                    metadata=getattr(node, "properties", {}),
                )
                nodes_with_scores.append(
                    NodeWithScore(node=text_node, score=similarity)
                )

        return nodes_with_scores


class CombinedGraphRetriever(BaseRetriever):
    """
    A retriever that performs that combines graph and vector search on a property graph store.
    """

    def __init__(
        self, graph_store, embed_model, llm, similarity_top_k: int = 5
    ):
        self.graph_store = graph_store
        self.embed_model = embed_model
        self.llm = llm
        self.similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # 1. Vector retrieval
        query_embedding = self.embed_model.get_text_embedding(
            query_bundle.query_str
        )
        vector_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self.similarity_top_k,
        )
        vector_nodes, similarities = self.graph_store.vector_query(
            vector_query
        )

        # Convert ChunkNodes to TextNodes for vector results
        vector_results = []
        for node, similarity in zip(vector_nodes, similarities):
            if hasattr(node, "text"):
                text_node = TextNode(
                    text=node.text,
                    id_=node.id,
                    metadata=getattr(node, "properties", {}),
                )
                vector_results.append(
                    NodeWithScore(node=text_node, score=similarity)
                )

        # 2. Graph traversal retrieval
        graph_retriever = LLMSynonymRetriever(
            self.graph_store, llm=self.llm, include_text=True
        )
        graph_results = graph_retriever.retrieve(query_bundle)

        # 3. Combine and deduplicate
        all_results = vector_results + graph_results
        seen_nodes = set()
        combined_results = []

        for node_with_score in all_results:
            node_id = node_with_score.node.node_id
            if node_id not in seen_nodes:
                seen_nodes.add(node_id)
                combined_results.append(node_with_score)

        return combined_results


# Use the combined retriever
combined_retriever = CombinedGraphRetriever(
    graph_store=graph_store,
    llm=generate_llm,
    embed_model=embed_model,
    similarity_top_k=5,
)

# Test the combined retriever
query_text = "What was the role of Idelle Weber in Viaweb?"
query_bundle = QueryBundle(query_str=query_text)
results = combined_retriever.retrieve(query_bundle)
for i, node_with_score in enumerate(results):
    print(f"{i + 1}. Score: {node_with_score.score:.3f}")
    print(
        f"   Text: {node_with_score.node.text[:100]}..."
    )  # Print first 100 chars
    print(f"   Node ID: {node_with_score.node.node_id}")
    print()
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore, TextNode
from llama_index.core.vector_stores.types import VectorStoreQuery
from llama_index.core.indices.property_graph import LLMSynonymRetriever
from typing import List


class GraphVectorRetriever(BaseRetriever):
    """
    A retriever that performs vector search on a property graph store.
    """

    def __init__(self, graph_store, embed_model, similarity_top_k: int = 5):
        self.graph_store = graph_store
        self.embed_model = embed_model
        self.similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # Get query embedding
        query_embedding = self.embed_model.get_text_embedding(
            query_bundle.query_str
        )

        # Perform vector search
        vector_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self.similarity_top_k,
        )
        nodes, similarities = self.graph_store.vector_query(vector_query)

        # Convert ChunkNodes to TextNodes
        nodes_with_scores = []
        for node, similarity in zip(nodes, similarities):
            # Convert ChunkNode to TextNode
            if hasattr(node, "text"):
                text_node = TextNode(
                    text=node.text,
                    id_=node.id,
                    metadata=getattr(node, "properties", {}),
                )
                nodes_with_scores.append(
                    NodeWithScore(node=text_node, score=similarity)
                )

        return nodes_with_scores


class CombinedGraphRetriever(BaseRetriever):
    """
    A retriever that performs that combines graph and vector search on a property graph store.
    """

    def __init__(
        self, graph_store, embed_model, llm, similarity_top_k: int = 5
    ):
        self.graph_store = graph_store
        self.embed_model = embed_model
        self.llm = llm
        self.similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # 1. Vector retrieval
        query_embedding = self.embed_model.get_text_embedding(
            query_bundle.query_str
        )
        vector_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self.similarity_top_k,
        )
        vector_nodes, similarities = self.graph_store.vector_query(
            vector_query
        )

        # Convert ChunkNodes to TextNodes for vector results
        vector_results = []
        for node, similarity in zip(vector_nodes, similarities):
            if hasattr(node, "text"):
                text_node = TextNode(
                    text=node.text,
                    id_=node.id,
                    metadata=getattr(node, "properties", {}),
                )
                vector_results.append(
                    NodeWithScore(node=text_node, score=similarity)
                )

        # 2. Graph traversal retrieval
        graph_retriever = LLMSynonymRetriever(
            self.graph_store, llm=self.llm, include_text=True
        )
        graph_results = graph_retriever.retrieve(query_bundle)

        # 3. Combine and deduplicate
        all_results = vector_results + graph_results
        seen_nodes = set()
        combined_results = []

        for node_with_score in all_results:
            node_id = node_with_score.node.node_id
            if node_id not in seen_nodes:
                seen_nodes.add(node_id)
                combined_results.append(node_with_score)

        return combined_results


# Use the combined retriever
combined_retriever = CombinedGraphRetriever(
    graph_store=graph_store,
    llm=generate_llm,
    embed_model=embed_model,
    similarity_top_k=5,
)

# Test the combined retriever
query_text = "What was the role of Idelle Weber in Viaweb?"
query_bundle = QueryBundle(query_str=query_text)
results = combined_retriever.retrieve(query_bundle)
for i, node_with_score in enumerate(results):
    print(f"{i + 1}. Score: {node_with_score.score:.3f}")
    print(
        f"   Text: {node_with_score.node.text[:100]}..."
    )  # Print first 100 chars
    print(f"   Node ID: {node_with_score.node.node_id}")
    print()

1. Score: 0.371
   Text: Large numbers of former students kept in touch with her, including me. After I moved to New York I b...
   Node ID: 48bda642-e94d-4b79-96fc-4f92ab8813c3

2. Score: 0.353
   Text: I'd compounded this problem by buying a house up in the Santa Cruz Mountains, with a beautiful view ...
   Node ID: 17e8d852-4c00-4eab-8258-641b074c8abb

3. Score: 0.314
   Text: In its time, the editor was one of the best general-purpose site builders. I kept the code tight and...
   Node ID: 1c666a5d-9058-42cf-8369-b21749b401b4

4. Score: 0.298
   Text: I started working on the application builder, Dan worked on network infrastructure, and the two unde...
   Node ID: 9e8a378c-015a-4184-a66b-b00939e11a31

5. Score: 0.284
   Text: The UI was horrible, but it proved you could build a whole store through the browser, without any cl...
   Node ID: 9f4b9b54-7be7-4d18-8689-493729dd9078

6. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> McCarthy...
   Node ID: b9d67e62-994c-416d-835b-3fa2c519612f

7. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> John McCarthy...
   Node ID: 672336bd-f8b6-4902-b0e0-d5918c233efe

8. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Jessica...
   Node ID: 5c8d9d09-d018-44da-adc4-a3a1c0b40ce8

9. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Dan...
   Node ID: 5a2cd2a5-96f0-489b-a69e-531d9996c422

10. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> two undergrads...
   Node ID: 7ffa6e08-72b5-46a3-9898-5c28c8ba10fe

11. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Dan Giffin...
   Node ID: 95bd8aaf-29a7-424e-ba83-0a90aeb87edf

12. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Idelle...
   Node ID: e9576435-ef91-4a2e-834c-790906cde85f

13. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Trevor...
   Node ID: 8c63d5e2-8abf-4b37-8f8f-5dbb03638ff3

14. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Trevor Blackwell...
   Node ID: 5fe159e0-010f-4dcd-9c1a-c660ccd98e6f

15. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Idelle Weber...
   Node ID: 8f722d33-f5d6-4fca-bc4a-03a9d60d7893

16. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Robert...
   Node ID: 7b82511d-b3ab-410d-b101-18cbe3d521f2

17. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Robert Morris...
   Node ID: d3d3aed7-c6d8-4a28-8c35-46377bee59fc

18. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Idelle's husband Julian...
   Node ID: 930a5bbb-bc2b-4f15-b3e5-af272a769660

19. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Tom Cheatham...
   Node ID: 157a81e5-f455-4831-825b-caa9a496e303

20. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Sam Altman...
   Node ID: ace9d483-7870-4fda-801c-ad6232b6a0a1

21. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> professor Cheatham...
   Node ID: cf43c8dc-690e-4189-8547-8fe86701eea5

22. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Maria Daniels...
   Node ID: 4df8d654-22b4-492b-a089-7730444fc604

23. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Bill Woods...
   Node ID: a0353041-57f8-4252-9cd3-7600d8044bef

24. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Roy Lichtenstein...
   Node ID: 6cf4eb5b-3755-4d74-9a9d-d15e56a54051

25. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Jessica Livingston...
   Node ID: 1f29de49-92d5-4350-9e82-2101b72ed500

26. Score: 1.000
   Text: Paul Graham -> WORKED_WITH -> Rich Draves...
   Node ID: 3aa36c90-6e87-48f9-937a-d8a71f4721bb

27. Score: 1.000
   Text: Paul Graham -> PART_OF -> Release Engineering...
   Node ID: bc30a50d-7e13-4d59-b79f-327a19105ea3

28. Score: 1.000
   Text: Paul Graham -> PART_OF -> PhD program in computer science...
   Node ID: 26ead48c-9ae6-42f0-9f20-0d241bb9202f

29. Score: 1.000
   Text: Paul Graham -> PART_OF -> Cornell University AI program...
   Node ID: 80f3a0cc-0621-4d11-b73d-1854c132d3f1

30. Score: 1.000
   Text: Paul Graham -> WORKED_ON -> YC...
   Node ID: fadc24a5-a3f1-4e6f-9de5-b4c6960c7bfa

31. Score: 1.000
   Text: Paul Graham -> WORKED_ON -> Hacker News...
   Node ID: 394d6965-6751-4d47-bd64-7518ada93577

32. Score: 1.000
   Text: Paul Graham -> PART_OF -> Summer Founders Program...
   Node ID: 57617b03-80a2-4297-8c36-7c88a3664c18

33. Score: 1.000
   Text: Paul Graham -> WORKED_ON -> Summer Founders Program...
   Node ID: 21c16992-c734-4357-925a-72ca92467d16

34. Score: 1.000
   Text: Paul Graham -> WORKED_ON -> Arc...
   Node ID: 07b07c4a-2e4d-4a4d-ba0d-e984820fb50a

35. Score: 1.000
   Text: Paul Graham -> WORKED_ON -> application builder...
   Node ID: f75a93ac-42d3-4a74-b6bd-f202f7c780d4

In [ ]:

Copied!





from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer

# Create query engine with your combined retriever
query_engine = RetrieverQueryEngine.from_args(
    retriever=combined_retriever,
    llm=generate_llm,
)

# Create response synthesizer
response_synthesizer = get_response_synthesizer(
    llm=generate_llm, use_async=False
)

# Create query engine
query_engine = RetrieverQueryEngine(
    retriever=combined_retriever, response_synthesizer=response_synthesizer
)

# Query and get answer
query_text = "What was the role of Idelle Weber in Viaweb?"
response = query_engine.query(query_text)
print(response)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer

# Create query engine with your combined retriever
query_engine = RetrieverQueryEngine.from_args(
    retriever=combined_retriever,
    llm=generate_llm,
)

# Create response synthesizer
response_synthesizer = get_response_synthesizer(
    llm=generate_llm, use_async=False
)

# Create query engine
query_engine = RetrieverQueryEngine(
    retriever=combined_retriever, response_synthesizer=response_synthesizer
)

# Query and get answer
query_text = "What was the role of Idelle Weber in Viaweb?"
response = query_engine.query(query_text)
print(response)

Idelle Weber was connected to Viaweb through her husband Julian, who provided $10,000 in seed funding for the company. In return for this funding, legal work, and business advice, they gave him 10% of the company.