Document Management#
Most LlamaIndex index structures allow for insertion, deletion, update, and refresh operations.
Insertion#
You can “insert” a new Document into any index data structure, after building the index initially. This document will be broken down into nodes and ingested into the index.
The underlying mechanism behind insertion depends on the index structure. For instance, for the summary index, a new Document is inserted as additional node(s) in the list. For the vector store index, a new Document (and embeddings) is inserted into the underlying document/embedding store.
An example notebook showcasing our insert capabilities is given here. In this notebook we showcase how to construct an empty index, manually create Document objects, and add those to our index data structures.
An example code snippet is given below:
from llama_index import SummaryIndex, Document
index = SummaryIndex([])
text_chunks = ["text_chunk_1", "text_chunk_2", "text_chunk_3"]
doc_chunks = []
for i, text in enumerate(text_chunks):
doc = Document(text=text, id_=f"doc_id_{i}")
doc_chunks.append(doc)
# insert
for doc_chunk in doc_chunks:
index.insert(doc_chunk)
Deletion#
You can “delete” a Document from most index data structures by specifying a document_id. (NOTE: the tree index currently does not support deletion). All nodes corresponding to the document will be deleted.
index.delete_ref_doc("doc_id_0", delete_from_docstore=True)
delete_from_docstore
will default to False
in case you are sharing nodes between indexes using the same docstore. However, these nodes will not be used when querying when this is set to False
as they will be deleted from the index_struct
of the index, which keeps track of which nodes can be used for querying.
Update#
If a Document is already present within an index, you can “update” a Document with the same doc id_
(for instance, if the information in the Document has changed).
# NOTE: the document has a `doc_id` specified
doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(
doc_chunks[0],
update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)
Here, we passed some extra kwargs to ensure the document is deleted from the docstore. This is of course optional.
Refresh#
If you set the doc id_
of each document when loading your data, you can also automatically refresh the index.
The refresh()
function will only update documents who have the same doc id_
, but different text contents. Any documents not present in the index at all will also be inserted.
refresh()
also returns a boolean list, indicating which documents in the input have been refreshed in the index.
# modify first document, with the same doc_id
doc_chunks[0] = Document(text="Super new document text", id_="doc_id_0")
# add a new document
doc_chunks.append(
Document(
text="This isn't in the index yet, but it will be soon!",
id_="doc_id_3",
)
)
# refresh the index
refreshed_docs = index.refresh_ref_docs(
doc_chunks, update_kwargs={"delete_kwargs": {"delete_from_docstore": True}}
)
# refreshed_docs[0] and refreshed_docs[-1] should be true
Again, we passed some extra kwargs to ensure the document is deleted from the docstore. This is of course optional.
If you print()
the output of refresh()
, you would see which input documents were refreshed:
print(refreshed_docs)
# > [True, False, False, True]
This is most useful when you are reading from a directory that is constantly updating with new information.
To automatically set the doc id_
when using the SimpleDirectoryReader
, you can set the filename_as_id
flag. You can learn more about customzing Documents.
Document Tracking#
Any index that uses the docstore (i.e. all indexes except for most vector store integrations), you can also see which documents you have inserted into the docstore.
print(index.ref_doc_info)
"""
> {'doc_id_1': RefDocInfo(node_ids=['071a66a8-3c47-49ad-84fa-7010c6277479'], metadata={}),
'doc_id_2': RefDocInfo(node_ids=['9563e84b-f934-41c3-acfd-22e88492c869'], metadata={}),
'doc_id_0': RefDocInfo(node_ids=['b53e6c2f-16f7-4024-af4c-42890e945f36'], metadata={}),
'doc_id_3': RefDocInfo(node_ids=['6bedb29f-15db-4c7c-9885-7490e10aa33f'], metadata={})}
"""
Each entry in the output shows the ingested doc id_
s as keys, and their associated node_ids
of the nodes they were split into.
Lastly, the original metadata
dictionary of each input document is also tracked. You can read more about the metadata
attribute in Customizing Documents.