Pydantic Extractor#
Here we test out the capabilities of our PydanticProgramExtractor
- being able to extract out an entire Pydantic object using an LLM (either a standard text completion LLM or a function calling LLM).
The advantage of this over using a “single” metadata extractor is that we can extract multiple entities with a single LLM call.
Setup#
import nest_asyncio
nest_asyncio.apply()
import os
import openai
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
Setup the Pydantic Model#
Here we define a basic structured schema that we want to extract. It contains:
entities: unique entities in a text chunk
summary: a concise summary of the text chunk
contains_number: whether the chunk contains numbers
This is obviously a toy schema. We’d encourage you to be creative about the type of metadata you’d want to extract!
from pydantic import BaseModel, Field
from typing import List
class NodeMetadata(BaseModel):
"""Node metadata."""
entities: List[str] = Field(
..., description="Unique entities in this text chunk."
)
summary: str = Field(
..., description="A concise summary of this text chunk."
)
contains_number: bool = Field(
...,
description=(
"Whether the text chunk contains any numbers (ints, floats, etc.)"
),
)
Setup the Extractor#
Here we setup the metadata extractor. Note that we provide the prompt template for visibility into what’s going on.
from llama_index.program.openai_program import OpenAIPydanticProgram
from llama_index.extractors import PydanticProgramExtractor
EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""
openai_program = OpenAIPydanticProgram.from_defaults(
output_cls=NodeMetadata,
prompt_template_str="{input}",
# extract_template_str=EXTRACT_TEMPLATE_STR
)
program_extractor = PydanticProgramExtractor(
program=openai_program, input_key="input", show_progress=True
)
Load in Data#
We load in Eugene’s essay (https://eugeneyan.com/writing/llm-patterns/) using our LlamaHub SimpleWebPageReader.
# load in blog
from llama_hub.web.simple_web.base import SimpleWebPageReader
from llama_index.node_parser import SentenceSplitter
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
from llama_index.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
orig_nodes
Extract Metadata#
Now that we’ve setup the metadata extractor and the data, we’re ready to extract some metadata!
We see that the pydantic feature extractor is able to extract all metadata from a given chunk in a single LLM call.
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
display(sample_entry)
{'entities': ['eugeneyan', 'HackerNews', 'Karpathy'],
'summary': 'This section discusses practical patterns for integrating large language models (LLMs) into systems & products. It introduces seven key patterns and provides information on evaluations and benchmarks in the field of language modeling.',
'contains_number': True}
new_nodes = program_extractor.process_nodes(orig_nodes)
display(new_nodes[5:7])