Skip to content

Using Structured LLMs#

The highest-level way to extract structured data in LlamaIndex is to instantiate a Structured LLM. First, let’s instantiate our Pydantic class as previously:

from datetime import datetime


class LineItem(BaseModel):
    """A line item in an invoice."""

    item_name: str = Field(description="The name of this item")
    price: float = Field(description="The price of this item")


class Invoice(BaseModel):
    """A representation of information from an invoice."""

    invoice_id: str = Field(
        description="A unique identifier for this invoice, often a number"
    )
    date: datetime = Field(description="The date this invoice was created")
    line_items: list[LineItem] = Field(
        description="A list of all the items in this invoice"
    )

If this is your first time using LlamaIndex, let’s get our dependencies:

  • pip install llama-index-core llama-index-llms-openai to get the LLM (we’ll be using OpenAI for simplicity, but you can always use another one)
  • Get an OpenAI API key and set it an an environment variable called OPENAI_API_KEY
  • pip install llama-index-readers-file to get the PDFReader
    • Note: for better parsing of PDFs, we recommend LlamaParse

Now let’s load in the text of an actual invoice:

from llama_index.readers.file import PDFReader
from pathlib import Path

pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=Path("./uber_receipt.pdf"))
text = documents[0].text

And let’s instantiate an LLM, give it our Pydantic class, and then ask it to complete using the plain text of the invoice:

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
sllm = llm.as_structured_llm(Invoice)

response = sllm.complete(text)

response is a LlamaIndex CompletionResponse with two properties: text and raw. text contains the JSON-serialized form of the Pydantic-ingested response:

json_response = json.loads(response.text)
print(json.dumps(json_response, indent=2))
{
    "invoice_id": "Visa \u2022\u2022\u2022\u20224469",
    "date": "2024-10-10T19:49:00",
    "line_items": [
        {"item_name": "Trip fare", "price": 12.18},
        {"item_name": "Access for All Fee", "price": 0.1},
        {"item_name": "CA Driver Benefits", "price": 0.32},
        {"item_name": "Booking Fee", "price": 2.0},
        {"item_name": "San Francisco City Tax", "price": 0.21},
    ],
}

Note that this invoice didn’t have an ID so the LLM has tried its best and used the credit card number. Pydantic validation is not a guarantee!

The raw property of response (somewhat confusingly) contains the Pydantic object itself:

from pprint import pprint

pprint(response.raw)
Invoice(
    invoice_id="Visa ••••4469",
    date=datetime.datetime(2024, 10, 10, 19, 49),
    line_items=[
        LineItem(item_name="Trip fare", price=12.18),
        LineItem(item_name="Access for All Fee", price=0.1),
        LineItem(item_name="CA Driver Benefits", price=0.32),
        LineItem(item_name="Booking Fee", price=2.0),
        LineItem(item_name="San Francisco City Tax", price=0.21),
    ],
)

Note that Pydantic is creating a full datetime object and not just translating a string.

A structured LLM works exactly like a regular LLM class: you can call chat, stream, achat, astream etc. and it will respond with Pydantic objects in all cases. You can also pass in your Structured LLM as a parameter to VectorStoreIndex.as_query_engine(llm=sllm) and it will automatically respond to your RAG queries with structured objects.

The Structured LLM takes care of all the prompting for you. If you want more control over the prompt, move on to Structured Prediction.