Hide navigation sidebar

Hide table of contents sidebar

Toggle site navigation sidebar

LlamaIndex 🦙 0.9.48

Toggle table of contents sidebar

LlamaIndex 🦙 0.9.48

Getting Started

Installation and Setup
How to read these docs
Starter Tutorial
High-Level Concepts
Customization Tutorial
Discover LlamaIndex Video Series

Use Cases

Q&A
Toggle child pages in navigation
- RAG CLI
Chatbots
Agents
Toggle child pages in navigation
- Agents (Putting your RAG Pipeline Together)
  Toggle child pages in navigation
- Agentic Strategies (Optimizing your RAG Pipeline)
  Toggle child pages in navigation
  - Routers
    Toggle child pages in navigation
  - Query Transformations
    Toggle child pages in navigation
    - HyDE Query Transform
    - Multi-Step Query Engine
  - Sub Question Query Engine (Intro)
  - Build your own OpenAI Agent
  - OpenAI Agent with Query Engine Tools
  - Retrieval-Augmented OpenAI Agent
  - OpenAI Agent + Query Engine Experimental Cookbook
  - OpenAI Agent Query Planning
  - Context-Augmented OpenAI Agent
- Agents
  Toggle child pages in navigation
- Tools
  Toggle child pages in navigation
  - Usage Pattern
  - LlamaHub Tools Guide
Structured Data Extraction
Toggle child pages in navigation
- Structured Outputs
  Toggle child pages in navigation
  - Pydantic Program
    Toggle child pages in navigation
  - Query Engines + Pydantic Outputs
    Toggle child pages in navigation
  - Output Parsing Modules
    Toggle child pages in navigation
- Output Parsing Modules
  Toggle child pages in navigation
- Extracting names and locations from descriptions of people
- Extracting album data from music reviews
- Extracting information from emails
Multi-modal
Toggle child pages in navigation

Understanding

Building an LLM application
Using LLMs
Toggle child pages in navigation
- Privacy and Security
Loading Data (Ingestion)
Toggle child pages in navigation
- LlamaHub
- Documents / Nodes
  Toggle child pages in navigation
  - Defining and Customizing Documents
    Toggle child pages in navigation
    - Metadata Extraction Usage Pattern
      Toggle child pages in navigation
  - Defining and Customizing Nodes
  - Transformations
- Node Parser Usage Pattern
  Toggle child pages in navigation
  - Node Parser Modules
- Ingestion Pipeline
  Toggle child pages in navigation
Indexing
Storing
Querying
Putting It All Together
Toggle child pages in navigation
- Q&A patterns
  Toggle child pages in navigation
- Full-Stack Web Application
  Toggle child pages in navigation
  - A Guide to Building a Full-Stack Web App with LLamaIndex
  - A Guide to Building a Full-Stack LlamaIndex Web App with Delphic
- How to Build a Chatbot
- Agents
  Toggle child pages in navigation
- Full-Stack Projects
  Toggle child pages in navigation
Tracing and Debugging
Evaluating
Toggle child pages in navigation
- Cost Analysis
  Toggle child pages in navigation
  - Usage Pattern

Optimizing

Basic Strategies
Toggle child pages in navigation
Advanced Retrieval Strategies
Toggle child pages in navigation
- Query Transform Cookbook
- Query Transformations
  Toggle child pages in navigation
  - HyDE Query Transform
  - Multi-Step Query Engine
- Composable Objects
- DeepMemory (Activeloop)
- Weaviate Vector Store - Hybrid Search
- Pinecone Vector Store - Hybrid Search
Agentic strategies
Toggle child pages in navigation
- Routers
  Toggle child pages in navigation
- Query Transformations
  Toggle child pages in navigation
  - HyDE Query Transform
  - Multi-Step Query Engine
- Sub Question Query Engine (Intro)
- Build your own OpenAI Agent
- OpenAI Agent with Query Engine Tools
- Retrieval-Augmented OpenAI Agent
- OpenAI Agent + Query Engine Experimental Cookbook
- OpenAI Agent Query Planning
- Context-Augmented OpenAI Agent
Evaluation
Toggle child pages in navigation
- End-to-End Evaluation
  Toggle child pages in navigation
- Component Wise Evaluation
  Toggle child pages in navigation
  - BEIR Out of Domain Benchmark
  - HotpotQADistractor Demo
- Evaluating
  Toggle child pages in navigation
  - Usage Pattern (Response Evaluation)
  - Usage Pattern (Retrieval)
  - Modules
    Toggle child pages in navigation
  - Evaluating With LabelledRagDataset’s
    Toggle child pages in navigation
    - Benchmarking RAG Pipelines With A LabelledRagDatatset
    - Downloading a LlamaDataset from LlamaHub
  - Contributing A LabelledRagDataset
    Toggle child pages in navigation
    - LlamaDataset Submission Template Notebook
- Component Wise Evaluation
  Toggle child pages in navigation
  - BEIR Out of Domain Benchmark
  - HotpotQADistractor Demo
- End-to-End Evaluation
  Toggle child pages in navigation
Fine-tuning
Toggle child pages in navigation
Building Performant RAG Applications for Production
Toggle child pages in navigation
Writing Custom Modules
Building RAG from Scratch (Lower-Level)
Toggle child pages in navigation

Module Guides

Models
Toggle child pages in navigation
Prompts
Toggle child pages in navigation
Loading Data
Toggle child pages in navigation
- Data Connectors (LlamaHub)
  Toggle child pages in navigation
  - Usage Pattern
  - Module Guides
    Toggle child pages in navigation
- Documents / Nodes
  Toggle child pages in navigation
  - Defining and Customizing Documents
    Toggle child pages in navigation
    - Metadata Extraction Usage Pattern
      Toggle child pages in navigation
  - Defining and Customizing Nodes
  - Transformations
- Node Parser Usage Pattern
  Toggle child pages in navigation
  - Node Parser Modules
- Ingestion Pipeline
  Toggle child pages in navigation
Indexing
Toggle child pages in navigation
- Using VectorStoreIndex
  Toggle child pages in navigation
- How Each Index Works
- Module Guides
  Toggle child pages in navigation
- Composability
  Toggle child pages in navigation
Storing
Toggle child pages in navigation
Querying
Toggle child pages in navigation
- Query Pipeline
  Toggle child pages in navigation
  - Usage Pattern
  - Module Usage
  - Module Guides
    Toggle child pages in navigation
- Query Engine
  Toggle child pages in navigation
  - Usage Pattern
    Toggle child pages in navigation
    - Response Modes
    - Streaming
  - Module Guides
    Toggle child pages in navigation
  - Supporting Modules
    Toggle child pages in navigation
    - Query Transformations
      Toggle child pages in navigation
      - HyDE Query Transform
      - Multi-Step Query Engine
- Chat Engine
  Toggle child pages in navigation
  - Usage Pattern
  - Module Guides
    Toggle child pages in navigation
- Agents
  Toggle child pages in navigation
- Retriever
  Toggle child pages in navigation
  - Retriever Modes
  - Retriever Modules
    Toggle child pages in navigation
- Response Synthesizer
  Toggle child pages in navigation
  - Response Synthesis Modules
    Toggle child pages in navigation
- Routers
  Toggle child pages in navigation
- Node Postprocessor
  Toggle child pages in navigation
  - Node Postprocessor Modules
    Toggle child pages in navigation
- Structured Outputs
  Toggle child pages in navigation
  - Pydantic Program
    Toggle child pages in navigation
  - Query Engines + Pydantic Outputs
    Toggle child pages in navigation
  - Output Parsing Modules
    Toggle child pages in navigation
Agents
Toggle child pages in navigation
Observability
Toggle child pages in navigation
Evaluating
Toggle child pages in navigation
- Usage Pattern (Response Evaluation)
- Usage Pattern (Retrieval)
- Modules
  Toggle child pages in navigation
- Evaluating With LabelledRagDataset’s
  Toggle child pages in navigation
  - Benchmarking RAG Pipelines With A LabelledRagDatatset
  - Downloading a LlamaDataset from LlamaHub
- Contributing A LabelledRagDataset
  Toggle child pages in navigation
  - LlamaDataset Submission Template Notebook
Supporting Modules
Toggle child pages in navigation
- ServiceContext

API Reference

API Reference
Toggle child pages in navigation
- Agents
- Callbacks
- Composability
- Evaluation
- Example Notebooks
- Finetuning
- Indices
  Toggle child pages in navigation
- LLM Predictors
- LLMs
  Toggle child pages in navigation
- Memory
- Multi-Modal LLMs, Vector Stores, Embeddings, Retriever, and Query Engine
  Toggle child pages in navigation
  - OpenAI
  - Replicate
- Node Postprocessor
- Node
  Toggle child pages in navigation
- Playground
- Prompt Templates
- Querying an Index
  Toggle child pages in navigation
  - Retrievers
    Toggle child pages in navigation
  - Response Synthesizer
  - Query Engines
    Toggle child pages in navigation
  - Chat Engines
    Toggle child pages in navigation
  - Query Bundle
  - Query Transform
- Data Connectors
  Toggle child pages in navigation
- Response
- Service Context
  Toggle child pages in navigation
  - Embeddings
  - OpenAIEmbedding
  - HuggingFaceEmbedding
  - OptimumEmbedding
  - InstructorEmbedding
  - LangchainEmbedding
  - GoogleUnivSentEncoderEmbedding
  - Node Parser
    Toggle child pages in navigation
  - PromptHelper
  - LLMs
    Toggle child pages in navigation
- Embeddings
- OpenAIEmbedding
- HuggingFaceEmbedding
- OptimumEmbedding
- InstructorEmbedding
- LangchainEmbedding
- GoogleUnivSentEncoderEmbedding
- Storage Context
  Toggle child pages in navigation
- Structured Index Configuration

Community

Integrations
Toggle child pages in navigation
Frequently Asked Questions (FAQ)
Toggle child pages in navigation
Full-Stack Projects
Toggle child pages in navigation

Contributing

Contributing to LlamaIndex
Documentation Guide

Changes

ChangeLog
Deprecated Terms

Toggle table of contents sidebar

Multi-Modal LLM using OpenAI GPT-4V model for image reasoning#

In this notebook, we show how to use OpenAI GPT4V MultiModal LLM class/abstraction for image understanding/reasoning.

We also show several functions we are now supporting for OpenAI GPT4V LLM:

complete (both sync and async): for a single prompt and list of images
chat (both sync and async): for multiple chat messages
stream complete (both sync and async): for steaming output of complete
stream chat (both sync and async): for steaming output of chat

!pip install openai matplotlib

Use GPT4V to understand Images from URLs#

import os

OPENAI_API_TOKEN = "sk-"  # Your OpenAI API token here
os.environ["OPENAI_API_TOKEN"] = OPENAI_API_TOKEN

Initialize `OpenAIMultiModal` and Load Images from URLs#

#

from llama_index.multi_modal_llms.openai import OpenAIMultiModal

from llama_index.multi_modal_llms.generic_utils import (
    load_image_urls,
)


image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    # "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents = load_image_urls(image_urls)

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)

from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)

https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg

<matplotlib.image.AxesImage at 0x17ef8c7d0>

../../_images/e60e626643c7ddca79139263675d76dd40e491d21e7abadd97e2527f54458d79.png

Complete a prompt with a bunch of images#

complete_response = openai_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

print(complete_response)

The image shows the Colosseum in Rome illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's multiple arches are vividly lit, contrasting with the dark blue sky in the background. Some construction or restoration work appears to be in progress at the base of the structure, and a few people can be seen walking near the site.

Steam Complete a prompt with a bunch of images#

stream_complete_response = openai_mm_llm.stream_complete(
    prompt="give me more context for this image",
    image_documents=image_documents,
)

for r in stream_complete_response:
    print(r.delta, end="")

This image shows the Colosseum, also known as the Flavian Amphitheatre, which is an iconic symbol of Imperial Rome and is located in the center of Rome, Italy. It is one of the world's most famous landmarks and is considered one of the greatest works of Roman architecture and engineering.

The Colosseum is illuminated at night with the colors of the Italian flag: green, white, and red. This lighting could be for a special occasion or event, such as a national holiday, a cultural celebration, or in solidarity with a cause. The use of lighting to display the national colors is a way to highlight the structure's significance to Italy and its people.

The Colosseum was built in the first century AD under the emperors of the Flavian dynasty and was used for gladiatorial contests and public spectacles such as mock sea battles, animal hunts, executions, re-enactments of famous battles, and dramas based on Classical mythology. It could hold between 50,000 and 80,000 spectators and was used for entertainment in the Roman Empire for over 400 years.

Today, the Colosseum is a major tourist attraction, drawing millions of visitors each year. It also serves as a powerful reminder of the Roman Empire's history and its lasting influence on the world.

Chat through a list of chat messages#

from llama_index.multi_modal_llms.openai_utils import (
    generate_openai_multi_modal_chat_message,
)

chat_msg_1 = generate_openai_multi_modal_chat_message(
    prompt="Describe the images as an alternative text",
    role="user",
    image_documents=image_documents,
)

chat_msg_2 = generate_openai_multi_modal_chat_message(
    prompt="The image is a graph showing the surge in US mortgage rates. It is a visual representation of data, with a title at the top and labels for the x and y-axes. Unfortunately, without seeing the image, I cannot provide specific details about the data or the exact design of the graph.",
    role="assistant",
)

chat_msg_3 = generate_openai_multi_modal_chat_message(
    prompt="can I know more?",
    role="user",
)

chat_messages = [chat_msg_1, chat_msg_2, chat_msg_3]
chat_response = openai_mm_llm.chat(
    # prompt="Describe the images as an alternative text",
    messages=chat_messages,
)

for msg in chat_messages:
    print(msg.role, msg.content)

MessageRole.USER [{'type': 'text', 'text': 'Describe the images as an alternative text'}, {'type': 'image_url', 'image_url': 'https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg'}]
MessageRole.ASSISTANT The image is a graph showing the surge in US mortgage rates. It is a visual representation of data, with a title at the top and labels for the x and y-axes. Unfortunately, without seeing the image, I cannot provide specific details about the data or the exact design of the graph.
MessageRole.USER can I know more?

print(chat_response)

assistant: I apologize for the confusion earlier. The image actually shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a twilight setting, with the sky transitioning from blue to black. The lighting accentuates the arches and the texture of the stone, creating a dramatic and colorful display. There are some people and a street visible in the foreground, with construction barriers indicating some ongoing work or preservation efforts.

Stream Chat through a list of chat messages#

stream_chat_response = openai_mm_llm.stream_chat(
    # prompt="Describe the images as an alternative text",
    messages=chat_messages,
)

for r in stream_chat_response:
    print(r.delta, end="")

I apologize for the confusion earlier. The image actually shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a twilight setting, with the sky transitioning from blue to black. The lighting accentuates the arches and the texture of the stone, creating a dramatic and patriotic display. There are a few people visible at the base of the Colosseum, and some construction barriers suggest maintenance or archaeological work may be taking place.

Async Complete#

response_acomplete = await openai_mm_llm.acomplete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

print(response_acomplete)

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's iconic arches are vividly lit, and the structure stands out against the dark blue evening sky. A few people can be seen near the base of the Colosseum, and there is some construction fencing visible in the foreground.

Async Steam Complete#

response_astream_complete = await openai_mm_llm.astream_complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

async for delta in response_astream_complete:
    print(delta.delta, end="")

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's iconic arches are vividly lit, and the structure stands out against the dark blue evening sky. Some construction or restoration work appears to be in progress at the base of the Colosseum, indicated by scaffolding and barriers. A few individuals can be seen near the structure, giving a sense of scale to the massive edifice.

Async Chat#

achat_response = await openai_mm_llm.achat(
    messages=chat_messages,
)

print(achat_response)

assistant: I apologize for the confusion in my previous response. Let me provide you with an accurate description of the image you've provided.

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a moment of twilight, with the sky transitioning from blue to black, highlighting the structure's iconic arches and the illuminated colors. There are some people and a street visible in the foreground, with construction barriers indicating some ongoing work or preservation efforts. The Colosseum's grandeur and historical significance are emphasized by the lighting and the dusk setting.

Async stream Chat#

astream_chat_response = await openai_mm_llm.astream_chat(
    messages=chat_messages,
)

async for delta in astream_chat_response:
    print(delta.delta, end="")

I apologize for the confusion in my previous response. The image actually depicts the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is shown with its iconic arched openings, and the lighting accentuates its grandeur against the evening sky. There are a few people and some construction barriers visible at the base, indicating ongoing preservation efforts or public works.

Complete with Two images#

image_urls = [
    "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    # "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents_1 = load_image_urls(image_urls)

response_multi = openai_mm_llm.complete(
    prompt="is there any relationship between those images?",
    image_documents=image_documents_1,
)
print(response_multi)

No, there is no direct relationship between these two images. The first image is an infographic showing the surge in U.S. mortgage rates and its comparison with existing home sales, indicating economic data. The second image is of a person holding a trophy, which seems to be related to a sports achievement or recognition. The content of the two images pertains to entirely different subjects—one is focused on economic information, while the other is related to an individual's achievement in a likely sporting context.

Use GPT4V to understand images from local files#

from llama_index import SimpleDirectoryReader

# put your local directore here
image_documents = SimpleDirectoryReader("./images_wiki").load_data()

response = openai_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("./images_wiki/3.jpg")
plt.imshow(img)

<matplotlib.image.AxesImage at 0x297eec110>

../../_images/7e30f1faca9c19a8e8a56f9d977532a1575d42fad28d1d87a73ccdb46390ff88.png

print(response)

You are looking at a close-up image of a glass Coca-Cola bottle. The label on the bottle features the iconic Coca-Cola logo with additional text underneath it commemorating the 2002 FIFA World Cup hosted by Korea/Japan. The label also indicates that the bottle contains 250 ml of the product. In the background with a shallow depth of field, you can see the blurred image of another Coca-Cola bottle, emphasizing the focus on the one in the foreground. The overall lighting and detail provide a clear view of the bottle and its labeling.

Evaluating Multi-Modal RAG

Multi-Modal LLM using Replicate LlaVa, Fuyu 8B, MiniGPT4 models for image reasoning

Copyright © 2023, Jerry Liu

Made with Sphinx and @pradyunsg's Furo

On this page

Multi-Modal LLM using OpenAI GPT-4V model for image reasoning