Skip to content

[Beta] Multi-modal models#

Concept#

Large language models (LLMs) are text-in, text-out. Large Multi-modal Models (LMMs) generalize this beyond the text modalities. For instance, models such as GPT-4V allow you to jointly input both images and text, and output text.

We've included a base MultiModalLLM abstraction to allow for text+image models. NOTE: This naming is subject to change!

Usage Pattern#

  1. The following code snippet shows how you can get started using LMMs e.g. with GPT-4V.
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader

# load image documents from urls
image_documents = load_image_urls(image_urls)

# load image documents from local directory
image_documents = SimpleDirectoryReader(local_directory).load_data()

# non-streaming
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)
response = openai_mm_llm.complete(
    prompt="what is in the image?", image_documents=image_documents
)
  1. The following code snippet shows how you can build MultiModal Vector Stores/Index.
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext

import qdrant_client
from llama_index.core import SimpleDirectoryReader

# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

# if you only need image_store for image retrieval,
# you can remove text_sotre
text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)

storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

# Load text and image documents from local folder
documents = SimpleDirectoryReader("./data_folder/").load_data()
# Create the MultiModal index
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)
  1. The following code snippet shows how you can use MultiModal Retriever and Query Engine.
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate
from llama_index.core.query_engine import SimpleMultiModalQueryEngine

retriever_engine = index.as_retriever(
    similarity_top_k=3, image_similarity_top_k=3
)

# retrieve more information from the GPT4V response
retrieval_results = retriever_engine.retrieve(response)

# if you only need image retrieval without text retrieval
# you can use `text_to_image_retrieve`
# retrieval_results = retriever_engine.text_to_image_retrieve(response)

qa_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

query_engine = index.as_query_engine(
    multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
)

query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)

Legend

  • โœ… = should work fine
  • โš ๏ธ = sometimes unreliable, may need more tuning to improve
  • ๐Ÿ›‘ = not available at the moment.

End to End Multi-Modal Work Flow#

The tables below attempt to show the initial steps with various LlamaIndex features for building your own Multi-Modal RAGs (Retrieval Augmented Generation). You can combine different modules/steps together for composing your own Multi-Modal RAG orchestration.

Query Type Data Sources
for MultiModal
Vector Store/Index
MultiModal
Embedding
Retriever Query
Engine
Output
Data
Type
Text โœ… Text โœ… Text โœ… Top-k retrieval โœ…
Simple Fusion retrieval โœ…
Simple Query Engine โœ… Retrieved Text โœ…
Generated Text โœ…
Image โœ… Image โœ… Image โœ…
Image to Text Embedding โœ…
Top-k retrieval โœ…
Simple Fusion retrieval โœ…
Simple Query Engine โœ… Retrieved Image โœ…
Generated Image ๐Ÿ›‘
Audio ๐Ÿ›‘ Audio ๐Ÿ›‘ Audio ๐Ÿ›‘ ๐Ÿ›‘ ๐Ÿ›‘ Audio ๐Ÿ›‘
Video ๐Ÿ›‘ Video ๐Ÿ›‘ Video ๐Ÿ›‘ ๐Ÿ›‘ ๐Ÿ›‘ Video ๐Ÿ›‘

Multi-Modal LLM Models#

These notebooks serve as examples how to leverage and integrate Multi-Modal LLM model, Multi-Modal embeddings, Multi-Modal vector stores, Retriever, Query engine for composing Multi-Modal Retrieval Augmented Generation (RAG) orchestration.

Multi-Modal
Vision Models
Single
Image
Reasoning
Multiple
Images
Reasoning
Image
Embeddings
Simple
Query
Engine
Pydantic
Structured
Output
GPT4V
(OpenAI API)
โœ… โœ… ๐Ÿ›‘ โœ… โœ…
GPT4V-Azure
(Azure API)
โœ… โœ… ๐Ÿ›‘ โœ… โœ…
Gemini
(Google)
โœ… โœ… ๐Ÿ›‘ โœ… โœ…
CLIP
(Local host)
๐Ÿ›‘ ๐Ÿ›‘ โœ… ๐Ÿ›‘ ๐Ÿ›‘
LLaVa
(replicate)
โœ… ๐Ÿ›‘ ๐Ÿ›‘ โœ… โš ๏ธ
Fuyu-8B
(replicate)
โœ… ๐Ÿ›‘ ๐Ÿ›‘ โœ… โš ๏ธ
ImageBind
[To integrate]
๐Ÿ›‘ ๐Ÿ›‘ โœ… ๐Ÿ›‘ ๐Ÿ›‘
MiniGPT-4
โœ… ๐Ÿ›‘ ๐Ÿ›‘ โœ… โš ๏ธ
CogVLM
โœ… ๐Ÿ›‘ ๐Ÿ›‘ โœ… โš ๏ธ
Qwen-VL
[To integrate]
โœ… ๐Ÿ›‘ ๐Ÿ›‘ โœ… โš ๏ธ

Multi Modal Vector Stores#

Below table lists some vector stores supporting Multi-Modal use cases. Our LlamaIndex built-in MultiModalVectorStoreIndex supports building separate vector stores for image and text embedding vector stores. MultiModalRetriever, and SimpleMultiModalQueryEngine support text to text/image and image to image retrieval and simple ranking fusion functions for combining text and image retrieval results. | Multi-Modal
Vector Stores | Single
Vector
Store | Multiple
Vector
Stores | Text
Embedding | Image
Embedding | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- | --------------------------- | --------------------------------------------------------- | ------------------------------------------------------- | | LLamaIndex self-built
MultiModal Index
| ๐Ÿ›‘ | โœ… | Can be arbitrary
text embedding
(Default is GPT3.5) | Can be arbitrary
Image embedding
(Default is CLIP) | | Chroma | โœ… | ๐Ÿ›‘ | CLIP โœ… | CLIP โœ… | | Weaviate
[To integrate] | โœ… | ๐Ÿ›‘ | CLIP โœ…
ImageBind โœ… | CLIP โœ…
ImageBind โœ… |

Multi-Modal LLM Modules#

We support integrations with GPT4-V, Anthropic (Opus, Sonnet), Gemini (Google), CLIP (OpenAI), BLIP (Salesforce), and Replicate (LLaVA, Fuyu-8B, MiniGPT-4, CogVLM), and more.

Multi-Modal Retrieval Augmented Generation#

We support Multi-Modal Retrieval Augmented Generation with different Multi-Modal LLMs with Multi-Modal vector stores.

Evaluation#

We support basic evaluation for Multi-Modal LLM and Retrieval Augmented Generation.