[Beta] Multi-modal models

Concept

Large language models (LLMs) are text-in, text-out. Large Multi-modal Models (LMMs) generalize this beyond the text modalities. For instance, models such as GPT-4V allow you to jointly input both images and text, and output text.

We’ve included a base MultiModalLLM abstraction to allow for text+image models. NOTE: This naming is subject to change!

Usage Pattern

  1. The following code snippet shows how you can get started using LMMs e.g. with GPT-4V.

from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index.multi_modal_llms.generic_utils import (
    load_image_urls,
)
from llama_index import SimpleDirectoryReader

# load image documents from urls
image_documents = load_image_urls(image_urls)

# load image documents from local directory
image_documents = SimpleDirectoryReader(local_directory).load_data()

# non-streaming
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)
response = openai_mm_llm.complete(
    prompt="what is in the image?", image_documents=image_documents
)
  1. The following code snippet shows how you can build MultiModal Vector Stores/Index.

from llama_index.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.vector_stores import QdrantVectorStore
from llama_index import SimpleDirectoryReader, StorageContext

import qdrant_client
from llama_index import (
    SimpleDirectoryReader,
)

# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(vector_store=text_store)

# Create the MultiModal index
documents = SimpleDirectoryReader("./data_folder/").load_data()

index = MultiModalVectorStoreIndex.from_documents(
    documents, storage_context=storage_context, image_vector_store=image_store
)
  1. The following code snippet shows how you can use MultiModal Retriever and Query Engine.

from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index.prompts import PromptTemplate
from llama_index.query_engine import SimpleMultiModalQueryEngine

retriever_engine = index.as_retriever(
    similarity_top_k=3, image_similarity_top_k=3
)

# retrieve more information from the GPT4V response
retrieval_results = retriever_engine.retrieve(response)

qa_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

query_engine = index.as_query_engine(
    multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
)

query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)

Legend

  • βœ… = should work fine

  • ⚠️ = sometimes unreliable, may need more tuning to improve

  • πŸ›‘ = not available at the moment.

End to End Multi-Modal Work Flow

The tables below attempt to show the initial steps with various LlamaIndex features for building your own Multi-Modal RAGs (Retrieval Augmented Generation). You can combine different modules/steps together for composing your own Multi-Modal RAG orchestration.

Query Type

Data Sources
for MultiModal
Vector Store/Index

MultiModal
Embedding

Retriever

Query
Engine

Output
Data
Type

Text βœ…

Text βœ…

Text βœ…

Top-k retrieval βœ…
Simple Fusion retrieval βœ…

Simple Query Engine βœ…

Retrieved Text βœ…
Generated Text βœ…

Image βœ…

Image βœ…

Image βœ…
Image to Text Embedding βœ…

Top-k retrieval βœ…
Simple Fusion retrieval βœ…

Simple Query Engine βœ…

Retrieved Image βœ…
Generated Image πŸ›‘

Audio πŸ›‘

Audio πŸ›‘

Audio πŸ›‘

πŸ›‘

πŸ›‘

Audio πŸ›‘

Video πŸ›‘

Video πŸ›‘

Video πŸ›‘

πŸ›‘

πŸ›‘

Video πŸ›‘

Multi-Modal LLM Models

These notebooks serve as examples how to leverage and integrate Multi-Modal LLM model, Multi-Modal embeddings, Multi-Modal vector stores, Retriever, Query engine for composing Multi-Modal Retrieval Augmented Generation (RAG) orchestration.

Multi-Modal
Vision Models

Single
Image
Reasoning

Multiple
Images
Reasoning

Image
Embeddings

Simple
Query
Engine

Pydantic
Structured
Output

GPT4V
(OpenAI API)

βœ…

βœ…

πŸ›‘

βœ…

βœ…

CLIP
(Local host)

πŸ›‘

πŸ›‘

βœ…

πŸ›‘

πŸ›‘

LLaVa
(replicate)

βœ…

πŸ›‘

πŸ›‘

βœ…

⚠️

Fuyu-8B
(replicate)

βœ…

πŸ›‘

πŸ›‘

βœ…

βœ…

ImageBind
[To integrate]

πŸ›‘

πŸ›‘

βœ…

πŸ›‘

πŸ›‘

MiniGPT-4

βœ…

πŸ›‘

πŸ›‘

βœ…

⚠️

CogVLM

βœ…

πŸ›‘

πŸ›‘

βœ…

⚠️

Qwen-VL
[To integrate]

βœ…

πŸ›‘

πŸ›‘

βœ…

⚠️

Multi Modal Vector Stores

Below table lists some vector stores supporting Multi-Modal use cases. Our LlamaIndex built-in MultiModalVectorStoreIndex supports building separate vector stores for image and text embedding vector stores. MultiModalRetriever, and SimpleMultiModalQueryEngine support text to text/image and image to image retrieval and simple ranking fusion functions for combining text and image retrieval results.

Multi-Modal
Vector Stores

Single
Vector
Store

Multiple
Vector
Stores

Text
Embedding

Image
Embedding

LLamaIndex self-built
MultiModal Index

πŸ›‘

βœ…

Can be arbitrary
text embedding
(Default is GPT3.5)

Can be arbitrary
Image embedding
(Default is CLIP)

Chroma

βœ…

πŸ›‘

CLIP βœ…

CLIP βœ…

Weaviate
[To integrate]

βœ…

πŸ›‘

CLIP βœ…
ImageBind βœ…

CLIP βœ…
ImageBind βœ…

Evaluation

We support basic evaluation for Multi-Modal LLM and Retrieval Augmented Generation.