[Beta] Multi-modal models#
Concept#
Large language models (LLMs) are text-in, text-out. Large Multi-modal Models (LMMs) generalize this beyond the text modalities. For instance, models such as GPT-4V allow you to jointly input both images and text, and output text.
We've included a base MultiModalLLM
abstraction to allow for text+image models. NOTE: This naming is subject to change!
Usage Pattern#
- The following code snippet shows how you can get started using LMMs e.g. with GPT-4V.
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader
# load image documents from urls
image_documents = load_image_urls(image_urls)
# load image documents from local directory
image_documents = SimpleDirectoryReader(local_directory).load_data()
# non-streaming
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)
response = openai_mm_llm.complete(
prompt="what is in the image?", image_documents=image_documents
)
- The following code snippet shows how you can build MultiModal Vector Stores/Index.
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext
import qdrant_client
from llama_index.core import SimpleDirectoryReader
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")
# if you only need image_store for image retrieval,
# you can remove text_sotre
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
# Load text and image documents from local folder
documents = SimpleDirectoryReader("./data_folder/").load_data()
# Create the MultiModal index
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
- The following code snippet shows how you can use MultiModal Retriever and Query Engine.
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate
from llama_index.core.query_engine import SimpleMultiModalQueryEngine
retriever_engine = index.as_retriever(
similarity_top_k=3, image_similarity_top_k=3
)
# retrieve more information from the GPT4V response
retrieval_results = retriever_engine.retrieve(response)
# if you only need image retrieval without text retrieval
# you can use `text_to_image_retrieve`
# retrieval_results = retriever_engine.text_to_image_retrieve(response)
qa_tmpl_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the query.\n"
"Query: {query_str}\n"
"Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)
query_engine = index.as_query_engine(
multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
)
query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)
Legend
- ✅ = should work fine
- ⚠️ = sometimes unreliable, may need more tuning to improve
- 🛑 = not available at the moment.
End to End Multi-Modal Work Flow#
The tables below attempt to show the initial steps with various LlamaIndex features for building your own Multi-Modal RAGs (Retrieval Augmented Generation). You can combine different modules/steps together for composing your own Multi-Modal RAG orchestration.
Query Type | Data Sources for MultiModal Vector Store/Index |
MultiModal Embedding |
Retriever | Query Engine |
Output Data Type |
---|---|---|---|---|---|
Text ✅ | Text ✅ | Text ✅ | Top-k retrieval ✅ Simple Fusion retrieval ✅ |
Simple Query Engine ✅ | Retrieved Text ✅ Generated Text ✅ |
Image ✅ | Image ✅ | Image ✅ Image to Text Embedding ✅ |
Top-k retrieval ✅ Simple Fusion retrieval ✅ |
Simple Query Engine ✅ | Retrieved Image ✅ Generated Image 🛑 |
Audio 🛑 | Audio 🛑 | Audio 🛑 | 🛑 | 🛑 | Audio 🛑 |
Video 🛑 | Video 🛑 | Video 🛑 | 🛑 | 🛑 | Video 🛑 |
Multi-Modal LLM Models#
These notebooks serve as examples how to leverage and integrate Multi-Modal LLM model, Multi-Modal embeddings, Multi-Modal vector stores, Retriever, Query engine for composing Multi-Modal Retrieval Augmented Generation (RAG) orchestration.
Multi-Modal Vision Models |
Single Image Reasoning |
Multiple Images Reasoning |
Image Embeddings |
Simple Query Engine |
Pydantic Structured Output |
---|---|---|---|---|---|
GPT4V (OpenAI API) |
✅ | ✅ | 🛑 | ✅ | ✅ |
GPT4V-Azure (Azure API) |
✅ | ✅ | 🛑 | ✅ | ✅ |
Gemini (Google) |
✅ | ✅ | 🛑 | ✅ | ✅ |
CLIP (Local host) |
🛑 | 🛑 | ✅ | 🛑 | 🛑 |
LLaVa (replicate) |
✅ | 🛑 | 🛑 | ✅ | ⚠️ |
Fuyu-8B (replicate) |
✅ | 🛑 | 🛑 | ✅ | ⚠️ |
ImageBind [To integrate] |
🛑 | 🛑 | ✅ | 🛑 | 🛑 |
MiniGPT-4 |
✅ | 🛑 | 🛑 | ✅ | ⚠️ |
CogVLM |
✅ | 🛑 | 🛑 | ✅ | ⚠️ |
Qwen-VL [To integrate] |
✅ | 🛑 | 🛑 | ✅ | ⚠️ |
Multi Modal Vector Stores#
Below table lists some vector stores supporting Multi-Modal use cases. Our LlamaIndex built-in MultiModalVectorStoreIndex
supports building separate vector stores for image and text embedding vector stores. MultiModalRetriever
, and SimpleMultiModalQueryEngine
support text to text/image and image to image retrieval and simple ranking fusion functions for combining text and image retrieval results.
| Multi-Modal
Vector Stores | Single
Vector
Store | Multiple
Vector
Stores | Text
Embedding | Image
Embedding |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- | --------------------------- | --------------------------------------------------------- | ------------------------------------------------------- |
| LLamaIndex self-built
MultiModal Index | 🛑 | ✅ | Can be arbitrary
text embedding
(Default is GPT3.5) | Can be arbitrary
Image embedding
(Default is CLIP) |
| Chroma | ✅ | 🛑 | CLIP ✅ | CLIP ✅ |
| Weaviate
[To integrate] | ✅ | 🛑 | CLIP ✅
ImageBind ✅ | CLIP ✅
ImageBind ✅ |
Multi-Modal LLM Modules#
We support integrations with GPT4-V, Anthropic (Opus, Sonnet), Gemini (Google), CLIP (OpenAI), BLIP (Salesforce), and Replicate (LLaVA, Fuyu-8B, MiniGPT-4, CogVLM), and more.
Multi-Modal Retrieval Augmented Generation#
We support Multi-Modal Retrieval Augmented Generation with different Multi-Modal LLMs with Multi-Modal vector stores.
Evaluation#
We support basic evaluation for Multi-Modal LLM and Retrieval Augmented Generation.