[Beta] Multi-modal models#
Concept#
Large language models (LLMs) are text-in, text-out. Large Multi-modal Models (LMMs) generalize this beyond the text modalities. For instance, models such as GPT-4V allow you to jointly input both images and text, and output text.
Weβve included a base MultiModalLLM
abstraction to allow for text+image models. NOTE: This naming is subject to change!
Usage Pattern#
The following code snippet shows how you can get started using LMMs e.g. with GPT-4V.
from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index.multi_modal_llms.generic_utils import (
load_image_urls,
)
from llama_index import SimpleDirectoryReader
# load image documents from urls
image_documents = load_image_urls(image_urls)
# load image documents from local directory
image_documents = SimpleDirectoryReader(local_directory).load_data()
# non-streaming
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)
response = openai_mm_llm.complete(
prompt="what is in the image?", image_documents=image_documents
)
The following code snippet shows how you can build MultiModal Vector Stores/Index.
from llama_index.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.vector_stores import QdrantVectorStore
from llama_index import SimpleDirectoryReader, StorageContext
import qdrant_client
from llama_index import (
SimpleDirectoryReader,
)
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")
# if you only need image_store for image retrieval,
# you can remove text_sotre
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
# Load text and image documents from local folder
documents = SimpleDirectoryReader("./data_folder/").load_data()
# Create the MultiModal index
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
The following code snippet shows how you can use MultiModal Retriever and Query Engine.
from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index.prompts import PromptTemplate
from llama_index.query_engine import SimpleMultiModalQueryEngine
retriever_engine = index.as_retriever(
similarity_top_k=3, image_similarity_top_k=3
)
# retrieve more information from the GPT4V response
retrieval_results = retriever_engine.retrieve(response)
# if you only need image retrieval without text retrieval
# you can use `text_to_image_retrieve`
# retrieval_results = retriever_engine.text_to_image_retrieve(response)
qa_tmpl_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the query.\n"
"Query: {query_str}\n"
"Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)
query_engine = index.as_query_engine(
multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
)
query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)
Legend
β = should work fine
β οΈ = sometimes unreliable, may need more tuning to improve
π = not available at the moment.
End to End Multi-Modal Work Flow#
The tables below attempt to show the initial steps with various LlamaIndex features for building your own Multi-Modal RAGs (Retrieval Augmented Generation). You can combine different modules/steps together for composing your own Multi-Modal RAG orchestration.
Query Type |
Data Sources |
MultiModal |
Retriever |
Query |
Output |
---|---|---|---|---|---|
Text β |
Text β |
Text β |
Top-k retrieval β
|
Simple Query Engine β |
Retrieved Text β
|
Image β |
Image β |
Image β
|
Top-k retrieval β
|
Simple Query Engine β |
Retrieved Image β
|
Audio π |
Audio π |
Audio π |
π |
π |
Audio π |
Video π |
Video π |
Video π |
π |
π |
Video π |
Multi-Modal LLM Models#
These notebooks serve as examples how to leverage and integrate Multi-Modal LLM model, Multi-Modal embeddings, Multi-Modal vector stores, Retriever, Query engine for composing Multi-Modal Retrieval Augmented Generation (RAG) orchestration.
Multi-Modal |
Single |
Multiple |
Image |
Simple |
Pydantic |
---|---|---|---|---|---|
GPT4V |
β |
β |
π |
β |
β |
GPT4V-Azure |
β |
β |
π |
β |
β |
Gemini |
β |
β |
π |
β |
β |
CLIP |
π |
π |
β |
π |
π |
LLaVa |
β |
π |
π |
β |
β οΈ |
Fuyu-8B |
β |
π |
π |
β |
β οΈ |
ImageBind |
π |
π |
β |
π |
π |
β |
π |
π |
β |
β οΈ |
|
β |
π |
π |
β |
β οΈ |
|
Qwen-VL |
β |
π |
π |
β |
β οΈ |
Multi Modal Vector Stores#
Below table lists some vector stores supporting Multi-Modal use cases. Our LlamaIndex built-in MultiModalVectorStoreIndex
supports building separate vector stores for image and text embedding vector stores. MultiModalRetriever
, and SimpleMultiModalQueryEngine
support text to text/image and image to image retrieval and simple ranking fusion functions for combining text and image retrieval results.
Multi-Modal |
Single |
Multiple |
Text |
Image |
---|---|---|---|---|
π |
β |
Can be arbitrary |
Can be arbitrary |
|
β |
π |
CLIP β |
CLIP β |
|
Weaviate |
β |
π |
CLIP β
|
CLIP β
|
Multi-Modal LLM Modules#
We support integrations with GPT4-V, CLIP (OpenAI), BLIP (Salesforce), and Replicate (LLaVA, Fuyu-8B, MiniGPT-4, CogVLM), and more.
- Multi-Modal LLM using OpenAI GPT-4V model for image reasoning
- Multi-Modal LLM using Googleβs Gemini model for image understanding and build Retrieval Augmented Generation with LlamaIndex
- Multi-Modal LLM using Replicate LlaVa, Fuyu 8B, MiniGPT4 models for image reasoning
- Multi-Modal GPT4V Pydantic Program
- GPT4-V Experiments with General, Specific questions and Chain Of Thought (COT) Prompting Technique.
- Retrieval-Augmented Image Captioning
Multi-Modal Retrieval Augmented Generation#
We support Multi-Modal Retrieval Augmented Generation with different Multi-Modal LLMs with Multi-Modal vector stores.
- Advanced Multi-Modal Retrieval using GPT4V and Multi-Modal Index/Retriever
- Multi-Modal on PDFβs with tables.
- Multi-Modal Retrieval using GPT text embedding and CLIP image embedding for Wikipedia Articles
- Image to Image Retrieval using CLIP embedding and image correlation reasoning using GPT4V
- Chroma Multi-Modal Demo with LlamaIndex
Evaluation#
We support basic evaluation for Multi-Modal LLM and Retrieval Augmented Generation.