Multimodal Ollama Cookbook¶
This cookbook shows how you can build different multimodal RAG use cases with LLaVa on Ollama.
- Structured Data Extraction from Images
- Retrieval-Augmented Image Captioning
- Multi-modal RAG
Setup Model¶
!pip install llama-index-multi-modal-llms-ollama
!pip install llama-index-readers-file
!pip install unstructured
!pip install llama-index-embeddings-huggingface
!pip install llama-index-vector-stores-qdrant
!pip install llama-index-embeddings-clip
from llama_index.multi_modal_llms.ollama import OllamaMultiModal
mm_model = OllamaMultiModal(model="llava:13b")
Structured Data Extraction from Images¶
Here we show how to use LLaVa to extract information from an image into a structured Pydantic object.
We can do this via our MultiModalLLMCompletionProgram
. It is instantiated with a prompt template, set of images you'd want to ask questions over, and the desired output Pydantic object.
Load Data¶
Let's first load an image ad for fried chicken.
from pathlib import Path
from llama_index.core import SimpleDirectoryReader
from PIL import Image
import matplotlib.pyplot as plt
input_image_path = Path("restaurant_images")
if not input_image_path.exists():
Path.mkdir(input_image_path)
!wget "https://docs.google.com/uc?export=download&id=1GlqcNJhGGbwLKjJK1QJ_nyswCTQ2K2Fq" -O ./restaurant_images/fried_chicken.png
# load as image documents
image_documents = SimpleDirectoryReader("./restaurant_images").load_data()
# display image
imageUrl = "./restaurant_images/fried_chicken.png"
image = Image.open(imageUrl).convert("RGB")
plt.figure(figsize=(16, 5))
plt.imshow(image)
<matplotlib.image.AxesImage at 0x3efdae470>
from pydantic import BaseModel
class Restaurant(BaseModel):
"""Data model for an restaurant."""
restaurant: str
food: str
discount: str
price: str
rating: str
review: str
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
prompt_template_str = """\
{query_str}
Return the answer as a Pydantic object. The Pydantic schema is given below:
"""
mm_program = MultiModalLLMCompletionProgram.from_defaults(
output_parser=PydanticOutputParser(Restaurant),
image_documents=image_documents,
prompt_template_str=prompt_template_str,
multi_modal_llm=mm_model,
verbose=True,
)
response = mm_program(query_str="Can you summarize what is in the image?")
for res in response:
print(res)
> Raw output: ```
{
"restaurant": "Buffalo Wild Wings",
"food": "8 wings or chicken poppers",
"discount": "20% discount on orders over $25",
"price": "$8.73 each",
"rating": "",
"review": ""
}
```
('restaurant', 'Buffalo Wild Wings')
('food', '8 wings or chicken poppers')
('discount', '20% discount on orders over $25')
('price', '$8.73 each')
('rating', '')
('review', '')
Retrieval-Augmented Image Captioning¶
Here we show a simple example of a retrieval-augmented image captioning pipeline, expressed via our query pipeline syntax.
!wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O tesla_2021_10k.htm
!wget "https://docs.google.com/uc?export=download&id=1THe1qqM61lretr9N3BmINc_NWDvuthYf" -O shanghai.jpg
# from llama_index import SimpleDirectoryReader
from pathlib import Path
from llama_index.readers.file import UnstructuredReader
from llama_index.core.schema import ImageDocument
loader = UnstructuredReader()
documents = loader.load_data(file=Path("tesla_2021_10k.htm"))
image_doc = ImageDocument(image_path="./shanghai.jpg")
from llama_index.core import VectorStoreIndex
from llama_index.core.embeddings import resolve_embed_model
embed_model = resolve_embed_model("local:BAAI/bge-m3")
vector_index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
query_engine = vector_index.as_query_engine()
from llama_index.core.prompts import PromptTemplate
from llama_index.core.query_pipeline import QueryPipeline, FnComponent
query_prompt_str = """\
Please expand the initial statement using the provided context from the Tesla 10K report.
{initial_statement}
"""
query_prompt_tmpl = PromptTemplate(query_prompt_str)
# MM model --> query prompt --> query engine
qp = QueryPipeline(
modules={
"mm_model": mm_model.as_query_component(
partial={"image_documents": [image_doc]}
),
"query_prompt": query_prompt_tmpl,
"query_engine": query_engine,
},
verbose=True,
)
qp.add_chain(["mm_model", "query_prompt", "query_engine"])
rag_response = qp.run("Which Tesla Factory is shown in the image?")
> Running module mm_model with input: prompt: Which Tesla Factory is shown in the image? > Running module query_prompt with input: initial_statement: The image you've provided is a photograph of the Tesla Gigafactory, which is located in Shanghai, China. This facility is one of Tesla's large-scale production plants and is used for manufacturing el... > Running module query_engine with input: input: Please expand the initial statement using the provided context from the Tesla 10K report. The image you've provided is a photograph of the Tesla Gigafactory, which is located in Shanghai, China. Thi...
print(f"> Retrieval Augmented Response: {rag_response}")
> Retrieval Augmented Response: The Gigafactory Shanghai in China is an important manufacturing facility for Tesla. It was established to increase the affordability of Tesla vehicles for customers in local markets by reducing transportation and manufacturing costs and eliminating the impact of unfavorable tariffs. The factory allows Tesla to access high volumes of lithium-ion battery cells manufactured by their partner Panasonic, while achieving a significant reduction in the cost of their battery packs. Tesla continues to invest in Gigafactory Shanghai to achieve additional output. This factory is representative of Tesla's plan to improve their manufacturing operations as they establish new factories, incorporating the learnings from their previous ramp-ups.
rag_response.source_nodes[1].get_content()
'For example, we are currently constructing Gigafactory Berlin under conditional permits in anticipation of being granted final permits. Moreover, we will have to establish and ramp production of our proprietary battery cells and packs at our new factories, and we additionally intend to incorporate sequential design and manufacturing changes into vehicles manufactured at each new factory. We have limited experience to date with developing and implementing manufacturing innovations outside of the Fremont Factory and Gigafactory Shanghai. In particular, the majority of our design and engineering resources are currently located in California. In order to meet our expectations for our new factories, we must expand and manage localized design and engineering talent and resources. If we experience any issues or delays in meeting our projected timelines, costs, capital efficiency and production capacity for our new factories, expanding and managing teams to implement iterative design and production changes there, maintaining and complying with the terms of any debt financing that we obtain to fund them or generating and maintaining demand for the vehicles we manufacture there, our business, prospects, operating results and financial condition may be harmed.\n\nWe will need to maintain and significantly grow our access to battery cells, including through the development and manufacture of our own cells, and control our related costs.\n\nWe are dependent on the continued supply of lithium-ion battery cells for our vehicles and energy storage products, and we will require substantially more cells to grow our business according to our plans. Currently, we rely on suppliers such as Panasonic and Contemporary Amperex Technology Co. Limited (CATL) for these cells. We have to date fully qualified only a very limited number\n\n16\n\nof such suppliers and have limited flexibility in changing suppliers. Any disruption in the supply of battery cells from our suppliers could limit production of our vehicles and energy storage products. In the long term, we intend to supplement cells from our suppliers with cells manufactured by us, which we believe will be more efficient, manufacturable at greater volumes and more cost-effective than currently available cells. However, our efforts to develop and manufacture such battery cells have required, and may continue to require, significant investments, and there can be no assurance that we will be able to achieve these targets in the timeframes that we have planned or at all. If we are unable to do so, we may have to curtail our planned vehicle and energy storage product production or procure additional cells from suppliers at potentially greater costs, either of which may harm our business and operating results.\n\nIn addition, the cost of battery cells, whether manufactured by our suppliers or by us, depends in part upon the prices and availability of raw materials such as lithium, nickel, cobalt and/or other metals. The prices for these materials fluctuate and their available supply may be unstable, depending on market conditions and global demand for these materials, including as a result of increased global production of electric vehicles and energy storage products. Any reduced availability of these materials may impact our access to cells and any increases in their prices may reduce our profitability if we cannot recoup the increased costs through increased vehicle prices. Moreover, any such attempts to increase product prices may harm our brand, prospects and operating results.\n\nWe face strong competition for our products and services from a growing list of established and new competitors.\n\nThe worldwide automotive market is highly competitive today and we expect it will become even more so in the future. For example, Model 3 and Model Y face competition from existing and future automobile manufacturers in the extremely competitive entry-level premium sedan and compact SUV markets. A significant and growing number of established and new automobile manufacturers, as well as other companies, have entered, or are reported to have plans to enter, the market for electric and other alternative fuel vehicles, including hybrid, plug-in hybrid and fully electric vehicles, as well as the market for self-driving technology and other vehicle applications and software platforms. In some cases, our competitors offer or will offer electric vehicles in important markets such as China and Europe, and/or have announced an intention to produce electric vehicles exclusively at some point in the future. Many of our competitors have significantly greater or better-established resources than we do to devote to the design, development, manufacturing, distribution, promotion, sale and support of their products. Increased competition could result in our lower vehicle unit sales, price reductions, revenue shortfalls, loss of customers and loss of market share, which may harm our business, financial condition and operating results.\n\nWe also face competition in our energy generation and storage business from other manufacturers, developers, installers and service providers of competing energy technologies, as well as from large utilities. Decreases in the retail or wholesale prices of electricity from utilities or other renewable energy sources could make our products less attractive to customers and lead to an increased rate of residential customer defaults under our existing long-term leases and PPAs.'
Multi-Modal RAG¶
We index a set of images and text using a local CLIP embedding model. We can index them jointly via our MultiModalVectorStoreIndex
NOTE: The current implementation blends both images and text. You can and maybe should define separate indexes/retrievers for images and text, letting you use separate embedding/retrieval strategies for each modality).
!wget "https://drive.usercontent.google.com/download?id=1qQDcaKuzgRGuEC1kxgYL_4mx7vG-v4gC&export=download&authuser=1&confirm=t&uuid=f944e95f-a31f-4b55-b68f-8ea67a6e90e5&at=APZUnTVZ6n1aOg7rtkcjBjw7Pt1D:1707010667927" -O mixed_wiki.zip
!unzip mixed_wiki.zip
!wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O ./mixed_wiki/tesla_2021_10k.htm
Build Multi-Modal Index¶
This is a special index that jointly indexes both text documents and image documents.
We use a local CLIP model to embed images/text.
from llama_index.core.indices.multi_modal.base import (
MultiModalVectorStoreIndex,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
from llama_index import (
SimpleDirectoryReader,
)
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
image_embed_model = ClipEmbedding()
# Create the MultiModal index
documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
image_embed_model=image_embed_model,
)
# Save it
# index.storage_context.persist(persist_dir="./storage")
# # Load it
# from llama_index import load_index_from_storage
# storage_context = StorageContext.from_defaults(
# vector_store=text_store, persist_dir="./storage"
# )
# index = load_index_from_storage(storage_context, image_store=image_store)
from llama_index.core.prompts import PromptTemplate
from llama_index.core.query_engine import SimpleMultiModalQueryEngine
qa_tmpl_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the query.\n"
"Query: {query_str}\n"
"Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)
query_engine = index.as_query_engine(llm=mm_model, text_qa_template=qa_tmpl)
query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)
print(str(response))
The image shows a Porsche sports car displayed at an auto show. It appears to be the latest model, possibly the Taycan Cross Turismo or a similar variant, which is designed for off-road use and has raised suspension. This type of vehicle combines the performance of a sports car with the utility of an SUV, allowing it to handle rougher terrain and provide more cargo space than a traditional two-door sports car. The design incorporates sleek lines and aerodynamic elements typical of modern electric vehicles, which are often associated with luxury and high performance.
from PIL import Image
import matplotlib.pyplot as plt
import os
def plot_images(image_paths):
images_shown = 0
plt.figure(figsize=(16, 9))
for img_path in image_paths:
if os.path.isfile(img_path):
image = Image.open(img_path)
plt.subplot(2, 3, images_shown + 1)
plt.imshow(image)
plt.xticks([])
plt.yticks([])
images_shown += 1
if images_shown >= 9:
break
# show sources
from llama_index.core.response.notebook_utils import display_source_node
for text_node in response.metadata["text_nodes"]:
display_source_node(text_node, source_length=200)
plot_images(
[n.metadata["file_path"] for n in response.metadata["image_nodes"]]
)
Node ID: 3face2c9-3b86-4445-b21e-5b7fc9683adb
Similarity: 0.8281288080117539
Text: === Porsche Mission E Cross Turismo ===
The Porsche Mission E Cross Turismo previewed the Taycan Cross Turismo, and was presented at the 2018 Geneva Motor Show. The design language of the Mission E...
Node ID: ef43aa15-30b6-4f0f-bade-fd91f90bfd0b
Similarity: 0.8281039313464207
Text: The Porsche Taycan is a battery electric saloon and shooting brake produced by German automobile manufacturer Porsche. The concept version of the Taycan, named the Porsche Mission E, debuted at the...