Exploring Opportunities for Retailers with LLaVa: The Next Frontier in AI Assistance — Multi-Modality

7 min readMar 18, 2024

In the ever-evolving landscape of retail, staying ahead means embracing innovation, especially in the way we communicate and engage with customers. Enter multi-modal chat apps — a game-changing technology that is revolutionizing the retail experience. In this blog post, we’re going to delve into the dynamic world of multi-modal chat applications and explore their transformative impact on the retail sector.

From small businesses to global retail giants, the integration of chat apps is not just a trend, but a strategic move towards more personalized, efficient, and immersive shopping experiences. We will uncover how the models behind these tools can be extended into vision. The point of this blogpost is to quickly explore a few of these opportunities using Llava — a multi-modal model.

Before we start, here’s a quick disclaimer. The rest of the blogpost is intentionally written to keep the entry barrier low and focus on the value to be had through a semi-technical lens. With that let’s get started with the first potential application

Embracing multi-modality allows for deeper insight and operational execution abilities.

Simply explaining an image or a video:

To quickly start using Llava, I recommend that you choose to use the existing integration through the transformers library.

Here’s a quick snippet showing how you can use the integration for our first simple application — explaining the content of an image.

from PIL import Image
import requests
from rich import print
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", device_map="auto")

prompt = "<image>\nUSER: Explain the content of the image?\nASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device="cuda")
# Generate
generate_ids = model.generate(**inputs, max_length=30)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Result that Llava-v-1.5 generated
----
ASSISTANT: The image depicts a busy city street with a stop sign 
prominently displayed at an intersection. A black car is driving down the street, 
and there are several people walking around, some of them carrying handbags. 
The scene also features a lion statue, adding a unique element to the urban environment. 
The street is lined with buildings, and there are a few potted plants placed along the sidewalk. 
The overall atmosphere of the scene is bustling and vibrant.
----

You can see for yourself that this is indeed what the image is about here. Obviously, this could be super useful in scenarios where you might have a customer simply upload an image without a lot of context, especially true when using communication mediums that allow for this. (chat, email, social media etc.)

2. Use in Personalized Multi-Modal Q&A Bots

Just like other foundation model demos or fancy know it all chatbot (aka ChatGPT), Llava, by itself, may not be as compelling to the average businessperson. However, as we’ve seen with RAG, coupling this with other customer data that is relevant to the customer, this becomes super-duper valuable. Check out an example below — where if we know that a customer is very quality conscious and doesn’t normally mind spending a few additional dollars for good product, the chatbot can be made to accelerate the customer decision by producing persuasive content, especially when the customer reveals their likes/dislikes on other forums such as reviews, social, chat etc. Look at the example below of how just a “Q&A session with an image” can be augmented with customer specific data and a recommendation via the prompt that gets fed to Llava.

Sample prompt for the personalization use case where the highlighted additional context could come from a retriever and/or another recommender model

url = "https://richmedia.ca-richimage.com/ImageDelivery/imageService?profileId=12026540&id=1859027&recipeId=728"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# Generate
generate_ids = model.generate(**inputs, max_length=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
result = output[0]
print(result.split("\nASSISTANT:")[1].strip())

Result that Llava-v-1.5 generated
----
Based on your preference for sweatshirts and the quality-conscious nature of your research, 
I would recommend the Rovis Jeans Slim Fit. These jeans are known for their slim fit, 
which can complement a sweatshirt well. They are made of high-quality denim, 
ensuring durability and a long-lasting appearance. The slim fit design also offers a 
modern and stylish look, which can be paired with various sweatshirts and other casual clothing items. 
Additionally, the Rovis Jeans are reasonably priced, making them an excellent choice for someone who values both 
style and affordability.
----

Doing this allows retailers to quickly close the gap on research while answering the question through the lens of what matters for the customer, allowing them to sell by aligning with their values as opposed to looking at ONLY their transactional behavior. In short, this opens up opportunities for incredible hyper personalization. This use case also highlights the need to double down on data and AI platforms that allow you to bring the rich mix of traditional ML and gen AI models together and get them to play together in real time at scale.

3. JSON-Forming for Data Pipelines

This is all great, until you look at this through the data engineering lens. A typical data engineer might not really care about the chatbot-ish use cases of Llava or models that are like Llava. Instead, they could be looking at curating data using alternate modes of data, in this case images. Say, for example, if we continued further along the lines of the previous use case but instead of responding to the user, we cared about what they were looking at so we can eke out what interested them about that specific image, we could use extract JSON from that image like so, using the sample prompt shown below.

prompt = """<image>\nUSER: Explain the clothing, specifically the jeans? 
What is it? The color and the sex of the person wearing it? and the material? 
Only return JSON for the above items and nothing else\nASSISTANT:"""

url = "https://richmedia.ca-richimage.com/ImageDelivery/imageService?profileId=12026540&id=1859027&recipeId=728"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# Generate
generate_ids = model.generate(**inputs, max_length=100)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
result = output[0]
print(result.split("\nASSISTANT:")[1].strip())

Result that Llava-v-1.5 generated
----
{
"color": "blue",
"sex": "male",
"material": "denim"
}
----

This is great, because now you can simply extract these properties and run a text2sql agent on it.

4. Augmented Anomaly Detection

When a customer submits a complaint with an image of what’s wrong with a product, a multi-modal model can be used to interpret the issue described by analyzing the product image, thereby identifying potential defects or damages like tears in clothing items. This automated analysis can significantly improve response times to customer complaints, streamline the return or exchange process, and contribute to quality control by flagging products that frequently result in complaints. Consequently, this can lead to enhanced customer satisfaction, improved product quality, and potentially increased brand loyalty, which are critical components of maintaining a competitive edge in the retail market. This automation ultimately aids in reducing the workload on customer service teams, allowing them to focus on complex issues that require human intervention, thus optimizing operational efficiency. Here’s a sample demonstrating exactly this!

prompt = """<image>\nUSER: Explain the problem in this picture with the product. 
A customer complained about this product\nASSISTANT:"""

url = "https://m.media-amazon.com/images/I/51a94AxNRPL.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# Generate
generate_ids = model.generate(**inputs, max_length=200)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
result = output[0]
print(result.split("\nASSISTANT:")[1].strip())

Result that Llava-v-1.5 generated
----
In the image, there is a pair of jeans with a noticeable tear or rip in the back pocket. This damage to the jeans 
is likely the result of wear and tear or an accidental tear. The customer might have complained about the product 
due to the unsightly appearance of the damaged jeans, as it can be perceived as a sign of poor quality or a 
manufacturing defect. The rip in the jeans may also affect the functionality and durability of the product, which 
could be a concern for the customer.
----

Of course, this output can be paired with JSON forming and be used to quickly execute on other use cases downstream. Infact, stuff like this is a gold mine for any operational excellence initiative.

5. Vision Augmented Stock Out Prediction

These multi-modal models provide a way to automatically monitor inventory levels through image analysis. By processing images of store shelves, the system can quickly determine if there is an out-of-stock situation, allowing retailers to address inventory shortages promptly. This capability is critical for maintaining optimal stock levels, ensuring that customer demands are met, and avoiding lost sales opportunities due to stockouts. Additionally, it can inform inventory management systems, trigger restocking procedures, and even guide dynamic pricing strategies. For businesses, the immediate benefit is the potential increase in operational efficiency and customer satisfaction, while the long-term value lies in the data collected on stock levels which can drive strategic decision-making and improve supply chain management. Like other use cases we explored previously, see a quick sample of how this is done below.

prompt = """<image>\nUSER: Is there an out of stock situation? 
What type of product seems to be out of stock? (food/clothing/appliances/durables)\nASSISTANT:"""

url = "https://assets.eposnow.com/public/content-images/pexels-roy-broo-empty-shelves-grocery-items.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# Generate
generate_ids = model.generate(**inputs, max_length=200)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
result = output[0]
print(result.split("\nASSISTANT:")[1].strip())

Result that Llava-v-1.5 generated
----
Yes, there is an out of stock situation in the image. 
The shelves are empty, and there are no items available for purchase. 
The type of product that seems to be out of stock is food, 
as the shelves are empty and there are no visible food items..
----

There you have it! We quickly explored five different multi-modal use cases using a state-of-the-art model (Llava) and its value for retailers. Of course, this is just a drop in the bucket and a lot more can be done using these types of models. That being said, this will be all for this blogpost. If you’re interested to check out how this can be done using a Data Intelligence Platform like Databricks at scale, it’s coming up next!

If you thought this was valuable or you want to connect with me, please reach out to me on LinkedIn

Exploring Opportunities for Retailers with LLaVa: The Next Frontier in AI Assistance — Multi-Modality

3. JSON-Forming for Data Pipelines

4. Augmented Anomaly Detection

5. Vision Augmented Stock Out Prediction

Written by Sathish Gangichetty