GPT-4 Vision, OpenAI’s advanced multi-modal AI, has the potential to revolutionize user interactions by integrating text and image processing. This synergy can unlock new applications and enhance user experience significantly. Yet, the cost of using GPT-4 is steep and privacy concerns might prevent organizations from using a third-party API for their sensitive data.
Fortunately, there are open-source multi-modal large language models (LLM) and visual language models (VLM). These models can be deployed on private servers, reduce costs, and ensure data privacy.
But it’s important to recognize the limitations of these open-source models when choosing the right solution for your needs.
Large Language and Vision Assistant (LLaVA)
LLaVA 1.5 stands out as the leading open-source multi-modal LLM, acclaimed for its performance on various multimodal benchmarks and visual question-answering tasks. It combines LLaMA and CLIP models to process vision and text data. Not only is LLaVA 1.5 highly capable, but it is also remarkably efficient and runs on a single GPU.
The model’s training is impressively fast. Its developers report that it can be fully trained in approximately one day on a single 8-A100 node, incurring just a few hundred dollars in costs. The pre-trained LLaVA models are available on the Hugging Face platform.
You can also test the model online. Since LlaVA 1.5’s training involves data generated with GPT-4, its use is limited to non-commercial purposes.
Fuyu
Fuyu is an 8-billion-parameter multi-modal LLM developed by Adept. Fuyu’s uniqueness lies in its architecture. It does not have separate components for image and text processing. Instead, it has a single decoder-only transformer. This design allows Fuyu to segment the input image into pieces and process them seamlessly.
This architecture brings two significant advantages. First, it makes Fuyu very agile, enabling it to deliver responses in under 100 milliseconds without compromising on quality. Second, Fuyu is flexible with image resolution. Unlike other models that require downsampling, Fuyu can handle images at their original resolution, provided their chunks fit within its context window.
Fuyu is available for download on Hugging Face. Adept’s website offers additional insights into Fuyu, detailing its capabilities and potential applications.
CogVLM
Tsinghua University researchers have developed CogVLM, a 17-billion-parameter visual language model. CogVLM’s architecture is similar to LLaVA, using a vision transformer to encode images which are then concatenated with text token embeddings.
CogVLM performs well on a range of vision-language tasks as detailed in its research paper. It works well for question-answering, multi-step reasoning, and even programming tasks that require visual input.
The researchers have also developed CogAgent, an iteration of CogVLM designed for creating autonomous agents. These agents can interact with GUIs, opening up new avenues for automation and user interaction. If you want to try CogVLM, a web demo is available on the project’s GitHub page.
Having fun with GPT-4V and LLaVA 1.5
While benchmark results are impressive and worth following, it is important to experiment with models before integrating them into your applications. Obviously, GPT-4 Vision is the most capable model, but there are a few things you can do to check the limits and boundaries of other models. Ideally, you should create a test dataset for your specific application that represents the scope of examples that the model will face.
Following are some of my own experiments. Note that with GPT-4V, it’s very difficult to tell whether the data you’re testing the model on has been included in the training set or not.
I did a little comparison between GPT-4V and LLaVA 1.5 on a range of examples, ranging from simple to complicated. First I started with a simple example for writing a caption for this image. I tried both models with temperature set to 0.1 and 1. This is an important capability if you want to generate captions for a large image dataset.
Prompt: Write a caption for this image
Temperature: 0.1
LLaVA: A wooden bowl filled with various flavored macaroon cookies.
GPT-4V: A Symphony of Flavors: Colorful Macarons in a Rustic Bowl
Temperature: 1.0
LLaVA: A bowl full of four different colors of macaroons.
GPT-4V: Earthy Elegance: A Bouquet of Macarons
Interestingly, I think LLaVA’s responses were better. But it is worth noting that this is not a thorough test. As I said, you must choose the right kind of images and prompt. For example, when I changed my prompt to “Write a simple caption for this image,” both models provided good results. LLaVA is much more cost-efficient but GPT-4V tends to perform better on complex images. So you’ll have to run a few tests to see which model is better for the kind of images you want to caption.
Next, I ran a more complicated test, prompting the models to extract specific data points from the screenshot of a product page on Amazon. This time, I used the lowest temperature to reduce randomness.
Prompt:
Extract the following information in json:
- Product name
- Product category
- Brand
- Price
Temperature: 0.1
LLaVA:
{
"product_name": "TP-LINK Archer C7 Wireless AC1750 Dual Band Wi-Fi Router (Archer C7)",
"product_category": "Computers & Accessories",
"brand": "TP-LINK",
"price": "$175.00"
}
GPT-4V:
{
"Product name": "TP-Link AXE5400 Tri-Band WiFi 6E Router (Archer AXE75) - Gigabit Wireless Internet, ax Router for Gaming, VPN Router, OneMesh, WPA3",
"Product category": "Electronics > Computers & Accessories > Networking Products > Routers",
"Brand": "TP-Link",
"Price": "$179.99"
}
LLaVA got the JSON structure right but hallucinated much of the information except for the brand. GPT-4V got it right. My next experiment was to provide the models with a data-rich image and ask them to extract and structure the information in JSON format. I grabbed a random receipt online and gave it to the models.
Prompt:
Can you provide the data contained in this receipt in json format?
LLaVA:
{
"save_money": "live better",
"store_of_purchase": "walmart",
"date_of_purchase": "06/23/2013",
"total_amount_paid": 7162.46,
"payment_methods": ["debit", "credit", "cash", "payment from party"],
"balance_due": -5662.55,
"shipping_charge": 326.0,
"sales_tax": -1672.46,
"cash_back": 1672.46,
"total_savings_with_cash_back": -6796.58,
"total_savings_only_cash_back": 0.52,
"employee_signature": "Shawna Yoho",
"return_price":
{
"price": 0.87,
"item_numbers": [7318997394, 4591954, 522612, 519060123, 152340449, 42018890, 796270, 19515001, 2294120, 56527129]
},
"transaction_time": "9:43 AM"
}
GPT-4V:
Certainly, here's the data from the receipt in JSON format:
{
"storeName": "Walmart",
"slogan": "Save money. Live better.",
"phoneNumber": "(412) 782-4444",
"address": {
"manager": "877",
"street": "877 FREEPORT RD",
"city": "PITTSBURGH",
"state": "PA",
"zipCode": "15238"
},
"transactionDetails": {
"storeNumber": "5339",
"opNumber": "00002131",
"teNumber": "04",
"trNumber": "08026"
},
"items": [
{
"description": "SHAVE GEL",
"upc": "004740000359",
"price": 2.97,
"quantity": "X"
},
{
"description": "COUPON",
"upc": "054700003590",
"price": -2.97,
"quantity": "O"
}
],
"summary": {
"subtotal": 2.97,
"tax1": 0.42,
"total": 3.39
},
"payment": {
"method": "DEBIT TEND",
"amount": 3.39,
"debitCashBack": 40.00,
"totalDebitPurchase": 43.39,
"changeDue": -40.00
},
"eftDetails": {
"eftDebit": 3.39,
"cashBack": 40.00,
"totalPurchase": 43.39
},
"accountInfo": {
"ref": "************8030",
"approvalCode": "428022",
"networkId": "NETWORK ID: 28004058",
"terminal": "TERMINAL: 28004058"
},
"timeStamp": {
"purchaseTime": "06/08/13 20:56:58",
"printTime": "06/08/13 20:57:01"
},
"itemCount": 2,
"transactionId": "TC#: 3780 6418 6987 2322 0132",
"additionalMessage": "Ask a Pharmacy Sales Associate how you can save money on pet medications!"
}
Please note that the JSON format is a structured representation of the receipt information, and certain assumptions were made for some fields due to the degraded quality of the text in the image. For ambiguous elements, I've left placeholders or generalized the information. Also, the UPC and item identification numbers may not be accurate due to potential misinterpretation from the image's quality.
LLaVA obviously struggled with extracting information. GPT-4V did a better job by getting most of the information right, though it also outputted non-JSON data, a problem that can be fixed with a better prompt. Interestingly, in one of my experiments, GPT-4V responded with the results and an explanation on how the cashback policy works.
Finally, I gave a very complicated example. I grabbed a screenshot from a YouTube video from Jordan Webster’s excellent linear algebra course. I then asked the models to solve the problem. GPT-4V absolutely nailed it, though again, I can’t confirm whether this problem was part of its training data. It is a screenshot from a YouTube channel that is not super popular, but the problem might have been borrowed from a textbook. LLaVA started producing some other problem that had nothing to do with the image.
Prompt:
can you solve this linear algebra problem. Go through it step by step
LLaVA:
To solve this linear algebra problem, we need to find the values of the variables in the given equation. The equation is:
Let λ = 2, 3, and 4 find the values of the variables in the equation:
(2x + 3y + 4z) = 2(x + y + z)
To solve this equation, we can use the method of elimination. First, we can simplify the equation by combining the terms:
2x + 3y + 4z = 2(x + y + z)
2x + 3y + 4z = 2x + 2y + 2z
Now, we can eliminate the variables one by one. To eliminate x, we can subtract 2x from both sides:
(2x + 3y + 4z) - 2x = 2(x + y + z) - 2x
3y + 4z = x + y + z
Next, we can eliminate y by subtracting 3y from both sides:
3y + 4z - 3y = x + y + z - 3y
4z = x + y + z
Finally, we can eliminate z by subtracting 4z from both sides:
4z - 4z = x + y + z - 4z
0 = x + y
Now, we have a system of linear equations with two variables:
3y = x
4z = x + y
To solve this system, we can use the method of substitution. First, we can solve equation 1 for y:
y = x/3
Next, we can substitute the expression for y into equation 2:
4z = x + x/3
4z = 3x/3 + x
4z = x + x/3
4z = 3x/3 + 2x/3
4z = x + x/3
Now, we can solve for z:
4z = x + x/3
4z = 3x/3 + x
4z = 2x + x/3
4z = 3x/3 + x/3
4z = x/3 + x/3
4z = x/3
Now, we can find the value of z:
z = x/12
Finally, we can substitute the expression for z into the expression for y:
y = x/3
y = x/12
*
3
y = x/4
Now, we have the solution for the variables in the given equation:
x = 4z = 4(x/12) = x/3
y = x/4
z = x/12
GPT-4V:
This problem involves changing the coordinates of a vector from one basis to another. We are given two bases, β = {b1, b2} and C = {c1, c2}, and the relationships between the bases:
b1 = 2c1 + c2 b2 = -c1 + 4c2
We are also given the coordinates of a vector v in the basis β:
[v]_β = [6 -7]^T
We want to find the coordinates of the same vector v in the basis C, denoted [v]_C.
To do this, we can express the vector v in terms of the basis β and then use the relationships between the bases to express it in terms of the basis C.
Let's express v in terms of the basis β:
v = 6b1 - 7b2
Now we substitute the expressions for b1 and b2 in terms of c1 and c2:
v = 6(2c1 + c2) - 7(-c1 + 4c2) v = (12c1 + 6c2) + (7c1 - 28c2) v = (12c1 + 7c1) + (6c2 - 28c2) v = 19c1 - 22c2
Now we have the vector v expressed in terms of the basis C. The coefficients of c1 and c2 are the coordinates of v in the basis C:
[v]_C = [19 -22]^T
So the coordinates of the vector v in the basis C are [19 -22]^T.
Final thoughts
None of these experiments mean that LLaVA or other open-source models are useless. They can solve many interesting problems. However, it is important to look past benchmarks and do some hands-on experiments with your own data and examples before integrating a model into your applications. Fortunately, the cost and complexity of fine-tuning models are quickly dropping. This will make it easier for organizations to run their own customized models.