Skip to content

Understanding LLaVA: Large Language and Vision Assistant

All Facets of Building a LLM, All Open Source

NeurIPS 2023 is just around the corner as we prepare for the week-long activities. To kick off the event, I will be taking a look at some of the top papers from this year, helping others understand what makes them awesome, and celebrating their accomplishments. 
One of the best places to start is a project that is making waves across all AI/ML communities: LLaVA. LLaVA or Large Language and Vision Assistant is a joint effort from researchers at the University of Wisconsin, Microsoft Research, and Columbia University. They aimed to create a novel end-to-end trained large multimodal model that could compete against even the giants of models such as GPT-4, all while being open source!

Good Models Start With Good Data

To start this Herculean task, the researchers knew they needed to create a state of the art dataset if they had a chance to compete with the other models. Without top of the line data, a model can get stuck in the mud when it comes to training runs, never able to eclipse past mediocrity. The result ended up being the LLaVA-Instruct-150k dataset, a collection of images paired with instructions and available to be viewed in FiftyOne here! The dataset was formed after noticing that there is a lack of high quality instruction following datasets. These instructions or “conversations” as they are denoted in the dataset come in three main forms. Let’s take a look at the picture from the dataset below and the examples associated with it:

  • Conversation
    • Human: Is the child in the image a boy or a girl? 
    • GPT: The child in the image is a little girl. 
    • Human: What is the little girl doing in the photo? 
    • GPT: The little girl is sitting at a table and is eating from a plate of food. 
    • Human: What kind of food is the little girl preparing to eat? 
    • GPT: The little girl is preparing to eat some pizza. 
    • Human: What is the little girl drinking during her meal? 
    • GPT: The little girl is drinking juice, which is served in a wine glass. 
    • Human: What is the age range of the child in the image? 
    • GPT: The child in the image can be classified as a toddler, which usually refers to children aged between 1 and 3 years old.
  • Details
    • Human: What are the key elements in this picture? 
    • GPT: The image features a little girl sitting at a dining table, enjoying a meal. She is preparing to eat a slice of pizza placed on the table, with a bowl nearby. There are several wine glasses on the table, one of which contains juice for the little girl to drink. A chair is positioned by the table where the little girl is seated. The scene is set in a well-stocked kitchen area, where an oven can be found in the background. There are also a few bottles visible in the scene, scattered around the perimeter. 
  • Complex Reasoning
    • Human: What aspect of the girl’s meal might be considered unusual for a child her age?
    • GPT: An unusual aspect of the girl’s meal is that she has some juice in a wine glass. Wine glasses are typically associated with adult beverages and are not commonly used by young children. While it is not dangerous for her to drink from a wine glass, it may appear unconventional and might raise questions about the appropriateness of using such a glass for a child. The rest of her meal, including the pizza and potentially other food items like the ones in the bowl, seem more typical for a child’s meal. 

The LLaVA team were able to create 150k image-instruction pairs using images from the COCO Train2017 dataset by leveraging GPT-4. The team would pass in the image and a caption to GPT-4, using both the multimodal version and the language only model. By passing in captions and bounding box locations to the language only model paired with the understanding of the image, you can form conversations about the image in a cheap and efficient manner. They even created a visualization for the root noun-verb pairs for the instruction and response below.

A Model With Humble Beginnings

What is so interesting about LLaVA to me is their amazing ability to keep things simple. The ability to not overcomplicate a solution keeps an engineer goal oriented, leading to potentially great results.. LLaVA researchers did not aim to reinvent the wheel, opting to use the widely popular CLIP VIT-L/14 visual encoder model and Vicuna, an LLM based on Llama 2. It’s only through the clever fusion and training of these two models which allows LLaVA to be powerful.

They break down the model training into two stages. Stage 1 is a pre-training stage aimed at only updating the projection matrix, the bridge between CLIP and Vicuna, using a subset of the CC3M dataset. This allows input images and input text to be mapped to the same embedding space, allowing the LLM to understand the context from both the image and the input prompt. The projection was designed to be simple and lightweight, allowing for faster iterations during experimentation.

Stage 2 of training is when both the projection matrix and the LLM are updated for one of two scenarios. Scenario one is when LLaVA is fine tuned on the multimodal instruction following dataset for daily user oriented applications. The second scenario is when LLaVA is fine-tuned on multimodal reasoning data for specifically the science domain. It is a unique approach that allows the researchers to address one of the common critiques of LLMs, the lack of a good science foundation. By adding this new segment of data, it allows the model to perform much higher on complex science based questions compared to its peers. Let’s take a look at how after all the training LLaVA stacks up against GPT-4.


The team created thirty images with three instruction pairs each, none of which had been seen before by GPT-4 or LLaVA. These image-instruction pairs were used to grade the models on a scale 1-10. A max score in any category is 300 with a perfect score being a 900 across all tasks. The results can be seen below:

LLaVA was able to capture overall an 85% relative score compared to GPT-4, meaning that LLaVA scored 85% of the points the GPT-4 received. It is important to note that the amount of resources, training time, and engineers on GPT-4 is expected to be exponentially higher than LLaVA. The dataset was even created with data generated by GPT-4. Yet, the researchers were able to be a stone’s throw away with a completely over the counter, open source project. When looking at the Science QA specifically, LLaVA does even better!

LLaVA was able to achieve new State of the Art performance when using LLaVA + GPT-4 (judge).  To understand more, when GPT-4 and LLaVA answers differ on the question, GPT-4 is brought in to be a “judge”, looking at the two answers and determining the better one. This ended up yielding a SoTA of 92.53% and is a number to be extremely proud of. The achievement is a large step in the right direction for LLMs and Multimodal Models.


LLaVA truly is an incredible project top to bottom. Since the initial release, they have updated their model and dataset for LLaVA 1.5, improving results across the board and bringing many new evaluation metrics. The dataset has also been updated to include more datasets to train on other than COCO, bringing in over 665K conversations now. The models and dataset are available to try, all for free!

LLaVA has made incredible strides in closing the gap between open source LLM models to GPT-4. It will be incredibly interesting how the model develops, especially on the dataset side. Currently with the methods being used to generate the LLaVA datasets, it makes it difficult to surpass GPT-4 due to the ground_truth conversations being answers from GPT-4 instead of human curated conversations. Perhaps an upgraded dataset could be all it takes for the final push. 
With all the outstanding accomplishments the team has made with the LLaVA project, the fact that it is all open source is a huge contribution to the field as whole. You can even hop on over to the LLaVA website and try the model yourself. I was personally blown away by how much better and more responsive it felt compared to LLMs of the past. The ability to converse with the model and work together to reach complex answers or solutions is incredible. The model can also be used for more traditional computer vision problems such as detection or OCR as well!

Most importantly, I think it is a great contribution to show the greater orbit of AI/ML of how LLMs are made, what thought is going into creating them, and showing each step along the way. It brings clarity and explainability to a science that desperately needs it. Doing so makes it much easier for more people to understand, embrace, and adopt AI instead of combat it. LLaVA is also an incredible role model of what responsible AI should look like and I can only hope they inspire more to follow suit!

Be sure to check back more soon for more paper digests, updates from the NeurIPS conference, and more FiftyOne content! Until next time!