Developing Vision Language Models (VLMs) is an arena filled with challenges and potential. Recently, we had the pleasure of discussing these topics with Hugo Laurençon about his research on the creation of VLMs, highlighted in his paper "What Matters When Building Vision Language Models", which will be presented at NeurIPS 2024. Our dialogue drew attention to the complexities in crafting and refining these models.

NeurIPS 2024 Paper: What matters when building vision-language models?Author:Hugo Laurençon has a Ph.D. from Sorbonne Université, specializing in Machine Learning. In addition, he is a AI Research Engineer at Hugging Face.

Introduction to Vision Language Models

Vision Language Models (VLMs) combine the versatility of language models with image processing capabilities. These models, such as the lauded GPT-4, are engineered to process and create text based on visual inputs. During the discussion, Hugo outlined practical applications, like generating website code from screenshots or providing answers from text-dense PDFs. Aiming to build a generalist model, their research focused on model efficiency during both training and inference.

Key Design Considerations

One critical theme in designing VLMs is the architecture. There are two main architectures used in VLMs: cross-attention and fully autoregressive.

Cross-attention models insert image information at different layers within the language model through cross-attention blocks where the text attends to the image features.
Fully autoregressive models directly connect the image features to the text input and feed the combined sequence to the language model.

Hugo highlighted the comparison between cross-attention architectures and the simpler, self-attention configurations that connect pre-trained language and vision model backbones. Their research found that while cross-attention models might perform better with frozen backbones, self-attention architectures excelled when the backbones were unfrozen. This indicates that self-attention architectures may preserve the original strengths of language models while maintaining simplicity and efficiency. This is likely because the cross-attention architecture disrupts the pre-trained language model more than the fully autoregressive architecture

Addressing Architectural and Efficiency Challenges

Discussion on training efficiency led Hugo to explore novel strategies such as learned pooling techniques like the Perceiver Resampler, which improves visual token management. These methods show promise in maintaining performance while minimizing computational load. Moreover, image splitting was introduced, highlighting its utility in enhancing text recognition performance in high-resolution images and document understanding tasks.

Data Quality and Curation Strategies

Data is the backbone of any model's success. Idefics2 is trained using a massive dataset called The Cauldron. The Cauldron includes various types of data:

Interleaved image-text documents from OBELICS.
Image-text pairs from datasets like LAION COCO (with synthetic captions instead of original alt text, as synthetic captions were found to perform better).
PDF documents for improving OCR capabilities.
Text-only instruction datasets to teach the model to follow instructions and perform calculations

Hugo described their strategy for training data, which depended on synthetic captions. These captions are carefully crafted to improve quality and diversity over typical alt-text, helping bolster model performance. However, he acknowledged the inevitable biases synthetic data introduces, stressing ongoing reevaluation and improvement of dataset quality through iterative model refinement. Ethical considerations also took center stage. The models are trained with content-filtering precautions and adhere to ethical guidelines enforced by the Spawning API to omit opt-out content from datasets. They also filter NSFW content using a classifier. However, they acknowledge the need for further research to address biases in synthetic data. The red-teaming exercise also revealed potential for misuse, highlighting the need for safeguards and responsible use.

Evaluating Model Performance Beyond Benchmarks

Testing and evaluating VLMs extend beyond standard benchmarks. Hugo emphasized the importance of hands-on model interaction through demonstrations to uncover failures and biases. Observing model outputs can reveal unintended training data artifacts, highlighting the need for further filtering or additional training data.

Closing Thoughts

The conversation concluded with reflections on the nugget of wisdom shared by VLM developers: building a model is not a one-off achievement but an ongoing journey. Interactive testing through demos can illuminate areas for improvement. Model training and data curation require a continuous loop of optimization—each iteration refining its core capabilities. Hugo’s insights painted a vibrant picture of the state of VLM development. We hope to inspire others to venturing into this rapidly evolving field by sharing these reflections. For those attending NeurIPS 2024 in Vancouver, it's a chance to interact with visionaries like Hugo and explore these advancements firsthand. Join us at the Voxel51 booth for more engaging discussions and plenty of tech swag!

Talk to a computer vision expert