Welcome to Voxel51’s bi-weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.
📰 The Industry Pulse
AI Wings of Fury: AI Agent-Powered Fighter Jet Takes to the Skies
Ok, that was kind of a clickbaity headline, but it’s pretty much what happened!
The US Air Force Test Pilot School and the Defense Advanced Research Projects Agency (DARPA) have successfully installed AI agents in the X-62A VISTA aircraft as part of the Air Combat Evolution (ACE) program.
The teams conducted over 100,000 lines of flight-critical software changes across 21 test flights, culminating in the first-ever AI vs human within-visual-range dogfights. The breakthrough demonstrates that AI can be used safely in aerospace applications, paving the way for future advances. The X-62A VISTA will continue to serve as a research platform to advance autonomous AI systems in aerospace.Looks like AI has finally ‘flying’ colors in aerospace! I highly recommend checking out this YouTube video that DARPA put out to learn more.
Phi-3: where less is more, and ‘mini’ means maximum impact!
Microsoft has overshadowed the Llama-3 launch with their latest line of small language models (SLMs) – Phi-3!
The Phi-3 family includes three models: phi-3-mini with 3.8 billion parameters, phi-3-small with 7 billion, and phi-3-medium with 14 billion. The phi-3-small and medium compete with or outperform GPT 3.5 across all benchmarks, including the multi-turn bench (and by a decent amount). It’s not so good on TriviaQA due to its limited capacity to store “factual knowledge,” but honestly, that’s not even an interesting benchmark to care about.
What is interesting, though, is how they curated their dataset. They created a dataset that used simple, easy-to-understand words like those a 4-year-old could understand. They also created synthetic datasets called “TinyStories” and “CodeTextbook” using high-quality data from larger language models. This supposedly makes the models less likely to give wrong or inappropriate answers.
Microsoft’s Phi-3 SLMs are the proof that sometimes, smaller is smarter.
Another one from Microsoft: VASA-1!
VASA-1 is a Microsoft Research project that generates realistic talking faces in real-time based on audio input.
VASA-1 brings talking faces to life with its cutting-edge technology. The system generates facial movements and expressions that perfectly sync with audio input, creating a seamless and realistic experience. Moreover, it does this in real-time, crafting animations on the fly as the audio is spoken. The result is a lifelike appearance that’s uncannily similar to real human faces, complete with intricate skin texture, facial features, and nuanced expressions.
Seriously, this thing is a trip. Go check out the website. None of the images are of real people, but the lip-audio synchronization, expressive facial movements, and natural head motions fooled me.
Use of generative AI tools
🤖 While generative AI has generated a lot of hype and excitement, Gartner’s recent survey of over 500 business leaders found that only 24% are currently using or piloting generative AI tools. The majority, 59%, are still exploring or evaluating the technology. The top use cases are software/application development, creative/design work, and research and development. Barriers to adoption include governance policies, ROI uncertainty, and a lack of skills. Gartner predicts 30% of organizations will use generative AI by 2025.
👨🏽💻 GitHub Gems
COCONut is a modernized large-scale segmentation dataset that improves upon COCO in terms of annotation quality, consistency across segmentation tasks, and scale and introduces a challenging new validation set.
- It contains approximately 383K images with over 5.18M human-verified panoptic segmentation masks.
- COCONut harmonizes annotations across semantic, instance, and panoptic segmentation tasks to address inconsistencies in the original COCO dataset.
Dataset Highlights
- High-quality, human-verified panoptic segmentation masks
- Consistent annotations across semantic, instance, and panoptic segmentation
- Meticulously curated validation set (COCONut-val) with 25K images and 437K masks
- Refined class definitions compared to COCO to improve annotation consistency
Construction
- Uses an assisted-manual annotation pipeline leveraging modern neural networks to generate high-quality mask proposals efficiently
- Human raters inspect and refine the machine-generated proposals
- Achieves significant increase in both scale and annotation quality compared to existing datasets
Check out the GitHub here. The dataset is on Kaggle, which you can find here.
📙 Good Reads
This week’s good read is Nathan Lambert’s slides from his guest lecture session for Stanford CS25, titled Aligning Open Language Models.
Honestly, I wouldn’t normally recommend slides as a good read (because that’s weird)…but this is an exception for three reasons:
- I’m a huge fan of Nathan Lambert.
- The slides break all the design principles that people say slides should have, so they’re full of text.
- He links out to a lot of really good material, so lots of alpha from that perspective.
In the slides, he showcases the rapid progress in aligning open language models, driven by innovative techniques, community efforts, and the availability of open-source resources. He then briefly traces the evolution of LMs from Claude Shannon’s early work in 1948 to the emergence of transformers in 2017 and the subsequent release of influential models like GPT-1, BERT, and GPT-3. He observes GPT-3 -3’s rise in 2020, with its surprising capabilities and potential harms, highlighted the need for aligning LMs with human values and intentions.
Nathan also provides a brief history of the following:
- Instruction Fine-tuning (IFT): This technique involves training base LMs on specific instructions to improve their ability to follow user commands and engage in dialogue.
- Self-Instruct/Synthetic Data: This approach utilizes existing high-quality prompts to generate additional training data, significantly expanding the IFT dataset.
- Early Open-Source Models: He highlights several early open-source instruct models like Alpaca, Vicuna, Koala, and Dolly, which were based on LLaMA and fine-tuned using various datasets like ShareGPT and OpenAssistant.
- Evaluation Challenges: The slides discuss the emergence of evaluation methods like ChatBotArena, AlpacaEval, MT Bench, and Open LLM Leaderboard, each with strengths and limitations.
- RLHF and LoRA: He explores the use of Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA) techniques to further improve the alignment and efficiency of open-source models.
- OpenAssistant: The release of OpenAssistant, a large human-generated instruction dataset, was crucial in advancing open-source aligned models.
- StableVicuna: This model marked a significant step by being the first open-source model trained with RLHF using PPO.• QLoRA & Guanaco: The development of QLoRA, a memory-efficient fine-tuning method, enabled the training of larger models like Guanaco, achieving state-of-the-art performance at the time.
🎙️ Good Listens
Multimodal AI has developed through model architectures, training datasets, and key insights.
Unified models can translate between data types and generate images from text, answer questions about images, and translate between languages. Larger and more capable models are expected to emerge in the coming years. The latest episode of the Practical AI podcast discusses AI’s rapid advancement and the shift towards multimodal AI models in 2024.
The hosts dive into the history and key developments that enabled today’s multimodal AI systems:
- In the early days of deep learning, research focused primarily on single modalities like natural language processing (NLP) for text and speech recognition for audio. Over time, deep learning models gradually expanded to handle multiple modalities.
- Different modalities, such as computer vision, speech, and natural language, were initially handled by separate specialized models. Combining the outputs of these unimodal models allowed for basic multimodal functionality, but this approach had limitations.
- As model architectures advanced, training a single model on multiple modalities became possible. This allowed the same model to process and relate information across different data types, such as text, images, audio, and video.
- The development of transformer architectures and attention mechanisms was a key turning point. Transformers provided a flexible, unified architecture that could scale to large datasets across multiple modalities.
- Large language models trained on massive amounts of web-scraped text data, while focused on language, likely picked up some primary multimodal associations from text that described or linked to images, audio, and video. This may have helped bootstrap the development of more multimodal models.
- The emergence of massive foundational models that span multiple modalities is a culmination of these trends. These models learn a shared representation space that captures information and relationships across modalities.
- Techniques like contrastive learning and cross-modal attention have further improved the ability of models to align and translate between different modalities.
The hosts also discuss Udio, a new AI tool that generates complete songs from text prompts, including music, lyrics and vocals. This raises questions about AI and creativity:
- Does AI-generated music involve enough human input to be copyrightable? Laws may evolve on this, similar to photography in the past.
- The music industry will be disrupted, but artists who embrace these tools may find them useful for rapid experimentation and iteration.
- In the future, emotionally intelligent AI music generators could create hyper-personalized music tailored to an individual’s changing moods.
The ability of AI to operate across multiple modalities is rapidly advancing and will likely continue to accelerate in the coming years. Those who align themselves with these new creative tools and capabilities may be best positioned in this fast-changing landscape.
👨🏽🔬 Good Research
Evaluating retrieval-augmented generation (RAG) systems, which combine information retrieval and language generation, has been a challenging task due to the reliance on extensive human annotations
A recent research paper introduces ARES, a novel framework that aims to address this issue by providing an automated, data-efficient, and robust evaluation approach.
We’ll summarize the paper using the PACES method.
Problem
Traditional methods for evaluating the quality of generated responses of a RAG system rely heavily on expensive and time-consuming human annotations, which can introduce subjectivity and inconsistency in the evaluation process. To address this issue, an automated evaluation framework called ARES has been proposed. It leverages synthetic data generation and machine learning techniques to provide reliable and data-efficient assessments of RAG system performance.
Approach
ARES approach has four components:
- Synthetic Data Generation: ARES generates a synthetic dataset of query-passage-answer triples using large language models.
- LLM Judges: ARES trains lightweight “judge” models to predict the quality of RAG system outputs.
- Prediction-Powered Inference (PPI): ARES employs PPI to estimate confidence intervals for quality scores.
- RAG System Ranking: ARES applies trained judges to evaluate outputs of various RAG systems and rank them based on scores and confidence intervals.
Claim
The paper’s main claim is that ARES provides an effective and efficient framework for evaluating RAG systems without relying heavily on human annotations.
The authors argue that ARES can accurately assess the performance of RAG systems in terms of context relevance, answer faithfulness, and answer relevance, while significantly reducing the need for time-consuming and expensive human evaluations.
They claim that ARES has the potential to become a standard evaluation framework for RAG systems, enabling researchers and practitioners to assess and compare the performance of different RAG architectures more effectively.
Evaluation
ARES is evaluated on various datasets, and its effectiveness in ranking RAG systems accurately is shown with limited human annotations. However, testing ARES on a broader range of datasets and RAG system architectures would be valuable.
Substantiation
The evaluation results substantiate the paper’s main claim that ARES is an effective and efficient framework for evaluating RAG systems.
The high correlation between ARES rankings and human judgments across different datasets and evaluation dimensions supports the claim that ARES can provide reliable assessments of RAG system performance while requiring significantly fewer human annotations compared to traditional approaches.
The ablation studies and robustness experiments further strengthen the validity of the proposed framework.
🗓️. Upcoming Events
Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.