Welcome to Voxel51's bi-weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.

📰 The Industry Pulse

Stability AI's new Stable Audio 2.0

🎧 Stability AI has introduced Stable Audio 2.0, a new AI model that generates high-quality, full-length audio tracks up to three minutes long with coherent musical structure from natural language prompts. Stable Audio 2.0 is a new audio generation tool that allows users to upload and transform audio samples. It has expanded sound effect generation and style transfer capabilities. The model uses a highly compressed autoencoder and a diffusion transformer to generate tracks with large-scale coherent structures. The model is trained on a licensed dataset from AudioSparx and is available for free on the Stable Audio website. Stable Radio, a 24/7 live stream featuring Stable Audio's generated tracks, is now available on YouTube.

SyntheMol, a generative AI model

💊 McMaster and Stanford University researchers have created SyntheMol, a generative AI model that rapidly designs small-molecule antibiotics to fight antibiotic-resistant bacteria. SyntheMol uses Monte Carlo tree search to create compounds from 132,000 building blocks and 13 chemical reactions. The AI model was trained on a database containing over 9,000 bioactive compounds and 5,300 synthetic molecules screened for growth inhibition against Acinetobacter baumannii. After generating 150 high-scoring compounds and synthesizing 58, six non-toxic drug candidates with antibacterial properties were identified. This proof-of-concept demonstrates the potential of generative AI to speed up drug discovery, address antimicrobial resistance, and improve healthcare worldwide, with the entire pipeline completed in just three months.

Transformer-based models

📈 Time series forecasting uses historical data patterns to make informed decisions across various industries, but integrating advanced AI techniques has been slow. Recently, transformer-based models are developed with large datasets, offering sophisticated capabilities for accurate forecasting and analysis. Notable examples include TimesFM, Lag-Llama, Moirai, Chronos, and Moment. These models help businesses and researchers navigate complex time series data landscapes, offering a glimpse into the future of time series analysis.

Use of generative AI tools

🤖 While generative AI has generated a lot of hype and excitement, Gartner's recent survey of over 500 business leaders found that only 24% are currently using or piloting generative AI tools. The majority, 59%, are still exploring or evaluating the technology. The top use cases are software/application development, creative/design work, and research and development. Barriers to adoption include governance policies, ROI uncertainty, and a lack of skills. Gartner predicts 30% of organizations will use generative AI by 2025.

👨🏽‍💻 GitHub Gems

This week's GitHub gem is a shameless promo from me! I spent a few days reviewing the approved papers, workshops, challenges, and tutorials for the 2024 CVPR conference and compiled what stood out most in this GitHub repo. Some of the highlights include Pixel-level Video Understanding in the Wild, Vlogger: Make Your Dream A Vlog, and Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes, which I found particularly exciting due to their potential impact on vision-language and multimodal models. Admittedly, the content I've included is skewed towards my interests. If you have a different focus and think I've missed something important, by all means, contribute to the repo and help make it a more comprehensive resource for the community. Are you going to be at CVPR? I'll be with the crew from Voxel 51 at Booth 1519. Come by to say hello and get some sweet swag. You can find us right next to the Meta and Amazon Science booths! Check out the repo, ⭐smash a Star, share it with your network, add your CVPR highlights, and let me know your thoughts. I’m looking forward to seeing you at the conference!

📙 Good Reads

tl;dr

Classification: Recall, Precision, ROC-AUC, Separation of Distributions
Summarization: Factual consistency via NLI, Relevance via reward modeling
Translation: Measure quality via chrF, BLEURT, COMET, COMETKiwi
Toxicity: Test with adversarial prompts from RealToxicityPrompts and BOLD
Copyright: Test with text from popular books and code

This week, I enjoyed reading Eugene Yan's practical piece, "A Builder's Guide to Evals for LLM-based Applications." Most off-the-shelf evaluation metrics like ROUGE, BERTScore, and G-Eval are unreliable and impractical for evaluating real-world LLM applications. They don't correlate well with application-specific performance. The blog is a comprehensive guide for evaluating LLMs across various applications, from summarization and translation to copyright and toxicity assessments. If you're working on creating products that use LM technology, I highly recommend bookmarking Eugene's post as a useful reference guide. It contains practical tips and considerations to help you build your evaluation suite. Here's what I learned:

Summarization

When it comes to summarization, factual consistency and relevance are key measurements. Eugene discusses framing these as binary classification problems and using natural language inference (NLI) models fine-tuned on open datasets as learned metrics. With sufficient task-specific data, these can outperform n-gram overlap and embedding similarity metrics. This method improves with more data and is a robust tool against factual inconsistencies or "hallucinations" in summaries.

Translation

Yan discusses both statistical and learned evaluation metrics. He highlights chrF, a character n-gram F-score, for its language-independent capabilities and ease of use across various languages. He also talks about metrics like BLEURT and COMET, which use models like BERT to assess translation accuracy, and COMETKiwi, a reference-free variant. Reference-free models like COMETKiwi show promise in assessing translations even without gold-standard human references.

Copyright

Eugene also discusses measuring undesirable behaviours like copyright regurgitation and toxicity. He presents methods to evaluate the extent to which models reproduce copyrighted content, a concern for both legal risks and ethical considerations. He outlines how to quantify exact and near-exact reproductions through specific datasets and metrics, providing a framework for assessing and mitigating these risks. Running LLMs on carefully curated prompt sets drawn from books, code, and web text can help quantify the extent to which copyrighted content sneaks into the outputs.

Toxicity

Yan doesn't shy away from the topic of toxicity in LLM outputs. He discusses using datasets like RealToxicityPrompts and BOLD to generate prompts that could lead to harmful language, employing tools like the Perspective API to measure and manage toxicity levels. This section underscores the ongoing need for vigilance and human evaluation to ensure LLMs produce safe and respectful content.

The Human Element

Despite the advancements in automated metrics, human judgment remains the gold standard for complex tasks and nuanced assessments. Eugene emphasizes the enduring importance of human evaluation, especially for complex reasoning tasks. Strategically sampling instances to label can help boost our evals' precision, recall, and overall confidence. He advocates a balanced approach, combining automated tools with human insights to refine and improve LLM evaluations continuously. Eugene Yan's guide is a valuable source of knowledge for anyone involved in developing, assessing, or just interested in the capabilities and challenges of LLM-based applications. Let me know if you found it helpful, too!

🎙️ Good Listens

This week’s good listen is the Dwarkesh Podcast’s most recent episode with Sholto Douglas (from DeepMind) and Trenton Bricken (from Anthropic). This episode, a whopping 3 hours and 12 minutes, which took me a week to listen to (and re-listen to). It’s full of great insights, nuggets of wisdom, and alpha. Below is what stuck with me the most:

Long context windows and long-horizon tasks

The primary bottleneck for AI agents taking on long-horizon tasks is not necessarily their ability to handle long context windows but rather their reliability and the ability to chain tasks successively with high enough probability.
Accurately measuring progress and success rates over long-horizon tasks is difficult. Current evaluations might not fully capture the complexity of chaining multiple tasks over longer periods. More complex evaluations, like SWE-bench, are mentioned, which attempt to address this by examining success rates over extended periods.
On long-horizon tasks, the cumulative probability of success decreases significantly with each task added to the chain if the base task solve rate isn't high enough. Achieving a high enough base task solve rate is key for improving performance on long-horizon tasks.
What will the design of future models look like, given the potential for models with dynamic compute bundles and infinite context windows? Such a model would adjust its computing resources and strategies according to the task, possibly reducing the need for specialized models or fine-tuning.

The residual stream of transformer networks

Transformers are described as sequentially processing tokens through layers, with each layer adding to or modifying the information in the residual stream. This process is likened to human cognitive processes, where information is progressively refined and used for decision-making.
Side note: The role of the residual stream is encoding and processing information. The residual stream can be thought of as the intermediate state of a transformer network's computation. It is the output of each layer before being fed into the next layer. Each token in the input prompt has its own associated residual stream vector at each layer. It enables "memory management," where later computational units can pick up, continue, copy, modify, delete or merge outputs from previous units.
In the podcast the residual stream is analogized to a boat moving down a river, collecting extra passengers or information (via attention heads and MLPs), effectively acting as the model's working memory or RAM. This analogy extends to how transformers operate on high-dimensional vectors, packing vast information into the residual stream.
Analogies can be made between transformer models and human cognitive processes, particularly the similarity to the brain's attention mechanisms and how the cerebellum operates. It's suggested that transformer models might mimic certain aspects of human cognition, mainly how they handle and process information over time.
The conversation discusses the complexity of transformer architectures, especially focusing on the quadratic cost of attention and its comparison to the computational cost of the MLP block within the model. This complexity is somewhat dominated by the MLP block, suggesting that attention's cost becomes significant only in very long contexts.
The conversation also covers how models like transformers can be fine-tuned to improve their ability to chain thoughts or operations, enhancing their performance on complex tasks. This process involves adjusting key and value weights within the model, potentially allowing for more sophisticated information encoding and retrieval.
There's speculation about the benefits of models sharing residual streams or more dense representations of information, potentially leading to more effective collaboration and problem-solving.

Features and superposition

The discussion suggests that viewing neural network operation in terms of features and feature spaces is a powerful and perhaps essential perspective. This view aligns with the idea that artificial neural networks process information by identifying, modifying, and combining input features to perform tasks or make decisions.
In neural networks, a feature is challenging to define precisely. However, it can be seen as a direction in activation space or a latent variable influencing the system's output. It represents an element or aspect of the input data the model recognizes and uses to make predictions or decisions.
Features can range from simple (e.g., the presence of an object in an image) to complex (e.g., abstract concepts like love or deception) and are fundamental units through which models process and interpret data.
The concept of superposition in models, mainly those trained on high-dimensional and sparse data, refers to the model's ability to compress multiple features into fewer parameters than there are features.
This allows models to effectively encode and represent vast information with limited resources, addressing the underparameterization relative to the complexity and diversity of real-world data.
Interpreting AI models, particularly deep neural networks, is difficult because they employ superposition to encode information. A single neuron in the model might activate for a diverse set of inputs, making it hard to understand its specific role or what it represents.
Efforts to enhance interpretability include techniques to project model activations into higher-dimensional, sparse spaces, where the compression is undone, and features become more distinguishable and semantically coherent.

👨🏽‍🔬 Good Research

Anthropic released a paper titled "Many-shot Jailbreaking," which unveils a startling vulnerability for long context LLMs. This paper highlights the problem of adversarial attacks in LLMs, the methodology of exploiting long-context capabilities, the significant findings about the scalability and effectiveness of such attacks, and the challenges faced in mitigating these threats effectively. The research shows how even the most advanced large language models can be manipulated to exhibit undesirable behaviours by abusing their extended context windows. This is done by applying hundreds of carefully crafted prompts, specifically many-shot jailbreak prompts (MSJ). MSJ extends the concept of few-shot jailbreaking, where the attacker prompts the model with a fictitious dialogue containing a series of queries that the model would typically refuse to answer. These queries could include instructions for picking locks, tips for home invasions, or other illicit activities. Jason Corso published a blog post about how to read research papers a while back. His PACES (problem, approach, claim, evaluation, substantiation) method has been my go-to for understanding papers. Using the PACES method, here's a summary of the key aspects of the paper titled "Many-shot Jailbreaking".

Problem

The paper addresses LLMs' susceptibilities to adversarial attacks using long-context prompts demonstrating undesirable behaviour. This vulnerability emerges due to expanded context windows in models by Anthropic, OpenAI, and Google DeepMind, posing a new security risk.

Approach

The authors investigate "Many-shot Jailbreaking (MSJ)" attacks, where the model is prompted with hundreds of examples of harmful behavior, exploiting the large context windows of recent LLMs. This method scales attacks from few-shot to many-shot, testing across various models, including Claude 2.0, GPT-3.5, GPT-4, Llama 2, and Mistral 7B. The study covers the effectiveness of these attacks across different models, tasks, and context formatting to understand and characterize scaling trends and evaluate potential mitigation strategies.

Claim

The paper claims that the effectiveness of Many-shot Jailbreaking attacks on LLMs follows a power law with increasing demonstrations, significantly in longer contexts. This effectiveness challenges current mitigation strategies and suggests that long contexts offer a new attack surface for LLMs.

Evaluation

The authors evaluate the approach using datasets focused on malicious use cases, personality assessments, and opportunities for the model to insult. They measure success by the frequency of model compliance with harmful prompts and observe the scaling effectiveness of the attack across model sizes and formatting variations. Notably, the paper finds that current alignment techniques like supervised learning (SL) and reinforcement learning (RL) increase the difficulty of successful attacks but don't eliminate the threat.

Substantiation

The evaluation convincingly supports the paper's claim that MSJ is a potent attack vector against LLMs, with effectiveness that scales predictably with the number of shots. Despite efforts at mitigation, the findings suggest that long contexts allow for an attack that can circumvent current defenses, posing risks for model security and future mitigation strategies.

🗓️. Upcoming Events

Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.

Talk to a computer vision expert