Can VLMs Hear What They See?

February 20, 2025 – Written by Harpreet Sahota

Tutorials Computer Vision

Exploring the Intersection of Vision Language Models and Audio Data

ESC-10 dataset parsed as spectrograms into a FiftyOne Dataset

 

I recently came across a paper that made me wonder: could we actually use vision language models to understand audio?

The paper, Vision Language Models Are Few-Shot Audio Spectrogram Classifiers, introduces and explores Visual Spectrogram Classification (VSC). In this task, visual language models (VLMs) classify audio by analyzing spectrograms (visual representations of sound).

The key points are:

  1. Core concept: Converting audio classification into a visual task by having VLMs analyze spectrogram images
  2. Findings:
    • VLMs can effectively bridge visual-audio domains.
    • Few-shot learning significantly improves performance.
    • VLMs sometimes outperform both human experts and audio language models.
  3. Significance:
  • Establishes a new benchmark for testing VLMs’ audio understanding
  • Demonstrates potential for improving audio captioning
  • Shows promise in compensating for current limitations of audio language models

This curiosity sent me down a day-long experimentation rabbit hole, leading to fun hacking. While I won’t be reproducing their few-shot learning approach, I wanted to explore whether Janus-Pro, a recently released VLM from DeepSeek AI, could tackle this task using zero-shot classification.

ESC-10 dataset

In the paper, the authors tested their hypothesis on the ESC-10 dataset.

The ESC-10 is an audio dataset for environmental sound classification. It contains a selection of 10 classes from the larger ESC-50 dataset. The ESC-10 dataset consists of 400 labeled environmental recordings divided into 10 classes with 40 clips per class, each lasting 5 seconds with a 44.1 kHz sampling rate. The dataset includes transient/percussive sounds and sounds with temporal patterns, and the ten categories included are:

  • 🐕 Dog
  • 🐓 Rooster
  • 🌧️ Rain
  • 🌊 Sea waves
  • 🔥 Crackling
  • 👶 Crying baby
  • 🤧 Sneezing
  • ⏰ Clock tick
  • 🚁 Helicopter
  • 🪚 Chainsaw

This code handles the organization and preprocessing of the ESC-10 dataset. Here’s what’s happening:

  1. We’re loading the full ESC-50 dataset from HuggingFace using the load_dataset function
  2. The organize_esc10_dataset function then:
    • Filters out only the ESC-10 samples (a subset of ESC-50)
    • Creates a directory structure where each sound category has its own folder
    • Processes each audio sample by:
    • Normalizing the audio to the range [-1, 1]
    • Converting from float32 to 16-bit PCM format (standard for WAV files)
    • Saving each processed audio file to its respective category folder
  3. This organization:
    • Makes the dataset easier to work with
    • Ensures consistent audio format across all samples
    • Creates a clean directory structure for further processing
    • Prepares the audio files for spectrogram generation in later steps
Note that the repo for this blog can be found here. Be sure to install all the requirements before proceeding.

Let’s begin by downloading and organizing the ESC-10 dataset:

import os 
import numpy as np
from scipy.io import wavfile
from datasets import load_dataset

esc_fifty = load_dataset(
    "ashraq/esc50", 
    split="train",
    cache_dir='.')

def organize_esc10_dataset(dataset, base_output_dir="esc10_organized"):
    # Create base output directory
    os.makedirs(base_output_dir, exist_ok=True)
    
    # Filter for ESC-10 samples
    esc10_samples = dataset.filter(lambda x: x['esc10'] == True)
    
    # Process each sample
    for sample in esc10_samples:
        category_dir = os.path.join(base_output_dir, sample['category'])
        os.makedirs(category_dir, exist_ok=True)
        
        wav_path = os.path.join(category_dir, sample['filename'])
        
        # Convert float32 audio to int16 PCM
        audio_array = sample['audio']['array']
        # Normalize to [-1, 1] if not already
        audio_array = audio_array / np.max(np.abs(audio_array))
        # Convert to int16
        audio_array = (audio_array * 32767).astype(np.int16)
        
        # Save audio array as wav file
        wavfile.write(
            wav_path, 
            sample['audio']['sampling_rate'],
            audio_array
        )
    
    print(f"Dataset organized in {base_output_dir}")

organize_esc10_dataset(esc_fifty)

Now, let’s download a plugin to create spectrograms from the audio files.

By converting audio into spectrograms, we can tap into VLMs’ sophisticated visual pattern recognition and semantic understanding capabilities, even though they weren’t specifically trained on audio data.

FiftyOne’s plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs. If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can see the full schedule here and look for the Advanced Computer Vision Data Curation and Model Evaluation workshop.

!fiftyone plugins download https://github.com/danielgural/audio_loader
  1. Once the plugin is downloaded there are two ways you can use it.You can launch the FiftyOne app in your local browser by opening the terminal and running: fiftyone app launch. Once the app has launched, hit the backtick (`) button on your keyboard, and this will open the Operator browser. Type in “Load Audio” and click on the operator. This will open up the form for the Load Audio plugin, which you can fill out (each element of the form will appear once you populate each one). You can choose to kick off a delegated service if you’d like.

Below is an example of the form:

The plugin will take some moments to run, depending on the size of your dataset. In this case, it should take no more than 1 minute.

  1. Alternatively, instead of launching the app via terminal, you can launch the app in the cell of a Jupyter Notebook. To do that, you must create a dummy dataset and then launch the app in the cell. The pattern for this is as follows:
import fiftyone as fo

dummy_dataset = fo.Dataset()
fo.launch_app(dummy_dataset)

Once the app has launched, you can open the Operator browser and hit backtick (`), then follow the instructions outlined above. In both cases, you can load the dataset once it has been created.

Depending on what you named your dataset, you can load it as follows:

import fiftyone as fo

audio_dataset = fo.load_dataset("esc10")

Now, let’s install a plugin that allows us to create custom dashboards and glean more insight into our dataset:

!fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard
fo.launch_app(audio_dataset)

Before diving deep into analysis, doing a quick “vibe check” of your dataset using FiftyOne’s visualization capabilities is always good practice. The app provides an intuitive interface to browse your samples, inspect their metadata, and get a feel for the data distribution.

You can:

  • Browse through spectrograms to check their quality and consistency.
  • Filter and sort samples by different fields.
  • Verify that labels are correctly assigned.
  • Spot any obvious outliers or data quality issues.
  • Get a sense of the class balance.

This visual inspection often reveals insights that might not be immediately apparent from the raw data or statistics alone.

Exploring the spectrograms data in the FiftyOne app

We’ll also need the labels later on, we can grab them like so:

audio_classes = audio_dataset.distinct("ground_truth.label")

Let’s supplement our visual inspection by exploring how our audio samples relate in high-dimensional space. By visualizing embeddings, we can discover deeper patterns and relationships in our data:

  • Natural groupings and similarities between different sounds
  • Hidden structures that might not be obvious from spectrograms alone
  • Potential outliers or unusual samples in our dataset
  • Subtle acoustic patterns that connect different sound categories

Computing embeddings

I’ll compute embeddings using three models:

This, dare I say, “multimodal” approach to analyzing embeddings provides different ways of exploring and understanding audio content, ultimately leading to an experiment with vision-language models (VLMs). Models like Music2Latent and CLAP operate directly on the raw audio waveforms, capturing temporal patterns, frequency relationships, and acoustic features in their native form.

Music2Latent

Music2Latent is an audio autoencoder that efficiently compresses audio into a smaller “latent space”. It is designed for tasks like music generation and audio information retrieval.

To extract audio features, the model encodes the audio using an encoder, and then the features from a specific layer are extracted and averaged. These features can then be used for various tasks. The model uses spectrograms (visual representations of audio frequencies) and consists of three main parts:

  • Encoder: Compresses the audio into a latent vector.
  • Decoder: Upsamples the latent vectors.
  • Consistency Model: Reconstructs the audio from the latent vector.

The extracted features can be used for tasks like auto-tagging, key estimation, and instrument/pitch classification, often outperforming similar models.

Let’s install the necessary dependencies and compute embeddings:

!pip install music2latent librosa
import librosa

from torch.nn.functional import normalize

from music2latent import EncoderDecoder

music_to_latent_model = EncoderDecoder()

for sample in audio_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    sample_rate = sample["frame_rate"]
    loaded_wave, _ = librosa.load(wav_path, sr=sample_rate)
    latents = music_to_latent_model.encode(loaded_wave, extract_features=True)
    embedding = latents.mean(dim=-1).squeeze(0) 
    normalized_embedding = normalize(embedding, p=2, dim=0)
    sample["music2latent_embedding"] = normalized_embedding.detach().cpu().numpy() #shape (8192,)

CLAP

The CLAP model introduces a contrastive language-audio pretraining model designed for audio representation learning by combining audio data with natural language descriptions.

The model can be used for:

  • Extracting Audio and Text Embeddings: The model uses audio and text encoders to project audio and text data into a shared latent space, creating audio embeddings Ea and text embeddings Et. These embeddings can be used for various downstream tasks.
  • Zero-Shot Audio Classification: The model can perform zero-shot audio classification by converting the classification task into a text-to-audio retrieval task. It matches a given audio Xa against a set of text prompts Xt (e.g., “the sound of class-name”) and determines the best match based on cosine similarity between their embeddings. The categories of audio are unrestricted (i.e., zero-shot).

We’ll use this model later for zero-shot-audio classification, but for now, let’s compute embeddings:

import torch
from torch.nn.functional import normalize

import librosa

from transformers import ClapModel, ClapProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

clap_model = ClapModel.from_pretrained("laion/clap-htsat-unfused").to(device)

clap_processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")

for sample in audio_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    loaded_wave, _ = librosa.load(wav_path, sr=48000)
    clap_inputs = clap_processor(audios=loaded_wave, return_tensors="pt", sampling_rate=48000).to(device)
    audio_embed = clap_model.get_audio_features(**clap_inputs).squeeze(0)  
    normalized_embedding = normalize(audio_embed, p=2, dim=0)
    sample["clap_embeddings"] = normalized_embedding.detach().cpu().numpy() #shape (512,)

AIMv2

In parallel, we can compute embeddings using AIMv2 on the spectrograms — visual representations that encode time-frequency relationships in a 2D format.

This sets up (at least what I think is) a fascinating comparison: while the audio-specific models represent our ‘traditional’ approach to audio understanding, the spectrogram-based analysis might hint at the suitability of a vision-language model to perform audio classification. Read this blog for a deep dive into the AIMv2 family of models.

Start by downloading the plugin:

!fiftyone plugins download https://github.com/harpreetsahota204/aim-embeddings-plugin

And compute embeddings:

import fiftyone.operators as foo

embedding_operator = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

embedding_operator(
    audio_dataset,
    model_name="apple/aimv2-large-patch14-native",
    embedding_types="cls",  # Either "cls" or "mean"
    emb_field="aimv2_embeddings",
)

Let’s visualize our embeddings to better understand how our different models are grouping similar audio classes.

Since our embeddings are high-dimensional, we’ll use UMAP to reduce them to 2D for visualization. This will help us see if the models are clustering similar genres together.

import fiftyone.brain as fob

embedding_fields = [ "aimv2_embeddings", "music2latent_embedding", "clap_embeddings"]

for fields in embedding_fields:
    _fname = fields.split("_embeddings")[0]
    results = fob.compute_visualization(
        audio_dataset,
        embeddings=fields,
        method="umap",
        brain_key=f"{_fname}_viz",
        num_dims=2,
        )
fo.launch_app(audio_dataset)

Embedding Analysis

Looking at the UMAP visualizations of the three embedding spaces reveals interesting patterns about how each model represents audio:

  1. CLAP embeddings show a remarkably clear separation between sound categories, with distinct clusters for each class. This is expected, given CLAP was specifically trained for audio-understanding tasks.
  2. Music2Latent shows moderate clustering with some overlap between categories. The model appears to group similar acoustic properties while maintaining some distinction between different sound types.
  3. AIMv2 embeddings, interestingly, show significant mixing between categories with no clear clustering pattern. Despite working with spectrograms, the vision model’s embeddings don’t appear to separate different sound categories naturally.

Exploring and visualizing embeddings in FiftyOne

Hypothesis for VLM Performance

Given the significant overlap in AIMv2’s embedding space, I might expect VLMs to face challenges when classifying spectrograms. The lack of natural clustering in the visual embedding space suggests that the spectrogram patterns might not map cleanly to sound categories from a pure vision perspective.

I hypothesize that:

  1. The specialized audio model (CLAP) will significantly outperform the VLM approach.
  2. VLMs might struggle with consistent classification across all categories.
  3. The performance gap between CLAP and VLMs could highlight the limitations of treating audio classification as a pure visual task.

Let’s test this hypothesis by implementing both approaches.

Zero-shot audio classification

I’ll use LAION’s CLAP (discussed above) model with a zero-shot audio classification pipeline. This will give us a reference point for how well a dedicated audio model performs on our genre classification task, which we can later compare against our VLM-based approach using spectrograms.

The CLAP model was trained on several datasets, including:

  • AudioCaps+Clotho (AC+CL): This smaller dataset contains approximately 55,000 audio-text pairs.
  • LAION-Audio-630K (LA.): The LAION-Audio-630K dataset was newly created for this model and is the largest public audio caption dataset with 633,526 audio-text pairs.
  • AudioSet: This dataset includes 1.9 million audio samples, originally with only labels, which were extended into captions using either a template or a keyword-to-caption model.

The datasets were combined to increase the total number of audio samples with text captions to 2.5 million.

from transformers import pipeline

zsc_audio_classifier = pipeline(
    task="zero-shot-audio-classification", 
    model="laion/clap-htsat-unfused"
    )

for sample in audio_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    zsc_audio_preds = zsc_audio_classifier(wav_path, candidate_labels= audio_classes)
    sample["zsc_audio_preds"] = fo.Classification(
        label=zsc_audio_preds[0]["label"], 
        confidence=zsc_audio_preds[0]["score"]
    )

Model evaluation in FiftyOne

You can use the evaluate_classifications method to evaluate the predictions of the zero-shot classifiers. This will return a ClassificationResults instance that provides various methods for generating aggregate evaluation reports about your model.

By default, the classifications will be treated as a generic multiclass classification task, and for illustration purposes, I am explicitly requesting that simple evaluation be used by setting the method argument to simple; but you can specify other evaluation strategies such as top-k accuracy or binary evaluation via the method parameter.

zsc_results = audio_dataset.evaluate_classifications(
    pred_field="zsc_audio_preds",
    gt_field="ground_truth",
    method="simple",
    eval_key=f"clap_simple_eval",
    )
fo.launch_app(audio_dataset)

You can evaluate model performance using the Model Evaluation panel in the app:

Evaluating zero-shot model performance in the model evaluation panel

Quite an impressive performance, which I think will be hard to beat! You can also access the results programmatically:

zsc_results.print_report()
                precision    recall  f1-score   support

      chainsaw       1.00      1.00      1.00        40
    clock_tick       1.00      1.00      1.00        40
crackling_fire       0.95      1.00      0.98        40
   crying_baby       1.00      1.00      1.00        40
           dog       1.00      1.00      1.00        40
    helicopter       0.98      1.00      0.99        40
          rain       1.00      0.95      0.97        40
       rooster       1.00      1.00      1.00        40
     sea_waves       1.00      0.97      0.99        40
      sneezing       1.00      1.00      1.00        40

      accuracy                           0.99       400
     macro avg       0.99      0.99      0.99       400
  weighted avg       0.99      0.99      0.99       400
zsc_results.print_metrics(average="macro", digits=4)
accuracy   0.9925
precision  0.9928
recall     0.9925
fscore     0.9925
support    400

Zero-shot spectrogram classification with Janus Pro

Janus-Pro is an advanced multimodal model designed for both multimodal understanding and visual generation, emphasizing improvements in understanding tasks. The model’s architecture is built upon decoupled visual encoding, which allows it to handle the differing representation needs of these two types of tasks more effectively.

I’ve developed a plugin for Janus Pro that allows you to easily run the model on your FiftyOne dataset.

Start by downloading the plugin, installing the requirements, and instantiating the operator.

!fiftyone plugins download https://github.com/harpreetsahota204/janus-vqa-fiftyone


!fiftyone plugins requirements @harpreetsahota/janus_vqa --install
import fiftyone.operators as foo

janus_vqa = foo.get_operator("@harpreetsahota/janus_vqa/janus_vqa")

In the Vision Language Models Are Few-Shot Audio Spectrogram Classifiers paper the authors use the following prompt in the zero-shot setting:

Figure 5 from Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

However, in the paper, the authors experimented with large models such as GPT-4o, Claude-3.5 Sonnet, and Gemini-1.5. Since we’re working with a smaller model, I’ll construct a more concise prompt:

string_audio_classes = ', '.join(audio_classes)

vlm_query_prompt = f"""This is an image of a spectrogram. Which of the following classes does this spectrogram best represent: [{string_audio_classes}]
Your response should be one word, the name of the class and nothing else.
"""

Before running this operator we’ll need to kickoff a delegated service. You can do this by running fiftyone delegated launch in your terminal.

await janus_vqa(
    audio_dataset,
    model_path="deepseek-ai/Janus-Pro-7B", #or you could pass deepseek-ai/Janus-Pro-1B
    question=vlm_query_prompt,
    question_field="query",
    answer_field="janus_classification",
    delegate=True
)

By default, the plugin outputs its result as a FiftyOne StringField, but we’ll need to have them parsed as FiftyOne Classifications so that we can use evaluate_classifications:

classifications = [fo.Classification(label=cls) for cls in audio_dataset.values("janus_classification")]

audio_dataset.set_values("janus_as_classification", classifications)
janus_results = audio_dataset.evaluate_classifications(
    pred_field="janus_as_classification",
    gt_field="ground_truth",
    method="simple",
    eval_key=f"janus_simple_eval",
    )

We can get an idea of model performance right off the bat:

audio_dataset.count_values("janus_classification")
{'rain': 28, 'clock_tick': 1, 'dog': 355, 'crying_baby': 16}

Well, at first glance, the results don’t look promising at all! Let’s dig a bit deeper:

print_report
                precision    recall  f1-score   support

      chainsaw       0.00      0.00      0.00        40
    clock_tick       1.00      0.03      0.05        40
crackling_fire       0.00      0.00      0.00        40
   crying_baby       0.00      0.00      0.00        40
           dog       0.11      0.95      0.19        40
    helicopter       0.00      0.00      0.00        40
          rain       0.11      0.07      0.09        40
       rooster       0.00      0.00      0.00        40
     sea_waves       0.00      0.00      0.00        40
      sneezing       0.00      0.00      0.00        40

      accuracy                           0.10       400
     macro avg       0.12      0.10      0.03       400
  weighted avg       0.12      0.10      0.03       400

Looking at the results, my suspicions are confirmed. But I’ll still use the model evaluation panel to compare model performance.

Model comparison panel in FiftyOne
fo.launch_app(audio_dataset)

The lacklustre performance of Janus-Pro on this zero-shot classification task isn’t entirely surprising for several reasons:

  1. Model Size and Training: Unlike the larger models used in the original paper (GPT-4V, Claude-3, Gemini-1.5), Janus-Pro is a significantly smaller model. This likely limits its ability to make nuanced distinctions between spectrogram patterns.
  2. Prompt Engineering: While necessary for the smaller model, the simplified prompt we used might not provide enough context about how to interpret spectrograms. A more detailed prompt explaining spectrograms’ time-frequency relationships could improve performance.
  3. Zero-Shot vs Few-Shot: The original paper demonstrated that few-shot learning significantly improved performance. By showing the model examples of each class, it can better learn the visual patterns associated with different sounds. Our zero-shot approach, while simpler, leaves the model to figure out these patterns from scratch.
  4. Visual Embedding Analysis: Looking back at our AIMv2 embedding visualization, the significant overlap between categories in the visual space suggested that pure vision-based approaches might struggle with this task. The CLAP embeddings, which showed clear clustering, reinforce that audio-specific architectures might be better suited for this task.

Conclusion

While Janus-Pro’s performance on zero-shot audio classification yielded impressive results, this exploration yielded valuable insights into the intersection of vision and audio understanding:

  1. Embedding Analysis Revelations: Our comparison of different embedding spaces (CLAP, Music2Latent, and AIMv2) provided fascinating insights into how different models interpret audio data. The clear clustering in CLAP’s embeddings versus the mixed representations in AIMv2’s space highlighted the importance of domain-specific architectures.
  2. Baseline Performance: CLAP’s strong zero-shot performance established a compelling baseline, demonstrating the current capabilities of dedicated audio understanding models. This gives us a clear reference point for evaluating future multimodal approaches.
  3. VLM Limitations and Potential: While Janus-Pro struggled with zero-shot classification, this experiment helps us understand the current limitations of treating audio classification as a purely visual task. It also suggests that with few-shot learning, larger models, and better prompt engineering, VLMs might still have untapped potential in audio understanding.

Even when experiments don’t yield interesting results, they give us insights into model capabilities, limitations, and the exciting challenges in bridging different modalities of perception.