Skip to content

CVPR 2024 Datasets and Benchmarks – Part 1: Datasets

CVPR, the IEEE/CVF Conference on Computer Vision and Pattern Recognition, is the Coachella Festival of computer vision. Just like any music festival needs its headliners, deep learning needs its rockstars: datasets and benchmarks. 

Over the years, these have played a massive role as the driving force in advancing computer vision and deep learning in general, and CVPR has consistently been a platform for introducing new ones. Without a platform like CVPR for researchers to present new datasets, progress in deep learning wouldn’t be as fast as it has been over the last decade.

But what sets a dataset apart from a benchmark?

A dataset is a collection of data samples, such as images, videos, or annotations, used to train and evaluate deep learning models. It provides the raw data that models learn from during the training process. Datasets are often labeled or annotated to provide ground truth information for supervised learning tasks. In computer vision, famous datasets from previous CVPRs include 2009’s ImageNet, 2016’s Cityscapes Dataset, 2017’s Kinetics Dataset, and 2020’s nuScenes.  

These large-scale datasets have enabled the development of state-of-the-art deep models for tasks like image classification, object detection, and semantic segmentation.

On the other hand, benchmarks are standardized tasks or challenges used to evaluate and compare the performance of different models or algorithms.

Benchmarks typically consist of a dataset, a well-defined evaluation metric, and a leaderboard ranking the performance of different models. They are the yardstick against which models are measured, allowing researchers to gauge performance against state-of-the-art approaches and track progress in a specific task or domain. Examples include the 2009 Pedestrian Detection Benchmark and 2012’s KITTI Vision Benchmark Suite.

The importance of datasets and benchmarks in deep learning can’t be overstated.

Datasets serve as the foundation for training deep learning models. At the same time, benchmarks allow for the objective assessment and comparison of model performance. This is why introducing new research in this direction pushes research boundaries and addresses the limitations of current approaches. 

New datasets are essential to exploring novel tasks, accommodating emerging model architectures, and capturing diverse real-world scenarios. As models achieve higher – and eventually human-level – performance on established benchmarks, researchers create new, more challenging ones to push the boundaries of what’s possible.  

Several new datasets and benchmarks at CVPR 2024 have caught my eye, each with the potential to advance progress in computer vision and deep learning. 

Do a ctrl-f for “datasets” on the list of accepted papers, and you’ll see a whopping 72 results. These datasets cover a spectrum of computer vision tasks, including multimodal, multi-view, room layout, 3D//4D/6D, robotic perception datasets, and more. Admittedly, the topic I’m most interested in is vision language and multimodality, so my picks are biased towards that.  

I’ll explore both in this two-part blog series, with this blog focusing on three datasets that I found interesting (presented in no particular order):

  • Panda-70M
  • 360 + x
  • TSP6K

Similarly, in part two, I’ll discuss three benchmarks that captured my interest: 

  • ImageNet-D
  • Polaris
  • VBench

In the following sections, I’ll focus on the following aspects of each dataset:

  • Task and Domain: Pinpoint each dataset’s specific problem and area within the computer vision landscape.
  • Dataset Curation, Size, and Composition: Explore the scale of each dataset, its creation process, and the diversity of data it encompasses.
  • Unique Features and Innovations: Uncover what sets each dataset apart and any novel approaches employed in its development.
  • Potential Applications and Impact: Discuss how each dataset can move the field forward and its potential real-world applications.
  • Comparison to Existing Datasets: Provide context by comparing each dataset to similar ones, showcasing how it fills existing gaps or offers unique advantages.
  • Accessibility and Usability: Share information about each dataset’s availability, licensing, and where you can find it.
  • Experiments and Results: When applicable, I’ll summarize key experimental findings for each dataset.

Panda-70M

tl;dr

  • Task: Video captioning
  • Size: The full training set has 70.7M samples at 720p resolution, an average duration of 8.5s (totalling 167 Khrs), and an average caption of 13.2 words.
  • License: Non-commercial, research only
  • Paper

Task and Domain

Panda-70M is for multimodal learning tasks that involve video and text, specifically targeting video captioning, video and text retrieval, and text-driven video generation. The domain is open, meaning the dataset includes a wide range of subjects and is not limited to any specific field or type of content.

Dataset Curation, Size, and Composition

Panda-70M contains 70.8 million video clips with automatically generated captions. 

It was created by curating 3.8 million high-resolution videos from the publicly available HD-VILA-100M dataset. The dataset is created following these three steps:

  1. Split videos into semantically coherent clips (see section 3.1 of the paper for details).
  1. Caption each clip using various cross-modality teacher models, like image/video visual question-answering models, with additional inputs/metadata about the video (discussed below in the next section).
  1. Take a 100k subset of this data and have human annotators generate ground truth captions.
    1. Fine-tune a retrieval model on this subset  (see section 3.3 of the paper for more details).
    2. Select the most precise annotation describing each scene of the videos.

The result is a diverse dataset, with videos that are semantically coherent, high-resolution, and free of watermarks. The captions describe each video’s main object and action, with an average of 13.2 words per caption.

The authors have pointed out that the dataset has a limitation, which is an artifact of taking a subset of HD-VILA-100M: most of the samples are videos with a lot of speech. Therefore, the main categories in our dataset are news, television shows, documentary films, egocentric videos, and instructional and narrative videos.

Unique Features and Innovations

The authors’ core insight is that a video typically contains information from several modalities that can assist an automatic captioning model. These include the title, description, subtitles, static frames, and video.  These modalities should be used to create a large-scale video-language dataset with rich captions more efficiently than manual annotation.

  • Automatic Annotation: Utilizes multiple cross-modality teacher models to generate captions, eliminating the need for expensive manual annotation.
  • Semantics-aware Splitting: Ensures video clips are semantically consistent while maintaining sufficient length for capturing motion content.
  • Fine-grained Retrieval: Employs a retrieval model specifically fine-tuned to select the most accurate and detailed caption from multiple candidates.
  • Multimodal Input: Takes advantage of various modalities and metadata like video frames, titles, descriptions, and subtitles to generate captions.

Potential Applications and Impact

Panda-70M can enable more effective pretraining of video-language models, driving progress on video understanding tasks.  By pretraining on Panda-70 M, the authors demonstrate substantial improvements in video captioning, retrieval, and text-to-video generation. Potential applications:

  • Video Captioning: Training and evaluating video captioning models with improved accuracy and detail.
  • Video and Text Retrieval: Facilitating accurate retrieval of videos based on text queries and vice versa.
  • Text-driven Video Generation: This dataset will enable the generation of videos from textual descriptions, potentially leading to advancements in video editing and content creation.
  • Multimodal Understanding: Advancing research in multimodal learning and understanding the relationships between vision and language.

Comparison to Existing Datasets

The Panda-70M video captioning dataset is much larger than other datasets that rely on manual labelling. It overcomes the limitations of ASR annotations, which often fail to capture the main content and actions, and provides a solution to manual annotations’ cost and scalability issues.

  • Larger Scale: Compared to existing manually annotated video captioning datasets, Panda-70M offers a significantly larger scale.
  • Richer Captions: Provides more detailed and informative captions than datasets annotated with ASR-generated subtitles.
  • Open Domain: This dataset covers a broader range of video content than datasets focused on specific domains, like cooking or actions.

Accessibility and Usability

The authors make the Panda-70M dataset and code publicly available. This enables the research community to build upon this resource and utilize it for various video-language tasks. However, according to the license, the dataset cannot be used commercially.

Experiments and Results

The paper provides benchmark results of models pretrained on Panda-70M on 3 downstream tasks:

The author pretrained models on Panda-70M and presented impressive results, even beating SOTA models like Video-LLaMA and BLIP-2.

360 + x

tl;dr

Task and Domain

360+x is a panoptic multimodal scene understanding dataset that provides a holistic view of scenes from multiple viewpoints and modalities.  The dataset focuses on real-world, everyday scenes with diverse activities and environments. It’s especially suited for tasks like:

  • Video classification: Identifying the scene category (e.g., park, restaurant)
  • Temporal action localization: Detecting and classifying actions within a video (e.g., eating, walking)
  • Self-supervised representation learning: Learning meaningful representations from unlabeled data
  • Cross-modality retrieval: Retrieving relevant information across different modalities (e.g., finding video segments based on audio cues)
  • Dataset adaptation: Adapting pre-trained models to new datasets

Dataset Size and Composition

360 + x was curated using two main devices: the Insta 360 One X2 and Snapchat Spectacles 3 cameras, ensuring high resolution and frame rate for both video and audio modalities. 

The dataset contains 360° panoramic videos, third-person front view videos, egocentric monocular/binocular videos, multi-channel audio, directional binaural delay information, GPS location data, and textual scene descriptions.

  • Size: 2,152 videos representing 232 data examples (464 from 360 camera, 1,688 from Spectacles)
  • Composition:
    • Viewpoints: 360° panoramic, third-person front view, egocentric monocular, egocentric binocular
    • Modalities: Video, multi-channel audio, directional binaural delay, location data, textual scene descriptions
  • Curation:
    • Scene selection: Based on comprehensiveness, real-life relevance, diverse weather/lighting, and rich sound sources
    • Annotation: Scene categories (28) based on Places Database and large language models; temporal action segmentation (38 action instances) with consensus among annotators

Unique Features and Innovations

360+x is the first dataset to cover multiple viewpoints (panoramic, third-person, egocentric) with multiple modalities (video, audio, location, text) to mimic real-world perception. 

Using novel data collection techniques, including 360-degree camera technology for panoramic video, it captures the following:

  • Multiple viewpoints and modalities: Offers a more holistic understanding of scenes than unimodal datasets.
  • Real-world complexity: Captures diverse activities and interactions in various environments, providing a more realistic challenge.
  • Directional binaural delay: Enables sound source localization and richer audio analysis.
  • Egocentric and third-person views: Allows for studying both participant and observer perspectives.

Potential Applications and Impact

This dataset’s rich multimodal data opens up new avenues for research in multimodal deep learning, particularly in multi-view, multimodal data for video understanding tasks. 

This will undoubtedly have a huge positive impact in real-world applications, such as enhancing robot navigation systems and enriching user interaction within smart environments.

Here are some ways I think it could impact research:

  • Multiple Viewpoints and Modalities: By offering a more comprehensive understanding of scenes compared to unimodal datasets, this feature allows for a holistic interpretation of complex environments.
  • Real-World Complexity: The dataset captures a wide array of activities and interactions across diverse environments, providing a more realistic and challenging context for research.
  • Directional Binaural Delay: This feature enables precise sound source localization and facilitates more in-depth audio analysis.
  • Egocentric and Third-Person Views: This unique combination allows researchers to study participant and observer perspectives, offering a complete understanding of human-environment interaction.

Comparison to Existing Datasets

Unlike datasets that focus on specific viewpoints, such as egocentric or third-person views alone, or those that lack audio-visual correlation, 360+x combines multiple viewpoints with multimodal data. 

Below are some of the limitations associated with existing datasets 

  • UCF101, Kinetics, HMDB51, ActivityNet: Primarily focus on video classification with limited scene complexity and single viewpoints.
  • EPIC-Kitchens, Ego4D: Focus on egocentric videos lacking other viewpoints and modalities.
  • AVA, AudioSet, VGGSound: Focus on audio-visual analysis but lack multiple viewpoints and directional audio information.

360+x addresses these limitations by providing a more comprehensive and realistic dataset with multiple viewpoints, modalities, and rich annotations. It also has longer 6-minute videos capturing more complex co-occurring activities than short single-action clips in most datasets. 

Accessibility and Usability

The dataset, code, and associated tools are publicly available for research under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, which means they are unavailable for commercial use. 

Experiments and Results

The paper establishes several baselines on 360+x for tasks such as video classification, temporal action localization, and cross-modal retrieval. 

Some interesting findings include:

Video Scene Classification

  • A 360° panoramic view outperforms an egocentric binocular or third-person front view alone.
  • Utilizing all three views (360°, egocentric binocular, third-person front) leads to the best performance.
  • Including additional modalities (audio, directional binaural information) and video improves the average precision.

Temporal Action Localization (TAL)

  • Adding audio and directional binaural delay modalities improves baseline and extractors pre-trained on 360+x dataset performance.
  • Using custom extractors pre-trained on 360+x dataset provides additional improvements over baseline extractors.

Cross-modality Retrieval

  • The intermodality retrieval results (Query-to-Video, Query-to-Audio, Query-to-Directional binaural delay) show the 360+x dataset’s modality compliance quality.
  • Collaborative training of audio and directional binaural features as query features leads to better performance than treating them independently.

Self-supervised Representation Learning

  • Using self-supervised learning (SSL) pre-trained models consistently improves precision for video classification, with a combination of video pace and clip order SSL techniques resulting in ~7% improvement.
  • For the TAL task, pre-training with video pace or clip order individually improves average performance by ~1.2% and ~0.9%, respectively, compared to the supervised baseline. Combining both SSL methods yields the highest performance gain of ~1.9%.

TSP6K

tl;dr

Task and Domain

TSP6K is a dataset for traffic scene parsing and understanding, focusing on semantic and instance segmentation.

Scene parsing involves segmenting and understanding various components within an image, identifying semantic objects and the surrounding environment. This dataset is tailored for traffic monitoring rather than general driving scenarios, providing detailed pixel-level and instance-level annotations for urban traffic scenes. 

Dataset Size and Composition

The TSP6K dataset includes 6,000 finely annotated images, which are specifically collected from urban road shooting platforms at different locations and times of the day. 

It’s split into 2,999 training images, 1,207 validation images, and 1,794 test images, which include:

  • High-resolution images were captured from traffic monitoring cameras positioned across 10 Chinese provinces.
  • Careful selection of images ensures diversity in scene type (e.g., crossings, pedestrian crossings), weather conditions (sunny, cloudy, rain, fog, snow), and time of day.
  • Pixel-level annotations for 21 semantic classes (e.g., road, vehicles, pedestrians, traffic signs, lanes) and instance-level annotations for all traffic participants.

This approach ensures a rich diversity in the dataset, capturing various traffic conditions and lighting environments. Each image in the dataset comes with high-quality semantic and instance-level annotations, which are necessary for detailed scene analysis in traffic monitoring applications.

Unique Features and Innovations

Some interesting features of this dataset are:

  • Crowded traffic scenes with significantly more traffic participants than existing driving scene datasets present a more challenging and realistic scenario for model training and evaluation.
  • An average of 42 traffic participants per image compared to 5-19 in driving datasets. 
  • 30% of images contain over 50 instances.
  • It captures a wide range of object scales from the perspective of the monitoring cameras, forcing models to handle significant size variations.

Potential Applications and Impact

By providing a dataset specifically for traffic monitoring, the TSP6K dataset opens up new research directions, like:

  • Traffic Flow Analysis: Improved scene understanding can lead to better traffic flow monitoring, congestion prediction, and anomaly detection.
  • It enables the training and evaluation of scene parsing, instance segmentation, and unsupervised domain adaptation methods for traffic monitoring applications.
  • Smart City Development: The dataset can contribute to developing intelligent traffic management systems and infrastructure planning.
  • Domain Adaptation Research: The domain gap between TSP6K and existing datasets provides a valuable benchmark for evaluating and advancing domain adaptation techniques.

Comparison to Existing Datasets

Datasets like Cityscapes, KITTI, and BDD100K have advanced traffic scene understanding. 

However, they primarily focused on autonomous driving use cases and scenarios. This means their contents are mainly collected from drivers’ perspectives inside vehicles. Deep learning models trained on driver-centric datasets underperform when applied to traffic monitoring scenes. 

TSP6K, however, is curated from a traffic monitoring perspective. 

This involves higher vantage points in denser and more diverse traffic scenarios from various urban locations captured at different times. This addresses a gap in existing datasets by focusing on the unique challenges presented by traffic monitoring scenes, such as the varied sizes of instances and the complex interactions in crowded urban settings.

Accessibility and Usability

The code and dataset are publicly available to researchers at the GitHub repository. The repository is under an Apache 2.0 license; however, the SegFormer license is mentioned, and users should be careful using the dataset for commercial purposes.

Experiments and Results

The paper details experiments using the TSP6K dataset to evaluate various scene parsing, instance segmentation, and unsupervised domain adaptation methods. 

  • Evaluated the performance of existing scene parsing, instance segmentation, and domain adaptation methods on TSP6K.
  • Proposed a detailed refining decoder architecture that achieves 75.8% mIoU and 58.4% iIoU on the validation set.
  • Used mIoU and iIoU (for traffic instances) as evaluation metrics.

These results set a new benchmark for traffic scene parsing performance, particularly in monitoring scenarios, and demonstrate the dataset’s utility in developing robust advanced parsing algorithms across different urban traffic environments.

Conclusion

CVPR 2024 has once again showcased the vital role that datasets and benchmarks play in advancing computer vision and deep learning. 

The three datasets explored in this blog post—Panda-70M, 360+x, and TSP6K—each bring unique innovations and potential for impact in their respective domains.

  • Panda-70M’s massive scale and automatic annotation approach pave the way for more effective pretraining of video-language models, enabling advancements in video captioning, retrieval, and generation. 
  • 360+x’s multi-view, multimodal data captures real-world complexity and opens up new avenues for research in holistic scene understanding. 
  • TSP6K fills a critical gap by focusing on traffic monitoring scenarios. These scenarios present challenges distinct from driver-centric datasets and will help advance traffic scene parsing and domain adaptation.

As we’ve seen, these datasets address limitations in existing resources, provide rich and diverse data, and establish new performance benchmarks. They are poised to drive research forward and contribute to real-world applications like content creation, smart environments, and intelligent traffic management.

However, it’s important to recognize that these datasets represent just a snapshot of the many groundbreaking contributions presented at CVPR 2024. At least 72 papers introduced new datasets spanning a wide range of computer vision tasks, and the conference continues to be a hub for innovation and collaboration.

Stay tuned for part two of this series, where we’ll dive into three exciting benchmarks from CVPR 2024: ImageNet-D, LaMPilot, and Polaris. In the meantime, be sure to check out the Awesome CVPR 2024 Challenges, Datasets, Papers, and Workshops repository on GitHub, and feel free to open an issue if there’s a specific dataset or benchmark you’d like to see covered.

Visit Voxel51 at CVPR 2024!

And if you’re attending CVPR 2024, don’t forget to stop by the Voxel51 booth #1519 to connect with the team, discuss the latest in computer vision and NLP, and grab some of the most sought-after swag at the event. See you there!