Best of NeurIPS – Feb 4, 2025

Best of NeurIPS – Feb 4, 2025

This event is now over.

Register for the next one.

Go to upcoming events
Skip to content

Best of NeurIPS - Feb 4

Feb 4, 2025 at 9 AM Pacific

Register for the Zoom

By submitting you (1) agree to Voxel51’s Terms of Service and Privacy Statement and (2) agree to receive occasional emails.

Welcome to the Best of NeurIPS virtual series that highlights some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

No "Zero-Shot" Without Exponential Data

Vishaal Udandarao
University of Tuebingen

Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?

Through thorough experiments, we consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.

Read the paper, “No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance”

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at Google Deepmind. He did his undergraduate degree in computer science in IIIT Delhi from 2016 to 2020, and his masters in machine learning in The University of Cambridge in 2021.

He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His key research interest-axes are: Data-centric Machine Learning, Robustness/Generalisation to Distribution Shifts, and Foundation Models. He has also been awarded the Google PhD Fellowship in Machine Intelligence for 2024.

Understanding Bias in Large-Scale Visual Datasets

Boya Zeng
University of Pennsylvania

Truly general-purpose vision systems require pre-training on diverse and representative visual datasets. The “dataset classification” experiment reveals that modern large-scale visual datasets are still very biased: neural networks can achieve excellent accuracy in classifying which dataset an image is from. However, the concrete forms of bias among these datasets remain unclear. In this talk, I will present a framework to identify the unique visual attributes distinguishing these large-scale datasets.

Read the paper, “Understanding Bias in Large-Scale Visual Datasets”

About the Speaker

Boya Zeng is an undergraduate student at the University of Pennsylvania. He is currently working with Prof. Zhuang Liu at Princeton University on visual datasets and generative models.

Map It Anywhere: Empowering BEV Map Prediction using Large-scale Public Datasets

Cherie Ho
Carnegie Mellon University

Omar Alama
Carnegie Mellon University

Jiaye Zou
Carnegie Mellon University

Top-down Bird’s Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. In this context, we show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms, Mapillary for FPV images and OpenStreetMap for BEV semantic maps.

We introduce Map It Anywhere (MIA), a data engine that enables seamless curation and modeling of labeled map prediction data from existing open-source map platforms. Using our MIA data engine, we display the ease of automatically collecting a 1.2 million FPV & BEV pair dataset encompassing diverse geographies, landscapes, environmental factors, camera models & capture scenarios. We further train a simple camera model-agnostic model on this data for BEV map prediction. Extensive evaluations using established benchmarks and our dataset show that the data curated by MIA enables effective pretraining for generalizable BEV map prediction, with zero-shot performance far exceeding baselines trained on existing datasets by 35%. Our analysis highlights the promise of using large-scale public maps for developing & testing generalizable BEV perception, paving the way for more robust autonomous navigation.

Read the paper, “Map It Anywhere (MIA): Empowering Bird’s Eye View Mapping using Large-scale Public Data”

About the Speakers

Cherie Ho is a final year robotics PhD student at Carnegie Mellon University working with Prof. Sebastian Scherer. Her research interest is in the intersection of field robotics, computer vision, and machine learning to develop robots that can continuously learn in new scenarios. She has developed generalizable, adaptive, and uncertainty-awarerobot algorithms for dynamic real-world applications. Applications include high-speed offroad driving, outdoor multi-drone systems, and outdoor wheelchairs. She is a recipient of Croucher Scholarship for Doctoral Study.

Jiaye (Tony) Zou is a senior CS undergraduate from Carnegie Mellon University. He is interested in multi-modal perception in dynamic real-world environments. He has developed MapItAnywhere, a large-scale data engine and baseline model for generalizable Bird’s Eye View mapping.

Omar Alama is starting his PhD at Carnegie Mellon University ECE in Fall 2024 advised by Prof. Sebastian Scherer and working in the Airlab at the CMU Robotics Institute. His research interests revolve around classical and modern deep-learning-based computer vision, which is used to build generalizable and efficient perception systems.