Exploring Video Datasets with FiftyOne and Vision-Language Models

Virtual

Americas

Workshops

Exploring Video Datasets with FiftyOne and Vision-Language Models - February 26, 2026

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Feb 26, 2026

9am - 10am PST

Online.

About this event

Join our virtual workshop to explore video datasets with FiftyOne and vision-language models. View more CV events here.

Host

Video is the hardest modality to work with. You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This hands-on workshop tackles a practical question: given a large video dataset, how do you understand what's in it without manually watching thousands of clips?

Using Facebook's Action100M dataset and FiftyOne, we'll build an end-to-end workflow for exploring video datasets. You'll learn to:

Navigate and explore video data in the FiftyOne App, filter samples, and understand dataset structure
Compute embeddings with Qwen3-VL to enable semantic search, zero-shot classification, and clustering
Generate descriptions and localize events using vision-language models like Qwen3-VL and Molmo2
Visualize patterns in your data through embedding projections and the FiftyOne App
Evaluate model outputs against Action100M's hierarchical annotations to validate what the models actually capture

By the end of the session, you'll have a reusable toolkit for understanding any video dataset at scale, whether you're curating training data, debugging model performance, or exploring a new domain.

Resources

A Hands-On Workshop with Action100M and FiftyOne

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.