Exploring Video Datasets with FiftyOne and Vision-Language Models - February 26, 2026
Feb 26, 2026
9am - 10am PST
Online. Register for the Zoom!
About this event
Join our virtual workshop to explore video datasets with FiftyOne and vision-language models. View more CV events here.
Host
Video is the hardest modality to work with. You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This hands-on workshop tackles a practical question: given a large video dataset, how do you understand what's in it without manually watching thousands of clips?
Using Facebook's Action100M dataset and FiftyOne, we'll build an end-to-end workflow for exploring video datasets. You'll learn to:
Navigate and explore video data in the FiftyOne App, filter samples, and understand dataset structure
Compute embeddings with Qwen3-VL to enable semantic search, zero-shot classification, and clustering
Generate descriptions and localize events using vision-language models like Qwen3-VL and Molmo2
Visualize patterns in your data through embedding projections and the FiftyOne App
Evaluate model outputs against Action100M's hierarchical annotations to validate what the models actually capture
By the end of the session, you'll have a reusable toolkit for understanding any video dataset at scale, whether you're curating training data, debugging model performance, or exploring a new domain.