Video is the hardest modality to work with. You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This hands-on workshop tackles a practical question: given a large video dataset, how do you understand what's in it without manually watching thousands of clips?
Using Facebook's Action100M dataset and FiftyOne, we'll build an end-to-end workflow for exploring video datasets. You'll learn to:
- Navigate and explore video data in the FiftyOne App, filter samples, and understand dataset structure
- Compute embeddings with Qwen3-VL to enable semantic search, zero-shot classification, and clustering
- Generate descriptions and localize events using vision-language models like Qwen3-VL and Molmo2
- Visualize patterns in your data through embedding projections and the FiftyOne App
- Evaluate model outputs against Action100M's hierarchical annotations to validate what the models actually capture
By the end of the session, you'll have a reusable toolkit for understanding any video dataset at scale, whether you're curating training data, debugging model performance, or exploring a new domain.
Resources
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.