Modern computer vision workflows increasingly rely on scalable video annotation pipelines to organize and understand large amounts of video data. Unlike image datasets, a video dataset can contain information that changes over time, including object tracks, temporal events, frame-level detections, and multiple forms of metadata. Managing these annotations efficiently is critical for tasks such as action recognition, tracking, and video classification.
In this tutorial, you will work through a complete example of loading and labeling a Kaggle video dataset using
FiftyOne’s open source software. Using the WLASL American Sign Language dataset, you will explore several common approaches to structuring video labels, including sample detections, frame detections, temporal detections, and video-level classification labels. Along the way, you will learn how to represent annotations at different levels of a video sample and visualize them directly inside an interactive dataset workflow.
By the end of this guide, you will understand how to build a flexible
video annotation pipeline for machine learning and computer vision applications while working with real-world video data in Python.
Video annotation: Creating a Kaggle video dataset
For today’s example, we will be taking a look at the World Level American Sign Language (WLASL) Dataset found on Kaggle. The dataset features actors signing 200 different words with multiple actors for each word. It also contains labeled bounding boxes for where the person signing is located as well. You will take a look at the different ways these ground_truths could be represented in a video dataset. Let’s start by loading our data. The dataset contains two parts of interest for us, a directory full of videos as well as a json that contains the label and bounding box for each video. Not every label in the json is present in our video directory though, so you will need to account for that later. The videos are all short clips like the one below containing the start to finish of the sign.
To begin, you will download the
dataset from Kaggle. Feel free to download manually or using the Kaggle API below. After the dataset is unzipped, you can boot up a notebook to load it into FiftyOne.
Next, load in your labels in your notebook from the json file:
Well done. You can now see your different signs under gloss and the associated videos under instances. Keep your labels here for now to come back to, next you will load your videos in FiftyOne.
To break down what happened above, you can take a look at each line one by one. First, you load in your video dataset with fo.Dataset.from_dir and specify that it is a VideoDirectory. This will populate the dataset with all of our videos. Next, use ensure_frames() to populate each sample with frame instances. This will allow us to store data at the frame level and not just video level. Afterwards, use compute_metadata() to calculate statistics on our video such as frame width and height, frame rate, and video duration. In the last two lines, you are naming your dataset and making it persistent so any changes you make get saved to your disk.
With our videos loaded, next comes adding the labels. Since not every video in the json file is present in the dataset, you will look up the dataset row by the video_id given in the sample filepath.
Use the above each time you make a lookup to our dataframe for information.
Video annotation: Sample detections
The first label you will be taking a look at is Sample level detections. This is a single bounding box that is static throughout the entire video. It is also how the data in your dataset is represented. You can load in these bounding boxes and take a look. You will take a portion of the dataset to lower load times.
In order to add our detection to your dataset, you first need to look up the detection associated with our video_id using find_row_by_video_id. The function will return the row and instance dictionary with the sign and the bounding box. Afterwards, you convert the Yolov3 (x1,x2,y1,y2) box to a normalized box for Fiftyone (x, y, ,w, h). Having both your box and label now, you can add a detection the same way you do for image datasets. Using fo.Detection you can provide your bounding box and label to create the detection. You add it to a list to hold and add it to our sample with sample['Sample Label'] = fo.Detections(detections=[det]).
Below you can see the results.
Frame detections
To take video level detections one step farther, you can also create labels for each frame. This means bounding boxes can move or track objects in the video. You can also provide an index for tracking datasets to create boxes such as “person 4”. In our case, our boxes do not move by default. So to show frame level detections, you will grow your bounding box a little bit each frame.
💡Notice frame level detections are stored in the frame instance and can be accessed using `sample.frames` and should be indexed starting at 1 not 0.
Temporal detections
Temporal detections are a way to classify events in your video by defining a start and end time. Temporal detections can also be used to capture semantic context of a scene such as, “pedestrian crossing the street”. Temporal detections contain a label and can be defined with a first and last frame input or a start and end timestamp. In your dataset, you can add a temporal detection of our label for the first half of the video and then “ASL is awesome!” for the second half. Then you will be able to observe the change over time in the video of the displayed label.
Video classification
Last, but certainly not least is video classification, the traditional label type of assigning a single class to the entire video. Unlike temporal detections, it is static for the entire sample. To add a classification label to a video in FiftyOne, you can add it just like it is added for an image with fo.Classification:
Video annotation: Video labels in action
If a video walk through is more your thing, one of our devs will take you through the same process step by step process detailed above.
Building scalable video annotation workflows
Building effective machine learning workflows starts with well-structured video annotation strategies. In this tutorial, you explored how a modern video dataset can contain multiple layers of information, from static detections and frame-level labels to temporal events and full video classification outputs. Using the WLASL Kaggle video dataset, you implemented several common annotation patterns directly in FiftyOne and visualized how different label formats can coexist within the same dataset.
You also saw how flexible classification labels and temporal annotations can help represent dynamic information that evolves throughout a video sample. These techniques provide a strong foundation for developing computer vision systems involving action recognition, tracking, gesture analysis, and multimodal video understanding.
As video-based AI applications continue to grow, scalable video annotation workflows become increasingly important for organizing data, improving model quality, and accelerating experimentation. The approaches covered here can be extended further with
FiftyOne through segmentation masks, embeddings, tracking IDs, and custom metadata to support more advanced video analytics pipelines.