Welcome to our weekly FiftyOne tips and tricks blog where we cover interesting workflows and features of FiftyOne! This week we are taking a look at video labels! We will walk you through an example of working with a video dataset and the different options you can use when creating labels.
Video datasets can have many different types of label data. Some video datasets contain detections, classification, tracks, timed events, or even more. It can prove a challenge to organize all these labels in one dataset. Today will cover how to add the following types of labels to you video dataset to to help you turn your next video dataset into a blockbuster:
- 🔍 Sample Detections
- 🖼️ Frame Detections
- ⌛ Temporal Detections
- 🏷️ Classification Labels
Wait, what’s FiftyOne?
FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.
- If you like what you see on GitHub, give the project a star.
- Get started! We’ve made it easy to get up and running in a few minutes.
- Join the FiftyOne Slack community, we’re always happy to help.
Creating a Video Dataset
For today’s example, we will be taking a look at the World Level American Sign Language (WLASL) Dataset found on kaggle. The dataset features actors signing 200 different words with multiple actors for each word. It also contains labeled bounding boxes for where the person signing is located as well! We will take a look at the different ways these ground_truths could be represented in a video dataset!
Let’s start by loading our data. The dataset contains two parts of interest for us, a directory full of videos as well as a json that contains the label and bounding box for each video. Not every label in the json is present in our video directory though, so we will need to account for that later. The videos are all short clips like the one below containing the start to finish of the sign.
We kick things off by downloading the dataset from kaggle. Feel free to download manually or using the kaggle API below. After the dataset is unzipped, we can boot up a notebook to load it into FiftyOne.
pip install kaggle kaggle datasets download -d risangbaskoro/wlasl-processed mkdir wlasl-processed unzip wlasl-processed.zip -d wlasl-processed
Next, we load in our labels in our notebook from the json file. We can take a look at how it is set up:
import pandas as pd import json import os main_path = './wlasl-processed/' wlasl_df = pd.read_json(main_path + 'WLASL_v0.3.json') wlasl_df.head()
Awesome, we can see our different signs under gloss and the associated videos under instances! We keep our labels here for now to come back to, next we will load in our videos in FiftyOne.
mp4_dir = main_path + "videos" dataset = fo.Dataset.from_dir(dataset_dir=mp4_dir,dataset_type=fo.types.VideoDirectory) dataset.ensure_frames() dataset.compute_metadata() dataset.name = 'wlasl-dataset' dataset.persistent = True session = fo.launch_app(dataset)
To break down what happened above, we can take a look at each line one by one. First, we load in our video dataset with
fo.Dataset.from_dir and specify that it is a VideoDirectory. This will populate the dataset with all of our videos. Next, we use
ensure_frames() to populate each sample with frame instances. This will allow us to store data at the frame level and not just video level. Afterwards, we use
compute_metadata() to calculate statistics on our video such as frame width and height, frame rate, and video duration. In the last two lines, we are naming our dataset and making it persistent so any changes we make get saved to our disk.
With our videos loaded, next comes adding the labels! Since not every video in the json file is present in the dataset, we will look up the dataset row by the
video_id given in the sample filepath.
def find_row_by_video_id(dataframe, video_id): for index, row in dataframe.iterrows(): for instance in row['instances']: if instance['video_id'] == video_id: return row, instance return None
We will use the above each time we make a lookup to our dataframe for information.
The first label we will be taking a look at is
Sample level detections. This is a single bounding box that is static throughout the entire video. It is also how the data in our dataset is represented! We can load in these bounding boxes and take a look. We will take a portion of the dataset to lower load times.
view = dataset.take(100) for sample in view: base_file_name = os.path.basename(sample.filepath) video_id, extension = os.path.splitext(base_file_name) row, inst = find_row_by_video_id(wlasl_df,video_id) gloss = row["gloss"] bbox = inst["bbox"] imw = sample.metadata.frame_width imh = sample.metadata.frame_height x1 = bbox / imw x2 = bbox / imw y1 = bbox / imh y2 = bbox / imh bbox = [x1,y1,x2-x1,y2-y1] det = fo.Detection(bounding_box=bbox,label=gloss) sample['Sample Label'] = fo.Detections(detections=[det]) sample.save() session.view = view
In order to add our detection to our dataset, we first need to look up the detection associated with our video_id using
find_row_by_video_id. The function will return the row and instance dictionary with the sign and the bounding box. Afterwards, we convert the Yolov3 (x1,x2,y1,y2) box to a normalized box for Fiftyone (x, y, ,w, h). Having both our box and label now, we can add a detection the same way we do for image datasets! Using
fo.Detection we can provide our bounding box and our label to create the detection! We add it to a list to hold and add it to our sample with
sample['Sample Label'] = fo.Detections(detections=[det]).
Below we can see the results!
To take video level detections one step farther, we can also create labels for each frame! This means bounding boxes can move or track objects in the video. You can also provide an index for tracking datasets to create boxes such as “person 4”. In our case, our boxes do not move by default. So to show frame level detections, we will grow our bounding box a little bit each frame.
def bigger_bbox(x, y, width, height, index): offset = 0.001 x_offset = index*offset # Apply the offsets to the parameters n_x = x - x_offset n_width = width + x_offset*2 return [n_x, y, n_width, height] for sample in view: base_file_name = os.path.basename(sample.filepath) video_id, extension = os.path.splitext(base_file_name) row, inst = find_row_by_video_id(wlasl_df,video_id) gloss = row["gloss"] bbox = inst["bbox"] imw = sample.metadata.frame_width imh = sample.metadata.frame_height x1 = bbox / imw x2 = bbox / imw y1 = bbox / imh y2 = bbox / imh bbox = [x1,y1,x2-x1,y2-y1] for frame_no, frame in sample.frames.items(): new_bbox = bigger_bbox(bbox,bbox,bbox,bbox,frame_no) det = fo.Detection(bounding_box=new_bbox,label=gloss) frame['Frame Label'] = fo.Detections(detections=[det]) sample.save()
💡Notice frame level detections are stored in the frame instance and can be accessed using `sample.frames` and should be indexed starting at 1 not 0.
Temporal detections are a way to classify events in your video by defining a start and end time. Temporal detections can also be used to capture semantic context of a scene such as, “pedestrian crossing the street”. Temporal detections contain a label and can be defined with a first and last frame input or a start and end timestamp. In our dataset, we can add a temporal detection of our label for the first half of the video and then “ASL is awesome!” for the second half. Then we will be able to observe the change over time in the video of the displayed label.
for sample in view: base_file_name = os.path.basename(sample.filepath) video_id, extension = os.path.splitext(base_file_name) row, inst = find_row_by_video_id(wlasl_df,video_id) gloss = row["gloss"] sample["TD Word"] = fo.TemporalDetection.from_timestamps( [0, sample.metadata.duration/2], label=gloss, sample=sample ) sample["TD Word2"] = fo.TemporalDetection.from_timestamps( [sample.metadata.duration/2, sample.metadata.duration], label="ASL is awesome!", sample=sample ) sample.save() session.view = view
Last, but certainly not least is video classification, the traditional label type of assigning a single class to the entire video. Unlike temporal detections, it is static for the entire sample. To add a classification label to a video in FiftyOne, you can add it just like it is added for an image with
for sample in view: base_file_name = os.path.basename(sample.filepath) video_id, extension = os.path.splitext(base_file_name) row, inst = find_row_by_video_id(wlasl_df,video_id) gloss = row["gloss"] sample["class"] = fo.Classification(label=gloss) sample.save() session.view = view
Video Labels in Action!
Video datasets can come in many sophisticated forms where the labels can be expressed in several ways. In our quick example, we covered how to add the following labels:
- 🔍 Sample Detections
- 🖼️ Frame Detections
- ⌛ Temporal Detections
- 🏷️ Classification Labels
I hope you are able to use these labels next time in your video datasets! These are just a small fraction of the possible tools to use as well! You can add masks, group your videos, or compute embeddings as well! To learn more cool tips and tricks with FiftyOne, look at our previous Tips and Tricks! Stay tuned for next week!
Join the FiftyOne Community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!
- 2,000+ FiftyOne Slack members
- 4,000+ stars on GitHub
- 5,000+ Meetup members
- Used by 370+ repositories
- 60+ contributors