Skip to content

Video Labels – FiftyOne Tips and Tricks – October 14th, 2023

Welcome to our weekly FiftyOne tips and tricks blog where we cover interesting workflows and features of FiftyOne! This week we are taking a look at video labels! We will walk you through an example of working with a video dataset and the different options you can use when creating labels.

Video datasets can have many different types of label data. Some video datasets contain detections, classification, tracks, timed events, or even more. It can prove a challenge to organize all these labels in one dataset. Today will cover how to add the following types of labels to you video dataset to to help you turn your next video dataset into a blockbuster:

  • 🔍 Sample Detections
  • 🖼️ Frame Detections
  • ⌛ Temporal Detections
  • 🏷️ Classification Labels

Wait, what’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

Ok, let’s dive into this week’s tips and tricks! Also feel free to follow along in our notebook or on YouTube!

Creating a Video Dataset

For today’s example, we will be taking a look at the World Level American Sign Language (WLASL) Dataset found on kaggle. The dataset features actors signing 200 different words with multiple actors for each word. It also contains labeled bounding boxes for where the person signing is located as well! We will take a look at the different ways these ground_truths could be represented in a video dataset!

Let’s start by loading our data. The dataset contains two parts of interest for us, a directory full of videos as well as a json that contains the label and bounding box for each video. Not every label in the json is present in our video directory though, so we will need to account for that later. The videos are all short clips like the one below containing the start to finish of the sign.

We kick things off by downloading the dataset from kaggle. Feel free to download manually or using the kaggle API below. After the dataset is unzipped, we can boot up a notebook to load it into FiftyOne.

pip install kaggle
kaggle datasets download -d risangbaskoro/wlasl-processed
mkdir wlasl-processed
unzip wlasl-processed.zip -d wlasl-processed

Next, we load in our labels in our notebook from the json file. We can take a look at how it is set up:

import pandas as pd 
import json
import os

main_path = './wlasl-processed/'
wlasl_df = pd.read_json(main_path + 'WLASL_v0.3.json')

wlasl_df.head()

Awesome, we can see our different signs under gloss and the associated videos under instances! We keep our labels here for now to come back to, next we will load in our videos in FiftyOne.

mp4_dir = main_path + "videos"

dataset = fo.Dataset.from_dir(dataset_dir=mp4_dir,dataset_type=fo.types.VideoDirectory)
dataset.ensure_frames()
dataset.compute_metadata()
dataset.name = 'wlasl-dataset'
dataset.persistent = True

session = fo.launch_app(dataset)

To break down what happened above, we can take a look at each line one by one. First, we load in our video dataset with fo.Dataset.from_dir and specify that it is a VideoDirectory. This will populate the dataset with all of our videos. Next, we use ensure_frames() to populate each sample with frame instances. This will allow us to store data at the frame level and not just video level. Afterwards, we use compute_metadata() to calculate statistics on our video such as frame width and height, frame rate, and video duration. In the last two lines, we are naming our dataset and making it persistent so any changes we make get saved to our disk.

With our videos loaded, next comes adding the labels! Since not every video in the json file is present in the dataset, we will look up the dataset row by the video_id given in the sample filepath.

def find_row_by_video_id(dataframe, video_id):
for index, row in dataframe.iterrows():
    	for instance in row['instances']:
        	if instance['video_id'] == video_id:
            		return row, instance
	return None

We will use the above each time we make a lookup to our dataframe for information.

Sample Detections

The first label we will be taking a look at is Sample level detections. This is a single bounding box that is static throughout the entire video. It is also how the data in our dataset is represented! We can load in these bounding boxes and take a look. We will take a portion of the dataset to lower load times.

view = dataset.take(100)

for sample in view:
	base_file_name = os.path.basename(sample.filepath)
	video_id, extension = os.path.splitext(base_file_name)
	row, inst = find_row_by_video_id(wlasl_df,video_id)
	gloss = row["gloss"]
	bbox = inst["bbox"]
	imw = sample.metadata.frame_width
	imh = sample.metadata.frame_height
	x1 = bbox[0] / imw
	x2 = bbox[2] / imw
	y1 = bbox[1] / imh
	y2 = bbox[3] / imh
	bbox = [x1,y1,x2-x1,y2-y1]
	det = fo.Detection(bounding_box=bbox,label=gloss)
	sample['Sample Label'] = fo.Detections(detections=[det])
   	 
	sample.save()
	
session.view = view

In order to add our detection to our dataset, we first need to look up the detection associated with our video_id using find_row_by_video_id. The function will return the row and instance dictionary with the sign and the bounding box. Afterwards, we convert the Yolov3 (x1,x2,y1,y2) box to a normalized box for Fiftyone (x, y, ,w, h). Having both our box and label now, we can add a detection the same way we do for image datasets! Using fo.Detection we can provide our bounding box and our label to create the detection! We add it to a list to hold and add it to our sample with sample['Sample Label'] = fo.Detections(detections=[det]).

Below we can see the results! 

Frame Detections

To take video level detections one step farther, we can also create labels for each frame! This means bounding boxes can move or track objects in the video. You can also provide an index for tracking datasets to create boxes such as “person 4”. In our case, our boxes do not move by default. So to show frame level detections, we will grow our bounding box a little bit each frame.

def bigger_bbox(x, y, width, height, index):

	offset = 0.001
	x_offset = index*offset

	# Apply the offsets to the parameters
	n_x = x - x_offset
	n_width = width + x_offset*2

	return [n_x, y, n_width, height]

for sample in view:
	base_file_name = os.path.basename(sample.filepath)
	video_id, extension = os.path.splitext(base_file_name)
	row, inst = find_row_by_video_id(wlasl_df,video_id)
	gloss = row["gloss"]
	bbox = inst["bbox"]
	imw = sample.metadata.frame_width
	imh = sample.metadata.frame_height
	x1 = bbox[0] / imw
	x2 = bbox[2] / imw
	y1 = bbox[1] / imh
	y2 = bbox[3] / imh
	bbox = [x1,y1,x2-x1,y2-y1]
	for frame_no, frame in sample.frames.items():
    	new_bbox = bigger_bbox(bbox[0],bbox[1],bbox[2],bbox[3],frame_no)
    	det = fo.Detection(bounding_box=new_bbox,label=gloss)
    	frame['Frame Label'] = fo.Detections(detections=[det])
   	 
	sample.save()

💡Notice frame level detections are stored in the frame instance and can be accessed using `sample.frames` and should be indexed starting at 1 not 0.

Temporal Detections

Temporal detections are a way to classify events in your video by defining a start and end time. Temporal detections can also be used to capture semantic context of a scene such as, “pedestrian crossing the street”. Temporal detections contain a label and can be defined with a first and last frame input or a start and end timestamp. In our dataset, we can add a temporal detection of our label for the first half of the video and then “ASL is awesome!” for the second half. Then we will be able to observe the change over time in the video of the displayed label.

for sample in view:
	base_file_name = os.path.basename(sample.filepath)
	video_id, extension = os.path.splitext(base_file_name)
	row, inst = find_row_by_video_id(wlasl_df,video_id)
	gloss = row["gloss"]
	sample["TD Word"] = fo.TemporalDetection.from_timestamps(
	[0, sample.metadata.duration/2], label=gloss, sample=sample
	)
	sample["TD Word2"] = fo.TemporalDetection.from_timestamps(
	[sample.metadata.duration/2, sample.metadata.duration], label="ASL is awesome!", sample=sample
	)

   	 
	sample.save()

session.view = view

Video Classification

Last, but certainly not least is video classification, the traditional label type of assigning a single class to the entire video. Unlike temporal detections, it is static for the entire sample. To add a classification label to a video in FiftyOne, you can add it just like it is added for an image with fo.Classification:

for sample in view:
	base_file_name = os.path.basename(sample.filepath)
	video_id, extension = os.path.splitext(base_file_name)
	row, inst = find_row_by_video_id(wlasl_df,video_id)
	gloss = row["gloss"]
	sample["class"] = fo.Classification(label=gloss)
   	 
	sample.save()
session.view = view

Video Labels in Action!

Conclusion

Video datasets can come in many sophisticated forms where the labels can be expressed in several ways. In our quick example, we covered how to add the following labels:

  • 🔍 Sample Detections
  • 🖼️ Frame Detections
  • ⌛ Temporal Detections
  • 🏷️ Classification Labels

I hope you are able to use these labels next time in your video datasets! These are just a small fraction of the possible tools to use as well! You can add masks, group your videos, or compute embeddings as well! To learn more cool tips and tricks with FiftyOne, look at our previous Tips and Tricks! Stay tuned for next week!

Join the FiftyOne Community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!