A guide to using the open source tool FiftyOne to download the Kinetics dataset and evaluate video understanding models

After the success of image classification dataset challenges and the rise of deep learning, tackling video was an obvious next step. Just like the classification of images, the task of video classification is the most straightforward start on the path to general video understanding models. As for the specific labels that are being classified, the computer vision research community has gravitated toward classifying human actions in videos.

One of the earliest human action recognition video datasets, even before deep learning took off, was the KTH dataset from 2004. Action recognition datasets have come a long way since then, some focusing on clips from Hollywood movies, while others focusing on sports.

In 2017, DeepMind released one of the largest and most impactful human action recognition datasets yet, Kinetics. As of the writing of this post, four versions of the Kinetics dataset have been released: 400, 600, 700, and 700–2020. The version number indicates the number of action classes. Additionally, each version adds new videos to replace those that have been deleted from YouTube over time.

This post walks through the integration of Kinetics into the open-source dataset curation and model analysis tool, FiftyOne. This integration includes a sophisticated way to download the dataset, as well as examples of how to evaluate and improve models trained on the dataset. Downloading Kinetics is now as easy as:

import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("kinetics-600")

Setup

To run the examples in this post, you need to install FiftyOne:

pip install fiftyone

You will also need to install Pytube which is used by FiftyOne to download videos from YouTube:

pip install pytube

Downloading Kinetics

Until recently, the only way to access the Kinetics dataset was to download each video directly from their sources on YouTube. This resulted in numerous issues including videos having been deleted, YouTube throttling downloads, and inefficiencies in clipping videos.

The Common Visual Data Foundation (CVDF) has collaborated with the Kinetics dataset maintainers to host all versions of the dataset on AWS for the general public to download. It should be noted that the CVDF-hosted version does not include all samples present in the original dataset, only those that were available on YouTube at the time that the CVDF version was created.

The CVDF has made it much easier to gain access to the full dataset. However, you still need to handle the challenges of visualizing, wrangling, and subsetting the dataset to meet your needs. In some cases, you don’t want to have to download the entire dataset, to begin with.

This is where the integration of Kinetics into the FiftyOne Dataset Zoo comes in. With just one line of Python code, you can now specify the version, the split, and the classes that you want and then visualize it in the FiftyOne App with just another line of code.

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
"kinetics-700-2020",
split="validation",
classes=["grooming cat", "grooming dog"],
max_samples=10,
)

session = fo.launch_app(dataset)

Training and Evaluating a Model

After having downloaded Kinetics, you can now start using it to train action recognition models. Since the dataset is already in FiftyOne, it is easy to use libraries like PyTorch or PyTorch Lightning Flash to train a model directly on the dataset.

pip install lightning-flash lightning-flash[video] torchvision pytorchvideo

import torch

from flash import Trainer
from flash.video import VideoClassificationData, VideoClassifier

import fiftyone as fo
import fiftyone.zoo as foz

classes = [
"swimming backstroke",
"swimming breast stroke",
"swimming butterfly stroke",
"swimming front crawl",
]

# Load Kinetics
dataset = foz.load_zoo_dataset(
"kinetics-700-2020",
splits=["train", "validation"],
classes=classes,
max_samples=50,
shuffle=True,
)

# Replace spaces in class names with underscore
labels = dataset.distinct("ground_truth.label")
labels_map = {l: l.replace(" ", "_") for l in labels}
dataset = dataset.map_labels("ground_truth", labels_map).clone()

# Create views for dataset splits
train_view = dataset.match_tags("train")
val_view = dataset.match_tags("validation")

# Create the Flash Datamodule
datamodule = VideoClassificationData.from_fiftyone(
train_dataset=train_view,
val_dataset=val_view,
predict_dataset=val_view,
label_field="ground_truth",
batch_size=1,
clip_sampler="uniform",
clip_duration=1,
decode_audio=False,
)

# Build the model
model = VideoClassifier(
backbone="x3d_xs",
labels=datamodule.labels,
pretrained=True,
)

trainer = Trainer(
max_epochs=10,
limit_train_batches=5,
gpus=torch.cuda.device_count(),
)

# Finetune the model
trainer.finetune(model, datamodule=datamodule, strategy="freeze")

After your model is trained, you can then generate predictions on the validation and test splits and use FiftyOne to evaluate the performance of the model.

from itertools import chain

from flash.core.classification import FiftyOneLabelsOutput

def get_fo_label_preds(samples, datamodule, trainer):
# Return a list of predictions in fo.Detection format
predictions = trainer.predict(
model,
datamodule=datamodule,
output=FiftyOneLabelsOutput(return_filepath=False, labels=datamodule.labels),
)
predictions = list(chain.from_iterable(predictions)) # flatten batches
return predictions

predictions = get_fo_label_preds(val_view, datamodule, trainer)

# Add predictions to FiftyOne dataset
val_view.set_values(
"predictions", predictions
)

session = fo.launch_app(val_view)

results = val_view.evaluate_classifications(
"ground_truth",
"predictions",
eval_key="eval",
)

The results of the evaluation can be used for things like plotting confusion matrices and precision-recall curves.

pip install ipywidgets

underscore_classes = [c.replace(" ", "_") for c in classes]

plot = results.plot_confusion_matrix(classes=underscore_classes)
plot.show()

As you can see, since we only finetuned the model on a few dozen samples, it is overfitting to the backstroke and butterfly stroke classes. This implies that we should download additional samples of the other two classes and continue training.

Analyzing the model to find the best and worst-performing samples can shed light on the best ways to improve your model’s performance.

from fiftyone import ViewField as F

eval_view = val_view.filter_labels(
"predictions", (F("confidence") > 0.6) & (F("eval") == False)
)

session.view = eval_view

The following shows one of the top examples in this evaluation view of highly confident but incorrectly predicted samples.

There are multiple issues that we can see with this sample. First, the footage is first-person which is rare in this dataset. If we want to predict on first-person videos, then more should be added to the training set. Second, there are examples of both breaststroke and backstroke in the video so it would be difficult to assign a label. Third, the ground truth label is front crawl which does not appear at all in the dataset.

Using FiftyOne to get hands-on and analyze specific samples can lead to results like these highlighting ways that you can improve the dataset itself. Since Kinetics is a very large dataset, we could easily download additional videos to supplement problematic samples that we may want to exclude from training. Improvements to your dataset can lead to easier gains in model performance than working on improving the model architecture itself.

Summary

The integration of the Kinetics dataset into FiftyOne makes it easier than ever to be able to download exactly the subset of Kinetics that you want or even the dataset in its entirety. Additionally, FiftyOne allows for in-depth evaluation and analysis of video models leading to better datasets and higher performing models.

Talk to a computer vision expert