A guide to downloading, visualizing, and evaluating models on the ActivityNet dataset using FiftyOne
Year over year, the
ActivityNet challenge has pushed the boundaries of what video understanding models are capable of. Training a machine learning model to classify video clips and detect actions performed in videos is no easy task. Thankfully, the teams behind datasets like
Kinetics and
ActivityNet have provided massive amounts of video data to the computer vision community to assist in building high-quality video models.
Setup
The ActivityNet Dataset
There are two popular versions of ActivityNet, one with 100 activity classes and a newer version with 200 activity classes.
These versions contain 9682 and 19994 videos respectively across their training, testing, and validation splits with over 8k and 23k labeled activity instances.
These labels are temporal activity detections which are each represented by a start and stop time for the segment as well as a label.
Downloading the Dataset
Downloading the entirety of the ActivityNet requires filling
out this form to gain access to the dataset. However, many use cases involve specific subsets of ActivityNet. For example, you may be training a model specifically to detect the class “Bathing dog”.
The integration between FiftyOne and ActivityNet means that the dataset can now be accessed through the
FiftyOne Dataset Zoo. Additionally, it makes it easier than ever to download specific subsets of ActivityNet directly from YouTube. You can download all samples of the class “Bathing dog” from the above example like so:
Other useful parameters include the ability to specify the dataset split
to download,max_samples
if you are only interested in subsets of the dataset, max_duration
to define the maximum length of videos you want to download and more.
If you are working with the full ActivityNet dataset, you can use the source_dir
parameter to point to the location of the dataset on disk and easily load it into FiftyOne to visualize it and analyze your models.
ActivityNet Model Evaluation
In this section, we cover how to add your custom model predictions to a FiftyOne dataset and evaluate temporal detections following the
ActivityNet evaluation protocol.
Adding Model Predictions to Dataset
Any custom labels and metadata can be easily added to a FiftyOne dataset. In this case, we need to populate a
TemporalDetection
field named
predictions
with the model predictions for each sample. We can then visualize the predictions in the
FiftyOne App.
Computing mAP
The
ActivityNet evaluation protocol used to evaluate temporal activity detection models trained on the dataset is similar to the object detection evaluation
protocol for the COCO dataset. Both involve computing the mean average precision of detections, either temporal or spatial, across the dataset by matching predictions with ground truth annotations.
The specifics of the ActivityNet mAP computation can be
found here as it has been reimplemented in FiftyOne. After having added your predictions to the FiftyOne dataset, you can call the
evaluate_detections()
method.
The evaluation results object can also be used to compute PR curves and
confusion matrices of your model’s performance.
Additional metrics can be computed using the
DETAD tool introduced to diagnose the performance of action detection models on ActivityNet and THUMOS14. A follow-up post will explore how to incorporate DETAD analysis into our FiftyOne workflow to analyze models trained on ActivityNet.
Analyzing Results
One of the primary benefits of the FiftyOne implementation of the ActivityNet evaluation protocol is that it stores not only dataset-wide metrics like mAP but also individual label-level results. Specifying the eval_key
parameter when calling evaluate_detections()
will populate fields on our dataset containing the individual sample- and label-level results of whether a predicted or ground truth temporal detection was a true positive, false positive, or false negative.
Using the
powerful querying capabilities of FiftyOne, we can really dig into the model results and explore specific cases of where the model performed well or poorly. This analysis quickly informs us of the type of data that the model struggles with that we need to incorporate more into the training scheme. For example, in one line we can find all predictions that had high confidence but were evaluated as false positives.
From here you can use the
to_clips()
method to convert the dataset of full videos to a
view of only the clips of videos according to the annotations of a field like our
TemporalDetection
predictions.
The samples in these evaluation views are videos that the model was not able to correctly understand. One way to resolve this is to train on more samples similar to those that the model failed on. If you are interested in using this workflow in your own projects, check out the
compute_similarity()
method of the
FiftyOne Brain.
Summary
ActivityNet is one of the leading video datasets in the computer vision community that is included in the popular yearly ActivityNet challenge. The
integration between ActivityNet and
FiftyOne makes it easier than ever to access the dataset, visualize it in the
FiftyOne App, and analyze your model performance to figure out how to build better activity classification and detection models.