Visualize and evaluate Hugging Face models on your dataset with Weights & Biases and FiftyOne

The development of a machine learning solution is no longer a problem of simply designing the right model for the job because the performance of your model is also bounded by the quality of your dataset. In this era of huge data, it is necessary to co-develop a high quality dataset alongside a capable model, iteratively improving your model and dataset together.

Two challenges that you will encounter are 1) keeping track of how your models perform and evolve over time as your dataset and model training scheme change and 2) managing your datasets and being able to visualize and explore them as they grow ever larger. Luckily, there are plenty of tools at your disposal that can help you co-develop high-quality datasets and models. The leading solutions to these two problems, specifically, are Weights & Biases (W&B) for model and experiment tracking and FiftyOne for dataset visualization and management. In this post, we will also be adding Hugging Face into the mix to quickly access a model to finetune.

In this post, we will show you how to visualize and evaluate Hugging Face models on your dataset by integrating both Weights & Biases and FiftyOne. Specifically, we’ll cover how to:

Curate a custom dataset with FiftyOne
Pull in a popular model from Hugging Face
Integrate FiftyOne and Weights & Biases in your model training loop
Visualize model predictions in FiftyOne
Sweep over hyperparameters with Weights & Biases
Find the best model with Weights & Biases and FiftyOne

Specifically, we'll curate a dataset for food detection from the COCO and Open Images datasets, and track the finetuning of a DETR model.

This integration between W&B and FiftyOne means that you can browse W&B to view your high-level model evaluation results and dig into the corresponding model predictions and evaluations on your actual samples in FiftyOne. This is critical to find interesting success and failure modes of your model to let you know how to improve the quality of your dataset and, in turn, the performance of your model.

What’s on the menu?

Weights & Biases: Weights & Biases is a developer-first MLOps platform that integrates into your model training loop to track your experiments and your model results. It also makes it easy to scale your experimentation and run large hyperparameter sweeps to optimize your model weights and manage the lifecycle of your model.

FiftyOne: FiftyOne is the open source toolkit for building high-quality datasets and computer vision models. It's designed to allow fast and effective analysis during dataset and model co-development enabling you to iterate through experiments rapidly. It has a powerful but easy-to-use App and Python SDK letting you curate your custom datasets, visualize and explore them, and integrate them into other computer vision workflows. (Disclaimer: I work at Voxel51 and built portions of FiftyOne.)

Hugging Face: Hugging Face is focused on letting the AI community build, train, and deploy state of the art models from throughout the open source machine learning community. Specifically, their model hub and transformers Python library make it easy to discover and extend open source machine learning models across a variety of different tasks.

Setup

If you haven’t already, install FiftyOne, the Weights & Biases Python client, and Hugg ing Face's transformers package:

Next, you will need to sign up for a free Weights & Biases account then find your API key here. The following will prompt you for this API key to connect the Python client to your account.

Preparing the dataset

In this walkthrough, let's pretend that we're working on the next viral recipe app that lets you take a picture of food and returns to you a recipe for how to cook it. We’ll focus on the food detection aspect.

FiftyOne makes it easy to curate a training and evaluation dataset for our task. For dataset curation, we can use FiftyOne to query for subsets of interest, find maximally unique data samples, find annotation mistakes, and more.

To do this, we're going to need to train an object detection model to detect food in images. We'll be making use of the COCO-2017 and Open Images v7 datasets, both of which exist in the FiftyOne Dataset Zoo and include annotations for various different types of food items like carrots, hot dogs, and cakes.

To start, we need to put together lists of the classes of food objects we want to download from each dataset.

See the full list of classes here.

Now we can use the FiftyOne Dataset Zoo to download all samples with at least one instance of the listed classes from the validation splits of both the COCO and Open Images datasets.

Next, we want to merge these two datasets together with the merge_samples() method. But first, there are a couple of processing steps to take, including making all of the classes the same case and filtering out all non-food classes.

The good news is that preprocessing data is easy with FiftyOne because of its handy builtin methods like map_labels(), filter_labels(), merge_samples(), and more. See this gist for these preprocessing and merging steps that make use of FiftyOne's SDK to write views that query, filter, and mutate the dataset and metadata.

We also want to set the dataset to be persistent so that we can easily load it with fo.load_dataset(dataset_name) in the future.

Before moving on, let's visualize and explore the dataset in the FiftyOne App.

The last processing step that needs to be done on the dataset is to generate training and validation splits. We can use FiftyOne to generate random 80/20 splits of the dataset, tagging samples as either train or val.

Model preparation

The object detection model we'll be using is Facebook Research's DETR from the Hugging Face transformers Python package. Some of the code in this section has been extended from this tutorial on finetuning DETR.

For the model preparation and training, we'll be sticking fairly closely to native PyTorch code so you can see the bare bones of how to train on a FiftyOne dataset and track your experiment with Weights & Biases.

Much of the image and label preprocessing is able to be handled by the model-specific processor class made available by Hugging Face.

We then need to create the PyTorch dataset class and data loaders that return images and labels in the format expected by DETR. See this gist for the implementation of these classes in a way that makes use of the FiftyOne dataset we’ve curated.

Finally, let's load the pretrained DETR model (with a ResNet-50 backbone) from Hugging Face. Since we're training on a custom dataset, we need to specify the number of labels on which we will be finetuning. We can use a FiftyOne aggregation to quickly find the number of distinct label classes across our ground truth labels in our dataset.

Note: If you have sufficient training data, you can also attempt to train this model from scratch as shown here.

Integrate Weights & Biases with FiftyOne

Now to bring Weights & Biases into the mix! We'll be using W&B to track hyperparameters of our experiment and monitor the training and validation losses throughout the training process. Later, we'll also show how to train and track multiple models with a W&B sweep.

Integrating a W&B run with FiftyOne involves creating a link between the FiftyOne dataset and field containing model predictions that is associated with a given W&B training run. There are numerous ways that this could be done, here we show one approach that adds direct links to and from FiftyOne and W&B.

W&B allows you to organize your experiments under projects, with each experiment constituting a run. We will track model and dataset metadata like training hyperparameters and information about the FiftyOne dataset we're training within the run configuration.

Now to actually create the integration between a W&B run and a FiftyOne dataset.

For the W&B to FiftyOne direction, we'll be storing the W&B run and project urls on the field of our FiftyOne dataset which will contain the predictions from the resulting model of that run. These links are then clickable in the sidebar of the FiftyOne App when hovering over that field name.

For the FiftyOne to W&B direction, we'll store a link on our W&B run to the URL of the specific FiftyOne view into our dataset which contains only the ground truth and prediction label fields for that run. (This can all be customized to link to whatever views you want).

Note that since the FiftyOne App runs in a local Python session on your machine accessible at localhost:5151, you'll need to have an instance of the FiftyOne App running whenever you want to click these links and view your dataset. You can run the following in a terminal window to start up a FiftyOne App process:

(With FiftyOne Teams you can just link directly to your deployed FiftyOne Teams App URL.)

Below is a sneak peek of the links this method creates between W&B and FiftyOne later in this post.

Train

Now we're ready to put everything together and get this model trained. In this example, we're just using native PyTorch code to write our training loop, but this can be extended to use your preferred model training framework by following the same principles.

The meat of the training code is in this train() method. Here, we set up an optimizer and iteratively train/validate one epoch at a time on our FiftyOne-backed dataloader. Each iteration, we then track whatever hyperparameters are of interest to us. In this example, we'll track the epoch, learning rate, and train/validation losses. Additionally, we save the model weights every 25 epochs throughout the training process, but this can be configured to your needs.

This gist provides some fairly boilerplate code to implement the methods used above. The batch processing performed here is specific to DETR and follows this finetuning tutorial. You can find the finetuning steps for other Hugging Face models here.

Time to start training! For reference, 20 epochs with this setup took roughly 4 hours on an NVIDIA TITAN V GPU.

We can also log the resulting model weights as a W&B artifact. Between this and the FiftyOne dataset, we can ensure that it will be easy to reproduce our results in the future.

Evaluate results in FiftyOne

FiftyOne not only makes it easy to curate your training dataset, but also to evaluate the results of your model predictions. The flexibility of FiftyOne's data model means that you can add however many custom fields that you want to your dataset, in this case predictions from all of our runs.

To start, we need to run inference on the samples of the validation set, then convert the model outputs from the bounding box format of DETR to the bounding box format expected by FiftyOne in the form of fo.Detection objects. This is a fairly straightforward conversion of restructuring the bounding box coordinates and putting the labels and confidences in the right spots. At the end, all of the model's detections are then stored in a new field on our FiftyOne dataset.

See this gist for an implementation of this DETR to FiftyOne conversion function, add_detections().

With the inference results added to our dataset, we can now make use of FiftyOne's model evaluation capabilities. For common label types, there are standard evaluation practices available in the FiftyOne SDK. For example, fo.evaluate_detections() will perform either COCO-style or Open Images-style object detection evaluation comparing your ground truth detections with your model predicted detections.

You can use this to compute the same mAP as with pycocotools, but the primary benefit is that this will also keep the individual label-level results around. For example, we’ll know if a prediction was a false positive or a false negative. This becomes invaluable to dig into the dataset and query for specific edge cases of interest, like figuring out “which vegetable is my model worst at detecting”, or “how often does my model miss detecting pizza slices”.

Using the evaluation results, we can write a query with FiftyOne's view expression language to find all false positive predictions with a high confidence. These will generally be interesting examples since this is where the model was confident in its prediction, but got it wrong. This often lets us find hard samples, annotation mistakes, or discrepancies between the training and validation splits.

From these results, we can see that even though the mAP is low, the model still performs fairly well and the low score is more due to the evaluation protocol and ground truth labels. Specifically, there are numerous cases where the model produces technically correct predictions which happen to not match the ground truth annotations.

In this example, you can see that there are annotation mistakes in the oranges that are missing from the ground truth, as well as a “Baked goods” annotation which may be correct, but may also be missing one of the “Snack”/”Fast food” ground truth labels for these food items. In practice, we may want to use this knowledge to take a pass over our label ontology here to ensure that all of the potential classes for an object are represented in the ground truth annotations.

This is precisely why you should not select the best model based off of aggregate scores like mAP alone, you must visualize and explore the predictions of your model on your dataset itself to build an intuition and trust in the performance of your dataset, and its shortcomings.

Perform a hyperparameter sweep

Combining everything we've learned so far, we can now consolidate it into a Weights & Biases sweep. When training a model, there are often so many hyperparameters and options to choose, that you need to iterate over them to find combinations that result in the best model for your task. W&B makes it really easy to define your training function and configure a sweep over the hyperparameters of interest to you.

The following code snippet uses methods defined in this gist and follows the exact same workflow outlined in the previous sections.

Now let’s set up and run our sweep configuration to cover five different combinations of learning rate, backbone learning rate, and weight decay parameters to find which result in the best performance.

Explore results

Once our sweep is complete, we can view the results in W&B and FiftyOne.

From the W&B dashboard, we can see that there is one run where the validation loss dropped significantly more and faster than the other runs. We can click on that run to see the metrics in more detail.

From that run's page, we can also click the "View Dataset in FiftyOne" link to open a new browser window to see exactly those model predictions and their evaluation results in FiftyOne.

(Assuming the FiftyOne App is running as described in the "Link Weights & Biases with FiftyOne" section above.)

From the FiftyOne App, we can now dig in and explore the results of this model. For example, we can perform the same high-confidence false-positive query from before but directly through buttons in the App rather than code, giving you even more flexibility to work how you’d like to.

We can then click on the link in the field's info tooltip to go back to the W&B run or project to explore our other experiments in W&B.

Summary

Weights & Biases is a leading solution for experiment tracking and model management for good reason. Its feature rich toolset allows you to easily integrate it into your model training pipelines and start building dashboards of your experimental results with ease. Integrating Weights & Biases with FiftyOne adds easy dataset management and model evaluation to the equation to round out your MLOps stack and let you start co-developing better datasets and models, faster!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!

1,500+ FiftyOne Slack members
2,800+ stars on GitHub
3,700+ Meetup members
Used by 274+ repositories
59+ contributors

Talk to a computer vision expert