Harness the Power of Diffusion Models with Higher-Quality Data

ControlNet has been one of the biggest success stories in ML in 2023. The project, which has racked up 21,000+ stars on GitHub, was all the rage at CVPR - and for good reason: it’s an easy, interpretable way to exert influence over the outputs of diffusion models. Rather than running the same diffusion model on the same prompt over and over again, hoping for a reasonable result, you can guide the model via an input map. Hence ControlNet’s cheeky tagline: “Let us control diffusion models!” There are distinct ControlNet models to ‘control’ the output via Canny edge maps, segmentation masks, pose keypoints, and even scribbles. One of the features that makes ControlNet so popular is its accessibility. In an era of hundred-billion parameter foundation models, ControlNet models are just 1.45GB (the same size as the underlying diffusion model). At a time when models like GPT-3.5 are being trained on tens of thousands of GPUs at a cost of hundreds of thousands, or even millions of USD, a ControlNet model can be trained at home on a single GPU in just 600 GPU hours! In other words, ControlNet training is so easy and quick you can train your own ControlNet model. Despite ControlNet 1.0’s remarkable success, the model suffered from a few rather unfortunate bugs. Here’s an example: While for most inputs, the model produced stunning, realistic images, in some cases, such as the scenario above, the model’s output was significantly oversaturated. When ControlNet’s creator Lvmin Zhang published ControlNet 1.1, which resolved these issues, the changes were so substantial that he created an entirely new GitHub repository! The craziest part: there were NO CHANGES to the model architecture. What changed? Data quality! That proves that your ControlNet training dataset is critical. It turns out that the data used to train ControlNet 1.0 had a few insidious flaws, including a group of grayscale people that was somehow duplicated thousands of times. The ControlNet 1.1 repo explicitly mentions this and other problems. The lesson: Data reigns supreme. State-of-the-art performance requires high quality data. In this blog post, I’ll show you how to clean and curate a high quality ControlNet training dataset so you can train your own state-of-the-art ControlNet model. All of the code required to follow along and curate your own image-caption ControlNet training dataset can be found here. If you’re eager, you can jump straight to the highlights:

Download the ControlNet training dataset
Deduplicate the data
Validate image-caption alignment

Setup | ControlNet Training

The only libraries we will need to clean and curate this data are pandas (for tabular data) and FiftyOne (for unstructured image data):

Additionally, you will need hashlib for helper functions, and you will probably want tqdm to track progress while downloading images. You can import all of the required modules as follows:

Select Your ControlNet Training Dataset

According to the paper that introduced ControlNet, Adding Conditional Control to Text-to-Image Diffusion Models (CVPR 2023), the original ControlNet models were trained on “3M image-caption pairs from the internet”. Unfortunately, Lvmin et al. stop short of revealing precisely what data they use:

“Given the current complicated situation outside research community, we refrain from disclosing more details about data. Nevertheless, researchers may take a look at that dataset project everyone know.” Lvmin Zhang

That being said, the information they do reveal lines up closely with Google’s Conceptual Captions Dataset: a dataset “consisting of ~3.3M images annotated with captions”. Regardless of whether this is the ControlNet dataset used used, Conceptual Captions will provide us with an illustrative example, and the dataset — when properly cleaned — should allow for training ControlNet models from scratch.

Download the Google Dataset

Google's proposed dataset download process is too cumbersome for my taste: first, you need to download a tab-separated variables (`.tsv`) file containing the captions and the urls where the corresponding images can be found, and then you need to download the images from their urls. Lucky for you, I’ve written this code so you don’t have to. Download the tsv file by clicking the “Download” button at the bottom of Google’s Conceptual Captions webpage, or by clicking on this link. We can load the tsv file as a pandas DataFrame in similar fashion to a csv, by passing in sep=t to specify that the separator is a tab.

Give the columns of the DataFrame descriptive names:

And then hash the url for each entry to generate a unique ID:

The DataFrame looks like this:

We will use these IDs to specify the download locations (filepaths) of images, so that we can associate captions to the corresponding images. If we want to download the images in batches, we can do so as follows:

Here we download batch_size images starting from start_index into the folder images, with filename specified by the url hash we generated above. We use curl to execute the download operation, and set limits for the time spent attempting to download each image, because some of the links are no longer valid. To download a total of num_images images, run the following:

Load and Visualize Your ControlNet Training Data

Once we have the images downloaded into a images folder, we can load the images and their captions as a Dataset in FiftyOne:

This code creates a Dataset named “gcc”, which is persisted to the underlying database, and then iterates through the first num_images rows of the pandas DataFrame, creating a Sample with the appropriate filepath and caption. For this walkthrough, I downloaded the first roughly 310,000 images. The first step we should take when inspecting a new computer vision dataset is to visualize it! We can do this by launching the FiftyOne App:

Remove Corrupted Samples

When we look at the data, we can immediately see that some of the images are not valid. This may be due to links which are no longer working, interruptions during downloading, or some other issue entirely. Fortunately, we can filter out these invalid images easily. In FiftyOne, the compute_metadata() method computes media-type-specific metadata for each sample. For image-based samples, this includes image width, height, and size in bytes. When the media file is nonexistent or corrupted, the metadata will be left as null. We can thus filter out the corrupted images by running compute_metadata() and matching for samples where the metadata exists:

Filter by Aspect Ratio

A next step we may want to take is filtering out samples with unusual aspect ratios. If our goal is to control the outputs of a diffusion model, we will likely only be working with images within a certain range of reasonable aspect ratios. We can do this using FiftyOne’s ViewField, which allows us to apply arbitrary expressions to attributes of our samples, and then filter based on these. For instance, if we want to discard all images that are more than twice as large in either dimension as they are in the other dimension, we can do so with the following code:

For the sake of clarity, this is what the discarded samples look like:

If you so choose, you can use a more or less stringent aspect ratio filter!

Filter by Resolution

In a similar vein, we might want to remove the low resolution images. We want to generate stunning, photorealistic images, so there is no sense including low resolution images in your ControlNet training dataset. This filter is similar to the aspect ratio filter. If we select 300 pixels as our lowest allowed width and height, the filter takes the form:

Once again, you can choose whatever thresholds you like. For clarity, here is a representative view of the discarded images:

Ensure Color Pallette

Looking at the low resolution images, we also might be reminded that some of the images in our dataset are greyscale. We likely want to generate images that are as vibrant as possible, so we should discard the black-and-white images. In FiftyOne, one of the attributes logged in image metadata is the number of channels: color images have three channels (RGB), whereas grayscale images only have one channel. Removing grayscale images is as simple as matching for images with three channels!

Deduplicate the Dataset

Our next task in our data curation quest is to remove duplicate images. When an image is exactly or approximately duplicated in a training dataset, the resulting model may be biased by this small set of overrepresented samples - not to mention the added training costs. We can find approximate duplicates in our dataset by using a model to generate embeddings for our images (we will use a CLIP model for illustration):

Then we create a similarity index based on these embeddings:

Finally, we can set a numerical threshold at which point we will consider images approximate duplicates (here we choose 0.3), and only retain one representative from each group of approximate duplicates:

Validate Image-Caption Alignment

Okay, now you’re in luck, because we saved the coolest step for last! Google’s Conceptual Captions Dataset consists of image-caption pairs from the internet. More precisely, “the raw descriptions are harvested from the Alt-text HTML attribute associated with web images”. This is great as an initial pass, but there are bound to be some low-quality captions in there. We may not be able to ensure that all of our captions perfectly describe their images, but we can certainly filter out some poorly aligned image-captions pairs! We will do so using CLIPScore, which is a “reference-free evaluation metric for image captioning”. In other words, you just need the image and the caption. CLIPScore is easy to implement. First, we use Scipy’s cosine distance method to define a cosine similarity function:

Then we define a function which takes in a Sample, and computes the CLIPScore between image embedding and caption embedding, stored on the samples:

Essentially, this expression just lower bounds the score at zero. The scaling factor 100 is the same as used by PyTorch. We can then compute the CLIPScore - our measure of alignment between images and captions - by adding the fields to our dataset and iterating over our samples:

If we want to see the “least aligned” samples, we can sort by “clip_score”.

To see the most aligned samples, we can do the same, but passing in reverse=True:

We can then set a CLIPScore threshold depending on how aligned we demand the image-caption pairs are. To my taste, a threshold of 21.8 seemed good enough:

The second line clones the view into a new persistent Dataset named “gcc_clean”.

The Results: An Improved ControlNet Training Dataset

After our ControlNet training dataset cleaning and curation, we have turned a relatively mediocre initial dataset of more than 310,000 samples into a high quality dataset with 83,181 samples. The fruits of our labor look like this:

We surely haven’t created a perfect dataset — a perfect dataset does not exist. What we have done is addressed all of the data quality issues that plagued ControlNet 1.0, plus a few more, just for good measure. Now you are ready to train your own state-of-the-art ControlNet model! Note: this post is adapted from a flash session that I presented at CVPR last week!

What’s Next?

If you enjoyed this blog post, you may also find the following blog posts interesting:

Talk to a computer vision expert