Editor’s note: this is a guest post by Yuheng Li, computer science Ph.D. student at University of Wisconsin-Madison

AI-generated imagery is an extremely exciting area of computer vision, with a variety of impressive innovations and technologies available to help bring new images to life. But what if you want more control over the generation process? That’s the question that inspired recent research and ultimately led to the creation of GLIGEN (Grounded Language to Image Generation).

In this blog post, I’ll share how GLIGEN came to be, how it works, and show you how you can gain control over the outputs of diffusion models by adding new trainable parameters!

Background and Motivation

Image generation has seen remarkable progress in recent years, with diffusion models being crucial to the field for AI image generation, 3D modeling, and more. Large-scale text-to-image (T2I) models like DALL-E2 and Stable Diffusion can create complex images from text, but can only be conditioned on text input, not on input from other modalities.

This can present challenges for users who want more control over the generation process. If you're an interior designer looking to visualize how furniture placement will look in a living room, for example, existing T2I diffusion models likely won't bring your imagination to life. Similarly, if you're hoping to create an image of yourself in the same pose as Michael Jackson, these models won't do the trick easily for you.

The community has been trying to work on bringing more control over the diffusion generation process. These efforts can be broadly categorized into four groups based on the type of trainable parameters.

1. Train a new model from scratch

One example is Composer, which defines representation elements of an image: caption, sketch, color, etc., and trains a new model from scratch conditioned on these elements. The advantage of this direction is: one can design its own model architecture to be better compatible with controllable elements. However, since it does not utilize existing foundation image generation models, training is costly each time.

2. Fine-tune a pre-trained model

Another paradigm is to fine-tune weights of an existing model. For example, ReCo appends new bounding box information into the caption and fine-tunes the pre-trained text encoder as well as diffusion models.

3. Add new trainable parameters to a frozen pre-trained model

GLIGEN and ControlNet fall into this category. Instead of changing weights of the foundation models, they add new learnable parameters to adapt and modify intermediate features in existing models.

4. Change sampling direction for a pre-trained model

What if I don’t want to train any parameters at all? Can I still control diffusion models? The answer is Yes! Universal Guidance proposes to use pre-trained discriminative models such as object detectors to control the sampling process.

Approach

Diffusion Model

Before diving into the GLIGEN approach, let’s first get familiar with a pre-trained diffusion model architecture.

Typically, a diffusion model is a Unet architecture as shown in the figure below (left), consisting of stacks of residual blocks and Transformer blocks (details on the right) where the real magic happens: the self-attention layer makes global visual feature processing possible; the cross-attention layer absorbs in caption features.

This is great if we want to condition our generation process on text alone, but what if we want more control over our generation process? Training models from scratch, conditioned on new control inputs can be quite costly, and it is unfeasible to do so whenever users ask for a new conditional input!

GLIGEN

GLIGEN bypasses this problem by adding new control to a pre-trained model without changing its weights. The core idea of the GLIGEN is: modifying the original visual features with Gated Self-Attention in the Transformer blocks. The figure below (left) shows where the gated self-attention layer is inserted in GLIGEN.

Input to Gated Self-Attention

This layer takes in image visual features and extra conditional features called grounding tokens.
Grounding tokens represent the new conditional input users want. For example, if a user wants to control the generation process with bounding boxes then the grounding tokens contain information for bounding boxes.

Operation within Gated Self-Attention

As shown on the right in the figure below, the visual tokens and grounding tokens are concatenated along the sequence dimension and fed into a self-attention layer. For the output, we discard the grounding tokens and treat the remaining ones as residual (light purple).
Instead of directly adding the residual, we first multiply the residual with a learnable constant γ which is initialized as 0. This γ acts like a gate, and that’s where the name for this layer comes from. This means that at the beginning of training, the new gated self-attention layer will not affect the original feature, leading to more stable training.

Optional Input to the Unet

The above design conceptually works for any extra conditional input due to the generalizability of the Transformer. We also find that for conditions that are spatially aligned with the output image such as an edge map, depth map, etc., the training converges faster and more easily, if they are also given as input to the Unet as shown in the figure below. In this case, the first convolutional (“conv”) layer needs to be modified to take in extra channels and needs to be trainable.

Scheduled Sampling

As we only modify intermediate features of pre-trained diffusion models, we can freely remove gated self-attention layers during the generation sampling process as shown below. Since the initial steps typically regulate the basic structure, this approach can effectively achieve a good trade-off between adhering to the conditions and ensuring image quality.

GLIGEN in Action

Here we demonstrate GLIGEN results across three different modality use cases: generating new images given grounding input on bounding boxes, keypoints, and Canny maps.

Grounding on Bounding Boxes

In this modality, users can specify the location of objects in their caption prompt. For example, you can control the position of your favorite celebrities to create a poster. On top of that, you can also control the style of the generated image by providing a reference image you’d like to use as inspiration in the new creation.

Grounding on Keypoints

One can also control the pose of the generated object by providing a set of keypoints. Note that in this case the GLIGEN is only trained using human keypoint data, but the model can be generalized into other domains such as monkeys and cartoon characters due to the scheduled sampling technique.

Grounding on Canny Maps

GLIGEN also enables users to easily generate various colorized versions of a canny drawing picture, allowing designers to quickly fill in colors and experiment with different styles. This feature makes GLIGEN a valuable tool for artists and designers who seek to explore multiple design options efficiently.

You can refer to our paper for more details and quantitative analysis.

The Role of Data

Data Used to Train GLIGEN

GLIGEN was trained mostly on data with bounding box grounding. Ideal data for this task would consist of image-text grounding pairs (see below). However, this type of data is rare (thousands). To overcome this shortage, the training data was augmented with a combination of three different data types.

Grounding data
- Each image is associated with a caption describing the whole image; noun entities are extracted from the caption, and are labeled with bounding boxes.
- Since the noun entities are taken directly from the natural language caption, they can cover a much richer vocabulary which will be beneficial for open-world vocabulary grounded generation.
Detection data
- Noun entities are pre-defined closed-set categories (e.g., 80 object classes in COCO). In this case, we choose to use either a null caption or concatenate class name as a caption.
- The detection data is of larger quantity (millions) than the grounding data (thousands), and can therefore greatly increase overall training data.
Detection and caption data
- Noun entities are the same as those in the detection data, and the image is described separately with a text caption.
- In this case, the noun entities may not exactly match those in the caption. For example, in the above example, the caption only gives a high-level description of the living room, whereas the detection annotation provides more fine-grained object-level details.

GLIGEN for Other Tasks

Object Detection

Object detection is one of the most important perception tasks in vision, and often requires increasing amounts of labeled data for training. Automatic ways of generating training data have been explored in recent years. One common approach is to get pseudo-label from a pre-trained detector like GLIP. GLIGEN potentially opens another way by generating an infinite number of training data.

Causal Inference

GLIGEN also demonstrates that it can generate counterfactual results. For example, an apple is the same size as a dog; a hen is smaller than an egg as shown below. These data can potentially help to train reasoning models to better understand spatial relationships between objects.

Conclusion & Next Steps

Although T2I diffusion models have their limitations when it comes to controllable generation, the community has been actively working on addressing these concerns. One of the proposed solutions, GLIGEN, modifies the features of diffusion models, without altering their weights, which makes it a cost-effective and modulated training approach. The outcomes of GLIGEN highlight its proficiency in multiple grounding modalities, indicating that it has the potential to improve controllable diffusion models for generating images.

Here are some resources to learn more about GLIGEN and give it a try:

And if you really want to explore further, you may want to consider using GLIGEN to build your own high quality dataset, and then visualize it in FiftyOne, the open source computer vision toolset maintained by Voxel51.

Talk to a computer vision expert