CS 180 Proj 5: Fun with Diffusion Models

Part A: The Power of Diffusion Models!

Part 0: Setup

First, we use DeepFloyd's IF diffusion model to generate images based on a given prompt. There are 2 stages to this diffusion model. The first stage will produce 64x64 sized images and the second stage will upsample the image to output 256x256 sized images. The results of the diffusion model are displayed below. The first set of images were generated with 20 inference steps, and the second set of images were generated wth 10 inference steps. I used seed=180 for all parts.

The amount of detail appears to go up as the number of inference steps increases. Both sets of images look cartoonish.

Part 1: Sampling Loops

In this part, we iteratively add noise to an image until we get an image of pure noise. We then feed this noisy image into a diffusion model which tries to denoise the noisy image. We can either completely remove the noise, or obtain a prediction of the original image.

1.1 Implementing the Forward Process

First, we implement the forward process, which takes a clean image and adds noise to it. We do this by following the equation below:

1.2 Classical Denoising

In this part, we use Gaussiaan blur filtering to try to remove the noise. The results from this will be noticeably worse.

1.3 One-Step Denoising

Lastly, we also use a pretrained diffusion model to denoise the image in a process called one-step denoising. To do this, we will pass in a text prompt since the model was trained with text-conditioning.

I have displayed the results of the previous 3 parts below, with t = [250, 500, 750]. For each column, the top row is the original test image, the second row is the result of the forward steps, the third row is the result of one-step denoising, and the last row is the result of classical denoising.

1.4 Iterative Denoising

In this part, we will implement iterative denoising, which should produce better results than any of the previous parts. To do this, we will follow the below equation:

Each subsequent step of iterative denoising should produce results that are less and less noisy. We will use strided timesteps to prevent us from having to run the model 1000 times. We first create the list of strided timesteps, starting at timestep 990 and taking step sized of 30 until we reach 0. The noisy image after every 5th loop of denoising is displayed. Additionally, the final result of iterative denoising is also displayed along with the result of one-step denoising and classical denoising.

1.5 Diffusion Model Sampling

In this part, we set i_start = 0 in the iterative denoising function to generate images from scratch. The results are displayed below:

1.6 Classifier-Free Guidance (CFG)

The results in the previous part are not of high quality, so we use classifier-free guidance to improve our outputted images. We compute both a conditional and unconditional noise estimate, and define our noise estimate according to the following equation:

The results of CFG iterative denoise are displayed below:

1.7 Image-to-image Translation

In this part, we take the test image of the campanile, add noise to it, and then denoise it using CFG iterative denoise. The results with starting indices of 1, 3, 5, 7, 10, and 20 are shown below, with increasing starting indices generating results that more closely resemble the original test image.

1.7.1 Editing Hand-Drawn and Web Images

Now we try the algorithm on an uploaded image from the internet and 2 hand-drawn images:

1.7.2 Inpainting

We can also use the same procedure to generate new content while preserving the original content. We generate a mask over the original image and generate noise over the part of the image we want to replace. The results are shown below.

1.7.3 Text-Conditional Image-to-image Translation

We can also use text prompts to guide the generation of the new content using the previous procedure. The prompts "a rocket ship", "a photo of a dog", and "an oil painting of a snowy mountain village" are used:

1.8 Visual Anagrams

In this part, we will implement an optical illusion with diffusion models. We will pass in two prompts for this part. We will denoise the images according to one prompt, flip it upside down, denoise with the other prompt, and the average the upside down version with the right side up version. The prompts "an oil painting of an old man" and "an oil painting of people around a campfire" are used for the first image. The prompts "a photo of a man" and "a photo of a dog" are used for the second image. The prompts "a rocket ship" and "a pencil" are used for the third image.

1.9 Hybrid Images

In this part, we will combine high frequencies with low frequencies to generate an illusion where one prompt appears from close up while another prompt appears from far away. The prompts "a lithograph of a skull" and "a lithograph of waterfalls" are used for the first image. The prompts "a rocket ship" and "a man wearing a hat" are used for the second image. The prompts "a rocket ship" and "a lithograph of waterfalls" are used for the third image.

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising U-Net

1.1 Implementing the UNet

In this part, I implement a UNet according to the diagrams below:

1.2 Using the UNet to Train a Denoiser

We aim to train a denoiser such that it maps a noisy image to a clean image. We will optimize over an L2 loss. First, we will generate training data pairs by adding noise to clean images. We will follow the below equation for adding noise:

The noising processes over sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] are show below:

1.2.1 Training

After generating the noisy images, we now train our denoiser to denoise images with sigma = 0.5. A batch size of 256 was used and our dataset was trained over for 5 epochs. The training loss and the results after the 1st and 5th epochs are displayed below:

1.2.2 Out-of-Distribution Testing

The results on different sigma values that the model wasn't trained for are displayed below:

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

In this part, we will add time conditioning to the UNet according to the diagram below, where we use FCBlocks to inject the conditioning signal into the Unet.

2.2 Training the UNet

To train the UNet, we will pick a random image from the training set, a random time t, and train the denoiser to predict the noise at time t. We will do this until the model converges.

2.3 Sampling from the UNet

We now sample from the time-conditioned UNet, and the samples at the 5th and 20th epochs are shown below.

2.4 Adding Class-Conditioning to UNet

We can improve the results from the previous part by adding class conditioning. We will do this by adding two more FCBlocks to our UNet. We train our model very similarly to the time-conditioned model, but with the added conditioning vector c and periodic unconditional generation.

2.5 Sampling from the Class-Conditioned UNet

We now sample from the class-conditioned UNet, and the samples at the 5th and 20th epochs are shown below.