First, we use DeepFloyd's IF diffusion model to generate images based on a given prompt. There are 2 stages to this diffusion model. The first stage will produce 64x64 sized images and the second stage will upsample the image to output 256x256 sized images. The results of the diffusion model are displayed below. The first set of images were generated with 20 inference steps, and the second set of images were generated wth 10 inference steps. I used seed=180 for all parts.
The amount of detail appears to go up as the number of inference steps increases. Both sets of images look cartoonish.
In this part, we iteratively add noise to an image until we get an image of pure noise. We then feed this noisy image into a diffusion model which tries to denoise the noisy image. We can either completely remove the noise, or obtain a prediction of the original image.
First, we implement the forward process, which takes a clean image and adds noise to it. We do this by following the equation below:
In this part, we use Gaussiaan blur filtering to try to remove the noise. The results from this will be noticeably worse.
Lastly, we also use a pretrained diffusion model to denoise the image in a process called one-step denoising. To do this, we will pass in a text prompt since the model was trained with text-conditioning.
I have displayed the results of the previous 3 parts below, with t = [250, 500, 750]. For each column, the top row is the original test image, the second row is the result of the forward steps, the third row is the result of one-step denoising, and the last row is the result of classical denoising.
In this part, we will implement iterative denoising, which should produce better results than any of the previous parts. To do this, we will follow the below equation:
Each subsequent step of iterative denoising should produce results that are less and less noisy. We will use strided timesteps to prevent us from having to run the model 1000 times. We first create the list of strided timesteps, starting at timestep 990 and taking step sized of 30 until we reach 0. The noisy image after every 5th loop of denoising is displayed. Additionally, the final result of iterative denoising is also displayed along with the result of one-step denoising and classical denoising.
In this part, we set i_start = 0 in the iterative denoising function to generate images from scratch. The results are displayed below:
The results in the previous part are not of high quality, so we use classifier-free guidance to improve our outputted images. We compute both a conditional and unconditional noise estimate, and define our noise estimate according to the following equation:
The results of CFG iterative denoise are displayed below:
In this part, we take the test image of the campanile, add noise to it, and then denoise it using CFG iterative denoise. The results with starting indices of 1, 3, 5, 7, 10, and 20 are shown below, with increasing starting indices generating results that more closely resemble the original test image.
Now we try the algorithm on an uploaded image from the internet and 2 hand-drawn images:
We can also use the same procedure to generate new content while preserving the original content. We generate a mask over the original image and generate noise over the part of the image we want to replace. The results are shown below.
We can also use text prompts to guide the generation of the new content using the previous procedure. The prompts "a rocket ship", "a photo of a dog", and "an oil painting of a snowy mountain village" are used:
In this part, we will implement an optical illusion with diffusion models. We will pass in two prompts for this part. We will denoise the images according to one prompt, flip it upside down, denoise with the other prompt, and the average the upside down version with the right side up version. The prompts "an oil painting of an old man" and "an oil painting of people around a campfire" are used for the first image. The prompts "a photo of a man" and "a photo of a dog" are used for the second image. The prompts "a rocket ship" and "a pencil" are used for the third image.
In this part, we will combine high frequencies with low frequencies to generate an illusion where one prompt appears from close up while another prompt appears from far away. The prompts "a lithograph of a skull" and "a lithograph of waterfalls" are used for the first image. The prompts "a rocket ship" and "a man wearing a hat" are used for the second image. The prompts "a rocket ship" and "a lithograph of waterfalls" are used for the third image.
In this part, I implement a UNet according to the diagrams below:
We aim to train a denoiser such that it maps a noisy image to a clean image. We will optimize over an L2 loss. First, we will generate training data pairs by adding noise to clean images. We will follow the below equation for adding noise:
The noising processes over sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] are show below:
After generating the noisy images, we now train our denoiser to denoise images with sigma = 0.5. A batch size of 256 was used and our dataset was trained over for 5 epochs. The training loss and the results after the 1st and 5th epochs are displayed below:
The results on different sigma values that the model wasn't trained for are displayed below:
In this part, we will add time conditioning to the UNet according to the diagram below, where we use FCBlocks to inject the conditioning signal into the Unet.
To train the UNet, we will pick a random image from the training set, a random time t, and train the denoiser to predict the noise at time t. We will do this until the model converges.
We now sample from the time-conditioned UNet, and the samples at the 5th and 20th epochs are shown below.
We can improve the results from the previous part by adding class conditioning. We will do this by adding two more FCBlocks to our UNet. We train our model very similarly to the time-conditioned model, but with the added conditioning vector c and periodic unconditional generation.
We now sample from the class-conditioned UNet, and the samples at the 5th and 20th epochs are shown below.