Project 5: Fun With Diffusion Models!

Part 5A

Part 0: Setup

The random seed I used was 180. By providng text prompts and sampling from the model, I generated an image of an oil painting of a snow mountain village, a man wearing a hat, and a rocket ship using num_inference_steps = 20 and num_inference_steps = 70.

Part 1: Sampling Loops

1.1: Forwarding Process

I added noise to an image via a forward process using equation 2.

Test Image

t = 250

t = 500

t = 750

1.2: Classical Denoising

I used the Gaussian blur filter to remove the noise from the noisy images generated in part 1.1.

t = 250

t = 500

t = 750

Blurred t = 250

Blurred t = 500

Blurred t = 750

1.3: One-Step Denoising

Using the Unet, I estimate the noise by passing it through stage_1.unet and then subsequently remove this noise to estimate the original image in one step.

t = 250

t = 500

t = 750

One-Step Denoised t = 250

One-Step Denoised t = 500

One-Step Denoised t = 750

1.4: Iterative Denoising

Rather than denoising in one step, I denoise starting from an i_start of 10 and iteratively across strided_timesteps starting at 990 and reducing by a stride of 30. Then I denoise using equation 3 every iteration.

t = 90

t = 240

t = 390

t = 540

t = 690

Test Image

Iteratively Denoised

One-Step Denoised

Gaussian Blurred

1.5: Diffusion Model Sampling

To generate images from scratch, I utilize an i_start of 0 and pass in random noise to generate 5 high quality photos.

1.6: Classifier Free Guidance

To improve image quality, we can employ classifier free guidance. This is done by computing both a conditional (condition on the text prompt) and an unconditional noise estimate (condition on null text prompt), and aggregating a new estimate from these two estimates that are scaled using a scale value of 7. These are the resulting high quality photos.

1.7: Image-To-Image Translation

Here, we'll follow the SDEdit algorithm to make a series of edits that become more and more like our image using different starting indexes of 1, 3, 5, 7, 10, and 20.

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Test Image

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Crater Lake

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Lantern

1.7.1: Editing Hand-Drawn and Web Images

Here, we're repeating the same SDEdit algorithm on images from the internet and handdrawn images.

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Image from Internet

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Handdrawn Image 1

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Handdrawn Image 2

1.7.2: Inpainting

Now, we want to isolate image generation and only run the diffusion denoising loop to within the bounds of a binary mask. Hence, I went ahead and created masks for my test, crater lake, and lantern images and impainted them.

Test Image

Mask

Hole to Replace

Impainted

Crater Lake

Mask

Hole to Replace

Impainted

Lantern

Mask

Hole to Replace

Impainted

1.7.3: Text-Conditioned Image-to-image Translation

Now, we're repeating the same SDEdit algorithm but using a text prompt. Using "a rocket ship" as our text prompt, this will transform our images more and more to look like a rocket ship.

Noise level = 1

Noise level = 3

Noise level = 5

Noise level = 7

Noise level = 10

Noise level = 20

Test Image

Noise level = 1

Noise level = 3

Noise level = 5

Noise level = 7

Noise level = 10

Noise level = 20

Test Image

Noise level = 1

Noise level = 3

Noise level = 5

Noise level = 7

Noise level = 10

Noise level = 20

Test Image

1.8: Visual Anagrams

Next, we want to create visual anagrams, which are images like look like one thing rightside up and look like something else upside down. To do this, we pass in two separate prompts, and denoise with each prompt. For the first prompt, we denoise as usual. For the second prompt, we flip the image upside first, compute the noise estimate, and then our noise estimate back to rightside up. We perform the reverse diffusion step using the average of these two noise estimates.

An Oil painting of an Old Man

An Oil Painting of People around a Campfire

An Oil painting of an Panda

An Oil painting of a Snowy Mountain

A Lithograph of a Waterfall

A Lithograph of a Skull

1.9: Hybrid Images

To create hybrid images we still use two separate prompts and generate noise estimates from them. Rather than flipped as in part 1.8, we run the first noise estimate through a low pass filter using a Gaussian kernel and the second noise estimate through a high pass filter (original - low pass). The combined low pass and high pass noise estimates are then used for the diffusion step.

Hybrid Image of a Skull and Waterfall

Hybrid Image of a White Bird and a Snowy Mountain

Hybrid Image of a Panda and People around a Campfire

Part 5B

1: Implementing the UNet

I constructing a denoiser as a UNEt by implementing its inner convolutional and average pooling layers and upscaling/downscaling/concatenating them as needed. I generated noisy images via z = x + sigma * epsilon and applied the denoiser to denoise the noisy images at varying sigma values. The result is shown in the figure below.

Figure 3: Varying Levels of Noise on MNIST Digits

Then, I trained this denoiser optimized over L2 loss on the MNIST dataset.

Figure 4: Training Loss Curve

These are the denoised results on the test set after training on one epoch.

Figure 5: Denoised Images After One Epoch of Training

These are the denoised results on the test set after training on five epochs. As you can see, it performs better than one epoch.

Figure 6: Denoised Images After Five Epochs of Training

Here is a visualization of denoising on varying levels of sigma.

Figure 7: Results on Digits from Test Set with Varying Noise Levels

2: Training a Diffusion Model

Next, I created a time-conditioned UNet that can iteratively denoise an image via DDPM. This is the loss associated with training 20 epochs on our new time-conditioned UNet.

Figure 10: Time-Conditioned UNet Training Loss Curve

These are the sampled results from the time-conditioned UNet for 5 and 20 epochs respectively.

Epoch 5 Time-Conditioned Sampling

Epoch 20 Time-Conditioned Sampling

I also created a class-conditioned UNet that can iteratively denoise an image via DDPM with and without conditions This is the loss associated with training 20 epochs on our new time-conditioned UNet.

Figure 11: Class-Conditioned UNet Training Loss Curve

And these are the sampled results from the class-conditioned UNet for 5 and 20 epochs respectively.

Epoch 5 Class-Conditioned Sampling

Epoch 20 Class-Conditioned Sampling