How Diffusion Model Works? | Aishwarya Anilkumar

AI has been on a roll in the last decade with many advancements that has been at the forefront of Technology.

Let’s understand what happens under the hood of a basic diffusion model, that went on to be the driving engines in many Generative AI marvels like Stable Diffusion, Dall-E 2 and Midjourney.

To go over the basic terms, Generative AI is the ability of algorithms to be able to generate new content inspired by the content it is trained on. The new content can be anything from Text, Images, Music and so on. Let me show you an AI created poem:

Upon the stage of silicon, a marvel doth arise, A spectacle that lends belief to our mortal eyes. In lines of code, deep thoughts are sown, A testament to human's genius, in circuits grown.

From a time when algorithms were able to only perform tasks that are mundane, computation heavy and needed no creativity, we have come a long way. GANs (Generative Adversarial Networks) paved the way for truly groundbreaking generative algorithms in the last decade, followed by Diffusion Models. Diffusion models have also become foundational models for research in life sciences and drug discovery. The potential being vast, it doesn’t hurt to know at least a little about such marvels even when you are not much technically inclined.

The building block of a basic Diffusion model is a neural network (NN). Lets say that you have a lot of images of Cats and Avocados . You trained your NN to know well the representation of both. But you need more pictures of Cats and you want your algorithm to do it for you.

The first process to do so is “Noising the images” In this step, when the NN sees the first image, it will say with confidence that ‘Okay, this is a cat!’. With the next image, the NN should say ‘This is probably a cat, but let me add in few details so that it exactly looks like one’. By the last image, NN will identify it as pure noise and try to maybe give it an outline in an attempt to make it resemble a cat!

credits: NVIDIA technical blog

So the goal of your NN is to take in images at different noise levels and turn it back to amazing pictures of cats. This is achieved by the NN trying to remove the noise that we added to the images.
The noise image is normally distributed, meaning, each pixels in the noise image fed into the NN is from a gaussian normal distribution.

Now in the above image, something happens between the NN block and the Newly created image in the red area is not discussed yet. So the direct output of the NN is NOT the final image!
So what the NN does is, it predicts noise! And we subtract this outputted noise from our original noise image, to get something close to a cat! Image it this way. You are given a glass of water with a drop of black ink in it. Initially you will be able to see the ink separate from the water, but gradually it diffuses into the water, leaving us with a black solution! Now using the same analogy, our NN predicts noise (ink) and we subtract this ink from the inky water in the glass. Now we see a less concentrated solution of water and ink. By doing this iteratively, we will be left with just clear water!

As you can see, with every iteration, the NN learns to better predict the noise that can be subtracted from the input noisy image to generate something that resembles a cat.
We use the sampling algorithm DDPM to sample noise to be fed into the NN.
Now one small yet important detail left out the in the diagram above is that, when we get the first output (the noisy image - the predicted noise), we cannot feed it directly to the NN in the next iteration. The reason being, our noisy image was normally distributed. but the outputted image is not! Thus to be able to feed the outputted image back into the NN in the next iteration, we must add some additional noise to it.
This crucial step ensures that the NN by the end of several epochs, doesn’t create an image that is blobby or a mean of all the images.
Now these concepts have barely scratched the surface of what Diffusion models are capable of. For example, consider given your NN a task to create an image of a cat sleeping inside a half cut avocado! This imaginative or creative ability can be induced into your model using context.
To be able to control your model this way, we use vector embeddings that represent the meaning of your context.
We thus can create an embedding that captures the meaning of “a cat inside avocado” and it will be fed into the NN that will output a predicted noise, and eventually create a beautiful picture of what you want!
Thus context is a vector that controls generation by your model.
In the next blog, we will go through this process pythonically and see how it works!