How does an AI generate an image?

How does an AI generate an image?

The Conversation | May 20, 2025 4 minutes read 2 comments
 

Artificial intelligence can now generate highly realistic images from simple text descriptions. This is possible thanks to training on millions of images and a process that gradually transforms noise into an image.

From a victory over the best human Go players, for example, to more recently, predicting the weather with unprecedented accuracy, AI advances continue to surprise. An even more disconcerting result is the generation of strikingly realistic images, fueling a certain confusion between real and fake. But how are these images generated automatically?

Image generation models rely on deep learning, i.e., very large neural networks that can reach several billion parameters. A neural network can be considered as a function that will associate input data with output predictions. This function is composed of a set of initially random parameters (numerical values) that the network will learn to fix through learning.

To give an order of magnitude, the Stable Diffusion model, capable of generating realistic images, is composed of 8 billion parameters and its training cost $600,000.

These parameters need to be learned. To explain how they are learned, we can look at the simpler case of object detection from images. An image is presented as input to the network, and the network must predict possible object labels (car, person, cat, etc.) as output.

Learning then consists of finding a good combination of parameters that allows the most accurate prediction possible of the objects present in the images. The quality of learning will depend mainly on the quantity of labeled data, the size of the models and the available computing power.

In the case of image generation, it is in some way the reverse process that we want to do: from a text describing a scene, the output of the model is expected to create an image corresponding to this description, which is considerably more complex than predicting a label.

Destroy to create

First, let's forget about the text and focus on the image alone. While generating an image is a complex process even for a human, destroying an image (the inverse problem) is a trivial problem. In concrete terms, starting with an image composed of pixels, randomly changing the color of certain pixels is a simple method of alteration.

We can present a neural network with a slightly altered image as input and ask it to predict the original image as output. We can then train the model to learn how to denoise images, which is a first step towards image generation. Thus, if we start with a highly noisy image and repeat the model call sequentially, we will obtain a less and less noisy image with each call until we obtain a completely denoised image.

If we exaggerate the process, we could then start with an image composed entirely of noise (a snow of random pixels), in other words an image of nothing, and repeat the calls to our "denoiser" model in order to end up with an image like the one shown below:


We then have a process capable of generating images but of limited interest because, depending on the random noise, it can after several iterations end up generating anything as an output image. We therefore need a method to guide the denoising process and text will be used for this task.

From noise to image

For the denoising process we need images, these come from the internet and allow us to constitute the training dataset. For the text necessary to guide the denoising, it is simply the captions of the images found on the internet that are used. In parallel with the learning of the image denoising, a network representing the text is associated. Thus, when the model learns to denoise an image, it also learns which words this denoising is associated with. Once the training is complete, we obtain a model which, from a descriptive text and total noise, will, by successive iterations, eliminate the noise to converge towards an image that matches the textual description.

The process eliminates the need for specific manual labeling. It draws on millions of images and their captions already available on the web. Finally, a picture is worth a thousand words; as an example, the image above is generated from the following text: "fried egg flowers in the bacon garden" by the Stable Diffusion model. The Conversation

Christophe Rodrigues , Lecturer-researcher in computer science, Pôle Léonard de Vinci

This article is republished from The Conversation under a Creative Commons license. Read the original article .

View More Articles
 

ArtMajeur

Receive our newsletter for art lovers and collectors