January 2, 20265 MIN READ

Diffusion Models: How AI Creates Images From Noise

By Learnia Team

Diffusion Models: How AI Creates Images From Noise

This article is written in English. Our training modules are available in French.

DALL-E, Midjourney, Stable Diffusion—they all create images from text using a technique called diffusion. The concept is beautifully counterintuitive: start with pure noise and gradually reveal an image.

What Are Diffusion Models?

Diffusion models generate images by learning to reverse a noise-adding process. They're trained to remove noise, and by applying this repeatedly to random noise, they create coherent images.

The Core Idea

Training: Learn how to remove noise from images
Generation: Start with noise, remove it step-by-step

It's like a sculptor revealing a statue from a block of marble—
except the marble is random static.

The Two Directions

Forward Process (Training)

Take a real image and gradually add noise until it's unrecognizable:

Step 0:  🖼️ Clear photo of a cat
Step 20: 📷 Slightly noisy
Step 40: 📺 Quite noisy
Step 60: 📻 Very noisy
Step 80: ⬜ Mostly noise
Step 100: ▪️▫️▪️ Pure random noise

The model observes this process.

Reverse Process (Generation)

Learn to go backwards—predicting and removing noise at each step:

Step 100: ▪️▫️▪️ Pure random noise
Step 80:  ⬜ "I think there's something here..."
Step 60:  📻 "Shape emerging..."
Step 40:  📺 "This looks like an animal..."
Step 20:  📷 "It's a cat!"
Step 0:   🖼️ Clear image of a cat

Why This Works

Pattern Recognition at Scale

The model is trained on billions of image-text pairs:

"A golden retriever on a beach" + [image]
"Sunset over mountains" + [image]
"Modern office interior" + [image]
... billions more

At each noise level, it learns:
"Given this noise pattern + this text prompt, 
what should the slightly-less-noisy version look like?"

Guided by Text

Your prompt guides the denoising direction:

Same starting noise + "cat" → reveals a cat
Same starting noise + "dog" → reveals a dog

The text tells the model which patterns to uncover.

The Generation Process

Step-by-Step

1. Start with random noise (pure static)
2. Text prompt is encoded into guidance signal
3. Model predicts: "What noise to remove to match this prompt?"
4. Remove predicted noise
5. Repeat 20-50 times
6. Final result: coherent image

Why Multiple Steps?

One-step denoising: Too much guesswork, poor quality
Many steps: Gradual refinement, better details

It's like sketching:
Step 1: Rough shapes
Step 2: Basic forms  
Step 3: Details
Step 4: Refinement
Step 5: Final touches

Key Concepts

Latent Space

Modern diffusion works in "latent space"—a compressed representation:

Image: 512×512×3 = 786,432 numbers
Latent: 64×64×4 = 16,384 numbers

~50× smaller → much faster processing

This is why it's called "Latent Diffusion" (used by Stable Diffusion).

CFG (Classifier-Free Guidance)

Controls how strictly the model follows your prompt:

CFG = 1: Very loose interpretation
CFG = 7: Balanced (typical default)
CFG = 15: Very strict adherence
CFG = 20+: Over-constrained, artifacts

Steps

How many denoising iterations:

10 steps: Fast but rough
20-30 steps: Good balance
50+ steps: Diminishing returns

Why Certain Things Fail

Hands and Text

Problem: Extra fingers, mangled text

Why:
- Hands appear in varied positions in training
- No consistent "hand structure" learned
- Text requires precise character placement
- Model sees text as shapes, not symbols

Specific Counts

Prompt: "Three apples"
Result: 2 or 4 apples

Why: Model doesn't truly "count"—it associates
"three" with visual patterns, not mathematics.

Unusual Compositions

Prompt: "Astronaut riding a horse underwater"
Result: May struggle or look unrealistic

Why: Training data rarely contains such combinations.
The model interpolates from what it knows.

Popular Diffusion Models (2025)

| Model | Creator | Key Strength | |-------|---------|--------------| | DALL-E 3 | OpenAI | Text handling, ChatGPT integration | | Imagen 3/4 | Google | Speed, typography, quality | | Midjourney v6 | Midjourney | Artistic quality, aesthetics | | Stable Diffusion 3 | Stability AI | Open source, customizable | | FLUX | Black Forest Labs | Quality, community fine-tunes |

Diffusion vs Other Approaches

GANs (Older Approach)

GANs: Two networks compete (generator vs discriminator)
Pros: Fast generation
Cons: Training instability, mode collapse

Diffusion: Single network, gradual denoising
Pros: Stable training, diverse outputs
Cons: Slower generation

Why Diffusion Won

2020: GANs dominated image generation
2022: DALL-E 2, Stable Diffusion change everything
2025: Diffusion is the standard

Key advantage: More stable, more controllable, better quality

What's Next?

Faster Generation

Consistency Models: High quality in 1-4 steps
Distillation: Smaller models, same quality

Better Control

ControlNet: Pose, edge, depth guidance
IP-Adapter: Style transfer from images
Inpainting: Edit specific regions

Video and Beyond

Sora (OpenAI): Diffusion for video
Veo (Google): Text-to-video with audio
3D: Emerging diffusion for 3D models

Key Takeaways

→Diffusion models generate by removing noise from random static
→Training: Learn to denoise images at every noise level
→Generation: Start with noise, denoise step-by-step
→Text guides which patterns to reveal
→Struggles with hands, text, counting—not flaws, just how it works

Ready to Master AI Image Generation?

This article covered the what and why of diffusion models. But effective image prompting requires understanding each tool's strengths and techniques.

In our Module 7 — Creative & Multimodal Prompts, you'll learn:

→Prompt structures for each major tool
→Controlling style, composition, and mood
→Working around common limitations
→Brand consistency in AI images
→Video generation with Sora and Veo

→ Explore Module 7: Creative Prompts

GO DEEPER

Module 7 — Multimodal & Creative Prompting

Generate images and work across text, vision, and audio.

Explorer le Module