Diffusion Models: How AI Creates Images From Noise
By Learnia Team
Diffusion Models: How AI Creates Images From Noise
This article is written in English. Our training modules are available in French.
DALL-E, Midjourney, Stable Diffusion—they all create images from text using a technique called diffusion. The concept is beautifully counterintuitive: start with pure noise and gradually reveal an image.
What Are Diffusion Models?
Diffusion models generate images by learning to reverse a noise-adding process. They're trained to remove noise, and by applying this repeatedly to random noise, they create coherent images.
The Core Idea
Training: Learn how to remove noise from images
Generation: Start with noise, remove it step-by-step
It's like a sculptor revealing a statue from a block of marble—
except the marble is random static.
The Two Directions
Forward Process (Training)
Take a real image and gradually add noise until it's unrecognizable:
Step 0: 🖼️ Clear photo of a cat
Step 20: 📷 Slightly noisy
Step 40: 📺 Quite noisy
Step 60: 📻 Very noisy
Step 80: ⬜ Mostly noise
Step 100: ▪️▫️▪️ Pure random noise
The model observes this process.
Reverse Process (Generation)
Learn to go backwards—predicting and removing noise at each step:
Step 100: ▪️▫️▪️ Pure random noise
Step 80: ⬜ "I think there's something here..."
Step 60: 📻 "Shape emerging..."
Step 40: 📺 "This looks like an animal..."
Step 20: 📷 "It's a cat!"
Step 0: 🖼️ Clear image of a cat
Why This Works
Pattern Recognition at Scale
The model is trained on billions of image-text pairs:
"A golden retriever on a beach" + [image]
"Sunset over mountains" + [image]
"Modern office interior" + [image]
... billions more
At each noise level, it learns:
"Given this noise pattern + this text prompt,
what should the slightly-less-noisy version look like?"
Guided by Text
Your prompt guides the denoising direction:
Same starting noise + "cat" → reveals a cat
Same starting noise + "dog" → reveals a dog
The text tells the model which patterns to uncover.
The Generation Process
Step-by-Step
1. Start with random noise (pure static)
2. Text prompt is encoded into guidance signal
3. Model predicts: "What noise to remove to match this prompt?"
4. Remove predicted noise
5. Repeat 20-50 times
6. Final result: coherent image
Why Multiple Steps?
One-step denoising: Too much guesswork, poor quality
Many steps: Gradual refinement, better details
It's like sketching:
Step 1: Rough shapes
Step 2: Basic forms
Step 3: Details
Step 4: Refinement
Step 5: Final touches
Key Concepts
Latent Space
Modern diffusion works in "latent space"—a compressed representation:
Image: 512×512×3 = 786,432 numbers
Latent: 64×64×4 = 16,384 numbers
~50× smaller → much faster processing
This is why it's called "Latent Diffusion" (used by Stable Diffusion).
CFG (Classifier-Free Guidance)
Controls how strictly the model follows your prompt:
CFG = 1: Very loose interpretation
CFG = 7: Balanced (typical default)
CFG = 15: Very strict adherence
CFG = 20+: Over-constrained, artifacts
Steps
How many denoising iterations:
10 steps: Fast but rough
20-30 steps: Good balance
50+ steps: Diminishing returns
Why Certain Things Fail
Hands and Text
Problem: Extra fingers, mangled text
Why:
- Hands appear in varied positions in training
- No consistent "hand structure" learned
- Text requires precise character placement
- Model sees text as shapes, not symbols
Specific Counts
Prompt: "Three apples"
Result: 2 or 4 apples
Why: Model doesn't truly "count"—it associates
"three" with visual patterns, not mathematics.
Unusual Compositions
Prompt: "Astronaut riding a horse underwater"
Result: May struggle or look unrealistic
Why: Training data rarely contains such combinations.
The model interpolates from what it knows.
Popular Diffusion Models (2025)
| Model | Creator | Key Strength | |-------|---------|--------------| | DALL-E 3 | OpenAI | Text handling, ChatGPT integration | | Imagen 3/4 | Google | Speed, typography, quality | | Midjourney v6 | Midjourney | Artistic quality, aesthetics | | Stable Diffusion 3 | Stability AI | Open source, customizable | | FLUX | Black Forest Labs | Quality, community fine-tunes |
Diffusion vs Other Approaches
GANs (Older Approach)
GANs: Two networks compete (generator vs discriminator)
Pros: Fast generation
Cons: Training instability, mode collapse
Diffusion: Single network, gradual denoising
Pros: Stable training, diverse outputs
Cons: Slower generation
Why Diffusion Won
2020: GANs dominated image generation
2022: DALL-E 2, Stable Diffusion change everything
2025: Diffusion is the standard
Key advantage: More stable, more controllable, better quality
What's Next?
Faster Generation
Consistency Models: High quality in 1-4 steps
Distillation: Smaller models, same quality
Better Control
ControlNet: Pose, edge, depth guidance
IP-Adapter: Style transfer from images
Inpainting: Edit specific regions
Video and Beyond
Sora (OpenAI): Diffusion for video
Veo (Google): Text-to-video with audio
3D: Emerging diffusion for 3D models
Key Takeaways
- →Diffusion models generate by removing noise from random static
- →Training: Learn to denoise images at every noise level
- →Generation: Start with noise, denoise step-by-step
- →Text guides which patterns to reveal
- →Struggles with hands, text, counting—not flaws, just how it works
Ready to Master AI Image Generation?
This article covered the what and why of diffusion models. But effective image prompting requires understanding each tool's strengths and techniques.
In our Module 7 — Creative & Multimodal Prompts, you'll learn:
- →Prompt structures for each major tool
- →Controlling style, composition, and mood
- →Working around common limitations
- →Brand consistency in AI images
- →Video generation with Sora and Veo
Module 7 — Multimodal & Creative Prompting
Generate images and work across text, vision, and audio.