Diffusion Image Generation and Editing

Diffusion models for controllable synthesis and editing

Introduction

This project explores diffusion models through two complementary approaches. In Part A, I use the pretrained DeepFloyd IF model to understand the forward noising process, iterative denoising, classifier-free guidance, SDEdit-style image editing, inpainting, visual anagrams, and hybrid images. In Part B, I build a UNet from scratch for MNIST digit generation, adding time conditioning for flow matching and class conditioning for controllable sampling.

For Part A, I ran all experiments locally with DeepFloyd checkpoints and kept the random seed fixed at 100 for reproducibility. I used a CFG scale of 7 for all guided sampling. For Part B, I trained on a single GPU using Adam optimizers with batch size 64 and hidden dimension 64. I used a CFG scale of 5 for class-conditional sampling. The key takeaway is that conditioning signals and CFG dramatically improve sample quality, while learning rate schedules help stabilize training.

Part A · The Power of Diffusion Models

0 · Setup and Prompt Embeddings

I initialized DeepFloyd IF locally and fixed the random seed at 100 for every subsequent sampling run. I then generated three images, shown below, with step sizes 20 and then 40.

Initial samples for three prompts — 20 steps

Samples refined with 40 steps — 40 steps

The image of the bird shows clear improvement and now accurately depicts the prompt. The "man mid fall" has improved quality but still misses the intended visual of a man in the air or slipping. The alien fleet prompt shows quality improvement in both visuals and accurately depicting the scene.

1.1 · Forward Process

To simulate the forward diffusion trajectory, I sample directly from the analytic distribution q(x_t | x₀). The function mixes the clean Campanile image with Gaussian noise according to the cumulative product of the noise schedule. For each timestep, I scale the clean pixels by √ᾱ_t and inject fresh noise scaled by √(1 - ᾱ_t). I reuse the cached alphas_cumprod table so that every forward sample exactly matches the DeepFloyd IF scheduler. This keeps energy consistent and produces the predictable degradation shown below.

Campanile diffusion forward process at multiple timesteps — t=250 t=500 t=750

1.2 · Classical Denoising

Here I apply a 2D Gaussian blur with σ = 1.2 and a 5x5 kernel to each noisy Campanile frame using torchvision.transforms.GaussianBlur. This averages neighboring pixels and suppresses high-frequency noise without any learning. The classical method is visibly less effective. Even the output looks like it has another layer of blur on top.

Classical Gaussian denoising compared across noise levels — Noisy inputs (top) and Gaussian-blur outputs (bottom) for timesteps 250, 500, and 750.

1.3 · One-Step Denoising with DeepFloyd

Using the pretrained DeepFloyd UNet, I predict the noise term ε̂_θ(x_t, t) for each noisy Campanile frame and perform the single reverse step x̂_0 = (x_t - √(1 - ᾱ_t) · ε̂) / √ᾱ_t. Even without iteration, this brings back sharp roof edges, contrast in the tower windows, and the surrounding trees. It looks far cleaner than the Gaussian blur baseline.

One-step DeepFloyd denoising across multiple timesteps — Columns: original image, noisy inputs at t = 250/500/750, and the corresponding one-step UNet reconstructions.

1.4 · Iterative Denoising

Starting from a noisy Campanile frame at timestep t = 690, I followed the strided cosine schedule toward 0. Each iteration calls the Stage 1 UNet (conditioned on the "high quality picture" prompt embedding) to produce both a noise prediction and a variance term. I reconstruct x̂₀ and blend it with the current image using DDPM weights derived from alphas_cumprod. Then I inject the scheduler variance via add_variance and move to the next timestep. Every fifth step, I convert the tensor from [-1, 1] to display space with to_display to visualize the trajectory. The second strip compares the iterative output with the one-step and Gaussian-blur baselines.

Iterative denoising trajectory 1 — step 1: t=690→660 step 6: t=540→510 step 11: t=390→360 step 16: t=240→210 step 21: t=90→60

Iterative denoising trajectory 2 — Original Noisy (t=690) Iterative denoising One-step denoising Gaussian blur (k=5, σ=2)

1.5 · Sampling from Noise

I sampled directly from Gaussian noise by running the same iterative scheduler (cosine/strided steps) used in 1.4, starting at the highest timestep and conditioning on the prompt "a high quality picture." Classifier-free guidance stayed at γ = 5.0 for consistency with earlier runs. Each image below comes from a different random seed, showing varied camera angles, lighting, and subject detail while keeping photorealistic structure.

Five independent samples generated from pure noise — Five independent samples from pure noise for the prompt "a high quality picture."

1.6 · Classifier-Free Guidance (CFG)

I enabled classifier-free guidance by running two UNet passes per timestep: one conditioned on the prompt "a high quality picture" and one unconditional. Their noise predictions are blended as ε = ε_u + γ(ε_c − ε_u) with guidance scale γ = 7. The blended noise then feeds the usual DDPM update with the scheduler variance (using the conditional branch’s variance). Sampling starts from pure Gaussian noise and follows the same strided/cosine schedule as in 1.4. The five samples below (different seeds) show improved sharpness and prompt adherence while keeping reasonable diversity.

Five CFG-guided samples generated from pure noise — Five CFG samples from pure noise with guidance scale γ = 7 (prompt: "a high quality picture").

1.7 · Image-to-Image Translation (SDEdit-style)

I applied SDEdit-style image-to-image translation by first noising a real image to different starting timesteps (indices [1, 3, 5, 7, 10, 20] in the strided schedule), then denoising with CFG (γ = 7) using conditional and unconditional passes. Higher start indices inject more noise, so lower timesteps preserve structure while higher ones enable stronger edits. All runs share the same prompt embedding for "a high quality picture" and reuse the variance-aware DDPM update from earlier parts.

Original Campanile photo — Campanile original (64×64 crop/resize).

Campanile edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

Original earth photo — Earth original (64×64 crop/resize).

Earth edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

Original bunny photo — Bunny original (64×64 crop/resize).

Bunny edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

1.7.1 · Editing Hand-Drawn and Web Images

I applied the same SDEdit loop to a web photo and two hand-drawn inputs using the prompt "a high quality picture". Each image is resized to 64x64 and normalized in process_pil_im, then denoised with the six starting indices [1, 3, 5, 7, 10, 20] (timesteps 960, 900, 840, 780, 690, 390).

Web image: Pom

Original pom web photo — Original web image.

Processed pom web photo — 64×64 cropped/normalized input to the model.

Pom edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

Hand-drawn: Cube

Original cube drawing — Original cube sketch.

Processed cube drawing — 64×64 cropped/normalized input to the model.

Cube edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

Hand-drawn: Flower

Original flower drawing — Original flower sketch.

Processed flower drawing — 64×64 cropped/normalized input to the model.

Flower edits across starting timesteps — idx 1 (t=960) idx 3 (t=900) idx 5 (t=840) idx 7 (t=780) idx 10 (t=690) idx 20 (t=390)

1.7.2 · Inpainting

I implemented RePaint-style inpainting. At each step, I re-noise the preserved region with the current timestep's forward pass and splice it with pure noise inside the binary mask (1 = edit, 0 = keep). Then I denoise with classifier-free guidance (γ = 7) using the same variance-aware DDPM update as before. The inpaint function keeps everything in fp16 on CUDA and iterates over the strided schedule to gradually hallucinate new content inside the masked area while freezing the unmasked pixels.

Campanile (reference mask + result)

Campanile processed input, mask, and inpainted result — 64×64 processed input, mask, and final inpainted Campanile using the same prompt embedding.

Bunny and Earth inpaints

Processed bunny input — Bunny 64×64 processed input.

Processed earth input — Earth 64×64 processed input.

Final inpainted results for bunny and earth — Final inpainted outputs for bunny and earth using the masked RePaint loop.

1.7.3 · Text-Conditional Image-to-Image Translation

I used the same SDEdit schedule but swapped the embedding to the prompt "a high quality crayon drawing". CFG steers the denoising toward crayon texture while the image conditioning keeps structure from the original inputs. Compared to the unconditional runs, the text prompt adds a strong style prior without changing content. Buildings stay recognizable, the bunny pose persists, and Earth retains its continents.

Campanile translated to crayon style — Campanile edited with text-conditioning to a crayon drawing.

Bunny translated to crayon style — Bunny in crayon style.

Earth translated to crayon style — Earth in crayon style.

1.8 · Visual Anagrams

I implemented flip illusions by averaging classifier-free guidance noise from two prompts applied to upright and 180-degree-flipped versions of the same latent at every DDPM step. For prompt pair (p₁, p₂), I compute ε₁ on the upright image and ε₂ on the flipped image, then flip ε₂ back. I take ε = (ε₁ + ε₂)/2 and run the usual variance-aware DDPM update. The shared unconditional embedding supports CFG (γ = 7). Starting from pure noise lets the image converge to one concept upright and the other when rotated.

Illusion 1: Man mid fall ↔ Paint brush

Illusion upright: man mid fall — Upright: prompt "a man mid fall".

Illusion flipped: paint brush — Rotated 180°: prompt "a paint brush".

Illusion 2: Caveman ↔ Manatee

Illusion upright: caveman — Upright: prompt "a photo of a caveman".

Illusion flipped: manatee — Rotated 180°: prompt "a photo of a manatee".

1.9 · Hybrid Images

I built factorized diffusion hybrids by splitting CFG noise into low and high frequencies for two prompts. At each timestep, I compute ε₁ for prompt 1 and ε₂ for prompt 2, then blur both with a 33x33 Gaussian (σ=2). I keep low(ε₁) for coarse structure and compute high(ε₂) = ε₂ - low(ε₂) for fine detail. Then I use ε = low(ε₁) + high(ε₂) in the variance-aware DDPM update. This keeps large shapes from prompt 1 while injecting textures and edges from prompt 2. I reused CFG scale γ=7 and the same strided schedule.

Hybrid 1: Man mid fall (low freq) + Paint brush (high freq)

Hybrid of man mid fall and paint brush — Low frequencies from "a man mid fall"; high frequencies from "a paint brush".

Hybrid 2: Caveman (low freq) + Manatee (high freq)

Hybrid of caveman and manatee — Low frequencies from "a photo of a caveman"; high frequencies from "a photo of a manatee".

Up close you see the high-frequency donor (brush bristles, manatee skin), while from farther away the coarse silhouette matches the low-frequency prompt.

Part B · Flow Matching from Scratch

Training UNet-based denoisers and flow-matching models on MNIST, with time and class conditioning.

1.1 · Implementing the UNet

I built a lightweight 28x28 MNIST UNet with two stride-2 conv downsamples to 14x14 then 7x7, AvgPool to 1x1, transpose-conv unflatten back to 7x7, then two upsample stages back to 28x28. Each Conv, Up, and Down block uses 3x3 convolutions with BatchNorm and GELU. Channel widths follow D to 2D on the encoder and mirror back to D before a final 1x1 output head. D is a hyperparameter.

I implemented Conv, DownConv, and UpConv building blocks plus Flatten and Unflatten, then stacked them into ConvBlocks and symmetrical down/up paths.
Final model path: encoder (ConvBlock D, DownBlock D to 2D), bottleneck (flatten then unflatten), decoder (UpBlock 2D to D, UpBlock D to D), then 1x1 conv head.

1.2 · Using the UNet to Train a Denoiser

I generated paired (z, x) examples by adding Gaussian noise z = x + σε with ε sampled from N(0, I) to normalized MNIST digits in [0, 1]. I implemented a visualization cell that samples a small batch from the test set, applies σ values of 0.0, 0.2, 0.4, 0.5, 0.6, 0.8, and 1.0 with a fixed seed, clamps to [0, 1], and renders a grid. This confirms the denoising task setup before training.

The noise sweep shows gradual degradation. Digits remain crisp through σ of about 0.4, become fuzzy at 0.6, and are nearly unreadable by σ of 0.8 to 1.0.

MNIST digits noised across sigma values — Noising grid for σ ∈ {0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0} on six MNIST test digits.

1.2.1 · Training (σ = 0.5)

I trained the MNIST denoiser with σ = 0.5 noise for 5 epochs. I used batch size 256, Adam with lr = 1e-4, hidden width D = 128, and on-the-fly noising each batch (z = x + 0.5 times ε). I loaded data from torchvision MNIST with shuffling on train and trained the UNet from 1.1 end-to-end on GPU. Total wall-clock time was 5 minutes and 21.7 seconds.

The loss curve across all iterations shows stable descent without divergence.
Denoising quality visibly improves from epoch 1 to epoch 5. Outputs sharpen and recover strokes.

Denoising results after 1 epoch — Epoch 1: clean / noisy / denoised rows.

Denoising results after 5 epochs — Epoch 5: cleaner reconstructions after training.

Training loss curve for sigma 0.5 run — Training loss curve (σ = 0.5, 5 epochs, batch 256, Adam lr=1e-4).

1.2.2 · Out-of-Distribution Testing

I evaluated the σ = 0.5-trained denoiser on held-out test digits across unseen noise scales of 0.0, 0.2, 0.4, 0.5, 0.6, 0.8, and 1.0. Using a fixed batch and seed, each digit was noised at a given σ then passed through the trained UNet. The model generalizes cleanly for σ at or below 0.5, retains legibility at 0.6 with mild blur, and degrades at 0.8 to 1.0 where inputs are nearly pure noise. Performance peaks near the training σ and falls off smoothly as the gap widens.

Fixed test digits denoised across the σ sweep show strong recovery through 0.5 to 0.6 and visible artifacts beyond 0.8.
The model generalizes best near the training σ. High-σ inputs lose structure and the UNet outputs become smeared.

OOD denoising across sigma values for fixed MNIST digits — Noisy (top) and denoised (bottom) rows for σ ∈ {0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0} on three test digits.

1.2.3 · Denoising Pure Noise

I retrained the same UNet to map pure Gaussian noise ε from N(0, I) directly to clean MNIST digits for 5 epochs with batch size 256 and Adam lr = 1e-4. Inputs are sampled noise and targets are clean digits. The model learns a deterministic "centroid" of the training set under MSE. Early outputs are blurry digit-like blobs, and later epochs sharpen strokes but still average across modes.

Training loss steadily decreases without instability. The curve flattens after about 4 epochs.
Epoch 1 outputs are diffuse blobs. By epoch 5, shapes resemble digits but remain averaged due to mode-averaging under MSE.

Pure-noise outputs after epoch 1 — Epoch 1: noisy inputs (row 2) map to blurry digit-like outputs (row 3).

Pure-noise outputs after epoch 5 — Epoch 5: outputs sharpen but still average over digit modes.

Training loss curve for pure-noise denoising (5 epochs, batch 256, Adam lr = 1e-4).

2.1 · Adding Time Conditioning to UNet

I injected normalized scalar time t in [0, 1] with two small FCBlocks and used them to modulate decoder features. An FCBlock is Linear then GELU then Linear. The time t goes through fc1_t to scale the bottleneck after unflatten, and fc2_t to scale the first up block. This preserves the original down/skip path while letting time control the flow field prediction.

Modulation points: bottleneck times t1, then up1 output times t2. Skips remain unscaled.
t is reshaped to (N,1), passed through FC to (N,C), and broadcast to (N,C,1,1) before multiplication.

2.2 · Training the Time-Conditioned UNet

I trained the time-conditioned UNet to predict the flow x₁ - x₀ on MNIST. For each batch, I sample t from U[0,1], draw x₀ from N(0,I), mix x_t = (1-t)x₀ + t x₁, and minimize MSE between predicted flow u_θ(x_t, t) and target x₁ - x₀. I used batch 64, D = 64, Adam lr = 1e-2, and ExponentialLR with gamma = 0.1^(1/num_epochs) for 5 epochs.

Dataset: MNIST train split only, batch size 64, on-the-fly noise and t sampling per batch.
Schedule: Adam lr = 1e-2 with exponential decay. I ran 5 epochs with stable descent.

Training loss curve for time-conditioned UNet — Training loss curve for the time-conditioned UNet (flow matching on MNIST, 5 epochs).

2.3 · Sampling from the Time-Conditioned UNet

I sample MNIST digits by starting from pure Gaussian noise x₀ ~ N(0, I) and running the Euler sampler from Algorithm 2 for T uniform steps t = 0 → 1. At each step I query the trained time-conditioned UNet for the flow u_θ(x_t, t) and update x_t+1 = x_t + (1/T) · u_θ. After T steps the final x_T is decoded as an image. I reuse the same sampler while loading checkpoints after different training epochs to see how sample quality evolves.

I generated samples from checkpoints at epochs 1, 5, and 10.
Quality quickly improves. Epoch 1 is blurry but digit-like, epoch 5 sharpens strokes, and epoch 10 produces crisp, diverse digits.

Samples after 1 epoch of time-conditioned training — Epoch 1: noisy blobs with emerging digit structure.

Samples after 5 epochs of time-conditioned training — Epoch 5: strokes are clearer; digits become legible.

Samples after 10 epochs of time-conditioned training — Epoch 10: clean, diverse digits from pure noise.

2.4 · Adding Class Conditioning to UNet

I extended the time-conditioned UNet with class conditioning by adding two more FCBlocks (fc1_c and fc2_c) that take a one-hot encoded class vector c from 0 to 9 as input. The class is converted to a 10-dim one-hot vector and passed through each FCBlock to produce modulation vectors c1 and c2. These combine with time modulations as: bottleneck = c1 * unflatten + t1 and up1 = c2 * up1 + t2. This lets class information scale features while time shifts them.

Class representation: one-hot vector (N, 10) through FCBlock to (N, C) and broadcast to (N, C, 1, 1) for spatial modulation.
Unconditional branch for CFG: during training, 10% of the time (p_uncond = 0.1) the class vector is zeroed out via a dropout mask. This teaches the model to also predict flows without class information.

2.5 · Training the Class-Conditioned UNet

I trained the class-conditioned UNet with the same hyperparameters as the time-only version: batch 64, hidden dim D = 64, Adam lr = 1e-2, and ExponentialLR decay with gamma = 0.1^(1/num_epochs) for 10 epochs. The key addition is the 10% class dropout (p_uncond = 0.1) applied per-sample each batch, which prepares the model for classifier-free guidance at inference.

Training uses class_fm_forward which samples t from U[0,1], mixes noise with clean images, randomly masks class conditioning, and minimizes MSE on predicted flow.
Convergence is similar to the time-only model. The extra class signal helps the model specialize per digit without slowing down training.

Training loss curve for class-conditioned UNet — Training loss curve for the class-conditioned UNet (flow matching on MNIST, 10 epochs).

2.6 · Sampling with Class Conditioning and CFG

I sampled from the class-conditioned UNet using classifier-free guidance with γ = 5.0. At each Euler step, I run two forward passes: one with class conditioning (mask = 1) and one unconditional (mask = 0). Then I blend flows as u = u_uncond + γ(u_cond - u_uncond). For each digit 0 to 9, I generate 4 samples to produce a 4x10 grid that shows controllability across all classes.

Epoch 1 samples are noisy but already cluster by class. By epoch 5, digits are legible. Epoch 10 produces crisp, class-consistent outputs.
CFG amplifies the class signal, improving sharpness and reducing ambiguity compared to unconditional sampling.

Class-conditional samples after epoch 1 — Epoch 1: emerging class structure with noise.

Class-conditional samples after epoch 5 — Epoch 5: digits become legible and class-consistent.

Class-conditional samples after epoch 10 — Epoch 10: clean, diverse samples for each digit class.

Training without LR Scheduler

To remove the exponential learning rate scheduler while maintaining performance, I lowered the constant learning rate from 1e-2 to 1e-3. The scheduler originally decayed lr from 1e-2 down to about 1e-3 over 10 epochs. A fixed 1e-3 approximates the average effective rate and avoids early instability from a high initial lr. The resulting samples below are comparable in quality, demonstrating that simplicity can match scheduled training when the constant lr is chosen appropriately.

Class-conditional samples without LR scheduler — Samples after 10 epochs with constant lr = 1e-3 (no scheduler).