Teaching VAEs at AMLD 2020: a workshop, in retrospect

We took a hands-on workshop on autoencoders and VAEs to Applied ML Days 2020 in Lausanne. Five years on, here's what held up — and what diffusion has since rewritten.

In January 2020, two of us took a workshop to Applied Machine Learning Days at EPFL in Lausanne. The title was Generative Modeling for Computer Vision. The thesis was simple: if you really want to understand how generative models work, you have to build the smallest one that does something interesting and stare at its latent space until it stops being magic.

The four notebooks we shipped — autoencoder vs. VAE on MNIST, a convolutional VAE on CelebA, and two experimental VQ-VAE notebooks — are still in the public repo. It’s late 2024 now, the generative-modeling landscape has been rewritten twice since we ran the room, and we’ve been meaning to write down what we learned.

The pedagogical bet

Most generative-modeling tutorials open with the ELBO and a wall of KL-divergence algebra. We didn’t want to do that. The participants we were going to get were practitioners — people who had trained classifiers and segmentation models and wanted to know what changes when the output is a distribution instead of a label.

So we inverted the order. Build a vanilla autoencoder first. Look at its 2D latent space. Notice it has holes — regions where decoding produces garbage because the model never had to map them to anything. Then add the two things VAEs add — a stochastic latent and a KL term that pulls the encoder’s outputs toward a unit Gaussian — and look at the latent space again. The holes are gone. You can now sample.

The math comes after the picture. Once you’ve watched the latent fill in, the reparameterization trick stops feeling like a hack — it’s the obvious way to backprop through “draw a sample.”

What’s in the notebooks

ae_vae_mnist.ipynb — the side-by-side. Both models share an MLP encoder/decoder; the VAE adds two linear heads (mu, logvar) and a reparameterization step:

class VariationalAutoEncoder(nn.Module):
    def __init__(self, inp_shape, hidden_dim, out_dim, z_dim):
        super().__init__()
        self.enc = Encoder(inp_shape, hidden_dim, out_dim)
        self.dec = Decoder(inp_shape, hidden_dim, out_dim)
        self.mu     = nn.Linear(out_dim, z_dim)
        self.logvar = nn.Linear(out_dim, z_dim)
        self.fc     = nn.Linear(z_dim, out_dim)

    def sample_z(self, mu, logvar):
        eps = torch.rand_like(mu)
        return mu + eps * torch.exp(0.5 * logvar)

Train both with z_dim=2 and you can plot the entire latent on a single chart, color-coded by digit. The AE clusters the ten classes into ten islands separated by empty water — sample from the gaps and you get noise. The VAE pulls the clusters together into a continuous blob; the gaps fill in with plausible morphs (a 4 that’s slowly becoming a 9). That single picture does more pedagogical work than a half-hour of derivations.

cnn_vae_celeba.ipynb — the same idea, scaled up. Convolutional encoder/decoder on cropped, face-aligned CelebA at 64×64. Three things were worth showing:

Latent-space arithmetic. CelebA ships with 40 binary attributes per face — Smiling, Eyeglasses, Male, Young, etc. Encode all the smiling faces and all the non-smiling faces, average the latents, and the difference is a “smile direction” in latent space. Add it to a neutral face’s latent, decode, and the face smiles. Same trick puts sunglasses on anyone. This is the moment in the workshop where the room visibly perks up.
Perceptual loss. Pixel-MSE reconstructions of faces are blurry — MSE doesn’t care about edges, just averages. We added a VGG16-based perceptual loss using forward hooks on layers 5 and 15:
```
feat_indices = [5, 15]
self.hooks = [Hook(self.vgg16_head[i]) for i in feat_indices]
```
The total loss becomes ELBO + λ·perceptual. Reconstructions sharpen up dramatically in the first few epochs.
The KL/reconstruction tradeoff. Crank the KL weight up and your latent is smooth and samplable but reconstructions go mushy. Crank it down and you’ve trained a slightly-noisy autoencoder. Tuning that one knob is half the practical battle with VAEs, and seeing the failure modes live is more useful than reading about β-VAEs.

The two VQ-VAE notebooks were marked experimental for a reason. The pitch is gorgeous: replace the continuous latent with a learned discrete codebook of K embeddings, and let the encoder snap to the nearest one. On MNIST with K=10, the obvious hope is that each codebook entry learns one digit — perfect disentanglement.

It mostly didn’t. Some digits cleanly captured an entry; others split across two; one entry ended up unused. We left it in the workshop anyway because watching a clean idea fail in an interesting way is its own lesson — and because the original VQ-VAE paper was already pointing at the future. (VQ-VAE-2 dropped a few months after AMLD; the entire VQ + transformer-prior recipe is now the bones of half the multimodal models in production.)

What surprised us, running the room

A few things from the workshop floor:

People believed the visualizations and disbelieved the loss curves. A 2D scatter of the MNIST latent did more for intuition than any plot of training loss. We’ve stolen this for client work since: when explaining a model to a non-ML stakeholder, find the one chart where the model’s behavior is legible.
Latent-space arithmetic on CelebA is a magic trick that survives close inspection. Every group we walked through it had the same reaction — disbelief, then they tried their own attribute combinations. Old + Eyeglasses − Young is genuinely fun. It’s also a clean way to teach what “linear structure in latent space” actually means.
Reparameterization is the part that finally clicks. We’ve taught autoencoders enough times to know the moment: someone goes “wait, you sample, but you backprop through the sample, by writing it as μ + ε·σ?” — and then the rest of the VAE machinery falls into place. The trick is small. The conceptual unlock is enormous.

Reading the workshop in late 2024

A lot has changed.

Diffusion has eaten image generation. When we ran the workshop in January 2020, DDPM was still six months from publication. By late 2024 — Stable Diffusion 3, Flux, DALL·E 3, Imagen 3 — the production answer for “generate a realistic image” is almost never a vanilla VAE anymore. If you want sharp samples and broad coverage of a complex distribution, you train a diffusion model. The pixel-blurriness that perceptual loss patched over in our CelebA notebook is a problem diffusion just doesn’t have.

But VAEs didn’t die — they got demoted, and then re-promoted. The dominant image-generation architecture today is latent diffusion: a VAE compresses 512×512 RGB into a much smaller latent grid, and a diffusion model runs in that latent space. The VAE encoder/decoder we taught in 2020 is, structurally, the same component that ships inside Stable Diffusion. A workshop participant who internalized “the VAE gives you a smooth, samplable latent that a downstream model can operate in” was being prepared for exactly the right thing — they just didn’t know what the downstream model would turn out to be.

VQ-VAE won, eventually. The discrete-codebook idea that didn’t quite click on our MNIST experiment is the heart of every modern tokenized-image model — VQ-GAN, MaskGIT, Parti, the image branches of multimodal LLMs. Our workshop participants who pushed on the experimental notebooks were a few years early, not wrong.

What we’d add to the workshop today. A fifth notebook: a tiny DDPM on MNIST. It fits in a Colab. The U-Net is small. Watching denoising emerge — pure noise gradually resolving into a digit over 50 timesteps — does the same intuition-building work that the AE-vs-VAE side-by-side did in 2020. We’d keep everything else and let the diffusion notebook be the bridge from the workshop’s vintage to today’s.

The notebooks still run

Almost. The requirements.txt pins torch==1.3.1, tensorflow==2.1.0, and PIL==6.2.2 — typical of a 2020 Colab snapshot, and unbuildable on a modern Python. The notebooks themselves are mostly forward-compatible: replace torch.rand_like(mu) with the standard-normal torch.randn_like(mu) (a small bug we’d fix on a v2), unpin everything, and they run on current PyTorch with minor adjustments.

The repo is public: MaxinAI/amld2020-workshop. The slides are in there too.

If you’re teaching generative modeling to engineers in 2025, the bones still work. Build the smallest model that does the interesting thing. Plot the latent. Let the math arrive after the picture.