> ## Documentation Index
> Fetch the complete documentation index at: https://blog.carolinechen.cc/llms.txt
> Use this file to discover all available pages before exploring further.

# Variational Autoencoders

Generative AI isn't just about Large Language Models. At its core, generative AI is about creating new data from scratch. While standard autoencoders are excellent for compression, they fail as generative models. This documentation explores the **Variational Autoencoder (VAE)**, popularized by [Kingma & Welling (2013)](https://arxiv.org/abs/1312.6114).

## The Core Problem

Traditional autoencoders compress an image into a discrete point in a low-dimensional "latent space."

<Warning>
  **The Discontinuity Gap:** Because the latent space is not regularized, it is
  often disorganized. Sampling a random point between two trained clusters often
  results in "gibberish" because the decoder hasn't learned to interpret those
  empty regions.
</Warning>

## A Probabilistic Approach

Instead of mapping an input to a single point, a VAE maps it to a **probability distribution** (specifically a Gaussian).

* **The Encoder:** Predicts parameters of the distribution: Mean ($\mu$) and Variance ($\sigma^2$).
* **The Latent Space:** By representing data as overlapping "clouds" rather than points, the space becomes continuous.

***

## The Objective Function: ELBO

To train a VAE, we maximize the **Evidence Lower Bound (ELBO)**. This objective balances reconstruction accuracy with latent space organization.

### Mathematical Derivation of ELBO

The goal is to maximize the probability of our data, expressed as the log density $\ln p(x)$. Since calculating this directly is intractable, we use marginalization to introduce the latent variable $z$.

**Step 1: Marginalization**
$\ln p(x) = \ln \int p(x, z) dz$

**Step 2: The Variation Trick**
We multiply and divide by the approximate posterior $q(z|x)$ (our Encoder) to express the integral as an expectation:
$\ln p(x) = \ln \mathbb{E}_{z \sim q(z|x)} \left[ \frac{p(x, z)}{q(z|x)} \right]$

**Step 3: Jensen's Inequality**
Because the logarithm function is concave, we can "swap" the log and the expectation to find the lower bound:
$\ln p(x) \ge \mathbb{E}_{z \sim q(z|x)} \left[ \ln \frac{p(x, z)}{q(z|x)} \right]$

**Step 4: Final Decomposition**
Using Bayes' Formula ($p(x, z) = p(x|z)p(z)$), we can break the ELBO into the two components used for training:
$\text{ELBO} = \underbrace{\mathbb{E}_{z \sim q(z|x)}[\ln p(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q(z|x) \parallel p(z))}_{\text{Regularization}}$

***

### 1. Reconstruction Loss ($L_2$)

This term represents the **Likelihood**. It measures how well the Decoder can recreate the original data $x$ given a latent sample $z$. Under a Gaussian assumption, this is typically implemented as **Mean Squared Error (MSE)**:

$\mathcal{L}_{recon} = \sum (x_i - \hat{x}_i)^2$

### 2. KL Divergence

This term measures the "distance" between the approximate posterior $q(z|x)$ and the prior $p(z)$. We typically assume the prior is a **Standard Normal Distribution** $p(z) = \mathcal{N}(0, 1)$. For a univariate Gaussian, the closed-form solution is:

$D_{KL} = \frac{1}{2} \left( \sigma^2 + \mu^2 - 1 - \ln(\sigma^2) \right)$

<Note>
  **The Tug-of-War:** $L_2$ wants to separate data to ensure accuracy
  (scattering), while $D_{KL}$ wants to pull all data toward the center
  (overlapping). This tension creates a smooth, navigable latent space.
</Note>

***

## The Reparameterization Trick

In standard backpropagation, you cannot flow gradients through a random sampling operation ($z \sim \mathcal{N}(\mu, \sigma^2)$). To solve this, we move the randomness to an external variable $\epsilon$.

### Mathematical Deduction

We define the latent vector $z$ as a deterministic function:

$z = \mu + \sigma \odot \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, I)$

By treating $\epsilon$ as a constant during the backward pass, we can calculate gradients for $\mu$ and $\sigma$ directly:

$\frac{\partial z}{\partial \mu} = 1, \quad \frac{\partial z}{\partial \sigma} = \epsilon$

***

## Capabilities & Trade-offs

<CardGroup cols={2}>
  <Card title="Smooth Interpolation" icon="wand-magic-sparkles">
    You can "walk" between two latent vectors to seamlessly blend features
    (e.g., changing a smile to a frown).
  </Card>

  <Card title="Data Generation" icon="grid-2">
    Generate entirely new samples by drawing random vectors from the standard
    normal prior.
  </Card>
</CardGroup>

### Limitations

* **Blurriness:** VAEs tend to produce softer images than GANs. This is because $L_2$ loss encourages the model to "average" its predictions when uncertain.
* **Inference:** While foundational for models like **Stable Diffusion**, vanilla VAEs struggle with high-resolution, sharp details without advanced modifications like VQ-VAEs.

***

## Resources

* **Original Paper:** [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)
* **Concepts:** ELBO, Reparameterization Trick, Latent Variables.