Section 16 · Lesson 16.2

Kullback–Leibler Divergence

How different are two distributions, in nats?

Kullback–Leibler (KL) divergence measures how different a distribution $P$ is from a reference $Q$ :

D_{\mathrm{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

KL is always non-negative and zero iff $P = Q$ . It is not symmetric: $D_{\mathrm{KL}}(P \| Q) \ne D_{\mathrm{KL}}(Q \| P)$ in general, so it's not a true metric.

KL is the workhorse of probabilistic ML. Variational inference picks an approximate posterior $Q$ to minimize $D_{\mathrm{KL}}(Q \| P)$ . Cross-entropy loss in classification is essentially the KL between empirical and predicted labels (up to a constant). In risk, KL appears in entropy-based VaR estimates and model-validation tests.

Which property does KL divergence have?

Section 16 · Lesson 16.2

Kullback–Leibler Divergence

How different are two distributions, in nats?

Kullback–Leibler (KL) divergence measures how different a distribution $P$ is from a reference $Q$ :

D_{\mathrm{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

KL is always non-negative and zero iff $P = Q$ . It is not symmetric: $D_{\mathrm{KL}}(P \| Q) \ne D_{\mathrm{KL}}(Q \| P)$ in general, so it's not a true metric.

Which property does KL divergence have?