Divergence measures between two probability distributions
Let's suppose you have a discrete data generating process pdata(x) and a model that outputs a probability mass function q(x). In many machine learning tasks, the goal is to tune up the parameter so that q(x) becomes as similar to pdata as possible. A very useful measure is the Kullback-Leibler divergence:
This quantity (also known as information gain) expresses the gain obtained by using the approximation q(x) instead of the original data generating process. It's immediate to see that if q(x) = pdata(x) ? DKL(pdata||q) = 0, while it's greater than 0 (unbounded) when there's a mismatch. Manipulating the previous expression, it's possible to gain a deeper understanding:
The first term is the negative entropy of the data generating process, which is a constant. The second one, instead, is the cross-entropy between the two distributions. Hence, if we minimize it, we also minimize the Kullback-Leibler divergence. In the following chapters, we are going to analyze some models based on this loss function, which is extremely useful in multilabel classifications. Therefore, I invite the reader who is not familiar with these concepts to fully understand their rationale before proceeding.
In some cases, it's preferable to work with a symmetric and bounded measure. The Jensen-Shannon divergence is defined as follows:
Even if it seems more complex, its behavior is equivalent to the Kullback-Leibler divergence, with the main difference that the two distributions can now be swapped and, above all, 0 ≤ DJS(pdata||q) ≤ log(2). As it's expressed as a function of DKL(pdata||q), it's easy to prove that its minimization is proportional to a cross-entropy reduction. The Jensen-Shannon divergence is employed in advanced models (such as Generative Adversarial Networks (GANs), but it's helpful to know it because it can be useful in some tasks where the Kullback-Leibler divergence can lead to an overflow (that is, when the two distributions have no overlaps and q(x) → 0).