Machine Learning Algorithms（Second Edition）

上QQ阅读APP看书，第一时间看更新

Divergence measures between two probability distributions

Let's suppose you have a discrete data generating process p_data(x)and a model that outputs a probability mass function q(x). In many machine learning tasks, the goal is to tune up the parameter so that q(x) becomes as similar to p_data as possible. A very useful measure is the Kullback-Leibler divergence:

This quantity (also known as information gain) expresses the gain obtained by using the approximation q(x) instead of the original data generating process. It's immediate to see that if q(x) = p_data(x) ? D_KL(p_data||q) = 0, while it's greater than 0 (unbounded) when there's a mismatch. Manipulating the previous expression, it's possible to gain a deeper understanding:

The first term is the negative entropy of the data generating process, which is a constant. The second one, instead, is the cross-entropy between the two distributions. Hence, if we minimize it, we also minimize the Kullback-Leibler divergence. In the following chapters, we are going to analyze some models based on this loss function, which is extremely useful in multilabel classifications. Therefore, I invite the reader who is not familiar with these concepts to fully understand their rationale before proceeding.

In some cases, it's preferable to work with a symmetric and bounded measure. The Jensen-Shannon divergence is defined as follows:

Even if it seems more complex, its behavior is equivalent to the Kullback-Leibler divergence, with the main difference that the two distributions can now be swapped and, above all, 0 ≤ D_JS(p_data||q) ≤ log(2). As it's expressed as a function of D_KL(p_data||q), it's easy to prove that its minimization is proportional to a cross-entropy reduction. The Jensen-Shannon divergence is employed in advanced models (such as Generative Adversarial Networks (GANs), but it's helpful to know it because it can be useful in some tasks where the Kullback-Leibler divergence can lead to an overflow (that is, when the two distributions have no overlaps and q(x) → 0).