11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.2. INFORMATION THEORY AND MODEL PERFORMANCE 187In plainer language, the divergence is the average difference in log probability between thetarget (p) and model (q). is divergence is just the difference between two entropies: eentropy of the target distribution p and the entropy arising from using q to predict p. Whenp = q, we know the actual probabilities of the events. In that case:D KL (p, q) = D KL (p, p) = ∑ ip i(log(pi ) − log(p i ) ) = 0.ere is no additional uncertainty induced when we use a probability distribution to representitself. at’s somehow a comforting thought. But more importantly, as q grows moredifferent from p, the divergence D KL also grows.What divergence can do for us now is help us contrast different approximations to p.As an approximating function q becomes more accurate, D KL (p, q) will shrink. So if wehave a pair of candidate distributions, say q 1 and q 2 , then the candidate that minimizes thedivergence will be closest to the target. Since predictive models specify probabilities of events(observations), we can use divergence to compare the accuracy of models.Overthinking: Cross entropy and divergence. Deriving divergence is easier than you might think.e insight is in realizing that when we use a probability distribution q to predict events from anotherdistribution p, this defines something known as cross entropy: H(p, q) = − ∑ i p i log(q i ). e notionis that events arise according the the p’s, but they are expected according to the q’s, so the entropy isinflated, depending upon how different p and q are.Divergence is defined as the additional entropy induced by using q. So it’s just the differencebetween H(p), the actual entropy of events, and H(p, q):D KL (p, q) = H(p, q) − H(p)= − ∑ p i log(q i ) − ( − ∑ p i log(p i ) ) = − ∑ (p i log(qi ) − log(p i ) )iiiSo divergence really is measuring how far q is from the target p, in units of entropy. Notice thatwhich is the target matters: H(p, q) does not in general equal H(q, p). For more on that fact, see therethinking box that follows.Rethinking: Divergence depends upon direction. In general, H(p, q) is not equal to H(q, p). edirection matters, when computing divergence. Understanding why this is true is of some value, sohere’s a contrived teaching example.Suppose we get in a rocket and head to Mars. But we have no control over our landing spot,once we reach Mars. Let’s try to predict whether we land in water or on dry land, using the Earth toprovide a probability distribution q to approximate the actual distribution on Mars, p. For the Earth,q = {0.7, 0.3}, for probability of water and land, respectively. Mars is totally dry, but let’s say forthe sake of the example that there is 1% surface water, so p = {0.01, 0.99}. If we count the ice caps,that’s not too big a lie. Now compute the divergence going from Earth to Mars. It turns out to beD E→M = D KL (p, q) = 1.14. at’s the additional uncertainty induced by using the Earth to predictthe Martian landing spot.Now consider going back the other direction. e numbers in p and q stay the same, but weswap their roles, and now D M→E = D KL (q, p) = 2.62. e divergence is more than double in thisdirection. is result seems to defy comprehension. How can the distance from Earth to Mars beshorter than the distance from Mars to Earth?Divergence behaves this way as a feature, not a bug. ere really is more additional uncertaintyinduced by using Mars to predict Earth than by using Earth to predict Mars. e reason is that, going

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!