11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

186 6. MODEL SELECTION, COMPARISON, AND AVERAGINGOverthinking: More on entropy. Above I said that information entropy is the average log-probability.But there’s also a −1 in the definition. Multiplying the average log-probability by −1 just makes theentropy H increase from zero, rather than decrease from zero. It’s conventional, but not functional.e logarithms above are natural logs (base e), but changing the base rescales without any effect oninference. Binary logarithms, base 2, are just as common. As long as all of the entropies you compareuse the same base, you’ll be fine.e only trick in computing H is to deal with the inevitable question of what to do when p i = 0.e log(0) = −∞, which won’t do. However, L’Hôpital’s rule tells us that lim pi→0 p i log(p i ) = 0. Sojust assume that 0 log(0) = 0, when you compute H. In other words, events that never happen dropout. is is not really a trick. It follows from the definition of a limit. But it isn’t obvious. It may makemore sense to just remember that when an event never happens, there’s no point in keeping it in themodel.Rethinking: e benefits of maximizing uncertainty. Information theory has many applications.A particularly important application is MAXIMUM ENTROPY, also known as MAXENT. Maximumentropy is a family of techniques for finding probability distributions that are most consistent withstates of knowledge. In other words, given what we know, what is the least surprising distribution?It turns out that one answer to this question maximizes the information entropy, using the priorknowledge as constraint. 78Many of the common distributions used in statistical modeling—uniform, Gaussian, binomial,Poisson, etc.—are also maximum entropy distributions, given certain constraints. For example, ifall we know about a measure is its mean and variance, then the distribution that maximizes entropy(and therefore minimizes surprise) is the Gaussian. ere are both epistemological and ontologicalinterpretations of this fact. Epistemologically, the Gaussian is expected to cover more possiblecombinations of events, precisely because it is the least surprising distribution consistent with theconstraints provided (mean and variance, infinite real line). Ontologically, frequencies of events inthe real world oen come to resemble Gaussian distributions, because when a natural process addsvalues together, it sheds all information other than mean and variance. You saw an example of thisin Chapter 4.When you meet the other exponential family distributions, in later chapters, you’ll also learnsomething of their maximum entropy interpretations.6.2.3. From entropy to accuracy. It’s nice to have a way to quantify uncertainty. H providesthis. So we can now say, in a precise way, how hard it is to hit the target. But how can we useinformation entropy to say how far a model is from the target? e key lies in DIVERGENCE:Divergence: e additional uncertainty induced by using probabilities fromone distribution to describe another distribution.is is oen known as Kullback-Leibler divergence or simply K-L divergence, named aer thepeople who introduced it for this purpose. 79Suppose for example that the true distribution of events is p 1 = 0.3, p 2 = 0.7. If webelieve instead that these events happen with probabilities q 1 = 0.25, q 2 = 0.75, how muchadditional uncertainty have we introduced, as a consequence of using q = {q 1 , q 2 } to approximatep = {p 1 , p 2 }? e formal answer to this question is based upon H, and has asimilarly simple formula:D KL (p, q) = ∑ ip i(log(pi ) − log(q i ) ) .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!