11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

188 6. MODEL SELECTION, COMPARISON, AND AVERAGINGfrom Mars to Earth, Mars has so little water on its surface that we will be very very surprised whenwe land in water on Earth. e Earth is mainly covered in water, aer all. In contrast, Earth has goodamounts of both water and dry land. So when we use the Earth to predict Mars, we expect both waterand land, to some extent, even though we do expect more water than land. So we won’t be nearly assurprised when we inevitably arrive on Martian dry land, because 30% of Earth is dry land.6.2.4. From divergence to deviance. At this point in the chapter, dear reader, you may bewondering where the chapter is headed. At the start, the goal was to deal with overfittingand underfitting. But now we’ve spent pages and pages on entropy and other fantasies. It’s asif I promised you a day at the beach, but now you find yourself at a dark cabin in the woods,wondering if this is a necessary detour or rather a sinister plot.It is a necessary detour. e point of all the preceding material about information theoryand divergence is to establish both:(1) How to measure the distance of a model from our target. Information theory givesus the distance measure we need, the K-L divergence.(2) How to estimate the divergence. Having identified the right measure of distance,we now need a way to estimate it in real statistical modeling tasks.Item (1) is accomplished. Item (2) remains for last. We’re going to show now that the divergenceleads us to using a measure of model fit known as DEVIANCE.To use D KL to compare models, it seems like we would have to know p, the target probabilitydistribution. In all of the examples so far, I’ve just assumed that p is known. But whenwe want to find a model q that is the best approximation to p, the “truth,” there is usually noway to access p directly. We wouldn’t be doing statistical inference, if we already knew p.But there’s an amazing way out of this predicament. It helps that we are only interestedin comparing the divergences of different candidates, say q and r. In that case, most of p justsubtracts out, because there is a E log(p i ) term in the divergence of both q and r. is termhas no effect on the distance of q and r from one another. So while we don’t know where p is,we can estimate how far apart q and r are, and which is closer to the target. It’s as if we can’ttell how far any particular archer is from the target, but we can tell which archer is closer andby how much.All of this also means that all we need to know is a model’s average log probability:E log(q i ) for q and E log(r i ) for r. ese expressions look a lot like log probabilities of outcomes,like the log-likelihoods you’ve been using already. Indeed, just summing the loglikelihoodsof each case provides an approximation of E log(q i ). We don’t have to know thep inside the expectation, because nature takes care of presenting the events for us.So we can compare the average log probability from each model to get an estimate of therelative distance of each model from the target. is also means that the absolute magnitudeof these values will not be interpretable—neither E log(q i ) nor E log(r i ) by itself suggests agood or bad model. Only the difference E log(q i )−E log(r i ) informs us about the divergenceof each model from the target p.All of this delivers us to a very common measure of model fit, one that also turns out tobe an approximation of K-L divergence, the DEVIANCE, which is defined as:D(q) = −2 ∑ ilog(q i )

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!