11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

13.3. VARYING EFFECTS AND THE UNDERFITTING/OVERFITTING TRADEOFF 343the data from the second tank. You can think of this procedure like the Bayesian updatingexample in Chapter 2. Tank 1’s estimate improves our estimate for tank 2.But wait. ere’s nothing special about the order the tanks appear, as long as they areindependent. What if we lost the data for tank 1, and so started out analysis with tank 2?Aer calculating the observed proportion surviving in tank 2, an assistant finds the lost datafor tank 1. Now, before you see what is written on the data sheet, can you say anything aboutwhat the proportion will be? Again, of course you can, because you know what happened intank 2. You take that educated guess and update it, once you are handed the actual data fortank 1. And so tank 2’s estimate improves our estimate for tank 1.In truth, this relationship is symmetric. Tank 1’s data improves the estimate for tank2, and likewise tank 2’s data improves the estimate for tank 1. As a result, if you do theestimation correctly, according to Bayes’ theorem, you end up with estimates for both tanksin which you have pooled all the information to improve each estimate. But as a result,neither estimate is exactly the same as the naive estimate you’d get by forcing the estimatefor each tank to ignore the other tank. Instead, both estimates have shrunk towards themean proportion across both tanks. ese estimates will exhibit shrinkage, due to poolinginformation.To gain a better appreciation of the shrinkage phenomenon, and why it varies as it doesacross the tanks, it will help to directly address the inferential benefits of pooling informationin this way. at’s where we turn next.13.3. Varying effects and the underfitting/overfitting tradeoffe first major benefit of using these varying effects estimates, instead of the empiricalraw estimates, is that they provide more accurate estimates of the individual cluster (tank)intercepts. 115 On average, the varying effects actually provide a better estimate of the individualtank (cluster) means. In this section, I’ll explain why this is true, in the context ofbuilding the model above, the one with varying intercepts by tank.e basic reason that the varying intercepts provide better estimates is that they do a betterjob of trading off underfitting and overfitting. You first met this distinction and estimationproblem in Chapter 6. To understand this in the context of our current data example, supposethat instead of tanks we had ponds with some continuity through time, so that we mightbe concerned with making predictions for the same clusters in the future. We’ll approachthe problem of prediction future survival in these ponds, from two extreme perspectives.First, suppose you ignore the varying effects and just use the overall mean across allponds, α, to make your predictions for each pond. A lot of data contributes to your estimateof α, and so it can be quite precise. However, your estimate of α is unlikely to exactly matchthe mean of any particular pond. As a result, the total sample mean underfits the data. isapproach is sometimes know as COMPLETE POOLING, because you pool all the data from allponds to produce a single estimate that is applied to every pond. It’s equivalent to assumingthat ponds do not vary at all in their mean survival probability.In contrast, suppose you use the raw empirical survival proportions to make predictionsfor each pond. is is like using a separate constant regression intercept for each pond. Inprevious chapters, you estimated models like this by including a dummy variable for eachcluster in the data, like with the academic departments in the Berkeley admissions data. eblue points in FIGURE 13.1 are this same kind of estimate. In each particular pond, quitelittle data contributes to each estimate, and so these estimates are rather imprecise. is isparticularly true of the smaller ponds, where less data goes into producing the estimates. As

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!