11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.3. AKAIKE INFORMATION CRITERION 191Rethinking: AIC and “true” models. It is possible to read both that (1) AIC assumes the data generatingmodel is one of the candidate models and (2) AIC does not assume the data generating modelis a candidate. is confusion arises because there are multiple ways to derive AIC. e gambit describedabove does not employ a “true” model, except to generate training and testing data. But otherderivations do focus on “truth.” e general lesson here is that just because one derivation or justificationof a procedure makes use of an assumption, it doesn’t mean that there isn’t another possiblejustification that uses different assumptions.Even more generally, the consequences of violating an assumption are sometimes benign andother times catastrophic. Procedures do not simply stop being useful, when an assumption is violated.If that were true, then no statistical procedure would ever work, at least in the large world. Still,caution requires taking note of violated assumptions and hopefully evaluating the consequences ofthese violations.6.3.1. Limits to AIC’s generality. But more generally, AIC is not general. It is a specialcase of much larger phenomenon: the severity of overfitting is an increasing function of thenumber of parameters. But this function is not always as simple as 2k, as it is in AIC. A fewcommon conditions benefit from a more general solution.6.3.1.1. Parameter count close to sample size. Suppose a model has k parameters and is fitto N observations. When k is close to N, overfitting rises very rapidly. is happens becausethe model starts perfectly encoding the training sample. So when the model sees the testsample, it’s always very surprised. A conservative approximation for this rise in overfittingis given by a common generalization of AIC:AICc = D train +2k1 − (k + 1)/NWhen N is very much larger than k, the above simplifies to plain AIC. But as k approachesN − 1, the penalty on the right approaches infinity. So anytime AIC is appropriate, AICcmay be a better choice.6.3.1.2. Informative priors. If the model’s priors are not flat, then AIC can get the penaltyvery wrong. is is the reason. e penalty term in AIC estimates how flexible the modelis. In the classical case of flat priors, it turns out that each additional parameter adds 2 tothe penalty. But when priors are not flat, but instead more concentrated around zero, thenthe model is less flexible. Remember, priors function inside Bayes’ theorem just as if theywere previous posterior inferences. So they behave like ghostly data of a kind, previouslyaccumulated evidence. So if the prior has an effect, it will be to prevent the model fromlearning everything from the sample. is reduces overfitting.As a result, each parameter with an informative prior tends to count less than 2 in thepenalty. is is the very reason that using informative priors can substitute for using a measurelike AIC. ere are generalizations for dealing with this, and they are important also forthe next problem.6.3.1.3. Multilevel models. Each level in a multilevel model serves as a kind of prior forthe next level. is means that multilevel models necessarily induce the problem above withinformative priors. So counting the number of parameters in a multilevel model never tellsyou the proper penalty term.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!