11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.3. WHEN ADDING VARIABLES HURTS 159stddev0.001 0.003 0.005 0.0070.0 0.2 0.4 0.6 0.8 1.0correlationFIGURE 5.11. e effect of correlated predictorvariables on the narrowness of the posterior distribution.e vertical axis shows the standard deviationof the posterior distribution of the slope. ehorizontal axis is the correlation between the predictorof interest and another predictor added tothe model. As the correlation increases, the standarddeviation inflates.you can usually diagnose the problem by looking for a big inflation of standard deviation,when both variables are included in the same model.Different disciplines have different conventions for dealing with collinear variables. Insome fields, it is typical to engage in some kind of data reduction procedure, like PRINCIPLECOMPONENTS or FACTOR ANALYSIS, and then to use the components/factors as predictorvariables. In other fields, this is considered voodoo, because principle components and factorsare notoriously hard to interpret, and there are usually hidden decisions that go intoproducing them. It is also difficult to use such constructed variables in prediction, becauselearning that component 2 is associated with an outcome doesn’t immediately tell us howany real measurement is associated with the outcome. With experience, you can cope withthese issues, but they remain issues. A popular defensive approach is to show that modelsusing any one member from a cluster of highly correlated variables will produce the sameinferences and predictions. For example, sometimes rainfall and soil moisture are highlycorrelated across sites in an ecological analysis. Using both in a model meant to predict presence/absenceof a species of plant may run afoul of multicollinearity, but if you can show thata model with either produces nearly the same predictions, it will be reassuring that both getat the same underlying ecological dimension.e problem of multicollinearity is really a member of a family of problems with fittingmodels, a family sometimes known as NON-IDENTIFIABILITY. When a parameter is nonidentifiable,it means that the structure of the data and model do not make it possible toestimate the parameter’s value. Sometimes this problem arises from mistakes in coding amodel, but many important types of models present non-identifiable or weakly-identifiableparameters, even when coded completely correctly.In general, there’s no guarantee that the available data contain much information abouta parameter of interest. When that’s true, your Bayesian machine will return a very wideposterior distribution. at doesn’t mean anything is wrong—you got the right answer tothe question you asked. But it might lead you to ask a better question.Overthinking: Simulating collinearity. e code to produce FIGURE 5.11 involves writing a functionthat generates correlated predictors, fits a model, and returns the standard deviation of the posteriordistribution for the slope relating perc.fat to kcal.per.g. en the code repeatedly calls thisfunction, with different degrees of correlation as input, and collects the results.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!