11.07.2015 Views

statisticalrethinkin..

statisticalrethinkin..

statisticalrethinkin..

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Statistical RethinkingA BAYESIAN COURSEWITHR EXAMPLESRichard McElreathis version compiled May 6, 2014


ContentsPreface 7Audience 7Teaching strategy 8How to use this book 8Acknowledgments 12Chapter 1. e Golem of Prague 131.1. Statistical Golems 131.2. Wrecking Prague 161.3. ree Tools for Golem Engineering 221.4. Plan 28Chapter 2. Small Worlds and Large Worlds 292.1. Probability is just counting 302.2. Colombo’s first Bayesian model 372.3. Components of the model 412.4. Making the model go 472.5. Summary 542.6. Practice 54Chapter 3. Sampling the Imaginary 593.1. Sampling from a grid-approximate posterior 623.2. Sampling to summarize 633.3. Sampling to simulate prediction 713.4. Summary 783.5. Practice 79Chapter 4. Linear Models 834.1. Why normal distributions are normal 844.2. A language for describing models 894.3. A Gaussian model of height 914.4. Adding a predictor 1044.5. Polynomial regression 1214.6. Summary 1254.7. Practice 125Chapter 5. Multivariate Linear Models 1295.1. Spurious association 1305.2. Masked relationship 1455.3. When adding variables hurts 1515.4. Categorical variables 1603


4 CONTENTS5.5. Ordinary least squares and lm 1665.6. Summary 1695.7. Practice 169Easy 169Medium 169Hard 170Chapter 6. Model Selection, Comparison, and Averaging 1736.1. e problem with parameters 1756.2. Information theory and model performance 1826.3. Akaike information criterion 1896.4. Deviance information criterion 1966.5. Using AIC 2026.6. In praise of complexity 2116.7. Practice 211Chapter 7. Interactions 2157.1. Building an interaction 2177.2. Symmetry of the linear interaction 2287.3. Continuous interactions 2307.4. Higher-order interactions 2407.5. Interactions with lm 2407.6. Summary 2417.7. Practice 241Chapter 8. Markov Chain Monte Carlo Estimation 2438.1. Good King Markov and His island kingdom 2448.2. Markov chain Monte Carlo 2478.3. Easy HMC: map2stan 2518.4. Care and feeding of your Markov chain 2588.5. Need another model fitting example 2658.6. Practice 265Chapter 9. Big Entropy and the Generalized Linear Model 2679.1. Generalized Linear Models 267Chapter 10. Distance and Duration 27310.1. Survival analysis 27410.2. Gamma 275Chapter 11. Counting and Classification 27711.1. Binomial 27711.2. Poisson 29311.3. Multinomial 302Chapter 12. Monsters and Mixtures 30312.1. Ordered Categorical Outcomes 30312.2. Ranked Outcomes 31612.3. Variable Probabilities: Beta-binomial 31612.4. Variable Rates: Gamma-Poisson 32612.5. Variable Process: Zero-Inflated Outcomes 332Chapter 13. Multilevel Models 33713.1. What are multilevel models good for? 337


CONTENTS 513.2. Multilevel tadpoles 33913.3. Varying effects and the underfitting/overfitting tradeoff 34313.4. Cross-classified models 349Chapter 14. Multilevel Models II: Slopes 35314.1. Everything can vary and probably should 353Chapter 15. Missing Data and Other Opportunities 35915.1. Measurement error 35915.2. Missing data 36515.3. Space and networks 371Chapter 16. Writing Statistics 379Endnotes 381Bibliography 389Index 395


PrefaceMasons, when they start upon a building,Are careful to test out the scaffolding;Make sure that planks won’t slip at busy points,Secure all ladders, tighten bolted joints.And yet all this comes down when the job’s doneShowing off walls of sure and solid stone.So if, my dear, there sometimes seem to beOld bridges breaking between you and meNever fear. We may let the scaffolds fallConfident that we have built our wall.(“Scaffolding” by Seamus Heaney, 1939–2013)is book means to help you raise your knowledge of and confidence in statistical modeling.It is meant as a scaffold, one that will allow you to construct the wall that you need,even though you will discard it aerwards. As a result, this book teaches the material in ofteninconvenient fashion, forcing you to perform step-by-step calculations that are usuallyautomated. e reason for all the algorithmic fuss is to ensure that you understand enoughof the details to make reasonable choices and interpretations in your own modeling work.So although you will move on to use more automation, it’s important to take things slow atfirst. Put up your wall, and then let the scaffolding fall.Audiencee principle audience is researchers in the natural and social sciences, whether newPhD students or seasoned professionals, who have had a basic course on regression but neverthelessremain uneasy about statistical modeling. is audience accepts that there is somethingvaguely wrong about typical statistical practice in the early 21st century, dominated asit is by p-values and a confusing menagerie of testing procedures. ey see alternative methodsin journals and books. But these people are not sure where to go to learn about thesemethods.As a consequence, this book doesn’t really argue against p-values and the like. It assumesthe reader is ready to try doing statistical inference without them. is isn’t the ideal7


8 PREFACEsituation. It would be better to have material that helps you spot common mistakes and misunderstandingsof p-values and tests in general, as all of us have to understand such things,even if we don’t use them. So I’ve tried to sneak in a little material of that kind, but unfortunatelycannot devote much space to it. e book would be too long, and it would disruptthe teaching flow of the material.It’s important to realize, however, that the disregard paid to p-values is not a uniquelyBayesian attitude. Indeed, significance testing can be—and has been—formulated as a Bayesianprocedure as well. So the choice to avoid significance testing is stimulated instead by epistemologicalconcerns, some of which are briefly discussed in the first chapter.Teaching strategye book uses much more computer code than formal mathematics. Even excellentmathematicians can have trouble understanding an approach, until they see a working algorithm.is is because implementation in code form removes all ambiguities. So material ofthis sort is easier to learn, if you also learn how to implement it.In addition to any pedagogical value of presenting code, so much of statistics is nowcomputational that a purely mathematical approach is anyways insufficient. As you’ll see inlater parts of this book, the same mathematical statistical model can sometimes be implementedin different ways, and the differences matter. So when you move beyond this bookto more advanced or specialized statistical modeling, the computational emphasis here willhelp you recognize and cope with all manner of practical troubles.e book is organized into four interdependent parts. e first three chapters are foundational.ey introduce Bayesian inference and the basic tools for performing Bayesiancalculations. ey move quite slowly. e next four chapters, 4 through 7, build multiplelinear regression as a Bayesian tool, with Bayesian processing of results. ese chapters alsomove rather slowly, largely because of the emphasis on plotting results, including interactioneffects. Problems of model complexity—overfitting—also feature prominently. e thirdpart of the book, chapters 8 through 10, present generalized linear models of several types,using the techniques of earlier chapters to process and understand them. e last part getsaround to multilevel models, both linear and generalized linear, as well as specialized typesthat address measurement error, spatial correlation, social networks, and shared ancestry(phylogeny).is last part is where Markov chain Monte Carlo (MCMC) is introduced. Most Bayesianstatistics books introduce MCMC quite early, and with good reason. is book differs, becausefitting Bayesian models without MCMC simultaneously educates you in how muchnon-Bayesian modeling works. is removes a great deal of confusion about the relationshipbetween Bayesian and non-Bayesian results. It can also be a bit much to ask studentsto become comfortable with regression while they simultaneously wrestle with sometimesfickleMarkov chains.How to use this bookis book is not a reference, but a course. It doesn’t try to support random access.Rather, it expects sequential access. is has immense pedagogical advantages, but it hasthe disadvantage of violating how most scientists actually read books.is book has a lot of code in it, integrated fully into the main text. e reason for thisis that doing model-based statistics in the 21st century really requires programming, of at


HOW TO USE THIS BOOK 9least a minor sort. e code is not optional. Everyplace, I have erred on the side of includingtoo much code, rather than too little. In my experience teaching scientific programming,novices learn more quickly when they have working code to modify, rather than needing towrite an algorithm from scratch. My generation was probably the last to have to learn someprogramming to use a computer, and so coding has gotten harder and harder to teach as timegoes on. My students are very computer literate, but they have no idea what computer codelooks like.What the book assumes. is book does not try to teach the reader to program, in the mostbasic sense. It assumes that you have made a basic effort to learn how to install and processdata in R. In most cases, a short introduction to R programming will be enough. I knowmany people have found Emmanuel Paradis’ R for Beginners helpful. You can find it andmany other beginner guides here:http://cran.r-project.org/other-docs.htmlTo make use of this book, you should know already that y


10 PREFACEread through a first time, for concepts. en a second pass with a deeper reading, includingexecuting code, will be needed to really cement the knowledge.Optional sections. Reflecting realism in how books like this are actually read, there are twokinds of optional sections: (1) “rethinking” and (2) “overthinking.” e “rethinking” sectionslook like this:Rethinking: ink again. e point of these “rethinking” boxes is to provide broader context for thematerial. ey allude to connections to other approaches, provide historical background, or call outcommon misunderstandings. ese boxes are meant to be optional.e “overthinking” sections look like this:Overthinking: Getting your hands dirty. ese sections, set in smaller type, provide more detailedexplanations of code or mathematics. is material isn’t essential for understanding the main text.But it does have a lot of value, especially on a second reading. For example, sometimes it matters howyou perform a calculation. Mathematics tells that these two expressions are equivalent:p 1 = log(0.01 200 )p 2 = 200 × log(0.01)But when you use R to compute them, they yield different answers:R code0.2( log( 0.01^200 ) )( 200 * log(0.01) )[1] -Inf[1] -921.034e second line is the right answer. is problem arises because of rounding error, when the computerrounds very small decimal values to zero. is loses precision and can introduce substantial errors ininference. As a result, we nearly always do statistical calculations using the logarithm of a probability,rather than the probability itself.You can ignore most of these overthinking sections on a first read.e command line is the best tool. Programming at the level needed to perform 21st centurystatistical inference is not that complicated, but it is unfamiliar at first. Why not justteach the reader how to do all of this with a point-and-click program? ere are big advantagesto doing statistics with text commands, rather than pointing and clicking on menus.Foremost among them is that each analysis documents itself, so that years from now you cancome back to your analysis and replicate it exactly. You can re-use your old files and sendthem to colleagues. Pointing and clicking, however, leaves no trail of breadcrumbs. A filewith your R commands inside it does. Once you get in the habit of planning, running, andpreserving your statistical analyses in this way, it pays for itself many times over.We don’t use the command line because we are hardcore or elitist (although we mightbe). We use the command line because it is better. Unlike the point-and-click interface,you do have to learn a basic set of commands to get started with a command line interface.However, the costs later are reduced. With point-and-click, you pay down the road, rather


HOW TO USE THIS BOOK 11than initially. You’ll see that the command line makes some things that are prohibitively difficultfor point-and-click interfaces, like recoding large data tables or running 10-thousandresampled regressions, trivial.How you should work. But I would be cruel, if I just told the reader to use a command-linetool, without also explaining something about how to do it. You do have to relearn somehabits, but it isn’t a major change. For readers who have only used menu-driven statisticssoware before, there will be some significant readjustment. But aer a few days, it willseem natural to you. For readers who have used command-driven statistics soware likeStata and SAS, there is still some readjustment ahead. I’ll explain the overall approach, first.en I’ll say why even Stata and SAS users are in for a change.First, the sane approach to scripting statistical analyses is to work back-and-forth betweentwo applications: (1) a plain text editor of your choice and (2) the R program itself. Aplain text editor is a program that creates and edits simple formatting-free text files. Commonexamples include Notepad (in Windows) and Text Edit (in Mac OS X) and Emacs (inmost *NIX distributions, including Mac OS X). You will use the plain text editor to keep arunning log of the commands you feed into the R application for processing. You absolutelydo not want to just type out commands directly into R itself. Instead, you want to eithercopy-and-paste lines of code from your plain text editor into R, or instead read entire scriptfiles directly into R. You will always of course enter commands directly into R as you exploredata or debug or merely play. But your serious work should be implemented through theplain text editor. ere are several practical reasons for this.First, editing commands in R is awkward. Inevitably, you will make typos. Editing acomplex line of code in R to find and fix a single misplaced character is a pain. In contrast,in your plain text editor, you can easily move the cursor to any position to edit, withouthaving to fight with the R command line at the same time.Second, you can see the whole picture in the text editor, because it separates input fromoutput. e text editor holds the plan of action. You want to plan your strategy and build thecommands to implement it there, free from the clutter of intermediate outputs and warningmessages and other vomitus of the R application itself. While it is possible to save a log ofcommands input into R, this hardly helps with planning.ird, you want a secure and portable record of your work. Once you solve a scriptingproblem once, you can consult your previous scripts to remember how you did it. And whena colleagues asks you how you did an analyses, you can just email them the script. e onlyfield I know of in which such a practice is actually common is economics (I believe AERenforces this policy?), but it should be the norm everyplace. e format should be portable,so that any colleague can open it on any computer system. Plain text files fit this bill. MSWord files do not. Additionally, a complex text processor like MS Word actually gets in theway, because it forces formatting and spell checking that has no effect on coding. You wanta plain text editor to show every space you type.You can add comments to your R scripts to help you plan the code and remember laterwhat the code is doing. To make a comment, just begin a line with the # symbol. To helpclarify the approach, below I provide a very short complete script for running a linear regressionon one of R’s built-in sets of data. Even if you don’t know what the code does yet,hopefully you will see it as a basic model of clarity of formatting and use of comments.


12 PREFACER code0.3# Load the data:# car braking distances in feet paired with speeds in km/h# see ?cars for detailsdata(cars)# fit a linear regression of distance on speedm


1 e Golem of PragueIn the 16th century, the House of Habsburg controlled much of Central Europe, theNetherlands, and Spain, as well as Spain’s colonies in the Americas. e House was the maybethe first true world power, with the Sun always shining on some portion of it. It’s ruler wasalso Holy Roman Emperor, and his seat of power was Prague. e Emperor in the late 16thcentury, Rudolph II, loved intellectual life. He invested in the arts, the sciences (includingastrology and alchemy), and mathematics, making Prague into a world center of learningand scholarship. It is appropriate then that in this learned atmosphere arose an early robot,the Golem of Prague.A golem (GOH-lem) is a clay robot known in Jewish folklore, constructed from dust andfire and water. It is brought to life by inscribing emet, Hebrew for “truth,” on its brow. Animatedby truth, but lacking free will, a golem always does exactly what it is told. is islucky, because the golem is incredibly powerful, able to withstand and accomplish morethan its creators could. However, its obedience also brings danger, as careless instructionsor unexpected events can turn a golem against its makers. Its abundance of power is matchedby its lack of wisdom.In some versions of the golem legend, the Rabbi Judah Loew ben Bezalel sought a wayto defend the Jews of Prague. As in many parts of 16th century Central Europe, the Jews ofPrague were persecuted. Using secret techniques from the Kabbalah, Rabbi Judah was ableto build a golem, animate it with “truth,” and order it to defend the Jewish people of Prague.Not everyone agreed with Judah’s action, fearing unintended consequences of toying with thepower of life. And ultimately Judah was forced to destroy the golem, as its combination ofextraordinary power with clumsiness eventually lead to innocent deaths. Wiping away oneletter from the inscription emet to spell instead met, “death,” Rabbi Judah decommissionedthe robot.1.1. Statistical GolemsScientists also make golems. 1 Our golems rarely have physical form, but they too areoen made of clay, living in silicon as computer code. ese golems are scientific models.But these golems have real effects on the world, through the predictions they make and theintuitions they challenge or inspire. A concern with “truth” enlivens these models, but justlike a golem or a modern robot, scientific models are neither true nor false, neither prophetsnor charlatans. Rather they are constructs engineered for some purpose. ese constructsare incredibly powerful, dutifully conducting their programmed calculations. Sometimestheir unyielding logic reveals implications previously hidden to their designers.13


14 1. THE GOLEM OF PRAGUEBut while they may be entirely logical in themselves, once let loose in the real worldtheir logic can produce surprising, silly, or dangerous behavior. Rather than idealized angelsof reason, scientific models are powerful clay robots without intent of their own, bumblingalong according to the myopic instructions they embody. Like with Rabbi Judah’s golem, thegolems of science are wisely regarded with both awe and apprehension. We absolutely haveto use them, but doing so always entails some risk.ere are many kinds of statistical models. Whenever someone deploys even a simplestatistical procedure, like a classical t-test, she is deploying a small golem that will obedientlycarry out an exact calculation, performing it the same way (nearly 2 ) every time, withoutcomplaint. Nearly every branch of science relies upon the senses of statistical golems. Inmany cases, it is no longer possible to even measure a phenomena of interest, without makinguse of a model. To measure the strength of natural selection or the speed of a neutrino orthe number of species in the Amazon, we must use models. e golem is a prosthesis, doingthe measuring for us, performing impressive calculations, finding patterns where none areobvious.However, there is no wisdom in the golem. It doesn’t discern when the context is inappropriatefor its answers. It just knows its own procedure, nothing else. It just does as it’s told.And so it remains a triumph of statistical science that there are now so many diverse golems,each useful in a particular context. Viewed this way, statistics is neither mathematics nor ascience, but rather a branch of engineering. And like engineering, a common set of designprinciples and constraints produce a great diversity of specialized applications.is diversity of applications helps to explain why introductory statistics courses are sooen confusing to the initiates. Instead of a single method for building, refining, and critiquingstatistical models, students are offered a zoo of pre-constructed golems known as“tests.” Each test has a particular purpose. Decision trees, like the one in FIGURE 1.1, arecommon. By answering a series of sequential questions, the user is meant to choose the“correct” procedure for their research circumstances.Unfortunately, while experienced statisticians grasp the unity of these procedures, studentsand researchers rarely do. Advanced courses in statistics do emphasize engineeringprinciples, but most scientists never get that far. Teaching statistics this way is somewhat liketeaching engineering backwards, starting with bridge building and ending with basic physics.So students and many scientists tend to use charts like FIGURE 1.1 without much thought totheir underlying structure, without much awareness of the models that each procedure embodies,and without any framework to help them make the inevitable compromises requiredby real research. It’s not their fault.For some, the toolbox of pre-manufactured golems is all they will ever need. Providedthey stay within well-tested contexts, using only a few different procedures in appropriatetasks, a lot of good science can be completed. is is similar to how plumbers can do a lotof useful work without knowing much about fluid dynamics. Serious trouble begins whenscholars move on to conducting innovative research, pushing the boundaries of their specialties.It’s as if we got our hydraulic engineers by promoting plumbers.Why aren’t the tests enough for innovative research? e classical procedures of introductorystatistics tend to be inflexible and fragile. By inflexible, I mean that they have verylimited ways to adapt to unique research contexts. By fragile, I mean that they fail in unpredictableways when applied to new contexts. is matters, because at the boundaries ofmost sciences, it is hardly ever clear which procedure is appropriate. None of the traditionalgolems has been evaluated in novel research settings, and so it can be hard to choose one


16 1. THE GOLEM OF PRAGUEconverses with idiosyncratic theories in a dialect barely understood by other scientists fromother tribes. Statistical experts outside the discipline can help, but they are limited by lackof fluency in the empirical and theoretical concerns of the discipline. In such settings, premanufacturedgolems may do nothing useful at all. Worse, they might wreck Prague. And ifwe keep adding new types of tools, soon there will be far too many to keep track of.Instead, what researchers need is some unified theory of golem engineering, a set ofprinciples for designing, building, and refining special-purpose statistical procedures. Everymajor branch of statistical philosophy possesses such a unified theory. But the theory is nevertaught in introductory—and oen not even in advanced—courses. So there are benefits inrethinking statistical inference as a set of strategies, instead of a set of pre-made tools.1.2. Wrecking PragueA lot can go wrong with statistical inference, and this is one reason that beginners are soanxious about it. When the framework is to choose a pre-made test from a flowchart, thenthe anxiety can mount as one worries about choosing the “correct” test. Statisticians, for theirpart, can derive pleasure from scolding scientists, which just makes the psychological battleworse.But anxiety can be cultivated into wisdom. at is the reason that this book insists onworking with the computational nuts-and-bolts of each golem. If you don’t understand howthe golem processes information, then you can’t interpret the golem’s output. is requiresknowing the statistical model in greater detail than is customary, and it requires doing thecomputations the hard way, at least until you are wise enough to use the push-button solutions.But there are conceptual obstacles as well, obstacles with how scholars define statisticalobjectives and interpret statistical results. Understanding any individual golem is notenough, in these cases. Instead, we need some statistical epistemology, an appreciation ofhow statistical models relate to hypotheses and the natural mechanisms of interest. Whatare we supposed to be doing with these little computational machines, anyway?e greatest obstacle that I encounter among students and colleagues is the tacit beliefthat the proper objective of statistical inference is to test null hypotheses. 3 is is the properobjective, the thinking or unthinking goes, because Karl Popper argued that science advancesby falsifying hypotheses. Karl Popper (1902–1994) is possibly the most influential philosopherof science, at least among scientists. He did persuasively argue that science works betterby developing hypotheses which are, in principle, falsifiable. Seeking out evidence that mightembarrass our ideas is a normative standard, and one that most scholars—whether they describethemselves as scientists or not—subscribe to. So statistical procedures should falsifyhypotheses, if we wish to be good scientists.But the above is a kind of folk Popperism, an informal philosophy of science commonamong scientists but not among philosophers of science. Science is not described by thefalsification standard, as Popper recognized and argued. 4 In fact, deductive falsification isimpossible in nearly every scientific context. In this section, I review a two reasons for thisimpossibility.(1) Hypotheses are not models. e relations among hypotheses and different kindsof models is complex. Many models correspond to the same hypothesis, and manyhypotheses correspond to a single model. is makes strict falsification impossible.


18 1. THE GOLEM OF PRAGUEHypotheses Process models Statistical modelsH 0“Evolutionis neutral”H 1“Selectionmatters”P 0ANeutral,equilibriumP 0BNeutral,non-equilibriumP 1AConstantselectionP 1BFluctuatingselectionM IM IIM IIIFIGURE 1.2. Relations among hypotheses (le), detailed process models(middle), and statistical models (right), illustrated by the example of “neutral”models of evolution. Hypotheses (H) are typically vague, and so correspondto more than one process model (P). Statistical evaluations of hypothesesrarely address process models directly. Instead, they rely uponstatistical models (M), all of which reflect only some aspects of the processmodels. As a result, relations are multiple in both directions: hypothesesdo not imply unique models, and models do not imply unique hypotheses.is fact greatly complicates statistical inference.the frequency distribution (histogram) of the frequency of different genetic variants (alleles).Some alleles are rare, appearing in only a few individuals. Others are very common, appearingin very many individuals in the population. A famous result in population genetics isthat a model like P 0A produces a power law distribution of allele frequencies. And so this factyields a statistical model, M II , that predicts a power law in the data. In contrast the constantselection process model P 1A predicts something quite different, M III . Unfortunately, otherselection models (P 1B ) imply the same statistical model, M II , as does the neutral model.We’ve reached the uncomfortable lesson:(1) Any given statistical model (M) may correspond to more than one process model(P).(2) Any given hypothesis (H) may correspond to more than one process model (P).(3) Any given statistical model (M) may correspond to more than one hypothesis (H).Now look what happens when we compare the statistical models to data. e classical approachis to take the “neutral” model as a null hypothesis. If the data are not sufficientlysimilar to the expectation under the null, then we say that we “reject” the null hypothesis.Suppose we follow the history of this subject and take P 0A as our null hypothesis. is impliesdata corresponding to M II . But since the same statistical model corresponds to a selection


1.2. WRECKING PRAGUE 19model P 1B , it’s not at all clear what we are to make of either rejecting or accepting the null.e null model is not unique to any process model nor hypothesis. If we reject the null, wecan’t really conclude that selection matters, because there are other neutral models that predictdifferent distributions of alleles. And if we accept the null, we can’t really conclude thatevolution in neutral, because some selection models expect the same frequency distribution.is is a huge bother. Once we have the diagram in FIGURE 1.2, it’s easy to see the problem.But few of us are so lucky. While population genetics has recognized this issue, scholarsin other disciplines continue to test frequency distributions against power law expectations,arguing even that there is only one neutral model. 10 Even if there were only one neutralmodel, there are so many non-neutral models that mimic the predictions of neutrality, thatneither rejecting nor failing to reject the null model carries much inferential power.So what can be done? Well, if you have multiple process models, a lot can be done. Ifit turns out that all of the process models of interest make very similar predictions, thenyou know to search for a different description of the evidence, a description under whichthe processes look different. For example, while P 0A and P 1B make very similar power lawpredictions for the frequency distribution of alleles, they make very dissimilar predictions forthe distribution of changes in allele frequency over time. In other words, explicitly comparepredictions of more than one model, and you can save yourself from some ordinary kinds offolly.And even when the data do discriminate among recognized models, we have to imaginethat there might be other process models that would correspond to the same predictions.is recognition lies behind many common practices in statistical inference, such as considerationof unobserved variables and sampling bias.Rethinking: Entropy and model identification One reason that statistical models routinely correspondto many different detailed process models is because they rely upon distributions like the normal,binomial, Poisson, and others. ese distributions are members of a family, the EXPONENTIALFAMILY. Nature loves the members of this family. Nature loves them, because nature loves entropy,and all of the exponential family distributions are MAXIMUM ENTROPY distributions. Taking the naturalpersonification out of that explanation will take time. e practical implication is that one canno more infer evolutionary process from a power law than we can infer developmental process fromthe fact that height is normally distributed. is fact should make us humble about what typical regressionmodels—the meat of this book—can teach us about mechanistic process. On the other hand,the maximum entropy nature of these distributions means we can use them to do useful statisticalwork, even when we can’t identify the underlying error process. Not only can we not identify it, butwe don’t have to.1.2.2. Measurement matters. e logic of falsification is very simple. We have a hypothesisH, and we show that it entails some observation D. en when look for D. If we don’t findit, we must conclude that H is false. Logicians call this kind of reasoning MODUS TOLLENS,which is Latin shorthand for “the method of destruction.” In contrast, finding D tells usnothing necessarily about H, because other hypotheses might also predict D.A compelling, but misleading, scientific fable that uses modus tollens concerns the colorof swans. Before Europeans voyaged to Australia, all swans that any European had ever seenor read about had white feathers. is lead to the belief that all swans are white. Let’s callthis a formal hypothesis:H 0 : All swans are white.


20 1. THE GOLEM OF PRAGUEWhen Europeans reached Australia, however, they encountered swans with black feathers.is evidence seemed to instantly prove H 0 to be false. Indeed, not all swans are white. Someare certainly black, according to all observers. e key insight here is that, before voyagingto Australia, no number of observations of white swans could prove H 0 to be true. Howeverit required only one observation of a black swan to prove it false.is is a seductive story. If we can believe that important scientific hypotheses can bestated in this form, then we have a powerful method for improving the accuracy of our theories:look for evidence that disconfirms our hypotheses. Whenever we find a black swan,H 0 must be false. Progress!Seeking disconfirming evidence is important, but it cannot be as powerful as the swanstory makes it appear. In addition to the correspondence problems among hypotheses andmodels, discussed in the previous section, most of the problems scientists confront are not sologically discrete. Instead, we most oen face two simultaneous problems that make the swanfable misrepresentative. First, observations are prone to error, especially at the boundariesof scientific knowledge. Second, most hypotheses are quantitative, concerning degrees ofexistence, rather than discrete, concerning total presence or absence. Let’s briefly considereach of these problems.1.2.2.1. Observation error. All observers will agree under most conditions that a swanis either black or white. ere are few intermediate shades, and most observers eyes worksimilarly enough that there will be little, if any, disagreement about which swans are whiteand which are black. But this kind of example is hardly commonplace in science, at least inmature fields. Instead, we routinely confront contexts in which we are not sure if we havedetected a disconfirming result. At the edges of scientific knowledge, the ability to measurea hypothetical phenomenon is oen in question as much as the phenomenon itself.Here are two examples.In 2005, a team of ornithologists from Cornell claimed to have evidence of an individualIvory-billed Woodpecker (Campephilus principalis), a species thought extinct. e hypothesisimplied here is:H 0 : e Ivory-billed Woodpecker is extinct.It would only take one observation to falsify this hypothesis. However, many doubted theevidence. Despite extensive search efforts and a $50,000 cash reward for information leadingto a live specimen, no evidence satisfying all parties has yet (by 2013) emerged. Even if goodphysical evidence does eventually arise, this episode should serve as a counter-point to theswan story. Finding disconfirming cases is complicated by the difficulties of observation.Black swans are not always really black swans, and sometimes white swans are really blackswans. ere are mistaken confirmations (false positives) and mistaken disconfirmations(false negatives). Against this background of measurement difficulties, scientists who alreadybelieve that the Ivory-billed Woodpecker is extinct will always be suspicious of a claimedfalsification. ose who believe it is still alive will tend to count the vaguest evidence asfalsification.Another example, this one from physics, focuses on the detection of faster-than-light(FTL) neutrinos. 11 In September 2011, a large and respected team of physicists announceddetection of neutrinos—small, neutral sub-atomic particles able to pass easily and harmlesslythrough most matter—that arrived from Switzerland to Italy in slightly faster-thanlightspeedtime. According to Einstein, neutrinos cannot travel faster than the speed of light.So this seems to be a falsification of special relativity. If so, it would turn physics on its head.


1.2. WRECKING PRAGUE 21e dominant reaction from the physics community was not “Einstein was wrong!” butinstead “How did the team mess up the measurement?” e team that made the measurementhad the same reaction, and asked others to check their calculations and attempt toreplicate the result.What could go wrong in the measurement? You might think measuring speed is a simplematter of dividing distance by time. It is, at the scale and energy you live at. But witha fundamental particle like a neutrino, if you measure when it starts its journey, you stopthe journey. e particle is consumed by the measurement. So more subtle approaches areneeded. e detected difference from lightspeed, furthermore, is quite small, and so eventhe latency of the time it takes a reading to travel from a detector to a control room can beorders of magnitude larger. And since the “measurement” in this case is really an estimatefrom a statistical model, all of the assumptions of the model are now suspect. As I write thisin 2013, the physics community is unanimous that the FTL neutrino result was measurementerror. ey found the technical error, which involved a poorly attached cable, amongother things. 12 Furthermore, neutrinos clocked from supernova events are consistent withEinstein, and those distances are much larger and so would reveal differences in speed muchbetter.In both the woodpecker and neutrino dramas, the key dilemma is whether the falsificationis real or spurious. Measurement is complicated in both cases, but in quite differentways, rendering both true-detection and false-detection plausible. Popper himself was awareof this limitation inherent in measurement, and it may be one reason that Popper himselfsaw science as being broader than falsification. But the probabilistic nature of evidence rarelyappears when practicing scientists discuss the philosophy and practice of falsification. 13 Myreading of the history of science is that these sorts of measurement problems are the norm,not the exception. 141.2.2.2. Continuous hypotheses. Another problem for the swan story is that most interestingscientific hypotheses are not of the kind “all swans are white” but rather of the kind:Or maybe:H 0 : 80% of swans are white.H 0 : Black swans are rare.Now what are we to conclude, aer observing a black swan? e null hypothesis doesn’tsay black swans do not exist, but rather that they have some frequency. e task here isnot to disprove or prove a hypothesis of this kind, but rather to estimate and explain thedistribution of swan coloration as accurately as we can. Even when there is no measurementerror of any kind, this problem will prevent us from applying the modus tollens swan storyto our science. 15You might object that the hypothesis above is just not a good scientific hypothesis, becauseit isn’t easy to disprove. But if that’s the case, then most of the important questionsabout the world are not good scientific hypotheses. In that case, we should conclude that thedefinition of a “good hypothesis” isn’t doing us much good. Now, nearly everyone agreesthat it is a good practice to design experiments and observations that can differentiate competinghypotheses. But in many cases, the comparison must be probabilistic, a matter ofdegree, not kind. 161.2.3. Falsification is consensual. e scientific community does come to regard some hypothesesas false. e caloric theory of heat and the geocentric model of the universe are no


22 1. THE GOLEM OF PRAGUElonger taught in science courses, unless its to teach how they were falsified. And evidenceoen—but not always—has something to do with such falsification.But falsification is always consensual, not logical. In light of the real problems of measurementerror and the continuous nature of natural phenomena, scientific communities arguetowards consensus about the meaning of evidence. ese arguments can be messy. Aer thefact, some textbooks misrepresent the history so it appears like logical falsification. 17 Suchhistorical revisionism may hurt everyone. It may hurt scientists, by rendering it impossiblefor their own work to live up to the legends that precede them. It may make science an easytarget, by promoting an easily attacked model of scientific epistemology. And it may hurtthe public, by exaggerating the definitiveness of scientific knowledge. 181.3. ree Tools for Golem EngineeringSo if attempting to mimic falsification is not a generally useful approach to statisticalmethods, what are we to do? We are to model. Models can be made into testing procedures—all statistical tests are also models 19 —but they can also be used to measure, forecast, andargue. Doing research benefits from the ability to produce and manipulate statistical models,both because scientific problems are more general than “testing” and because the pre-madegolems you maybe met in introductory statistics courses are ill-fit to many research contexts.If you want to reduce your chances of wrecking Prague, then some golem engineering knowhowis needed. Make no mistake: You will wreck Prague eventually. But if you are a goodgolem engineer, at least you’ll notice the destruction. And since you’ll know a lot about howyour golem works, you stand a good chance to figure out what went wrong. en your nextgolem won’t be as bad. Without the engineering training, you’re always at someone else’smercy.It can be hard to get a good education in statistical model building and criticism, though.Applied statistical modeling in the early 21st century is marked by the heavy use of severalengineering tools that are almost always absent from introductory, and even many advanced,statistics courses. ese tools aren’t really new, but they are newly popular. And many ofthe recent advances in statistical inference depend upon computational innovations that feelmore like computer science than classical statistics, so it’s not clear who is responsible forteaching them, if anyone.ere are many tools worth learning. In this book I’ve chosen to focus on three broadones that are in demand in both the social and biological sciences. ese tools are:(1) Bayesian data analysis(2) Multilevel models(3) Model comparison using information criteriaese tools are deeply related to one another, so it makes sense to teach them together. Understandingof these tools comes, as always, only with implementation—you can’t comprehendgolem engineering until you do it. And so this book focuses mostly on code, how todo things. But in the rest of this section, I provide brief introductions to the three tools.1.3.1. Bayesian data analysis. For the classical Greeks and Romans, wisdom and chancewere enemies. Minerva (Athena), symbolized by the owl, was the personification of wisdom.Fortuna (Tyche), symbolized by the wheel of fortune, was the personification of luck, bothgood and bad. Minerva was mindful and measuring, while Fortuna was fickle and unreliable.Only a fool would rely on Fortuna, while all wise folk appealed to Minerva. 20


1.3. THREE TOOLS FOR GOLEM ENGINEERING 23FIGURE 1.3. Saturn, much like Galileo must have seen it. e true shapeis uncertain, but not because of any sampling variation. Probability theorycan still help.e rise of probability theory changed that. Statistical inference compels us instead torely on Fortuna as a route to Minerva, to use chance and uncertainty to discover reliableknowledge. All of statistical inference has this motivation, to some extent. But Bayesian dataanalysis goes further than most traditions. It actually encodes knowledge with probability,using the language of chance to describe the plausibility of different possibilities.e core premise is that we can use probability theory as a general way to represent uncertainty,whether that uncertainty references countable events in the world or rather theoreticalconstructs like parameters. Once you accept this gambit, the rest follows logically.Once we have defined our assumptions, Bayesian inference forces a uniquely logical way ofprocessing that information to produce inference. In other words, Bayesian inference takesa question in the form of a model and uses logic to produce an answer in the form of probabilities.To appreciate the nature of Bayesian logic, it helps to have something to compare it to.Bayesian probability stands in contrast to the another important probability interpretationin statistics, the FREQUENTIST perspective, which requires that all probabilities be definedby connection to countable events. 21 is leads to frequentist uncertainty being premisedon imaginary resampling of data—if we were to repeat the measurement many many times,we would end up collecting a list of values that will have some pattern to it. e distributionof these measurements is called a SAMPLING DISTRIBUTION. is resampling is never done,and in general it doesn’t even make sense—repeat sampling of the diversification of songbirds in the Andes, anyone? As Sir Ronald Fisher put it:[...] the only populations that can be referred to in a test of significancehave no objective reality, being exclusively the product of the statistician’simagination [...] 22But in some contexts, like controlled greenhouse experiments, it’s a useful device for describinguncertainty. Whatever the context, it’s just part of the model, an assumption about whatthe data would look like under resampling. It’s just as fantastical as the Bayesian gambit ofusing probability to describe uncertainty. 23


24 1. THE GOLEM OF PRAGUEBut the different attitudes towards probability do enforce different tradeoffs. It will helpto encounter a simple example where the difference between Bayesian and frequentist probabilitymatters. In the year 1610, Galileo turned a primitive telescope to the night sky andbecame the first human to see Saturn’s rings. Well, he probably saw a blob, with some smallerblobs attached to it (FIGURE 1.3). Since the telescope was primitive, it couldn’t really focusthe image very well. Saturn always appeared blurred. is is a statistical problem, of a sort.ere’s uncertainty about the planet’s shape, but notice that none of the uncertainty is a resultof variation in repeat measurements. We could look through the telescope a thousandtimes, and it will always give the same blurred image (for any given position of the Earth andSaturn). So the sampling distribution of any measurement is constant, because the measurementis deterministic—there’s nothing “random” about it. Frequentist statistical inferencehas a lot of trouble getting started here. In contrast, Bayesian inference proceeds as usual,because the deterministic “noise” can still be modeled using probability, as long as we don’tidentify probability with frequency. As a result, the field of image reconstruction and processingis dominated by Bayesian algorithms. 24In more routine statistical procedures, like linear regression, this difference in probabilityconcepts has less of an effect. However, it is important to realize that even when aBayesian procedure and frequentist procedure give exactly the same answer, our Bayesiangolems aren’t justifying their inferences with imagined repeat sampling. More generally,Bayesian golems treat “randomness” as a property of information, not of the world. Nothingin the real world—excepting controversial interpretations of quantum physics—is actuallyrandom. Presumably, if we had more information, we could exactly predict everything. Wejust use randomness to describe our uncertainty in the face of incomplete knowledge. Fromthe perspective of our golem, the coin toss is “random,” but it’s really the golem that is random,not the coin.Rethinking: Probability is not unitary. It will make some readers uncomfortable to suggest thatthere is more than one way to define “probability.” Aren’t mathematical concepts uniquely correct?ey are not. Once you adopt some set of premises, or axioms, everything does follow logicallyin mathematical systems. But the axioms are open to debate and interpretation. So not only is there“Bayesian” and “frequentist” probability, but there are different versions of Bayesian probability even,relying upon different arguments to justify the approach. In more advanced Bayesian texts, you’llcome across names like Bruno de Finetti, Richard T. Cox, and Leonard “Jimmie” Savage. Each ofthese figures is associated with a somewhat different conception of Bayesian probability. ere areothers. is book mainly follows the “logical” Cox (better labeled Laplace-Jeffreys-Cox-Jaynes) interpretation,but dipping into de Finetti (a “betting” interpretation) occasionally.How can different versions of probability thrive, even within the same analyst? By themselves,mathematical objects don’t “mean” anything, in the sense of real world implication. What does itmean to take the square root of a negative number? What does mean to take a limit as somethingapproaches infinity? ese are essential and routine concepts, but their meanings depend upon contextand analyst, upon beliefs about how well abstraction represents reality. Mathematics doesn’taccess the real world directly. So answering such questions remains a contentious and entertainingproject, in all branches of applied mathematics. So while everyone subscribes to the same axioms ofprobability, not everyone agrees in all contexts about how to interpret probability.Before moving on to describe the next two tools, it’s worth emphasizing an advantage ofBayesian data analysis, at least when scholars are learning statistical modeling. is entirebook could be rewritten to remove any mention of “Bayesian.” In places, it would become


1.3. THREE TOOLS FOR GOLEM ENGINEERING 25easier. In others, it would become much harder. But having taught applied statistics bothways, I have found that the Bayesian framework presents a distinct pedagogical advantage:many people find it more intuitive. Perhaps best evidence for this is that very many scientistsinterpret non-Bayesian results in Bayesian terms, for example interpreting ordinary P-valuesas Bayesian posterior probabilities and non-Bayesian confidence intervals as Bayesian ones(you’ll learn posterior probability and confidence intervals in Chapters 2 and 3). Even statisticsinstructors make these mistakes. 25 In this sense then, Bayesian models lead to moreintuitive interpretations, the ones scientists tend to project onto statistical results.e opposite pattern of mistake—interpreting a posterior probability as a P-value—seems to happen only rarely, if ever. e reason for the asymmetry may be that the errorstend to be in the hopeful direction. Bayesian posterior probabilities do provide informationabout what we hope to reason with—probabilities of models and parameters. No one appearsinclined to interpret the results instead as less-informative probabilities of data. Aerall, we already know the data.Note that none of this ensures that Bayesian models will be more correct than non-Bayesianmodels. It just means that the scientist’s intuitions will less commonly be at odds withthe actual logic of the framework. is simplifies some of the aspects of teaching statisticalmodeling.Rethinking: A little history. Bayesian statistical inference is much older than the typical tools of introductorystatistics, most of which were developed in the early 20th century. Versions of the Bayesianapproach were applied to scientific work in the late 1700’s and repeatedly in the 19th century. Butaer World War I, anti-Bayesian statisticians, like Sir Ronald Fisher, succeeded in marginalizing theapproach. All Fisher said about Bayesian analysis (then called inverse probability) in his influential1925 handbook was:[...] the theory of inverse probability is founded upon an error, and must be whollyrejected. 26Bayesian data analysis became increasingly accepted within statistics during the second half of the20th century, because it proved not to be founded upon an error. All philosophy aside, it worked.Beginning in the 1990’s, new computational approaches lead to a rapid rise in application of Bayesianmethods. 27 Bayesian methods remain computationally expensive, however. And so as data sets haveincreased in scale—millions of rows is common in genomic analysis, for example—alternatives to orapproximations to Bayesian inference remain important.1.3.2. Multilevel models. In an apocryphal telling of Hindu cosmology, it is said that theEarth rests on the back of a great elephant, who in turn stands on the back of a massive turtle.When asked upon what the turtle stands, a guru is said to reply, “it’s turtles all the way down.”Statistical models don’t contain turtles, but they do contain parameters. And parameterssupport inference. Upon what do parameters themselves stand? Sometimes, in some ofthe most powerful models, it’s parameters all the way down. What this means is that anyparticular parameter can be usefully regarded as a placeholder for a missing model. Givensome model of how the parameter gets its value, it is simple enough to embed the new modelinside the old one. is results in a model with multiple levels of uncertainty, each feedinginto the next, a MULTILEVEL MODEL.Multilevel models—also known as hierarchical, mixed effects, or random effects models—are becoming de rigueur in the biological and social sciences. Fields as diverse as educationaltesting and bacterial phylogenetics now depend upon routine multilevel models to process


26 1. THE GOLEM OF PRAGUEdata. Like Bayesian data analysis, multilevel modeling is not particular new, but it has onlybeen available on desktop computers for a few decades. And since such models have a naturalBayesian representation, they have grown hand-in-hand with Bayesian data analysis.ere are four typical reasons that researchers use multilevel models.(1) To adjust estimates for repeat sampling. When more than one observation arisesfrom the same individual, location, or time, then traditional, single-level modelsmay mislead us.(2) To adjust estimates for imbalance in sampling. When some individuals, locations, ortimes are sampled more than others, we may also be misled by single-level models.(3) To study variation. If our research questions include variation among individualsor other groups within the data, then multilevel models are a big help, because theymodel variation explicitly.(4) To avoid averaging. Frequently, scholars pre-average some data to construct variablesfor a regression analysis. is can be dangerous, because averaging removesvariation. It therefore manufactures false confidence. Multilevel models allow usto preserve the uncertainty in the original, pre-averaged values, while still usingthe average to make predictions.All three apply to contexts in which the researcher recognizes clusters or groups of measurementsthat may differ from one another. ese clusters or groups may be individualssuch as different students, locations such as different cities, or times such as different years.Since each cluster may well have a different average tendency or respond differently to anytreatment, clustered data oen benefit from being modeled by a golem that expects suchvariation.But the scope of multilevel modeling is much greater than these examples. Diversemodel types turn out to be multilevel: models for missing data (imputation), measurementerror, factor analysis, some time series models, types of spatial and network regression, andphylogenetic regressions all are special applications of the multilevel strategy. is is why,for some scholars, grasping the concept of multilevel modeling leads to a perspective shi.Suddenly single-level models end up looking like mere components of multilevel models.e multilevel strategy provides an engineering principle to help us to introduce these componentsinto a particular analysis, exactly where we think we need them.I want to convince the reader of something that appears unreasonable: multilevel regressiondeserves to be the default form of regression. Papers that do not use multilevel modelsshould have to justify not using a multilevel approach. Certainly some data and contexts donot need the multilevel treatment. But most contemporary studies in the social and naturalsciences, whether experimental or not, would benefit from it. Perhaps the most importantreason is that even well-controlled treatments interact with unmeasured aspects of the individuals,groups, or populations studied. is leads to variation in treatment effects, in whichindividuals or groups vary in how they respond to the same circumstance. Multilevel modelsattempt to quantify the extent of this variation, as well as identify which units in the dataresponded in which ways.ese benefits don’t come for free, however. Fitting and interpreting multilevel modelscan be considerably harder than fitting and interpreting a traditional regression model.In practice, many researchers simply trust their black-box soware and interpret multilevelregression exactly like single-level regression. In time, this will change. ere was a timein applied statistics when even ordinary multiple regression was considered cutting edge,


1.3. THREE TOOLS FOR GOLEM ENGINEERING 27something for only experts to fiddle with. Instead, scientists used many simple procedures,like t-tests. Now, almost everyone uses the better multivariate tools. e same will eventuallybe true of multilevel models. But scholarly culture and curriculum still has a little catchingup to do.1.3.3. Model comparison and information criteria. Beginning seriously in the 1960’s and1970’s, statisticians began to develop a peculiar family of metrics for comparing structurallydifferent models: INFORMATION CRITERIA. All of these criteria aim to let us compare modelsbased upon future predictive accuracy. But they do so in more and less general ways andby using different approximations for different model types. So the number of unique informationcriteria in the statistical literature has grown quite large. Still, they all share thiscommon enterprise.e most famous information criterion is AIC, the Akaike (ah-kah-ee-kay) informationcriterion. AIC and related metrics—we’ll discuss DIC and WAIC as well—explicitly build amodel of the prediction task and use that model to estimate performance of each model youmight wish to compare. Because the prediction is modeled, it depends upon assumptions.So information criteria do not in fact achieve the impossible, by seeing the future. ey arestill golems.AIC and its kin are known as “information” criteria, because they develop their measureof model accuracy from INFORMATION THEORY. Information theory has a scope far beyondcomparing statistical models. But it will be necessary to understand a little bit of generalinformation theory, in order to really comprehend information criteria. So in Chapter 6,you’ll also find a conceptual crash course in information theory.What AIC and its kin actually do for a researcher is help with two common difficultiesin model comparison.(1) e most important statistical phenomenon that you may have never heard of isOVERFITTING. 28 Overfitting is the subject of Chapter 6. For now, you can understandit as a consequence of a golem learning too much from the data. Future datawill not be exactly like past data, and so any golem that is unaware of this fact tendsto make worse predictions than it could. e surprising consequence is that morecomplex statistical models always fit the available data better than simpler ones,even when the additional complexity has nothing to do with true causal process.So if we wish to make good predictions, we cannot judge our models simply onhow well they fit our data.(2) A major benefit of using AIC and its kin is that they allow for comparison of multiplenon-null models to the same data. Frequently, several plausible models ofa phenomenon are known to us. e neutral evolution debate (page 18) is oneexample. In some empirical contexts, like social networks and evolutionary phylogenies,there may be no reasonable or uniquely “null” model. is was also true ofthe neutral evolution example. In such cases, it’s not only a good idea to explicitlycompare models. It’s also mandatory. Information criteria aren’t the only way toconduct the comparison. But they are an accessible and widely used way.Multilevel modeling and Bayesian data analysis have been worked on for decades andcenturies, respectively. Information criteria are comparatively very young. Many statisticianshave never used information criteria in an applied problem, and there is no consensusabout which metrics are best and how best to use them. Still, information criteria are alreadyin frequent use in the sciences—appearing in prominent publications and featuring in


28 1. THE GOLEM OF PRAGUEprominent debates 29 —and a great deal is known about them, both from analysis and experience.Need some bridge to next chapter?1.4. Plan


2 Small Worlds and Large WorldsWhen Cristoforo Colombo (Christopher Columbus) infamously sailed west in the year1492, he believed that the Earth was spherical. In this, he was like most educated people ofhis day. He was unlike most people, though, in that he also believed the planet was muchsmaller than it actually is—only 30,000 km around its middle instead of the actual 40,000km (FIGURE 2.1). 30 is was one of the most consequential mistakes in European history. IfColombo had believed instead that the Earth was 40,000 km around, he would have correctlyreasoned that his fleet could not carry enough food and potable water to complete a journeyall the way westward to Asia. But at 30,000 km around, Asia would lie a bit west of thecoast of California. It was possible to carry enough supplies to make it that far. Emboldenedin part by his unconventional estimate, Colombo set sail, eventually making landfall in theBahamas.Colombo made a prediction based upon his view that the world was small. But since helived in a large world, aspects of the prediction were wrong. In his case, the error was lucky.His small world model was wrong in an unanticipated way: there was a lot of land in theway. If he had been wrong in the expected way, with nothing but ocean between Europe andAsia, he and his entire expedition would have run out of supplies long before reaching theEast Indies.Colombo’s small and large worlds provide a contrast between model and reality. Allstatistical modeling has these same two frames: the small world of the model itself and thelarge world we hope to deploy the model in. 31 Navigating between these two worlds remainsa central challenge of statistical modeling. e challenge is made harder by forgetting thedistinction.e SMALL WORLD is the self-contained logical world of the model. Within the smallworld, all possibilities are nominated. ere are no pure surprises, like the existence of a hugecontinent between Europe and Asia. Within the small world of the model, it is important tobe able to verify the model’s logic, making sure that it performs as expected under favorableassumptions. Bayesian models have some advantages in this regard, as they have reasonableclaims to optimality: No alternative model could produce more accurate estimates, assumingthe small world is an accurate description of the real world. In contrast, other modelingtraditions, while useful on the whole, can more easily produce logically inconsistent modelson small world terms.e LARGE WORLD is the broader context in which one deploys a model. In the largeworld, there may be events that were not imagined in the small world. Moreover, the modelis always an incomplete representation of the large world, and so will make mistakes, evenif all kinds of events have been properly nominated. e logical consistency of a model in29


30 2. SMALL WORLDS AND LARGE WORLDSFIGURE 2.1. On the le, Martin Behaim’s 1492 globe, showing the smallworld that Colombo anticipated. e big island labeled “Cipangu” is Japan.On the right, the same globe with an overlay of the Americas. Overlay madewith G.Projector: http://www.giss.nasa.gov/tools/gprojector/the small world is no guarantee that it will be optimal in the large world. But it is certainly awarm comfort.In this chapter, you will begin to build Bayesian models. e way that Bayesian modelslearn from evidence is arguably optimal in the small world. When their assumptions approximatereality, they also perform well in the large world. But large world performancehas to be demonstrated rather than logically deduced, just like with all kinds of models.is chapter focuses on the small world, where the logical virtues of Bayesian data analysisare on display. It explains probability theory in its essential form: counting the waysthings can happen. Bayesian inference arises automatically from this perspective. en thechapter presents the standard components of a Bayesian statistical model, a model for learningfrom data. It shows you how to animate the model, to produce estimates.All this work provides a foundation for the next chapter, in which you’ll learn to summarizeBayesian estimates, as well as to consider large world obligations.2.1. Probability is just countingSuppose you have a globe representing our planet, the Earth. is version of the worldis small enough to hold in your hands. You are curious how much of the surface is coveredin water. You adopt the following strategy. You will toss the globe up in the air. When youcatch it, you will record whether or not the surface under your right index finger is water orland. en you toss the globe up in the air again and repeat the procedure. 32 is strategygenerates a sequence of surface samples from the globe. e first 9 samples might look like:W L W W W L W L Wwhere W indicates water and L indicates land. So you observe 6 W (water) observations and3 L (land) observations. Call this sequence of observations the data.In order to use these data to learn about the proportion of water on the globe, we’ll goright to heart of probability theory. We’ll get to the heart by forgetting about probabilitiesand just counting the ways things can happen. We’ll just tally all the ways different possible


2.1. PROBABILITY IS JUST COUNTING 31proportions of water—from 0 to 1—could produce the data. e number of ways that eachpossible proportion could produce the data will provide a relative measure of its plausibility.And that’s Bayesian inference.2.1.1. Enumerate the possibilities. To begin, we’ll simplify the analysis by considering only11 different proportions of water: 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1. is abbreviatedlist will get the lesson across. en in the next major section, you’ll see how togeneralize to an infinite number of possibilities.is leads us to a forced choice: what do we think are the relative plausibilities of eachof these 11 proportions, before we’ve seen any data? e easiest answer, and the most conventional,is to answer that we want to count all of them equally, until we have a good reasonnot to. 33 We do already have good reason not to: since you are probably reading this bookon dry land, there is obviously some land. So the proportion of water is not 1. And sinceyou know there are oceans on planet Earth, the proportion is not 0. But we’ll work with theequal counting approach, because it is foundational and easiest to understand. Later in thechapter, we’ll repeat the analysis from a different beginning, so you can see its impact. Fornow, think of this as announcing:“I do not know which of these possibilities is correct. So until I have goodreason to do otherwise, I’ll keep track of all of the possibilities equally.”As you’ll see later, if ever you receive information that contradicts this egalitarian bookkeeping,it’s very easy to incorporate it.So begin by stating that there’s at least one way that each proportion could be the correctone. So we assign a count of “1” to each possibility. Let’s start organizing this information ina table:proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ways 1 1 1 1 1 1 1 1 1 1 1Going forward, we’ll keep counting the ways things can happen, updating this table at eachstep.2.1.2. How many ways can the data happen? Building on top of this meager beginning,consider only the first datum: W (water). How many ways can each of the possible proportionsgenerate this observation? To answer this question, it helps to remember that globetosses, like coin flips, are not really random. e physics are deterministic. It’s just that wedon’t know enough about the initial conditions and cannot measure them with sufficientprecision in order to predict the outcome. Probability indicates ignorance of an observer—whether machine or person—not true indeterminacy. e “randomness” of the globe orcoin resides in us and the machines we build, never in the globe or coin itself.So consider the globe toss in detail. When you pick up the inflatable globe, you holdit in a particular way. ere are very many ways to hold it. en when you toss it in theair, there are very many ways to impart momentum to it. en as the globe briefly defiesgravity, it interacts with air currents and particles in the air. ere are many possibilitieshere as well. And when you catch the globe, there are again many ways to grasp and holdit, each leading to either “water” or “land.” All of these possibilities form a garden of forkingpaths, each producing a different perfectly deterministic series of events, producing aperfectly deterministic outcome.But since we do not know enough to identify the one unique path that did happen, wespeak of the globe toss as “random.” What’s important is the relative numbers of paths that


32 2. SMALL WORLDS AND LARGE WORLDSnumber of ways to see W1 5 10number of ways to see L1 5 100 0.2 0.4 0.6 0.8 1proportion of water on globe0 0.2 0.4 0.6 0.8 1proportion of water on globeFIGURE 2.2. Le: e relative numbers of ways for each possible proportionof water (horizontal axis) to produce an observation of W (water). Right:e relative numbers of ways for each possible proportion to produce anobservation of L (land).produce W (water) or L (land). Say for example that there are a million ways to toss the globe.You can probably appreciate that a globe that is half blue and half green will have roughly thesame number of ways of producing W as of producing L, about half-million ways to observeW and half-million ways to observe L. So for a globe with any proportion of water, it shouldbe true that the relative number of ways to toss the globe that will produce W is approximatelyequal to the proportion of water on the surface of the globe. For example (and still assuminga million ways to toss the globe), a globe that is covered with 20% water will have about 200-thousand ways to produce W and 800-thousand ways to produce L. Of course we cannot besure this is correct—it is an assumption. As you’ll see, there’s no avoiding assumptions, ifyou hope to learn from data.is assumption allows us to count up the ways that each possible proportion of watercould have produced the first observation, W. Consider the first possibility, a proportion ofzero. is is the easiest case, as there must be zero ways for such a globe to produce a W.Now consider the second possible proportion, 0.1. is means that 10% of the ways shouldproduce a W. For simplicity of calculation, suppose there are only 10 meaningfully differentways to toss the globe. So a proportion of 0.1 indicates that 1 of these ways produces W and 9of them produce L. Similarly for the possibility 0.2, 2 of these ways produce W and 8 produceL. We can tabulate all of these counts to summarize the relative numbers of ways that eachpossible proportion could produce the first observation:proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ways 1 1 1 1 1 1 1 1 1 1 1ways to see W 0 1 2 3 4 5 6 7 8 9 10And I display these counts in the lehand plot of FIGURE 2.2. ere’s no magic here. Allwe’ve done is tabulate that each possible proportion of water on the globe has a different(relative) number of ways to produce the observation W. If we had seen L (land) instead, thenumbers of ways would be shown by the righthand plot in FIGURE 2.2. ese counts just


2.1. PROBABILITY IS JUST COUNTING 33reverse the order, because the problem is symmetric: if there are 2 out of 10 ways to see W,then there are 8 out of 10 ways to see L.2.1.3. Update the possibilities. We’re not really interested directly in the numbers of waysthat each possibility can produce the data. ose counts are just a means to an end. Instead,we’re directly interested in the number of ways that each possibility could be correct. So nowwe use both the initial relative counts for each possibility in combination with the numbersof ways to see the data to produce a new count of ways that each possible proportion of watercould happen. ese counts are, as before, proportional to the relative plausibility of eachpossibility, because things that can happen in more ways are more plausible.e way to update the counts for each possible proportion is just to count. Consider thefirst possibility, a proportion of zero. ere was initially 1 (relative) way for it to be correct,by assumption. For this one way, we found that there are zero ways for it to produce theobservation W. So there are 1 × 0 = 0 ways for a proportion of zero to be correct. Similarlyfor the second possibility, there was 1 way for 0.1 to be correct, and we counted 1 way (outof every 10 ways) for it to produce the observation W, resulting in a new count of 1 × 1 = 1.And so on, for all of the other possibilities. e new counts, and therefore new relativeplausibilities, are shown in the table.proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ways 1 1 1 1 1 1 1 1 1 1 1aer seeing W 0 1 2 3 4 5 6 7 8 9 10at’s easy enough. But consider now the second observation in the sequence of globetosses, L. e numbers of ways for each possible proportion to produce this observationare given by the righthand plot in FIGURE 2.2, as explained before. To update again thenumbers of ways each possible proportion could be correct, again take the current numbersand multiply them by the respective numbers of ways each possibility could produce thedata. is gives us:proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ways 1 1 1 1 1 1 1 1 1 1 1aer seeing W 0 1 2 3 4 5 6 7 8 9 10ways to see L 10 9 8 7 6 5 4 3 2 1 0aer seeing WL 0 9 16 21 24 25 24 21 16 9 0You can confirm these calculations quickly in R. First make a list of initial ways for eachpossible proportion to be correct:ways


34 2. SMALL WORLDS AND LARGE WORLDSnumber of ways to see WL1 13 25FIGURE 2.3. e numbers of ways, for each possibleproportion of water (horizontal), to observethe sequence of outcomes WL (vertical). e tallerthe bar, the more ways for nature to produce thedata, given the proportion of water on the horizontalaxis.0 0.2 0.4 0.6 0.8 1proportion of water on globeAnd now we can multiple these lists to repeat the calculations shown in the table. Aerseeing the first W (the parentheses around the line of code below just force R to print out thevalues—it would perform the calculation silently, otherwise):R code2.3( ways


2.1. PROBABILITY IS JUST COUNTING 35occur. So they are just as honest as they are dependent upon the information used to definethem.2.1.4. More updating. But we’re not done. ere’s more data. So let’s see how the countsevolve as we fold in each remaining observation. e full sequence of globe tosses was:W L W W W L W L WYou’ve already accounted for the first two. I’ll go ahead and fill in the table, accounting foreach remaining observation in turn:proportion start W L W W W L W L W0.0 1 0 0 0 0 0 0 0 0 00.1 1 1 9 9 9 9 81 81 729 7290.2 1 2 16 32 64 128 1024 2048 16384 327680.3 1 3 21 63 189 567 3969 11907 83349 2500470.4 1 4 24 96 384 1536 9216 36864 221184 8847360.5 1 5 25 125 625 3125 15625 78125 390625 19531250.6 1 6 24 144 864 5184 20736 124416 497664 29859840.7 1 7 21 147 1029 7203 21609 151263 453789 31765230.8 1 8 16 128 1024 8192 16384 131072 262144 20971520.9 1 9 9 81 729 6561 6561 59049 59049 5314411.0 1 10 0 0 0 0 0 0 0 0is table is oriented with data along the top and the possible proportions on the le side.We started this analysis by saying there is at least one way for each possibility to be right, asseen in the “start” column. is just means that we’ll weight each possibility equally, for now.en the first W observation leads us to compute that there are more ways that the largerproportions could produce the data. at’s the first W column. As each subsequence observationis incorporated, the numbers of ways for each possibility to produce the sequence ofglobe tosses grows rapidly. By the time we reach the righthand edge, there are more than3-million ways for the proportion 0.7 to produce the observed sequence.2.1.5. From counts to probabilities. Of course the raw magnitude of these numbers is ofno interest. It’s only the relative magnitudes that matter. So usually we don’t work with theseraw counts. Instead we work with standardized counts. And that is all that’s le for us toarrive at probability in the usual form.Consider the numbers in the final column of the table in the previous section. If we addup all of these numbers, we get the total number of ways for all possibilities to produce thesequence of data. Let’s compute that number. Here’s R code to compute the values in thefinal column:ways_start


36 2. SMALL WORLDS AND LARGE WORLDSways to see data / 1e+051 16 320 0.2 0.4 0.6 0.8 1proportion of water on globeways_std0.00 0.10 0.200.0 0.2 0.4 0.6 0.8 1.0possibilitiesFIGURE 2.4. Le: Final numbers of ways for each possible proportion ofwater (horizontal axis) to produce the entire sequence of data. Each rectanglerepresents 100-thousand. Right: e standardized, relative numbers ofways for each possible proportion of water (horizontal axis) to produce theobserved data sequence.Now we add up these counts and divide each by the sum. is yields the standardizedplausibility of each possible proportion:R code2.6# add them all uptotal_ways


2.2. COLOMBO’S FIRST BAYESIAN MODEL 37one possibility (or none) can be correct, so don’t think that just because these numbers arenow probabilities that they define frequencies of any events. Probabilities change when ourinformation changes; frequencies do not. And that’s because probabilities are just a shortcutto counting up possibilities.If you find all of this rather deflationary—probability theory seems kinda simple minded—good. Probability theory is just a mechanism for keeping track of the ways that things canhappen, according to our assumptions. It isn’t the mind of angels or2.2. Colombo’s first Bayesian modelYou can solve any problem in probability by just counting the numbers of ways each possibilitycould produce each event of interest. But that’s tedious. So instead we nearly alwayswork with standardized counts, probabilities, and decompose the counts into componentprobabilities that represent the different elements in our counting in the previous section.is decomposition comprises a statistical model of one sort.Consider four different kinds of things we counted in the previous section.(1) e initial number of ways each possibility could be correct.(2) e number of ways each possibility could produce an individual W or L observation.(3) e accumulated number of ways each possibility could produce the entire data.(4) e total number of ways to produce the data, across all possibilities.Each of these counts has a direct analog in Bayesian data analysis.[older content follows, still needs revision]To get the logic moving, we need to make assumptions, and these assumptions constitutethe model. Designing a simple Bayesian model benefits from a design loop with three steps.(1) Data story: Motivate the model by narrating how the data might arise(2) Update: Educate your model by feeding it the data(3) Evaluate: All statistical models require supervision, leading possibly to model revisione next sections walk through these steps, in the context of the globe tossing evidence.Rethinking: e logic of plausibility. Logic is usually identified with Aristotle and rules for consistentreasoning about true and false statements. e logic applied here is a continuous version that dealsin plausibility of each statement. e American physicist Richard T. Cox showed that if one wishesto reason with plausibility and still be consistent with Aristotelian logic, then one is forced to useprobability theory to perform the reasoning. 34 e consequence of this fact is that Bayesian analysis,which obeys the laws of probability, is a consistent system for plausible reasoning.2.2.1. A data story. Bayesian data analysis usually means producing a story for how the datacame to be. is story may be descriptive, specifying associations that can be used to predictoutcomes, given observations. Or it may be causal, a theory of how some events produceother events. Typically, any story you intend to be causal may also be descriptive. But manydescriptive stories are hard to interpret causally. But all data stories are complete, in the sensethat they are sufficient for specifying an algorithm for simulating new data. In later sectionsof this chapter, you’ll see examples of doing just that, as simulating new data is a necessarypart of model criticism.


38 2. SMALL WORLDS AND LARGE WORLDSYou can motivate your data story by trying to explain how each piece of data is born. isusually means describing aspects of the underlying reality as well as the sampling process.e data story in this case is simply a restatement of the sampling process:(1) e true proportion of water covering the globe is p.(2) A single toss of the globe has a probability p of producing a water (W) observation.It has a probability 1 − p of producing a land (L) observation.(3) Each toss of the globe is independent of the others.e data story is then translated into a formal probability model. is probability model iseasy to build, because the construction process can be usefully broken down into a series ofcomponent decisions. Before meeting these components, however, it’ll be useful to visualizehow a Bayesian model behaves. Aer you’ve become acquainted with how such a modellearns from data, we’ll pop the machine open and investigate its engineering.Rethinking: e value of storytelling. e data story has value, even if you quickly abandon it andnever use it to build a model or simulate new observations. Indeed, it is important to eventuallydiscard the story, because many different stories always correspond to the same model. As a result,showing that a model does a good job does not in turn uniquely support our data story. Still, the storyhas value because in trying to outline the story, oen one realizes that additional questions must beanswered. Most data stories are much more specific than are the verbal hypotheses that inspire datacollection. Hypotheses can be vague, such as “it’s more likely to rain on warm days.” When youare forced to consider sampling and measurement and make a precise statement of how temperaturepredicts rain, many stories and resulting models will be consistent with the same vague hypothesis.Resolving that ambiguity oen leads to important realizations and model revisions, before any modelis fit to data.2.2.2. Bayesian updating. Our problem is one of using the evidence—the sequence of globetosses—to decide among different possible proportions of water on the globe. Each possibleproportion may be more or less plausible, given the evidence. A Bayesian model begins withone set of plausibilities assigned to each of these possibilities. en it updates them in light ofthe data. is updating process is a kind of learning, called BAYESIAN UPDATING. e detailsof this updating—how it is mechanically achieved—can wait until later in the chapter. Fornow, let’s look at how such a machine behaves.In this example, let’s program our Bayesian machine to initially assign the same plausibilityto every proportion of water, every value of p. Now look at the top-le plot in FIGURE 2.5.e dashed horizontal line represents this initial, equal confidence, plausibility of the valuesof p.Aer seeing the first toss, which is a “W,” the model updates the plausbilities to the solidline. e plausibility of p = 0 has now fallen to exactly zero—the equivalent of “impossible.”Why? Because we observed at least one speck of water on the globe, so now we know thereis some water. e model executes this logic automatically. You don’t have it instruct it toaccount for this consequence. Probability theory takes care of it for you.Likewise, the plausibility of p > 0.5 has increased. is is because there is not yet anyevidence that there is land on the globe, so the initial plausibilities are modified to be consistentwith this. Note however that the relative plausibilities are what matter, and there isn’tyet much evidence. So the differences in plausibility are not yet very large. In this way, theamount of evidence seen so far is embodied in the plausibilities of each value of p.


W L W W W L W L W2.2. COLOMBO’S FIRST BAYESIAN MODEL 39W Ln W= 0W W L W L Wn = 1W L W W W L W L Wn = 2W L W W W L W L Wn = 3confidence plausibilityW L W W W L W L W0 0.5 1 0 0.5 1W 0 Ln probability W= 0W W of L 0.5 proportion water W LwaterW W Lprobability W 1 W W of L water W L Wn = 4n = 50 0.5 1W Lprobability W W W of L water W L Wn = 6plausibilityconfidence plausibilityW L W W W L W WL LWW W W L W WL LWW W W L W L W0W L W0.5W W L1W0L W0.5 1 0 0.5 1n =W 0 0Lprobability n W= 0W W of L water0.5 n = 0W L W W Lprobability W 1n = 0W W of L water W L W W Lprobability W W W of L water W L Wproportion watern = 7n = 8n = 9confidence plausibilityplausibilityplausibility0 0.5 1 0 0.5 1 0 0.5 10 0 0.5probability of water0.5 0 1proportion water probability 10.5 0 1 0.5 1proportion waterproportion waterproportion waterof water probability of waterFIGURE 2.5. How a Bayesian model learns. Each toss of the globe producesan observation of water (W) or land (L). e model’s estimate of the proportionof water on the globe is a plausibility for every possible value. elines and curves in this figure are these collections of plausibilities. In eachplot, a previous plausibilities (dashed curve) are updated in light of the latestobservation to produce a new set of plausibilities (solid curve).In the remaining plots in FIGURE 2.5, the additional samples from the globe are introducedto the model, one at a time. Each dashed curve is just the solid curve from the previousplot, moving le-to-right and top-to-bottom. Every time a “W” is seen, the peak of the plausibilitycurve moves to the right, towards larger values of p. Every time an “L” is seen, itmoves the other direction. e maximum height of the curve increases with each sample,


40 2. SMALL WORLDS AND LARGE WORLDSmeaning that the range of p values consistent with the data shrinks as the amount of evidenceincreases. As each new observation is added, the curve is updated consistent with allprevious observations.Notice that every updated set of plausibilities becomes the initial plausibilities for thenext observation. Every conclusion is the starting point for future inference. However, thisupdating process works backwards, as well as forwards. Given the final set of plausibility inthe bottom-right plot of FIGURE 2.5, and knowing the final observation (W), it is possible tomathematically subtract the observation, to infer the previous plausibility curve. So the datacould be presented to your model in any order, or all at once even. In most cases, you willpresent the data all at once, for the sake of convenience.Rethinking: Sample size and reliable inference. It is common to hear that there is a minimum numberof observations for a useful statistical estimate. In non-Bayesian statistical inference, proceduresare oen justified by the method’s behavior at very large sample sizes, so-called asymptotic behavior.As a result, performance at small samples sizes is questionable.In contrast, Bayesian estimates are valid for any sample size. is does not mean that more dataisn’t helpful—it certainly is. Rather, the estimates have a clear and valid interpretation, no matterthe sample size. But the price for this power is dependency upon the initial estimates, the prior. Ifthe prior is a bad one, then the resulting inference will be misleading. ere’s no free lunch, 35 whenit comes to learning about the world. A Bayesian golem must choose an initial plausibility, and anon-Bayesian golem must choose an estimator. Both golems pay for lunch with their assumptions.2.2.3. Evaluate. e Bayesian model learns in a way that is demonstrably optimal, providedthat the real, large world is accurately described by the model. is is to say that yourBayesian machine guarantees perfect inference, within the small world. No other way of usingthe available information, and beginning with the same state of information, could dobetter.Don’t get too excited about this logical virtue, however. e model may malfunction,so its results always have to be checked. And if there are important differences betweenthe model and reality, then there is no logical guarantee of large world performance. ereare many implications of a small and large world mismatch. Let’s consider just two for themoment.First, the model’s certainty is no guarantee that the model is a good one. As the amountof data increases, the globe tossing model will grow increasingly sure of the proportion ofwater. is means that the curves in FIGURE 2.5 will become increasingly narrow and tall,restricting plausible values within a very narrow range. But models of all sorts—Bayesian ornot—can be very confident about an estimate, even when the model is seriously misleading.is is because the estimates are conditional on the model. What your model is telling youis that, given a commitment to this particular model, it can be very sure of that the plausiblevalues are in a narrow range. Under a different model, things might look differently.Second, it is important to supervise and critique your model’s work. Consider again thefact that the updating in the previous section works in any order of data arrival. We couldshuffle the order of the observations, as long six W’s and three L’s remain, and still end upwith the same final plausibility curve. at is only true, however, because the model assumesthat order is irrelevant to inference. When something is irrelevant to the machine, it won’taffect the inference directly. But it may affect it indirectly, because the data will depend uponorder. So it is important to check the model’s inferences in light of aspects of the data it does


2.3. COMPONENTS OF THE MODEL 41not know about. Such checks are an inherently creative enterprise, le to the analyst and thescientific community. Golems are very bad at it.Near the end of the chapter, you’ll see some examples of such checks. For now, note thatthe goal is not to test the truth value of the model’s assumptions. We know the model’s assumptionsare never exactly right, in the sense of matching the true data generating process.erefore there’s no point in checking if the model is true. Failure to conclude that a modelis false must be a failure of our imagination, not a success of the model. Moreover, modelsdo not need to be exactly true in order to produce highly precise and useful inferences. Allmanner of small world assumptions about error distributions and the like can be violated inthe large world, but a model may still produce a perfectly useful estimate. is is becausemodels are essentially information processing machines, and there are some surprising aspectsof information that cannot be easily captured by framing the problem in terms of thetruth of assumptions. 36Instead, the objective is to check the model’s adequacy for some purpose. is usuallymeans asking and answering additional questions, beyond those that originally constructedthe model. Both the questions and answers will depend upon the scientific context. So it’shard to provide general advice. ere will be many examples, throughout the book, andof course the scientific literature is replete with evaluations of the suitability of models fordifferent jobs—prediction, comprehension, measurement, and persuasion.2.3. Components of the modelNow that you’ve seen how the Bayesian model behaves, when confronted with the globetossing data, it’s time to open up the machine and learn how it works. Bayesian models allshare a common structure, comprising (1) a likelihood, (2) one or more parameters, and (3)a prior. ese components are usually chosen in that order. In this section, you’ll meet thesecomponents in some detail.2.3.1. Likelihood. e first and most influential component of a Bayesian model is theLIKELIHOOD. e likelihood is a mathematical formula that specifies the plausibility of thedata. You can build your own likelihood formula from basic assumptions of your story forhow the data arise, or you can use one of several off-the-shelf likelihoods that are commonin the sciences. Either way, the likelihood needs to tell you the probability of any possibleobservation, for any given state of the (small) world.In the case of the globe tossing model, the likelihood can be deduced from the data story.Begin by nominating all of the possible events. ere are two: water (W) and land (L). ereare no other events. e globe never gets stuck to the ceiling, for example. When we observea sample of W’s and L’s of length N (9 in the actual sample), we need to say how likely thatexact sample is, out of the universe of potential samples of the same length. at might soundchallenging, but it’s the kind of thing you get good at very quickly, once you start practicing.In this case, once we add our assumptions that (1) every toss is independent of the othertosses and (2) the probability of W is the same on every toss, probability theory providesa unique answer, known as the binomial distribution. is is the common “coin tossing”distribution. And so the probability of observing w W’s in n tosses, with a probability p ofW, is:n!Pr(w|n, p) =w!(n − w)! pw (1 − p) n−w .Read the above as:


42 2. SMALL WORLDS AND LARGE WORLDSe count of W’s w is distributed binomially, with probability p of a W oneach toss and n tosses in total.And the binomial distribution formula is built into R, so you can easily compute the likelihoodof the data—6 W’s in 9 tosses—under any value of p with:R code2.8dbinom( 6 , size=9 , prob=0.5 )[1] 0.1640625Change the 0.5 to any other value, to see how the value changes.Sometimes, likelihoods are written L(p|w, n): the likelihood of p, conditional on w andn. Note however that this notation reverses what is on the le side of the | symbol. Just keepin mind that likelihoods are probabilities of data, not of parameters or models.Overthinking: Names and probability distributions. e “d” in dbinom stands for density. Functionsnamed in this way almost always have corresponding partners that begin with “r” for randomsamples and that begin with “p” for cumulative probabilities. See the help ?dbinom.Rethinking: A central role for likelihood. A great deal of ink has been spilled focusing on howBayesian and non-Bayesian data analysis differ. Focusing on differences is useful, but sometimesit distracts us from fundamental similarities. Notably, the most influential assumptions in bothBayesian and many non-Bayesian models are the likelihood functions and their relations to the parameters.e assumptions about the likelihood influence inference for every piece of data, and assample size increases, the likelihood matters more and more. is helps to explain why Bayesian andnon-Bayesian inferences are oen so similar.2.3.2. Parameters. For most likelihood functions, there are adjustable inputs. In the binomiallikelihood, these inputs are p (the probability of seeing a W), n (the sample size), andw (the number of W’s). One or all of these may also be quantities that we wish to estimatefrom data, PARAMETERS. In our globe tossing example, both n and w are data—we believethat we have observed their values without error. at leaves p as an unknown parameter,and our Bayesian machine’s job is to describe what the data tell us about it.But in other analyses, different inputs of the likelihood may be the targets of our analysis.It’s common in field biology, for example, to estimate both p and n. Such an analysis maybe called mark-recapture, capture-recapture, or sight-resight. e goal is usually to estimatepopulation size, the n input of the binomial distribution. In other cases, we might know pand n, but not w. Your Bayesian machine can tell you what the data say about any parameter,once you tell it the likelihood and which bits have been observed. In some cases, the golemwill just tell you that not much can be learned—Bayesian data analysis isn’t magic, aer all.But it is usually an advance to learn that our data don’t discriminate among the possibilities.In future chapters, there will be more parameters in your models, in excess of the smallnumber that may appear in the likelihood. In statistical modeling, many of the most commonquestions we ask about data are answered directly by parameters:• What is the average difference between treatment groups?• How strong is the association between a treatment and an outcome?• Does the effect of the treatment depend upon a covariate?• How much variation is there among groups?


2.3. COMPONENTS OF THE MODEL 43You’ll see how these questions become extra parameters inside the likelihood.Rethinking: Datum or parameter? It is typical to conceive of data and parameters as completelydifferent kinds of entities. Data are measured and known; parameters are unknown and must beestimated from data. Usefully, in the Bayesian framework the distinction between a datum and aparameter is fuzzy: a datum can be recast as a very narrow probability density for a parameter, and aparameter as a datum with uncertainty. Much later in the book, you’ll see how to use this continuitybetween certainty (data) and uncertainty (parameters) to incorporate measurement error and missingdata into your modeling.2.3.3. Prior. For every parameter you intend your Bayesian machine to estimate, you mustprovide to the machine a PRIOR. A Bayesian machine must have an initial plausibility assignmentfor each possible value of the parameter. e prior is this initial set of plausibilities.When you have a previous estimate to provide to the machine, that can become the prior,as in the steps in FIGURE 2.5. Back in FIGURE 2.5, the machine did its learning one piece ofdata at a time. As a result, each estimate becomes the prior for the next step. But this doesn’tresolve the problem of providing a prior, because at the dawn of time, when n = 0, the machinestill had an initial estimate for the parameter p: a flat line specifying equal plausibilityfor every possible value.Overthinking: Prior as probability distribution. You could write the prior in the example here as:Pr(p) = 11 − 0 = 1.e prior is a probability distribution for the parameter. In general, for a uniform prior from a to b,the probability of any point in the interval is 1/(b−a). If you’re bugged by the fact that the probabilityof every value of p is 1, remember that every probability distribution must sum (integrate) to 1. eexpression 1/(b − a) ensures that the area under the flat line from a to b is equal to 1.e prior is what the machine “believes” before it sees the data. It is part of the model,not a reflection necessarily of what you believe. Clearly the likelihood contains many assumptionsthat are unlikely to be exactly true—completely independent tosses of the globe,no events other than W and L, constant p regardless of how the globe is tossed. Using alikelihood function does not force you to believe in it, even though the machine acts as if itbelieves. e same goes for priors.So where do priors come from? ey are engineering assumptions, chosen to help themachine learn. e flat prior in FIGURE 2.5 is very common, but it is hardly ever the bestprior. You’ll see later in the book that priors that gently nudge the machine usually improveinference. Such priors are sometimes called REGULARIZING or WEAKLY INFORMATIVE priors.ey are so useful that non-Bayesian statistical procedures have adopted a mathematicallyequivalent approach, PENALIZED LIKELIHOOD. You’ll meet examples in later chapters.More generally, priors are useful for constraining parameters to reasonable ranges, aswell as for expressing any knowledge we have about the parameter, before any data are observed.For example, in the globe tossing case, you know before the globe is tossed even oncethat the values p = 0 and p = 1 are completely implausible. You also know that any valuesof p very close to 0 and 1 are less plausible than values near p = 0.5. Even vague knowledgelike this can be useful, when evidence is scarce.


44 2. SMALL WORLDS AND LARGE WORLDSere is a school of Bayesian inference that emphasizes choosing priors based upon thepersonal beliefs of the analyst. 37 While this SUBJECTIVE BAYESIAN approach thrives in somestatistics and philosophy and economics programs, it is rare in the sciences. Within Bayesiandata analysis in the natural and social sciences, the prior is considered to be just part ofthe model. As such it should be chosen, evaluated, and revised just like all of the othercomponents of the model. In practice, the subjectivist and the non-subjectivist will oenanalyze data in nearly the same way. And sometimes the most persuasive data analysis is theone that lets an opponent choose the prior, so you can show them that the data and modelrequire them to revise their beliefs.None of this should be understood to mean that any statistical analysis is not inherentlysubjective, because of course it is—lots of little subjective decisions are involved in all partsof science. It’s just that priors and Bayesian data analysis are no more inherently subjectivethan are likelihoods and the repeat sampling assumptions required for significance testing. 38Anyone who has visited a statistics help desk at a university has probably experienced thissubjectivity—statisticians do not in general exactly agree on how to analyze anything but thesimplest of problems. e fact that statistical inference uses mathematics does not imply thatthere is only one reasonable or useful way to conduct an analysis.Beyond all of the above, there’s no law mandating we use only one prior. If you don’thave a strong argument for any particular prior, then try different ones. Because the prior isan assumption, it should be interrogated like other assumptions: by altering it and checkinghow sensitive inference is to the assumption. No one is required to swear an oath to theassumptions of a model, and no set of assumptions deserves such respect.Rethinking: Prior, prior pants on fire. Historically, some opponents of Bayesian inference objectedto the arbitrariness of priors. It’s true that priors are very flexible, being able to encode many differentstates of information. If the prior can be anything, isn’t it possible to get any answer you want? Indeedit is. Regardless, aer a couple hundred years of Bayesian calculation, it hasn’t turned out that peopleuse priors to lie. If your goal is to lie with statistics, you’d be a fool to do it with priors, because such alie would be easily uncovered. Better to use the more opaque machinery of the likelihood. Or betteryet—don’t actually take this advice!—massage the data, drop some “outliers,” and otherwise engagein motivated data transformation.It is true though that choice of the likelihood is much more conventionalized than choice of prior.But conventional choices are oen poor ones, smuggling in influences that can be hard to discover.In this regard, both a Bayesian and a non-Bayesian model are equally harried, because both traditionsdepend heavily upon likelihood functions and conventionalized model forms. And the fact that thenon-Bayesian procedure doesn’t have to make an assumption about the prior is of little comfort. isis because non-Bayesian procedures need to make choices that Bayesian ones do not, such as choice ofestimator or likelihood penalty. Oen, such choices can be shown to be equivalent to some Bayesianchoice of prior or rather choice of loss function. (You’ll meet loss functions later in Chapter 3.)2.3.4. Posterior. Once you have chosen a likelihood, which parameters are to be estimated,and a prior for each parameter, a Bayesian model treats the estimates as a purely logicalconsequence of those assumptions. For every unique combination of data, likelihood, parameters,and prior, there is a unique set of estimates. e resulting estimates—the relativeplausibility of different parameter values, conditional on the data—are known as the POS-TERIOR DISTRIBUTION. e posterior distribution takes the form of the inverse of the likelihood,Pr(p|n, w). Whereas the likelihood is the probability of the data, conditional on theparameters, the posterior is the probability of the parameters, conditional on the data.


2.3. COMPONENTS OF THE MODEL 45e mathematical procedure that defines the logic of the posterior density is BAYES’THEOREM. is is the theorem that gives Bayesian data analysis its name. But the theoremitself is a trivial implication of probability theory. Here’s a quick derivation of it, in thecontext of the globe tossing example. To grasp this derivation, remember that we are usingprobabilities to represent plausibility. But all the rules of ordinary probability theory applyand ensure that the logic of plausible reasoning is correct. For this example, I’m going toomit n, the sample size. Nothing important is being hidden here. It’ll just be easier to followthe derivation without having to carry the inert n around in the mathematics.e joint probability of the data w and any particular value of p is:Pr(w, p) = Pr(w|p) Pr(p).is just says that the probability of w and p is the product of likelihood Pr(w|p) and theprior probability Pr(p). is is like saying that the probability of rain and cold on the sameday is equal to the probability of rain, when it’s cold, times the probability that it’s cold. ismuch is just definition. But it’s just as true that:Pr(w, p) = Pr(p|w) Pr(w).All I’ve done is reverse which probability is conditional, on the righthand side. It is still atrue definition. It’s like saying that the probability of rain and cold on the same day is equalto the probability that it’s cold, when it’s raining, times the probability of rain. Compare thisstatement to the one in the previous paragraph.Now since both righthand sides are equal to the same thing, we can set them equal toone another and solve for the posterior probability, Pr(p|w):Pr(p|w) =Pr(w|p) Pr(p).Pr(w)And this is Bayes’ theorem. It says that the probability of any particular value of p, consideringthe data, is equal to the product of the likelihood and prior, divided by this thing Pr(w),which I’ll call the average likelihood. In word form:Posterior =Likelihood × PriorAverage Likelihood .e average likelihood, Pr(w), can be confusing. It is commonly called the “evidence” orthe “probability of the data,” neither of which is a transparent name. e probability Pr(w)is merely the average likelihood of the data. Averaged over what? Averaged over the prior.It’s job is to standardize the posterior, to ensure it sums (integrates) to one. In mathematicalform:Pr(w) = E ( Pr(w|p) ) ∫= Pr(w|p) Pr(p)dp.e operator E means to take an expectation, which you can think of as an average. Suchaverages are commonly called marginals in mathematical statistics, and so you may also seethis same probability called a marginal likelihood. And the integral above just defines theproper way to compute the average over a continuous distribution of values, like the infinitepossible values of p.e key lesson for now is that the posterior is proportional to the product of the prior andthe likelihood. FIGURE 2.6 illustrates the multiplicative interaction of a prior and a likelihood.On each row, a prior on the le is multiplied by the likelihood in the middle to produce a


46 2. SMALL WORLDS AND LARGE WORLDSpriorlikelihoodposterior⇥ /prior0 0.5 1likelihood0 0.5 1posterior0 0.5 1⇥ /prior0 0.5 1likelihood0 0.5 1posterior0 0.5 1⇥ /0 0.5 10 0.5 10 0.5 1FIGURE 2.6. e posterior distribution, as a product of the prior distributionand likelihood. Top row: A flat prior constructs a posterior that is simplyproportional to the likelihood. Middle row: A step prior, assigning zeroprobability to all values less than 0.5, resulting in a truncated posterior. Bottomrow: A peaked prior that shis and skews the posterior, relative to thelikelihood.posterior on the right. e likelihood in each case is the same, the likelihood for the globetoss data. e priors however vary. As a result, the posterior distributions vary.Rethinking: Bayesian data analysis isn’t about Bayes’ theorem. A common notion about Bayesiandata analysis, and Bayesian inference more generally, is that it is distinguished by the use of Bayes’theorem. is is a mistake. Instead, Bayesian approaches are characterized by the adoption of abroader notion of probability. ere are several routes to these broader probability concepts, butwhat they all share is using probability as a general method for quantifying uncertainty. In contrast,


2.4. MAKING THE MODEL GO 47other probability concepts restrict probabilities to frequencies of data, real or imagined. Bayesianprobability includes frequencies as a special case.Inference under any probability concept will eventually make use of Bayes’ theorem. Commonintroductory examples of “Bayesian” analysis using HIV and DNA testing are not necessarilyBayesian, because since all of the elements of the calculation are frequencies of observations, a non-Bayesian analysis would do exactly the same thing. Instead, Bayesian approaches get to use Bayes’theorem more generally, to quantify uncertainty about theoretical entities that cannot be observed,like parameters and models. Powerful inferences can be produced under both Bayesian and non-Bayesian probability concepts, but different justifications and sacrifices are necessary.2.4. Making the model goRecall that your Bayesian model is a machine, a figurative golem. It has built-in definitionsfor the likelihood, the parameters, and the prior. And then at its heart lies a motor thatprocesses data, producing a posterior distribution. e action of this motor can be thoughtof as conditioning the prior on the data. As explained in the previous section, this conditioningis governed by the rules of probability theory, which defines a uniquely logical posteriorfor every prior, likelihood, and data.However, knowing the mathematical rule is oen of little help, because many of the interestingmodels in contemporary science cannot be conditioned formally, no matter yourskill in mathematics. And while some broadly useful models like linear regression can beconditioned formally, this is only possible if you constrain your choice of prior to specialforms that are easy to do mathematics with. We’d like to avoid forced modeling choices ofthis kind, instead favoring conditioning engines that can accommodate whichever prior ismost useful for inference.What this means is that various numerical techniques are needed to approximate themathematics that follows from the definition of Bayes’ theorem. In this book, you’ll meetthree different conditioning engines, numerical techniques for computing posterior distributions.(1) Grid approximation(2) Quadratic approximation(3) Markov chain Monte Carlo (MCMC)ere are many other engines, and new ones are being invented all the time. But the threeyou’ll get to know here are common and widely useful. In addition, as you learn them, you’llalso learn principles that will help you understand other techniques.Rethinking: How you fit the model is part of the model. Earlier in this chapter, I implicitly definedthe model as a composite of a prior and a likelihood. at definition is typical. But in practical terms,we should also consider how the model is fit to data as part of the model. In very simple problems,like the globe tossing example that consumes this chapter, calculation of the posterior density is trivialand foolproof. In even moderately complex problems, however, the details of fitting the model todata force us to recognize that our numerical technique influences our inferences. is is becausedifferent mistakes and compromises arise under different techniques. e same model fit to the samedata using different techniques may produce different answers. When something goes wrong, everypiece of the machine may be suspect. And so our golems carry with them their updating engines, asmuch slaves to their engineering as they are to the priors and likelihoods we program into them.


48 2. SMALL WORLDS AND LARGE WORLDS2.4.1. Grid approximation. One of the simplest techniques is grid approximation. Whilemost parameters are continuous, capable of taking on an infinite number of values, it turnsout that we can achieve an excellent approximation of the continuous posterior distributionby considering only a finite grid of parameter values. At any particular value of a parameter,p ′ , it’s a simple matter to compute the posterior probability: just multiply the prior probabilityof p ′ by the likelihood at p ′ . Repeating this procedure for each value in the grid generatesan approximate picture of the exact posterior distribution. In this section, you’ll see how toaccomplish this, using simple bits of R code.In the context of the globe tossing problem, grid approximation works extremely well.So let’s build a grid approximation for the model we’ve constructed so far. Here is the recipe:(1) Define the grid. is means you decide how many points to use in estimating theposterior, and then you make a list of the parameter values on the grid.(2) Compute the value of the prior at each parameter value on the grid.(3) Compute the likelihood at each parameter value.(4) Compute the unstandardized posterior at each parameter value, by multiplying theprior by the likelihood.(5) Finally, standardize the posterior, by dividing each value by the sum of all values.In the globe tossing context, here’s the code to complete all five of these steps:R code2.9# define gridp_grid


2.4. MAKING THE MODEL GO 495 points20 pointsposterior probability0.0 0.1 0.2 0.3 0.4 0.50.0 0.2 0.4 0.6 0.8 1.0probability of waterposterior probability0.00 0.04 0.08 0.120.0 0.2 0.4 0.6 0.8 1.0probability of waterFIGURE 2.7. Computing posterior distribution by grid approximation. Ineach plot, the posterior distribution for the globe toss data and model isapproximated with a finite number of evenly-spaced points. With only 5points (le), the approximation is terrible. But with 20 points (right), theapproximation is already quite good. Compare to the analytically-solved,exact posterior distribution in FIGURE 2.5 (page 39).prior


50 2. SMALL WORLDS AND LARGE WORLDSA useful approach is QUADRATIC APPROXIMATION. Under quite general conditions, theregion near the peak of the posterior distribution will be nearly Gaussian—or “normal”—inshape. is means the posterior distribution can be usefully approximated by a Gaussiandistribution. A Gaussian distribution is convenient, because it can be completely describedby only two numbers: the location of its center (mean) and its spread (variance).A Gaussian approximation is called “quadratic approximation” because the logarithmof a Gaussian distribution forms a parabola. And a parabola is a quadratic function. Sothis approximation essentially represents any log-posterior with a parabola. Later, in Chapter4, you’ll work directly with Gaussian distributions, and then you’ll have a chance to seequadratic nature more directly.We’ll use quadratic approximation for much of the first half of this book. For manyof the most common procedures in applied statistics—linear regression, for example—theapproximation works very well. Sometimes, it is even exactly correct, not actually an approximationat all. Computationally, quadratic approximation is very inexpensive, at leastcompared to grid approximation and MCMC (discussed next). e procedure, which R willhappily conduct at your command, contains two steps.(1) Find the posterior mode. is is usually accomplished by some optimization algorithm,a procedure that virtually “climbs” the posterior distribution, as if it were amountain. e golem doesn’t know where the peak is, but it does know the slopeunder its feet. ere are many well-developed optimization procedures, most ofthem more clever than simple hill climbing. But all of them try to find peaks.(2) Once you find the peak of the posterior, you must estimate the curvature near thepeak. is curvature is sufficient to compute a quadratic approximation of theentire posterior distribution. In some cases, these calculations can be done analytically,but usually your computer uses some numerical technique instead.To compute the quadratic approximation for the globe tossing data, we’ll use a tool inthe rethinking package: map. e abbreviation MAP stands for MAXIMUM A POSTERIORI,which is just a fancy Latin name for the mode of the posterior density. We’re going to beusing map a lot in the book. It’s a flexible model fitting tool that will allow us to specify alarge number of different “regression” models. So it’ll be worth trying it out right now. You’llget a more thorough understanding of it later.To compute the quadratic approximation to the globe tossing data:R code2.12library(rethinking)globe.qa


2.4. MAKING THE MODEL GO 51just the count of water (6), and the start list tells map to start searching at p = 0.5. Now let’ssee the output:Warning messages:1: In dbinom(x = nw, prob = p, size = 9, log = TRUE) : NaNs produced2: In dbinom(x = nw, prob = p, size = 9, log = TRUE) : NaNs producedMean StdDev 2.5% 97.5%p 0.67 0.16 0.36 0.97Before explaining the output, let’s talk about warning messages. Note that the first thing thathappens when you execute map above is that R spits out those scary looking warnings aboutsomething called NaNs. ese warnings are usually harmless. What they indicate in this caseis that, while searching for the value of p that maximizes the posterior probability, R triedsome values outside of the 0–1 interval that p must lie within. In geek-speak, p ∈ [0, 1], butwe never told R that, so it tried values outside the range. When dbinom was told to computethe likelihood using a “probability” less than zero or greater than one, it refused and returnedNaN, which stands for not a number. is is R’s polite way of saying that it was asked a sillyquestion. Try this experiment to directly see this behavior:dbinom( 6 , size=9 , prob=2 )R code2.13[1] NaNWarning message:In dbinom(6, size = 9, prob = 2) : NaNs producedere’s your NaN. Later on, you’ll see how to prevent this kind of thing, mainly by usingtransformations of the parameters. But for now, you do need to know what these warningsmean, so you don’t overreact to them.Returning to the quadratic approximation, the function precis presents a brief summaryof the quadratic approximation. In this case, it shows a MAP value of p = 0.67, whichit calls the “Mean.” e curvature is labeled “StdDev” is stands for standard deviation.is value is the standard deviation of the posterior distribution, while the mean value isits peak. Finally, the last two values in the precis output show the 95% percentile interval,which you’ll learn more about in the next chapter. You can read this kind of approximationlike: Assuming the posterior is Gaussian, it is maximized at 0.67, and it’s standard deviationis 0.16.Since we already know the posterior, let’s compare to see how good the approximationis. I’ll use the analytical approach here, which uses dbeta. I won’t explain this calculation,but it ensures that we have exactly the right answer, with no approximations. You can findan explanation and derivation of it in just about any more advanced textbook on Bayesianinference.# analytical calculationw


52 2. SMALL WORLDS AND LARGE WORLDSn = 9n = 18n = 36Density0.0 0.5 1.0 1.5 2.0 2.5Density0 1 2 3Density0 1 2 3 4 50.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0proportion waterproportion waterproportion waterFIGURE 2.8. Accuracy of the quadratic approximation. In each plot, the exactposterior distribution is plotted in blue, and the quadratic approximationis plotted as the black curve. Le: e globe tossing data with n = 9tosses and w = 6 waters. Middle: Double the amount of data, with thesame fraction of water, n = 18 and w = 12. Right: Four times as muchdata, n = 36 and w = 24.You can see this plot (with a little extra formatting) on the le in FIGURE 2.8. e blue curve isthe analytical posterior and the black curve is the quadratic approximation. e black curvedoes alright on its le side, but looks pretty bad on its right side. It even assigns positiveprobability to p = 1, which we know is impossible, since we saw at least one land sample.As the amount of data increases, however, the quadratic approximation gets better. In themiddle of FIGURE 2.8, the sample size is doubled to n = 18 tosses, but with the same fractionof water, so that the mode of the posterior is in the same place. e quadratic approximationlooks better now, although still not great. At quadruple the data, on the right side of thefigure, the two curves are nearly the same now.is phenomenon, where the quadratic approximation improves with the amount ofdata, is very common. It’s one of the reasons that so many classical statistical proceduresare nervous about small samples: those procedures use quadratic (or other) approximationsthat are only known to be safe with infinite data. Oen, these approximations are useful withless than infinite data, obviously. But the rate of improvement as sample size increases variesgreatly depending upon the details. In some model types, the quadratic approximation canremain terrible even with thousands of samples.Using the quadratic approximation in a Bayesian context brings with it all the same concerns.But you can always lean on some algorithm other than quadratic approximation, ifyou have doubts. Indeed, grid approximation works very well with small samples, becausein such cases the model must be simple and the computations will be quite fast. You can alsouse MCMC, which is introduced next.Rethinking: Maximum likelihood estimation. e quadratic approximation, either with a uniformprior or with a lot of data, is usually equivalent to a MAXIMUM LIKELIHOOD ESTIMATE (MLE) and itsSTANDARD ERROR. e MLE is a very common non-Bayesian parameter estimate. is equivalence


2.4. MAKING THE MODEL GO 53between a Bayesian approximation and a common non-Bayesian estimator is both a blessing and acurse. It is a blessing, because it allows us to re-interpret a wide range of published non-Bayesianmodel fits in Bayesian terms. It is a curse, because maximum likelihood estimates have some curiousdrawbacks, which we’ll explore in later chapters.Overthinking: e Hessians are coming. Sometimes it helps to know more about how the quadraticapproximation is computed. In particular, the approximation sometimes fails. When it does, chancesare you’ll get a confusing error message that says something about the “Hessian.” Students of worldhistory may know that the Hessians were German mercenaries hired by the British in the 18th centuryto do various things, including fight against the American revolutionary George Washington. esemercenaries are named aer a region of what is now central Germany, Hesse.e Hessian that concerns us here has little to do with mercenaries. It is named aer mathematicianLudwig Otto Hesse (1811–1874). A Hessian is a square matrix of second derivatives. It isused for many purposes in mathematics, but in the quadratic approximation it is second derivativesof the log of posterior probability with respect to the parameters. It turns out that these derivativesare sufficient to describe a Gaussian distribution, because the logarithm of a Gaussian distribution isjust a parabola. Parabolas have no derivatives beyond the second, so once we know the center of theparabola (the posterior mode) and its second derivative, we know everything about it. And indeedthe second derivative of the logarithm of a Gaussian distribution is exactly its standard deviation.e standard deviation is typically computed from the Hessian, so computing the Hessian isnearly always a necessary step. But sometimes the computation goes wrong, and your golem willchoke while trying to compute the Hessian. In those cases, you have several options. Not all hope islost. But for now it’s enough to recognize the term and associate it with an attempt to find the standarddeviation for a quadratic approximation.2.4.3. Markov chain Monte Carlo. ere are lots of important model types, like multilevel(mixed-effects) models, for which neither grid approximation nor quadratic approximationare always satisfactory. Such models can easily have hundreds or thousands or tens-ofthousandsof parameters. Grid approximation obviously fails here, because it just takes toolong—the Sun will go dark before your computer finishes the grid. Special forms of quadraticapproximation might work, if everything is just right. But commonly, something is not justright. Furthermore, multilevel models do not always allow us to write down a single, unifiedfunction for the posterior distribution. is means that the function to maximize (whenfinding the MAP) is not known, but must be computed in pieces.As a result, various counter-intuitive model fitting techniques have arisen. e mostpopular of these is MARKOV CHAIN MONTE CARLO (MCMC), which is a family of specificconditioning engines that are capable of handling highly complex models. It is fair to say thatMCMC is largely responsible for the insurgence of Bayesian data analysis that began in the1990’s. While MCMC is much older than the 1990’s, affordable computer power is not, so wemust also thank the engineers. Much later in the book (Chapter 8), you’ll meet a simple andprecise example of MCMC model fitting, aimed at helping you really understand the outlineof the technique.e conceptual challenge with MCMC lies in its highly non-obvious strategy. Insteadof attempting to compute or approximate the posterior density directly, MCMC techniquesmerely draw samples from the posterior. You end up with a collection of parameter values,and the frequencies of these values correspond to the posterior plausibilities. You can thenbuild a picture of the posterior from the histogram of these samples. But commonly, we work


54 2. SMALL WORLDS AND LARGE WORLDSdirectly with the samples, because they are in many ways more convenient than having theposterior directly. And so that’s where we turn in the next chapter.2.5. Summaryis chapter introduced the conceptual mechanics of Bayesian data analysis. Bayesianprobabilities state the plausibilities of different possibilities, whether those possibilities aredata or parameters. Plausibilities are updated in light of observations, a process known asBayesian updating.A Bayesian model is a composite of a likelihood, a choice of parameters, and a prior.e likelihood provides the plausibility of an observation (data), given a fixed value for theparameters. e prior provides the plausibility of each possible value of the parameters,before accounting for the data. e rules of probability tell us that the only logical way tocompute the plausibilities, aer accounting for the data, is to use Bayes’ theorem. is resultsin the posterior distribution.In practice, Bayesian models are fit to data using numerical techniques, like grid approximation,quadratic approximation, and Markov chain Monte Carlo. Each method imposesdifferent tradeoffs.Easy.2.6. Practice2.6.1. e1. Which of the expressions below correspond to the statement: the probability ofrain on Monday?(1) Pr(rain)(2) Pr(rain|Monday)(3) Pr(Monday|rain)(4) Pr(rain, Monday)/ Pr(Monday)2.6.2. e2. Which of the following statements corresponds to the expression: Pr(Monday|rain)?(1) e probability of rain on Monday.(2) e probability of rain, given that it is Monday.(3) e probability that it is Monday, given that it is raining.(4) e probability that it is Monday and that it is raining.2.6.3. e3. Which of the expressions below correspond to the statement: the probability thatit is Monday, given that it is raining?(1) Pr(Monday|rain)(2) Pr(rain|Monday)(3) Pr(rain|Monday) Pr(Monday)(4) Pr(rain|Monday) Pr(Monday)/ Pr(rain)(5) Pr(Monday|rain) Pr(rain)/ Pr(Monday)2.6.4. e4. e Bayesian statistician Bruno de Finetti (1906–1985) began his book on probabilitytheory with the declaration: “PROBABILITY DOES NOT EXIST.” (Capitals in theoriginal.) What he meant is that probability is a device for describing uncertainty from theperspective of an observer with limited knowledge; it has no objective reality. Discuss theglobe tossing example from the chapter, in light of this statement. What does it mean to say“the probability of water is 0.7”?


2.6. PRACTICE 55Medium.2.6.5. m1. Recall the globe tossing model from the chapter. Compute and plot the gridapproximate posterior distribution for each of the following sets of observations. In eachcase, assume a uniform prior for p.(1) W, W, W(2) W, W, W, L(3) L, W, W, L, W, W, W2.6.6. m2. Now assume a prior for p that is equal to zero when p < 0.5 and is a positiveconstant when p ≥ 0.5. Again compute and plot the grid approximate posterior distributionfor each of the sets of observations in the problem just above.2.6.7. m3. Suppose there are two globes, one for Earth and one for Mars. e Earth globeis 70% covered in water. e Mars globe is 100% land. Further suppose that one of theseglobes—you don’t know which—was tossed in the air and produced a “land” observation.Assume that each globe was equally likely to be tossed. Show that the posterior probabilitythat the globe was the Earth, conditional on seeing “land” (Pr(Earth|land)), is 0.23.2.6.8. Cards and faces. Suppose you have a deck with only three cards. Each card has twosides, and each side is either black or white. One card has two black sides. e second cardhas one black and one white side. e third card has two white sides. Now suppose all threecards are placed in a bag and shuffled. Someone reaches into the bag and pulls out a card andplaces it flat on a table. A black side is shown facing up, but you don’t know the color of theside facing down. Show that the probability that the other side is also black is 2/3. Use thecounting method (section 2 of the chapter) to approach this problem. is means countingup the ways that each card could produce the observed data (a black side facing up on thetable).2.6.9. More cards. Now suppose there are four cards: B/B, B/W, W/W, and another B/B.Again suppose a card is drawn from the bag and a black side appears face up. Again calculatethe probability that the other side is black.2.6.10. Heavy cards. Imagine that black ink is heavy, and so cards with black sides are heavierthan cards with white sides. As a result, it’s less likely that a card with black sides is pulledfrom the bag. So again assume there are three cards: B/B, B/W, and W/W. Aer experimentinga number of times, you conclude that for every way to pull the B/B card from the bag,there are 2 ways to pull the B/W card and 3 ways to pull the W/W card. Again suppose thata card is pulled and a black side appears face up. Show that the probability the other side isblack is now 0.5. Use the counting method, as before.2.6.11. Sequence of cards. Assume again the original card problem, with a single card showinga black side face up. Before looking at the other side, we draw another card from the bagand lay it face up on the table. e face that is shown on the new card is white. Show thatthe probability that the first card, the one showing a black side, has black on its other side isnow 0.75. Use the counting method, if you can. Hint: Treat this like the sequence of globetosses, counting all the ways to see each observation, for each possible first card.Hard.


56 2. SMALL WORLDS AND LARGE WORLDS2.6.12. A Tale of Two Pandas. Suppose there are two species of panda bear. Both are equallycommon in the wild and live in the same places. ey look exactly alike and eat the samefood, and there is yet no genetic assay capable of telling them apart. ey differ however intheir family sizes. Species A gives birth to twins 10% of the time, otherwise birthing a singleinfant. Species B births twins 20% of the time, otherwise birthing singleton infants. Assumethese numbers are known with certainty, from many years of field research.Now suppose you are managing a captive panda breeding program. You have a new femalepanda of unknown species, and she has just given birth to twins. What is the probabilitythat her next birth will also be twins?Solution. To solve this problem, realize first that it is asking for a conditional probability: Pr(twins 2 |twins 1 ),the probability the second birth is twins, conditional on the first birth being twins. Remember thedefinition of conditional probability:Pr(twins 2 |twins 1 ) = Pr(twins 1, twins 2 )Pr(twins)So our job is to define Pr(twins 1 , twins 2 ), the joint probability that both births are twins, and Pr(twins),the unconditioned probability of twins.Pr(twins) is easier, so let’s do that one first. e “unconditioned” probability just means that wehave to average over the possibilities. In this case, that means the species have to averaged over. eproblem implies that both species are equally common, so there’s a half chance that any given pandais of either species. is gives us:Pr(twins) = 1 2} (0.1) + 1 2{{ }(0.2) .} {{ }Species A Species BA little arithmetic tells us that Pr(twins) = 0.15.Now for Pr(twins 1 , twins 2 ). e probability that a female from species A has two sets of twins is0.1 × 0.1 = 0.01. e corresponding probability for species B is 0.2 × 0.2 = 0.04. Averaging overspecies identity:Pr(twins 1 , twins 2 ) = 1 2 (0.01) + 1 2(0.04) = 0.025.Finally, we combine these probabilities to get the answer:Pr(twins 2 |twins 1 ) = 0.0250.15 = 25150 = 1 6 ≈ 0.17.Note that this is higher than Pr(twins). is is because the first set of twins provides some informationabout which species we have, and this information was automatically used in the calculation.2.6.13. Name that panda. Recall all the facts from the problem above. Now compute theprobability that the panda we have is from species A, assuming we have observed only thefirst birth and that it was twins.Solution. Our target now is Pr(A|twins 1 ), the posterior probability that the panda is species A, giventhat we observed twins. Bayes’ theorem tells us:Pr(A|twins 1 ) = Pr(twins 1|A) Pr(A).Pr(twins)We calculated Pr(twins) = 0.15 in the previous problem, and we were given Pr(twins|A) = 0.1 aswell. e only stumbling block may be Pr(A), the prior probability of species A. is was also given,


2.6. PRACTICE 57implied by the equal abundance of both species. So prior to observing the birth, we have Pr(A) = 0.5.So:Pr(A|twins 1 ) = (0.1)(0.5)0.15= 5 15 = 1 3 .So the posterior probability of species A, aer observing twins, falls to 1/3, from a prior probability of1/2. is also implies a posterior probability of 2/3 that our panda is species B, since we are assumingonly two possible species. ese are small world probabilities that trust the assumptions.2.6.14. Update that panda. Continuing on from the previous problem, suppose the samepanda mother has a second birth and that it is not twins, but a singleton infant. Computethe posterior probability that this panda is species A.Solution. ere are a few ways to arrive at the answer. e easiest is perhaps to recall that Bayes’theorem accumulates evidence, using Bayesian updating. So we can take the posterior probabilitiesfrom the previous problem and use them as prior probabilities in this problem. is implies Pr(A) =1/3. Now we can ignore the first observation, the twins, and concern ourselves with only the latestobservation, the singleton birth. e previous observation is embodied in the prior, so there’s no needto account for it again.e formula is:Pr(A|singleton) =Pr(singleton|A) Pr(A)Pr(singleton)We already have the prior, Pr(A). e other pieces are straightforward:Pr(singleton|A) = 1 − 0.1 = 0.9Pr(singleton) = Pr(singleton|A) Pr(A) + Pr(singleton|B) Pr(B)Combining everything, we get:= (0.9) 1 3 + (0.8) 2 3 = 5 6Pr(A|singleton) = (0.9) 1 356= 9 25 = 0.36is is a modest increase in posterior probability of species A, an increase from about 0.33 to 0.36.e other way to proceed is to go back to the original prior, Pr(A) = 0.5, before observed anybirths. en you can treat both observations (twins, singleton) as data and update the original prior.I’m going to start abbreviating “twins” as T and “singleton” as S. e formula:Pr(A|T, S) =Pr(T, S|A) Pr(A)Pr(T, S)Let’s start with the average likelihood, Pr(T, S), because it will force us to define the likelihoods anyway.Pr(T, S) = Pr(T, S|A) Pr(A) + Pr(T, S|B) Pr(B)I’ll go slowly through this, so I don’t lose anyone along the way. e first likelihood is just the probabilitya species A mother has twins and then a singleton:Pr(T, S|A) = (0.1)(0.9) = 0.09And the second likelihood is similar, but for species B:Pr(T, S|B) = (0.2)(0.8) = 0.16


58 2. SMALL WORLDS AND LARGE WORLDSe priors are both 0.5, so all together:Pr(T, S) = (0.09)(0.5) + (0.16)(0.5) = 1 8 = 0.125We already have the likelihood needed for the numerator, so we can go to the final answer now:Pr(A|T, S) = (0.09)(0.5) = 0.360.125Unsurprisingly, the same answer we got the other way.2.6.15. A panda for all data. A common boast of Bayesian statisticians is that Bayesianinference makes it easy to use all of the data, even if the data are of different types.So suppose now that a veterinarian comes along who has a new genetic test that sheclaims can identify the species of our mother panda. But the test, like all tests, is imperfect.is is the information you have about the test.• e probability it correctly identifies a species A panda is 0.8.• e probability it correctly identifies a species B panda is 0.65.e vet administers the test to your panda and tells you that the test is positive for species A.First ignore your previous information from the births and compute the posterior probabilitythat your panda is species A. en redo your calculation, now using the birth data as well.Solution. First, ignoring the births. is is what we know about the test:Pr(test A|A) = 0.8Pr(test A|B) = 1 − 0.65 = 0.35We use the original prior, Pr(A) = 0.5. Plugging everything into Bayes’ theorem:Pr(A|test A) =(0.8)(0.5)(0.8)(0.5) + (0.35)(0.5) ≈ 0.7So the test has increased the confidence in species A from 0.5 to 0.7.Now to use the birth information as well. e easiest way to do this is just to begin with theposterior from the previous problem. at posterior already used the birth data, so if we adopt it asour prior, we automatically use the birth data. And so our prior becomes Pr(A) = 0.36. en theapproach is just the same as just above:Pr(test A|A) Pr(A)Pr(A|test A) =Pr(test A)(0.8)(0.36)Pr(A|test A) =(0.36)(0.8) + (1 − 0.36)(0.35) ≈ 0.56And since this posterior uses all of the data, the two births and the genetic test, it would be honest tolabel it as Pr(A|test A, twins, singleton) ≈ 0.56.


3 Sampling the ImaginaryLots of books on Bayesian statistics introduce posterior inference by using a medical testingscenario. To repeat the structure of common examples, suppose there is a blood test thatcorrectly detects vampirism 95% of the time. is implies Pr(positive|vampire) = 0.95. It’sa very accurate test. It does make mistakes, though, in the form of false-positives. 1% of thetime, it incorrectly diagnoses normal people as vampires, implying Pr(positive|mortal) =0.01. e final bit of information we are told is that vampires are rather rare, being only 0.1%of the population, implying Pr(vampire) = 0.001. Suppose now that someone tests positivefor vampirism. What’s the probability that he or she is a bloodsucking immortal?e correct approach is just to use Bayes’ theorem to invert the probability, to computePr(vampire|positive). e calculation can be presented as:Pr(vampire|positive) =Pr(positive|vampire) Pr(vampire),Pr(positive)where Pr(positive) is the average probability of a positive test result, that is Pr(positive) =Pr(positive|vampire) Pr(vampire) + Pr(positive|mortal) ( 1 − Pr(vampire) ) . Performing thecalculation in R:PrPV


60 3. SAMPLING THE IMAGINARYFew people find it easy to remember which number goes where, probably because they nevergrasp the logic of the procedure. It just a formula that descends from the sky.ere is a way to present the same problem that does make it more intuitive, however.Suppose that instead of reporting probabilities, as before, I tell you the following.(1) In a population of 100,000 people, 100 of them are vampires.(2) Of the 100 who are vampires, 95 of them will test positive for vampirism.(3) Of the 99,900 mortals, 999 of them will test positive for vampirism.Now tell me, if we test all 100,000 people, what proportion of those who test positive forvampirism actually are vampires? Many people, although certainly not all people, find thispresentation a lot easier. 39 Now we can just count up the number of people who test positive:95 + 999 = 1094. Out of these 1094 positive tests, 95 of them are real vampires, so thatimplies:Pr(vampire|positive) = 951094 ≈ 0.087.It’s exactly the same answer as before, but without a seemingly arbitrary rule.e second presentation of the problem, using counts rather than probabilities, is oencalled the frequency format or natural frequencies. Why a frequency format helps people intuitthe correct approach remains contentious. Some people think that human psychologynaturally works better when it receives information in the form a person in a natural environmentwould receive it. In the real world, we encounter frequencies only. No one has everseen a probability, the thinking goes. But everyone sees counts (frequencies) in their dailylives. Maybe so.Rethinking: e natural frequency phenomenon is not unique. Changing the representation ofa problem oen makes it easier to address or inspires new ideas that were not available in an oldrepresentation. 40 In physics, switching between Newtonian and Lagrangian mechanics can makeproblems much easier. In evolutionary biology, switching between inclusive fitness and multilevelselection sheds new light on old models. And in statistics in general, switching between Bayesian andnon-Bayesian representations oen teaches us new things about both approaches.Regardless of the explanation for this phenomenon, we can exploit it. And in this chapterwe exploit it by taking the probability distributions from the previous chapter and samplingfrom them to produce counts. e posterior distribution is a probability distribution.And like all probability distributions, we can imagine drawing samples from it. e sampledevents in this case are parameter values. Most parameters have no exact empirical realization.e Bayesian formalism treats parameter distributions as relative plausibility, not asany real random process. Randomness is always a property of knowledge, never of the realworld. But inside of the computer, parameters are just as empirical as the outcome of a coinflip or a die toss or an agricultural experiment. e posterior defines the expected frequencythat different parameter values will appear, once we start plucking parameters out of it.is chapter teaches you basic skills for working with samples from the posterior distribution.It will seem a little silly to work with samples at this point, because the posteriordistribution for the globe tossing model is very simple. It’s so simple that it’s no problem towork directly with the grid approximation or even the exact mathematical form. 41 But thereare two reasons to adopt the sampling approach early on, before it’s really necessary.First, many scientists are quite shaky about integral calculus, even though they havestrong and valid intuitions about how to summarize data. Working with samples transforms


3. SAMPLING THE IMAGINARY 61a problem in calculus into a problem in data summary, into a frequency format problem. Anintegral in a typical Bayesian context is just the total probability in some interval. at can bea challenging calculus problem, but once you have samples from the probability distribution,it’s just a matter of counting values in the interval. Even seemingly simple calculations, likesimple confidence intervals, are made difficult once a model has many parameters. In thosecases, one must average over the uncertainty in all other parameters, when describing theuncertainty in a focal parameter. is requires a complicated integral, but only a very simpledata summary. An empirical attack on the posterior allows the scientist to ask and answermore questions about the model, without relying upon a captive mathematician. For thisreason, it is oen easier and more intuitive to work with samples from the posterior, than towork with probabilities and integrals directly.Second, some of the most capable methods of computing the posterior produce nothingbut samples. Many of these methods are variants of Markov chain Monte Carlo techniques(MCMC). So if you learn early on how to conceptualize and process samples from the posterior,when you inevitably must fit a model to data using MCMC—and chances are youwill—you will already know how to make sense of the output. Beginning with Chapter 8of this book, you will use MCMC to open up the types and complexity of models you canpractically fit to data. MCMC is no longer a technique only for experts, but rather part ofthe standard toolkit of quantitative science. So it’s worth planning ahead.So in this chapter we’ll begin to use samples to summarize and simulate model output.e skills you learn here will apply to every problem in the remainder of the book, eventhough the details of the models, how they are fit to data, and how the samples are producedwill vary.Rethinking: Why statistics can’t save bad science. e vampirism example at the start of this chapterhas the same logical structure as many different signal detection problems: (1) ere is some binarystate that is hidden from us; (2) We observe an imperfect cue of the hidden state; (3) We (should) useBayes’ theorem to logically deduce the impact of the cue on our uncertainty.Scientific inference is oen framed in similar terms: (1) An hypothesis is either true or false;(2) We conduct a statistical test and get an imperfect cue of the hypothesis’ falsity (usually framedas significant versus insignificant results); (3) We (should) use Bayes’ theorem to logically deduce theimpact of the cue on the status of the hypothesis. It’s the third step that is hardly ever done. But let’sdo it, for a toy example, so you can see how little statistical tests may do for us.Suppose the probability of a significant result, when an hypothesis is true, is Pr(sig|true) = 0.95.at’s the power of the test. Suppose that the probability of a significant result, when an hypothesisis false, is Pr(sig|false) = 0.05. at’s the false-positive rate, like the 5% of conventional significancetesting. Finally, we have to state the base-rate at which hypotheses are true. Suppose for example that1 in every 100 hypotheses turns out the be true. en Pr(true) = 0.01. No one knows what this valueshould be, but the history of science suggests it’s very small. Now use Bayes’ to compute the posterior:Pr(true|sig) =Pr(sig|true) Pr(true)Pr(sig)=Pr(sig|true) Pr(true)Pr(sig|true) Pr(true) + Pr(sig|false) Pr(false) .Plug in the appropriate values, and the answer is approximately Pr(true|sig) = 0.16. So a significanttest result corresponds to a 16% chance that the hypothesis is true. is is the same low base-ratephenomenon that applies in medical (and vampire) testing. You can shrink the false-positive rate to1% and get this posterior probability up to around 0.5, only as good as a coin flip. e most importantthing to do is to improve the base-rate, Pr(true), and that requires thinking, not testing. 42


62 3. SAMPLING THE IMAGINARYDensity0.0 0.5 1.0 1.5 2.0 2.50.0 0.2 0.4 0.6 0.8 1.0proportion water (p)FIGURE 3.1. Sampling parameter values from the posterior distribution.Le: 10-thousand samples from the posterior implied by the globe tossingdata and model. Right: e density of samples (vertical) at each parametervalue (horizontal).3.1. Sampling from a grid-approximate posteriorBefore beginning to work with samples, we need to generate them. Here’s a reminderfor how to compute the posterior for the globe tossing model, using grid approximation:R code3.2p_grid


3.2. SAMPLING TO SUMMARIZE 63plot( samples )R code3.4In this plot, it’s as if you are flying over the posterior distribution, looking down on it. ereare many more samples from the dense region near 0.6 and very few samples below 0.25. Onthe right, the plot shows the density estimate computed from these samples.library(rethinking)dens( samples )R code3.5You can see that the estimated density is very similar to ideal posterior you computed viagrid approximation. If you draw even more samples, maybe 1e5 or 1e6, the density estimatewill get more and more similar to the ideal.All you’ve done so far is crudely replicate the posterior density you had already computed.So next it is time to use these samples to describe and understand the posterior.3.2. Sampling to summarizeOnce your model produces a posterior distribution, the model’s work is done. But yourwork has just begun. It is necessary to summarize the posterior distribution. Exactly how itis summarized depends upon your purpose. But common questions include:• How much posterior probability lies below some parameter value?• How much posterior probability lies between two parameter values?• Which parameter value marks the lower 5% of the posterior probability?• Which range of parameter values contains 90% of the posterior probability?• Which parameter value has highest posterior probability?ese simple questions can be usefully divided to into questions about (1) intervals of definedboundaries, (2) questions about intervals of defined probability mass, and (3) questions aboutpoint estimates. We’ll see how to approach these questions using samples from the posterior.3.2.1. Intervals of defined boundaries. Suppose I ask you for the posterior probability thatthe proportion of water is less than 0.5. Using the grid-approximate posterior, you can justadd up all of the probabilities, where the corresponding parameter value is less than 0.5:# add up posterior probability where p < 0.5sum( posterior[ p_grid < 0.5 ] )R code3.6[1] 0.1718746So about 17% of the posterior probability is below 0.5. Couldn’t be easier. But since grid approximationisn’t practical in general, it won’t always be so easy. Once there is more than oneparameter in the posterior distribution (wait until the next chapter for that complication),even this simple sum is no longer very simple.So let’s see how to perform the same calculation, using samples from the posterior. isapproach does generalize to complex models with many parameters, and so you can useit everywhere. All you have to do is similarly add up all of the samples below 0.5, but alsodivide the resulting count by the total number of samples. In other words, find the frequencyof parameter values below 0.5:


64 3. SAMPLING THE IMAGINARYR code3.7sum( samples < 0.5 ) / 1e4[1] 0.1726And that’s nearly the same answer as the grid approximation provided, although your answerwill not be exactly the same, because the exact samples you drew from the posterior will bedifferent. is region is shown in the upper-le plot in FIGURE 3.2. Using the same approach,you can ask how much posterior probability lies between 0.5 and 0.75:R code3.8sum( samples > 0.5 & samples < 0.75 ) / 1e4[1] 0.6059So about 61% of the posterior probability lies between 0.5 and 0.75. is region is shown inthe upper-right plot of FIGURE 3.2.Overthinking: Counting with sum. In the R code examples just above, I used the function sumto effectively count up how many samples fulfill a logical criterion. Why does this work? It worksbecause R internally converts the a logical expression, like samples < 0.5, to a vector of TRUE andFALSE results, one for each element of samples, saying whether or not each element matches thecriterion. Go ahead and enter samples < 0.5 on the R prompt, to see this for yourself. en whenyou sum this vector of TRUE and FALSE, R counts each TRUE as 1 and each FALSE as 0. So it ends upcounting how many TRUE values are in the vector, which is the same as the number of elements insamples that match the logical criterion.3.2.2. Intervals of defined mass. It is more common to see scientific journals reporting aninterval of defined mass, usually known as a CONFIDENCE INTERVAL. An interval of posteriorprobability, such as the ones we are working with, may instead be called a CREDIBLEINTERVAL, although the terms may also be used interchangeably, in the usual polysemy thatarises when commonplace words are used in technical definitions. It’s easy to keep track ofwhat’s being summarized, however, as long as you pay attention to how the model is defined.In this book, I’ll try to be clear by always writing posterior interval when referring to summariesof posterior probability and instead confidence interval when referring to summariesof data or simulated data.ese posterior intervals report two parameter values that contain between them a specifiedamount of posterior probability, a probability mass. For this type of interval, it is easierto find the answer by using samples from the posterior than by using a grid approximation.Suppose for example you want to know the boundaries of the lower 80% posterior probability.You know this interval starts at p = 0. To find out where it stops, think of the samplesas data and ask where the 80th percentile lies:R code3.9quantile( samples , 0.8 )80%0.7607608is region is shown in the bottom-le plot in FIGURE 3.2. Similarly, the middle 80% intervallies between the 10th percentile and the 90th percentile. ese boundaries are found usingthe same approach:


3.2. SAMPLING TO SUMMARIZE 65Density0.0000 0.0010 0.00200.00 0.25 0.50 0.75 1.00proportion water (p)Density0.0000 0.0010 0.00200.00 0.25 0.50 0.75 1.00proportion water (p)lower 80%middle 80%Density0.0000 0.0010 0.00200.00 0.25 0.50 0.75 1.00proportion water (p)Density0.0000 0.0010 0.00200.00 0.25 0.50 0.75 1.00proportion water (p)FIGURE 3.2. Two kinds of confidence interval. Top row: intervals of definedboundaries. Top-le: the blue area is the posterior probability below a parametervalue of 0.5. Top-right: the posterior probability between 0.5 and0.75. Bottom row: intervals of defined mass. Bottom-le: lower 80% posteriorprobability exists below a parameter value of about 0.75. Bottom-right:middle 80% posterior probability lies between the 10% and 90% quantiles.quantile( samples , c( 0.1 , 0.9 ) )R code3.1010% 90%0.4464464 0.8118118is region is shown in the bottom-right plot in FIGURE 3.2.Intervals of this sort, which assign equal probability mass to each tail, are very commonin the scientific literature. We’ll call them PERCENTILE INTERVALS (PI). ese intervals doa good job of communicating the shape of a distribution, as long as the distribution isn’t tooasymmetrical. But in terms of supporting inferences about which parameters are consistentwith the data, they are not perfect. Consider the posterior distirbution and different intervalsin FIGURE 3.3. is posterior is consistent with observing 3 waters in 3 tosses and a uniform


66 3. SAMPLING THE IMAGINARYDensity0.000 0.001 0.002 0.003 0.00450% Percentile Interval0.00 0.25 0.50 0.75 1.00proportion water (p)Density0.000 0.001 0.002 0.003 0.00450% HPDI0.00 0.25 0.50 0.75 1.00proportion water (p)FIGURE 3.3. e difference between percentile and highest posterior densityconfidence intervals. e posterior density here corresponds to a flatprior and observing three water samples in three total tosses of the globe.Le: 50% percentile interval. is interval assigns equal mass (25%) to boththe le and right tail. As a result, it omits the most probable parameter value,p = 1. Right: 50% highest posterior density interval, HPDI. is intervalfinds the narrowest region with 50% of the posterior probability. Such aregion always includes the most probable parameter value.(flat) prior. It is highly skewed, having its maximum value at the boundary, p = 1. You cancompute it, via grid approximation, with:R code3.11p_grid


3.2. SAMPLING TO SUMMARIZE 67In contrast, the right-hand plot in FIGURE 3.3 displays the 50% HIGHEST POSTERIORDENSITY INTERVAL (HPDI). 43 e HPDI is the narrowest interval containing the specifiedprobability mass. If you think about it, there must be an infinite number of posterior intervalswith the same mass. But if you want an interval that best represents the parameter valuesmost consistent with the data, then you want the densest of these intervals. at’s what theHPDI is. Compute it from the samples with HPDI (also part of rethinking):HPDI( samples , prob=0.5 )R code3.13lower 0.5 upper 0.50.8418418 1.0000000is interval captures the parameters with highest posterior probability, as well as being noticeablynarrower: 0.16 in width rather than 0.23 for the PI.So the HPDI has some advantages over the PI. But in most cases, these two types ofinterval are very similar. 44 ey only look so different in this case because the posterior distributionis so highly skewed. If we used samples from the posterior distribution for 6 watersin 9 tosses (as in most of this chapter), instead these intervals would be nearly identical. Tryit for yourself, using different probability masses, such as prob=0.8 and prob=0.95. Whenthe posterior is bell shaped, it hardly matters which type of interval you use. Remember,we’re not launching rockets or calibrating atom smashers, so fetishizing precision to the 5thdecimal place will not improve your science.e HPDI also has some disadvantages. HPDI is more computationally intensive than PIand suffers from greater simulation variance, which is a fancy way of saying that it is sensitiveto how many samples you draw from the posterior. It is also harder to understand and manyscientific audiences will not appreciate its features, while they will immediately understand apercentile interval, as ordinary non-Bayesian intervals are nearly always percentile intervals(although of sampling distributions, not posterior distributions).Overall, if the choice of interval type makes a big difference, then you shouldn’t be usingintervals to summarize the posterior. Remember, the entire posterior distribution isthe Bayesian estimate. It summarizes the relative plausibilities of each possible value of theparameter. Intervals of the distribution are just helpful for summarizing it. If choice of intervalleads to different inferences, then you’d be better off just plotting the entire posteriordistribution.Rethinking: Why 95%? e most common interval mass in the natural and social sciences is the95% interval. is interval leaves 5% of the probability outside, correspond to a 5% chance of theparameter not lying within the interval (although see below). is customary interval also reflectsthe customary threshold for statistical significance, which is 5% or P < 0.05. It is not easy to defendthe choice of 95% (5%), outside of pleas to convention. Ronald Fisher is sometimes blamed for thischoice, but his widely-cited 1925 invocation of it was not enthusiastic:“e [number of standard deviations] for which P = .05, or 1 in 20, is 1.96 ornearly 2; it is convenient to take this point as a limit in judging whether a deviationis to be considered significant or not.” 45Most people don’t think of convenience as a serious criterion. Later in his career, Fisher activelyadvised against always using the same threshold for significance or width of confidence interval. 46So what are you supposed to do then? ere is no consensus, but thinking is always a good idea.If you are trying to say that an interval doesn’t include some value, then you might use the widest


68 3. SAMPLING THE IMAGINARYinterval that excludes the value. If you want to argue that two groups are different, then instead ofan interval, you might just compute their overlap. Oen, all confidence intervals do is communicatethe shape of a distribution. In that case, a series of nested intervals may be more useful than any oneinterval. For example, why not present 80%, 95%, and 99% intervals, along with the median?Rethinking: What do confidence intervals mean? It is common to hear that a 95% confidence intervalmeans that there is a probability 0.95 that the true parameter value lies within the interval. In strictnon-Bayesian statistical inference, such a statement is never correct, because strict non-Bayesian inferenceforbids using probability to measure uncertainty about parameters. Instead, one should saythat if we repeated the study and analysis a very large number of times, then 95% of the computed intervalswould contain the true parameter value. If the distinction is not entirely clear to you, then youare in good company. Most scientists find the definition of a confidence interval to be bewildering,and many of them slip unconsciously into a Bayesian interpretation.But whether you use a Bayesian interpretation or not, a 95% interval does not contain the truevalue 95% of the time. e history of science teaches us that confidence intervals exhibit chronicoverconfidence. 47 e word true should set off alarms that something is wrong with a statement like“contains the true value.” e 95% is a small world number (see the introduction to Chapter 2), onlytrue in the model’s logical world. So it will never apply exactly to the real or large world. It is what thegolem believes, but you are free to believe something else. Regardless, the width of the interval, andthe values it covers, can provide valuable advice.3.2.3. Point estimates. e third and final common summary task for the posterior is toproduce point estimates of some kind. Given the entire posterior distribution, what valueshould you report? is seems like an innocent question, but it is difficult to answer. eBayesian parameter estimate is precisely the entire posterior distribution, which is not a singlenumber, but instead a function that maps each unique parameter value onto a plausibilityvalue. And so in order to produce a POINT ESTIMATE from the posterior, we’ll have to askand answer more questions.e first thing to note is that you don’t have to choose a point estimate at all. But beyondthat, it’s worth seeing how to justify a point estimate. Consider the following example.Suppose again the globe tossing experiment in which we observe 3 waters out of 3 tosses,as in FIGURE 3.3. Let’s consider three alternative point estimates. First, it is very commonfor scientists to report the parameter value with highest posterior probability, a maximum aposteriori (MAP) estimate. You can easily compute the MAP in this example:R code3.14p_grid[ which.max(posterior) ][1] 1Or if you instead have samples from the posterior, you can still approximate the same point:R code3.15chainmode( samples , adj=0.01 )[1] 0.9985486But why is this point, the mode, interesting? Why not report the posterior mean or median?


3.2. SAMPLING TO SUMMARIZE 69Density0.000 0.001 0.002 0.003 0.004meanmedianmode0.0 0.2 0.4 0.6 0.8 1.0proportion water (p)expected proportional loss0.2 0.4 0.6 0.80.0 0.2 0.4 0.6 0.8 1.0decisionFIGURE 3.4. Point estimates and loss functions. Le: Posterior distribution(blue) aer observing 3 water in 3 tosses of the globe. Vertical lines showthe locations of the mode, median, and mean. Each point implies a differentloss function. Right: Expected loss under the rule that loss is proportionalto absolute distance of decision (horizontal axis) from the true value. epoint marks the value of p that minimizes the expected loss, the posteriormedian.mean( samples )median( samples )R code3.16[1] 0.8005558[1] 0.8408408ese are also point estimates, and they also summarize the posterior. But all three—themode (MAP), mean, and median—are different in this case. How can we choose amongthem? FIGURE 3.4 shows this posterior distribution and the locations of these point summaries.One principled way to go beyond using the entire posterior as the estimate is to choosea LOSS FUNCTION. A loss function is a rule that tells you the cost associated with using anyparticular point estimate. While statisticians and game theorists have long been interestedin loss functions, and how Bayesian inference supports them, scientists hardly ever use themexplicitly. e key insight is that different loss functions imply different point estimates.Here’s an example to help us work through the procedure. Suppose I offer you a bet.Tell me which value of p, the proportion of water on the Earth, you think is correct. I willpay you one-hundred United States Dollars, if you get it exactly right. But I will subtractmoney from your gain, proportional to the distance of your decision from the correct value.Precisely, your loss is proportional to the absolute value of d−p, where d is your decision andp is the correct answer. We could change the precise dollar values involved, without changingthe important aspects of this problem. What matters is that the loss is proportional to thedistance of your decision from the true value.


70 3. SAMPLING THE IMAGINARYNow once you have the posterior density in hand, how should you use it to maximizeyour expected winnings? It turns out that the parameter value that maximizes expectedwinnings (minimizes expected loss) is the median of the posterior density. Let’s calculatethat fact, without using a mathematical proof. ose interested in the proof should followthe endnote. 48Calculating expected loss for any given decision means using the posterior to averageover our uncertainty in the true value. Of course we don’t know the true value, in mostcases. But if we are going to use our model’s information about the parameter, that meansusing the entire posterior distribution. So suppose we decide p = 0.5 will be our decision.en the expected loss will be:R code3.17sum( posterior*abs( 0.5 - p_grid ) )[1] 0.3128752e symbols posterior and p_grid are the same ones we’ve been using throughout thischapter, containing the posterior probabilities and the parameter values, respectively. Allthe code above does is compute the weighted average loss, where each loss is weighted by itscorresponding posterior probability. ere’s a trick for repeating this calculation for everypossible decision, using the function sapply, said like ess apply. For now, just use this trick.I’ll explain how and why it works, later in the book.R code3.18loss


3.3. SAMPLING TO SIMULATE PREDICTION 71asymmetric, rising sharply as true wind speed exceeds our guess, but rising only slowly astrue wind speed falls below our guess. In this context, the optimal point estimate would tendto be larger than posterior mean or median. Moreover, the real issue is whether or not toorder an evacuation, and so producing a point estimate of wind speed may not be necessaryat all.Usually, research scientists don’t think about loss functions. And so any point estimatelike the mean or MAP that they may report isn’t intended to support any particular decision,but rather to describe the shape of the posterior. You might argue that the decision to makeis whether or not to accept an hypothesis. But the challenge then is to say what the relevantcosts and benefits would be, in terms of the knowledge gained or lost. 49 Usually it’s betterto communicate as much as you can about the posterior distribution, as well as the data andthe model itself, so that others can build upon your work. Premature decisions to accept orreject hypotheses can cost lives. 50It’s healthy to keep these issues in mind, if only because they remind us that many ofthe routine questions in statistical inference can only be answered under consideration of aparticular empirical context and applied purpose. Statisticians can provide general outlinesand standard answers, but a motivated and attentive scientist will always be able to improveupon such general advice.3.3. Sampling to simulate predictionAnother common job for samples from the posterior is to ease simulation of the model’simplied observations. Generating implied observations—predictions—from a model is usefulfor at least four distinct reasons.(1) Model checking. Aer a model is fit to real data, it is worth simulating implied observations,to check both whether the fit worked correctly and to investigate modelfailures.(2) Soware validation. In order to be sure that our model fitting soware is working,it helps to simulate observations under a known model and then attempt to recoverthe values of the parameters the data were simulated under.(3) Research design. If you can simulate observations from your hypothesis, then youcan evaluate whether the research design can be effective. In a narrow sense, thismeans doing power analysis, but the possibilities are much broader.(4) Forecasting. Estimates can be used to simulate new predictions, for new cases andfuture observations. ese forecasts can be useful as applied prediction, but alsofor model criticism and revision.In this final section of the chapter, we’ll look at how to produce simulated observations andhow to perform some simple model checks.3.3.1. Dummy data. Let’s summarize the globe tossing model that you’ve been workingwith for two chapters now. A fixed true proportion of water p exists, and that is the target ofour inference. Tossing the globe in the air and catching it produces observations of “water”and “land” that appear in proportion to p and 1 − p, respectively.Now note that these assumptions not only allow us to infer the plausibility of each possiblevalue of p, aer observation. at’s what you did in the previous chapter. ese assumptionsalso allow us to simulate the observations that the model implies. ey allow this,because likelihood functions work in both directions. Given a realized observation, the likelihoodfunction says how plausible the observation is. And given only the parameters, the


72 3. SAMPLING THE IMAGINARYR code3.20R code3.21R code3.22R code3.23likelihood defines a distribution of possible observations that we can sample from, to simulateobservation. In this way, Bayesian models are always generative, capable of simulatingpredictions. Many non-Bayesian models are also generative, but many are not.We will call such simulated data DUMMY DATA, to indicate that it is a stand in for actualdata, like a crash test dummy. With the globe tossing model, the dummy data arises from abinomial likelihood:n!Pr(w|p, n) =w!(n − w)! pw (1 − p) n−w ,where w is an observed count of “water” and n is the number of tosses. Suppose n = 2, twotosses of the globe. en there are only three possible observations: 0 water, 1 water, 2 water.You can quickly compute the likelihood of each, for any given value of p. Let’s use p = 0.7,which is just about the true proportion of water on the Earth:dbinom( 0:2 , size=2 , prob=0.7 )[1] 0.09 0.42 0.49is means that there’s a 9% chance of observing w = 0, a 42% chance of w = 1, and a 49%chance of w = 2. If you change the value of p, you’ll get a different distribution of impliedobservations.Now we’re going to simulate observations, using these likelihoods. is is done by samplingfrom the distribution just described above. You could use sample to do this, but Rprovides convenient sampling functions for all the ordinary probability distributions, likethe binomial. So a single dummy data observation of w can be sampled with:rbinom( 1 , size=2 , prob=0.7 )[1] 1e “r” in rbinom stands for “random.” It can also generate more than one simulation at atime. A set of 10 simulations can be made by:rbinom( 10 , size=2 , prob=0.7 )[1] 2 2 2 1 2 1 1 1 0 2Let’s generate 100-thousand dummy observations, just to verify that each value (0, 1, or 2)appears in proportion to its likelihood:dummy_w


3.3. SAMPLING TO SIMULATE PREDICTION 73Frequency0 5000 15000 250000 2 4 6 8dummy water countFIGURE 3.5. Distribution of simulated sample observations from 9 tosses ofthe globe. ese samples assume the proportion of water is 0.7.dummy_w


74 3. SAMPLING THE IMAGINARY3.3.2. Model checking. MODEL CHECKING means (1) ensuring the model fitting workedcorrectly and (2) evaluating the adequacy of a model for some purpose. Since Bayesian modelsare always generative, able to simulate observations as well as estimate parameters fromobservations, once you condition a model on data, you can simulate to examine the model’sempirical expectations.3.3.2.1. Did the soware work? In the simplest case, we can check whether the sowareworked by checking for correspondence between implied predictions and the data used tofit the model. You might also call these implied predictions retrodictions, as they ask howwell the model reproduces the data used to educate it. An exact match is neither expectednor desired. But when there is no correspondence at all, it probably means the soware didsomething wrong.ere is no way to really be sure that soware works correctly. Even when the retrodictionscorrespond to the observed data, there may be subtle mistakes. And when you startworking with multilevel models, you’ll have to expect a certain pattern of lack of correspondencebetween retrodictions and observations. Despite there being no perfect way to ensuresoware has worked, the simple check I’m encouraging here oen catches silly mistakes,mistakes of kind everyone makes from time to time.In the case of the globe tossing analysis, the soware implementation is simple enoughthat it can be checked against analytical results. So instead let’s move directly to consideringthe model’s adequacy.3.3.2.2. Is the model adequate? Aer assessing whether the posterior distribution is thecorrect one, because the soware worked correctly, it’s useful to also look for aspects of thedata that are not well-described by the model’s expectations. e goal is not to test whetherthe model’s assumptions are “true,” because all models are false. Rather, the goal is to assessexactly how the model fails to describe the data, as a path towards model revision andimprovement. All models fail in same respect, so you have to use your judgment—as wellas the judgments of your colleagues—to decide whether any particular failure is or is notimportant.Again, few scientists want to produce models that do nothing more than re-describeexisting samples. So imperfect prediction (retrodiction) is not a bad thing. Typically we hopeto either predict future observations or understand enough that we might usefully tinker withthe world. We’ll consider these problems again, in Chapter 6.For now, we need to learn how to combine sampling of simulated observations, as in theprevious section, with sampling parameters from the posterior distribution. We expect todo better when we use the entire posterior distribution, not just some point estimate derivedfrom it. Why? Because there is a lot of information about uncertainty in the entire posteriordistribution. We lose this information when we pluck out a single parameter value and thenperform calculations with it. is loss of information leads to overconfidence.Let’s do some basic model checks, using simulated observations for the globe tossingmodel. e observations in our example case are counts of water, over tosses of the globe.e implied predictions of the model are doubly uncertain. First, for any unique value ofthe parameter p, there is a unique implied pattern of observations that the model expects.ese are what you sampled in the previous section. ere is uncertainty in the predictedobservations, because even if you know p with certainty, you won’t know the next globe tosswith certainty (unless p = 0 or p = 1). Second, since there is also uncertainty about p, thereis uncertainty about everything that depends upon p. e uncertainty in p arises from the


3.3. SAMPLING TO SIMULATE PREDICTION 75spread of the posterior distribution for p, not from sampling variation. But it will interactwith the sampling variation, when we try to assess what the model tells us about outcomes.We’d like to propagate the parameter uncertainty—carry it forward—as we evaluate theimplied predictions. All that is required is averaging over the posterior density for p, whilecomputing the predictions. For each possible value of the parameter p, there is an implieddistribution of outcomes. So if you were to compute the sampling distribution of outcomes ateach value of p, then you could average all of these prediction distributions together, using theposterior probabilities of each value of p, to get a POSTERIOR PREDICTIVE DISTRIBUTION.FIGURE 3.6 illustrates this averaging. On the le, the posterior distribution is shown,with three unique parameter values highlighted by the vertical lines. e implied distributionof observations specific to each of these parameter values is shown in the middle columnof plots. Observations are never certain for any value of p, but they do shi around in responseto it. Finally, on the right-hand side, the sampling distributions for all values of p arecombined, using the posterior probabilities to compute the weighted average frequency ofeach possible observation, zero to nine water samples. e actually observed value, six watersamples, is shown by the thick line.e resulting distribution is for predictions, but it incorporates all of the uncertaintyembodied in the posterior distribution for the parameter p. As a result, it is honest. Whilethe model does a good job of predicting the data—the most likely observation is indeedthe observed data—predictions are still quite spread out. If instead you were to use only asingle parameter value to compute implied predictions, say the most probable value at thepeak of posterior distribution, you’d produce an overconfident distribution of predictions,narrower than the right-hand plot in FIGURE 3.6 and more like the distribution shown forp = 0.64 in the middle row of the middle column. e usual effect of this overconfidencewill be to lead you to believe that the model is more consistent with the data than it really is—the predictions will cluster around the observations more tightly. is illusion arises fromtossing away uncertainty about the parameters.So how do you actually do the calculations? To simulate predicted observations for asingle value of p, say p = 0.6, you can use rbinom to generate random binomial samples:w


76 3. SAMPLING THE IMAGINARY0.38 (A) p = 0.38number 0.89 (C) p = of 0.89 water samplesFrequency0 1000 3000Frequency0 1000 30000 3 6 9number 0.64 (B) p = of 0.64 water samplesMergedprobability0.000 0.010 0.020A BCFrequency0 1000 3000Frequency0 1000 30000 0.5 10 3 6 90 3 6 9probability of waternumber of water samples0 3 6 9number of water samplesFIGURE 3.6. Simulating predictions from the total posterior. Le: e familiarposterior distribution for the globe tossing data. ree example parametervalues (0.38, 0.64, 0.89) are marked by the vertical lines. Middlecolumn: Each of the three parameter values is used to simulate observations.Right: Combining simulated observation distributions for all parametervalues (not just 0.38, 0.64, and 0.89), each weighted by its posteriorprobability, produces the posterior predictive distribution. is distributionpropagates uncertainty about parameter to uncertainty about prediction.Observed value (6) highlighted.can compute intervals and point statistics using the same procedures. If you plot these samples,you’ll see the distribution shown in the right-hand plot in FIGURE 3.6.e simulated model predictions are quite consistent with the observed data in thiscase—the actual count of 6 lies right in the middle of the simulated distribution. ere isquite a lot of spread to the predictions, but a lot of this spread arises from the binomial processitself, not uncertainty about p. Still, it’d be premature to conclude that the model isperfect. So far, we’ve only viewed the data just as the model views it: each toss of the globe iscompletely independent of the others. is assumption is questionable. Unless the persontossing the globe is careful, it is easy to induce correlations and therefore patterns amongthe sequential tosses. Consider for example that about half of the globe (and planet) is coveredby the Pacific Ocean. As a result, water and land are not uniformly distributed on theglobe, and therefore unless the globe spins and rotates enough while in the air, the positionwhen tossed could easily influence the sample once it lands. e same problem arises in coin


3.3. SAMPLING TO SIMULATE PREDICTION 77Frequency0 1000 2000 3000Frequency0 1000 2000 30002 4 6 8longest run length0 2 4 6 8number of switchesFIGURE 3.7. Alternative views of the same posterior predictive distributon(see FIGURE 3.6). Instead of considering the data as the model saw it, asa sum of water samples, now we view the data as both the length of themaximum run of water or land (le) and the number of switches betweenwater and land samples (right). Observed values highlighted in each. Whilethe simulated predictions are consistent with the run length (3 water in arow), they are much less consistent with the frequent switches (6 switchesin 9 tosses).tosses, and indeed skilled individuals can influence the outcome of a coin toss, by exploitingthe physics of it. 51So with the goal of seeking out aspects of prediction in which the model fails, let’s lookat the data in two different ways. Recall that the sequence of nine tosses was W L W W W LW L W. First, consider the length of the longest run of either water or land. is will providea crude measure of correlation between tosses. So in the observed data, the longest run is 3W’s. Second, consider the number of times in the data that the sample switches from waterto land or from land to water. is is another measure of correlation between samples. Inthe observed data, the number of switches is 6. ere is nothing special about these two newways of describing the data. ey just serve to inspect the data in new ways. In your ownmodeling, you’ll have to imagine aspects of the data that are relevant in your context, foryour purposes.FIGURE 3.7 shows the simulated predictions, viewed in these two new ways. On thele, the length of the longest run of water or land is plotted, with the observed value of 3highlighted by the bold line. Again, the true observation is the most common simulated observation,but with a lot of spread around it. On the right, the number of switches from waterto land and land to water is shown, with the observed value of 6 highlighted in bold. Nowthe simulated predictions appear less consistent with the data, as the majority of simulatedobservations have fewer switches than were observed in the actual sample. is is consistentwith lack of independence between tosses of the globe, in which each toss is negativelycorrelated with the last.


78 3. SAMPLING THE IMAGINARYDoes this mean that the model is bad? at depends. e model will always be wrongin some sense, be mis-specified. But whether or not the mis-specification should lead us totry other models will depend upon our specific interests. In this case, if tosses do tend toswitch from W to L and L to W, then each toss will provide less information about the truecoverage of water on the globe. In the long run, even the wrong model we’ve used throughoutthe chapter will converge on the correct proportion. But it will do so more slowly than theposterior distribution may lead us to believe.Rethinking: What does more extreme mean? A common way of measuring deviation of observationfrom model is to count up the tail area that includes the observed data and any more extreme data.Ordinary p-values are an example of such a tail-area probability. When comparing observations todistributions of simulated predictions, as in FIGURE 3.6 and FIGURE 3.7, we might wonder how farout in the tail the observed data must be before we conclude that the model is a poor one. Becausestatistical contexts vary so much, it’s impossible to give a universally useful answer.But more importantly, there are usually very many ways to view data and define “extreme.” Ordinaryp-values view the data in just the way to model expects it, and so provide a very weak form ofmodel checking. For example, the far-right plot in FIGURE 3.6 evaluates model fit in the best way forthe model. Alternative ways of defining “extreme” may provide a more serious challenge to a model.e different definitions of extreme in FIGURE 3.7 can more easily embarrass it.Model fitting remains on objective procedure—everyone and every golem conducts Bayesianupdating in a way that doesn’t depend upon personal preferences. But model checking is inherentlysubjective, and this actually allows it to be quite powerful, since subjective knowledge of an empiricaldomain provides expertise. Expertise in turn allows for imaginative checks of model performance.Since golems have terrible imaginations, we need the freedom to engage our imaginations. In thisway, the objective and subjective work together. E. T. Jaynes (1922–1998) said all of this much moresuccinctly:It would be very nice to have a formal apparatus that gives us some “optimal” wayof recognizing unusual phenomena and inventing new classes of hypotheses thatare most likely to contain the true one; but this remains an art for the creativehuman mind. 523.4. Summaryis chapter introduced the basic procedures for manipulating posterior distributions.Our fundamental tool is samples of parameter values drawn from the posterior distribution.Samples transform problems in integral calculus into problems in data summary. ese samplescan be used to produce intervals, point estimates, as well as posterior predictive checks,as well as other kinds of simulations.Posterior predictive checks combine uncertainty about parameters, as described by theposterior distribution, with uncertainty about outcomes, as described by the assumed likelihoodfunction. ese checks are useful for verifying that your soware worked correctly.ey are also useful for prospecting for ways in which your models is inadequate.Once models become more complex, posterior predictive simulations will be used fora broader range of applications. Even understanding a model oen requires simulating impliedobservations. We’ll keep working with samples from the posterior, to make these tasksas easy and customizable as possible.


3.5. PRACTICE 793.5. PracticeEasy. ese problems use the samples from the posterior distribution for the globe tossingexample. is code will give you a specific set of samples, so that you can check your answersexactly.p_grid


80 3. SAMPLING THE IMAGINARYHard. e practice problems here all use the data below. ese data indicate the gender(male=1, female=0) of officially reported first and second born children in 100 two-childfamilies. ese data come from a certain country in which families are restricted to twolegal children.R code3.28birth1


3.5. PRACTICE 813.5.17. Predict second borns with sisters. e model assumes that sex of first and secondbirths are independent. To check this assumption, focus now on second births that followedfemale first borns. Compare 10-thousand simulated counts of boys to only those secondbirths that followed girls. To do this correctly, you need to count the number of first bornswho were girls and simulate that many births, 10-thousand times. Compare the counts ofboys in your simulations to the actual observed count of boys following girls. How does themodel look in this light? Any guesses what is going on in these data?


4 Linear ModelsHistory has been unkind to Ptolemy. Claudius Ptolemy (born 90 CE, died 168 CE) wasan Egyptian mathematician and astronomer, most famous for his geocentric model of thesolar system. ese days, when scientists wish to mock someone, they might compare himto a supporter of the geocentric model. But Ptolemy was a genius. His mathematical modelof the motions of the planets (FIGURE 4.1) was extremely accurate. To achieve it’s accuracy,it employed a device known as an epicycle, a circle on a circle. It is even possible to have epiepicycles,circles on circles on circles. With enough epicycles in the right places, Ptolemy’smodel could predict with accuracy greater than anyone had achieved before him. And sothe model was utilized for over a thousand years. And Ptolemy and people like him, toilingover centuries, worked it all out without the aid of a computer. Anyone should be flatteredto be compared to Ptolemy.e trouble of course is that the geocentric model is wrong, in many respects. If youused it to plot the path of your Mars probe, you’d miss the red planet by quite a distance.But for spotting Mars in the night sky, it remains an excellent model. It would have to bere-calibrated every century or so, depending upon which heavenly body you wish to locate.But the geocentric model continues to make useful predictions, provided those predictionsremain within a narrow domain of questioning.e strategy of using epicycles might seem crazy, once you know the correct physicalstructure of the solar system. But it turns out that the ancients had hit upon a generalizedsystem of approximation. Given enough circles embedded in enough places, the Ptolemaicstrategy is the same as a Fourier series, a way of decomposing a periodic function (like anorbit) into a series of sine and cosine functions. So no matter the actual arrangement ofplanets and moons, a geocentric model can be built to describe their paths against the nightsky.LINEAR REGRESSION is the geocentric model of applied statistics. By “linear regression,”we will mean a family of simple statistical golems that attempt to learn about the mean andvariance of some measurement, using an additive combination of other measurements. Likegeocentrism, linear regression can usefully describe a very large variety of natural phenomena.Like geocentrism, linear regression is a descriptive model that corresponds to manydifferent process models. If we read its structure too literally, we’re likely to make mistakes.But used wisely, these little linear golems continue to be useful.is chapter introduces linear regression as a Bayesian procedure. Under a probabilityinterpretation, which is necessary for Bayesian work, linear regression uses a Gaussian (normal)distribution to describe our golem’s uncertainty about some measurement of interest.is type of model is simple, flexible, and commonplace. Like all statistical models, it isnot universally useful. But linear regression has a strong claim to being foundational, in the83


84 4. LINEAR MODELSFIGURE 4.1. e Ptolemaic Universe, in which complex motion of the planetsin the night sky was explained by orbits within orbits, called epicycles.e model is incredibly wrong, yet makes quite good predictions.sense that once you learn to build and interpret linear regression models, you can quicklymove on to other types of regression which are less normal.4.1. Why normal distributions are normalSuppose you and a thousand of your closest friends line up on the half-way line of asoccer field (football pitch). Each of you has a coin in your hand. At the sound of the whistle,you begin flipping the coins. Each time a coin comes up heads, that person moves one steptowards the lehand goal. Each time a coin comes up tails, that person moves one steptowards the righthand goal. Each person flips the coin 16 times, follows the implied moves,and then stands still. Now we measure the distance of each person from the half-way line.Can you predict what proportion of the one-thousand people who are standing on the halfwayline? How about the proportion 5 yards le of the line?It’s hard to say where any individual person will end up, but you can say with great confidencewhat the collection of positions will be. e distances will be distributed in approximatelynormal, or Gaussian, fashion. is is true even though the underlying distributionis binomial. It does this because there are so many more possible ways to realize a sequenceof le-right steps that sums to zero. ere are slightly fewer ways to realize a sequence thatends up one step le or right of zero. And so on, with the number of possible sequencesdeclining in the characteristic bell curve of the normal distribution.4.1.1. Normal by addition. Let’s see this result, by simulating this experiment in R. To showthat there’s nothing special about the underlying coin flip, assume instead that each step isdifferent from all the others, a random distance between zero and one yard. us a coin is


4.1. WHY NORMAL DISTRIBUTIONS ARE NORMAL 85position-6 -3 0 3 60 4 8 12 16step numberDensity0.0 0.2 0.44 stepsDensity0.00 0.15 0.308 stepsDensity0.00 0.10 0.2016 steps-6 -3 0 3 6-6 -3 0 3 6-6 -3 0 3 6positionpositionpositionFIGURE 4.2. Random walks on the soccer field converge to a normal distribution.e more steps are taken, the closer the match between the realempirical distribution of positions and the ideal normal distribution, superimposedin the last plot in the bottom panel.flipped, a distance between zero and one yard is taken in the indicated direction, and theprocess repeats. To simulate this, we generate for each person a list of 16 random numbersbetween −1 and 1. ese are the individual steps. en we add these steps together to getthe position aer 16 steps. en we need to replicate this procedure one-thousand times.is is the sort of task that would be harrowing in a point-and-click interface, but it is madetrivial by the command line. Here’s a single line to do the whole thing:pos


86 4. LINEAR MODELSexperiment with even larger numbers of steps to verify for yourself that the distribution ofpositions is stabilizing on the Gaussian. You can square the step sizes and transform themin a number of arbitrary ways, without changing the result: normality emerges. Where doesit come from?Any process that adds together random values from the same distribution converges toa normal. But it’s not easy to grasp why addition should result in a bell curve of sums. 53Here’s a conceptual way to think of the process. Whatever the average value of the sourcedistribution, each sample from it can be thought of as a fluctuation from that average value.When we begin to add these fluctuations together, they also begin to cancel one another out.A large positive fluctuation will cancel a large negative one. e more terms in the sum, themore chances for each fluctuation to be cancelled by another, or by a series of smaller onesin the opposite direction. So eventually the most likely sum, in the sense that there are themost ways to realize it, will be a sum in which every fluctuation is cancelled by another, asum of zero (relative to the mean). 54It doesn’t matter what shape the underlying distribution possesses. It could be uniform,like in our example above, or it could be (nearly) anything else. 55 Depending upon the underlyingdistribution, the convergence might be slow, but it will be inevitable. Oen, as inthis example, convergence is rapid.4.1.2. Normal by multiplication. Here’s another way to get a normal distribution. Supposethe growth rate of an organism is influenced by a dozen loci, each with several alleles thatcode for more growth. Suppose also that all of these loci interact with one another, such thateach increase growth by a percentage. is means that their effects multiply, rather than add.For example, we can sample a random growth rate for this example with this line of code:R code4.2prod( 1 + runif(12,0,0.1) )is code just samples 12 random numbers between 1.0 and 1.1, each representing a proportionalincrease in growth. us 1.0 means no additional growth and 1.1 means a 10%increase. e product of all twelve is computed and returned as output. Now what distributiondo you think these random products will take? Let’s generate 10-thousand of them andsee:R code4.3growth


4.1. WHY NORMAL DISTRIBUTIONS ARE NORMAL 87e smaller the effect of each locus, the better this additive approximation will be. In thisway, small effects that multiply together are approximately additive, and so they also tend tostabilize on Gaussian distributions. Verify this for yourself by comparing:big


88 4. LINEAR MODELS4.1.4.2. Epistemological justification. But the natural occurrence of the Gaussian distributionis only one reason to build models around it. Another route to justifying the Gaussianas our choice of skeleton, and a route that will help us appreciate later why it is oen a poorchoice, is that it represents a particular state of ignorance. When all we know or are willingto say about a distribution of measures (measures are continuous values on the real numberline) is their mean and variance, then the Gaussian distribution arises as the most consistentwith our assumptions.at is to say that the Gaussian distribution is the most natural expression of our stateof ignorance, because if all we are willing assume is that a measure has finite variance, theGaussian distribution is the shape that can be realized in the largest number of ways anddoes not introduce any new assumptions. It is the least surprising and least informativeassumption to make. In this way, the Gaussian is the distribution most consistent with ourassumptions. Or rather, it is the most consistent with our golem’s assumptions. If you don’tthink the distribution should be Gaussian, then that implies that you know something elsethat you should tell your golem about, something that would improve inference.is epistemological justification is premised on INFORMATION THEORY and MAXIMUMENTROPY. We’ll dwell on information theory in Chapter 6, when it’ll be more useful. enin later chapters, other common and useful distributions will be used to build generalizedlinear models (GLMs). When these other distributions are introduced, you’ll learn the constraintsthat make them the uniquely most appropriate (consistent with our assumptions)distributions.For now, let’s take the ontological and epistemological justifications of just the normaldistribution as reasons to start building models of measures around it. roughout all ofthis modeling, keep in mind that using a model is not equivalent to swearing an oath to it.e golem is your servant, not the other way around. And so the golem’s beliefs are not yourown. ere is no contract that forces us to believe the assumptions of our models. ey justneed to be useful robots.Overthinking: Gaussian distribution. You don’t have to memorize the Gaussian probability distributionformula to make good use of it. You’re computer already knows it. But a little knowledge of itsform can help demystify it. e probability of some value y, given a Gaussian (normal) distributionwith mean µ and standard deviation σ, is:( )1Pr(y|µ, σ) = √ exp (y − µ)2−2πσ2 2σ 2is looks monstrous. But the important bit is just the (y − µ) 2 bit. is is the part that gives thenormal distribution it’s fundamental shape, a quadratic shape. Once you exponentiate the quadraticshape, you get the classic bell curve. e rest of it just scales and standardizes the distribution so thatit sums to one, as all probability distributions must. But an expression as simple as exp(−y 2 ) yieldsthe Gaussian prototype.e Gaussian is also a continuous distribution, unlike the binomial probabilities of earlier chapters.is means that the value y in the Gaussian distribution can be any continuous value. ebinomial, in contrast, requires integers. Probability distributions with only discrete outcomes, likethe binomial, are oen called probability mass functions, while continuous ones like the Gaussian arecalled probability density functions. For mathematical reasons, probability densities, but not masses,can be greater than 1. Try dnorm(0,0,0.1), for example, which is the way to make R calculatePr(0|0, 0.1). e answer, about 4, is no mistake. Probability density is the rate of change in cumulativeprobability. So where cumulative probability is increasing rapidly, density can easily exceed1. But if we calculate the area under the density function, it will never exceed 1. Such areas are also


4.2. A LANGUAGE FOR DESCRIBING MODELS 89called probability mass. You can usually ignore all these density/mass details while doing computationalwork. But it’s good to be aware of the distinctions, if only passively.e Gaussian distribution is routinely seen without σ but with another parameter, τ. e parameterτ in this context is usually called precision and defined as τ = 1/σ 2 . is change of parametersgives us the equivalent formula (just substitute σ = 1/ √ τ):√ τPr(y|µ, τ) =2π exp( − 1 2τ(y − µ)2)is form is common in Bayesian data analysis, and Bayesian model fitting soware, such as BUGSor JAGS, sometimes requires using τ rather than σ.4.2. A language for describing modelsis book adopts a standard language for describing and coding statistical models. Youfind this language in many statistical texts and in nearly all statistical journals, as it is generalto both Bayesian and non-Bayesian modeling. Scientists increasingly use this same languageto describe their statistical methods, as well. So learning this language is an investment, nomatter where you are headed next.Here’s the approach, in abstract. ere will be many examples later. ese numbereditems describe the choices that will be encoded in your model description.(1) First, we recognize a set of measurements that we hope to predict or understand,the outcome variable or variables.(2) For each of these outcome variables, we define a likelihood distribution that definesthe plausibility of individual observations. In linear regression, this distribution isalways Gaussian.(3) en we recognize a set of other measurements that we hope to use to predict orunderstand the outcome. Call these predictor variables.(4) We relate the exact shape of the likelihood distribution—it’s precise location andvariance and other aspects of its shape, if it has them—to the predictor variables.In choosing a way to relate the predictors to the outcomes, we are forced to nameand define all of the parameters of the model.(5) Finally, we choose priors for all of the parameters in the model. ese priors definethe initial information state of the model, before seeing the data.Aer all these decisions are made—and most of them will come to seem automatic toyou before long—we summarize the model with something mathy like:outcome i ∼ Normal(µ i , σ)µ i = β × predictor iβ ∼ Normal(0, 10)σ ∼ Cauchy(0, 1)If that doesn’t make much sense, good. at indicates that you are holding the right textbook,since this book teaches you how to read and write these mathematical model descriptions.We won’t do any mathematical manipulation of them. Instead, they provide an unambiguousway to define and communicate our models. Once you get comfortable with their grammar,when you start reading these mathematical descriptions in other books or in scientificjournals, you’ll find them less obtuse.


90 4. LINEAR MODELSe approach above surely isn’t the only way to describe statistical modeling, but it is awidespread and productive language. Once a scientist learns this language, it becomes easierto communicate the assumptions of our models. We no longer have to remember seeminglyarbitrary lists of bizarre conditions like homoscedasticity (constant variance), because we canjust read these conditions from the model definitions. We will also be able to see natural waysto change these assumptions, instead of feeling trapped within some procrustean model typelike regression or multiple regression or ANOVA or ANCOVA or such. ese are all thesame kind of model, and that fact becomes obvious once we know how to talk about modelsas mappings of one set of variables through a probability distribution onto another set ofvariables.4.2.1. Re-describing the globe tossing model. It’s good to work with examples. Recall theproportion of water problem from previous chapters. e model in that case was always:w ∼ Binomial(n, p)p ∼ Uniform(0, 1)where w was the observed count of water samples, n was the total number of samples, and pwas the proportion of water on the actual globe. Read the above statement as:e count w is distributed binomially with sample size n and probability p.e prior for p is assumed to be uniform between zero and one.Once we know the model in this way, we automatically know all of its assumptions. Weknow the binomial distribution assumes that each sample (globe toss) is independent of theothers, and so we also know that the model assumes that sample points are independent ofone another.For now, we’ll focus on simple models like the above. In these models, the first line definesthe likelihood function used in Bayes’ theorem. e other lines define priors. Both ofthe lines in this model are STOCHASTIC, as indicated by the ∼ symbol. A stochastic relationshipis just a mapping of a variable or parameter onto a distribution. It is stochastic becauseno single instance of the variable on the le is known with certainty. Instead, the mapping isprobabilistic: some values are more plausible than others, but very many different values areplausible under any model. Later, we’ll have models with deterministic definitions in themas well.Overthinking: From model definition to Bayes’ theorem. To relate the mathematical format aboveto Bayes’ theorem, you could use the model definition to define the posterior distribution:Pr(p|w, n) =Binomial(w|n, p)Uniform(p|0, 1)∫Binomial(w|n, p)Uniform(p|0, 1)dp.at monstrous denominator is just the average likelihood again. It standardizes the posterior to sumto 1. e action is in the numerator, where the posterior probability of any particular value of p isseen again to be proportional to the product of the likelihood and prior. In R code form, this is thesame grid approximation calculation you’ve been using all along. In a form recognizable as the aboveexpression:R code4.6w


4.3. A GAUSSIAN MODEL OF HEIGHT 91Compare to the calculations in earlier chapters.4.3. A Gaussian model of heightLet’s build a linear regression model now. Well, it’ll be a “regression” once we have apredictor variable in it. For now, we’ll get the scaffold in place and construct the predictorvariable in the next section.We’ll work through this material by using real sets of data. In this case, we want a singlemeasurement variable to model as a Gaussian distribution. ere will be two parametersdescribing the distribution’s shape, the mean µ and the standard deviation σ. Bayesian updatingwill allow us to consider every possible combination of values for µ and σ and to scoreeach combination by its relative plausibility, in light of the data. ese relative plausibilitiesare the posterior probabilities of each combination of values µ, σ.Another way to say the above is this. ere are an infinite number of possible Gaussiandistributions. Some have small means. Others have large means. Some are wide, with a largeσ. Others are narrow. We want our Bayesian machine to consider every possible distribution,each defined by a combination of µ and σ, and rank them by posterior plausibility. Posteriorplausibility provides a measure of the logical compatibility of each possible distribution withthe data and model.In practice we’ll use approximations to the formal analysis. So we won’t really considerevery possible value of µ and σ. But that won’t cost us anything in most cases. Instead thething to worry about is keeping in mind that the “estimate” here will be the entire posteriordistribution, not any point within it. And as a result, the posterior distribution will be adistribution of Gaussian distributions. Yes, a distribution of distributions. If that doesn’tmake sense yet, then that just means you are being honest with yourself. Hold on, workhard, and it will make plenty of sense before long.4.3.1. e data. e data contained in data(Howell1) are partial census data for the Dobearea !Kung San, compiled from interviews conducted by Nancy Howell in the late 1960’s. 56For the non-anthropologists reading along, the !Kung San are the most famous foragingpopulation of the 20th century, largely because of detailed quantitative studies by people likeHowell.Load the data and place them into a convenient object with:library(rethinking)data(Howell1)d


92 4. LINEAR MODELS'data.frame': 544 obs. of 4 variables:$ height: num 152 140 137 157 145 ...$ weight: num 47.8 36.5 31.9 53 41.3 ...$ age : num 63 63 65 41 51 35 32 27 19 54 ...$ male : int 1 0 0 1 0 1 0 1 0 1 ...is data frame contains four columns. Each column has 544 entries, so there are 544 individualsin these data. Each individual has a recorded height (centimeters), weight (kilograms),age (years), and “maleness” (0 indicating female and 1 indicating male).We’re going to work with just the height column, for the moment. e column containingthe heights is really just a regular old R vector, the kind of list we have been workingwith in many of the code examples. You can access this vector by using its name:R code4.9d$heightRead the symbol $ as extract, as in extract the column named height from the data frame d.All it does is give you the column that follows it.Overthinking: Data frames. It might seem like this whole data frame thing is kind of annoying, rightnow. If we’re working with only one column here, why bother with this d thing at all? You don’t haveto use a data frame, as you can just pass raw vectors to every command we’ll use in this book. Butkeeping related variables in the same data frame is a huge convenience. Once we have more than onevariable, and we wish to model one as a function of the others, you’ll better see the value of the dataframe. You won’t have to wait long.More technically, a data frame is a special kind of list in R. So you access the individual variableswith the usual list “double bracket” notation, like d[[1]] for the first variable or d[['x']] for thevariable named x. Unlike regular lists, however, data frames force all variables to have the samelength. at isn’t always a good thing. And that’s why some statistical packages, like the powerfulStan Markov chain sampler (mc-stan.org), accept plain lists of data, rather than proper data frames.All we want for now are heights of adults in the sample. e reason to filter out nonadultsfor now is that height is strongly correlated with age, before adulthood. Later in thechapter, I’ll ask you to tackle the age problem. But for now, better to postpone it. You canfilter the data frame down to individuals of age 18 or greater with:R code4.10d2 = 18 , ]We’ll be working with the data frame d2 now. It should have 352 rows (individuals) in it.Overthinking: Index magic. e square bracket notation used in the code above is index notation.It is very powerful, but also quite compact and confusing. e data frame d is a matrix, a rectangulargrid of values. You can access any value in the matrix with d[row,col], replacing row and col withrow and column numbers. If row or col are lists of numbers, then you get more than one row orcolumn. If you leave the spot for row or col blank, then you get all of whatever you leave blank. Forexample, d[ 3 , ] gives all columns at row 3. Typing d[,] just gives you the entire matrix, becauseit returns all rows and all columns.So what d[ d$age >= 18 , ] does is give you all of the rows in which d$age is greater-thanor-equal-to18. It also gives you all of the columns, because the spot aer the comma is blank. eresult is stored in d2, the new data frame containing only adults. With a little practice, you can usethis square bracket index notion to perform custom searches of your data, much like performing a


4.3. A GAUSSIAN MODEL OF HEIGHT 93database query.4.3.2. e model. Our goal is to model these values using a Gaussian distribution. First, goahead and plot the distribution of heights, with dens(d2$height). ese data look ratherGaussian in shape, as is typical of height data. is may be because height is a sum of manysmall growth factors. As you saw at the start of the chapter, a distribution of sums tends toconverge to a Gaussian distribution. Whatever the reason, adult heights are nearly alwaysapproximately normal.So it’s reasonable for the moment to adopt the stance that the model’s likelihood shouldbe Gaussian. But be careful about choosing the Gaussian distribution only when the plottedoutcome variable looks Gaussian to you. Gawking at the raw data, to try to decide how tomodel them, is usually not a good idea. e data could be a mixture of different Gaussiandistributions, for example, and in that case you won’t be able to detect the underlying normalityjust by eyeballing the outcome distribution. Furthermore, as mentioned earlier inthis chapter, the empirical distribution needn’t be actually Gaussian in order to justify usinga Gaussian likelihood.So which Gaussian distribution? ere are an infinite number of them, with an infinitenumber of different means and standard deviations. We’re ready to write down the generalmodel and compute the plausibility of each combination of µ and σ. To define the heightsas normally distributed with a mean µ and standard deviation σ, we write:h i ∼ Normal(µ, σ).In many books you’ll see the same model written as h i ∼ N (µ, σ), which means the samething. e symbol h refers to the list of heights, and the subscript i means each individualelement of this list. It is conventional to use i because it stands for index. e index i takes onrow numbers, and so in this example can take any value from 1 to 352 (the number of heightsin d2$height). As such, the model above is saying that all the golem knows about eachheight measurement is defined by the same normal distribution, with mean µ and standarddeviation σ. Before long, those little i’s are going to show up on the righthand side of themodel definition, and you’ll be able to see why we must bother with them. So don’t ignorethe i, even if it seems like useless ornamentation right now.Rethinking: Independent and identically distributed. e short model above is sometimes describedas assuming that the values h i are independent and identically distributed, which may be abbreviatedi.i.d., iid, or IID. You might even see the same model written:h iiid∼ Normal(µ, σ).“iid” indicates that each value h i has the same probability function, independent of the other h valuesand using the same parameters. A moment’s reflection tells us that this is hardly ever true, in a physicalsense. Whether measuring the same distance repeatedly or studying a population of heights, it is hardto argue that every measurement is independent of the others. For example, heights within familiesare correlated because of alleles shared through recent shared ancestry.e i.i.d. assumption doesn’t have to seem awkward, however, as long as you remember thatprobability is inside the golem, not outside in the world. e i.i.d. assumption is about how the golemrepresents its uncertainty. It is an epistemological assumption. It is not a physical assumption aboutthe world, an ontological one, unless you insist that it is. E. T. Jaynes (1922–1998) called this the mindprojection fallacy, the mistake of confusing epistemological claims with ontological claims. 57


94 4. LINEAR MODELSe point isn’t to say epistemology trumps reality, but rather that in ignorance of such correlationsthat the distribution most consistent with our state of information is i.i.d. Such an assumptioncan be rejected, by checking the model. Furthermore, there is a mathematical result known as deFinetti’s theorem that tells us that values which are EXCHANGEABLE can be approximated by mixturesof i.i.d. distributions. Colloquially, exchangeable values can be reordered. e practical impact of thisis that “i.i.d.” as an assumption cannot be read too literally, as different processes models again correspondto the same statistical model (as argued in Chapter 1). Even furthermore, there are many typesof correlation that do little or nothing to the overall shape of a distribution, but only affect the precisesequence in which values appear. For example, pairs of sisters have highly correlated heights. But theoverall distribution of female height remains almost perfectly normal. In such cases, i.i.d. remainsperfectly useful, despite ignoring the correlations. Consider for example that Markov chain MonteCarlo (Chapter 8) can use highly-correlated sequential samples to estimate most any iid distributionwe like.To complete the model, we’re going to need some priors. e parameters to be estimatedare both µ and σ, so we need a prior Pr(µ, σ), the joint prior probability for all parameters.In most cases, priors are specified independently for each parameter, which amounts to assumingPr(µ, σ) = Pr(µ) Pr(σ). en we can write:h i ∼ Normal(µ, σ)[likelihood]µ ∼ Normal(156, 10) [µ prior]σ ∼ Uniform(0, 50)[σ prior]e labels on the right are not part of the model, but instead just notes to help you keep trackof the purpose of each line. e prior for µ is a broad Gaussian prior, centered on 156cm,with 95% of probability between 156 ± 20.It’s a very good idea to plot your priors, so you have a sense of the assumption they buildinto the model. In this case:R code4.11curve( dnorm( x , 156 , 10 ) , from=100 , to=200 )Execute that code yourself, to see that the golem is assuming that the average height (not eachindividual height) is almost certainly between 120cm and 190cm. So this prior carries a littleinformation, but not a lot. e σ prior is a truly flat prior, a uniform one, that functions justto constrain σ to have positive probability between zero and 50cm. View it with:R code4.12curve( dunif( x , 0 , 50 ) , from=-10 , to=60 )A standard deviation like σ must be positive, so bounding it at zero makes sense. How shouldwe pick the upper bound? In this case, a standard deviation of 50cm would imply that 95%of individual heights lie within 100cm of the average height. at’s a very large range.All this talk is nice, but it’ll help to really see what these priors imply about the distributionof individual heights. You didn’t specify a prior probability distribution of heightsdirectly, but once you’ve chosen priors for µ and σ, these imply a prior distribution of individualheights. You can quickly simulate heights by sampling from the prior, like you sampledfrom the posterior back in Chapter 3. Remember, every posterior is also potentially aprior for a subsequent analysis, so you can process priors just like posteriors.


4.3. A GAUSSIAN MODEL OF HEIGHT 95sample_mu


96 4. LINEAR MODELShave the samples you’ll produce in this subsection, you can compare them to the quadraticapproximation in the next.Unfortunately, doing the calculations here requires some technical tricks that add little,if any, conceptual insight. So I’m going to present the code here without explanation. You canexecute it and keep going for now, but later return and follow the endnote for an explanationof the algorithm. 58 For now, here are the guts of the golem:R code4.14mu.list


4.3. A GAUSSIAN MODEL OF HEIGHT 97FIGURE 4.3. Samples from the posterior distributionfor the heights data. e density of points ishighest in the center, reflecting the most plausiblecombinations of µ and σ.I reproduce this plot in FIGURE 4.3. Note that the function col.alpha is part of the rethinkingR package. All it does is make colors transparent, which helps the plot in FIG-URE 4.3 more easily show density, where samples overlap. Adjust the plot to your tastes byplaying around with cex (character expansion, the size of the points), pch (plot character),and the 0.1 transparency value.Now that you have these samples, you can describe the distribution of confidence in eachcombination of µ and σ by summarizing the samples. ink of them like data and describethem, just like in Chapter 3. For example, to characterize the shapes of the marginal posteriordensities of µ and σ, all we need to do is:dens( sample.mu )dens( sample.sigma )R code4.19e jargon “marginal” here means “averaging over the other parameters.” Execute the abovecode and inspect the plots. ese densities are very close to being normal distributions.And this is quite typical. As sample size increases, posterior densities approach the normaldistribution. If you look closely, though, you’ll notice that the density for σ has a longerrighthand tail. I’ll exaggerate this tendency a bit later, to show you that this condition is verycommon for standard deviation parameters.To summarize the widths of these densities with highest posterior density intervals (HPDIs),just like in Chapter 3:HPDI( sample.mu , prob=0.95 )HPDI( sample.sigma , prob=0.95 )R code4.20Since these samples are just vectors of numbers, you can compute any statistic from them thatyou could from ordinary data. If you want the mean or median, just use the correspondingR functions.Overthinking: Sample size and the normality of σ’s posterior. Before moving on to using quadraticapproximation (map) as shortcut to all of this inference, it is worth repeating the analysis of the heightdata above, but now with only a fraction of the original data. e reason to do this is to demonstratethat, in principle, the posterior is not always so Gaussian in shape.


98 4. LINEAR MODELSere’s no trouble with the mean, µ. For a Gaussian likelihood and a Gaussian prior on µ, theposterior distribution is always Gaussian as well, regardless of sample size. It is the standard deviationσ that causes problems. So if you care about σ—oen people do not—you do need to be careful ofabusing the quadratic approximation.e deep reasons for the posterior of σ tending to have a long righthand tail are complex. But auseful way to conceive of the problem is that variances must be positive. As a result, there must bemore uncertainty about how big the variance (or standard deviation) is than about how small it is.For example, if the variance is estimated to be near zero, then you know for sure that it can’t be muchsmaller. But it could be a lot bigger.Let’s quickly analyze only 20 of the heights from the height data to reveal this issue. To sample20 random heights from the original list:R code4.21d3


4.3. A GAUSSIAN MODEL OF HEIGHT 99Density0.0 0.1 0.2 0.3 0.44 6 8 10 12sigmaFIGURE 4.4. Samples from the posterior for only 20 adult heights, instead ofthe full set of 352. On the le: Samples of µ and σ drawn from the posterior.On the right: e posterior of σ, averaging over µ. e black curve is acomparison to a Gaussian distribution with the same mean and variance.Unlike for the posterior from the full set of 352 heights, now the uncertaintyin σ is asymmetrical, with a long tail towards larger values. Your own plotswill look a bit different, because the 20 heights are sampled randomly.definition you were introduced to earlier in this chapter. Each line in the definition hasa corresponding definition in the form of R code. e engine inside map then uses thesedefinitions to define the posterior probability at each combination of parameter values. enit can climb the posterior distribution and find the peak, its MAP.Let’s begin by repeating the code to load the data and select out the adults:library(rethinking)data(Howell1)d


100 4. LINEAR MODELS)Note the commas at the end of each line, except the last. ese commas separate each lineof the model definition.e next step is to specify starting values for each parameter in the model. In this case,that means the parameters µ and σ. e start values are the locations that R begins hillclimbing at. Here’s a good list of starting values in this case:R code4.25start


4.3. A GAUSSIAN MODEL OF HEIGHT 101allow you to make a collection of arbitrary R objects. ey differ in one important respect: list evaluatesthe code you embed inside it, while alist does not. So when you define a list of formulas, youshould use alist, so the code isn’t executed. But when you define a list of start values for parameters,you should use list, so that code like mean(d2$height) will be evaluated to a numeric value.e priors we used before are very weak, both because they are nearly flat and becausethere is so much data. So I’ll splice in a more informative prior for µ, so you can see theeffect. All I’m going to do is change the standard deviation of the prior to 0.1, so it’s a verynarrow prior. I’ll also build the formula and start lists right into the call to map, so you cansee how to build it all at once.m4.2


102 4. LINEAR MODELSpreviously observed 100 heights with mean value 165. at’s a pretty strong prior. In contrast, theformer Normal(165, 10) prior implies n = 1/10 2 = 0.01, a hundredth of an observation. is is anextremely weak prior. But of course exactly how strong or weak either prior is will depend upon howmuch data is used to update it.4.3.6. Sampling from a map fit. e above explains how to get a MAP quadratic approximationof the posterior, using map. But how do you then get samples from the quadraticapproximate posterior distribution? e answer is rather simple, but non-obvious, and itrequires recognizing that a quadratic approximation to a posterior distribution with morethan one parameter dimension—µ and σ each contribute one dimension—is just a multidimensionalGaussian distribution.As a consequence, when R constructs a quadratic approximation, it calculates not onlystandard deviations for all parameters, but also the covariances among all pairs of parameters.Just like a mean and standard deviation (or its square, a variance) are sufficient todescribe a one-dimensional Gaussian distribution, a list of means and a matrix of variancesand covariances is sufficient to describe a multi-dimensional Gaussian distribution. To seethis matrix of variances and covariances, for model m4.1, use:R code4.29vcov( m4.1 )mu sigmamu 0.1695256052 0.0003866944sigma 0.0003866944 0.0849067256e above is a VARIANCE-COVARIANCE matrix. It is the multi-dimensional glue of a quadraticapproximation, because it tells us how each parameter relates to every other parameterin the posterior distribution. A variance-covariance matrix can be factored into twoelements: (1) a vector of variances for the parameters and (2) a correlation matrix that tellsus how changes in any parameter lead to correlated changes in the others. is decompositionis usually easier to understand. So let’s do that now:R code4.30diag( vcov( m4.1 ) )cov2cor( vcov( m4.1 ) )mu sigma0.16952561 0.08490673mu sigmamu 1.00000000 0.00322314sigma 0.00322314 1.00000000e two-element vector in the output is the list of variances. If you take the square root of thisvector, you get the standard deviations that are shown in precis output. e two-by-twomatrix in the output is the correlation matrix. Each entry shows the correlation, boundedbetween −1 and +1, for each pair of parameters. e 1’s indicate a parameter’s correlationwith itself. If these values where anything except 1, we would be worried. e other entriesare typically closer to zero, and they are very close to zero in this example. is indicatesthat learning µ tells us nothing about σ and likewise that learning σ tells us nothing about


4.3. A GAUSSIAN MODEL OF HEIGHT 103µ. is is typical of simple Gaussian models of this kind, but it is quite rare in more generalmodeling.Okay, so how do we get samples from this multi-dimensional posterior? Now insteadof sampling single values from a simple Gaussian distribution, we sample vectors of valuesfrom a multi-dimensional Gaussian distribution. e rethinking package provides a conveniencefunction to do exactly that:library(rethinking)post


104 4. LINEAR MODELS4.4. Adding a predictorWhat we’ve done above is a Gaussian model of height in a population of adults. But itdoesn’t really have the usual feel of “regression” to it. Typically, we are interested in modelinghow an outcome is related to some predictor variable. And by including a predictor variablein a particular way, we’ll have linear regression.So now let’s look at how height in these Kalahari foragers covaries with weight. is isn’tthe most thrilling scientific question, I know. But it is an easy relationship to start with. Goahead and plot height and weight against one another to get an idea of how strongly theycovary:R code4.34plot( d2$height ~ d2$weight )ere’s obviously a relationship here: knowing a person’s weight helps you predict height.But how do we take our Gaussian model from the previous section and incorporate predictorvariables?Rethinking: What is “regression”? Many diverse types of models are called “regression.” e termhas become synonymous with using one or more predictor variables to model the distribution of oneor more outcome variables. e original use of term, however, arose from anthropologist FrancisGalton’s (1822–1911) observation that the sons of tall and short men tended to be more similar to thepopulation mean, hence regression to the mean. 59is phenomenon arises statistically whenever individual measurements are assigned a commondistribution, leading to shrinkage as each measurement informs the others. In the context ofGalton’s height data, attempting to predict each son’s height on the basis of only his father’s height isfolly. Better to use the population of fathers. is leads to a prediction for each son which is similarto each father but “shrunk” towards the overall mean. Such predictions are routinely better. issame regression/shrinkage phenomenon applies at higher levels of abstraction and forms one basisof multilevel modeling (Chapter 13).4.4.1. e linear model strategy. e strategy is to make the parameter for the mean ofa Gaussian distribution, µ, into a linear function of the predictor variable and other, newparameters that we invent. is strategy is oen simply called the LINEAR MODEL. elinear model strategy instructs the golem to assume that the predictor variable has a perfectlyconstant and additive relationship to the mean of the outcome. e golem then computesthe posterior distribution of this constant relationship.What this means, recall, is that the machine considers every possible combination of theparameter values. With a linear model, some of the parameters now stand for the strengthof association between the mean of the outcome and the value of the predictor. For eachcombination of values, the machine computes the posterior probability, which is a measureof relative plausibility, given the model and data. So the posterior distribution ranks theinfinite possible combinations of parameter values by their logical plausibility. As a result,the posterior distribution provides relative plausibilities of the different possible strengths ofassociation, given the assumptions you programmed into the model.


4.4. ADDING A PREDICTOR 105Here’s how it works, in the simplest case of only one predictor variable. We’ll wait untilthe next chapter to confront more than one predictor. Recall the basic Gaussian model:h i ∼ Normal(µ, σ)[likelihood]µ ∼ Normal(156, 10) [µ prior]σ ∼ Uniform(0, 50)[σ prior]Now how do we get weight into a Gaussian model of height? Let x be the mathematical namefor the column of weight measurements, d2$weight. Now we have a predictor variable x,which is a list of measures of the same length as h. We’d like to say how knowing the valuesin x can help us describe or predict the values in h. To get weight into the model in thisway, we define the mean µ as a function of the values in x. is is what it looks like, withexplanation to follow:h i ∼ Normal(µ i , σ)[likelihood]µ i = α + βx i [linear model]α ∼ Normal(156, 100)β ∼ Normal(0, 10)σ ∼ Uniform(0, 50)[α prior][β prior][σ prior]Again, I’ve labeled each line on the righthand side, by the type of definition it encodes. We’lldiscuss them in turns.4.4.1.1. Likelihood. To decode all of this, let’s begin with just the likelihood, the first lineof the model. is is nearly identical to before, except now there is a little index i on the µ, aswell as the h. is is necessary now, because the mean µ now depends upon unique predictorvalues on each row i. So the little i on µ i indicates that the mean depends upon the row.4.4.1.2. Linear model. e mean µ is no longer a parameter to be estimated. Rather, asseen in the second line of the model, µ i is constructed from other parameters, α and β, andthe predictor variable x. is line is not a stochastic relationship—there is no ∼ in it, butrather an = in it—because the definition of µ i is deterministic, not probabilistic. at is tosay that, once we know α and β and x i , we also know µ i .e value x i is just the weight value on row i. It refers to the same individual as theheight value, h i , on the same row. e parameters α and β are more mysterious. Where didthey come from? We made them up. e parameters µ and σ are necessary and sufficient todescribe a Gaussian distribution. But α and β are instead devices we invent for manipulatingµ, allowing it vary systematically across cases in the data.You’ll be making up all manner of parameters as your skills improve. One way to understandthese made-up parameters is the think of them as targets of learning. Each parameteris something that must be described in the posterior density. So when you want to knowsomething about the data, you ask your golem by inventing a parameter for it. is willmake more and more sense as you progress. Here’s how it works in this context. e secondline of the model definition is just:µ i = α + βx iWhat this tells the regression golem is that you are asking two questions about the mean ofthe outcome.


106 4. LINEAR MODELS(1) What is the expected height, when x i = 0? e parameter α answers this question.For this reason, α is oen called the intercept.(2) What is the change in expected height, when x i changes by 1 unit? e parameterβ answers this question.Jointly these two parameters, together with x, ask the golem to find a line that relates x to h,a line that passes though α when x i = 0 and has slope β. at is a task that golems are verygood at. It’s up to you, though, to be sure it’s a good question.Rethinking: Nothing special or natural about linear models. Note that there’s nothing special aboutthe linear model, really. You can choose a different relationship between α and β and µ. For example,the following is a perfectly legitimate definition for µ i :µ i = α exp(−βx i ).is does not define a linear regression, but it does define a regression model. e linear relationshipwe are using instead is conventional, but nothing requires that you use it. It is very common in somefields, like ecology and demography, to use functional forms for µ that come from theory, rather thanthe geocentrism of linear models.Overthinking: Units and regression models. Readers who had a traditional training in physicalsciences will know how to carry units through equations of this kind. For their benefit, here’s themodel again (omitting priors for brevity), now with units of each symbol added.h i cm ∼ Normal(µ i cm, σcm)µ i cm = αcm + β cmkg x ikgSo you can see that β must have units of cm/kg in order for the mean µ i to have units cm. One of thefacts that labeling with units clears up is that a parameter like β is a kind of rate.4.4.1.3. Priors. e remaining lines in the model define priors for the parameters to beestimated: α, β, and σ. All of these are weak priors, leading to inferences that will echo non-Bayesian methods of model fitting, such as maximum likelihood. But as always, you shouldplay around with the priors—plotting them, changing them and refitting the model—to geta sense of their influence.You’ve seen priors for α and σ before, although α was called µ back then. I’ve widenedthe prior for α, since as you’ll see it is common for the intercept in a linear model to swinga long way from the mean of the outcome variable. e flat prior here with a huge standarddeviation will allow it to move wherever it needs to.e prior for β deserves explanation. Why have a Gaussian prior with mean zero? isprior places just as much probability below zero as it does above zero, and when β = 0,weight has no relationship to height. So many people see it as conservative assumption. Andsuch a prior will pull probability mass towards zero, leading to more conservative estimatesthan a perfectly flat prior will. So many people find such a prior appealing for that reason.But note that a Gaussian prior with standard deviation of 10 is still very weak, so theamount of conservatism it induces will be very small. As you make the standard deviation inthis prior smaller, the amount of shrinkage towards zero increases and your model producesmore and more conservative estimates about the relationship between height and weight.


4.4. ADDING A PREDICTOR 107Before fitting this model, though, consider whether this zero-centered prior really makessense. Do you think there’s just as much chance that the relationship between height andweight is negative as that it is positive? Of course you don’t. In this context, such a silly prioris harmless, because there is a lot of data. But in other contexts, your golem may need a littlenudge in the right direction.Rethinking: What’s the correct prior? People commonly ask what the correct prior is for a givenanalysis. e question sometimes implies that for any given set of data, there is a uniquely correctprior that must be used, or else the analysis will be invalid. is is a mistake. ere is no morea uniquely correct prior than there is a uniquely correct likelihood. Instead, statistical models aremachines for inference. Many machines will work, but some work better than others. Priors can bewrong, but only in the same sense that a kind of hammer can be wrong for building a table.In choosing priors, there are simple guidelines to get you started. Priors encode states of informationbefore seeing data. So priors allow us to explore the consequences of beginning with differentinformation. In cases in which we have good prior information that discounts the plausibility of someparameter values, like negative associations between height and weight, we can encode that informationdirectly into priors. When we don’t have such information, we still usually know enough aboutthe plausible range of values. And you can vary the priors and repeat the analysis in order to studyhow different states of initial information influence inference. Frequently, there are many reasonablechoices for a prior, and all of them produce the same inference.Making choices tends to make novices nervous. ere’s an illusion sometimes that default proceduresare more objective than procedures that require user choice, such as choosing priors. If that’strue, then all “objective” means is that everyone does the same thing. It carries no guarantees of realismor accuracy. Furthermore, non-Bayesian procedures make all manner of automatic choices, someof which are equivalent to choosing particular priors. So it isn’t the case that Bayesian models makemore assumptions than non-Bayesian models; they just make it easier to notice the assumptions.4.4.2. Fitting the model. e code needed to fit this model via quadratic approximation isa straightforward modification of the kind of code you’ve already seen. All we have to do isincorporate our new model for the mean into the model specification inside map and be sureto add our new parameters to the start list. Let’s repeat the model definition, now with thecorresponding R code on the righthand side:h i ∼ Normal(µ i , σ)height ~ dnorm(mu,sigma)µ i = α + βx i mu


108 4. LINEAR MODELSdata(Howell1)d


4.4. ADDING A PREDICTOR 109height ~ dnorm( a + b*weight , sigma ) ,a ~ dnorm( 156 , 100 ) ,b ~ dnorm( 0 , 10 ) ,sigma ~ dunif( 0 , 50 )) ,data=d2 ,start=list( a=mean(d2$height) , sigma=sd(d2$height) , b=0 ) )is is exactly the same model, but with the linear definition for µ embedded in the first line of code.e form with the explicit linear model, mu


110 4. LINEAR MODELSRethinking: What do parameters mean? A basic issue with interpreting model-based estimates isin knowing the meaning of parameters. ere is no consensus about what an estimate means, however,because different people take different philosophical stances towards models, probability, andprediction. e perspective in this book is a common Bayesian perspective: posterior probabilitiesof parameter values describe the relative compatibility of different states of the world with the data, accordingto the model. ese are small world (Chapter 2) numbers. So reasonable people may disagreeabout the large world meaning, and the details of those disagreements depend strongly upon context.Such disagreements are productive, because they lead to model criticism and revision, something thatgolems cannot do for themselves.4.4.3.1. Tables of estimates. With the new linear regression fit to the Kalahari data, weinspect the estimates:R code4.37precis( m4.3 )Mean StdDev 2.5% 97.5%a 113.89 1.91 110.16 117.63sigma 5.07 0.19 4.70 5.45b 0.90 0.04 0.82 0.99e first row gives the quadratic approximation for α, the second the approximation for σ,and the third approximation for β. Let’s try to make some sense of them, in this very simplemodel.Best to begin from the bottom, with b (β), because it’s the new parameter. Since β isa slope, the value 0.90 can be read as a person 1 kg heavier is expected to be 0.90 cm taller.95% of the posterior probability lies between 0.82 and 0.99. at suggests that β values closeto zero or greatly above one are highly incompatible with these data and this model. If youwere thinking that perhaps there was no relationship at all between height and weight, thenthis estimate becomes strong evidence of a positive relationship instead. But maybe youjust wanted as precise a measurement as possible of the relationship between height andweight. is estimate embodies that measurement, conditional on the model, as always. Fora different model, the measure of the relationship might be different.e estimate of α, a in the precis table, indicates that a person of weight 0 should be114cm tall. is is nonsense, since real people always have positive weight, yet it is also true.Parameters like α are “intercepts” that tell us the value of µ when all of the predictor variableshave value zero. As a consequence, the value of the intercept is frequently uninterpretablewithout also studying any β parameters.Finally, the estimate for σ, sigma, informs us of the width of the distribution of heightsaround the mean. A quick way to interpret it is to recall that about 95% of the probability ina Gaussian distribution lies between two standard deviations. So in this case, the estimatetells us that it is most likely that 95% of heights lie within 10cm (2σ) of the mean height. Butthere is also uncertainty about this, as always. So we can also take note of the fact that the95% percentile interval says that 95% of heights lie within 9.4cm to 10.9cm of the mean. Inother words, 95% of the plausible values of 2σ lie between 9.4 and 10.9.e numbers in the default precis output aren’t sufficient to describe the quadratic posteriorcompletely. For that, we also require the variance-covariance matrix. We’re interestedin correlations among parameters—we already have their variance in the table above—solet’s go straight to the correlation matrix:


4.4. ADDING A PREDICTOR 111precis( m4.3 , corr=TRUE )R code4.38Mean StdDev 2.5% 97.5% a sigma ba 113.89 1.91 110.16 117.63 1.00 0 -0.99sigma 5.07 0.19 4.70 5.45 0.00 1 0.00b 0.90 0.04 0.82 0.99 -0.99 0 1.00e new columns on the far right show the correlations among the parameters. is is thesame information you’d get by using cov2cor(vcov(m4.3)). Notice that α and β are almostperfectly negatively correlated. Right now, this is harmless. It just means that thesetwo parameters carry the same information—as you change the slope of the line, the bestintercept changes to match it. But in more complex models, strong correlations like this canmake it difficult to fit the model to the data. So we’ll want to use some golem engineeringtricks to avoid it, when possible.e first trick is CENTERING. Centering is the procedure of subtracting the mean of avariable from each value. To create a centered version of the weight variable:d2$weight.c


112 4. LINEAR MODELSheight140 150 160 170 180FIGURE 4.5. Height in centimeters (vertical) plottedagainst weight in kilograms (horizontal), withthe maximum a posteriori line for the mean heightat each weight plotted in black.30 35 40 45 50 55 60weightthe outcome, when the predictor is at its average value. is makes interpreting the intercepta lot easier.4.4.3.2. Plotting posterior inference against the data. It’s almost always helpful to plot theposterior inference against the data. Not only does plotting help in interpreting the posterior,but it also provides an informal check on model assumptions. When the model’s predictionsdon’t come close to key observations or patterns in the plotted data, then you might suspectthe model is badly specified and return to critique its assumptions.But even if you only treat plots as a way to help in interpreting the posterior, they areinvaluable. For simple models like this one, it is easy to just read the table of numbers andunderstand what the model says. But for even slightly more complex models, especiallythose that include interaction effects (Chapter 7), interpreting posterior distributions is hard.Combine with this the problem of incorporating the information in vcov into your interpretations,and the plots are irreplaceable.We’re going to start with a simple version of that task, superimposing just the MAP valuesover the height and weight data. e we’ll slowly add more and more information to theprediction plots, until we’ve used the entire posterior distribution.To superimpose the MAP values for mean height over the actual data:R code4.42plot( height ~ weight , data=d2 )abline( a=coef(m4.3)["a"] , b=coef(m4.3)["b"] )You can see the resulting plot in FIGURE 4.5. Each point in this plot is a single individual.e black line is defined by the MAP slope β and MAP intercept α. Notice that in the codeabove, I pulled the numbers straight from the fit model. e function coef returns a vectorof MAP values, and the names of the parameters extract the intercept and slope.4.4.3.3. Adding uncertainty around the mean. e MAP line is just the posterior mean,the most plausible line in the infinite universe of lines the posterior distribution has considered.Plots of the MAP line, like FIGURE 4.5, are useful for getting an impression of themagnitude of the estimated influence of a variable, like weight, on an outcome, like height.But they do a poor job of communicating uncertainty. Remember, the posterior distributionconsiders every possible regression line connecting height to weight. It assigns a


4.4. ADDING A PREDICTOR 113relative plausibility to each. is means that each combination of α and β has a posteriorprobability. It could be that there are many lines with nearly the same posterior probabilityas the MAP line. Or it could be instead that the posterior distribution is rather narrow nearthe MAP line.So how can we get that uncertainty onto the plot? Together, a combination of α andβ define a line. And so we could sample a bunch of lines from the posterior distribution.en we could display those lines on the plot, to visualize the uncertainty in the regressionrelationship.Let’s do that: sample from the posterior distribution and display a bunch of lines on theplot. What’s nice about working with samples here, as always, is that the samples will retainthe correlation structure of the posterior distribution. So the samples will be correlated inthe right way, and anything we compute from them will also be correlated in the right way.To extract some samples from the model:post


114 4. LINEAR MODELSheight140 150 160 170 18030 35 40 45 50 55 60weightFIGURE 4.6. Samples from the quadratic approximate posterior distributionfor the height/weight model, m4.3. Le: Each point is a joint sample fromthe posterior. Estimates of the slope b and the intercept a are strongly negativelycorrelated. Right: One-hundred lines sampled from the posteriordistribution, showing the uncertainty in the regression relationship.e result is shown in the righthand plot in FIGURE 4.6. By plotting multiple regressionlines, sampled from the posterior, it is easy to see both the highly confident aspects of therelationship and the less confident aspects. e cloud of regression lines displays greateruncertainty at extreme values for weight. is is very common. In this case, the lines donot vary a lot, however, since there is so much data. But you’ll encounter other examples inwhich the relationship is much less precise.4.4.3.4. Plotting regression intervals and contours. e cloud of regression lines in FIG-URE 4.6 is an appealing display, because it communicates uncertainty about the relationshipin a way that many people find intuitive. But it’s much more common to see the uncertaintydisplayed by plotting an interval or contour around the MAP regression line. In this section,I’ll walk you through how to compute any arbitrary interval you like, using the underlyingcloud of regression lines embodied in the posterior distribution. en we’ll plot a shadedregion around the MAP line, to display the interval.Here’s how to plot a 95% interval around the regression line. is interval incorporatesuncertainty in both the slope β and intercept α at the same time. To understand how itworks, focus for the moment on a single weight value, say 50 kilograms. You can quicklymake a list of 10-thousand values of µ for an individual who weighs 50 kilograms, by usingyour samples from the posterior:R code4.46mu_at_50


4.4. ADDING A PREDICTOR 115Density0.0 0.2 0.4 0.6 0.8 1.0 1.2158.0 159.0 160.0mu|weight=50FIGURE 4.7. e quadratic approximate posteriordistribution of the mean height, µ, whenweight is 50kg. is distribution representsthe relative plausibility of different values ofthe mean.b went into computing each, the variation across those means incorporates the uncertaintyin and correlation between both parameters. It might be helpful at this point to actually plotthe density for this vector of means:dens( mu_at_50 , col=rangi2 , lwd=2 , xlab="mu|weight=50" )R code4.47I reproduce this plot in FIGURE 4.7. Since the components of µ have distributions, so to doesµ. And since the distributions of α and β are Gaussian, so to is the distribution of µ (addingGaussian distributions always produces a Gaussian distribution).Since the posterior for µ is a distribution, you can find intervals for it, just like for anyposterior distribution. To find the 95% highest posterior density interval (HPDI) of µ at50kg, just use the HPDI command as usual:HPDI( mu , prob=0.95 )R code4.48lower upper158.4466 159.7910What these numbers mean is that there is a 95% chance that the average height is between158.4cm and 159.8cm (conditional on the model and data), assuming the weight is 50kg.at’s good so far, but we need to repeat the above calculation for every weight value onthe horizontal axis, not just when it is 50kg. We want to draw 95% HPDIs around the MAPslope in Figure 4.5.is is made simple by strategic use of the link function (part of rethinking). Whatlink will do is take you map model fit, sample from the posterior distribution, and thencompute µ for each sample from the posterior distribution. It will do that for each case inthe data, as well. Here’s what it looks like for the data you used to fit the model:mu


116 4. LINEAR MODELSYou end up with a big matrix of values of µ. Each row is a sample from the posterior distribution(link defaults to use 1000 samples). Each column is a case (row) in the data. ereare 352 rows in d2, corresponding to 352 individuals. So there are 352 columns in the matrixmu above.Now what can we do with this big matrix? Lots of things. e function link providesa posterior distribution of µ for each case we feed it. So above we have a distribution ofµ for each individual in the original data. We actually want something slightly different: adistribution of µ for each unique weight value on the horizontal axis. It’s only slightly harderto compute that, by just passing link some new data:R code4.50# define sequence of weights to compute predictions for# these values will be on the horizontal axisweight.seq


4.4. ADDING A PREDICTOR 117height140 150 160 170 180height140 150 160 170 18030 35 40 45 50 55 60weight30 35 40 45 50 55 60weightFIGURE 4.8. Le: e first 100 values in the distribution of µ at each weightvalue. Right: e !Kung height data again, now with 95% HPDI of the meanindicated by the shaded region. Compare this region to the distributions ofblue points on the le.You can plot these summaries on top of the data with a few lines of R code:# plot raw data# fading out points to make line and interval more visibleplot( height ~ weight , data=d2 , col=col.alpha(rangi2,0.5) )R code4.53# plot the MAP line, aka the mean mu for each weightlines( weight.seq , mu.mean )# plot a shaded region for 95% HPDIshade( mu.HPDI , weight.seq )You can see the results in the righthand plot in FIGURE 4.8.Using this approach, you can derive and plot posterior prediction means and intervalsfor quite complicated models, for any data you choose. It’s true that it is possible to useanalytical formulas to compute intervals like this. I have tried teaching such an analyticalapproach before, and it has always been disaster. Part of the reason is probably my own failureas a teacher, but another part is that most social and natural scientists have never had muchtraining in probability theory and tend to get very nervous around ∫ ’s. I’m sure with enougheffort, every one of them could learn to do the mathematics. But all of them can quicklylearn to generate and summarize samples derived from the posterior distribution. So whilethe mathematics would be a more elegant approach, and there is some additional insightthat comes from knowing the mathematics, the pseudo-empirical approach presented hereis very flexible and allows a much broader audience of scientists to pull insight from theirstatistical modeling. And again, when you start estimating models with MCMC, this is reallythe only approach available. So it’s worth learning now.To summarize, here’s the recipe for generating predictions and intervals from the posteriorof a fit model.


118 4. LINEAR MODELS(1) Use link to generate distributions of posterior values for µ. e default behaviorof link is to use the original data, so you have to pass it a list of new horizontalaxis values you want to plot posterior predictions across.(2) Use summary functions like mean or HPDI or PI to find averages and lower andupper bounds of µ for each value of the predictor variable.(3) Finally, use plotting functions like lines and shade to draw the lines and intervals.Or you might plot the distributions of the predictions, or do further numericalcalculations with them. It’s really up to you.is recipe works for every model we fit in the book. As long as you know how parametersrelate to the data, you can use samples from the posterior to sample predictions.Rethinking: Overconfident confidence intervals. e confidence interval for the regression line inFIGURE 4.8 clings tightly to the MAP line. us there is very little uncertainty about the averageheight as a function of average weight. But you have to keep in mind that these inferences are alwaysconditional on the model. Even a very bad model can have very tight confidence intervals. It mayhelp if you think of the regression line in FIGURE 4.8 as saying: conditional on the assumption thatheight and weight are related by a straight line, then this is the most plausible line, and these are itsplausible bounds.Overthinking: How link works. e function link is not really very sophisticated. All it is doingis using the formula you provided when you fit the model to compute the value of the linear model.It does this for each sample from the posterior distribution, for each case in the data. You couldaccomplish the same thing for any model, fit by any means, by performing these steps yourself. isis how it’d look for m4.3.R code4.54post


4.4. ADDING A PREDICTOR 119stochastic definition in the first line. e Gaussian distribution on the first line tells us thatthe model expects observed heights to be distributed around µ, not right on top of it. Andthe spread around µ is governed by σ. All of this suggests we need to incorporate σ in thepredictions somehow.Here’s how you do it. Imagine simulating heights. For any unique weight value, yousample from a Gaussian distribution with the correct mean µ for that weight, using the correctvalue of σ sampled from the same posterior distributuon. If you do this for every samplefrom the posterior, for every weight value of interest, you end up with a collection of simulatedheights that embody the uncertainty in the posterior as well as the uncertainty in theGaussian likelihood.sim.height


120 4. LINEAR MODELSheight140 150 160 170 18030 35 40 45 50 55 60weightFIGURE 4.9. 95% prediction interval forheight, as a function of weight. e solidline is the MAP estimate of the mean heightat each weight. e two shaded regionsshow different 95% plausible regions. enarrow shaded interval around the line is thedistribution of µ. e wider shaded regionrepresents the region within which the modelexpects to find 95% of actual heights in thepopulation, at each weight.Notice that the outline for the wide shaded interval is a little jagged. is is the simulationvariance in the tails of the sampled Gaussian values. If it really bothers you, increase thenumber of samples you take from the posterior distribution. e optional n parameter forsim.height controls how many samples are used. Try for example:R code4.58sim.height


4.5. POLYNOMIAL REGRESSION 121post


122 4. LINEAR MODELSRethinking: Linear, additive, funky. e parabolic model of µ i above is still called a “linear model”of the mean. is is so, even though the equation is clearly not of a straight line. Better would beto say that the model is “additive,” as it is the sum of terms that multiply parameters by data (andtransformations of data). is means that parameters only interact with one another by addition,hence the label “additive.”Additive models have the advantage of being easier to fit to data. ey are also oen easierto interpret, because they assume that parameters act independently on the mean. ey have thedisadvantage of being strongly conventional. ey are oen used thoughtlessly, even when obviouslyinappropriate. When you have real physical knowledge of your study system, it is oen easy to dobetter than an additive model.Words like “linear” are routinely used in different ways in different contexts, which makes themconfusing. Polysemy is the rule rather than the exception in science, so one way to save yourself fromsome mistakes is to always insist on a clear mathematical description of the model, like the standardmodel formulas in this text.Fitting these models to data is easy. Interpreting them can be hard. We’ll begin withthe easy part, fitting a parabolic model of height on weight. e first thing to do is to STAN-DARDIZE the predictor variable. is means to first center the variable and then divide it byits standard deviation. Why do this? You already have some sense of the value of centering.Going further to standardize leaves the mean at zero but also rescales the range of the data.is is helpful for two reasons.(1) Interpretation might be easier. For a standardized variable, a change of one unit isequivalent to a change of one standard deviation. In many contexts, this is moreinteresting and more revealing than a one unit change on the natural scale. Andonce you start making regressions with more than one kind of predictor variable,standardizing all of them makes it easier to compare their relative influence on theoutcome, using only estimates. On the other hand, you might want to interpretthe data on the natural scale. So standardization can make interpretation harder,not easier. With a little practice, though, you can learn to quickly convert back andforth between natural and standard scales.(2) More important though are the advantages for fitting the model to the data. Whenpredictor variables have very large values in them, there are sometimes numericalglitches. Even well-known statistical soware can suffer from these glitches,leading to mistaken estimates. ese problems are very common for polynomialregression, because the square or cube of a large number can be truly massive. Standardizinglargely resolves this issue.To standardize weight, all you do is subtract the mean and then divide by the standarddeviation. is will do it:R code4.61d$weight.s


4.5. POLYNOMIAL REGRESSION 123To fit the parabolic model, just modify the definition of µ i . Here’s the model (with veryweak priors):h i ∼ Normal(µ i , σ)height ~ dnorm(mu,sigma)µ i = α + β 1 x i + β 2 x 2 i mu


124 4. LINEAR MODELSheight60 80 100 140 180height60 80 100 140 180height60 80 100 140 180(a) (b) (c)-2 -1 0 1 2weight.s-2 -1 0 1 2weight.s-2 -1 0 1 2weight.sFIGURE 4.10. Polynomial regressions of height on weight (standardized),for the full !Kung data. In each plot, the raw data are shown by the circles.e solid curves show the path of µ in each model, and the shaded regionsshow the 95% interval of the mean (close to the solid curve) and the 95%interval of predictions (wider). (a) Linear regression. (b) A second orderpolynomial, a parabolic regression. (c) A third order polynomial, a cubicregression.height.PI


4.7. PRACTICE 125m4.6


126 4. LINEAR MODELS4.7.1. e1. In the model definition below, which line is the likelihood?y i ∼ Normal(µ, σ)µ ∼ Normal(0, 10)σ ∼ Uniform(0, 10)4.7.2. e2. In the model definition just above, how many parameters are in the posterior distribution?4.7.3. e3. Using the model definition above, write down Bayes’ theorem. (See page 89 for an example.)4.7.4. e4. In the model definition below, which line is the linear model?y i ∼ Normal(µ, σ)µ i = α + βx iα ∼ Normal(0, 10)β ∼ Normal(0, 1)σ ∼ Uniform(0, 10)4.7.5. e5. In the model definition just above, how many parameters are in the posterior distribution?Medium.4.7.6. m1. For the model definition below, simulate observed heights from the prior (not the posterior).See pages 88–89 for an example.y i ∼ Normal(µ, σ)µ ∼ Normal(0, 10)σ ∼ Uniform(0, 10)4.7.7. m2. Translate the model just above into a map formula.4.7.8. m3. Translate the map model formula below into a mathematical model definition.flist


4.7. PRACTICE 1274.7.12. Missing heights. e weights listed below were recorded in the !Kung census, but heightswere not recorded for these individuals. Provide predicted heights and 90% confidence intervals(HPDI or PCI is fine) for each of these individuals. at is, fill in the table below, using model-basedpredictions.Individual weight expected height 90% confidence interval1 46.952 43.723 64.784 32.595 54.634.7.13. Non-adults. Select out all the rows in the Howell1 data with ages below 18 years of age. Ifyou do it right, you should end up with a new data frame with 192 rows in it.(a) Fit a linear regression to these data, using map. Present and interpret the estimates. For every10 units of increase in weight, how much taller does the model predict a child gets?(b) Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimposethe MAP regression line and 90% HPDI for the mean. Also superimpose the 90% HPDI forpredicted heights.(c) What aspects of the model fit concern you? Describe the kinds of assumptions you wouldchange, if any, to improve the model. You don’t have to write any new code. Just explain what themodel appears to be doing a bad job of, and what you hypothesize would be a better model.4.7.14. Logjam. Suppose a colleague of yours, who works on allometry, glances at the practiceproblems just above. Your colleague exclaims, “at’s silly. Everyone knows that it’s only the logarithmof body weight that scales with height!” Let’s take your colleague’s advice and see what happens.(a) Model the relationship between height (cm) and the natural logarithm of weight (log-kg). Usethe entire Howell1 data frame, all 544 rows, adults and non-adults. Fit this model, using quadraticapproximation:h i ∼ Normal(µ i , σ)µ i = α + β log(w i )α ∼ Normal(138, 100)β ∼ Normal(0, 100)σ ∼ Uniform(0, 50)where h i is the height of individual i and w i is the weight (in kg) of individual i. e function forcomputing a natural log in R is just log. Can you interpret the resulting estimates?(b) Begin with this plot:plot( height ~ weight , data=Howell1 ,col=col.alpha("slateblue",0.4) )R code4.69en use samples from the quadratic approximate posterior of the model in (a) to superimpose onthe plot: (1) the predicted mean height as a function of weight, (2) the 95% HPDI for the mean, and(3) the 95% HPDI for predicted heights.


5 Multivariate Linear ModelsOne of the most reliable sources of waffles in North America is a Waffle House diner.Waffle House is nearly always open, even just aer a hurricane, as most diners invest in disasterpreparedness. e United States’ disaster relief agency (FEMA) informally uses WaffleHouse as an index of disaster severity. 61 If the Waffle House is closed, that’s a serious event.at’s all nice, but when you inspect the correlation between Waffle House and the stabilityof marriages in a region, it looks like waffles are causing some disasters (FIGURE 5.1).States with many Waffle Houses per person, like Georgia and Alabama, also have some ofthe highest divorce rates in the United States. e lowest divorce rates are found where thereare zero Waffle Houses.Could always-available waffles and hash brown potatoes put marriage at risk? Probablynot. is is an example of a misleading correlation. No one thinks there is any plausiblemechanism by which Waffle House diners make divorce more likely. Instead, when we seea correlation of this kind, we immediately start asking about other variables that are “really”driving the relationship between waffles and divorce. In this case, Waffle House began inGeorgia in the year 1955. Over time, the diners spread across the Southern United States,remaining largely bounded within it. So Waffle House is associated with the South. Divorceis not a uniquely Southern institution, but is more common anyplace that people marryyoung, and many communities in the South still frown on young people “shacking up” andliving together out of wedlock. So it’s probably just an accident of history that Waffle Houseand high divorce rates both occur in the South.In other contexts, however, it’s easy to be fooled by such spurious correlations. is isone reason why so much statistical effort is devoted to MULTIVARIATE REGRESSION, usingmore than one predictor variable to model an outcome. Reasons oen given for multivariatemodels include:(1) Statistical “control” for confounds. A confound is a variable that may be correlatedwith another variable of interest. e spurious waffles and divorce correlation isone possible type of confound, where the confound (Southernness) makes a variablewith no real importance (Waffle House density) appear to be important. Butconfounds can hide real important variables just as easily as they can produce falseones. In a particularly important type of confound, known as SIMPSON’S PARA-DOX, the entire direction of an apparent association between a predictor and outcomecan be reversed by considering a confound. 62(2) Multiple causation. Even when confounds are absent, due for example to tightexperimental control, a phenomenon may really arise from multiple causes. Measurementof each cause is useful, so when we can use the same data to estimate more129


130 5. MULTIVARIATE LINEAR MODELSDivorce rate6 8 10 12 14MENJOKARALSCGAFIGURE 5.1. e number of Waffle House dinersper million people is associated with divorce rate(in the year 2009) within the United States. Eachpoint is a State. “Southern” (former Confederate)States shown in blue. Shaded region is 95%precentile interval of the mean. ese data are indata(WaffleDivorce) in the rethinking package.0 10 20 30 40 50Waffle Houses per millionthan one type of influence, we should. Furthermore, when causation is multiple,one cause can hide another. Multivariate models can help in such settings.(3) Interactions. Even when variables are completely uncorrelated, the importance ofeach may still depend upon the other. For example, plants benefit from both lightand water. But in the absence of either, the other is no benefit at all. Such INTER-ACTIONS occur in a very large number of systems. When this is the case, effectiveinference about one variable will usually depend upon consideration of other variables.In this chapter, we begin to deal with the first of two these, using multivariate regressionto deal with simple confounds and to take multiple measurements of influence. You’llsee how to include any arbitrary number of main effects in your linear model of the Gaussianmean. ese main effects are additive combinations of variables, the simplest type ofmultivariate model.We’ll focus on two valuable things multivariate models can help us with: (1) revealingspurious correlations like the Waffle House correlation with Divorce and (2) revealing importantcorrelations that may be masked by unrevealed correlations with other variables. Butmultiple predictor variables can hurt as much as they can help. So the chapter describes somedangers of multivariate models, notably multicollinearity. Along the way, you’ll meet CATE-GORICAL VARIABLES, which usually must be broken down into multiple predictor variables.Rethinking: Causal inference. Despite its central importance to science, there is no unified approachto causal inference yet in the sciences or in statistics. ere are even people who argue that causedoes not really exist; it’s just a psychological illusion. 63 About one thing, however, there is generalagreement: causal inference always depends upon unverifiable assumptions. Another way to say thisis that it’s always possible to imagine some way in which your inference about cause is mistaken, nomatter how careful the research design or statistical analysis. A lot can be accomplished, despite thisultimate barrier on inference. 645.1. Spurious associationLet’s leave waffles behind, at least for the moment. An example that is easier to understandis the correlation between divorce rate and marriage rate (FIGURE 5.2). e rate atwhich adults marry is a great predictor of divorce rate, as seen in the lehand plot in the


5.1. SPURIOUS ASSOCIATION 131Divorce6 8 10 12Divorce6 8 10 12-1 0 1 2Marriage.s-2 -1 0 1 2 3MedianAgeMarriage.sFIGURE 5.2. Divorce rate is associated with both marriage rate (le) andmedian age at marriage (right). Both predictor variables are standardized inthis example. e average marriage rate across States is 20, and the averagemedian age at marriage is 26.figure. But does marriage cause divorce? In a trivial sense it obviously does: one cannotget a divorce without first getting married. But there’s no reason beyond that for high marriagerate to be correlated with divorce—it’s easy to imagine high marriage rate indicatinghigh cultural valuation of marriage and therefore being associated with low divorce rate. Sosomething is suspicious here.Another predictor associated with divorce is the median age at marriage, displayed inthe righthand plot in FIGURE 5.2. Age at marriage is also a good predictor of divorce rate—higher age at marriage predicts less divorce. You can replicate the righthand plot in the figureby fitting this linear regression model:D i ∼ Normal(µ i , σ)µ i = α + β a a iα ∼ Normal(10, 10)β a ∼ Normal(0, 1)σ ∼ Uniform(0, 10)D i is the divorce rate for State i, and a i is State i’s median age at marriage. ere are no newcode tricks or techniques here, but I’ll add comments to help explain the mass of code. We’regoing to standardize the predictor here, because it’s a good habit to get into.# load datalibrary(rethinking)data(WaffleDivorce)d


132 5. MULTIVARIATE LINEAR MODELSsd(d$MedianAgeMarriage)# fit modelm5.1


5.1. SPURIOUS ASSOCIATION 133But merely comparing parameter means between different bivariate regressions is noway to decide which predictor is better. Both of these predictors could provide independentvalue, or they could be redundant, or one could eliminate the value of the other. So we’llbuild a multivariate model with the goal of measuring the partial value of each predictor.e question we want answered is:What is the predictive value of a variable, once I already know all of the otherpredictor variables?So for example once you fit a multivariate regression to predict divorce using both marriagerate and age at marriage, the model answers the questions:(1) Aer I already know marriage rate, what additional value is there in also knowingage at marriage?(2) Aer I already know age at marriage, what additional value is there in also knowingmarriage rate?e parameter estimates corresponding to each predictor are the (oen opaque) answers tothese questions. Next we’ll fit the model that asks these questions.Rethinking: “Control” is out of control. Very oen, the question just above is spoken of as “statisticalcontrol,” as in controlling for the effect of one variable while estimating the effect of another. But this issloppy language, as it implies too much. It implies a causal interpretation (“effect”), and it implies anexperimental disassociation of the predictor variables (“control”). You may be willing to make theseassumptions, but they are not part of the model, so be wary.e point here isn’t to police language. Instead, the point is to observe the distinction betweensmall world and large world interpretations. Since most people who use statistics are not statisticians,sloppy language like “control” can promote a sloppy culture of interpretation. Such cultures tend tooverestimate the power of statistical methods, so resisting them can be difficult. Disciplining yourown language may be enough. Disciplining another’s language is hard to do, without seeming like afastidious scold, as this very box must seem.5.1.1. Multivariate notation. Multivariate regression formulas look a lot like the polynomialmodels at the end of the previous chapter—they add more parameters and variables tothe definition of µ i . e strategy is straightforward:(1) Nominate the predictor variables you want in the linear model of the mean.(2) For each predictor, make a parameter that will measure its association with theoutcome.(3) Multiply the parameter by the variable and add that term to the linear model.Examples are always necessary, so here is the model the predicts divorce rate, using bothmarriage rate and age at marriage.D i ∼ Normal(µ i , σ)[likelihood]µ i = α + β m m i + β a a i [linear model]α ∼ Normal(10, 10) [prior for α]β m ∼ Normal(0, 1) [prior for β m ]β a ∼ Normal(0, 1) [prior for β a ]σ ∼ Uniform(0, 10) [prior for σ]


134 5. MULTIVARIATE LINEAR MODELSYou can use whatever symbols you like for the parameters and variables, but here I’ve chosenm for marriage rate and a for age at marriage, reusing these symbols as subscripts for thecorresponding parameters. But feel free to use whichever symbols reduce the load on yourown memory.So what does it mean to assume µ i = α + β m m i + β a a i ? It means that the expectedoutcome for any State with marriage rate m i and median age at marriage a i is the sum of threeindependent terms. e first term is a constant, α. Every State gets this. e second term isthe product of the marriage rate, m i , and the coefficient, β m , that measures the associationbetween marriage rate and divorce rate. e third term is similar, but for the associationwith median age at marriage instead.If you are like most people, this is still pretty mysterious. So it might help to read the+ symbols as “or” and then say: a State’s divorce rate can be a function of either its marriagerate or its median age at marriage. e “or” indicates independent associations, which maybe purely statistical or rather causal.Overthinking: Compact notation and the design matrix. Oen, linear models are written using acompact form like:n∑µ i = α + β j x jiwhere j is an index over predictor variables and n is the number of predictor variables. e samemodel may also be abbreviated:j=1µ i = α + β 1 x 1i + β 2 x 2i + ... + β n x niBoth of these forms may be read as the mean is a modeled as the sum of an intercept and an additivecombination of the products of parameters and predictors. Even more compactly, using matrix notation:m = Xbwhere m is a vector of predicted means, one for each row in the data, b is a (column) vector of parameters,one for each predictor variable, and X is a matrix. is matrix is called a design matrix. It hasas many rows as the data, and as many columns as there are predictors plus one. So X is basically adata frame, but with an extra first column. e extra column is filled with 1s. ese 1s are multipliedby the first parameter, which is the intercept, and so return the unmodified intercept. When X ismatrix-multiplied by b, you get the predicted means. In R notation, this operation is X %*% b.We’re not going to use the design matrix approach in this book. And in general you don’t needto. But it’s good to recognize it, and sometimes it can save you a lot of work. For example, for linearregressions, there is a nice matrix formula for the maximum likelihood (or least squares) estimates.Most statistical soware exploits that formula, which requires using a design matrix.5.1.2. Fitting the model. To fit this model to the divorce data, we just expand the linearmodel. Here’s the model definition again, now with the code on the righthand side:D i ∼ Normal(µ i , σ)Divorce ~ dnorm(mu,sigma)µ i = α + β m m i + β a a i mu


5.1. SPURIOUS ASSOCIATION 135abmbasigma-2 0 2 4 6 8 10EstimateFIGURE 5.3. Quadratic approximate posterior for model m5.3. Points showeach posterior mode, and solid black line segments show percentile intervals.And here is the map fitting code:m5.3


136 5. MULTIVARIATE LINEAR MODELSYou can interpret these estimates as saying:Once we know median age at marriage for a State, there is little or no additionalpredictive power in also knowing the rate of marriage in that State.Note that this does not mean that there is no value in knowing marriage rate. If you didn’thave access to age-at-marriage data, then you’d definitely find value in knowing the marriagerate.But how did the model achieve this result? To answer that question, we’ll draw somepictures.5.1.3. Plotting multivariate posteriors. Visualizing posterior distribution in simple bivariateregressions (the previous chapter) is easy. ere’s only one predictor variable, so a singlescatterplot can convey a lot of information. And so in the previous chapter we used scattersof the data. en we overlaid regression lines and intervals to both (1) visualize the size ofthe association between the predictor and outcome and (2) to get a crude sense of the abilityof the model to predict the individual observations.With multivariate regression, you’ll need more plots. ere is a huge literature detailinga variety of plotting techniques that all attempt to help one understand multiple linear regression.None of these techniques is suitable for all jobs, and most do not generalize beyondlinear regression. So the approach I take here is to instead help you compute whatever youneed from the model. I offer three types of interpretive plots:(1) Predictor residual plots. ese plots show the outcome against residual predictorvalues.(2) Counterfactual plots. ese show the implied predictions for imaginary experimentsin which the different predictor variables can be changed independently ofone another.(3) Posterior prediction plots. ese show model-based predictions against raw data,or otherwise display the error in prediction.Each of these plot types has its advantages and deficiencies, depending upon the context andthe question of interest. In the rest of this section, I show you how to manufacture each ofthese in the context of the divorce data.5.1.3.1. Predictor residual plots. A predictor variable residual is the average predictionerror when we use all of the other predictor variables to model a predictor of interest. at’sa complicated concept, so we’ll go straight to the example, where it will make sense. ebenefit of computing these things is that, once plotted against the outcome, we have a bivariateregression of sorts that has already “controlled” for all of the other predictor variables. Itjust leaves in the variation that is not expected by the model of the mean, µ, as a function ofthe other predictors.In our multivariate model of divorce rate, we have two predictors: (1) marriage rate(Marriage.s) and (2) median age at marriage (MedianAgeMarriage.s). To compute predictorresiduals for either, we just use the other predictor to model it. So for marriage rate,


5.1. SPURIOUS ASSOCIATION 137this is the model we need:m i ∼ Normal(µ i , σ)µ i = α + βa iα ∼ Normal(0, 10)β ∼ Normal(0, 1)σ ∼ Uniform(0, 10)As before, m is marriage rate and a is median age at marriage. Note that since we standardizedboth variables, we already expect the mean α to be around zero. So I’ve centered α’s priorthere, but it’s still so flat that it hardly matters.is code will fit the model:m5.4


138 5. MULTIVARIATE LINEAR MODELSMarriage.s-1 0 1 2-2 -1 0 1 2 3MedianAgeMarriage.sFIGURE 5.4. Residual marriage rate in each State,aer accounting for the linear association withmedian age at marriage. Each gray line segmentis a residual, the distance of each observed marriagerate from the expected value, attempting topredict marriage rate with median age at marriagealone. So States that lie above the black regressionline have higher rates of marriage than expected,according to age at marriage. ose below the linehave lower rates than expected.}lines( c(x,x) , c(mu[i],y) , lwd=0.5 , col=col.alpha("black",0.7) )e result is shown as FIGURE 5.4. Notice that the residuals are variation in marriage ratethat is le over, aer taking out the purely linear relationship between the two variables.Now to use these residuals, let’s put them on a horizontal axis and plot them againstthe actual outcome of interest, divorce rate. I plot these residuals against divorce rate inFIGURE 5.5 (lehand plot), also overlaying the linear regression of the two variables. Youcan think of this plot as displaying the linear relationship between divorce and marriagerates, having statistically “controlled” for median age of marriage. e vertical dashed lineindicates marriage rate that exactly matches the expectation from median age at marriage.So States to the right of the line marry faster than expected. States to the le of the line marryslower than expected. Average divorce rate on both sides of the line is about the same, and sothe regression line demonstrates little relationship between divorce and marriage rates. eslope of the regression line is −0.13, exactly what we found in the multivariate model, m5.3.e righthand plot in FIGURE 5.5 displays the same kind of calculation, but now formedian age at marriage, “controlling” for marriage rate. So States to the right of the verticaldashed line have older than expected median age at marriage, while those to the le haveyounger than expected median age at marriage. Now we find that the average divorce rateon the right is lower than the rate on the le, as indicated by the regression line. States inwhich people marry older than expected for a given rate of marriage tend to have less divorce.e slope of the regression line here is −1.13, again the same as in the multivariate model,m5.3.So what’s the point of all of this? ere’s direct value in seeing the model-based predictionsdisplayed against the outcome, aer subtracting out the influence of other predictors.e plots in FIGURE 5.5 do this. But this procedure also brings home the message that regressionmodels answer with the remaining association of each predictor with the outcome,aer already knowing the other predictors. In computing the predictor residual plots, youhad to perform those calculations yourself. In the unified multivariate model, it all happensautomatically.


5.1. SPURIOUS ASSOCIATION 139slowerfasteryoungerolderDivorce rate6 8 10 12 14Divorce rate6 8 10 12 14-1.5 -0.5 0.5 1.0 1.5Marriage rate residuals-1 0 1 2Age of marriage residualsFIGURE 5.5. Predictor residual plots for the divorce data. Le: States withfast marriage rates for their median age of marriage have about the samedivorce rates as do States with slow marriage rates. Right: States with oldmedian age of marriage for their marriage rate have lower divorce rates,while States with young median age of marriage have higher divorce rates.Linear regression models do all of this with a very specific additive model of how thepredictors relate to one another. is fact in embodied in the residual calculations you didabove, in which you used separate linear regressions to generate the residuals. But predictorvariables can be related to one another in non-additive ways. e basic logic of statisticalcontrol does not change in those cases, but the details definitely do.5.1.3.2. Counterfactual plots. A second sort of inferential plot displays the implied predictionsof the model. I call these plots counterfactual, because they can be produced for anyvalues of the predictor variables you like, even unobserved or impossible combinations likevery high median age of marriage and very high marriage rate. ere are no States with thiscombination, but in a counterfactual plot, you can ask the model for a prediction for such aState.e simplest use of a counterfactual plot is to see how the predictions change as youchange only one predictor at a time. is means holding the values of all predictors constant,except for a single predictor of interest. Such predictions will not necessarily look likeyour raw data—they are counterfactual aer all—but they will help you understand the implicationsof the model. Since it’s hard to interpret raw numbers in a table, plots that helpyou understand the model’s implications are priceless.Let’s draw a pair of counterfactual plots for the divorce model. Beginning with a plotshowing the impact of changes in Marriage.s on predictions:# prepare new counterfactual dataMAM.avg


140 5. MULTIVARIATE LINEAR MODELS)MedianAgeMarriage.s=MAM.avg# compute counterfactual mean divorce (mu)mu


5.1. SPURIOUS ASSOCIATION 141MedianAgeMarriage.s = 0Marriage.s = 0Divorce6 8 10 12Divorce6 8 10 12-1 0 1 2Marriage.s-2 -1 0 1 2 3MedianAgeMarriage.sFIGURE 5.6. Counterfactual plots for the multivariate divorce model, m5.3.Each plot shows the change in predicted mean across values of a single predictor,holding the other predictor constant at its mean value (zero in bothcases). Shaded regions show 95% percentile intervals of the mean (dark,narrow) and 95% prediction intervals (light, wide).mtext( "Marriage.s = 0" )lines( MAM.seq , mu.mean )shade( mu.PI , MAM.seq )shade( MAM.PI , MAM.seq )ese plots have the same slopes as the residual plots in the previous section. But they don’tdisplay any data, raw or residual, because they are counterfactual. And they also show percentileintervals on the scale of the data, instead of on that weird residual scale. As a result,they are direct displays of the impact on prediction of a change in each variable.A tension with such plots, however, lies in their counterfactual nature. Is it really possible—in the large world rather than the small world of the model—to change median age of marriagewithout also changing the marriage rate? Probably not. Suppose for example that youpay young couples to postpone marriage until they are 35 years old. Surely this will alsodecrease the number of couples who ever get married—some people will die before turning35, among other reasons—decreasing the overall marriage rate. An extraordinary and evildegree of control over people would be necessary to really hold marriage rate constant whileforcing everyone to marry at a later age.In this example, the difficulty of separately manipulating marriage rate and marriage agedoesn’t impede inference much, only because marriage rate has almost no effect on prediction,once median age of marriage is taken into account. But in many problems, including alater one in this chapter, more than one predictor variable has a sizable impact on the outcome.In that case, while these counterfactual plots always help in understanding the model,they may also mislead by displaying predictions for impossible combinations of predictorvalues. If our goal is to intervene in the world, there may not be any realistic way to manipulateeach predictor without also manipulating the others.


142 5. MULTIVARIATE LINEAR MODELS5.1.3.3. Posterior prediction plots. In addition to understanding the estimates, it’s importantto check the model fit against the observed data. is is what you did in Chapter 3,when you simulated globe tosses, averaging over the posterior, and compared the simulatedresults to the observed. ese kind of checks are useful in many ways. For now, we’ll focuson two uses for them.(1) Did the model fit correctly? Golems do make mistakes, as do golem engineers.Many common soware and user errors can be more easily diagnosed by comparingimplied predictions to the raw data. Some caution is required, because not allmodels try to exactly match the sample. But even then, you’ll know what to expectfrom a successful fit. You’ll see some examples later.(2) How does the model fail? All models are useful fictions, so they always fail in someway. Sometimes, the model fits correctly but is still so poor for our purposes that itmust be discarded. More oen, a model predicts well in some respects, but not inothers. By inspecting the individual cases where the model makes poor predictions,you might get an idea of how to improve the model.Let’s begin by simulating predictions, averaging over the posterior.R code5.10# call link without specifying new data# so it uses original datamu


5.1. SPURIOUS ASSOCIATION 143(a)(b)Predicted divorce6 8 10 12(c)Divorce error-4 -2 0 2 4IDUT6 8 10 12Observed divorceMEAR ALGAMSSCID0 10 20 30 40Waffles per capitaMEARALAKKYGAOKCORILAMSINNHTNAZSDORVTWVNMWAMAMDKSIAOHNCDCDEMITXHIVAMOWYILMTFLCANYPAWISCNEUTCTNDMNNJID-6 -4 -2 0 2 4FIGURE 5.7. Posterior predictive plots for the multivariate divorce model,m5.3. (a) Predicted divorce rate against observed, with 95% confidence intervalsof the average prediction. e dashed line shows perfect prediction.(b) Average prediction error for each State, with 95% interval of the mean(black line) and 95% prediction interval (gray +). (c) Average predictionerror (residuals) against number of Waffle Houses per capita, with superimposedregression of the two variables.Utah (UT), both of which have much lower divorce rates than the model expects them tohave. e easiest way to label a few select points is to use identify:identify( x=d$Divorce , y=mu , labels=d$Loc , cex=0.8 )R code5.12Aer executing the line of code above, R will wait for you to click near a point in the activeplot window. It’ll then place a label near that point, on the side you choose. When you aredone labeling points, press your right mouse button (or press ESC, on some platforms).e plot in FIGURE 5.7(a) makes it hard to see the amount of prediction error, in manycases. For this reason, lots of people also use residual plots that show the mean prediction


144 5. MULTIVARIATE LINEAR MODELSerror for each row. I’m also going to use order to sort the States from lowest prediction errorto highest. To compute residuals and display them:R code5.13# compute residualsdivorce.resid


5.2. MASKED RELATIONSHIP 145N


146 5. MULTIVARIATE LINEAR MODELSup needing female body mass as well, to see the masking that hides the relationships amongthe variables.e first model to consider is the simple bivariate regression between kilocalories andneocortex percent. You know how to set up this regression. Fit the model with:R code5.16m5.5


5.2. MASKED RELATIONSHIP 147a ~ dnorm( 0 , 100 ) ,bn ~ dnorm( 0 , 1 ) ,sigma ~ dunif( 0 , 10 )) ,data=dcc ,start=list(a=mean(d$kcal.per.g),bn=0,sigma=sd(d$kcal.per.g)) )You might get some NaNs produced warnings, but the model will now fit correctly. Take alook at the quadratic approximate posterior now:precis( m5.5 , digits=3 )R code5.20Mean StdDev 2.5% 97.5%a 0.353 0.471 -0.569 1.276bn 0.005 0.007 -0.009 0.018sigma 0.166 0.028 0.110 0.221First, note that I added digits=3 to the precis call. e reason is that the posterior meanfor bn is very small. So we need more digits to see that it’s not exactly zero—a change from thesmallest neocortex percent in the data, 55%, to the largest, 76%, would result in an expectedchange of only:coef(m5.5)["bn"] * ( 76 - 55 )R code5.210.09456748at’s less than 0.1 kilocalories. e kilocalories in the data range from less than 0.5 to morethan 0.9 per gram, so this effect isn’t so impressive. More importantly, it isn’t very precise.e 95% interval of the parameter extends a good distance on both sides of zero. You canplot the predicted mean and 95% interval for the mean to see this more easily:np.seq


148 5. MULTIVARIATE LINEAR MODELSkcal.per.g0.5 0.6 0.7 0.8 0.9kcal.per.g0.5 0.6 0.7 0.8 0.955 60 65 70 75neocortex.perc-2 -1 0 1 2 3 4log.masskcal.per.g0.5 0.6 0.7 0.8 0.9kcal.per.g0.5 0.6 0.7 0.8 0.955 60 65 70 75neocortex.perc-2 -1 0 1 2 3 4log.massFIGURE 5.8. Milk energy and neocortex among primates. In the top twoplots, simple bivariate regressions of kilocalories per gram of milk on (le)neocortex percent and (right) log female body mass show weak and uncertainassociations. However, on the bottom, a single regression with bothneocortex percent and log body mass suggests strong association with bothvariables. Both neocortex and body mass are associated with milk energy,but in opposite directions. is masks each variable’s relationship with theoutcome, unless both are considered simultaneously.Now consider another predictor variable, adult female body mass, mass in the dataframe. Let’s use the logarithm of mass, log(mass), as a predictor as well. Why the logarithmof mass instead of the raw mass in kilograms? It is oen true that scaling measurements likebody mass are related by magnitudes to other variables. Taking the log of a measure translatesthe measure into magnitudes. So by using the logarithm of body mass here, we’re sayingthat we suspect that the magnitude of a mother’s body mass is related to milk energy, in alinear fashion.To fit another bivariate regression, with the logarithm of body mass as the predictorvariable now, I recommend transforming mass first to make a new column in the data:


5.2. MASKED RELATIONSHIP 149dcc$log.mass


150 5. MULTIVARIATE LINEAR MODELSdata=dcc ,start=list(a=mean(dcc$kcal.per.g) , bn=0 , bm=0 ,sigma=sd(dcc$kcal.per.g)) )precis(m5.7)Mean StdDev 2.5% 97.5%a -1.08 0.47 -2.00 -0.17bn 0.03 0.01 0.01 0.04bm -0.10 0.02 -0.14 -0.05sigma 0.11 0.02 0.08 0.15By incorporating both predictor variables in the regression, the estimated association of bothwith the outcome has increased. e posterior mean for the association of neocortex percenthas increased more than 6-fold, and it’s 95% interval is now entirely above zero. e posteriormean for log body mass is more strongly negative.Let’s plot the intervals for the predicted mean kilocalories, for this new model. Here’sthe code for the relationship between kilocalories and neocortex percent. ese are counterfactualplots, so we’ll use the mean log body mass in this calculation, showing only howpredicted energy varies as a function of neocortex percent:R code5.26mean.log.mass


5.3. WHEN ADDING VARIABLES HURTS 151variables, body size and neocortex, are correlated across species makes it hard to see theserelationships, unless we statistically account for both.Overthinking: Simulating a masking relationship. Just as with understanding spurious association(page 144), it may help to simulate data in which two meaningful predictors act to mask one another.Suppose again a single outcome, y, and two predictors, x pos and x neg . e predictor x pos is positiveassociated with y, while x neg is negatively associated with y. Furthermore, the two predictors arepositively correlated with one another. Here’s code to produce data meeting these criteria:N


152 5. MULTIVARIATE LINEAR MODELSvariables are in reality strongly associated with the outcome variable. is frustrating phenomenonarises from the details of how statistical control works. So once you understandmulticollinearity, you will better understand multivariate models in general.To explore multicollinearity, let’s begin with a simple simulation. en we’ll turn to theprimate milk data again and see multicollinearity in a real data set.5.3.1. Multicollinear legs. e simulation example is predicting an individual’s height usingthe length of his or her legs as predictor variables. Surely height is positively associatedwith leg length, or at least the simulation will assume it is. Nevertheless, once you put bothleg lengths into the model, something vexing will happen.e code below will simulate the heights and leg lengths of 100 individuals. For each,first a height is simulated from a Gaussian distribution. en each individual gets a simulatedproportion of height for their legs, ranging from 0.4 to 0.5. Finally, each leg is salted with alittle measurement or developmental error, so the le and right legs are not exactly the samelength, as is typical in real populations. At the end, the code puts height and the two leglengths into a common data frame.R code5.28N


5.3. WHEN ADDING VARIABLES HURTS 153bl -0.06 1.94 -3.87 3.75br 2.11 1.96 -1.72 5.94sigma 0.57 0.04 0.49 0.65ose posterior means and standard deviations look crazy. is is a case in which a graphicalview of the precis output is more useful, because it displays the posterior means and 95%intervals in a way that allows us with a glance to see that something has gone wrong here:plot(precis(m5.8))R code5.30ablbrsigma-4 -2 0 2 4 6EstimateYour numbers and precis plot will not look exactly the same, due to simulation variance.But they will show the same odd result. If both legs have almost identical lengths, and heightis so strongly associated with leg length, then why is this posterior distribution so weird? Didthe model fitting work correctly?e model did fit correctly, and the posterior distribution here is the right answer tothe question we asked. Recall that a multiple linear regression answers the question: whatis the value of knowing each predictor, aer already knowing all of the other predictors? So inthis case, the question becomes: what is the value of knowing each leg’s length, aer alreadyknowing the other leg’s length?e answer to this weird question is equally weird, but perfectly logical. e posteriordistribution is the answer to this question, considering every possible combination of theparameters and assigning relative plausibilities to every combination, conditional on thismodel and these data. It might help to look at the bivariate posterior distribution for bl andbr:post


154 5. MULTIVARIATE LINEAR MODELSDensity0 1 2 3 4 5 61.8 1.9 2.0 2.1 2.2 2.3sum of bl and brFIGURE 5.9. Le: Posterior distribution of the association of each leg withheight, from model m5.8. Since both variables contain almost identical information,the posterior is a narrow ridge of negatively correlated values.Right: e posterior distribution of the sum of the two parameters is centeredon the proper association of either leg with height.is really:y i ∼ Normal(µ i , σ)R code5.32µ i = α + (β 1 + β 2 )x iAll I’ve done is factor x i out of each term. e parameters β 1 and β 2 cannot be pulled apart,because they never separately influence the mean µ. Only their sum, β 1 +β 2 , influences µ. Sothis means the posterior distribution ends up reporting the practically infinite combinationsof β 1 and β 2 that make their sum close to the actual association of x with y.And the posterior distribution in this simulated example has done exactly that, it hasproduced a good estimate of the sum of bl and br. Here’s how you can compute the posteriordistribution of their sum, and then plot it:sum_blbr


5.3. WHEN ADDING VARIABLES HURTS 155sigma ~ dunif( 0 , 10 )) ,data=d,start=list(a=10,bl=0,sigma=1) )precis(m5.9)Mean StdDev 2.5% 97.5%a 0.87 0.28 0.32 1.42bl 2.04 0.06 1.92 2.16sigma 0.57 0.04 0.49 0.65at 2.04 is almost identical to the mean value of sum_blbr.You’ll get slightly different results in your own simulation, due to random variationacross simulations. But the basic lesson remains intact across different simulations: Whentwo predictor variables are very strongly correlated, including both in a model may lead to confusion.e posterior distribution isn’t wrong, in such cases. It’s just giving the right answerto a badly-formed question.5.3.2. Multicollinear milk. In the leg length example, it’s easy to see that including both legsin the model is a little silly. But the problem that arises in real data sets is that we may notanticipate a clash between highly correlated predictors. And therefore we may mistakenlyread the posterior distribution to say that neither predictor is important. In this section, welook at an example of this issue with real data.Let’s return to the primate milk data from earlier in the chapter. Let’s get back the originaldata again:library(rethinking)data(milk)d


156 5. MULTIVARIATE LINEAR MODELS# kcal.per.g regressed on perc.lactosem5.11


5.3. WHEN ADDING VARIABLES HURTS 15710 30 5010 30 50kcal.per.gperc.fatperc.lactose0.5 0.7 0.930 50 70FIGURE 5.10. A pairs plot of the total energy,percent fat, and percent lactose variablesfrom the primate milk data. Percentfat and percent lactose are strongly negativelycorrelated with one another, providingmostly the same information.0.5 0.7 0.9 30 50 70bl ~ dnorm( 0 , 1 ) ,sigma ~ dunif( 0 , 10 )) ,data=d ,start=list(a=mean(d$kcal.per.g),bf=0,bl=0,sigma=sd(d$kcal.per.g)) )precis( m5.12 , digits=3 )Mean StdDev 2.5% 97.5%a 1.007 0.200 0.615 1.399bf 0.002 0.002 -0.003 0.007bl -0.009 0.002 -0.013 -0.004sigma 0.061 0.008 0.045 0.077Now the posterior means of both bf and bl are closer to zero. And the standard deviationsfor both parameters are twice as large as in the bivariate models (m5.10 and m5.11). In thecase of percent fat, the posterior mean is essentially zero.What has happened here? is is the same phenomenon as in the leg length example.What has happened is that the variables perc.fat and perc.lactose contain much of thesame information. ey are substitutes for one another. As a result, when you include bothin a regression, the posterior distribution ends up describing a long ridge of combinationsof bf and bl that are equally plausible.In the case of the fat and lactose, these two variables form essentially a single axis ofvariation. e easiest way to see this is to use a pairs plot:pairs( ~ kcal.per.g + perc.fat + perc.lactose ,data=d , col=rangi2 )R code5.37


158 5. MULTIVARIATE LINEAR MODELSI display this plot in FIGURE 5.10. Along the diagonal, the variables are labeled. In eachscatterplot off the diagonal, the vertical axis variable is the variable labeled on the same rowand the horizontal axis variable is the variable labeled in the same column. For example, thetwo scatterplots in the first row in FIGURE 5.10 are kcal.per.g (vertical) against perc.fat(horizontal) and then kcal.per.g (vertical) against perc.lactose (horizontal). Noticethat percent fat is positively correlated with the outcome, while percent lactose is negativelycorrelated with it. Now look at the rightmost scatterplot in the middle row. is plot is thescatter of percent fat (vertical) against percent lactose (horizontal). Notice that the pointsline up almost entirely along a straight line. ese two variables are negatively correlated,and so strongly so that they are nearly redundant. Either helps in predicting kcal.per.g,but neither helps much once you already know the other.You can compute the correlation between the two variables with cor:R code5.38cor( d$perc.fat , d$perc.lactose )[1] -0.9416373at’s a pretty strong correlation. How strong does a correlation have to get, before youshould start worrying about multicollinearity? ere’s no easy answer to that question. Correlationsdo have to get pretty high before this problem ruins your analysis. But what mattersisn’t just the correlation between a pair of variables. Rather, what matters the correlation thatremains aer accounting for any other predictors. But with only two predictors here, we canaddress the correlation question directly, with a little simulation experiment. Suppose wehave only kcal.per.g and perc.fat. Now we construct a random predictor variable, callit x, that is correlated with perc.fat at some predetermined level. en fit the regressionmodel that tries to predict kcal.per.g using both perc.fat and our random fake variablex. Record the standard error of the estimated effect of perc.fat. Now repeat this proceduremany times, at different levels of correlation between perc.fat and x.If you do this, you’ll get a plot like that in FIGURE 5.11. e vertical axis is the averagestandard deviation across 100 regressions, using a simulated correlated predictor variable x.e horizontal axis shows the intensity of correlation between x and perc.fat. When thetwo variables are uncorrelated, on the le side of the plot, then the standard deviation of theposterior is small. is means the posterior distribution is piled up a narrow range of values.As the correlation increases—and keep in mind that we aren’t adding any information here,just a correlated string of random numbers—the standard deviation inflates. But the effectis far from linear; it accelerates rapidly, as the correlation increases. Above a correlation of0.9, the standard deviation increases very rapidly, approaching in fact ∞ as the correlationapproaches 1. e code for producing this plot contains some techniques not yet explained,so I include it in an optional overthinking box at the end of this section.What can be done about multicollinearity? e best thing to do is be aware of it. Youcan anticipate this problem by checking the predictor variables against one another in a pairsplot. Any pair or cluster of variables with very large correlations, over about 0.9, may be problematic,once included as main effects in the same model. However, it isn’t always true thathighly-correlated variables are completely redundant—other predictors might be correlatedwith only one of the pair, and so help extract the unique information each predictor provides.So you can’t know just from a table of correlations nor from a matrix of scatterplots whethermulticollinearity will prevent you from including sets of variables in the same model. Still,


5.3. WHEN ADDING VARIABLES HURTS 159stddev0.001 0.003 0.005 0.0070.0 0.2 0.4 0.6 0.8 1.0correlationFIGURE 5.11. e effect of correlated predictorvariables on the narrowness of the posterior distribution.e vertical axis shows the standard deviationof the posterior distribution of the slope. ehorizontal axis is the correlation between the predictorof interest and another predictor added tothe model. As the correlation increases, the standarddeviation inflates.you can usually diagnose the problem by looking for a big inflation of standard deviation,when both variables are included in the same model.Different disciplines have different conventions for dealing with collinear variables. Insome fields, it is typical to engage in some kind of data reduction procedure, like PRINCIPLECOMPONENTS or FACTOR ANALYSIS, and then to use the components/factors as predictorvariables. In other fields, this is considered voodoo, because principle components and factorsare notoriously hard to interpret, and there are usually hidden decisions that go intoproducing them. It is also difficult to use such constructed variables in prediction, becauselearning that component 2 is associated with an outcome doesn’t immediately tell us howany real measurement is associated with the outcome. With experience, you can cope withthese issues, but they remain issues. A popular defensive approach is to show that modelsusing any one member from a cluster of highly correlated variables will produce the sameinferences and predictions. For example, sometimes rainfall and soil moisture are highlycorrelated across sites in an ecological analysis. Using both in a model meant to predict presence/absenceof a species of plant may run afoul of multicollinearity, but if you can show thata model with either produces nearly the same predictions, it will be reassuring that both getat the same underlying ecological dimension.e problem of multicollinearity is really a member of a family of problems with fittingmodels, a family sometimes known as NON-IDENTIFIABILITY. When a parameter is nonidentifiable,it means that the structure of the data and model do not make it possible toestimate the parameter’s value. Sometimes this problem arises from mistakes in coding amodel, but many important types of models present non-identifiable or weakly-identifiableparameters, even when coded completely correctly.In general, there’s no guarantee that the available data contain much information abouta parameter of interest. When that’s true, your Bayesian machine will return a very wideposterior distribution. at doesn’t mean anything is wrong—you got the right answer tothe question you asked. But it might lead you to ask a better question.Overthinking: Simulating collinearity. e code to produce FIGURE 5.11 involves writing a functionthat generates correlated predictors, fits a model, and returns the standard deviation of the posteriordistribution for the slope relating perc.fat to kcal.per.g. en the code repeatedly calls thisfunction, with different degrees of correlation as input, and collects the results.


160 5. MULTIVARIATE LINEAR MODELSR code5.39library(rethinking)data(milk)d


5.4. CATEGORICAL VARIABLES 161$ weight: num 47.8 36.5 31.9 53 41.3 ...$ age : num 63 63 65 41 51 35 32 27 19 54 ...$ male : int 1 0 0 1 0 1 0 1 0 1 ...e male variable is our new predictor, an example of a DUMMY VARIABLE. Dummy variablesare devices for encoding categories into quantitative models. e purpose of the malevariable is to indicate when a person in the sample is male. So it takes the value 1 wheneverthe person is male, but it takes the value 0 when the person is female. It doesn’t matter whichcategory—“male” or “female”—is indicated by the 1. e model won’t care. But correctly interpretingthe model will demand that you remember, so it’s a good idea to name the variableaer the category assigned the 1 value.e effect of a dummy variable is to turn a parameter on for those cases in the category.Simultaneously, the variable turns the same parameter off for those cases in another category.is will make more sense, once you see it in the mathematical definition of the model. emodel to fit is:h i ∼ Normal(µ i , σ)µ i = α + β m m iα ∼ Normal(150, 100)β m ∼ Normal(0, 10)σ ∼ Uniform(0, 50)where h is height and m is the dummy variable indicating a male individual. e parameterβ m influences prediction only for those cases where m i = 1. When m i = 0, it has no effecton prediction, because it is multiplied by zero inside the linear model, α + β m m i , cancelingit out, whatever its value.To fit this model, use the usual format inside map:m5.13


162 5. MULTIVARIATE LINEAR MODELSfemales, 7.3 cm. So to compute the average male height, you just add these two estimates:135 + 7.3 = 142.3.at’s good enough for the posterior mean of average male height, but you’ll also needto consider the width of the posterior distribution. And because the parameters α and β mare correlated with one another, you can’t just add together the boundaries in the precisoutput and get correct boundaries for their sum. 66 But as usual, the most accessible way toderive a percentile interval for average male height is just to sample from the posterior. enyou can just add samples of a and bm together to get the posterior distribution of their sum.Here’s all it takes:R code5.42post


5.4. CATEGORICAL VARIABLES 1635.4.2. Many categories. Binary categories are easy, because you only need one dummy variable.You just pick one of the categories to be indicated by the dummy. e other categoryis then measured by the intercept, α.But when there are more than two categories, you’ll need more than one dummy variable.Here’s the general rule:To include k categories in a linear model, you require k−1 dummy variables.Each dummy variable indicates, with the value 1, a unique category. e category with nodummy variable assigned to it ends up again as the “intercept” category.Let’s explore an example using the primate milk data again. We’re interested now in theclade variable, which encodes the broad taxonomic membership of each species:data(milk)d


164 5. MULTIVARIATE LINEAR MODELSe model we aim to fit to the data is kcal.per.g regressed on the dummy variablesfor clade:k i ∼ Normal(µ i , σ)µ i = α + β NWM NWM i + β OWM OWM i + β S S iα ∼ Normal(0.6, 10)β NWM ∼ Normal(0, 1)β OWM ∼ Normal(0, 1)β S ∼ Normal(0, 1)σ ∼ Uniform(0, 10)A linear model like this really defines 4 different linear models, each corresponding to a differentcategory. A table might help you to understand how dummy variables make individualparameters contingent on category:Category NWM i OWM i S i µ iApe 0 0 0 µ i = αNew World monkey 1 0 0 µ i = α + β NWMOld World monkey 0 1 0 µ i = α + β OWMStrepsirrhine 0 0 1 µ i = α + β SEach category implies a different set of 1s and 0s in the dummy variables, which in turn implya different equation for µ i , once simplified.Fitting the model is straightforward:R code5.47m5.14


5.4. CATEGORICAL VARIABLES 165# sample posteriorpost


5.5. ORDINARY LEAST SQUARES AND LM 1675.5.1. Design formulas. e input format for models in lm, and many of the other modelfittingfunctions in R, is a compact form of notation known as a DESIGN FORMULA. Designformulas take the expression for µ i in the detailed mathematical form of a model and stripout the parameters, leaving only a sum of predictor names. ese are sufficient to describethe model’s design, provided all priors are flat.For example, if we have the linear regression:then the corresponding design formula is:y i ∼ Normal(µ i , σ)µ i = α + βx iy ~ 1 + xe leading 1 on the righthand side indicates the intercept, α. As you add more predictorvariables, the design formula expands. For example, this design formula:corresponds to the model:y ~ 1 + x + z + wy i ∼ Normal(µ i , σ)µ i = α + β x x i + β z z i + β w w iDesign formulas are intimately related to a common matrix algebra description of linearmodels. See the overthinking box on page 134 for a short introduction to this approach.5.5.2. Using lm. To fit a linear regression using OLS, just provide the design formula and thename of the data frame. So for both examples just above, the calls to lm would be (assumingthe data are in a frame named d):m5.11


168 5. MULTIVARIATE LINEAR MODELS5.5.2.2. Categorical variables. Black box functions like lm will automatically expand categoricalfactors into the necessary number of dummy variables. But R is not a mind reader,and sometimes the same variable can be treated as categories or rather as continuous. Forexample, you might code a variable like season with the values 1, 2, 3, and 4. ese couldthen be categories—implying you don’t care about order, but rather about estimating anydifferences among seasons—or a continuous variable marking passage of time. R is easilyconfused in these circumstances.To prevent R from getting confused, and therefore confusing yourself, best to get intothe habit of explicitly telling lm when you mean to use a variable as categories. is will doit:R code5.53m5.16


MEDIUM 1695.6. Summaryis chapter introduced multiple regression, a way of constructing descriptive modelsfor how the mean of a measurement is associated with more than one predictor variable.e defining question of multiple regression is: What is the value of knowing each predictor,once we already know the other predictors? Implicit in this question are: (1) the value ofthe predictors for description of the sample, instead of forecasting a future sample, and (2)the assumption that the value of each predictor does depend upon the values of the otherpredictors. In the next two chapters, we confront these two issues.5.7. PracticeEasy5.7.1. e1. Which of the linear models below are multiple linear regressions?(1) µ i = α + βx i(2) µ i = β x x i + β z z i(3) µ i = α + β(x i − z i )(4) µ i = α + β x x i + β z z i5.7.2. e2. Write down a multiple regression to evaluate the claim: Animal diversity is linearlyrelated to latitude, but only aer controlling for plant diversity. You just need to write downthe model definition.5.7.3. e3. Write down a multiple regression to evaluate the claim: Neither amount of fundingnor size of laboratory are by themselves good predictors of time to PhD degree; but together thesevariables are both positively associated with time to degree. Write down the model definitionand indicate which side of zero each slope parameter should be on.5.7.4. e4. Suppose you have a single categorical predictor with 4 levels (unique values), labeledA, B, C and D. Let A i be an indicator variable that is 1 where case i is in category A. Alsosuppose B i , C i , and D i for the other categories. Now which of the following linear modelsare inferentially equivalent ways to include the categorical variable in a regression? Modelsare inferentially equivalent when it’s possible to compute one posterior distribution from theposterior distribution of another model.(1) µ i = α + β A A i + β B B i + β D D i(2) µ i = α + β A A i + β B B i + β C C i + β D D i(3) µ i = α + β B B i + β C C i + β D D i(4) µ i = α A A i + α B B i + α C C i + α D D i(5) µ i = α A (1 − B i − C i − D i ) + α B B i + α C C i + α D D iMedium5.7.5. m1. Invent your own example of a spurious correlation. An outcome variable shouldbe correlated with both predictor variables. But when both predictors are entered in the samemodel, the correlation between the outcome and one of the predictors should mostly vanish(or at least be greatly reduced).5.7.6. m2. Invent your own example of a masked relationship. An outcome variable shouldbe correlated with both predictor variables, but in opposite directions. And the two predictorvariables should be correlated with one another.


170 5. MULTIVARIATE LINEAR MODELS5.7.7. m3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters—States and localities with many fire fighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires causefire fighters. Consider the same reversal of causal inference in the context of the divorce andmarriage data. How might a high divorce rate cause a higher marriage rate? Can you thinkof a way to evaluate this relationship, using multiple regression?5.7.8. m4. In the divorce data, States with high numbers of Mormons (members of eChurch of Jesus Christ of Latter-day Saints) have much lower divorce rates than the regressionmodels expected. Find a list of Mormon population by State and use those numbersas a predictor variable, predicting divorce rate using marriage rate, median age at marriage,and percent Mormon population (possibly standardized). You may want to consider transformationsof the raw percent Mormon variable. [I actually don’t know how this regressionturns out. I suspect getting it right requires interacting Mormon population with medianage marriage, which will have to wait until Chapter 7.]5.7.9. m5. One way to reason through multiple causation hypotheses is to imagine detailedmechanisms through which predictor variables may influence outcomes. For example, it issometimes argued that the price of gasoline (predictor variable) is positively associated withlower obesity rates (outcome variable). However, there are at least two important mechanismsby which the price of gas could reduce obesity. First, it could lead to less driving andtherefore more exercise. Second, it could lead to less driving, which leads to less eating out,which leads to less consumption of huge restaurant meals. Can you outline one or more multipleregressions that address these two mechanisms? Assume you can have any predictordata you need.HardAll three exercises below use the same data, data(foxes) (part of rethinking). 69 eurban fox (Vulpes vulpes) is a successful exploiter of human habitat. Since urban foxes movein packs and defend territories, data on habitat quality and population density is also included.e data frame has five columns:(1) group: Number of the social group the individual fox belongs to(2) avgfood: e average amount of food available in the territory(3) groupsize: e number of foxes in the social group(4) area: Size of the territory(5) weight: Body weight of the individual fox5.7.10. Bivariate entrée. Fit two bivariate Gaussian regressions, using map: (1) body weightas a linear function of territory size (area), and (2) body weight as a linear function ofgroupsize. Plot the results of these regressions, displaying the MAP regression line and the95% confidence interval of the mean. Is either variable important for predicting fox bodyweight?5.7.11. Multivariate course. Now fit a multiple linear regression with weight as the outcomeand both area and groupsize as predictor variables. Plot the predictions of the modelfor each predictor, holding the other predictor constant at its mean. What does this modelsay about the importance of each variable? Why do you get different results than you got inexercise just above?


HARD 1715.7.12. Model comparison dessert. Finally, consider the avgfood variable. Fit two moremultiple regressions: (1) body weight as an additive function of avgfood and groupsize,and (2) body weight as an additive function of all three variables, avgfood and groupsizeand area. Compare the results of these models to the previous models you’ve fit, in the firsttwo exercises.(a) Is avgfood or area a better predictor of body weight? If you had to choose one orthe other to include in a model, which would it be? Support your assessment with any tablesor plots you choose.(b) When both avgfood or area are in the same model, their effects are reduced (closerto zero) and their standard errors are larger than when they are included in separate models.Can you explain this result?


6 Model Selection, Comparison, and AveragingNicolaus Copernicus (1473–1543): Polish astrologer, ecclesiastical lawyer, and blasphemer.Famous for his heliocentric model of the solar system, Copernicus argued for replacingthe geocentric model, because the heliocentric model was more “harmonious.” is positioneventually lead (decades later) to Galileo’s famous disharmony with, and trial by, the Church.is story has become a fable of science’s triumph over ideology and superstition. ButCopernicus’ justification looks poor to us now, ideology aside. ere are two problems: themodel was neither particularly harmonious nor more accurate than the geocentric model.e Copernican model was very complicated. In fact, it had similar epicycle clutter as thePtolemaic model (FIGURE 6.1). Copernicus had moved the Sun to the center, but since hestill used perfect circles for orbits, he still needed epicycles. And so “harmony” doesn’t quitedescribe the model’s appearance. Just like the Ptolemaic model, the Copernican model waseffectively a Fourier series, a means of approximating periodic functions. is leads to thesecond problem: the heliocentric model made exactly the same predictions as the geocentricmodel. Equivalent approximations can be constructed whether the Earth is stationary orrather moving. So there was no reason to prefer it on the basis of accuracy alone.Copernicus didn’t appeal just to some vague harmony, though. He also argued for thesuperiority of his model on the basis of needing fewer causes: “We thus follow Nature, whoproducing nothing in vain or superfluous oen prefers to endow one cause with many effects.”70 And it was true that a heliocentric model required fewer circles and epicycles tomake the same predictions as a geocentric model. In this sense, it was simpler than the geocentricmodel.Scholars oen prefer simpler theories. is preference is oen le vague—a kind of aestheticpreference. Other times we retreat to pragmatism, preferring simpler models becausethey are easier to work with. Frequently, scientists cite a loose principle known as OCK-HAM’S RAZOR: models with fewer assumptions are to be preferred. In the case of Copernicusand Ptolemy, the razor makes a clear recommendation. It cannot guarantee that Copernicuswas right (he wasn’t, aer all), but since the heliocentric and geocentric models make thesame predictions, at least the razor offers a clear resolution to the dilemma. But the razorcan be hard to use more generally, because usually we must choose among models that differin both their accuracy and their simplicity. How are we to trade these different criteriaagainst one another? e razor offers no guidance.is chapter describes some of the most commonly used tools for coping with this tradeoff.Some notion of simplicity usually features in all of these tools, and so each is commonlycompared to Ockham’s razor. But each tool is equally about improving predictive accuracy.So they are not like the razor, because they explicitly tradeoff accuracy and simplicity.173


174 6. MODEL SELECTION, COMPARISON, AND AVERAGINGFIGURE 6.1. Ptolemaic (le) and Copernican (right) models of the solarsystem. Both models use epicycles (circles on circles), and both modelsproduce exactly the same predictions. However, the Copernican model requiresfewer circles.So instead of Ockham’s razor, think of Ulysses’ compass. Ulysses was the hero of Homer’sOdyssey. During his voyage, Ulysses had to navigate a narrow straight between the manyheadedbeast Scylla—who attacked from a cliff face and gobbled up sailors—and the seamonster Charybdis—who pulled boats and men down to a watery grave. Passing too closeto either meant disaster. In the context of scientific models, you can think of these monstersas representing two fundamental kinds of statistical error:(1) e many-headed beast of OVERFITTING, which leads to poor prediction by learningtoo much from the data(2) e whirlpool of UNDERFITTING, which leads to poor prediction by learning toolittle from the dataOur job is to carefully navigate between these monsters. ere are two common familiesof approaches. e first is to use a REGULARIZING PRIOR to tell the model not to gettoo excited by the data. e second is to use some scoring device, like INFORMATION CRI-TERIA, to explicitly model the prediction task. is chapter focuses on information criteria,because they are harder to explain and entail a number of new concepts. But both families ofapproaches are routinely used in the natural and social sciences, and frequently a procedurefrom one family can be shown to be equivalent to a procedure from the other. Furthermore,they can be maybe should be used in combination.is chapter informally develops and demonstrates two popular information criteria,AIC (Akaike information criterion) and DIC (Deviance information criterion). AIC andDIC provide an estimate of a model’s forecasting error. To build up to AIC and DIC, thischapter first discusses overfitting and underfitting in more detail. en there are two distinctsteps in justifying AIC and its relatives. First, a measure of model performance known asDEVIANCE is chosen for comparing models. Information theory justifies this choice. Second,


6.1. THE PROBLEM WITH PARAMETERS 175we care about prediction, not fitting. So we require an estimate of the deviance of a modelwhen applied to new data. Information criteria like AIC and DIC accomplish this by buildinga model of forecasting. Aer establishing AIC and DIC, the chapter presents examples oftheir use in model selection, comparison, and averaging.Rethinking: Stargazing. e most common form of model selection among practicing scientists isto search for a model in which every coefficient is statistically significant. Statisticians sometimes callthis STARGAZING, as it is embodied by scanning for asterisks (**) trailing aer estimates. A colleagueof mine once called this approach the “Space Odyssey,” in honor of A. C. Clarke’s novel and film. emodel that is full of stars, the thinking goes, is best.But such a model is not best. Whatever you think about null hypothesis significance testing ingeneral, using it to select among structurally different models is a mistake—p-values are not designedto help you navigate between underfitting and overfitting. As you’ll see once you start using AICand related measures, it is true that predictor variables that do improve prediction are not alwaysstatistically significant. It also possible for variables that are statistically significant to do nothinguseful for prediction. And since the conventional 5% threshold is purely conventional, we shouldn’texpect it to optimize anything.Rethinking: Is AIC Bayesian? AIC is not usually thought of as a Bayesian tool. ere are both historicaland statistical reasons for this. Historically, AIC was originally derived without reference toBayesian probability. Statistically, AIC uses MAP estimates instead of the entire posterior, and it requiresflat priors. So it doesn’t look particularly Bayesian. Reinforcing this impression is the existenceof another model comparison metric, the BAYESIAN INFORMATION CRITERION (BIC). However, BICalso requires flat priors and MAP estimates, although it’s not actually an “information criterion.”Regardless, AIC has a clear and pragmatic interpretation under Bayesian probability, and Akaikeand others have long argued for alternative Bayesian justifications of the procedure. 71 And as you’llsee later in the book, more obviously Bayesian information criterion like DIC and WAIC providealmost exactly the same results as AIC, when AIC’s assumptions are met. In this light, we can fairlyregard AIC as a special limit of a Bayesian criterion like WAIC, even if that isn’t how AIC was originalderived.6.1. e problem with parametersIn the previous chapter, we saw how adding variables and parameters to a model canhelp to reveal hidden effects and improve estimates. You also saw that adding variables canhurt, in particular when predictor variables are highly correlated with one another. But whatabout when the predictor variables are not highly correlated? Would it be safe to just addthem all?e answer is “no.” ere are two principle concerns with just adding variables. e firstis that adding parameters—making the model more complex—always improves the fit of amodel family to the data. By “fit” I mean any reasonable measure of how well the model canretrodict the data used to fit the model. In the context of linear Gaussian models, R 2 is themost common measure of this kind. Fit always improves, even when the variables you addto a model are just random numbers, with no relation to the outcome. So it’s no good tochoose among models using fit to the data. Second, while more complex models always fitthe data better, they oen predict new data worse. Models that have too many parameterstend to overfit more than do simpler models. is means that a complex model will be verysensitive to the exact sample used to fit it, leading to potentially large mistakes when future


176 6. MODEL SELECTION, COMPARISON, AND AVERAGINGbrain volume (cc)400 800 1200habilisboiseiafarensisafricanussapiensergasterrudolfensisFIGURE 6.2. Average brain volume incubic centimeters against body massin kilograms, for six hominin species.What model best describes the relationshipbetween brain size and body size?30 40 50 60body mass (kg)data is not exactly like the past data. But simple models, with too few parameters, tendinstead to underfit, systematically over-predicting or under-predicting the data, regardlessof how well future data resemble past data. So we can’t always favor either simple models orcomplex models.Let’s examine both of these issues in the context of a simple data example.6.1.1. More parameters always improve fit. OVERFITTING occurs when a model learns toomuch from the sample. What this means is that there are both regular and irregular featuresin every sample. e regular features are the targets of our learning, because they generalizewell or answer a question of interest. Regular features are useful, given an objective of ourchoice. e irregular features are instead aspects of the data that do not generalize well ormislead us.Overfitting happens automatically, unfortunately. Here’s an example. e data displayedin FIGURE 6.2 are average brain volumes and body masses for seven hominin species. 72 Let’sget these data into R, so you can work with them. I’m going to build these data from directinput, rather than loading a pre-made data frame, just so you see an example of how to builda data frame from scratch.R code6.1sppnames


6.1. THE PROBLEM WITH PARAMETERS 177or diet or species age. is is the same “statistical control” strategy explained in the previouschapter.But why use a line to relate body size to brain size? It’s not clear why nature demandsthat the relationship among species be a straight line. Why not consider a curved model,like a parabola? Indeed, why not a cubic function of body size, or even a quintic model? Iagree that there’s no reason yet given to suppose a priori that brain size scales only linearlywith body size. Indeed, many readers will prefer to model a linear relationship between logbrain volume and log body mass (an exponential relationship). But that’s not the directionI’m headed with this example. e lesson here will arise, no matter how we transform thedata. Even aer a log transform of both variables, there’s no reason to insist that linear is theonly imaginable relationship.Let’s fit a series of increasingly complex model families and see which function fits thedata best. We’re going to use lm to fit these models, instead of using map. Take a look back atSection 5.5 (page 166), if you’ve forgotten or missed how these two methods of fitting linearmodels relate to one another. In this case, using lm will get us expediently to the lesson of thissection. We’ll also use polynomial regressions, so review Section 4.5 (page 121) if necessary.e simplest model that relates brain size to body size is the linear one. It will be the firstmodel we consider:v i ∼ Normal(µ i , σ)µ i = α + β 1 m iis is just to say that the average brain volume v i of species i is a linear function of its bodymass m i . Priors are necessarily flat here, since we’re using lm. Now fit this model to the datausing lm:m6.1


178 6. MODEL SELECTION, COMPARISON, AND AVERAGINGis model family adds one more parameter, β 2 , but uses all of the same data as m6.1. Fitthis model to the data:R code6.4m6.2


6.1. THE PROBLEM WITH PARAMETERS 179(a)R^2 = 0.49(b)R^2 = 0.54brain volume (cc)400 800 1200brain volume (cc)400 800 120035 40 45 50 55 6035 40 45 50 55 60body mass (kg)body mass (kg)(c)R^2 = 0.68(d)R^2 = 0.81brain volume (cc)400 800 1200brain volume (cc)400 800 120035 40 45 50 55 6035 40 45 50 55 60body mass (kg)body mass (kg)(e)R^2 = 0.99(f)R^2 = 1.00brain volume (cc)400 800 1200brain volume (cc)0 400 120035 40 45 50 55 60body mass (kg)35 40 45 50 55 60body mass (kg)FIGURE 6.3. Polynomial linear models of increasing degree, fit to the homininbrain size and body size data. Each plot shows the predicted meanin black, with 95% interval of the mean shaded. R 2 , is displayed above eachplot. (a) First degree polynomial. (b) Second degree. (c) ird degree. (d)Fourth degree. (e) Fih degree. (f) Sixth degree.


180 6. MODEL SELECTION, COMPARISON, AND AVERAGINGbrain volume (cc)600 100035 40 45 50 55 60FIGURE 6.4. An underfit model of homininbrain volume. is model ignoresany association between bodymass and brain volume, producing ahorizontal line of predictions. As a result,the model fits badly and (presumably)predicts badly.body mass (kg)relationships among the data. ese summaries compress the data into a simpler form, althoughwith loss of information (“lossy” compression) about the sample. e parameters can then be usedto generate new data, effectively decompressing the data.When a model has a parameter to correspond to each datum, such as m6.6, then there is actuallyno compression. e model just encodes the raw data in a different form, using parameters instead.As a result, we learn nothing about the data from such a model. Learning about the data requires usinga simpler model that achieves some compression, but not too much. is view of model selection isoen known as MINIMUM DESCRIPTION LENGTH (MDL). 73R code6.66.1.2. Too few parameters hurt too. e overfit polynomial models manage to fit the dataextremely well, but they suffer for this within-sample accuracy by making nonsensical outof-samplepredictions. In contrast, UNDERFITTING produces models that are inaccurate bothwithin and out of sample. ey have learned too little, failing to recover regular features ofthe sample.For example, consider this model of brain volume:v i ∼ Normal(µ, σ)µ = αere are no predictor variables here. Just the intercept α. Fit this model with:m6.7


6.1. THE PROBLEM WITH PARAMETERS 181(a)(b)brain volume (cc)400 800 1200brain volume (cc)0 500 150035 40 45 50 55 60body mass (kg)35 40 45 50 55 60body mass (kg)FIGURE 6.5. Underfitting and overfitting as under-sensitivity and oversensitivityto sample. In both plots, a regression is fit to the seven sets ofdata made by dropping one row from the original data. (a) An underfitmodel is insensitive to the sample, changing little as individual points aredropped. (b) An overfit model is sensitive to the sample, changing dramaticallyas points are dropped.You can see the truth of this in FIGURE 6.5. In both plots in the figure what I’ve done is dropeach row of the data, one at a time, and refit the model. So each line plotted in (a) is a firstdegree polynomial fit to one of the seven possible sets of data constructed from droppingone row. e curves in (b) are instead different fih order polynomials fit the same sevensets of data. Notice that the straight lines hardly vary, while the curves fly about wildly. isis a general contrast between underfit and overfit models: sensitivity to exact compositionof the sample used to fit the model.Overthinking: Dropping rows. e calculations needed to produce FIGURE 6.5 are made easy by atrick of R’s index notation. To drop a row i from a data frame d, just use:d.new


182 6. MODEL SELECTION, COMPARISON, AND AVERAGINGRethinking: Bias and variance. e underfitting/overfitting dichotomy is oen described as theBIAS-VARIANCE TRADEOFF. 74 “Bias” here refers to underfitting, while “variance” refers to overfitting.ese terms are confusing though, because they are used in many different ways in different contexts,even within statistics. e term “bias” also sounds like a bad thing, even though increasing bias oenleads to better predictions. For these reasons, this book prefers underfitting/overfitting, but you shouldexpect to see the same concepts discussed as bias/variance.6.2. Information theory and model performanceSo how do we navigate between the hydra of overfitting and the vortex of underfitting?Many common approaches begin by defining a precise criterion of model performance.What do you want the model to do well at? It can be hard to define “fit” in even a simple,uncontroversial way, however. is is because accuracy depends upon the definition of thetarget, and there is no unique best target.In defining a target, there are two major dimensions to worry about:(1) Cost-benefit analysis. How much does it cost when we’re wrong? How much do wewin when we’re right? Most scientists never ask these questions in any formal way,but applied scientists must routinely answer them.(2) Accuracy in context. Some prediction tasks are inherently easier than others. Soeven if we ignore costs and benefits, we still need a way to judge “accuracy” thataccounts for how much a model could possibly improve prediction.It will help to explore these two dimensions in an example.6.2.1. Firing the weatherperson. Suppose in a certain city, a certain weatherperson issuesuncertain predictions for rain or shine on each day of the year. 75 e predictions are in theform of probabilities of rain. e currently employed weatherperson predicted these chancesof rain over a ten-day sequence, with the actual outcomes shown below each prediction:Day 1 2 3 4 5 6 7 8 9 10Prediction 1 1 1 0.6 0.6 0.6 0.6 0.6 0.6 0.6ObservedA newcomer rolls into town, and this newcomer boasts that he can best the current weatherperson,by always predicting sunshine. Over the same ten day period, the newcomer’s recordwould be:Day 1 2 3 4 5 6 7 8 9 10Prediction 0 0 0 0 0 0 0 0 0 0Observed“So by rate of correct prediction alone,” the newcomer announces, “I’m the best person forthe job.”e newcomer is right. Define hit rate as the average chance of a correct prediction. Sofor the current weatherperson, she gets 3 × 1 + 7 × 0.4 = 5.8 hits in 10 days, for a rate of5.8/10 = 0.58 correct predictions per day. In contrast, the newcomer gets 3×0+7×1 = 7,for 7/10 = 0.7 hits per day. e newcomer wins.


6.2. INFORMATION THEORY AND MODEL PERFORMANCE 1836.2.1.1. Costs and benefits. But it’s not hard to find another criterion, other than rate ofcorrect prediction, that makes the newcomer look foolish. Any consideration of costs andbenefits will suffice. Suppose for example that you hate getting caught in the rain, but you alsohate carrying an umbrella. Let’s define the cost of getting wet as −5 points of happiness andthe cost of carrying an umbrella as −1 points of happiness. Suppose your chance of carryingan umbrella is equal to the forecast probability of rain. Your job is now to maximize yourhappiness by choosing a weatherperson. Here are your points, following either the currentweatherperson or the newcomer:Day 1 2 3 4 5 6 7 8 9 10ObservedPointsCurrent −1 −1 −1 −0.6 −0.6 −0.6 −0.6 −0.6 −0.6 −0.6Newcomer −5 −5 −5 0 0 0 0 0 0 0So the current weatherperson nets you −7.2 happiness, while the newcomer nets you −15happiness. So the newcomer doesn’t look so clever now. You can play around with the costsand the decision rule, but since the newcomer always gets you caught unprepared in the rain,it’s not hard to beat his forecast.6.2.1.2. Measuring accuracy. But even if we ignore costs and benefits of any actual decisionbased upon the forecasts, there’s still ambiguity about which measure of “accuracy” toadopt. ere’s nothing special about “hit rate.” Consider for example computing the probabilityof predicting the exact sequence of days. is means computing the probability of acorrect prediction for each day. en multiply all of these probabilities together to get thejoint probability of correctly predicting the observed sequence. is is the same thing as thejoint likelihood, which you’ve been using up to this point to fit models with Bayes’ theorem.In this light, the newcomer looks even worse. e probability for the current weatherpersonis 1 3 × 0.4 7 ≈ 0.005. For the newcomer, it’s 0 3 × 1 7 = 0. So the newcomer has zeroprobability of getting the sequence correct. is is because the newcomer’s predictions neverexpect rain. So even though the newcomer has a high average probability of being correct(hit rate), he has a terrible joint probability of being correct.So which definition of “accuracy” should we use: joint probability, average probability,or even some other? We can answer this question by first answering two other questions.(1) What is a perfect prediction? Well, suppose for the moment that we could knowthe true probabilities of rain or shine on each day. (If the word “true” bugs you, seethe rethinking box further down.) at would make a valuable target, because inprinciple it would be unbeatable, if we knew it. So this answer provides a target.(2) How should we measure distance from the target? A perfect prediction would justreport the true probabilities of rain on each day. So when either weatherpersonprovides a prediction that differs from the target, we can measure the distance ofthe prediction from the target. But what distance should we adopt?is second question is the harder one. It’s important to get the answer right, and it’snot obvious how to go about answering it. One reason is that some targets are just easier tohit than other targets. We need a measure of distance that accounts for this. For example,suppose we extend the weather forecast into the winter. Now there are three types of days:rain, sun, and snow. Now there are three ways to be wrong, instead of just two. is has to be


184 6. MODEL SELECTION, COMPARISON, AND AVERAGINGreflected in any reasonable measure of distance from the target, because by adding anothertype of event, the target has gotten harder to hit.It’s like taking a two-dimensional archery bullseye and forcing the archer to hit the targetat the right time—a third dimension—as well. Now the possible distance between thebest archer and the worst archer has grown, because there’s another way to miss. And withanother way to miss, one might also say that there is another way for an archer to impress.As the potential distance between the target and the shot increases, so too does the potentialimprovement and ability of a talented archer to impress us.Rethinking: What is a true model? It’s hard to define “true” probabilities, because all models arefalse. So what does “truth” mean in this context? It means the right probabilities, given our stateof ignorance. Our state of ignorance is described by the model. e probability is in the model,not in the world. If we had all of the information relevant to producing a forecast, then rain or sunwould be deterministic, and the “true” probabilities would be just 0’s and 1’s. Absent some relevantinformation, as in all modeling, outcomes in the small world are uncertain, even though they remainperfectly deterministic in the large world. Because of our ignorance, we can have “true” probabilitiesbetween zero and one.An example might help. Suppose you toss the globe, as in Chapter 2. Before you catch it, theoutcome is uncertain. ere is a “true” probability of observing water, conditional on our assumedmodel. But if we had enough information about the globe toss—initial conditions, angular momentumvector, and such—then the outcome would be knowable with certainty. No two tosses are everexactly alike, so the “true” probability of observing water must average over the unknown differencesto describe the relative plausibility of water compared to land. ere is a right answer to the questionthis model poses—70% water. It’s our ignorance of the physics of the globe toss that leads us to useit as a way to estimate the amount of water on the surface.6.2.2. Information and uncertainty. One solution to the problem of how to measure distanceof a model’s accuracy from a target was provided in the late 1940s. 76 Originally appliedto problems in communication of messages, such as telegraph, the field of INFORMATIONTHEORY is now important across the basic and applied sciences, and it has deep connectionsto Bayesian inference. And like many successful fields, information theory has spawned alarge number of bogus applications, as well. 77e basic insight is to ask: how much is our uncertainty reduced by learning an outcome?Consider the weather forecasts again. Forecasts are issued in advance and the weather isuncertain. When the actual day arrives, the weather is no longer uncertain. e reductionin uncertainty is then a natural measure of how much we have learned, how much “information”we derive from observing the outcome. So if we can develop a precise definition of“uncertainty,” we can provide a baseline measure of how hard it is to predict, as well as howmuch improvement is possible. e measured decrease in uncertainty is the definition ofinformation in this context.Information: e reduction in uncertainty derived from learning an outcome.To use this definition, what we need is a principled way to quantify the uncertainty inherentin a probability distribution. So suppose again that there are two possible weatherevents on any particular day: either it is sunny or it is rainy. Each of these events occurs withsome probability, and these probabilities add up to one. What we want is a function that uses


6.2. INFORMATION THEORY AND MODEL PERFORMANCE 185the probabilities of shine and rain and produces a measure of uncertainty. What propertiesshould such a measure of uncertainty possess? ere are three intuitive desiderata.(1) e measure of uncertainty should be continuous. If it were not, then an arbitrarilysmall change in any of the probabilities, for example the probability of rain, wouldresult in a massive change in uncertainty.(2) e measure of uncertainty should increase as the number of possible events increases.For example, suppose there are two cities that need weather forecasts. Inthe first city, it rains on half of the days in the year and is sunny on the others. Inthe second, it rains, shines and hails, each on 1 out of every 3 days in the year. We’dlike our measure of uncertainty to be larger in the second city, where there is onemore kind of event to predict.(3) e measure of uncertainty should be additive. What this means is that if we firstmeasure the uncertainty about rain or shine (2 possible events) and then the uncertaintyabout hot or cold (2 different possible events), the uncertainty over the fourcombinations of these events—rain/hot, rain/cold, shine/hot, shine/cold—shouldbe the sum of the separate uncertainties.ere is only one function that satisfies these desiderata. is function is usually known asINFORMATION ENTROPY, and has a surprisingly simple definition. If there are n differentpossible events and each event i has probability p i , and we call the list of probabilities p, thenthe unique measure of uncertainty we seek is:n∑H(p) = − E log(p i ) = − p i log(p i ). (6.1)In plainer words:e uncertainty contained in a probability distribution is the average logprobabilityof an event.“Event” here might refer to a type of weather, like rain or shine, or a particular species ofbird or even a particular nucleotide in a DNA sequence. While it’s not worth going into thedetails of the derivation of H, it is worth pointing out that nothing about this function isarbitrary. Every part of it derives from the three requirements above.An example will help. To compute the information entropy for the weather, suppose thetrue probabilities of rain and shine are p 1 = 0.3 and p 2 = 0.7, respectively. en:i=1H(p) = − ( p 1 log(p 1 ) + p 2 log(p 2 ) ) ≈ 0.61Suppose instead we live in Abu Dhabi. en the probabilities of rain and shine might be morelike p 1 = 0.01 and p 2 = 0.99. Now the entropy would be approximately 0.06. Why has theuncertainty decreased? Because in Abu Dhabi it hardly ever rains. erefore there’s muchless uncertainty about any given day, compared to a place in which it rains 30% of the time.It’s in this way that information entropy measures the uncertainty inherent in a distributionof events. Similarly, if we add another kind of event to the distribution—forecasting intowinter, so also predicting snow—entropy tends to increase, due the added dimensionalityof the prediction problem. For example, suppose probabilities of sun, rain, and snow arep 1 = 0.7, p 2 = 0.15, and p 3 = 0.15, respectively. en entropy is about 0.82.ese entropy values by themselves don’t mean much to us, though. But we can use themto build a measure of accuracy. at comes next.


186 6. MODEL SELECTION, COMPARISON, AND AVERAGINGOverthinking: More on entropy. Above I said that information entropy is the average log-probability.But there’s also a −1 in the definition. Multiplying the average log-probability by −1 just makes theentropy H increase from zero, rather than decrease from zero. It’s conventional, but not functional.e logarithms above are natural logs (base e), but changing the base rescales without any effect oninference. Binary logarithms, base 2, are just as common. As long as all of the entropies you compareuse the same base, you’ll be fine.e only trick in computing H is to deal with the inevitable question of what to do when p i = 0.e log(0) = −∞, which won’t do. However, L’Hôpital’s rule tells us that lim pi→0 p i log(p i ) = 0. Sojust assume that 0 log(0) = 0, when you compute H. In other words, events that never happen dropout. is is not really a trick. It follows from the definition of a limit. But it isn’t obvious. It may makemore sense to just remember that when an event never happens, there’s no point in keeping it in themodel.Rethinking: e benefits of maximizing uncertainty. Information theory has many applications.A particularly important application is MAXIMUM ENTROPY, also known as MAXENT. Maximumentropy is a family of techniques for finding probability distributions that are most consistent withstates of knowledge. In other words, given what we know, what is the least surprising distribution?It turns out that one answer to this question maximizes the information entropy, using the priorknowledge as constraint. 78Many of the common distributions used in statistical modeling—uniform, Gaussian, binomial,Poisson, etc.—are also maximum entropy distributions, given certain constraints. For example, ifall we know about a measure is its mean and variance, then the distribution that maximizes entropy(and therefore minimizes surprise) is the Gaussian. ere are both epistemological and ontologicalinterpretations of this fact. Epistemologically, the Gaussian is expected to cover more possiblecombinations of events, precisely because it is the least surprising distribution consistent with theconstraints provided (mean and variance, infinite real line). Ontologically, frequencies of events inthe real world oen come to resemble Gaussian distributions, because when a natural process addsvalues together, it sheds all information other than mean and variance. You saw an example of thisin Chapter 4.When you meet the other exponential family distributions, in later chapters, you’ll also learnsomething of their maximum entropy interpretations.6.2.3. From entropy to accuracy. It’s nice to have a way to quantify uncertainty. H providesthis. So we can now say, in a precise way, how hard it is to hit the target. But how can we useinformation entropy to say how far a model is from the target? e key lies in DIVERGENCE:Divergence: e additional uncertainty induced by using probabilities fromone distribution to describe another distribution.is is oen known as Kullback-Leibler divergence or simply K-L divergence, named aer thepeople who introduced it for this purpose. 79Suppose for example that the true distribution of events is p 1 = 0.3, p 2 = 0.7. If webelieve instead that these events happen with probabilities q 1 = 0.25, q 2 = 0.75, how muchadditional uncertainty have we introduced, as a consequence of using q = {q 1 , q 2 } to approximatep = {p 1 , p 2 }? e formal answer to this question is based upon H, and has asimilarly simple formula:D KL (p, q) = ∑ ip i(log(pi ) − log(q i ) ) .


6.2. INFORMATION THEORY AND MODEL PERFORMANCE 187In plainer language, the divergence is the average difference in log probability between thetarget (p) and model (q). is divergence is just the difference between two entropies: eentropy of the target distribution p and the entropy arising from using q to predict p. Whenp = q, we know the actual probabilities of the events. In that case:D KL (p, q) = D KL (p, p) = ∑ ip i(log(pi ) − log(p i ) ) = 0.ere is no additional uncertainty induced when we use a probability distribution to representitself. at’s somehow a comforting thought. But more importantly, as q grows moredifferent from p, the divergence D KL also grows.What divergence can do for us now is help us contrast different approximations to p.As an approximating function q becomes more accurate, D KL (p, q) will shrink. So if wehave a pair of candidate distributions, say q 1 and q 2 , then the candidate that minimizes thedivergence will be closest to the target. Since predictive models specify probabilities of events(observations), we can use divergence to compare the accuracy of models.Overthinking: Cross entropy and divergence. Deriving divergence is easier than you might think.e insight is in realizing that when we use a probability distribution q to predict events from anotherdistribution p, this defines something known as cross entropy: H(p, q) = − ∑ i p i log(q i ). e notionis that events arise according the the p’s, but they are expected according to the q’s, so the entropy isinflated, depending upon how different p and q are.Divergence is defined as the additional entropy induced by using q. So it’s just the differencebetween H(p), the actual entropy of events, and H(p, q):D KL (p, q) = H(p, q) − H(p)= − ∑ p i log(q i ) − ( − ∑ p i log(p i ) ) = − ∑ (p i log(qi ) − log(p i ) )iiiSo divergence really is measuring how far q is from the target p, in units of entropy. Notice thatwhich is the target matters: H(p, q) does not in general equal H(q, p). For more on that fact, see therethinking box that follows.Rethinking: Divergence depends upon direction. In general, H(p, q) is not equal to H(q, p). edirection matters, when computing divergence. Understanding why this is true is of some value, sohere’s a contrived teaching example.Suppose we get in a rocket and head to Mars. But we have no control over our landing spot,once we reach Mars. Let’s try to predict whether we land in water or on dry land, using the Earth toprovide a probability distribution q to approximate the actual distribution on Mars, p. For the Earth,q = {0.7, 0.3}, for probability of water and land, respectively. Mars is totally dry, but let’s say forthe sake of the example that there is 1% surface water, so p = {0.01, 0.99}. If we count the ice caps,that’s not too big a lie. Now compute the divergence going from Earth to Mars. It turns out to beD E→M = D KL (p, q) = 1.14. at’s the additional uncertainty induced by using the Earth to predictthe Martian landing spot.Now consider going back the other direction. e numbers in p and q stay the same, but weswap their roles, and now D M→E = D KL (q, p) = 2.62. e divergence is more than double in thisdirection. is result seems to defy comprehension. How can the distance from Earth to Mars beshorter than the distance from Mars to Earth?Divergence behaves this way as a feature, not a bug. ere really is more additional uncertaintyinduced by using Mars to predict Earth than by using Earth to predict Mars. e reason is that, going


188 6. MODEL SELECTION, COMPARISON, AND AVERAGINGfrom Mars to Earth, Mars has so little water on its surface that we will be very very surprised whenwe land in water on Earth. e Earth is mainly covered in water, aer all. In contrast, Earth has goodamounts of both water and dry land. So when we use the Earth to predict Mars, we expect both waterand land, to some extent, even though we do expect more water than land. So we won’t be nearly assurprised when we inevitably arrive on Martian dry land, because 30% of Earth is dry land.6.2.4. From divergence to deviance. At this point in the chapter, dear reader, you may bewondering where the chapter is headed. At the start, the goal was to deal with overfittingand underfitting. But now we’ve spent pages and pages on entropy and other fantasies. It’s asif I promised you a day at the beach, but now you find yourself at a dark cabin in the woods,wondering if this is a necessary detour or rather a sinister plot.It is a necessary detour. e point of all the preceding material about information theoryand divergence is to establish both:(1) How to measure the distance of a model from our target. Information theory givesus the distance measure we need, the K-L divergence.(2) How to estimate the divergence. Having identified the right measure of distance,we now need a way to estimate it in real statistical modeling tasks.Item (1) is accomplished. Item (2) remains for last. We’re going to show now that the divergenceleads us to using a measure of model fit known as DEVIANCE.To use D KL to compare models, it seems like we would have to know p, the target probabilitydistribution. In all of the examples so far, I’ve just assumed that p is known. But whenwe want to find a model q that is the best approximation to p, the “truth,” there is usually noway to access p directly. We wouldn’t be doing statistical inference, if we already knew p.But there’s an amazing way out of this predicament. It helps that we are only interestedin comparing the divergences of different candidates, say q and r. In that case, most of p justsubtracts out, because there is a E log(p i ) term in the divergence of both q and r. is termhas no effect on the distance of q and r from one another. So while we don’t know where p is,we can estimate how far apart q and r are, and which is closer to the target. It’s as if we can’ttell how far any particular archer is from the target, but we can tell which archer is closer andby how much.All of this also means that all we need to know is a model’s average log probability:E log(q i ) for q and E log(r i ) for r. ese expressions look a lot like log probabilities of outcomes,like the log-likelihoods you’ve been using already. Indeed, just summing the loglikelihoodsof each case provides an approximation of E log(q i ). We don’t have to know thep inside the expectation, because nature takes care of presenting the events for us.So we can compare the average log probability from each model to get an estimate of therelative distance of each model from the target. is also means that the absolute magnitudeof these values will not be interpretable—neither E log(q i ) nor E log(r i ) by itself suggests agood or bad model. Only the difference E log(q i )−E log(r i ) informs us about the divergenceof each model from the target p.All of this delivers us to a very common measure of model fit, one that also turns out tobe an approximation of K-L divergence, the DEVIANCE, which is defined as:D(q) = −2 ∑ ilog(q i )


6.3. AKAIKE INFORMATION CRITERION 189where i indexes each observation (case), and each q i is just the likelihood of case i. e −2in front doesn’t do anything important. It’s there for historical reasons.You can compute the deviance for any model you’ve fit already in this book, just by usingthe MAP estimates to compute a log-probability of the observed data for each row. eseprobabilities are the q values. en you add these log-probabilities together and multiply by−2. In many cases, R automates these steps. Most of the standard model fitting functionssupport logLik, which will do the hard part: compute the sum of log-probabilities, usuallyknown as the log-likelihood of the data. For example:# fit model with lmm6.1


190 6. MODEL SELECTION, COMPARISON, AND AVERAGINGIndeed, for simple Gaussian models—ordinary linear regression—deviance and R 2 containsimilar information.ere is an accessible way to use deviance to navigate between underfitting and overfitting.It’s usually known as CROSS-VALIDATION. ere are many varieties of cross-validation,but here’s the premise. Suppose we have two samples of the same size. e first sample isusually called the training sample, and the second the testing sample. We use the trainingsample to fit a model, to estimate its parameters. en we use the testing sample to evaluatethe model’s performance. is means using the estimates based on the training sampleto compute the deviance for the observations in the testing sample. Once you’ve done thistrain-test procedure for all the models you wish to compare, you can use their relative testsampledeviance values to do so.Cross-validation is common and useful. But oen there aren’t two samples, or ratherwe’d like to use all of the available samples to fit the model, so we can use the full strengthof the evidence. But what we can do is imagine what would happen if we did have anothersample to predict. In other words, we can make a meta-model of forecasting. en we canask the meta-model to estimate deviance in a testing sample, from theory alone.is is the strategy behind the various INFORMATION CRITERIA, such as the renownAKAIKE INFORMATION CRITERION, abbreviated AIC. 80 AIC attempts to estimate the testsampledeviance of a model. Here’s the AIC gambit.(1) Suppose there’s a training sample of size N.(2) Fit a model (not necessarily the data generating model) to the training sample, andcompute the deviance on the training sample. Call this deviance D train .(3) Suppose another sample of size N from the same process. is is the test sample.(4) Compute the deviance on the test sample. is means using the MAP estimatesfrom step (2) to compute the deviance for the data in the test sample. Call thisdeviance D test .(5) Compute the difference D test − D train . is difference will usually be positive, becausethe model will tend to perform worse (have a higher deviance) in testing thanin training.(6) Finally, imagine repeating this procedure many times. e average difference thentells us the expected overfitting, how much the training deviance underestimatesthe divergence of the model.I call the above logic a gambit because it cannot provide guarantees. But it can provide valuableadvice. It turns out that this gambit leads to an astonishingly simple formula for theexpected test-sample deviance:AIC = D train + 2k ≈ E D testwhere k is the number of parameters in the model. e term 2k is oen called the penaltyterm. It is a measure of expected overfitting.is result depends upon weak priors, a Gaussian posterior distribution, and number ofparameters k much less than the number of cases N. So it’s appropriate for ordinary linearregression, and it even works quite well for many non-Gaussian regressions (generalizedlinear models, GLMs) that we’ll examine later in this book.


6.3. AKAIKE INFORMATION CRITERION 191Rethinking: AIC and “true” models. It is possible to read both that (1) AIC assumes the data generatingmodel is one of the candidate models and (2) AIC does not assume the data generating modelis a candidate. is confusion arises because there are multiple ways to derive AIC. e gambit describedabove does not employ a “true” model, except to generate training and testing data. But otherderivations do focus on “truth.” e general lesson here is that just because one derivation or justificationof a procedure makes use of an assumption, it doesn’t mean that there isn’t another possiblejustification that uses different assumptions.Even more generally, the consequences of violating an assumption are sometimes benign andother times catastrophic. Procedures do not simply stop being useful, when an assumption is violated.If that were true, then no statistical procedure would ever work, at least in the large world. Still,caution requires taking note of violated assumptions and hopefully evaluating the consequences ofthese violations.6.3.1. Limits to AIC’s generality. But more generally, AIC is not general. It is a specialcase of much larger phenomenon: the severity of overfitting is an increasing function of thenumber of parameters. But this function is not always as simple as 2k, as it is in AIC. A fewcommon conditions benefit from a more general solution.6.3.1.1. Parameter count close to sample size. Suppose a model has k parameters and is fitto N observations. When k is close to N, overfitting rises very rapidly. is happens becausethe model starts perfectly encoding the training sample. So when the model sees the testsample, it’s always very surprised. A conservative approximation for this rise in overfittingis given by a common generalization of AIC:AICc = D train +2k1 − (k + 1)/NWhen N is very much larger than k, the above simplifies to plain AIC. But as k approachesN − 1, the penalty on the right approaches infinity. So anytime AIC is appropriate, AICcmay be a better choice.6.3.1.2. Informative priors. If the model’s priors are not flat, then AIC can get the penaltyvery wrong. is is the reason. e penalty term in AIC estimates how flexible the modelis. In the classical case of flat priors, it turns out that each additional parameter adds 2 tothe penalty. But when priors are not flat, but instead more concentrated around zero, thenthe model is less flexible. Remember, priors function inside Bayes’ theorem just as if theywere previous posterior inferences. So they behave like ghostly data of a kind, previouslyaccumulated evidence. So if the prior has an effect, it will be to prevent the model fromlearning everything from the sample. is reduces overfitting.As a result, each parameter with an informative prior tends to count less than 2 in thepenalty. is is the very reason that using informative priors can substitute for using a measurelike AIC. ere are generalizations for dealing with this, and they are important also forthe next problem.6.3.1.3. Multilevel models. Each level in a multilevel model serves as a kind of prior forthe next level. is means that multilevel models necessarily induce the problem above withinformative priors. So counting the number of parameters in a multilevel model never tellsyou the proper penalty term.


192 6. MODEL SELECTION, COMPARISON, AND AVERAGINGere are generalizations of AIC that address this problem, as well as the informativeprior problem more generally. e most common and already in wide use is DIC, the DE-VIANCE INFORMATION CRITERION. 81 More general yet is WAIC, known as the WIDELY AP-PLICABLE INFORMATION CRITERION 82 or rather the WATANABE-AKAIKE INFORMATIONCRITERION. 83 We’ll consider DIC later in this very chapter, to handle informative priors.WAIC will make an appearance much later, in Chapter 13.6.3.1.4. Other prediction frameworks. e AIC gambit entails predicting a test sampleof the same size and nature as the training sample. is most certainly does not mean AICcan only be used when we plan to predict a sample of the same size as training. For example,AIC approximates some forms of cross-validation. 84But AIC’s implied prediction task is hardly representative of everything we might wishto do with models. For example, some statisticians prefer to evaluate predictions using aPREQUENTIAL framework, in which models are judged on their accumulated learning errorover the training sample. 85 And once you start using multilevel models, “prediction” is nolonger uniquely defined, because the test sample can differ from the training sample in waysthat forbid use of some the parameter estimates. We’ll worry about that issue in Chapter 13.Rethinking: Uniformitarianism and AIC. e AIC gambit described on page 190 pulls the test samplefrom the same process as the training sample. is is a kind of uniformitarian assumption, inwhich future data are expected to come from the same process as past data and have the same roughrange of values. is can cause problems. For example, suppose we fit a regression that predictsheight using body weight. e training sample comes from a poor town, in which most people arepretty thin. e relationship between height and weight turns out to be positive and strong. Nowalso suppose our prediction goal is to guess the heights in another, much wealthier, town. Pluggingthe weights from the wealthy individuals into the model fit to the poor individuals will predictoutrageously tall people. e reason is that, once weight becomes large enough, it has essentiallyno relationship with height. AIC will not automatically recognize nor solve this problem. Nor willany other isolated procedure. But over repeated rounds of model fitting, attempts at prediction, andmodel criticism, it is possible to overcome this kind of limitation. As always, statistics is no substitutefor science.6.3.2. Simulating AIC. e gambit that leads to AIC is complicated. So let’s walk through asimulation of this procedure. Hirotugu Akaike (1927–2009) himself original did this entirelywith formal mathematics. But we’ll use an expedient simulation, so it’ll be easier to understandand to modify. is section is not necessary for using AIC. But it may be necessaryfor understanding it, both with respect to its derivation and its scope. It will also provide anexample structure for how to study forecasting for any unique modeling problem you mightencounter in the future.Let’s begin by writing a function that implements the Akaike gambit on page 190. Here’sthe code, followed by explanation.R code6.11sim.train.test


6.3. AKAIKE INFORMATION CRITERION 193}return( dev.test - dev.train )e first line just announces that sim.train.test is a function, which is a bundle ofreusable code with adjustable inputs. e adjustable input here is N, the sample size. elines of code between the braces are the reusable code. ese lines perform these jobs, inorder:(1) Simulate the training sample, stored in y.train(2) Simulate the test sample, stored in y.test(3) Fit a linear model to the training sample(4) Compute the deviance on the training sample(5) Compute the deviance on the testing sample(6) Report back as output the difference between the test and training devianceOur goal is to repeat this procedure thousands of times, to get a distribution for the differenceD test − D train . R makes this easy:diff


194 6. MODEL SELECTION, COMPARISON, AND AVERAGING(a)(b)Density0.0 0.2 0.4 0.6N = 1Density0.00 0.02 0.04 0.06N = 100 2 5 10 15-20 -10 0 10 20Dtest - DtrainDtest - Dtrain(c)(d)deviance48 52 56 602.1 4N = 205.3 7.59.9deviance25 30 352.1 4.4N = 106.29.7141 2 3 4 5parameters1 2 3 4 5parametersFIGURE 6.6. Simulations of the AIC gambit. (a) e distribution of D test −D train , from 10-thousand simulations. Mean difference marked by the verticaldashed line. (b) Like before, but now with N = 10. (c) D train and D testas a function of model complexity. Each point is the mean of 10-thousandsimulations, each with sample size N = 20. e filled points are training,and the empty points are testing. e solid black line displays AIC for eachmodel. (d) Like before, but now with only N = 10. AICc now displayed bythe dashed line.Panel (d) displays the same analysis, but now with half the sample size, N = 10. Nowthe more complex models with four and five parameters overfit more than AIC expects themto, as indicated by the open points lying above the solid black line. e dashed line at thetop shows AICc, which recall uses the penalty 2k/(1 − (k + 1)/N). In this example, AICcanticipates the accelerating overfitting as k approaches N. But it is also rather conservative,overestimating D test by sometimes a large amount. Being conservative isn’t always the bestapproach, but peer review tends to favor conservative analyses, so using AICc in this kind ofsituation is conventional.However, notice that in this case both AIC and AICc nominate the same model as best,the simplest model with one parameter. is model is also the model with the best D test . isis true even though the correct model is the model with three parameters. is situation is


6.3. AKAIKE INFORMATION CRITERION 195perfectly normal. e best model for prediction may not be the correct model. e basicreason is that parameters must be estimated. When the training sample is small relative tothe complexity of a model, the parameters will not be estimated accurately. is means thatusing a deliberately simple model can sometimes improve prediction. And AIC is capable todetecting when, at least sometimes.Overthinking: Simulating AIC again. e function that simulates train and test for the multipleregression example in FIGURE 6.6 is given by this code:sim.train.test2


196 6. MODEL SELECTION, COMPARISON, AND AVERAGINGask how a procedure behaves in this large-data limit. With practically infinite data, AIC will alwaysselect the most complex model, so AIC is sometimes accused of “overfitting.” But at the large-datalimit, the most complex model will make predictions identical to the true model. e reason is thatwith so much data every parameter can be very very precisely estimated. And so using an overlycomplexmodel will not hurt prediction. For example, the model with 5 parameters in FIGURE 6.6(d)will tell you that the coefficients for predictors aer the 2nd are almost exactly zero. erefore failingto identify the “correct” model does not hurt us, at least not in this sense.6.4. Deviance information criterionAIC is an amazing result, but it has a lot of restrictions on its use. Most salient of theserestrictions is the requirement to use flat (or merely very weak) priors. So in this sectionI introduce DIC, the DEVIANCE INFORMATION CRITERION. 86 DIC and AIC are differentprocedures for estimating the same thing: the test sample deviance, D test . In this way bothattempt to approximate the information divergence of a model, which allows us to contrastthe accuracy of different models fit to the same data. But DIC, unlike AIC, does not requireflat priors. DIC was developed specifically for Bayesian models fit with MCMC, so it hasa more fully Bayesian flavor, as well. DIC can also handle many multilevel models (Chapter13). But when priors are flat, DIC and AIC provide nearly identical estimates. In thissense then, DIC is a more general version of AIC, despite the differences of history.Still, DIC is not perfectly general—nothing in statistics ever is. Like AIC, DIC assumesan approximately Gaussian posterior density. And unlike AIC, it has no known AICc-styleadjustment. So DIC works best when the effective number of parameters is much less thanthe sample size. We say “effective” number of parameters, because as you’ll see shortly, notall parameters necessarily contribute equally to overfitting, once we start using informativepriors.DIC is also a greater bother in computation. Unlike AIC, DIC has no quick formulafor the overfitting penalty. Instead DIC requires processing the posterior distribution fromthe training sample. Luckily, you’ve already performed more bothersome calculations, usingsamples from the posterior density. So you’ll find computing DIC to be a natural extensionof skills you’ve already acquired. And amazingly, it is possible to get a useful estimate of D test ,just by processing the posterior density of D train .Rethinking: What about BIC? e BAYESIAN INFORMATION CRITERION, abbreviated BIC and alsoknown as the Schwarz criterion, 87 is more commonly juxtaposed with AIC. e choice between BICor AIC (or neither!) is not about being Bayesian or not. ere are both Bayesian and non-Bayesianways to motivate both, and depending upon how strict one wishes to be, neither may be consideredBayesian.BIC is an estimate of the average likelihood of a simple linear model. e average likelihood is thedenominator in Bayes’ theorem, the likelihood averaged over the prior. ere is a venerable traditionin Bayesian inference of comparing average likelihoods as a means to comparing models. A ratio ofaverage likelihoods is called a BAYES FACTOR. On the log scale, these ratios are differences, and socomparing differences in average likelihoods is much like comparing differences in AIC or DIC. Sinceaverage likelihood is, indeed, averaged over the prior, more parameters induce a natural penalty oncomplexity. is helps guard against overfitting, even though the exact penalty is not in general thesame as with AIC or DIC.


6.4. DEVIANCE INFORMATION CRITERION 197But some well-known Bayesian statisticians dislike the Bayes factor approach, 88 and all admitthat there are technical obstacles to its use. Foremost for our purposes, computing average likelihoodis hard. AIC, DIC, and the like have their own computational problems, but nothing like averagelikelihood. And even when priors are weak and have no influence on estimates within models, priorscan have a huge impact on comparisons between models.So while the approach is tremendously valuable, and learning an alternative to information criterianecessarily helps one to understand information criteria even better, a robust treatment of Bayesfactors is just beyond the scope of this book. It’s important to realize, though, that the choice ofBayesian or not does not also decide between information criteria or Bayes factors. Moreover, there’sno need to choose, really. We can always use both and learn from the ways they agree and disagree.6.4.1. Defining DIC. First some notation. ere are two elements in calculating DIC: theaverage deviance and the deviance of the average of the parameters.(1) e average deviance arises from noting that, since deviance is computed from parameters,and since parameters have a distribution, then the deviance must alsohave a distribution. Compute the posterior density of the deviance, and then takeits mean. at’s the average deviance, ¯D.(2) e deviance of the average of the parameters implies finding the mean of the parametersand then computing a single deviance value for this posterior mean. at’sthe deviance of the average of the parameters, ˆD.en DIC is defined as:DIC = ˆD + 2(¯D − ˆD) ≈ E D testIn English: the overfitting penalty is twice the difference between the average deviance andthe deviance of the average of the parameters. e difference ¯D − ˆD is called the EFFECTIVENUMBER OF PARAMETERS, oen labeled p D . It takes the place of the parameter count k inAIC’s formula.Overthinking: Computing DIC using posterior variance. An alternative formula for DIC is:DIC = ˆD + var(D)where var(D) is the variance of the posterior distribution of the deviance. is implies that var(D)/2is an estimate of the effective number of parameters, p D . is formula is sometimes more accurate ormore convenient than the original difference formula. Much of the time, it’s almost exactly the same.Whichever you use, just be sure to compute DIC the same way on all models you wish to compare.Or better yet, you could compute DIC both ways, to be sure inference is not sensitive to the details ofcomputation. It doesn’t matter so much that the two formulas for DIC give slightly different answers.What matters is the relative positions of the models remain the same, using the different formulas.6.4.2. Simulating DIC. To get a grasp on DIC, let’s do an experiment that reveals both howto compute it and why it behaves as it does. Suppose we observe a single measurement, andthis measurement has value 1. Also suppose we know the measurement has a standard errorof σ = 1. So our inference problem is just the mean value of the measurement. Using a flatprior, we fit the model and compute ¯D and ˆD:


198 6. MODEL SELECTION, COMPARISON, AND AVERAGINGR code6.15m6.9


6.4. DEVIANCE INFORMATION CRITERION 199(a)(b)deviance1.5 2.5 3.51deviance1.5 2.5 3.50.5-1 0 1 2 3mu-1 0 1 2 3muFIGURE 6.7. How DIC estimates the overfitting penalty. In each plot, samplesfrom the posterior density of µ, mu, are shown against deviance at eachvalue of µ. e vertical black dashed line marks the posterior mean, E µ.e horizontal black dashed line marks its deviance, ˆD. e blue dashedline marks the average deviance, ¯D. e difference ¯D − ˆD estimates theeffective number of parameters. is difference is shown by the gray linesegment in the right margin of each plot, labeled by its value. (a) A flatprior on µ, resulting in ¯D − ˆD ≈ 1. (b) An informative µ ∼ Normal(0, 1)prior, resulting in ¯D − ˆD ≈ 0.5.e informative prior has this effect because priors act like previously observed data. Inm6.10, the prior is equivalent to exactly one previous observation, located at y i = 0. Goahead and fit a model to one observation at y i = 0, using a flat prior, and you’ll see that theposterior density of µ will have mean 0 and standard error 1. So m6.10 updates that posteriorin light of another observation at y i = 1. Since the prior and the sample are equally strong,the posterior mean ends up right between them, at µ = 0.5. And likewise, since the priorand sample influence the posterior equally, the effective number of parameters is exactlyhalf the number of parameters. In terms of the distance ¯D − ˆD, the prior effectively doublesthe sample size, concentrating the posterior more around its mean. If the posterior is moreconcentrated, then the average deviance (blue dashed line) must be closer to the deviance ofthe mean (black dashed line).Typically, you’ll work with much larger samples, so any informative prior won’t be nearlyas influential as in the pedagogical example in FIGURE 6.7. But the phenomenon is quitegeneral. Informative priors reduce the flexibility of the model—how freely it can encode thesample. Reduced flexibility tends to mean less risk of overfitting.To see how well DIC does at estimating D test , let’s repeat the simulation strategy we developedfor AIC (FIGURE 6.6, panels c and d). Now however, we’ll use Normal(0, 1) priorson every regression slope (but not the intercept) in the multivariate models. So the models


200 6. MODEL SELECTION, COMPARISON, AND AVERAGING(a)(b)deviance48 52 5624.1N = 204.97.29.2deviance22 26 302.2 4N = 105.3 79.21 2 3 4 5parameters1 2 3 4 5parametersFIGURE 6.8. Accuracy of the DIC estimate. Black points are D train and openpoints D test . e blue trend shows DIC for each model. Data simulatedidentically to FIGURE 6.6, but all models now use a regularizing prior. Seemain text for details. Each point is the mean of 10-thousand simulations.(a) N = 20. (b) N = 10.look like this:y i ∼ Normal(µ i , 1)µ i = α + β 1 x i + β 2 z iα ∼ Normal(0, 100)β 1 ∼ Normal(0, 1)β 2 ∼ Normal(0, 1)Priors of this kind—zero mean and applied to regression coefficients—are oen called REGU-LARIZING priors. ey are designed to produce less overfitting. Let’s see how DIC anticipatesthis.FIGURE 6.8 displays plots analogous to panels c and d in FIGURE 6.6, but now with DICshown by the blue trend. Black points still show D train and open points D test . In panel (a),note that models with 3, 4, and 5 parameters overfit by less than twice the number of parameters.DIC also correctly anticipates this fact, although it was a little too pessimistic aboutthe model with 3 parameters.In panel (b), the sample size is cut in half to N = 10. Compare to FIGURE 6.6, panel(d). With flat priors, the more complex models overfit more than AIC expects. But nowusing the regularizing priors, DIC manages to get much closer to the target. A little bit ofregularization can go a long way towards taming overfit models. e reason is that overfitmodels are fitting parameters on the basis of very little evidence per parameter. is resultsin wide posterior densities, indicating broad uncertainty about the parameters. Just a littlebit of prior information on the parameters can have a big effect on unreliable estimates, whilehaving little impact on the parameters we can be confident about.


6.4. DEVIANCE INFORMATION CRITERION 201Rethinking: Ridge regression. Linear models in which the slope parameters use Gaussian priors,centered at zero, are sometimes known as RIDGE REGRESSION. Ridge regression typically takes asinput a precision λ that essentially describes the narrowness of the prior. λ > 0 results in less overfitting.However, just as with the Bayesian version, if λ is too large, we risk underfitting.While not originally developed as Bayesian, ridge regression is another example of how a statisticalprocedure can be understood from both Bayesian and non-Bayesian perspectives. Ridge regressiondoes not compute a posterior density. Instead it uses a modification of OLS that stitches λinto the usual matrix algebra formula for the estimates. e function lm.ridge, built into R’s MASSlibrary, will fit linear models this way.Overthinking: Simulating DIC. e code for producing FIGURE 6.8 is a bit monstrous. But it is verymuch like the code used to simulate AIC (page 195).sim.train.test.DIC


202 6. MODEL SELECTION, COMPARISON, AND AVERAGING}I’ve used map instead of a faster matrix algebra procedure, so the code will be easier to understandand modify. But that means these simulations are a lot slower than the previous ones. Some statusreporting in the loop over models helps reduce anxiety:R code6.18N


6.5. USING AIC 203helps guard against overconfidence in model structure, in the same way that usingthe entire posterior density helps guard against overconfidence in parameter values.e section demonstrates how to conduct comparison and averaging, using a simple examplewith a few predictor variables. Later chapters continue using these tools, and the details ofexamples do vary. So be wary not to overgeneralize the example that follows.6.5.1. Model comparison. Recall the primate milk data from the previous chapter. Let’sload it into R, remove the NAs, and rescale one of the explanatory variables:data(milk)d


204 6. MODEL SELECTION, COMPARISON, AND AVERAGINGkcal.per.g ~ dnorm( mu , exp(log.sigma) ) ,mu ~ a + bm*log(mass)) ,data=d , start=list(a=a.start,bm=0,log.sigma=sigma.start) )m6.14


6.5. USING AIC 205compare( m6.11 , m6.12 , m6.13 , m6.14 , nobs=nrow(d) )R code6.23k AICc w.AICc dAICcm6.14 4 -14.02 0.93 0.00m6.11 2 -7.60 0.04 6.42m6.13 3 -6.89 0.03 7.13m6.12 3 -5.03 0.01 8.99e function compare takes fit models as input, as well as the number of observations (nobs)the models were fit to. It returns a table in which models are ranked from best to worst, withfour columns.(1) k is the number of free parameters in each model.(2) AICc is the small-sample adjusted AIC value for each model. Smaller is better.Notice here that the values are actually negative. is is fine and not unusual withGaussian linear models. e most negative AICc is best.(3) w.AICc is the AKAIKE WEIGHT for each model. ese values are transformed AICcvalues. I’ll explain them below.(4) dAICc is the difference between each AICc and the lowest AICc.ese last two columns, w.AICc and dAICc, are there to ease interpretation of the AICcvalues. Recall that the absolute magnitude of any information criterion contains little information.It’s only the differences between models that provide estimates of relative accuracy.So these two columns transform AICc into measures of relative difference.dAICc, usually called delta-AICc or δAICc, is just a difference. So it provides a quickguide to how quickly accuracy degrades as you move down the model rank. is is useful,because oen the best model is barely better than the second best. Sometimes several modelshave nearly identical AICc values. e differences between AICc values just makes it easierto diagnose such situations. In this case, model m6.14 is more than 6 units of deviance betterthan the second model. From there, differences between models are much smaller.How big of a difference is 6? Well, it’s equivalent to adding 3 meaningless and uncorrelatedpredictor variables. But honestly it’s hard to judge on this scale how much moreaccurate this makes model m6.14. is is where the Akaike weights help. e weight for amodel i in a set of m models is given by:w i =exp(−0.5δAICc i )∑ mj=1 exp(−0.5δAICc j)is example uses AICc, but the formula is the same for any other information criterion. Tocompute these weights yourself, without the convenience of compare:# compute AICc values for all four modelsaicc


206 6. MODEL SELECTION, COMPARISON, AND AVERAGINGe Akaike weight formula might look rather odd, but really all it is doing is putting AICcon a probability scale. So each weight will be a number from 0 to 1, and the weights togetheralways sum to 1. Now larger values are better.But what do these probabilities mean? ere actually isn’t a consensus about that. Buthere’s Akaike’s interpretation, which is common. 89A model’s weight is an estimate of the probability that the model will make thebest predictions on new data, conditional on the set of models considered.Here’s the heuristic explanation. First, regard AICc as the expected deviance of a model onfuture data. at is to say that AICc gives us an estimate of E(D test ). Akaike weights convertthese deviance values, which are log-likelihoods, to plain likelihoods and then standardizethem all. is is just like Bayes’ theorem uses a sum in the denominator to standardize theproduct of the likelihood and prior. erefore the Akaike weights are analogous to posteriorprobabilities of models, conditional on expected future data, and assuming equal priors. 90 Soyou can usefully read each weight as an estimated probability that each model will performbest on future data. In simulation at least, interpreting weights in this way turns out to beappropriate. 91In this analysis, the best model has 94% of the model weight. at’s pretty good. Notehowever that these weights are conditional upon the set of models considered. If you addanother model, or take one away, all of the weights will change. is is still a small worldanalysis, incapable of considering models that we have yet to imagine or analyze. Still, it’sclear that if we restrict ourselves to these four simple linear regressions, we benefit a lot fromusing both of the predictor variables.Notice as well that either predictor alone is actually expected to do worse than the modelwithout either predictor, m6.11. e expected difference is small, but since neither predictoralone improves the deviance very much, the penalty term knocks down the two models withonly one predictor.Overthinking: Calculating Akaike weights. e formula for Akaike weights looks very odd at first.Here’s the quick run down of its pieces. Deviance, and therefore AICc, is on a log-probability scale.So exponentiating restores it as regular probability. e factor of −0.5 just cancels out the −2 usedin computing deviance. e delta-AICc (δAICc) values are used instead of raw AICc values, becausethis makes the numerical calculations inside the computer more reliable. Specifically, exponentiatinga very negative number will sometimes round to zero. Try exp(-1000) for example. Rescalingto differences doesn’t change the weights, but does change accuracy. So always do it. Finally, eachexponentiated value is summed up in the denominator, to ensure that all of the individual weightssum to 1, like good probabilities.Rethinking: How a big a difference in AIC is “significant”? Newcomers to information criteriaoen ask whether a difference between AIC values is “significant.” For example, models m6.14 andm6.11 differ by 6.42 units of deviance. Is it possible to say whether this difference is big enoughto conclude that m6.14 is significantly better? In general, it is not possible to provide a principledthreshold of difference that makes one model “significantly” better than another, whatever that means.e same is actually true of ordinary significance testing—the 5% convention is just a convention. Wecould invent some convention for AIC, but it too would just a be a convention. Moreover, we knowthe models will not make the same predictions—they are different models. So “significance” in thiscontext must have a very different definition than usual.


6.5. USING AIC 207e attitude this book encourages is to retain and present all models, no matter how big or smallthe differences in AIC (or another criterion). e more information in your summary, the moreinformation for peer review, and the more potential for the scholarly community to accumulate information.And keep in mind that averaging models oen produces better results than selecting anysingle model, obviating the “significance” question.Rethinking: AIC metaphors. Here are two metaphors are used to help explain the concepts behindusing AIC (or another information criterion) to compare models.ink of models as race horses. In any particular race, the best horse may not win. But it’s morelikely to win than is the worst horse. And when the winning horse finishes in half the time of thesecond-place horse, you can be pretty sure the winning horse is also the best. But if instead it’s aphoto-finish, with a near tie between first and second place, then it is much harder to be confidentabout which is the best horse. AIC values are analogous to these race times—smaller values are better,and the distances between the horses/models are informative. Akaike weights transform differencesin finishing time into probabilities of being the best model/horse on future data/races. But if the trackconditions or jockey changes, these probabilities may mislead. Forecasting future racing/predictionbased upon a single race/fit carries no guarantees.ink of models as stones thrown to skip on a pond. No stone will ever reach the other side(perfect prediction), but some sorts of stones make it further than others, on average (make bettertest predictions). But on any individual throw, lots of unique conditions avail—the wind might pickup or change direction, a duck could surface to intercept the stone, or the thrower’s grip might slip. Sowhich stone will go farthest is not certain. Still, the relative distances reached by each stone thereforeprovide information about which stone will do best on average. But we can’t be too confident aboutany individual stone, unless the distances between stones is very large.Of course neither metaphor is perfect. Metaphors never are. But many people find these to behelpful in interpreting information criteria.6.5.1.2. Comparing estimates. In addition to comparing models on the basis of expectedtest deviance, it is nearly always useful to compare parameter estimates among models. Comparingestimates helps in at least two major ways. First, it is useful to understand why a particularmodel or models have lower AIC values. Changes in posterior densities, across models,provide useful hints. Second, regardless of AIC values, we oen want to know whethersome parameter’s posterior density is stable across models. For example, scholars oen askwhether a predictor remains important as other predictors are added and subtracted fromthe model. To address that kind of question, one typically looks for a parameter’s posteriordensity to remain stable across models, as well as for all models that contain that parameterto have lower AICc (or DIC) than those models without it.In the primate milk example, comparing estimates confirms what you already learnedin the previous chapter: the model with both predictors does much better, because each predictormasks the other. In order to demonstrate that in the previous chapter, we actually didfit three of models currently at hand. Looking at a consolidated table of the MAP estimatesmakes the comparison a lot easier. e coeftab function takes a series of fit models as inputand builds such a table:coeftab(m6.11,m6.12,m6.13,m6.14)R code6.25m6.11 m6.12 m6.13 m6.14


208 6. MODEL SELECTION, COMPARISON, AND AVERAGINGam6.14m6.13m6.12m6.11log.sigmam6.14m6.13m6.12m6.11bnm6.14m6.13m6.12m6.11bmm6.14m6.13m6.12m6.11FIGURE 6.9. Comparing the posteriordensities of parameters for the fourmodels fit to the primate milk data.Each point is a MAP estimate, and eachblack line segment is a 95% confidenceinterval. Estimates are grouped by parameteridentity, and each row in agroup is a model.-2 -1 0 1 2 3 4EstimateR code6.26a 0.66 0.35 0.71 -1.09log.sigma -1.79 -1.80 -1.85 -2.16bn NA 0.45 NA 2.79bm NA NA -0.03 -0.10nobs 17 17 17 17e nobs at the bottom are the number of observations, just there to help you make sureyou fit each model to the same observations. From scanning the table, you can see that theestimates for both bn and bm get farther from zero when they are both present in the model.But standard errors aren’t represented here, and seeing how the uncertainty changes is just asimportant as seeing how the location changes. You can get coeftab to add standard errorsto the table (see ?coeftab), but that still doesn’t make it easy to appreciate changes in thewidth of posterior densities. Better to plot these estimates:plot( coeftab(m6.11,m6.12,m6.13,m6.14) )e result is shown in FIGURE 6.9. Each point is a MAP estimate, and each black line segmentis a 95% confidence interval. Each group of estimates correspond to the same namedparameter, across models. Each row in each group is a model, labeled on the le. Now youcan quickly scan each group of estimates to see how estimates change or not across models.You can adjust these plots to group by model instead of by parameter and to display onlysome parameters. See ?coeftab.plot for details.Rethinking: Barplots suck. e plot in FIGURE 6.9 is called a DOTCHART. It is a replacement for atypical BARPLOT. e only problem with barplots is that they have bars. e bars carry only a little


6.5. USING AIC 209kcal.per.g0.5 0.7 0.90.55 0.65 0.75FIGURE 6.10. Model averaged posteriorpredictive distribution for the primatemilk analysis. e black regressionline and dashed 95% confidenceinterval correspond to the minimum-AICc model, m6.14. e gray line andshaded 95% confidence region correspondto the model averaged predictions.neocortexinformation—which way to zero, usually—but greatly clutter the presentation and generate opticalillusions. Better to remove them and use dotcharts instead of barplots. R’s dotchart function issufficient to produce FIGURE 6.9. It’s usually all you need, but there are more flexible options. Anybook on R graphics will present several.6.5.2. Model averaging. Way back in Chapter 3, we saw how to preserve the uncertaintyabout parameters when simulating predictions from a model. Now we have the analogousproblem of preserving the uncertainty about models. Treating model weights as heuristicplausibilities that each model will perform best in testing, it makes sense to try to preservethese relative plausibilities when generating predictions. And doing so is mechanically verysimilar to the procedure with a single model.To review, let’s simulate and plot counterfactual predictions for the minimum-AICcmodel, m6.14. Here’s the familiar code for simulating the posterior predictive distribution,focusing on counterfactual predictions across the range of neocortex.post


210 6. MODEL SELECTION, COMPARISON, AND AVERAGINGNow let’s simulate and add model averaged posterior predictions. Here’s the procedure,and then I’ll show you the code.(1) Compute AICc (or another information criterion) for each model(2) Compute the weight for each model(3) Sample a model, using the each model’s weight as a probability(4) Sample a vector of parameters from the posterior density of the model chosen instep (3)(5) Repeat steps (3) and (4) until you have enough sampled parameter values(6) Use the collection of samples from all models to simulate predictionsTypically the number of models to average over is small. In this example, it it only four.So it makes more sense to sample from the posterior of each model in proportion to eachmodel weight. is ensures that you end up with a collection of sampled parameter values,with samples from each individual model appearing in proportion to model weight. Whena model doesn’t include a parameter, that parameter is implicitly set to the value zero.And this is what sample.qa.posterior can do, once you pass it a list of models insteadof just one model. So to sample according to AICc weight (the default behavior) and thencompute posterior predictions:R code6.28post


6.7. PRACTICE 211Consider by analogy the Curse of Tippecanoe. 92 From the year 1840 until 1960, every UnitedStates president who was elected in a year ending in the digit 0 (which happens every 20 years, given4 year terms) has died in office. William Henry Harrison was the first, being elected in 1840 and dyingof pneumonia the next year. John F. Kennedy was the last, elected in 1960 and assassinated in 1963.Seven American presidents died in sequence in this pattern. Ronald Reagan was elected in 1980, butdespite at least one attempt on his life, he managed to live long aer his term was up, breaking thecurse. Given enough time and data, a pattern like this can be found for almost any body of data.But without any compelling reason to believe this pattern is meaningful, it is hardly compelling thatsuch patterns exist. Most large sets of data will contain patterns of correlation that are strong andsurprising. If we search hard enough, we are bound to find a Curse of Tippecanoe. ere are manyother patterns in presidential names and dates, and no doubt new ones are being found and circulatedall the time.Fiddling with and constructing many predictor variables is a great way to find coincidences, butnot necessarily a great way to evaluate hypotheses. However, fitting many possible models isn’t alwaysa dangerous idea, provided some judgment is exercised in weeding down the list of variables at thestart. ere are two scenarios in which this strategy appears defensible. First, sometimes all one wantsto do is explore a set of data, because there are no clear hypotheses to evaluate. is is rightly labeledpejoratively as DATA DREDGING, when one does not admit to it. But when used together with modelaveraging, and freely admitted, it can be a way to stimulate future investigation. Second, sometimeswe need to convince an audience that we have tried all of the combinations of predictors, becausenone of the variables seem to help much in prediction.6.6. In praise of complexityNeed to add discussion of reasons for using complex models, even when AIC/DIC recommendagainst it.6.7. PracticeAll three problems to follow use the same data. Pull out the old Howell !Kung demography dataand split it into two equally-sized data frames. Here’s the code to do it:library(rethinking)data(Howell1)d


212 6. MODEL SELECTION, COMPARISON, AND AVERAGING6.7.1. An average polynomial. Let h i and x i be the height and centered age values, respectively,on row i. Fit the following models to the data in d1:M 1 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i .M 2 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i + β 2 x 2 i .M 3 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i + β 2 x 2 i + β 3 x 3 i .M 4 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i + β 2 x 2 i + β 3 x 3 i + β 4 x 4 i .M 5 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i + β 2 x 2 i + β 3 x 3 i + β 4 x 4 i + β 5 x 5 i .M 6 : h i ∼ Normal(µ i , σ),µ i = α + β 1 x i + β 2 x 2 i + β 3 x 3 i + β 4 x 4 i + β 5 x 5 i + β 6 x 6 i .Use map to fit these.Note that fitting all of these polynomials to the height-by-age relationship is not a good way toderive insight. It would be better to have a simpler approach that would allow for more insight, likeperhaps a piecewise linear model. But the set of polynomial families above will serve to help youpractice and understand model comparison and averaging.(a) Compare the models above, using AICc. Compare the model rankings, as well as the AICcweights.(b) For each model, produce a plot with model averaged mean and 95% confidence interval ofthe mean, superimposed on the raw data. How do predictions differ across models?(c) Now also plot the model averaged predictions, across all models. In what ways do the averagedpredictions differ from the predictions of the model with the lowest AICc value?6.7.2. Cross-validate. (a) Compute the test-sample deviance for each model. is means calculatingdeviance, but using the data in d2 now. You can compute the log-likelihood of the height datawith:R code6.30 sum( dnorm( d2$height , mu , sigma , log=TRUE ) )where mu is a vector of predicted means (based upon age values and MAP parameters) and sigma isthe MAP standard deviation.(b) Compare the deviances from (a) to the AICc values. It might be easier to compare if yousubtract the smallest value in each list from the others. For example, subtract the minimum AICcfrom all of the AICc values so that the best AICc is normalized to zero. Which model makes the bestout-of-sample predictions in this case? Does AICc do a good job of estimating the test deviance?


6.7. PRACTICE 2136.7.3. e deviance of the average. Now compute the out-of-sample deviance of the averagedmodels, using AICc. Calculate the model-averaged predictions under each metric and use thosepredictions to compute the deviance, just as you did in the previous problem. Compare the devianceof the AICc-averaged model to the deviances you computed for each individual model. How doesthe averaged model compare in accuracy to the individual models? Can you explain these results?


7 InteractionsHow is a manatee like an A.W.38 bomber? e manatee (Trichechus manatus) is a slowmoving, aquatic mammal that lives in warm, shallow waters. Manatees have no naturalpredators, but they do share their waters with motor boats. And motor boats have propellers.Manatees are related to elephants, and they have very thick skins. But propeller blades canand do kill them. A majority of adult manatees bear some kind of scar earned in a collisionwith a boat (FIGURE 7.1, top). 93e Armstrong Whitworth A.W.38 Whitley was a frontline Royal Air Force bomber.During the second World War, the A.W.38 carried bombs and pamphlets into German territory.Unlike the manatee, the A.W.38 has fierce natural enemies: artillery and interceptorfire. Many planes never returned from their missions. And those that survived had the scarsto prove it (FIGURE 7.1, bottom).In both cases—manatee propeller scars and bomber bullet holes—we’d like to do somethingto improve the odds, to help manatees and bombers survive. Most observers intuitthat helping manatees or bombers means reducing the kind of damage we see on them. Formanatees, this might mean requiring propeller guards (on the boats, not the manatees). Forbombers, it’d mean up-armoring the parts of the plane that show the most damage.But in both cases, the evidence misleads us. Propellers do not cause most of the injuryand death caused to manatees. Rather autopsies confirm that collisions with blunt parts ofthe boat, like the keel, do far more damage. Similarly, up-armoring the damaged portions ofreturning bombers did little good. Instead, improving the A.W.38 bomber meant armoringthe undamaged sections. 94e evidence from surviving manatees and bombers is misleading, because it is conditionalon survival. Manatees and bombers that perished look different. A manatee struckby a keel is less likely to live than another grazed by a propeller. So among the survivors,propeller scars are common. Similarly, bombers that returned home conspicuously lackeddamage to the cockpit and engines. ey got lucky. Bombers that never returned home wereless so. To get the right answer, in either context, we have to realize that the kind of damageseen is conditional on survival.CONDITIONING is one of the most important principles of statistical inference. Data,like the manatee scars and bomber damage, are conditional on how they get into our sample.Posterior distributions are conditional on the data. All model-based inference is conditionalon the model. Every inference is conditional on something, whether we notice it or not.And a large part of the power of statistical modeling comes from creating devices thatallow probability to be conditional of aspects of each case. e linear models you’ve grown tolove are just crude devices that allow each outcome y i to be conditional on a set of predictors215


216 7. INTERACTIONSFIGURE 7.1. TOP: Dorsal scars for 5 adult Florida manatees. Rows of shortscars, for example on the individuals Africa and Flash, are indicative of propellerlaceration. BOTTOM: ree exemplars of damage on A.W.38 bombersreturning from missions.for each case i. Like the epicycles of the Ptolemaic and Copernican models (Chapters 4 and6), linear models give us a way to describe conditionality.Simple linear models frequently fail to provide enough conditioning, however. Everymodel so far in this book has assumed that each predictor has an independent associationwith the mean of the outcome. What if we want to allow the association to be conditional?For example, in the primate milk data from the previous chapters, suppose the relationshipbetween milk energy and brain size varies by taxonomic group (ape, monkey, prosimian).is is the same as suggesting that the influence of brain size on milk energy is conditionalon taxonomic group. e linear models of previous chapters cannot address this question.To model deeper conditionality—where the importance of one predictor depends uponanother predictor—we need INTERACTION. Interaction is a kind of conditioning, a way ofallowing parameters (really their posterior distributions) to be conditional on further aspectsof the data. e simplest kind of interaction, a linear interaction, is built by extendingthe linear modeling strategy to parameters within the linear model. So it is akin to placingepicycles on epicycles in the Ptolemaic and Copernican models. It is descriptive, but verypowerful.More generally, interactions are central to most statistical models beyond the cozy worldof Gaussian outcomes and linear models of the mean. In generalized linear models (GLMs,Chapter 9 and onwards), even when one does not explicitly define variables as interacting,they will always interact to some degree. Moreover, every variable essentially interacts withitself, as the impact of change in its value will depend upon its current value. Say goodbye to


7.1. BUILDING AN INTERACTION 217simple constant slopes of linear regression. Say hello to impacts that may depend upon thecovariation of dozens of predictor variables.Multilevel models induce similar effects. Common sorts of multilevel models are essentiallymassive interaction models, in which estimates (intercepts and slopes) are conditionalon clusters (person, genus, village, city, galaxy) in the data. Multilevel interaction effects arecomplex. ey’re not just allowing the impact of a predictor variable to change dependingupon some other variable, but they are also estimating aspects of the distribution of thosechanges. is may sound like genius, or madness, or both. Regardless, you can’t have thepower of multilevel modeling without it.Models that allow for complex interactions are easy to fit to data. But they can be considerablyharder to understand. And so I spend this chapter reviewing simple interactioneffects: how to specify them, how to interpret them, and how to plot them. e chapterstarts with a case of an interaction between a single categorical (dummy) variable and a singlecontinuous variable. In this context, it is easy to appreciate the sort of hypothesis thatan interaction allows for. en the chapter moves on to show how manipulating the modelof the mean itself can help us understand the behavior of an interaction, before moving onto more complex interactions between multiple continuous predictor variables as well ashigher-order interactions among more than two variables. In every section of this chapter,the model predictions are visualized, averaging over uncertainty in parameters.My hope is that this chapter lays a solid foundation for interpreting generalized linearmodels and multilevel models in the later chapters.Rethinking: Statistics all-star, Abraham Wald. e World War II bombers story is the work of AbrahamWald (1902–1950). Wald was born in what is now Romania, but immigrated to the United Statesaer the Nazi invasion of Austria. Wald made many contributions over his short life. Perhaps mostgermane to the current material, Wald proved that for many types of rules for making statistical decisions,there will exist a Bayesian rule that is at least as good as any non-Bayesian one. Wald provedthis, remarkably, beginning with non-Bayesian premises, and so anti-Bayesians could not ignore it.is work was summarized in Wald’s 1950 book, published just before his death. 95 Wald died muchtoo young, from a plane crash while touring India.7.1. Building an interactionAfrica is special. e second largest continent, it is the most culturally and geneticallydiverse. Africa has about 3 billion fewer people than Asia, but it has just as many living languages.Africa is so genetically diverse that most of the genetic variation outside of Africa isjust a subset of the variation within Africa. Africa is also geographically special, in a puzzlingway: bad geography tends to be related to good economies within Africa, while the oppositeis true outside of Africa.To appreciate the puzzle, look at regressions of terrain ruggedness (a particular kind ofbad geography) against economic performance (log GDP per capita in the year 2000), bothinside and outside of Africa (FIGURE 7.2). Load the data used in this figure, and split it intoAfrica and non-Africa, with this code:library(rethinking)data(rugged)d


218 7. INTERACTIONSlog(rgdppc_2000)6 7 8 9Africalog(rgdppc_2000)7 8 9 10 11not Africa0 1 2 3 4 5 6rugged0 1 2 3 4 5ruggedFIGURE 7.2. Separate linear regressions inside and outside of Africa, for log-GDP against terrain ruggedness. e slope is positive inside Africa, butnegative outside. How can we recover this reversal of the slope, using thecombined data?# extract countries with GDP datadd


7.1. BUILDING AN INTERACTION 219# non-African nationsm7.2


220 7. INTERACTIONSird, we may want to use information criteria or another method to compare models.In order to compare a model that treats all continents the same way to a model that allowsdifferent slopes in different continents, we need models that use all of the same data (asexplained in Chapter 6). is means we can’t split the data, but have to make the model splitthe data.Fourth, once you begin using multilevel models (Chapter 13), you’ll see that there areadvantages to borrowing information across categories like “Africa” and “not Africa.” is isespecially true when sample sizes vary across categories, such that overfitting risk is higherwithin some categories. In other words, what we learn about ruggedness outside of Africashould have some effect on our estimate within Africa, and visa versa. Multilevel modelsborrow information in this way, in order to improve estimates in all categories. When wesplit the data, this borrowing is impossible.So let’s see how to recover the reversal of slope, within a single model.7.1.1. Adding a dummy variable doesn’t work. e first thing to realize is that just includingthe categorical variable (dummy variable) cont_africa won’t reveal the reversed slope.It’s worth fitting this model to prove it to yourself, though. I’m going to walk through thisas a simple model comparison exercise, just so you begin to get some applied examples ofconcepts you’ve accumulated from earlier chapters.e question is to what extent singling out African nations changes predictions. ereare two models to fit, to start. e first is just the simple linear regression of log-GDP onruggedness, but now for the entire data set:R code7.3m7.3


7.1. BUILDING AN INTERACTION 221)sigma=sd(log(dd$rgdppc_2000)) )Now to compare these models, using DIC:compare( m7.3 , m7.4 )R code7.5DIC pD dDIC weightm7.4 475.66 3.98 0.00 1m7.3 539.90 3.05 64.24 0is is a case in which it will be safe to ignore the lower ranked model, as m7.4 gets all themodel weight. Notice also that DIC has done pretty superb job of knowing the number ofparameters in each model, 4 and 3. But recall that if the priors were stronger, either modelmight end up with p D less than the number of parameters.Now let’s plot the posterior predictions for m7.4, so you can see how, despite it’s plausiblesuperiority to m7.3, it still doesn’t manage different slopes inside and outside of Africa. Tosample from the posterior and compute the predicted means and intervals for both Africanand non-African nations:rugged.seq


222 7. INTERACTIONSlog GDP year 20006 7 8 9 10 11not AfricaAfrica0 1 2 3 4 5 6Terrain Ruggedness IndexFIGURE 7.3. Including a dummy variable forAfrican nations has no effect on the slope.African nations are shown in blue. Non-African nations are shown in gray. Regressionmeans for each subset of nations are shownin corresponding colors, along with 95% percentileintervals shown by shading.where y is log(rgdppc_2000), A is cont_africa, and r is rugged. As you’ve done sinceChapter 4, the linear model is built by replacing the parameter µ in the top line, the likelihood,with a linear equation that is a function of data and new parameters such as α andβ.We’ll build interactions by extending this strategy. Now you want to allow the relationshipbetween y and r to vary as a function of A. Within the model, this relationship ismeasured by the slope β r . Following the same strategy of replacing parameters with linearmodels, the most straightforward way to make β r depend upon A is just to define the slopeβ r as a linear model itself, one that includes A. is approach results in this likelihood (you’llalso need priors, which we’ll add in a bit):y i ∼ Normal(µ i , σ)[likelihood]µ i = α + γ i r i + β A A i [linear model of µ]γ i = β r + β Ar A i[linear model of slope]is is the first model with two linear models, but its structure is the same as every Gaussianmodel you’ve already fit in this book. So you don’t need to learn any new tricks for fittingthis model to data. e tricks lie entirely in interpreting it.e first line above is the same Gaussian likelihood you’ve been using since Chapter 4.e second line is the same kind of additive definition of µ i that you’ve seen many times.e third line is the new bit. e new symbol γ i is just a placeholder for the linear functionthat defines the slope between GDP and ruggedness. We use “gamma” (γ) here, because itfollows “beta” (β) in the Greek alphabet. e equation for γ i defines the interaction betweenruggedness and African nations. It is a linear interaction effect, because the equation γ i is afamiliar linear model.By defining the relationship between GDP and ruggedness in this way, you are explicitlymodeling the hypothesis that the slope between GDP and ruggedness depends—is conditional—uponwhether or not a nation is in Africa. e parameter β Ar defines the strengthof this dependency. If you set β Ar = 0, then you get the previous model back. If insteadβ Ar > 0, then African nations have a more positive slope between GDP and ruggedness. Ifβ Ar < 0, African nations have a more negative slope. For any nation not in Africa, A i = 0and so the interaction parameter β Ar has no effect on prediction for that nation. Of courseyou are going to compute the posterior distribution for β Ar from the data. But once you


7.1. BUILDING AN INTERACTION 223have the posterior distribution, it is only through understanding where the parameter fitsinto your model that will allow you interpret it.To fit this new model, you can just use map as before. Here’s the code to fit the modelthat includes an interaction between ruggedness and being in Africa:m7.5


224 7. INTERACTIONSlog GDP year 20006 7 8 9African nationslog GDP year 20007 8 9 10 11Non-African nations0 1 2 3 4 5 6Terrain Ruggedness Index0 1 2 3 4 5Terrain Ruggedness IndexFIGURE 7.4. Posterior predictions for the terrain ruggedness model, includingthe interaction between Africa and ruggedness.mu


7.1. BUILDING AN INTERACTION 225Plotting these calculations is just like previous examples. Here’s the code:# plot African nations with regressiond.A1


226 7. INTERACTIONSInteraction models ruin this paradise, however. Look at the interaction likelihood again:y i ∼ Normal(µ i , σ)[likelihood]µ i = α + γ i r i + β A A i [linear model of µ]γ i = β r + β Ar A i[linear model of slope]Now the change in µ i that results from a unit change in r i is given by γ i . And since γ i is afunction of three things—β r , β Ar , and A i —we have to know all three in order to know theinfluence of r i on the outcome. e only time the slope β r has its old meaning is when A i = 0,which makes γ i = β r . Otherwise, to compute the influence of r i on the outcome, we have tosimultaneously consider two parameters and another predictor variable.e practical implication of this fact is that you can no longer read the influence of eitherpredictor from the table of estimates. Here are the parameter estimates:R code7.12precis(m7.5)Mean StdDev 2.5% 97.5%a 9.18 0.14 8.92 9.45br -0.18 0.08 -0.33 -0.04bA -1.85 0.22 -2.27 -1.42bAr 0.35 0.13 0.10 0.60sigma 0.93 0.05 0.83 1.03Since γ (gamma) doesn’t appear in this table—it wasn’t estimated—we have to compute itourselves. Its easy enough to do that at the MAP values (posterior means). For example, theMAP slope relating ruggedness to log-GDP within Africa is:And outside of Africa:γ = β r + β Ar (1) = −0.18 + 0.35 = 0.17γ = β r + β Ar (0) = −0.18So the relationship between ruggedness and log-GDP is essentially reversed inside and outsideof Africa.7.1.4.2. Incorporating uncertainty. But that’s only at the MAP values. To get some ideaof the uncertainty around those γ values, we’ll need to use the whole posterior. Since γdepends upon parameters, and those parameters have a posterior distribution, γ must alsohave a posterior distribution.To compute the posterior distribution of γ, you could do some integral calculus, or youcould just process the samples from the posterior:R code7.13post


7.1. BUILDING AN INTERACTION 227Density0 1 2 3 4 5-0.4 -0.2 0.0 0.2 0.4 0.6gammaFIGURE 7.5. Posterior distributions of the slope relating terrain ruggednessto log-GDP. Blue: African nations. Black: non-African nations.mean( gamma.Africa)mean( gamma.notAfrica )R code7.14[1] 0.1631653[1] -0.1824547Nearly identical to the MAP values.But not we also have full distributions of the slopes within and outside of Africa. Let’splot them on the same axis, so we can see their overlap clearly:dens( gamma.Africa , xlim=c(-0.5,0.6) , ylim=c(0,5.5) ,xlab="gamma" , col=rangi2 )dens( gamma.notAfrica , add=TRUE )R code7.15And this plot appears as FIGURE 7.5. e posterior distribution of the slope within Africa isshown in blue. e posterior distribution outside of Africa is shown in black.From here, you could use the samples to ask a bunch of questions, depending upon yourinterest. For example, what’s the probability (according to this model and these data) thatthe slope within Africa is less than the slope outside of Africa? All we need to do is computethe difference between the slopes, for each sample from the posterior, and then ask whatproportion of these differences is below zero. is code will do it:diff


228 7. INTERACTIONSthis model and these data, it’s highly implausible that the slope association ruggedness withlog-GDP is lower inside Africa than outside it.Rethinking: More on the meaning of posterior probability. Keep in mind that this number, 0.0036,is not the probability of any observable event. For example, in repeat resampling we do not expect0.36% of countries within Africa to suffer terrain ruggedness worse than a country outside of Africa.Instead, 0.0036 is the relative plausibility that your golem assigns to the question you asked of it, andonly for the data you presented to it. So the golem is telling that it is very very skeptical of the notionthat γ within Africa is lower than γ outside of Africa. How skeptical? Of all the possible states ofthe world it knows about, only a proportion 0.0036 of them are consistent with both the data and theclaim that γ in Africa is less than γ outside Africa. Your golem is skeptical, but it’s usually a good ideafor you to remain skeptical of your golem.7.2. Symmetry of the linear interactionBuridan’s ass is a toy philosophical problem in which an ass who always moves towardsthe closest pile of food will starve to death when he finds himself equidistant between twoidentical piles. e basic problem is one of symmetry: how can the ass decide between twoidentical options? Like many toy problems, you can’t take this one too seriously. Of coursethe ass will not starve. But thinking about how the symmetry is broken can be productive.Interactions are like Buridan’s ass. Like the two piles of identical food, the linear interactioncontains two symmetrical interpretations. Absent some other information, outside themodel, there’s no logical basis for preferring one over the other. Consider for example theGDP and terrain ruggedness problem. e interaction there has two equally valid phrasings.(1) How much does the influence of ruggedness (on GDP) depend upon whether thenation is in Africa?(2) How much does the influence of being in Africa (on GDP) depend upon ruggedness?While these two possibilities sound different to most humans, your golem thinks they areidentical.In this section, we’ll examine this fact analytically. en we’ll plot the ruggedness andGDP example again, but with the reverse phrasing—the influence of Africa depends uponruggedness.7.2.1. Buridan’s interaction. Consider yet again the mathematical form of the likelihood:Let’s expand γ i into the expression for µ i :y i ∼ Normal(µ i , σ)[likelihood]µ i = α + γ i r i + β A A i [linear model of µ]γ i = β r + β Ar A iµ i = α + (β r + β Ar A i )r i + β A A i= α + β r r i + β Ar A i r i + β A A iNow factor together the terms with A i in them:µ i = α + β r r i + (β A + β Ar r i )} {{ }GA i[linear model of slope]


7.2. SYMMETRY OF THE LINEAR INTERACTION 229e term labeled G above looks a lot like γ i in the original form. is is the same model ofµ i , but re-expressed so that the linear interaction applies to A i .e point of the algebra above is to prove that linear interactions are symmetric, just likethe choice facing Buridan’s ass. Within the model, there’s no basis to prefer one interpretationover the other, because in fact they are the same interpretation. But when we reason causallyabout models, our minds tend to prefer one interpretation over the other.7.2.2. Africa depends upon ruggedness. It’ll be helpful to plot the reverse interpretation:the influence of being in Africa depends upon terrain ruggedness. e computation steps arevery similar to what we did in the previous section. But now we will display cont_africaon the horizontal axis, while using different lines for different values of rugged.# get minimum and maximum rugged valuesq.rugged


230 7. INTERACTIONSlog GDP year 20006 7 8 9 10 11otherContinentAfricaFIGURE 7.6. e other side of the interactionbetween ruggedness and continent. Bluepoints are nations with above median ruggedness.Black points are below the median.Dashed black line: relationship between continentand log-GDP, for an imaginary nationwith minimum observed ruggedness (0.003).Blue line: an imaginary nation with maximumobserved ruggedness (6.2).effect—the line does slope upwards a tiny amount, but the confidence bounds should preventus from getting excited about that fact. For a nation with very high ruggedness, there isalmost no negative effect on GDP of being in Africa.is perspective on the GDP and terrain ruggedness data is completely consistent withthe previous perspective. It’s simultaneously true in these data (and with this model) that(1) the influence of ruggedness depends upon continent and (2) the influence of continentdepends upon ruggedness. Indeed, something is gained by looking at the data in this symmetricalperspective. Just inspecting the first view of the interaction, back on page 224, it’snot obvious that African nations are on average nearly always worse off. It’s just at very highvalues of rugged that nations inside and outside of Africa have the same expected log-GDP.is second way of plotting the interaction makes it clear that terrain ruggedness doesn’tquite help African nations. Perhaps it just stops them from being hurt so badly.7.3. Continuous interactionse main point I want to convince the reader of is that interaction effects are difficultto interpret. ey are nearly impossible to interpret, using only posterior means and standarddeviations. We’ve already discussed two reasons for this (page 225). A third reason tobe wary of using only tables of numbers to interpret interactions is that interactions amongcontinuous variables are especially opaque. It’s one thing to make a slope conditional upon acategory, as in the previous example of ruggedness and being in Africa. In such a context, themodel reduces to estimating a different slope for each category. But it’s quite a lot harder tounderstand that a slope varies in a continuous fashion with a continuous variable. Interpretationis much harder in this case, even though the mathematics of the model are essentiallythe same as in the categorical case.In pursuit of clarifying the construction and interpretation of continuous interactionsamong two or more continuous predictor variables, in this section I develop a simple regressionexample and show you a way to plot the two-way interaction between two continuousvariables. e method I present for plotting this interaction is a triptych plot, a panel of threecomplementary figures that comprise a whole picture of the regression results. ere’s nothingmagic about having three figures—in other cases you might want more or less. Instead,the utility lies in making multiple figures that allow one to see how the interaction alters aslope, across changes in a chosen variable.


7.3. CONTINUOUS INTERACTIONS 231is example will also allow me to illustrate two benefits of CENTERING prediction variables.You met centering already (page 111). When a prediction variable is “centered,” it isjust rescaled so that its mean is zero. Why would you ever do that to the data? ere are twocommon benefits. First, centering the prediction variables can make it much easier to leanon the coefficients alone in understanding the model, especially when you want to comparethe estimates from models with and without an interaction. Second, sometimes model fittinghas a hard time with uncentered variables. Centering (and possibly also standardizing) thedata before fitting the model can help you achieve a faster and more reliable set of estimates.Still, even with everything centered and double-checked, it can be hard or impossible toreally understand continuous interactions from numbers alone. So the message here is toalways plot posterior predictions, counterfactual or not, in order to avoid misunderstandingthe model fit.7.3.1. e data. e data in this example are sizes of blooms from beds of tulips grown ingreenhouses, under different soil and light conditions. Load the data with:library(rethinking)data(tulips)d


232 7. INTERACTIONSe main effect likelihood is (we’ll add priors later):And the full interaction likelihood is:B i ∼ Normal(µ i , σ)µ i = α + β w w i + β s s iB i ∼ Normal(µ i , σ)µ i = α + β w w i + β s s i + β ws w i s iwhere B i is the value of bloom on row i, w i is the value of water, and s i is the value of shade.I’m leaving the categorical variable bed out of this analysis, but I actually think a sincereanalysis requires it, and in a later chapter, we’ll come back and add bed to the analysis. epoints I wish to make right now don’t depend upon it, however.Fitting the models with map is just as you might expect. I’ll use practically flat priorshere, so we get results nearly identical to typical maximum likelihood inference. is isn’t toimply that this is necessarily the best thing to do. Instead, it’s to present another example ofhow to do it.R code7.19m7.6


7.3. CONTINUOUS INTERACTIONS 233It’s easy enough to fix the problem, though. Let’s fix it, then I’ll explain why it happened.ere are two basic solutions.(1) We can use another method of optimization. ere are many different ways toclimb the posterior density. R’s optim knows several, because no method alwaysworks. e default for map is called BFGS. Two others to try, when you have trouble,are called Nelder-Mead and (when all else fails) SANN. Tell map to use anothermethod by adding the parameter method to the call. ere will be an examplebelow.(2) We can tell optim to search longer, so it doesn’t reach its maximum iterations. Youcan tell map to allow a larger number of iterations by passing it a control list. erewill be an example below.I’ll go ahead and use both solutions, which is more than we need, just so you can see howthey are implemented.m7.6


234 7. INTERACTIONSbs -38.91 34.94sigma 57.34 46.23bws NA -39.5nobs 27 27Now consider these estimates and try to figure out what the models are telling us about theinfluence of water and shade on the blooms. First, consider the intercepts a, α. e estimateof the intercept changes a lot from one model to the next, from 53 to −84. What do thesevalues mean? Remember, the intercept is the expected value of the outcome when all of thepredictor variables take the value zero. In this case, neither of the predictor variables evertakes the value zero within the data. As a result, these intercept estimates are very hard tointerpret. is is a very common issue, and most of the regression examples so far in thisbook have suffered from it.Now, consider the slope parameters. In the main-effect-only model, m7.6, the MAPvalue for the main effect of water is positive and the main effect for shade is negative. Takea look at the standard deviations and intervals in precis(m7.6) to verify that both posteriordistributions are reliably on one side of zero. You might infer that these posterior distributionssuggest that water increases blooms while shade reduces them. For every additionallevel of soil moisture, blooms increase by 76, on average. For every addition unit of shade,blooms decrease by 42, on average. ose sound reasonable.But the analogous posterior distributions from the interaction model, m7.7, are quitedifferent. First, assure yourself that the interaction model is indeed a much better model:R code7.22compare(m7.6,m7.7)DIC pD dDIC weightm7.7 293.41 4.87 0.00 0.99m7.6 303.88 4.29 10.47 0.01is is a slam dunk for m7.7. So let’s seriously consider the posterior distribution from m7.7.Now both main effects are positive, but the new interaction posterior mean is negative. Areyou to conclude now that the main effect of shade is to help the tulips? And the negativeinteraction itself implies that as shade increases, water has a reduced impact on blooms. Butreduced by how much?Sampling from the posterior now and plotting the model’s predictions would help immenselywith interpretation. And that’s the course I want to encourage and provide codefor. In general, I don’t think it’s ever safe to interpret interactions without plotting them.But for the moment, let’s instead look at the value of centering the predictor variables andre-estimating the models.7.3.3. Center and re-estimate. To center a variable means to create a new variable that containsthe same information as the original, but has a new mean of zero. For example, tomake centered versions of shade and water, just subtract the mean of the original fromeach value:R code7.23d$shade.c


7.3. CONTINUOUS INTERACTIONS 235e new centered variable has the same variance as the original, but now has a mean of zero.In this case, centering recodes the levels of water and shade so that instead of their rangingfrom 1 to 3, they now range from −1 to 1.Centering in this analysis will do two things for us. First, it’ll fix our previous problemwith maximum iterations. Second, it’ll make the estimates easier to interpret. Let’sre-estimate the two regression models, but now using the new centered variables shade.cand water.c. I’m also going to remove the optim fixes from before, because now we don’tneed them.m7.8


236 7. INTERACTIONSe primary reason is that, when the predictors are centered, the MAP value for α isthe just the empirical mean for the outcome, mean(d$blooms). at is also the start valuegiven to map. With un-centered predictors, the MAP value for α lies a long way from theempirical mean. Hence, the long search that failed.7.3.3.2. Estimates changed less across models. Why did centering the predictor variablesresult in the main effect posterior means remaining the same across the models with andwithout the interaction? In the un-centered models, the interaction effect is applied to everycase, and so none of the parameters in µ makes sense alone. is is because neither ofthe predictors in those models, shade and water, are ever zero. As a result the interactionparameter always factors into generating a prediction. Consider for example a tulip at theaverage moisture and shade levels, 2 in each case. e expected blooms for such a tulip is:µ i | si =2,w i =2 = α + β w (2) + β s (2) + β ws (2 × 2)So to figure out the effect of increasing water by 1 unit, you have to use all of the β parameters.Plugging in the MAP values for the un-centered interaction model, m7.7, we get:µ i | si =2,w i =2 = −150.8 + 181.5(2) + 64.1(2) − 52.9 × 2 × 2You can compute the prediction in R:R code7.25k


7.3. CONTINUOUS INTERACTIONS 237bws -51.87 12.95 -77.25 -26.49sigma 45.22 6.15 33.17 57.28And here are justifiable readings of each:• e estimate a, α, is the expected value of blooms when both water and shade areat their average values. eir average values are both zero (0), because they werecentered before fitting the model.• e estimate bw, β w , is the expected change in blooms when water increases byone unit and shade is at its average value (of zero). is parameter does not tell youthe expected rate of change for any other value of shade. is estimate suggeststhat when shade is at its average value, increasing water is highly beneficial toblooms.• e estimate bs, β s , is the expected change in blooms when shade increases by oneunit and water is at its average value (of zero). is parameter does not tell youthe expected rate of change for any other value of water. is estimate suggeststhat when water is at its average value, increasing shade is highly detrimental toblooms.• e estimate bws, β ws , is the interaction effect. Like all linear interactions, it can beexplained in more than one way. First, the estimate tells us the expected change inthe influence of water on blooms when increasing shade by one unit. Second, ittells us the expected change in the influence of shade on blooms when increasingwater by one unit.So why is the interaction estimate, bws, negative? e short answer is that water and shadehave opposite effects on blooms, but that each also makes the other more important to theoutcome. If you don’t see how to read that from the number −53, you are in good company.And that’s why the best thing to do is to plot implied predictions.7.3.4. Plotting implied predictions. Golems (models) have awesome powers of reason, butterrible people skills. e golem provides a posterior density, but for us humans to understandits implications, we need to decode the posterior into something else. Centered predictorsor not, plotting posterior predictions always tells you what the golem is thinking, onthe scale of the outcome. at’s why we’ve emphasized plotting so much. But in previouschapters, there were no interactions. As a result, when plotting model predictions as a functionof any one predictor, you could hold the other predictors constant at any value you liked.So the choice of which values to set the un-viewed predictor variables to wasn’t sensitive.Now that’ll be different. Once there are interactions in a model, the effect of changinga predictor depends upon the values of the other predictors. Maybe the simplest way to goabout plotting such interdependency is to make a frame of multiple bivariate plots. In eachplot, you choose different values for the un-viewed variables. en by comparing the plotsto one another, you can see how big of a difference the changes make.Here’s how you might accomplish this visualization, for the tulip data. I’m going to makethree plots in a single panel. Such a panel of three plots that are meant to be viewed togetheris a triptych, and triptych plots are very handy for understanding the impact of interactions.First, just draw samples from the quadratic posterior as usual:post


238 7. INTERACTIONSNow for the plotting. Here’s the strategy. I want each plot to show the bivariate relationshipbetween shade and blooms, as predicted by the model. Each plot will plot predictions fora different value of water. For this example, it is easy to pick which values of water to use,because there are only three values: −1, 0, and 1 (this variable was centered, recall). Sothe first plot will show the predicted relationship between blooms and shade, holding waterconstant at −1. e second plot will show the same relationship, but now holding waterconstant at 0. e final plot will hold water constant at 1. In addition, in each plot, I’ll showonly the raw data that has the water value appropriate in each case.You already know how to produce each plot, using the same kind of sapply code fromprevious chapters. I’m going to wrap that kind of code in a loop now, and iterate over thethree values of water.c. Here’s the code:R code7.29# make a plot window with three panels in a single rowpar(mfrow=c(1,3)) # 1 row, 3 columns# loop over values of water.c and plot predictionsfor ( w in -1:1 ) {dt


7.3. CONTINUOUS INTERACTIONS 239water.c = -1water.c = 0water.c = 1blooms0 100 200 300blooms0 100 200 300blooms0 100 200 300-1 0 1shade (centered)-1 0 1shade (centered)-1 0 1shade (centered)water.c = -1water.c = 0water.c = 1blooms0 100 200 300blooms0 100 200 300blooms0 100 200 300-1 0 1shade (centered)-1 0 1shade (centered)-1 0 1shade (centered)FIGURE 7.7. Triptych plot of predicted blooms across water treatments,without (top row) and with (bottom row) an interaction effect. Blue pointsin each plot are data. e solid line is the posterior mean and the dashedlines give 95% interval of the mean. Top row: Without the interaction,model m7.8. Each of the three plots of blooms against shade level is fora different water level. e slope of the regression line in each case is exactlythe same, because there is no interaction in this model. Bottom row:With the interaction, model m7.9. Now the slope of blooms against shadehas a different value in each plot.the mean water value in the middle plot, increasing shade clearly reduces the size of blooms.Going from the least (−1) to the most shade, blooms are approximately halved in size. Atthe highest water value, on the right, the slope becomes even more negative, and shade is aneven stronger predictor of smaller blooms, as shade increases. Going from the least to themost shade, bloom size declines by about two-thirds.What is going on here? e likely explanation for these results is that tulips need bothwater and light to produce blooms. At low water levels, shade can’t have much of an effect,because the tulips don’t have enough water to produce blooms anyway. At higher water levels,shade can matter more, because the tulips have enough water to produce some blooms.At very high water levels, water is no longer limiting the blooms very much, and so shade canhave a much more dramatic impact on the outcome. e same explanation works symmetricallyfor shade. If there isn’t enough light, then more water hardly helps. You could remakeFIGURE 7.7 with water on the horizontal axes and shade level varied from le to right, if you’dlike to visualize the model predictions that way.


240 7. INTERACTIONS7.4. Higher-order interactions[Not sure if I should include an example of a 3-way interaction. is is just a placeholderfor now.]Princeton, NJ 2012 wine tasting event.Interaction of judge nationality and wine location is bias. Bias depends upon flight.Interaction of judge and flight is preference. Preference depends upon nation.Interaction of nation and flight is alignment(?). Alignment depends upon judge.R code7.30library(rethinking)data(Wines2012)d


7.7. PRACTICE 241m7.x


8 Markov Chain Monte Carlo EstimationFor most of human history, chance has been a villain. In classic Roman civilization,chance was personified by Fortuna, goddess of cruel fate, with her ever spinning wheel ofluck. Opposed to her sat Minerva, goddess of wisdom and understanding. Only the desperatewould pray to Fortuna, while everyone implored Minerva for aid. Certainly science wasthe domain of Minerva, a realm with no useful role for Fortuna to play.But by the beginning of the 20th century, the opposition between Fortuna and Minervahad changed to a collaboration. Scientists, servants of Minerva, began publishing booksof random numbers, instruments of chance to be used for learning about the world. Now,chance and wisdom share a cooperative relationship, and few of us are any longer bewilderedby the notion that an understanding of chance could help us acquire wisdom. Everythingfrom weather forecasting to finance to evolutionary biology is dominated by the study ofstochastic processes.is chapter introduces one of the more marvelous examples of how Fortuna and Minervacooperate: the estimation of posterior probability densities using a stochastic processknown as Markov chain Monte Carlo (MCMC) estimation. Unlike in every earlier chapter inthis book, here we’ll produce samples from the joint posterior of a model without maximizinganything. Instead of having to lean on quadratic and other approximations of the shapeof the posterior, now we’ll be able to sample directly from the posterior without assumingany nice Gaussian, or any other, shape for it.e cost of this power is that it may take much longer for our estimates to finish, andusually more work is required to specify the model as well. But the benefit is escaping theawkwardness of assuming multivariate normality. Equally important is the ability to directlyestimate models, such as the multilevel models of later chapters. Such models are not onlyfrequently intractable with the methods we’ve used so far in this book, but also their posteriordistributions are oen non-Gaussian in dramatic ways.e good news is that tools for building and inspecting MCMC estimates are gettingbetter all the time. In this chapter you’ll meet a convenient way to convert the map formulasyou’ve used so far into Markov chains. e engine that makes this possible is STAN (mcstan.org).Stan’s creators describe it as “a probabilistic programming language implementingstatistical inference.” You won’t be working directly in Stan to begin with—the rethinkingpackage provides tools that hide it from you for now. But as you move on to more advancedtechniques, you’ll be able to quickly generate models you already understand in Stan’s language.en you can tinker with them and gain access to Stan’s full power.243


244 8. MARKOV CHAIN MONTE CARLO ESTIMATION234 51671098FIGURE 8.1. Good King Markov’s island kingdom.Rethinking: Stanislaw Ulam. e Stan programming language is not an abbreviation or acronym.Rather, it is named aer Stanislaw Ulam (1909–1984). Ulam is credited as being the invention ofMarkov chain Monte Carlo. Together with Ed Teller, Ulam applied it to designing fusion bombs. Buthe and others soon applied the general Monte Carlo method to diverse problems of less monstrousnature. Ulam made important contributions in pure mathematics, chaos theory, and molecular andtheoretical biology, as well.8.1. Good King Markov and His island kingdomFor the moment, forget about posterior densities and MCMC. Consider instead the taleof Good King Markov. King Markov was a benevolent autocrat of an island kingdom, acircular archipelago, with 10 islands. Each island was neighbored by two others, and theentire archipelago formed a ring. e islands were of different sizes, and so had differentsized populations living on them. e second island was about twice as populous as the first,the third about three times as populous as the first, and so on, up to the largest island, whichwas 10-times as populous as the smallest. e good king’s island kingdom is displayed inFIGURE 8.1, with the islands numbered by their relative population sizes.e Good King was an autocrat, but he did have a number of obligations to His people.Among these obligations, King Markov agreed to visit each island in His kingdom from timeto time. Since the people love their king, each island would prefer that he visit them moreoen. And so everyone agreed that the king should visit each island in proportion to itspopulation size, visiting the largest island 10-times as oen as the smallest, for example.e Good King Markov, however, wasn’t one for schedules or bookkeeping, and so hewanted a way to fulfill his obligation without planning his travels months ahead of time.Also, since the archipelago was a ring, the King insisted that he only move among adjacentislands, to minimize time spent on the water—like many citizens of his kingdom, the kingbelieves there are sea monsters in the middle of the archipelago.e king’s advisor, a Mr Metropolis, engineered a clever solution to these demands. We’llcall this solution the Metropolis algorithm. Here’s how it works.


8.1. GOOD KING MARKOV AND HIS ISLAND KINGDOM 245(1) Wherever the King is, each week he decides between staying put for another weekor moving to one of the two adjacent islands. To decide his next move, he flips acoin.(2) If the coin turns up heads, the King considers moving to the adjacent island clockwisearound the archipelago. If the coin turns up tails, he considers instead movingcounter-clockwise. Call the island the coin nominates the proposal island.(3) Now, to see whether or not he moves to the proposal island, King Markov countsout a number of seashells equal to the relative population size of the proposal island.So for example, if the proposal island is number 9, then he counts out 9seashells. en he also counts out a number of stones equal to the relative populationof the current island. So for example, if the current island is number 10, thenKing Markov ends up holding 10 stones, in addition to the 9 seashells.(4) When there are more seashells than stones, King Markov always moves to the proposalisland. But if there are fewer shells than stones, he discards a number of stonesequal to the number of shells. So for example if there are 4 shells and 6 stones, heends up with 4 shells and 6 − 4 = 2 stones. en he places the shells and the remainingstones in a bag. He reaches in and randomly pulls out one object. If it is ashell, he moves to the proposal island. Otherwise, he stays put another week. As aresult, the probability that he moves is equal to the number of shells divided by theoriginal number of stones.is procedure may seem baroque and, honestly, a bit crazy. But it does work. e king willappear to move around the islands randomly, sometimes staying on one island for weeks,other times bouncing around without apparent pattern. But in the long run, this procedureguarantees that the king will be found on each island in proportion to its population size.You can prove this to yourself, by simulating King Markov’s journey. Here’s a short pieceof code to do this, storing the history of the king’s island positions in the vector positions:num.weeks


246 8. MARKOV CHAIN MONTE CARLO ESTIMATIONafter 10000 weeksisland2 4 6 8 10number of weeks0 500 1000 15000 50 100 150 200week2 4 6 8 10islandFIGURE 8.2. Results of the king following the Metropolis algorithm. elehand plot shows the king’s position (vertical axis) across weeks (horizontalaxis). In any particular week, it’s nearly impossible to say where theking will be. e righthand plot shows the long-run behavior of the algorithm,as the time spent on each island turns out to be proportional to itspopulation size.biggest island, number 10). en the for loop steps through the weeks. Each week, it recordsthe king’s current position. en it simulates a coin flip to nominate a proposal island. eonly trick here lies in making sure that a proposal of “11” loops around to island 1 and aproposal of “0” loops around to island 10. Finally, a random number between zero and oneis generated (runif(1)), and the king moves, if this random number is less than the ratio ofthe proposal island’s population to the current island’s population (proposal/current).You can see the results of this simulation in FIGURE 8.2. e lehand plot shows theking’s location across the first 200 weeks of his simulated travels. As you move from the leto the right in this plot, the points show the king’s location through time. e king travelsamong islands, or sometimes stays in place for a few weeks. is plot demonstrates the seeminglypointless path the Metropolis algorithm sends the king on. e righthand plot showsthat the path is far from pointless, however. e horizontal axis is now islands (and theirrelative populations), while the vertical is the number of weeks the king is found on each.Aer the entire 10-thousand weeks of the simulation, you can see that the proportion of timespent on each island converges to be almost exactly proportional to the relative populationsof the islands.e algorithm will still work in this way, even if we allow the king to be equally likelyto propose a move to any island from any island, not just among neighbors. As long asKing Markov still uses the ratio of the proposal island’s population to the current island’spopulation as his probability of moving, in the long run, he will spend the right amount oftime on each island. e algorithm would also work for any size archipelago, even if the kingdidn’t know how many islands were in it. All he needs to know at any point in time is thepopulation of the current island and the population of the proposal island. en, without any


8.2. MARKOV CHAIN MONTE CARLO 247forward planning or backwards record keeping, King Markov can satisfy his royal obligationto visit his people proportionally.8.2. Markov chain Monte Carloe precise algorithm King Markov used is a special case of the general METROPOLISALGORITHM from the real world. 98 And this algorithm is an example of Markov chain MonteCarlo. In real applications, the goal is of course not to help an autocrat schedule his journeys,but instead to draw samples from an unknown and usually complex target distribution, likea posterior probability density.• e “islands” in our objective are parameter values, and they need not be discrete,but can instead take on a continuous range of values as usual.• e “population sizes” in our objective are the posterior probabilities at each parametervalue.• e “weeks” in our objective are samples taken from the joint posterior of the parametersin the model.Provided the way we choose our proposed parameter values at each step is symmetric—sothat there is an equal chance of proposing from A to B and from B to A—then the Metropolisalgorithm will eventually give us a collection of samples from the joint posterior. We can thenuse these samples just like all the samples you’ve already used in this book.e Metropolis algorithm is the grandparent of several different strategies for gettingsamples from unknown posterior distributions. In the remainder of this section, I brieflyexplain the concepts behind two of the most important in contemporary Bayesian inference:Gibbs sampling and Hamiltonian (aka Hybrid, aka HMC) Monte Carlo. en we’ll turn tousing Hamiltonian Monte Carlo to do some new things with linear regression models.8.2.1. Gibbs sampling. e Metropolis algorithm works whenever the probability of proposinga jump to B from A is equal to the probability of proposing A from B, when the proposaldistribution is symmetric. ere is a more general method, known as Metropolis-Hastings 99 ,that allows asymmetric proposals. is would mean, in the context of King Markov’s fable,that the King’s coin were biased to lead him clockwise on average.Why would we want an algorithm that allows asymmetric proposals? One reason is thatit makes it easier to handle parameters, like standard deviations, that have boundaries atzero. A better reason, however, is that it allows us to generate savvy proposals that explorethe posterior density more efficiently. By “more efficiently,” I mean that we can acquire anequally good image of the posterior distribution in fewer steps.e most common way to generate savvy proposals is a technique known as GIBBS SAM-PLING. 100 Gibbs sampling is a variant of the Metropolis-Hastings algorithm that uses cleverproposals and is therefore more efficient. By “efficient,” I mean that you can get a goodestimate of the posterior from Gibbs sampling with many fewer samples than a comparableMetropolis approach. e improvement arises from adaptive proposals in which the distributionof proposed parameter values adjusts itself intelligently, depending upon the parametervalues at the moment.First, let’s describe the problem a little better. en I’ll explain how Gibbs sampling worksin detail. A major problem with basic Metropolis-Hastings (M-H), which you can think ofas King Markov’s system, is that it can take it a long time to adequately explore the posteriordensity. In principle, King Markov fulfills his promise to his people, but he might be deadbefore he visits the smaller islands again. e primary reason it can take so long to get around


248 8. MARKOV CHAIN MONTE CARLO ESTIMATIONFIGURE 8.3. Samples from the posterior for 20 adult !Kung heights. evertical dashed line on the le shows the location of the slice through theposterior of σ shown on the right. ese slices are posterior densities conditionalon the value of the other parameter, µ in this case. Writing analyticalexpressions for these slices makes Gibbs sampling possible.the parameter values is that the proposal distribution in basic M-H knows nothing about thetarget distribution.To understand this, let’s focus on the size of the changes in parameter value that areallowed. In King Markov’s fable, he could only move to neighboring islands. But what if hecould move two or three islands on each side? Let’s call the distance of proposals from thecurrent parameter value the step size. If the step size is too large, then many proposalswill be rejected (the King won’t move), because they will jump too far outside the parametervalues with high posterior probability. If step size is instead too small, then most proposalswill be accepted, but it will take a very long time for the us to explore the tails of the posteriordistribution. If you set the step size just right, then you can tradeoff one of the concerns forthe other.But even if we optimize the step size, the algorithm will still be inefficient, because whatwe really want is to take big steps when we are far from the center of the posterior distributionand small steps when we are close to the center. In other words, we’d like the proposals to be afunction of how close we are to the target. Metropolis-Hastings, in its usual form, has no wayto do this. If we could make the proposal distribution a function of the current parametervalues, then we could improve the efficiency of the Markov chain.To work towards one solution to this problem, let’s consider the final samples from aMetropolis chain, one that has given us a good picture of the target posterior density. Supposefor example we are estimating the mean µ and standard deviation σ for 20 adult heightsfrom the !Kung data you met in Chapter 4. I display such a posterior in FIGURE 8.3. e lehandplot shows samples from the posterior for both parameters, or Pr(µ, σ|D). e verticaldashed line locates a particular slice through the posterior, at µ = 150. e righthand plotshows the profile of the posterior at this slice. What you are seeing is the posterior densityof σ, where µ = 150. Another way to say this is you are viewing Pr(σ|D, µ).


8.2. MARKOV CHAIN MONTE CARLO 249Now, the interesting thing about this kind of slice through the full posterior is that, evenwhen we cannot compute an analytical expression for the full density Pr(µ, σ|D), we oencan compute an analytical expression for Pr(σ|µ, D) (or for Pr(µ|σ, D)). By treating oneof the parameters as a constant, like the data D, it makes the math easier. In the case of aGaussian outcome, it is possible to write down analytical expressions for Pr(σ|µ, D) and forPr(µ|σ, D).Why might we want to do this? In order to generate intelligent proposals, so that theMarkov chain will be more efficient. ink of it this way. If the Markov chain is at µ = 150,then the target for σ is the righthand plot in FIGURE 8.3. So if we have a formula to give us theshape of the slice in FIGURE 8.3, then we can use that function to produce random proposalsteps that will automatically be smaller, when the chain is close to the center of the target, andautomatically be larger, when the chain is far from the center of the target. So if we sampleout proposal values for both σ and µ from Pr(σ|µ, D) and Pr(µ|σ, D), respectively, then theMarkov chain will converge much faster and we’ll need fewer samples overall to get a goodpicture of the full target, Pr(µ, σ|D), the joint posterior.e catch is that, in order to get an analytical expression for these slices, we’ll have tomake an explicit choice of the prior distribution for each parameter. Flat priors will no longerdo. Indeed, we must choose a prior distribution that is conjugate with the likelihood function.So called conjugate pairs are matching distributions for priors and likelihoods that producea posterior with the same family as the prior. For example, the likelihood function in thiscase is Gaussian. For σ, the standard deviation of the Gaussian, it turns out that we want touse an inverse-gamma distribution (you’ll meet gamma distributions in a later chapter). eanalogous conjugate prior for the mean µ is a normal distribution.All of this means we can rewrite our Markov chain to generate proposals from the analyticalexpressions for the slices through the joint posterior. Not only does this vastly acceleratehow quickly the Markov chain gets into the high-probability region of parameter values, butit also means that Gibbs sampling will always accept any proposal. No rejected proposalsmeans more efficient search of the target distribution. e King never stands still.Alright, so let’s compare the efficiency of the Gibbs algorithm above to the analogousMetropolis-Hastings algorithm that uses fixed proposal distributions. You have the Metropolis-Hastings code, in the previous section. e first thing to compare is how long each approachtakes to finish 100-thousand samples. On my 2.66 GHz desktop, the Gibbs chain took 5seconds, while the plain Metropolis-Hastings took about 17 seconds. at’s about 19956samples per second for the Gibbs chain, and 5974 for the Metropolis-Hastings chain. Now,these models are so simple that the improvement hardly matters. But at this relative rate ofimprovement, a Metropolis chain that takes an hour to finish can be made to finish in about18 minutes, once converted to Gibbs sampling.Not only will the Gibbs chain execute faster, but it’ll also explore the parameter spacebetter. We can compare how quickly the two approaches get into the high probability regionof the target distribution. In FIGURE 8.4, I display the first 500 samples form both chains, withGibbs sampling on the le and fixed-step Metropolis-Hastings on the right. Both chainsstarted at the same parameter values, but Gibbs sampling approached the target region inexactly two giant steps. Metropolis-Hastings, in contrast, took hundreds of steps to arrivein the same region. In the end, both algorithms produced the same approximate estimate ofthe joint posterior. But Gibbs did so much more efficiently.Gibbs sampling is the basis of the powerful and widely used soware BUGS (Bayesianinference Using Gibbs Sampling) and JAGS (Just Another Gibbs Sampler). BUGS and JAGS


250 8. MARKOV CHAIN MONTE CARLO ESTIMATIONGibbs samplingMetropolis-Hastingsmu146 150 154mu146 150 1548 10 12 14sigma8 10 12 14sigmaFIGURE 8.4. e first 500 samples from Gibbs sampling (le) and fixed-stepMetropolis-Hastings (right). e black plus signs in each plot locate thetarget high density region. Gibbs sampling gets nearby in exactly two steps.allow you to define model in much the sample style as the map models you’ve been usingso far in this book. If you use conjugate priors, BUGS and JAGS models can sample veryefficiently. If you don’t use conjugate priors, they fall back on Metropolis sampling.But there are some practical limitations to Gibbs sampling. First, maybe we don’t wantto use conjugate priors. Some conjugate priors seem silly, and choosing a prior so that themodel fits efficiently isn’t really a strong argument from a scientific perspective. Second,as models become more complex and contain hundreds or thousands or tens-of-thousandsof parameters, Gibbs sampling can become shockingly inefficient. In those cases, there areother algorithms.8.2.2. Hamiltonian Monte Carlo.It appears to be a quite general principle that, whenever there is a randomizedway of doing something, then there is a nonrandomized way that deliversbetter performance but requires more thought. —E. T. Jaynes 101e Metropolis algorithm and Gibbs sampling are both highly random procedures. eytry out new parameter values and see how good they are, compared to the current values.But Gibbs sampling gains efficiency by reducing this randomness and exploiting knowledgeof the target distribution. is seems to fit Jaynes’ suggestion, quoted above, that when thereis a random way of accomplishing some calculation, there is probably a less random way thatis better.Another important MCMC algorithm, HAMILTONIAN MONTE CARLO (or Hybrid MonteCarlo, HMC) pushes Jaynes’ principle further. HMC is much more computationally costlyat each step than are Metropolis or Gibbs sampling. But its proposals are typically muchmore efficient than even Gibbs sampling. As a result, it doesn’t need as many samples todescribe the posterior distribution. And as models become more complex—thousands ortens-of-thousands of parameters—HMC can really outshine other algorithms.


8.3. EASY HMC: MAP2STAN 251We’re going to be using HMC on and off for the remainder of this book. You won’t haveto implement it yourself. But understanding some of the concept behind it will help yougrasp how it outperforms Metropolis and Gibbs sampling and also why it is not a universalsolution to all MCMC problems.Suppose King Markov’s cousin Monty is King on the mainland. Monty’s kingdom is nota discrete set of islands. Instead, it is a continuous territory stretched out along a narrowvalley. But the King has a similar obligation: to visit his citizens in proportion to their localdensity. Like Markov, Monty doesn’t wish to bother with schedules and calculations. Solikewise he’s not going to take a full census and solve for some optimal travel schedule.Also like Markov, Monty has a highly-educated and mathematically gied advisor. Hisname is Hamilton. Hamilton realized that a much more efficient way to visit the citizens inthe continuous Kingdom is to travel back and forth along its length. In order to spend moretime in densely settled areas, they should slow the royal vehicle down when houses growmore dense. Likewise, they should speed up when houses grow more sparse. is strategyrequires knowing how quickly population density is changing, at their current location. Butit doesn’t require remembering where they’ve been or knowing the population distributionanyplace else. And a major benefit of this strategy compared to that of Metropolis is that theKing makes a full sweep of the kingdom before revisiting anyone.is story is analogous to how Hamiltonian Monte Carlo works. In statistical applications,the royal vehicle is the current vector of parameter values. Let’s consider the singleparameter case, just to keep things simple. In that case, the log-posterior is like a bowl, withthe MAP at its nadir. en the job is to sweep across the surface of the bowl, adjusting speedin proportion to how high up we are.HMC really does run a physics simulation, pretending the vector of parameters gives theposition of a little frictionless particle. e log-posterior provides a surface for this particle toglide across. When the log-posterior is very flat, because there isn’t much information in thelikelihood and the priors are rather flat, then the particle can glide for a long time before theslope (gradient) makes it turn around. When instead the log-posterior is very steep, becauseeither the likelihood is very concentrated or the priors are, then the particle doesn’t get farbefore turning around.8.3. Easy HMC: map2stane rethinking package provides a convenient interface, map2stan, to compile lists offormulas, like the lists you’ve been using so far to construct map estimates, into Stan HMCcode. A little more housekeeping is needed to use map2stan: you need to preprocess anyvariable transformations, and you need to construct a clean data frame with only the variablesyou will use. But otherwise installing Stan on your computer is the hardest part. Andonce you get comfortable with interpreting samples produced in this way, you go peek insideand see exactly how the model formulas you already understand correspond to the code thatdrives the Markov chain.To see how it’s done, let’s revisit the terrain ruggedness example from Chapter 7.library(rethinking)data(rugged)d


252 8. MARKOV CHAIN MONTE CARLO ESTIMATIONWe’re going to fit the interaction model, predicting log-GDP with terrain ruggedness, whetheror not a nation is African, and the interaction of the two. Here’s the way to do it with map,just like before, but now with some priors.R code8.3m8.1


8.3. EASY HMC: MAP2STAN 253$ cont_africa: int 1 0 0 0 0 0 0 0 0 1 ...e data frame dd.trim contains only the three variables we’re using.8.3.2. Estimation. Now provided you have the rstan package installed (mc-stan.org), youcan get samples from the posterior distribution with this code:m8.1stan


254 8. MARKOV CHAIN MONTE CARLO ESTIMATIONFIGURE 8.5. Pairs plot of the samples produced by Stan. e diagonal showsa density estimate for each parameter. Below the diagonal, correlations betweenparameters are shown.Note that post here is a list, not a data.frame. is fact will be very useful to use lateron, when we encounter multilevel models. For now, if it doesn’t behave like you expect it to,you can coerce it to a data frame with post


8.3. EASY HMC: MAP2STAN 255mcmcpairs(post)R code8.8FIGURE 8.5 shows the resulting plot. is is a pairs plot, so it’s still a matrix of bivariate scatterplots. But now along the diagonal the smoothed histogram of each parameter is shown, alongwith its name. And in the lower triangle of the matrix, below the diagonal, the correlationbetween each pair of parameters is shown, with stronger correlations indicated by relativesize.For this model and these data, the resulting posterior distribution is quite nearly multivariateGaussian. e density for sigma is certainly skewed in the expected direction. Butotherwise the quadratic approximation does almost as well as Hamiltonian Monte Carlo.is is a very simple kind of model structure of course, with Gaussian priors, so an approximatelyquadratic posterior should be no surprise. Later, we’ll see some more exotic posteriordistributions.Overthinking: Stan messages. When you fit a model using map2stan, R will first translate yourmodel formula into a Stan language model. en it sends that model to Stan. e messages yousee in your R console are status updates from Stan. Stan first again translates the model, this timeinto C++ code. at code is then sent to a C++ compiler, to build an executable file is a specializedsampling engine for your model. en Stan feeds the data and starting values to that executable file,and if all goes well, sampling begins. You will see Stan count through the iterations. During sampling,you might occasionally see a scary looking warning something like this:Informational Message:The current Metropolis proposal is about to berejected because of the following issue: Error in function stan::prob::multi_normal_log(N4stan5Covariance matrixCovariance matrix is not positive definite.Covariancematrix(0,0) is 0:0. If this warning occurs sporadically, such as forhighly constrained variable types like covariance matrices, then thesampler is fine, but if this warning occurs often then your model may beeither severely ill-conditioned or misspecified.Severely ill-conditioned or misspecified? at certainly sounds bad. But rarely does this messageindicate a serious problem. As long as it happens only a handful of times, and especially if it onlyhappens during warmup, then odds are very good the chain is fine. You should still always check thechain for problems, of course. Just don’t panic when you see this message.8.3.4. Using the samples. Once you have samples in an object like post, you work withthem just as you’ve already learned to do. If you have the samples from the posterior and youknow the model, you can do anything: simulate predictions, compute differences betweenparameters, and calculate DIC.Actually, by default map2stan computes DIC for you. You can extract it’s value withDIC(m8.1stan), for example. DIC is also reported and broken down by its components onthe default show output for a map2stan model fit.show(m8.1stan)R code8.9map2stan model fitFormula:log_gdp ~ dnorm(mu, sigma)


256 8. MARKOV CHAIN MONTE CARLO ESTIMATIONmu ~ a + br * rugged + bA * cont_africa + brA * rugged * cont_africaa ~ dnorm(0, 100)br ~ dnorm(0, 10)bA ~ dnorm(0, 10)brA ~ dnorm(0, 10)sigma ~ dcauchy(0, 1)Log-likelihood at expected values: -229.43Deviance: 458.86DIC: 468.66Effective number of parameters (pD): 4.9is report just reiterates the formulas used to define the Markov chain and then reportslog-likelihood, deviance, DIC, and the effective number of parameters p D (as estimated incomputing DIC).8.3.5. Checking the chain. Provided the Markov chain is defined correctly—and it is here—then it is guaranteed to converge in the long run to the answer we want, the posterior distribution.But the machine does sometimes malfunction. In the next major section, we’ll dwellon causes of and solutions to malfunction.For now, let’s meet the most broadly useful tool for diagnosing malfunction, a TRACEPLOT. A trace plot merely plots the samples in sequential order, joined by a line. Looking atthe trace of each parameter in this way can quickly diagnose many common problems. Andonce you come to recognize a healthy, functioning Markov chain, quick checks of trace plotsprovide a lot of peace of mind. A trace plot isn’t the last thing analysts do to inspect MCMCoutput. But it’s nearly always the first.In the ruggedness example, the trace plot shows a very healthy chain. View it with:R code8.10plotchains(m8.1stan)e result is shown in FIGURE 8.6. Each plot in this figure is similar to what you’d get ifyou just used, for example, plot(post$a,type="l"), but with some extra information andlabeling to help out. You can think of the zig-zagging trace of each parameter as the paththe chain took through each dimension of parameter space. It’s easier to see this path inthe zoomed plot in the lower right corner, which displays only the first 100 samples aerwarmup.e gray region in each plot, the first one-thousand samples, marks the adaptation samples.During adaptation, the Markov chain is learning to more efficiently sample from theposterior distribution. So these samples are not necessarily reliable to use for inference. eyare automatically discarded by extract.samples, which returns only the samples shownin the white regions of FIGURE 8.6.Now, how is this chain a healthy one? Typically we look for two things in these traceplots: stationarity and good mixing. Stationarity refers to the path staying within the posteriordistribution. Notice that these traces, for example, all stick around a very stable centraltendency, the center of gravity of each dimension of the posterior. Another way to think ofthis is that the mean value of the chain is quite stable from beginning to end.A well-mixing chain means that each successive sample within each parameter is nothighly correlated with the sample before it. Visually, you can see this by the rapid zig-zag


8.3. EASY HMC: MAP2STAN 257FIGURE 8.6. Trace plot of the Markov chain from the ruggedness model,m8.1stan. is is a clean, healthy Markov chain, both stationary and wellmixing.e gray region is warmup, during which the Markov chain wasadapting to improve sampling efficiency. e white region contains thesamples used for inference. In lower-right, a close look at the first 100 samplesaer warmup.motion of each path, as the trace traverses the posterior density without getting mired anyplace.It doesn’t sit still for long.To really understand these points, though, you’ll have to see some trace plots for unhealthychains. at’s the project of the next section.Overthinking: Raw Stan model code. All map2stan does is translate a list of formulas into Stan’smodeling language. en Stan does the rest. Learning how to write Stan code is not necessary formost of the models in this book. But other models do require some direct interaction with Stan,because it is capable of much more than map2stan allows you to express. And even for simple models,you’ll gain additional comprehension and control, if you peek into the machine. You can alwaysaccess the raw Stan code that map2stan produces by using the function stancode. For example,stancode(m8.1stan) prints out the Stan code for the ruggedness model. Before you’re familiarwith Stan’s language, it’ll look long and weird. So let’s focus on just the most important part, the“model block”:model{real mu[N];a ~ normal( 0 , 100 );


258 8. MARKOV CHAIN MONTE CARLO ESTIMATION}br ~ normal( 0 , 10 );bA ~ normal( 0 , 10 );brA ~ normal( 0 , 10 );sigma ~ cauchy( 0 , 1 );for ( i in 1:N ) {mu[i]


8.4. CARE AND FEEDING OF YOUR MARKOV CHAIN 2598.4.2. How many chains do you need? It is very common to run more than one Markovchain, when estimating a single model. To do this with map2stan or stan itself, the chainsparameter specifies the number of independent Markov chains to sample from. All of thenon-warmup samples from each chain will be combined in the resulting inferences.So the question naturally arises: how many chains do we need? ere are two answersto this question. When getting started with a specific model, you need more than one chain.When you begin the final run that you’ll make inferences from, however, most of the timeyou only need one chain. I’ll briefly explain these answers.e first time you try to sample from a chain, you might not be sure whether the chainis working right. So of course you will check the trace plot. Having more than one chainduring these checks helps to make sure that the Markov chains are all converging to samedistribution. Sometimes, individual chains look like they’ve settled down to a stable distribution,but if you run the chain again, it might settle down to a different distribution. Whenyou run multiple Markov chains, and see that all of them end up in the same region of parameterspace, it provides a check that the machine is working correctly. Using 3 or 4 chainsis conventional, and quite oen more than enough toBut once you’ve verified that the sampling is working well, and you have a good ideaof how many warmup samples you need, it’s perfectly safe to just run one long chain. Forexample, suppose we learn that we need 1000 warmup samples and about 9000 real samplesin total. Should we run one chain, with warmup=1000 and iter=10000, or rather 3 chains,with warmup=1000 and iter=4000? It doesn’t really matter, in terms of inference.But it might matter in efficiency, because the 3 chains cost you an extra 2000 samplesof warmup that just get thrown away. And since warmup is typically the slowest part of thechain, these extra 2000 samples cost a disproportionate amount of your computer’s time.On the other hand, if you run the chains on different computers or processor cores within asingle computer, then you might prefer 3 chains, because you can spread the load and finishthe whole job faster.ere are exotic situations in which all of the advice above must be modified. But fortypical regression models, you can live by the motto four short chains to check, one long chainfor inference. ings may still go wrong—you’ll some examples in the next sections, so youknow what to look for. And once you know what to look for, youcan fix any problems beforerunning a long final Markov chain.Overthinking: Parallelizing Stan samples. Explain sflist2stanfit.Rethinking: Convergence diagnostics and ritual dangers. Discuss Rhat and importance of not trustingthem.8.4.3. Taming a wild chain. One common problem with some models is that there arebroad, flat regions of the posterior density. is happens most oen, as you might guess,when one uses flat priors. e problem this can generate is a wild, wandering Markov chainthat erratically samples extremely positive and extremely negative parameter values.Let’s look at a simple example. e code below tries to estimate the mean and standarddeviation of the two Gaussian observations −1 and 1.


260 8. MARKOV CHAIN MONTE CARLO ESTIMATIONR code8.11y


8.4. CARE AND FEEDING OF YOUR MARKOV CHAIN 261FIGURE 8.7. Diagnosing and healing a sick Markov chain. Le: Trace plotfrom two independent chains defined by model m8.2. ese chains are notstationary and should not be used for inference. Right: Adding weakly informativepriors (see m8.3) clears up the condition right away. ese chainsare fine to use for inference.chains=2 , iter=1e4 , warmup=1000 )precis(m8.3)Mean StdDev lower 0.95 upper 0.95a 0.02 1.78 -3.34 3.56sigma 2.03 2.19 0.42 5.28at’s much better. Take a look at the righthand panel in FIGURE 8.7. is trace plot lookshealthy. Both chains are stationary around the same values, and mixing is excellent. No morewild detours off to negative 30-million.To appreciate what has happened, take a look at the priors (dashed) and posteriors (blue)in FIGURE 8.8. Both the Gaussian prior for α and the Cauchy (KO-shee) prior for σ containvery gradual downhill slopes. ey are so gradual, that even with only two observations, asin this example, the likelihood almost completely overcomes them. e mean of the priorfor α is 1, but the mean of the posterior is zero, just as the likelihood says it should be. eprior for σ is maximized at zero. But the posterior has its median around 1.4. e standarddeviation of the data is 1.4.ese weakly informative priors have helped by providing a very gentle nudge towardsreasonable values of the parameters. Now values like 30-million are no longer equally plausibleas small values like 1 or 2. Lots of problematic chains want subtle priors like these,designed to tune estimation by assuming a little but of prior information about each parameter.And even though the priors end up getting washed out right away—two observationswere enough here—they still have a big effect on inference, by allowing us to get an answer.at answer is also a good answer.


262 8. MARKOV CHAIN MONTE CARLO ESTIMATIONDensity0.0 0.2 0.4Density0.0 0.2 0.4 0.6-10 0 10 20alpha0 5 10 15 20sigmaFIGURE 8.8. Prior (dashed) and posterior (blue) for the model with weaklyinformative priors, m8.3. Even with only two observations, the likelihoodeasily overcomes these priors. Yet the model cannot be successfully estimatedwithout them.Overthinking: Cauchy distribution. Need a box on Cauchy? Should explain here that Cauchy inmodel above is half-Cauchy.8.4.4. Non-identifiable parameters. Back in Chapter 5, you met the problem of highly correlatedpredictors and the non-identifiable parameters they can create. Here you’ll see whatsuch parameters look like inside of a Markov chain. You’ll also see how you can repair them,without re-estimating the posterior.To construct a non-identifiable model, we’ll first simulate 100 observations from a Gaussiandistribution with mean zero and standard deviation 1.R code8.13y


8.4. CARE AND FEEDING OF YOUR MARKOV CHAIN 263FIGURE 8.9. A chain with unidentifiable parameters, a1 and a2.data=list(y=y) , start=list(a1=0,a2=0,sigma=1) ,chains=2 , iter=1e4 )precis(m8.4)Mean StdDev lower 0.95 upper 0.95a1 -9209.45 3086.31 -13604.71 -4819.27a2 9209.46 3086.31 4819.40 13604.79sigma 1.02 0.05 0.92 1.11ose estimates look suspicious. e means for a1 and a2 are almost exactly the same distancefrom zero, on opposite sides. And the standard deviations of the chains are massive.is is of course a result of the fact that we cannot simultaneously estimate a1 and a2, butonly their sum.Looking at the trace plot reveals more. FIGURE 8.9 shows two Markov chains from themodel above. ese chains do not look like they are stationary, nor do they seem to bemixing very well. Indeed, when you see a pattern like this, it is reason to worry. Don’t usethese samples.Again, weak priors can rescue us. Now the model fitting code is:


264 8. MARKOV CHAIN MONTE CARLO ESTIMATIONFIGURE 8.10. Trace plot from the same model as in FIGURE 8.9, but nowwith weak priors. e chains converge and mix well now.R code8.15m8.5


8.6. PRACTICE 265And since the chain is working now, in this case we can actually fix the weird parameterizationproblem, without changing the model or running the chains again. Notice that ineach chain, the trace for a1 is the mirror image of the trace for a2. is is because the chainis successfully estimating the posterior distribution of the sum of a1 and a2. You can verifythis with the samples:post


9 Big Entropy and the Generalized Linear ModelProbability theory is just counting. It looks fancy, and the mathematics can be opaque.But the trick I introduced in Chapter 3, where we converted a posterior probability calculationto a counting problem, works because probability theory really is just counting. Whatdoes it count? Probability counts the ways something can happen, given what we know (orthink we know). Remember, probability is a device for describing uncertainty, the relativeplausibility of different things, given certain constraints on how we believe things can happen.As a result, if you just write down everything that can happen, eliminate things thatare incompatible with the constraints, and then count up how many ways each thing canhappen, you are doing probability theory. Counting is not an approximation. Instead, it’sthe real thing.And that’s just about all Bayesian inference is, at least mechanically. e probability distributionswe routinely use in statistical modeling—the Gaussian, binomial, and others—areexactly the distributions that you arrive at when you just count up all the ways each thingcan happen, given a precise statement of prior information. As a result, these distributionsare the most honest expressions of the state of information that goes into defining the problem.ey are the unique MAXIMUM ENTROPY distributions that are consistent with ourassumptions. ere is no guarantee that what happened or will happen obeys the distribution.But there is a guarantee that no other distribution can do a better job of describing theuncertainty.Here’s an example. ink back to the globe tossing experiment from Chapter 2. Insteadof tossing the real globe, we’ll imagine tossing an inflatable ball that is half covered in blue,“water,” and half in green, “land.” As a result, there are as many ways to get a water result asthere are to get a green result. Remember, the physics of ball tossing are deterministic—anyrandomness in the outcome is a result of our lack of information, not of any true indeterminacyin the physics. So by saying the probability of water is one-half, we must mean thatthere are just about as many initial conditions that lead to observing water as there are initialconditions that lead to observing land.9.1. Generalized Linear Modelse most accessible way to resist the Tyranny of Gauss is to learn to construct, estimateand interpret generalized linear models (GLM’s). is label is unfortunate, because it referencesmathematical definition rather than purpose, and it is easy to confuse with generallinear models, which are quite different. Sadly, GLM is the standard terminology, so youhave to know it.267


268 9. BIG ENTROPY AND THE GENERALIZED LINEAR MODELBetter to think of generalized linear models as non-Gaussian regression models. If youhave an outcome you wish to model as a result of a multivariate process, but the outcome is adistance, duration, count, rank or other non-Gaussian measure, then GLM’s provide a practicaland flexible approach. ese models assume non-Gaussian stochastic nodes, but theyuse the same strategy of replacing parameters of the stochastic node with additive modelsthat incorporate predictor variables. So most of what you’ve already learned about buildingand interpreting Gaussian regression models remains relevant for non-Gaussian regression.ere are some new wrinkles, as you’ll see. But the additional power and predictive precisionare worth it.In this book, I focus mainly on GLM’s that use a family of outcome distributions knownas the exponential family. Indeed, many people actually define GLM’s as those regressionsthat use exponential family distributions. e normal distribution you’ve come to know andlove is the most famous member of the family, but there are others which are just as usefuland natural. ese distributions are scientifically important and mathematically convenient,partly because they arise naturally from many real-world processes, just like the normal distribution.e naturalness of these distributions helps us appreciate why, for example, theexponential distribution is a fundamental distribution of distances and durations.But the overall strategy used in building GLM’s is not constrained to the exponentialfamily. I’ll provide a couple of important applied examples, in the form of rank and orderedregression models. ese are not exactly exponential family distributions, although they dobuild on top of them. Still, by learning how these distributions are built up from more basicconsiderations of probability, it’ll help open your mind to the broader approach of seriouslyconsidering how to predict outcomes, even when they are measured on idiosyncratic scales.e goal is to build a model that addresses the measurement design of the data.Here’s the plan. Much of the remainder of this chapter will teach you the basics of modelingdistances and durations, kinds of measures I like to think of as displacements. Displacementsare continuous measures that contain only information about distance from a pointof reference. ey are always positive (or zero, in special cases). In the next chapter, I focuson count distributions and models. And then in the next, I discuss special distributionsfor awkward measures like ordered categories and ranks, as well as ways to mix exponentialfamily distributions together to begin to model variability in underlying processes. All ofthis work is on a path towards multilevel models, in Chapter 13.But before settling into the serious work of developing these GLM’s, it will pay first tomeet the most prominent members of the family. e important point to make, before movingon, is that these distributions are not arbitrary choices. Rather, they relate naturally tothe kinds of measures we wish to model. en I want to foreground some of the additionalissues in fitting and interpretation that come with GLM’s. You’ll meet these issues in detailas you develop the applied cases in the chapters to follow. But it’ll help now to outline theseissues for you, so you can understand how most of them arise from the same basic characteristicof generalized linear models: the variance of most non-Gaussian outcomes is notindependent of the mean.9.1.1. Meet the family. FIGURE 9.1 illustrates the representative shapes of the densities forthe most common exponential family distributions used in GLM’s. e horizontal axis ineach plot represents values of a variable, and the vertical axis represents density. For eachdistribution, the figure also provides the notation for its stochastic node (above each densityplot), and the name of R’s corresponding built-in density function (below each density plot).


y ∼ Pois(λ)y ∼ Binom count eventslowratecountevents ∼ (µ, σ) 9.1. GENERALIZED LINEAR MODELS 269is(λ) count events ∼ (λ) count y ∼low Binom(n, probability p)low ratemany trials ∼ (λ) events ∼ large meanlow probability (λ, ) ∼ (µ, σ)many trials ∼ (, ) ∼ (λ)low probability ∼ (λ)many trials ∼ (λ, ) sum∼ (µ, σ) dgamma ∼ (, dnorm ) ∼ (λ) ∼ (µ, σ) ∼ (λ) ∼ (λ, ) ∼ (λ) ∼ (µ, σ) ∼ (, ) ∼ (λ) ∼ (λ, ) ∼ (λ) dexp ∼ (λ) ∼ (, ) count events count ∼ (λ, ) low rate events ∼ (, ) low probability dpois many trials dbinom FIGURE 9.1. Some of the exponential family distributions, their notation, and some of their relationships. Center: exponential distribution. Clockwise,from top-le: gamma, normal (Gaussian), binomial and Poisson dis- tributions. While there are other important distributions both in the family and outside it, you canusefully think of these distributions as the most common building blocks of probability modeling.Like the normal distribution (see Chapter 4), each of these exponential family distributionsarises (approximately or exactly) from natural processes. Not only do they arisefrom natural processes, but they are also related to one another through natural processesand the transformations that happen during measurement. ese are not ad hoc mathematicalconveniences. Instead they are natural distributions that arise through many processesin both nature and the laboratory. Just like Gaussian distributions populate the world, sotoo do exponential, gamma, Poisson, binomial, and others. Or the way I prefer to think of it,these distributions have strong informational “gravity” that attracts collections of measurestowards them. Just like a collection of sums from (almost) any distribution tends towards aGaussian distribution (Chapter 4), other ways of combining or transforming values lead toother distributions.Later, both in this chapter and the next, I’ll provide heuristic explanations of some processesthat give rise to these distributions, as well as how they relate to one another. For now,note the gray arrows in FIGURE 9.1. Each arrow is labeled with a process, whether it occursnaturally or through human measurement, that leads each distribution to generate another,either approximately or exactly. For example, as I’ll explain in more detail later in this chapter,summing values sampled from an exponential distribution (center) leads directly to the


270 9. BIG ENTROPY AND THE GENERALIZED LINEAR MODELgamma distribution (top-le). is isn’t the only way to produce a gamma distribution, butit is a natural and common one. Furthermore, if you add enough exponential values together,the gamma converges to the normal. Beginning again with the exponential in thecenter, the count distributions at the bottom of the figure can arise by counting the numberof exponentially-distributed events that happen in a finite interval of time. ere are otherrelationships among these distributions, and further relationships between these distributionsand information theory. But much of those details are beyond the scope of this book.I just want you to be aware of the wider context these distributions live in.I’m going to split the introduction of these distributions across this chapter and the next.In this chapter, I introduce and show you how to build models based upon the exponential(FIGURE 9.1, center) and gamma (top-le). ese are natural distributions of durations anddistances. Both durations and distances are measures of displacement, absolute length fromsome point of reference. erefore, unlike truly continuous measures, durations and distancesare strictly positive or zero. ey are never negative. A direct consequence of this factis that the pattern of variation around the mean always varies with the mean. is is quiteunlike the Gaussian distribution, in which the mean and variance are independent (unlessmodeled otherwise).e same general relationship between the mean and the variance is true of count distributions,as well. In the next chapter, I focus on the count distributions at the bottom ofFIGURE 9.1, the Poisson (bottom-le) and binomial (bottom-right) distributions. Counts arediscrete positive integers—like 1, 2 and 3—or zero. Again, they are never negative. Andagain, a consequence of this fact is that the variance of a count distribution is closely relatedto its mean. Both the Poisson and binomial can be derived from the exponential distribution,as indicated in FIGURE 9.1. And the Poisson can be derived from the binomial, by choosingspecial values for its parameters.It isn’t important in this book to grasp the mathematical details of the relationshipsamong these distributions, for the most part. Instead, I describe them in order to help thereader appreciate when each distribution is perhaps most helpful, natural and interpretable.9.1.2. Linking models and distributions. e first component of the GLM strategy is toopen up many natural probability distributions to apply to our outcome measurements. Butjust like with ordinary Gaussian models, we have to work with the parameters of these distributionsin order to build models that include predictor variables, other measurements.e choice of how to associate the basic parameters of the outcome distribution with thepredictor variables is called the link.To remind you, in the case of an ordinary Gaussian regression model with a singe predictorvariable, the linking strategy is to replace the parameter µ, the mean of the Gaussiandistribution, with a linear model that contains the predictor variable. is leads to the nowfamiliarmodel definition:y i ∼ Normal(µ i , σ),µ i = α + βx i .e mean for each case i is linked to an additive model that includes x i and the two parameters,α and β.If you choose a different outcome distribution—a non-Gaussian likelihood—then thefundamental parameters of that distribution will not be µ and σ. Importantly, most distributionsother than the Gaussian do not have means that are independent of their standarddeviations. So in most cases it isn’t possible to just replace the parameter for the mean with


9.1. GENERALIZED LINEAR MODELS 271an additive model. Some other link is needed. And many fundamental parameters are notcontinuous like µ. Instead, they must be positive (like σ) or bounded between zero and one.ese facts can make establishing a useful link a little more complicated.As this book introduces each outcome distribution, it will also introduce common linkingstrategies for associating distribution parameters and predictor variables. e importantprinciple to remember is that there is no single right way to accomplish this goal. Your hypothesestell you how you want the predictor variables to relate the aspects of the outcome,like its mean. e link is a mathematical strategy for modeling that hypothesis. Like allmodels, all you need to hope for is that it be useful.But while there is no one right way to build a link, there are well practiced and usefulways to accomplish it. Depending upon the distribution chosen—exponential, binomial oranother—there are usually common links that provide a time-tested way to embed an additivemodel. Still, it’s worth being critical of link choices. Just like with other aspects of amodel, such as the prior and the likelihood function, there is always some subjective choiceinvolved in the choice of a link. While scientists tend to be much more suspicious of priorsthan links, both can alter inference in powerful ways.9.1.3. Living with the family. Moving from Gaussian models to models based upon otherdistributions does have some costs. We need GLM’s, because there are many important typesof data that are not even approximately Gaussian and cannot be usefully coerced into Gaussianform through transformation. But these non-Gaussian distributions all share the inconvenientfeature of having patterns of variation that are related to the mean. For example, ifthe average count of some event is close to zero, then the variation around that mean mustextend further above the mean than below it. is is just because counts cannot go belowzero. e distribution will be skewed. As the mean increases, the pattern of variation willchange, usually becoming more symmetrical.Such relationships between the mean and variation in a collection of measures is natural.Distances and counts really cannot be negative, and their means and variances really docovary. It’s how nature actually works. But in the practice of statistics, it presents some newproblems for us.9.1.3.1. Needle in a non-linear haystack. Search is harder.9.1.3.2. Ceilings, floors and interpreting parameters. Everything interacts.9.1.3.3. e weakest link. Link functions matter and even smuggle in assumptions.


10 Distance and DurationA curious thing about machines and organisms alike is that they tend to fail predictably.I don’t mean that it is easy to predict when any particular machine will break or organismwill die. Rather I mean that the distribution of these failures tends to take a characteristicform, or one very close to it.Suppose there is a machine with a number of components necessary for its function.When any one of these components fails, the machine fails. Now, also suppose each componentwill fail sometime between today and one year from now. Each component however hasequal probability of failing on any day of the year, and each component fails independentlyof the others.What will the distribution of times to failure look like? e answer depends criticallyon how many components there are in the machine (FIGURE 10.1). For one component(FIGURE 10.1, le), the distribution is naturally just uniform—any day between tomorrowand one year from now is equally likely. You can simulate the distribution for yourself withthis one line of R code:y


274 10. DISTANCE AND DURATION1 component3 components10 components0 100 200 300days until failure0 100 200 300days until failure0 50 100 150 200days until failureFIGURE 10.1. Times to failure for a machine (or organism) in which failureof any one component leads to total failure. In each plot, time to failureis on the horizontal axis and density is on the vertical. Assume that eachcomponent of the machine has an equal chance of failure on any day fromnow until a year from now. Time to failure is the time it took for at least onecomponent to fail. Le: A machine with only one component shows theuniform distribution of the single component’s failure probability. Middle:A machine with 3 components usually fails sooner rather than later. Right:A machine with 10 components exhibits a nearly perfect exponential failuredensity. Comparison to perfect exponential in black.number of machines remaining aer any fixed amount of time follows the exponential. eexpected proportion remaining at time T will be:exp(−λT).e number of failures is highest at the start, because the most machines are still around tofail. Aer a while, there are very few machines le, and so fewer fail per unit time. Still,they are failing at the same rate. Substituting other events for “failure” leads one to noticeexponential processes in many natural systems, from atoms to ecosystems.As with the Gaussian density in Chapter 4, there is a maximum entropy interpretation ofthe exponential density as well. If all we know about a collection of measures is their averagedisplacement from some point, then the least surprising (maximum entropy) distribution forthe measures will be an exponential distribution. 10310.1. Survival analysisAs a fundamental distribution of displacements—distances in space or durations in time—the exponential is very handy for modeling. In this section, I provide a very brief introductionto its use in making GLM’s. is is a good place to start, because the exponential densityhas only one parameter, a rate, that determines both its mean and variance. And so it will beeasier to first encounter some aspects of GLM’s in the context of exponential distributions.On the other hand, the broader subfield that exponential GLM’s are part of, oen calledsurvival analysis (in the biological sciences) or event history analysis (in the social sciences),


10.2. GAMMA 275can be quite complicated. It is the topic of entire books. 104 And so here all I’m going toattempt is a simple application to illustrate how to get a linear model into the exponentialdensity.Add tenure data example.10.2. GammaBuilt up from exponential — add together exponential waiting times and get a gamma.


11 Counting and Classification[Summarize Spelke experiments on counting? Goal is to provide intuition for why uncertaintyaround a count scales with the count.]11.1. BinomialA filtered exponentiale binomial density is, like you saw early in this book:y ∼ Binomial(n, p),where y is a count (zero or a positive whole number), p is the probability any particular “trial”is a success, and n is the number of trials.11.1.1. e importance of log-odds. How do we convert the parameters of the binomialdistribution into an additive function of parameters and predictor variables? What is a goodlink? Several different strategies are available, but the best isn’t to just replace, for example,the probability p with an additive model. Instead, we define the log-odds of the event as anadditive model. e log-odds of any event E is just:plog1 − p .is is simply the logarithm of the odds. Why do you want to define the density through thelog-odds? Why not the plain odds, or even just the probability p itself? A very basic answer,but one that is unsatisfying for most, is that log-odds are a natural parameter of the binomialdistribution, much like τ but not σ, is a natural parameter of the normal. But so what? Whatdoes using log-odds give us that plain probability or regular odds do not?Log-odds have several pragmatic advantages. First, unlike the probability p, the logoddsare continuous. ey can take any value between −∞ and +∞. A probability like p isin contrast bounded between zero and one. Having a continuous scale to build our modelsupon will make them much easier to work with, as you’ll see.Second, unlike the plain non-log odds, p/(1 − p), the log-odds are symmetrical: eyaccelerate at the same rate in both directions, as p increases above one-half or as it decreasesbelow one-half. e same is absolutely not true of the regular odds. is will be easier toappreciate with a graph. In FIGURE 11.1, I show the contrast in symmetry between log-oddsand ordinary odds. On the le, log-odds of an event are plotted against the probability ofthe event. e horizontal dashed line shows the log-odds value at which the event is equallylikely to happen as not, p = 0.5. Notice that this occurs where the log-odds are equal tozero. Furthermore, the shape of the log-odds curve is perfectly symmetrical on both sides ofzero, as if reflected in a mirror. In contrast, the plot on the right shows that ordinary odds are277


278 11. COUNTING AND CLASSIFICATIONlog-odds-4 -2 0 2 4odds0 20 40 60 80 1000.00 0.25 0.50 0.75 1.00probability0.00 0.25 0.50 0.75 1.00probabilityFIGURE 11.1. Log-odds versus ordinary odds. On the le, the log-odds ofan event are plotted against the raw probability. e horizontal dashed lineshows the log-odds value at which the event has probability one-half. Noticethat the log-odds curve is perfectly symmetrical around this line. On theright, ordinary odds are plotted against probability. Again, the horizontaldashed line shows the value of the odds at which the probability is one-half.Now however, the curve is far from symmetrical around this line. Odds arenot invariant to the choice of which event to focus on, while log-odds areinvariant.highly asymmetrical. Again, the horizontal dashed line shows the value on the vertical axis atwhich the event has probability one-half. is is where the odds are equal to one. But now thecurve is far from symmetrical on both sides of this line. Instead it accelerates towards infinityas the probability increases above one-half. Why is the symmetry of the log-odds superior tothe asymmetry of the odds? We want a scale of measurement that isn’t influenced by arbitrarymeasurement choices. e horizontal probability axis in both plots could just as easily gothe other direction, measuring the probability of the event not happening. e log-oddsplot would be unaffected by this decision, while the odds plot would change dramatically.Since the choice of which event to label on the horizontal axis is largely arbitrary—a matterof our convenience—it is important to have a scale of measurement that is invariant to sucharbitrary decisions. Log-odds provide such a scale.Finally, a reasonable model of plain probabilities and odds must be fundamentally multiplicative,because otherwise the model would allow these values to dip below zero, whichshould be impossible. In contrast, multiplicative relations, once logged, become additive.You’ve seen this already, when you learned that joint likelihood is the product of each individuallikelihood, but the joint log-likelihood is the sum of each log-likelihood. And so likelog-likelihood is additive, log-odds are fundamentally additive, and therefore lend themselvesmore easily to the generalized linear modeling framework.11.1.2. e logit link. So we’ll adopt the strategy of modeling the log-odds, instead of theprobability p directly. More precisely, what this means is that instead of defining the binomial


11.1. BINOMIAL 279model this way:y i ∼ Binomial(n, p i )we define the log-odds instead:Equivalently, you can write:p i = α + βx iy i ∼ Binomial(n, p i )p ilog = α + βx i1 − p ilogit(p i ) = α + βx iSo it’s still true that we are using the GLM approach to model a unique probability foreach case i. But the additive model defines the log-odds for case i, not the probability. If wemade p i itself a linear model, then it would be easy to find parameter values that would makep i less than zero or greater than one. Since a probability should be between zero and one,that approach could cause problems. But the log-odds extend symmetrically around zero toboth −∞ and +∞, so making the log-odds into a linear model is a lot safer.Okay, so what does this imply about the probability itself, p i ? We have to care about theanswer to this question, because in order to fit the model to the data, we have to calculatelikelihoods. So we want to solve for p i . Suppose for example that the additive model of thelog-odds for case i is just called ϕ i . In math speak, this implies:p ilog = ϕ i . (11.1)1 − p iNow we solve for the probability itself. Do this by taking (11.1) and solving for p i . Aer alittle algebra, you get:p i = exp(ϕ i)1 + exp(ϕ i ) .You might recognize this probability as the LOGISTIC, with the log-odds sometimes calledthe LOGIT. 105 e logistic function pops up all over science. I first learned it in biology, asthe result of natural selection on the frequency of a favorable allele. But it has emerged herejust as a consequence of the desire to be able to make our model about log-odds.What we’ve done so far is establish a LOGIT LINK between the additive model and theprobability of an event. Now you need to see how to code this link.11.1.3. Example: Prosocial chimpanzees. e data for this example come from an experimentaimed at evaluating the prosocial tendencies of chimpanzees. 106 Each chimpanzee wasplaced in a room with two levers, one on the le and one on the right. Across from the focalanimal was a transparent divider, on the other side of which another chimpanzee would sit,in one of the treatment conditions. On each side of the divider, each lever was attached to aplate. When the focal animal pulled the le lever, the plates on the le side would move tobe close enough to each animal for him or her to retrieve anything on it. As a result, if thefocal animal pulled the le lever, whatever rested on the le plates became available to eachanimal. If he or she instead pulled the right lever, whatever was on the right plates becameavailable.ere were two treatment variables to be concerned with for now. First, one of the twosets of plates, le or right, always had a “prosocial” option that had food on both plates, while


280 11. COUNTING AND CLASSIFICATIONthe other side had food only on the side close to the focal animal. ink of the prosocialoption as being 1/1, indicating food on both sides. e nonsocial option, in contrast, wasalways 1/0, indicating food only on the focal side. If the focal animal pulled the lever on the1/1 side, both animals received food. If he or she instead pulled the 1/0 lever, only the pullerreceived food. e side of the 1/1 option was balanced across trials, to control for handednesspreferences. So the first question to ask of these data will be: do the chimpanzees prefer the1/1 option?e second key treatment variable is whether or not another chimpanzee sat on the otherside of transparent divider. is is the condition dummy variable, with 1 indicating therewas another chimpanzee present. So the second question to ask of the data will be: do chimpanzeesprefer the 1/1 option more when another animal is present? Or are they insteadindifferent to the presence of the conspecific?To address these questions, we will try to predict which side the focal animal pulls, conditionalon which side had the 1/1 option and whether or not another animal was present.Since the outcome is binary (le or right lever), we’ll use a binomial stochastic node andbuild a GLM with a logit link. Load the data from the rethinking package:R code11.1library(rethinking)data(chimpanzees)d


11.1. BINOMIAL 281proportion pulled left0.0 0.2 0.4 0.6 0.8 1.01234567proportion pulled 1/10.0 0.2 0.4 0.6 0.8 1.01234 567FIGURE 11.2. Proportion of trials in which an individual pulled the lehandlever (top plot) and the 1/1 (prosocial) lever (bottom plot). Each group offour points are the data from a single individual chimpanzee, with the numberscorresponding to actor labels in the data. Circles represent pulls whenthe 1/0 option was on the le. Triangles represent pulls when the 1/1 (prosocial)option was on the le. Filled symbols represent pulls when anotherindividual was present (social condition).m10.1


282 11. COUNTING AND CLASSIFICATIONR code11.3Estimate S.E. 2.5% 97.5%a 0.32 0.09 0.14 0.5is implies a MAP probability of pulling the le lever of logistic(0.32)≈ 0.58, with a95% quadratic estimate interval of 0.54 to 0.62:logistic( c(0.14,0.5) )[1] 0.5349429 0.6224593So we have to note at the start that the chimpanzees, on average, exhibit a preference for thelehand lever.Now fit the next two models, which allow us to investigate the hypothesis. We don’t fita model with only condition, because we don’t think there is any reason to suspect thatchimpanzees pull the lehand lever more or less when another chimpanzee is present. Incontrast, we have reason to suspect that they pull the lehand lever more when the lehandlever is both 1/1 option and another chimpanzee is present. is means we want to estimatethese additional two models:L i ∼ Binomial(p i , 1)p ilog = α + β P P i1 − p iα ∼ Normal(0, 10)β P ∼ Normal(0, 10)R code11.4L i ∼ Binomial(p i , 1)p ilog = α + (β P + β PC C i )P i1 − p iα ∼ Normal(0, 10)β P ∼ Normal(0, 10)β PC ∼ Normal(0, 10)where P is prosoc.left, a 1/0 variable indicating when the 1/1 option was on the le side.C is condition, a 1/0 variable indicating when another animal was present. Note that thesecond model above has a two-way interaction, but it doesn’t consider one of the main effects.is is because we are assuming that there is no direct effect of condition on pulling le versusright. You may want to fit a model that considers the main effect of condition, at the end ofthis section, so you can see if inferences change any.To fit these models, just build the linear model of log-odds:m10.2


11.1. BINOMIAL 283) ,data=d , start=list(a=0,bp=0) )m10.3


284 11. COUNTING AND CLASSIFICATIONproportion pulled left0.0 0.2 0.4 0.6 0.8 1.0FIGURE 11.3. Model averaged predictions for the chimpanzee data. eblue symbols and lines are the actual data. As in FIGURE 11.2, circles representpulls when the 1/0 option was on the le, triangles represent pulls whenthe 1/1 (prosocial) option was on the le, and filled symbols represent pullswhen another individual was present (social condition). Chimpanzees didappear to prefer the 1/1 (prosocial) option, but were insensitive to whetheranother individual was there to benefit from it.# define link functionp.link


11.1. BINOMIAL 285}lines( x , y , col="slateblue" )lines( x , pred.propleft[x] )lines( x , pred.propleft.ci[1,x] , lty=2 )lines( x , pred.propleft.ci[2,x] , lty=2 )FIGURE 11.3. ose are some bad predictions! Nevertheless, no signs that these chimpanzeesare responding much to the partner condition.11.1.3.1. Still Gaussian. It’s useful to see how good a job the quadratic approximation isdoing. Let’s compare the estimates above to the same model fit using Stan.d2


286 11. COUNTING AND CLASSIFICATIONFIGURE 11.4. Pairs plot of the posterior samples for m10.3stan, sampledwith Hamiltonian Monte Carlo. e posterior is multivariate Gaussian, andso quadratic approximation for these models is perfectly fine. It isn’t alwaysso, however.R code11.10d2


11.1. BINOMIAL 287h3 ~ dnorm(0,10),h4 ~ dnorm(0,10),h5 ~ dnorm(0,10),h6 ~ dnorm(0,10),h7 ~ dnorm(0,10)) ,data=d2 ,start=list(a=0,bp=0,bpC=0,h2=0,h3=0,h4=0,h5=0,h6=0,h7=0),iter=1e4 , warmup=1000 )precis(m10.3b.stan)Mean StdDev lower 0.95 upper 0.95a -0.73 0.27 -1.27 -0.21bp 0.83 0.26 0.36 1.37bpC -0.13 0.29 -0.68 0.48h2 11.40 5.14 3.86 22.13h3 -0.33 0.36 -1.02 0.39h4 -0.32 0.36 -1.02 0.38h5 -0.01 0.35 -0.71 0.66h6 0.95 0.36 0.21 1.62h7 2.55 0.46 1.69 3.49is posterior is definitely not entirely Gaussian—look at the interval for h2. is is a ceilingeffect, and it would be unidentified without a prior.New plotting code (this really needs to be made easier to understand):post


288 11. COUNTING AND CLASSIFICATIONfor ( j in 1:7 ) {pred.propleft


11.1. BINOMIAL 289d$male


290 11. COUNTING AND CLASSIFICATIONproportion admitted0.2 0.4 0.6 0.8ABCDEF2 4 6 8 10 12rowFIGURE 11.5. Predictions for model m1. Blue points and lines are the data,converted to proportions admitted. Lines connect female (circles) and male(triangles) data from the same academic department. Black lines are thepredictions arising from m1’s naive posterior, with 95% HPDI’s shown bythe dashed lines.lines( 1:12 , pred.pr.admit.ci[2,] , lty=2 )Wow, pretty terrible predictions! ere are only two departments in which females hada lower rate of admission than males, and yet the model says that females should expect tohave a 14% lower chance of admission. What has gone wrong here?Hierarchical data structure. Really want a multilevel model, but that will wait a few morechapters. Will use dummy variable (fixed effects) approach for now, assigning dummies toeach department, in order to account for different rates of overall admission among depts.R code11.19# code dummy variables for each departmentd$deptB


11.1. BINOMIAL 291a ~ dnorm(0,10)) ,data=d ,start=list(a=0,bdB=0,bdC=0,bdD=0,bdE=0,bdF=0) )m10.7


292 11. COUNTING AND CLASSIFICATIONproportion admitted0.2 0.4 0.6 0.8ABCDEF2 4 6 8 10 12rowFIGURE 11.6. Model averaged predictions for UCB admissions data.pred.pr.admit


11.2. POISSON 293logit(p) ~ a + bdB*deptB + bdC*deptC + bdD*deptD +bdE*deptE + bdF*deptF +bm*male ,a ~ dnorm(0,10) ,bm ~ dnorm(0,10)) ,data=d ,start=list(a=0,bm=0,bdB=0,bdC=0,bdD=0,bdE=0,bdF=0) ,iter=1e4 , warmup=1000 )precis(m10.7stan)Mean StdDev lower 0.95 upper 0.95a 0.68 0.10 0.49 0.88bm -0.10 0.08 -0.27 0.05bdB -0.04 0.11 -0.26 0.16bdC -1.26 0.11 -1.47 -1.05bdD -1.30 0.11 -1.49 -1.08bdE -1.74 0.13 -2.00 -1.49bdF -3.32 0.17 -3.64 -2.99Go ahead and extract the samples and take a look at the pairs plot. You’ll see that again thequadratic approximation has done a very good job.11.1.5. Fitting binomial models with glm. Gives same results as the map approach in previoussection:m0g


294 11. COUNTING AND CLASSIFICATIONShould I include an explanation of why this is “natural”? Easiest derivation something like:Pr(y|λ) = λy exp(−λ).y!Take the logarithm of the density function, since any exponential family model can be pulledapart this way:log Pr(y|λ) = y log(λ) − λ − log(y!).To find the “natural” parameter of the distribution, find the coefficient that multiplies theoutcome, y. In this case, that is:log(λ).So the logarithm of the mean is the natural parameter of the Poisson. Could just put thisderivation in an endnote, instead.11.2.1. Example: Polynesian tool complexity. Data from Kline and Boyd 2010. 107 Hypotheses:(1) number of tools increases with population size, (2) number of tools increaseswith contact rate. Very small data set, so consider hazards of assuming posterior is normal.Will fit again later in book using MCMC and then compare. Also possibly a good RJMCMCexample, because will compute reasonably quickly (only 10 cases!).R code11.25library(rethinking)data(islands)d


11.2. POISSON 295FIGURE 11.7. Locations of populationsin islands data. Map reproducedfrom Kline and Boyd 2010,supplemental.e most basic Poisson model contains no prediction variables, just a constant mean:n i ∼ Poisson(λ)where n i is the value of Total.Tools on row i and λ is the mean count.Before fitting this model, though, let’s establish the link function to use. We don’t technicallyneed a link right now, but it’ll be easier to move on to the later models, if we establishthe link now. Also, it will help with estimation. It’s most common to use a log-link with aPoisson model, such that the mean λ is modeled:n i ∼ Poisson(λ i )log λ i = αis log-link is the reason that Poisson models are oen called log-linear models. e linearmodel part of the GLM is the logarithm of the observed mean. is actually implies, as you’llsee in a bit, that on the real count scale the posited relationship will not be linear.If you solve the second line above for λ i , you find that this log-link implies that λ i =exp(α). We really do need a link here, otherwise the mean will be allowed to go below zero.If the linear model is exponentiated, as it is here, this guarantees that λ will be positive, evenif the linear model itself is negative. Go ahead and try it, if you don’t immediately intuit thistruth.Now here’s the code to fit the model:m10.8


296 11. COUNTING AND CLASSIFICATIONR code11.28exp( coef(m10.8)[1] )a34.8And by no small coincidence, this is the same value you get by just empirically averaging theTotal.Tools column.Adding a predictor is as you’d expect by now, just add terms to the second line. Forexample, here is the second model we’d like to fit, using log-population as a predictor:And the code to fit it:n i ∼ Poisson(λ i )log λ i = α + β P log P iR code11.29m10.9


11.2. POISSON 297compare(m10.8,m10.9,m10.10,m10.11,nobs=nrow(d) , DIC=TRUE )R code11.31k AICc w.AICc dAICc DIC pD wDIC dDICm10.10 3 81.05 0.76 0.00 76.99 2.97 0.53 0.00m10.9 2 83.77 0.20 2.72 82.07 2.00 0.04 5.08m10.11 4 86.91 0.04 5.86 77.44 3.26 0.42 0.45m10.8 1 133.92 0.00 52.87 133.32 0.95 0.00 56.33ese rankings suggest that at least log-population is a useful predictor, and possibly contactrate as well.First, you can inspect the estimates and see if the coefficients are in the predicted directions.Comparing estimates across models:coeftab(m10.8,m10.9,m10.10,m10.11)R code11.32m10.8 m10.9 m10.10 m10.11a 3.55 1.33 0.93 0.94bp NA 0.24 0.27 0.26bc NA NA 0.29 -0.09bcp NA NA NA 0.04nobs 10 10 10 10And since it’s hard to see the uncertainty around these estimates, FIGURE 11.8 shows theresult of plotting the coeftab. In models m10.9 and m10.10, the main effects of both logpopulationand contact rate are consistent with the hypotheses: they are both positive, beingassociated with more tools. Interpreting m10.11 is harder, because we didn’t center the predictorsbefore fitting (remember centering, from Chapter 4). But the interaction estimate, atleast, can be interpreted. It is positive, as it should be if high contact rate and population sizeact synergistically. But the size of the estimate is very small, and if you inspect the standarderror on it, you’ll see it’s very uncertain.To get a sense for what’s being predicted, let’s plot the model predictions, as usual. ereare no new coding tricks here, aside from the detail of how to convert the linear model intothe prediction scale. To illustrate that, I’m going to produce two plots. In the first, I’ll drawthe mean and 95% confidence interval of the mean, using the usual lines. In the second, I’llplot Poisson random samples from the posterior, so you can see how to simulate observationsfrom a fit Poisson model.To calculate predicted mean counts, you just need the usual sapply code, now with theinverse link function embedded. is is exactly what you did when you plotted the binomialmodels. But now the link function isn’t the logit, but rather the logarithm. So this code willcompute the predicted mean and 95% confidence interval of the mean, for low contact andhigh contact islands, respectively:post


298 11. COUNTING AND CLASSIFICATIONbpm10.11m10.10m10.9bcm10.11m10.10m10.9bcpm10.11m10.10m10.9FIGURE 11.8. Means and 90% intervalsfor the posterior distribution of the parametersof the islands models.-1.5 -0.5 0.5EstimateTotal.Tools20 30 40 50 60 70FIGURE 11.9. Predicted mean toolcounts and 95% confidence intervalsof the mean, for both low contactrate islands (open circles and blacklines) and high contact rate islands(filled blue circles and blue lines).7 8 9 10 11 12log.popmu.low


11.2. POISSON 299model m2averagedTotal.Tools20 30 40 50 60 70Total.Tools20 30 40 50 60 707 8 9 10 11 127 8 9 10 11 12log.poplog.popFIGURE 11.10. Predicted tool counts, with low contact rate predictionsshown by open black points and high contact rate shown by filled bluepoints. e actual island data is shown by the larger points. On the le,predictions for only model m10.10 show a clear separation between highand low contact rate islands. On the right, the model averaged predictionsare less confident about the importance of contact rate, and little separationappears between the filled and open points.Since the mean of the Poisson distribution is just λ, then to compute the mean, you onlyneed to exponentiate the linear model. When there is log-link, the inverse is exponential.Plotting these computations with the usual lines commands, the result is shown inFIGURE 11.9. e filled points are islands with high contact rate, while the empty points areislands with low contact rate. e black lines show predictions for low-contact islands, whilethe black lines show predictions for the high contact islands. While the trend with increasingpopulation size is quite strong, there is a lot of overlap between the high and low contact ratepredictions. So while model comparison supports using contact rate to improve prediction,it isn’t half as important as log-population size.Now for simulating predictions from the model. e function you want now is rpois,which produces random Poisson-distributed values. We’ll embed this function in sapply,just as you embedded rnorm in earlier chapters. Here’s the code to produce 500 randompredictions for both low and high contact rate islands:npts


300 11. COUNTING AND CLASSIFICATIONIn FIGURE 11.10, I show these simulated island tool counts. In both plots in this figure,the open black circles are predicted counts for low contact rate, while the filled blue circlesare for high contact rate. e plot on the le represents predicted counts for just modelm10.10, the model with only main effects of both log-pop and contact rate. is modelshows a visible separation between open and filled predictions: high contact islands tend tohave higher counts, although there is still considerable overlap. On the right, the modelaveragedpredictions show practically no visible separation at all. Indeed, there is less visibleseparation in these actual count predictions than there was in FIGURE 11.9, which showedonly predicted means, not actual counts.11.2.1.1. HMC islands. Is it Gaussian still?R code11.35m10.11stan


11.2. POISSON 3010.15 0.30 -0.3 0.0 0.30.15 0.30 -0.4 0.0 0.4a-0.5 0.5 1.5a3.0 3.2 3.4 3.60.15 0.25 0.35-0.98bp0.15 0.25 0.35-0.46bp-0.14 0.14bc-3 -1 1 3-0.76 0.35bc0.0 0.4-0.3 0.0 0.30.08 -0.1-0.99bcp-0.4 0.0 0.40.09 -0.19-0.26bcp-0.5 0.5 1.5-3 -1 1 33.0 3.3 3.60.0 0.4FIGURE 11.11. Posterior distributions for models m10.11stan (le) andm10.11stan.c (right).bcp ~ dnorm(0,1)) ,data=d ,start=list(a=log(mean(d$Total.Tools)),bp=0,bc=0,bcp=0) ,iter=1e4 , warmup=1000 )e estimates are going to look different, because of the centering, but the predictions remainthe same. But we’re interested in the shape of the posterior distribution. Look now at therighthand pairs plot in FIGURE 11.11. Now those strong correlations are gone. And theMarkov chain also was more efficient, resulting in a greater number of effective samples. Stanprovides some numerical advice about how well the chain is mixing. We’ll focus on n_eff,which is a crude estimate of the effective number of samples. When the chain explores theposterior well, n_eff will be larger. at means you’ll need fewer samples to get a goodpicture of the posterior distribution. In these models, they sample so quickly that you mightnot care—you could take 100-thousand samples in under a minute. But with more data andmore complex models, you might now want to wait the hours it takes to get a good picture.So recoding the data a little pays off.To see Stan’s n_eff estimates, use summary:summary(m10.11stan)R code11.37mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhata 0.9 0.0 0.4 0.2 0.7 0.9 1.2 1.7 2092 1bp 0.3 0.0 0.0 0.2 0.2 0.3 0.3 0.3 2115 1bc -0.1 0.0 0.8 -1.8 -0.7 -0.1 0.5 1.5 2118 1


302 11. COUNTING AND CLASSIFICATIONbcp 0.0 0.0 0.1 -0.1 0.0 0.0 0.1 0.2 2118 1dev 74.3 0.1 2.6 71.3 72.4 73.6 75.4 80.9 2415 1lp__ 915.4 0.0 1.5 911.7 914.8 915.8 916.5 917.2 2123 1And for the model with centered population size:R code11.38summary(m10.11stan.c)mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhata 3.3 0 0.1 3.1 3.3 3.3 3.4 3.5 3157 1bp 0.3 0 0.0 0.2 0.2 0.3 0.3 0.3 3568 1bc 0.3 0 0.1 0.0 0.2 0.3 0.4 0.5 3079 1bcp 0.1 0 0.2 -0.3 0.0 0.1 0.2 0.4 4112 1dev 74.9 0 2.8 71.4 72.8 74.2 76.3 81.7 3262 1lp__ 915.4 0 1.4 911.9 914.7 915.7 916.4 917.1 3260 1Compare the two n_eff columns. It’s like getting more than a thousand free samples, justby centering the predictor.Another reason to do this is that it makes DIC more accurate. ese models have 4parameters each, and the priors are quite uninformative. So the estimated number of parametersshould be very close to 4. Comparing the two:R code11.39DIC(m10.11stan)DIC(m10.11stan.c)[1] 77.59912attr(,"pD")[1] 3.339596[1] 78.82325attr(,"pD")[1] 3.962123So the model with centered log.pop gets it right, because it’s posterior distribution bettermatches DIC’s assumptions.11.3. MultinomialExample: categorize wines by score, data(Wines2012)y i ∼ Multinomial(n, p 1i , p 2i , ..., p mi )Basic notion: Each category gets a score, and that score is used to generate an unorderedprobability distribution, through standardization. So if we have m categories in the outcomedistribution, then probability of observing category y i = k is:Pr(y i = k) =exp(ϕ ki)∑j exp(ϕ kj)


12 Monsters and Mixturesis chapter is about building more complex statistical models, by piecing together thesimpler tools of previous chapters. e tension is always between the power we derive fromthe models and the dangers we face when we misunderstand or misapply them. Inferencebecomes simultaneously harder, riskier, and potentially more powerful.Monsters are models that are pieced together from simpler, oen exponential family,pieces to form specialized outcome distributions. Ordered categories and ranks are verycommon kinds of data that don’t really suit any of the previous distributions. ese kinds ofdata need their own, somewhat monstrous, probability densities.Mixtures are models that comprise multiple stochastic processes, usually embedded withinone another. Measurement error and other stochastic processes can be blended with the outcomedistributions you’ve already seen to produce models that can estimate and cope withthe outcomes that depend upon more than one stochastic process.12.1. Ordered Categorical OutcomesDescribe why we need a density for ordered categorical outcomes. Show examples. Verycommon in social sciences. In natural sciences, appears in ecological field data, for example.12.1.1. An ordered probability density. e probability distribution we seek here will beborn of stating how we’d like to be able to interpret it. e main criterion is that we’d like tooperate with LOG-ODDS. Recall from the last chapter, when you met binomial GLM’s, thatthe log-odds of any event E is just:logPr(E)1 − Pr(E) .is is simply the logarithm of the odds. Back there (page 277), I explained the advantages ofmodeling the log-odds as an additive combination of parameters and variables. All of thosereasons matter here, in the same ways.So we want to work with log-odds, again. But there’s also another desirable property oflog-odds, which will help us now that we have, in effect, a multinomial distribution with morethan two mutually exclusive outcomes: We should focus on the CUMULATIVE LOG-ODDS ofeach possible value. Let me explain why. e probability density we are aer has a numberof discrete, ordered observable values. Our goal is to get a formula—a probability density—that tells us the likelihood of each value. is function will be like the dnorm and dbinomand dpois of earlier GLM’s. It will help us in getting to this goal if we start with cumulativeprobability, though, instead of the discrete probability of each observable value. Cumulativeprobability is just the probability of any value or smaller value. Why go cumulative at the start303


304 12. MONSTERS AND MIXTUREShere? Because cumulative probability naturally constrains itself to never exceeding a totalprobability of one. And because this is an ordered density, we know that the cumulative logoddsof the largest observable value must be +∞, which is the same as cumulative probabilityof one. is anchors the distribution and standardizes it at the same time. If you start insteadwith discrete individual probabilities of each outcome, then you’d have to later standardizethese probabilities to ensure they sum to exactly one. It turns out to be easier to just startwith the cumulative probability and then work backwards to the individual probabilities. Iknow, this seems weird. But I’ll walk you through it. 108What we want is for the cumulative log-odds of an observed value y i to be equal-to-orless-thansome possible value k to be:logPr(y i ≤ k)1 − Pr(y i ≤ k) = ϕ k, (12.1)where ϕ k is a continuous value, different for each observable value k. We’ll make this valueinto a linear model, in a bit. For now, it’s just a placeholder. e above function is just adirect embodiment of the log-odds and cumulative density objectives we’ve stated so far. Itactually says nothing else at all. Now we solve for the cumulative probability density itself.Do this by taking (12.1) and solving for Pr(y i ≤ k). Aer a little algebra, you get:Pr(y i ≤ k) = exp(ϕ k)1 + exp(ϕ k ) .You might recognize this probability as the logistic, same as in the last chapter. It arose in thesame way, establishing the logistic function as the inverse link for the binomial model. Butnow we have a CUMULATIVE LOGISTIC, since the probability Pr(y i ≤ k) is cumulative.But we still need likelihoods, which are not cumulative. So how do you use this thing?Well, it’s a probability density, so you can use it to define the likelihood of any observationy i . By definition, in a discrete probability density, the likelihood of any observation y i = kmust be:Pr(y i = k) = Pr(y i ≤ k) − Pr(y i ≤ k − 1). (12.2)is just says that since the logistic is cumulative, we can compute the discrete probability ofexactly y i = k by subtracting the cumulative probability of one observable value lower thank.12.1.2. Putting the GLM in the ϕ. We’re almost ready to walk through actually writingthe code to fit models to this distribution. But before we get to coding, there is one moreconceptual step. To build additive models of the log-odds values ϕ k , we’ll need to distinguishthe intercept at each value k from the rest of the additive model.In the simplest case, there are no predictor variables, and so each observable value khas a unique log-odds value ϕ k that just translates between the cumulative probability scaleand the log-odds scale of the model. For example, consider a case in which the observablevalues are the numbers 1, 2 and 3. Now the maximum likelihood estimate of the cumulativeprobability of observing a “1” is just going to be the observed proportion of 1’s in the data.e estimate of the cumulative probability of a 2-or-less is going to be the proportion of theobserved values that are 2 or 1. e cumulative probability of a 3 must be exactly 1, becauseit’s the maximum value. So if there are, say, 100 observations, and 31 of them are “1” and49 of them are “2”, then the estimated cumulative probabilities of 1, 2 and 3 are: 31/100,(31+49)/100, 1. Crunched into proportions, those values are : 0.31, 0.8, and 1. Now the ϕ


12.1. ORDERED CATEGORICAL OUTCOMES 305value corresponding to each is just the cumulative log-odds of each. So you can computethem directly from these proportions, just by using the formula for log-odds. For example,the log-odds for “2” must be:ϕ 2 = log0.81 − 0.8 ≈ 1.39.at is what I mean when I say that each ϕ k value translates between the probability andlog-odds scales. If you know one, you can compute the other.But we need a way to allow the distribution to change, as a result of predictor variables.is is where the generalized linear model comes back. ink of the example just aboveas a GLM in which there are only intercepts, no predictor variables. is is akin to a linearregression with only a single parameter, α, in the model of µ i . In this case, however, therehas to be a unique intercept for each observable value (1, 2 or 3, in the example). Otherwise,the distribution wouldn’t be sufficiently flexible. Recall, we’re resorting to this orderedcategorical density, because we don’t have any basis to say that the “distance” between eachobservable value is the same. All these unique intercepts do is take an arbitrary cumulativeprobability density and translate it to cumulative log-odds.So how do we get predictor variables and β coefficients back in here? Just add them toϕ k . But while there still needs to be a unique intercept for each observable value k, the restof the additive model will be common to all values ϕ k . In effect, this means pulling out aunique intercept parameter for each observable value, and then adding a common additivemodel to each. So now the cumulative log-odds of any value k will be defined as:logPr(y i ≤ k)1 − Pr(y i ≤ k) = α k + ϕ i .In turn, ϕ i is an additive model that may contain slope parameters and predictor variables.It’s functionally just like µ i , from all of those Gaussian models you fit in earlier chapters.What these assumptions do is allow the distribution of observable values to vary as aconsequence of the predictor variables inside ϕ i . is will make more sense, once you fitone of these models and plot its implied predictions. I’ll walk you through that. But first,we need to actually code the density function that will make estimating the naive posteriorpossible.Overthinking: Coding the ordered logistic density. A density function is a block of code that takesobserved values and parameters as inputs and returns likelihood (or log-likelihood) correspondingto those inputs. ere’s nothing stopping you from writing your own density functions. en you canuse them in map and in other calculations. e density function we need to perform ordered logisticregression, dordlogit, is already in the rethinking package. But this is a good spot to provide aglimpse at how R actually uses density functions to fit models. Here, we’ll write our first completelycustom density function.You do, however, have to obey a small number of conventions, in writing a density function.First, the first parameter must be a vector of values to compute likelihoods for, and this vector must benamed x, just because R expects it to have this name. Second, there must be a parameter named log,and when it is set to TRUE, the function should return log-likelihood, rather than ordinary likelihood.Ideally, the function will return one (log-)likelihood for each element of the input x. Other parametersdefined in the function are for the parameters of the likelihood function, the µ’s and σ’s and such. Youcan name those whatever you want.Let’s write the density function for the ordered categorical distribution. is function is alreadybuilt into the rethinking package, but it’s worth examining it, as an example of how build your own


306 12. MONSTERS AND MIXTURESdensity function. I’ll present a working function definition here, and then I’ll step through its piecesand explain the important parts.R code12.1dordlogit


12.1. ORDERED CATEGORICAL OUTCOMES 30712.1.3. Example: Moral intuition. e data for this example come from a series of experimentsconducted by philosophers, lead by Cushman. 109 Yes, philosophers do sometimesconduct experiments. In this case, the experiments aim to collect empirical evidence relevantto debates about moral intuition, the forms of reasoning through which people developjudgments about the moral goodness and badness of actions. ese debates are relevant toall of the social sciences, because they touch on broader issues of reasoning, the role of emotionsin decision making, and the theories of moral development, both in individuals andgroups.Specifically, these experiments address scenarios sometimes known as “trolley problems”that have proved vexing or paradoxical to moral philosophers. e reason these scenarioscan be vexing is that the analytical context of them can be identical, and yet people reliablyreach different judgments about the goodness or badness of the actions. Previous researchhas lead to at least three important principles of unconscious reasoning that may explainvariations in judgment. ese principles are:e action principle: Harm caused by action is morally worse than equivalent harmcaused by omission.e intention principle: Harm intended as the means to a goal is morally worse thanequivalent harm foreseen as the side effect of a goal.e contact principle: Using physical contact to cause harm to a victim is morallyworse than causing equivalent harm to a victim without using physical contact.e experimental context we’ll explore these principles within comprises stories thatvary these principles, while keeping many of the basic objects and actors the same. For example,the most renown story is about a speeding boxcar. Here is the version of boxcar storythat implies the action principle, but not the others:Standing by the railroad tracks, Dennis sees an empty, out-of-control boxcar aboutto hit five people. Next to Dennis is a lever that can be pulled, sending the boxcardown a side track and away from the five people. But pulling the lever will alsolower the railing on a footbridge spanning the side track, causing one person to falloff the footbridge and onto the side track, where he will be hit by the boxcar. IfDennis pulls the lever the boxcar will switch tracks and not hit the five people, andthe one person to fall and be hit by the boxcar. If Dennis does not pull the lever theboxcar will continue down the tracks and hit five people, and the one person willremain safe above the side track.Since the actor (Dennis) had to do something to create the outcome, rather than remainpassive, this is an action scenario. However, the harm caused to the one man who will fallis not necessary, or intended, in order to save the five. us it is not an example of theintention principle. And there is no direct contact, so it is also not an example of the contactprinciple. Now, how morally permissible is it to pull the lever? e data we’ll be interestedin predicting are responses to that question, rated on a scale from 1 (never permissible) to 7(always permissible).You can contrast the above story with the same outline, but now with both the actionprinciple and the intention principle. at is, in this version, the actor both does somethingto change the outcome and the action must cause harm to the one person in order to savethe other five:Standing by the railroad tracks, Evan sees an empty, out-of-control boxcar aboutto hit five people. Next to Evan is a lever that can be pulled, lowering the railingon a footbridge that spans the main track, and causing one person to fall off the


308 12. MONSTERS AND MIXTURESfootbridge and onto the main track, where he will be hit by the boxcar. e boxcarwill slow down because of the one person, therefore preventing the five from beinghit. If Evan pulls the lever the one person will fall and be hit by the boxcar, andtherefore the boxcar will slow down and not hit the five people. If Evan does notpull the lever the boxcar will continue down the tracks and hit the five people, andthe one person will remain safe above the main track.Most people judge that, if Even pulls the lever, it is morally worse (less permissible) thanwhen Dennis pulls the lever. You’ll see by how much, as we analyze these data.Load the data with:R code12.4library(rethinking)data(Trolley)d


12.1. ORDERED CATEGORICAL OUTCOMES 309Take a look at the parameter estimates. ey are just log-odds estimates for the cumulativeprobability of each value. ere is no estimate for the value “7”, because the cumulativeprobability of the maximum value must be 1. I’m going to hold off on plotting the predictionsfrom this model, until we have models with predictor variables to compare it to.Overthinking: Starting values for intercepts. Before moving on to adding predictor variables tothe model, it’s worth commenting on how to get reasonable starting values for the intercepts. In theexample here, when you fit m11.1, I just guessed at some starting values, and it turned out okay. Butyou won’t always get so lucky. Now that we’re working with non-linear models, maximum likelihoodsometimes needs a helping hand to get to the global maximum. If you accidentally pick poor startingvalues for the parameters, you can get either non-sensical estimates or R may fail to find any likelyestimates at all. is can be very frustrating.A good method for picking starting values for the intercepts in an ordered categorical model ofthis kind is to just use the log-odds definition to find them. Here’s an example, in which I computethe implied log-odds of each observable value, using the data.logodds


310 12. MONSTERS AND MIXTURESWhy did I subtract the new additive portion from the intercept? If you look back atthe code for dordlogit, you’ll see that I’ve already snuck this assumption past you once.e reason for doing this is to make the parameter estimates easier to interpret, later. Wewould like a positive estimate for, say, β A to indicate an increase in the average value of y i ,when A i increases. But in order to increase the average y i , we need to bunch up probabilityon the high end of the distribution. is corresponds to reducing the cumulative log-oddsof every possible outcome value, leaving the remaining probability mass at the high end. Iknow this is weird. It will make sense once we estimate this model and plot its predictions,across changes in the predictor variables. So bear with me.You fit this model just as you’d expect, by adding the slopes and predictor variables tothe phi parameter inside dordlogit. Here’s a working model:R code12.7m11.2


12.1. ORDERED CATEGORICAL OUTCOMES 311bCI ~ dnorm(0,10),c(a1,a2,a3,a4,a5,a6) ~ dnorm(0,10)) ,data=d ,start=list(bA=0,bI=0,bC=0,bAI=0,bCI=0,a1=-1.9,a2=-1.2,a3=-0.7,a4=0.2,a5=0.9,a6=1.8) )No new tricks here. e above just adds two interaction terms and two interaction parameters,bAI and bCI.Now let’s compare these three models. You can use coeftab to get a quick comparisonof estimates:coeftab(m11.1,m11.2,m11.3)R code12.9m11.1 m11.2 m11.3a1 -1.92 -2.84 -2.63a2 -1.27 -2.16 -1.94a3 -0.72 -1.57 -1.34a4 0.25 -0.55 -0.31a5 0.89 0.12 0.36a6 1.77 1.02 1.27bA NA -0.71 -0.47bI NA -0.72 -0.28bC NA -0.96 -0.33bAI NA NA -0.45bCI NA NA -1.27nobs 9930 9930 9930Whatever do these estimates mean? e first six rows, from a1 to a6, are just the α intercepts,one for each value below the maximum of “7”. ese really can’t be interpreted on their own,unless you are very used to reading log-odds values.e next 5 rows, from bA to bCI, are the various slope parameters: three main effectsand two interactions. ese are interpretable on their own, to a limited extent. It makessense to ask, first, if they are very far from zero. You can check the standard errors and 95%confidence intervals with precis and verify that all of the slope estimates are quite reliablynegative. Second, all of the slopes are negative, which implies that each factor/interactionreduces the average response. Including action, intention or contact in a story leads peopleto judge it as less morally permissible. But by how much? Remember, these parametersare part of a function defining cumulative log-odds, so they can be interpreted as changesin cumulative log-odds. But unless you are very comfortable thinking about log-odds andcumulative probability densities, that doesn’t help you much. It also doesn’t help that thischange applies to the cumulative log-odds of every value of the response variable, aside fromthe maximum one (which is fixed at cumulative log-odds ∞).So what to do? When in doubt, plot the model’s predictions. Let’s compare these modelsusing AICc:compare( m11.1 , m11.2 , m11.3 , nobs=nrow(d) )R code12.10k AICc w.AICc dAICc


312 12. MONSTERS AND MIXTURESm11.3 11 36929.21 1 0.00m11.2 9 37089.91 0 160.70m11.1 6 37854.46 0 925.25Now, since model m11.3 absolutely dominates based upon AICc—it gets very nearly 100%of the weight—we can safely proceed here, ignoring model averaging. But if the AICc’s hadturned out to be less decisive, the only thing you’d change in the code that follows is passinga list of models to sample.qa.posterior and then making sure to plot enough samplesfrom the posterior to represent the influence of low-probability models.ere is no perfect way to plot the predictions of these cumulative log-odds models.But a common and useful way is to use the horizontal axis for a predictor variable and thevertical axis for cumulative probability. en you can plot a curve for each response value,as it changes across values of the predictor variable. Aer plotting a curve for each responsevalue, you’ll end up mapping the distribution of responses, as it changes across values of thepredictor variable.So let’s do that. First, as usual, sample from the naive posterior:R code12.11post


12.1. ORDERED CATEGORICAL OUTCOMES 313probability0.0 0.5 1.0action=0, contact=0probability0.0 0.5 1.0action=1, contact=00 10 1intentionintentionprobability0.0 0.5 1.0action=0, contact=10 1FIGURE 12.1. Predictions of theordered categorical model with interactions,m11.3. Each plot showshow the distribution of predictedresponses varies by intention.Upper-le: Effect of intentionwhen action and contact areboth zero. e other two plots eachchange either action or contactto one.intentionI know this plotting code is complicated, compared to previous chapters. But as themodels become more monstrous, so to does the code needed to compute predicts and displaythem. With power comes hardship.Finally, plot the MAP predictions, so it’s easier to see where the center of each boundaryis located:pmap


314 12. MONSTERS AND MIXTURESResults in FIGURE 12.1. In each plot, the blue lines indicate the boundaries betweenresponse values, numbered 1 through 7, bottom to top. e thickness of the blue lines correspondsto the variation in predictions due to variation in samples from the posterior. Sincethere is so much data in this example, the path of the predicted boundaries is quite certain.e horizontal axis represents values of intention, either zero or one. e change inheight of each boundary going from le to right in each plot indicates the predicted impactof changing a story from non-intention to intention. Finally, each plot sets the other twopredictor variables, action and contact, to either zero or one. In the upper-le, both areset to zero. is plot shows the predicted effect of taking a story with no-action, no-contact,and no-intention and adding intention to it. In the upper-right, action is now set to one.is plot shows the predicted impact of taking a story with action and no-intention (actionand contact never go together in this experiment, recall) and adding intention. is upperrightplot demonstrates the interaction between action and intention. Finally, in thelower-le, contact is set to one. is plot shows the predicted impact of taking a story withcontact and no-intention and adding intention to it. is plot shows the large interactioneffect between contact and intention, the largest estimated effect in the model.12.1.3.3. What about repeat measures? You might have realized at the start of this examplethat each participant responded to multiple (more than 30) different scenarios (“case”in the data frame). In other words, we have repeat measures on each participant. Not onlythat, but each participant responded to different versions of the same story (“story” in thedata frame). So not only are there repeat measures by participant, but there are also repeatmeasures by story.ese data are crying out for a multilevel analysis, using random effects. Many readers,whether they’ve learned about random effects before or not, may have heard of a concerncalled pseudo-replication, which is related. It stands to reason that all of the responses froma single participant are likely correlated with one another, if for no other reason than somepeople like to give high responses and others give low responses. What we’re really interestedin is how the factors action, intention and contact change a person’s response, accountingfor these clustering effects. We might also be interested in the strength and patterning ofthe clustering itself, as well. Whether you’re interested in the clustering itself or not, failureto model it can result in imprecise or misleading estimates.ere are some good ways to deal with this kind of issue. e best approach is to usean actual multilevel (also known as hierarchical) model. You’ll have to wait a until the nextchapter for that solution. Another approach, quite common still in fields like economics, isto just add a categorical variable to the model, labeling each cluster of possibly-correlatedoutcomes with a different dummy variable. is is known as the fixed effects approach, forreasons you can worry about when you get to Chapter 13. In this case, the fixed effectsapproach is pretty expensive, computationally. ere are 331 different participants in thedata, so that implies 330 new independent intercept parameters! ese parameters alloweach participant to have their own unique average response. en the slope parameters willindicate changes within participants, rather than changes relative to an average statisticalparticipant.Asking you to fit such a model, using map, would be an act of torture. You’d have toconstruct 330 new dummy variables and then type in an absurdly long formula for phi, withan equally absurd start list. So before we leave ordered categorical outcomes behind, the


12.1. ORDERED CATEGORICAL OUTCOMES 315last thing I’d like to do is teach you how to use the convenient function polr to fit the samemodels you did above with map.12.1.4. Using polr. Using polr to model the fixed effect on id.R codem11.4


316 12. MONSTERS AND MIXTURESanswering the same scenarios. So is it fair to reward m11.4 for predicting the average responsesof these specific individuals? I think not. Now, the estimates for the slope parametersin m11.4 may indeed produce better out-of-sample predictions than those in m11.3.But we can’t easily decide that from the AICc comparison above, because the comparison isdominated by the individual intercepts.Instead, we’d rather have a way to focus on particular aspects of prediction, like the slopeparameters, while ignoring others, like the individual intercepts. is concern is sometimescalled the question of focus. We’re not going to deal with this concern in any detail, rightnow. But when you reach Chapter 13, question of focus will appear again.12.2. Ranked OutcomesProblem with ranks: Must predict entire vector at once, because only one item can be#1. Use Keener and Waldman approach with orthant probabilities. Code for this is ready.Need a good example data frame. Alex’s village data? Ryan’s field data?12.3. Variable Probabilities: Beta-binomialroughput the book so far, we’ve been implicitly assuming that the “true” value of eachparameter was the same for every unit in the data. Every individual person, chimpanzee,nation, or location had the same α and the same β and the same p. But suppose this isn’t thecase. Suppose instead that different units in the data actually experience different underlyprobabilities or rates.You’ve already seen an empirical case in which this possibility was important, when youanalyzed the UCB admissions data earlier in the previous chapter. In ignoring the fact thatpeople applying to different departments had different overall probabilities of admission, wewere lead to the wrong conclusion. It was only by estimating a different baseline, or intercept,probability in each case that we could get the model to tell us what was obvious from plottingthe data.is kind of situation is quite common. Whenever there are clusters of some kind in thedata, then they might experience different probabilities of an event, because of unobservedfactors linking individual observations within that cluster. In the case of the academic departments,it is that selection criteria and funding amounts are clustered by department, andso exert a common causal influence on everyone who applies, even though deviations amongapplicants may be due to differences among applicants.In this section, I’m going to show you the simplest way to begin to model this kind ofheterogeneity among units, without using the fixed effects approach we used for the UCBadmissions data. ere are a number of reasons to move beyond the fixed effects approach,the approach of assigning a dummy variable for each cluster. I’ll speak to these reasons inmuch more depth in the next chapter. For now, it is sufficient to realize that the averageprobability in each cluster does inform us about the average probabilities in the other clusters.e fixed effects approach instead assumes that the per-cluster probabilities are completelyindependent of one another, and therefore the estimate of each takes no advantage of the datafrom the other clusters. So it will help us to pool the information and make better inferences,if we explicitly model the varying probabilities by cluster.e second important reason for now is that oen we’d like to actually estimate the distributionof the probabilities across units. We’ll need such an estimate, if we want to generalizepredictions to units we haven’t yet sampled. For example, what about academic departments


12.3. VARIABLE PROBABILITIES: BETA-BINOMIAL 317not in the UCB admission data? How can we make useful predictions for them, if we don’tknow their overall probabilities of admission? Well, if we know the distribution of overallprobabilities, then we can at least give bracketed predictions, categorized by 10% most selectiveand 10% least, for example. is is important, because highly selective departmentsreject most everyone, and so good prediction will have to account for this possibility. Likewise,in most field biology contexts, we’d like to generalize to streams, or individuals, ortransects that aren’t in our data. e fixed effects approach doesn’t tell us the shape of theheterogeneity among units, and so it can’t help us predict future units very well.Finally, sometimes the scientific question itself is about heterogeneity. How much variationis there among units? In non-linear models, you can’t just start calculating sums ofsquares in classic ANOVA fashion, so some more subtle approach is needed. One approachis to actually model the heterogeneity and estimate the parameters that describe it. at’swhat we’re going to do now.12.3.1. Stacking distributions. What we’re going to do to address this issue is build a mixturemodel, a model that stacks together two different probability distributions, so that oneor more parameters in the top distribution are given their distributions by the the lower distributions.e top distribution will codify our assumptions about the observed outcomes,as usual. e second distribution will codify our assumptions about the parameters of thefirst distribution. is will allow us to explicitly model the distribution of the binomial probabilities.What does a mixture model look like? We’ll build one of the most common and usefulones here, the beta-binomial model. is model combines, as the name suggests, a binomialoutcome distribution with a beta distribution that defines the probabilities in the population.Each unit i in the data produces an observed count, y i . ese counts are assumed toemerge from different underlying probabilities p i , however, with each unit i having a differentunknown probability p i . ese p i ’s in turn come from a beta distribution with shapeparameters A and B. is is much like the y i values come from a binomial distribution,except that we don’t get to see the p i ’s. Instead we have to infer them from the observabley i ’s. is is naturally a bit confusing, so hang on, and we’ll apply the approach and give youa chance to work through it in the context of an empirical example.Before getting a better idea what the beta distribution looks like, it’d help to view theabove two-node model from another perspective. Viewed purely causally, nature is samplingfrom the beta to produce a probability that then generates binomial outcomes. Naturesamples a different p i for each unit i. is generates heterogeneity among units. Anotherway to see the same model, although viewing it more mathematically now, is that the secondnode defines the prior probability density for p i . Remember, when we assign a distributionto a parameter in a Bayesian perspective, we are assigning it a prior to be updated in lightof the data. And so in this context we are saying that we expect the values p i to have a betaprior. But instead of merely assuming a flat beta density, out of ignorance, we can use thedata itself to estimate the prior. is isn’t cheating, because there is nothing cheating aboutestimating a density using Bayes’ theorem. It’s just that the prior for p i is the posterior for Aand B. So we will have to make some kind of ignorance assumptions for A and B, as you’llsee. ose will be the prior for those parameters. But they will be updated using the data,to produce a posterior for A and B, which will in turn become the prior for p i . Remember,every posterior is something’s prior. Sometimes priors come from aspects of the data, whileother priors in the same model do not.


318 12. MONSTERS AND MIXTURESDensity0.6 0.8 1.0 1.2 1.4alpha = 1, beta = 1Density0.0 0.5 1.0 1.5alpha = 3, beta = 30.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0p ip ialpha = 4, beta = 1alpha = 0.9, beta = 0.8Density0 1 2 3 4Density1.0 1.2 1.4 1.6 1.80.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0p ip iFIGURE 12.2. e beta distribution defines the probabilities of differentprobabilities, p i . e horizontal axis in each plot is the range of binomialprobabilities from which to draw samples. It is a parameter, but now witha different value for each case i. e vertical axis is the likelihood (density)of each probability, according to a beta distribution with shape defined bythe A and B values shown at the top of each plot.12.3.2. Beta distribution. You haven’t seen a beta distribution before, unless it was outsidethis book, so it’s worth saying a little about it. e BETA DISTRIBUTION is a probability densityfor probabilities themselves, continuous values between zero and one. It is a member ofthe exponential family, just like the Gaussian and Poisson and binomial. But I didn’t mentionit before, because it isn’t so oen used as a top-level distribution. 110 Instead, it becomesvaluable when we start stacking and mixing probability distributions.e beta distribution is very flexible, and its shape is defined by two parameters, A andB. Sadly, the conventional names for these parameters are α and β, and that’s how theyappear in most books. But since α and β are the same symbols oen used to build linearmodels, I’m going to use the “fancy” capital letter names for them. Hopefully this will reduceany confusion for now. As you’ll see a little later, we’re going to re-parameterize the betadistribution anyway.Let’s aim to understand how the distribution can take different shapes. You can see someof its characterist shapes, and how they correspond to the values of A and B, in FIGURE 12.2.


12.3. VARIABLE PROBABILITIES: BETA-BINOMIAL 319e beta distribution can take many shapes, from flat (A = 1, B = 1, top-le), to peaked inthe middle (A = 3, B = 3, top-right), to highly skewed (A = 4, B = 1, bottom-le), andto massing probability at the ends near zero and one (A = 0.9, B = 0.8, bottom-right). Youcan explore more shapes by changing the A and B values in the following code:A


320 12. MONSTERS AND MIXTURESBut we’re going to be interested in using the solution numerically, not manipulating itanalytically. So we want the beta-binomial density function, within R. Several packages,including rethinking, provide it. e rethinking package calls it dbetabinom.12.3.4. Beta-binomial links. e fundamental parameters of the beta distribution are Aand B, as you saw above. But it’s going to usually be more convenient for us to model the betadistribution as a function of its mean, the average probability drawn from it, and dispersion,how spread out the distribution is. Under this parameterization, the mean is called ¯p (saypee-bar) and the dispersion θ. ese two parameters relate to A and B in a simple way:¯p =AA + B , θ = A + B.is way of using the beta distribution is very common, and so functions like dbetabinomalready expect input in this form, using the labels prob for ¯p and theta for θ. 111Redefining the beta distribution in this way is equivalent to using a link function. It’sjust like using a link, because neither of the parameters A nor B is the mean. But we’d like tomodel the mean, so we need a function to link these parameters to the mean. Since ¯p is themean, it defines our link function. But we’re also going to need to use log-odds again, so thatour linear model translates logically to the probability space between zero and one. So whatdoes the model look like, under this link strategy? Defining the logit of ¯p as a linear model,we get:logy i ∼ BetaBinomial(n i , ¯p i , θ)¯p i1 − ¯p i= α + βx iEffectively, we’ve pushed the logistic down one level of stochasticity, since the binomial levelis now determined by the beta probabilities. e parameters that are directly estimated inthis case will be α, β, and θ.Now, in some cases, you may need to also use a link for θ. Even if you don’t want to modelthe dispersion as a linear model—and I’m not going to encourage that in most circumstances,mainly because trying to model both ¯p and θ with linear models is a formula for stumblinginto a nonidentifiable model—even if you don’t want to do that, you have to restrict θ to begreater than zero. An easy approach to doing this is just to use a log-link, just like the meanλ of a Poisson, so that the estimates on the natural scale are always positive. is is what it’dlook like:logy i ∼ BetaBinomial(n i , ¯p i , θ)¯p i1 − ¯p i= α + βx ilog θ = τe name τ is arbitrary. It just needs to be different from θ, so you can keep in mind thatyou’ve estimated a parameter on the log scale as a stand-in for θ. en you’d estimate α, β,and τ as parameters. To get the posterior for θ, you’d just exponentiate the posterior for τ.I’ll use this kind of link in one of the model fits to come, so you’ll get to see what it looks likein code form.


12.3. VARIABLE PROBABILITIES: BETA-BINOMIAL 32112.3.5. Example: Bolker’s Reedfrogs. We’ll use an example also from Ben Bolker, mortalitydata on reed frog tadpoles variably exposed to aquatic predators at experimentally determineddensities. You can load the package and data with:library(rethinking)data(reedfrogs)d


322 12. MONSTERS AND MIXTURESdispersed the beta distribution is. When θ is large, the beta density is bunched up around themode. When θ is small, the density is more spread out or, at very low values, even bunchedup on the edges of the probability space. Starting search at log-odds equal to zero and θ = 2defines a flat Beta distribution.Let’s compare the fits of these model estimations, before taking a look at the estimates:R code12.23compare(m12.4,m12.5,nobs=sum(d$density))k AICc w.AICc dAICcm11.5 2 273.09 1 0.0m11.4 1 574.49 0 301.4Okay, so the beta-binomial model fits the data much much better, even aer accounting forhaving one more parameter. Now how do its inferences differ, such that it does such a betterjob? Taking a look at the estimates from both models:R code12.24coeftab(m11.4,m11.5)m11.4 m11.5a 0.84 0.92tau NA 0.96nobs 48 48So the log-odds estimates, a, aren’t so different. For the binomial model, the mean posteriorprobability and 95% HPDI are:R code12.25post


12.3. VARIABLE PROBABILITIES: BETA-BINOMIAL 323Density0 2 4 6FIGURE 12.3. Estimated distributionof survival probabilities (blacklines) over the empirical distributionof survival proportions (blue),for the tadpole data. Dashed curvesshow 95% percentile confidence intervalof the Beta density estimates,sampled from the posterior.0.2 0.4 0.6 0.8 1.0proportion survivingθ (α + β). So when we characterize the posterior distribution of the beta distribution, wehave to account for the covariance between these two parameters. at is, we aren’t going tocompute a confidence interval for each or either parameter, but rather for the distributionimplied by correlated samples of both parameters. Weird to say, but the code will make iteasier to understand. Here’s how we do it:p.seq


324 12. MONSTERS AND MIXTURESe results are visible in FIGURE 12.3. e black estimated curves show the same skewedshape as the empirical density. is is what the beta-binomial model is capable of copyingwith, while the binomial model assumes instead a concentrated density, peaked at 0.7 onthe horizontal axis in FIGURE 12.3, with 95% of its probability between 0.67 and 0.73 (theHPDI you computed earlier). us the pure-binomial model does a poor job of describingthe heterogeneity in survival across cases, while the beta-binomial is built to do better at thiskind of job.Still, you might ask if the beta-binomial model is just doing better here, because wehaven’t yet included predictor variables to soak up some of that heterogeneity. Aer all, thecentral purpose of GLM’s is to explain variation in outcomes using variation in predictors.So let’s include some predictors, now.Adding predators. Let’s add a dummy variable for the presence of predators, which islikely to explain a lot of variation in survival. To make the dummy and fit both new models:R code12.29d$pred.yes


12.3. VARIABLE PROBABILITIES: BETA-BINOMIAL 325compare(m12.6,m12.7,nobs=sum(d$density))R code12.30k AICc w.AICc dAICcm11.7 3 233.96 1 0.00m11.6 2 274.27 0 40.31Again, we can compare the estimates to get a better idea for how the beta-binomial is improvingprediction, by accounting for heterogeneity:coeftab(m12.6,m12.7)R code12.31m11.6 m11.7a 2.54 2.21bp -2.65 -2.21tau NA 2.26nobs 48 48And again the estimated log-odds (a) are similar across both models, as well as the newestimate for the effect of predation on log-odds, bp. Predators being present reduces survivalprobability, unsurprisingly. In fact, it takes the average survival from logistic(2.2)≈ 0.9to logistic(2.2-2.2)≈ 0.5.But what is theta doing? Again, it’s describing the heterogeneity that remains. Now toplot the predictions, we’ll have to treat this somewhat like an interaction effect, making oneplot for when predators are absent and another for when predators are present. Here’s thecode to compute the mean and confidence interval, when predators are absent (I’ll leave itto the reader to modify it for when predators are present):post


326 12. MONSTERS AND MIXTURESDensity0 2 4 6 8 10predators absentDensity0 1 2 3 4 5predators present0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0proportion survivingproportion survivingFIGURE 12.4. Estimated distribution of survival probabilities (black lines)over the empirical distribution of survival proportions (blue), for the tadpoledata, now separated by cases in which predators were absent (le) andpresent (right). Dashed curves show 95% percentile confidence interval ofthe Beta density estimates, sampled from the posterior.R code12.33post


12.4. VARIABLE RATES: GAMMA-POISSON 327is to the binomial. In a gamma-Poisson model or process, the mean number of events perunit time in the Poisson, λ, is assumed to vary among sampling units. ese units may bedifferent vials of fruit flies or strands of DNA or soccer teams or, as you’ll model in thissection, Swedish roads.You can write down the mathematical form of the gamma-Poisson model, just like youdid for the beta-binomial before.y i ∼ GammaPoisson(µ i , s)log µ i = α + βx iI’ve gone ahead and established the link function we’ll need, as well. Let me explain thismodel, one line at a time. e top line represents the stochastic process that produces actualobservations, Poisson distributed values, y, one observed for each unique case i. is Poissonprocess is governed by a collection of rate parameters, λ i , one for each unit i in the data. esevarying rates in turn arise from the second line of the model. e second line represents adeeper stochastic process that creates variable λ values, according to a gamma distribution.e gamma is governed by two parameters, a mean µ i that is unique to each unit i and acommon dispersion parameter s (“scale”). Finally, we use a log-link to model µ i , becausethat will ensure the gamma mean is always positive. Another common choice, as you sawin Chapter 9, is the inverse link, which can misbehave but is nevertheless the natural linkfor the gamma. It’s also very common to use a log-link for the scale parameter, to ensure itdoesn’t go below zero.Why a gamma distribution? Why not use the beta again, or even a normal distribution?e first virtue of a gamma distribution is that it has the right domain for values of λ. emean number of events in a Poisson process must be a positive real value. e gamma distributionproduces positive real values. Neither the beta nor normal does. e second virtueis that the gamma is the natural sister distribution to the Poisson. If you start with a gammadistributed prior and update it with Poisson observations, the resulting posterior will alsobe a gamma distribution. is is just like the beta is to the binomial. e practical valueof this fact is that the likelihood function for gamma-Poisson model is easy to solve for andcompute. is will not necessarily be true for more complex multilevel models that you’llmeet in the next chapter.Like with the beta and the binomial, note that the gamma distribution in the modeleffectively defines the prior distribution for the parameter λ of the Poisson. e data, Poissonobservations, will allow us to estimate the shape of this prior. e parameters of the gamma,in light of the data, will have a posterior that provides a distribution of λ i values.Since you met the gamma already in Chapter 7, we can jump right into using the gamma-Poisson now. at’s what’s next.12.4.1. Example: Swedish traffic accidents. e data for this example come from an experimentconducted by the Swedish government in the early 1960’s. ey wanted to knowhow effective imposing speed limits would be for reducing traffic accidents. On some days aspeed limit was enforced, while on other days it was not. ese data have some interesting aspectswe’ll glide over for now. We’re just going to be interested in estimating the distributionof traffic accidents, using the speed limit as a predictor variable.Load the data with:


328 12. MONSTERS AND MIXTURESR code12.34library(rethinking)data(Traffic)d


12.4. VARIABLE RATES: GAMMA-POISSON 329assume that every day in the data experienced one of two accident rates, either the no limitrate (presumably higher) or the limited rate (presumably lower). e gamma-Poisson model,in contrast, will instead assume that each day in the data experiences a different rate, drawnfrom a gamma distribution. However those days on which the speed limit was in effect willdraw from a different gamma distribution, with a different mean, than those on which therewas no limit.Here’s the code to fit both the plain Poisson and gamma-Poisson:m12.10


330 12. MONSTERS AND MIXTURESPoisson that includes the speed limit variable. is suggests that most of the variation in thedata cannot be easily explained by the speed limit. You’ll see how the gamma-Poisson dealwith that variation, when we plot the predictions a little bit later.Before plotting, though, it’ll be useful to look at the parameter estimates. For modelm12.11, which dominates the model weights:R code12.38precis( m12.11 )Mean StdDev 2.5% 97.5%a 3.14 0.03 3.07 3.20s 2.19 0.34 1.53 2.85bl -0.19 0.06 -0.30 -0.07e estimate for the speed limit is negative, −0.19, and reliably below zero. is indicatesthat the speed limit did indeed reduce accidents, as assumed. But by how much? To find out,we have to reverse the link function, to get back on the count scale. e mean of the gammawill then, in turn, be the mean λ of the Poisson. e mean λ will be the mean observedcount. And so we just need to exponentiate the linear model to get an MAP predicted meannumber of accidents. Without and with the speed limit, the mean predictions are:R code12.39exp( 3.14 )exp( 3.14 - 0.19 )[1] 23.10387[1] 19.10595So the speed limit possibly reduced the number of accidents, on average, by 4 per day.Whether you think that is a lot or not depends upon how much you care about each accident.But the speed limit has done more than just reduce the mean. Since the variance ofcounts tend to scale with the mean, it has also reduced the variation in the number of accidents.Importantly, it may have reduced the high range of counts, such that there are manyfewer days with counts about, say, 30 or more, when the speed limit was in effect. In order tosee the data and predictions this way, we’ll need to plot the distribution of predictions, notjust consider the central estimate.I’ll provide the code here to plot the predicted distribution of counts over the empiricaldistribution, for days on which the speed limit was in effect. e reader should be able tomodify the code to produce the sister plot for days on which there was no speed limit. eplot will use samples from the posterior, as always. It’ll also explicitly compare the Poissonand gamma-Poisson predictions. Here’s the code:R code12.40post


12.4. VARIABLE RATES: GAMMA-POISSON 331limit YESlimit NODensity0.00 0.04 0.08 0.12Density0.00 0.04 0.080 10 20 30 40 50accidents0 10 20 30 40 50accidentsFIGURE 12.5. Predicted densities of accident counts, according to bothPoisson and gamma-Poisson models. Lehand plot shows days with thespeed limit. Righthand plot shows days without it. e blue density in eachplot is the empirical count distribution. e black transparent curves representthe gamma-Poisson (wider group of curves) and Poisson (narrower)prediction distributions. Each curve is a single sample from the posterior.dgampois( z , mu=exp(post$a[i] + post$bl[i]) ,scale=post$s[i] ) )lines( 0:50 , dy , col=col.alpha("black",fade) )}for ( i in 1:n ) {dy


332 12. MONSTERS AND MIXTURESthe gamma-Poisson process. e numbers of accidents across days are highly variable, muchmore variable than a Poisson process can suggest. In particular, on days without the speedlimit (righthand plot), there is a very long tail of high accident numbers. On days with thespeed limit, this tail is greatly reduced, although there are still some days with more than40 accidents, even then. e gamma-Poisson distributions can cover this heterogeneity innumber of accidents, while the Poisson curves in both cases are far too narrow are symmetrical.You can simulate counts and directly estimate the probability of, say, 30 or more accidentsper day with and without the speed limit. is will help me show you how to workwith the posterior some more, as well as demonstrate that the speed limit has a more appreciableeffect on the variation than it does on the mean. Remember, the average number ofaccidents reduced by the speed limit is only 4. But let’s ask how much the upper tail gets reduced.Here is how we might compute the predicted proportions of days, with and withoutthe limit, with 30 or more accidents.R code12.41post


12.5. VARIABLE PROCESS: ZERO-INFLATED OUTCOMES 333measurement error, (2) no birds present. Let p be the probability of an error. Let λ be thePoisson mean of actual bird densities. So the likelihood of observing a zero is:Pr(0|p, λ) = Pr(error|p) + Pr(no error|p) Pr(0|λ)= p + (1 − p) exp(−λ)And the likelihood of a non-zero value y is:Pr(y|p, λ) = Pr(no error|p) Pr(y|λ)= (1 − p) λy exp(−λ)y!So define ZIPoisson as the distribution above, with parameters p (probability of a zero) andλ (mean of Poisson) to describe it’s shape. en a zero-inflated Poisson regression takes theform:y i ∼ ZIPoisson(p i , λ i )logit(p i ) = α p + β p x ilog λ i = α λ + β λ x iNotice that there are two linear models and two link functions, one for each process in theZIPoisson. e parameters of the linear models differ, because any predictor such as x maybe associated differently with each component.To fit to data, this’ll work:m12.12


334 12. MONSTERS AND MIXTURESdata=d , start=list(ap=0,an=1,bfp=0,bfn=0,bep=0,ben=0) )precis(m12.12stan)Mean StdDev lower 0.95 upper 0.95ap 1.12 0.15 0.86 1.45an 2.16 0.05 2.08 2.25bfp -0.37 0.32 -1.06 0.28bfn 0.06 0.09 -0.13 0.23bep 0.01 0.09 -0.18 0.18ben 0.03 0.03 -0.02 0.09Overthinking: Zero-inflated Poisson density function. e function dzipois is implemented ina way that guards against some kinds of numerical error. So its code looks confusing—just typedzipois at the R prompt and see. But really all it’s doing is implementing the likelihood formuladefined in the section above. Here’s a more transparent version of the function, in which it’s mucheasier to see the mixture definition:R code12.45dzip


12.5. VARIABLE PROCESS: ZERO-INFLATED OUTCOMES 335$ kg.meat : num 0 0 0 4.5 0 0 0 0 0 0 ...$ hours : num 6.97 9 1.55 8 3 7.5 9 7.3 8.1 6 ...$ datatype: num 1 1 1 1 1 1 1 1 1 1 ...Outcome is kg.meat, which is mass of meat returned to camp by an individual trip. It is amix of zeros and positive real values.Zero-augmented gamma:d2 0])R code12.47m12.12stan


336 12. MONSTERS AND MIXTURESma2 ~ dnorm(0,10),pa ~ dnorm(0,10),pa2 ~ dnorm(0,10),s ~ dcauchy(0,2.5)) ,data=d2 ,start=list(ah=mean(log(d2$harvest)),af=0,ma=0,ma2=0,pa=0,pa2=0,s=1) ,constraints=list(s="lower=0") )precis(m12.13stan)Mean StdDev lower 0.95 upper 0.95ah 1.95 0.01 1.93 1.96af -0.16 0.02 -0.21 -0.12ma 0.00 0.01 -0.02 0.01ma2 -0.05 0.01 -0.07 -0.04pa -0.01 0.02 -0.04 0.03pa2 0.26 0.02 0.23 0.29s 3.01 0.05 2.89 3.10From just the numbers, we can say that harvests decline with age (negative ma2), while failuresincrease with age (positive pa2). For a more detailed view, we’ll need to compute posteriorpredictions.


13 Multilevel ModelsIntro idea: Case of Clive Wearing, as described by his wife Deborah in book ForeverToday. Wearing is a conductor and pianist who acquired anterograde amnesia aer a caseof Herpes simplex encephalitis. He can still play the piano and conduct, but he cannot formnew long-term memories. He is always forgetting recent events.Statistical models that are not multilevel are similar, in the sense that as they move fromone cluster (individual, group) in the data to another, trying to estimate parameters for eachcluster, they forget everything about the previous clusters. ey behave this way, becausethe model assumptions force them to, assuming that cluster intercepts and slopes containno joint information. is would be like forgetting you had ever been in a coffee shop, eachtime you go to the coffee shop. Coffee shops do vary, but information about one does helpyour expectations about the others. As it is in life, it is in statistics: anterograde amnesia isbad for learning about the world. We want models that instead use all of the information insavvy ways. Multilevel models do that.13.1. What are multilevel models good for?e basic reason to use a multilevel model is that your data are clustered somehow. Byclustered, I mean observations (cases) within different groups differ in their averages, resultingin correlations within groups. As a result, describing these differences can improveprediction and inference. ere are two common forms of clustering: (1) repeat observationsof the same entities and (2) observations of separate entities that are associated temporally,spatial or causally. In my mind, these are really the same thing, but it’s worth providingexamples of each.Repeat observations arise when we observe the same individual (person, tadpole, chimpanzee)multiple times. is may arise because we use the same individuals in multiple treatmentsof an experiment (within subjects design). Or it may arise because some individualssimply appear more oen by chance. Another common way to realize repeat observationsis with a time series, in which each individual appears once at each time step, but manytimes within the data. In all of these contexts, heterogeneity among individuals will lead toa particular pattern of variation between observations: variation will tend to cluster, if onlya little, by individual.e entities that exhibit clustering could just as easily be days of the week, cities, orstreams. We might observe individuals only once, but many individuals on common daysand in common places. Observations of all individuals on Monday, for example, may cluster,because of unobserved causal influences on that day. Likewise, observations of all individualsfrom a common stream or other location may also cluster. In this way, we actually have337


338 13. MULTILEVEL MODELSrepeat observations from the same time or place, and the problem is structurally similar tothe repeat observation of individuals. Again, clustering may arise.In some circles, these repeat observation and clustering scenarios are known as PSEU-DOREPLICATION, or false replication. 113 Pseudoreplication can mean two distinct things,so you have to be careful with the term. e sense that is meant here is the presence ofpotentially clustered observations, for either of the reasons above. We might have repeatobservations of the individual or of multiple individuals in the same place. Inference forsuch a data structure can confound classical analysis of variance, and the pseudoreplicationconcern usually arises in this context.Classical analysis of variance can be confounded in these cases, because it is the wrongmodel for our hypothesis about how the data arose. e pseudoreplication concern has sopermeated branches of biology that a senior colleague of mine once announced at a prestigiousconference that we should treat all observations from the same individual as a singledata point! Another colleague once told me that multiple fish from the same tank are equivalentto sampling a single fish! ese colleagues are throwing the data out with the bath water.Of course more data is almost always better. Individuals don’t always behave the same way,so multiple samples from each of them is helpful. Likewise, not every fish in the same tankbehaves the same way. ankfully, multilevel models make pseudoreplication a non-issue,in almost all cases. e mistake is to think that we can’t change the model to reflect the storywe believe gave rise to the data. Instead, those colleagues wanted to change the data to reflectthe model.A similar classical concern arises from MULTIPLE COMPARISON. If we have many groupsin the data, whether the “groups” are collections of individuals or rather of observations fromthe same individuals, then there are potentially many comparisons of group means. Withmany tests comes an inflated chance of a significant test result. is problem has lead toa number of post-estimation adjustments to P-values, such as Bonferroni corrections andsuch, in hopes of reducing the rate of false alarms. But an essential problem with these approachesis that they continue to assume that every group is independent of the others, andas a consequence, the estimates themselves are misleading in the first place. Post hoc correctionof P-values cannot fix them. A better approach is to use a better model, one in whichthe estimates are not independent of one another. A multilevel model will do the job.When data are clustered, and we have good hypotheses about how that clustering mightbe structured, a multilevel model can provide better estimates of the variation within andbetween groups, as well as better estimates of common influences across groups (treatmenteffects, for example). So whether you are interested in the variation among groups, in itself,or rather in the effect of common predictors across groups, you are usually going to be betteroff using a multilevel model. In any context in which you are tempted to add dummyvariables for an even modest number of groups, you are almost certainly better off using amultilevel model with “random” effects, rather than a single-level model with “fixed” effects.ere are some costs of the multilevel approach to modeling, however. e first is thatwe have to make more assumptions. We have to define the distributions from which theaverage characteristics of the clusters arise. is is just like what we did in the previouschapter, with the beta-binomial and gamma-Poisson models. ere are also new estimationchallenges that come with the full multilevel approach, in which we not only estimate the parametersof the distribution of cluster characteristics, but also estimate each cluster’s uniquecharacteristic. is estimation challenge leads us headfirst into MCMC estimation. Finally,multilevel models can be hard to understand, not only because they stack together stochastic


13.2. MULTILEVEL TADPOLES 339and even non-additive stochastic effects, but also because they make predictions at differentlevels of the data. In many cases, we are interested in only one or a few of those levels, and asa consequence, model comparison using metrics like AIC and DIC becomes more difficult.e basic logic remains unchanged, but now we have to make more decisions about whichparameters in the model we wish to focus on.ere’s a lot to accomplish in this chapter. First, let’s explore the advantages of the multilevelapproach, in the context of a familiar data example from the previous chapter.13.2. Multilevel tadpolesIn the previous chapter, I used the tadpole predation (data(reedfrogs)) data to illustratethe value of explicitly modeling the heterogeneity among units in the data. In that case,the units were separate aquarium tanks with tadpoles in them. e outcome of interest wasthe number of survivors aer a fixed period. Even aer accounting for differing treatmenteffects among the tanks, there was still substantial variation in survival rate across tanks.As a result, the beta-binomial model that explicitly modeled that variation could produce abetter description of the data.at model, recall, was specified as:y i ∼ Binomial(p i , n i ),p i ∼ Beta(¯p, θ),where y i is the number of survivors in tank i, p i is the probability of each tadpole in tanki surviving, n i is the number of tadpoles in tank i at the beginning of the experiment, andfinally the parameters to be estimated are ¯p and θ. ese are the parameters of the betadistribution, which estimates the shape of the variation among tanks.But what is conspicuously missing from our work with this model are estimates of thesurvival probability, p i , in each tank. We’ve defined those parameters, but we never estimatedthem. We just estimate their distribution. Wouldn’t it be nice to have the individual tankestimates, in addition to the estimates defining the distribution of which they are members?If we ever hope to make decent predictions on a per-tank basis, knowing the p i ’s would be ahuge help.Of course, maybe in this case you don’t care much about predicting anything in eachtank, because it was an experiment, and so the tanks are regarded as ephemeral collections.But in many cases, we do care about the specific collections, because they are lasting entities.Suppose for example that the “tanks” were streams or ponds in the natural world.Now making accurate predictions is suddenly more important, and in addition identifyingstreams with unusually high mortality may lead to useful measurement that identifies additionalcausal variables. But even when we don’t care about the unique clusters in our sample,it helps to estimate the per-cluster parameters, because doing so can provide better estimatesof any treatment effects in the model.Call these p i parameters, all 48 of them, VARYING EFFECTS. 114 ey are parameters likeany other, when used to make predictions. But they are varying among clusters of observationsin the data. e clusters here are tanks, and the observations are individual survive/dieevents of tadpoles. Varying effects provide the needed per-cluster parameters to aidin inference about both unique clusters and predictor variables. ese parameters are alsocommonly called random effects. ey stand in contrast to the so-called fixed effects or constanteffects parameters of a single-level model, in which no assumption is made about thedistribution from which the parameters arise.


340 13. MULTILEVEL MODELSSo how do we modify the tadpole beta-binomial model to estimate the p i parameters?I’m going to take a step to the side here to drop the beta distribution. Instead we’ll move forwardusing a normal distribution now for the log-odds of survival in each tank, rather than abeta distribution of the probability (p i ) in each tank. I do this, because this normal distributionassumption is the most common general approach to solving this problem, and it’s alsothe easiest to estimate. In the next chapter, I’ll show you how to bring the beta distributionback, if you want to. But you can get all of the conceptual lessons, either way.What we want to to do is define the log-odds of survival in each tank j as α + α j . eparameter α, with no j, is what it has usually been: the average log-odds in the entire sample.But each α j , with an j, is the deviation from this global average within each tank j. esewill be the varying intercepts for each tank. en we state that each varying intercept comesfrom a normal distribution with mean zero and standard deviation σ. In sum, the multileveltadpole survival model is:s ij ∼ Binomial(n i , p ij )p ijlog = α + α j1 − p ijα j ∼ Normal(0, σ)α ∼ Normal(0, 1)σ ∼ Cauchy(0, 1)You can fit this model, using techniques I’ll summarize later in this chapter, with:R code13.1library(rethinking)data(reedfrogs)d


13.2. MULTILEVEL TADPOLES 341is model fit provides estimates for 50 parameters: one overall sample intercept α, the varianceamong tanks σ, and then 48 per-tank intercepts α j . Let’s check DIC though to see theeffective number of parameters:DIC(m13.1)R code13.2[1] 212.0444attr(,"pD")[1] 40.26574Only 40 effective parameters. ere are 10 fewer effective parameters than actual parameters,because the prior assigned to each α j shrinks them all towards zero. In this case, the prior isreasonably strong. Check the average value of sigma_tank with precis or coef and you’llsee it’s around 1.6. So each α j is assigned a Gaussian prior with mean zero and standarddeviation 1.6. is is a REGULARIZING PRIOR, like you’ve used in previous chapters, butnow the amount of regularization has been learned from the data itself.You can plot the median per-tank estimates with:post


342 13. MULTILEVEL MODELSprobability of survival in tank0.2 0.4 0.6 0.8 1.010 25 25 351 16 32 48tankFIGURE 13.1. Empirical proportions of survivors in each tadpole tank,shown by the filled blue points, plotted with the 48 per-tank estimates fromthe multilevel model, shown by the black circles. e dashed line locates theoverall average proportion of survivors across all tanks. e vertical lines dividetanks with different initial counts of tadpoles, 10 (le) and 25 (middle)and 35 (right). Each value on the horizontal axis is the index of a tank (row)in the data. In every tank, the estimate from the multilevel model is closerto the dashed line than the empirical proportion is. is reflects the poolingof information across tanks, to help with inference about each tank.13.2.1. Pooling. All three of these phenomena arise from a common cause: pooling informationacross clusters (tanks) to improve estimates. What POOLING means here is thateach tank provides information that can be used to improve the estimates for all of the othertanks. Each tank helps in this way, because we made an assumption about how the varyinglog-odds in each tank, α j , related to all of the others. We assumed a distribution, the normaldistribution in this case. Once we have a distributional assumption, we can use Bayes’theorem to optimally share information among the clusters.ink of it this way. Suppose you encounter the data for each tank of tadpoles in sequence.Before you see the outcome of the first tank, tank 1, you can have only a naivenotion of what the observed survival proportion will be. But once you see the outcome oftank 1, you’ve gained information and your guess has improved. Now the second tank comesalong. Before you realize the outcome of the second tank, can you say anything about its proportionthat will survive? Of course you can, because you have the outcome of the first tankto improve your prediction. You won’t be too confident in your guess, perhaps. But it’ll bebetter on average than your truly naive guess was for tank 1. And once the second tank’s dataare realized, you can improve your guess from generalizing the first tank’s outcome, using


13.3. VARYING EFFECTS AND THE UNDERFITTING/OVERFITTING TRADEOFF 343the data from the second tank. You can think of this procedure like the Bayesian updatingexample in Chapter 2. Tank 1’s estimate improves our estimate for tank 2.But wait. ere’s nothing special about the order the tanks appear, as long as they areindependent. What if we lost the data for tank 1, and so started out analysis with tank 2?Aer calculating the observed proportion surviving in tank 2, an assistant finds the lost datafor tank 1. Now, before you see what is written on the data sheet, can you say anything aboutwhat the proportion will be? Again, of course you can, because you know what happened intank 2. You take that educated guess and update it, once you are handed the actual data fortank 1. And so tank 2’s estimate improves our estimate for tank 1.In truth, this relationship is symmetric. Tank 1’s data improves the estimate for tank2, and likewise tank 2’s data improves the estimate for tank 1. As a result, if you do theestimation correctly, according to Bayes’ theorem, you end up with estimates for both tanksin which you have pooled all the information to improve each estimate. But as a result,neither estimate is exactly the same as the naive estimate you’d get by forcing the estimatefor each tank to ignore the other tank. Instead, both estimates have shrunk towards themean proportion across both tanks. ese estimates will exhibit shrinkage, due to poolinginformation.To gain a better appreciation of the shrinkage phenomenon, and why it varies as it doesacross the tanks, it will help to directly address the inferential benefits of pooling informationin this way. at’s where we turn next.13.3. Varying effects and the underfitting/overfitting tradeoffe first major benefit of using these varying effects estimates, instead of the empiricalraw estimates, is that they provide more accurate estimates of the individual cluster (tank)intercepts. 115 On average, the varying effects actually provide a better estimate of the individualtank (cluster) means. In this section, I’ll explain why this is true, in the context ofbuilding the model above, the one with varying intercepts by tank.e basic reason that the varying intercepts provide better estimates is that they do a betterjob of trading off underfitting and overfitting. You first met this distinction and estimationproblem in Chapter 6. To understand this in the context of our current data example, supposethat instead of tanks we had ponds with some continuity through time, so that we mightbe concerned with making predictions for the same clusters in the future. We’ll approachthe problem of prediction future survival in these ponds, from two extreme perspectives.First, suppose you ignore the varying effects and just use the overall mean across allponds, α, to make your predictions for each pond. A lot of data contributes to your estimateof α, and so it can be quite precise. However, your estimate of α is unlikely to exactly matchthe mean of any particular pond. As a result, the total sample mean underfits the data. isapproach is sometimes know as COMPLETE POOLING, because you pool all the data from allponds to produce a single estimate that is applied to every pond. It’s equivalent to assumingthat ponds do not vary at all in their mean survival probability.In contrast, suppose you use the raw empirical survival proportions to make predictionsfor each pond. is is like using a separate constant regression intercept for each pond. Inprevious chapters, you estimated models like this by including a dummy variable for eachcluster in the data, like with the academic departments in the Berkeley admissions data. eblue points in FIGURE 13.1 are this same kind of estimate. In each particular pond, quitelittle data contributes to each estimate, and so these estimates are rather imprecise. is isparticularly true of the smaller ponds, where less data goes into producing the estimates. As


344 13. MULTILEVEL MODELSa consequence, the error of these estimates is high, and they are rather overfit to the data.Standard errors for each intercept can be very large, and in extreme cases, even infinite. eseare sometimes called the NO POOLING estimates. No information is shared across ponds, inthis case. It’s like assuming that the variation among ponds is infinite.When you estimate varying intercepts, however, you use PARTIAL POOLING of informationto produce estimates for each group that are less underfit than the grand mean andless overfit than the no pooling estimates. As a consequence, they tend to be better estimatesof the true per-cluster (per-pond) means. is will be especially true when ponds have fewtadpoles in them, because then the no pooling estimates will be especially overfit. When alot of data goes into each pond, then there will be less difference between the varying effectestimates and the no pooling estimates, but any difference will still be expected to be to thevarying effect estimates’ advantage.It’s useful to demonstrate this fact, but to do so, we’ll have to simulate some tadpole data.In that case, we’ll know the true per-pond survival probabilities. en we can compare theno-pooling estimates to the varying effect estimates, by computing how close each gets tothe true values they are trying to estimate.e rest of this subsection outlines how to do such a simulation. Learning to simulateand validate models and model fitting in this way is extremely valuable. Once you startusing more complex models, you will want to ensure that your code is working and thatyou understand the model. You can help in this project by simulating data from the model,with specified parameter values, and then making sure that your method of estimation canrecover the parameters within tolerable ranges of precision. Even just simulating data froma model structure has a huge impact on understanding, in my experience.e first step is to define the model we’ll be using. I’ll use the same basic multilevelbinomial model as before:s ij ∼ Binomial(n i , p ij )p ijlog = α + α j1 − p ijα j ∼ Normal(0, σ)α ∼ Normal(0, 1)σ ∼ Cauchy(0, 1)So to simulate data from this process, we need to assign values to:• α, the average log-odds of survival in the entire population of ponds• σ, the standard deviation of the distribution of log-odds of survival among ponds• α j , a series of individual pond intercepts, one for each pond j.We’ll also need to assign sample sizes, n i , to each pond. But once we’ve made all of thosechoices, we can easily simulate counts of surviving tadpoles, straight from the top-level binomialprocess, using rbinom.Assign values to the parameters. I’m going to assign specific values representative of theactual tadpole data, to make the upcoming plot that demonstrates the increased accuracy ofthe varying effects estimates. But you can come back to this step later and change them towhatever you want.Here’s the code to initialize the values of α, σ, the number of ponds, and the sample sizen i in each pond.


13.3. VARYING EFFECTS AND THE UNDERFITTING/OVERFITTING TRADEOFF 345a


346 13. MULTILEVEL MODELSCompute the no-pooling estimates. We’re ready to start analyzing the fake data, now. eeasiest task is to just compute the no-pooling estimates. We can accomplish this straightfrom the empirical data, just by calculating the proportion of survivors in each pond. I’llkeep these estimates on the probability scale, instead of translating them to the log-oddsscale, because we’ll want to compare the quality of the estimates on the probability scalelater.R code13.8dsim$p


13.3. VARYING EFFECTS AND THE UNDERFITTING/OVERFITTING TRADEOFF 347logistic( 1.1 )R code13.11[1] 0.7502601So the average rate of survival in the population of ponds is just around 75% (in this simulation—yours will differ some). To get the expected probability of survival for any pond j, just addthe varying intercept for it to α. For example, for pond 1:logistic( 1.1 + 1.54 )R code13.12[1] 0.933392Now that we’ve found these estimates, let’s compute the predicted survival proportions,according to them, and add those proportions to our growing simulation data frame. I’llcall the column varying.pi, to indicate that is contains the expected per-pond survivalprobabilities, derived from the varying effect estimates.estimated.aj


348 13. MULTILEVEL MODELSabsolute error of estimate0.00 0.10 0.205 10 1025 25351 15 30 45 60tankFIGURE 13.2. Error of no-pooling and varying effects estimates, for the simulatedtadpoles. e horizontal axis displays pond number, j. e verticalaxis measures the absolute error in the predicted proportion of survivors,compared to the true value used in the simulation. e higher the point,the worse the estimate. No-pooling empirical estimates of proportion survivingin each tank are shown by the filled blue points. Varying estimatesare shown by the black circles. e dashed blue and black lines show the averageerror for each kind of estimate, across each initial density of tadpoles(sample size). Smaller tanks produce more error, but the varying estimatesare also always better, on average, especially in smaller tanks.estimates always have the advantage, at least on average. ird, and this is the observationthat will help you grasp shrinkage, the distance between the blue dashed line and the blackdashed line grows, the smaller the pond gets. So while both kinds of estimates suffer fromreduced sample size, the varying effect estimates suffer less.Okay, so what are we to make of all of this? Remember, back in FIGURE 13.1 (page 342),the smaller tanks demonstrated more shrinkage towards the mean. Here, the ponds withthe smallest sample size show the greatest improvement over the naive no-pooling estimates.is is no coincidence. Shrinkage towards the mean results from trying to negotiate the underfittingand overfitting risks of the grand mean on one end and the individual means ofeach pond on the other. e smaller tanks/ponds contain less information, and so their varyingestimates are influenced more by the pooled information from the other ponds. In otherwords, small ponds are prone to overfitting (high variance), and so they receive a bigger doesof the underfit grand mean (bias), to move them towards the optimal tradeoff of underfittingand overfitting. Likewise, the larger ponds shrink much less, because they contain more informationand are prone to less overfitting. erefore they need less correcting to negotiate


13.4. CROSS-CLASSIFIED MODELS 349the bias-variance (underfitting-overfitting) tradeoff. When individual ponds are very large,pooling in this way does hardly anything to improve estimates, because the estimates don’thave far to go. But in that case, they also don’t do any harm, and the information pooledfrom them can substantially help prediction in smaller ponds.e partially pooled estimates provide better average prediction. And they do so becausethey adjust individual cluster (pond) estimates to negotiate the tradeoff between underfittingand overfitting.13.4. Cross-classified modelsWe can use more than one type of cluster in the same model. is is especially commonwhen units in the data possess MULTIPLE MEMBERSHIP in more than one kind of cluster.For example, the observations in data(chimpanzees) are lever pulls. Each pull iswithin a cluster of pulls belonging to an individual chimpanzee. But each pull is also withinan experimental block done at the same time. So each observed pull has membership in bothan actor (1 to 7) and a block (1 to 6).Re-estimate chimpanzee model, using varying effects and now being able to include theexperiment block:L ijk ∼ Binomial(1, p ijk )p ijklog = α + α Aj + α Bk + (β P + β PC C ijk )P ijk1 − p ijkα Aj ∼ Normal(0, σ A )α Bk ∼ Normal(0, σ B )α ∼ Normal(0, 1)β P ∼ Normal(0, 1)β PC ∼ Normal(0, 1)σ A ∼ Cauchy(0, 1)σ B ∼ Cauchy(0, 1)where L ijk is pulled.left for case i and actor j and block k, α Aj is the varying effect of actorj, α Bk is the varying effect of experiment block k, and σ αA and σ αB are the standard deviationsthat correspond to each varying effect type. e predictors P ijk and C ijk are prosoc.leftand condition. To load the data and estimate the model:library(rethinking)data(chimpanzees)d


350 13. MULTILEVEL MODELS)sigma_actor ~ dcauchy(0,1),sigma_block ~ dcauchy(0,1)) ,data=list(pulled.left=d$pulled.left,prosoc.left=d$prosoc.left,condition=d$condition,actor=as.integer(d$actor),block_num=as.integer(d$block)) ,start=list(a=0,bp=0,bpc=0,sigma_actor=1,sigma_block=1,aj=rep(0,7),ak=rep(0,6)),warmup=1000 , iter=1e4Looking at estimates:R code13.17precis(m13.4)Mean StdDev lower 0.95 upper 0.95a 0.27 0.63 -0.99 1.52bp 0.77 0.25 0.31 1.26bpc -0.09 0.28 -0.63 0.44sigma_actor 2.16 0.84 0.94 3.76sigma_block 0.23 0.19 0.01 0.56aj[1] -0.98 0.66 -2.22 0.41aj[2] 4.22 1.51 1.73 7.31aj[3] -1.28 0.67 -2.65 -0.01aj[4] -1.28 0.66 -2.59 0.06aj[5] -0.98 0.66 -2.32 0.32aj[6] -0.03 0.66 -1.35 1.25aj[7] 1.50 0.71 0.15 2.93ak[1] -0.17 0.23 -0.68 0.18ak[2] 0.04 0.19 -0.34 0.46ak[3] 0.06 0.19 -0.31 0.49ak[4] 0.01 0.19 -0.39 0.40ak[5] -0.03 0.19 -0.44 0.33ak[6] 0.12 0.21 -0.26 0.57A lot of variation among chimpanzees (actor), as you saw in Chapter 11. But not much byexperiment block (block).13.4.1. Marginal posterior predictions. Simulating over varying intercepts.R code13.18post


13.4. CROSS-CLASSIFIED MODELS 351Density0.0 1.0 2.0 3.0blocksactors0 1 2 3 4 5 6 7sigma valueFIGURE 13.3. Posterior distributions of the standard deviations of varyingintercepts by actor (black) and experimental block (blue), for the modelm13.4.})post$a +rnorm(n,0,post$sigma_actor) +(post$bp + post$bpc*C)*P


14 Multilevel Models II: Slopes14.1. Everything can vary and probably shouldReasons to use varying slopes.14.1.1. Varying slopes and correlated effects. Another reason for the growing popularityof multilevel models is to improve inference about predictor variables. Even if you don’thypothesize that the effect of a variable is different across clusters, accounting for variationin the overall responses among clusters can help you get better estimates of even the nonvaryingeffects.e easiest way to appreciate this fact is to recall the UCBadmit data from Chapter 11.In those data, failing to model the varying means across departments lead to exactly theopposite inference of the truth. I addressed the problem back then by using naive no-poolingestimates derived from adding a dummy variable for each department. But since the varyingeffect estimates are on average superior to such no-pooling estimates, we could have doneeven better by using a multilevel model.In fact, let’s quickly return to those data and do just that. is will also allow me to showyou some more aspects of the notation and of these models. Unlike in the tadpole data,now our clusters (departments) will span the rows in the data frame. So we’re going to needanother index, j, to indicate clusters, while we keep i to indicate rows within them. Let’s loadthe data and add these i and j labels, to make this point clear.data(UCBadmit)d


354 14. MULTILEVEL MODELS II10 E female 94 299 393 0 10 511 F male 22 351 373 1 11 612 F female 24 317 341 0 12 6On the righthand side, you can see the new columns. e first row is row 1 within cluster 1.e 6th row is the 2nd row within cluster 3.With this new notation in hand, the model we want to estimate is:A ij ∼ Binomial(n ij , p ij )logit p ij = α + α j + βm ijα j ∼ Normal(0, σ)[likelihood][linear model][prior for varying intercepts]α ∼ Normal(0, 10) [prior for α]β ∼ Normal(0, 1) [prior for β]σ ∼ Cauchy(0, 2.5) [prior for σ]e outcome variable A ij is the number of admit decisions, admit, and the sample size in eachcase is n ij , applications. Notice that the index on the p values is now ij. is is becauseprobabilities vary by department, as indexed by j. Once there are predictor variables in thelog-odds linear model, p ij will also vary by i. Likewise, the parameters to estimate, α j , alsovary by department. All that has really changed is that now multiple rows in the data willcontribute data to the estimated varying effect for each cluster. But the above notation is verystandard and actually much more common than the tadpole example above. So it’s worthmaking sure you understand it.Here’s the code to fit this model:R code14.2m14.1


14.1. EVERYTHING CAN VARY AND PROBABLY SHOULD 355a_dept[4] -0.06 0.64 -1.41 1.06a_dept[5] -0.50 0.64 -1.77 0.65a_dept[6] -2.05 0.66 -3.35 -0.88sigma_dept 1.66 0.81 0.65 3.19e estimated effect of male is very similar to what we got in Chapter 11. But now we alsohave better estimates of the individual department average acceptance rates. You’ll see thatthe departments are ordered from those with the highest proportions accepted to the lowest.Remember, the values above are the α j estimates, and so they are deviations from the globalmean α, which in this case has mean −0.56.14.1.2. Varying effects of being male. Now, in order to teach you about varying slopes,let’s consider the variation in gender bias among departments. Sure, overall there isn’t muchevidence of gender bias in the previous model. But what if we allow the effect of an applicant’sbeing male vary in the same way we already allowed the overall rate of admission to vary?Such a model is a VARYING SLOPES (or RANDOM SLOPES) model. In this case, the model isspecified by defining a series of individual department effects of the variable male. In effect,we assume that every department not only has its own intercept (its log-odds of admission),but it also has its own slope (the change in log-odds of admission arising from changing anapplicant’s gender from female to male).Specifying this model will require a little more work, although conceptually it is familiarterritory. e trouble is that, in order to successfully pool information, we’ll need to definedistributions of not only the varying intercepts and slopes, but also their correlation. Anotherway to think of this is that we are defining a prior probability density for the cluster interceptsand slopes. We’ll estimate this prior from the data, so it’s an empirical prior, but we stillneed to define it’s basic shape, the parameters that will describe it. If we don’t allow for theGaussian distribution of intercepts to be correlated with the Gaussian distribution of slopes,then we are assuming they are uncorrelated. Now, maybe the overall rate of admissions ineach department really is uncorrelated with any bias by gender in admissions. But whatif they are correlated? What if departments that admit most applicants also favor femaleapplicants? In that case, defining the distribution of varying intercepts and slopes as beingcorrelated allows us to pool information not only across clusters (departments), but alsoacross parameter types.Suppose for example that you analyze the data for all departments except for the smallestone, finding a strong correlation between the overall probability of admission and theamount of gender bias. Now, before even seeing the data for the final department, you expectits intercept and slope to be correlated. Furthermore, if you learn its slope, it changesyour guess about its intercept. Likewise, if you learn its intercept, it changes your guess aboutits slope. In reality, both estimate simultaneously inform one another, just as all departmentssimultaneously inform one another. is is how allowing the varying intercepts and slopesto be correlated allows us to better pool information across clusters in the data.Another reason to allow for the correlation is that, if you want to generalize to new samplingclusters in the future, then you will need a complete definition of the joint distributionof the varying intercepts and slopes. If you assume they are uncorrelated, you mightwell make worse predictions than if you estimated their correlation. Now, if you have goodreason—either theoretical or practical—to assume the correlation is zero, then certainly doso. But otherwise, assuming that the distribution of varying intercepts is uncorrelated withthe distribution of varying slopes is a lot like assuming that the posterior distributions ofany two parameters are perfectly uncorrelated. As you know by now, having used so many


356 14. MULTILEVEL MODELS IIR code14.3variance-covariance matrixes in this book, this is sometimes nearly true, but oen not evenclose to true.is is what the model looks like:A ij ∼ Binomial(p ij , n ij )logit p ij = α + α j + (β + β j )m ij( )(( )αj0∼ MVNormal , SRSβ j 0)[likelihood][linear model][joint prior for varying effects]α ∼ Normal(0, 10) [prior for α]β ∼ Normal(0, 1) [priot for β](σ α , σ β ) ∼ Cauchy(0, 2.5) [prior for each σ]R ∼ LKJCorr(2)[prior for correlation matrix]e symbol m ij indicates the value of male for the i-th row of the j-th department. It ismultiplied by the sum β + β j , which is a total slope defined by both a value common toall departments, β, and a value unique to each department, β j . Just as with the varyingintercepts α j , the varying slopes β j are drawn from a multivariate normal distribution, andthe distribution of both the varying intercepts and slopes is defined by the third line.So how are you supposed to read that third line? is mess is what is needed to define ajoint prior for the varying intercepts, α j , and varying slopes, β j . Now we have two standarddeviation parameters, one governing each type of varying effect, so I have relabeled them σ αand σ β to indicate the batch of effects each describes. But these two normal distributionsare also possibly correlated with one another, and their correlation is measured by ρ (“rho”).e final line above reads: sample pairs of α j and β j values, one pair for each department j,from a bivariate normal distribution with variance-covariance matrix defined by the variancesσα 2 and σβ 2 and correlation ρ.ink of these α j and β j values as being analogous to the sets of samples you drew fromthe naive posterior in previous chapters. Using the variance-covariance matrix of the posterior(vcov), you could define a multivariate normal distribution and sample random numbersfrom. e values across columns in the resulting table had a special correlation structure.Likewise, these varying intercepts and slopes will have a special correlation structure,defined by ρ. e notation is a bit weird at first, but you’ll get used to it, aer a few moreexamples.To estimate this model, use:m14.2


14.1. EVERYTHING CAN VARY AND PROBABLY SHOULD 357)applications=d$applications,male=d$male,dept=d$j) ,start=list(a=0,bm=0,a_dept=rep(0,6),bm_dept=rep(0,6),Sigma_dept=c(1,1),Rho_dept=diag(2))Now let’s look at the non-varying estimates:precis(m14.2)R code14.4Mean StdDev lower 0.95 upper 0.95a -0.51 0.79 -2.13 1.01bm -0.17 0.28 -0.71 0.39a_dept[1] 1.83 0.80 0.32 3.48a_dept[2] 1.27 0.81 -0.17 3.05a_dept[3] -0.14 0.79 -1.54 1.63a_dept[4] -0.11 0.79 -1.79 1.41a_dept[5] -0.63 0.79 -2.20 0.96a_dept[6] -2.09 0.80 -3.60 -0.42bm_dept[1] -0.64 0.37 -1.37 0.01bm_dept[2] -0.06 0.38 -0.84 0.73bm_dept[3] 0.25 0.29 -0.34 0.80bm_dept[4] 0.08 0.29 -0.49 0.69bm_dept[5] 0.29 0.31 -0.22 0.98bm_dept[6] 0.04 0.34 -0.69 0.67Sigma_dept[1] 1.76 0.71 0.81 3.17Sigma_dept[2] 0.55 0.36 0.07 1.16Rho_dept[1,1] 1.00 0.00 1.00 1.00Rho_dept[1,2] -0.28 0.37 -0.89 0.43Rho_dept[2,1] -0.28 0.37 -0.89 0.43Rho_dept[2,2] 1.00 0.00 1.00 1.00e first row of output is the estimate for α, the average log-odds of admission across alldepartments. Notice that is a highly imprecise estimate, which results from there being a lotof variation among departments in their admissions rates. e second row of output is theestimate of β, the average effect of being male on log-odds of admission. is is negative,but also highly imprecise. e last two rows of output are the estimates for σ α and σ β , thestandard deviations of the varying intercepts and slopes.What can you say from these estimates? Notice that the standard deviation of the interceptsis more than three times as large as the standard deviation of the slopes for male.A solid inference to draw from this difference is that more of the variation in log-odds ofadmission in the entire data arises from differences among departments in overall rates ofadmission than it does from differences in gender bias. at is, heterogeneity in overall admissionrates is more influential than heterogeneity in gender bias in admission.


15 Missing Data and Other Opportunities15.1. Measurement errorDivorce example. Let’s begin by loading the data and plotting the measurement error ofthe outcome, as standard error:library(rethinking)library(rstan)data(WaffleDivorce)d


360 15. MISSING DATA AND OTHER OPPORTUNITIESDivorce rate4 6 8 10 12 14Divorce rate4 6 8 10 12 1423 24 25 26 27 28 29Median age marriage0 1 2 3log populationFIGURE 15.1. Le: Divorce rate by median age of marriage, States of theUnited States. Vertical bars show plus and minus one standard deviation ofthe Gaussian uncertainty in measured divorce rate. Southern States shownin blue. Right: Divorce rate, again with standard deviations, against logpopulation of each State. Smaller States produce more uncertain estimates.R code15.2And then we also use these D est,i as data in the regression equation. is will not onlyallow us to estimate coefficients for predictions that take into account the uncertainty in theoutcome, but it will also update the prior for divorce rate in each State.Here’s what the model looks like:D est,i ∼ Normal(µ i , σ)[“likelihood” for estimates]µ i = α + β A A i + β R R i [linear model]D est,i ∼ Normal(D obs,i , D SE,i )α ∼ Normal(0, 10)β A ∼ Normal(0, 10)β R ∼ Normal(0, 10)σ ∼ Cauchy(0, 2.5)[prior for estimates]So really the only difference between this model and a typical linear regression is replacingthe outcome with a vector of parameters and assigning to each “outcome” parameter a priorthat is a Gaussian distribution defined by the respective mean and standard deviation of themeasurement.Now for the Stan model code:model_code


15.1. MEASUREMENT ERROR 361vector[N] MAM;}parameters {real a;real bMR;real bMAM;real sigma;vector[N] divorce_est;}model {vector[N] mu;a ~ normal(0,10);bMR ~ normal(0,10);bMAM ~ normal(0,10);sigma ~ cauchy(0,1)T[0,];divorce_est ~ normal( divorce_obs , divorce_sd );mu


362 15. MISSING DATA AND OTHER OPPORTUNITIESDivorce estimated - divorce observed-3 -2 -1 0 1 20.5 1.0 1.5 2.0 2.5Divorce observed standard errorDivorce rate (posterior)4 6 8 10 12 1423 24 25 26 27 28 29Median age marriageFIGURE 15.2. Le: Shrinkage resulting from modeling the measurementerror. Right: Comparison of regression that ignores measurement error(dashed line and gray shading) with regression that incorporates measurementerror (blue line and shading).divorce_est[1] 11.75 0.66 10.46 13.05...divorce_est[50] 11.53 1.09 9.48 13.73lp__ -57.99 6.53 -71.05 -45.50If you look back at Chapter 5, you’ll see that the former estimate for bMAM was about −1. Nowit’s almost half that, but still reliably negative. So compared to the original regression thatignores measurement error, the association between divorce and age at marriage has beenreduced.If you look again at FIGURE 15.1, you can see a hint of why this has happened. States withextremely low and high ages at marriage tend to also have more uncertain divorce rates. Asa result those rates have been shrunk towards the expected mean defined by the regressionline. FIGURE 15.2 displays this shrinkage phenomenon. On the le of the figure, the differencebetween the observed and estimated divorce rates are shown on the vertical axis, whilethe standard error of the observed is shown on the horizontal. e dashed line at zero indicatesno change from observed to estimated. Notice that States with more uncertain divorcerates—further right on the plot—have estimates more different from observed. is is yourfriend SHRINKAGE from before. Less certain estimates are improved by pooling informationfrom more certain estimates.is shrinkage results in pulling divorce rates towards the regression line, as seen in therighthand plot in the same figure. is plot shows the estimated divorce rates against theobserved ages at marriage, with the old no-error regression in gray and the new with-errorregression in blue.


15.1. MEASUREMENT ERROR 36315.1.2. Error on both outcome and predictor. Adding measurement error to a predictorworks the same way:D est,i ∼ Normal(µ i , σ)[likelihood for outcome estimates]µ i = α + β A A i + β R R est,i [linear model using predictor estimates]D est,i ∼ Normal(D obs,i , D SE,i )R est,i ∼ Normal(R obs,i , R SE,i )α ∼ Normal(0, 10)β A ∼ Normal(0, 10)β R ∼ Normal(0, 10)σ ∼ Cauchy(0, 2.5)And the code below defines the equivalent Stan model.[prior for outcome estimates][prior for predictor estimates]model_code2


364 15. MISSING DATA AND OTHER OPPORTUNITIESMarriage rate estimated - observed-1.0 -0.5 0.00.5 1.5 2.5 3.5Marriage rate standard errorDivorce rate (posterior)4 6 8 10 12 1415 20 25 30Marriage rate (posterior)FIGURE 15.3. Le: Shrinkage for predictor marriage rate. Notice thatshrinkage is not balanced, but rather that the model believes the observedvalues tended to be overestimates. Right: Shrinkage of both the outcome,divorce rate, and marriage rate. Solid points are the posterior means. Openpoints are observed. Lines connect pairs of points for the same State.divorce_obs=d$Divorce,divorce_sd=d$Divorce.SE,marriage_obs=d$Marriage,marriage_sd=d$Marriage.SE,MAM=d$MedianAgeMarriage)start


15.2. MISSING DATA 365information in divorce rate to help us improve estimates of marriage rate. In contrast, sincethe relationship between divorce and median age at marriage is strong, there’s a lot of informationin age at marriage to help us improve estimates of divorce rate. at’s why divorceestimates shrink more than marriage rate estimates do.Of course, that information in age at marriage may be illusory, as we haven’t incorporateduncertainty about age at marriage. If we had those data, in the form of standard errorsor better yet a distribution of ages at marriage in each State, then we could blend that informationinto the model using the same approach as above.e big take home point for this section is that when you have a distribution of values,don’t reduce it down to a single value to use in a regression. Instead, use the entire distribution.15.2. Missing dataWith measurement error, the insight is to realize that any datum can be replaced by a distributionthat reflects uncertainty. at distribution comprises a prior that can be updatedby the model, improving estimates of data while simultaneously estimating coefficients. Informationactually flows among the cases, through the regression model, allowing each measurementto borrow strength from the others.Sometimes however data are simply missing—no measurement is available at all. Atfirst, this seems like a lost cause. What can be done when there is no measurement at all, noteven one with error? A lot can be done, because just like information flows among cases inthe measurement error example, it can also flow from present data to missing data, as longas we are willing to make a model of the whole variable.15.2.1. Baldini’s example here. Case of the giant banker?15.2.2. Imputing neocortex. Primate milk example. Back in Chapter 5, we used data(milk)to illustrate masking, using both neocortex percent and body mass to predict milk energy.One aspect of those data are 12 missing values in the neocortex.perc column. We useda COMPLETE-CASE analysis back then, dropping those 12 cases and the 12 good body massand milk energy values in them. at le us with only 17 cases to work with.Now we’ll see how to use Bayesian IMPUTATION to conserve and use the informationin the other columns, while producing estimates for the missing values. e estimates wederive for the missing values won’t be particularly crisp—they will have wide posteriors. Butthey will be honest, given the assumptions built into the model.What we’re going to do is build an example of MCAR (Missing Completely At Random)imputation. With MCAR, we assume that the location of the missing values is completelyrandom with respect to those values and all other values in the data. is is the simplest andeasiest to understand form of imputation. So it makes sense to start here.e trick to all imputation is the simultaneously model the predictor variable that hasmissing values together with the outcome variable. e not-missing values will produceestimates that comprise a prior for the missing values, and these will then be updated by therelationship between the predictor and outcome. But we have to conceive of the predictornow as a mixed vector of data and parameters. In our case, the variable with missing valuesis neocortex percent. Call it N.N = [0.55, N 2 , N 3 , N 4 , 0.65, 0.65, ..., 0.76, 0.75].


366 15. MISSING DATA AND OTHER OPPORTUNITIESEvery index i at which there is a missing value, there is also a parameter N i that will form anestimate for it.is is the model:k i ∼ Normal(µ i , σ) [likelihood for outcome k]µ i = α + β N N i + β M log M i [linear model]N i ∼ Normal(ν, σ N ) [likelihood/prior for obs/missing N]α ∼ Normal(0, 10)β N ∼ Normal(0, 1)β M ∼ Normal(0, 1)σ ∼ Cauchy(0, 1)ν ∼ Normal(0.5, 1)σ N ∼ Cauchy(0, 1)Note that when N i is observed, then the third line above is a likelihood, just like any old linearregression. But when N i is missing and therefore a parameter, that same line is interpretedas a prior. Since the parameters ν and σ N are also estimated, the prior is learned from thedata.Specifying this model in Stan can be done several different ways. All of the ways area little awkward, because the locations of missing values have to respected. e approachI’ll use here hews closely to the discussion just above: we’ll merge the observed values andparameters into a vector that we’ll use as “data” in the regression.First, get the data loaded and transform predictors:R code15.7library(rethinking)library(rstan)data(milk)d


15.2. MISSING DATA 367nc


368 15. MISSING DATA AND OTHER OPPORTUNITIESR code15.11data_list


15.2. MISSING DATA 369kcal per gram0.5 0.6 0.7 0.8 0.9neocortex proportion0.55 0.65 0.750.55 0.60 0.65 0.70 0.75 0.80neocortex proportion-2 -1 0 1 2 3 4log(mass)FIGURE 15.4. Le: Inferred relationship between milk energy (vertical) andneocortex proportion (horizontal), with imputed values shown in blue. Blueline segments are 90% confidence intervals. Right: Inferred relationshipbetween the two predictors, neocortex proportion and log mass. Imputedvalues again shown in blue.So by using all the cases, the strength of the inferred relationships has diminished. is mightmake you sad, but ask yourself whether you would want your colleagues to use all the data,even if it meant inferring a weaker relationship. en apply that same standard to yourself.Let’s do some plotting to visualize what’s happened here. FIGURE 15.4 displays both theinferred relationship between milk energy and neocortex (le) and the relationship betweenthe two predictors (right). Both plots show imputed neocortex values in blue, with 90% confidenceintervals shown by the blue line segments. Although there’s a lot of uncertainty inthe imputed values—hey, Bayesian inference isn’t magic, just logic—they do show a gentletilt towards the regression line. is has happened because the observed values provideinformation that guides the estimation of the missing values.e righthand plot shows the inferred relationship between the predictors. We alreadyknow that these two predictors are positively associated—that’s what creates the maskingproblem. But notice here that the imputed values do not shown an upward slope. eydo not, because the imputation model—the first regression with neocortex (observed andmissing) as the outcome—assumed no relationship. So odds are we can improve this modelby changing the imputation model to estimate the relationship between the two predictors.Let’s do that now. e notion is to change the imputation line of the model from:toN i ∼ Normal(ν, σ N )N i ∼ Normal(ν i , σ N )ν i = α N + γ M log M i


370 15. MISSING DATA AND OTHER OPPORTUNITIESwhere α N and γ M now describe the linear relationship between neocortex and log mass. eobjective is to extract information from the observed cases and exploit it to improve theestimates of the missing values inside N.e modified Stan code:R code15.14model_code2


15.3. SPACE AND NETWORKS 371start


372 15. MISSING DATA AND OTHER OPPORTUNITIESkcal per gram0.5 0.6 0.7 0.8 0.9neocortex proportion0.55 0.65 0.750.55 0.60 0.65 0.70 0.75 0.80neocortex proportion-2 -1 0 1 2 3 4log(mass)FIGURE 15.5. Same relationships as shown in FIGURE 15.4, but now for theimputation model that estimates the association between the predictors.e information in the association between predictors has been used to infera stronger relationship between milk energy and the imputed values.Now we can write the rest of the model. e notion is that varying intercepts for the nislands are sampled from a multivariate Gaussian distribution with mean zero and correlationmatrix R, as defined above. e rest is just a Poisson regression, predicting total toolsas a function of log population size and the varying intercept for each island. is is what itlooks like:T i ∼ Poisson(λ i )log λ i = α + γ i σ + β log P i⎛ ⎞⎛⎛⎞ ⎞γ 10⎜ ⎟⎜⎜⎟ ⎟⎝ . ⎠ ∼ MVNormal⎝⎝. ⎠ , R⎠γ n 0R i,j = θ exp(−kD 2 i,j)(1 − δ i,j ) + δ i,jα ∼ Normal(0, 10)β ∼ Normal(0, 1)σ ∼ Uniform(0, 10)θ ∼ Beta(2, 2)k ∼ Normal(2, 3)T[0, ]One of the odd things about this model is that while the matrix R appears as a parameterinside the multivariate Gaussian distribution, it isn’t directly estimated. Instead, the correlationmatrix is defined by other parameters (k and θ), which are estimated.Now for code:


15.3. SPACE AND NETWORKS 373library(rethinking)data(islands2)d


374 15. MISSING DATA AND OTHER OPPORTUNITIES}'dev


15.3. SPACE AND NETWORKS 375correlation0.0 0.2 0.4 0.6 0.8 1.00 1 2 3 4 5kilometers (thousands)FIGURE 15.6. Posterior correlation function. Median correlation by distanceshown by the black curve. e other curves are 200 functions sampledfrom the posterior.b 0.26 0.09 0.08 0.45dev -31.44 2.31 -36.13 -27.63lp__ 913.50 3.55 906.51 920.08A few things can be inferred straight from the table. First, notice that b, the coefficient forlog population, is still positive. Population size increases total tools, even aer accountingfor any spatial effects. Second, there’s a lot of uncertainty as to the autocorrelation function.So we’re going to need to plot it, in order to understand it.Let’s begin by plotting the drop in correlation with distance.post


376 15. MISSING DATA AND OTHER OPPORTUNITIESlatitude-20 -10 0 10 20YapChuukManusTrobriandHawaiiSanta CruzTikopiaLau FijiMalekulaTonga-40 -20 0 20longitudeFIGURE 15.7. Posterior correlations among total tools on each island.Strength of correlation is shown by the darkness of each line segment connectinga pair of islands.R code15.23psize


15.3. SPACE AND NETWORKS 377Total Tools20 30 40 50 60 70Santa CruzTikopiaMalekulaYapChuukTrobriandLau FijiTongaManusHawaii7 8 9 10 11 12log populationFIGURE 15.8. Posterior correlations among total tools on each island, overlaidon the relationship between tools and population.


16 Writing Statistics379


EndnotesChapter 11. I draw this metaphor from Collins and Pinch, e Golem: What You Should Know about Science. It is verysimilar to E. T. Jaynes’ 2003 metaphor of statistical models as robots, although with a less precise and more monstrousimplication. [13]2. ere are probably no algorithms nor machines that never break, bend, or malfunction. A common citationfor this observation is Wittgenstein (1953), Philosophical Investigations, section 193. Malfunction will interestus, later in the book, when we consider more complex models and the procedures needed to fit them to data. [14]3. See Mulkay and Gilbert 1981. I sometimes teach a PhD core course that includes some philosophy of science,and PhD students are nearly all shocked by how little their casual philosophy resembles that of Popper or anyother philosopher of science. e first half of Ian Hacking’s Representing and Intervening (1983) is probably thequickest way into the history of the philosophy of science. It’s getting out of date, but remains readable and broadminded. [16]4. Maybe best to begin with Popper’s last book, e Myth of the Framework (1996). I also recommend interestedreaders to go straight to a modern translation of Popper’s earlier Logic of Scientific Discovery. Chapters 6, 8, 9and 10 in particular demonstrate that Popper appreciated the difficulties with describing science as an exercisein falsification. Other later writings, many collected in Objective knowledge: An evolutionary approach, showthat Popper viewed the generation of scientific knowledge as an evolutionary process that admits many differentmethods. [16]5. George E. P. Box is famous for this dictum. As far as I can tell, his first published use of it was as a sectionheading in a 1979 paper (Box, 1979). Population biologists like myself are more familiar with a philosophicallysimilar essay about modeling in general by Richard Levins, “e Strategy of Model Building in Population Biology”(Levins, 1966). [17]6. Ohta and Gillespie (1996). [17]7. Hubbell (2001). e theory has been productive in that it has forced greater clarity of modeling and understandingof relations between theory and data. But the theory has had its difficulties. See Clark (2012). For amore general skeptical attitude towards “neutrality,” see Proulx and Adler (2010). [17]8. For direct application of Kimura’s model to cultural variation, see for example Hahn and Bentley (2003). Allof the same epistemic problems reemerge here, but in a context with much less precision of theory. Hahn andBentley have since adopted a more nuanced view of the issue. See their comment to Lansing and Cox (2011), aswell as the similar comment by Feldman. [17]9. Gillespie (1977). [17]10. Lansing and Cox (2011). See objections by Hahn, Bentley, and Feldman in the peer commentary to the article.[19]11. See Cho 2011 for a December 2011 summary focusing on debates about measurement. [20]381


382 ENDNOTES12. For an autopsy of the experiment, see http://profmattstrassler.com/articles-and-posts/particle-physics-basics/neutrinos/nefaster-than-light/opera-what-went-wrong/. [21]13. See Mulkay and Gilbert 1981 for many examples of “Popperism” from practicing scientists, including famousones. [21]14. For an accessible history of some measurement issues in the development of physics and biology, includingearly experiments on relativity and abiogenesis, I recommend Collins and Pinch 1998. Some scientists have readthis book as an attack on science. However, as the authors clarify in the second edition, this was not their intention.Science makes myths, like all cultures do. at doesn’t necessarily imply that science does not work. Seealso Daston and Galison (2007), which tours concepts of objective measurement, spanning several centuries.[21]15. e first chapter of Sober (2008) contains a similar discussion of modus tollens. Note that the statistical philosophyof Sober’s book is quite different from that of the book you are holding. In particular, Sober is weaklyanti-Bayesian. is is important, because it emphasizes that rejecting modus tollens as a model of statistical inferencehas nothing to do with any debates about Bayesian versus non-Bayesian tools. [21]16. Popper himself had to deal with this kind of theory, because the rise of quantum mechanics in his lifetimepresented rather serious challenges to the notion that measurement was unproblematic. See chapter 9 in hisLogic of Scientific Discovery, for example. [21]17. See the Aerward to the 2nd edition of Collins and Pinch (1998) for examples of textbooks getting it wrongby presenting tidy fables about the definitiveness of evidence. [22]18. A great deal has been written about the sociology of science and the interface of science and public interest.Interested novices might begin with Kitcher (2011), Science in a Democratic Society, which has a very broad topicalscope and so can serve as an introduction to many dilemmas. [22]19. Yes, even procedures that claim to be free of assumptions do have assumptions and are a kind of model. Allsystems of formal representation, including numbers, do not directly reference reality. For example, there ismore than one way to construct “real” numbers in mathematics, and there are important consequences in someapplications. In application, all formal systems are like models. See http://plato.stanford.edu/entries/philosophymathematics/for a short overview of some different stances that can be sustained towards reasoning in mathematicalsystems. [22]20. Saint Augustine, in City of God, famously admonished against trusting in luck, as personified by Fortuna:“How, therefore, is she good, who without discernment comes to both the good and to the bad?” See also theintroduction to Gigerenzer et al. (1990). Rao (1997) presents a page from an old book of random numbers,commenting upon how seemingly useless such a thing would have been in previous eras. [22]21. Most scholars trace frequentism to British logician John Venn (1834–1923), as for example presented in his1876 book. Speaking of the proportion of male births in all births, Venn said, “probability is nothing but thatproportion” (page 84). Venn taught Fisher some of his maths, so this may be where Fisher got his dogmaticopposition to Bayesian probability. [23]22. Fisher (1956). See also Fisher (1955), the first major section of which discusses the same point. [23]23. is last sentence is a rephrasing from Lindley (1971): “A statistician faced with some data oen embedsit in a family of possible data that is just as much a product of his fantasy as is a prior distribution”. Dennis V.Lindley (1923–2013) was a prominent defender of Bayesian data analysis when it had very few defenders. [23]24. It’s hard to find an accessible introduction to image analysis, because it’s a very computational subject. Atthe intermediate level, see Marin and Robert (2007), chapter 8. You can hum over their mathematics and stillacquaint yourself with the different goals and procedures. See also Jaynes (1984) for spirited comments on thehistory of Bayesian image analysis and his pessimistic assessment of non-Bayesian approaches. [24]25. See Gigerenzer et al. 2004. [25]


ENDNOTES 38326. Fisher (1925), page 9. See Gelman and Robert (2013) for reflection on intemperate anti-Bayesian attitudesfrom the middle of last century. [25]27. See McGrayne (2011) for a non-technical history of Bayesian data analysis. See also Fienberg (2006),which describes (among many other things) applied use of Bayesian multilevel models in election prediction,beginning in the early 1960’s. [25]28. I borrow this phrasing from Silver (2012). Silver’s book is a well-written, non-technical survey of modelingand prediction in a range of domains. [27]29. See eobald (2010) for a fascinating example in which multiple non-null phylogenetic models are contrasted.[28]Chapter 230. Morison (1942). In addition to underestimating the circumference, Colombo also overestimated the size ofAsia and the distance between mainland China and Japan. [29]31. is distinction and vocabulary derive from Savage (1962). [29]32. I first encountered this globe tossing strategy in Gelman and Nolan (2002). Since I’ve been using it in classrooms,several people have told me that they have seen it in others places, but I’ve been unable to find a primevalcitation, if there is one. [30]33. is assumption is usually called either the principle of indifference or the principle of insufficient reason. ereis a very long history of debate over this principle, and there are numerous ways to justify it. None of these waysprovide any guarantees, because indeed no method of inference provides guarantees. [31]34. Van Horn (2003). Technically, any system with the same proportionality as probability theory will do. [37]35. ere is actually a set of theorems, the No Free Lunch theorems. ese theorems—and others which aresimilar but named and derived separately—effectively state that there is no optimal way to pick priors (forBayesians) or select estimators or procedures (for non-Bayesians). See Wolpert and Macready (1997) forexample. [40]36. is is a subtle point that will be expanded in other places. On the topic of accuracy of assumptions versusinformation processing, see e.g. Appendix A of Jaynes (1985): the Gaussian, or normal, error distributionneedn’t be physically correct in order to be the most useful assumption. [41]37. is approach is usually identified with Bruno de Finetti and L. J. Savage. See Kadane (2011) for review andexplanation. [44]38. See Berger and Berry (1988), for example, for further exploration of these ideas. [44]Chapter 339. Gigerenzer and Hoffrage (1995). [60]40. Feynman (1967) provides one the best defenses of this device in scientific discovery. [60]41. For a binary outcome problem of this kind, the posterior density is given by dbeta(p,w+1,n-w+1), wherep is the proportion of interest, w is the observed count of water, and n is the number of tosses. If you’re curiousabout how to prove this fact, look up “beta-binomial conjugate prior.” I avoid discussing the analytical approachin this book, because very few problems are so simple that they have exact analytical solutions like this. [60]42. See Ioannidis (2005) for another narrative of the same idea. e problem is almost certainly worse than thecalculation suggests, because non-significant results are filtered out of the published literature. [61]43. See Box and Tiao 1973, page 84 and then page 122 for a general discussion. [67]44. Gelman et al. 2013a page 33 comment on differences between percentile intervals and HPDIs. [67]


384 ENDNOTES45. Fisher (1925), in Chapter III within section 12 on the normal distribution. ere a couple of other places inthe book in which the same resort to convenience or convention is used. Fisher seems to indicate that the 5%mark was already widely practiced by 1925 and already without clear justification. [67]46. Fisher (1956). [67]47. See Henrion and Fischoff (1986) for examples from the estimation of physical constants, such as the speedof light. [68]48. Robert (2007) provides concise proofs of optimal estimators under several standard loss functions, like thisone. It also covers the history of the topic, as well as many related issues in deriving good decisions from statisticalprocedures. [70]49. Rice (2010) presents a interesting construction of classical Fisherian testing through the adoption of lossfunctions. [71]50. See Hauer (2004) for three tales from transportation safety in which testing resulted in premature incorrectdecisions and a demonstrable and continuing loss of human life. [71]51. It is poorly appreciated that coin tosses are very hard to bias, as long as you catch them in the air. Once theyland and bounce and spin, however, it is very easy to bias them. [77]52. Jaynes (1985), page 351. [78]Chapter 453. Leo Breiman, at the start of Chapter 9 of his classic book on probability theory (Breiman, 1968), says “thereis really no completely satisfying answer” to the question “why normal?” Many mathematical results remainmysterious, even aer we prove them. So if you don’t quite get why the normal distribution is the limiting distribution,you are in good company. [86]54. For the reader hungry for mathematical details, see Frank (2009) for a nicely illustrated explanation of this,using Fourier transforms. [86]55. Technically, the distribution of sums converges to normal only when the original distribution has finite variance.What this means practically is that the magnitude of any newly sampled value cannot be so big as tooverwhelm all of the previous values. ere are natural phenomena with effectively infinite variance, but wewon’t be working with any. Or rather, when we do, I won’t comment on it. [86]56. Howell 2010 and Howell 2000. See also Lee and DeVore 1976. Much more raw data is available for downloadfrom https://tspace.library.utoronto.ca/handle/1807/10395. [91]57. Jaynes (2003), page 21–22. See that book’s index for other mentions in various statistical arguments. [93]58. e strategy is the same grid approximation strategy as before (page 48). But now there are two dimensions,and so there is a geometric (literally) increase in bother. e algorithm is mercifully short, however, if not transparent.ink of the code as being six distinct commands. e first two lines of code just establish the range ofµ and σ values, respectively, to calculate over, as well as how many points to calculate in between. e third lineof code expands those chosen µ and σ values into a matrix of all of the combinations of µ and σ. is matrixis stored in a data frame, post. In the monstrous fourth line of code, shown in expanded form to make it easierto read, the log-likelihood at each combination of µ and σ is computed. is line looks so awful, because wehave to be careful here to do everything on the log scale. Otherwise rounding error will quickly make all of theposterior probabilities zero. So what sapply does is pass the unique combination of µ and σ on each row ofpost to a function that compute the log-likelihood of each observed height, and adds all of these log-likelihoodstogether (sum). In the fih line, we multiply the prior by the likelihood to get the product that is proportionalto the posterior density. e priors are also on the log scale, and so we add them to the log-likelihood, which isequivalent to multiplying the raw densities by the likelihood. Finally, the obstacle for getting back on the probabilityscale is that rounding error is always a threat when moving from log-probability to probability. If you usethe obvious approach, like exp( post$prod ), you’ll get a vector full of zeros, which isn’t very helpful. isis a result of R’s rounding very small probabilities to zero. Remember, in large samples, all unique samples areunlikely. is is why you have to work with log-probability. e code in the box dodges this problem by scaling


ENDNOTES 385all of the log-products by the maximum log-product. As a result, the values in post$prob are not all zero, butthey also aren’t exactly probabilities. Instead they are relative posterior probabilities. But that’s good enough forwhat we wish to do with these values. [96]59. e most accessible of Galton’s writings on the topic has been reprinted as Galton (1989). [104]60. e implied definition of α in a parabolic model is α = E y i − β 1 E x i − β 2 E x 2 i . Now even when the averagex i is zero, E x i = 0, the average square will likely not be zero. So α becomes hard to directly interpret again. [123]Chapter 561. “How to Measure a Storm’s Fury One Breakfast at a Time.” e Wall Street Journal: September 1, 2011. [129]62. Simpson (1951). Simpson’s paradox is very famous in statistics, probably because recognizing it increasesthe apparent usefulness of statistical modeling. It’s a lot less known outside of statistics. [129]63. Debates about causal inference go back a long time. David Hume is key citation. One curious obstaclein modern statistics is that classic causal reasoning requires that if A causes B, then B will always appearwhen A appears. But with probabilistic relationships, like those described in most contemporary scientificmodels, it is unsurprising to talk about probabilistic causes, in which B only sometimes follows A. Seehttp://plato.stanford.edu/entries/causation-probabilistic/. [130]64. See Pearl (2014) for an accessible introduction, with discussion. See also Rubin (2005) for a related approach.[130]65. Data from Table 2 of Hinde and Milligan (2011). [145]66. Provided the posterior distributions are Gaussian, you could, however, get the variance of their sum by addingtheir variances and twice their covariance. e variance of the sum (or difference) of two normal distributionsa and b is given by σ 2 a + σ 2 b + 2ρσ aσ b , where ρ is the correlation between the two. [162]67. See Gelman and Stern (2006) for further explanation, and see Nieuwenhuis et al. (2011) for some evidenceof how commonly this mistake occurs. [165]68. See Stigler (1981) for historical context. ere are a number of legitimate ways to derive the method of leastsquares estimation. Gauss’ approach was Bayesian, but a probability interpretation isn’t always necessary. [166]69. ese data are modified from an example in Grafen and Hails (2002). [170]Chapter 670. De Revolutionibus, Book 1, Chapter 10. [173]71. See e.g. Akaike (1978), as well as discussion in Burnham and Anderson (2002). [175]72. Data from Table 1 of McHenry and Coffing (2000). [176]73. See Grünwald (2007) for a book-length treatment of these ideas. [180]74. ere are many discussions of bias and variance in the literature, some much more mathematical thanothers. For a broad treatment, I recommend Chapter 7 of Hastie, Tibshirani and Friedman’s 2009 book, whichexplores BIC, AIC, cross-validation and other measures, all in the context of the bias-variance tradeoff. [182]75. I first encountered this kind of example in Jaynes (1976), page 246. Jaynes himself credits G. David Forney’s1972 information theory course notes. Forney is an important figure in information theory, having won severalawards for his contributions. [182]76. Shannon (1948). For a more accessible introduction, see the venerable textbook Elements of Informationeory, by Cover and omas. Slightly more advanced, but having lots of added value, is Jaynes’ (2003, Chapter11) presentation. A foundational book in applying information theory to statistical inference is Kullback (1959),but it’s not easy reading. [184]


386 ENDNOTES77. See two famous editorials on the topic: Shannon (1956) and Elias (1958). Elias’ editorial is a clever workof satire and remains as current today as it was in 1958. Both of these one-page editorials are readily availableonline. [184]78. I really wish I could say there is an accessible introduction to maximum entropy, at the level of most naturaland social scientists’ math training. If there is, I haven’t found it yet. Jaynes (2003) is an essential source, but ifyour integral calculus is rusty, progress will be very slow. Better might be Steven Frank’s papers (2009; 2011)that explain the approach and relate it to common distributions in nature. You can mainly hum over the mathsin these and still get the major concepts. See also Harte (2011), for a textbook presentation of applications inecology. [186]79. Kullback and Leibler 1951. Note however that Kullback and Leibler did not name this measure aer themselves.See Kullback (1987) for Solomon Kullback’s reflections on the nomenclature. For what it’s worth, Kullbackand Leibler make it clear in their 1951 paper that Harold Jeffreys had used this measure already in the developmentof Bayesian statistics. [186]80. Akaike (1973). See also Akaike (1974, 1978, 1981), where AIC was further developed and related to Bayesianapproaches. Population biologists tend to know about AIC from Burnham and Anderson (2002), which stronglyadvocates its use. [190]81. Lunn et al. (2013) contains a fairly understandable presentation of DIC, including a number of different waysto compute it. [192]82. Watanabe (2010). [192]83. Gelman et al. (2013b) re-dub WAIC to give explicit credit to Watanabe, in the same way people renamedAIC aer Akaike. Gelman et al. (2013b) is worthwhile also for the broad perspective it takes on the inferenceproblem. [192]84. Stone (1977). [192]85. is is closely related to minimum description length. See Grünwald (2007). [192]86. Spiegelhalter et al. (2002). See Lunn et al. (2013) for an updated discussion. [196]87. Schwarz (1978). [196]88. Gelman and Rubin (1995). See also section 7.4, page 182, of Gelman et al. (2013a). [197]89. See Burnham and Anderson (2002) for a thorough argument in favor of this interpretation. e authors notethat some aspects of the procedure remain heuristic, because priors within models are le behind when usinginformation criteria, unlike when using Bayes factors. For some, that’s a virtue. For others, a sin. [206]90. Or alternatively the deviance of each model provides the likelihood and the penalty term provides the prior.See the presentation in Burnham and Anderson (2002). [206]91. See both Burnham and Anderson (2002) and Claeskens and Hjort (2008). [206]92. William Henry Harrison’s military history earned him the nickname “Old Tippecanoe.” Tippecanoe was thesight of a large battle between Native Americans and Harrison, in 1811. In popular opinion, he was a war hero.But in popular imagination, Harrison was cursed by the Native Americans in the aermath of the battle. [211]Chapter 793. All manatee facts here taken from Lightsey et al. (2006); Rommel et al. (2007). [215]94. Wald (1943). See Mangel and Samaniego (1984) for a more accessible presentation and historical context.[215]95. Wald (1950). Wald’s foundational paper is Wald (1939). Fienberg (2006) is a highly-recommended read forhistorical context. For more technical discussions, see Berger (1985), Robert (2007), and Jaynes 2003, page 406.


ENDNOTES 387[217]96. From Nunn and Puga 2011. [218]97. Riley et al. 1999 [218]Chapter 898. Metropolis et al. (1953). e algorithm has been named aer the first author of this paper, however it’s notclear how each co-author participated in discovery and implementation of the algorithm. Among the other authorswere Edward Teller, most famous as the father of the hydrogen bomb, and Marshall Rosenbluth, a renownphysicist in his own right, as well as their wives Augusta and Arianna (respectively), who did much of the computerprogramming. Nicholas Metropolis lead the research group. ere work was in turn based on earlier workwith Stanislaw Ulam, Metropolis and Ulam (1949). [247]99. Hastings (1970) [247]100. Geman and Geman (1984) is the original. I recommend Casella and George (1992) as well. Note that Gibbssampling is named aer physicist and mathematician J. W. Gibbs, one of the founders of statistical physics. However,Gibbs died in the year 1903, long before even the Metropolis algorithm was invented. Instead the strategyis named aer Gibbs, both to honor him and in light of the algorithms connections to statistical physics. [247]101. Chapter 16 of Jaynes (2003). [250]102. Cite Gelman chapter in Handbook of MCMC. [258]Chapter 9Chapter 9103. Does Jaynes have a clear proof of this? Could cite Frank paper, but I think he just cites Jaynes. [274]104. Common intro text is Singer and Willett 2003. [275]Chapter 10105. No one knows precisely why this function has this name, but there are plausible guesses. e term appearsto originate with Verhulst’s 1838 paper on population growth, in which he called it logistique. At the time, theterm logarithm and logistic were sometimes used interchangeably. So perhaps Verhulst was stating nothingmore than this is a logarithmic curve. e term logit is a play on the term probit, which is itself an shortening ofprobability unit. Probit usually refers to the Gaussian quantile distribution. [279]106. Silk et al. (2005). [279]107. Kline and Boyd 2010. [294]Chapter 11108. Because these models are built on cumulative densities, rather than regular densities, they are sometimescalled cumulative link models. Remember, the “link” in a GLM refers to the function that connects the additivemodel to the outcome. So a “cumulative” link uses a cumulative density function, of log-odds in this case, to doso. [304]109. Cushman et al. 2006. [307]110. ere are notable exceptions. Find a good introductory citation. [318]111. R calls A shape1 and B shape2. See ?dbeta, for example. [320]112. e built-in R function dnbinom provides the same likelihoods, but under a different guise. Using dgampoisis pedagogically useful, but functionally equivalent to dnbinom. [328]


388 ENDNOTES113. Hurlbert (1984) is the typical citation in ecology and population biology. [338]114. I adopt the terminology of Gelman 2005, who argues that the common term random effects hardly aids withunderstanding, for most people. Indeed, it seems to encourage misunderstanding, partly because the terms fixedand random mean different things to different statisticians. See pages 20–21 of Gelman’s paper. I fully realize,however, that by trying to spread Gelman’s alternative jargon, I am essentially spitting into a very strong wind.[339]115. is fact has been understood much longer than multilevel models have been practical to use. See Stein(1955) for an influential paper. [343]


BibliographyAkaike, H. (1973). Information theory and an extension of the maximum likelihood principle. InPetrov, B. N. and Csaki, F., editors, Second International Symposium on Information eory, pages267–281.Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on AutomaticControl, 19(6):716–723.Akaike, H. (1978). A bayesian analysis of the minimum aic procedure. Ann. Inst. Statist. Math.,30:9–14.Akaike, H. (1981). Likelihood of a model and information criteria. Journal of Econometrics, 16:3–14.Berger, J. O. (1985). Statistical decision theory and Bayesian Analysis. Springer-Verlag, New York, 2ndedition.Berger, J. O. and Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. AmericanScientist, pages 159–165.Box, G. E. P. (1979). Robustness in the strategy of scientific model building. In Launer, R. andWilkinson, G., editors, Robustness in Statistics. Academic Press, New York.Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley Pub.Co., Reading, Mass.Breiman, L. (1968). Probability. Addison-Wesley Pub. Co.Burnham, K. and Anderson, D. (2002). Model Selection and Multimodel Inference: A PracticalInformation-eoretic Approach. Springer-Verlag, 2nd edition.Casella, G. and George, E. I. (1992). Explaining the gibbs sampler. e American Statistician,46(3):167–174.Cho, A. (2011). Superluminal neutrinos: Where does the time go? Science, 334(6060):1200–1201.Claeskens, G. and Hjort, N. (2008). Model Selection and Model Averaging. Cambridge UniversityPress.Clark, J. S. (2012). e coherence problem with the unified neutral theory of biodiversity. Trends inEcology and Evolution, 27:198–2002.Collins, H. M. and Pinch, T. (1998). e Golem: What You Should Know about Science. CambridgeUniversity Press, 2nd edition.Cushman, F., Young, L., and Hauser, M. (2006). e role of conscious reasoning and intuition inmoral judgment: Testing three principles of harm. Psychological Science, 17(12):1082–1089.Daston, L. J. and Galison, P. (2007). Objectivity. MIT Press, Cambridge MA.Elias, P. (1958). Two famous papers. IRE Transactions: on Information eory, 4:99.Feynman, R. (1967). e character of physical law. MIT Press.Fienberg, S. E. (2006). When did Bayesian inference become “Bayesian”? Bayesian Analysis, 1(1):1–40.Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical SocietyB, 17(1):69–78.Fisher, R. A. (1956). Statistical methods and scientific inference. Hafner, New York, NY.Frank, S. A. (2009). e common patterns of nature. Journal of Evolutionary Biology, 22:1563–1585.389


390 BibliographyFrank, S. A. (2011). Measurement scale in maximum entropy models of species abundance. Journalof Evolutionary Biology, 24:485–496.Galton, F. (1989). Kinship and correlation. Statistical Science, 4(2):81–86.Gelman, A. (2005). Analysis of variance: Why it is more important than ever. e Annals of Statistics,33(1):1–53.Gelman, A., Carlin, J. C., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013a). BayesianData Analysis. Chapman & Hall/CRC, 3rd edition.Gelman, A., Hwang, J., and Vehtari, A. (2013b). Understanding predictive information criteria forbayesian models.Gelman, A. and Nolan, D. (2002). Teaching Statistics: A Bag of Tricks. Oxford University Press.Gelman, A. and Robert, C. P. (2013). “not only defended but also applied”: e perceived absurdityof bayesian inference. e American Statistician, 67(1):1–5.Gelman, A. and Rubin, D. B. (1995). Avoiding model selection in bayesian social research. SociologicalMethodology, 25:165–173.Gelman, A. and Stern, H. (2006). e difference between “significant” and “not significant” is notitself statistically significant. e American Statistician, 60(4):328–331.Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restorationof images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741.Gigerenzer, G. and Hoffrage, U. (1995). How to improve bayesian reasoning without instruction:Frequency formats. Psychological Review, 102:684–704.Gigerenzer, G., Krauss, S., and Vitouch, O. (2004). e null ritual: What you always wanted to knowabout significance testing but were afraid to ask. In Kaplan, D., editor, e Sage handbook of quantitativemethodology for the social sciences, pages 391–408. Sage Publications, Inc., ousand Oaks.Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., and Kruger, L. (1990). e Empire ofChance: How Probability Changed Science and Everyday Life. Cambridge University Press.Gillespie, J. H. (1977). Sampling theory for alleles in a random environment. Nature, 266:443–445.Grafen, A. and Hails, R. (2002). Modern Statistics for the Life Sciences. Oxford University Press,Oxford.Grünwald, P. D. (2007). e Minimum Description Length Principle. MIT Press, Cambridge MA.Hacking, I. (1983). Representing and Intervening: Introductory Topics in the Philosophy of NaturalScience. Cambridge University Press, Cambridge.Hahn, M. W. and Bentley, R. A. (2003). Dri as a mechanism for cultural change: an example frombaby names. Proceedings of the Royal Society B, 270:S120–S123.Harte, J. (2011). Maximum Entropy and Ecology: A eory of Abundance, Distribution, and Energetics.Oxford Series in Ecology and Evolution. Oxford University Press, Oxford.Hastie, T., Tibshirani, R., and Friedman, J. (2009). e Elements of Statistical Learning: Data Mining,Inference, and Prediction. Springer, 2nd edition.Hastings, W. (1970). Monte carlo sampling methods using markov chains and their applications.Biometrika, 57(1):97–109.Hauer, E. (2004). e harm done by tests of significance. Accident Analysis & Prevention, 36:495–500.Henrion, M. and Fischoff, B. (1986). Assessing uncertainty in physcial constants. American Journalof Physics, 54:791–798.Hinde, K. and Milligan, L. M. (2011). Primate milk synthesis: Proximate mechanisms and ultimateperspectives. Evolutionary Anthropology, 20:9–23.Howell, N. (2000). Demography of the Dobe !Kung. Aldine de Gruyter, New York.Howell, N. (2010). Life Histories of the Dobe !Kung: Food, Fatness, and Well-being over the Life-span.Origins of Human Behavior and Culture. University of California Press.Hubbell, S. P. (2001). e Unified Neutral eory of Biodiversity and Biogeography. Princeton UniversityPress, Princeton.Hurlbert, S. H. (1984). Pseudoreplication and the design of ecological field experiments. EcologicalMonographs, 54:187–211.


Bibliography 391Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8):0696–0701.Jaynes, E. T. (1976). Confidence intervals vs bayesian intervals. In Harper, W. L. and Hooker, C. A.,editors, Foundations of Probability eory, Statistical Inference, and Statistical eories of Science,page 175.Jaynes, E. T. (1984). e intutive inadequancy of classical statistics. Epistemologia, 7:43–74.Jaynes, E. T. (1985). Highly informative priors. Bayesian Statistics, 2:329–360.Jaynes, E. T. (2003). Probability eory: e Logic of Science. Cambridge University Press.Kadane, J. B. (2011). Principles of Uncertainty. Chapman & Hall/CRC.Kitcher, P. (2011). Science in a Democratic Society. Prometheus Books, Amherst, New York.Kullback, S. (1959). Information theory and statistics. John Wiley and Sons, NY.Kullback, S. (1987). e Kullback-Leibler distance. e American Statistician, 41(4):340.Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of MathematicalStatistics, 22(1):79–86.Lansing, J. S. and Cox, M. P. (2011). e domain of the replicators: Selection, neutrality, and culturalevolution (with commentary). Current Anthropology, 52:105–125.Lee, R. B. and DeVore, I., editors (1976). Kalahari Hunter-Gatherers: Studies of the !Kung San andeir Neighbors. Harvard University Press, Cambridge.Levins, R. (1966). e strategy of model building in population biology. American Scientist, 54.Lightsey, J. D., Rommel, S. A., Costidis, A. M., and Pitchford, T. D. (2006). Methods used duringgross necropsy to determine watercra-related mortality in the Florida manatee (TRICHECHUSMANATUS LATIROSTRIS). Journal of Zoo and Wildlife Medicine, 37(3):262–275.Lindley, D. V. (1971). Estimation of many parameters. In Godambe, V. P. and Sprott, D. A., editors,Foundations of Statistical Inference. Holt, Rinehart and Winston, Toronto.Lunn, D., Jackson, C., Best, N., omas, A., and Spiegelhalter, D. (2013). e BUGS Book. CRC Press.Mangel, M. and Samaniego, F. (1984). Abraham wald’s work on aircra survivability. Journal of theAmerican Statistical Association, 79:259–267.Marin, J.-M. and Robert, C. (2007). Bayesian Core: A Practical Approach to Computational BayesianStatistics. Springer.McGrayne, S. B. (2011). e eory at Would Not Die: How Bayes’ Rule Cracked the Enigma Code,Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy.Yale University Press.McHenry, H. M. and Coffing, K. (2000). Australopithecus to Homo: Transformations in body andmind. Annual Review of Anthropology, 29:125–146.Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equations of statecalculations by fast computing machines. Journal of Chemical Physics, 21(6):1087–1092.Metropolis, N. and Ulam, S. (1949). e monte carlo method. Journal of the American StatisticalAssociation, 44(247):335–341.Morison, S. E. (1942). Admiral of the Ocean Sea: A Life of Christopher Columbus. Little, Brown andCompany, Boston.Mulkay, M. and Gilbert, G. N. (1981). Putting philosophy to work: Karl popper’s influence on scientificpractice. Philosophy of the Social Sciences, 11:389–407.Nieuwenhuis, S., Forstmann, B. U., and Wagenmakers, E.-J. (2011). Erroneous analyses of interactionsin neuroscience: a problem of significance. Nature Neuroscience, 14(9):1105–1107.Nunn, N. and Puga, D. (2011). Ruggedness: e blessing of bad geography in Africa. Review ofEconomics and Statistics.Ohta, T. and Gillespie, J. H. (1996). Development of neutral and nearly neutral theories. eoreticalPopulation Biology, 49:128–142.Pearl, J. (2014). Understanding simpson’s paradox. e American Statistician, 68:8–13.Popper, K. (1996). e Myth of the Framework: In Defence of Science and Rationality. Routledge.


392 BibliographyProulx, S. R. and Adler, F. R. (2010). e standard of neutrality: still flapping in the breeze? Journalof Evolutionary Biology, 23:1339–1350.Rao, C. R. (1997). Statistics and Truth: Putting Chance To Work. World Scientific Publishing.Rice, K. (2010). A decision-theoretic formulation of Fisher’s approach to testing. e AmericanStatistician, 64(4):345–349.Riley, S. J., DeGloria, S. D., and Elliot, R. (1999). A terrain ruggedness index that quantifies topographicheterogeneity. Intermountain Journal of Sciences, 5:23–27.Robert, C. P. (2007). e Bayesian Choice: from decision-theoretic foundations to computational implementation.Springer Texts in Statistics. Springer, 2nd edition.Rommel, S. A., Costidis, A. M., Pitchford, T. D., Lightsey, J. D., Snyder, R. H., and Haubold,E. M. (2007). Forensic methods for characterizing watercra from watercra-induced woundson the Florida manatee (TRICHECHUS MANATUS LATIROSTRIS). Marine Mammal Science,23(1):110–132.Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journalof the American Statistical Association, 100(469):322–331.Savage, L. J. (1962). e Foundations of Statistical Inference. Methuen.Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464.Shannon, C. E. (1948). A mathematical theory of communication. e Bell System Technical Journal,27:379–423.Shannon, C. E. (1956). e bandwagon. IRE Transactions: on Information eory, 2:3.Silk, J. B., Brosnan, S. F., Vonk, J., Henrich, J., Povinelli, D. J., Richardson, A. S., Lambeth, S. P., Mascaro,J., and Schapiro, S. J. (2005). Chimpanzees are indifferent to the welfare of unrelated groupmembers. Nature, 437:1357–1359.Silver, N. (2012). e Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. PenguinPress, New York.Simpson, E. H. (1951). e interpretation of interaction in contingency tables. Journal of the RoyalStatistical Society, Series B, 13:238–241.Sober, E. (2008). Evidence and Evolution: e logic behind the science. Cambridge University Press,Cambridge.Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and va der Linde, A. (2002). Bayesian measures of modelcomplexity and fit. Journal of the Royal Statistical Society B, 64:583–639.Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivatiate normal distribution.In Proceedings of the ird Berkeley Symposium of Mathematical Statistics and Probability,volume 1, pages 197–206, Berkeley. University of California Press.Stigler, S. M. (1981). Gauss and the invention of least squares. e Annals of Statistics, 9(3):465–474.Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’scriterion. Journal of the Royal Statistical Society B, 39(1):44–47.eobald, D. L. (2010). A formal test of the theory of universal common ancestry. Nature, 465:219–222.Van Horn, K. S. (2003). Constructing a logic of plausible inference: a guide to Cox’s theorem. InternationalJournal of Approximate Reasoning, 34:3–24.Venn, J. (1876). e Logic of Chance. Macmillan and co, New York, 2nd edition.Wald, A. (1939). Contributions to the theory of statistical estimation and testing hypotheses. Annalsof Mathematical Statistics, 10(4):299–326.Wald, A. (1943). A method of estimating plane vulnerability based on damage of survivors. Technicalreport, Statistical Research Group, Columbia University.Wald, A. (1950). Statistical Decision Functions. J. Wiley, New York.Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely applicable informationcriterion in singular learning theory. Journal of Machine Learning Research, 11:3571–3594.Wittgenstein, L. (1953). Philosophical Investigations.


Bibliography 393Wolpert, D. and Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactionson Evolutionary Computation, page 67.


IndexAkaike (1973), 380, 383Akaike (1974), 380, 383Akaike (1978), 379, 380, 383Akaike (1981), 380, 383Berger and Berry (1988), 377, 383Berger (1985), 380, 383Box and Tiao (1973), 377, 383Box (1979), 375, 383Breiman (1968), 378, 383Burnham and Anderson (2002), 379, 380, 383Casella and George (1992), 381, 383Cho (2011), 375, 383Claeskens and Hjort (2008), 380, 383Clark (2012), 375, 383Collins and Pinch (1998), 376, 383Cushman et al. (2006), 381, 383Daston and Galison (2007), 376, 383Elias (1958), 379, 383Feynman (1967), 377, 383Fienberg (2006), 377, 380, 383Fisher (1925), 377, 383Fisher (1955), 376, 383Fisher (1956), 376, 378, 383Frank (2009), 378, 380, 383Frank (2011), 380, 383Galton (1989), 379, 384Gelman and Nolan (2002), 377, 384Gelman and Robert (2013), 377, 384Gelman and Rubin (1995), 380, 384Gelman and Stern (2006), 379, 384Gelman et al. (2013a), 377, 380, 384Gelman et al. (2013b), 380, 384Gelman (2005), 381, 384Geman and Geman (1984), 381, 384Gigerenzer and Hoffrage (1995), 377, 384Gigerenzer et al. (1990), 376, 384Gigerenzer et al. (2004), 376, 384Gillespie (1977), 375, 384Grafen and Hails (2002), 379, 384Grünwald (2007), 379, 380, 384Hacking (1983), 375, 384Hahn and Bentley (2003), 375, 384Harte (2011), 380, 384Hastie et al. (2009), 379, 384Hastings (1970), 381, 384Hauer (2004), 378, 384Henrion and Fischoff (1986), 378, 384Hinde and Milligan (2011), 379, 384Howell (2000), 378, 384Howell (2010), 378, 384Hubbell (2001), 375, 384Hurlbert (1984), 381, 384Ioannidis (2005), 377, 384Jaynes (1976), 379, 385Jaynes (1984), 376, 385Jaynes (1985), 377, 378, 385Jaynes (2003), 375, 378, 380, 381, 385Kadane (2011), 377, 385Kitcher (2011), 376, 385Kullback and Leibler (1951), 380, 385Kullback (1959), 379, 385Kullback (1987), 380, 385Lansing and Cox (2011), 375, 385Lee and DeVore (1976), 378, 385Levins (1966), 375, 385Lightsey et al. (2006), 380, 385Lindley (1971), 376, 385Lunn et al. (2013), 380, 385Mangel and Samaniego (1984), 380, 385Marin and Robert (2007), 376, 385McGrayne (2011), 377, 385McHenry and Coffing (2000), 379, 385Metropolis and Ulam (1949), 381, 385Metropolis et al. (1953), 381, 385Morison (1942), 377, 385Mulkay and Gilbert (1981), 375, 376, 385Nieuwenhuis et al. (2011), 379, 385Nunn and Puga (2011), 380, 385Ohta and Gillespie (1996), 375, 385Pearl (2014), 379, 385Popper (1996), 375, 385Proulx and Adler (2010), 375, 385Rao (1997), 376, 386Rice (2010), 378, 386395


396 INDEXRiley et al. (1999), 381, 386Robert (2007), 378, 380, 386Rommel et al. (2007), 380, 386Rubin (2005), 379, 386Savage (1962), 377, 386Schwarz (1978), 380, 386Shannon (1948), 379, 386Shannon (1956), 379, 386Silk et al. (2005), 381, 386Silver (2012), 377, 386Simpson (1951), 379, 386Sober (2008), 376, 386Spiegelhalter et al. (2002), 380, 386Stein (1955), 382, 386Stigler (1981), 379, 386Stone (1977), 380, 386eobald (2010), 377, 386Van Horn (2003), 377, 386Venn (1876), 376, 386Wald (1939), 380, 386Wald (1943), 380, 386Wald (1950), 380, 386Watanabe (2010), 380, 386Wittgenstein (1953), 375, 386Wolpert and Macready (1997), 377, 386AIC, 184Akaike weight, 199Akaike information criterion, 184barplot, 202Bayes factor, 190Bayes’ theorem, 38Bayesian information criterion, 169, 190Bayesian updating, 32beta distribution, 312bias-variance tradeoff, see also overfitting, 176burn-in, 252categorical variable, 154categorical variables, 124centering, 103, 105, 224, 225complete pooling, 337complete-case, 359Conditioning, 209confidence interval, 58consistent, 189credible interval, 58cross-validation, 184cumulative log-odds, 297cumulative logistic, 298data compression, 172data dredging, 205design formula, 161deviance, 168, 182deviance information criterion, 186, 190DIC, 186divergence, see also Kullback-Leibler divergence,180dotchart, 202dummy data, 66dummy variable, 155effective number of parameters, 191exchangeable, 88exponential family, 19, 81factor analysis, 153frequentist, 23gamma-Poisson, 320Gibbs sampling, 241Hamiltonian Monte Carlo, 244highest posterior density interval, 61imputation, 359information criteria, 27, 168, 184information entropy, 179information theory, 27, 82, 178Kullback-Leibler divergence, 180interaction, 210interactions, 124large world, 29likelihood, 35linear model, 98Linear regression, 77log-odds, 297logistic, 273logit, 273logit link, 273loss function, 63Markov chain Monte Carlo, 47maxent, 180maximum a posteriori, 44maximum entropy, 19, 82, 180maximum likelihood estimate, 46MCAR, 359Metropolis algorithm, 241Minimum Description Length, 174Model averaging, 196Model checking, 68Model comparison, 196model selection, 196modus tollens, 19multicollinearity, 145multilevel model, 25multiple comparison, 332multiple membership, 343multivariate regression, 123no pooling, 338


INDEX 397non-identifiability, 153Ockham’s razor, 167OLS, 160ordinary least squares, 160Overfitting, 170overfitting, 15, 27, 145, 168, 169pairs, 151parameters, 36partial pooling, 338penalized likelihood, 37percentile intervals, 59point estimate, 62polynomial regression, 115pooling, 336posterior distribution, 38posterior predictive distribution, 69prequential, 186principle components, 153prior, 36pseudoreplication, 332quadratic approximation, 43random slopes, 349regularizing, 37, 194regularizing prior, 168, 335ridge regression, 195sampling distribution, 23shrinkage, 335, 356Simpson’s paradox, 123small world, 29Stan, 237standard error, 46standardize, 116stargazing, 169stochastic, 84subjective Bayesian, 37trace plot, 250underfitting, 168, see also overfitting, 174variance-covariance, 96varying effects, 333varying slopes, 349WAIC, 186Watanabe-Akaike information criterion, 186weakly informative, 37widely applicable information criterion, 186

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!