10.07.2015 Views

Regression Using Bayesian Statistics in R - LISA

Regression Using Bayesian Statistics in R - LISA

Regression Using Bayesian Statistics in R - LISA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Regression</strong> <strong>Us<strong>in</strong>g</strong> <strong>Bayesian</strong> <strong>Statistics</strong> <strong>in</strong> RNels JohnsonLead Collaborator, Laboratory for Interdiscipl<strong>in</strong>ary Statistical Analysis07/21/2010Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 1 / 32


Outl<strong>in</strong>eWhat is L<strong>in</strong>ear <strong>Regression</strong>?Intro to <strong>Bayesian</strong> <strong>Statistics</strong>More on Priors<strong>Bayesian</strong> Estimation and InferenceComput<strong>in</strong>g IssuesExamplesNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 2 / 32


<strong>Regression</strong>Suppose you have a variable Y whose outcome is considered random.Examples: Amount of soil erosion at a particular site; How much time ittakes for a fungus to develop a resistance to a fungicide; How many feet amov<strong>in</strong>g car will need to come to a complete stop.Suppose you have some other other variables X which are treated as fixed.Examples: K<strong>in</strong>d of plants are grow<strong>in</strong>g <strong>in</strong> the soil; The composition of thefungicide; How fast the car is travel<strong>in</strong>g.If you th<strong>in</strong>k the observed value of Y is dependent on X then regression is atool for model<strong>in</strong>g that dependence.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 3 / 32


L<strong>in</strong>ear <strong>Regression</strong>L<strong>in</strong>ear regression assumes that the underly<strong>in</strong>g structure of Y is a l<strong>in</strong>earcomb<strong>in</strong>ation of the variables X .Example:Y dist = β 0 + X speed β speed + ɛOften the formula for l<strong>in</strong>ear regression will be condensed us<strong>in</strong>g matrix algebra:Y = X β + ɛS<strong>in</strong>ce Y is considered a random variable, we will naturally expect it not tofollow the underly<strong>in</strong>g structure exactly.ɛ is the term of the model that describes Y ’s random variation around it’sunderly<strong>in</strong>g structure X β.The most common way to describe the random variation is with ɛ ∼ N(0, σ 2 ).Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 4 / 32


L<strong>in</strong>ear <strong>Regression</strong> Scatter plot●●Y−2 −1 0 1 2 3●●●●●●●● ●● ● ●●●●●●● ●●●●●● ●●● ●● ●● ●●●●● ●●●●●●●●−3 −2 −1 0 1 2XNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 5 / 32


Estimat<strong>in</strong>g ParametersS<strong>in</strong>ce we don’t know the values for β and σ 2 we’ll need to estimate thembased on our data (Y , X ).Traditionally we would do so by f<strong>in</strong>d<strong>in</strong>g the β and σ 2 that maximize thelikelihood:L(Y i |X i , β, σ 2 ) =n∏N(Y i |X i β, σ 2 )Note: the likelihood is also the jo<strong>in</strong>t distribution of the data.The estimates ˆβ = (X T X ) −1 X T Y and s 2 = (Y −X ˆβ) T (Y −X ˆβ)n−pwould beconsidered random variables because they are functions of the randomvariable Y i .i=1Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 6 / 32


Interpretation of ParametersWe might compute a confidence <strong>in</strong>terval for β or perform a test to see if it issignificantly different from zero.Confidence <strong>in</strong>tervals are <strong>in</strong>terpreted <strong>in</strong> terms of the proportion (i.e. relativefrequency) of times they should capture the true parameter <strong>in</strong> the long run.This is because it is the endpo<strong>in</strong>ts of the <strong>in</strong>terval that are random variablesand the parameter is fixed. The <strong>in</strong>terval either captures the fixed parameteror it doesn’t.Because of this <strong>in</strong>terpretation, this paradigm of statistics is called thefrequentist paradigm (or classical paradigm).Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 7 / 32


Illustration of Frequency InterpretationNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 8 / 32


Bayes’ TheoremThe <strong>Bayesian</strong> paradigm is named after Rev Thomas Bayes (by probabilistBruno de F<strong>in</strong>neti) for its use of his theorem.Take the rule for conditional probability for two events A and B:P(A|B) =Bayes discovered that this is equivalent to:P(A|B) = P(B|A)P(A)P(B)P(A ∩ B)P(B)This is known as Bayes’ Theorem or Bayes’ Rule.= ∫ P(B|A)P(A)P(B|A)P(A)dANels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 9 / 32


<strong>Bayesian</strong> ParadigmThe mathematician Pierre-Simon Laplace got the idea that <strong>in</strong>stead of justdef<strong>in</strong><strong>in</strong>g probability on variables, we could also def<strong>in</strong>e probability onparameters too. And by us<strong>in</strong>g Bayes’ Rule we can make <strong>in</strong>ference onparameters. Effectively treat<strong>in</strong>g parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. <strong>Us<strong>in</strong>g</strong> Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(θ|D) is called the posterior distribution. It is what we will use to make<strong>in</strong>ference about the parameters θ = {β, σ 2 }.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 10 / 32


<strong>Bayesian</strong> ParadigmThe mathematician Pierre-Simon Laplace got the idea that <strong>in</strong>stead of justdef<strong>in</strong><strong>in</strong>g probability on variables, we could also def<strong>in</strong>e probability onparameters too. And by us<strong>in</strong>g Bayes’ Rule we can make <strong>in</strong>ference onparameters. Effectively treat<strong>in</strong>g parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. <strong>Us<strong>in</strong>g</strong> Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(D|θ) is the likelihood we discussed previously. It conta<strong>in</strong>s all the<strong>in</strong>formation about θ we can learn from the data.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 10 / 32


<strong>Bayesian</strong> ParadigmThe mathematician Pierre-Simon Laplace got the idea that <strong>in</strong>stead of justdef<strong>in</strong><strong>in</strong>g probability on variables, we could also def<strong>in</strong>e probability onparameters too. And by us<strong>in</strong>g Bayes’ Rule we can make <strong>in</strong>ference onparameters. Effectively treat<strong>in</strong>g parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. <strong>Us<strong>in</strong>g</strong> Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(θ) is called the prior distribution for θ. It conta<strong>in</strong>s the <strong>in</strong>formation webelieve about θ before we observe the data.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 10 / 32


<strong>Bayesian</strong> ParadigmThe mathematician Pierre-Simon Laplace got the idea that <strong>in</strong>stead of justdef<strong>in</strong><strong>in</strong>g probability on variables, we could also def<strong>in</strong>e probability onparameters too. And by us<strong>in</strong>g Bayes’ Rule we can make <strong>in</strong>ference onparameters. Effectively treat<strong>in</strong>g parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. <strong>Us<strong>in</strong>g</strong> Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(D) is the normaliz<strong>in</strong>g constant of the function P(D|θ)P(θ) such thatP(θ|D) is a proper probability distribution.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 10 / 32


ProportionalityOften know<strong>in</strong>g the normaliz<strong>in</strong>g constant P(D) is not necessary for <strong>Bayesian</strong><strong>in</strong>ference.When it’s removed we get:P(θ|D) ∝ P(D|θ)P(θ) ⇔ posterior ∝ likelihood × priorThe ∝ symbol means that P(θ|D) is proportional to (i.e. a scalar multipleof) P(D|θ)P(θ).In this case that scalar is P(D).Usually P(D|θ) and P(θ) only need to be known up to proportionality as wellbecause their normaliz<strong>in</strong>g constants get lumped <strong>in</strong> with P(D).Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 11 / 32


How It Works:One advantage of <strong>Bayesian</strong> analysis is sequentially updat<strong>in</strong>g beliefs about θ.Process: Prior belief about θ → Observe data → update belief about θ.That is P(θ) → P(D|θ)P(θ) → P(θ|D).Now that we’ve done one experiment and have P(θ|D 1 ) we can conductanother experiment us<strong>in</strong>g P(θ|D 1 ) as the prior for θ.Leads us to:P(θ|D 2 , D 1 ) ∝ P(D 2 |θ)P(θ|D 1 )This process can be cont<strong>in</strong>ually repeated.Example: Last year you did a study relat<strong>in</strong>g the levels of heavy metals <strong>in</strong>streams to the size of mussels <strong>in</strong> the streams. Use the posterior from lastyear’s study as your prior for this year’s study.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 12 / 32


Prior DistributionsJust like how you need to specify the likelihood for the problem, you need tospecify the prior distribution.This is not an <strong>in</strong>tuitive process for many people at first ma<strong>in</strong>ly because of theprior.What helped me the most: th<strong>in</strong>k of the prior as a measure of our uncerta<strong>in</strong>ty<strong>in</strong> the θ’s true value.Very often it is easiest to write the jo<strong>in</strong>t prior as <strong>in</strong>dependent univariatepriors, e.g. P(θ) = P(β)P(σ 2 ).Some more important term<strong>in</strong>ology that comes up when talk<strong>in</strong>g about priors:Informative and non<strong>in</strong>formative priorsProper and improper priorsConjugate priorsReference priorsNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 13 / 32


Heavy Prior WeightInformative priors put a lot of weight on a certa<strong>in</strong> region of the parameterspace. This corresponds to a strong prior belief <strong>in</strong> what θ should be and thusprovides a lot of <strong>in</strong>formation about θ.Careful selection of an <strong>in</strong>formative prior can make <strong>Bayesian</strong> estimationsuperior to their frequentist counterparts.However poor selection of an <strong>in</strong>formative prior can really make a mess ofth<strong>in</strong>gs unless your sample size is “large enough”.Un<strong>in</strong>formative priors (also called diffuse or flat priors) spread weight over awide range of the parameter space. This results <strong>in</strong> less <strong>in</strong>formation about theparameter. It is a common choice for practitioners without much prior<strong>in</strong>formation.Maximum likelihood estimates for θ <strong>in</strong> a frequentist paradigm often haveanalogous <strong>Bayesian</strong> estimates where an un<strong>in</strong>formative prior is placed on θ.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 14 / 32


Informative PriorDensity0.0 0.1 0.2 0.3 0.4 0.5LikelihoodPriorPosterior−4 −2 0 2 4θNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 16 / 32


Improper PriorsImproper priors are a special k<strong>in</strong>d of un<strong>in</strong>formative prior where π(θ) is notactually a probability distribution.In regression, some examples are π(β) ∝ 1 and π(σ 2 ) ∝ 1 σ. These areexamples of Jeffreys prior (see next slide).It’s OK to use an improper prior only when the result<strong>in</strong>g posterior is a validprobability distribution.This can be difficult to verify sometimes. So if you aren’t sure, don’t use one.<strong>Us<strong>in</strong>g</strong> a proper prior (π(θ) is a valid probability distribution) will always result<strong>in</strong> a valid posterior distribution.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 17 / 32


Reference PriorsThe idea beh<strong>in</strong>d reference priors is that they are supposed to be a “default”non<strong>in</strong>formative prior.Reference priors result <strong>in</strong> what is called “optimal frequentist” properties forthe posterior.For our purposes that means the credible <strong>in</strong>tervals give the same result asconfidence <strong>in</strong>tervals.Reference priors are specific to each model<strong>in</strong>g problem and may notnecessarily exist for every problem.If θ is one dimensional (i.e. only one parameter) then the Jeffreys prior is thereference prior.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 18 / 32


ConjugacyIf P(θ) is a conjugate prior for θ with likelihood P(D|θ), then P(θ|D) has thesame family of distributions as the prior.Examples:If P(D|β, σ 2 ) ∼ N(X β, σ 2 ) and P(β) is some normal distribution, thenP(β|D, σ 2 ) also follows a normal distribution.If P(D|β, σ 2 ) ∼ N(X β, σ 2 ) and P(σ −2 ) is some gamma distribution, thenP(σ −2 |D, β) also follows a gamma distribution.This is nice because we know a lot about the properties of the normal andgamma distributions. We can also generate values from these distributionswith computer programs quite readily.For a list of some: http://en.wikipedia.org/wiki/Conjugate_priorNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 19 / 32


Without Conjugacy?Without conjugacy we might end up with a posterior we know noth<strong>in</strong>g about.This can make it very hard to compute summaries of the posterior and formany years it crippled our ability to do <strong>Bayesian</strong> <strong>in</strong>ference.And even with conjugacy, we can often only get conjugate posteriordistributions that are conditional on other parameters <strong>in</strong> the model (like theexamples).Luckily <strong>in</strong> the last 20 years with the power of computers this has changed.And it’s why you are see<strong>in</strong>g so many more <strong>Bayesian</strong> papers.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 20 / 32


Posterior Distributions<strong>Bayesian</strong> <strong>in</strong>ference is usually done on the jo<strong>in</strong>t posterior distribution of all theparameters P(θ|D).Sometimes it is done on the marg<strong>in</strong>al distribution of a s<strong>in</strong>gle parameter. Suchas:∫P(β|D) = P(θ|D)dσ 2Because the posterior is a probability distribution for θ, preform<strong>in</strong>g <strong>in</strong>ferenceon θ is as simple as f<strong>in</strong>d<strong>in</strong>g relevant summaries based on its posterior.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 21 / 32


Po<strong>in</strong>t EstimationTwo very popular po<strong>in</strong>t estimators for θ:The mean of the posterior distribution (posterior mean).The maximum a posterior estimator also known as the MAP estimator.The MAP estimator is arg max θ P(θ|D).You could use any measure of center that makes sense for P(θ|D).Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 22 / 32


Illustration of MAP Estimatorθ^ = 3Posterior0.00 0.05 0.10 0.15 0.200 3 5 10 15θNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 23 / 32


Interval Estimation<strong>Bayesian</strong> confidence <strong>in</strong>tervals are called credible <strong>in</strong>tervals. So a 95% <strong>Bayesian</strong>confidence <strong>in</strong>terval is called a 95% credible <strong>in</strong>terval.There are two popular ways to f<strong>in</strong>d them:Highest posterior density, often shortened to HPD.Equal tail probability.A 95% HPD credible <strong>in</strong>terval is the smallest <strong>in</strong>terval with probability 0.95.A 95% Equal tail probability <strong>in</strong>terval use the values that have cumulativeprobability of 0.025 and 0.975 as the endpo<strong>in</strong>ts. This is not preferred whenthe posterior is highly skewed or multimodal.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 24 / 32


HPD and Equal Tail Credible IntervalsPosterior0.00 0.05 0.10 0.15 0.201.089865 8.767273θNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 25 / 32


Hypothesis Test<strong>in</strong>gS<strong>in</strong>ce we can talk about the probability of θ be<strong>in</strong>g <strong>in</strong> some <strong>in</strong>terval, thismakes <strong>in</strong>terpretation of certa<strong>in</strong> k<strong>in</strong>ds of hypothesis tests much easier.For <strong>in</strong>stance H 0 : β ≤ 0 vs. H a : β > 0. You can f<strong>in</strong>d P β|D (β ≤ 0) and if it issufficiently small then reject the null hypothesis <strong>in</strong> favor of the alternative.We could also test H 0 : a ≤ β ≤ b vs H 0 : β ≤ a or b ≤ β where a and b arechosen such that if H 0 were true then β would have no practical effect.Aga<strong>in</strong>, compute P β|D (H 0 ) and if it is sufficiently small, reject H 0 .Some special steps need to be taken when try<strong>in</strong>g to test H 0 : β = 0 which arebeyond the scope of this course.Also <strong>in</strong>stead of reject<strong>in</strong>g H 0 or accept<strong>in</strong>g H a , you could just report theevidence (i.e. the probability) for the reader to decide.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 26 / 32


Sampl<strong>in</strong>g From the Jo<strong>in</strong>t PosteriorAs stated previously we want to perform <strong>in</strong>ference on θ us<strong>in</strong>g the jo<strong>in</strong>tposterior P(θ|D).Summariz<strong>in</strong>g the jo<strong>in</strong>t posterior may be difficult (or impossible).A solution to this problem is to use a computer algorithm to sampleobservations from the jo<strong>in</strong>t posterior.If we have a large enough sample from P(θ|D) then the distribution of thesample should approximate P(θ|D).This means we can just summarize the sample from P(θ|D) to make<strong>in</strong>ference on θ <strong>in</strong>stead of us<strong>in</strong>g the distribution P(θ|D) directly.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 27 / 32


<strong>Us<strong>in</strong>g</strong> Samples to Approximate the PosteriorN =2 0N = 100N = 500Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6−4 −2 0 2 4θ−4 −2 0 2 4θ−4 −2 0 2 4θN = 1000N = 5000N = 10000Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6−4 −2 0 2 4θ−4 −2 0 2 4θ−4 −2 0 2 4θNels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 28 / 32


One (Three) Important Algorithm(s)The big three algorithms are as follows:Gibbs sampl<strong>in</strong>g.Metropolis algorithm.Metropolis-Hast<strong>in</strong>gs algorithm.All three of these algorithms are Markov Cha<strong>in</strong> Monte Carlo algorithms.Commonly known as MCMC algorithms.What that means for us is that after you run the algorithm “long enough”you will eventually start sampl<strong>in</strong>g from P(θ|D).The Metropolis-Hast<strong>in</strong>gs algorithm is a generalization of the other two and itso similar to the Metropolis algorithm that the two titles are often<strong>in</strong>terchanged.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 29 / 32


More on GibbsThe Gibbs sampler is a much simplified version of the Metropolis-Hast<strong>in</strong>gsalgorithm. So simplified <strong>in</strong> fact it isn’t obvious at all that they are related.All three algorithms require you to know the conditional posterior distributionof each parameter <strong>in</strong> the model (e.g. P(β|D, σ 2 ) and P(σ 2 |D, β) ).The Gibbs sampler requires that the user can sample from each of theconditional posterior distributions us<strong>in</strong>g computer software.This is where conjugate priors become very popular. If the conditionalposterior distributions are all normal, gamma, or some other distribution wecan readily sample from, then the Gibbs algorithm makes sampl<strong>in</strong>g from thejo<strong>in</strong>t posterior very easy.Other details about this algorithm are beyond the scope of this course.The data analysis examples we will discuss both use a Gibbs sampler.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 30 / 32


More on Metropolis-Hast<strong>in</strong>gsIf we can’t f<strong>in</strong>d conjugate priors or some other prior to make it easy tosample from the conditonal posteriors (like <strong>in</strong> logistic regression), then wehave to use the Metropolis-Hast<strong>in</strong>g algorithm which allows us to drawsamples from almost any conditional posterior distribution.Some papers conta<strong>in</strong> tricks on how to change problems that needMetropolis-Hast<strong>in</strong>gs <strong>in</strong>to problems that only need Gibbs.The probit regression example we will do uses a trick by Albert and Chib(1993) allow us to use Gibbs <strong>in</strong>stead of M-H.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 31 / 32


ExamplesOpen up your R console!We’ll be discuss<strong>in</strong>g:How to estimate a simple l<strong>in</strong>ear regression and a probit regression us<strong>in</strong>g thebayesm package <strong>in</strong> R.How to assess if our Gibbs algorithm is work<strong>in</strong>g properly.How the <strong>Bayesian</strong> model fit compares to the classical model fit.Nels Johnson (<strong>LISA</strong>) <strong>Bayesian</strong> <strong>Statistics</strong> 07/21/2010 32 / 32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!