Regression Using Bayesian Statistics in R - LISA

Regression Using Bayesian Statistics in RNels JohnsonLead Collaborator, Laboratory for Interdisciplinary Statistical Analysis07/21/2010Nels Johnson (LISA) Bayesian Statistics 07/21/2010 1 / 32

OutlineWhat is Linear Regression?Intro to Bayesian StatisticsMore on PriorsBayesian Estimation and InferenceComputing IssuesExamplesNels Johnson (LISA) Bayesian Statistics 07/21/2010 2 / 32

RegressionSuppose you have a variable Y whose outcome is considered random.Examples: Amount of soil erosion at a particular site; How much time ittakes for a fungus to develop a resistance to a fungicide; How many feet amoving car will need to come to a complete stop.Suppose you have some other other variables X which are treated as fixed.Examples: Kind of plants are growing in the soil; The composition of thefungicide; How fast the car is traveling.If you think the observed value of Y is dependent on X then regression is atool for modeling that dependence.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 3 / 32

Linear RegressionLinear regression assumes that the underlying structure of Y is a linearcombination of the variables X .Example:Y dist = β 0 + X speed β speed + ɛOften the formula for linear regression will be condensed using matrix algebra:Y = X β + ɛSince Y is considered a random variable, we will naturally expect it not tofollow the underlying structure exactly.ɛ is the term of the model that describes Y ’s random variation around it’sunderlying structure X β.The most common way to describe the random variation is with ɛ ∼ N(0, σ 2 ).Nels Johnson (LISA) Bayesian Statistics 07/21/2010 4 / 32

Linear Regression Scatter plot●●Y−2 −1 0 1 2 3●●●●●●●● ●● ● ●●●●●●● ●●●●●● ●●● ●● ●● ●●●●● ●●●●●●●●−3 −2 −1 0 1 2XNels Johnson (LISA) Bayesian Statistics 07/21/2010 5 / 32

Estimating ParametersSince we don’t know the values for β and σ 2 we’ll need to estimate thembased on our data (Y , X ).Traditionally we would do so by finding the β and σ 2 that maximize thelikelihood:L(Y i |X i , β, σ 2 ) =n∏N(Y i |X i β, σ 2 )Note: the likelihood is also the joint distribution of the data.The estimates ˆβ = (X T X ) −1 X T Y and s 2 = (Y −X ˆβ) T (Y −X ˆβ)n−pwould beconsidered random variables because they are functions of the randomvariable Y i .i=1Nels Johnson (LISA) Bayesian Statistics 07/21/2010 6 / 32

Interpretation of ParametersWe might compute a confidence interval for β or perform a test to see if it issignificantly different from zero.Confidence intervals are interpreted in terms of the proportion (i.e. relativefrequency) of times they should capture the true parameter in the long run.This is because it is the endpoints of the interval that are random variablesand the parameter is fixed. The interval either captures the fixed parameteror it doesn’t.Because of this interpretation, this paradigm of statistics is called thefrequentist paradigm (or classical paradigm).Nels Johnson (LISA) Bayesian Statistics 07/21/2010 7 / 32

Illustration of Frequency InterpretationNels Johnson (LISA) Bayesian Statistics 07/21/2010 8 / 32

Bayes’ TheoremThe Bayesian paradigm is named after Rev Thomas Bayes (by probabilistBruno de Finneti) for its use of his theorem.Take the rule for conditional probability for two events A and B:P(A|B) =Bayes discovered that this is equivalent to:P(A|B) = P(B|A)P(A)P(B)P(A ∩ B)P(B)This is known as Bayes’ Theorem or Bayes’ Rule.= ∫ P(B|A)P(A)P(B|A)P(A)dANels Johnson (LISA) Bayesian Statistics 07/21/2010 9 / 32

Bayesian ParadigmThe mathematician Pierre-Simon Laplace got the idea that instead of justdefining probability on variables, we could also define probability onparameters too. And by using Bayes’ Rule we can make inference onparameters. Effectively treating parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(θ|D) is called the posterior distribution. It is what we will use to makeinference about the parameters θ = {β, σ 2 }.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 10 / 32

Bayesian ParadigmThe mathematician Pierre-Simon Laplace got the idea that instead of justdefining probability on variables, we could also define probability onparameters too. And by using Bayes’ Rule we can make inference onparameters. Effectively treating parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(D|θ) is the likelihood we discussed previously. It contains all theinformation about θ we can learn from the data.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 10 / 32

Bayesian ParadigmThe mathematician Pierre-Simon Laplace got the idea that instead of justdefining probability on variables, we could also define probability onparameters too. And by using Bayes’ Rule we can make inference onparameters. Effectively treating parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(θ) is called the prior distribution for θ. It contains the information webelieve about θ before we observe the data.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 10 / 32

Bayesian ParadigmThe mathematician Pierre-Simon Laplace got the idea that instead of justdefining probability on variables, we could also define probability onparameters too. And by using Bayes’ Rule we can make inference onparameters. Effectively treating parameters as random variables.In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes’Rule we get:P(θ|D) = P(D|θ)P(θ)P(D)P(D) is the normalizing constant of the function P(D|θ)P(θ) such thatP(θ|D) is a proper probability distribution.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 10 / 32

ProportionalityOften knowing the normalizing constant P(D) is not necessary for Bayesianinference.When it’s removed we get:P(θ|D) ∝ P(D|θ)P(θ) ⇔ posterior ∝ likelihood × priorThe ∝ symbol means that P(θ|D) is proportional to (i.e. a scalar multipleof) P(D|θ)P(θ).In this case that scalar is P(D).Usually P(D|θ) and P(θ) only need to be known up to proportionality as wellbecause their normalizing constants get lumped in with P(D).Nels Johnson (LISA) Bayesian Statistics 07/21/2010 11 / 32

How It Works:One advantage of Bayesian analysis is sequentially updating beliefs about θ.Process: Prior belief about θ → Observe data → update belief about θ.That is P(θ) → P(D|θ)P(θ) → P(θ|D).Now that we’ve done one experiment and have P(θ|D 1 ) we can conductanother experiment using P(θ|D 1 ) as the prior for θ.Leads us to:P(θ|D 2 , D 1 ) ∝ P(D 2 |θ)P(θ|D 1 )This process can be continually repeated.Example: Last year you did a study relating the levels of heavy metals instreams to the size of mussels in the streams. Use the posterior from lastyear’s study as your prior for this year’s study.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 12 / 32

Prior DistributionsJust like how you need to specify the likelihood for the problem, you need tospecify the prior distribution.This is not an intuitive process for many people at first mainly because of theprior.What helped me the most: think of the prior as a measure of our uncertaintyin the θ’s true value.Very often it is easiest to write the joint prior as independent univariatepriors, e.g. P(θ) = P(β)P(σ 2 ).Some more important terminology that comes up when talking about priors:Informative and noninformative priorsProper and improper priorsConjugate priorsReference priorsNels Johnson (LISA) Bayesian Statistics 07/21/2010 13 / 32

Heavy Prior WeightInformative priors put a lot of weight on a certain region of the parameterspace. This corresponds to a strong prior belief in what θ should be and thusprovides a lot of information about θ.Careful selection of an informative prior can make Bayesian estimationsuperior to their frequentist counterparts.However poor selection of an informative prior can really make a mess ofthings unless your sample size is “large enough”.Uninformative priors (also called diffuse or flat priors) spread weight over awide range of the parameter space. This results in less information about theparameter. It is a common choice for practitioners without much priorinformation.Maximum likelihood estimates for θ in a frequentist paradigm often haveanalogous Bayesian estimates where an uninformative prior is placed on θ.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 14 / 32

Informative PriorDensity0.0 0.1 0.2 0.3 0.4 0.5LikelihoodPriorPosterior−4 −2 0 2 4θNels Johnson (LISA) Bayesian Statistics 07/21/2010 16 / 32

Improper PriorsImproper priors are a special kind of uninformative prior where π(θ) is notactually a probability distribution.In regression, some examples are π(β) ∝ 1 and π(σ 2 ) ∝ 1 σ. These areexamples of Jeffreys prior (see next slide).It’s OK to use an improper prior only when the resulting posterior is a validprobability distribution.This can be difficult to verify sometimes. So if you aren’t sure, don’t use one.Using a proper prior (π(θ) is a valid probability distribution) will always resultin a valid posterior distribution.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 17 / 32

Reference PriorsThe idea behind reference priors is that they are supposed to be a “default”noninformative prior.Reference priors result in what is called “optimal frequentist” properties forthe posterior.For our purposes that means the credible intervals give the same result asconfidence intervals.Reference priors are specific to each modeling problem and may notnecessarily exist for every problem.If θ is one dimensional (i.e. only one parameter) then the Jeffreys prior is thereference prior.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 18 / 32

ConjugacyIf P(θ) is a conjugate prior for θ with likelihood P(D|θ), then P(θ|D) has thesame family of distributions as the prior.Examples:If P(D|β, σ 2 ) ∼ N(X β, σ 2 ) and P(β) is some normal distribution, thenP(β|D, σ 2 ) also follows a normal distribution.If P(D|β, σ 2 ) ∼ N(X β, σ 2 ) and P(σ −2 ) is some gamma distribution, thenP(σ −2 |D, β) also follows a gamma distribution.This is nice because we know a lot about the properties of the normal andgamma distributions. We can also generate values from these distributionswith computer programs quite readily.For a list of some: http://en.wikipedia.org/wiki/Conjugate_priorNels Johnson (LISA) Bayesian Statistics 07/21/2010 19 / 32

Without Conjugacy?Without conjugacy we might end up with a posterior we know nothing about.This can make it very hard to compute summaries of the posterior and formany years it crippled our ability to do Bayesian inference.And even with conjugacy, we can often only get conjugate posteriordistributions that are conditional on other parameters in the model (like theexamples).Luckily in the last 20 years with the power of computers this has changed.And it’s why you are seeing so many more Bayesian papers.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 20 / 32

Posterior DistributionsBayesian inference is usually done on the joint posterior distribution of all theparameters P(θ|D).Sometimes it is done on the marginal distribution of a single parameter. Suchas:∫P(β|D) = P(θ|D)dσ 2Because the posterior is a probability distribution for θ, preforming inferenceon θ is as simple as finding relevant summaries based on its posterior.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 21 / 32

Point EstimationTwo very popular point estimators for θ:The mean of the posterior distribution (posterior mean).The maximum a posterior estimator also known as the MAP estimator.The MAP estimator is arg max θ P(θ|D).You could use any measure of center that makes sense for P(θ|D).Nels Johnson (LISA) Bayesian Statistics 07/21/2010 22 / 32

Illustration of MAP Estimatorθ^ = 3Posterior0.00 0.05 0.10 0.15 0.200 3 5 10 15θNels Johnson (LISA) Bayesian Statistics 07/21/2010 23 / 32

Interval EstimationBayesian confidence intervals are called credible intervals. So a 95% Bayesianconfidence interval is called a 95% credible interval.There are two popular ways to find them:Highest posterior density, often shortened to HPD.Equal tail probability.A 95% HPD credible interval is the smallest interval with probability 0.95.A 95% Equal tail probability interval use the values that have cumulativeprobability of 0.025 and 0.975 as the endpoints. This is not preferred whenthe posterior is highly skewed or multimodal.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 24 / 32

HPD and Equal Tail Credible IntervalsPosterior0.00 0.05 0.10 0.15 0.201.089865 8.767273θNels Johnson (LISA) Bayesian Statistics 07/21/2010 25 / 32

Hypothesis TestingSince we can talk about the probability of θ being in some interval, thismakes interpretation of certain kinds of hypothesis tests much easier.For instance H 0 : β ≤ 0 vs. H a : β > 0. You can find P β|D (β ≤ 0) and if it issufficiently small then reject the null hypothesis in favor of the alternative.We could also test H 0 : a ≤ β ≤ b vs H 0 : β ≤ a or b ≤ β where a and b arechosen such that if H 0 were true then β would have no practical effect.Again, compute P β|D (H 0 ) and if it is sufficiently small, reject H 0 .Some special steps need to be taken when trying to test H 0 : β = 0 which arebeyond the scope of this course.Also instead of rejecting H 0 or accepting H a , you could just report theevidence (i.e. the probability) for the reader to decide.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 26 / 32

Sampling From the Joint PosteriorAs stated previously we want to perform inference on θ using the jointposterior P(θ|D).Summarizing the joint posterior may be difficult (or impossible).A solution to this problem is to use a computer algorithm to sampleobservations from the joint posterior.If we have a large enough sample from P(θ|D) then the distribution of thesample should approximate P(θ|D).This means we can just summarize the sample from P(θ|D) to makeinference on θ instead of using the distribution P(θ|D) directly.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 27 / 32

Using Samples to Approximate the PosteriorN =2 0N = 100N = 500Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6−4 −2 0 2 4θ−4 −2 0 2 4θ−4 −2 0 2 4θN = 1000N = 5000N = 10000Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6Density0.0 0.1 0.2 0.3 0.4 0.5 0.6−4 −2 0 2 4θ−4 −2 0 2 4θ−4 −2 0 2 4θNels Johnson (LISA) Bayesian Statistics 07/21/2010 28 / 32

One (Three) Important Algorithm(s)The big three algorithms are as follows:Gibbs sampling.Metropolis algorithm.Metropolis-Hastings algorithm.All three of these algorithms are Markov Chain Monte Carlo algorithms.Commonly known as MCMC algorithms.What that means for us is that after you run the algorithm “long enough”you will eventually start sampling from P(θ|D).The Metropolis-Hastings algorithm is a generalization of the other two and itso similar to the Metropolis algorithm that the two titles are ofteninterchanged.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 29 / 32

More on GibbsThe Gibbs sampler is a much simplified version of the Metropolis-Hastingsalgorithm. So simplified in fact it isn’t obvious at all that they are related.All three algorithms require you to know the conditional posterior distributionof each parameter in the model (e.g. P(β|D, σ 2 ) and P(σ 2 |D, β) ).The Gibbs sampler requires that the user can sample from each of theconditional posterior distributions using computer software.This is where conjugate priors become very popular. If the conditionalposterior distributions are all normal, gamma, or some other distribution wecan readily sample from, then the Gibbs algorithm makes sampling from thejoint posterior very easy.Other details about this algorithm are beyond the scope of this course.The data analysis examples we will discuss both use a Gibbs sampler.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 30 / 32

More on Metropolis-HastingsIf we can’t find conjugate priors or some other prior to make it easy tosample from the conditonal posteriors (like in logistic regression), then wehave to use the Metropolis-Hasting algorithm which allows us to drawsamples from almost any conditional posterior distribution.Some papers contain tricks on how to change problems that needMetropolis-Hastings into problems that only need Gibbs.The probit regression example we will do uses a trick by Albert and Chib(1993) allow us to use Gibbs instead of M-H.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 31 / 32

ExamplesOpen up your R console!We’ll be discussing:How to estimate a simple linear regression and a probit regression using thebayesm package in R.How to assess if our Gibbs algorithm is working properly.How the Bayesian model fit compares to the classical model fit.Nels Johnson (LISA) Bayesian Statistics 07/21/2010 32 / 32

Regression Using Bayesian Statistics in R - LISA

Create successful ePaper yourself

Delete template?

Save as template?