Introduzione alla Statistica bayesiana - Politecnico di Milano

. 

. 

Introduzione alla Statistica bayesiana 

Alessandra Guglielmi 

Politecnico di Milano 

Dipartimento di Matematica 

Milano, Italia 

e-mail: alessandra.guglielmi@polimi.it 

1 ottobre 2012 

A. Guglielmi Bayesian Statistics 1

Bayesian learning 

Y = sample space = set of of possible dataset; 

y: a single dataset 

Θ = parameter space = set of all possible parameters values, 

from which we hope to identify the value that best represents 

the true population characteristics 

Under the Bayesian approach, there are TWO random 

elements (Y , θ): 

π(θ) prior distribution: describes our belief that θ 

represents the true population characteristics 

p(y|θ) likelihood: describes our belief that y would be the 

outcome if θ is the true parameter value 


Bayes’ Theorem 

Once we obtain y, we update our beliefs about θ computing 

π(θ|y) posterior distribution: describes our belief that θ is 

the true value, having observed y 

The posterior distribution is computed via Bayes’ Theorem: 

π(θ|y) = 

p(y|θ)π(θ) 

∫ 

Θ p(y|θ)π(θ)dθ 

Ricordatevi il Teo di Bayes per una partizione finita 

R Example: binomial_beta.R 


Bayes’ Theorem 

It is a typical scientific approach: 

the prior belief is UPDATED via observed data and yields 

posterior distribution 

it suggests that scientific inference is based on 2 parts: 

one depends on the scientist’s subjective opinion and 

understanding of the phenomenon under study BEFORE 

an EXPERIMENT was performed, the other depends on 

the observed data the scientist has obtained from the 

experiment. 


Stima dei parametri 

Viene fatta sintetizzando la distribuzione finale π(θ|y 1 , . . . , y n ) 

attraverso, per esempio, 

media a posteriori E[θ|y 1 , . . . , y n ] 

varianza a posteriori Var[θ|y 1 , . . . , y n ] 

stime intervallari C : P(θ ∈ C|y 1 , . . . , y n ) ≥ 0.95 

Metodi computazionali per ricavare queste stime, soprattutto 

per simulare dalla distribuzione finale 

MCMC: Markov chain Monte Carlo methods 

IDEA: se NON è possibile simulare v.a. iid dalla distribuzione 

finale, si costruisce una catena markoviana la cui distribuzione 

limite è la distribuzione finale, applicando il Teorema Ergodico 

per approssimare integrali della posterior. 


Previsione di nuove osservazioni 

Observe n tosses of a coin: (y 1 , . . . , y n ) (y i = 1 if the coin was 

H at the i-th toss) 

What is the probability that the next toss will be H? 

Bayesian prediction: P(Y n+1 = 1|Y 1 = y 1 , . . . , Y n = y n ) 

Posterior predictive distribution: L(Y new |Y = y) 


Prediction 

What is the probability that the next toss will be H? 

Likelihood: θ ∑ y i 

(1 − θ) n−∑ y i 

; prior π(θ); posterior π(θ|y) 

∫ 

P(Y n+1 = 1|Y 1 = y 1 , . . . , Y n = y n ) = 

∫ 

= 

(0,1) 

(0,1) 

θπ(θ|y)dθ = E(θ|y) = α + ∑ y i 

α + β + n 

Ex: n = 10, ∑ y i = 3, α = β = 1: 

P(Y n+1 = 1|θ)π(θ|y)dθ 

P(Y 11 = 1|Y 1 = y 1 , . . . , Y n = y n ) = E(θ|y) = 4 12 = 1 3 ≠ 1 2 = E(θ) 


Bayesian vs non-Bayesian approach 

FREQUENTIST approach 

parameters are fixed at their true but unknown value 

“objective” notion of probability 

good large sample properties 

estimation: maximizing the likelihood 

confidence intervals: difficult to interpret 

no symmetry in testing hypotheses H 0 and H 1 , difficult 

interpretation of p-values 

BAYESIAN approach 

parameters are r.v.s with distributions attached to them 

subjective notion of probability (prior) combined with data 

does not require large sample approximations (inference is 

exact for any n) 

estimation: via summary statistics of the posterior 

distributions; their computation via simulation based 

approach (MCMC) 

credible intervals: NO problems in interpreting them 

H 0 and H 1 are symmetric 


Bayesian Hierarchical Models 

Multilevel data: results of a test for students in a population of 

schools in US 

patients within several hospitals 

people (or items) within provinces within regions within 

countries 

Two levels: 

groups 

units within groups 

y ij is the data of the i-th unit in group j, i = 1, . . . , n j , 

j = 1, . . . , J 

(Y 1 , Y 2 , . . . , Y J ) Y j = (Y 1,j , . . . , Y nj ,j) 


Bayesian Gaussian hierarchical model 

Y 1,j , . . . , Y nj ,j|ϕ j 

iid ∼ N(ϕj , σ 2 ) 

within-group model 

ϕ 1 , . . . , ϕ J |(µ, τ 2 ) iid ∼ N(µ, τ 2 ) 

(µ, τ 2 ) ∼ π 

between-group model 

population of groups/group-parameters: prediction on a student 

coming from a new school, selected at random from the 

population of groups 

group-specific parameters 

ϕ 1 , . . . , ϕ J are NOT independent, since we want to share 

information between the groups; the dependency is mild 

(exchangeability) 


Bayesian Gaussian hierarchical model 

The prior is completed assuming: 

1 

σ 2 ∼ gamma(ν 0 

2 , ν 0σ0 

2 

2 ) σ2 within-group variance 

1 

τ 2 ∼ gamma(η 0 

2 , η 0˜σ 

0 

2 

2 ) τ 2 between-group variance 

µ ∼ N(µ 0 , γ0 2 ) 

R Example: Bayesian_hierarchical.R 


Borrowing strength 

E 

( 

ϕ j |ȳ j , µ, τ 2 , σ 2) = 

n j /σ 2 

n j /σ 2 + 1/τ 2 ȳj + 

frequentist estimator of ϕ j 

1/τ 2 

n j /σ 2 + 1/τ 2 µ 

prior mean of ϕ j 

When n j is small, i.e. group j gives little info about ϕ j : 

E(ϕ j | . . .) ≈ µ 

The Bayesian estimate is obtained borrowing strength from the 

other groups (through µ) 

When τ 2 is large (heterogeneous groups): E(ϕ j | . . .) ≈ ȳ j ; there 

is less shrinkage to µ, relying more on the info in group j. 

NO POOLING: τ 2 = +∞ ⇔ one analysis for each group 

COMPLETE POOLING: τ 2 = 0 ⇔ ϕ 1 = . . . = ϕ J = µ one 

single parameter

Introduzione alla Statistica bayesiana - Politecnico di Milano

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?