1 Introduction

More documents

Recommendations

Info

30 1. INTRODUCTIONSection 1.2.4Again we can first determine the parameter vector w ML governing the mean and subsequentlyuse this to find the precision β ML as was the case for the simple Gaussiandistribution.Having determined the parameters w and β, we can now make predictions fornew values of x. Because we now have a probabilistic model, these are expressedin terms of the predictive distribution that gives the probability distribution over t,rather than simply a point estimate, and is obtained by substituting the maximumlikelihood parameters into (1.60) to givep(t|x, w ML ,β ML )=N ( t|y(x, w ML ),β −1ML). (1.64)Now let us take a step towards a more Bayesian approach and introduce a priordistribution over the polynomial coefficients w. For simplicity, let us consider aGaussian distribution of the form( α) (M+1)/2 {p(w|α) =N (w|0,α −1 I)=exp − α }2π 2 wT w (1.65)where α is the precision of the distribution, and M +1 is the total number of elementsin the vector w for an M th order polynomial. Variables such as α, which controlthe distribution of model parameters, are called hyperparameters. Using Bayes’theorem, the posterior distribution for w is proportional to the product of the priordistribution and the likelihood functionp(w|x, t,α,β) ∝ p(t|x, w,β)p(w|α). (1.66)We can now determine w by finding the most probable value of w given the data,in other words by maximizing the posterior distribution. This technique is calledmaximum posterior, or simply MAP. Taking the negative logarithm of (1.66) andcombining with (1.62) and (1.65), we find that the maximum of the posterior isgiven by the minimum ofβN∑{y(x n , w) − t n } 2 + α 22 wT w. (1.67)n=1Thus we see that maximizing the posterior distribution is equivalent to minimizingthe regularized sum-of-squares error function encountered earlier in the form (1.4),with a regularization parameter given by λ = α/β.1.2.6 Bayesian curve fittingAlthough we have included a prior distribution p(w|α), we are so far still makinga point estimate of w and so this does not yet amount to a Bayesian treatment. Ina fully Bayesian approach, we should consistently apply the sum and product rulesof probability, which requires, as we shall see shortly, that we integrate over all valuesof w. Such marginalizations lie at the heart of Bayesian methods for patternrecognition.
1.2. Probability Theory 31In the curve fitting problem, we are given the training data x and t, along witha new test point x, and our goal is to predict the value of t. We therefore wishto evaluate the predictive distribution p(t|x, x, t). Here we shall assume that theparameters α and β are fixed and known in advance (in later chapters we shall discusshow such parameters can be inferred from data in a Bayesian setting).A Bayesian treatment simply corresponds to a consistent application of the sumand product rules of probability, which allow the predictive distribution to be writtenin the form∫p(t|x, x, t) = p(t|x, w)p(w|x, t)dw. (1.68)Here p(t|x, w) is given by (1.60), and we have omitted the dependence on α andβ to simplify the notation. Here p(w|x, t) is the posterior distribution over parameters,and can be found by normalizing the right-hand side of (1.66). We shall seein Section 3.3 that, for problems such as the curve-fitting example, this posteriordistribution is a Gaussian and can be evaluated analytically. Similarly, the integrationin (1.68) can also be performed analytically with the result that the predictivedistribution is given by a Gaussian of the formwhere the mean and variance are given byHere the matrix S is given byp(t|x, x, t) =N ( t|m(x),s 2 (x) ) (1.69)m(x) = βφ(x) T SN∑φ(x n )t n (1.70)n=1s 2 (x) = β −1 + φ(x) T Sφ(x). (1.71)S −1 = αI + βN∑φ(x n )φ(x) T (1.72)n=1where I is the unit matrix, and we have defined the vector φ(x) with elementsφ i (x) =x i for i =0,...,M.We see that the variance, as well as the mean, of the predictive distribution in(1.69) is dependent on x. The first term in (1.71) represents the uncertainty in thepredicted value of t due to the noise on the target variables and was expressed alreadyin the maximum likelihood predictive distribution (1.64) through β −1ML. However, thesecond term arises from the uncertainty in the parameters w and is a consequenceof the Bayesian treatment. The predictive distribution for the synthetic sinusoidalregression problem is illustrated in Figure 1.17.
Page 1 and 2: 1IntroductionThe problem of searchi
Page 3 and 4: 1. INTRODUCTION 3also preserve usef
Page 5 and 6: 1.1. Example: Polynomial Curve Fitt
Page 11: 1.1. Example: Polynomial Curve Fitt
Page 14 and 15: 14 1. INTRODUCTIONfrom (1.5) and (1
Page 16 and 17: 16 1. INTRODUCTIONp(X,Y )p(Y )Y =2Y
Page 18 and 19: 18 1. INTRODUCTIONFigure 1.12The co
Page 20 and 21: 20 1. INTRODUCTIONfinite sum over t
Page 22 and 23: 22 1. INTRODUCTIONtion of probabili
Page 24 and 25: 24 1. INTRODUCTIONsee, is required
Page 26 and 27: 26 1. INTRODUCTIONFigure 1.14Illust
Page 32 and 33: 32 1. INTRODUCTIONFigure 1.17The pr
Page 34 and 35: 34 1. INTRODUCTIONFigure 1.19Scatte
Page 36 and 37: 36 1. INTRODUCTIONextend this appro
Page 38 and 39: 38 1. INTRODUCTIONin a high-dimensi
Page 40 and 41: 40 1. INTRODUCTIONp(x, C 1 )x 0̂xp
Page 44 and 45: 44 1. INTRODUCTION54p(x|C 2 )1.21p(
Page 46 and 47: 46 1. INTRODUCTIONindependent, so t
Page 48 and 49: 48 1. INTRODUCTIONSection 5.6Exerci
Page 50 and 51: 50 1. INTRODUCTIONtion (1.92) and t
Page 52 and 53: 52 1. INTRODUCTION0.50.5H = 1.77H =
Page 54 and 55: 54 1. INTRODUCTIONAppendix EAppendi
Page 56 and 57: 56 1. INTRODUCTIONFigure 1.31A conv
Page 58 and 59: 58 1. INTRODUCTIONThus we can view
Page 60 and 61: 60 1. INTRODUCTION1.12 (⋆⋆) www
Page 66: 66 1. INTRODUCTION1.40 (⋆) By app

1 Introduction

Create successful ePaper yourself

Delete template?

Save as template?