11.07.2015 Views

Bayesian Linear Regression - CEDAR

Bayesian Linear Regression - CEDAR

Bayesian Linear Regression - CEDAR

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Machine Learning ! ! ! ! !Srihari<strong>Bayesian</strong> <strong>Linear</strong> <strong>Regression</strong>Sargur Sriharisrihari@cedar.buffalo.edu1


Machine Learning ! ! ! ! !SrihariMotivation of <strong>Bayesian</strong> Approach• We have seen <strong>Linear</strong> <strong>Regression</strong> with M basisM −1functions:y(x,w) =∑j = 0w jφ j(x) = w T φ(x)• The maximum likelihood solution isw ML= (Φ T Φ) −1 Φ T t⎛ φ0(x⎜⎜ φ0(xΦ = ⎜⎜⎝φ0(x12N)))φ (x11)...φφM −1M −1(x(x1N) ⎞⎟⎟⎟⎟)⎠2


Machine Learning ! ! ! ! !SrihariShortcomings of MLE• M.L.E. of parameters wdoes not address– M (Model complexity: how many basis functions?– It is controlled by data size N• More data allows better fit without overfitting• Regularization also controls overfit (λ controls itseffect)E D(w) = 1 2N∑n=1{ t n−w T φ(x n) }2whereE(w) = E D(w) + λE W(w)• But M and choice of ϕ j are still important– M can be determined by holdout, but wasteful of dataE (w) =• Model complexity and over-fitting better handled using<strong>Bayesian</strong> approach 3W12wTw


Machine Learning ! ! ! ! !SrihariPrior Distribution for ParametersAssume multivariate Gaussian prior for w (whichhas components w 0 ,..,w M-1 )p(w) = N (w|m 0 , S 0 )with mean m 0 and covariance matrix S 0If we choose S 0 = α -1 I it means that the variances of theweights are all equal to α -1 and covariances are zerop(w) with zero mean (m 0 =0)and isotropic over weights (same variances)w 1 w 0 5


Machine Learning ! ! ! ! !SrihariLikelihood of Data• Assume noise precision parameter βt = y(x,w)+ε p(t|x,w,β)=N(t|y(x,w),β -1 )• Likelihood p(t|w) with Gaussian noise has anexponential formp(t | X, w, β) =N∏n=1N ( t n| w T φ(x n), β −1)This is the probability of data t given the parameters w andinput X6


Machine Learning ! ! ! ! !SrihariPosterior Distribution of Parameters• Posterior is proportional to Likelihood x Priorp(w | t) =– where likelihood function isp(t | X,w,β) =p(t | w)p(w)p(t)N ( t n| w T φ(x n),β −1)• Multiplying by Gaussian prior, p(w)=N(w|m 0 ,S 0 )posterior is also Gaussian, written directly asp(w|t)=N(w|m N ,S N )– Where m N is the mean of the posteriorgiven by m N = S N (S 0-1m 0 + βΦ T t)– And S N is the covariance matrix of posteriorgiven by S-1N = S-10 + β Φ T Φ N∏n =1w 1 Product of Gaussiansw 0 7


Machine Learning ! ! ! ! !SrihariProperties of Posterior1. Since posterior p(w|t)=N(w|m N ,S N ) is Gaussianits mode coincides with its mean– Thus maximum posterior weight vector is w MAP = m N 2. For infinitely broad prior S 0 =α -1 I with α -> 0 – Then mean m N reduces to the maximum likelihoodvaluew ML= (Φ T Φ) −1 Φ T t3. If N = 0, posterior reverts to the prior4. If data points arrive sequentially, then posteriorto any stage acts as prior distribution forsubsequent data points8


Machine Learning ! ! ! ! !SrihariChoose a simple Gaussian prior p(w)• Zero mean (m 0 =0) isotropic• (same variances) Gaussianp(w |α ) = N(w | 0,α −1 I) Single precisionparameter• Corresponding posterior distribution isp(w|t)=N(w|m N ,S N )wherem N =βS N Φ T t and S N-1=αI+βΦ T ΦNote:β is noise precision andα is variance of parameter w in priorw 1 PointEstimatewithinfinitesamplesw 0 9


Machine Learning ! ! ! ! !SrihariSampling p(w) and p(w|t) • Each sample represents a straight line indata spacew 1 Distribution Six samplesp(w) w 0 p(w|t) Goal of <strong>Bayesian</strong> <strong>Linear</strong> <strong>Regression</strong>:Determine p(w|t) 10


Machine Learning ! ! ! ! !SrihariInteresting Relationship to MLE• Since• Log of Posterior isln p(w | t) = − β 2Np(w | t) = ∏ N t n| w T φ(x n), β −1 N(w | 0,α −1 I)n=1N∑n=1{ t n− w T φ(x n)}• Thus Maximization of posterior distribution wrtw is equivalent to minimization of sum-ofsquareserror functionE(w) = 1 2N∑n=1( ){ t n− w T φ(x n) } 2+ λ 2 wT wwith addition of quadratic regularization term w T wwith λ=α/β 112− α 2 wT w + const


Machine Learning ! ! ! ! !Srihari<strong>Bayesian</strong> <strong>Linear</strong> <strong>Regression</strong> Example(Straight Line Fit)• Single input variable x• Single target variable t• Goal is to fit– <strong>Linear</strong> model y(x,w) = w 0 +w 1 x• Goal of <strong>Linear</strong> <strong>Regression</strong> is to recoverw=[w 0 ,w 1 ] given the samplestx12


Machine Learning ! ! ! ! !SrihariData Generation• Synthetic data generated fromf(x,w)=w 0 +w 1 x withparameter valuesw 0 =-0.3 and w 1 =0.5t-1 x 1– First choose x n from U(x|-1,1), Then evaluate f(x n ,w)– Add Gaussian noise with st dev 0.2 to obtain targett n– Precision parameter β=(1/0.2) 2 =25 • For prior over w we choose α = 2 p(w |α ) = N(w | 0,α −1 I)13


Sequential <strong>Bayesian</strong> LearningMachine Learning ! ! ! ! !Srihari• Since there areonly twoparameters– We can plotprior andposteriordistributions inparameterspace• We look atsequentialupdate ofposteriorWe are plotting p(w|t)for a single data pointBefore datapointsobservedAfter first data point(x,t) observedBand represents valuesof w 0 , w 1 representingst lines going near datapoint xLikelihoodfor 2 nd pointaloneLikelihood p(t|x.w) asfunction of wTrue parameterValueXPrior/Posteriorp(w)gives p(w|t)Six samples(regression functions)corresponding to y(x,w)with w drawn fromposteriorNoDataPointFirstDataPoint(x 1 ,t 1 )SecondDataPointWith infinite pointsposterior is a deltafunction centeredat true parameters(white cross)Likelihood for20 th pointaloneTwentyDataPoints14


Machine Learning ! ! ! ! !SrihariGeneralization of Gaussian prior• The Gaussian prior over parameters isp(w |α ) = N(w | 0,α −1 I)Maximization of posterior ln p(w|t) is equivalent tominimization of sum of squares errorE(w) = 1 2N∑n=1{ t n− w T φ(x n) } 2+ λ 2 wT w• Generalization of Gaussian prior⎡p(w | α) = q ⎛α⎞⎢2⎝⎜2 ⎠⎟⎣1/q⎤1Γ(1 /q) ⎥⎦– q=2 corresponds to GaussianM⎛exp − α⎝⎜2⎞| w j| q⎠⎟– Corresponds to minimization of regularized errorfunctionN{ t n− w T φ(x n) } 2 M∑|w j| q1512∑ + λ 2n =1j =1M∑j=1


Machine Learning ! ! ! ! !SrihariExample of Predictive Distribution• Data generated from sin(2πx)• Model: nine Gaussian basis fnsy(x,w) =w jφ j(x) = w T φ(x)• Predictive distribution8∑j=0p(t | x, t,α, β) = N(t | m N T ϕ(x),σ N 2 (x)) where σ N 2 (x) = 1 β + ϕ(x)T S Nϕ(x)Plot of p(t|x) for one data pointshowing mean (red)and one std dev (pink)Mean of Predictive Distribution⎛ φ0(x1)⎜⎜ φ0(x2)Φ = ⎜⎜⎝φ0(xN)m N =βS N Φ T tS N-1=αI+βΦ T Φφ (x11)...φφM −1M −1(x1)⎞⎟⎟⎟⎟(x )N ⎠20


Machine Learning ! ! ! ! !SrihariPredictive Distribution (Sinusoidal Data)p(t | x, t,α, β) = N(t | m N T ϕ(x),σ N 2 (x))N=1 N=2• Std dev of t issmallest inneighborhoodof data points• Uncertaintydecreases asmore datapoints areobservedMean of theGaussianPredictiveDistributionOne standarddeviationfrom MeanN=4 N=25Plot only shows point-wise predictive varianceTo show covariance between predictions at differentvalues of x draw samples from posterior distribution over wp(w|t) and draw samples of y(x,w)


Machine Learning ! ! ! ! !SrihariEquivalent Kernel• Posterior mean of w is m N =βS N Φ T t– where S N is the design matrix• Substitute mean value into <strong>Regression</strong> functiony(x,w) =M −1∑j= 0w jφ j(x) = w T φ(x)• Mean of predictive distribution at point x€y(x, m N) = m N T φ(x) = βφ(x) T S NΦ T t=N∑ βφ(x) T S Nφ(x n)t nn=1N∑= k(x,x n)t nn=1– Where k(x,x’)=βφ (x) T S N φ (x’) is the equivalent kernel• Mean of predictive distribution is a linear combinationof training set target variables


Machine Learning ! ! ! ! !SrihariKernel Function• <strong>Regression</strong> functions such asy(x, m N) =N∑n=1k(x,x n)t nk(x,x’)=βφ (x) T S N φ (x’)• That take a linear combination of thetraining set target values are known aslinear smoothers• They depend on the input values x n fromthe data set since they appear in thedefinition of S N 25


Machine Learning ! ! ! ! !SrihariEquivalent kernel for Gaussian BasisGaussian Basis ϕ(x)"φ j(x) = exp (x − µ j) 2 %$# 2s 2 '&Equivalent Kernelk(x,x’)=φ (x) T S N φ(x’)Plot of k(x,x’) shownas a functionof x and x’Peaks when x=x’For three values of xthe behaviorof k(x,x’) is shown as a slicexThey are localized around x, i.e., peaks when x =x’Local evidence is weighted more than distant evidenceKernel used directly in regressiony(x, m N) =N∑n=1k(x,x n)t nx’Data set used togenerate kernel were200 values ofx equally spacedin (-1,1)26


Machine Learning ! ! ! ! !SrihariEquivalent Kernel for Polynomial BasisFunctionφ j (x)=x jk(x,x’)=βφ (x) T S N φ (x’)Localized function of x’ even thoughcorresponding basis function is nonlocal27


Machine Learning ! ! ! ! !SrihariEquivalent Kernel for Sigmoidal BasisFunction"ϕ j(x) = σ $#x − µ js%1' where σ (a) =&1+ exp(−a)k(x,x’)=βφ(x) T S N φ(x’)Localized function of x’ even thoughcorresponding basis function is nonlocal28


Machine Learning ! ! ! ! !SrihariCovariance between y(x) and y(x’)cov [y(x) , y(x’)] = cov[φ(x) T w, w T φ(x’)]= φ(x) T S N φ(x’)= β -1 k(x,x’)The kernel captures the covarianceFrom the form of the equivalent kernel,the predictive mean at nearby points y(x) , y(x’) will be highly correlatedFor more distant pairs correlation is smaller29


Machine Learning ! ! ! ! !SrihariKernel Function formulation• Formulation of <strong>Linear</strong> <strong>Regression</strong> in terms ofkernel function suggests:– Directly define kernel functions and use to makepredictions for input x• Leads to Gaussian Processes for <strong>Regression</strong>and Classification• Can be shown that• Equivalent kernel satisfies important property:– where€€k(x,z) =ψ(x) T ψ(z)ψ(x) = β 1/ 2 S N1/ 2 φ(x)N∑n=1k(x,x n) =130


Machine Learning ! ! ! ! !SrihariProbabilistic <strong>Linear</strong> <strong>Regression</strong> Revisited!• Re-derive the predictive distribution by workingin terms of distribution over functions y(x,w)– It will provide a specific example of a GaussianProcess• Consider model with M fixed basis functionsy(x)=w T φ (x)where x is the input vector and w is the M-dimensional weight vector• Assume a Gaussian distribution of weight wp(w) = N(w|0,α -1 I)• Probability distribution over w induces aprobability distribution over functions y(x)yx


Machine Learning ! ! ! ! !SrihariProbabilistic <strong>Linear</strong> <strong>Regression</strong> is a GP!• We wish to evaluate y (x) at training points x 1 ,.., x N• Our interest is the joint distribution of values y=[y(x 1 ),..y(x N )] Since y(x)=w T φ(x) we can write y = Φw– where Φ is the design matrix with elements Φ nk =φ k (x n )N x M times M x 1 yields a N x 1• Since y is a linear combination of elements of w whichare Gaussian distributed as p(w) = N(w|0,α -1 I)!y is itself Gaussian with!!mean:E[y] =ΦE[w]=0 andco !variance: Cov[y]=E[yy T ]=ΦE[ww T ]Φ T =(1/α)ΦΦ T =K! where K is the N x N Gram Matrix with elements !K nm = k (x n ,x m ) = (1/α)φ(x n ) T φ(x m )Covarianceexpressedby Kernelfunctionand k (x,x’) is the kernel function!Thus Gaussian distributed weight vector induces a Gaussian joint distribution over training samples


Machine Learning ! ! ! ! !SrihariGeneral definition of GP!• We saw a particular example of a Gaussian process!– <strong>Linear</strong> regression using the parametric model!– With samples y= [y(x 1 ),..y(x N )] !– Assume p(w) = N(w|0,α -1 I)y(x)=w T φ(x)– Then E[y] =0 and Cov[y]=K, the Gram matrix which is equivalentto pairwise kernel values!• More generally, a Gaussian process is a probabilitydistribution over functions y (x)– Such that the set of values of y(x) evaluated at arbitrary pointsx 1 ,..,x N jointly have a multivariate Gaussian distribution!• For a single input x 1 , output y 1 is univariate Gaussian. For two inputsx 1, x 2 the output y 1, y 2 is bivariate Gaussian, etc!33


Machine Learning ! ! ! ! !SrihariParametric <strong>Regression</strong> vs GP <strong>Regression</strong>!• In parametric regression we have several samples from whichwe learn the parameters!• In Gaussian processes we we view the samples as one hugeinput that has a Gaussian distribution !– with a mean and a covariance matrix!• A Gaussian process is a stochastic process X t , t ∈ T, forwhich any finite linear combination of samples has a jointGaussian Distribution!– any linear functional applied to the sample function X t will give a normallydistributed result. Notation-wise, one can write X ~ GP(m,K), meaningthe random function X is distributed as a GP with mean function m and35 covariance function K.!


Machine Learning ! ! ! ! !SrihariGaussian Process with Two Samples!• Let y be a function (curve) !– of a one-dimensional variable x!• We take two samples y 1 and y 2y 2y 1corresponding to x 1 and x 2• Assume they have a bivariateGaussian distribution p(y 1 ,y 2 ) • Each point from this distribution!– has an associated probability !– It also defines a function y(x) !• Assuming that two points areenough to define a curve!• More than two points will beneeded to define a curve!– Which leads to a higherdimensional probability distribution!y 1x 1 x 1y 236


Machine Learning ! ! ! ! !SrihariGaussian Process with N samples!• A process that generatessamples over time {y 1 ,..y N } isa GP iff every set of samplesY=[y 1 ,..,y N ] is a vector-valuedGaussian random variable!• We define a distribution overall functions!– Constrained by the samples!• The samples have a jointGaussian distribution in N-dimensional space!37<strong>Regression</strong>:!Possible functions!constrained by Nsamples!Case of two samples!t 1 and t 2 that are!bivariate Gaussian!


Machine Learning ! ! ! ! !SrihariSpecifying a Gaussian Process!• Key point about Gaussian Stochastic Processes!– Joint distribution over N variables y 1 ,.., y N is completelyspecified by the second-order statistics,!• i.e., mean and covariance!• With mean zero, it is completely specified bycovariance of y(x) evaluated at any two values of xwhich is given by a kernel function!E[y(x n ) y(x m )] = k(x n ,x m )• For the Gaussian Process defined by the linearregression model y(x,w)=w T φ (x) withprior p(w) = N(w|0,α -1 I) the kernel function is38K nm = k (x n ,x m ) = (1/α) φ(x n ) T φ (x m )!


Machine Learning ! ! ! ! !SrihariDefining a Kernel Directly!• GP can also be specified directly bychoice of kernel function (instead ofindirectly by basis function)!• Samples of functions drawn fromGaussian processes for two differentchoices of kernel functions are shown next39


Machine Learning ! ! ! ! !SrihariSamples from GaussianProcesses for two kernels!Functions are drawn !from !Gaussian processes!Note that each sample!is a function!Gaussian Kernel!k(x,x') = exp( − || x − x'|| 2 /2σ 2)Exponential Kernel!k(x,x') = exp( −θ | x − x'| )€€Ornstein-Uhlenbeck!process for !Brownian motion!40


Machine Learning ! ! ! ! !SrihariProb Distributions and Stochastic Processes!• A probability distribution describes the distributionsof scalars (x) and vectors (x)• A stochastic process describes distribution offunctions f(x)– Can think of a function as a very long vector!• Each entry in the vector is f (x) which is the value of function at x• A Gaussian process is a generalization of Gaussiandistributions where it is describing the distribution offunctions!• Since the vector is infinite-dimensional we constrain it toonly those points x corresponding to training and test points!41


Machine Learning ! ! ! ! !SrihariGP for <strong>Regression</strong> (Direct Approach)!• We specify Gaussian Process directly over functions!– Abandon approach of defining a distribution over weights w• Take into account noise on observed target values as!! ! !t n = y n + ε n where y n =y (x n )– Noise process has a Gaussian distribution p(t n |y n )=N(t n |y n ,β -1 )• Note that target t n is output y n corrupted by noise• Defining t =(t 1 ,..,t N ) T our goal is to determine adistribution p(t)– Which is a distribution over functions


Machine Learning ! ! ! ! !SrihariGP for <strong>Regression</strong> (Direct Approach)!• Assuming noise is independent for each data point!!joint distribution of t =(t 1 ,..,t N ) T on values y = (y 1 ,..,y N ) T isp(t|y)=N(t|y,β -1 I N )• From definition of GP, marginal distribution of y is given by !!a Gaussian with zero mean, covariance matrix given by Gram matrix Kp(y) =N(y|0,K)• Where kernel function that determines K is chosen to express: !!property that for points x n , x m that are similar corresponding values y(x n ),y(x m ) will be more strongly correlated than for dissimilar points!!K depends on application!


Machine Learning ! ! ! ! !Srihari<strong>Regression</strong>: Marginal Distribution!• From distributions p(t|y) and p(y) we can get marginal p(t)conditioned on input values x 1 ,..,x N• Applying Sum rule and product rule!p(t) = ∫ p(t | y)p(y)dy= N(t | 0,C)where covariance matrix C has the elements !C(x n ,x m ) = k(x n ,x m ) + β -1 δ nmDue to y(x)Due to ε From the result thatwhen p(y) and p(t|y) are Gaussianp(t) is also Gaussian• The two Gaussian sources of randomness, y(x) and ε areindependent, so their covariances simply add!δ specified by ε


Machine Learning ! ! ! ! !SrihariWidely used kernel function for GP!• Exponential of a quadratic form!!with addition of constant and linear terms!$k(x n,x m) = θ 0exp − θ 12 || x '%n− x m|| 2 (&) + θ 2+ θ 3x T nx m€Corresponds to a parametric model that isa linear function of input variables• Samples from this prior are plotted forvarious values of the parameters θ 0 , θ 1 , θ 2 ,and θ 345

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!