20.08.2013 Views

X - UCSD DSP Lab

X - UCSD DSP Lab

X - UCSD DSP Lab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Least Squares<br />

Ken Kreutz-Delgado<br />

(Nuno Vasconcelos)<br />

ECE 175A – Winter 2012 - <strong>UCSD</strong>


(Unweighted) Least Squares<br />

• Assume linearity in the unknown,<br />

deterministic model parameters <br />

• Scalar, additive noise model:<br />

( ; ) ( ) T<br />

y f x x <br />

– E.g., for a line (f affine in x),<br />

f ( x; ) x <br />

1 0<br />

( x)<br />

1 0<br />

<br />

x<br />

<br />

<br />

<br />

1<br />

<br />

• This can be generalized to arbitrary functions of x:<br />

0( x)<br />

0 <br />

( x)<br />

<br />

<br />

<br />

<br />

k( x)<br />

k<br />

2


Examples<br />

• Components of (x) can be arbitrary<br />

nonlinear functions of x that are linear in :<br />

– Line Fitting (affine in x):<br />

f ( x; ) x <br />

1 0<br />

– Polynomial Fitting (nonlinear in x):<br />

f ( x; ) i<br />

x<br />

k<br />

i 0<br />

– Truncated Fourier Series (nonlinear in x):<br />

k<br />

f ( x; ) <br />

i cos( i x)<br />

i 0<br />

i<br />

T<br />

( x) [1 x]<br />

T k<br />

( x) <br />

1x <br />

<br />

<br />

T<br />

( x) <br />

1 cos( k x)<br />

3


(Unweighted)Least Squares<br />

• Loss = Euclidean norm of model error:<br />

where<br />

L( ) y( x)<br />

<br />

T<br />

y1 ( x1)<br />

0 <br />

y <br />

( x)<br />

<br />

<br />

T<br />

y n <br />

( xn)<br />

<br />

<br />

k <br />

• The loss function can also be rewritten as,<br />

E.g., for the line,<br />

n<br />

T T<br />

<br />

yi xi n y x <br />

i 1<br />

2<br />

y x<br />

L ( ) i 0 1<br />

i<br />

2 2<br />

L( ) ( ) ( ) <br />

2<br />

i<br />

n<br />

4


Examples<br />

• The most important component is the Design Matrix (x)<br />

– Line Fitting: ‒ Polynomial Fitting:<br />

f ( x; ) x <br />

1x1 <br />

( x)<br />

<br />

<br />

1x n <br />

1 0<br />

– Truncated Fourier Series:<br />

k<br />

f ( x; ) <br />

i cos( i x)<br />

i 0<br />

f ( x; ) i<br />

x<br />

i 0<br />

k 1x 1<br />

<br />

( x)<br />

<br />

k<br />

1 x <br />

n <br />

1cos( kx1)<br />

<br />

<br />

( x)<br />

<br />

<br />

1cos( kxn)<br />

<br />

k<br />

i<br />

5


(Unweighted) Least Squares<br />

• One way to minimize<br />

L( ) y( x)<br />

<br />

is to find a value such that<br />

<br />

T<br />

L( ˆ ) 2 y( x) ˆ <br />

( x)<br />

0 and<br />

<br />

<br />

<br />

2<br />

– These conditions will hold when (x) is one-to-one, yielding<br />

1<br />

ˆ T<br />

<br />

T<br />

<br />

( x) ( x) ( x y (<br />

<br />

ˆ<br />

2<br />

<br />

L( ˆ T<br />

) 2 ( x) (<br />

x)<br />

0<br />

2<br />

<br />

) xy )<br />

Thus, the least squares solution is determined by the<br />

Pseudoinverse of the Design Matrix (x):<br />

T<br />

1<br />

T<br />

<br />

( x) <br />

( x) ( x) <br />

( x)<br />

6


(Unweighted) Least Squares<br />

• If the design matrix (x) is one-to-one (has full column<br />

rank) the least squares solution is conceptually easy to<br />

compute. This will nearly always be the case in practice.<br />

• Recall the example of fitting a line in the plane:<br />

1x1 1 1<br />

1<br />

T <br />

x <br />

Γ ( x) Γ( x) n<br />

2<br />

x1x <br />

<br />

n x x<br />

<br />

<br />

1 x <br />

n <br />

y1 1 1<br />

T y <br />

Γ ( x) y n<br />

x1x <br />

<br />

n xy<br />

y <br />

n <br />

7


(Unweighted)Least squares<br />

• Pseudoinverse solution = LS solution:<br />

1<br />

ˆ <br />

T<br />

<br />

T<br />

( x) y ( ) ( x)<br />

( )<br />

<br />

The LS line fit solution is<br />

which (naively) requires<br />

– a 2x2 matrix inversion<br />

x x y<br />

– followed by a 2x1 vector post-multiplication<br />

1<br />

1 x <br />

ˆ<br />

y <br />

2 <br />

x x xy <br />

8


(Unweighted) Least Squares<br />

• What about fitting k th order polynomial? :<br />

f ( x; ) i<br />

x<br />

k<br />

i 0<br />

i<br />

k 1x 1<br />

<br />

( x)<br />

<br />

k<br />

1 x <br />

n <br />

k<br />

1 1 1x 1<br />

T <br />

<br />

( x) ( x)<br />

<br />

<br />

k k k<br />

x1x n1x n<br />

k<br />

1 x <br />

<br />

n <br />

k 2k<br />

<br />

x x <br />

9


(Unweighted) Least squares<br />

and<br />

1 1 y1 y <br />

T <br />

( x) y <br />

<br />

n <br />

k k x k<br />

1 xn y<br />

<br />

n<br />

x y <br />

Thus, when (x) is one-to-one (has full column rank):<br />

1<br />

K<br />

1 x y <br />

<br />

ˆ <br />

T T<br />

( x)<br />

y ( x) ( x) ( x) y <br />

K 2K<br />

k <br />

x x x y <br />

• Mathematically, this is a very straightforward procedure.<br />

• Numerically, this is generally NOT how the solution is<br />

computed. (Viz. the sophisticated algorithms in Matlab)<br />

1<br />

10


(Unweighted) Least Squares<br />

• Note the computational costs<br />

– when (naively) fitting a line (1 st order polynomial):<br />

1<br />

1 x <br />

ˆ<br />

y <br />

2 <br />

x x xy <br />

– when (naively) fitting a general k th order polynomial:<br />

1<br />

k<br />

1 x y <br />

<br />

ˆ <br />

k 2k<br />

k <br />

x x x y <br />

2x2 matrix inversion, followed by a<br />

2x1 vector post-multiplication<br />

(k+1)x(k+1) matrix inversion, followed by a<br />

(k+1)x1 multiplication<br />

It is evident that the computational complexity is an increasing<br />

function of the number of parameters.<br />

More efficient (and numerically stable) algorithms are used in<br />

practice, but complexity scaling with number of parameters still<br />

remains true.<br />

11


(Unweighted) Least Squares<br />

• Suppose the dataset {x i} is constant in value over<br />

repeated, non-constant measurements of the dataset {y i}.<br />

– e.g. consider some process (e.g., temperature) where the<br />

measurements {y i} are taken at the same locations {x i} every day<br />

• Then the design matrix (x) is constant in value!<br />

– In advance (off-line) compute: ‒ Everyday (on-line) re-compute:<br />

k<br />

1 x <br />

<br />

A <br />

k 2k<br />

<br />

x x <br />

1<br />

y <br />

<br />

ˆ A <br />

k <br />

x y <br />

Hence, least squares sometimes can be implemented very efficiently.<br />

There are also efficient recursive updates, e.g. the Kalman filter.<br />

12


Geometric Solution & Interpretation<br />

• Alternatively, we can derive the least squares solution<br />

using geometric (Hilbert Space) considerations.<br />

• Goal: minimize the size (norm) of the model prediction<br />

error (aka residual error), e ( ) = y - (x) :<br />

2 2<br />

L( ) e( ) y (<br />

x)<br />

<br />

• Note that given a known design matrix (x) the vector<br />

| | 1 | k<br />

<br />

( x)<br />

<br />

<br />

<br />

1 k <br />

<br />

<br />

i<br />

<br />

i 1<br />

| | k | <br />

is a linear combination of the column vectors i .<br />

• I.e. (x) is in the range space (column space) of (x)<br />

13


(Hilbert Space) Geometric Interpretation<br />

• The vector lives in the range (column space) of<br />

= (x) = [ 1 , … , k] :<br />

• Assume that y is as shown.<br />

• Equivalent statements are:<br />

ˆ<br />

– is the value of closest to y in the range of .<br />

ˆ<br />

– is the orthogonal projection of y onto the range of .<br />

– The residual e = y- is to the range of .<br />

1<br />

k y ˆ<br />

2<br />

y<br />

ˆ<br />

14


Geometric Interpretation of LS<br />

• e = y - is to the range of iff<br />

y ˆ 1<br />

'' "<br />

'' "<br />

• Thus e = y - is in the nullspace of T ˆ<br />

:<br />

y ˆ <br />

1<br />

k y ˆ<br />

2<br />

T<br />

1 0<br />

T 1 <br />

<br />

<br />

<br />

<br />

<br />

ˆ<br />

T<br />

y0y T<br />

T ˆ <br />

<br />

1<br />

1 y 0 <br />

<br />

y<br />

ˆ<br />

2<br />

k<br />

0<br />

15


Geometric interpretation of LS<br />

• Note: Nullspace Condition Normal Equation:<br />

0<br />

T ˆ T ˆ T<br />

y y<br />

• Thus, if is one-to-one (has full column rank)<br />

we again get the pseudoinverse solution:<br />

1<br />

ˆ <br />

T<br />

<br />

T<br />

<br />

( x) y ( ) ( x)<br />

( x)<br />

<br />

x y<br />

16


Probabilistic Interpretation<br />

• We have seen that estimating a parameter by minimizing<br />

the least squares loss function is a special case of MLE<br />

• This interpretation holds for many loss functions<br />

– First, note that<br />

ˆ argminL , ( ; )<br />

<br />

<br />

argmaxe<br />

– Now note that, because<br />

, ( ; ) <br />

L y f x<br />

<br />

y f x <br />

y f x <br />

L<br />

, ( ; )<br />

e 0, y,<br />

x<br />

we can make this exponential function into a Y|X pdf by an<br />

appropriate normalization.<br />

17


Prob. Interpretation of Loss Minimization<br />

• I.e. by defining<br />

P ( y | x;<br />

)<br />

YX | L<br />

y, f ( x;<br />

)<br />

<br />

1<br />

<br />

e<br />

( x;<br />

)<br />

e<br />

<br />

e dy <br />

L y, f ( x; ) L<br />

y, f ( x;<br />

)<br />

<br />

<br />

• If the normalization constant (x;) does not depend<br />

on , (x;) = (x), then<br />

ˆ L<br />

y, f ( x;<br />

)<br />

argmaxe<br />

<br />

YX |<br />

<br />

<br />

<br />

argmax P y | x;<br />

<br />

which makes the problem a special case of MLE<br />

18


Prob. Interpretation of Loss Minimization<br />

• Note that for loss functions of the form,<br />

L( ) g[ y <br />

f ( x;<br />

)<br />

]<br />

the model f(x; ) only changes the mean of Y<br />

– A shift in mean does not change the shape of a pdf,<br />

and therefore cannot change the value of the normalization<br />

constant<br />

– Hence, for loss functions of the type above, it is always true that<br />

ˆ L<br />

y, f ( x;<br />

)<br />

<br />

argmaxe<br />

<br />

<br />

<br />

YX |<br />

<br />

argmax P y | x;<br />

<br />

19


Regression<br />

• This leads to the interpretation that<br />

we saw last time<br />

• Which is the usual definition of a<br />

regression problem<br />

– two random variables X and Y<br />

– a dataset of examples D = {(x 1,y 1), … ,(x n,y n)}<br />

– a parametric model of the form<br />

y f ( x; ) <br />

– where is a parameter vector, and a random variable that<br />

accounts for noise<br />

• the pdf of the noise determines the loss in the other<br />

formulation<br />

20


Regression<br />

• Error pdf:<br />

– Gaussian<br />

1<br />

P( ) e<br />

2<br />

2<br />

<br />

<br />

– Laplacian<br />

1<br />

P( ) e<br />

2<br />

<br />

<br />

– Rayleigh<br />

<br />

P( ) e 2<br />

<br />

| |<br />

<br />

<br />

2<br />

<br />

<br />

2<br />

2<br />

2<br />

<br />

<br />

2<br />

2<br />

• Equivalent Loss Function:<br />

– L 2 distance<br />

L( x, y) ( y x)<br />

– L 1 distance<br />

L( x, y) y x<br />

– Rayleigh distance<br />

L( x, y) ( y x)<br />

2<br />

2<br />

log( y x)<br />

21


Regression<br />

• We know how to solve the problem with losses<br />

– Why would we want the added complexity incurred by introducing<br />

error probability models?<br />

• The main reason is that this allows a data driven definition<br />

of the loss<br />

• One good way to see this is the problem of weighted least<br />

squares<br />

– Suppose that you know that not all measurements (x i,y i) have the<br />

same importance<br />

– This can be encoded in the loss function<br />

– Remember that the unweighted loss is<br />

2<br />

L ‖ y x ‖ yx<br />

2<br />

( ) ( ) i<br />

i<br />

22


Regression<br />

• To weigh different points differently we could use<br />

2<br />

L wi yi (<br />

x) , w 0 ,<br />

i i <br />

i<br />

or even the more generic form<br />

T T<br />

L y ( x) W y ( x) , W W0, • In the latter case the solution is (homework)<br />

*<br />

<br />

T<br />

1<br />

T<br />

<br />

( x) W ( x) <br />

<br />

(<br />

x) W y<br />

• The question is “how do I know these weights”?<br />

• Without a probabilistic model one has little guidance on this.<br />

23


Regression<br />

• The probabilistic equivalent<br />

*<br />

<br />

argmaxe<br />

<br />

, ( ; ) <br />

L y f x<br />

1<br />

T<br />

<br />

argmaxexp ( y ( x) ) W ( y (<br />

x)<br />

)<br />

<br />

2<br />

<br />

is the MLE for a Gaussian pdf of known covariance<br />

W<br />

• In the case where the covariance W is diagonal we have<br />

2<br />

i i<br />

1<br />

1<br />

L <br />

y x<br />

<br />

2<br />

( ) <br />

i i<br />

• In this case, each point is weighted by the inverse variance.<br />

24


Regression<br />

• This makes sense<br />

– Under the probabilistic formulation the variance i is the variance<br />

of the error associated with the i th observation<br />

– This means that it is a measure of the uncertainty of the<br />

observation<br />

• When<br />

y f ( x ; ) <br />

i i i<br />

W <br />

1<br />

<br />

we are weighting each point by the inverse of its<br />

uncertainty (variance)<br />

• We can also check the goodness of this weighting matrix<br />

by plotting the histogram of the errors<br />

• if W is chosen correctly, the errors should be Gaussian<br />

25


Model Validation<br />

• In fact, by analyzing the errors of the fit we can say a lot<br />

• This is called model validation<br />

– Leave some data on the side, and run it through the predictor<br />

– Analyze the errors to see if there is deviation from the<br />

assumed model (Gaussianity for least squares)<br />

26


Model Validation & Improvement<br />

• Many times this will give you hints to alternative models<br />

that may fit the data better<br />

• Typical problems for least squares (and solutions)<br />

27


Model Validation & Improvement<br />

• Example 1 error histogram<br />

• this does not look Gaussian<br />

• look at the scatter plot of the error (y – f(x, *))<br />

– increasing trend, maybe we should try<br />

log( y ) f ( x ; ) <br />

i i i<br />

28


Model Validation & Improvement<br />

• Example 1 error histogram for the new model<br />

• this looks Gaussian<br />

– this model is probably better<br />

– there are statistical tests that you can use to check this objectively<br />

– these are covered in statistics classes<br />

29


Model Validation & Improvement<br />

• Example 2 error histogram<br />

• This also does not look Gaussian<br />

• Checking the scatter plot now seems to suggest to try<br />

y f ( x ; ) <br />

i i i<br />

30


Model Validation & Improvement<br />

• Example 2 error histogram for the new model<br />

• Once again, seems to work<br />

• The residual behavior looks Gaussian<br />

– However, It is NOT always the case that such changes will work.<br />

If not, maybe the problem is the assumption of Gaussianity itself<br />

Move away from least squares, try MLE with other error pdfs<br />

31


END<br />

32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!