X - UCSD DSP Lab

Least Squares 

Ken Kreutz-Delgado 

(Nuno Vasconcelos) 

ECE 175A – Winter 2012 - UCSD

(Unweighted) Least Squares 

• Assume linearity in the unknown, 

deterministic model parameters 

• Scalar, additive noise model: 

( ; ) ( ) T 

y f x x 

– E.g., for a line (f affine in x), 

f ( x; ) x 

1 0 

( x) 

1 0 

 

x 

 

 

 

1 

 

• This can be generalized to arbitrary functions of x: 

0( x) 

0 

( x) 

 

 

 

 

k( x) 

k 

2

Examples 

• Components of (x) can be arbitrary 

nonlinear functions of x that are linear in : 

– Line Fitting (affine in x): 

f ( x; ) x 

1 0 

– Polynomial Fitting (nonlinear in x): 

f ( x; ) i 

x 

k 

i 0 

– Truncated Fourier Series (nonlinear in x): 

k 

f ( x; ) 

i cos( i x) 

i 0 

i 

T 

( x) [1 x] 

T k 

( x) 

1x 

 

 

T 

( x) 

1 cos( k x) 

3

(Unweighted)Least Squares 

• Loss = Euclidean norm of model error: 

where 

L( ) y( x) 

 

T 

y1 ( x1) 

0 

y 

( x) 

 

 

T 

y n 

( xn) 

 

 

k 

• The loss function can also be rewritten as, 

E.g., for the line, 

n 

T T 

 

yi xi n y x 

i 1 

2 

y x 

L ( ) i 0 1 

i 

2 2 

L( ) ( ) ( ) 

2 

i 

n 

4

Examples 

• The most important component is the Design Matrix (x) 

– Line Fitting: ‒ Polynomial Fitting: 

f ( x; ) x 

1x1 

( x) 

 

 

1x n 

1 0 

– Truncated Fourier Series: 

k 

f ( x; ) 

i cos( i x) 

i 0 

f ( x; ) i 

x 

i 0 

k 1x 1 

 

( x) 

 

k 

1 x 

n 

1cos( kx1) 

 

 

( x) 

 

 

1cos( kxn) 

 

k 

i 

5


• One way to minimize 

L( ) y( x) 

 

is to find a value such that 

 

T 

L( ˆ ) 2 y( x) ˆ 

( x) 

0 and 

 

 

 

2 

– These conditions will hold when (x) is one-to-one, yielding 

1 

ˆ T 

 

T 

 

( x) ( x) ( x y ( 

 

ˆ 

2 

 

L( ˆ T 

) 2 ( x) ( 

x) 

0 

2 

 

) xy ) 

Thus, the least squares solution is determined by the 

Pseudoinverse of the Design Matrix (x): 

T 

1 

T 

 

( x) 

( x) ( x) 

( x) 

6


• If the design matrix (x) is one-to-one (has full column 

rank) the least squares solution is conceptually easy to 

compute. This will nearly always be the case in practice. 

• Recall the example of fitting a line in the plane: 

1x1 1 1 

1 

T 

x 

Γ ( x) Γ( x) n 

2 

x1x 

 

n x x 

 

 

1 x 

n 

y1 1 1 

T y 

Γ ( x) y n 

x1x 

 

n xy 

y 

n 

7

(Unweighted)Least squares 

• Pseudoinverse solution = LS solution: 

1 

ˆ 

T 

 

T 

( x) y ( ) ( x) 

( ) 

 

The LS line fit solution is 

which (naively) requires 

– a 2x2 matrix inversion 

x x y 

– followed by a 2x1 vector post-multiplication 

1 

1 x 

ˆ 

y 

2 

x x xy 

8


• What about fitting k th order polynomial? : 

f ( x; ) i 

x 

k 

i 0 

i 

k 1x 1 

 

( x) 

 

k 

1 x 

n 

k 

1 1 1x 1 

T 

 

( x) ( x) 

 

 

k k k 

x1x n1x n 

k 

1 x 

 

n 

k 2k 

 

x x 

9

(Unweighted) Least squares 

and 

1 1 y1 y 

T 

( x) y 

 

n 

k k x k 

1 xn y 

 

n 

x y 

Thus, when (x) is one-to-one (has full column rank): 

1 

K 

1 x y 

 

ˆ 

T T 

( x) 

y ( x) ( x) ( x) y 

K 2K 

k 

x x x y 

• Mathematically, this is a very straightforward procedure. 

• Numerically, this is generally NOT how the solution is 

computed. (Viz. the sophisticated algorithms in Matlab) 

1 

10


• Note the computational costs 

– when (naively) fitting a line (1 st order polynomial): 

1 

1 x 

ˆ 

y 

2 

x x xy 

– when (naively) fitting a general k th order polynomial: 

1 

k 

1 x y 

 

ˆ 

k 2k 

k 

x x x y 

2x2 matrix inversion, followed by a 

2x1 vector post-multiplication 

(k+1)x(k+1) matrix inversion, followed by a 

(k+1)x1 multiplication 

It is evident that the computational complexity is an increasing 

function of the number of parameters. 

More efficient (and numerically stable) algorithms are used in 

practice, but complexity scaling with number of parameters still 

remains true. 

11


• Suppose the dataset {x i} is constant in value over 

repeated, non-constant measurements of the dataset {y i}. 

– e.g. consider some process (e.g., temperature) where the 

measurements {y i} are taken at the same locations {x i} every day 

• Then the design matrix (x) is constant in value! 

– In advance (off-line) compute: ‒ Everyday (on-line) re-compute: 

k 

1 x 

 

A 

k 2k 

 

x x 

1 

y 

 

ˆ A 

k 

x y 

Hence, least squares sometimes can be implemented very efficiently. 

There are also efficient recursive updates, e.g. the Kalman filter. 

12

Geometric Solution & Interpretation 

• Alternatively, we can derive the least squares solution 

using geometric (Hilbert Space) considerations. 

• Goal: minimize the size (norm) of the model prediction 

error (aka residual error), e ( ) = y - (x) : 

2 2 

L( ) e( ) y ( 

x) 

 

• Note that given a known design matrix (x) the vector 

| | 1 | k 

 

( x) 

 

 

 

1 k 

 

 

i 

 

i 1 

| | k | 

is a linear combination of the column vectors i . 

• I.e. (x) is in the range space (column space) of (x) 

13

(Hilbert Space) Geometric Interpretation 

• The vector lives in the range (column space) of 

= (x) = [ 1 , … , k] : 

• Assume that y is as shown. 

• Equivalent statements are: 

ˆ 

– is the value of closest to y in the range of . 

ˆ 

– is the orthogonal projection of y onto the range of . 

– The residual e = y- is to the range of . 

1 

k y ˆ 

2 

y 

ˆ 

14

Geometric Interpretation of LS 

• e = y - is to the range of iff 

y ˆ 1 

'' " 

'' " 

• Thus e = y - is in the nullspace of T ˆ 

: 

y ˆ 

1 

k y ˆ 

2 

T 

1 0 

T 1 

 

 

 

 

 

ˆ 

T 

y0y T 

T ˆ 

 

1 

1 y 0 

 

y 

ˆ 

2 

k 

0 

15

Geometric interpretation of LS 

• Note: Nullspace Condition Normal Equation: 

0 

T ˆ T ˆ T 

y y 

• Thus, if is one-to-one (has full column rank) 

we again get the pseudoinverse solution: 

1 

ˆ 

T 

 

T 

 

( x) y ( ) ( x) 

( x) 

 

x y 

16

Probabilistic Interpretation 

• We have seen that estimating a parameter by minimizing 

the least squares loss function is a special case of MLE 

• This interpretation holds for many loss functions 

– First, note that 

ˆ argminL , ( ; ) 

 

 

argmaxe 

– Now note that, because 

, ( ; ) 

L y f x 

 

y f x 

y f x 

L 

, ( ; ) 

e 0, y, 

x 

we can make this exponential function into a Y|X pdf by an 

appropriate normalization. 

17

Prob. Interpretation of Loss Minimization 

• I.e. by defining 

P ( y | x; 

) 

YX | L 

y, f ( x; 

) 

 

1 

 

e 

( x; 

) 

e 

 

e dy 

L y, f ( x; ) L 

y, f ( x; 

) 

 

 

• If the normalization constant (x;) does not depend 

on , (x;) = (x), then 

ˆ L 

y, f ( x; 

) 

argmaxe 

 

YX | 

 

 

 

argmax P y | x; 

 

which makes the problem a special case of MLE 

18

Prob. Interpretation of Loss Minimization 

• Note that for loss functions of the form, 

L( ) g[ y 

f ( x; 

) 

] 

the model f(x; ) only changes the mean of Y 

– A shift in mean does not change the shape of a pdf, 

and therefore cannot change the value of the normalization 

constant 

– Hence, for loss functions of the type above, it is always true that 

ˆ L 

y, f ( x; 

) 

 

argmaxe 

 

 

 

YX | 

 

argmax P y | x; 

 

19

Regression 

• This leads to the interpretation that 

we saw last time 

• Which is the usual definition of a 

regression problem 

– two random variables X and Y 

– a dataset of examples D = {(x 1,y 1), … ,(x n,y n)} 

– a parametric model of the form 

y f ( x; ) 

– where is a parameter vector, and a random variable that 

accounts for noise 

• the pdf of the noise determines the loss in the other 

formulation 

20

Regression 

• Error pdf: 

– Gaussian 

1 

P( ) e 

2 

2 

 

 

– Laplacian 

1 

P( ) e 

2 

 

 

– Rayleigh 

 

P( ) e 2 

 

| | 

 

 

2 

 

 

2 

2 

2 

 

 

2 

2 

• Equivalent Loss Function: 

– L 2 distance 

L( x, y) ( y x) 

– L 1 distance 

L( x, y) y x 

– Rayleigh distance 

L( x, y) ( y x) 

2 

2 

log( y x) 

21

Regression 

• We know how to solve the problem with losses 

– Why would we want the added complexity incurred by introducing 

error probability models? 

• The main reason is that this allows a data driven definition 

of the loss 

• One good way to see this is the problem of weighted least 

squares 

– Suppose that you know that not all measurements (x i,y i) have the 

same importance 

– This can be encoded in the loss function 

– Remember that the unweighted loss is 

2 

L ‖ y x ‖ yx 

2 

( ) ( ) i 

i 

22

Regression 

• To weigh different points differently we could use 

2 

L wi yi ( 

x) , w 0 , 

i i 

i 

or even the more generic form 

T T 

L y ( x) W y ( x) , W W0, • In the latter case the solution is (homework) 

* 

 

T 

1 

T 

 

( x) W ( x) 

 

( 

x) W y 

• The question is “how do I know these weights”? 

• Without a probabilistic model one has little guidance on this. 

23

Regression 

• The probabilistic equivalent 

* 

 

argmaxe 

 

, ( ; ) 

L y f x 

1 

T 

 

argmaxexp ( y ( x) ) W ( y ( 

x) 

) 

 

2 

 

is the MLE for a Gaussian pdf of known covariance 

W 

• In the case where the covariance W is diagonal we have 

2 

i i 

1 

1 

L 

y x 

 

2 

( ) 

i i 

• In this case, each point is weighted by the inverse variance. 

24

Regression 

• This makes sense 

– Under the probabilistic formulation the variance i is the variance 

of the error associated with the i th observation 

– This means that it is a measure of the uncertainty of the 

observation 

• When 

y f ( x ; ) 

i i i 

W 

1 

 

we are weighting each point by the inverse of its 

uncertainty (variance) 

• We can also check the goodness of this weighting matrix 

by plotting the histogram of the errors 

• if W is chosen correctly, the errors should be Gaussian 

25

Model Validation 

• In fact, by analyzing the errors of the fit we can say a lot 

• This is called model validation 

– Leave some data on the side, and run it through the predictor 

– Analyze the errors to see if there is deviation from the 

assumed model (Gaussianity for least squares) 

26

Model Validation & Improvement 

• Many times this will give you hints to alternative models 

that may fit the data better 

• Typical problems for least squares (and solutions) 

27


• Example 1 error histogram 

• this does not look Gaussian 

• look at the scatter plot of the error (y – f(x, *)) 

– increasing trend, maybe we should try 

log( y ) f ( x ; ) 

i i i 

28


• Example 1 error histogram for the new model 

• this looks Gaussian 

– this model is probably better 

– there are statistical tests that you can use to check this objectively 

– these are covered in statistics classes 

29


• Example 2 error histogram 

• This also does not look Gaussian 

• Checking the scatter plot now seems to suggest to try 

y f ( x ; ) 

i i i 

30


• Example 2 error histogram for the new model 

• Once again, seems to work 

• The residual behavior looks Gaussian 

– However, It is NOT always the case that such changes will work. 

If not, maybe the problem is the assumption of Gaussianity itself 

Move away from least squares, try MLE with other error pdfs 

31

END 

32

X - UCSD DSP Lab

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?