X - UCSD DSP Lab
X - UCSD DSP Lab
X - UCSD DSP Lab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Least Squares<br />
Ken Kreutz-Delgado<br />
(Nuno Vasconcelos)<br />
ECE 175A – Winter 2012 - <strong>UCSD</strong>
(Unweighted) Least Squares<br />
• Assume linearity in the unknown,<br />
deterministic model parameters <br />
• Scalar, additive noise model:<br />
( ; ) ( ) T<br />
y f x x <br />
– E.g., for a line (f affine in x),<br />
f ( x; ) x <br />
1 0<br />
( x)<br />
1 0<br />
<br />
x<br />
<br />
<br />
<br />
1<br />
<br />
• This can be generalized to arbitrary functions of x:<br />
0( x)<br />
0 <br />
( x)<br />
<br />
<br />
<br />
<br />
k( x)<br />
k<br />
2
Examples<br />
• Components of (x) can be arbitrary<br />
nonlinear functions of x that are linear in :<br />
– Line Fitting (affine in x):<br />
f ( x; ) x <br />
1 0<br />
– Polynomial Fitting (nonlinear in x):<br />
f ( x; ) i<br />
x<br />
k<br />
i 0<br />
– Truncated Fourier Series (nonlinear in x):<br />
k<br />
f ( x; ) <br />
i cos( i x)<br />
i 0<br />
i<br />
T<br />
( x) [1 x]<br />
T k<br />
( x) <br />
1x <br />
<br />
<br />
T<br />
( x) <br />
1 cos( k x)<br />
3
(Unweighted)Least Squares<br />
• Loss = Euclidean norm of model error:<br />
where<br />
L( ) y( x)<br />
<br />
T<br />
y1 ( x1)<br />
0 <br />
y <br />
( x)<br />
<br />
<br />
T<br />
y n <br />
( xn)<br />
<br />
<br />
k <br />
• The loss function can also be rewritten as,<br />
E.g., for the line,<br />
n<br />
T T<br />
<br />
yi xi n y x <br />
i 1<br />
2<br />
y x<br />
L ( ) i 0 1<br />
i<br />
2 2<br />
L( ) ( ) ( ) <br />
2<br />
i<br />
n<br />
4
Examples<br />
• The most important component is the Design Matrix (x)<br />
– Line Fitting: ‒ Polynomial Fitting:<br />
f ( x; ) x <br />
1x1 <br />
( x)<br />
<br />
<br />
1x n <br />
1 0<br />
– Truncated Fourier Series:<br />
k<br />
f ( x; ) <br />
i cos( i x)<br />
i 0<br />
f ( x; ) i<br />
x<br />
i 0<br />
k 1x 1<br />
<br />
( x)<br />
<br />
k<br />
1 x <br />
n <br />
1cos( kx1)<br />
<br />
<br />
( x)<br />
<br />
<br />
1cos( kxn)<br />
<br />
k<br />
i<br />
5
(Unweighted) Least Squares<br />
• One way to minimize<br />
L( ) y( x)<br />
<br />
is to find a value such that<br />
<br />
T<br />
L( ˆ ) 2 y( x) ˆ <br />
( x)<br />
0 and<br />
<br />
<br />
<br />
2<br />
– These conditions will hold when (x) is one-to-one, yielding<br />
1<br />
ˆ T<br />
<br />
T<br />
<br />
( x) ( x) ( x y (<br />
<br />
ˆ<br />
2<br />
<br />
L( ˆ T<br />
) 2 ( x) (<br />
x)<br />
0<br />
2<br />
<br />
) xy )<br />
Thus, the least squares solution is determined by the<br />
Pseudoinverse of the Design Matrix (x):<br />
T<br />
1<br />
T<br />
<br />
( x) <br />
( x) ( x) <br />
( x)<br />
6
(Unweighted) Least Squares<br />
• If the design matrix (x) is one-to-one (has full column<br />
rank) the least squares solution is conceptually easy to<br />
compute. This will nearly always be the case in practice.<br />
• Recall the example of fitting a line in the plane:<br />
1x1 1 1<br />
1<br />
T <br />
x <br />
Γ ( x) Γ( x) n<br />
2<br />
x1x <br />
<br />
n x x<br />
<br />
<br />
1 x <br />
n <br />
y1 1 1<br />
T y <br />
Γ ( x) y n<br />
x1x <br />
<br />
n xy<br />
y <br />
n <br />
7
(Unweighted)Least squares<br />
• Pseudoinverse solution = LS solution:<br />
1<br />
ˆ <br />
T<br />
<br />
T<br />
( x) y ( ) ( x)<br />
( )<br />
<br />
The LS line fit solution is<br />
which (naively) requires<br />
– a 2x2 matrix inversion<br />
x x y<br />
– followed by a 2x1 vector post-multiplication<br />
1<br />
1 x <br />
ˆ<br />
y <br />
2 <br />
x x xy <br />
8
(Unweighted) Least Squares<br />
• What about fitting k th order polynomial? :<br />
f ( x; ) i<br />
x<br />
k<br />
i 0<br />
i<br />
k 1x 1<br />
<br />
( x)<br />
<br />
k<br />
1 x <br />
n <br />
k<br />
1 1 1x 1<br />
T <br />
<br />
( x) ( x)<br />
<br />
<br />
k k k<br />
x1x n1x n<br />
k<br />
1 x <br />
<br />
n <br />
k 2k<br />
<br />
x x <br />
9
(Unweighted) Least squares<br />
and<br />
1 1 y1 y <br />
T <br />
( x) y <br />
<br />
n <br />
k k x k<br />
1 xn y<br />
<br />
n<br />
x y <br />
Thus, when (x) is one-to-one (has full column rank):<br />
1<br />
K<br />
1 x y <br />
<br />
ˆ <br />
T T<br />
( x)<br />
y ( x) ( x) ( x) y <br />
K 2K<br />
k <br />
x x x y <br />
• Mathematically, this is a very straightforward procedure.<br />
• Numerically, this is generally NOT how the solution is<br />
computed. (Viz. the sophisticated algorithms in Matlab)<br />
1<br />
10
(Unweighted) Least Squares<br />
• Note the computational costs<br />
– when (naively) fitting a line (1 st order polynomial):<br />
1<br />
1 x <br />
ˆ<br />
y <br />
2 <br />
x x xy <br />
– when (naively) fitting a general k th order polynomial:<br />
1<br />
k<br />
1 x y <br />
<br />
ˆ <br />
k 2k<br />
k <br />
x x x y <br />
2x2 matrix inversion, followed by a<br />
2x1 vector post-multiplication<br />
(k+1)x(k+1) matrix inversion, followed by a<br />
(k+1)x1 multiplication<br />
It is evident that the computational complexity is an increasing<br />
function of the number of parameters.<br />
More efficient (and numerically stable) algorithms are used in<br />
practice, but complexity scaling with number of parameters still<br />
remains true.<br />
11
(Unweighted) Least Squares<br />
• Suppose the dataset {x i} is constant in value over<br />
repeated, non-constant measurements of the dataset {y i}.<br />
– e.g. consider some process (e.g., temperature) where the<br />
measurements {y i} are taken at the same locations {x i} every day<br />
• Then the design matrix (x) is constant in value!<br />
– In advance (off-line) compute: ‒ Everyday (on-line) re-compute:<br />
k<br />
1 x <br />
<br />
A <br />
k 2k<br />
<br />
x x <br />
1<br />
y <br />
<br />
ˆ A <br />
k <br />
x y <br />
Hence, least squares sometimes can be implemented very efficiently.<br />
There are also efficient recursive updates, e.g. the Kalman filter.<br />
12
Geometric Solution & Interpretation<br />
• Alternatively, we can derive the least squares solution<br />
using geometric (Hilbert Space) considerations.<br />
• Goal: minimize the size (norm) of the model prediction<br />
error (aka residual error), e ( ) = y - (x) :<br />
2 2<br />
L( ) e( ) y (<br />
x)<br />
<br />
• Note that given a known design matrix (x) the vector<br />
| | 1 | k<br />
<br />
( x)<br />
<br />
<br />
<br />
1 k <br />
<br />
<br />
i<br />
<br />
i 1<br />
| | k | <br />
is a linear combination of the column vectors i .<br />
• I.e. (x) is in the range space (column space) of (x)<br />
13
(Hilbert Space) Geometric Interpretation<br />
• The vector lives in the range (column space) of<br />
= (x) = [ 1 , … , k] :<br />
• Assume that y is as shown.<br />
• Equivalent statements are:<br />
ˆ<br />
– is the value of closest to y in the range of .<br />
ˆ<br />
– is the orthogonal projection of y onto the range of .<br />
– The residual e = y- is to the range of .<br />
1<br />
k y ˆ<br />
2<br />
y<br />
ˆ<br />
14
Geometric Interpretation of LS<br />
• e = y - is to the range of iff<br />
y ˆ 1<br />
'' "<br />
'' "<br />
• Thus e = y - is in the nullspace of T ˆ<br />
:<br />
y ˆ <br />
1<br />
k y ˆ<br />
2<br />
T<br />
1 0<br />
T 1 <br />
<br />
<br />
<br />
<br />
<br />
ˆ<br />
T<br />
y0y T<br />
T ˆ <br />
<br />
1<br />
1 y 0 <br />
<br />
y<br />
ˆ<br />
2<br />
k<br />
0<br />
15
Geometric interpretation of LS<br />
• Note: Nullspace Condition Normal Equation:<br />
0<br />
T ˆ T ˆ T<br />
y y<br />
• Thus, if is one-to-one (has full column rank)<br />
we again get the pseudoinverse solution:<br />
1<br />
ˆ <br />
T<br />
<br />
T<br />
<br />
( x) y ( ) ( x)<br />
( x)<br />
<br />
x y<br />
16
Probabilistic Interpretation<br />
• We have seen that estimating a parameter by minimizing<br />
the least squares loss function is a special case of MLE<br />
• This interpretation holds for many loss functions<br />
– First, note that<br />
ˆ argminL , ( ; )<br />
<br />
<br />
argmaxe<br />
– Now note that, because<br />
, ( ; ) <br />
L y f x<br />
<br />
y f x <br />
y f x <br />
L<br />
, ( ; )<br />
e 0, y,<br />
x<br />
we can make this exponential function into a Y|X pdf by an<br />
appropriate normalization.<br />
17
Prob. Interpretation of Loss Minimization<br />
• I.e. by defining<br />
P ( y | x;<br />
)<br />
YX | L<br />
y, f ( x;<br />
)<br />
<br />
1<br />
<br />
e<br />
( x;<br />
)<br />
e<br />
<br />
e dy <br />
L y, f ( x; ) L<br />
y, f ( x;<br />
)<br />
<br />
<br />
• If the normalization constant (x;) does not depend<br />
on , (x;) = (x), then<br />
ˆ L<br />
y, f ( x;<br />
)<br />
argmaxe<br />
<br />
YX |<br />
<br />
<br />
<br />
argmax P y | x;<br />
<br />
which makes the problem a special case of MLE<br />
18
Prob. Interpretation of Loss Minimization<br />
• Note that for loss functions of the form,<br />
L( ) g[ y <br />
f ( x;<br />
)<br />
]<br />
the model f(x; ) only changes the mean of Y<br />
– A shift in mean does not change the shape of a pdf,<br />
and therefore cannot change the value of the normalization<br />
constant<br />
– Hence, for loss functions of the type above, it is always true that<br />
ˆ L<br />
y, f ( x;<br />
)<br />
<br />
argmaxe<br />
<br />
<br />
<br />
YX |<br />
<br />
argmax P y | x;<br />
<br />
19
Regression<br />
• This leads to the interpretation that<br />
we saw last time<br />
• Which is the usual definition of a<br />
regression problem<br />
– two random variables X and Y<br />
– a dataset of examples D = {(x 1,y 1), … ,(x n,y n)}<br />
– a parametric model of the form<br />
y f ( x; ) <br />
– where is a parameter vector, and a random variable that<br />
accounts for noise<br />
• the pdf of the noise determines the loss in the other<br />
formulation<br />
20
Regression<br />
• Error pdf:<br />
– Gaussian<br />
1<br />
P( ) e<br />
2<br />
2<br />
<br />
<br />
– Laplacian<br />
1<br />
P( ) e<br />
2<br />
<br />
<br />
– Rayleigh<br />
<br />
P( ) e 2<br />
<br />
| |<br />
<br />
<br />
2<br />
<br />
<br />
2<br />
2<br />
2<br />
<br />
<br />
2<br />
2<br />
• Equivalent Loss Function:<br />
– L 2 distance<br />
L( x, y) ( y x)<br />
– L 1 distance<br />
L( x, y) y x<br />
– Rayleigh distance<br />
L( x, y) ( y x)<br />
2<br />
2<br />
log( y x)<br />
21
Regression<br />
• We know how to solve the problem with losses<br />
– Why would we want the added complexity incurred by introducing<br />
error probability models?<br />
• The main reason is that this allows a data driven definition<br />
of the loss<br />
• One good way to see this is the problem of weighted least<br />
squares<br />
– Suppose that you know that not all measurements (x i,y i) have the<br />
same importance<br />
– This can be encoded in the loss function<br />
– Remember that the unweighted loss is<br />
2<br />
L ‖ y x ‖ yx<br />
2<br />
( ) ( ) i<br />
i<br />
22
Regression<br />
• To weigh different points differently we could use<br />
2<br />
L wi yi (<br />
x) , w 0 ,<br />
i i <br />
i<br />
or even the more generic form<br />
T T<br />
L y ( x) W y ( x) , W W0, • In the latter case the solution is (homework)<br />
*<br />
<br />
T<br />
1<br />
T<br />
<br />
( x) W ( x) <br />
<br />
(<br />
x) W y<br />
• The question is “how do I know these weights”?<br />
• Without a probabilistic model one has little guidance on this.<br />
23
Regression<br />
• The probabilistic equivalent<br />
*<br />
<br />
argmaxe<br />
<br />
, ( ; ) <br />
L y f x<br />
1<br />
T<br />
<br />
argmaxexp ( y ( x) ) W ( y (<br />
x)<br />
)<br />
<br />
2<br />
<br />
is the MLE for a Gaussian pdf of known covariance<br />
W<br />
• In the case where the covariance W is diagonal we have<br />
2<br />
i i<br />
1<br />
1<br />
L <br />
y x<br />
<br />
2<br />
( ) <br />
i i<br />
• In this case, each point is weighted by the inverse variance.<br />
24
Regression<br />
• This makes sense<br />
– Under the probabilistic formulation the variance i is the variance<br />
of the error associated with the i th observation<br />
– This means that it is a measure of the uncertainty of the<br />
observation<br />
• When<br />
y f ( x ; ) <br />
i i i<br />
W <br />
1<br />
<br />
we are weighting each point by the inverse of its<br />
uncertainty (variance)<br />
• We can also check the goodness of this weighting matrix<br />
by plotting the histogram of the errors<br />
• if W is chosen correctly, the errors should be Gaussian<br />
25
Model Validation<br />
• In fact, by analyzing the errors of the fit we can say a lot<br />
• This is called model validation<br />
– Leave some data on the side, and run it through the predictor<br />
– Analyze the errors to see if there is deviation from the<br />
assumed model (Gaussianity for least squares)<br />
26
Model Validation & Improvement<br />
• Many times this will give you hints to alternative models<br />
that may fit the data better<br />
• Typical problems for least squares (and solutions)<br />
27
Model Validation & Improvement<br />
• Example 1 error histogram<br />
• this does not look Gaussian<br />
• look at the scatter plot of the error (y – f(x, *))<br />
– increasing trend, maybe we should try<br />
log( y ) f ( x ; ) <br />
i i i<br />
28
Model Validation & Improvement<br />
• Example 1 error histogram for the new model<br />
• this looks Gaussian<br />
– this model is probably better<br />
– there are statistical tests that you can use to check this objectively<br />
– these are covered in statistics classes<br />
29
Model Validation & Improvement<br />
• Example 2 error histogram<br />
• This also does not look Gaussian<br />
• Checking the scatter plot now seems to suggest to try<br />
y f ( x ; ) <br />
i i i<br />
30
Model Validation & Improvement<br />
• Example 2 error histogram for the new model<br />
• Once again, seems to work<br />
• The residual behavior looks Gaussian<br />
– However, It is NOT always the case that such changes will work.<br />
If not, maybe the problem is the assumption of Gaussianity itself<br />
Move away from least squares, try MLE with other error pdfs<br />
31
END<br />
32