21.01.2014 Views

Support Vector Machines - The Auton Lab

Support Vector Machines - The Auton Lab

Support Vector Machines - The Auton Lab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Support</strong> <strong>Vector</strong><br />

<strong>Machines</strong><br />

Note to other teachers and users of<br />

these slides. Andrew would be delighted<br />

if you found this source material useful in<br />

giving your own lectures. Feel free to use<br />

these slides verbatim, or to modify them<br />

to fit your own needs. PowerPoint<br />

originals are available. If you make use<br />

of a significant portion of these slides in<br />

your own lecture, please include this<br />

message, or the following link to the<br />

source repository of Andrew’s tutorials:<br />

http://www.cs.cmu.edu/~awm/tutorials .<br />

Comments and corrections gratefully<br />

received.<br />

Andrew W. Moore<br />

Professor<br />

School of Computer Science<br />

Carnegie Mellon University<br />

www.cs.cmu.edu/~awm<br />

awm@cs.cmu.edu<br />

412-268-7599<br />

Copyright © 2001, 2003, Andrew W. Moore<br />

Nov 23rd, 2001<br />

Linear Classifiers<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

How would you<br />

classify this data?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 2<br />

1


Linear Classifiers<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

How would you<br />

classify this data?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 3<br />

Linear Classifiers<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

How would you<br />

classify this data?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 4<br />

2


Linear Classifiers<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

How would you<br />

classify this data?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 5<br />

Linear Classifiers<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

Any of these<br />

would be fine..<br />

..but which is<br />

best?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 6<br />

3


Classifier Margin<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

f(x,w,b) = sign(w. x - b)<br />

Define the margin<br />

of a linear<br />

classifier as the<br />

width that the<br />

boundary could be<br />

increased by<br />

before hitting a<br />

datapoint.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 7<br />

Maximum Margin<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

Linear SVM<br />

f(x,w,b) = sign(w. x - b)<br />

<strong>The</strong> maximum<br />

margin linear<br />

classifier is the<br />

linear classifier<br />

with the, um,<br />

maximum margin.<br />

This is the<br />

simplest kind of<br />

SVM (Called an<br />

LSVM)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 8<br />

4


Maximum Margin<br />

x<br />

α<br />

f<br />

y est<br />

denotes +1<br />

denotes -1<br />

<strong>Support</strong> <strong>Vector</strong>s<br />

are those<br />

datapoints that<br />

the margin<br />

pushes up<br />

against<br />

Linear SVM<br />

f(x,w,b) = sign(w. x - b)<br />

<strong>The</strong> maximum<br />

margin linear<br />

classifier is the<br />

linear classifier<br />

with the, um,<br />

maximum margin.<br />

This is the<br />

simplest kind of<br />

SVM (Called an<br />

LSVM)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 9<br />

Why Maximum Margin?<br />

denotes +1<br />

denotes -1<br />

<strong>Support</strong> <strong>Vector</strong>s<br />

are those<br />

datapoints that<br />

the margin<br />

pushes up<br />

against<br />

1. Intuitively this feels safest.<br />

2. If we’ve made f(x,w,b) a small = sign(w. error in xthe<br />

- b)<br />

location of the boundary (it’s been<br />

jolted in its perpendicular <strong>The</strong> maximum direction)<br />

this gives us least margin chance linear of causing a<br />

misclassification. classifier is the<br />

3. LOOCV is easy since linear the classifier model is<br />

immune to removal with of the, any nonsupport-vector<br />

datapoints.<br />

um,<br />

maximum margin.<br />

4. <strong>The</strong>re’s some theory (using VC<br />

dimension) that is This related is the to (but not<br />

the same as) the simplest proposition kind that of this<br />

is a good thing. SVM (Called an<br />

5. Empirically it works LSVM) very very well.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 10<br />

5


Specifying a line and margin<br />

“Predict Class = +1”<br />

zone<br />

“Predict Class = -1”<br />

zone<br />

Plus-Plane<br />

Classifier Boundary<br />

Minus-Plane<br />

• How do we represent this mathematically?<br />

• …in m input dimensions?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 11<br />

Specifying a line and margin<br />

• Plus-plane = { x : w . x + b = +1 }<br />

• Minus-plane = { x : w . x + b = -1 }<br />

Classify as..<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

+1<br />

-1<br />

Universe<br />

explodes<br />

“Predict Class = -1”<br />

zone<br />

if<br />

if<br />

if<br />

Plus-Plane<br />

Classifier Boundary<br />

Minus-Plane<br />

w . x + b >= 1<br />

w . x + b


Computing the margin width<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width<br />

How do we compute<br />

M in terms of w<br />

and b?<br />

• Plus-plane = { x : w . x + b = +1 }<br />

• Minus-plane = { x : w . x + b = -1 }<br />

Claim: <strong>The</strong> vector w is perpendicular to the Plus Plane. Why?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 13<br />

Computing the margin width<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

• Plus-plane = { x : w . x + b = +1 }<br />

• Minus-plane = { x : w . x + b = -1 }<br />

Claim: <strong>The</strong> vector w is perpendicular to the Plus Plane. Why?<br />

And so of course the vector w is also<br />

perpendicular to the Minus Plane<br />

M = Margin Width<br />

How do we compute<br />

M in terms of w<br />

and b?<br />

Let u and v be two vectors on the<br />

Plus Plane. What is w . ( u – v ) ?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 14<br />

7


Computing the margin width<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width<br />

How do we compute<br />

M in terms of w<br />

and b?<br />

• Plus-plane = { x : w . x + b = +1 }<br />

• Minus-plane = { x : w . x + b = -1 }<br />

• <strong>The</strong> vector w is perpendicular to the Plus Plane<br />

• Let x - be any point on the minus plane<br />

• Let x + be the closest plus-plane-point to x - .<br />

x +<br />

x -<br />

Any location in<br />

R m : not<br />

necessarily a<br />

datapoint<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 15<br />

Computing the margin width<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

x +<br />

“Predict Class = -1”<br />

zone<br />

• Plus-plane = { x : w . x + b = +1 }<br />

• Minus-plane = { x : w . x + b = -1 }<br />

• <strong>The</strong> vector w is perpendicular to the Plus Plane<br />

• Let x - be any point on the minus plane<br />

• Let x + be the closest plus-plane-point to x - .<br />

• Claim: x + = x - + λ w for some value of λ. Why?<br />

x -<br />

M = Margin Width<br />

How do we compute<br />

M in terms of w<br />

and b?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 16<br />

8


Computing the margin width<br />

<strong>The</strong> line from x - to x + is<br />

How perpendicular do we compute to the<br />

x -<br />

planes. M in terms of w<br />

So to and get b? from x - to x +<br />

travel some distance in<br />

• Plus-plane = { x : w . x + b = +1 direction } w.<br />

• Minus-plane = { x : w . x + b = -1 }<br />

• <strong>The</strong> vector w is perpendicular to the Plus Plane<br />

• Let x - be any point on the minus plane<br />

• Let x + be the closest plus-plane-point to x - .<br />

• Claim: x + = x - + λ w for some value of λ. Why?<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

x +<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 17<br />

Computing the margin width<br />

What we know:<br />

• w . x + + b = +1<br />

• w . x - + b = -1<br />

• x + = x - + λ w<br />

• |x + - x - | = M<br />

It’s now easy to get M<br />

in terms of w and b<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

x +<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 18<br />

x -<br />

9


Computing the margin width<br />

What we know:<br />

• w . x + + b = +1<br />

• w . x - + b = -1<br />

• x + = x - + λ w<br />

• |x + - x - | = M<br />

It’s now easy to get M<br />

in terms of w and b<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

x +<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width<br />

w . (x - + λ w) + b = 1<br />

=><br />

w . x - + b + λ w .w = 1<br />

=><br />

-1 + λ w .w = 1<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 19<br />

x -<br />

=><br />

λ =<br />

2<br />

w.w<br />

Computing the margin width<br />

What we know:<br />

• w . x + + b = +1<br />

• w . x - + b = -1<br />

• x + = x - + λ w<br />

• |x + - x - | = M<br />

• 2<br />

λ =<br />

w.w<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

x +<br />

“Predict Class = -1”<br />

zone<br />

M = Margin Width =<br />

M = |x + - x - | =| λ w |=<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 20<br />

x -<br />

= λ | w | = λ w.<br />

w<br />

2 w.<br />

w<br />

= =<br />

w.<br />

w<br />

2<br />

w.<br />

w<br />

2<br />

w.w<br />

10


Learning the Maximum Margin Classifier<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

Given a guess of w and b we can<br />

• Compute whether all data points in the correct half-planes<br />

• Compute the width of the margin<br />

So now we just need to write a program to search the space<br />

of w’s and b’s to find the widest margin that matches all<br />

the datapoints. How?<br />

Gradient descent? Simulated Annealing? Matrix Inversion?<br />

EM? Newton’s Method?<br />

x +<br />

M = Margin Width =<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 21<br />

x -<br />

2<br />

w. w<br />

Learning via Quadratic Programming<br />

• QP is a well-studied class of optimization<br />

algorithms to maximize a quadratic function of<br />

some real-valued variables subject to linear<br />

constraints.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 22<br />

11


Find<br />

arg max<br />

u<br />

Quadratic Programming<br />

T<br />

T u Ru<br />

c + d u +<br />

2<br />

Quadratic criterion<br />

Subject to<br />

And subject to<br />

a<br />

a<br />

a<br />

a<br />

a<br />

a<br />

11<br />

21<br />

u + a<br />

u<br />

1<br />

1<br />

+ a<br />

u + a<br />

n1<br />

1<br />

( n+<br />

1)1<br />

( n+<br />

2)1<br />

( n+<br />

e)1<br />

u<br />

1<br />

1<br />

1<br />

12<br />

22<br />

n2<br />

u + a<br />

u + a<br />

+ a<br />

u<br />

u<br />

u<br />

2<br />

2<br />

2<br />

( n+<br />

1)2<br />

( n+<br />

2)2<br />

( n+<br />

e)2<br />

+ ... + a<br />

+ ... + a<br />

:<br />

+ ... + a<br />

u<br />

u<br />

u<br />

2<br />

2<br />

2<br />

:<br />

1m<br />

2m<br />

nm<br />

u<br />

u<br />

u<br />

m<br />

+ ... + a<br />

+ ... + a<br />

+ ... + a<br />

m<br />

m<br />

≤ b<br />

1<br />

≤ b<br />

2<br />

≤ b<br />

( n+<br />

1) m<br />

( n+<br />

e)<br />

m<br />

n<br />

( n+<br />

2) m<br />

u<br />

u<br />

u<br />

m<br />

m<br />

m<br />

= b<br />

= b<br />

= b<br />

n additional linear<br />

inequality<br />

constraints<br />

( n+<br />

1)<br />

( n+<br />

2)<br />

( n+<br />

e)<br />

e additional linear<br />

equality<br />

constraints<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 23<br />

Find<br />

arg max<br />

u<br />

Quadratic Programming<br />

T<br />

T u Ru<br />

c + d u +<br />

2<br />

Quadratic criterion<br />

Subject to<br />

And subject to<br />

a<br />

a<br />

a<br />

a<br />

a<br />

a<br />

11<br />

21<br />

u + a<br />

u<br />

1<br />

1<br />

+ a<br />

u + a<br />

n1<br />

1<br />

( n+<br />

1)1<br />

( n+<br />

2)1<br />

( n+<br />

e)1<br />

1<br />

1<br />

1<br />

12<br />

22<br />

n2<br />

u + a<br />

u + a<br />

u<br />

+ a<br />

u<br />

u<br />

u<br />

2<br />

2<br />

2<br />

( n+<br />

1)2<br />

( n+<br />

2)2<br />

( n+<br />

e)2<br />

+ ... + a<br />

+ ... + a<br />

:<br />

+ ... + a<br />

u<br />

u<br />

u<br />

2<br />

2<br />

2<br />

:<br />

1m<br />

2m<br />

nm<br />

u<br />

u<br />

u<br />

m<br />

+ ... + a<br />

+ ... + a<br />

+ ... + a<br />

m<br />

m<br />

≤ b<br />

1<br />

≤ b<br />

<strong>The</strong>re exist algorithms for finding<br />

such constrained quadratic<br />

optima much more efficiently<br />

and reliably than gradient<br />

ascent.<br />

2<br />

≤ b<br />

(But they are very fiddly…you<br />

probably don’t want to write<br />

one yourself)<br />

( n+<br />

1) m<br />

( n+<br />

e)<br />

m<br />

n<br />

( n+<br />

2) m<br />

u<br />

u<br />

u<br />

m<br />

m<br />

m<br />

= b<br />

= b<br />

= b<br />

n additional linear<br />

inequality<br />

constraints<br />

( n+<br />

1)<br />

( n+<br />

2)<br />

( n+<br />

e)<br />

e additional linear<br />

equality<br />

constraints<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 24<br />

12


Learning the Maximum Margin Classifier<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

M =<br />

2<br />

w.w<br />

Given guess of w , b we can<br />

• Compute whether all data<br />

points are in the correct<br />

half-planes<br />

• Compute the margin width<br />

Assume R datapoints, each<br />

(x k ,y k ) where y k = +/- 1<br />

What should our quadratic<br />

optimization criterion be?<br />

How many constraints will we<br />

have?<br />

What should they be?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 25<br />

Learning the Maximum Margin Classifier<br />

“Predict Class = +1”<br />

zone<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

“Predict Class = -1”<br />

zone<br />

M =<br />

2<br />

w.w<br />

Given guess of w , b we can<br />

• Compute whether all data<br />

points are in the correct<br />

half-planes<br />

• Compute the margin width<br />

Assume R datapoints, each<br />

(x k ,y k ) where y k = +/- 1<br />

What should our quadratic<br />

optimization criterion be?<br />

Minimize w.w<br />

How many constraints will we<br />

have? R<br />

What should they be?<br />

w . x k + b >= 1 if y k = 1<br />

w . x k + b


Uh-oh!<br />

This is going to be a problem!<br />

What should we do?<br />

denotes +1<br />

denotes -1<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 27<br />

denotes +1<br />

denotes -1<br />

Uh-oh!<br />

This is going to be a problem!<br />

What should we do?<br />

Idea 1:<br />

Find minimum w.w, while<br />

minimizing number of<br />

training set errors.<br />

Problemette: Two things<br />

to minimize makes for<br />

an ill-defined<br />

optimization<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 28<br />

14


denotes +1<br />

denotes -1<br />

Uh-oh!<br />

This is going to be a problem!<br />

What should we do?<br />

Idea 1.1:<br />

Minimize<br />

w.w + C (#train errors)<br />

Tradeoff parameter<br />

<strong>The</strong>re’s a serious practical<br />

problem that’s about to make<br />

us reject this approach. Can<br />

you guess what it is?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 29<br />

denotes +1<br />

denotes -1<br />

Uh-oh!<br />

This is going to be a problem!<br />

What should we do?<br />

Idea 1.1:<br />

Minimize<br />

w.w + C (#train errors)<br />

Tradeoff parameter<br />

Can’t be expressed as a Quadratic<br />

Programming problem.<br />

Solving it may be too slow.<br />

(Also, doesn’t distinguish between<br />

disastrous errors and near misses)<br />

<strong>The</strong>re’s a serious practical<br />

problem that’s about to make<br />

us reject this approach. Can<br />

you guess what it is?<br />

So… any<br />

other<br />

ideas?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 30<br />

15


denotes +1<br />

denotes -1<br />

Uh-oh!<br />

This is going to be a problem!<br />

What should we do?<br />

Idea 2.0:<br />

Minimize<br />

w.w + C (distance of error<br />

points to their<br />

correct place)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 31<br />

Learning Maximum Margin with Noise<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

M =<br />

2<br />

w.w<br />

Given guess of w , b we can<br />

• Compute sum of distances<br />

of points to their correct<br />

zones<br />

• Compute the margin width<br />

Assume R datapoints, each<br />

(x k ,y k ) where y k = +/- 1<br />

What should our quadratic<br />

optimization criterion be?<br />

How many constraints will we<br />

have?<br />

What should they be?<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 32<br />

16


Learning Maximum Margin with Noise<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

ε 2<br />

ε 11<br />

ε 7<br />

M =<br />

2<br />

w.w<br />

Given guess of w , b we can<br />

• Compute sum of distances<br />

of points to their correct<br />

zones<br />

• Compute the margin width<br />

Assume R datapoints, each<br />

(x k ,y k ) where y k = +/- 1<br />

What should our quadratic<br />

optimization criterion be?<br />

Minimize<br />

1<br />

2<br />

w.<br />

w + C<br />

R<br />

∑ ε k<br />

k = 1<br />

How many constraints will we<br />

have? R<br />

What should they be?<br />

w . x k + b >= 1-ε k if y k = 1<br />

w . x k + b = 1-ε k if y k = 1<br />

w . x k + b


Learning Maximum Margin with Noise<br />

wx+b=1<br />

wx+b=0<br />

wx+b=-1<br />

ε 2<br />

ε 11<br />

ε 7<br />

M =<br />

2<br />

w.w<br />

Given guess of w , b we can<br />

• Compute sum of distances<br />

of points to their correct<br />

zones<br />

• Compute the margin width<br />

Assume R datapoints, each<br />

(x k ,y k ) where y k = +/- 1<br />

What should our quadratic<br />

optimization criterion be?<br />

Minimize<br />

1<br />

2<br />

w.<br />

w + C<br />

R<br />

∑ ε k<br />

k = 1<br />

<strong>The</strong>re’s a bug in this QP. Can you spot it?<br />

How many constraints will we<br />

have? R<br />

What should they be?<br />

w . x k + b >= 1-ε k if y k = 1<br />

w . x k + b = 1-ε k if y k = 1<br />

w . x k + b = 0 for all k<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 36<br />

18


An Equivalent QP<br />

Maximize<br />

R<br />

∑<br />

α<br />

1<br />

−<br />

R<br />

R<br />

∑∑<br />

k<br />

k = 1 2 k = 1 l=<br />

1<br />

α α Q<br />

k<br />

l<br />

kl<br />

Warning: up until Rong Zhang spotted my error in<br />

Oct 2003, this equation had been wrong in earlier<br />

versions of the notes. This version is correct.<br />

where Q = y y x . x )<br />

kl<br />

k<br />

l<br />

(<br />

k l<br />

Subject to these<br />

constraints:<br />

0 ≤<br />

α k<br />

≤ C<br />

∀k<br />

R<br />

∑<br />

k = 1<br />

α k<br />

y k<br />

=<br />

0<br />

<strong>The</strong>n define:<br />

w =<br />

b<br />

R<br />

∑<br />

k = 1<br />

K<br />

α k<br />

y k<br />

x<br />

= y (1 − ε ) − x . w<br />

where K = arg max<br />

K<br />

k<br />

k<br />

K<br />

α<br />

K<br />

k<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. x - b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 37<br />

An Equivalent QP<br />

R<br />

R R<br />

1<br />

∑αk<br />

− ∑∑αkαl<br />

Maximize where Q = y y x . x )<br />

k = 1 2 k = 1 l=<br />

1<br />

Q<br />

kl<br />

Warning: up until Rong Zhang spotted my error in<br />

Oct 2003, this equation had been wrong in earlier<br />

versions of the notes. This version is correct.<br />

kl<br />

k<br />

l<br />

(<br />

k l<br />

Subject to these<br />

constraints:<br />

0 ≤<br />

α k<br />

≤ C<br />

∀k<br />

R<br />

∑<br />

k = 1<br />

α k<br />

y k<br />

=<br />

0<br />

<strong>The</strong>n define:<br />

w<br />

=<br />

R<br />

∑<br />

k = 1<br />

α k<br />

y k<br />

x<br />

b = y<br />

K<br />

(1 − ε<br />

K<br />

) − x<br />

K<br />

. w<br />

where K = arg max α<br />

k<br />

k<br />

Datapoints with α k > 0<br />

will be the support<br />

vectors <strong>The</strong>n classify with:<br />

K<br />

k<br />

f(x,w,b) = sign(w. x - b)<br />

..so this sum only needs<br />

to be over the<br />

support vectors.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 38<br />

19


R<br />

∑<br />

α<br />

An Equivalent QP<br />

1<br />

−<br />

R<br />

R<br />

∑∑<br />

k<br />

k = 1 2 k = 1 l=<br />

1<br />

α α Q<br />

Maximize where Q = y y x . x )<br />

k<br />

l<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 39<br />

kl<br />

kl<br />

k<br />

l<br />

(<br />

k l<br />

Why did I tell you about this R<br />

Subject to these equivalent 0 ≤ α k<br />

≤QP?<br />

C ∀k<br />

∑ α k<br />

y k<br />

constraints:<br />

• It’s a formulation that QP<br />

k = 1<br />

packages can optimize more<br />

<strong>The</strong>n define:<br />

quickly<br />

Datapoints with α k > 0<br />

will be the support<br />

R<br />

<strong>The</strong>n classify with:<br />

w = ∑ α k<br />

y<br />

•<br />

k<br />

xBecause of further vectors jawdropping<br />

developments<br />

k<br />

k = 1<br />

f(x,w,b) you’re = sign(w. x - b)<br />

about to learn. ..so this sum only needs<br />

b = y<br />

K<br />

(1 − ε<br />

K<br />

) − x<br />

K<br />

. w to be over the<br />

K<br />

support vectors.<br />

where<br />

K<br />

=<br />

arg<br />

max<br />

k<br />

α<br />

k<br />

=<br />

0<br />

Suppose we’re in 1-dimension<br />

What would<br />

SVMs do with<br />

this data?<br />

x=0<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 40<br />

20


Suppose we’re in 1-dimension<br />

Not a big surprise<br />

x=0<br />

Positive “plane”<br />

Negative “plane”<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 41<br />

Harder 1-dimensional dataset<br />

That’s wiped the<br />

smirk off SVM’s<br />

face.<br />

What can be<br />

done about<br />

this?<br />

x=0<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 42<br />

21


Harder 1-dimensional dataset<br />

Remember how<br />

permitting nonlinear<br />

basis<br />

functions made<br />

linear regression<br />

so much nicer?<br />

Let’s permit them<br />

here too<br />

2<br />

x=0<br />

z<br />

k<br />

= ( xk<br />

, xk<br />

)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 43<br />

Harder 1-dimensional dataset<br />

Remember how<br />

permitting nonlinear<br />

basis<br />

functions made<br />

linear regression<br />

so much nicer?<br />

Let’s permit them<br />

here too<br />

2<br />

x=0<br />

z<br />

k<br />

= ( xk<br />

, xk<br />

)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 44<br />

22


Common SVM basis functions<br />

z k = ( polynomial terms of x k of degree 1 to q )<br />

z k = ( radial basis functions of x k )<br />

⎛ | xk<br />

− c<br />

z<br />

k[<br />

j]<br />

= φ j<br />

( xk<br />

) = KernelFn<br />

⎜<br />

⎝ KW<br />

z k = ( sigmoid functions of x k )<br />

This is sensible.<br />

Is that the end of the story?<br />

No…there’s one more trick!<br />

j<br />

| ⎞<br />

⎟<br />

⎠<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 45<br />

⎛<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

Φ(x) =<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎝<br />

1<br />

2x<br />

2x<br />

2x<br />

2x<br />

m<br />

2<br />

1<br />

2<br />

2<br />

x<br />

x<br />

x<br />

:<br />

:<br />

2<br />

m<br />

2x<br />

x<br />

1<br />

2x<br />

x<br />

:<br />

1<br />

2x<br />

x<br />

1<br />

2x<br />

x<br />

m<br />

2x<br />

x<br />

:<br />

:<br />

2<br />

1<br />

1<br />

2<br />

2<br />

3<br />

3<br />

m<br />

x<br />

⎞<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎠<br />

Constant Term<br />

Linear Terms<br />

Pure<br />

Quadratic<br />

Terms<br />

Quadratic<br />

Cross-Terms<br />

Quadratic<br />

Basis Functions<br />

Number of terms (assuming m input<br />

dimensions) = (m+2)-choose-2<br />

= (m+2)(m+1)/2<br />

= (as near as makes no difference) m 2 /2<br />

You may be wondering what those<br />

2 ’s are doing.<br />

•You should be happy that they do no<br />

harm<br />

•You’ll find out why they’re there<br />

soon.<br />

m−1<br />

m<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 46<br />

23


QP with basis functions<br />

Maximize<br />

R<br />

∑<br />

α<br />

1<br />

−<br />

R<br />

R<br />

∑∑<br />

k<br />

k = 1 2 k = 1 l=<br />

1<br />

α α Q<br />

k<br />

l<br />

Warning: up until Rong Zhang spotted my error in<br />

Oct 2003, this equation had been wrong in earlier<br />

versions of the notes. This version is correct.<br />

kl<br />

where Qkl<br />

= yk<br />

yl<br />

( Φ(<br />

xk<br />

). Φ(<br />

xl<br />

))<br />

Subject to these<br />

constraints:<br />

0 ≤<br />

α k<br />

≤ C<br />

∀k<br />

R<br />

∑<br />

k = 1<br />

α k<br />

y k<br />

=<br />

0<br />

<strong>The</strong>n define:<br />

w =<br />

b<br />

k s.t.<br />

= y (1 − ε ) − x . w<br />

K<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

where K = arg max<br />

K<br />

Φ ( x<br />

k<br />

K<br />

k<br />

)<br />

α<br />

K<br />

k<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 47<br />

Maximize<br />

Subject to these<br />

constraints:<br />

<strong>The</strong>n define:<br />

w<br />

=<br />

R<br />

∑<br />

k s.t.<br />

α<br />

∑<br />

QP with basis functions<br />

α k<br />

1<br />

−<br />

α<br />

y k k<br />

> 0<br />

0 ≤<br />

Φ ( x<br />

k<br />

α k<br />

b = y<br />

K<br />

(1 − ε<br />

K<br />

) − x<br />

K<br />

. w<br />

where K = arg max α<br />

R<br />

R<br />

∑∑<br />

k<br />

k = 1 2 k = 1 l=<br />

1<br />

α α Q<br />

k<br />

k<br />

l<br />

kl<br />

where Qkl<br />

= yk<br />

yl<br />

( Φ(<br />

xk<br />

). Φ(<br />

xl<br />

))<br />

≤ C<br />

)<br />

We must do R 2 /2 dot products to<br />

get this matrix ready. R<br />

K<br />

k<br />

∀k<br />

Each dot product requires m 2 /2<br />

additions and multiplications<br />

k = 1<br />

<strong>The</strong>n classify with: …or does it?<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 48<br />

∑<br />

α k<br />

y k<br />

<strong>The</strong> whole thing costs R 2 m 2 /4.<br />

Yeeks!<br />

=<br />

0<br />

24


Quadratic Dot<br />

Products<br />

⎛<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

Φ( a)<br />

• Φ(<br />

b)<br />

=<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎝<br />

a<br />

1<br />

2a<br />

2a<br />

2a<br />

:<br />

2a<br />

m<br />

2<br />

1<br />

2<br />

2<br />

a<br />

a<br />

:<br />

2<br />

m<br />

2a a<br />

1<br />

2a a<br />

:<br />

1<br />

2a a<br />

1<br />

2a a<br />

m<br />

2a<br />

a<br />

:<br />

:<br />

2<br />

1<br />

1<br />

2<br />

2<br />

3<br />

3<br />

m<br />

a<br />

⎞<br />

⎛<br />

⎟ ⎜<br />

⎟ ⎜ 2b1<br />

⎟ ⎜ 2b2<br />

⎟ ⎜<br />

⎟ ⎜ :<br />

⎟ ⎜<br />

⎟ ⎜<br />

2bm<br />

2<br />

⎟ ⎜ b1<br />

⎟ ⎜ 2<br />

⎟ ⎜ b2<br />

⎟ ⎜ :<br />

⎟ ⎜<br />

2<br />

⎟ ⎜ bm<br />

⎟<br />

•<br />

⎜ 2b1b<br />

2<br />

⎟<br />

2b b<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎠<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎜<br />

⎝<br />

1<br />

:<br />

2b b<br />

2b<br />

1 3<br />

1 m<br />

2b<br />

b<br />

:<br />

2 3<br />

2b b<br />

1 m<br />

:<br />

b<br />

⎞<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎠<br />

m−1<br />

m<br />

m−1<br />

m<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 49<br />

1<br />

+<br />

m<br />

∑<br />

i=<br />

1<br />

+<br />

m<br />

∑<br />

i=<br />

1<br />

m<br />

+<br />

2 a b i i<br />

a 2<br />

b i i<br />

m<br />

∑∑<br />

i= 1 j=<br />

i+<br />

1<br />

2<br />

2a a b b<br />

i<br />

j<br />

i<br />

j<br />

Quadratic Dot<br />

Products<br />

Φ( a)<br />

• Φ(<br />

b)<br />

=<br />

1+<br />

2<br />

m<br />

∑<br />

i=<br />

1<br />

a b +<br />

i<br />

i<br />

m<br />

∑<br />

i=<br />

1<br />

a b<br />

2 2<br />

i i<br />

+<br />

m<br />

m<br />

∑∑<br />

i= 1 j=<br />

i+<br />

1<br />

2a a b b<br />

i<br />

j<br />

i<br />

j<br />

Just out of casual, innocent, interest,<br />

let’s look at another function of a and<br />

b:<br />

( a.<br />

b + 1)<br />

2<br />

= ( a.<br />

b)<br />

+ 2a.<br />

b + 1<br />

2<br />

2<br />

m<br />

m<br />

⎛ ⎞<br />

= ⎜∑<br />

aibi<br />

⎟ + 2∑<br />

aibi<br />

+ 1<br />

⎝ i=<br />

1 ⎠ i=<br />

1<br />

m<br />

m<br />

= ∑∑aibia<br />

jb<br />

j<br />

+ 2∑<br />

aibi<br />

+ 1<br />

i= 1 j=<br />

1<br />

m<br />

m<br />

i=<br />

1<br />

= 2<br />

∑ ( a ) + 2∑∑<br />

+ 2 ∑<br />

ibi<br />

aibi<br />

a<br />

jb<br />

j<br />

aibi<br />

+ 1<br />

i=<br />

1<br />

m<br />

m<br />

i= 1 j=<br />

i+<br />

1<br />

m<br />

i=<br />

1<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 50<br />

25


Quadratic Dot<br />

Products<br />

Φ( a)<br />

• Φ(<br />

b)<br />

=<br />

1+<br />

2<br />

m<br />

∑<br />

i=<br />

1<br />

a b +<br />

i<br />

i<br />

m<br />

∑<br />

i=<br />

1<br />

a b<br />

2 2<br />

i i<br />

+<br />

m<br />

m<br />

∑∑<br />

i= 1 j=<br />

i+<br />

1<br />

2a a b b<br />

i<br />

j<br />

i<br />

j<br />

Just out of casual, innocent, interest,<br />

let’s look at another function of a and<br />

b:<br />

( a.<br />

b + 1)<br />

2<br />

= ( a.<br />

b)<br />

+ 2a.<br />

b + 1<br />

2<br />

2<br />

m<br />

m<br />

⎛ ⎞<br />

= ⎜∑<br />

aibi<br />

⎟ + 2∑<br />

aibi<br />

+ 1<br />

⎝ i=<br />

1 ⎠ i=<br />

1<br />

m<br />

m<br />

= ∑∑aibia<br />

jb<br />

j<br />

+ 2∑<br />

aibi<br />

+ 1<br />

i= 1 j=<br />

1<br />

m<br />

m<br />

i=<br />

1<br />

= 2<br />

∑ ( a ) + 2∑∑<br />

+ 2 ∑<br />

ibi<br />

aibi<br />

a<br />

jb<br />

j<br />

aibi<br />

+ 1<br />

i=<br />

1<br />

m<br />

m<br />

i= 1 j=<br />

i+<br />

1<br />

m<br />

i=<br />

1<br />

<strong>The</strong>y’re the same!<br />

And this is only O(m) to<br />

compute!<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 51<br />

Maximize<br />

QP with Quadratic basis functions<br />

R<br />

R R<br />

1<br />

∑αk<br />

− ∑∑αkαlQkl<br />

Qkl<br />

= yk<br />

yl<br />

Φ(<br />

xk<br />

).<br />

k = 1 2 k = 1 l=<br />

1<br />

Subject to these<br />

constraints:<br />

0 ≤<br />

α k<br />

≤ C<br />

where ( Φ(<br />

xl<br />

))<br />

We must do R 2 /2 dot products to<br />

get this matrix ready. R<br />

∀k<br />

Warning: up until Rong Zhang spotted my error in<br />

Oct 2003, this equation had been wrong in earlier<br />

versions of the notes. This version is correct.<br />

∑<br />

α k<br />

y k<br />

Each dot product now only requires<br />

m additions and multiplications<br />

k = 1<br />

=<br />

0<br />

<strong>The</strong>n define:<br />

w =<br />

b<br />

k s.t.<br />

= y (1 − ε ) − x . w<br />

K<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

where K = arg max<br />

K<br />

Φ ( x<br />

k<br />

K<br />

k<br />

)<br />

α<br />

K<br />

k<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 52<br />

26


Higher Order Polynomials<br />

Polynomial<br />

Quadratic<br />

Cubic<br />

Quartic<br />

φ(x)<br />

All m 2 /2<br />

terms up to<br />

degree 2<br />

All m 3 /6<br />

terms up to<br />

degree 3<br />

All m 4 /24<br />

terms up to<br />

degree 4<br />

Cost to<br />

build Q kl<br />

matrix<br />

tradition<br />

ally<br />

m 2 R 2 /4<br />

Cost if 100<br />

inputs<br />

2,500 R 2<br />

m 3 R 2 /12 83,000 R 2<br />

m 4 R 2 /48 1,960,000 R 2<br />

Cost to<br />

build Q kl<br />

matrix<br />

sneakily<br />

φ(a).φ(b)<br />

(a.b+1) 2<br />

(a.b+1) 3<br />

(a.b+1) 4<br />

mR 2 / 2<br />

mR 2 / 2<br />

mR 2 / 2<br />

Cost if<br />

100<br />

inputs<br />

50 R 2<br />

50 R 2<br />

50 R 2<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 53<br />

QP with Quintic basis functions<br />

We must do R 2 /2 dot products R R to get this<br />

matrix ready.<br />

Maximize ∑αk<br />

+ ∑∑αkαlQ<br />

where Q ( ( ). ( ))<br />

kl<br />

kl<br />

= yk<br />

yl<br />

Φ xk<br />

Φ xl<br />

In 100-d, each k = 1dot product k = 1now l=<br />

1 needs 103<br />

operations instead of 75 million<br />

R<br />

But there are still worrying things lurking away.<br />

Subject to these<br />

What are they? 0 ≤ α k<br />

≤ C ∀k<br />

∑ α k<br />

y k<br />

= 0<br />

constraints:<br />

k = 1<br />

<strong>The</strong>n define:<br />

w =<br />

b<br />

k s.t.<br />

K<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

K<br />

Φ ( x<br />

= y (1 − ε ) − x . w<br />

where K = arg max<br />

k<br />

K<br />

k<br />

)<br />

α<br />

K<br />

k<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 54<br />

27


Maximize<br />

QP with Quintic basis functions<br />

We must do R 2 /2 dot products R R to get this<br />

matrix ready.<br />

∑αk<br />

+ ∑∑αkαlQ<br />

where Q ( ( ). ( ))<br />

kl<br />

kl<br />

= yk<br />

yl<br />

Φ xk<br />

Φ xl<br />

In 100-d, each k = 1dot product k = 1now l=<br />

1 needs 103<br />

operations instead of 75 million<br />

R<br />

But there are still worrying things lurking away.<br />

Subject to these<br />

What are they? 0 ≤ α k<br />

≤ C ∀k<br />

∑ α k<br />

y k<br />

= 0<br />

constraints:<br />

•<strong>The</strong> fear of overfitting kwith = 1 this enormous<br />

number of terms<br />

<strong>The</strong>n define:<br />

•<strong>The</strong> evaluation phase (doing a set of<br />

predictions on a test set) will be very<br />

expensive (why?)<br />

w<br />

=<br />

k s.t.<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

Φ ( x<br />

b = y<br />

K<br />

(1 − ε<br />

K<br />

) − x<br />

K<br />

. w<br />

where K = arg max α<br />

k<br />

k<br />

)<br />

K<br />

k<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 55<br />

Maximize<br />

QP with Quintic basis functions<br />

We must do R 2 /2 dot products R R to get this<br />

matrix ready.<br />

∑αk<br />

+ ∑∑αkαlQ<br />

where Q ( ( ). ( ))<br />

kl<br />

kl<br />

= yk<br />

yl<br />

Φ xk<br />

Φ xl<br />

In 100-d, each k = 1dot product k = 1now l=<br />

1 needs 103<br />

<strong>The</strong> use of Maximum Margin<br />

operations instead of 75 million<br />

magically makes this not a<br />

problem R<br />

But there are still worrying things lurking away.<br />

Subject to these<br />

What are they? 0 ≤ α k<br />

≤ C ∀k<br />

∑ α k<br />

y k<br />

= 0<br />

constraints:<br />

•<strong>The</strong> fear of overfitting kwith = 1 this enormous<br />

number of terms<br />

<strong>The</strong>n define:<br />

•<strong>The</strong> evaluation phase (doing a set of<br />

predictions on a test set) will be very<br />

expensive (why?)<br />

w<br />

=<br />

k s.t.<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

Φ ( x<br />

b = y<br />

K<br />

(1 − ε<br />

K<br />

) − x<br />

K<br />

. w<br />

where K = arg max α<br />

k<br />

k<br />

)<br />

K<br />

k<br />

Because each w. φ(x) (see below)<br />

needs 75 million operations. What<br />

can be done?<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 56<br />

28


Maximize<br />

QP with Quintic basis functions<br />

We must do R 2 /2 dot products R R to get this<br />

matrix ready.<br />

∑αk<br />

+ ∑∑αkαlQ<br />

where Q ( ( ). ( ))<br />

kl<br />

kl<br />

= yk<br />

yl<br />

Φ xk<br />

Φ xl<br />

In 100-d, each k = 1dot product k = 1now l=<br />

1 needs 103<br />

<strong>The</strong> use of Maximum Margin<br />

operations instead of 75 million<br />

magically makes this not a<br />

problem R<br />

But there are still worrying things lurking away.<br />

Subject to these<br />

What are they? 0 ≤ α k<br />

≤ C ∀k<br />

∑ α k<br />

y k<br />

= 0<br />

constraints:<br />

•<strong>The</strong> fear of overfitting kwith = 1 this enormous<br />

number of terms<br />

<strong>The</strong>n define:<br />

•<strong>The</strong> evaluation phase (doing a set of<br />

predictions on a test set) will be very<br />

expensive (why?)<br />

w<br />

=<br />

k s.t.<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

Φ ( x<br />

wb<br />

⋅= Φ ( yx<br />

) =<br />

K<br />

(1 −∑<br />

εα<br />

k<br />

y<br />

K<br />

) k<br />

Φ−<br />

(<br />

x<br />

k<br />

)<br />

K<br />

. ⋅wΦ<br />

k s.t. α k > 0<br />

where<br />

=<br />

K∑=<br />

α<br />

arg k<br />

y k<br />

( xmax<br />

k<br />

⋅ x + 1)<br />

α<br />

k s.t. α k > 0<br />

k<br />

k<br />

Only Sm operations (S=#support vectors)<br />

k<br />

)<br />

( x )<br />

K<br />

5<br />

Because each w. φ(x) (see below)<br />

needs 75 million operations. What<br />

can be done?<br />

<strong>The</strong>n classify with:<br />

f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 57<br />

Maximize<br />

QP with Quintic basis functions<br />

We must do R 2 /2 dot products R R to get this<br />

matrix ready.<br />

∑αk<br />

+ ∑∑αkαlQ<br />

where Q ( ( ). ( ))<br />

kl<br />

kl<br />

= yk<br />

yl<br />

Φ xk<br />

Φ xl<br />

In 100-d, each k = 1dot product k = 1now l=<br />

1 needs 103<br />

<strong>The</strong> use of Maximum Margin<br />

operations instead of 75 million<br />

magically makes this not a<br />

problem R<br />

But there are still worrying things lurking away.<br />

Subject to these<br />

What are they? 0 ≤ α k<br />

≤ C ∀k<br />

∑ α k<br />

y k<br />

= 0<br />

constraints:<br />

•<strong>The</strong> fear of overfitting kwith = 1 this enormous<br />

number of terms<br />

<strong>The</strong>n define:<br />

•<strong>The</strong> evaluation phase (doing a set of<br />

predictions on a test set) will be very<br />

expensive (why?)<br />

wb<br />

w<br />

⋅<br />

=<br />

k s.t.<br />

∑<br />

α k<br />

α<br />

y k k<br />

> 0<br />

Φ ( x<br />

= Φ ( yx<br />

) = (1 −∑<br />

εα<br />

) Φ−<br />

(<br />

x ) . ⋅wΦ<br />

k<br />

)<br />

Because each w. φ(x) (see below)<br />

needs 75 million operations. What<br />

can be done?<br />

k<br />

y k k<br />

K k s.t. α k > 0K<br />

K K<br />

5 <strong>The</strong>n When classify you see this with: many callout bubbles on<br />

where<br />

=<br />

K∑=<br />

α<br />

arg k<br />

y k<br />

( xmax<br />

k<br />

⋅ x + 1)<br />

α<br />

a slide it’s time to wrap the author in a<br />

k s.t. α k > 0<br />

k<br />

blanket, gently take him away and murmur<br />

k<br />

f(x,w,b) “someone’s = been sign(w. at the PowerPoint φ(x) -b) for too<br />

long.”<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 58<br />

Only Sm operations (S=#support vectors)<br />

( x )<br />

29


QP with Quintic basis functions<br />

1<br />

R<br />

R R<br />

Maximize ∑αk<br />

− ∑∑αkαlQkl<br />

where Andrew’s Qkl<br />

= opinion yk<br />

yl<br />

of ( Φwhy ( xSVMs k<br />

). Φdon’t<br />

( xl<br />

))<br />

k = 1 2 k = 1 l=<br />

1<br />

overfit as much as you’d think:<br />

No matter what the basis function,<br />

there are really Ronly up to R<br />

Subject to these 0 ≤ α k<br />

≤ C parameters: ∀k<br />

α∑<br />

1 , α 2 .. αα kR , yand k<br />

usually = 0<br />

constraints:<br />

most are set to<br />

k<br />

zero<br />

= 1<br />

by the Maximum<br />

Margin.<br />

<strong>The</strong>n define:<br />

Asking for small w.w is like “weight<br />

decay” in Neural Nets and like Ridge<br />

w = ∑ α y Φ k k<br />

( x<br />

Regression parameters in Linear<br />

k<br />

) regression and like the use of Priors<br />

in Bayesian Regression---all designed<br />

k s.t. α k > 0<br />

to smooth the function and reduce<br />

wb<br />

⋅= Φ (<br />

overfitting.<br />

yx<br />

) =<br />

K<br />

(1 −∑<br />

εα<br />

k<br />

y<br />

K<br />

) k<br />

Φ−<br />

(<br />

x<br />

k<br />

)<br />

K<br />

. ⋅wΦ<br />

( x )<br />

k s.t. α k > 0<br />

K<br />

5 <strong>The</strong>n classify with:<br />

where<br />

=<br />

K∑=<br />

α<br />

arg k<br />

y k<br />

( xmax<br />

k<br />

⋅ x + 1)<br />

α<br />

k s.t. α k > 0<br />

k<br />

k<br />

Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. φ(x) -b)<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 59<br />

SVM Kernel Functions<br />

• K(a,b)=(a . b +1) d is an example of an SVM<br />

Kernel Function<br />

• Beyond polynomials there are other very high<br />

dimensional basis functions that can be made<br />

practical by finding the right Kernel Function<br />

• Radial-Basis-style Kernel Function:<br />

K(<br />

a,<br />

b)<br />

⎛ ( a − b)<br />

exp<br />

⎜−<br />

⎝ 2σ<br />

=<br />

2<br />

• Neural-net-style Kernel Function:<br />

K( a,<br />

b)<br />

= tanh( κ a.<br />

b −δ<br />

)<br />

2<br />

⎞<br />

⎟<br />

⎠<br />

σ, κ and δ are magic<br />

parameters that must<br />

be chosen by a model<br />

selection method<br />

such as CV or<br />

VCSRM*<br />

*see last lecture<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 60<br />

30


VC-dimension of an SVM<br />

• Very very very loosely speaking there is some theory which<br />

under some different assumptions puts an upper bound on<br />

the VC dimension as<br />

⎡<br />

⎢<br />

Diameter<br />

Margin<br />

• where<br />

• Diameter is the diameter of the smallest sphere that can<br />

enclose all the high-dimensional term-vectors derived<br />

from the training set.<br />

• Margin is the smallest margin we’ll let the SVM use<br />

• This can be used in SRM (Structural Risk Minimization) for<br />

choosing the polynomial degree, RBF σ, etc.<br />

• But most people just use Cross-Validation<br />

⎤<br />

⎥<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 61<br />

SVM Performance<br />

• Anecdotally they work very very well indeed.<br />

• Example: <strong>The</strong>y are currently the best-known<br />

classifier on a well-studied hand-written-character<br />

recognition benchmark<br />

• Another Example: Andrew knows several reliable<br />

people doing practical real-world work who claim<br />

that SVMs have saved them when their other<br />

favorite classifiers did poorly.<br />

• <strong>The</strong>re is a lot of excitement and religious fervor<br />

about SVMs as of 2001.<br />

• Despite this, some practitioners (including your<br />

lecturer) are a little skeptical.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 62<br />

31


Doing multi-class classification<br />

• SVMs can only handle two-class outputs (i.e. a<br />

categorical output variable with arity 2).<br />

• What can be done?<br />

• Answer: with output arity N, learn N SVM’s<br />

• SVM 1 learns “Output==1” vs “Output != 1”<br />

• SVM 2 learns “Output==2” vs “Output != 2”<br />

• :<br />

• SVM N learns “Output==N” vs “Output != N”<br />

• <strong>The</strong>n to predict the output for a new input, just<br />

predict with each SVM and find out which one puts<br />

the prediction the furthest into the positive region.<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 63<br />

References<br />

• An excellent tutorial on VC-dimension and <strong>Support</strong><br />

<strong>Vector</strong> <strong>Machines</strong>:<br />

C.J.C. Burges. A tutorial on support vector machines<br />

for pattern recognition. Data Mining and Knowledge<br />

Discovery, 2(2):955-974, 1998.<br />

http://citeseer.nj.nec.com/burges98tutorial.html<br />

• <strong>The</strong> VC/SRM/SVM Bible:<br />

Statistical Learning <strong>The</strong>ory by Vladimir Vapnik, Wiley-<br />

Interscience; 1998<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 64<br />

32


What You Should Know<br />

• Linear SVMs<br />

• <strong>The</strong> definition of a maximum margin classifier<br />

• What QP can do for you (but, for this class, you<br />

don’t need to know how it does it)<br />

• How Maximum Margin can be turned into a QP<br />

problem<br />

• How we deal with noisy (non-separable) data<br />

• How we permit non-linear boundaries<br />

• How SVM Kernel functions permit us to pretend<br />

we’re working with ultra-high-dimensional basisfunction<br />

terms<br />

Copyright © 2001, 2003, Andrew W. Moore <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>: Slide 65<br />

33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!