Support Vector Machines - The Auton Lab

Support Vector 

Machines 

Note to other teachers and users of 

these slides. Andrew would be delighted 

if you found this source material useful in 

giving your own lectures. Feel free to use 

these slides verbatim, or to modify them 

to fit your own needs. PowerPoint 

originals are available. If you make use 

of a significant portion of these slides in 

your own lecture, please include this 

message, or the following link to the 

source repository of Andrew’s tutorials: 

http://www.cs.cmu.edu/~awm/tutorials . 

Comments and corrections gratefully 

received. 

Andrew W. Moore 

Professor 

School of Computer Science 

Carnegie Mellon University 

www.cs.cmu.edu/~awm 

awm@cs.cmu.edu 

412-268-7599 

Copyright © 2001, 2003, Andrew W. Moore 

Nov 23rd, 2001 

Linear Classifiers 

x 

α 

f 

y est 

denotes +1 

denotes -1 

f(x,w,b) = sign(w. x - b) 

How would you 

classify this data? 

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2 

1


x 

α 

f 

y est 

denotes +1 

denotes -1 


How would you 




x 

α 

f 

y est 

denotes +1 

denotes -1 


How would you 



2


x 

α 

f 

y est 

denotes +1 

denotes -1 


How would you 




x 

α 

f 

y est 

denotes +1 

denotes -1 


Any of these 

would be fine.. 

..but which is 

best? 


3

Classifier Margin 

x 

α 

f 

y est 

denotes +1 

denotes -1 


Define the margin 

of a linear 

classifier as the 

width that the 

boundary could be 

increased by 

before hitting a 

datapoint. 


Maximum Margin 

x 

α 

f 

y est 

denotes +1 

denotes -1 

Linear SVM 


The maximum 

margin linear 

classifier is the 

linear classifier 

with the, um, 

maximum margin. 

This is the 

simplest kind of 

SVM (Called an 

LSVM) 


4

Maximum Margin 

x 

α 

f 

y est 

denotes +1 

denotes -1 

Support Vectors 

are those 

datapoints that 

the margin 

pushes up 

against 

Linear SVM 


The maximum 

margin linear 

classifier is the 

linear classifier 

with the, um, 


This is the 

simplest kind of 

SVM (Called an 

LSVM) 


Why Maximum Margin? 

denotes +1 

denotes -1 

Support Vectors 

are those 

datapoints that 

the margin 

pushes up 

against 

1. Intuitively this feels safest. 

2. If we’ve made f(x,w,b) a small = sign(w. error in xthe 

- b) 

location of the boundary (it’s been 

jolted in its perpendicular The maximum direction) 

this gives us least margin chance linear of causing a 

misclassification. classifier is the 

3. LOOCV is easy since linear the classifier model is 

immune to removal with of the, any nonsupport-vector 

datapoints. 

um, 


4. There’s some theory (using VC 

dimension) that is This related is the to (but not 

the same as) the simplest proposition kind that of this 

is a good thing. SVM (Called an 

5. Empirically it works LSVM) very very well. 


5

Specifying a line and margin 

“Predict Class = +1” 

zone 

“Predict Class = -1” 

zone 

Plus-Plane 

Classifier Boundary 

Minus-Plane 

• How do we represent this mathematically? 

• …in m input dimensions? 


Specifying a line and margin 

• Plus-plane = { x : w . x + b = +1 } 

• Minus-plane = { x : w . x + b = -1 } 

Classify as.. 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 

+1 

-1 

Universe 

explodes 


zone 

if 

if 

if 

Plus-Plane 

Classifier Boundary 

Minus-Plane 

w . x + b >= 1 

w . x + b

Computing the margin width 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 

M = Margin Width 

How do we compute 

M in terms of w 

and b? 



Claim: The vector w is perpendicular to the Plus Plane. Why? 




zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 



Claim: The vector w is perpendicular to the Plus Plane. Why? 

And so of course the vector w is also 

perpendicular to the Minus Plane 




and b? 

Let u and v be two vectors on the 

Plus Plane. What is w . ( u – v ) ? 


7



zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 




and b? 



• The vector w is perpendicular to the Plus Plane 

• Let x - be any point on the minus plane 

• Let x + be the closest plus-plane-point to x - . 

x + 

x - 

Any location in 

R m : not 

necessarily a 

datapoint 




zone 

wx+b=1 

wx+b=0 

wx+b=-1 

x + 


zone 






• Claim: x + = x - + λ w for some value of λ. Why? 

x - 




and b? 


8


The line from x - to x + is 

How perpendicular do we compute to the 

x - 

planes. M in terms of w 

So to and get b? from x - to x + 

travel some distance in 

• Plus-plane = { x : w . x + b = +1 direction } w. 





• Claim: x + = x - + λ w for some value of λ. Why? 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 

x + 


zone 




What we know: 

• w . x + + b = +1 

• w . x - + b = -1 

• x + = x - + λ w 

• |x + - x - | = M 

It’s now easy to get M 

in terms of w and b 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 

x + 


zone 



x - 

9


What we know: 

• w . x + + b = +1 

• w . x - + b = -1 

• x + = x - + λ w 

• |x + - x - | = M 

It’s now easy to get M 

in terms of w and b 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 

x + 


zone 


w . (x - + λ w) + b = 1 

=> 

w . x - + b + λ w .w = 1 

=> 

-1 + λ w .w = 1 


x - 

=> 

λ = 

2 

w.w 


What we know: 

• w . x + + b = +1 

• w . x - + b = -1 

• x + = x - + λ w 

• |x + - x - | = M 

• 2 

λ = 

w.w 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 

x + 


zone 

M = Margin Width = 

M = |x + - x - | =| λ w |= 


x - 

= λ | w | = λ w. 

w 

2 w. 

w 

= = 

w. 

w 

2 

w. 

w 

2 

w.w 

10

Learning the Maximum Margin Classifier 


zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 

Given a guess of w and b we can 

• Compute whether all data points in the correct half-planes 

• Compute the width of the margin 

So now we just need to write a program to search the space 

of w’s and b’s to find the widest margin that matches all 

the datapoints. How? 

Gradient descent? Simulated Annealing? Matrix Inversion? 

EM? Newton’s Method? 

x + 

M = Margin Width = 


x - 

2 

w. w 

Learning via Quadratic Programming 

• QP is a well-studied class of optimization 

algorithms to maximize a quadratic function of 

some real-valued variables subject to linear 

constraints. 


11

Find 

arg max 

u 

Quadratic Programming 

T 

T u Ru 

c + d u + 

2 

Quadratic criterion 

Subject to 

And subject to 

a 

a 

a 

a 

a 

a 

11 

21 

u + a 

u 

1 

1 

+ a 

u + a 

n1 

1 

( n+ 

1)1 

( n+ 

2)1 

( n+ 

e)1 

u 

1 

1 

1 

12 

22 

n2 

u + a 

u + a 

+ a 

u 

u 

u 

2 

2 

2 

( n+ 

1)2 

( n+ 

2)2 

( n+ 

e)2 

+ ... + a 

+ ... + a 

: 

+ ... + a 

u 

u 

u 

2 

2 

2 

: 

1m 

2m 

nm 

u 

u 

u 

m 

+ ... + a 

+ ... + a 

+ ... + a 

m 

m 

≤ b 

1 

≤ b 

2 

≤ b 

( n+ 

1) m 

( n+ 

e) 

m 

n 

( n+ 

2) m 

u 

u 

u 

m 

m 

m 

= b 

= b 

= b 

n additional linear 

inequality 

constraints 

( n+ 

1) 

( n+ 

2) 

( n+ 

e) 

e additional linear 

equality 

constraints 


Find 

arg max 

u 

Quadratic Programming 

T 

T u Ru 

c + d u + 

2 

Quadratic criterion 

Subject to 

And subject to 

a 

a 

a 

a 

a 

a 

11 

21 

u + a 

u 

1 

1 

+ a 

u + a 

n1 

1 

( n+ 

1)1 

( n+ 

2)1 

( n+ 

e)1 

1 

1 

1 

12 

22 

n2 

u + a 

u + a 

u 

+ a 

u 

u 

u 

2 

2 

2 

( n+ 

1)2 

( n+ 

2)2 

( n+ 

e)2 

+ ... + a 

+ ... + a 

: 

+ ... + a 

u 

u 

u 

2 

2 

2 

: 

1m 

2m 

nm 

u 

u 

u 

m 

+ ... + a 

+ ... + a 

+ ... + a 

m 

m 

≤ b 

1 

≤ b 

There exist algorithms for finding 

such constrained quadratic 

optima much more efficiently 

and reliably than gradient 

ascent. 

2 

≤ b 

(But they are very fiddly…you 

probably don’t want to write 

one yourself) 

( n+ 

1) m 

( n+ 

e) 

m 

n 

( n+ 

2) m 

u 

u 

u 

m 

m 

m 

= b 

= b 

= b 

n additional linear 

inequality 

constraints 

( n+ 

1) 

( n+ 

2) 

( n+ 

e) 

e additional linear 

equality 

constraints 


12



zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 

M = 

2 

w.w 

Given guess of w , b we can 

• Compute whether all data 

points are in the correct 

half-planes 

• Compute the margin width 

Assume R datapoints, each 

(x k ,y k ) where y k = +/- 1 

What should our quadratic 

optimization criterion be? 

How many constraints will we 

have? 

What should they be? 




zone 

wx+b=1 

wx+b=0 

wx+b=-1 


zone 

M = 

2 

w.w 


• Compute whether all data 

points are in the correct 

half-planes 



(x k ,y k ) where y k = +/- 1 



Minimize w.w 


have? R 


w . x k + b >= 1 if y k = 1 

w . x k + b

Uh-oh! 

This is going to be a problem! 

What should we do? 

denotes +1 

denotes -1 


denotes +1 

denotes -1 

Uh-oh! 



Idea 1: 

Find minimum w.w, while 

minimizing number of 

training set errors. 

Problemette: Two things 

to minimize makes for 

an ill-defined 

optimization 


14

denotes +1 

denotes -1 

Uh-oh! 



Idea 1.1: 

Minimize 

w.w + C (#train errors) 

Tradeoff parameter 

There’s a serious practical 

problem that’s about to make 

us reject this approach. Can 

you guess what it is? 


denotes +1 

denotes -1 

Uh-oh! 



Idea 1.1: 

Minimize 

w.w + C (#train errors) 

Tradeoff parameter 

Can’t be expressed as a Quadratic 

Programming problem. 

Solving it may be too slow. 

(Also, doesn’t distinguish between 

disastrous errors and near misses) 

There’s a serious practical 

problem that’s about to make 

us reject this approach. Can 

you guess what it is? 

So… any 

other 

ideas? 


15

denotes +1 

denotes -1 

Uh-oh! 



Idea 2.0: 

Minimize 

w.w + C (distance of error 

points to their 

correct place) 


Learning Maximum Margin with Noise 

wx+b=1 

wx+b=0 

wx+b=-1 

M = 

2 

w.w 


• Compute sum of distances 

of points to their correct 

zones 



(x k ,y k ) where y k = +/- 1 




have? 



16


wx+b=1 

wx+b=0 

wx+b=-1 

ε 2 

ε 11 

ε 7 

M = 

2 

w.w 




zones 



(x k ,y k ) where y k = +/- 1 



Minimize 

1 

2 

w. 

w + C 

R 

∑ ε k 

k = 1 


have? R 


w . x k + b >= 1-ε k if y k = 1 

w . x k + b = 1-ε k if y k = 1 

w . x k + b


wx+b=1 

wx+b=0 

wx+b=-1 

ε 2 

ε 11 

ε 7 

M = 

2 

w.w 




zones 



(x k ,y k ) where y k = +/- 1 



Minimize 

1 

2 

w. 

w + C 

R 

∑ ε k 

k = 1 

There’s a bug in this QP. Can you spot it? 


have? R 


w . x k + b >= 1-ε k if y k = 1 

w . x k + b = 1-ε k if y k = 1 

w . x k + b = 0 for all k 


18

An Equivalent QP 

Maximize 

R 

∑ 

α 

1 

− 

R 

R 

∑∑ 

k 

k = 1 2 k = 1 l= 

1 

α α Q 

k 

l 

kl 

Warning: up until Rong Zhang spotted my error in 

Oct 2003, this equation had been wrong in earlier 

versions of the notes. This version is correct. 

where Q = y y x . x ) 

kl 

k 

l 

( 

k l 

Subject to these 

constraints: 

0 ≤ 

α k 

≤ C 

∀k 

R 

∑ 

k = 1 

α k 

y k 

= 

0 

Then define: 

w = 

b 

R 

∑ 

k = 1 

K 

α k 

y k 

x 

= y (1 − ε ) − x . w 

where K = arg max 

K 

k 

k 

K 

α 

K 

k 

Then classify with: 




R 

R R 

1 

∑αk 

− ∑∑αkαl 

Maximize where Q = y y x . x ) 

k = 1 2 k = 1 l= 

1 

Q 

kl 




kl 

k 

l 

( 

k l 


constraints: 

0 ≤ 

α k 

≤ C 

∀k 

R 

∑ 

k = 1 

α k 

y k 

= 

0 


w 

= 

R 

∑ 

k = 1 

α k 

y k 

x 

b = y 

K 

(1 − ε 

K 

) − x 

K 

. w 

where K = arg max α 

k 

k 

Datapoints with α k > 0 

will be the support 

vectors Then classify with: 

K 

k 


..so this sum only needs 

to be over the 

support vectors. 


19

R 

∑ 

α 


1 

− 

R 

R 

∑∑ 

k 

k = 1 2 k = 1 l= 

1 

α α Q 

Maximize where Q = y y x . x ) 

k 

l 


kl 

kl 

k 

l 

( 

k l 

Why did I tell you about this R 

Subject to these equivalent 0 ≤ α k 

≤QP? 

C ∀k 

∑ α k 

y k 

constraints: 

• It’s a formulation that QP 

k = 1 

packages can optimize more 


quickly 

Datapoints with α k > 0 

will be the support 

R 


w = ∑ α k 

y 

• 

k 

xBecause of further vectors jawdropping 

developments 

k 

k = 1 

f(x,w,b) you’re = sign(w. x - b) 

about to learn. ..so this sum only needs 

b = y 

K 

(1 − ε 

K 

) − x 

K 

. w to be over the 

K 

support vectors. 

where 

K 

= 

arg 

max 

k 

α 

k 

= 

0 

Suppose we’re in 1-dimension 

What would 

SVMs do with 

this data? 

x=0 


20

Suppose we’re in 1-dimension 

Not a big surprise 

x=0 

Positive “plane” 

Negative “plane” 


Harder 1-dimensional dataset 

That’s wiped the 

smirk off SVM’s 

face. 

What can be 

done about 

this? 

x=0 


21


Remember how 

permitting nonlinear 

basis 

functions made 

linear regression 

so much nicer? 

Let’s permit them 

here too 

2 

x=0 

z 

k 

= ( xk 

, xk 

) 



Remember how 

permitting nonlinear 

basis 

functions made 

linear regression 

so much nicer? 

Let’s permit them 

here too 

2 

x=0 

z 

k 

= ( xk 

, xk 

) 


22

Common SVM basis functions 

z k = ( polynomial terms of x k of degree 1 to q ) 

z k = ( radial basis functions of x k ) 

⎛ | xk 

− c 

z 

k[ 

j] 

= φ j 

( xk 

) = KernelFn 

⎜ 

⎝ KW 

z k = ( sigmoid functions of x k ) 

This is sensible. 

Is that the end of the story? 

No…there’s one more trick! 

j 

| ⎞ 

⎟ 

⎠ 


⎛ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

Φ(x) = 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎝ 

1 

2x 

2x 

2x 

2x 

m 

2 

1 

2 

2 

x 

x 

x 

: 

: 

2 

m 

2x 

x 

1 

2x 

x 

: 

1 

2x 

x 

1 

2x 

x 

m 

2x 

x 

: 

: 

2 

1 

1 

2 

2 

3 

3 

m 

x 

⎞ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎠ 

Constant Term 

Linear Terms 

Pure 

Quadratic 

Terms 

Quadratic 

Cross-Terms 

Quadratic 

Basis Functions 

Number of terms (assuming m input 

dimensions) = (m+2)-choose-2 

= (m+2)(m+1)/2 

= (as near as makes no difference) m 2 /2 

You may be wondering what those 

2 ’s are doing. 

•You should be happy that they do no 

harm 

•You’ll find out why they’re there 

soon. 

m−1 

m 


23

QP with basis functions 

Maximize 

R 

∑ 

α 

1 

− 

R 

R 

∑∑ 

k 

k = 1 2 k = 1 l= 

1 

α α Q 

k 

l 




kl 

where Qkl 

= yk 

yl 

( Φ( 

xk 

). Φ( 

xl 

)) 


constraints: 

0 ≤ 

α k 

≤ C 

∀k 

R 

∑ 

k = 1 

α k 

y k 

= 

0 


w = 

b 

k s.t. 

= y (1 − ε ) − x . w 

K 

∑ 

α k 

α 

y k k 

> 0 


K 

Φ ( x 

k 

K 

k 

) 

α 

K 

k 


f(x,w,b) = sign(w. φ(x) -b) 


Maximize 


constraints: 


w 

= 

R 

∑ 

k s.t. 

α 

∑ 

QP with basis functions 

α k 

1 

− 

α 

y k k 

> 0 

0 ≤ 

Φ ( x 

k 

α k 

b = y 

K 

(1 − ε 

K 

) − x 

K 

. w 


R 

R 

∑∑ 

k 

k = 1 2 k = 1 l= 

1 

α α Q 

k 

k 

l 

kl 

where Qkl 

= yk 

yl 

( Φ( 

xk 

). Φ( 

xl 

)) 

≤ C 

) 

We must do R 2 /2 dot products to 

get this matrix ready. R 

K 

k 

∀k 

Each dot product requires m 2 /2 

additions and multiplications 

k = 1 

Then classify with: …or does it? 



∑ 

α k 

y k 

The whole thing costs R 2 m 2 /4. 

Yeeks! 

= 

0 

24

Quadratic Dot 

Products 

⎛ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

Φ( a) 

• Φ( 

b) 

= 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎝ 

a 

1 

2a 

2a 

2a 

: 

2a 

m 

2 

1 

2 

2 

a 

a 

: 

2 

m 

2a a 

1 

2a a 

: 

1 

2a a 

1 

2a a 

m 

2a 

a 

: 

: 

2 

1 

1 

2 

2 

3 

3 

m 

a 

⎞ 

⎛ 

⎟ ⎜ 

⎟ ⎜ 2b1 

⎟ ⎜ 2b2 

⎟ ⎜ 

⎟ ⎜ : 

⎟ ⎜ 

⎟ ⎜ 

2bm 

2 

⎟ ⎜ b1 

⎟ ⎜ 2 

⎟ ⎜ b2 

⎟ ⎜ : 

⎟ ⎜ 

2 

⎟ ⎜ bm 

⎟ 

• 

⎜ 2b1b 

2 

⎟ 

2b b 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎠ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎜ 

⎝ 

1 

: 

2b b 

2b 

1 3 

1 m 

2b 

b 

: 

2 3 

2b b 

1 m 

: 

b 

⎞ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎟ 

⎠ 

m−1 

m 

m−1 

m 


1 

+ 

m 

∑ 

i= 

1 

+ 

m 

∑ 

i= 

1 

m 

+ 

2 a b i i 

a 2 

b i i 

m 

∑∑ 

i= 1 j= 

i+ 

1 

2 

2a a b b 

i 

j 

i 

j 

Quadratic Dot 

Products 

Φ( a) 

• Φ( 

b) 

= 

1+ 

2 

m 

∑ 

i= 

1 

a b + 

i 

i 

m 

∑ 

i= 

1 

a b 

2 2 

i i 

+ 

m 

m 

∑∑ 

i= 1 j= 

i+ 

1 

2a a b b 

i 

j 

i 

j 

Just out of casual, innocent, interest, 

let’s look at another function of a and 

b: 

( a. 

b + 1) 

2 

= ( a. 

b) 

+ 2a. 

b + 1 

2 

2 

m 

m 

⎛ ⎞ 

= ⎜∑ 

aibi 

⎟ + 2∑ 

aibi 

+ 1 

⎝ i= 

1 ⎠ i= 

1 

m 

m 

= ∑∑aibia 

jb 

j 

+ 2∑ 

aibi 

+ 1 

i= 1 j= 

1 

m 

m 

i= 

1 

= 2 

∑ ( a ) + 2∑∑ 

+ 2 ∑ 

ibi 

aibi 

a 

jb 

j 

aibi 

+ 1 

i= 

1 

m 

m 

i= 1 j= 

i+ 

1 

m 

i= 

1 


25

Quadratic Dot 

Products 

Φ( a) 

• Φ( 

b) 

= 

1+ 

2 

m 

∑ 

i= 

1 

a b + 

i 

i 

m 

∑ 

i= 

1 

a b 

2 2 

i i 

+ 

m 

m 

∑∑ 

i= 1 j= 

i+ 

1 

2a a b b 

i 

j 

i 

j 

Just out of casual, innocent, interest, 

let’s look at another function of a and 

b: 

( a. 

b + 1) 

2 

= ( a. 

b) 

+ 2a. 

b + 1 

2 

2 

m 

m 

⎛ ⎞ 

= ⎜∑ 

aibi 

⎟ + 2∑ 

aibi 

+ 1 

⎝ i= 

1 ⎠ i= 

1 

m 

m 

= ∑∑aibia 

jb 

j 

+ 2∑ 

aibi 

+ 1 

i= 1 j= 

1 

m 

m 

i= 

1 

= 2 

∑ ( a ) + 2∑∑ 

+ 2 ∑ 

ibi 

aibi 

a 

jb 

j 

aibi 

+ 1 

i= 

1 

m 

m 

i= 1 j= 

i+ 

1 

m 

i= 

1 

They’re the same! 

And this is only O(m) to 

compute! 


Maximize 

QP with Quadratic basis functions 

R 

R R 

1 

∑αk 

− ∑∑αkαlQkl 

Qkl 

= yk 

yl 

Φ( 

xk 

). 

k = 1 2 k = 1 l= 

1 


constraints: 

0 ≤ 

α k 

≤ C 

where ( Φ( 

xl 

)) 

We must do R 2 /2 dot products to 

get this matrix ready. R 

∀k 




∑ 

α k 

y k 

Each dot product now only requires 

m additions and multiplications 

k = 1 

= 

0 


w = 

b 

k s.t. 

= y (1 − ε ) − x . w 

K 

∑ 

α k 

α 

y k k 

> 0 


K 

Φ ( x 

k 

K 

k 

) 

α 

K 

k 




26

Higher Order Polynomials 

Polynomial 

Quadratic 

Cubic 

Quartic 

φ(x) 

All m 2 /2 

terms up to 

degree 2 

All m 3 /6 

terms up to 

degree 3 

All m 4 /24 

terms up to 

degree 4 

Cost to 

build Q kl 

matrix 

tradition 

ally 

m 2 R 2 /4 

Cost if 100 

inputs 

2,500 R 2 

m 3 R 2 /12 83,000 R 2 

m 4 R 2 /48 1,960,000 R 2 

Cost to 

build Q kl 

matrix 

sneakily 

φ(a).φ(b) 

(a.b+1) 2 

(a.b+1) 3 

(a.b+1) 4 

mR 2 / 2 

mR 2 / 2 

mR 2 / 2 

Cost if 

100 

inputs 

50 R 2 

50 R 2 

50 R 2 


QP with Quintic basis functions 

We must do R 2 /2 dot products R R to get this 

matrix ready. 

Maximize ∑αk 

+ ∑∑αkαlQ 

where Q ( ( ). ( )) 

kl 

kl 

= yk 

yl 

Φ xk 

Φ xl 

In 100-d, each k = 1dot product k = 1now l= 

1 needs 103 

operations instead of 75 million 

R 

But there are still worrying things lurking away. 


What are they? 0 ≤ α k 

≤ C ∀k 

∑ α k 

y k 

= 0 

constraints: 

k = 1 


w = 

b 

k s.t. 

K 

∑ 

α k 

α 

y k k 

> 0 

K 

Φ ( x 

= y (1 − ε ) − x . w 


k 

K 

k 

) 

α 

K 

k 




27

Maximize 



matrix ready. 

∑αk 


where Q ( ( ). ( )) 

kl 

kl 

= yk 

yl 

Φ xk 

Φ xl 


1 needs 103 


R 




≤ C ∀k 

∑ α k 

y k 

= 0 

constraints: 

•The fear of overfitting kwith = 1 this enormous 

number of terms 


•The evaluation phase (doing a set of 

predictions on a test set) will be very 

expensive (why?) 

w 

= 

k s.t. 

∑ 

α k 

α 

y k k 

> 0 

Φ ( x 

b = y 

K 

(1 − ε 

K 

) − x 

K 

. w 


k 

k 

) 

K 

k 




Maximize 



matrix ready. 

∑αk 


where Q ( ( ). ( )) 

kl 

kl 

= yk 

yl 

Φ xk 

Φ xl 


1 needs 103 

The use of Maximum Margin 


magically makes this not a 

problem R 




≤ C ∀k 

∑ α k 

y k 

= 0 

constraints: 







w 

= 

k s.t. 

∑ 

α k 

α 

y k k 

> 0 

Φ ( x 

b = y 

K 

(1 − ε 

K 

) − x 

K 

. w 


k 

k 

) 

K 

k 

Because each w. φ(x) (see below) 

needs 75 million operations. What 

can be done? 




28

Maximize 



matrix ready. 

∑αk 


where Q ( ( ). ( )) 

kl 

kl 

= yk 

yl 

Φ xk 

Φ xl 


1 needs 103 




problem R 




≤ C ∀k 

∑ α k 

y k 

= 0 

constraints: 







w 

= 

k s.t. 

∑ 

α k 

α 

y k k 

> 0 

Φ ( x 

wb 

⋅= Φ ( yx 

) = 

K 

(1 −∑ 

εα 

k 

y 

K 

) k 

Φ− 

( 

x 

k 

) 

K 

. ⋅wΦ 

k s.t. α k > 0 

where 

= 

K∑= 

α 

arg k 

y k 

( xmax 

k 

⋅ x + 1) 

α 

k s.t. α k > 0 

k 

k 

Only Sm operations (S=#support vectors) 

k 

) 

( x ) 

K 

5 



can be done? 




Maximize 



matrix ready. 

∑αk 


where Q ( ( ). ( )) 

kl 

kl 

= yk 

yl 

Φ xk 

Φ xl 


1 needs 103 




problem R 




≤ C ∀k 

∑ α k 

y k 

= 0 

constraints: 







wb 

w 

⋅ 

= 

k s.t. 

∑ 

α k 

α 

y k k 

> 0 

Φ ( x 

= Φ ( yx 

) = (1 −∑ 

εα 

) Φ− 

( 

x ) . ⋅wΦ 

k 

) 



can be done? 

k 

y k k 

K k s.t. α k > 0K 

K K 

5 Then When classify you see this with: many callout bubbles on 

where 

= 

K∑= 

α 

arg k 

y k 

( xmax 

k 

⋅ x + 1) 

α 

a slide it’s time to wrap the author in a 

k s.t. α k > 0 

k 

blanket, gently take him away and murmur 

k 

f(x,w,b) “someone’s = been sign(w. at the PowerPoint φ(x) -b) for too 

long.” 


Only Sm operations (S=#support vectors) 

( x ) 

29


1 

R 

R R 

Maximize ∑αk 

− ∑∑αkαlQkl 

where Andrew’s Qkl 

= opinion yk 

yl 

of ( Φwhy ( xSVMs k 

). Φdon’t 

( xl 

)) 

k = 1 2 k = 1 l= 

1 

overfit as much as you’d think: 

No matter what the basis function, 

there are really Ronly up to R 

Subject to these 0 ≤ α k 

≤ C parameters: ∀k 

α∑ 

1 , α 2 .. αα kR , yand k 

usually = 0 

constraints: 

most are set to 

k 

zero 

= 1 

by the Maximum 

Margin. 


Asking for small w.w is like “weight 

decay” in Neural Nets and like Ridge 

w = ∑ α y Φ k k 

( x 

Regression parameters in Linear 

k 

) regression and like the use of Priors 

in Bayesian Regression---all designed 

k s.t. α k > 0 

to smooth the function and reduce 

wb 

⋅= Φ ( 

overfitting. 

yx 

) = 

K 

(1 −∑ 

εα 

k 

y 

K 

) k 

Φ− 

( 

x 

k 

) 

K 

. ⋅wΦ 

( x ) 

k s.t. α k > 0 

K 

5 Then classify with: 

where 

= 

K∑= 

α 

arg k 

y k 

( xmax 

k 

⋅ x + 1) 

α 

k s.t. α k > 0 

k 

k 

Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. φ(x) -b) 


SVM Kernel Functions 

• K(a,b)=(a . b +1) d is an example of an SVM 

Kernel Function 

• Beyond polynomials there are other very high 

dimensional basis functions that can be made 

practical by finding the right Kernel Function 

• Radial-Basis-style Kernel Function: 

K( 

a, 

b) 

⎛ ( a − b) 

exp 

⎜− 

⎝ 2σ 

= 

2 

• Neural-net-style Kernel Function: 

K( a, 

b) 

= tanh( κ a. 

b −δ 

) 

2 

⎞ 

⎟ 

⎠ 

σ, κ and δ are magic 

parameters that must 

be chosen by a model 

selection method 

such as CV or 

VCSRM* 

*see last lecture 


30

VC-dimension of an SVM 

• Very very very loosely speaking there is some theory which 

under some different assumptions puts an upper bound on 

the VC dimension as 

⎡ 

⎢ 

Diameter 

Margin 

• where 

• Diameter is the diameter of the smallest sphere that can 

enclose all the high-dimensional term-vectors derived 

from the training set. 

• Margin is the smallest margin we’ll let the SVM use 

• This can be used in SRM (Structural Risk Minimization) for 

choosing the polynomial degree, RBF σ, etc. 

• But most people just use Cross-Validation 

⎤ 

⎥ 


SVM Performance 

• Anecdotally they work very very well indeed. 

• Example: They are currently the best-known 

classifier on a well-studied hand-written-character 

recognition benchmark 

• Another Example: Andrew knows several reliable 

people doing practical real-world work who claim 

that SVMs have saved them when their other 

favorite classifiers did poorly. 

• There is a lot of excitement and religious fervor 

about SVMs as of 2001. 

• Despite this, some practitioners (including your 

lecturer) are a little skeptical. 


31

Doing multi-class classification 

• SVMs can only handle two-class outputs (i.e. a 

categorical output variable with arity 2). 

• What can be done? 

• Answer: with output arity N, learn N SVM’s 

• SVM 1 learns “Output==1” vs “Output != 1” 

• SVM 2 learns “Output==2” vs “Output != 2” 

• : 

• SVM N learns “Output==N” vs “Output != N” 

• Then to predict the output for a new input, just 

predict with each SVM and find out which one puts 

the prediction the furthest into the positive region. 


References 

• An excellent tutorial on VC-dimension and Support 

Vector Machines: 

C.J.C. Burges. A tutorial on support vector machines 

for pattern recognition. Data Mining and Knowledge 

Discovery, 2(2):955-974, 1998. 

http://citeseer.nj.nec.com/burges98tutorial.html 

• The VC/SRM/SVM Bible: 

Statistical Learning Theory by Vladimir Vapnik, Wiley- 

Interscience; 1998 


32

What You Should Know 

• Linear SVMs 

• The definition of a maximum margin classifier 

• What QP can do for you (but, for this class, you 

don’t need to know how it does it) 

• How Maximum Margin can be turned into a QP 

problem 

• How we deal with noisy (non-separable) data 

• How we permit non-linear boundaries 

• How SVM Kernel functions permit us to pretend 

we’re working with ultra-high-dimensional basisfunction 

terms 


33

Support Vector Machines - The Auton Lab

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?