Sparse Kernel machines

Sparse Kernel machines 

Support Vector Machines 

Relevance Vector Machines 

Application to 'brain activity decoding' 

Bertrand Thirion, Parietal team, 

INRIA Saclay-Île-de-France 

16/11/2009 Decoding group 

1

Problem statement 

● Given a vector x that describes your data, predict a 

variable of interest y 

– 

– 

Regression 

Classification 

● Generic solution: estimate φ, w, b such that 

● We may consider that φ(x)=x 

●Problem: only n observations x 1 

.. x n 

, n

Outline 

● 

Simple classifiers 

– Naïve Bayes 

– Linear Discriminant Analysis 

● 

Support Vector Machines 

– Large margin classifiers 

– Soft margin classifier 

– Support Vector Regression 

● 


– Regression 

– Classifiers 


3

Naïve Bayes Classification 

● 

The features are assumed to be independent 

conditionally to the class 

● 

● 

● 

Naïve Bayes simply combines the responses from 

the different 'dimensions' (features) 

In that case, learning boils down to estimating 

Strong assumption, not very efficient 


4

Linear Discriminant analysis 

● 

● 

● 

● 

Generative Data model 

Discriminant function: 

Where 

and 

● 

Learning means : learning the class parameters 

● 

Problem when the number of dimension p large 


5

Large margin classifiers 

● Idea: assuming that the classes are well separated, find 

the (w,b) that maximally separate them, i.e. that 

maximize the margin between the two classes 

find the 

hyperplane (w,b) 

such that 

margin 

is maximal 

One can fix 

where equality holds for support vectors 


6

Large margin classifiers 

● The problem that is solved is thus 

● After some algebra, it is found that the solution is 

● where the coefs a are defined as 


7

large margin classifiers 

● 

or more generally the kernel 

represent scalar products, 

● Note that problem is now of size n (number of 

subjects/trials) rather than p (number of features) 

●Only a subset of observations have a i 

>0 (support 

vectors) 


8

Soft margin classifier 

● 

● 

The classes may not be perfectly separable 

Slack variables measure how bad the sample is 

● The problem becomes 

margin 

with C>0 

Subject to 


9

Soft margin classifier 

● The solution remains the same ! 

● 

But under different constraints: 

● In practice we only need to decide the value of C>0 

(inverse regularization) 

– Internal cross validation 


10

Multi-class SVM 

● 

From binary to multi-class classifiers 

– One versus the rest 

– One versus one 

● 

K(K-1)/2 problems → vote procedure 

– No procedure is optimal ! 


11

Support Vector Regression (SVR) 

● 

● 

● 

Now: 

Model: 

Learning means solving 

● 

Where f ε 

is the ε-insensitive loss function 


12

Support Vector Regression (SVR) 

● 

The solution is similar to SVM 

● 

where 


13


● 

Shortcomings of SVMs 

– Only provide a boudary: no posterior probability of 

classification 

– Multi-class is problematic (in theory) 

– Choice of C 

● 

RVM are defined directly in the kernel domain 

● ARD prior on w: 


14


α 

w 

y 

● 

● 

Estimation of w→ estimation of α 

EM algorithm: iterative estimation of the parameters 

and hyper-parameters 

● In theory many values of w are shrunk to 0 ; 

Nonzero values are associated with relevance 

vectors: but here these are typical samples 

● 

Use for classification : we introduce a sigmoidal 

non-linearity 

with 


15

Relevance vector machines 

● 

● 

● 

The classes prototypes are estimated, not the 

boundary 

Probabilistic model → Better handles multi-class 

problems in classification (in theory) 

Less efficient than SVMs 


16

Discussion 

● 

All the correlation between the variables are 

implicitly embedded in the kernel terms k(x i 

,x j 

) 

● 

● 

● 

Problem: this is hidden and cannot be recovered 

exactly if k is not bilinear 

Estimation is nicely reduced from a p-dimensional 

problem to an n-dimensional problem (n

Sparse Kernel machines

Create successful ePaper yourself

Delete template?

Save as template?