23.04.2014 Views

Sparse Kernel machines

Sparse Kernel machines

Sparse Kernel machines

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Sparse</strong> <strong>Kernel</strong> <strong>machines</strong><br />

Support Vector Machines<br />

Relevance Vector Machines<br />

Application to 'brain activity decoding'<br />

Bertrand Thirion, Parietal team,<br />

INRIA Saclay-Île-de-France<br />

16/11/2009 Decoding group<br />

1


Problem statement<br />

● Given a vector x that describes your data, predict a<br />

variable of interest y<br />

–<br />

–<br />

Regression<br />

Classification<br />

● Generic solution: estimate φ, w, b such that<br />

● We may consider that φ(x)=x<br />

●Problem: only n observations x 1<br />

.. x n<br />

, n


Outline<br />

●<br />

Simple classifiers<br />

– Naïve Bayes<br />

– Linear Discriminant Analysis<br />

●<br />

Support Vector Machines<br />

– Large margin classifiers<br />

– Soft margin classifier<br />

– Support Vector Regression<br />

●<br />

Relevance Vector Machines<br />

– Regression<br />

– Classifiers<br />

16/11/2009 Decoding group<br />

3


Naïve Bayes Classification<br />

●<br />

The features are assumed to be independent<br />

conditionally to the class<br />

●<br />

●<br />

●<br />

Naïve Bayes simply combines the responses from<br />

the different 'dimensions' (features)<br />

In that case, learning boils down to estimating<br />

Strong assumption, not very efficient<br />

16/11/2009 Decoding group<br />

4


Linear Discriminant analysis<br />

●<br />

●<br />

●<br />

●<br />

Generative Data model<br />

Discriminant function:<br />

Where<br />

and<br />

●<br />

Learning means : learning the class parameters<br />

●<br />

Problem when the number of dimension p large<br />

16/11/2009 Decoding group<br />

5


Large margin classifiers<br />

● Idea: assuming that the classes are well separated, find<br />

the (w,b) that maximally separate them, i.e. that<br />

maximize the margin between the two classes<br />

find the<br />

hyperplane (w,b)<br />

such that<br />

margin<br />

is maximal<br />

One can fix<br />

where equality holds for support vectors<br />

16/11/2009 Decoding group<br />

6


Large margin classifiers<br />

● The problem that is solved is thus<br />

● After some algebra, it is found that the solution is<br />

● where the coefs a are defined as<br />

16/11/2009 Decoding group<br />

7


large margin classifiers<br />

●<br />

or more generally the kernel<br />

represent scalar products,<br />

● Note that problem is now of size n (number of<br />

subjects/trials) rather than p (number of features)<br />

●Only a subset of observations have a i<br />

>0 (support<br />

vectors)<br />

16/11/2009 Decoding group<br />

8


Soft margin classifier<br />

●<br />

●<br />

The classes may not be perfectly separable<br />

Slack variables measure how bad the sample is<br />

● The problem becomes<br />

margin<br />

with C>0<br />

Subject to<br />

16/11/2009 Decoding group<br />

9


Soft margin classifier<br />

● The solution remains the same !<br />

●<br />

But under different constraints:<br />

● In practice we only need to decide the value of C>0<br />

(inverse regularization)<br />

– Internal cross validation<br />

16/11/2009 Decoding group<br />

10


Multi-class SVM<br />

●<br />

From binary to multi-class classifiers<br />

– One versus the rest<br />

– One versus one<br />

●<br />

K(K-1)/2 problems → vote procedure<br />

– No procedure is optimal !<br />

16/11/2009 Decoding group<br />

11


Support Vector Regression (SVR)<br />

●<br />

●<br />

●<br />

Now:<br />

Model:<br />

Learning means solving<br />

●<br />

Where f ε<br />

is the ε-insensitive loss function<br />

16/11/2009 Decoding group<br />

12


Support Vector Regression (SVR)<br />

●<br />

The solution is similar to SVM<br />

●<br />

where<br />

16/11/2009 Decoding group<br />

13


Relevance Vector Machines<br />

●<br />

Shortcomings of SVMs<br />

– Only provide a boudary: no posterior probability of<br />

classification<br />

– Multi-class is problematic (in theory)<br />

– Choice of C<br />

●<br />

RVM are defined directly in the kernel domain<br />

● ARD prior on w:<br />

16/11/2009 Decoding group<br />

14


Relevance Vector Machines<br />

α<br />

w<br />

y<br />

●<br />

●<br />

Estimation of w→ estimation of α<br />

EM algorithm: iterative estimation of the parameters<br />

and hyper-parameters<br />

● In theory many values of w are shrunk to 0 ;<br />

Nonzero values are associated with relevance<br />

vectors: but here these are typical samples<br />

●<br />

Use for classification : we introduce a sigmoidal<br />

non-linearity<br />

with<br />

16/11/2009 Decoding group<br />

15


Relevance vector <strong>machines</strong><br />

●<br />

●<br />

●<br />

The classes prototypes are estimated, not the<br />

boundary<br />

Probabilistic model → Better handles multi-class<br />

problems in classification (in theory)<br />

Less efficient than SVMs<br />

16/11/2009 Decoding group<br />

16


Discussion<br />

●<br />

All the correlation between the variables are<br />

implicitly embedded in the kernel terms k(x i<br />

,x j<br />

)<br />

●<br />

●<br />

●<br />

Problem: this is hidden and cannot be recovered<br />

exactly if k is not bilinear<br />

Estimation is nicely reduced from a p-dimensional<br />

problem to an n-dimensional problem (n

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!