Sparse Kernel machines
Sparse Kernel machines
Sparse Kernel machines
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Sparse</strong> <strong>Kernel</strong> <strong>machines</strong><br />
Support Vector Machines<br />
Relevance Vector Machines<br />
Application to 'brain activity decoding'<br />
Bertrand Thirion, Parietal team,<br />
INRIA Saclay-Île-de-France<br />
16/11/2009 Decoding group<br />
1
Problem statement<br />
● Given a vector x that describes your data, predict a<br />
variable of interest y<br />
–<br />
–<br />
Regression<br />
Classification<br />
● Generic solution: estimate φ, w, b such that<br />
● We may consider that φ(x)=x<br />
●Problem: only n observations x 1<br />
.. x n<br />
, n
Outline<br />
●<br />
Simple classifiers<br />
– Naïve Bayes<br />
– Linear Discriminant Analysis<br />
●<br />
Support Vector Machines<br />
– Large margin classifiers<br />
– Soft margin classifier<br />
– Support Vector Regression<br />
●<br />
Relevance Vector Machines<br />
– Regression<br />
– Classifiers<br />
16/11/2009 Decoding group<br />
3
Naïve Bayes Classification<br />
●<br />
The features are assumed to be independent<br />
conditionally to the class<br />
●<br />
●<br />
●<br />
Naïve Bayes simply combines the responses from<br />
the different 'dimensions' (features)<br />
In that case, learning boils down to estimating<br />
Strong assumption, not very efficient<br />
16/11/2009 Decoding group<br />
4
Linear Discriminant analysis<br />
●<br />
●<br />
●<br />
●<br />
Generative Data model<br />
Discriminant function:<br />
Where<br />
and<br />
●<br />
Learning means : learning the class parameters<br />
●<br />
Problem when the number of dimension p large<br />
16/11/2009 Decoding group<br />
5
Large margin classifiers<br />
● Idea: assuming that the classes are well separated, find<br />
the (w,b) that maximally separate them, i.e. that<br />
maximize the margin between the two classes<br />
find the<br />
hyperplane (w,b)<br />
such that<br />
margin<br />
is maximal<br />
One can fix<br />
where equality holds for support vectors<br />
16/11/2009 Decoding group<br />
6
Large margin classifiers<br />
● The problem that is solved is thus<br />
● After some algebra, it is found that the solution is<br />
● where the coefs a are defined as<br />
16/11/2009 Decoding group<br />
7
large margin classifiers<br />
●<br />
or more generally the kernel<br />
represent scalar products,<br />
● Note that problem is now of size n (number of<br />
subjects/trials) rather than p (number of features)<br />
●Only a subset of observations have a i<br />
>0 (support<br />
vectors)<br />
16/11/2009 Decoding group<br />
8
Soft margin classifier<br />
●<br />
●<br />
The classes may not be perfectly separable<br />
Slack variables measure how bad the sample is<br />
● The problem becomes<br />
margin<br />
with C>0<br />
Subject to<br />
16/11/2009 Decoding group<br />
9
Soft margin classifier<br />
● The solution remains the same !<br />
●<br />
But under different constraints:<br />
● In practice we only need to decide the value of C>0<br />
(inverse regularization)<br />
– Internal cross validation<br />
16/11/2009 Decoding group<br />
10
Multi-class SVM<br />
●<br />
From binary to multi-class classifiers<br />
– One versus the rest<br />
– One versus one<br />
●<br />
K(K-1)/2 problems → vote procedure<br />
– No procedure is optimal !<br />
16/11/2009 Decoding group<br />
11
Support Vector Regression (SVR)<br />
●<br />
●<br />
●<br />
Now:<br />
Model:<br />
Learning means solving<br />
●<br />
Where f ε<br />
is the ε-insensitive loss function<br />
16/11/2009 Decoding group<br />
12
Support Vector Regression (SVR)<br />
●<br />
The solution is similar to SVM<br />
●<br />
where<br />
16/11/2009 Decoding group<br />
13
Relevance Vector Machines<br />
●<br />
Shortcomings of SVMs<br />
– Only provide a boudary: no posterior probability of<br />
classification<br />
– Multi-class is problematic (in theory)<br />
– Choice of C<br />
●<br />
RVM are defined directly in the kernel domain<br />
● ARD prior on w:<br />
16/11/2009 Decoding group<br />
14
Relevance Vector Machines<br />
α<br />
w<br />
y<br />
●<br />
●<br />
Estimation of w→ estimation of α<br />
EM algorithm: iterative estimation of the parameters<br />
and hyper-parameters<br />
● In theory many values of w are shrunk to 0 ;<br />
Nonzero values are associated with relevance<br />
vectors: but here these are typical samples<br />
●<br />
Use for classification : we introduce a sigmoidal<br />
non-linearity<br />
with<br />
16/11/2009 Decoding group<br />
15
Relevance vector <strong>machines</strong><br />
●<br />
●<br />
●<br />
The classes prototypes are estimated, not the<br />
boundary<br />
Probabilistic model → Better handles multi-class<br />
problems in classification (in theory)<br />
Less efficient than SVMs<br />
16/11/2009 Decoding group<br />
16
Discussion<br />
●<br />
All the correlation between the variables are<br />
implicitly embedded in the kernel terms k(x i<br />
,x j<br />
)<br />
●<br />
●<br />
●<br />
Problem: this is hidden and cannot be recovered<br />
exactly if k is not bilinear<br />
Estimation is nicely reduced from a p-dimensional<br />
problem to an n-dimensional problem (n