Online (and Offline) on an Even Tighter Budget - Leon Bottou
Online (and Offline) on an Even Tighter Budget - Leon Bottou
Online (and Offline) on an Even Tighter Budget - Leon Bottou
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Jas<strong>on</strong> West<strong>on</strong><br />
NEC Laboratories<br />
America,<br />
Princet<strong>on</strong>,<br />
NJ, USA<br />
jas<strong>on</strong>w@nec-labs.com<br />
<str<strong>on</strong>g>Online</str<strong>on</strong>g> (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <str<strong>on</strong>g>Offline</str<strong>on</strong>g>) <strong>on</strong> <strong>an</strong> <strong>Even</strong> <strong>Tighter</strong> <strong>Budget</strong><br />
Abstract<br />
We develop a fast <strong>on</strong>line kernel algorithm for<br />
classificati<strong>on</strong> which c<strong>an</strong> be viewed as <strong>an</strong> improvement<br />
over the <strong>on</strong>e suggested by (Crammer,<br />
K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004), titled ”<str<strong>on</strong>g>Online</str<strong>on</strong>g> Classificat<strong>on</strong><br />
<strong>on</strong> a <strong>Budget</strong>”. In that previous work,<br />
the authors introduced <strong>an</strong> <strong>on</strong>-the-fly compressi<strong>on</strong><br />
of the number of examples used in the predicti<strong>on</strong><br />
functi<strong>on</strong> using the size of the margin as a quality<br />
measure. Although displaying impressive results<br />
<strong>on</strong> relatively noise-free data we show how<br />
their algorithm is susceptible in noisy problems.<br />
Utilizing a new quality measure for <strong>an</strong> included<br />
example, namely the error induced <strong>on</strong> a selected<br />
subset of the training data, we gain improved<br />
compressi<strong>on</strong> rates <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> insensitivity to noise over<br />
the existing approach. Our method is also extendable<br />
to the batch training mode case.<br />
1 Introducti<strong>on</strong><br />
Rosenblatt’s Perceptr<strong>on</strong> (Rosenblatt, 1957) efficiently c<strong>on</strong>structs<br />
a hyperpl<strong>an</strong>e separating labeled examples (x i,yi) ∈<br />
Ên ×{−1, +1}. Memory requirements are minimal because<br />
the Perceptr<strong>on</strong> is <strong>an</strong> <strong>on</strong>line algorithm: each iterati<strong>on</strong><br />
c<strong>on</strong>siders a single example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> updates a c<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>idate hyperpl<strong>an</strong>e<br />
accordingly. Yet it globally c<strong>on</strong>verges to a separating<br />
hyperpl<strong>an</strong>e if such a hyperpl<strong>an</strong>e exists.<br />
The Perceptr<strong>on</strong> returns <strong>an</strong> arbitrary separating hyperpl<strong>an</strong>e<br />
regardless of the minimal dist<strong>an</strong>ce, or margin, between the<br />
hyperpl<strong>an</strong>e <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the examples. In c<strong>on</strong>trast, the Generalized<br />
Portrait algorithm (Vapnik <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Lerner, 1963) explictly<br />
seeks <strong>an</strong> hyperpl<strong>an</strong>e with maximal margins.<br />
All the above methods produce a hyperpl<strong>an</strong>e whose normal<br />
vector is expressed as a linear combinati<strong>on</strong> of examples.<br />
Both training <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> recogniti<strong>on</strong> c<strong>an</strong> be carried out with the<br />
<strong>on</strong>ly knowledge of the dot products x ′ i xj between examples.<br />
Support Vector Machines (Boser, Guy<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik,<br />
Antoine Bordes<br />
NEC Laboratories<br />
America,<br />
Princet<strong>on</strong>,<br />
NJ, USA<br />
<strong>an</strong>toine@nec-labs.com<br />
Le<strong>on</strong> <strong>Bottou</strong><br />
NEC Laboratories<br />
America,<br />
Princet<strong>on</strong>,<br />
NJ, USA<br />
le<strong>on</strong>b@nec-labs.com<br />
1992) produce maximum margin n<strong>on</strong>-linear separating hypersurfaces<br />
by simply replacing the dot products by a Mercer<br />
kernel K(xi,xj).<br />
Neither the Generalized Portrait nor the Support Vector<br />
Machines (SVM) are <strong>on</strong>line algorithms. A set of training<br />
examples must be gathered (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> stored in memory) prior<br />
to running the algorithm. Several authors have proposed<br />
<strong>on</strong>line Perceptr<strong>on</strong> vari<strong>an</strong>ts that feature both the margin <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
kernel properties. Example of such algorithms include the<br />
Relaxed <str<strong>on</strong>g>Online</str<strong>on</strong>g> Maximum Margin Algorithm (ROMMA)<br />
(Li <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> L<strong>on</strong>g, 2002), the Approximate Maximal Margin<br />
Classificati<strong>on</strong> Algorithms (ALMA) (Gentile, 2001), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
the Margin Infused Relaxed Algorihm (MIRA) (Crammer<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2003).<br />
The computati<strong>on</strong>al requirements1 of kernel algorithms are<br />
closely related to the sparsity of the linear combinati<strong>on</strong><br />
defining the separating hyper-surface. Each iterati<strong>on</strong> of<br />
most Perceptr<strong>on</strong> vari<strong>an</strong>ts c<strong>on</strong>siders a single example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
decides whether to insert it into the linear combinati<strong>on</strong>. The<br />
<strong>Budget</strong> Perceptr<strong>on</strong> (Crammer, K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004)<br />
achieves greater sparsity by also trying to remove some of<br />
the examples already present in the linear combinati<strong>on</strong>.<br />
This discussi<strong>on</strong> <strong>on</strong>ly applies to the case where all examples<br />
c<strong>an</strong> be separated by a hyperpl<strong>an</strong>e or a hypersurface, that<br />
is to say in the absence of noise. Support Vector Machines<br />
use Soft Margins (Cortes <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, 1995) to h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>le noisy<br />
examples at the expense of sparsity. <strong>Even</strong> in the case where<br />
the training examples c<strong>an</strong> be separated, using Soft Margins<br />
often improves the test error. Noisy data sharply degrades<br />
the perform<strong>an</strong>ce of all the Perceptr<strong>on</strong> vari<strong>an</strong>ts discussed<br />
above.<br />
We propose a vari<strong>an</strong>t of the Perceptr<strong>on</strong> algorithm that addresses<br />
this problem by removing examples from the linear<br />
combinati<strong>on</strong> <strong>on</strong> the basis of a direct measurement of<br />
the training error in the spirit of Kernel Matching Pursuit<br />
(KMP) (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000). We show that this algorithm<br />
has good perform<strong>an</strong>ce <strong>on</strong> both noisy <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> n<strong>on</strong>-noisy<br />
data.<br />
1 <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> sometimes the generalizati<strong>on</strong> properties
2 Learning <strong>on</strong> a <strong>Budget</strong><br />
Figure 1 shows the <strong>Budget</strong> Perceptr<strong>on</strong> algorithm (Crammer,<br />
K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004). Like Rosenblatt’s Perceptr<strong>on</strong>,<br />
this algorithm ensures that the hyperpl<strong>an</strong>e normal<br />
wt c<strong>an</strong> always be expressed as a linear combinati<strong>on</strong> of the<br />
examples in set Ct:<br />
wt = <br />
αixi. (1)<br />
i∈Ct<br />
Whereas Rosenblatt’s Perceptr<strong>on</strong> updates the hyperpl<strong>an</strong>e<br />
normal wt whenever the current example (xt,yt) is misclassified,<br />
the <strong>Budget</strong> Perceptr<strong>on</strong> updates the normal whenever<br />
the margin is smaller th<strong>an</strong> a predefined parameter<br />
β>0, that is to say whenever yt(xt · wt)
3 Learning <strong>on</strong> a <strong>Tighter</strong> <strong>Budget</strong><br />
The <strong>Budget</strong> Perceptr<strong>on</strong> removal process simulates the removal<br />
of each example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> eventually selects the example<br />
that remains recognized with the largest margin. This<br />
margin c<strong>an</strong> be viewed as <strong>an</strong> indirect estimate of the impact<br />
of the removal <strong>on</strong> the overall perform<strong>an</strong>ce of the hyperpl<strong>an</strong>e.<br />
Thus, to improve the <strong>Budget</strong> algorithm we propose<br />
to replace this indirect estimate by a direct evaluati<strong>on</strong> of the<br />
misclassificati<strong>on</strong> rate. We term this algorithm the <strong>Tighter</strong><br />
<strong>Budget</strong> Perceptr<strong>on</strong>. The idea is simply to replace the measure<br />
of margin<br />
i =argmax{yj(wt−1−αjyjxj)<br />
· xj} (2)<br />
j∈Ct<br />
with the overall error rate <strong>on</strong> all currently seen examples:<br />
i =argmin{<br />
j∈Ct<br />
t<br />
L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />
k=1<br />
Intuitively, if <strong>an</strong> example is well classified (has a large margin)<br />
then not <strong>on</strong>ly will it be correctly classified when it is<br />
removed as in equati<strong>on</strong> (2) but also all other examples will<br />
still be well classified as well. On the other h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>, if <strong>an</strong> example<br />
is <strong>an</strong> outlier then its c<strong>on</strong>tributi<strong>on</strong> to the exp<strong>an</strong>si<strong>on</strong> of<br />
w is likely to classify points incorrectly. Therefore when<br />
removing <strong>an</strong> example from the cache <strong>on</strong>e is likely to either<br />
remove noise points or well-classified points first. Apart<br />
from when the kernel matrix has very low r<strong>an</strong>k we do indeed<br />
observe this behaviour, e.g. in figure 8.<br />
Compared to the original <strong>Budget</strong> Perceptr<strong>on</strong>, this removal<br />
rule is more expensive to compute, now it requires O(t(p+<br />
n)) operati<strong>on</strong>s (see secti<strong>on</strong> 2). Therefore in secti<strong>on</strong> 3.2 we<br />
discuss ways of approximating this computati<strong>on</strong> whilst still<br />
retaining its desirable properties. First, however we discuss<br />
the relati<strong>on</strong>ship between this algorithm <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> existing<br />
approaches.<br />
3.1 Relati<strong>on</strong> to Other Algorithms<br />
Kernel Matching Pursuit The idea of kernel matching<br />
pursuit (KMP) (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000) is to build a<br />
predictor w = <br />
i αixi greedily by adding <strong>on</strong>e example at<br />
a time, until a pre-chosen cache size p is found. The example<br />
to add is chosen by searching for the example which<br />
gives the largest decrease in error rate, usually in terms of<br />
squared loss, but other choices of loss functi<strong>on</strong> are possible.<br />
While this procedure is for batch learning, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> not <strong>on</strong>line<br />
learning, clearly this criteria for additi<strong>on</strong> is the same as our<br />
criteria for deleti<strong>on</strong>.<br />
There are various vari<strong>an</strong>ts of KMP, two of them called<br />
basic- <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> backfitting- are described in figure 4. Basic<br />
adapts <strong>on</strong>ly a single αi in the inserti<strong>on</strong> step, whereas backfitting<br />
adjusts all αi of previously chosen points. The latter<br />
Input: Margin β>0, Cache Limit p.<br />
Initialize: Set ∀t αt =0, w0 =0, C0 = ∅.<br />
Loop: Fort =1,...,T<br />
– Get a new inst<strong>an</strong>ce xt ∈ Ên , yt = ±1.<br />
– Predict ˆyt = sign(yt(xt · wt−1)).<br />
– Get a new label yt.<br />
– If yt(xt · wt−1) ≤ β update:<br />
1. If |Ct| = p remove <strong>on</strong>e example:<br />
a Find i =argminj∈Ct<br />
{ t k=1 L(yk, sign((wt−1 − αjyjxj) ·<br />
xk)}.<br />
b Update wt−1 ← wt−1 − αiyixi<br />
c Remove Ct−1 ← Ct−1 \{i}.<br />
2. Insert Ct ← Ct−1 ∪{t}.<br />
3. Set αt =1.<br />
4. Compute wt ← wt−1 + αtytxt.<br />
Output: H(x) =sign(wT · x).<br />
Figure 3: The <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> algorithm.<br />
c<strong>an</strong> be computed efficiently if the kernel matrix c<strong>an</strong> be fit<br />
into memory (the algorithm is given in (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio,<br />
2000)), but is expensive for large datasets. The basic<br />
algorithm, <strong>on</strong> the other h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> does not perform well for classificati<strong>on</strong>,<br />
as shown in figure 8.<br />
Note that we could adapt our algorithm’s additi<strong>on</strong> step to<br />
also be based <strong>on</strong> training error. However, using the Perceptr<strong>on</strong><br />
rule, <strong>an</strong> example is <strong>on</strong>ly added to the cache if it is <strong>an</strong><br />
error, making it more efficient to compute. Note that vari<strong>an</strong>ts<br />
of KMP have also been introduced that incorporate a<br />
deleti<strong>on</strong> as well as <strong>an</strong> inserti<strong>on</strong> step (Nair, Choudhury <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
Ke<strong>an</strong>e, 2002).<br />
C<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Multi-edit C<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> multi-edit (Devijver<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Kittler, 1982) are editing algorithms to ”sparsify”<br />
k-NN. C<strong>on</strong>dense removes examples that are far from<br />
the decisi<strong>on</strong> boundary. The Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the SVM already<br />
have their own ”c<strong>on</strong>dense” step as such points typically<br />
have αi =0. The <strong>Budget</strong> Perceptr<strong>on</strong> is <strong>an</strong> attempt to<br />
make the c<strong>on</strong>dense step of the Perceptr<strong>on</strong> more aggressive.<br />
Multi-edit attempts to remove all the examples that are <strong>on</strong><br />
the wr<strong>on</strong>g side of the Bayes decisi<strong>on</strong> boundary. One is<br />
then left with learning a decisi<strong>on</strong> rule with n<strong>on</strong>-overlapping<br />
classes with the same Bayes decisi<strong>on</strong> boundary as before,<br />
but with Bayes risk equal to zero. Note that neither the Perceptr<strong>on</strong><br />
nor the SVM (with soft margin) perform this step 2 ,<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> all incorrectly classified examples become support vec-<br />
2 An algorithm designed to combine the multi-edit step into<br />
SVMs is developed in (Bakır, <strong>Bottou</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> West<strong>on</strong>, 2004).
Input: Cache Limit p.<br />
Initialize: Set ∀t αt =0, w0 =0, C0 = ∅.<br />
Loop: Fort =1,...,p<br />
– Choose (k, α) =<br />
mX<br />
(yj − (wt + αxj) · xi) 2 .<br />
arg min<br />
α, j=1...m<br />
i=1<br />
– Insert Ct ← Ct−1 ∪{t}.<br />
– Basic-KMP:<br />
Set wt ← wt−1 + αxk.<br />
Backfitting-KMP:<br />
Set wt ← <br />
αixi where {αi} =<br />
i∈Ct<br />
mX<br />
arg min<br />
{αi} j=1<br />
yj − X<br />
αixi · xj)<br />
i∈Ct<br />
Output: H(x) =sign(wT · x).<br />
Figure 4: The Basic <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Backfitting Kernel Matching Pursuit<br />
(KMP) Algorithms (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000).<br />
tors with αi > 0. Combining c<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> multi-edit together<br />
<strong>on</strong>e <strong>on</strong>ly tries to keep the correctly classified examples<br />
close to the decisi<strong>on</strong> boundary. The <strong>Tighter</strong> <strong>Budget</strong><br />
Perceptr<strong>on</strong> is also <strong>an</strong> approximate way of trying to achieve<br />
these two goals, as previously discussed.<br />
Regularizati<strong>on</strong> One could also view the <strong>Tighter</strong> <strong>Budget</strong><br />
Perceptr<strong>on</strong> as <strong>an</strong> approximati<strong>on</strong> of minimizing a regularized<br />
objective of the form<br />
1 <br />
L(yi,f(xi)) + γ||α||0.<br />
m<br />
where operator || · ||0 is defined as counting the number<br />
of n<strong>on</strong>zero coefficients. That is to say, the fixed sized<br />
cache chosen acts a regularizer to reduce the capacity of the<br />
set of functi<strong>on</strong>s implementable by the Perceptr<strong>on</strong> rule, the<br />
goal of which is to minimize the classificati<strong>on</strong> loss. This<br />
me<strong>an</strong>s that for noisy problems, with a reduced cache size<br />
<strong>on</strong>e should see improved generalizati<strong>on</strong> error compared to<br />
a st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard Perceptr<strong>on</strong> using the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>,<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> we indeed find experimentally that this is the case.<br />
3.2 Making the per-time-step complexity bounded by<br />
a c<strong>on</strong>st<strong>an</strong>t independent of t<br />
An import<strong>an</strong>t requirement of <strong>on</strong>line algorithms is that their<br />
per-time-step complexity should be bounded by a c<strong>on</strong>st<strong>an</strong>t<br />
independent of t (t being the time-step index), for it is assumed<br />
that samples arrive at a c<strong>on</strong>st<strong>an</strong>t rate. The algorithm<br />
in figure 3 grows linearly in the time, t, because of the computati<strong>on</strong><br />
in step 1(a), that is when we choose the example<br />
! 2<br />
Input: Qt−1, xt, s<br />
– Qt ← Qt−1 ∪ xt.<br />
– If |Qt| >q<br />
1. i =argmaxi∈Qt−1 si<br />
2. Qt ← Qt \ xi<br />
Output: Qt<br />
Figure 5: Algorithm for maintaining a fixed cache size q<br />
of relev<strong>an</strong>t examples for estimating the training error. The<br />
idea is to maintain a count si of the number of times the<br />
predicti<strong>on</strong> ch<strong>an</strong>ges label for example i. One then retains<br />
the examples which ch<strong>an</strong>ge labels most often.<br />
in the cache to delete which results in the minimal loss over<br />
all t observati<strong>on</strong>s:<br />
i =argmin{<br />
j∈Ct<br />
t<br />
L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />
k=1<br />
(Note that this extra computati<strong>on</strong>al expense is <strong>on</strong>ly invoked<br />
when xt is a margin error, which if the problem has a low<br />
error rate, is <strong>on</strong>ly <strong>on</strong> a small fracti<strong>on</strong> of the iterati<strong>on</strong>s.) Nevertheless,<br />
it is possible to c<strong>on</strong>sider approximati<strong>on</strong>s to this<br />
equati<strong>on</strong> to make the algorithm independent of t.<br />
We could simply reduce the measure of loss to <strong>on</strong>ly the<br />
fixed p examples in the cache:<br />
i =argmin{<br />
j∈Ct<br />
<br />
L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />
k∈Ct<br />
(3)<br />
While this is faster to compute, it may be suboptimal as we<br />
wish to have <strong>an</strong> estimator of the loss that is as unbiased as<br />
possible, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the points that are selected in the cache are a<br />
biased sample. However, they do have the adv<strong>an</strong>tage that<br />
m<strong>an</strong>y of them may be close to the decisi<strong>on</strong> surface.<br />
A more unbiased sample could be chosen simply by picking<br />
a fixed number of r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>omly chosen examples, say q<br />
examples, where we choose q in adv<strong>an</strong>ce. We define this<br />
subset as Qt, where |Qt| =min(q, t) which is taken from<br />
the t available examples until the cache is filled. Then we<br />
compute:<br />
i =argmin{<br />
j∈Ct<br />
<br />
L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />
k∈Qt<br />
(4)<br />
The problem with this strategy is that m<strong>an</strong>y of these examples<br />
may be either very easy to classify or always mislabeled<br />
(noise) so this could be wastful.<br />
We therefore suggest a sec<strong>on</strong>dary caching scheme to<br />
choose the q examples with which we estimate the error.<br />
We wish to keep the examples that are most likely to ch<strong>an</strong>ge
label as these are most likely to give us informati<strong>on</strong> about<br />
the perform<strong>an</strong>ce of the classifier. If <strong>an</strong> example is well<br />
classified it will not ch<strong>an</strong>ge label easily when the classifier<br />
ch<strong>an</strong>ges slightly. Likewise, if <strong>an</strong> example is <strong>an</strong> outlier<br />
it will be c<strong>on</strong>sisently incorrectly classified. In fact the number<br />
of examples that are relev<strong>an</strong>t in this c<strong>on</strong>text should be<br />
relatively small. We therefore keep a count si of the number<br />
of times example xi has ch<strong>an</strong>ged label, divided by the<br />
amount of time it has been in the cache. If this value is<br />
small then we c<strong>an</strong> c<strong>on</strong>sider removing this point from the<br />
sec<strong>on</strong>dary cache. When we receive a new observati<strong>on</strong> at x t<br />
at time t we thus perform the update given in figure 5 in the<br />
case that xt is a margin error.<br />
Complexity The last vari<strong>an</strong>t of the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
has a deleti<strong>on</strong> step cost of O(pq + qn) operati<strong>on</strong>s,<br />
where p is the cache size, q is the sec<strong>on</strong>dary cache size, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
n is the input dimensi<strong>on</strong>ality. This should be compared to<br />
O(p) for the <strong>Budget</strong> Perceptr<strong>on</strong>, where clearly the deleti<strong>on</strong><br />
step is still less expensive.<br />
In the case of relatively noise free problems with a reas<strong>on</strong>able<br />
cache size p, the deleti<strong>on</strong> step occurs infrequently: by<br />
the time the cache becomes full, the perceptr<strong>on</strong> performs<br />
well enough to make margin errors rare. The inserti<strong>on</strong> step<br />
then dominates the computati<strong>on</strong>al cost. In the case of noisy<br />
problems, the cheaper deleti<strong>on</strong> step of the <strong>Budget</strong> Perceptr<strong>on</strong><br />
performs too poorly to be c<strong>on</strong>sidered a valid alternative.<br />
Moreover, as we shall see experimentally, the <strong>Tighter</strong><br />
<strong>Budget</strong> Perceptr<strong>on</strong> c<strong>an</strong> achieve the same test error as the<br />
<strong>Budget</strong> Perceptr<strong>on</strong> for smaller cache size p.<br />
4 Experiments<br />
4.1 2D Experiments - <str<strong>on</strong>g>Online</str<strong>on</strong>g> mode<br />
Figure 6 shows a 2D classificati<strong>on</strong> problem of 1000 points<br />
from two classes separated by a very small margin. We<br />
show the decisi<strong>on</strong> rule found after <strong>on</strong>e epoch of Perceptr<strong>on</strong>,<br />
<strong>Budget</strong> Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
training, using a linear kernel. Both <strong>Budget</strong> Perceptr<strong>on</strong>s<br />
vari<strong>an</strong>ts produce sparser soluti<strong>on</strong>s th<strong>an</strong> the Perceptr<strong>on</strong>, although<br />
the <strong>Budget</strong> Perceptr<strong>on</strong> provides slightly less accurate<br />
soluti<strong>on</strong>s, even for larger cache sizes. Figure 7 shows a<br />
similar dataset, but with overlapping classes. The Perceptr<strong>on</strong><br />
algorithm will fail to c<strong>on</strong>verge with multiple epochs in<br />
this case. After <strong>on</strong>e epoch a decisi<strong>on</strong> rule is obtained with a<br />
relatively large number of SVs. Most examples which are<br />
<strong>on</strong> the wr<strong>on</strong>g side of the Bayes decisi<strong>on</strong> rule are SVs. 3<br />
The <strong>Budget</strong> Perceptr<strong>on</strong> fails to alleviate this problem. Although<br />
<strong>on</strong>e c<strong>an</strong> reduce the cache size to force more sparsity,<br />
the decisi<strong>on</strong> rule obtained is highly inaccurate. This is due<br />
to noise points which are far from the decisi<strong>on</strong> boundary<br />
3 Note that support vector machines, not shown, also suffer<br />
from a similar deficiency in terms of sparsity - all incorrectly classified<br />
examples are SVs.<br />
Perceptr<strong>on</strong> (16 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (5 SVs)<br />
<strong>Budget</strong> Perceptr<strong>on</strong> (5 SVs) <strong>Budget</strong> Perceptr<strong>on</strong> (10 SVs)<br />
Figure 6: Separable Toy Data in <str<strong>on</strong>g>Online</str<strong>on</strong>g> Mode. The <strong>Budget</strong><br />
Perceptr<strong>on</strong> of (Crammer, K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004)<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> our <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> provide sparser soluti<strong>on</strong>s<br />
th<strong>an</strong> the Perceptr<strong>on</strong> algorithm, however the <strong>Budget</strong><br />
Perceptr<strong>on</strong> seems sometimes to provide slightly worse soluti<strong>on</strong>s.<br />
being the last vectors to be removed from the cache, as c<strong>an</strong><br />
be seen in the example with a cache size of 50.<br />
4.2 2D Experiments - Batch mode<br />
Figure 8 shows a 2D binary classificati<strong>on</strong> problem with the<br />
decisi<strong>on</strong> surface found by the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>,<br />
<strong>Budget</strong> Perceptr<strong>on</strong>, Perceptr<strong>on</strong>, SVM, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> two flavors of<br />
KMP when using the same Gaussi<strong>an</strong> kernel. For the <strong>on</strong>line<br />
algorithms we r<strong>an</strong> multiple epochs over the dataset<br />
until c<strong>on</strong>vergence. This example gives a simple dem<strong>on</strong>strati<strong>on</strong><br />
of how the <strong>Budget</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>s<br />
c<strong>an</strong> achieve a greater level of sparsity th<strong>an</strong> the Perceptr<strong>on</strong>,<br />
whilst choosing examples that are close to the margin, in<br />
c<strong>on</strong>strast to the KMP algorithm.<br />
Where possible in the fixed cache algorithms, we fixed the<br />
cache sizes to 10 SVs (examples highlighted with squares),<br />
as a trained SVM uses this number. The Perceptr<strong>on</strong> is not<br />
as sparse as SVM, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> uses 19 SVs. However both the<br />
<strong>Budget</strong> Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> still<br />
separate the data with 10 SVs. The Perceptr<strong>on</strong> required 14<br />
epochs to c<strong>on</strong>verge, the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> required<br />
22, the <strong>Budget</strong> Perceptr<strong>on</strong> required 26 (however, we had to<br />
decrease the width of the Gaussi<strong>an</strong> kernel for the last algorithm<br />
as it did not c<strong>on</strong>verge for larger widths). Backfitting-<br />
KMP provides as good sparsity as SVM. Basic-KMP does<br />
not give zero error even after 400 iterati<strong>on</strong>s, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> by this<br />
time has used 37 SVs (it cycles around the same SVs m<strong>an</strong>y<br />
times). Note that all the algorithms except KMP choose
Perceptr<strong>on</strong> (103 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (5 SVs)<br />
<strong>Budget</strong> Perceptr<strong>on</strong> (5 SVs) <strong>Budget</strong> Perceptr<strong>on</strong> (50 SVs)<br />
Figure 7: Noisy Toy Data in <str<strong>on</strong>g>Online</str<strong>on</strong>g> Mode. The Perceptr<strong>on</strong><br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Budget</strong> Perceptr<strong>on</strong> (independent of cache size)<br />
fail when problems c<strong>on</strong>tain noise, as dem<strong>on</strong>strated by a<br />
simple 2D problem with Gaussi<strong>an</strong> noise in <strong>on</strong>e dimensi<strong>on</strong>.<br />
The <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, however, finds a separati<strong>on</strong><br />
very close to the Bayes rule.<br />
SVs close to the margin.<br />
4.3 Benchmark Datasets<br />
We c<strong>on</strong>ducted experiments <strong>on</strong> three well-known datasets:<br />
the US Postal Service (USPS) Database, Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />
B<strong>an</strong><strong>an</strong>a4 . A summary of the datasets is given in Table 1.<br />
For all three cases (USPS,Waveform,B<strong>an</strong><strong>an</strong>a) we chose <strong>an</strong><br />
RBF kernel, the width values were taken from (Vapnik,<br />
1998) <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (Rätsch, Onoda <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, 2001), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> are chosen<br />
to be optimal for SVMs, the latter two by cross validati<strong>on</strong>.<br />
We used the same widths for all other algorithms,<br />
despite that these may be suboptimal in these cases. For<br />
USPS we c<strong>on</strong>sidered the two class problem of separating<br />
digit zero from the rest. We also c<strong>on</strong>structed a sec<strong>on</strong>d experiment<br />
by corrupting USPS with label noise. We r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>omly<br />
flipped the labels of 10% of the data in the training<br />
set to observe the perform<strong>an</strong>ce effects <strong>on</strong> the algorithms<br />
tested (for SVMs, we report the test error with the optimal<br />
value of C chosen <strong>on</strong> the test set, in this case, C =10).<br />
We tested the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, <strong>Budget</strong> Perceptr<strong>on</strong><br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Perceptr<strong>on</strong> in <strong>an</strong> <strong>on</strong>line setting by <strong>on</strong>ly allowing<br />
<strong>on</strong>e pass through the training data. Obviously this puts<br />
these algorithms at a disadv<strong>an</strong>tage compared to batch algorithms<br />
such as SVMs which c<strong>an</strong> look at training points<br />
multiple times. Nevertheless, we compare with SVMs as<br />
4 USPS is available at ftp://ftp.kyb.tuebingen.<br />
mpg.de/pub/bs/data. Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> B<strong>an</strong><strong>an</strong>a are available<br />
at http://mlg.<strong>an</strong>u.edu.au/˜raetsch/data/.<br />
Perceptr<strong>on</strong> (19 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (10 SVs)<br />
<strong>Budget</strong> Perceptr<strong>on</strong> (10 SVs) basic-KMP (37 SVs)<br />
backfitting-KMP (10 SVs) SVM (10 SVs)<br />
Figure 8: N<strong>on</strong>linear Toy Data in Batch Mode. We compare<br />
various algorithms <strong>on</strong> a simple n<strong>on</strong>linear dataset following<br />
(Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000). See the text for more<br />
expl<strong>an</strong>ati<strong>on</strong>.<br />
our gold st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard measure of perform<strong>an</strong>ce. The results are<br />
given in figures 9-12. In all cases for the Perceptr<strong>on</strong> vari<strong>an</strong>ts<br />
we use β =0<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> for the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
we employ the algorithm given in figure 3 without the computati<strong>on</strong>al<br />
efficiency techniques given in secti<strong>on</strong> 3.2. We<br />
show the test error against different fixed cache sizes p resulting<br />
in p support vectors. We report averages over 5 runs<br />
for USPS <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> 10 runs for Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> B<strong>an</strong><strong>an</strong>a. The error<br />
bars indicate the maximum <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> minimum values over<br />
all runs.<br />
The results show the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> yielding<br />
similar test error perform<strong>an</strong>ce to the SVM but with c<strong>on</strong>siderably<br />
less SVs. The <strong>Budget</strong> Perceptr<strong>on</strong> fares less well with<br />
Name Inputs Train Test σ C<br />
USPS 256 7329 2000 128 1000<br />
Waveform 21 4000 1000 3.16 1<br />
B<strong>an</strong><strong>an</strong>a 2 4000 1300 0.7 316<br />
Table 1: Datasets used in the experiments. The hyperparameters<br />
are for <strong>an</strong> SVM with RBF kernel, taken from<br />
(Vapnik, 1998) <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (Rätsch, Onoda <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, 2001).
Test Error<br />
Test Error<br />
0.08<br />
0.07<br />
0.06<br />
0.05<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
<strong>Budget</strong> Perceptr<strong>on</strong><br />
<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
<strong>Tighter</strong> <strong>Budget</strong> Pt<strong>on</strong> (10 epochs)<br />
SVM<br />
Perceptr<strong>on</strong><br />
Perceptr<strong>on</strong> (10 epochs)<br />
0<br />
0 50 100 150 200 250<br />
SVs<br />
0.08<br />
0.07<br />
0.06<br />
0.05<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
Figure 9: USPS Digit 0 vs Rest.<br />
<strong>Budget</strong> Perceptr<strong>on</strong><br />
<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
SVM<br />
Perceptr<strong>on</strong><br />
Perceptr<strong>on</strong> (10 epochs)<br />
0<br />
7 20 55 148 403 1097 2981<br />
SVs<br />
Figure 10: USPS Digit 0 vs Rest + 10% noise.<br />
the test error degrading c<strong>on</strong>siderably faster for decreasing<br />
cache size compared to the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, particularly<br />
<strong>on</strong> the noisy problems. Note that if the cache<br />
size is large enough both the <strong>Budget</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong><br />
Perceptr<strong>on</strong>s c<strong>on</strong>verge to the Perceptr<strong>on</strong> soluti<strong>on</strong>, hence the<br />
two curves meet at their furthest point. However, while the<br />
test error immediately degrades for the <strong>Budget</strong> Perceptr<strong>on</strong>,<br />
for the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> the test error in fact improves<br />
over the Perceptr<strong>on</strong> test error in both the noisy problems.<br />
This should be expected as the st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard Perceptr<strong>on</strong><br />
c<strong>an</strong>not deal with overlapping classes.<br />
In figure 9 we also show the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
with (up to) 10 epochs <strong>on</strong> USPS (typically the algorithm<br />
c<strong>on</strong>verges before 10 epochs). The perform<strong>an</strong>ce is similar to<br />
<strong>on</strong>ly 1 epoch for small cache sizes. For larger cache sizes,<br />
clearly the maximum cache size c<strong>on</strong>verges to the same perform<strong>an</strong>ce<br />
of a Perceptr<strong>on</strong> with 10 epochs, which in this<br />
case gives slightly better perform<strong>an</strong>ce th<strong>an</strong> <strong>an</strong>y cache size<br />
possible with 1 epoch.<br />
4.4 Faster Error Computati<strong>on</strong><br />
In this secti<strong>on</strong> we explore the error evaluati<strong>on</strong> cache strategies<br />
described previously in secti<strong>on</strong> 3.2. We compared the<br />
following strategies to evaluate error rates: (i) using all<br />
Test Error<br />
Test Error<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
<strong>Budget</strong> Perceptr<strong>on</strong><br />
<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
SVM<br />
Perceptr<strong>on</strong><br />
Perceptr<strong>on</strong> (10 epochs)<br />
200 400 600 800 1000<br />
SVs<br />
Figure 11: Waveform Dataset.<br />
<strong>Budget</strong> Perceptr<strong>on</strong><br />
<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />
SVM<br />
Perceptr<strong>on</strong><br />
Perceptr<strong>on</strong> (10 epochs)<br />
200 400 600 800<br />
SVs<br />
Figure 12: B<strong>an</strong><strong>an</strong>a Dataset.<br />
points so far seen, (ii) using <strong>on</strong>ly the support vectors in the<br />
cache, i.e. equati<strong>on</strong> (3), (iii) using a r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om cache of size<br />
q, i.e. equati<strong>on</strong> (4), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (iv) using a cache of size q of the<br />
examples that flip label most often, i.e. figure 5.<br />
Figure 13 compares these methods <strong>on</strong> the USPS dataset for<br />
fixed cache size of support vectors p =35<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> p =85. The<br />
results are averaged over 40 runs (the error bars show the<br />
st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard error).<br />
We compare different amounts of evaluati<strong>on</strong> vectors q. The<br />
results show that c<strong>on</strong>siderable computati<strong>on</strong>al speedup c<strong>an</strong><br />
be gained by <strong>an</strong>y of the methods compared to keeping all<br />
training vectors. Keeping a cache of examples that ch<strong>an</strong>ge<br />
label most often performs better th<strong>an</strong> a r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om cache for<br />
small cache sizes. Using the support vectors themselves<br />
also performs better th<strong>an</strong> the r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om strategy for the same<br />
cache size. This makes sense as support vectors themselves<br />
are likely to be examples that c<strong>an</strong> ch<strong>an</strong>ge label easily,<br />
making it similar to the cache of examples that most often<br />
ch<strong>an</strong>ge label. Nevertheless, it c<strong>an</strong> be worthwhile to have a<br />
small number of support vectors for fast evaluati<strong>on</strong>, but a<br />
larger set of error evaluati<strong>on</strong> vectors when <strong>an</strong> error is encountered.<br />
We suggest to choose q <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> p such that a cache<br />
of the qp kernel calculati<strong>on</strong>s fits in memory at all times.
Test Error<br />
Test Error<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
7 55 403 2981<br />
Error−Evaluati<strong>on</strong> Cache Size (Q)<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
7 55 403 2981<br />
Error−Evaluati<strong>on</strong> Cache Size (Q)<br />
Test Error<br />
Test Error<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
R<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om Cache<br />
Most flipped in Cache<br />
Only SVs in cache<br />
0 2000 4000 6000<br />
Error−Evaluati<strong>on</strong> Cache Size (Q)<br />
0.04<br />
R<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om Cache<br />
Most flipped in Cache<br />
Only SVs in cache<br />
0.03<br />
0.02<br />
0.01<br />
0 2000 4000 6000<br />
Error−Evaluati<strong>on</strong> Cache Size (Q)<br />
Figure 13: Error Rates for Different Error Measure<br />
Cache Strategies <strong>on</strong> USPS, digit zero versus the rest.<br />
The number of SVs is fixed to 35 in the top row, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> 85<br />
in the bottom row, the left-h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> plots are log plots of the<br />
right-h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>on</strong>es. The different strategies ch<strong>an</strong>ge the number<br />
of points used to evaluate the error rate for the SV cache<br />
deleti<strong>on</strong> process.<br />
5 Summary<br />
We have introduced a sparse <strong>on</strong>line algorithm that is a vari<strong>an</strong>t<br />
of the Perceptr<strong>on</strong>. It attempts to deal with some of<br />
the computati<strong>on</strong>al issues of using kernel algorithms in <strong>an</strong><br />
<strong>on</strong>line setting by restricting the number of SVs <strong>on</strong>e c<strong>an</strong><br />
use. It also allows methods such as the Perceptr<strong>on</strong> to deal<br />
with overlapping classes <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> noise. It c<strong>an</strong> be c<strong>on</strong>sidered<br />
as <strong>an</strong> improvement over the <strong>Budget</strong> Perceptr<strong>on</strong> of (Crammer,<br />
K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004) because it is empirically<br />
sparser th<strong>an</strong> that method for the same error rate, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> c<strong>an</strong><br />
h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>les noisy problems while that method c<strong>an</strong>not. Our<br />
method tends to keep <strong>on</strong>ly points that are close to the margin<br />
<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> that lie <strong>on</strong> the correct side of the Bayes decisi<strong>on</strong><br />
rule. This occurs because other examples are less useful<br />
for describing a decisi<strong>on</strong> rule with low error rate, which is<br />
the quality measure we use for inclusi<strong>on</strong> in to the cache.<br />
However, the cost of this is that quality measure used<br />
to evaluate training points is more expensive to compute<br />
th<strong>an</strong> for the <strong>Budget</strong> Perceptr<strong>on</strong> (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> in this sense the<br />
name ”<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>” is slightly misleading).<br />
However, we believe there exist various approximati<strong>on</strong>s to<br />
speed up this method whilst retaining its useful properties.<br />
We explored some strategies in this vein by introducing a<br />
small sec<strong>on</strong>dary cache of evaluati<strong>on</strong> vectors with positive<br />
results. Future work should investigate further ways to improve<br />
<strong>on</strong> this, some first suggesti<strong>on</strong>s being to <strong>on</strong>ly look at<br />
a subset of points to remove <strong>on</strong> each iterati<strong>on</strong>, or to remove<br />
the worst n points every n iterati<strong>on</strong>s.<br />
Acknowledgements<br />
Part of this work was funded by NSF gr<strong>an</strong>t CCR-0325463.<br />
References<br />
Bakır, G., <strong>Bottou</strong>, L., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> West<strong>on</strong>, J. (2004). Breaking SVM<br />
Complexity with Cross-Training. In Adv<strong>an</strong>ces in Neural Informati<strong>on</strong><br />
Processing Systems 17 (NIPS 2004). MIT Press,<br />
Cambridge, MA. to appear.<br />
Boser, B. E., Guy<strong>on</strong>, I. M., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, V. (1992). A Training Algorithm<br />
for Optimal Margin Classifiers. In Haussler, D., editor,<br />
Proceedings of the 5th Annual ACM Workshop <strong>on</strong> Computati<strong>on</strong>al<br />
Learning Theory, pages 144–152, Pittsburgh, PA.<br />
ACM Press.<br />
Cortes, C. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, V. (1995). Support Vector Networks. Machine<br />
Learning, 20:273–297.<br />
Crammer, K., K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola, J., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, Y. (2004). <str<strong>on</strong>g>Online</str<strong>on</strong>g> Classificati<strong>on</strong><br />
<strong>on</strong> a <strong>Budget</strong>. In Thrun, S., Saul, L., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Schölkopf,<br />
B., editors, Adv<strong>an</strong>ces in Neural Informati<strong>on</strong> Processing Systems<br />
16. MIT Press, Cambridge, MA.<br />
Crammer, K. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, Y. (2003). Ultrac<strong>on</strong>servative <str<strong>on</strong>g>Online</str<strong>on</strong>g><br />
Algorithms for Multiclass Problems. Journal of Machine<br />
Learning Research, 3:951–991.<br />
Devijver, P. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Kittler, J. (1982). Pattern Recognit<strong>on</strong>, A statistical<br />
approach. Prentice Hall, Englewood Cliffs.<br />
Gentile, C. (2001). A New Approximate Maximal Margin Classificati<strong>on</strong><br />
Algorithm. Journal of Machine Learning Research,<br />
2:213–242.<br />
Kecm<strong>an</strong>, V., Vogt, M., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Hu<strong>an</strong>g, T. (2003). On the Equality<br />
of Kernel AdaTr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Sequential Minimal Optimizati<strong>on</strong> in<br />
Classificati<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Regressi<strong>on</strong> Tasks <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Alike Algorithms<br />
for Kernel Machines. In Proceedings of Europe<strong>an</strong> Symposium<br />
<strong>on</strong> Artificial Neural Networks, ESANN’2003, pages<br />
215–222, Evere, Belgium. D-side Publicati<strong>on</strong>s.<br />
Li, Y. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> L<strong>on</strong>g, P. (2002). The Relaxed <str<strong>on</strong>g>Online</str<strong>on</strong>g> Maximum Margin<br />
Algorithm. Machine Learning, 46:361–387.<br />
Nair, P. B., Choudhury, A., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Ke<strong>an</strong>e, A. J. (2002). Some Greedy<br />
Learning Algorithms for Sparse Regressi<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Classificati<strong>on</strong><br />
with Mercer Kernels. Journal of Machine Learning Research,<br />
3:781–801.<br />
Rätsch, G., Onoda, T., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, K.-R. (2001). Soft Margins<br />
for AdaBoost. Machine Learning, 42(3):287–320. Also:<br />
NeuroCOLT Technical Report 1998-021.<br />
Rosenblatt, F. (1957). The Perceptr<strong>on</strong>: A perceiving <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> recognizing<br />
automat<strong>on</strong>. Technical Report 85-460-1, Project<br />
PARA, Cornell Aer<strong>on</strong>autical Lab.<br />
Vapnik, V. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Lerner, A. (1963). Pattern Recogniti<strong>on</strong> using Generalized<br />
Portrait Method. Automati<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Remote C<strong>on</strong>trol,<br />
24:774–780.<br />
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New<br />
York.<br />
Vincent, P. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, Y. (2000). Kernel Matching Pursuit.<br />
Technical Report 1179, Département d’Informatique<br />
et Recherche Opérati<strong>on</strong>nelle, Université de M<strong>on</strong>tréal. Presented<br />
at Snowbird’00.