24.10.2013 Views

Online (and Offline) on an Even Tighter Budget - Leon Bottou

Online (and Offline) on an Even Tighter Budget - Leon Bottou

Online (and Offline) on an Even Tighter Budget - Leon Bottou

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Jas<strong>on</strong> West<strong>on</strong><br />

NEC Laboratories<br />

America,<br />

Princet<strong>on</strong>,<br />

NJ, USA<br />

jas<strong>on</strong>w@nec-labs.com<br />

<str<strong>on</strong>g>Online</str<strong>on</strong>g> (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <str<strong>on</strong>g>Offline</str<strong>on</strong>g>) <strong>on</strong> <strong>an</strong> <strong>Even</strong> <strong>Tighter</strong> <strong>Budget</strong><br />

Abstract<br />

We develop a fast <strong>on</strong>line kernel algorithm for<br />

classificati<strong>on</strong> which c<strong>an</strong> be viewed as <strong>an</strong> improvement<br />

over the <strong>on</strong>e suggested by (Crammer,<br />

K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004), titled ”<str<strong>on</strong>g>Online</str<strong>on</strong>g> Classificat<strong>on</strong><br />

<strong>on</strong> a <strong>Budget</strong>”. In that previous work,<br />

the authors introduced <strong>an</strong> <strong>on</strong>-the-fly compressi<strong>on</strong><br />

of the number of examples used in the predicti<strong>on</strong><br />

functi<strong>on</strong> using the size of the margin as a quality<br />

measure. Although displaying impressive results<br />

<strong>on</strong> relatively noise-free data we show how<br />

their algorithm is susceptible in noisy problems.<br />

Utilizing a new quality measure for <strong>an</strong> included<br />

example, namely the error induced <strong>on</strong> a selected<br />

subset of the training data, we gain improved<br />

compressi<strong>on</strong> rates <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> insensitivity to noise over<br />

the existing approach. Our method is also extendable<br />

to the batch training mode case.<br />

1 Introducti<strong>on</strong><br />

Rosenblatt’s Perceptr<strong>on</strong> (Rosenblatt, 1957) efficiently c<strong>on</strong>structs<br />

a hyperpl<strong>an</strong>e separating labeled examples (x i,yi) ∈<br />

Ên ×{−1, +1}. Memory requirements are minimal because<br />

the Perceptr<strong>on</strong> is <strong>an</strong> <strong>on</strong>line algorithm: each iterati<strong>on</strong><br />

c<strong>on</strong>siders a single example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> updates a c<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>idate hyperpl<strong>an</strong>e<br />

accordingly. Yet it globally c<strong>on</strong>verges to a separating<br />

hyperpl<strong>an</strong>e if such a hyperpl<strong>an</strong>e exists.<br />

The Perceptr<strong>on</strong> returns <strong>an</strong> arbitrary separating hyperpl<strong>an</strong>e<br />

regardless of the minimal dist<strong>an</strong>ce, or margin, between the<br />

hyperpl<strong>an</strong>e <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the examples. In c<strong>on</strong>trast, the Generalized<br />

Portrait algorithm (Vapnik <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Lerner, 1963) explictly<br />

seeks <strong>an</strong> hyperpl<strong>an</strong>e with maximal margins.<br />

All the above methods produce a hyperpl<strong>an</strong>e whose normal<br />

vector is expressed as a linear combinati<strong>on</strong> of examples.<br />

Both training <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> recogniti<strong>on</strong> c<strong>an</strong> be carried out with the<br />

<strong>on</strong>ly knowledge of the dot products x ′ i xj between examples.<br />

Support Vector Machines (Boser, Guy<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik,<br />

Antoine Bordes<br />

NEC Laboratories<br />

America,<br />

Princet<strong>on</strong>,<br />

NJ, USA<br />

<strong>an</strong>toine@nec-labs.com<br />

Le<strong>on</strong> <strong>Bottou</strong><br />

NEC Laboratories<br />

America,<br />

Princet<strong>on</strong>,<br />

NJ, USA<br />

le<strong>on</strong>b@nec-labs.com<br />

1992) produce maximum margin n<strong>on</strong>-linear separating hypersurfaces<br />

by simply replacing the dot products by a Mercer<br />

kernel K(xi,xj).<br />

Neither the Generalized Portrait nor the Support Vector<br />

Machines (SVM) are <strong>on</strong>line algorithms. A set of training<br />

examples must be gathered (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> stored in memory) prior<br />

to running the algorithm. Several authors have proposed<br />

<strong>on</strong>line Perceptr<strong>on</strong> vari<strong>an</strong>ts that feature both the margin <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

kernel properties. Example of such algorithms include the<br />

Relaxed <str<strong>on</strong>g>Online</str<strong>on</strong>g> Maximum Margin Algorithm (ROMMA)<br />

(Li <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> L<strong>on</strong>g, 2002), the Approximate Maximal Margin<br />

Classificati<strong>on</strong> Algorithms (ALMA) (Gentile, 2001), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

the Margin Infused Relaxed Algorihm (MIRA) (Crammer<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2003).<br />

The computati<strong>on</strong>al requirements1 of kernel algorithms are<br />

closely related to the sparsity of the linear combinati<strong>on</strong><br />

defining the separating hyper-surface. Each iterati<strong>on</strong> of<br />

most Perceptr<strong>on</strong> vari<strong>an</strong>ts c<strong>on</strong>siders a single example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

decides whether to insert it into the linear combinati<strong>on</strong>. The<br />

<strong>Budget</strong> Perceptr<strong>on</strong> (Crammer, K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004)<br />

achieves greater sparsity by also trying to remove some of<br />

the examples already present in the linear combinati<strong>on</strong>.<br />

This discussi<strong>on</strong> <strong>on</strong>ly applies to the case where all examples<br />

c<strong>an</strong> be separated by a hyperpl<strong>an</strong>e or a hypersurface, that<br />

is to say in the absence of noise. Support Vector Machines<br />

use Soft Margins (Cortes <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, 1995) to h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>le noisy<br />

examples at the expense of sparsity. <strong>Even</strong> in the case where<br />

the training examples c<strong>an</strong> be separated, using Soft Margins<br />

often improves the test error. Noisy data sharply degrades<br />

the perform<strong>an</strong>ce of all the Perceptr<strong>on</strong> vari<strong>an</strong>ts discussed<br />

above.<br />

We propose a vari<strong>an</strong>t of the Perceptr<strong>on</strong> algorithm that addresses<br />

this problem by removing examples from the linear<br />

combinati<strong>on</strong> <strong>on</strong> the basis of a direct measurement of<br />

the training error in the spirit of Kernel Matching Pursuit<br />

(KMP) (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000). We show that this algorithm<br />

has good perform<strong>an</strong>ce <strong>on</strong> both noisy <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> n<strong>on</strong>-noisy<br />

data.<br />

1 <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> sometimes the generalizati<strong>on</strong> properties


2 Learning <strong>on</strong> a <strong>Budget</strong><br />

Figure 1 shows the <strong>Budget</strong> Perceptr<strong>on</strong> algorithm (Crammer,<br />

K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004). Like Rosenblatt’s Perceptr<strong>on</strong>,<br />

this algorithm ensures that the hyperpl<strong>an</strong>e normal<br />

wt c<strong>an</strong> always be expressed as a linear combinati<strong>on</strong> of the<br />

examples in set Ct:<br />

wt = <br />

αixi. (1)<br />

i∈Ct<br />

Whereas Rosenblatt’s Perceptr<strong>on</strong> updates the hyperpl<strong>an</strong>e<br />

normal wt whenever the current example (xt,yt) is misclassified,<br />

the <strong>Budget</strong> Perceptr<strong>on</strong> updates the normal whenever<br />

the margin is smaller th<strong>an</strong> a predefined parameter<br />

β>0, that is to say whenever yt(xt · wt)


3 Learning <strong>on</strong> a <strong>Tighter</strong> <strong>Budget</strong><br />

The <strong>Budget</strong> Perceptr<strong>on</strong> removal process simulates the removal<br />

of each example <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> eventually selects the example<br />

that remains recognized with the largest margin. This<br />

margin c<strong>an</strong> be viewed as <strong>an</strong> indirect estimate of the impact<br />

of the removal <strong>on</strong> the overall perform<strong>an</strong>ce of the hyperpl<strong>an</strong>e.<br />

Thus, to improve the <strong>Budget</strong> algorithm we propose<br />

to replace this indirect estimate by a direct evaluati<strong>on</strong> of the<br />

misclassificati<strong>on</strong> rate. We term this algorithm the <strong>Tighter</strong><br />

<strong>Budget</strong> Perceptr<strong>on</strong>. The idea is simply to replace the measure<br />

of margin<br />

i =argmax{yj(wt−1−αjyjxj)<br />

· xj} (2)<br />

j∈Ct<br />

with the overall error rate <strong>on</strong> all currently seen examples:<br />

i =argmin{<br />

j∈Ct<br />

t<br />

L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />

k=1<br />

Intuitively, if <strong>an</strong> example is well classified (has a large margin)<br />

then not <strong>on</strong>ly will it be correctly classified when it is<br />

removed as in equati<strong>on</strong> (2) but also all other examples will<br />

still be well classified as well. On the other h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>, if <strong>an</strong> example<br />

is <strong>an</strong> outlier then its c<strong>on</strong>tributi<strong>on</strong> to the exp<strong>an</strong>si<strong>on</strong> of<br />

w is likely to classify points incorrectly. Therefore when<br />

removing <strong>an</strong> example from the cache <strong>on</strong>e is likely to either<br />

remove noise points or well-classified points first. Apart<br />

from when the kernel matrix has very low r<strong>an</strong>k we do indeed<br />

observe this behaviour, e.g. in figure 8.<br />

Compared to the original <strong>Budget</strong> Perceptr<strong>on</strong>, this removal<br />

rule is more expensive to compute, now it requires O(t(p+<br />

n)) operati<strong>on</strong>s (see secti<strong>on</strong> 2). Therefore in secti<strong>on</strong> 3.2 we<br />

discuss ways of approximating this computati<strong>on</strong> whilst still<br />

retaining its desirable properties. First, however we discuss<br />

the relati<strong>on</strong>ship between this algorithm <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> existing<br />

approaches.<br />

3.1 Relati<strong>on</strong> to Other Algorithms<br />

Kernel Matching Pursuit The idea of kernel matching<br />

pursuit (KMP) (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000) is to build a<br />

predictor w = <br />

i αixi greedily by adding <strong>on</strong>e example at<br />

a time, until a pre-chosen cache size p is found. The example<br />

to add is chosen by searching for the example which<br />

gives the largest decrease in error rate, usually in terms of<br />

squared loss, but other choices of loss functi<strong>on</strong> are possible.<br />

While this procedure is for batch learning, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> not <strong>on</strong>line<br />

learning, clearly this criteria for additi<strong>on</strong> is the same as our<br />

criteria for deleti<strong>on</strong>.<br />

There are various vari<strong>an</strong>ts of KMP, two of them called<br />

basic- <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> backfitting- are described in figure 4. Basic<br />

adapts <strong>on</strong>ly a single αi in the inserti<strong>on</strong> step, whereas backfitting<br />

adjusts all αi of previously chosen points. The latter<br />

Input: Margin β>0, Cache Limit p.<br />

Initialize: Set ∀t αt =0, w0 =0, C0 = ∅.<br />

Loop: Fort =1,...,T<br />

– Get a new inst<strong>an</strong>ce xt ∈ Ên , yt = ±1.<br />

– Predict ˆyt = sign(yt(xt · wt−1)).<br />

– Get a new label yt.<br />

– If yt(xt · wt−1) ≤ β update:<br />

1. If |Ct| = p remove <strong>on</strong>e example:<br />

a Find i =argminj∈Ct<br />

{ t k=1 L(yk, sign((wt−1 − αjyjxj) ·<br />

xk)}.<br />

b Update wt−1 ← wt−1 − αiyixi<br />

c Remove Ct−1 ← Ct−1 \{i}.<br />

2. Insert Ct ← Ct−1 ∪{t}.<br />

3. Set αt =1.<br />

4. Compute wt ← wt−1 + αtytxt.<br />

Output: H(x) =sign(wT · x).<br />

Figure 3: The <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> algorithm.<br />

c<strong>an</strong> be computed efficiently if the kernel matrix c<strong>an</strong> be fit<br />

into memory (the algorithm is given in (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio,<br />

2000)), but is expensive for large datasets. The basic<br />

algorithm, <strong>on</strong> the other h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> does not perform well for classificati<strong>on</strong>,<br />

as shown in figure 8.<br />

Note that we could adapt our algorithm’s additi<strong>on</strong> step to<br />

also be based <strong>on</strong> training error. However, using the Perceptr<strong>on</strong><br />

rule, <strong>an</strong> example is <strong>on</strong>ly added to the cache if it is <strong>an</strong><br />

error, making it more efficient to compute. Note that vari<strong>an</strong>ts<br />

of KMP have also been introduced that incorporate a<br />

deleti<strong>on</strong> as well as <strong>an</strong> inserti<strong>on</strong> step (Nair, Choudhury <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

Ke<strong>an</strong>e, 2002).<br />

C<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Multi-edit C<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> multi-edit (Devijver<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Kittler, 1982) are editing algorithms to ”sparsify”<br />

k-NN. C<strong>on</strong>dense removes examples that are far from<br />

the decisi<strong>on</strong> boundary. The Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the SVM already<br />

have their own ”c<strong>on</strong>dense” step as such points typically<br />

have αi =0. The <strong>Budget</strong> Perceptr<strong>on</strong> is <strong>an</strong> attempt to<br />

make the c<strong>on</strong>dense step of the Perceptr<strong>on</strong> more aggressive.<br />

Multi-edit attempts to remove all the examples that are <strong>on</strong><br />

the wr<strong>on</strong>g side of the Bayes decisi<strong>on</strong> boundary. One is<br />

then left with learning a decisi<strong>on</strong> rule with n<strong>on</strong>-overlapping<br />

classes with the same Bayes decisi<strong>on</strong> boundary as before,<br />

but with Bayes risk equal to zero. Note that neither the Perceptr<strong>on</strong><br />

nor the SVM (with soft margin) perform this step 2 ,<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> all incorrectly classified examples become support vec-<br />

2 An algorithm designed to combine the multi-edit step into<br />

SVMs is developed in (Bakır, <strong>Bottou</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> West<strong>on</strong>, 2004).


Input: Cache Limit p.<br />

Initialize: Set ∀t αt =0, w0 =0, C0 = ∅.<br />

Loop: Fort =1,...,p<br />

– Choose (k, α) =<br />

mX<br />

(yj − (wt + αxj) · xi) 2 .<br />

arg min<br />

α, j=1...m<br />

i=1<br />

– Insert Ct ← Ct−1 ∪{t}.<br />

– Basic-KMP:<br />

Set wt ← wt−1 + αxk.<br />

Backfitting-KMP:<br />

Set wt ← <br />

αixi where {αi} =<br />

i∈Ct<br />

mX<br />

arg min<br />

{αi} j=1<br />

yj − X<br />

αixi · xj)<br />

i∈Ct<br />

Output: H(x) =sign(wT · x).<br />

Figure 4: The Basic <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Backfitting Kernel Matching Pursuit<br />

(KMP) Algorithms (Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000).<br />

tors with αi > 0. Combining c<strong>on</strong>dense <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> multi-edit together<br />

<strong>on</strong>e <strong>on</strong>ly tries to keep the correctly classified examples<br />

close to the decisi<strong>on</strong> boundary. The <strong>Tighter</strong> <strong>Budget</strong><br />

Perceptr<strong>on</strong> is also <strong>an</strong> approximate way of trying to achieve<br />

these two goals, as previously discussed.<br />

Regularizati<strong>on</strong> One could also view the <strong>Tighter</strong> <strong>Budget</strong><br />

Perceptr<strong>on</strong> as <strong>an</strong> approximati<strong>on</strong> of minimizing a regularized<br />

objective of the form<br />

1 <br />

L(yi,f(xi)) + γ||α||0.<br />

m<br />

where operator || · ||0 is defined as counting the number<br />

of n<strong>on</strong>zero coefficients. That is to say, the fixed sized<br />

cache chosen acts a regularizer to reduce the capacity of the<br />

set of functi<strong>on</strong>s implementable by the Perceptr<strong>on</strong> rule, the<br />

goal of which is to minimize the classificati<strong>on</strong> loss. This<br />

me<strong>an</strong>s that for noisy problems, with a reduced cache size<br />

<strong>on</strong>e should see improved generalizati<strong>on</strong> error compared to<br />

a st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard Perceptr<strong>on</strong> using the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>,<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> we indeed find experimentally that this is the case.<br />

3.2 Making the per-time-step complexity bounded by<br />

a c<strong>on</strong>st<strong>an</strong>t independent of t<br />

An import<strong>an</strong>t requirement of <strong>on</strong>line algorithms is that their<br />

per-time-step complexity should be bounded by a c<strong>on</strong>st<strong>an</strong>t<br />

independent of t (t being the time-step index), for it is assumed<br />

that samples arrive at a c<strong>on</strong>st<strong>an</strong>t rate. The algorithm<br />

in figure 3 grows linearly in the time, t, because of the computati<strong>on</strong><br />

in step 1(a), that is when we choose the example<br />

! 2<br />

Input: Qt−1, xt, s<br />

– Qt ← Qt−1 ∪ xt.<br />

– If |Qt| >q<br />

1. i =argmaxi∈Qt−1 si<br />

2. Qt ← Qt \ xi<br />

Output: Qt<br />

Figure 5: Algorithm for maintaining a fixed cache size q<br />

of relev<strong>an</strong>t examples for estimating the training error. The<br />

idea is to maintain a count si of the number of times the<br />

predicti<strong>on</strong> ch<strong>an</strong>ges label for example i. One then retains<br />

the examples which ch<strong>an</strong>ge labels most often.<br />

in the cache to delete which results in the minimal loss over<br />

all t observati<strong>on</strong>s:<br />

i =argmin{<br />

j∈Ct<br />

t<br />

L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />

k=1<br />

(Note that this extra computati<strong>on</strong>al expense is <strong>on</strong>ly invoked<br />

when xt is a margin error, which if the problem has a low<br />

error rate, is <strong>on</strong>ly <strong>on</strong> a small fracti<strong>on</strong> of the iterati<strong>on</strong>s.) Nevertheless,<br />

it is possible to c<strong>on</strong>sider approximati<strong>on</strong>s to this<br />

equati<strong>on</strong> to make the algorithm independent of t.<br />

We could simply reduce the measure of loss to <strong>on</strong>ly the<br />

fixed p examples in the cache:<br />

i =argmin{<br />

j∈Ct<br />

<br />

L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />

k∈Ct<br />

(3)<br />

While this is faster to compute, it may be suboptimal as we<br />

wish to have <strong>an</strong> estimator of the loss that is as unbiased as<br />

possible, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the points that are selected in the cache are a<br />

biased sample. However, they do have the adv<strong>an</strong>tage that<br />

m<strong>an</strong>y of them may be close to the decisi<strong>on</strong> surface.<br />

A more unbiased sample could be chosen simply by picking<br />

a fixed number of r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>omly chosen examples, say q<br />

examples, where we choose q in adv<strong>an</strong>ce. We define this<br />

subset as Qt, where |Qt| =min(q, t) which is taken from<br />

the t available examples until the cache is filled. Then we<br />

compute:<br />

i =argmin{<br />

j∈Ct<br />

<br />

L(yk, sign((wt−1 − αjyjxj) · xk)}.<br />

k∈Qt<br />

(4)<br />

The problem with this strategy is that m<strong>an</strong>y of these examples<br />

may be either very easy to classify or always mislabeled<br />

(noise) so this could be wastful.<br />

We therefore suggest a sec<strong>on</strong>dary caching scheme to<br />

choose the q examples with which we estimate the error.<br />

We wish to keep the examples that are most likely to ch<strong>an</strong>ge


label as these are most likely to give us informati<strong>on</strong> about<br />

the perform<strong>an</strong>ce of the classifier. If <strong>an</strong> example is well<br />

classified it will not ch<strong>an</strong>ge label easily when the classifier<br />

ch<strong>an</strong>ges slightly. Likewise, if <strong>an</strong> example is <strong>an</strong> outlier<br />

it will be c<strong>on</strong>sisently incorrectly classified. In fact the number<br />

of examples that are relev<strong>an</strong>t in this c<strong>on</strong>text should be<br />

relatively small. We therefore keep a count si of the number<br />

of times example xi has ch<strong>an</strong>ged label, divided by the<br />

amount of time it has been in the cache. If this value is<br />

small then we c<strong>an</strong> c<strong>on</strong>sider removing this point from the<br />

sec<strong>on</strong>dary cache. When we receive a new observati<strong>on</strong> at x t<br />

at time t we thus perform the update given in figure 5 in the<br />

case that xt is a margin error.<br />

Complexity The last vari<strong>an</strong>t of the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

has a deleti<strong>on</strong> step cost of O(pq + qn) operati<strong>on</strong>s,<br />

where p is the cache size, q is the sec<strong>on</strong>dary cache size, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

n is the input dimensi<strong>on</strong>ality. This should be compared to<br />

O(p) for the <strong>Budget</strong> Perceptr<strong>on</strong>, where clearly the deleti<strong>on</strong><br />

step is still less expensive.<br />

In the case of relatively noise free problems with a reas<strong>on</strong>able<br />

cache size p, the deleti<strong>on</strong> step occurs infrequently: by<br />

the time the cache becomes full, the perceptr<strong>on</strong> performs<br />

well enough to make margin errors rare. The inserti<strong>on</strong> step<br />

then dominates the computati<strong>on</strong>al cost. In the case of noisy<br />

problems, the cheaper deleti<strong>on</strong> step of the <strong>Budget</strong> Perceptr<strong>on</strong><br />

performs too poorly to be c<strong>on</strong>sidered a valid alternative.<br />

Moreover, as we shall see experimentally, the <strong>Tighter</strong><br />

<strong>Budget</strong> Perceptr<strong>on</strong> c<strong>an</strong> achieve the same test error as the<br />

<strong>Budget</strong> Perceptr<strong>on</strong> for smaller cache size p.<br />

4 Experiments<br />

4.1 2D Experiments - <str<strong>on</strong>g>Online</str<strong>on</strong>g> mode<br />

Figure 6 shows a 2D classificati<strong>on</strong> problem of 1000 points<br />

from two classes separated by a very small margin. We<br />

show the decisi<strong>on</strong> rule found after <strong>on</strong>e epoch of Perceptr<strong>on</strong>,<br />

<strong>Budget</strong> Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

training, using a linear kernel. Both <strong>Budget</strong> Perceptr<strong>on</strong>s<br />

vari<strong>an</strong>ts produce sparser soluti<strong>on</strong>s th<strong>an</strong> the Perceptr<strong>on</strong>, although<br />

the <strong>Budget</strong> Perceptr<strong>on</strong> provides slightly less accurate<br />

soluti<strong>on</strong>s, even for larger cache sizes. Figure 7 shows a<br />

similar dataset, but with overlapping classes. The Perceptr<strong>on</strong><br />

algorithm will fail to c<strong>on</strong>verge with multiple epochs in<br />

this case. After <strong>on</strong>e epoch a decisi<strong>on</strong> rule is obtained with a<br />

relatively large number of SVs. Most examples which are<br />

<strong>on</strong> the wr<strong>on</strong>g side of the Bayes decisi<strong>on</strong> rule are SVs. 3<br />

The <strong>Budget</strong> Perceptr<strong>on</strong> fails to alleviate this problem. Although<br />

<strong>on</strong>e c<strong>an</strong> reduce the cache size to force more sparsity,<br />

the decisi<strong>on</strong> rule obtained is highly inaccurate. This is due<br />

to noise points which are far from the decisi<strong>on</strong> boundary<br />

3 Note that support vector machines, not shown, also suffer<br />

from a similar deficiency in terms of sparsity - all incorrectly classified<br />

examples are SVs.<br />

Perceptr<strong>on</strong> (16 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (5 SVs)<br />

<strong>Budget</strong> Perceptr<strong>on</strong> (5 SVs) <strong>Budget</strong> Perceptr<strong>on</strong> (10 SVs)<br />

Figure 6: Separable Toy Data in <str<strong>on</strong>g>Online</str<strong>on</strong>g> Mode. The <strong>Budget</strong><br />

Perceptr<strong>on</strong> of (Crammer, K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004)<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> our <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> provide sparser soluti<strong>on</strong>s<br />

th<strong>an</strong> the Perceptr<strong>on</strong> algorithm, however the <strong>Budget</strong><br />

Perceptr<strong>on</strong> seems sometimes to provide slightly worse soluti<strong>on</strong>s.<br />

being the last vectors to be removed from the cache, as c<strong>an</strong><br />

be seen in the example with a cache size of 50.<br />

4.2 2D Experiments - Batch mode<br />

Figure 8 shows a 2D binary classificati<strong>on</strong> problem with the<br />

decisi<strong>on</strong> surface found by the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>,<br />

<strong>Budget</strong> Perceptr<strong>on</strong>, Perceptr<strong>on</strong>, SVM, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> two flavors of<br />

KMP when using the same Gaussi<strong>an</strong> kernel. For the <strong>on</strong>line<br />

algorithms we r<strong>an</strong> multiple epochs over the dataset<br />

until c<strong>on</strong>vergence. This example gives a simple dem<strong>on</strong>strati<strong>on</strong><br />

of how the <strong>Budget</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>s<br />

c<strong>an</strong> achieve a greater level of sparsity th<strong>an</strong> the Perceptr<strong>on</strong>,<br />

whilst choosing examples that are close to the margin, in<br />

c<strong>on</strong>strast to the KMP algorithm.<br />

Where possible in the fixed cache algorithms, we fixed the<br />

cache sizes to 10 SVs (examples highlighted with squares),<br />

as a trained SVM uses this number. The Perceptr<strong>on</strong> is not<br />

as sparse as SVM, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> uses 19 SVs. However both the<br />

<strong>Budget</strong> Perceptr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> still<br />

separate the data with 10 SVs. The Perceptr<strong>on</strong> required 14<br />

epochs to c<strong>on</strong>verge, the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> required<br />

22, the <strong>Budget</strong> Perceptr<strong>on</strong> required 26 (however, we had to<br />

decrease the width of the Gaussi<strong>an</strong> kernel for the last algorithm<br />

as it did not c<strong>on</strong>verge for larger widths). Backfitting-<br />

KMP provides as good sparsity as SVM. Basic-KMP does<br />

not give zero error even after 400 iterati<strong>on</strong>s, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> by this<br />

time has used 37 SVs (it cycles around the same SVs m<strong>an</strong>y<br />

times). Note that all the algorithms except KMP choose


Perceptr<strong>on</strong> (103 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (5 SVs)<br />

<strong>Budget</strong> Perceptr<strong>on</strong> (5 SVs) <strong>Budget</strong> Perceptr<strong>on</strong> (50 SVs)<br />

Figure 7: Noisy Toy Data in <str<strong>on</strong>g>Online</str<strong>on</strong>g> Mode. The Perceptr<strong>on</strong><br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Budget</strong> Perceptr<strong>on</strong> (independent of cache size)<br />

fail when problems c<strong>on</strong>tain noise, as dem<strong>on</strong>strated by a<br />

simple 2D problem with Gaussi<strong>an</strong> noise in <strong>on</strong>e dimensi<strong>on</strong>.<br />

The <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, however, finds a separati<strong>on</strong><br />

very close to the Bayes rule.<br />

SVs close to the margin.<br />

4.3 Benchmark Datasets<br />

We c<strong>on</strong>ducted experiments <strong>on</strong> three well-known datasets:<br />

the US Postal Service (USPS) Database, Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g><br />

B<strong>an</strong><strong>an</strong>a4 . A summary of the datasets is given in Table 1.<br />

For all three cases (USPS,Waveform,B<strong>an</strong><strong>an</strong>a) we chose <strong>an</strong><br />

RBF kernel, the width values were taken from (Vapnik,<br />

1998) <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (Rätsch, Onoda <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, 2001), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> are chosen<br />

to be optimal for SVMs, the latter two by cross validati<strong>on</strong>.<br />

We used the same widths for all other algorithms,<br />

despite that these may be suboptimal in these cases. For<br />

USPS we c<strong>on</strong>sidered the two class problem of separating<br />

digit zero from the rest. We also c<strong>on</strong>structed a sec<strong>on</strong>d experiment<br />

by corrupting USPS with label noise. We r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>omly<br />

flipped the labels of 10% of the data in the training<br />

set to observe the perform<strong>an</strong>ce effects <strong>on</strong> the algorithms<br />

tested (for SVMs, we report the test error with the optimal<br />

value of C chosen <strong>on</strong> the test set, in this case, C =10).<br />

We tested the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, <strong>Budget</strong> Perceptr<strong>on</strong><br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Perceptr<strong>on</strong> in <strong>an</strong> <strong>on</strong>line setting by <strong>on</strong>ly allowing<br />

<strong>on</strong>e pass through the training data. Obviously this puts<br />

these algorithms at a disadv<strong>an</strong>tage compared to batch algorithms<br />

such as SVMs which c<strong>an</strong> look at training points<br />

multiple times. Nevertheless, we compare with SVMs as<br />

4 USPS is available at ftp://ftp.kyb.tuebingen.<br />

mpg.de/pub/bs/data. Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> B<strong>an</strong><strong>an</strong>a are available<br />

at http://mlg.<strong>an</strong>u.edu.au/˜raetsch/data/.<br />

Perceptr<strong>on</strong> (19 SVs) <strong>Tighter</strong> <strong>Budget</strong> Ptr<strong>on</strong> (10 SVs)<br />

<strong>Budget</strong> Perceptr<strong>on</strong> (10 SVs) basic-KMP (37 SVs)<br />

backfitting-KMP (10 SVs) SVM (10 SVs)<br />

Figure 8: N<strong>on</strong>linear Toy Data in Batch Mode. We compare<br />

various algorithms <strong>on</strong> a simple n<strong>on</strong>linear dataset following<br />

(Vincent <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, 2000). See the text for more<br />

expl<strong>an</strong>ati<strong>on</strong>.<br />

our gold st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard measure of perform<strong>an</strong>ce. The results are<br />

given in figures 9-12. In all cases for the Perceptr<strong>on</strong> vari<strong>an</strong>ts<br />

we use β =0<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> for the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

we employ the algorithm given in figure 3 without the computati<strong>on</strong>al<br />

efficiency techniques given in secti<strong>on</strong> 3.2. We<br />

show the test error against different fixed cache sizes p resulting<br />

in p support vectors. We report averages over 5 runs<br />

for USPS <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> 10 runs for Waveform <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> B<strong>an</strong><strong>an</strong>a. The error<br />

bars indicate the maximum <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> minimum values over<br />

all runs.<br />

The results show the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> yielding<br />

similar test error perform<strong>an</strong>ce to the SVM but with c<strong>on</strong>siderably<br />

less SVs. The <strong>Budget</strong> Perceptr<strong>on</strong> fares less well with<br />

Name Inputs Train Test σ C<br />

USPS 256 7329 2000 128 1000<br />

Waveform 21 4000 1000 3.16 1<br />

B<strong>an</strong><strong>an</strong>a 2 4000 1300 0.7 316<br />

Table 1: Datasets used in the experiments. The hyperparameters<br />

are for <strong>an</strong> SVM with RBF kernel, taken from<br />

(Vapnik, 1998) <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (Rätsch, Onoda <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, 2001).


Test Error<br />

Test Error<br />

0.08<br />

0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

<strong>Budget</strong> Perceptr<strong>on</strong><br />

<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

<strong>Tighter</strong> <strong>Budget</strong> Pt<strong>on</strong> (10 epochs)<br />

SVM<br />

Perceptr<strong>on</strong><br />

Perceptr<strong>on</strong> (10 epochs)<br />

0<br />

0 50 100 150 200 250<br />

SVs<br />

0.08<br />

0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

Figure 9: USPS Digit 0 vs Rest.<br />

<strong>Budget</strong> Perceptr<strong>on</strong><br />

<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

SVM<br />

Perceptr<strong>on</strong><br />

Perceptr<strong>on</strong> (10 epochs)<br />

0<br />

7 20 55 148 403 1097 2981<br />

SVs<br />

Figure 10: USPS Digit 0 vs Rest + 10% noise.<br />

the test error degrading c<strong>on</strong>siderably faster for decreasing<br />

cache size compared to the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>, particularly<br />

<strong>on</strong> the noisy problems. Note that if the cache<br />

size is large enough both the <strong>Budget</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>Tighter</strong> <strong>Budget</strong><br />

Perceptr<strong>on</strong>s c<strong>on</strong>verge to the Perceptr<strong>on</strong> soluti<strong>on</strong>, hence the<br />

two curves meet at their furthest point. However, while the<br />

test error immediately degrades for the <strong>Budget</strong> Perceptr<strong>on</strong>,<br />

for the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong> the test error in fact improves<br />

over the Perceptr<strong>on</strong> test error in both the noisy problems.<br />

This should be expected as the st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard Perceptr<strong>on</strong><br />

c<strong>an</strong>not deal with overlapping classes.<br />

In figure 9 we also show the <strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

with (up to) 10 epochs <strong>on</strong> USPS (typically the algorithm<br />

c<strong>on</strong>verges before 10 epochs). The perform<strong>an</strong>ce is similar to<br />

<strong>on</strong>ly 1 epoch for small cache sizes. For larger cache sizes,<br />

clearly the maximum cache size c<strong>on</strong>verges to the same perform<strong>an</strong>ce<br />

of a Perceptr<strong>on</strong> with 10 epochs, which in this<br />

case gives slightly better perform<strong>an</strong>ce th<strong>an</strong> <strong>an</strong>y cache size<br />

possible with 1 epoch.<br />

4.4 Faster Error Computati<strong>on</strong><br />

In this secti<strong>on</strong> we explore the error evaluati<strong>on</strong> cache strategies<br />

described previously in secti<strong>on</strong> 3.2. We compared the<br />

following strategies to evaluate error rates: (i) using all<br />

Test Error<br />

Test Error<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

<strong>Budget</strong> Perceptr<strong>on</strong><br />

<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

SVM<br />

Perceptr<strong>on</strong><br />

Perceptr<strong>on</strong> (10 epochs)<br />

200 400 600 800 1000<br />

SVs<br />

Figure 11: Waveform Dataset.<br />

<strong>Budget</strong> Perceptr<strong>on</strong><br />

<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong><br />

SVM<br />

Perceptr<strong>on</strong><br />

Perceptr<strong>on</strong> (10 epochs)<br />

200 400 600 800<br />

SVs<br />

Figure 12: B<strong>an</strong><strong>an</strong>a Dataset.<br />

points so far seen, (ii) using <strong>on</strong>ly the support vectors in the<br />

cache, i.e. equati<strong>on</strong> (3), (iii) using a r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om cache of size<br />

q, i.e. equati<strong>on</strong> (4), <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> (iv) using a cache of size q of the<br />

examples that flip label most often, i.e. figure 5.<br />

Figure 13 compares these methods <strong>on</strong> the USPS dataset for<br />

fixed cache size of support vectors p =35<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> p =85. The<br />

results are averaged over 40 runs (the error bars show the<br />

st<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ard error).<br />

We compare different amounts of evaluati<strong>on</strong> vectors q. The<br />

results show that c<strong>on</strong>siderable computati<strong>on</strong>al speedup c<strong>an</strong><br />

be gained by <strong>an</strong>y of the methods compared to keeping all<br />

training vectors. Keeping a cache of examples that ch<strong>an</strong>ge<br />

label most often performs better th<strong>an</strong> a r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om cache for<br />

small cache sizes. Using the support vectors themselves<br />

also performs better th<strong>an</strong> the r<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om strategy for the same<br />

cache size. This makes sense as support vectors themselves<br />

are likely to be examples that c<strong>an</strong> ch<strong>an</strong>ge label easily,<br />

making it similar to the cache of examples that most often<br />

ch<strong>an</strong>ge label. Nevertheless, it c<strong>an</strong> be worthwhile to have a<br />

small number of support vectors for fast evaluati<strong>on</strong>, but a<br />

larger set of error evaluati<strong>on</strong> vectors when <strong>an</strong> error is encountered.<br />

We suggest to choose q <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> p such that a cache<br />

of the qp kernel calculati<strong>on</strong>s fits in memory at all times.


Test Error<br />

Test Error<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

7 55 403 2981<br />

Error−Evaluati<strong>on</strong> Cache Size (Q)<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

7 55 403 2981<br />

Error−Evaluati<strong>on</strong> Cache Size (Q)<br />

Test Error<br />

Test Error<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

R<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om Cache<br />

Most flipped in Cache<br />

Only SVs in cache<br />

0 2000 4000 6000<br />

Error−Evaluati<strong>on</strong> Cache Size (Q)<br />

0.04<br />

R<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>om Cache<br />

Most flipped in Cache<br />

Only SVs in cache<br />

0.03<br />

0.02<br />

0.01<br />

0 2000 4000 6000<br />

Error−Evaluati<strong>on</strong> Cache Size (Q)<br />

Figure 13: Error Rates for Different Error Measure<br />

Cache Strategies <strong>on</strong> USPS, digit zero versus the rest.<br />

The number of SVs is fixed to 35 in the top row, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> 85<br />

in the bottom row, the left-h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> plots are log plots of the<br />

right-h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> <strong>on</strong>es. The different strategies ch<strong>an</strong>ge the number<br />

of points used to evaluate the error rate for the SV cache<br />

deleti<strong>on</strong> process.<br />

5 Summary<br />

We have introduced a sparse <strong>on</strong>line algorithm that is a vari<strong>an</strong>t<br />

of the Perceptr<strong>on</strong>. It attempts to deal with some of<br />

the computati<strong>on</strong>al issues of using kernel algorithms in <strong>an</strong><br />

<strong>on</strong>line setting by restricting the number of SVs <strong>on</strong>e c<strong>an</strong><br />

use. It also allows methods such as the Perceptr<strong>on</strong> to deal<br />

with overlapping classes <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> noise. It c<strong>an</strong> be c<strong>on</strong>sidered<br />

as <strong>an</strong> improvement over the <strong>Budget</strong> Perceptr<strong>on</strong> of (Crammer,<br />

K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, 2004) because it is empirically<br />

sparser th<strong>an</strong> that method for the same error rate, <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> c<strong>an</strong><br />

h<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>les noisy problems while that method c<strong>an</strong>not. Our<br />

method tends to keep <strong>on</strong>ly points that are close to the margin<br />

<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> that lie <strong>on</strong> the correct side of the Bayes decisi<strong>on</strong><br />

rule. This occurs because other examples are less useful<br />

for describing a decisi<strong>on</strong> rule with low error rate, which is<br />

the quality measure we use for inclusi<strong>on</strong> in to the cache.<br />

However, the cost of this is that quality measure used<br />

to evaluate training points is more expensive to compute<br />

th<strong>an</strong> for the <strong>Budget</strong> Perceptr<strong>on</strong> (<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> in this sense the<br />

name ”<strong>Tighter</strong> <strong>Budget</strong> Perceptr<strong>on</strong>” is slightly misleading).<br />

However, we believe there exist various approximati<strong>on</strong>s to<br />

speed up this method whilst retaining its useful properties.<br />

We explored some strategies in this vein by introducing a<br />

small sec<strong>on</strong>dary cache of evaluati<strong>on</strong> vectors with positive<br />

results. Future work should investigate further ways to improve<br />

<strong>on</strong> this, some first suggesti<strong>on</strong>s being to <strong>on</strong>ly look at<br />

a subset of points to remove <strong>on</strong> each iterati<strong>on</strong>, or to remove<br />

the worst n points every n iterati<strong>on</strong>s.<br />

Acknowledgements<br />

Part of this work was funded by NSF gr<strong>an</strong>t CCR-0325463.<br />

References<br />

Bakır, G., <strong>Bottou</strong>, L., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> West<strong>on</strong>, J. (2004). Breaking SVM<br />

Complexity with Cross-Training. In Adv<strong>an</strong>ces in Neural Informati<strong>on</strong><br />

Processing Systems 17 (NIPS 2004). MIT Press,<br />

Cambridge, MA. to appear.<br />

Boser, B. E., Guy<strong>on</strong>, I. M., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, V. (1992). A Training Algorithm<br />

for Optimal Margin Classifiers. In Haussler, D., editor,<br />

Proceedings of the 5th Annual ACM Workshop <strong>on</strong> Computati<strong>on</strong>al<br />

Learning Theory, pages 144–152, Pittsburgh, PA.<br />

ACM Press.<br />

Cortes, C. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Vapnik, V. (1995). Support Vector Networks. Machine<br />

Learning, 20:273–297.<br />

Crammer, K., K<str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g>ola, J., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, Y. (2004). <str<strong>on</strong>g>Online</str<strong>on</strong>g> Classificati<strong>on</strong><br />

<strong>on</strong> a <strong>Budget</strong>. In Thrun, S., Saul, L., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Schölkopf,<br />

B., editors, Adv<strong>an</strong>ces in Neural Informati<strong>on</strong> Processing Systems<br />

16. MIT Press, Cambridge, MA.<br />

Crammer, K. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Singer, Y. (2003). Ultrac<strong>on</strong>servative <str<strong>on</strong>g>Online</str<strong>on</strong>g><br />

Algorithms for Multiclass Problems. Journal of Machine<br />

Learning Research, 3:951–991.<br />

Devijver, P. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Kittler, J. (1982). Pattern Recognit<strong>on</strong>, A statistical<br />

approach. Prentice Hall, Englewood Cliffs.<br />

Gentile, C. (2001). A New Approximate Maximal Margin Classificati<strong>on</strong><br />

Algorithm. Journal of Machine Learning Research,<br />

2:213–242.<br />

Kecm<strong>an</strong>, V., Vogt, M., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Hu<strong>an</strong>g, T. (2003). On the Equality<br />

of Kernel AdaTr<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Sequential Minimal Optimizati<strong>on</strong> in<br />

Classificati<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Regressi<strong>on</strong> Tasks <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Alike Algorithms<br />

for Kernel Machines. In Proceedings of Europe<strong>an</strong> Symposium<br />

<strong>on</strong> Artificial Neural Networks, ESANN’2003, pages<br />

215–222, Evere, Belgium. D-side Publicati<strong>on</strong>s.<br />

Li, Y. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> L<strong>on</strong>g, P. (2002). The Relaxed <str<strong>on</strong>g>Online</str<strong>on</strong>g> Maximum Margin<br />

Algorithm. Machine Learning, 46:361–387.<br />

Nair, P. B., Choudhury, A., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Ke<strong>an</strong>e, A. J. (2002). Some Greedy<br />

Learning Algorithms for Sparse Regressi<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Classificati<strong>on</strong><br />

with Mercer Kernels. Journal of Machine Learning Research,<br />

3:781–801.<br />

Rätsch, G., Onoda, T., <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Müller, K.-R. (2001). Soft Margins<br />

for AdaBoost. Machine Learning, 42(3):287–320. Also:<br />

NeuroCOLT Technical Report 1998-021.<br />

Rosenblatt, F. (1957). The Perceptr<strong>on</strong>: A perceiving <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> recognizing<br />

automat<strong>on</strong>. Technical Report 85-460-1, Project<br />

PARA, Cornell Aer<strong>on</strong>autical Lab.<br />

Vapnik, V. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Lerner, A. (1963). Pattern Recogniti<strong>on</strong> using Generalized<br />

Portrait Method. Automati<strong>on</strong> <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Remote C<strong>on</strong>trol,<br />

24:774–780.<br />

Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New<br />

York.<br />

Vincent, P. <str<strong>on</strong>g><strong>an</strong>d</str<strong>on</strong>g> Bengio, Y. (2000). Kernel Matching Pursuit.<br />

Technical Report 1179, Département d’Informatique<br />

et Recherche Opérati<strong>on</strong>nelle, Université de M<strong>on</strong>tréal. Presented<br />

at Snowbird’00.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!