21.10.2013 Views

Very Large SVM Training using Core Vector Machines

Very Large SVM Training using Core Vector Machines

Very Large SVM Training using Core Vector Machines

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

on <strong>using</strong> (10), (11) and αℓ = 0. Thus, the pattern chosen<br />

in (13) also makes the most progress towards maximizing<br />

the dual objective. This subset selection heuristic has been<br />

commonly used by various decomposition algorithms (e.g.,<br />

(Chang & Lin, 2004; Joachims, 1999; Platt, 1999)).<br />

4.1.4 Finding the MEB<br />

At each iteration of step 4, we find the MEB by <strong>using</strong> the<br />

QP formulation in Section 3.2. As the size |St| of the<br />

core-set is much smaller than m in practice (Section 5),<br />

the computational complexity of each QP sub-problem is<br />

much lower than solving the whole QP. Besides, as only<br />

one core vector is added at each iteration, efficient rank-one<br />

update procedures (Cauwenberghs & Poggio, 2001; Vishwanathan<br />

et al., 2003) can also be used. The cost then becomes<br />

quadratic rather than cubic. In the current implementation<br />

(Section 5), we use SMO. As only one point is<br />

added each time, the new QP is just a slight perturbation of<br />

the original. Hence, by <strong>using</strong> the MEB solution obtained<br />

from the previous iteration as starting point (warm start),<br />

SMO can often converge in a small number of iterations.<br />

4.2 Convergence to (Approximate) Optimality<br />

First, consider ǫ = 0. The proof in (Bădoiu & Clarkson,<br />

2002) does not apply as it requires ǫ > 0. Nevertheless, as<br />

the number of core vectors increases by one at each iteration<br />

and the training set size is finite, so CVM must terminate<br />

in a finite number (say, τ) of iterations, With ǫ = 0,<br />

MEB(Sτ) is an enclosing ball for all the points on termination.<br />

Because Sτ is a subset of the whole training set and<br />

the MEB of a subset cannot be larger than the MEB of the<br />

whole set. Hence, MEB(Sτ) must also be the exact MEB<br />

of the whole (˜ϕ-transformed) training set. In other words,<br />

when ǫ = 0, CVM outputs the exact solution of the kernel<br />

problem.<br />

Now, consider ǫ > 0. Assume that the algorithm terminates<br />

at the τth iteration, then<br />

Rτ ≤ r MEB(S) ≤ (1 + ǫ)Rτ<br />

(14)<br />

by definition. Recall that the optimal primal objective p ∗<br />

of the kernel problem in Section 3.2.1 (or 3.2.2) is equal to<br />

the optimal dual objective d∗ 2 in (7) (or (9)), which in turn<br />

is related to the optimal dual objective d∗ 1 = r2 MEB(S) in (2)<br />

by (6). Together with (14), we can then bound p ∗ as<br />

Hence, max<br />

R 2 τ ≤ p∗ + ˜κ ≤ (1 + ǫ) 2 R 2 τ . (15)<br />

2<br />

Rτ p∗ +˜κ , p∗ +˜κ<br />

R2 <br />

≤ (1 + ǫ)<br />

τ<br />

2 and thus CVM is<br />

an (1 + ǫ) 2 -approximation algorithm. This also holds with<br />

high probability when probabilistic speedup is used.<br />

As mentioned in Section 1, practical <strong>SVM</strong> implementations<br />

also output approximated solutions only. Typically,<br />

a parameter similar to our ǫ is required at termination. For<br />

example, in SMO and <strong>SVM</strong> light (Joachims, 1999), training<br />

stops when the KKT conditions are fulfilled within ǫ.<br />

Experience with these softwares indicate that near-optimal<br />

solutions are often good enough in practical applications.<br />

Moreover, it can also be shown that when the CVM terminates,<br />

all the points satisfy loose KKT conditions as in<br />

SMO and <strong>SVM</strong> light .<br />

4.3 Time and Space Complexities<br />

Existing decomposition algorithms cannot guarantee the<br />

number of iterations and consequently the overall time<br />

complexity (Chang & Lin, 2004). In this Section, we show<br />

how this can be obtained for CVM. In the following, we assume<br />

that a plain QP implementation, which takes O(m 3 )<br />

time and O(m 2 ) space for m patterns, is used for the MEB<br />

sub-problem in Section 4.1.4. Moreover, we assume that<br />

each kernel evaluation takes constant time.<br />

As proved in (Bădoiu & Clarkson, 2002), CVM converges<br />

in at most 2/ǫ iterations. In other words, the total number<br />

of iterations, and consequently the size of the final core-set,<br />

are of τ = O(1/ǫ). In practice, it has often been observed<br />

that the size of the core-set is much smaller than this worstcase<br />

theoretical upper bound (Kumar et al., 2003). This<br />

will also be corroborated by our experiments in Section 5.<br />

Consider first the case where probabilistic speedup is not<br />

used in Section 4.1.2. As only one core vector is added at<br />

each iteration, |St| = t + 2. Initialization takes O(m) time<br />

while distance computations in steps 2 and 3 take O((t +<br />

2) 2 + tm) = O(t2 + tm) time. Finding the MEB in step 4<br />

takes O((t + 2) 3 ) = O(t3 ) time, and the other operations<br />

take constant time. Hence, the tth iteration takes O(tm +<br />

t3 ) time, and the overall time for τ = O(1/ǫ) iterations is<br />

τ<br />

t=1<br />

O(tm + t 3 ) = O(τ 2 m + τ 4 ) = O<br />

which is linear in m for a fixed ǫ.<br />

m<br />

1<br />

+<br />

ǫ2 ǫ4 <br />

,<br />

As for space 5 , since only the core vectors are involved<br />

in the QP, the space complexity for the tth iteration is<br />

O(|St| 2 ). As τ = O(1/ǫ), the space complexity for the<br />

whole procedure is O(1/ǫ 2 ), which is independent of m<br />

for a fixed ǫ.<br />

On the other hand, when probabilistic speedup is used, initialization<br />

only takes O(1) time while distance computations<br />

in steps 2 and 3 take O((t+2) 2 ) = O(t 2 ) time. Time<br />

for the other operations remains the same. Hence, tth iter-<br />

ation takes O(t3 ) time and the whole procedure takes<br />

τ<br />

O(t 3 ) = O(τ 4 <br />

1<br />

) = O<br />

ǫ4 <br />

.<br />

t=1<br />

5 As the patterns may be stored out of core, we ignore the<br />

O(m) space required for storing the m patterns.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!