21.10.2013 Views

Very Large SVM Training using Core Vector Machines

Very Large SVM Training using Core Vector Machines

Very Large SVM Training using Core Vector Machines

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CPU time (in seconds)<br />

10 6<br />

10 5<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10<br />

1K 3K 10K 30K 100K 300K 1M<br />

−1<br />

size of training set<br />

L2−<strong>SVM</strong> (CVM)<br />

L2−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L2−<strong>SVM</strong> (R<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (Simple<strong>SVM</strong>)<br />

number of SV’s<br />

10 5<br />

10 4<br />

10 3<br />

10<br />

1K 3K 10K 30K 100K 300K 1M<br />

2<br />

size of training set<br />

L2−<strong>SVM</strong> (CVM)<br />

core−set size<br />

L2−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L2−<strong>SVM</strong> (R<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (Simple<strong>SVM</strong>)<br />

0<br />

1K 3K 10K 30K 100K<br />

size of training set<br />

300K 1M<br />

(a) CPU time.<br />

(b) number of SV’s.<br />

(c) testing error.<br />

Figure 2: Results on the checkerboard data set (Except for the CVM, all the other implementations have to terminate<br />

early because of not enough memory and / or the training time is too long). Note that the CPU time, number of support<br />

vectors, and size of the training set are in log scale.<br />

2002), we aim at separating class 2 from the other classes.<br />

1% − 90% of the whole data set (with a maximum of<br />

522,911 patterns) are used for training while the remaining<br />

are used for testing. We set β = 10000 for the Gaussian<br />

kernel. Preliminary studies show that the number of support<br />

vectors is over ten thousands. Consequently, R<strong>SVM</strong><br />

and Simple<strong>SVM</strong> cannot be run on our machine. Similarly,<br />

for low rank approximation, preliminary studies show that<br />

over thousands of basis vectors are required for a good approximation.<br />

Therefore, only the two LIB<strong>SVM</strong> implementations<br />

will be compared with the CVM here.<br />

Figure 3 shows that CVM is, again, as accurate as the others.<br />

Note that when the training set is small, more training<br />

patterns bring in additional information useful for classification<br />

and so the number of core vectors increases with<br />

training set size. However, after processing around 100K<br />

patterns, both the time and space requirements of CVM begin<br />

to exhibit a constant scaling with the training set size.<br />

With hindsight, one might simply sample 100K training<br />

patterns and hope to obtain comparable results 9 . However,<br />

for satisfactory classification performance, different problems<br />

require samples of different sizes and CVM has the<br />

important advantage that the required sample size does not<br />

have to be pre-specified. Without such prior knowledge,<br />

random sampling gives poor testing results, as has been<br />

demonstrated in (Lee & Mangasarian, 2001).<br />

5.3 Relatively Small Data Sets: UCI Adult Data 10<br />

Following (Platt, 1999), we use training sets with up to<br />

32,562 patterns. As can be seen in Figure 4, CVM is<br />

still among the most accurate methods. However, as this<br />

data set is relatively small, more training patterns do carry<br />

more classification information. Hence, as discussed in<br />

Section 5.2, the number of iterations, the core set size<br />

and consequently the CPU time all increase with the num-<br />

9 In fact, we tried both LIB<strong>SVM</strong> implementations on a random<br />

sample of 100K training patterns, but their testing accuracies are<br />

inferior to that of CVM.<br />

10 http://research.microsoft.com/users/jplatt/smo.html<br />

error rate (in %)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

L2−<strong>SVM</strong> (CVM)<br />

L2−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L2−<strong>SVM</strong> (R<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (LIB<strong>SVM</strong>)<br />

L1−<strong>SVM</strong> (Simple<strong>SVM</strong>)<br />

ber of training patterns. From another perspective, recall<br />

that the worst case core-set size is 2/ǫ, independent of<br />

m (Section 4.3). For the value of ǫ = 10 −6 used here,<br />

2/ǫ = 2 × 10 6 . Although we have seen that the actual size<br />

of the core-set is often much smaller than this worst case<br />

value, however, when m ≪ 2/ǫ, the number of core vectors<br />

can still be dependent on m. Moreover, as has been observed<br />

in Section 5.1, the CVM is slower than the more sophisticated<br />

LIB<strong>SVM</strong> on processing these smaller data sets.<br />

6 Conclusion<br />

In this paper, we exploit the “approximateness” in <strong>SVM</strong><br />

implementations. We formulate kernel methods as equivalent<br />

MEB problems, and then obtain provably approximately<br />

optimal solutions efficiently with the use of coresets.<br />

The proposed CVM procedure is simple, and does not<br />

require sophisticated heuristics as in other decomposition<br />

methods. Moreover, despite its simplicity, CVM has small<br />

asymptotic time and space complexities. In particular, for<br />

a fixed ǫ, its asymptotic time complexity is linear in the<br />

training set size m while its space complexity is independent<br />

of m. When probabilistic speedup is used, it even has<br />

constant asymptotic time and space complexities for a fixed<br />

ǫ, independent of the training set size m. Experimentally,<br />

on large data sets, it is much faster and produce far fewer<br />

support vectors (and thus faster testing) than existing methods.<br />

On the other hand, on relatively small data sets where<br />

m ≪ 2/ǫ, SMO can be faster. CVM can also be used for<br />

other kernel methods such as support vector regression, and<br />

details will be reported elsewhere.<br />

References<br />

Bădoiu, M., & Clarkson, K. (2002). Optimal core-sets for balls.<br />

DIMACS Workshop on Computational Geometry.<br />

Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental<br />

support vector machine learning. Advances in Neural<br />

Information Processing Systems 13. Cambridge, MA: MIT<br />

Press.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!