Very Large SVM Training using Core Vector Machines

More documents

Recommendations

Info

For a fixed ǫ, it is thus constant, independent of m. The space complexity, which depends only on the number of iterations τ, is still O(1/ǫ 2 ). If more efficient QP solvers were used in the MEB subproblem of Section 4.1.4, both the time and space complexities can be further improved. For example, with SMO, the space complexity for the tth iteration is reduced to O(|St|) and that for the whole procedure driven down to O(1/ǫ). Note that when ǫ decreases, the CVM solution becomes closer to the exact optimal solution, but at the expense of higher time and space complexities. Such a tradeoff between efficiency and approximation quality is typical of all approximation schemes. Morever, be cautioned that the Onotation is used for studying the asymptotic efficiency of algorithms. As we are interested on handling very large data sets, an algorithm that is asymptotically more efficient (in time and space) will be the best choice. However, on smaller problems, this may be outperformed by algorithms that are not as efficient asymptotically. These will be demonstrated experimentally in Section 5. 5 Experiments In this Section, we implement the two-class L2-SVM in Section 3.2.2 and illustrate the scaling behavior of CVM (in C++) on both toy and real-world data sets. For comparison, we also run the following SVM implementations 6 : 1. L2-SVM: LIBSVM implementation (in C++); 2. L2-SVM: LSVM implementation (in MATLAB), with low-rank approximation (Fine & Scheinberg, 2001) of the kernel matrix added; 3. L2-SVM: RSVM (Lee & Mangasarian, 2001) implementation (in MATLAB). The RSVM addresses the scale-up issue by solving a smaller optimization problem that involves a random ¯m × m rectangular subset of the kernel matrix. Here, ¯m is set to 10% of m; 4. L1-SVM: LIBSVM implementation (in C++); 5. L1-SVM: SimpleSVM (Vishwanathan et al., 2003) implementation (in MATLAB). Parameters are used in their default settings unless otherwise specified. All experiments are performed on a 3.2GHz Pentium–4 machine with 512M RAM, running Windows XP. Since our focus is on nonlinear kernels, we use the 6 Our CVM implementation can be downloaded from http://www.cs.ust.hk/∼jamesk/cvm.zip. LIBSVM can be downloaded from http://www.csie.ntu.edu.tw/∼cjlin/libsvm/; LSVM from http://www.cs.wisc.edu/dmi/lsvm; and SimpleSVM from http://asi.insa-rouen.fr/∼gloosli/. Moreover, we followed http://www.csie.ntu.edu.tw/∼cjlin/libsvm/faq.html in adapting the LIBSVM package for L2-SVM. Gaussian kernel k(x,y) = exp(−x − y 2 /β), with β = 1 m 2 m i,j=1 xi − xj 2 . Our CVM implementation is adapted from LIBSVM, and uses SMO for each QP sub-problem in Section 4.1.4. As in LIBSVM, our CVM also uses caching (with the same cache size as in the other LIBSVM implementations above) and stores all training patterns in main memory. For simplicity, shrinking is not used in our current CVM implementation. Moreover, we employ probabilistic speedup (Section 4.1.2) and set ǫ = 10 −6 in all the experiments. As in other decomposition methods, the use of a very stringent stopping criterion is not necessary in practice. Preliminary studies show that ǫ = 10 −6 is acceptable for most tasks. Using an even smaller ǫ does not show improved generalization performance, but may increase the training time unnecessarily. 5.1 Checkerboard Data We first experiment on the 4 × 4 checkerboard data used by Lee and Mangasarian (2001) for evaluating large-scale SVM implementations. We use training sets with a maximum of 1 million points and 2000 independent points for testing. Of course, this problem does not need so many points for training, but it is convenient for illustrating the scaling properties. Experimentally, L2-SVM with low rank approximation does not yield satisfactory performance on this data set, and so its result is not reported here. RSVM, on the other hand, has to keep a rectangular kernel matrix of size ¯m × m and cannot be run on our machine when m exceeds 10K. Similarly, the SimpleSVM has to store the kernel matrix of the active set, and runs into storage problem when m exceeds 30K. As can be seen from Figure 2, CVM is as accurate as the others. Besides, it is much faster 7 and produces far fewer support vectors (which implies faster testing) on large data sets. In particular, one million patterns can be processed in under 13s. On the other hand, for relatively small training sets, with less than 10K patterns, LIBSVM is faster. This, however, is to be expected as LIBSVM uses more sophisticated heuristics and so will be more efficient on small-tomedium sized data sets. Figure 2(b) also shows the core-set size, which can be seen to be small and its curve basically overlaps with that of the CVM. Thus, almost all the core vectors are useful support vectors. Moreover, it also confirms our theoretical findings that both time and space are constant w.r.t. the training set size, when it is large enough. 5.2 Forest Cover Type Data 8 This data set has been used for large scale SVM training by Collobert et al. (2002). Following (Collobert et al., 7 As some implementations are in MATLAB, so not all the CPU time measurements can be directly compared. However, it is still useful to note the constant scaling exhibited by the CVM and its speed advantage over other C++ implementations, when the data set is large. 8 http://kdd.ics.uci.edu/databases/covertype/covertype.html
CPU time (in seconds) 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1K 3K 10K 30K 100K 300K 1M −1 size of training set L2−SVM (CVM) L2−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) number of SV’s 10 5 10 4 10 3 10 1K 3K 10K 30K 100K 300K 1M 2 size of training set L2−SVM (CVM) core−set size L2−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) 0 1K 3K 10K 30K 100K size of training set 300K 1M (a) CPU time. (b) number of SV’s. (c) testing error. Figure 2: Results on the checkerboard data set (Except for the CVM, all the other implementations have to terminate early because of not enough memory and / or the training time is too long). Note that the CPU time, number of support vectors, and size of the training set are in log scale. 2002), we aim at separating class 2 from the other classes. 1% − 90% of the whole data set (with a maximum of 522,911 patterns) are used for training while the remaining are used for testing. We set β = 10000 for the Gaussian kernel. Preliminary studies show that the number of support vectors is over ten thousands. Consequently, RSVM and SimpleSVM cannot be run on our machine. Similarly, for low rank approximation, preliminary studies show that over thousands of basis vectors are required for a good approximation. Therefore, only the two LIBSVM implementations will be compared with the CVM here. Figure 3 shows that CVM is, again, as accurate as the others. Note that when the training set is small, more training patterns bring in additional information useful for classification and so the number of core vectors increases with training set size. However, after processing around 100K patterns, both the time and space requirements of CVM begin to exhibit a constant scaling with the training set size. With hindsight, one might simply sample 100K training patterns and hope to obtain comparable results 9 . However, for satisfactory classification performance, different problems require samples of different sizes and CVM has the important advantage that the required sample size does not have to be pre-specified. Without such prior knowledge, random sampling gives poor testing results, as has been demonstrated in (Lee & Mangasarian, 2001). 5.3 Relatively Small Data Sets: UCI Adult Data 10 Following (Platt, 1999), we use training sets with up to 32,562 patterns. As can be seen in Figure 4, CVM is still among the most accurate methods. However, as this data set is relatively small, more training patterns do carry more classification information. Hence, as discussed in Section 5.2, the number of iterations, the core set size and consequently the CPU time all increase with the num- 9 In fact, we tried both LIBSVM implementations on a random sample of 100K training patterns, but their testing accuracies are inferior to that of CVM. 10 http://research.microsoft.com/users/jplatt/smo.html error rate (in %) 40 35 30 25 20 15 10 5 L2−SVM (CVM) L2−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) ber of training patterns. From another perspective, recall that the worst case core-set size is 2/ǫ, independent of m (Section 4.3). For the value of ǫ = 10 −6 used here, 2/ǫ = 2 × 10 6 . Although we have seen that the actual size of the core-set is often much smaller than this worst case value, however, when m ≪ 2/ǫ, the number of core vectors can still be dependent on m. Moreover, as has been observed in Section 5.1, the CVM is slower than the more sophisticated LIBSVM on processing these smaller data sets. 6 Conclusion In this paper, we exploit the “approximateness” in SVM implementations. We formulate kernel methods as equivalent MEB problems, and then obtain provably approximately optimal solutions efficiently with the use of coresets. The proposed CVM procedure is simple, and does not require sophisticated heuristics as in other decomposition methods. Moreover, despite its simplicity, CVM has small asymptotic time and space complexities. In particular, for a fixed ǫ, its asymptotic time complexity is linear in the training set size m while its space complexity is independent of m. When probabilistic speedup is used, it even has constant asymptotic time and space complexities for a fixed ǫ, independent of the training set size m. Experimentally, on large data sets, it is much faster and produce far fewer support vectors (and thus faster testing) than existing methods. On the other hand, on relatively small data sets where m ≪ 2/ǫ, SMO can be faster. CVM can also be used for other kernel methods such as support vector regression, and details will be reported elsewhere. References Bădoiu, M., & Clarkson, K. (2002). Optimal core-sets for balls. DIMACS Workshop on Computational Geometry. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. Advances in Neural Information Processing Systems 13. Cambridge, MA: MIT Press.
Page 1 and 2: Very Large SVM Training using Core
Page 3 and 4: where α = [αi, . . .,αm] ′ are
Page 5: on using (10), (11) and αℓ = 0.

Very Large SVM Training using Core Vector Machines

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?