Very Large SVM Training using Core Vector Machines

More documents

Recommendations

Info

4 Core Vector Machine (CVM) After formulating the kernel method as a MEB problem, we obtain a transformed kernel ˜ k, together with the associated feature space ˜ F, mapping ˜ϕ and constant ˜κ = ˜ k(z,z). To solve this kernel-induced MEB problem, we adopt the approximation algorithm 3 described in the proof of Theorem 2.2 in (Bădoiu & Clarkson, 2002). As mentioned in Section 2, the idea is to incrementally expand the ball by including the point furthest away from the current center. In the following, we denote the core-set, the ball’s center and radius at the tth iteration by St,ct and Rt respectively. Also, the center and radius of a ball B are denoted by cB and rB. Given an ǫ > 0, the CVM then works as follows: 1. Initialize S0, c0 and R0. 2. Terminate if there is no ˜ϕ(z) (where z is a training point) falling outside the (1+ǫ)-ball B(ct, (1+ǫ)Rt). 3. Find z such that ˜ϕ(z) is furthest away from ct. Set St+1 = St ∪ {z}. 4. Find the new MEB(St+1) from (5) and set ct+1 = c MEB(St+1) and Rt+1 = r MEB(St+1) using (3). 5. Increment t by 1 and go back to step 2. In the sequel, points that are added to the core-set will be called core vectors. Details of each of the above steps will be described in Section 4.1. Despite its simplicity, CVM has an approximation guarantee (Section 4.2) and also provably small time and space complexities (Section 4.3). 4.1 Detailed Procedure 4.1.1 Initialization Bădoiu and Clarkson (2002) simply used an arbitrary point z ∈ S to initialize S0 = {z}. However, a good initialization may lead to fewer updates and so we follow the scheme in (Kumar et al., 2003). We start with an arbitrary point z ∈ S and find za ∈ S that is furthest away from z in the feature space ˜ F. Then, we find another point zb ∈ S that is furthest away from za in ˜ F. The initial core-set is then set to be S0 = {za,zb}. Obviously, MEB(S0) (in ˜ F) has center c0 = 1 2 (˜ϕ(za) + ˜ϕ(zb)) On using (3), we thus have αa = αb = 1 2 and all the other αi’s are zero. The initial radius is R0 = 1 2˜ϕ(za) − ˜ϕ(zb) = 1 ˜ϕ(za) 2 + ˜ϕ(zb) 2 − 2˜ϕ(za) ′ ˜ϕ(zb) = 1 2 2 2˜κ − 2˜ k(za,zb). In a classification problem, one may further require za and zb to come from different classes. On using (10), R0 then becomes 1 2 2 κ + 2 + 1 C + 2k(xa,xb). As κ and C are constants, choosing the pair (xa,xb) that maximizes R0 is then equivalent to choosing the closest pair belonging to 3 A similar algorithm is also described in (Kumar et al., 2003). opposing classes, which is also the heuristic used in initializing the SimpleSVM (Vishwanathan et al., 2003). 4.1.2 Distance Computations Steps 2 and 3 involve computing ct − ˜ϕ(zℓ) for zℓ ∈ S. Now, ct − ˜ϕ(zℓ) 2 (12) = X αiαj˜ k(zi,zj) − 2 X αi˜ k(zi,zℓ) + ˜ k(zℓ,zℓ), z i,z j ∈St z i∈St on using (3). Hence, computations are based on kernel evaluations instead of the explicit ˜ϕ(zi)’s, which may be infinite-dimensional. Note that, in contrast, existing MEB algorithms only consider finite-dimensional spaces. However, in the feature space, ct cannot be obtained as an explicit point but rather as a convex combination of (at most) |St| ˜ϕ(zi)’s. Computing (12) for all m training points takes O(|St| 2 + m|St|) = O(m|St|) time at the tth iteration. This becomes very expensive when m is large. Here, we use the probabilistic speedup method in (Smola & Schölkopf, 2000). The idea is to randomly sample a sufficiently large subset S ′ from S, and then take the point in S ′ that is furthest away from ct as the approximate furthest point over S. As shown in (Smola & Schölkopf, 2000), by using a small random sample of, say, size 59, the furthest point obtained from S ′ is with probability 0.95 among the furthest 5% of points from the whole S. Instead of taking O(m|St|) time, this randomized method only takes O(|St| 2 + |St|) = O(|St| 2 ) time, which is much faster as |St| ≪ m. This trick can also be used in initialization. 4.1.3 Adding the Furthest Point Points outside MEB(St) have zero αi’s (Section 4.1.1) and so violate the KKT conditions of the dual problem. As in (Osuna et al., 1997), one can simply add any such violating point to St. Our step 3, however, takes a greedy approach by including the point furthest away from the current center. In the classification case 4 (Section 3.2.2), we have arg max zℓ /∈B(ct,(1+ǫ)Rt) ct − ˜ϕ(zℓ) 2 = arg min αiyiyℓ(k(xi,xℓ) + 1) zℓ /∈B(ct,(1+ǫ)Rt) zi∈St = arg min zℓ /∈B(ct,(1+ǫ)Rt) yℓ(w ′ ϕ(xℓ) + b), (13) on using (10), (11) and (12). Hence, (13) chooses the worst violating pattern corresponding to the constraint (8). Also, as the dual objective in (9) has gradient −2˜Kα, so for a pattern ℓ currently outside the ball m (˜Kα)ℓ = αi yiyℓk(xi,xℓ) + yiyℓ + δiℓ C i=1 = yℓ(w ′ ϕ(xℓ) + b), 4 The case for one-class classification (Section 3.2.1) is similar.
on using (10), (11) and αℓ = 0. Thus, the pattern chosen in (13) also makes the most progress towards maximizing the dual objective. This subset selection heuristic has been commonly used by various decomposition algorithms (e.g., (Chang & Lin, 2004; Joachims, 1999; Platt, 1999)). 4.1.4 Finding the MEB At each iteration of step 4, we find the MEB by using the QP formulation in Section 3.2. As the size |St| of the core-set is much smaller than m in practice (Section 5), the computational complexity of each QP sub-problem is much lower than solving the whole QP. Besides, as only one core vector is added at each iteration, efficient rank-one update procedures (Cauwenberghs & Poggio, 2001; Vishwanathan et al., 2003) can also be used. The cost then becomes quadratic rather than cubic. In the current implementation (Section 5), we use SMO. As only one point is added each time, the new QP is just a slight perturbation of the original. Hence, by using the MEB solution obtained from the previous iteration as starting point (warm start), SMO can often converge in a small number of iterations. 4.2 Convergence to (Approximate) Optimality First, consider ǫ = 0. The proof in (Bădoiu & Clarkson, 2002) does not apply as it requires ǫ > 0. Nevertheless, as the number of core vectors increases by one at each iteration and the training set size is finite, so CVM must terminate in a finite number (say, τ) of iterations, With ǫ = 0, MEB(Sτ) is an enclosing ball for all the points on termination. Because Sτ is a subset of the whole training set and the MEB of a subset cannot be larger than the MEB of the whole set. Hence, MEB(Sτ) must also be the exact MEB of the whole (˜ϕ-transformed) training set. In other words, when ǫ = 0, CVM outputs the exact solution of the kernel problem. Now, consider ǫ > 0. Assume that the algorithm terminates at the τth iteration, then Rτ ≤ r MEB(S) ≤ (1 + ǫ)Rτ (14) by definition. Recall that the optimal primal objective p ∗ of the kernel problem in Section 3.2.1 (or 3.2.2) is equal to the optimal dual objective d∗ 2 in (7) (or (9)), which in turn is related to the optimal dual objective d∗ 1 = r2 MEB(S) in (2) by (6). Together with (14), we can then bound p ∗ as Hence, max R 2 τ ≤ p∗ + ˜κ ≤ (1 + ǫ) 2 R 2 τ . (15) 2 Rτ p∗ +˜κ , p∗ +˜κ R2 ≤ (1 + ǫ) τ 2 and thus CVM is an (1 + ǫ) 2 -approximation algorithm. This also holds with high probability when probabilistic speedup is used. As mentioned in Section 1, practical SVM implementations also output approximated solutions only. Typically, a parameter similar to our ǫ is required at termination. For example, in SMO and SVM light (Joachims, 1999), training stops when the KKT conditions are fulfilled within ǫ. Experience with these softwares indicate that near-optimal solutions are often good enough in practical applications. Moreover, it can also be shown that when the CVM terminates, all the points satisfy loose KKT conditions as in SMO and SVM light . 4.3 Time and Space Complexities Existing decomposition algorithms cannot guarantee the number of iterations and consequently the overall time complexity (Chang & Lin, 2004). In this Section, we show how this can be obtained for CVM. In the following, we assume that a plain QP implementation, which takes O(m 3 ) time and O(m 2 ) space for m patterns, is used for the MEB sub-problem in Section 4.1.4. Moreover, we assume that each kernel evaluation takes constant time. As proved in (Bădoiu & Clarkson, 2002), CVM converges in at most 2/ǫ iterations. In other words, the total number of iterations, and consequently the size of the final core-set, are of τ = O(1/ǫ). In practice, it has often been observed that the size of the core-set is much smaller than this worstcase theoretical upper bound (Kumar et al., 2003). This will also be corroborated by our experiments in Section 5. Consider first the case where probabilistic speedup is not used in Section 4.1.2. As only one core vector is added at each iteration, |St| = t + 2. Initialization takes O(m) time while distance computations in steps 2 and 3 take O((t + 2) 2 + tm) = O(t2 + tm) time. Finding the MEB in step 4 takes O((t + 2) 3 ) = O(t3 ) time, and the other operations take constant time. Hence, the tth iteration takes O(tm + t3 ) time, and the overall time for τ = O(1/ǫ) iterations is τ t=1 O(tm + t 3 ) = O(τ 2 m + τ 4 ) = O which is linear in m for a fixed ǫ. m 1 + ǫ2 ǫ4 , As for space 5 , since only the core vectors are involved in the QP, the space complexity for the tth iteration is O(|St| 2 ). As τ = O(1/ǫ), the space complexity for the whole procedure is O(1/ǫ 2 ), which is independent of m for a fixed ǫ. On the other hand, when probabilistic speedup is used, initialization only takes O(1) time while distance computations in steps 2 and 3 take O((t+2) 2 ) = O(t 2 ) time. Time for the other operations remains the same. Hence, tth iteration takes O(t3 ) time and the whole procedure takes τ O(t 3 ) = O(τ 4 1 ) = O ǫ4 . t=1 5 As the patterns may be stored out of core, we ignore the O(m) space required for storing the m patterns.
Page 1 and 2: Very Large SVM Training using Core
Page 3: where α = [αi, . . .,αm] ′ are
Page 7 and 8: CPU time (in seconds) 10 6 10 5 10

Very Large SVM Training using Core Vector Machines

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?