The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

This contradict the optimality of f, b, as for f ′ = 0, b ′ = β it holds thatE (x,y)∼D l((f ′ (x) + b ′ )y) ≤ λl(−β) + (1 − λ) · (1 − θ)l(−β) + θ · l(β))= (1 − θ)l(−β) + θ · l(β) + o(1)We can conclude now the proof of Theorem 2.6. By choosing d large enoughand using Equation (12), we can guarantee that g| {x:〈x,e〉=−γ} is very concentratedaround its expectation. In particular, if (x, y) are sampled according to D, then w.p.> 0.5 · (1 − θ) · (1 − λ) = Ω(1), it holds that yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1), whileErr γ (D) ≤ λ = O (γ poly(log(m)))To conclude the proof of Theorem 3.2, we note that we can assume that g is O(e)-invariant. Otherwise, we can replace it with P e f + b. This does not increase ||f|| Hk norErr D,l (f + b), thus, the solution P e f + b is optimal as well. Now, it follows that g| {x:〈x,e〉=−γ}is constant and we finish as before.5.5.2 Theorem 3.1Let L be the Lipschitz constant of l. Let β > α > 0 such that l(α) > l(β). Choose 0 < θ < 1large enough so that (1−θ)l(−β)+θl(β) < θl(α). First, define probability measures µ 1 , µ 2 , µ 3and µ over [−1, 1] × {±1} as follows.µ 1 (γ, 1) = θ, µ 1 (−γ, −1) = 1 − θw(x) =µ 2 (−γ, 1) = 1The measure µ 3 is the product of uniform{±1} and the measure over [−1, 1] whose densityfunction is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 8Finally, µ = (1 − λ 1 − λ 2 )µ 1 + λ 2 µ 2 + λ 3 µ 3 with λ 2 , λ 3 > 0 to be chosen later.By lemma 5.15 (see remark 5.16), there is a continuous normalized kernel k ′ such thatw.p. ≥ 1 − 2e exp(− 1 ) the function returned by the algorithm is of the form f + b withγ‖f‖ Hk ′ ≤ c · m 3 A (γ) for some c > 0 (depending only on l). Now, let e ∈ Sd−1 be the vectorfrom Lemma 5.22, corresponding to the kernel k ′ . The distribution D is the pullback of µw.r.t. e. By considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ 3 + λ 2 .Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (11) with C = C A (γ) and with an additiveerror ≤ √ γ. As in the proof of Theorem 2.6, it holds that, w.p. ≥ 1 − 10 exp(− 1 ) for γm = m A (γ),∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · c · K 3.5 · E · m 3 · (r K + s d ) (13)|∂ + l(0)|λ 3{x:〈x,e〉=γ}{x:〈x,e〉=−γ}Denote the last bound by ɛ. It holds thatErr D,l (g) = (1 − λ 2 − λ 3 )E µ 1 el(yg(x)) + λ 2 E µ 2 el(yg(x)) + λ 3 E µ 3 el(yg(x)) (14)35✷
Now, denote δ = ∫ g. It holds that{x:〈x,e〉=−γ}∫∫E µ 1 el(yg(x)) = θ l(g(x)) + (1 − θ)Thus,≥≥{x:〈x,e〉=γ}(∫θ · lg{x:〈x,e〉=γ})+ (1 − θ) · lθ · l(δ) + (1 − θ) · l(−δ) − Lɛ{x:〈x,e〉=−γ}(−∫l(−g(x)){x:〈x,e〉=−γ}Err D,l (g) ≥ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) − Lɛ + λ 2 E µ 2 el(yg(x))However, by considering the constant solution δ, it follows that1Err D,l (g) ≤ (1 − λ 2 − λ 3 )(θl(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 32 (l(δ) + l(−δ)) + √ γ≤ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 3 · l(−|δ|) + √ γThus,Err µ 2 e ,l(g) ≤ Lɛλ 2+ l(δ) + λ 3λ 2l(−|δ|) += L · l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+)g(15)√ γ(16)λ 210 · c · L · K3.5· E · m 3 · (r K + s d ) + l(δ) + λ √3γl(−|δ|) +λ 2 λ 2Now, relying on the assumption that γ · log 8 (m) = o(1), it is possible to choose λ 2 =Θ ( √ γK4 ) = Θ ( √ γ log 4 (m) ) , λ 3 = √ γ, K = Θ(log(m/γ)), and d = Θ(log(m/γ)) such thatthe bound in Equation (13), L·l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+ 10·c·K3.5λ 2· E · m 3 · (r K + s d ), λ 2 , λ 3 and λ 3λ 2o(1).l(δ) ≤ l ( α2are allSince the ) bound in Equation (13) is o(1), it follows, as in the proof of Theorem 2.6, thatand consequently, 0 l(0)−l( α 2 )− o(1), l(g(x)) < l(0) ⇒l(0)g(x) > 0. Since the marginal distributions of µ 1 e and µ 2 e are the same, it follows that, if (x, y)(are chosen according to D, then w.p. >l(0)−l( α 2 )l(0))− o(1) · (1 − λ 2 − λ 3 ) · (1 − θ) = Ω(1),yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1) while Err γ (D) ≤ λ 2 + λ 3 = O ( √ γ poly(log(m))).36✷λ 2
Page 1 and 2: The complexity of learning halfspac
Page 3 and 4: exists an equivalent inner product
Page 5 and 6: is enough that we can efficiently c
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11 and 12: We now expand on this brief descrip
Page 13 and 14: and (uniformly and independently) a
Page 15 and 16: The proof of Theorem 2.7To prove Th
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35: We note that ω f◦g ≤ ω f ·
Page 39 and 40: equivalent formulationminErr D,l (f
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?