The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

Define the following symmetric kernel:∫k s (x, y) = k(Ax, Ay)dA .We show that the corresponding Hilbert space consists of functions of the form∫f(x) = f A (Ax)dA ,O(d)O(d)where for every A f A ∈ H k . Moreover,‖f‖ 2 H ks≤∫O(d)‖f A ‖ 2 H kdA . (7)Let α be the maximal number such that∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, | ¯f e (γ) − ¯f e (−γ)| > α .Since H k is closed to negation, it follows that α satisfies∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, ¯fe (γ) − ¯f e (−γ) > α .Fix some v ∈ S d−1 and define f ∈ H ks to be∫f(x) =O(d)f Av (Ax)dA .By Equation (7) we have that ‖f‖ Hks ≤ C. It is also possible to show that for all AErr Dv,hinge(f Av ◦ A) = Err DAv ,hinge(f Av ) ≤ 1. Therefore, by the convexity of the loss,Err Dv,hinge(f) ≤ 1. It follows, by the previous sections, that | ¯f(γ) − ¯f(−γ)| = o(1). But, weshow that ¯f(γ) − ¯f(−γ) > α. It therefore follows that α = o(1), as required.Concluding the proofWe have shown that for every kernel, there exists some direction e such that for all f ∈ H kthat satisfies ‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1 we have that | ¯f(γ) − ¯f(−γ)| = o(1).Next, consider f which is also an (approximated) optimal solution of program (2) withrespect to D e . Since Err De,hinge(0) = 1, we clearly have that Err De,hinge(f) ≤ 1, hence| ¯f(γ) − ¯f(−γ)| = o(1). Next we show that ¯f(−γ) > 1/2, which will imply that f predictsthe label 1 for most instances on the 1 co-dimensional sphere such that 〈x, e〉 = −γ. Hence,its 0-1 error is close to 0.3(1−λ) ≥ 0.2 while Err γ (D e ) = λ/2. By choosing λ = O(γ log 3.1 (C))(we obtain that the approximation ratio is Ω1γ log 3.1 (C)It is therefore left to show that ¯f(−γ) > 1/2. Let a = ¯f(γ) ≈ ¯f(−γ). On (1 − λ)fraction fraction of the distribution, the hinge-loss would be (on average and roughly)0.3[1 + a] + + 0.7[1 − a] + . This function is minimized for a = 1, which concludes our proofsince λ is o(1).13).
The proof of Theorem 2.7To prove Theorem 2.7, we prove, using John’s Lemma (Matousek, 2002), that for every embeddingψ : S d−1 → B 1 , we can construct a kernel k : S d−1 × S d−1 → R and a probabilitymeasure µ N over S d−1 with the following properties: If f is an approximate solution of pro-(gram (4), where γ fraction of the distribution D is perturbed by µ N , then ‖f‖ k ≤ OUsing this, we adapt the proof as sketched above to prove Theorem 2.7.3 Additional ResultsLow dimensional distributions. It is of interest to examine Theorem 2.6 when D issupported ( on B d for d small. We show that for d = O(log(1/γ)), the approximation ratio is1Ω √ γ·poly(log(1/γ))). Most commonly used kernels (e.g., the polynomial, RBF, and Hyperbolictangent kernels, as well as the kernel used in (Shalev-Shwartz et al., 2011)) are symmetric.Namely, for all unit vectors x, y ∈ B, k(x, y) := 〈ψ(x), ψ(y)〉 H1 depends only on 〈x, y〉 H .For symmetric kernels, we show that even with the restriction that d = O(log(1/γ)), the(approximation ratio is still Ω1γ·poly(log(1/γ))m 1.5γ).). However, the result for symmetric kernels isonly proved for (idealized) algorithms that return the exact solution of program (5).Theorem 3.1 Let A be a kernel-based learner corresponding to a Lipschitz surrogate. Assumethat m A (γ) = exp(o(γ −1/8 )). Then, for every γ > 0, there exists a distribution D onB d , for d = O(log(m A (γ)/γ)), such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ · poly(log(mA (γ)))Theorem 3.2 Assume that m A (γ) = exp(o(γ −2/7 )) and ψ is continuous and symmetric.For every γ > 0, there exists a distribution ( D on B d ), for d = O(log(m A (γ))) and a solutionto program (5) whose 0-1-error is Ω· Err γ (D).1γ poly(log(m A (γ)))The integrality gap. In bounding the approximation ratio, we considered a predefinedloss l. We believe that similar bounds hold as well for algorithms that can choose l accordingto γ. However, at the moment, we only know to lower bound the integrality gap, as definedbelow.If we let l depend on γ, we should redefine the complexity of Program (5) to be C · L,where L is the Lipschitz constant of l. (See the discussion following Program (5)). The (γ-)integrality gap of program (5) and (4) is defined as the worst case, over all possible choicesof D, of the ratio between the optimum of the program, running on the input γ, to Err γ (D).We note that Err D,0−1 (f) ≤ Err D,l (f) for every convex surrogate l. Thus, the integrality gapalways upper bounds the approximation ratio. Moreover, this fact establishes most (if notall) guarantees for algorithms that solve Program (5) or Program (4).We denote by ∂ + f the right derivative of the real function f. Note that ∂ + f always existsfor f convex. Also, ∀x ∈ R, |∂ + f(x)| ≤ L if f is L-Lipschitz. We prove:14
Page 1 and 2: The complexity of learning halfspac
Page 3 and 4: exists an equivalent inner product
Page 5 and 6: is enough that we can efficiently c
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11 and 12: We now expand on this brief descrip
Page 13: and (uniformly and independently) a
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35 and 36: We note that ω f◦g ≤ ω f ·
Page 37 and 38: Now, denote δ = ∫ g. It holds th
Page 39 and 40: equivalent formulationminErr D,l (f
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

Create successful ePaper yourself

Delete template?

Save as template?