5.5.3 <strong>The</strong> integrality gap – <strong>The</strong>orem 3.3Our first step is a reduction to the hinge loss. Let a = ∂ + l(0). Define{l ∗ ax + 1 x ≤ 1−a(x) =0 o/wit is not hard to see that l ∗ is a convex surrogate satisfying ∀x, l ∗ (x) ≤ l(x) and ∂ + l ∗ (0) =∂ + l(0). Thus, if we substitute l with l ∗ , we just decrease the integrality gap, hence canassume that l = l ∗ . Now, we note that if we consider program (11) with l = l ∗ the inegralitygap <strong>of</strong> coincides with what we get by replacing C with |a| · C and l ∗ with the hinge loss.To see that, note that for every f ∈ H k , b ∈ R, Err D,l ∗(f + b) = Err D,hinge (|a| · f + |a| · b),thus, minimizing Err D,l ∗ over all functions f ∈ H k that satisfy ||f|| Hk ≤ C is equivalent tominimizing Err D,hinge over all functions f ∈ H k that satisfy ||f|| Hk ≤ |a| · C. Thus, it isenough to prove the <strong>The</strong>orem for l = l hinge .Next, we show that we can assume that the embedding is symmetric (i.e., correspond toa symmetric <strong>kernel</strong>). As the integrality gap is at least as large as the approximation ratio,<strong>using</strong> <strong>The</strong>orem 3.2 this will complete our argument. (<strong>The</strong> reduction to the hinge loss yieldsbounds with universal constants in the asymptotic terms).Let γ > 0 and let D be a distribution on S d−1 × {±1}. It is enough to find (a possiblydifferent) distribution D 1 with the same γ-margin <strong>error</strong> as D, for which the optimum <strong>of</strong>program (11) (with l = l hinge ) is not smaller than the optimum <strong>of</strong> the programmin Err D,hinge (f + b)s.t. f ∈ H ks , b ∈ R (17)||f|| HksDenote the optimal value <strong>of</strong> program (17) by α and assume, towards contradiction, thatwhenever Err γ (D 1 ) = Err γ (D), the optimum <strong>of</strong> program (11) is strictly less then α.For every A ∈ O(d), let D A , be the distribution <strong>of</strong> the r.v. (Ax, y) ∈ S d−1 × {±1},where (x, y) ∼ D. Since clearly Err γ (D A ) = Err γ (D), there exist f A ∈ H k and b A ∈ Rsuch that ||f A || Hk ≤ C and Err DA ,hinge(g A ) < α, where g A := f A + b A . Define f ∈ H ks byf(x) = ∫ O(d) f A(Ax)dA and let b = ∫ O(d) b AdA and g = f +b. By <strong>The</strong>orem 5.18, ||f|| Hks ≤ C.Finally, for l = l hinge ,≤ CErr D,hinge (g) = E (x,y)∼D l(yg(x))= E (x,y)∼D l(yE A∼O(d) g A (Ax))≤E (x,y)∼D E A∼O(d) l(yg A (Ax))= E A∼O(d) E (x,y)∼D l(yg A (Ax))= E A∼O(d) E (x,y)∼DA l(yg A (x)) < αContrary to the assumption that α is the optimum <strong>of</strong> program (17).5.5.4 Finite dimension - <strong>The</strong>orems 2.7 and 3.4Let V ⊆ C(S d−1 ) be the linear space {Λ v,b ◦ ψ : v ∈ R m , b ∈ R} and denote ¯W = {Λ v,b ◦ ψ :v ∈ W, b ∈ R}. We note that dim(V ) ≤ m + 1. Instead <strong>of</strong> program (4) we consider the37
equivalent formulationminErr D,l (f)s.t. f ∈ ¯W (18)<strong>The</strong> following lemma is very similar to lemma 5.12, but with better dependency on m (m 1.5instead <strong>of</strong> m 2 ).Lemma 5.25 Let l be a convex surrogate and let V ⊂ C(S d−1 ) an m-dimensional vectorspace. <strong>The</strong>re exists a continuous <strong>kernel</strong> k : S d−1 × S d−1 → R with sup x∈S d−1 k(x, x) ≤ 1 suchthat H k = V as a vector space and there exists a probability measure µ N such that∀f ∈ V, ||f|| Hk ≤ 2m1.5|∂ + l(0)| Err µ N ,l(f)Pro<strong>of</strong> Let ψ : S d−1 → V ∗ be the evaluation operator. It maps each x ∈ S d−1 to the linearfunctional f ∈ V ↦→ f(x). We claim that1. ψ is continuous,2. aff(ψ(S d−1 ) ∪ −ψ(S d−1 )) = V ∗ ,3. V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Pro<strong>of</strong> <strong>of</strong> 1: We need to show that ψ(x n ) → ψ(x) if x n → x. Since V ∗ is finite dimensional, itsuffices to show that ψ(x n )(f) → ψ(x)(f) for every f ∈ V , which follows from the continuity<strong>of</strong> f.Pro<strong>of</strong> <strong>of</strong> 2: Note that 0 ∈ U = aff(ψ(S d−1 ) ∪ −ψ(S d−1 )), so U is a linear space. Now, defineT : U ∗ → V via T (u ∗ ) = u ∗ ◦ ψ. We claim that T is onto, whence dim(U) = dim(U ∗ ) =dim(V ) = dim(V ∗ ), so that U = V ∗ . Indeed, for f ∈ V , let u ∗ f ∈ U ∗ be the functionalu ∗ f (u) = u(f). Now, T (u∗ f )(x) = u∗ f (ψ(x)) = ψ(x)(f) = f(x), thus T (u∗ f ) = f.Pro<strong>of</strong> <strong>of</strong> 3: From U = V ∗ it follows that U ∗ = V ∗∗ , so that the mapping T : V ∗∗ → V isonto, showing that V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Let us apply John’s Lemma to K = conv(ψ(S d−1 )∪−ψ(S d−1 )). It yields an inner producton V ∗ 1with K contained in the unit ball and containing the ball around 0 with radius √ m.Let k be the <strong>kernel</strong> k(x, y) = 〈ψ(x), ψ(y)〉. Since ψ is continuous, k is continuous as well.By <strong>The</strong>orem 5.1.1 and since T is onto, it follows that, as a vector space, V = H k . Since Kis contained in the unit ball, it follows that sup x∈S d−1 k(x, x) ≤ 1. It remains to prove theexistence <strong>of</strong> the measure µ N .Let e 1 , . . . , e m ∈ V ∗ be an orthonormal basis. For every i ∈ [m], choose(x 1 i , y i ), . . . , (x m+1i , y i ) ∈ S d−1 × {±1} and λ 1 i , . . . , λ m+1i ≥ 0 such that ∑ m+1j=1 λj i = 1 and√1me i = ∑ m+1j=1 λj i y iψ(x j i ). Define µ N(x j i , 1) = µ N(x j i , −1) = λj i. 2mLet f ∈ V . By <strong>The</strong>orem 5.1.1 there exists v ∈ V ∗ such that f = Λ v,0 ◦ ψ and ||f|| Hk =||v|| V ∗. It follows that, for a = ∂ + l(0),38
- Page 1 and 2: The complexity of learning halfspac
- Page 3 and 4: exists an equivalent inner product
- Page 5 and 6: is enough that we can efficiently c
- Page 7 and 8: our terminology, they considered th
- Page 9 and 10: It is shown in (Birnbaum and Shalev
- Page 11 and 12: We now expand on this brief descrip
- Page 13 and 14: and (uniformly and independently) a
- Page 15 and 16: The proof of Theorem 2.7To prove Th
- Page 17 and 18: attempts to prove a quantitative op
- Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
- Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
- Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
- Page 25 and 26: ( )that in this case Err µN ,hinge
- Page 27 and 28: Thus, it is enough to find a neighb
- Page 29 and 30: Legendre polynomials we have|P d,n
- Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
- Page 33 and 34: Now, it holds that∫∫∫ ∫∫
- Page 35 and 36: We note that ω f◦g ≤ ω f ·
- Page 37: Now, denote δ = ∫ g. It holds th
- Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
- Page 43 and 44: Consequently, every approximated so
- Page 45: Kosaku Yosida. Functional Analysis.