The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

Finally, let v ∈ V, b ∈ R. We haveErr µN ,hinge(Λ w,b ) =m∑i=1m+1∑j=1λ j [i lhinge (Λ w,b (x j i4m)) + l hinge(−Λ w,b (x j i ))]+ ρj [i lhinge (Λ w,b (z j i4m)) + l hinge(−Λ w,b (z j i ))][ (≥ 1 m∑m+1))]∑m+1l hinge λ j i4m(Λ w,b(x j i )) ∑+ l hinge(− λ j i (Λ w,b(x j i ))i=1j=1j=1[ ( m+1))]m+1+ l hinge ρ j i (Λ w,b(z j i )) ∑+ l hinge(− ρ j i (Λ w,b(z j i ))= 14m≥ 14m≥≥i=1∑ m∑j=1m∑ (〈l hinge w, e 〉imi=1〈+l hinge(− w, e 〉 ) (〈i+ b + l hingemm∑(l hinge − |〈w, e )i〉|m14m 2i=114m 2 ||w|||〈w, e i 〉|) (+ b + l hinge −w, e imj=1〈w, e 〉im〉 )− b)− b✷Let X be a bounded subset of some Hilbert space H and let C > 0. DenoteF H (X , C) = {Λ w,b | X | ‖w‖ H ≤ C, b ∈ R} .Denote by M the collection of all affine subspaces of H that are spanned by points from X .Denote by t = t H (X , C) be the maximal number such that for every affine subspace M ∈ Mof dimension less than t there is x ∈ X such that d(x, M) > t C .Lemma 5.13 Let X be a bounded subset of some Hilbert space H and let C > 0. There isA ⊂ X with |A| = t H (X , C) such that [−1, 1] A ⊂ F H (X , C)| A .Proof Denote t = t H (X , C). Let x 0 , . . . , x t ∈ X be points such that the (t dimensional)volume of the parallelogram Q defined by the vectorsx 1 − x 0 , . . . , x t − x 0is maximal (if the supremum is not attained, the argument can be carried out with a parallelogramwhose volume is sufficiently close to the supremum). Let A = {x 1 , . . . , x t }. We claimthat for every 1 ≤ i ≤ t, the distance of x i from the affine span, M i , of A∪{x 0 }\{x i } is ≥ t C .Indeed, the volume of Q is the (t − 1 dimensional) volume of the parallelogram Q ∩ (M i − x 0 )times d(x i , M i ). By the maximality of x 0 , . . . , x t and the definition of t, d(x i , M i ) ≥ t C .21
For 1 ≤ i ≤ t Let v i = x i − P Mi x i . Note that ‖v i ‖ H = d(x i , M i ) ≥ t Now, given aCfunction f : A → [−1, 1], we will show that f ∈ F H (X , C). Consider the affine functional〈t∑t∑〉f(x i )f(x i )v it∑〈〉f(xi )v iΛ(x) = 〈v‖vi=1 i ‖ 2 i , x − P Mi (x)〉 H =, x −, PH‖vi=1 i ‖ 2 H‖vH i=1 i ‖ 2 Mi (x) .HHNote that since v i is perpendicular to M i , b := − ∑ 〈〉t f(xi )v ii=1, P Mi (x) does not dependon x. Let w := ∑ ti=1f(x i )v i‖v i. We have‖ 2 H‖v i ‖ 2 HH‖w‖ H ≤t∑ 1|f(x i )|‖v i ‖ ≤ tC t = C.i=1Therefore, Λ| X ∈ F H (X , C). Finally, for every 1 ≤ j ≤ t we haveΛ(x j ) =t∑i=1f(x i )〈v‖v i ‖ 2 i , v j − P i (v j )〉 = f(x i ).HHere, the last inequality follows form that fact that for i ≠ j, since v i is perpendicular toM i , 〈v i , v j − P i (v j )〉 = 0. Therefore, f = Λ| A .Let l : R → R be a surrogate loss function. We say that an algorithm (ɛ, δ)-learns F ⊂ R Xusing m examples w.r.t. l if:• Its input is a sample of m points in X × {±1} and its output is a hypothesis in F.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, ˆf ∈ F withErr D,l ( ˆf) ≤ inf f∈F Err D,l (f) + ɛLemma 5.14 Suppose that an algorithm A (ɛ, δ)-learns F using m examples w.r.t. a surrogateloss l. Then, for every pair of distributions D and D ′ on X × {±1}, if ˆf ∈ F is thehypothesis returned by A running on D, then Err D ′ ,l(f) ≤ m(l(0) + ɛ) w.p. ≥ 1 − 2eδ.Proof Suppose toward a contradiction that w.p. ≥ 2eδ we have Err D ′ ,l(f) > a for a >m(l(0) + ɛ). Consider the following distribution, ˜D: w.p.1we sample from m D′ and withprobability 1 − 1 we sample from D. Suppose now that we run the algorithm A on ˜D.mConditioning on the event that all the samples are from D, we have, w.p. ≥ 2eδ,Err D ′ ,l( ˆf) > a and therefore, Err ˜D,l( ˆf) > a . The probability that indeed all the m samplesmare from D is ( 1 − m) 1 m>1. Hence, with probability > δ, we have Err 2e ˜D,l( ˆf) > a . mOn the other hand, With probability ≥ 1 − δ we have Err ˜D,l( ˆf) ≤ inf f∈F Err ˜D(f) + ɛ ≤Err D,l (0) + ɛ = l(0) + ɛ. Hence, with positive probability,It follows that a ≤ m(l(0) + ɛ).am ≤ l(0) + ɛ .22✷
Page 1 and 2: The complexity of learning halfspac
Page 3 and 4: exists an equivalent inner product
Page 5 and 6: is enough that we can efficiently c
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11 and 12: We now expand on this brief descrip
Page 13 and 14: and (uniformly and independently) a
Page 15 and 16: The proof of Theorem 2.7To prove Th
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21: Lemma 5.11 (John’s Lemma) (Matous
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35 and 36: We note that ω f◦g ≤ ω f ·
Page 37 and 38: Now, denote δ = ∫ g. It holds th
Page 39 and 40: equivalent formulationminErr D,l (f
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

Create successful ePaper yourself

Delete template?

Save as template?