10.07.2015 Views

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

For 1 ≤ i ≤ t Let v i = x i − P Mi x i . Note that ‖v i ‖ H = d(x i , M i ) ≥ t Now, given aCfunction f : A → [−1, 1], we will show that f ∈ F H (X , C). Consider the affine functional〈t∑t∑〉f(x i )f(x i )v it∑〈 〉f(xi )v iΛ(x) = 〈v‖vi=1 i ‖ 2 i , x − P Mi (x)〉 H =, x −, PH‖vi=1 i ‖ 2 H‖vH i=1 i ‖ 2 Mi (x) .HHNote that since v i is perpendicular to M i , b := − ∑ 〈 〉t f(xi )v ii=1, P Mi (x) does not dependon x. Let w := ∑ ti=1f(x i )v i‖v i. We have‖ 2 H‖v i ‖ 2 HH‖w‖ H ≤t∑ 1|f(x i )|‖v i ‖ ≤ tC t = C.i=1<strong>The</strong>refore, Λ| X ∈ F H (X , C). Finally, for every 1 ≤ j ≤ t we haveΛ(x j ) =t∑i=1f(x i )〈v‖v i ‖ 2 i , v j − P i (v j )〉 = f(x i ).HHere, the last inequality follows form that fact that for i ≠ j, since v i is perpendicular toM i , 〈v i , v j − P i (v j )〉 = 0. <strong>The</strong>refore, f = Λ| A .Let l : R → R be a surrogate loss function. We say that an algorithm (ɛ, δ)-learns F ⊂ R X<strong>using</strong> m examples w.r.t. l if:• Its input is a sample <strong>of</strong> m points in X × {±1} and its output is a hypothesis in F.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, ˆf ∈ F withErr D,l ( ˆf) ≤ inf f∈F Err D,l (f) + ɛLemma 5.14 Suppose that an algorithm A (ɛ, δ)-learns F <strong>using</strong> m examples w.r.t. a surrogateloss l. <strong>The</strong>n, for every pair <strong>of</strong> distributions D and D ′ on X × {±1}, if ˆf ∈ F is thehypothesis returned by A running on D, then Err D ′ ,l(f) ≤ m(l(0) + ɛ) w.p. ≥ 1 − 2eδ.Pro<strong>of</strong> Suppose toward a contradiction that w.p. ≥ 2eδ we have Err D ′ ,l(f) > a for a >m(l(0) + ɛ). Consider the following distribution, ˜D: w.p.1we sample from m D′ and withprobability 1 − 1 we sample from D. Suppose now that we run the algorithm A on ˜D.mConditioning on the event that all the samples are from D, we have, w.p. ≥ 2eδ,Err D ′ ,l( ˆf) > a and therefore, Err ˜D,l( ˆf) > a . <strong>The</strong> probability that indeed all the m samplesmare from D is ( 1 − m) 1 m>1. Hence, with probability > δ, we have Err 2e ˜D,l( ˆf) > a . mOn the other hand, With probability ≥ 1 − δ we have Err ˜D,l( ˆf) ≤ inf f∈F Err ˜D(f) + ɛ ≤Err D,l (0) + ɛ = l(0) + ɛ. Hence, with positive probability,It follows that a ≤ m(l(0) + ɛ).am ≤ l(0) + ɛ .22✷

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!