10.07.2015 Views

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

We now expand on this brief description <strong>of</strong> the main steps.Polynomial sample implies small CLet X be a subset <strong>of</strong> the unit ball <strong>of</strong> some Hilbert space H and let C > 0. Assume thataffine functionals over X with norm ≤ C can be learnt <strong>using</strong> m examples with <strong>error</strong> ɛ andconfidence δ. That is, assume that there is an algorithm such that• Its input is a sample <strong>of</strong> m points in X × {±1} and its output is an affine functionalΛ w,b with ‖w‖ ≤ C.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, w, b withErr D,hinge (Λ w,b ) ≤ inf ‖w ′ ‖≤C,b ′ ∈R Err D,hinge (Λ w ′ ,b′) + ɛ.We will show that there is an equivalent inner product 〈·, ·〉 ′ on H under which X is containedin a unit ball (not necessarily around 0) and the affine functional returned by any <strong>learning</strong>algorithm as above, must have norm ≤ O (m 3 ) w.r.t. the new norm.<strong>The</strong> construction <strong>of</strong> the norm is done as follows. We first find an affine subspace M ⊂ H<strong>of</strong> dimension d ≤ m that is very close to X in the sense that the distance <strong>of</strong> every pointin X from M is ≤ m . To find such an M, we assume toward a contradiction that there isCno such M, and use this to show that there is a subset A ⊂ X such that every functionf : A → [−1, 1] can be realized by some affine functional with norm ≤ C. This contradictsthe assumption that affine functionals with norm ≤ C can be learnt <strong>using</strong> m examples.Having the subspace M at hand, we construct, <strong>using</strong> John’s lemma (e.g. Matousek(2002)), an inner product 〈·, ·〉 ′′ on M and a distribution µ N on X with the property thatthe projection <strong>of</strong> X on M is contained in a ball <strong>of</strong> radius 1 w.r.t. 〈·, 2 ·〉′′ , and the hinge <strong>error</strong><strong>of</strong> every affine functional w.r.t. µ N is lower bounded by the norm <strong>of</strong> the affine functional,divided by m 2 . We show that that this entails that any affine functional returned by thealgorithm must have a norm ≤ O (m 3 ) w.r.t. the inner product 〈·, ·〉 ′′ .Finally, we construct an inner product 〈·, ·〉 ′ on H by putting the norm 〈·, ·〉 ′′ on M andmultiplying the original inner product by C2 on M ⊥ .2m 2<strong>The</strong> one dimensional distributionWe define a distribution D on [−1, 1] as follows. Start with the distribution D 1 that takes thevalues ±(γ, 1), where D 1 (γ, 1) = 0.7 and D 1 (−γ, −1) = 0.3. Clearly, for this distribution,the threshold 0 has zero <strong>error</strong> <strong>rate</strong>. To construct D, we perturb D 1 with “noise” as follows.Let D = (1 − λ)D 1 + λD 2 , where D 2 is defined as follows. <strong>The</strong> probability <strong>of</strong> the labels isuniform and independent <strong>of</strong> the instance and the marginal probability over the instances isdefined by the density function{0 if |x| > 1/8ρ(x) =√8if |x| ≤ 1/8 .π 1−(8x) 2This choice <strong>of</strong> ρ simplifies our calculations due to its relation to Chebyshev polynomials.However, other choices <strong>of</strong> ρ which are supported on a small interval around zero can alsowork.10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!