08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Boosting Algorithm<br />

Given a sample S <strong>of</strong> n labeled examples x 1 , . . . , x n , initialize each<br />

example x i to have a weight w i = 1. Let w = (w 1 , . . . , w n ).<br />

For t = 1, 2, . . . , t 0 do<br />

Call the weak learner on the weighted sample (S, w), receiving<br />

hypothesis h t .<br />

End<br />

Multiply the weight <strong>of</strong> each example that was misclassified by<br />

h t by α = 1 2 +γ<br />

1 . Leave the other weights as they are.<br />

−γ<br />

2<br />

Output the classifier MAJ(h 1 , . . . , h t0 ) which takes the majority vote<br />

<strong>of</strong> the hypotheses returned by the weak learner. Assume t 0 is odd so<br />

there is no tie.<br />

Figure 6.6: The boosting algorithm<br />

Definition 6.4 (γ-Weak learner on sample) A weak learner is an algorithm that given<br />

examples, their labels, and a nonnegative real weight w i on each example x i , produces a<br />

classifier that correctly labels a subset <strong>of</strong> examples with total weight at least ( 1 + γ) ∑ n w<br />

2 i .<br />

At the high level, boosting makes use <strong>of</strong> the intuitive notion that if an example was<br />

misclassified, one needs to pay more attention to it. The boosting procedure is in Figure<br />

6.6.<br />

i=1<br />

Theorem 6.20 Let A be a γ-weak learner for sample S. Then t 0 = O( 1<br />

γ 2 log n) is sufficient<br />

so that the classifier MAJ(h 1 , . . . , h t0 ) produced by the boosting procedure has training<br />

error zero.<br />

Pro<strong>of</strong>: Suppose m is the number <strong>of</strong> examples the final classifier gets wrong. Each <strong>of</strong><br />

these m examples was misclassified at least t 0 /2 times so each has weight at least α t0/2 .<br />

Thus the total weight is at least mα t0/2 . On the other hand, at time t+1, only the weights<br />

<strong>of</strong> examples misclassified at time t were increased. By the property <strong>of</strong> weak learning, the<br />

total weight <strong>of</strong> misclassified examples is at most ( 1 − γ) <strong>of</strong> the total weight at time t. Let<br />

2<br />

weight(t) be the total weight at time t. Then<br />

(<br />

weight(t + 1) ≤ α ( 1<br />

− γ) + ( ) 1<br />

+ γ) × weight(t)<br />

2 2<br />

= (1 + 2γ) × weight(t).<br />

216

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!