08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

described using O(k log d) bits: log 2 (d) bits to give the index <strong>of</strong> the feature in the root,<br />

O(1) bits to indicate for each child if it is a leaf and if so what label it should have, and<br />

then O(k L log d) and O(k R log d) bits respectively to describe the left and right subtrees,<br />

where k L is the number <strong>of</strong> nodes in the left subtree and k R is the number <strong>of</strong> nodes in the<br />

right subtree. So, by Theorem 6.5, we can be confident the true error is low if we can<br />

produce a consistent tree with fewer than ɛ|S|/ log(d) nodes.<br />

6.4 Regularization: penalizing complexity<br />

Theorems 6.3 and 6.5 suggest the following idea. Suppose that there is no simple rule<br />

that is perfectly consistent with the training data, but we notice there are very simple<br />

rules with training error 20%, say, and then some more complex rules with training error<br />

10%, and so on. In this case, perhaps we should optimize some combination <strong>of</strong> training error<br />

and simplicity. This is the notion <strong>of</strong> regularization, also called complexity penalization.<br />

Specifically, a regularizer is a penalty term that penalizes more complex hypotheses.<br />

Given our theorems so far, a natural measure <strong>of</strong> complexity <strong>of</strong> a hypothesis is the number<br />

<strong>of</strong> bits we need to write it down. 20 Consider now fixing some description language, and let<br />

H i denote those hypotheses that can be described in i bits in this language, so |H i | ≤ 2 i .<br />

Let δ i = δ/2 i . Rearranging the bound <strong>of</strong> Theorem 6.3, we know that with probability at<br />

√<br />

ln(|Hi |)+ln(2/δ i )<br />

least 1 − δ i , all h ∈ H i satisfy err D (h) ≤ err S (h) +<br />

. Now, applying the<br />

2|S|<br />

union bound over all i, using the fact that δ 1 + δ 2 + δ 3 + . . . = δ, and also the fact that<br />

ln(|H i |) + ln(2/δ i ) ≤ i ln(4) + ln(2/δ), gives the following corollary.<br />

Corollary 6.6 Fix any description language, and consider a training sample S drawn<br />

from distribution D. With probability greater than or equal to 1 − δ, all hypotheses h<br />

satisfy<br />

√<br />

size(h) ln(4) + ln(2/δ)<br />

err D (h) ≤ err S (h) +<br />

2|S|<br />

where size(h) denotes the number <strong>of</strong> bits needed to describe h in the given language.<br />

Corollary 6.6 gives us the trade<strong>of</strong>f we were looking for. It tells us that rather than<br />

searching for a rule <strong>of</strong> low training error, we instead may want to search for a rule with<br />

a low right-hand-side in the displayed formula. If we can find one for which this quantity<br />

is small, we can be confident true error will be low as well.<br />

information gain <strong>of</strong> x i is defined as: Ent(S v ) − [ |S0 v |<br />

|S v| Ent(S0 v) + |S1 v |<br />

|S v| Ent(S1 v)]. Here, Ent(S ′ ) is the binary<br />

entropy <strong>of</strong> the label proportions in set S ′ ; that is, if a p fraction <strong>of</strong> the examples in S ′ are positive, then<br />

Ent(S ′ ) = p log 2 (1/p) + (1 − p) log 2 (1/(1 − p)), defining 0 log 2 (0) = 0. This then continues until all leaves<br />

are pure—they have only positive or only negative examples.<br />

20 Later we will see support vector machines that use a regularizer for linear separators based on the<br />

margin <strong>of</strong> separation <strong>of</strong> data.<br />

197

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!