08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Notice that the sum <strong>of</strong> slack variables is the total hinge loss <strong>of</strong> w. So, this convex<br />

optimization is minimizing a weighted sum <strong>of</strong> 1/γ 2 , where γ is the margin, and the total<br />

hinge loss. If we were to add the constraint that all s i = 0 then this would be solving for<br />

the maximum margin linear separator for the data. However, in practice, optimizing a<br />

weighted combination generally performs better.<br />

6.9 VC-Dimension<br />

In Section 6.2 we presented several theorems showing that so long as the training set<br />

S is large compared to 1 ɛ log(|H|), we can be confident that every h ∈ H with err D(h) ≥ ɛ<br />

will have err S (h) > 0, and if S is large compared to 1 ɛ 2 log(|H|), then we can be confident<br />

that every h ∈ H will have |err D (h)−err S (h)| ≤ ɛ. In essence, these results used log(|H|)<br />

as a measure <strong>of</strong> complexity <strong>of</strong> class H. VC-dimension is a different, tighter measure <strong>of</strong><br />

complexity for a concept class, and as we will see, is also sufficient to yield confidence<br />

bounds. For any class H, VCdim(H) ≤ log 2 (|H|) but it can also be quite a bit smaller.<br />

Let’s introduce and motivate it through an example.<br />

Consider a database consisting <strong>of</strong> the salary and age for a random sample <strong>of</strong> the adult<br />

population in the United States. Suppose we are interested in using the database to answer<br />

questions <strong>of</strong> the form: “what fraction <strong>of</strong> the adult population in the United States<br />

has age between 35 and 45 and salary between $50,000 and $70,000?” That is, we are<br />

interested in queries that ask about the fraction <strong>of</strong> the adult population within some axisparallel<br />

rectangle. What we can do is calculate the fraction <strong>of</strong> the database satisfying<br />

this condition and return this as our answer. This brings up the following question: How<br />

large does our database need to be so that with probability greater than or equal to 1−δ,<br />

our answer will be within ±ɛ <strong>of</strong> the truth for every possible rectangle query <strong>of</strong> this form?<br />

If we assume our values are discretized such as 100 possible ages and 1,000 possible<br />

salaries, then there are at most (100 × 1, 000) 2 = 10 10 possible rectangles. This means we<br />

can apply Theorem 6.3 with |H| ≤ 10 10 . Specifically, we can think <strong>of</strong> the target concept<br />

c ∗ as the empty set so that err S (h) is exactly the fraction <strong>of</strong> the sample inside rectangle<br />

h and err D (h) is exactly the fraction <strong>of</strong> the whole population inside h. 22 This would tell<br />

us that a sample size <strong>of</strong><br />

1<br />

2ɛ 2 (10 ln 10 + ln(2/δ)) would be sufficient.<br />

However, what if we do not wish to discretize our concept class? Another approach<br />

would be to say that if there are only N adults total in the United States, then there<br />

are at most N 4 rectangles that are truly different with respect to D and so we could use<br />

|H| ≤ N 4 . Still, this suggests that S needs to grow with N, albeit logarithmically, and<br />

one might wonder if that is really necessary. VC-dimension, and the notion <strong>of</strong> the growth<br />

22 Technically D is the uniform distribution over the adult population <strong>of</strong> the United States, and we<br />

want to think <strong>of</strong> S as an independent identical distributed sample from this D.<br />

206

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!