08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6 Machine Learning<br />

6.1 Introduction<br />

Machine learning algorithms are general purpose tools that solve problems from many<br />

disciplines without detailed domain-specific knowledge. They have proven to be very<br />

effective in a large number <strong>of</strong> contexts, including computer vision, speech recognition,<br />

document classification, automated driving, computational science, and decision support.<br />

The core problem. A core problem underlying many machine learning applications<br />

is learning a good classification rule from labeled data. This problem consists <strong>of</strong> a domain<br />

<strong>of</strong> interest X , called the instance space, such as email messages or patient records,<br />

and a classification task, such as classifying email messages into spam versus non-spam<br />

or determining which patients will respond well to a given medical treatment. We will<br />

typically assume our instance space X = {0, 1} d or X = R d , corresponding to data that is<br />

described by d Boolean or real-valued features. Features for email messages could be the<br />

presence or absence <strong>of</strong> various types <strong>of</strong> words, and features for patient records could be<br />

the results <strong>of</strong> various medical tests. To perform the learning task, our learning algorithm<br />

is given a set S <strong>of</strong> labeled training examples, which are points in X along with their<br />

correct classification. This training data could be a collection <strong>of</strong> email messages, each<br />

labeled as spam or not spam, or a collection <strong>of</strong> patients, each labeled by whether or not<br />

they responded well to the given medical treatment. Our algorithm then aims to use the<br />

training examples to produce a classification rule that will perform well over new data.<br />

A key feature <strong>of</strong> machine learning, which distinguishes it from other algorithmic tasks, is<br />

that our goal is generalization: to use one set <strong>of</strong> data in order to perform well on new data<br />

we have not seen yet. We focus on binary classification where items in the domain <strong>of</strong> interest<br />

are classified into two categories, as in the medical and spam-detection examples above.<br />

How to learn. A high-level approach to solving this problem that many algorithms<br />

we discuss will follow is to try to find a “simple” rule with good performance on the<br />

training data. For instance in the case <strong>of</strong> classifying email messages, we might find a set<br />

<strong>of</strong> highly indicative words such that every spam email in the training data has at least<br />

one <strong>of</strong> these words and none <strong>of</strong> the non-spam emails has any <strong>of</strong> them; in this case, the<br />

rule “if the message has any <strong>of</strong> these words then it is spam, else it is not” would be a<br />

simple rule that performs well on the training data. Or, we might find a way <strong>of</strong> weighting<br />

words with positive and negative weights such that the total weighted sum <strong>of</strong> words in<br />

the email message is positive on the spam emails in the training data, and negative on the<br />

non-spam emails. We will then argue that so long as the training data is representative <strong>of</strong><br />

what future data will look like, we can be confident that any sufficiently “simple” rule that<br />

performs well on the training data will also perform well on future data. To make this into<br />

a formal mathematical statement, we need to be precise about what we mean by “simple”<br />

as well as what it means for training data to be “representative” <strong>of</strong> future data. In fact,<br />

we will see several notions <strong>of</strong> complexity, including bit-counting and VC-dimension, that<br />

190

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!