25949117
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The adult census income binary classification dataset would be an example of a training data set that
could be used to create a new model to predict whether a person’s income level would be greater or
less than $50,000. This prediction is based on the known input variables like age, education, job type,
marital status, race, and number of hours worked per week.
A key point to note is that, in this example, a specific “binary” outcome is defined for a given set of
input data. Based on the input elements, a person’s income is predicted to be only one of the two
following possibilities:
Income = Less than or equal to $50,000 a year.
Income = Greater than $50,000 a year.
Browsing this sample dataset manually in Microsoft Excel, you can easily start to see patterns emerge
that would likely affect the outcome based on today’s common knowledge, specifically that education
level and occupation are major factors in predicting the outcome. No wonder parents constantly
remind their children to stay in school and get a good education. This is also the same basic process
that supervised learning prediction algorithms attempt to achieve: to determine a repeatable pattern of
inference that can be applied to a new set of input data.
Once generated, a new model can then be validated for accuracy by using testing datasets. Here is
where it all gets really interesting: by using larger and more diverse “training” datasets, predictive
models can incrementally improve themselves and keep learning.
Predictive models can generally achieve better accuracy results when provided with new (and more
recent) datasets. The prediction evaluation process can be expressed as shown in Figure 2-4.
FIGURE 2-4 Testing the new prediction model.
The evaluation process for new prediction models that use supervised learning primarily consists of
determining the accuracy of the new generated model. In this case, the prediction model accuracy can
easily be determined because the input values and outcomes are already known. The question then
becomes how approximate the model’s prediction is based on the known input and output values
supplied.
Each time a new prediction model is generated, the first step should always be to evaluate the results
30