06.09.2023 Views

25949117

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The adult census income binary classification dataset would be an example of a training data set that

could be used to create a new model to predict whether a person’s income level would be greater or

less than $50,000. This prediction is based on the known input variables like age, education, job type,

marital status, race, and number of hours worked per week.

A key point to note is that, in this example, a specific “binary” outcome is defined for a given set of

input data. Based on the input elements, a person’s income is predicted to be only one of the two

following possibilities:

Income = Less than or equal to $50,000 a year.

Income = Greater than $50,000 a year.

Browsing this sample dataset manually in Microsoft Excel, you can easily start to see patterns emerge

that would likely affect the outcome based on today’s common knowledge, specifically that education

level and occupation are major factors in predicting the outcome. No wonder parents constantly

remind their children to stay in school and get a good education. This is also the same basic process

that supervised learning prediction algorithms attempt to achieve: to determine a repeatable pattern of

inference that can be applied to a new set of input data.

Once generated, a new model can then be validated for accuracy by using testing datasets. Here is

where it all gets really interesting: by using larger and more diverse “training” datasets, predictive

models can incrementally improve themselves and keep learning.

Predictive models can generally achieve better accuracy results when provided with new (and more

recent) datasets. The prediction evaluation process can be expressed as shown in Figure 2-4.

FIGURE 2-4 Testing the new prediction model.

The evaluation process for new prediction models that use supervised learning primarily consists of

determining the accuracy of the new generated model. In this case, the prediction model accuracy can

easily be determined because the input values and outcomes are already known. The question then

becomes how approximate the model’s prediction is based on the known input and output values

supplied.

Each time a new prediction model is generated, the first step should always be to evaluate the results

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!