01.11.2020 Views

Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The attributes shown in Table 2.1 come in two different types:

numeric variables and categorical (or factor) variables. Attribute 1

(height) is a numeric variable and is the most usual type of attribute.

Attribute 2 is gender and is indicated by the entry Male or Female.

This type of attribute is called a categorical or factor variable.

Categorical variables have the property that there’s no order relation

between the various values. There’s no sense to Male < Female

(despite centuries of squabbling). Categorical variables can be twovalued,

like Male Female, or multivalued, like states (AL, AK, AR . .

. WY). Other distinctions can be drawn regarding attributes (integer

versus float, for example), but they do not have the same impact on

machine learning algorithms. The reason for this is that many

machine learning algorithms take numeric attributes only; they

cannot handle categorical or factor variables. Penalized regression

algorithms deal only with numeric attributes. The same is true for

support vector machines, kernel methods, and K-nearest neighbors.

Chapter 4 will cover methods for converting categorical variables to

numeric variables. The nature of the variables will shape your

algorithm choices and the direction you take in developing a

predictive model, so it’s one of the things you need to pay attention to

when you face a new problem.

A similar dichotomy arises for the labels. The labels shown in Table

2.1 are numeric: the amount of money that the individual spent on

books online last year. In other problems, though, the labels may also

be categorical. For example, if the job with Table 2.1 were to predict

which individuals would spend more than $200 next year the problem

would change, and the problem approach would change. The new

problem of predicting which customers would spend more than $200

would have new labels. The new labels would take one of two values.

Table 2.2 shows the relationship between the labels given in Table 2.1

and new labels based on the logical proposition Spending > $200.

The new labels shown in Table 2.2 take one of two values—True or

False.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!