10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

Other features describe a dataset in terms of its components:<br />

• Frequency of a given subcomponent, such as a word in a book<br />

• Number of subcomponents and/or the number of different subcomponents<br />

• Average size of the subcomponents, such as the average sentence length<br />

Ordinal features allow us to perform ranking, sorting, and grouping of similar<br />

values. As we have seen in previous chapters, features can be numerical or<br />

categorical. Numerical features are often described as being ordinal. For example,<br />

three people, Alice, Bob and Charlie, may have heights of 1.5 m, 1.6 m and 1.7 m. We<br />

would say that Alice and Bob are more similar in height than are Alice and Charlie.<br />

The Adult dataset that we loaded in the last section contains examples of continuous,<br />

ordinal features. For example, the Hours-per-week feature tracks how many hours<br />

per week people work. Certain operations make sense on a feature like this. They<br />

include computing the mean, standard deviation, minimum and maximum. There is<br />

a function in pandas for giving some basic summary stats of this type:<br />

adult["Hours-per-week"].describe()<br />

The result tells us a little about this feature.<br />

count 32561.000000<br />

mean 40.437456<br />

std 12.347429<br />

min 1.000000<br />

25% 40.000000<br />

50% 40.000000<br />

75% 45.000000<br />

max 99.000000<br />

dtype: float64<br />

Some of these operations do not make sense for other features. For example, it<br />

doesn't make sense to compute the sum of the education statuses.<br />

There are also features that are not numerical, but still ordinal. The Education<br />

feature in the Adult dataset is an example of this. For example, a Bachelor's degree<br />

is a higher education status than finishing high school, which is a higher status than<br />

not completing high school. It doesn't quite make sense to compute the mean of these<br />

values, but we can create an approximation by taking the median value. The dataset<br />

gives a helpful feature Education-Num, which assigns a number that is basically<br />

equivalent to the number of years of education completed. This allows us to quickly<br />

compute the median:<br />

adult["Education-Num"].median()<br />

[ 85 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!