01.11.2020 Views

Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

facilitate iterating through the process of training and tweaking. If the

data set grows to 1,000 x 1,000, the training times will grow to a

fraction of a minute for penalized linear regression and a few minutes

for an ensemble method. As the data set gets to several tens of

thousands of rows and columns, the training times will expand to 3 or

4 hours for penalized linear regression and 12 to 24 hours for an

ensemble method. The larger training times will have an impact on

your development time because you’ll iterate a number of times.

The second important observation regarding row and column counts

is that if the data set has many more columns than rows, you may be

more likely to get the best prediction with penalized linear regression

and vice versa. Chapter 3, “Predictive Model Building: Balancing

Performance, Complexity, and Big Data,” and the examples you’ll

run later will give you a better understanding of why that’s true.

The next step in the checklist is to determine how many of the

columns of data are numeric versus categorical. Listing 2-2 shows

code to accomplish this for the rocks versus mine data set. The code

runs down each column and adds up the number of entries that are

numeric (int or float), the number of entries that are nonempty

strings, and the number that are empty. The result is that the first 60

columns contain all numeric values and the last column contains all

strings. The string values are the labels. Generally, categorical

variables are presented as strings, as in this example. In some cases,

binary-valued categorical variables are presented as a 0,1 numeric

variable.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!