Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Recommendations

Info

binary decision tree are fixed, given the data. The process for generating the tree isdeterministic. One way to get differing models is to take random samples of thetraining data and train on these random subsets. That technique is called Bagging(short for bootstrap aggregating). It gives a way to generate a large number ofslightly different binary decision trees. Those are then averaged (or voted for aclassifier) to yield a final result. Chapter 6 describes in more detail this technique andother more powerful ones.How to Decide Which Algorithm to UseTable 1.4 gives a sketch comparison of these two families of algorithms. Penalizedlinear regression methods have the advantage that they train very quickly. Trainingtimes on large data sets can extend to hours, days, or even weeks. Training usuallyneeds to be done several times before a deployable solution is arrived at. Longtraining times can stall development and deployment on large problems. The rapidtraining time for penalized linear methods makes them useful for the obvious reasonthat shorter is better. Depending on the problem, these methods may suffer someperformance disadvantages relative to ensemble methods. Chapter 3 gives moreinsight into the types of problems where penalized regression might be a better choiceand those where ensemble methods might be a better choice. Penalized linearmethods can sometimes be a useful first step in your development process even in thecircumstance where they yield inferior performance to ensemble methods.Table 1.4 High-Level Tradeoff between Penalized Linear Regression and EnsembleAlgorithmsPenalizedLinearRegressionTRAININGSPEEDPREDICTIONSPEEDPROBLEM DEALS WITHCOMPLEXITY WIDEATTRIBUTE+ + – +EnsembleMethods– – + –Early in development, a number of training iterations will be necessary for purposesof feature selection and feature engineering and for solidifying the mathematicalproblem statement. Deciding what you are going to use as input to your predictivemodel can take some time and thought. Sometimes that is obvious, but usually itrequires some iteration. Throwing in everything you can find is not usually a goodsolution.Trial and error is typically required to determine the best inputs for a model. Forexample, if you're trying to predict whether a visitor to your website will click a linkfor an ad, you might try using demographic data for the visitor. Maybe that does notgive you the accuracy that you need, so you try incorporating data regarding thevisitor's past behavior on the site—what ad the visitor clicked during past site visits orwhat products the visitor has bought. Maybe adding data about the site the visitor wason before coming to your site would help. These questions lead to a series ofexperiments where you incorporate the new data and see whether it hurts or helps.This iteration is generally time-consuming both for the data manipulations and fortraining your predictive model. Penalized linear regression will generally be fasterthan an ensemble method, and the time difference can be a material factor in thedevelopment process.
For example, if the training set is on the order of a gigabyte, training times may be onthe order of 30 minutes for penalized linear regression and 5 or 6 hours for anensemble method. If the feature engineering process requires 10 iterations to selectthe best feature set, the computation time alone comes to the difference betweentaking a day or taking a week to accomplish feature engineering. A useful process,therefore, is to train a penalized linear model in the early stages of development,feature engineering, and so on. That gives the data scientist a feel for which variablesare going to be useful and important as well as a baseline performance for comparisonwith other algorithms later in development.Besides enjoying a training time advantage, penalized linear methods generatepredictions much faster than ensemble methods. Generating a prediction involvesusing the trained model. The trained model for penalized linear regression is simply alist of real numbers—one for each feature being used to make the predictions. Thenumber of floating-point operations involved is the number of variables being used tomake predictions. For highly time-sensitive predictions such as high-speed trading orInternet ad insertions, computation time makes the difference between making moneyand losing money.For some problems, linear methods may give equivalent or even better performancethan ensemble methods. Some problems do not require complicated models. Chapter3 goes into some detail about the nature of problem complexity and how the datascientist's task is to balance problem complexity, predictive model complexity, anddata set size to achieve the best deployable model. The basic idea is that on problemsthat are not complex and problems for which sufficient data are not available, linearmethods may achieve better overall performance than more complicated ensemblemethods. Genetic data provide a good illustration of this type of problem.The general perception is that there's an enormous amount of genetic data around.Genetic data sets are indeed large when measured in bytes, but in terms of generatingaccurate predictions, they aren't very large. To understand this distinction, considerthe following thought experiment. Suppose that you have two people, one with aheritable condition and the other without. If you had genetic sequences for the twopeople, could you determine which gene was responsible for the condition?Obviously, that's not possible because many genes will differ between the twopersons. So how many people would it take? At a minimum, it would take genesequences for as many people as there are genes, and given any noise in themeasurements, it would take even more. Humans have roughly 20,000 genes,depending on your count. And each datum costs roughly $1,000. So having justenough data to resolve the disease with perfect measurements would cost $20 million.This situation is very similar to fitting a line to two points, as discussed earlier in thischapter. Models need to have fewer degrees of freedom than the number of datapoints. The data set typically needs to be a multiple of the degrees of freedom in themodel. Because the data set size is fixed, the degrees of freedom in the model need tobe adjustable. The chapters dealing with penalized linear regression will show youhow the adjustability is built into penalized linear regression and how to use it toachieve optimum performance.NOTE The two broad categories of algorithms addressed in this bookmatch those that Jeremy Howard and I presented at Strata Conference in 2012.Jeremy took ensemble methods, and I took penalized linear regression. We hadfun arguing about the relative merits of the two groups. In reality, however,those two cover something like 80 percent of the model building that I do, andthere are good reasons for that.Chapter 3 goes into more detail about why one algorithm or another is a better choicefor a given problem. It has to do with the complexity of the problem and the number
Page 3 and 4: IntroductionExtracting actionable i
Page 5 and 6: Who This Book Is ForThis book is in
Page 7 and 8: save Chapter 2 until they start loo
Page 9 and 10: evolved from overcoming the limitat
Page 11 and 12: ConventionsTo help you get the most
Page 13 and 14: CHAPTER 1The Two Essential Algorith
Page 15 and 16: the flight path, and so on. Having
Page 17 and 18: become an issue. In some problems,
Page 19 and 20: Figure 1.2 Fitting lines with only
Page 21: Figure 1.3 Binary decision tree exa
Page 25 and 26: NOTE The first pass can usually be
Page 27 and 28: convergence divergence), and RSI (r
Page 29 and 30: runs through code for the core algo
Page 31 and 32: CHAPTER 2Understand the Problem byU
Page 33 and 34: A unique identifier may or may not
Page 35 and 36: The attributes shown in Table 2.1 c
Page 37 and 38: THINGS TO NOTICE ABOUT YOUR NEW DAT
Page 39 and 40: PHYSICAL CHARACTERISTICS OF THE ROC
Page 41 and 42: facilitate iterating through the pr
Page 43 and 44: colCounts.append(type)type = [0]*3s
Page 45 and 46: LISTING 2-3: SUMMARY STATISTICS FOR
Page 47 and 48: catCount = [0]*2for elt in colData:
Page 49 and 50: from the line. That means that the
Page 51 and 52: Figure 2.1 Quantile-quantile plot o
Page 53 and 54: makes it possible to read data into
Page 55 and 56: LISTING 2-5: USING PYTHON PANDAS TO
Page 57 and 58: Pandas makes it possible to automat
Page 59 and 60: Figure 2.3 Parallel coordinates gra
Page 61 and 62: of representative pairs of attribut
Page 63 and 64: Figure 2.5 Cross- plot of rocks ver
Page 65 and 66: Figure 2.7 Target-attribute cross-p
Page 67 and 68: # and add some ditherif rocksVMines
Page 69 and 70: Equation 2-2: Average values of the
Page 71 and 72: (sqrt(var2*var3) * numElt)corr221 +
Page 73 and 74:
Figure 2.8 Heat map showing attribu
Page 75 and 76:
that you saw in “Classification P
Page 77 and 78:
plot.ylabel(("Quartile Ranges"))sho
Page 79 and 80:
As an alternative to the listing of
Page 81 and 82:
Figure 2.10 Box plot of real-valued
Page 83 and 84:
work for the abalone problem. Rocks
Page 85 and 86:
for i in range(nrows):#plot rows of
Page 87 and 88:
Figure 2.13 Graph of the logit func
Page 89 and 90:
LISTING 2-12: CORRELATIONCALCULATIO
Page 91 and 92:
Figure 2.15 Correlation heat map fo
Page 93 and 94:
LISTING 2-13: WINE DATA SUMMARY—W
Page 95 and 96:
mean 0.087467 15.87492246.467792 0.
Page 97 and 98:
Figure 2.17 Parallel coordinate plo
Page 99 and 100:
#Try again with normalized valuesfo
Page 101 and 102:
Figure 2.19 Correlation heat map fo
Page 103 and 104:
LISTING 2-15: SUMMARY OF GLASS DATA
Page 105 and 106:
75% 0.610000 9.172500 0.000000 0.10
Page 107 and 108:
Figure 2.21 Parallel coordinate plo
Page 109 and 110:
plot.ylabel(("Attribute Values"))pl
Page 111 and 112:
building predictive models. The too
Page 113 and 114:
The variable that you are attemptin
Page 115 and 116:
Referring to the data set in Table
Page 117 and 118:
ASSESSING PERFORMANCE OF PREDICTIVE
Page 119 and 120:
training on the remaining data. Sta
Page 121 and 122:
Figure 3.2 A complicated classifica
Page 123 and 124:
CONTRAST BETWEEN A SIMPLE MODEL AND
Page 125 and 126:
Figure 3.5 Linear model fit to comp
Page 127 and 128:
Figure 3.7 Linear model fit to smal
Page 129 and 130:
used for predictive modeling. Addin
Page 131 and 132:
Model to Balance Problem Complexity
Page 133 and 134:
LISTING 3-1: COMPARISON OF MSE, MAE
Page 135 and 136:
from the mean) and the standard dev
Page 137 and 138:
Figure 3.9 Confusion matrix example
Page 139 and 140:
LISTING 3-2: MEASURING PERFORMANCEF
Page 141 and 142:
print("Shape of yTrain array", yTra
Page 143 and 144:
examples after it has been deployed
Page 145 and 146:
associated with its removal. If the
Page 147 and 148:
Figure 3.10 In-sample ROC for rocks
Page 149 and 150:
demonstration that performance esti
Page 151 and 152:
preserve the statistical peculiarit
Page 153 and 154:
vector Y contains the labels. And t
Page 155 and 156:
Initialize: Out_of_sample_error = N
Page 157 and 158:
indices = range(len(xList))xListTes
Page 159 and 160:
#scatter plot of actual versus pred
Page 161 and 162:
LISTING 3-4: FORWARD STEPWISEREGRES
Page 163 and 164:
Figure 3.15 Histogram of wine taste
Page 165 and 166:
regression throttle back ordinary r
Page 167 and 168:
xTrain = numpy.array(xListTrain); y
Page 169 and 170:
Figure 3.16 Wine quality prediction
Page 171 and 172:
Figure 3.18 Histogram of wine taste
Page 173 and 174:
LISTING 3-7: ROCKS VERSUS MINES USI
Page 175 and 176:
Listing 3-8 shows the AUC and assoc
Page 177 and 178:
Figure 3.19 AUC for the rocks-versu
Page 179 and 180:
different problem types (regression
Page 181 and 182:
penalty. This chapter explains how
Page 183 and 184:
model for evaluation speed. The num
Page 185 and 186:
In this table, the outcomes are rea
Page 187 and 188:
Equation 4-4: Linear relation betwe
Page 189 and 190:
real numbers—the ones included in
Page 191 and 192:
between zero and the vector space p
Page 193 and 194:
Figure 4.2 Optimum solutions with s
Page 195 and 196:
the sum of squares penalties (the c
Page 197 and 198:
refinement to the forward stepwise
Page 199 and 200:
LISTING 4-1: LARS ALGORITHM FORPRED
Page 201 and 202:
#calculate correlation between attr
Page 203 and 204:
numeric values that are fixed by th
Page 205 and 206:
Figure 4.3 Coefficient curves for L
Page 207 and 208:
LISTING 4-2: 10-FOLD CROSS-VALIDATI
Page 209 and 210:
#Define test and training index set
Page 211 and 212:
looping nxval times. In this case n
Page 213 and 214:
predictably on new data. The more c
Page 215 and 216:
ElasticNet problem given by Equatio
Page 217 and 218:
Initializing and Iterating the Glmn
Page 219 and 220:
#calculate means and variancesxMean
Page 221 and 222:
value of betalabelHat = sum([xNorma
Page 223 and 224:
Figure 4.7 Coefficient curves for r
Page 225 and 226:
This section has gone through two s
Page 227 and 228:
LISTING 4-4: CONVERTING ACLASSIFICA
Page 229 and 230:
#number of steps to takenSteps = 35
Page 231 and 232:
Some problems require deciding amon
Page 233 and 234:
LISTING 4-5: BASIS EXPANSION FOR WI
Page 235 and 236:
squared, logarithmic, and sinusoida
Page 237 and 238:
numRow = [float(row[i]) for i inran
Page 239 and 240:
print(nameList)for i in range(ncols
Page 241 and 242:
classification problem to an ordina
Page 243 and 244:
CHAPTER 5Building Predictive Models
Page 245 and 246:
to use the cross-validation version
Page 247 and 248:
Another way to think about how to p
Page 249 and 250:
LISTING 5-1: USING CROSS-VALIDATION
Page 251 and 252:
#Call LassoCV from sklearn.linear_m
Page 253 and 254:
Figure 5.2 Out-of-sample error with
Page 255 and 256:
make a material difference in the r
Page 257 and 258:
#calculate means and variancesxMean
Page 259 and 260:
#different orderingabsCoef = [abs(a
Page 261 and 262:
Changing Y to un-normalized changes
Page 263 and 264:
LISTING 5-3: USING OUT-OF-SAMPLEERR
Page 265 and 266:
#Convert list of list to np array f
Page 267 and 268:
Figure 5.6 Cross-validation error c
Page 269 and 270:
other. Listing 5-4 initializes an e
Page 271 and 272:
tendency to be overfit. It’s more
Page 273 and 274:
Figure 5.9 Receiver operating chara
Page 275 and 276:
You can accomplish this by replicat
Page 277 and 278:
attrRow = [float(elt) for elt in ro
Page 279 and 280:
elif (predList[irow] >= 0.0) and (y
Page 281 and 282:
TP = tpr[52] * P#FN = False negativ
Page 283 and 284:
LISTING 5-5: COEFFICIENT TRAJECTORI
Page 285 and 286:
alphas, coefs, _ = enet_path(X, Y,l
Page 287 and 288:
0.038531154796719078, 0.00355153481
Page 289 and 290:
Figure 5.10 plots the coefficient c
Page 291 and 292:
the coefficients problematic becaus
Page 293 and 294:
xList.append(row)#separate labels f
Page 295 and 296:
#begin iterationnSteps = 100lamMult
Page 297 and 298:
sumBeta = sum([abs(betaInner[n]) fo
Page 299 and 300:
class. Ones that are large negative
Page 301 and 302:
LISTING 5-7: MULTICLASSCLASSIFICATI
Page 303 and 304:
ySD.append(stdDev)yNormalized = []f
Page 305 and 306:
groups, one plane will do it. If yo
Page 307 and 308:
the prediction. Then that is compar
Page 309 and 310:
CHAPTER 6Ensemble MethodsEnsemble m
Page 311 and 312:
LISTING 6-1: BUILDING A DECISION TR
Page 313 and 314:
HOW A BINARY DECISION TREE GENERATE
Page 315 and 316:
LISTING 6-2: TRAINING A DECISION TR
Page 317 and 318:
lhSse = sum([(s - lhAvg) * (s - lhA
Page 319 and 320:
Figure 6.3 Block diagram of depth 1
Page 321 and 322:
Figure 6.5 shows how the sum square
Page 323 and 324:
Figure 6.6 Prediction using depth 2
Page 325 and 326:
Figure 6.8 Prediction using depth 6
Page 327 and 328:
LISTING 6-3: CROSS-VALIDATION AT AR
Page 329 and 330:
Figure 6.9 Out-of-sample error vers
Page 331 and 332:
different figures of merit than reg
Page 333 and 334:
plot is similar to the plot of the
Page 335 and 336:
#maximum number of models to genera
Page 337 and 338:
Figure 6.11 MSE versus number of tr
Page 339 and 340:
Figure 6.13 shows the curve of MSE
Page 341 and 342:
LISTING 6-5: PREDICTING WINE QUALIT
Page 343 and 344:
mse = []allPredictions = []for iMod
Page 345 and 346:
solitary attributes and therefore c
Page 347 and 348:
demonstrate its variance reduction
Page 349 and 350:
#maximum number of models to genera
Page 351 and 352:
With gradient boosting, tree depth
Page 353 and 354:
Figure 6.19 Gradient boosting predi
Page 355 and 356:
Page 357 and 358:
Page 359 and 360:
LISTING 6-7: GRADIENT BOOSTING FORP
Page 361 and 362:
#build cumulative prediction from f
Page 363 and 364:
only needing tree depth when there
Page 365 and 366:
idxTest = random.sample(range(nrows
Page 367 and 368:
plot.plot(nModels,mse)plot.axis('ti
Page 369 and 370:
Page 371 and 372:
Page 373 and 374:
1. 1. Panda Biswanath, Joshua S. He
Page 375 and 376:
given in Chapter 6 helps you unders
Page 377 and 378:
the parameter being ignored, which
Page 379 and 380:
makes it possible to assign differe
Page 381 and 382:
Figure 7.1 Wine taste prediction pe
Page 383 and 384:
xTrain, xTest, yTrain, yTest = trai
Page 385 and 386:
Figure 7.2 Relative importance of v
Page 387 and 388:
ls Least mean squared error.lad Lea
Page 389 and 390:
If the type of max_features is floa
Page 391 and 392:
4. Once the oos performance curve i
Page 393 and 394:
test_size=0.30,random_state=531)# T
Page 395 and 396:
might want to try them both to make
Page 397 and 398:
LISTING 7-3: BUILDING A REGRESSIONM
Page 399 and 400:
latestPrediction = modelList[-1].pr
Page 401 and 402:
Figure 7.5 Wine taste error for Bag
Page 403 and 404:
Predictive Models Using Penalized L
Page 405 and 406:
#list of names forabaloneNames = nu
Page 407 and 408:
ASSESSING PERFORMANCE AND THEIMPORT
Page 409 and 410:
LISTING 7-5: PREDICTING ABALONE AGE
Page 411 and 412:
#plot training and test errors vs n
Page 413 and 414:
Figure 7.9 Abalone age prediction e
Page 415 and 416:
Figure 7.11 Abalone age prediction
Page 417 and 418:
outcomes might be “clicked on the
Page 419 and 420:
classification the labels are 0 or
Page 421 and 422:
LISTING 7-7: CLASSIFYING SONARRETUR
Page 423 and 424:
# Plot feature importancefeatureImp
Page 425 and 426:
# ('Threshold Value = ', 0.46564102
Page 427 and 428:
Figure 7.14 Variable importance for
Page 429 and 430:
This function predicts class probab
Page 431 and 432:
else:labels.append(0)attrRow = [flo
Page 433 and 434:
ctClass = [i*0.01 for i in range(10
Page 435 and 436:
# ('Threshold Value = ', 2.02817801
Page 437 and 438:
Figure 7.16 AUC versus ensemble siz
Page 439 and 440:
Figure 7.18 Mine detection ROC curv
Page 441 and 442:
Figure 7.20 Variable importance for
Page 443 and 444:
Solving Multiclass ClassificationPr
Page 445 and 446:
ncols = len(xNum[1])#Labels are int
Page 447 and 448:
featureImportance = featureImportan
Page 449 and 450:
Figure 7.23 is a bar chart showing
Page 451 and 452:
nrows = len(xNum)ncols = len(xNum[1
Page 453 and 454:
print("Best Missclassification Erro
Page 455 and 456:
As before, the Gradient Boosting ve
Page 457 and 458:
Figure 7.25 Glass classifier built
Page 459 and 460:
Figure 7.27 Glass classifier built
Page 461 and 462:
Table 7.1 Performance and Training
Page 463 and 464:
The chapter demonstrated the use of
Page 468 and 469:
®Machine Learning in Python : Esse
Page 470 and 471:
To my children, Scott, Seth, and Ca
Page 472 and 473:
About the Technical EditorDaniel Po
Page 474 and 475:
IndexerJohnna VanHoose DinseCover D
Page 476:
WILEY END USER LICENSEAGREEMENTGo t
show all

Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Create successful ePaper yourself

Delete template?

Save as template?