Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Recommendations

Info

LISTING 2-1: SIZING UP A NEW DATA SET—ROCKVMINESUMMARIES.PY (OUTPUT:OUTPUTROCKSVMINESSUMMARIES.TXT)__author__ = 'mike_bowles'import urllib2import sys#read data from uci data repositorytarget_url =("https://archive.ics.uci.edu/ml/machine-learning-""databases/undocumented/connectionistbench/sonar/sonar.all-data")data = urllib2.urlopen(target_url)#arrange data into list for labels and list of listsfor attributesxList = []labels = []for line in data:#split on commarow = line.strip().split(",")xList.append(row)sys.stdout.write("Number of Rows of Data = " +str(len(xList)) + '\n')sys.stdout.write("Number of Columns of Data = " +str(len(xList[1])))Output:Number of Rows of Data = 208Number of Columns of Data = 61As you can see in the sample output, this data set has 208 rows (lines)and 61 columns (fields per line). What difference does this make?The number of rows and columns has several impacts on how youproceed. First, the overall size gives you a rough idea of how longyour training times are going to be. For a small data set like the rocksversus mines data, training time will be less than a minute, which will
facilitate iterating through the process of training and tweaking. If thedata set grows to 1,000 x 1,000, the training times will grow to afraction of a minute for penalized linear regression and a few minutesfor an ensemble method. As the data set gets to several tens ofthousands of rows and columns, the training times will expand to 3 or4 hours for penalized linear regression and 12 to 24 hours for anensemble method. The larger training times will have an impact onyour development time because you’ll iterate a number of times.The second important observation regarding row and column countsis that if the data set has many more columns than rows, you may bemore likely to get the best prediction with penalized linear regressionand vice versa. Chapter 3, “Predictive Model Building: BalancingPerformance, Complexity, and Big Data,” and the examples you’llrun later will give you a better understanding of why that’s true.The next step in the checklist is to determine how many of thecolumns of data are numeric versus categorical. Listing 2-2 showscode to accomplish this for the rocks versus mine data set. The coderuns down each column and adds up the number of entries that arenumeric (int or float), the number of entries that are nonemptystrings, and the number that are empty. The result is that the first 60columns contain all numeric values and the last column contains allstrings. The string values are the labels. Generally, categoricalvariables are presented as strings, as in this example. In some cases,binary-valued categorical variables are presented as a 0,1 numericvariable.
Page 3 and 4: IntroductionExtracting actionable i
Page 5 and 6: Who This Book Is ForThis book is in
Page 7 and 8: save Chapter 2 until they start loo
Page 9 and 10: evolved from overcoming the limitat
Page 11 and 12: ConventionsTo help you get the most
Page 13 and 14: CHAPTER 1The Two Essential Algorith
Page 15 and 16: the flight path, and so on. Having
Page 17 and 18: become an issue. In some problems,
Page 19 and 20: Figure 1.2 Fitting lines with only
Page 21 and 22: Figure 1.3 Binary decision tree exa
Page 23 and 24: For example, if the training set is
Page 25 and 26: NOTE The first pass can usually be
Page 27 and 28: convergence divergence), and RSI (r
Page 29 and 30: runs through code for the core algo
Page 31 and 32: CHAPTER 2Understand the Problem byU
Page 33 and 34: A unique identifier may or may not
Page 35 and 36: The attributes shown in Table 2.1 c
Page 37 and 38: THINGS TO NOTICE ABOUT YOUR NEW DAT
Page 39: PHYSICAL CHARACTERISTICS OF THE ROC
Page 43 and 44: colCounts.append(type)type = [0]*3s
Page 45 and 46: LISTING 2-3: SUMMARY STATISTICS FOR
Page 47 and 48: catCount = [0]*2for elt in colData:
Page 49 and 50: from the line. That means that the
Page 51 and 52: Figure 2.1 Quantile-quantile plot o
Page 53 and 54: makes it possible to read data into
Page 55 and 56: LISTING 2-5: USING PYTHON PANDAS TO
Page 57 and 58: Pandas makes it possible to automat
Page 59 and 60: Figure 2.3 Parallel coordinates gra
Page 61 and 62: of representative pairs of attribut
Page 63 and 64: Figure 2.5 Cross- plot of rocks ver
Page 65 and 66: Figure 2.7 Target-attribute cross-p
Page 67 and 68: # and add some ditherif rocksVMines
Page 69 and 70: Equation 2-2: Average values of the
Page 71 and 72: (sqrt(var2*var3) * numElt)corr221 +
Page 73 and 74: Figure 2.8 Heat map showing attribu
Page 75 and 76: that you saw in “Classification P
Page 77 and 78: plot.ylabel(("Quartile Ranges"))sho
Page 79 and 80: As an alternative to the listing of
Page 81 and 82: Figure 2.10 Box plot of real-valued
Page 83 and 84: work for the abalone problem. Rocks
Page 85 and 86: for i in range(nrows):#plot rows of
Page 87 and 88: Figure 2.13 Graph of the logit func
Page 89 and 90: LISTING 2-12: CORRELATIONCALCULATIO
Page 91 and 92:
Figure 2.15 Correlation heat map fo
Page 93 and 94:
LISTING 2-13: WINE DATA SUMMARY—W
Page 95 and 96:
mean 0.087467 15.87492246.467792 0.
Page 97 and 98:
Figure 2.17 Parallel coordinate plo
Page 99 and 100:
#Try again with normalized valuesfo
Page 101 and 102:
Figure 2.19 Correlation heat map fo
Page 103 and 104:
LISTING 2-15: SUMMARY OF GLASS DATA
Page 105 and 106:
75% 0.610000 9.172500 0.000000 0.10
Page 107 and 108:
Figure 2.21 Parallel coordinate plo
Page 109 and 110:
plot.ylabel(("Attribute Values"))pl
Page 111 and 112:
building predictive models. The too
Page 113 and 114:
The variable that you are attemptin
Page 115 and 116:
Referring to the data set in Table
Page 117 and 118:
ASSESSING PERFORMANCE OF PREDICTIVE
Page 119 and 120:
training on the remaining data. Sta
Page 121 and 122:
Figure 3.2 A complicated classifica
Page 123 and 124:
CONTRAST BETWEEN A SIMPLE MODEL AND
Page 125 and 126:
Figure 3.5 Linear model fit to comp
Page 127 and 128:
Figure 3.7 Linear model fit to smal
Page 129 and 130:
used for predictive modeling. Addin
Page 131 and 132:
Model to Balance Problem Complexity
Page 133 and 134:
LISTING 3-1: COMPARISON OF MSE, MAE
Page 135 and 136:
from the mean) and the standard dev
Page 137 and 138:
Figure 3.9 Confusion matrix example
Page 139 and 140:
LISTING 3-2: MEASURING PERFORMANCEF
Page 141 and 142:
print("Shape of yTrain array", yTra
Page 143 and 144:
examples after it has been deployed
Page 145 and 146:
associated with its removal. If the
Page 147 and 148:
Figure 3.10 In-sample ROC for rocks
Page 149 and 150:
demonstration that performance esti
Page 151 and 152:
preserve the statistical peculiarit
Page 153 and 154:
vector Y contains the labels. And t
Page 155 and 156:
Initialize: Out_of_sample_error = N
Page 157 and 158:
indices = range(len(xList))xListTes
Page 159 and 160:
#scatter plot of actual versus pred
Page 161 and 162:
LISTING 3-4: FORWARD STEPWISEREGRES
Page 163 and 164:
Figure 3.15 Histogram of wine taste
Page 165 and 166:
regression throttle back ordinary r
Page 167 and 168:
xTrain = numpy.array(xListTrain); y
Page 169 and 170:
Figure 3.16 Wine quality prediction
Page 171 and 172:
Figure 3.18 Histogram of wine taste
Page 173 and 174:
LISTING 3-7: ROCKS VERSUS MINES USI
Page 175 and 176:
Listing 3-8 shows the AUC and assoc
Page 177 and 178:
Figure 3.19 AUC for the rocks-versu
Page 179 and 180:
different problem types (regression
Page 181 and 182:
penalty. This chapter explains how
Page 183 and 184:
model for evaluation speed. The num
Page 185 and 186:
In this table, the outcomes are rea
Page 187 and 188:
Equation 4-4: Linear relation betwe
Page 189 and 190:
real numbers—the ones included in
Page 191 and 192:
between zero and the vector space p
Page 193 and 194:
Figure 4.2 Optimum solutions with s
Page 195 and 196:
the sum of squares penalties (the c
Page 197 and 198:
refinement to the forward stepwise
Page 199 and 200:
LISTING 4-1: LARS ALGORITHM FORPRED
Page 201 and 202:
#calculate correlation between attr
Page 203 and 204:
numeric values that are fixed by th
Page 205 and 206:
Figure 4.3 Coefficient curves for L
Page 207 and 208:
LISTING 4-2: 10-FOLD CROSS-VALIDATI
Page 209 and 210:
#Define test and training index set
Page 211 and 212:
looping nxval times. In this case n
Page 213 and 214:
predictably on new data. The more c
Page 215 and 216:
ElasticNet problem given by Equatio
Page 217 and 218:
Initializing and Iterating the Glmn
Page 219 and 220:
#calculate means and variancesxMean
Page 221 and 222:
value of betalabelHat = sum([xNorma
Page 223 and 224:
Figure 4.7 Coefficient curves for r
Page 225 and 226:
This section has gone through two s
Page 227 and 228:
LISTING 4-4: CONVERTING ACLASSIFICA
Page 229 and 230:
#number of steps to takenSteps = 35
Page 231 and 232:
Some problems require deciding amon
Page 233 and 234:
LISTING 4-5: BASIS EXPANSION FOR WI
Page 235 and 236:
squared, logarithmic, and sinusoida
Page 237 and 238:
numRow = [float(row[i]) for i inran
Page 239 and 240:
print(nameList)for i in range(ncols
Page 241 and 242:
classification problem to an ordina
Page 243 and 244:
CHAPTER 5Building Predictive Models
Page 245 and 246:
to use the cross-validation version
Page 247 and 248:
Another way to think about how to p
Page 249 and 250:
LISTING 5-1: USING CROSS-VALIDATION
Page 251 and 252:
#Call LassoCV from sklearn.linear_m
Page 253 and 254:
Figure 5.2 Out-of-sample error with
Page 255 and 256:
make a material difference in the r
Page 257 and 258:
#calculate means and variancesxMean
Page 259 and 260:
#different orderingabsCoef = [abs(a
Page 261 and 262:
Changing Y to un-normalized changes
Page 263 and 264:
LISTING 5-3: USING OUT-OF-SAMPLEERR
Page 265 and 266:
#Convert list of list to np array f
Page 267 and 268:
Figure 5.6 Cross-validation error c
Page 269 and 270:
other. Listing 5-4 initializes an e
Page 271 and 272:
tendency to be overfit. It’s more
Page 273 and 274:
Figure 5.9 Receiver operating chara
Page 275 and 276:
You can accomplish this by replicat
Page 277 and 278:
attrRow = [float(elt) for elt in ro
Page 279 and 280:
elif (predList[irow] >= 0.0) and (y
Page 281 and 282:
TP = tpr[52] * P#FN = False negativ
Page 283 and 284:
LISTING 5-5: COEFFICIENT TRAJECTORI
Page 285 and 286:
alphas, coefs, _ = enet_path(X, Y,l
Page 287 and 288:
0.038531154796719078, 0.00355153481
Page 289 and 290:
Figure 5.10 plots the coefficient c
Page 291 and 292:
the coefficients problematic becaus
Page 293 and 294:
xList.append(row)#separate labels f
Page 295 and 296:
#begin iterationnSteps = 100lamMult
Page 297 and 298:
sumBeta = sum([abs(betaInner[n]) fo
Page 299 and 300:
class. Ones that are large negative
Page 301 and 302:
LISTING 5-7: MULTICLASSCLASSIFICATI
Page 303 and 304:
ySD.append(stdDev)yNormalized = []f
Page 305 and 306:
groups, one plane will do it. If yo
Page 307 and 308:
the prediction. Then that is compar
Page 309 and 310:
CHAPTER 6Ensemble MethodsEnsemble m
Page 311 and 312:
LISTING 6-1: BUILDING A DECISION TR
Page 313 and 314:
HOW A BINARY DECISION TREE GENERATE
Page 315 and 316:
LISTING 6-2: TRAINING A DECISION TR
Page 317 and 318:
lhSse = sum([(s - lhAvg) * (s - lhA
Page 319 and 320:
Figure 6.3 Block diagram of depth 1
Page 321 and 322:
Figure 6.5 shows how the sum square
Page 323 and 324:
Figure 6.6 Prediction using depth 2
Page 325 and 326:
Figure 6.8 Prediction using depth 6
Page 327 and 328:
LISTING 6-3: CROSS-VALIDATION AT AR
Page 329 and 330:
Figure 6.9 Out-of-sample error vers
Page 331 and 332:
different figures of merit than reg
Page 333 and 334:
plot is similar to the plot of the
Page 335 and 336:
#maximum number of models to genera
Page 337 and 338:
Figure 6.11 MSE versus number of tr
Page 339 and 340:
Figure 6.13 shows the curve of MSE
Page 341 and 342:
LISTING 6-5: PREDICTING WINE QUALIT
Page 343 and 344:
mse = []allPredictions = []for iMod
Page 345 and 346:
solitary attributes and therefore c
Page 347 and 348:
demonstrate its variance reduction
Page 349 and 350:
#maximum number of models to genera
Page 351 and 352:
With gradient boosting, tree depth
Page 353 and 354:
Figure 6.19 Gradient boosting predi
Page 355 and 356:
Page 357 and 358:
Page 359 and 360:
LISTING 6-7: GRADIENT BOOSTING FORP
Page 361 and 362:
#build cumulative prediction from f
Page 363 and 364:
only needing tree depth when there
Page 365 and 366:
idxTest = random.sample(range(nrows
Page 367 and 368:
plot.plot(nModels,mse)plot.axis('ti
Page 369 and 370:
Page 371 and 372:
Page 373 and 374:
1. 1. Panda Biswanath, Joshua S. He
Page 375 and 376:
given in Chapter 6 helps you unders
Page 377 and 378:
the parameter being ignored, which
Page 379 and 380:
makes it possible to assign differe
Page 381 and 382:
Figure 7.1 Wine taste prediction pe
Page 383 and 384:
xTrain, xTest, yTrain, yTest = trai
Page 385 and 386:
Figure 7.2 Relative importance of v
Page 387 and 388:
ls Least mean squared error.lad Lea
Page 389 and 390:
If the type of max_features is floa
Page 391 and 392:
4. Once the oos performance curve i
Page 393 and 394:
test_size=0.30,random_state=531)# T
Page 395 and 396:
might want to try them both to make
Page 397 and 398:
LISTING 7-3: BUILDING A REGRESSIONM
Page 399 and 400:
latestPrediction = modelList[-1].pr
Page 401 and 402:
Figure 7.5 Wine taste error for Bag
Page 403 and 404:
Predictive Models Using Penalized L
Page 405 and 406:
#list of names forabaloneNames = nu
Page 407 and 408:
ASSESSING PERFORMANCE AND THEIMPORT
Page 409 and 410:
LISTING 7-5: PREDICTING ABALONE AGE
Page 411 and 412:
#plot training and test errors vs n
Page 413 and 414:
Figure 7.9 Abalone age prediction e
Page 415 and 416:
Figure 7.11 Abalone age prediction
Page 417 and 418:
outcomes might be “clicked on the
Page 419 and 420:
classification the labels are 0 or
Page 421 and 422:
LISTING 7-7: CLASSIFYING SONARRETUR
Page 423 and 424:
# Plot feature importancefeatureImp
Page 425 and 426:
# ('Threshold Value = ', 0.46564102
Page 427 and 428:
Figure 7.14 Variable importance for
Page 429 and 430:
This function predicts class probab
Page 431 and 432:
else:labels.append(0)attrRow = [flo
Page 433 and 434:
ctClass = [i*0.01 for i in range(10
Page 435 and 436:
# ('Threshold Value = ', 2.02817801
Page 437 and 438:
Figure 7.16 AUC versus ensemble siz
Page 439 and 440:
Figure 7.18 Mine detection ROC curv
Page 441 and 442:
Figure 7.20 Variable importance for
Page 443 and 444:
Solving Multiclass ClassificationPr
Page 445 and 446:
ncols = len(xNum[1])#Labels are int
Page 447 and 448:
featureImportance = featureImportan
Page 449 and 450:
Figure 7.23 is a bar chart showing
Page 451 and 452:
nrows = len(xNum)ncols = len(xNum[1
Page 453 and 454:
print("Best Missclassification Erro
Page 455 and 456:
As before, the Gradient Boosting ve
Page 457 and 458:
Figure 7.25 Glass classifier built
Page 459 and 460:
Figure 7.27 Glass classifier built
Page 461 and 462:
Table 7.1 Performance and Training
Page 463 and 464:
The chapter demonstrated the use of
Page 468 and 469:
®Machine Learning in Python : Esse
Page 470 and 471:
To my children, Scott, Seth, and Ca
Page 472 and 473:
About the Technical EditorDaniel Po
Page 474 and 475:
IndexerJohnna VanHoose DinseCover D
Page 476:
WILEY END USER LICENSEAGREEMENTGo t
show all

Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Create successful ePaper yourself

Delete template?

Save as template?