Thinking-data-science-a-data-science-practitioners-guide

Recommendations

Info

2 1 Data Science Processin model building itself. Becoming a modern data scientist, you need to understandhow to handle these new data types and learn the modern technologies of machinelearning.So, let us begin our journey by first understanding the data science process. I willfirst introduce you to the traditional model building process, followed by old-hat datascientists. No, do not take it wrong. Though these processes were developed manyyears ago, they still find their usefulness in modern data science. I will provide youwith definite guidelines on when to use the traditional approach and when to use amore-advanced modern approach.Traditional Model BuildingAll these years, a data scientist, while building an AI application, would first startwith the exploratory data analysis (EDA). After all, understanding the data yourselfis very vital in telling the machine what it means. In technical terms, it is importantfor us to understand the features (independent variables) in our dataset to do apredictive analysis on the target, a dependent variable. Using these features andtargets, we would create a training dataset for training a statistical algorithm. SuchEDA, many-a-times, requires a deep domain knowledge. And that is where peoplehaving domain knowledge in various vertical industries thrive to become a datascientist. I will try to meet the aspirations of every such individual.As said earlier, in those days, the data was mostly numeric. As a data scientist,you have to make sure that the data is clean for feeding it to an algorithm. So, therecomes data-cleansing. For this, one has to first find out if there are any missingvalues. If so, either remove those columns from your analysis or impute the propervalues in those missing fields. After we ensure the data is clean, you need to do somepreprocessing on it to make it ready for machine learning.The various steps required in data preparation are studying the data variance incolumns, scaling data, searching for correlations, dimensionality reduction, and soon. For this, he would use many available tools for data exploring, get visualrepresentations of data distributions and correlations between columns, and useseveral dimensionality reduction techniques. The list is endless; process is timeconsumingand laborious.After the data scientist makes the dataset ready for machine learning, his next taskis to select an appropriate algorithm based on his knowledge and experience. Afterthe algorithm is trained, we say we have built the model. Data scientist now uses anyknown performance evaluations methods to test the trained model on his testdatasets. If the performance metrics do not give acceptable accuracies, he will trytweaking the hyper-parameters of the algorithm. If that does not work, he may haveto go back to the data preparation stage, selecting new features, do additionalfeatures engineering and further dimensionality reductions, and then retrain hisalgorithm for an improved accuracy. If this too does not work out, he will try anotherstatistical algorithm. The entire process continues over many iterations until he
AI on Image Datasets 3creates a fine-tuned model on his dataset. You see, the entire process is not just timeconsuming,but also requires a good amount of knowledge of statistical methods.You must understand several classical machine learning algorithms and have a deepknowledge of EDA techniques. That is to say, knowledge width rather than a deepknowledge of each algorithm is required in developing efficient machine learningmodels.Now, we look at the modern approach.Modern Approach for Model BuildingThe traditional process that I have described above is highly laborious and timeconsuming.Not only this, as a data scientist you would need an excellent knowledgeof statistics as a subject, several tools available for EDA, learn several machinelearning algorithms, and know the various techniques for evaluating the model’sperformance. What if somebody automates this entire process? Such automation hasbeen available for many years. As a data scientist, you need to gain these new skills.We have auto-sklearn that works on the popular sklearn machine learning librarythat does both regression and classification automated model building. Not just themodel selection, it has built-in functionality to build a better performing datapipeline created by applying Bayesian optimization. It also uses statistical ensembletechniques to do model aggregations. If you decide to use an artificial neural network(ANN) approach for model building, we have AutoKeras that outputs the bestperforming network design for your dataset. Several commercial andnon-commercial tools are available in this space—just to mention a few, H2O.ai,TPOT, MLBox, PyCaret, DataRobot, DataBricks, and BlobCity AutoAI. The lastone provides an interesting feature that, to my knowledge and at the time of thiswriting, nobody else does. It generates a complete .ipynb file (project source), which,as a data scientist, you can claim as your own invention. After all, all clients ask forthe originality and the source code for the project.I have included a full chapter on AutoML for you to master this novel approach.The next important aspect of modern machine learning development is theintroduction of new data types, and that is text and image data.AI on Image DatasetsWith the outstanding success in developing predictive models using statisticaltechniques, we also call it classical ML and sometimes even use the term “goodold-fashioned AI” (GOFAI); the industry wanted to apply their learnings on imagedata initially for object detection. It probably started with detection of a certain classof objects in an image, and then the techniques were extended to classify dogs andcats, face tagging, and identifying persons and finally even to perform such
Page 1 and 2: The Springer Series in Applied Mach
Page 3 and 4: Editorial BoardOver the last decade
Page 5 and 6: Poornachandra SarangPracticing Data
Page 7 and 8: viPrefacemachine learning algorithm
Page 9 and 10: viiiPrefaceLarge Applications) is c
Page 11 and 12: Contents1 Data Science Process ....
Page 13 and 14: ContentsxiiiTree Traversal Algorith
Page 15 and 16: ContentsxvModel Fitting for Huge Da
Page 17 and 18: Contentsxvii13 BIRCH ..............
Page 19 and 20: Contentsxix18 ANN-Based Application
Page 21: Chapter 1Data Science ProcessWith l
Page 25 and 26: Data Science Process 5(NLU). So, we
Page 27 and 28: Data Science Process 7data and do n
Page 29 and 30: Data Science Process 9Fig. 1.3 Data
Page 31 and 32: Data Science Process 11use techniqu
Page 33 and 34: AutoML 13Fig. 1.9 Exhaustive list o
Page 35 and 36: Hyper-Parameter Tuning 15Fig. 1.11
Page 37 and 38: Models Based on Transfer Learning 1
Page 39 and 40: Chapter 2Dimensionality ReductionCr
Page 41 and 42: Dimensionality Reduction Techniques
Page 71 and 72: Summary 51Fig. 2.25 LDA performance
Page 73 and 74:
Part IClassical Algorithms: Overvie
Page 75 and 76:
Chapter 3Regression AnalysisA Well-
Page 77 and 78:
Regression Types 57Generally, these
Page 79 and 80:
Regression Types 59Polynomial Regre
Page 81 and 82:
Regression Types 61ElasticNet Regre
Page 83 and 84:
Regression Types 63As before, you p
Page 85 and 86:
Bayesian Linear Regression 65Fig. 3
Page 87 and 88:
Bayesian Linear Regression 67You ma
Page 89 and 90:
Logistic Regression 69Fig. 3.8 Sigm
Page 91 and 92:
Logistic Regression 71Fig. 3.9 Asso
Page 93 and 94:
Summary 73those. I discussed only a
Page 95 and 96:
76 4 Decision Treegrades, etc. by b
Page 97 and 98:
78 4 Decision TreeFig. 4.2 Balanced
Page 99 and 100:
80 4 Decision Tree¼ 1X Ci¼1pi ð
Page 101 and 102:
82 4 Decision TreeInformation gain
Page 103 and 104:
84 4 Decision TreeTable 4.2 Subset
Page 105 and 106:
86 4 Decision TreeTable 4.4 Subset
Page 107 and 108:
88 4 Decision TreeFig. 4.6 Final tr
Page 109 and 110:
90 4 Decision TreeFig. 4.7 Housing
Page 111 and 112:
92 4 Decision TreeEvaluating Perfor
Page 113 and 114:
94 4 Decision TreeFig. 4.9 Features
Page 115 and 116:
96 4 Decision TreeFig. 4.11 Decisio
Page 117 and 118:
98 5 Ensemble: Bagging and Boosting
Page 119 and 120:
100 5 Ensemble: Bagging and Boostin
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Chapter 6K-Nearest NeighborsA Super
Page 151 and 152:
KNN Working 133• Step-3: For each
Page 153 and 154:
Advantages 135Table 6.2 Euclidean d
Page 155 and 156:
Project 137those who are further aw
Page 157 and 158:
Project 139#Setup arrays to store t
Page 159 and 160:
Summary 141Fig. 6.7 Classification
Page 161 and 162:
144 7 Naive BayesWhen to Use?A few
Page 163 and 164:
146 7 Naive BayesNow we substitute
Page 165 and 166:
148 7 Naive BayesNaive Bayes TypesD
Page 167 and 168:
150 7 Naive BayesPreparing DatasetT
Page 169 and 170:
152 7 Naive BayesThe output in this
Page 171 and 172:
154 8 Support Vector MachinesFig. 8
Page 173 and 174:
156 8 Support Vector MachinesComple
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
164 8 Support Vector MachinesTable
Page 183 and 184:
Part IIClustering: OverviewClusteri
Page 185 and 186:
Part II Clustering: Overview 169fal
Page 187 and 188:
Chapter 9Centroid-Based ClusteringC
Page 189 and 190:
The K-Means Algorithm 173observatio
Page 191 and 192:
The K-Means Algorithm 175• Elbow
Page 193 and 194:
The K-Means Algorithm 177The Gap St
Page 195 and 196:
The K-Means Algorithm 179ProjectWe
Page 197 and 198:
The K-Medoids Algorithm 181in the c
Page 199 and 200:
Summary 183Fig. 9.11 Clustering wit
Page 201 and 202:
186 10 Connectivity-Based Clusterin
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Chapter 11Gaussian Mixture ModelA P
Page 213 and 214:
Probability Distribution 199For a m
Page 215 and 216:
Project 201Fig. 11.3 Features and t
Page 217 and 218:
Determining Optimal Number of Clust
Page 219 and 220:
Determining Optimal Number of Clust
Page 221 and 222:
Summary 207SummaryThe Gaussian mixt
Page 223 and 224:
210 12 Density-Based ClusteringFig.
Page 225 and 226:
212 12 Density-Based ClusteringTher
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
Page 243 and 244:
230 13 BIRCHTo understand the BIRCH
Page 245 and 246:
232 13 BIRCH3. Apply existing clust
Page 247 and 248:
234 13 BIRCHFig. 13.4 BIRCH cluster
Page 249 and 250:
236 13 BIRCHYou observe that the cl
Page 251 and 252:
238 14 CLARANS1. Select multiple su
Page 253 and 254:
240 14 CLARANSProjectFor this proje
Page 255 and 256:
242 14 CLARANSThe output shows the
Page 257 and 258:
244 15 Affinity Propagation Cluster
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
252 16 STING & CLIQUEFig. 16.1 STIN
Page 267 and 268:
254 16 STING & CLIQUEPros/ConsHere
Page 269 and 270:
256 16 STING & CLIQUEIntervals = 30
Page 271 and 272:
Part IIIANN: OverviewSo far, you ha
Page 273 and 274:
262 17 Artificial Neural NetworksTh
Page 275 and 276:
264 17 Artificial Neural NetworksNe
Page 277 and 278:
266 17 Artificial Neural NetworksIm
Page 279 and 280:
268 17 Artificial Neural NetworksNo
Page 281 and 282:
270 17 Artificial Neural NetworksFi
Page 283 and 284:
272 17 Artificial Neural Networks
Page 285 and 286:
Page 287 and 288:
Page 289 and 290:
Page 291 and 292:
280 17 Artificial Neural NetworksFo
Page 293 and 294:
Page 295 and 296:
284 17 Artificial Neural Networksbi
Page 297 and 298:
Page 299 and 300:
Chapter 18ANN-Based ApplicationsTex
Page 301 and 302:
Developing NLP Applications 291This
Page 303 and 304:
Developing NLP Applications 293The
Page 305 and 306:
Developing NLP Applications 295bert
Page 307 and 308:
Developing NLP Applications 297For
Page 309 and 310:
Developing NLP Applications 299test
Page 311 and 312:
Developing NLP Applications 301We w
Page 313 and 314:
Developing NLP Applications 303Fig.
Page 315 and 316:
Developing NLP Applications 305Mode
Page 317 and 318:
Developing NLP Applications 307Fig.
Page 319 and 320:
Developing NLP Applications 309Glov
Page 321 and 322:
Fig. 18.15 Network summaryFig. 18.1
Page 323 and 324:
Developing Image-Based Applications
Page 325 and 326:
Page 327 and 328:
Page 329 and 330:
Page 331 and 332:
Page 333 and 334:
Page 335 and 336:
Page 337 and 338:
Summary 327SummaryWe have several p
Page 339 and 340:
330 19 Automated Toolsseveral combi
Page 341 and 342:
332 19 Automated Toolsauto-sklearn
Page 343 and 344:
334 19 Automated ToolsIt produced t
Page 345 and 346:
336 19 Automated ToolsFig. 19.3 Reg
Page 347 and 348:
338 19 Automated Toolson the best p
Page 349 and 350:
340 19 Automated ToolsFig. 19.6 Mod
Page 351 and 352:
342 19 Automated Toolsestablish the
Page 353 and 354:
344 19 Automated ToolsFig. 19.9 Net
Page 355 and 356:
346 19 Automated ToolsThe plot for
Page 357 and 358:
348 19 Automated ToolsTPOTThis is a
Page 359 and 360:
Chapter 20Data Scientist’s Ultima
Page 361 and 362:
Workflow-0: Quick Solution 353Workf
Page 363 and 364:
Workflow-4: Features Engineering 35
Page 365 and 366:
Workflow-11: Clustering 357Workflow
show all

Thinking-data-science-a-data-science-practitioners-guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?