Thinking-data-science-a-data-science-practitioners-guide

Recommendations

Info

4 1 Data Science Processoperations in real-time data streams. Today, we have machine learning applicationslike traffic analysis that work on live data streams. All these new models mademachine learning engineers (in the short term, we call them ML engineers) and datascientists to develop and learn new methods for processing image data. Finally, ourcomputers understand only binary data; though image data is binary data, we stillneed to transform this data into a machine-understandable format. Note that eachsingle image comprises several binary data items (pixels) and each pixel in an imageis a RGB representation in an image data file.The older classical ML technology failed to meet these new requirements, and theindustry started looking at the alternative approaches. The ANNs (artificial neuralnetworks) technology, which was invented many decades ago, came to the rescue.The modern computing resources made the use of this technology workable indeveloping such models. The neural network training requires several gigabytes ofmemory, GPUs, and many hours of training. Tech giants had those kinds ofresources, and they trained networks, which we can reuse and extend their functionalitiesfor our own purpose. We call this new technology Transfer Learning. I havegiven an exhaustive coverage later in the book on this new technology—transferlearning.Model Development on Text DatasetsAfter seeing the success of ANN/DNN (deep neural networks) technology inbuilding image applications, researchers started exploring its application to textdata. There came the new term and the field of its own—natural language processing(NLP). Preparing text data for machine learning requires a different kind of approachas compared to numeric. Numeric data is contained in databases having a fewcolumns. Each column is a potential candidate for a feature. Thus, in numericmachine learning models, the number of features is typically very low.Now, consider the text data, where each word or a sentence is a potentialcandidate for a feature. For email-spam applications, you use every word in yourtext as a feature, while for a document summarization model, you will use everysentence as a feature. Considering the vocabulary of words that we have, you caneasily imagine the number of features. The traditional dimensionality reductiontechniques that we have for numeric data would not apply for a text dataset inreducing its dimensions. You need to cleanse the entire text corpus. For cleansingtext data, you would need several steps, removing punctuations, stop words, removing/convertingnumbers, lowercasing, and so on. Further to this data cleaningoperation, to reduce the features count, we need to apply a few more techniqueslike building a dictionary of unique words, stemming, lemmatization, and soon. Finally, you need to understand tokenization so that these words/sentenceswould be transformed into machine-understandable formats.For text data, the context in which a word appears also plays an important role inits analytics. So, there came a new branch called natural language understanding
Data Science Process 5(NLU). So, we introduced many new concepts like bag-of-words (BoW), tf-idf(Term Frequency Inverse Document Frequency), bi-grams, n-grams, and soon. The researchers developed many new ANN architectures over the years forcreating text-based AI applications, beginning with RNN (Recurrent NeuralNetworks), followed by LSTM (Long Short-Term Memory), and currently thetransformers and its different implementations. Today, we have many modelsbased on the transformer architecture. BERT (Bidirectional Encoder Representationsfrom Transformers) is one of the most widely used one. Then, we have models likeGPT for text generation. The list is endless.Model Building on High-Frequency DatasetsHigh-frequency datasets is another challenging requirement that has come up in theindustry these days. Why do I say challenging? Here, the dataset frequency keepschanging over time—the data is dynamic. So, your entire data science process, thatincludes building efficient data pipelines, features engineering, and model training/evaluation, has to be dynamic too. Businesses like Amazon and Jio (a large telecomcompany in India) collect several terabytes of data every day. Processing and doingdata analytics on such high volume, high-frequency datasets is not a simple task.Having seen the requirements and needs for developing modern AI applications, Iwill now show you the workflow that a modern data scientist follows to developthose highly acclaimed ML models.Data Science ProcessThe first step in model building is to prepare the data for algorithm/ANN training.Data PreparationThe data in this world comes in two formats—numeric or character based. Adatabase table may contain both numeric and character fields. A character-baseddata is also available in a text document. When you do data analytics using machinelearning, the data may be in a relational database, may be structured or unstructured,or may even come from a NoSQL database source. Also, a data scientist is requiredto work on a large text corpus, such as news or novels. If we pick the data from theWeb, it will contain HTML tagging. We require a data scientist to work on imagedatasets too.Both numeric and text-based data require a different processing treatment. Let usfirst consider numeric data.
Page 1 and 2: The Springer Series in Applied Mach
Page 3 and 4: Editorial BoardOver the last decade
Page 5 and 6: Poornachandra SarangPracticing Data
Page 7 and 8: viPrefacemachine learning algorithm
Page 9 and 10: viiiPrefaceLarge Applications) is c
Page 11 and 12: Contents1 Data Science Process ....
Page 13 and 14: ContentsxiiiTree Traversal Algorith
Page 15 and 16: ContentsxvModel Fitting for Huge Da
Page 17 and 18: Contentsxvii13 BIRCH ..............
Page 19 and 20: Contentsxix18 ANN-Based Application
Page 21 and 22: Chapter 1Data Science ProcessWith l
Page 23: AI on Image Datasets 3creates a fin
Page 27 and 28: Data Science Process 7data and do n
Page 29 and 30: Data Science Process 9Fig. 1.3 Data
Page 31 and 32: Data Science Process 11use techniqu
Page 33 and 34: AutoML 13Fig. 1.9 Exhaustive list o
Page 35 and 36: Hyper-Parameter Tuning 15Fig. 1.11
Page 37 and 38: Models Based on Transfer Learning 1
Page 39 and 40: Chapter 2Dimensionality ReductionCr
Page 41 and 42: Dimensionality Reduction Techniques
Page 71 and 72: Summary 51Fig. 2.25 LDA performance
Page 73 and 74: Part IClassical Algorithms: Overvie
Page 75 and 76:
Chapter 3Regression AnalysisA Well-
Page 77 and 78:
Regression Types 57Generally, these
Page 79 and 80:
Regression Types 59Polynomial Regre
Page 81 and 82:
Regression Types 61ElasticNet Regre
Page 83 and 84:
Regression Types 63As before, you p
Page 85 and 86:
Bayesian Linear Regression 65Fig. 3
Page 87 and 88:
Bayesian Linear Regression 67You ma
Page 89 and 90:
Logistic Regression 69Fig. 3.8 Sigm
Page 91 and 92:
Logistic Regression 71Fig. 3.9 Asso
Page 93 and 94:
Summary 73those. I discussed only a
Page 95 and 96:
76 4 Decision Treegrades, etc. by b
Page 97 and 98:
78 4 Decision TreeFig. 4.2 Balanced
Page 99 and 100:
80 4 Decision Tree¼ 1X Ci¼1pi ð
Page 101 and 102:
82 4 Decision TreeInformation gain
Page 103 and 104:
84 4 Decision TreeTable 4.2 Subset
Page 105 and 106:
86 4 Decision TreeTable 4.4 Subset
Page 107 and 108:
88 4 Decision TreeFig. 4.6 Final tr
Page 109 and 110:
90 4 Decision TreeFig. 4.7 Housing
Page 111 and 112:
92 4 Decision TreeEvaluating Perfor
Page 113 and 114:
94 4 Decision TreeFig. 4.9 Features
Page 115 and 116:
96 4 Decision TreeFig. 4.11 Decisio
Page 117 and 118:
98 5 Ensemble: Bagging and Boosting
Page 119 and 120:
100 5 Ensemble: Bagging and Boostin
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Chapter 6K-Nearest NeighborsA Super
Page 151 and 152:
KNN Working 133• Step-3: For each
Page 153 and 154:
Advantages 135Table 6.2 Euclidean d
Page 155 and 156:
Project 137those who are further aw
Page 157 and 158:
Project 139#Setup arrays to store t
Page 159 and 160:
Summary 141Fig. 6.7 Classification
Page 161 and 162:
144 7 Naive BayesWhen to Use?A few
Page 163 and 164:
146 7 Naive BayesNow we substitute
Page 165 and 166:
148 7 Naive BayesNaive Bayes TypesD
Page 167 and 168:
150 7 Naive BayesPreparing DatasetT
Page 169 and 170:
152 7 Naive BayesThe output in this
Page 171 and 172:
154 8 Support Vector MachinesFig. 8
Page 173 and 174:
156 8 Support Vector MachinesComple
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
164 8 Support Vector MachinesTable
Page 183 and 184:
Part IIClustering: OverviewClusteri
Page 185 and 186:
Part II Clustering: Overview 169fal
Page 187 and 188:
Chapter 9Centroid-Based ClusteringC
Page 189 and 190:
The K-Means Algorithm 173observatio
Page 191 and 192:
The K-Means Algorithm 175• Elbow
Page 193 and 194:
The K-Means Algorithm 177The Gap St
Page 195 and 196:
The K-Means Algorithm 179ProjectWe
Page 197 and 198:
The K-Medoids Algorithm 181in the c
Page 199 and 200:
Summary 183Fig. 9.11 Clustering wit
Page 201 and 202:
186 10 Connectivity-Based Clusterin
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Chapter 11Gaussian Mixture ModelA P
Page 213 and 214:
Probability Distribution 199For a m
Page 215 and 216:
Project 201Fig. 11.3 Features and t
Page 217 and 218:
Determining Optimal Number of Clust
Page 219 and 220:
Determining Optimal Number of Clust
Page 221 and 222:
Summary 207SummaryThe Gaussian mixt
Page 223 and 224:
210 12 Density-Based ClusteringFig.
Page 225 and 226:
212 12 Density-Based ClusteringTher
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
Page 243 and 244:
230 13 BIRCHTo understand the BIRCH
Page 245 and 246:
232 13 BIRCH3. Apply existing clust
Page 247 and 248:
234 13 BIRCHFig. 13.4 BIRCH cluster
Page 249 and 250:
236 13 BIRCHYou observe that the cl
Page 251 and 252:
238 14 CLARANS1. Select multiple su
Page 253 and 254:
240 14 CLARANSProjectFor this proje
Page 255 and 256:
242 14 CLARANSThe output shows the
Page 257 and 258:
244 15 Affinity Propagation Cluster
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
252 16 STING & CLIQUEFig. 16.1 STIN
Page 267 and 268:
254 16 STING & CLIQUEPros/ConsHere
Page 269 and 270:
256 16 STING & CLIQUEIntervals = 30
Page 271 and 272:
Part IIIANN: OverviewSo far, you ha
Page 273 and 274:
262 17 Artificial Neural NetworksTh
Page 275 and 276:
264 17 Artificial Neural NetworksNe
Page 277 and 278:
266 17 Artificial Neural NetworksIm
Page 279 and 280:
268 17 Artificial Neural NetworksNo
Page 281 and 282:
270 17 Artificial Neural NetworksFi
Page 283 and 284:
272 17 Artificial Neural Networks
Page 285 and 286:
Page 287 and 288:
Page 289 and 290:
Page 291 and 292:
280 17 Artificial Neural NetworksFo
Page 293 and 294:
Page 295 and 296:
284 17 Artificial Neural Networksbi
Page 297 and 298:
Page 299 and 300:
Chapter 18ANN-Based ApplicationsTex
Page 301 and 302:
Developing NLP Applications 291This
Page 303 and 304:
Developing NLP Applications 293The
Page 305 and 306:
Developing NLP Applications 295bert
Page 307 and 308:
Developing NLP Applications 297For
Page 309 and 310:
Developing NLP Applications 299test
Page 311 and 312:
Developing NLP Applications 301We w
Page 313 and 314:
Developing NLP Applications 303Fig.
Page 315 and 316:
Developing NLP Applications 305Mode
Page 317 and 318:
Developing NLP Applications 307Fig.
Page 319 and 320:
Developing NLP Applications 309Glov
Page 321 and 322:
Fig. 18.15 Network summaryFig. 18.1
Page 323 and 324:
Developing Image-Based Applications
Page 325 and 326:
Page 327 and 328:
Page 329 and 330:
Page 331 and 332:
Page 333 and 334:
Page 335 and 336:
Page 337 and 338:
Summary 327SummaryWe have several p
Page 339 and 340:
330 19 Automated Toolsseveral combi
Page 341 and 342:
332 19 Automated Toolsauto-sklearn
Page 343 and 344:
334 19 Automated ToolsIt produced t
Page 345 and 346:
336 19 Automated ToolsFig. 19.3 Reg
Page 347 and 348:
338 19 Automated Toolson the best p
Page 349 and 350:
340 19 Automated ToolsFig. 19.6 Mod
Page 351 and 352:
342 19 Automated Toolsestablish the
Page 353 and 354:
344 19 Automated ToolsFig. 19.9 Net
Page 355 and 356:
346 19 Automated ToolsThe plot for
Page 357 and 358:
348 19 Automated ToolsTPOTThis is a
Page 359 and 360:
Chapter 20Data Scientist’s Ultima
Page 361 and 362:
Workflow-0: Quick Solution 353Workf
Page 363 and 364:
Workflow-4: Features Engineering 35
Page 365 and 366:
Workflow-11: Clustering 357Workflow
show all

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Delete template?

Save as template?