Thinking-data-science-a-data-science-practitioners-guide

Recommendations

Info

PrefaceviiNow comes the next challenge for a data scientist, and that is clustering a datasetwithout having labeled data points. We call this unsupervised learning. I have a hugesection (Part II) comprising Chaps. 9 through 16 for clustering, giving you an indepthcoverage for several clustering techniques. The notion of cluster is not welldefinedand usually there is no consensus on the results produced by clusteringalgorithms. So, we have lots of clustering algorithms that deal with small, medium,large, and really huge spatial datasets. I cover many clustering algorithms explainingtheir applications for various sized datasets.Chapter 9 (Centroid-Based Clustering) discusses the centroid-based clusteringalgorithms, which are probably the simplest and are the starting points for clusteringhuge spatial datasets. The chapter covers both K-Means and K-Medoids clusteringalgorithms. For K-Means, I describe its working followed by the explanation of thealgorithm itself. I discuss the purpose of objective function and techniques forselecting optimal clusters. These are called Elbow, Average Silhouette, and theGap Statistic. This is followed by a discussion on K-Means limitations and whereto use it? For the K-Medoids algorithm, I follow a similar approach describing itsworking, algorithm, merits, demerits, and implementation.Chapter 10 (Connectivity-Based Clustering) describes two connectivity-basedclustering algorithms: Agglomerative and Divisive. For Agglomerative clustering,I describe the Simple, Complete, and Average linkages while explaining its fullworking. I then discuss its advantages, disadvantages, and practical situations wherethis algorithm finds its use. For Divisive clustering, I take a similar approach anddiscuss its implementation challenges.Chapter 11 (Gaussian Mixture Model) describes another type of clusteringalgorithm where the data distribution is of Gaussian type. I explain how to selectthe optimal number of clusters with a practical example.Chapter 12 (Density-Based Clustering) focuses on density-based clustering techniques.Here I describe three algorithms—DBSCAN, OPTICS, and Mean Shift. Idiscuss why we use DBSCAN? After discussing preliminaries, I discuss its fullworking. I then discuss its advantages, disadvantages, and implementation with thehelp of a project. To understand OPTICS, I first explain to you a few terms like coredistance and reachability distance. Like the earlier one, I discuss its implementationwith the help of a project. Finally, I describe Mean Shift clustering by explaining itsfull working and how to select the bandwidth. A discussion on the algorithm’sstrengths, weaknesses, applications, and a practical implementation illustrated withthe help of a project follows this.Chapter 13 (BIRCH) discusses another important clustering algorithm calledBIRCH. This is an algorithm that helps data scientists in clustering huge datasets,where all the earlier algorithms fail. BIRCH splits the huge dataset into subclustersby creating a hierarchical tree-like structure. The algorithm does clustering incrementallyeliminating the need for loading the entire dataset into memory. In thischapter, I discuss why and where to use this algorithm and explain its working byshowing you how the algorithm constructs a CF tree.Chapter 14 (CLARANS) discusses another important algorithm for clusteringenormous sized datasets. This is called CLARANS. The CLARA (Clustering for
viiiPrefaceLarge Applications) is considered an extension to K-Medoids. CLARANS (Clusteringfor Large Applications with Randomized Search) is a step further to handlespatial datasets. This algorithm maintains a balance between the computational costand the random sampling of data. I discuss both CLARA and CLARANS algorithmsand present you a practical project to understand how to use CLARANS.Chapter 15 (Affinity Propagation Clustering) describes an altogether differenttype of clustering technique which is based on gossiping. This is called AffinityPropagation clustering. I fully describe its working by explaining to you how we usegossiping for forming groups having affinity toward each other and their leader. Thisalgorithm does not require you to have a prior estimation of the number of clusters. Iexplain the concept of responsibility and availability matrices while presenting itsfull working. Toward the end, I demonstrate its implementation with a practicalproject.Toward the end of the clustering section, you will learn two more clusteringalgorithms: these are STING and CLIQUE. Chapter 16 (STING & CLIQUE)discusses these algorithms. We consider them both density and grid based. Theadvantage of STING lies in the fact that it does re-scan the entire dataset whileanswering a new query. Thus, unlike previous algorithms, this algorithm is computationallyfar less expensive. The STING stands for STatistical INformation Grid. Iexplain how the algorithm constructs the grid and uses it for querying. CLIQUE is asubspace clustering algorithm that uses a bottom-up approach while clustering. Thisalgorithm provides better clustering in the case of high-dimensional datasets. Likeearlier chapters, I present the full working of the algorithms, their advantages,disadvantages, and their practical implementations in projects.After discussing the several classical algorithms, which are based on statisticaltechniques, we now move on to an evolutionally technique of machine learning andthat is ANN (Artificial Neural Networks). The ANN technology definitely brought anew revolution in machine learning. Part III (ANN: Overview) provides you anoverview of this technology.In Chap. 17 (Artificial Neural Networks), I introduce you to ANN/DNN technology.You will first learn many new terms like back-propagation, optimization/lossfunctions, evaluation metrics, and so on. I discuss how to design an ANN architectureon your own, how to train/evaluate it, and finally how to use a trained-model forinferring unseen data. I introduce you to various network architectures, such asCNN, RNN, LSTM, Transformers, BERT, and so on. You will understand the latestTransfer Learning technology and learn to extend the functionality of a pre-trainedmodel for your own purposes.Chapter 18 (ANN-Based Applications) deals with two practical examples ofusing ANN/DNN. One example deals with text data and the other one with imagedata. Besides other things like designing and training networks, in this chapter, youwill learn how to prepare text and image datasets for machine learning.With the wide choices of algorithms and selection between classical and ANNapproach of model building, we make the data scientist’s life quite tough. Fortunately,the researchers and engineers have developed tools to help data scientists inthe above selections.
Page 1 and 2: The Springer Series in Applied Mach
Page 3 and 4: Editorial BoardOver the last decade
Page 5 and 6: Poornachandra SarangPracticing Data
Page 7: viPrefacemachine learning algorithm
Page 11 and 12: Contents1 Data Science Process ....
Page 13 and 14: ContentsxiiiTree Traversal Algorith
Page 15 and 16: ContentsxvModel Fitting for Huge Da
Page 17 and 18: Contentsxvii13 BIRCH ..............
Page 19 and 20: Contentsxix18 ANN-Based Application
Page 21 and 22: Chapter 1Data Science ProcessWith l
Page 23 and 24: AI on Image Datasets 3creates a fin
Page 25 and 26: Data Science Process 5(NLU). So, we
Page 27 and 28: Data Science Process 7data and do n
Page 29 and 30: Data Science Process 9Fig. 1.3 Data
Page 31 and 32: Data Science Process 11use techniqu
Page 33 and 34: AutoML 13Fig. 1.9 Exhaustive list o
Page 35 and 36: Hyper-Parameter Tuning 15Fig. 1.11
Page 37 and 38: Models Based on Transfer Learning 1
Page 39 and 40: Chapter 2Dimensionality ReductionCr
Page 41 and 42: Dimensionality Reduction Techniques
Page 59 and 60:
Dimensionality Reduction Techniques
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Summary 51Fig. 2.25 LDA performance
Page 73 and 74:
Part IClassical Algorithms: Overvie
Page 75 and 76:
Chapter 3Regression AnalysisA Well-
Page 77 and 78:
Regression Types 57Generally, these
Page 79 and 80:
Regression Types 59Polynomial Regre
Page 81 and 82:
Regression Types 61ElasticNet Regre
Page 83 and 84:
Regression Types 63As before, you p
Page 85 and 86:
Bayesian Linear Regression 65Fig. 3
Page 87 and 88:
Bayesian Linear Regression 67You ma
Page 89 and 90:
Logistic Regression 69Fig. 3.8 Sigm
Page 91 and 92:
Logistic Regression 71Fig. 3.9 Asso
Page 93 and 94:
Summary 73those. I discussed only a
Page 95 and 96:
76 4 Decision Treegrades, etc. by b
Page 97 and 98:
78 4 Decision TreeFig. 4.2 Balanced
Page 99 and 100:
80 4 Decision Tree¼ 1X Ci¼1pi ð
Page 101 and 102:
82 4 Decision TreeInformation gain
Page 103 and 104:
84 4 Decision TreeTable 4.2 Subset
Page 105 and 106:
86 4 Decision TreeTable 4.4 Subset
Page 107 and 108:
88 4 Decision TreeFig. 4.6 Final tr
Page 109 and 110:
90 4 Decision TreeFig. 4.7 Housing
Page 111 and 112:
92 4 Decision TreeEvaluating Perfor
Page 113 and 114:
94 4 Decision TreeFig. 4.9 Features
Page 115 and 116:
96 4 Decision TreeFig. 4.11 Decisio
Page 117 and 118:
98 5 Ensemble: Bagging and Boosting
Page 119 and 120:
100 5 Ensemble: Bagging and Boostin
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Chapter 6K-Nearest NeighborsA Super
Page 151 and 152:
KNN Working 133• Step-3: For each
Page 153 and 154:
Advantages 135Table 6.2 Euclidean d
Page 155 and 156:
Project 137those who are further aw
Page 157 and 158:
Project 139#Setup arrays to store t
Page 159 and 160:
Summary 141Fig. 6.7 Classification
Page 161 and 162:
144 7 Naive BayesWhen to Use?A few
Page 163 and 164:
146 7 Naive BayesNow we substitute
Page 165 and 166:
148 7 Naive BayesNaive Bayes TypesD
Page 167 and 168:
150 7 Naive BayesPreparing DatasetT
Page 169 and 170:
152 7 Naive BayesThe output in this
Page 171 and 172:
154 8 Support Vector MachinesFig. 8
Page 173 and 174:
156 8 Support Vector MachinesComple
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
164 8 Support Vector MachinesTable
Page 183 and 184:
Part IIClustering: OverviewClusteri
Page 185 and 186:
Part II Clustering: Overview 169fal
Page 187 and 188:
Chapter 9Centroid-Based ClusteringC
Page 189 and 190:
The K-Means Algorithm 173observatio
Page 191 and 192:
The K-Means Algorithm 175• Elbow
Page 193 and 194:
The K-Means Algorithm 177The Gap St
Page 195 and 196:
The K-Means Algorithm 179ProjectWe
Page 197 and 198:
The K-Medoids Algorithm 181in the c
Page 199 and 200:
Summary 183Fig. 9.11 Clustering wit
Page 201 and 202:
186 10 Connectivity-Based Clusterin
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Chapter 11Gaussian Mixture ModelA P
Page 213 and 214:
Probability Distribution 199For a m
Page 215 and 216:
Project 201Fig. 11.3 Features and t
Page 217 and 218:
Determining Optimal Number of Clust
Page 219 and 220:
Determining Optimal Number of Clust
Page 221 and 222:
Summary 207SummaryThe Gaussian mixt
Page 223 and 224:
210 12 Density-Based ClusteringFig.
Page 225 and 226:
212 12 Density-Based ClusteringTher
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
Page 243 and 244:
230 13 BIRCHTo understand the BIRCH
Page 245 and 246:
232 13 BIRCH3. Apply existing clust
Page 247 and 248:
234 13 BIRCHFig. 13.4 BIRCH cluster
Page 249 and 250:
236 13 BIRCHYou observe that the cl
Page 251 and 252:
238 14 CLARANS1. Select multiple su
Page 253 and 254:
240 14 CLARANSProjectFor this proje
Page 255 and 256:
242 14 CLARANSThe output shows the
Page 257 and 258:
244 15 Affinity Propagation Cluster
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
252 16 STING & CLIQUEFig. 16.1 STIN
Page 267 and 268:
254 16 STING & CLIQUEPros/ConsHere
Page 269 and 270:
256 16 STING & CLIQUEIntervals = 30
Page 271 and 272:
Part IIIANN: OverviewSo far, you ha
Page 273 and 274:
262 17 Artificial Neural NetworksTh
Page 275 and 276:
264 17 Artificial Neural NetworksNe
Page 277 and 278:
266 17 Artificial Neural NetworksIm
Page 279 and 280:
268 17 Artificial Neural NetworksNo
Page 281 and 282:
270 17 Artificial Neural NetworksFi
Page 283 and 284:
272 17 Artificial Neural Networks
Page 285 and 286:
Page 287 and 288:
Page 289 and 290:
Page 291 and 292:
280 17 Artificial Neural NetworksFo
Page 293 and 294:
Page 295 and 296:
284 17 Artificial Neural Networksbi
Page 297 and 298:
Page 299 and 300:
Chapter 18ANN-Based ApplicationsTex
Page 301 and 302:
Developing NLP Applications 291This
Page 303 and 304:
Developing NLP Applications 293The
Page 305 and 306:
Developing NLP Applications 295bert
Page 307 and 308:
Developing NLP Applications 297For
Page 309 and 310:
Developing NLP Applications 299test
Page 311 and 312:
Developing NLP Applications 301We w
Page 313 and 314:
Developing NLP Applications 303Fig.
Page 315 and 316:
Developing NLP Applications 305Mode
Page 317 and 318:
Developing NLP Applications 307Fig.
Page 319 and 320:
Developing NLP Applications 309Glov
Page 321 and 322:
Fig. 18.15 Network summaryFig. 18.1
Page 323 and 324:
Developing Image-Based Applications
Page 325 and 326:
Page 327 and 328:
Page 329 and 330:
Page 331 and 332:
Page 333 and 334:
Page 335 and 336:
Page 337 and 338:
Summary 327SummaryWe have several p
Page 339 and 340:
330 19 Automated Toolsseveral combi
Page 341 and 342:
332 19 Automated Toolsauto-sklearn
Page 343 and 344:
334 19 Automated ToolsIt produced t
Page 345 and 346:
336 19 Automated ToolsFig. 19.3 Reg
Page 347 and 348:
338 19 Automated Toolson the best p
Page 349 and 350:
340 19 Automated ToolsFig. 19.6 Mod
Page 351 and 352:
342 19 Automated Toolsestablish the
Page 353 and 354:
344 19 Automated ToolsFig. 19.9 Net
Page 355 and 356:
346 19 Automated ToolsThe plot for
Page 357 and 358:
348 19 Automated ToolsTPOTThis is a
Page 359 and 360:
Chapter 20Data Scientist’s Ultima
Page 361 and 362:
Workflow-0: Quick Solution 353Workf
Page 363 and 364:
Workflow-4: Features Engineering 35
Page 365 and 366:
Workflow-11: Clustering 357Workflow
show all

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Delete template?

Save as template?