Thinking-data-science-a-data-science-practitioners-guide

Recommendations

Info

20 2 Dimensionality ReductionWhy Reduce Dimensionality?One would say having more features for inference is better than having just a handfulof features. In machine learning, this is not true. Having many features rather adds tothe woes of a data scientist. We call it the “curse of dimensionality.” A highdimensionaldataset is considered a curse to a data scientist. Why so? I will mentiona few challenges a data scientist faces while handling high-dimensional datasets:• A large dataset is likely to contain many nulls, so you must do thorough datacleansing. You will either need to drop those columns or impute those columnswith appropriate data values.• Certain columns may be highly correlated. You just need to pick up only oneamong them.• There exists a high probability of over-fitting the model.• Training would be computationally expensive.Looking at these issues, we must consider ways to reduce the dimensions; rather,use only those really meaningful features in the model development. For example,having two columns, like birthdate and age, are meaningless as both convey thesame information to the model. So, we can easily get rid of one without a compromisein the model’s ability to infer unseen data. An id field in the dataset is totallyredundant for machine learning and should be removed. Such manual inspections forreducing the dimensions would be practically impossible, and thus, many techniquesare developed for dimensionality reduction programmatically.Dimensionality Reduction TechniquesDimensionality reduction essentially means reducing the number of features in yourdataset. Two approaches can do this—keeping only the most important features byeliminating unwanted ones and the second one by combining some features toreduce the total count.Lot of work has been done on dimensionality reduction, and we have devisedseveral techniques for the same. You need to learn all these techniques as onetechnique may not meet our purpose; most of the time you will need to apply severaltechniques in a certain order to achieve the desired outcome. I will now discuss thefollowing techniques. Though the list is not complete, it is surely exhaustive:• Columns with missing values• Filtering columns based on variance• Filtering highly correlated columns• Random forest• Backward elimination• Forward features selection• Factor analysis
Dimensionality Reduction Techniques 21• Principal component analysis• Independent component analysis• Isometric mapping• t-distributed stochastic neighbor embedding (t-SNE)• Uniform Manifold Approximation and Projection (UMAP)• Singular value decomposition (SVD)• Linear discriminant analysis (LDA)To illustrate these techniques, I have created a trivial project, which isdiscussed next.Project DatasetThe project is based on the loan eligibility dataset. It has 12 features on the basis ofwhich the customer’s eligibility for a loan sanction is decided. Information on thedataset is shown in Fig. 2.1.The dataset is designed for classification. The target is Loan_Status, which iseither approved or disapproved. The loan status depends on the values of other11 fields (Loan_ID is not useful to us), which are the features for our experimentationon dimensionality reduction. We try to reduce these dimensions and see itsimpact on our target variable. Thus, we start with our first DR technique.Fig. 2.1 Loan eligibility dataset structure
Page 1 and 2: The Springer Series in Applied Mach
Page 3 and 4: Editorial BoardOver the last decade
Page 5 and 6: Poornachandra SarangPracticing Data
Page 7 and 8: viPrefacemachine learning algorithm
Page 9 and 10: viiiPrefaceLarge Applications) is c
Page 11 and 12: Contents1 Data Science Process ....
Page 13 and 14: ContentsxiiiTree Traversal Algorith
Page 15 and 16: ContentsxvModel Fitting for Huge Da
Page 17 and 18: Contentsxvii13 BIRCH ..............
Page 19 and 20: Contentsxix18 ANN-Based Application
Page 21 and 22: Chapter 1Data Science ProcessWith l
Page 23 and 24: AI on Image Datasets 3creates a fin
Page 25 and 26: Data Science Process 5(NLU). So, we
Page 27 and 28: Data Science Process 7data and do n
Page 29 and 30: Data Science Process 9Fig. 1.3 Data
Page 31 and 32: Data Science Process 11use techniqu
Page 33 and 34: AutoML 13Fig. 1.9 Exhaustive list o
Page 35 and 36: Hyper-Parameter Tuning 15Fig. 1.11
Page 37 and 38: Models Based on Transfer Learning 1
Page 39: Chapter 2Dimensionality ReductionCr
Page 43 and 44: Dimensionality Reduction Techniques
Page 71 and 72: Summary 51Fig. 2.25 LDA performance
Page 73 and 74: Part IClassical Algorithms: Overvie
Page 75 and 76: Chapter 3Regression AnalysisA Well-
Page 77 and 78: Regression Types 57Generally, these
Page 79 and 80: Regression Types 59Polynomial Regre
Page 81 and 82: Regression Types 61ElasticNet Regre
Page 83 and 84: Regression Types 63As before, you p
Page 85 and 86: Bayesian Linear Regression 65Fig. 3
Page 87 and 88: Bayesian Linear Regression 67You ma
Page 89 and 90: Logistic Regression 69Fig. 3.8 Sigm
Page 91 and 92:
Logistic Regression 71Fig. 3.9 Asso
Page 93 and 94:
Summary 73those. I discussed only a
Page 95 and 96:
76 4 Decision Treegrades, etc. by b
Page 97 and 98:
78 4 Decision TreeFig. 4.2 Balanced
Page 99 and 100:
80 4 Decision Tree¼ 1X Ci¼1pi ð
Page 101 and 102:
82 4 Decision TreeInformation gain
Page 103 and 104:
84 4 Decision TreeTable 4.2 Subset
Page 105 and 106:
86 4 Decision TreeTable 4.4 Subset
Page 107 and 108:
88 4 Decision TreeFig. 4.6 Final tr
Page 109 and 110:
90 4 Decision TreeFig. 4.7 Housing
Page 111 and 112:
92 4 Decision TreeEvaluating Perfor
Page 113 and 114:
94 4 Decision TreeFig. 4.9 Features
Page 115 and 116:
96 4 Decision TreeFig. 4.11 Decisio
Page 117 and 118:
98 5 Ensemble: Bagging and Boosting
Page 119 and 120:
100 5 Ensemble: Bagging and Boostin
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Chapter 6K-Nearest NeighborsA Super
Page 151 and 152:
KNN Working 133• Step-3: For each
Page 153 and 154:
Advantages 135Table 6.2 Euclidean d
Page 155 and 156:
Project 137those who are further aw
Page 157 and 158:
Project 139#Setup arrays to store t
Page 159 and 160:
Summary 141Fig. 6.7 Classification
Page 161 and 162:
144 7 Naive BayesWhen to Use?A few
Page 163 and 164:
146 7 Naive BayesNow we substitute
Page 165 and 166:
148 7 Naive BayesNaive Bayes TypesD
Page 167 and 168:
150 7 Naive BayesPreparing DatasetT
Page 169 and 170:
152 7 Naive BayesThe output in this
Page 171 and 172:
154 8 Support Vector MachinesFig. 8
Page 173 and 174:
156 8 Support Vector MachinesComple
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
164 8 Support Vector MachinesTable
Page 183 and 184:
Part IIClustering: OverviewClusteri
Page 185 and 186:
Part II Clustering: Overview 169fal
Page 187 and 188:
Chapter 9Centroid-Based ClusteringC
Page 189 and 190:
The K-Means Algorithm 173observatio
Page 191 and 192:
The K-Means Algorithm 175• Elbow
Page 193 and 194:
The K-Means Algorithm 177The Gap St
Page 195 and 196:
The K-Means Algorithm 179ProjectWe
Page 197 and 198:
The K-Medoids Algorithm 181in the c
Page 199 and 200:
Summary 183Fig. 9.11 Clustering wit
Page 201 and 202:
186 10 Connectivity-Based Clusterin
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Chapter 11Gaussian Mixture ModelA P
Page 213 and 214:
Probability Distribution 199For a m
Page 215 and 216:
Project 201Fig. 11.3 Features and t
Page 217 and 218:
Determining Optimal Number of Clust
Page 219 and 220:
Determining Optimal Number of Clust
Page 221 and 222:
Summary 207SummaryThe Gaussian mixt
Page 223 and 224:
210 12 Density-Based ClusteringFig.
Page 225 and 226:
212 12 Density-Based ClusteringTher
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
Page 243 and 244:
230 13 BIRCHTo understand the BIRCH
Page 245 and 246:
232 13 BIRCH3. Apply existing clust
Page 247 and 248:
234 13 BIRCHFig. 13.4 BIRCH cluster
Page 249 and 250:
236 13 BIRCHYou observe that the cl
Page 251 and 252:
238 14 CLARANS1. Select multiple su
Page 253 and 254:
240 14 CLARANSProjectFor this proje
Page 255 and 256:
242 14 CLARANSThe output shows the
Page 257 and 258:
244 15 Affinity Propagation Cluster
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
252 16 STING & CLIQUEFig. 16.1 STIN
Page 267 and 268:
254 16 STING & CLIQUEPros/ConsHere
Page 269 and 270:
256 16 STING & CLIQUEIntervals = 30
Page 271 and 272:
Part IIIANN: OverviewSo far, you ha
Page 273 and 274:
262 17 Artificial Neural NetworksTh
Page 275 and 276:
264 17 Artificial Neural NetworksNe
Page 277 and 278:
266 17 Artificial Neural NetworksIm
Page 279 and 280:
268 17 Artificial Neural NetworksNo
Page 281 and 282:
270 17 Artificial Neural NetworksFi
Page 283 and 284:
272 17 Artificial Neural Networks
Page 285 and 286:
Page 287 and 288:
Page 289 and 290:
Page 291 and 292:
280 17 Artificial Neural NetworksFo
Page 293 and 294:
Page 295 and 296:
284 17 Artificial Neural Networksbi
Page 297 and 298:
Page 299 and 300:
Chapter 18ANN-Based ApplicationsTex
Page 301 and 302:
Developing NLP Applications 291This
Page 303 and 304:
Developing NLP Applications 293The
Page 305 and 306:
Developing NLP Applications 295bert
Page 307 and 308:
Developing NLP Applications 297For
Page 309 and 310:
Developing NLP Applications 299test
Page 311 and 312:
Developing NLP Applications 301We w
Page 313 and 314:
Developing NLP Applications 303Fig.
Page 315 and 316:
Developing NLP Applications 305Mode
Page 317 and 318:
Developing NLP Applications 307Fig.
Page 319 and 320:
Developing NLP Applications 309Glov
Page 321 and 322:
Fig. 18.15 Network summaryFig. 18.1
Page 323 and 324:
Developing Image-Based Applications
Page 325 and 326:
Page 327 and 328:
Page 329 and 330:
Page 331 and 332:
Page 333 and 334:
Page 335 and 336:
Page 337 and 338:
Summary 327SummaryWe have several p
Page 339 and 340:
330 19 Automated Toolsseveral combi
Page 341 and 342:
332 19 Automated Toolsauto-sklearn
Page 343 and 344:
334 19 Automated ToolsIt produced t
Page 345 and 346:
336 19 Automated ToolsFig. 19.3 Reg
Page 347 and 348:
338 19 Automated Toolson the best p
Page 349 and 350:
340 19 Automated ToolsFig. 19.6 Mod
Page 351 and 352:
342 19 Automated Toolsestablish the
Page 353 and 354:
344 19 Automated ToolsFig. 19.9 Net
Page 355 and 356:
346 19 Automated ToolsThe plot for
Page 357 and 358:
348 19 Automated ToolsTPOTThis is a
Page 359 and 360:
Chapter 20Data Scientist’s Ultima
Page 361 and 362:
Workflow-0: Quick Solution 353Workf
Page 363 and 364:
Workflow-4: Features Engineering 35
Page 365 and 366:
Workflow-11: Clustering 357Workflow
show all

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Delete template?

Save as template?