11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

20 2 Dimensionality Reduction

Why Reduce Dimensionality?

One would say having more features for inference is better than having just a handful

of features. In machine learning, this is not true. Having many features rather adds to

the woes of a data scientist. We call it the “curse of dimensionality.” A highdimensional

dataset is considered a curse to a data scientist. Why so? I will mention

a few challenges a data scientist faces while handling high-dimensional datasets:

• A large dataset is likely to contain many nulls, so you must do thorough data

cleansing. You will either need to drop those columns or impute those columns

with appropriate data values.

• Certain columns may be highly correlated. You just need to pick up only one

among them.

• There exists a high probability of over-fitting the model.

• Training would be computationally expensive.

Looking at these issues, we must consider ways to reduce the dimensions; rather,

use only those really meaningful features in the model development. For example,

having two columns, like birthdate and age, are meaningless as both convey the

same information to the model. So, we can easily get rid of one without a compromise

in the model’s ability to infer unseen data. An id field in the dataset is totally

redundant for machine learning and should be removed. Such manual inspections for

reducing the dimensions would be practically impossible, and thus, many techniques

are developed for dimensionality reduction programmatically.

Dimensionality Reduction Techniques

Dimensionality reduction essentially means reducing the number of features in your

dataset. Two approaches can do this—keeping only the most important features by

eliminating unwanted ones and the second one by combining some features to

reduce the total count.

Lot of work has been done on dimensionality reduction, and we have devised

several techniques for the same. You need to learn all these techniques as one

technique may not meet our purpose; most of the time you will need to apply several

techniques in a certain order to achieve the desired outcome. I will now discuss the

following techniques. Though the list is not complete, it is surely exhaustive:

• Columns with missing values

• Filtering columns based on variance

• Filtering highly correlated columns

• Random forest

• Backward elimination

• Forward features selection

• Factor analysis

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!