11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

22 2 Dimensionality Reduction

Columns with Missing Values

In a large dataset, you may find a few columns with lots of missing values. When you

do a count of null value checks on each column, you will know which columns have

missing values. You either impute the missing values or eliminate those columns

with the assumption that they will not affect the model’s performance.

To eliminate the specific columns, we decide on a certain threshold value; say, if

the column has over 20% missing values, eliminate it. We can do this in Python, with

a simple code like this:

df.isnull().sum()/len(df)

a = df.isnull().sum()/len(df)

variables = df.columns[:-1]

variable = []

for i in range(0,len(df.columns[:-1])):

if a[i]>0.03: #setting the threshold as 3%

variable.append(variables[i])

print(variable)

This is the output on our dataset:

'Self_Employed', 'LoanAmount', 'Credit_History']

As you see, the columns Self_Employed, LoanAmount, and Credit_history are

dropped as the number of missing values in these columns cross our set threshold

limit.

In some situations, you may still decide to keep those columns having missing

values. In this case, you will need to impute the missing values with either the mode

or the median value of that column. In our dataset, let us consider the three columns,

loan amount, loan amount term, and credit history that have missing values. We will

plot the histograms of the data distribution for these three columns, which is shown

in Fig. 2.2.

Looking at the histograms, you know that loan amount term and credit history

have discrete distributions, so we will impute the column’s mode value into the

missing values. For the loan amount, we will use the median for the missing values.

We show this in the code snippet below:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!