11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

24 2 Dimensionality Reduction

This is the output on our dataset:

Gender

Married

Dependents

Education

Self_Employed

ApplicantIncome

CoapplicantIncome

LoanAmount

Loan_Amount_Term

Credit_History

Property_Area

Loan_Status

dtype: float64

1.778751e-01

2.351972e-01

1.255589e+00

1.708902e-01

2.859435e-01

3.732039e+07

8.562930e+06

7.074027e+03

4.151048e+03

1.241425e-01

6.201280e-01

2.152707e-01

As you observe, the variance for the Credit_History is very low, so we can safely

drop this column. For datasets having large numbers of columns, this kind of

physical examination of variance may not be practical, so you may set a certain

threshold for filtering out those columns, as you did with detecting columns having

large numbers of missing values. This is done using the following code snippet:

# Omitting 'Loan_ID' and 'Loan_Status'

numeric = df[df.columns[1:-1]]

var = numeric.var()

numeric_cols = numeric.columns

variable = []

for i in range(0, len(numeric_cols)):

if var[i]>=10: # variance threshold

variable.append(numeric_cols[i])

variable

This is the output:

['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',

'Loan_Amount_Term']

As you see, the above-listed four columns exhibit lots of variance, so they are

significant to us in machine training. You may likewise filter out the columns having

low variance and thus insignificant to us.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!