Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 3
Supervised Learning Using Python
Likewise, if r is nearer to 0, it means that x and y are not correlated. A
simplified formula to calculate r is shown here:
r=
n( åxy)-( åx)( åy)
å -( å ) å -( å )
é 2
2
n x x ù é 2
n y y
ëê
ûú ëê
2
ù
ûú
You can easily use correlation for dimensionality reduction. Let’s say
Y is a variable that is a weighted sum of n variables: X1, X2, ... Xn. You
want to reduce this set of X to a smaller set. To do so, you need to calculate
the correlation coefficient for each X pair. Now, if Xi and Xj are highly
correlated, then you will investigate the correlation of Y with Xi and Xj. If
the correlation of Xi is greater than Xj, then you remove Xj from the set, and
vice versa. The following function is an example of the dropping feature
using correlation:
from scipy.stats.stats import pearsonr
def drop_features(y_train,X_train,X,index):
i1 = 0
processed = 0
while(1):
flag = True
for i in range(X_train.shape[1]):
if i > processed :
i1 = i1 + 1
corr = pearsonr(X_train[:,i], y_train)
PEr= .674 * (1- corr[0]*corr[0])/ (len(X_
train[:,i])**(1/2.0))
if corr[0] < PEr:
X_train = np.delete(X_train,i,1)
51