Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 3
Supervised Learning Using Python
break
if flag:
break
return X_train, index
index.append(X.
columns[i1-1])
processed = i - 1
flag = False
The actual use case of this code is shown at the end of the chapter.
Now, the question is, what should be the threshold value of the
previous correlation that, say, X and Y are correlated. A common practice
is to assume that if r > 0.5, it means the variables are correlated, and if
r < 0.5, then it means no correlation. One big limitation of this approach
is that it does not consider the length of the data. For example, a 0.5
correlation in a set of 20 data points should not have the same weight as a
0.5 correlation in a set of 10,000 data points. To overcome this problem, a
probable error concept has been introduced, as shown here:
PEr = .674´ 1-
2
r
n
r is the correlation coefficient, and n is the sample size.
Here, r > 6PEr means that X and Y are highly correlated, and if r < Per,
this means that X and Y are independent. Using this approach, you can see
that even r = 0.1 means a high correlation when the data size is huge.
One interesting application of correlation is in product
recommendations on an e-commerce site. Recommendations can identify
similar users if you calculate the correlation of their common ratings for
the same products. Similarly, you can find similar products by calculating
the correlation of their common ratings from the same user. This approach
is known as collaborative filtering.
52