09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3

Supervised Learning Using Python

break

if flag:

break

return X_train, index

index.append(X.

columns[i1-1])

processed = i - 1

flag = False

The actual use case of this code is shown at the end of the chapter.

Now, the question is, what should be the threshold value of the

previous correlation that, say, X and Y are correlated. A common practice

is to assume that if r > 0.5, it means the variables are correlated, and if

r < 0.5, then it means no correlation. One big limitation of this approach

is that it does not consider the length of the data. For example, a 0.5

correlation in a set of 20 data points should not have the same weight as a

0.5 correlation in a set of 10,000 data points. To overcome this problem, a

probable error concept has been introduced, as shown here:

PEr = .674´ 1-

2

r

n

r is the correlation coefficient, and n is the sample size.

Here, r > 6PEr means that X and Y are highly correlated, and if r < Per,

this means that X and Y are independent. Using this approach, you can see

that even r = 0.1 means a high correlation when the data size is huge.

One interesting application of correlation is in product

recommendations on an e-commerce site. Recommendations can identify

similar users if you calculate the correlation of their common ratings for

the same products. Similarly, you can find similar products by calculating

the correlation of their common ratings from the same user. This approach

is known as collaborative filtering.

52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!