09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4

Unsupervised Learning: Clustering

Clustering classifies objects into groups based on similarity or

distance measure. This is an example of unsupervised learning. The main

difference between clustering and classification is that the latter has welldefined

target classes. The characteristics of target classes are defined by

the training data and the models learned from it. That is why classification

is supervised in nature. In contrast, clustering tries to define meaningful

classes based on data and its similarity or distance. Figure 4-1 illustrates a

document clustering process.

Figure 4-1. Document clustering

K-Means Clustering

Let’s suppose that a retail distributer has an online system where local

agents enter trading information manually. One of the fields they have

to fill in is City. But because this data entry process is manual, people

normally tend to make lots of spelling errors. For example, instead of

Delhi, people enter Dehi, Dehli, Delly, and so on. You can try to solve

this problem using clustering because the number of clusters are already

known; in other words, the retailer is aware of how many cities the agents

operate in. This is an example of K-means clustering.

The K-means algorithm has two inputs. The first one is the data X, which

is a set of N number of vectors, and the second one is K, which represents

the number of clusters that needs to be created. The output is a set of K

centroids in each cluster as well as a label to each vector in X that indicates

78

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!