Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 4
Unsupervised Learning: Clustering
Clustering classifies objects into groups based on similarity or
distance measure. This is an example of unsupervised learning. The main
difference between clustering and classification is that the latter has welldefined
target classes. The characteristics of target classes are defined by
the training data and the models learned from it. That is why classification
is supervised in nature. In contrast, clustering tries to define meaningful
classes based on data and its similarity or distance. Figure 4-1 illustrates a
document clustering process.
Figure 4-1. Document clustering
K-Means Clustering
Let’s suppose that a retail distributer has an online system where local
agents enter trading information manually. One of the fields they have
to fill in is City. But because this data entry process is manual, people
normally tend to make lots of spelling errors. For example, instead of
Delhi, people enter Dehi, Dehli, Delly, and so on. You can try to solve
this problem using clustering because the number of clusters are already
known; in other words, the retailer is aware of how many cities the agents
operate in. This is an example of K-means clustering.
The K-means algorithm has two inputs. The first one is the data X, which
is a set of N number of vectors, and the second one is K, which represents
the number of clusters that needs to be created. The output is a set of K
centroids in each cluster as well as a label to each vector in X that indicates
78