09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4

Unsupervised Learning: Clustering

Choosing K: The Elbow Method

There are certain cases where you have to determine the K in K-means

clustering. For this purpose, you have to use the elbow method, which

uses a percentage of variance as a variable dependent on the number

of clusters. Initially, several clusters are chosen. Then another cluster is

added, which doesn’t make the modeling of data much better. The number

of clusters is chosen at this point, which is the elbow criterion. This “elbow”

cannot always be unambiguously identified. The percentage of variance is

realized as the ratio of the between-group variance of individual clusters

to the total variance. Assume that in the previous example, the retailer

has four cities: Delhi, Kolkata, Mumbai, and Chennai. The programmer

does not know that, so he does clustering with K=2 to K=9 and plots the

percentage of variance. He will get an elbow curve that clearly indicates

K=4 is the right number for K.

Distance or Similarity Measure

The measure of distance or similarity is one of the key factors of clustering.

In this section, I will describe the different kinds of distance and similarity

measures. Before that, I’ll explain what distance actually means here.

Properties

The distances are measures that satisfy the following properties:

• dist(x, y) = 0 if and only if x=y.

• dist(x, y) > 0 when x ≠ y.

• dist(x, y) = dist(x, y).

• dist(x,y) + dist(y,z) >= d(z,x) for all x, y, and z.

82

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!