Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 4
Unsupervised Learning: Clustering
Choosing K: The Elbow Method
There are certain cases where you have to determine the K in K-means
clustering. For this purpose, you have to use the elbow method, which
uses a percentage of variance as a variable dependent on the number
of clusters. Initially, several clusters are chosen. Then another cluster is
added, which doesn’t make the modeling of data much better. The number
of clusters is chosen at this point, which is the elbow criterion. This “elbow”
cannot always be unambiguously identified. The percentage of variance is
realized as the ratio of the between-group variance of individual clusters
to the total variance. Assume that in the previous example, the retailer
has four cities: Delhi, Kolkata, Mumbai, and Chennai. The programmer
does not know that, so he does clustering with K=2 to K=9 and plots the
percentage of variance. He will get an elbow curve that clearly indicates
K=4 is the right number for K.
Distance or Similarity Measure
The measure of distance or similarity is one of the key factors of clustering.
In this section, I will describe the different kinds of distance and similarity
measures. Before that, I’ll explain what distance actually means here.
Properties
The distances are measures that satisfy the following properties:
• dist(x, y) = 0 if and only if x=y.
• dist(x, y) > 0 when x ≠ y.
• dist(x, y) = dist(x, y).
• dist(x,y) + dist(y,z) >= d(z,x) for all x, y, and z.
82