Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 4
Unsupervised Learning: Clustering
Although Levenshtein is simple in implementation and
computationally less expensive, if you want to introduce a gap in string
matching (for example, New Delhi and NewDelhi), then the Needleman-
Wunsch algorithm is the better choice.
Similarity in the Context of Document
A similarity measure between documents indicates how identical two
documents are. Generally, similarity measures are bounded in the range
[-1,1] or [0,1] where a similarity score of 1 indicates maximum similarity.
Types of Similarity
To measure similarity, documents are realized as a vector of terms
excluding the stop words. Let’s assume that A and B are vectors
representing two documents. In this case, the different similarity measures
are shown here:
• Dice
The Dice coefficient is denoted by the following:
AÇ
B
sim(q,d j ) = D(A,B) =
a A + ( 1-a)
B
Also,
aÎ[ 01 , ] and let a = 1 2
• Overlap
The Overlap coefficient is computed as follows:
AÇ
B
Sim(q,d j ) = O(A,B) =
min ( A , B )
The Overlap coefficient is calculated using the max
operator instead of min.
87