09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4

Unsupervised Learning: Clustering

Although Levenshtein is simple in implementation and

computationally less expensive, if you want to introduce a gap in string

matching (for example, New Delhi and NewDelhi), then the Needleman-

Wunsch algorithm is the better choice.

Similarity in the Context of Document

A similarity measure between documents indicates how identical two

documents are. Generally, similarity measures are bounded in the range

[-1,1] or [0,1] where a similarity score of 1 indicates maximum similarity.

Types of Similarity

To measure similarity, documents are realized as a vector of terms

excluding the stop words. Let’s assume that A and B are vectors

representing two documents. In this case, the different similarity measures

are shown here:

• Dice

The Dice coefficient is denoted by the following:

B

sim(q,d j ) = D(A,B) =

a A + ( 1-a)

B

Also,

aÎ[ 01 , ] and let a = 1 2

• Overlap

The Overlap coefficient is computed as follows:

B

Sim(q,d j ) = O(A,B) =

min ( A , B )

The Overlap coefficient is calculated using the max

operator instead of min.

87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!