11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Preface

vii

Now comes the next challenge for a data scientist, and that is clustering a dataset

without having labeled data points. We call this unsupervised learning. I have a huge

section (Part II) comprising Chaps. 9 through 16 for clustering, giving you an indepth

coverage for several clustering techniques. The notion of cluster is not welldefined

and usually there is no consensus on the results produced by clustering

algorithms. So, we have lots of clustering algorithms that deal with small, medium,

large, and really huge spatial datasets. I cover many clustering algorithms explaining

their applications for various sized datasets.

Chapter 9 (Centroid-Based Clustering) discusses the centroid-based clustering

algorithms, which are probably the simplest and are the starting points for clustering

huge spatial datasets. The chapter covers both K-Means and K-Medoids clustering

algorithms. For K-Means, I describe its working followed by the explanation of the

algorithm itself. I discuss the purpose of objective function and techniques for

selecting optimal clusters. These are called Elbow, Average Silhouette, and the

Gap Statistic. This is followed by a discussion on K-Means limitations and where

to use it? For the K-Medoids algorithm, I follow a similar approach describing its

working, algorithm, merits, demerits, and implementation.

Chapter 10 (Connectivity-Based Clustering) describes two connectivity-based

clustering algorithms: Agglomerative and Divisive. For Agglomerative clustering,

I describe the Simple, Complete, and Average linkages while explaining its full

working. I then discuss its advantages, disadvantages, and practical situations where

this algorithm finds its use. For Divisive clustering, I take a similar approach and

discuss its implementation challenges.

Chapter 11 (Gaussian Mixture Model) describes another type of clustering

algorithm where the data distribution is of Gaussian type. I explain how to select

the optimal number of clusters with a practical example.

Chapter 12 (Density-Based Clustering) focuses on density-based clustering techniques.

Here I describe three algorithms—DBSCAN, OPTICS, and Mean Shift. I

discuss why we use DBSCAN? After discussing preliminaries, I discuss its full

working. I then discuss its advantages, disadvantages, and implementation with the

help of a project. To understand OPTICS, I first explain to you a few terms like core

distance and reachability distance. Like the earlier one, I discuss its implementation

with the help of a project. Finally, I describe Mean Shift clustering by explaining its

full working and how to select the bandwidth. A discussion on the algorithm’s

strengths, weaknesses, applications, and a practical implementation illustrated with

the help of a project follows this.

Chapter 13 (BIRCH) discusses another important clustering algorithm called

BIRCH. This is an algorithm that helps data scientists in clustering huge datasets,

where all the earlier algorithms fail. BIRCH splits the huge dataset into subclusters

by creating a hierarchical tree-like structure. The algorithm does clustering incrementally

eliminating the need for loading the entire dataset into memory. In this

chapter, I discuss why and where to use this algorithm and explain its working by

showing you how the algorithm constructs a CF tree.

Chapter 14 (CLARANS) discusses another important algorithm for clustering

enormous sized datasets. This is called CLARANS. The CLARA (Clustering for

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!