08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.6.2 Gaussian Mixture Model<br />

A second example is the Gaussian Mixture Model with k spherical Gaussians as components,<br />

discussed in Section 3.6.2. We saw there that for two Gaussians, each <strong>of</strong> variance<br />

one in each direction, data points are at distance O( √ d) from their correct centers and<br />

if the separation between the centers is O(1), which is much smaller than O( √ d), an<br />

approximate optimal solution could be misclassifying almost all points. We already saw<br />

in Section 3.6.2, that SVD <strong>of</strong> the data matrix A helps in this case.<br />

We will show that a natural and simple SVD-based approach called Spectral Clustering<br />

helps solve not only these two examples, but a more general class <strong>of</strong> problems for which<br />

there may or may not be a stochastic model. The result can be qualitatively described<br />

by a simple statement when k, the number <strong>of</strong> clusters, is O(1): If there is some clustering<br />

with cluster centroids separated by at least a constant times the “standard deviation”,<br />

then, we can find a clustering close to this clustering. This should remind the reader <strong>of</strong><br />

the mnemonic “Means separated by 6 or some constant number <strong>of</strong> standard deviations”.<br />

8.6.3 Standard Deviation without a stochastic model<br />

First, how do we define mean and standard deviation for a clustering problem without<br />

assuming a stochastic model <strong>of</strong> data? In a stochastic model, each cluster consists <strong>of</strong> independent<br />

identically distributed points from a distribution, so the mean <strong>of</strong> the cluster is<br />

just the mean <strong>of</strong> the distribution. Analogously, we can define the mean as the centroid<br />

<strong>of</strong> the data points in a cluster, whether or not we have a stochastic model. If we had<br />

a distribution, it would have a standard deviation in each direction, namely, the square<br />

root <strong>of</strong> the mean squared distance from the mean <strong>of</strong> the distribution in that direction.<br />

But the same definition also applies with no distribution or stochastic model. We give<br />

the formal definitions after introducing some notation.<br />

Notation: We will denote by A the data matrix which is n × d, with each row a data<br />

point and k will denote the number <strong>of</strong> clusters. A k-clustering will be represented by a<br />

n × d matrix C; row i <strong>of</strong> C is the center <strong>of</strong> the cluster that a i belongs to. So, C will have<br />

k distinct rows.<br />

The variance <strong>of</strong> the clustering C along the direction v, where v is a vector <strong>of</strong> length<br />

1, is the mean squared distance <strong>of</strong> data points from their cluster centers in the direction<br />

v, namely, it is<br />

1<br />

n∑ (<br />

(ai − c i ) · v ) 2<br />

.<br />

n<br />

i=1<br />

The variance may differ from direction to direction, but we define the variance, denoted<br />

σ 2 , <strong>of</strong> the clustering to be the maximum over all directions, namely,<br />

σ 2 (C) = 1 n∑<br />

n max<br />

(<br />

(ai − c i ) · v ) 2 1 =<br />

v|=1<br />

n ||A − C||2 2.<br />

i=1<br />

278

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!