08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

⎛<br />

⎜<br />

⎝<br />

1 1 1 1 1 0 0 0 0 0 0 0<br />

1 1 1 1 1 0 0 0 0 0 0 0<br />

1 1 1 1 1 0 0 0 0 0 0 0<br />

1 1 1 1 1 0 0 0 0 0 0 0<br />

1 1 1 1 1 0 0 0 0 0 0 0<br />

0 0 0 0 0 1 1 1 1 0 0 0<br />

0 0 0 0 0 1 1 1 1 0 0 0<br />

0 0 0 0 0 1 1 1 1 0 0 0<br />

0 0 0 0 0 1 1 1 1 0 0 0<br />

0 0 0 0 0 0 0 0 0 1 1 1<br />

0 0 0 0 0 0 0 0 0 1 1 1<br />

0 0 0 0 0 0 0 0 0 1 1 1<br />

adjacency matrix<br />

⎞<br />

⎟<br />

⎠<br />

⎛<br />

⎜<br />

⎝<br />

1 0 0<br />

1 0 0<br />

1 0 0<br />

1 0 0<br />

1 0 0<br />

0 1 0<br />

0 1 0<br />

0 1 0<br />

0 1 0<br />

0 0 1<br />

0 0 1<br />

0 0 1<br />

singular<br />

vectors<br />

⎞<br />

⎫<br />

⎪⎬<br />

⎪⎭<br />

⎫<br />

⎪⎬<br />

⎪⎭<br />

⎟<br />

}<br />

⎠<br />

Figure 8.3: In spectral clustering the vertices <strong>of</strong> a graph, one finds a few top singular<br />

vectors <strong>of</strong> the adjacency matrix and forms a new matrix whose columns are these singular<br />

vectors. The rows <strong>of</strong> this matrix are points in a lower dimensional space. These points<br />

can be clustered with k-means cluster. In the above example there are three clusters that<br />

are cliques with no connecting vertices. If one removed a few edges from the cliques and<br />

added a few edges connecting vertices in different cliques, the rows <strong>of</strong> a clique would map<br />

to clusters <strong>of</strong> points as shown in the top right rather than all to a single point.<br />

Sometimes there is clearly a correct target clustering we want to find, but approximately<br />

optimal k-means or k-median clusterings can be very far from the target. We<br />

describe two important stochastic models where this situation arises. In these cases as<br />

well as many others, a clustering algorithm based on SVD, called Spectral Clustering is<br />

useful.<br />

Spectral Clustering first projects data points onto the space spanned by the top k<br />

singular vectors <strong>of</strong> the data matrix and works in the projection. It is widely used and<br />

indeed finds a clustering close to the target clustering for data arising from many stochastic<br />

models. We will prove this in a generalized setting which includes both stochastically<br />

generated data and data with no stochastic models.<br />

8.6.1 Stochastic Block Model<br />

Stochastic Block Models are models <strong>of</strong> communities. Suppose there are k communities<br />

C 1 , C 2 , . . . , C k among a population <strong>of</strong> n people. Suppose the probability <strong>of</strong> two people in<br />

the same community knowing each other is p and if they are in different communities, the<br />

276

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!