08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

probability is q (where, q < p). 32 We assume the events that person i knows person j are<br />

independent across all i and j.<br />

Specifically, we are given an n × n data matrix A, where, a ij = 1 if and only if i and<br />

j know each other. We assume the a ij are independent random variables, and use a i to<br />

denote the i th row <strong>of</strong> A. It is useful to think <strong>of</strong> A as the adjacency matrix <strong>of</strong> a graph, such<br />

as the friendship network in Facebook. We will also think <strong>of</strong> the rows a i as data points.<br />

The clustering problem is to classify the data points into the communities they belong to.<br />

In practice, the graph is fairly sparse, i.e., p and q are small, namely, O(1/n) or O(ln n/n).<br />

Consider the simple case <strong>of</strong> two communities with n/2 people in each and with<br />

p = α n ; q = β n<br />

, where, α, β ∈ O(ln n).<br />

Let u and v be the centroids <strong>of</strong> the data points in community one and community two<br />

respectively; so, u i ≈ p for i ∈ C 1 and u j ≈ q for j ∈ C 2 and v i ≈ q for i ∈ C 1 and v j ≈ p<br />

for j ∈ C 2 . We have<br />

|u − v| 2 ≈<br />

n∑<br />

(u j − v j ) 2 =<br />

j=1<br />

(α − β)2<br />

n 2 n =<br />

(α − β)2<br />

.<br />

n<br />

Inter-centroid distance ≈ α − β √ n<br />

. (8.1)<br />

On the other hand, the distance between a data point and its cluster centroid is much<br />

greater:<br />

For i ∈ C 1 , E ( |a i − u| 2) n∑<br />

= E((a ij − u j ) 2 ) = n [p(1 − p) + q(1 − q)] ∈ Ω(α + β).<br />

2<br />

j=1<br />

We see that if α, β ∈ O(ln n), then the ratio<br />

(√ )<br />

distance between cluster centroids<br />

ln n<br />

distance <strong>of</strong> data point to its cluster centroid ∈ O √ . n<br />

So the centroid <strong>of</strong> a cluster is much farther from data points in its own cluster than it is<br />

to the centroid <strong>of</strong> the other cluster.<br />

Now, consider the k-median objective function. Suppose we wrongly classify a point<br />

in C 1 as belonging to C 2 . The extra cost we incur is at most the distance between<br />

centroids which is only O( √ ln n/ √ n) times the k-median cost <strong>of</strong> the data point. So just<br />

by examining the cost, we cannot rule out a ε-approximate k-median clustering from<br />

misclassifying all points. A similar argument can also be made about k-means objective<br />

function.<br />

32 More generally, for each pair <strong>of</strong> communities a and b, there could be a probability p ab that a person<br />

from community a knows a person from community b. But for the discussion here, we take p aa = p for<br />

all a and p ab = q, for all a ≠ b.<br />

277

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!