08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

are likely to be balanced given that a ij are all between 0 and 1. In this case, there will<br />

be O(1) groups only and the log factors disappear.<br />

Another measure <strong>of</strong> density is based on similarities. Recall that the similarity between<br />

objects represented by vectors (rows <strong>of</strong> A) is defined by their dot products. Thus, similarities<br />

are entries <strong>of</strong> the matrix AA T . Define the average cohesion f(S) <strong>of</strong> a set S <strong>of</strong> rows<br />

<strong>of</strong> A to be the sum <strong>of</strong> all pairwise dot products <strong>of</strong> rows in S divided by |S|. The average<br />

cohesion <strong>of</strong> A is the maximum over all subsets <strong>of</strong> rows <strong>of</strong> the average cohesion <strong>of</strong> the subset.<br />

Since the singular values <strong>of</strong> AA T are squares <strong>of</strong> singular values <strong>of</strong> A, we expect f(A)<br />

to be related to σ 1 (A) 2 and d(A) 2 . Indeed it is. We state the following without pro<strong>of</strong>.<br />

Lemma 8.12 d(A) 2 ≤ f(A) ≤ d(A) log n. Also, σ 1 (A) 2 ≥ f(A) ≥ cσ 1(A) 2<br />

log n .<br />

f(A) can be found exactly using flow techniques as we will see later.<br />

8.11 Community Finding and Graph Partitioning<br />

Assume that data are nodes in a possibly weighted graph where edges represent some<br />

notion <strong>of</strong> affinity between their endpoints. In particular, let G = (V, E) be a weighted<br />

graph. Given two sets <strong>of</strong> nodes S and T , define<br />

E(S, T ) = ∑ i∈S<br />

j∈T<br />

We then define the density <strong>of</strong> a set S to be<br />

e ij .<br />

d(S, S) =<br />

E(S, S)<br />

.<br />

|S|<br />

If G is an undirected graph, then d(S, S) can be viewed as the average degree in the<br />

vertex-induced subgraph over S. The set S <strong>of</strong> maximum density is therefore the subgraph<br />

<strong>of</strong> maximum average degree. Finding such a set can be viewed as finding a tight-knit<br />

community inside some network. In the next section, we describe an algorithm for finding<br />

such a set using network flow techniques.<br />

8.11.1 Flow Methods<br />

Here we consider dense induced subgraphs <strong>of</strong> a graph. An induced subgraph <strong>of</strong> a<br />

graph consisting <strong>of</strong> a subset <strong>of</strong> the vertices <strong>of</strong> the graph along with all edges <strong>of</strong> the graph<br />

that connect pairs <strong>of</strong> vertices in the subset <strong>of</strong> vertices. We show that finding an induced<br />

subgraph with maximum average degree can be done by network flow techniques. This<br />

is simply maximizing the density d(S, S) over all subsets S <strong>of</strong> the graph. First consider<br />

the problem <strong>of</strong> finding a subset <strong>of</strong> vertices such that the induced subgraph has average<br />

287

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!