# Discovering Clusters in Networks - SNAP - Stanford University

Discovering Clusters in Networks - SNAP - Stanford University

Discovering Clusters in Networks - SNAP - Stanford University

### You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CS246: M**in****in**g Massive Datasets

Jure Leskovec, **Stanford** **University**

http://cs246.stanford.edu

**Networks** of tightly

connected groups

Network communities:

Sets of nodes with lots of

connections **in**side and

few to outside (the rest

of the network)

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu

Communities, clusters,

groups, modules

2

How to automatically

f**in**d such densely

connected groups of

nodes?

Ideally such clusters

then correspond to

real groups

For example:

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu

Communities, clusters,

groups, modules

3

Zachary’s Karate club network:

Observe social ties and rivalries **in** a university karate club

Dur**in**g his observation, conflicts led the group to split

Split could be expla**in**ed by a m**in**imum cut **in** the network

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 4

F**in**d micro-markets by partition**in**g the

“query x advertiser” graph:

query

advertiser

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu

5

Search**in**g for small communities **in**

the Web graph

What is the signature of a community /

discussion **in** a Web graph?

…

…

Dense 2-layer graph

Intuition: Many people all talk**in**g about the same th**in**gs

[Kumar et al. ‘99]

Use this to def**in**e “topics”:

What the same people on

the left talk about on the right

Remember HITS!

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 7

A more well-def**in**ed problem:

Enumerate complete bipartite subgraphs K s,t

Where K s,t : s nodes on the “left” where each l**in**ks

to the same t other nodes on the “right”

X Y

K3,4 Fully connected

|X| = s = 3

|Y| = t = 4

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 8

[Kumar et al. ‘99]

Two po**in**ts:

(1) Dense bipartite graph: the signature of a

community/discussion

(2) Complete bipartite subgraph K s,t

K s,t = graph on s nodes, each l**in**ks to the same t other nodes

Plan:

(A) From (2) get back to (1):

Via: Any dense enough graph conta**in**s a

smaller K s,t as a subgraph

(B) How do we solve (2) **in** a giant graph?

What similar problems were solved on big non-graph data?

(3) Frequent itemset enumeration [Agrawal-Srikant ‘99]

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 9

Marketbasket analysis. Sett**in**g:

Market: Universe U of n items

Baskets: m subsets of U: S 1, S 2, …, S m ⊆ U

(S i is a set of items one person bought)

Support: Frequency threshold f

Goal:

[Agrawal-Srikant ‘99]

F**in**d all subsets T s.t. T ⊆ S i of ≥ f sets S i

(items **in** T were bought together at least f times)

What’s the connection between the

itemsets and complete bipartite graphs?

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 10

View each node i as a

set Si of nodes i po**in**ts to

i

a

b

c

d

S i={a,b,c,d}

F**in**d frequent itemsets:

s … m**in**imum support

t … itemset size

We found K s,t!

K s,t = a set Y of size t

that occurs **in** s sets S i

x

[Kumar et al. ‘99]

Say we f**in**d a frequent

itemset Y={a,b,c} of supp. s

So, there are s nodes that

l**in**k to all of {a,b,c}:

2/15/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 11

a

b

c

X

y

x

y

z

a

b

c

a

b

c

z

Y

a

b

c

Itemsets f**in**ds Complete bipartite graphs!

How?

View each node i as a

set S i of nodes i po**in**ts to

K s,t = a set Y of size t

that occurs **in** s sets S i

Look**in**g for K s,t set of

frequency threshold to s

and look at layer t – all

frequent sets of size t

[Kumar et al. ‘99]

2/15/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 12

i

j

i

k

a

b

c

d

S i={a,b,c,d}

a

b

c

d

X Y

s … m**in**imum support (|X|=s)

t … itemset size (|Y|=t)

From K s,t to Communities: Informally, every

dense enough graph G conta**in**s a bipartite

subgraph K s,t where s and t depend on size

(# of nodes) and density (avg. degree) of G

[Kovan-Sos-Turan ‘53]

Theorem:

Let G=(X, Y, E), |X|=|Y| = n

with avg. degree

For the proof we will need the follow**in**g fact

Recall:

Let f(x) = x(x-1)(x-2)…(x-k)

Once x ≥ k, f(x) curves upward (convex)

Suppose a sett**in**g:

g(y) is convex

⎛a

⎞ a(

a −1)...(

a − b + 1)

⎜ ⎟ =

⎝b

⎠ b!

Consider node i of degree

k i and neighbor set S i

i

Put node i **in** buckets for

all size t subsets of i’s

neighbors

a

b

c

d

….

(a,b)

Potential right-hand

sides of K s,t (i.e., all

size t subsets of S i)

As soon as s nodes

appear **in** a bucket

we have a K s,t

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 15

i

i

(a,c)

i

(a,d)

i ….

(b,c)

Note: As soon as s nodes appear **in** a

bucket we found a K s,t

How many buckets does node i contribute to?

⎛

⎜

⎝

ki = # of ways to select t elements out of ki

(ki … degree of node i)

t

⎞

⎟

⎠

What is the total size of all buckets?

∑

So, the total height of

all buckets is…

We have: Total height of all buckets: ≥

t

⎛n

⎞ n

How many buckets are there? ⎜ ⎟ ≤

⎝ t ⎠ t!

What is the average height of buckets?

≥

n

t

s

t!

t!

n

t

=

s

n s

t!

t

So, avg. bucket

height ≥ s

⇒ By pigeonhole pr**in**ciple, there must be at

least one bucket with more than s nodes **in** it

⇒ We found a Ks,t 2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 18

Analytical result:

Complete bipartite subgraphs K s,t are embedded **in**

larger dense enough graphs (i.e., the communities)

Biparite subgraphs act as “signatures” of communities

Algorithmic result:

Frequent itemset extraction and dynamic

programm**in**g f**in**ds graphs K s,t

Method is super scalable

[Kumar et al. ‘99]

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 19

Undirected graphs (but can be have

(non-negative) weighted edges)

Undirected graph G(V,E):

Bi-partition**in**g task:

Divide vertices **in**to two disjo**in**t groups A, B

A 1

5 B

2

3

4

6

Questions:

How can we def**in**e a “good” partition of G?

How can we efficiently identify such a partition?

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 21

2

1

3

4

5

6

What makes a good partition?

Maximize the number of with**in**-group

connections

M**in**imize the number of between-group

connections

2

1

3

A B

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 22

4

5

6

Express partition**in**g objectives as a function

of the “edge cut” of the partition

Cut: Set of edges with only one vertex **in** a

group:

A

2

1

3

4

5

6

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 23

B

cut(A,B) = 2

Criterion: M**in**imum-cut

M**in**imise weight of connections between groups

m**in**A,B cut(A,B)

Degenerate case:

“Optimal cut”

M**in**imum cut

Problem:

Only considers external cluster connections

Does not consider **in**ternal cluster connectivity

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 24

Criterion: Normalized-cut [Shi-Malik, ’97]

Connectivity between groups relative to the

density of each group

vol(A): total weight of the edges with at least

one endpo**in**t **in** A: vol = ∑

A: adjacency matrix of undirected G

A ij = 1 if (i, j) is an edge, else 0

x is a vector **in** ℜ n with components (x 1,…, x n)

just a label/value of each node of G

What is the mean**in**g of A⋅ x?

Entry y j is a sum of labels x i of neighbors of j

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 26

y j

x i

j th coord**in**ate of Ax:

Sum of the x-values

of neighbors of j

Make this a new value at node j

Spectral Graph Theory:

Analyze the “spectrum” of matrix represent**in**g G

Spectrum: Eigenvectors of a graph, ordered by the

magnitude (strength) of their correspond**in**g

eigenvalues:

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 27

Suppose all nodes **in**

G have degree d

and G is connected

What are some eigenvalues/vectors of G?

A·x = λ x What is λ? What x?

Consider:

What if G is not connected?

Say G has 2 components, each d-regular

What are some eigenvectors?

x’= Put all 1s on A, 0s on B or vice versa

Adjacency matrix (A):

n× n matrix

A=[a ij], a ij=1 if edge between node i and j

2

1

3

4

5

Important properties:

Symmetric matrix

Eigenvectors are real and orthogonal

6

1 2 3 4 5 6

1 0 1 1 0 1 0

2 1 0 1 0 0 0

3 1 1 0 1 0 0

4 0 0 1 0 1 1

5 1 0 0 1 0 1

6 0 0 0 1 1 0

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 30

Degree matrix (D):

n× n diagonal matrix

2

D=[d ii], d ii = degree of node i

1

3

4

5

6

1 2 3 4 5 6

1 3 0 0 0 0 0

2 0 2 0 0 0 0

3 0 0 3 0 0 0

4 0 0 0 3 0 0

5 0 0 0 0 3 0

6 0 0 0 0 0 2

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 31

Laplacian matrix (L):

2

n× n symmetric matrix

1

3

4

5

6

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

What is trivial eigenvector,

eigenvalue?

L = D - A

For symmetric matrix M:

λ2

What is the mean**in**g of m**in** x T Lx on G?

What else do we know about x?

x is unit vector: ∑ 2

What else do we know about x?

x is unit vector: ∑ 2

ℜ

0

2

Constra**in**ts: ∑

Say, we want to m**in**imize the cut score

(#edges cross**in**g)

We can express partition A, B as a vector

We can m**in**imize the cut score of the

partition by f**in**d**in**g a non-trivial vector

Say, we want to m**in**imize the cut score

(#edges cross**in**g)

We can express partition A, B as a vector

We can m**in**imize the cut score of the

partition by f**in**d**in**g a non-trivial vector

How to def**in**e a “good” partition of a graph?

M**in**imize a given graph cut criterion

How to efficiently identify such a partition?

Approximate us**in**g **in**formation provided by the

eigenvalues and eigenvectors of a graph

Spectral Cluster**in**g

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu

39

Three basic stages:

1. Pre-process**in**g

Construct a matrix representation of the graph

2. Decomposition

Compute eigenvalues and eigenvectors of the matrix

Map each po**in**t to a lower-dimensional

representation based on one or more eigenvectors

3. Group**in**g

Assign po**in**ts to two or more clusters, based on the

new representation

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu

40

Pre-process**in**g:

Build Laplacian

matrix L of the

graph

Decomposition:

F**in**d eigenvalues λ

and eigenvectors x

of the matrix L

Map vertices to

correspond**in**g

components of λ 2

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 41

0.0

1.0

3.0

λ= X =

1

2

3

4

5

6

3.0

4.0

5.0

0.3

0.6

0.3

-0.3

-0.3

-0.6

0.4

0.4

0.4

0.4

0.4

0.4

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

0.3

0.6

0.3

-0.3

-0.3

-0.6

-0.5

0.4

0.1

0.1

-0.5

0.4

-0.2

-0.4

0.6

0.6

-0.2

-0.4

-0.4

0.4

-0.4

0.4

0.4

-0.4

-0.5

0.0

0.5

-0.5

0.5

0.0

How do we now

f**in**d clusters?

Group**in**g:

Sort components of reduced 1-dimensional vector

Identify clusters by splitt**in**g the sorted vector **in** two

How to choose a splitt**in**g po**in**t?

Naïve approaches:

Split at 0, (or mean or median value)

More expensive approaches:

Attempt to m**in**imize normalized cut criterion **in** 1-dim

1

2

3

4

5

6

0.3

0.6

0.3

-0.3

-0.3

-0.6

Split at 0:

Cluster A: Positive po**in**ts

Cluster B: Negative po**in**ts

1

2

3

0.3

0.6

0.3

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 42

4

5

6

-0.3

-0.3

-0.6

A

B

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 43

How do we partition a graph **in**to k clusters?

Two basic approaches:

Recursive bi-partition**in**g [Hagen et al., ’92]

Recursively apply bi-partition**in**g algorithm **in** a

hierarchical divisive manner

Disadvantages: Inefficient, unstable

Cluster multiple eigenvectors [Shi-Malik, ’00]

Build a reduced space from multiple eigenvectors

Node i is described by its k eigenvector components (x 2,i, x 3,i, …, x k,i)

Use k-means to cluster the po**in**ts

A preferable approach…

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 44

Eigengap:

The difference between two consecutive

eigenvalues

Most stable cluster**in**g is generally given by

the value k that maximizes the eigengap:

Example:

Eigenvalue

50

45

40

35

30

25

20

15

10

5

0

λ 1

λ 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

max = λ − λ

⇒ Choose

k=2

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 45

∆ k

2

1

Standard Rayleigh quotient iteration:

Start with random vector x, make a guess for λ

Then iterate:

Standard Rayleigh quotient iteration

Start with random vector x, make a guess for λ

Then iterate:

Start with random x, make a guess for λ=0.2

METIS:

Heuristic but works really well **in** practice

http://glaros.dtc.umn.edu/gkhome/views/metis

Graclus:

Based on kernel k-means

http://www.cs.utexas.edu/users/dml/Software/graclus.html

Cluto:

http://glaros.dtc.umn.edu/gkhome/views/cluto/

2/14/2012 Jure Leskovec, **Stanford** CS246: M**in****in**g Massive Datasets, http://cs246.stanford.edu 50