19.08.2013 Views

Discovering Clusters in Networks - SNAP - Stanford University

Discovering Clusters in Networks - SNAP - Stanford University

Discovering Clusters in Networks - SNAP - Stanford University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CS246: Mining Massive Datasets

Jure Leskovec, Stanford University

http://cs246.stanford.edu


Networks of tightly

connected groups

Network communities:

Sets of nodes with lots of

connections inside and

few to outside (the rest

of the network)

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Communities, clusters,

groups, modules

2


How to automatically

find such densely

connected groups of

nodes?

Ideally such clusters

then correspond to

real groups

For example:

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Communities, clusters,

groups, modules

3


Zachary’s Karate club network:

Observe social ties and rivalries in a university karate club

During his observation, conflicts led the group to split

Split could be explained by a minimum cut in the network

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4


Find micro-markets by partitioning the

“query x advertiser” graph:

query

advertiser

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

5


Searching for small communities in

the Web graph

What is the signature of a community /

discussion in a Web graph?



Dense 2-layer graph

Intuition: Many people all talking about the same things

[Kumar et al. ‘99]

Use this to define “topics”:

What the same people on

the left talk about on the right

Remember HITS!

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7


A more well-defined problem:

Enumerate complete bipartite subgraphs K s,t

Where K s,t : s nodes on the “left” where each links

to the same t other nodes on the “right”

X Y

K3,4 Fully connected

|X| = s = 3

|Y| = t = 4

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8


[Kumar et al. ‘99]

Two points:

(1) Dense bipartite graph: the signature of a

community/discussion

(2) Complete bipartite subgraph K s,t

K s,t = graph on s nodes, each links to the same t other nodes

Plan:

(A) From (2) get back to (1):

Via: Any dense enough graph contains a

smaller K s,t as a subgraph

(B) How do we solve (2) in a giant graph?

What similar problems were solved on big non-graph data?

(3) Frequent itemset enumeration [Agrawal-Srikant ‘99]

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9


Marketbasket analysis. Setting:

Market: Universe U of n items

Baskets: m subsets of U: S 1, S 2, …, S m ⊆ U

(S i is a set of items one person bought)

Support: Frequency threshold f

Goal:

[Agrawal-Srikant ‘99]

Find all subsets T s.t. T ⊆ S i of ≥ f sets S i

(items in T were bought together at least f times)

What’s the connection between the

itemsets and complete bipartite graphs?

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10


View each node i as a

set Si of nodes i points to

i

a

b

c

d

S i={a,b,c,d}

Find frequent itemsets:

s … minimum support

t … itemset size

We found K s,t!

K s,t = a set Y of size t

that occurs in s sets S i

x

[Kumar et al. ‘99]

Say we find a frequent

itemset Y={a,b,c} of supp. s

So, there are s nodes that

link to all of {a,b,c}:

2/15/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

a

b

c

X

y

x

y

z

a

b

c

a

b

c

z

Y

a

b

c


Itemsets finds Complete bipartite graphs!

How?

View each node i as a

set S i of nodes i points to

K s,t = a set Y of size t

that occurs in s sets S i

Looking for K s,t set of

frequency threshold to s

and look at layer t – all

frequent sets of size t

[Kumar et al. ‘99]

2/15/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

i

j

i

k

a

b

c

d

S i={a,b,c,d}

a

b

c

d

X Y

s … minimum support (|X|=s)

t … itemset size (|Y|=t)


From K s,t to Communities: Informally, every

dense enough graph G contains a bipartite

subgraph K s,t where s and t depend on size

(# of nodes) and density (avg. degree) of G

[Kovan-Sos-Turan ‘53]

Theorem:

Let G=(X, Y, E), |X|=|Y| = n

with avg. degree


For the proof we will need the following fact

Recall:

Let f(x) = x(x-1)(x-2)…(x-k)

Once x ≥ k, f(x) curves upward (convex)

Suppose a setting:

g(y) is convex

⎛a

⎞ a(

a −1)...(

a − b + 1)

⎜ ⎟ =

⎝b

⎠ b!


Consider node i of degree

k i and neighbor set S i

i

Put node i in buckets for

all size t subsets of i’s

neighbors

a

b

c

d

….

(a,b)

Potential right-hand

sides of K s,t (i.e., all

size t subsets of S i)

As soon as s nodes

appear in a bucket

we have a K s,t

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

i

i

(a,c)

i

(a,d)

i ….

(b,c)


Note: As soon as s nodes appear in a

bucket we found a K s,t

How many buckets does node i contribute to?




ki = # of ways to select t elements out of ki

(ki … degree of node i)

t




What is the total size of all buckets?


So, the total height of

all buckets is…


We have: Total height of all buckets: ≥

t

⎛n

⎞ n

How many buckets are there? ⎜ ⎟ ≤

⎝ t ⎠ t!

What is the average height of buckets?


n

t

s

t!

t!

n

t

=

s

n s

t!

t

So, avg. bucket

height ≥ s

⇒ By pigeonhole principle, there must be at

least one bucket with more than s nodes in it

⇒ We found a Ks,t 2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18


Analytical result:

Complete bipartite subgraphs K s,t are embedded in

larger dense enough graphs (i.e., the communities)

Biparite subgraphs act as “signatures” of communities

Algorithmic result:

Frequent itemset extraction and dynamic

programming finds graphs K s,t

Method is super scalable

[Kumar et al. ‘99]

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19


Undirected graphs (but can be have

(non-negative) weighted edges)


Undirected graph G(V,E):

Bi-partitioning task:

Divide vertices into two disjoint groups A, B

A 1

5 B

2

3

4

6

Questions:

How can we define a “good” partition of G?

How can we efficiently identify such a partition?

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

2

1

3

4

5

6


What makes a good partition?

Maximize the number of within-group

connections

Minimize the number of between-group

connections

2

1

3

A B

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

4

5

6


Express partitioning objectives as a function

of the “edge cut” of the partition

Cut: Set of edges with only one vertex in a

group:

A

2

1

3

4

5

6

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

B

cut(A,B) = 2


Criterion: Minimum-cut

Minimise weight of connections between groups

minA,B cut(A,B)

Degenerate case:

“Optimal cut”

Minimum cut

Problem:

Only considers external cluster connections

Does not consider internal cluster connectivity

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24


Criterion: Normalized-cut [Shi-Malik, ’97]

Connectivity between groups relative to the

density of each group

vol(A): total weight of the edges with at least

one endpoint in A: vol = ∑


A: adjacency matrix of undirected G

A ij = 1 if (i, j) is an edge, else 0

x is a vector in ℜ n with components (x 1,…, x n)

just a label/value of each node of G

What is the meaning of A⋅ x?

Entry y j is a sum of labels x i of neighbors of j

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

y j

x i


j th coordinate of Ax:

Sum of the x-values

of neighbors of j

Make this a new value at node j

Spectral Graph Theory:

Analyze the “spectrum” of matrix representing G

Spectrum: Eigenvectors of a graph, ordered by the

magnitude (strength) of their corresponding

eigenvalues:

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27


Suppose all nodes in

G have degree d

and G is connected

What are some eigenvalues/vectors of G?

A·x = λ x What is λ? What x?

Consider:


What if G is not connected?

Say G has 2 components, each d-regular

What are some eigenvectors?

x’= Put all 1s on A, 0s on B or vice versa


Adjacency matrix (A):

n× n matrix

A=[a ij], a ij=1 if edge between node i and j

2

1

3

4

5

Important properties:

Symmetric matrix

Eigenvectors are real and orthogonal

6

1 2 3 4 5 6

1 0 1 1 0 1 0

2 1 0 1 0 0 0

3 1 1 0 1 0 0

4 0 0 1 0 1 1

5 1 0 0 1 0 1

6 0 0 0 1 1 0

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30


Degree matrix (D):

n× n diagonal matrix

2

D=[d ii], d ii = degree of node i

1

3

4

5

6

1 2 3 4 5 6

1 3 0 0 0 0 0

2 0 2 0 0 0 0

3 0 0 3 0 0 0

4 0 0 0 3 0 0

5 0 0 0 0 3 0

6 0 0 0 0 0 2

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31


Laplacian matrix (L):

2

n× n symmetric matrix

1

3

4

5

6

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

What is trivial eigenvector,

eigenvalue?

L = D - A


For symmetric matrix M:

λ2

What is the meaning of min x T Lx on G?


What else do we know about x?

x is unit vector: ∑ 2


What else do we know about x?

x is unit vector: ∑ 2



0

2

Constraints: ∑


Say, we want to minimize the cut score

(#edges crossing)

We can express partition A, B as a vector

We can minimize the cut score of the

partition by finding a non-trivial vector


Say, we want to minimize the cut score

(#edges crossing)

We can express partition A, B as a vector

We can minimize the cut score of the

partition by finding a non-trivial vector


How to define a “good” partition of a graph?

Minimize a given graph cut criterion

How to efficiently identify such a partition?

Approximate using information provided by the

eigenvalues and eigenvectors of a graph

Spectral Clustering

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

39


Three basic stages:

1. Pre-processing

Construct a matrix representation of the graph

2. Decomposition

Compute eigenvalues and eigenvectors of the matrix

Map each point to a lower-dimensional

representation based on one or more eigenvectors

3. Grouping

Assign points to two or more clusters, based on the

new representation

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

40


Pre-processing:

Build Laplacian

matrix L of the

graph

Decomposition:

Find eigenvalues λ

and eigenvectors x

of the matrix L

Map vertices to

corresponding

components of λ 2

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41

0.0

1.0

3.0

λ= X =

1

2

3

4

5

6

3.0

4.0

5.0

0.3

0.6

0.3

-0.3

-0.3

-0.6

0.4

0.4

0.4

0.4

0.4

0.4

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

0.3

0.6

0.3

-0.3

-0.3

-0.6

-0.5

0.4

0.1

0.1

-0.5

0.4

-0.2

-0.4

0.6

0.6

-0.2

-0.4

-0.4

0.4

-0.4

0.4

0.4

-0.4

-0.5

0.0

0.5

-0.5

0.5

0.0

How do we now

find clusters?


Grouping:

Sort components of reduced 1-dimensional vector

Identify clusters by splitting the sorted vector in two

How to choose a splitting point?

Naïve approaches:

Split at 0, (or mean or median value)

More expensive approaches:

Attempt to minimize normalized cut criterion in 1-dim

1

2

3

4

5

6

0.3

0.6

0.3

-0.3

-0.3

-0.6

Split at 0:

Cluster A: Positive points

Cluster B: Negative points

1

2

3

0.3

0.6

0.3

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42

4

5

6

-0.3

-0.3

-0.6

A

B


2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43


How do we partition a graph into k clusters?

Two basic approaches:

Recursive bi-partitioning [Hagen et al., ’92]

Recursively apply bi-partitioning algorithm in a

hierarchical divisive manner

Disadvantages: Inefficient, unstable

Cluster multiple eigenvectors [Shi-Malik, ’00]

Build a reduced space from multiple eigenvectors

Node i is described by its k eigenvector components (x 2,i, x 3,i, …, x k,i)

Use k-means to cluster the points

A preferable approach…

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44


Eigengap:

The difference between two consecutive

eigenvalues

Most stable clustering is generally given by

the value k that maximizes the eigengap:

Example:

Eigenvalue

50

45

40

35

30

25

20

15

10

5

0

λ 1

λ 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

max = λ − λ

⇒ Choose

k=2

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45

∆ k

2

1


Standard Rayleigh quotient iteration:

Start with random vector x, make a guess for λ

Then iterate:


Standard Rayleigh quotient iteration

Start with random vector x, make a guess for λ

Then iterate:


Start with random x, make a guess for λ=0.2


METIS:

Heuristic but works really well in practice

http://glaros.dtc.umn.edu/gkhome/views/metis

Graclus:

Based on kernel k-means

http://www.cs.utexas.edu/users/dml/Software/graclus.html

Cluto:

http://glaros.dtc.umn.edu/gkhome/views/cluto/

2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!