Discovering Clusters in Networks - SNAP - Stanford University
Discovering Clusters in Networks - SNAP - Stanford University
Discovering Clusters in Networks - SNAP - Stanford University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Networks of tightly
connected groups
Network communities:
Sets of nodes with lots of
connections inside and
few to outside (the rest
of the network)
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
Communities, clusters,
groups, modules
2
How to automatically
find such densely
connected groups of
nodes?
Ideally such clusters
then correspond to
real groups
For example:
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
Communities, clusters,
groups, modules
3
Zachary’s Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
Find micro-markets by partitioning the
“query x advertiser” graph:
query
advertiser
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
5
Searching for small communities in
the Web graph
What is the signature of a community /
discussion in a Web graph?
…
…
Dense 2-layer graph
Intuition: Many people all talking about the same things
[Kumar et al. ‘99]
Use this to define “topics”:
What the same people on
the left talk about on the right
Remember HITS!
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
A more well-defined problem:
Enumerate complete bipartite subgraphs K s,t
Where K s,t : s nodes on the “left” where each links
to the same t other nodes on the “right”
X Y
K3,4 Fully connected
|X| = s = 3
|Y| = t = 4
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
[Kumar et al. ‘99]
Two points:
(1) Dense bipartite graph: the signature of a
community/discussion
(2) Complete bipartite subgraph K s,t
K s,t = graph on s nodes, each links to the same t other nodes
Plan:
(A) From (2) get back to (1):
Via: Any dense enough graph contains a
smaller K s,t as a subgraph
(B) How do we solve (2) in a giant graph?
What similar problems were solved on big non-graph data?
(3) Frequent itemset enumeration [Agrawal-Srikant ‘99]
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
Marketbasket analysis. Setting:
Market: Universe U of n items
Baskets: m subsets of U: S 1, S 2, …, S m ⊆ U
(S i is a set of items one person bought)
Support: Frequency threshold f
Goal:
[Agrawal-Srikant ‘99]
Find all subsets T s.t. T ⊆ S i of ≥ f sets S i
(items in T were bought together at least f times)
What’s the connection between the
itemsets and complete bipartite graphs?
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
View each node i as a
set Si of nodes i points to
i
a
b
c
d
S i={a,b,c,d}
Find frequent itemsets:
s … minimum support
t … itemset size
We found K s,t!
K s,t = a set Y of size t
that occurs in s sets S i
x
[Kumar et al. ‘99]
Say we find a frequent
itemset Y={a,b,c} of supp. s
So, there are s nodes that
link to all of {a,b,c}:
2/15/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
a
b
c
X
y
x
y
z
a
b
c
a
b
c
z
Y
a
b
c
Itemsets finds Complete bipartite graphs!
How?
View each node i as a
set S i of nodes i points to
K s,t = a set Y of size t
that occurs in s sets S i
Looking for K s,t set of
frequency threshold to s
and look at layer t – all
frequent sets of size t
[Kumar et al. ‘99]
2/15/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
i
j
i
k
a
b
c
d
S i={a,b,c,d}
a
b
c
d
X Y
s … minimum support (|X|=s)
t … itemset size (|Y|=t)
From K s,t to Communities: Informally, every
dense enough graph G contains a bipartite
subgraph K s,t where s and t depend on size
(# of nodes) and density (avg. degree) of G
[Kovan-Sos-Turan ‘53]
Theorem:
Let G=(X, Y, E), |X|=|Y| = n
with avg. degree
For the proof we will need the following fact
Recall:
Let f(x) = x(x-1)(x-2)…(x-k)
Once x ≥ k, f(x) curves upward (convex)
Suppose a setting:
g(y) is convex
⎛a
⎞ a(
a −1)...(
a − b + 1)
⎜ ⎟ =
⎝b
⎠ b!
Consider node i of degree
k i and neighbor set S i
i
Put node i in buckets for
all size t subsets of i’s
neighbors
a
b
c
d
….
(a,b)
Potential right-hand
sides of K s,t (i.e., all
size t subsets of S i)
As soon as s nodes
appear in a bucket
we have a K s,t
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
i
i
(a,c)
i
(a,d)
i ….
(b,c)
Note: As soon as s nodes appear in a
bucket we found a K s,t
How many buckets does node i contribute to?
⎛
⎜
⎝
ki = # of ways to select t elements out of ki
(ki … degree of node i)
t
⎞
⎟
⎠
What is the total size of all buckets?
∑
So, the total height of
all buckets is…
We have: Total height of all buckets: ≥
t
⎛n
⎞ n
How many buckets are there? ⎜ ⎟ ≤
⎝ t ⎠ t!
What is the average height of buckets?
≥
n
t
s
t!
t!
n
t
=
s
n s
t!
t
So, avg. bucket
height ≥ s
⇒ By pigeonhole principle, there must be at
least one bucket with more than s nodes in it
⇒ We found a Ks,t 2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
Analytical result:
Complete bipartite subgraphs K s,t are embedded in
larger dense enough graphs (i.e., the communities)
Biparite subgraphs act as “signatures” of communities
Algorithmic result:
Frequent itemset extraction and dynamic
programming finds graphs K s,t
Method is super scalable
[Kumar et al. ‘99]
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
Undirected graphs (but can be have
(non-negative) weighted edges)
Undirected graph G(V,E):
Bi-partitioning task:
Divide vertices into two disjoint groups A, B
A 1
5 B
2
3
4
6
Questions:
How can we define a “good” partition of G?
How can we efficiently identify such a partition?
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
2
1
3
4
5
6
What makes a good partition?
Maximize the number of within-group
connections
Minimize the number of between-group
connections
2
1
3
A B
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
4
5
6
Express partitioning objectives as a function
of the “edge cut” of the partition
Cut: Set of edges with only one vertex in a
group:
A
2
1
3
4
5
6
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
B
cut(A,B) = 2
Criterion: Minimum-cut
Minimise weight of connections between groups
minA,B cut(A,B)
Degenerate case:
“Optimal cut”
Minimum cut
Problem:
Only considers external cluster connections
Does not consider internal cluster connectivity
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
Criterion: Normalized-cut [Shi-Malik, ’97]
Connectivity between groups relative to the
density of each group
vol(A): total weight of the edges with at least
one endpoint in A: vol = ∑
A: adjacency matrix of undirected G
A ij = 1 if (i, j) is an edge, else 0
x is a vector in ℜ n with components (x 1,…, x n)
just a label/value of each node of G
What is the meaning of A⋅ x?
Entry y j is a sum of labels x i of neighbors of j
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
y j
x i
j th coordinate of Ax:
Sum of the x-values
of neighbors of j
Make this a new value at node j
Spectral Graph Theory:
Analyze the “spectrum” of matrix representing G
Spectrum: Eigenvectors of a graph, ordered by the
magnitude (strength) of their corresponding
eigenvalues:
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
Suppose all nodes in
G have degree d
and G is connected
What are some eigenvalues/vectors of G?
A·x = λ x What is λ? What x?
Consider:
What if G is not connected?
Say G has 2 components, each d-regular
What are some eigenvectors?
x’= Put all 1s on A, 0s on B or vice versa
Adjacency matrix (A):
n× n matrix
A=[a ij], a ij=1 if edge between node i and j
2
1
3
4
5
Important properties:
Symmetric matrix
Eigenvectors are real and orthogonal
6
1 2 3 4 5 6
1 0 1 1 0 1 0
2 1 0 1 0 0 0
3 1 1 0 1 0 0
4 0 0 1 0 1 1
5 1 0 0 1 0 1
6 0 0 0 1 1 0
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
Degree matrix (D):
n× n diagonal matrix
2
D=[d ii], d ii = degree of node i
1
3
4
5
6
1 2 3 4 5 6
1 3 0 0 0 0 0
2 0 2 0 0 0 0
3 0 0 3 0 0 0
4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
Laplacian matrix (L):
2
n× n symmetric matrix
1
3
4
5
6
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
What is trivial eigenvector,
eigenvalue?
L = D - A
For symmetric matrix M:
λ2
What is the meaning of min x T Lx on G?
What else do we know about x?
x is unit vector: ∑ 2
What else do we know about x?
x is unit vector: ∑ 2
ℜ
0
2
Constraints: ∑
Say, we want to minimize the cut score
(#edges crossing)
We can express partition A, B as a vector
We can minimize the cut score of the
partition by finding a non-trivial vector
Say, we want to minimize the cut score
(#edges crossing)
We can express partition A, B as a vector
We can minimize the cut score of the
partition by finding a non-trivial vector
How to define a “good” partition of a graph?
Minimize a given graph cut criterion
How to efficiently identify such a partition?
Approximate using information provided by the
eigenvalues and eigenvectors of a graph
Spectral Clustering
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
39
Three basic stages:
1. Pre-processing
Construct a matrix representation of the graph
2. Decomposition
Compute eigenvalues and eigenvectors of the matrix
Map each point to a lower-dimensional
representation based on one or more eigenvectors
3. Grouping
Assign points to two or more clusters, based on the
new representation
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
40
Pre-processing:
Build Laplacian
matrix L of the
graph
Decomposition:
Find eigenvalues λ
and eigenvectors x
of the matrix L
Map vertices to
corresponding
components of λ 2
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41
0.0
1.0
3.0
λ= X =
1
2
3
4
5
6
3.0
4.0
5.0
0.3
0.6
0.3
-0.3
-0.3
-0.6
0.4
0.4
0.4
0.4
0.4
0.4
1 2 3 4 5 6
1 3 -1 -1 0 -1 0
2 -1 2 -1 0 0 0
3 -1 -1 3 -1 0 0
4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
0.3
0.6
0.3
-0.3
-0.3
-0.6
-0.5
0.4
0.1
0.1
-0.5
0.4
-0.2
-0.4
0.6
0.6
-0.2
-0.4
-0.4
0.4
-0.4
0.4
0.4
-0.4
-0.5
0.0
0.5
-0.5
0.5
0.0
How do we now
find clusters?
Grouping:
Sort components of reduced 1-dimensional vector
Identify clusters by splitting the sorted vector in two
How to choose a splitting point?
Naïve approaches:
Split at 0, (or mean or median value)
More expensive approaches:
Attempt to minimize normalized cut criterion in 1-dim
1
2
3
4
5
6
0.3
0.6
0.3
-0.3
-0.3
-0.6
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1
2
3
0.3
0.6
0.3
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
4
5
6
-0.3
-0.3
-0.6
A
B
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43
How do we partition a graph into k clusters?
Two basic approaches:
Recursive bi-partitioning [Hagen et al., ’92]
Recursively apply bi-partitioning algorithm in a
hierarchical divisive manner
Disadvantages: Inefficient, unstable
Cluster multiple eigenvectors [Shi-Malik, ’00]
Build a reduced space from multiple eigenvectors
Node i is described by its k eigenvector components (x 2,i, x 3,i, …, x k,i)
Use k-means to cluster the points
A preferable approach…
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44
Eigengap:
The difference between two consecutive
eigenvalues
Most stable clustering is generally given by
the value k that maximizes the eigengap:
Example:
Eigenvalue
50
45
40
35
30
25
20
15
10
5
0
λ 1
λ 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
max = λ − λ
⇒ Choose
k=2
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
∆ k
2
1
Standard Rayleigh quotient iteration:
Start with random vector x, make a guess for λ
Then iterate:
Standard Rayleigh quotient iteration
Start with random vector x, make a guess for λ
Then iterate:
Start with random x, make a guess for λ=0.2
METIS:
Heuristic but works really well in practice
http://glaros.dtc.umn.edu/gkhome/views/metis
Graclus:
Based on kernel k-means
http://www.cs.utexas.edu/users/dml/Software/graclus.html
Cluto:
http://glaros.dtc.umn.edu/gkhome/views/cluto/
2/14/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50