Catch the Wind: Graph Workload Balancing on Cloud
Catch the Wind: Graph Workload Balancing on Cloud
Catch the Wind: Graph Workload Balancing on Cloud
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>Catch</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Wind</str<strong>on</strong>g>: <str<strong>on</strong>g>Graph</str<strong>on</strong>g> <str<strong>on</strong>g>Workload</str<strong>on</strong>g> <str<strong>on</strong>g>Balancing</str<strong>on</strong>g> <strong>on</strong><br />
<strong>Cloud</strong><br />
Abstract— <str<strong>on</strong>g>Graph</str<strong>on</strong>g> partiti<strong>on</strong>ing is a key issue in graph database<br />
processing systems for achieving high efficiency <strong>on</strong> <strong>Cloud</strong>. However,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> balanced graph partiti<strong>on</strong>ing itself is difficult because it is<br />
known to be NP-complete. In additi<strong>on</strong> a static graph partiti<strong>on</strong>ing<br />
cannot keep all graph algorithms efficient for a l<strong>on</strong>g time in<br />
parallel <strong>on</strong> <strong>Cloud</strong> because <str<strong>on</strong>g>the</str<strong>on</strong>g> workload balancing in different<br />
iterati<strong>on</strong>s for different graph algorithms are all possible different.<br />
In this paper, we investigate graph behaviors by exploring <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
working window (we call it wind) changes, where a working<br />
window is a set of active vertices that a graph algorithm really<br />
needs to access in parallel computing. We investigated nine classic<br />
graph algorithms using real datasets, and propose simple yet<br />
effective policies that can achieve both high graph workload<br />
balancing and efficient partiti<strong>on</strong> <strong>on</strong> <strong>Cloud</strong>.<br />
I. INTRODUCTION<br />
Due to <str<strong>on</strong>g>the</str<strong>on</strong>g> large number of new applicati<strong>on</strong>s need to deal<br />
with massive graphs, several graph database processing systems<br />
are developed [22]. As <strong>on</strong>e of <str<strong>on</strong>g>the</str<strong>on</strong>g> representatives, Google<br />
has developed Pregel [15] as its internal graph processing<br />
platform based <strong>on</strong> bulk-synchr<strong>on</strong>ous parallel model (BSP)<br />
[25]. Pregel takes a vertex-centric approach and computes<br />
in a sequence of supersteps. In a superstep, Pregel applies<br />
a user-defined functi<strong>on</strong> (UDF) <strong>on</strong> every active vertex v in<br />
parallel, with <str<strong>on</strong>g>the</str<strong>on</strong>g> capability of receiving messages from o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
vertices to v in <str<strong>on</strong>g>the</str<strong>on</strong>g> previous superstep, and <str<strong>on</strong>g>the</str<strong>on</strong>g> capability of<br />
sending messages to o<str<strong>on</strong>g>the</str<strong>on</strong>g>r vertices (neighbors of v), which will<br />
be delivered in <str<strong>on</strong>g>the</str<strong>on</strong>g> next superstep. The idea behind Pregel<br />
is similar to MapReduce [7]. But, as pointed out in [15],<br />
MapReduce needs to pass <str<strong>on</strong>g>the</str<strong>on</strong>g> massive graph itself from <strong>on</strong>e<br />
step to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r step iteratively, which is time c<strong>on</strong>suming.<br />
Pregel adopts BSP and implements a stateful model to support<br />
l<strong>on</strong>g-lived processes. Besides Google’s own implementati<strong>on</strong>,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>re are some public open source implementati<strong>on</strong>s which take<br />
similar approaches, such as HAMA [1] and Giraph [2]. Shao et<br />
al. in [22] survey systems and implementati<strong>on</strong>s for managing<br />
and mining large graphs <strong>on</strong> <strong>Cloud</strong>.<br />
On <strong>Cloud</strong> with K computati<strong>on</strong>al nodes 1 , <str<strong>on</strong>g>the</str<strong>on</strong>g> efficiency of<br />
any graph algorithm 2 relies <strong>on</strong> how to partiti<strong>on</strong> a large graph<br />
into K subgraphs, where all subgraphs are similar in size,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> sets of <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices of <str<strong>on</strong>g>the</str<strong>on</strong>g> K subgraphs are disjoint, and<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> number of edges across two subgraphs is minimized. Here,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> similar in size is for workload balancing and <str<strong>on</strong>g>the</str<strong>on</strong>g> minimum<br />
1 We use node to indicate a computati<strong>on</strong>al node, and vertex to indicate a<br />
vertex in <str<strong>on</strong>g>the</str<strong>on</strong>g> graph<br />
2 Here we mean general graph algorithms run <strong>on</strong> Pregel, which do not aware<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> underlying distributed envir<strong>on</strong>ment<br />
Zechao Shang, Jeffrey Xu Yu<br />
The Chinese University of H<strong>on</strong>g K<strong>on</strong>g, H<strong>on</strong>g K<strong>on</strong>g, China<br />
{zcshang,yu}@se.cuhk.edu.hk<br />
number of crossing edges is for communicati<strong>on</strong> cost minimizati<strong>on</strong>,<br />
in parallel computing. However, <str<strong>on</strong>g>the</str<strong>on</strong>g> graph partiti<strong>on</strong>ing is<br />
challenging because it is known to be NP-complete [11], [4],<br />
and is challenging <strong>on</strong> <strong>Cloud</strong> as partiti<strong>on</strong>ing extremely large,<br />
irregular datasets is <strong>on</strong>ly beginning to be addressed [8]. Currently,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> de-facto standard of graph partiti<strong>on</strong>ing is random<br />
partiti<strong>on</strong>ing for handling large graphs like social networks<br />
[20]. The two main reas<strong>on</strong>s behind <str<strong>on</strong>g>the</str<strong>on</strong>g> random partiti<strong>on</strong>ing<br />
of large graphs are: (i) <str<strong>on</strong>g>the</str<strong>on</strong>g> extremely high cost of graph<br />
partiti<strong>on</strong>ing even in parallel [21] and (ii) <str<strong>on</strong>g>the</str<strong>on</strong>g> questi<strong>on</strong>able<br />
efficiency that a single graph partiti<strong>on</strong>ing can achieve for any<br />
graph algorithms in general, where <str<strong>on</strong>g>the</str<strong>on</strong>g> efficiency is achieved<br />
by workload balancing am<strong>on</strong>g all computati<strong>on</strong>al nodes <strong>on</strong><br />
<strong>Cloud</strong>. To deal with <str<strong>on</strong>g>the</str<strong>on</strong>g> graph partiti<strong>on</strong>ing, several replicati<strong>on</strong><br />
based approaches are recently proposed [20], [16], [26], where<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> sets of vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> K subgraphs can be overlapped.<br />
Yang et al. in [26] investigate dynamic adapti<strong>on</strong> of changing<br />
workload with such replicati<strong>on</strong> based <strong>on</strong> a reas<strong>on</strong>able good<br />
initial partiti<strong>on</strong>ing. Stant<strong>on</strong> et al. in [23] study several heuristic<br />
methods for graph partiti<strong>on</strong>ing under <str<strong>on</strong>g>the</str<strong>on</strong>g> stream computati<strong>on</strong><br />
model. Although <str<strong>on</strong>g>the</str<strong>on</strong>g>y achieve c<strong>on</strong>siderable performance and<br />
scalability, <str<strong>on</strong>g>the</str<strong>on</strong>g>y deal with static workloads. In this paper, as<br />
an attempt, we c<strong>on</strong>centrate ourselves <strong>on</strong> dynamic workload<br />
balancing based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex-centric computing that Pregel<br />
or similar systems take, without any requirements <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
initial partiti<strong>on</strong>ing, and we do not allow vertex overlapping<br />
between computati<strong>on</strong>al nodes, because vertex overlapping may<br />
require an additi<strong>on</strong>al mechanism with possible high overhead<br />
for c<strong>on</strong>sistency maintenance, and costs more storage.<br />
The main c<strong>on</strong>tributi<strong>on</strong>s of this work are summarized below.<br />
We investigate graph behaviors when executing graph algorithms<br />
in Pregel. The graph behavior is modeled as <str<strong>on</strong>g>the</str<strong>on</strong>g> set of<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices that a graph algorithm really needs to access<br />
in a superstep. The motivati<strong>on</strong> behind is that we observe that<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> workload balancing should not be d<strong>on</strong>e for <str<strong>on</strong>g>the</str<strong>on</strong>g> entire graph<br />
<strong>on</strong>ce and thus used forever, because a graph algorithm does<br />
not always need to access all vertices in every superstep. This<br />
explains why a single static graph partiti<strong>on</strong>ing cannot achieve<br />
workload balancing because such graph partiti<strong>on</strong>ing is built<br />
for <str<strong>on</strong>g>the</str<strong>on</strong>g> entire graph. We investigate <str<strong>on</strong>g>the</str<strong>on</strong>g> graph behaviors by<br />
exploring <str<strong>on</strong>g>the</str<strong>on</strong>g> working window changes (we call it wind for<br />
short). We classify nine classic graph algorithms into three<br />
categories, and c<strong>on</strong>duct extensive experimental studies <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
working window changes for a random graph partiti<strong>on</strong>ing.<br />
Based <strong>on</strong> our experimental studies, we propose simple yet
effective policies that can achieve workload balancing for <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
active vertices and reduce <str<strong>on</strong>g>the</str<strong>on</strong>g> number of edges across different<br />
computati<strong>on</strong>al nodes in supersteps starting from an initial<br />
random graph partiti<strong>on</strong>ing. We c<strong>on</strong>firm <str<strong>on</strong>g>the</str<strong>on</strong>g> effectiveness and<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> efficiency of our policies <strong>on</strong> a prototyped system we have<br />
built <strong>on</strong> top of HAMA [1], an open source implementati<strong>on</strong> of<br />
Pregel.<br />
The remainder of <str<strong>on</strong>g>the</str<strong>on</strong>g> paper is organized as follows. We<br />
discuss Pregel in Secti<strong>on</strong> II <strong>on</strong> which our prototyped system is<br />
built up, and discuss nine graph algorithms in three categories<br />
to which we investigate how to support dynamic workload<br />
balancing in Secti<strong>on</strong> III. We show our experimental studies in<br />
Secti<strong>on</strong> V. We discuss <str<strong>on</strong>g>the</str<strong>on</strong>g> related work <strong>on</strong> graph partiti<strong>on</strong>ing<br />
in Secti<strong>on</strong> VI, and c<strong>on</strong>clude our paper in Secti<strong>on</strong> VII.<br />
II. THE PREGEL<br />
Google has developed Pregel [15] as its internal graph<br />
processing platform based <strong>on</strong> bulk-synchr<strong>on</strong>ous parallel model<br />
BSP [25]. Pregel distributes data <strong>on</strong> all computati<strong>on</strong>al nodes,<br />
and during <str<strong>on</strong>g>the</str<strong>on</strong>g> entire computing all data are assumed to reside<br />
in main memory. Pregel takes a vertex-centric approach and<br />
computes in a sequence of supersteps. In each superstep, every<br />
node in Pregel computes a user-defined functi<strong>on</strong> (UDF) against<br />
each vertex in a graph in parallel. The UDF computes u’s<br />
value based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> receiving messages to u, which are sent<br />
from o<str<strong>on</strong>g>the</str<strong>on</strong>g>r vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> previous superstep, and <str<strong>on</strong>g>the</str<strong>on</strong>g> UDF<br />
may also send messages from u to o<str<strong>on</strong>g>the</str<strong>on</strong>g>r vertices (neighbors<br />
of u) which will receive <str<strong>on</strong>g>the</str<strong>on</strong>g> messages in <str<strong>on</strong>g>the</str<strong>on</strong>g> next superstep.<br />
In general, a vertex u can receive/send messages from/to its<br />
neighbors which reside in <str<strong>on</strong>g>the</str<strong>on</strong>g> same or different nodes. The<br />
synchr<strong>on</strong>izati<strong>on</strong> is d<strong>on</strong>e at <str<strong>on</strong>g>the</str<strong>on</strong>g> end of every superstep to ensure<br />
that all nodes complete <str<strong>on</strong>g>the</str<strong>on</strong>g>ir tasks (including sending/receiving<br />
messages) before entering <str<strong>on</strong>g>the</str<strong>on</strong>g> next superstep.<br />
Pregel hides <str<strong>on</strong>g>the</str<strong>on</strong>g> underlying communicati<strong>on</strong> mechanism<br />
completely. The applicati<strong>on</strong> to be built <strong>on</strong> top of Pregel<br />
cannot c<strong>on</strong>trol any low level operati<strong>on</strong>s such as mode of sending/receiving<br />
messages, workload balancing, data partiti<strong>on</strong> and<br />
replicati<strong>on</strong>.<br />
Some functi<strong>on</strong>s provided in <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex-centric API in Pregel<br />
are shown below for accessing vertices.<br />
class Vertex {<br />
VertexValue getVertexValue();<br />
void setVertexValue(Value);<br />
int getSuperStep();<br />
void SendMessageTo(Vertex, MessageValue);<br />
void VoteToHalt();<br />
void abstract UDF (MessageIterator);<br />
}<br />
Here, getVertexValue() and setVertexValue() get/set <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex<br />
value, respectively. SendMessageTo() takes two inputs: <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
vertex to be sent and <str<strong>on</strong>g>the</str<strong>on</strong>g> message to be delivered. getSuper-<br />
Step() gets <str<strong>on</strong>g>the</str<strong>on</strong>g> current superstep number. UDF is <str<strong>on</strong>g>the</str<strong>on</strong>g> functi<strong>on</strong><br />
to be implemented by an applicati<strong>on</strong>. The input to UDF is<br />
a MessageIterator, which is used to access every receiving<br />
messages to <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>ding vertex. In <str<strong>on</strong>g>the</str<strong>on</strong>g> following, for<br />
simplicity, for a vertex u, we take <str<strong>on</strong>g>the</str<strong>on</strong>g> MessageIterator as a<br />
set of messages to u denoted as Mu. Any vertex is initially<br />
set as active, and becomes inactive by calling VoteToHalt() by<br />
Algorithm 1: A Computati<strong>on</strong>al Node in Pregel<br />
1 foreach active vertex u do<br />
2 u.UDF (received messages to u from <str<strong>on</strong>g>the</str<strong>on</strong>g> previous<br />
superstep);<br />
3 send out all outgoing messages;<br />
4 enter barrier (all nodes wait here for sync);<br />
TABLE I<br />
SUMMARY OF THE NINE GRAPH ALGORITHMS<br />
No. Algorithm Category<br />
A1 PageRank [15], [17]<br />
A2 Semi-clustering [15]<br />
I<br />
A3 <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Coloring [19], [13]<br />
A4 Single Source Shortest Path (SSSP) [15], [5]<br />
A5 Breadth First Search (BFS) [6]<br />
II<br />
A6 Random-Walk [18]<br />
A7 Maximal Matching (MM) [15], [3]<br />
A8 Minimum Spanning Tree (MST) [19], [10] III<br />
A9 Maximal Independent Sets (MIS) [19], [14]<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> vertex itself. An inactive vertex will become active again<br />
when it receives message(s). The Pregel repeats <str<strong>on</strong>g>the</str<strong>on</strong>g> supersteps<br />
until all vertices become inactive.<br />
Algorithm 1 depicts a node in Pregel. In <str<strong>on</strong>g>the</str<strong>on</strong>g> for loop, it<br />
computes <str<strong>on</strong>g>the</str<strong>on</strong>g> UDF for each active vertex u in a superstep,<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g>n sends out all messages from a vertex that requests.<br />
The sending (line 3) in Algorithm 1 is completely different<br />
from <str<strong>on</strong>g>the</str<strong>on</strong>g> member functi<strong>on</strong> of SendMessageTo() in <str<strong>on</strong>g>the</str<strong>on</strong>g> Vertex<br />
class. The latter is to highlight that it needs to send messages<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> former is <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>e that really sends messages. Besides<br />
Google’s own implementati<strong>on</strong>, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are some public open<br />
source implementati<strong>on</strong>s which take <str<strong>on</strong>g>the</str<strong>on</strong>g> similar approaches,<br />
such as HAMA [1] and Giraph [2].<br />
III. NINE GRAPH ALGORITHMS<br />
We discuss nine classic graph algorithms including PageRank<br />
[15], [17], Semi-clustering [15], <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Coloring [19],<br />
[13], Single Source Shortest Path (SSSP) [15], [5], Breadth<br />
First Search (BFS) [6], Random-Walk [18], Maximal Matching<br />
(MM) [15], [3], Minimum Spanning Tree (MST) [19], [10],<br />
and Maximal Independent Sets (MIS) [19], [14]. In <str<strong>on</strong>g>the</str<strong>on</strong>g> Pregel<br />
framework, all graph algorithms need to implement a vertexcentric<br />
UDF functi<strong>on</strong> as indicated in line 2 in Algorithm 1. We<br />
divide all <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph algorithms into 3 categories by <str<strong>on</strong>g>the</str<strong>on</strong>g> ways<br />
of a UDF functi<strong>on</strong> being designed, namely, Always-Active-<br />
Style, Traversal-Style, and Multi-Phase-Style. We discuss <str<strong>on</strong>g>the</str<strong>on</strong>g>m<br />
below.<br />
Always-Active-Style (Category I): By Always-Active-Style,<br />
every vertex in every superstep sends messages to all its<br />
neighbors. The main steps are illustrated in Algorithm 2.<br />
Here, <str<strong>on</strong>g>the</str<strong>on</strong>g> UDF functi<strong>on</strong> defined <strong>on</strong> a vertex u receives a<br />
set of messages Mu, which were sent to u in <str<strong>on</strong>g>the</str<strong>on</strong>g> immediate<br />
previous superstep by <str<strong>on</strong>g>the</str<strong>on</strong>g> neighbors of u. The UDF functi<strong>on</strong><br />
first computes <str<strong>on</strong>g>the</str<strong>on</strong>g> value of u based <strong>on</strong> both <str<strong>on</strong>g>the</str<strong>on</strong>g> old value that u<br />
holds and <str<strong>on</strong>g>the</str<strong>on</strong>g> messages received (line 1). Then, it will inform<br />
all u’s neighbors of <str<strong>on</strong>g>the</str<strong>on</strong>g> current value of u (line 2-3), and also
Algorithm 2: Always-Active-Style (Mu)<br />
Input: Mu, all incoming messages to vertex u<br />
1 Call an aggregate functi<strong>on</strong> <strong>on</strong> u based <strong>on</strong> Mu;<br />
2 foreach outgoing edge e = (u, v) do<br />
3 Send a message to v;<br />
4 Change <str<strong>on</strong>g>the</str<strong>on</strong>g> local vertex value <strong>on</strong> u if necessary;<br />
Algorithm 3: PageRank-UDF<br />
Input: Mu, all incoming messages to vertex u<br />
1 sum ← 0;<br />
2 foreach message m ∈ Mu do<br />
3 sum ← sum + m;<br />
4 p ← u’s value;<br />
5 p ← α · p + (1 − α) · sum;<br />
6 foreach outgoing edge e = (u, v) from u do<br />
7 Send a message with value p/degree(u) to vertex v;<br />
8 Set u’s value to be p;<br />
update <str<strong>on</strong>g>the</str<strong>on</strong>g> current value of u to be held for <str<strong>on</strong>g>the</str<strong>on</strong>g> next superstep<br />
(line 4). PageRank, Semi-clustering, and <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Coloring are<br />
all in this category.<br />
As an example, <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex-centric PageRank UDF is shown<br />
in Algorithm 3 based <strong>on</strong> Pregel implementati<strong>on</strong> [15]. Every<br />
vertex is assigned to an initial value, which is its initial<br />
rank. The UDF functi<strong>on</strong> computes a new rank at <str<strong>on</strong>g>the</str<strong>on</strong>g> i-th<br />
superstep for vertex u by combining its previous rank and<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> aggregati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> ranks of its neighbors (line 1-5). It <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
sends <str<strong>on</strong>g>the</str<strong>on</strong>g> new rank of u to its neighbors (line 6-7), and sets <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
new rank of u to be held for <str<strong>on</strong>g>the</str<strong>on</strong>g> next iterati<strong>on</strong>. The terminati<strong>on</strong><br />
c<strong>on</strong>diti<strong>on</strong> is based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> pre-determined number of supersteps,<br />
and all nodes will terminate at <str<strong>on</strong>g>the</str<strong>on</strong>g> same time. It is important to<br />
notice that all vertices are active sending <str<strong>on</strong>g>the</str<strong>on</strong>g>ir updated ranks<br />
to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir neighbors, because such rank values have impacts <strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> ranks of <str<strong>on</strong>g>the</str<strong>on</strong>g>ir neighbors, and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir neighbors’ neighbors,<br />
etc.<br />
Traversal-Style (Category II): By Traversal-Style, a specific<br />
vertex is treated as a starting point, and <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r vertices<br />
are involved in computing based <strong>on</strong> how it propagates <strong>on</strong><br />
c<strong>on</strong>diti<strong>on</strong>s. The main steps are illustrated in Algorithm 4.<br />
Initially, <strong>on</strong>ly <strong>on</strong>e vertex will be set as active and all o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs are<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> inactive status. In <str<strong>on</strong>g>the</str<strong>on</strong>g> computing, a vertex u becomes<br />
active when it receives some messages (Mu). As shown in<br />
Algorithm 4, <str<strong>on</strong>g>the</str<strong>on</strong>g> UDF functi<strong>on</strong> updates <str<strong>on</strong>g>the</str<strong>on</strong>g> value of u based <strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> messages received (line 1-2). If <str<strong>on</strong>g>the</str<strong>on</strong>g> value of u is updated,<br />
it implies that it is now involved in <str<strong>on</strong>g>the</str<strong>on</strong>g> computing of <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
given graph algorithm, and it will propagate to some of its<br />
selected neighbors <strong>on</strong> c<strong>on</strong>diti<strong>on</strong>s (line 4-5). Because u does not<br />
necessarily need to be involved in <str<strong>on</strong>g>the</str<strong>on</strong>g> computing all <str<strong>on</strong>g>the</str<strong>on</strong>g> time,<br />
u will vote to halt by calling <str<strong>on</strong>g>the</str<strong>on</strong>g> member functi<strong>on</strong> VoteToHalt()<br />
to voluntarily be inactive (line 7). Note that a vertex will be<br />
waked up by <str<strong>on</strong>g>the</str<strong>on</strong>g> messages received. It is interesting to note<br />
that <str<strong>on</strong>g>the</str<strong>on</strong>g> category I algorithms, for example PageRank, cannot<br />
effectively use VoteToHalt(), because <str<strong>on</strong>g>the</str<strong>on</strong>g> value of a vertex will<br />
Algorithm 4: Traversal-Style (Mu)<br />
Input: Mu, all incoming messages to vertex u<br />
1 Call an aggregate functi<strong>on</strong> <strong>on</strong> u based <strong>on</strong> Mu;<br />
2 Update <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex value for u if needed;<br />
3 if u’s vertex value is updated <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
4 foreach outgoing edge e = (u, v) do<br />
5 Send a message to v <strong>on</strong> c<strong>on</strong>diti<strong>on</strong>;<br />
6 Set <str<strong>on</strong>g>the</str<strong>on</strong>g> local vertex value <strong>on</strong> u if needed;<br />
7 VoteToHalt();<br />
Algorithm 5: SSSP-UDF<br />
Input: Mu, all incoming messages to vertex u<br />
1 dmin ← ∞;<br />
2 foreach message m ∈ Mu do<br />
3 dmin ← min(m, dmin);<br />
4 p ← u’s value;<br />
5 if dmin < p <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
6 foreach outgoing edge e = (u, v) from u do<br />
7 Send a message to v, with a value of dmin + ω(u, v);<br />
8 Set u’s local value as dmin;<br />
9 VoteToHalt();<br />
always affect o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs, even if <str<strong>on</strong>g>the</str<strong>on</strong>g> value of <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex satisfies<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>verge c<strong>on</strong>diti<strong>on</strong>, for example, <str<strong>on</strong>g>the</str<strong>on</strong>g> error bound between<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> current value and its previous value is less than or equal to<br />
a threshold. Typical algorithms in <str<strong>on</strong>g>the</str<strong>on</strong>g> category II are Breadth<br />
First Search [6], Single Source Shortest Path [6], and Random-<br />
Walk [18].<br />
The implementati<strong>on</strong> of SSSP is illustrated in Algorithm 5.<br />
Initially every vertex u, except <str<strong>on</strong>g>the</str<strong>on</strong>g> start vertex, is assigned to<br />
an initial value ∞ for <str<strong>on</strong>g>the</str<strong>on</strong>g> shortest distance from <str<strong>on</strong>g>the</str<strong>on</strong>g> starting<br />
vertex to u. In SSSP, a vertex u receives messages from some<br />
of its neighbors if some of its neighbors have updated <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
shortest distance from <str<strong>on</strong>g>the</str<strong>on</strong>g> starting vertex. The UDF functi<strong>on</strong><br />
will determine <str<strong>on</strong>g>the</str<strong>on</strong>g> new shortest distance from <str<strong>on</strong>g>the</str<strong>on</strong>g> starting<br />
vertex to u itself via some of u’s neighbors (line 1-3). If <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
new shortest distance becomes smaller, u will send messages<br />
to its neighbors to tell <str<strong>on</strong>g>the</str<strong>on</strong>g>m <str<strong>on</strong>g>the</str<strong>on</strong>g> new updates (line 6-7), where<br />
ω(u, v) is <str<strong>on</strong>g>the</str<strong>on</strong>g> edge weight <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> edge from u to its neighbor<br />
v. Because u may not necessarily need to be involved in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
shortest distance computing, it votes to halt (line 9).<br />
Multi-Phase-Style (Category III): As illustrated in Algorithm<br />
6, a single phase, for example, to identify <strong>on</strong>e matching<br />
edge am<strong>on</strong>g many, cannot be d<strong>on</strong>e in a single superstep,<br />
because a matching edge is an edge (u, v) that nei<str<strong>on</strong>g>the</str<strong>on</strong>g>r u nor<br />
v is involved in ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r matching edge. It needs to be d<strong>on</strong>e in<br />
several supersteps in Pregel. The entire computati<strong>on</strong> is divided<br />
into a number of phases, and each phase, Pj is d<strong>on</strong>e in k<br />
supersteps: Pj0, Pj1, · · · , Pji, · · · , Pik−1 . Typical algorithms<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> category III are MM [15], [3], MST [19], [10], and<br />
MIS [19], [14].<br />
We discuss <str<strong>on</strong>g>the</str<strong>on</strong>g> maximal matching problem (MM), as an<br />
example, which is to find a maximal subset of edges, ME,
Algorithm 6: Multi-Phase-Style (Mu)<br />
Input: Mu, all incoming messages to vertex u<br />
1 if u has completed its computing <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
2 VoteToHalt(); return;<br />
3 switch getSuperStep() % k do<br />
4 ...;<br />
5 case i /* 0 ≤ i < k */<br />
6 call ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r Always-Active-Style or Traversal-Style style<br />
UDF; break;<br />
7 ...;<br />
Algorithm 7: MM-UDF<br />
Input: Mu, all incoming messages to vertex u<br />
1 if u itself is a matched point <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
2 VoteToHalt(); return;<br />
3 switch getSuperStep() % 4 do<br />
4 case 0 /* invitati<strong>on</strong> */<br />
5 foreach outgoing edge e = (u, v) from u do<br />
6 Send a message to v to invite ;<br />
7 acceptu ← false;<br />
8 break;<br />
9 case 1 /* acceptance */<br />
10 Randomly pick an edge e = (u, v) up from Mu;<br />
11 Send a message to v to accept; acceptu ← true; break;<br />
12 case 2 /* c<strong>on</strong>firmati<strong>on</strong> */<br />
13 if acceptu = false <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
14 Randomly pick an edge e = (u, v) up from Mu;<br />
15 Send a message to v to c<strong>on</strong>firm;<br />
16 Set u as a matched point; VoteToHalt(); break;<br />
17 case 3 /* marking */<br />
18 if Mu = ∅ <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
19 Set u as a matched point; VoteToHalt(); break;<br />
in a given graph G where no two edges in ME have a<br />
comm<strong>on</strong> vertex. In Pregel, a randomized algorithm [3] is<br />
used to support a bipartite maximal matching, which can be<br />
extended to handle maximal matching in general graph. The<br />
computati<strong>on</strong> is divided into a number of phases, and each<br />
phase, Pi, is d<strong>on</strong>e in four supersteps, namely, Pi0 , Pi1 , Pi2 ,<br />
and Pi3 . In Pi0 , all vertices that have not been involved in any<br />
matching send a message to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir neighbors to invite <str<strong>on</strong>g>the</str<strong>on</strong>g>m to<br />
join a match. In Pi1 , a vertex randomly accepts <strong>on</strong>e of <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
invitati<strong>on</strong>s to join a matching, and replies <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>ding<br />
sender a message, because a vertex may possibly receive many<br />
matching invitati<strong>on</strong>s from its neighbors. In Pi2 , in a similar<br />
fashi<strong>on</strong>, a vertex that has sent invitati<strong>on</strong>s may receive several<br />
acceptances, and will c<strong>on</strong>firm <strong>on</strong>e acceptance by replying a<br />
message. In Pi3, <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex that receives a c<strong>on</strong>firmati<strong>on</strong> will<br />
mark itself as marked. The computati<strong>on</strong> repeats until no more<br />
matches can be identified. As illustrated in Algorithm 7, <strong>on</strong>e<br />
important thing is that a vertex may not resp<strong>on</strong>d to any when<br />
it has already been matched.<br />
IV. SUPER-DYNAMIC PARTITIONING<br />
In this secti<strong>on</strong>, we discuss two important issues which are<br />
also c<strong>on</strong>flict with each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r in Pregel-like systems for large<br />
graph processing. The two issues are workload balancing and<br />
communicati<strong>on</strong> cost minimizati<strong>on</strong>. In Pregel, <str<strong>on</strong>g>the</str<strong>on</strong>g> number of<br />
vertices is <str<strong>on</strong>g>the</str<strong>on</strong>g> dominant factor for workload in each computati<strong>on</strong>al<br />
node, because of its vertex-centric computing. The<br />
workload balancing suggests that all nodes are best to have<br />
similar amount of workload in every superstep. The intuiti<strong>on</strong><br />
is that <str<strong>on</strong>g>the</str<strong>on</strong>g> computing cost of a superstep is <str<strong>on</strong>g>the</str<strong>on</strong>g> max computing<br />
cost of <str<strong>on</strong>g>the</str<strong>on</strong>g> node that has <str<strong>on</strong>g>the</str<strong>on</strong>g> largest workload. <str<strong>on</strong>g>Workload</str<strong>on</strong>g><br />
balancing will lead to minimizati<strong>on</strong> of computati<strong>on</strong>al cost<br />
of supersteps. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost<br />
minimizati<strong>on</strong> suggests that <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost am<strong>on</strong>g<br />
supersteps shall be minimized. It is worth noting that, in graph<br />
processing in Pregel, such communicati<strong>on</strong> cost depends <strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> number of messages across computati<strong>on</strong>al nodes, and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
number of messages is heavily dependent <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> number of<br />
edges that cross different computati<strong>on</strong>al nodes. Minimizing<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost will significantly reduce <str<strong>on</strong>g>the</str<strong>on</strong>g> time for<br />
synchr<strong>on</strong>izati<strong>on</strong>. The two issues are important, because <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
system excuti<strong>on</strong> time depends <strong>on</strong> both of <str<strong>on</strong>g>the</str<strong>on</strong>g>m. The two issues<br />
are c<strong>on</strong>flict. First, balancing workload am<strong>on</strong>g computati<strong>on</strong>al<br />
nodes may increase <str<strong>on</strong>g>the</str<strong>on</strong>g> number of edges that are across<br />
different computati<strong>on</strong>al nodes which leads to possibly high<br />
communicati<strong>on</strong> cost. Sec<strong>on</strong>d, minimizing <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong><br />
cost may make all workload goes to a single computati<strong>on</strong>al<br />
node.<br />
<str<strong>on</strong>g>Graph</str<strong>on</strong>g> Notati<strong>on</strong>s: C<strong>on</strong>sider a graph G = (V, E), where V<br />
is <str<strong>on</strong>g>the</str<strong>on</strong>g> set of vertices, and E ⊆ V × V is <str<strong>on</strong>g>the</str<strong>on</strong>g> set of directed<br />
edges. We use n = |V |, m = |E| to denote <str<strong>on</strong>g>the</str<strong>on</strong>g> numbers of<br />
vertices and edges, respectively.<br />
A graph G will be distributed to <str<strong>on</strong>g>the</str<strong>on</strong>g> K computati<strong>on</strong>al nodes,<br />
{N1, N2, · · · , NK}, in Pregel for parallel processing, where<br />
all Ni have <str<strong>on</strong>g>the</str<strong>on</strong>g> same memory capacity C.<br />
At <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep, each node Ni holds a subset of<br />
vertices, V s<br />
s s<br />
i ⊆ V under <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>diti<strong>on</strong> that Vi ∩ Vj = ∅<br />
for (i = j) and V = V s<br />
i , and holds a subset of internal<br />
edges, IE s i = {(u, v)|(u, v) ∈ E, u ∈ V s<br />
s<br />
i , v ∈ Vi }. The<br />
edges that cross nodes are called external edges. We use EE s i,j<br />
to denote <str<strong>on</strong>g>the</str<strong>on</strong>g> set of external edges which cross two different<br />
computati<strong>on</strong>al nodes, Ni and Nj, i = j, such as EE s i,j =<br />
{(u, v)|(u, v) ∈ E, u ∈ V s<br />
s<br />
i , v ∈ Vj } ∪ {(u, v)|(u, v) ∈<br />
E, u ∈ V s s<br />
j , v ∈ Vi }. The edges associated with partiti<strong>on</strong> Ni<br />
is Es i = ∪EEsi,j ∪ IE s i .<br />
Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow, <str<strong>on</strong>g>Workload</str<strong>on</strong>g>, and <str<strong>on</strong>g>Workload</str<strong>on</strong>g> <str<strong>on</strong>g>Balancing</str<strong>on</strong>g>:<br />
The computati<strong>on</strong>al cost <strong>on</strong> a computati<strong>on</strong>al node mainly<br />
depends <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> number of active vertices that receive messages<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex-centric computing framework like Pregel. Here,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices are a subset of <str<strong>on</strong>g>the</str<strong>on</strong>g> entire vertices in a<br />
superstep, because not all vertices will be invoked by messages<br />
in a superstep. We call <str<strong>on</strong>g>the</str<strong>on</strong>g> set of active vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th<br />
superstep <strong>on</strong> a computati<strong>on</strong>al node Ni a working window, and<br />
denote it as W s i , and <str<strong>on</strong>g>the</str<strong>on</strong>g> entire working window at <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th<br />
superstep is W s = ∪K i=1W s i . The average size of working
window per node at <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep is denoted as W s , which<br />
is computed as W s = |W s |/K.<br />
We c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> workload in <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep <strong>on</strong> a<br />
computati<strong>on</strong>al node Ni denoted as W s i .<br />
W s i = |W s i | + |AE s i | (1)<br />
Here, |W s i | is <str<strong>on</strong>g>the</str<strong>on</strong>g> number of active vertices <strong>on</strong> Ni, and AE s i<br />
is <str<strong>on</strong>g>the</str<strong>on</strong>g> active subset of <str<strong>on</strong>g>the</str<strong>on</strong>g> edges. The average workload in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
s-th superstep is W s = K i=1 Ws i /K.<br />
Eq. (1) indicates <str<strong>on</strong>g>the</str<strong>on</strong>g> workload depends <strong>on</strong> two factors. |W s i |<br />
is <str<strong>on</strong>g>the</str<strong>on</strong>g> number of UDF functi<strong>on</strong>s must be invoked to compute,<br />
and |AE s i | is <str<strong>on</strong>g>the</str<strong>on</strong>g> number of messages that UDF functi<strong>on</strong>s must<br />
process |W s i |. Eq. (1) takes <str<strong>on</strong>g>the</str<strong>on</strong>g> simple sum of |W s i | and |AEs i |<br />
for two reas<strong>on</strong>s. First, it is based <strong>on</strong> our observati<strong>on</strong>s. For<br />
most UDF functi<strong>on</strong>s, <str<strong>on</strong>g>the</str<strong>on</strong>g> number of computing operati<strong>on</strong>s is<br />
linear w.r.t. <str<strong>on</strong>g>the</str<strong>on</strong>g> incoming messages. Sec<strong>on</strong>d, <str<strong>on</strong>g>the</str<strong>on</strong>g> message serializati<strong>on</strong>/deserializati<strong>on</strong><br />
time is linear to <str<strong>on</strong>g>the</str<strong>on</strong>g> size of messages,<br />
which usually overtakes o<str<strong>on</strong>g>the</str<strong>on</strong>g>r operati<strong>on</strong>s and dominates <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
UDF executi<strong>on</strong> time.<br />
In additi<strong>on</strong>, we use D s i,j<br />
to denote <str<strong>on</strong>g>the</str<strong>on</strong>g> number of messages<br />
sent from Ni to Nj at <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep, which is based <strong>on</strong><br />
EE s i,j. D s i,j implies <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost from Ni to Nj at<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep. The total communicati<strong>on</strong> cost at <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th<br />
superstep is denoted as D s .<br />
Because it is very difficult to minimize <str<strong>on</strong>g>the</str<strong>on</strong>g> two objectives,<br />
namely, workload balancing and communicati<strong>on</strong> cost minimizati<strong>on</strong>,<br />
at <str<strong>on</strong>g>the</str<strong>on</strong>g> same time, in this paper, we study to minimize<br />
communicati<strong>on</strong> cost under <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>diti<strong>on</strong> that workloads are<br />
balanced. The problem statement is given below.<br />
Problem Statement: For a given graph algorithm expressible<br />
in Pregel, <str<strong>on</strong>g>the</str<strong>on</strong>g> problem we study is to minimize <br />
s Ds , subject<br />
to two c<strong>on</strong>diti<strong>on</strong>s in every superstep: (1) |V s<br />
i | + |Es i | ≤ C and<br />
(2) at each superstep s, all computati<strong>on</strong>al nodes are balanced,<br />
namely Ws i ≤ (1 + ɛ) × Ws , ∀i ∈ {1..K} where 0 ≤ ɛ ≪ 1.<br />
The first c<strong>on</strong>diti<strong>on</strong> suggests that a node Ni must be able to<br />
hold <str<strong>on</strong>g>the</str<strong>on</strong>g> workload assigned, and <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d c<strong>on</strong>diti<strong>on</strong> suggests<br />
that <str<strong>on</strong>g>the</str<strong>on</strong>g> workload must be balanced.<br />
The problem is challenging. First, <str<strong>on</strong>g>the</str<strong>on</strong>g> problem itself is an<br />
NP problem even in a static setting, as <str<strong>on</strong>g>the</str<strong>on</strong>g> size-balanced graph<br />
partiti<strong>on</strong> problem is shown to be NP-complete [4]. In additi<strong>on</strong>,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices, we are most interested in, are those vertices<br />
that will be active in <str<strong>on</strong>g>the</str<strong>on</strong>g> near future ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than those vertices<br />
that are active in <str<strong>on</strong>g>the</str<strong>on</strong>g> current superstep. This is because in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> current superstep <str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices have already been<br />
determined based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> messages already received from<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> immediate previous superstep, and cannot be rebalanced.<br />
Sec<strong>on</strong>d, it is known to be very difficult to predict <str<strong>on</strong>g>the</str<strong>on</strong>g> workload<br />
in parallel graph processing <strong>on</strong> computati<strong>on</strong>al nodes. Third,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>re does not exist <str<strong>on</strong>g>the</str<strong>on</strong>g> optimal graph partiti<strong>on</strong>ing under <strong>on</strong>e<br />
set of c<strong>on</strong>diti<strong>on</strong>s that can be efficiently used for any graph<br />
problems in general, for <str<strong>on</strong>g>the</str<strong>on</strong>g> reas<strong>on</strong> that <str<strong>on</strong>g>the</str<strong>on</strong>g> workloads vary<br />
greatly, as discussed for <str<strong>on</strong>g>the</str<strong>on</strong>g> 3 categories of graph algorithms.<br />
Fourth, this requests a super-dynamic soluti<strong>on</strong> during <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
computati<strong>on</strong> of a graph algorithm, that must be effective and<br />
simple.<br />
1<br />
1<br />
1<br />
1<br />
2<br />
2<br />
2<br />
2<br />
3<br />
3<br />
3<br />
3<br />
5<br />
5<br />
5<br />
5<br />
4<br />
4<br />
4<br />
4<br />
6<br />
7<br />
6<br />
7<br />
6<br />
6<br />
7<br />
7<br />
8<br />
8<br />
8<br />
8<br />
11 12<br />
10<br />
11 12<br />
10<br />
9<br />
9<br />
13<br />
13<br />
11 12<br />
10<br />
9<br />
13<br />
11 12<br />
10<br />
9<br />
13<br />
14<br />
14<br />
14<br />
14<br />
(a) PageRank<br />
1<br />
2<br />
15<br />
16<br />
15<br />
16<br />
15<br />
15<br />
16<br />
16<br />
3<br />
17<br />
17<br />
5<br />
4<br />
6<br />
7<br />
8<br />
11 12<br />
10<br />
9<br />
13<br />
14<br />
15<br />
Fig. 1. A <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Example<br />
17<br />
17<br />
1<br />
1<br />
1<br />
1<br />
2<br />
2<br />
2<br />
2<br />
3<br />
3<br />
3<br />
3<br />
5<br />
5<br />
5<br />
5<br />
4<br />
4<br />
4<br />
4<br />
6<br />
7<br />
6<br />
7<br />
6<br />
6<br />
7<br />
7<br />
8<br />
8<br />
8<br />
8<br />
11 12<br />
10<br />
11 12<br />
10<br />
9<br />
9<br />
13<br />
13<br />
11 12<br />
10<br />
9<br />
13<br />
11 12<br />
10<br />
9<br />
13<br />
14<br />
14<br />
14<br />
14<br />
15<br />
16<br />
15<br />
16<br />
15<br />
15<br />
16<br />
16<br />
17<br />
17<br />
17<br />
17<br />
(b) BFS (Start from 1)<br />
Fig. 2. <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Examples<br />
A. Various Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Behaviors<br />
16<br />
17<br />
1<br />
1<br />
1<br />
1<br />
2<br />
2<br />
2<br />
2<br />
3<br />
3<br />
3<br />
3<br />
5<br />
5<br />
5<br />
5<br />
4<br />
4<br />
4<br />
4<br />
6<br />
7<br />
6<br />
7<br />
6<br />
6<br />
7<br />
7<br />
8<br />
8<br />
8<br />
8<br />
11 12<br />
10<br />
11 12<br />
10<br />
10<br />
9<br />
9<br />
9<br />
13<br />
13<br />
11 12<br />
13<br />
11 12<br />
10<br />
9<br />
(c) MM<br />
We show various working window behaviors using a graph<br />
example. Fig. 1 shows a graph with 17 vertices and 31<br />
edges. Assume that <str<strong>on</strong>g>the</str<strong>on</strong>g>re are 3 nodes N1, N2, and N3. As<br />
divided by <str<strong>on</strong>g>the</str<strong>on</strong>g> dotted line in Fig. 1, N1 holds <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices<br />
V1 = {v1, v2, v3, v4, v5} and <str<strong>on</strong>g>the</str<strong>on</strong>g> internal edges am<strong>on</strong>g V1,<br />
N2 holds <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices V2 = {v6, v7, v8, v9, v10, v11} and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
internal edges am<strong>on</strong>g V2, and N3 holds <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices V3 =<br />
{v12, v13, v14, v15, v16, v17} and <str<strong>on</strong>g>the</str<strong>on</strong>g> internal edges am<strong>on</strong>g V3,<br />
respectively. The numbers of vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> 3 nodes are<br />
balanced with 5, 6, and 6 vertices. It implies that it is workload<br />
balanced if we simply let <str<strong>on</strong>g>the</str<strong>on</strong>g> working window W s i = Vi, for<br />
1 ≤ i ≤ 3, regardless <str<strong>on</strong>g>the</str<strong>on</strong>g> graph algorithms. The numbers<br />
of external edges are <str<strong>on</strong>g>the</str<strong>on</strong>g> minimum: 2 edges between N1<br />
and N2 and 3 edges between N2 and N3. It implies that<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost is minimized. This graph partiti<strong>on</strong><br />
can be d<strong>on</strong>e by most state-of-<str<strong>on</strong>g>the</str<strong>on</strong>g>-art graph partiti<strong>on</strong>er like<br />
METIS [12].<br />
Fig. 2 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows in several supersteps<br />
using 3 graph algorithms, PageRank, BFS, and MM. In every<br />
superstep, a shaded vertex indicates an active vertex, and<br />
a white vertex indicates an inactive vertex. Fig. 2(a) shows<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> working windows for PageRank in <str<strong>on</strong>g>the</str<strong>on</strong>g> first 4 supersteps<br />
from top to bottom. All vertices in every superstep are active.<br />
Fig. 2(b) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows for BFS in <str<strong>on</strong>g>the</str<strong>on</strong>g> first 4<br />
supersteps from top to bottom, starting from <str<strong>on</strong>g>the</str<strong>on</strong>g> starting vertex<br />
v1. As can be seen from <str<strong>on</strong>g>the</str<strong>on</strong>g> supersteps, <str<strong>on</strong>g>the</str<strong>on</strong>g> 4 working windows<br />
are, W 1 = {v1}, W 2 = {v2, v5}, W 3 = {v3, v4, v6}, and<br />
W 4 = {v7, v8, v9, v10, v11}, in <str<strong>on</strong>g>the</str<strong>on</strong>g> 1st, 2nd, 3rd, and 4th<br />
superstep, respectively. In <str<strong>on</strong>g>the</str<strong>on</strong>g> 1st and 2nd supersteps, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
working windows are in N1. In <str<strong>on</strong>g>the</str<strong>on</strong>g> 3rd superstep, N1 and<br />
N2 have active vertices to work <strong>on</strong>, but <str<strong>on</strong>g>the</str<strong>on</strong>g> number of active<br />
vertices is small. In <str<strong>on</strong>g>the</str<strong>on</strong>g> 4th superstep, all active vertices<br />
13<br />
14<br />
14<br />
14<br />
14<br />
15<br />
16<br />
15<br />
16<br />
15<br />
15<br />
16<br />
16<br />
17<br />
17<br />
17<br />
17
size of working set<br />
size of working set<br />
size of working set<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0.75<br />
0.25<br />
(a) PageRank (A1)<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
1.0<br />
0.5<br />
(c) BFS (A5)<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
(e) MST (A8)<br />
size of working set<br />
size of working set<br />
size of working set<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0.75<br />
0.25<br />
(b) SSSP (A4)<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
1.0<br />
0.5<br />
Fig. 3. Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Sizes<br />
(d) MM (A7)<br />
0<br />
1 5 10<br />
superstep<br />
15<br />
(f) MIS (A9)<br />
are <strong>on</strong> N2. Fig. 2(c) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows for MM.<br />
Unlike <str<strong>on</strong>g>the</str<strong>on</strong>g> previous figures, we show <str<strong>on</strong>g>the</str<strong>on</strong>g> 1st, 5th, 9th, and<br />
13th supersteps, because a single phase c<strong>on</strong>sists of k = 4<br />
supersteps. Initially, in <str<strong>on</strong>g>the</str<strong>on</strong>g> 1st superstep, all vertices are active.<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g> later supersteps, unlike o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs, <str<strong>on</strong>g>the</str<strong>on</strong>g> node N3 does not<br />
have much active vertices to work <strong>on</strong>.<br />
As a remark, it is difficult to predict how <str<strong>on</strong>g>the</str<strong>on</strong>g> working<br />
windows change from time to time in general as <str<strong>on</strong>g>the</str<strong>on</strong>g> algorithm<br />
varies. Such changes possibly depend <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> graph algorithm<br />
to be used and <str<strong>on</strong>g>the</str<strong>on</strong>g> large graph to be processed.<br />
B. <str<strong>on</strong>g>Catch</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Patterns<br />
We investigate <str<strong>on</strong>g>the</str<strong>on</strong>g> behaviors of <str<strong>on</strong>g>the</str<strong>on</strong>g> nine graph algorithms<br />
in Pregel in Table III, by analyzing <str<strong>on</strong>g>the</str<strong>on</strong>g> properties of working<br />
windows. The 9 graph algorithms are numbered by A1, A2,<br />
· · · , A9. We have c<strong>on</strong>ducted <str<strong>on</strong>g>the</str<strong>on</strong>g> testing using all <str<strong>on</strong>g>the</str<strong>on</strong>g> datasets<br />
listed in our experimental studies. They all have <str<strong>on</strong>g>the</str<strong>on</strong>g> similar<br />
behaviors. Below, we show <str<strong>on</strong>g>the</str<strong>on</strong>g> results using <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat<br />
data (http://snap.stanford.edu/data/), which is<br />
a collaborati<strong>on</strong> dataset with 23K vertices and 186K edges.<br />
Auxiliary supersteps which do not have any communicati<strong>on</strong><br />
are omitted in following charts for clarity.<br />
Dynamic Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Changes: First, we show that<br />
working windows change dynamically. Fig. 3 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> ratio<br />
of <str<strong>on</strong>g>the</str<strong>on</strong>g> working window size over <str<strong>on</strong>g>the</str<strong>on</strong>g> entire vertex set in<br />
all supersteps. Fig. 3(a) shows that all vertices are active<br />
for PageRank (Category I). For SSSP and BFS of Traversal-<br />
Style in Category II, <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows increase and <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
decrease, as shown in Fig. 3(b) and Fig. 3(c). For MM, MST,<br />
and MIS of Multi-Phase-Style in Category III, <str<strong>on</strong>g>the</str<strong>on</strong>g> working<br />
windows show some periodical patterns in different ways, as<br />
shown in Fig. 3(d), Fig. 3(e), and Fig. 3(f). It c<strong>on</strong>firms that it<br />
is difficult to use <strong>on</strong>e graph partiti<strong>on</strong> to support any graph<br />
algorithms in Pregel. We also verify it by partiti<strong>on</strong>ing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
hit rate<br />
hit rate<br />
hit rate<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
(a) PageRank (A1)<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
(c) BFS (A5)<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
(e) MST (A8)<br />
hit rate<br />
hit rate<br />
hit rate<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0.75<br />
0.25<br />
(b) SSSP (A4)<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
1.0<br />
0.5<br />
(d) MM (A7)<br />
0<br />
1 4 7 10<br />
distance between supersteps<br />
(f) MIS (A9)<br />
Fig. 4. Average correlati<strong>on</strong> between <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows<br />
CA-C<strong>on</strong>dmat graph data into 16 computati<strong>on</strong>al nodes using<br />
METIS (<str<strong>on</strong>g>the</str<strong>on</strong>g> static graph partiti<strong>on</strong>ing approach), and run <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
nine graph algorithms. To show <str<strong>on</strong>g>the</str<strong>on</strong>g> workload balancing, we<br />
define an imbalance factor, which is <str<strong>on</strong>g>the</str<strong>on</strong>g> sum of <str<strong>on</strong>g>the</str<strong>on</strong>g> max<br />
workload for every superstep divided by <str<strong>on</strong>g>the</str<strong>on</strong>g> sum of average<br />
workload for every superstep. If <str<strong>on</strong>g>the</str<strong>on</strong>g> workload is well balanced,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> imbalance factor is near 1. A larger imbalance factor<br />
implies that workloads are not well balanced. Our experiment<br />
shows that during processing graph algorithms, <str<strong>on</strong>g>the</str<strong>on</strong>g> imbalance<br />
factor can be up to 1.6. This leads to that <str<strong>on</strong>g>the</str<strong>on</strong>g> costly static<br />
graph partiti<strong>on</strong>ing can not serve all <str<strong>on</strong>g>the</str<strong>on</strong>g> purposes for workload<br />
balancing, because <str<strong>on</strong>g>the</str<strong>on</strong>g> workload changes dynamically.<br />
Next, we discuss whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r it is possible to catch <str<strong>on</strong>g>the</str<strong>on</strong>g> wind<br />
(working window) in some way, even though it is known<br />
to be very difficult. We give some notati<strong>on</strong>s below first.<br />
C<strong>on</strong>sider two working windows, W p and W q , at <str<strong>on</strong>g>the</str<strong>on</strong>g> p-th and<br />
q-th supersteps, respectively. The distance between <str<strong>on</strong>g>the</str<strong>on</strong>g> two<br />
working windows is denoted as dis(W p , W q ) = |p − q|. We<br />
say a working window Wp is a k-distance previous working<br />
window of Wq if p < q and dis(W p , W q ) = k. The hitrate<br />
of a working window W q regarding ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r k-distance<br />
previous working window W p is denoted as hitk(W p , W q )<br />
and is defined as hitk(W p , W q ) = |W p ∩W q |<br />
|W q | .<br />
L<strong>on</strong>g-Term Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Behaviors: Fig. 4 shows <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
relati<strong>on</strong>ship between k-distance and <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rate for every two<br />
working windows using <str<strong>on</strong>g>the</str<strong>on</strong>g> graph algorithms. Here, <str<strong>on</strong>g>the</str<strong>on</strong>g> xaxis<br />
in Fig. 4 is <str<strong>on</strong>g>the</str<strong>on</strong>g> k-distance, and <str<strong>on</strong>g>the</str<strong>on</strong>g> y-axis shows <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
average hit-rate for every two working windows that are kdistance.<br />
For example, when k = 1, <str<strong>on</strong>g>the</str<strong>on</strong>g> average hit-rate is<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> average of all hit1(W q−1 , W q ). In Fig. 4, we observe <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
behaviors of graph algorithms in <str<strong>on</strong>g>the</str<strong>on</strong>g> three categories. Fig. 4(a)<br />
shows hit-rate maintains 100% for any two working windows<br />
in k-distance where 1 ≤ k ≤ 10, when PageRank (A1) is
hit rate<br />
hit rate<br />
hit rate<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
k=1 k=2 k=3<br />
0<br />
1 5 10 15<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
superstep<br />
(a) PageRank (A1)<br />
0<br />
1 5 10 15<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
k=1 k=2 k=3<br />
superstep<br />
(c) BFS (A5)<br />
k=1 k=2 k=3<br />
0<br />
1 5 10 15<br />
superstep<br />
(e) MST (A8)<br />
hit rate<br />
hit rate<br />
hit rate<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
k=1 k=2 k=3<br />
0<br />
1 5 10 15<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0.25<br />
superstep<br />
(b) SSSP (A4)<br />
k=1 k=2 k=3<br />
0<br />
1 5 10<br />
1.0<br />
0.75<br />
0.5<br />
superstep<br />
(d) MM (A7)<br />
k=1 k=2 k=3<br />
0<br />
1 5 10<br />
superstep<br />
(f) MIS (A9)<br />
Fig. 5. The hit-rates between <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows<br />
executed. This is <str<strong>on</strong>g>the</str<strong>on</strong>g> typical behavior of graph algorithms<br />
in Category I (Table III), because all vertices are always<br />
active (Algorithm 2). Fig. 4(b) and Fig. 4(c) show <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rate<br />
when SSSP (A4) and BFS (A5) are executed, in Category<br />
II, respectively. The working windows show some locality<br />
property. The hit-rate decreases naturally when <str<strong>on</strong>g>the</str<strong>on</strong>g> distance<br />
become larger. Fig. 4(d), Fig. 4(e), and Fig. 4(f) show <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
periodical behaviors regarding <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rates, for MM (A7),<br />
MST (A8), and MIS (A9), in Category III, respectively. It is<br />
worth menti<strong>on</strong>ing <str<strong>on</strong>g>the</str<strong>on</strong>g> periodical behaviors of <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rate for<br />
Category III algorithms. Every three supersteps, or in o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
words, a working window at a superstep, q, accesses <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
similar working window as <str<strong>on</strong>g>the</str<strong>on</strong>g> 3-distance previous working<br />
window, q − 3. This has a str<strong>on</strong>g relati<strong>on</strong>ship with <str<strong>on</strong>g>the</str<strong>on</strong>g> design<br />
of <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex-centric algorithms, as can be seen in MM-UDF<br />
(Algorithm 7). For all <str<strong>on</strong>g>the</str<strong>on</strong>g> three categories, it suggests that<br />
moving working window to <str<strong>on</strong>g>the</str<strong>on</strong>g> right computati<strong>on</strong>al nodes at<br />
a superstep can possibly reduce <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
following supersteps.<br />
Short-Term Working <str<strong>on</strong>g>Wind</str<strong>on</strong>g>ow Behaviors: Fig. 5 shows<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> hitk(W q−1 , W q ), for k = 1, 2, 3, for each superstep.<br />
Unlike Fig. 4 which shows <str<strong>on</strong>g>the</str<strong>on</strong>g> average, Fig. 5 shows<br />
hitk(W q−1 , W q ) for each superstep from 1 to 15. For PageRank,<br />
since it is Always-Active-Style, <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rate is always<br />
100% (Fig. 5(a)) for k = 1, 2, 3. For SSSP and BFS, during<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> computing, <str<strong>on</strong>g>the</str<strong>on</strong>g> hit-rate increases up to a heap and <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
decreases (Fig. 5(b) and Fig. 5(c)). It exhibits high hit-rate<br />
between two c<strong>on</strong>secutive supersteps (k = 1). For MM, MST,<br />
and MIS, Fig. 5(d), Fig. 5(e), and Fig. 5(f) show <str<strong>on</strong>g>the</str<strong>on</strong>g> periodical<br />
patterns. It exhibits high hit-rate between two c<strong>on</strong>secutive<br />
supersteps when k = 3 instead of k = 1, due to <str<strong>on</strong>g>the</str<strong>on</strong>g> graph<br />
algorithm designs in Category III.<br />
Discussi<strong>on</strong>s: The traditi<strong>on</strong>al parallel system load balancing<br />
approaches collect statistics for a moderate-length history to<br />
perform analyze and strategies. It becomes infeasible in our<br />
problem, because we have to c<strong>on</strong>sider load balancing in every<br />
superstep in processing a graph algorithm. In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words, we<br />
have to deal with load balancing using a very short history<br />
which dynamically changes. A questi<strong>on</strong> that arises is whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>re exist some patterns for us to predict working window<br />
movements. Such an answer depends <strong>on</strong> both <str<strong>on</strong>g>the</str<strong>on</strong>g> underneath<br />
graph and <str<strong>on</strong>g>the</str<strong>on</strong>g> graph algorithm. It is very difficult to have<br />
an analytical result at this stage. In this work, based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
extensive experimental studies, we observe <str<strong>on</strong>g>the</str<strong>on</strong>g> three categories<br />
of graph algorithms have potential predictability. The Always-<br />
Active-Style algorithms have stable working windows, as<br />
we can predict working windows. For <str<strong>on</strong>g>the</str<strong>on</strong>g> Traversal-Style<br />
algorithms, even though <str<strong>on</strong>g>the</str<strong>on</strong>g> predictability cannot be ensured<br />
in general, however, in real applicati<strong>on</strong>s, most of large scale<br />
graphs have <str<strong>on</strong>g>the</str<strong>on</strong>g> low-diameter property. This means that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
average distance between two vertices are short, hypo<str<strong>on</strong>g>the</str<strong>on</strong>g>sized<br />
to be in <str<strong>on</strong>g>the</str<strong>on</strong>g> same scale of log of <str<strong>on</strong>g>the</str<strong>on</strong>g> size of graph. Also <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
size of k-neighbors, which means all vertices are reachable<br />
in no more than k hops, is exp<strong>on</strong>ential to k. It implies that<br />
some Traversal-Style algorithms, for example SSSP, have a<br />
reas<strong>on</strong>able hit-rate between supersteps. However, <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
hand, BFS in Category II has a very low hit-rate between<br />
supersteps, because it literally traverses a graph. For Category<br />
III algorithms, we need to c<strong>on</strong>sider a phase as a whole. The<br />
hit-rate between two successive phases is very high, due to<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> nature of <str<strong>on</strong>g>the</str<strong>on</strong>g> algorithm design that accesses a certain set<br />
of vertices, determines some partial results, and moves from<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> set of such vertices to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r nearby set of vertices. Our<br />
observati<strong>on</strong> suggests that in many cases <str<strong>on</strong>g>the</str<strong>on</strong>g> working windows<br />
can be possibly predicted using a very short history.<br />
C. A Distributed Vertex-Centric Scheduler<br />
In this secti<strong>on</strong>, we discuss our distributed vertex-centric<br />
scheduler. The main idea is to allow each node Ni to decide by<br />
itself how much workload should be moved out to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r Nj,<br />
in order to minimize <str<strong>on</strong>g>the</str<strong>on</strong>g> total communicati<strong>on</strong> costs. There are<br />
two issues. The first issue is how much workload we should<br />
move from Ni to Nj. And <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d issue is what we should<br />
move from Ni to Nj.<br />
How much to move: For <str<strong>on</strong>g>the</str<strong>on</strong>g> first issue, we keep <str<strong>on</strong>g>the</str<strong>on</strong>g> hard<br />
workload limit as much as possible, in order to ensure <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
workload balancing. Assume <str<strong>on</strong>g>the</str<strong>on</strong>g> current superstep is s. If Ni<br />
is overloaded in <str<strong>on</strong>g>the</str<strong>on</strong>g> current s-th superstep, all o<str<strong>on</strong>g>the</str<strong>on</strong>g>r nodes<br />
Nj, for i = j, should not move any vertices towards Ni,<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> overloaded node Ni should move its vertices out<br />
to o<str<strong>on</strong>g>the</str<strong>on</strong>g>r nodes, to keep all nodes balanced in <str<strong>on</strong>g>the</str<strong>on</strong>g> (s+1)th<br />
superstep. This approach can prevent nodes from being<br />
severely overloaded. We introduce a quota Q s i,j<br />
to determine<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> workload to be moved from Ni to Nj <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> current s-th<br />
superstep for <str<strong>on</strong>g>the</str<strong>on</strong>g> next (s+1)-th superstep, and call it a quota<br />
functi<strong>on</strong>.<br />
Q s i,j = Ws−1 i<br />
− Ws−1 j<br />
K<br />
(2)
Algorithm 8: Move(i,j)<br />
1 if Q s i,j = 0 <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
2 Q s i,j ← Qthreshold;<br />
3 ∆ ← Q s i,j;<br />
4 Let Li,j be a ranking list of vertices in Ni;<br />
5 while ∆ > 0 do<br />
6 remove u from <str<strong>on</strong>g>the</str<strong>on</strong>g> top of Li,j;<br />
7 if W s i,j(u) < ∆ <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
8 move u from Ni to Nj;<br />
9 ∆ ← ∆ − W s i,j(u);<br />
Here, Qs i,j is determined based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> workload in <str<strong>on</strong>g>the</str<strong>on</strong>g> previous<br />
(s-1)-th superstep. A questi<strong>on</strong> that arises is why we must use<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> previous workloads, W s−1<br />
i − Ws−1 j , and cannot use <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
current workload, Ws i − Ws j . Because <str<strong>on</strong>g>the</str<strong>on</strong>g> workload in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
current superstep has already been determined and cannot be<br />
changed without introducing an extra synchr<strong>on</strong>izati<strong>on</strong> barrier<br />
which is costly in BSP computing. Based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> discussi<strong>on</strong>s<br />
in Secti<strong>on</strong> IV-B, <str<strong>on</strong>g>the</str<strong>on</strong>g> working window has high hit-rate based<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> previous working window, we use W s−1<br />
i − Ws−1 j to<br />
estimate Ws i − Ws j . This is c<strong>on</strong>firmed to be effective in our<br />
extensive testing.<br />
Based <strong>on</strong> Qs i,j , a node Ni can move <str<strong>on</strong>g>the</str<strong>on</strong>g> workload from Ni<br />
to Nj, if Qs i,j > 0. No workload can be transferred from Ni<br />
to Nj, if Qs i,j < 0. By a n<strong>on</strong>-zero Qsi,j , it becomes unlikely<br />
that o<str<strong>on</strong>g>the</str<strong>on</strong>g>r nodes, Nj, will move its workload to Ni if Ni is<br />
= 0 in a superstep or in <str<strong>on</strong>g>the</str<strong>on</strong>g> initial<br />
overloaded. When Qs i,j<br />
superstep where we do not know <str<strong>on</strong>g>the</str<strong>on</strong>g> workload yet, Ni and<br />
Nj may move some workload to each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r. We c<strong>on</strong>trol <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
amount of workload to be moved under a small threshold, for<br />
|V |+|E|<br />
example, α = 1% × K2 . The numerator, |V | + |E|, is <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
|V |+|E|<br />
total number of vertices and edges, is <str<strong>on</strong>g>the</str<strong>on</strong>g> amount that<br />
K2 can be moved from Ni to Nj <strong>on</strong> average. We <strong>on</strong>ly move out<br />
up to its 1%, which is a very small amount. It is better to allow<br />
some vertices to be moved from Ni to Nj even if Qs i,j = 0.<br />
O<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise if <str<strong>on</strong>g>the</str<strong>on</strong>g> workload is perfectly balanced, which could<br />
be achieved by hash based distributi<strong>on</strong> policy, no move will be<br />
allowed at all. And in Always-Active-Style type algorithms,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>re will be no move forever.<br />
Suppose Qs i,j > 0. We move some workload from Ni to<br />
Nj as bounded by Qs i,j . Suppose that we move a vertex u<br />
from Ni to Nj, <str<strong>on</strong>g>the</str<strong>on</strong>g>n <str<strong>on</strong>g>the</str<strong>on</strong>g> workload associated with this move<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex u is denoted as W s i (u). Here, W s i (u) is equal to<br />
1 + |AE s i,j(u)|, where <str<strong>on</strong>g>the</str<strong>on</strong>g> value of 1 is <str<strong>on</strong>g>the</str<strong>on</strong>g> number of vertices<br />
to be moved, |AE s i,j(u)| is <str<strong>on</strong>g>the</str<strong>on</strong>g> number of active edges. A<br />
vertex u can be moved from Ni to Nj if W s i (u) is less than<br />
Qs i,j . Algorithm 8 illustrates <str<strong>on</strong>g>the</str<strong>on</strong>g> idea. It is worth noting that a<br />
in Algorithm 8. A vertex can<br />
ranking list, Li,j is used for Qs i,j<br />
<strong>on</strong>ly appear in a ranking list at most <strong>on</strong>ce, and <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> list is ranked based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> degree of benefit of moving it<br />
from Ni to Nj . Below, we discuss <str<strong>on</strong>g>the</str<strong>on</strong>g> ranking which is based<br />
<strong>on</strong> a functi<strong>on</strong> called willingness score.<br />
What to move: Assume a vertex u is in a computati<strong>on</strong>al node<br />
Ni. Let Is i,j (u) and Os i,j (u) be <str<strong>on</strong>g>the</str<strong>on</strong>g> number of in-messages<br />
received and <str<strong>on</strong>g>the</str<strong>on</strong>g> number of out-messages sent to/from <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
vertex u to a node Nj in <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep, respectively.<br />
We denote <str<strong>on</strong>g>the</str<strong>on</strong>g> total communicati<strong>on</strong> cost for <str<strong>on</strong>g>the</str<strong>on</strong>g> vertex u by<br />
IOi,j(u), which is defined in Eq. (3).<br />
IO s i,j(u) = I s i,j(u) + O s i,j(u) (3)<br />
It is important to note that a vertex u is an active vertex if it<br />
receives/sends messages. Based <strong>on</strong> IO s i,j(u), <str<strong>on</strong>g>the</str<strong>on</strong>g> first vertexcentric<br />
willingness functi<strong>on</strong> for workload balancing is to select<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> computati<strong>on</strong>al node Nj for a vertex u to move to if Nj<br />
has <str<strong>on</strong>g>the</str<strong>on</strong>g> max IO s i,j(u) value am<strong>on</strong>g all nodes (Eq. (4)).<br />
ϖ s 0(u) = argmaxj IO s i,j(u) (4)<br />
Suppose u is in node Ni. Eq. (4) takes <str<strong>on</strong>g>the</str<strong>on</strong>g> simple majority,<br />
and decides <str<strong>on</strong>g>the</str<strong>on</strong>g> node Nj for u to move. If u has more internal<br />
edges than <str<strong>on</strong>g>the</str<strong>on</strong>g> external edges, ϖ s 0(u) will return Ni, which<br />
suggests that u is not supposed to be moved, and will stay in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> same node Ni.<br />
Suppose u is about to be moved to Nj from Ni. As can be<br />
seen from Eq. (4), it does not c<strong>on</strong>sider whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> neighbor of u in Nj will move to a different node Nk for<br />
k = j at <str<strong>on</strong>g>the</str<strong>on</strong>g> same time when u is moved to Nj. However, it<br />
is difficult to predict whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> neighbor of<br />
u in Nj will move or not, and it is costly to do so, even it<br />
can be d<strong>on</strong>e. Alternatively, from a different angle, <str<strong>on</strong>g>the</str<strong>on</strong>g> strategy<br />
we take is to keep a vertex u in Ni for certain interval, in<br />
order for o<str<strong>on</strong>g>the</str<strong>on</strong>g>r vertices to be moved to Ni. In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words,<br />
we do not move u from <strong>on</strong>e to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r new computati<strong>on</strong>al<br />
node frequently. Based <strong>on</strong> this observati<strong>on</strong>, we define a new<br />
willingness functi<strong>on</strong>, ϖ s 1(u).<br />
ϖ s 1(u) =<br />
<br />
ϖ s 0(u) not moved in <str<strong>on</strong>g>the</str<strong>on</strong>g> past β supersteps<br />
0, o<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise<br />
Here, β is a parameter to c<strong>on</strong>trol at least how l<strong>on</strong>g a vertex u<br />
is forced to stay in a node.<br />
Next, we c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> potential problem with <str<strong>on</strong>g>the</str<strong>on</strong>g> simple<br />
majority used in Eq. (4). C<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> following example.<br />
With Eq. (4), suppose that if u remains in <str<strong>on</strong>g>the</str<strong>on</strong>g> same node Ni,<br />
IO s i,i(u) = 100, and fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r suppose if u moves to a different<br />
node Nj, IO s i,i(u) = 101. The gain of moving u to a different<br />
node is little and is with cost. In order to handle this problem,<br />
we c<strong>on</strong>sider to take a ratio into account when moving vertices,<br />
instead of simple majority. In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words, we <strong>on</strong>ly move u<br />
to Nj if <str<strong>on</strong>g>the</str<strong>on</strong>g> gain is higher than a ratio λ, IOsi,j<br />
(u)<br />
IOs i,i (u) ≥ λ. We<br />
modify Eq. (4) as follows.<br />
IO s<br />
i,j(u) =<br />
<br />
λ(Is i,j (u) + Os i,j (u)), i = j<br />
Is i,j (u) + Os i,j (u), o<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise<br />
where λ > 1. Accordingly, we define a new willingness<br />
functi<strong>on</strong>, ϖ s 2(u) below.<br />
(5)<br />
(6)<br />
ϖ s 2(u) = argmaxj IO s<br />
i,j(u) (7)<br />
All <str<strong>on</strong>g>the</str<strong>on</strong>g> willingness functi<strong>on</strong>s are vertex-centric and do not<br />
c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> workload in <str<strong>on</strong>g>the</str<strong>on</strong>g> computati<strong>on</strong>al nodes. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>
workload is a very important factor to be balanced, we define<br />
a communicati<strong>on</strong> cost formula<br />
IO s<br />
i,j(u) = (I s i,j(u) + O s i,j(u)) × (1 − Ws−1 j<br />
s−1 ) (8)<br />
W<br />
that takes both vertex willingness to move and <str<strong>on</strong>g>the</str<strong>on</strong>g> immediate<br />
previous workload into c<strong>on</strong>siderati<strong>on</strong>. C<strong>on</strong>sider a computati<strong>on</strong>al<br />
node Nj for a vertex to be moved to. If its workload<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> previous (s-1)-th superstep, W s−1<br />
j , is higher than <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
average workload W s−1 , <str<strong>on</strong>g>the</str<strong>on</strong>g>n we do not attempt to move a<br />
vertex to Nj. There are many possible ways to define such<br />
a functi<strong>on</strong>. For simplicity, we show <str<strong>on</strong>g>the</str<strong>on</strong>g> simplest <strong>on</strong>e here.<br />
Accordingly, we give a new willingness functi<strong>on</strong>, ϖs 3(u).<br />
D. Implementati<strong>on</strong><br />
ϖ s 3(u) = argmaxj IO s<br />
i,j(u) (9)<br />
We have implemented our prototyped system <strong>on</strong> HAMA [1]<br />
(Versi<strong>on</strong> 0.4) which is a BSP computing framework <strong>on</strong> top<br />
of HDFS (Hadoop Distributed File System). Our proposed<br />
methods can be easily migrated to any BSP-based graph<br />
processing systems. Assume that <str<strong>on</strong>g>the</str<strong>on</strong>g>re are K computati<strong>on</strong>al<br />
nodes, N1, N2, · · · , NK. We discuss <str<strong>on</strong>g>the</str<strong>on</strong>g> two main comp<strong>on</strong>ents<br />
in our prototyped system <strong>on</strong> top of HAMA, which are working<br />
workload m<strong>on</strong>itoring and workload moving.<br />
<str<strong>on</strong>g>Workload</str<strong>on</strong>g> M<strong>on</strong>itoring: On HAMA, in every superstep, a<br />
vertex is active if it is active already or it is inactive in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> previous superstep but receives messages in <str<strong>on</strong>g>the</str<strong>on</strong>g> current<br />
superstep. The number of active vertices as well as <str<strong>on</strong>g>the</str<strong>on</strong>g> number<br />
of messages received and <str<strong>on</strong>g>the</str<strong>on</strong>g> number of messages sent can<br />
be collected. The workload Ws i in <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep <strong>on</strong> Ni,<br />
for 1 ≤ i ≤ K, will be sent to all <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r computati<strong>on</strong>al<br />
nodes, and in <str<strong>on</strong>g>the</str<strong>on</strong>g> (s+1)-th superstep such collected workload<br />
assists <str<strong>on</strong>g>the</str<strong>on</strong>g> determinati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> workload to be moved am<strong>on</strong>g<br />
computati<strong>on</strong>al nodes, Q s+1<br />
i,j , for 1 ≤ i ≤ K and 1 ≤<br />
j ≤ K. The informati<strong>on</strong> exchange can be d<strong>on</strong>e using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
built-in functi<strong>on</strong>s, for ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r synchr<strong>on</strong>izati<strong>on</strong> or broadcasting,<br />
provided by <str<strong>on</strong>g>the</str<strong>on</strong>g> BSP-based graph processing systems. In<br />
our implementati<strong>on</strong> <strong>on</strong> HAMA, we define several subclasses<br />
under HAMA BSPMessage class, and exchange messages<br />
using HAMA MessageManager. Based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> workload, each<br />
computati<strong>on</strong>al node will select its active vertices to move.<br />
<str<strong>on</strong>g>Workload</str<strong>on</strong>g> Moving: The quota Qs i,j <strong>on</strong> Ni is known at <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
beginning of <str<strong>on</strong>g>the</str<strong>on</strong>g> s-th superstep. On a node Ni, we maintain K<br />
priority queues, where each priority queue, Pi,j, maintains <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
top active vertices which are willing to move to Nj from Ni up<br />
to Q s i,j<br />
, using willingness score functi<strong>on</strong>s defined in previous<br />
subsecti<strong>on</strong>. All <str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices will appear in <strong>on</strong>e queue at<br />
most. The inactive vertices do not appear in any queues. At<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> end of <str<strong>on</strong>g>the</str<strong>on</strong>g> superstep before entering <str<strong>on</strong>g>the</str<strong>on</strong>g> synchr<strong>on</strong>izati<strong>on</strong><br />
barrier, we remove <str<strong>on</strong>g>the</str<strong>on</strong>g> top active vertices from a priority queue<br />
Pi,j <strong>on</strong> Ni and move <str<strong>on</strong>g>the</str<strong>on</strong>g>m to Nj.<br />
It is important to note that we need to trace where <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
vertices are when vertices are moved from <strong>on</strong>e computati<strong>on</strong>al<br />
node to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r. This is because, initially, if a computati<strong>on</strong>al<br />
TABLE II<br />
DATASETS<br />
<str<strong>on</strong>g>Graph</str<strong>on</strong>g> Type |V | |E|<br />
CA-C<strong>on</strong>dMat 3<br />
citeseerx 4<br />
cit-Patents 3<br />
dblp 5<br />
Flickr 6<br />
soc-LiveJournal1 3<br />
web-Google 3<br />
uk-2005 7<br />
Collaborati<strong>on</strong> 23K 186K<br />
Academic 6.5M 15M<br />
Citati<strong>on</strong> 3.8M 16.5M<br />
Academic 1M 8M<br />
Social Networks 80K 11.8M<br />
Social Networks 4.8M 70M<br />
Web graphs 0.8M 5.9M<br />
Web graphs 39.5M 936M<br />
node, Ni, maintains a vertex u in Ni which links to ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
vertex v not in Ni, Ni knows which computati<strong>on</strong>al node Nj<br />
keeps v. When a vertex v can be moved around, it needs<br />
to identify <str<strong>on</strong>g>the</str<strong>on</strong>g> computati<strong>on</strong>al node Nk that has v. If <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
BSP-based graph processing system has built-in functi<strong>on</strong>s to<br />
support transacti<strong>on</strong>al vertex movement to ensure <str<strong>on</strong>g>the</str<strong>on</strong>g> ACID<br />
property, we will use that functi<strong>on</strong> to move vertices. O<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise,<br />
we can support <str<strong>on</strong>g>the</str<strong>on</strong>g> vertices moving by using a stateof-<str<strong>on</strong>g>the</str<strong>on</strong>g>-art<br />
graph partiti<strong>on</strong> management module such as Zephyr<br />
[9] and Lookup Table [24]. In our implementati<strong>on</strong> <strong>on</strong> HAMA<br />
versi<strong>on</strong> 0.4, we implement <str<strong>on</strong>g>the</str<strong>on</strong>g> Lookup Table as used in [24]<br />
with additi<strong>on</strong>al catching functi<strong>on</strong> to reduce <str<strong>on</strong>g>the</str<strong>on</strong>g> overhead.<br />
V. EXPERIMENTS<br />
We have implemented our prototyped system <strong>on</strong> HAMA<br />
[1] (Versi<strong>on</strong> 0.4) <strong>on</strong> top of HDFS (Hadoop Distributed File<br />
System). We c<strong>on</strong>ducted extensive testing in a cloud envir<strong>on</strong>ment<br />
with 24 PCs c<strong>on</strong>nected by a 100Mbps network, where<br />
each PC is equipped with a Intel i3-2100 3.1GHz CPU, 4GB<br />
memory, and a 320GB disk, running Scientific Linux release<br />
6.1. Table II shows basic informati<strong>on</strong> about <str<strong>on</strong>g>the</str<strong>on</strong>g> graph datasets<br />
used in our experiments.<br />
Nine Policies: In <str<strong>on</strong>g>the</str<strong>on</strong>g> following, we show <str<strong>on</strong>g>the</str<strong>on</strong>g> performance of<br />
our proposed policies based <strong>on</strong> ϖ s 0 (Eq. (4)), ϖ s 1 (Eq. (5)),<br />
ϖ s 2 (Eq. (7)), and ϖ s 3 (Eq. (9)). Here, <str<strong>on</strong>g>the</str<strong>on</strong>g> default β used in<br />
ϖ s 1 is β = 1, and <str<strong>on</strong>g>the</str<strong>on</strong>g> default λ used for ϖ s 2 is λ = 1.05.<br />
The 8 policies are formed as follows: P0 (ϖ s 0), P1 (ϖ s 3), P2<br />
(ϖ s 2), P3 (ϖ s 3 + ϖ s 2), P4 (ϖ s 1), P5 (ϖ s 3 + ϖ s 1), P6 (ϖ s 2 + ϖ s 1),<br />
and P7 (ϖ s 3 + ϖ s 2 + ϖ s 1). We tested all <str<strong>on</strong>g>the</str<strong>on</strong>g> policies against all<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> graph datasets (Table II). Initially, a graph is randomly<br />
partiti<strong>on</strong>ed into 16 partiti<strong>on</strong>s. Because all graphs have similar<br />
behaviors, we show <str<strong>on</strong>g>the</str<strong>on</strong>g> results measured in <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat<br />
dataset.<br />
We c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> edges that c<strong>on</strong>nect to <str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices.<br />
We call an edge an active edge if it c<strong>on</strong>nects to at least<br />
<strong>on</strong>e active vertex. Fig. 6(a) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages of <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
total number of active internal edges over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number<br />
of active edges for <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat dataset, and Fig. 6(b)<br />
shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages of <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of <str<strong>on</strong>g>the</str<strong>on</strong>g> moved active<br />
3 http://snap.stanford.edu/data/<br />
4 http://csxstatic.ist.psu.edu/about/data<br />
5 http://socialcomputing.asu.edu/datasets/Flickr<br />
6 http://dblp.uni-trier.de/xml/<br />
7 http://law.dsi.unimi.it/webdata/uk-2005/
Percentage of Internal Edges<br />
Percentage of moved vertices<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
PR<br />
PR<br />
Coloring<br />
Semi-cluster<br />
P0<br />
P1<br />
P2<br />
P3<br />
P4<br />
P5<br />
P6<br />
P7<br />
SSSP BFS RWR MatchingMST MIS<br />
(a) The Percentage of Active Internal Edges<br />
BFS<br />
SSSP<br />
Coloring<br />
Semi-cluster<br />
P0<br />
P1<br />
P2<br />
P3<br />
P4<br />
P5<br />
P6<br />
P7<br />
RWR Matcching<br />
MIS<br />
MST<br />
(b) The Percentage of Moved Active Vertices<br />
Fig. 6. Nine Policies<br />
vertices over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active vertices in <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-<br />
C<strong>on</strong>dmat dataset, for all 8 policies and all 9 graph algorithms.<br />
A higher percentage implies less cost for communicati<strong>on</strong><br />
between computati<strong>on</strong>al nodes. As shown in Fig. 6(a), <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
policies, P3 and P7, outperform <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r policies in most<br />
cases. In Fig. 6(b), <str<strong>on</strong>g>the</str<strong>on</strong>g> policy P7 moves a smaller number of<br />
vertices compared to P3. It suggests that <str<strong>on</strong>g>the</str<strong>on</strong>g> combinati<strong>on</strong> of<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> three ϖ s 3 + ϖ s 2 + ϖ s 1 reduces <str<strong>on</strong>g>the</str<strong>on</strong>g> max number of external<br />
edges effectively with a small overhead. In <str<strong>on</strong>g>the</str<strong>on</strong>g> following, we<br />
use P7 as <str<strong>on</strong>g>the</str<strong>on</strong>g> default policy, and call it <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W.<br />
With <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W, we fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r test <str<strong>on</strong>g>the</str<strong>on</strong>g> graph datasets. As a<br />
basis for comparis<strong>on</strong>, Fig. 7(a) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages of <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
total number of active internal edges over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number<br />
of active edges using <str<strong>on</strong>g>the</str<strong>on</strong>g> initial random partiti<strong>on</strong>ing without<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W. Such percentage is very low 4-6%, which implies a<br />
high communicati<strong>on</strong> cost due to <str<strong>on</strong>g>the</str<strong>on</strong>g> large percentage of active<br />
external edges across different computati<strong>on</strong>al nodes. Fig. 7(b)<br />
and Fig. 7(c) show that, with <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W, we can increase <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
number of active internal edges up to 70% with a small<br />
overhead (number of moved active vertices).<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W vs METIS: Given a set of active vertices in a<br />
superstep, it is infeasible to run METIS to partiti<strong>on</strong> using<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices due to its high overhead. In this testing,<br />
as an indicator, we compare <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W with METIS. For <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
METIS, we initially partiti<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat dataset into 16<br />
partiti<strong>on</strong>s using METIS, and we run METIS every superstep<br />
using <str<strong>on</strong>g>the</str<strong>on</strong>g> informati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> active vertices. For <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W, we<br />
randomly partiti<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat dataset into 16 partiti<strong>on</strong>s,<br />
and we use <str<strong>on</strong>g>the</str<strong>on</strong>g> default policy P7 to balance workload at run<br />
time. As expected, METIS outperforms <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W in terms of<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> percentage of <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active internal edges<br />
over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active edges. However, Fig. 8 shows<br />
that our <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W can approach <str<strong>on</strong>g>the</str<strong>on</strong>g> METIS which dynamically<br />
repartiti<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> graph with a high overhead for four represen-<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
Percentage of moved vertices<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
20<br />
15<br />
10<br />
100<br />
5<br />
0<br />
80<br />
60<br />
40<br />
20<br />
0<br />
PR<br />
BFS<br />
SSSP<br />
Coloring<br />
Semi-cluster<br />
RWR<br />
CM<br />
citex<br />
cit-P<br />
dblp<br />
Flickr<br />
LJ<br />
Google<br />
MatchingMST MIS<br />
(a) The Percentage of Active Internal Edges (Random)<br />
PR<br />
CM<br />
citex<br />
cit-P<br />
dblp<br />
Flickr<br />
LJ<br />
Google<br />
MIS<br />
MST<br />
Matcching<br />
RWR<br />
BFS<br />
SSSP<br />
Coloring<br />
Semi-cluster<br />
(b) The Percentage of Active Internal Edges (<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W)<br />
CM<br />
citex<br />
cit-P<br />
dblp<br />
Flickr<br />
LJ<br />
Google<br />
PR<br />
BFS<br />
SSSP<br />
Coloring<br />
Semi-cluster<br />
RWR Matcching<br />
MIS<br />
MST<br />
(c) The Percentage of Moved Active Vertices (<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W)<br />
Fig. 7. Various of <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Datasets<br />
tative graph algorithms, PageRank, SSSP, Random-Walk, and<br />
MST.<br />
Cold-Start and Hot-Start: We test <str<strong>on</strong>g>the</str<strong>on</strong>g> cold-start and hot-start<br />
of different graph algorithms when we use <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W to dynamically<br />
balance workload. All <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph algorithms are named<br />
A1, A2, ... A9 (refer to Table III). Fig. 9 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages<br />
of <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active internal edges over <str<strong>on</strong>g>the</str<strong>on</strong>g> total<br />
number of active edges for <str<strong>on</strong>g>the</str<strong>on</strong>g> CA-C<strong>on</strong>dmat dataset which is<br />
initially randomly partiti<strong>on</strong>ed into 16 partiti<strong>on</strong>s. Fig. 10 shows<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> percentage of <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of <str<strong>on</strong>g>the</str<strong>on</strong>g> moved active vertices<br />
over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active vertices. We explain our testing<br />
results. Take PageRank (A1) as an example. In Fig. 9(a), <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
horiz<strong>on</strong>tal blue line shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentage of <str<strong>on</strong>g>the</str<strong>on</strong>g> number of<br />
active internal edges when we run PageRank under <str<strong>on</strong>g>the</str<strong>on</strong>g> coldstart,<br />
using <str<strong>on</strong>g>the</str<strong>on</strong>g> initial randomly partiti<strong>on</strong>ed graph. Each of <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
9 bars, representing <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph algorithms, shows <str<strong>on</strong>g>the</str<strong>on</strong>g> same<br />
percentage for PageRank using <str<strong>on</strong>g>the</str<strong>on</strong>g> graph partiti<strong>on</strong> resulting<br />
from <strong>on</strong>e of <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph algorithms under <str<strong>on</strong>g>the</str<strong>on</strong>g> hot-start. In a<br />
similar fashi<strong>on</strong>, in Fig. 10(a), <str<strong>on</strong>g>the</str<strong>on</strong>g> horiz<strong>on</strong>tal blue line shows <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
percentage of <str<strong>on</strong>g>the</str<strong>on</strong>g> number of active vertices moved when we<br />
run PageRank under <str<strong>on</strong>g>the</str<strong>on</strong>g> cold-start, using <str<strong>on</strong>g>the</str<strong>on</strong>g> initial randomly<br />
partiti<strong>on</strong>ed graph. Each of <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 bars, representing <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph<br />
algorithms, shows <str<strong>on</strong>g>the</str<strong>on</strong>g> same percentage for PageRank using <str<strong>on</strong>g>the</str<strong>on</strong>g>
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W METIS<br />
1 5 10 15 20<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
(a) PageRank (A1)<br />
0<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W METIS<br />
1 15 30 45 60<br />
(c) Random-Walk (A6)<br />
A1A2A3A4A5A6A7A8A9<br />
(a) PageRank (A1)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(d) SSSP (A4)<br />
A1A2A3A4A5A6A7A8A9<br />
(g) MM (A7)<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W METIS<br />
1 5 10 15 20<br />
(b) SSSP (A4)<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W METIS<br />
1 5 10 15 20<br />
(d) MST (A8)<br />
Fig. 8. The Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(b) Semi-clustering (A2) (c) <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Coloring (A3)<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(e) BFS (A5)<br />
A1A2A3A4A5A6A7A8A9<br />
(h) MST (A8)<br />
Percentage of Internal Edges<br />
Percentage of Internal Edges<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
A1A2A3A4A5A6A7A8A9<br />
(f) Random-Walk (A6)<br />
Percentage of Internal Edges<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(i) MIS (A9)<br />
Fig. 9. Hot Starts: The Percentage of Internal Edges<br />
graph partiti<strong>on</strong> resulting from <strong>on</strong>e of <str<strong>on</strong>g>the</str<strong>on</strong>g> 9 graph algorithms<br />
under <str<strong>on</strong>g>the</str<strong>on</strong>g> hot-start. Fig. 9(a) and Fig. 10(a) show that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
active vertices moved can help moving <str<strong>on</strong>g>the</str<strong>on</strong>g> workload in a right<br />
directi<strong>on</strong>. In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words, with a small overhead (Fig. 10(a)),<br />
similar results can be obtained (Fig. 9(a)). The similar patterns<br />
can be seen for o<str<strong>on</strong>g>the</str<strong>on</strong>g>r algorithms as well.<br />
Large Dataset: We show our testing results using <str<strong>on</strong>g>the</str<strong>on</strong>g> uk-2005<br />
dataset, which has roughly 39M vertices and 936M edges. We<br />
randomly partiti<strong>on</strong> it into 24 partiti<strong>on</strong>s. We compare <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
with HAMA. Fig. 11 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> results for PageRank for every<br />
superstep. Fig. 11(a) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages of <str<strong>on</strong>g>the</str<strong>on</strong>g> number of<br />
active internal edges over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of active edges<br />
per superstep. <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W can maintain over 3.4 times of active<br />
edges as internal edges over HAMA. Fig. 11(b) shows that<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> total executi<strong>on</strong> time can reduce 31.5%, and Fig. 11(c)<br />
shows that <str<strong>on</strong>g>the</str<strong>on</strong>g> total communicati<strong>on</strong> time can reduce 42.6%.<br />
Note that communicati<strong>on</strong> time is included in <str<strong>on</strong>g>the</str<strong>on</strong>g> executi<strong>on</strong><br />
time. Fig. 12(a) and Fig. 12(b) also show <str<strong>on</strong>g>the</str<strong>on</strong>g> percentages of<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> number of active internal edges over <str<strong>on</strong>g>the</str<strong>on</strong>g> total number of<br />
active edges in every superstep when computing SSSP and<br />
MM, respectively. For SSSP and MM, <str<strong>on</strong>g>the</str<strong>on</strong>g> executi<strong>on</strong> times<br />
reduce 2% and 9%.<br />
Percent of moved vertices<br />
Percentage of Internal Edges<br />
Percent of moved vertices<br />
Percent of moved vertices<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(a) PageRank (A1)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(d) SSSP (A4)<br />
A1A2A3A4A5A6A7A8A9<br />
(g) MM (A7)<br />
Percent of moved vertices<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(b) Semi-clustering (A2) (c) <str<strong>on</strong>g>Graph</str<strong>on</strong>g> Coloring (A3)<br />
Percent of moved vertices<br />
Percent of moved vertices<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(e) BFS (A5)<br />
A1A2A3A4A5A6A7A8A9<br />
(h) MST (A8)<br />
Percent of moved vertices<br />
Percent of moved vertices<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
A1A2A3A4A5A6A7A8A9<br />
(f) Random-Walk (A6)<br />
Percent of moved vertices<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
A1A2A3A4A5A6A7A8A9<br />
(i) MIS (A9)<br />
Fig. 10. Hot Starts: The Percentage of Moved Active Vertices<br />
HAMA <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
1 5 10 15<br />
(a) Percentage of internal<br />
edges<br />
Executi<strong>on</strong> Time (s)<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HAMA <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
1 5 10 15<br />
(b) Executi<strong>on</strong> time<br />
Communicati<strong>on</strong> Time (s)<br />
Fig. 11. PageRank (per superstep)<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HAMA <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
1 5 10 15<br />
(c) Communicati<strong>on</strong> time<br />
Scaleup: We test <str<strong>on</strong>g>the</str<strong>on</strong>g> scaleup using a part of uk-2005 that can<br />
be held in 8 computati<strong>on</strong>al nodes, and increase <str<strong>on</strong>g>the</str<strong>on</strong>g> number of<br />
computati<strong>on</strong>al nodes to 24. Our <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W outperforms HAMA.<br />
VI. RELATED WORKS<br />
<str<strong>on</strong>g>Graph</str<strong>on</strong>g> partiti<strong>on</strong>ing is to partiti<strong>on</strong> a large graph into a<br />
number of partiti<strong>on</strong>s, and is widely used to support scientific<br />
simulati<strong>on</strong>s in a parallel envir<strong>on</strong>ment [21]. Because graph<br />
partiti<strong>on</strong>ing is known to be an NP-complete problem [11],<br />
[4], many heuristics and approximate approaches are proposed.<br />
Schloegel, et al. in [21] survey static and dynamic graph partiti<strong>on</strong>ing<br />
methods, where <str<strong>on</strong>g>the</str<strong>on</strong>g> former implies that all informati<strong>on</strong><br />
is given for graph partiti<strong>on</strong>ing, and <str<strong>on</strong>g>the</str<strong>on</strong>g> latter does so with <strong>on</strong>ly<br />
partial informati<strong>on</strong>. We discuss <str<strong>on</strong>g>the</str<strong>on</strong>g> approaches based <strong>on</strong> [21].<br />
For <str<strong>on</strong>g>the</str<strong>on</strong>g> static graph partiti<strong>on</strong>ing techniques, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are<br />
combinatorial techniques, spectral methods, and multilevel<br />
schemes. The combinatorial techniques partiti<strong>on</strong> a graph based<br />
<strong>on</strong> its highly c<strong>on</strong>nected comp<strong>on</strong>ents, and <str<strong>on</strong>g>the</str<strong>on</strong>g> representative<br />
approaches are <str<strong>on</strong>g>the</str<strong>on</strong>g> KL (Kernighan-Lin) algorithm and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
FM (Fiduccia-Mat<str<strong>on</strong>g>the</str<strong>on</strong>g>yses) algorithm. Spectral methods are to<br />
partiti<strong>on</strong> a graph based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> Laplacian for <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix representati<strong>on</strong><br />
of a graph and computing of <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d eigenvalue<br />
of <str<strong>on</strong>g>the</str<strong>on</strong>g> Laplacian. Multilevel schemes c<strong>on</strong>sist of three phases.<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g> first graph coarsening phase, it recursively collapses<br />
vertices/edges into a smaller coarser graph. In <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d<br />
initial partiti<strong>on</strong>ing phase, it partiti<strong>on</strong>s <str<strong>on</strong>g>the</str<strong>on</strong>g> sufficient <str<strong>on</strong>g>the</str<strong>on</strong>g> coarsest<br />
graph into partiti<strong>on</strong>s. And in <str<strong>on</strong>g>the</str<strong>on</strong>g> multilevel refinement phase,<br />
it partiti<strong>on</strong>s a graph based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> partiti<strong>on</strong>ing results from <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
coarsest graph to <str<strong>on</strong>g>the</str<strong>on</strong>g> original graph in order. The representative
Percentage of Internal Edges<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HAMA <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
1 5 10 15<br />
(a) SSSP<br />
Percentage of Internal Edges<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HAMA <str<strong>on</strong>g>Catch</str<strong>on</strong>g>W<br />
1 5 10 15<br />
(b) MM<br />
Fig. 12. Percentage of internal edges per superstep<br />
Running Time(s)<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
8 12 16 24<br />
Size of Cluster<br />
<str<strong>on</strong>g>Catch</str<strong>on</strong>g>W HAMA<br />
Fig. 13. Scaleup<br />
method is METIS [12]. The qualitative comparis<strong>on</strong> is given<br />
in [21]. These approaches cannot be used to address our<br />
problem for two reas<strong>on</strong>s. First, it is difficult to know all<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> informati<strong>on</strong> needed for graph algorithms. Sec<strong>on</strong>d, it is<br />
infeasible to partiti<strong>on</strong> a graph before processing a graph<br />
algorithm due to its high overhead.<br />
For dynamic graph partiti<strong>on</strong>ing (or adaptive graph partiti<strong>on</strong>ing),<br />
it is to minimize <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> cost for rebalancing<br />
workloads am<strong>on</strong>g computati<strong>on</strong>al nodes. The objective<br />
functi<strong>on</strong> takes both <str<strong>on</strong>g>the</str<strong>on</strong>g> balancing and <str<strong>on</strong>g>the</str<strong>on</strong>g> cost to balance (<str<strong>on</strong>g>the</str<strong>on</strong>g><br />
number of vertices to be moved) into c<strong>on</strong>siderati<strong>on</strong>. There are<br />
repartiti<strong>on</strong>ing methods, scratch-remap methods, and diffusi<strong>on</strong>based<br />
methods. The repartiti<strong>on</strong>ing methods partiti<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> graph<br />
ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r from <str<strong>on</strong>g>the</str<strong>on</strong>g> scratch or by some cut-end-phase strategy.<br />
The scratch-remap methods compute <str<strong>on</strong>g>the</str<strong>on</strong>g> new partiti<strong>on</strong>ing, and<br />
minimize <str<strong>on</strong>g>the</str<strong>on</strong>g> moving cost from <str<strong>on</strong>g>the</str<strong>on</strong>g> existing old partiti<strong>on</strong>s<br />
to <str<strong>on</strong>g>the</str<strong>on</strong>g> new partiti<strong>on</strong>s. The diffusi<strong>on</strong>-based methods take an<br />
incremental approach and migrate workload from <str<strong>on</strong>g>the</str<strong>on</strong>g> overloaded<br />
computati<strong>on</strong>al nodes to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir neighbor nodes that are<br />
underloaded repeatedly. The diffusi<strong>on</strong>-based methods mainly<br />
address two questi<strong>on</strong>s: how much to move and what to be<br />
moved. It is worth noting that <str<strong>on</strong>g>the</str<strong>on</strong>g> diffusi<strong>on</strong>-based methods<br />
need several iterati<strong>on</strong>s to achieve global balancing. In our<br />
problem setting, we have to rebalance a small amount of data<br />
simultaneously <strong>on</strong>ly in a single iterati<strong>on</strong>.<br />
Devine et al. in [8] survey <str<strong>on</strong>g>the</str<strong>on</strong>g> partiti<strong>on</strong>ing and load balancing<br />
for emerging parallel applicati<strong>on</strong>s and architectures.<br />
VII. CONCLUSION<br />
In this paper, we investigate <str<strong>on</strong>g>the</str<strong>on</strong>g> graph behaviors by exploring<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> working window changes in parallel computing<br />
<strong>on</strong> a public open source implementati<strong>on</strong> HAMA of Pregel for<br />
nine graph algorithms using real datasets. We show that it<br />
is possible to use immediate previous working window as a<br />
basis to catch <str<strong>on</strong>g>the</str<strong>on</strong>g> working window in <str<strong>on</strong>g>the</str<strong>on</strong>g> next superstep in<br />
Pregel like system. We propose simple yet effective policies<br />
to move small amount of active vertices in order to achieve<br />
high graph workload balancing. We c<strong>on</strong>firm <str<strong>on</strong>g>the</str<strong>on</strong>g> effectiveness<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> efficiency of our policies <strong>on</strong> a prototyped system built<br />
<strong>on</strong> top of HAMA [1].<br />
ACKNOWLEDGMENT<br />
The work was supported by grant of <str<strong>on</strong>g>the</str<strong>on</strong>g> Research Grants<br />
Council of <str<strong>on</strong>g>the</str<strong>on</strong>g> H<strong>on</strong>g K<strong>on</strong>g SAR, China No. 418512.<br />
REFERENCES<br />
[1] http://hama.apache.org/.<br />
[2] http://incubator.apache.org/giraph/.<br />
[3] T. E. Anders<strong>on</strong>, S. S. Owicki, J. B. Saxe, and C. P. Thacker. High speed<br />
switch scheduling for local area networks. In ASPLOS ’92, 1992.<br />
[4] K. Andreev and H. Räcke. Balanced graph partiti<strong>on</strong>ing. In SPAA ’04,<br />
2004.<br />
[5] B. V. Cherkassky, A. V. Goldberg, and T. Radzik. Shortest paths<br />
algorithms: <str<strong>on</strong>g>the</str<strong>on</strong>g>ory and experimental evaluati<strong>on</strong>. Math. Program., 73(2),<br />
1996.<br />
[6] T. H. Cormen, C. E. Leisers<strong>on</strong>, R. L. Rivest, and C. Stein. Introducti<strong>on</strong><br />
to Algorithms. The MIT Press, 3rd editi<strong>on</strong>, 2009.<br />
[7] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing <strong>on</strong><br />
large clusters. In OSDI ’04, 2004.<br />
[8] K. D. Devine, E. G. Boman, and G. Karypis. Partiti<strong>on</strong>ing and load<br />
balancing for emerging parallel applicati<strong>on</strong>s and architectures. In<br />
Fr<strong>on</strong>tiers of Scientific Computing. SIAM, 2006.<br />
[9] A. J. Elmore, S. Das, D. Agrawal, and A. El Abbadi. Zephyr: live<br />
migrati<strong>on</strong> in shared nothing databases for elastic cloud platforms. In<br />
SIGMOD ’11, 2011.<br />
[10] R. G. Gallager, P. A. Humblet, and P. M. Spira. A distributed algorithm<br />
for minimum-weight spanning trees. ACM Trans. Program. Lang. Syst.,<br />
5(1), 1983.<br />
[11] M. R. Garey, D. S. Johns<strong>on</strong>, and L. J. Stockmeyer. Some simplified<br />
np-complete problems. In STOC ’74, 1974.<br />
[12] G. Karypis and V. Kumar. A fast and high quality multilevel scheme<br />
for partiti<strong>on</strong>ing irregular graphs. SIAM J. Sci. Comput., 20, 1998.<br />
[13] N. Linial. Locality in distributed graph algorithms. SIAM J. Comput.,<br />
21(1), 1992.<br />
[14] M. Luby. A simple parallel algorithm for <str<strong>on</strong>g>the</str<strong>on</strong>g> maximal independent set<br />
problem. SIAM J. Comput., 15(4), 1986.<br />
[15] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser,<br />
and G. Czajkowski. Pregel: a system for large-scale graph processing.<br />
In SIGMOD ’10, 2010.<br />
[16] J. M<strong>on</strong>dal and A. Deshpande. Managing large dynamic graphs efficiently.<br />
In SIGMOD ’12, 2012.<br />
[17] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citati<strong>on</strong><br />
ranking: Bringing order to <str<strong>on</strong>g>the</str<strong>on</strong>g> web. Technical report, Stanford InfoLab,<br />
1999.<br />
[18] K. Pears<strong>on</strong>. The Problem of <str<strong>on</strong>g>the</str<strong>on</strong>g> Random Walk. Nature, 72, 1905.<br />
[19] D. Peleg. Distributed computing: a locality-sensitive approach. SIAM,<br />
2000.<br />
[20] J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, P. Chhabra,<br />
and P. Rodriguez. The little engine(s) that could: scaling <strong>on</strong>line social<br />
networks. In SIGCOMM ’10, 2010.<br />
[21] K. Schloegel, G. Karypis, and V. Kumar. <str<strong>on</strong>g>Graph</str<strong>on</strong>g> partiti<strong>on</strong>ing for high<br />
performance scientific simulati<strong>on</strong>s. Tech. Report TR 00-018, Computer<br />
Science and Engineering, University of Minnesota, 2000.<br />
[22] B. Shao, H. Wang, and Y. Xiao. Managing and mining large graphs:<br />
systems and implementati<strong>on</strong>s (tutorial). In SIGMOD ’12, 2012.<br />
[23] I. Stant<strong>on</strong> and G. Kliot. Streaming graph partiti<strong>on</strong>ing for large distributed<br />
graphs. In KDD ’12, 2012.<br />
[24] A. L. Tatarowicz, C. Curino, E. P. C. J<strong>on</strong>es, and S. Madden. Lookup<br />
tables: Fine-grained partiti<strong>on</strong>ing for distributed databases. In ICDE ’12,<br />
2012.<br />
[25] L. G. Valiant. A bridging model for parallel computati<strong>on</strong>. Commun.<br />
ACM, 33(8), 1990.<br />
[26] S. Yang, X. Yan, B. Z<strong>on</strong>g, and A. Khan. Towards effective partiti<strong>on</strong><br />
management for large graphs. In SIGMOD ’12, 2012.