Executing Task Graphs Using Work-Stealing - IPDPS
Executing Task Graphs Using Work-Stealing - IPDPS
Executing Task Graphs Using Work-Stealing - IPDPS
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Executing</strong> <strong>Task</strong> <strong>Graphs</strong><br />
<strong>Using</strong> <strong>Work</strong>-<strong>Stealing</strong><br />
Jim Sukha<br />
MIT CSAIL<br />
<strong>IPDPS</strong>, 4/21/2010<br />
Kunal Agrawal (Washington University in St. Louis)<br />
Charles E. Leiserson (MIT)
<strong>Task</strong> <strong>Graphs</strong><br />
A task graph is a directed acyclic graph (dag) where<br />
Every node A is a task requiring computation,<br />
Every edge (A, B) means that the computation of B<br />
depends on the result of A’s computation.<br />
s<br />
A<br />
E<br />
B<br />
C<br />
F<br />
D<br />
G<br />
t
Example: Counting Paths<br />
Counting the number of paths through a dag can be<br />
expressed as a task graph.<br />
Source starts<br />
with value 1.<br />
s<br />
1<br />
A<br />
1<br />
B<br />
C<br />
The value of a node X is the<br />
sum of the values of X’s<br />
immediate predecessors.<br />
1<br />
1<br />
D<br />
E F G<br />
2<br />
1 1 1<br />
t<br />
3
Fork-Join Languages<br />
Some task graphs can be executed in parallel using<br />
a fork-join language such as Cilk++.<br />
f1<br />
void f1() {<br />
A();<br />
cilk_spawn B();<br />
C();<br />
cilk_sync;<br />
D();<br />
}<br />
void f2() {<br />
E(); F(); G();<br />
}<br />
s<br />
A<br />
B<br />
C<br />
D<br />
E F G<br />
int main() {<br />
cilk_spawn f1();<br />
f2();<br />
cilk_sync;<br />
}<br />
t<br />
f2
<strong>Graphs</strong> with Arbitrary Dependencies<br />
Unfortunately, one can not directly express task<br />
graphs with arbitrary dependencies using only<br />
spawn and sync.<br />
1<br />
Counting 1 B<br />
3<br />
paths in a<br />
non-seriesparallel<br />
dag.<br />
s<br />
1<br />
A<br />
C<br />
2<br />
D<br />
t<br />
5<br />
E F G<br />
1 2 2<br />
Question: Can we efficiently execute arbitrary task<br />
graphs in parallel in a fork-join language such as Cilk++?
Our Contributions: Nabbit<br />
We are developing Nabbit, a Cilk++ library for<br />
executing task graphs with arbitrary dependencies.<br />
s<br />
A<br />
E<br />
B<br />
C<br />
F<br />
Nabbit is built on top of Cilk++. It utilizes Cilk++’s<br />
provably-efficient work-stealing scheduler without<br />
any modification to the Cilk++ runtime.<br />
<strong>Using</strong> Nabbit, the computation of an individual task<br />
graph node can itself be parallel.<br />
D<br />
G<br />
t
Provable Bounds for Nabbit<br />
We are developing Nabbit, a Cilk++ library for<br />
executing task graphs with arbitrary dependencies.<br />
Nabbit offers provable bounds on the time<br />
required for parallel execution of (static and<br />
dynamic) task graphs.<br />
The time bounds for Nabbit are asymptotically<br />
optimal for task graphs whose nodes have<br />
constant in-degree and out-degree.
Outline<br />
Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />
Nabbit Implementation<br />
Completion Time Bound<br />
Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions
Dynamic-Programming Example<br />
Generic dynamic programs can often be<br />
expressed as a task graph.<br />
g 1 (M(i-1, j-1))<br />
M(i, j) = max E(i, j)<br />
F(i, j)<br />
E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />
(0, 3)<br />
(1, 2) (1, 3)<br />
(2,0) (2,1) (2, 2) (2, 3)<br />
F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) )
Static <strong>Task</strong> <strong>Graphs</strong><br />
For this example, we can use a static task graph,<br />
i.e., a task graph where the structure of the dag is<br />
known before the execution begins.<br />
g 1 (M(i-1, j-1))<br />
M(i, j) = max E(i, j)<br />
F(i, j)<br />
E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />
(0, 3)<br />
(1, 2) (1, 3)<br />
(2,0) (2,1) (2, 2) (2, 3)<br />
F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) )<br />
Create a node for every cell M(i, j).
Static <strong>Task</strong> <strong>Graphs</strong><br />
For this example, we can use a static task graph,<br />
i.e., a task graph where the structure of the dag is<br />
known before the execution begins. *<br />
g 1 (M(i-1, j-1))<br />
M(i, j) = max E(i, j)<br />
F(i, j)<br />
E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />
(0, 3)<br />
(1, 2) (1, 3)<br />
(2,0) (2,1) (2, 2) (2, 3)<br />
F(i, j) = g 3 ( M(i, 1), M(i, 2),… ,M(i, j-1) )<br />
Create a node for every cell M(i, j). Then add dependency edges.<br />
* In Nabbit, static task graphs still require dynamic scheduling.<br />
We assume the compute time for each node may be unknown.
Interface for Static Nabbit<br />
For static task graphs, each task graph node is<br />
derived from Nabbit’sDAGNode class, and<br />
overrides the node’s Compute() method.<br />
class Mnode: public DAGNode {<br />
Mnode(int key, MDag* dag);<br />
void Compute();<br />
};<br />
class MDag {<br />
int N; int *M;<br />
Mnode* g;<br />
MDag(int N_, int* M_);<br />
};<br />
Typically, each node<br />
needs to know its<br />
identity, and global<br />
parameters for task<br />
graph.<br />
Programmer<br />
builds their own<br />
task graph.
Constructing a Static <strong>Task</strong> Graph<br />
Programmers use Nabbit’sAddDep method to<br />
specify dependencies between task graph nodes.<br />
class MDag {<br />
int N; int* s; Mnode* g;<br />
MDag(int n_, int* M_) : N(N_), M(M_) {<br />
g = new Mnode[N*N];<br />
}<br />
};<br />
for (int i = 0; i < N; i++){<br />
for (int j = 0; j < N; j++) {<br />
int k = N*i+j;<br />
g[k].key = k; g[k].dag = this;<br />
if (i > 0) g[k].AddDep(&Mnode[k-N]);<br />
if (j > 0) g[k].AddDep(&Mnode[k-1]);<br />
}<br />
}<br />
M(i, j) has edges from<br />
M(i-1, j) and M(i, j-1).<br />
Allocate<br />
nodes
Implementing <strong>Task</strong> Nodes<br />
<strong>Task</strong> graph nodes inherit from a DAGNode class,<br />
and override the node’s Compute() method.<br />
class Mnode: public DAGNode {<br />
int i, j;<br />
void Compute() {<br />
int z = INFINITY;<br />
int Eij = calcE(dag->M, i, j);<br />
int Fij = calcF(dag->M, i, j);<br />
if ((i > 0) && (j > 0))<br />
z = g1(M, i, j);<br />
dag->M[key] = min(z, Eij, Fij);<br />
}<br />
};<br />
One can call<br />
other Cilk<br />
functions inside<br />
the Compute()<br />
method, including<br />
methods that<br />
spawn and<br />
sync.
Outline<br />
Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />
Nabbit Implementation<br />
Completion Time Bound<br />
Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions
Static Nabbit Implementation<br />
Nabbit uses a simple algorithm to execute static<br />
task graphs in parallel. Each node<br />
1. Maintains a count of the # of its immediate predecessors that are<br />
still incomplete. (Each node keeps a join counter.)<br />
2. Notifies its immediate successors in parallel after it is computed.<br />
3. Recursively computes any successors which become ready.<br />
void ComputeAndNotify() {<br />
this->Compute();<br />
cilk_for (int q = 0;<br />
q < successors().size();<br />
q++) {<br />
DAGNode* Y = successors[q];<br />
int val = AtomicDecAndFetch(Y.join);<br />
if (val == 0) Y.ComputeAndNotify();<br />
}<br />
}<br />
Start<br />
execution by<br />
calling<br />
ComputeAnd<br />
Notify from<br />
the source<br />
(root) node.
<strong>Work</strong>-<strong>Stealing</strong> in Nabbit<br />
Nabbit is able to rely on Cilk++’s work-stealing<br />
scheduler to load-balance the computation.<br />
When a processor runs<br />
out of work, it tries to<br />
steal work from other<br />
processors.<br />
Nabbit spawns task<br />
nodes in a way that<br />
makes the Cilk++<br />
runtime likely to steal<br />
nodes along the critical<br />
path of the task graph.<br />
Sample Dynamic-Program<br />
Execution with P=4
Smith-Waterman Dynamic Program<br />
As a benchmark, we consider a dynamic program<br />
modeling the Smith-Waterman algorithm with a<br />
generic penalty gap:<br />
M(i-1, j-1) + s(i, j)<br />
M(i, j) = max E(i, j)<br />
F(i, j)<br />
E(i, j) = max M(k, j) + γ(i-k)<br />
k∊{0, 1,…i-1}<br />
F(i, j) = max M(i, k) + γ(j-k)<br />
k∊{0, 1,…j-1}<br />
(0, 3)<br />
(1, 2) (1, 3)<br />
(2,0) (2,1) (2, 2) (2, 3)<br />
In this example, s(i, j) and γ(k) are constant arrays.
Comparison with Divide-and-Conquer<br />
For the dynamic program, we compare the task graph<br />
evaluation using Nabbit with alternative algorithms.<br />
K-way divide-and-conquer<br />
Wavefront (Synchronous)<br />
1 2 1 2<br />
1 2<br />
2 3 2 3<br />
1 2 3 4<br />
2 3 4 5<br />
1 2 1 2<br />
2 3<br />
2 3 2 3<br />
3 4 5 6<br />
4 5 6 7<br />
K=2<br />
For all algorithms, base case is blocks of size B by B.<br />
Grid is arranged in a cache-oblivious layout.
16-core AMD Barcelona<br />
Comparing implementations of the dynamic program<br />
Nabbit<br />
K=5<br />
Wavefront<br />
Nabbit<br />
K=5<br />
Wavefront<br />
K=2<br />
K=2<br />
Serial Running Time = 4.4 s<br />
c = 4.4e-9 if T S ~= cN 3<br />
Serial Running Time = 664 s<br />
c = 5.3e-9 if T S ~= cN 3<br />
Opteron Processor 8354: 2.2 Ghz
Comparison, N=3000, P=16<br />
Nabbit<br />
Wavefront<br />
K=5<br />
K=2
Outline<br />
Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />
Nabbit Implementation<br />
Completion Time Bound<br />
Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions
Definitions<br />
Let D=(V, E) be a task graph to execute.<br />
Consider the execution dag associated with the<br />
Compute() method of a task node A∊V.<br />
W(A) : the work of A<br />
(# of nodes in execution dag)<br />
S(A) : the span of A<br />
(length of longest path in dag)<br />
W(A) = 11<br />
S(A) = 6<br />
A<br />
M: # of task nodes on longest<br />
path through task graph D.<br />
∆: maximum degree of any<br />
task node<br />
M = 5<br />
∆ = 2<br />
s<br />
A<br />
E<br />
B<br />
C<br />
F<br />
D<br />
G<br />
t
<strong>Work</strong> and Span of a <strong>Task</strong> Graph<br />
We can define a “total” work and span for the<br />
task graph execution. Define T 1 and T ∞ as:<br />
T 1 = Σ<br />
A∊V<br />
W(A) + O(E)<br />
max<br />
T max<br />
Σ ∞ =<br />
(S(A) + O(1))<br />
all paths p<br />
through D<br />
Any execution of the task<br />
graph on P processors<br />
requires time at least:<br />
max { T 1 /P, T ∞ }.<br />
Σ<br />
A∊p<br />
s<br />
A<br />
E<br />
B<br />
C<br />
F<br />
D<br />
G<br />
t
Completion Time for Static Nabbit<br />
THEOREM 1: Nabbit executes a static task graph<br />
D = (V, E) on P processors in expected time<br />
O<br />
where<br />
T 1<br />
+ T ∞ + M lg ∆ + C(D)<br />
P<br />
E<br />
C(D) = O + M min{∆, P} .<br />
P<br />
T 1 /P + T ∞ : Bound for ordinary Cilk-like work-stealing<br />
M lg ∆ : span of notifying task node successors<br />
C(D): worst-case contention for atomic decrements.<br />
(min{∆, P}: time a decrement can wait)<br />
Theorem is asymptotically optimal when ∆ = Θ(1).<br />
M: # of nodes<br />
on longest<br />
path through<br />
task graph D.<br />
∆: maximum<br />
degree of any<br />
task node
Outline<br />
Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />
Nabbit Implementation<br />
Completion Time Bound<br />
Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions
Dynamic Nabbit<br />
Nabbit also supports dynamic task graphs. Roughly, a<br />
dynamic task graph can be thought of as performing a<br />
parallel traversal of a two-phase dag, where the first<br />
Init() phase creates new nodes.<br />
H<br />
B<br />
D<br />
A<br />
A D<br />
C<br />
B<br />
C<br />
H<br />
G<br />
F<br />
Init() section<br />
Creation edge<br />
(Initialize node<br />
on first visit)<br />
E<br />
E<br />
F<br />
G<br />
Compute() section<br />
Dependency edge<br />
(Compute node<br />
on last visit)
Complications for Dynamic Nabbit<br />
Dynamic task graphs are more complicated because<br />
Init() and Compute() happen concurrently.<br />
H<br />
B<br />
B<br />
D<br />
A<br />
A D<br />
C<br />
C<br />
H<br />
G<br />
F<br />
Init() section<br />
Creation edge<br />
(Initialize node<br />
on first visit)<br />
E<br />
E<br />
F<br />
G<br />
Compute() section<br />
Dependency edge<br />
(Compute node<br />
on last visit)
Completion Time for Dynamic Nabbit<br />
THEOREM 2: Nabbit executes a dynamic task graph<br />
D = (V, E) on P processors in expected time<br />
O<br />
where<br />
T 1<br />
+ T ∞ + M∆ + C(D)<br />
P<br />
E C(D) = O + M min{∆, P} .<br />
P<br />
T 1 and T ∞ are modified to account for Init() for each node.<br />
M∆ : weaker bound because all edges in the graph are not<br />
known ahead of time.
Topics for Future Investigation<br />
We are interested in possibly extending Nabbit in<br />
several directions:<br />
Strongly Dynamic <strong>Task</strong> <strong>Graphs</strong><br />
Compute() of a task node can generate a new task.<br />
Reusing Nodes and Garbage Collection<br />
Hierarchical <strong>Task</strong> <strong>Graphs</strong><br />
Runtime/Compiler Support for Nabbit
Applications for Nabbit?<br />
We are interested in possibly extending Nabbit in<br />
several directions:<br />
Strongly Dynamic <strong>Task</strong> <strong>Graphs</strong><br />
Compute() of a task node can generate a new task.<br />
Reusing Nodes and Garbage Collection<br />
Hierarchical <strong>Task</strong> <strong>Graphs</strong><br />
Runtime/Compiler Support for Nabbit<br />
Applications!<br />
The value of these possible extensions to Nabbit depends<br />
on programs that use static or dynamic task graphs.<br />
We value any feedback regarding potential applications!
Questions?