Executing Task Graphs Using Work-Stealing - IPDPS

Executing Task Graphs 

Using Work-Stealing 

Jim Sukha 

MIT CSAIL 

IPDPS, 4/21/2010 

Kunal Agrawal (Washington University in St. Louis) 

Charles E. Leiserson (MIT)

Task Graphs 

A task graph is a directed acyclic graph (dag) where 

Every node A is a task requiring computation, 

Every edge (A, B) means that the computation of B 

depends on the result of A’s computation. 

s 

A 

E 

B 

C 

F 

D 

G 

t

Example: Counting Paths 

Counting the number of paths through a dag can be 

expressed as a task graph. 

Source starts 

with value 1. 

s 

1 

A 

1 

B 

C 

The value of a node X is the 

sum of the values of X’s 

immediate predecessors. 

1 

1 

D 

E F G 

2 

1 1 1 

t 

3

Fork-Join Languages 

Some task graphs can be executed in parallel using 

a fork-join language such as Cilk++. 

f1 

void f1() { 

A(); 

cilk_spawn B(); 

C(); 

cilk_sync; 

D(); 

} 

void f2() { 

E(); F(); G(); 

} 

s 

A 

B 

C 

D 

E F G 

int main() { 

cilk_spawn f1(); 

f2(); 

cilk_sync; 

} 

t 

f2

Graphs with Arbitrary Dependencies 

Unfortunately, one can not directly express task 

graphs with arbitrary dependencies using only 

spawn and sync. 

1 

Counting 1 B 

3 

paths in a 

non-seriesparallel 

dag. 

s 

1 

A 

C 

2 

D 

t 

5 

E F G 

1 2 2 

Question: Can we efficiently execute arbitrary task 

graphs in parallel in a fork-join language such as Cilk++?

Our Contributions: Nabbit 

We are developing Nabbit, a Cilk++ library for 

executing task graphs with arbitrary dependencies. 

s 

A 

E 

B 

C 

F 

Nabbit is built on top of Cilk++. It utilizes Cilk++’s 

provably-efficient work-stealing scheduler without 

any modification to the Cilk++ runtime. 

Using Nabbit, the computation of an individual task 

graph node can itself be parallel. 

D 

G 

t

Provable Bounds for Nabbit 

We are developing Nabbit, a Cilk++ library for 

executing task graphs with arbitrary dependencies. 

Nabbit offers provable bounds on the time 

required for parallel execution of (static and 

dynamic) task graphs. 

The time bounds for Nabbit are asymptotically 

optimal for task graphs whose nodes have 

constant in-degree and out-degree.

Outline 

Static Task Graphs using Nabbit 

Nabbit Implementation 

Completion Time Bound 

Dynamic Task Graphs and Other Extensions

Dynamic-Programming Example 

Generic dynamic programs can often be 

expressed as a task graph. 

g 1 (M(i-1, j-1)) 

M(i, j) = max E(i, j) 

F(i, j) 

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) ) 

(0, 3) 

(1, 2) (1, 3) 

(2,0) (2,1) (2, 2) (2, 3) 

F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) )

Static Task Graphs 

For this example, we can use a static task graph, 

i.e., a task graph where the structure of the dag is 

known before the execution begins. 

g 1 (M(i-1, j-1)) 


F(i, j) 

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) ) 

(0, 3) 

(1, 2) (1, 3) 

(2,0) (2,1) (2, 2) (2, 3) 

F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) ) 

Create a node for every cell M(i, j).

Static Task Graphs 

For this example, we can use a static task graph, 

i.e., a task graph where the structure of the dag is 

known before the execution begins. * 

g 1 (M(i-1, j-1)) 


F(i, j) 

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) ) 

(0, 3) 

(1, 2) (1, 3) 

(2,0) (2,1) (2, 2) (2, 3) 

F(i, j) = g 3 ( M(i, 1), M(i, 2),… ,M(i, j-1) ) 

Create a node for every cell M(i, j). Then add dependency edges. 

* In Nabbit, static task graphs still require dynamic scheduling. 

We assume the compute time for each node may be unknown.

Interface for Static Nabbit 

For static task graphs, each task graph node is 

derived from Nabbit’sDAGNode class, and 

overrides the node’s Compute() method. 

class Mnode: public DAGNode { 

Mnode(int key, MDag* dag); 

void Compute(); 

}; 

class MDag { 

int N; int *M; 

Mnode* g; 

MDag(int N_, int* M_); 

}; 

Typically, each node 

needs to know its 

identity, and global 

parameters for task 

graph. 

Programmer 

builds their own 

task graph.

Constructing a Static Task Graph 

Programmers use Nabbit’sAddDep method to 

specify dependencies between task graph nodes. 

class MDag { 

int N; int* s; Mnode* g; 

MDag(int n_, int* M_) : N(N_), M(M_) { 

g = new Mnode[N*N]; 

} 

}; 

for (int i = 0; i < N; i++){ 

for (int j = 0; j < N; j++) { 

int k = N*i+j; 

g[k].key = k; g[k].dag = this; 

if (i > 0) g[k].AddDep(&Mnode[k-N]); 

if (j > 0) g[k].AddDep(&Mnode[k-1]); 

} 

} 

M(i, j) has edges from 

M(i-1, j) and M(i, j-1). 

Allocate 

nodes

Implementing Task Nodes 

Task graph nodes inherit from a DAGNode class, 

and override the node’s Compute() method. 

class Mnode: public DAGNode { 

int i, j; 

void Compute() { 

int z = INFINITY; 

int Eij = calcE(dag->M, i, j); 

int Fij = calcF(dag->M, i, j); 

if ((i > 0) && (j > 0)) 

z = g1(M, i, j); 

dag->M[key] = min(z, Eij, Fij); 

} 

}; 

One can call 

other Cilk 

functions inside 

the Compute() 

method, including 

methods that 

spawn and 

sync.

Outline 





Static Nabbit Implementation 

Nabbit uses a simple algorithm to execute static 

task graphs in parallel. Each node 

1. Maintains a count of the # of its immediate predecessors that are 

still incomplete. (Each node keeps a join counter.) 

2. Notifies its immediate successors in parallel after it is computed. 

3. Recursively computes any successors which become ready. 

void ComputeAndNotify() { 

this->Compute(); 

cilk_for (int q = 0; 

q < successors().size(); 

q++) { 

DAGNode* Y = successors[q]; 

int val = AtomicDecAndFetch(Y.join); 

if (val == 0) Y.ComputeAndNotify(); 

} 

} 

Start 

execution by 

calling 

ComputeAnd 

Notify from 

the source 

(root) node.

Work-Stealing in Nabbit 

Nabbit is able to rely on Cilk++’s work-stealing 

scheduler to load-balance the computation. 

When a processor runs 

out of work, it tries to 

steal work from other 

processors. 

Nabbit spawns task 

nodes in a way that 

makes the Cilk++ 

runtime likely to steal 

nodes along the critical 

path of the task graph. 

Sample Dynamic-Program 

Execution with P=4

Smith-Waterman Dynamic Program 

As a benchmark, we consider a dynamic program 

modeling the Smith-Waterman algorithm with a 

generic penalty gap: 

M(i-1, j-1) + s(i, j) 


F(i, j) 

E(i, j) = max M(k, j) + γ(i-k) 

k∊{0, 1,…i-1} 

F(i, j) = max M(i, k) + γ(j-k) 

k∊{0, 1,…j-1} 

(0, 3) 

(1, 2) (1, 3) 

(2,0) (2,1) (2, 2) (2, 3) 

In this example, s(i, j) and γ(k) are constant arrays.

Comparison with Divide-and-Conquer 

For the dynamic program, we compare the task graph 

evaluation using Nabbit with alternative algorithms. 

K-way divide-and-conquer 

Wavefront (Synchronous) 

1 2 1 2 

1 2 

2 3 2 3 

1 2 3 4 

2 3 4 5 

1 2 1 2 

2 3 

2 3 2 3 

3 4 5 6 

4 5 6 7 

K=2 

For all algorithms, base case is blocks of size B by B. 

Grid is arranged in a cache-oblivious layout.

16-core AMD Barcelona 

Comparing implementations of the dynamic program 

Nabbit 

K=5 

Wavefront 

Nabbit 

K=5 

Wavefront 

K=2 

K=2 

Serial Running Time = 4.4 s 

c = 4.4e-9 if T S ~= cN 3 

Serial Running Time = 664 s 

c = 5.3e-9 if T S ~= cN 3 

Opteron Processor 8354: 2.2 Ghz

Comparison, N=3000, P=16 

Nabbit 

Wavefront 

K=5 

K=2

Outline 





Definitions 

Let D=(V, E) be a task graph to execute. 

Consider the execution dag associated with the 

Compute() method of a task node A∊V. 

W(A) : the work of A 

(# of nodes in execution dag) 

S(A) : the span of A 

(length of longest path in dag) 

W(A) = 11 

S(A) = 6 

A 

M: # of task nodes on longest 

path through task graph D. 

∆: maximum degree of any 

task node 

M = 5 

∆ = 2 

s 

A 

E 

B 

C 

F 

D 

G 

t

Work and Span of a Task Graph 

We can define a “total” work and span for the 

task graph execution. Define T 1 and T ∞ as: 

T 1 = Σ 

A∊V 

W(A) + O(E) 

max 

T max 

Σ ∞ = 

(S(A) + O(1)) 

all paths p 

through D 

Any execution of the task 

graph on P processors 

requires time at least: 

max { T 1 /P, T ∞ }. 

Σ 

A∊p 

s 

A 

E 

B 

C 

F 

D 

G 

t

Completion Time for Static Nabbit 

THEOREM 1: Nabbit executes a static task graph 

D = (V, E) on P processors in expected time 

O 

where 

T 1 

+ T ∞ + M lg ∆ + C(D) 

P 

E 

C(D) = O + M min{∆, P} . 

P 

T 1 /P + T ∞ : Bound for ordinary Cilk-like work-stealing 

M lg ∆ : span of notifying task node successors 

C(D): worst-case contention for atomic decrements. 

(min{∆, P}: time a decrement can wait) 

Theorem is asymptotically optimal when ∆ = Θ(1). 

M: # of nodes 

on longest 

path through 

task graph D. 

∆: maximum 

degree of any 

task node

Outline 





Dynamic Nabbit 

Nabbit also supports dynamic task graphs. Roughly, a 

dynamic task graph can be thought of as performing a 

parallel traversal of a two-phase dag, where the first 

Init() phase creates new nodes. 

H 

B 

D 

A 

A D 

C 

B 

C 

H 

G 

F 

Init() section 

Creation edge 

(Initialize node 

on first visit) 

E 

E 

F 

G 

Compute() section 

Dependency edge 

(Compute node 

on last visit)

Complications for Dynamic Nabbit 

Dynamic task graphs are more complicated because 

Init() and Compute() happen concurrently. 

H 

B 

B 

D 

A 

A D 

C 

C 

H 

G 

F 

Init() section 

Creation edge 

(Initialize node 

on first visit) 

E 

E 

F 

G 

Compute() section 

Dependency edge 

(Compute node 

on last visit)

Completion Time for Dynamic Nabbit 

THEOREM 2: Nabbit executes a dynamic task graph 

D = (V, E) on P processors in expected time 

O 

where 

T 1 

+ T ∞ + M∆ + C(D) 

P 

E C(D) = O + M min{∆, P} . 

P 

T 1 and T ∞ are modified to account for Init() for each node. 

M∆ : weaker bound because all edges in the graph are not 

known ahead of time.

Topics for Future Investigation 

We are interested in possibly extending Nabbit in 

several directions: 

Strongly Dynamic Task Graphs 

Compute() of a task node can generate a new task. 

Reusing Nodes and Garbage Collection 

Hierarchical Task Graphs 

Runtime/Compiler Support for Nabbit

Applications for Nabbit? 

We are interested in possibly extending Nabbit in 

several directions: 

Strongly Dynamic Task Graphs 

Compute() of a task node can generate a new task. 

Reusing Nodes and Garbage Collection 

Hierarchical Task Graphs 

Runtime/Compiler Support for Nabbit 

Applications! 

The value of these possible extensions to Nabbit depends 

on programs that use static or dynamic task graphs. 

We value any feedback regarding potential applications!

Questions?

Executing Task Graphs Using Work-Stealing - IPDPS

Create successful ePaper yourself

Delete template?

Save as template?