04.05.2015 Views

Executing Task Graphs Using Work-Stealing - IPDPS

Executing Task Graphs Using Work-Stealing - IPDPS

Executing Task Graphs Using Work-Stealing - IPDPS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Executing</strong> <strong>Task</strong> <strong>Graphs</strong><br />

<strong>Using</strong> <strong>Work</strong>-<strong>Stealing</strong><br />

Jim Sukha<br />

MIT CSAIL<br />

<strong>IPDPS</strong>, 4/21/2010<br />

Kunal Agrawal (Washington University in St. Louis)<br />

Charles E. Leiserson (MIT)


<strong>Task</strong> <strong>Graphs</strong><br />

A task graph is a directed acyclic graph (dag) where<br />

Every node A is a task requiring computation,<br />

Every edge (A, B) means that the computation of B<br />

depends on the result of A’s computation.<br />

s<br />

A<br />

E<br />

B<br />

C<br />

F<br />

D<br />

G<br />

t


Example: Counting Paths<br />

Counting the number of paths through a dag can be<br />

expressed as a task graph.<br />

Source starts<br />

with value 1.<br />

s<br />

1<br />

A<br />

1<br />

B<br />

C<br />

The value of a node X is the<br />

sum of the values of X’s<br />

immediate predecessors.<br />

1<br />

1<br />

D<br />

E F G<br />

2<br />

1 1 1<br />

t<br />

3


Fork-Join Languages<br />

Some task graphs can be executed in parallel using<br />

a fork-join language such as Cilk++.<br />

f1<br />

void f1() {<br />

A();<br />

cilk_spawn B();<br />

C();<br />

cilk_sync;<br />

D();<br />

}<br />

void f2() {<br />

E(); F(); G();<br />

}<br />

s<br />

A<br />

B<br />

C<br />

D<br />

E F G<br />

int main() {<br />

cilk_spawn f1();<br />

f2();<br />

cilk_sync;<br />

}<br />

t<br />

f2


<strong>Graphs</strong> with Arbitrary Dependencies<br />

Unfortunately, one can not directly express task<br />

graphs with arbitrary dependencies using only<br />

spawn and sync.<br />

1<br />

Counting 1 B<br />

3<br />

paths in a<br />

non-seriesparallel<br />

dag.<br />

s<br />

1<br />

A<br />

C<br />

2<br />

D<br />

t<br />

5<br />

E F G<br />

1 2 2<br />

Question: Can we efficiently execute arbitrary task<br />

graphs in parallel in a fork-join language such as Cilk++?


Our Contributions: Nabbit<br />

We are developing Nabbit, a Cilk++ library for<br />

executing task graphs with arbitrary dependencies.<br />

s<br />

A<br />

E<br />

B<br />

C<br />

F<br />

Nabbit is built on top of Cilk++. It utilizes Cilk++’s<br />

provably-efficient work-stealing scheduler without<br />

any modification to the Cilk++ runtime.<br />

<strong>Using</strong> Nabbit, the computation of an individual task<br />

graph node can itself be parallel.<br />

D<br />

G<br />

t


Provable Bounds for Nabbit<br />

We are developing Nabbit, a Cilk++ library for<br />

executing task graphs with arbitrary dependencies.<br />

Nabbit offers provable bounds on the time<br />

required for parallel execution of (static and<br />

dynamic) task graphs.<br />

The time bounds for Nabbit are asymptotically<br />

optimal for task graphs whose nodes have<br />

constant in-degree and out-degree.


Outline<br />

Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />

Nabbit Implementation<br />

Completion Time Bound<br />

Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions


Dynamic-Programming Example<br />

Generic dynamic programs can often be<br />

expressed as a task graph.<br />

g 1 (M(i-1, j-1))<br />

M(i, j) = max E(i, j)<br />

F(i, j)<br />

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />

(0, 3)<br />

(1, 2) (1, 3)<br />

(2,0) (2,1) (2, 2) (2, 3)<br />

F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) )


Static <strong>Task</strong> <strong>Graphs</strong><br />

For this example, we can use a static task graph,<br />

i.e., a task graph where the structure of the dag is<br />

known before the execution begins.<br />

g 1 (M(i-1, j-1))<br />

M(i, j) = max E(i, j)<br />

F(i, j)<br />

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />

(0, 3)<br />

(1, 2) (1, 3)<br />

(2,0) (2,1) (2, 2) (2, 3)<br />

F(i, j) = g 3 ( M(i, 1), M(i, 2),…, M(i, j-1) )<br />

Create a node for every cell M(i, j).


Static <strong>Task</strong> <strong>Graphs</strong><br />

For this example, we can use a static task graph,<br />

i.e., a task graph where the structure of the dag is<br />

known before the execution begins. *<br />

g 1 (M(i-1, j-1))<br />

M(i, j) = max E(i, j)<br />

F(i, j)<br />

E(i, j) = g 2 ( M(1, j), M(2, j),…, M(i-1, j) )<br />

(0, 3)<br />

(1, 2) (1, 3)<br />

(2,0) (2,1) (2, 2) (2, 3)<br />

F(i, j) = g 3 ( M(i, 1), M(i, 2),… ,M(i, j-1) )<br />

Create a node for every cell M(i, j). Then add dependency edges.<br />

* In Nabbit, static task graphs still require dynamic scheduling.<br />

We assume the compute time for each node may be unknown.


Interface for Static Nabbit<br />

For static task graphs, each task graph node is<br />

derived from Nabbit’sDAGNode class, and<br />

overrides the node’s Compute() method.<br />

class Mnode: public DAGNode {<br />

Mnode(int key, MDag* dag);<br />

void Compute();<br />

};<br />

class MDag {<br />

int N; int *M;<br />

Mnode* g;<br />

MDag(int N_, int* M_);<br />

};<br />

Typically, each node<br />

needs to know its<br />

identity, and global<br />

parameters for task<br />

graph.<br />

Programmer<br />

builds their own<br />

task graph.


Constructing a Static <strong>Task</strong> Graph<br />

Programmers use Nabbit’sAddDep method to<br />

specify dependencies between task graph nodes.<br />

class MDag {<br />

int N; int* s; Mnode* g;<br />

MDag(int n_, int* M_) : N(N_), M(M_) {<br />

g = new Mnode[N*N];<br />

}<br />

};<br />

for (int i = 0; i < N; i++){<br />

for (int j = 0; j < N; j++) {<br />

int k = N*i+j;<br />

g[k].key = k; g[k].dag = this;<br />

if (i > 0) g[k].AddDep(&Mnode[k-N]);<br />

if (j > 0) g[k].AddDep(&Mnode[k-1]);<br />

}<br />

}<br />

M(i, j) has edges from<br />

M(i-1, j) and M(i, j-1).<br />

Allocate<br />

nodes


Implementing <strong>Task</strong> Nodes<br />

<strong>Task</strong> graph nodes inherit from a DAGNode class,<br />

and override the node’s Compute() method.<br />

class Mnode: public DAGNode {<br />

int i, j;<br />

void Compute() {<br />

int z = INFINITY;<br />

int Eij = calcE(dag->M, i, j);<br />

int Fij = calcF(dag->M, i, j);<br />

if ((i > 0) && (j > 0))<br />

z = g1(M, i, j);<br />

dag->M[key] = min(z, Eij, Fij);<br />

}<br />

};<br />

One can call<br />

other Cilk<br />

functions inside<br />

the Compute()<br />

method, including<br />

methods that<br />

spawn and<br />

sync.


Outline<br />

Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />

Nabbit Implementation<br />

Completion Time Bound<br />

Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions


Static Nabbit Implementation<br />

Nabbit uses a simple algorithm to execute static<br />

task graphs in parallel. Each node<br />

1. Maintains a count of the # of its immediate predecessors that are<br />

still incomplete. (Each node keeps a join counter.)<br />

2. Notifies its immediate successors in parallel after it is computed.<br />

3. Recursively computes any successors which become ready.<br />

void ComputeAndNotify() {<br />

this->Compute();<br />

cilk_for (int q = 0;<br />

q < successors().size();<br />

q++) {<br />

DAGNode* Y = successors[q];<br />

int val = AtomicDecAndFetch(Y.join);<br />

if (val == 0) Y.ComputeAndNotify();<br />

}<br />

}<br />

Start<br />

execution by<br />

calling<br />

ComputeAnd<br />

Notify from<br />

the source<br />

(root) node.


<strong>Work</strong>-<strong>Stealing</strong> in Nabbit<br />

Nabbit is able to rely on Cilk++’s work-stealing<br />

scheduler to load-balance the computation.<br />

When a processor runs<br />

out of work, it tries to<br />

steal work from other<br />

processors.<br />

Nabbit spawns task<br />

nodes in a way that<br />

makes the Cilk++<br />

runtime likely to steal<br />

nodes along the critical<br />

path of the task graph.<br />

Sample Dynamic-Program<br />

Execution with P=4


Smith-Waterman Dynamic Program<br />

As a benchmark, we consider a dynamic program<br />

modeling the Smith-Waterman algorithm with a<br />

generic penalty gap:<br />

M(i-1, j-1) + s(i, j)<br />

M(i, j) = max E(i, j)<br />

F(i, j)<br />

E(i, j) = max M(k, j) + γ(i-k)<br />

k∊{0, 1,…i-1}<br />

F(i, j) = max M(i, k) + γ(j-k)<br />

k∊{0, 1,…j-1}<br />

(0, 3)<br />

(1, 2) (1, 3)<br />

(2,0) (2,1) (2, 2) (2, 3)<br />

In this example, s(i, j) and γ(k) are constant arrays.


Comparison with Divide-and-Conquer<br />

For the dynamic program, we compare the task graph<br />

evaluation using Nabbit with alternative algorithms.<br />

K-way divide-and-conquer<br />

Wavefront (Synchronous)<br />

1 2 1 2<br />

1 2<br />

2 3 2 3<br />

1 2 3 4<br />

2 3 4 5<br />

1 2 1 2<br />

2 3<br />

2 3 2 3<br />

3 4 5 6<br />

4 5 6 7<br />

K=2<br />

For all algorithms, base case is blocks of size B by B.<br />

Grid is arranged in a cache-oblivious layout.


16-core AMD Barcelona<br />

Comparing implementations of the dynamic program<br />

Nabbit<br />

K=5<br />

Wavefront<br />

Nabbit<br />

K=5<br />

Wavefront<br />

K=2<br />

K=2<br />

Serial Running Time = 4.4 s<br />

c = 4.4e-9 if T S ~= cN 3<br />

Serial Running Time = 664 s<br />

c = 5.3e-9 if T S ~= cN 3<br />

Opteron Processor 8354: 2.2 Ghz


Comparison, N=3000, P=16<br />

Nabbit<br />

Wavefront<br />

K=5<br />

K=2


Outline<br />

Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />

Nabbit Implementation<br />

Completion Time Bound<br />

Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions


Definitions<br />

Let D=(V, E) be a task graph to execute.<br />

Consider the execution dag associated with the<br />

Compute() method of a task node A∊V.<br />

W(A) : the work of A<br />

(# of nodes in execution dag)<br />

S(A) : the span of A<br />

(length of longest path in dag)<br />

W(A) = 11<br />

S(A) = 6<br />

A<br />

M: # of task nodes on longest<br />

path through task graph D.<br />

∆: maximum degree of any<br />

task node<br />

M = 5<br />

∆ = 2<br />

s<br />

A<br />

E<br />

B<br />

C<br />

F<br />

D<br />

G<br />

t


<strong>Work</strong> and Span of a <strong>Task</strong> Graph<br />

We can define a “total” work and span for the<br />

task graph execution. Define T 1 and T ∞ as:<br />

T 1 = Σ<br />

A∊V<br />

W(A) + O(E)<br />

max<br />

T max<br />

Σ ∞ =<br />

(S(A) + O(1))<br />

all paths p<br />

through D<br />

Any execution of the task<br />

graph on P processors<br />

requires time at least:<br />

max { T 1 /P, T ∞ }.<br />

Σ<br />

A∊p<br />

s<br />

A<br />

E<br />

B<br />

C<br />

F<br />

D<br />

G<br />

t


Completion Time for Static Nabbit<br />

THEOREM 1: Nabbit executes a static task graph<br />

D = (V, E) on P processors in expected time<br />

O<br />

where<br />

T 1<br />

+ T ∞ + M lg ∆ + C(D)<br />

P<br />

E<br />

C(D) = O + M min{∆, P} .<br />

P<br />

T 1 /P + T ∞ : Bound for ordinary Cilk-like work-stealing<br />

M lg ∆ : span of notifying task node successors<br />

C(D): worst-case contention for atomic decrements.<br />

(min{∆, P}: time a decrement can wait)<br />

Theorem is asymptotically optimal when ∆ = Θ(1).<br />

M: # of nodes<br />

on longest<br />

path through<br />

task graph D.<br />

∆: maximum<br />

degree of any<br />

task node


Outline<br />

Static <strong>Task</strong> <strong>Graphs</strong> using Nabbit<br />

Nabbit Implementation<br />

Completion Time Bound<br />

Dynamic <strong>Task</strong> <strong>Graphs</strong> and Other Extensions


Dynamic Nabbit<br />

Nabbit also supports dynamic task graphs. Roughly, a<br />

dynamic task graph can be thought of as performing a<br />

parallel traversal of a two-phase dag, where the first<br />

Init() phase creates new nodes.<br />

H<br />

B<br />

D<br />

A<br />

A D<br />

C<br />

B<br />

C<br />

H<br />

G<br />

F<br />

Init() section<br />

Creation edge<br />

(Initialize node<br />

on first visit)<br />

E<br />

E<br />

F<br />

G<br />

Compute() section<br />

Dependency edge<br />

(Compute node<br />

on last visit)


Complications for Dynamic Nabbit<br />

Dynamic task graphs are more complicated because<br />

Init() and Compute() happen concurrently.<br />

H<br />

B<br />

B<br />

D<br />

A<br />

A D<br />

C<br />

C<br />

H<br />

G<br />

F<br />

Init() section<br />

Creation edge<br />

(Initialize node<br />

on first visit)<br />

E<br />

E<br />

F<br />

G<br />

Compute() section<br />

Dependency edge<br />

(Compute node<br />

on last visit)


Completion Time for Dynamic Nabbit<br />

THEOREM 2: Nabbit executes a dynamic task graph<br />

D = (V, E) on P processors in expected time<br />

O<br />

where<br />

T 1<br />

+ T ∞ + M∆ + C(D)<br />

P<br />

E C(D) = O + M min{∆, P} .<br />

P<br />

T 1 and T ∞ are modified to account for Init() for each node.<br />

M∆ : weaker bound because all edges in the graph are not<br />

known ahead of time.


Topics for Future Investigation<br />

We are interested in possibly extending Nabbit in<br />

several directions:<br />

Strongly Dynamic <strong>Task</strong> <strong>Graphs</strong><br />

Compute() of a task node can generate a new task.<br />

Reusing Nodes and Garbage Collection<br />

Hierarchical <strong>Task</strong> <strong>Graphs</strong><br />

Runtime/Compiler Support for Nabbit


Applications for Nabbit?<br />

We are interested in possibly extending Nabbit in<br />

several directions:<br />

Strongly Dynamic <strong>Task</strong> <strong>Graphs</strong><br />

Compute() of a task node can generate a new task.<br />

Reusing Nodes and Garbage Collection<br />

Hierarchical <strong>Task</strong> <strong>Graphs</strong><br />

Runtime/Compiler Support for Nabbit<br />

Applications!<br />

The value of these possible extensions to Nabbit depends<br />

on programs that use static or dynamic task graphs.<br />

We value any feedback regarding potential applications!


Questions?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!