27.01.2015 Views

A Methodology for Fine- Grained Parallelism in JavaScript ...

A Methodology for Fine- Grained Parallelism in JavaScript ...

A Methodology for Fine- Grained Parallelism in JavaScript ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Computer Systems @ Colorado<br />

COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO<br />

http://systems.cs.colorado.edu<br />

A <strong>Methodology</strong> <strong>for</strong> <strong>F<strong>in</strong>e</strong>-<br />

<strong>Gra<strong>in</strong>ed</strong> <strong>Parallelism</strong> <strong>in</strong><br />

<strong>JavaScript</strong> Applications<br />

Jeff Fifield and Dirk Grunwald<br />

Department of Computer Science<br />

University of Colorado<br />

LCPC Workshop<br />

September 8, 2011


Modern <strong>JavaScript</strong><br />

Most web applications are written us<strong>in</strong>g <strong>JavaScript</strong><br />

• Email<br />

• Office suites<br />

• Games<br />

• IDEs<br />

Many server side applications use <strong>JavaScript</strong><br />

• CommonJS<br />

• node.js, R<strong>in</strong>goJS, wakanda, many more...<br />

• MongoDB map/reduce, CouchDB views<br />

APIs <strong>for</strong> Desktop, Mobile


Why is <strong>JavaScript</strong> Slow<br />

it's not because <strong>JavaScript</strong> is <strong>in</strong>terpreted,<br />

- current <strong>JavaScript</strong> eng<strong>in</strong>es are JIT compilers<br />

- they per<strong>for</strong>m standard optimizations and code gen<br />

- they can generate good code<br />

it's because <strong>JavaScript</strong> is untyped,<br />

what does c = a + b mean <br />

if a = 1, b = 2 <br />

if a = "cat", b = "dog" <br />

Other reasons too:<br />

object creation, use of eval, sparse arrays, ...<br />

but these th<strong>in</strong>gs shouldn't be <strong>in</strong> hot loops anyway


Type Specializ<strong>in</strong>g JITs<br />

Luckily, programmers usually write Type-Stable code:<br />

That is,<br />

a = 1;<br />

b = 2;<br />

if (someth<strong>in</strong>g)<br />

b = 5;<br />

c = a + b;<br />

and not,<br />

a = 1;<br />

b = 2;<br />

if (someth<strong>in</strong>g)<br />

b = “word”;<br />

c = a + b;<br />

Type specializ<strong>in</strong>g, dynamic optimiz<strong>in</strong>g, JIT compilers<br />

• Observe types through profil<strong>in</strong>g or trac<strong>in</strong>g (still emit checks)<br />

• Prove types us<strong>in</strong>g type <strong>in</strong>ference (no checks, research area)<br />

• Generate code specialized to given types<br />

• Google V8 Crankshaft, Mozilla TraceMonkey, JägerMonkey<br />

Good JS per<strong>for</strong>mance requires Type-Stable Code


Go Faster with <strong>Parallelism</strong><br />

●<br />

●<br />

●<br />

●<br />

●<br />

<strong>JavaScript</strong> does not support parallelism<br />

●<br />

●<br />

Lots of concurrent programm<strong>in</strong>g models <strong>in</strong> use<br />

Various multi-threaded implementations<br />

WebWorkers<br />

●<br />

Com<strong>in</strong>g soon, threads with asynchronous message pass<strong>in</strong>g<br />

WebCL<br />

●<br />

Com<strong>in</strong>g soon, OpenCL <strong>for</strong> <strong>JavaScript</strong><br />

Other proposals from the Server Side <strong>JavaScript</strong> community<br />

●<br />

●<br />

Sync and lock – JVM semantics<br />

Fork – as <strong>in</strong> <strong>for</strong>k a new process<br />

None of these are appeal<strong>in</strong>g <strong>for</strong> f<strong>in</strong>e-gra<strong>in</strong>ed parallelism<br />

●<br />

●<br />

<strong>F<strong>in</strong>e</strong>-gra<strong>in</strong>ed parallelism == loop or procedure level<br />

Too low level or too heavy weight


Mak<strong>in</strong>g <strong>Parallelism</strong> Easier<br />

Use an Embedded Doma<strong>in</strong> Specific Language (EDSL)<br />

How to put a DSL <strong>in</strong>to a general purpose language:<br />

1. restrict the language to not do th<strong>in</strong>gs you don't like<br />

2. augment the language to do the th<strong>in</strong>gs you do like<br />

3. add runtime AST construction + DSL compiler<br />

Get the benefits of the host syntax, compiler, libraries, etc.<br />

Examples:<br />

Copperhead <strong>for</strong> Python,<br />

Intel Array Build<strong>in</strong>g Blocks / Rapidm<strong>in</strong>d / Ct <strong>for</strong> C++,<br />

Microsoft Accelerator <strong>for</strong> .NET


Sluice: EDSL <strong>for</strong> Stream <strong>Parallelism</strong><br />

• Kahn Process Network model of computation<br />

• StreamIt style program construction<br />

function Adder(arg) {<br />

Kernel<br />

Split<br />

Kernel<br />

this.a = arg;<br />

this.work = function() {<br />

Kernel<br />

var e = this.pop();<br />

Kernel ... Kernel Kernel<br />

Kernel Kernel<br />

e = e + this.a;<br />

Kernel<br />

this.push(e);<br />

return false;<br />

Jo<strong>in</strong><br />

Kernel<br />

Kernel<br />

}<br />

}<br />

var add0 = new Adder(1);<br />

var add1 = new Adder(1);<br />

var add2 = new Adder(1);<br />

pipel<strong>in</strong>e split-jo<strong>in</strong> data parallel<br />

var sj = Sluice.SplitRR(1, add0, add1, add2).Jo<strong>in</strong>RR(1);<br />

var p = Sluice.Pipel<strong>in</strong>e(new Count(10), sj, new Pr<strong>in</strong>ter());<br />

> p.run()<br />

1 2 3 4 5 6 7 8 9 10


Default Sluice Execution: Run as<br />

Normal <strong>JavaScript</strong><br />

●<br />

●<br />

●<br />

●<br />

●<br />

Sluice uses a pure <strong>JavaScript</strong> implementation by default<br />

Kernels execute without modification (it's just <strong>JavaScript</strong>)<br />

Map streams to <strong>JavaScript</strong> Array with bounded sizes<br />

Simple Schedul<strong>in</strong>g Algorithm:<br />

●<br />

●<br />

●<br />

S<strong>in</strong>gle threaded scheduler based on corout<strong>in</strong>es<br />

yield to the scheduler when blocked on full or empty stream<br />

Choose next kernel to run us<strong>in</strong>g round-rob<strong>in</strong><br />

Requires generators – not yet <strong>in</strong> the language.


Stream <strong>Parallelism</strong> Simplifies Program Analysis<br />

function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />

this.m_pos_rd = pos_rd;<br />

this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />

this.work = function () {<br />

var <strong>for</strong>ce = [0.0,0.0,0.0];<br />

var i = this.pop()*4;<br />

var N = this.pop()*4;<br />

<strong>for</strong> (var j=0; j


Sluice Acceleration<br />

Target a low level representation <strong>for</strong> stream programs<br />

●<br />

●<br />

Stream and Kernel Intermediate Representation (SKIR)<br />

SKIR = LLVM + Kernels + Streams<br />

Code generation:<br />

●<br />

●<br />

●<br />

Typed kernels only, i.e. type-stable + <strong>in</strong>ference<br />

Generate AST from Sluice kernel at runtime<br />

Translate AST to SKIR-C to SKIR<br />

Currently uses programmer annotation:<br />

Var k = new MyKernel(...);<br />

Sluice.toSkir(k, function(err, ret) {<br />

Sluice.Pipel<strong>in</strong>e(..., ret, ...).run();<br />

});


Kernel Offload<br />

●<br />

●<br />

Offload selected kernels to SKIR runtime<br />

●<br />

These kernels run <strong>in</strong> parallel with Sluice<br />

SKIR Runtime consists of:<br />

●<br />

●<br />

●<br />

JIT compiler <strong>for</strong> stream parallel computation<br />

– Standard Synchronous Dataflow optimizations<br />

●<br />

e.g. fusion, fission, batch<strong>in</strong>g<br />

– Shared memory multi-core and OpenCL backends<br />

– Support <strong>for</strong> dynamic optimization<br />

Dynamic schedul<strong>in</strong>g algorithm<br />

– S<strong>in</strong>gle threaded or parallel execution<br />

Out of process communication mechanisms<br />

– RPC implemented as protobufs over sockets<br />

– Supports shared memory stream communication<br />

– Supports shared memory <strong>for</strong> kernel state


Kernel Offload<br />

<strong>JavaScript</strong><br />

SKIR<br />

Sluice<br />

Process boundary<br />

1) Compile K2 to SKIR code<br />

K1<br />

shared<br />

memory<br />

<strong>in</strong>put stream<br />

2) Transfer SKIR code<br />

3) Allocate streams<br />

4) Allocate & copy state<br />

5) Call K2<br />

RPC<br />

state<br />

K2<br />

(offloaded)<br />

shared<br />

memory<br />

state copy<br />

K2<br />

SKIR<br />

6) JIT compile K2<br />

shared<br />

memory<br />

output stream<br />

7) Execute K2<br />

K3<br />

Sluice<br />

8) Copy state back to JS


Evaluation<br />

●<br />

●<br />

Set of compute <strong>in</strong>tensive kernels<br />

●<br />

●<br />

Rout<strong>in</strong>es from Pixastic image process<strong>in</strong>g library<br />

– Already <strong>JavaScript</strong>, trivial to port to Sluice<br />

– 5 Megapixel images (2592x1944)<br />

– Task parallelism<br />

n-body particle physics simulation from CUDA SDK<br />

– Ported to <strong>JavaScript</strong> and Sluice from C++<br />

– Vary<strong>in</strong>g number of particles<br />

– Pipel<strong>in</strong>e parallelism, Data parallelism<br />

Compare with V8 (node.js) execution on a 4-core<br />

Intel Nahalem


Pixastic Benchmarks<br />

Ported 4 image process<strong>in</strong>g rout<strong>in</strong>es: <strong>in</strong>vert, sepia, sharpen,<br />

and edge detection<br />

Orig<strong>in</strong>al Invert Code<br />

function <strong>in</strong>vert_ref(params)<br />

{<br />

var data = Pixastic.prepareData(params);<br />

var <strong>in</strong>vertAlpha =<br />

!!params.options.<strong>in</strong>vertAlpha;<br />

}<br />

var rect = params.options.rect;<br />

var p = rect.width * rect.height;<br />

var pix = p*4, pix1 = pix + 1,<br />

var pix2 = pix + 2, pix3 = pix + 3;<br />

while (p--) {<br />

data[pix-=4] = 255 - data[pix];<br />

data[pix1-=4] = 255 - data[pix1];<br />

data[pix2-=4] = 255 - data[pix2];<br />

if (<strong>in</strong>vertAlpha)<br />

data[pix3-=4] = 255 - data[pix3];<br />

}<br />

return true;<br />

Sluice Invert Code<br />

function <strong>in</strong>vert_sluice(data, <strong>in</strong>vert_alpha)<br />

{<br />

this.<strong>in</strong>vertAlpha = <strong>in</strong>vert_alpha;<br />

this.data = data;<br />

}<br />

this.work = function () {<br />

var w = this.pop();<br />

var h = this.pop();<br />

var p = w * h;<br />

};<br />

var pix = p*4, pix1 = pix + 1;<br />

var pix2 = pix + 2, pix3 = pix + 3;<br />

while (p--) {<br />

this.data[pix-=4] = 255 - this.data[pix];<br />

this.data[pix1-=4] = 255 - this.data[pix1];<br />

this.data[pix2-=4] = 255 - this.data[pix2];<br />

if (this.<strong>in</strong>vertAlpha)<br />

this.data[pix3-=4] = 255 - this.data[pix3];<br />

}<br />

this.push(w);<br />

this.push(h);<br />

return false;


Results: Pixastic<br />

1800<br />

600<br />

execution time (ms)<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

ref<br />

skir cold<br />

skir warm<br />

time per image (ms)<br />

500<br />

400<br />

300<br />

200<br />

100<br />

edges<br />

sharpen<br />

sepia<br />

<strong>in</strong>vert<br />

0<br />

<strong>in</strong>vert sepia sharpen edges<br />

0<br />

1 2 4 6 8 10<br />

benchmark<br />

number of images<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Test of acceleration due to SKIR<br />

Offload process<strong>in</strong>g of s<strong>in</strong>gle image to SKIR<br />

ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />

cold = SKIR - <strong>in</strong>cludes JIT overhead<br />

warm = SKIR - does not <strong>in</strong>clude JIT<br />

overhead<br />

●<br />

●<br />

●<br />

Test of task parallelism<br />

Launch 1 to 10 image<br />

process<strong>in</strong>g kernels <strong>in</strong> parallel<br />

Shows mean time per image<br />

●<br />

8 worker threads (4 cores / 8<br />

hardware threads)


n-body simulation<br />

function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />

this.m_pos_rd = pos_rd;<br />

this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />

this.work = function () {<br />

var <strong>for</strong>ce = [0.0,0.0,0.0];<br />

var i = this.pop()*4;<br />

var N = this.pop()*4;<br />

<strong>for</strong> (var j=0; j


Results: n-body<br />

600<br />

3<br />

500<br />

time per iteration (ms)<br />

400<br />

300<br />

200<br />

100<br />

ref<br />

sluice<br />

skir<br />

speedup<br />

2<br />

4096<br />

3072<br />

2048<br />

1024<br />

0<br />

100 500 1000 1500<br />

1<br />

2 4 8<br />

number of bodies<br />

number of threads<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Test of acceleration due to SKIR<br />

Offload CalculateForces kernel to SKIR<br />

Shows mean time per simulation iteration<br />

ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />

sluice = sluice implementation<br />

skir = sluice implementation + SKIR offload<br />

●<br />

●<br />

●<br />

Test of data parallelism<br />

Offload CalculateForces kernel as a data<br />

parallel kernel, use SKIR kernel fission<br />

Shows speedup vs. s<strong>in</strong>gle thread<br />

● Use 2, 4 or 8 SKIR threads (4 cores / 8<br />

hardware threads)<br />

●<br />

<strong>JavaScript</strong> also uses one thread


Summary<br />

●<br />

Stream parallelism<br />

● Isolates program state<br />

● Abstracts <strong>in</strong>ter-task communication<br />

●<br />

●<br />

●<br />

Presented a method <strong>for</strong> mak<strong>in</strong>g concurrency explicit and<br />

aid<strong>in</strong>g type <strong>in</strong>ference <strong>in</strong> <strong>JavaScript</strong> Applications us<strong>in</strong>g<br />

Stream <strong>Parallelism</strong><br />

Showed how to use these techniques to offload program<br />

kernels <strong>for</strong> parallel execution us<strong>in</strong>g a high per<strong>for</strong>mance JIT<br />

and runtime system <strong>for</strong> stream<strong>in</strong>g computation<br />

Showed significant per<strong>for</strong>mance ga<strong>in</strong>s <strong>for</strong> the tested<br />

benchmarks


Blank Slide


extend kernel work function<br />

function createKernel(workfn)<br />

{<br />

var that = workfn;<br />

if (that._sluice_kernel !== true) {<br />

that._sluice_kernel = true;<br />

if (that.work == undef<strong>in</strong>ed)<br />

that.work = workfn;<br />

that.<strong>in</strong>s = [];<br />

that.outs = [];<br />

that.pop = createPop(that,0);<br />

that.popN = createPop(that);<br />

that.push = createPush(that,0);<br />

that.pushN = createPush(that);<br />

that.fiber = Fiber(<br />

function() {<br />

var r = false;<br />

while(r == false)<br />

r = that.work.call(that);<br />

return r;<br />

});<br />

// return a pop operation<br />

function createArrayPop(kernel,stream)<br />

{<br />

var k = kernel;<br />

if (stream != null) {<br />

var s = stream;<br />

return function() {<br />

var a = k.<strong>in</strong>s[s].data;<br />

while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />

var e = a.shift();<br />

return e;<br />

};<br />

} else {<br />

return function(s) {<br />

var a = k.<strong>in</strong>s[s].data;<br />

while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />

var e = a.shift();<br />

return e;<br />

};<br />

}<br />

}<br />

}<br />

}<br />

return that;


enchmarks['<strong>in</strong>vert'] = {<br />

ref : function (rgba, cb) {<br />

var ret = <strong>in</strong>vert_ref(rgba, 2592, 1944, false);<br />

cb(ret);<br />

},<br />

skir : function (rgba, cb) {<br />

var <strong>in</strong>vert = new <strong>in</strong>vert_sluice(rgba, 0);<br />

toSkir(<strong>in</strong>vert,<br />

function(err, ret) {<br />

var p = new sluice.Pipel<strong>in</strong>e( function(){ this.push(2592); //w<br />

this.push(1944); //h<br />

return true; },<br />

ret,<br />

function() { this.pop();<br />

return true; }<br />

).run(function() { cb(<strong>in</strong>vert.data); });<br />

});<br />

}<br />

};<br />

var rgba = [];<br />

<strong>for</strong> (var i=0; i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!