21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Model<strong>in</strong>g Multigra<strong>in</strong> Parallelism on Heterogeneous Multi-core Processors 39<br />

<strong>in</strong>terfaces and re-formulation of parallel algorithms to fully exploit the additional<br />

hardware. Furthermore, schedul<strong>in</strong>g code on accelerators and orchestrat<strong>in</strong>g parallel<br />

execution and data transfers between host processors and accelerators is a<br />

non-trivial exercise [5].<br />

Consider the problem of identify<strong>in</strong>g the most appropriate programm<strong>in</strong>g model<br />

and accelerator configuration for a given parallel application. The simplest way<br />

to identify the best comb<strong>in</strong>ation is to exhaustively measure the execution time<br />

of all of the possible comb<strong>in</strong>ations of programm<strong>in</strong>g models and mapp<strong>in</strong>gs of<br />

the application to the hardware. Unfortunately, this technique is not scalable to<br />

large, complex systems, large applications, or applications with behavior that<br />

varies significantly with the <strong>in</strong>put. The execution time of a complex application<br />

is the function of many parameters. A given parallel application may consist of<br />

N phases where each phase is affected differently by accelerators. Each phase<br />

can exploit d dimensions of parallelism or any comb<strong>in</strong>ation thereof such as ILP,<br />

TLP, or both. Each phase or dimension of parallelism can use any of m different<br />

programm<strong>in</strong>g and execution models such as message pass<strong>in</strong>g, shared memory,<br />

SIMD, or any comb<strong>in</strong>ation thereof. Accelerator availability or use may consist of<br />

c possible configurations, <strong>in</strong>volv<strong>in</strong>g different numbers of accelerators. Exhaustive<br />

analysis of the execution time for all comb<strong>in</strong>ations requires at least N ×d×m×c<br />

trials with any given <strong>in</strong>put.<br />

Models of parallel computation have been <strong>in</strong>strumental <strong>in</strong> the adoption and<br />

use of parallel systems. Unfortunately, commonly used models [6,7] are not directly<br />

portable to accelerator-based systems. First, the heterogeneous process<strong>in</strong>g<br />

common to these systems is not reflected <strong>in</strong> most models of parallel computation.<br />

Second, current models do not capture the effects of multi-gra<strong>in</strong> parallelism.<br />

Third, few models account for the effects of us<strong>in</strong>g multiple programm<strong>in</strong>g models<br />

<strong>in</strong> the same program. Parallel programm<strong>in</strong>g at multiple dimensions and with<br />

a synthesis of models consumes both enormous amounts of programm<strong>in</strong>g effort<br />

and significant amounts of execution time, if not handled with care. To overcome<br />

these deficits, we present a model for multi-dimensional parallel computation on<br />

heterogeneous multi-core processors. Consider<strong>in</strong>g that each dimension of parallelism<br />

reflects a different degree of computation granularity, we name the model<br />

MMGP, for Model of Multi-Gra<strong>in</strong> Parallelism.<br />

MMGP is an analytical model which formalizes the process of programm<strong>in</strong>g<br />

accelerator-based systems and reduces the need for exhaustive measurements.<br />

This paper presents a generalized MMGP model for accelerator-based architectures<br />

with one layer of host processor parallelism and one layer of accelerator<br />

parallelism, followed by the specialization of this model for the Cell Broadband<br />

Eng<strong>in</strong>e.<br />

The <strong>in</strong>put to MMGP is an explicitly parallel program, with parallelism expressed<br />

with mach<strong>in</strong>e-<strong>in</strong>dependent abstractions, us<strong>in</strong>g common programm<strong>in</strong>g<br />

libraries and constructs. Upon identification of a few key parameters of the application<br />

derived from micro-benchmark<strong>in</strong>g and profil<strong>in</strong>g of a sequential run,<br />

MMGP predicts with reasonable accuracy the execution time with all feasible<br />

mapp<strong>in</strong>gs of the application to host processors and accelerators. MMGP is fast

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!