21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

MPI: Message Pass<strong>in</strong>g on Multicore Processors with On-Chip Interconnect 23<br />

be slow, but some offer first-class on-chip <strong>in</strong>ter-core network support. Technology scal<strong>in</strong>g<br />

is enabl<strong>in</strong>g such network-<strong>in</strong>terconnected parallel systems to be built on a chip, offer<strong>in</strong>g<br />

users extremely low latency networks. The MIT Raw processor [33], [31], [30],<br />

[32] builds on this idea and provides a prototype to evaluate these ideas. Raw <strong>in</strong>cludes<br />

first-class <strong>in</strong>struction set architecture (ISA) support for <strong>in</strong>ter-processor communication,<br />

enabl<strong>in</strong>g orders of magnitude improvement <strong>in</strong> communication latency.<br />

This paper <strong>in</strong>vestigates the merits of tightly <strong>in</strong>tegrated on-chip networks, especially<br />

<strong>in</strong> light of their programmability and performance. This paper <strong>in</strong>troduces rMPI, which<br />

provides a scalable <strong>in</strong>terface that allows transparent migration of the large extant legacy<br />

code base which will have to run on multicores. rMPI leverages the on-chip network of<br />

the Raw multicore processor to build an abstraction with which many programmers are<br />

familiar: the Message Pass<strong>in</strong>g Interface (MPI). The processor cores that constitute chip<br />

multicores (CMPs) such as Raw are tightly coupled through fast <strong>in</strong>tegrated on-chip networks,<br />

mak<strong>in</strong>g such CMPs quite different from more traditional heavily-decoupled parallel<br />

computer systems. Additionally, some CMPs elim<strong>in</strong>ate many layers of abstraction<br />

between the user program and underly<strong>in</strong>g hardware, allow<strong>in</strong>g programmers to directly<br />

<strong>in</strong>teract with hardware resources. Because of the removal of these layers, CMPs can<br />

have extremely fast <strong>in</strong>terrupts with low overhead. Remov<strong>in</strong>g standard computer system<br />

layers such as the operat<strong>in</strong>g system both represents an opportunity for improved<br />

performance but also places an <strong>in</strong>creased responsibility on the programmer to develop<br />

robust software. These and other novel features of multicore architectures motivated<br />

design<strong>in</strong>g rMPI to best take advantage of the tightly-coupled networks and direct access<br />

to hardware resources that many CMPs offer. rMPI offers the follow<strong>in</strong>g features:<br />

1) robust, deadlock-free, and scalable programm<strong>in</strong>g mechanisms; 2) an <strong>in</strong>terface that<br />

is compatible with current MPI software; 3) an easy <strong>in</strong>terface for programmers already<br />

familiar with high-level message pass<strong>in</strong>g paradigms; 4) and f<strong>in</strong>e-gra<strong>in</strong> control over their<br />

programs when automatic parallelization tools do not yield sufficient performance.<br />

Multicores with low-latency on-chip networks offer a great opportunity for performance<br />

and energy sav<strong>in</strong>gs [29], [33]. However, this opportunity can be quickly squandered<br />

if programmers do not structure their applications and runtime systems <strong>in</strong> ways<br />

that leverage the aforementioned unique aspects of multicores. Multicores with onchip<br />

networks and small on-chip memories usually perform best when data are communicated<br />

directly from core to core without access<strong>in</strong>g off-chip memory, encourag<strong>in</strong>g<br />

communication-centric algorithms[33]. Multcores also perform well when the underly<strong>in</strong>g<br />

networks provide the ability to send f<strong>in</strong>e-gra<strong>in</strong> messages between cores with<strong>in</strong> a few<br />

cycles. MPI was orig<strong>in</strong>ally developed 15 years ago assum<strong>in</strong>g coarser-gra<strong>in</strong> communication<br />

between cores and communication overhead usually <strong>in</strong>cluded operat<strong>in</strong>g system<br />

calls and sockets overhead. rMPI allows <strong>in</strong>vestigation <strong>in</strong>to how well MPI, given its<br />

assumptions about system overheads, maps to multicore architectures with on-chip networks.<br />

The evaluation of rMPI presented <strong>in</strong> this paper attempts to understand how well<br />

it succeeds <strong>in</strong> offer<strong>in</strong>g the above-mentioned features, and if MPI is still an appropriate<br />

API <strong>in</strong> the multicore doma<strong>in</strong>. rMPI is evaluated <strong>in</strong> comparison to two references.<br />

To develop a qualitative <strong>in</strong>tuition about the scal<strong>in</strong>g properties of rMPI, it is compared<br />

aga<strong>in</strong>st LAM/MPI, a highly optimized commercial MPI implementation runn<strong>in</strong>g on

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!