Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
MPI: Message Pass<strong>in</strong>g on Multicore Processors with On-Chip Interconnect 23<br />
be slow, but some offer first-class on-chip <strong>in</strong>ter-core network support. Technology scal<strong>in</strong>g<br />
is enabl<strong>in</strong>g such network-<strong>in</strong>terconnected parallel systems to be built on a chip, offer<strong>in</strong>g<br />
users extremely low latency networks. The MIT Raw processor [33], [31], [30],<br />
[32] builds on this idea and provides a prototype to evaluate these ideas. Raw <strong>in</strong>cludes<br />
first-class <strong>in</strong>struction set architecture (ISA) support for <strong>in</strong>ter-processor communication,<br />
enabl<strong>in</strong>g orders of magnitude improvement <strong>in</strong> communication latency.<br />
This paper <strong>in</strong>vestigates the merits of tightly <strong>in</strong>tegrated on-chip networks, especially<br />
<strong>in</strong> light of their programmability and performance. This paper <strong>in</strong>troduces rMPI, which<br />
provides a scalable <strong>in</strong>terface that allows transparent migration of the large extant legacy<br />
code base which will have to run on multicores. rMPI leverages the on-chip network of<br />
the Raw multicore processor to build an abstraction with which many programmers are<br />
familiar: the Message Pass<strong>in</strong>g Interface (MPI). The processor cores that constitute chip<br />
multicores (CMPs) such as Raw are tightly coupled through fast <strong>in</strong>tegrated on-chip networks,<br />
mak<strong>in</strong>g such CMPs quite different from more traditional heavily-decoupled parallel<br />
computer systems. Additionally, some CMPs elim<strong>in</strong>ate many layers of abstraction<br />
between the user program and underly<strong>in</strong>g hardware, allow<strong>in</strong>g programmers to directly<br />
<strong>in</strong>teract with hardware resources. Because of the removal of these layers, CMPs can<br />
have extremely fast <strong>in</strong>terrupts with low overhead. Remov<strong>in</strong>g standard computer system<br />
layers such as the operat<strong>in</strong>g system both represents an opportunity for improved<br />
performance but also places an <strong>in</strong>creased responsibility on the programmer to develop<br />
robust software. These and other novel features of multicore architectures motivated<br />
design<strong>in</strong>g rMPI to best take advantage of the tightly-coupled networks and direct access<br />
to hardware resources that many CMPs offer. rMPI offers the follow<strong>in</strong>g features:<br />
1) robust, deadlock-free, and scalable programm<strong>in</strong>g mechanisms; 2) an <strong>in</strong>terface that<br />
is compatible with current MPI software; 3) an easy <strong>in</strong>terface for programmers already<br />
familiar with high-level message pass<strong>in</strong>g paradigms; 4) and f<strong>in</strong>e-gra<strong>in</strong> control over their<br />
programs when automatic parallelization tools do not yield sufficient performance.<br />
Multicores with low-latency on-chip networks offer a great opportunity for performance<br />
and energy sav<strong>in</strong>gs [29], [33]. However, this opportunity can be quickly squandered<br />
if programmers do not structure their applications and runtime systems <strong>in</strong> ways<br />
that leverage the aforementioned unique aspects of multicores. Multicores with onchip<br />
networks and small on-chip memories usually perform best when data are communicated<br />
directly from core to core without access<strong>in</strong>g off-chip memory, encourag<strong>in</strong>g<br />
communication-centric algorithms[33]. Multcores also perform well when the underly<strong>in</strong>g<br />
networks provide the ability to send f<strong>in</strong>e-gra<strong>in</strong> messages between cores with<strong>in</strong> a few<br />
cycles. MPI was orig<strong>in</strong>ally developed 15 years ago assum<strong>in</strong>g coarser-gra<strong>in</strong> communication<br />
between cores and communication overhead usually <strong>in</strong>cluded operat<strong>in</strong>g system<br />
calls and sockets overhead. rMPI allows <strong>in</strong>vestigation <strong>in</strong>to how well MPI, given its<br />
assumptions about system overheads, maps to multicore architectures with on-chip networks.<br />
The evaluation of rMPI presented <strong>in</strong> this paper attempts to understand how well<br />
it succeeds <strong>in</strong> offer<strong>in</strong>g the above-mentioned features, and if MPI is still an appropriate<br />
API <strong>in</strong> the multicore doma<strong>in</strong>. rMPI is evaluated <strong>in</strong> comparison to two references.<br />
To develop a qualitative <strong>in</strong>tuition about the scal<strong>in</strong>g properties of rMPI, it is compared<br />
aga<strong>in</strong>st LAM/MPI, a highly optimized commercial MPI implementation runn<strong>in</strong>g on