21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

MPI: Message Pass<strong>in</strong>g on Multicore Processors with On-Chip Interconnect 35<br />

above for the matrix multiplication application. In these cases, the messages are quite<br />

small (20 words or less), so send<strong>in</strong>g many more, as <strong>in</strong> the 16 processor case, affects<br />

LAM/MPI’s performance more drastically than it does rMPI’s.<br />

In general, though, both applications achieve speedups rang<strong>in</strong>g from approximately<br />

6x – 14.5x on both rMPI and LAM/MPI. These speedups are larger than that of matrix<br />

multiply and jacobi, which algorithmically have significantly lower computation-tocommunication<br />

ratios. The results for all four applications evaluated agree with <strong>in</strong>tuition:<br />

rMPI and LAM/MPI both exhibit better performance scalability for applications<br />

with larger computation-to-communication ratios.<br />

5 Related Work<br />

A large number of other message pass<strong>in</strong>g work have <strong>in</strong>fluenced this work. The iWarp<br />

system [5], [16] attempted to <strong>in</strong>tegrate a VLIW processor and f<strong>in</strong>e-gra<strong>in</strong>ed communication<br />

system on a s<strong>in</strong>gle chip. The INMOS transputer [4] had comput<strong>in</strong>g elements<br />

that could send messages to one another. The MIT Alewife mach<strong>in</strong>e [19] also conta<strong>in</strong>ed<br />

support for fast user-level messag<strong>in</strong>g. Other multicore microprocessors <strong>in</strong>clude VIRAM<br />

[18], Wavescalar [29], TRIPS [23], Smart Memories [21], [22], and the Tarantula [7]<br />

extension to Alpha. Some commercial chip multiprocessors <strong>in</strong>clude the POWER 4 [8]<br />

and Intel Pentium D [1]. This work is applicable to many newer architectures that are<br />

similar to Raw <strong>in</strong> that they conta<strong>in</strong> multiple process<strong>in</strong>g elements on a s<strong>in</strong>gle chip.<br />

This paper primarily concentrated on Raw’s dynamic network, but much work has<br />

been done us<strong>in</strong>g Raw’s static network, which operates on scalar operands. Prior work<br />

[33], [12] shows considerable speedups can result us<strong>in</strong>g the static network for stream<br />

computation. Additionally, there exist a number of compiler systems for Raw that automatically<br />

generate statically-scheduled communication patterns. CFlow [13], for <strong>in</strong>stance,<br />

is a compiler system that enables statically-scheduled message pass<strong>in</strong>g between<br />

programs runn<strong>in</strong>g on separate processors. Raw’s rawcc [20] automatically parallelize C<br />

programs, generat<strong>in</strong>g communication <strong>in</strong>structions where necessary.<br />

While this paper showed that MPI can be successfully ported to a multicore architecture,<br />

its <strong>in</strong>herent overhead causes it to squander the multicore opportunity. The Multicore<br />

Association’s CAPI API [3] offers a powerful alternative—a lightweight API for<br />

multicore architectures that is optimized for low-latency <strong>in</strong>ter-core networks, and boasts<br />

a small memory footpr<strong>in</strong>t that can fit <strong>in</strong>to <strong>in</strong>-core memory. There also exist a number of<br />

MPI implementations for a variety of platforms and <strong>in</strong>terconnection devices, <strong>in</strong>clud<strong>in</strong>g<br />

MPICH [34], LAM/MPI [6], and OpenMPI [11]. [17] discusses software overhead <strong>in</strong><br />

messag<strong>in</strong>g layers.<br />

6 Conclusion<br />

This paper presented rMPI, an MPI-compliant message pass<strong>in</strong>g library for multi-core<br />

architectures with on-chip <strong>in</strong>terconnect. rMPI <strong>in</strong>troduces robust, deadlock-free, mechanisms<br />

to program multicores, offer<strong>in</strong>g an <strong>in</strong>terface that is compatible with current MPI<br />

software. Likewise, rMPI gives programmers already familiar with MPI an easy <strong>in</strong>terface<br />

with which to program Raw which enables f<strong>in</strong>e-gra<strong>in</strong> control over their programs.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!