21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

speedup<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

rMPI: Message Pass<strong>in</strong>g on Multicore Processors with On-Chip Interconnect 31<br />

GDN<br />

rMPI<br />

LAM<br />

16 × 16 Input Matricies<br />

2 4 8 16<br />

processors<br />

Fig. 5. Speedup (cycle ratio) for Jacobi Relaxation<br />

(16×16 <strong>in</strong>put matrix)<br />

speedup<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

GDN<br />

rMPI<br />

LAM<br />

2048 × 2048 Input Matricies<br />

2 4 8 16<br />

processors<br />

Fig. 6. Speedup (cycle ratio) for Jacobi Relaxation<br />

(2048×2048 <strong>in</strong>put matrix)<br />

ratio; there is simply not enough computation to effectively amortize the cost imposed<br />

by the MPI semantics. On the other hand, the extremely low overhead of the GDN can<br />

achieve a non-trivial speedup despite the small <strong>in</strong>put matrix.<br />

For an <strong>in</strong>put matrix size of 2048×2048, the three configurations scale very similarly,<br />

as seen <strong>in</strong> Figure 6. This is congruent with <strong>in</strong>tuition: the low-overhead GDN outperforms<br />

both MPI implementations for small <strong>in</strong>put data sizes because its low overhead<br />

immunizes it aga<strong>in</strong>st low computation-to-communicationratios. However, the MPI overhead<br />

is amortized over a longer runn<strong>in</strong>g program with larger messages <strong>in</strong> this case. rMPI<br />

even outperforms the GDN <strong>in</strong> the 16 processor case, which is most likely a due to memory<br />

access synchronization on the GDN, as the GDN algorithm is broken <strong>in</strong>to phases <strong>in</strong><br />

which more than one processor accesses memory at the same time. Contrast<strong>in</strong>gly, the<br />

<strong>in</strong>terrupt-driven approach used <strong>in</strong> rMPI effectively staggers memory accesses, and <strong>in</strong><br />

this case, such stagger<strong>in</strong>g provides a w<strong>in</strong> for rMPI.<br />

Figure 7 summarizes the speedup characteristics for the GDN, rMPI, and LAM/MPI.<br />

Aga<strong>in</strong>, the GDN achieves a 3x speedup immediately, even for small <strong>in</strong>put data sizes.<br />

On the other hand, both MPI implementations have slowdowns for small data sizes, as<br />

their overhead is too high and does not amortize for low computation to communication<br />

ratios. rMPI starts to see a speedup before LAM/MPI, though, with an <strong>in</strong>put data matrix<br />

of size 64×64. One potential reason rMPI exhibits more speedup is its fast <strong>in</strong>terrupt<br />

mechanism and lack of operat<strong>in</strong>g system layers.<br />

One clearly <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>put data size for the GDN and rMPI graphs is 512×512:<br />

both show a significant speedup spike. Figure 8, which shows the throughput (computed<br />

elements per clock cycle) of the serial jacobi implementation runn<strong>in</strong>g on Raw,<br />

sheds some light on why this speedup jump occurs. Up until the 512×512 <strong>in</strong>put size,<br />

the entire data set fit <strong>in</strong>to the Raw data cache, obviat<strong>in</strong>g the need to go to DRAM. However,<br />

the 512×512 data size no longer fit <strong>in</strong>to the cache of a s<strong>in</strong>gle Raw tile. Hence, a<br />

significant dip <strong>in</strong> throughput occurs for the serial version for that data size. On the other<br />

hand, s<strong>in</strong>ce the data set is broken up for the parallelized GDN and rMPI versions, this<br />

cache bottleneck does not occur until even larger data sizes, which expla<strong>in</strong>s the jump<br />

<strong>in</strong> speedup. It should be noted that this distributed cache architecture evident <strong>in</strong> many<br />

multicore architectures can be generally beneficial, as it allows fast caches to be tightly<br />

coupled with nearby processors.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!