21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

32 J. Psota and A. Agarwal<br />

speedup<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

GDN<br />

rMPI<br />

LAM<br />

16 32 64 128 256 512 1024 2048<br />

<strong>in</strong>put data size<br />

Fig. 7. Speedup for Jacobi. x-axis shows N,<br />

where the <strong>in</strong>put matrix is N×N.<br />

overhead<br />

500.00%<br />

450.00%<br />

400.00%<br />

350.00%<br />

300.00%<br />

250.00%<br />

200.00%<br />

150.00%<br />

100.00%<br />

50.00%<br />

0.00%<br />

N=16 N=32 N=64 N=128 N=256 N=512 N=1024 N=2048<br />

problem size<br />

2 tiles<br />

4 tiles<br />

8 tiles<br />

16 tiles<br />

Fig. 9. Overhead of rMPI for Jacobi Relaxation<br />

relative to a GDN implementation for various<br />

<strong>in</strong>put sizes. Overhead computed us<strong>in</strong>g cycle<br />

counts.<br />

Throughput (computed elements/cycle)<br />

0.04<br />

0.035<br />

0.03<br />

0.025<br />

0.02<br />

0.015<br />

0.01<br />

0.005<br />

0<br />

data cache overflows<br />

16 32 64 128 256 512 1024 2048<br />

<strong>in</strong>put data size<br />

Fig. 8. Throughput (computations/cycle) for Jacobi<br />

(on Raw). Cache overflows at 512×512 <strong>in</strong>put<br />

size.<br />

run time (cycles)<br />

38000<br />

36000<br />

34000<br />

32000<br />

30000<br />

28000<br />

26000<br />

24000<br />

22000<br />

20000<br />

2kB 4kB 8kB 16kB 32kB 64kB 128kB 256kB<br />

<strong>in</strong>struction cache size<br />

Fig. 10. Performance of jacobi with vary<strong>in</strong>g <strong>in</strong>struction<br />

cache size. Lower is better.<br />

Figure 3 characterizes the overhead of rMPI for a simple send/receive pair of processors<br />

on the Raw chip. To characterize rMPI’s overhead for real applications, which<br />

may have complex computation and communication patterns, the overhead of rMPI was<br />

measured for jacobi. The experiment measured the complete runn<strong>in</strong>g time of jacobi experiments<br />

for various <strong>in</strong>put data sizes and numbers of processors for both the GDN and<br />

rMPI implementations. Figure 9 shows the results of this experiment. Here, rMPI overhead<br />

is computed as overheadrMPI =(cyclesrMPI − cyclesGDN)/(cyclesGDN).<br />

As can be seen, rMPI’s overhead is quite large for small data sets. Furthermore, its<br />

overhead is particularly high for small data sets runn<strong>in</strong>g on a large number of processors,<br />

as evidenced by the 16×16 case for 16 processors, which has an overhead of<br />

nearly 450%. However, as the <strong>in</strong>put data set <strong>in</strong>creases, rMPI’s overhead drops quickly.<br />

It should also be noted that for data sizes from 16×16 through 512×512, add<strong>in</strong>g processors<br />

<strong>in</strong>creases overhead, but for data sizes larger than 512×512, add<strong>in</strong>g processors<br />

decreases overhead. In fact, the 1024×1024 data size for 16 processors has just a 1.7%<br />

overhead. The 2048×2048 for 16 processors actually shows a speedup beyond the GDN

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!