21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

188 H. Servat et al.<br />

1.5GHz POWER5 chips with 16GBytes of RAM. Each POWER5 chip is dualcore<br />

and dual-thread (giv<strong>in</strong>g a total of 8 threads on the system). The parallel<br />

version of the FTDock for that multicore uses the FFTW3.2alpha2 library <strong>in</strong><br />

order to do the 3D FFT, and consists on divid<strong>in</strong>g the rotations among different<br />

tasks us<strong>in</strong>g OpenMPI. Therefore, we are compar<strong>in</strong>g a coarse-gra<strong>in</strong> parallelization<br />

on that multicore aga<strong>in</strong>st a f<strong>in</strong>e-gra<strong>in</strong> parallelization on a Cell BE.<br />

Figure 8 shows the total execution times of FTDock for 1, 2, 4 and 8 tasks<br />

or SPEs on a POWER5 multicore and on a Cell BE respectively, for 128 3 and<br />

256 3 grid sizes. We can see that Cell BE FTDock outperforms that multicore<br />

parallel implementation of the FTDock. The reasons seems to be the same that<br />

we commented when compar<strong>in</strong>g PPU and SPU on Section 6.2. Memory hierarchy<br />

becomes a bottleneck for the POWER5 tasks, meanwhile SPEs of the Cell BE<br />

avoid the MSHR limitation access<strong>in</strong>g to ma<strong>in</strong> memory. Moreover, Cell is more<br />

cost-effective and more power-efficient than the POWER5 multicore [24].<br />

seconds<br />

400<br />

300<br />

200<br />

100<br />

0<br />

CellBE<br />

POWER5 multicore<br />

1 2 4 8<br />

SPEs<br />

a) 128 3 grid size<br />

seconds<br />

3000<br />

2000<br />

1000<br />

0<br />

CellBE<br />

POWER5 multicore<br />

1 2 4 8<br />

SPEs<br />

b) 256 3 grid size<br />

Fig. 8. Total execution time of the parallel implementation of FTDock for a multicore<br />

POWER5 and a CellBE for 128 3 grid size(left), and 256 3 grid size (right)<br />

8 Conclusions<br />

In this paper we have evaluated and analyzed an implementation of a prote<strong>in</strong><br />

dock<strong>in</strong>g application, FTDock, on a Cell BE Blade us<strong>in</strong>g different algorithm parameters<br />

and levels of parallelization.<br />

We have achieved a significant speedup of FTDock when we have offloaded<br />

the most time consum<strong>in</strong>g functions to the SPEs. However, improv<strong>in</strong>g the PPU<br />

processor and its memory hierarchy would improve the potential of Cell BE for<br />

applications that can not be completely offloaded to the SPEs. Indeed, <strong>in</strong>creas<strong>in</strong>g<br />

the number of SPEs per task (one thread runn<strong>in</strong>g on one PPE) improves the<br />

performance of the application. However, l<strong>in</strong>ear speedup is not achieved because<br />

ma<strong>in</strong> memory and EIB contention <strong>in</strong>creases when several SPEs access to ma<strong>in</strong><br />

memory <strong>in</strong> the same Cell BE. Therefore, one should parallelize the application<br />

<strong>in</strong> such a way the ma<strong>in</strong> memory and EIB contention is distributed between the<br />

two Cell BE of the blade [18].<br />

The relatively small Local Store of the SPE <strong>in</strong>creases the ma<strong>in</strong> memory accesses<br />

<strong>in</strong> order to keep partial results. With FTDock, we have seen that <strong>in</strong>creas<strong>in</strong>g

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!