21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Drug Design Issues on the Cell BE 187<br />

6.3 Parallelization Us<strong>in</strong>g the Two Cell BE of the Blade<br />

So far, we have presented FTDock execution time results us<strong>in</strong>g several SPEs on<br />

the same Cell BE processor of a Cell BE Blade. In this section we present results<br />

for the MPI FTDock parallelization us<strong>in</strong>g the two Cell BE of a Blade.<br />

Figure 7 shows the total execution time us<strong>in</strong>g 1 or 2 MPI tasks (1 task on<br />

1 PPE), and 1, 2, 4 and 8 SPEs per task, for 128 3 (left) and 256 3 (right) grid<br />

sizes. Us<strong>in</strong>g 1 task with n SPEs is slower (less efficient) than us<strong>in</strong>g 2 tasks with<br />

n/2 SPEs. The ma<strong>in</strong> reason is that we distribute the contention of the EIB and<br />

the ma<strong>in</strong> memory between the two Cell BEs of a Blade. Actually, that can be<br />

seen compar<strong>in</strong>g the execution time of 1 task and 2 tasks for the same number of<br />

SPEs; execution time is not divided by two.<br />

seconds<br />

150<br />

100<br />

50<br />

0<br />

1 MPI<br />

2 MPIs<br />

1 2 4 8<br />

SPEs<br />

a) 128 3 grid size<br />

seconds<br />

1500<br />

1000<br />

500<br />

0<br />

1 MPI<br />

2 MPIs<br />

1 2 4 8<br />

SPEs<br />

b) 256 3 grid size<br />

Fig. 7. Total execution time of the parallel implementation of FTDock us<strong>in</strong>g 1 or 2<br />

MPI tasks and 1, 2, 4 and 8 SPEs/task, for 128 3 (left), and 256 3 (right) grid sizes<br />

6.4 Parallelization Us<strong>in</strong>g Dual-Thread PPU Feature<br />

The PPE hardware supports two simultaneous threads of execution [2], duplicat<strong>in</strong>g<br />

architecture and special purpose registers, except for some system-level<br />

resources such as memory.<br />

Discretize function accesses a work<strong>in</strong>g data set that perfectly fits <strong>in</strong> second<br />

level of cache. Indeed, the discretize function has a lot of branches that<br />

may not be predicted by the PPE branch-prediction hardware. Hence, a second<br />

thread may make forward progress and <strong>in</strong>crease PPE pipel<strong>in</strong>e use and system<br />

throughput. We achieve 1.5x of the Discretize function when paralleliz<strong>in</strong>g it us<strong>in</strong>g<br />

OpenMP, and a 1.1 − 1.2x overall improvement of the FTDock application.<br />

In any case, function offload<strong>in</strong>g and vectorization of that function would get<br />

better performance improvements and scalability.<br />

7 Comparison with a POWER5 Multicore Platform<br />

In this section we compare our Cell BE implementation of the FTDock with<br />

a parallel version of the FTDock, runn<strong>in</strong>g on a POWER5 multicore with two

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!