21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Experiences with Paralleliz<strong>in</strong>g a Bio-<strong>in</strong>formatics Program on the Cell BE 171<br />

Execution time (secs)<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

PPU SPU-base SPUcontrol<br />

Pairwise Alignment (PW)<br />

SPU-SIMD SPU-<br />

SIMDpipel<strong>in</strong>ed<br />

(a) Effect of optimizations on PW<br />

Execution time (secs)<br />

200<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

PPU SPU-base SPU-<br />

SIMDprfscore<br />

Progressive Alignment (PA)<br />

SPUcontrol<br />

SPU-SIMD<br />

(b) Effect of optimizations on PA<br />

Fig. 5. The effect of each of the optimizations on the execution time of pairwise alignement<br />

and progressive alignment<br />

Aga<strong>in</strong>, execut<strong>in</strong>g the orig<strong>in</strong>al code on the SPU is slower than runn<strong>in</strong>g on the<br />

PPU (bar “SPU-base” vs. “PPU”). Aga<strong>in</strong>, we attribute this to excessive control<br />

flow. In PA, we identify two possible causes: the loop <strong>in</strong>side the prfscore()<br />

function and the rema<strong>in</strong><strong>in</strong>g control flow <strong>in</strong>side the loop bodies of the forward and<br />

backward loops. First, we remove all control flow <strong>in</strong> the prfscore() function by<br />

unroll<strong>in</strong>g the loop, vectoriz<strong>in</strong>g and by us<strong>in</strong>g pre-computed masks to deal with the<br />

loop iteration count (see Section 5.1). This br<strong>in</strong>gs performance close to the PPU<br />

execution time (bar “SPU-SIMD-prfscore”). Second, we remove the rema<strong>in</strong><strong>in</strong>g<br />

control flow <strong>in</strong> the same way as <strong>in</strong> the PW loop nests. This gives an overall<br />

speedup of 1.6 over PPU execution (bar “SPU-control”).<br />

Vectoriz<strong>in</strong>g the forward and backward loops improves performance, but the<br />

effect is relatively small (bar “SPU-SIMD”). The reason is that the <strong>in</strong>ner loop<br />

conta<strong>in</strong>s calls to prfscore. The execution of these calls rema<strong>in</strong>s sequential, which<br />

significantly reduces the benefit of vectorization. S<strong>in</strong>ce these calls are also responsible<br />

for most of the execution time, there is no benefit from unroll<strong>in</strong>g the<br />

vectorized loops as the unaligned memory accesses are relatively unimportant<br />

compared to prfscore(). Furthermore, remov<strong>in</strong>g unaligned memory accesses<br />

requires many registers but the vectorized loop nest is already close to us<strong>in</strong>g all<br />

registers.<br />

6.3 Scal<strong>in</strong>g with Multiple SPUs<br />

The f<strong>in</strong>al part of our analysis concerns the scal<strong>in</strong>g of performance when us<strong>in</strong>g<br />

multiple SPUs. In the follow<strong>in</strong>g, we use the best version of each phase. Figure<br />

6(b) shows the speedup over PPU-only execution when us<strong>in</strong>g an <strong>in</strong>creas<strong>in</strong>g<br />

number of SPUs.<br />

As expected, the PW phase scales very well with multiple SPUs. With 8 SPUs,<br />

the parallelized and optimized PW phase runs 51.2 times faster than the orig<strong>in</strong>al<br />

code on the PPU.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!