21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

184 H. Servat et al.<br />

Figures 3.a and 4.a show the total execution time divided on communication<br />

and non communication time of the 3D FFT. We have measured the communication<br />

time done <strong>in</strong> the executions with no computation <strong>in</strong>to the SPEs. Non<br />

communication time <strong>in</strong> the Figures is the total execution time m<strong>in</strong>us the communication<br />

time. That is, non communication time is the computation that is not<br />

overlapped with communication. Figures 3.b and 4.b show the total execution<br />

time divided on the five mandatory steps of the 3D FFT, but the 128-A and<br />

128 3 grid size case (Figure 3.b). For 128-A and 128 3 grid size case, the 128 × 128<br />

plane fits <strong>in</strong> the LS of the SPE and Steps 1, 2, and 3 are jo<strong>in</strong>ed together <strong>in</strong> one<br />

Step called FFT2D <strong>in</strong> the Figure, and Steps 4, 5 are jo<strong>in</strong>ed <strong>in</strong>to another Step<br />

called transf+fft <strong>in</strong> the Figure.<br />

First, <strong>in</strong>dependently of us<strong>in</strong>g one or two SPEs to swap submatrices on the<br />

transpositions, communication time <strong>in</strong>creases when reduc<strong>in</strong>g the blocksize because<br />

this reduction affects the memory bandwidth that we obta<strong>in</strong>. In particular,<br />

one row DMA transfers, which happens on the second matrix transpostion (for<br />

Nx × Nz planes), misuse the EIB memory bandwidth, specially for 32 blocksize.<br />

For rows of 32 elements, there are only 256 bytes to transfer, which is less<br />

than the m<strong>in</strong>imum number of bytes (1024 bytes) to achieve near peak memory<br />

bandwidth [18]. Besides, DMA list transfers can help to improve memory bandwidth<br />

on second matrix transposition. A DMA list transfer allows gather/scatter<br />

operations, which <strong>in</strong>creases the size of those 256-byte DMA transfers.<br />

Figures 3 and 4 show performance differences between A 3D FFT and B<br />

3D FFT transposition strategies on the communication part. That is because,<br />

<strong>in</strong> the A 3D FFT strategy, one SPE uses double buffer<strong>in</strong>g when swapp<strong>in</strong>g and<br />

transpos<strong>in</strong>g the two B × B submatrices, while, <strong>in</strong> the B 3D FFT strategy no<br />

double buffer<strong>in</strong>g is done; <strong>in</strong> this case one SPE performs the matrix transposition<br />

of only one submatrix. Therefore, the execution time of the A 3D-FFT strategy,<br />

for the same blocksize, is less than for B 3D-FFT strategy. However, for 256 3 grid<br />

size, B 3D FFT strategy shows better performance when us<strong>in</strong>g 128 blocksizes<br />

(Figure 4). That is because we <strong>in</strong>crease the usage of the memory bandwidth of<br />

the EIB and the computation to be done <strong>in</strong> a SPE. The m<strong>in</strong>imum DMA transfer<br />

size is 1024 bytes, that is a row of 128 elements, two 4-byte floats per element.<br />

That is the m<strong>in</strong>imum size to achieve peak memory bandwidth [18].<br />

F<strong>in</strong>ally, for 128-A and 128 3 grid size case, we have reduced the DMA transfers<br />

to do, and <strong>in</strong>creased the re-use of the data per DMA transfer (Figure 3).<br />

Figure 5 shows the speed-up based on the execution time with 1 SPE, us<strong>in</strong>g<br />

the best blocksize for each transposition strategy and grid size. The speedup<br />

achieved is not l<strong>in</strong>ear. That is mostly due to the contention access<strong>in</strong>g to ma<strong>in</strong><br />

memory and the conflicts on the EIB. However, one SPE strategy scales very<br />

well for grid size 128 3 thanks to the reduction of the DMA transfers. For two<br />

SPE strategy, we estimate the execution time consider<strong>in</strong>g that we have the same<br />

speedup for one and two SPE strategies for 2 SPEs.<br />

F<strong>in</strong>ally, synchronization, necessary on access<strong>in</strong>g to the work scheduler list,<br />

does not seem to be a performance bottleneck. However, we have to <strong>in</strong>terleave<br />

wait loops to read an atomic counter to reduce the contention.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!