Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
184 H. Servat et al.<br />
Figures 3.a and 4.a show the total execution time divided on communication<br />
and non communication time of the 3D FFT. We have measured the communication<br />
time done <strong>in</strong> the executions with no computation <strong>in</strong>to the SPEs. Non<br />
communication time <strong>in</strong> the Figures is the total execution time m<strong>in</strong>us the communication<br />
time. That is, non communication time is the computation that is not<br />
overlapped with communication. Figures 3.b and 4.b show the total execution<br />
time divided on the five mandatory steps of the 3D FFT, but the 128-A and<br />
128 3 grid size case (Figure 3.b). For 128-A and 128 3 grid size case, the 128 × 128<br />
plane fits <strong>in</strong> the LS of the SPE and Steps 1, 2, and 3 are jo<strong>in</strong>ed together <strong>in</strong> one<br />
Step called FFT2D <strong>in</strong> the Figure, and Steps 4, 5 are jo<strong>in</strong>ed <strong>in</strong>to another Step<br />
called transf+fft <strong>in</strong> the Figure.<br />
First, <strong>in</strong>dependently of us<strong>in</strong>g one or two SPEs to swap submatrices on the<br />
transpositions, communication time <strong>in</strong>creases when reduc<strong>in</strong>g the blocksize because<br />
this reduction affects the memory bandwidth that we obta<strong>in</strong>. In particular,<br />
one row DMA transfers, which happens on the second matrix transpostion (for<br />
Nx × Nz planes), misuse the EIB memory bandwidth, specially for 32 blocksize.<br />
For rows of 32 elements, there are only 256 bytes to transfer, which is less<br />
than the m<strong>in</strong>imum number of bytes (1024 bytes) to achieve near peak memory<br />
bandwidth [18]. Besides, DMA list transfers can help to improve memory bandwidth<br />
on second matrix transposition. A DMA list transfer allows gather/scatter<br />
operations, which <strong>in</strong>creases the size of those 256-byte DMA transfers.<br />
Figures 3 and 4 show performance differences between A 3D FFT and B<br />
3D FFT transposition strategies on the communication part. That is because,<br />
<strong>in</strong> the A 3D FFT strategy, one SPE uses double buffer<strong>in</strong>g when swapp<strong>in</strong>g and<br />
transpos<strong>in</strong>g the two B × B submatrices, while, <strong>in</strong> the B 3D FFT strategy no<br />
double buffer<strong>in</strong>g is done; <strong>in</strong> this case one SPE performs the matrix transposition<br />
of only one submatrix. Therefore, the execution time of the A 3D-FFT strategy,<br />
for the same blocksize, is less than for B 3D-FFT strategy. However, for 256 3 grid<br />
size, B 3D FFT strategy shows better performance when us<strong>in</strong>g 128 blocksizes<br />
(Figure 4). That is because we <strong>in</strong>crease the usage of the memory bandwidth of<br />
the EIB and the computation to be done <strong>in</strong> a SPE. The m<strong>in</strong>imum DMA transfer<br />
size is 1024 bytes, that is a row of 128 elements, two 4-byte floats per element.<br />
That is the m<strong>in</strong>imum size to achieve peak memory bandwidth [18].<br />
F<strong>in</strong>ally, for 128-A and 128 3 grid size case, we have reduced the DMA transfers<br />
to do, and <strong>in</strong>creased the re-use of the data per DMA transfer (Figure 3).<br />
Figure 5 shows the speed-up based on the execution time with 1 SPE, us<strong>in</strong>g<br />
the best blocksize for each transposition strategy and grid size. The speedup<br />
achieved is not l<strong>in</strong>ear. That is mostly due to the contention access<strong>in</strong>g to ma<strong>in</strong><br />
memory and the conflicts on the EIB. However, one SPE strategy scales very<br />
well for grid size 128 3 thanks to the reduction of the DMA transfers. For two<br />
SPE strategy, we estimate the execution time consider<strong>in</strong>g that we have the same<br />
speedup for one and two SPE strategies for 2 SPEs.<br />
F<strong>in</strong>ally, synchronization, necessary on access<strong>in</strong>g to the work scheduler list,<br />
does not seem to be a performance bottleneck. However, we have to <strong>in</strong>terleave<br />
wait loops to read an atomic counter to reduce the contention.