RECENT DEVELOPMENT IN COMPUTATIONAL SCIENCE

Recommendations

Info

ISCS 2011 Selected Papers Vol.2 Time Benchmarks for the OpenMP and GPU GFLOPS GFLOPS 350 300 250 200 150 100 50 0 300 250 200 150 100 50 0 (a) real number (M2) DGEMM(8 Threads) DGEMM(Tesla C1060) SGEMM(Tesla C1060) Ratio(Single/Double) 256 512 1024 2048 Matrix size 4096 (c) real number (M3) Tesla C1060 Tesla M2050 with cuda3.1 Tesla M2050 with cuda3.2 256 512 1024 2048 Matrix size 4096 5120 5120 7 6 5 4 3 2 1 0 RATIO GFLOPS GFLOPS 350 300 250 200 150 100 50 0 300 250 200 150 100 50 0 (b) complex number (M2) ZGEMM(8 Threads) ZGEMM(Tesla C1060) CGEMM(Tesla C1060) Ratio(Single/Double) 256 512 1024 2048 Matrix size 4096 5120 (d) complex number (M3) Tesla C1060 Tesla M2050 with cuda3.1 Tesla M2050 with cuda3.2 256 512 1024 Matrix size 2048 Figure 4: Effective FLOPS values of the MM for various matrix sizes (256 ∼ 5120) by SGEMM, CGEMM, DGEMM and ZGEMM routines in Tesla GPUs (Tesla C1060 and Tesla M2050). The upper and lower two panels are obtained in the machines of M2 (a)(b) and M3 (c)(d) (see Table 1), respectively. The times for memory transfer are included in the estimation. Memory Transfer Rate [%] 100 80 60 40 20 0 MM DtoH HtoD 37.5% 14.6% 47.8% 128 51.8% 11.3% 36.8% 256 56.9% 20.1% 22.9% 1024 79.2% 8.6% 12.2% 4096 Figure 5: Data transfer ratio in DGEMM calculation for various matrix sizes in M3. 21 4096 6 4 2 0 RATIO
ISCS 2011 Selected Papers Vol.2 J. Gotou, S. Haraguchi, M. Tsujikawa, T. Oda Figure 6: Parallelization scheme for three dimensional fast Fourier transformation (3D-FFT) in the OpenMP. speedup factor 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Num.of Processors Figure 7: The parallelization efficiency (speed-up factor) of 3D-FFT in the OpenMP. The size of 48 × 48 × 216 is used. number of available threads in OpenMP. The parallelization efficiency in OpenMP is presented in Fig. 7. Using 8 cores, we obtain the efficiency of 65%. The parallelization efficiency has decreased compared with the MM (see Fig. 3). This may be because in the scheme the temporal memory for 1D-FFT, as the overhead of calculation, is stored discontinuously from the original memory area of 3D-FFT. As another investigation for 3D-FFT, the automatic parallelized version of FFTW [8] in OpenMP are tested (not reported here), resulting the parallelization efficiency of 25% for using 8 cores. In this work, we have carried out the test of CUFFT for 3D-FFT. The computational times are presented in Fig. 8. The efficiency is found to depend on the size of FFT (spiky structure in the figure). There are many sizes where the efficiency is worse than that of FFTW [8]. This implies the 3D-FFT by CUFFT is not appropriate for the CPMD code in which the size of FFT depends on the input parameters. 4 Results on CPMD code 4.1 MPI+OpenMP (Hybrid) The hybrid version of code is organized with k point parallelization on MPI, the OpenMP calculation both in MM and FFT, and other minor parallelizations. The efficiencies are presented with those of the MPI version in Fig. 9. Using 16 cores (2 MPI+ 8 OpenMP), the performance amounts to 37%. This value is still fearly well, but the additional improvement must be required for an more efficient calculation. 4.2 MPI+GPU The MPI+GPU version of CPMD code has been checked mainly in M2, which has 8 cores and 3 GPUs. The BaTiO3 system is used for the computation. The time report is presented in Figs. 10 and 11. In the latter, the times for MM, FFT are included. In these figures, the labels of BLAS and FFTW indicate the single processing calculation for the MM and the FFT, respectively. The 22
Page 1 and 2: RECENT DEVELOPMENT IN COMPUTATIONAL
Page 3 and 4: Group Photo February 17, 2011 �
Page 8 and 9: Compression Stress Effect on Disloc
Page 10 and 11: ISCS 2011 Selected Papers Vol.2 Com
Page 12 and 13: ISCS 2011 Selected Papers Vol.2 Com
Page 14: ISCS 2011 Selected Papers Vol.2 Com
Page 17 and 18: ISCS 2011 Selected Papers Vol.2 S.
Page 25 and 26: ISCS 2011 Selected Papers Vol.2 J.
Page 27: ISCS 2011 Selected Papers Vol.2 J.
Page 31 and 32: ISCS 2011 Selected Papers Vol.2 J.
Page 34 and 35: Simulation Of Fluid-Solid Interacti
Page 36 and 37: ISCS 2011 Selected Papers Vol.2 Sim
Page 44 and 45: High-Pressure Crystal Structure Pre
Page 46 and 47: ISCS 2011 Selected Papers Vol.2 Cry
Page 52: ISCS 2011 Selected Papers Vol.2 Cry
Page 55 and 56: ISCS 2011 Selected Papers Vol.2 M.
Page 67 and 68: ISCS 2011 Selected Papers Vol.2 Mic
Page 77 and 78: ISCS 2011 Selected Papers Vol.2 Gia
Page 79 and 80:
ISCS 2011 Selected Papers Vol.2 Gia
Page 81 and 82:
Page 83 and 84:
Page 86 and 87:
International Symposium on Computat
Page 88 and 89:
February 16 (Wednesday): 2 nd day M
Page 90 and 91:
List of Participants Akhmaloka Rect
Page 92:
Gia Septiana Wulandari Triati Dewi
show all

RECENT DEVELOPMENT IN COMPUTATIONAL SCIENCE

Create successful ePaper yourself

Delete template?

Save as template?