Tutorial: Introduction to CUDA Fortran | GTC 2013

More documents

Recommendations

Info

Naive approach Transfer A Transfer B FFT(A) FFT(B) conv IFFT(C) Transfer C Smarter approach 0.3s time I/O Transfer A Transfer B Transfer C Compute FFT(A) FFT(B) conv IFFT(C) Optimal approach time H2D A(1) B(1) A(2) B(2) A(N) B(N) Compute FFT A(1) FFT B(1) conv IFFT C(1) FFT A(2) FFT B(2) conv IFFT C(2) FFT A(N) FFT B(N) conv IFFT C(N) D2H C(1) C(2) C(N) 0.18s time NOT TO SCALE
program driver use cudafor use cufft implicit none complex, allocatable,dimension(:,:,:), pinned :: A , B complex, allocatable,dimension(:,:,:), device :: A_d, B_d real:: elapsed_time,scale integer, parameter :: num_streams=8 integer:: nxx, nyy, nomega, plan, stream(num_streams) integer :: clock_start,clock_end,clock_rate, istat, ifr, i,j, ind logical:: plog nxx=512; nyy=512 ; nomega=196 ! Initialize FFT plan call cufftPlan2d(plan,nyy,nxx,CUFFT_C2C) do i = 1,num_streams istat= cudaStreamCreate(stream(i)) end do ! Find the clock rate call SYSTEM_CLOCK(COUNT_RATE=clock_rate) allocate(A(nxx,nyy,nomega),B(nxx,nyy,nomega)) allocate(A_d(nxx,nyy,num_streams),B_d(nxx,nyy,num_streams)) istat=cudaThreadSynchronize() call SYSTEM_CLOCK(COUNT=clock_start) ! Start timing scale =1./(nxx*nyy) do ifr=1,nomega ind = mod(ifr,num_streams)+1 ! Send data <strong>to</strong> GPU istat= cudaMemcpyAsync(A_d(1,1,ind),A(1,1,ifr),nxx*nyy, stream(ind)) istat= cudaMemcpyAsync(B_d(1,1,ind),B(1,1,ifr),nxx*nyy, stream(ind)) ! Execute FFTs on GPU call cufftSetStream(plan,stream(ind)) call cufftExecC2C(plan ,A_d(1,1,ind),A_d(1,1,ind),CUFFT_FORWARD) call cufftExecC2C(plan ,B_d(1,1,ind),B_d(1,1,ind),CUFFT_FORWARD) ! Convolution and scaling of the arrays !$cuf kernel do(2) do j=1,nyy do i=1,nxx B_d(i,j,ind)= A_d(i,j,ind)*B_d(i,j,ind)*scale end do end do ! Execute FFTs on GPU call cufftExecC2C(plan ,B_d(1,1,ind),B_d(1,1,ind),CUFFT_INVERSE) ! Copy results back istat=cudaMemcpyAsync( B(1,1,ifr),B_d(1,1,ind),nxx*nyy, stream=stream(ind)) end do istat=cudaThreadSynchronize() call SYSTEM_CLOCK(COUNT=clock_end) ! End timing print *,"Elapsed time :", REAL(clock_end-clock_start)/REAL(clock_rate) deallocate(A,B,A_d,B_d) end program
Page 1 and 2:
Introduction to CUDA Fortran
Page 3 and 4:
Introduction • CUDA is a scalable
Page 5 and 6:
Heterogeneous Programming • Host
Page 7 and 8:
Data Transfers program copyData use
Page 9 and 10:
Data Transfers program copyData use
Page 11 and 12:
F90 Array Increment module simpleOp
Page 13 and 14:
CUDA Fortran - Device Code F90 modu
Page 15 and 16:
Execution Model Software Thread Thr
Page 17 and 18:
Mapping Arrays to Thread Blocks •
Page 19 and 20: Built-in Variables for Device Code
Page 21 and 22: Multidimensional Example - Host •
Page 23 and 24: 2D Example - Device Code module sim
Page 25 and 26: Kernel Loop Directives (CUF Kernels
Page 27 and 28: Reduction using CUF Kernels • Com
Page 29 and 30: Compilation • pgfortran - PGI’s
Page 31 and 32: Separate Compilation module kernel_
Page 33 and 34: Host-Device Transfers • Host-devi
Page 35 and 36: Page-Locked Data Transfers • Page
Page 37 and 38: Asynchronous Data Transfers • Asy
Page 39 and 40: GPU/CPU Synchronization • cudaDev
Page 41 and 42: Outline • Introduction • Perfor
Page 43 and 44: Misaligned Data Access • Use arra
Page 45 and 46: Strided Data Access • Use array i
Page 47 and 48: Shared Memory • On-chip • All t
Page 49 and 50: Matrix Transpose - Shared Memory at
Page 51 and 52: Textures • Different pathway for
Page 53 and 54: Textures - Host Code program tex us
Page 55 and 56: Execution Configuration • GPUs ar
Page 57 and 58: Thread-Level Parallelism attributes
Page 59 and 60: Thread-Level Parallelism • Mimic
Page 61 and 62: Instruction-Level Parallelism No Sh
Page 63 and 64: Calling CUBLAS from CUDA Fortran
Page 65 and 66: Calling Thrust from CUDA Fortran C
Page 67 and 68: CUDA Fortran Sorting with Thrust pr
Page 69: Convolution Example A B Perform con
Page 73 and 74: CUDA Libraries from CUDA Fortran
Page 75 and 76: Computing π pgf90 -O3 -Mpreprocess
Page 77 and 78: Reductions on GPU 3 1 7 0 4 1 6 3
Page 79 and 80: Computing π with CUDA Fortran Kern
Page 81 and 82: Computing π with an Atomic Lock In
Page 83 and 84: Multi-GPU Programming • Multi-GPU
Page 85 and 86: Multi-GPU Memory Allocation • All
Page 87 and 88: Unified Virtual Addressing • CPU
Page 89 and 90: Direct Transfers • Once direct ac
Page 91 and 92: Direct Transfer Example - Device Co
Page 93 and 94: Direct Transfer Example - Host Code
show all

Tutorial: Introduction to CUDA Fortran | GTC 2013

Create successful ePaper yourself

Delete template?

Save as template?