Using Thrust to Sort CUDA FORTRAN Arrays - NVIDIA Developer ...

Thrust Wrapper Functions Sort Implementation PerspectivesUsing Thrust to Sort CUDA FORTRAN Arrays©NVIDIA Corporation | Summer, 2011 | Ty McKercherCUDA How-To Guide© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesValue Proposition• Implement high performance parallel applicationswith minimal programming effort• Rich collection of object-based data parallelprimitives to implement complex algorithms withconcise, readable source code• Now accessible via FORTRAN!© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesAcknowledgements• Massimiliano Fatica (NVIDIA)• Thomas Toy (Portland Group)• Jared Hoberock (NVIDIA)• Nathan Bell (NVIDIA)© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesWhat is Thrust?Standard templatelibrary for GPU• Leverage ParallelPrimitives forrapiddevelopmentHighly optimizedalgorithms• Sorting, PrefixSum, Reduction,more…Ships free withCUDA v4.0• Integrate with• CUDA C/C++• CUDA FORTRAN…© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesThrust ComponentsContainers• Manage host and device memory• Simplify data transfersIterators• Act like pointers• Keep track of memory spacesAlgorithms• Applied to Containers© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesSimple Sort Example using Thrust#include #include #include #include #include #include int main(void){// generate 32M random numbers on the hostthrust::host_vector h_vec(32

Thrust Wrapper FunctionsSort Implementation PerspectivesOutlineWhat is Thrust?Wrapper FunctionsSort ImplementationPerspectives© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesConfigure CUDA FORTRAN for CUDA v4.0rc4.oset CUDAROOT=/usr/local/cuda;set CUDAVERSION=4.0;MakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3Until PGI supports CUDA v4.0 natively• Create rc4.o file• Compile CUDA FORTRAN files (.cuf) using –rc flag• Add –L/usr/local/cuda/lib64 if using CUDA v4.0 Toolkit© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesThrust conversion featureIn order to call Thrustfrom CUDA FORTRANMust convertdevice containerto standard Cpointer// allocate device vectorthrust::device_vector d_vec(4);// obtain raw pointer to device vector’s memoryint *ptr = thrust::raw_pointer_cast(&d_vec[0]);© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesCreate C Wrapper to Thrust sort functioncsort.cu#include #include #include extern "C" {void sort_int_wrapper( int *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}void sort_float_wrapper( float *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}NVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3all: csort.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@clean:rm csort.oMakefile}void sort_double_wrapper( double *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesAdd FORTRAN interface for Wrappersmodule thrustinterface thrustsortthrust_module.cufMakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3subroutine sort_int(input,N) bind(C,name="sort_int_wrapper")use iso_c_bindinginteger(c_int),device:: input(*)integer(c_int),value:: Nend subroutinesubroutine sort_float(input,N) bind(C,name="sort_float_wrapper")use iso_c_bindingreal(c_float),device:: input(*)integer(c_int),value:: Nend subroutineall: csort.o thrust_module.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@thrust_module.o: thrust_module.cufpgf90 –c $(F90FLAGS) $^ -o $@clean:rm csort.osubroutine sort_double(input,N) bind(C,name="sort_double_wrapper")use iso_c_bindingreal(c_double),device:: input(*)integer(c_int),value:: Nend subroutineend interfaceend module thrust© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort ImplementationPerspectivesOutlineUsing Thrust to sort FORTRAN arrayWrapper FunctionSort ImplementationPerspectives© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesCUDA FORTRAN test sort programprogram testsortuse thrusttest_sort.cufMakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3real, allocatable :: cpuData(:)real, allocatable, device :: gpuData(:)integer:: N=10allocate(cpuData(N))allocate(gpuData(N))do i=1,NcpuData(i)=random(i)end docpuData(5)=100.print *,"Before sorting", cpuDatagpuData=cpuDatacall thrustsort(gpuData,size(gpuData))cpuData=gpuDataprint *,"After sorting", cpuDataall: test_sorttest_sort: test_sort.o csort.o thrust_module.opgf90 $(F90FLAGS) -o $@ $^test_sort.o:pgf90 –c $(F90FLAGS) $< -o $@test_sort.cuf thrust_module.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@thrust_module.o: thrust_module.cufpgf90 –c $(F90FLAGS) $^ -o $@clean:rm csort.o test_sort.o test_sortend program© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesAdd timing to sort programprogram timesortuse cudaforuse thrustimplicit nonedo i=1,NcpuData(i)=random(i)end doprint *,"Sorting array of ",N, " single precision"real, allocatable :: cpuData(:)real, allocatable, device :: gpuData(:)integer:: i,N=100000000type ( cudaEvent ) :: startEvent , stopEventreal :: time, randominteger :: istatistat = cudaEventCreate ( startEvent )istat = cudaEventCreate ( stopEvent )allocate(cpuData(N))allocate(gpuData(N))gpuData=cpuDataistat = cudaEventRecord ( startEvent , 0)call thrustsort(gpuData,size(gpuData))istat = cudaEventRecord ( stopEvent , 0)istat = cudaEventSynchronize ( stopEvent )istat = cudaEventElapsedTime ( time,startEvent,stopEvent )cpuData=gpuDataprint *," Sorted array in:",time," (ms)"print *,"After sort", cpuData(1:5),cpuData(N-4:N)end program© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesRun Time (Seconds)Single Precision Timing Results1.4Single Precision Sort Time Comparison(CUDA FORTRAN Wrapper vs Native Thrust)1.210.80.60.4M2090 CUDA FORTRANM2090 Native Thrust0.20100 M 200 M 300 M 400 M 500 M 600 MNumber of Single Precision Array Elements© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesRun Time (seconds)Double Precision Timing Results1.61.41.2Double Precision Sort Time Comparison(CUDA FORTRAN Wrapper vs Native Thrust)10.80.6M2090 CUDA FORTRANM2090 Native Thrust0.40.2050 M 100 M 200 M 300 MNumber of Double Precision Array Elements© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesOutlineUsing Thrust to sort FORTRAN arrayWrapper FunctionsSort ImplementationPerspectives© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesConclusionsThrust deliversperformance andimprovesprogrammerproductivityWrapper functionsintroducenegligibleoverheadCUDA FORTRANcan benefit fromThrust innovationsCUDA FORTRANWrapperperformance =native Thrustperformance© 2011 NVIDIA Corporation

Thrust Wrapper Functions Sort Implementation PerspectivesReferences• Collection of CUDA examples, tricks, and suggestions– http://cudamusing.blogspot.com/• PGI Compiler and Tools– http://www.cse.scitech.ac.uk/events/GPU_2010/16_Miles.pdf• Thrust Google Project Page– http://code.google.com/p/thrust/© 2011 NVIDIA Corporation

Using Thrust to Sort CUDA FORTRAN Arrays - NVIDIA Developer ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?