13.07.2015 Views

Using Thrust to Sort CUDA FORTRAN Arrays - NVIDIA Developer ...

Using Thrust to Sort CUDA FORTRAN Arrays - NVIDIA Developer ...

Using Thrust to Sort CUDA FORTRAN Arrays - NVIDIA Developer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation Perspectives<strong>Using</strong> <strong>Thrust</strong> <strong>to</strong> <strong>Sort</strong> <strong>CUDA</strong> <strong>FORTRAN</strong> <strong>Arrays</strong>©<strong>NVIDIA</strong> Corporation | Summer, 2011 | Ty McKercher<strong>CUDA</strong> How-To Guide© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesValue Proposition• Implement high performance parallel applicationswith minimal programming effort• Rich collection of object-based data parallelprimitives <strong>to</strong> implement complex algorithms withconcise, readable source code• Now accessible via <strong>FORTRAN</strong>!© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesAcknowledgements• Massimiliano Fatica (<strong>NVIDIA</strong>)• Thomas Toy (Portland Group)• Jared Hoberock (<strong>NVIDIA</strong>)• Nathan Bell (<strong>NVIDIA</strong>)© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesWhat is <strong>Thrust</strong>?Standard templatelibrary for GPU• Leverage ParallelPrimitives forrapiddevelopmentHighly optimizedalgorithms• <strong>Sort</strong>ing, PrefixSum, Reduction,more…Ships free with<strong>CUDA</strong> v4.0• Integrate with• <strong>CUDA</strong> C/C++• <strong>CUDA</strong> <strong>FORTRAN</strong>…© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation Perspectives<strong>Thrust</strong> ComponentsContainers• Manage host and device memory• Simplify data transfersItera<strong>to</strong>rs• Act like pointers• Keep track of memory spacesAlgorithms• Applied <strong>to</strong> Containers© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesSimple <strong>Sort</strong> Example using <strong>Thrust</strong>#include #include #include #include #include #include int main(void){// generate 32M random numbers on the hostthrust::host_vec<strong>to</strong>r h_vec(32


<strong>Thrust</strong> Wrapper Functions<strong>Sort</strong> Implementation PerspectivesOutlineWhat is <strong>Thrust</strong>?Wrapper Functions<strong>Sort</strong> ImplementationPerspectives© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesConfigure <strong>CUDA</strong> <strong>FORTRAN</strong> for <strong>CUDA</strong> v4.0rc4.oset <strong>CUDA</strong>ROOT=/usr/local/cuda;set <strong>CUDA</strong>VERSION=4.0;MakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3Until PGI supports <strong>CUDA</strong> v4.0 natively• Create rc4.o file• Compile <strong>CUDA</strong> <strong>FORTRAN</strong> files (.cuf) using –rc flag• Add –L/usr/local/cuda/lib64 if using <strong>CUDA</strong> v4.0 Toolkit© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation Perspectives<strong>Thrust</strong> conversion featureIn order <strong>to</strong> call <strong>Thrust</strong>from <strong>CUDA</strong> <strong>FORTRAN</strong>Must convertdevice container<strong>to</strong> standard Cpointer// allocate device vec<strong>to</strong>rthrust::device_vec<strong>to</strong>r d_vec(4);// obtain raw pointer <strong>to</strong> device vec<strong>to</strong>r’s memoryint *ptr = thrust::raw_pointer_cast(&d_vec[0]);© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesCreate C Wrapper <strong>to</strong> <strong>Thrust</strong> sort functioncsort.cu#include #include #include extern "C" {void sort_int_wrapper( int *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}void sort_float_wrapper( float *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}NVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3all: csort.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@clean:rm csort.oMakefile}void sort_double_wrapper( double *data, int N){thrust::device_ptr dev_ptr(data);thrust::sort(dev_ptr, dev_ptr+N);}© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesAdd <strong>FORTRAN</strong> interface for Wrappersmodule thrustinterface thrustsortthrust_module.cufMakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3subroutine sort_int(input,N) bind(C,name="sort_int_wrapper")use iso_c_bindinginteger(c_int),device:: input(*)integer(c_int),value:: Nend subroutinesubroutine sort_float(input,N) bind(C,name="sort_float_wrapper")use iso_c_bindingreal(c_float),device:: input(*)integer(c_int),value:: Nend subroutineall: csort.o thrust_module.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@thrust_module.o: thrust_module.cufpgf90 –c $(F90FLAGS) $^ -o $@clean:rm csort.osubroutine sort_double(input,N) bind(C,name="sort_double_wrapper")use iso_c_bindingreal(c_double),device:: input(*)integer(c_int),value:: Nend subroutineend interfaceend module thrust© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> ImplementationPerspectivesOutline<strong>Using</strong> <strong>Thrust</strong> <strong>to</strong> sort <strong>FORTRAN</strong> arrayWrapper Function<strong>Sort</strong> ImplementationPerspectives© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation Perspectives<strong>CUDA</strong> <strong>FORTRAN</strong> test sort programprogram testsortuse thrusttest_sort.cufMakefileNVINC = -I/usr/local/cuda/includeF90FLAGS = -rc=rc4.o -Mcuda=cc20 -O3real, allocatable :: cpuData(:)real, allocatable, device :: gpuData(:)integer:: N=10allocate(cpuData(N))allocate(gpuData(N))do i=1,NcpuData(i)=random(i)end docpuData(5)=100.print *,"Before sorting", cpuDatagpuData=cpuDatacall thrustsort(gpuData,size(gpuData))cpuData=gpuDataprint *,"After sorting", cpuDataall: test_sorttest_sort: test_sort.o csort.o thrust_module.opgf90 $(F90FLAGS) -o $@ $^test_sort.o:pgf90 –c $(F90FLAGS) $< -o $@test_sort.cuf thrust_module.ocsort.o: csort.cunvcc -c -arch sm_13 $(NVINC) $^ -o $@thrust_module.o: thrust_module.cufpgf90 –c $(F90FLAGS) $^ -o $@clean:rm csort.o test_sort.o test_sortend program© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesAdd timing <strong>to</strong> sort programprogram timesortuse cudaforuse thrustimplicit nonedo i=1,NcpuData(i)=random(i)end doprint *,"<strong>Sort</strong>ing array of ",N, " single precision"real, allocatable :: cpuData(:)real, allocatable, device :: gpuData(:)integer:: i,N=100000000type ( cudaEvent ) :: startEvent , s<strong>to</strong>pEventreal :: time, randominteger :: istatistat = cudaEventCreate ( startEvent )istat = cudaEventCreate ( s<strong>to</strong>pEvent )allocate(cpuData(N))allocate(gpuData(N))gpuData=cpuDataistat = cudaEventRecord ( startEvent , 0)call thrustsort(gpuData,size(gpuData))istat = cudaEventRecord ( s<strong>to</strong>pEvent , 0)istat = cudaEventSynchronize ( s<strong>to</strong>pEvent )istat = cudaEventElapsedTime ( time,startEvent,s<strong>to</strong>pEvent )cpuData=gpuDataprint *," <strong>Sort</strong>ed array in:",time," (ms)"print *,"After sort", cpuData(1:5),cpuData(N-4:N)end program© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesRun Time (Seconds)Single Precision Timing Results1.4Single Precision <strong>Sort</strong> Time Comparison(<strong>CUDA</strong> <strong>FORTRAN</strong> Wrapper vs Native <strong>Thrust</strong>)1.210.80.60.4M2090 <strong>CUDA</strong> <strong>FORTRAN</strong>M2090 Native <strong>Thrust</strong>0.20100 M 200 M 300 M 400 M 500 M 600 MNumber of Single Precision Array Elements© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesRun Time (seconds)Double Precision Timing Results1.61.41.2Double Precision <strong>Sort</strong> Time Comparison(<strong>CUDA</strong> <strong>FORTRAN</strong> Wrapper vs Native <strong>Thrust</strong>)10.80.6M2090 <strong>CUDA</strong> <strong>FORTRAN</strong>M2090 Native <strong>Thrust</strong>0.40.2050 M 100 M 200 M 300 MNumber of Double Precision Array Elements© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesOutline<strong>Using</strong> <strong>Thrust</strong> <strong>to</strong> sort <strong>FORTRAN</strong> arrayWrapper Functions<strong>Sort</strong> ImplementationPerspectives© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesConclusions<strong>Thrust</strong> deliversperformance andimprovesprogrammerproductivityWrapper functionsintroducenegligibleoverhead<strong>CUDA</strong> <strong>FORTRAN</strong>can benefit from<strong>Thrust</strong> innovations<strong>CUDA</strong> <strong>FORTRAN</strong>Wrapperperformance =native <strong>Thrust</strong>performance© 2011 <strong>NVIDIA</strong> Corporation


<strong>Thrust</strong> Wrapper Functions <strong>Sort</strong> Implementation PerspectivesReferences• Collection of <strong>CUDA</strong> examples, tricks, and suggestions– http://cudamusing.blogspot.com/• PGI Compiler and Tools– http://www.cse.scitech.ac.uk/events/GPU_2010/16_Miles.pdf• <strong>Thrust</strong> Google Project Page– http://code.google.com/p/thrust/© 2011 <strong>NVIDIA</strong> Corporation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!