30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011Parallelization of the Generalized HoughTransform on GPUJuan Gómez-Luna 1a , José María González-Linares b , José Ignacio Benavi<strong>de</strong>s a , Emilio L. Zapata b andNicolás Guil bAbstract–Programs <strong>de</strong>veloped un<strong>de</strong>r the ComputeUnified Device Architecture (CUDA) obtain the highestperformance rate, when the exploitation of hardwareresources on a Graphics Processing Unit (GPU) ismaximized. In or<strong>de</strong>r to achieve this purpose, loadbalancing among threads and a high value of processoroccupancy, i.e. the ratio of active threads, areindispensable. However, in certain applications, anoptimally balanced implementation may limit theoccupancy, due to a greater need of registers and sharedmemory. This is the case of the Fast Generalized HoughTransform (Fast GHT), an image processing technique forlocalizing an object within an image. In this work, wepresent two parallelization alternatives for the Fast GHT,one that optimizes the load balancing and another thatmaximizes the occupancy. We have compared them using alarge amount of real images to test their strong and weakpoints and we have drawn several conclusions about un<strong>de</strong>rwhich conditions it is better to use one or another. We havealso tackled several parallelization problems related tosparse data distribution, divergent execution paths andirregular memory access patterns in updating operationsby proposing a set of generic techniques as compacting,sorting and memory storage replication.Keywords–GPU, CUDA, occupancy, load balancing,Generalized Hough Transform.GI. INTRODUCTIONRAPHICS Processing Units (GPUs) haveemerged as general purpose coprocessors inthe last years. An extensive variety ofapplications is nowadays benefiting from theirimpressive potential, especially after the launch ofCUDA [9]. GPUs offer a huge amount of computingthreads, arranged in a Single-Program Multiple-Data(SPMD) mo<strong>de</strong>l. Such extensive resources make themattractive for general-purpose computations. Thisinterest has boosted the <strong>de</strong>velopment of GPUprogramming tools, such as CUDA and OpenCL [10].General programming recommendations are optimizingload balancing and increasing processor occupancy.However, <strong>de</strong>pending on the algorithm structure, bothrecommendations cannot be applied simultaneously.Then some kind of tra<strong>de</strong>off must be un<strong>de</strong>rtaken, since anoptimally balanced implementation may increase the useof registers and the need for sharing data among threads,what <strong>de</strong>creases occupancy. Moreover, parallelizationbecomes even more challenging, if the algorithmpresents workload-<strong>de</strong>pen<strong>de</strong>nt computations and nonlinearmemory references. The former provokes1 Corresponding author: el1goluj@uco.esa Computer Architecture and Electronics Department,University of Córdoba, Córdoba, Spainb Computer Architecture Department, University ofMálaga, Málaga, Spaindivergence among threads, if the layout is not carefullyplanned. The latter affects the locality of references,what entails serialized memory accesses.In this work, we will discuss how performanceproblems caused by previous consi<strong>de</strong>rations can bemitigated using suitable strategies. They will beillustrated by implementing an efficient parallelizationof the GHT. We conduct an exhaustive analysis of theprevious consi<strong>de</strong>rations that leads us to the followingresults:• We propose compacting and sorting, in or<strong>de</strong>r toreduce accesses to global memory and thenumber of executed instructions.• We present two efficient mechanisms fordistributing two sorted lists among blocks andthreads. One of them optimizes the loadbalancing, whilst the other maintains theoccupancy at the highest possible values.• We use replicated sub-histograms per blockwith successful results.The rest of the paper is organized as follows. Section 2<strong>de</strong>picts the characteristics of regular and irregularapplications. Section 3 presents the GHT. Section 4<strong>de</strong>scribes our proposals for an efficient implementationof irregular parts. In section 5, we propose the use ofreplicated sub-histograms per block, in or<strong>de</strong>r to improvea voting process. Section 6 presents the executionresults. Finally, section 7 conclu<strong>de</strong>s the paper.II. REGULAR AND IRREGULAR PROBLEMSParallelizing any application on any parallel platformrequires programmers to apply a certain level ofabstraction. The change from a sequential conception toa parallel conception is never trivial. However, somealgorithms are regular in the sense that they use linearaddressing and apply the same computation on everyinstance of the input data. This inherent parallelismmakes those algorithms to be easier to implement on aSPMD platform, as is a GPU. In this regard, manyimage and vi<strong>de</strong>o processing applications exhibit regularcomputation patterns and regular memory accesses. TheCUDA SDK inclu<strong>de</strong>s some samples of regular imageprocessing.Following the optimization recommendations when aregular problem is parallelized, ensures a goodperformance and impressive speedups on CUDAenabledGPUs. However, achieving an importantimprovement with the implementation of an irregularproblem is always har<strong>de</strong>r. The distribution of work anddata in such algorithms cannot be characterized a priori,because these quantities are input-<strong>de</strong>pen<strong>de</strong>nt and evolvewith the computation itself. Algorithms with theseproperties yield performance problems for parallelimplementations, where equal distribution of work over<strong>JP2011</strong>-305

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!