30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011a bucket that keeps tracks of the objects containedwithin the ball (c, r c ). Each bucket holds the closestk-elements to c. Thus the radius r c is the maximumdistance between the center c and its k-nearestneighbor.The buckets are filled up sequentially as the centersare created and thereby a given element i locatedin the intersection of two or more center balls remainsassigned to the first bucket that hold it. The firstcenter is randomly chosen from the set of objects.The next ones are selected so that they maximizethe sum of the distances to all previous centers.A range query q with radius r is solved by scanningthe centers in or<strong>de</strong>r of creation. For each centerd(q, c) is computed and only if d(q, c) ≤ r c + r, it isnecessary to compare the query against the objectsof the associated bucket. This process ends up eitherat the first center that holds d(q, c) < r c − r, meaningthat the query ball (q, r) is totally contained inthe center ball (c, r c ), or when all centers have beenconsi<strong>de</strong>red.B. Sparse Spatial Selection (SSS-In<strong>de</strong>x)During construction, this pivot-based in<strong>de</strong>x [6] selectssome objects as pivots from the collection andthen computes the distance between these pivots andthe rest of the database. The result is a table of distanceswhere columns are the pivots and rows theobjects. Each cell in the table contains the distancebetween the object and the respective pivot. Thesedistances are used to solve queries as follows. For arange query (q, r) the distances between the queryand all pivots are computed. An object x from thecollection can be discar<strong>de</strong>d if there exists a pivot p ifor which the condition |d(p i , x) − d(p i , q)| > r doeshold. The objects that pass this test are consi<strong>de</strong>redas potential members of the final set of objects thatform part of the solution for the query and thereforethey are directly compared against the query byapplying the condition d(x, q) ≤ r. The gain in performancecomes from the fact that it is much cheaperto effect the calculations for discarding objects usingthe table than computing the distance between thecandidate objects and the query.A key issue in this in<strong>de</strong>x is the method that calculatesthe pivots, which must be good enough todrastically reduce total number of distance computationsbetween the objects and the query. An effectivemethod is as follows. Let (X, ) be a metricspace, U ⊂ X an object collection, and M the maximumdistance between any pair of objects, M =max{d(x, y)/x, y ∈ U}. The set of pivots containsinitially only the first object of the collection. Then,for each element x i ∈ U, x i is chosen as a new pivotif its distance to every pivot in the current set of pivotsis equal or greater than αM, being α a constantparameter. Therefore, an object in the collection becomesa new pivot if it is located at more than afraction of the maximum distance with respect to allthe current pivots.III. Graphic Processing Units (GPU)GPUs have emerged as a powerful cost-efficientmany-core architecture. They integrate a large numberof functional units following a SIMT mo<strong>de</strong>l.We <strong>de</strong>velop all our implementations using NVIDIAgraphic cards and its CUDA programming mo<strong>de</strong>l([7]). A CUDA kernel executes a sequential co<strong>de</strong>on a large number of threads in parallel. Thosethreads are grouped into fixed size sets called warps 1 .Threads within a warp proceed in a lock step execution.Every cycle, the hardware scheduler ofeach GPU multiprocessor chooses the next warp toexecute (i.e. no individual threads but warps areswapped in and out). If the threads in a warp executedifferent co<strong>de</strong> paths, only those that follow the samepath can be executed simultaneously and a penaltyis incurred.Warps are further organized into a grid of CUDABlocks: threads within a block can cooperate witheach other by (1) efficiently sharing data through ashared low latency local memory and (2) synchronizingtheir execution via barriers. In contrast, threadsfrom different blocks can only coordinate their executionvia accesses to a high latency global memory.Within certain restrictions, the programmer specifieshow many blocks and how many threads per blockare assigned to the execution of a given kernel. Whena kernel is launched, threads are created by hardwareand dispatched to the GPU cores.According to NVIDIA the most significant factoraffecting performance is the bandwidth usage. Althoughthe GPU takes advantage of multithreadingto hi<strong>de</strong> memory access latencies, having hundredsof threads simultaneously accessing the global memoryintroduces a high pressure on the memory busbandwidth. The memory hierarchy inclu<strong>de</strong>s a largeregister file (statically partitioned per thread) and asoftware controlled low latency shared memory (permultiprocessor). Therefore, reducing global memoryaccesses by using local shared memory to exploit interthread locality and data reuse largely improveskernel execution time. In addition, improving memoryaccess patterns is important to allow coalescingof warp loads and to avoid bank conflicts on sharedmemory accesses.IV. Range QueriesIn this section we <strong>de</strong>scribe the mapping of threerange search algorithms onto CUDA-enabled GPUs:a brute-force approach and two in<strong>de</strong>x-based searchmethods.All of them exploit two different levels of parallelism.As in some previous papers [8][9] we assumea high frequency of incoming queries and exploitcoarse-grained inter-query parallelism, i.e. we alwayssolve nq queries in parallel. However, we also exploitthe fine-grained parallelism available when solvinga single query. Overall, each query is processed bya different CUDA Block that contains hundreds of1 Currently, there are 32 threads per warp<strong>JP2011</strong>-360

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!