13.07.2015 Views

Duane Merrill (NVIDIA) - GPU Technology Conference

Duane Merrill (NVIDIA) - GPU Technology Conference

Duane Merrill (NVIDIA) - GPU Technology Conference

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Duane</strong> <strong>Merrill</strong> (<strong>NVIDIA</strong>)Michael Garland (<strong>NVIDIA</strong>)


AgendaSparse graphsBreadth-first search (BFS)Maximal independent set (MIS)∞∞2∞22122101 211231223151069872 34BFSMIS2


Sparse graphsGraph G(V, E) where n = |V| and m = |E|Convey structure via relationshipsWikipedia ‘073D lattice4


Graph sparsityAdjacency matrixof graph G(n, m)v 0 v 0 v 1v 1 v 1 v 2v 2 v 0 v 2 v 3v 3 v 1 v 3“Spy” plots5


Compressed sparse row (CSR) representationv 0 v 1 v 2 v 3Adjacency matrixof graph G(n, m)v 0 v 0 v 1v 1 v 1 v 2v 2 v 0 v 2 v 3v 3 v 1 v 3Column indices: O(m)[0] [1] [2] [3] [4] [5] [6] [7] [8]0 1 1 2 0 2 3 1 3Row offsets: O(n)[0] [1] [2] [3] [4]0 2 4 7 86


Breadth-first search (BFS)1. Pick a source node2. Rank every vertex by the length ofshortest path from source08


Breadth-first search (BFS)1. Pick a source node2. Rank every vertex by the length ofshortest path from source1110119


Breadth-first search (BFS)1. Pick a source node2. Rank every vertex by the length ofshortest path from source12122101222122210


Breadth-first search (BFS)1. Pick a source node2. Rank every vertex by the length ofshortest path from source12 312210122212232311


Breadth-first search (BFS)1. Pick a source node2. Rank every vertex by the length ofshortest path from source(or by predecessor vertex)12


BFS applicationsAlgorithmic primitiveReachability (e.g., garbage collection)Belief propagation (statistical inference)Simple performance-analog for many applicationsPointer-chasingWork queuesFinding community structurePath-findingCore benchmark kernelGraph 500RodiniaParboil13


The sequential algorithmCycles newly-discovered-but-unvisitedvertices through a queue1. Dequeue vertex v2. For all neighbors n i of v:• dist[n i ] = dist[v] + 1tsgrhuenobavfcijdlmpQ: {a}kgub• Enqueue n i14vcijshneafdltropmkQ: {b, c, d, e, f}


Common parallelization strategies• Level-synchronous: parallel work done in “epochs” by levela) Quadratic workb) Linear work• Randomized: parallel work done within independent sub-problemsa) Ullman-Yannakakis


Quadratic-work approach1. Inspect every vertex at every time step• E.g., one thread per vertex2. If updated, update its neighbors3. Repeat until no further updates16


Quadratic-work approach1. Inspect every vertex at every time step• E.g., one thread per vertex2. If updated, update its neighbors3. Repeat until no further updates17


Quadratic-work approach1. Inspect every vertex at every time step• E.g., one thread per vertex2. If updated, update its neighbors3. Repeat until no further updates18


Quadratic-work approach1. Inspect every vertex at every time step• E.g., one thread per vertex2. If updated, update its neighbors3. Repeat until no further updates19


Quadratic-work approach• Input: Vertex set V, row-offsets array R, column-indices array C,source vertex s• Output: Array dist[0..n-1] with dist[v] holding the distance from sto vkernel kernel1. parallel for (i in V) :2. dist[i] := ∞3. dist[s] := 04. iteration := 05. do :6. done := true7. parallel for (i in V) :8. if (dist[i] == iteration)9. done := false10. for (offset in R[i] .. R[i+1]-1) :11. j := C[offset]12. dist[j] = iteration + 113. iteration++14. while (!done)Good: trivial data-parallel <strong>GPU</strong> stencilsBad: worst-case quadratic O(n 2 +m) workHarish et al., Deng et al. Hong et al20


Vertices (millions)Vertices (millions)Vertices (millions)Vertices (millions)Vertices (millions)Vertices (millions)Active edges per iteration300.51402520151050.40.30.20.11201008060402000 2 4 6 8 10 12 14 16 18Search Depth0.00 100 200 300 400 500 600 700 800Search Depth00 1 2 3 4Search DepthWikipedia 3D cubic lattice R-MAT small-world0.030550.025440.0200.0150.01032320.005110.0000 3000 6000 9000 12000 15000Search Depth00 20 40 60 80 100 120 140 160Search Depth00 4 8 12 16 20 24 28 32 36 40 44 48 52Search DepthEurope road atlasPDE-constrainedoptimizationAuto transmission mesh21


Active vertex identifiersLinear-work approach1. Expand neighbors in parallel• Enqueue adjacency lists• Vertex frontier Edge-frontier0.50.40.3Logical EdgeFrontierLogical VertexFrontier2. Filter out previously-visited verticesin parallel• Edge-frontier Vertex frontier0.20.13. Repeat0Search depthGood: Linear O(n+m) work<strong>Merrill</strong> et al. (PPoPP’12), Luo et al., etc.Challenging: Need shared queues for trackingfrontiers22


Referenced vertex identifiers(millions)Quadratic vs. linear4035QuadraticLinear3D Poisson Lattice(300 3 = 27M vertices)3025201510500 100 200 300 400 500 600 700 800BFS iteration130x more work!!23


Linear-work challenges for parallelizationGeneric challenges1. Load imbalance between coresCoarse workstealing queues2. Bandwidth inefficiencyUse “status bitmask” when filteringalready-visited nodes<strong>GPU</strong>-specific challenges1. Cooperative allocation (globalqueues)2. Load imbalance within SIMD lanes3. Poor locality within SIMD lanes4. Simultaneous discovery (i.e., thebenign race condition)24


Linear-work challenges for parallelizationGeneric challenges1. Load imbalance between coresCoarse workstealing queues2. Bandwidth inefficiencyUse “status bitmask” when filteringalready-visited nodes<strong>GPU</strong>-specific challenges1. Cooperative allocation (globalqueues)2. Load imbalance within SIMD lanes3. Poor locality within SIMD lanes4. Simultaneous discovery (i.e., thebenign race condition)25


Available bandwidth (GB/s)1) Cooperative data movement (queuing)Problem:Need shared work “queues”<strong>GPU</strong> rate-limits (C2050):16 billion vertex-identifiers / sec (124 GB/s)1000100124.3Only 67M global-atomics / sec (238xslowdown)Only 600M smem-atomics / sec (27xslowdown)1010.54Solution:Compute enqueue offsets using prefix-sum0.11:1 1:8 1:64 1:512 1:4096 1:32768 1:262144Frequency of atomic ops per vertex dequeued26


Prefix sum for allocationAllocationrequirementResult ofprefix sumOutputA 2 C 1 0 D 3t 0 t 1 t 2 t 3A 0 2 C 3 D 3t 0 t 1 t 2 t 30 1 2 3 4 5Each output is computed to be the sumof the previous inputsO(n) workUse results as a scatter offsetsFits the <strong>GPU</strong> machine model wellProceed at copy-bandwidthOnly ~8 thread-instructions perinput27


Prefix sum for edge expansionAdjacencylist sizeA 2 C 1 0 D 3Result ofprefix sumExpandededge frontiert 0 t 1 t 2 t 3A 0 C 2 3 D 3t 0 t 1 t 2 t 30 1 2 3 4 528


Data-flow of efficient CTA scan(granularity coarsening)x16barrierx4x1x4barrierx16(For good <strong>GPU</strong> scan details, see Baxter’s tutorials @ http://www.moderngpu.com/)29


2) Poor load balancing within SIMD lanesProblem:Solution:Large variance in adjacency listsizesExacerbated by wide SIMD widthsCooperative neighbor expansionEnlist nearby threads to helpprocess each adjacency list inparallela) Bad: Serial expansion & processingt 1 t 3t 0 t 2b) Slightly better: Coarse warp-centric parallel expansion…t 0 t 1 t 2 t 3 t 4 t 5 t 31…t 32 t 33 t 34 t 35 t 36 t 37 t 63…t 64 t 65 t 66 t 67 t 68 t 69 t 95c) Best: Fine-grained parallel expansion (packed by prefix sum)t 0 t 1 t 2 t 3 t 4 t 530


2) Poor load balancing within SIMD lanesProblem:Solution:Large variance in adjacency listsizesExacerbated by wide SIMD widthsCooperative neighbor expansionEnlist nearby threads to helpprocess each adjacency list inparallela) Bad: Serial expansion & processingt 1 t 3t 0 t 2b) Slightly better: Coarse warp-centric parallel expansion…t 0 t 1 t 2 t 3 t 4 t 5 t 31…t 32 t 33 t 34 t 35 t 36 t 37 t 63…t 64 t 65 t 66 t 67 t 68 t 69 t 95c) Best: Fine-grained parallel expansion (packed by prefix sum)t 0 t 1 t 2 t 3 t 4 t 531


3) Poor locality within SIMD lanesProblem:The referenced adjacency lists arearbitrarily locateda) Bad: Serial expansion & processingt 1 t 3t 0 t 2Solution:Exacerbated by wide SIMD widthsCan’t afford to have SIMD threadsstreaming through unrelated datab) Slightly better: Coarse warp-centric parallel expansion…t 0 t 1 t 2 t 3 t 4 t 5 t 31…t 32 t 33 t 34 t 35 t 36 t 37 t 63Cooperative neighbor expansion…t 64 t 65 t 66 t 67 t 68 t 69 t 95c) Best: Fine-grained parallel expansion (packed by prefix sum)t 0 t 1 t 2 t 3 t 4 t 532


4) SIMD simultaneous discovery0 1 20 1 20 1 20 1 23 453 453 453 456 786 786 786 78Iteration 1 Iteration 2 Iteration 3 Iteration 4“Duplicates” in the edge-frontier redundant workExacerbated by wide SIMDCompounded every iteration33


Vertices (millions)4) SIMD simultaneous discoveryNormally not a problem forCPU implementationsSerial adjacency list inspectionA big problem for cooperative SIMDexpansion1.61.41.21.00.8Logical edgefrontierLogical vertexfrontierActuallyexpandedActuallycontractedSolution:Spatially-descriptive datasetsPower-law datasetsLocalized duplicate removal using hashesin local scratch0.60.40.20.00 100 200 300 400 500 600 700 800Search Depth34


Single-socket comparisonGraphSpy PlotAverageSearch DepthBillion TE/sNehalemParallelspeedupBillion TE/sTesla C2050Parallelspeedup †††europe.osm 19314 0.3 11 xgrid5pt.5000 7500 0.6 7.3 xhugebubbles 6151 0.4 15 xgrid7pt.300 679 0.12 (4-core † ) 2.4 x 1.1 28 xnlpkkt160 142 0.47 (4-core † ) 3.0 x 2.5 10 xaudikw1 62 3.0 4.6 xcage15 37 0.23 (4-core † ) 2.6 x 2.2 18 xkkt_power 37 0.11 (4-core † ) 3.0 x 1.1 23 xcoPapersCiteseer 26 3.0 5.9 xwikipedia-2007 20 0.19 (4-core † ) 3.2 x 1.6 25 xkron_g500-logn20 6 3.1 13 xrandom.2Mv.128Me 6 0.50 (8-core †† ) 7.0 x 3.0 29 xrmat.2Mv.128Me 6 0.70 (8-core †† ) 6.0 x 3.3 22 x†2.5 GHz Core i7 4-core (Leiserson et al.)††2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)


Single-socket comparisonGraphSpy PlotAverageSearch DepthBillion TE/sNehalemParallelspeedupBillion TE/sTesla C2050Parallelspeedup †††europe.osm 19314 0.3 11 xgrid5pt.5000 7500 0.6 7.3 xhugebubbles 6151 0.4 15 xgrid7pt.300 679 0.12 (4-core † ) 2.4 x 1.1 28 xnlpkkt160 142 0.47 (4-core † ) 3.0 x 2.5 10 xaudikw1 62 3.0 4.6 xcage15 37 0.23 (4-core † ) 2.6 x 2.2 18 xkkt_power 37 0.11 (4-core † ) 3.0 x 1.1 23 xcoPapersCiteseer 26 3.0 5.9 xwikipedia-2007 20 0.19 (4-core † ) 3.2 x 1.6 25 xkron_g500-logn20 6 3.1 13 xrandom.2Mv.128Me 6 0.50 (8-core †† ) 7.0 x 3.0 29 xrmat.2Mv.128Me 6 0.70 (8-core †† ) 6.0 x 3.3 22 x†2.5 GHz Core i7 4-core (Leiserson et al.)††2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)


Maximal independent setA vertex subset U V of an undirected graph G = (V, E)U is independent if no two vertices in U are adjacentU is maximal if no node can be added without violating independenceaBoth {a} and {b, c} are maximal independent setsbc38


Maximal independent setA vertex subset U V of an undirected graph G = (V, E)U is independent if no two vertices in U are adjacentU is maximal if no node can be added without violating independenceaBoth {a} and {b, c} are maximal independent setsbc(Finding the maximum independent set is generally NP-Complete.)39


MIS applicationsGraph coloring (vertex coloring)Co-scheduling sets of independent resourcesUseful for <strong>GPU</strong>s: avoid atomic updates by touchingitems in different kernelsMatching (edge coloring)Pair-wise assignmentsScheduling of jobs onto hostsMatching people with desks, dance partners, kidneys, etc.Run MIS on G’ where:There is a node in G’ for every edge in GNodes are connected in G’ if edges in G are adjacent40


Example: MIS(2) for mesh refinement inalgebraic multi-gridBell et al., Cohen et al., etc.41


The sequential algorithm• Repeatedly add a vertex to the MIS,removing it and its neighbors from thegraph:1. Select first identifier v from Vtsgrhuenobavfcpidmjlk2. Add v to MISMIS: {}3. Remove v and neighbors(v) from Vvi4. Repeat until V = {}gujashnltropmkMIS: {a}…42


The parallel algorithm (Luby85)• Repeatedly add sets to the MIS, removingthem and their neighbors from the graph:ihrkdjsgocelmfp1. Randomly permute identifier orderingof V2. Identify all m i such that identifier m i


Data-parallel <strong>GPU</strong> approach• Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s• Output: Array MIS[0..n-1] with MIS[v] holding membership flag (1|-1) in the maximalindependent set constructed.kernel kernel1. parallel for (i in V) :2. MIS[i] := 03. do :4. done := true5. parallel for (i in V) :6. if (MIS[i] == 0)7. dependent i := false8. done := false9. local_min i := i10. for (offset in R[i] .. R[i+1]-1) :11. j := C[offset]12. if (MIS[j] == 1)13. dependent i = true14. else if ((MIS[j] == 0) && (permute(j) < permute(i))15. local_min i := j16. if (dependent i )17. MIS[i] := -118. else if (local_min i == i)19. MIS[i] := 120. while (!done)Good: trivial data-parallel <strong>GPU</strong> stencilsE.g., CUSP library, Hoberock, Cohen, Wang,etc.Bad: average case O(m log n) workbecause edges are never removed44


Future work: actually remove vertices from GDon’t visit all n vertices at every time step!Should significantly improve utilization in subsequent roundsHigher density of active verticesCSR isn’t a particularly friendly format for vertex removalEasy: Use prefix sum to compact the row_offsets arrayDouble the size to pair segment-begin offsets with explicit segment-end offsetsHard: No great way to find and remove vertex identifiers from adjacency lists45


SummaryPurely data-parallel approaches can beuncompetitiveE.g., one thread per vertex/edge @ each timestepDynamic workload management:Prefix sum (vs. atomic-add)Utilization requires fine-grained expansion andcontraction<strong>GPU</strong>s can be very amenable to dynamic,cooperative problems46


dumerrill@nvidia.com


GTC 2013:The Premier Event in AcceleratedComputingRegistration is Open!Four days – March 18-21, San Jose, CAThree keynotes300+ sessionsOne day of pre-conference developer tutorials100+ research postersLots of networking events and opportunities


Experimental corpusGraph Spy Plot DescriptionAverageSearch DepthVertices(millions)Edges(millions)europe.osm Road network 19314 50.9 108.1grid5pt.5000 2D Poisson stencil 7500 25.0 125.0hugebubbles-00020 2D mesh 6151 21.2 63.6grid7pt.300 3D Poisson stencil 679 27.0 188.5nlpkkt160Constrained optimizationproblem142 8.3 221.2audikw1 Finite element matrix 62 0.9 76.7cage15 Transition prob. matrix 37 5.2 94.0kkt_power Optimization (KKT) 37 2.1 13.0coPapersCiteseer Citation network 26 0.4 32.1wikipedia-20070206 Wikipedia page links 20 3.6 45.0kron_g500-logn20 Graph500 random graph 6 1.0 100.7random.2Mv.128Me Uniform random graph 6 2.0 128.0rmat.2Mv.128Me RMAT random graph 6 2.0 128.0


Comparison of expansion techniques50


Coupling of expansion & contraction phasesAlternatives:Expand-contractWholly realize vertex-frontier in DRAMSuitable for all types of BFS iterationsContract-expandLogical EdgeFrontierLogical VertexFrontierTwo-phaseWholly realize edge-frontier in DRAMEven better for small, fleeting BFS iterationsWholly realize both frontiers in DRAMBFS iterationEven better for large, saturating BFS iterations (surprising!!)51


10 9 edges / secnormalizedComparison of coupling approaches3.53.02.52.01.51.00.50.0Expand-Contract Contract-Expand 2-Phase Hybrid1.41.21.00.80.60.40.20.052


Redundant expansionfactor (log)Local duplicate cullingContraction uses localcollision hashes in smemscratch space:10SimpleLocal duplicate cullingHash per warp(instantaneous coverage)4.2xHash per CTA (recenthistory coverage)2.4x2.6xRedundant work < 10% in alldatasetsNo atomics needed11.1x1.7x1.2x 1.1x1.4x1.1x1.3x53


10 9 edges / secnormalizedMulti-<strong>GPU</strong> traversalExpand neighbors sort by <strong>GPU</strong> (with filter) read from peer <strong>GPU</strong>s (with filter) repeat9876543210C2050 x1 C2050 x2 C2050 x3 C2050 x41.21.00.80.60.40.20.054

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!