- No tags were found...

Parallel Combinatorial BLAS and Applications in Graph Computations

**Primitives** **for** **Graph** **Computations**• By analogy tonumericallinear algebra,PeakBLAS 3• What would thecombinatorialBLAS look like?BLAS 2BLAS 1BLAS 3 (n-by-n matrix-matrix multiply)BLAS 2 (n-by-n matrix-vector multiply)BLAS 1 (sum of scaled n-vectors)2

The Case **for** **Primitives**It takes a “certain” level of expertise to get any kind ofper**for**mance in this jungle of parallel computing• I think you’ll agree with me by the end of the talk :)What’s bandwidthanyway?480xI can just implement it(w/ enough coffee)The right primitive !All pairs shortestpaths on the GPU3

The Case **for** Sparse Matrices• Many irregular applications contain sufficient coarsegrainedparallelism that can ONLY be exploited usingabstractions at proper level.Traditional graphcomputationsData driven. Unpredictablecommunication patternsIrregular and unstructured. Poorlocality of referenceFine grained data accesses.Dominated by latency**Graph**s in the language oflinear algebraFixed communication patterns.Overlapping opportunitiesOperations on matrix blocks.Exploits memory hierarchyCoarse grained parallelism.Bandwidth limited4

Identification of **Primitives**‣Sparse matrix-matrix multiplication (SpGEMM)Most general and challenging parallel primitive.‣Sparse matrix-vector multiplication (SpMV)‣Sparse matrix-transpose-vector multiplication (SpMVT)Equivalently, multiplication from the left‣Addition and other point-wise operations (SpAdd)Included in SpGEMM, “proudly” parallel‣Indexing and assignment (SpRef, SpAsgn)A(I,J) where I and J are arrays of indicesReduces to SpGEMMMatrices on semirings, e.g. (×, +), (and, or), (+, min)5

Why focus on SpGEMM?length(J)length(I)0 1 0 00 0 0 1mxx1 0 00 1 00 0 10 0 0=n• **Graph** clustering (Markov, peer pressure)• Shortest path calculations• Betweenness centrality• Subgraph / submatrix indexing• **Graph** contraction• Cycle detectionThursday, April 9, 20096• Multigrid interpolation & restriction• Colored intersection searching• Applying constraints in finite element computations• Context-free parsing ...

Why focus on SpGEMM?length(J)length(I)0 1 0 00 0 0 1mxx1 0 00 1 00 0 10 0 0=n• **Graph** clustering (Markov, peer pressure)• Shortest path calculations• Betweenness centrality• Subgraph / submatrix indexing• **Graph** contraction• Cycle detectionThursday, April 9, 20096• Multigrid interpolation & restriction• Colored intersection searching• Applying constraints in finite element computations• Context-free parsing ...

Vitals of Combinatorial BLASDistributed Memory Com1. Scalability, in the presence of increasing processors,problem size, and sparsity.1D2D701200Parallel Data Distribution : Wh6050403020101000800600400200012x 10 5108N642000210000 040003000P0121050008x 10 5N64210000 0Estimated Speedup (Ideal) of 1D and 2D SpGEMM algorithms2000P500040003000! 2D algorithms have the potential to scale, if implemented correctly. Sparse matrixadditions, overlapping communication, and maintaining load balance are crucial.In practice, 2D algorithms have the potential to scale, if implementedcorrectly. Overlapping communication, and maintaining load balance arecrucial.! Actual break-even point between 1D and 2D algorithms is around 50 processors.Per**for**mance of 1D algorithms flattens out around 40 processors.!!!!7

Sequential KernelStandard algorithm is O(nnz+ flops+n)flopsnnznX• Strictly O(nnz) data structure• Outer-product **for**mulation• Work-efficient8

Node Level ConsiderationsSubmatrices are hypersparse (i.e. nnz

Addressing the Load BalanceRMat: Model **for** graphs with high variance on degrees• Random permutations areuseful. But...• Bulk synchronous algorithmsmay still suffer:• Asynchronous algorithmshave no notion of stages.• Overall, no significantimbalance.10

Addressing the Load BalanceRMat: Model **for** graphs with high variance on degrees• Random permutations areuseful. But...• Bulk synchronous algorithmsmay still suffer:• Asynchronous algorithmshave no notion of stages.• Overall, no significantimbalance.10

Scaling Results **for** SpGEMM20.000Parallel PSpGEMM Scalability, Rmat-Scale20Asynchronous implementationTime (seconds)15.00010.0005.000One-sided MPI-2Runs on TACC’s Lonestar clusterDual-core dual-socketIntel Xeon 2.66 Ghz01 4 16 64 256ProcessorsRMat X RMat productAverage degree (nnz/n) ≈ 85.000PSpGEMM Scalability with Increasing Problem Size 64 ProcessorsTime (seconds)3.7502.5001.2501100 1048576 2097152 3145728 4194304Number of Vertices

Vitals of Combinatorial BLAS2. Generality, of the numeric type of matrix elements,algebraic operation per**for**med, and the library interface.Without the language abstraction penalty: C++ Templatestemplate class SpMat;• Achieve mixed precision arithmetic: Type traits• En**for**cing interface and strong type checking: CRTP• General semiring operation: Function Objects➡ Abstraction penalty is not just a programming language issue.➡ In particular, view matrices as indexed data structures and stayaway from single element access (Interface should discourage)12

Vitals of Combinatorial BLAS3. Extendability, of the library while maintainingcompatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.➡ Even Boost’ serializable concept might be restrictive (and slow)Commonalities:- Support the sequential API- Composed of a number of arraysSpSeqAny parallel logic:asynchronous, bulk synchronous, etcSpParSpSeqCSCDCSCTuples13

Applications and AlgorithmsApplicationsCommunity DetectionNetwork Vulnerability AnalysisCombinatorial AlgorithmsBetweenness Centrality **Graph** Clustering ContractionParallel Combinatorial BLASSpGEMM SpRef/SpAsgn SpMV SpAddBetweenness CentralityC B (v): Among all the shortest paths,what fraction of them pass throughthe node of interest?A typical software stack **for** an applicationenabled with the Combinatorial BLASBrandes’ algorithm14

Betweenness Centrality using SparseMatrices [Robinson, Kepner]A T1 2475x• Adjacency matrix: sparse array w/ nonzeros **for** graph edges• Storage-efficient implementation from sparse data structures• Betweenness Centrality Algorithm:1.Pick a starting vertex, v2.Compute shortest paths from v to all other nodes3.Starting with most distant nodes, roll back and tally paths3615

Betweenness Centrality using BFSA Tx(A T x).*¬x1 2475x3616T• Every iteration, another level of the BFS isdiscovered.t 1 t 2 t 3 t 4 x += x ~• Sparsity is preserved, but sparse matrixtimes sparse vector has very little potentialparallelism (has o(nnz) work)

Parallelism: Multiple-source BFS1 24 7 536A TX(A T X).*¬X• Batch processing of multiple source vertices• Sparse matrix-matrix multiplication => work efficient• Potential parallelism is much higher• Same applies to the tallying phase17

Millions of TEPSBetweenness Centrality onCombinatorial BLAS• Semi-basic implementation: 2D matrices, synchronous matrixmultiplication, no overlapping of communication with computation,some remote DMA, mixed type arithmetic, no template60specialization **for** boolean matricesFundamental trade-off:Parallelism vs memory usagebatch/proc=4batch/proc=8batch/proc=16batch/proc=32Batch processing greatlyhelps **for** large p402004 16 64ProcessorsInput: RMAT scale 17~1M edges only• Likely to per**for**m better onlarge inputs• Code only a few lines longerthan Matlab version18

Thank You !Questions?19