Primitives for Graph Computations

graphanalysis.org
  • No tags were found...

Parallel Combinatorial BLAS and Applications in Graph Computations

Primitives for Graph Computations• By analogy tonumericallinear algebra,PeakBLAS 3• What would thecombinatorialBLAS look like?BLAS 2BLAS 1BLAS 3 (n-by-n matrix-matrix multiply)BLAS 2 (n-by-n matrix-vector multiply)BLAS 1 (sum of scaled n-vectors)2


The Case for PrimitivesIt takes a “certain” level of expertise to get any kind ofperformance in this jungle of parallel computing• I think you’ll agree with me by the end of the talk :)What’s bandwidthanyway?480xI can just implement it(w/ enough coffee)The right primitive !All pairs shortestpaths on the GPU3


The Case for Sparse Matrices• Many irregular applications contain sufficient coarsegrainedparallelism that can ONLY be exploited usingabstractions at proper level.Traditional graphcomputationsData driven. Unpredictablecommunication patternsIrregular and unstructured. Poorlocality of referenceFine grained data accesses.Dominated by latencyGraphs in the language oflinear algebraFixed communication patterns.Overlapping opportunitiesOperations on matrix blocks.Exploits memory hierarchyCoarse grained parallelism.Bandwidth limited4


Identification of Primitives‣Sparse matrix-matrix multiplication (SpGEMM)Most general and challenging parallel primitive.‣Sparse matrix-vector multiplication (SpMV)‣Sparse matrix-transpose-vector multiplication (SpMVT)Equivalently, multiplication from the left‣Addition and other point-wise operations (SpAdd)Included in SpGEMM, “proudly” parallel‣Indexing and assignment (SpRef, SpAsgn)A(I,J) where I and J are arrays of indicesReduces to SpGEMMMatrices on semirings, e.g. (×, +), (and, or), (+, min)5


Why focus on SpGEMM?length(J)length(I)0 1 0 00 0 0 1mxx1 0 00 1 00 0 10 0 0=n• Graph clustering (Markov, peer pressure)• Shortest path calculations• Betweenness centrality• Subgraph / submatrix indexing• Graph contraction• Cycle detectionThursday, April 9, 20096• Multigrid interpolation & restriction• Colored intersection searching• Applying constraints in finite element computations• Context-free parsing ...


Why focus on SpGEMM?length(J)length(I)0 1 0 00 0 0 1mxx1 0 00 1 00 0 10 0 0=n• Graph clustering (Markov, peer pressure)• Shortest path calculations• Betweenness centrality• Subgraph / submatrix indexing• Graph contraction• Cycle detectionThursday, April 9, 20096• Multigrid interpolation & restriction• Colored intersection searching• Applying constraints in finite element computations• Context-free parsing ...


Vitals of Combinatorial BLASDistributed Memory Com1. Scalability, in the presence of increasing processors,problem size, and sparsity.1D2D701200Parallel Data Distribution : Wh6050403020101000800600400200012x 10 5108N642000210000 040003000P0121050008x 10 5N64210000 0Estimated Speedup (Ideal) of 1D and 2D SpGEMM algorithms2000P500040003000! 2D algorithms have the potential to scale, if implemented correctly. Sparse matrixadditions, overlapping communication, and maintaining load balance are crucial.In practice, 2D algorithms have the potential to scale, if implementedcorrectly. Overlapping communication, and maintaining load balance arecrucial.! Actual break-even point between 1D and 2D algorithms is around 50 processors.Performance of 1D algorithms flattens out around 40 processors.!!!!7


Sequential KernelStandard algorithm is O(nnz+ flops+n)flopsnnznX• Strictly O(nnz) data structure• Outer-product formulation• Work-efficient8


Node Level ConsiderationsSubmatrices are hypersparse (i.e. nnz


Addressing the Load BalanceRMat: Model for graphs with high variance on degrees• Random permutations areuseful. But...• Bulk synchronous algorithmsmay still suffer:• Asynchronous algorithmshave no notion of stages.• Overall, no significantimbalance.10


Addressing the Load BalanceRMat: Model for graphs with high variance on degrees• Random permutations areuseful. But...• Bulk synchronous algorithmsmay still suffer:• Asynchronous algorithmshave no notion of stages.• Overall, no significantimbalance.10


Scaling Results for SpGEMM20.000Parallel PSpGEMM Scalability, Rmat-Scale20Asynchronous implementationTime (seconds)15.00010.0005.000One-sided MPI-2Runs on TACC’s Lonestar clusterDual-core dual-socketIntel Xeon 2.66 Ghz01 4 16 64 256ProcessorsRMat X RMat productAverage degree (nnz/n) ≈ 85.000PSpGEMM Scalability with Increasing Problem Size 64 ProcessorsTime (seconds)3.7502.5001.2501100 1048576 2097152 3145728 4194304Number of Vertices


Vitals of Combinatorial BLAS2. Generality, of the numeric type of matrix elements,algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templatestemplate class SpMat;• Achieve mixed precision arithmetic: Type traits• Enforcing interface and strong type checking: CRTP• General semiring operation: Function Objects➡ Abstraction penalty is not just a programming language issue.➡ In particular, view matrices as indexed data structures and stayaway from single element access (Interface should discourage)12


Vitals of Combinatorial BLAS3. Extendability, of the library while maintainingcompatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.➡ Even Boost’ serializable concept might be restrictive (and slow)Commonalities:- Support the sequential API- Composed of a number of arraysSpSeqAny parallel logic:asynchronous, bulk synchronous, etcSpParSpSeqCSCDCSCTuples13


Applications and AlgorithmsApplicationsCommunity DetectionNetwork Vulnerability AnalysisCombinatorial AlgorithmsBetweenness Centrality Graph Clustering ContractionParallel Combinatorial BLASSpGEMM SpRef/SpAsgn SpMV SpAddBetweenness CentralityC B (v): Among all the shortest paths,what fraction of them pass throughthe node of interest?A typical software stack for an applicationenabled with the Combinatorial BLASBrandes’ algorithm14


Betweenness Centrality using SparseMatrices [Robinson, Kepner]A T1 2475x• Adjacency matrix: sparse array w/ nonzeros for graph edges• Storage-efficient implementation from sparse data structures• Betweenness Centrality Algorithm:1.Pick a starting vertex, v2.Compute shortest paths from v to all other nodes3.Starting with most distant nodes, roll back and tally paths3615


Betweenness Centrality using BFSA Tx(A T x).*¬x1 2475x3616T• Every iteration, another level of the BFS isdiscovered.t 1 t 2 t 3 t 4 x += x ~• Sparsity is preserved, but sparse matrixtimes sparse vector has very little potentialparallelism (has o(nnz) work)


Parallelism: Multiple-source BFS1 24 7 536A TX(A T X).*¬X• Batch processing of multiple source vertices• Sparse matrix-matrix multiplication => work efficient• Potential parallelism is much higher• Same applies to the tallying phase17


Millions of TEPSBetweenness Centrality onCombinatorial BLAS• Semi-basic implementation: 2D matrices, synchronous matrixmultiplication, no overlapping of communication with computation,some remote DMA, mixed type arithmetic, no template60specialization for boolean matricesFundamental trade-off:Parallelism vs memory usagebatch/proc=4batch/proc=8batch/proc=16batch/proc=32Batch processing greatlyhelps for large p402004 16 64ProcessorsInput: RMAT scale 17~1M edges only• Likely to perform better onlarge inputs• Code only a few lines longerthan Matlab version18


Thank You !Questions?19

More magazines by this user
Similar magazines