Dense Matrix Algorithms -- Chapter 8 Introduction

More documents

Recommendations

Info

8 <strong>Matrix</strong>-<strong>Matrix</strong> Multiplications C = AB • Assume the best serial algorithm is: O(n 3 ) • This is not true however – Strassen's algorithm has fewer operations but not substantially – There are however others • In parallel, there are three algorithms discussed: • A simple block algorithm – Communication contention and lots of memory -- parallel runtime Ω(n) • Cannon's block algorithm -- reduces the memory requirement – Allows computation/communication overlap » Changes the parallel runtime a little unfortunately • The DNS algorithm (Dekel, Nassimi, Sahni algorithm) – Partitions intermediate data so that the parallel runtime is reduced to Θ(log n) -- an upper bound lower than the lower bound for the above two algorithms 5/6/2003 densematrix 15 The Simple Algorithm • Assume matrices A and B of size n×n • Assume p processors in a grid of size √p×√p • Assume the matrices are distributed by blocks of size n/√p×n/√p on each processor for both A and B – Algorithm: • Perform an all-to-all broadcast in each row of processors of the blocks of A in the particular row – For row i, this insures that every block of the i-th block row of A is on every processor in the i-th row of the grid • Perform an all-to-all broadcast in each column of processors of the blocks of B in the particular column – For column j, that insures that every block of the j-th block column of B is on every processor in the j-th column of the grid • Perform the row-block multiplication by the column -block of the blocks on each processor -- this computes the appropriate block of C 5/6/2003 densematrix 16
9 Performance And Analysis for the Simple Algorithm • Communication: • Two all-to-all broadcasts on √p processors communication n 2 /p elements – The time taken is: 2(t s log √p + t w n 2 /p(√p–1)) • Computation: • The matrix multiplications (√p multiplications of size n/√p×n/√p) – The time taken is: √p(n/√p) 3 = n 3 /p • Parallel runtime: • Approximately: n 3 /p + 2t s log √p + 2t w n 2 /√p 5/6/2003 densematrix 17 Cost and Isoefficiency Function For The Simple Algorithm • Cost: • Approximately: n 3 + 2pt s log √p + 2t w √p n 2 • Cost optimal: • Provided p = O(n 2 ) • Isoefficiency: Ω(p 3/2 ) • From t s term: t s p log p • From t w term: 8 t w 3 p 3/2 = Θ(p 3/2 ) • From degree of concurrency: Ω(p 3/2 ) • Memory: • Every processor uses: 2 √p (n/ √p) 2 = Θ(n 2 / √p) 5/6/2003 densematrix 18
Page 1 and 2: 1 Dense Matrix Algorithms -- Chapte
Page 3 and 4: 3 Fewer Than n Processors • Suppo
Page 5 and 6: 5 Diagram Of The Computational And
Page 7: 7 Determining The Cost-Optimal Cons
Page 11 and 12: 11 The DNS Algorithm • Partitions
Page 13: 13 The DNS Algorithm Continued 5/6/

Dense Matrix Algorithms -- Chapter 8 Introduction

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?