Dense Matrix Algorithms -- Chapter 8 Introduction

1 

Dense Matrix Algorithms -- 

Chapter 8 

CS 442/EECE 432 

Brian T. Smith, UNM, CS Dept. 

Spring, 2003 

5/6/2003 densematrix 1 

Introduction 

• Dense matrix operations 

– Operations on matrices with no or very few zero entries 

• Essentially, so few it is not worth creating a separate data 

structure or algorithm to avoid the computation with the zero 

entries 

• Deal with square matrices for which the dimension is divisible by 

the number of processors for ease of presentation but most of the 

algorithmic processes work for rectangle and uneven 

distributions 

– Uses mainly the data decomposition partitioning of the input or 

output data 

5/6/2003 densematrix 2

2 

Matrix-Vector Multiplication y = Ax 

• Assumptions: 

• The matrix A is of size n×n 

• The number of processors is n, arranged in a linear array 

• The matrix A is distributed by rows, 1 row per processor 

• The vector x is initially distributed rowwise -- that is, one 

element per processor 

• Algorithm 

• Perform an all-to-all broadcast of x to all processors 

• Perform the vector times row dot product to produce a rowdistributed 

value for the vector y 


Parallel Runtime 

• The all-to-all broadcast takes Θ(n) 

• n processors (p=n) and message of length m=1 from 

Table 4.1 on most networks 

• The operation time is also Θ(n) 

• Thus, the overall parallel time is Θ(n) 

• The cost or process-product time is thus Θ(n 2 ) 

• Because the serial time is also Θ(n 2 ), the 

algorithm is cost-optimal 


3 

Fewer Than n Processors 

• Suppose the number of processors p < n 

• Place n/p rows per processor of both A and x 

• An all-to-all broadcast distributed n/p elements of x 

• Each processor computes n/p elements of y 

• No further distribution of y is needed 

• The broadcast takes: t s log p + t w (n/p)(p–1) ≈ t s log p + t w n 

• The computation per processor is: (n/p)n = n 2 /p 

• The total parallel runtime T P is: n 2 /p + t s log p + t w n 

• The cost is: n 2 + t s p log p + t w np = Θ(n 2 ) 

• The work W = n 2 

• The algorithm is cost-optimal, provided p = O(n) 


Diagram Of The Computational And 

Communication Process For 1-D Row 

Partition Matrix-Vector Multiplication 


4 

ScalabilityAnalysis 

• The parallel overhead is: T o = pT p – W 

• That is:, T o = t s p log p + t w np 

• To determine the isoefficiency function, we use the 

equation W = KT o where K = E/(1–E) and E is constant 

• We try to bound and express W in terms of p in terms of constants E, 

t w and t s 

– Use the larger term t w np in T o to show that W = K 2 t w2 p 2 or W = Θ(p 2 ) 

– Using the degree of concurrency: 

» Maximum number of processors p is n: that is, p = O(n) 

» That is, n = Ω(p), n 2 = Ω(p 2 ), and thus W = Ω(p 2 ) 

» This places a lower bound on the isoefficiency function which is the 

same as the upper bound 

• Thus, the isoefficiency function for 1-D row partition matrix-vector 

multiply is Θ(p 2 ) 


2-D Partitioned Algorithm For 

Matrix-Vector Multiplication 

• We begin by assuming n 2 processors arranged in 

an n×n 2-d grid 

• A is assumed to be distributed with one element per processor 

• x is assumed distributed as one element per processor in the 

last column processors of the grid 

• Algorithm: 

• Distribute the elements of x from the last processor in each row 

to the diagonal processor in each row 

• Perform a simultaneous one-to-all broadcast from the diagonal 

processors to the processors in each column 

• Each processor performs one element of A by one element of x 

multiplication (the elements each processor owns) 

• Finally, a sum reduction along the rows to the last processor in 

the row is performed to compute each component of y 


5 

Diagram Of The Computational And 

Communication Process For 2-D Row 

Partition Matrix-Vector Multiplication 


Parallel Runtime Of 2-D Partition 

Algorithm 

• The three communication steps 

• The first (from last column to diagonal) takes Θ(1) 

• The second and third (broadcast and reduction) each 

take Θ(log n) -- see Table 4.1 with m = 1 

• The computation takes Θ(1) 

• Thus, the parallel runtime is: Θ(log n) 

• The cost (process-time product) is: Θ(n 2 log n) 

• Thus, the algorithm is not cost optimal 


6 

Increasing The Granularity To Obtain 

A Cost Optimal Algorithm 

• Use few processor than n 2 , say p processors 

• Instead of 1 element per processor, assume the matrix is distributed 

with a n/√p × n/√p submatrix (block) per processor 

• Assume x is distributed on the last column in blocks of size n/√p in 

the natural way 

• Use the same algorithm -- the differences are: 

– n/√p elements of x are distributed to the diagonal processors 

» Takes t s + t w n/√p time (versus unit time) 

– n/√p elements of x are moved from the diagonal processor in a one-toall 

broadcast to all processors in a column 

» Takes (t s + t w n/√p)log√p time (versus Θ(log n)) 

– A matrix-vector subproduct is computed 

» Takes (n/√p) or n 2 /p time (versus Θ(1)) 

– n/√p elements of partial results for y are sum reduced to the last 

processor in each row 

» Takes (t s + t w n/√p)log√p time (versus Θ(log n)) 


Scalability Analysis 

• Adding these times up, the parallel runtime is 

approximately n 2 /p + t s log p + t w n/√p log p 

• The cost (as long as p is bounded away from n) is Θ(n 2 ) -- 

the algorithm is cost optimal 

• But what is the bounding relationship 

– Let's develop the isoefficiency function 

– The overhead T o is: T o = pT p – W = t s p log p + t w n√p 

log p 

• Using this result and some algebra, we find that W = Θ(p 

log 2 p) 

– Using a "degrees of concurrency argument", W = Ω(p), a lower 

bound for W 


7 

Determining The Cost-Optimal 

Constraint on p 

• We now have two expressions for W 

coming from the requirement for cost 

optimality 

• Setting them equal, p log 2 p = O(n 2 ) 

• Now we need to derive an expression for p in terms 

of n to determine the upper bound on p in terms of n 

• Takes the log on both side, solve for log p and 

substitute back into the above equation gives: 

⎛ 

⎜ 

n 

p = O 

⎝ log 

2 

2 

⎞ 

⎟ 

n 

⎠ 


Comparison: 1-D Versus 2-D 

• Runtime: 

• 1-D: n 2 /p + t s log p + t w n 

• 2-D: n 2 /p + t s log p + t w n/√p log p 

– The 2-D partition is faster -- smaller t w term 

• Isoefficiency (the growth in the work to keep the 

efficient fixed): 

• 1-D: Θ(p 2 ) 

• 2-D: Θ(p log 2 p) 

– The 2-D partition is more scalable 

» That is, the efficiency can be maintained with fewer 

processors (or on a wider range of processors) 


8 

Matrix-Matrix Multiplications C = AB 

• Assume the best serial algorithm is: O(n 3 ) 

• This is not true however 

– Strassen's algorithm has fewer operations but not substantially 

– There are however others 

• In parallel, there are three algorithms discussed: 

• A simple block algorithm 

– Communication contention and lots of memory -- parallel runtime Ω(n) 

• Cannon's block algorithm -- reduces the memory requirement 

– Allows computation/communication overlap 

» Changes the parallel runtime a little unfortunately 

• The DNS algorithm (Dekel, Nassimi, Sahni algorithm) 

– Partitions intermediate data so that the parallel runtime is reduced to 

Θ(log n) -- an upper bound lower than the lower bound for the above 

two algorithms 


The Simple Algorithm 

• Assume matrices A and B of size n×n 

• Assume p processors in a grid of size √p×√p 

• Assume the matrices are distributed by blocks of size 

n/√p×n/√p on each processor for both A and B 

– Algorithm: 

• Perform an all-to-all broadcast in each row of processors of the 

blocks of A in the particular row 

– For row i, this insures that every block of the i-th block row of A 

is on every processor in the i-th row of the grid 

• Perform an all-to-all broadcast in each column of processors of 

the blocks of B in the particular column 

– For column j, that insures that every block of the j-th block 

column of B is on every processor in the j-th column of the grid 

• Perform the row-block multiplication by the column -block of 

the blocks on each processor -- this computes the appropriate 

block of C 


9 

Performance And Analysis for 

the Simple Algorithm 

• Communication: 

• Two all-to-all broadcasts on √p processors 

communication n 2 /p elements 

– The time taken is: 2(t s log √p + t w n 2 /p(√p–1)) 

• Computation: 

• The matrix multiplications (√p multiplications of 

size n/√p×n/√p) 

– The time taken is: √p(n/√p) 3 = n 3 /p 

• Parallel runtime: 

• Approximately: n 3 /p + 2t s log √p + 2t w n 2 /√p 


Cost and Isoefficiency Function 

For The Simple Algorithm 

• Cost: 

• Approximately: n 3 + 2pt s log √p + 2t w √p n 2 

• Cost optimal: 

• Provided p = O(n 2 ) 

• Isoefficiency: Ω(p 3/2 ) 

• From t s term: t s p log p 

• From t w term: 8 t w 3 p 3/2 = Θ(p 3/2 ) 

• From degree of concurrency: Ω(p 3/2 ) 

• Memory: 

• Every processor uses: 2 √p (n/ √p) 2 = Θ(n 2 / √p) 


10 

Cannon's Algorithm 

• Recall it: 

• Initially 

– Aligns A by shifting the data in the rows circularly left a 

distance equal to the row number 

– Aligns B by shifting the data in the columns circularly up 

a distance equal to the column number 

• Repeatedly for √p steps: 

– Perform a matrix multiplication of the block of A by the 

block of B that is on each processor 

– Shift circularly the blocks of A in rows left one position 

– Shift circularly the blocks of B in columns one position 


Performance Analysis Of Cannon's Algorithm 

• Initial alignment 

• Shifts at most √p – 1 positions 

– Time is at most: 2(t s + t w n 2 /p) 

• √p steps: 

• Total shifting time: 2(t s + t w n 2 /p)√p 

• Total computation time: (n/√p) 3 √p = n 3 /p 

• Total parallel runtime is: 

n 3 /p + 2t s √p+ 2t w n 2 /√p 

• Cf: simple algorithm: n 3 /p + 2t s log √p + 2t w n 2 /√p 

• Cost optimality condition and isoefficiency function are 

the same for Cannon's algorithm as the simple algorithm 


11 

The DNS Algorithm 

• Partitions intermediate data as well as the input data 

• Results in a parallel time of Θ(log n) using Ω(n 3 /log n) processors 

• The essence of the algorithm: 

• Use a processor, labeled P ijk , to perform each scalar product A ik B kj for 

a total of n 3 and add the results up to obtain C ij simultaneously in log 

n steps 

• The parallel algorithm: 

• Consider processors are arranged in n planes of n×n processors 

– We arrange the matrices A and B so that the elements A ik and B kj are on 

the processor ij in the k-th plane 

– Each processor performs its multiplication of elements 

– Then n 2 simultaneous sum-reductions are performed, say down the 

dimension k to the bottom plane of processors, to produce the product C 

on the bottom plane 


Communication For The DNS 

Algorithm 

• Initially A and B are assumed to be 

distributed element by element on the 

bottom plane of processors (k = 0) 

• For A, move its k column to the diagonal column in 

the k plane 

– Now broadcast those elements simultaneously to the rows 

of each plane 

• For B, move its k row to the diagonal row in the k 

plane 

– Now broadcast those elements simultaneously to the 

columns of each plane 


12 

Diagram Of The DNS Communication For A 

As rows 

k 

j 

0,2 

0,3 

1,3 

1,2 2,2 

0,1 1,1 2,1 3,1 

0,0 1,0 2,0 3,0 

i 


2,3 

3,2 

3,3 

Diagram Of The Communication For A 

0,3 

0,3 

1,3 

2,3 

3,3 

0,3 

0,3 

0,2 

0,2 

0,2 

0,2 

1,2 2,2 

3,2 

Broadcast 

along columns 

k 

0,1 

0,1 

0,1 

0,1 

1,1 2,1 3,1 

j 

0,0 

0,0 

0,0 

0,0 1,0 2,0 3,0 

i 


13 

The DNS Algorithm Continued

Dense Matrix Algorithms -- Chapter 8 Introduction

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?