Dense Matrix Algorithms -- Chapter 8 Introduction
Dense Matrix Algorithms -- Chapter 8 Introduction
Dense Matrix Algorithms -- Chapter 8 Introduction
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1<br />
<strong>Dense</strong> <strong>Matrix</strong> <strong>Algorithms</strong> --<br />
<strong>Chapter</strong> 8<br />
CS 442/EECE 432<br />
Brian T. Smith, UNM, CS Dept.<br />
Spring, 2003<br />
5/6/2003 densematrix 1<br />
<strong>Introduction</strong><br />
• <strong>Dense</strong> matrix operations<br />
– Operations on matrices with no or very few zero entries<br />
• Essentially, so few it is not worth creating a separate data<br />
structure or algorithm to avoid the computation with the zero<br />
entries<br />
• Deal with square matrices for which the dimension is divisible by<br />
the number of processors for ease of presentation but most of the<br />
algorithmic processes work for rectangle and uneven<br />
distributions<br />
– Uses mainly the data decomposition partitioning of the input or<br />
output data<br />
5/6/2003 densematrix 2
2<br />
<strong>Matrix</strong>-Vector Multiplication y = Ax<br />
• Assumptions:<br />
• The matrix A is of size n×n<br />
• The number of processors is n, arranged in a linear array<br />
• The matrix A is distributed by rows, 1 row per processor<br />
• The vector x is initially distributed rowwise -- that is, one<br />
element per processor<br />
• Algorithm<br />
• Perform an all-to-all broadcast of x to all processors<br />
• Perform the vector times row dot product to produce a rowdistributed<br />
value for the vector y<br />
5/6/2003 densematrix 3<br />
Parallel Runtime<br />
• The all-to-all broadcast takes Θ(n)<br />
• n processors (p=n) and message of length m=1 from<br />
Table 4.1 on most networks<br />
• The operation time is also Θ(n)<br />
• Thus, the overall parallel time is Θ(n)<br />
• The cost or process-product time is thus Θ(n 2 )<br />
• Because the serial time is also Θ(n 2 ), the<br />
algorithm is cost-optimal<br />
5/6/2003 densematrix 4
3<br />
Fewer Than n Processors<br />
• Suppose the number of processors p < n<br />
• Place n/p rows per processor of both A and x<br />
• An all-to-all broadcast distributed n/p elements of x<br />
• Each processor computes n/p elements of y<br />
• No further distribution of y is needed<br />
• The broadcast takes: t s log p + t w (n/p)(p–1) ≈ t s log p + t w n<br />
• The computation per processor is: (n/p)n = n 2 /p<br />
• The total parallel runtime T P is: n 2 /p + t s log p + t w n<br />
• The cost is: n 2 + t s p log p + t w np = Θ(n 2 )<br />
• The work W = n 2<br />
• The algorithm is cost-optimal, provided p = O(n)<br />
5/6/2003 densematrix 5<br />
Diagram Of The Computational And<br />
Communication Process For 1-D Row<br />
Partition <strong>Matrix</strong>-Vector Multiplication<br />
5/6/2003 densematrix 6
4<br />
ScalabilityAnalysis<br />
• The parallel overhead is: T o = pT p – W<br />
• That is:, T o = t s p log p + t w np<br />
• To determine the isoefficiency function, we use the<br />
equation W = KT o where K = E/(1–E) and E is constant<br />
• We try to bound and express W in terms of p in terms of constants E,<br />
t w and t s<br />
– Use the larger term t w np in T o to show that W = K 2 t w2 p 2 or W = Θ(p 2 )<br />
– Using the degree of concurrency:<br />
» Maximum number of processors p is n: that is, p = O(n)<br />
» That is, n = Ω(p), n 2 = Ω(p 2 ), and thus W = Ω(p 2 )<br />
» This places a lower bound on the isoefficiency function which is the<br />
same as the upper bound<br />
• Thus, the isoefficiency function for 1-D row partition matrix-vector<br />
multiply is Θ(p 2 )<br />
5/6/2003 densematrix 7<br />
2-D Partitioned Algorithm For<br />
<strong>Matrix</strong>-Vector Multiplication<br />
• We begin by assuming n 2 processors arranged in<br />
an n×n 2-d grid<br />
• A is assumed to be distributed with one element per processor<br />
• x is assumed distributed as one element per processor in the<br />
last column processors of the grid<br />
• Algorithm:<br />
• Distribute the elements of x from the last processor in each row<br />
to the diagonal processor in each row<br />
• Perform a simultaneous one-to-all broadcast from the diagonal<br />
processors to the processors in each column<br />
• Each processor performs one element of A by one element of x<br />
multiplication (the elements each processor owns)<br />
• Finally, a sum reduction along the rows to the last processor in<br />
the row is performed to compute each component of y<br />
5/6/2003 densematrix 8
5<br />
Diagram Of The Computational And<br />
Communication Process For 2-D Row<br />
Partition <strong>Matrix</strong>-Vector Multiplication<br />
5/6/2003 densematrix 9<br />
Parallel Runtime Of 2-D Partition<br />
Algorithm<br />
• The three communication steps<br />
• The first (from last column to diagonal) takes Θ(1)<br />
• The second and third (broadcast and reduction) each<br />
take Θ(log n) -- see Table 4.1 with m = 1<br />
• The computation takes Θ(1)<br />
• Thus, the parallel runtime is: Θ(log n)<br />
• The cost (process-time product) is: Θ(n 2 log n)<br />
• Thus, the algorithm is not cost optimal<br />
5/6/2003 densematrix 10
6<br />
Increasing The Granularity To Obtain<br />
A Cost Optimal Algorithm<br />
• Use few processor than n 2 , say p processors<br />
• Instead of 1 element per processor, assume the matrix is distributed<br />
with a n/√p × n/√p submatrix (block) per processor<br />
• Assume x is distributed on the last column in blocks of size n/√p in<br />
the natural way<br />
• Use the same algorithm -- the differences are:<br />
– n/√p elements of x are distributed to the diagonal processors<br />
» Takes t s + t w n/√p time (versus unit time)<br />
– n/√p elements of x are moved from the diagonal processor in a one-toall<br />
broadcast to all processors in a column<br />
» Takes (t s + t w n/√p)log√p time (versus Θ(log n))<br />
– A matrix-vector subproduct is computed<br />
» Takes (n/√p) or n 2 /p time (versus Θ(1))<br />
– n/√p elements of partial results for y are sum reduced to the last<br />
processor in each row<br />
» Takes (t s + t w n/√p)log√p time (versus Θ(log n))<br />
5/6/2003 densematrix 11<br />
Scalability Analysis<br />
• Adding these times up, the parallel runtime is<br />
approximately n 2 /p + t s log p + t w n/√p log p<br />
• The cost (as long as p is bounded away from n) is Θ(n 2 ) --<br />
the algorithm is cost optimal<br />
• But what is the bounding relationship<br />
– Let's develop the isoefficiency function<br />
– The overhead T o is: T o = pT p – W = t s p log p + t w n√p<br />
log p<br />
• Using this result and some algebra, we find that W = Θ(p<br />
log 2 p)<br />
– Using a "degrees of concurrency argument", W = Ω(p), a lower<br />
bound for W<br />
5/6/2003 densematrix 12
7<br />
Determining The Cost-Optimal<br />
Constraint on p<br />
• We now have two expressions for W<br />
coming from the requirement for cost<br />
optimality<br />
• Setting them equal, p log 2 p = O(n 2 )<br />
• Now we need to derive an expression for p in terms<br />
of n to determine the upper bound on p in terms of n<br />
• Takes the log on both side, solve for log p and<br />
substitute back into the above equation gives:<br />
⎛<br />
⎜<br />
n<br />
p = O<br />
⎝ log<br />
2<br />
2<br />
⎞<br />
⎟<br />
n<br />
⎠<br />
5/6/2003 densematrix 13<br />
Comparison: 1-D Versus 2-D<br />
• Runtime:<br />
• 1-D: n 2 /p + t s log p + t w n<br />
• 2-D: n 2 /p + t s log p + t w n/√p log p<br />
– The 2-D partition is faster -- smaller t w term<br />
• Isoefficiency (the growth in the work to keep the<br />
efficient fixed):<br />
• 1-D: Θ(p 2 )<br />
• 2-D: Θ(p log 2 p)<br />
– The 2-D partition is more scalable<br />
» That is, the efficiency can be maintained with fewer<br />
processors (or on a wider range of processors)<br />
5/6/2003 densematrix 14
8<br />
<strong>Matrix</strong>-<strong>Matrix</strong> Multiplications C = AB<br />
• Assume the best serial algorithm is: O(n 3 )<br />
• This is not true however<br />
– Strassen's algorithm has fewer operations but not substantially<br />
– There are however others<br />
• In parallel, there are three algorithms discussed:<br />
• A simple block algorithm<br />
– Communication contention and lots of memory -- parallel runtime Ω(n)<br />
• Cannon's block algorithm -- reduces the memory requirement<br />
– Allows computation/communication overlap<br />
» Changes the parallel runtime a little unfortunately<br />
• The DNS algorithm (Dekel, Nassimi, Sahni algorithm)<br />
– Partitions intermediate data so that the parallel runtime is reduced to<br />
Θ(log n) -- an upper bound lower than the lower bound for the above<br />
two algorithms<br />
5/6/2003 densematrix 15<br />
The Simple Algorithm<br />
• Assume matrices A and B of size n×n<br />
• Assume p processors in a grid of size √p×√p<br />
• Assume the matrices are distributed by blocks of size<br />
n/√p×n/√p on each processor for both A and B<br />
– Algorithm:<br />
• Perform an all-to-all broadcast in each row of processors of the<br />
blocks of A in the particular row<br />
– For row i, this insures that every block of the i-th block row of A<br />
is on every processor in the i-th row of the grid<br />
• Perform an all-to-all broadcast in each column of processors of<br />
the blocks of B in the particular column<br />
– For column j, that insures that every block of the j-th block<br />
column of B is on every processor in the j-th column of the grid<br />
• Perform the row-block multiplication by the column -block of<br />
the blocks on each processor -- this computes the appropriate<br />
block of C<br />
5/6/2003 densematrix 16
9<br />
Performance And Analysis for<br />
the Simple Algorithm<br />
• Communication:<br />
• Two all-to-all broadcasts on √p processors<br />
communication n 2 /p elements<br />
– The time taken is: 2(t s log √p + t w n 2 /p(√p–1))<br />
• Computation:<br />
• The matrix multiplications (√p multiplications of<br />
size n/√p×n/√p)<br />
– The time taken is: √p(n/√p) 3 = n 3 /p<br />
• Parallel runtime:<br />
• Approximately: n 3 /p + 2t s log √p + 2t w n 2 /√p<br />
5/6/2003 densematrix 17<br />
Cost and Isoefficiency Function<br />
For The Simple Algorithm<br />
• Cost:<br />
• Approximately: n 3 + 2pt s log √p + 2t w √p n 2<br />
• Cost optimal:<br />
• Provided p = O(n 2 )<br />
• Isoefficiency: Ω(p 3/2 )<br />
• From t s term: t s p log p<br />
• From t w term: 8 t w 3 p 3/2 = Θ(p 3/2 )<br />
• From degree of concurrency: Ω(p 3/2 )<br />
• Memory:<br />
• Every processor uses: 2 √p (n/ √p) 2 = Θ(n 2 / √p)<br />
5/6/2003 densematrix 18
10<br />
Cannon's Algorithm<br />
• Recall it:<br />
• Initially<br />
– Aligns A by shifting the data in the rows circularly left a<br />
distance equal to the row number<br />
– Aligns B by shifting the data in the columns circularly up<br />
a distance equal to the column number<br />
• Repeatedly for √p steps:<br />
– Perform a matrix multiplication of the block of A by the<br />
block of B that is on each processor<br />
– Shift circularly the blocks of A in rows left one position<br />
– Shift circularly the blocks of B in columns one position<br />
5/6/2003 densematrix 19<br />
Performance Analysis Of Cannon's Algorithm<br />
• Initial alignment<br />
• Shifts at most √p – 1 positions<br />
– Time is at most: 2(t s + t w n 2 /p)<br />
• √p steps:<br />
• Total shifting time: 2(t s + t w n 2 /p)√p<br />
• Total computation time: (n/√p) 3 √p = n 3 /p<br />
• Total parallel runtime is:<br />
n 3 /p + 2t s √p+ 2t w n 2 /√p<br />
• Cf: simple algorithm: n 3 /p + 2t s log √p + 2t w n 2 /√p<br />
• Cost optimality condition and isoefficiency function are<br />
the same for Cannon's algorithm as the simple algorithm<br />
5/6/2003 densematrix 20
11<br />
The DNS Algorithm<br />
• Partitions intermediate data as well as the input data<br />
• Results in a parallel time of Θ(log n) using Ω(n 3 /log n) processors<br />
• The essence of the algorithm:<br />
• Use a processor, labeled P ijk , to perform each scalar product A ik B kj for<br />
a total of n 3 and add the results up to obtain C ij simultaneously in log<br />
n steps<br />
• The parallel algorithm:<br />
• Consider processors are arranged in n planes of n×n processors<br />
– We arrange the matrices A and B so that the elements A ik and B kj are on<br />
the processor ij in the k-th plane<br />
– Each processor performs its multiplication of elements<br />
– Then n 2 simultaneous sum-reductions are performed, say down the<br />
dimension k to the bottom plane of processors, to produce the product C<br />
on the bottom plane<br />
5/6/2003 densematrix 21<br />
Communication For The DNS<br />
Algorithm<br />
• Initially A and B are assumed to be<br />
distributed element by element on the<br />
bottom plane of processors (k = 0)<br />
• For A, move its k column to the diagonal column in<br />
the k plane<br />
– Now broadcast those elements simultaneously to the rows<br />
of each plane<br />
• For B, move its k row to the diagonal row in the k<br />
plane<br />
– Now broadcast those elements simultaneously to the<br />
columns of each plane<br />
5/6/2003 densematrix 22
12<br />
Diagram Of The DNS Communication For A<br />
As rows<br />
k<br />
j<br />
0,2<br />
0,3<br />
1,3<br />
1,2 2,2<br />
0,1 1,1 2,1 3,1<br />
0,0 1,0 2,0 3,0<br />
i<br />
5/6/2003 densematrix 23<br />
2,3<br />
3,2<br />
3,3<br />
Diagram Of The Communication For A<br />
0,3<br />
0,3<br />
1,3<br />
2,3<br />
3,3<br />
0,3<br />
0,3<br />
0,2<br />
0,2<br />
0,2<br />
0,2<br />
1,2 2,2<br />
3,2<br />
Broadcast<br />
along columns<br />
k<br />
0,1<br />
0,1<br />
0,1<br />
0,1<br />
1,1 2,1 3,1<br />
j<br />
0,0<br />
0,0<br />
0,0<br />
0,0 1,0 2,0 3,0<br />
i<br />
5/6/2003 densematrix 24
13<br />
The DNS Algorithm Continued<br />
5/6/2003 densematrix 25