03.02.2015 Views

Dense Matrix Algorithms -- Chapter 8 Introduction

Dense Matrix Algorithms -- Chapter 8 Introduction

Dense Matrix Algorithms -- Chapter 8 Introduction

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1<br />

<strong>Dense</strong> <strong>Matrix</strong> <strong>Algorithms</strong> --<br />

<strong>Chapter</strong> 8<br />

CS 442/EECE 432<br />

Brian T. Smith, UNM, CS Dept.<br />

Spring, 2003<br />

5/6/2003 densematrix 1<br />

<strong>Introduction</strong><br />

• <strong>Dense</strong> matrix operations<br />

– Operations on matrices with no or very few zero entries<br />

• Essentially, so few it is not worth creating a separate data<br />

structure or algorithm to avoid the computation with the zero<br />

entries<br />

• Deal with square matrices for which the dimension is divisible by<br />

the number of processors for ease of presentation but most of the<br />

algorithmic processes work for rectangle and uneven<br />

distributions<br />

– Uses mainly the data decomposition partitioning of the input or<br />

output data<br />

5/6/2003 densematrix 2


2<br />

<strong>Matrix</strong>-Vector Multiplication y = Ax<br />

• Assumptions:<br />

• The matrix A is of size n×n<br />

• The number of processors is n, arranged in a linear array<br />

• The matrix A is distributed by rows, 1 row per processor<br />

• The vector x is initially distributed rowwise -- that is, one<br />

element per processor<br />

• Algorithm<br />

• Perform an all-to-all broadcast of x to all processors<br />

• Perform the vector times row dot product to produce a rowdistributed<br />

value for the vector y<br />

5/6/2003 densematrix 3<br />

Parallel Runtime<br />

• The all-to-all broadcast takes Θ(n)<br />

• n processors (p=n) and message of length m=1 from<br />

Table 4.1 on most networks<br />

• The operation time is also Θ(n)<br />

• Thus, the overall parallel time is Θ(n)<br />

• The cost or process-product time is thus Θ(n 2 )<br />

• Because the serial time is also Θ(n 2 ), the<br />

algorithm is cost-optimal<br />

5/6/2003 densematrix 4


3<br />

Fewer Than n Processors<br />

• Suppose the number of processors p < n<br />

• Place n/p rows per processor of both A and x<br />

• An all-to-all broadcast distributed n/p elements of x<br />

• Each processor computes n/p elements of y<br />

• No further distribution of y is needed<br />

• The broadcast takes: t s log p + t w (n/p)(p–1) ≈ t s log p + t w n<br />

• The computation per processor is: (n/p)n = n 2 /p<br />

• The total parallel runtime T P is: n 2 /p + t s log p + t w n<br />

• The cost is: n 2 + t s p log p + t w np = Θ(n 2 )<br />

• The work W = n 2<br />

• The algorithm is cost-optimal, provided p = O(n)<br />

5/6/2003 densematrix 5<br />

Diagram Of The Computational And<br />

Communication Process For 1-D Row<br />

Partition <strong>Matrix</strong>-Vector Multiplication<br />

5/6/2003 densematrix 6


4<br />

ScalabilityAnalysis<br />

• The parallel overhead is: T o = pT p – W<br />

• That is:, T o = t s p log p + t w np<br />

• To determine the isoefficiency function, we use the<br />

equation W = KT o where K = E/(1–E) and E is constant<br />

• We try to bound and express W in terms of p in terms of constants E,<br />

t w and t s<br />

– Use the larger term t w np in T o to show that W = K 2 t w2 p 2 or W = Θ(p 2 )<br />

– Using the degree of concurrency:<br />

» Maximum number of processors p is n: that is, p = O(n)<br />

» That is, n = Ω(p), n 2 = Ω(p 2 ), and thus W = Ω(p 2 )<br />

» This places a lower bound on the isoefficiency function which is the<br />

same as the upper bound<br />

• Thus, the isoefficiency function for 1-D row partition matrix-vector<br />

multiply is Θ(p 2 )<br />

5/6/2003 densematrix 7<br />

2-D Partitioned Algorithm For<br />

<strong>Matrix</strong>-Vector Multiplication<br />

• We begin by assuming n 2 processors arranged in<br />

an n×n 2-d grid<br />

• A is assumed to be distributed with one element per processor<br />

• x is assumed distributed as one element per processor in the<br />

last column processors of the grid<br />

• Algorithm:<br />

• Distribute the elements of x from the last processor in each row<br />

to the diagonal processor in each row<br />

• Perform a simultaneous one-to-all broadcast from the diagonal<br />

processors to the processors in each column<br />

• Each processor performs one element of A by one element of x<br />

multiplication (the elements each processor owns)<br />

• Finally, a sum reduction along the rows to the last processor in<br />

the row is performed to compute each component of y<br />

5/6/2003 densematrix 8


5<br />

Diagram Of The Computational And<br />

Communication Process For 2-D Row<br />

Partition <strong>Matrix</strong>-Vector Multiplication<br />

5/6/2003 densematrix 9<br />

Parallel Runtime Of 2-D Partition<br />

Algorithm<br />

• The three communication steps<br />

• The first (from last column to diagonal) takes Θ(1)<br />

• The second and third (broadcast and reduction) each<br />

take Θ(log n) -- see Table 4.1 with m = 1<br />

• The computation takes Θ(1)<br />

• Thus, the parallel runtime is: Θ(log n)<br />

• The cost (process-time product) is: Θ(n 2 log n)<br />

• Thus, the algorithm is not cost optimal<br />

5/6/2003 densematrix 10


6<br />

Increasing The Granularity To Obtain<br />

A Cost Optimal Algorithm<br />

• Use few processor than n 2 , say p processors<br />

• Instead of 1 element per processor, assume the matrix is distributed<br />

with a n/√p × n/√p submatrix (block) per processor<br />

• Assume x is distributed on the last column in blocks of size n/√p in<br />

the natural way<br />

• Use the same algorithm -- the differences are:<br />

– n/√p elements of x are distributed to the diagonal processors<br />

» Takes t s + t w n/√p time (versus unit time)<br />

– n/√p elements of x are moved from the diagonal processor in a one-toall<br />

broadcast to all processors in a column<br />

» Takes (t s + t w n/√p)log√p time (versus Θ(log n))<br />

– A matrix-vector subproduct is computed<br />

» Takes (n/√p) or n 2 /p time (versus Θ(1))<br />

– n/√p elements of partial results for y are sum reduced to the last<br />

processor in each row<br />

» Takes (t s + t w n/√p)log√p time (versus Θ(log n))<br />

5/6/2003 densematrix 11<br />

Scalability Analysis<br />

• Adding these times up, the parallel runtime is<br />

approximately n 2 /p + t s log p + t w n/√p log p<br />

• The cost (as long as p is bounded away from n) is Θ(n 2 ) --<br />

the algorithm is cost optimal<br />

• But what is the bounding relationship<br />

– Let's develop the isoefficiency function<br />

– The overhead T o is: T o = pT p – W = t s p log p + t w n√p<br />

log p<br />

• Using this result and some algebra, we find that W = Θ(p<br />

log 2 p)<br />

– Using a "degrees of concurrency argument", W = Ω(p), a lower<br />

bound for W<br />

5/6/2003 densematrix 12


7<br />

Determining The Cost-Optimal<br />

Constraint on p<br />

• We now have two expressions for W<br />

coming from the requirement for cost<br />

optimality<br />

• Setting them equal, p log 2 p = O(n 2 )<br />

• Now we need to derive an expression for p in terms<br />

of n to determine the upper bound on p in terms of n<br />

• Takes the log on both side, solve for log p and<br />

substitute back into the above equation gives:<br />

⎛<br />

⎜<br />

n<br />

p = O<br />

⎝ log<br />

2<br />

2<br />

⎞<br />

⎟<br />

n<br />

⎠<br />

5/6/2003 densematrix 13<br />

Comparison: 1-D Versus 2-D<br />

• Runtime:<br />

• 1-D: n 2 /p + t s log p + t w n<br />

• 2-D: n 2 /p + t s log p + t w n/√p log p<br />

– The 2-D partition is faster -- smaller t w term<br />

• Isoefficiency (the growth in the work to keep the<br />

efficient fixed):<br />

• 1-D: Θ(p 2 )<br />

• 2-D: Θ(p log 2 p)<br />

– The 2-D partition is more scalable<br />

» That is, the efficiency can be maintained with fewer<br />

processors (or on a wider range of processors)<br />

5/6/2003 densematrix 14


8<br />

<strong>Matrix</strong>-<strong>Matrix</strong> Multiplications C = AB<br />

• Assume the best serial algorithm is: O(n 3 )<br />

• This is not true however<br />

– Strassen's algorithm has fewer operations but not substantially<br />

– There are however others<br />

• In parallel, there are three algorithms discussed:<br />

• A simple block algorithm<br />

– Communication contention and lots of memory -- parallel runtime Ω(n)<br />

• Cannon's block algorithm -- reduces the memory requirement<br />

– Allows computation/communication overlap<br />

» Changes the parallel runtime a little unfortunately<br />

• The DNS algorithm (Dekel, Nassimi, Sahni algorithm)<br />

– Partitions intermediate data so that the parallel runtime is reduced to<br />

Θ(log n) -- an upper bound lower than the lower bound for the above<br />

two algorithms<br />

5/6/2003 densematrix 15<br />

The Simple Algorithm<br />

• Assume matrices A and B of size n×n<br />

• Assume p processors in a grid of size √p×√p<br />

• Assume the matrices are distributed by blocks of size<br />

n/√p×n/√p on each processor for both A and B<br />

– Algorithm:<br />

• Perform an all-to-all broadcast in each row of processors of the<br />

blocks of A in the particular row<br />

– For row i, this insures that every block of the i-th block row of A<br />

is on every processor in the i-th row of the grid<br />

• Perform an all-to-all broadcast in each column of processors of<br />

the blocks of B in the particular column<br />

– For column j, that insures that every block of the j-th block<br />

column of B is on every processor in the j-th column of the grid<br />

• Perform the row-block multiplication by the column -block of<br />

the blocks on each processor -- this computes the appropriate<br />

block of C<br />

5/6/2003 densematrix 16


9<br />

Performance And Analysis for<br />

the Simple Algorithm<br />

• Communication:<br />

• Two all-to-all broadcasts on √p processors<br />

communication n 2 /p elements<br />

– The time taken is: 2(t s log √p + t w n 2 /p(√p–1))<br />

• Computation:<br />

• The matrix multiplications (√p multiplications of<br />

size n/√p×n/√p)<br />

– The time taken is: √p(n/√p) 3 = n 3 /p<br />

• Parallel runtime:<br />

• Approximately: n 3 /p + 2t s log √p + 2t w n 2 /√p<br />

5/6/2003 densematrix 17<br />

Cost and Isoefficiency Function<br />

For The Simple Algorithm<br />

• Cost:<br />

• Approximately: n 3 + 2pt s log √p + 2t w √p n 2<br />

• Cost optimal:<br />

• Provided p = O(n 2 )<br />

• Isoefficiency: Ω(p 3/2 )<br />

• From t s term: t s p log p<br />

• From t w term: 8 t w 3 p 3/2 = Θ(p 3/2 )<br />

• From degree of concurrency: Ω(p 3/2 )<br />

• Memory:<br />

• Every processor uses: 2 √p (n/ √p) 2 = Θ(n 2 / √p)<br />

5/6/2003 densematrix 18


10<br />

Cannon's Algorithm<br />

• Recall it:<br />

• Initially<br />

– Aligns A by shifting the data in the rows circularly left a<br />

distance equal to the row number<br />

– Aligns B by shifting the data in the columns circularly up<br />

a distance equal to the column number<br />

• Repeatedly for √p steps:<br />

– Perform a matrix multiplication of the block of A by the<br />

block of B that is on each processor<br />

– Shift circularly the blocks of A in rows left one position<br />

– Shift circularly the blocks of B in columns one position<br />

5/6/2003 densematrix 19<br />

Performance Analysis Of Cannon's Algorithm<br />

• Initial alignment<br />

• Shifts at most √p – 1 positions<br />

– Time is at most: 2(t s + t w n 2 /p)<br />

• √p steps:<br />

• Total shifting time: 2(t s + t w n 2 /p)√p<br />

• Total computation time: (n/√p) 3 √p = n 3 /p<br />

• Total parallel runtime is:<br />

n 3 /p + 2t s √p+ 2t w n 2 /√p<br />

• Cf: simple algorithm: n 3 /p + 2t s log √p + 2t w n 2 /√p<br />

• Cost optimality condition and isoefficiency function are<br />

the same for Cannon's algorithm as the simple algorithm<br />

5/6/2003 densematrix 20


11<br />

The DNS Algorithm<br />

• Partitions intermediate data as well as the input data<br />

• Results in a parallel time of Θ(log n) using Ω(n 3 /log n) processors<br />

• The essence of the algorithm:<br />

• Use a processor, labeled P ijk , to perform each scalar product A ik B kj for<br />

a total of n 3 and add the results up to obtain C ij simultaneously in log<br />

n steps<br />

• The parallel algorithm:<br />

• Consider processors are arranged in n planes of n×n processors<br />

– We arrange the matrices A and B so that the elements A ik and B kj are on<br />

the processor ij in the k-th plane<br />

– Each processor performs its multiplication of elements<br />

– Then n 2 simultaneous sum-reductions are performed, say down the<br />

dimension k to the bottom plane of processors, to produce the product C<br />

on the bottom plane<br />

5/6/2003 densematrix 21<br />

Communication For The DNS<br />

Algorithm<br />

• Initially A and B are assumed to be<br />

distributed element by element on the<br />

bottom plane of processors (k = 0)<br />

• For A, move its k column to the diagonal column in<br />

the k plane<br />

– Now broadcast those elements simultaneously to the rows<br />

of each plane<br />

• For B, move its k row to the diagonal row in the k<br />

plane<br />

– Now broadcast those elements simultaneously to the<br />

columns of each plane<br />

5/6/2003 densematrix 22


12<br />

Diagram Of The DNS Communication For A<br />

As rows<br />

k<br />

j<br />

0,2<br />

0,3<br />

1,3<br />

1,2 2,2<br />

0,1 1,1 2,1 3,1<br />

0,0 1,0 2,0 3,0<br />

i<br />

5/6/2003 densematrix 23<br />

2,3<br />

3,2<br />

3,3<br />

Diagram Of The Communication For A<br />

0,3<br />

0,3<br />

1,3<br />

2,3<br />

3,3<br />

0,3<br />

0,3<br />

0,2<br />

0,2<br />

0,2<br />

0,2<br />

1,2 2,2<br />

3,2<br />

Broadcast<br />

along columns<br />

k<br />

0,1<br />

0,1<br />

0,1<br />

0,1<br />

1,1 2,1 3,1<br />

j<br />

0,0<br />

0,0<br />

0,0<br />

0,0 1,0 2,0 3,0<br />

i<br />

5/6/2003 densematrix 24


13<br />

The DNS Algorithm Continued<br />

5/6/2003 densematrix 25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!