Parallelizing the Divide and Conquer Algorithm - Innovative ...

More documents

Recommendations

Info

To compute the approximate eigenvalues and the quantities ^ j , d i stably and eciently, we use the hybrid scheme for the rational interpolation of f(x) as developed by Li [25]. The hybrid scheme keeps the peak number of iterations relatively small for solving the secular equation. For our parallel implementation, this is helpful because the execution time for this part is determined by whichever eigenvalue takes the largest number of iterations. 4.3 Back Transformation The main cost in the divide and conquer algorithm is in computing the product QU (see (2.4)). The eciency of the whole implementation relies on a proper implementation of this back transformation. The goal is to reduce the size of the matrix-matrix multiplication when transforming the eigenvectors of the perturbed diagonal matrix to the eigenvectors of the tridiagonal matrix. In this section, we explain a permutation strategy originally suggested by Gu[18] and used in the serial LAPACK divide and conquer code. Then we derive a permutation strategy more suitable for our parallel implementation. This new strategy is one of the major contributions of our work. After the deation process, we denote bt G the product of all the Givens rotations used to set to zero component ofzcorresponding to nearly equal diagonal elements of D and by P the accumulation of permutations used to translate the zero components of z to the bottom of z: ! eD + ~z~z T 0 PG(D+zz t )G T P T = : (4.3) 0 Let (e U; ) e be the spectral decomposition of D e + ~z~z T . Then ! ! eD + ~z~z T 0 eU 0 e ! 0 eU 0 = 0 0 I 0 0 I ! T = UU T ; and the spectral decomposition of the tridiagonal matrix T = Q(D + zz T )Q is obtained from T = Q(PG) T PG(D+zz T )(PG) T PGQ T = Q(PG) T UU T PGQ T = WW T 16
with W = Q(PG) T U. When not properly implemented, the computation of W can be very expensive. To simplify the explanation, we illustrate it with a 44 example: we suppose that d1 = d3 and that G is the Givens rotation used to set to zero the third component ofz. The matrix P is a permutation that moves z3 to the bottom of z. We indicate by a\" that a value has changed. There are two way of applying the transformation (PG) T , either on the left, that is on Q, or on the right, that is on U. Note that Q = diag(Q1;Q2) is block diagonal (see(2.3)) and so we would like tototake advantage of this structure. It would halve the cost of the matrix multiplication if Q1,Q2 are of the same size. To preserve the block diagonal form of Q, we need to apply (PG) T on the right: Q (PG) T U = = = 0 B @ 0 B @ 0 B @ 1 0 C A (PG)T B @ 1 0 C A G B @ 10 C A B @ 1 0 0 The product between the two last matrices is performed with 64 ops instead of the 2n 3 = 128 ops of a full matrix product. However, if we apply (PG) T on the left, we can reduce further the number of ops. Consider again the 4 4 example: Q(PG) T U = 0 B @ 1 0 C A GT P T B @ 1 C A 1 C A 1 1 1 C A 1 C A 17
Page 1 and 2: Parallelizing the Divide and Conque
Page 3 and 4: their parallel implementations of t
Page 5 and 6: and the eigenvalues of T are theref
Page 7 and 8: with c a constant of order unity. T
Page 9 and 10: 3 Parallelization Issues Divide and
Page 11 and 12: 4 Implementation Details We now des
Page 13 and 14: 1-D block distribution P0 P0 P0 P1
Page 15: Then, combining (4.1) and (4.2) and
Page 19 and 20: ( Q13; Q23) T contains k 0 column
Page 21 and 22: PxSYEVX is the name of the expert d
Page 23 and 24: 7 Time relative to DC=1 6 5 bisecti
Page 25 and 26: 8 Time relative to D&C=1 EVX 7 6 bi
Page 27 and 28: 7 Speed−up of D&C over QR, IBM SP
Page 29 and 30: for the computing all the eigenvalu
Page 31 and 32: References [1] E. Anderson, Z. Bai,
Page 33 and 34: [19] Ming Gu and Stanley C. Eisenst

Parallelizing the Divide and Conquer Algorithm - Innovative ...

Create successful ePaper yourself

Delete template?

Save as template?