12.07.2015 Views

Online proceedings - EDA Publishing Association

Online proceedings - EDA Publishing Association

Online proceedings - EDA Publishing Association

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Here Q i ‘s are temporary matrices. Seven multiplicationsand eighteen additions or subtractions were required. Becauseaddition and subtraction require ~n 2 operations, at bigmatrices Strassen’s matrix multiplication method is fasterthan conventional methods (see details in the next section).Input matrices do not needed to be square, the only criterionis that their sizes must be dividable by two.In the literature one can find theoretically better algorithmsthan Strassen’s, see, e.g. [9]. These methods are generallynot usable in practice because their benefit appearsonly on extremely huge matrices. Strassen’s algorithm isalso slower on small matrices than the conventional methods.In SUNRED we found that the algorithm is optimal formatrices with n>500.The calculation of Q i matrices can be parallelized. Fullparallelization demands 5.25n 2 additional temporary memoryspace. Without parallelizing this demand is only 1.5n 2 .We have chosen a compromise: two multiplications at thesame time; in this case 2.25n 2 extra memory is required.When n>1000, Strassen’s multiplication become recursive,so the multiplication runs on more than two threads.Disadvantage of Strassen’s method is the degradation ofnumerical stability [5]. In the case of SUNRED generallythis means the decrease of accuracy from 7-10 decimal digitsto 5-8 digits. This is not a problem in practice because thedeviation of material parameters, fitting of components, uncertaintyof radiated power and other effects anyway wouldresult in a simulation accuracy in the range of percents atbest.Strassen presented a method for inversion as well. Thematrix to invert is divided into blocks again:⎡C⎢⎣C1121CC1222⎤ ⎡A⎥ = ⎢⎦ ⎣A1121Strassen’s inversion:R1 = A11 -1R2 = A21 × R1R3 = R1 × A12R4 = A21 × R3R5 = R4 − A22AA−112 ⎤22⎥⎦(7)R6 = R5 -1 (8)C12 = R3 × R6C21 = R6 × R2R7 = R3 × C21C11 = R1 − R7C22 = −R6Two inversions and six multiplications remained, so thenumber of n 3 operations is not changed however because themultiplications can be done by Strassen’s algorithm, inversionsare recursively decomposable; the n 2 ≈ n theo-log 7 2.807retical improvement is achievable. In practice the situation isbetter: an n×n inversion takes 2.5-3 times longer than an n×nmultiplication, and now the inversion is changed to multiplication.Parallelization is available at inversion but not somuch as at multiplication: R 2 -R 3 and C 11 -C 21 can go together.24-26 September 2008, Rome, ItalyIn SUNRED half of multiplications and all inversions areperformed on symmetrical matrices, special routines aremade, which are almost twice as fast as non-symmetricalones.Frequency-domain (AC) simulation requires complexnumber arithmetic. Complex multiplication contains fourreal multiplications, one addition and one subtraction:(A + iB)(C + iD) = (AC − BD) + i(AD + BC) (9)However, similar to Strassen’s methods, multiplicationcan be reordered as the number of multiplications decreaseto three, while addition and subtraction increase to threethree:(A + iB)(C + iD) = (10)= (AC − BD) + i[(A + B)(C + D) − AC − BD]Multiplications can be done parallel, temporary matricesare required with 2-5n 2 size, depending on the level of parallelization.IV. BENCHMARK RESULTSThe speed gain of the algorithms is presented in this section.The following abbreviations are used in the tables:1024×1 field: A real life sample: 1024×1024 grid resolution,1 layer (2D) thermal field, full DC simulation.64×64 field: A real life sample: 64×64 grid resolution, 16layers (3D) thermal field, full DC simulation.n×n matrix: Multiplication of two 3200×3200 sized, real(64 bit double precision) matrices or inversion of one.m×o matrix: Multiplication of a 4800×1920 and a1920×1600 sized, real (64 bit double precision) matrixTest computer: Dell Dimension 9200 with Intel Core 2Duo E6400 (2.13 GHz, 2 MB cache), 4 GB DDR2 667 MHzRAM, Windows Vista Ultimate 32 bit operating system. Thesolver was compiled with MS Visual Studio .NET 2003 C++compiler to SSE2 instruction set. The run times are calculatedas average of three measurements.Table I shows the effect of thread number. The test systemcontained a dual core processor, so more than two threadsgive minimal speed change, but four threads give a bit morespeed than two. 64×16 field gained much more from thethread number increase, because 90% of its runtime wasconsumed by matrix operations vs. 50% of 1024×1.Table II presents the results of matrix blocking multiplication(2) and loop unrolling; 14-26% acceleration was gainedin real life applications.TABLE IEffect of thread number on the speed (dual core system)Thread Runtime Speed ratio1 30,65 sec 100,0%2 19,75 sec 155,2%1024×1 field4 19,57 sec 156,6%64 19,85 sec 154,4%1024 21,34 sec 143,6%1 16,75 sec 100,0%2 9,57 sec 175,1%64×16 field4 9,49 sec 176,5%64 9,60 sec 174,5%1024 10,52 sec 159,3%©<strong>EDA</strong> <strong>Publishing</strong>/THERMINIC 2008 139ISBN: 978-2-35500-008-9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!