Online proceedings - EDA Publishing Association

More documents

Recommendations

Info

Here Q i ‘s are temporary matrices. Seven multiplicationsand eighteen additions or subtractions were required. Becauseaddition and subtraction require ~n 2 operations, at bigmatrices Strassen’s matrix multiplication method is fasterthan conventional methods (see details in the next section).Input matrices do not needed to be square, the only criterionis that their sizes must be dividable by two.In the literature one can find theoretically better algorithmsthan Strassen’s, see, e.g. [9]. These methods are generallynot usable in practice because their benefit appearsonly on extremely huge matrices. Strassen’s algorithm isalso slower on small matrices than the conventional methods.In SUNRED we found that the algorithm is optimal formatrices with n>500.The calculation of Q i matrices can be parallelized. Fullparallelization demands 5.25n 2 additional temporary memoryspace. Without parallelizing this demand is only 1.5n 2 .We have chosen a compromise: two multiplications at thesame time; in this case 2.25n 2 extra memory is required.When n>1000, Strassen’s multiplication become recursive,so the multiplication runs on more than two threads.Disadvantage of Strassen’s method is the degradation ofnumerical stability [5]. In the case of SUNRED generallythis means the decrease of accuracy from 7-10 decimal digitsto 5-8 digits. This is not a problem in practice because thedeviation of material parameters, fitting of components, uncertaintyof radiated power and other effects anyway wouldresult in a simulation accuracy in the range of percents atbest.Strassen presented a method for inversion as well. Thematrix to invert is divided into blocks again:⎡C⎢⎣C1121CC1222⎤ ⎡A⎥ = ⎢⎦ ⎣A1121Strassen’s inversion:R1 = A11 -1R2 = A21 × R1R3 = R1 × A12R4 = A21 × R3R5 = R4 − A22AA−112 ⎤22⎥⎦(7)R6 = R5 -1 (8)C12 = R3 × R6C21 = R6 × R2R7 = R3 × C21C11 = R1 − R7C22 = −R6Two inversions and six multiplications remained, so thenumber of n 3 operations is not changed however because themultiplications can be done by Strassen’s algorithm, inversionsare recursively decomposable; the n 2 ≈ n theo-log 7 2.807retical improvement is achievable. In practice the situation isbetter: an n×n inversion takes 2.5-3 times longer than an n×nmultiplication, and now the inversion is changed to multiplication.Parallelization is available at inversion but not somuch as at multiplication: R 2 -R 3 and C 11 -C 21 can go together.24-26 September 2008, Rome, ItalyIn SUNRED half of multiplications and all inversions areperformed on symmetrical matrices, special routines aremade, which are almost twice as fast as non-symmetricalones.Frequency-domain (AC) simulation requires complexnumber arithmetic. Complex multiplication contains fourreal multiplications, one addition and one subtraction:(A + iB)(C + iD) = (AC − BD) + i(AD + BC) (9)However, similar to Strassen’s methods, multiplicationcan be reordered as the number of multiplications decreaseto three, while addition and subtraction increase to threethree:(A + iB)(C + iD) = (10)= (AC − BD) + i[(A + B)(C + D) − AC − BD]Multiplications can be done parallel, temporary matricesare required with 2-5n 2 size, depending on the level of parallelization.IV. BENCHMARK RESULTSThe speed gain of the algorithms is presented in this section.The following abbreviations are used in the tables:1024×1 field: A real life sample: 1024×1024 grid resolution,1 layer (2D) thermal field, full DC simulation.64×64 field: A real life sample: 64×64 grid resolution, 16layers (3D) thermal field, full DC simulation.n×n matrix: Multiplication of two 3200×3200 sized, real(64 bit double precision) matrices or inversion of one.m×o matrix: Multiplication of a 4800×1920 and a1920×1600 sized, real (64 bit double precision) matrixTest computer: Dell Dimension 9200 with Intel Core 2Duo E6400 (2.13 GHz, 2 MB cache), 4 GB DDR2 667 MHzRAM, Windows Vista Ultimate 32 bit operating system. Thesolver was compiled with MS Visual Studio .NET 2003 C++compiler to SSE2 instruction set. The run times are calculatedas average of three measurements.Table I shows the effect of thread number. The test systemcontained a dual core processor, so more than two threadsgive minimal speed change, but four threads give a bit morespeed than two. 64×16 field gained much more from thethread number increase, because 90% of its runtime wasconsumed by matrix operations vs. 50% of 1024×1.Table II presents the results of matrix blocking multiplication(2) and loop unrolling; 14-26% acceleration was gainedin real life applications.TABLE IEffect of thread number on the speed (dual core system)Thread Runtime Speed ratio1 30,65 sec 100,0%2 19,75 sec 155,2%1024×1 field4 19,57 sec 156,6%64 19,85 sec 154,4%1024 21,34 sec 143,6%1 16,75 sec 100,0%2 9,57 sec 175,1%64×16 field4 9,49 sec 176,5%64 9,60 sec 174,5%1024 10,52 sec 159,3%©EDA Publishing/THERMINIC 2008 139ISBN: 978-2-35500-008-9
TABLE IIEffect of multiplication-blocking on the speedNormalruntimeNormalspeed ratioBlockruntimeBlockspeed ration×n matrix 33,95 sec 100,0% 18,92 sec 179,4%m×o matrix 10,84 sec 100,0% 8,84 sec 122,6%1024×1 field 22,45 sec 100,0% 19,57 sec 114,7%64×16 field 11,96 sec 100,0% 9,49 sec 126,0%n×n matrixm×o matrix1024×1 field64×16 fieldTABLE IIIEffect of Strassen’s multiplication on the speedMin. block size Runtime Speed ratioTraditional (∞) 46,38 sec 100,0%1200 20,23 sec 229,3%800 18,79 sec 246,8%500 18,92 sec 245,1%200 19,84 sec 233,8%100 31,43 sec 147,6%Traditional (∞) 20,93 sec 100,0%1200 9,51 sec 220,0%800 9,44 sec 221,7%500 8,84 sec 236,8%200 9,22 sec 226,9%100 11,83 sec 176,9%Traditional (∞) 19,89 sec 100,0%1200 19,82 sec 100,4%800 19,74 sec 100,7%500 19,57 sec 101,6%200 20,10 sec 99,0%100 24,20 sec 82,2%Traditional (∞) 9,92 sec 100,0%1200 9,90 sec 100,3%800 9,91 sec 100,1%500 9,49 sec 104,6%200 10,13 sec 98,0%100 13,03 sec 76,1%24-26 September 2008, Rome, ItalyTABLE IVEffect of Strassen’s inversion on the speedMin. block size Runtime Speed ratioTraditional (∞) 129,42 100,0%1200 29,15 444,0%n×n matrix800 23,12 559,7%500 23,08 560,8%200 23,10 560,4%100 11,24 561,6%1024×1 field64×16 fieldTraditional (∞) 21,17 100,0%1200 21,27 99,5%800 19,48 108,6%500 19,57 108,2%200 19,58 108,1%100 19,55 108,3%Traditional (∞) 11,24 100,0%1200 11,14 100,9%800 9,67 116,2%500 9,49 118,4%200 9,53 117,9%100 9,58 117,3%TABLE VReal (double precision) and complex matrix multiplication run timesRealruntimeRealtime ratioComplexruntimeComplextime ration×n matrix 18,92 100,0% 56,15 296,8%m×o matrix 8,84 100,0% 25,99 294,1%Although each of the other algorithmic changes result onlya few percent of gain, but the net effect of these small gainsis about a 25-60% boost in speed. The more layers in themodel the higher gain in speed is achieved – that is optimalfor complex architectures.Table III presents the effect of Strassen’s multiplicationalgorithm. The big difference in matrix multiplications is theresult of multithreading, but the Strassen’s multiplicationsgive more than double speed vs. double core, so the algorithmis efficient. If block size is decreased to 200 or less,the speed drops back significantly. The effect is minimal toreal life applications, only 1.6-4.6%. The effect of the twocores cannot be seen in this case because the reduction runson four threads.Strassen’s inversion however means considerable acceleration:8-18% in real-life applications (Table IV) becauseinversion gains much more of the new algorithm than multiplication.AC simulation is under construction in Vector SUNRED,but the complex matrix multiplication method is ready. Themultiplication takes about three times longer to the samesized real multiplication (Table V).V. CONCLUSIONSSummarizing the developments in the SUNRED algorithm,we can state the major speed increase is the result ofmultithreading. Nowadays the main development directionof processor manufacturers is the core number elevation, soSUNRED would be able to take even better advantage ofthis trend in the future. In some years GPUs will be integratedinto CPUs (AMD Fusion [10], Intel Larabee [11])which means a new challenge in the future development ofthe SUNRED algorithm.ACKNOWLEDGMENTThis work is based on prof. Dr. Vladimir Székely’s originalSUNRED field solver algorithm [7]. I would like tothank his guidance, recommendations and ideas.Thanks for Dr. András Poppe for his recommendationsand his help in presentation and for prof. Dr. Márta Renczfor her support in the publication of the article.REFERENCES[1] L. Pohl, V. Szekely: A more flexible realization of the SUNRED algorithm,THERMINIC Workshop, Sept. 27-29, Nice, France, Proceedings,2006.[2] David R. Butenhof: Programming with POSIX Threads, Addison-Wesley, 1997, ISBN 0-201-63392-2[3] Monica S. Lam, Edward E. Rothberg and Michael E. Wolf, “TheCache Performance and Optimizations of Blocked Algorithms”,ASPLOS IV, 1991.[4] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran: Cacheobliviousalgorithms. In Proceedings of the 40th Annual Symposiumon Foundations of Computer Science, pages 285–297, New York, October1999.[5] Strassen, V. Numerische Mathematik, 1969, vol. 13, pp. 354–356[6] William H. Press: Numerical Recipes in C: The Art of ScientificComputing; Cambridge University Press, 1992, pp 102-104, p 177,ISBN: 0-521-43720-2[7] V. Székely: SUNRED a new thermal simulator and typical applications,3rd THERMINIC Workshop, 21-23 September, Cannes, France,pp. 84-90, 1997[8] Hyper-Threading Technology, Intel Corporation,http://www.intel.com/technology/platform-technology/hyperthreading/,on 21 August 2008.©EDA Publishing/THERMINIC 2008 140ISBN: 978-2-35500-008-9
Page 1 and 2:
http://cmp.imag.fr/conferences/ther
Page 3 and 4:
24-26 September 2008, Rome, Italy©
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
24-26 September 2008, Rome, ItalyBL
Page 11 and 12:
24-26 September 2008, Rome, ItalySe
Page 13 and 14:
24-26 September 2008, Rome, ItalyED
Page 15 and 16:
Staggered Cartesian meshes tolerate
Page 17 and 18:
VIII. THE CASE FOR MCAD-EMBEDDED EC
Page 19 and 20:
24-26 September 2008, Rome, ItalyTr
Page 21 and 22:
solution to the eikonal equation(
Page 23 and 24:
24-26 September 2008, Rome, ItalyS
Page 25 and 26:
24-26 September 2008, Rome, ItalyTr
Page 27 and 28:
24-26 September 2008, Rome, Italyda
Page 29 and 30:
q (RthΣ, t)q (R th , t )10.80.60.4
Page 31 and 32:
24-26 September 2008, Rome, ItalyEv
Page 33 and 34:
24-26 September 2008, Rome, Italy*R
Page 35 and 36:
24-26 September 2008, Rome, Italy[7
Page 37 and 38:
24-26 September 2008, Rome, ItalyNe
Page 39 and 40:
24-26 September 2008, Rome, ItalyS=
Page 41 and 42:
24-26 September 2008, Rome, ItalyVI
Page 43 and 44:
24-26 September 2008, Rome, Italyth
Page 45 and 46:
24-26 September 2008, Rome, Italypr
Page 47 and 48:
24-26 September 2008, Rome, ItalyFi
Page 49 and 50:
24-26 September 2008, Rome, ItalyPo
Page 51 and 52:
24-26 September 2008, Rome, ItalyWL
Page 53 and 54:
24-26 September 2008, Rome, ItalyTa
Page 55 and 56:
24-26 September 2008, Rome, ItalyII
Page 57 and 58:
Thermal resistance (K/W)0.250.200.1
Page 59 and 60:
24-26 September 2008, Rome, ItalyB.
Page 61 and 62:
temperatrue rise [K]76543210Sim0,00
Page 63 and 64:
24-26 September 2008, Rome, ItalyCo
Page 65 and 66:
24-26 September 2008, Rome, ItalyPr
Page 67 and 68:
Page 69 and 70:
24-26 September 2008, Rome, ItalyBl
Page 71 and 72:
conclusion, every tile is contacted
Page 73 and 74:
and B10), three space blocks (W4, W
Page 75 and 76:
24-26 September 2008, Rome, ItalyMu
Page 77 and 78:
After having determined the k eff v
Page 79 and 80:
24-26 September 2008, Rome, ItalyVI
Page 81 and 82:
24-26 September 2008, Rome, ItalyTh
Page 83 and 84:
PTjunc24-26 September 2008, Rome, I
Page 85 and 86:
tures - has been shown previously [
Page 87 and 88:
24-26 September 2008, Rome, ItalyAu
Page 89 and 90:
IV.CONCLUSIONSMentor Graphics Exped
Page 91 and 92:
24-26 September 2008, Rome, ItalyIn
Page 93 and 94:
III. HEDORIS SIMULATIONSThermal sim
Page 95 and 96:
24-26 September 2008, Rome, Italytr
Page 97 and 98:
24-26 September 2008, Rome, ItalyDe
Page 99 and 100: 24-26 September 2008, Rome, ItalyII
Page 101 and 102: datasheets, these values are usuall
Page 103 and 104: 24-26 September 2008, Rome, Italyte
Page 105 and 106: 24-26 September 2008, Rome, ItalyCo
Page 107 and 108: 24-26 September 2008, Rome, Italy20
Page 109 and 110: material to handle difficulties suc
Page 111 and 112: 24-26 September 2008, Rome, ItalyAt
Page 113 and 114: 24-26 September 2008, Rome, Italymo
Page 115 and 116: heatspreaders deposited during the
Page 117 and 118: 24-26 September 2008, Rome, ItalyTh
Page 119 and 120: 24-26 September 2008, Rome, Italyla
Page 121 and 122: Fig. 13 Transient of the output dio
Page 123 and 124: 24-26 September 2008, Rome, ItalyIn
Page 125 and 126: chamber to calibrate the thermal te
Page 127 and 128: V.Table 2: FE-Simulation test-matri
Page 131 and 132: 24-26 September 2008, Rome, Italy(c
Page 133 and 134: Case coverSupercapacitor n°2Electr
Page 135 and 136: 24-26 September 2008, Rome, Italyfr
Page 137 and 138: 24-26 September 2008, Rome, ItalyTA
Page 141 and 142: 24-26 September 2008, Rome, Italych
Page 143 and 144: 24-26 September 2008, Rome, ItalyDe
Page 145 and 146: Peltier control units (PCUs). The t
Page 149: every step: the fewer cells the mor
Page 155 and 156: 24-26 September 2008, Rome, ItalyRe
Page 157 and 158: Drain source leakage current (A)Rev
Page 159 and 160: 24-26 September 2008, Rome, ItalyFP
Page 161 and 162: V IN =V DD /2Fig. 5. Short circuit
Page 165 and 166: The overall objective of the NANOPA
Page 167 and 168: 24-26 September 2008, Rome, ItalyRe
Page 169 and 170: predictions show that carbon nanotu
Page 171 and 172: 24-26 September 2008, Rome, ItalyFl
Page 173 and 174: 55. Campbell, R.C., S.E. Smith, and
Page 179 and 180: 24-26 September 2008, Rome, ItalyEX
Page 181 and 182: 24-26 September 2008, Rome, Italyen
Page 183 and 184: 24-26 September 2008, Rome, ItalyKE
Page 185 and 186: An original three-step etch process
Page 187 and 188: with a network of built-in sensors
Page 189 and 190: 24-26 September 2008, Rome, Italyan
Page 191 and 192: 24-26 September 2008, Rome, ItalyFu
Page 193 and 194: 24-26 September 2008, Rome, ItalyFi
Page 195 and 196: 24-26 September 2008, Rome, ItalyFi
Page 197 and 198: 24-26 September 2008, Rome, ItalyWe
Page 199 and 200: semiconductors is of the order of 1
Page 201 and 202:
24-26 September 2008, Rome, ItalyDu
Page 203 and 204:
transistor, through the bulk and be
Page 205 and 206:
impedance behaviour and on the othe
Page 207 and 208:
Tungstenmicro-heaterGas SensingMate
Page 209 and 210:
Tungsten micro-heater resistance (
Page 211 and 212:
24-26 September 2008, Rome, ItalyPo
Page 213 and 214:
24-26 September 2008, Rome, ItalyT3
Page 215 and 216:
24-26 September 2008, Rome, ItalyEv
Page 217 and 218:
24-26 September 2008, Rome, ItalyD.
Page 219 and 220:
24-26 September 2008, Rome, ItalyOn
Page 221 and 222:
24-26 September 2008, Rome, ItalyAc
Page 223 and 224:
24-26 September 2008, Rome, Italy
Page 225 and 226:
24-26 September 2008, Rome, ItalyLO
Page 227 and 228:
24-26 September 2008, Rome, Italymu
Page 229 and 230:
24-26 September 2008, Rome, Italy(R
Page 231 and 232:
Page 233 and 234:
24-26 September 2008, Rome, ItalyPa
Page 235 and 236:
24-26 September 2008, Rome, ItalyEl
Page 237 and 238:
Temperature rise (°C)1,81,61,41,21
Page 239 and 240:
i résistances carbone (A)i résist
Page 241 and 242:
calculates the dissipation distribu
Page 243 and 244:
in the next. It can also be used to
Page 245 and 246:
allows creating special methods of
Page 247 and 248:
OLED device (see Fig. 2.) provided
Page 249 and 250:
24-26 September 2008, Rome, ItalyFi
Page 251 and 252:
24-26 September 2008, Rome, ItalyIn
Page 253:
24-26 September 2008, Rome, ItalyWa
show all

Online proceedings - EDA Publishing Association

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?