PPKE ITK PhD and MPhil Thesis Classes

More documents

Recommendations

Info

46 2. MAPPING THE NUMERICAL SIMULATIONS OF PARTIAL DIFFERENTIAL EQUATIONS in LUTs for each nonlinear template element. In case of the zero-order nonlinear templates, only one parameter should be stored in the LUT, while in case of the first-order nonlinearity, the gradient value and the constant shift of the current section should be stored. By using this arrangement, for zero-order nonlinear templates, the difference of the value of the currently processed cell and the value of the neighboring cell should be compared to the boundary points. The result of this comparison is used to acquire the adequate nonlinear value. In case of the first-order nonlinear template, additional computation is required. After identifying the proper interval of nonlinearity, the difference should be multiplied by the gradient value and added to the constant. Since the SPEs on the Cell processor are vector processors, the values of the nonlinear function and the boundary points are also stored as a 4-element vector. In each step four differences are computed in parallel and all boundary points must be examined to determine the four nonlinear template elements. To get an efficient implementation, double buffering, vectorization, pipelining, and loop unrolling techniques can be used similarly to the linear template implementation. To investigate the performance of loop unrolling in a nonlinear case, the same 256×256 sized cell array and 16 forward Euler iterations were used as in linear case. The global maximum finder template [21] was used where the template values were defined by a first order nonlinearity which can be partitioned into 2 or 4 segments. Here the number of required instructions were calculated in the case when the inner loop of the computation was not rolled out as well as unrolled 2, 4 and 8 times. In addition, the nonlinearity was segmented into two (A) and four (B) segments. The result of the investigation is shown in Figure 2.10. We can see that most of the time was spent by an SPE with the single and dual cycle instructions and stall due to dependency, which can be reduced by the unrolling technique. Without unrolling the SPE is stalled due to dependency more than 8 million clock cycles in case the nonlinearity is partitioned into two parts (A). However, by using the unrolling technique it is reduced to nearly 259 000 clock cycles, so almost 50% less clock cycles are required for the computation in the 8 times unrolled case. In addition, the effect of increasing the number of segments of the nonlinearity was investigated. Our experiments show that the number of required instructions
2.1 Introduction 47 2.50E+07 1.88E+07 1.25E+07 6.25E+06 number of instructions 0E+00 Unroll 0 (A) Unroll 2 (A) Unroll 4 (A) Unroll 8 (A) Single Cycle Nop Cycle Stall due to prefetch miss Stall due to waiting for hint target Dual Cycle Stall due to branch miss Stall due to dependency Channel stall cycle Figure 2.10: Comparison of the instruction number in case of different unrolling is more than 25% higher in case of using four nonlinearity segments (B) than if the nonlinearity is partitioned only into two parts (A). The comparison of the instruction distribution of the upper cases shows that the main difference between them is the number of required single cycles due to usage of more conditional structures during the implementation as shown in Figure 2.10. In the four segment case (B) the number of stall cycles is smaller than in the two segment case (A) without loop unrolling. By using loop unrolling 30% performance increase can be achieved. In nonlinear case the performance can also be improved if multiple SPEs are used. Finally, we studied the effect of multiple SPEs. In this test case the inner loop of the computation was rolled out at 8 times, and ran on 1, 2, 4 and 8 SPEs in parallel with the nonlinearity partitioned into two (A) and four segments (B). The comparison of the performance of these cases is shown in Figure 2.11. If the nonlinearity is partitioned into two segments (A) the maximum performance of the simulation on 1 SPE is about 223 million cell iterations/s, which is
Page 1:
An Optimal Mapping of Numerical Sim
Page 5 and 6:
Acknowledgements It is not so hard
Page 7 and 8:
Contents 1 Introduction 1 1.1 Cellu
Page 9:
CONTENTS iii 4.3.3 Operating Steps
Page 12 and 13:
vi LIST OF FIGURES 2.4 Performance
Page 15: List of Tables 2.1 Comparison of di
Page 18 and 19: xii LIST OF ABBREVIATIONS FPE Falco
Page 21 and 22: Chapter 1 Introduction Due to the r
Page 23 and 24: 3 can be represented in arbitrary t
Page 25 and 26: 1.1 Cellular Neural/Nonlinear Netwo
Page 27 and 28: 1.2 Cellular Neural/Nonlinear Netwo
Page 29 and 30: 1.3 CNN-UM Implementations 9 the Lo
Page 31 and 32: 1.3 CNN-UM Implementations 11 modif
Page 33 and 34: 1.4 Field Programmable Gate Arrays
Page 47 and 48: 1.5 IBM Cell Broadband Engine Archi
Page 55 and 56: Chapter 2 Mapping the Numerical Sim
Page 57 and 58: 2.1 Introduction 37 18 Load instruc
Page 59 and 60: 2.1 Introduction 39 Instruction his
Page 61 and 62: 2.1 Introduction 41 2E+07 1E+07 9E+
Page 63 and 64: 2.1 Introduction 43 Main memory 8 t
Page 65: 2.1 Introduction 45 6 4.5 3 Speedup
Page 69 and 70: 2.2 Ocean model and its implementat
Page 71 and 72: 2.2 Ocean model and its implementat
Page 73 and 74: 2.3 Computational Fluid Flow Simula
Page 89 and 90: Chapter 3 Investigating the Precisi
Page 91 and 92: 3.2 Numerical Solutions of the PDEs
Page 93 and 94: 3.3 Testing Methodology 73 In fixed
Page 95 and 96: 3.4 Properties of the Arithmetic Un
Page 97 and 98: 3.4 Properties of the Arithmetic Un
Page 99 and 100: 3.5 Results 79 in the L2 cache of t
Page 101 and 102: 3.5 Results 81 1E-02 1E-03 Error 1E
Page 103 and 104: 3.5 Results 83 arithmetic unit requ
Page 105: 3.6 Conclusion 85 point and the 40
Page 108 and 109: 4. IMPLEMENTING A GLOBAL ANALOGIC P
Page 116 and 117:
4. IMPLEMENTING A GLOBAL ANALOGIC P
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
110 5. SUMMARY OF NEW SCIENTIFIC RE
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 141 and 142:
References The author’s journal p
Page 143 and 144:
5.2 Application of the results 123
Page 145 and 146:
Page 147 and 148:
show all

PPKE ITK PhD and MPhil Thesis Classes

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?