PPKE ITK PhD and MPhil Thesis Classes

More documents

Recommendations

Info

42 2. MAPPING THE NUMERICAL SIMULATIONS OF PARTIAL DIFFERENTIAL EQUATIONS operations (channel stall cycle). The utilization of these SPEs is less than 15%, while the other SPEs are almost as efficient as a single SPE. Investigating the required memory bandwidth shows that one SPE should load 2 32bit floating point values and the result should be stored in the main memory. If 600 million cells are updated every second each SPE requires 7.2Gb/s memory I/O bandwidth and the available 25.6Gb/s bandwidth is not enough to support all the 8 SPEs. To reduce this high bandwidth requirement pipelining technique can be used as shown in Figure 2.6. In this case the SPEs are chained one after the other, and each SPE computes a different iteration step, using the results of the previous SPE and the Cell processor is working similarly to a systolic array. Only the first and last SPE in the pipeline should access main memory, the other SPEs are loading the results of the previous iteration directly from the local memory of the neighboring SPE. Due to the ring structure of the Element Interconnect Bus (EIB), communication between the neighboring SPEs is very efficient. The instruction histogram of the optimized CNN simulator kernel is summarized in Figure 2.7. Though the memory bandwidth bottleneck is eliminated by using the SPEs in a pipeline, overall performance of the simulation kernel can not be improved. The instruction histogram in Figure 2.7 still show huge amount of channel stall cycles in the 4 and 8 SPE case. To find the source of these stall cycles detailed profiling of the simulation kernel is required. First, profiling instructions are inserted into the source code to determine the length of the computation part of the program and the overhead. The results are summarized on Figure 2.8. Profiling results show that in the 4 and 8 SPE cases some SPEs spend huge amount of cycles in the setup portion of the program. This behavior is caused by the SPE thread creation and initialization process. First, the required number of SPE threads is created, then the addresses required to set up communication between the SPEs and form the pipeline are determined. Meanwhile the started threads are blocked. In the 4 SPE case threads on SPE 0 and SPE 1 are created first, hence these threads should wait not only for the address determination phase but also until the other two threads are created. Unfortunately this thread creation overhead cannot be eliminated. For a long analogic algorithm which uses several templates, the SPEs and the PPE should be used in a client/server model.
2.1 Introduction 43 Main memory 8 th iteration SPE 7 EIB SPE 0 1 st iteration 7 th iteration SPE 6 SPE 1 2 nd iteration 6 th iteration SPE 5 SPE 2 3 rd iteration 5 th iteration SPE 4 SPE 3 4 th iteration Figure 2.6: Data-flow of the pipelined multi-SPE CNN simulator
Page 1:
An Optimal Mapping of Numerical Sim
Page 5 and 6:
Acknowledgements It is not so hard
Page 7 and 8:
Contents 1 Introduction 1 1.1 Cellu
Page 9:
CONTENTS iii 4.3.3 Operating Steps
Page 12 and 13: vi LIST OF FIGURES 2.4 Performance
Page 15: List of Tables 2.1 Comparison of di
Page 18 and 19: xii LIST OF ABBREVIATIONS FPE Falco
Page 21 and 22: Chapter 1 Introduction Due to the r
Page 23 and 24: 3 can be represented in arbitrary t
Page 25 and 26: 1.1 Cellular Neural/Nonlinear Netwo
Page 27 and 28: 1.2 Cellular Neural/Nonlinear Netwo
Page 29 and 30: 1.3 CNN-UM Implementations 9 the Lo
Page 31 and 32: 1.3 CNN-UM Implementations 11 modif
Page 33 and 34: 1.4 Field Programmable Gate Arrays
Page 47 and 48: 1.5 IBM Cell Broadband Engine Archi
Page 55 and 56: Chapter 2 Mapping the Numerical Sim
Page 57 and 58: 2.1 Introduction 37 18 Load instruc
Page 59 and 60: 2.1 Introduction 39 Instruction his
Page 61: 2.1 Introduction 41 2E+07 1E+07 9E+
Page 65 and 66: 2.1 Introduction 45 6 4.5 3 Speedup
Page 67 and 68: 2.1 Introduction 47 2.50E+07 1.88E+
Page 69 and 70: 2.2 Ocean model and its implementat
Page 71 and 72: 2.2 Ocean model and its implementat
Page 73 and 74: 2.3 Computational Fluid Flow Simula
Page 89 and 90: Chapter 3 Investigating the Precisi
Page 91 and 92: 3.2 Numerical Solutions of the PDEs
Page 93 and 94: 3.3 Testing Methodology 73 In fixed
Page 95 and 96: 3.4 Properties of the Arithmetic Un
Page 97 and 98: 3.4 Properties of the Arithmetic Un
Page 99 and 100: 3.5 Results 79 in the L2 cache of t
Page 101 and 102: 3.5 Results 81 1E-02 1E-03 Error 1E
Page 103 and 104: 3.5 Results 83 arithmetic unit requ
Page 105: 3.6 Conclusion 85 point and the 40
Page 108 and 109: 4. IMPLEMENTING A GLOBAL ANALOGIC P
Page 110 and 111: 4. IMPLEMENTING A GLOBAL ANALOGIC P
Page 112 and 113:
4. IMPLEMENTING A GLOBAL ANALOGIC P
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
110 5. SUMMARY OF NEW SCIENTIFIC RE
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 141 and 142:
References The author’s journal p
Page 143 and 144:
5.2 Application of the results 123
Page 145 and 146:
Page 147 and 148:
show all

PPKE ITK PhD and MPhil Thesis Classes

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?