Architecture of Computing Systems (Lecture Notes in Computer ...

Recommendations

Info

240 F. Xudong et al. threads in the wavefront have to execute the branch, which means all the paths are executed serially. This situation degrades the kernel performance greatly. Therefore branch divergences in kernels should be eliminated as much as possible. We convert control dependence to data dependence, which caters to GPUs powerful data processing capability [10]. Our branch elimination is a two-step strategy: a) Branch fusion. Our branch fusion here is only suitable for the situation where the left expressions of if and else are the same. If not, there is no benefit using branch fusion since the expressions in both branches have to be executed. b) Expression simplification. The second step is used to simplify expressions gotten from branch fusion in the hope of eliminating all the redundant computations. With branch elimination, we can eliminate all the eight branches in the Interp kernel. 3.4 CPU-GPU Task Distribution GPUs are good at performing ALU-intensive tasks, which qualifies them as good accelerators for CPUs. The philosophy of using GPUs to accelerate applications is to manipulate massive threads to exploit parallelism among threads and hide memory access latencies. So when the problem size is very small, there may be not enough threads to occupy stream processing cores to fully exploit parallelism. Take the problem size 16 3 for example. Assuming that there are enough GPRs, only 4K threads are needed to process the computation under this problem size, which is much less than the maximum 10K threads that the RV770 core can provide, not to mention the smaller problem sizes. When the speedup obtained by the GPU is less than one, we should consider turning the task back to run on the CPU. Nevertheless, porting computing tasks to the CPU entails inevitable overhead such as data communication latency. This indicates the performance gain from distributing the task among the CPU and the GPU must counteract this overhead for the purpose of improving the overall system performance. Distributing tasks between the CPU and the GPU is sure to outperform the CPU- or GPU-single computing device execution. 4 Experimental Evaluation To examine the benefits of our optimization strategies, we implemented the Mgrid application using Brook+ on an AMD Radeon HD4870 GPU. All the results are compared to the single-thread CPU version, which is measured on an Intel Xeon E5405 CPU running at 2GHz with 256KB L1 cache and 12MB L2 cache. We used the Intel ifort compiler as the CPU compiler with the optimization option -O3. Mgrid is a 3D multigrid application in the SPECfp/NAS benchmark. Notably, it is the only application found in both the SPEC and NAS benchmark suites, and among those few SPEC 2000 applications surviving through SPEC 95 and SPEC 98. The main process of Mgird is of a V-cycle pattern performed at multilevel grids in multi-pass (multiple iterations), as illustrated in Fig. 1(b). Mgrid
Optimizing Stencil Application on Multi-thread GPU Architecture 241 spends 85% of its execution time performing stencil computations on 3D arrays, which go through all of the four primary kernels, including Resid, Psinv, Rprj3, and Interp. TheResid kernel computes the residual. The Psinv kernel computes the approximate inverse. The Rprj3 kernel computes the projection from fine grid to coarse grid. The Interp kernel computes the interpolation from coarse grid to fine grid. 4.1 Effects of Improving Thread Utilization Here we examine the effectiveness of improving thread utilization. Applying the strategies proposed in Subsection 3.1, we used four thread granularities: double, double2, double2 plus two output streams, and double2 plus four output streams. If a thread using double can compute N points, then a thread using double2, double2 plus two output streams and double2 plus four output streams can compute 2N, 4N, and 8N points, respectively. For conveniences sake, we used N, 2N, 4N and 8N to denote different thread granularities. Fig. 2(a) shows the speedup of the Resid kernel over the CPU implementation under different problem sizes using the four thread granularities. Note that since our optimization strategies target stencil computations in Mgrid, whenevaluating a single kernel, we do not count in the time consumed by loading the kernels to the GPU and by the periodical communication subroutine Comm3 in each kernel. Nevertheless, in our overall evaluation for Mgrid, all the time will be counted. As shown in Fig. 2(a), the thread granularity 8N demonstrates the best performance under problem size 256 3 and 128 3 , while 4N yields the best performance under the other two problem sizes. Problem sizes smaller than 32 3 are not shown in the figure since their speedups are less than one. We can see that under large problem sizes 256 3 and 128 3 , the speedup for each kernel scales up with the granularity. This is because under large problem size, large thread granularity provides more chances to exploit intermediate result reuse within threads, yet there are enough threads to exploit parallelism among threads. However, under small problem size, large thread granularity requires more GPRs in each thread, thus impacting the number of threads and resulting in limited parallelism among threads. Under problem size 64 3 and 32 3 , the speedups first scale with the granularity, reach a maximum in granularity 4N, and then decrease in granularity 8N, see Fig. 2(a). The performance gain through hitting more data reuse within each thread using large granularity (8N) is offset by the performance loss caused by lacking enough threads to fully occupy parallel stream computing cores. Fig. 2(b) shows the speedup of each kernel and the whole application Mgrid under the largest problem size 256 3 . We can see that the speedups of the kernel Resid and Psinv scale up with the thread granularity monotonously. This is because large thread granularity is favorable for intermediate data reuse within each thread, while there are abundant threads for thread level parallelism. Interp computes the coarse grid through accessing the finer grid, so it tends to be an ALU-intensive kernel. The speedup of the Interp kernel increases rapidly with the thread granularity and reaches a maximum in thread granularity 4N, see
Page 2 and 3:
Lecture Notes in Computer Science 5
Page 4 and 5:
Volume Editors Christian Müller-Sc
Page 6 and 7:
General Chair Organization Christia
Page 8 and 9:
Organization IX Hartmut Schmeck Kar
Page 10 and 11:
Keynote Table of Contents HyVM - Hy
Page 12 and 13:
Table of Contents XIII JetBench:AnO
Page 14 and 15:
How to Enhance a Superscalar Proces
Page 16 and 17:
4 J. Mische et al. The Real-time Vi
Page 18 and 19:
6 J. Mische et al. 4.1 Instruction
Page 20 and 21:
8 J. Mische et al. Additionally the
Page 22 and 23:
10 J. Mische et al. % of unused pip
Page 24 and 25:
12 J. Mische et al. % of cycles spe
Page 26 and 27:
14 J. Mische et al. 16. Lickly, B.,
Page 28 and 29:
16 G. Aşılıoğlu, E.M. Kaya, and
Page 30 and 31:
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
26 T.B. Preußer, P. Reichel, and R
Page 40 and 41:
Page 42 and 43:
Page 44 and 45:
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
38 P. Bellasi, W. Fornaciari, and D
Page 52 and 53:
Page 54 and 55:
Page 56 and 57:
Page 58 and 59:
Page 60 and 61:
Page 62 and 63:
50 J. Zeppenfeld and A. Herkersdorf
Page 64 and 65:
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
Page 74 and 75:
62 B. Jakimovski, B. Meyer, and E.
Page 76 and 77:
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
74 M. Bonn and H. Schmeck Fig. 1. J
Page 88 and 89:
76 M. Bonn and H. Schmeck 2.2 Node
Page 90 and 91:
78 M. Bonn and H. Schmeck Uptime-ba
Page 92 and 93:
80 M. Bonn and H. Schmeck 2.4 Simul
Page 94 and 95:
82 M. Bonn and H. Schmeck done rate
Page 96 and 97:
84 M. Bonn and H. Schmeck Fig. 8. J
Page 98 and 99:
86 M. Bonn and H. Schmeck tells the
Page 100 and 101:
88 J.-P. Steghöfer et al. � �
Page 102 and 103:
90 J.-P. Steghöfer et al. mechanis
Page 104 and 105:
92 J.-P. Steghöfer et al. resource
Page 106 and 107:
94 J.-P. Steghöfer et al. Choose a
Page 108 and 109:
96 J.-P. Steghöfer et al. 1. Defin
Page 110 and 111:
98 J.-P. Steghöfer et al. and all
Page 112 and 113:
100 J.-P. Steghöfer et al. 19. Kim
Page 114 and 115:
102 K. Kloch et al. large-scale sys
Page 116 and 117:
104 K. Kloch et al. a�t� 1.0 0.
Page 118 and 119:
106 K. Kloch et al. constant. This
Page 120 and 121:
108 K. Kloch et al. Relative number
Page 122 and 123:
110 K. Kloch et al. (a) infection r
Page 124 and 125:
112 K. Kloch et al. (ii) Phase of a
Page 126 and 127:
114 P. Petoumenos et al. Studying t
Page 128 and 129:
116 P. Petoumenos et al. % of Misse
Page 130 and 131:
118 P. Petoumenos et al. IQ: n 4-in
Page 132 and 133:
120 P. Petoumenos et al. downsizing
Page 134 and 135:
122 P. Petoumenos et al. As long as
Page 136 and 137:
124 P. Petoumenos et al. comparable
Page 138 and 139:
Exploiting Inactive Rename Slots fo
Page 140 and 141:
128 M. Kayaalp et al. In a supersca
Page 142 and 143:
130 M. Kayaalp et al. INSTRUCTION 1
Page 144 and 145:
132 M. Kayaalp et al. time. Alterna
Page 146 and 147:
134 M. Kayaalp et al. Fig. 3. Numbe
Page 148 and 149:
136 M. Kayaalp et al. The results o
Page 150 and 151:
Efficient Transaction Nesting in Ha
Page 152 and 153:
140 Y. Liu et al. HTMs include TCC
Page 154 and 155:
142 Y. Liu et al. rollback T0 begin
Page 156 and 157:
144 Y. Liu et al. Processor core Pr
Page 158 and 159:
146 Y. Liu et al. 5.2 Results and A
Page 160 and 161:
148 Y. Liu et al. decreases, partia
Page 162 and 163:
Decentralized Energy-Management to
Page 164 and 165:
152 B. Becker et al. Furthermore, t
Page 166 and 167:
154 B. Becker et al. An optimizing
Page 168 and 169:
156 B. Becker et al. smart-home man
Page 170 and 171:
158 B. Becker et al. freedom like w
Page 172 and 173:
160 B. Becker et al. Power [W] 4500
Page 174 and 175:
EnergySaving Cluster Roll: Power Sa
Page 176 and 177:
164 M.F. Dolz et al. Themodulequeri
Page 178 and 179:
166 M.F. Dolz et al. This daemon al
Page 180 and 181:
168 M.F. Dolz et al. been submitted
Page 182 and 183:
170 M.F. Dolz et al. On the other h
Page 184 and 185:
172 M.F. Dolz et al. at the inactiv
Page 186 and 187:
Effect of the Degree of Neighborhoo
Page 188 and 189:
176 T. Abdullah et al. A zone based
Page 190 and 191:
178 T. Abdullah et al. A consumer/p
Page 192 and 193:
180 T. Abdullah et al. Messages 140
Page 194 and 195:
182 T. Abdullah et al. (except when
Page 196 and 197:
184 T. Abdullah et al. % Matchmakin
Page 198 and 199:
186 T. Abdullah et al. show that th
Page 200 and 201:
188 M. Schindewolf, D. Kramer, and
Page 202 and 203: 190 M. Schindewolf, D. Kramer, and
Page 212 and 213: 200 R. Plyaskin and A. Herkersdorf
Page 224 and 225: 212 M.Y. Qadri, D. Matichard, and K
Page 234 and 235: A Tightly Coupled Accelerator Infra
Page 236 and 237: 224 F. Nowak and R. Buchty where A
Page 238 and 239: 226 F. Nowak and R. Buchty Fig. 3.
Page 240 and 241: 228 F. Nowak and R. Buchty Table 2.
Page 242 and 243: 230 F. Nowak and R. Buchty Table 4.
Page 244 and 245: 232 F. Nowak and R. Buchty 5.3 Comp
Page 246 and 247: Optimizing Stencil Application on M
Page 248 and 249: 236 F. Xudong et al. compared to th
Page 250 and 251: 238 F. Xudong et al. stencil comput
Page 254 and 255: 242 F. Xudong et al. Speedup 14 12
Page 256 and 257: 244 F. Xudong et al. 5 Related Work
Page 258: Abdullah, Tariq 174 Alima, Luc Onan
show all

Architecture of Computing Systems (Lecture Notes in Computer ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?