Architecture of Computing Systems (Lecture Notes in Computer ...

Recommendations

Info

114 P. Petoumenos et al. Studying these approaches in more detail we discovered that, in some cases, small changes in the IQ size bring about significant degradation in performance. New found understanding of the relation of the size of the instruction queue to performance, by the work of Chou et al. [7], Karkhanis and Smith [6], Eyerman and Eeckhout [4], Qureshi et al. [8], points to the main culprit for this: Memory-Level Parallelism (MLP) [5]. In the presence of misses and MLP the single most important factor that affects performance in relation to the IQ size is whether MLP is preserved or harmed. While this new understanding seems, in retrospect, obvious and easy to integrate in previous proposals, in fact it requires a new approach in managing the IQ. Specifically, while prior feedback-loop proposals based their decisions on a coarse sampling period measured in thousands of cycles or instructions, an MLP-aware technique requires a much more fine-grain approach where decisions must be taken at instruction intervals whose size is comparable to the size of the IQ. The reason for this is that MLP itself exists if misses appear within the instruction window of the processor and must be handled at that resolution. More accurately, MLP exists if the distance between misses is less than the size of the reorder buffer (ROB) [6]. In our case, we use a combined IQ and ROB in the form of a Register Update Unit [9], thus the ROB size defaults to the size of the IQ. In the rest of the paper we discuss MLP with respect to the size of the IQ. Contributions of this paper: • We expose MLP (when it exists) as the main factor that affects performance in IQ resizing techniques. • We propose a methodology to integrate MLP-awareness in IQ resizing techniques by measuring and predicting MLP in fine-grain segments of the dynamic instruction stream. • We propose an example practical implementation of this approach and show that it consistently outperforms in total EDP (for the whole processor) an optimized non-MLP-aware technique as well as improving EDP across the board compared to base case of an unmanaged IQ. Structure of this paper–– Section 2 presents related work, while Section 3 motivates the need for IQ resizing using MLP. In Section 4 we present our proposal for MLPawareness in the IQ resizing and in Section 5 we delve into some details. Section 6 offers our evaluation and Section 7 concludes the paper. 2 Related Work IQ Resizing Techniques–– The instruction window (including possibly a separate reorder buffer, an instruction queue and the load/store queue) has been prime target for energy optimizations. The reason is twofold: first they are greedy consumers of the processor power budget; second, typically these structures are sized to support peak execution in a wide out-of-order processor. Although there are many proposals for IQ resizing, we list here the three main proposals that are relevant to our work.
MLP-Aware Instruction Queue Resizing: The Key to Power-Efficient Performance 115 The proposals by Buyuktosunoglu et al [1] and Kucuk et al [3], resize the IQ by physically partitioning it in segments and disabling and enabling each segment 1 as needed. Both techniques use a feedback loop to control the size of the IQ. The Buyuktosunoglu et al. technique [1] is based on the number of active entries (i.e., ready-to-issue entries) per segment to decide whether or not a segment deserves to remain active, while Kucuk et al. [3] argue that the occupancy of the IQ (in valid instructions) rather than active entries is a better indication of its required size. In both techniques the IPC is measured periodically; a sudden drop in IPC signals the need for IQ upsizing. However in both approaches there is no other mechanism to revert back to the full IQ size. The Folegnani and González approach [2] is distinguished by being a logical resizing of the IQ (limiting the range of the entries that can be allocated) not a physical partitioning as the other two. In addition, the feedback mechanism is based on how much the youngest instructions in the IQ contribute to the actual ILP. If they do not contribute much, this means that the size of the IQ can be reduced further. However, there is no way to adapt the IQ to larger sizes except periodically revering back to full size. Nevertheless, the Folegnani and González resizing policy is very good at adjusting the IQ size so as to not harm the ILP in the program. Modeling of MLP–– Karkhanis and Smith presented a first-order model of a superscalar core [6] that deepened our understanding of MLP. This work exposed the importance of MLP in shaping the performance of out-of-order execution. Karkhanis and Smith show that in absence of any upsetting events such as branch mispredictions and L2 misses the number of instructions that will issue (on average) is a property of the program and is expressed by the so-called IW characteristic of the program. The presence, however, of upsetting events, such as L2 misses, decreases the IPC from the ideal point of the IW characteristic depending on the apparent cost of the L2 miss. This is because, a L2 miss drains the pipeline, and eventually stalls it when the instruction that misses blocks the retirement of other instructions [6]. MLP, in this case, spreads the cost of accessing main memory over all the instructions that miss in parallel, making their apparent cost appear much less. Existing IQ resizing mechanisms focus mainly on the variations of the ILP (effectively moving along the IW characteristic) ignoring what happens during misses. Our work, focuses instead on the latter. MLP-Aware techniques–– Qureshi et al. exploited MLP to optimize cache replacement [8]. MLP in their case is associated with data (cache lines) and the MLP prediction is done in the cache. Data that do not exhibit MLP are given preference in staying in the cache over data that exhibit MLP. Eyerman and Eeckhout exploit MLP to optimize fetch policies for Simultaneous Multi-Threaded (SMT) processors [4]. In their case, MLP is associated with instructions that miss in the cache. Our MLP prediction mechanism is similar to the one proposed by Eyerman and Eeckhout, but because we use it not to regulate the fetch stage, but to manage the entire IQ, we associate MLP information to larger segments of code. 1 Sequentially from the end of the IQ in [1] or independently of position in [3].
Page 2 and 3:
Lecture Notes in Computer Science 5
Page 4 and 5:
Volume Editors Christian Müller-Sc
Page 6 and 7:
General Chair Organization Christia
Page 8 and 9:
Organization IX Hartmut Schmeck Kar
Page 10 and 11:
Keynote Table of Contents HyVM - Hy
Page 12 and 13:
Table of Contents XIII JetBench:AnO
Page 14 and 15:
How to Enhance a Superscalar Proces
Page 16 and 17:
4 J. Mische et al. The Real-time Vi
Page 18 and 19:
6 J. Mische et al. 4.1 Instruction
Page 20 and 21:
8 J. Mische et al. Additionally the
Page 22 and 23:
10 J. Mische et al. % of unused pip
Page 24 and 25:
12 J. Mische et al. % of cycles spe
Page 26 and 27:
14 J. Mische et al. 16. Lickly, B.,
Page 28 and 29:
16 G. Aşılıoğlu, E.M. Kaya, and
Page 30 and 31:
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
26 T.B. Preußer, P. Reichel, and R
Page 40 and 41:
Page 42 and 43:
Page 44 and 45:
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
38 P. Bellasi, W. Fornaciari, and D
Page 52 and 53:
Page 54 and 55:
Page 56 and 57:
Page 58 and 59:
Page 60 and 61:
Page 62 and 63:
50 J. Zeppenfeld and A. Herkersdorf
Page 64 and 65:
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
Page 74 and 75:
62 B. Jakimovski, B. Meyer, and E.
Page 76 and 77: 64 B. Jakimovski, B. Meyer, and E.
Page 86 and 87: 74 M. Bonn and H. Schmeck Fig. 1. J
Page 88 and 89: 76 M. Bonn and H. Schmeck 2.2 Node
Page 90 and 91: 78 M. Bonn and H. Schmeck Uptime-ba
Page 92 and 93: 80 M. Bonn and H. Schmeck 2.4 Simul
Page 94 and 95: 82 M. Bonn and H. Schmeck done rate
Page 96 and 97: 84 M. Bonn and H. Schmeck Fig. 8. J
Page 98 and 99: 86 M. Bonn and H. Schmeck tells the
Page 100 and 101: 88 J.-P. Steghöfer et al. � �
Page 102 and 103: 90 J.-P. Steghöfer et al. mechanis
Page 104 and 105: 92 J.-P. Steghöfer et al. resource
Page 106 and 107: 94 J.-P. Steghöfer et al. Choose a
Page 108 and 109: 96 J.-P. Steghöfer et al. 1. Defin
Page 110 and 111: 98 J.-P. Steghöfer et al. and all
Page 112 and 113: 100 J.-P. Steghöfer et al. 19. Kim
Page 114 and 115: 102 K. Kloch et al. large-scale sys
Page 116 and 117: 104 K. Kloch et al. a�t� 1.0 0.
Page 118 and 119: 106 K. Kloch et al. constant. This
Page 120 and 121: 108 K. Kloch et al. Relative number
Page 122 and 123: 110 K. Kloch et al. (a) infection r
Page 124 and 125: 112 K. Kloch et al. (ii) Phase of a
Page 128 and 129: 116 P. Petoumenos et al. % of Misse
Page 130 and 131: 118 P. Petoumenos et al. IQ: n 4-in
Page 132 and 133: 120 P. Petoumenos et al. downsizing
Page 134 and 135: 122 P. Petoumenos et al. As long as
Page 136 and 137: 124 P. Petoumenos et al. comparable
Page 138 and 139: Exploiting Inactive Rename Slots fo
Page 140 and 141: 128 M. Kayaalp et al. In a supersca
Page 142 and 143: 130 M. Kayaalp et al. INSTRUCTION 1
Page 144 and 145: 132 M. Kayaalp et al. time. Alterna
Page 146 and 147: 134 M. Kayaalp et al. Fig. 3. Numbe
Page 148 and 149: 136 M. Kayaalp et al. The results o
Page 150 and 151: Efficient Transaction Nesting in Ha
Page 152 and 153: 140 Y. Liu et al. HTMs include TCC
Page 154 and 155: 142 Y. Liu et al. rollback T0 begin
Page 156 and 157: 144 Y. Liu et al. Processor core Pr
Page 158 and 159: 146 Y. Liu et al. 5.2 Results and A
Page 160 and 161: 148 Y. Liu et al. decreases, partia
Page 162 and 163: Decentralized Energy-Management to
Page 164 and 165: 152 B. Becker et al. Furthermore, t
Page 166 and 167: 154 B. Becker et al. An optimizing
Page 168 and 169: 156 B. Becker et al. smart-home man
Page 170 and 171: 158 B. Becker et al. freedom like w
Page 172 and 173: 160 B. Becker et al. Power [W] 4500
Page 174 and 175: EnergySaving Cluster Roll: Power Sa
Page 176 and 177:
164 M.F. Dolz et al. Themodulequeri
Page 178 and 179:
166 M.F. Dolz et al. This daemon al
Page 180 and 181:
168 M.F. Dolz et al. been submitted
Page 182 and 183:
170 M.F. Dolz et al. On the other h
Page 184 and 185:
172 M.F. Dolz et al. at the inactiv
Page 186 and 187:
Effect of the Degree of Neighborhoo
Page 188 and 189:
176 T. Abdullah et al. A zone based
Page 190 and 191:
178 T. Abdullah et al. A consumer/p
Page 192 and 193:
180 T. Abdullah et al. Messages 140
Page 194 and 195:
182 T. Abdullah et al. (except when
Page 196 and 197:
184 T. Abdullah et al. % Matchmakin
Page 198 and 199:
186 T. Abdullah et al. show that th
Page 200 and 201:
188 M. Schindewolf, D. Kramer, and
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
200 R. Plyaskin and A. Herkersdorf
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
212 M.Y. Qadri, D. Matichard, and K
Page 226 and 227:
Page 228 and 229:
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
A Tightly Coupled Accelerator Infra
Page 236 and 237:
224 F. Nowak and R. Buchty where A
Page 238 and 239:
226 F. Nowak and R. Buchty Fig. 3.
Page 240 and 241:
228 F. Nowak and R. Buchty Table 2.
Page 242 and 243:
230 F. Nowak and R. Buchty Table 4.
Page 244 and 245:
232 F. Nowak and R. Buchty 5.3 Comp
Page 246 and 247:
Optimizing Stencil Application on M
Page 248 and 249:
236 F. Xudong et al. compared to th
Page 250 and 251:
238 F. Xudong et al. stencil comput
Page 252 and 253:
240 F. Xudong et al. threads in the
Page 254 and 255:
242 F. Xudong et al. Speedup 14 12
Page 256 and 257:
244 F. Xudong et al. 5 Related Work
Page 258:
Abdullah, Tariq 174 Alima, Luc Onan
show all

Architecture of Computing Systems (Lecture Notes in Computer ...

Create successful ePaper yourself

Delete template?

Save as template?