Copyright by William Lloyd Bircher 2010 - The Laboratory for ...

More documents

Recommendations

Info

of Quad-Core AMD processors, this is the dominant effect. When an active core performs a cache probe of an idle core, latency is increased compared to probing an active core. The performance loss can be significant for memory-bound (cache probe- intensive) workloads. Direct performance effects are due to the current operating frequency of an active core. The effect tends to be less compared to indirect, since operating systems are reasonably effective at matching current operating frequency to performance demand. These effects are illustrated in Figure 6.1. Two extremes of workloads are presented: the compute-bound crafty and the memory- bound equake. For each workload, two cases are presented: fixed and normal scheduling. Fixed scheduling isolates indirect performance loss by eliminating the effect of OS frequency scheduling and thread migration. This is accomplished by forcing the software thread to a particular core for the duration of the experiment. In this case, the thread runs always run at the maximum frequency. The idle cores always run at the minimum frequency. As a result, crafty achieves 100 percent of the performance of processor that does not use dynamic power management. In contrast, the memory-bound equake shows significant performance loss due to the reduced performance of idle cores. Direct performance loss is shown in the dark solid and light solid lines, which utilize OS scheduling of frequency and threads. Because direct performance losses are caused by suboptimal frequency in active cores, the compute-bound crafty shows a significant performance loss. The memory-bound equake actually shows a performance 106
improvement for low idle core frequencies. This is caused by idle cores remaining at a high frequency following a transition from active to idle. Performance 100% 95% 90% 85% 80% 75% 70% 65% 60% 200 700 1200 Idle Core Frequency (MHz) 1700 2200 Figure 6.1 Direct and Indirect Performance Impact 6.1.4 Indirect Performance Effects FreqB The amount of indirect performance loss is mostly dependent on the following three factors: Idle core frequency, OS p-state transition characteristics, and OS scheduling characteristics. The probe latency (time to respond to probe) is largely independent of idle core frequency above the “breakover” frequency (FreqB). Below FreqB the performance drops rapidly at an approximately linear rate. This can be seen in Figure 6.1 as the dashed light line. The value of FreqB is primarily dependent on the inherent probe latency of the processor and the number of active and idle cores. Increasing the active core frequency increases the demand for probes and therefore increases FreqB. Increasing 107 crafty-fixed equake-fixed equake crafty
Page 1 and 2:
Copyright by William Lloyd Bircher
Page 3 and 4:
Predictive Power Management for Mul
Page 5 and 6:
Acknowledgements I would like to th
Page 7 and 8:
Predictive Power Management for Mul
Page 9 and 10:
Table of Contents Chapter 1 Introdu
Page 11 and 12:
5.3.6 Memory ......................
Page 13 and 14:
List of Tables Table 1.1 Windows Vi
Page 15 and 16:
List of Figures Figure 1.1 CPU Core
Page 17 and 18:
Chapter 1 Introduction Computing sy
Page 19 and 20:
increases the overhead of adaptatio
Page 21 and 22:
suboptimal from a power and perform
Page 23 and 24:
Active Core Activity Idle except fo
Page 25 and 26:
4. Design a predictive power manage
Page 27 and 28:
1.7 Organization This dissertation
Page 29 and 30:
Chapter 2 Methodology The developme
Page 31 and 32:
2.1.2 Subsystem-Level Power in a Se
Page 33 and 34:
The main components are subsystem p
Page 35 and 36:
Table 2.4 Laptop System Description
Page 37 and 38:
2.3 Performance Counter Sampling To
Page 39 and 40:
instruction streams that exercise a
Page 41 and 42:
the power trace to the PMC trace co
Page 43 and 44:
The first bar in Figure 3.1 “Fetc
Page 45 and 46:
Table 3.4 Instruction Linear Regres
Page 47 and 48:
long time to complete, more aggress
Page 49 and 50:
power adaptations as c-states. Thes
Page 51 and 52:
Core Power (Watts) 60 50 40 30 20 1
Page 53 and 54:
The difference between C0-Idle and
Page 55 and 56:
case error is 3.3%. Alternatively s
Page 57 and 58:
3. Based on basic domain knowledge,
Page 59 and 60:
3.6 Summary This section describes
Page 61 and 62:
Power (Watts) 30 25 20 15 10 5 0 Co
Page 63 and 64:
workloads become more memory-bound,
Page 65 and 66:
At the other extreme, the productiv
Page 67 and 68:
processor. In both platforms the to
Page 69 and 70:
Table 4.1 Subsystem Power Standard
Page 71 and 72: the case of I/O, the observed workl
Page 73 and 74: average distribution. The apparent
Page 75 and 76: Table 4.4 Workload Phase Classifica
Page 77 and 78: Frequency 1 10 100 1000 PhaseLength
Page 79 and 80: power management, it is shown that
Page 81 and 82: power measurement hardware for mult
Page 83 and 84: memory. Since the number of main me
Page 85 and 86: TLB Misses - Loads/stores that miss
Page 87 and 88: The form of the subsystem power mod
Page 89 and 90: 5.2.2 Memory This section considers
Page 91 and 92: prefetch traffic does increase afte
Page 93 and 94: average access time to the distant
Page 95 and 96: Watts Figure 5.6 Disk Power Model (
Page 97 and 98: exhibits little variation in power
Page 99 and 100: The memory model averaged about 9%
Page 101 and 102: efficiency rather than performance.
Page 103 and 104: through the application GUI. The nu
Page 105 and 106: 1 data cache access rate dominates
Page 107 and 108: periods of disconnect, cache snoop
Page 109 and 110: 5.3.4 CPU To test the extensibility
Page 111 and 112: Despite this, high accuracy of less
Page 113 and 114: Light activity yields higher precha
Page 115 and 116: 5.3.8 Chipset The Chipset power mod
Page 117 and 118: applied to the GPU core logic, larg
Page 119 and 120: Chapter 6 Performance Effects of Dy
Page 121: can be considered as predictors whi
Page 125 and 126: 6.1.5 Direct Performance Effects Si
Page 127 and 128: of SPEC CPU2000 workloads, almost n
Page 129 and 130: adjusting a hysteresis timer. The t
Page 131 and 132: increase/decrease time. Since the i
Page 133 and 134: In order to reduce p-state performa
Page 135 and 136: performance loss and power consumpt
Page 137 and 138: eactive scheme used in Windows Vist
Page 139 and 140: Watts Watts Watts Watts Watts 150 1
Page 141 and 142: 7.2 Commercial DVFS Algorithm Exist
Page 143 and 144: of idle-active transitions in the c
Page 145 and 146: workload/operating systems adds and
Page 147 and 148: instructions. In APCI[Ac07] termino
Page 149 and 150: confidence level will drop below a
Page 151 and 152: First, prediction accuracy is consi
Page 153 and 154: coverage of 43% and accuracy over 9
Page 155 and 156: which corresponds to the DVFS sched
Page 157 and 158: Table 7.6: SYSmark 2007 Power and P
Page 159 and 160: In this case the predictor achieves
Page 161 and 162: 2007 power consumption contains man
Page 163 and 164: Chapter 8 Related Research This sec
Page 165 and 166: 8.2 System-Level Power Characteriza
Page 167 and 168: prediction scheme in this dissertat
Page 169 and 170: Chapter 9 Conclusions and Future Wo
Page 171 and 172: the duration of power and performan
Page 173 and 174:
execution, portions of pipelines or
Page 175 and 176:
[BiJo06-1] W. L. Bircher and L. Joh
Page 177 and 178:
the 2005 ACM SIGMETRICS Internation
Page 179 and 180:
[HaKe07] H. Hanson, S.W. Keckler, K
Page 181 and 182:
[JoMa01] R. Joseph and M. Martonosi
Page 183 and 184:
[LiBr05] Y. Li, D. Brooks, Z. Hu, a
Page 185 and 186:
[Os06] Open Source Development Lab,
Page 187 and 188:
[WaCh08] X. Wang and M. Chen. Clust
Page 189 and 190:
[PiSh01] P. Pillai and K. G. Shin.
show all

Copyright by William Lloyd Bircher 2010 - The Laboratory for ...

Create successful ePaper yourself

Delete template?

Save as template?