21.08.2013 Views

Temperature-Aware Integrated DVFS and Power Gating for ... - KAIST

Temperature-Aware Integrated DVFS and Power Gating for ... - KAIST

Temperature-Aware Integrated DVFS and Power Gating for ... - KAIST

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010 1381<br />

<strong>Temperature</strong>-<strong>Aware</strong> <strong>Integrated</strong> <strong>DVFS</strong> <strong>and</strong><br />

<strong>Power</strong> <strong>Gating</strong> <strong>for</strong> Executing Tasks<br />

with Runtime Distribution<br />

Kyungsu Kang, Student Member, IEEE, Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE,<br />

<strong>and</strong> Chong-Min Kyung, Fellow, IEEE<br />

Abstract—At high-operating temperature, chip cooling is crucial<br />

due to the exponential temperature dependence of leakage<br />

current. However, traditional cooling methods, e.g., power/clock<br />

gating applied when a temperature threshold is reached, often<br />

cause excessive per<strong>for</strong>mance degradation. In this paper, we<br />

propose a method <strong>for</strong> delivering lower energy consumption by<br />

integrating the cooling <strong>and</strong> running in a temperature-aware<br />

manner without incurring per<strong>for</strong>mance penalty. In order to<br />

further reduce the energy consumption, we exploited the runtime<br />

distribution of each sub-segment of a task called “bin” in<br />

an analytical manner such that time budget <strong>for</strong> cooling in<br />

each bin is allocated in proportion to the probability of the<br />

occurrence of the bin. We apply the proposed method to two<br />

realistic software programs, H.264 decoder <strong>and</strong> ray tracing <strong>and</strong> a<br />

benchmark program, equake. The experimental results show that<br />

the proposed method yields additional 19.4%–27.2% reduction<br />

in energy consumption compared with existing methods.<br />

Index Terms—Dynamic voltage <strong>and</strong> frequency scaling (<strong>DVFS</strong>),<br />

energy minimization, hard real time, power gating (PG), runtime<br />

distribution.<br />

I. Introduction<br />

TECHNOLOGY scaling has resulted in a sharp growth<br />

in transistor density up to 0.4 billion transistors/cm2 [1]. Such a number of transistors have pushed chip power<br />

density up to 250 W/cm2 –300 W/cm2 [2]. High-power density<br />

incurs temperature-related problems including reliability<br />

such as negative bias temperature instability <strong>and</strong> per<strong>for</strong>mance<br />

degradation. Leakage power consumption, which is the most<br />

critical among temperature-related problems [3], is caused by<br />

the exponential dependency of leakage current on temperature.<br />

It affects battery lifetime in mobile devices <strong>and</strong> increases the<br />

cost of cooling in high-per<strong>for</strong>mance computing [4]. Especially<br />

Manuscript received September 4, 2009; revised February 23, 2010. Date<br />

of current version August 20, 2010. This work was supported by the National<br />

Research Foundation (NRF) of Korea funded by the Korean government<br />

(MEST), under Grant 2010-0000823. This paper was recommended by<br />

Associate Editor M. Poncino.<br />

K. Kang, J. Kim, <strong>and</strong> C.-M. Kyung are with the Department of Electrical<br />

Engineering, Korea Advanced Institute of Science <strong>and</strong> Technology, Daejeon<br />

305–701, Korea (e-mail: kskang@vslab.kaist.ac.kr; jskim@vslab.kaist.ac.kr;<br />

kyung@ee.kaist.ac.kr).<br />

S. Yoo is with the Department of Electronics <strong>and</strong> Electrical Engineering,<br />

Pohang University of Science <strong>and</strong> Technology, Pohang 790–784, Korea (email:<br />

sungjoo.yoo@postech.ac.kr).<br />

Color versions of one or more of the figures in this paper are available<br />

online at http://ieeexplore.ieee.org.<br />

Digital Object Identifier 10.1109/TCAD.2010.2059290<br />

0278-0070/$26.00 c○ 2010 IEEE<br />

h<strong>and</strong>held devices such as thin notebooks, personal digital assistants,<br />

<strong>and</strong> cell phones are likely to suffer from temperatureinduced<br />

leakage power because they are not equipped with<br />

active cooling facilities, e.g., cooling fans, due to the small<br />

<strong>for</strong>m factor requirement.<br />

Dynamic thermal management (DTM) is an effective<br />

method to control the chip temperature. When the operating<br />

temperature approaches the thermal limit, several solutions are<br />

available <strong>for</strong> cooling such as stopping with power/clock gating,<br />

running at lower clock frequencies <strong>and</strong>/or lower voltages, <strong>and</strong><br />

issuing less instructions/functions [5]–[7]. DTM trades per<strong>for</strong>mance<br />

<strong>for</strong> temperature reduction often incurring per<strong>for</strong>mance<br />

penalty.<br />

There have been several studies on temperature-aware dynamic<br />

voltage <strong>and</strong> frequency scaling (<strong>DVFS</strong>) [8]–[10] to avoid<br />

the per<strong>for</strong>mance penalty by setting frequency <strong>and</strong> voltage<br />

in a temperature-aware manner. In such methods, given a<br />

time slack (= deadline − remaining worst execution time) <strong>and</strong><br />

statistical workload estimation [10], the per<strong>for</strong>mance level is<br />

set to minimize the total energy consumption consisting of<br />

temperature-induced leakage energy <strong>and</strong> switching energy.<br />

Although temperature-aware <strong>DVFS</strong> <strong>and</strong> DTM are selfsufficient<br />

techniques, applying them together will help fully<br />

exploit the potential of both techniques. For instance, in case<br />

of high temperature where temperature-induced leakage power<br />

consumption dominates, we utilize the time slack to cool down<br />

the processor by applying power/clock gating, e.g., turning<br />

off the processor during the slack. In the low-temperature<br />

situation where switching power dominates, we focus on<br />

reducing the switching power consumption by lowering the<br />

operating voltage/frequency. As our motivational example in<br />

Section III shows, such a temperature-aware tradeoff between<br />

power/clock gating <strong>and</strong> voltage/frequency setting gives a significant<br />

reduction in energy consumption than applying either<br />

technique without the other.<br />

Most complex programs are characterized by the runtime<br />

distribution [17]–[19]. There are two sources of runtime distribution:<br />

data-dependent behavior <strong>and</strong> conflicts in accessing<br />

architectural resources, e.g., cache <strong>and</strong> memory. Programs with<br />

loops <strong>and</strong> if/else statements tend to have different loop counts<br />

<strong>and</strong> take different execution paths depending on (input) data.<br />

For instance, in the case of video codec, object movements,<br />

i.e., input picture data can determine the execution time


1382 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

of time-consuming functions, e.g., motion estimation. Cache<br />

misses <strong>and</strong> DRAM access scheduling are typical sources<br />

of runtime variation in the architectural area. It is well<br />

known that exploiting runtime distribution can give further<br />

reduction in energy consumption than considering only the<br />

worst-case runtime [17]–[19]. In this paper, we extended<br />

the application of the in<strong>for</strong>mation of runtime distribution to<br />

the temperature-aware tradeoff between power/clock gating<br />

<strong>and</strong> voltage/frequency scaling.<br />

In this paper, we present a solution that integrates both DTM<br />

(to be exact, power/clock gating 1 ) <strong>and</strong> <strong>DVFS</strong> in a temperatureaware<br />

manner to minimize the total energy consumption of<br />

periodic hard real-time system. Compared with earlier works,<br />

ours is unique in two aspects: 1) temperature-aware tradeoff<br />

between PG <strong>and</strong> <strong>DVFS</strong>, <strong>and</strong> 2) exploiting runtime distribution<br />

in a temperature-aware tradeoff between PG <strong>and</strong> <strong>DVFS</strong>.<br />

This paper is organized as follows. Section II reviews<br />

related works. Section III explains the motivation <strong>for</strong> our work.<br />

Section IV presents our temperature-aware power estimation<br />

method. Section V gives the problem definition <strong>and</strong> solution<br />

overview. Sections VI <strong>and</strong> VII explain the temperature-aware<br />

integration of PG <strong>and</strong> <strong>DVFS</strong>. Section VIII shows how to<br />

apply the proposed solution during runtime. Section IX reports<br />

experimental results, followed by the conclusion in Section X.<br />

II. Related Works<br />

<strong>Temperature</strong>-aware design methods require thermal models.<br />

HotSpot [11] utilizes an equivalent circuit comprising thermal<br />

resistance <strong>and</strong> capacitance of architectural blocks <strong>and</strong> thermal<br />

characteristics of package. Han et al. [12] proposed timeinvariant<br />

linear thermal system that uses adaptive time interval<br />

to linearize the temperature calculation <strong>and</strong> speed up the<br />

thermal analysis. Kumar et al. [13] proposed a regressionbased<br />

thermal model that uses hardware per<strong>for</strong>mance counters<br />

available in the processor.<br />

“HybDTM” [5] controls temperature by clock gating or<br />

limiting task execution when the temperature reaches a thermal<br />

threshold. Rao et al. [6] proposed an analytical solution<br />

which maximizes the average throughput of processor within<br />

the given time <strong>and</strong> temperature constraints. DTM techniques<br />

can effectively lower the chip temperature while the benefit<br />

often comes at the cost of per<strong>for</strong>mance degradation. Yang<br />

et al. [7] proposed a temperature-aware task scheduling to<br />

keep the chip temperature below the given threshold. This<br />

method explores different execution orders of hot <strong>and</strong> cold<br />

tasks to keep the temperature under control. Yuan et al.<br />

[14] proposed a runtime PG algorithm, called TALK, to<br />

minimize temperature-induced leakage energy without per<strong>for</strong>mance<br />

penalty. This method per<strong>for</strong>ms PG when the system<br />

temperature is too high according to the ratio of remaining<br />

execution cycle to time-to-deadline. Srinivasan <strong>and</strong> Adve<br />

[32] proposed a predictive frame-based DTM algorithm that<br />

adapts the architectural configurations (the issue queue <strong>and</strong><br />

register file sizes as well as the number of active arithmetic<br />

logic units) <strong>and</strong> operating frequency to the frame type<br />

1 For simplicity, throughout this paper, we will use the term “power gating<br />

(PG)” instead of “DTM with power/clock gating.”<br />

(I, P, or B)-dependent characteristics of average IPC <strong>and</strong> power<br />

consumption.<br />

In [15], Liu et al. proposed a design-time temperatureaware<br />

<strong>DVFS</strong> technique through static temperature analysis.<br />

They <strong>for</strong>mulated the problem of minimizing peak temperature<br />

as a nonlinear programming problem, <strong>and</strong> aimed at reducing<br />

the system energy consumption under the peak temperature<br />

constraint. In [9], Bao et al. proposed a <strong>DVFS</strong> technique <strong>for</strong><br />

temperature-aware energy minimization based on both static<br />

<strong>and</strong> dynamic temperature analysis. In [16], Yuan <strong>and</strong> Qu proposed<br />

design-time <strong>and</strong> runtime solutions to minimize system<br />

energy consumption while suppressing tasks that cannot be<br />

completed by <strong>DVFS</strong> due to system overheat.<br />

Runtime distribution is exploited by several <strong>DVFS</strong> methods<br />

[10], [17]–[19]. In [17] <strong>and</strong> [18], the authors solved the<br />

distribution-aware <strong>DVFS</strong> problem by analytical approaches.<br />

In [19], Lorch <strong>and</strong> Smith proposed processor acceleration to<br />

conserve energy (PACE) algorithm which proactively increases<br />

the clock frequency as the task execution progresses to take<br />

advantage of runtime distribution. However, their works have<br />

not utilized the option of cooling to reduce the temperaturedependent<br />

leakage power consumption during task execution.<br />

In [10], Zhang <strong>and</strong> Chatha solved a stochastic thermal-aware<br />

<strong>DVFS</strong> problem in order to keep the expected latency within<br />

the designer-specified level subject to the condition that the<br />

probability of peak temperature exceeding a given value is<br />

sufficiently small. In [33], the authors proposed a group of<br />

pictures (GOP)-level <strong>DVFS</strong> algorithm in MPEG-2 decoding,<br />

which exploits the variation of frame decoding times <strong>for</strong> a<br />

GOP. The slack obtained during the frame decoding of the<br />

previous GOP is exploited to lower the operating frequency<br />

of frame decoding of the next GOP thereby lowering the<br />

temperature.<br />

Existing slack reclamation-based <strong>DVFS</strong> methods are targeted<br />

at energy reduction [9], [16]–[19], which apply power/<br />

clock gating only when the task execution finishes <strong>and</strong> there<br />

still remains an idle time until the deadline. On the contrary,<br />

our method is applicable to hard real-time systems, <strong>and</strong><br />

exploits the slack, i.e., per<strong>for</strong>ms slack reclamation, to per<strong>for</strong>m<br />

switching between PG <strong>and</strong> <strong>DVFS</strong> according to the temperature.<br />

I.e., our method applies power/clock gating during task<br />

execution in order to lower the operating temperature. Compared<br />

with existing slack reclamation-based DTM methods<br />

[14], [15], [29]–[31] which assume that the slack is fixed,<br />

our method dynamically switches between power gating <strong>and</strong><br />

<strong>DVFS</strong> according to the temperature level. It was found out<br />

that adjusting the overall slack size <strong>and</strong> distributing it over<br />

the allowed time interval results in further energy reduction<br />

as well as temperature lowering. For instance, in case of high<br />

temperature, our method increases the slack size <strong>for</strong> more idle<br />

slots (<strong>for</strong> further cooling) by running the processor at a higher<br />

frequency during the active slots.<br />

Unlike previous works [9], [14], [16]–[19] which try to<br />

reduce the energy consumption by either <strong>DVFS</strong> or PG often<br />

without considering runtime distribution, our method provides<br />

an integration of PG <strong>and</strong> <strong>DVFS</strong> to minimize the total energy<br />

consumption considering both the temperature <strong>and</strong> runtime<br />

distribution.


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1383<br />

Fig. 1. (a) Worst-case execution cycles of each task instance in a period are<br />

divided into three bins. (b) Cumulative distribution function <strong>for</strong> each bin j,<br />

where CDF(j +1) − CDF(j) denotes the probability of the task being finished<br />

with bin j.<br />

III. Motivation<br />

Fig. 1 shows an example of executing a periodic task<br />

with runtime distribution where the worst-case execution cycle<br />

(WCEC) <strong>for</strong> a task period is divided into three bins as shown<br />

in [19]. The task has one execution instance (shown as “bin<br />

executed” at the top of Fig. 1) during each period of 1.2 s.<br />

In Fig. 1, the four instances require two, two, four, <strong>and</strong> six<br />

hundred million cycles, respectively. We can compute the<br />

probability of each bin with respect to the total number of<br />

instances. The first bin has the probability of 100% (= 4/4)<br />

because the first bin is executed in every instance (there are<br />

four instances in Fig. 1). The second <strong>and</strong> third bin have the<br />

probabilities of 50% (= 2/4) <strong>and</strong> 25% (= 1/4), respectively.<br />

There<strong>for</strong>e, the probabilities of each task instance being finished<br />

with the execution of the first, second, <strong>and</strong> third bin are<br />

50% (= 100 − 50), 25% (= 50 − 25), <strong>and</strong> remaining 25%,<br />

respectively. The cumulative distribution function (CDF) of<br />

each bin is shown in Fig. 1(b).<br />

Let us assume that, given an initial temperature condition,<br />

the PG <strong>and</strong> voltage/frequency setting need to be determined<br />

<strong>for</strong> this task. The system has M discrete frequency levels<br />

(each of which has a corresponding voltage level) <strong>and</strong> two<br />

operation states: active <strong>and</strong> sleep states. In the active state,<br />

the system executes the task at one of the M frequency levels<br />

consuming both switching <strong>and</strong> leakage power. In the sleep<br />

state, no task is executed <strong>and</strong> the processor consumes only<br />

leakage power through PG. When the processor switches from<br />

sleep to active state (or vice versa), additional time <strong>and</strong> energy<br />

overhead called “power state transition overhead” is incurred.<br />

Fig. 2(a) shows how the operation states are determined by<br />

a traditional temperature-aware <strong>DVFS</strong> method which assigns<br />

a single frequency level in proportion to the ratio of WCEC to<br />

time-to-deadline when it is higher than critical speed. 2 In this<br />

case, if the task execution is finished earlier than the deadline,<br />

then the processor is simply power-gated. In this example, we<br />

assume that the frequency level is set to the critical speed,<br />

2 Critical speed is the frequency such that no slower frequency setting can<br />

give further energy reduction due to the increasing leakage power at low<br />

frequency.<br />

Fig. 2. Motivational example showing the reduction of energy consumption<br />

of the proposed method by exploiting the runtime distribution <strong>and</strong> applying<br />

integrated <strong>DVFS</strong> <strong>and</strong> PG. (a) Frequency scaling schedule according to<br />

conventional <strong>DVFS</strong>. (b) Integration of <strong>DVFS</strong> <strong>and</strong> PG when the frequency<br />

scaling schedule is the same as in (a). (c) Our result with WCEC-based<br />

integration of <strong>DVFS</strong> <strong>and</strong> PG. (d) Our result with integration of <strong>DVFS</strong> <strong>and</strong><br />

PG exploiting statistical in<strong>for</strong>mation.<br />

600 _MHz. Conventional <strong>DVFS</strong> in Fig. 2(a) yields an average<br />

temperature of 78.36 °C <strong>and</strong> energy consumption of 20.93J. 3<br />

Note that the temperature in Fig. 2 represents the average<br />

temperature of active state.<br />

In order to reduce the temperature during the active state,<br />

the method called TALK [14] can be applied such that the<br />

sleep state can be inserted during the active state as Fig. 2(b)<br />

shows. The average temperature of active state is reduced from<br />

78.36 °C to 76.08 °C.<br />

The basic idea of our method is that, when leakage power<br />

consumption dominates total power consumption, i.e., at high<br />

temperature, we first allocate a share of the given slack to<br />

the sleep state in proportion to the temperature <strong>and</strong> then<br />

place the sleep states during task execution instead of simply<br />

applying PG only after task execution [Fig. 2(a)] or during task<br />

execution [Fig. 2(b)]. This is because reducing the temperature<br />

<strong>and</strong>, thus, leakage power consumption by cooling can be more<br />

urgent than running the task when the temperature is very high.<br />

3 The temperature <strong>and</strong> energy consumption are calculated with the<br />

temperature-aware power estimation method <strong>and</strong> our processor power/thermal<br />

model to be explained in Sections IV <strong>and</strong> IX.


1384 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

Increasing active state obviously yields less heat generation<br />

<strong>and</strong> less temperature increase due to less dynamic power<br />

during active period. However, increasing idle state also helps<br />

lower the temperature. As the sum of active state <strong>and</strong> idle state<br />

is fixed, we need to find the optimal mix between active state<br />

<strong>and</strong> idle state which yields the lowest average temperature. At<br />

high temperature, it is sometimes beneficial to borrow time<br />

budget from active states (by running at higher frequency) to<br />

allow more idle states <strong>for</strong> temperature drop.<br />

Fig. 2(c) shows how the integration of PG <strong>and</strong> <strong>DVFS</strong><br />

lowers the temperature <strong>and</strong> the energy consumption. Such an<br />

integration is enabled by borrowing time budget from active<br />

states. First, a portion of the whole time budget is assigned to<br />

the sleep states in proportion to the temperature. Then, a higher<br />

frequency is assigned to the active states to compensate <strong>for</strong> the<br />

reduced time budget <strong>and</strong> meet the deadline. Fig. 2(c) shows<br />

that such an integration of PG <strong>and</strong> <strong>DVFS</strong> yields additional<br />

16.4% reduction in energy consumption mostly by reducing<br />

the temperature-induced leakage power consumption during<br />

active states. Compared with TALK [14] in Fig. 2(b), where<br />

only the critical speed is used, slack is assigned to sleep states<br />

in Fig. 2(c) in proportion to temperature <strong>and</strong> frequency levels<br />

higher than the critical speed are also used <strong>for</strong> active states. 4<br />

In Fig. 2(c), worst-case execution time is assumed in<br />

applying the PG <strong>and</strong> <strong>DVFS</strong>. However, by exploiting the<br />

runtime distribution, we obtain a further reduction in energy<br />

consumption as Fig. 2(d) shows. The key idea is to assign<br />

slack budget to the lower index bins in proportion to their<br />

probability of occurrence. Thus, the higher their probability,<br />

the more slack budget is assigned to the lower index bins. By<br />

utilizing more sleep states in the lower index bins with highexecution<br />

probability, we allow more cooling of the processor<br />

which leads to a reduction in the leakage power consumption<br />

in the lower index bins. The increased slack budget assigned to<br />

the lower index bins must, then, be borrowed from the higher<br />

index bins. Thus, the power consumption of higher index<br />

bins will increase. However, the total energy consumption<br />

is reduced at high temperature because the eventual energy<br />

consumption is the sum of the products, over all bins, of<br />

consumed energy per bin <strong>and</strong> the associated bin probability.<br />

The lower index bins have higher bin probability than the<br />

higher index bins. The overall energy consumption can be<br />

reduced by assigning more slack budget to the lower index<br />

bins than higher index bins, which results in lower temperature<br />

in the lower index bins. In our method the operating frequency<br />

<strong>and</strong> the amount of sleep states are analytically determined considering<br />

temperature <strong>and</strong> bin probabilities (see Section VII).<br />

IV. Preliminaries: <strong>Temperature</strong>-<strong>Aware</strong> <strong>Power</strong><br />

Estimation<br />

System power consumption can vary during task execution<br />

while the operating frequency <strong>and</strong> the switching power<br />

consumption are constant, because temperature is varying<br />

4 Note that even in the case that there is no slack in conventional <strong>DVFS</strong>,<br />

e.g., fcrit ≤ 0.5 GHz in the example of Fig. 2(a), our method can still give<br />

energy reduction by creating additional slack, which is borrowed from active<br />

state, <strong>and</strong> uni<strong>for</strong>mly inserting idle state as shown in Fig. 2(c).<br />

Fig. 3. <strong>Temperature</strong>-aware energy calculation of a task at a fixed clock<br />

frequency (f ) given a time period [ts, tf ]. (a) Task execution time is divided<br />

into the time interval, t which is small enough to consider the temperature<br />

during t to be constant. (b) <strong>Temperature</strong>-aware energy consumption can be<br />

calculated by (1).<br />

Fig. 4. Problem definition to minimize total energy consumption.<br />

with time (due to the switching power consumption) thereby<br />

changing the leakage power consumption. However, because<br />

the thermal time constant of system is usually in the order<br />

of milliseconds, it is not necessary to update temperature <strong>for</strong><br />

every clock cycle <strong>for</strong> temperature-aware power estimation. For<br />

temperature estimation, we use a distributed thermal RC model<br />

using HotSpot [11]. During power estimation, we update<br />

temperature, obtained from the thermal model, at each time<br />

step t (= 0.1 ms in our experiment). The time interval, t,<br />

is sufficiently small that we may consider the temperature<br />

during t to be constant. Fig. 3 illustrates how to calculate<br />

energy consumption in a temperature-aware manner. Given a<br />

time period [ts,tf ], where tf − ts = K · t, temperature-aware<br />

energy consumption can be calculated by summing the energy<br />

consumption of all K intervals as follows:<br />

K<br />

Eactive = (Ps(f )+Pl(f, Tk)) · t (1)<br />

k=1<br />

where Ps <strong>and</strong> Pl denote switching power <strong>and</strong> leakage power<br />

consumption, respectively. We assume that clock frequency<br />

is proportional to Vdd, i.e., f ∝ Vdd. 5 In summary, switching<br />

power consumption is a function of clock frequency (f ), while<br />

leakage power consumption is a function of clock frequency<br />

<strong>and</strong> temperature (Tk). Details of our power (Ps <strong>and</strong> Pl) <strong>and</strong><br />

thermal (Tk) models are explained in Section IX.<br />

5 This is because the upper limit on the clock frequency is determined<br />

by the inverse of the propagation delay of gate elements, which is roughly<br />

proportional to Vdd.


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1385<br />

V. Problem Definition<br />

In real environments, processor changes operation states<br />

<strong>and</strong>/or frequency levels on a discrete time basis. The length of<br />

discrete time period, l, is mostly determined by the scheduler<br />

in the operating system. Within l, which is set at 2 ms in our<br />

experiment, the operation state <strong>and</strong> frequency level are fixed.<br />

The time interval [0, D], where D denotes the task deadline, is<br />

divided into N slots of l duration each as shown in Fig. 4. A<br />

slot where the processor is in the active (sleep) state is called<br />

active (sleep) slot. In the active slot, the power consumption<br />

is the sum of Ps <strong>and</strong> Pl. In the sleep slot, only a small amount<br />

of leakage power, Psleep (≪Pl), is consumed via PG. The<br />

processor energy consumption, Etotal, can be calculated by the<br />

following equation:<br />

N<br />

Etotal = (Eactive[i] · x[i]+Esleep · (1 − x[i])) (2)<br />

where<br />

<strong>and</strong><br />

i=1<br />

Eactive[i] =Ps(f [i]) · l +<br />

L<br />

Pl(f [i],Tk) · t (3)<br />

k=1<br />

Esleep = Psleep · l (4)<br />

where l = L · t (L = 20 ≫ 1) <strong>and</strong> f [i] is the clock<br />

frequency of the ith slot. x[i] is1iftheith slot is an active<br />

slot, <strong>and</strong> 0 if it is a sleep slot.<br />

We define the problem of energy minimization as follows.<br />

Given an initial temperature (Tinit), a task with a deadline<br />

D, <strong>and</strong> its runtime distribution (including the in<strong>for</strong>mation of<br />

WCEC), <strong>and</strong> assuming that the interval, [0, D] is divided<br />

into N slots <strong>and</strong> the WCEC is divided into B bins, 6 the<br />

problem is to find f [i] <strong>and</strong> x[i] such that the total energy<br />

consumption, Etotal in (2) is minimized. The problem complexity<br />

is O((M +1) N ) because each slot can be at one of<br />

the M frequency levels or sleep state. The complexity can be<br />

extremely high because M ∼ 20 <strong>and</strong> N ∼ 500 <strong>for</strong> even a<br />

deadline of 1 s (l = 2 ms).<br />

In the remainder of this paper, we present our solution<br />

to the problem as follows. In Section VI, we explain the<br />

temperature-aware integration of <strong>DVFS</strong> <strong>and</strong> PG, assuming a<br />

single distinct execution cycle <strong>for</strong> a task. In Section VII, we<br />

will incorporate runtime distribution into this integration of<br />

<strong>DVFS</strong> <strong>and</strong> PG to present a complete design-time solution.<br />

Finally, in Section VIII, we will explain how to apply the<br />

design-time solution to runtime.<br />

VI. <strong>Temperature</strong>-<strong>Aware</strong> <strong>Integrated</strong> <strong>DVFS</strong> <strong>and</strong><br />

<strong>Power</strong> <strong>Gating</strong><br />

In this section, assuming a pre-determined execution cycle<br />

<strong>and</strong> a given deadline, we explain how to determine the<br />

operating frequency <strong>and</strong> sleep slots (Section VI-A) <strong>and</strong> the<br />

6 Each bin consists of Cbin = ⌈WCEC/B⌉ cycles.<br />

Fig. 5. Total energy consumption (= switching + leakage energy) per cycle<br />

as a function of clock frequency.<br />

locations of sleep slots (Section VI-B) in a temperature-aware<br />

manner.<br />

A. Energy-Optimal Frequency Decision<br />

Given the execution cycle <strong>and</strong> deadline, we first determine<br />

the frequency which minimizes total energy consumption<br />

during active states. Fig. 5 shows the total energy consumption<br />

per cycle, Ecycle of the processor used in our experiments. 7<br />

Ecycle is a function of frequency <strong>and</strong> temperature. In order<br />

to obtain Ecycle(f ), at each frequency f , we obtain Ecycle<br />

by simulating the task execution on the power/thermal model<br />

until the temperature reaches steady state, assuming that the<br />

deadline is sufficiently larger than the thermal time constant<br />

of the processor.<br />

Fig. 5 shows that the curve of Ecycle is convex. Thus, the<br />

frequency <strong>for</strong> minimal energy consumption called optimal<br />

frequency, fopt is obtained when the slope of the curve is<br />

zero. We apply the root-finding algorithm such as bisection<br />

method [28] to find the optimal frequency.<br />

Fig. 6 shows the flow of energy, power, <strong>and</strong> temperature<br />

estimation. The bisection method shown in Fig. 6(a) per<strong>for</strong>ms,<br />

at each intermediate frequency, four times of estimation of<br />

Ecycle to obtain the gradients of Ecycle at fleft <strong>and</strong> fmid.<br />

The estimation of Ecycle consists of average power estimation<br />

<strong>and</strong> steady state temperature estimation. The steady state<br />

temperature can be estimated from the product of the average<br />

power <strong>and</strong> the thermal resistance extracted from the HotSpot<br />

simulator. As Fig. 6(b) shows, the estimation of Ecycle needs<br />

iterations of power <strong>and</strong> temperature estimation because of<br />

the dependency of leakage power on temperature. In our<br />

experiments, five iterations are usually sufficient to yield a<br />

convergence in both the estimated temperature <strong>and</strong> power<br />

consumption with a variation of less than 0.1%. The bisection<br />

method used to calculate fopt is a general binary search<br />

algorithm <strong>and</strong> the computational complexity is O(log2NF ),<br />

where NF is the number of possible frequency values (NF =<br />

240 in our experiments). We measured the computational time<br />

of the root-finding algorithm [Fig. 6(a)] from the experimental<br />

plat<strong>for</strong>m, an LG xnote LW25 laptop, running at 2 GHz. The<br />

runtime overhead is less than 1 ms.<br />

7 Our processor power model is derived from the Intel Penryn processor<br />

[21]. Details of the power model are given in Section IX.


1386 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

Fig. 6. Algorithm flow of energy-optimal frequency decision. (a) Flow of<br />

bisection method. (b) Flow of Ecycle estimation.<br />

Fig. 7. <strong>Temperature</strong>-aware sleep slot placement per<strong>for</strong>med in two steps.<br />

(a) Calculation of the numbers of active <strong>and</strong> sleep slots, Na <strong>and</strong> Ns.<br />

(b) Determining the sleep slot location using the threshold temperature, Tthr.<br />

The optimal frequency, fopt is assigned to active slots. 8<br />

After determining the optimal frequency, fopt, the number<br />

of active slots, Na, <strong>and</strong> that of sleep slots, Ns, are easily<br />

calculated by (5) <strong>and</strong> (6). Fig. 7(a) illustrates Na active <strong>and</strong><br />

Ns sleep slots determined by the optimal frequency, fopt<br />

Na =<br />

w/fopt<br />

l<br />

<br />

Ns = N − Na = N −<br />

<br />

w/fopt<br />

l<br />

where w is the number of execution cycles of task.<br />

B. Sleep Slot Location<br />

The position of each sleep slot affects the operating temperature<br />

of ensuing active slots, <strong>and</strong> thus leakage power<br />

consumption in active states. Determining the locations of<br />

active <strong>and</strong> sleep slots is a permutation problem. The number of<br />

permutations is N!/(Na!·Ns!). Considering that N (= Na +Ns)<br />

8 In case of discrete frequencies, the nearest discrete frequency, not lower<br />

than fopt is assigned to active states.<br />

(5)<br />

(6)<br />

Fig. 8. <strong>Temperature</strong> <strong>and</strong> leakage energy according to the number of consecutive<br />

active slots (Na). (a) <strong>Temperature</strong>. (b) Leakage energy.<br />

Fig. 9. Two examples of idle slot distribution <strong>for</strong> a task which is running<br />

at a fixed clock frequency. (a) Idle slots are uni<strong>for</strong>mly distributed during task<br />

execution. (b) Idle slots are non-uni<strong>for</strong>mly distributed during task exaction.<br />

can be easily as large as 500 (e.g., with 1 s time-to-deadline<br />

<strong>and</strong> 2 ms time slots), the complexity is huge.<br />

We solved the problem of locating sleep slots in two steps.<br />

First, we prove (Lemma 1) that a uni<strong>for</strong>m distribution of sleep<br />

slots can give the optimal leakage energy consumption. Then,<br />

we present a method of determining sleep slot locations during<br />

runtime.<br />

Lemma 1: Given Na active slots <strong>and</strong> Ns sleep slots <strong>for</strong> a<br />

task which is running at a fixed clock frequency during active<br />

slots, the energy consumption is minimal when idle slots are<br />

evenly distributed across the task execution.<br />

Proof : Fig. 8(a) shows the temperature with respect to the<br />

number of consecutive active slots (Na). In Fig. 8(a), larger Na<br />

leads to higher operating temperature. Since leakage power is<br />

exponentially dependent on the temperature of active slots, the<br />

leakage energy with respect to Na, E(Na) is a convex function<br />

as shown in Fig. 8(b). Assume that idle slots are uni<strong>for</strong>mly<br />

distributed as shown in Fig. 9(a). In this case, the same number<br />

of consecutive active slots (Nopt) are located between sleep<br />

slots <strong>and</strong> the leakage energy consumed by Nopt consecutive<br />

active slots in a period is E(Nopt) as shown in Fig. 8(b).<br />

Now we assume that idle slots are non-uni<strong>for</strong>mly distributed<br />

as shown in Fig. 9(b), where the numbers of consecutive active<br />

slots of period i <strong>and</strong> period i + 1 are Nopt + x <strong>and</strong> Nopt −<br />

x, respectively. The sum of leakage energy of period i <strong>and</strong><br />

period i +1 is E(Nopt + x)+E(Nopt − x). Since the E(Na) isa<br />

convex function, E(Nopt + x)+E(Nopt − x) ≥ 2 · E(Nopt). The<br />

minimal energy consumption is achieved when equality holds.<br />

The equality holds only when E(Nopt + x) <strong>and</strong> E(Nopt − x)<br />

are the same, i.e., x = 0. Thus, the energy consumption is


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1387<br />

minimal when idle slots are uni<strong>for</strong>mly distributed across the<br />

task execution.<br />

Lemma 1 shows that a uni<strong>for</strong>m location of idle slots can give<br />

an optimal leakage energy consumption. However, there can be<br />

different possibilities of uni<strong>for</strong>m location. For instance, if there<br />

are 8 active slots <strong>and</strong> 3 idle slots, we can have three different<br />

cases of uni<strong>for</strong>m location. 9 In order to simplify the solution<br />

while maintaining the same uni<strong>for</strong>m idle slot insertions, we<br />

present a heuristic solution. Fig. 7(b) illustrates the solution.<br />

If the current temperature exceeds a threshold temperature Tthr,<br />

we insert a sleep slot, hopefully between active slots, to drop<br />

the temperature as illustrated in Fig. 7(b). Tthr needs to be<br />

determined to minimize the maximal operating temperature.<br />

For instance, if Tthr is set too high, too few sleep slots are<br />

inserted between active slots. In such a case, most sleep slots<br />

may be used only after all the active slots thereby failing<br />

to contribute to lowering the temperature during the task<br />

execution. 10 If Tthr is set too low, most sleep slots are inserted<br />

in the beginning of the task causing the temperature to rise<br />

later in the remaining active slots after all available sleep slots<br />

are exhausted. To find the optimal Tthr which distributes sleep<br />

slots evenly between active slots, we per<strong>for</strong>m a binary search<br />

sweeping between Tamb <strong>and</strong> Tmax.WesetTmax as 105 °C which<br />

is the allowed maximum junction temperature of our processor<br />

model specified in [23].<br />

VII. <strong>Temperature</strong> <strong>and</strong> Bin Probability-<strong>Aware</strong><br />

Statistical <strong>Integrated</strong> <strong>DVFS</strong> <strong>and</strong> <strong>Power</strong> <strong>Gating</strong><br />

The basic idea of incorporating the statistical in<strong>for</strong>mation,<br />

i.e., runtime distribution into the integrated <strong>DVFS</strong> <strong>and</strong> PG<br />

is to allocate time budget (as given by the time-to-deadline)<br />

to lower index bins in proportion to their bin probabilities.<br />

We consider a bin as a virtual sub-task <strong>and</strong> the time budget<br />

of the bin as the time-to-deadline of the virtual sub-task.<br />

Thus, allocating more time budget to a low-index bin corresponds<br />

to assigning a longer time-to-deadline to the bin. Fig. 10<br />

illustrates, assuming a fixed execution cycle of a (virtual) task,<br />

the energy consumption per cycle vs. time-to-deadline (i.e., the<br />

given time budget) at the minimal-energy frequency, fopt (explained<br />

in Section VI). Increasing time-to-deadline (i.e., more<br />

time budget) allows <strong>for</strong> operation at lower optimal frequency<br />

(fopt) operation thereby decreasing energy consumption by<br />

lowering the supply voltage. Note that the increased timeto-deadline<br />

reduces leakage power consumption as well as<br />

switching power consumption by lowering the supply voltage.<br />

It is because the reduced switching power consumption gives<br />

less temperature increase thereby less leakage power consumption<br />

than the case when the time-to-deadline is not extended.<br />

For our problem <strong>for</strong>mulation (to be given in this section),<br />

we approximate Ecycle(fj) in Fig. 10 with a function,<br />

9 The three cases are: 1) A, A, S, A, A, A, S, A, A, A, S; 2) A, A, A, S,<br />

A, A, S, A, A, A, S; <strong>and</strong> 3) A, A, A, S, A, A, A, S, A, A, S, where A <strong>and</strong><br />

S represent active <strong>and</strong> sleep slots.<br />

10 In a multitask system, the sleep slots of previously executed tasks<br />

can affect the initial operating temperature, i.e., the power consumption of<br />

following tasks. In this paper, however, we focus on the solution to the single<br />

task problem. The extension of this paper to multitask systems will be our<br />

future work.<br />

Fig. 10. Total energy consumption per cycle vs. time-to-deadline.<br />

a · S −b + c, where S is the given time-to-deadline (i.e., time<br />

budget), where a, b, <strong>and</strong> c are fitting parameters depending on<br />

the target system conditions such as processor, heat sink, <strong>and</strong><br />

ambient temperature.<br />

Our strategy is to reduce the energy consumption of lower<br />

index bins by extending the time-to-deadlines this way. As<br />

mentioned in Section III, the increased time budget <strong>for</strong> lower<br />

index bins is borrowed from higher index bins, which increases<br />

the operating frequency <strong>and</strong> power consumption of higher<br />

index bins. However, since the bin probability of lower index<br />

bin is higher than that of higher index bin, we can eventually<br />

obtain a reduction in total energy consumption if the energy<br />

reduction of lower index bins multiplied by the probability<br />

is larger than that of higher index bins. We <strong>for</strong>mulate this<br />

problem of time budgeting as follows:<br />

Find Sj <strong>for</strong> all j =1, 2,..., B<br />

where Sj is the time budget allocated to bin j<br />

<strong>and</strong> B is the number of bins<br />

such that Cost = B<br />

j=1 pj · Ecycle(fj)<br />

is minimized,<br />

subject to<br />

=<br />

B<br />

pj · (a · S −b + c) (7)<br />

j=1<br />

B<br />

Sj ≤ D (8)<br />

j=1<br />

where D is the time-to-deadline.<br />

In (7), pj is the bin probability. Thus, pj ·Ecycle(fj) represents<br />

the expected energy consumption per cycle as contributed by<br />

bin j. Note that each bin has the same number of execution<br />

cycles. Thus, if the execution cycle is taken into account,<br />

then the function Cost in (7) becomes the expectation of the<br />

total energy consumption of the task. Because a <strong>and</strong> c are<br />

constants in (7), the function Cost in (7) can be replaced by


1388 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

Cost ∗ as follows:<br />

Cost ∗ =<br />

B<br />

j=1<br />

pj · S −b<br />

j . (9)<br />

Note that, as shown in Fig. 10, Ecycle(fj) is monotonically<br />

decreasing as the time budget is increasing. b must be positive<br />

to let S −b<br />

j convex in (9). We can apply Jensen’s inequality<br />

[20] 11 to find the condition minimizing (9). To do that, we<br />

rewrite (9) as follows:<br />

Cost ∗ B<br />

= (p −1/b<br />

j · Sj) −b . (10)<br />

j=1<br />

According to Jensen’s inequality, Cost∗ in (10) can be minimized<br />

when p −1/b<br />

j · Sj has the same value <strong>for</strong> all j’s, i.e.,<br />

p −1/b<br />

j · Sj = p −1/b<br />

j+1 · Sj+1 <strong>for</strong> all j’s. Sj needs to satisfy the<br />

following relation:<br />

Sj+1 = b<br />

pj+1<br />

pj<br />

Cost ∗ has a lower bound given as follows:<br />

B ·<br />

B j=1 p−1/b j<br />

B<br />

· Sj. (11)<br />

· Sj<br />

−b<br />

. (12)<br />

In order to calculate Sj using (11), we first need to determine<br />

S1, which is used to calculate the remaining Sj’s by iterative<br />

applications of (11). S1 is obtained by iterative binary subdivision,<br />

i.e., by sweeping S1 values over the interval [0, D] such<br />

that Cost in (7) is minimized while satisfying the constraint<br />

in (8).<br />

VIII. Application of Design-Time Solution to<br />

Runtime<br />

The solution presented in Sections VI <strong>and</strong> VII, given the<br />

initial temperature <strong>and</strong> runtime distribution, yields a designtime<br />

solution to the problem given in Section V, i.e., the<br />

operating frequencies of active slots, <strong>and</strong> threshold temperature,<br />

Tthr (which determines the number <strong>and</strong> locations of<br />

sleep slots). In this section, we propose a runtime solution<br />

obtained by applying the pre-computed design-time solutions<br />

according to the current temperature. We prepare a lookup<br />

table (LUT) storing the pre-computed design-time results<br />

as shown in Fig. 11(a). Each entry in the LUT yields the<br />

operating frequency, f , <strong>and</strong> the threshold temperature, Tthr, <strong>for</strong><br />

each bin according to the initial temperature, Tinit. If there is<br />

no exact match to the initial temperature of LUT, the nearest<br />

one from, but not lower, the current temperature is selected<br />

in order to prevent the operating temperature from violating<br />

the given maximum temperature constraint. For example, in<br />

Fig. 11, assume that the task starts with temperature 62 °C<br />

(≤70 °C). There is no exact match in the LUT of Fig. 11(a).<br />

Thus, the entry corresponding to Tinit = 70 °C is chosen.<br />

11If f (x) is convex on the interval a


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1389<br />

Fig. 12. Peak temperature violation check per<strong>for</strong>med at the start of each<br />

slot.<br />

prepare Tss(fi) <strong>for</strong> each frequency level fi (19 levels in our<br />

experiment).<br />

When the estimated temperature Ti+1 violates the peak<br />

temperature constraint, we adjust the frequency level, fi, to<br />

alevelfpt (peak threshold frequency). fpt can be obtained<br />

from the method of calculating fopt in Section VI.A by setting<br />

Tthr to Tmax. In other words, fpt is the maximum frequency<br />

which gives an optimal energy consumption without violating<br />

the peak temperature constraint.<br />

The number of entries in the LUT can affect the energy<br />

reduction obtained during runtime. Experimental results in the<br />

next section will show that only a single entry can support the<br />

temperature range between 30 °C <strong>and</strong> 90 °C with a negligible<br />

degradation in energy gain.<br />

IX. Experimental Results<br />

A. Experimental Setup<br />

Our experimental results are based on the data collected<br />

from the Penryn [21] processor, which is a dual-core with<br />

a unified L2 cache, manufactured in 45-nm high-k metal<br />

gate silicon process technology. The floorplanning <strong>and</strong> power<br />

decomposition of the processor are obtained from [21] <strong>and</strong><br />

shown in Fig. 13. In our power model, we estimated the<br />

switching power based on the relationship between power,<br />

frequency <strong>and</strong> voltage, i.e., Ps = Cs · V 2 dd · f . We estimated<br />

the effective capacitance, Cs, from the power values in the<br />

datasheet [23]. To be specific, we obtained the switching power<br />

value, Ps, from total <strong>and</strong> leakage power values at the junction<br />

temperature of 105 °C [23] <strong>and</strong> estimated Cs (Cs =7.942nF).<br />

Then, we applied it to the switching power estimation at<br />

different frequency <strong>and</strong> voltage levels. To account <strong>for</strong> the<br />

temperature-dependent leakage power consumption, we utilize<br />

a modified version of leakage power model proposed in [3]. 13<br />

Although our experiments are based on the Penryn processor,<br />

it should be noted that our solution is general <strong>and</strong> independent<br />

of the processor model. Fig. 14 shows the power values<br />

<strong>for</strong> various frequency <strong>and</strong> temperature levels in our power<br />

model. The frequency ranges from 0.6 GHz to 3.0 GHz with<br />

19 discrete levels. The runtime overhead of voltage/frequency<br />

transition is assumed to be 10 µs [34].<br />

13 The gate leakage power is controlled by the property of high-k dielectric<br />

materials used in our processor model. In [22], high-k materials can reduce<br />

the gate leakage down to 0.2% <strong>and</strong> 1.1% <strong>for</strong> PMOS <strong>and</strong> NMOS, respectively.<br />

We scaled the gate leakage in [3] to 0.5% in our power estimation as an<br />

approximate average of the two cases.<br />

Fig. 13. (a) Floorplanning <strong>and</strong> (b) power decomposition of our processor<br />

model.<br />

Fig. 14. <strong>Power</strong> consumption of our power model according to clock frequency<br />

<strong>and</strong> operating temperature (40 °C, 60 °C, 80 °C, <strong>and</strong> 100 °C).<br />

When PG is activated in the processor, the voltage is<br />

lowered only to the point where both the state of core <strong>and</strong><br />

cache can be retained. Thus, there is only a small wake-up<br />

time overhead caused by the state recovery including phaselocked<br />

loop turn-on time (15 µs in [23]). When the processor<br />

is turned on, in order to eliminate the wake-up overhead, the<br />

turn-on operation can be started wake-up overhead earlier than<br />

the end of sleep slot. Thus, the processor becomes ready to<br />

be executed when the sleep slot finishes <strong>and</strong> a new active slot<br />

starts.<br />

According to the product datasheet [23], we used a sleep<br />

state power consumption of 1.9 W, where both core <strong>and</strong> cache<br />

can retain their states. We set the slot size l to be 2 ms which<br />

is much longer than the power state transition time overhead<br />

<strong>and</strong> other typical threshold delays of power on-off transition<br />

[24].


1390 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

Fig. 15. HotSpot coefficients <strong>for</strong> temperature simulation. (a) Side view of a<br />

typical chip package. (b) HotSpot parameter.<br />

For temperature calculation, we used HotSpot [11] as the<br />

thermal modeling tool, <strong>and</strong> modified its parameters according<br />

to our processor model. The coefficients related with the<br />

temperature model of the processor are given in Fig. 15. The<br />

ambient temperature Tamb is assumed to be 50 °C. We per<strong>for</strong>med<br />

the temperature simulations using a sampling interval<br />

t of 0.1 ms.<br />

Experiments were per<strong>for</strong>med with two real examples, i.e.,<br />

H.264 decoders with different frame rates (14 f/s <strong>and</strong> 7 f/s in<br />

our experiments) <strong>and</strong> ray tracing. We also used a benchmark<br />

program, equake from SPEC2000. For ray tracing <strong>and</strong> equake,<br />

we assume the time-to-deadline according to a utilization<br />

ratio (i.e., WETC 14 /D = 1.13 <strong>and</strong> WETC/D = 0.70 in our<br />

experiment). We collected the runtime distribution from a per<strong>for</strong>mance<br />

profiling application programming interface [25] on<br />

Intel Core2Duo 7200 processor in LG xnote LW25 laptop. We<br />

used 2000 frames of input picture <strong>for</strong> H.264 decoder, 15 360<br />

pixels of input picture <strong>for</strong> ray tracing, <strong>and</strong> 130 simulation<br />

cycles <strong>for</strong> equake. The input picture <strong>for</strong> H.264 decoder was<br />

extracted from the movie Harry Potter. We divided the input<br />

data into two sets: training set (1000 frames <strong>for</strong> H.264, 7680<br />

pixels <strong>for</strong> ray tracing, <strong>and</strong> 65 simulation cycles <strong>for</strong> equake)<br />

<strong>and</strong> evaluation set (the other frames, pixels, <strong>and</strong> simulation<br />

cycles). We obtained design-time solutions with the training<br />

set <strong>and</strong> then obtained experimental results in this section<br />

by applying the design-time solution to the evaluation sets.<br />

Fig. 16 shows the runtime distribution of H.264 decoder<br />

<strong>and</strong> ray tracing [26]. The runtime distribution of H.264 (ray<br />

tracing) was obtained from per-frame (per-pixel) execution<br />

cycles collected during the execution with the training set. The<br />

WCECs of H.264 decoder <strong>and</strong> ray tracing are 6.0E+8 cycles<br />

<strong>and</strong> 2.0E+8 cycles, respectively. We assume the number of<br />

14 In this paper, we define WETC as worst-case execution time at critical<br />

speed as WCEC divided by the critical speed.<br />

Fig. 16. Runtime distribution of real examples. (a) H.264 decoder. (b) Ray<br />

tracing. (c) Equake.<br />

bins <strong>for</strong> both applications is five. Thus, each bin of H.264<br />

decoder <strong>and</strong> ray tracing contains 1.2E + 8 cycles <strong>and</strong> 0.4E + 8<br />

cycles, respectively.<br />

B. Experimental Results<br />

We per<strong>for</strong>med the experiment using two different versions<br />

of the proposed methods (T-OURS <strong>and</strong> TD-OURS) <strong>and</strong> three<br />

existing ones (PACE, T-<strong>DVFS</strong>, <strong>and</strong> TALK) as follows.<br />

PACE [19]: Only <strong>DVFS</strong> is applied while exploiting runtime<br />

distribution without considering operating temperature.<br />

<strong>Temperature</strong>-aware <strong>DVFS</strong> (T-<strong>DVFS</strong>) [9]: Only <strong>DVFS</strong> is applied<br />

considering steady state temperature without exploiting<br />

runtime distribution <strong>and</strong> PG.<br />

TALK [14]: Only PG is applied at the critical speed considering<br />

time-varying temperature without exploiting runtime<br />

distribution.<br />

T-OURS: Our first method (in Section VI) where only timevarying<br />

temperature is considered.<br />

TD-OURS: Our second method (in Section VII) considering<br />

both temperature variation <strong>and</strong> runtime distribution.<br />

Table I shows the result of comparison.<br />

Table II shows the comparison of energy consumption <strong>for</strong><br />

H.264 video decoding, ray tracing, <strong>and</strong> equake. All the energy<br />

consumption data are normalized with respect to the energy<br />

consumption of T-<strong>DVFS</strong> method. Note that TALK is not<br />

applicable when the frequency needs to be set at a higher level


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1391<br />

TABLE I<br />

Comparison of Our Methods With Existing Ones<br />

<strong>DVFS</strong> <strong>Power</strong> gating Runtime dist. <strong>Temperature</strong><br />

PACE O X O X<br />

T-<strong>DVFS</strong> O X X O<br />

TALK X O X O<br />

T-OURS O O X O<br />

TD-OURS O O O O<br />

TABLE II<br />

Comparison of Energy Consumption Results <strong>for</strong> H.264 Decoder,<br />

Ray Tracing, <strong>and</strong> Equake<br />

Benchmark T-<strong>DVFS</strong> (J) PACE TALK T-OURS TD-OURS<br />

WETC ≥ D (14 f/s <strong>for</strong> H.264)<br />

H.264 15.34 1.06 N/A 0.85 0.79<br />

Ray tracing 4.31 1.08 N/A 0.91 0.81<br />

Equake 122.47 1.07 N/A 0.89 0.78<br />

WETC


1392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

Fig. 19. Energy comparisons <strong>for</strong> H.264 decoder <strong>and</strong> ray tracing benchmark with different ambient temperature. The energy consumptions presented in this<br />

figure are normalized against T-<strong>DVFS</strong>. (a) H.264 decoder when WETC ≥ D. (b) H.264 decoder when WETC


KANG et al.: TEMPERATURE-AWARE INTEGRATED <strong>DVFS</strong> AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1393<br />

TABLE V<br />

Increase in Energy Consumption <strong>for</strong> Varying Initial<br />

<strong>Temperature</strong>s<br />

Tinit (°C) 30 40 50 60 70 80 90<br />

Energy increase 0.33 0.27 0.27 0.17 0.10 0.08 0.00<br />

(%)<br />

Fig. 20. <strong>Temperature</strong> behavior comparison <strong>for</strong> executing H.264 with TD-<br />

OURS <strong>and</strong> runtime solution when initial temperature is 30 °C. (a) <strong>Temperature</strong><br />

profiles during entire simulation. (b) Zoom-in of temperature profiles over<br />

time.<br />

Tinit from 30 °C to 80 °C with 10 °C step. Table V shows<br />

the results. In Table V, the first <strong>and</strong> second rows indicate the<br />

initial temperature <strong>and</strong> the maximum increase (%) of energy<br />

consumption <strong>for</strong> H.264 <strong>and</strong> ray tracing compared with the<br />

energy consumption at 90 °C, respectively. In Table V, the<br />

energy consumption increases only by up to 0.33% compared<br />

with the energy consumption at 90 °C.<br />

Fig. 20 shows the thermal behavior of IEC module which<br />

confirms the results in Table V. In Fig. 20, DT−30 (DT−90)<br />

represents the temperature profiles obtained by applying the<br />

design-time solution prepared with the initial temperature of<br />

30 °C (90 °C) to the case when the initial temperature is 30 °C<br />

<strong>for</strong> H.264. No significant difference of energy, i.e., less than<br />

0.33% is observed according to Tinit. Thus, in our experiments,<br />

we need the design-time solution obtained at 90 °C <strong>for</strong> LUT.<br />

Fig. 20(b) shows a close-up view of the circled point in<br />

Fig. 20(a), the temperature profile during the initial period.<br />

As Fig. 20(b) shows, DT−90 gives a higher local maximum<br />

than DT−30 since DT−90 has a higher Tthr than DT−30.<br />

Such a difference comes from the difference in the initial<br />

temperatures, i.e., 30 °C <strong>and</strong> 90 °C.<br />

In our experiments, the LUT is managed as a normal<br />

data array <strong>and</strong> has only 11 entries to store optimal working<br />

frequencies <strong>and</strong> threshold temperatures of bins. Since four<br />

bytes are used <strong>for</strong> each entry, a total of 44-byte memory space<br />

is required <strong>for</strong> the LUT. The LUT is accessed only when a new<br />

bin starts. In that case, two entries are read from the LUT to<br />

determine working frequency <strong>and</strong> threshold temperature (Tthr)<br />

of a newly starting bin. Reading two entries can be executed<br />

with only one load instruction at the target processor. Thus,<br />

the LUT has a negligible overhead in terms of area, runtime<br />

<strong>and</strong> power consumption.<br />

X. Conclusion<br />

In this paper, we proposed a method of integrating <strong>DVFS</strong><br />

<strong>and</strong> power gating in a temperature/runtime distribution-aware<br />

manner. The proposed method analytically assigns active<br />

(including frequency level) <strong>and</strong> sleep (including locations)<br />

states to each time slot according to temperature <strong>and</strong> runtime<br />

distribution <strong>for</strong> total energy reduction. The higher the temperature<br />

is, the more time budget is assigned to the beginning<br />

of task execution. The proposed method contributes to the<br />

overall energy saving by lowering the temperature <strong>and</strong> the<br />

leakage power consumption, especially, during the period of<br />

high execution probability, i.e., the beginning phase of task<br />

execution. In experiments with two real applications, H.264<br />

decoder <strong>and</strong> ray tracing <strong>and</strong> one benchmark program, equake,<br />

the proposed method yields 19.4%–27.2% of additional energy<br />

savings compared with existing methods.<br />

In future work, we will extend the proposed idea to applications<br />

running on multi/many-core <strong>for</strong> high-per<strong>for</strong>mance<br />

computing (with response time constraints) as well as mobile/home<br />

real-time systems. In addition, since active cooling<br />

is an effective method of temperature management, we will<br />

also work on the integration of active cooling <strong>and</strong> the proposed<br />

method considering the benefits <strong>and</strong> cost of active cooling<br />

<strong>and</strong> PG.<br />

References<br />

[1] Intel Homepage [Online]. Available: http://www.intel.com/products/<br />

processor/core2duo/mobile/specifications.htm<br />

[2] Semiconductor Research Corporation. 2005 Packing Thrust Strategic<br />

Needs [Online]. Available: http://www.src.org/<br />

[3] W. P. Liao, L. He, <strong>and</strong> K. M. Lepak, “<strong>Temperature</strong> <strong>and</strong> supply voltage<br />

aware per<strong>for</strong>mance <strong>and</strong> power modeling at micro-architecture level,”<br />

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7,<br />

pp. 1042–1053, Jul. 2005.<br />

[4] Google Data Center [Online]. Available: http://www.datacenterknowledge.com/<br />

[5] A. Kumar, L. Shang, L.-S. Peh, <strong>and</strong> N. K. Jha, “HybDTM: A coordinated<br />

hardware-software approach <strong>for</strong> dynamic thermal management,” in Proc.<br />

DAC, 2006, pp. 548–553.<br />

[6] R. Rao, S. Vrudhula, C. Chakrabarti, <strong>and</strong> N. Chang, “An optimal<br />

analytical solution <strong>for</strong> processor speed control with thermal constraints,”<br />

in Proc. ISLPED, 2006, pp. 292–297.<br />

[7] J. Yang, X. Zhou, M. Chrobak, Y. Zhang, <strong>and</strong> L. Jin, “Dynamic thermal<br />

management through task scheduling,” in Proc. Int. Symp. ISPASS, 2008,<br />

pp. 191–201.<br />

[8] A. Coskun, T. Rosing, <strong>and</strong> K. Whisnant, “<strong>Temperature</strong> aware task<br />

scheduling in MPSoCs,” in Proc. DATE, 2007, pp. 1659–1664.<br />

[9] M. Bao, A. Andrei, P. Eles, <strong>and</strong> Z. Peng, “<strong>Temperature</strong>-aware<br />

voltage selection <strong>for</strong> energy minimization,” in Proc. DATE, 2008,<br />

pp. 1083–1086.<br />

[10] S. Zhang <strong>and</strong> K. S. Chatha, “System-level thermal aware design of<br />

applications with uncertain execution times,” in Proc. ICCAD, 2008,<br />

pp. 242–249.


1394 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010<br />

[11] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, <strong>and</strong> D. Tarjan,<br />

“<strong>Temperature</strong>-aware microarchitecture: Modeling <strong>and</strong> implementation,”<br />

ACM Trans. Architect. Code Optim., vol. 1, no. 1, pp. 94–125, Mar.<br />

2004.<br />

[12] Y. Han, I. Koren, <strong>and</strong> C. M. Krishna, “Temptor: A lightweight runtime<br />

temperature monitoring tool using per<strong>for</strong>mance counters,” in Proc. 3rd<br />

Workshop TACS, Held Conjunct. ISCA-33, 2006.<br />

[13] A. Kumar, L. Shang, L.-S. Peh, <strong>and</strong> N. K. Jha, “System-level dynamic<br />

thermal management <strong>for</strong> high-per<strong>for</strong>mance microprocessors,”<br />

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 1,<br />

pp. 96–108, Jan. 2008.<br />

[14] L. Yuan, S. Leventhal, <strong>and</strong> G. Qu, “<strong>Temperature</strong>-aware leakage minimization<br />

technique <strong>for</strong> real-time systems,” in Proc. ICCAD, 2006, pp.<br />

761–764.<br />

[15] Y. Liu, H. Yang, R. P. Dick, H. Wang, <strong>and</strong> L. Shang, “Thermal vs. energy<br />

optimization <strong>for</strong> <strong>DVFS</strong>-enabled processors in embedded systems,” in<br />

Proc. Int. Symp. ISQED, 2007, pp. 204–209.<br />

[16] L. Yuan <strong>and</strong> G. Qu, “ALT-<strong>DVFS</strong>: Dynamic voltage scaling with awareness<br />

of leakage <strong>and</strong> temperature <strong>for</strong> real-time systems,” in Proc. AHS,<br />

2007, pp. 660–670.<br />

[17] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, <strong>and</strong> T. Kim, “Dynamic<br />

voltage scaling of supply <strong>and</strong> body bias exploiting software runtime<br />

distribution,” in Proc. DATE, 2008, pp. 242–247.<br />

[18] J. Kim, S. Yoo, <strong>and</strong> C.-M. Kyung, “Program phase <strong>and</strong> runtime<br />

distribution-aware online <strong>DVFS</strong> <strong>for</strong> combined Vdd/Vbb scaling,” in<br />

Proc. DATE, 2009, pp. 417–422.<br />

[19] J. R. Lorch <strong>and</strong> A. J. Smith, “Improving dynamic voltage scaling<br />

algorithm with PACE,” ACM SIGMETRICS Per<strong>for</strong>m. Eval. Rev., vol.<br />

29, no. 1, pp. 50–61, Jun. 2001.<br />

[20] S. Krantz, S. Kress, <strong>and</strong> R. Kress, Jensen’s Inequality. Cambridge, MA:<br />

Birkhauser, 1999.<br />

[21] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V.<br />

Naydenov, T. Khondker, S. Sarkar, <strong>and</strong> P. Singh, “Penryn: 45-nm next<br />

generation Intel Core 2 processor,” in Proc. ASSCC, 2007, pp. 14–17.<br />

[22] K. Mistry, C. Allen, C. Auth, B. Beattie, D. Bergstrom, M. Bost, M.<br />

Brazier, M. Buehler, A. Cappellani, R. Chau, C. H. Choi, G. Ding, K.<br />

Fischer, T. Ghani, R. Grover, W. Han, D. Hanken, M. Hattendorf, J.<br />

He, J. Hicks, R. Huessner, D. Ingerly, P. Jain, R. James, L. Jong, S.<br />

Joshi, C. Kenyon, K. Kuhn, K. Lee, H. Liu, J. Maiz, B. Mclntyre, P.<br />

Moon, J. Neirynck, S. Pae, C. Parker, D. Parsons, C. Prasad, L. Pipes,<br />

M. Prince, P. Ranade, T. Reynolds, J. S<strong>and</strong><strong>for</strong>d, L. Shifren, J. Sebastian,<br />

J. Seiple, D. Simon, S. Sivakumar, P. Smith, C. Thomas, T. Troeger, P.<br />

V<strong>and</strong>ervoorn, S. Williams, <strong>and</strong> K. Zawadzki, “A 45 nm logic technology<br />

with high-k + metal gate transistors, strained silicon, 9 Cu interconnect<br />

layers, 193 nm dry patterning, <strong>and</strong> 100% Pb-free packaging,” in Proc.<br />

IEDM, 2007, pp. 247–250.<br />

[23] Intel. (2009 Mar.). “Intel Core2 solo mobile processor <strong>and</strong> Intel Core2<br />

extreme mobile processor on 45-nm process datasheet” [Online]. Available:<br />

http://www.intel.com/<br />

[24] H. Kim, H. Hong, H.-S. Kim, J.-H. Ahn, <strong>and</strong> S. Kang, “Total energy<br />

minimization of real-time tasks in an on-chip multiprocessor using<br />

dynamic voltage scaling efficiency metric,” IEEE Trans. Comput.-Aided<br />

Design Integr. Circuits Syst., vol. 27, no. 11, pp. 2088–2092, Nov. 2008.<br />

[25] PAPI [Online]. Available: http://icl.cs.utk.deu/papi<br />

[26] J. Bikker. “Raytracing: Theory & implementation” [Online]. Available:<br />

http://www.devmaster.net/articles/raytracing−series/part7.php<br />

[27] JEDEC. 2006 Failure Mechanisms <strong>and</strong> Models <strong>for</strong> Semiconductor<br />

Devices [Online]. Available: http://www.jedec.org<br />

[28] G. Corliss, “Which root does the bisection algorithm find?” SIAM Rev.,<br />

vol. 19, no. 2, pp. 325–327, 1977.<br />

[29] N. Bansal <strong>and</strong> K. Pruhs, “Speed scaling to manage temperature,” in<br />

Proc. STACS, 2005, pp. 460–471.<br />

[30] J.-J. Chen, C.-M. Hung, <strong>and</strong> T.-W. Kuo, “On the minimization of the<br />

instantaneous temperature <strong>for</strong> periodic real-time tasks,” in Proc. RTAS,<br />

2007, pp. 236–248.<br />

[31] A. K. Coskun, T. T. Rosing, K. A. Whisnant, <strong>and</strong> K. C. Gross, “Static<br />

<strong>and</strong> dynamic temperature-aware scheduling <strong>for</strong> multiprocessor SoCs,”<br />

IEEE Trans. Very Large Scale Integr. Syst., vol. 16, no. 9, pp. 1127–<br />

1140, Sep. 2008.<br />

[32] J. Srinivasan <strong>and</strong> S. V. Adve, “Predictive dynamic thermal management<br />

<strong>for</strong> multimedia applications,” in Proc. ICS, 2003. pp. 109–120.<br />

[33] W. Lee, K. Patel, <strong>and</strong> M. Pedram, “GOP-level dynamic thermal management<br />

in MPEG-2 decoding,” IEEE Trans. Very Large Scale Integr.<br />

Syst., vol. 16, no. 6, pp. 662–672, Jun. 2008.<br />

[34] J. S. Lee, K. Skadron, <strong>and</strong> S. W. Chung, “Predictive temperature-aware<br />

<strong>DVFS</strong>,” IEEE Trans. Comput., vol. 59, no. 1, pp. 127–133, Jan. 2010.<br />

Kyungsu Kang (S’06) received the B.S. degree<br />

from the Department of Electrical <strong>and</strong> Electronic Engineering,<br />

Kyungpook National University, Daegu,<br />

Korea, in 2003. Since 2003, he has been pursuing<br />

the unified course of the M.S. <strong>and</strong> Ph.D. degrees<br />

from the Department of Electrical Engineering <strong>and</strong><br />

Computer Science, Korea Advanced Institute of Science<br />

<strong>and</strong> Technology, Daejeon, Korea.<br />

His current research interests include power <strong>and</strong><br />

thermal management <strong>for</strong> 2-D/3-D chip multiprocessors.<br />

Jungsoo Kim (S’06) received the B.S. degree in<br />

electrical engineering from the Korea Advanced<br />

Institute of Science <strong>and</strong> Technology (<strong>KAIST</strong>), Daejeon,<br />

Korea, in 2005. Since 2005, he has been pursuing<br />

the unified course of the M.S. <strong>and</strong> Ph.D. degrees<br />

from the Department of Electrical Engineering <strong>and</strong><br />

Computer Science, <strong>KAIST</strong>.<br />

His current research interests include dynamic<br />

power <strong>and</strong> thermal management, MPSoC design, <strong>and</strong><br />

massive parallel processing.<br />

Sungjoo Yoo (M’09) received the B.S., M.S., <strong>and</strong><br />

Ph.D. degrees in electronics engineering from Seoul<br />

National University, Seoul, Korea, in 1992, 1995,<br />

<strong>and</strong> 2000, respectively.<br />

He was a Researcher with TIMA Laboratory,<br />

Grenoble, France, from 2000 to 2004. He was a<br />

Senior <strong>and</strong> Principal Engineer with Samsung Electronics,<br />

Yongin, Gyeonggi, Korea, from 2004 to<br />

2008. He has been with the Pohang University of<br />

Science <strong>and</strong> Technology, Pohang, Korea, since 2008.<br />

His current research interests include low-power<br />

design <strong>and</strong> memory/storage architecture <strong>for</strong> embedded systems.<br />

Chong-Min Kyung (S’76–M’81–SM’99–F’08) received<br />

the B.S. degree in electronics engineering<br />

from Seoul National University, Seoul, Korea, in<br />

1975, the M.S. <strong>and</strong> Ph.D. degrees in electrical<br />

engineering from the Korea Advanced Institute of<br />

Science <strong>and</strong> Technology (<strong>KAIST</strong>), Daejeon, Korea,<br />

in 1977 <strong>and</strong> 1981, respectively.<br />

From 1981 to 1983, he was with Bell Telephone<br />

Laboratories, Murray Hill, NJ, as a Postdoctoral<br />

Research Fellow. Since he joined <strong>KAIST</strong> in 1983,<br />

he has been working on system-on-a-chip design <strong>and</strong><br />

verification methodology, processor <strong>and</strong> graphics architectures <strong>for</strong> high-speed<br />

<strong>and</strong>/or low-power applications including mobile video codec.<br />

Dr. Kyung received the Most Excellent Design Award <strong>and</strong> Special Feature<br />

Award in the University Design Contest in the ASP-DAC in 1997 <strong>and</strong> 1998,<br />

respectively. He received the Best Paper Award in the 36th DAC held in<br />

New Orleans, LA, the 10th International Conference on Signal Processing<br />

Application <strong>and</strong> Technology, Orl<strong>and</strong>o, FL, in 1999, <strong>and</strong> the 1999 International<br />

Conference on Computer Design, Austin, TX. He was General Chair of the<br />

Asian Solid-State Circuits Conference 2007, <strong>and</strong> ASP-DAC 2008. In 2000, he<br />

received the National Medal from the Korean government <strong>for</strong> his contributions<br />

to research <strong>and</strong> education in integrated circuit designs. He is a member of the<br />

National Academy of Engineering Korea <strong>and</strong> Korean Academy of Science<br />

<strong>and</strong> Technology. He is the Hynix Chair Professor at <strong>KAIST</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!