Temperature-Aware Integrated DVFS and Power Gating for ... - KAIST

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010 1381 

Temperature-Aware Integrated DVFS and 

Power Gating for Executing Tasks 

with Runtime Distribution 

Kyungsu Kang, Student Member, IEEE, Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE, 

and Chong-Min Kyung, Fellow, IEEE 

Abstract—At high-operating temperature, chip cooling is crucial 

due to the exponential temperature dependence of leakage 

current. However, traditional cooling methods, e.g., power/clock 

gating applied when a temperature threshold is reached, often 

cause excessive performance degradation. In this paper, we 

propose a method for delivering lower energy consumption by 

integrating the cooling and running in a temperature-aware 

manner without incurring performance penalty. In order to 

further reduce the energy consumption, we exploited the runtime 

distribution of each sub-segment of a task called “bin” in 

an analytical manner such that time budget for cooling in 

each bin is allocated in proportion to the probability of the 

occurrence of the bin. We apply the proposed method to two 

realistic software programs, H.264 decoder and ray tracing and a 

benchmark program, equake. The experimental results show that 

the proposed method yields additional 19.4%–27.2% reduction 

in energy consumption compared with existing methods. 

Index Terms—Dynamic voltage and frequency scaling (DVFS), 

energy minimization, hard real time, power gating (PG), runtime 

distribution. 

I. Introduction 

TECHNOLOGY scaling has resulted in a sharp growth 

in transistor density up to 0.4 billion transistors/cm2 [1]. Such a number of transistors have pushed chip power 

density up to 250 W/cm2 –300 W/cm2 [2]. High-power density 

incurs temperature-related problems including reliability 

such as negative bias temperature instability and performance 

degradation. Leakage power consumption, which is the most 

critical among temperature-related problems [3], is caused by 

the exponential dependency of leakage current on temperature. 

It affects battery lifetime in mobile devices and increases the 

cost of cooling in high-performance computing [4]. Especially 

Manuscript received September 4, 2009; revised February 23, 2010. Date 

of current version August 20, 2010. This work was supported by the National 

Research Foundation (NRF) of Korea funded by the Korean government 

(MEST), under Grant 2010-0000823. This paper was recommended by 

Associate Editor M. Poncino. 

K. Kang, J. Kim, and C.-M. Kyung are with the Department of Electrical 

Engineering, Korea Advanced Institute of Science and Technology, Daejeon 

305–701, Korea (e-mail: kskang@vslab.kaist.ac.kr; jskim@vslab.kaist.ac.kr; 

kyung@ee.kaist.ac.kr). 

S. Yoo is with the Department of Electronics and Electrical Engineering, 

Pohang University of Science and Technology, Pohang 790–784, Korea (email: 

sungjoo.yoo@postech.ac.kr). 

Color versions of one or more of the figures in this paper are available 

online at http://ieeexplore.ieee.org. 

Digital Object Identifier 10.1109/TCAD.2010.2059290 

0278-0070/$26.00 c○ 2010 IEEE 

handheld devices such as thin notebooks, personal digital assistants, 

and cell phones are likely to suffer from temperatureinduced 

leakage power because they are not equipped with 

active cooling facilities, e.g., cooling fans, due to the small 

form factor requirement. 

Dynamic thermal management (DTM) is an effective 

method to control the chip temperature. When the operating 

temperature approaches the thermal limit, several solutions are 

available for cooling such as stopping with power/clock gating, 

running at lower clock frequencies and/or lower voltages, and 

issuing less instructions/functions [5]–[7]. DTM trades performance 

for temperature reduction often incurring performance 

penalty. 

There have been several studies on temperature-aware dynamic 

voltage and frequency scaling (DVFS) [8]–[10] to avoid 

the performance penalty by setting frequency and voltage 

in a temperature-aware manner. In such methods, given a 

time slack (= deadline − remaining worst execution time) and 

statistical workload estimation [10], the performance level is 

set to minimize the total energy consumption consisting of 

temperature-induced leakage energy and switching energy. 

Although temperature-aware DVFS and DTM are selfsufficient 

techniques, applying them together will help fully 

exploit the potential of both techniques. For instance, in case 

of high temperature where temperature-induced leakage power 

consumption dominates, we utilize the time slack to cool down 

the processor by applying power/clock gating, e.g., turning 

off the processor during the slack. In the low-temperature 

situation where switching power dominates, we focus on 

reducing the switching power consumption by lowering the 

operating voltage/frequency. As our motivational example in 

Section III shows, such a temperature-aware tradeoff between 

power/clock gating and voltage/frequency setting gives a significant 

reduction in energy consumption than applying either 

technique without the other. 

Most complex programs are characterized by the runtime 

distribution [17]–[19]. There are two sources of runtime distribution: 

data-dependent behavior and conflicts in accessing 

architectural resources, e.g., cache and memory. Programs with 

loops and if/else statements tend to have different loop counts 

and take different execution paths depending on (input) data. 

For instance, in the case of video codec, object movements, 

i.e., input picture data can determine the execution time

1382 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 9, SEPTEMBER 2010 

of time-consuming functions, e.g., motion estimation. Cache 

misses and DRAM access scheduling are typical sources 

of runtime variation in the architectural area. It is well 

known that exploiting runtime distribution can give further 

reduction in energy consumption than considering only the 

worst-case runtime [17]–[19]. In this paper, we extended 

the application of the information of runtime distribution to 

the temperature-aware tradeoff between power/clock gating 

and voltage/frequency scaling. 

In this paper, we present a solution that integrates both DTM 

(to be exact, power/clock gating 1 ) and DVFS in a temperatureaware 

manner to minimize the total energy consumption of 

periodic hard real-time system. Compared with earlier works, 

ours is unique in two aspects: 1) temperature-aware tradeoff 

between PG and DVFS, and 2) exploiting runtime distribution 

in a temperature-aware tradeoff between PG and DVFS. 

This paper is organized as follows. Section II reviews 

related works. Section III explains the motivation for our work. 

Section IV presents our temperature-aware power estimation 

method. Section V gives the problem definition and solution 

overview. Sections VI and VII explain the temperature-aware 

integration of PG and DVFS. Section VIII shows how to 

apply the proposed solution during runtime. Section IX reports 

experimental results, followed by the conclusion in Section X. 

II. Related Works 

Temperature-aware design methods require thermal models. 

HotSpot [11] utilizes an equivalent circuit comprising thermal 

resistance and capacitance of architectural blocks and thermal 

characteristics of package. Han et al. [12] proposed timeinvariant 

linear thermal system that uses adaptive time interval 

to linearize the temperature calculation and speed up the 

thermal analysis. Kumar et al. [13] proposed a regressionbased 

thermal model that uses hardware performance counters 

available in the processor. 

“HybDTM” [5] controls temperature by clock gating or 

limiting task execution when the temperature reaches a thermal 

threshold. Rao et al. [6] proposed an analytical solution 

which maximizes the average throughput of processor within 

the given time and temperature constraints. DTM techniques 

can effectively lower the chip temperature while the benefit 

often comes at the cost of performance degradation. Yang 

et al. [7] proposed a temperature-aware task scheduling to 

keep the chip temperature below the given threshold. This 

method explores different execution orders of hot and cold 

tasks to keep the temperature under control. Yuan et al. 

[14] proposed a runtime PG algorithm, called TALK, to 

minimize temperature-induced leakage energy without performance 

penalty. This method performs PG when the system 

temperature is too high according to the ratio of remaining 

execution cycle to time-to-deadline. Srinivasan and Adve 

[32] proposed a predictive frame-based DTM algorithm that 

adapts the architectural configurations (the issue queue and 

register file sizes as well as the number of active arithmetic 

logic units) and operating frequency to the frame type 

1 For simplicity, throughout this paper, we will use the term “power gating 

(PG)” instead of “DTM with power/clock gating.” 

(I, P, or B)-dependent characteristics of average IPC and power 

consumption. 

In [15], Liu et al. proposed a design-time temperatureaware 

DVFS technique through static temperature analysis. 

They formulated the problem of minimizing peak temperature 

as a nonlinear programming problem, and aimed at reducing 

the system energy consumption under the peak temperature 

constraint. In [9], Bao et al. proposed a DVFS technique for 

temperature-aware energy minimization based on both static 

and dynamic temperature analysis. In [16], Yuan and Qu proposed 

design-time and runtime solutions to minimize system 

energy consumption while suppressing tasks that cannot be 

completed by DVFS due to system overheat. 

Runtime distribution is exploited by several DVFS methods 

[10], [17]–[19]. In [17] and [18], the authors solved the 

distribution-aware DVFS problem by analytical approaches. 

In [19], Lorch and Smith proposed processor acceleration to 

conserve energy (PACE) algorithm which proactively increases 

the clock frequency as the task execution progresses to take 

advantage of runtime distribution. However, their works have 

not utilized the option of cooling to reduce the temperaturedependent 

leakage power consumption during task execution. 

In [10], Zhang and Chatha solved a stochastic thermal-aware 

DVFS problem in order to keep the expected latency within 

the designer-specified level subject to the condition that the 

probability of peak temperature exceeding a given value is 

sufficiently small. In [33], the authors proposed a group of 

pictures (GOP)-level DVFS algorithm in MPEG-2 decoding, 

which exploits the variation of frame decoding times for a 

GOP. The slack obtained during the frame decoding of the 

previous GOP is exploited to lower the operating frequency 

of frame decoding of the next GOP thereby lowering the 

temperature. 

Existing slack reclamation-based DVFS methods are targeted 

at energy reduction [9], [16]–[19], which apply power/ 

clock gating only when the task execution finishes and there 

still remains an idle time until the deadline. On the contrary, 

our method is applicable to hard real-time systems, and 

exploits the slack, i.e., performs slack reclamation, to perform 

switching between PG and DVFS according to the temperature. 

I.e., our method applies power/clock gating during task 

execution in order to lower the operating temperature. Compared 

with existing slack reclamation-based DTM methods 

[14], [15], [29]–[31] which assume that the slack is fixed, 

our method dynamically switches between power gating and 

DVFS according to the temperature level. It was found out 

that adjusting the overall slack size and distributing it over 

the allowed time interval results in further energy reduction 

as well as temperature lowering. For instance, in case of high 

temperature, our method increases the slack size for more idle 

slots (for further cooling) by running the processor at a higher 

frequency during the active slots. 

Unlike previous works [9], [14], [16]–[19] which try to 

reduce the energy consumption by either DVFS or PG often 

without considering runtime distribution, our method provides 

an integration of PG and DVFS to minimize the total energy 

consumption considering both the temperature and runtime 

distribution.

KANG et al.: TEMPERATURE-AWARE INTEGRATED DVFS AND POWER GATING FOR EXECUTING TASKS WITH RUNTIME DISTRIBUTION 1383 

Fig. 1. (a) Worst-case execution cycles of each task instance in a period are 

divided into three bins. (b) Cumulative distribution function for each bin j, 

where CDF(j +1) − CDF(j) denotes the probability of the task being finished 

with bin j. 

III. Motivation 

Fig. 1 shows an example of executing a periodic task 

with runtime distribution where the worst-case execution cycle 

(WCEC) for a task period is divided into three bins as shown 

in [19]. The task has one execution instance (shown as “bin 

executed” at the top of Fig. 1) during each period of 1.2 s. 

In Fig. 1, the four instances require two, two, four, and six 

hundred million cycles, respectively. We can compute the 

probability of each bin with respect to the total number of 

instances. The first bin has the probability of 100% (= 4/4) 

because the first bin is executed in every instance (there are 

four instances in Fig. 1). The second and third bin have the 

probabilities of 50% (= 2/4) and 25% (= 1/4), respectively. 

Therefore, the probabilities of each task instance being finished 

with the execution of the first, second, and third bin are 

50% (= 100 − 50), 25% (= 50 − 25), and remaining 25%, 

respectively. The cumulative distribution function (CDF) of 

each bin is shown in Fig. 1(b). 

Let us assume that, given an initial temperature condition, 

the PG and voltage/frequency setting need to be determined 

for this task. The system has M discrete frequency levels 

(each of which has a corresponding voltage level) and two 

operation states: active and sleep states. In the active state, 

the system executes the task at one of the M frequency levels 

consuming both switching and leakage power. In the sleep 

state, no task is executed and the processor consumes only 

leakage power through PG. When the processor switches from 

sleep to active state (or vice versa), additional time and energy 

overhead called “power state transition overhead” is incurred. 

Fig. 2(a) shows how the operation states are determined by 

a traditional temperature-aware DVFS method which assigns 

a single frequency level in proportion to the ratio of WCEC to 

time-to-deadline when it is higher than critical speed. 2 In this 

case, if the task execution is finished earlier than the deadline, 

then the processor is simply power-gated. In this example, we 

assume that the frequency level is set to the critical speed, 

2 Critical speed is the frequency such that no slower frequency setting can 

give further energy reduction due to the increasing leakage power at low 

frequency. 

Fig. 2. Motivational example showing the reduction of energy consumption 

of the proposed method by exploiting the runtime distribution and applying 

integrated DVFS and PG. (a) Frequency scaling schedule according to 

conventional DVFS. (b) Integration of DVFS and PG when the frequency 

scaling schedule is the same as in (a). (c) Our result with WCEC-based 

integration of DVFS and PG. (d) Our result with integration of DVFS and 

PG exploiting statistical information. 

600 _MHz. Conventional DVFS in Fig. 2(a) yields an average 

temperature of 78.36 °C and energy consumption of 20.93J. 3 

Note that the temperature in Fig. 2 represents the average 

temperature of active state. 

In order to reduce the temperature during the active state, 

the method called TALK [14] can be applied such that the 

sleep state can be inserted during the active state as Fig. 2(b) 

shows. The average temperature of active state is reduced from 

78.36 °C to 76.08 °C. 

The basic idea of our method is that, when leakage power 

consumption dominates total power consumption, i.e., at high 

temperature, we first allocate a share of the given slack to 

the sleep state in proportion to the temperature and then 

place the sleep states during task execution instead of simply 

applying PG only after task execution [Fig. 2(a)] or during task 

execution [Fig. 2(b)]. This is because reducing the temperature 

and, thus, leakage power consumption by cooling can be more 

urgent than running the task when the temperature is very high. 

3 The temperature and energy consumption are calculated with the 

temperature-aware power estimation method and our processor power/thermal 

model to be explained in Sections IV and IX.


Increasing active state obviously yields less heat generation 

and less temperature increase due to less dynamic power 

during active period. However, increasing idle state also helps 

lower the temperature. As the sum of active state and idle state 

is fixed, we need to find the optimal mix between active state 

and idle state which yields the lowest average temperature. At 

high temperature, it is sometimes beneficial to borrow time 

budget from active states (by running at higher frequency) to 

allow more idle states for temperature drop. 

Fig. 2(c) shows how the integration of PG and DVFS 

lowers the temperature and the energy consumption. Such an 

integration is enabled by borrowing time budget from active 

states. First, a portion of the whole time budget is assigned to 

the sleep states in proportion to the temperature. Then, a higher 

frequency is assigned to the active states to compensate for the 

reduced time budget and meet the deadline. Fig. 2(c) shows 

that such an integration of PG and DVFS yields additional 

16.4% reduction in energy consumption mostly by reducing 

the temperature-induced leakage power consumption during 

active states. Compared with TALK [14] in Fig. 2(b), where 

only the critical speed is used, slack is assigned to sleep states 

in Fig. 2(c) in proportion to temperature and frequency levels 

higher than the critical speed are also used for active states. 4 

In Fig. 2(c), worst-case execution time is assumed in 

applying the PG and DVFS. However, by exploiting the 

runtime distribution, we obtain a further reduction in energy 

consumption as Fig. 2(d) shows. The key idea is to assign 

slack budget to the lower index bins in proportion to their 

probability of occurrence. Thus, the higher their probability, 

the more slack budget is assigned to the lower index bins. By 

utilizing more sleep states in the lower index bins with highexecution 

probability, we allow more cooling of the processor 

which leads to a reduction in the leakage power consumption 

in the lower index bins. The increased slack budget assigned to 

the lower index bins must, then, be borrowed from the higher 

index bins. Thus, the power consumption of higher index 

bins will increase. However, the total energy consumption 

is reduced at high temperature because the eventual energy 

consumption is the sum of the products, over all bins, of 

consumed energy per bin and the associated bin probability. 

The lower index bins have higher bin probability than the 

higher index bins. The overall energy consumption can be 

reduced by assigning more slack budget to the lower index 

bins than higher index bins, which results in lower temperature 

in the lower index bins. In our method the operating frequency 

and the amount of sleep states are analytically determined considering 

temperature and bin probabilities (see Section VII). 

IV. Preliminaries: Temperature-Aware Power 

Estimation 

System power consumption can vary during task execution 

while the operating frequency and the switching power 

consumption are constant, because temperature is varying 

4 Note that even in the case that there is no slack in conventional DVFS, 

e.g., fcrit ≤ 0.5 GHz in the example of Fig. 2(a), our method can still give 

energy reduction by creating additional slack, which is borrowed from active 

state, and uniformly inserting idle state as shown in Fig. 2(c). 

Fig. 3. Temperature-aware energy calculation of a task at a fixed clock 

frequency (f ) given a time period [ts, tf ]. (a) Task execution time is divided 

into the time interval, t which is small enough to consider the temperature 

during t to be constant. (b) Temperature-aware energy consumption can be 

calculated by (1). 

Fig. 4. Problem definition to minimize total energy consumption. 

with time (due to the switching power consumption) thereby 

changing the leakage power consumption. However, because 

the thermal time constant of system is usually in the order 

of milliseconds, it is not necessary to update temperature for 

every clock cycle for temperature-aware power estimation. For 

temperature estimation, we use a distributed thermal RC model 

using HotSpot [11]. During power estimation, we update 

temperature, obtained from the thermal model, at each time 

step t (= 0.1 ms in our experiment). The time interval, t, 

is sufficiently small that we may consider the temperature 

during t to be constant. Fig. 3 illustrates how to calculate 

energy consumption in a temperature-aware manner. Given a 

time period [ts,tf ], where tf − ts = K · t, temperature-aware 

energy consumption can be calculated by summing the energy 

consumption of all K intervals as follows: 

K 

Eactive = (Ps(f )+Pl(f, Tk)) · t (1) 

k=1 

where Ps and Pl denote switching power and leakage power 

consumption, respectively. We assume that clock frequency 

is proportional to Vdd, i.e., f ∝ Vdd. 5 In summary, switching 

power consumption is a function of clock frequency (f ), while 

leakage power consumption is a function of clock frequency 

and temperature (Tk). Details of our power (Ps and Pl) and 

thermal (Tk) models are explained in Section IX. 

5 This is because the upper limit on the clock frequency is determined 

by the inverse of the propagation delay of gate elements, which is roughly 

proportional to Vdd.


V. Problem Definition 

In real environments, processor changes operation states 

and/or frequency levels on a discrete time basis. The length of 

discrete time period, l, is mostly determined by the scheduler 

in the operating system. Within l, which is set at 2 ms in our 

experiment, the operation state and frequency level are fixed. 

The time interval [0, D], where D denotes the task deadline, is 

divided into N slots of l duration each as shown in Fig. 4. A 

slot where the processor is in the active (sleep) state is called 

active (sleep) slot. In the active slot, the power consumption 

is the sum of Ps and Pl. In the sleep slot, only a small amount 

of leakage power, Psleep (≪Pl), is consumed via PG. The 

processor energy consumption, Etotal, can be calculated by the 

following equation: 

N 

Etotal = (Eactive[i] · x[i]+Esleep · (1 − x[i])) (2) 

where 

and 

i=1 

Eactive[i] =Ps(f [i]) · l + 

L 

Pl(f [i],Tk) · t (3) 

k=1 

Esleep = Psleep · l (4) 

where l = L · t (L = 20 ≫ 1) and f [i] is the clock 

frequency of the ith slot. x[i] is1iftheith slot is an active 

slot, and 0 if it is a sleep slot. 

We define the problem of energy minimization as follows. 

Given an initial temperature (Tinit), a task with a deadline 

D, and its runtime distribution (including the information of 

WCEC), and assuming that the interval, [0, D] is divided 

into N slots and the WCEC is divided into B bins, 6 the 

problem is to find f [i] and x[i] such that the total energy 

consumption, Etotal in (2) is minimized. The problem complexity 

is O((M +1) N ) because each slot can be at one of 

the M frequency levels or sleep state. The complexity can be 

extremely high because M ∼ 20 and N ∼ 500 for even a 

deadline of 1 s (l = 2 ms). 

In the remainder of this paper, we present our solution 

to the problem as follows. In Section VI, we explain the 

temperature-aware integration of DVFS and PG, assuming a 

single distinct execution cycle for a task. In Section VII, we 

will incorporate runtime distribution into this integration of 

DVFS and PG to present a complete design-time solution. 

Finally, in Section VIII, we will explain how to apply the 

design-time solution to runtime. 

VI. Temperature-Aware Integrated DVFS and 

Power Gating 

In this section, assuming a pre-determined execution cycle 

and a given deadline, we explain how to determine the 

operating frequency and sleep slots (Section VI-A) and the 

6 Each bin consists of Cbin = ⌈WCEC/B⌉ cycles. 

Fig. 5. Total energy consumption (= switching + leakage energy) per cycle 

as a function of clock frequency. 

locations of sleep slots (Section VI-B) in a temperature-aware 

manner. 

A. Energy-Optimal Frequency Decision 

Given the execution cycle and deadline, we first determine 

the frequency which minimizes total energy consumption 

during active states. Fig. 5 shows the total energy consumption 

per cycle, Ecycle of the processor used in our experiments. 7 

Ecycle is a function of frequency and temperature. In order 

to obtain Ecycle(f ), at each frequency f , we obtain Ecycle 

by simulating the task execution on the power/thermal model 

until the temperature reaches steady state, assuming that the 

deadline is sufficiently larger than the thermal time constant 

of the processor. 

Fig. 5 shows that the curve of Ecycle is convex. Thus, the 

frequency for minimal energy consumption called optimal 

frequency, fopt is obtained when the slope of the curve is 

zero. We apply the root-finding algorithm such as bisection 

method [28] to find the optimal frequency. 

Fig. 6 shows the flow of energy, power, and temperature 

estimation. The bisection method shown in Fig. 6(a) performs, 

at each intermediate frequency, four times of estimation of 

Ecycle to obtain the gradients of Ecycle at fleft and fmid. 

The estimation of Ecycle consists of average power estimation 

and steady state temperature estimation. The steady state 

temperature can be estimated from the product of the average 

power and the thermal resistance extracted from the HotSpot 

simulator. As Fig. 6(b) shows, the estimation of Ecycle needs 

iterations of power and temperature estimation because of 

the dependency of leakage power on temperature. In our 

experiments, five iterations are usually sufficient to yield a 

convergence in both the estimated temperature and power 

consumption with a variation of less than 0.1%. The bisection 

method used to calculate fopt is a general binary search 

algorithm and the computational complexity is O(log2NF ), 

where NF is the number of possible frequency values (NF = 

240 in our experiments). We measured the computational time 

of the root-finding algorithm [Fig. 6(a)] from the experimental 

platform, an LG xnote LW25 laptop, running at 2 GHz. The 

runtime overhead is less than 1 ms. 

7 Our processor power model is derived from the Intel Penryn processor 

[21]. Details of the power model are given in Section IX.


Fig. 6. Algorithm flow of energy-optimal frequency decision. (a) Flow of 

bisection method. (b) Flow of Ecycle estimation. 

Fig. 7. Temperature-aware sleep slot placement performed in two steps. 

(a) Calculation of the numbers of active and sleep slots, Na and Ns. 

(b) Determining the sleep slot location using the threshold temperature, Tthr. 

The optimal frequency, fopt is assigned to active slots. 8 

After determining the optimal frequency, fopt, the number 

of active slots, Na, and that of sleep slots, Ns, are easily 

calculated by (5) and (6). Fig. 7(a) illustrates Na active and 

Ns sleep slots determined by the optimal frequency, fopt 

Na = 

w/fopt 

l 

 

Ns = N − Na = N − 

 

w/fopt 

l 

where w is the number of execution cycles of task. 

B. Sleep Slot Location 

The position of each sleep slot affects the operating temperature 

of ensuing active slots, and thus leakage power 

consumption in active states. Determining the locations of 

active and sleep slots is a permutation problem. The number of 

permutations is N!/(Na!·Ns!). Considering that N (= Na +Ns) 

8 In case of discrete frequencies, the nearest discrete frequency, not lower 

than fopt is assigned to active states. 

(5) 

(6) 

Fig. 8. Temperature and leakage energy according to the number of consecutive 

active slots (Na). (a) Temperature. (b) Leakage energy. 

Fig. 9. Two examples of idle slot distribution for a task which is running 

at a fixed clock frequency. (a) Idle slots are uniformly distributed during task 

execution. (b) Idle slots are non-uniformly distributed during task exaction. 

can be easily as large as 500 (e.g., with 1 s time-to-deadline 

and 2 ms time slots), the complexity is huge. 

We solved the problem of locating sleep slots in two steps. 

First, we prove (Lemma 1) that a uniform distribution of sleep 

slots can give the optimal leakage energy consumption. Then, 

we present a method of determining sleep slot locations during 

runtime. 

Lemma 1: Given Na active slots and Ns sleep slots for a 

task which is running at a fixed clock frequency during active 

slots, the energy consumption is minimal when idle slots are 

evenly distributed across the task execution. 

Proof : Fig. 8(a) shows the temperature with respect to the 

number of consecutive active slots (Na). In Fig. 8(a), larger Na 

leads to higher operating temperature. Since leakage power is 

exponentially dependent on the temperature of active slots, the 

leakage energy with respect to Na, E(Na) is a convex function 

as shown in Fig. 8(b). Assume that idle slots are uniformly 

distributed as shown in Fig. 9(a). In this case, the same number 

of consecutive active slots (Nopt) are located between sleep 

slots and the leakage energy consumed by Nopt consecutive 

active slots in a period is E(Nopt) as shown in Fig. 8(b). 

Now we assume that idle slots are non-uniformly distributed 

as shown in Fig. 9(b), where the numbers of consecutive active 

slots of period i and period i + 1 are Nopt + x and Nopt − 

x, respectively. The sum of leakage energy of period i and 

period i +1 is E(Nopt + x)+E(Nopt − x). Since the E(Na) isa 

convex function, E(Nopt + x)+E(Nopt − x) ≥ 2 · E(Nopt). The 

minimal energy consumption is achieved when equality holds. 

The equality holds only when E(Nopt + x) and E(Nopt − x) 

are the same, i.e., x = 0. Thus, the energy consumption is


minimal when idle slots are uniformly distributed across the 

task execution. 

Lemma 1 shows that a uniform location of idle slots can give 

an optimal leakage energy consumption. However, there can be 

different possibilities of uniform location. For instance, if there 

are 8 active slots and 3 idle slots, we can have three different 

cases of uniform location. 9 In order to simplify the solution 

while maintaining the same uniform idle slot insertions, we 

present a heuristic solution. Fig. 7(b) illustrates the solution. 

If the current temperature exceeds a threshold temperature Tthr, 

we insert a sleep slot, hopefully between active slots, to drop 

the temperature as illustrated in Fig. 7(b). Tthr needs to be 

determined to minimize the maximal operating temperature. 

For instance, if Tthr is set too high, too few sleep slots are 

inserted between active slots. In such a case, most sleep slots 

may be used only after all the active slots thereby failing 

to contribute to lowering the temperature during the task 

execution. 10 If Tthr is set too low, most sleep slots are inserted 

in the beginning of the task causing the temperature to rise 

later in the remaining active slots after all available sleep slots 

are exhausted. To find the optimal Tthr which distributes sleep 

slots evenly between active slots, we perform a binary search 

sweeping between Tamb and Tmax.WesetTmax as 105 °C which 

is the allowed maximum junction temperature of our processor 

model specified in [23]. 

VII. Temperature and Bin Probability-Aware 

Statistical Integrated DVFS and Power Gating 

The basic idea of incorporating the statistical information, 

i.e., runtime distribution into the integrated DVFS and PG 

is to allocate time budget (as given by the time-to-deadline) 

to lower index bins in proportion to their bin probabilities. 

We consider a bin as a virtual sub-task and the time budget 

of the bin as the time-to-deadline of the virtual sub-task. 

Thus, allocating more time budget to a low-index bin corresponds 

to assigning a longer time-to-deadline to the bin. Fig. 10 

illustrates, assuming a fixed execution cycle of a (virtual) task, 

the energy consumption per cycle vs. time-to-deadline (i.e., the 

given time budget) at the minimal-energy frequency, fopt (explained 

in Section VI). Increasing time-to-deadline (i.e., more 

time budget) allows for operation at lower optimal frequency 

(fopt) operation thereby decreasing energy consumption by 

lowering the supply voltage. Note that the increased timeto-deadline 

reduces leakage power consumption as well as 

switching power consumption by lowering the supply voltage. 

It is because the reduced switching power consumption gives 

less temperature increase thereby less leakage power consumption 

than the case when the time-to-deadline is not extended. 

For our problem formulation (to be given in this section), 

we approximate Ecycle(fj) in Fig. 10 with a function, 

9 The three cases are: 1) A, A, S, A, A, A, S, A, A, A, S; 2) A, A, A, S, 

A, A, S, A, A, A, S; and 3) A, A, A, S, A, A, A, S, A, A, S, where A and 

S represent active and sleep slots. 

10 In a multitask system, the sleep slots of previously executed tasks 

can affect the initial operating temperature, i.e., the power consumption of 

following tasks. In this paper, however, we focus on the solution to the single 

task problem. The extension of this paper to multitask systems will be our 

future work. 

Fig. 10. Total energy consumption per cycle vs. time-to-deadline. 

a · S −b + c, where S is the given time-to-deadline (i.e., time 

budget), where a, b, and c are fitting parameters depending on 

the target system conditions such as processor, heat sink, and 

ambient temperature. 

Our strategy is to reduce the energy consumption of lower 

index bins by extending the time-to-deadlines this way. As 

mentioned in Section III, the increased time budget for lower 

index bins is borrowed from higher index bins, which increases 

the operating frequency and power consumption of higher 

index bins. However, since the bin probability of lower index 

bin is higher than that of higher index bin, we can eventually 

obtain a reduction in total energy consumption if the energy 

reduction of lower index bins multiplied by the probability 

is larger than that of higher index bins. We formulate this 

problem of time budgeting as follows: 

Find Sj for all j =1, 2,..., B 

where Sj is the time budget allocated to bin j 

and B is the number of bins 

such that Cost = B 

j=1 pj · Ecycle(fj) 

is minimized, 

subject to 

= 

B 

pj · (a · S −b + c) (7) 

j=1 

B 

Sj ≤ D (8) 

j=1 

where D is the time-to-deadline. 

In (7), pj is the bin probability. Thus, pj ·Ecycle(fj) represents 

the expected energy consumption per cycle as contributed by 

bin j. Note that each bin has the same number of execution 

cycles. Thus, if the execution cycle is taken into account, 

then the function Cost in (7) becomes the expectation of the 

total energy consumption of the task. Because a and c are 

constants in (7), the function Cost in (7) can be replaced by


Cost ∗ as follows: 

Cost ∗ = 

B 

j=1 

pj · S −b 

j . (9) 

Note that, as shown in Fig. 10, Ecycle(fj) is monotonically 

decreasing as the time budget is increasing. b must be positive 

to let S −b 

j convex in (9). We can apply Jensen’s inequality 

[20] 11 to find the condition minimizing (9). To do that, we 

rewrite (9) as follows: 

Cost ∗ B 

= (p −1/b 

j · Sj) −b . (10) 

j=1 

According to Jensen’s inequality, Cost∗ in (10) can be minimized 

when p −1/b 

j · Sj has the same value for all j’s, i.e., 

p −1/b 

j · Sj = p −1/b 

j+1 · Sj+1 for all j’s. Sj needs to satisfy the 

following relation: 

Sj+1 = b 

pj+1 

pj 

Cost ∗ has a lower bound given as follows: 

B · 

B j=1 p−1/b j 

B 

· Sj. (11) 

· Sj 

−b 

. (12) 

In order to calculate Sj using (11), we first need to determine 

S1, which is used to calculate the remaining Sj’s by iterative 

applications of (11). S1 is obtained by iterative binary subdivision, 

i.e., by sweeping S1 values over the interval [0, D] such 

that Cost in (7) is minimized while satisfying the constraint 

in (8). 

VIII. Application of Design-Time Solution to 

Runtime 

The solution presented in Sections VI and VII, given the 

initial temperature and runtime distribution, yields a designtime 

solution to the problem given in Section V, i.e., the 

operating frequencies of active slots, and threshold temperature, 

Tthr (which determines the number and locations of 

sleep slots). In this section, we propose a runtime solution 

obtained by applying the pre-computed design-time solutions 

according to the current temperature. We prepare a lookup 

table (LUT) storing the pre-computed design-time results 

as shown in Fig. 11(a). Each entry in the LUT yields the 

operating frequency, f , and the threshold temperature, Tthr, for 

each bin according to the initial temperature, Tinit. If there is 

no exact match to the initial temperature of LUT, the nearest 

one from, but not lower, the current temperature is selected 

in order to prevent the operating temperature from violating 

the given maximum temperature constraint. For example, in 

Fig. 11, assume that the task starts with temperature 62 °C 

(≤70 °C). There is no exact match in the LUT of Fig. 11(a). 

Thus, the entry corresponding to Tinit = 70 °C is chosen. 

11If f (x) is convex on the interval a


Fig. 12. Peak temperature violation check performed at the start of each 

slot. 

prepare Tss(fi) for each frequency level fi (19 levels in our 

experiment). 

When the estimated temperature Ti+1 violates the peak 

temperature constraint, we adjust the frequency level, fi, to 

alevelfpt (peak threshold frequency). fpt can be obtained 

from the method of calculating fopt in Section VI.A by setting 

Tthr to Tmax. In other words, fpt is the maximum frequency 

which gives an optimal energy consumption without violating 

the peak temperature constraint. 

The number of entries in the LUT can affect the energy 

reduction obtained during runtime. Experimental results in the 

next section will show that only a single entry can support the 

temperature range between 30 °C and 90 °C with a negligible 

degradation in energy gain. 

IX. Experimental Results 

A. Experimental Setup 

Our experimental results are based on the data collected 

from the Penryn [21] processor, which is a dual-core with 

a unified L2 cache, manufactured in 45-nm high-k metal 

gate silicon process technology. The floorplanning and power 

decomposition of the processor are obtained from [21] and 

shown in Fig. 13. In our power model, we estimated the 

switching power based on the relationship between power, 

frequency and voltage, i.e., Ps = Cs · V 2 dd · f . We estimated 

the effective capacitance, Cs, from the power values in the 

datasheet [23]. To be specific, we obtained the switching power 

value, Ps, from total and leakage power values at the junction 

temperature of 105 °C [23] and estimated Cs (Cs =7.942nF). 

Then, we applied it to the switching power estimation at 

different frequency and voltage levels. To account for the 

temperature-dependent leakage power consumption, we utilize 

a modified version of leakage power model proposed in [3]. 13 

Although our experiments are based on the Penryn processor, 

it should be noted that our solution is general and independent 

of the processor model. Fig. 14 shows the power values 

for various frequency and temperature levels in our power 

model. The frequency ranges from 0.6 GHz to 3.0 GHz with 

19 discrete levels. The runtime overhead of voltage/frequency 

transition is assumed to be 10 µs [34]. 

13 The gate leakage power is controlled by the property of high-k dielectric 

materials used in our processor model. In [22], high-k materials can reduce 

the gate leakage down to 0.2% and 1.1% for PMOS and NMOS, respectively. 

We scaled the gate leakage in [3] to 0.5% in our power estimation as an 

approximate average of the two cases. 

Fig. 13. (a) Floorplanning and (b) power decomposition of our processor 

model. 

Fig. 14. Power consumption of our power model according to clock frequency 

and operating temperature (40 °C, 60 °C, 80 °C, and 100 °C). 

When PG is activated in the processor, the voltage is 

lowered only to the point where both the state of core and 

cache can be retained. Thus, there is only a small wake-up 

time overhead caused by the state recovery including phaselocked 

loop turn-on time (15 µs in [23]). When the processor 

is turned on, in order to eliminate the wake-up overhead, the 

turn-on operation can be started wake-up overhead earlier than 

the end of sleep slot. Thus, the processor becomes ready to 

be executed when the sleep slot finishes and a new active slot 

starts. 

According to the product datasheet [23], we used a sleep 

state power consumption of 1.9 W, where both core and cache 

can retain their states. We set the slot size l to be 2 ms which 

is much longer than the power state transition time overhead 

and other typical threshold delays of power on-off transition 

[24].


Fig. 15. HotSpot coefficients for temperature simulation. (a) Side view of a 

typical chip package. (b) HotSpot parameter. 

For temperature calculation, we used HotSpot [11] as the 

thermal modeling tool, and modified its parameters according 

to our processor model. The coefficients related with the 

temperature model of the processor are given in Fig. 15. The 

ambient temperature Tamb is assumed to be 50 °C. We performed 

the temperature simulations using a sampling interval 

t of 0.1 ms. 

Experiments were performed with two real examples, i.e., 

H.264 decoders with different frame rates (14 f/s and 7 f/s in 

our experiments) and ray tracing. We also used a benchmark 

program, equake from SPEC2000. For ray tracing and equake, 

we assume the time-to-deadline according to a utilization 

ratio (i.e., WETC 14 /D = 1.13 and WETC/D = 0.70 in our 

experiment). We collected the runtime distribution from a performance 

profiling application programming interface [25] on 

Intel Core2Duo 7200 processor in LG xnote LW25 laptop. We 

used 2000 frames of input picture for H.264 decoder, 15 360 

pixels of input picture for ray tracing, and 130 simulation 

cycles for equake. The input picture for H.264 decoder was 

extracted from the movie Harry Potter. We divided the input 

data into two sets: training set (1000 frames for H.264, 7680 

pixels for ray tracing, and 65 simulation cycles for equake) 

and evaluation set (the other frames, pixels, and simulation 

cycles). We obtained design-time solutions with the training 

set and then obtained experimental results in this section 

by applying the design-time solution to the evaluation sets. 

Fig. 16 shows the runtime distribution of H.264 decoder 

and ray tracing [26]. The runtime distribution of H.264 (ray 

tracing) was obtained from per-frame (per-pixel) execution 

cycles collected during the execution with the training set. The 

WCECs of H.264 decoder and ray tracing are 6.0E+8 cycles 

and 2.0E+8 cycles, respectively. We assume the number of 

14 In this paper, we define WETC as worst-case execution time at critical 

speed as WCEC divided by the critical speed. 

Fig. 16. Runtime distribution of real examples. (a) H.264 decoder. (b) Ray 

tracing. (c) Equake. 

bins for both applications is five. Thus, each bin of H.264 

decoder and ray tracing contains 1.2E + 8 cycles and 0.4E + 8 

cycles, respectively. 

B. Experimental Results 

We performed the experiment using two different versions 

of the proposed methods (T-OURS and TD-OURS) and three 

existing ones (PACE, T-DVFS, and TALK) as follows. 

PACE [19]: Only DVFS is applied while exploiting runtime 

distribution without considering operating temperature. 

Temperature-aware DVFS (T-DVFS) [9]: Only DVFS is applied 

considering steady state temperature without exploiting 

runtime distribution and PG. 

TALK [14]: Only PG is applied at the critical speed considering 

time-varying temperature without exploiting runtime 

distribution. 

T-OURS: Our first method (in Section VI) where only timevarying 

temperature is considered. 

TD-OURS: Our second method (in Section VII) considering 

both temperature variation and runtime distribution. 

Table I shows the result of comparison. 

Table II shows the comparison of energy consumption for 

H.264 video decoding, ray tracing, and equake. All the energy 

consumption data are normalized with respect to the energy 

consumption of T-DVFS method. Note that TALK is not 

applicable when the frequency needs to be set at a higher level


TABLE I 

Comparison of Our Methods With Existing Ones 

DVFS Power gating Runtime dist. Temperature 

PACE O X O X 

T-DVFS O X X O 

TALK X O X O 

T-OURS O O X O 

TD-OURS O O O O 

TABLE II 

Comparison of Energy Consumption Results for H.264 Decoder, 

Ray Tracing, and Equake 

Benchmark T-DVFS (J) PACE TALK T-OURS TD-OURS 

WETC ≥ D (14 f/s for H.264) 

H.264 15.34 1.06 N/A 0.85 0.79 

Ray tracing 4.31 1.08 N/A 0.91 0.81 

Equake 122.47 1.07 N/A 0.89 0.78 

WETC


Fig. 19. Energy comparisons for H.264 decoder and ray tracing benchmark with different ambient temperature. The energy consumptions presented in this 

figure are normalized against T-DVFS. (a) H.264 decoder when WETC ≥ D. (b) H.264 decoder when WETC


TABLE V 

Increase in Energy Consumption for Varying Initial 

Temperatures 

Tinit (°C) 30 40 50 60 70 80 90 

Energy increase 0.33 0.27 0.27 0.17 0.10 0.08 0.00 

(%) 

Fig. 20. Temperature behavior comparison for executing H.264 with TD- 

OURS and runtime solution when initial temperature is 30 °C. (a) Temperature 

profiles during entire simulation. (b) Zoom-in of temperature profiles over 

time. 

Tinit from 30 °C to 80 °C with 10 °C step. Table V shows 

the results. In Table V, the first and second rows indicate the 

initial temperature and the maximum increase (%) of energy 

consumption for H.264 and ray tracing compared with the 

energy consumption at 90 °C, respectively. In Table V, the 

energy consumption increases only by up to 0.33% compared 

with the energy consumption at 90 °C. 

Fig. 20 shows the thermal behavior of IEC module which 

confirms the results in Table V. In Fig. 20, DT−30 (DT−90) 

represents the temperature profiles obtained by applying the 

design-time solution prepared with the initial temperature of 

30 °C (90 °C) to the case when the initial temperature is 30 °C 

for H.264. No significant difference of energy, i.e., less than 

0.33% is observed according to Tinit. Thus, in our experiments, 

we need the design-time solution obtained at 90 °C for LUT. 

Fig. 20(b) shows a close-up view of the circled point in 

Fig. 20(a), the temperature profile during the initial period. 

As Fig. 20(b) shows, DT−90 gives a higher local maximum 

than DT−30 since DT−90 has a higher Tthr than DT−30. 

Such a difference comes from the difference in the initial 

temperatures, i.e., 30 °C and 90 °C. 

In our experiments, the LUT is managed as a normal 

data array and has only 11 entries to store optimal working 

frequencies and threshold temperatures of bins. Since four 

bytes are used for each entry, a total of 44-byte memory space 

is required for the LUT. The LUT is accessed only when a new 

bin starts. In that case, two entries are read from the LUT to 

determine working frequency and threshold temperature (Tthr) 

of a newly starting bin. Reading two entries can be executed 

with only one load instruction at the target processor. Thus, 

the LUT has a negligible overhead in terms of area, runtime 

and power consumption. 

X. Conclusion 

In this paper, we proposed a method of integrating DVFS 

and power gating in a temperature/runtime distribution-aware 

manner. The proposed method analytically assigns active 

(including frequency level) and sleep (including locations) 

states to each time slot according to temperature and runtime 

distribution for total energy reduction. The higher the temperature 

is, the more time budget is assigned to the beginning 

of task execution. The proposed method contributes to the 

overall energy saving by lowering the temperature and the 

leakage power consumption, especially, during the period of 

high execution probability, i.e., the beginning phase of task 

execution. In experiments with two real applications, H.264 

decoder and ray tracing and one benchmark program, equake, 

the proposed method yields 19.4%–27.2% of additional energy 

savings compared with existing methods. 

In future work, we will extend the proposed idea to applications 

running on multi/many-core for high-performance 

computing (with response time constraints) as well as mobile/home 

real-time systems. In addition, since active cooling 

is an effective method of temperature management, we will 

also work on the integration of active cooling and the proposed 

method considering the benefits and cost of active cooling 

and PG. 

References 

[1] Intel Homepage [Online]. Available: http://www.intel.com/products/ 

processor/core2duo/mobile/specifications.htm 

[2] Semiconductor Research Corporation. 2005 Packing Thrust Strategic 

Needs [Online]. Available: http://www.src.org/ 

[3] W. P. Liao, L. He, and K. M. Lepak, “Temperature and supply voltage 

aware performance and power modeling at micro-architecture level,” 

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7, 

pp. 1042–1053, Jul. 2005. 

[4] Google Data Center [Online]. Available: http://www.datacenterknowledge.com/ 

[5] A. Kumar, L. Shang, L.-S. Peh, and N. K. Jha, “HybDTM: A coordinated 

hardware-software approach for dynamic thermal management,” in Proc. 

DAC, 2006, pp. 548–553. 

[6] R. Rao, S. Vrudhula, C. Chakrabarti, and N. Chang, “An optimal 

analytical solution for processor speed control with thermal constraints,” 

in Proc. ISLPED, 2006, pp. 292–297. 

[7] J. Yang, X. Zhou, M. Chrobak, Y. Zhang, and L. Jin, “Dynamic thermal 

management through task scheduling,” in Proc. Int. Symp. ISPASS, 2008, 

pp. 191–201. 

[8] A. Coskun, T. Rosing, and K. Whisnant, “Temperature aware task 

scheduling in MPSoCs,” in Proc. DATE, 2007, pp. 1659–1664. 

[9] M. Bao, A. Andrei, P. Eles, and Z. Peng, “Temperature-aware 

voltage selection for energy minimization,” in Proc. DATE, 2008, 

pp. 1083–1086. 

[10] S. Zhang and K. S. Chatha, “System-level thermal aware design of 

applications with uncertain execution times,” in Proc. ICCAD, 2008, 

pp. 242–249.


[11] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, and D. Tarjan, 

“Temperature-aware microarchitecture: Modeling and implementation,” 

ACM Trans. Architect. Code Optim., vol. 1, no. 1, pp. 94–125, Mar. 

2004. 

[12] Y. Han, I. Koren, and C. M. Krishna, “Temptor: A lightweight runtime 

temperature monitoring tool using performance counters,” in Proc. 3rd 

Workshop TACS, Held Conjunct. ISCA-33, 2006. 

[13] A. Kumar, L. Shang, L.-S. Peh, and N. K. Jha, “System-level dynamic 

thermal management for high-performance microprocessors,” 

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 1, 

pp. 96–108, Jan. 2008. 

[14] L. Yuan, S. Leventhal, and G. Qu, “Temperature-aware leakage minimization 

technique for real-time systems,” in Proc. ICCAD, 2006, pp. 

761–764. 

[15] Y. Liu, H. Yang, R. P. Dick, H. Wang, and L. Shang, “Thermal vs. energy 

optimization for DVFS-enabled processors in embedded systems,” in 

Proc. Int. Symp. ISQED, 2007, pp. 204–209. 

[16] L. Yuan and G. Qu, “ALT-DVFS: Dynamic voltage scaling with awareness 

of leakage and temperature for real-time systems,” in Proc. AHS, 

2007, pp. 660–670. 

[17] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, “Dynamic 

voltage scaling of supply and body bias exploiting software runtime 

distribution,” in Proc. DATE, 2008, pp. 242–247. 

[18] J. Kim, S. Yoo, and C.-M. Kyung, “Program phase and runtime 

distribution-aware online DVFS for combined Vdd/Vbb scaling,” in 

Proc. DATE, 2009, pp. 417–422. 

[19] J. R. Lorch and A. J. Smith, “Improving dynamic voltage scaling 

algorithm with PACE,” ACM SIGMETRICS Perform. Eval. Rev., vol. 

29, no. 1, pp. 50–61, Jun. 2001. 

[20] S. Krantz, S. Kress, and R. Kress, Jensen’s Inequality. Cambridge, MA: 

Birkhauser, 1999. 

[21] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V. 

Naydenov, T. Khondker, S. Sarkar, and P. Singh, “Penryn: 45-nm next 

generation Intel Core 2 processor,” in Proc. ASSCC, 2007, pp. 14–17. 

[22] K. Mistry, C. Allen, C. Auth, B. Beattie, D. Bergstrom, M. Bost, M. 

Brazier, M. Buehler, A. Cappellani, R. Chau, C. H. Choi, G. Ding, K. 

Fischer, T. Ghani, R. Grover, W. Han, D. Hanken, M. Hattendorf, J. 

He, J. Hicks, R. Huessner, D. Ingerly, P. Jain, R. James, L. Jong, S. 

Joshi, C. Kenyon, K. Kuhn, K. Lee, H. Liu, J. Maiz, B. Mclntyre, P. 

Moon, J. Neirynck, S. Pae, C. Parker, D. Parsons, C. Prasad, L. Pipes, 

M. Prince, P. Ranade, T. Reynolds, J. Sandford, L. Shifren, J. Sebastian, 

J. Seiple, D. Simon, S. Sivakumar, P. Smith, C. Thomas, T. Troeger, P. 

Vandervoorn, S. Williams, and K. Zawadzki, “A 45 nm logic technology 

with high-k + metal gate transistors, strained silicon, 9 Cu interconnect 

layers, 193 nm dry patterning, and 100% Pb-free packaging,” in Proc. 

IEDM, 2007, pp. 247–250. 

[23] Intel. (2009 Mar.). “Intel Core2 solo mobile processor and Intel Core2 

extreme mobile processor on 45-nm process datasheet” [Online]. Available: 

http://www.intel.com/ 

[24] H. Kim, H. Hong, H.-S. Kim, J.-H. Ahn, and S. Kang, “Total energy 

minimization of real-time tasks in an on-chip multiprocessor using 

dynamic voltage scaling efficiency metric,” IEEE Trans. Comput.-Aided 

Design Integr. Circuits Syst., vol. 27, no. 11, pp. 2088–2092, Nov. 2008. 

[25] PAPI [Online]. Available: http://icl.cs.utk.deu/papi 

[26] J. Bikker. “Raytracing: Theory & implementation” [Online]. Available: 

http://www.devmaster.net/articles/raytracing−series/part7.php 

[27] JEDEC. 2006 Failure Mechanisms and Models for Semiconductor 

Devices [Online]. Available: http://www.jedec.org 

[28] G. Corliss, “Which root does the bisection algorithm find?” SIAM Rev., 

vol. 19, no. 2, pp. 325–327, 1977. 

[29] N. Bansal and K. Pruhs, “Speed scaling to manage temperature,” in 

Proc. STACS, 2005, pp. 460–471. 

[30] J.-J. Chen, C.-M. Hung, and T.-W. Kuo, “On the minimization of the 

instantaneous temperature for periodic real-time tasks,” in Proc. RTAS, 

2007, pp. 236–248. 

[31] A. K. Coskun, T. T. Rosing, K. A. Whisnant, and K. C. Gross, “Static 

and dynamic temperature-aware scheduling for multiprocessor SoCs,” 

IEEE Trans. Very Large Scale Integr. Syst., vol. 16, no. 9, pp. 1127– 

1140, Sep. 2008. 

[32] J. Srinivasan and S. V. Adve, “Predictive dynamic thermal management 

for multimedia applications,” in Proc. ICS, 2003. pp. 109–120. 

[33] W. Lee, K. Patel, and M. Pedram, “GOP-level dynamic thermal management 

in MPEG-2 decoding,” IEEE Trans. Very Large Scale Integr. 

Syst., vol. 16, no. 6, pp. 662–672, Jun. 2008. 

[34] J. S. Lee, K. Skadron, and S. W. Chung, “Predictive temperature-aware 

DVFS,” IEEE Trans. Comput., vol. 59, no. 1, pp. 127–133, Jan. 2010. 

Kyungsu Kang (S’06) received the B.S. degree 

from the Department of Electrical and Electronic Engineering, 

Kyungpook National University, Daegu, 

Korea, in 2003. Since 2003, he has been pursuing 

the unified course of the M.S. and Ph.D. degrees 

from the Department of Electrical Engineering and 

Computer Science, Korea Advanced Institute of Science 

and Technology, Daejeon, Korea. 

His current research interests include power and 

thermal management for 2-D/3-D chip multiprocessors. 

Jungsoo Kim (S’06) received the B.S. degree in 

electrical engineering from the Korea Advanced 

Institute of Science and Technology (KAIST), Daejeon, 

Korea, in 2005. Since 2005, he has been pursuing 

the unified course of the M.S. and Ph.D. degrees 

from the Department of Electrical Engineering and 

Computer Science, KAIST. 

His current research interests include dynamic 

power and thermal management, MPSoC design, and 

massive parallel processing. 

Sungjoo Yoo (M’09) received the B.S., M.S., and 

Ph.D. degrees in electronics engineering from Seoul 

National University, Seoul, Korea, in 1992, 1995, 

and 2000, respectively. 

He was a Researcher with TIMA Laboratory, 

Grenoble, France, from 2000 to 2004. He was a 

Senior and Principal Engineer with Samsung Electronics, 

Yongin, Gyeonggi, Korea, from 2004 to 

2008. He has been with the Pohang University of 

Science and Technology, Pohang, Korea, since 2008. 

His current research interests include low-power 

design and memory/storage architecture for embedded systems. 

Chong-Min Kyung (S’76–M’81–SM’99–F’08) received 

the B.S. degree in electronics engineering 

from Seoul National University, Seoul, Korea, in 

1975, the M.S. and Ph.D. degrees in electrical 

engineering from the Korea Advanced Institute of 

Science and Technology (KAIST), Daejeon, Korea, 

in 1977 and 1981, respectively. 

From 1981 to 1983, he was with Bell Telephone 

Laboratories, Murray Hill, NJ, as a Postdoctoral 

Research Fellow. Since he joined KAIST in 1983, 

he has been working on system-on-a-chip design and 

verification methodology, processor and graphics architectures for high-speed 

and/or low-power applications including mobile video codec. 

Dr. Kyung received the Most Excellent Design Award and Special Feature 

Award in the University Design Contest in the ASP-DAC in 1997 and 1998, 

respectively. He received the Best Paper Award in the 36th DAC held in 

New Orleans, LA, the 10th International Conference on Signal Processing 

Application and Technology, Orlando, FL, in 1999, and the 1999 International 

Conference on Computer Design, Austin, TX. He was General Chair of the 

Asian Solid-State Circuits Conference 2007, and ASP-DAC 2008. In 2000, he 

received the National Medal from the Korean government for his contributions 

to research and education in integrated circuit designs. He is a member of the 

National Academy of Engineering Korea and Korean Academy of Science 

and Technology. He is the Hynix Chair Professor at KAIST.

Temperature-Aware Integrated DVFS and Power Gating for ... - KAIST

Create successful ePaper yourself

Delete template?

Save as template?