21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

216 N. AbouGhazaleh et al.<br />

Simultaneous <strong>in</strong>crease <strong>in</strong> IPC and L2 cache access (rule 1): Our approach reacts to the<br />

first positive feedback case by reduc<strong>in</strong>g the core speed rather than <strong>in</strong>creas<strong>in</strong>g it, as <strong>in</strong> the<br />

<strong>in</strong>dependent policy. This decision is based on the observation that the <strong>in</strong>crease <strong>in</strong> IPC<br />

was accompanied by an <strong>in</strong>crease <strong>in</strong> the number of L2 cache accesses. This <strong>in</strong>crease may<br />

<strong>in</strong>dicate a start of a program phase with high memory traffic. Hence, we preemptively<br />

reduce the core speed to avoid overload<strong>in</strong>g the L2 cache doma<strong>in</strong> with excess traffic.<br />

In contrast, <strong>in</strong>creas<strong>in</strong>g the core speed would exacerbate the load <strong>in</strong> both doma<strong>in</strong>s. We<br />

choose to decrease the core speed rather than keep<strong>in</strong>g it unchanged to save core energy,<br />

especially with the likelihood of longer core stalls due to the expected higher L2 cache<br />

traffic.<br />

Simultaneous decrease <strong>in</strong> IPC and L2 cache access (rule 5): We target the second undesired<br />

positive feedback scenario where the <strong>in</strong>dependent policy decreases both core<br />

and cache speeds. From observ<strong>in</strong>g the cache workload, we deduce that the decrease <strong>in</strong><br />

IPC is not due to higher L2 traffic. Thus, longer core stalls are a result of local core activity<br />

such as branch misprediction. Hence, <strong>in</strong>creas<strong>in</strong>g or decreas<strong>in</strong>g the core speed may<br />

not elim<strong>in</strong>ate the source of these stalls. By do<strong>in</strong>g so, we risk unnecessarily <strong>in</strong>creas<strong>in</strong>g <strong>in</strong><br />

the application’s execution time or energy consumption. Hence, we choose to ma<strong>in</strong>ta<strong>in</strong><br />

the core speed without any change <strong>in</strong> this case, to break the positive feedback scenario<br />

without hurt<strong>in</strong>g delay or energy.<br />

5 Evaluation<br />

In this section, we evaluate the efficacy of our <strong>in</strong>tegrated DVS policy, which considers<br />

doma<strong>in</strong> <strong>in</strong>teractions, on reduc<strong>in</strong>g a chip’s energy and energy-delay product. We use the<br />

Simplescalar and Wattch architectural simulators with an MCD extension by Zhu et<br />

al. [9] that models <strong>in</strong>ter-doma<strong>in</strong> synchronization events and speed scal<strong>in</strong>g overheads.<br />

To model the MCD design <strong>in</strong> Figure 1, we altered the simulator k<strong>in</strong>dly provided by Zhu<br />

et al. by merg<strong>in</strong>g different core doma<strong>in</strong>s <strong>in</strong>to a s<strong>in</strong>gle doma<strong>in</strong> and separat<strong>in</strong>g the L2<br />

cache <strong>in</strong>to its own doma<strong>in</strong>. In the <strong>in</strong>dependent DVS policy, we monitor the <strong>in</strong>struction<br />

fetch queue to control the core doma<strong>in</strong>, and the number of L2 accesses to control the<br />

L2 cache doma<strong>in</strong>.<br />

S<strong>in</strong>ce our goal is to devise a DVS policy for an embedded processor with MCD<br />

extensions, we use Configuration A from Table 3 as a representative of a simple embedded<br />

processor (Simplescalar’s StrongArm configuration [10]). We use Mibench benchmarks<br />

with the long <strong>in</strong>put datasets. S<strong>in</strong>ce Mibench applications are relatively short,<br />

we fast-forward only 500 million <strong>in</strong>structions and simulate the follow<strong>in</strong>g 500 million<br />

<strong>in</strong>structions or until benchmark completion.<br />

To extend our evaluation and check whether our policy can be extended to different<br />

arenas (<strong>in</strong> particular, higher performance processors), we also use the SPEC2000<br />

benchmarks and a high-end embedded processor [9] (see Configuration B <strong>in</strong> Table 3).<br />

We run the SPEC2000 benchmarks us<strong>in</strong>g the reference data set. We use the same execution<br />

w<strong>in</strong>dow and fastforward amount (500M) for uniformity.<br />

Our goal is twofold. First, to show the benefit of account<strong>in</strong>g for doma<strong>in</strong> <strong>in</strong>teractions,<br />

we compare our <strong>in</strong>tegrated DVS policy with the <strong>in</strong>dependent policy described <strong>in</strong> Section<br />

3. For a fair comparison, we use the same policy parameters and thresholds used by

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!