21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

306 V.M. Weaver and S.A. McKee<br />

They conclude that SimPo<strong>in</strong>t and SMARTS give the most accurate results. Over<br />

70% of the previous 10 years of HPCA, ISCA, and MICRO papers (end<strong>in</strong>g <strong>in</strong><br />

2005) use reduced simulation methods that are less accurate. Most rema<strong>in</strong><strong>in</strong>g<br />

papers use full <strong>in</strong>put sets. Sampl<strong>in</strong>g is thus an under-utilized technique that can<br />

greatly <strong>in</strong>crease the breadth and accuracy of computer architecture research.<br />

Collect<strong>in</strong>g data needed by SimPo<strong>in</strong>t is difficult and time consum<strong>in</strong>g; we<br />

present two tools to more easily generate the Basic Block Vectors (BBVs) that<br />

SimPo<strong>in</strong>t needs. Our tools greatly expand the platforms for which BBVs can<br />

be generated, <strong>in</strong>clud<strong>in</strong>g a number of embedded platforms. We implement the<br />

tools us<strong>in</strong>g dynamic b<strong>in</strong>ary <strong>in</strong>strumentation (DBI), a technique that generates<br />

BBVs much faster than simulation. DBI tools are easier to use than simulators,<br />

remov<strong>in</strong>g many barriers to wider SimPo<strong>in</strong>t use. Features <strong>in</strong>herent <strong>in</strong> the tools we<br />

extend make it possible to collect data that previous tools cannot.This <strong>in</strong>cludes<br />

creat<strong>in</strong>g cross-platform BBV files (e.g., generat<strong>in</strong>g MIPS BBVs from MIPS b<strong>in</strong>aries<br />

on an IA32 host), as well as collect<strong>in</strong>g BBVs that <strong>in</strong>clude operat<strong>in</strong>g system<br />

<strong>in</strong>formation along with normal user-space <strong>in</strong>formation.<br />

We validate the generated BBVs and compare them aga<strong>in</strong>st the P<strong>in</strong>Po<strong>in</strong>t [10]<br />

BBVs generated by the P<strong>in</strong> utility. We validate all three methods us<strong>in</strong>g hardware<br />

performance counters while runn<strong>in</strong>g the SPEC CPU2000 [15] and SPEC<br />

CPU2006 [16] benchmark suites on a variety of 32-bit Intel L<strong>in</strong>ux system. Our<br />

website conta<strong>in</strong>s source code for our Qemu and Valgr<strong>in</strong>d modifications.<br />

2 Generat<strong>in</strong>g Simulation Po<strong>in</strong>ts<br />

SimPo<strong>in</strong>t exploits phase behavior <strong>in</strong> programs. Many applications exhibit cyclic<br />

behavior: code execut<strong>in</strong>g at one po<strong>in</strong>t <strong>in</strong> time behaves similarly to code runn<strong>in</strong>g<br />

at some other po<strong>in</strong>t. Entire program behavior can be approximated by model<strong>in</strong>g<br />

only a representative set of <strong>in</strong>tervals (<strong>in</strong> our case, simulation po<strong>in</strong>ts or SimPo<strong>in</strong>ts).<br />

Figures 1, 2, and 3 show examples of program phase behavior at a granularity<br />

of 100M <strong>in</strong>structions; these are captured us<strong>in</strong>g hardware performance counters<br />

on the CPU2000 benchmarks. Each figure shows two metrics: the top is L1 D-<br />

Cache miss rate, and the bottom is cycles per <strong>in</strong>struction (CPI). Figure 1 shows<br />

twolf, which exhibits almost completely uniform behavior. For this type of program,<br />

one <strong>in</strong>terval is enough to approximate whole-program behavior. Figure 2<br />

shows the mcf benchmark, which has more complex behavior. Periodic behavior<br />

is evident: representative <strong>in</strong>tervals from the various phases can be used to approximate<br />

total behavior. The last example, Figure 3, shows the extremely complex<br />

behavior of gcc runn<strong>in</strong>g the 200.i <strong>in</strong>put set. Few patterns are apparent;<br />

this type of program is difficult to approximate with the SimPo<strong>in</strong>t methodology<br />

(smaller phase <strong>in</strong>tervals are needed to recognize patterns, and variable-size<br />

phases are possible, but choos<strong>in</strong>g appropriate <strong>in</strong>terval lengths is non-trivial). We<br />

run the CPU2000 benchmarks on n<strong>in</strong>e implementations of architectures runn<strong>in</strong>g<br />

the IA32 ISA, f<strong>in</strong>d<strong>in</strong>g that phase behavior is consistent across all platforms when<br />

us<strong>in</strong>g the same b<strong>in</strong>aries, despite large differences <strong>in</strong> hardware process and design.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!