21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

310 V.M. Weaver and S.A. McKee<br />

Note that gcc uses an extremely large stack. By default Qemu only emulates<br />

a 512KB stack, but the -s command-l<strong>in</strong>e option enables at least 8MB of stack<br />

space, which allows all gcc benchmarks to run to completion.<br />

2.4 Valgr<strong>in</strong>d<br />

Valgr<strong>in</strong>d [9] is a dynamic b<strong>in</strong>ary <strong>in</strong>strumentation tool for the PowerPC, IA32,<br />

and AMD64 architectures. It was orig<strong>in</strong>ally designed to detect application memory<br />

allocation errors, but it has developed <strong>in</strong>to a generic and flexible DBI utility.<br />

Valgr<strong>in</strong>d translates native processor code <strong>in</strong>to a RISC-like <strong>in</strong>termediate code. Instrumentation<br />

occurs on this <strong>in</strong>termediate code, which is then recompiled back<br />

to the native <strong>in</strong>struction set. Translated blocks are cached.<br />

Our BBV generation code is a plug<strong>in</strong> to Valgr<strong>in</strong>d 3.2.3. By default, Valgr<strong>in</strong>d<br />

<strong>in</strong>struments at a super-block level rather than the basic block level. A<br />

super-block only has one entrance, but can have multiple exit po<strong>in</strong>ts. We use<br />

the --vex-guest-chase-thresh=0 option to force Valgr<strong>in</strong>d to use basic blocks,<br />

although our experiments show that us<strong>in</strong>g super-blocks yields similar results.<br />

Valgr<strong>in</strong>d implements just-<strong>in</strong>-time translation of the program be<strong>in</strong>g run. We <strong>in</strong>strument<br />

every <strong>in</strong>struction to call our BBV generation rout<strong>in</strong>e. It would be more<br />

efficient to call only the rout<strong>in</strong>e once per block, but <strong>in</strong> order to work around some<br />

problems with the “rep” <strong>in</strong>struction prefix (described later) we must <strong>in</strong>strument<br />

every <strong>in</strong>struction. When call<strong>in</strong>g our <strong>in</strong>struction rout<strong>in</strong>e, we look up the current<br />

basic block <strong>in</strong> a hash table to f<strong>in</strong>d a data structure that holds the relevant statistics.<br />

We <strong>in</strong>crement the basic block counter and the total <strong>in</strong>struction count. If<br />

we f<strong>in</strong>ish an <strong>in</strong>terval by overflow<strong>in</strong>g the committed <strong>in</strong>struction count, we update<br />

BBV <strong>in</strong>formation and clear all counts. Valgr<strong>in</strong>d runs from 5 (art) to 114<br />

(vortex) times slower than native execution, mak<strong>in</strong>g it the slowest of the tools<br />

we evaluate.<br />

3 Evaluation<br />

To evaluate the BBV generation tools, we use the SPEC CPU2000 [15] and<br />

CPU2006 [16] benchmarks with full reference <strong>in</strong>puts. We compile the benchmarks<br />

on SuSE L<strong>in</strong>ux 10.2 with gcc 4.1 and “-O2” optimization (except for vortex,<br />

which we compile without optimization because it crashes, otherwise). We l<strong>in</strong>k<br />

b<strong>in</strong>aries statically to avoid library differences on the mach<strong>in</strong>es we use to gather<br />

data. The choice to use static l<strong>in</strong>k<strong>in</strong>g is not due to tool dependencies; all three<br />

handle both dynamic and static executables.<br />

We choose IA32 as our test platform because it is widely used and because<br />

all three tools support it. We use the Perfmon2 [5] <strong>in</strong>terface to gather hardware<br />

performance counter results for the platforms described <strong>in</strong> Table 1.<br />

The performance counters are set to write out the the relevant statistics every<br />

100M <strong>in</strong>structions. The data collected are used <strong>in</strong> conjunction with simulation<br />

po<strong>in</strong>ts and weights generated by SimPo<strong>in</strong>t to calculate estimated CPI. We calculate<br />

actual CPI for the benchmarks by us<strong>in</strong>g the performance counter data,<br />

and use this as a basis for our error calculations. Note that calculated statistics

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!