21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Us<strong>in</strong>g Dynamic B<strong>in</strong>ary Instrumentation 311<br />

Table 1. Mach<strong>in</strong>es used<br />

mach<strong>in</strong>e processor memory L1 I/D L2/L3 Cache performance counters used<br />

nestle 400MHz Pentium II 256MB 16KB/16KB 512KB <strong>in</strong>st retired,<br />

cpu clk unhalted<br />

spruengli 550MHz Pentium III 512MB 16KB/16KB 512KB <strong>in</strong>st retired,<br />

cpu clk unhalted<br />

itanium 800MHz Itanium 1GB 16KB/16KB 96KB/3MB ia32 <strong>in</strong>st retired,<br />

cpu cycles<br />

chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB <strong>in</strong>structions retired,<br />

unhalted core cycles<br />

milka 1.733MHz Athlon MP 512MB 64KB/64KB 256KB retired <strong>in</strong>structions,<br />

cpu clk unhalted<br />

gallais 1.8GHz Pentium 4 256MB 12Kμ/16KB 256KB <strong>in</strong>str retired:nbogusntag,<br />

global power events:runn<strong>in</strong>g<br />

jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB retired <strong>in</strong>structions,<br />

cpu clk unhalted<br />

sampaka12 2.8GHz Pentium 4 2GB 12Kμ/16KB 512KB <strong>in</strong>str retired:nbogusntag,<br />

global power events:runn<strong>in</strong>g<br />

domori25 3.46GHz Pentium D 4GB 12Kμ/16KB 2MB <strong>in</strong>str retired:nbogusntag,<br />

global power events:runn<strong>in</strong>g<br />

are ideal, with full warmup. If we were analyz<strong>in</strong>g via a simulation, the results<br />

would likely vary <strong>in</strong> accuracy depend<strong>in</strong>g on how architectural state is warmed up<br />

after fast-forward<strong>in</strong>g between simulation po<strong>in</strong>ts. We use SimPo<strong>in</strong>t version 3.2,<br />

the newest version from the SimPo<strong>in</strong>t website, to generate our simulation po<strong>in</strong>ts.<br />

3.1 The Rep Prefix<br />

When validat<strong>in</strong>g aga<strong>in</strong>st actual hardware, total retired <strong>in</strong>struction counts closely<br />

match P<strong>in</strong> results, but Qemu and Valgr<strong>in</strong>d results diverge on certa<strong>in</strong> benchmarks.<br />

We f<strong>in</strong>d the cause of this problem to be the IA32 rep prefix. This prefix appears<br />

before str<strong>in</strong>g <strong>in</strong>structions (which typically implement a memory operation followed<br />

by a po<strong>in</strong>ter auto-<strong>in</strong>crement). The prefix causes the str<strong>in</strong>g <strong>in</strong>struction to<br />

repeat, decrement<strong>in</strong>g the ecx register until it reaches zero. A naive implementation<br />

of the rep prefix treats each repetition as a committed <strong>in</strong>struction. In actual<br />

hardware, this <strong>in</strong>struction is grouped <strong>in</strong> multiples of 4096, so only every 4096 th<br />

repetition counts as one committed <strong>in</strong>struction. The performance counters and<br />

P<strong>in</strong> both show this behavior. Our Valgr<strong>in</strong>d and Qemu plug<strong>in</strong>s are modified to<br />

compensate for this, so that we achieve consistent committed <strong>in</strong>struction counts<br />

across all of the BBV generators and actual hardware.<br />

3.2 The Art Benchmark<br />

Under Valgr<strong>in</strong>d, the art float<strong>in</strong>g po<strong>in</strong>t benchmark f<strong>in</strong>ishes with half the number<br />

of <strong>in</strong>structions committed by actual hardware. Valgr<strong>in</strong>d uses 64-bit float<strong>in</strong>g po<strong>in</strong>t<br />

arithmetic for portability reasons, but by default on L<strong>in</strong>ux IA32, programs use<br />

80-bit float<strong>in</strong>g po<strong>in</strong>t operations. The art benchmark unwisely uses the “==”<br />

C operator to compare two float<strong>in</strong>g po<strong>in</strong>t numbers, and due to round<strong>in</strong>g errors<br />

between the 80-bit and 64-bit versions, the 64-bit version can f<strong>in</strong>ish early, while<br />

still generat<strong>in</strong>g the proper reference output.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!