21.11.2014 Views

Itanium Performance Insights from the IMPACT Compiler

Itanium Performance Insights from the IMPACT Compiler

Itanium Performance Insights from the IMPACT Compiler

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Hot Chips 13<br />

August 21, 2001<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong><br />

<strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

J. Sias, M. Merten, E. Nystrom, R. Barnes,<br />

C. Shannon, J. Matarazzo, S. Ryoo,<br />

J. Olivier, W. Hwu<br />

<strong>IMPACT</strong> Research Group<br />

Coordinated Science Laboratory<br />

University of Illinois at Urbana-Champaign<br />

http://www.crhc.uiuc.edu/<strong>IMPACT</strong>/<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

<strong>Itanium</strong> Development History<br />

August 21, 2001<br />

Intel and Hewlett Packard<br />

Intel/HP<br />

alliance<br />

input<br />

<strong>from</strong><br />

UIUC<br />

Intel<br />

begins<br />

prototype<br />

compiler<br />

based on<br />

<strong>IMPACT</strong><br />

Intel/HP<br />

announce<br />

<strong>Itanium</strong><br />

Intel<br />

commercially<br />

licenses<br />

<strong>IMPACT</strong><br />

Intel donates<br />

prototype<br />

compiler and<br />

system<br />

to <strong>IMPACT</strong><br />

Research<br />

Intel/HP<br />

publically<br />

release<br />

<strong>Itanium</strong><br />

processor &<br />

compilers<br />

‘87<br />

‘95<br />

‘96 ‘97 ‘98 ‘99 ‘00 ‘01<br />

‘02<br />

<strong>IMPACT</strong><br />

ILP<br />

compiler<br />

born<br />

Research<br />

Partnership<br />

with HP Labs:<br />

predication &<br />

speculation<br />

Advanced<br />

control flow,<br />

predication &<br />

speculation<br />

balancing<br />

research<br />

Focused<br />

<strong>Itanium</strong><br />

development<br />

begins at<br />

<strong>IMPACT</strong><br />

Research<br />

General<br />

spec.<br />

support<br />

add to<br />

linux<br />

kernel<br />

University of Illinois at Urbana-Champaign<br />

Results<br />

comparable<br />

to Intel ecc<br />

achieved<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

<strong>Itanium</strong> Architecture Overview<br />

August 21, 2001<br />

• <strong>Itanium</strong> design goal: enhance scalability of parallelism by<br />

moving complex decisions to <strong>the</strong> compiler<br />

– Bundling: enables static scheduling by communicating instruction<br />

parallelism<br />

– Predication: allows compiler to optimize across multiple paths by<br />

providing an alternative to control flow<br />

– Speculation support: allows compiler to select specific instructions<br />

for early execution<br />

Original Control Speculation Predication<br />

...<br />

BR<br />

If addr != 0 If addr == 0<br />

r = [addr] ...<br />

r = [addr]<br />

...<br />

BR<br />

If addr != 0 If addr == 0<br />

If p: use r<br />

r = [addr]<br />

...<br />

p =<br />

(addr != 0)<br />

If !p: ...<br />

use r<br />

use r<br />

...<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

<strong>Itanium</strong> Compilation Landscape<br />

August 21, 2001<br />

• Increased reliance on <strong>the</strong> compiler for performance<br />

– Explicit control of <strong>the</strong> architecture: realities of modern<br />

microarchitecture have become visible at software level<br />

– Particular problems: effects of runtime uncertainty<br />

• Control resolution, variable memory latency, etc.<br />

– Solutions <strong>from</strong> EPIC/VLIW research<br />

• Memory disambiguation, profiling<br />

• Static scheduling, control speculation, predication<br />

Applicability<br />

Public/<br />

Proprietary<br />

Peephole<br />

opti level<br />

ILP opti<br />

level<br />

Extensibility<br />

gcc<br />

Very High<br />

Public<br />

Low<br />

Very Low<br />

Low<br />

ecc<br />

High<br />

Proprietary<br />

High<br />

High<br />

High<br />

<strong>IMPACT</strong><br />

Medium<br />

Future Public<br />

Medium<br />

Very High<br />

Very High<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

August 21, 2001<br />

<strong>IMPACT</strong> <strong>Compiler</strong><br />

• <strong>IMPACT</strong> compiler supports ILP compilation and research<br />

– Extensible framework for easy implementation of new optimizations<br />

– Versatile and pervasive intermediate representation (IR)<br />

– Comprehensive predicate-aware dataflow and predicate analyses<br />

– Direct analysis <strong>from</strong> IR at most stages of compilation<br />

– Advanced interprocedural pointer analysis<br />

Profiling<br />

Machine Description Relied<br />

Upon Throughout<br />

Prepass<br />

Scheduling<br />

Function<br />

Inlining<br />

Traditional<br />

Optimization<br />

Predicate<br />

Formulation<br />

Predicate-Aware<br />

Classical and<br />

ILP Optimization<br />

Instruction<br />

Selection<br />

Register<br />

Allocation<br />

Interprodecural<br />

Pointer Analysis<br />

Early Utilization of<br />

Predication<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

Many Optimizations<br />

Utilize Control<br />

Speculation<br />

Meticulous Static<br />

Scheduling<br />

Postpass<br />

Scheduling<br />

<strong>IMPACT</strong>


Hot Chips 13<br />

SPECcpu2000 Ratio Comparison<br />

August 21, 2001<br />

164.gzip<br />

500<br />

256.bzip2<br />

300.twolf<br />

400<br />

300<br />

200<br />

100<br />

0<br />

175.vpr<br />

impact<br />

ecc<br />

gcc<br />

181.mcf<br />

255.vortex<br />

186.crafty<br />

254.gap<br />

197.parser<br />

Dual processor HP i2000 <strong>Itanium</strong> 800 MHz with 2MB L3 cache<br />

linux kernel 2.4.7 with PMC patch, gcc 2.9.6, ecc 5.0.1 beta<br />

(ecc measurements not official Intel results)<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Cycle Accounting Breakdown<br />

August 21, 2001<br />

Execution time / SPEC reference time<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

Pipeline flush<br />

Memory<br />

Dependency<br />

Unstalled execution<br />

0<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

164 175 181 186 197 254 255 256 300<br />

Measured machine execution cycles using performance monitoring counters<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

<strong>Performance</strong> of Memory Hierarchy<br />

August 21, 2001<br />

Satisfied accesses x latency (1E12)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

Main memory<br />

(est. 80 cycles)<br />

L3 cache<br />

L2 cache<br />

L1D cache<br />

L1I cache (demand)<br />

L1I cache (prefetch)<br />

INT REQ<br />

L1 I Cache<br />

16KB<br />

4-way<br />

32B lines<br />

2 cycles<br />

L1 D<br />

Cache<br />

16KB<br />

4-way<br />

32B lines<br />

2 cycles<br />

FP REQ<br />

MISS<br />

MISS<br />

L2 Unified<br />

Cache<br />

96KB<br />

6-way<br />

64B lines<br />

6 cycles int<br />

9 cycles fp<br />

MISS<br />

L3 Unified<br />

off chip, on<br />

package<br />

cache<br />

2MB<br />

21 cycles int<br />

24 cycles fp<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

164 175 181 186 197 254 255 256 300<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Approaches to Control Speculation<br />

August 21, 2001<br />

General Speculation<br />

Write Value Into Register<br />

Hit<br />

Valid<br />

Inexpensive<br />

Moderate<br />

Expensive<br />

Result<br />

ld.s<br />

Check<br />

DTLB<br />

NaT<br />

Page<br />

Walk<br />

VHPT<br />

OS<br />

Write NaT Into Register<br />

Check<br />

Page<br />

Table<br />

Not Valid<br />

Missing<br />

Load<br />

Page<br />

Except on<br />

Consumption<br />

Sentinel Speculation<br />

Write Value Into Register<br />

ld.s<br />

Check<br />

DTLB<br />

Write NaT<br />

or Value<br />

Into<br />

Register<br />

chk.s<br />

NaT<br />

ld<br />

Check<br />

DTLB<br />

Hit<br />

NaT<br />

Page<br />

Walk<br />

VHPT<br />

OS<br />

Check<br />

Page<br />

Table<br />

Valid<br />

Missing<br />

Not Valid<br />

Load<br />

Page<br />

Except<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Architecture and OS Support<br />

August 21, 2001<br />

• IPF allows OS to mix and match general and sentinel<br />

speculation models<br />

• Linux kernel modifications for general speculation<br />

– Basic support (system-wide general speculation): modify Default<br />

Control Register (DCR) not to defer transparent exceptions<br />

– Basic support caused page fault on every speculative NULL<br />

pointer access<br />

– Solution: install page 0 translation as a NaT page<br />

– Advanced support (selective general speculation): Bit in binary<br />

header indicates presence of recovery code. If no recovery code,<br />

ED bits in page table entries are cleared, overriding <strong>the</strong> DCR<br />

setting, so that no serviceable faults are deferred<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

August 21, 2001<br />

Cycle Accounting Breakdown Revisited<br />

Execution time / SPEC reference time<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

Pipeline flush<br />

Memory<br />

Dependency<br />

Unstalled execution<br />

0<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

164 175 181 186 197 254 255 256 300<br />

Measured machine execution cycles using performance monitoring counters<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

August 21, 2001<br />

Hyperblock Transformation<br />

CB6<br />

CB5<br />

Example <strong>from</strong> 164.gzip deflate()<br />

CB10<br />

P<br />

Predicate<br />

Definition<br />

CB93<br />

CB94<br />

CB133<br />

CB140<br />

• Commonly executed code<br />

pulled toge<strong>the</strong>r into a<br />

straight-line sequence<br />

P<br />

P<br />

P<br />

P<br />

P<br />

P<br />

Implicit NOP<br />

Explicit NOP<br />

CB166<br />

CB141<br />

CB142<br />

CB143<br />

• Serial branch execution<br />

replaced with parallel<br />

predicate evaluation<br />

P<br />

P<br />

P<br />

P<br />

CB157<br />

CB158<br />

• Compromise between ILP<br />

and code size<br />

P<br />

P<br />

P<br />

P<br />

P<br />

P<br />

P<br />

66%<br />

1%<br />

CB160<br />

CB159<br />

– Bundling reduces explicit<br />

horizontal NOPs<br />

P<br />

P<br />

0%<br />

CB165<br />

CB168<br />

– Interlocking reduces explicit<br />

vertical NOPs<br />

P<br />

P<br />

0%<br />

33%<br />

CB181<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Branch Effects of ILP Transformations<br />

August 21, 2001<br />

• ILP compilation eliminates branches<br />

– Predication and branch combining<br />

– Loop unrolling<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

Taken mispredict<br />

Not-taken mispredict<br />

Taken<br />

Not-taken<br />

• Reduction in total dynamic branches<br />

• Proportionally fewer taken branches<br />

• Reduction in mispredicted branches<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

basic ILP basic ILP basic ILP<br />

164.gzip 175.vpr 255.vortex<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Comparison of Branch Behavior<br />

August 21, 2001<br />

7.E+10<br />

6.E+10<br />

5.E+10<br />

Not-taken incorrect<br />

Taken incorrect<br />

Taken correct<br />

Not-taken correct<br />

4.E+10<br />

3.E+10<br />

2.E+10<br />

1.E+10<br />

0.E+00<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

gcc<br />

ecc<br />

impact<br />

Branches<br />

164 175 181 186 197 254 255 256 300<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

Challenges and Future Directions<br />

August 21, 2001<br />

• Profile dependence and accuracy<br />

– Getting profile into build systems<br />

– Transparent runtime data collection and re-optimization, making more use of<br />

<strong>the</strong> <strong>Itanium</strong> performance monitoring counters<br />

• Effectiveness of ILP compilation algorithms<br />

– New algorithms to increase both aggressiveness and stability of speculation<br />

and predication<br />

– Program level ILP transformations<br />

• Memory subsystem improvements<br />

– More efficient pointer analysis algorithms<br />

– Memory data flow analysis and optimization<br />

• More results are being derived for presentation at Microprocessor Forum<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>


Hot Chips 13<br />

August 21, 2001<br />

Acknowledgements<br />

• Former <strong>IMPACT</strong> members<br />

– David August, Ben-Chung Cheng, Daniel Connors, Kevin Crozier, Brian<br />

Deitrich, John Gyllenhaal, Richard Hank, Teresa Johnson, Dan Lavery, Scott<br />

Mahlke, Le-Chun Wu<br />

• Intel <strong>Itanium</strong> Team<br />

– Carole Dulong, John Crawford, Dan Lavery, Steve Skedzielewski, Jim Pierce<br />

• HP <strong>Itanium</strong> Team<br />

– Richard Holman, Vatsa Santhanam, Carol Thompson, Richard Hank<br />

• HP Labs PD Team<br />

– Bob Rau, Mike Schlansker, Vinod Kathail, Scott Mahlke<br />

• HP Labs Linux Team<br />

– Brian Lynn, David Mosberger, Hans Boehm, Stephane Eranian<br />

• HP Philanthropy - 9 i2000 <strong>Itanium</strong> machines, expedited<br />

– Karen Fontana, Tony Napolitan, Ralph Hyver, Chris Hsiung, Rob Bouzon<br />

• Intel - 2 alpha/beta <strong>Itanium</strong>s<br />

– Carole Dulong, John Crawford<br />

<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />

<strong>IMPACT</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!