Itanium Performance Insights from the IMPACT Compiler
Itanium Performance Insights from the IMPACT Compiler
Itanium Performance Insights from the IMPACT Compiler
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Hot Chips 13<br />
August 21, 2001<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong><br />
<strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
J. Sias, M. Merten, E. Nystrom, R. Barnes,<br />
C. Shannon, J. Matarazzo, S. Ryoo,<br />
J. Olivier, W. Hwu<br />
<strong>IMPACT</strong> Research Group<br />
Coordinated Science Laboratory<br />
University of Illinois at Urbana-Champaign<br />
http://www.crhc.uiuc.edu/<strong>IMPACT</strong>/<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
<strong>Itanium</strong> Development History<br />
August 21, 2001<br />
Intel and Hewlett Packard<br />
Intel/HP<br />
alliance<br />
input<br />
<strong>from</strong><br />
UIUC<br />
Intel<br />
begins<br />
prototype<br />
compiler<br />
based on<br />
<strong>IMPACT</strong><br />
Intel/HP<br />
announce<br />
<strong>Itanium</strong><br />
Intel<br />
commercially<br />
licenses<br />
<strong>IMPACT</strong><br />
Intel donates<br />
prototype<br />
compiler and<br />
system<br />
to <strong>IMPACT</strong><br />
Research<br />
Intel/HP<br />
publically<br />
release<br />
<strong>Itanium</strong><br />
processor &<br />
compilers<br />
‘87<br />
‘95<br />
‘96 ‘97 ‘98 ‘99 ‘00 ‘01<br />
‘02<br />
<strong>IMPACT</strong><br />
ILP<br />
compiler<br />
born<br />
Research<br />
Partnership<br />
with HP Labs:<br />
predication &<br />
speculation<br />
Advanced<br />
control flow,<br />
predication &<br />
speculation<br />
balancing<br />
research<br />
Focused<br />
<strong>Itanium</strong><br />
development<br />
begins at<br />
<strong>IMPACT</strong><br />
Research<br />
General<br />
spec.<br />
support<br />
add to<br />
linux<br />
kernel<br />
University of Illinois at Urbana-Champaign<br />
Results<br />
comparable<br />
to Intel ecc<br />
achieved<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
<strong>Itanium</strong> Architecture Overview<br />
August 21, 2001<br />
• <strong>Itanium</strong> design goal: enhance scalability of parallelism by<br />
moving complex decisions to <strong>the</strong> compiler<br />
– Bundling: enables static scheduling by communicating instruction<br />
parallelism<br />
– Predication: allows compiler to optimize across multiple paths by<br />
providing an alternative to control flow<br />
– Speculation support: allows compiler to select specific instructions<br />
for early execution<br />
Original Control Speculation Predication<br />
...<br />
BR<br />
If addr != 0 If addr == 0<br />
r = [addr] ...<br />
r = [addr]<br />
...<br />
BR<br />
If addr != 0 If addr == 0<br />
If p: use r<br />
r = [addr]<br />
...<br />
p =<br />
(addr != 0)<br />
If !p: ...<br />
use r<br />
use r<br />
...<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
<strong>Itanium</strong> Compilation Landscape<br />
August 21, 2001<br />
• Increased reliance on <strong>the</strong> compiler for performance<br />
– Explicit control of <strong>the</strong> architecture: realities of modern<br />
microarchitecture have become visible at software level<br />
– Particular problems: effects of runtime uncertainty<br />
• Control resolution, variable memory latency, etc.<br />
– Solutions <strong>from</strong> EPIC/VLIW research<br />
• Memory disambiguation, profiling<br />
• Static scheduling, control speculation, predication<br />
Applicability<br />
Public/<br />
Proprietary<br />
Peephole<br />
opti level<br />
ILP opti<br />
level<br />
Extensibility<br />
gcc<br />
Very High<br />
Public<br />
Low<br />
Very Low<br />
Low<br />
ecc<br />
High<br />
Proprietary<br />
High<br />
High<br />
High<br />
<strong>IMPACT</strong><br />
Medium<br />
Future Public<br />
Medium<br />
Very High<br />
Very High<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
August 21, 2001<br />
<strong>IMPACT</strong> <strong>Compiler</strong><br />
• <strong>IMPACT</strong> compiler supports ILP compilation and research<br />
– Extensible framework for easy implementation of new optimizations<br />
– Versatile and pervasive intermediate representation (IR)<br />
– Comprehensive predicate-aware dataflow and predicate analyses<br />
– Direct analysis <strong>from</strong> IR at most stages of compilation<br />
– Advanced interprocedural pointer analysis<br />
Profiling<br />
Machine Description Relied<br />
Upon Throughout<br />
Prepass<br />
Scheduling<br />
Function<br />
Inlining<br />
Traditional<br />
Optimization<br />
Predicate<br />
Formulation<br />
Predicate-Aware<br />
Classical and<br />
ILP Optimization<br />
Instruction<br />
Selection<br />
Register<br />
Allocation<br />
Interprodecural<br />
Pointer Analysis<br />
Early Utilization of<br />
Predication<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
Many Optimizations<br />
Utilize Control<br />
Speculation<br />
Meticulous Static<br />
Scheduling<br />
Postpass<br />
Scheduling<br />
<strong>IMPACT</strong>
Hot Chips 13<br />
SPECcpu2000 Ratio Comparison<br />
August 21, 2001<br />
164.gzip<br />
500<br />
256.bzip2<br />
300.twolf<br />
400<br />
300<br />
200<br />
100<br />
0<br />
175.vpr<br />
impact<br />
ecc<br />
gcc<br />
181.mcf<br />
255.vortex<br />
186.crafty<br />
254.gap<br />
197.parser<br />
Dual processor HP i2000 <strong>Itanium</strong> 800 MHz with 2MB L3 cache<br />
linux kernel 2.4.7 with PMC patch, gcc 2.9.6, ecc 5.0.1 beta<br />
(ecc measurements not official Intel results)<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Cycle Accounting Breakdown<br />
August 21, 2001<br />
Execution time / SPEC reference time<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
Pipeline flush<br />
Memory<br />
Dependency<br />
Unstalled execution<br />
0<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
164 175 181 186 197 254 255 256 300<br />
Measured machine execution cycles using performance monitoring counters<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
<strong>Performance</strong> of Memory Hierarchy<br />
August 21, 2001<br />
Satisfied accesses x latency (1E12)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0.0<br />
Main memory<br />
(est. 80 cycles)<br />
L3 cache<br />
L2 cache<br />
L1D cache<br />
L1I cache (demand)<br />
L1I cache (prefetch)<br />
INT REQ<br />
L1 I Cache<br />
16KB<br />
4-way<br />
32B lines<br />
2 cycles<br />
L1 D<br />
Cache<br />
16KB<br />
4-way<br />
32B lines<br />
2 cycles<br />
FP REQ<br />
MISS<br />
MISS<br />
L2 Unified<br />
Cache<br />
96KB<br />
6-way<br />
64B lines<br />
6 cycles int<br />
9 cycles fp<br />
MISS<br />
L3 Unified<br />
off chip, on<br />
package<br />
cache<br />
2MB<br />
21 cycles int<br />
24 cycles fp<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
164 175 181 186 197 254 255 256 300<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Approaches to Control Speculation<br />
August 21, 2001<br />
General Speculation<br />
Write Value Into Register<br />
Hit<br />
Valid<br />
Inexpensive<br />
Moderate<br />
Expensive<br />
Result<br />
ld.s<br />
Check<br />
DTLB<br />
NaT<br />
Page<br />
Walk<br />
VHPT<br />
OS<br />
Write NaT Into Register<br />
Check<br />
Page<br />
Table<br />
Not Valid<br />
Missing<br />
Load<br />
Page<br />
Except on<br />
Consumption<br />
Sentinel Speculation<br />
Write Value Into Register<br />
ld.s<br />
Check<br />
DTLB<br />
Write NaT<br />
or Value<br />
Into<br />
Register<br />
chk.s<br />
NaT<br />
ld<br />
Check<br />
DTLB<br />
Hit<br />
NaT<br />
Page<br />
Walk<br />
VHPT<br />
OS<br />
Check<br />
Page<br />
Table<br />
Valid<br />
Missing<br />
Not Valid<br />
Load<br />
Page<br />
Except<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Architecture and OS Support<br />
August 21, 2001<br />
• IPF allows OS to mix and match general and sentinel<br />
speculation models<br />
• Linux kernel modifications for general speculation<br />
– Basic support (system-wide general speculation): modify Default<br />
Control Register (DCR) not to defer transparent exceptions<br />
– Basic support caused page fault on every speculative NULL<br />
pointer access<br />
– Solution: install page 0 translation as a NaT page<br />
– Advanced support (selective general speculation): Bit in binary<br />
header indicates presence of recovery code. If no recovery code,<br />
ED bits in page table entries are cleared, overriding <strong>the</strong> DCR<br />
setting, so that no serviceable faults are deferred<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
August 21, 2001<br />
Cycle Accounting Breakdown Revisited<br />
Execution time / SPEC reference time<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
Pipeline flush<br />
Memory<br />
Dependency<br />
Unstalled execution<br />
0<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
164 175 181 186 197 254 255 256 300<br />
Measured machine execution cycles using performance monitoring counters<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
August 21, 2001<br />
Hyperblock Transformation<br />
CB6<br />
CB5<br />
Example <strong>from</strong> 164.gzip deflate()<br />
CB10<br />
P<br />
Predicate<br />
Definition<br />
CB93<br />
CB94<br />
CB133<br />
CB140<br />
• Commonly executed code<br />
pulled toge<strong>the</strong>r into a<br />
straight-line sequence<br />
P<br />
P<br />
P<br />
P<br />
P<br />
P<br />
Implicit NOP<br />
Explicit NOP<br />
CB166<br />
CB141<br />
CB142<br />
CB143<br />
• Serial branch execution<br />
replaced with parallel<br />
predicate evaluation<br />
P<br />
P<br />
P<br />
P<br />
CB157<br />
CB158<br />
• Compromise between ILP<br />
and code size<br />
P<br />
P<br />
P<br />
P<br />
P<br />
P<br />
P<br />
66%<br />
1%<br />
CB160<br />
CB159<br />
– Bundling reduces explicit<br />
horizontal NOPs<br />
P<br />
P<br />
0%<br />
CB165<br />
CB168<br />
– Interlocking reduces explicit<br />
vertical NOPs<br />
P<br />
P<br />
0%<br />
33%<br />
CB181<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Branch Effects of ILP Transformations<br />
August 21, 2001<br />
• ILP compilation eliminates branches<br />
– Predication and branch combining<br />
– Loop unrolling<br />
1.0<br />
0.9<br />
0.8<br />
0.7<br />
Taken mispredict<br />
Not-taken mispredict<br />
Taken<br />
Not-taken<br />
• Reduction in total dynamic branches<br />
• Proportionally fewer taken branches<br />
• Reduction in mispredicted branches<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.0<br />
basic ILP basic ILP basic ILP<br />
164.gzip 175.vpr 255.vortex<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Comparison of Branch Behavior<br />
August 21, 2001<br />
7.E+10<br />
6.E+10<br />
5.E+10<br />
Not-taken incorrect<br />
Taken incorrect<br />
Taken correct<br />
Not-taken correct<br />
4.E+10<br />
3.E+10<br />
2.E+10<br />
1.E+10<br />
0.E+00<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
gcc<br />
ecc<br />
impact<br />
Branches<br />
164 175 181 186 197 254 255 256 300<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
Challenges and Future Directions<br />
August 21, 2001<br />
• Profile dependence and accuracy<br />
– Getting profile into build systems<br />
– Transparent runtime data collection and re-optimization, making more use of<br />
<strong>the</strong> <strong>Itanium</strong> performance monitoring counters<br />
• Effectiveness of ILP compilation algorithms<br />
– New algorithms to increase both aggressiveness and stability of speculation<br />
and predication<br />
– Program level ILP transformations<br />
• Memory subsystem improvements<br />
– More efficient pointer analysis algorithms<br />
– Memory data flow analysis and optimization<br />
• More results are being derived for presentation at Microprocessor Forum<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>
Hot Chips 13<br />
August 21, 2001<br />
Acknowledgements<br />
• Former <strong>IMPACT</strong> members<br />
– David August, Ben-Chung Cheng, Daniel Connors, Kevin Crozier, Brian<br />
Deitrich, John Gyllenhaal, Richard Hank, Teresa Johnson, Dan Lavery, Scott<br />
Mahlke, Le-Chun Wu<br />
• Intel <strong>Itanium</strong> Team<br />
– Carole Dulong, John Crawford, Dan Lavery, Steve Skedzielewski, Jim Pierce<br />
• HP <strong>Itanium</strong> Team<br />
– Richard Holman, Vatsa Santhanam, Carol Thompson, Richard Hank<br />
• HP Labs PD Team<br />
– Bob Rau, Mike Schlansker, Vinod Kathail, Scott Mahlke<br />
• HP Labs Linux Team<br />
– Brian Lynn, David Mosberger, Hans Boehm, Stephane Eranian<br />
• HP Philanthropy - 9 i2000 <strong>Itanium</strong> machines, expedited<br />
– Karen Fontana, Tony Napolitan, Ralph Hyver, Chris Hsiung, Rob Bouzon<br />
• Intel - 2 alpha/beta <strong>Itanium</strong>s<br />
– Carole Dulong, John Crawford<br />
<strong>Itanium</strong> <strong>Performance</strong> <strong>Insights</strong> <strong>from</strong> <strong>the</strong> <strong>IMPACT</strong> <strong>Compiler</strong><br />
<strong>IMPACT</strong>