26.11.2012 Views

Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 ...

Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 ...

Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

32035 Rev. 3.22 November 2007<br />

<str<strong>on</strong>g>Compiler</str<strong>on</strong>g> <str<strong>on</strong>g>Usage</str<strong>on</strong>g> <str<strong>on</strong>g>Guidelines</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> AMD<str<strong>on</strong>g>64</str<strong>on</strong>g> Plat<str<strong>on</strong>g>for</str<strong>on</strong>g>ms<br />

-O3 (level-3) specifies aggressive global optimizati<strong>on</strong>. This optimizati<strong>on</strong> level per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms all level-<strong>on</strong>e<br />

and level-two optimizati<strong>on</strong>s and enables more aggressive hoisting and scalar replacement<br />

optimizati<strong>on</strong>s that may or may not be profitable.<br />

-O4 (level-4) per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms all level-1, level-2, and level-3 optimizati<strong>on</strong>s and enables hoisting of guarded<br />

invariant floating point expressi<strong>on</strong>s.<br />

Loop Optimizati<strong>on</strong> using -Munroll, -Mvect, and -Mc<strong>on</strong>cur. Loop per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance may be<br />

improved through vectorizati<strong>on</strong> or unrolling opti<strong>on</strong>s, and, <strong>on</strong> systems with multiple processors, by<br />

using parallelizati<strong>on</strong> opti<strong>on</strong>s.<br />

-Munroll unrolls loops. Executing multiple instances during each loop iterati<strong>on</strong> reduces branch<br />

overhead, improving executi<strong>on</strong> speed by creating better opportunities <str<strong>on</strong>g>for</str<strong>on</strong>g> instructi<strong>on</strong> scheduling.<br />

Using -Munroll sub-opti<strong>on</strong>s c:number and n:number, or using -Mnounroll can c<strong>on</strong>trol whether<br />

and how loops are unrolled.<br />

-Mvect opti<strong>on</strong> triggers the vectorizer to scan code searching <str<strong>on</strong>g>for</str<strong>on</strong>g> loops that are candidates <str<strong>on</strong>g>for</str<strong>on</strong>g> highlevel<br />

trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s such as loop distributi<strong>on</strong>, loop exchange, cache tiling, and idiom recogniti<strong>on</strong><br />

(replacement of a recognizable code sequence, such as a reducti<strong>on</strong> loop, with optimized code<br />

sequences or functi<strong>on</strong> calls). The vectorizer trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> can be c<strong>on</strong>trolled by arguments to the<br />

-Mvect opti<strong>on</strong>. By default, -Mvect without sub-opti<strong>on</strong>s is equivalent to -Mvect=assoc,<br />

cachesize:262144. Vectorizati<strong>on</strong> sub-opti<strong>on</strong>s are assoc, cachesize:number, sse, and prefetch.<br />

-Mc<strong>on</strong>cur opti<strong>on</strong> instructs the compiler to scan code searching <str<strong>on</strong>g>for</str<strong>on</strong>g> loops that are candidates <str<strong>on</strong>g>for</str<strong>on</strong>g> autoparallelizati<strong>on</strong>.<br />

-Mc<strong>on</strong>cur must be used at compile-time and link-time. The parallelizer per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms<br />

various operati<strong>on</strong>s that are c<strong>on</strong>trolled by arguments to the -Mc<strong>on</strong>cur opti<strong>on</strong>. By default, -Mc<strong>on</strong>cur<br />

without sub-opti<strong>on</strong>s is equivalent to -Mc<strong>on</strong>cur=dist:block. Auto-Parallelizati<strong>on</strong> sub-opti<strong>on</strong>s are<br />

altcode:number, dist:block, dist:cycle, cncall, noassoc, and innermost.<br />

Interprocedural Analysis and Optimizati<strong>on</strong> using -Mipa. Interprocedural analysis (IPA) can<br />

improve per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>for</str<strong>on</strong>g> many programs. To compile programs with IPA use an aggregated<br />

subopti<strong>on</strong> such as -Mipa=fast. Refer to the PGI <str<strong>on</strong>g>Compiler</str<strong>on</strong>g> User’s Guide <str<strong>on</strong>g>for</str<strong>on</strong>g> available sub-opti<strong>on</strong>s.<br />

Functi<strong>on</strong> Inlining using -Minline. Inlining allows a call to a functi<strong>on</strong> or subroutine to be<br />

replaced by a copy of the body of that functi<strong>on</strong> or subroutine. Several -Minline sub-opti<strong>on</strong>s determine<br />

the selecti<strong>on</strong> criteria <str<strong>on</strong>g>for</str<strong>on</strong>g> functi<strong>on</strong>s to be inlined. Available sub-opti<strong>on</strong>s are except:func, name:func,<br />

size:number, levels:number, and lib:filename.ext. Note that in C++ releases prior to 6.2, functi<strong>on</strong><br />

inlining does not occur unless the -Minline switch is used. Beginning with release 6.2 inlining will<br />

occur automatically <str<strong>on</strong>g>for</str<strong>on</strong>g> C++ functi<strong>on</strong>s specified by means of the inline keyword or methods defined<br />

in the body of the class. Also, if C++ excepti<strong>on</strong>s are not used, the --no_excepti<strong>on</strong>s flag improves<br />

per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance.<br />

3.1.4 Linking with ACML<br />

Due to the strategic importance of the AMD multi-core processor architecture, libraries are in place to<br />

assist developers in porting software to AMD processors. AMD Core Math Library (ACML) is<br />

designed to “squeeze” the greatest possible per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance from AMD multi-core plat<str<strong>on</strong>g>for</str<strong>on</strong>g>ms and is<br />

integrated in all PGI Toolkits. As the number of cores increases over time, future processor<br />

Chapter 3 Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-Centric <str<strong>on</strong>g>Compiler</str<strong>on</strong>g> Switches 21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!