Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 ...

More documents

Recommendations

Info

<strong>Compiler</strong> <strong>Usage</strong> <strong>Guidelines</strong> <strong>for</strong> AMD<strong>64</strong> Plat<strong>for</strong>ms 3.1.2 General Per<strong>for</strong>mance Switches 32035 Rev. 3.22 November 2007 To get a program running, start by compiling and linking without optimization. Use the optimization level -O0 or select -g to per<strong>for</strong>m minimal optimization. At this level, you can debug a program easily and isolate any coding errors exposed during porting to x86 or AMD<strong>64</strong> plat<strong>for</strong>ms. Use option -tp (i.e. target processor) to specify the target architecture. Options -tp k8-<strong>64</strong> and -tp k8-<strong>64</strong>e result in the generation of code supported on and optimized <strong>for</strong> AMD<strong>64</strong> processors. Edition 7 supports AMD Opteron quad-core processor with options -tp barcelona-<strong>64</strong> to generate <strong>64</strong>-bit code and -tp barcelona to generate 32-bit code. Note: The <strong>64</strong>-bit PGI compiler can generate 32-bit binaries. To get started quickly with optimization, with any PGI compiler use options -fast and -Mipa=fast. For C++ programs, add -Minline=levels:10 --no_exceptions (C++ program compiled with --no_exceptions will fail if the program uses exception handling). Beginning in Edition 7 the -fast option became synonymous with the -fastsse option, and the optimizations per<strong>for</strong>med by -fast in previous releases were placed under the -nfast option. Note: The -fastsse option is still necessary to compile 32 bit code. Generally, further significant per<strong>for</strong>mance gains can be realized. However, individual optimizations can sometimes cause slowdowns depending on coding style. Optimization flags most likely to further improve per<strong>for</strong>mance are-O3, -Mpfi/-Mpfo, -Minline, and on targets with multiple processors -Mconcur, The --zc_eh option allows zero-cost exception handling <strong>for</strong> C++. For C++ BASE optimization, use --zc_eh with -Mipa=fast,inline and -Msmartalloc=huge. The huge flag enables the use of huge pages if the OS is configured to provide them. 3.1.3 Optimization Switches In addition to the -tp (i.e., target processor) switch, the following list of switches may improve the per<strong>for</strong>mance of the program. It is worth experimenting with these switches, but care must be used to ensure per<strong>for</strong>mance improvements. Local and Global Optimization using -O. Specify any of the following optimization level (-Olevel) options. -O0—(level-0) specifies no optimization. This optimization level generates a basic block <strong>for</strong> each language statement. This is useful <strong>for</strong> debugging since there is a direct correlation between the program text and the code generated. -O1 (level-1) specifies local optimization. This optimization level per<strong>for</strong>ms scheduling of basic blocks and allocates registers. -O2 (level-2) specifies global optimization. This optimization level per<strong>for</strong>ms all level-one local optimization as well as level-two global optimization. 20 Per<strong>for</strong>mance-Centric <strong>Compiler</strong> Switches Chapter 3
32035 Rev. 3.22 November 2007 <strong>Compiler</strong> <strong>Usage</strong> <strong>Guidelines</strong> <strong>for</strong> AMD<strong>64</strong> Plat<strong>for</strong>ms -O3 (level-3) specifies aggressive global optimization. This optimization level per<strong>for</strong>ms all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. -O4 (level-4) per<strong>for</strong>ms all level-1, level-2, and level-3 optimizations and enables hoisting of guarded invariant floating point expressions. Loop Optimization using -Munroll, -Mvect, and -Mconcur. Loop per<strong>for</strong>mance may be improved through vectorization or unrolling options, and, on systems with multiple processors, by using parallelization options. -Munroll unrolls loops. Executing multiple instances during each loop iteration reduces branch overhead, improving execution speed by creating better opportunities <strong>for</strong> instruction scheduling. Using -Munroll sub-options c:number and n:number, or using -Mnounroll can control whether and how loops are unrolled. -Mvect option triggers the vectorizer to scan code searching <strong>for</strong> loops that are candidates <strong>for</strong> highlevel trans<strong>for</strong>mations such as loop distribution, loop exchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop, with optimized code sequences or function calls). The vectorizer trans<strong>for</strong>mation can be controlled by arguments to the -Mvect option. By default, -Mvect without sub-options is equivalent to -Mvect=assoc, cachesize:262144. Vectorization sub-options are assoc, cachesize:number, sse, and prefetch. -Mconcur option instructs the compiler to scan code searching <strong>for</strong> loops that are candidates <strong>for</strong> autoparallelization. -Mconcur must be used at compile-time and link-time. The parallelizer per<strong>for</strong>ms various operations that are controlled by arguments to the -Mconcur option. By default, -Mconcur without sub-options is equivalent to -Mconcur=dist:block. Auto-Parallelization sub-options are altcode:number, dist:block, dist:cycle, cncall, noassoc, and innermost. Interprocedural Analysis and Optimization using -Mipa. Interprocedural analysis (IPA) can improve per<strong>for</strong>mance <strong>for</strong> many programs. To compile programs with IPA use an aggregated suboption such as -Mipa=fast. Refer to the PGI <strong>Compiler</strong> User’s Guide <strong>for</strong> available sub-options. Function Inlining using -Minline. Inlining allows a call to a function or subroutine to be replaced by a copy of the body of that function or subroutine. Several -Minline sub-options determine the selection criteria <strong>for</strong> functions to be inlined. Available sub-options are except:func, name:func, size:number, levels:number, and lib:filename.ext. Note that in C++ releases prior to 6.2, function inlining does not occur unless the -Minline switch is used. Beginning with release 6.2 inlining will occur automatically <strong>for</strong> C++ functions specified by means of the inline keyword or methods defined in the body of the class. Also, if C++ exceptions are not used, the --no_exceptions flag improves per<strong>for</strong>mance. 3.1.4 Linking with ACML Due to the strategic importance of the AMD multi-core processor architecture, libraries are in place to assist developers in porting software to AMD processors. AMD Core Math Library (ACML) is designed to “squeeze” the greatest possible per<strong>for</strong>mance from AMD multi-core plat<strong>for</strong>ms and is integrated in all PGI Toolkits. As the number of cores increases over time, future processor Chapter 3 Per<strong>for</strong>mance-Centric <strong>Compiler</strong> Switches 21
Page 1 and 2: Compiler U
Page 3 and 4: 32035 Rev. 3.22 November 2007 Conte
Page 5 and 6: 32035 Rev. 3.22 November 2007 <stro
Page 9 and 10: 32035 Rev. 3.22 November 2007 Table
Page 11 and 12: 32035 Rev. 3.22 November 2007 Revis
Page 13 and 14: 32035 Rev. 3.22 November 2007 Chapt
Page 19: 32035 Rev. 3.22 November 2007 <stro
Page 29 and 30: 32035 Rev. 3.22 November 2007 3.7 S
Page 31 and 32: 32035 Rev. 3.22 November 2007 Table
Page 39 and 40: 32035 Rev. 3.22 November 2007 3.13.
Page 45 and 46: 32035 Rev. 3.22 November 2007 4.2.2
Page 53 and 54: 32035 Rev. 3.22 November 2007 4.12.
Page 57 and 58: 32035 Rev. 3.22 November 2007 5.1.2

Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?