PGI User's Guide

More documents

Recommendations

Info

$Intel(R) Math Kernel Library for Linux* OS User's Guide$

Vectorization using –MvectAssume the preceding program is compiled as follows, where -Mvect=nosse disables SSE vectorization:% pgfortran -fast -Mvect=nosse -Minfo vadd.fvector_op:4, Loop unrolled 4 timesloop:18, Loop unrolled 4 timesThe following output shows a sample result if the generated executable is run and timed on a standalone AMDOpteron 2.2 Ghz system:% /bin/time vadd-1.000000 -771.000 -3618.000 -6498.00 -9999.005.39user 0.00system 0:05.40elapsed 99%CPNow, recompile with SSE vectorization enabled, and you see results similar to these:% pgfortran -fast -Minfo vadd.f -o vaddvector_op:4, Unrolled inner loop 8 timesLoop unrolled 7 times (completely unrolled)loop:18, Generated 4 alternate loops for the inner loopGenerated vector sse code for inner loopGenerated 3 prefetch instructions for this loopNotice the informational message for the loop at line 18.• The first two lines of the message indicate that the loop was vectorized, SSE instructions were generated,and four alternate versions of the loop were also generated. The loop count and alignments of the arraysdetermine which of these versions is executed.• The last line of the informational message indicates that prefetch instructions have been generated for threeloads to minimize latency of data transfers from main memory.Executing again, you should see results similar to the following:% /bin/time vadd-1.000000 -771.000 -3618.00 -6498.00-9999.03.59user 0.00system 0:03.59elapsed 100%CPUThe result is a 50% speed-up over the equivalent scalar, that is, the non-SSE, version of the program.Speed-up realized by a given loop or program can vary widely based on a number of factors:• When the vectors of data are resident in the data cache, performance improvement using vector SSE or SSE2instructions is most effective.• If data is aligned properly, performance will be better in general than when using vector SSE operations onunaligned data.• If the compiler can guarantee that data is aligned properly, even more efficient sequences of SSEinstructions can be generated.34
Chapter 3. Optimizing & Parallelizing• The efficiency of loops that operate on single-precision data can be higher. SSE2 vector instructions canoperate on four single-precision elements concurrently, but only two double-precision elements.NoteCompiling with –Mvect=sse can result in numerical differences from the executables generatedwith less optimization. Certain vectorizable operations, for example dot products, are sensitiveto order of operations and the associative transformations necessary to enable vectorization (orparallelization).Auto-Parallelization using -MconcurWith the -Mconcur option the compiler scans code searching for loops that are candidates for autoparallelization.-Mconcur must be used at both compile-time and link-time. When the parallelizer findsopportunities for auto-parallelization, it parallelizes loops and you are informed of the line or loop beingparallelized if the -Minfo option is present on the compile line. Refer to “Optimization Controls” in the <strong>PGI</strong>Reference <strong>Guide</strong> for a complete specification of -Mconcur.A loop is considered parallelizable if doesn't contain any cross-iteration data dependencies. Cross-iterationdependencies from reductions and expandable scalars are excluded from consideration, enabling more loopsto be parallelizable. In general, loops with calls are not parallelized due to unknown side effects. Also, loopswith low trip counts are not parallelized since the overhead in setting up and starting a parallel loop will likelyoutweigh the potential benefits. In addition, the default is to not parallelize innermost loops, since these oftenby definition are vectorizable using SSE instructions and it is seldom profitable to both vectorize and parallelizethe same loop, especially on multi-core processors. Compiler switches and directives are available to let youoverride most of these restrictions on auto-parallelization.Auto-parallelization Sub-optionsThe parallelizer performs various operations that can be controlled by arguments to the –Mconcur commandline option. The following sections describe these arguments that affect the operation of the vectorizer. Inaddition, these vectorizer operations can be controlled from within code using directives and pragmas.For details on the use of directives and pragmas, refer to Chapter 8, “Using Directives and Pragmas”.By default, –Mconcur without any sub-options is equivalent to:-Mconcur=dist:blockThis enables parallelization of loops with blocked iteration allocation across the available threads of execution.These defaults may vary depending on the target system.Altcode OptionThe option –Mconcur=altcode instructs the parallelizer to generate alternate serial code for parallelizedloops. If altcode is specified without arguments, the parallelizer determines an appropriate cutoff lengthand generates serial code to be executed whenever the loop count is less than or equal to that length. Ifaltcode:n is specified, the serial altcode is executed whenever the loop count is less than or equal to n. Ifnoaltcode is specified, no alternate serial code is generated.35
Page 2 and 3:
While every precaution has been tak
Page 5 and 6: PGI ® Compiler User’s Guide4. Us
Page 7 and 8: PGI ® Compiler User’s GuideRefer
Page 12 and 13: xii14.6. Intrinsic Header File Orga
Page 14 and 15: xiv13.3. Large Array and Small Memo
Page 16 and 17: Organizationxvi• Fortran 95 Handb
Page 18 and 19: Conventionsin this guide with which
Page 21 and 22: Chapter 1. Getting StartedThis chap
Page 23 and 24: Chapter 1. Getting StartedWhere:opt
Page 25 and 26: Chapter 1. Getting Startedfilename.
Page 27 and 28: Chapter 1. Getting Startedfilename.
Page 29 and 30: Chapter 1. Getting Startedparallel
Page 31 and 32: Chapter 1. Getting Startedas execut
Page 33 and 34: Chapter 1. Getting Startedar or ran
Page 35 and 36: Chapter 1. Getting StartedTo do thi
Page 37 and 38: Chapter 2. Using Command LineOption
Page 39 and 40: Chapter 2. Using Command Line Optio
Page 41 and 42: Chapter 2. Using Command Line Optio
Page 43 and 44: Chapter 3. Optimizing & Parallelizi
Page 53: Chapter 3. Optimizing & Parallelizi
Page 67: Chapter 3. Optimizing & Parallelizi
Page 70 and 71: Invoking Function Inlining50except:
Page 72 and 73: Creating an Inline Librarylevel of
Page 74 and 75: Restrictions on InliningA Fortran s
Page 76 and 77: OpenMP OverviewFortran directives a
Page 78 and 79: Task OverviewN = 1000DO I = 1, NV(I
Page 80 and 81: C/C++ Parallelization PragmasC/C++
Page 82 and 83: Directive and Pragma ClausesFortran
Page 84 and 85: Directive and Pragma ClausesThis cl
Page 86 and 87: Run-time Library RoutinesRun-time L
Page 88 and 89: Run-time Library RoutinesRun-time L
Page 90 and 91: Environment VariablesRun-time Libra
Page 93 and 94: Chapter 6. Using MPIMessage Passing
Page 95 and 96: Chapter 6. Using MPIyou are using M
Page 97 and 98: Chapter 6. Using MPIthe root of the
Page 99 and 100: Chapter 6. Using MPIcorrelated with
Page 101: Chapter 6. Using MPI• Add the fol
Page 104 and 105:
TerminologyAvailability84The PGI 11
Page 106 and 107:
System Requirements86Vector operati
Page 108 and 109:
Memory Model88• waits for complet
Page 110 and 111:
Accelerator DirectivesAccelerator D
Page 112 and 113:
Accelerator Directives• Initial d
Page 114 and 115:
Accelerator DirectivesThis directiv
Page 116 and 117:
Accelerator Directive ClausesUse th
Page 118 and 119:
Environment Variables• Interfaces
Page 120 and 121:
PGI Unified Binary for Accelerators
Page 122 and 123:
Profiling Accelerator KernelsWith '
Page 124 and 125:
Supported IntrinsicsTable 7.5. Supp
Page 126 and 127:
References related to AcceleratorsT
Page 128 and 129:
PGI Proprietary C and C++ Pragmas10
Page 130 and 131:
Scope of Fortran Directives and Com
Page 132 and 133:
Scope of C/C++ Pragmas and Command-
Page 134 and 135:
Prefetch Directives and Pragmas114d
Page 136 and 137:
C$PRAGMA CTable 8.2. !DEC$ Directiv
Page 138 and 139:
Using System Library Routines118voi
Page 140 and 141:
Creating and Using Dynamic Librarie
Page 142 and 143:
Creating and Using Dynamic-Link Lib
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Using LIB3FStep 3: Run the exe to e
Page 152 and 153:
Setting Environment VariablesIn bas
Page 154 and 155:
PGI-Related Environment VariablesEn
Page 156 and 157:
PGI Environment VariablesLD_LIBRARY
Page 158 and 159:
PGI Environment VariablesNCPUS138Se
Page 160 and 161:
PGI Environment VariablesThe value
Page 162 and 163:
Using Environment Modules on LinuxT
Page 164 and 165:
144
Page 166 and 167:
Deploying Applications on Linux146T
Page 168 and 169:
Code Generation and Processor Archi
Page 170 and 171:
150
Page 172 and 173:
Inter-language Calling Consideratio
Page 174 and 175:
Compatible Data TypesNoteFortran Ty
Page 176 and 177:
Array Indices! Fortran function ret
Page 178 and 179:
ExamplesCompile and execute the pro
Page 180 and 181:
Examples160int a,b,c;a=8; b=2;print
Page 182 and 183:
Win32 Calling Conventionscout
Page 184 and 185:
Win32 Calling Conventions164call wo
Page 186 and 187:
166
Page 188 and 189:
Large Static Data in LinuxC/C++ Dat
Page 190 and 191:
Practical Limitations of Large Arra
Page 192 and 193:
Medium Memory Model and Large Array
Page 194 and 195:
Large Array and Small Memory Model
Page 196 and 197:
Extended Inline AssemblyExtended In
Page 198 and 199:
Extended Inline Assemblyexample2:..
Page 200 and 201:
Extended Inline Assembly180movq %rs
Page 202 and 203:
Extended Inline AssemblyConstraintw
Page 204 and 205:
Extended Inline AssemblyConstraintu
Page 206 and 207:
Extended Inline AssemblyConstraintM
Page 208 and 209:
Extended Inline AssemblyModifierDes
Page 210 and 211:
Intrinsicsvoid example21(){void * s
Page 212 and 213:
suboptions, 18syntax, 2, 17Commands
Page 214 and 215:
InstallLinux portability package, 1
Page 216 and 217:
modifier *, 186, 186modifier &, 186
Page 218:
optimization, 39-tp option, 39UNIXc
show all

PGI User's Guide

Create successful ePaper yourself

Delete template?

Save as template?