PGI User's Guide

More documents

Recommendations

Info

$Intel(R) Math Kernel Library for Linux* OS User's Guide$

PGI Unified Binary for Acceleratorsfastmathkeepbinkeepgpukeepptxmaxregcount:nmul24nofmatime[no]waitUse routines from the fast math library.Keep the binary (.bin) files.Keep the kernel source (.gpu) files.Keep the portable assembly (.ptx) file for the GPU code.Specify the maximum number of registers to use on the GPU.Leaving this blank indicates no limit.Use 24-bit multiplication for subscripting.Do not generate fused multiply-add instructions.Link in a limited-profiling library, as described in “ProfilingAccelerator Kernels,” on page 102.Wait for each kernel to finish before continuing in the host program.100hostSelect NO accelerator target. Generate PGI Unified Binary Code, as described in “PGI Unified Binary forAccelerators,” on page 100.The compiler automatically invokes the necessary CUDA software tools to create the kernel code and embedsthe kernels in the object file.NoteTo access accelerator libraries, you must link an accelerator program with the –ta flag.PGI Unified Binary for AcceleratorsNoteThe information and capabilities described in this section are only supported for 64-bit systems.PGI compilers support the PGI Unified Binary feature to generate executables with functions optimizedfor different host processors, all packed into a single binary. This release extends the PGI Unified Binarytechnology for accelerators. Specifically, you can generate a single binary that includes two versions offunctions:• one is optimized for the accelerator• one runs on the host processor when the accelerator is not available or when you want to compare host toaccelerator execution.To enable this feature, use the extended –ta flag:-ta=nvidia,hostThis flag tells the compiler to generate two versions of functions that have valid accelerator regions.• A compiled version that targets the accelerator.• A compiled version that ignores the accelerator directives and targets the host processor.
Chapter 7. Using an AcceleratorIf you use the –Minfo flag, you get messages similar to the following:s1:s1:12, PGI Unified Binary version for -tp=barcelona-64 -ta=host18, Generated an alternate loop for the inner loopGenerated vector sse code for inner loopGenerated 1 prefetch instructions for this loop12, PGI Unified Binary version for -tp=barcelona-64 -ta=nvidia15, Generating copy(b(:,2:90))Generating copyin(a(:,2:90))16, Loop is parallelizable18, Loop is parallelizableParallelization requires privatization of array t(2:90)Accelerator kernel generated16, !$acc do parallel18, !$acc do parallel, vector(256)Using register for tThe PGI Unified Binary message shows that two versions of the subprogram s1 were generated:• one for no accelerator (–ta=host)• one for the NVIDIA GPU (–ta=nvidia)At run time, the program tries to load the NVIDIA CUDA dynamic libraries and test for the presence of a GPU. Ifthe libraries are not available or no GPU is found, the program runs the host version.You can also set an environment variable to tell the program to run on the NVIDIA GPU. To do this, setACC_DEVICE to the value NVIDIA or nvidia. Any other value of the environment variable causes theprogram to use the host version.NoteThe only supported –ta targets for this release are nvidia and host.Multiple Processor TargetsWith 64-bit processors, you can use the –tp flag with multiple processor targets along with the –ta flag. Yousee the following behavior:• If you specify one –tp value and one –ta value:You see one version of each subprogram generated for that specific target processor and target accelerator.• If you specify one –tp value and multiple –ta values:The compiler generates two versions of subprograms that contain accelerator regions for the specifiedtarget processor and each target accelerator.• If you specify multiple –tp values and one –ta value:If 2 or more –tp values are given, the compiler generates up to that many versions of each subprogram, foreach target processor, and each version also targets the selected accelerator.• If you specify multiple –tp values and multiple –ta values:101
Page 2 and 3:
While every precaution has been tak
Page 5 and 6:
PGI ® Compiler User’s Guide4. Us
Page 7 and 8:
PGI ® Compiler User’s GuideRefer
Page 12 and 13:
xii14.6. Intrinsic Header File Orga
Page 14 and 15:
xiv13.3. Large Array and Small Memo
Page 16 and 17:
Organizationxvi• Fortran 95 Handb
Page 18 and 19:
Conventionsin this guide with which
Page 21 and 22:
Chapter 1. Getting StartedThis chap
Page 23 and 24:
Chapter 1. Getting StartedWhere:opt
Page 25 and 26:
Chapter 1. Getting Startedfilename.
Page 27 and 28:
Chapter 1. Getting Startedfilename.
Page 29 and 30:
Chapter 1. Getting Startedparallel
Page 31 and 32:
Chapter 1. Getting Startedas execut
Page 33 and 34:
Chapter 1. Getting Startedar or ran
Page 35 and 36:
Chapter 1. Getting StartedTo do thi
Page 37 and 38:
Chapter 2. Using Command LineOption
Page 39 and 40:
Chapter 2. Using Command Line Optio
Page 41 and 42:
Chapter 2. Using Command Line Optio
Page 43 and 44:
Chapter 3. Optimizing & Parallelizi
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67:
Page 70 and 71: Invoking Function Inlining50except:
Page 72 and 73: Creating an Inline Librarylevel of
Page 74 and 75: Restrictions on InliningA Fortran s
Page 76 and 77: OpenMP OverviewFortran directives a
Page 78 and 79: Task OverviewN = 1000DO I = 1, NV(I
Page 80 and 81: C/C++ Parallelization PragmasC/C++
Page 82 and 83: Directive and Pragma ClausesFortran
Page 84 and 85: Directive and Pragma ClausesThis cl
Page 86 and 87: Run-time Library RoutinesRun-time L
Page 88 and 89: Run-time Library RoutinesRun-time L
Page 90 and 91: Environment VariablesRun-time Libra
Page 93 and 94: Chapter 6. Using MPIMessage Passing
Page 95 and 96: Chapter 6. Using MPIyou are using M
Page 97 and 98: Chapter 6. Using MPIthe root of the
Page 99 and 100: Chapter 6. Using MPIcorrelated with
Page 101: Chapter 6. Using MPI• Add the fol
Page 104 and 105: TerminologyAvailability84The PGI 11
Page 106 and 107: System Requirements86Vector operati
Page 108 and 109: Memory Model88• waits for complet
Page 110 and 111: Accelerator DirectivesAccelerator D
Page 112 and 113: Accelerator Directives• Initial d
Page 114 and 115: Accelerator DirectivesThis directiv
Page 116 and 117: Accelerator Directive ClausesUse th
Page 118 and 119: Environment Variables• Interfaces
Page 122 and 123: Profiling Accelerator KernelsWith '
Page 124 and 125: Supported IntrinsicsTable 7.5. Supp
Page 126 and 127: References related to AcceleratorsT
Page 128 and 129: PGI Proprietary C and C++ Pragmas10
Page 130 and 131: Scope of Fortran Directives and Com
Page 132 and 133: Scope of C/C++ Pragmas and Command-
Page 134 and 135: Prefetch Directives and Pragmas114d
Page 136 and 137: C$PRAGMA CTable 8.2. !DEC$ Directiv
Page 138 and 139: Using System Library Routines118voi
Page 140 and 141: Creating and Using Dynamic Librarie
Page 142 and 143: Creating and Using Dynamic-Link Lib
Page 150 and 151: Using LIB3FStep 3: Run the exe to e
Page 152 and 153: Setting Environment VariablesIn bas
Page 154 and 155: PGI-Related Environment VariablesEn
Page 156 and 157: PGI Environment VariablesLD_LIBRARY
Page 158 and 159: PGI Environment VariablesNCPUS138Se
Page 160 and 161: PGI Environment VariablesThe value
Page 162 and 163: Using Environment Modules on LinuxT
Page 164 and 165: 144
Page 166 and 167: Deploying Applications on Linux146T
Page 168 and 169: Code Generation and Processor Archi
Page 170 and 171:
150
Page 172 and 173:
Inter-language Calling Consideratio
Page 174 and 175:
Compatible Data TypesNoteFortran Ty
Page 176 and 177:
Array Indices! Fortran function ret
Page 178 and 179:
ExamplesCompile and execute the pro
Page 180 and 181:
Examples160int a,b,c;a=8; b=2;print
Page 182 and 183:
Win32 Calling Conventionscout
Page 184 and 185:
Win32 Calling Conventions164call wo
Page 186 and 187:
166
Page 188 and 189:
Large Static Data in LinuxC/C++ Dat
Page 190 and 191:
Practical Limitations of Large Arra
Page 192 and 193:
Medium Memory Model and Large Array
Page 194 and 195:
Large Array and Small Memory Model
Page 196 and 197:
Extended Inline AssemblyExtended In
Page 198 and 199:
Extended Inline Assemblyexample2:..
Page 200 and 201:
Extended Inline Assembly180movq %rs
Page 202 and 203:
Extended Inline AssemblyConstraintw
Page 204 and 205:
Extended Inline AssemblyConstraintu
Page 206 and 207:
Extended Inline AssemblyConstraintM
Page 208 and 209:
Extended Inline AssemblyModifierDes
Page 210 and 211:
Intrinsicsvoid example21(){void * s
Page 212 and 213:
suboptions, 18syntax, 2, 17Commands
Page 214 and 215:
InstallLinux portability package, 1
Page 216 and 217:
modifier *, 186, 186modifier &, 186
Page 218:
optimization, 39-tp option, 39UNIXc
show all

PGI User's Guide

Create successful ePaper yourself

Delete template?

Save as template?