PGI User's Guide

More documents

Recommendations

Info

$Intel(R) Math Kernel Library for Linux* OS User's Guide$

TerminologyAvailability84The PGI 11.1 Fortran & C Accelerator compilers are available only on x86 processor-based workstations andservers with an attached NVIDIA CUDA-enabled GPU or Tesla card. These compilers target all platforms thatPGI supports except 64-bit Mac OS X. All examples included in this chapter are developed and presented onsuch a platform. For a list of supported GPUs, refer to the Accelerator Installation and Supported Platforms listin the latest PGI Release Notes.User-directed Accelerator ProgrammingIn user-directed accelerator programming the user specifies the regions of a host program to be targeted foroffloading to an accelerator device. The bulk of a user’s program, as well as regions containing constructsthat are not supported on the targeted accelerator, are executed on the host. This chapter concentrates onspecification of loops and regions of code to be offloaded to an accelerator.Features Not Covered or ImplementedThis chapter does not describe features or limitations of the host programming environment as a whole.Further, it does not cover automatic detection and offloading of regions of code to an accelerator by a compileror other tool. While future versions of the PGI compilers may allow for automatic offloading or multipleaccelerators of different types, these features are not currently supported.TerminologyClear and consistent terminology is important in describing any programming model. This section providesdefinitions of the terms required for you to effectively use this chapter and the associated programming model.Acceleratora special-purpose co-processor attached to a CPU and to which the CPU can offload data and executablekernels to perform compute-intensive calculations.Compute intensityfor a given loop, region, or program unit, the ratio of the number of arithmetic operations performed oncomputed data divided by the number of memory transfers required to move that data between two levelsof a memory hierarchy.Compute regiona region defined by an Accelerator compute region directive. A compute region is a structured blockcontaining loops which are compiled for the accelerator. A compute region may require device memoryto be allocated and data to be copied from host to device upon region entry, and data to be copied fromdevice to host memory and device memory deallocated upon exit. Compute regions may not contain othercompute regions or data regions.CUDAstands for Compute Unified Device Architecture; the CUDA environment from NVIDIA is a C-likeprogramming environment used to explicitly control and program an NVIDIA GPU.Data regiona region defined by an Accelerator data region directive, or an implicit data region for a function orsubroutine containing Accelerator directives. Data regions typically require device memory to be allocated
Chapter 7. Using an Acceleratorand data to be copied from host to device memory upon entry, and data to be copied from device to hostmemory and device memory deallocated upon exit. Data regions may contain other data regions andcompute regions.Devicea general reference to any type of accelerator.Device memorymemory attached to an accelerator which is physically separate from the host memory.Directivein C, a #pragma, or in Fortran, a specially formatted comment statement that is interpreted by a compilerto augment information about or specify the behavior of the program.DMADirect Memory Access, a method to move data between physically separate memories; this is typicallyperformed by a DMA engine, separate from the host CPU, that can access the host physical memory as wellas an IO device or GPU physical memory.GPUa Graphics Processing Unit; one type of accelerator device.GPGPUGeneral Purpose computation on Graphics Processing Units.Hostthe main CPU that in this context has an attached accelerator device. The host CPU controls the programregions and data loaded into and executed on the device.Loop trip countthe number of times a particular loop executes.OpenCL - Open Compute Languagea proposed standard C-like programming environment similar to CUDA that enables portable low-levelgeneral-purpose programming on GPUs and other accelerators.Private datawith respect to an iterative loop, data which is used only during a particular loop iteration. With respectto a more general region of code, data which is used within the region but is not initialized prior to theregion and is re-initialized prior to any use after the region.Regiona structured block identified by the programmer or implicitly defined by the language. Certain actions mayoccur when program execution reaches the start and end of a region, such as device memory allocationor data movement between the host and device memory. Loops in a compute region are targeted forexecution on the accelerator.Structured blockin C, an executable statement, possibly compound, with a single entry at the top and a single exit at thebottom. In Fortran, a block of executable statements with a single entry at the top and a single exit at thebottom.85
Page 2 and 3:
While every precaution has been tak
Page 5 and 6:
PGI ® Compiler User’s Guide4. Us
Page 7 and 8:
PGI ® Compiler User’s GuideRefer
Page 12 and 13:
xii14.6. Intrinsic Header File Orga
Page 14 and 15:
xiv13.3. Large Array and Small Memo
Page 16 and 17:
Organizationxvi• Fortran 95 Handb
Page 18 and 19:
Conventionsin this guide with which
Page 21 and 22:
Chapter 1. Getting StartedThis chap
Page 23 and 24:
Chapter 1. Getting StartedWhere:opt
Page 25 and 26:
Chapter 1. Getting Startedfilename.
Page 27 and 28:
Chapter 1. Getting Startedfilename.
Page 29 and 30:
Chapter 1. Getting Startedparallel
Page 31 and 32:
Chapter 1. Getting Startedas execut
Page 33 and 34:
Chapter 1. Getting Startedar or ran
Page 35 and 36:
Chapter 1. Getting StartedTo do thi
Page 37 and 38:
Chapter 2. Using Command LineOption
Page 39 and 40:
Chapter 2. Using Command Line Optio
Page 41 and 42:
Chapter 2. Using Command Line Optio
Page 43 and 44:
Chapter 3. Optimizing & Parallelizi
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54: Chapter 3. Optimizing & Parallelizi
Page 67: Chapter 3. Optimizing & Parallelizi
Page 70 and 71: Invoking Function Inlining50except:
Page 72 and 73: Creating an Inline Librarylevel of
Page 74 and 75: Restrictions on InliningA Fortran s
Page 76 and 77: OpenMP OverviewFortran directives a
Page 78 and 79: Task OverviewN = 1000DO I = 1, NV(I
Page 80 and 81: C/C++ Parallelization PragmasC/C++
Page 82 and 83: Directive and Pragma ClausesFortran
Page 84 and 85: Directive and Pragma ClausesThis cl
Page 86 and 87: Run-time Library RoutinesRun-time L
Page 88 and 89: Run-time Library RoutinesRun-time L
Page 90 and 91: Environment VariablesRun-time Libra
Page 93 and 94: Chapter 6. Using MPIMessage Passing
Page 95 and 96: Chapter 6. Using MPIyou are using M
Page 97 and 98: Chapter 6. Using MPIthe root of the
Page 99 and 100: Chapter 6. Using MPIcorrelated with
Page 101: Chapter 6. Using MPI• Add the fol
Page 106 and 107: System Requirements86Vector operati
Page 108 and 109: Memory Model88• waits for complet
Page 110 and 111: Accelerator DirectivesAccelerator D
Page 112 and 113: Accelerator Directives• Initial d
Page 114 and 115: Accelerator DirectivesThis directiv
Page 116 and 117: Accelerator Directive ClausesUse th
Page 118 and 119: Environment Variables• Interfaces
Page 120 and 121: PGI Unified Binary for Accelerators
Page 122 and 123: Profiling Accelerator KernelsWith '
Page 124 and 125: Supported IntrinsicsTable 7.5. Supp
Page 126 and 127: References related to AcceleratorsT
Page 128 and 129: PGI Proprietary C and C++ Pragmas10
Page 130 and 131: Scope of Fortran Directives and Com
Page 132 and 133: Scope of C/C++ Pragmas and Command-
Page 134 and 135: Prefetch Directives and Pragmas114d
Page 136 and 137: C$PRAGMA CTable 8.2. !DEC$ Directiv
Page 138 and 139: Using System Library Routines118voi
Page 140 and 141: Creating and Using Dynamic Librarie
Page 142 and 143: Creating and Using Dynamic-Link Lib
Page 150 and 151: Using LIB3FStep 3: Run the exe to e
Page 152 and 153: Setting Environment VariablesIn bas
Page 154 and 155:
PGI-Related Environment VariablesEn
Page 156 and 157:
PGI Environment VariablesLD_LIBRARY
Page 158 and 159:
PGI Environment VariablesNCPUS138Se
Page 160 and 161:
PGI Environment VariablesThe value
Page 162 and 163:
Using Environment Modules on LinuxT
Page 164 and 165:
144
Page 166 and 167:
Deploying Applications on Linux146T
Page 168 and 169:
Code Generation and Processor Archi
Page 170 and 171:
150
Page 172 and 173:
Inter-language Calling Consideratio
Page 174 and 175:
Compatible Data TypesNoteFortran Ty
Page 176 and 177:
Array Indices! Fortran function ret
Page 178 and 179:
ExamplesCompile and execute the pro
Page 180 and 181:
Examples160int a,b,c;a=8; b=2;print
Page 182 and 183:
Win32 Calling Conventionscout
Page 184 and 185:
Win32 Calling Conventions164call wo
Page 186 and 187:
166
Page 188 and 189:
Large Static Data in LinuxC/C++ Dat
Page 190 and 191:
Practical Limitations of Large Arra
Page 192 and 193:
Medium Memory Model and Large Array
Page 194 and 195:
Large Array and Small Memory Model
Page 196 and 197:
Extended Inline AssemblyExtended In
Page 198 and 199:
Extended Inline Assemblyexample2:..
Page 200 and 201:
Extended Inline Assembly180movq %rs
Page 202 and 203:
Extended Inline AssemblyConstraintw
Page 204 and 205:
Extended Inline AssemblyConstraintu
Page 206 and 207:
Extended Inline AssemblyConstraintM
Page 208 and 209:
Extended Inline AssemblyModifierDes
Page 210 and 211:
Intrinsicsvoid example21(){void * s
Page 212 and 213:
suboptions, 18syntax, 2, 17Commands
Page 214 and 215:
InstallLinux portability package, 1
Page 216 and 217:
modifier *, 186, 186modifier &, 186
Page 218:
optimization, 39-tp option, 39UNIXc
show all

PGI User's Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?