with CUDA Fortran

More documents

Recommendations

Info

Execution Configuration • GPUs are high latency, 100s of cycles per device memory request • For good performance, you need to ensure there is enough parallelism to hide this latency • Such parallelism can come from: – Thread-level parallelism – Instruction-level parallelism
Thread-Level Parallelism • Execution configuration dictates number of threads per block – Limit on number of threads per block for each architecture • Number of concurrent blocks on a multiprocessor limited by – Register use per thread – Shared memory use per thread block – Limit on number of threads per multiprocessor • Occupancy – Ratio of actual to maximum number of concurrent threads per multiprocessor
Page 1 and 2: Introduction to CUDA Fortran
Page 3 and 4: Introduction • CUDA is a scalable
Page 5 and 6: Heterogeneous Programming • Host
Page 7 and 8: Data Transfers program copyData use
Page 9 and 10: Data Transfers program copyData use
Page 11 and 12: F90 Array Increment module simpleOp
Page 13 and 14: CUDA Fortran - Device Code F90 modu
Page 15 and 16: Execution Model Software Thread Thr
Page 17 and 18: Mapping Arrays to Thread Blocks •
Page 19 and 20: Built-in Variables for Device Code
Page 21 and 22: Multidimensional Example - Host •
Page 23 and 24: 2D Example - Device Code module sim
Page 25 and 26: Kernel Loop Directives (CUF Kernels
Page 27 and 28: Reduction using CUF Kernels • Com
Page 29 and 30: Compilation • pgfortran - PGI’s
Page 31 and 32: Host-Device Transfers • Host-devi
Page 33 and 34: Page-Locked Data Transfers • Page
Page 35 and 36: Asynchronous Data Transfers • Asy
Page 37 and 38: GPU/CPU Synchronization • cudaDev
Page 39 and 40: Outline • Introduction • Perfor
Page 41 and 42: Misaligned Data Access • Use arra
Page 43 and 44: Strided Data Access • Use array i
Page 45 and 46: Shared Memory • On-chip • All t
Page 47 and 48: Matrix Transpose - Shared Memory at
Page 49: Outline • Introduction • Perfor
Page 53 and 54: Thread-Level Parallelism • Run on
Page 55 and 56: Instruction-Level Parallelism • H
Page 57 and 58: Instruction-Level Parallelism with
Page 59 and 60: Calling CUBLAS from CUDA Fortran
Page 61 and 62: Calling Thrust from CUDA Fortran C
Page 63 and 64: CUDA Fortran Sorting with Thrust pr
Page 65 and 66: Convolution Example A B Perform con
Page 67 and 68: program driver use cudafor use cuff
Page 69 and 70: CUDA Libraries from CUDA Fortran
Page 71 and 72: Computing π pgf90 -O3 -Mpreprocess
Page 73 and 74: Reductions on GPU 3 1 7 0 4 1 6 3
Page 75 and 76: Computing π with CUDA Fortran Kern
Page 77 and 78: Computing π with an Atomic Lock In
Page 79 and 80: Multi-GPU Programming • Multi-GPU
Page 81 and 82: Multi-GPU Memory Allocation • All
Page 83 and 84: Unified Virtual Addressing • CPU
Page 85 and 86: Direct Transfers • Once direct ac
Page 87 and 88: Direct Transfer Example - Device Co
Page 89 and 90: Direct Transfer Example - Host Code

with CUDA Fortran

Create successful ePaper yourself

Delete template?

Save as template?