Tutorial CUDA

More documents

Recommendations

Info

Float “Safety” Don’t accidentally use FP64 where you intend FP32: © NVIDIA Corporation 2008 Standard math library and floating-point literals default to double precision float f = 2.0 * sin(3.14); // warning: double precision! // fp64 multiply, several fp64 insts. for sin(), and fp32 cast float f = 2.0f * sinf(3.14f); // warning: double precision! // fp32 multiply and several fp32 instructions for sin() On FP64-capable NVIDIA GPUs, the green code will be much faster than the orange code
Mixed Precision Arithmetic Researchers are achieving great speedups at high accuracy using mixed 32/64-bit arithmetic © NVIDIA Corporation 2008 “Exploiting Mixed Precision Floating Point Hardware in Scientific Computations” Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou, Julien Langou, Piotr Luszczek, and Stanimire Tomov. November, 2007. http://www.netlib.org/utk/people/JackDongarra/PAPERS/par_comp_iter_ref.pdf Abstract: By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor. Results on modern processor architectures and the Cell BE are presented.
Page 1 and 2:
Tutorial CUDA Cyril Zeller NVIDIA D
Page 3 and 4:
© NVIDIA Corporation 2008 GPU Comp
Page 5 and 6:
Parallel Computing’s Dark Age But
Page 7 and 8:
Enter the GPU GPU = Graphics Proces
Page 9 and 10:
Enter CUDA CUDA is a scalable paral
Page 11 and 12:
110-240X © NVIDIA Corporation 2008
Page 13 and 14:
GPUs Are Getting Faster, Faster ©
Page 15 and 16:
© NVIDIA Corporation 2008 CUDA Pro
Page 17 and 18:
Heterogeneous Programming CUDA = se
Page 19 and 20:
Hierarchy of Concurrent Threads Thr
Page 21 and 22:
Memory Hierarchy Sequential Kernels
Page 23 and 24:
CUDA Language: C with Minimal Exten
Page 25 and 26:
Example: Increment Array Elements C
Page 27 and 28:
Example: Host Code // allocate host
Page 29 and 30:
More on Memory Spaces Each thread c
Page 31 and 32:
Compiling CUDA for NVIDIA GPUs Any
Page 33 and 34:
Device Emulation Mode Pitfalls Emul
Page 35 and 36:
Reduction Exercise At the end of ea
Page 37 and 38:
Reduce 1: Blocking the Data Block I
Page 39 and 40:
Reduce 1: Multi-Pass Reduction Bloc
Page 41 and 42:
© NVIDIA Corporation 2008 CUDA Imp
Page 43 and 44:
Hardware Implementation: A Set of S
Page 45 and 46: Hardware Implementation: Execution
Page 47 and 48: Host Synchronization All kernel lau
Page 49 and 50: Multiple CPU Threads and CUDA CUDA
Page 51 and 52: Performance Optimization Expose as
Page 53 and 54: Expose Parallelism: CPU/GPU Paralle
Page 55 and 56: Minimize CPU ↔ GPU Data Transfers
Page 57 and 58: Global Memory Reads/Writes Global m
Page 59 and 60: Coalesced Global Memory Accesses ©
Page 61 and 62: Non-Coalesced Global Memory Accesse
Page 63 and 64: Avoiding Non-Coalesced Accesses For
Page 65 and 66: Profiler Signals Events are tracked
Page 67 and 68: Back to Reduce Exercise: Profile wi
Page 69 and 70: Reduce 2 Thread IDs Block IDs Distr
Page 71 and 72: Maximize Use of Shared Memory Share
Page 73 and 74: Example: Square Matrix Multiplicati
Page 75 and 76: Example: Avoiding Non-Coalesced flo
Page 81 and 82: Execution Configuration: Constraint
Page 83 and 84: Execution Configuration: Heuristics
Page 85 and 86: Back to Reduce Exercise: Problem wi
Page 87 and 88: Parallel Reduction Complexity Takes
Page 89 and 90: Reduce 3: Go Ahead! Open up reduce\
Page 91 and 92: Arithmetic Instruction Throughput f
Page 93 and 94: Runtime Math Library There are two
Page 95: What You Need To Know FP64 instruct
Page 99 and 100: Single Precision Floating Point ©
Page 101 and 102: Instruction Predication Comparison
Page 103 and 104: Shared Memory Is Banked Bandwidth o
Page 105 and 106: Bank Addressing Examples 2-way bank
Page 107 and 108: Back to Reduce Exercise: Problem wi
Page 109 and 110: Reduce 3: Bank Conflicts Showed for
Page 111 and 112: Reduce 4: Go Ahead! Open up reduce\
Page 113 and 114: Reduce 5: Unrolled Loop if (numThre
Page 115 and 116: Reduce 5: Final Unrolled Loop if (n
Page 117 and 118: Coming Up Soon CUDA 2.0 © NVIDIA C
Page 119 and 120: Extra Slides
Page 121 and 122: Applications - Condensed 3D image a
Page 123 and 124: Performance GPU Electromagnetic Fie
Page 125 and 126: EvolvedMachines 130X Speed up Brain
Page 127 and 128: nbody Astrophysics Astrophysics res
Page 129 and 130: A quick review device = GPU = set o
Page 131 and 132: Language Extensions: Function Type
Page 133 and 134: Language Extensions: Execution Conf
Page 135 and 136: Common Runtime Component Provides:
Page 137 and 138: Common Runtime Component: Mathemati
Page 139 and 140: Host Runtime Component Provides fun
Page 141 and 142: Host Runtime Component: Memory Mana
Page 143 and 144: Host Runtime Component: Interoperab
Page 145 and 146: Host Runtime Component: Error Handl
Page 147 and 148:
Device Runtime Component: Mathemati
Page 149 and 150:
Device Runtime Component: Texture F
Page 151 and 152:
Compilation Any source file contain
Page 153 and 154:
Role of Open64 Open64 compiler give
Page 155 and 156:
CUDA Libraries CUBLAS CUFFT © NVID
Page 157:
CUFFT Library Efficient implementat
show all

Tutorial CUDA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?