OpenCL Optimization

gputechconf.com

OpenCL Optimization

OpenCL Optimization


OutlineOverviewThe CUDA architectureMemory optimizationExecution configuration optimizationInstruction optimizationSummary© NVIDIA Corporation 2009 2


Overall Optimization StrategiesMaximize parallel executionExposing data parallelism in algorithmsChoosing execution configurationOverlap memory transfer with computationMaximize memory bandwidthKeep the hardware busyMaximize instruction throughputGet the job done with as few clock cycles as possibleWe will talk about how to do those in NVIDIA GPUs.© NVIDIA Corporation 20093


OutlineOverviewThe CUDA architectureMemory optimizationExecution configuration optimizationInstruction optimizationSummary© NVIDIA Corporation 2009 4


2 nd Gen CUDA Architecture: GT200Device contains 30 Streaming Multiprocessors(SMs)Each SM contains8 scalar processors1 double precision unit2 special function unitsshared memory (16 K)registers (16,384 32-bit=64 K)MultiprocessorScalarProcessorsDouble© NVIDIA Corporation 20095SharedMemory5


Execution ModelOpenCLHardwareWork-item/threadScalarProcessorWork-item are executed by scalarprocessorsWork-groups are executed onmultiprocessorsWork-groups do not migrateWork-groupMultiprocessorSeveral concurrent work-groups can resideon one SM- limited by SM resources (localand private memory)...A kernel is launched as a grid of workgroupsGrid© NVIDIA Corporation 2009DeviceOnly one kernel can execute on a device atone time6


Warp and SIMT...=32 Threads32 ThreadsWork-grouptime32 ThreadsWarpsSM multithreadedWarp schedulerwarp 8 instruction 11warp 1 instruction 42warp 3 instruction 95• Work-groups divide into groups of 32 threadscalled warps.• Warps always perform same instruction (SIMT)• Warps are basic scheduling units• 4 clock cycles to dispatch an instruction toall the threads in a warp• A lot of warps can hide memory latency..warp 8 instruction 12warp 3 instruction 96© NVIDIA Corporation 20097


OpenCL Memory HierarchyPrivateMemoryPrivateMemoryWork Item 1 Work Item M Work Item 1 Work Item MPE PE PE PECompute Unit 1Local MemoryPrivateMemoryPrivateMemoryCompute Unit NLocal MemoryGlobal / Constant Memory Data Cache• Global: R/W per-kernel• Constant : R per-kernel• Local memory: R/W per-group• Private: R/W per-threadCompute DeviceGlobal MemoryCompute Device Memory© NVIDIA Corporation 20098


Mapping between OpenCL and CUDAOpenCLCUDAPrivateMemoryPrivateMemoryPrivateMemoryPrivateMemoryRegistersRegistersRegistersRegistersWork Item 1 Work Item M Work Item 1 Work Item MPE PE PE PECompute Unit 1Compute Unit NThreadProcessorThreadProcessorMultiprocessorThreadProcessorThreadProcessorMultiprocessorLocal MemoryLocal MemoryShared MemoryShared MemoryCompute DeviceGlobal / Constant Memory Data CacheCompute DeviceGlobal / Constant Memory Data CacheGlobal MemoryCompute Device MemoryGlobal/Local MemoryCompute Device Memory© NVIDIA Corporation 20099


OutlineOverviewThe CUDA architectureMemory optimizationExecution configuration optimizationInstruction optimizationSummary© NVIDIA Corporation 2009 10


Overview of Memory OptimizationMinimize host-device data transferCoalesce global memory accessUse local memory as a cache© NVIDIA Corporation 200911


Minimizing host-device data transferHost device data transfer has much lowerbandwidth than global memory access.8 GB/s (PCI-e, x16 Gen2) vs 141 GB/s (GTX 280)Minimize transferIntermediate data can be allocated, operated, de-allocateddirectly on GPUSometimes it’s even better to recompute on GPU, or callkernels that do not have performance gainsGroup transferOne large transfer much better than many small ones© NVIDIA Corporation 200912


CoalescingGlobal memory latency: 400-600 cycles.The single most important performanceconsideration!Global memory access by threads of a half warpcan be coalesced to one transaction for word ofsize 8-bit, 16-bit, 32-bit, 64-bit or two transactionsfor 128-bit.Global memory can be viewed as composingaligned segments of 16 and 32 words.E.g. 32-bit word:© NVIDIA Corporation 200913


Coalescing in Compute Capability 1.0and 1.1K-th thread in a half warp must access the k-thword in a segment; however, not all threads need toparticipateCoalesces – 1 transaction………Out of sequence – 16 transactions……Misaligned – 16 transactions…© NVIDIA Corporation 200914


Coalescing in Compute Capability1.2 and 1.3Coalescing for any pattern of access that fits into asegment size# of transactions = # of accessed segments© NVIDIA Corporation 2009


Example of Misaligned Accesses__kernel void offsetCopy(__global float *odata,__global float* idata,int offset){int xid = get_global_id(0) + offset;odata[xid] = idata[xid];}offset=1GTX280 (compute capability 1.3) dropsby a factor of 1.7 while FX 5600 (computecapability 1.0) drops by a factor of 8.© NVIDIA Corporation 200916


Example of Strided Accesses__kernel void strideCopy(__global float* odata,__global float* idata,int stride){int xid = get_global_id(0) * stride;odata[xid] = idata[xid];}stride=2Large strides often arise inapplications. However, stridescan be avoided using local memory.© NVIDIA Corporation 200917


Local MemoryLatency ~100x smaller than global memoryCache data to reduce global memory accessUse local memory to avoid non-coalesced globalmemory accessThreads can cooperate through local memory© NVIDIA Corporation 200918


Caching Example 1: MatrixMultiplicationC=AxBEvery thread corresponds to one entry in C.Uncached version:__kernel void simpleMultiply(__global float* a,__global float* b,__global float* c,int N){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;for (int i = 0; i < TILE_DIM; i++) {sum += a[row*TILE_DIM+i] * b[i*N+col];}c[row*N+col] = sum;}© NVIDIA Corporation 200919


Memory Access Pattern of a Half-Warp© NVIDIA Corporation 2009A lot of repeated access to the same row of A.Un-coalesced in CC


Matrix Multiplication (cont.)OptimizationNVIDIAGeForceGTX 280NVIDIA QuadroFX 5600No optimizationCoalesced usingshared memory tostore a tile of AUsing sharedmemory toeliminateredundant readsof a tile of B8.8 GBps 0.62 GBps14.3 GBps 7.34 GBps29.7 GBps 15.5 GBps© NVIDIA Corporation 200921


Matrix Multiplication (cont.)Cashed and coalesced version:__kernel void coalescedMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM]){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* b[i*N+col];}c[row*N+col] = sum;}© NVIDIA Corporation 200922


Matrix Multiplication (cont.)OptimizationNVIDIAGeForceGTX 280NVIDIA QuadroFX 5600No optimizationCoalesced usingshared memory tostore a tile of AUsing sharedmemory toeliminateredundant readsof a tile of B8.8 GBps 0.62 GBps14.3 GBps 7.34 GBps29.7 GBps 15.5 GBps© NVIDIA Corporation 200923


Coalescing Example 2: MatrixTransposeABB=A’Strided global mem access in naïveimplementation, resulting in 16transactions if stride > 16AtileBMove the strided access intolocal memory read© NVIDIA Corporation 200924


Matrix Transpose PerformanceOptimizationNVIDIAGeForceGTX 280NVIDIAQuadroFX 5600No optimizationUsing sharedmemory tocoalesce globalreadsRemoving bankconflicts1.1 GBps 0.4 GBps24.8 GBps 13.3 GBps30.3 GBps 15.6 GBps© NVIDIA Corporation 200925


Bank ConflictsLocal memoryA 2 nd order effect compared to globalmemory coalescingLocal memory is divide into banks.Successive 32-bit words assigned tosuccessive banksNumber of banks = 16 for CC 1.xR/W different banks can be performedsimultaneously.Bank conflict: two R/W fall in the samebank, the access will be serialized.Thus, accessing should be designed toavoid bank conflictBank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7Bank 15© NVIDIA Corporation 200926


OutlineOverviewThe CUDA architectureMemory optimizationExecution configuration optimizationInstruction optimizationSummary© NVIDIA Corporation 2009 27


Work-group Heuristics# of work-groups > # of SMEach SM has at least one work-group to execute# of work-groups / # of SM > 2Multi work-groups can run concurrently on a SMWork on another work-group if one work-group is waitingon barrier# of work-groups / # of SM > 100 to scale well tofuture device© NVIDIA Corporation 200928


Work-item HeuristicsThe number of work-items per work-group shouldbe a multiple of 32 (warp size)Want as many warps running as possible to hidelatenciesMinimum: 64Larger, e.g. 256 may be betterDepends on the problem, do experiments!© NVIDIA Corporation 200929


OccupancyHide latency: thread instructions are executedsequentially. So executing other warps when onewarp is paused is the only way to hide latencies andkeep the hardware busyOccupancy: ratio of active warps per SM to themaximum number of allowed warps32 in GT 200, 24 in GeForce 8 and 9-series.© NVIDIA Corporation 200930


Global Memory Latency HidingEnough warps can hide the latency of globalmemory accessWe need 400/4 = 100 arithmetic instructions to hidethe latency. For example, assume the code has 8arithmetic instructions (4 cycle) for every oneglobal memory access (~400 cycles). Thus 100/8~13warps would be enough. This corresponds to 54%occupancy.© NVIDIA Corporation 200931


Register Dependency Latency HidingIf an instruction uses a result stored in a registerwritten by an instruction before it, this is ~ 24cycles latencySo, we need 24/4=6 warps to hide registerdependency latency. This corresponds to 25%occupancy© NVIDIA Corporation 200932


Occupancy ConsiderationsIncrease occupancy to achieve latency hidingAfter some point (e.g. 50%), further increase inoccupancy won’t lead to performance increaseOccupancy is limited by resource usage:RegistersLocal memoryScheduling hardware© NVIDIA Corporation 200933


Resource Limitation on OccupancyWork-groups on a SM partition registers and localmemoryIf every thread uses 10 registers and every workgrouphas 256 work-items, then 3 work-groups use256*10*3 < 8192. A 100% occupancy can beachieved.However, if every thread uses 11 registers, since256*11*3 > 8192, only 2 work-groups are allowed.So occupancy is reduced to 66%!But, if work-group has 128 work-items, since128*11*5 < 8192, occupancy can be 83%.© NVIDIA Corporation 200934


Other Resource Limitations onOccupancyMaximum number of warps.Maximum number of work-groups per SM: 8So occupancy calculation in realistic case iscomplicated, thus…© NVIDIA Corporation 200935


Occupancy Calculator© NVIDIA Corporation 200936


OutlineOverviewThe CUDA architectureMemory optimizationExecution configuration optimizationInstruction optimizationSummary© NVIDIA Corporation 2009 37


Instruction ThroughputThroughput: # of instructions per cycleIn SIMT architecture, if T is the number ofoperations per clock cycleSM Throughtput = T/WarpSizeMaximizing throughput: using smaller number ofcycles to get the job done© NVIDIA Corporation 200938


Arithmetic Instruction ThroughputInt, and float add, shift, min, max, and float mul,mad: T = 8Int divide and modulo are expensiveAvoid automatic conversion of double to floatAdding “f” to floating literals (e.g. 1.0f) because thedefault is double© NVIDIA Corporation 200939


Memory InstructionsUse local memory to reduce global memory accessIncrease algorithm’s arithmetic intensity (the ratioof arithmetic to global memory accessinstructions). The higher of this ratio, the fewer ofwarps are required to hide global memory latency.© NVIDIA Corporation 200940


Scalar Architecture and CompilerNVIDIA GPUs have a scalar architectureUse vector types in OpenCL for convenience, notperformanceGenerally want more work-items rather than large vectorsper work-itemUse the -cl-mad-enable compiler optionPermits use of FMADs, which can lead to largeperformance gainsInvestigate using the -cl-fast-relaxed-math compileroptionenables many aggressive compiler optimizations© NVIDIA Corporation 200941


Math LibrariesThere are two types of runtime math librariesNative_function() map directly to the hardware level: fasterbut lower accuracyFunction(): slower but higher accuracyUse native math library whenever speed is moreimportant than precision© NVIDIA Corporation 200942


Control FlowIf branching happens within a warp, differentexecution paths must be serialized, increasing thetotal number of instructions.No penalty if different warps divergeNo divergence if controlling condition depends only onlocal_id/warp_size© NVIDIA Corporation 200943


SummaryOpenCL programs run on GPU can achieve greatperformance if one canMaximize parallel executionMaximize memory bandwidthMaximize instruction throughputThank you and enjoy OpenCL!© NVIDIA Corporation 200944


Additional Topics• Async transfer• Zero copy• Texture memory• OpenCL extensions• Interoperability• Multi-GPU© NVIDIA Corporation 200945


© NVIDIA Corporation 2009 46


© NVIDIA Corporation 2009 47

More magazines by this user
Similar magazines