Inside Kepler

gputechconf.com

Inside Kepler

Inside KeplerLars Nyland – ArchitectureStephen Jones – CUDANVIDIA Corporation


Welcome the Kepler GK110 GPUPerformanceEfficiencyProgrammability


SMX: Efficient PerformancePower-Aware SMX ArchitectureClocks & Feature SizeSMX result -Performance upPower down


Power vs Clock Speed ExampleLogicClockingAreaPowerAreaPowerFermi2x clockAB1.0x 1.0x1.0x 1.0xKepler1x clockAABB1.8x 0.9x1.0x 0.5x


SMX Balance of ResourcesResourceKepler GK110 vs FermiFloating point throughput 2-3xMax Blocks per SMX2xMax Threads per SMX 1.3xRegister File BandwidthRegister File CapacityShared Memory BandwidthShared Memory Capacity2x2x2x1x


New ISA Encoding: 255 Registers per ThreadFermi limit: 63 registers per threadA common Fermi performance limiterLeads to excessive spillingKepler : Up to 255 registers per threadEspecially helpful for FP64 appsEx. Quda QCD fp64 sample runs 5.3x fasterSpills are eliminated with extra registers


New High-Performance SMX InstructionsSHFL (shuffle) -- Intra-warp data exchangeATOM -- Broader functionality, FasterCompiler-generated,high performanceinstructions: bit shift bit rotate fp32 division read-only cache


New Instruction: SHFLData exchange between threads within a warpAvoids use of shared memoryOne 32-bit value per exchange4 variants:a b c d e fg h__shfl() __shfl_up() __shfl_down() __shfl_xor()h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e fIndexedany-to-anyShift right to n thneighbourShift left to n th neighbourButterfly (XOR)exchange


ATOM instruction enhancementsAdded int64 functions tomatch existing int32Atom Op int32 int64add x xcas x x2 – 10x performance gainsShorter processing pipelineMore atomic processorsSlowest 10x fasterFastest 2x fasterexch x xmin/max x Xand/or/xor x X


High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loopsExample: Data reduction (sum of all values)Without Atomics1. Divide input data array into N sections2. Launch N blocks, each reduces onesection3. Output is N values4. Second launch of N threads, reducesoutputs to single value


High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loopsExample: Data reduction (sum of all values)With Atomics1. Divide input data array into N sections2. Launch N blocks, each reduces onesection3. Write output directly via atomic.No need for second kernel launch.


Texture performanceTexture :Provides hardware accelerated filteredsampling of data (1D, 2D, 3D)Read-only data cache holds fetched samplesBacked up by the L2 cacheSMX vs Fermi SM :4x filter ops per clock4x cache capacityTexSMXRead-onlyData CacheL2


Texture Cache UnlockedAdded a new path for computeAvoids the texture unitAllows a global address to be fetched and cachedEliminates texture setupWhy use it?Separate pipeline from shared/L1Highest miss bandwidthFlexible, e.g. unaligned accessesManaged automatically by compiler“const __restrict” indicates eligibilityTexSMXRead-onlyData CacheL2


const __restrict ExampleAnnotate eligible kernelparameters withconst __restrictCompiler will automaticallymap loads to use read-onlydata cache path__global__ void saxpy(float x, float y,const float * __restrict input,float * output){size_t offset = threadIdx.x +(blockIdx.x * blockDim.x);}// Compiler will automatically use texture// for "input"output[offset] = (input[offset] * x) + y;


Kepler GK110 Memory System HighlightsEfficient memory controller for GDDR5Peak memory clocks achievableMore L2Double bandwidthDouble sizeMore efficient DRAM ECC ImplementationDRAM ECC lookup overhead reduced by 66%(average, from a set of application traces)


Bonsai GPU Tree-CodeJournal of Computational Physics,231:2825-2839, April 2012Jeroen Bédorf, Simon Portegies ZwartLeiden Observatory, The NetherlandsEvghenii GaburovCIERA @ Northwestern U.SARA, The NetherlandsGalaxies generated with: GalaticsWidrow L. M., Dubinksi J., 2005,Astrophysical Journal, 631 838


Improving ProgrammabilityLibrary Calls from KernelsSimplify CPU/GPU DivideBatching to Help Fill GPUDynamic Load BalancingOccupancyDynamicParallelismData-Dependent ExecutionRecursive Parallel Algorithms


What is Dynamic Parallelism?The ability to launch new grids from the GPUDynamicallySimultaneouslyIndependentlyCPU GPU CPU GPUFermi: Only CPU can generate GPU workKepler: GPU can generate work for itself


What Does It Mean?CPU GPU CPU GPUGPU as Co-ProcessorAutonomous, Dynamic Parallelism


Data-Dependent ParallelismComputationalPower allocated toregions of interestCUDA TodayCUDA on Kepler


Dynamic Work GenerationFixed GridStatically assign conservativeworst-case gridDynamically assign performancewhere accuracy is requiredInitial GridDynamic Grid


Batched & Nested ParallelismCPU-Controlled Work BatchingCPU programs limited by singlepoint of controlCan run at most 10s of threadsCPU Control Threaddgetf2 dgetf2 dgetf2CPU Control Threaddswap dswap dswapCPU Control ThreadCPU is fully consumed withcontrolling launchesdtrsm dtrsm dtrsmCPU Control Threaddgemm dgemm dgemmCPU Control ThreadMultiple LU-Decomposition, Pre-KeplerAlgorithm flow simplified for illustrative purposes


Grid Management UnitStream Queue MgmtCRZBQYStream Queue MgmtCRZBQYAPXCUDAGeneratedWorkAPGrid Management UnitPending & Suspended Grids1000s of pending gridsXWork Distributor16 active gridsWork Distributor32 active gridsSM SM SM SMSMX SMX SMX SMXFermiKepler GK110


Fermi ConcurrencyA -- B -- CStream 1A--B--C P--Q--R X--Y--ZP -- Q -- RStream 2Hardware Work QueueX -- Y -- ZStream 3Fermi allows 16-way concurrencyUp to 16 grids can run at onceBut CUDA streams multiplex into a single queueOverlap only at stream edges


Kepler Improved ConcurrencyA--B--CP--Q--RX--Y--ZMultiple Hardware Work QueuesA -- B -- CStream 1P -- Q -- RStream 2X -- Y -- ZStream 3Kepler allows 32-way concurrencyOne work queue per streamConcurrency at full-stream levelNo inter-stream dependencies


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Fermi: Time-Division MultiprocessABCDEFCPU ProcessesShared GPU


Hyper-Q: Simultaneous MultiprocessABCDEFCPU ProcessesShared GPU


Without Hyper-Q100500A B C D E FTimeGPU Utilization %


With Hyper-Q100GPU Utilization %500CBAEDAFDCBDCFEBAFETime


Whitepaper: http://www.nvidia.com/object/nvidia-kepler.html

More magazines by this user
Similar magazines