Evolution of the NVIDIA GPU Architecture

Graphics Pipeline

Computational Elements of a GPU∗ Streaming Processor –Core of the design∗ Place where all of the computation takes place∗ Streaming Multiprocessor∗ Groups of streaming multiprocessors∗ In addition to the SPs, these also contain the SpecialFunction Units and Load/Store Units∗ Instructional Schedulers∗ Complex Control Logic

Streaming Multiprocessor Architecture

Types of GPU Memory∗ Global∗ DRAM∗ Slowest Performance∗ Texture∗ Cached Global Memory∗ “Bound” at runtime∗ Constant∗ Cached Global Memory∗ Shared∗ Local to a block of threads

Architectural Memory Hierarchy

Fermi Architecture

Fermi Improvements∗ Increase the number of SPs per SM∗ Unified Request Path for load/store instructions∗ Implementation of a cache hierarchy∗ L1 cache per SM∗ Configurable with Shared Memory∗ L2 cache is shared globally∗ Register Spilling∗ Occurs when the register requirements of a threadexceed what is available on the device∗ Previous Generation: Spill to DRAM (global memory)∗ Fermi: Use of the L1 cache

Summary

Kepler SM Design15

Warp Scheduler∗ 4 warp schedulers∗ Each scheduler can issue up to 2 independentinstructions when it is ready to issue.16

Kepler Memory Architecture∗ Shared Memory and L1 are stillphysically shared∗ New configuration: 32K L1,32K Shared∗ Shared memory bandwidth isdoubled compared with Fermi∗ Increased the size of L2∗ Doubled the size Fermi,increasing it to 1536 KB∗ Introduction of Read‐OnlyCache∗ Previously, this was used inFermi for Texture cache∗ 48 KB of storage 17

Warp Shuffle Instructions∗ In Fermi, data could only be exchanged between threads using sharedmemory.∗ Resulted in additional synchronization time∗ Kepler allows the shuffle functions, which∗ Exchange data between threads without using shared memory∗ Handles the store‐and‐load operation as a single step∗ Data can only be shared within the same warp∗ In their example, an FFT algorithm saw 6% performance increase whenusing this instruction.18

Kepler Hardware Features∗ Dynamic Parallelism∗ Any kernel can launch more kernels from within itself∗ Takes additional load off of the CPU∗ Hyper‐Q∗ 32 hardware managed work queues∗ Fermi had 1 queue∗ Grid Management Unit∗ Needed to manage the number of grids that are executed∗ Introduction of the GMU to handle all of the grids that can be activeat one time∗ NVIDIA GPUDirect TM∗ Ability for CUDA enabled GPUs to interact without the need for CPUintervention∗ The GPU can interact directly with the NIC19

Comparison of Kepler and Fermi20

Use for Computation∗ Historically, GPUs were used for graphics to offloadCPU work∗ Current trend –Combine CPU and GPU on a single core∗ Due to the massively parallel computations of thework, GPUs are ideal for their number of processingcores.∗ However, these are only ideal when there are few datadependencies.∗ Introduction of CUDA and the Tesla GPUs

CUDA Programming∗ Extensions to the C language∗ With some C++ support∗ Programming Support∗ Windows –Visual Studio∗ Linux/Mac – Eclipse∗ Programming paradigm where each computationtake place on a separate thread∗ Requires NVIDIA GPU for acceleration∗ Simulators are used for research purposes

Example – Vector AdditionCfor( int i = 0; i < SIZE; ++i ) {c[ i ] = a[ i ] + b[ i ];}CUDA__global__ void addVectors( float* a, float* b,float* c ) {int id = threadIdx.x;if( id < SIZE ) {c[ id ] = a[ id ] + b[ id ];}}

Programming Requirements∗ Explicit Memory Operations to allocate and copy datafrom the CPU to GPU∗ Some exceptions do apply∗ All kernels execute asynchronously of the CPU∗ Explicit synchronization barriers between the processors

Synchronization and Performance∗ To meet data dependencies,∗ Synchronization Primitives∗ __syncthreads() –Synchronizes all threads in a block∗ Atomic Operations –Depending on compute/CUDAversion, these are possible on global and shared memory∗ Performance is dictated by memory operations andsynchronization cost∗ Memory Coalescence∗ Warp Divergence

Relation to Other Architectures∗ SMT∗ Many smaller cores, with less functionality, to compute results∗ Each core has a hardware context for a thread that can beswitched out∗ Vector Processors∗ Computation of results in parallel that could be donesequentially by a CPU∗ Ability to access large chunks of data from memory at a giventime∗ Banks of shared memory ‐ could lead to bank conflicts∗ Digital Signal Processors∗ As with DSP algorithms, many applications could also use theMAC elements; these are built into the GPU by design

Conclusions∗ GPUs are massively parallel devices that can be used forgeneral purpose computing, in addition to graphicsprocessing∗ As the cost continues to decrease, these devices becomeoff‐the‐shelf components that can be used to build largersystem.∗ In addition to compute capabilities, Kepler offers thebenefit of additional performance per watt, making a morepower efficient design.∗ When used with other technologies, like OpenCL, GPUs canbe used in heterogeneous platforms.

References∗∗∗∗∗∗∗http://www.nvidia.com/page/corporate_timeline.htmlhttp://www.pcmag.com/encyclopedia_term/0,2542,t=graphics+pipeline&i=43933,00.aspS. L. Alarcon, “CUDA Memories,” unpublished.NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available:http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.NVIDIA. (2009). NVIDIA’s Next Generation CUDA TM Compute Architecture: Fermi. [Online].Available:http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.NVIDIA. (2012). NVIDIA’s Next Generation CUDA TM Compute Architecture: Kepler TM GK110.[Online]. Available: http://www.nvidia.com/content/PDF/kepler/NVIDIA‐Kepler‐GK110‐Architecture‐Whitepaper.pdf.NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce‐GTX‐680‐Whitepaper‐FINAL.pdf

Evolution of the NVIDIA GPU Architecture

Create successful ePaper yourself

Delete template?

Save as template?