11.07.2015 Views

Evolution of the NVIDIA GPU Architecture

Evolution of the NVIDIA GPU Architecture

Evolution of the NVIDIA GPU Architecture

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Graphics Pipeline


Computational Elements <strong>of</strong> a <strong>GPU</strong>∗ Streaming Processor –Core <strong>of</strong> <strong>the</strong> design∗ Place where all <strong>of</strong> <strong>the</strong> computation takes place∗ Streaming Multiprocessor∗ Groups <strong>of</strong> streaming multiprocessors∗ In addition to <strong>the</strong> SPs, <strong>the</strong>se also contain <strong>the</strong> SpecialFunction Units and Load/Store Units∗ Instructional Schedulers∗ Complex Control Logic


Streaming Multiprocessor <strong>Architecture</strong>


Types <strong>of</strong> <strong>GPU</strong> Memory∗ Global∗ DRAM∗ Slowest Performance∗ Texture∗ Cached Global Memory∗ “Bound” at runtime∗ Constant∗ Cached Global Memory∗ Shared∗ Local to a block <strong>of</strong> threads


Architectural Memory Hierarchy


Fermi <strong>Architecture</strong>


Fermi Improvements∗ Increase <strong>the</strong> number <strong>of</strong> SPs per SM∗ Unified Request Path for load/store instructions∗ Implementation <strong>of</strong> a cache hierarchy∗ L1 cache per SM∗ Configurable with Shared Memory∗ L2 cache is shared globally∗ Register Spilling∗ Occurs when <strong>the</strong> register requirements <strong>of</strong> a threadexceed what is available on <strong>the</strong> device∗ Previous Generation: Spill to DRAM (global memory)∗ Fermi: Use <strong>of</strong> <strong>the</strong> L1 cache


Summary


Kepler SM Design15


Warp Scheduler∗ 4 warp schedulers∗ Each scheduler can issue up to 2 independentinstructions when it is ready to issue.16


Kepler Memory <strong>Architecture</strong>∗ Shared Memory and L1 are stillphysically shared∗ New configuration: 32K L1,32K Shared∗ Shared memory bandwidth isdoubled compared with Fermi∗ Increased <strong>the</strong> size <strong>of</strong> L2∗ Doubled <strong>the</strong> size Fermi,increasing it to 1536 KB∗ Introduction <strong>of</strong> Read‐OnlyCache∗ Previously, this was used inFermi for Texture cache∗ 48 KB <strong>of</strong> storage 17


Warp Shuffle Instructions∗ In Fermi, data could only be exchanged between threads using sharedmemory.∗ Resulted in additional synchronization time∗ Kepler allows <strong>the</strong> shuffle functions, which∗ Exchange data between threads without using shared memory∗ Handles <strong>the</strong> store‐and‐load operation as a single step∗ Data can only be shared within <strong>the</strong> same warp∗ In <strong>the</strong>ir example, an FFT algorithm saw 6% performance increase whenusing this instruction.18


Kepler Hardware Features∗ Dynamic Parallelism∗ Any kernel can launch more kernels from within itself∗ Takes additional load <strong>of</strong>f <strong>of</strong> <strong>the</strong> CPU∗ Hyper‐Q∗ 32 hardware managed work queues∗ Fermi had 1 queue∗ Grid Management Unit∗ Needed to manage <strong>the</strong> number <strong>of</strong> grids that are executed∗ Introduction <strong>of</strong> <strong>the</strong> GMU to handle all <strong>of</strong> <strong>the</strong> grids that can be activeat one time∗ <strong>NVIDIA</strong> <strong>GPU</strong>Direct TM∗ Ability for CUDA enabled <strong>GPU</strong>s to interact without <strong>the</strong> need for CPUintervention∗ The <strong>GPU</strong> can interact directly with <strong>the</strong> NIC19


Comparison <strong>of</strong> Kepler and Fermi20


Use for Computation∗ Historically, <strong>GPU</strong>s were used for graphics to <strong>of</strong>floadCPU work∗ Current trend –Combine CPU and <strong>GPU</strong> on a single core∗ Due to <strong>the</strong> massively parallel computations <strong>of</strong> <strong>the</strong>work, <strong>GPU</strong>s are ideal for <strong>the</strong>ir number <strong>of</strong> processingcores.∗ However, <strong>the</strong>se are only ideal when <strong>the</strong>re are few datadependencies.∗ Introduction <strong>of</strong> CUDA and <strong>the</strong> Tesla <strong>GPU</strong>s


CUDA Programming∗ Extensions to <strong>the</strong> C language∗ With some C++ support∗ Programming Support∗ Windows –Visual Studio∗ Linux/Mac – Eclipse∗ Programming paradigm where each computationtake place on a separate thread∗ Requires <strong>NVIDIA</strong> <strong>GPU</strong> for acceleration∗ Simulators are used for research purposes


Example – Vector AdditionCfor( int i = 0; i < SIZE; ++i ) {c[ i ] = a[ i ] + b[ i ];}CUDA__global__ void addVectors( float* a, float* b,float* c ) {int id = threadIdx.x;if( id < SIZE ) {c[ id ] = a[ id ] + b[ id ];}}


Programming Requirements∗ Explicit Memory Operations to allocate and copy datafrom <strong>the</strong> CPU to <strong>GPU</strong>∗ Some exceptions do apply∗ All kernels execute asynchronously <strong>of</strong> <strong>the</strong> CPU∗ Explicit synchronization barriers between <strong>the</strong> processors


Synchronization and Performance∗ To meet data dependencies,∗ Synchronization Primitives∗ __syncthreads() –Synchronizes all threads in a block∗ Atomic Operations –Depending on compute/CUDAversion, <strong>the</strong>se are possible on global and shared memory∗ Performance is dictated by memory operations andsynchronization cost∗ Memory Coalescence∗ Warp Divergence


Relation to O<strong>the</strong>r <strong>Architecture</strong>s∗ SMT∗ Many smaller cores, with less functionality, to compute results∗ Each core has a hardware context for a thread that can beswitched out∗ Vector Processors∗ Computation <strong>of</strong> results in parallel that could be donesequentially by a CPU∗ Ability to access large chunks <strong>of</strong> data from memory at a giventime∗ Banks <strong>of</strong> shared memory ‐ could lead to bank conflicts∗ Digital Signal Processors∗ As with DSP algorithms, many applications could also use <strong>the</strong>MAC elements; <strong>the</strong>se are built into <strong>the</strong> <strong>GPU</strong> by design


Conclusions∗ <strong>GPU</strong>s are massively parallel devices that can be used forgeneral purpose computing, in addition to graphicsprocessing∗ As <strong>the</strong> cost continues to decrease, <strong>the</strong>se devices become<strong>of</strong>f‐<strong>the</strong>‐shelf components that can be used to build largersystem.∗ In addition to compute capabilities, Kepler <strong>of</strong>fers <strong>the</strong>benefit <strong>of</strong> additional performance per watt, making a morepower efficient design.∗ When used with o<strong>the</strong>r technologies, like OpenCL, <strong>GPU</strong>s canbe used in heterogeneous platforms.


References∗∗∗∗∗∗∗http://www.nvidia.com/page/corporate_timeline.htmlhttp://www.pcmag.com/encyclopedia_term/0,2542,t=graphics+pipeline&i=43933,00.aspS. L. Alarcon, “CUDA Memories,” unpublished.<strong>NVIDIA</strong>. (2012 April 16). <strong>NVIDIA</strong> CUDA C Programming Guide. [Online]. Available:http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.<strong>NVIDIA</strong>. (2009). <strong>NVIDIA</strong>’s Next Generation CUDA TM Compute <strong>Architecture</strong>: Fermi. [Online].Available:http://www.nvidia.com/content/PDF/fermi_white_papers/<strong>NVIDIA</strong>_Fermi_Compute_<strong>Architecture</strong>_Whitepaper.pdf.<strong>NVIDIA</strong>. (2012). <strong>NVIDIA</strong>’s Next Generation CUDA TM Compute <strong>Architecture</strong>: Kepler TM GK110.[Online]. Available: http://www.nvidia.com/content/PDF/kepler/<strong>NVIDIA</strong>‐Kepler‐GK110‐<strong>Architecture</strong>‐Whitepaper.pdf.<strong>NVIDIA</strong>. (2012). <strong>NVIDIA</strong> GeForce GTX 680. [Online]. Available:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce‐GTX‐680‐Whitepaper‐FINAL.pdf

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!