11.07.2015 Views

CUDA and MPI

CUDA and MPI

CUDA and MPI

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>CUDA</strong> <strong>and</strong> <strong>MPI</strong>Dale SouthardSenior Solution Architect


Pinning Host Memory


What You’ve Learned So Far….Use new or malloc() to allocate host memoryUse cudaMalloc() to allocate device memoryTransfer using cudaMemcpy()This works but won’t achieve full PCIe speed.


Pinning Host MemoryPage-locked (pinned) memory can’t be swapped outBecause it is guaranteed to be resident in RAM, the GPU can domore efficient DMA transfers (saturating the PCIe bus)Pinned memory can be allocated with eithercudaMallocHost(**ptr, size)cudaHostAlloc(**ptr, size, flags)


Zero-Copy Memory before <strong>CUDA</strong> 4.0Since we can efficiently DMA from pinned memory, the GPU canaccess it in situfloat *h_ptr, *d_ptr;cudaHostAlloc(&h_ptr, size, cudaHostAllocMapped);cudaHostGetDevicePointer(d_ptr, h_ptr, 0);cudaSetDeviceFlags(cudaDeviceMapHost);


GPUDirect


The DMA/RDMA Problem<strong>CUDA</strong> driver allocates its own pinned memory region for DMAtransfers to/from GPUIB driver allocates its own pinned memory region for RDMAtransfers to/from IB cardGPU can only access system memoryIB can only access system memory<strong>MPI</strong> stack has no knowledge of GPUSo, for efficient copies we need two pinned buffers.


1 2<strong>MPI</strong> <strong>and</strong> <strong>CUDA</strong> Before GPUDirectCPUMainMemChipsetGPUGPUMemory


What is GPUDirect?GPUDirect is an umbrella term for improving interoperability withthird-party devices (Especially cluster fabric hardware)Long-term goal is to reduce dependence on CPU for managingtransfersContains both programming model <strong>and</strong> system softwareenhancementsLinux only (for now)


GPUDirect v1Jointly developed with MellanoxEnables IB driver <strong>and</strong> <strong>CUDA</strong> driver to share the same pinnedmemoryEliminates CPU memcpy()sKernel patch for additional kernel mode callbackGuarantees proper cleanup of shared physical memory atprocess shutdownCurrently shipping


1GPUDirect v1CPUChipsetGPUInfiniB<strong>and</strong>GPUMemory


<strong>CUDA</strong> 4.0 Enhancements


No-copy Pinning of System MemoryReduce system memory usage <strong>and</strong> CPU memcpy() overheadEasier to add <strong>CUDA</strong> acceleration to existing applicationsJust register malloc’d system memory for async operations<strong>and</strong> then call cudaMemcpy() as usualBefore No-copy PinningExtra allocation <strong>and</strong> extra copy requiredAll <strong>CUDA</strong>-capable GPUs on Linux or WindowsRequires Linux kernel 2.6.15+ (RHEL 5)With No-copy PinningJust register <strong>and</strong> go!cudaMallocHost(b)memcpy(b, a)cudaHostRegister(a)cudaMemcpy() to GPU, launch kernels, cudaMemcpy() from GPUmemcpy(a, b)cudaFreeHost(b)malloc(a)cudaHostUnregister(a)


Unified Virtual AddressingOne address space for all CPU <strong>and</strong> GPU memoryDetermine physical memory location from pointer valueEnables libraries to simplify their interfaces (e.g. cudaMemcpy)Before UVASeparate options for each permutationcudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceWith UVAOne function h<strong>and</strong>les all casescudaMemcpyDefault(data location becomes an implementation detail)Supported on Tesla 20-series <strong>and</strong> other Fermi GPUs64-bit applications on Linux <strong>and</strong> Windows TCC


Unified Virtual AddressingEasier to Program with Single Address SpaceNo UVA: Multiple Memory SpacesUVA : Single Address SpaceSystemMemoryGPU0MemoryGPU1MemorySystemMemoryGPU0MemoryGPU1Memory0x00000x00000x00000x00000xFFFF0xFFFF0xFFFF0xFFFFCPUGPU0GPU1CPUGPU0GPU1PCI-ePCI-e


GPUDirect v2Uses UVAGPU Aware <strong>MPI</strong><strong>MPI</strong> calls h<strong>and</strong>le both GPU <strong>and</strong> CPU pointersImproves programmer productivityData movement done in SWSame performance as v1Requires<strong>CUDA</strong> 4.0 <strong>and</strong> unified address space support64-bit host app <strong>and</strong> GF100+ only


Multiple GPUs Before NVIDIA GPUDirect v2.0Required Copy into Main MemoryGPU1MemoryGPU2MemoryTwo copies required:1. cudaMemcpy(GPU2, sysmem)2. cudaMemcpy(sysmem, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset


Multiple GPUs with NVIDIA GPUDirect v2.0:Peer-to-Peer CommunicationDirect Transfers between GPUsGPU1MemoryGPU2MemoryOnly one copy required:1. cudaMemcpy(GPU2, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset


GPUDirect v2.0: Peer-to-Peer CommunicationDirect communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programmingDirect TransfersCopy from GPU0 memory to GPU1 memoryWorks transparently with UVADirect AccessGPU0 reads or writes GPU1 memory (load/store)Supported only on Tesla 20-series (Fermi)64-bit applications on Linux <strong>and</strong> Windows TCC


GPUDirect Future DirectionsP2P protocol could be extended to other devicesNetwork cardsStorage devices (flash?)Other?Extended PCI topologiesMore GPU autonomyBetter NUMA topology discovery/exposurewww.openfabrics.org 21


Topology <strong>and</strong> Locality


TopologyMemoryMemoryMemoryMemoryCPUCPUCPUCPUChipsetChipsetChipsetChipsetGPUGPU6.5 GB/s4.34 GB/s


And More TopologyMemoryMemoryMemoryMemoryCPUCPUCPUCPUChipsetChipsetGPUGPU11.0 GB/s7.4 GB/s


There’s No Easy Answer Here….Memory hierarchies are getting deeper, uniformity is decreasingPortable Hardware Locality (hwloc)NUMA placement (SLURM spank, some <strong>MPI</strong>s)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!