CUDA and MPI

CUDA and MPIDale SouthardSenior Solution Architect

Pinning Host Memory

What You’ve Learned So Far….Use new or malloc() to allocate host memoryUse cudaMalloc() to allocate device memoryTransfer using cudaMemcpy()This works but won’t achieve full PCIe speed.

Pinning Host MemoryPage-locked (pinned) memory can’t be swapped outBecause it is guaranteed to be resident in RAM, the GPU can domore efficient DMA transfers (saturating the PCIe bus)Pinned memory can be allocated with eithercudaMallocHost(**ptr, size)cudaHostAlloc(**ptr, size, flags)

Zero-Copy Memory before CUDA 4.0Since we can efficiently DMA from pinned memory, the GPU canaccess it in situfloat *h_ptr, *d_ptr;cudaHostAlloc(&h_ptr, size, cudaHostAllocMapped);cudaHostGetDevicePointer(d_ptr, h_ptr, 0);cudaSetDeviceFlags(cudaDeviceMapHost);

GPUDirect

The DMA/RDMA ProblemCUDA driver allocates its own pinned memory region for DMAtransfers to/from GPUIB driver allocates its own pinned memory region for RDMAtransfers to/from IB cardGPU can only access system memoryIB can only access system memoryMPI stack has no knowledge of GPUSo, for efficient copies we need two pinned buffers.

1 2MPI and CUDA Before GPUDirectCPUMainMemChipsetGPUGPUMemory

What is GPUDirect?GPUDirect is an umbrella term for improving interoperability withthird-party devices (Especially cluster fabric hardware)Long-term goal is to reduce dependence on CPU for managingtransfersContains both programming model and system softwareenhancementsLinux only (for now)

GPUDirect v1Jointly developed with MellanoxEnables IB driver and CUDA driver to share the same pinnedmemoryEliminates CPU memcpy()sKernel patch for additional kernel mode callbackGuarantees proper cleanup of shared physical memory atprocess shutdownCurrently shipping

1GPUDirect v1CPUChipsetGPUInfiniBandGPUMemory

CUDA 4.0 Enhancements

No-copy Pinning of System MemoryReduce system memory usage and CPU memcpy() overheadEasier to add CUDA acceleration to existing applicationsJust register malloc’d system memory for async operationsand then call cudaMemcpy() as usualBefore No-copy PinningExtra allocation and extra copy requiredAll CUDA-capable GPUs on Linux or WindowsRequires Linux kernel 2.6.15+ (RHEL 5)With No-copy PinningJust register and go!cudaMallocHost(b)memcpy(b, a)cudaHostRegister(a)cudaMemcpy() to GPU, launch kernels, cudaMemcpy() from GPUmemcpy(a, b)cudaFreeHost(b)malloc(a)cudaHostUnregister(a)

Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from pointer valueEnables libraries to simplify their interfaces (e.g. cudaMemcpy)Before UVASeparate options for each permutationcudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceWith UVAOne function handles all casescudaMemcpyDefault(data location becomes an implementation detail)Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC

Unified Virtual AddressingEasier to Program with Single Address SpaceNo UVA: Multiple Memory SpacesUVA : Single Address SpaceSystemMemoryGPU0MemoryGPU1MemorySystemMemoryGPU0MemoryGPU1Memory0x00000x00000x00000x00000xFFFF0xFFFF0xFFFF0xFFFFCPUGPU0GPU1CPUGPU0GPU1PCI-ePCI-e

GPUDirect v2Uses UVAGPU Aware MPIMPI calls handle both GPU and CPU pointersImproves programmer productivityData movement done in SWSame performance as v1RequiresCUDA 4.0 and unified address space support64-bit host app and GF100+ only

Multiple GPUs Before NVIDIA GPUDirect v2.0Required Copy into Main MemoryGPU1MemoryGPU2MemoryTwo copies required:1. cudaMemcpy(GPU2, sysmem)2. cudaMemcpy(sysmem, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset

Multiple GPUs with NVIDIA GPUDirect v2.0:Peer-to-Peer CommunicationDirect Transfers between GPUsGPU1MemoryGPU2MemoryOnly one copy required:1. cudaMemcpy(GPU2, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset

GPUDirect v2.0: Peer-to-Peer CommunicationDirect communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programmingDirect TransfersCopy from GPU0 memory to GPU1 memoryWorks transparently with UVADirect AccessGPU0 reads or writes GPU1 memory (load/store)Supported only on Tesla 20-series (Fermi)64-bit applications on Linux and Windows TCC

GPUDirect Future DirectionsP2P protocol could be extended to other devicesNetwork cardsStorage devices (flash?)Other?Extended PCI topologiesMore GPU autonomyBetter NUMA topology discovery/exposurewww.openfabrics.org 21

Topology and Locality

TopologyMemoryMemoryMemoryMemoryCPUCPUCPUCPUChipsetChipsetChipsetChipsetGPUGPU6.5 GB/s4.34 GB/s

And More TopologyMemoryMemoryMemoryMemoryCPUCPUCPUCPUChipsetChipsetGPUGPU11.0 GB/s7.4 GB/s

There’s No Easy Answer Here….Memory hierarchies are getting deeper, uniformity is decreasingPortable Hardware Locality (hwloc)NUMA placement (SLURM spank, some MPIs)

CUDA and MPI

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?