Accelerating High Performance Computing with GPUDirect RDMA

Accelerating High Performance Computing with GPUDirect RDMA

Accelerating High Performance Computing withGPUDirect RDMAWednesday, August 7, 2013 - 10AM-11AM PST

Leading Supplier of End-to-End Interconnect SolutionsServer / ComputeSwitch / GatewayVirtual Protocol InterconnectVirtual Protocol Interconnect56G IB & FCoIB56G InfiniBand10/40/56GbE & FCoE10/40/56GbEStorageFront / Back-EndComprehensive End-to-End InfiniBand and Ethernet PortfolioICsAdapter CardsSwitches/GatewaysHost/Fabric SoftwareCables© 2013 Mellanox Technologies 2

User SpaceSystem SpaceUser SpaceSystem SpaceI/O Offload Frees Up CPU for Application ProcessingWithout RDMAWith RDMA and Offload~53% CPUEfficiency~88% CPUEfficiency~47% CPUOverhead/Idle~12% CPUOverhead/Idle© 2013 Mellanox Technologies 3

Mellanox Interconnect Development TimelineQDR InfiniBand End-to-EndGPUDirect Technology ReleasedFDR InfiniBand End-to-EndMPI/SHMEM Collectives Offloads(FCA), Scalable HPC (MXM), OpenSHMEM, PGAS/UPCConnect-IB - 100Gb/s HCADynamically ConnectedTransport20082009 2010 2011 2012 2013World’s First Petaflop SystemsCORE-Direct TechnologyInfiniBand – Ethernet BridgingLong-Haul SolutionsGPUDirect RDMATechnology and Solutions Leadership© 2013 Mellanox Technologies 4

GPUDirect History• The GPUDirect project was announced Nov 2009• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”• GPUDirect was developed together by Mellanox and NVIDIA• New interface (API) within the Tesla GPU driver• New interface within the Mellanox InfiniBand drivers• Linux kernel modification to allow direct communication between drivers• GPUDirect 1.0 was announced Q2’10• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”• GPUDirect RDMA alpha release is available today• Mellanox has over two dozen developers using and providing feedback• “Proof of Concept” designs in flight today with commercial end-customers and government entities• Ohio State University has a version of MVAPICH2 for GPUDirect RDMA available for MPI application developersGPUDirect RDMA is targeted for a GA release in Q4’13© 2013 Mellanox Technologies 5

1 2GPU-InfiniBand Bottleneck (pre-GPUDirect)• GPU communications uses “pinned” buffers for data movement• A section in the host memory that is dedicated for the GPU• Allows optimizations such as write-combining and overlapping GPU computation and data transfer forbest performance• InfiniBand uses “pinned” buffers for efficient RDMA transactions• Zero-copy data transfers, Kernel bypass• Reduces CPU overheadCPUSystemMemoryChipsetGPUInfiniBandGPUMemory© 2013 Mellanox Technologies 6

1 21GPUDirect 1.0ReceiveTransmitSystemMemory21CPUNon GPUDirectCPUSystemMemoryGPUChipsetChipsetGPUInfiniBandInfiniBandGPUMemoryGPUMemorySystemMemory1CPUCPUSystemMemoryGPUChipsetChipsetGPUGPUMemoryInfiniBandGPUDirect 1.0InfiniBandGPUMemory© 2013 Mellanox Technologies 7

GPUDirect 1.0 – Application Performance• LAMMPS• 3 nodes, 10% gain3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node• Amber – Cellulose• 8 nodes, 32% gain• Amber – FactorIX• 8 nodes, 27% gain© 2013 Mellanox Technologies 8

11GPUDirect RDMAReceiveTransmitSystemMemory1CPUGPUDirect 1.0CPUSystemMemoryGPUChipsetChipsetGPUInfiniBandInfiniBandGPUMemoryGPUMemorySystemMemory1CPUCPUSystemMemoryGPUChipsetChipsetGPUGPUMemoryInfiniBandGPUDirect RDMAInfiniBandGPUMemory© 2013 Mellanox Technologies 9

Hardware considerations for GPUDirect RDMANote : A requirement for GPUDirect RDMA to work properly is that the NVIDIAGPU and the Mellanox InfiniBand Adapter share the same root complexCPUCPUChipset orPCIE switchChipset© 2013 Mellanox Technologies 10

How to get started evaluating GPUDirect RDMA…How do I get started with the GPUDirect RDMA alpha code release?The only way to get access to the alpha release is by sending an email to , you will receive a response within 24 hours, and will be able todownload the code via an FTP site.If you would like to evaluate MVAPICH2.1.9-GDR, please state this in the email request,and you will receive a separate email on how to download it.How can I ensure I get the latest updates and information?The Community Site at Mellanox ( is a great place to getthe very latest information on GPUDirect RDMA. You will also be able to connect withyour peers, ask questions, exchange ideas, find additional resources, and share bestpractices.© 2013 Mellanox Technologies 11

A Community for Mellanox Technology Enthusiasts© 2013 Mellanox Technologies 12

Thank You© 2013 Mellanox Technologies 13

MVAPICH2-GDR: MVAPICH2 with GPUDirect RDMAWebinar onAccelerating High Performance Computing with GPUDirect RDMAbyProf. Dhabaleswar K. (DK) PandaThe Ohio State UniversityE-mail: panda@cse.ohio-state.eduMVAPICH Project

Large-scale InfiniBand Cluster Installations• 205 IB Clusters (41%) in the June 2013 Top500 list(• Installations in the Top 40 (18 systems):462,462 cores (Stampede) at TACC (6 th ) 138,368 cores (Tera-100) at France/CEA (25 th )147, 456 cores (Super MUC) in Germany (7 th ) 53,504 cores (PRIMERGY) at Australia/NCI (27 th )110,400 cores (Pangea) at France/Total (11th) 77,520 cores (Conte) at Purdue University (28 th )73,584 (Spirit) at USA/Air Force (14 th ) 48,896 cores (MareNostrum) at Spain/BSC (29 th )77,184 cores (Curie thin nodes) at France/CEA(15 th )78,660 cores (Lomonosov) in Russia (31 st )120, 640 cores (Nebulae) atChina/NSCS (16 th ) 137,200 cores (Sunway Blue Light) in China 33 rd )72,288 cores (Yellowstone) at NCAR (17 th ) 46,208 cores (Zin) at LLNL (34 th )125,980 cores (Pleiades) at NASA/Ames (19 th ) 38,016 cores at India/IITM (36 th )70,560 cores (Helios) at Japan/IFERC (20 th ) More are getting installed !73,278 cores (Tsubame 2.0) at Japan/GSIC (21 st )MVAPICH2 with GPUDirect RDMA2

MVAPICH2/MVAPICH2-X Software• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, andRDMA over Converged Enhanced Ethernet (RoCE)– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002– MVAPICH2-X (MPI + PGAS), Available since 2012– Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in70 countries– More than 180,000 downloads from OSU site directly– Empowering many TOP500 clusters• 7 th ranked 204,900-core cluster (Stampede) at TACC• 14 th ranked 125,980-core cluster (Pleiades) at NASA• 17 th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology• 75 th ranked 16,896-core cluster (Keenland) at GaTech• and many others– Available with software stacks of many IB, HSE, and server vendorsincluding Linux Distros (RedHat and SuSE)– http://mvapich.cse.ohio-state.eduMVAPICH2 with GPUDirect RDMA3

MVAPICH2 1.9 and MVAPICH2-X 1.9• Released on 05/06/13• Major Features and Enhancements– Based on MPICH-3.0.3• Support for all MPI-3 features – Non-blocking collectives, Neighborhood Collectives, etc.– Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)• Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA– Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)• Support for application-level checkpointing• Support for hierarchical system-level checkpointing– Scalable UD-multicast-based designs and tuned algorithm selection for collectives– Improved job startup time• Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup onhomogeneous clusters– Revamped Build system with support for parallel builds– Many enhancements related to GPU• MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.– Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0dMVAPICH2 with GPUDirect RDMA4

Designing GPU-Aware MPI Library• OSU started this research and development direction in 2011• Initial support was provided in MVAPICH2 1.8a (SC ‘11)• Since then many enhancements and new designs related to GPUcommunication have been incorporated in 1.8 and 1.9 series• Have also extended OSU Micro-Benchmark Suite (OMB) to testand evaluate– GPU-aware MPI communication– OpenACCMVAPICH2 with GPUDirect RDMA5

What is GPU-Aware MPI Library?MVAPICH2 with GPUDirect RDMA6

MPI + CUDA - Naive• Data movement in applications with standard MPI and CUDA interfacesAt Sender:CPUcudaMemcpy(s_hostbuf, s_devbuf, . . .);MPI_Send(s_hostbuf, size, . . .);At Receiver:MPI_Recv(r_hostbuf, size, . . .);cudaMemcpy(r_devbuf, r_hostbuf, . . .);PCIeGPUNICSwitchHigh Productivity and Low PerformanceMVAPICH2 with GPUDirect RDMA7

MPI + CUDA - Advanced• Pipelining at user level with non-blocking MPI and CUDA interfacesAt Sender:for (j = 0; j < pipeline_len; j++)cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,…);for (j = 0; j < pipeline_len; j++) {while (result != cudaSucess) {result = cudaStreamQuery(…);if(j > 0) MPI_Test(…);}MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);}MPI_Waitall();CPUPCIeGPUNICSwitchLow Productivity and High PerformanceMVAPICH2 with GPUDirect RDMA8

GPU-Aware MPI Library: MVAPICH2-GPU• Standard MPI interfaces used for unified data movement• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)• Overlaps data movement from GPU with RDMA transfersAt Sender:MPI_Send(s_devbuf, size, …);insideMVAPICH2At Receiver:MPI_Recv(r_devbuf, size, …);High Performance and High ProductivityMVAPICH2 with GPUDirect RDMA9

MPI Micro-benchmark PerformanceTime (us)30002500200015001000Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU45 %24 %Better500032K 64K 128K 256K 512K 1M 2M 4MMessage Size (bytes)• 45% improvement compared with a naïve user-level implementation(Memcpy+Send), for 4MB messages• 24% improvement compared with an advanced user-level implementation(MemcpyAsync+Isend), for 4MB messagesH. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPUCommunication for InfiniBand Clusters, ISC ‘11MVAPICH2 with GPUDirect RDMA10

MVAPICH2 1.9 Features for NVIDIA GPU Clusters• Support for MPI communication from NVIDIA GPU devicememory• High performance RDMA-based inter-node point-to-pointcommunication (GPU-GPU, GPU-Host, and Host-GPU)• High performance intra-node point-to-point communication formulti-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) inintra-node communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collectivecommunication from GPU device buffersMVAPICH2 with GPUDirect RDMA11

GPU-Direct RDMA with CUDA 5.0• Fastest possible communicationSystemMemorybetween GPU and other PCI-EdevicesCPU• Network adapter can directlyread/write data from/to GPUdevice memoryInfiniBandChipsetGPU• Avoids copies through the host• Allows for better asynchronousGPUMemorycommunication• OFED with GPU-Direct is underwork by NVIDIA and MellanoxMVAPICH2 with GPUDirect RDMA12

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA• Peer2Peer (P2P) bottlenecks onSNB E5-2670Sandy BridgeCPUSystemMemory• Design of MVAPICH2– Hybrid designGPU– Takes advantage of GPU-Direct-RDMA for writes to GPU– Uses host-based buffered design inIBAdapterChipsetGPUMemorycurrent MVAPICH2 for reads– Works around the bottlenecksP2P write: 5.2 GB/sP2P read: < 1.0 GB/stransparentlyMVAPICH2 with GPUDirect RDMA13

Performance of MVAPICH2 with GPU-Direct-RDMAGPU-GPU Internode MPI Latency3025Small Message LatencyMVAPICH2-1.9MVAPICH2-1.9-GDR1000900800Large Message LatencyMVAPICH2-1.9MVAPICH2-1.9-GDRLatency (us)20151019.167.5 %Latency (us)Better700600500400300Better56.220010001 4 16 64 256 1K 4K016K 64K 256K 1M 4MMessage Size (bytes)Message Size (bytes)Based on MVAPICH2-1.9Intel Sandy Bridge (E5-2670) node with 16 coresNVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCACUDA 5.5, OFED with GPU-Direct-RDMA PatchMVAPICH2 with GPUDirect RDMA14

Performance of MVAPICH2 with GPU-Direct-RDMA800700GPU-GPU Internode MPI Uni-Directional BandwidthSmall Message BandwidthMVAPICH2-1.9MVAPICH2-1.9-GDR70006000Large Message BandwidthMVAPICH2-1.9MVAPICH2-1.9-GDRBandwidth (MB/s)6005004003002002.8xBetterBandwidth (MB/s)500040003000200033%Better100100001 4 16 64 256 1K 4K08K 32K 128K 512K 2MMessage Size (bytes)MVAPICH2 with GPUDirect RDMABased on MVAPICH2-1.9Intel Sandy Bridge (E5-2670) node with 16 coresNVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCACUDA 5.5, OFED with GPU-Direct-RDMA PatchMessage Size (bytes)15

Performance of MVAPICH2 with GPU-Direct-RDMAGPU-GPU Internode MPI Bi-directional Bandwidth1000Small Message Bi-Bandwidth12000Large Message Bi-Bandwidth900800MVAPICH2-1.9MVAPICH2-1.9-GDR10000MVAPICH2-1.9MVAPICH2-1.9-GDRBandwidth (MB/s)7006005004003003xBetterBandwidth (MB/s)80006000400054%Better2002000100001 4 16 64 256 1K 4K8K 32K 128K 512K 2MMessage Size (bytes)Message Size (bytes)Based on MVAPICH2-1.9Intel Sandy Bridge (E5-2670) node with 16 coresNVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCACUDA 5.5, OFED with GPU-Direct-RDMA PatchMVAPICH2 with GPUDirect RDMA16

How will it help me?• MPI applications can be made GPU-aware to use directcommunication from/to GPU buffers, as supported byMVAPICH2 1.9, and extract performance benefits• GPU-Aware MPI applications using short and mediummessages can extract added performance and scalabilitybenefits with MVAPICH2-GPUDirect RDMA (MVAPICH2-GDR)MVAPICH2 with GPUDirect RDMA17

How can I get Started with GDR Experimentation?• Two modules are needed– GPUDirect RDMA (GDR) driver from Mellanox– MVAPICH2-GDR from OSU• Send a note to• You will get alpha versions of GDR driver and MVAPICH2-GDR(based on MVAPICH2 1.9 release)• You can get started with this version• MVAPICH2 team is working on multiple enhancements (collectives,datatypes, one-sided) to exploit the advantages of GDR• As GDR driver matures, successive versions of MVAPICH2-GDR withenhancements will be made available to the communityMVAPICH2 with GPUDirect RDMA18

Will it be too Hard to Use GDR?• No• You need to first install the OFED-GDR driver from Mellanox• Install MVAPICH2-GDR• Current GPU-aware features in MVAPICH2 are triggered witha runtime parameter: MV2_USE_CUDA=1• To activate GDR functionality, you just need to use one moreruntime parameter: MV2_USE_GPUDIRECT =1• A short demo will be shown now to illustrate the easy usageof MVAPICH2-GDRMVAPICH2 with GPUDirect RDMA19

Additional Information and Contact Point for Questions Web Pagehttp://mvapich.cse.ohio-state.edupanda@cse.ohio-state.eduMVAPICH2 with GPUDirect RDMA20

Accelerating HighPerformance Computingwith GPUDirect RDMANVIDIA webinar 8/7/2013

Outline! GPUDirect technology family! Current NVIDIA Software and Hardware Requirements! Current MPI Status! Using GPU Direct with IBVerbs extensions! Using GPUDirect RDMA and MPI! CUDA 6 – moving from GPUDirect alpha to beta to GA! Team Q & A

GPUDirect is a family of technologies! GPUDirect Shared GPU-Sysmem for inter-node copy optimization! How: Use GPUDirect-aware 3 rd party network drivers! GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy! How: Use CUDA APIs directly in application! How: use P2P-aware MPI implementation! GPUDirect P2P for intra-node, inter-GPU LD/ST access! How: Access remote data by address directly in GPU device code! GPUDirect RDMA for inter-node copy optimization! What: 3 rd party PCIe devices can read and write GPU memory! How: Use GPUDirect RDMA-aware 3 rd party network drivers* and MPIimplementations* or custom device drivers for other hardware* forthcoming

NVIDIA Software and Hardware Requirements! What drivers and CUDA versions are required to support GPUDirect?! Alpha Patches work with – CUDA 5.0 or CUDA 5.5! Final release based on CUDA 6.0 (beta in October)! New driver, probably version 331! Register at for early access! NVIDIA Hardware Requirements! RDMA available on Tesla and Quadro Kepler class hardware

GPU Aware MPI Libraries – Current StatusMVAPICHOpen MPIIBM PlatformComputingComputingIBM Platform MPIAll libraries allow• GPU and network device share same sysmem buffers• Utilizes best transfer mode (such as CUDA IPC directtransfer within a node between GPUs)• Send and receive of GPU buffers and mostcollectivesVersions:• MVAPICH2 1.9• OpenMPI 1.7.2• IBM Platform MPI V9.1Reference• NVIDIA GPUDirect Technology Overview

IB verbs extensions for GPUDirect RDMA! Developers may program at the IB verbs level or with MPI! Current version with RDMA support (available via Mellanox)! Gives application developers early access to an RDMA path! IB verbs was changed to provide:! Extended memory registration API’s to support GPU buffer! GPU memory de-allocation call-back (for efficient MPI implementations)

IB verbs with GPUDirect RDMAUse existing memory registration APIs:! struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,size_t length, int access)! pd: protection domain! Access:IBV_ACCESS_LOCAL_WRITE = 1,IBV_ACCESS_REMOTE_WRITE = (1

Example:d_buf=cudaMalloc();mr=ibv_reg_mr(pd,d_buf,size,…);// … RDMA on buffer hereibv_dereg_mr(mr)cudaFree(d_buf)

GPUDirect RDMAGPUPCIeCPU/IOHPCIeThird PartyHardwareGPUMemoryCPUMemoryTodayMellanoxTomorrow??

GPUDirect RDMA: Common use cases! Inter-Node MPI communication! Transfer data between local GPU memory and a remote Node! Interface with third party hardware! It requires adopting NVIDIA GPUDirect-Interop API in vendor softwarestack

GPUDirect RDMA: What does it get you?! MPI_Send latency of ~20us with Shared GPU-Sysmem! No overlap possible! Bidirectional transfer is difficult! MPI_Send latency of ~6us with RDMA! Does not affect running kernels! Unlimited concurrency! RDMA possible!

So what happens at CUDA 6?! No change to MPI programs! Interfaces are simplified, reducing the work for MPI implementors! Programmers working at the verbs level also benefit! Requires upgrading to CUDA 6 and then current NVIDIA driver! Register to receive updates! Release of CUDA 6 (RC1, then final)! Release of MVAPICH2 beta, then final! Progress from other MPI vendors

Contacts and Resources! NVIDIA! Register at forearly access to CUDA 6! Developer Zone GPU Direct page! Developer Zone RDMA page! Mellanox! Register! Ohio State!! Emails!!!

The EndQuestion Time

Upcoming GTC Express WebinarsAugust 13 - GPUs in the Film Visual Effects PipelineAugust 14 - Beyond Real-time Video Surveillance Analytics with GPUsAugust 15 - CUDA 5.5 Production Release: Features OverviewSeptember 5 - Data Discovery through High-Data-Density Visual Analysisusing NVIDIA GRID GPUsSeptember 12 - Guided Performance Analysis with NVIDIA Visual ProfilerRegister at

GTC 2014 Call for SubmissionsLooking for submissions in the fields of! Science and research! Professional graphics! Mobile computing! Automotive applications! Game development! Cloud computingSubmit by September 27 at

More magazines by this user
Similar magazines