Accelerating CFD Solvers for Massive Scalability | GTC 2013 - GPU ...

Accelerating CFD Solvers for Massive Scalability | GTC 2013 - GPU ...


WHAT IS CFD?» Using numerical methods to solve fluid flowproblems» Important for aviation, automobile, and navalindustries» CFD solvers are used throughout the DoD,NASA, and elsewhere» Real-world CFD simulations require massivecomputational powerModels Meshes Solvers Visuals2Copyright © 2013 EM Photonics

MODERNIZATION OF CFD SOLVERSCase Study» HPC hardware technology has recentlyevolved much faster than softwareimplementations» GPUs / multicore» Some new solvers are being developed» Legacy solvers still exist …» Year 3 of our efforts» NAVAIR uses CFDsolvers to model shipsand their interactionwith moving planes» Full ships require>500,000 CPU hourson a supercomputer» More performancemeans fasterturnaround times forsimulations and morecomplex models3Copyright © 2013 EM Photonics

ON LEGACY SOLVERS» Old solvers are still in widespread use» Validation» Momentum» Preference for feature development vs. structure ormodernization» Yet more performance is desired4Copyright © 2013 EM Photonics

CFD SOLVERS USEDAVUS – Case StudyCreated by: AFRL» 50k lines of Fortran» Can be coupled to external codesUsed by: Primarily DoDFUN3DCreated by: NASA Langley» 765k lines of Fortran» MultiphysicsUsed by: NASA, commercial, DoD» Finite volume, unstructured grid solvers» “MPI Everywhere” structure» Majority of time is spent in iterative solution5Copyright © 2013 EM Photonics


A MESH» Unstructuredtetrahedral grid» Essentially asparse matrix» Room 210F talkimmediatelyfollowing this one7Copyright © 2013 EM Photonics

A MESH – DIVIDED BY PARMETIS» Each zone isindependent» Neighbor zonesare likely to be onthe same node,but optimality ofplacement is notguaranteed8Copyright © 2013 EM Photonics

STEP 1: COMPUTE» Each meshregion can becomputed inparallel» Goodperformanceassumes cost ofeach region isidentical9Copyright © 2013 EM Photonics

STEP 2: COMMUNICATE» Primarilybetweenneighbors» Occasionalglobalcommunication» All MPIThen repeat10Copyright © 2013 EM Photonics

MPI = EASY PARALLELISMCore 1NODE 2Core 1Core 2Core 3Core 2Core 3Core 4NODE 1Core 411Copyright © 2013 EM Photonics

MPINOW, ADD THE GPUGTC 2012 results:» MPI ranks cannot share a GPU, but there aremore ranks than GPUs in most systems» Every rank needs a GPU or waits will occur oruse fewer ranks; not realistic» Giving more work to GPU ranks just shiftsother bottlenecks – all ranks must be equallyaccelerated for all code paths» Solution: convert to one-rank OpenMP» This was the only realistic architecturewithout a full rewriteAcceleratedAcceleratedCPUCPUWaitWaitSingle Mean Flow Sweep12Copyright © 2013 EM Photonics

GTC2012 SOLUTION» Convert to one-rank per node OpenMP» Spawn many (more) zones, balance by flexing tasks to/from GPU» Problems: full rewrite, work required to handle load balancingNodeProcessNodeProcessOpenMPOpenMPMPI13Copyright © 2013 EM Photonics


KEPLER / HYPER-Q» Only available recently – Supercomputing 2012» Hardware feature» Requires K20 (no K10)15Copyright © 2013 EM Photonics

CUDA PROXY» Launch CUDA Proxy Daemon» nvidia-cuda-proxy-control –d» CRAY_CUDA_PROXY=1» Proxy daemon will intercept all GPUcalls and route them to a singleprocess» Ranks routed to individual Hyper-QqueueskernelMPI RankskernelProxyCUDA Process16Copyright © 2013 EM Photonics

GTC2013 SOLUTION» Keep existing MPI structure» Use Proxy to share the GPU across ranks» Count on CUDA-aware MPI for intra-node transfers» Use GPUDirect RDMA for inter-node» Not available for the current results» Nice part: involved mostly undoing the 2012 code17Copyright © 2013 EM Photonics

BENCHMARKING SETUP» Cray Nano at NVIDIA» 16 nodes» Identical to Titan nodes» 1x 16-core AMD Interlagos 6274 pernode» 1x Tesla K20x per node» PCI-E 2.0» Cray MPI stack and compiler» Two test problems; 1.4m unknownsand 5.9m» Thanks to Cliff Woolley!18Copyright © 2013 EM Photonics

ITERATIONS PER SECONDOVERALL SCALING» 8 ranks per node for each test (others tested)1.3x0. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESCPUGPU19Copyright © 2013 EM Photonics

SCALING EFFICIENCYSCALING EFFICIENCY1.2196% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESCPUGPU20Copyright © 2013 EM Photonics

SPEEDUPCOMPUTE REGION BENCHMARKS» K1 is forward solve» K2 is reverse solve» 8 MPI ranks used forCPU and GPU tests» Proxy overhead is not afactor141210864201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESK2 1.4M K2 5.9M K1 1.4M K1 5.9M21Copyright © 2013 EM Photonics

PERCENTTIME SPENT COMMUNICATING10090807060504030201001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESGPU 1.4M CPU 1.4M GPU 5.9M CPU 5.9M» Primary motivator for GPUDirect RDMA22Copyright © 2013 EM Photonics

GPUDIRECT RDMA» Current MPI-GPU procedure: 5 step process» New feature, possible since CUDA 5.0» Not widely available (yet)» Requires interconnect driver and MPI to support © 2013 EM Photonics

SUMMARY AND NEXT STEPSCUDA PROXY IS A MAJOR ENABLER FOR LEGACY CODESEfficient porting from legacy MPI to CUDA cluster; minimal changesMPI LAYER IMPROVEMENTS WILL ENABLE PERFORMANCECurrent code still has manual transfers through host memoryFURTHER CODE ENHANCEMENTSMore kernels; fewer blocking MPI calls24Copyright © 2013 EM Photonics

Thanks!Questions?» Convention hall @ booth 311» Online @ www.emphotonics.comCopyright © 2013 EM Photonics

More magazines by this user
Similar magazines