Accelerating CFD Solvers for Massive Scalability | GTC 2013 - GPU ...

gputechconf.com

Accelerating CFD Solvers for Massive Scalability | GTC 2013 - GPU ...

ACCELERATING COMPUTATIONAL FLUID DYNAMICS SOLVERS FORMASSIVE SCALABILITY: A CASE STUDYJohn HumphreyEM PhotonicsMarch 20, 2013


WHAT IS CFD?» Using numerical methods to solve fluid flowproblems» Important for aviation, automobile, and navalindustries» CFD solvers are used throughout the DoD,NASA, and elsewhere» Real-world CFD simulations require massivecomputational powerModels Meshes Solvers Visuals2Copyright © 2013 EM Photonics


MODERNIZATION OF CFD SOLVERSCase Study» HPC hardware technology has recentlyevolved much faster than softwareimplementations» GPUs / multicore» Some new solvers are being developed» Legacy solvers still exist …» Year 3 of our efforts» NAVAIR uses CFDsolvers to model shipsand their interactionwith moving planes» Full ships require>500,000 CPU hourson a supercomputer» More performancemeans fasterturnaround times forsimulations and morecomplex models3Copyright © 2013 EM Photonics


ON LEGACY SOLVERS» Old solvers are still in widespread use» Validation» Momentum» Preference for feature development vs. structure ormodernization» Yet more performance is desired4Copyright © 2013 EM Photonics


CFD SOLVERS USEDAVUS – Case StudyCreated by: AFRL» 50k lines of Fortran» Can be coupled to external codesUsed by: Primarily DoDFUN3DCreated by: NASA Langley» 765k lines of Fortran» MultiphysicsUsed by: NASA, commercial, DoD» Finite volume, unstructured grid solvers» “MPI Everywhere” structure» Majority of time is spent in iterative solution5Copyright © 2013 EM Photonics


6BASIC UNSTRUCTURED SOLVER FLOW AND PARALLELISM


A MESH» Unstructuredtetrahedral grid» Essentially asparse matrix» Room 210F talkimmediatelyfollowing this one7Copyright © 2013 EM Photonics


A MESH – DIVIDED BY PARMETIS» Each zone isindependent» Neighbor zonesare likely to be onthe same node,but optimality ofplacement is notguaranteed8Copyright © 2013 EM Photonics


STEP 1: COMPUTE» Each meshregion can becomputed inparallel» Goodperformanceassumes cost ofeach region isidentical9Copyright © 2013 EM Photonics


STEP 2: COMMUNICATE» Primarilybetweenneighbors» Occasionalglobalcommunication» All MPIThen repeat10Copyright © 2013 EM Photonics


MPI = EASY PARALLELISMCore 1NODE 2Core 1Core 2Core 3Core 2Core 3Core 4NODE 1Core 411Copyright © 2013 EM Photonics


MPINOW, ADD THE GPUGTC 2012 results:» MPI ranks cannot share a GPU, but there aremore ranks than GPUs in most systems» Every rank needs a GPU or waits will occur oruse fewer ranks; not realistic» Giving more work to GPU ranks just shiftsother bottlenecks – all ranks must be equallyaccelerated for all code paths» Solution: convert to one-rank OpenMP» This was the only realistic architecturewithout a full rewriteAcceleratedAcceleratedCPUCPUWaitWaitSingle Mean Flow Sweep12Copyright © 2013 EM Photonics


GTC2012 SOLUTION» Convert to one-rank per node OpenMP» Spawn many (more) zones, balance by flexing tasks to/from GPU» Problems: full rewrite, work required to handle load balancingNodeProcessNodeProcessOpenMPOpenMPMPI13Copyright © 2013 EM Photonics


14KEPLER + CUDA PROXY


KEPLER / HYPER-Qhttp://www.nvidia.com/object/nvidia-kepler.html» Only available recently – Supercomputing 2012» Hardware feature» Requires K20 (no K10)15Copyright © 2013 EM Photonics


CUDA PROXY» Launch CUDA Proxy Daemon» nvidia-cuda-proxy-control –d» CRAY_CUDA_PROXY=1» Proxy daemon will intercept all GPUcalls and route them to a singleprocess» Ranks routed to individual Hyper-QqueueskernelMPI RankskernelProxyCUDA Process16Copyright © 2013 EM Photonics


GTC2013 SOLUTION» Keep existing MPI structure» Use Proxy to share the GPU across ranks» Count on CUDA-aware MPI for intra-node transfers» Use GPUDirect RDMA for inter-node» Not available for the current results» Nice part: involved mostly undoing the 2012 code17Copyright © 2013 EM Photonics


BENCHMARKING SETUP» Cray Nano at NVIDIA» 16 nodes» Identical to Titan nodes» 1x 16-core AMD Interlagos 6274 pernode» 1x Tesla K20x per node» PCI-E 2.0» Cray MPI stack and compiler» Two test problems; 1.4m unknownsand 5.9m» Thanks to Cliff Woolley!18Copyright © 2013 EM Photonics


ITERATIONS PER SECONDOVERALL SCALING» 8 ranks per node for each test (others tested)1.3x0.50.40.30.22x0.101 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESCPUGPU19Copyright © 2013 EM Photonics


SCALING EFFICIENCYSCALING EFFICIENCY1.2196%0.80.60.40.201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESCPUGPU20Copyright © 2013 EM Photonics


SPEEDUPCOMPUTE REGION BENCHMARKS» K1 is forward solve» K2 is reverse solve» 8 MPI ranks used forCPU and GPU tests» Proxy overhead is not afactor141210864201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESK2 1.4M K2 5.9M K1 1.4M K1 5.9M21Copyright © 2013 EM Photonics


PERCENTTIME SPENT COMMUNICATING10090807060504030201001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16NODESGPU 1.4M CPU 1.4M GPU 5.9M CPU 5.9M» Primary motivator for GPUDirect RDMA22Copyright © 2013 EM Photonics


GPUDIRECT RDMA» Current MPI-GPU procedure: 5 step process» New feature, possible since CUDA 5.0» Not widely available (yet)» Requires interconnect driver and MPI to supporthttps://developer.nvidia.com/content/introduction-cuda-aware-mpi23Copyright © 2013 EM Photonics


SUMMARY AND NEXT STEPSCUDA PROXY IS A MAJOR ENABLER FOR LEGACY CODESEfficient porting from legacy MPI to CUDA cluster; minimal changesMPI LAYER IMPROVEMENTS WILL ENABLE PERFORMANCECurrent code still has manual transfers through host memoryFURTHER CODE ENHANCEMENTSMore kernels; fewer blocking MPI calls24Copyright © 2013 EM Photonics


Thanks!Questions?» Convention hall @ booth 311» Online @ www.emphotonics.comCopyright © 2013 EM Photonics

More magazines by this user
Similar magazines