Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

gputechconf.com

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Company OverviewArea of expertise• Complete package– HPC-hardware– ong>CFDong>-consulting– HPC-softwareRackmount-serverClusterWorkstationsGPUsSlide 3ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Company OverviewHPC-Software based on GPU-computing• Motivation ong>forong> GPU-accelerated ong>CFDong>– Shorter development cycles– Larger models → increased accuracy– (Automated) optimization– … many more …• LBultra– Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)stand-alone versionplugin ong>forong> design suite• ong>Culisesong>– ong>Libraryong> ong>forong> accelerated ong>CFDong> on hybrid GPU-CPU systemsSlide 5ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


ong>Libraryong> ong>Culisesong>Schematic overviewOpenFOAM® (1.7.1/2.0.1/2.1.0)MPI-parallelized CPU implementationbased on domain decompositionong>Culisesong>:Solves linear system(s)on multiple GPUsOpenFOAM:CPU0CPU1CPU2linear system Ax=bprocessor partitioningsolution xInterface:cudaMemcpy(….cudaMemcpyHostToDevice)cudaMemcpy(….cudaMemcpyDeviceToHost)ong>Culisesong>:PCGPBiCGAMGPCGGPU0GPU1GPU2MPI-parallel assembly ofsystem matrices remains on CPUsSlide 7ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


ong>Libraryong> ong>Culisesong>Solvers available• State-of-the-art solvers ong>forong> linear systems– Multi-GPU– Single or double precision (only DP results are shown)• Krylov subspace methods– Conjugate or Bi-Conjugate Gradient methodong>forong> symmetric and non-symmetric system matrices– Preconditioning• Jacobi (DiagonalPCG)• Incomplete Cholesky (DICPCG)• Algebraic Multigrid (AMGPCG)• Stand-alone Multigrid method under developmentSlide 8ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


ong>Libraryong> ong>Culisesong>Parallel approach• 1-1 link between MPI-process/rank and GPU-> CPU partitioning equals GPU partitioning-> peak perong>forong>mance CPU under-utilization of GPUs• Bunching of MPI-ranks requiredn-1 linkage optionGPUDirect– Peer-to-peer data exchangeCUDA 4.1 IPC– Directly hidden in MPI-implementationrelease candidates: OpenMPI, MVAPICH2MPI_Comm_size (comm,&size)CPU0CPU1CPU21-1 3-1GPU0GPU1GPU2Slide 9ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsSetup• Amdahl‘s law and theoretical maximumspeedup:speedupsSlide 101s =1 − f + f as max = lim s(a) = 1a→∞ 1 − fEfficiency E =ss maxfraction of computation that is ported to GPUfong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU Systemsacceleration on GPU:a → ∞a = 15a = 10a = 5Example: On CPUsolution of linearsysten consumes 80%of total CPU time:f = 0.8a = 10s max = 5s = 3.57E = 0.71B. Landmann


Example resultsSetup• ong>CFDong> solver: OpenFOAM® 2.0.1/2.1.0• Fair comparison:– Best linear solver on CPU vs best linear solver on GPU• Krylov: preconditioned Conjugate Gradient method• Multigrid method– Needs considerable tuning of solver parameters ong>forong> bothCPU and GPU solvers (multigrid, SIMPLE 1 algorithm, …)– Same convergence criterion: specified tolerance of residual• Hardware configuration:Tyan board with– 2 CPUs: Intel Xeon X5650 @ 2.67 GHz– 8 GPUs: Tesla 2070 (6GB)1. Semi-Implicit Method ong>forong> Pressure-Linked Equations8×Slide 11ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsAutomotive: DrivAER• Generic car shape model• Incompressible flow– simpleFOAM solver• SIMPLE 1 method– Pressure-velocity coupling– Poisson equation ong>forong> pressurelinear system solved by ong>Culisesong>– k-ω SST turbulence model– 2 computational grids• 3 million grid cells(sequential runs)• 22 million grid cells(parallel runs)DrivAER geometrySolver control (OpenFOAM®) via config filessolvers {psolver PCGpreconditioner DICtolerance 1e-6...}solvers {psolver PCG PCGGPUpreconditioner AMGtolerance 1e-6...}Slide 121. Semi-Implicit Method ong>forong> Pressure-Linked Equationsong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsDrivAER 3M grid cells• Single CPU vs single CPU+GPU– Converged solution (4000 timesteps)– Validation: comparison of results• DICPCG on CPU• AMGPCG on GPUDICPCGAMGPCGSingle CPUSingle CPU+GPU• Memory requirement– AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells– DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cellsSlide 13ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsDrivAER 3M grid cells• Speedup with single GPUSolverCPUSolverGPUFractionfSpeedups =11 − f + f aTheoreticalmaximumSpeedups max = 11 − fGPU-accelerationspeeduplinear solveraEfficieny E =ss maxGAMG 1DICPCGDiagonalPCGAMGPCGDiagonalPCGDiagonalPCG0.55 1.56 2.22 3.3668%0.78 2.7 4.55 5.860%0.87 4.9 7.7 11.664%1. GAMG: Generalized geometric-algebraic Multigrid solvergeometric agglomeration based on grid faces areaSlide 14ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Simulation timeScalingExample resultsDrivAER 3M grid cells• Perong>forong>mance with multiple GPUs• Strong scaling: multiple CPUs+GPUs (1-1 linkage)– Scaling of total code versus # of CPUs and # of GPUs– Scaling of linear solver versus # of CPUs and # of GPUs120010008006004002000total time time linear solver scaling total scaling linear solverAMGPCG solver0 1 2 3 4 5 6 7# of CPUs = # of GPUs3.532.521.510.50Slide 15ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsDrivAER 3M grid• Speedup by adding multiple GPUs (1-1 linkage)Solver CPUvsSolver GPUSpeeduptotals1 CPUSpeeduptotals2 CPUsSpeeduptotals4 CPUsSpeeduptotals6 CPUsSpeedupLinear solvera1 CPUSpeedupLinear solvera2 CPUsSpeedupLinear solvera4 CPUsSpeedupLinear solvera6 CPUsGAMGvsAMG PCGDICPCGvsDiagonal PCGDiagonal PCGvsDiagonal PCG+1 GPU +2 GPUs +4 GPUs +6 GPUs +1 GPU +2 GPUs +4 GPUs +6 GPUs1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.132.7 1.49 1.20 1.45 5.8 1.95 1.46 1.844.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80Example: computation is 2.84 times faster when runningon 2 GPUs + 2 CPUs than running on 2 CPUs onlySlide 16ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Simulation timeScalingExample resultsDrivAER 22M grid cells• Perong>forong>mance with multiple GPUs, ong>forong> memory reasons minimum 3 GPUs needed(GPU memory usage ≈90%)600050004000300020001000total timetotal time CPU onlyscaling totalscaling total CPU onlyGAMG on CPUs only (dashed)AMGPCG on CPUs+GPUs (solid)time linear solvertime linear solver CPU onlyscaling linear solverscaling linear solver CPU only2.521.510.5Slide 1703 4 6 8# of CPUs = # of GPUsong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU Systems0B. Landmann


Example resultsDrivAER 22M grid cells• Speedup by adding multiple GPUs (1-1 linkage)GAMG solver vs AMGPCG solver# of CPUs3 CPUs4 CPUs6 CPUs8 CPUs# of GPUs added+3 GPUs+4 GPUs+6 GPUs+8 GPUsSpeedup s 1.56 1.58 1.54 1.42Speedup linear solver a 3.4 2.81 2.91 2.33Fraction f 0.60 0.59 0.57 0.50Theoretical max speedups max2.50 2.43 2.33 2.00Efficiency E 62% 65% 66% 71%• Utilization not optimalFurther optimization under developmentn-1 linkage between CPU-GPUSlide 18ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsMultiphase flow: ship hull• LTSinterFoam solver– Steady with use oflocal time stepping method– Volume of fluid (VoF)method– Pressure solverlinear system → ong>Culisesong>• 4M grid cellsSolverCPUSolverGPUFractionfSpeedup sEfficiency ETheoret.maximumspeedupGPU-acceleration linear solveraDICPCGDiagonalPCGDiagonalPCGDiagonalPCG0.43 1.54 1.75 4.9188%0.55 2.12 2.22 8.6695%Slide 19ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsHeat transfer: heated room• buoyantPimpleFoamsolver– Unsteady PISO 1 method– Pressure solverlinear system → ong>Culisesong>• 4M grid cellsSolverCPUSolverGPUFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solveraDICPCGDiagonalPCGDiagonalPCGDiagonalPCG0.72 2.45 3.57 6.1169%0.80 3.59 5.00 9.9072%1. Pressure-Implicit with Splitting of OperatorsSlide 20ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsProcess industry: flow molding• pisoFoam solver– unsteady– Pressure solverlinear system → ong>Culisesong>– 500K grid cellsSolverCPUDICPCGDiagonalPCGSolverGPUDiagonalPCGDiagonalPCGFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solvera0.84 2.65 6.25 3.642%0.94 6.9 16.7 10.442%Slide 21ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Example resultsPharmaceutical: generic bioreactor• interFoam solver– Unsteady– VoF method– Pressure solverlinear system → ong>Culisesong>• 500k grid cellsliquid surfaceshaking device (off-centered spindle)SolverCPUSolverGPUFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solveraGAMG AMGPCG 0.53 1.44 2.12 2.59DiagonalPCGDiagonalPCG68%0.81 3.00 5.26 5.9457%Slide 22ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Summary• Speedup categorized by application86.891%74.765%70% 4.27 63%3.431.6 1.92.2242%1 1 1 1 1 1 1 1 1 1automotive multiphase heat transfer pharmaceutics process industrySpeedup Acceleration OpenFOAM® basic Efficiencyobtained from (averaged) single GPU test casesSlide 23ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Future ong>Culisesong>featuresUnder development• Stand-alone multigrid solver• Multi-GPU usage and scalability– Optimized load balancingvia n-1 linkage between CPU-GPU– Optimized data exchangevia peer-to-peer (PCIe 2.0/3.0) transfersSlide 24ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann


Questions?Slide 25ong>Culisesong> - A ong>Libraryong> ong>forong> ong>Acceleratedong> ong>CFDong> on Hybrid GPU-CPU SystemsB. Landmann

More magazines by this user
Similar magazines