Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Company OverviewArea of expertise• Complete package– HPC-hardware– <strong>CFD</strong>-consulting– HPC-softwareRackmount-serverClusterWorkstationsGPUsSlide 3<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Company OverviewHPC-Software based on GPU-computing• Motivation <strong>for</strong> GPU-accelerated <strong>CFD</strong>– Shorter development cycles– Larger models → increased accuracy– (Automated) optimization– … many more …• LBultra– Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)stand-alone versionplugin <strong>for</strong> design suite• <strong>Culises</strong>– <strong>Library</strong> <strong>for</strong> accelerated <strong>CFD</strong> on hybrid GPU-CPU systemsSlide 5<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

<strong>Library</strong> <strong>Culises</strong>Schematic overviewOpenFOAM® (1.7.1/2.0.1/2.1.0)MPI-parallelized CPU implementationbased on domain decomposition<strong>Culises</strong>:Solves linear system(s)on multiple GPUsOpenFOAM:CPU0CPU1CPU2linear system Ax=bprocessor partitioningsolution xInterface:cudaMemcpy(….cudaMemcpyHostToDevice)cudaMemcpy(….cudaMemcpyDeviceToHost)<strong>Culises</strong>:PCGPBiCGAMGPCGGPU0GPU1GPU2MPI-parallel assembly ofsystem matrices remains on CPUsSlide 7<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

<strong>Library</strong> <strong>Culises</strong>Solvers available• State-of-the-art solvers <strong>for</strong> linear systems– Multi-GPU– Single or double precision (only DP results are shown)• Krylov subspace methods– Conjugate or Bi-Conjugate Gradient method<strong>for</strong> symmetric and non-symmetric system matrices– Preconditioning• Jacobi (DiagonalPCG)• Incomplete Cholesky (DICPCG)• Algebraic Multigrid (AMGPCG)• Stand-alone Multigrid method under developmentSlide 8<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

<strong>Library</strong> <strong>Culises</strong>Parallel approach• 1-1 link between MPI-process/rank and GPU-> CPU partitioning equals GPU partitioning-> peak per<strong>for</strong>mance CPU under-utilization of GPUs• Bunching of MPI-ranks requiredn-1 linkage option• GPUDirect– Peer-to-peer data exchangeCUDA 4.1 IPC– Directly hidden in MPI-implementationrelease candidates: OpenMPI, MVAPICH2MPI_Comm_size (comm,&size)CPU0CPU1CPU21-1 3-1GPU0GPU1GPU2Slide 9<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsSetup• Amdahl‘s law and theoretical maximumspeedup:speedupsSlide 101s =1 − f + f as max = lim s(a) = 1a→∞ 1 − fEfficiency E =ss maxfraction of computation that is ported to GPUf<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU Systemsacceleration on GPU:a → ∞a = 15a = 10a = 5Example: On CPUsolution of linearsysten consumes 80%of total CPU time:f = 0.8a = 10s max = 5s = 3.57E = 0.71B. Landmann

Example resultsSetup• <strong>CFD</strong> solver: OpenFOAM® 2.0.1/2.1.0• Fair comparison:– Best linear solver on CPU vs best linear solver on GPU• Krylov: preconditioned Conjugate Gradient method• Multigrid method– Needs considerable tuning of solver parameters <strong>for</strong> bothCPU and GPU solvers (multigrid, SIMPLE 1 algorithm, …)– Same convergence criterion: specified tolerance of residual• Hardware configuration:Tyan board with– 2 CPUs: Intel Xeon X5650 @ 2.67 GHz– 8 GPUs: Tesla 2070 (6GB)1. Semi-Implicit Method <strong>for</strong> Pressure-Linked Equations8×Slide 11<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsAutomotive: DrivAER• Generic car shape model• Incompressible flow– simpleFOAM solver• SIMPLE 1 method– Pressure-velocity coupling– Poisson equation <strong>for</strong> pressurelinear system solved by <strong>Culises</strong>– k-ω SST turbulence model– 2 computational grids• 3 million grid cells(sequential runs)• 22 million grid cells(parallel runs)DrivAER geometrySolver control (OpenFOAM®) via config filessolvers {psolver PCGpreconditioner DICtolerance 1e-6...}solvers {psolver PCG PCGGPUpreconditioner AMGtolerance 1e-6...}Slide 121. Semi-Implicit Method <strong>for</strong> Pressure-Linked Equations<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsDrivAER 3M grid cells• Single CPU vs single CPU+GPU– Converged solution (4000 timesteps)– Validation: comparison of results• DICPCG on CPU• AMGPCG on GPUDICPCGAMGPCGSingle CPUSingle CPU+GPU• Memory requirement– AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells– DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cellsSlide 13<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsDrivAER 3M grid cells• Speedup with single GPUSolverCPUSolverGPUFractionfSpeedups =11 − f + f aTheoreticalmaximumSpeedups max = 11 − fGPU-accelerationspeeduplinear solveraEfficieny E =ss maxGAMG 1DICPCGDiagonalPCGAMGPCGDiagonalPCGDiagonalPCG0.55 1.56 2.22 3.3668%0.78 2.7 4.55 5.860%0.87 4.9 7.7 11.664%1. GAMG: Generalized geometric-algebraic Multigrid solvergeometric agglomeration based on grid faces areaSlide 14<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Simulation timeScalingExample resultsDrivAER 3M grid cells• Per<strong>for</strong>mance with multiple GPUs• Strong scaling: multiple CPUs+GPUs (1-1 linkage)– Scaling of total code versus # of CPUs and # of GPUs– Scaling of linear solver versus # of CPUs and # of GPUs120010008006004002000total time time linear solver scaling total scaling linear solverAMGPCG solver0 1 2 3 4 5 6 7# of CPUs = # of GPUs3.532.521.510.50Slide 15<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsDrivAER 3M grid• Speedup by adding multiple GPUs (1-1 linkage)Solver CPUvsSolver GPUSpeeduptotals1 CPUSpeeduptotals2 CPUsSpeeduptotals4 CPUsSpeeduptotals6 CPUsSpeedupLinear solvera1 CPUSpeedupLinear solvera2 CPUsSpeedupLinear solvera4 CPUsSpeedupLinear solvera6 CPUsGAMGvsAMG PCGDICPCGvsDiagonal PCGDiagonal PCGvsDiagonal PCG+1 GPU +2 GPUs +4 GPUs +6 GPUs +1 GPU +2 GPUs +4 GPUs +6 GPUs1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.132.7 1.49 1.20 1.45 5.8 1.95 1.46 1.844.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80Example: computation is 2.84 times faster when runningon 2 GPUs + 2 CPUs than running on 2 CPUs onlySlide 16<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Simulation timeScalingExample resultsDrivAER 22M grid cells• Per<strong>for</strong>mance with multiple GPUs, <strong>for</strong> memory reasons minimum 3 GPUs needed(GPU memory usage ≈90%)600050004000300020001000total timetotal time CPU onlyscaling totalscaling total CPU onlyGAMG on CPUs only (dashed)AMGPCG on CPUs+GPUs (solid)time linear solvertime linear solver CPU onlyscaling linear solverscaling linear solver CPU only2.521.510.5Slide 1703 4 6 8# of CPUs = # of GPUs<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU Systems0B. Landmann

Example resultsDrivAER 22M grid cells• Speedup by adding multiple GPUs (1-1 linkage)GAMG solver vs AMGPCG solver# of CPUs3 CPUs4 CPUs6 CPUs8 CPUs# of GPUs added+3 GPUs+4 GPUs+6 GPUs+8 GPUsSpeedup s 1.56 1.58 1.54 1.42Speedup linear solver a 3.4 2.81 2.91 2.33Fraction f 0.60 0.59 0.57 0.50Theoretical max speedups max2.50 2.43 2.33 2.00Efficiency E 62% 65% 66% 71%• Utilization not optimalFurther optimization under developmentn-1 linkage between CPU-GPUSlide 18<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsMultiphase flow: ship hull• LTSinterFoam solver– Steady with use oflocal time stepping method– Volume of fluid (VoF)method– Pressure solverlinear system → <strong>Culises</strong>• 4M grid cellsSolverCPUSolverGPUFractionfSpeedup sEfficiency ETheoret.maximumspeedupGPU-acceleration linear solveraDICPCGDiagonalPCGDiagonalPCGDiagonalPCG0.43 1.54 1.75 4.9188%0.55 2.12 2.22 8.6695%Slide 19<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsHeat transfer: heated room• buoyantPimpleFoamsolver– Unsteady PISO 1 method– Pressure solverlinear system → <strong>Culises</strong>• 4M grid cellsSolverCPUSolverGPUFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solveraDICPCGDiagonalPCGDiagonalPCGDiagonalPCG0.72 2.45 3.57 6.1169%0.80 3.59 5.00 9.9072%1. Pressure-Implicit with Splitting of OperatorsSlide 20<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsProcess industry: flow molding• pisoFoam solver– unsteady– Pressure solverlinear system → <strong>Culises</strong>– 500K grid cellsSolverCPUDICPCGDiagonalPCGSolverGPUDiagonalPCGDiagonalPCGFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solvera0.84 2.65 6.25 3.642%0.94 6.9 16.7 10.442%Slide 21<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Example resultsPharmaceutical: generic bioreactor• interFoam solver– Unsteady– VoF method– Pressure solverlinear system → <strong>Culises</strong>• 500k grid cellsliquid surfaceshaking device (off-centered spindle)SolverCPUSolverGPUFractionfSpeedup SEfficiency ETheoret.maximumspeedupGPU-accelerationlinear solveraGAMG AMGPCG 0.53 1.44 2.12 2.59DiagonalPCGDiagonalPCG68%0.81 3.00 5.26 5.9457%Slide 22<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Summary• Speedup categorized by application86.891%74.765%70% 4.27 63%3.431.6 1.92.2242%1 1 1 1 1 1 1 1 1 1automotive multiphase heat transfer pharmaceutics process industrySpeedup Acceleration OpenFOAM® basic Efficiencyobtained from (averaged) single GPU test casesSlide 23<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Future <strong>Culises</strong>featuresUnder development• Stand-alone multigrid solver• Multi-GPU usage and scalability– Optimized load balancingvia n-1 linkage between CPU-GPU– Optimized data exchangevia peer-to-peer (PCIe 2.0/3.0) transfersSlide 24<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Questions?Slide 25<strong>Culises</strong> - A <strong>Library</strong> <strong>for</strong> <strong>Accelerated</strong> <strong>CFD</strong> on Hybrid GPU-CPU SystemsB. Landmann

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?