13.07.2015 Views

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Company OverviewArea of expertise• Complete package– HPC-hardware– <str<strong>on</strong>g>CFD</str<strong>on</strong>g>-c<strong>on</strong>sulting– HPC-softwareRackmount-serverClusterWorkstati<strong>on</strong>s<strong>GPU</strong>sSlide 3<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Company OverviewHPC-Software based <strong>on</strong> <strong>GPU</strong>-computing• Motivati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>GPU</strong>-accelerated <str<strong>on</strong>g>CFD</str<strong>on</strong>g>– Shorter development cycles– Larger models → increased accuracy– (Automated) optimizati<strong>on</strong>– … many more …• LBultra– Lattice-Boltzmann method: speedup of 20x comparing a single <strong>GPU</strong> and a <strong>CPU</strong> (4 cores)stand-al<strong>on</strong>e versi<strong>on</strong>plugin <str<strong>on</strong>g>for</str<strong>on</strong>g> design suite• <str<strong>on</strong>g>Culises</str<strong>on</strong>g>– <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> accelerated <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> hybrid <strong>GPU</strong>-<strong>CPU</strong> systemsSlide 5<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


<str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>Culises</str<strong>on</strong>g>Schematic overviewOpenFOAM® (1.7.1/2.0.1/2.1.0)MPI-parallelized <strong>CPU</strong> implementati<strong>on</strong>based <strong>on</strong> domain decompositi<strong>on</strong><str<strong>on</strong>g>Culises</str<strong>on</strong>g>:Solves linear system(s)<strong>on</strong> multiple <strong>GPU</strong>sOpenFOAM:<strong>CPU</strong>0<strong>CPU</strong>1<strong>CPU</strong>2linear system Ax=bprocessor partiti<strong>on</strong>ingsoluti<strong>on</strong> xInterface:cudaMemcpy(….cudaMemcpyHostToDevice)cudaMemcpy(….cudaMemcpyDeviceToHost)<str<strong>on</strong>g>Culises</str<strong>on</strong>g>:PCGPBiCGAMGPCG<strong>GPU</strong>0<strong>GPU</strong>1<strong>GPU</strong>2MPI-parallel assembly ofsystem matrices remains <strong>on</strong> <strong>CPU</strong>sSlide 7<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


<str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>Culises</str<strong>on</strong>g>Solvers available• State-of-the-art solvers <str<strong>on</strong>g>for</str<strong>on</strong>g> linear systems– Multi-<strong>GPU</strong>– Single or double precisi<strong>on</strong> (<strong>on</strong>ly DP results are shown)• Krylov subspace methods– C<strong>on</strong>jugate or Bi-C<strong>on</strong>jugate Gradient method<str<strong>on</strong>g>for</str<strong>on</strong>g> symmetric and n<strong>on</strong>-symmetric system matrices– Prec<strong>on</strong>diti<strong>on</strong>ing• Jacobi (Diag<strong>on</strong>alPCG)• Incomplete Cholesky (DICPCG)• Algebraic Multigrid (AMGPCG)• Stand-al<strong>on</strong>e Multigrid method under developmentSlide 8<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


<str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>Culises</str<strong>on</strong>g>Parallel approach• 1-1 link between MPI-process/rank and <strong>GPU</strong>-> <strong>CPU</strong> partiti<strong>on</strong>ing equals <strong>GPU</strong> partiti<strong>on</strong>ing-> peak per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>CPU</strong> under-utilizati<strong>on</strong> of <strong>GPU</strong>s• Bunching of MPI-ranks requiredn-1 linkage opti<strong>on</strong>• <strong>GPU</strong>Direct– Peer-to-peer data exchangeCUDA 4.1 IPC– Directly hidden in MPI-implementati<strong>on</strong>release candidates: OpenMPI, MVAPICH2MPI_Comm_size (comm,&size)<strong>CPU</strong>0<strong>CPU</strong>1<strong>CPU</strong>21-1 3-1<strong>GPU</strong>0<strong>GPU</strong>1<strong>GPU</strong>2Slide 9<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsSetup• Amdahl‘s law and theoretical maximumspeedup:speedupsSlide 101s =1 − f + f as max = lim s(a) = 1a→∞ 1 − fEfficiency E =ss maxfracti<strong>on</strong> of computati<strong>on</strong> that is ported to <strong>GPU</strong>f<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>accelerati<strong>on</strong> <strong>on</strong> <strong>GPU</strong>:a → ∞a = 15a = 10a = 5Example: On <strong>CPU</strong>soluti<strong>on</strong> of linearsysten c<strong>on</strong>sumes 80%of total <strong>CPU</strong> time:f = 0.8a = 10s max = 5s = 3.57E = 0.71B. Landmann


Example resultsSetup• <str<strong>on</strong>g>CFD</str<strong>on</strong>g> solver: OpenFOAM® 2.0.1/2.1.0• Fair comparis<strong>on</strong>:– Best linear solver <strong>on</strong> <strong>CPU</strong> vs best linear solver <strong>on</strong> <strong>GPU</strong>• Krylov: prec<strong>on</strong>diti<strong>on</strong>ed C<strong>on</strong>jugate Gradient method• Multigrid method– Needs c<strong>on</strong>siderable tuning of solver parameters <str<strong>on</strong>g>for</str<strong>on</strong>g> both<strong>CPU</strong> and <strong>GPU</strong> solvers (multigrid, SIMPLE 1 algorithm, …)– Same c<strong>on</strong>vergence criteri<strong>on</strong>: specified tolerance of residual• Hardware c<strong>on</strong>figurati<strong>on</strong>:Tyan board with– 2 <strong>CPU</strong>s: Intel Xe<strong>on</strong> X5650 @ 2.67 GHz– 8 <strong>GPU</strong>s: Tesla 2070 (6GB)1. Semi-Implicit Method <str<strong>on</strong>g>for</str<strong>on</strong>g> Pressure-Linked Equati<strong>on</strong>s8×Slide 11<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsAutomotive: DrivAER• Generic car shape model• Incompressible flow– simpleFOAM solver• SIMPLE 1 method– Pressure-velocity coupling– Poiss<strong>on</strong> equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> pressurelinear system solved by <str<strong>on</strong>g>Culises</str<strong>on</strong>g>– k-ω SST turbulence model– 2 computati<strong>on</strong>al grids• 3 milli<strong>on</strong> grid cells(sequential runs)• 22 milli<strong>on</strong> grid cells(parallel runs)DrivAER geometrySolver c<strong>on</strong>trol (OpenFOAM®) via c<strong>on</strong>fig filessolvers {psolver PCGprec<strong>on</strong>diti<strong>on</strong>er DICtolerance 1e-6...}solvers {psolver PCG PCG<strong>GPU</strong>prec<strong>on</strong>diti<strong>on</strong>er AMGtolerance 1e-6...}Slide 121. Semi-Implicit Method <str<strong>on</strong>g>for</str<strong>on</strong>g> Pressure-Linked Equati<strong>on</strong>s<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsDrivAER 3M grid cells• Single <strong>CPU</strong> vs single <strong>CPU</strong>+<strong>GPU</strong>– C<strong>on</strong>verged soluti<strong>on</strong> (4000 timesteps)– Validati<strong>on</strong>: comparis<strong>on</strong> of results• DICPCG <strong>on</strong> <strong>CPU</strong>• AMGPCG <strong>on</strong> <strong>GPU</strong>DICPCGAMGPCGSingle <strong>CPU</strong>Single <strong>CPU</strong>+<strong>GPU</strong>• Memory requirement– AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells– Diag<strong>on</strong>alPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cellsSlide 13<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsDrivAER 3M grid cells• Speedup with single <strong>GPU</strong>Solver<strong>CPU</strong>Solver<strong>GPU</strong>Fracti<strong>on</strong>fSpeedups =11 − f + f aTheoreticalmaximumSpeedups max = 11 − f<strong>GPU</strong>-accelerati<strong>on</strong>speeduplinear solveraEfficieny E =ss maxGAMG 1DICPCGDiag<strong>on</strong>alPCGAMGPCGDiag<strong>on</strong>alPCGDiag<strong>on</strong>alPCG0.55 1.56 2.22 3.3668%0.78 2.7 4.55 5.860%0.87 4.9 7.7 11.664%1. GAMG: Generalized geometric-algebraic Multigrid solvergeometric agglomerati<strong>on</strong> based <strong>on</strong> grid faces areaSlide 14<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Simulati<strong>on</strong> timeScalingExample resultsDrivAER 3M grid cells• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance with multiple <strong>GPU</strong>s• Str<strong>on</strong>g scaling: multiple <strong>CPU</strong>s+<strong>GPU</strong>s (1-1 linkage)– Scaling of total code versus # of <strong>CPU</strong>s and # of <strong>GPU</strong>s– Scaling of linear solver versus # of <strong>CPU</strong>s and # of <strong>GPU</strong>s120010008006004002000total time time linear solver scaling total scaling linear solverAMGPCG solver0 1 2 3 4 5 6 7# of <strong>CPU</strong>s = # of <strong>GPU</strong>s3.532.521.510.50Slide 15<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsDrivAER 3M grid• Speedup by adding multiple <strong>GPU</strong>s (1-1 linkage)Solver <strong>CPU</strong>vsSolver <strong>GPU</strong>Speeduptotals1 <strong>CPU</strong>Speeduptotals2 <strong>CPU</strong>sSpeeduptotals4 <strong>CPU</strong>sSpeeduptotals6 <strong>CPU</strong>sSpeedupLinear solvera1 <strong>CPU</strong>SpeedupLinear solvera2 <strong>CPU</strong>sSpeedupLinear solvera4 <strong>CPU</strong>sSpeedupLinear solvera6 <strong>CPU</strong>sGAMGvsAMG PCGDICPCGvsDiag<strong>on</strong>al PCGDiag<strong>on</strong>al PCGvsDiag<strong>on</strong>al PCG+1 <strong>GPU</strong> +2 <strong>GPU</strong>s +4 <strong>GPU</strong>s +6 <strong>GPU</strong>s +1 <strong>GPU</strong> +2 <strong>GPU</strong>s +4 <strong>GPU</strong>s +6 <strong>GPU</strong>s1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.132.7 1.49 1.20 1.45 5.8 1.95 1.46 1.844.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80Example: computati<strong>on</strong> is 2.84 times faster when running<strong>on</strong> 2 <strong>GPU</strong>s + 2 <strong>CPU</strong>s than running <strong>on</strong> 2 <strong>CPU</strong>s <strong>on</strong>lySlide 16<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Simulati<strong>on</strong> timeScalingExample resultsDrivAER 22M grid cells• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance with multiple <strong>GPU</strong>s, <str<strong>on</strong>g>for</str<strong>on</strong>g> memory reas<strong>on</strong>s minimum 3 <strong>GPU</strong>s needed(<strong>GPU</strong> memory usage ≈90%)600050004000300020001000total timetotal time <strong>CPU</strong> <strong>on</strong>lyscaling totalscaling total <strong>CPU</strong> <strong>on</strong>lyGAMG <strong>on</strong> <strong>CPU</strong>s <strong>on</strong>ly (dashed)AMGPCG <strong>on</strong> <strong>CPU</strong>s+<strong>GPU</strong>s (solid)time linear solvertime linear solver <strong>CPU</strong> <strong>on</strong>lyscaling linear solverscaling linear solver <strong>CPU</strong> <strong>on</strong>ly2.521.510.5Slide 1703 4 6 8# of <strong>CPU</strong>s = # of <strong>GPU</strong>s<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>0B. Landmann


Example resultsDrivAER 22M grid cells• Speedup by adding multiple <strong>GPU</strong>s (1-1 linkage)GAMG solver vs AMGPCG solver# of <strong>CPU</strong>s3 <strong>CPU</strong>s4 <strong>CPU</strong>s6 <strong>CPU</strong>s8 <strong>CPU</strong>s# of <strong>GPU</strong>s added+3 <strong>GPU</strong>s+4 <strong>GPU</strong>s+6 <strong>GPU</strong>s+8 <strong>GPU</strong>sSpeedup s 1.56 1.58 1.54 1.42Speedup linear solver a 3.4 2.81 2.91 2.33Fracti<strong>on</strong> f 0.60 0.59 0.57 0.50Theoretical max speedups max2.50 2.43 2.33 2.00Efficiency E 62% 65% 66% 71%• Utilizati<strong>on</strong> not optimalFurther optimizati<strong>on</strong> under developmentn-1 linkage between <strong>CPU</strong>-<strong>GPU</strong>Slide 18<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsMultiphase flow: ship hull• LTSinterFoam solver– Steady with use oflocal time stepping method– Volume of fluid (VoF)method– Pressure solverlinear system → <str<strong>on</strong>g>Culises</str<strong>on</strong>g>• 4M grid cellsSolver<strong>CPU</strong>Solver<strong>GPU</strong>Fracti<strong>on</strong>fSpeedup sEfficiency ETheoret.maximumspeedup<strong>GPU</strong>-accelerati<strong>on</strong> linear solveraDICPCGDiag<strong>on</strong>alPCGDiag<strong>on</strong>alPCGDiag<strong>on</strong>alPCG0.43 1.54 1.75 4.9188%0.55 2.12 2.22 8.6695%Slide 19<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsHeat transfer: heated room• buoyantPimpleFoamsolver– Unsteady PISO 1 method– Pressure solverlinear system → <str<strong>on</strong>g>Culises</str<strong>on</strong>g>• 4M grid cellsSolver<strong>CPU</strong>Solver<strong>GPU</strong>Fracti<strong>on</strong>fSpeedup SEfficiency ETheoret.maximumspeedup<strong>GPU</strong>-accelerati<strong>on</strong>linear solveraDICPCGDiag<strong>on</strong>alPCGDiag<strong>on</strong>alPCGDiag<strong>on</strong>alPCG0.72 2.45 3.57 6.1169%0.80 3.59 5.00 9.9072%1. Pressure-Implicit with Splitting of OperatorsSlide 20<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsProcess industry: flow molding• pisoFoam solver– unsteady– Pressure solverlinear system → <str<strong>on</strong>g>Culises</str<strong>on</strong>g>– 500K grid cellsSolver<strong>CPU</strong>DICPCGDiag<strong>on</strong>alPCGSolver<strong>GPU</strong>Diag<strong>on</strong>alPCGDiag<strong>on</strong>alPCGFracti<strong>on</strong>fSpeedup SEfficiency ETheoret.maximumspeedup<strong>GPU</strong>-accelerati<strong>on</strong>linear solvera0.84 2.65 6.25 3.642%0.94 6.9 16.7 10.442%Slide 21<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Example resultsPharmaceutical: generic bioreactor• interFoam solver– Unsteady– VoF method– Pressure solverlinear system → <str<strong>on</strong>g>Culises</str<strong>on</strong>g>• 500k grid cellsliquid surfaceshaking device (off-centered spindle)Solver<strong>CPU</strong>Solver<strong>GPU</strong>Fracti<strong>on</strong>fSpeedup SEfficiency ETheoret.maximumspeedup<strong>GPU</strong>-accelerati<strong>on</strong>linear solveraGAMG AMGPCG 0.53 1.44 2.12 2.59Diag<strong>on</strong>alPCGDiag<strong>on</strong>alPCG68%0.81 3.00 5.26 5.9457%Slide 22<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Summary• Speedup categorized by applicati<strong>on</strong>86.891%74.765%70% 4.27 63%3.431.6 1.92.2242%1 1 1 1 1 1 1 1 1 1automotive multiphase heat transfer pharmaceutics process industrySpeedup Accelerati<strong>on</strong> OpenFOAM® basic Efficiencyobtained from (averaged) single <strong>GPU</strong> test casesSlide 23<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Future <str<strong>on</strong>g>Culises</str<strong>on</strong>g>featuresUnder development• Stand-al<strong>on</strong>e multigrid solver• Multi-<strong>GPU</strong> usage and scalability– Optimized load balancingvia n-1 linkage between <strong>CPU</strong>-<strong>GPU</strong>– Optimized data exchangevia peer-to-peer (PCIe 2.0/3.0) transfersSlide 24<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann


Questi<strong>on</strong>s?Slide 25<str<strong>on</strong>g>Culises</str<strong>on</strong>g> - A <str<strong>on</strong>g>Library</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>CFD</str<strong>on</strong>g> <strong>on</strong> <strong>Hybrid</strong> <strong>GPU</strong>-<strong>CPU</strong> <strong>Systems</strong>B. Landmann

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!