CFD - GPU Technology Conference

Stan Posey 

NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

Agenda: GPU Acceleration for Applied CFD 

Overview of GPU Progress for CFD 

GPU Acceleration of ANSYS Fluent 

GPU Acceleration of OpenFOAM 

2

GPU Progress Summary for GPU-Parallel CFD 

GPU progress in CFD research continues to expand 

Growth from particle-based CFD and high-order methods 

Explicit schemes generally more progress than implicit 

Strong GPU investments by commercial CFD vendors (ISVs) 

Breakthroughs in GPU-parallel linear solvers and preconditioners 

GPUs for 2 nd -level parallelism, preserves costly MPI investment 

ISV focus on hybrid parallel CFD that utilizes all CPU cores + GPU 

GPU progress for end-user developed CFD with OpenACC 

Most benefits to aerospace companies with legacy Fortran 

GPUs behind fast growth in particle-based commercial CFD 

New ISV developments in lattice Boltzmann (LBM) and SPH 

3

CFD Software Character and GPU Suitability 

Structured Grid FV Unstructured FV Unstructured FE 

Explicit 

Usually 

Compressible 

Numerical operations on I,J,K stencil, no “solver” 

[Typically flat profiles: GPU strategy of directives (OpenACC)] 

Finite Volume 

Finite Element: 

Implicit 

Usually 

Incompressible 

Sparse matrix linear algebra – iterative solvers 

[Hot spot ~50%, small % LoC: GPU strategy of CUDA and libs] 

4

CFD Speedups for GPU Relative to 8-Core CPU 


Explicit 

Usually 

Compressible 

Implicit 

Usually 


~10x ~5x 

Turbostream 

SJTU RANS 

Structured grid explicit 

generally best GPU fit 

Finite Volume 

Veloxi 

- SD++ 

Stanford 

(Jameson) 

- FEFLO 

(Lohner) 


5

Turbostream: CFD for Turbomachinery 

Source: 

http://www.turbostream-cfd.com/ 

Sample Turbostream GPU Simulations 

Typical Routine Simulation 

Large-scale Simulation 

~19x Speedup 

6

Commercial Aircraft Wing Design on GPUs 

COMAC and SJTU 

Commercial Aircraft Corporation of China 

GPU Application 

SJTU-developed explicit CFD RANS for 

aerodynamic evaluation of wing shapes 

COMAC Wing 

Candidate 

GPU Benefit 

Use of Tesla C2070: 37x vs. single core 

Intel core i7 CPU 

Faster simulations for more wing design 

candidates vs. costly wind tunnel tests 

Expanding to multi-GPU and full aircraft 

ONERA M6 Wing 

CFD Simulation 

7

CFD Speedups for GPU Relative to 8-Core CPU 


Explicit 

Usually 

Compressible 

~15x ~5x 

Turbostream 

Veloxi 

SJTU RANS 

- SD++ 

Stanford 

(Jameson) 

- FEFLO 

(Lohner) 

Finite Volume 


Implicit 

Usually 


Commercial CFD mostly 

unstructured implicit 

- ANSYS Fluent 

- Culises for 

OpenFOAM 

- SpeedIT for 

OpenFOAM 

- CFD-ACE+ 

- FIRE 

~2x 

- Moldflow 

- AcuSolve 

- Moldex3D 

8

NVIDIA Strategy for GPU-Accelerated CFD 

Strategic Alliances 

Business and technical alliances with key ISVs (ANSYS, CD-adapco, etc.) 

Invest in long-term technical collaboration for ANSYS Fluent acceleration 

Develop key technical collaborations with CFD research community: 

TiTech—Aoki, Stanford—Jameson, Oxford—Giles, Wyoming—Mavriplis, others 

Software Development 

NVIDIA linear solver toolkit with emphasis on AMG for industry CFD 

Invest in relevant high-order methods (DGM, flux reconstruction, etc.) 

Applications Support 

Direct developer support for range of ISV and customer requests 

Implicit Schemes: Integration support of libraries and solver toolkit 

Explicit Schemes: Stencil libraries, OpenACC support for Fortran 

9

Primary Commercial CAE and GPU Progress 

ISV Primary Applications (Green color indicates CUDA-ready during 2013) 

ANSYS 

ANSYS Mechanical; ANSYS Fluent; ANSYS HFSS 

DS SIMULIA Abaqus/Standard; Abaqus/Explicit; Abaqus/CFD 

MSC Software 

Altair 

CD-adapco 

Autodesk 

ESI Group 

Siemens 

LSTC 

Mentor 

Metacomp 

MSC Nastran; Marc; Adams 

RADIOSS; AcuSolve 

STAR-CD; STAR-CCM+ 

AS Mechanical, Moldflow, AS CFD 

PAM-CRASH imp; CFD-ACE+ 

NX Nastran 

LS-DYNA; LS-DYNA CFD 

FloEFD, FloTherm 

CFD++ 

10

Additional Commercial GPU Developments 

ISV Domain Location Primary Applications 

FluiDyna CFD Germany Culises for OpenFOAM; LBultra 

Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL 

Prometech CFD Japan Particleworks 

Turbostream CFD England, UK Turbostream 

IMPETUS Explicit FEA Sweden AFEA 

AVL CFD Austria FIRE 

CoreTech CFD (molding) Taiwan Moldex3D 

Intes Implicit FEA Germany PERMAS 

Next Limit CFD Spain XFlow 

CPFD CFD USA BARRACUDA 

Flow Science CFD USA FLOW-3D 

SCSK Implicit FEA Japan ADVENTURECluster 

CDH Implicit FEA Germany AMLS; FastFRS 

FunctionBay MB Dynamics S. Korea RecurDyn 

Cradle Software CFD Japan SC/Tetra; scSTREAM 

11

Status Summary of ISVs and GPU Acceleration 

Every primary ISV has products available on GPUs or ongoing evaluation 

The 4 largest ISVs all have products based on GPUs, some at 3rd generation 

ANSYS SIMULIA MSC Software Altair 

The top 4 out of 5 ISV applications are available on GPUs today 

ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only 

Several new ISVs were founded with GPUs as a primary competitive strategy 

Prometech, FluiDyna, Vratis, IMPETUS, Turbostream 

Open source CFD OpenFOAM available on GPUs today with many options 

Commercial options: FluiDyna, Vratis; Open source options: Cufflink, Symscape ofgpu, RAS, etc. 

12

Basics of GPU Computing for ISV Software 

ISV software use of GPU acceleration is user-transparent 

Jobs launch and complete without additional user steps 

User informs ISV application (GUI, command) that a GPU exists 

Schematic of a CPU with an attached GPU accelerator 

CPU begins/ends job, GPU manages heavy computations 

CPU 

I/O 

Hub 

Cache 

1 

4 

DDR 

DDR 

PCI-Express 

3 

2 

GDDR 

GDDR 

GPU 

Schematic of an x86 CPU 

with a GPU accelerator 

1. ISV job launched on CPU 

2. Solver operations sent to GPU 

3. GPU sends results back to CPU 

4. ISV job completes on CPU 

13

Commercial CFD Focus on Sparse Solvers 

CFD Application Software 

Read input, matrix Set-up 

GPU 

Implicit Sparse 

Matrix Operations 

- Hand-CUDA Parallel 

40% - 65% of 

Profile time, 

Small % LoC 

Implicit Sparse 

Matrix Operations 

CPU 

- GPU Libraries, CUBLAS 

- OpenACC Directives 

Global solution, write output 

(Investigating OpenACC 

for more tasks on GPU) 

+ 

14

NVIDIA Offers an Accelerated Solver Toolkit 

Toolkit of linear solvers, preconditioners, other, for large sparse Ax=b 

BiCGstab AMG Jacobi 

MC-DILU 

Available schemes include: 

AMG – multi-level scheme popular with several commercial CFD 

Jacobi, BiCGStab, FGMRES, MC-DILU, and others 

Use of NVIDIA linear solver toolkit for industry-ready CFD: 

ANSYS 14.5 collaboration introduced their AMG-GPU solver in Nov 2012 

FluiDyna collaboration on Culises 2.0 AMG solver library for OpenFOAM 

Other ISVs and customer CFD software undergoing evaluation . . . 

15

GPU Developments for Aircraft CFD 

External Aero 

Developer Location Software 

(Green color indicates GPU-ready during 2013) 

NASA USA OVERFLOW 

NASA USA FUN3D 

AFRL USA AVUS 

ONERA France elsA 

Stanford/Jameson USA SD++ 

JAXA Japan UPACS 

ANSYS USA ANSYS Fluent 15.0 

CD-adapco USA/UK STAR-CCM+ 

Metacomp USA CFD++ 

Internal Flows 


FluiDyna Germany Culises for OpenFOAM 2.2.0 

Vratis Poland Speed-IT for OpenFOAM 2.2.0 

CD-adapco USA/UK STAR-CCM+ 

16

GPU Developments for Turbine Engine CFD 

Turbomachinery 

Developer Location Software 

(Green color indicates CUDA-ready during 2013) 

Turbostream England, UK Turbostream 3.0 

Oxford / Rolls Royce England, UK OP2 / Hydra 

ANSYS USA ANSYS CFD 15.0 (Fluent + CFX) 

Combustor 


FluiDyna Germany Culises for OpenFOAM 2.2.0 

Vratis Poland Speed-IT for OpenFOAM 2.2.0 

Cascade Technologies USA CHARLES 

Convergent Science USA Converge CFD 

Sandia NL / Oak Ridge NL USA 

S3D 

Nozzle / Noise 

Naval Research Lab USA JENRE 

Aviadvigatel OJSC Russia GHOST CFD 

17

GPU Status of Select Automotive CAE Software 

Select Automotive CAE Application ISV Select CAE Software GPU Status 

CSM: Durability (Stress) and Fatigue MSC Nastran Available Today 

Road Handling and VPG Adams (for MBD) Evaluation 

Powertrain Stress Analysis Abaqus/Standard Available Today 

Body NVH MSC Nastran Available Today 

Crashworthiness and Safety LS-DYNA Implicit only, beta 

CFD: Aerodynamics / Thermal UH ANSYS Fluent Available Today, beta 

IC Engine Combustion STAR-CCM+ Evaluation 

Aerodynamics / HVAC OpenFOAM Available Today 

Plastic Mold Injection Moldflow Available Today 

18

GPU Progress Summary for GPU-Parallel CFD 

GPU progress in CFD research continues to expand 

Growth from particle-based CFD and high-order methods 

Explicit schemes generally more progress than implicit 

Strong GPU investments by commercial CFD vendors (ISVs) 

Breakthroughs in GPU-parallel linear solvers and preconditioners 

GPUs for 2 nd -level parallelism, preserves costly MPI investment 

ISV focus on hybrid parallel CFD that utilizes all CPU cores + GPU 

GPU progress for end-user developed CFD with OpenACC 

Most benefits to aerospace companies with legacy Fortran 

GPUs behind fast growth in particle-based commercial CFD 

New ISV developments in lattice Boltzmann (LBM) and SPH 

19

Particle-Based Commercial CFD Software Growing 

ISV Software Application Method GPU Status 

PowerFLOW Aerodynamics LBM Evaluation 

LBultra Aerodynamics LBM Available v2.0 

XFlow Aerodynamics LBM Evaluation 

Project Falcon Aerodynamics LBM Evaluation 

Particleworks Multiphase/FS MPS (~SPH) Available v3.5 

BARRACUDA Multiphase/FS MP-PIC In development 

EDEM Discrete phase DEM In development 

ANSYS Fluent–DDPM Multiphase/FS DEM In development 

STAR-CCM+ Multiphase/FS DEM Evaluation 

AFEA High impact SPH Available v2.0 

ESI High impact SPH, ALE In development 

LSTC High impact SPH, ALE Evaluation 

Altair High impact SPH, ALE Evaluation 

20

TiTech Aoki Lab LBM Solution of External Flows 

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows 

Based on Lattice Boltzmann Method, Prof. Dr. Takayuki Aoki 

http://registration.gputechconf.com/quicklink/8Is4ClC 

www.sim.gsic.titech.ac.jp 

Aoki CFD solver using Lattice 

Boltzmann method (LBM) with 

Large Eddy Simulation (LES) 

21

FluiDyna Lattice Boltzmann Solver LBultra 

http://www.fluidyna.com/content/lbultra 

www.fluidyna.de 

Spin-Off in 2006 

from TU Munich 

CFD solver using 

Lattice Boltzmann 

method (LBM) 

Demonstrated 25x 

speedup single GPU 

Multi-GPU ready 

Contact FluiDyna 

for license details 

22

Prometech and Particleworks for Particle CFD 

http://www.prometech.co.jp 

Oil Flow in 

HB Gearbox 

MPS-based method developed at the 

University of Tokyo [Prof. Koshizuka] 

Particleworks 3.0 GPU vs. 4 core i7 

Courtesy of Prometech Software and Particleworks CFD Software 

23





24

ANSYS and NVIDIA Technical Collaboration 

Release ANSYS Mechanical ANSYS Fluent ANSYS EM 

13.0 

Dec 2010 

SMP, Single GPU, Sparse 

and PCG/JCG Solvers 

ANSYS Nexxim 

14.0 

Dec 2011 

+ Distributed ANSYS; 

+ Multi-node Support 

Radiation Heat Transfer 

(beta) 

ANSYS Nexxim 

14.5 

Nov 2012 

+ Multi-GPU Support; 

+ Hybrid PCG; 

+ Kepler GPU Support 

+ Radiation HT; 

+ GPU AMG Solver (beta), 

Single GPU 

ANSYS Nexxim 

15.0 

Q4-2013 

+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; 

+ CUDA 5 Kepler Tuning 

ANSYS Nexxim 

ANSYS HFSS (Transient) 

25

ANSYS Fluent 14.5 and Radiation HT on GPU 

VIEWFAC Utility: 

Use on CPUs, GPUs 

or both ~2x speedup 

Radiation HT Applications: 

- Underhood cooling 

- Cabin comfort HVAC 

- Furnace simulations 

RAY TRACING Utility: 

Uses OptiX library 

from NVIDIA with up 

to ~15x speedup 

(Use on GPU only) 

- Solar loads on buildings 

- Combustor in turbine 

- Electronics passive cooling 

26

ANSYS Fluent Use of NVIDIA Solver Tooklit 

ANSYS Fluent 15.0 will offer a GPU-based AMG solver (Nov/Dec 2013) 

Developed with support for MPI across multiple nodes and multiple GPUs 

Solver collaboration on pressure-based coupled Navier-Stokes, others to follow 

Early results published at Parallel CFD 2013, 20-24 May, Changsha, CN 

GPU-Accelerated Algebraic Multigrid for Applied CFD 

27

ANSYS Fluent CPU Profile for Coupled Solver 

Non-linear iterations 

Assemble Linear System of Equations 

Runtime: 

~ 35% 

Accelerate 

this first 

Solve Linear System of Equations: Ax = b 

~ 65% 

No 

Converged ? 

Yes 

Stop 

28

Error Residuals 

ANSYS Fluent 14.5 GPU Solver Convergence 

nvAMG Preview of ANSYS Fluent Convergence Behavior 

1.0000E+00 

1.0000E-01 

1.0000E-02 

1.0000E-03 

NVAMG-Cont 

NVAMG-X-mom 

NVAMG-Y-mom 

NVAMG-Z-mom 

FLUENT-Cont 

FLUENT-X-mom 

FLUENT-Y-mom 

FLUENT-Z-mom 

Numerical Results 

Mar 2012: Test for 

convergence at 

each iteration 

matches precise 

Fluent behavior 

1.0000E-04 

1.0000E-05 

1.0000E-06 

1.0000E-07 

1.0000E-08 

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 

Model FL5S1: 

- Incompressible 

- Flow in a Bend 

- 32K Hex Cells 

- Coupled Solver 

Iteration Number 

29

ANSYS Fluent AMG Solver Time (Sec) 

ANSYS Fluent 14.5 GPU Acceleration 

Preview of ANSYS Fluent 14.5 Performance – by ANSYS, Aug 2012 

3000 

2832 

Dual Socket CPU 

Dual Socket CPU + Tesla C2075 

Helix Model 

2000 

Lower 

is 

Better 

1000 

0 

5.5x 

2 x Xeon X5650, 

Only 1 Core Used 

933 

1.8x 

517 517 

2 x Xeon X5650, 

All 12 Cores Used 

Helix geometry 

1.2M Tet cells 

Unsteady, laminar 

Coupled PBNS, DP 

AMG F-cycle on CPU 

AMG V-cycle on GPU 

NOTE: All jobs 

solver time only 

30

ANSYS Fluent AMG Solver Time per Iteration (Sec) 

ANSYS Fluent with GPU-Based AMG Solver 

ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012 

9 

6 

Airfoil and Aircraft Models with Hexahedral Cells 

K20X 

3930K(6) 

Lower 

is 

Better 

2.4x 

2 x Core-i7 3930K, 

Only 6 Cores Used 

Solver settings: 

3 

0 

2.4x 

Airfoil (hex 784K) Aircraft (hex 1798K) 

CPU Fluent solver: 

F-cycle, agg8, DILU, 

0pre, 3post 

GPU nvAMG solver: 

V-cycle, agg8, MC-DILU, 

0pre, 3post 

NOTE: Times 

for solver only 

31

GPUs and Distributed Cluster Computing 

Partition on CPU 

1 

2 3 

4 

Geometry decomposed: partitions 

put on independent cluster nodes; 

CPU distributed parallel processing 

Nodes distributed 

parallel using MPI 

N1 

N1 N2 N3 N4 

Global Solution 

32

GPUs and Distributed Cluster Computing 

Partition on CPU 

1 

2 3 

4 

Geometry decomposed: partitions 

put on independent cluster nodes; 

CPU distributed parallel processing 

Nodes distributed 

parallel using MPI 

N1 

1 

Execution on 

CPU + GPU 

N1 N2 N3 N4 

G1 G2 G3 G4 

GPUs shared memory 

parallel using OpenMP 

under distributed parallel 

Global Solution 

33

ANSYS Fluent for 3.6M Cell Aerodynamic Case 

Multi-GPU Acceleration of 

16-Core ANSYS Fluent 15.0 

(Preview) External Aero 

2.9X Solver Speedup 

Xeon E5-2667 + 4 x Tesla K20X GPUs 

CPU Configuration 

CPU + GPU Configuration 

16-Core Server Node 

8-Cores 

8-Cores 

G1 G2 G3 G4 

34

ANSYS Fluent AMG Solver Time per Iteration (Sec) 

ANSYS Fluent for 14M Cell Aerodynamic Case 

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Jun 2013 

75 

69 

Intel Xeon E5-2667, 2.90GHz 

Intel Xeon E5-2667, 2.90GHz + Tesla K20X 

Truck Body Model 

50 

Lower 

is 

Better 

14 M Mixed cells 

41 

DES Turbulence 

Coupled PBNS, SP 

25 

3.5x 

28 

Times for 1 Iteration 

AMG F-cycle on CPU 

12 

3.3x 

9 

GPU: Preconditioned 

FGMRES with AMG 

0 

1 x Nodes, 2 CPUs 

(12 Cores Total) 


(24 Cores Total); 

8 GPUs (4 ea Node) 


(48 Cores Total); 

16 GPUs (4 ea Node) 

NOTE: All jobs 

solver time only 

35





36

2013: Further Expansion of OF Community 

ESI acquisition of OpenCFD from SGI during Sep 2012 

IDAJ acquire majority stake of ICON during May 2013 

This Year 3 (up from 2) OpenFOAM Global User Events: 

APR 24 – 26, Frankfurt, DE: ESI OpenFOAM Users Conference (first ever) 

http://www.esi-group.com/corporate/events/2013/OpenFOAM2013 

Concentration on OpenFOAM from OpenCFD 

JUN 11 – 14, Jeju, KR : 8 th International OpenFOAM Workshop (first in Asia) 

http://www.openfoamworkshop2013.org/ 

Concentration on OpenFOAM-extend and Wikki 

OCT 24 – 25, Hamburg, DE : 7 th Open Source CFD International Conference (ICON) 

http://www.opensourcecfd.com/conference2013/ 

Concentration on both OpenFOAM and OpenFOAM-extend 

37

NVIDIA Market Strategy for OpenFOAM 

Provide technical support for commercial GPU solver developments 

FluiDyna Culises AMG solver library using NVIDIA toolkit 

Vratis Speed-IT library, development of CUSP-based AMG 

Alliances (but no development) with key OpenFOAM organizations 

ESI and OpenCFD Foundation (H. Weller, M. Salari) 

Wikki and OpenFOAM-extend community (H. Jasak) 

IDAJ in Japan and ICON in the UK – support of both OF and OF-ext 

Conduct performance studies and customer benchmark evaluations 

Collaborations: developers, customers, OEMs (Dell, SGI, HP, etc.) 

38

Culises: CFD Solver Library for OpenFOAM 

Culises Easy-to-Use AMG-PCG Solver: 


#1. Download and license from http://www.FluiDyna.de 

#2. Automatic installation with FluiDyna-provided script 

#3. Activate Culises and GPUs with 2 edits to config-file 

config-file CPU-only 

config-file CPU+GPU 

FluiDyna: TU Munich 

Spin-Off from 2006 

Culises provides a 

linear solver library 

Culises requires only 

two edits to control 

file of OpenFOAM 

Multi-GPU ready 

Contact FluiDyna 

for license details 


39

Culises Coupling to OpenFOAM 

Culises Coupling is User-Transparent: 


40

OpenFOAM Speedups Based on CFD Application 

GPU Speedups for Different Industry Cases: 

Range of model sizes and different solver schemes (Krylov, AMG-PCG, etc.) 


Automotive 

1.6x 

Multiphase 

1.9x 

Thermal 

3.0x 

Pharma CFD 

2.2x 

Process CFD 

4.7x 

Job Speedup Solver Speedup OpenFOAM CPU-Only Efficiency 

41

FluiDyna Culises: CFD Solver for OpenFOAM 

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems 

Dr. Bjoern Landmann, FluiDyna 

developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0293-GTC2012-Culises-Hybrid-GPU.pdf 


DrivAer: Joint Car Body Shape by BMW and Audi 

http://www.aer.mw.tum.de/en/research-groups/automotive/drivaer 

Mesh Size - CPUs 9M - 2 CPU 18M - 2 CPU 36M - 2 CPU 

Solver speedup of 7x 

for 2 CPU + 4 GPU 

• 36M Cells (mixed type) 

• GAMG on CPU 

• AMGPCG on GPU 

GPUs +1 GPU +2 GPUs +4 GPUs 

2.5x 4.2x 6.9x 

Job Speedup 1.36x 1.52x 1.67x 

42

Conclusions For Applied CFD on GPUs 

GPUs provide significant speedups for solver intensive jobs 

Improved product quality with higher fidelity modeling 

Shorten product engineering cycles with faster simulation turnaround 

Simulations recently considered impractical now possible 

Unsteady RANS, Large Eddy Simulation (LES) practical in cost and time 

Effective parameter optimization from large increase in number of jobs 

43

Stan Posey 

NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

CFD - GPU Technology Conference

Create successful ePaper yourself

Delete template?

Save as template?