11.07.2015 Views

Lagrangian Particle Tracking on a GPU. pdfauthor - User Websites ...

Lagrangian Particle Tracking on a GPU. pdfauthor - User Websites ...

Lagrangian Particle Tracking on a GPU. pdfauthor - User Websites ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> <str<strong>on</strong>g>Particle</str<strong>on</strong>g> <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> <strong>on</strong> a <strong>GPU</strong>AuthorNils BrünggelAdvisorsProf. Jörg HofstetterProf. Dr. Josef BürglerJuly 15, 2011


C<strong>on</strong>tents1 Introducti<strong>on</strong> 11.1 Motivati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Project Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 OpenFOAM 22.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Mesh Representati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4 Parallelizati<strong>on</strong> and Support for <strong>GPU</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . 53 The <str<strong>on</strong>g>Particle</str<strong>on</strong>g> <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> Algorithm 63.1 Basic <str<strong>on</strong>g>Particle</str<strong>on</strong>g> <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Modified <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Implementati<strong>on</strong> in OpenFOAM (solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> Library) . . . . . . . . . . . . . . . . 84 <strong>GPU</strong> Computing 94.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 Architecture of Modern <strong>GPU</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Software Development 125.1 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.2 Complete <str<strong>on</strong>g>Particle</str<strong>on</strong>g> Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 Data Reducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.4 Tools and Validati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Test Cases and Results 186.1 Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2 Torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.4 Performance Analysis of Computati<strong>on</strong>al Kernels . . . . . . . . . . . . . . . . . . . 216.5 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 C<strong>on</strong>clusi<strong>on</strong> and Future Work 248 Summaries of Related Works 258.1 Nvidia’s <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.2 The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s in the EM Driven Turbulent Flow with Fine Mesh . . . . 258.3 Complex Chemistry Modeling of Diesel Spray Combusti<strong>on</strong> . . . . . . . . . . . . . . 258.4 Porting Large Fortran Codebases to <strong>GPU</strong>s . . . . . . . . . . . . . . . . . . . . . . . 269 Appendix 279.1 Asserti<strong>on</strong>s <strong>on</strong> the <strong>GPU</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279.2 Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289.3 Project Definiti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28I


List of Figures1 Class diagram outlining the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library. . . . . . . . . . . . . . . . . . . . 32 Illustrati<strong>on</strong> of circular dependencies in OpenFOAM using templates and inheritance. 33 Example situati<strong>on</strong> in two dimensi<strong>on</strong>s illustrates the particle tracking algorithm. . . 64 Two dimensi<strong>on</strong>al illustrati<strong>on</strong> of a particle trajectory through multiple cells. . . . . 85 Nvidia’s Fermi architecture, taken from [25]. . . . . . . . . . . . . . . . . . . . . . . 106 Cell centres arranged in memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Mesh data layout in memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Owner and neighbour data of the particle tunnel case. . . . . . . . . . . . . . . . . 139 Simplified class diagram of the <strong>GPU</strong> tracking library. . . . . . . . . . . . . . . . . . 1410 Complete particle tracking sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . 1511 Reducing the particle labels array to those which sill need tracking. . . . . . . . . . 1612 Width plot of the <strong>GPU</strong> time spent for calculating lambdas. The CPU is used fordata reducti<strong>on</strong>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1613 <str<strong>on</strong>g>Particle</str<strong>on</strong>g> tunnel with randomly distributed particles. . . . . . . . . . . . . . . . . . 1814 A torus defined by surface revoluti<strong>on</strong> of a circle. . . . . . . . . . . . . . . . . . . . 1815 A tetrahedral mesh generated in a torus. . . . . . . . . . . . . . . . . . . . . . . . . 1916 Plot showing the executi<strong>on</strong> time for a growing number of particles. . . . . . . . . . 2017 Width plot showing the time of kernel functi<strong>on</strong>s and memcpy operati<strong>on</strong>s. . . . . . 2118 Width plot showing the actual time spent <strong>on</strong> the <strong>GPU</strong>. . . . . . . . . . . . . . . . 21II


Erklärung der SelbstständigkeitHiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig angefertigt und keine anderen alsdie angegebenen Hilfsmittel verwendet habe. Sämtliche verwendeten Textausschnitte, Zitate oderInhalte anderer Verfasser wurden ausdrücklich als solche gekennzeichnet.Horw, 15. Juli 2011Nils Brünggel


AcknowledgmentsI would like to thank the many people who helped me completing this thesis. Especially my advisorJosef Bürgler who took the time to discuss the many problems I ran into during this work andhelped me to develop soluti<strong>on</strong>s for it. Brahim Aakti who thought me many things about fluiddynamics that I never learned as a computer science student. Thomas Koller who helped mebetter understanding C++ and reviewed my software design. Luca Mangani for discussing theimplementati<strong>on</strong> in OpenFOAM. And all the people who read this thesis (or parts of it) and gavefeedback: Antoine Hauck, Kym Brünggel, Marcel Brünggel and David Roos Launchbury.IV


AbstractIn additi<strong>on</strong> to the most comm<strong>on</strong> numerical method, the finite volume method, particlesare often used in complex fluid simulati<strong>on</strong>s. For example when simulating combusti<strong>on</strong> in aDiesel engine the droplets are modeled as particles while the fluid is calculated in a discretemesh. The two phases are called Eulerian and <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>. While the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> phase doesnot need a mesh itself, coupling it with the Eulerian phase requires to know, for each timestep, in which cell the particles reside. If unstructured meshes are used it is not trivial tofind a particles occupancy cell, given its positi<strong>on</strong>. To do so a sophisticated particle trackingalgorithm was developed.When using particles it is usually desirable to use a large number of particles, because thisleads to better results. The aim of this project was to track particles more efficiently. Becausethe particle tracking algorithm tracks each particle individually it is suitable for massivelyparallel computing. A graphics processing unit (<strong>GPU</strong>) was used which can run thousands ofthreads c<strong>on</strong>currently.Using an existing computati<strong>on</strong>al fluid dynamics (CFD) package, OpenFOAM, a simplesolver was developed in which particles are dragged by a given velocity field through themesh. This work involved c<strong>on</strong>verting the mesh and particle data into structures suitable forparallel processing as well as porting the algorithm to the <strong>GPU</strong>. Testing and validating thecode was the biggest part: Because of the lack of object oriented c<strong>on</strong>structs for the code <strong>on</strong>the <strong>GPU</strong>, the data was stored in simple arrays making it necessary to calculate array indicesin the program. This lead to rather error pr<strong>on</strong>e code. To ensure correctness the intermediateresults of the sequential implementati<strong>on</strong> were extracted and compared to the results of thenew parallel implementati<strong>on</strong>. Finally a switch was added to the solver which causes a meshsearch after each time-step: For each particle a mesh search is executed for its end positi<strong>on</strong>revealing the actual occupancy cell. This is then compared with the new occupancy calculatedby the newly-developed particle tracking algorithm. If it is turned <strong>on</strong>, <strong>on</strong>e can ensure thatoccupancy cells of all particles are correct. This, of course, makes the solver extremely slow.Comparing the computati<strong>on</strong> time for the tracking algorithm revealed a huge speedup. The<strong>GPU</strong> implementati<strong>on</strong> is around 30 times faster compared to the sequential implementati<strong>on</strong>in OpenFOAM. Because it is necessary to copy all the data over a slow bus to the <strong>GPU</strong> thepractical executi<strong>on</strong> time is slower, but still much faster than the sequential implementati<strong>on</strong>.V


1 Introducti<strong>on</strong>1.1 Motivati<strong>on</strong>In computati<strong>on</strong>al fluid dynamics (CFD) the finite volume method (FVM) is most comm<strong>on</strong>ly usedto solve the partial differential equati<strong>on</strong>s (PDE) which describe the fluid flow. In order to discretizethe simulati<strong>on</strong> domain a mesh is c<strong>on</strong>structed which covers the whole domain. The equati<strong>on</strong>s arethen evaluated at the centroid of each cell and the results, such as velocity, pressure, temperature,etc are stored for each cell. In additi<strong>on</strong> to the FVM some parts of a simulati<strong>on</strong> may be modeled asparticles such as Diesel droplets in an engine. [24] The gaseous phase, represented in the mesh,is called Eulerian, while the liquid phase, represented by particles, is called <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>. Thoseare different ways of looking at the flow field. In the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> view the observer follows anindividual particle as it moves through space and time. This can be visualized as sitting in a boatand drifting down a river. The Eulerian specificati<strong>on</strong> of the flow field is a way of looking at fluidmoti<strong>on</strong> that focuses <strong>on</strong> specific locati<strong>on</strong>s in space through which the fluid flows as time passes.This can be visualized by sitting <strong>on</strong> the bank of a river and watching the water pass the fixedlocati<strong>on</strong>. [8] There are different levels of coupling between these two phases: If the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>particles are just dragged by the Eulerian phase <strong>on</strong>e refers to this as a <strong>on</strong>e-way coupling. If the<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> particles also influence the Eulerian phase it is called a two-way coupling. The meshplays an important role when the two phases are coupled: The Eulerian equati<strong>on</strong>s are evaluatedat the centroid of every cell. The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> particles do not need a mesh themselves, but thecoupling requires it to know in which cell a particle resides so that the coupling terms can beappended to the Eulerian phase. More specifically we need to know for every time step throughwhich cells the particle travels and how much time it spends in each cell.It is distinguished between structured and unstructured meshes. Structured refers to the wayhow the mesh data is stored in memory: A direct mapping between the c<strong>on</strong>necti<strong>on</strong>s in the meshand the addresses of the data exists. This structure restricts the shape of the cells in the meshto hexahedra. Unstructured meshes <strong>on</strong> the other hand allow cells with any number of faces, thismakes them more flexible but also increases the amount of work a program has to do, for exampleto access a cell next to a given cell. Structured meshes can be transformed into a uniform cartesiangrid. [4] It is therefore trivial to find the cell in which a particle resides, given its positi<strong>on</strong>. Inan unstructured mesh this is no l<strong>on</strong>ger possible and requires a sophisticated algorithm which isexplained in secti<strong>on</strong> 3. OpenFOAM comes with support for unstructured meshes because thismakes it simpler to automatically mesh complex geometries from a CAD system and it simplifiesthe import of meshes from different meshing tools. [19]1.2 Project Goal<str<strong>on</strong>g>Particle</str<strong>on</strong>g> tracking requires the executi<strong>on</strong> of simple vector calculati<strong>on</strong>s for a large number of particles.This can be achieved more efficiently by using a massively parallel processor such as a <strong>GPU</strong>(see secti<strong>on</strong> 4). Based <strong>on</strong> the existing OpenFOAM code the goal is to develop a solver whichmoves particles using the velocity field from the mesh. The solver will use the <strong>GPU</strong> to executethe computati<strong>on</strong>s required for the particle tracking algorithm in a massively parallel way, whichwill lead to great speedup compared to the already existing single-threaded implementati<strong>on</strong> inOpenFOAM. The original project definiti<strong>on</strong> can be found in the appendix, secti<strong>on</strong> 9.3.1


2 OpenFOAM2.1 Introducti<strong>on</strong>OpenFOAM is an open source framework for computati<strong>on</strong>al fluid dynamics. It is entirely writtenin C++ and licensed under the GNU General Public License (GPL). OpenFOAM comes with basicmeshing tools, various solvers and utilities to import meshes and export data for post processing.As opposed to many commercial packages the OpenFoam tools do not provide a graphical userinterface. Instead the user has to edit c<strong>on</strong>figurati<strong>on</strong> files within an OpenFOAM case and then runthe solver in a UNIX-like fashi<strong>on</strong> <strong>on</strong> the command line. One of the strengths of OpenFOAM isthat custom solvers can be written or modified quickly because of its modular design and the usageof advanced C++ features.2.2 Mesh Representati<strong>on</strong>OpenFOAM supports unstructured polyhedral meshes with any number of faces. The mesh isdescribed in the official <strong>User</strong>s Guide [6] (chapter 5), the Programmers Guide [5] (chapter 2.3) aswell as in Jasak’s PhD thesis [18] (chapter 3.2). Here I will recall a short summary. OpenFOAMstores the mesh in the following files 1 , which are generated by mesh utilities such as blockMesh.• points C<strong>on</strong>tains a list of coordinates of vertices. OpenFOAM implicitly indexes the pointsstarting from 0 up to the number of points minus <strong>on</strong>e.• faces C<strong>on</strong>tains a list of faces. The point indices defined in the points file are used toreference the vertices. It is distinguished between internal faces and boundary faces: Internalfaces are between two cells and always have an owner and a neighbour cell assigned, whileboundary faces lie <strong>on</strong> the boundary of the computati<strong>on</strong>al domain and just have an owner cellassigned (see below). Further a face normal is defined, so that it points outside of the ownercell. In case the face is a boundary face, the normal points outside of the computati<strong>on</strong>aldomain.The order of the vertices of a face is defined such that, if the normal vector points towardsthe viewer, the vertices are arranged in an anti-clockwise path. The faces are also implicitlyindexed, starting with 0 for the first face.It is comm<strong>on</strong> that the faces inside the simulati<strong>on</strong> domain, called inner faces have a lower facelabel than those <strong>on</strong> the boundary, called boundary faces. OpenFOAM lets the user definevarious boundary c<strong>on</strong>diti<strong>on</strong>s by referring a set of boundary faces.• owner Each face is assigned an owner cell, in case of an internal face this is the cell withthe lower label of the two cells adjacent to the face. The face normal can also be defined aspointing from the cell with the lower label to the cell with the higher label.• neighbour Internal faces also have a neighbour cell. Since an internal face is always betweentwo cells, the cell not being the owner is the neighbour cell. (Which is the cell with the higherlabel.) Faces lying <strong>on</strong> the boundary have no neighbour cell and are omitted in the neighbourlist. Since some faces always lie <strong>on</strong> the boundary, this list has less items then the owner list.• boundary The boundary file c<strong>on</strong>tains a list of patches which refer to a set of boundary faces.2.3 The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> FrameworkOpenFOAM includes a <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> framework which was originally developed to simulate Dieseldroplets [24]. The simplest <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> library allows the simulati<strong>on</strong> of solid particles which aremoved by the Eulerian field. The work carried out here is based <strong>on</strong> the solidPartilce library.1 Located in c<strong>on</strong>stant/polyMesh in an OpenFoam case directory.2


<str<strong>on</strong>g>Particle</str<strong>on</strong>g>vector positi<strong>on</strong>_label celli_label facei_scalar stepFracti<strong>on</strong>_+ readFields()+ writeFields()+ write()+ positi<strong>on</strong>()+ cell()+ trackToFace()<str<strong>on</strong>g>Particle</str<strong>on</strong>g>TypefriendIDLListIntrusively doubly-linked list.CloudA collecti<strong>on</strong> of particles.+ void move(<str<strong>on</strong>g>Tracking</str<strong>on</strong>g>Data& td)solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Cloudfriend A collecti<strong>on</strong> of particles.<str<strong>on</strong>g>Particle</str<strong>on</strong>g>TypefriendIOPositi<strong>on</strong>Helper IO class to read and write particlepositi<strong>on</strong>s+ virtual void readData(Cloud& c,bool checkClass)+ virtual bool write() c<strong>on</strong>st+ virtual bool writeData(Ostream& os) c<strong>on</strong>st<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Typesolid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Simple solid spherical particle classwith <strong>on</strong>e-way coupling with thec<strong>on</strong>tinuous phase.- scalar d_- vector U_+ bool move(trackData&);+ void hitWallPatch (c<strong>on</strong>st wallPolyPatch&,int&);void move(c<strong>on</strong>st dimensi<strong>on</strong>edVector& g)Figure 1: Class diagram outlining the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library.Figure 1 shows the most important classes bel<strong>on</strong>ging to the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library. Note that,for the sake of simplicity not all class members are shown. The particles are stored in a doublylinked list. The particle tracking algorithm is implemented in a flexible way: The most basicfuncti<strong>on</strong>ality is in the <str<strong>on</strong>g>Particle</str<strong>on</strong>g> class via the trackToFace() functi<strong>on</strong>. Functi<strong>on</strong>ality specific to acertain particle type, such as the drag model, is implemented in the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> class. Thesolid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> class provides the move() method which is called from the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Cloudclass for every particle. Using the specific drag model the destinati<strong>on</strong> for a particle is estimated,trackToFace() is then called with the before calculated destinati<strong>on</strong>.Usage of curiously recurring template patterns When looking at the classes of the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>library a remarkable circular dependency between derived classes can be found. For example theclasses <str<strong>on</strong>g>Particle</str<strong>on</strong>g> and solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> in the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> framework. solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> is derivedfrom the <str<strong>on</strong>g>Particle</str<strong>on</strong>g> class template which takes the type of the derived class as template argument.<str<strong>on</strong>g>Particle</str<strong>on</strong>g>solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Figure 2: Illustrati<strong>on</strong> of circular dependencies in OpenFOAM using templates and inheritance.The dependency goes in <strong>on</strong>e directi<strong>on</strong> via inheritance and in the other directi<strong>on</strong> via templates.This pattern was named ”Curiously Recurring Template Pattern (CRTP)” by James Coplien3


in 1995 [14]. It can be used to implement static polymorphism without the need to use virtualfuncti<strong>on</strong>s, as illustrated in the example below.1 # include 23 template 4 struct A {56 void doSomething () {7 static_cast < DerivedType & >(* this ). doSomethingElse ();8 }9 };1011 struct B: public A {1213 void doSomethingElse () {1415 printf (" Hello ␣ from ␣B!\n");16 }17 };1819 struct C: public A {2021 void doSomethingElse () {2223 printf (" Hello ␣ from ␣C!\n");24 }25 };262728 int main () {2930 B b;31 b. doSomething ();32 C c;33 c. doSomething ();3435 return 0;36 }3738 // Output3940 Hello from B!41 Hello from C!As menti<strong>on</strong>ed before this idiom is used in the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> classes 2 . The <str<strong>on</strong>g>Particle</str<strong>on</strong>g> class isdeclared as follows:1 template < class <str<strong>on</strong>g>Particle</str<strong>on</strong>g>Type >2 class <str<strong>on</strong>g>Particle</str<strong>on</strong>g> : public IDLList < <str<strong>on</strong>g>Particle</str<strong>on</strong>g>Type >:: link {3 // ..4 };And the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> class as follows:1 class solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> : public <str<strong>on</strong>g>Particle</str<strong>on</strong>g> < solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> > {2 // ..3 };2 This can be found in the OpenFOAM source tree in the directory src/lagrangian4


The particle class can now call members which are to be defined in their derived classes using astatic cast <strong>on</strong> the this-pointer.1 static_cast < <str<strong>on</strong>g>Particle</str<strong>on</strong>g>Type & >(* this )Using a reference to the derived type it can be checked weather the particle is of a special typeas d<strong>on</strong>e <strong>on</strong> line 299 from the file <str<strong>on</strong>g>Particle</str<strong>on</strong>g>.C:1 // ..2 else if ( static_cast < <str<strong>on</strong>g>Particle</str<strong>on</strong>g>Type & >(* this ). softImpact ())3 {45 // ..This is just <strong>on</strong>e example in which this idiom is used and there are many more within the codeof OpenFOAM.2.4 Parallelizati<strong>on</strong> and Support for <strong>GPU</strong>sThe OpenFOAM code is mostly sequential C++ code. Larger simulati<strong>on</strong>s can be decomposed intovarious domains and multiple instances of a solver can be launched which communicate using themessage passing interface (MPI) [10]. While OpenFOAM currently does not support fine grainedparallelism, it has recently been shown that OpenFOAM could benefit a hybrid parallelizati<strong>on</strong>model where OpenMP is used in additi<strong>on</strong> to MPI for the fine grained parallelizati<strong>on</strong> <strong>on</strong> each host[20].There is officially no support for <strong>GPU</strong> computing in OpenFOAM, however various third-partysoluti<strong>on</strong>s exist. The [21] SpeedIT tools make it possible to use OpenFOAM with CUBLAS,Nvidia’s implementati<strong>on</strong> of BLAS (Basic Linear Algebra Subprograms). OF<strong>GPU</strong> [32] runs thePrec<strong>on</strong>diti<strong>on</strong>ed c<strong>on</strong>jugate gradient solver for symmetric matrices (PCG) and the Prec<strong>on</strong>diti<strong>on</strong>edbic<strong>on</strong>jugate gradient solver for asymmetric matrices (PBiCG) <strong>on</strong> the <strong>GPU</strong> by using the CUSP [7]library, a library for sparse linear algebra and graph computati<strong>on</strong>s <strong>on</strong> the <strong>GPU</strong>.5


3 The <str<strong>on</strong>g>Particle</str<strong>on</strong>g> <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> AlgorithmThe particle tracking algorithm used in OpenFOAM was published in [22]. Since it is crucial forthe optimizati<strong>on</strong> it is presented here.3.1 Basic <str<strong>on</strong>g>Particle</str<strong>on</strong>g> <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> Algorithm2acell a cell c1Cf3 C c Sf0pp’cell bbFigure 3: Example situati<strong>on</strong> in two dimensi<strong>on</strong>s illustrates the particle tracking algorithm.For the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>-Eulerian coupling it is required to know, for every time step, which cellsa particle crosses and how much time it spent there. Figure 3 illustrates the particle trackingalgorithm during a time step. The particle is initially located at positi<strong>on</strong> a and moves to b. Whiletraveling al<strong>on</strong>g the straight line from a to b it changes the cell twice at p and p ′ .The point where the particle hits the face is calculated by the following equati<strong>on</strong>:p = a + λ a · (b − a) (1)λ a denotes the fracti<strong>on</strong> of the path vector which the particle travels until it hits the first face(the distance from a to p). In this equati<strong>on</strong> λ a and p are unknown, we <strong>on</strong>ly know the start andend positi<strong>on</strong> of the particle (a and b). The vector from C f to p al<strong>on</strong>g the face is orthog<strong>on</strong>al to theface normal. This leads to the definiti<strong>on</strong> of another equati<strong>on</strong>, which can be used to calculate p:(p − C f ) • S f = 0 (2)In the equati<strong>on</strong> above we used the fact that the dot product of two orthog<strong>on</strong>al vectors is zero.Now substituting equati<strong>on</strong> (1) into equati<strong>on</strong> (2) and solving for λ a gives:λ a = (C f − a) • S f(b − a) • S f(3)Now since p no l<strong>on</strong>ger appears in equati<strong>on</strong> (3) we can calculate λ a for each face respective ofthe cell. The face with the smallest λ a ∈ [0, 1] is the face, which is crossed by the particle. Theparticle is then moved al<strong>on</strong>g the line from a to b <strong>on</strong>to the face hit i.e. p = a + λ a · (b − a). Theparticles’s occupancy informati<strong>on</strong> is updated to the cell <strong>on</strong> the other side of the face. The nexttracking event works in the same way as the <strong>on</strong>e just presented. This is repeated until the whole6


time step is processed. If λ a is not in the interval [0, 1], then the particle’s end positi<strong>on</strong> b must beinside the same cell.3.2 Modified <str<strong>on</strong>g>Tracking</str<strong>on</strong>g> AlgorithmIf a face is defined by more than three vertices, then these do not necessarily lie in a plane. Incase a face is n<strong>on</strong>-planar the mesh stores the face centroid and interpolates a normal vector whichrepresents the effective plane of the face. When using the effective planes as cell faces, the cells in amesh are no l<strong>on</strong>ger space-filling! It is therefore possible to loose track of a particle when it crosses aface close to a vertex. The deficiencies of the basic particle tracking algorithm are further describedin [22, page 267]. While these deficiencies seem to appear in rather large n<strong>on</strong>-standard cases itshould be menti<strong>on</strong>ed that the modified tracking algorithm also solves a simple implementati<strong>on</strong>problem of the basic particle algorithm: Just naively moving the particle <strong>on</strong>to the face hit and thenevaluate the formula for λ a again for the new occupancy cell may yields the same face again asbefore, depending <strong>on</strong> how the dices of floating point accuracy roll. The problem can be solved bymoving the particle just a little bit more into the next cell using an ɛ-envir<strong>on</strong>ment or by disablingthe face in the subsequent calculati<strong>on</strong> 3 . There is no such problem when using the modified trackingalgorithm, since the cell centre is taken as reference point inside the cell and not the particlesactual positi<strong>on</strong> in order to get a list of faces which might be hit by the particle.With reference to figure 3, instead of taking the starting point of a particle, the cell centre istaken as reference point inside the cell to determine which faces the particle may crosses (if any).Replacing a with C c in equati<strong>on</strong> (3) gives:λ c = (C f − C c ) • S f(b − C c ) • S f(4)The line from C c to b (shown dashed in figure 3) crosses face 1 and 0 4 . Equati<strong>on</strong> (4) thereforeyields λ c between 0 and 1 for face 0 and face 1. If there is no face with λ c ∈ [0, 1], then b mustbe in the same cell as a. Otherwise it is also necessary to calculate λ a using equati<strong>on</strong> (3) for thefaces, which are crossed by the line from C c to b (face 0 and 1 in this case). The lowest value ofλ a determines then which face was actually hit and the fracti<strong>on</strong> of the time step, which it took totravel to the face can be determined using λ a . The cell occupancy is changed to the neighbouringcell of the face which was crossed (c in this case). With λ m = min(1, max(0, λ a )) equati<strong>on</strong> (5) isthen used to move the particle to p.p = a + λ m · (b − a) (5)The complete tracking algorithm is shown below as pseudo code. This was taken from [22](algorithm 1).Algorithm 1 Complete tracking algorithmwhile the particle has not yet reached its end positi<strong>on</strong> at b dofind the set of faces, F i for which 0 ≤ λ c ≤ 1if size of F i = 0 thenmove the particle to the end positi<strong>on</strong>elsefind face f ∈ F i for which λ a is smallestmove the particle according to equati<strong>on</strong> 5 using this value of λ a .set particle cell occupancy to neighbouring cell of face fend ifend while3 In the authors experience the first soluti<strong>on</strong>, using an ɛ-envir<strong>on</strong>ment did not work in all cases, it was necessary todisable the face hit for the next calculati<strong>on</strong>.4 Or more precisely the plane which is defined by the face normal of face 0.7


3.3 Implementati<strong>on</strong> in OpenFOAM (solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> Library)The actual implementati<strong>on</strong> of the particle tracking algorithm is more complicated. In the examplepresented to explain the particle tracking algorithm (figure 3) the velocity of the particle wasassumed to be c<strong>on</strong>stant during the whole tracking step. In the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library the velocityof the particle is coupled to the velocity of the fluid: A drag model is used, which calculates thedrag issued by the fluid phase <strong>on</strong>to the particle, assuming spherical particles. In other words thevelocity of the particle depends <strong>on</strong> the velocity of the cell in which it resides and therefore thevelocity changes <strong>on</strong>ce the particle changes the cell. Therefore, it is required to distinguish betweenEulerian and <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> time steps.bb 3b 2b 1aFigure 4: Two dimensi<strong>on</strong>al illustrati<strong>on</strong> of a particle trajectory through multiple cells.Figure 4 shows a particle which starts at the beginning of a time step at positi<strong>on</strong> a. Its endpositi<strong>on</strong> is estimated using the velocity field in the cell in which point a lies. It is marked as b 1 infigure 4. After changing the cell the first time, the end positi<strong>on</strong> is re-estimated using the velocityfield in the new cell. This way the Eulerian time step is broken up into several smaller <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>steps (5 in this example). The thick red line shows the actual trajectory of the particle. Thelight-red arrows show the estimated end positi<strong>on</strong> at the c<strong>on</strong>cerning sub steps.The drag model used to calculate the velocity of the particles in the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library takes<strong>on</strong>ly drag and buoyancy forces into account.dU pdt= D c · |U c − U p | + (1 − ρ cρ p) · g (6)Here U p is the velocity of the particle. U c the velocity of the cell in which the particle resides,ρ p and ρ c are the densities for the particle and the fluid respectively. The drag coefficient D c iscalculated using the Schiller-Naumann approximati<strong>on</strong> [30].D c = 18d 2 · ν c · ρcρ p· (1 + 0.15 · Re 0.687p ) (7)Here d is the diameter of the particle. The Reynolds number is calculated using the diameter isused d as the characteristic length.Re p = d ν c· |U c − U p | (8)8


4 <strong>GPU</strong> Computing4.1 Introducti<strong>on</strong>Advances in computer graphics and the growth of the video game industry lead to the developmentof powerful <strong>GPU</strong>s in the past ten years. It’s worth noting that the video game industry generatesmore revenue than Hollywood nowadays [33]. While early <strong>GPU</strong>s were designed to accelerate afixed graphics pipeline c<strong>on</strong>trolled from the CPU, modern <strong>GPU</strong>s are fully programmable givingthe programmer c<strong>on</strong>trol of the graphics pipeline and allow the executi<strong>on</strong> of small programs, calledshaders, directly <strong>on</strong> the <strong>GPU</strong>. Developers so<strong>on</strong> realized that this computing power cannot justbe used for graphics, but also to accelerate many problems in computati<strong>on</strong>al science. With therelease of CUDA in 2007 it became practical to develop computati<strong>on</strong>al intense applicati<strong>on</strong>s for<strong>GPU</strong>s. Since then thousands of works have been published in which <strong>GPU</strong>s are used to acceleratecomputing.The next two chapters give a short overview of the <strong>GPU</strong> architecture and the framework usedto program <strong>GPU</strong>s in order to understand the rest of the thesis. For an extensive introducti<strong>on</strong> to<strong>GPU</strong> programming the author recommends the book ”Programming Massively Parallel Processors:A Hands-<strong>on</strong> Approach” by David B. Kirk and Wen-mei W. Hwu [16] as well as the CUDA Cprogramming guide [27] which comes with the CUDA software development kit.4.2 Architecture of Modern <strong>GPU</strong>sThe performance of modern hardware is largely c<strong>on</strong>strained by the latency of dynamic randomaccess memory (DRAM). While the performance of processors increased greatly, latency of DRAMaccess did not decrease much, this is often referred to as the memory wall. The reas<strong>on</strong> for thislies in the way how DRAM works: DRAM is basically a very large array of tiny semic<strong>on</strong>ductorcapacitors. The presence of a tiny amount of electrical charge distinguishes between 0 and 1. Toread the data the charge must be shared with a sensor and if a sufficient amount of charge ispresent, the sensor detects a 1 (0 otherwise). This is a slow process 5 , the latency for a globalmemory access is around 400 to 800 clock cycles [27, 5.2.3] <strong>on</strong> a <strong>GPU</strong>. Modern DRAMs use aparallel process to increase the rate of data: Each time a locati<strong>on</strong> is requested many c<strong>on</strong>secutivelocati<strong>on</strong>s, which include the requested locati<strong>on</strong>, are read. Data at c<strong>on</strong>secutive locati<strong>on</strong>s can be readat <strong>on</strong>ce and then be transferred at high speed to the processor. The programmer must ensure thatthe program arranges its data in a way that it can be accessed c<strong>on</strong>secutively whenever possible.Otherwise the applicati<strong>on</strong> will not perform optimally, because more memory transacti<strong>on</strong>s thantheoretically needed will be executed.While CPUs are kept busy by using complex automatic caching mechanism 6 , <strong>GPU</strong>s hide latencyusing massive parallelism: Thread scheduling <strong>on</strong> <strong>GPU</strong>s is implemented in hardware: Each streamingmultiprocessor (SM) has its own scheduler. The scheduler bundles threads into warps which areexecuted simultaneously. If a warp has to wait for data it can be exchanged with another warpready for calculati<strong>on</strong>s. Because scheduling is implemented in hardware dispatching and switchingwarps has negligible overhead. The SM can automatically coalesce memory access, which is howto the high memory bandwidth advertised [27, Figure 1.1] is achieved: If multiple threads in thesame warp access c<strong>on</strong>secutive memory locati<strong>on</strong>s at the same time hardware coalesces this into asingle memory transacti<strong>on</strong> [27, Appendix F 3.2.1]. Further vectorized loading of fundamental datatypes is supported [28, Table 86].Figure 5 illustrates Nvidia’s current architecture named after the Italian physicist Enrico Fermi.Fermi is Nvidia’s sec<strong>on</strong>d generati<strong>on</strong> architecture which supports CUDA. The figure shows a singlestreaming multiprocessor. The GeForce GTX 470, for example, has 14 multiprocessors and each<strong>on</strong>e has 32 cores 7 . The register file is shared am<strong>on</strong>g all the threads assigned to an SM. Each5 Furthermore DRAM cells must be periodically refreshed otherwise they loose their charge. During refresh cyclesthe data cannot be accessed.6 Around 1 of all transistors are just used for caching in a modern CPU.7 3This explains the warp size of 32.9


,%-$).'(/012/-thread has it’s own register state and program counter, threads can therefore branch independently.However since a warp executes <strong>on</strong>e comm<strong>on</strong> instructi<strong>on</strong> at time divergent branching in a wrapis inefficient. Data dependent branches may lead to threads taking different executi<strong>on</strong> paths. Inthis case both parts a branch are executed sequentially with the c<strong>on</strong>cerning threads disabled ineither branch [27, Chapter 4.1]. Kernels (a program executed <strong>on</strong> the <strong>GPU</strong>, see chapter 4.3) withhigh register usage limit the number of blocks CUDA can assign to <strong>on</strong>e SM, this may result inpoor performance because the scheduler has too few warps to keep the SM busy. To calculate theoccupancy of a CUDA program Nvidia provides a spreadsheet [1]. <strong>GPU</strong>s are often called SIMD(single instructi<strong>on</strong> multiple data) architectures. However compared to SIMD architectures suchas Intels SSE (streaming SIMD extensi<strong>on</strong>) a <strong>GPU</strong> offers much more: A scheduler implementedin hardware, state for each threads, independent branching (affects performance, but possible),3+#&-1'&"4+)5%'(/shared memory, etc. Because of this Nvidia refers to their architecture as SIMT (single instructi<strong>on</strong>,%-$).'(/012/-,%-$).'(/012/-multiple threads).ird Generati<strong>on</strong> StreamingltiprocessorThird Generati<strong>on</strong> Streamingthird generati<strong>on</strong>MultiprocessorSM introduces severalhitectural innovati<strong>on</strong>s that make it not <strong>on</strong>ly theThe third generati<strong>on</strong> SM introduces severalst powerful SM yet built, but also the mostarchitectural innovati<strong>on</strong>s that make it not <strong>on</strong>ly thegrammable and efficient.most powerful SM yet built, but also the mostprogrammable and efficient.High Performance CUDA cores512 High Performance CUDA coresh SM features 32 CUDA CUDA Corecessors—a fourfoldease over prior SMigns. Each CUDAcessor has a fullyelined integer arithmeticc unit (ALU) and floatingnt unit (FPU). Prior <strong>GPU</strong>s used IEEE 754-1985ting point arithmetic. The Fermi architecturelements the new IEEE 754-2008 floating-pointdard, providing the fused multiply-add (FMA)ructi<strong>on</strong> for both single and double precisi<strong>on</strong>hmetic. FMA improves over a multiply-addD) instructi<strong>on</strong> by doing the multiplicati<strong>on</strong> anditi<strong>on</strong> with a single final rounding step, with noof precisi<strong>on</strong> in the additi<strong>on</strong>. FMA is moreurate than performing the operati<strong>on</strong>sarately. GT200 implemented double precisi<strong>on</strong> FMA.!"#$%&'()*+"&!"#$%&'()*+"&3+#&-1'&"4+)5%'(/,%-$).'(/012/-F/G"#&/-)6"2/)HIJKL=M)N)IJOP"&Q!"#$%&'()*+"&F/G"#&/-)6"2/)HIJKL=M)N)IJOP"&Q7!8.954-/ 54-/ 54-/ 54-/7!8.954-/ 54-/ 54-/7!8.954-/54-/ 54-/ 54-/ 54-/7!8.9Each SM features 32 CUDA CUDA Core7!8.954-/ 54-/ 54-/ 54-/ 54-/ 54-/ 54-/ 54-/processors—a fourfoldDispatch Port Port7!8.9Operand Operand Collectorincrease over prior SM7!8.954-/ 54-/ 54-/ 54-/designs. Each CUDA54-/ 54-/ 54-/ 7!8.9 54-/FP Unit INT Unitprocessor has a fullyFP Unit INT Unit7!8.954-/ 54-/ 54-/ 54-/pipelined integer arithmeticResult Queue7!8.954-/ 54-/ 54-/ 54-/7!8.9logic unit (ALU) and floating Result Queue54-/ 54-/ 54-/ 54-/7!8.9point unit (FPU). Prior <strong>GPU</strong>s used IEEE 754-19857!8.9floating point arithmetic. The Fermi architecture54-/ 54-/ 54-/ 54-/ 54-/ 54-/ 54-/ 54-/7!8.9implements the new IEEE 754-2008 floating-point7!8.9standard, providing the fused multiply-add (FMA) 54-/ 54-/ 54-/ 54-/54-/ 54-/ 54-/ 7!8.9 54-/instructi<strong>on</strong> for both single and double precisi<strong>on</strong>arithmetic. FMA improves over a multiply-add3+&/-'4++/'&):/&;4-)?@).(%-/0)A/B4-C)8)7D)5%'(/additi<strong>on</strong> with a single final rounding step, with no*+"E4-B)5%'(/loss of precisi<strong>on</strong> in the additi<strong>on</strong>. FMA is moreaccurate than performing the operati<strong>on</strong>sFermi Streaming Multiprocessor (SM)separately. GT200 implemented double Figure precisi<strong>on</strong> 5: Nvidia’s FMA. Fermi architecture, taken from [25].!"#$%&'()*+"&7!8.9.6*7!8.9.6*7!8.9.6*7!8.9.6*7!8.9In GT200, the integer ALU was limited to 24-bit precisi<strong>on</strong> for multiply operati<strong>on</strong>s; as a result,4.3 CUDA*+"E4-B)5%'(/multi-instructi<strong>on</strong> emulati<strong>on</strong> sequences were required for integer arithmetic. In Fermi, the newlydesigned integer CUDAALU stands supports for Compute full 32-bit Unified precisi<strong>on</strong> Device for all Architecture instructi<strong>on</strong>s, Fermi and Streaming c<strong>on</strong>sistent a C-style with Multiprocessor API standard to program(SM)Nvidia’sprogramming hardware. language CUDA requirements. is scalable, The which integer means ALU is programs also optimized written to efficiently in CUDAsupportcan run <strong>on</strong> any CUDA64-bit and extended enabled <strong>GPU</strong>, precisi<strong>on</strong> independent operati<strong>on</strong>s. of Various the exact instructi<strong>on</strong>s hardwareare c<strong>on</strong>figurati<strong>on</strong>. supported, including For example the same programBoolean, shift, can move, run <strong>on</strong>compare, a <strong>GPU</strong> with c<strong>on</strong>vert, two streaming bit-field extract, multiprocessors bit-reverse orinsert, <strong>on</strong> <strong>on</strong>e and with populati<strong>on</strong> four or eight multiprocessors.count. However when it comes to performance optimizati<strong>on</strong> knowledge of the underlying hardware is oftenstill required.16 Load/Store CUDA Units defines three key abstracti<strong>on</strong>s - a hierarchy of thread groups, shared memory andT200, the integer ALU was limited to 24-bit precisi<strong>on</strong> for multiply operati<strong>on</strong>s; as a result,lti-instructi<strong>on</strong> emulati<strong>on</strong> sequences were required for integer arithmetic. In Fermi, the newlyigned integer ALU supports full 32-bit precisi<strong>on</strong> for all instructi<strong>on</strong>s, c<strong>on</strong>sistent with standardgramming Each language SM has 16 requirements. load/store units, allowing The integer source and ALU destinati<strong>on</strong> is also addresses optimized to be to calculated efficiently supportfor sixteen threads per clock. Supporting units load and store the data at each address tobit and extended precisi<strong>on</strong> operati<strong>on</strong>s. Various instructi<strong>on</strong>s10are supported, including54-/54-/54-/ 54-/3+&/-'4++/'&):/&;4-)?@).(%-/0)A/B4-C)8)7D)5%'(/7!8.97!8.97!8.97!8.97!8.97!8.97!8.97!8.97!8.97!8.97!8.9.6*.6*.6*.6*


arrier synchr<strong>on</strong>izati<strong>on</strong>. The CUDA terminology differentiates between the host as the CPUexecuting a host program sequentially and the device (usually a <strong>GPU</strong>) which executes the CUDAprogram in parallel. A CUDA host program can allocate memory <strong>on</strong> the device, copy data betweenthe host and the device and execute programs <strong>on</strong> the <strong>GPU</strong>. Programs running <strong>on</strong> the <strong>GPU</strong> arecalled kernels. A kernel must be c<strong>on</strong>figured when it is called, meaning that the programmer mustspecify <strong>on</strong> how many threads blocks the kernel runs and the number of threads per block. Kernelsare defined and invoked using a proprietary syntax [27, 2.1]. CUDA source files usually end withcu and can include a mix of host and device code. During the compilati<strong>on</strong> workflow the host codeis separated from the device code. The device code is compiled into assembly form, and the hostcode is passed <strong>on</strong> to the c<strong>on</strong>cerning compiler (for example g++ <strong>on</strong> a Linux envir<strong>on</strong>ment) [27, 3.1].11


5 Software Development5.1 Software DesignThe engine for particle tracking <strong>on</strong> the <strong>GPU</strong> is built up<strong>on</strong> the solid particle library. A solvercalled gpu<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>Foam was developed. If it is invoked using the -gpu switch, tracking will beexecuted <strong>on</strong> the <strong>GPU</strong>. Because CUDA files must be compiled using Nvidia’s proprietary compiler,nvcc, the code using CUDA keywords was developed as a separate shared library against whichthe solver must be linked. The rest of the solver code can be compiled in the same way as anyother OpenFOAM applicati<strong>on</strong>.Mesh Data As explained earlier, the OpenFOAM Cloud class stores particles in a linked list,which is not suitable for parallel processing <strong>on</strong> a <strong>GPU</strong>. Before the mesh can be used <strong>on</strong> the <strong>GPU</strong>,it is c<strong>on</strong>verted into a more suitable format, which is described here. Three dimensi<strong>on</strong>al vectorsare padded to 4 elements such that they can be fetched using CUDA’s built in vector data typesfloat4 or double4. For example, the cell centres are stored by the order of cell labels as illustratedin figure 6.cell labels0 1 2cellCentres x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3Figure 6: Cell centres arranged in memory.The cell labels are not stored, since they go from 0 to the number of cells −1. Storing the facelabels associated with the cell labels requires some more effort because in an unstructured mesh acell can have any number of faces.cell labels0 1faceLabelsIndex0 6 12faceLabelsPerCell0 9 19 21 31 41 1 10 22 32 42 06 labels per facenFacesPerCell6 6 6 6 6 6 6 6 6 6Figure 7: Mesh data layout in memory.Figure 7 illustrates the mapping from the cell labels to the c<strong>on</strong>cerning face labels. The facelabels are stored in the array faceLabelsPerCell, by cell. Meaning that the first six face labelsbel<strong>on</strong>g to cell 0 and the next six to cell 1. Face label 0 appears twice, this is the label of the facebetween cell 0 and 1. The faceLabelsIndex array stores the indices of the faceLabelsPerCellarray, where the face labels to a given cell start. Note that the array names are identically to the<strong>on</strong>es used in the code. The mesh data was taken from the tunnel test case (see secti<strong>on</strong> 6.1), whichc<strong>on</strong>sists of ten cubes placed <strong>on</strong>e after another. Data bel<strong>on</strong>ging to a certain face, such as the facecentres and the face normals are stored at the same index as the face labels. The neighbour and12


owner arrays are implemented as described in secti<strong>on</strong> 2.2. This is illustrated in figure 8, the data isagain taken from the particle tunnel case. There are <strong>on</strong>ly ten faces with a neighbour cell: Face 0has neighbour cell 1, face 1 neighbour cell 2 and so forth.ownersface label0 1 2 3 4 5 6 7 8 9 10 110 1 2 3 4 5 6 7 8 0 1 2labels of the owner cellsneighboursface label0 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 9labels of the neighbour cellsFigure 8: Owner and neighbour data of the particle tunnel case.<str<strong>on</strong>g>Particle</str<strong>on</strong>g> Data The particle data is stored in a similar fashi<strong>on</strong> as the mesh data. For examplethe labels of the faces for which λ c is in the interval [0, 1] (see secti<strong>on</strong> 3) also use an additi<strong>on</strong>alindex array.Abstracti<strong>on</strong> and Data Encapsulati<strong>on</strong> When programming CUDA, the host code can becompletely written in C++ while <strong>on</strong>ly C code with a few extensi<strong>on</strong>s borrowed from C++ such asoperator overloading or functi<strong>on</strong> templates [27, Appendix D] is allowed <strong>on</strong> the device. This makesit impossible to use c<strong>on</strong>tainers from the C++ standard template library (STL) or OpenFOAMclasses <strong>on</strong> the device. Using <strong>on</strong>ly manually allocated arrays in order to manage the data oftenleads to error pr<strong>on</strong>e code and therefore l<strong>on</strong>g testing and debugging time. To transform the datainto a flat structure, suitable for the <strong>GPU</strong>, the vector c<strong>on</strong>tainer from the STL was used. The C++ISO standard guarantees that the data is stored c<strong>on</strong>tiguously in memory such that it can be copiedto the <strong>GPU</strong> using raw pointers. From the C++ standard document [12, 23.3.6.1]:The elements of a vector are stored c<strong>on</strong>tiguously, meaning that if v is a vectorwhere T is some type other than bool, then it obeys the identity &v[n] == &v[0] + nfor all 0


cuvectorTCuDataC<strong>on</strong>tains some basic functi<strong>on</strong>ality todeal with data arranged for devices.Wrapper around std::vector, c<strong>on</strong>tainsa pointer to the c<strong>on</strong>cerning devicememory. Device memory is allocatedup<strong>on</strong> c<strong>on</strong>structi<strong>on</strong> and freed whenthe object is destroyed. Upload()copies data from host to device anddownload() from device to host.<str<strong>on</strong>g>Particle</str<strong>on</strong>g>DataHolds all the data related to the particles, such aspositi<strong>on</strong>, velocity, occupancy and the calculated dataused for tracking such as the lambdas and the faceshit.FlatMeshStores the mesh in a flat structuresuitable for the use <strong>on</strong> the device.<str<strong>on</strong>g>Particle</str<strong>on</strong>g>EngineActual Engine. Holds functi<strong>on</strong>s which do the actualparticle tracking such as calculating the lambdas,estimating the end positi<strong>on</strong> or moving the particles.Figure 9: Simplified class diagram of the <strong>GPU</strong> tracking library.5.2 Complete <str<strong>on</strong>g>Particle</str<strong>on</strong>g> Engine<str<strong>on</strong>g>Tracking</str<strong>on</strong>g> a set of particles given a start positi<strong>on</strong> and an (estimated) end positi<strong>on</strong> involves severalsteps. First <strong>on</strong>e needs to calculate λ c for all particles. Recalling the equati<strong>on</strong> (4) for λ cλ c = (C f − C c ) • S f(b − C c ) • S fThe <strong>on</strong>ly data depending <strong>on</strong> the particle to track is the end positi<strong>on</strong> b. Therefore we cancalculate the numerator of (4) in advance for every face and cell. This is d<strong>on</strong>e right after the meshdata is uploaded. Once λ c is calculated, we have a set of faces for every particle, the faces forwhich λ c is in the interval [0, 1]. This set usually c<strong>on</strong>sists of zero, <strong>on</strong>e or two faces. 8 <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s forwhich no face was found have their end positi<strong>on</strong> inside the same cell in which they were at thebeginning. All we need to do is move them to the end positi<strong>on</strong>. Therefore we need to resort theparticle data and create a sec<strong>on</strong>d set c<strong>on</strong>sisting of particles which still need tracking, we call thisthe set of remaining particles. For those, λ a must be calculated to figure out which face they hit.With this informati<strong>on</strong> all the particles can be moved: <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s not in the set of remaining particlescan be moved to the end. For particles in the set of remaining particles it must be checked whetherthe face hit is a boundary face, and if so, the particle must be reflected at the boundary. Otherwisethe particle is moved <strong>on</strong>to the face hit and the occupancy informati<strong>on</strong> is updated to the adjacentcell. In the next step <strong>on</strong>ly particles in the remaining set of this step need to be c<strong>on</strong>sidered. Beforestarting with the next iterati<strong>on</strong>, the velocity must be updated using the velocity vector of the newoccupancy cell. This process is illustrated in figure 10.8 More faces per particle can be found, but this happens not very often.14


Initial calculati<strong>on</strong>Calculate the numerator of λ cfor each face.For every timestep1.Estimate the end positi<strong>on</strong>.For each particle estimate the end positi<strong>on</strong> using theparticles velocity.2.Find facesFor each particle find the set of faces, for which λ c∈ [0,1].3.Update the set of remaining particlesCreate a new set, called remaining particles, whichc<strong>on</strong>sists of particles for which <strong>on</strong>e or more faces werefound.4.Calculate λ aFor each particle in the set of remaining particles, find thesmallest λ aand the face hit by the particle.Repeat until all particlesreach their end positi<strong>on</strong>.5.Move particlesFor particles in the set of remaining particles move eachparticle <strong>on</strong>to the face hit and update the occupancyinformati<strong>on</strong>. The particles not in the set of remainingparticles can be just moved to their end positi<strong>on</strong>.6.Update velocityFor all particles in the set of remaining particles theparticles velocity is updated using the velocity from thenew occupancy cell.7.Update the set of particlesIn the next step <strong>on</strong>ly particles from this steps set ofremaining are c<strong>on</strong>sidred.Figure 10: Complete particle tracking sequence.15


5.3 Data Reducti<strong>on</strong>After calculating λ c it is known which particles stay in their cell. The set of particles is thenreduced to the set of remaining cells using the array which holds the number of faces found asillustrated in figure 11.particleLabels0 1 2 3 4 5nFacesFound 0 1 1 0 0 2reduceparticleLabelsRemaining1 2 5Figure 11: Reducing the particle labels array to those which sill need tracking.When programming sequentially, this can be d<strong>on</strong>e with <strong>on</strong>e simple loop. Copying the data backfrom the device to the host, then sort it in just <strong>on</strong>e thread and upload it again takes far too muchtime. This is illustrated in figure 12, where the runtime of all kernel functi<strong>on</strong>s and memory copingis plotted. In this test, the total computing time spent <strong>on</strong> the <strong>GPU</strong> is just around 15%. A hugepart is used to copy the memory between the host and the device. The white gap shows that the<strong>GPU</strong> is idle during the time spent <strong>on</strong> the host to reduce data. The test was d<strong>on</strong>e <strong>on</strong> the torus (seesecti<strong>on</strong> 6.2) case, where the lambdas for 100’000 particles were calculated .0memcpyHtoDcalcLambdaAKernelcalcLambdacnumKernelmemcpyDtoHmemsetScalarKernelmemsetIntegralKernelfindFacesKernel114096.344 usecFigure 12: Width plot of the <strong>GPU</strong> time spent for calculating lambdas. The CPU is used for datareducti<strong>on</strong>.Doing such a reducti<strong>on</strong> efficiently <strong>on</strong> massively parallel hardware is far less trivial. Fortunatelysorting <strong>on</strong> the <strong>GPU</strong> is already implemented [23]. Thrust [11] is a library of basic algorithms for the<strong>GPU</strong> with an interface similar to the standard template library. It includes algorithms for counting,sorting and searching. The particle labels array is reduced by the following steps <strong>on</strong> the <strong>GPU</strong>.1. Count the number of zeros in nFacesFound. Use this to calculate how many particles stillremain.2. Sort the particleLabels array using the nFacesFound array as key in descending order. Allthe labels with no faces hit will be at an index greater or equal to the number of particlesremaining.16


5.4 Tools and Validati<strong>on</strong>In order to validate the results solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g>Foam has been modified so that all the data goingthrough the trackToFace(..) is written to the disc in binary 9 format. This includes the beginningand the end positi<strong>on</strong> of each particle, the cell it occupies at the beginning, the face it hits (ifany) and the lambdas calculated. This data is then read by a test program, which calculates thelambdas and the faces hit again <strong>on</strong> the <strong>GPU</strong> using the start and end positi<strong>on</strong> stored before. Theresults are then compared and any differences between them is reported. This utility is calledcalcLambdas and requires the -data argument.Doing so validates the results of the kernels which are resp<strong>on</strong>sible to calculate the lambdas,step 2, 3 and 4 in figure 10, but whether the particles are moved correctly or not is not verified.The gpu<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>Foam utility therefore comes with a switch called -validate. If it is turned<strong>on</strong> the mesh is searched after moving the particles at the end of the time step in order to verifythat the particles are in the correct cell. To generate a number of particles as test data a utilitycalled genRandCloud was developed. It takes the desired number of particles as input argumentand randomly positi<strong>on</strong>s particles in the simulati<strong>on</strong> domain.Searching the mesh is d<strong>on</strong>e using the octree library [3] OpenFOAM provides: The problem offinding the correct cell, given a positi<strong>on</strong> is reduced to the nearest neighbour problem. The centroidsof all the cells in a mesh are taken as point-set and the positi<strong>on</strong> of the particle as the query point.The probability that the particle is in the same cell as the closest centroid is quite high and if itis not, it will be in <strong>on</strong>e of the neighbouring cells. The octree is used for spatial subdivisi<strong>on</strong>: Thebounding box of the simulati<strong>on</strong> domain is recursively subdivided into 8 smaller boxes, startingfrom the center of the bounding box. Recursi<strong>on</strong> stops <strong>on</strong>ce all centroids are in a separate box.During this process a tree is built which c<strong>on</strong>tains the directi<strong>on</strong> for each step. This can be explainedeasier in two dimensi<strong>on</strong>s: Where the bounding box would be divided into 4 rectangles <strong>on</strong>e at thenorth west, south west, north east and south east. Using this tree the program just needs to followthe directi<strong>on</strong> given a positi<strong>on</strong> in order to solve the nearest neighbour problem.9 In a first attempt the data was just written out in text format. Because of the loss of precisi<strong>on</strong> it was difficultto compare the results again.17


6 Test Cases and Results6.1 TunnelThe tunnel test case is very simple and was mainly used for debugging during development. Itc<strong>on</strong>sists of ten cubes. This is illustrated in figure 13 the particles are colored by the x-coordinate(goes from left to right in the illustrati<strong>on</strong>) of their velocity.6.2 TorusFigure 13: <str<strong>on</strong>g>Particle</str<strong>on</strong>g> tunnel with randomly distributed particles.The sec<strong>on</strong>d test case is a simple torus. A torus is defined by two circles orthog<strong>on</strong>al to each other inthree dimensi<strong>on</strong>al space. The smaller circle is then rotated around the bigger circle leading to asurface of revoluti<strong>on</strong> which defines the torus.Figure 14: A torus defined by surface revoluti<strong>on</strong> of a circle.In order to polyg<strong>on</strong>ally approximate the torus discrete points <strong>on</strong> the rotating circle are calculatedusing formula 9. The formula requires the following parameters: (x 0 , y 0 ) is the center of the circle,r is the radius of the circle and (x, y) the points lying <strong>on</strong> the circle.(x, y) = (x 0 + r · cos(α), y 0 + r · sin(α)) (9)The mesh was then generated using netgen [31], a simple mesh generator which can exportmeshes in the OpenFOAM format.18


Figure 15: A tetrahedral mesh generated in a torus.Figure 15 shows a coarse mesh with around 30000 cells generated in a torus.6.3 MeasurementsThe performance measurements are d<strong>on</strong>e using the torus case with 228’184 cells. The tests ran <strong>on</strong>an Intel 655k CPU 10 and an Nvidia GeForce GTX 470 11 . Both parts have been released to themarket in 2010. 50’000 particles are tracked over ten time steps, which is equivalent to calculatingthe lambdas for 500’000 particles.CPU <strong>GPU</strong>Single Precisi<strong>on</strong> 2’810’792 220’866Double Precisi<strong>on</strong> 3’264’259 376’349Table 1: Executi<strong>on</strong> time in microsec<strong>on</strong>dsTable 1 compares the time required to calculate the lambdas <strong>on</strong> the CPU using the OpenFOAMcode and <strong>on</strong> the <strong>GPU</strong> using the code developed during this thesis. It should be noted that the<strong>GPU</strong> time includes the time required to copy all the data over the PCI bus and that the actualcomputing time <strong>on</strong> the <strong>GPU</strong> is much shorter! Because of this tracking particles <strong>on</strong> the <strong>GPU</strong> <strong>on</strong>lymakes sense for a larger number of particles as illustrated by figure 16.10 http://ark.intel.com/Product.aspx?id=4875011 http://www.nvidia.com/object/product_geforce_gtx_470_us.html19


3.5e+063e+06CPU vs <strong>GPU</strong>CPU<strong>GPU</strong>Time in Microsec<strong>on</strong>ds2.5e+062e+061.5e+061e+0650000000 5 10 15 20 25 30 35 40 45 50Number of <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s (times 10’000)Figure 16: Plot showing the executi<strong>on</strong> time for a growing number of particles.The executi<strong>on</strong> times in figure 16 have been measured ten times at each point. For each pointthe average, the highest and lowest time is shown, the average values are c<strong>on</strong>nected by lines. As <strong>on</strong>ecan see there are some measurement errors <strong>on</strong> the CPU, this is most likely because the operatingsystem and further user processes are running <strong>on</strong> the CPU al<strong>on</strong>g with the OpenFOAM code. Onthe <strong>GPU</strong> <strong>on</strong>ly very little diversity in the measurements can be seen, this is because the test wasrunning <strong>on</strong> a dedicated <strong>GPU</strong>, not used for graphics.For a more detailed analysis of the computing time spent <strong>on</strong> the <strong>GPU</strong> Nvidia’s Compute Profiler[9] was used.20


0Upload time mesh data andtime to calculate staticdata.Required forevery timestep.246762.54 usecmemsetScalarKernelreorderNFacesFoundmemcpyHtoDcalcLambdaAKernelcalcLambdacnumKernelthrustmemcpyDtoHmemcpyDtoDmemsetIntegralKernelfindFacesKernelFigure 17: Width plot showing the time of kernel functi<strong>on</strong>s and memcpy operati<strong>on</strong>s.Figure 17 shows a width plot of the <strong>GPU</strong> time 12 . The white spaces in between indicate idletime in which the host code is doing something. It can be easily seen that around 60 % of the runtime are memory copying time! Furthermore the first chunk of data copied is the mesh data. Thistogether with the initial calculati<strong>on</strong> must be d<strong>on</strong>e <strong>on</strong>ly <strong>on</strong>ce.0Actual executi<strong>on</strong> time <strong>on</strong>the <strong>GPU</strong> per time step.246762.54 usecFigure 18: Width plot showing the actual time spent <strong>on</strong> the <strong>GPU</strong>.If we do not c<strong>on</strong>sider the time spent to copy the data over the PCI bus and to do the initialcalculati<strong>on</strong>, then we get an actual run time <strong>on</strong> the <strong>GPU</strong> of just 89’742 usec per time step (doubleprecisi<strong>on</strong>). That is more than 30 times faster than the single threaded CPU implementati<strong>on</strong>. Theactual executi<strong>on</strong> time <strong>on</strong> the <strong>GPU</strong> is illustrated in figure 18.6.4 Performance Analysis of Computati<strong>on</strong>al KernelsThere are three major kernels involved in the computati<strong>on</strong>: A kernel which calculates the numeratorof λ c at the beginning (initial step in figure 10), this kernel has <strong>on</strong>e thread per cell. A kernel whichfinds all faces with λ c ∈ [0, 1] (step 2 in figure 10) with <strong>on</strong>e thread per particle and a kernel which12 The plot was d<strong>on</strong>e using the same test case (with double precisi<strong>on</strong>) as presented before in table 1. The shortertime is because this is a sum of time measured <strong>on</strong> the <strong>GPU</strong> (for copying data and <strong>GPU</strong> kernels), while the timepresented in the table was measured <strong>on</strong> the CPU and therefore includes overhead for invoking memory copies and<strong>GPU</strong> kernels.21


calculates λ a for the particles with λ c ∈ [0, 1] 13 (step 4 in figure 10).Because the kernel resp<strong>on</strong>sible for finding faces with λ c ∈ [0, 1] takes the most time to compute(figure 17, yellow) it is analysed here. The structure of the kernel is as follows: First data bel<strong>on</strong>gingto the cell and the particle is fetched using the particle label. Then it is iterated over all faces ofthe cell where first the face data (face normal and face centre) is fetched and then the lambdas arecalculated. If they are within the desired interval the label of the face is written into the facesfound array. Once the loop is d<strong>on</strong>e the number of faces found is written back. When compiled fordouble precisi<strong>on</strong> the kernel’s occupancy (see secti<strong>on</strong> 4.2) is quite low around 33%, this is becauseeach thread requires 34 registers. Compiled for single precisi<strong>on</strong> improves the occupancy (around66%), because just 24 registers are required. In both cases the occupancy is limited by registerusage. If it could be d<strong>on</strong>e with just 16 registers per thread, the occupancy would be 100 %. Thekernel is memory bound, meaning that executi<strong>on</strong> time is wasted by waiting for data from globalmemory. The main reas<strong>on</strong> for this are uncoalesced reads and writes as well as the low occupancy.For example writing the face labels found is uncoalesced because it differs for every thread andtherefore it is written back sequentially. There are different opti<strong>on</strong>s to further improve the kernel’sefficiency• Turn off L1 cache for global memory access. Uncached memory transacti<strong>on</strong>s are multiples of32, 64 and 128 bytes where cached transacti<strong>on</strong>s are always multiples of 128 bytes (the cacheline size) [27, F.4.2]. In case of uncoalesced memory access, smaller transacti<strong>on</strong>s are better,because the full size of the cache line is not used anyways. Quickly trying this out lead to aslight improvement of around 20% 14 .• Use shared memory or texture cache. Shared memory is fast, <strong>on</strong>-chip memory, which canbe accessed by threads in the same block [27, 3.2.3]. However tracking particles in anunstructured mesh is not well suited for shared memory: Using shared memory requires thatthreads sharing data are placed in the same CUDA block. The threads for the particles whichreside in the same cell share data: The cell centre, face centres and normals, etc. Howeverthis would require that the particles are sorted by the occupancy cells. Also all CUDA blocksfor a kernel launch must be of the same size. Furthermore the block size should be a multipleof the wrap size (32 threads) and at minimum 64 threads [26, 4.4]. This makes it unfeasibleto use shared memory. Using texture memory [27, 3.2.10.1] might speed up the kernel, sinceit lead to significant performance improvement in Nvidia’s particle demo (see secti<strong>on</strong> 8).• Review the code and rearrange data in order to improve memory access patterns and reducethe register requirements. The kernel code first fetches the label of the particles. During thefirst iterati<strong>on</strong> the label of the particles corresp<strong>on</strong>d directly to the thread index. The particlelabels array is just used in later iterati<strong>on</strong>s, where some particles are already d<strong>on</strong>e. Becausethe number of particles which need to be c<strong>on</strong>sidered in further iterati<strong>on</strong>s are usually muchsmaller, this could lead to a significant improvement.6.5 Memory Requirements<strong>GPU</strong>s have memory buses designed to reduce latency and not to support a very large amount ofmemory as CPUs do. To day <strong>GPU</strong>s can have at most 6GB of RAM. For larger cases it is probablyrequired to partiti<strong>on</strong> the particles into chunks fitting into the memory of the <strong>GPU</strong>.In the simulati<strong>on</strong> code a particle requires about 1338 bytes of memory to store all the relevantdata (positi<strong>on</strong>, velocity, diameter, label, etc). Therefore processing <strong>on</strong>e milli<strong>on</strong> particles at <strong>on</strong>cewould require around 1.25 Gb of memory just for the particles. A tetrahedral mesh cell requiresabout 221 bytes of memory. A mesh with <strong>on</strong>e milli<strong>on</strong> cells therefore requires about 212 Mb of data.13 The kernels to calculate the lambdas are parallelized over the number of particles. It would also be possible toparallelize over all the cells in the mesh but this would require that each thread iterates over the particles in its cell.Because the number of particles per cell usually varies this would lead to massively divergent warps and thereforepoor performance [27, 5.4.2]14 The plots presented before are with L1 cache enabled.22


The calcLambda test program and the gpu<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>Foam solver report the data required <strong>on</strong> the<strong>GPU</strong> at the beginning of the simulati<strong>on</strong>.23


7 C<strong>on</strong>clusi<strong>on</strong> and Future WorkIt was shown that the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> particle tracking algorithm can run efficiently <strong>on</strong> the <strong>GPU</strong>. Portingit required the c<strong>on</strong>versi<strong>on</strong> of the mesh data into structures suitable for the SIMT architecture.The algorithm itself could be parallelized easily since the computati<strong>on</strong>s are d<strong>on</strong>e for each particleindividually which is well suitable for massively parallel processors. The biggest problem, withrespect to efficiency, are the time c<strong>on</strong>suming memory copies over the PCI bus. Having to copydata over a slow bus repeatedly can nullify speedups achieved by porting parts of a simulati<strong>on</strong>to massively parallel hardware. While there exist several workarounds, such as the possibilityto overlap copying with kernel executi<strong>on</strong>, the problem will probably be g<strong>on</strong>e in future hardwaregenerati<strong>on</strong>s. For example Advanced Micro Devices (AMD) already announced their next generati<strong>on</strong>architecture called “Fusi<strong>on</strong>” [13] in which the <strong>GPU</strong> and the CPU will access memory over the samehigh speed bus.Now that the basics for <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> particles <strong>on</strong> <strong>GPU</strong>s are implemented, using it in an actualsimulati<strong>on</strong> is the next step. Just executing the tracking <strong>on</strong> the <strong>GPU</strong> would probably not havea big impact <strong>on</strong> the overall run time, mostly due to the fact that the particle data must bec<strong>on</strong>verted into a suitable format and copied over a slow bus for each time-step 15 . However if thereare other computati<strong>on</strong>ally intense tasks involving particles such as collisi<strong>on</strong> detecti<strong>on</strong> significantimprovements in the overall run time could be achieved. In 2000 Niklas Nordin wrote in his thesisabout Diesel combusti<strong>on</strong> [24, Chapter 2.2.6] “Am<strong>on</strong>g the spray sub-models, the weakest model isthe collisi<strong>on</strong> model.” Reas<strong>on</strong>s for this are the mesh dependency of the collisi<strong>on</strong> model and the factthat some collisi<strong>on</strong> models do not even take the trajectory of the particles into account. Collisi<strong>on</strong>models which work independently of the mesh and use the trajectories of the particles tend tobe computati<strong>on</strong>ally intensive because each particle must be collisi<strong>on</strong> checked with a subset of allparticles. Nvidia already showed that collisi<strong>on</strong>s can efficiently be calculated <strong>on</strong> a <strong>GPU</strong> (see secti<strong>on</strong>8.1). Adding such collisi<strong>on</strong> models would greatly improve the <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> framework.15 It is interesting to see how other teams dealt with this problem. For example a large CFD code written inFortran code using OpenMP was automatically translated into CUDA code. Having the whole code running <strong>on</strong> the<strong>GPU</strong> eliminated the need to copy data between two address spaces during the simulati<strong>on</strong>. It is <strong>on</strong>ly required toupload the data at the beginning and download the results at the end. [15] (See secti<strong>on</strong> 8.4 for a summary.) WithOpenFOAM this would be unfeasible because of the lack of fine-grained parallelism, the immense complexity of theC++ code and the unsuitable data structures for massively parallel processing.24


8 Summaries of Related Works8.1 Nvidia’s <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s DemoThe demo [17] shows how to efficiently implement a simple particle system in CUDA. It includesinteracti<strong>on</strong>s between neighboring particles using a uniform grid. Nvidia suggests that the demo isused as a framework up<strong>on</strong> which more complicated particle interacti<strong>on</strong>s such as smoothed particlehydrodynamics (SPH) or soft body simulati<strong>on</strong>s can be built.There are three main steps in the simulati<strong>on</strong>: Integrati<strong>on</strong>, building the grid data structure andprocessing collisi<strong>on</strong>s. The visualizati<strong>on</strong> is d<strong>on</strong>e by directly using OpenGL together with CUDA. Bydefault the system simulates 16384 particles in real time. On a GTX 470 it was possible howeverto simulate around 10 5 particles in real time (with a framerate of about 30 frames per sec<strong>on</strong>d).The particle-particle interacti<strong>on</strong>s are simplified using spatial subdivisi<strong>on</strong>. Because the interacti<strong>on</strong>force drops off with distance, the force for a given particle is computed by <strong>on</strong>ly comparing it withparticles within a certain radius. <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s farther away are not c<strong>on</strong>sidered.The integrati<strong>on</strong> of the particle data (positi<strong>on</strong> and velocity) is simply d<strong>on</strong>e using the Eulermethod. The grid c<strong>on</strong>sists of cubes with a side length of the same size as the particle’s diameter (allparticles are of the same diameter too). Using such a grid greatly simplifies further computati<strong>on</strong>s:Each particle can <strong>on</strong>ly cover 8 cells at most and at most four particles can theoretically residein <strong>on</strong>e cell. The grid used is called loose, because each particle can <strong>on</strong>ly be assigned to <strong>on</strong>e cell,even though it might overlap cell boundaries. Because the particle can overlap several cells, whenprocessing collisi<strong>on</strong>s all the neighbour cells must be examined (3 · 3 · 3 = 27).There are two methods of how the grid can be built: The first <strong>on</strong>e uses atomic operati<strong>on</strong>s. Twoarrays are used: One which stores the number of particles in each cell and <strong>on</strong>e which stores theparticle indices for each cell. Both are rebuilt at every timeframe. The kernel runs with <strong>on</strong>e threadper particle. The sec<strong>on</strong>d method builds the grid using sorting and was designed for older <strong>GPU</strong>swhich do not support atomic operati<strong>on</strong>s. However in the current versi<strong>on</strong> of the particles demo theatomic versi<strong>on</strong> was removed, because the versi<strong>on</strong> which uses sorting performs better.Further the paper menti<strong>on</strong>s that binding the global memory arrays to textures improvedperformance by 45% because texture reads are cached. These arrays are used to fetch the particles’positi<strong>on</strong> and veolcity which is typically n<strong>on</strong>-coalesced.8.2 The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s in the EM Driven Turbulent Flow withFine MeshThis paper [29] presents a simulati<strong>on</strong> in which the flow of inducti<strong>on</strong> in a crucible furnace (ICF) issimulated using the solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> library from OpenFOAM. The particles are used to representthe metal flow inside the ICF while the (turbulent) electromagnetic field is represented by theEulerian phase. The particles are moved by a <strong>on</strong>e way coupling where <strong>on</strong>ly the vector field of thecell in which the particle currently remains is taken into account. An interpolati<strong>on</strong> of the velocityof the surrounding cells to the actual positi<strong>on</strong> of the particle is not used. The authors menti<strong>on</strong> thatthe collisi<strong>on</strong> with the wall did not work correctly when the particles diameter is larger than thecell side, they therefore had to fix the code in order to get the simulati<strong>on</strong> working. The authorsfinally c<strong>on</strong>clude that the industrial observati<strong>on</strong>s corresp<strong>on</strong>d to the observati<strong>on</strong>s they made withtheir simulati<strong>on</strong>.8.3 Complex Chemistry Modeling of Diesel Spray Combusti<strong>on</strong>This PhD thesis [24] is about simulating Diesel sprays and combusti<strong>on</strong>. It starts with explainingwhy it is necessary to treat the liquid phase (Diesel in this case) in a <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> way: In a Dieselengine the fuel is injected through a spray which has a diameter <strong>on</strong> the order of 0.1 mm, with avelocity of 200-400 m/s. The subsequent igniti<strong>on</strong> and the combusti<strong>on</strong> require length scales, whichare even smaller. Using the finite volume method to simulate everything would require a mesh withscales so small that it would require enormous amounts of memory and computing time which are25


not available today or any time so<strong>on</strong> in the future. In this thesis various methods to track particlesin an unstructured grid are discussed an an early versi<strong>on</strong> of the particle tracking algorithm [22] isdescribed. Also break up models and collisi<strong>on</strong> models are presented. <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> particle trackingwas introduced to OpenFOAM around the year 2000 with the introducti<strong>on</strong> of dieselFoam whichis described in this thesis.8.4 Porting Large Fortran Codebases to <strong>GPU</strong>sThis work deals with porting large legacy codes to CUDA [15]. It starts with a Fortran codecalled FEFLO which has around <strong>on</strong>e milli<strong>on</strong> lines of code. OpenMP is used to parallelize thecode for CPUs. Because porting all the code manually was just too much work a translator waswritten which ports the code automatically to CUDA. The existing OpenMP directives were usedto generate CUDA kernels. The author claims that the work was d<strong>on</strong>e in a few m<strong>on</strong>ths using athousand line Pyth<strong>on</strong> script based <strong>on</strong> FParser [2]. It is further menti<strong>on</strong>ed that memory transfersbetween the CPU and <strong>GPU</strong> should be avoided because they are slow and that the performancegain by just porting a few bottleneck routines will nullified by the additi<strong>on</strong>al time required fordata transfers. Therefore memory transfers should happen at the beginning and at the end andnot during the simulati<strong>on</strong>. When discussing implementati<strong>on</strong> details the author menti<strong>on</strong>s thrust[11] which was used for reducti<strong>on</strong>s. It is c<strong>on</strong>cluded that because of uniform coding c<strong>on</strong>venti<strong>on</strong>senforced during the development of FEFLO it was possible to port the code to a large percentageautomatically to CUDA, just in a few cases rewriting code manually was necessary. The automaticcode translator has a few limitati<strong>on</strong>s: Fine-grained parallelism must be already expressed in thecode, which was d<strong>on</strong>e in this case with OpenMP. Additi<strong>on</strong>ally it <strong>on</strong>ly supports Fortran, C or C++is not supported.26


9 Appendix9.1 Asserti<strong>on</strong>s <strong>on</strong> the <strong>GPU</strong>CUDA does not come with support for asserti<strong>on</strong>s as they are offered by the C standard library.Using asserti<strong>on</strong>s during the development helped caching bugs early. Asserti<strong>on</strong>s were implementedusing the C preprocessor.1 // This flag is read at the beginning of every kernel if it ’s not true the2 // kernel returns . This way no more kernels are launched <strong>on</strong>ce an asserti<strong>on</strong>3 // fails .4 __device__ bool run = true ;56 // Trap cannot be used directly , because if the kernel is interrupted with7 // trap the output of printf is never transferred to the host and never8 // printed out .9 # define cudaAssert ( c<strong>on</strong>diti<strong>on</strong> ) \10 if (!( c<strong>on</strong>diti<strong>on</strong> )){ printf (" Asserti<strong>on</strong> ␣%s␣ failed !\n", # c<strong>on</strong>diti<strong>on</strong> ); run = false; }1112 # define checkRun if (! run ) asm (" trap ;");1314 # define info (x, ...) printf (x, __VA_ARGS__ )1516 # else1718 # define cudaAssert ( c<strong>on</strong>diti<strong>on</strong> )19 # define checkRun20 # define info (x, ...)2122 # endif2324 # endifUsing the asserti<strong>on</strong> macro requires the programmer to add checkRun at the beginning ofevery kernel. The macro itself can be used just like the C asserti<strong>on</strong> macro by just addingcudaAssert(c<strong>on</strong>diti<strong>on</strong>) to test weather a c<strong>on</strong>diti<strong>on</strong> expected to be true is really true. Note thatthe asserti<strong>on</strong>s are <strong>on</strong>ly evaluated if DEBUG is defined. This way there is no overhead if debugging isturned off.27


9.2 Curriculum VitaeEducati<strong>on</strong>University of Applied Science, LucerneMSc Engineering2009 - presentUniversity of Applied Science, LucerneBSc with specializati<strong>on</strong> in Software Systems 2005 - 2009Thesis: Web Crawler for Recommender SystemsWirtschaftsmittelschule Aarau 2000 - 2003Work ExperienceAdNovum Informatik AG, ZürichWork student, part time 2010Implementing a CAPTCHA system for nevisAuth.University of Applied Science - CC D3S, LucerneAssistant, part time 2009Working <strong>on</strong> different software projects. Assisting in a JavaEE class.ETHZ - Chair of Systems Design, ZürichSystem administrator, part time 2006 - 2008Administrati<strong>on</strong> and support of the IT-infrastructure. Web Development with Ruby <strong>on</strong> Rails.crossmediacomm AG, AarauInternship 2003-20049.3 Project Definiti<strong>on</strong>Introducti<strong>on</strong>In computati<strong>on</strong>al fluid dynamics (CFD) a number of PDEs have to be solved. In additi<strong>on</strong> it mightbe necessary to include particle-effects which is often accomplished by the probability densityfuncti<strong>on</strong>s (PDF) method. This allows more accurate simulati<strong>on</strong> of cumbusti<strong>on</strong>-processes like indiesel-engines or aircraft-engines. During the course of the computati<strong>on</strong> each of milli<strong>on</strong>s of particleshas to be tracked al<strong>on</strong>g its path (particle tracking). The operati<strong>on</strong>s applied for each particle aresimple and could therefore be executed by the <strong>GPU</strong> in parallel.OpenFOAM (Open Field Operati<strong>on</strong> and Manipulati<strong>on</strong>) is a C++ framework for the simulati<strong>on</strong>of fluids, the interacti<strong>on</strong> of fluids with solids and many more problems. OpenFOAM includesapprox. 190 standard programs which could be easily adapted and even extended. Since 2004OpenFOAM is distributed under the GPL. Its user base is gradually increasing not <strong>on</strong>ly in theresearch- and educati<strong>on</strong>-area but also in the industrial field.Job descripti<strong>on</strong>During the course of the Master Diploma Work an OpenFOAM-Library has to be developed whichexecutes the computati<strong>on</strong>ally intensive operati<strong>on</strong>s of particle tracking <strong>on</strong> the graphical processingunit (<strong>GPU</strong>), using CUDA. Starting form the sec<strong>on</strong>d c<strong>on</strong>solidati<strong>on</strong> project (in spring semester 2010),where a 2D-versi<strong>on</strong> of the particle tracking outside OpenFOAM was realized, the following taskshave to be d<strong>on</strong>e for the master thesis:28


1. Extend the Algorithm to 3D where cells could be tetra-, hexa- and in general polyhedra.Assume that the velocity field can be computed at any positi<strong>on</strong> (by calling an already existingOpenFOAM-method)2. Design a C++-class for OpenFOAM which fetches all necessary data for particle tracking(like positi<strong>on</strong> and velocity of each particle, what cell c<strong>on</strong>tains which particles, the geometryof the cells, etc.) and c<strong>on</strong>verts/adapts the data structures for efficient use <strong>on</strong> the <strong>GPU</strong>.3. For each cell in the simulati<strong>on</strong> domain, move the particles in this cell with the given velocityfor a certain time-step. Assume, that the velocity (see above) is known at each point.4. If computati<strong>on</strong>s can be d<strong>on</strong>e in parallel, implement them <strong>on</strong> the <strong>GPU</strong>.5. The correctness of the implementati<strong>on</strong> shall be tested using two examples (1. <strong>on</strong>e dimensi<strong>on</strong>alpipe, 2. torus)6. Compare Your results with the versi<strong>on</strong> running solely <strong>on</strong> the CPU.7. Document Your code accordingly (Doxygen).8. Make Your code and documentati<strong>on</strong> easy accessible (via git- or svn-repository).For the implementati<strong>on</strong> use a recent versi<strong>on</strong> of OpenFOAM such as 1.7.x or 1.6-ext.29


Bibliography[1] CUDA <strong>GPU</strong> Occupancy Calculator. http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls.[2] f2py - Fortran to Pyth<strong>on</strong> Interface Generator. http://code.google.com/p/f2py/.[3] Octree library in OpenFOAM. http://foam.sf.net/doc/Doxygen/html/classFoam_1_1octree.html.[4] Quick Overview of CFD Grid Terminology. http://www.innovative-cfd.com/cfd-grid.html.[5] OpenFOAM Programmers Guide. OpenCFD UK, 2008.[6] OpenFOAM <strong>User</strong>s Guide. OpenCFD UK, 2008.[7] CUSP - Generic Parallel Algorithms for Sparse Matrix and Graph Computati<strong>on</strong>s. http://code.google.com/p/cusp-library/, 2011.[8] <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> and Eulerian specificati<strong>on</strong> of the flow field — Wikipedia, The Free Encyclopedia.http://en.wikipedia.org/wiki/<str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g>_and_Eulerian_specificati<strong>on</strong>_of_the_flow_field, 2011.[9] Nvidia Compute Visual Profiler. http://developer.nvidia.com/c<strong>on</strong>tent/nvidia-visual-profiler, 2011.[10] Running applicati<strong>on</strong>s in parallel. http://www.openfoam.com/docs/user/running-applicati<strong>on</strong>s-parallel.php, 2011.[11] Thrust - Code at the speed of light. http://code.google.com/p/thrust/, 2011.[12] Pete Becker. Working Draft, Standard for Programming Language C++. ISO, 2011.[13] Nathan Brookwood. AMD Fusi<strong>on</strong>TM Family of APUs: Enabling a Superior, ImmersivePC Experience. http://www.amd.com/us/Documents/48423_fusi<strong>on</strong>_whitepaper_WEB.pdf,2010.[14] James O. Coplien. Curiously recurring template patterns. C++ Rep., 7:24–27, February 1995.[15] Andrew Corrigan and Rainald Löhner. Porting Large Fortran Codebases to <strong>GPU</strong>s. Technicalreport, Center for Computati<strong>on</strong>al Fluid Dynamics, 2010.[16] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-<strong>on</strong>Approach. Morgan Kaufmann, 2010.[17] Sim<strong>on</strong> Green. <str<strong>on</strong>g>Particle</str<strong>on</strong>g> Simulati<strong>on</strong> using CUDA, 2010.[18] Hrvoje Jasak. Error Analysis and Estimati<strong>on</strong> for the Finite Volume Method with Applicati<strong>on</strong>sto Fluid Flows. PhD thesis, Imperial College, 1996.[19] Hrvoje Jasak. Polyhedral Mesh Handling in OpenFOAM, 2005.[20] YUAN LIU. Hybrid Parallel Computati<strong>on</strong> of OpenFOAM Solver <strong>on</strong> Multi-Core ClusterSystems. Master’s thesis, KTH Informati<strong>on</strong> and Communicati<strong>on</strong> Technology, 2011.[21] Vratis ltd. SpeedIT Tools: Bey<strong>on</strong>d Accelerati<strong>on</strong>. http://vratis.com/speedITblog/, 2011.[22] Graham B. Macphers<strong>on</strong>, Niklas Nordin, and Henry G. Weller. <str<strong>on</strong>g>Particle</str<strong>on</strong>g> tracking in unstructured,arbitrary polyhedral meshes for use in CFD and molecular dynamics. Communicati<strong>on</strong>s inNumerical Methods in Engineering, 2008.30


[23] Satish N., Harris M., and Garland M. Designing Efficient Sorting Algorithms for Manycore<strong>GPU</strong>s. In Proceedings of IEEE Internati<strong>on</strong>al Parallel and Distributed Processing Symposium2009, 2009.[24] Niklas Nordin. Complex Chemistry Modeling of Diesel Spray Combusti<strong>on</strong>. PhD thesis, 2000.[25] Nvidia. Fermi Whitepaper, 2009.[26] Nvidia. CUDA C Best Practices Guide 4.0, 2011.[27] Nvidia. CUDA C Programming Guide 4.0, 2011.[28] Nvidia. PTX: Parallel Thread Executi<strong>on</strong> ISA 2.3, 2011.[29] Mihails Scepanskis and Andris Jakovics. The <str<strong>on</strong>g>Lagrangian</str<strong>on</strong>g> <str<strong>on</strong>g>Particle</str<strong>on</strong>g>s in the EM Driven TurbulentFlow with Fine Mesh (the Treatement of solid<str<strong>on</strong>g>Particle</str<strong>on</strong>g> Library in OpenFOAM). Technicalreport, University of Latvia, 2011.[30] L. Schiller and Z. Naumann. Über die grundlegenden Berechnungen bei der Schwerkraftaufbereitung.Zeitschrift des Vereins Deutscher Ingenieure, 1933.[31] Joachim Schöberl. NETGEN - Automatic Mesh Generator. http://www.hpfem.jku.at/netgen/, 2009.[32] Symscape. <strong>GPU</strong> Linear Solver Library for OpenFOAM. http://www.symscape.com/gpu-openfoam, 2011.[33] Matthew Yi. THEY GOT GAME - Stacks of new releases for hungry video game enthusiastsmean it’s boom time for an industry now even bigger than Hollywood. http://www.sfgate.com/cgi-bin/article.cgi?f=/chr<strong>on</strong>icle/archive/2004/12/18/MNGUOAE36I1.DTL, 2004.31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!