Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

The flux F nn ′ across the n−n ′ interface connectingvolumes V n and V n ′ is defined byF nn ′ = (qρ) nn ′X nn ′ − (φDρ) nn ′X n − X n ′, (8)d n + d n ′where the subscript nn ′ indicates that the quantityis evaluated at the interface, and the quantities d n ,d n ′ denote the distances from the centers of the controlvolumes V n , V n ′ to the their common interfacewith interfacial area A nn ′. In general, R n is a nonlinearfunction of the independent field variables.We use an inexact Newton method to solve the discretizedequations for zero residual.Within the flow and transport modules theequations are solved fully implicitly, but becausetransport generally requires much smaller timesteps than flow, we couple these modules sequentially.A linear interpolation is used to obtain flowfield variables within the transport solve. To accountfor changes in porosity and permeability dueto mineral reactions, the transport solver is used tocalculate an updated porosity over a flow time stepand the revised porosity is passed back to the flowsolver. Future implementations may explore independentgrid hierarchies for flow and transport, aswell as fully coupled schemes.3 Architecture and Parallel ImplementationPFLOTRAN is has been written from the groundup with parallel scalability in mind, and can run onmachines ranging from laptops to the largest massivelyparallel computer architectures. Through judicioususe of Fortran 90/95 features, the code employsa highly modular, object-oriented design thatcan hide many of the details of the parallel implementationfrom the user, if desired. PFLOTRANis built on top of the PETSc framework [2; 3; 4]and uses numerous features from PETSc includingnonlinear solvers, linear solvers, sparse matrix datastructures (both blocked and non-blocked matrices),vectors, constructs for the parallelism of PDEson structured grids, options database (runtime controlof solver options), and binary I/O.PFLOTRAN’s parallel paradigm is based on domaindecomposition: each MPI process is assigneda subdomain of the problem and a parallel solveis implemented over all processors. Message passing(3D “halo exchange”) is required at the subdomainboundaries with adjacent MPI processes to fillghost points in order to compute flux terms (Equation8). A number of different preconditioners fromPETSc or other packages (PETSc provides interfacesto several) can be employed, but currentlywe usually use single-level domain decompositionpreconditioners employed inside of a global inexactNewton-Krylov solver. Within the Krylovsolver, gather/scatter operations are needed to handleoff-processor vector elements in matrix-vectormultiplies, and numerous MPI Allreduce() operationsare required to compute vector inner productsand norms, making communication highly latencybound.4 Parallel PerformanceIn this section, we examine performance in a strongscaling context using a benchmark problem from amodel of a hypothetical uranium plume at the Hanford300 Area in southeastern Washington state,described in [5]. The computational domain ofthe problem measures 1350 × 2500 × 20 meters(x,y,z) and utilizes complex stratigraphy (Figure 1)mapped from the Hanford EarthVision database[6], with material properties provided by [7]. Thestratigraphy must be read from a large HDF5 fileat initialization time. We examine aspects of bothcomputation and I/O in this section.In the benchmark runs here, we utilize an inexactNewton method with a fixed tolerance forthe inner, linear solve. The linear (Jacobian) solvesemploy the BCGS stabilized bi-conjugate gradientsolver (BiCGstab) in PETSc, with a block-Jacobipreconditioner that applies an incomplete LUdecompositionsolver with zero fill-in (ILU(0)) oneach block. (For linear solves with block-structuredmatrices, a block-ILU(0) solve is employed.) Ourchoice of preconditioner is a very simple one, butwe have been surprised at how well it has workedat scale on large machines such as Jaguar. Wehave experimented with more sophisticated preconditionerssuch as the the Hypre/BoomerAMGand Trilinos/ML algebraic multigrid solvers, butwe have yet to identify a preconditioner that isas robust and as scalable at high processor countsas block-Jacobi. Solvers such as BoomerAMG displayexcellent convergence behavior, but show unacceptablegrowth in set up time above 1000 cores.(Determining how to make good use of multi-levelsolvers is an area of active research for us.)CUG 2009 Proceedings 3 of 14

Previous page

Next page

1

3

5

6

7

8

9

10

11

12

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?