13.07.2015 Views

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The flux F nn ′ across the n−n ′ interface connectingvolumes V n <strong>and</strong> V n ′ is defined byF nn ′ = (qρ) nn ′X nn ′ − (φDρ) nn ′X n − X n ′, (8)d n + d n ′where the subscript nn ′ indicates that the quantityis evaluated at the interface, <strong>and</strong> the quantities d n ,d n ′ denote the distances from the centers of the controlvolumes V n , V n ′ to the their common interfacewith interfacial area A nn ′. In general, R n is a nonlinearfunction of the independent field variables.We use an inexact Newton method to solve the discretizedequations for zero residual.Within the flow <strong>and</strong> transport modules theequations are solved fully implicitly, but becausetransport generally requires much smaller timesteps than flow, we couple these modules sequentially.A linear interpolation is used to obtain flowfield variables within the transport solve. To accountfor changes in porosity <strong>and</strong> permeability dueto mineral reactions, the transport solver is used tocalculate an updated porosity over a flow time step<strong>and</strong> the revised porosity is passed back to the flowsolver. Future implementations may explore independentgrid hierarchies for flow <strong>and</strong> transport, aswell as fully coupled schemes.3 Architecture <strong>and</strong> Parallel Implementation<strong>PFLOTRAN</strong> is has been written from the groundup with parallel scalability in mind, <strong>and</strong> can run onmachines ranging from laptops to the largest massivelyparallel computer architectures. Through judicioususe of Fortran 90/95 features, the code employsa highly modular, object-oriented design thatcan hide many of the details of the parallel implementationfrom the user, if desired. <strong>PFLOTRAN</strong>is built on top of the <strong>PETSc</strong> framework [2; 3; 4]<strong>and</strong> uses numerous features from <strong>PETSc</strong> includingnonlinear solvers, linear solvers, sparse matrix datastructures (both blocked <strong>and</strong> non-blocked matrices),vectors, constructs for the parallelism of PDEson structured grids, options database (runtime controlof solver options), <strong>and</strong> binary I/O.<strong>PFLOTRAN</strong>’s parallel paradigm is <strong>based</strong> on domaindecomposition: each MPI process is assigneda subdomain of the problem <strong>and</strong> a parallel solveis implemented over all processors. Message passing(3D “halo exchange”) is required at the subdomainboundaries with adjacent MPI processes to fillghost points in order to compute flux terms (Equation8). A number of different preconditioners from<strong>PETSc</strong> or other packages (<strong>PETSc</strong> provides interfacesto several) can be employed, but currentlywe usually use single-level domain decompositionpreconditioners employed inside of a global inexactNewton-Krylov solver. Within the Krylovsolver, gather/scatter operations are needed to h<strong>and</strong>leoff-processor vector elements in matrix-vectormultiplies, <strong>and</strong> numerous MPI Allreduce() operationsare required to compute vector inner products<strong>and</strong> norms, making communication highly latencybound.4 Parallel PerformanceIn this section, we examine performance in a strongscaling context using a benchmark problem from amodel of a hypothetical uranium plume at the Hanford300 Area in southeastern Washington state,described in [5]. The computational domain ofthe problem measures 1350 × 2500 × 20 meters(x,y,z) <strong>and</strong> utilizes complex stratigraphy (Figure 1)mapped from the Hanford EarthVision database[6], with material properties provided by [7]. Thestratigraphy must be read from a large HDF5 fileat initialization time. We examine aspects of bothcomputation <strong>and</strong> I/O in this section.In the benchmark runs here, we utilize an inexactNewton method with a fixed tolerance forthe inner, linear solve. The linear (Jacobian) solvesemploy the BCGS stabilized bi-conjugate gradientsolver (BiCGstab) in <strong>PETSc</strong>, with a block-Jacobipreconditioner that applies an incomplete LUdecompositionsolver with zero fill-in (ILU(0)) oneach block. (For linear solves with block-structuredmatrices, a block-ILU(0) solve is employed.) Ourchoice of preconditioner is a very simple one, butwe have been surprised at how well it has workedat scale on large machines such as Jaguar. Wehave experimented with more sophisticated preconditionerssuch as the the Hypre/BoomerAMG<strong>and</strong> Trilinos/ML algebraic multigrid solvers, butwe have yet to identify a preconditioner that isas robust <strong>and</strong> as scalable at high processor countsas block-Jacobi. Solvers such as BoomerAMG displayexcellent convergence behavior, but show unacceptablegrowth in set up time above 1000 cores.(Determining how to make good use of multi-levelsolvers is an area of active research for us.)CUG 2009 Proceedings 3 of 14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!