Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

Experiences and Challenges Scaling PFLOTRAN, aPETSc-based Code for Subsurface Reactive Flow Simulations,Towards the Petascale on Cray XT SystemsRichard Tran MillsOak Ridge National Laboratory, Oak Ridge, TN 37831Vamsi Sripathi and G. (Kumar) MahinthakumarNorth Carolina State University, Raleigh, North Carolina 27695Glenn E. HammondPacific Northwest National Laboratory, Richland, WA 99352Peter C. LichtnerLos Alamos National Laboratory, Los Alamos, NM 87545Barry F. SmithArgonne National Laboratory, Argonne, IL 60439ABSTRACT: We describe our experiences running PFLOTRAN—a code for simulation of coupledhydro-thermal-chemical processes in variably saturated, non-isothermal, porous media—on the Cray XTseries of computers, including initial experiences running on the petaflop incarnation of Jaguar, the CrayXT5 at the National Center for Computational Sciences at Oak Ridge National Laboratory. PFLOTRANutilizes fully implicit time-stepping and is built on top of the Portable, Extensible Toolkit for ScientificComputation (PETSc). We discuss some of the hurdles to “at scale” performance with PFLOTRAN andthe progress we have made in overcoming them on the Cray XT4 and XT5 architectures.KEYWORDS: PFLOTRAN, PETSc, groundwater, solvers1 IntroductionOver the past several decades, subsurface (groundwater)flow and transport models have becomevital tools for the U.S. Department of Energy(DOE) in its environmental stewardship mission.These models have been employed to evaluatethe impact of alternative energy sources and theefficacy of proposed remediation strategies forlegacy waste sites. For years, traditional models—simulating single-phase groundwater flow andsingle-component solute transport within a singlecontinuum, with basic chemical reactionssuch as aqueous complexing, mineral precipitation/dissolutionand linear sorption to rock/soilsurfaces and including radioactive decay—havebeen employed. Although these simplified groundwatermodels are still in wide use, advances insubsurface science have enabled the developmentof more sophisticated models that employ multiplefluid phases and chemical components coupledthrough a suite of biological and geochemicalreactions at multiple scales. With this increasedcomplexity, however, comes the need for morecomputational power, typically far beyond that ofthe average desktop computer. This is especiallytrue when applying these sophisticated multiphaseflow and multicomponent reactive transport mod-CUG 2009 Proceedings 1 of 14

The flux F nn ′ across the n−n ′ interface connectingvolumes V n and V n ′ is defined byF nn ′ = (qρ) nn ′X nn ′ − (φDρ) nn ′X n − X n ′, (8)d n + d n ′where the subscript nn ′ indicates that the quantityis evaluated at the interface, and the quantities d n ,d n ′ denote the distances from the centers of the controlvolumes V n , V n ′ to the their common interfacewith interfacial area A nn ′. In general, R n is a nonlinearfunction of the independent field variables.We use an inexact Newton method to solve the discretizedequations for zero residual.Within the flow and transport modules theequations are solved fully implicitly, but becausetransport generally requires much smaller timesteps than flow, we couple these modules sequentially.A linear interpolation is used to obtain flowfield variables within the transport solve. To accountfor changes in porosity and permeability dueto mineral reactions, the transport solver is used tocalculate an updated porosity over a flow time stepand the revised porosity is passed back to the flowsolver. Future implementations may explore independentgrid hierarchies for flow and transport, aswell as fully coupled schemes.3 Architecture and Parallel ImplementationPFLOTRAN is has been written from the groundup with parallel scalability in mind, and can run onmachines ranging from laptops to the largest massivelyparallel computer architectures. Through judicioususe of Fortran 90/95 features, the code employsa highly modular, object-oriented design thatcan hide many of the details of the parallel implementationfrom the user, if desired. PFLOTRANis built on top of the PETSc framework [2; 3; 4]and uses numerous features from PETSc includingnonlinear solvers, linear solvers, sparse matrix datastructures (both blocked and non-blocked matrices),vectors, constructs for the parallelism of PDEson structured grids, options database (runtime controlof solver options), and binary I/O.PFLOTRAN’s parallel paradigm is based on domaindecomposition: each MPI process is assigneda subdomain of the problem and a parallel solveis implemented over all processors. Message passing(3D “halo exchange”) is required at the subdomainboundaries with adjacent MPI processes to fillghost points in order to compute flux terms (Equation8). A number of different preconditioners fromPETSc or other packages (PETSc provides interfacesto several) can be employed, but currentlywe usually use single-level domain decompositionpreconditioners employed inside of a global inexactNewton-Krylov solver. Within the Krylovsolver, gather/scatter operations are needed to handleoff-processor vector elements in matrix-vectormultiplies, and numerous MPI Allreduce() operationsare required to compute vector inner productsand norms, making communication highly latencybound.4 Parallel PerformanceIn this section, we examine performance in a strongscaling context using a benchmark problem from amodel of a hypothetical uranium plume at the Hanford300 Area in southeastern Washington state,described in [5]. The computational domain ofthe problem measures 1350 × 2500 × 20 meters(x,y,z) and utilizes complex stratigraphy (Figure 1)mapped from the Hanford EarthVision database[6], with material properties provided by [7]. Thestratigraphy must be read from a large HDF5 fileat initialization time. We examine aspects of bothcomputation and I/O in this section.In the benchmark runs here, we utilize an inexactNewton method with a fixed tolerance forthe inner, linear solve. The linear (Jacobian) solvesemploy the BCGS stabilized bi-conjugate gradientsolver (BiCGstab) in PETSc, with a block-Jacobipreconditioner that applies an incomplete LUdecompositionsolver with zero fill-in (ILU(0)) oneach block. (For linear solves with block-structuredmatrices, a block-ILU(0) solve is employed.) Ourchoice of preconditioner is a very simple one, butwe have been surprised at how well it has workedat scale on large machines such as Jaguar. Wehave experimented with more sophisticated preconditionerssuch as the the Hypre/BoomerAMGand Trilinos/ML algebraic multigrid solvers, butwe have yet to identify a preconditioner that isas robust and as scalable at high processor countsas block-Jacobi. Solvers such as BoomerAMG displayexcellent convergence behavior, but show unacceptablegrowth in set up time above 1000 cores.(Determining how to make good use of multi-levelsolvers is an area of active research for us.)CUG 2009 Proceedings 3 of 14

Relative growth in BiCGStab itsPFLOTRAN strong scaling1.121.1Observed1.081.061.041.0210.981024 2048 4096 8192 16384 32768Number of processor coresFigure 3: Growth of BiCGstab iterations for the strongscalingcase depicted in Figure 2.Proportion time in VecDot/NormPFLOTRAN strong scaling10.80.60.40.201024 2048 4096 8192 16384 32768Number of processor coresFigure 4: Proportion of time spent in vector dot productand norm computations for the strong-scaling case depictedin Figure 2.CUG 2009 Proceedings 5 of 14

Time (s) in VecDot/Norm per step32168421PFLOTRAN strong scalingObservedIdeal0.51024 2048 4096 8192 16384 32768Number of processor coresFigure 5: Combined cost of all vector dot product or normcalculations for the strong-scaling case depicted in Figure 2.4.1 Computational Phase4.1.1 Strong-scaling, flow, 270M DoFFigure 2 depicts the strong scaling behavior of theflow solver for a 1350 × 2500 × 80 cell flow-onlyversion of the problem (270 million total degreesof freedom) run on the Cray XT4 at ORNL/NCCS.We ran the problem for 50 time steps. The scalingbehavior is good all the way out to 27580 processorcores (near the limit of the number of coresavailable on the machine with 800 MHz DDR2memory—we did not wish to introduce load imbalanceby mixing memory speeds), although somedeparture from linear scaling is seen. Because weemploy a single-level domain decomposition preconditioner(block-Jacobi), it might be supposedthat the loss of effectiveness of the precondition asthe number of processor cores grows (and hencethe size of each local subdomain shrinks) might beresponsible for a significant portion of the departurefrom linear scaling. Figure 3 shows that thegrowth in the total number of BiCGstab iterationsis actually very modest, with only approximately10% more iterations required at 27580 cores than at1024 cores. The departure from linear scaling appearsto be almost entirely due to the large growthin the proportion of time spent in the computationof vector inner products (dot products) andnorms (Figure 4), which is dominated by the cost ofMPI Allreduce() operations. In fact, if we examinethe strong-scaling behavior of the cost of vectordot products and norms versus the cost of allother function calls (Figures 5 and 6, we see that thestrong scalability of the dot products and norms isquite poor, while the scalability of everything elseis essentially linear (and even slightly superlinearat 16384 cores).The time spent in MPI Allreduce() calls wasfurther investigated using CrayPAT, which breaksthe time into two components: MPI Allreduceand MPI Allreduce (Sync). According to Cray,MPI Allreduce (Sync) is due to the implicit synchronizationoperation in the MPI Allreduce() calland is likely due to load imbalance. However, furtheranalysis of the code indicated that only 5%of the processors incur any significant applicationload imbalance. To investigate this further, we developeda perfectly load balanced dot product kernelthat performs approximately the same numberof MPI Allreduce() calls as PFLOTRAN andtakes approximately the same execution time (accomplishedvia redundant local dot products) asPFLOTRAN. This kernel yielded nearly the sameamount of MPI Allreduce (Sync) time, indicatingthat the MPI Allreduce (Sync) experienced by the applicationis not due to application load imbalancebut likely due to system load imbalance (i.e., performancevariability among cores).To improve MPI Allreduce() performanceCUG 2009 Proceedings 6 of 14

Time (s) in NOT VecDot/Norm per step2561286432168PFLOTRAN strong scalingObservedIdeal41024 2048 4096 8192 16384 32768Number of processor coresFigure 6: Combined cost of all routines that are not vectordot product or norm calculations for the strong-scalingcase depicted in Figure 2.on the Cray XT4, we experimented with a numberof MPI communication related environment variables.Of these settings, MPICH PTL MATCH OFFhad the greatest impact as it eliminates a portalsystems call that can add significant overheadto latency dominated applications such as PFLO-TRAN (dominated by MPI Allreduce()s of a singlenumber in the flow solver). Other options suchas MPICH FAST MEMCPY and MPICH COLL OPT ON(which we note has become the default behavior inMPT 3) also had some impact. These optimizationsreduced the MPI Allreduce() time by approximately18%.As the MPI Allreduce() calls performed duringthe BiCGstab solve dominate the overall executiontime at large core counts on the XT systems,we implemented an alternative, “improved” versionof algorithm [8] in a new KSP solver (“IBCGS”)in PETSc. The standard BiCGstab method requiresfour MPI Allreduce() calls per iteration (includingthe vector norm computation to check convergence).The restructured algorithm used in IBCGSrequires only two MPI Allreduce() calls, at theexpense of a considerably more complicated algorithm,more local computation (additional vectoroperations), and computation of a transposematrix-vector product at the beginning of the solve.By lagging the calculation of the residual norm, itis possible to further reduce communication andwrap everything into one MPI Allreduce iteration,at the cost of doing one additional iteration.Despite the additional computation involved, therestructured solver is generally faster than the standardBiCGstab solver on the XT4/5 when usingmore than a few hundred processor cores. The tablebelow displays the breakdown (as reported byCrayPAT) of time spent in MPI SYNC, MPI, andUser time when using the traditional (“BCGS”) andrestructured (“IBCGS”) algorithms for a 30 timestep run (including initialization time but no diskoutput) of the benchmark problem on 16384 coresof the XT4 version of Jaguar. The amount of timespent in User computations is larger in the IBCGScase, but the time spent in MPI SYNC and MPI isfar lower.Group Traditional RestructuredMPI SYNC 196 120MPI 150 79User 177 2004.1.2 Strong-scaling, transport, 2B DoFWe have recently made some preliminary benchmarkruns on the petaflop incarnation of Jaguar,the Cray XT5 at ORNL/NCCS, with a coupled flowand transport problem from the Hanford 300 Area.The problem consists of 850 × 1000 × 160 cellsand includes 15 chemical components, resulting 136CUG 2009 Proceedings 7 of 14

million total degrees of freedom in the flow solveand 2.04 billion total degrees of freedom in thetransport solve. This problem size is at the limitof what we can do with 32-bit array indexing. (AlthoughPFLOTRAN and PETSc can support 64-bitindices, some work will be required to get our codeworking with some of the libraries we link with.)The problem was run for 25 time steps of both flowand transport. In many simulations, we would usea longer time step for flow than for transport, but inthis case there is a time varying boundary condition(the stage of the adjacent Columbia River) that limitsthe size of the flow time step (to a maximum ofone hour), and higher resolution for the transportstep is not needed.Figure 7 depicts the preliminary scaling behaviorwe have observed on up to 65536 processorcores on Jaguar. Figures 8 and 9 break the performancedown into time spent in the flow andtransport solves, respectively. It is not surprisingthat the scalability of the flow problem is poor: atonly 136 million degrees of freedom, this is a verysmall problem to be running on so many processorcores. We employ the large number of processorcores for the benefit of the much larger (15times) and computationally more expensive transportproblem. We note that the coupled flow andreactive transport problems that we might typicallyrun with PFLOTRAN will generally involve transportproblems considerably larger than their flowcounterparts, perhaps going as high as 20 chemicalcomponents. If kinetic rate laws are used forsorption, then the size of the transport problem becomeseven higher because a kinetic sorption equation(without flux terms) must be added for eachkinetic reaction. It is possible that at some point itmay make sense to solve the flow problem redundantlyon disjoint groups of MPI processes, muchas parallel multi-level solvers do for coarse-gridproblems. This may not be necessary, however, asthe cost of the transport solves may be dominant,and in many cases it is possible to run the flowsolves with a much larger time step size than thetransport solve.We have not yet had the opportunity to conductfurther experiments to understand and improvethe performance of this benchmark problem.We believe that running it with the restructuredBiCGstab solver in PETSc will be beneficial, but wefirst need to add support in PETSc for the transposematrix-vector product for blocked sparse matriceswith large block size. The data we collectedduring the run indicate that there are some significantsources of load imbalance that do not derivefrom the partitioning of work among the processors.(We believe that the partitioning is good, asthis problem employs a regular grid and the ratioof maximum to minimum times spent in somekey routines is not high even at 65536 cores; for example,it is 1.2 inside the Jacobian evaluation routines.)More detailed evaluation (running experimentswith MPI Barrier() calls inserted beforecollective communications, for example) is neededto identify the sources of load imbalance.4.2 I/O PhasePFLOTRAN originally employed serialized I/Othrough MPI process 0. Because PFLOTRANspends relatively little time in I/O, this actuallyproved workable up to about 8000 processor cores.To scale beyond this, we added parallel I/O in theform of 1) routines employing parallel HDF5 forreading input files and writing out simulation outputfiles, and 2) employing direct MPI-IO calls ina PETSc Viewer backend (for checkpointing). Thisgreatly increased the scalability of the code, but initializationcosts still became unacceptable at whenusing more than about 16000 cores (roughly half thesize of the Jaguar XT4 system). Here we detail howwe have mitigated this problem.The Cray XT architecture uses the Lustre filesystem (lfs) [9] to facilitate high bandwidth I/O forapplications. Lustre is an object based parallel filesystem that has 3 main components namely ObjectStorage Servers (OSS’s), Meta Data Servers (MDS)and File system clients. Each OSS can serve multipleObject Storage Targets (OST’s) which act as I/Oservers, the MDS manage the names and directoriesfor the file system and the file system clients arethe Cray XT compute nodes. The Cray XT5 systemhas 672 OST’s, 1 MDS and 37,544 file system clientsor compute nodes. Files are broken into chunksand stored as objects on the OST’s in a round robinfashion. The size of the object stored on a singleOST is defined as File Stripe Size and the numberof OST’s along which the file is striped is defined asFile Stripe Count. The default value for stripe sizeis 1MiB and stripe count is 4 OST’s. We have usedthe default lfs settings for our runs but these parameterscan be modified by the user with the helpof lfs commands. Depending upon the file accessCUG 2009 Proceedings 8 of 14

Wall-clock time (s) per time step6432168PFLOTRAN strong scaling: flow + transportObservedIdeal48192 16384 32768 65536Number of processor coresFigure 7: Combined strong-scaling performance of flowand transport solves for a coupled flow and reactive transportsimulation on an 850 × 1000 × 160 cell grid with 15chemical components.Wall-clock time (s) per time step321684PFLOTRAN strong scaling: 136M DoF flowObservedIdeal28192 16384 32768 65536Number of processor coresFigure 8: Strong-scaling performance of the flow solvecomponent (136 million degrees of freedom) of the strongscalingstudy depicted in figure 7.CUG 2009 Proceedings 9 of 14

Wall-clock time (s) per time step6432168PFLOTRAN strong scaling: 2B DoF tranObservedIdeal48192 16384 32768 65536Number of processor coresFigure 9: Strong-scaling performance of the transportsolve component (2.04 billion degrees of freedom) of thestrong-scaling study depicted in figure 7.Time in seconds18001600140012001000800600400200Comparison of Initialization Phase on XT5OriginalImproved04096 8192 16384 32768 65536 131072Number of processor coresFigure 10: Performance comparison of the original andimproved initialization routines on the 270 million degreesof freedom flow problem on the Cray XT5. Thecurve labeled “original” depicts the performance when allprocesses open and read from the HDF5 file, and the “improved”curve depicts performance when only a subsetof processes read the data and then broadcast to the othermembers of their I/O group.CUG 2009 Proceedings 10 of 14

pattern, the lfs settings will have an impact on I/Operformance.From the scaling studies performed on CrayXT, it was observed that the initialization phaseof PFLOTRAN does not scale well with increasein processor count. The first curve in Figure 10shows the timing of the initialization phase at differentprocessor counts on Cray XT5. The initializationphase almost entirely consists of HDF5 routineswhich read from a single large HDF5 inputfile. All processes participate in a parallel read operationwhich involves a contiguous region of HDF5datasets and there are two different read patternsi.e., every process reads the dataset entirely and everyprocess reads a chunk of the dataset startingfrom its own offset within the dataset. For this kindof read pattern the lfs settings (stripe size and stripecount) will have little impact on performance becausemost of the processors will be accessing a singleOST at any given point of time during execution.To investigate the reasons for the poor performanceof the initialization phase at scale, we usedCrayPAT to profile PFLOTRAN on up to 27572cores of the Cray XT4. Figure 11 shows the percentagecontribution of different routines to overall executiontime. The profiling was done for the 270MDoF test problem with 30 time-steps and with nooutput. From the figure, it is evident that the MPIroutines MPI File open(), MPI File close()and MPI File read at() contribute significantlyat higher processor count. It must be noted herethat for the entire file system there is only 1 Metadata server (MDS) and every time a process needsto open/close a file it needs to poll the MDS. Asthe number of clients increases the file open/closebecome a much costlier operation and this setupwould become a bottleneck for I/O performance.This has been identified as a problem for systemscalability by others [10; 11].To avoid this performance penalty at higher processorcounts we have modified the read mechanismwithin PFLOTRAN. Instead of having everyprocess participate in the read operations we implementeda mechanism where only a subset of processesexecute the HDF5 routines and are involvedin the read operations. The code example below describesthe creation of new sub-communicators thatare used in the improved I/O method.group_size = HDF5_GROUP_SIZE (e.g., 1024)call MPI_Comm_rank(MPI_COMM_WORLD,global_rank,ierr)MPI_comm iogroup, readersglobal_comm = MPI_COMM_WORLDcolor = global_rank / group_sizekey = global_rankcall MPI_Comm_split(global_comm,color,key,iogroup,ierr)reader_key = global_rankif (mod(global_rank,group_size) == 0) thenreader_color = 1elsereader_color = 0endifcall MPI_Comm_split(global_comm,reader_color,reader_key,readers,ierr)if (reader_color==1) thencall h5pset_fapl_mpio_f(prop_id,readers,MPI_INFO_NULL,hdf5_err)call h5dread_f(...,HDF_NATIVE_INTEGER,integer_array,length)endifcall MPI_Bcast(integer_array,length,MPI_INTEGER,0,iogroup,ierr)Temporary buffers will have to be allocatedat the reader processes to hold the data for theirgroup. The reader processes collect the offset valuesand chunk sizes from the processes in theirgroup by using MPI Gather(). After the designatedprocesses have completed reading the datafrom the file they would distribute the data usingMPI Bcast() or MPI Scatterv() dependingupon the offset values and chunk sizes to the rest ofthe processes in their respective groups. By carefulselection of reader processes, we ensure that there isno network congestion at this point. Even thoughwe did not face out-of-memory problems with ourruns, we can avoid this situation in the future, ifit arises, by using a smaller group size or by usingmultiple MPI Bcast()/MPI Scatterv().With these modifications, we were able to reducethe time spent in initialization phase from1700 to 40 seconds at 98304 cores of Cray XT5. Thesecond curve in Figure 10 shows the timing of theinitialization phase with the improved I/O method.For the runs up to 16384, we have used a group sizeof 1024 and for runs above 16384 it is 8192.CUG 2009 Proceedings 11 of 14

Figure 11: Breakdown of contributions of different routinesto overall execution time of the 270 million degreesof freedom flow problem run on the Cray XT5 for 30 timesteps, with no disk output.5 Conclusions and Future WorkPFLOTRAN has only been under active developmentfor a few years, but has already become amodular, object-oriented, extensible, and scalablecode. We have demonstrated that it is currentlyable to make effective use of a significant portionof a petascale machine. It does not do so with anywherenear the efficiency that we would like, but itis possible to use the code to solve leadership-classcomputing problems right now.Further development of PFLOTRAN continuesat a rapid pace. One focus is on developing bettersolvers/preconditioners to further improve scalability.We are exploring several types of multi-levelapproaches, which we believe are ultimately necessaryto achieve good weak scalability. These approacheswill probably be combined with physicsbased ones [12] that will allow multi-level methodsto be employed on linear systems to whichthey are well-suited. Though we did not discussit in this paper, we are also making progress onadding structured adaptive mesh refinement capabilitiesto the code, which will greatly decreasethe memory and computational requirements forsimulations that must resolve certain fine-scale features.We are also adding support for unstructuredmeshes, which will allow static refinement ofthe grid around complex geologic features, aroundwells, etc. Support for multiple-continuum (subgrid)models will also be added to the code, whichwill dramatically increase the work associated witheach grid cell and, incidentally, result in more efficientuse of leadership-class machines, as muchof the subgrid work can proceed in embarrassinglyparallel fashion.AcknowledgementsThis research was partially sponsored by the Climateand Environmental Sciences Division (CESD)of the Office of Biological and Environmental Research(BER) and the Computational Science Researchand Partnerships (SciDAC) Division of theOffice of Advanced Scientific Computing Research(ASCR) within the U.S. Department of Energy’s Officeof Science (SC). This research used resourcesof the National Center for Computational Sciences(NCCS) at Oak Ridge National Laboratory (ORNL),which is managed by UT-Battelle, LLC, for theU.S. Department of Energy under Contract No. DE-AC05-00OR22725. Pacific Northwest National Laboratoryis managed for the U.S. Department of En-CUG 2009 Proceedings 12 of 14

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?