APCOM'07 in conjunction with EPMESC XI, December 3-6, 2007 ...

APCOM’07 in conjunction with EPMESC XI, December 3-6, 2007, Kyoto, JAPAN 

Efficient Sequential and Parallel Solvers for hp Finite Element 

Method 

Maciej Paszyński 1 *, David Pardo 2 , Carlos Torres-Verdin 2 , Paweł Matuszyk 3 

1 

Department of Computer Science, AGH University of Science and Technology, Al.Mickiewicza 30, 

Kraków, 30-059, Poland 

2 

Department of Petroleum and Geosystems Engineering, The University of Texas,1 University Station 

C0300, Austin, Texas, 78712, USA 

3 

Department of Applied Computer Science and Modeling, AGH University of Science and Technology, 

Al.Mickiewicza 30, Kraków, 30-059, Poland 

e-mail: paszynsk@agh.edu.pl, dzubiaur@gmail.com, cverdin@uts.cc.utexas.edu, pjm@agh.edu.pl 

Abstract We present a sequential and parallel direct solver designed for hp Finite Element Method (FEM) 

applied to solve numerous problems, including non-stationary heat transfer problem, the Stokes problem, 

and the resistivity logging measurement simulations. The hp FEM incorporates a self-adaptive strategy that 

generates a sequence of hp refined meshes, delivering exponential convergence of the numerical error with 

respect to the number of degrees of freedom (mesh size or CPU time). The hp meshes generated by the 

self-adaptive strategy are obtained by multiple h or p refinements of the initial mesh. The self-adaptive 

mesh, generated in this way, is stored as refinement trees growing down from nodes of the initial mesh. 

First, we eliminate degrees of freedom starting from leaves of refinement trees, and then we eliminate 

common degrees of freedom traveling up the refinement trees. The solver is parallelized by utilizing the 

domain decomposition paradigm. In other words, the solver generates Schur complements of local 

sub-systems, from bottom of refinement trees, through initial mesh elements and sub-domains. Then, the 

global problem reduces to relatively small one common "interface" problem, and finally the backward 

substitution must be executed to propagate the solution from the common interface, through sub-domains, 

initial mesh elements, down to leafs of refinement trees. The LU factorizations computed at different levels 

of elimination trees are stored at tree nodes to be reutilized by the solver after the computational mesh is 

locally refined. We pesent also the performance measurements of the solver. 

Key words: Finite Element Method, hp adaptivity, direct solver, parallel direct solver 

INTRODUCTION 

The data structures and efficient direct solvers for computational meshes utilized by fully automatic hp 

adaptive 2D and 3D Finite Element Method (FEM) codes [1,2,3,4] are presented. The codes generate a 

sequence of hp meshes delivering exponential convergence of the numerical error with respect to the 

number of degrees of freedom (mesh size or CPU time). The hp meshes consist in finite elements with 

various sizes and various polynomial order of approximation, changing locally, on finite element faces, 

edges and interiors. The final optimal mesh is constructed by a sequence of h or p refinements executed on 

the initial mesh. The h refinements consist in breaking some finite elements into smaller son elements, 

whilst the p refinements consist in adjusting polynomial orders of approximation on some element faces, 

edges and interiors. In the utilized data structure, the h refinements are stored as trees growing from initial 

mesh elements. This allows us to propose the efficient direct solver working on the level of refinement 

trees. The degrees of freedom are eliminated by traveling the refinement trees, from leaves nodes to the 

level of initial mesh elements nodes. The local Schur complements associated with particular elimination 

levels can be stored in tree nodes. Each time the mesh is locally refined, only local Schur complements 

associated to newly refined nodes must be updated. The Schur complements associated with not refined 

nodes can be still utilized in the process of solver execution over the newly refined mesh. The idea of the

ecursive solver working on the refinement trees is generalized into the elimination tree constructed out of 

initial mesh elements, as well as into the elimination tree built based on sub-domains obtained by the 

domain decomposition of the entire mesh. The solvers were tested on a sequence of problems, including 

the non-stationary heat transfer problem, the Stokes problem [5], and the resistivity logging measurements 

simulation [6]. 

SEQUENTIAL AND PARALLEL DIRECT SOLVERS 

In this chapter we present a classification of up to date sequential and parallel direct solvers dedicated to 

FEM computations. 

1. Frontal solvers The solver browses finite elements in the order prescribed by the user. It aggregates 

degrees of freedom to the so-called frontal matrix. Based on the elements connectivity information it 

recognizes fully assembled degrees of freedom and eliminates them from the frontal matrix [7]. This is 

done to keep the size of the frontal matrix as small as possible. The key for efficient work of the frontal 

solver is optimal ordering of finite elements. 

2. Multifrontal solvers The solver constructs the degrees of freedom connectivity tree based on analysis 

of the geometry of computational domain [7]. It is usually done by utilizing graph representation of 

computational domain and graph partitioning algorithm. The frontal elimination pattern is utilized on 

every tree branch. Finite elements are joined into pairs and degrees of freedom are assembled into frontal 

matrix associated with the branch. The process is repeated until the root of the assembly tree is reached. 

Finally, the common dense problem is solved and partial backward substitutions are recursively executed 

on the assembly tree. 

3. Sub-structuring method solver This is a parallel solver working over a computational domain 

partitioned into multiple sub-domains. It works in the following steps [8]. First, the sub-domains internal 

degrees of freedom are eliminated with respect to the interface degrees of freedom. Second, the interface 

problem is solved. Finally, the internal problems are solved by executing backward substitution on each 

sub-domain, utilizing the interface problem solution computed in the second step. 

4. Multiple fronts solver This is a simplest implementation of the sub-structuring method solver [9]. It 

performs partial frontal decomposition on each sub-domain. Then, it sums up contributions from 

particular sub-domains into one common interface problem. Finally, it solves the common interface 

problem by utilizing a sequential frontal solver. 

5. Direct sub-structuring method solver In this version of the sub-structuring method solver, the interface 

problem is solved by utilizing the parallel solver [8]. 

6. Sparse direct method solver This is a parallel implementation of the multifrontal solver. An example of 

the sparse direct method solver is the MUlti frontal Massively Parallel Solver (MUMPS) [10-12]. 

DATA STRUCTURE SUPPORTING HP REFINEMENTS 

This section introduces the data structure storing the history of mesh transformation that can be further 

utilized by the direct solver. We propose to store two levels of connectivity trees: 

• The initial mesh elements connectivity tree 

• h refinements connectivity trees, which grow down from the level of initial mesh elements. 

For parallel implementation of the algorithm we propose three levels of connectivity trees: 

• The connectivity tree for sub-domains built out of the computational mesh distributed into 

sub-domains 

• The connectivity trees for initial mesh elements, built separately on every sub-domain 

• h refinements connectivity trees, which grow down from every refined initial mesh elements. 

The three connectivity trees related to computational mesh presented in Fig. 1 are presented in Fig. 2. 

Partial LU factorization performed by the solver will be stored at tree nodes for further reutilization.

Fig. 1 Exemplary computational domain with 4 initial mesh elements partitioned into 2 sub-domains. 

Each initial mesh element is broken into 4 son elements. 

Fig. 2 Connectivity trees for computational mesh presented in Fig. 1 

DIRECT SOLVER FOR hp FINITE ELEMENT METHOD 

In this section we describe a proposed new direct solver dedicated to fully automatic hp Finite Element 

Method [1-4]. The software incorporates a self-adaptive strategy that generates a sequence of hp refined 

meshes, delivering exponential convergence of the numerical error with respect to the number of degrees 

of freedom (mesh size or CPU time). The hp meshes generated by the self-adaptive strategy are obtained 

by multiple h or p refinements of the initial mesh. The h refinement consists in breaking some finite 

elements into 2 (in horizontal or vertical direction) or 4 son elements, and/or the p refinement consists in 

increasing polynomial order of approximation on some finite element edges, faces and interiors. 

The self-adaptive mesh, generated in this way, is stored as refinement trees growing down from the initial 

mesh. We utilize a tree-like structure for the computational mesh. First, we eliminate degrees of freedom 

starting from leaves of refinement trees, and then we eliminate common degrees of freedom traveling up 

the refinement trees. In other words, we compute a sequence of Schur complements, starting from the 

bottom level, and traveling up the structure of refinement trees. Then, we utilize the nested dissection 

scheme to eliminate degrees of freedom on the level of initial mesh elements. 

The parallel version of the solver utilizes the domain decomposition paradigm. The computational mesh is 

partitioned into multiple sub-domains with each sub-domain assigned to a separate processor. In other 

words, the solver generates Schur complements of local sub-systems, from bottom of refinement trees, 

through initial mesh elements and sub-domains. Then, the global problem reduces to relatively small one 

common "interface" problem, and finally the backward substitution must be executed to propagate the 

solution from the common interface, through sub-domains, initial mesh elements, down to the leafs of 

refinement trees. 

The algorithm of the recursive solver can be summarized in the following pseudo-code: 

matrix function recursive_solver(tree_node) 

if tree_node has no son nodes then

eliminate leaf element stiffness matrix internal nodes 

return Schur complement sub-matrix 

else if tree_node has son nodes then 

do for each son 

son_matrix = recursive_solver(tree_node_son) 

merge son_matrix into new_matrix 

enddo 

decide which unknowns of new_matrix can be eliminated 

perform partial forward elimination on new_matrix 

return Schur complement sub-matrix 

endif 

The solver can be used to effectively solve multiple right hand sides, since for each new right hand side 

only new backward substitution must be executed. This is needed in context of goal-oriented adaptivity, 

where solution of the dual problem is needed. The solver is written in FORTRAN 90. The parallelization 

of the solver consists in assigning tree branches to particular processors and simply sending Schur 

complements contributions from one branch to the other. We implemented the parallel version of the 

solver by utilizing the Message Passing Interface (MPI). 

Fig. 3 Elimination patterns over the distributed connectivity tree 

It should be emphasized, that the communication cost for the solver is related with the size of local systems 

of equations related to common edges between adjacent elements. The interior and edge degrees of freedom 

that are eliminated on current level of connectivity tree are denoted in Fig. 3 by dashed lines, whilst degrees 

of freedom that remain uneliminated are denoted by solid lines. 

COMPUTATIONAL PROBLEMS 

1. 3D DC Resistivity logging measurements simulations in deviated wells The problem consists in 

solving the conductive media equation 

imp 

( ∇u) 

= o J 

∇ o σ −∇ 

(1) 

in the 3D domain with different formation layers, presented in Fig. 4. There is a tool with one transmitter 

and two receiver electrodes in the borehole. The tool is shifted along the borehole. The reflected waves are 

recorded by the receiver electrodes in order to determine location of the oil formation in the ground. Of 

particular interest to the oil industry are 3D simulations with deviated wells, where the angle between the 

borehole and formation layers is sharp θ 90 . This fully 3D problem can be reduced to 2D by 

0 ≠ 

considering three non-orthogonal systems of coordinates presented in Fig. 4. The variational formulation 

u ∈ u 

1 

+ H Ω such that: 

in the new system of coordinates consists in finding ( ) 

∂u 

∂v 

1 

, ˆ σ = v, 

fˆ 

∀v 

∈ H 

2 

D 

∂ξ 

∂ξ 

L ( Ω) 

2 

L 

( Ω) 

( Ω) 

D 

D 

(2)

` 

Fig. 4 Three non-orthogonal systems of coordinates in the borehole and formation layers 

−1 

−1T 

where new electrical conductivity of the media ˆ σ : = J σ J J and f ˆ : = f J with 

gradient of the impressed current and 

( x1, 

x2 

, x3 

) 

( ζ , ζ , ζ ) 

1 

2 

3 

f ∇J 

imp 

= is the 

∂ 

J = 

(3) 

∂ 

stands for the Jacobian matrix of the change of variables from the Cartesian reference to non-orthogonal 

J = det J is its determinant. We take Fourier series expansions in the 

systems of coordinates, and ( ) 

azimuthal ζ 2 direction 

( 2 

1, 2, 

3 ) ∑ ( 1, 

3 ) ; 

+∞ = 

u 

l 

jlζ 

ζ ζ = ul 

ζ ζ e 

l= 

−∞ 

1, 2 , 3 ∑ 1, 

3 

2 

+∞ = m 

= m 

m= 

−∞ 

jm 

e ζ 

ζ ζ ζ σ ζ ζ 

( 2 

1, 2, 

3 ) ∑ ( 1, 

3 ) ; 

+∞ = 

f 

n 

jnζ 

ζ ζ = f n ζ ζ e 

n= 

−∞ 

ζ (4) 

( ) ( ) ; 

σ (5) 

ζ (6) 

1 

The final variational formulation for zero frequency (DC) is the following: Find ∈ u + H ( Ω) 

⎛ ∂u 

⎞ 

⎛ ∂v 

⎞ 

n= 

k + 2 

∑ ⎜ 

⎟ , ˆ σ k −n 

⎜ 

⎟ = vk 

, fˆ 

n 2 

L 

n= 

k− 

2 

D 

⎝ ∂ξ 

⎠k 

⎝ ∂ξ 

2 

⎠n 

2 

L ( Ω2 

D ) 

( Ω ) 

since five Fourier modes are enough to represent exactly the new material coefficients [13]. 

∀v 

Fig. 5 Geometry of the cavity problem 

k 

u such that: 

In the similar way we can derive the variational formulation for non-zero frequency (AC): Find 

1 ( ) s ∈ H Γ ( curl; 

Ω) 

such that: 

E E 

D 

D 

(7)

ζ 

−1 

ζ 

imp 

( ∇ × F ) s ( ) s n ( ∇ × E) 

l − F ( ) ( ) L ( ) s kˆ 

2 

, ˆ µ − 

, s−n 

El 

= − jω 

Fs 

, Jˆ 

2 

Ω 

2 

s 2 

2 D 

L ( Ω ) L ( Ω ) 

n= 

s+ 

2 

∑ 

n= 

s−2 

2 D 

2 D 

2. 2D Stokes problem We consider the SUPG (Streamline Upwind Petrov-Galerkin [14]) stabilized weak 

formulation of the Stokes problem: Find velocity and pressure fields ( , p ) ∈ ( u + V) 

× Q where 

1 

V = { v ∈ H ( Ω) 

: v = 0 on ΓD 

} and = ( Ω) 

2 

L 

∫ µ & ε ( ) : & ε ( v) 

Ω − ∫ p∇ 

o vdΩ 

= ∫ ρb 

o vdΩ 

+ ∫ 

Ω 

− 

Ω 

Ω 

Q such that (9-10): 

ΓN 

u D 

2 u d t o vdΓ 

(9) 

∫ 

Ω 

∑∫ 

K∈Th K 

∑∫ 

K∈Th K 

N 

( ∇ o s( 

) ) 

q∇ o vdΩ − τ ∇q 

o ∇pdK 

= τ ∇q 

o ˆ u dK . (10) 

h 

h 

The SUPG formulation is utilized to solve the plane flow of an isothermal fluid in a square lid-driven 

0 , 1 × 0, 

1 presented in Fig. 5. Fluid dynamic viscosity is defined as µ = 1 and the body force 

cavity ( ) ( ) 

b = 0 . The stabilization coefficient is defined as τ 

K 

2 

αhK 

= with = 0. 

01 

2µ 

α . 

3. Non-stationary heat transfer problem The weak form of the non-stationary heat transfer problem: 

Find the temperature distribution u uD 

+ V V = 

1 

v ∈ H Ω : v = 0 on Γ satisfying 

∈ where ( ) 

{ } 

( ρc pu 

v) 

+ ∫ k∇u 

o ∇v 

dΩ 

+ ∫ βuv 

dΓN 

= ∫ fv dΩ 

+ ∫ ( βu 

N + q) 

v dΓN 

∀v 

∈V 

Ω 

&, (11) 

Ω 

ΓN 

( u( 

) , v) 

= ( ρc 

u , v) 

∀v 

∈V 

c p 0 

Ω p 0 

Ω 

Ω 

ΓN 

ρ (12) 

FE - discretization in time gives the following matrix system: 

M u& 

+ Ku = f 

(13) 

Applying the trapezoidal rule for the time discretization we obtain 

k + 1 

k k 

( M + αδ K) 

u = [ M − ( 1− 

α ) δ K] 

u + δ f 

where M is the mass matrix, δ is the time step, ∈[ 

0, 

1] 

α gives different time integration schemes. We 

focus on the solution of the heat-transfer problem in the L-shape domain presented in Fig. 6. 

Fig. 6 Geometry of the step problem 

The initial temperature distribution is 0 0 = u at 0 = t . The L-shape domain is heated/cooled with 1 ± = u N 

with β = 1 and no internal heating f = 0 . 

D 

∀F 

s 

(8) 

(14)

SOLVER PERFORMANCE AND REUTILIZATION OF PARTIAL LU FACTORIZATIONS 

1. 3D DC Resistivity logging measurements simulations in deviated wells We performed 

measurements of execution time and relative efficiency on the LONESTAR linux cluster [15] for the 3D 

resistivity logging measurements simulations problem with 2D formulation based on non-orthogonal 

system of coordinates and Fourier series expansions. From these measurements its follows that the solver 

attains 60% relative efficiency up to 192 processors, compare Fig. 7 and 8. 

1000 

100 

1.2 

10 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

1 

1 

4 

4 

12 

20 

28 

40 

56 

72 

88 

104 

120 

Time [s] 

Fig. 7 Paralell solver execution time for increasing number of processors 

12 

20 

28 

40 

56 

72 

88 

104 

144 

176 

208 

Relative efficiency E 

Fig. 8 Relative efficiency E=T1/(pTp) of the parallel solver 

2. 2D Stokes problem We analyzed percent of reutilized LU factorizations on a sequence of meshes 

generated by the self-adaptive hp FEM for the cavity problem, presented in Fig. 9. Some LU factorizations 

computed in the previous iteration can be effectively reutilized in the next iteration. This is because mesh 

refinements occurs only in close neighborhood of the two local singularities localized on the left and right 

top parts of the mesh. LU factorizations associated with elements denoted by white color in Fig. 10 are not 

recomputed, but reutilized from the previous mesh. However, on the top of the elimination tree, when 

refined parts of the mesh are merged with unrefined parts, it is necessary to recompute LU factorization, 

since one of matrix contributions, coming from refined parts of the mesh, is completely new. 

Notice that LU factorizations from previous mesh cannot be reutilized if local part of the mesh is either h 

refined (finite elements are broken) or p refined (local polynomial order of approximation is changed). 

This is because increase of the polynomial order of approximation changes number of degrees of freedom 

in the local matrix and the LU factorization is no longer valid. 

120 

144 

176 

208 

240 

240

3. Non-stationary heat transfer problem Finally, we tested the reutilization of partial factorizations for 

the non-stationary heat transfer problem, see Fig. 11. In these kind of problems, it possible to reutilize LU 

factorizations within a sequence of meshes generated for one time step. However it is not possible to 

reutilize LU factorization from one time step to the other, since the problem is non-stationary and changes 

from one time step to the other, even if computational mesh is the same. 

Fig. 9 Sequence of meshes for the cavity problem. Different colors denote different polynomial orders of 

approximation varying from p=1 to p=9 on finite element edges and interiors (in both directions). 

RESULTS 

Fig. 10 Reutilization of LU factorizations from previous meshes during first three iterations. 

We conclude our presentation by presenting in Fig. 12 numerical results for the 3D resistivity logging 

measurements problem. In Fig. 13 we present also velocity distribution for the cavity problem, and final 

temperature distribution for the non-stationary heat transfer problem. Thanks to self-adaptive hp FEM all

these results have been computed on optimal meshes delivering numerical solution with less then 3% 

relative error. 

CONCLUSIONS 

We have proposed efficient sequential and parallel solver for hp FEM. The solver scales well up to 200 

processors. It provides an infrastructure for storing partial LU factorizations at elimination tree nodes, to be 

reutilized in further calls to the solver, after the mesh is hp refined. We showed, that partial LU 

factorizations can be effectively reutilized within iterations of self-adaptive hp FEM. However it is not 

possible to reutilize partial factorizations from previous time step in non-stationary problems. 

Fig. 11 Optimal meshes for particular time steps for the non-stationary heat transfer problem. 

Fig. 12 Results for 3D AC resistivity logging measurements with 20 kHz wireline tool, for resistivities of 

formation layers presented in the second panel, for axi-symmetric as well as 30 and 60 degrees tilted well. 

Fig. 13 Results (horizontal and vertical velocity component) for the cavity problem, as well as temperature 

distribution for the non-stationary heat transfer problem in final time step t=0.127 s.

Acknowledgements The support of Polish MNiSW grant no. 3 T08B 055 29 is gratefully 

acknowledged. The first author is also supported by the Foundation for Polish Science under Homming 

Programme. The second and third authors are supported by The University of Texas at Austin’s Joint 

Industry Research Consortium on Formation Evaluation sponsored by Aramco, Baker Atlas, BP, British 

Gas, ConocoPhilips, Chevron, ENI E&P, ExxonMobil, Halliburton Energy Services, Hydro, Marathon Oil 

Corporation, Mexican Institute for Petroleum, Occidental Petroleum Corporation, Petrobras, 

Schlumberger, Shell International E&P, Statoil, TOTAL, and Weatherford. 

REFERENCES 

[1] L. Demkowicz, 2D hp-Adaptive Finite Element Package, TICAM Report 02-06, The University of 

Texas at Austin (2002) 

[2] M. Paszyński, J. Kurtz, L. Demkowicz, Parallel Fully Automatic hp Adaptive 2D Finite Element 

Package, Computer Methods in Applied Mechanics and Engineering, 195, 7-8, (2006), pp. 711-741 

[3] L. Demkowicz, D. Pardo, W. Rachowicz, 3D hp-Adaptive Finite Element Package (3Dhp90) The 

Ultimate Data Structure for Three Dimensional, Anisotropic hp Refinements, TICAM Report 02-24, 

The University of Texas at Austin (2002) 

[4] M. Paszyński, L. Demkowicz, Parallel Fully Automatic hp Adaptive 3D Finite Element Package, 

Engineering with Computers, 22, 3-4, (2006), pp. 255-276. 

[5] Matuszyk P., Paszyński M., Extensions of the 2D fully automatic hp adaptive Finite Element Method 

for Stokes and non-stationary heat transfer problems, 9 US National Congress on Computational 

Mechanics, 2007, USACM, San Francisco, USA (2007) 

[6] D. Pardo, L. Demkowicz, C. Torres-Verdin, M. Paszyński, Simulation of Resistivity 

Logging-While-Drilling (LWD) Measurements Using a Self-Adaptive Goal-Oriented hp-Finite 

Element Method, SIAM Journal on Applied Mathematics, 66, (2006), pp. 2085-2106. 

[7] I. S. Duff , J.K. Reid, The multifrontal solution of indefinite sparse symmetric linear systems, ACM 

Trans. on Math. Soft., 9 (1983) pp. 302-325 

[8] L. Giraud, A. Marocco, J.-C. Rioual, Iterative versus direct parallel substructuring methods in 

semiconductor device modelling, Numerical Linear Algebra with Applications, 12, 1 (2005) pp. 33-55 

[9] J. A. Scott, Parallel Frontal Solvers for Large Sparse Linear Systems, ACM Trans. on Math. Soft., 29, 

4 (2003) pp. 395-417 

[10] P. R. Amestoy, I. S. Duff , J.-Y. L’Excellent, Multifrontal parallel distributed symmetric and 

unsymmetric solvers, in Comput. Methods in Appl. Mech. Eng. 184 (2000) pp. 501-520 

[11] P. R. Amestoy, I. S. Duff , J. Koster, J.-Y. L’Excellent, A fully asynchronous multifrontal solver using 

distributed dynamic scheduling, SIAM Journal of Matrix Analysis and Applications, 23, 1 (2001) pp. 

15-41 

[12] P. R. Amestoy, A. Guermouche, J.-Y. L’Excellent, S. Pralet, Hybrid scheduling for the parallel 

solution of linear systems. Accepted to Parallel Computing (2005) 

[13] D. Pardo, V. Calo, C. Torres-Verdin, M.J. Nam, Fourier Series Expansion in a Non-Orthogonal 

System of Coordinates for Simulation of 3D Borehole Resistivity Measurements. Part I: DC, 

submitted to Computer Methods in Applied Mechanics and Engineering (2007) 

[14] T. J. R. Hughes, L. P. Franca, A New FEM for Computational Fluid Dynamics: VII The Stokes 

Problem with Varoious Well-Posed Boundary Conditions: Symmetric Formulations that Converge for 

All Velocity/Pressure Spaces, Computer Methods in Applied Mechanics and Enginering, 65 (1987) 

pp. 85-96 

[15] Lonestar Cluster Users’ Manual http://www.tacc.utexas.edu/services/userguides/lonestar

APCOM'07 in conjunction with EPMESC XI, December 3-6, 2007 ...

Create successful ePaper yourself

Delete template?

Save as template?