A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich

A highly scalable matrix-free multigrid solver for 

µFE analysis based on a pointer-less octree 

Cyril Flaig and Peter Arbenz 

ETH Zürich, Chair of Computational Science, 8092 Zürich, Switzerland 

Abstract. The state of the art method to predict bone stiffness is micro 

finite element (µFE) analysis based on high-resolution computed tomography 

(CT). Modern parallel solvers enable simulations with billions of 

degrees of freedom. In this paper we present a conjugate gradient solver 

that works directly on the CT image and exploits the geometric properties 

of the regular grid and the basic element shapes given by the 

3D pixel. The data is stored in a pointer-less octree. The tree data structure 

provides different resolutions of the image that are used to construct 

a geometric multigrid preconditioner. It enables the use of matrix-free 

representation of all matrices on all levels. The new solver reduces the 

memory footprint by more than a factor of 10 compared to our previous 

solver ParFE. It allows to solve much bigger problems than before and 

scales excellently on a Cray XT-5 supercomputer. 

Keywords: micro-finite element analysis, voxel based computing, matrixfree, 

geometric multigrid preconditioning, pointer-less octree 

1 Introduction 

Osteoporosis is a bone disease affecting millions of people around the world. 

The disease entails low bone quality and increases the risk of bone fracture. 

For a better understanding of bone structures and to improve the prediction 

of bone fractures, a precise estimation of its stiffness and strength is required. 

Micro finite element analysis (µFE) is a tool to this end [12, 17]. It is based on 

high-resolution 3D images that are obtained by computed tomography (CT). 

The high resolution scans produce computation domains of complicated shape 

composed of a huge number of voxels (3D pixels), cf. Fig 1. Since voxels directly 

translate into finite elements the resulting linear systems can have enormous 

numbers of degrees of freedom (dofs). Some years ago, we have developed a fully 

parallel state-of-the-art solver called ParFE [2, 11] based on the conjugate gradient 

algorithm preconditioned by smoothed aggregation-based algebraic multigrid. 

This code exploits the geometric properties of the underlaying rectangular 

grid by avoiding the assembly of the system matrix. The largest realistic bone 

model solved with ParFE so far had a size of about 1.5 billion dofs [3]. 

It is natural to represent the voxel-based domains by octrees [4, 6, 15]. Sampeth 

et al. [15] used the different tree levels to construct a geometric multigrid 

preconditioner.

In this paper, we present a solver based on a pointer-less octree-like data 

structure. Both finite elements and nodes are identified by a key corresponding 

to a space filling curve. This curve is equivalent to an octree. In contrast to [4, 

6, 15] we deal with incomplete octrees due to the bone free space. In full space 

approaches [9, 10] the bone free space is modeled by very soft material and 

its unknowns are included in the computations. With the help of the new data 

structure the algorithm can exploit the sparse structure of the bone. This enables 

us to run the simulation with up to 6 times smaller memory footprint compared 

to the geometric multigrid that also stores the empty bone region [9]. Compared 

to matrix-free ParFE, the memory savings is more than a factor of 10. 

2 The mathematical modeling of the problem 

The linear elasticity theory is used to analyse the bone strength. The weak formulation 

in 3D reads as follows [5]: Find the displacement field u ∈ [HE 1 (Ω)]3 = 

{v ∈ [H 1 (Ω)] 3 : v |ΓD = u S } such that 

∫ 

∫ 

∫ 

[2µε(u) : ε(v) + λ div u div v] dΩ = f T vdΩ + gS T vdΓ (1) 

Ω 

Ω 

Γ N 

for all v ∈ [H 1 0 (Ω)] 3 with the volume forces f, the boundaries traction g on the 

Neuman boundary, the linearized symmetric strain tensor 

and the Lamé constants 

λ = 

ε(u) := 1 2 (∇u + (∇u)T ), 

Eν 

(1 + ν)(1 − 2ν) , µ = E 

2(1 + ν) . 

Here, E is the Young’s modulus and ν the Poisson’s ratio. 

We use two different boundary conditions. The Neuman boundaries are traction 

free, g S = 0. On the top and bottom of the domain we have Dirichlet 

boundary condition with a fixed displacement. The engineers look for regions 

with high stresses and strains to determine the quality of the bone [17]. 

The displacements are discretized by trilinear hexahedral elements. These 

are converted one-to-one from the voxels of the CT image. Thus, all elements 

are cubes of the same size. In contrast to ParFE only the Young’s modulus can 

vary in the domain. The Poisson’s ratio ν must be constant. Bone mass has a 

typical Poisson’s ratio ν = 0.3. Applying this finite element discretization to (1) 

results in a symmetric positive definite linear system 

Au = f. 

The number of degrees of freedom can exceed 10 9 . For symmetric positive definite 

linear systems of this size the preconditioned conjugate gradient algorithm is the 

solver of choice [13]. We use a geometric multigrid preconditioner.

Algorithm 1 Optimized Search 

Require: int SearchIndex(int start, t octree key key, t tree tree) 

int count = 1; 

while key > tree[start + count].key do 

count = count · 8; 

end while 

return binarySearch(start + count/8, start + count, key, tree); 

We coarsen by aggregating 2 × 2 × 2 voxels. A voxel of the coarser level 

l + 1 gets its Young’s modulus by averaging the Young’s moduli of the eight 

aggregated smaller voxels of level l, 

E l+1 

x,y,z = 1 8 

1∑ 

i,j,k=0 

E l 2x+i,2y+j,2z+k, (2) 

where the Young’s modulus of a non-existing child element is zero. If this procedure 

is applied to a homogeneous grid with the standard prolongation (interpolation) 

and restriction it corresponds to the Galerkin product [16]. For smoothing 

we use a Chebyshev polynomial [1]. This type of smoother was successfully used 

in ParFE [2] in the context of a smoothed aggregation-based algebraic multigrid 

preconditioner. 

3 Implementation details 

The mesh, which is constructed from a 3D image, is stored in an octree. An 

octree divides each spatial dimension in two parts. This means that each tree 

node has eight children. Finite elements and nodes of the grid that lie in bone free 

space are not stored. In our application we iterate over all elements of a multigrid 

(or octree) level. These elements have the same size. Both, the nodes and the 

elements of each level are stored in one array. Each element is identified by the 

coordinate of its node with local number 0. If the data item has a weight w ≥ 0 

then it represents both an element with a Young’s modulus of E elem = w · 1GPa 

and the node of the element with local number 0. Plain nodes are characterized 

by a negative weight. The nodes and elements are sorted against their position in 

the depth-first traversal of the tree. This so-called Morton ordering corresponds 

to a space filling curve called Z-curve [14]. The Morton key can be computed 

easily from the three coordinates (short int) by interleaving their bits key = 

z 15 y 15 x 15 · · · z 1 y 1 x 1 z 0 y 0 x 0 . This pointer-less storing scheme reduces the needed 

memory to hold the octree by 24 Byte (on 64-bit by 56 Byte) per node. The 

whole application needs only about 100 Bytes per degree of freedom. That is 

about 16 times less compared to the matrix-free ParFE code. 

3.1 Accessing nodes of an element 

In matrix-free finite element applications the nodes of the corresponding element 

must be accessed. Usually an element-to-node table is queried to get the

Algorithm 2 Prolongation 

Require: void Prolongate(Vector c, Vector f) 

ImportGhostNodes(c); 

cindex tmp = 0; 

for each i in T reeF ineLevel do 

coarsekey = i.key/8; bits = i.key mod 8; factor = FactorOfElem(bits); 

cindex tmp = SearchIndex(cindex tmp, coarsekey, coarsetree); 

f[IndexOf(i)] += factor · c[cindex tmp]; 

coarsekeylist = AddCoarseKeysIfBitInDimensionIsSet(coarsekey, bits); 

for each cnode in coarsekeylist do 

cindex = SearchIndex(cindex tmp, cnode, coarsetree); 

f[IndexOf(i)] += factor · c[cindex]; 

end for 

end for 

ZeroBoundaryNodes(f); 

indices of the corresponding nodes. With the octree data structure this corresponds 

to the search of the eight neighbours in positive x, y, z direction. The 

binary search corresponds to the travel of the root down to the leaves. Nodes 

with bigger coordinates have always a bigger Morton key. The search has to be 

done from the index of the actual element to the end of the array. 

A faster way to access the neighbouring nodes is to ascend in the tree and 

descend to the wanted node [7]. Ascending in the full octree is an exponential 

interval search by a factor of eight (see Algorithm 1). The binary search combined 

with an exponential interval search speeds up the application. 

3.2 Matrix-vector multiplication 

The first step is to store the prescribed values at the Dirichlet boundary points 

and zero the corresponding components of the source vector. This is done because 

the boundary conditions are not taken into account in the matrix. Afterwards we 

import the ghost nodes. Then all elements must be traversed in order to compute 

the matrix-free matrix-vector product. All corresponding displacements of the 

nodes are loaded. This involves the neighbour search described in Section 3.1. 

Then the local stiffness matrix is applied with a scaling parameter that corresponds 

to the Young’s modulus of the element. The results of the local element 

are added into the appropriate places in the destination vector and the ghost 

nodes are exported. Finally the displacements at the Dirichlet boundary points 

are restored. 

3.3 Prolongation and restriction 

Compared to the matrix-vector multiplication prolongation and restriction 

are procedures that involve two tree levels. Instead of traveling between the 

levels, the two different resolutions are traversed concurrently. The keys on the 

coarser level are computed from those of the finer level by a division by eight.

Fig. 1. Load balancing with a space filling curve in a cubical bone sample. 16 partitions 

are used. On the left side all partitions are shown. On the right side the partitions 

numbered three to nine are displayed. Note that partitions need not be connected. 

Because we traverse the mesh in one direction, we can use the fast search described 

in Section 3.1. Algorithm 2 describes the prolongation. The restriction 

is implemented in a similar way. 

3.4 Load balancing 

The domain partitioning is obtained by splitting the space filling curve in equal 

sized sets of contiguous elements. This avoids the use of a data structure to store 

the mapping from the nodes to the processes. 

After reading the image data each process sorts its nodes and elements according 

the space filling curve. Afterwards, the key space is subdivided binary 

into buckets until each holds less data items than a defined upper limit. Each 

process gets a number of consecutive buckets until the average size of elements 

is reached. This results in a nearly balanced distribution, cf. Fig. 1. 

4 Numerical results 

We performed a strong and a weak scalability test. We used the boundary conditions 

described in Section 2. In each test we used the following stopping criterion: 

||r k || M −1 ≤ 10 −6 ||r 0 || M −1. We used a W-cycle in the multigrid preconditioner 

M. On the finest level we used a Chebyshev smoother of degree 6. On each 

coarser level the degree was increased by one. On the coarsest level we solved 

the problem by a Jacobi preconditioned CG algorithm. We stopped CG after 20 

iterations or if the residual norm was decreased by a factor of 10 7 . Usually the 

first criterion was met. The timings were made on the Cray XT5 of the Swiss National 

Supercomputing Center [8]. The Cray XT5 is based on Opteron processors 

with six cores running at 2.4 GHz. Each core has 1.33 GiB main memory.

64 512 1728 5832 8000 

dofs 445 · 10 6 3.6 · 10 9 12.0 · 10 9 40.5 · 10 9 55.5 · 10 9 

meshing time [s] 8.5 20.9 52.7 154 204 

c240 

setup time [s] 20.4 21.4 23.1 28.9 33.6 

GFlops 32.3 253 854 2888 3947 

dofs 758 · 10 6 6.1 · 10 9 20.4 · 10 9 69.1 · 10 9 94.7 · 10 9 

meshing time [s] 16.6 37.1 92.3 273 804 

c320 

setup time [s] 34.6 36.0 37.7 44.8 51.0 

GFlops 31.8 252 856 2865 3921 

Table 1. Weak scalability timings. The meshing time includes also the time to read 

the image data. 

Time [s] 

1800 

1600 

1400 

1200 

1000 

800 

c320 iterations 

c240 iterations 

c320 solving time 

c240 solving time 

17 

15 

13 

Iterations 

8 1000 2000 3000 4000 5000 5832 8000 

Number of Cores 

Fig. 2. Weak scaling with two different trabecular bone samples embedded in a 320 3 

and a 240 3 regular grid. 3D mirroring is applied to generate the bigger meshes. 

4.1 Weak scalability 

The solver for the bone analysis is designed such that it scales well on MPI-based 

supercomputers with big-sized meshes. We have tested the weak scalability with 

up to 8000 cores with two different meshes, cf. Table 1. The larger grids are 

generated by 3D mirroring [2] from a bone sample encased in a cube, cf. Fig. 1. 

We have used two base meshes: 

– c240 is encased in a 240 3 cube with 6.9·10 6 degrees of freedom and 1.46·10 6 

elements (porosity 10.6%). 

– c320 is encased in a 320 3 cube with 11.8·10 6 degrees of freedom and 2.23·10 6 

elements (porosity 6.83%). 

The biggest mesh on 8000 cores has 94.7 · 10 9 dofs and is 62 times bigger than 

the largest problem solved with ParFE [3]. In these tests we always used 7 levels 

in the multigrid preconditioner. 

In Fig. 2 we see that the solver scales nearly perfectly up to 8000 cores. With 

both meshes, above 125 cores the solving time increases only little. Also the setup 

time and the flop rate of the matrix vector product scale very well, cf. Table 1. 

However, the meshing time doesn’t scale. This time includes the construction

Parallel Efficiency 

1 

0.8 

0.6 

0.4 

0.2 

0 

degree 6 level 7 





27 36 72 144 288 576 

1000 

Solution Time [s] 

100 






linear speedup 

27 36 72 144 288 576 

Fig. 3. Strong scaling with different smoother degrees and number of levels in the 

multigrid algorithm. c320 mesh three timed 3D mirrored used on initial 27 cores. On 

the left site the parallel efficiency. On the right side the solution time. The yellow 

dashed line in the bottom denotes linear speed up. 

of the octree (meshing) and, most of all, the time to distribute the voxel data 

among the cores. The latter means the broadcast of about 250 MiB = 320 3 · 8 B 

of image data from the root core to all others cores, which is a costly procedure. 

4.2 Strong scalability 

For the strong scalability test a mesh based on c320 was used with 320·10 6 dofs. 

This moderately sized problem could be solved on a machine that is affordable 

for a clinical institute. We have tested the scalability with different parameters 

to identify the limiting factors. The memory that is needed for solving this mesh 

forced us to use at least 27 cores. 

Figure 3 shows that the application scales very well up to 576 cores. If the 

number of levels is chosen too big (red line) the parallel efficiency decreases and a 

configuration with a smoother of higher degree needs less time to solve with 144 

cores. The reason is that the problem size on the coarser mesh gets very small 

and the communication dominates. With redistribution and using a smaller set 

of cores on coarser meshes the efficiency would be higher especially for large 

numbers of levels. 

The higher smoother degree results in higher efficiency because on the fine 

meshes the matrix-vector product scales perfectly with the number of processors. 

However, on this mesh the smoother of degree ten needed more time to solve the 

problem than the smoother of degree six if the same number of levels is used. 

5 Conclusions and future work 

We have presented a highly parallel solver for voxel-based µFE bone analysis. 

The solver is based on the PCG method and uses a geometric multigrid preconditioner. 

Because the mesh is stored in a octree-like data structure all levels 

are implemented with matrix-free techniques. The minimal memory footprint 

enabled us to solve huge problems with more than 94 · 10 9 degrees of freedom.

Solving these problems with the old solver ParFE would require 16 times as 

many processors! The solver also shows nearly perfect weak scalability up to 

8000 of processors. 

We plan to further improve the accessing of the element nodes by a low 

collision rate hashing. Further enhancements could be done with enabling repartitioning 

of the coarser level using a subset of processors. This would lower 

communication complexity and increase further the parallel efficiency. 

Acknowledgments 

The work of the first author has been funded in parts by the Swiss National 

Science Foundation project 205320 125114. The computations on the Cray XT5 

have been performed in the framework of a Large User Project grant of the Swiss 

National Supercomputing Centre (CSCS). 

References 

1. Adams, M., Brezina, M., Hu, J., Tuminaro, R.: Parallel multigrid smoothing: polynomial 

versus Gauss–Seidel. J. Comput. Phys. 188(2), 593–610 (2003) 

2. Arbenz, P., van Lenthe, G.H., Mennel, U., Müller, R., Sala, M.: A scalable multilevel 

preconditioner for matrix-free µ-finite element analysis of human bone structures. 

Internat. J. Numer. Methods Engrg. 73(7), 927–947 (2008) 

3. Bekas, C., Curioni, A., Arbenz, P., Flaig, C., van Lenthe, G., Müller, R., Wirth, A.: 

Extreme scalability challenges in micro-finite element simulations of human bone. 

Concurrency Computat.: Pract. Exper. 22(16), 2282–2296 (2010) 

4. Bielak, J., Ghattas, O., Kim, E.J.: Parallel octree-based finite element method 

for large-scale earthquake ground simulation. Comp. Model. in Eng. & Sci. 10(2), 

99–112 (2005) 

5. Braess, D.: Finite Elements: Theory, fast solvers and applications in solid mechanics. 

Cambridge University Press, Cambridge, 2nd edn. (2001) 

6. Burstedde, C., Wilcox, L.C., Ghattas, O.: p4est: Scalable algorithms for parallel 

adaptive mesh refinement on forests of octrees, accepted for publication in SIAM 

J. Sci. Comput. 

7. Castro, R., Lewiner, T., Lopes, H., Tavares, G., Bordignon, A.: Statistical optimization 

of octree searches. Computer Graphics Forum 27(6), 1557–1566 (2008) 

8. Swiss National Supercomputing Centre (CSCS), http://www.cscs.ch/ 

9. Flaig, C., Arbenz, P.: A Scalable Memory Efficient Multigrid Solver for Micro- 

Finite Element Analyses Based on CT Images. Parallel Computing (2011), accepted 

for publication 

10. Margenov, S., Vutov, Y.: Comparative analysis of PCG solvers for voxel FEM 

systems. In: Proceedings of the International Multiconference on Computer Science 

and Information Technology. pp. 591–598 (2006) 

11. The ParFE Project Home Page (2010), http://parfe.sourceforge.net/ 

12. van Rietbergen, B., Weinans, H., Huiskes, R., Polman, B.J.W.: Computational 

strategies for iterative solutions of large FEM applications employing voxel data. 

Internat. J. Numer. Methods Engrg. 39(16), 2743–2767 (1996) 

13. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, PA, 

2nd edn. (2003)

14. Samet, H.: The quadtree and related hierarchical data structures. ACM Comput. 

Surv. 16, 187–260 (1984) 

15. Sampath, R.S., Biros, G.: A parallel geometric multigrid method for finite elements 

on octree meshes. SIAM J. Sci. Comput. 32(3), 1361–1392 (2010) 

16. Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, London 

(2000) 

17. Wirth, A., Mueller, T., Vereecken, W., Flaig, C., Arbenz, P., Müller, R., van 

Lenthe, G.H.: Mechanical competence of bone-implant systems can accurately be 

determined by image-based micro-finite element analyses. Arch. Appl. Mech. 80(5), 

513–525 (2010)

A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich

Create successful ePaper yourself

Delete template?

Save as template?