28.12.2013 Views

A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich

A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich

A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A <strong>highly</strong> <strong>scalable</strong> <strong>matrix</strong>-<strong>free</strong> <strong>multigrid</strong> <strong>solver</strong> <strong>for</strong><br />

<strong>µFE</strong> analysis based on a pointer-less octree<br />

Cyril Flaig and Peter Arbenz<br />

<strong>ETH</strong> <strong>Zürich</strong>, Chair of Computational Science, 8092 <strong>Zürich</strong>, Switzerland<br />

Abstract. The state of the art method to predict bone stiffness is micro<br />

finite element (<strong>µFE</strong>) analysis based on high-resolution computed tomography<br />

(CT). Modern parallel <strong>solver</strong>s enable simulations with billions of<br />

degrees of <strong>free</strong>dom. In this paper we present a conjugate gradient <strong>solver</strong><br />

that works directly on the CT image and exploits the geometric properties<br />

of the regular grid and the basic element shapes given by the<br />

3D pixel. The data is stored in a pointer-less octree. The tree data structure<br />

provides different resolutions of the image that are used to construct<br />

a geometric <strong>multigrid</strong> preconditioner. It enables the use of <strong>matrix</strong>-<strong>free</strong><br />

representation of all matrices on all levels. The new <strong>solver</strong> reduces the<br />

memory footprint by more than a factor of 10 compared to our previous<br />

<strong>solver</strong> ParFE. It allows to solve much bigger problems than be<strong>for</strong>e and<br />

scales excellently on a Cray XT-5 supercomputer.<br />

Keywords: micro-finite element analysis, voxel based computing, <strong>matrix</strong><strong>free</strong>,<br />

geometric <strong>multigrid</strong> preconditioning, pointer-less octree<br />

1 Introduction<br />

Osteoporosis is a bone disease affecting millions of people around the world.<br />

The disease entails low bone quality and increases the risk of bone fracture.<br />

For a better understanding of bone structures and to improve the prediction<br />

of bone fractures, a precise estimation of its stiffness and strength is required.<br />

Micro finite element analysis (<strong>µFE</strong>) is a tool to this end [12, 17]. It is based on<br />

high-resolution 3D images that are obtained by computed tomography (CT).<br />

The high resolution scans produce computation domains of complicated shape<br />

composed of a huge number of voxels (3D pixels), cf. Fig 1. Since voxels directly<br />

translate into finite elements the resulting linear systems can have enormous<br />

numbers of degrees of <strong>free</strong>dom (dofs). Some years ago, we have developed a fully<br />

parallel state-of-the-art <strong>solver</strong> called ParFE [2, 11] based on the conjugate gradient<br />

algorithm preconditioned by smoothed aggregation-based algebraic <strong>multigrid</strong>.<br />

This code exploits the geometric properties of the underlaying rectangular<br />

grid by avoiding the assembly of the system <strong>matrix</strong>. The largest realistic bone<br />

model solved with ParFE so far had a size of about 1.5 billion dofs [3].<br />

It is natural to represent the voxel-based domains by octrees [4, 6, 15]. Sampeth<br />

et al. [15] used the different tree levels to construct a geometric <strong>multigrid</strong><br />

preconditioner.


In this paper, we present a <strong>solver</strong> based on a pointer-less octree-like data<br />

structure. Both finite elements and nodes are identified by a key corresponding<br />

to a space filling curve. This curve is equivalent to an octree. In contrast to [4,<br />

6, 15] we deal with incomplete octrees due to the bone <strong>free</strong> space. In full space<br />

approaches [9, 10] the bone <strong>free</strong> space is modeled by very soft material and<br />

its unknowns are included in the computations. With the help of the new data<br />

structure the algorithm can exploit the sparse structure of the bone. This enables<br />

us to run the simulation with up to 6 times smaller memory footprint compared<br />

to the geometric <strong>multigrid</strong> that also stores the empty bone region [9]. Compared<br />

to <strong>matrix</strong>-<strong>free</strong> ParFE, the memory savings is more than a factor of 10.<br />

2 The mathematical modeling of the problem<br />

The linear elasticity theory is used to analyse the bone strength. The weak <strong>for</strong>mulation<br />

in 3D reads as follows [5]: Find the displacement field u ∈ [HE 1 (Ω)]3 =<br />

{v ∈ [H 1 (Ω)] 3 : v |ΓD = u S } such that<br />

∫<br />

∫<br />

∫<br />

[2µε(u) : ε(v) + λ div u div v] dΩ = f T vdΩ + gS T vdΓ (1)<br />

Ω<br />

Ω<br />

Γ N<br />

<strong>for</strong> all v ∈ [H 1 0 (Ω)] 3 with the volume <strong>for</strong>ces f, the boundaries traction g on the<br />

Neuman boundary, the linearized symmetric strain tensor<br />

and the Lamé constants<br />

λ =<br />

ε(u) := 1 2 (∇u + (∇u)T ),<br />

Eν<br />

(1 + ν)(1 − 2ν) , µ = E<br />

2(1 + ν) .<br />

Here, E is the Young’s modulus and ν the Poisson’s ratio.<br />

We use two different boundary conditions. The Neuman boundaries are traction<br />

<strong>free</strong>, g S = 0. On the top and bottom of the domain we have Dirichlet<br />

boundary condition with a fixed displacement. The engineers look <strong>for</strong> regions<br />

with high stresses and strains to determine the quality of the bone [17].<br />

The displacements are discretized by trilinear hexahedral elements. These<br />

are converted one-to-one from the voxels of the CT image. Thus, all elements<br />

are cubes of the same size. In contrast to ParFE only the Young’s modulus can<br />

vary in the domain. The Poisson’s ratio ν must be constant. Bone mass has a<br />

typical Poisson’s ratio ν = 0.3. Applying this finite element discretization to (1)<br />

results in a symmetric positive definite linear system<br />

Au = f.<br />

The number of degrees of <strong>free</strong>dom can exceed 10 9 . For symmetric positive definite<br />

linear systems of this size the preconditioned conjugate gradient algorithm is the<br />

<strong>solver</strong> of choice [13]. We use a geometric <strong>multigrid</strong> preconditioner.


Algorithm 1 Optimized Search<br />

Require: int SearchIndex(int start, t octree key key, t tree tree)<br />

int count = 1;<br />

while key > tree[start + count].key do<br />

count = count · 8;<br />

end while<br />

return binarySearch(start + count/8, start + count, key, tree);<br />

We coarsen by aggregating 2 × 2 × 2 voxels. A voxel of the coarser level<br />

l + 1 gets its Young’s modulus by averaging the Young’s moduli of the eight<br />

aggregated smaller voxels of level l,<br />

E l+1<br />

x,y,z = 1 8<br />

1∑<br />

i,j,k=0<br />

E l 2x+i,2y+j,2z+k, (2)<br />

where the Young’s modulus of a non-existing child element is zero. If this procedure<br />

is applied to a homogeneous grid with the standard prolongation (interpolation)<br />

and restriction it corresponds to the Galerkin product [16]. For smoothing<br />

we use a Chebyshev polynomial [1]. This type of smoother was successfully used<br />

in ParFE [2] in the context of a smoothed aggregation-based algebraic <strong>multigrid</strong><br />

preconditioner.<br />

3 Implementation details<br />

The mesh, which is constructed from a 3D image, is stored in an octree. An<br />

octree divides each spatial dimension in two parts. This means that each tree<br />

node has eight children. Finite elements and nodes of the grid that lie in bone <strong>free</strong><br />

space are not stored. In our application we iterate over all elements of a <strong>multigrid</strong><br />

(or octree) level. These elements have the same size. Both, the nodes and the<br />

elements of each level are stored in one array. Each element is identified by the<br />

coordinate of its node with local number 0. If the data item has a weight w ≥ 0<br />

then it represents both an element with a Young’s modulus of E elem = w · 1GPa<br />

and the node of the element with local number 0. Plain nodes are characterized<br />

by a negative weight. The nodes and elements are sorted against their position in<br />

the depth-first traversal of the tree. This so-called Morton ordering corresponds<br />

to a space filling curve called Z-curve [14]. The Morton key can be computed<br />

easily from the three coordinates (short int) by interleaving their bits key =<br />

z 15 y 15 x 15 · · · z 1 y 1 x 1 z 0 y 0 x 0 . This pointer-less storing scheme reduces the needed<br />

memory to hold the octree by 24 Byte (on 64-bit by 56 Byte) per node. The<br />

whole application needs only about 100 Bytes per degree of <strong>free</strong>dom. That is<br />

about 16 times less compared to the <strong>matrix</strong>-<strong>free</strong> ParFE code.<br />

3.1 Accessing nodes of an element<br />

In <strong>matrix</strong>-<strong>free</strong> finite element applications the nodes of the corresponding element<br />

must be accessed. Usually an element-to-node table is queried to get the


Algorithm 2 Prolongation<br />

Require: void Prolongate(Vector c, Vector f)<br />

ImportGhostNodes(c);<br />

cindex tmp = 0;<br />

<strong>for</strong> each i in T reeF ineLevel do<br />

coarsekey = i.key/8; bits = i.key mod 8; factor = FactorOfElem(bits);<br />

cindex tmp = SearchIndex(cindex tmp, coarsekey, coarsetree);<br />

f[IndexOf(i)] += factor · c[cindex tmp];<br />

coarsekeylist = AddCoarseKeysIfBitInDimensionIsSet(coarsekey, bits);<br />

<strong>for</strong> each cnode in coarsekeylist do<br />

cindex = SearchIndex(cindex tmp, cnode, coarsetree);<br />

f[IndexOf(i)] += factor · c[cindex];<br />

end <strong>for</strong><br />

end <strong>for</strong><br />

ZeroBoundaryNodes(f);<br />

indices of the corresponding nodes. With the octree data structure this corresponds<br />

to the search of the eight neighbours in positive x, y, z direction. The<br />

binary search corresponds to the travel of the root down to the leaves. Nodes<br />

with bigger coordinates have always a bigger Morton key. The search has to be<br />

done from the index of the actual element to the end of the array.<br />

A faster way to access the neighbouring nodes is to ascend in the tree and<br />

descend to the wanted node [7]. Ascending in the full octree is an exponential<br />

interval search by a factor of eight (see Algorithm 1). The binary search combined<br />

with an exponential interval search speeds up the application.<br />

3.2 Matrix-vector multiplication<br />

The first step is to store the prescribed values at the Dirichlet boundary points<br />

and zero the corresponding components of the source vector. This is done because<br />

the boundary conditions are not taken into account in the <strong>matrix</strong>. Afterwards we<br />

import the ghost nodes. Then all elements must be traversed in order to compute<br />

the <strong>matrix</strong>-<strong>free</strong> <strong>matrix</strong>-vector product. All corresponding displacements of the<br />

nodes are loaded. This involves the neighbour search described in Section 3.1.<br />

Then the local stiffness <strong>matrix</strong> is applied with a scaling parameter that corresponds<br />

to the Young’s modulus of the element. The results of the local element<br />

are added into the appropriate places in the destination vector and the ghost<br />

nodes are exported. Finally the displacements at the Dirichlet boundary points<br />

are restored.<br />

3.3 Prolongation and restriction<br />

Compared to the <strong>matrix</strong>-vector multiplication prolongation and restriction<br />

are procedures that involve two tree levels. Instead of traveling between the<br />

levels, the two different resolutions are traversed concurrently. The keys on the<br />

coarser level are computed from those of the finer level by a division by eight.


Fig. 1. Load balancing with a space filling curve in a cubical bone sample. 16 partitions<br />

are used. On the left side all partitions are shown. On the right side the partitions<br />

numbered three to nine are displayed. Note that partitions need not be connected.<br />

Because we traverse the mesh in one direction, we can use the fast search described<br />

in Section 3.1. Algorithm 2 describes the prolongation. The restriction<br />

is implemented in a similar way.<br />

3.4 Load balancing<br />

The domain partitioning is obtained by splitting the space filling curve in equal<br />

sized sets of contiguous elements. This avoids the use of a data structure to store<br />

the mapping from the nodes to the processes.<br />

After reading the image data each process sorts its nodes and elements according<br />

the space filling curve. Afterwards, the key space is subdivided binary<br />

into buckets until each holds less data items than a defined upper limit. Each<br />

process gets a number of consecutive buckets until the average size of elements<br />

is reached. This results in a nearly balanced distribution, cf. Fig. 1.<br />

4 Numerical results<br />

We per<strong>for</strong>med a strong and a weak scalability test. We used the boundary conditions<br />

described in Section 2. In each test we used the following stopping criterion:<br />

||r k || M −1 ≤ 10 −6 ||r 0 || M −1. We used a W-cycle in the <strong>multigrid</strong> preconditioner<br />

M. On the finest level we used a Chebyshev smoother of degree 6. On each<br />

coarser level the degree was increased by one. On the coarsest level we solved<br />

the problem by a Jacobi preconditioned CG algorithm. We stopped CG after 20<br />

iterations or if the residual norm was decreased by a factor of 10 7 . Usually the<br />

first criterion was met. The timings were made on the Cray XT5 of the Swiss National<br />

Supercomputing Center [8]. The Cray XT5 is based on Opteron processors<br />

with six cores running at 2.4 GHz. Each core has 1.33 GiB main memory.


64 512 1728 5832 8000<br />

dofs 445 · 10 6 3.6 · 10 9 12.0 · 10 9 40.5 · 10 9 55.5 · 10 9<br />

meshing time [s] 8.5 20.9 52.7 154 204<br />

c240<br />

setup time [s] 20.4 21.4 23.1 28.9 33.6<br />

GFlops 32.3 253 854 2888 3947<br />

dofs 758 · 10 6 6.1 · 10 9 20.4 · 10 9 69.1 · 10 9 94.7 · 10 9<br />

meshing time [s] 16.6 37.1 92.3 273 804<br />

c320<br />

setup time [s] 34.6 36.0 37.7 44.8 51.0<br />

GFlops 31.8 252 856 2865 3921<br />

Table 1. Weak scalability timings. The meshing time includes also the time to read<br />

the image data.<br />

Time [s]<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

c320 iterations<br />

c240 iterations<br />

c320 solving time<br />

c240 solving time<br />

17<br />

15<br />

13<br />

Iterations<br />

8 1000 2000 3000 4000 5000 5832 8000<br />

Number of Cores<br />

Fig. 2. Weak scaling with two different trabecular bone samples embedded in a 320 3<br />

and a 240 3 regular grid. 3D mirroring is applied to generate the bigger meshes.<br />

4.1 Weak scalability<br />

The <strong>solver</strong> <strong>for</strong> the bone analysis is designed such that it scales well on MPI-based<br />

supercomputers with big-sized meshes. We have tested the weak scalability with<br />

up to 8000 cores with two different meshes, cf. Table 1. The larger grids are<br />

generated by 3D mirroring [2] from a bone sample encased in a cube, cf. Fig. 1.<br />

We have used two base meshes:<br />

– c240 is encased in a 240 3 cube with 6.9·10 6 degrees of <strong>free</strong>dom and 1.46·10 6<br />

elements (porosity 10.6%).<br />

– c320 is encased in a 320 3 cube with 11.8·10 6 degrees of <strong>free</strong>dom and 2.23·10 6<br />

elements (porosity 6.83%).<br />

The biggest mesh on 8000 cores has 94.7 · 10 9 dofs and is 62 times bigger than<br />

the largest problem solved with ParFE [3]. In these tests we always used 7 levels<br />

in the <strong>multigrid</strong> preconditioner.<br />

In Fig. 2 we see that the <strong>solver</strong> scales nearly perfectly up to 8000 cores. With<br />

both meshes, above 125 cores the solving time increases only little. Also the setup<br />

time and the flop rate of the <strong>matrix</strong> vector product scale very well, cf. Table 1.<br />

However, the meshing time doesn’t scale. This time includes the construction


Parallel Efficiency<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

degree 6 level 7<br />

degree 6 level 6<br />

degree 6 level 5<br />

degree 10 level 6<br />

degree 10 level 5<br />

27 36 72 144 288 576<br />

1000<br />

Solution Time [s]<br />

100<br />

degree 6 level 7<br />

degree 6 level 6<br />

degree 6 level 5<br />

degree 10 level 6<br />

degree 10 level 5<br />

linear speedup<br />

27 36 72 144 288 576<br />

Fig. 3. Strong scaling with different smoother degrees and number of levels in the<br />

<strong>multigrid</strong> algorithm. c320 mesh three timed 3D mirrored used on initial 27 cores. On<br />

the left site the parallel efficiency. On the right side the solution time. The yellow<br />

dashed line in the bottom denotes linear speed up.<br />

of the octree (meshing) and, most of all, the time to distribute the voxel data<br />

among the cores. The latter means the broadcast of about 250 MiB = 320 3 · 8 B<br />

of image data from the root core to all others cores, which is a costly procedure.<br />

4.2 Strong scalability<br />

For the strong scalability test a mesh based on c320 was used with 320·10 6 dofs.<br />

This moderately sized problem could be solved on a machine that is af<strong>for</strong>dable<br />

<strong>for</strong> a clinical institute. We have tested the scalability with different parameters<br />

to identify the limiting factors. The memory that is needed <strong>for</strong> solving this mesh<br />

<strong>for</strong>ced us to use at least 27 cores.<br />

Figure 3 shows that the application scales very well up to 576 cores. If the<br />

number of levels is chosen too big (red line) the parallel efficiency decreases and a<br />

configuration with a smoother of higher degree needs less time to solve with 144<br />

cores. The reason is that the problem size on the coarser mesh gets very small<br />

and the communication dominates. With redistribution and using a smaller set<br />

of cores on coarser meshes the efficiency would be higher especially <strong>for</strong> large<br />

numbers of levels.<br />

The higher smoother degree results in higher efficiency because on the fine<br />

meshes the <strong>matrix</strong>-vector product scales perfectly with the number of processors.<br />

However, on this mesh the smoother of degree ten needed more time to solve the<br />

problem than the smoother of degree six if the same number of levels is used.<br />

5 Conclusions and future work<br />

We have presented a <strong>highly</strong> parallel <strong>solver</strong> <strong>for</strong> voxel-based <strong>µFE</strong> bone analysis.<br />

The <strong>solver</strong> is based on the PCG method and uses a geometric <strong>multigrid</strong> preconditioner.<br />

Because the mesh is stored in a octree-like data structure all levels<br />

are implemented with <strong>matrix</strong>-<strong>free</strong> techniques. The minimal memory footprint<br />

enabled us to solve huge problems with more than 94 · 10 9 degrees of <strong>free</strong>dom.


Solving these problems with the old <strong>solver</strong> ParFE would require 16 times as<br />

many processors! The <strong>solver</strong> also shows nearly perfect weak scalability up to<br />

8000 of processors.<br />

We plan to further improve the accessing of the element nodes by a low<br />

collision rate hashing. Further enhancements could be done with enabling repartitioning<br />

of the coarser level using a subset of processors. This would lower<br />

communication complexity and increase further the parallel efficiency.<br />

Acknowledgments<br />

The work of the first author has been funded in parts by the Swiss National<br />

Science Foundation project 205320 125114. The computations on the Cray XT5<br />

have been per<strong>for</strong>med in the framework of a Large User Project grant of the Swiss<br />

National Supercomputing Centre (CSCS).<br />

References<br />

1. Adams, M., Brezina, M., Hu, J., Tuminaro, R.: Parallel <strong>multigrid</strong> smoothing: polynomial<br />

versus Gauss–Seidel. J. Comput. Phys. 188(2), 593–610 (2003)<br />

2. Arbenz, P., van Lenthe, G.H., Mennel, U., Müller, R., Sala, M.: A <strong>scalable</strong> multilevel<br />

preconditioner <strong>for</strong> <strong>matrix</strong>-<strong>free</strong> µ-finite element analysis of human bone structures.<br />

Internat. J. Numer. Methods Engrg. 73(7), 927–947 (2008)<br />

3. Bekas, C., Curioni, A., Arbenz, P., Flaig, C., van Lenthe, G., Müller, R., Wirth, A.:<br />

Extreme scalability challenges in micro-finite element simulations of human bone.<br />

Concurrency Computat.: Pract. Exper. 22(16), 2282–2296 (2010)<br />

4. Bielak, J., Ghattas, O., Kim, E.J.: Parallel octree-based finite element method<br />

<strong>for</strong> large-scale earthquake ground simulation. Comp. Model. in Eng. & Sci. 10(2),<br />

99–112 (2005)<br />

5. Braess, D.: Finite Elements: Theory, fast <strong>solver</strong>s and applications in solid mechanics.<br />

Cambridge University Press, Cambridge, 2nd edn. (2001)<br />

6. Burstedde, C., Wilcox, L.C., Ghattas, O.: p4est: Scalable algorithms <strong>for</strong> parallel<br />

adaptive mesh refinement on <strong>for</strong>ests of octrees, accepted <strong>for</strong> publication in SIAM<br />

J. Sci. Comput.<br />

7. Castro, R., Lewiner, T., Lopes, H., Tavares, G., Bordignon, A.: Statistical optimization<br />

of octree searches. Computer Graphics Forum 27(6), 1557–1566 (2008)<br />

8. Swiss National Supercomputing Centre (CSCS), http://www.cscs.ch/<br />

9. Flaig, C., Arbenz, P.: A Scalable Memory Efficient Multigrid Solver <strong>for</strong> Micro-<br />

Finite Element Analyses Based on CT Images. Parallel Computing (2011), accepted<br />

<strong>for</strong> publication<br />

10. Margenov, S., Vutov, Y.: Comparative analysis of PCG <strong>solver</strong>s <strong>for</strong> voxel FEM<br />

systems. In: Proceedings of the International Multiconference on Computer Science<br />

and In<strong>for</strong>mation Technology. pp. 591–598 (2006)<br />

11. The ParFE Project Home Page (2010), http://parfe.source<strong>for</strong>ge.net/<br />

12. van Rietbergen, B., Weinans, H., Huiskes, R., Polman, B.J.W.: Computational<br />

strategies <strong>for</strong> iterative solutions of large FEM applications employing voxel data.<br />

Internat. J. Numer. Methods Engrg. 39(16), 2743–2767 (1996)<br />

13. Saad, Y.: Iterative Methods <strong>for</strong> Sparse Linear Systems. SIAM, Philadelphia, PA,<br />

2nd edn. (2003)


14. Samet, H.: The quadtree and related hierarchical data structures. ACM Comput.<br />

Surv. 16, 187–260 (1984)<br />

15. Sampath, R.S., Biros, G.: A parallel geometric <strong>multigrid</strong> method <strong>for</strong> finite elements<br />

on octree meshes. SIAM J. Sci. Comput. 32(3), 1361–1392 (2010)<br />

16. Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, London<br />

(2000)<br />

17. Wirth, A., Mueller, T., Vereecken, W., Flaig, C., Arbenz, P., Müller, R., van<br />

Lenthe, G.H.: Mechanical competence of bone-implant systems can accurately be<br />

determined by image-based micro-finite element analyses. Arch. Appl. Mech. 80(5),<br />

513–525 (2010)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!