A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich
A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich
A highly scalable matrix-free multigrid solver for µFE ... - ETH Zürich
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A <strong>highly</strong> <strong>scalable</strong> <strong>matrix</strong>-<strong>free</strong> <strong>multigrid</strong> <strong>solver</strong> <strong>for</strong><br />
<strong>µFE</strong> analysis based on a pointer-less octree<br />
Cyril Flaig and Peter Arbenz<br />
<strong>ETH</strong> <strong>Zürich</strong>, Chair of Computational Science, 8092 <strong>Zürich</strong>, Switzerland<br />
Abstract. The state of the art method to predict bone stiffness is micro<br />
finite element (<strong>µFE</strong>) analysis based on high-resolution computed tomography<br />
(CT). Modern parallel <strong>solver</strong>s enable simulations with billions of<br />
degrees of <strong>free</strong>dom. In this paper we present a conjugate gradient <strong>solver</strong><br />
that works directly on the CT image and exploits the geometric properties<br />
of the regular grid and the basic element shapes given by the<br />
3D pixel. The data is stored in a pointer-less octree. The tree data structure<br />
provides different resolutions of the image that are used to construct<br />
a geometric <strong>multigrid</strong> preconditioner. It enables the use of <strong>matrix</strong>-<strong>free</strong><br />
representation of all matrices on all levels. The new <strong>solver</strong> reduces the<br />
memory footprint by more than a factor of 10 compared to our previous<br />
<strong>solver</strong> ParFE. It allows to solve much bigger problems than be<strong>for</strong>e and<br />
scales excellently on a Cray XT-5 supercomputer.<br />
Keywords: micro-finite element analysis, voxel based computing, <strong>matrix</strong><strong>free</strong>,<br />
geometric <strong>multigrid</strong> preconditioning, pointer-less octree<br />
1 Introduction<br />
Osteoporosis is a bone disease affecting millions of people around the world.<br />
The disease entails low bone quality and increases the risk of bone fracture.<br />
For a better understanding of bone structures and to improve the prediction<br />
of bone fractures, a precise estimation of its stiffness and strength is required.<br />
Micro finite element analysis (<strong>µFE</strong>) is a tool to this end [12, 17]. It is based on<br />
high-resolution 3D images that are obtained by computed tomography (CT).<br />
The high resolution scans produce computation domains of complicated shape<br />
composed of a huge number of voxels (3D pixels), cf. Fig 1. Since voxels directly<br />
translate into finite elements the resulting linear systems can have enormous<br />
numbers of degrees of <strong>free</strong>dom (dofs). Some years ago, we have developed a fully<br />
parallel state-of-the-art <strong>solver</strong> called ParFE [2, 11] based on the conjugate gradient<br />
algorithm preconditioned by smoothed aggregation-based algebraic <strong>multigrid</strong>.<br />
This code exploits the geometric properties of the underlaying rectangular<br />
grid by avoiding the assembly of the system <strong>matrix</strong>. The largest realistic bone<br />
model solved with ParFE so far had a size of about 1.5 billion dofs [3].<br />
It is natural to represent the voxel-based domains by octrees [4, 6, 15]. Sampeth<br />
et al. [15] used the different tree levels to construct a geometric <strong>multigrid</strong><br />
preconditioner.
In this paper, we present a <strong>solver</strong> based on a pointer-less octree-like data<br />
structure. Both finite elements and nodes are identified by a key corresponding<br />
to a space filling curve. This curve is equivalent to an octree. In contrast to [4,<br />
6, 15] we deal with incomplete octrees due to the bone <strong>free</strong> space. In full space<br />
approaches [9, 10] the bone <strong>free</strong> space is modeled by very soft material and<br />
its unknowns are included in the computations. With the help of the new data<br />
structure the algorithm can exploit the sparse structure of the bone. This enables<br />
us to run the simulation with up to 6 times smaller memory footprint compared<br />
to the geometric <strong>multigrid</strong> that also stores the empty bone region [9]. Compared<br />
to <strong>matrix</strong>-<strong>free</strong> ParFE, the memory savings is more than a factor of 10.<br />
2 The mathematical modeling of the problem<br />
The linear elasticity theory is used to analyse the bone strength. The weak <strong>for</strong>mulation<br />
in 3D reads as follows [5]: Find the displacement field u ∈ [HE 1 (Ω)]3 =<br />
{v ∈ [H 1 (Ω)] 3 : v |ΓD = u S } such that<br />
∫<br />
∫<br />
∫<br />
[2µε(u) : ε(v) + λ div u div v] dΩ = f T vdΩ + gS T vdΓ (1)<br />
Ω<br />
Ω<br />
Γ N<br />
<strong>for</strong> all v ∈ [H 1 0 (Ω)] 3 with the volume <strong>for</strong>ces f, the boundaries traction g on the<br />
Neuman boundary, the linearized symmetric strain tensor<br />
and the Lamé constants<br />
λ =<br />
ε(u) := 1 2 (∇u + (∇u)T ),<br />
Eν<br />
(1 + ν)(1 − 2ν) , µ = E<br />
2(1 + ν) .<br />
Here, E is the Young’s modulus and ν the Poisson’s ratio.<br />
We use two different boundary conditions. The Neuman boundaries are traction<br />
<strong>free</strong>, g S = 0. On the top and bottom of the domain we have Dirichlet<br />
boundary condition with a fixed displacement. The engineers look <strong>for</strong> regions<br />
with high stresses and strains to determine the quality of the bone [17].<br />
The displacements are discretized by trilinear hexahedral elements. These<br />
are converted one-to-one from the voxels of the CT image. Thus, all elements<br />
are cubes of the same size. In contrast to ParFE only the Young’s modulus can<br />
vary in the domain. The Poisson’s ratio ν must be constant. Bone mass has a<br />
typical Poisson’s ratio ν = 0.3. Applying this finite element discretization to (1)<br />
results in a symmetric positive definite linear system<br />
Au = f.<br />
The number of degrees of <strong>free</strong>dom can exceed 10 9 . For symmetric positive definite<br />
linear systems of this size the preconditioned conjugate gradient algorithm is the<br />
<strong>solver</strong> of choice [13]. We use a geometric <strong>multigrid</strong> preconditioner.
Algorithm 1 Optimized Search<br />
Require: int SearchIndex(int start, t octree key key, t tree tree)<br />
int count = 1;<br />
while key > tree[start + count].key do<br />
count = count · 8;<br />
end while<br />
return binarySearch(start + count/8, start + count, key, tree);<br />
We coarsen by aggregating 2 × 2 × 2 voxels. A voxel of the coarser level<br />
l + 1 gets its Young’s modulus by averaging the Young’s moduli of the eight<br />
aggregated smaller voxels of level l,<br />
E l+1<br />
x,y,z = 1 8<br />
1∑<br />
i,j,k=0<br />
E l 2x+i,2y+j,2z+k, (2)<br />
where the Young’s modulus of a non-existing child element is zero. If this procedure<br />
is applied to a homogeneous grid with the standard prolongation (interpolation)<br />
and restriction it corresponds to the Galerkin product [16]. For smoothing<br />
we use a Chebyshev polynomial [1]. This type of smoother was successfully used<br />
in ParFE [2] in the context of a smoothed aggregation-based algebraic <strong>multigrid</strong><br />
preconditioner.<br />
3 Implementation details<br />
The mesh, which is constructed from a 3D image, is stored in an octree. An<br />
octree divides each spatial dimension in two parts. This means that each tree<br />
node has eight children. Finite elements and nodes of the grid that lie in bone <strong>free</strong><br />
space are not stored. In our application we iterate over all elements of a <strong>multigrid</strong><br />
(or octree) level. These elements have the same size. Both, the nodes and the<br />
elements of each level are stored in one array. Each element is identified by the<br />
coordinate of its node with local number 0. If the data item has a weight w ≥ 0<br />
then it represents both an element with a Young’s modulus of E elem = w · 1GPa<br />
and the node of the element with local number 0. Plain nodes are characterized<br />
by a negative weight. The nodes and elements are sorted against their position in<br />
the depth-first traversal of the tree. This so-called Morton ordering corresponds<br />
to a space filling curve called Z-curve [14]. The Morton key can be computed<br />
easily from the three coordinates (short int) by interleaving their bits key =<br />
z 15 y 15 x 15 · · · z 1 y 1 x 1 z 0 y 0 x 0 . This pointer-less storing scheme reduces the needed<br />
memory to hold the octree by 24 Byte (on 64-bit by 56 Byte) per node. The<br />
whole application needs only about 100 Bytes per degree of <strong>free</strong>dom. That is<br />
about 16 times less compared to the <strong>matrix</strong>-<strong>free</strong> ParFE code.<br />
3.1 Accessing nodes of an element<br />
In <strong>matrix</strong>-<strong>free</strong> finite element applications the nodes of the corresponding element<br />
must be accessed. Usually an element-to-node table is queried to get the
Algorithm 2 Prolongation<br />
Require: void Prolongate(Vector c, Vector f)<br />
ImportGhostNodes(c);<br />
cindex tmp = 0;<br />
<strong>for</strong> each i in T reeF ineLevel do<br />
coarsekey = i.key/8; bits = i.key mod 8; factor = FactorOfElem(bits);<br />
cindex tmp = SearchIndex(cindex tmp, coarsekey, coarsetree);<br />
f[IndexOf(i)] += factor · c[cindex tmp];<br />
coarsekeylist = AddCoarseKeysIfBitInDimensionIsSet(coarsekey, bits);<br />
<strong>for</strong> each cnode in coarsekeylist do<br />
cindex = SearchIndex(cindex tmp, cnode, coarsetree);<br />
f[IndexOf(i)] += factor · c[cindex];<br />
end <strong>for</strong><br />
end <strong>for</strong><br />
ZeroBoundaryNodes(f);<br />
indices of the corresponding nodes. With the octree data structure this corresponds<br />
to the search of the eight neighbours in positive x, y, z direction. The<br />
binary search corresponds to the travel of the root down to the leaves. Nodes<br />
with bigger coordinates have always a bigger Morton key. The search has to be<br />
done from the index of the actual element to the end of the array.<br />
A faster way to access the neighbouring nodes is to ascend in the tree and<br />
descend to the wanted node [7]. Ascending in the full octree is an exponential<br />
interval search by a factor of eight (see Algorithm 1). The binary search combined<br />
with an exponential interval search speeds up the application.<br />
3.2 Matrix-vector multiplication<br />
The first step is to store the prescribed values at the Dirichlet boundary points<br />
and zero the corresponding components of the source vector. This is done because<br />
the boundary conditions are not taken into account in the <strong>matrix</strong>. Afterwards we<br />
import the ghost nodes. Then all elements must be traversed in order to compute<br />
the <strong>matrix</strong>-<strong>free</strong> <strong>matrix</strong>-vector product. All corresponding displacements of the<br />
nodes are loaded. This involves the neighbour search described in Section 3.1.<br />
Then the local stiffness <strong>matrix</strong> is applied with a scaling parameter that corresponds<br />
to the Young’s modulus of the element. The results of the local element<br />
are added into the appropriate places in the destination vector and the ghost<br />
nodes are exported. Finally the displacements at the Dirichlet boundary points<br />
are restored.<br />
3.3 Prolongation and restriction<br />
Compared to the <strong>matrix</strong>-vector multiplication prolongation and restriction<br />
are procedures that involve two tree levels. Instead of traveling between the<br />
levels, the two different resolutions are traversed concurrently. The keys on the<br />
coarser level are computed from those of the finer level by a division by eight.
Fig. 1. Load balancing with a space filling curve in a cubical bone sample. 16 partitions<br />
are used. On the left side all partitions are shown. On the right side the partitions<br />
numbered three to nine are displayed. Note that partitions need not be connected.<br />
Because we traverse the mesh in one direction, we can use the fast search described<br />
in Section 3.1. Algorithm 2 describes the prolongation. The restriction<br />
is implemented in a similar way.<br />
3.4 Load balancing<br />
The domain partitioning is obtained by splitting the space filling curve in equal<br />
sized sets of contiguous elements. This avoids the use of a data structure to store<br />
the mapping from the nodes to the processes.<br />
After reading the image data each process sorts its nodes and elements according<br />
the space filling curve. Afterwards, the key space is subdivided binary<br />
into buckets until each holds less data items than a defined upper limit. Each<br />
process gets a number of consecutive buckets until the average size of elements<br />
is reached. This results in a nearly balanced distribution, cf. Fig. 1.<br />
4 Numerical results<br />
We per<strong>for</strong>med a strong and a weak scalability test. We used the boundary conditions<br />
described in Section 2. In each test we used the following stopping criterion:<br />
||r k || M −1 ≤ 10 −6 ||r 0 || M −1. We used a W-cycle in the <strong>multigrid</strong> preconditioner<br />
M. On the finest level we used a Chebyshev smoother of degree 6. On each<br />
coarser level the degree was increased by one. On the coarsest level we solved<br />
the problem by a Jacobi preconditioned CG algorithm. We stopped CG after 20<br />
iterations or if the residual norm was decreased by a factor of 10 7 . Usually the<br />
first criterion was met. The timings were made on the Cray XT5 of the Swiss National<br />
Supercomputing Center [8]. The Cray XT5 is based on Opteron processors<br />
with six cores running at 2.4 GHz. Each core has 1.33 GiB main memory.
64 512 1728 5832 8000<br />
dofs 445 · 10 6 3.6 · 10 9 12.0 · 10 9 40.5 · 10 9 55.5 · 10 9<br />
meshing time [s] 8.5 20.9 52.7 154 204<br />
c240<br />
setup time [s] 20.4 21.4 23.1 28.9 33.6<br />
GFlops 32.3 253 854 2888 3947<br />
dofs 758 · 10 6 6.1 · 10 9 20.4 · 10 9 69.1 · 10 9 94.7 · 10 9<br />
meshing time [s] 16.6 37.1 92.3 273 804<br />
c320<br />
setup time [s] 34.6 36.0 37.7 44.8 51.0<br />
GFlops 31.8 252 856 2865 3921<br />
Table 1. Weak scalability timings. The meshing time includes also the time to read<br />
the image data.<br />
Time [s]<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
c320 iterations<br />
c240 iterations<br />
c320 solving time<br />
c240 solving time<br />
17<br />
15<br />
13<br />
Iterations<br />
8 1000 2000 3000 4000 5000 5832 8000<br />
Number of Cores<br />
Fig. 2. Weak scaling with two different trabecular bone samples embedded in a 320 3<br />
and a 240 3 regular grid. 3D mirroring is applied to generate the bigger meshes.<br />
4.1 Weak scalability<br />
The <strong>solver</strong> <strong>for</strong> the bone analysis is designed such that it scales well on MPI-based<br />
supercomputers with big-sized meshes. We have tested the weak scalability with<br />
up to 8000 cores with two different meshes, cf. Table 1. The larger grids are<br />
generated by 3D mirroring [2] from a bone sample encased in a cube, cf. Fig. 1.<br />
We have used two base meshes:<br />
– c240 is encased in a 240 3 cube with 6.9·10 6 degrees of <strong>free</strong>dom and 1.46·10 6<br />
elements (porosity 10.6%).<br />
– c320 is encased in a 320 3 cube with 11.8·10 6 degrees of <strong>free</strong>dom and 2.23·10 6<br />
elements (porosity 6.83%).<br />
The biggest mesh on 8000 cores has 94.7 · 10 9 dofs and is 62 times bigger than<br />
the largest problem solved with ParFE [3]. In these tests we always used 7 levels<br />
in the <strong>multigrid</strong> preconditioner.<br />
In Fig. 2 we see that the <strong>solver</strong> scales nearly perfectly up to 8000 cores. With<br />
both meshes, above 125 cores the solving time increases only little. Also the setup<br />
time and the flop rate of the <strong>matrix</strong> vector product scale very well, cf. Table 1.<br />
However, the meshing time doesn’t scale. This time includes the construction
Parallel Efficiency<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
degree 6 level 7<br />
degree 6 level 6<br />
degree 6 level 5<br />
degree 10 level 6<br />
degree 10 level 5<br />
27 36 72 144 288 576<br />
1000<br />
Solution Time [s]<br />
100<br />
degree 6 level 7<br />
degree 6 level 6<br />
degree 6 level 5<br />
degree 10 level 6<br />
degree 10 level 5<br />
linear speedup<br />
27 36 72 144 288 576<br />
Fig. 3. Strong scaling with different smoother degrees and number of levels in the<br />
<strong>multigrid</strong> algorithm. c320 mesh three timed 3D mirrored used on initial 27 cores. On<br />
the left site the parallel efficiency. On the right side the solution time. The yellow<br />
dashed line in the bottom denotes linear speed up.<br />
of the octree (meshing) and, most of all, the time to distribute the voxel data<br />
among the cores. The latter means the broadcast of about 250 MiB = 320 3 · 8 B<br />
of image data from the root core to all others cores, which is a costly procedure.<br />
4.2 Strong scalability<br />
For the strong scalability test a mesh based on c320 was used with 320·10 6 dofs.<br />
This moderately sized problem could be solved on a machine that is af<strong>for</strong>dable<br />
<strong>for</strong> a clinical institute. We have tested the scalability with different parameters<br />
to identify the limiting factors. The memory that is needed <strong>for</strong> solving this mesh<br />
<strong>for</strong>ced us to use at least 27 cores.<br />
Figure 3 shows that the application scales very well up to 576 cores. If the<br />
number of levels is chosen too big (red line) the parallel efficiency decreases and a<br />
configuration with a smoother of higher degree needs less time to solve with 144<br />
cores. The reason is that the problem size on the coarser mesh gets very small<br />
and the communication dominates. With redistribution and using a smaller set<br />
of cores on coarser meshes the efficiency would be higher especially <strong>for</strong> large<br />
numbers of levels.<br />
The higher smoother degree results in higher efficiency because on the fine<br />
meshes the <strong>matrix</strong>-vector product scales perfectly with the number of processors.<br />
However, on this mesh the smoother of degree ten needed more time to solve the<br />
problem than the smoother of degree six if the same number of levels is used.<br />
5 Conclusions and future work<br />
We have presented a <strong>highly</strong> parallel <strong>solver</strong> <strong>for</strong> voxel-based <strong>µFE</strong> bone analysis.<br />
The <strong>solver</strong> is based on the PCG method and uses a geometric <strong>multigrid</strong> preconditioner.<br />
Because the mesh is stored in a octree-like data structure all levels<br />
are implemented with <strong>matrix</strong>-<strong>free</strong> techniques. The minimal memory footprint<br />
enabled us to solve huge problems with more than 94 · 10 9 degrees of <strong>free</strong>dom.
Solving these problems with the old <strong>solver</strong> ParFE would require 16 times as<br />
many processors! The <strong>solver</strong> also shows nearly perfect weak scalability up to<br />
8000 of processors.<br />
We plan to further improve the accessing of the element nodes by a low<br />
collision rate hashing. Further enhancements could be done with enabling repartitioning<br />
of the coarser level using a subset of processors. This would lower<br />
communication complexity and increase further the parallel efficiency.<br />
Acknowledgments<br />
The work of the first author has been funded in parts by the Swiss National<br />
Science Foundation project 205320 125114. The computations on the Cray XT5<br />
have been per<strong>for</strong>med in the framework of a Large User Project grant of the Swiss<br />
National Supercomputing Centre (CSCS).<br />
References<br />
1. Adams, M., Brezina, M., Hu, J., Tuminaro, R.: Parallel <strong>multigrid</strong> smoothing: polynomial<br />
versus Gauss–Seidel. J. Comput. Phys. 188(2), 593–610 (2003)<br />
2. Arbenz, P., van Lenthe, G.H., Mennel, U., Müller, R., Sala, M.: A <strong>scalable</strong> multilevel<br />
preconditioner <strong>for</strong> <strong>matrix</strong>-<strong>free</strong> µ-finite element analysis of human bone structures.<br />
Internat. J. Numer. Methods Engrg. 73(7), 927–947 (2008)<br />
3. Bekas, C., Curioni, A., Arbenz, P., Flaig, C., van Lenthe, G., Müller, R., Wirth, A.:<br />
Extreme scalability challenges in micro-finite element simulations of human bone.<br />
Concurrency Computat.: Pract. Exper. 22(16), 2282–2296 (2010)<br />
4. Bielak, J., Ghattas, O., Kim, E.J.: Parallel octree-based finite element method<br />
<strong>for</strong> large-scale earthquake ground simulation. Comp. Model. in Eng. & Sci. 10(2),<br />
99–112 (2005)<br />
5. Braess, D.: Finite Elements: Theory, fast <strong>solver</strong>s and applications in solid mechanics.<br />
Cambridge University Press, Cambridge, 2nd edn. (2001)<br />
6. Burstedde, C., Wilcox, L.C., Ghattas, O.: p4est: Scalable algorithms <strong>for</strong> parallel<br />
adaptive mesh refinement on <strong>for</strong>ests of octrees, accepted <strong>for</strong> publication in SIAM<br />
J. Sci. Comput.<br />
7. Castro, R., Lewiner, T., Lopes, H., Tavares, G., Bordignon, A.: Statistical optimization<br />
of octree searches. Computer Graphics Forum 27(6), 1557–1566 (2008)<br />
8. Swiss National Supercomputing Centre (CSCS), http://www.cscs.ch/<br />
9. Flaig, C., Arbenz, P.: A Scalable Memory Efficient Multigrid Solver <strong>for</strong> Micro-<br />
Finite Element Analyses Based on CT Images. Parallel Computing (2011), accepted<br />
<strong>for</strong> publication<br />
10. Margenov, S., Vutov, Y.: Comparative analysis of PCG <strong>solver</strong>s <strong>for</strong> voxel FEM<br />
systems. In: Proceedings of the International Multiconference on Computer Science<br />
and In<strong>for</strong>mation Technology. pp. 591–598 (2006)<br />
11. The ParFE Project Home Page (2010), http://parfe.source<strong>for</strong>ge.net/<br />
12. van Rietbergen, B., Weinans, H., Huiskes, R., Polman, B.J.W.: Computational<br />
strategies <strong>for</strong> iterative solutions of large FEM applications employing voxel data.<br />
Internat. J. Numer. Methods Engrg. 39(16), 2743–2767 (1996)<br />
13. Saad, Y.: Iterative Methods <strong>for</strong> Sparse Linear Systems. SIAM, Philadelphia, PA,<br />
2nd edn. (2003)
14. Samet, H.: The quadtree and related hierarchical data structures. ACM Comput.<br />
Surv. 16, 187–260 (1984)<br />
15. Sampath, R.S., Biros, G.: A parallel geometric <strong>multigrid</strong> method <strong>for</strong> finite elements<br />
on octree meshes. SIAM J. Sci. Comput. 32(3), 1361–1392 (2010)<br />
16. Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, London<br />
(2000)<br />
17. Wirth, A., Mueller, T., Vereecken, W., Flaig, C., Arbenz, P., Müller, R., van<br />
Lenthe, G.H.: Mechanical competence of bone-implant systems can accurately be<br />
determined by image-based micro-finite element analyses. Arch. Appl. Mech. 80(5),<br />
513–525 (2010)