Hybrid programming with MPI & OpenMP - Prace Training Portal

workshop on application porting and performance tuning 11-12 June 2009Hybrid programming with MPI & OpenMPCarlo Cavazzoni (CINECA)

Outline• Motivation• Application test case: QuantumESPRESSO• Mixed paradigm and FFT• ScalaPACK and Multithread library• Typical loops• Perspective and Conclusion2

Motivation:Paradigm Change in HPC30000002500000Aggregate Number of Core for Top500 Supercomputersf coresnumber of200000015000001000000New Moore Low:500000Number of cores doubles0every 2 yearsgiu-93giu-94giu-95giu-96giu-97giu-98giu-99giu-00giu-01giu-02giu-03giu-04giu-05giu-06giu-07giu-08yearsCPU freq. does not grow any moreWhat about (tightly coupled)MPI applications?3

MPI inter process communicationsMPI on Multi core CPUnode node 1 MPI proces / coreStress networkStress OSMPI_BCASTMany MPI codes (QE) based onALLTOALLMessages = processes * processesnodenodeWe need to exploit the hierarchynetworkRe-designapplicationsMix message passingAnd multi-threading 4

Standard de-factoMPI (inter node)OpenMP (intra node)• Distributed memory systems • Shared memory systems• message passing• Threads creations• data distribution ib i model• relaxed-consistency model• Version 2.1 (09/08)• Version 3.0 (05/08)• API for C/C++ and Fortran • Compiler directive C and Fortran5

Classification of Mixed paradigm parallelizationR. Rabenseifner (HLRS, 2008)6

Mixed Paradigm• PROS:– Better use of memory hierarchy– Better use of interconnection– Improve scalability• CONS:– Overhead in thread management– Greater attention to memory access– Worse performances7

Quantum-ESPRESSOpackageQuantum ESPRESSO is an open-source suite of computer codes for electronic-structure calculations andmaterials modeling. It is based on density-functional theory, plane waves, and pseudopotentials.Features include:structural optimizations,phononselastic constantsab-initio molecular dynamicsdielectric and Raman tensorsinfrared spectraNMR spectraetc…“QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations ofmaterials”, Paolo Giannozzi et al. submitted for pubblication (2009)“First-principles codes for computational crystallography in the Quantum-ESPRESSO package”S. Scandolo et al., Z. Kristallogr. 220, 574 (2005)www.quantum-espresso.orgwww.qe-forge.org8

Ab-initio simulations:The “standard d model”“Molecular dynamics”for atomsMa = F = -dE/dRSchroedinger equationfor electronsHψ = Eψe - -e - interactions:Density Functional Theorye - -nuclei interactions:Pseudopotentials“Ab-initio” molecular dynamics = Classical molecular dynamics in thepotential ti energy surface generated by theelectrons in their quantum ground state10

CP (Car-Parrinello)More scalableCode for Car-Parrinello MDDisordered systemsLiquidsFinite temperature!!! Roberto Car & Michele Parrinellodevelop the CP method on aCRAY machine Installed at CINECA !!!PW (Plane Wave)Code for Electronic Structure computationStructure optimizationBorn-Hoppenheimer MDLess scalable11

Parallelization (before OpenMP)1x1xN proc grid√N x √N proc gridstatesParallelization over ImagesEnergy barrier evaluationCharge densityHamiltonianstatesParallelization over Pools:K-points sempling ofParallelization of the FFTParallelization ofBrillouin Zone.Electronic band structureReal and Reciprocal SpaceDecomposition.One FFT for each electronBands structure.Matrix DiagonalizationAt least one state foreach electronLoosely coupledTightly coupled12

QE run on most PRACE/DEISA HPC clustersHPCx (EPCC)HECToR (EPCC)LOUHI (SCS)IBM P5, 2560 cores Cray XT4, 11328 AMD cores Cray XT4, 11328 AMD coresBCX (CINECA)IBM LS21, 5120 AMD coresHUYGENS (SARA)IBM, 3328 P6 coresNext to comeSP6 (CINECA)IBM P6, 5576 coresBGP (CINECA)IBM BG/P, 4096 cores1414

HECToR – PSIWAT (DEISA benchmark)Performance of QE, pure MPIHECToR – CNT10POR8 (CINECA benchmark)Speed‐upSpeed‐up87654321256 512 1024 2098 4096STDSCALAPACK3,53,2532,752,52,2521,7515 1,51,251512 1024 2048 4096STDSCALAPACKHPCx – PSIWAT (DEISA benchmark)BCX – W256 (CINECA benchmark)Speed‐upSpeed‐up32,7525 2,554Smaller dataset, no more data to distributeMPI saturate, Here OpenMP can help2,2521,751,5STDSCALAPACK32SCALAPACK1,251256 512 1024164 128 256 512 1024http://www.deisa.eu/science/benchmarking/results15

Main Algorithms in QE• 3D FFT• Linear Algebra– Matrix Matrix Multiplicationli – less Matrix-Vector and Vector-Vector– Eigenvalues and Eigenvectors computation• Space integrals• Point function evaluations16

Parallelization Strategy• 3D FFT ad hoc MPI & OpenMP driver• Liear Algebra ScalaPACK + blas multithread• Space integrals MPI & OpenMP loops parallelization andreduction• Point functionevaluationsMPI & OpenMP loops parallelization17

MPI & OpenMP works!BCX – W256 (CINECA benchmark)Speed‐up, relative to 64 cores Pure MPI543210Pure MPIMPI+OpenMP, 4 threadsMPI+OpenMP, 2 threads128 256 512 1024coresMPI+4OMP Threads saturate at 512 coresMPI+2OMP Threads saturate at 256 coresPure MPI saturate at 128 coresWe observe the same behaviourBut at an higher number of cores18

Changing architecture: HPCxHPCx – W256 (CINECA benchmark)Speed‐up, relative to 64 cores Pure MPI7PURE MPI65MPI+OpenMP, 4 threadsMPI+OpenMP, 2 threads4 MPI+OpenMP, 8 threads3210 64 128 256 512 1024coresOn HPCx we can use 8 threads,QE scale up to 1024 cores19

Mixed QE: Implicit vs Explicit approachImplicit• Linking multi-threading libraries• No control of thread creationoverheadExplicit• OpenMP syntax knowledge• Manage code flow and threadRelatively simpleMore efficientIn QE we use both multi-thread parallelization20

Multithread libraries (mkl, acml, essl)• No explicit OpenMP directive• Very efficient (in most cases)• No possibility to mix multithread and non multithreadversion of the same library (selective call)• Workaround using omp_set_num_threads()omp_get_max_threads() max threads() not always possible21

Performance of Linear Algebra in QEBCX – CINECAHPCx - EPCCLA: Speed‐up, relative to 64 cores Pure MPIFFT: Speed‐up, relative to 64 cores Pure MPI543Pure MPIMPI+OpenMP, 4 threadsMPI+OpenMP, 2 threads543PURE MPIMPI+OpenMP, 4 threadsMPI+OpenMP, 2 threadsMPI+OpenMP, 8 threads2211064 128 256 512 1024064 128 256 512 1024corescoresMatrix size 1024x102422

Understanding QE 3DFFT, Parallelization of Plane WavezxReciprocal SpaceG Plane Wave vectorsψ i( G )E cutzyx0 1 2 3 0 1 2 30 1 2 3 0 1 23012 3Column0123PE 0PE 1PE 2PE 3zρ( G)4E cutCharge density ~ Nx Ny / 5 FFT along zSingle state electronic wave functionxSimilar 3DFFT are present in most ab-initio codes like CPMD23

Understanding QE 3DFFT,task groupsP0G0G1P1P2G2P3pis(i)G3Number of electronic statesdo i = 1, ndo i = 1, n/groupsize, groupsizecompute 3D FFT( psi(i) )merge( psi( i ), psi( i + 1 ), … , &end do psi( i + groupsize - 1) )compute groupsize 3D FFT(at the same time)Single electron wave functionsend do24

Understanding QE 3DFFT: master-onlyLocal (to each MPI process)number of “z” stickdistributed to threadsThread privateParallel transpose(implemented with isend/irecv)Local (to each MPI process)number of “xy” planeDistributed to threadsThread private25

QE 3DFFT: funneled26

QE 3DFFT: multiple • Pros:– Overlap communication andcomputation– Less synchronizations betweenthreads and memory flush• Cons:– Do not exploit memory andnetwork hierarchy– Stress the network like plain MPINot yet completed…27

Performance of the 3DFFT in QEBCX – CINECAFFT: Speed‐up, relative to 64 cores Pure MPI5 Pure MPIMPI+OpenMP, 4 threads4 MPI+OpenMP, 2 threads321064 128 256 512 1024coresHPCx - EPCCFFT: Speed‐up, relative to 64 cores Pure MPI8PURE MPI7MPI+OpenMP, 4 threads6MPI+OpenMP, 2 threads5MPI+OpenMP, 8 threads43210 64 128 256 512 1024coresFFT GRID size: 200 200 20016Task Groups28

Space Integrals: Electrostatic potentialsimple loop parallelization!$omp parallel do default(shared), private(rp,is,rhet,rhog,fpibg), p, , , p g), reduction(+:eh,ehte,ehti), DO ig = gstart, ngmrp = (0.D0,0.D0)DO is = 1, nsprp = rp + sfac( ig, is ) * rhops( ig, is ) Ionic density (reciprocal space)END DOrhet = rhoeg( ig )Electronic density (reciprocal space)rhog = rhet + rpIF( tscreen ) THENfpibg = fpi / ( g(ig) * tpiba2 ) + screen_coul(ig)ELSEfpibg = fpi / ( g(ig) * tpiba2 )END IFHartree potentialvloc(ig) = vloc(ig) + fpibg * rhogHatree Energyeh = eh + fpibg * rhog * CONJG(rhog)ehte = ehte + fpibg * DBLE(rhet * CONJG(rhet))ehti = ehti + fpibg * DBLE( rp * CONJG(rp))END DOIonic contributionElectronic contribution! IBM xlf compiler does not like name of function used for OpenMP reduction29

Space Integral: Non local energyless simple loop parallelization!$omp parallel default(shared), &!$omp private(is,iv,ijv,isa,ism,ia,inl,jnl,sums,i,iss,sumt), reduction(+:ennl)do is = 1, nspdo iv = 1, nh(is)do jv = iv, nh(is)ijv = (jv-1)*jv/2 + iv!$omp doisa = 0do ism = 1, is - 1isa = isa + na(ism)end do!$omp do!$omp end doend doend doend do!$omp end parallel...do i = 1, ndo ia = 1, na(is)inl = ish(is)+(iv-1)*na(is)+iajnl = ish(is)+(jv-1)*na(is)+iaisa = isa+1sums = 0.d0Larger parallel region to reduceThe fork/join overheadiss = ispin(i)sums(iss) = sums(iss) + f(i) * bec(inl,i) * bec(jnl,i)end dosumt = 0.d0do iss = 1, nspin!$ d ll l rhovan( ijv, isa, iss ) = sums( iss )sumt = sumt + sums( iss )end doif( iv .ne. jv ) sumt = 2.d0 * sumtennl = ennl + sumt * dvan( jv, iv, is)end do!$omp end do30

Point Function evaluation:Exchange and Correlation energy!$omp parallel do private( rhox, arhox, ex, ec, vx, vc ), reduction(+:etxc)do ir = 1, nnrrhox = rhor (ir, nspin)arhox = abs (rhox)if (arhox.gt.1.d-30) thenCALL xc( arhox, ex, ec, vx(1), vc(1) )v(ir,nspin) = e2 * (vx(1)( + vc(1) )etxc = etxc + e2 * (ex + ec) * rhoxendifenddo!$omp end parallel doReal space electronic charge densityXC functional (external subroutine)XC PotentialXC Energy31

Gram-Smith kernel:dealing with thread private allocatable array!$omp parallel default(shared), private( temp, k, ig )ALLOCATE( temp( ngw ) )!$omp doDO k = 1, kmaxcsc(k) = 0.0d0IF ( ispin(i) .EQ. ispin(k) ) THENDO ig = 1, ngwtemp(ig) = cp(1,ig,k) * cp(1,ig,i) + cp(2,ig,k) * cp(2,ig,i)END DOcsc(k) = 2.0d0 * SUM(temp)IF (gstart == 2) csc(k) = csc(k) - temp(1)ENDIFEND DO!$omp end doDEALLOCATE( temp )!$omp end parallel32

pDO nt = 1, ntypIF ( upf(nt)%tvanp ) THEN !DO ih = 1, nh(nt) !DO jh = ih, nh(nt)!CALL qvan2( ngm, ih, jh, nt, qmod, qgm, ylmk0 ) !!$omp parallel default(shared), private(na,qgm _ na,is,dtmp,ig,mytid,ntids)#ifdef __OPENMPmytid = omp_get_thread_num()take the thread IDntids = omp_get_num_threads()#endifExample of non trivialloop parallelizationtake the number of threadsALLOCATE( qgm_na( ngm ) ) !DO na = 1, nat !#ifdef __OPENMPIF( MOD( na, ntids ) /= mytid ) CYCLEdistribute atoms round-robin to threads#endifIF ( ityp(na) == nt ) THENqgm_na(1:ngm) = qgm(1:ngm)*eigts1(ig1(1:ngm),na)*eigts2(ig2(1:ngm),na)*eigts3(ig3(1:ngm),na)DO is = 1, nspin_magdtmp = 0.0d0DO ig = 1, ngmdtmp = dtmp + aux( ig, is ) * CONJG( qgm_na( ig ) )END DO!$omp end parallelEND DOEND DOEND IFEND DOdeeq(ih,jh,na,is) , , = fact * omega * DBLE( dtmp )deeq(jh,ih,na,is) = deeq(ih,jh,na,is)END DOEND IFEND DODEALLOCATE( qgm_na )33

ScaLAPACKdiagonalization• TurboRVB• Ill conditioned matrix25000x250002 • PDSYEVD, PDSYEVX• HUYGENS (sara)• DECI - QMCHTC• Ad-hoc diagonalization34

PDSYEVX vs ad-hoc subroutineTurboRVB: Localized orbitals DFT + Quantum Monte Carlo for isolated and periodic systemStarting Hamiltonian from trial non orthogonal localized orbitalsIll conditioned symmetric matrix with a lot of eigenvalues close to zeroPDSYEVX require larger and larger working space and longer and longer time to convergeAd-hoc parallel subroutine using mixed MPI+OpenMPSlower for standard matrix but not dependent from the condition number,No additional working space.35

Ad-hoc diagonalization• Based on Numerical Recipes/Housolder algorithm (2 subs: LUfactorization + QL rotations ).• Matrix is distributed by row to MPI processes (not likeScaLALPACK), we need an additional redistribution of data.• Numerical consistency and machine precision is managedcarefully (limits based on the physical problem).• Tridiagonal system solved by the root only.REFERENCES :NUMERICAL RECIPES, THE ART OF SCIENTIFIC COMPUTING. W.H. PRESS, B.P. FLANNERY, S.A. TEUKOLSKY, AND W.T.VETTERLING, CAMBRIDGE UNIVERSITY PRESS, CAMBRIDGE.PARALLEL NUMERICAL ALGORITHMS, T.L. FREEMAN AND C.PHILLIPS, PRENTICE HALL INTERNATIONAL (1992).36

Conclusion• Number of cores double every two years• MPI and OpenMP exploit multi-core nodes• Both implicit it and explicit it multi-threadingth • Mixed paradigm: not big effort, good compilers andlibraries• Next challenge: beyond Parallel Computing (reliable, hybrid)37

Acknowledgment• Filippo Spiga (University of Milan – Bicocca)• Mark Bull (EPCC)• EPCC, SARA, SCS staff38

Hybrid programming with MPI & OpenMP - Prace Training Portal

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?