12.07.2015 Views

Hybrid programming with MPI & OpenMP - Prace Training Portal

Hybrid programming with MPI & OpenMP - Prace Training Portal

Hybrid programming with MPI & OpenMP - Prace Training Portal

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

workshop on application porting and performance tuning 11-12 June 2009<strong>Hybrid</strong> <strong>programming</strong> <strong>with</strong> <strong>MPI</strong> & <strong>OpenMP</strong>Carlo Cavazzoni (CINECA)


Outline• Motivation• Application test case: QuantumESPRESSO• Mixed paradigm and FFT• ScalaPACK and Multithread library• Typical loops• Perspective and Conclusion2


Motivation:Paradigm Change in HPC30000002500000Aggregate Number of Core for Top500 Supercomputersf coresnumber of200000015000001000000New Moore Low:500000Number of cores doubles0every 2 yearsgiu-93giu-94giu-95giu-96giu-97giu-98giu-99giu-00giu-01giu-02giu-03giu-04giu-05giu-06giu-07giu-08yearsCPU freq. does not grow any moreWhat about (tightly coupled)<strong>MPI</strong> applications?3


<strong>MPI</strong> inter process communications<strong>MPI</strong> on Multi core CPUnode node 1 <strong>MPI</strong> proces / coreStress networkStress OS<strong>MPI</strong>_BCASTMany <strong>MPI</strong> codes (QE) based onALLTOALLMessages = processes * processesnodenodeWe need to exploit the hierarchynetworkRe-designapplicationsMix message passingAnd multi-threading 4


Standard de-facto<strong>MPI</strong> (inter node)<strong>OpenMP</strong> (intra node)• Distributed memory systems • Shared memory systems• message passing• Threads creations• data distribution ib i model• relaxed-consistency model• Version 2.1 (09/08)• Version 3.0 (05/08)• API for C/C++ and Fortran • Compiler directive C and Fortran5


Classification of Mixed paradigm parallelizationR. Rabenseifner (HLRS, 2008)6


Mixed Paradigm• PROS:– Better use of memory hierarchy– Better use of interconnection– Improve scalability• CONS:– Overhead in thread management– Greater attention to memory access– Worse performances7


Quantum-ESPRESSOpackageQuantum ESPRESSO is an open-source suite of computer codes for electronic-structure calculations andmaterials modeling. It is based on density-functional theory, plane waves, and pseudopotentials.Features include:structural optimizations,phononselastic constantsab-initio molecular dynamicsdielectric and Raman tensorsinfrared spectraNMR spectraetc…“QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations ofmaterials”, Paolo Giannozzi et al. submitted for pubblication (2009)“First-principles codes for computational crystallography in the Quantum-ESPRESSO package”S. Scandolo et al., Z. Kristallogr. 220, 574 (2005)www.quantum-espresso.orgwww.qe-forge.org8


Ab-initio simulations:The “standard d model”“Molecular dynamics”for atomsMa = F = -dE/dRSchroedinger equationfor electronsHψ = Eψe - -e - interactions:Density Functional Theorye - -nuclei interactions:Pseudopotentials“Ab-initio” molecular dynamics = Classical molecular dynamics in thepotential ti energy surface generated by theelectrons in their quantum ground state10


CP (Car-Parrinello)More scalableCode for Car-Parrinello MDDisordered systemsLiquidsFinite temperature!!! Roberto Car & Michele Parrinellodevelop the CP method on aCRAY machine Installed at CINECA !!!PW (Plane Wave)Code for Electronic Structure computationStructure optimizationBorn-Hoppenheimer MDLess scalable11


Parallelization (before <strong>OpenMP</strong>)1x1xN proc grid√N x √N proc gridstatesParallelization over ImagesEnergy barrier evaluationCharge densityHamiltonianstatesParallelization over Pools:K-points sempling ofParallelization of the FFTParallelization ofBrillouin Zone.Electronic band structureReal and Reciprocal SpaceDecomposition.One FFT for each electronBands structure.Matrix DiagonalizationAt least one state foreach electronLoosely coupledTightly coupled12


QE run on most PRACE/DEISA HPC clustersHPCx (EPCC)HECToR (EPCC)LOUHI (SCS)IBM P5, 2560 cores Cray XT4, 11328 AMD cores Cray XT4, 11328 AMD coresBCX (CINECA)IBM LS21, 5120 AMD coresHUYGENS (SARA)IBM, 3328 P6 coresNext to comeSP6 (CINECA)IBM P6, 5576 coresBGP (CINECA)IBM BG/P, 4096 cores1414


HECToR – PSIWAT (DEISA benchmark)Performance of QE, pure <strong>MPI</strong>HECToR – CNT10POR8 (CINECA benchmark)Speed‐upSpeed‐up87654321256 512 1024 2098 4096STDSCALAPACK3,53,2532,752,52,2521,7515 1,51,251512 1024 2048 4096STDSCALAPACKHPCx – PSIWAT (DEISA benchmark)BCX – W256 (CINECA benchmark)Speed‐upSpeed‐up32,7525 2,554Smaller dataset, no more data to distribute<strong>MPI</strong> saturate, Here <strong>OpenMP</strong> can help2,2521,751,5STDSCALAPACK32SCALAPACK1,251256 512 1024164 128 256 512 1024http://www.deisa.eu/science/benchmarking/results15


Main Algorithms in QE• 3D FFT• Linear Algebra– Matrix Matrix Multiplicationli – less Matrix-Vector and Vector-Vector– Eigenvalues and Eigenvectors computation• Space integrals• Point function evaluations16


Parallelization Strategy• 3D FFT ad hoc <strong>MPI</strong> & <strong>OpenMP</strong> driver• Liear Algebra ScalaPACK + blas multithread• Space integrals <strong>MPI</strong> & <strong>OpenMP</strong> loops parallelization andreduction• Point functionevaluations<strong>MPI</strong> & <strong>OpenMP</strong> loops parallelization17


<strong>MPI</strong> & <strong>OpenMP</strong> works!BCX – W256 (CINECA benchmark)Speed‐up, relative to 64 cores Pure <strong>MPI</strong>543210Pure <strong>MPI</strong><strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads<strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads128 256 512 1024cores<strong>MPI</strong>+4OMP Threads saturate at 512 cores<strong>MPI</strong>+2OMP Threads saturate at 256 coresPure <strong>MPI</strong> saturate at 128 coresWe observe the same behaviourBut at an higher number of cores18


Changing architecture: HPCxHPCx – W256 (CINECA benchmark)Speed‐up, relative to 64 cores Pure <strong>MPI</strong>7PURE <strong>MPI</strong>65<strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads<strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads4 <strong>MPI</strong>+<strong>OpenMP</strong>, 8 threads3210 64 128 256 512 1024coresOn HPCx we can use 8 threads,QE scale up to 1024 cores19


Mixed QE: Implicit vs Explicit approachImplicit• Linking multi-threading libraries• No control of thread creationoverheadExplicit• <strong>OpenMP</strong> syntax knowledge• Manage code flow and threadRelatively simpleMore efficientIn QE we use both multi-thread parallelization20


Multithread libraries (mkl, acml, essl)• No explicit <strong>OpenMP</strong> directive• Very efficient (in most cases)• No possibility to mix multithread and non multithreadversion of the same library (selective call)• Workaround using omp_set_num_threads()omp_get_max_threads() max threads() not always possible21


Performance of Linear Algebra in QEBCX – CINECAHPCx - EPCCLA: Speed‐up, relative to 64 cores Pure <strong>MPI</strong>FFT: Speed‐up, relative to 64 cores Pure <strong>MPI</strong>543Pure <strong>MPI</strong><strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads<strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads543PURE <strong>MPI</strong><strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads<strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads<strong>MPI</strong>+<strong>OpenMP</strong>, 8 threads2211064 128 256 512 1024064 128 256 512 1024corescoresMatrix size 1024x102422


Understanding QE 3DFFT, Parallelization of Plane WavezxReciprocal SpaceG Plane Wave vectorsψ i( G )E cutzyx0 1 2 3 0 1 2 30 1 2 3 0 1 23012 3Column0123PE 0PE 1PE 2PE 3zρ( G)4E cutCharge density ~ Nx Ny / 5 FFT along zSingle state electronic wave functionxSimilar 3DFFT are present in most ab-initio codes like CPMD23


Understanding QE 3DFFT,task groupsP0G0G1P1P2G2P3pis(i)G3Number of electronic statesdo i = 1, ndo i = 1, n/groupsize, groupsizecompute 3D FFT( psi(i) )merge( psi( i ), psi( i + 1 ), … , &end do psi( i + groupsize - 1) )compute groupsize 3D FFT(at the same time)Single electron wave functionsend do24


Understanding QE 3DFFT: master-onlyLocal (to each <strong>MPI</strong> process)number of “z” stickdistributed to threadsThread privateParallel transpose(implemented <strong>with</strong> isend/irecv)Local (to each <strong>MPI</strong> process)number of “xy” planeDistributed to threadsThread private25


QE 3DFFT: funneled26


QE 3DFFT: multiple • Pros:– Overlap communication andcomputation– Less synchronizations betweenthreads and memory flush• Cons:– Do not exploit memory andnetwork hierarchy– Stress the network like plain <strong>MPI</strong>Not yet completed…27


Performance of the 3DFFT in QEBCX – CINECAFFT: Speed‐up, relative to 64 cores Pure <strong>MPI</strong>5 Pure <strong>MPI</strong><strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads4 <strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads321064 128 256 512 1024coresHPCx - EPCCFFT: Speed‐up, relative to 64 cores Pure <strong>MPI</strong>8PURE <strong>MPI</strong>7<strong>MPI</strong>+<strong>OpenMP</strong>, 4 threads6<strong>MPI</strong>+<strong>OpenMP</strong>, 2 threads5<strong>MPI</strong>+<strong>OpenMP</strong>, 8 threads43210 64 128 256 512 1024coresFFT GRID size: 200 200 20016Task Groups28


Space Integrals: Electrostatic potentialsimple loop parallelization!$omp parallel do default(shared), private(rp,is,rhet,rhog,fpibg), p, , , p g), reduction(+:eh,ehte,ehti), DO ig = gstart, ngmrp = (0.D0,0.D0)DO is = 1, nsprp = rp + sfac( ig, is ) * rhops( ig, is ) Ionic density (reciprocal space)END DOrhet = rhoeg( ig )Electronic density (reciprocal space)rhog = rhet + rpIF( tscreen ) THENfpibg = fpi / ( g(ig) * tpiba2 ) + screen_coul(ig)ELSEfpibg = fpi / ( g(ig) * tpiba2 )END IFHartree potentialvloc(ig) = vloc(ig) + fpibg * rhogHatree Energyeh = eh + fpibg * rhog * CONJG(rhog)ehte = ehte + fpibg * DBLE(rhet * CONJG(rhet))ehti = ehti + fpibg * DBLE( rp * CONJG(rp))END DOIonic contributionElectronic contribution! IBM xlf compiler does not like name of function used for <strong>OpenMP</strong> reduction29


Space Integral: Non local energyless simple loop parallelization!$omp parallel default(shared), &!$omp private(is,iv,ijv,isa,ism,ia,inl,jnl,sums,i,iss,sumt), reduction(+:ennl)do is = 1, nspdo iv = 1, nh(is)do jv = iv, nh(is)ijv = (jv-1)*jv/2 + iv!$omp doisa = 0do ism = 1, is - 1isa = isa + na(ism)end do!$omp do!$omp end doend doend doend do!$omp end parallel...do i = 1, ndo ia = 1, na(is)inl = ish(is)+(iv-1)*na(is)+iajnl = ish(is)+(jv-1)*na(is)+iaisa = isa+1sums = 0.d0Larger parallel region to reduceThe fork/join overheadiss = ispin(i)sums(iss) = sums(iss) + f(i) * bec(inl,i) * bec(jnl,i)end dosumt = 0.d0do iss = 1, nspin!$ d ll l rhovan( ijv, isa, iss ) = sums( iss )sumt = sumt + sums( iss )end doif( iv .ne. jv ) sumt = 2.d0 * sumtennl = ennl + sumt * dvan( jv, iv, is)end do!$omp end do30


Point Function evaluation:Exchange and Correlation energy!$omp parallel do private( rhox, arhox, ex, ec, vx, vc ), reduction(+:etxc)do ir = 1, nnrrhox = rhor (ir, nspin)arhox = abs (rhox)if (arhox.gt.1.d-30) thenCALL xc( arhox, ex, ec, vx(1), vc(1) )v(ir,nspin) = e2 * (vx(1)( + vc(1) )etxc = etxc + e2 * (ex + ec) * rhoxendifenddo!$omp end parallel doReal space electronic charge densityXC functional (external subroutine)XC PotentialXC Energy31


Gram-Smith kernel:dealing <strong>with</strong> thread private allocatable array!$omp parallel default(shared), private( temp, k, ig )ALLOCATE( temp( ngw ) )!$omp doDO k = 1, kmaxcsc(k) = 0.0d0IF ( ispin(i) .EQ. ispin(k) ) THENDO ig = 1, ngwtemp(ig) = cp(1,ig,k) * cp(1,ig,i) + cp(2,ig,k) * cp(2,ig,i)END DOcsc(k) = 2.0d0 * SUM(temp)IF (gstart == 2) csc(k) = csc(k) - temp(1)ENDIFEND DO!$omp end doDEALLOCATE( temp )!$omp end parallel32


pDO nt = 1, ntypIF ( upf(nt)%tvanp ) THEN !DO ih = 1, nh(nt) !DO jh = ih, nh(nt)!CALL qvan2( ngm, ih, jh, nt, qmod, qgm, ylmk0 ) !!$omp parallel default(shared), private(na,qgm _ na,is,dtmp,ig,mytid,ntids)#ifdef __OPENMPmytid = omp_get_thread_num()take the thread IDntids = omp_get_num_threads()#endifExample of non trivialloop parallelizationtake the number of threadsALLOCATE( qgm_na( ngm ) ) !DO na = 1, nat !#ifdef __OPEN<strong>MPI</strong>F( MOD( na, ntids ) /= mytid ) CYCLEdistribute atoms round-robin to threads#endifIF ( ityp(na) == nt ) THENqgm_na(1:ngm) = qgm(1:ngm)*eigts1(ig1(1:ngm),na)*eigts2(ig2(1:ngm),na)*eigts3(ig3(1:ngm),na)DO is = 1, nspin_magdtmp = 0.0d0DO ig = 1, ngmdtmp = dtmp + aux( ig, is ) * CONJG( qgm_na( ig ) )END DO!$omp end parallelEND DOEND DOEND IFEND DOdeeq(ih,jh,na,is) , , = fact * omega * DBLE( dtmp )deeq(jh,ih,na,is) = deeq(ih,jh,na,is)END DOEND IFEND DODEALLOCATE( qgm_na )33


ScaLAPACKdiagonalization• TurboRVB• Ill conditioned matrix25000x250002 • PDSYEVD, PDSYEVX• HUYGENS (sara)• DECI - QMCHTC• Ad-hoc diagonalization34


PDSYEVX vs ad-hoc subroutineTurboRVB: Localized orbitals DFT + Quantum Monte Carlo for isolated and periodic systemStarting Hamiltonian from trial non orthogonal localized orbitalsIll conditioned symmetric matrix <strong>with</strong> a lot of eigenvalues close to zeroPDSYEVX require larger and larger working space and longer and longer time to convergeAd-hoc parallel subroutine using mixed <strong>MPI</strong>+<strong>OpenMP</strong>Slower for standard matrix but not dependent from the condition number,No additional working space.35


Ad-hoc diagonalization• Based on Numerical Recipes/Housolder algorithm (2 subs: LUfactorization + QL rotations ).• Matrix is distributed by row to <strong>MPI</strong> processes (not likeScaLALPACK), we need an additional redistribution of data.• Numerical consistency and machine precision is managedcarefully (limits based on the physical problem).• Tridiagonal system solved by the root only.REFERENCES :NUMERICAL RECIPES, THE ART OF SCIENTIFIC COMPUTING. W.H. PRESS, B.P. FLANNERY, S.A. TEUKOLSKY, AND W.T.VETTERLING, CAMBRIDGE UNIVERSITY PRESS, CAMBRIDGE.PARALLEL NUMERICAL ALGORITHMS, T.L. FREEMAN AND C.PHILLIPS, PRENTICE HALL INTERNATIONAL (1992).36


Conclusion• Number of cores double every two years• <strong>MPI</strong> and <strong>OpenMP</strong> exploit multi-core nodes• Both implicit it and explicit it multi-threadingth • Mixed paradigm: not big effort, good compilers andlibraries• Next challenge: beyond Parallel Computing (reliable, hybrid)37


Acknowledgment• Filippo Spiga (University of Milan – Bicocca)• Mark Bull (EPCC)• EPCC, SARA, SCS staff38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!