parallelization issues and particle-in-cell codes - Department of ...
parallelization issues and particle-in-cell codes - Department of ...
parallelization issues and particle-in-cell codes - Department of ...
Transform your PDFs into Flipbooks and boost your revenue!
Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.
PARALLELIZATIONISSUES PARTICLE-IN-CELLCODES AND<br />
<strong>in</strong>PartialFulllment<strong>of</strong>theRequirementsfortheDegree<strong>of</strong> PresentedtotheFaculty<strong>of</strong>theGraduateSchool <strong>of</strong>CornellUniversity ADissertation<br />
Doctor<strong>of</strong>Philosophy<br />
AnneCathr<strong>in</strong>eElster August1994 by
cAnneCathr<strong>in</strong>eElster1994 ALLRIGHTSRESERVED
PARALLELIZATIONISSUES PARTICLE-IN-CELLCODES AnneCathr<strong>in</strong>eElster,Ph.D. AND<br />
\Everyth<strong>in</strong>gshouldbemadeassimpleaspossible,butnotsimpler." CornellUniversity1994<br />
<strong>of</strong><strong>in</strong>dividualmodulessuchasmatrixsolvers<strong>and</strong>factorizers.However,manyapplications<strong>in</strong>volveseveral<strong>in</strong>teract<strong>in</strong>gmodules.Ouranalyses<strong>of</strong>a<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><br />
Theeld<strong>of</strong>parallelscienticcomput<strong>in</strong>ghasconcentratedon<strong>parallelization</strong> {AlbertE<strong>in</strong>ste<strong>in</strong>.<br />
<strong>in</strong>gdependenciesaectdatapartition<strong>in</strong>g<strong>and</strong>leadtonew<strong>parallelization</strong>strategies concern<strong>in</strong>gprocessor,memory<strong>and</strong>cacheutilization.Ourtest-bed,aKSR1,is codemodel<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong>anelectriceld,showthattheseaccompanyever,most<strong>of</strong>thenewmethodspresentedholdgenerallyforhierarchical<strong>and</strong>/or<br />
distributedmemorysystems. adistributedmemorymach<strong>in</strong>ewithagloballysharedaddress<strong>in</strong>gspace.How-<br />
performanceanalyseswithaccompany<strong>in</strong>gKSRbenchmarks,havebeen<strong>in</strong>cludedfor boththisscheme<strong>and</strong>forthetraditionalreplicatedgridsapproach. arraystokeepthe<strong>particle</strong>locationsautomaticallypartiallysorted.Complexity<strong>and</strong> We<strong>in</strong>troduceanovelapproachthatusesdualpo<strong>in</strong>tersonthelocal<strong>particle</strong>
ourresultsdemonstrateitfailstoscaleproperlyforproblemswithlargegrids(say, storage<strong>and</strong>computationtimeassociatedwithadd<strong>in</strong>gthegridcopies,becomes greaterthan128-by-128)runn<strong>in</strong>gonasfewas15KSRnodes,s<strong>in</strong>cetheextra Thelatterapproachma<strong>in</strong>ta<strong>in</strong>sload-balancewithrespectto<strong>particle</strong>s.However,<br />
signicant.<br />
<strong>particle</strong>distributions.Ourdualpo<strong>in</strong>terapproachmayfacilitatethisthroughdynamicallypartitionedgrids.<br />
parallelsystems.Itmay,however,requireloadbalanc<strong>in</strong>gschemesfornon-uniform replicatethewholegrid.Consequently,itscaleswellforlargeproblemsonhighly Ourgridpartition<strong>in</strong>gscheme,althoughhardertoimplement,doesnotneedto<br />
producesa25%sav<strong>in</strong>gs<strong>in</strong>cache-hitsfora4-by-4cache. po<strong>in</strong>tswith<strong>in</strong>thesamecache-l<strong>in</strong>ebyreorder<strong>in</strong>gthegrid<strong>in</strong>dex<strong>in</strong>g.Thisalignment Aconsideration<strong>of</strong>the<strong>in</strong>putdata'seectonthesimulationmayleadt<strong>of</strong>urtherimprovements.Forexample,<strong>in</strong>thecase<strong>of</strong>mean<strong>particle</strong>drift,itis<strong>of</strong>ten<br />
advantageoustopartitionthegridprimarilyalongthedirection<strong>of</strong>thedrift.<br />
Wealso<strong>in</strong>troducehierarchicaldatastructuresthatstoreneighbor<strong>in</strong>ggrid-<br />
<strong>codes</strong>isalsogiven. <strong>in</strong>stabilities.Anoverview<strong>of</strong>themostcentralreferencesrelatedtoparallel<strong>particle</strong> whichleadtopredictablephenomena<strong>in</strong>clud<strong>in</strong>gplasmaoscillations<strong>and</strong>two-stream The<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><strong>codes</strong>forthisstudyweretestedus<strong>in</strong>gphysicalparameters,
BiographicalSketch<br />
Norway,toSynnve<strong>and</strong>NilsLoeElsteronOctober2,1962. AnneCathr<strong>in</strong>eElsterwasbornjustsouth<strong>of</strong>theArcticCircle,<strong>in</strong>MoiRana, \Cognitoergosum."(Ith<strong>in</strong>k,thereforeIam.)<br />
Herelementary<strong>and</strong>secondaryeducationswhereobta<strong>in</strong>edatMissoradoSchool {Rene'Decartes,DiscourseonMethod,1637.<br />
(1968-70),Monrovia,Liberia;BrevikBarneskole<strong>and</strong>AsenUngdommskole,Brevik, Norway(1970-78),followedbyPorsgrunnvideregaendeskole,Norway,wereshe completedherExamenArtium<strong>in</strong>1981withascienceconcentration<strong>in</strong>chemistry<br />
enrolledthefollow<strong>in</strong>gyearattheUniversity<strong>of</strong>MassachusettsatAmherstwhere <strong>and</strong>physics.AwardedascholarshipthroughtheNorway-AmericaAssociation<br />
shereceivedherB.Sc.degree<strong>in</strong>ComputerSystemsEng<strong>in</strong>eer<strong>in</strong>gcumlaude<strong>in</strong>May major<strong>in</strong>g<strong>in</strong>pre-bus<strong>in</strong>ess,butslant<strong>in</strong>gherprogramtowardscomputerscience.She bytheUniversity<strong>of</strong>Oregon,Eugene,shespentherrstyear<strong>of</strong>collegeocially<br />
1985.Annejo<strong>in</strong>edtheSchool<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>gatCornellUniversity<strong>in</strong> SchlumbergerWellServicesattheirAust<strong>in</strong>SystemsCenterfollow<strong>in</strong>gherPh.D. AnthonyP.Reeveschair<strong>in</strong>gherCommittee.Shehasacceptedapositionwith September1985fromwhichreceivedanMSdegree<strong>in</strong>August1988withPr<strong>of</strong>essor<br />
iii
KareAndreasBjrnerud Tothememory<strong>of</strong><br />
iv
Acknowledgements<br />
FirstIwouldliketothankmyadvisorPr<strong>of</strong>.NielsF.Otani<strong>and</strong>myunocial co-advisorsDr.JohnG.Shaw<strong>and</strong>Dr.PalghatS.Ramesh<strong>of</strong>theXeroxDesign aloud."{Author'sre-write<strong>of</strong>quotebyRalphWaldoEmerson,Friends. \FriendsarepeoplewithwhomImaybes<strong>in</strong>cere.Beforethem,Imayth<strong>in</strong>k<br />
youforbeliev<strong>in</strong>g<strong>in</strong>me<strong>and</strong>lett<strong>in</strong>gmegetaglimpse<strong>of</strong>theworld<strong>of</strong>computational physics! <strong>and</strong>advicewereessentialforthedevelopment<strong>and</strong>completion<strong>of</strong>thiswork.Thank ResearchInstitute(DRI)formak<strong>in</strong>gitallpossible.Theirencouragement,support<br />
gali<strong>of</strong>ComputerScience<strong>and</strong>hisNUMAgroupfornumeroushelpfuldiscussions, <strong>and</strong>SpecialCommitteememberPr<strong>of</strong>.Soo-YoungLeeforhisusefulsuggestions. GratitudeisalsoextendedtomySpecialCommitteememberPr<strong>of</strong>.KeshavP<strong>in</strong>partment<strong>and</strong>theircomputerstas,forprovid<strong>in</strong>g<strong>in</strong>valuablehelp<strong>and</strong>computer<br />
resources,<strong>and</strong>Dr.GregoryW.Zack<strong>of</strong>theXeroxDesignResearchInstitutefor hissupport. IalsowishtothanktheCornellTheoryCenter,theComputerScienceDe-<br />
northespiritneededtogetthroughthisdegreeprogram.Most<strong>of</strong>all,Iwould theirencouragement<strong>and</strong>moralsupport,Iwouldneitherhavehadthecondence Specialthanksareextendedtoallmygoodfriendsthroughtheyears.Without v
(Tusentakk,MammaogPappa!)<strong>and</strong>mysibl<strong>in</strong>gs,JohanFredrik<strong>and</strong>ToneBente. liketoacknowledgemyfamily<strong>in</strong>clud<strong>in</strong>gmyparents,Synnve<strong>and</strong>NilsL.Elster<br />
grant,theMathematicalScienceInstitute,<strong>and</strong>IBMthroughtheComputerScience<strong>Department</strong>,forthesupportthroughGraduateResearchAssistantships;the<br />
F<strong>in</strong>ally,Ialsowishtoexpressmyhonor<strong>and</strong>gratitudetomysponsors<strong>in</strong>clud<strong>in</strong>g: XeroxCorporation<strong>and</strong>theNationalScienceFoundationthroughmyadvisor'sPYI<br />
enticResearch,Mr.<strong>and</strong>Mrs.Rob<strong>in</strong>sonthroughtheThankstoSc<strong>and</strong><strong>in</strong>aviaInc., tricalEng<strong>in</strong>eer<strong>in</strong>g<strong>and</strong>theComputerScience<strong>Department</strong>forthesupportthrough Teach<strong>in</strong>gAssistantships;<strong>and</strong>theRoyalNorwegianCouncilforIndustrial<strong>and</strong>Sci-<br />
NorwegianGovernmentforprovid<strong>in</strong>gscholarships<strong>and</strong>loans;theSchool<strong>of</strong>Elec-<br />
aswellasCornellUniversity,forthegenerousfellowships.<br />
ProjectsAgency,theNationalInstitutes<strong>of</strong>Health,IBMCorporation<strong>and</strong>other tion,<strong>and</strong>NewYorkState.Additionalfund<strong>in</strong>gcomesfromtheAdvancedResearch TheoryCenter,whichreceivesmajorfund<strong>in</strong>gfromtheNationalScienceFounda-<br />
Thisresearchwasalsoconducted<strong>in</strong>partus<strong>in</strong>gtheresources<strong>of</strong>theCornell<br />
members<strong>of</strong>thecenter'sCorporateResearchInstitute. whoprobablyunknow<strong>in</strong>glysupportedmyeortsthroughtheaboveorganizations <strong>and</strong><strong>in</strong>stitutions.Maytheir<strong>in</strong>vestmentspayosomeday! Igratefullyacknowledgeallthetaxpayers<strong>in</strong>Norway<strong>and</strong>theUnitedStates<br />
knownanerscholar. closefriend<strong>and</strong>\brother"whosounexpectedlypassedawayonlyaweekafter defend<strong>in</strong>ghisPh.D.<strong>in</strong>FrenchLiteratureatOxford<strong>in</strong>December1992.Ihavenot Thisdissertationisdedicatedtothememory<strong>of</strong>KareAndrewBjrnerud,my<br />
vi
1Introduction Table<strong>of</strong>Contents 1.1Motivation<strong>and</strong>Goals::::::::::::::::::::::::::1 1.2ParticleSimulationModels:::::::::::::::::::::::4 1.3NumericalTechniques:::::::::::::::::::::::::5 1.4Contributions::::::::::::::::::::::::::::::6 1.5Appendices:::::::::::::::::::::::::::::::7 1.1.1Term<strong>in</strong>ology:::::::::::::::::::::::::::2 1<br />
2PreviousWorkonParticleCodes 2.3OtherParallelPICReferences:::::::::::::::::::::11 2.2ParallelParticle-<strong>in</strong>-Cell<strong>codes</strong>:::::::::::::::::::::9 2.1TheOrig<strong>in</strong>s<strong>of</strong>ParticleCodes:::::::::::::::::::::8 2.3.2Hypercubeapproaches:::::::::::::::::::::12 2.3.1Fight<strong>in</strong>gdatalocalityonthebit-serialMPP:::::::::11 2.2.1Vector<strong>and</strong>low-ordermultitask<strong>in</strong>g<strong>codes</strong>:::::::::::10 8<br />
3ThePhysics<strong>of</strong>ParticleSimulationCodes 2.4LoadBalanc<strong>in</strong>g:::::::::::::::::::::::::::::17 2.3.5Other<strong>particle</strong>methods:::::::::::::::::::::16 2.3.3AMasParapproach:::::::::::::::::::::::15 2.3.4ABBNattempt:::::::::::::::::::::::::16<br />
3.2TheDiscreteModelEquations:::::::::::::::::::::26 3.1ParticleModel<strong>in</strong>g::::::::::::::::::::::::::::25 2.4.1Dynamictaskschedul<strong>in</strong>g::::::::::::::::::::18<br />
3.3Solv<strong>in</strong>gforTheField::::::::::::::::::::::::::31 3.2.1Poisson'sequation:::::::::::::::::::::::27 3.2.2Thecharge,q::::::::::::::::::::::::::28 3.2.3ThePlasmaFrequency,!p:::::::::::::::::::30<br />
3.3.1Poisson'sEquation:::::::::::::::::::::::32 3.3.2FFTsolvers:::::::::::::::::::::::::::33 3.3.3F<strong>in</strong>ite-dierencesolvers:::::::::::::::::::::36 3.3.4Neumannboundaries::::::::::::::::::::::37 vii
3.4Mesh{ParticleInteractions:::::::::::::::::::::::39 3.5Mov<strong>in</strong>gthe<strong>particle</strong>s::::::::::::::::::::::::::42 3.6Test<strong>in</strong>gtheCode{ParameterRequirements:::::::::::::46 3.6.1!p<strong>and</strong>thetimestep::::::::::::::::::::::46 3.4.1Apply<strong>in</strong>gtheeldtoeach<strong>particle</strong>:::::::::::::::39<br />
3.6.2Two-streamInstabilityTest::::::::::::::::::48 3.5.1TheMobilityModel::::::::::::::::::::::42 3.5.2TheAccelerationModel::::::::::::::::::::43 3.4.2Recomput<strong>in</strong>geldsdueto<strong>particle</strong>s::::::::::::::40<br />
4Parallelization<strong>and</strong>HierarchicalMemoryIssues 3.7ResearchApplication{DoubleLayers::::::::::::::::50 4.3TheSimulationGrid::::::::::::::::::::::::::55 4.2DistributedMemoryversusSharedMemory:::::::::::::54 4.1Introduction:::::::::::::::::::::::::::::::53 4.3.1ReplicatedGrids::::::::::::::::::::::::55 4.3.2DistributedGrids::::::::::::::::::::::::56 4.3.3Block-column/Block-rowPartition<strong>in</strong>g:::::::::::::56<br />
4.4ParticlePartition<strong>in</strong>g::::::::::::::::::::::::::57 4.3.4GridsBasedonR<strong>and</strong>omParticleDistributions::::::::56 4.4.2PartialSort<strong>in</strong>g:::::::::::::::::::::::::58 4.4.1FixedProcessorPartition<strong>in</strong>g::::::::::::::::::57<br />
4.6Particlesort<strong>in</strong>g<strong>and</strong><strong>in</strong>homogeneousproblems::::::::::::63 4.5LoadBalanc<strong>in</strong>g:::::::::::::::::::::::::::::60 4.6.1DynamicPartition<strong>in</strong>gs:::::::::::::::::::::64 4.5.3Load<strong>and</strong>distance:::::::::::::::::::::::62 4.5.2Loadbalanc<strong>in</strong>gus<strong>in</strong>gthe<strong>particle</strong>densityfunction::::::61 4.5.1TheUDDapproach:::::::::::::::::::::::60 4.4.3DoublePo<strong>in</strong>terScheme:::::::::::::::::::::59<br />
4.7TheFieldSolver::::::::::::::::::::::::::::66 4.6.2Communicationpatterns::::::::::::::::::::64 4.6.3N-body/MultipoleIdeas::::::::::::::::::::65 4.7.1Processorutilization::::::::::::::::::::::67<br />
5AlgorithmicComplexity<strong>and</strong>PerformanceAnalyses 4.8InputEfects:ElectromagneticConsiderations::::::::::::71 4.9HierarchicalMemoryDataStructures:CellCach<strong>in</strong>g:::::::::72 4.7.3FFTSolvers:::::::::::::::::::::::::::67 4.7.2Non-uniformgrid<strong>issues</strong>::::::::::::::::::::67<br />
5.2Model::::::::::::::::::::::::::::::::::75 5.1Introduction:::::::::::::::::::::::::::::::74 4.7.4Multigrid::::::::::::::::::::::::::::71<br />
5.2.1KSRSpecics::::::::::::::::::::::::::76 viii
5.3SerialPICPerformance:::::::::::::::::::::::::80 5.2.2Modelparameters::::::::::::::::::::::::78 5.2.3ResultSummary::::::::::::::::::::::::80<br />
5.4ParallelPIC{FixedParticlePartition<strong>in</strong>g,Replicatedgrids:::::85 5.3.4FFT-solver:::::::::::::::::::::::::::83 5.3.5Field-GridCalculation:::::::::::::::::::::84 5.3.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity(Scatter)::::::::::::::::::::::::::83<br />
5.3.2ParticleUpdates{Positions::::::::::::::::::82 5.3.1ParticleUpdates{Velocities::::::::::::::::::82<br />
5.4.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity87 5.4.2ParticleUpdates{Positions::::::::::::::::::86 5.4.1ParticleUpdates{Velocities::::::::::::::::::85 5.5ParallelPIC{PartitionedChargeGridUs<strong>in</strong>gTemporaryBorders 5.5.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity96 5.5.2ParticleUpdates{Positions::::::::::::::::::93 5.5.1ParticleUpdates{Velocities::::::::::::::::::92 5.4.5ParallelField-GridCalculation::::::::::::::::91 <strong>and</strong>PartiallySortedLocalParticleArrays::::::::::::::92 5.4.4DistributedmemoryFFTsolver::::::::::::::::88<br />
6ImplementationontheKSR1 5.6HierarchicalDatastructures:Cell-cach<strong>in</strong>g::::::::::::::98 5.5.5ParallelField-GridCalculation::::::::::::::::98 5.5.4FFTsolver:::::::::::::::::::::::::::98<br />
6.3ParallelSupportontheKSR::::::::::::::::::::::105 6.1Architectureoverview::::::::::::::::::::::::::103 6.2Someprelim<strong>in</strong>arytim<strong>in</strong>gresults::::::::::::::::::::103 6.4KSRPICCode:::::::::::::::::::::::::::::106 6.3.1CversusFortran::::::::::::::::::::::::106 6.4.1Port<strong>in</strong>gtheserial<strong>particle</strong>code::::::::::::::::106 102<br />
6.10ParticleScal<strong>in</strong>g:::::::::::::::::::::::::::::117 6.5Cod<strong>in</strong>gtheParallelizations:::::::::::::::::::::::109<br />
6.11Test<strong>in</strong>g::::::::::::::::::::::::::::::::::117 6.9GridScal<strong>in</strong>g:::::::::::::::::::::::::::::::117 6.8FFTSolver:::::::::::::::::::::::::::::::116 6.7DistributedGrid::::::::::::::::::::::::::::113 6.6Replicat<strong>in</strong>gGrids::::::::::::::::::::::::::::111 6.5.1ParallelizationUs<strong>in</strong>gSPC:::::::::::::::::::109<br />
7Conclusions<strong>and</strong>Futurework 7.1FutureWork:::::::::::::::::::::::::::::::123 ix 121
AAnnotatedBibliography A.2ReferenceBooks:::::::::::::::::::::::::::::125 A.1Introduction:::::::::::::::::::::::::::::::124 A.2.1Hockney<strong>and</strong>Eastwood:\ComputerSimulationsUs<strong>in</strong>gParticles"::::::::::::::::::::::::::::::12ulation":::::::::::::::::::::::::::::125<br />
124 A.2.2Birdsall<strong>and</strong>Langdon:\PlasmaPhysicsViaComputerSim-<br />
A.3GeneralParticleMethods:::::::::::::::::::::::126 A.3.1F.H.Harlow:"PIC<strong>and</strong>itsProgeny":::::::::::::126 A.3.2J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>:TheFast A.2.3Others::::::::::::::::::::::::::::::125 A.3.3D.W.Hewett<strong>and</strong>A.B.Langdon:RecentProgresswithAvanti: A.2.4Foxet.al:\Solv<strong>in</strong>gProblemsonConcurrentProcessors"::126<br />
A.3.4S.H.Brecht<strong>and</strong>V.A.Thomas:MultidimensionalSimulationsUs<strong>in</strong>gHybridParticleCodes:::::::::::::::127<br />
MultipoleMethodforGridlessParticleSimulation::::::127 A2.5DEMDirectImplicitPICCode:::::::::::::127<br />
A.4ParallelPIC{SurveyArticles:::::::::::::::::::::129 A.4.1JohnM.Dawson::::::::::::::::::::::::129 A.3.6A.Mank<strong>of</strong>skyetal.:Doma<strong>in</strong>Decomposition<strong>and</strong>Particle A.3.5C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga:GridlessParticleSimulationUs<strong>in</strong>gtheMassivelyParallelProcessor(MPP)::::128<br />
A.4.3ClaireMax:\ComputerSimulation<strong>of</strong>AstrophysicalPlasmas"130 A.4.2DavidWalker'ssurveyPaper::::::::::::::::::129 Push<strong>in</strong>gforMultiprocess<strong>in</strong>gComputers::::::::::::128 A.5OtherParallelPICReferences:::::::::::::::::::::130<br />
A.5.7Sturtevant<strong>and</strong>Maccabee(Univ.<strong>of</strong>NewMexico)::::::136 A.5.6PauletteLiewer(JPL)et.al.:::::::::::::::::135 A.5.4Lubeck<strong>and</strong>Faber(LANL):::::::::::::::::::133 A.5.5Azari<strong>and</strong>Lee'sWork::::::::::::::::::::::134 A.5.3DavidWalker(ORNL):::::::::::::::::::::132 A.5.2C.S.L<strong>in</strong>(SouthwestResearchInstitute,TX)etal::::::131 A.5.1Vector<strong>codes</strong>{Horowitzetal:::::::::::::::::130<br />
BCalculat<strong>in</strong>g<strong>and</strong>Verify<strong>in</strong>gthePlasmaFrequency B.2Thepotentialequation:::::::::::::::::::::::::138 B.1Introduction:::::::::::::::::::::::::::::::137 A.5.8PeterMacNeice(Hughes/Goddard)::::::::::::::136<br />
B.3Velocity<strong>and</strong><strong>particle</strong>positions:::::::::::::::::::::140 B.2.2TheSORsolver:::::::::::::::::::::::::138 B.2.1TheFFTsolver:::::::::::::::::::::::::138 B.2.3Theeldequations:::::::::::::::::::::::139 137<br />
x
Bibliography B.4Chargedensity:::::::::::::::::::::::::::::140 B.5Plugg<strong>in</strong>gtheequations<strong>in</strong>toeachother::::::::::::::::144 149<br />
xi
List<strong>of</strong>Tables 2.2Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1988-89::::21 2.1Overview<strong>of</strong>ParallelPICReferences{2.5DHybrid:::::::::20<br />
4.1Boundary/Arearatiosfor2Dpartition<strong>in</strong>gswithunitarea.:::::57 4.2Surface/VolumeRatiosfor3Dpartition<strong>in</strong>gswithunitvolume.:::57 3.1Plasmaoscillations<strong>and</strong>time-step:::::::::::::::::::49 2.5Overview<strong>of</strong>ParallelPICReferences{Others::::::::::::24 2.4Overview<strong>of</strong>ParallelPICReferences{Walker::::::::::::23 2.3Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1990-93::::22<br />
6.4DistributedGrid{InputEects:::::::::::::::::::116 5.1PerformanceComplexity{PICAlgorithms:::::::::::::81 6.3SerialPerformance<strong>of</strong>ParticleCodeSubrout<strong>in</strong>es::::::::::108 6.2SerialPerformanceontheKSR1:::::::::::::::::::107 6.1SSCALSerialTim<strong>in</strong>gResults:::::::::::::::::::::104<br />
xii
List<strong>of</strong>Figures 3.13-Dview<strong>of</strong>a2-D<strong>particle</strong>simulation.Chargesarethought<strong>of</strong>as 3.3Calculation<strong>of</strong>eldatlocation<strong>of</strong><strong>particle</strong>us<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation.41 3.4TheLeapfrogMethod.::::::::::::::::::::::::45 3.2Calculation<strong>of</strong>nodeentry<strong>of</strong>lowercorner<strong>of</strong>current<strong>cell</strong>::::::40 3.5Two-stream<strong>in</strong>stabilitytest.a)Initialconditions,b)wavesare \rods".:::::::::::::::::::::::::::::::::29<br />
4.3X-proleforParticleDistributionShown<strong>in</strong>Figure4.2:::::::61 4.5BasicTopologies:(a)2DMesh(grid),(b)R<strong>in</strong>g.::::::::::69 4.4NewGridDistributionDuetoX-prole<strong>in</strong>Figure4.3.:::::::62 4.1Gridpo<strong>in</strong>tdistribution(rows)oneachprocessor.::::::::::56 4.2Inhomogeneous<strong>particle</strong>distribution:::::::::::::::::61 form<strong>in</strong>g,c)characterstictwo-streameye.:::::::::::::::51<br />
5.1Transpose<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g.:::::::::89 4.6Rowstorageversus<strong>cell</strong>cach<strong>in</strong>gstorage:::::::::::::::73<br />
5.5Cache-hitsfora4x4<strong>cell</strong>-cachedsubgrid:::::::::::::::100 5.3Movement<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheir<strong>cell</strong>s,block-vectorsett<strong>in</strong>g.:::95 5.2GridPartition<strong>in</strong>g;a)block-vector,b)subgrid::::::::::::94<br />
6.1KSRSPCcallsparalleliz<strong>in</strong>g<strong>particle</strong>updaterout<strong>in</strong>es::::::::110 5.4ParallelChargeAccumulation:Particles<strong>in</strong><strong>cell</strong>`A'sharegridpo<strong>in</strong>ts<br />
6.2ParallelScalability<strong>of</strong>ReplicatedGridApproach::::::::::112 po<strong>in</strong>ts,`a'<strong>and</strong>`b',aresharedwith<strong>particle</strong>supdatedbyanother process<strong>in</strong>gnode.::::::::::::::::::::::::::::97 withthe<strong>particle</strong>s<strong>in</strong>the8neighbor<strong>in</strong>g<strong>cell</strong>s.Two<strong>of</strong>thesegrid<br />
6.5Grid-sizeScal<strong>in</strong>g.Replicatedgrids;4nodes;262,144Particles,100 6.3Scatter:DistributedGridversusReplicatedGrids::::::::::114 6.4DistributedGridBenchmarks:::::::::::::::::::::115 6.6ParticleScal<strong>in</strong>g{ReplicatedGrids::::::::::::::::::119 Time-steps:::::::::::::::::::::::::::::::118 xiii
Chapter1 Introduction \Onceexperienced,theexpansion<strong>of</strong>personal<strong>in</strong>tellectualpowermadeavailablebythecomputerisnoteasilygivenup."{SheilaEvansWidnall,Chair<br />
<strong>of</strong>theFacultycommitteeonundergraduateadmissionsatMIT,Science,<br />
<strong>in</strong>gplasmaphysics,xerography,astrophysics,<strong>and</strong>semiconductordevicephysics. 1.1Motivation<strong>and</strong>Goals Particlesimulationsarefundamental<strong>in</strong>manyareas<strong>of</strong>appliedresearch,<strong>in</strong>clud-<br />
August1983.<br />
S<strong>of</strong>ar,thesesimulationshave,duetotheirhighdem<strong>and</strong>forcomputerresources typicallyus<strong>in</strong>guptoanorder<strong>of</strong>1million<strong>particle</strong>s. (especiallymemory<strong>and</strong>CPUpower),beenlimitedto<strong>in</strong>vestigat<strong>in</strong>glocaleects{ magneticelds.Thenumericaltechniquesusedusually<strong>in</strong>volveassign<strong>in</strong>gcharges tosimulated<strong>particle</strong>s,solv<strong>in</strong>gtheassociatedeldequationswithrespecttosimulatedmeshpo<strong>in</strong>ts,apply<strong>in</strong>gtheeldsolutiontothegrid,<strong>and</strong>solv<strong>in</strong>gtherelated<br />
equations<strong>of</strong>motionforthe<strong>particle</strong>s.Codesbasedonthesenumericaltechniques Thesesimulations<strong>of</strong>ten<strong>in</strong>volvetrack<strong>in</strong>g<strong>of</strong>charged<strong>particle</strong>s<strong>in</strong>electric<strong>and</strong><br />
arefrequentlyreferredtoParticle-<strong>in</strong>-Cell(PIC)<strong>codes</strong>. HighlyparallelcomputerssuchastheKendallSquareResearch(KSR)ma- 1
ch<strong>in</strong>e,arebecom<strong>in</strong>gamore<strong>and</strong>more<strong>in</strong>tegralpart<strong>of</strong>scienticcomputation.By develop<strong>in</strong>gnovelalgorithmictechniquestotakeadvantage<strong>of</strong>thesemodernparallelmach<strong>in</strong>es<strong>and</strong>employ<strong>in</strong>gstate-<strong>of</strong>-the-art<strong>particle</strong>simulationmethods,weare<br />
2<br />
target<strong>in</strong>g2-Dsimulationsus<strong>in</strong>gatleast10-100millionsimulation<strong>particle</strong>s.This cont<strong>in</strong>u<strong>in</strong>g<strong>in</strong>vestigation<strong>of</strong>severalproblems<strong>in</strong>magnetospheric<strong>and</strong>ionospheric possible. willfacilitatethestudy<strong>of</strong><strong>in</strong>terest<strong>in</strong>gglobalphysicalphenomenapreviouslynot physics.Inparticular,weexpectthesemethodstobeappliedtoanexist<strong>in</strong>gsimulationwhichmodelstheenergization<strong>and</strong>precipitation<strong>of</strong>theelectronsresponsible<br />
fortheAuroraBorealis.Pr<strong>of</strong>essorOtani<strong>and</strong>hisstudentsarecurrentlyexam<strong>in</strong><strong>in</strong>g These<strong>particle</strong>simulation<strong>parallelization</strong>methodswillthenbeused<strong>in</strong>our<br />
portant<strong>in</strong>produc<strong>in</strong>gthese\auroral"electrons,becausethel<strong>in</strong>earmodestructure <strong>of</strong>thesewaves<strong>in</strong>cludesacomponent<strong>of</strong>theelectriceldwhichisorientedalong magneticplasmawaves)have<strong>in</strong>accelerat<strong>in</strong>gtheseelectronsalongeldl<strong>in</strong>esat altitudes<strong>of</strong>1to2Earthradii.K<strong>in</strong>eticAlfvenwaveshavebeenproposedtobeim-<br />
withsmallercomputerstherolethatk<strong>in</strong>eticAlfvenwaves(lowfrequencyelectro-<br />
Earth'sionosphere.[Sil91][SO93]. thetechniquesdeveloped<strong>in</strong>thisthesis. theEarth'smagneticeld,idealforaccelerat<strong>in</strong>gelectronsdownwardtowardsthe<br />
1.1.1Term<strong>in</strong>ology Theterm<strong>in</strong>ologyused<strong>in</strong>thisthesisisbasedonthetermscommonlyassociated Other<strong>codes</strong>thattargetparallelcomputersshouldalsobenetfrommany<strong>of</strong><br />
withcomputerarchitecture<strong>and</strong>parallelprocess<strong>in</strong>gaswellasthoseadoptedby KSR.Thefollow<strong>in</strong>g\dictionary"isforthebenet<strong>of</strong>thoseunfamiliarwiththis term<strong>in</strong>ology. Cache:Fastlocalmemory. Registers:Reallyfastlocalmemory;onlyafewwords<strong>of</strong>storage.
Subcache:KSR-ismforfastlocalmemory,i.e.whatgenerallywouldbe referredtoasaprocessor'scacheorlocalcache.OntheKSR1the0.5Mb subcacheissplitup<strong>in</strong>toa256Kbdatacache<strong>and</strong>a256Kb<strong>in</strong>structioncache. 3<br />
LocalMemory:KSRcallsthesum<strong>of</strong>theirlocalmemoryAllcache<strong>and</strong>they maybeusedforemphasiswhenreferr<strong>in</strong>gtotheKSR. Thisthesiswillusethetermcacheforfastlocalmemory.Thetermsubcache<br />
eachprocess<strong>in</strong>gelementasthemoregenerallyusedterm,localmemory.On notcommonpractice.Thisthesiswillrefertothedistributedmemoryon hencesometimesrefertolocalmemoryas\localcache".Notethatthisis theKSR,thelocalmemoryconsists<strong>of</strong>128sets<strong>of</strong>16-wayassociativememory<br />
Subpage(alsocalledcachel<strong>in</strong>e):M<strong>in</strong>imumdata-packagecopied<strong>in</strong>tolocal Page:Cont<strong>in</strong>uousmemoryblock<strong>in</strong>localmemory(16KbonKSR)thatmay beswappedouttoexternalmemory(e.g.disk). eachwithapagesize<strong>of</strong>16Kbgiv<strong>in</strong>gatotallocalmemory<strong>of</strong>32Mb.<br />
Thrash<strong>in</strong>g:Swapp<strong>in</strong>g<strong>of</strong>data<strong>in</strong><strong>and</strong>out<strong>of</strong>cacheorlocalmemorydue subcache.OntheKSR1eachpageisdivided<strong>in</strong>to128subpages(cachel<strong>in</strong>es) <strong>of</strong>128bytes(16words).<br />
Process:Runn<strong>in</strong>gprogramorprogramsegment,typicallywithalot<strong>of</strong>OS OSkernel:Operat<strong>in</strong>gSystem(OS)kernel;set<strong>of</strong>OSprograms<strong>and</strong>rout<strong>in</strong>es. acrosspageorsubpageboundaries. tomemoryreferencesrequest<strong>in</strong>gdatafromotherprocessorsordataspread<br />
Threads:Light-weightprocesses,i.e.specialprocesseswithlittleOSkernelover-head;OntheKSRonetypicallyspawnsaparallelprogram<strong>in</strong>toP<br />
kernelsupport,i.e.processestypicallytakeawhiletosetup(create)<strong>and</strong> release<strong>and</strong>arethensometimesreferredtoasheavyweightprocesses(see threads).
Machthreads:Low-level<strong>in</strong>terfacetotheOSkernel;basedontheMach threadrunn<strong>in</strong>gperprocessor. threads,whereP=no.<strong>of</strong>availableprocessorssothattherewillbeone 4<br />
Pthreads(POSIXthreads):Higher-level<strong>in</strong>terfacetoMachthreadsbased <strong>of</strong>KSR'sparallelismsupportisbuiltontop<strong>of</strong>Pthreads.TheKSR1Pthreads adheretoIEEEPOSIXP1003.4ast<strong>and</strong>ard. ontheIEEEPOSIX(PortableOperat<strong>in</strong>gSystemInterface)st<strong>and</strong>ard.Most OS(asopposedtoUNIX).<br />
Note:thetermsthread<strong>and</strong>processormaybeused<strong>in</strong>terchangeablywhentalk<strong>in</strong>g<br />
Particlesimulationmodelsare<strong>of</strong>tendividedup<strong>in</strong>t<strong>of</strong>ourcategories[HE89]: 1.2ParticleSimulationModels aboutaparallelprogramthatrunsonethreadperprocessor. Forfurtherdetailsonourtest-bed,theKSR1,seeChapter6.<br />
1.Correlatedsystemswhich<strong>in</strong>cludeN-bodyproblems<strong>and</strong>relatedmodels concern<strong>in</strong>gcovalentliquids(e.g.moleculardynamics),ionicliquids,stellar<br />
3.Collisionalsystems<strong>in</strong>clud<strong>in</strong>gsubmicronsemiconductordevicesus<strong>in</strong>gthe 2.Collisionlesssystems<strong>in</strong>clud<strong>in</strong>gcollisionlessplasma<strong>and</strong>galaxieswithspiralstructures,<br />
clusters,galaxyclusters,etc.,<br />
4.Collision-dom<strong>in</strong>atedsystems<strong>in</strong>clud<strong>in</strong>gsemiconductordevicesimulations us<strong>in</strong>gthediusionmodel<strong>and</strong><strong>in</strong>viscid,<strong>in</strong>compressibleuidmodelsus<strong>in</strong>g microscopicMonte-Carlomodel,<strong>and</strong> vortexcalculations.
ions,stars,orgalaxies).Collision-dom<strong>in</strong>atedsystems,ontheotherh<strong>and</strong>,takethe betweeneach<strong>particle</strong>simulated<strong>and</strong>physical<strong>particle</strong>smodeled(atoms,molecules Inthecorrelatedsystemsmodelsthereisaone-to-onemapp<strong>in</strong>g(correlation) 5<br />
otherextreme<strong>and</strong>useamathematicaldescriptionthattreatsthevortexelements asacont<strong>in</strong>uous,<strong>in</strong>compressible<strong>and</strong><strong>in</strong>visciduid. tems,whereeachsimulations<strong>particle</strong>,<strong>of</strong>tenreferredtoasa\super<strong>particle</strong>",may representmillions<strong>of</strong>physicalelectronsorions<strong>in</strong>acollisionlessplasma.Thenumericaltechniquesusedusually<strong>in</strong>volveassign<strong>in</strong>gchargestosimulated<strong>particle</strong>s,<br />
solv<strong>in</strong>gtheassociatedeldequationswithrespecttosimulatedmeshpo<strong>in</strong>ts,apply<strong>in</strong>gtheeldsolutiontothegrid,<strong>and</strong>solv<strong>in</strong>gtherelatedequations<strong>of</strong>motionfor<br />
the<strong>particle</strong>s.Codesbasedonthesenumericaltechniquesarefrequentlyreferred <strong>in</strong>furtherdetail<strong>in</strong>Chapter3.Thischapteralso<strong>in</strong>cludesdiscussions<strong>of</strong>methods used<strong>in</strong>verify<strong>in</strong>gtheparameterizations<strong>and</strong>thephysicsbeh<strong>in</strong>dthecode.Tests us<strong>in</strong>grealphysicalparametersdemonstrat<strong>in</strong>gpredictablephenomena<strong>in</strong>clud<strong>in</strong>g Chapter2discussesthema<strong>in</strong>referencescover<strong>in</strong>gcurrentparallel<strong>particle</strong><strong>codes</strong><br />
Ourapplicationfalls<strong>in</strong>betweenunderthesecondcategory,collisionlesssys-<br />
toParticle-<strong>in</strong>-Cell(PIC)<strong>codes</strong>. <strong>and</strong>relatedtopics.Thephysicsbeh<strong>in</strong>dthecollisionlessplasmamodelisdescribed<br />
Ourimplementationsuse2-DFFTs(assum<strong>in</strong>gperiodicboundaries)astheeld 1.3NumericalTechniques solvers.Othertechniques,<strong>in</strong>clud<strong>in</strong>gnitedierence(e.g.SOR)<strong>and</strong>multigrid plasmaoscillation<strong>and</strong>two-stream<strong>in</strong>stabilitiesarediscussed.<br />
methodsmayalsobeused.Theeldisappliedtoeachnode/gridpo<strong>in</strong>tus<strong>in</strong>ga 1Dnitedierenceequationforeachdimension,<strong>and</strong>thentheeldcalculatedat each<strong>particle</strong>locationus<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation.Aleapfrog<strong>particle</strong>pusheris usedtoadvancethe<strong>particle</strong>positions<strong>and</strong>velocities.Thesenumericalalgorithms areexpla<strong>in</strong>ed<strong>in</strong>furtherdetail<strong>in</strong>Chapter3.
work;<strong>in</strong>stead,wewillconcentrateonthegeneralapproaches<strong>of</strong>howtoparallelize themasthey<strong>in</strong>teractwithothersections<strong>of</strong>acode. However,ourchoice<strong>of</strong>particularnumericalmethodsisnotthefocus<strong>of</strong>this 6<br />
parallelmethodswith<strong>in</strong>numericalalgorithmswithaneyetowardsma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g theirapplicabilitytomoresophisticatednumericalmethodsthatmaybedeveloped <strong>in</strong>thefuture. Theprimarygoal<strong>of</strong>thisworkishenceto<strong>in</strong>vestigatehowbestto<strong>in</strong>corporate<br />
1.4Contributions<br />
highlightboththeparallelperformance<strong>of</strong>the<strong>in</strong>dividualblocks<strong>of</strong>code<strong>and</strong>also Thisdissertationprovidesan<strong>in</strong>-depthstudy<strong>of</strong>howtoparallelizeafairlycomplex<br />
codemay<strong>in</strong>uenceotherparts<strong>of</strong>thecode.Weshowhowthese<strong>in</strong>teractionsaect giveguidel<strong>in</strong>esastohowthechoices<strong>of</strong>parallelmethodsused<strong>in</strong>eachblock<strong>of</strong> analyses<strong>of</strong>theparallelalgorithmsdevelopedhere<strong>in</strong>,are<strong>in</strong>cluded.Theseanalyses codewithseveral<strong>in</strong>teract<strong>in</strong>gmodules.Algorithmiccomplexity<strong>and</strong>performance<br />
memory<strong>and</strong>cacheutilization. datapartition<strong>in</strong>g<strong>and</strong>leadtonew<strong>parallelization</strong>strategiesconcern<strong>in</strong>gprocessor, memoryaddress<strong>in</strong>gspacesuchasthetheKSRsupercomputer.However,most<strong>of</strong> cludesadiscussion<strong>of</strong>distributedversussharedmemoryenvironments. systemswitheitherahierarchicalordistributedmemorysystem.Chapter4<strong>in</strong>-<br />
thenewmethodsdeveloped<strong>in</strong>thisdissertationgenerallyholdforhigh-performance Ourframeworkisaphysicallydistributedmemorysystemwithagloballyshared<br />
replicated<strong>and</strong>partitionedgrids,aswellasanovelgridpartition<strong>in</strong>gapproachthat leadstoanecientimplementationfacilitat<strong>in</strong>gdynamicallypartitionedgrids.This novelapproachtakesadvantage<strong>of</strong>theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>uses adualpo<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraysthatkeepsthe<strong>particle</strong>locations Thischapteralsodescribesatraditionalparallel<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>methodsus<strong>in</strong>g<br />
automaticallypartiallysorted(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Load-
alanc<strong>in</strong>gtechniquesassociatedwiththisdynamicschemearealsodiscussed. thecachesize<strong>and</strong>theproblem'smemoryaccessstructure,<strong>and</strong>showthatfurther Chapter4also<strong>in</strong>troduceshierarchicaldatastructuresthataretailoredforboth 7<br />
computeronwhichourtest-<strong>codes</strong>wasimplemented<strong>in</strong>Cus<strong>in</strong>gPthreadsonthe arecovered<strong>in</strong>Chapter5.Chapter6describesourtest-bed,theKSR1super-<br />
improvementscanbemadebyconsider<strong>in</strong>gthe<strong>in</strong>putdata'seectonthesimulation.<br />
experimentalresults. KSR1.Optimizationswereguidedbythemethodsdictatedbyouranalytical<strong>and</strong> Complexity<strong>and</strong>performanceanalyses<strong>of</strong>themethodsdiscussed<strong>in</strong>Chapter4<br />
1.5Appendices Anannotatedbibliography<strong>of</strong>thereferencesdescribed<strong>in</strong>Chapter2isprovided <strong>in</strong>AppendixA.AppendixBshowshowtoverifythatthenumericalmethods usedproducetheexpectedplasmaoscillationbycalculat<strong>in</strong>g<strong>and</strong>analyz<strong>in</strong>gwhat happensdur<strong>in</strong>gonegeneraltime-step.
Chapter2 PreviousWorkonParticleCodes <strong>in</strong>evitable."{HerbertG.Wells[1866-1946],M<strong>in</strong>d<strong>and</strong>theEnd<strong>of</strong>ItsTether \Welive<strong>in</strong>referencetopastexperience<strong>and</strong>nott<strong>of</strong>utureevents,however<br />
Theorig<strong>in</strong>s<strong>of</strong>PIC<strong>codes</strong>datebackto1955whenHarlow<strong>and</strong>hiscolleaguesat 2.1TheOrig<strong>in</strong>s<strong>of</strong>ParticleCodes TheAnalects. \Studythepastifyoudiv<strong>in</strong>ethefuture."{Confucius[551-479B.C.],<br />
toHockney<strong>and</strong>Eastwood[HE89],thisworklaidthefoundationforMorse<strong>and</strong> m<strong>in</strong>d(their1-Dcodedidnotcompetewithsimilar<strong>codes</strong><strong>of</strong>thatera.)Accord<strong>in</strong>g LosAlamos[Har88,Har64]developedthem<strong>and</strong>relatedmethodsforuiddynamics calculations.Theirorig<strong>in</strong>alworkwasa1-Dcodethathad2ormoredimesions<strong>in</strong> (Cloud-<strong>in</strong>-Cell)schemesforplasmas<strong>in</strong>1969. Nielson's,<strong>and</strong>Birdsall'sBerkeleygroup<strong>in</strong>troduction<strong>of</strong>higher-order<strong>in</strong>terpolation<br />
lengthyreview<strong>of</strong>theeld<strong>of</strong>model<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong>electric<strong>and</strong>magnetic man'sgroupatStanford<strong>and</strong>Birdsall<strong>and</strong>BridgesatBerkeley<strong>in</strong>thelate'50s,but theirworkdidnotuseameshfortheeldcalculations.Dawson[Daw83]givesa Therst<strong>particle</strong>models<strong>of</strong>electrostaticplasmasare,however,duetoBunne-<br />
8
elds<strong>and</strong>coversseveralphysicalmodel<strong>in</strong>gtechniquestypicallyassociatedwith <strong>particle</strong><strong>codes</strong>.Severalgoodreferencebookson<strong>particle</strong>simulationshavealsobeen published<strong>in</strong>thelastfewyears[HE89][BL91][Taj89][BW91].Foramoredetailed 9<br />
pleaseseetheannotatedbibliography<strong>in</strong>AppendixA. 2.2ParallelParticle-<strong>in</strong>-Cell<strong>codes</strong> desciption<strong>of</strong>thesebooks<strong>and</strong>some<strong>of</strong>themajorpapersreferenced<strong>in</strong>thischapter, Duetotheirdem<strong>and</strong>forcomputationalpower,<strong>particle</strong><strong>codes</strong>areconsideredgood c<strong>and</strong>idatesforparallelhighperformancecomputersystems[FJL+88][AFKW90] [Max91][Wal90].However,because<strong>of</strong>theirseem<strong>in</strong>glynon-parallelstructure,especially<strong>in</strong>the<strong>in</strong>homogeneouscases,muchworkstillrema<strong>in</strong>s{toquoteWalker<br />
[Wal90]:<br />
<strong>in</strong>Tables2.1-2.5attheend<strong>of</strong>thischapter.Thecentralideasprovidedbythese Anoverview<strong>of</strong>thema<strong>in</strong>referencesperta<strong>in</strong><strong>in</strong>gtoparallelPIC<strong>codes</strong>isprovided positionschemesneedtobe<strong>in</strong>vestigated,particularlywith<strong>in</strong>homoge-<br />
neous<strong>particle</strong>distributions." \...ForMIMDdistributedmemorymultiprocessorsalternativedecom-<br />
thisisfrequentlycalledaLagrangi<strong>and</strong>ecomposition.TheparallelPIC<strong>codes</strong> us<strong>in</strong>gapureLagrangi<strong>and</strong>ecompositionwillusuallyreplicatethegridoneach referencesarediscussed<strong>in</strong>subsequentsections. processor.InanEuleri<strong>and</strong>ecomposition,the<strong>particle</strong>sareassignedtothe processorswithitslocalgrid<strong>in</strong>formation.Asthe<strong>particle</strong>smove,theymaymigrate Notethatiftheassignments<strong>of</strong><strong>particle</strong>s(orgrids)toprocessorsrema<strong>in</strong>xed,<br />
toanotherprocessorwheretheirnewlocalgrid<strong>in</strong>formationisthenstored.An loadbalanceas<strong>particle</strong>smove<strong>and</strong>\bunch"togetheroncerta<strong>in</strong>processorsover AdaptiveEulerianapproachwillre-partitionthegrid<strong>in</strong>ordertoachieveabetter time.
2.2.1Vector<strong>and</strong>low-ordermultitask<strong>in</strong>g<strong>codes</strong> (OsakaUniv.)etal.havea3-pagenote[NOY85]thatdescribeshowtheybunch Therst<strong>parallelization</strong>s<strong>of</strong><strong>particle</strong><strong>codes</strong>weredoneonvectormach<strong>in</strong>es.Nishiguchi 10<br />
withtim<strong>in</strong>ganalysesdoneonasimilar3DcodeforaCray.UnlikeNishiguchiet Horowitz(LLNL,laterUniv.<strong>of</strong>Maryl<strong>and</strong>)etal.[Hor87]describesa2Dalgorithm <strong>particle</strong>s<strong>in</strong>their1-DcodetoutilizethevectorprocessoronaVP-100computer.<br />
muchlessstorage. al.whoemploysaxedgrid,Horowitz'sapproachrequiressort<strong>in</strong>g<strong>in</strong>ordertodo thevectorization.Thisschemeisabitslowerforverylargesystems,butrequires scribedbyMank<strong>of</strong>skyetal.[M+88].They<strong>in</strong>cludelow-ordermultiprocess<strong>in</strong>gon systemssuchastheCrayX-MP<strong>and</strong>Cray2.ARGUSisa3-Dsystem<strong>of</strong>simulation<strong>codes</strong><strong>in</strong>clud<strong>in</strong>gmodulesforPIC<strong>codes</strong>.Thesemodules<strong>in</strong>cludeseveraleld<br />
solvers(SOR,Chebyshev<strong>and</strong>FFT)<strong>and</strong>electromagneticsolvers(leapfrog,generalizedimplicit<strong>and</strong>frequencydoma<strong>in</strong>),wheretheyclaimtheir3DFFTsolverto<br />
whereasthelocal<strong>particle</strong>sgotwrittentodisk.Acached<strong>particle</strong>swasthentagged beexceptionallyfast.One<strong>of</strong>themost<strong>in</strong>terest<strong>in</strong>gideas<strong>of</strong>thepaper,however,is howtheyusedthecacheasstoragefor<strong>particle</strong>sthathavelefttheirlocaldoma<strong>in</strong> tomulti-taskover<strong>particle</strong>s(oragroup<strong>of</strong><strong>particle</strong>s)with<strong>in</strong>aeldblock.They ontoalocal<strong>particle</strong><strong>in</strong>itsnew<strong>cell</strong>whenitgotswapped<strong>in</strong>.Theirexperience withCANDOR,a2.5DelectromagneticPICcode,showedthatitprovedecient notethetrade-oduetothefactthat<strong>parallelization</strong>eciency<strong>in</strong>creaseswiththe number<strong>of</strong><strong>particle</strong>spergroup.Thespeed-upfortheirimplementationontheCray number<strong>of</strong><strong>particle</strong>groups,whereasthevectorizationeciency<strong>in</strong>creaseswiththe X-MP/48showedtoreachclosetothetheoreticalmaximum<strong>of</strong>4. modeledascollisionless<strong>particle</strong>swhereastheelectronsaretreatedasan<strong>in</strong>ertia- tomodeltiltmodes<strong>in</strong>eld-reversedcongurations(FRCs).Here,theionsare<br />
Parallelizationeortsontwoproduction<strong>codes</strong>,ARGUS<strong>and</strong>CANDOR,isde-<br />
Horowitzetal.[HSA89]laterdescribea3DhybridPICcodewhichisused
processors<strong>in</strong>themultigridphasebycomput<strong>in</strong>gonedimensiononeachprocessor. lessuid.Amultigridalgorithmisusedfortheeldsolve,whereastheleapfrog methodisusedtopushthe<strong>particle</strong>s.Horowitzmulti-tasksover3<strong>of</strong>the4Cray2 11<br />
<strong>and</strong>hencemulti-taskedachiev<strong>in</strong>ganaverageoverlap<strong>of</strong>about3duetotherelationshipbetweentasklength<strong>and</strong>thetime-sliceprovidedforeachtaskbytheCray.<br />
The<strong>in</strong>terpolation<strong>of</strong>theeldstothe<strong>particle</strong>swasfoundcomputationally<strong>in</strong>tensive<br />
2(many,butsimplercalculations).Theseresultsarehenceclearlydependenton theschedul<strong>in</strong>galgorithms<strong>of</strong>theCrayoperat<strong>in</strong>gsystem. priorityatwhichitran.)The<strong>particle</strong>pushphasesimilarlygotanoverlap<strong>of</strong>about (FortheCraytheyused,thetime-slicedependedonthesize<strong>of</strong>thecode<strong>and</strong>the<br />
2.3OtherParallelPICReferences Withthe<strong>in</strong>troduction<strong>of</strong>distributedmemorysystems,severalnew<strong>issues</strong>arearis<strong>in</strong>g <strong>in</strong>the<strong>parallelization</strong><strong>of</strong><strong>particle</strong><strong>codes</strong>.Datalocalityisstillthecentralissue,but ratherthantry<strong>in</strong>gtollupaset<strong>of</strong>vectorregisters,onenowtriestom<strong>in</strong>imize communicationcosts(globaldata<strong>in</strong>teraction).Ineithercase,whereavailable,one wouldliketolluplocalcachel<strong>in</strong>es.<br />
128-by-128toroidalgrid<strong>of</strong>bit-serialprocessorslocatedatGoddard. electrostaticPICcodeimplementedontheMPP(MassivelyParallelProcessor),a C.S.L<strong>in</strong>etal.[LTK88,L<strong>in</strong>89b,L<strong>in</strong>89a]describeseveralimplementations<strong>of</strong>a1-D 2.3.1Fight<strong>in</strong>gdatalocalityonthebit-serialMPP<br />
processors<strong>and</strong>thegridcomputationsareavoidedbyus<strong>in</strong>gthe<strong>in</strong>verseFFTto ogy<strong>and</strong>notmuchlocalmemory,theyfoundthattheoverhead<strong>in</strong>communication bit-serialSIMD(s<strong>in</strong>gle<strong>in</strong>struction,multipledata)architecturewithagridtopol-<br />
computethe<strong>particle</strong>positions<strong>and</strong>theelectriceld.However,s<strong>in</strong>cetheMPPisa Theyrstdescribeagridlessmodel[LTK88]where<strong>particle</strong>saremappedto<br />
whencomput<strong>in</strong>gthereductionsumsneededforthistechniquewhencomput<strong>in</strong>g
array<strong>and</strong>sortedthe<strong>particle</strong>saccord<strong>in</strong>gtotheir<strong>cell</strong>everytimestep.Thiswas thechargedensitywassohighthat60%<strong>of</strong>theCPUtimewasused<strong>in</strong>thiseort. Inanearlierstudy,theymappedthesimulationdoma<strong>in</strong>directlytotheprocessor 12<br />
wouldnotrema<strong>in</strong>load-balancedovertimes<strong>in</strong>cetheuctuations<strong>in</strong>electricalforces thearrayprocessors<strong>and</strong>thestag<strong>in</strong>gmemory.Theyalsopo<strong>in</strong>toutthatthescheme foundtobehighly<strong>in</strong>ecientontheMPPduetotheexcessiveI/Orequiredbetween wouldcausethe<strong>particle</strong>stodistributenon-uniformlyovertheprocessors.<br />
llsonlyhalfthe<strong>particle</strong>planes(processorgrid)with<strong>particle</strong>stomakethesort<strong>in</strong>g solver.Theimplementationuses64planestostorethe<strong>particle</strong>s.Thisapproach (scatter<strong>particle</strong>sthatareclusteredthroughrotations). Theauthorsimulatesupto524,000<strong>particle</strong>sonthismach<strong>in</strong>eus<strong>in</strong>ganFFT L<strong>in</strong>later[L<strong>in</strong>89b,L<strong>in</strong>89a],usesa<strong>particle</strong>sort<strong>in</strong>gschemebasedonrotations<br />
simplerbybe<strong>in</strong>gabletoshue(\rotate")thedataeasilyonthisbit-serialSIMD mach<strong>in</strong>e.Thespareroomwasused<strong>in</strong>the\shu<strong>in</strong>g"process.Herecongested<strong>cell</strong>s westernneighbor,ifnecessary,dur<strong>in</strong>gthe\sort<strong>in</strong>g"process. hadpart<strong>of</strong>theircontentsrotatedtotheirnorthernneighbor,<strong>and</strong>thentotheir<br />
2.3.2Hypercubeapproaches morecomputationalpower<strong>and</strong>adierent<strong>in</strong>terconnectionnetwork,othertechniqueswillprobablyprovemoreuseful.<br />
ThisimplementationisclearlytiedtotheMPParchitecture.Fornodeswith<br />
NCUBE,haveoat<strong>in</strong>gpo<strong>in</strong>tprocessors(someevenwithvectorboards)<strong>and</strong>alot UnliketheMPP,hypercubessuchastheInteliPSCs,theJPLMarkII,<strong>and</strong>the connectivity(thenumber<strong>of</strong><strong>in</strong>terconnectionsbetweennodesgrowslogarthmically morelocalmemory.Theiradvantageis<strong>in</strong>theirrelativelyhighdegree<strong>of</strong><strong>in</strong>ter-node <strong>in</strong>thenumber<strong>of</strong>nodes)thatprovidesperfectnear-neighborconnectionsforFFT systemsarestillfairlyhomogeneous(withtheexception<strong>of</strong>someI/Oprocessors). algorithmsaswellasO(logN)datagathers<strong>and</strong>scatters.Thenodes<strong>of</strong>these
ontheInteliPSChypercube.AmultigridalgorithmbasedonFredrickson<strong>and</strong> McBryaneortsareusedfortheeldcalculations,whereastheyconsidered3 Lubeck<strong>and</strong>Faber(LANL)[LF88]covera2-Delectrostaticcodebenchmarked 13<br />
basedonthefactthattheir<strong>particle</strong>stendedtocongregate<strong>in</strong>10%<strong>of</strong>the<strong>cell</strong>s, <strong>cell</strong><strong>in</strong>formation(observ<strong>in</strong>gstrictlocality).Theauthorsrejectedthisapproach dierentapproachesforthe<strong>particle</strong>pushphase.<br />
torelaxthelocalityconstra<strong>in</strong>tbyallow<strong>in</strong>g<strong>particle</strong>stobeassignedtoprocessorsnot hencecaus<strong>in</strong>gseriousload-imbalance.Thesecondalternativetheyconsideredwas Theirrstapproachwastoassign<strong>in</strong>g<strong>particle</strong>stowhicheverprocessorhasthe<br />
lancedsolution)wouldbeastrongfunction<strong>of</strong>the<strong>particle</strong><strong>in</strong>putdistribution. <strong>of</strong>thisalternative(moveeitherthegrid<strong>and</strong>/or<strong>particle</strong>stoachieveamoreload-<br />
necessarilyconta<strong>in</strong><strong>in</strong>gtheirspatialregion.Theauthorsarguethattheperformance<br />
astheiPSC).Tous,however,thisseemstorequirealot<strong>of</strong>extraover-head<strong>in</strong> theprocessorssothatanequalnumber<strong>of</strong><strong>particle</strong>scouldbeprocessedateach time-step.Thisachievesaperfectloadbalance(forhomogeneoussystemssuch communicat<strong>in</strong>gthewholegridtoeachprocessorateachtime-step,nottomention Thealternativetheydecidedtoimplementreplicatedthespatialgridamong<br />
hav<strong>in</strong>gtostoretheentiregridateachprocessor.Theydo,however,describea greater"comparedwithasharedmemoryimplementation. thepartition<strong>in</strong>g<strong>of</strong>theirPICalgorithmforthehypercube\anorder<strong>of</strong>magnitude niceperformancemodelfortheirapproach.Theauthorscommentthattheyfound<br />
Aza92].Theirunderly<strong>in</strong>gmotivationistoparallelizeParticle-In-Cell(PIC)<strong>codes</strong> throughahybridscheme<strong>of</strong>Grid<strong>and</strong>ParticlePartitions. onhybridpartition<strong>in</strong>gforPIC<strong>codes</strong>onhypercubes[ALO89,AL91,AL90,AL92, Partition<strong>in</strong>ggridspace<strong>in</strong>volvesdistribut<strong>in</strong>g<strong>particle</strong>sevenlyamongprocessor Azari<strong>and</strong>Lee(Cornell)havepublishedseveralpapersrelatedtoAzari'swork<br />
<strong>and</strong>partition<strong>in</strong>gthegrid<strong>in</strong>toequal-sizedsub-grids,oneperavailableprocessor element(PE).Theneedtosort<strong>particle</strong>sfromtimetotimeisreferredtoasan
undesirableloadbalanc<strong>in</strong>gproblem(dynamicloadbalanc<strong>in</strong>g). areevenlydistributedamongprocessorelements(PEs)nomatterwheretheyare A<strong>particle</strong>partition<strong>in</strong>gimplies,accord<strong>in</strong>gtotheirpapers,thatallthe<strong>particle</strong>s 14<br />
entiresimulation.TheentiregridisassumedtohavetobestoredoneachPE<strong>in</strong> locatedonthegrid.EachPEkeepstrack<strong>of</strong>thesame<strong>particle</strong>throughoutthe ordertokeepthecommunicationoverheadlow.Thestoragerequirementsforthis schemearelarger,<strong>and</strong>aglobalsum<strong>of</strong>thelocalgridentriesisneededaftereach iteration. mentthatbypartition<strong>in</strong>gthespaceonecansavememoryspaceoneachPE,<strong>and</strong> bypartition<strong>in</strong>gthe<strong>particle</strong>sonemayattempttoobta<strong>in</strong>awell-balancedloaddistributionwhichwouldleadtoahigheciency.Theirhybridpartition<strong>in</strong>gscheme<br />
canbeoutl<strong>in</strong>edasfollows: Theirhybridpartition<strong>in</strong>gapproachcomb<strong>in</strong>esthesetwoschemeswiththeargu-<br />
1.thegridispartitioned<strong>in</strong>toequalsubgrids; 2.agroup<strong>of</strong>PEsareassignedtoeachblock; 3.eachgrid-blockisstored<strong>in</strong>thelocalmemory<strong>of</strong>each<strong>of</strong>thePEsassignedto 4.the<strong>particle</strong>s<strong>in</strong>eachblockare<strong>in</strong>itiallypartitionedevenlyamongPEs<strong>in</strong>that thatblock;<br />
accumulator",somek<strong>in</strong>d<strong>of</strong>gather-scattersorterproposedbyG.Fox.etal.(See alsoWalker'sgeneralreference<strong>in</strong>AppendixA.) nothaveafullimplementation<strong>of</strong>thecode.Heusesthe\quasi-staticcrystal Walker(ORNL)[Wal89]describesa3-DPICcodefortheNCUBE,butdoes block.<br />
tations.Her1988paperwithDecyk,Dawson(UCLA)<strong>and</strong>G.Fox(Syracuse) [LDDF88],describesa1-Delectrostaticcodenamed1-DUCLA,decompos<strong>in</strong>gthe physicaldoma<strong>in</strong><strong>in</strong>tosub-doma<strong>in</strong>sequal<strong>in</strong>numbertothenumber<strong>of</strong>processors Liewer(JPL)hasalsoco-authoredseveralpapersonhypercubeimplemen-
availablesuchthat<strong>in</strong>itiallyeachsub-doma<strong>in</strong>hasanequalnumber<strong>of</strong>processors. Theirtest-bedwastheMarkIII32-nodehypercube. Thecodeusesthe1-DconcurrentFFTdescribed<strong>in</strong>Foxetal.Forthe<strong>particle</strong> 15<br />
<strong>of</strong>theprocessors.)Thecodehencepassesthegridarrayamongtheprocessors phase.(Theyneedtopartitionthedoma<strong>in</strong>accord<strong>in</strong>gtotheGrayCodenumber<strong>in</strong>g theFFTsolver<strong>in</strong>ordertotakeadvantage<strong>of</strong>thehypercubeconnectionforthis However,theauthorspo<strong>in</strong>touthowtheyneedtouseadierentpartition<strong>in</strong>gfor push<strong>in</strong>gphase,theydividetheirgridup<strong>in</strong>to(N?p)equal-sizedsub-doma<strong>in</strong>s.<br />
processors)thatbeatstheCrayX-MP.Lieweralsoco-authoredapaper[FLD90] namedGCPIC(GeneralConcurrentPIC)implementedontheMarkIIIfb(64- thecaseifanitedierenceeldsolutionisused<strong>in</strong>place<strong>of</strong>theFFT. twiceateachtimestep.Intheconclusions,theypo<strong>in</strong>toutthatthismaynotbe<br />
theoption<strong>of</strong>boundedorperiodic<strong>in</strong>theotherdimension.Thecodeused21- describ<strong>in</strong>ga2-Delectrostaticcodethatwasperiodic<strong>in</strong>onedimension,<strong>and</strong>with Inthepaperpublishedayearlater[LZDD89]theydescribeasimilarcode<br />
later<strong>in</strong>thisthesis. DFFTs<strong>in</strong>thesolver.Lieweretal.hasrecentlydevelopeda3Dcodeonthe<br />
2.3.3AMasParapproach DeltaTouchstone(512nodegrid)wherethegridsarereplicatedoneachprocessor [DDSL93,LLFD93].Lieweretal.alsohaveapaperonloadbalanc<strong>in</strong>gdescribed MacNeice'spaper[Mac93]describesa3DelectromagneticPICcodere-written foraMasParwitha128-by-128grid.ThecodeisbasedonOscarBuneman's TRISTANcode.Theystorethethethirddimension<strong>in</strong>virtualmemorysothat needtosortaftereachtime-step.Anite-dierenceschemeisusedfortheeld eachprocessorhasagridvector.TheyuseaEuleri<strong>and</strong>ecomposition,<strong>and</strong>hence S<strong>in</strong>cetheyassumesystemswithrelativelymild<strong>in</strong>homogeneities,noloadbalanc<strong>in</strong>g solve,whereasthe<strong>particle</strong>pushphaseisaccomplishedviatheleapfrogmethod.
considerationsweretaken.Thefacttheyonlysimulate400,000<strong>particle</strong>s<strong>in</strong>a105- the128-by-128processorgrid,weassumewasduetothememorylimitations<strong>of</strong>the by-44-by-55system,i.e.onlyaboutone<strong>particle</strong>per<strong>cell</strong><strong>and</strong>henceunder-utiliz<strong>in</strong>g 16<br />
<strong>in</strong>dividual(local)data. 2.3.4ABBNattempt (SIMD)mach<strong>in</strong>e,i.e.theprocessorssharethe<strong>in</strong>structionstream,butoperateon MasParused(64Kb/processor).TheMasParisaS<strong>in</strong>gleInstructionMultipleData<br />
ourtest-bed,theKSR1,isalsoashared-addressspacesystemwithdistributed bythehighcosts<strong>of</strong>copy<strong>in</strong>gverylargeblocks<strong>of</strong>read-onlydata.LiketheBBN, TC2000whoseperformancewasdisappo<strong>in</strong>t<strong>in</strong>g.Theyusedashared-memoryPIC Sturtevant<strong>and</strong>Maccabee[SM90]describeaplasmacodeimplementedontheBBN<br />
overcamesome<strong>of</strong>theobstaclesthatfaceSturtevant<strong>and</strong>Maccabee. memory.Byfocus<strong>in</strong>gmoreondatalocality<strong>issues</strong>,wewilllatershowhowwe algorithmthatdidnotmapwelltothearchitecture<strong>of</strong>theBBN<strong>and</strong>hencegothit<br />
(alternat<strong>in</strong>gdirectionimplicit)asa2-Ddirecteldsolver.Thepaperdoespo<strong>in</strong>t outthatonlym<strong>in</strong>imalconsiderationwasgiventoalgorithmsthatmaybeusedto 2.3.5Other<strong>particle</strong>methods <strong>and</strong>somerelativisticextensions.ThecodeusesaniterativesolutionbasedonADI D.W.Hewett<strong>and</strong>A.B.Langdon[HL88]describethedirectimplicitPICmethod<br />
(severalD).Bytreat<strong>in</strong>gtheelectronsasamasslessuid<strong>and</strong>theionsas<strong>particle</strong>s, tages<strong>in</strong>us<strong>in</strong>gahybrid<strong>particle</strong>codetosimulateplasmasonverylargescalelengths somephysicsthatmagnetohydrodynamics(MHD)<strong>codes</strong>donotprovide(MHDassumeschargeneutrality,i.e.=0),canbe<strong>in</strong>cludedwithoutthecosts<strong>of</strong>afull<br />
S.H.Brecht<strong>and</strong>V.A.Thomas[BT88]describetheadvantages<strong>and</strong>disadvan-<br />
implementtherelativisticextensions.Someconceptsweretestedona1-Dcode.<br />
<strong>particle</strong>code.Theyavoidsolv<strong>in</strong>gthepotentialequationsbyassum<strong>in</strong>gthatthe
plasmaisquasi-neutral(neni),us<strong>in</strong>gtheDarw<strong>in</strong>approximationwherelight wavescanbeignored,<strong>and</strong>assum<strong>in</strong>gtheelectronmasstobezero.Theyhenceuse apredictor-correctormethodtosolvethesimpliedequations. 17<br />
modernhierarchicalsolvers,<strong>of</strong>whichthemostgeneraltechniqueisthefastmultipolemethod(FMM),toavoidsome<strong>of</strong>thelocalsmooth<strong>in</strong>g,boundaryproblems,<br />
J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>[AGR88]advocatetheuse<strong>of</strong> plasmas<strong>and</strong>beams,<strong>and</strong>plasmas<strong>in</strong>complicatedregions.Thepaperdescribesthe <strong>and</strong>alias<strong>in</strong>gproblemsassociatedwithPICmethodswhenusedtosimulatecold FMMmethodforgridless<strong>particle</strong>simulations<strong>and</strong>howitfareswithrespecttothe<br />
MPP,Lieweretal.[LLDD90]haveimplementedadynamicloadbalanc<strong>in</strong>gscheme 2.4LoadBalanc<strong>in</strong>g Asidefromthescroll<strong>in</strong>gmethodspreviouslymentionedthatL<strong>in</strong>developedforthe aforementionedproblemsassociatedwithPICmethods.<br />
ThecodeisbasedontheelectrostaticcodeGCPIC(see2.5.1).Loadbalanc<strong>in</strong>gwas achievedbytheirAdaptiveEulerianapproachthathaseachprocessorcalculate fora1DelectromagneticPICcodeontheMarkIIIHypercubeatCaltech/JPL.<br />
theirsub-gridboundaries<strong>and</strong>theircurrentnumber<strong>of</strong><strong>particle</strong>s.Theypo<strong>in</strong>tout thattheactualplasmadensityprolecouldbeuseddirectly(computedfortheeld solutionstage),butthatitwouldrequiremorecommunicationtomakethedensity anapproximation<strong>of</strong>theplasmadensityprole<strong>and</strong>us<strong>in</strong>gittocomputethegrid partition<strong>in</strong>g.Thiscalculationrequiresallprocessorstobroadcastthelocation<strong>of</strong><br />
muchlargeramount<strong>of</strong>communication<strong>and</strong>computationoverhead.Resultsfrom proleglobal.Othermethods,suchas<strong>particle</strong>sort<strong>in</strong>g,wereassumedtorequirea testcaseswith5120<strong>particle</strong>srunon8processorswereprovided.Intheloadbalanc<strong>in</strong>gcase,the<strong>particle</strong>distributionwasapproximatedevery5time-steps.
Hennessyetal.[SHG92,SHT+92]havealsodonesome<strong>in</strong>terest<strong>in</strong>gload-balanc<strong>in</strong>g 2.4.1Dynamictaskschedul<strong>in</strong>g studiesforhierarchicalN-bodymethodsthatareworth<strong>in</strong>vestigat<strong>in</strong>g. 18<br />
MultipoleMethod.Thepaperstressescash<strong>in</strong>g<strong>of</strong>communicateddata<strong>and</strong>claims centratesonanalyz<strong>in</strong>gtwoN-bodymethods:theBarnes-HutMethod<strong>and</strong>theFast thatformostrealisticscal<strong>in</strong>gcases,boththecommunicationtocomputationratio, aswellastheoptimalcachesize,growslowlyaslargerproblemsarerunonlarger Therstpaper,entitled\Implications<strong>of</strong>hierarchicalN-bodyMethods"con-<br />
N-bodymethods"focusesonhowtoachieveeective<strong>parallelization</strong>sthroughsimultaneouslyconsider<strong>in</strong>gloadbalanc<strong>in</strong>g<strong>and</strong>optimiz<strong>in</strong>gfordatalocalityforthreodsconsidered<strong>in</strong>therstpaper,theyalsoconsiderarecentmethodforradiosity<br />
Thesecondpaper,entitled\LoadBalanc<strong>in</strong>g<strong>and</strong>DataLocality<strong>in</strong>Hierarchical memoryimplementations. overheadssubstantially<strong>in</strong>creaseswhengo<strong>in</strong>gfromshared-memorytodistributed mach<strong>in</strong>es.Theyalsoshowthattheprogramm<strong>in</strong>gcomplexity<strong>and</strong>performance<br />
ma<strong>in</strong>hierarchicalN-bodymethods.InadditiontotheBarnes-Hut<strong>and</strong>FMMmeth-<br />
<strong>in</strong>dicator<strong>of</strong>theworkassociatedwithit<strong>in</strong>thenext.Unabletondaneective calculations<strong>in</strong>computergraphics. predictivemechanismthatcouldprovideloadbalanc<strong>in</strong>g,thebestapproachedthe authorsendedupwithwassometh<strong>in</strong>gtheycallcost-estimates+steal<strong>in</strong>g.Thisuses s<strong>in</strong>ce<strong>in</strong>thiscase,theworkassociatedwithapatch<strong>in</strong>oneiterationisnotagood Thelatterturnsouttorequireaverydierentapproachtoload-balanc<strong>in</strong>g<br />
prol<strong>in</strong>gorcost-estimatesto<strong>in</strong>itializethetaskqueuesateachprocessor,<strong>and</strong>then useson-the-ytasksteal<strong>in</strong>gtoprovideloadbalanc<strong>in</strong>g. atStanford.Ithas16processorsorganized<strong>in</strong>4clusterswhereeachclusterhas 4MIPSR3000processorsconnectedbyasharedbus.Theclustersareconnected together<strong>in</strong>ameshnetwork.Eachprocessorhastwolevel<strong>of</strong>cachesthatarekept ThetestbedforbothpapersistheexperimentalDASHmultiprocessorlocated
coherent<strong>in</strong>hardware,<strong>and</strong>shareanequalfraction<strong>of</strong>thephysicalmemory. dierentfromoursett<strong>in</strong>g,buttheiruse<strong>of</strong>dynamictaskschedul<strong>in</strong>gtoachieveload Boththeapplications(N-bodysimulations)<strong>and</strong>test-bed(DASH)arefairly 19<br />
<strong>in</strong>formationsuchas\load"<strong>and</strong>\distance"thatrelatestothisapproach. balanc<strong>in</strong>g,isworthnot<strong>in</strong>g.Chapter4discussesouridea<strong>of</strong>us<strong>in</strong>gprocessorsystem
Mank<strong>of</strong>sky2.5DhybridCrayX-MPmulti- Author(s) Table2.1:Overview<strong>of</strong>ParallelPICReferences{2.5DHybrid Type ArchitectureParallel methodsSolverpushersimulated FieldParticleMax.ptcls<br />
Horowiz2.5DhybridCray2 [M+88] etal. (CANDOR)X-MP/48task<strong>in</strong>g (ARGUS)(4proc.) 3D Cray2& multi-3D-FFTleapfrogleapfrog<br />
etal. (4proc.) multi-Multigridleap-frogupto106 &others<br />
(1989-92) [HSA89] Azari&2.5DhybridInteliPSC/2hybridpredictor-sortptcls16,384 Lee quasi-neutralhypercubepartitioncorrectoreacht Darw<strong>in</strong> (32proc)(subgrids 43x43x43<br />
[AL90,AL91]ignorelight, [ALO89] [Aza92] [AL92] (neni, me=0) BBN replicated)<br />
20
Author(s) Table2.2:Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1988-89 TypeArchitectureParallel methods Solver Field ParticleMax.ptcls<br />
Dawson&Fox,statichypercube Liewer, Decyk, electro-MarkIIIdecomp.r<strong>and</strong>.gen 1-D JPL Eulerianforvelocity<br />
FFTGaussian pushersimulated<br />
[LDDF88] Liewer, Decyk, [LD89] GCPIC electro-MarkIII(loadbal. statichypercubetried<strong>in</strong> 1-D JPL [LLDD90]) Eulerian FFT leav<strong>in</strong>g buer720,896ptcls ptcls8128gridpts<br />
21
Author(s)TypeArchitectureParallelFieldParticleMax.ptcls Table2.3:Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1990-93<br />
Ferraro,2-D Decyk,electro-MarkIII JPL replicatetwo methodsSolverpushersimulated<br />
[FLD90] Liewer,3-D LiewerstatichypercubeoneachFFTs Deltaprocessor<br />
replicate?? grid 1-D<br />
[DDSL93] Krucken,electro-Touchstonegrid Ferraro,static Decyck 512-nodeprocessor (Intel) oneach ??(3000gridpts)<br />
1.47x108<br />
(prepr<strong>in</strong>t)<br />
22
Author(s) Type Table2.4:Overview<strong>of</strong>ParallelPICReferences{Walker ArchitectureParallel methodsSolver Field ParticleMax.ptcls<br />
[Wal89]implementedconsidered)<strong>of</strong>gridpts Walker3D,butnohypercubedynamic full-scalePIC(NCUBErout<strong>in</strong>gassumed FFT pushersimulated<br />
Walker estimates<br />
SOS 3-D hypercubeAdaptivecommercialSOScodeestimates simulated EulerSOScode only<br />
[Wal90] onCray only<br />
23
Author(s) L<strong>in</strong>etal. Type Table2.5:Overview<strong>of</strong>ParallelPICReferences{Others<br />
[LTK88]electrostatic(16,384 ArchitectureParallel proc.)<br />
reduction\gridless"<strong>in</strong>verse methods sums <strong>in</strong>v.FFTFFT Solverpushersimulated FieldParticleMax.ptcls<br />
Lubeck& [L<strong>in</strong>89b] [L<strong>in</strong>89a]electrostatic(bit-serial) Faberelectrostatic(64proc)gridoneach 1-D 2-D InteliPSCreplicatemultigridsortptclsGridpts: MPP sort<strong>in</strong>g1-DFFTloadbal.524,000 etc. viashifts<br />
MacNeice [Mac93]basedon [LF88] (1993)electromagn.128x128 3-D MasPar processor 3rdD<strong>in</strong>dierence store niteleapfrog200000 eacht
Chapter3<br />
SimulationCodes ThePhysics<strong>of</strong>Particle<br />
3.1ParticleModel<strong>in</strong>g splendour<strong>of</strong>theirown."{Bertr<strong>and</strong>Russell,WhatIBelieve,1925. theend,thefreshairbr<strong>in</strong>gsvigour,<strong>and</strong>allthegreatspaceshavea \Eveniftheopenw<strong>in</strong>dows<strong>of</strong>scienceatrstmakesusshiver...<strong>in</strong><br />
experienced<strong>in</strong>comput<strong>in</strong>g,butnotnecessarilywithphysics,wealsoreviewthebasic algorithmsassociatedwith<strong>particle</strong>simulations.Asabenettothosewhoare physicsbeh<strong>in</strong>dsome<strong>of</strong>theequationsmodeled,i.e.howonearrivesattheequations Thegoal<strong>of</strong>thischapteristoshowhowtoimplementsome<strong>of</strong>thetypicalnumerical<br />
use<strong>of</strong><strong>particle</strong><strong>codes</strong>. <strong>and</strong>whattheymeanwithrespecttothephysicalworldtheyaredescrib<strong>in</strong>g.Based howthisknowledgecanbeusedtoverify<strong>and</strong>testthe<strong>codes</strong>.F<strong>in</strong>ally,wedescribe examples<strong>of</strong><strong>in</strong>terest<strong>in</strong>gphysicalphenomenathatcanbe<strong>in</strong>vestigatedthroughthe onourunderst<strong>and</strong><strong>in</strong>g<strong>of</strong>thephysicalsystemwearetry<strong>in</strong>gtomodel,wethenshow<br />
manynumericalmethodsavailable.Severalothertechniques,somemorecomplex, Thenumericaltechniquesdescribed<strong>in</strong>thischapterrepresentonlyafew<strong>of</strong>the 25
moresuitableforourfuturegoal,i.e.<strong>parallelization</strong>thanothermorecomplex representativeavor<strong>of</strong>themostcommontechniques,<strong>and</strong>mayalsoprovetobe butpossiblymoreaccurate,exist.However,wefeelthetechniqueschosengivea 26<br />
algorithms. oursimulation.Althoughthephysicalworldisclearly3D<strong>and</strong>themethodsused extendtomultipledimensions,wechosetomodelonly2Ds<strong>in</strong>cethenumerical problems(whichcapturepossibleglobaleects){evenonpresentsupercomputers. computationsotherwisewouldtakeaverylongtimeforthelarger<strong>in</strong>terest<strong>in</strong>g Thediscussionsfollow<strong>in</strong>gconcentrateon2Dsimulations,themodelwechosefor<br />
solutions(us<strong>in</strong>gnergrids).Untilthewholeuniversecanbeaccuratelymodeled 3Dmodel,doanevenlarger-scale2D(or1D)simulation,orobta<strong>in</strong>moredetailed <strong>in</strong>ashortamount<strong>of</strong>time,physicalmodelswillalwaysbesubjecttotrade-os! computationalpowerbecomesavailable,onemighteitherdecidetoexp<strong>and</strong>toa Thisisthereasonmanycurrentserial<strong>particle</strong><strong>codes</strong>modelonly1D.Asmore<br />
3.2TheDiscreteModelEquations Thissectionlooksatwheretheequationsthatweremodeledstemfrom<strong>in</strong>the physics.Inoursimulationsus<strong>in</strong>gtheAccelerationmodel,weassumedthethe equationsbe<strong>in</strong>gmodeledwere:<br />
dv=dt=qE=m dx=dt=v; r2=?0; E=?r; (3.4) (3.3) (3.2) (3.1)<br />
Thefollow<strong>in</strong>gsubsectionswilldescribetheseequation<strong>in</strong>moredetail.
3.2.1Poisson'sequation Theeldequationr2=?=0stemsfromthedierentialform<strong>of</strong>Maxwell's equationswhichsayelectriceldswithspacechargesobey: 27<br />
freecharges.rDisalsocalledthedivergence<strong>of</strong>D. whereDistheelectric\displacement"<strong>and</strong>fisthesourcedensityordensity<strong>of</strong> Comb<strong>in</strong>edwiththeequationformatter(withnopolarizationsorboundcharges) rD=f; (3.5)<br />
<strong>and</strong>withE=grad,r,wehave: divr=?0or D=0E div0E= (3.7) (3.6)<br />
eratorsgrad<strong>and</strong>divalsoyieldstheabovePoisson'sequation: Fora2-Dscalareld(x;y),thesequentialapplication<strong>of</strong>thedierentialop-<br />
r2=?0: (3.8)<br />
orr2=?0: r2=@2 @x2+@2<br />
v=dx=dt;<strong>and</strong>dv=dt=qE=mfollowfromNewton'slaw,wheretheforceFisthe @y2 (3.10) (3.9)<br />
theymean<strong>in</strong>oursimulation. sum<strong>of</strong>theelectricforceqE. Letusnowtakeacloserlookatthechargeq<strong>and</strong>thechargedensity<strong>and</strong>what
Thechargedensity,,isusuallythought<strong>of</strong>asunits<strong>of</strong>charge=(length)3(chargeper 3.2.2Thecharge,q volume).InSI-units,thelengthismeasured<strong>in</strong>meters.Now,volume<strong>in</strong>dicatesa3- 28<br />
Dsystem,whereasoursimulationisa2-Done.To\convert"our2-Dsimulationto a3-Dview,onecanth<strong>in</strong>k<strong>of</strong>eachsimulation<strong>particle</strong>represent<strong>in</strong>garod-likeentity <strong>of</strong><strong>in</strong>nitelengththat<strong>in</strong>the3-Dplane<strong>in</strong>tersectsthex-yplaneperpendicularlyat<br />
rodwouldhencerepresentthechargeqLhz.Consequently,thetotalchargeQ<strong>in</strong> Consideravolumehzhxhy,wherehzistheheight<strong>in</strong>units<strong>of</strong>theunitlength, number<strong>of</strong>simulation<strong>particle</strong>s.Consider<strong>in</strong>gqLasthechargeperunitlength,each withhx<strong>and</strong>hybe<strong>in</strong>gthegridspac<strong>in</strong>g<strong>in</strong>x<strong>and</strong>y,respectively,<strong>and</strong>Npbe<strong>in</strong>gthe thesimulations<strong>particle</strong>'slocation(x0,y0)asshown<strong>in</strong>Figure3.1.<br />
thevolumeis: Asmentioned,thechargedensityisQ/volume: Q=volume=(NpqLhz)=(hzhxhy) Q=NpqLhz (3.12) (3.11)<br />
noticethathzalwayscancelsout<strong>and</strong>that=NpqL=(hxhy)hastherightunits: (no?dim)(charge=length) (length)2 =charge (length)3: (3.13)<br />
Inotherwords,<strong>in</strong>our2-Dsimulation,thechargedensity,,isdenedsothat: hxhyX gridpo<strong>in</strong>ts(i;j)i;j=qLNp; (3.14)<br />
<strong>particle</strong>s(Np),number<strong>of</strong>gridpo<strong>in</strong>ts(Ng),gridspac<strong>in</strong>gs(hx;hy),awellasthe Inoursimulations,wehavesettheelectroncharge(q),thenumber<strong>of</strong>simulation (3.15)
z 29<br />
6 qL=charge/unitlength<br />
> eee<br />
?<br />
y - unitlength<br />
??<br />
,,,,,,,,,,<br />
? 6<br />
-x,<br />
Figure3.1:3-Dview<strong>of</strong>a2-D<strong>particle</strong>simulation.Chargesarethought<strong>of</strong>as\rods". ?:simulation<strong>particle</strong>asviewedon2-Dplane
chargespersimulation<strong>particle</strong>)throughqL: meanchargedensity0,allas<strong>in</strong>putvariables.Notethatbyspecify<strong>in</strong>gtheabove parameters,0determ<strong>in</strong>esthesize<strong>of</strong>thesimulation<strong>particle</strong>s(number<strong>of</strong>electron 30<br />
simulation<strong>particle</strong>. importantthatthe<strong>in</strong>putparametersused<strong>in</strong>thetestareconsistentsothatthe wheren0isthe<strong>particle</strong>density,<strong>and</strong>nisthenumber<strong>of</strong>physical<strong>particle</strong>sper Intest<strong>in</strong>gthecodeforplasmaoscillation,aphysicalphenomenon,itistherefore 0=n0q=nNpqL=(hxhy)<br />
<strong>in</strong>realplasmas.Togetafeelforwheretheseoscillationstemfrom,onecantakea resultsmakesense(physically).<br />
lookatwhathappenstotwo<strong>in</strong>niteverticalplanes<strong>of</strong>charges<strong>of</strong>thesamepolarity 3.2.3ThePlasmaFrequency,!p whentheyareparalleltoeachother. Theplasmafrequencyisdenedas!p=qq0=m0:Theseoscillationsareobserved FromGausswehavethattheuxatone<strong>of</strong>theplanesis:<br />
with: theeldsbuttheverticalones(<strong>in</strong><strong>and</strong>out<strong>of</strong>theplane)cancel,sooneendsup Look<strong>in</strong>gata\pill-box"aroundonesmallareaontheplane,onenoticesthatall IEdA=2ZEdA=2EA<br />
IEdA=q0<br />
Know<strong>in</strong>gthat=nq=Area<strong>and</strong>q=RdA,<strong>and</strong>plugg<strong>in</strong>gthisbacktotheux equationonegets: giv<strong>in</strong>gE==20.S<strong>in</strong>cewehavetwoplanes,thetotaleld<strong>in</strong>oursystemis: EA=A E=0
frequency.FromNewtonwehave:F=ma=qE Nowwewanttoshowthatx00+!2x=0,where!woulddescribetheplasma 31<br />
Hence,onecanwrite: Look<strong>in</strong>gattheeldequationfromGauss<strong>in</strong>terms<strong>of</strong>x:<br />
Iftheplanesappearperiodically,onecouldthendeducethefollow<strong>in</strong>goscillatory E=0=nq x00=a=q 0volx=0x<br />
behavior<strong>of</strong>themovement<strong>of</strong>theplaneswithrespecttooneanother: !p=sq m0x<br />
laws,an<strong>in</strong>-depthanalysis<strong>of</strong>onegeneraltimestepcanbeusedtopredictthegeneral behavior<strong>of</strong>thecode.Theanalyticderivationsprovided<strong>in</strong>AppendixBprovethat Thisisreferredtoastheplasmafrequency. Inordertoverifywhatourcodeactuallywillbehaveaccord<strong>in</strong>gtothephysical m0<br />
3.3Solv<strong>in</strong>gforTheField Theelds<strong>in</strong>whichthe<strong>particle</strong>saresimulatedaretypicallydescribedbypartial dierentialequations(PDEs).Whichnumericalmethodtousetosolvetheseequa-<br />
ournumericalapproach<strong>in</strong>deedproducesthepredictedplasmafrequency.<br />
ablility,variation<strong>of</strong>coecients)<strong>and</strong>theboundaryconditions(periodic,mixedor tionsdependsontheproperties<strong>of</strong>theequations(l<strong>in</strong>earity,dimensionality,seper-<br />
boundariesrequiremethodsthatgothroughaset<strong>of</strong>iterationsimprov<strong>in</strong>gonan solutiondirectly(directmethods),whereasthemorecomplicatedequation<strong>and</strong> simple,isolated). Simple,eldequationscanbesolvedus<strong>in</strong>gmethodsthatcomputetheexact
<strong>in</strong>itialguess(iterativemethods).Forvery<strong>in</strong>homogeneous<strong>and</strong>non-l<strong>in</strong>earsystems, currentnumericaltechniquesmaynotbeabletondanysolutions. Foranoverview<strong>of</strong>some<strong>of</strong>theclassicalnumericalmethodsusedforsolv<strong>in</strong>geld 32<br />
Young[You89]givesaniceoverview<strong>of</strong>iterativemethods.Itshould,however benotedthatnumericalanalysisisstillisanevolv<strong>in</strong>gdiscipl<strong>in</strong>ewhosecurrent <strong>and</strong>futurecontributionsmay<strong>in</strong>uenceone'schoice<strong>of</strong>method.Recent<strong>in</strong>terest<strong>in</strong>gtechniques<strong>in</strong>cludemultigridmethods<strong>and</strong>relatedmultileveladaptivemethods<br />
[McC89].<br />
equations,Chapter6<strong>of</strong>Hockney<strong>and</strong>Eastwood'sbook[HE89]isrecommended.<br />
Poisson'sequation: 3.3.1Poisson'sEquation theGauss'lawforchargeconservationwhichcanbedescribedbythefollow<strong>in</strong>g Thebehavior<strong>of</strong>charged<strong>particle</strong>s<strong>in</strong>appliedelectrostaticeldsisgovernedby<br />
where;<strong>and</strong>0aretheelectrostaticpotential,spacechargedensity,<strong>and</strong>dielectric constant,respectively.TheLaplacian,r2,act<strong>in</strong>gupon(x;y),isforthe2-D caseus<strong>in</strong>gaCartesiancoord<strong>in</strong>atesystem<strong>in</strong>x<strong>and</strong>y: r2=?0 r2=@2<br />
(3.16)<br />
PDE:@2 Comb<strong>in</strong><strong>in</strong>gthisdenitionwithGauss'law,henceyieldsthefollow<strong>in</strong>gelliptic1 @x2+@2@y2 (3.17)<br />
wherea,b,c,d,e,f,<strong>and</strong>garegivenrealfunctionswhicharecont<strong>in</strong>uous<strong>in</strong>someregion<strong>in</strong>the (x,y)plane,satisfyb2
potential,,<strong>and</strong>subsequentlytheeld,E=?r(hencethenameeldsolver). whichisthesecondorderequationweneedtosolve<strong>in</strong>ordertodeterm<strong>in</strong>ethe ThePoisson'sequationisacommonPDEwhichwewillshowcanbesolved 33<br />
us<strong>in</strong>gFFT-basedmethodswhenperiodic<strong>and</strong>certa<strong>in</strong>othersimpleboundaryconditionscanbeassumed.Ourparallelimplementationcurrentlyusesthistype<strong>of</strong><br />
DirectsolversbasedontheFFT(FastFourierTransform)<strong>and</strong>cyclicreduction solver. 3.3.2FFTsolvers O(N2)ormoreoperations<strong>and</strong>theaboveschemesmodelellipticequations,these aPDE<strong>in</strong>O(NlogN)orlessoperations[HE89].S<strong>in</strong>cePDEsolverstypicallyare (CR)schemescomputetheexactsolutions<strong>of</strong>Ndierenceequationsrelat<strong>in</strong>gto schemesarecommonlyreferredtoas\rapidellipticsolvers"(RES).RESmayonly beusedforsomecommonspecialcases<strong>of</strong>thegeneralPDEsthatmayarisefrom theeldequation.However,whentheycanbeused,thearegenerallythemethods equationr2=f)<strong>in</strong>simpleregions(e.g.squaresorrectangles)withcerta<strong>in</strong> <strong>of</strong>choices<strong>in</strong>cetheyarebothfast<strong>and</strong>accurate. simpleboundaryconditions. TheFourierTransform FFTsolversrequirethatthePDEshaveconstantcoecients(e.g.thePoisson<br />
The1DFouriertransform<strong>of</strong>afunctionhisafunctionH<strong>of</strong>!: The<strong>in</strong>verseFouriertransformtakesHbacktotheorig<strong>in</strong>alfunctionh: H(!)=Z1<br />
h(x)=1 2Z1 x=?1h(x)e?i!xdx: !=?1H(!)ei!xd!: (3.20) (3.19)
Ifhisadescription<strong>of</strong>aphysicalprocessasafunction<strong>of</strong>time,then!ismeasured <strong>in</strong>cyclesperunittime(theunit<strong>of</strong>frequency).Hencehis<strong>of</strong>tenlabeledthetime doma<strong>in</strong>description<strong>of</strong>thephysicalprocess,whereasHrepresentstheprocess<strong>in</strong>the 34<br />
frequencydoma<strong>in</strong>.However,<strong>in</strong>our<strong>particle</strong>simulation,hisafunction<strong>of</strong>distance. Inthiscase,!is<strong>of</strong>tenreferredtoastheangularfrequency(radianspersecond), s<strong>in</strong>ceit<strong>in</strong>corporatesthe2factorassociatedwithf,theunit<strong>of</strong>frequency(i.e. !=2f).<br />
Thesecondderivativecansimilarlybeobta<strong>in</strong>edbyascal<strong>in</strong>g<strong>of</strong>Hby?!2: Noticethatdierentiation(or<strong>in</strong>tegration)<strong>of</strong>hleadstoamerescal<strong>in</strong>g<strong>of</strong>Hbyi!: Dierentiationviascal<strong>in</strong>g<strong>in</strong>F dx=1 dx2=1 d2h dh2Z1<br />
!=?1i!H(!)ei!xd!: !=?1(i!)dH dxd! (3.22) (3.21)<br />
Hence,iftheFouriertransform<strong>and</strong>its<strong>in</strong>versecanbecomputedquickly,socan<br />
2Z1 !=?1(i!)[(i!)H(!)]ei!xd! !=?1(?!2)H(!)ei!xd!: (3.23)<br />
Discretiz<strong>in</strong>gforcomputers thederivatives<strong>of</strong>afunctionbyus<strong>in</strong>gtheFourier(frequency)doma<strong>in</strong>. (3.24)<br />
InordertobeabletosolvethePDEonacomputer,wemustdiscretizeoursystem (orone<strong>of</strong>itsperiods<strong>in</strong>theperiodiccase)<strong>in</strong>toaniteset<strong>of</strong>gridpo<strong>in</strong>ts.Inorder tosatisfythediscreteFouriertransform(DFT),thesegridpo<strong>in</strong>tsmustbeequally (theh(x)'s)withaspac<strong>in</strong>gd<strong>in</strong>toNcomplexnumbers(theH(!)'s): spaceddimension-wise(wemayapplya1DDFTsequentiallyforeachdimension toobta<strong>in</strong>amulti-dimensionaltransform).A1DDFTmapsNcomplexnumbers H(h(x))=H(!)N?1 Xn=0fnei!n=N (3.25)
erepresentedbyaFouriers<strong>in</strong>etransform(nocos<strong>in</strong>eterms),i.e.xedvalues boundaryconditions.Hence,acceptableboundaries<strong>in</strong>cludethosethathavecan NoticethattheFourierterms(harmonics)mustalsobeabletosatisfythe 35<br />
thefullDFT. thatareperiodic(systemsthatrepeat/\wrap"aroundthemselves)<strong>and</strong>henceuse (Dirichlet);thoseus<strong>in</strong>gaFouriercos<strong>in</strong>etransform,i.e.slopes(Neumann);orthose<br />
widthlimitedtowith<strong>in</strong>thespac<strong>in</strong>gs(i.eallfrequencies<strong>of</strong>gmustsatisfytheb<strong>and</strong>s y,respectively,thentheSampl<strong>in</strong>gTheoremsaysthatifafunctiongisb<strong>and</strong>-<br />
?hx
1)ComputeFFT((x;y)): ~(kx;ky)=1 LxLyN?1 Xx=0M?1 Xy=0(x;y)e2ixkx=Lxe2iyky=Ly 36<br />
comments)wherekx=2=Lx<strong>and</strong>ky=2=Ly,<strong>and</strong>scal<strong>in</strong>gby1=0. 2)Getby<strong>in</strong>tegrat<strong>in</strong>g(kx;ky)twicebydivid<strong>in</strong>gbyk2=k2x+k2y(seeprevious 3)Inordertogetbacktothegrid,(x;y)takethecorrespond<strong>in</strong>g<strong>in</strong>verse (3.27)<br />
FourierTransform:<br />
3.3.3F<strong>in</strong>ite-dierencesolvers 0(x;y)=1 JLJ?1 kx=0L?1 Xky=0(kx;ky)e?2ixkx=Lxe?2iyky=Ly: X<br />
Relaxation)methodwhichweused<strong>in</strong>somesimpletest-casesunderNeumann(re- Wenow<strong>in</strong>cludethedescription<strong>of</strong>a5-po<strong>in</strong>tnite-dierenceSOR(SuccessiveOver (3.28)<br />
ective)<strong>and</strong>Dirichlet(xed)boundaryconditions. low<strong>in</strong>gisobta<strong>in</strong>edforeachcoord<strong>in</strong>ate: Apply<strong>in</strong>ga5-po<strong>in</strong>tcentralnitedierenceoperatorfortheLaplacian,thefol-<br />
thepotentialatagivennode(i,j),yieldsthefollow<strong>in</strong>g2-Dnite-dierenceformula wherehisthegridspac<strong>in</strong>g<strong>in</strong>x<strong>and</strong>y(hereassumedequal). Comb<strong>in</strong><strong>in</strong>gGauss'law<strong>and</strong>theaboveapproximation,<strong>and</strong>thesolv<strong>in</strong>gfori;j, r2(i;j+1+i?1;j?4i;j+i+1;j+i;j?1)=h2; (3.29)<br />
(Gauss-Seidel): i;j=i;j 0h2+i;j+1+i?1;j+i+1;j+i;j?1=4; (3.30)
fortest<strong>in</strong>gsomesimplecases. Thisisthenitedierenceapproximationweused<strong>in</strong>ourorig<strong>in</strong>alPoissonsolver Tospeeduptheconvergence<strong>of</strong>ourniteapproximationtechniques,weused 37<br />
i;j'swerehenceupdated<strong>and</strong>\accelerated"asfollows: aSuccessiveOverRelaxation(SOR)schemetoupdatethe<strong>in</strong>teriorpo<strong>in</strong>ts.The tmp=(i;j+1+i?1;j+i+1;j+i;j?1)=4; i;j=i;j+!(tmp?i;j) (3.31)<br />
codewasrunforseveraldierent!'s,<strong>and</strong>wedeterm<strong>in</strong>edthatforthegridspac<strong>in</strong>gsweused,itseemedtooptimize(i.e.needfeweriterationbeforeconvergence)<br />
between1.7<strong>and</strong>1.85.Asmentioned,convergencewashereassumedtobeanac-<br />
thatEquation3.31isthesameasEquation3.30.Inourtestimplementation,our where!istheaccelerationfactorassumedtobeoptimalbetween1<strong>and</strong>2.Notice (3.32)<br />
SunSparcstations)calculatedbyupdat<strong>in</strong>gtheexitconditionasfollows: ceptablethresholdforround-o(weused10?14formost<strong>of</strong>ourtestrunsonour<br />
theoptimalaccelerationfactor!. methodsexistsuchasmultigridmethods<strong>and</strong>techniquesus<strong>in</strong>glargertemplate Thereisanextensivenumericalliteraturediscuss<strong>in</strong>ghowtopredict<strong>and</strong>calculate TheSORisafairlycommontechnique.Othernewer<strong>and</strong>possiblymoreaccurate exit=max(tmp?(i;j);exit): (3.33)<br />
(morepo<strong>in</strong>ts).S<strong>in</strong>ceweassumedtheeld<strong>in</strong>ournalcodetobeperiodic,we optedfortheFFT-solvers<strong>in</strong>ceitisamuchmoreaccurate(givesusthedirect<br />
toappear\reected"acrosstheboundary.Inotherwords, solution)<strong>and</strong>als<strong>of</strong>airlyquickmethodthatparallelizeswell. 3.3.4Neumannboundaries NaturalNeumannboundariesareboundarieswherethe<strong>in</strong>teriorpo<strong>in</strong>tsareassumed
forpo<strong>in</strong>tsalongtheNeumannborder.Forour2-Dcase,thenitedierenceupdate d38<br />
forborderpo<strong>in</strong>tsonasquaregridhencebecome: i;j?1?i;j+1 2hx =db;x(Leftorrightborders) dx=0<br />
wheredb;x<strong>and</strong>db;yformtheboundaryconditions(derivative<strong>of</strong>thepotential) forthatboundary.Inthecase<strong>of</strong>impermeableNeumannborders,db=0. i?1;j?i+1;j 2hy =db;y(Toporbottomborders) (3.35) (3.34)<br />
thenplugged<strong>in</strong>tothe2-Dcentralnitedierenceformula,asshownbelow(here: h=hx=hy).Tosimplifynotation,weusedthefollow<strong>in</strong>g"template"forournite dierenceapproximations:k Solv<strong>in</strong>gtheaboveequationforthepo<strong>in</strong>tbeyondtheborder,theresultwas<br />
become: Theequationsforapproximat<strong>in</strong>gthepotentialsattheNeumannbordershence l mo n m=(i,j) k=(i+1,j) o=(i-1,j) l=(i,j-1)<br />
n=(i,j+1)<br />
m=m 0hxhy+2hxdb;x+k+2n+o=4:0[Leftborder](3.36)<br />
SORtoconverge,aswellasfortest<strong>in</strong>gpurposes,some<strong>of</strong>theboundarypo<strong>in</strong>tsmay speciedasDirichlet(xedpotential,i.e.electrodes).Inourtestimplementation Similarequationswerederivedforthetop<strong>and</strong>bottomborders. SystemswithperiodicorNeumannboundariesares<strong>in</strong>gular.Inorderforthe 0hxhy?2hxdb;x+k+2l+o=4:0[Rightborder](3.37)<br />
electrode,<strong>and</strong>theupperrightborderbe<strong>in</strong>ga1Velectrode.Weobservedhow weusedamaskto<strong>in</strong>dicateNeumannorDirichletedges. WetestedourSORcodeonagridwithpart<strong>of</strong>thebottomleftbe<strong>in</strong>ga0V
theeldscontourslookedbyus<strong>in</strong>gourplotfacility.Asexpected,thepotentials \smoothed"betweenthe0V<strong>and</strong>1Velectrode,i.e.thepotentialsvaluesateach gridpo<strong>in</strong>tchangedverylittlelocallyafterconvergence.Agroup<strong>of</strong><strong>particle</strong>s<strong>of</strong>the 39<br />
samecharge<strong>in</strong>itializedclosetoeachotherattheOVelectrodewould,asexpected, overtimemovetowardsthe1Velectrode<strong>and</strong>generallyspreadawayfromeach<br />
Thesolverprovideduswiththepotential(i;j)ateachgrid-po<strong>in</strong>t.TheeldEat 3.4Mesh{ParticleInteractions neartheNeumanleftorrightborder. other.Theywould,asexpected,alsocurveback<strong>in</strong>tothesystemwhencom<strong>in</strong>g<br />
eachcorrespond<strong>in</strong>ggrid-po<strong>in</strong>twasthencalculatedus<strong>in</strong>garstorderdierence<strong>in</strong> eachdirection.TheeldEwashencestored<strong>in</strong>twoarrays,Ex<strong>and</strong>Ey,forthex the<strong>in</strong>teriorpo<strong>in</strong>tswhenapply<strong>in</strong>gtheeldtoeachnodeonthegrid: <strong>and</strong>ydirections,respectively.Weusedthefollow<strong>in</strong>g1-Ddierenceequationsfor Exi;j=(i;j?1?i;j+1)=(2hx) Eyi;j=(i?1;j?i+1;j)=(2hy) (3.38)<br />
3.4.1Apply<strong>in</strong>gtheeldtoeach<strong>particle</strong> canbeviewedasthevectorresult<strong>in</strong>gfromcomb<strong>in</strong><strong>in</strong>gEx<strong>and</strong>Ey. wherehx<strong>and</strong>hyarethegridspac<strong>in</strong>gs<strong>in</strong>x<strong>and</strong>y,respectively.Theactualeld (3.39)<br />
Afunctionwasthenwrittentocalculatetheelectriceldatagiven<strong>particle</strong>'s location(x0,y0),giventheeldgridsEx<strong>and</strong>Ey,theirsize,<strong>and</strong>thegridspac<strong>in</strong>gs canbestbedescribedthroughFigures3.2<strong>and</strong>3.3. hx,hy.(Allparameterswerepassedthroughpo<strong>in</strong>ters).Thenite-elementscheme
(i+1,j) 40 (i+1,j+1)<br />
(i,j).........<br />
(x0,y0)<br />
(i,j+1) hy<br />
Theeld'scontributiontoeach<strong>particle</strong>washencecalculatedas: Figure3.2:Calculation<strong>of</strong>nodeentry<strong>of</strong>lowercorner<strong>of</strong>current<strong>cell</strong> j=(<strong>in</strong>t)((x0+hx)/hx);<br />
Epartx=(Ex(i;j)(hx?a)(hy?b) i=(<strong>in</strong>t)((y0+hy)/hy).<br />
+Ex(i+1;j)(hx?a)b +Ex(i;j+1)a(hy?b) +Ex(i+1;j+1)ab)=(hxhy); (3.43) (3.41) (3.42) (3.40)<br />
Eparty=(Ey(i;j)(hx?a)(hy?b) +Ey(i+1;j)(hx?a)b +Ey(i;j+1)a(hy?b) +Ey(i+1;j+1)ab)=(hxhy); (3.45) (3.47) (3.46) (3.44)<br />
3.4.2Recomput<strong>in</strong>geldsdueto<strong>particle</strong>s Totake<strong>in</strong>toaccountthechargeduetoall<strong>particle</strong>s,theeldsneededtobe recomputedaftereachtimestep.Particleshadtobesynchronized<strong>in</strong>timebefore
(i+1,j) 41<br />
(i,j)......ḣx a(x0,y0)<br />
(i+1,j+1)<br />
where b (i,j+1) hy<br />
Figure3.3:Calculation<strong>of</strong>eldatlocation<strong>of</strong><strong>particle</strong>us<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation. b=y0-((i-1)*hy). a=x0-((j-1)*hx);<br />
densitywasupdatedaccord<strong>in</strong>gtothe<strong>particle</strong>s'location{i.e.the<strong>particle</strong>s<strong>in</strong>uence theeld.Theeldsateachnodewereupdatedas<strong>in</strong>theprevioussection. theeldupdate.Asbefore,weassumedan<strong>in</strong>itialchargedensitypernode.This<br />
k=p=(hx+hy): <strong>and</strong>basdened<strong>in</strong>Figure3.3.Thecorrespond<strong>in</strong>gequationsfortheupdateswere Thechargedensitygrid,washenceupdatedaftereachtime-stepus<strong>in</strong>ga<br />
(i+1;j+1)+=abk: (i+1;j)+=(hx?a)bk; (i;j+1)+=a(hy?b)k; (i;j)+=(hx?a)(hy?b)k; (3.51) (3.48) (3.49) (3.50)
totrack<strong>particle</strong>s<strong>in</strong>anelectriceld.Particlecoord<strong>in</strong>ateswereread<strong>in</strong>fromale Us<strong>in</strong>gthePoissonsolverdeveloped<strong>in</strong>theprevioussection,wethendevelopedcode 3.5Mov<strong>in</strong>gthe<strong>particle</strong>s 42<br />
usedtotrack<strong>in</strong>dependent<strong>particle</strong>s<strong>in</strong>theelduntiltheyencounteredaboundary. I.e.theeldswerecomputedatthenodesus<strong>in</strong>gF<strong>in</strong>iteDierences(FD)<strong>and</strong> <strong>in</strong>thelastsection,exceptweallowedforanon-zerosourceterm(r2=S,S6=0). <strong>and</strong>trackeduntiltheyhitaboundary.The<strong>in</strong>itialconditionsweresetasdescribed<br />
<strong>in</strong>terpolatedwithF<strong>in</strong>iteElements(FE). Ahybridnite-element/nitedierencemethod<strong>and</strong>Euler<strong>in</strong>tegrationwere<br />
weremovedaccord<strong>in</strong>gtot,avariablethatgotadjustedsothatthe<strong>particle</strong>s accord<strong>in</strong>gtothe<strong>particle</strong>s'location{i.e.the<strong>particle</strong>s<strong>in</strong>uencedtheeld.The eldsateachnodewereupdatedataxedtime-<strong>in</strong>tervalt,whereasthe<strong>particle</strong>s wouldmovenomorethan1/4<strong>of</strong>a<strong>cell</strong>sideperttime<strong>in</strong>crement.The<strong>particle</strong>s' Theprogramassumedan<strong>in</strong>itialchargedensitypernodewhichgotupdated<br />
trajectorieswererecordedonoutputles. calculations\leap"overthepositions,<strong>and</strong>viceversa. leapfrogmethodwasconsideredfortheaccelerationmodel,wherethevelocities 3.5.1TheMobilityModel Fortrack<strong>in</strong>gthe<strong>particle</strong>s,themobilitymodelupdatesthelocationsdirectly.A<br />
IntheMobilitymodel,thenewlocation(x1,y1)<strong>of</strong>a<strong>particle</strong>atlocation(x0,y0)is calculatedgiventheeld(Epart-x,Epart-y)atthat<strong>particle</strong>,mobility(),timestep(dt),<strong>and</strong>grid(look<strong>in</strong>gat<strong>particle</strong>velocity=dx/dt).Tomakesurereasonable<br />
stepsweretaken,theimplementationshouldcheckwhetherthe<strong>particle</strong>hasmoved border,<strong>and</strong>aagbagwasset. morethan1/4<strong>of</strong>the<strong>particle</strong>'s<strong>cell</strong>'sside(hx,hypassed<strong>in</strong>)with<strong>in</strong>atime-step. Ifso,itmayreducethetime-step,recomputethelocation,<strong>and</strong>returnthenew time-step.Inourtestimplementation,the<strong>particle</strong>swerestoppedwhentheyhita
tionupdates<strong>in</strong>themobilitymodel: Giventhepreviousrout<strong>in</strong>es,thefollow<strong>in</strong>gsimpleequationsdescribedtheloca-<br />
43<br />
bility<strong>of</strong>themedium,,isthesame<strong>in</strong>bothx<strong>and</strong>y.Mosteldsconsideredare<strong>in</strong> Theaboveequationsassumethemediumtobeisotropic,thatisthatthemo-<br />
x1=x0+(Epartxt); y1=y0+(Epartyt): (3.53) (3.52)<br />
Bythelaws<strong>of</strong>physics,<strong>particle</strong>swilltendtobeattractedtoDirichletboundaries suchmedia.<br />
Thelatterhappensbecauseasthe<strong>particle</strong>sclose<strong>in</strong>onaNeumannborder,they (electrodes)withoppositehighcharges,butberepelledfromNeumannboundaries. seetheir\imagecharge"reectedacrosstheborder.S<strong>in</strong>ceequallysignedcharges Test<strong>in</strong>gtheMobilityModel<br />
(y=0)equallyspacedbetweenx=0.4<strong>and</strong>0.6(gridrang<strong>in</strong>gfromx=0to1).We veriedthatthe<strong>particle</strong>sdidnotspreadoutunlesstheeldswererecomputed. repeleachother,the<strong>particle</strong>steersaway<strong>and</strong>doesnotcrossaNeumannborder.<br />
As<strong>in</strong>glesimulation<strong>particle</strong>startednearthebottomplate(0V),wouldalwaysgo electrodeplates,respectively.N<strong>in</strong>e<strong>particle</strong>swerethenstartedatthebottomplate Thecodewastestedby<strong>in</strong>itializ<strong>in</strong>gthebottom<strong>and</strong>topborderas0V<strong>and</strong>1V<br />
IntheAccelerationModelformov<strong>in</strong>gthe<strong>particle</strong>s,theforceonthe<strong>particle</strong>is straightuptothetopplate(1V),s<strong>in</strong>cenoother<strong>particle</strong>swouldbepresentto <strong>in</strong>uenceitseld. 3.5.2TheAccelerationModel proportionaltotheeldstrengthratherthanthevelocity: F=qE (3.54)
whereEistheeld(Ex,Ey)atthat<strong>particle</strong>.Thisisthemodelwewillbeus<strong>in</strong>g<br />
<strong>in</strong>ourparallelizedplasmasimulation. 44<br />
elds<strong>and</strong>then<strong>particle</strong>s'contributiontothem.However,<strong>in</strong>stead<strong>of</strong>therout<strong>in</strong>efor Theleapfrogmethodwasusedforupdat<strong>in</strong>gthe<strong>particle</strong>s'locationaccord<strong>in</strong>gto<br />
comput<strong>in</strong>gthe<strong>particle</strong>'svelocity,theotherforupdat<strong>in</strong>gthenewlocation. mov<strong>in</strong>gthe<strong>particle</strong>us<strong>in</strong>gtheMobilityModel,wenowusedtworout<strong>in</strong>es,onefor thismodel.Adragtermproportionaltothevelocitywasalsoadded<strong>in</strong>. Weusedthesamefunctionsasdescribed<strong>in</strong>thelastsectionforcomput<strong>in</strong>gthe<br />
updatelagahalftime-stepbeh<strong>in</strong>dtheupdate<strong>of</strong>a<strong>particle</strong>'sposition(position with<strong>in</strong>itialspeed(vx0,vy0),giventheforceF,<strong>particle</strong>massm,time-step(t), <strong>and</strong>thegrid. Thefunctionscalculatethenewspeed(vx1,vy1)<strong>of</strong>a<strong>particle</strong>atlocation(x0,y0)<br />
(x,y)\leap<strong>in</strong>gover"thevelocity,thentheotherwayaround: Theleap-frogmethodmodelsa<strong>particle</strong>'smovementbyhav<strong>in</strong>gthevelocity<br />
Splitt<strong>in</strong>gthedirectionalvectors<strong>in</strong>tox<strong>and</strong>ytermsgive: vn+1=2=vn?1=2+(F((xn;yn))=m)t Fx=qEpartx (3.55)<br />
Add<strong>in</strong>gthedragterm,thisgaveusthefollow<strong>in</strong>g<strong>codes</strong>egment: Fy=qEparty vx1=vx0+((Fx/mass)*del_t)-(drag*vx0*del_t); (3.57) (3.56)<br />
Withtheequationsforupdat<strong>in</strong>gthelocations: xn+1=xn+vn+1=2t vy1=vy0+((Fy/mass)*del_t)-(drag*vy0*del_t); (3.58)
' 45<br />
vn?1=2 ''?$<br />
?$<br />
Figure3.4:TheLeapfrogMethod. xn vn+1=2 xn+1 vn+3=2-time<br />
orthe<strong>codes</strong>egment:x1=x0+(vx0*del_t);<br />
periodicsystemisassumed.Aagissetifthe<strong>particle</strong>movesmorethanasystem (1=2t),<strong>and</strong>then\leaps"overthelocationstep(Figure3.4). Noticethatatthersttime-step,thevelocityis\pulledback"halfatime-step Whenthe<strong>particle</strong>hitsaborder,the<strong>particle</strong>re-apearsontheothersideifa y1=y0+(vy0*del_t);<br />
lengthwith<strong>in</strong>atime-step.Inthiscasethetime-stepwillneedtobereduced<strong>in</strong><br />
Initialtest<strong>in</strong>gwasdoneasforthemobilitymodel.Aga<strong>in</strong>,westarted5<strong>particle</strong>sat <strong>in</strong>thischapter. Test<strong>in</strong>gtheAccelerationModel ordertohaveaviablesimulation.Parameterizationwillbediscussedfurtherlater<br />
theupperplate,exceptforcaseswherethe<strong>particle</strong>camefairlyclosetotheirleft orrightborders.Aspredicted,the<strong>particle</strong>shead<strong>in</strong>gfortheborderwerethenbe repelledbytheirimagecharges. thebottomplate<strong>and</strong>showedthattheyspreadoutnicelyastheymovedtowards<br />
simulatetheplasmaoscillationsdescribed<strong>in</strong>thenextsection. Thetrue\acid-test"forourcodewas,however,toseewhetheritcouldcorrectly
Theparametershx,hy,Lx,Ly,t,<strong>and</strong>Npneedtosatisfythefollow<strong>in</strong>gconstra<strong>in</strong>ts 3.6Test<strong>in</strong>gtheCode{Parameter Requirements 46<br />
iscollisionless(Hockney<strong>and</strong>Eastwood[1988]describestheseforthe1-Dcase): <strong>in</strong>orderfortheplasmawavestobeadequatelyrepresented<strong>and</strong>sothatthemodel 1.!pt2,where!pistheplasmaoscillationfrequency.Onecanusually 2.hx;hyD,i.e.thatthespac<strong>in</strong>gD,theDebyelengthdenedtobethe expect!ptbetween.1<strong>and</strong>.2togivetheoptimalspeedversusaccuracy.<br />
3.Lx;LyD. 4.NpDLx;Ly,i.e.number<strong>of</strong>simulation<strong>particle</strong>sperDebyelengthshould characteristicwavelength<strong>of</strong>electrostaticoscillation(D=vT=!p,wherevT<br />
belargecomparedtothesimulationarea.Thisgenerallyguaranteesthat isthethermalvelocity<strong>of</strong>theplasma.<br />
therearealargenumber<strong>of</strong>simulation<strong>particle</strong>s<strong>in</strong>therange<strong>of</strong>thevelocities<br />
3.6.1!p<strong>and</strong>thetimestep regard<strong>in</strong>gboththestability<strong>and</strong>thecorrectness<strong>of</strong>thecode. Byanalyz<strong>in</strong>gthecodecarefullyfromthispo<strong>in</strong>t,condencecanbeachieved nearthephasevelocity<strong>of</strong>unstablewaves.<br />
Toseewhetherthecodecouldproducethecorrectplasmafrerquency,wereformulatedthecodeused<strong>in</strong>theaccelerationmodel(Section3.5.2)tohaveperiodic<br />
boundaryconditionsonallboundaries. lengthnq<strong>in</strong>thedirectionperpendiculartothesimulationplane,<strong>and</strong>massper <strong>particle</strong>planeswilleachbesee<strong>in</strong>ganotherplane<strong>of</strong>chargesacrosstheboundaries unitlengthnm.Wearehencenowmodel<strong>in</strong>gan\<strong>in</strong>nitesystem"wherethetwo Thesystemisthenloadedwithtwob<strong>and</strong>s<strong>of</strong><strong>particle</strong>swithchargeperunit
asitisrepelledfromtheotherb<strong>and</strong><strong>in</strong>its<strong>cell</strong>.Thissystemhencecorrespondsto theoscillatorysystemdescribed<strong>in</strong>Section3.2.3. Assum<strong>in</strong>gthesystemsizeisLxbyLy,the<strong>particle</strong>b<strong>and</strong>sarethenplacedeither 47<br />
thesystemshouldoscillate. closetotheboundaryorclosetothecenter<strong>of</strong>thesystem,aligned<strong>in</strong>thexory direction.Aslongastheb<strong>and</strong>sarenotplacedatdistance<strong>of</strong>12Lfromeachother,<br />
accurateresultthantheSORmethod. solverdescribed<strong>in</strong>Section3.3.2forthePoisson'sequationtoobta<strong>in</strong>amore velocities.S<strong>in</strong>cewenowareassum<strong>in</strong>gperiodicconditions,wecouldusetheFFT-<br />
Ifthecodewascorrectlynormalized,the<strong>particle</strong>planesshouldoscillateback Theleap-frog<strong>particle</strong>-pusherwasusedtoadvancethe<strong>particle</strong>positions<strong>and</strong><br />
<strong>and</strong>forth<strong>in</strong>they-direction(orx-direction)throughthecenter<strong>of</strong>thesystemata frequency<strong>of</strong>!0=!p=2=1period. <strong>in</strong>deedapproximatedtheexpectedtheplasmafrequencyaslongas!pt2. Us<strong>in</strong>gthefollow<strong>in</strong>gknownphysicalparameters(fromPhysicsToday,Aug'93): Wetestedourcodeforseveraldierenttime-stepst<strong>and</strong>veriedthatourcode<br />
<strong>and</strong>the<strong>in</strong>putparameters: 0=8.854187817*10?12; q=-1.6021773*10?19;<br />
0=q*n0=-1.602*10?12(n0=107{typicalforsomeplasmas) drag=0.0;t=(vary<strong>in</strong>g{seebelow);tmax=0.0002 m=9.109389*10?31;<br />
wecancalculatetheexpectedplasmafrequency: !p=sq0 m0=vut (9:10938910?31)(8:85418781710?12)1:78105 (?1:602177310?19)2
Toavoidanygrid-renementproblems/<strong>in</strong>terpolationerrors,weputtheb<strong>and</strong>s 48<br />
atx=0.1875<strong>and</strong>at0.8125whichis<strong>in</strong>thecenter<strong>of</strong>theb<strong>and</strong>srespectivecolumn<strong>of</strong> <strong>cell</strong>sforan8x8system.Wewerehenceabletogetthefollow<strong>in</strong>gtestsdemonstrat<strong>in</strong>g the!ptrelationshipshown<strong>in</strong>Table3.1. above.Noticethatfort=1:010?5,!pt=2,<strong>and</strong>aspredictedbythetheory istheturn<strong>in</strong>gpo<strong>in</strong>tforstability. 3.6.2Two-streamInstabilityTest Theseresults,shown<strong>in</strong>Table3.1,agreewiththetheoreticalresultweobta<strong>in</strong>ed<br />
distributed<strong>particle</strong>s(<strong>in</strong>ourcase2D<strong>particle</strong>grids)areloadedwithopposite<strong>in</strong>itial driftvelocities.Detailedknowledge<strong>of</strong>thenon-l<strong>in</strong>earbehaviorassociatedwithsuch performedatwo-stream<strong>in</strong>stabilitytest[BL91].Inthiscase,twoset<strong>of</strong>uniformly simulationsweredeveloped<strong>in</strong>the'60s. T<strong>of</strong>uthertestwhetherour<strong>codes</strong>wereabletosimulatephysicalsystems,wealso<br />
growexponentially<strong>in</strong>time.Inorderforthistesttowork,caremustbetaken bunch<strong>in</strong>g<strong>of</strong><strong>particle</strong>s<strong>in</strong>theotherstream,<strong>and</strong>viceversa.Theperturbationshence densityperturbation(bunch<strong>in</strong>g)<strong>of</strong>onestreamisre<strong>in</strong>forcedbytheforcesdueto movethrougheachotheronewavelength<strong>in</strong>onecycle<strong>of</strong>theplasmafrequency,the Systemsthatsimulateoppos<strong>in</strong>gstreamsareunstables<strong>in</strong>cewhentwostreams<br />
separatelyus<strong>in</strong>gan<strong>in</strong>itialdriftvelocityvdrift=!pLxforhalfthe<strong>particle</strong>s<strong>and</strong> vdrift=?!pLxfortheotherhalf<strong>of</strong>the<strong>particle</strong>s. <strong>in</strong>choos<strong>in</strong>gthe<strong>in</strong>itialconditions.Inourcase,wechosetotesteachdimension<br />
wassetto1:510?7.The\eyes"appearedwith<strong>in</strong>10time-steps<strong>of</strong>thelargewave teristicnon-l<strong>in</strong>ear\eyes"associatedwithtwo-stream<strong>in</strong>stabilities.Ourtime-step appear<strong>in</strong>g.Noticethatthesearedistanceversusvelocityplotsshow<strong>in</strong>g1Deects. AscanbeseenfromFigures3.6a-c,ourcodewasabletocapturethecharac-
Table3.1:Plasmaoscillations<strong>and</strong>time-step 49<br />
t period Two-b<strong>and</strong>test<br />
5.0*10?5blowsup 10*10?5blowsup (seconds)(*10?5) {passesborders<strong>in</strong>1t! !p=2=period<br />
2.0*10?5blowsup (*105)<br />
0.52*10?53.12-3.64 1.5*10?5blowsup<br />
0.50*10?53.00-3.50 1.2*10?5 1.0*10?53.38(avg) 2.5 2.6,butblowsupafter1period<br />
3.25(avg) 1.86 1.93 2.0<br />
0.48*10?53.36 0.40*10?53.20-3.60 0.30*10?53.20-3.60<br />
1.87<br />
0.25*10?53.50 0.10*10?53.50 3.4(avg) 0.05*10?53.50 0.01*10?53.50 1.795 1.85
position.Weobta<strong>in</strong>edsimilarplotsforacorrespond<strong>in</strong>gtest<strong>of</strong>x<strong>and</strong>vx. Eachdot<strong>in</strong>Figure3.6aactuallyrepresentsallthe<strong>particle</strong>s<strong>in</strong>xmapp<strong>in</strong>gtothey 50<br />
Accesstoalargeparallelsystems,suchastheKSR-1,willallowustoremovethe 3.7ResearchApplication{DoubleLayers restrictionsimposedonusbythespeed<strong>of</strong>thesmallercomputerswecurrentlyuse.<br />
asatellitepass<strong>in</strong>gbelowtheplasma.Thedevelopment<strong>of</strong>the<strong>in</strong>stabilitydependson whenpresent,changestheelectronvelocitydistributionthatwouldbeobservedby plasmaphysics<strong>issues</strong>.First,an<strong>in</strong>stabilityoccurs<strong>in</strong>thepresentsimulationsdueto an<strong>in</strong>teractionbetweentheaccelerated<strong>and</strong>backgroundelectrons.The<strong>in</strong>stability, Ourresearchgrouphopes,thereby,tobeabletoclarifyanumber<strong>of</strong><strong>in</strong>terest<strong>in</strong>g<br />
raterepresentation<strong>of</strong>thiseectrequiresbothahigh-speedplatform<strong>and</strong>aparallel algorithmappropriatefortheproblem.Second,wehaveobservedthepresence <strong>of</strong>anomalousresistivity<strong>in</strong>regions<strong>of</strong>substantialAlfvenwave-generatedelectron <strong>and</strong>the<strong>in</strong>stability.Theperpendicular<strong>and</strong>parallelresolutionrequiredforaccu-<br />
theperpendicularstructure<strong>of</strong>thenonl<strong>in</strong>eardevelopment<strong>of</strong>boththeAlfvenwave<br />
iscurrentlyrestrictedbythesimulationsystemlength.Because<strong>of</strong>theperiodic noiseduetoverylargesuper<strong>particle</strong>s.Third,theevolution<strong>of</strong>anAlfvenwavepulse allowustoemployamuchlargernumber<strong>of</strong><strong>particle</strong>s,enabl<strong>in</strong>gustoreducethe drift.Wewouldliketop<strong>in</strong>po<strong>in</strong>tthecause<strong>of</strong>thisresistivity,butitsmechanismhas<br />
boundaryconditions,employed<strong>in</strong>ourcurrentsimulation,anAlfvenwavepacket provedelusiveduetothepresence<strong>of</strong>substantialnoise.Use<strong>of</strong>parallelismwould<br />
musteventuallytraversearegionpreviouslycrossed,encounter<strong>in</strong>gplasmaconditions<strong>of</strong>itsownwake.Aga<strong>in</strong>,thelongersystempossibleonparallelsystemssuch<br />
astheKSR-1,wouldalleviatethisproblem.Therearealsoothersimulation<strong>issues</strong> whichwouldbenetfromtheresources<strong>and</strong><strong>parallelization</strong>possibilitiespresented <strong>in</strong>thisthesis. Our<strong>in</strong>itialexperiments<strong>in</strong>dicatethattheKSR1isagoodmatchforourproblem.
51<br />
vdrift?vdriftvdrift?vdriftvdrift?vdrift xvxyvy<br />
c)characterstictwo-streameye. Figure3.5:Two-stream<strong>in</strong>stabilitytest.a)Initialconditions,b)wavesareform<strong>in</strong>g,
Wendthecomb<strong>in</strong>ation<strong>of</strong>therelativeease<strong>of</strong>implementationprovidedbyits <strong>and</strong>processorresourcesveryattractive. shared-memoryprogramm<strong>in</strong>genvironmentcomb<strong>in</strong>edwithitssignicantmemory 52<br />
2GBusedforOS,program,<strong>and</strong>datastorage).Each<strong>particle</strong>uses4doubleprecisonquantities(velocity<strong>and</strong>location<strong>in</strong>bothx<strong>and</strong>y)<strong>and</strong>henceoccupies32<br />
arraysneedtot<strong>in</strong>localmemory.GiventhecurrentKSR1'shardware,acon-<br />
Duetothecomputational<strong>and</strong>memoryrequirements<strong>of</strong>ourcode,allmajor servativeestimatewouldimplywearerestrictedto2GB<strong>of</strong>memory(theother<br />
orlarger.Thisshouldenableustostudyeectscurrentlynotseenus<strong>in</strong>gcurrent serial<strong>codes</strong>(us<strong>in</strong>g,forexample,256-by-32grids). oryrestrictions,wewouldthereforeliketomodelsystemsthatare4096-by-256 bytes.Particlecode<strong>in</strong>vestigations<strong>of</strong>auroralaccelerationtypicallyemploy10-100 <strong>particle</strong>spergriddepend<strong>in</strong>gontheeectbe<strong>in</strong>gstudied.Giventhecurrentmem-
Chapter4 Parallelization<strong>and</strong>Hierarchical MemoryIssues<br />
theparallelismfor<strong>in</strong>dividualmodules,suchassolvers,matrixtransposers,factorizers,<strong>particle</strong>pushers,etc.Our<strong>particle</strong>simulationcode,however,isfairlycomplex<br />
4.1Introduction Todate,theeld<strong>of</strong>scienticparallelcomputationhasconcentratedonoptimiz<strong>in</strong>g \Parallelismisenjoymentexponentiated."{authorca.1986.<br />
<strong>in</strong>teractionsaect<strong>parallelization</strong>. <strong>and</strong>consists<strong>of</strong>several<strong>in</strong>teract<strong>in</strong>gmodules.Akeypo<strong>in</strong>t<strong>in</strong>ourworkisthereforeto considerthe<strong>in</strong>teractionsbetweenthesesub-programblocks<strong>and</strong>analyzehowthe ignoredthisissue,or<strong>in</strong>thecase<strong>of</strong>Azarietal.[ALO89],usedalocalizeddirect solver.Thelatter,however,onlyworksforveryspecializedcases.Themoregeneral solver)mayimpacttheoverall<strong>particle</strong>partition<strong>in</strong>g.Previousworkhaseither problemsusuallyrequiretheuse<strong>of</strong>somesort<strong>of</strong>numericalPDEsolver. Inparticular,wewouldliketoseehowthesolverpartition<strong>in</strong>g(sayforanFFT<br />
anovelgridpartition<strong>in</strong>gapproachthatleadstoanecientimplementationfacili- Traditionalparallelmethodsus<strong>in</strong>greplicated<strong>and</strong>partitionedgrids,aswellas 53
tat<strong>in</strong>gdynamicallypartitionedgridsaredescribed.Thelatterisanovelapproach thattakesadvantage<strong>of</strong>theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>usesadual po<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraystokeepthe<strong>particle</strong>locationspartially 54<br />
grid.Inthecontext<strong>of</strong>gridupdates<strong>of</strong>thechargedensity,wewillrefertothis showanovelapproachus<strong>in</strong>ghierarchicaldatastructuresforstor<strong>in</strong>gthesimulation techniquesassociatedwiththisdynamicschemewillalsobediscussed. sortedautomatically(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Load-balanc<strong>in</strong>g<br />
techniqueas<strong>cell</strong>cach<strong>in</strong>g. F<strong>in</strong>ally,wewill<strong>in</strong>vestigatehowmemoryhierarchiesaect<strong>parallelization</strong><strong>and</strong><br />
4.2DistributedMemoryversusShared locality<strong>and</strong>theoverheadassociatedwithit.Thisproblemwithparalleloverhead Theprimaryproblemfac<strong>in</strong>gdistributedmemorysystemsarema<strong>in</strong>ta<strong>in</strong><strong>in</strong>gdata alsoextendstothesharedmemorysett<strong>in</strong>gwheredatalocalitywithrespectto<br />
havean<strong>in</strong>terest<strong>in</strong>gpropertywhichseemhighlyrelevant.Inordertoachievetheir peakspeed(17-19MFLOPs),thedatamustbe<strong>in</strong>whatSuncallstheSuperCache memorysystemwhereallmemoryistreatedasacache(orhierarchythere<strong>of</strong>). cache,isimportant.TheauthorproposesthatoneviewtheKSRasashared<br />
whichisabout0.5MBytesperprocessor(maybeto<strong>in</strong>crease<strong>in</strong>futureversions<strong>of</strong> thehardware).Thisimpliesthatifyouarego<strong>in</strong>gtopartitionaproblemacrossa Shaw[Sha]po<strong>in</strong>tsoutthathisexperiencewiththeSPARC-10sshowsthey<br />
group<strong>of</strong>SPARC-10s,youhavemanylevels<strong>of</strong>memoryaccesstoworryabout: 2.virtualmemoryaccessononemach<strong>in</strong>eonanetwork 1.mach<strong>in</strong>eaccessonanetwork 3.realmemoryaccessononemach<strong>in</strong>eonanetwork 4.SuperCacheaccessononeprocessorononemach<strong>in</strong>eonanetwork
Hence,thereisagreatdealtoworryabout<strong>in</strong>gett<strong>in</strong>gaproblemtowork\right" whenyouhaveanetwork<strong>of</strong>multiprocessorSPARC-10s. Anetwork<strong>of</strong>Sun10'swillconsequentlyraisealot<strong>of</strong><strong>issues</strong>similartothat 55<br />
addressthegeneralproblems<strong>and</strong>givesomeguidel<strong>in</strong>esonthe<strong>parallelization</strong>s<strong>of</strong> doubt,alot<strong>of</strong>ne-tun<strong>in</strong>gisnecessary.Itis,however,hopedthatourworkcan <strong>in</strong>structions).)Toachieveoptimumperformanceonanygivenparallelsystem,no alsohasa0.5MBlocalcacheoneachprocessor(0.25MBfordata,0.25MBfor <strong>of</strong>theKSR<strong>in</strong>thattheybothpossessseverallevels<strong>of</strong>cache/memory.(TheKSR<br />
4.3TheSimulationGrid gridquantities.Theeasiest<strong>and</strong>mostcommonparallelimplementations,havea Whenparalleliz<strong>in</strong>ga<strong>particle</strong>code,one<strong>of</strong>thebottlenecksishowtoupdatethe fairlycomplexphysics(<strong>and</strong>similar)<strong>codes</strong>.<br />
localcopy<strong>of</strong>thegridforeachthread(processor). quantitiesontheborders,<strong>and</strong>the<strong>particle</strong>srema<strong>in</strong>totallysorted;I.e.,all<strong>particle</strong>s with<strong>in</strong>asub-gridareh<strong>and</strong>ledbythesamelocalthread(process<strong>in</strong>gnode). 4.3.1ReplicatedGrids Intheidealcase,thegridisdistributed,theprocess<strong>in</strong>gnodesonlysharegrid<br />
Toensurethatallgridupdatesoccurwithoutanycontentionfromthreadstry<strong>in</strong>g<br />
thelocalgridsarethenaddedtogethereitherbyoneglobalmasterthreador<strong>in</strong> threads),one<strong>of</strong>themostcommontechniquesistoreplicatethegridforeachthread. Eachthreadthencalculatesthecontribution<strong>of</strong>itsown<strong>particle</strong>sbyadd<strong>in</strong>gthem up<strong>in</strong>alocalarrayequal<strong>in</strong>sizetothewholegrid.Whenallthethreadsaredone, toupdatethesamegridnode(apply<strong>in</strong>gcontributionsfrom<strong>particle</strong>sfromdierent<br />
globalmasterapproachis<strong>in</strong>factfasterforsmallergrids(say,32x32). showthat,duetotheoverhead<strong>in</strong>spawn<strong>in</strong>g<strong>and</strong>synchroniz<strong>in</strong>gmanythreads,a parallelbysomeorallthreads.ExperimentsperformedbytheauthorontheKSR1
4.3.2DistributedGrids Eventhoughthe<strong>particle</strong>pushphaseisparallelizedus<strong>in</strong>gareplicatedgrid,thegrid maystillbedistributed<strong>in</strong>aparallelizedeldsolver.ForanFFTsolverthistypically 56<br />
partition<strong>in</strong>gmaynotbethemostecient. 4.3.3Block-column/Block-rowPartition<strong>in</strong>g gridwith<strong>particle</strong>updates<strong>in</strong>m<strong>in</strong>d,weshallshowthatacolumnorrow-oriented <strong>in</strong>volvesblock-column<strong>and</strong>block-rowdistributions.However,whendistribut<strong>in</strong>gthe<br />
Assum<strong>in</strong>g,nevertheless,thatonetriestosticktothecolumn/rowdistribution,<strong>in</strong><br />
4.1. the<strong>particle</strong>updatephase,anodewouldstillneedtocopyarow(orcolumn)to are<strong>cell</strong>-based.Thisleaveseachprocessorwiththegridstructureshown<strong>in</strong>Figure <strong>of</strong>process<strong>in</strong>gnodesavailable)s<strong>in</strong>cethe<strong>particle</strong>-grid<strong>and</strong>grid-<strong>particle</strong>calculations itsneighbor(assum<strong>in</strong>gthenumber<strong>of</strong>grid-po<strong>in</strong>ts<strong>in</strong>eachdirection=thenumber<br />
o----o----o----o----o----o----o----o----o<br />
o=localgriddata;O=datacopiedfromneighbor |||||||||<br />
Figure4.1:Gridpo<strong>in</strong>tdistribution(rows)oneachprocessor. O----O----O----O----O----O----O----O----O<br />
4.3.4GridsBasedonR<strong>and</strong>omParticleDistributions toleavetheirlocalsubdoma<strong>in</strong>sless<strong>of</strong>ten,<strong>and</strong>lesscommunicationishenceneeded. r<strong>and</strong>omdistributions<strong>of</strong><strong>particle</strong>s<strong>in</strong>anelectrostaticeld.The<strong>particle</strong>sthentend Squaresubgridsgiveamuchbetterborder/<strong>in</strong>teriorratiothansk<strong>in</strong>nyrectanglesfor
isreasonable<strong>and</strong>itsimplementationsimpler. butconcludedthatthesquareis<strong>in</strong>deedagoodchoices<strong>in</strong>ceitsborder/arearatio We<strong>in</strong>vestigatedseveralotherregularpolygons(seefollow<strong>in</strong>gTables4.1<strong>and</strong>4.2), 57<br />
Table4.1:Boundary/Arearatiosfor2Dpartition<strong>in</strong>gswithunitarea.<br />
\UniformGrid"Hex: RegularHexagon: PolygonBoundary/AreaRatio Square: Circle: 3.94 3.54 3.72<br />
Table4.2:Surface/VolumeRatiosfor3Dpartition<strong>in</strong>gswithunitvolume. RegularTriangle: 4.00<br />
PolygonSurface/VolumeRatio 4.56<br />
Staggeredregularhexagonwithdepth=1side: Cyl<strong>in</strong>derw/optimumheight<strong>and</strong>radius: Sphere(optimum): Cube: 5.59 5.54 4.84<br />
4.4ParticlePartition<strong>in</strong>g 4.4.1FixedProcessorPartition<strong>in</strong>g 6.00<br />
ma<strong>in</strong>ta<strong>in</strong>salocalcopy<strong>of</strong>thegridthat,aftereachstepisaddedtotheother localgridstoyieldaglobalgridwhichdescribesthetotoalchargedistribution. Theeasiest<strong>and</strong>mostcommonschemefor<strong>particle</strong>partition<strong>in</strong>gistodistributethe <strong>particle</strong>sevenlyamongtheprocessors<strong>and</strong>leteachprocessorkeepontrack<strong>in</strong>gthe same<strong>particle</strong>sovertime.Thisschemeworksreasonablywellwheneachprocessor
Unfortunately,replicat<strong>in</strong>gthegridisnotdesirablewhenthegridislarge<strong>and</strong> severalprocessorsareused.Inadditiontotheobviousgridsummationcosts,it wouldalsoconsumesagreatdeal<strong>of</strong>valuablememory<strong>and</strong>thereforehampersour 58<br />
<strong>particle</strong>partition<strong>in</strong>gschemewouldalsonotfarewell<strong>in</strong>comb<strong>in</strong>ationwithagrid eortsto<strong>in</strong>vestigateglobalphysicaleects{one<strong>of</strong>theprimegoals<strong>of</strong>our<strong>particle</strong> simulation. partition<strong>in</strong>g. Azari<strong>and</strong>Lee[AL91,AL92]foradistributedmemoryhypercube(stillhasproblems Analternativewouldbetouseahybridpartition<strong>in</strong>gliketheonedescribedby S<strong>in</strong>cethe<strong>particle</strong>swillbecomedispersedalloverthegridovertime,axed<br />
chapter. periodically.Thelatterwouldalsorequireadynamicgridallocationifloadbalance istobema<strong>in</strong>ta<strong>in</strong>ed.Wewillgetbacktothesecomb<strong>in</strong>edschemeslater<strong>in</strong>this forvery<strong>in</strong>homogeneouscases),ortosortthe<strong>particle</strong>saccord<strong>in</strong>gtolocalgrids<br />
Onewaytoreducememoryconictswhenupdat<strong>in</strong>gthegrid<strong>in</strong>thecasewherethe gridisdistributedamongtheprocessors,istohavethe<strong>particle</strong>spartiallysorted. 4.4.2PartialSort<strong>in</strong>g Bypartialsort<strong>in</strong>gwemeanthatall<strong>particle</strong>quantities(locations<strong>and</strong>velocities) with<strong>in</strong>acerta<strong>in</strong>subgridarema<strong>in</strong>ta<strong>in</strong>edbytheprocessorh<strong>and</strong>l<strong>in</strong>gtherespective withmillions<strong>of</strong><strong>particle</strong>sfairly<strong>of</strong>ten. {here:gridlocations)arelimitedtothegridpo<strong>in</strong>tsontheborders<strong>of</strong>thesubgrids. Thismethodcanbequitecostlyifoneiitisnecessarytogloballysortanarray subgrid.Inthiscase,memoryconicts(waitsforexclusiveaccesstosharedvariable<br />
\Dirty"bits Analternativeapproachistoma<strong>in</strong>ta<strong>in</strong>local<strong>particle</strong>arraysthatgetpartiallysorted aftereachtime-step.(Thiswouldbetheequivalent<strong>of</strong>send<strong>in</strong>g<strong>and</strong>receiv<strong>in</strong>g<strong>particle</strong>sthatleavetheirsub-doma<strong>in</strong><strong>in</strong>thedistributedmemorysett<strong>in</strong>g.)
depend<strong>in</strong>gonwhetherornotthenewlocationiswith<strong>in</strong>thelocalsubgrid.Cray istoadda\dirty-bit"toeach<strong>particle</strong>location<strong>and</strong>thensetorclearthisbit Afairlycommontechniquefromtheshared-memoryvectorprocess<strong>in</strong>gworld 59<br />
programmersworry<strong>in</strong>gaboutus<strong>in</strong>gtoomuchextramemoryforthe\dirty"bits havebeenknowntousetheleastsignicantbit<strong>in</strong>thedouble-precisionoat<strong>in</strong>g po<strong>in</strong>tnumberdescrib<strong>in</strong>gthelocationsasa\dirty"bit! appropriate\dirty-bit"sett<strong>in</strong>g.Thiswould,however,requireafollow<strong>in</strong>gsearch throughthe\dirty-bits"<strong>of</strong>alllocal<strong>particle</strong>sbyallprocessors<strong>and</strong>acorrespond<strong>in</strong>g notbeexp<strong>and</strong>ed<strong>and</strong>wasted). ll-<strong>in</strong><strong>of</strong>\dirty"locationslocallyoneachprocessor,(assum<strong>in</strong>glocalmemoryshould Ifthelocationisonanewgrid,thelocationsmaystillbewrittenbackwiththe<br />
4.4.3DoublePo<strong>in</strong>terScheme<br />
getsupdated<strong>and</strong>theexit<strong>in</strong>g<strong>particle</strong>iswrittentoascratchmemory.Noticehow Amoreelegantwaythatachievespartialsort<strong>in</strong>g<strong>of</strong>local<strong>particle</strong>arraysautomaticallydur<strong>in</strong>g<strong>particle</strong>updatesistoma<strong>in</strong>ta<strong>in</strong>twopo<strong>in</strong>ters,aread<strong>and</strong>awrite<br />
thisautomatically\sorts"the<strong>particle</strong>sback<strong>in</strong>tothelocalarray. po<strong>in</strong>ter,tothelocal<strong>particle</strong>arrays.Ifthenew<strong>particle</strong>locationisstillwith<strong>in</strong>the localsubgrid,thenbothpo<strong>in</strong>tersget<strong>in</strong>cremented,otherwiseonlythereadpo<strong>in</strong>ter array<strong>and</strong>ll<strong>in</strong><strong>in</strong>com<strong>in</strong>g<strong>particle</strong>sbyupdat<strong>in</strong>gthewritepo<strong>in</strong>ter. Afterthethread(processor)isdone,itcouldthengothroughtheglobalscratch<br />
acrossscalarprocessors. eachvector-location.However,<strong>in</strong>ourworkweareconcentrat<strong>in</strong>gon<strong>parallelization</strong>s welltovectorizationasthe\dirty-bit"approach,unlessaset<strong>of</strong>po<strong>in</strong>tersisusedfor Itshouldbepo<strong>in</strong>tedoutthatthisdualpo<strong>in</strong>tertechniquesdoesnotlenditselfas<br />
Loadbalanc<strong>in</strong>g<strong>in</strong>formation Thisdualpo<strong>in</strong>terschemethewritepo<strong>in</strong>tersautomaticallytellsyouhowloadbalancedthecomputationisaftereachtime-step.Ifsomethread(processor)suddenly
getsverymany<strong>particle</strong>sorhardlyany,agscouldberaisedto<strong>in</strong>itiatearepartition<strong>in</strong>g<strong>of</strong>thegridforloadbalanc<strong>in</strong>gpurposes.Thiswouldalsobeuseful<strong>in</strong>the<br />
60 extremecasewheremost<strong>of</strong>the<strong>particle</strong>sendup<strong>in</strong>asmallnumber<strong>of</strong>subgrids<strong>and</strong> caus<strong>in</strong>gmemoryproblemsforthelocal<strong>particle</strong>array. 4.5LoadBalanc<strong>in</strong>g 4.5.1TheUDDapproach Load-balanc<strong>in</strong>gideasstemm<strong>in</strong>gfromtheauthor'sworkonfault-tolerantmatrix algorithms[EUR89]canalsobeappliedtoloadbalanc<strong>in</strong>g<strong>particle</strong><strong>codes</strong>.There,<br />
processortom<strong>in</strong>imizetheeectontherema<strong>in</strong><strong>in</strong>gpartialorthogonaltrees,low hadbeenespeciallytailoredforecientmulti-process<strong>in</strong>gonhypercubes.Thehypercubealgorithmswerebasedonan<strong>in</strong>terconnectionschemethat<strong>in</strong>volvedtwo<br />
communicationoverheadwasma<strong>in</strong>ta<strong>in</strong>ed. uniformdistribution<strong>of</strong>the<strong>particle</strong>sovercurrentlyavailableprocessors(assum<strong>in</strong>g theUDD(Uniform-Data-Distribution)approachformatrices[EUR89]{i.e.a ahomogeneoussystem).Elsteretal.analyzedseveralre-distributiontechniques,<strong>of</strong> whichtheRow/ColumnUDD(UniformDataDistribution)[Uya86,UR85a,UR85b, Theoptimumre-distribution<strong>of</strong><strong>particle</strong>sshouldbesimilartothatshownfor orthogonalsets<strong>of</strong>b<strong>in</strong>arytrees.Byfocus<strong>in</strong>gonredistribut<strong>in</strong>gtheloadforeach algorithmicfault-toleranttechniqueswere<strong>in</strong>troducedformatrixalgorithmsthat<br />
UR88]methodprovedtobethemost<strong>in</strong>terest<strong>in</strong>g.Firstacolumn-wiseUDDhas performedontherowwiththefaultyprocessor.This<strong>in</strong>volveddistribut<strong>in</strong>gthe<br />
faultyprocessor,while<strong>in</strong>creas<strong>in</strong>gtheheight<strong>of</strong>thesub-matricesontherema<strong>in</strong><strong>in</strong>g direction(height)<strong>of</strong>thesub-matricesontherema<strong>in</strong><strong>in</strong>gprocessors<strong>in</strong>therow<strong>of</strong>the datapo<strong>in</strong>ts<strong>of</strong>thefaultyprocessorequallyamongtheotherprocessorsresid<strong>in</strong>g<strong>in</strong> sothatonlynear-neighborcommunicationwasneeded.Then,byshr<strong>in</strong>k<strong>in</strong>gthey-<br />
processorscorrespond<strong>in</strong>gly(row-wiseUDD),loadbalanc<strong>in</strong>gwasachievedwithboth thesamerowasthehealthyonesbyrippl<strong>in</strong>gtheloadfromprocessortoprocessor
61ss sssss<br />
s s s<br />
Figure4.2:Inhomogeneous<strong>particle</strong>distribution<br />
thenear-neighborcommunicationpattern<strong>and</strong>theorthogonaltreepatternsupheld. Figure4.3:X-proleforParticleDistributionShown<strong>in</strong>Figure4.2 ?JAACCCC Inour<strong>particle</strong>sort<strong>in</strong>gsett<strong>in</strong>gsimilarideasmightproveuseful<strong>in</strong>re-distribut<strong>in</strong>g .J.<br />
thegridwhenopt<strong>in</strong>gforloadbalanc<strong>in</strong>g. 4.5.2Loadbalanc<strong>in</strong>gus<strong>in</strong>gthe<strong>particle</strong>densityfunction functions<strong>of</strong>thex<strong>and</strong>ydirections.Onewould<strong>in</strong>thiscaseperiodicallycalculate Onewaytodothepartition<strong>in</strong>gistoma<strong>in</strong>ta<strong>in</strong>awatchonthe<strong>particle</strong>density thedensity\prole"<strong>of</strong>thesystem.Forexampleifthegridhadthedistributionas shown<strong>in</strong>Figure4.2,itwouldgiveax-proleasshown<strong>in</strong>Figure4.3.Thex-prole couldthenbeusedtopartitionthegrid<strong>in</strong>thex-direction(seeFigure4.4).The samecouldthebedoneforthey-direction,giv<strong>in</strong>gunevenrectangularsub-grids. Thisisrem<strong>in</strong>iscent<strong>of</strong>adaptivemeshrenement,sotherearesurelyideastobe
62<br />
.<br />
usedfromthatarea.Theschemeisalsosimilartotheonerecentlyproposedby Lieweretal.[LLDD90]fora1Dcode.Theyuseanapproximatedensityfunction toavoidhav<strong>in</strong>gtobroadcastthe<strong>particle</strong>densityforallgridpo<strong>in</strong>ts. Figure4.4:NewGridDistributionDuetoX-prole<strong>in</strong>Figure4.3.<br />
presence<strong>of</strong>I/Oprocessorsonasubset<strong>of</strong>theprocessors,etc. <strong>in</strong>clud<strong>in</strong>gtheKSR,hasnodeswithunevenloadsduestoeithertime-shar<strong>in</strong>g,the 4.5.3Load<strong>and</strong>distance wouldonlyachieveloadbalanceforhomogeneoussystems.Manyparallelsystems, Techniquesbasedpurelyon<strong>particle</strong>load(asoutl<strong>in</strong>ed<strong>in</strong>theprevioussection)<br />
nodeswithrespecttocommunicationtime)<strong>in</strong>tothecode,onecouldmakethe loadbalance.By<strong>in</strong>corporat<strong>in</strong>gtheuse<strong>of</strong>tablesma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g\load"(howbusy implementationsmoregenerallysuitable,notonlyfortheKSR,butals<strong>of</strong>ora arethe<strong>in</strong>dividualprocess<strong>in</strong>gnodes)<strong>and</strong>\distance"(howfarawayaretheotehr Oneideaweareconsider<strong>in</strong>gistouserun-time<strong>in</strong>formationtoaidus<strong>in</strong>achiev<strong>in</strong>g<br />
distributedsystemenvironmentssuchasanetwork<strong>of</strong>Suns. <strong>and</strong>gridre-assignment<strong>in</strong>thiscontext,arewhat<strong>in</strong>putdistributions<strong>and</strong>run-time distributionscancommonlybeexpected. Thoughtstoconsiderwhensearch<strong>in</strong>gfortherightapproach<strong>particle</strong>sort<strong>in</strong>g
computations,thenitisreasonabletoassumeastaticallocation<strong>of</strong>sub-grids. However,ifthesystemhasonelimitedarea<strong>of</strong><strong>particle</strong>smigrat<strong>in</strong>garoundthe Ifthe<strong>in</strong>putdataisuniform,<strong>and</strong>onecanexpectittorema<strong>in</strong>sodur<strong>in</strong>gthe 63<br />
topology;which<strong>in</strong>dicateswhichprocessorsareneighbors,<strong>and</strong>hencedictates actuallysorts(re-arrangepartition<strong>in</strong>gs)willactuallydependon:1)thenetwork <strong>particle</strong>/gridpartition<strong>in</strong>gs,<strong>and</strong>whentosort(updatethepartition<strong>in</strong>g).Howone system,itisreasonabletomoreseriouslyconsideradynamicgridapproach.<br />
howtom<strong>in</strong>imizecommunication,<strong>and</strong>2)thememorymodel;whichaectshow Eitherway,\load"<strong>and</strong>\distance"<strong>in</strong>formationcouldthenbeusedtodeterm<strong>in</strong>e<br />
pass<strong>in</strong>gadncach<strong>in</strong>goccurs. operat<strong>in</strong>gsystem,butifthesetwopo<strong>in</strong>tsareignored,oneisstilllikelytoendup tialfordistributedworkstations,sotailor<strong>in</strong>gtheimplementationstom<strong>in</strong>imizenow withan<strong>in</strong>ecientimplementation.Communicationoverheadis,however,substan-<br />
becomescrucial. Granted,theKSRtriestohidethesetwodependencieswiththeaid<strong>of</strong>itsclever<br />
endup<strong>in</strong>oneprocessor,byassign<strong>in</strong>geachpart<strong>of</strong>thegridtoagroup<strong>of</strong>processors Azari<strong>and</strong>Lee[AL91]addressedtheproblemthatresultswhenseveral<strong>particle</strong>s 4.6Particlesort<strong>in</strong>g<strong>and</strong><strong>in</strong>homogeneous<br />
(hybridpartition<strong>in</strong>g).Thisworksreasonablyforfairlyhomogeneousproblems, problems<br />
wheremost<strong>of</strong>the\action"takesplace<strong>in</strong>asmallregion<strong>of</strong>thesystem(oneprocessor\group").Unfortunately,thereareseveralsuchcases<strong>in</strong>plasmaphysics,<br />
butwouldtakeaseriousperformancehitforstrongly<strong>in</strong>homogeneousproblems sodevelop<strong>in</strong>galgorithmstoh<strong>and</strong>lethese<strong>in</strong>homogeneouscasesisdenitelyworth <strong>in</strong>vestigat<strong>in</strong>g.
Thewayweseetheproblem<strong>of</strong><strong>in</strong>homogeneousproblemssolved,istousebotha 4.6.1DynamicPartition<strong>in</strong>gs dynamicgrid<strong>and</strong>adynamic<strong>particle</strong>partition<strong>in</strong>g(well,the<strong>particle</strong>partition<strong>in</strong>gis 64<br />
basicallyimplied),i.e.whatWalkerreferstoasanadaptiveeuleri<strong>and</strong>ecomposition. Thenumber<strong>of</strong>gridelementsperprocessorshouldherereecttheconcentration<strong>of</strong> <strong>particle</strong>s,i.e.ifaprocessordoescomputations<strong>in</strong>anareawithalot<strong>of</strong><strong>particle</strong>s,it wouldoperateonasmallergridregion,<strong>and</strong>viceversa.Gridquantitieswouldthen needtobedynamicallyredistributedatrun-timeas<strong>particle</strong>congregate<strong>in</strong>various areas<strong>of</strong>thegrid. gridasapossibleattemptatloadbalanc<strong>in</strong>g.Toquotehim: Onpage49<strong>of</strong>histhesis,Azari[Aza92]<strong>in</strong>deedmentionsre-partition<strong>in</strong>g<strong>of</strong>the havebeendesignedfornon-uniform<strong>particle</strong>distributiononthegrid. Onepossibleattemptforloadbalanc<strong>in</strong>gcouldbetore-partitionthegrid spaceus<strong>in</strong>gadierentmethodsuchasbi-partition<strong>in</strong>g.Thesemethods<br />
Theoverheadswillbefurtheranalyzed<strong>in</strong>thisthesis.S<strong>in</strong>ceweareus<strong>in</strong>ganFFT solver,thegridwillneedtobere-partitionedregularlyregardless<strong>of</strong>the<strong>particle</strong> s<strong>in</strong>cethenumber<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>eachsubgridwouldbedierent.Also, there-partition<strong>in</strong>gtaskitselfisanewoverhead. However,thegridpartition<strong>in</strong>gunbalancedthegrid-relatedcalculations<br />
pusher<strong>in</strong>ordertotakeadvantage<strong>of</strong>parallelism<strong>in</strong>thesolver. 4.6.2Communicationpatterns<br />
the<strong>particle</strong>stage,s<strong>in</strong>ceoneherewouldgenerallyprefersquaresubgridsoneach isthattheroworcolumndistributionusedbytheFFTisnotassuitablefor (forthe<strong>particle</strong>phase).Anotherargumentfordo<strong>in</strong>gare-distribution<strong>of</strong>thegrid shouldbesimilartothat<strong>of</strong>go<strong>in</strong>gfromarowdistributiontoadynamicsub-grid Thecommunicationcostforthetransposeassociatedwithadistributed2DFFT<br />
processor.
shouldmatchthispartition<strong>in</strong>g<strong>in</strong>ordertoavoidanextratranspose.Wewillget desirableals<strong>of</strong>ortheotherstages,thentheFFTsolver'sorder<strong>of</strong>the1D-FFTs If,however,forsomereasonamoreblock-columnorblockrowpartition<strong>in</strong>gisn 65<br />
trixtranspositionsformeshes<strong>and</strong>hypercubes,respectively. backtothisidea<strong>in</strong>thenextchapter. 4.6.3N-body/MultipoleIdeas Theauthorhasalsoconsideredsomeparallelmultipole/N-bodyideas[ZJ89,BCLL92]. Azari-Bojanczyk-Lee[ABL88]<strong>and</strong>Johnsson-Ho[JH87]have<strong>in</strong>vestigatedma-<br />
Hut<strong>and</strong>ORB(OrthogonalRecursiveBisection)treestosubdivide<strong>particle</strong>s.One Multipolemethodsuse<strong>in</strong>terest<strong>in</strong>gtree-structuredapproachessuchastheBarnes- ideawouldbetouseatree-structuresimilartothatdescribedbyBarnes<strong>and</strong>Hut [BH19]toorganizethe<strong>particle</strong>sforeach"sort"[BH19]. Barnes-Huttree TheBH(Barnes-Hut)treeorganizesthe<strong>particle</strong>sbymapp<strong>in</strong>gthemontoab<strong>in</strong>ary tree,quad-tree(max.4childrenpernode),oroct-tree(max.8childrenpernode) for1-D,2-Dor3-Dspaces,respectively.Consider<strong>in</strong>g<strong>particle</strong>sdistributedona partition<strong>in</strong>gthespace<strong>in</strong>to4equalboxes.Eachboxisthenpartitionedaga<strong>in</strong>until 2-Dplane(withmorethanone<strong>particle</strong>ispresent),itsquad-treeisgeneratedby onlyone<strong>particle</strong>rema<strong>in</strong>sperbox.Therootdepictsthetoplevelbox(thewhole space),each<strong>in</strong>ternalnoderepresent<strong>in</strong>ga<strong>cell</strong>,<strong>and</strong>theleaves<strong>particle</strong>s(oremptyif<br />
<strong>in</strong>terested<strong>in</strong>,Iwillnotgo<strong>in</strong>tothedetail<strong>of</strong>whatthecalculationsactuallyestimate the<strong>cell</strong>hasno<strong>particle</strong>).<br />
physically.) subtrees<strong>of</strong><strong>particle</strong>sforboxesconta<strong>in</strong><strong>in</strong>g<strong>particle</strong>ssucientlyfarawayfromthe present<strong>particle</strong>dur<strong>in</strong>gforcecalculations.(S<strong>in</strong>ceitisthedatastructureIam TheBHalgorithmtraversesthetreeforeach<strong>particle</strong>approximat<strong>in</strong>gwhole
Howdoesthisrelateto<strong>particle</strong>sort<strong>in</strong>gforPIC<strong>codes</strong>?S<strong>in</strong>ce<strong>particle</strong>s<strong>in</strong>variably w<strong>and</strong>ero<strong>in</strong>dierentdirections,<strong>particle</strong>sthatwerelocaltothegrid<strong>cell</strong>s(boundto Treestructures<strong>and</strong><strong>particle</strong>sort<strong>in</strong>g66<br />
mach<strong>in</strong>es.Oneapproachtoovercomethisistosortthe<strong>particle</strong>stotheprocessors conta<strong>in</strong><strong>in</strong>gtheircurrent(<strong>and</strong>possiblyneighbor<strong>in</strong>g)<strong>cell</strong>s. processornodes)atthethestart<strong>of</strong>thesimulations,afterawhilearenolongernear theirorig<strong>in</strong>s.Thiscausesalot<strong>of</strong>communicationtracfordistributedmemory<br />
anicelybalanceddistribution<strong>of</strong><strong>particle</strong>s,buts<strong>in</strong>cethetreestructureitselfdoes assignthenodesaccord<strong>in</strong>gtothesub-treedistributionthisgives.Thiswillyield notaccountfor<strong>cell</strong>-<strong>cell</strong><strong>in</strong>teraction(neighbor<strong>in</strong>g<strong>cell</strong>s),amorecleverapproachis requiredforPIC<strong>codes</strong>s<strong>in</strong>cetheyareheavily\neighbor-oriented". Therstideaistomapthenew<strong>particle</strong>locationstotheBHtree<strong>and</strong>then<br />
types<strong>of</strong><strong>in</strong>teractionsisfairlycomplex,butnotthat<strong>in</strong>terest<strong>in</strong>gforourpurposes. <strong>cell</strong><strong>in</strong>teractions.Howitgoesabouttheactualcomputations<strong>and</strong>thedierent <strong>in</strong>volves<strong>particle</strong>-<strong>particle</strong><strong>and</strong><strong>particle</strong>-<strong>cell</strong><strong>in</strong>teractions,theFMMalsoallows<strong>cell</strong>-<br />
thecomputationalspace<strong>in</strong>toatreestructure.UnliketheBHmethodwhichonly TheFastMultipoleMethod(FMM)usesasimilarrecursivedecomposition<strong>of</strong><br />
thatahybridBH-FMMapproachcouldprovethemostuseful.S<strong>in</strong>ceweonlycare aboutneighbor<strong>in</strong>g<strong>cell</strong>s,itshouldbepossibletosimplifythecalculationsfor<strong>cell</strong><strong>cell</strong><strong>in</strong>teractions.<br />
Inordertomakeuse<strong>of</strong>thestructuresthesemethodspresent,theauthorbelieves<br />
code,comparedtoothersolvers,suchasmultigrid. impactchoos<strong>in</strong>ganFFTsolverwouldhaveonthe<strong>particle</strong>-push<strong>in</strong>gstages<strong>of</strong>our Notic<strong>in</strong>gthatdierentimplementationsusedierentsolvers,itisusefultoseewhat 4.7TheFieldSolver<br />
wouldgenerallynotbeusedforvery<strong>in</strong>homogeneoussystems.Otani[Ota]has Itisalsoworthnot<strong>in</strong>gthatanFFTsolver<strong>and</strong>periodicboundaryconditions
ispredictedthatonewouldnothavelessthan,say,half<strong>of</strong>theprocessorsh<strong>and</strong>l<strong>in</strong>g po<strong>in</strong>tedout,however,thatitispossibleforverylargewavestoexist<strong>in</strong>anotherwise homogeneoussystemwhich,<strong>in</strong>turn,leadtosignicantbunch<strong>in</strong>g<strong>of</strong>the<strong>particle</strong>s.It 67<br />
However,us<strong>in</strong>gonly50%<strong>of</strong>theprocessorseectivelyonaparallelsystemis<strong>in</strong>deed most<strong>of</strong>the<strong>particle</strong>s. 4.7.1Processorutilization asignicantperformancedegradation.Itis\onlyhalf"<strong>of</strong>theparallel,butwith<br />
withrespecttoasiglenode<strong>of</strong>afactor<strong>of</strong>64! respecttotheserialspeed,whichistheimportantmeasure,itisverysignicant, 50%utilization/eciencygivesusa64-timesspeed-up(max.theoreticallimit) versusa128-timesspeedupforafullyusedsystem.I.e.thereisaloss<strong>of</strong>resources especiallyforhighlyparallelsystems.Forexampleona128-processorsystema<br />
balancethe<strong>particle</strong>pushstagemayleadtoproblemswiththeeldsolver. Ramesh[Ram]haspo<strong>in</strong>tedoutthatkeep<strong>in</strong>ganon-uniformgrid<strong>in</strong>ordertoload-<br />
4.7.2Non-uniformgrid<strong>issues</strong><br />
loadacrosstheprocessors,aperformancehit,alsopo<strong>in</strong>tedoutbyAzari<strong>and</strong>Lee, doesstillcome<strong>in</strong>toplay<strong>in</strong>thatwenowhaveasolverthatwillhaveanuneven sorwouldvaryaccord<strong>in</strong>gtonon-uniformdistribution.Ramesh'spo<strong>in</strong>t,however, system-widebasis.Thenumberifmanygridpo<strong>in</strong>tsthatgetstoredbyeachproces-<br />
Thegridcouldalternativelybepartitioneduniformly(equalgridspac<strong>in</strong>g)ona<br />
thatwillhavetobeconsidered.<br />
AparallelFFTisconsideredquitecommunication-<strong>in</strong>tensive,<strong>and</strong>because<strong>of</strong>the 4.7.3FFTSolvers leadtoanother\can<strong>of</strong>worms"withrespecttoaccuracy,etc. Alternatively,asolverfornon-uniformmeshescouldbeused.It,however,may<br />
communicationstructure,isisbestsuitedforhypercubeswhichhaveahighdegree
overheadisfunctionallythesameasthemoredirect2Dapproach,butbyperform<strong>in</strong>gallthecommunicationatonce(transpose),somestart-up-messageoverheadis<br />
<strong>of</strong>communicationl<strong>in</strong>ks.Do<strong>in</strong>ga2DFFTusuallyalsoimpliesdo<strong>in</strong>gatranspose <strong>of</strong>therows/columns.G.Foxet.al.[FJL+88]po<strong>in</strong>toutthatforthisapproachthe 68<br />
saved).<br />
whodidsomeimageprocess<strong>in</strong>gworkontheMPP(bit-serialarrayproc.). systems.Interest<strong>in</strong>glyenough,theonlyreferenceIseemtohaveencounteredfor thehypercube.Sarkaetal.have<strong>in</strong>vestigateparallelFFTsonsharedmemory parallelFFTsonarrayprocessors,isMariaGuiterrez's,astudent<strong>of</strong>myMSadvisor ClaireChu,astudent<strong>of</strong>VanLoan,wroteaPhDthesisonparallelFFTsfor<br />
<strong>and</strong>asectiononFFTsforsharedmemorysystems.Allthealgorithms<strong>in</strong>thebook arewritten<strong>in</strong>ablock-matrixlanguage.Asmentioned<strong>in</strong>Section2,G.Foxet.al. alsocoversparallelFFTs.Bothpo<strong>in</strong>touttheuse<strong>of</strong>atranspose. VanLoan'sbook[Loa92].It<strong>in</strong>cludesbothClareChu'sworkonthehypercube<br />
highlyconnectedcommunicationtopologiesassumeahypercube<strong>in</strong>terconnection network(ittssoperfectlyfortheFFT!).Inaddition,mostdistributedmemory withonlyonedatapo<strong>in</strong>tperprocessor. <strong>and</strong>sharedmemoryapproachesfound<strong>in</strong>numericaltextstendtolookatalgorithms Mostdistributedparallelalgorithms<strong>in</strong>volv<strong>in</strong>gtreestructures,FFTsorother<br />
po<strong>in</strong>tsperprocessor(assum<strong>in</strong>gtheneighbor<strong>in</strong>gcolumngetscopied)seemsvery processors,thiswouldonlyleaveusa4-by-4grid.Besides,hav<strong>in</strong>gonly4grid <strong>in</strong>ecientforthe<strong>particle</strong>phase{<strong>and</strong>certa<strong>in</strong>lydoesnotallowformuchdynamic gridallocation. Inourcase,thisisnotagoodmodel.For<strong>in</strong>stance,ifweweretouse16<br />
anngridmappedontopprocess<strong>in</strong>gelements(PEs)withn=O(p).Inthis case,ifn=p,wewillhavep1D-FFTstosolve<strong>in</strong>eachdirectiononpPEs.The questionis,<strong>of</strong>course,howtoredistributethegridelementswhengo<strong>in</strong>gfromone Consequently,itwouldbealotmorereasonabletoassumethatwehave,say
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o 69<br />
|||| | \___________________________/ (b) |<br />
o---o---o---o<br />
dimensiontotheother<strong>and</strong>whatcostis<strong>in</strong>volved. Figure4.5:BasicTopologies:(a)2DMesh(grid),(b)R<strong>in</strong>g.<br />
advantageoustohaveone(ormore)columns(orrows)perprocessor. <strong>of</strong>2(simplify<strong>in</strong>gtheFFTs).FortherstFFT,itisclearthatitwouldbemost Letusrstconsiderast<strong>and</strong>ardgrid<strong>and</strong>r<strong>in</strong>gtopologyshown<strong>in</strong>Figure4.5.<br />
ecutiontimetoperformanN-po<strong>in</strong>tFFT.Onewouldthenneedto\transpose" Assumethatboththeprocessorarray<strong>and</strong>theeldgriddimensionsareapower<br />
thegridentries<strong>in</strong>ordertoperformthe1D-FFTs<strong>in</strong>theotherdirection.Hopefully,suchpackagescouldbeobta<strong>in</strong>edfromthevendors<strong>in</strong>cethisisobviouslya<br />
time-consum<strong>in</strong>gtaskgivenallthecommunication<strong>in</strong>volved. Assum<strong>in</strong>gonlyonecolumnperprocessor,itwouldthentakeO(NlogN)ex-<br />
withI/Oattachmentsthatsignicantlydegradetheircomputationaluse,caus<strong>in</strong>g set.ThereasonthelatterislikelytohappenisthattheKSRhassomeprocessors underly<strong>in</strong>gtoken-basedcommunicationstructure.Thereisalsotheproblemwith contentionfromotherr<strong>in</strong>g<strong>cell</strong>sthatmaybeexcludedfromone'scurrentprocessor ItisnotobvioushowfastsuchanalgorithmwouldbeontheKSRgiventhe
a32-processorr<strong>in</strong>g{theidealforanFFT.Ifonly16processorsareused<strong>and</strong>one doesnotexplicitlyrequestexclusiveaccesstothewholer<strong>in</strong>g,otheruser'sprocesses anunbalancedsystem.Giventhis,itisunlikelythatonecouldgetthefulluse<strong>of</strong> 70<br />
r<strong>in</strong>gaect<strong>in</strong>gone'sperformance.FurtherdetailsontheKSR'sarchitectureisgiven runn<strong>in</strong>gontherema<strong>in</strong><strong>in</strong>g16nodescouldcauseextracommunicationtraconthe<br />
r<strong>in</strong>g(or64ormore).Noticeherethepotentialcontentionbetweenload-balanc<strong>in</strong>g <strong>in</strong>Chapter6.<br />
Match<strong>in</strong>gGridStructures<strong>and</strong>Alternat<strong>in</strong>gFFTs the1D-FFTs<strong>and</strong>thetranspose. reasonableapproachifoneconsidersrunn<strong>in</strong>gonthefull\unbalanced"32-processor The\pool-<strong>of</strong>-task"approachthatVanLoanmentions<strong>in</strong>hisbookmaybea<br />
(or32addresseswithastride<strong>of</strong>64pagesor64addresseswithastride<strong>of</strong>32pages). Asmentioned<strong>in</strong>the<strong>in</strong>troduction,thelocalmemoryontheKSR=(128sets)x<br />
Oneshouldhencetakecaretoensurethatstridesareanon-power-<strong>of</strong>-two<strong>of</strong>the processorrepeatedlyreferencesmorethan16addresseswithastride<strong>of</strong>128pages comes<strong>in</strong>powers-<strong>of</strong>-two.OntheKSRlocalmemorythrash<strong>in</strong>ghenceoccurswhena (16-wayassociativity)x(16Kb(pagesize))=32Mb.Infact,allphysicalmemory<br />
operateonarraysthat<strong>in</strong>deedarepowers-<strong>of</strong>-two.Itisforthisreason<strong>and</strong>thefact multiple<strong>of</strong>apagesize(16kB).<br />
<strong>of</strong>ten<strong>in</strong>volveanactualtranspose. themanufacturer(<strong>of</strong>tenh<strong>and</strong>-coded<strong>in</strong>assembler),that2DFFTimplementations thatoneusuallycanmakeuse<strong>of</strong>fastunit-stride1-DFFTrout<strong>in</strong>esprovidedby Noticehowthisconictswiththeimplementation<strong>of</strong>2DFFTswhichtendto<br />
betweeneachset<strong>of</strong>1-DFFTcalls<strong>in</strong>ordertobeabletousecontiguousvectorsfor the1-DFFTcomputations.Thisreorder<strong>in</strong>g<strong>of</strong>thegridcanbequitecostlyifthe keepthe<strong>particle</strong>s<strong>in</strong>block-vectorgridsdur<strong>in</strong>gthe<strong>particle</strong>push,bothre-order<strong>in</strong>g gridisne<strong>and</strong>ifthereareontheaverageafew<strong>particle</strong>sper<strong>cell</strong>,respectively.Ifwe Dur<strong>in</strong>gour2-DFFTeldsolver,thereisareorder<strong>in</strong>g(transposition)<strong>of</strong>thegrid
stepscanbesaved,<strong>and</strong>wehencegetthefollow<strong>in</strong>g: 1.Column-wise{FFT 71<br />
2.Transpose 3.Row-wiseFFT 4.Row-wise<strong>in</strong>verseFFT 5.Transpose<br />
square-shapedgridpartition<strong>in</strong>gforuniformproblems. 6.Column-wise{<strong>in</strong>verseFFT<br />
4.7.4Multigrid 7.ParticlePush<br />
Innot<strong>in</strong>ghowsimilarourdynamicgrid-partition<strong>in</strong>gapproachforthe<strong>particle</strong>grid/grid-<strong>particle</strong>stepsistothat<strong>of</strong>adaptivegridschemesonends<strong>in</strong>multigrid<br />
(MG)methods,onecouldask,woulditbereasonabletoconsiderus<strong>in</strong>ganactual parallelMGmethodfortheeldsolver?GoodMultigridreferences<strong>in</strong>clude[Bri87, developedforparallelsystems.Theexamples<strong>in</strong>thereferencesfocusonproblems withNeumann<strong>and</strong>Dirichletboundaryconditions. McC89].Theproblemseems,however,tobethatthesemethodsarestillnotfully<br />
Noticethatthisblock-columngridpartition<strong>in</strong>gconictswiththeoptimum<br />
tocause<strong>particle</strong>stomove<strong>in</strong>adirectionprimarilyalongtheeld.Inthiscase, Forelectromagnetic<strong>codes</strong>,themagneticeldl<strong>in</strong>eswillforcerta<strong>in</strong>criteriatend 4.8InputEfects:Electromagnetic Considerations
onewould<strong>in</strong>deedlikethesubpartitionstobealignedasatrectanglesalongthis direction. Electronstendtobetiedtotheeldl<strong>in</strong>es<strong>and</strong>movehugedistancesalongthe 72<br />
magneticeld,buthavedicultymov<strong>in</strong>gperpendiculartoit.Ionsbehavealotlike ionsspiral<strong>in</strong>garoundthemagneticeldB,<strong>and</strong>theionsbehaveasiftheywere<br />
theelectrons,buts<strong>in</strong>cetheyhavemoremass,theytendtomakelargerexcursions acrosstheelds.Ifthefrequencyregimeishigh,onewillonlyseeapart<strong>of</strong>the a<strong>particle</strong>cangoaroundamagneticeldl<strong>in</strong>epersecond),onemaywanttomodel isnopreferredgeneraldirections<strong>of</strong>the<strong>particle</strong>s,sosquarepartition<strong>in</strong>gswouldbe preferable. motionwheretheelectronsareunmagnitized.Fortheseunmagnetizedcases,there Ifthefrequencyregimeiswayabovethecyclotronfrequency(howmanytimes<br />
memoryhierarchies,especiallysystemswithcaches,weshallshowthattheseare Gridsaretypicallystoredeithercolumnorrow-wise.However,<strong>in</strong>asystemwith 4.9HierarchicalMemoryDataStructures:<br />
notnecessarilythebeststorageschemesforPIC<strong>codes</strong>. CellCach<strong>in</strong>g<br />
each<strong>particle</strong>willbeaccess<strong>in</strong>gthefourgridpo<strong>in</strong>ts<strong>of</strong>its<strong>cell</strong>(assum<strong>in</strong>gaquadrangulargrid).<br />
Whenthe<strong>particle</strong>'schargecontributionsarecollectedbackonthegridpo<strong>in</strong>ts,<br />
po<strong>in</strong>tsitiscontribut<strong>in</strong>gto. l<strong>in</strong>estendtobesmall{16wordsontheKSR1),<strong>and</strong>columnorrowstorage<strong>of</strong>the gridexits,each<strong>particle</strong>willneedatleasttwocachel<strong>in</strong>estoaccessthefourgrid S<strong>in</strong>ceasignicantoverheadispaidforeachcachehit,wehenceproposethat Ifthelocalgridsizeexceedsacachel<strong>in</strong>e(whichistypicallythecases<strong>in</strong>cecache<br />
one<strong>in</strong>steadstoresthegridaccord<strong>in</strong>gtoa<strong>cell</strong>cach<strong>in</strong>gscheme.Thismeansthat
<strong>in</strong>stead<strong>of</strong>stor<strong>in</strong>gthegridroworcolumn-wisedur<strong>in</strong>gthe<strong>particle</strong>phase,oneshould storethegridpo<strong>in</strong>ts<strong>in</strong>a1-Darrayaccord<strong>in</strong>gtolittlesubgridsthatmayt<strong>in</strong>to onecache-l<strong>in</strong>e.OntheKSR1wherethecachel<strong>in</strong>eis16words,thismeansthat 73<br />
m<strong>in</strong>imizethebordereects.However,eventhe8X2caseshowsa22%reduction<strong>in</strong> thegridisstoredasasequence<strong>of</strong>either8X2or4x4subgrids.Thelatterwould thenumber<strong>of</strong>cache-l<strong>in</strong>eaccesses.Furtheranalysis<strong>of</strong>thisschemewillbepresented<br />
4x4<strong>cell</strong>-cach<strong>in</strong>gstorage: <strong>in</strong>Chapter5. Row-storage<strong>of</strong>2-Dmxnarray:<br />
a00a03a10a13a04a07a14a17amn a00a01a0na10amn<br />
(cache-l<strong>in</strong>e)aligned.TheKSRautomaticallydoesthiswhenus<strong>in</strong>gmalloc<strong>in</strong>C. Inorderforthe<strong>cell</strong>-cach<strong>in</strong>gschemetobeeective,thegridneedstobesubpage Noticethatthis<strong>cell</strong>-cach<strong>in</strong>gtechniqueisreallyauni-processortechnique.Therefore,bothparallel<strong>and</strong>serial<strong>cell</strong><strong>codes</strong>foranysystemwithacachewouldbenet<br />
fromus<strong>in</strong>gthisalternativeblockstorage. s<strong>in</strong>ce<strong>cell</strong>-cach<strong>in</strong>gstoragewouldnotpreservethecolumn/rowstorage<strong>in</strong>theFFTs.<br />
Figure4.6:Rowstorageversus<strong>cell</strong>cach<strong>in</strong>gstorage<br />
Itwould,however,aectthealternat<strong>in</strong>gFFTapproachwewillbe<strong>in</strong>troduc<strong>in</strong>g
Chapter5<br />
PerformanceAnalyses AlgorithmicComplexity<strong>and</strong><br />
5.1Introduction ous,judgmentdicult."{Hippocrates[ca.460-357B.C.],Aphorisms. Proverbial:Arslonga,vitabrevis. \Lifeisshort,theArtlong,opportunityeet<strong>in</strong>g,experiencetreacher-<br />
Inordertogetabetterunderst<strong>and</strong><strong>in</strong>g<strong>of</strong>howvarious<strong>parallelization</strong>approaches proachesweth<strong>in</strong>karethemostreasonable.Theseanalysesconsiderbothcompu-<br />
tationalrequirements<strong>and</strong>memorytrac.Bycomb<strong>in</strong><strong>in</strong>gthecomplexityresults withthetim<strong>in</strong>gresultsfromserial<strong>codes</strong><strong>and</strong>knownparallelbenchmarks,onecan wouldaectperformance,thischapter<strong>in</strong>cludesacomplexityanalysis<strong>of</strong>theap-<br />
wastheKSR1(seeChapter6). estimatehoweectivethe<strong>parallelization</strong>swillbeonagivenparallelsystem,given <strong>and</strong>test<strong>in</strong>gthechosenparallelalgorithm(s)onachosenarchitecture.Ourtest-bed certa<strong>in</strong>chosenproblemparameterssuchasgridsize,<strong>and</strong>thenumber<strong>of</strong>simulation <strong>particle</strong>sused.Ane-tun<strong>in</strong>g<strong>of</strong>theseresultsispossibleafterfullyimplement<strong>in</strong>g<br />
74
Whenmov<strong>in</strong>gfromsequentialcomputersystemstoparallelsystemswithdistributedmemory,datalocalitybecomesamajorissueformostapplication.S<strong>in</strong>ce<br />
75 5.2Model<br />
systems. caremustbetakensothatthisprocessdoesnottakean<strong>in</strong>ord<strong>in</strong>ateamount<strong>of</strong> extrarun-time<strong>and</strong>memoryspace.Afterall,thepurpose<strong>of</strong>paralleliz<strong>in</strong>gthe<strong>codes</strong> issotheycantakeadvantage<strong>of</strong>thecomb<strong>in</strong>edspeed<strong>and</strong>memorysize<strong>of</strong>parallel <strong>in</strong>termediateresults<strong>and</strong>dataneedtobesharedamongtheprocess<strong>in</strong>gelements,<br />
Jessup[HJ81]: wheretcommisthecommunicationtimeforanN-vector.Hereisthestart-up time<strong>and</strong>aparameterdescrib<strong>in</strong>gtheb<strong>and</strong>widthonthesystem. Typically,communicationoverheadismodeledasfollows(seeHockney<strong>and</strong><br />
thetotaltimeagivenparallelapproachwouldtake(assum<strong>in</strong>gtheprocessdoesnot Ifonethenassumesthatthecomputationsdonotoverlapwithcommunication, tcomm=+N<br />
getswappedoutbytheoperat<strong>in</strong>gsystematrun-time)wouldthenbe: communicationoverheaddescribedabove. wheretcompisthetimethealgorithmspendsoncomputations<strong>and</strong>tcommisthe Modernparallelsystemswithdistributedmemory,start<strong>in</strong>gwiththeInteliPSC Ttotal=tcomp+tcomm;<br />
2,however,have<strong>in</strong>dependentI/Oprocessorsthatlettheusersoverlapthetime time.Howecienttheimplementationsarethenalsobecomesrelatedtohowwell spentsend<strong>in</strong>gdatabetweenprocessors,i.e.communication,withcomputation theprogramcanrequestdata<strong>in</strong>advancewhiledo<strong>in</strong>glargechunks<strong>of</strong>computations. ishencealsostronglyapplication-dependent. Howmuchtheapplicationprogrammercantakeadvantage<strong>of</strong>thisoverlapfeature
thememoryhierarchyareaccessedfortherequesteddata.Forsimplicitywewill consideramodelwithonlytwohierarchicalparameters,tlm,whichdenotestime Onhierarchicalmemorysystems,tcommwillbeafunction<strong>of</strong>whichlevels<strong>of</strong> 76<br />
alocalsubcache.Foravector<strong>of</strong>Nlocaldataelements,tlm(N)ishencenota Itshouldbenotedthattlmcoversbothitemswith<strong>in</strong>acache-l<strong>in</strong>e<strong>and</strong>itemswith<strong>in</strong> ory: associatedwithlocalmemoryaccesses<strong>and</strong>tgm,timeassociatedwithglobalmem-<br />
are1)with<strong>in</strong>acache-l<strong>in</strong>e,2)with<strong>in</strong>cache-l<strong>in</strong>es<strong>in</strong>cache<strong>and</strong>3)<strong>in</strong>localmemory. simpleconstant,butratherafunction<strong>of</strong>whetherthe<strong>in</strong>dividualdataelements tcomm=tlm(N)+tgm(M);<br />
thanaccesstosystemmemory,wewillassumethatthisissoundesirabledur<strong>in</strong>g Similarly,tgmisafunction<strong>of</strong>whetheravector<strong>of</strong>Mdataelementsaccessedareall<br />
computationthattheproblems<strong>in</strong>thiscasearetailoredtotwith<strong>in</strong>thesystem with<strong>in</strong>1)acommunicationpacket(equaltoacache-l<strong>in</strong>eontheKSR),2)<strong>in</strong>some<br />
memory. accesstoexternalmemorydevicestypicallyisseveralorders<strong>of</strong>magnitudeslower distantlocalmemory,or3)onsomeexternalstoragedevicesuchasdisk.S<strong>in</strong>ce<br />
<strong>of</strong>thesamecomplexity. 5.2.1KSRSpecics TheKSR1alsohasotherhardwarefeaturesthathampertheaccuracy<strong>of</strong>theabove Noticethatserialcomputersystemswithlocalcachesalso<strong>in</strong>volvetcomm=tlm<br />
distributedmemorymodel.Amongthesefeaturesisitshierarchicaltokenr<strong>in</strong>g network<strong>and</strong>thelocalsub-cachesassociatedwitheachprocessor. dedicatedto<strong>in</strong>struction,theotherhalftodata.Theperformance<strong>of</strong>thesystem theperformance<strong>of</strong>distributedmemorysystemswithvectorprocessors,willdepend willhencealsodependonhowecientlythelocalsub-cachegetsutilized.Similarly, EachKSRprocessor<strong>cell</strong><strong>in</strong>cludes<strong>of</strong>a.5MBytesub-cache,half<strong>of</strong>whichis
onhowecientlythevectorunitsateachprocess<strong>in</strong>gnodeareutilized. nell,wecurrentlyhave128nodeKSR1systemmadeup<strong>of</strong>432-noder<strong>in</strong>gsthat TheKSR1's<strong>in</strong>terconnectionnetworkconsists<strong>of</strong>ar<strong>in</strong>g<strong>of</strong>r<strong>in</strong>gsr<strong>in</strong>gs.AtCor-<br />
77<br />
atanygiventime;i.e.amaximum<strong>of</strong>14messagescanyamong32processors simultaneously.Thisisamoreseverelimitthanwhatisnormallyseenondistributedmach<strong>in</strong>eswithI/Ochips(e.g.st<strong>and</strong>ardr<strong>in</strong>gs,meshes<strong>and</strong>hypercubes).<br />
thatthecommunicationb<strong>and</strong>width<strong>of</strong>itsr<strong>in</strong>gissignicantlyhigherthanwhat Onenormallyexpectsam<strong>in</strong>imum<strong>of</strong>Nmessages(near-neighbor)tobeh<strong>and</strong>led simultaneousbyanN-processorr<strong>in</strong>g.TheKSRdoes,however,havetheadvantage eachmessage-headerhastomakeitbacktothesender,thereisvirtuallynodif-<br />
oneseesontypicalr<strong>in</strong>garchitectures.Also,s<strong>in</strong>cether<strong>in</strong>gisunidirectional<strong>and</strong> ference<strong>in</strong>communicationwiththenearestneighborversuscommunicationwitha bythefactthat3processors<strong>in</strong>eachKSRr<strong>in</strong>ghaveI/Odevicesattachedmak<strong>in</strong>g nodeconnectedattheoppositeend<strong>of</strong>ther<strong>in</strong>g.However,majorcommunication delaysareexperiencedwhenhav<strong>in</strong>gtoaccessdatalocatedonotherr<strong>in</strong>gs.<br />
thattheremaybenomorethanapproximately14messages(tokens)ononer<strong>in</strong>g hiddenduetothesharedmemoryprogramm<strong>in</strong>genvironment)isfurtherlimited<strong>in</strong> areconnectedtoatop-levelcommunicationr<strong>in</strong>g.The\messagepass<strong>in</strong>g"(actually<br />
variablethatletsusersavoidthesespecial<strong>cell</strong>swhenperform<strong>in</strong>gbenchmarks. thesenodesslowerthantheothers.Whenrunn<strong>in</strong>ganapplicationonall32r<strong>in</strong>g nodes,thismusthencebeconsidered.KSRfortunatelyprovidesanenvironment Thecalculations<strong>of</strong>theexpectedperformance<strong>of</strong>eachnodeisalsocomplicated<br />
feed-backfromourtestruns.Ourprototypeimplementationsbenchmarked<strong>in</strong> overlap.Renementswillthenbemadeaswebuildupthemodel<strong>and</strong><strong>in</strong>corporate simplestmodel,i.e,onewhichassumesuniformprocessors<strong>and</strong>nocommunication Chapter6didnottakeadvantage<strong>of</strong>pre-fetch<strong>in</strong>gfeatures. Whenanalyz<strong>in</strong>gvariousapproachestoourproblem,weshallstartwiththe
5.2.2Modelparameters Thefollow<strong>in</strong>gparameterswillbeused<strong>in</strong>ourmodel: \xx"belowarereplacedbyone<strong>of</strong>thefollow<strong>in</strong>g: 78<br />
{scatter:Calculation<strong>of</strong>each<strong>particle</strong>'scontributiontothechargedensity {part-push-x:Update<strong>of</strong><strong>particle</strong>positions. {part-push-v:Update<strong>of</strong><strong>particle</strong>velocities(<strong>in</strong>cludeschargegather).<br />
{t-solver:Fieldsolverus<strong>in</strong>g2FFT.Maybeimplementedus<strong>in</strong>g1D {transpose:Reorder<strong>in</strong>g<strong>of</strong>griddatapo<strong>in</strong>tsfrombe<strong>in</strong>gstoredamongthe basedonitsxedsimulationcharge<strong>and</strong>location. FFTsalongeach<strong>of</strong>thetwosimulationaxes.<br />
{border-sum:Addtemporaryborderelementstoglobalgrid. {grid-sum:Adduplocalgridcopies<strong>in</strong>toaglobalgridsum. vector/PE,thisisastraightforwardvectortranspose. PEsasblock-columnstoblock-rows,orvisaversa.Inthecase<strong>of</strong>1<br />
{part-sort:Procedureforrelocat<strong>in</strong>g<strong>particle</strong>sthathaveleftportion<strong>of</strong><br />
{nd-new-ptcls:Searchthroughout/scratcharraysfor<strong>in</strong>com<strong>in</strong>g<strong>particle</strong>s.<br />
{loc-ptcls:access<strong>in</strong>glocalarrays<strong>of</strong><strong>particle</strong>quantities. thegridscorrespond<strong>in</strong>gtotheircurrentnode.This<strong>in</strong>volvestransfer<strong>of</strong> v<strong>and</strong>xdataforthese<strong>particle</strong>stoPEsthathavetheirnewlocalgrid<br />
T:Totaltime{describestimeswhich<strong>in</strong>volvebothcomputationtime<strong>and</strong> communicationtime(whereneeded).Thecommunicationtimeisconsidered tobethetimespentexchang<strong>in</strong>gneededdataamongprocess<strong>in</strong>gunits. Txx:Totaltimeforthespecicalgorithm/method\xx".
tcomm?xx:Communicationtime<strong>of</strong>algorithm\xx". ,:Start-uptime<strong>and</strong>b<strong>and</strong>widthparameter,respectively,asdenedabove tcomp?xx:Computationtime<strong>of</strong>algorithm\xx". 79<br />
tgm(N):Globalmemorycommunicationtime<strong>in</strong>volved<strong>in</strong>access<strong>in</strong>gavector tlm(N):LocalMemorycommunicationtime<strong>in</strong>volved<strong>in</strong>access<strong>in</strong>gavector <strong>of</strong>lengthNstoredonalocalnodes(seeSection5.1) <strong>in</strong>Section5.1.<br />
P:Number<strong>of</strong>process<strong>in</strong>gnodesavailable. Px:Number<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>thex-directionwhenus<strong>in</strong>gasub-grid <strong>of</strong>lengthNstoredonaaparallelsystems(seeSection5.1).Includestlm(N) forthespeciedvector.<br />
Py:Number<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>they-directionwhenus<strong>in</strong>gasub-grid partition<strong>in</strong>g.<br />
Np:Number<strong>of</strong>super-<strong>particle</strong>s<strong>in</strong>simulation. N:Generic<strong>in</strong>tegerforalgorithmiccomplexity.E.g.O(N)impliesal<strong>in</strong>eartimealgorithm.<br />
partition<strong>in</strong>g(P=PxPy).<br />
Npmoved:Totalnumber<strong>of</strong><strong>particle</strong>sthatmovedwith<strong>in</strong>atime-step. Nx:Number<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>thex-direction(rectangulardoma<strong>in</strong>). Ny:Number<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>they-direction(rectangulardoma<strong>in</strong>). maxlocNp:Maximumnumber<strong>of</strong><strong>particle</strong>sonanygivenprocessor.<br />
ccalc:Calculationconstantcharacteriz<strong>in</strong>gthespeed<strong>of</strong>thelocalnode. Ng:Totalnumber<strong>of</strong>grid-po<strong>in</strong>ts<strong>in</strong>thex<strong>and</strong>y-directions.Ng=NxNy.
approachesanalyzed.Theseapproachesare1)serialPIConasystemwithalocal table5.1Showasummary<strong>of</strong>theperformancecomplexityguresforthethree 5.2.3ResultSummary 80<br />
cachememoryonecanhaveperprocessor,theserialcodenotonlysuersfrom cache,2)parallelalgorithmus<strong>in</strong>gaxed<strong>particle</strong>partition<strong>in</strong>g<strong>and</strong>areplicatedgrid withaparallelsum,<strong>and</strong>3)aparallelalgorithmus<strong>in</strong>gaxedgridpartition<strong>in</strong>gthat<br />
t<strong>in</strong>cache(tlmargumentsarelarge).Onecanalsoseethatthereplicatedgrid hav<strong>in</strong>gonlyoneCPU,butalsosuerswhenthedatasetsgetlarge<strong>and</strong>nolonger automaticallypartiallysortsthelocaldynamic<strong>particle</strong>arrays.<br />
approachwillsuerfromthetgrid?sumcomputation<strong>and</strong>communicationoverhead Fromthistableonecanseethatgiventhatoneislimited<strong>in</strong>howmuchfastlocal<br />
forlargegrids(Ng)onalargenumber<strong>of</strong>processors(P).ThefactmaxlocNpwill maxlocNpNp clearlyhamperthegridpartition<strong>in</strong>gapproachforload-imbalancedsystemswith<br />
<strong>and</strong>discussesfurther<strong>issues</strong>associatedwitheachapproach. tocompensateforthisimbalancebyre-partition<strong>in</strong>gthegrid(<strong>and</strong>re-sort<strong>in</strong>gthe <strong>particle</strong>s). Thefollow<strong>in</strong>gsectionsdescribehowwearrivedattheequations<strong>in</strong>Table5.1 Pforagoodportion<strong>of</strong>thetime-steps.Chapter4discussedways<br />
5.3SerialPICPerformance onenode,\communication"timewillherebeassumedlimitedtolocalmemory Follow<strong>in</strong>gisananalysis<strong>of</strong>theserialalgorithm.S<strong>in</strong>ceitisassumedtorunononly notbethecase.Thesharedmemoryprogramm<strong>in</strong>genvironmentwillthenstore accesses. datatodiskonsmallerserialsystems,exceptthattheKSRusesfastRAMona some<strong>of</strong>thedataonothernodes.Thiscanbeconsideredequivalenttoswapp<strong>in</strong>g high-speedr<strong>in</strong>g<strong>in</strong>stead<strong>of</strong>slowI/Odevices. Inlargesimulationsthatusemorethanthethe32Mblocalmemory,thiswill
Table5.1:PerformanceComplexity{PICAlgorithms 81<br />
GathereldO(Np) at<strong>particle</strong>s <strong>and</strong>updatetcomm= velocities Serial w/cache tcomp= ParticlePartition<strong>in</strong>gGridPartition<strong>in</strong>g Replicatedgridswithautomatic tcomp= O(Np withparallelsumpartial<strong>particle</strong>sort<br />
Update tcomm= P) tcomp=<br />
article O(tlm(Ng))+O(tgm(Ng))+O(tlm(Ng<br />
O(maxlocNp)<br />
positions O(Np) tcomm= tcomp= O(tlm(Np))O(tlm(Np tcomp= O(Np tcomm= P) P)) tcomp= O(maxlocNp) O(tlm(maxlocNp))+ tcomm= tcomm= O(tgm(Npmoved)) P))+<br />
Scatter <strong>particle</strong>charges togrid (charge densities) O(Np) tcomp= O(NglogP) tcomp= O(Np P)+ O(tgm(max(Nx tcomp= O(maxlocNp)+<br />
Needto tcomm= O(tlm(Np))O(tlm(Np tcomm= O(tgm(NglogP))O(tgm(maxNx P))+O(tlm(maxlocNp))+ tcomm=<br />
Px;Ny Py))<br />
re-arrange gridforFFT?otherwise Notifstart<strong>in</strong>NotifdoseriesNotifpartitioned samedimension;<strong>of</strong>grid-sumsongrid<strong>in</strong>onlyone O(tlm(Ng))O(tgm(Ng)) tcomp= sub-grids;otherwisedimension;otherwise tcomp= O(tgm(Ng)) tcomp= Py)<br />
2DFFT Solver Field-GridO(Ng) calculationstcomm= O(NglogNg)O(Ng O(tlm(Ng))O(tlm(tcomp))+O(tlm(tcomp))+ tcomp= tcomm= tcomp= O(Ng O(tlmNg) PlogNg) P) O(Ng O(tlmNg) tcomp= O(Ng tcomm= PlogNg)<br />
(F<strong>in</strong>ite Dierences)O(tlm(Ng))O(tlm(Ng tcomm= tcomm= P)) O(tlm(Ng tcomm= P) P))
The<strong>particle</strong>rout<strong>in</strong>eforpush<strong>in</strong>gthevelocitiesgatherstheeldsateach<strong>particle</strong> <strong>and</strong>thenupdatestheirvelocitiesus<strong>in</strong>gthiseld.Theeldgatherisabil<strong>in</strong>ear 5.3.1ParticleUpdates{Velocities 82<br />
additionscanbereducedby2byus<strong>in</strong>gtemporaryvariablesfortheweigh<strong>in</strong>gsums (hx?a)<strong>and</strong>(hy?b).Thevelocityupdatev=v+[qEpart(1=m)t?(dragvt) <strong>in</strong>terpolationthatconsists<strong>of</strong>7additions<strong>and</strong>8multiplications.Thenumber<strong>of</strong> cansimilarlybeoptimizedtouseonly2multiplications<strong>and</strong>2additionsforeach dimension.Wehencehave: Tserial?part?push?v=tcomp?gather+tcomp?part?push?v +tcomm?gather+tcomm?part?push?v =Npc1calc+Npc2calc (5.1) (5.2)<br />
wherec1calc<strong>and</strong>c1calcisthetimespentdo<strong>in</strong>gthemultiplicationss<strong>and</strong>additions +tlm(Ng)+tlm(Np) =O(Np)+O(tlm(Ng)+tlm(Np)) (5.5) (5.3)<br />
associatedwithgather<strong>and</strong>thevelocityupdates,respectively. (5.4)<br />
t<strong>in</strong>cache).Ifthegridisstoredeitherroworcolumn-wise,eachgatherwillalso implyareadfromtwodierentgridcache-l<strong>in</strong>eswhentherowsorcolumnsexceed tlm(Ng)wouldimplyalot<strong>of</strong>localcachehits(assum<strong>in</strong>gthegridistoolargeto thecache-l<strong>in</strong>esize,unlessareorder<strong>in</strong>gschemesuchas<strong>cell</strong>-cach<strong>in</strong>gisdone.Cellcach<strong>in</strong>gwas<strong>in</strong>troduced<strong>in</strong>Chapter4<strong>and</strong>itscomplexityisdiscussedlater<strong>in</strong>this<br />
Noticethatifthe<strong>particle</strong>sarenotsortedwithrespecttotheirgridlocation,<br />
The<strong>particle</strong>rout<strong>in</strong>eforpush<strong>in</strong>gthelocationsupdateseachlocationwiththefollow<strong>in</strong>g2Dmultiply<strong>and</strong>add:x=x+vt,aga<strong>in</strong>giv<strong>in</strong>g:<br />
Tserial?part?push?x=tcomp?part?push?x+tcomm?part?push?x<br />
chapter. 5.3.2ParticleUpdates{Positions (5.6)
=Npccalc+tlm(Np) 83<br />
5.3.3Calculat<strong>in</strong>gtheParticles'Contributiontothe whereccalcisthetimespentdo<strong>in</strong>gtheabovemultiplications<strong>and</strong>additions. =O(Np)+O(tlm(Np)) (5.7)<br />
ChargeDensity(Scatter) (5.8)<br />
Hereeach<strong>particle</strong>'schargegetsscatteredtothe4gridcorners.Thisoperation requires8additions<strong>and</strong>8multiplications,where,as<strong>in</strong>thegathercase,2additions canbesavedbyus<strong>in</strong>gtemporaryvariablesfortheweigh<strong>in</strong>gsums(hx?a)<strong>and</strong> (hy?b).Wehencehave:<br />
whereccalcisthetimespentdo<strong>in</strong>gtheabovemultiplications<strong>and</strong>additions. Tserial?charge?gather=tcomp?charge?gather+tcomm?charge?gather(5.9)<br />
Likethegathercase,each<strong>particle</strong>islikelytocausetwocache-hitsregard<strong>in</strong>g =O(Np)+O(tlm(Np)) =Npccalc+tlm(Np) (5.11) (5.10)<br />
gridpo<strong>in</strong>tsunlessareorder<strong>in</strong>gschemesuchas<strong>cell</strong>-cach<strong>in</strong>gisemployed. 5.3.4FFT-solver 1DFFTSareknowntotakeonlyO(NlogN)computationtime.Ourcurrentserial <strong>codes</strong>olvesfortheeldaccord<strong>in</strong>gtothefollow<strong>in</strong>gFFT-basedalgorithm: Step1:Allocatetemporaryarrayforstor<strong>in</strong>gcomplexcolumnbeforecall<strong>in</strong>g Step2:Copymatrix<strong>in</strong>tocomplexarray.O(Ng)(memoryaccess) Step3a:CallcomplexFFTfunctionrow-wise.O(NglogNx) Step3b:CallcomplexFFTfunctioncolumn-wise. FFTrout<strong>in</strong>eprovidedbyNumericalRecipes.<br />
(NowhaveFFT(chargedensity),i.e.(kx,ky))O(NglogNy)
Step4:Scale<strong>in</strong>Fouriermode(i.e.dividebyk2=kx2+ky2<strong>and</strong>scaleby Step5a:CallcomplexFFTfunctiondo<strong>in</strong>g<strong>in</strong>verseFFTcolumn-wise.O(NglogNy) 1/epstoget(kx,ky)).O(Ng)additions<strong>and</strong>multiplications. 84<br />
Step7:Obta<strong>in</strong>nalresultbyscal<strong>in</strong>g(multiply<strong>in</strong>g)theresultby1/(NXNy). Step6:Transferresultbacktorealarray.O(Ng)(memoryaccess) Step5b:Callcomplex<strong>in</strong>verseFFTfunctionrow-wise.O(NglogNx)<br />
2,4,6<strong>and</strong>7)areallconsideredtobelocal<strong>and</strong>l<strong>in</strong>ear<strong>in</strong>number<strong>of</strong>gridpo<strong>in</strong>ts,i.e. theyrequireO(Ng)operationswithno<strong>in</strong>ter-processordatatransferstak<strong>in</strong>gplace Thememorycopy<strong>in</strong>g(<strong>in</strong>clud<strong>in</strong>gtranspos<strong>in</strong>g)<strong>and</strong>scal<strong>in</strong>goperations(Steps1, O(Ng)multiplications<br />
However,s<strong>in</strong>cewestronglyrecommendthatvendor-optimizedFFTrout<strong>in</strong>esbe (hopefully). notionthata2-DFFTgenerallytakesO(N2logN)onaserialcomputer. used(<strong>and</strong>these<strong>of</strong>tenuseproprietaryalgorithms),wewillsticktothesimplied Thememorycopy<strong>in</strong>gcouldbeavoidedbyimplement<strong>in</strong>gamoretailoredFFT.<br />
hencehave: aswellas5a<strong>and</strong>5b.Suchtranposesgenerallygeneratealot<strong>of</strong>cache-hits.We Intheabovealgorithmthereisanimplicittransposebetweensteps3a<strong>and</strong>3b Tserial?fft?solver=tcomp?fft?solver+tcomm?fft?solver =O(NglogNg)+O(tlm(Ng)); (5.12)<br />
5.3.5Field-GridCalculation FFTwillalsohavesomecache-hitsassociatedwithit. whereO(tlm(Ng))signiesthecache-hitsassociatedwiththetranspose.The2D (5.13)<br />
1-Dnitedierenceequation<strong>of</strong>thepotentials(onthegrid)calculatedbytheeld Theeld-gridcalculationdeterm<strong>in</strong>estheelectriceld<strong>in</strong>eachdirectionbyus<strong>in</strong>ga
solver.This<strong>in</strong>volves2additions<strong>and</strong>twomultiplicationsforeachgridpo<strong>in</strong>tfor eachdirection,giv<strong>in</strong>gthefollow<strong>in</strong>gcomputationtime: Tserial?field?grid=tcomp?field?grid+tcomm?field?grid 85<br />
Noticethats<strong>in</strong>ceNpgenerallyisanorder<strong>of</strong>magnitudeorsolargerthanNg, =O(Ng)+O(tlm(Ng)) =Ngccalc+tlm(Ng) (5.16) (5.15) (5.14)<br />
wecanexpectthisrout<strong>in</strong>etobefairly<strong>in</strong>signicantwithrespecttotherest<strong>of</strong>the computationtime.Thisisveried<strong>in</strong>Chapter6. 5.4ParallelPIC{FixedParticlePartition<strong>in</strong>g,<br />
when<strong>particle</strong>s<strong>in</strong>neighbor<strong>in</strong>g<strong>cell</strong>strytocontributetothesamechargedensity replicatethegridarray(s).Thisisdonetoavoidthewriteconictsthatmayoccur Asmentioned<strong>in</strong>Chapter4,oneway<strong>of</strong>paralleliz<strong>in</strong>gaPIC<strong>particle</strong>codeisto Replicatedgrids<br />
gridpo<strong>in</strong>ts. densityarray.Theothergrids(thepotential<strong>and</strong>theeldgrids)getcopied<strong>in</strong>on read-access,<strong>and</strong>s<strong>in</strong>cetheyareonlywrittentoonceforeachtime-step,wedonot havetoworryaboutwrite-conictscaus<strong>in</strong>gerroneousresults. Inthesharedmemorysett<strong>in</strong>g,weonlyneedtophysicallyreplicatethecharge<br />
The<strong>particle</strong>pushrout<strong>in</strong>esareidealc<strong>and</strong>idatesfor<strong>parallelization</strong>s<strong>in</strong>cetheyperform<strong>in</strong>dependentcalculationsoneach<strong>particle</strong>.S<strong>in</strong>ceweassumethesamenodes<br />
alwaysprocessthesame<strong>particle</strong>s,<strong>parallelization</strong>isachievedbyparalleliz<strong>in</strong>gthe global<strong>particle</strong>loop.Wehencehave tcomp?replicated?grid?push?v=(Np =O(Np P)ccalc P): (5.18)<br />
5.4.1ParticleUpdates{Velocities<br />
(5.17)
However,s<strong>in</strong>cethe<strong>particle</strong>stendtobecomer<strong>and</strong>omlydistributed<strong>in</strong>space(if theywerenotalready),<strong>and</strong>s<strong>in</strong>ceaprocess<strong>in</strong>gelementisalwaysresponsiblefor Noticethatthereisno<strong>in</strong>ter-processcommunicationassociatedwithwrites. 86<br />
updat<strong>in</strong>gthesamegroup<strong>of</strong><strong>particle</strong>s,regardless<strong>of</strong>the<strong>particle</strong>gridlocation,the readsassociatedwithgather<strong>in</strong>gtheeldatthe<strong>particle</strong>smostlikelywillcause cache-hitsforlargegridss<strong>in</strong>ceitisunlikelythewholeeldgridwillt<strong>in</strong>the subcache. copied<strong>in</strong>tolocalmemoryas<strong>particle</strong>sarescatteredalloverthesystem.Thiscan beviewedasanotherlevel<strong>of</strong>cachehit: Regardless,thereplicatedgridmethodislikelytocausetheentiregridtoget<br />
Assum<strong>in</strong>gthegridwasdistributedbythesolver<strong>and</strong>eld-gridphases<strong>in</strong>toblockcolumnsorblock-rows,thismeansthatwhenus<strong>in</strong>glargesystemsspann<strong>in</strong>gseveral<br />
tcomm?replicated?grid?push?v=tgrid+tloc?ptcls =O(tgm(Ng))+O(tlm(Np P)): (5.20) (5.21) (5.19)<br />
wecantakeadvantage<strong>of</strong>pre-fetch<strong>in</strong>gifsuchoperationsareavailable,<strong>and</strong>ifthe tiontgm.However,s<strong>in</strong>cewecanassumetheentiregrideventuallywillbeneeded, gridwillt<strong>in</strong>localmemory. r<strong>in</strong>gs,thesecopieswilloccurasahierarchicalbroadcasts<strong>in</strong>corporated<strong>in</strong>thefunc-<br />
S<strong>in</strong>cethe<strong>particle</strong>positionupdatesarecompletely<strong>in</strong>dependentboth<strong>of</strong>oneother 5.4.2ParticleUpdates{Positions <strong>and</strong>thegrid,thisrout<strong>in</strong>ecanbe\trivially"parallelized,<strong>and</strong>near-perfectl<strong>in</strong>ear<br />
Thecache-hitsassociatedwithread<strong>in</strong>g<strong>and</strong>writ<strong>in</strong>geach<strong>particle</strong>position speed-upexpected: Treplicated?grid?push?loc=tcomp?push?x+tcomm?push?x =O(Np P)+O(tlm(Np P)): (5.23) (5.22)
5.4.3Calculat<strong>in</strong>gtheParticles'Contributiontothe arrayscanhencebesequentiallyaccessed. (O(tlm(Np P)))areherem<strong>in</strong>imals<strong>in</strong>cethe<strong>particle</strong>partition<strong>in</strong>gisxed.The<strong>particle</strong> 87<br />
byaccumulat<strong>in</strong>gthechargeseach<strong>particle</strong>representonthegridpartition<strong>in</strong>gthe Oncethe<strong>particle</strong>positionsaredeterm<strong>in</strong>ed,onecancalculatethechargedensity simulationspace.Hence,each<strong>particle</strong>'schargegetsscatteredtothe4gridcorners <strong>of</strong>thegrid<strong>cell</strong>conta<strong>in</strong><strong>in</strong>gthe<strong>particle</strong>.S<strong>in</strong>cetherearefour<strong>cell</strong>sassociatedwith ChargeDensity<br />
computed,theyallneedtobesummedup<strong>in</strong>toaglobalarrayforthesolver.This eachgridpo<strong>in</strong>t,write-conictsarelikelytooccurs<strong>in</strong>cethe<strong>particle</strong>partition<strong>in</strong>gis computationphase.However,aftertheselocalchargedensityarrayshavebeen xed.Thesewrite-conictsare,however,avoidedbyreplicat<strong>in</strong>gthegrids<strong>in</strong>the sumcouldbedoneeitherseriallyor<strong>in</strong>parallelus<strong>in</strong>gatreestructure. beneed<strong>in</strong>g,wecantakeadvantageforpre-fetch<strong>in</strong>gifsuchoperationsareavailable. occurashierarchicalgathers.Aga<strong>in</strong>,s<strong>in</strong>ceweknow<strong>in</strong>advancewhichdatawewill Ngelements.Thesetransfers,liketheoneshappen<strong>in</strong>g<strong>in</strong>thepush-vphasemay TheserialsumwouldrequireO(Ngp)additionsplusP?1block-transfers<strong>of</strong><br />
thisphenomenon. Notice,however,thatifp=128,logp=7,whichmeansO(Nglogp)mayapproach Np(assum<strong>in</strong>g10orso<strong>particle</strong>sper<strong>cell</strong>)!Ourbenchmarks<strong>in</strong>Chapter6illustrate Ifoneprocess<strong>in</strong>gnodeweretogatherthewholeresult,thiswouldcausenetwork Ifaglobalparalleltreesumwasused,O(Nglogp)additionswouldberequired.<br />
eachsummationtreewillherecorrespondtothenaldest<strong>in</strong>ation<strong>of</strong>thesubgrid. beavoidedifthesumswerearrangedtobeaccumulatedonsubgrids.Theroot<strong>of</strong> tracforthesolvers<strong>in</strong>cethesolverwouldbeus<strong>in</strong>gadistributedgrid.Thiscould Noticethatthissplitstillcausesthesamesumcomplexity: tcomp?parallel?sum=O(PNg=PlogP)=O(NglogP)
Wehencehave: Treplicated?grid?scatter==tcomp?scatter+tcomp?grid?sum +tcomm?scatter+tcomm?grid?sum 88 (5.24)<br />
5.4.4DistributedmemoryFFTsolver +O(tlmNp =ONp P+O(NglogP) P)+O(tgm(MglogP)) (5.26) (5.27) (5.25)<br />
S<strong>in</strong>cetheFFTsolveriscompletely<strong>in</strong>dependent<strong>of</strong>the<strong>particle</strong>s,its<strong>parallelization</strong> advantage<strong>of</strong>vendor-providedparallellibrarieswheneveravailableforthiscase. canalsobedoneseparately.Infact,itishighlyrecommendedthatonetakes<br />
Transpose Assum<strong>in</strong>gonevectorperprocessor,themostobviouswaytodoar<strong>in</strong>gtransposeis basically<strong>in</strong>volvestranspos<strong>in</strong>gadistributedgrid. Noticethatthecommunication<strong>in</strong>volved<strong>in</strong>go<strong>in</strong>gfromonedimensiontotheother<br />
toattherststep,sendoutthewholevectorexcepttheentrythatthenodeitself willbeus<strong>in</strong>g<strong>in</strong>therow(column)operation.IfafullI/Or<strong>in</strong>gwasavailable,this neighbor. shouldbenoconictss<strong>in</strong>cethenodeswillbesend<strong>in</strong>gthisdatatotheirnearest wouldimplytak<strong>in</strong>gcommunicationtime=+(N?1)onaN-by-Ngrid.There<br />
<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g. totalcommunicationtime.Figure5.1depictsthedatamovements<strong>in</strong>atranspose the<strong>in</strong>tendedrow(orcolumn)(seeexamplebelow).Theaveragemessagelength wouldbeN/2,butthestart-uptimewouldeachtimebe.ThisgivesN+(N2 Foreachstep,oneentrythengets\pickedo"untileverynodehasreceived<br />
I.e.N=P=4,giv<strong>in</strong>gusacommunicationtime<strong>of</strong>4+42(or4+8)i,e thetransposeisO(N2)!!ThisisworrisomewithrespecttotheO(NlogN)FFTs 2)
step1:a11 keepsend Node1 Node289<br />
a21 keepsend a22a12 keepsend Node3 a13 a23 keepsend Node4 a14<br />
step2:a11a24 a41 a31 a42 a32 a33a43 a24<br />
a14a34 a21a31 a22a41 a32a12 a33a42 a43a13 a44a23 a34<br />
step4:done step3:a11a23 a14 a13 a21a34 a24 a31a41 a33 a32 a42a12 a44 a43<br />
communicationtime. Over-all2D-FFTcomplexity computationaltime,especiallys<strong>in</strong>cecomputationaltimeisusuallyalotlessthan Figure5.1:Transpose<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g.<br />
Totalcommunicationcostsfor\transpose"byfollow<strong>in</strong>gther<strong>in</strong>gapproachabove, weget: aboveequationsassumethatthereisexactlyoneN-vectorperprocessor.This givesatotalcommunicationtimefortheFFTsolver(P=Nx=Ny): tcommtranspose=N+N2 Thisisthecommunicationtimeitwouldtaketore-shueN-vectors.Notethe 2
Totaltimefora2DparallelFFTwithoneN-vectorperprocessorwillthenbe: tcomm?fft=p+N2g 90<br />
Tparallel?2D?fft=tcomp?2D?fft+tcomm?2D?fft 2<br />
assum<strong>in</strong>gthedata-blocksarebunchedtogethersothatthenumber<strong>of</strong>vectorswith<strong>in</strong> acomputationphasedoesnotaectthestart-uptime.Hereccalcisaconstant =2(N2logN)ccalc+Ng+N2 2 (5.29) (5.28)<br />
<strong>in</strong>dicat<strong>in</strong>gthespeed<strong>of</strong>theprocessor.Notethats<strong>in</strong>cethisanalysisdidnottake <strong>in</strong>toconsiderationthenumber<strong>of</strong>memoryaccessesmadeversusthenumber<strong>of</strong> oat<strong>in</strong>gpo<strong>in</strong>tcalculationsmade,thisparameterwouldhavetobesomestatistical<br />
1-DFFTs. average<strong>of</strong>thetwo<strong>in</strong>ordertobemean<strong>in</strong>gful.Localcach<strong>in</strong>gconsiderationscould<br />
Eachcommunicationphasethentransfers(\transposes")ontheaverageNpN2blocks ThismeansthateachprocessorhasacomputationphasethatperformsNp(NlogN) alsobegured<strong>in</strong>tothisparameter.<br />
<strong>of</strong>dataPtimes. AssumenowaNNgridonaPprocessorr<strong>in</strong>g,whereNP=k,kan<strong>in</strong>teger.<br />
Eachprocess<strong>in</strong>gunit<strong>in</strong>thecomputationphasedoes:<br />
Thisgivestotalcommunicationtime(multiply<strong>in</strong>gabovewithNp): computations. (N2)blocks<strong>of</strong>datadur<strong>in</strong>gthePphases. Eachcommunicationphasethentransfers(\transposes")ontheaverage(NP) tcomp?fft=NP(2NlogN)<br />
(Note:(NP)P=Ng))tcomm?fft=N+N2 2:
Thetotaltimeestimatefora2DFFTishence: T2D?fft=OPNg P(Ng 2)ccalc+O 91<br />
Whenseveralr<strong>in</strong>gsare<strong>in</strong>volved,anothermemoryhierarchygetsadded.Similarly, fortrulylargesystems,one1DFFTmaynott<strong>in</strong>acachel<strong>in</strong>e<strong>and</strong>additional =ONg(Ng PlogNg++Ng) Ng+N2g 2)!(5.30)<br />
performancehitswillhencebetaken. (5.31)<br />
Parallel2DFFTSolver Aparallel2DDFFTsolver<strong>in</strong>volvestwo<strong>of</strong>theaboveparallel2DFFTs(count<strong>in</strong>g boththeregularFFT<strong>and</strong>the<strong>in</strong>verseFFT)<strong>and</strong>ascal<strong>in</strong>g<strong>of</strong>1=(N2)foreachgrid quantity.Afactor<strong>of</strong>2doesnotshow<strong>in</strong>thecomplexityequations,<strong>in</strong>fact,eventhe scal<strong>in</strong>g<strong>of</strong>O(N2),canbeconsidered<strong>in</strong>cluded<strong>in</strong>withthe2DFFT(O(N2logN)). equationsforthe2DFFTsolver: Assum<strong>in</strong>gwehaveagrid<strong>of</strong>sizeNg=N2,wehencehavethefollow<strong>in</strong>gcomplexity Tparallel?fft?solver=tcomp?fft?solver+tcomm?fft?solver =O(NglogNg)+O(tlm(NglogNg)+O(tgm(Ng)); (5.32)<br />
5.4.5ParallelField-GridCalculation transpose.The2DFFTwillalsohavesomecache-hitsassociatedwithit. whereO(tgm(Ng))signiesthecache-hits<strong>and</strong>memorytracassociatedwiththe (5.33)<br />
Thenitedierenceequationsassociatedwiththeeld-gridcalculationsparallelize fairlystraightforwardly.S<strong>in</strong>cethe<strong>in</strong>putgridisthepotentialgridsuppliedbythe achieve: solver,itmakessensetopartitionthe<strong>in</strong>teriorloop<strong>in</strong>tocomparableblock-vectors. Thebordercasesmaybeparallelizedseparately.Oneshouldhencebeableto Tparallel?field?grid=tcomp?field?grid+tcomm?field?grid (5.34)
=Ng Pccalc+tlm(Ng 92<br />
Theeld-gridcalculationsmaycausesomereadcachehitsforthenextphaseif thepusherisnotgo<strong>in</strong>gtobeus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g. =O(Ng P)+O(tlm(Ng)) P) (5.35)<br />
5.5ParallelPIC{PartitionedChargeGrid(5.36)<br />
4whichreplicatesthegridborders<strong>and</strong>usesdualpo<strong>in</strong>tersonthelocal<strong>particle</strong> Inthissectionwewillanalyzethegridpartition<strong>in</strong>gapproachoutl<strong>in</strong>ed<strong>in</strong>Chapter SortedLocalParticleArrays Us<strong>in</strong>gTemporaryBorders<strong>and</strong>Partially<br />
maximumtimerequiredononenode,i.e.: Thetotaltime<strong>of</strong>theupdate<strong>of</strong>thelocal<strong>particle</strong>velocityarraysisequaltothe arrays<strong>in</strong>ordertoma<strong>in</strong>ta<strong>in</strong>anautomaticpartialsort<strong>in</strong>g<strong>of</strong>the<strong>particle</strong>s. 5.5.1ParticleUpdates{Velocities<br />
Add<strong>in</strong>g<strong>in</strong>thecommunicationtimeassociatedwiththegather<strong>and</strong>thevelocity updatesgivesusthefollow<strong>in</strong>gequation: tcomp?serial?part?push?v=maxlocalNpccalc =O(maxlocalNp): (5.38) (5.37)<br />
Tjj?push?v=tcomp?gather+tcomp?part?push?v +tcomm?gather+tcomm?part?push?v =maxlocNpc1calc+maxlocNpc2calc (5.39) (5.40)<br />
wherec1calc<strong>and</strong>c1calcisthetimespentdo<strong>in</strong>gthemultiplicationss<strong>and</strong>additions +tlm(Ng)+tlm(Np) =O(maxlocNp)+O(tlm(Ng)+tlm(maxlocNp))(5.43) (5.41)<br />
associatedwithgather<strong>and</strong>thevelocityupdates,respectively.Noticethatthis (5.42)
method.S<strong>in</strong>cethe<strong>particle</strong>sarealreadypartiallysorted,onlylocalgridpo<strong>in</strong>tsor thoserightontheirborders,contributetotheeldateach<strong>particle</strong>(tlmversustgm). versionsuersalotfewercache-hits<strong>in</strong>thegatherphasethanthereplicatedgrid 93<br />
basedonthischeck,eitherwritethemback<strong>in</strong>tothelocalarrayorouttothe Ifthe<strong>particle</strong>sarestored<strong>in</strong>localarraysus<strong>in</strong>garead<strong>and</strong>writepo<strong>in</strong>teronthe 5.5.2ParticleUpdates{Positions cancheckwhetherthe<strong>particle</strong>s'newlocationsarestillwith<strong>in</strong>thelocalgrid,<strong>and</strong> localpositionarray(s)<strong>and</strong>awritepo<strong>in</strong>terforanoutput/scratcharray,thenone globalscratcharray.Theseoperationsmaybecompletelylocalifeachnodehas itsownscratchareatowriteto{hencetheuse<strong>of</strong>anextrawritepo<strong>in</strong>terpernode fortheout/scratcharray.(Thenodesmayotherwisehavetosp<strong>in</strong>-lockonaglobal writepo<strong>in</strong>ter.)Thispart<strong>of</strong>the<strong>codes</strong>hencetakes:<br />
Itishighlyunlikelythatalargenumber<strong>of</strong><strong>particle</strong>swillleaveonanygiventimestep,s<strong>in</strong>cemost<strong>particle</strong>sshouldbemov<strong>in</strong>gwith<strong>in</strong>eachsubdoma<strong>in</strong>ifaccurate(<strong>and</strong><br />
ecient)resultsaretobeobta<strong>in</strong>ed.HowbadO(maxlocalNp)willgetdepends onhow<strong>in</strong>homegenoustheproblembe<strong>in</strong>gsimulatedgets<strong>and</strong>whetherdynamic grid-reorder<strong>in</strong>g(globalsortsneeded)isdonetocompensatetotheseimbalances. thecost<strong>in</strong>alsohav<strong>in</strong>gtoauto-sortthevelocitiessothattheycorrespondtothe <strong>particle</strong>locations.Thelatterwould<strong>in</strong>volveO(maxlocalNp)localread/writes. <strong>particle</strong>location<strong>in</strong>dexhastobetested. Theaboveguresdonot<strong>in</strong>cludethetest<strong>in</strong>g<strong>of</strong>thenew<strong>particle</strong>locationor<br />
Ifthe<strong>particle</strong>arraysareperfectlyload-balanced,O(maxlocalNp)=O(Np=P). tpush?local?ptcls=maxlocalNpccalc =O(maxlocalNp): (5.45) (5.44)<br />
Toaccountforpossible<strong>in</strong>com<strong>in</strong>g<strong>particle</strong>s,thenodesnowhavetocheckeach Noticethatifthegridisdividedupalongonlyonedimension,thenonlyone
other'sout/scratcharrays.Thiswill<strong>in</strong>volve: tf<strong>in</strong>d?new?<strong>particle</strong>s=tlm(maxlocalNp)[writes] 94<br />
theseupdateswilltakedependsonhowmany<strong>particle</strong>sleavetheirlocaldoma<strong>in</strong>. Noticethatthereads<strong>in</strong>volveread-transfersfromothernodes.Howmuchtime +tgm(Npmove)[reads] (5.47) (5.46)<br />
Ifthe<strong>particle</strong>sareuniformlydistributed,theoptimalgridpartition<strong>in</strong>gwouldbe ict<strong>in</strong>gwiththeonedesirablefortheFFTsolverwhichrequiresablock-vector squaresubgrids(seeChapter4,Section3.4).Noticethatthispartition<strong>in</strong>giscon- thegridpartition<strong>in</strong>g,thedistribution<strong>of</strong>the<strong>particle</strong>s<strong>and</strong>their<strong>in</strong>itialvelocities. Thenumber<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheirlocaldoma<strong>in</strong>eachtime-stepdependson<br />
partition<strong>in</strong>g(Figure5.2).<br />
Asmallsav<strong>in</strong>gswillbeachievedwhenus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g<strong>in</strong>the Figure5.2:GridPartition<strong>in</strong>g;a)block-vector,b)subgrid a)block-vector b)subgrid<br />
thismaybefarout-weighedbythepenaltyfor,<strong>in</strong>theblock-vectorcase,hav<strong>in</strong>g largerborderswithfewer<strong>in</strong>teriorpo<strong>in</strong>tsthanthesubgridpartition<strong>in</strong>g. testfor<strong>particle</strong>locationss<strong>in</strong>ceonlyonedimensionhastobechecked.However, Assumethattheaverageprobabilitythatthelocal<strong>particle</strong>sonaprocess<strong>in</strong>g
nodeleavetheir<strong>cell</strong>aftereachtime-stepisP(locNpleave<strong>cell</strong>).Assum<strong>in</strong>gone roworcolumn<strong>of</strong>grid-<strong>cell</strong>sperprocessor,thenabout50%<strong>of</strong>those<strong>particle</strong>sthat leavetheirlocal<strong>cell</strong>,willstillrema<strong>in</strong>onthelocalprocessor(ignor<strong>in</strong>gtheones 95<br />
leav<strong>in</strong>gdiagonallyothefourcorners).SeeFigure5.3.? -6?<br />
? 6 - 6-? 6-?66- 6- 6<br />
Figure5.3:Movement<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheir<strong>cell</strong>s,block-vectorsett<strong>in</strong>g. ? ?6? ? ? -<br />
processor. ThismeansthatP(locNpleave<strong>cell</strong>)=2<strong>particle</strong>smustbetransferredtoanother<br />
process<strong>in</strong>gnode. over,onlytheborder<strong>cell</strong>swillbecontribut<strong>in</strong>g<strong>particle</strong>stobemovedtoanother <strong>particle</strong>pushphase,thenassum<strong>in</strong>gthe<strong>particle</strong>swillnotmovemorethanone<strong>cell</strong> Theprobability<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gisafunction<strong>of</strong>the<strong>in</strong>itialvelocities<strong>and</strong> However,ifasub-matrixpartition<strong>in</strong>gwaschosenforthegridpo<strong>in</strong>ts<strong>in</strong>the<br />
partition<strong>in</strong>gcouldadapttothedrift.E.g.ifthe<strong>particle</strong>saregenerallydrift<strong>in</strong>g<strong>in</strong> itscurrentprocessor? how\bad"isthedistribution<strong>of</strong><strong>particle</strong>s,<strong>and</strong>howlikelyisthis\bunch"toleave positions<strong>of</strong>the<strong>particle</strong>s<strong>and</strong>thesize<strong>of</strong>each<strong>cell</strong>.Thequestionthenbecomes:<br />
thex-direction,thenitwouldbeveryadvantageoustopartitionthegridasblockrows<strong>and</strong>therebysignicantlyreduc<strong>in</strong>gtheneedfor<strong>particle</strong>sort<strong>in</strong>g(re-location<br />
<strong>of</strong><strong>particle</strong>s)aftereachtime-step.However,<strong>in</strong>plasmasimulationsthereisusually Noticealsothatifonecouldanticipatetheover-alldrift<strong>of</strong>the<strong>particle</strong>s,thegrid<br />
considerableup-downmotion,eventhoughthereisleft-rightdriftpresent. Thetotaltimetaken,asidefromlocalread/writes<strong>and</strong>dynamicload-balanc<strong>in</strong>g
steps,willbe: Tjj?push?x=tcomp?push?x+tcomm?loc?ptcls+tcomm?f<strong>in</strong>d?new?ptcls(5.48) 96<br />
=O(maxlocNp)+O(tlm(maxlocNp)+tgm(Npmoved)); =maxlocNpccalc+tlm(maxlocNp)+tgm(Npmoved) (5.49)<br />
theout/scratcharraysare<strong>in</strong>cludedastgm(Npmoved). arewith<strong>in</strong>thelocalgriddoma<strong>in</strong><strong>of</strong>theirprocess<strong>in</strong>gnode.Thesearchesthrough whereccalc<strong>in</strong>cludestheadditions<strong>and</strong>multiplicationsassociatedwiththe<strong>particle</strong> positionupdatesaswellasthetestsforwhetherthe<strong>particle</strong>s'newpositionsstill (5.50)<br />
5.5.3Calculat<strong>in</strong>gtheParticles'Contributiontothe<br />
mustbeexpected<strong>in</strong>overcom<strong>in</strong>gthepotentialwrite-conictsthatwillresultwhen sharethegridpo<strong>in</strong>tswith<strong>particle</strong>s<strong>in</strong>thenearestneighbor<strong>in</strong>g<strong>cell</strong>s,someoverhead its<strong>cell</strong>,wherethechargeisthenaddedtothechargedensity.S<strong>in</strong>cethe<strong>particle</strong>s Inthisrout<strong>in</strong>eeach<strong>particle</strong>'schargeisscatteredtothefourgridpo<strong>in</strong>tssurround<strong>in</strong>g ChargeDensity<br />
thegridisdistributed.(SeeFigure5.4) ever,thiswillcauseseveralbusy-waitsunlessthe<strong>particle</strong>sarecompletelysorted. Ast<strong>and</strong>ardapproachwouldbetouseagloballockontheborderelements,how-<br />
theextraborderarraywasthetopborder(simpliesimplementationforperiodic <strong>particle</strong>s'contributiontotheoverallchargedensity,werstcalculatethemasif borders.Noticethatonlyoneborderforeachdimensionneedstobereplicated s<strong>in</strong>cetheotherlocalbordermayaswellbethemastercopy.Whencalculat<strong>in</strong>gthe Anotherapproachwouldbetousetheideafromreplicatedgrids,i.e.replicated<br />
systems).Theseborderarraysthengetaddedtothemastercopy.S<strong>in</strong>ceonlyone canbedone<strong>in</strong>parallel.However,therewillbenetworktrac<strong>in</strong>volvedgett<strong>in</strong>gthe correspond<strong>in</strong>gmastercopyvaluestobeadded<strong>in</strong>. nodehasanextracopy<strong>of</strong>anygivenborder,therewillbenowriteconicts,sothese
97<br />
...<br />
146adA<br />
2 7db358<br />
...<br />
Nodel Nodek<br />
Figure5.4:ParallelChargeAccumulation:Particles<strong>in</strong><strong>cell</strong>`A'sharegridpo<strong>in</strong>ts withthe<strong>particle</strong>s<strong>in</strong>the8neighbor<strong>in</strong>g<strong>cell</strong>s.Two<strong>of</strong>thesegridpo<strong>in</strong>ts,`a'<strong>and</strong>`b', aresharedwith<strong>particle</strong>supdatedbyanotherprocess<strong>in</strong>gnode. Tparallel?scatter=tcomp?scatter+tcomp?border?sum +tcomm?scatter+tcomm?border?sum (5.51)<br />
Noticethatthesize<strong>of</strong>theborder(s)dependsonthepartition<strong>in</strong>g.Ifthedoma<strong>in</strong> +O(tlm(maxlocNp))+O(tgm(max =O(maxlocNp)+O(max Nx Px;Ny Py!) Nx Px;Ny Py!))(5.54) (5.53) (5.52)<br />
isonlypartitioned<strong>in</strong>ondirection(block-vector),say<strong>in</strong>y,thentheborderwillbe maxNx isPNx.Noticehoweventhisisconsiderablylessthanwhatthereplicatedgrid <strong>particle</strong>distributions,willbesquaresub-partition<strong>in</strong>g. uses(PNgwhenNyP). As<strong>in</strong>the<strong>particle</strong>pushphase,theoptimalcaseforsimulationswithuniform Px;Ny Py=Nx=1=Nx<strong>and</strong>thetotalextraspaceallocatedfortheseborders<br />
...<br />
...<br />
...<br />
...
Despitethefactthattheblock-vectorpartition<strong>in</strong>gisfarfromoptimalfortheother parts<strong>of</strong>thecode,itisprobably<strong>in</strong>advisabletotrytoanyotherpartition<strong>in</strong>gfor 5.5.4FFTsolver 98<br />
FFTs<strong>and</strong><strong>in</strong>verseFFTsareavailable<strong>and</strong>allonehastoworryaboutisaccess<strong>in</strong>g/arrang<strong>in</strong>gthedata<strong>in</strong>theappropriateorder.Ifonly1DFFTrout<strong>in</strong>esare<br />
Section5.4.Regardless,theoveralltimeforaparallelizedsolvershouldbethat<strong>of</strong> available,theseriallocalversionmaybeusedasabuild<strong>in</strong>gblockasdescribed<strong>in</strong> thesolverused<strong>in</strong>thereplicatedgridcase.<br />
theFFTphase.Mostvendorsprovideveryecienth<strong>and</strong>-codedFFTrout<strong>in</strong>es thatwouldprobablybealotfasterthanuser-codedversions.Ideallyparallel2D<br />
Thenitedierenceequationsassociatedwiththeeld-gridcalculationsparallelizesasforthereplicatedgridcase:<br />
5.5.5ParallelField-GridCalculation partition<strong>in</strong>g(orviceversa)willrequiretimetgm(Ng). Are-arrangement<strong>of</strong>thegridfromasubgridpartition<strong>in</strong>gtoablock-vector<br />
Asbefore,theeld-gridcalculationsmaycausesomereadcachehitsforthenext Tparallel?field?grid=tcomp?field?grid+tcomm?field?grid =O(Ng =Ng Pccalc+tlm(Ng P)+O(tlm(Ng)) P) (5.56) (5.57) (5.55)<br />
5.6HierarchicalDatastructures:Cell-cach<strong>in</strong>g phaseifthepusherisnotgo<strong>in</strong>gtobeus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g. memoryhierarchies,especiallyasystemswithcaches,weshowed<strong>in</strong>Chapter4 Gridsaretypicallystoredeithercolumnorrow-wise.However,<strong>in</strong>asystemwith thatthesearenotnecessarilythebeststorageschemesforPIC<strong>codes</strong>.Instead,
l<strong>in</strong>esneededforeachscatter(calculation<strong>of</strong>a<strong>particle</strong>'scontributiontothecharge oneshouldtrytobuildadatastructurethatm<strong>in</strong>imizesthenumber<strong>of</strong>cache-<br />
density). 99<br />
gulargridstructure).Ifoneusesacolumnorrowstorage<strong>of</strong>thegrid,onecan henceexpecttheneedtoaccesstwocache-l<strong>in</strong>es<strong>of</strong>gridquantitiesforeach<strong>particle</strong>. each<strong>particle</strong>willbeaccess<strong>in</strong>gallfourgridpo<strong>in</strong>ts<strong>of</strong>its<strong>cell</strong>(assum<strong>in</strong>gaquadran-<br />
E.g.forarow-wisestoredgridtherewilltypicallybeonecache-l<strong>in</strong>econta<strong>in</strong><strong>in</strong>gthe Whenthe<strong>particle</strong>'schargecontributionsarecollectedbackonthegridpo<strong>in</strong>ts,<br />
toptwocorners<strong>and</strong>oneconta<strong>in</strong><strong>in</strong>gthebottomtwocorners. tasmany<strong>cell</strong>saspossiblewith<strong>in</strong>acachel<strong>in</strong>e,thenumber<strong>of</strong>cache-hitscan ratherthanrowsorcolumns(orblock-rows<strong>and</strong>block-columns).For3D<strong>codes</strong>, bereduced.Inthiscase,squaresubdoma<strong>in</strong>s<strong>of</strong>thegridaretted<strong>in</strong>tothecache sub-cubesshouldbeaccommodatedaswellaspossible.Inotherwords,whatever Ifone<strong>in</strong>steadusesthe<strong>cell</strong>-cach<strong>in</strong>gstrategy<strong>of</strong>Chapter4,whereonetriesto<br />
accesspatternthecodeuses,thecacheuseshouldtrytoreectthis. Thismeansthatacolumn/rowstorageapproachwoulduse2x16cache-hits <strong>of</strong>cache-hitsassociatedwitheach<strong>cell</strong><strong>in</strong>thiscase. eachcache-l<strong>in</strong>ecanaccommodatea4-by-4subgrid.Figure5.5showsthenumber whereas<strong>cell</strong>-cach<strong>in</strong>gwoulduse[(3x3)*1]+[(3+3)*2]+4=25cache-hits,an S<strong>in</strong>ceacache-l<strong>in</strong>eontheKSRis128bytes,i.e.1664-bitoat<strong>in</strong>gpo<strong>in</strong>tnumbers,<br />
<strong>cell</strong>is:Totalcachehitsper<strong>cell</strong>=(Cx?1)(Cy?1)1(<strong>in</strong>teriorpo<strong>in</strong>ts):(5.58) improvement<strong>of</strong>morethan25%! Ingeneral,ifthecachesizeisCxCy,thenthetotalnumber<strong>of</strong>cache-hitsper<br />
IfCx=Cy=C,theoptimalsituationfor<strong>cell</strong>-cach<strong>in</strong>g,theaboveequation +((Cx?1)+(Cy?1))2(borders)(5.59) +14(corner) (5.60)
100<br />
f f<br />
f<br />
f=gridpo<strong>in</strong>tsstored<strong>in</strong>cache-l<strong>in</strong>e 12<br />
1<br />
21 21 1f2<br />
Figure5.5:Cache-hitsfora4x4<strong>cell</strong>-cachedsubgrid 4<br />
becomes: however,requiremorearray<strong>in</strong>dex<strong>in</strong>g.ToaccessarrayelementA(i,j),weneed: Inordernottohavetochangealltheotheralgorithms,<strong>cell</strong>-cach<strong>in</strong>gdoes, Totalcachehitsper<strong>cell</strong>=C2+2C+1(<strong>cell</strong>?cach<strong>in</strong>g) A(i;j)=A[((i=C)(C2C2))+((j=C)C2) (5.61)<br />
Nx+j],most<strong>of</strong>theseoperationcanbeperformedbysimpleshiftswhenCisa AlthoughthisrequiresmanymoreoperationthanthetypicalA(i;j)=A[i +(jmodC)+(imodC)C] (5.63) (5.62)<br />
power<strong>of</strong>2.Comparedtograbb<strong>in</strong>ganothercache-l<strong>in</strong>e(whichcouldbeverycostlyif onehastogetitfromacrossthesystemonadierentr<strong>in</strong>g)thisisstillanegligible cost. grid. ratherblock-column-<strong>cell</strong>-cachestoavoidcommunicationcostswhenre-order<strong>in</strong>gthe levelwiththeFFTsolver.Theover-allgridcouldstillbestoredblock-column,or Itshouldbenotedthatthe<strong>cell</strong>-cashedstorageschemeconictsatthelocal
stage<strong>of</strong>thegridsummation<strong>and</strong>aspart<strong>of</strong>theeld-gridcalculationswiththehelp onatemporarygridarray. Inthereplicatedgridcase,there-order<strong>in</strong>gscouldbe<strong>in</strong>corporated<strong>in</strong>thenal 101<br />
benetserialaswellasparallelmach<strong>in</strong>eswithlocalcaches. Itshouldbeemphasizedthat<strong>cell</strong>-cach<strong>in</strong>gisalocalmemoryconstructthat
Chapter6 ImplementationontheKSR1<br />
Chapters4<strong>and</strong>5.Ourtest-bedistheKendallSquareResearchmach<strong>in</strong>eKSR1 Thischapterdescribesourimplementations<strong>of</strong>some<strong>of</strong>theideaspresented<strong>in</strong> havenocerta<strong>in</strong>tyuntilyoutry."{Sophocles[495-406B.C.],Trach<strong>in</strong>iae. \Onemustlearnbydo<strong>in</strong>gtheth<strong>in</strong>g;thoughyouth<strong>in</strong>kyouknowit,you<br />
currentlyavailableattheCornellTheoryCenter.<br />
onewouldhencehavetoconsideritadistributedmemorymach<strong>in</strong>ewithrespectto notpaycarefulattentiontomemorylocality<strong>and</strong>cach<strong>in</strong>g.Foroptimalperformance distributedamongitsprocessors,toaprogrammeritisaddressedlikeasharedmemorymach<strong>in</strong>e.However,seriousperformancehitscanbeexperiencedifonedoes<br />
AlthoughtheKSR1physicallyhasitschunks<strong>of</strong>32Mbma<strong>in</strong>realmemory<br />
works,delaysduetoswapp<strong>in</strong>g<strong>and</strong>datacopy<strong>in</strong>gcanbem<strong>in</strong>imized.Many<strong>of</strong>these datalocality.Byalsohav<strong>in</strong>gagoodunderst<strong>and</strong><strong>in</strong>gforhowthememoryhierarchy <strong>issues</strong>werediscussed<strong>in</strong>Chapter4<strong>and</strong>5. memoryoneachprocess<strong>in</strong>gnode.102<br />
Theconguration<strong>in</strong>nthisstudyhas128processornodeswith32Mb<strong>of</strong>real
to<strong>in</strong>struction,theotherhalftodata.Its64-bit20MIPSprocess<strong>in</strong>gunitsexecutes EachKSRprocessor<strong>cell</strong>consists<strong>of</strong>a0.5Mbsub-cache,half<strong>of</strong>whichisdedicated 6.1Architectureoverview 103<br />
two<strong>in</strong>structionspercyclegiv<strong>in</strong>g,accord<strong>in</strong>gtothemanufacturer,40peakMFLOPs Mb/second.The<strong>cell</strong>sexperiencethefollow<strong>in</strong>gmemorylatencies: per<strong>cell</strong>(28MFLOPFFT<strong>and</strong>32MFLOPmatrixmultiplication). saidtoperformat1Gb/second,whereasthest<strong>and</strong>ardi/ochannelarelistedas30 sub-cache(0.5Mb):2cyclesor.1microsecond; Communicationbetweentheprocess<strong>in</strong>g<strong>cell</strong><strong>and</strong>itslocalmemory(cache)is<br />
localcache(32Mb):20-24cyclesor1microsec;ond cacheonotherr<strong>in</strong>g(s)(33,792Mb):570cyclesor28.5microsecond; externaldisks(variable):400,000cycles!orapproximately20msec. othercacheonsamer<strong>in</strong>g(992Mb):130cyclesor6.5microsecond;<br />
beclearfromthisgurethatprogramsthanneedsomuchmemorythatfrequent variablewassetsothatthethreadsonlyranonnodeswithouti/odevices. whentim<strong>in</strong>gthesenodes.Forthebenchmarkspresented<strong>in</strong>thischapter,asystem Thelongdisklatencyisduealmostentirelytothedisk-accessspeed.Itshould Ourtest-bedcurrentlyhas10nodeswithi/odevices.Caremusthencebetaken<br />
diskI/Oisnecessarydur<strong>in</strong>gcomputation,willtakean<strong>in</strong>ord<strong>in</strong>ateamount<strong>of</strong>time. Infact,one<strong>of</strong>thestrengths<strong>of</strong>highlyparallelcomputersisnotonlytheirpotentially powerfulcollectiveprocess<strong>in</strong>gspeed,butthelargeamounts<strong>of</strong>realmemory<strong>and</strong><br />
implementation<strong>of</strong>thest<strong>and</strong>ardBLASrout<strong>in</strong>eSSCAL(vector-times-scalar)ona Togetabetterfeelfortheperformance<strong>of</strong>theKSR,werstranaserialC-<br />
cachethattheyprovide. 6.2Someprelim<strong>in</strong>arytim<strong>in</strong>gresults
happenedaseachdataelementwasusedonlyonce. series<strong>of</strong>localplatforms.Itshouldbenotedthats<strong>in</strong>ceweperformedthistestwith unitstride,thisletstheKSRtakeadvantage<strong>of</strong>cach<strong>in</strong>g.However,nocachere-use 104<br />
v-length(secs) 100,0000.2499 Arch.:Sun4cSparc1Sun4cSparc2Sun4m/670MPDECAlphaKSR-1 (ottar.cs) Table6.1:SSCALSerialTim<strong>in</strong>gResults<br />
200,0000.500 load:0.08-0.390.00-0.06 0.2666 (ash.tc) 0.1000 (secs) 1.00-1.300.0-0.12.25-5.0 (leo.cs) 0.1000 0.1166 (secs) (sitar.cs)(homer.tc)<br />
0.01670.0600 (secs) (secs)<br />
1,000,000 2,000,000 400,0000.500 0.98 0.217 0.42 0.43 0.200 0.38 0.40 0.050 0.07 0.08 0.120<br />
4,000,00012.87 2.43 2.45 4.85 5.07 1.06 1.08 2.12 2.15 4.25 1.02 1.97 4.07 1.03 2.10 0.20 0.21 0.40 0.22<br />
4.23 0.43 0.82 0.84 0.24 0.56 0.58 1.10 3.42 5.44<br />
theSunSPARC4m/670MPs(althoughnottheDECAlpha1)aslongasthe Ascanbeseenfromtheresults<strong>in</strong>Table6.1,theKSR'sscalarunitsout-perform 8,000,000 1.65 1.67 6.28<br />
giv<strong>in</strong>guswide-rang<strong>in</strong>gtim<strong>in</strong>gresultsforthesameproblem.Thiscouldbedue problemwassmallenough.Forlargerproblems,theloadseemedtoplayarole anothernode.Theseareclearly<strong>issues</strong>thatalsowillneedtobeconsideredfor parallelimplementations. porationwithnootheruserspresent.Alongwithlaterresults,thetim<strong>in</strong>gsobta<strong>in</strong>edseemto <strong>in</strong>dicatethatitsscalarspeedisabout3timesthat<strong>of</strong>aKSRnode<strong>and</strong>about4timesthat<strong>of</strong>a tocach<strong>in</strong>gaswellastheKSROS'sattempttoshipourprocessdynamicallyto<br />
Sun4m/670MP{veryimpressive<strong>in</strong>deed. 1TheDECAlpharesultswhereobta<strong>in</strong>edonaloanermach<strong>in</strong>efromDigitalEquipmentCor-
6.3ParallelSupportontheKSR TheKSRoersaparalleliz<strong>in</strong>gFortrancompilerbasedonPrest<strong>of</strong>orst<strong>and</strong>ard<strong>parallelization</strong>ssuchastil<strong>in</strong>g.Prestocomm<strong>and</strong>sthengetconvertedtoPthreadswhich<br />
105<br />
threadmanipulations.OnemayalsouseSPC(SimplePrestoC)whichprovides sixP1003.4thread)directlyleav<strong>in</strong>gittototheprogrammertodotheactually asimpliedmechanismwhenus<strong>in</strong>gfunctionsthatgetcalledbyateam<strong>of</strong>threads (generally,oneperprocess<strong>in</strong>gnode).TheSPCteamsprovidesanimplicitbarrier arethenaga<strong>in</strong>sitt<strong>in</strong>gontop<strong>of</strong>Machthread.TheCcompilerusesPthreads(Pos-<br />
rightbefore<strong>and</strong>aftereachparallelcall.<br />
tom<strong>in</strong>imizememorytransfers. systemtakescare<strong>of</strong>datareplication,distribution,<strong>and</strong>movement,thoughitis highlyusefultomakenote<strong>of</strong>howthisish<strong>and</strong>ledsothatprogramscanbeoptimized tohaveashareddataarea.Eachthreadalsohasaprivatedataarea.Thememory TheKSR'sprogrammer'smemorymodeldenesallthreads<strong>in</strong>as<strong>in</strong>gleprocess<br />
&execute"foragivenset<strong>of</strong>parallelthreads.Allsubsequentthreadsareequal hierarchically,<strong>and</strong>any<strong>of</strong>thesemayaga<strong>in</strong>callafork. barriers,mutexlocks,<strong>and</strong>conditionvariables.Onlyonethreadcancall\fork ThebarriersareimplementedthroughcallstoBarriercheck<strong>in</strong><strong>and</strong>Barrier TheKSRoersfourPthreadconstructsfor<strong>parallelization</strong>:s<strong>in</strong>glePthreads,<br />
lattermakesthemasterwaitforallslavestonish.TheexplicitKSRPthread barrierconstructhenceoperateasa\half"barrier.Inordertoobta<strong>in</strong>thefull checkout.Theformercausesallslavestowaitforthedesignatedmaster,the<br />
waitformaster).SPCcallsimplicitlyprovidefullbarriers,i.e.acheck-outfollowed synchronizationusuallyassociatedwiththetermbarrier,apthread_checkout byacheck-<strong>in</strong>. (masterwaitsforslaves)needstobeissuedrightbeforeapthread_check<strong>in</strong>(slaves
Fortranmayappeartobethemostnaturallanguageforimplement<strong>in</strong>gnumerical 6.3.1CversusFortran algorithms.AlthoughatpresentonlytheKSRFortrancompileroersautomatic 106<br />
exibility.C,withitspowerfulpo<strong>in</strong>terconstructsfordynamicmemoryallocation <strong>and</strong>strongl<strong>in</strong>ktoUNIX,isalsorapidlybecom<strong>in</strong>gmorepopular.S<strong>in</strong>cewewant <strong>parallelization</strong>,wehavechosentoimplementourcode<strong>in</strong>Cs<strong>in</strong>ceitallowsformore tousesome<strong>of</strong>theCpo<strong>in</strong>terfeatures<strong>in</strong>theimplementation,C<strong>in</strong>deedbecomes appear<strong>in</strong>futureFortranst<strong>and</strong>ards.Ifthathappens,Fortranmayaga<strong>in</strong>appear moreattractive.)C<strong>in</strong>terfaceswellwithbothAssemblyLanguage<strong>and</strong>Fortran, anaturalchoice.(Theauthorexpectsthedynamicmemoryallocationfeatureto <strong>and</strong>isalongwithFortran77<strong>and</strong>C++,theonlylanguagescurrentlyavailableon ourKSR1. 6.4KSRPICCode<br />
edit<strong>in</strong>g<strong>in</strong>clude-statementspo<strong>in</strong>t<strong>in</strong>gtolocallibraryles)wastochangethepr<strong>in</strong>tf theKSR1.Theonlycodechangerequiredtogetitrunn<strong>in</strong>gcorrectly(asidefrom wastoportourserial<strong>particle</strong>codedevelopedonthelocalSunworkstationto 6.4.1Port<strong>in</strong>gtheserial<strong>particle</strong>code<br />
statementsformatt<strong>in</strong>gargument\%lf"<strong>and</strong>\%le"topla<strong>in</strong>\%f"<strong>and</strong>\%e".This Therstth<strong>in</strong>gwedidafterexplor<strong>in</strong>gsomethePthreadfeaturesontestcases,<br />
theimplementorsnotion<strong>of</strong>doubleis. istheresult<strong>of</strong>mov<strong>in</strong>gfroma32-bitmach<strong>in</strong>etoa64-bitmach<strong>in</strong>e<strong>and</strong>hencewhat<br />
isaround12-13secondsforall<strong>of</strong>them.Thisshouldhence<strong>in</strong>dicatethe<strong>in</strong>creased for,benchmarkswereobta<strong>in</strong>edfortheserialversion<strong>of</strong>our<strong>particle</strong>codeonthe correspond<strong>in</strong>gtest-runsfor32x32gridpo<strong>in</strong>tswiththesamenumber<strong>of</strong><strong>particle</strong>s, KSR.SeeTable6.2.Noticethatthedierencebetweenthe16x16entries<strong>and</strong>the Inordertogetabetternotion<strong>of</strong>whatk<strong>in</strong>d<strong>of</strong>performancewecouldhope
solvertimeforgo<strong>in</strong>gfroma16x16gridtoa32x32grid.Noticehowthisgure provessignicantcomparedtotheoveralltimefortherunswithanaverage<strong>of</strong>1-2 <strong>particle</strong>sper<strong>cell</strong>,butbecomesmorenegligibleasthenumber<strong>of</strong><strong>particle</strong>sper<strong>cell</strong> 107<br />
<strong>in</strong>creases.<br />
Runno.GridsizeNo.<strong>of</strong><strong>particle</strong>stime(sec) Table6.2:SerialPerformanceontheKSR1 100time-steps,non-optimizedcode<br />
1 8x8Load=1.5-3.0<br />
2b 2a 2c 1024 2048 4096 64 15.68 27.86 51.81 1.88<br />
2d 3b 3a 3c 16x16 8192 1024 2048 4096 100.42 28.50 39.90<br />
Toga<strong>in</strong>further<strong>in</strong>sightonhowmucheachpart<strong>of</strong>thecodecontributestothe 3d 32x32 8192 113.30 63.76<br />
16,384<strong>particle</strong>sover100timestepsonbothaSunSparc1,SunSparc460/MP,<strong>and</strong> overalltime<strong>of</strong>thecomputations,benchmarkswereobta<strong>in</strong>forrunssimulat<strong>in</strong>g loopsbygloballyden<strong>in</strong>gtheappropriate<strong>in</strong>verses,<strong>and</strong>replac<strong>in</strong>gthedivisionswith multiplications.Asonecanseefromtheresults,thedefaultKSRdivision<strong>in</strong>C theKSR1(Table6.2).Intheoptimizedcase,weremoveddivisionsfrom<strong>particle</strong>
speed-ups,butnotquiteasgoodasouroptimizedcode(exceptforthesolver, isreallyslow.Bysett<strong>in</strong>gthe-qdivagwhencompil<strong>in</strong>g,weobta<strong>in</strong>edsignicant whichwedidnotoptimize). 108<br />
Table6.3:SerialPerformance<strong>of</strong>ParticleCodeSubrout<strong>in</strong>es 128x128=16K<strong>particle</strong>s,32-by-32grid,100time-steps Sun4c/60Sparc1,SunSparc670/MP,<strong>and</strong>theKSR1<br />
Initializations4.93 slowdivslowdivslowdivoptimizedoptimizedoptimized Sparc1670/MPKSR1Sparc1670/MPKSR1 (sec)(sec)(sec)(sec) 9.9 5.37 (sec) 5.92 (sec)<br />
FFTFieldsolve34.03 (Part-rho) Pull-back Scatter 85.45 2.71 35.175.75 1.5 2.32 0.95 31.4 0.52 9.84<br />
Update<strong>particle</strong>152.4 8.9 33.58 13.9 11.2<br />
(<strong>in</strong>cl.gather) velocities -108.20120.7049.4 32.8 8.8<br />
Update<strong>particle</strong>47.93 Fieldgrid locations 1.70 - 32.231.47 1.66 1.08 17.7 0.5 26.2<br />
Likethevectorcode,theseKSRguresalsocomparefavorabletothoseobta<strong>in</strong>ed Simulation314.82-188.3264.94113.379.77 (toE) 0.22<br />
onSunworkstations.TheSparc1took,forourapplication,about1.5timeslonger, ontheaverage,thantheKSRserialrunswithamoderateload.
therest<strong>of</strong>thecode.S<strong>in</strong>cethisrout<strong>in</strong>eisastrictlygrid-dependent(i.e.O(Ng)), Itisworthnot<strong>in</strong>ghowlittletimetheeld-gridcalculationstakescomparedwith Theseresultswereusedwhendevelop<strong>in</strong>gourstrategiesforChapter4<strong>and</strong>5. 109<br />
number<strong>of</strong>simulation<strong>particle</strong>s,Np,i.e.NgNp,<strong>and</strong>alltheotherrout<strong>in</strong>esare 6.5Cod<strong>in</strong>gtheParallelizations dependentonNp. thismatchesouranalysis<strong>in</strong>Chapter5s<strong>in</strong>cetherearealotfewergrid-po<strong>in</strong>tsthan<br />
Ourrsteortwastoparallelizeourcodebypartition<strong>in</strong>gthecodewithrespect to<strong>particle</strong>s.S<strong>in</strong>ceeach<strong>of</strong>the<strong>particle</strong>updatesare<strong>in</strong>dependent,thisisafairly \trivial"<strong>parallelization</strong>fortheserout<strong>in</strong>es.Forthechargecollectionphase,however,<br />
us<strong>in</strong>gSPC(SimplePrestoC)foreachavailableprocessor.Ourcodehenceuses Inordertoachievethe<strong>particle</strong><strong>parallelization</strong>,werstspawnedateam<strong>of</strong>Pthreads 6.5.1ParallelizationUs<strong>in</strong>gSPC thismeansblock<strong>in</strong>g<strong>and</strong>wait<strong>in</strong>gforgridelementsorreplicat<strong>in</strong>gthegrid.<br />
implicitbarrierconstructsasdepictedbelowthroughtheprcallcalls.Although itisratherwastefultospawnathreadforthous<strong>and</strong>s<strong>of</strong><strong>particle</strong>sononly64-256 Pthreadsarelight-weightconstructsproduc<strong>in</strong>gverylittleoverheadoncreation, processors.Wehencegroupthe<strong>particle</strong>sus<strong>in</strong>gthemodulus<strong>and</strong>ceil<strong>in</strong>gfunctions whencalculat<strong>in</strong>gthepo<strong>in</strong>terstothe<strong>particle</strong>numbereachthreadstartsat(similar tothepartition<strong>in</strong>g<strong>of</strong>ar<strong>and</strong>om-sizedmatrixacrossaset<strong>of</strong>processors).Thiscan eitherbedonewith<strong>in</strong>as<strong>in</strong>glesystemcallus<strong>in</strong>gglobalsorus<strong>in</strong>glocalsystemcalls <strong>in</strong>eachsubrout<strong>in</strong>e.Weoptedforthelatterforthesake<strong>of</strong>modularity.Figure 6.1showsthe<strong>codes</strong>egmentforourma<strong>in</strong><strong>particle</strong>loopafteradd<strong>in</strong>gSPCcallsto parallelized<strong>particle</strong>updaterout<strong>in</strong>es.
************************ma<strong>in</strong>simulationloop:***************/ t=0.0; while(t
6.6Replicat<strong>in</strong>gGrids111<br />
end.InasharedmemoryaddressedsystemliketheKSR,thiscanbefairlyeasily implementedbyadd<strong>in</strong>ganextradimensiontothegridarray. havethenodes(threads)updatelocalcopies<strong>and</strong>thenaddthemtogether<strong>in</strong>the (part-rho)rout<strong>in</strong>eistoreplicatethechargedensitygridforeachprocessors<strong>and</strong> Themostpopularwaytoh<strong>and</strong>letheparallelwritesassociatedwiththescatter<br />
accumulation.However,aspredicted,thechargeaccumulationrout<strong>in</strong>ereallytakes velocities<strong>and</strong>positions,respectively,whenus<strong>in</strong>gthesereplicatedgridforcharge thenonparallelizedsum,theoverhead<strong>in</strong>add<strong>in</strong>gthelocalgridcopiestogether amajorperformancehitasthenumber<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>creases.Infact,for Figures6.2showshowgreatspeed-upswereachievedforthe<strong>particle</strong>pushersfor<br />
parallelsums). ThisbehaviorcanclearlybeenseenfromtheScattercurve<strong>in</strong>Figure6.2.(usesthe <strong>in</strong>thatitalsoactuallyslowsdownwhenmorethan16process<strong>in</strong>gnodesareused. downfor8ormoreprocessors!Atree-basedparallelizedsumisnotmuchbetter causessomuchoverheadthatthisrout<strong>in</strong>eexperiencesaworsethanl<strong>in</strong>earslow-<br />
<strong>of</strong>processorss<strong>in</strong>cetheseupdatesare<strong>in</strong>dependent<strong>and</strong>hence<strong>in</strong>volveno<strong>in</strong>ter-node memorytrac.Thevelocityupdatesneedtoread<strong>in</strong>gridquantities<strong>and</strong>s<strong>in</strong>ce theseaccessesarefairlyr<strong>and</strong>om,globalreadcopiesneedtobemadebythesystem underneath,<strong>and</strong>thisrout<strong>in</strong>ehencetakesaperformancehitoverthepositionupdate Asexpected,the<strong>particle</strong>positionupdatesscalefairlyl<strong>in</strong>early<strong>in</strong>thenumber<br />
<strong>of</strong>thegrid<strong>in</strong>thisphase. rout<strong>in</strong>e. at118nodes,ourmaximumtestcase,eachthreadtypicallyh<strong>and</strong>ledonlyonerow sion,italsodidnotbenetmuchfrom<strong>parallelization</strong>.Thisisnotsurpris<strong>in</strong>gs<strong>in</strong>ce Thescatter<strong>and</strong>gather/push-vrout<strong>in</strong>esalsoshow<strong>in</strong>terest<strong>in</strong>g\k<strong>in</strong>ks"<strong>in</strong>their S<strong>in</strong>cetheeld-gridrout<strong>in</strong>ewasalreadyfast<strong>and</strong>onlyparallelized<strong>in</strong>onedimen-<br />
curvearound32processors.ThiseectcanprobablybeattributedtotheKSR1's
112<br />
Figure6.2:ParallelScalability<strong>of</strong>ReplicatedGridApproach
32-noder<strong>in</strong>gsize.S<strong>in</strong>ceweelim<strong>in</strong>atedI/Onodes<strong>in</strong>ourruns,thismeantthat whenwerequested32nodes,thosenodeswillnolongerbeonthesamer<strong>in</strong>g,<strong>and</strong> thelongeraccesstimesfor<strong>in</strong>ter-r<strong>in</strong>gmemoryaccessstartsplay<strong>in</strong>garole. 113<br />
6.7DistributedGrid<br />
parableoptimization.Asexpected,thethedistributedgridcaseshowscont<strong>in</strong>ued obta<strong>in</strong>edfromthesetwoapproachesforrunswiththesameproblemsize<strong>of</strong>comheadassociatedwithreplicatedgrids.Figure6.3givesscatterrout<strong>in</strong>ebenchmarks<br />
One<strong>of</strong>thereasonsfordistribut<strong>in</strong>gthegridwastoalleviatedthescatterover-<br />
speed-upwith<strong>in</strong>creas<strong>in</strong>gnumber<strong>of</strong>processors,whereasthereplicatedgridapproachbehavesas<strong>in</strong>Figure6.2,byactuallyshow<strong>in</strong>gslow-downwhenus<strong>in</strong>gmore<br />
than16processors.<br />
results<strong>in</strong>Table6.4.Thistablecomparesasimulationswithrelativelylarge<strong>in</strong>itial <strong>of</strong>thepushrout<strong>in</strong>esareaectedbythe<strong>in</strong>itialconditionsasdemonstratedbythe Itishardertocomparethisdatatothereplicatedgridcase,s<strong>in</strong>cetheeciency <strong>in</strong>clud<strong>in</strong>g<strong>particle</strong>pushresultsaswell. Figure6.4showsthesamebenchmarksforthepartitionedgridscase,butnow<br />
<strong>in</strong>itialvelocitiesonlyone-hundredth<strong>of</strong>that<strong>and</strong>onewithnone.Thesebenchmarks time-step,thisrout<strong>in</strong>etakesalargeperformancehit(almostafactor<strong>of</strong>2<strong>in</strong>our arefairlycomparabletoeachotherexceptfortherout<strong>in</strong>eupdat<strong>in</strong>gtheposition. velocitiessuchastheplasma<strong>in</strong>stabilitytestcase(shown<strong>in</strong>column1),toonewith S<strong>in</strong>cetheglobalscratcharraysgrowwiththenumber<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gateach testcase). mostcolumnforcomparisons.Althoughthesebenchmarksaredoneononly4pro-<br />
cessor,theyalreadyshowthatthedistributedgridscatterrout<strong>in</strong>eisfasterthan thereplicatedgridscatterrout<strong>in</strong>e.However,theseresultsalsoshowtheoverhead associatedwithpartiallysort<strong>in</strong>g<strong>and</strong>ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>gdynamiclocal<strong>particle</strong>arrays. Benchmarksforthecorrespond<strong>in</strong>greplicatedgridcaseare<strong>in</strong>cluded<strong>in</strong>theright-
114<br />
Figure6.3:Scatter:DistributedGridversusReplicatedGrids
115<br />
Figure6.4:DistributedGridBenchmarks
(seeFigure6.4),thisapproachbecomesfavorableasthescatterrout<strong>in</strong>estartsto dom<strong>in</strong>atethereplicatedgridcase. However,s<strong>in</strong>cethedistributed<strong>particle</strong>updaterout<strong>in</strong>esalsoscalel<strong>in</strong>early<strong>in</strong>time 116<br />
Vdrift: 128128grid;32,768<strong>particle</strong>s;4processors. Table6.4:DistributedGrid{InputEects<br />
ScatterTime(secs)Time(secs)Time(secs)Time(secs)<br />
!p2Ly1/100(!p2Ly) 12.35 11.77 11.83 0.0 19.33 N.A.<br />
Gather+23.86 Push-v Push-x 55.14 22.53 29.35 22.43 29.13 19.20<br />
6.8FFTSolver 16.67<br />
vendorprovided2DFFTrout<strong>in</strong>ethatissupplied<strong>in</strong>Fortranthatwehopetobe abletoutilize<strong>in</strong>ourproductionruns.S<strong>in</strong>cetheserout<strong>in</strong>esaredevelopedbythe Wedidnotimplementanoptimized2DFFTsolverontheKSR.TheKSRhasa vendorstoshowotheirhardware,theseare<strong>of</strong>tenhighlyoptimizedrout<strong>in</strong>esthat useassemblylanguageprogramm<strong>in</strong>g,cach<strong>in</strong>g<strong>and</strong>everyotherfeaturethevendor welltheyarecoded. canth<strong>in</strong>k<strong>of</strong>tak<strong>in</strong>gadvantage<strong>of</strong>.Itisthereforehighlyunlikelythanuserswillbe abletoachievesimilarperformancewiththeirFortranorC<strong>codes</strong>,nomatterhow
FromTable6.3onewouldhardlyth<strong>in</strong>kitworthwhiletoparallelizetheField-Grid 6.9GridScal<strong>in</strong>g rout<strong>in</strong>es<strong>in</strong>ceittakessuchan<strong>in</strong>signicantamount<strong>of</strong>timecomparedwiththerest 117<br />
<strong>of</strong>theserialcode.However,onedoesactuallystartnotic<strong>in</strong>gitspresenceasthe gridsgetlarger<strong>and</strong>therest<strong>of</strong>thecodeparallelized.Figure6.4illustrateshowthe parallelizedcode,asexpected,scalesl<strong>in</strong>earlywiththegrid-sizeforxednumber <strong>of</strong>processors<strong>and</strong><strong>particle</strong>s. velocityrout<strong>in</strong>e<strong>and</strong>thescatterrout<strong>in</strong>eshowthegrideectsaspredicted. vary<strong>in</strong>ggrid-sizes<strong>in</strong>cethisrout<strong>in</strong>eis<strong>in</strong>dependent<strong>of</strong>thegrid,whereasthegather/push-<br />
6.10ParticleScal<strong>in</strong>g Asexpected,the<strong>particle</strong>pushrout<strong>in</strong>esforpositionsshowednoeect<strong>of</strong>the<br />
bution(eld-grid)showsnoeect<strong>of</strong>thisscal<strong>in</strong>gs<strong>in</strong>ceitonly<strong>in</strong>volvesgridquan-<br />
tities.The\k<strong>in</strong>k"<strong>in</strong>the<strong>particle</strong>updatecurvesexperienced<strong>in</strong>ourbenchmarks <strong>and</strong>thenumber<strong>of</strong><strong>particle</strong>svaried.(Seegure6.5) atat4096<strong>particle</strong>s/node(4<strong>particle</strong>spergrid)canbeattributedtolocalpag<strong>in</strong>g/cach<strong>in</strong>g.S<strong>in</strong>cethereplicatedgridapproachaccessesthesamelocal<strong>particle</strong><br />
Asexpected,calculat<strong>in</strong>gtheeldateachgrid-po<strong>in</strong>tbasedonthechargedistri-<br />
Wealso<strong>in</strong>cludedbenchmarkswherethenumber<strong>of</strong>grids<strong>and</strong>processorswerexed,<br />
cache-sizeisexceeded. 6.11Test<strong>in</strong>g Toassurethatthe<strong>codes</strong>werestillmodel<strong>in</strong>gtheexpectedequations,theoutput<strong>of</strong> quantities<strong>in</strong>sequentialmemorylocations,cache-hitswilloccurwhenthelocal<br />
playedanimportantrole<strong>in</strong>thiseortUnfortunately,wedidnothaveaccesstothe ter3forcheck<strong>in</strong>gforplasmaoscillations,symmetry<strong>and</strong>two-stream<strong>in</strong>stabilities eachnewparallelizedversionwascheckedforvalidity.Thetestsdescribed<strong>in</strong>Chap-
118<br />
Time-steps Figure6.5:Grid-sizeScal<strong>in</strong>g.Replicatedgrids;4nodes;262,144Particles,100
119<br />
Figure6.6:ParticleScal<strong>in</strong>g{ReplicatedGrids
HDFpackageusedtoproducethetwo-stream<strong>in</strong>stabilitygraphs<strong>in</strong>Chapter3.We did,however,compareoutputles<strong>of</strong>thechargedistribution<strong>and</strong><strong>particle</strong>position withourserialcode'scorrespond<strong>in</strong>goutput.Welimitedthenumber<strong>of</strong>iterations 120<br />
toabout10s<strong>in</strong>cethese<strong>codes</strong>producenon-l<strong>in</strong>earities<strong>and</strong>round-oeectsthatas timeprogresses.
Chapter7 Conclusions<strong>and</strong>Futurework<br />
Bybr<strong>in</strong>g<strong>in</strong>gtogetherknowledgefromtheelds<strong>of</strong>computerscience,physics,<strong>and</strong> \Iknowthatyoubelieveyouunderst<strong>and</strong>whatyouth<strong>in</strong>kIwrote,butIam<br />
appliedmathematics(numericalanalysis),thisdissertationpresentedsomeguidel<strong>in</strong>esonhowt<strong>of</strong>ullyleverageparallelsupercomputertechnology<strong>in</strong>thearea<strong>of</strong><br />
Restaurant,Rt89,Ithaca,NY;orig<strong>in</strong>albyRichardNixon. {author'sversion<strong>of</strong>say<strong>in</strong>grstseenonaplaqueatGlennwoodP<strong>in</strong>es notsureyourealizethatwhatyouread,isnotwhatImeant..."<br />
fairlycomplexapplicationwhichhasseveral<strong>in</strong>teract<strong>in</strong>gmodules. <strong>in</strong>gasummary<strong>of</strong>thema<strong>in</strong>referencesrelatedtothearea<strong>of</strong>parallelParticle-<strong>in</strong>-Cell <strong>codes</strong>.Ourcodemodeledcharged<strong>particle</strong>s<strong>in</strong>anelectriceld.Weanalyzedthis <strong>particle</strong>simulations. Thisworkhighlightedseveralrelevantbasicpr<strong>in</strong>ciplesfromtheseelds<strong>in</strong>clud-<br />
sharedmemoryaddress<strong>in</strong>gspacesuchastheKSRsupercomputer.However,most <strong>of</strong>thenewmethodsdeveloped<strong>in</strong>thisdissertationholdgenerallyforhigh-performance systemswitheitherhierarchicalordistributedmemory. Bystudy<strong>in</strong>gthe<strong>in</strong>teractionsbetweenourapplication'ssub-programblocks,we Ourframeworkwasaphysicallydistributedmemorysystemwithaglobally<br />
showedhowtheaccompany<strong>in</strong>gdependenciesaectdatapartition<strong>in</strong>g<strong>and</strong>leadto 121
cilitat<strong>in</strong>gdynamicallypartitionedgrids.Thisnovelapproachtakesadvantage<strong>of</strong> new<strong>parallelization</strong>strategiesconcern<strong>in</strong>gprocessor,memory<strong>and</strong>cacheutilization. We<strong>in</strong>troducedanovelapproachthatleadtoanecientimplementationfa-<br />
122<br />
theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>usesadualpo<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraysthatkeepsthe<strong>particle</strong>locationsautomaticallypartiallysorted<br />
(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Complexity<strong>and</strong>performanceanalyseswas<strong>in</strong>cludedforboththisapproach<strong>and</strong>for<strong>of</strong>traditionalreplicatedgrids<br />
cachesize<strong>and</strong>ourproblem'sstructure<strong>of</strong>memoryaccess.Byreorder<strong>in</strong>gthegrid Thissavedusus25%<strong>of</strong>thecache-hitsfora4-by-4cache. approach. <strong>in</strong>dex<strong>in</strong>g,wealignedthestorage<strong>of</strong>neighbor<strong>in</strong>ggrid-po<strong>in</strong>tswiththelocalcache. Wealso<strong>in</strong>troducedhierarchicaldatastructuresthatweretailoredforboththe<br />
gridapproachisappropriateforproblemsrunonalimitednumber<strong>of</strong>processors. data'seectonthesimulation.E.g.<strong>in</strong>thecase<strong>of</strong>mean<strong>particle</strong>drift,itmaybe advantageoustopartitionthegridprimarilyalongthedirection<strong>of</strong>thedrift. Ouranalyses<strong>and</strong>implementationbenchmarksdemonstratehowthereplicated Weshowedthatfurtherimprovementscanbemadebyconsider<strong>in</strong>gthe<strong>in</strong>put<br />
severalprocessors(<strong>in</strong>ourcase,greaterthan8withrespecttoprocess<strong>in</strong>gtime).In oryrequirements,forlargeproblems(say,gridslargerthan128-by-128)runn<strong>in</strong>gon replicatedgridfailstoscaleproperlybothwithrespecttocomputation<strong>and</strong>mem-<br />
S<strong>in</strong>cethe<strong>particle</strong>sarealwaysprocessedbythesamecomputationalnode,an<strong>in</strong>itiallyload-balancedsimulationwillrema<strong>in</strong>perfectlyloadbalanced.However,the<br />
thiscase,theextrastorage<strong>and</strong>computationtimeassociatedwithadd<strong>in</strong>gthelocal largeproblemsonhighlyparallelsystemswithseveraldozenormoreprocessors, s<strong>in</strong>ceitdoesnotneedtoreplicate<strong>and</strong>summarizethewholegrid. partition<strong>in</strong>g,ontheotherh<strong>and</strong>,ismorediculttoimplement,butscaleswellfor gridcopiestogether,wasshowntobesignicant.Ourdualpo<strong>in</strong>terschemeforgrid
<strong>in</strong>stabilities. whichleadtopredictablephenomena<strong>in</strong>clud<strong>in</strong>gplasmaoscillations<strong>and</strong>two-stream The<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><strong>codes</strong>forthisstudyweretestedus<strong>in</strong>gphysicalparameters, 123<br />
7.1FutureWork \Ilikethedreams<strong>of</strong>thefuturebetterthanthehistory<strong>of</strong>thepast."<br />
thepast10yearscomput<strong>in</strong>ghasbecomeanessentialtoolformanyscientists<strong>and</strong> Wilson,isclearlyagrow<strong>in</strong>geldthatwillrema<strong>in</strong><strong>in</strong>focusforyearstocome.Dur<strong>in</strong>g Theeld<strong>of</strong>computationalscience,atermrstcastbyphysicsNobelLaureateKen {ThomasJeerson,lettertoJohnAdams,1816.<br />
eng<strong>in</strong>eers,<strong>and</strong>weexpectthistrendtocont<strong>in</strong>ueastechnologyprovidesuswith morecomputationalpowerthroughtheadvances<strong>in</strong>semiconductortechnologies specicallyaimed<strong>and</strong>ComputationalScience<strong>and</strong>Eng<strong>in</strong>eer<strong>in</strong>g[SRS93]. theimportance<strong>of</strong>severaluniversitieshaverecently<strong>in</strong>itiatedgraduateprograms aswellastheuse<strong>of</strong>parallel<strong>and</strong>distributedcomputersystems.Toillustrate methods,butitisclearthatthema<strong>in</strong>techniquesdevelopedtoparallelizecomplex computerprograms,suchas<strong>particle</strong>simulations,canbeappliedtomanyother scientic<strong>and</strong>eng<strong>in</strong>eer<strong>in</strong>g<strong>codes</strong>.Currentresearchareasthattakeadvantage<strong>of</strong> computationalscience<strong>and</strong>eng<strong>in</strong>eer<strong>in</strong>g<strong>in</strong>cludeastronomy,biology<strong>and</strong>virology, Thisdissertationfocusedonanalyz<strong>in</strong>gthe<strong>parallelization</strong><strong>of</strong>Particle-<strong>in</strong>-Cell<br />
chemistry,electromagnetics,uiddynamics,geometricmodel<strong>in</strong>g,materialscience, medic<strong>in</strong>e,numericalalgorithms,physics,problem-solv<strong>in</strong>genvironments,scientic visualization,signal<strong>and</strong>imageprocess<strong>in</strong>g,structurestechnology,symboliccomput<strong>in</strong>g,<strong>and</strong>weather<strong>and</strong>oceanmodel<strong>in</strong>g.Thefunhasmerelybegun!<br />
\Manwantstoknow,<strong>and</strong>whenheceasestodoso,heisnolongerman".{ Nansen[1861-1930],onthereasonforpolarexplorations.
AnnotatedBibliography AppendixA<br />
A.1Introduction Thissectiongiveadetaileddescription<strong>of</strong>some<strong>of</strong>thema<strong>in</strong>referencesrelatedto {Virg<strong>in</strong>iaWoolf,quoted<strong>in</strong>HaroldNicolson,Diaries. \Noth<strong>in</strong>ghasreallyhappeneduntilithasbeenrecorded."<br />
centraltoourwork.Anoverviewtable<strong>of</strong>thesereferenceswasprovided<strong>in</strong>Chapter paralell<strong>particle</strong><strong>codes</strong>.SectionsA.2<strong>and</strong>A.3coverthema<strong>in</strong>books<strong>and</strong>general articles,respectively,whereasSectionsA.4<strong>and</strong>A.5coverthePICreferencesmost<br />
CitationIndex.INSPECisrecentlibraryutilitythattheauthorfoundextremely <strong>of</strong>severalsciencediscipl<strong>in</strong>es,referecesare<strong>of</strong>tenhardtondus<strong>in</strong>gonlytheScience 2.S<strong>in</strong>cetheeld<strong>of</strong>computationalscienceisrelativelynew<strong>and</strong>onthe<strong>in</strong>tersection recentconferencepapers<strong>and</strong>abstractsaswell.)However,toreallybeontop<strong>of</strong> thisrapidlymov<strong>in</strong>geld,itisimportantt<strong>of</strong>ollowthema<strong>in</strong>conferences<strong>in</strong>high useful(itnotonlylistauthors<strong>and</strong>titlesfrompublishedjournals,but<strong>in</strong>cludes<br />
Mounta<strong>in</strong>ConferenceonIterativeMethods),etc.),keep<strong>in</strong>touchwiththema<strong>in</strong> performancecomput<strong>in</strong>g(e.g.Supercomput<strong>in</strong>g'9x,SHPCC'9x,SIAMconferences, TheColoradoConferenceonIterativeMethods(previouslycalledTheCopper 124
actors<strong>of</strong>theeld(havethemsendyoutechnicalreports),<strong>and</strong>lookoutfornew journalssuchasComputationalScience&Eng<strong>in</strong>eer<strong>in</strong>gdueoutbyIEEE<strong>in</strong>the spr<strong>in</strong>g<strong>of</strong>1994. 125<br />
A.2ReferenceBooks A.2.1Hockney<strong>and</strong>Eastwood:\ComputerSimulations Thisbook[HE89]isasolidreferencethatconcentrateson<strong>particle</strong>simulationswith applications<strong>in</strong>thearea<strong>of</strong>semiconductors.They<strong>in</strong>cludeahistoricaloverview<strong>of</strong> PIC<strong>codes</strong><strong>and</strong>giveafairlythoroughexample<strong>of</strong>1-Dplasmamodels.Theycovers severalnumericalmethodsforsolv<strong>in</strong>gtheeldequation(<strong>in</strong>clud<strong>in</strong>gtheSOR<strong>and</strong> Us<strong>in</strong>gParticles"<br />
FFTmethodsweused),<strong>and</strong>describeseveraltechniquesrelatedtosemiconductor devicemodel<strong>in</strong>g,astrophysics,<strong>and</strong>moleculardynamics.TheFFTisdescribed<strong>in</strong><br />
Thistext[BL91],concentratesondescrib<strong>in</strong>gthegeneraltechniquesused<strong>in</strong>plasma theappendix.<br />
physics<strong>codes</strong><strong>and</strong>hence<strong>in</strong>cludefairlydetaileddescriptions<strong>of</strong>2<strong>and</strong>3-Delectrostatic<strong>and</strong>electromagneticprograms.Theirbookalso<strong>in</strong>cludeadiskettewith<br />
A.2.2Birdsall<strong>and</strong>Langdon:\PlasmaPhysicsVia<br />
1-Delectrostaticcodecompletewithgraphicsforrunn<strong>in</strong>gunderMS-DOS<strong>and</strong> ComputerSimulation"<br />
X-w<strong>in</strong>dows(X11). A.2.3Others Tajima[Taj89]<strong>and</strong>Bower&Wilson[BW91]arealsogoodgenericreferencesfor plasmasimulation<strong>codes</strong>withslantstowardsastrophysics. ulardynamics<strong>in</strong>volv<strong>in</strong>gnumericaltechniquessuchastheMonteCarlomethod Forthose<strong>in</strong>terested<strong>in</strong>correlatedmethods{statisticalmechanics<strong>and</strong>molec-
(namedsoduetotheroler<strong>and</strong>omnumbersplay<strong>in</strong>thismethod){Allen&Tildesley[AT92]<strong>and</strong>B<strong>in</strong>der[B<strong>in</strong>92]areconsideredsome<strong>of</strong>thebestreferencesonthe<br />
subject.AllthoughthesemethodsarequitedierentfromthePICmethod,they 126<br />
mayprovide<strong>in</strong>terest<strong>in</strong>galgorithmicideasthatcouldbeusedwhenparalleliz<strong>in</strong>g PIC<strong>codes</strong>.<br />
used<strong>in</strong>paralleliz<strong>in</strong>gphysical<strong>codes</strong>.Thematerialcentersaroundtheauthors experienceontheCaltechhypercube,<strong>and</strong>ishencebiasedtowardshypercubeimplementations.VolumeI,subtitledGeneralTechniques<strong>and</strong>RegularProblems,<br />
Processors" This2-volumeset[FJL+88,AFKW90]coversthemostcommonparalleltechniques A.2.4Foxet.al:\Solv<strong>in</strong>gProblemsonConcurrent<br />
givesatheoreticaloverview<strong>of</strong>thealgorithmsused<strong>and</strong>theirunderly<strong>in</strong>gnumerical techniques,whereasVolumeII,S<strong>of</strong>twareforConcurrentProcessor,isaprimarilycompendium<strong>of</strong>Fortran<strong>and</strong>Cprogramsbasedonthealgorithmsdescribed<strong>in</strong><br />
VolumeI.<br />
workshopwaspublished<strong>in</strong>ComputerPhysicsCommunicationsthefollow<strong>in</strong>gyear Physics"washeldatLosAlamos<strong>in</strong>NewMexico.Theproceed<strong>in</strong>gsfromthis InApril1987aworkshopentitled\ParticleMethods<strong>in</strong>FluidDynamics<strong>and</strong>Plasma A.3GeneralParticleMethods<br />
<strong>and</strong>conta<strong>in</strong>sseveral<strong>in</strong>terest<strong>in</strong>gpapers<strong>in</strong>thearea: [Har88]isasurveyarticlecover<strong>in</strong>gtheorig<strong>in</strong><strong>of</strong>PIC<strong>and</strong>relatedtechniquesas A.3.1F.H.Harlow:"PIC<strong>and</strong>itsProgeny" methodsforexplor<strong>in</strong>gshock<strong>in</strong>teractionswithmaterial<strong>in</strong>terfaces<strong>in</strong>the'60s.Harlow'sPICworkforuiddynamicssimulations('64)wasthefoundationforMorse<br />
<strong>and</strong>Nielson's('69)workonhigher-order<strong>in</strong>terpolation(CIC)schemesforplasmas.
Accord<strong>in</strong>gtoHockney<strong>and</strong>Eastwood,they<strong>and</strong>Birdsall'sBerkeleygroup('69)were therstto<strong>in</strong>troduceCICschemes.) Thebulk<strong>of</strong>Harlow'spaperisalist<strong>of</strong>143references,mostlypapersonuid 127<br />
dynamicsbytheauthors<strong>and</strong>hiscolleaguesatLosAlamos(1955-87). A.3.2J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>: TheFastMultipoleMethodforGridlessParticle<br />
odswhenusedtosimulatecoldplasmas<strong>and</strong>beams,<strong>and</strong>plasmas<strong>in</strong>complicated [AGR88]advocatestheuse<strong>of</strong>modernhierarchicalsolvers,<strong>of</strong>whichthemostgeneraltechniqueisthefastmultipolemethod(FMM),toavoidsome<strong>of</strong>thelocal<br />
smooth<strong>in</strong>g,boundaryproblems,<strong>and</strong>alias<strong>in</strong>gproblemsassociatedwithPICmeth-<br />
Simulation<br />
A.3.3D.W.Hewett<strong>and</strong>A.B.Langdon:RecentProgress theaforementionedproblemsassociatedwithPICmethods. regions.ThepaperdescribestheFMMmethod<strong>and</strong>howitfareswithrespectto<br />
ThecodeusesaniterativesolutionbasedonADI(alternat<strong>in</strong>gdirectionimplicit)as a2-Ddirecteldsolver.Thepaperdoespo<strong>in</strong>toutthatonlym<strong>in</strong>imalconsideration [HL88]describesthedirectimplicitPICmethod<strong>and</strong>somerelativisticextensions. withAvanti:A2.5DEMDirectImplicitPIC<br />
wasgiventoalgorithmsthatmaybeusedtoimplementtherelativisticextensions. Code<br />
[BT88]describestheadvantages<strong>and</strong>disadvantages<strong>in</strong>us<strong>in</strong>gahybrid<strong>particle</strong>code A.3.4S.H.Brecht<strong>and</strong>V.A.Thomas:Multidimensional Someconceptsweretestedona1-Dcode.<br />
tosimulateplasmasonverylargescalelengths(severalD).Bytreat<strong>in</strong>gthe electronasamasslessuid<strong>and</strong>theionsas<strong>particle</strong>s,somephysicsthatmagneto- SimulationsUs<strong>in</strong>gHybridParticleCodes
hydrodynamics(MHD)<strong>codes</strong>donotprovide(MHDassumeschargeneutrality,i.e. thepotentialequationsbyassum<strong>in</strong>gthattheplasmaisquasi-neutral(neni), =0),canbe<strong>in</strong>cludedwithoutthecosts<strong>of</strong>afull<strong>particle</strong>code.Theyavoidsolv<strong>in</strong>g 128<br />
us<strong>in</strong>gtheDarw<strong>in</strong>approximationwherelightwavescanbeignored,<strong>and</strong>assum<strong>in</strong>g theelectronmasstobezero.Theyhenceuseapredictor-correctormethodtosolve thesimpliedequations. A.3.5C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga:Gridless<br />
positions<strong>and</strong>theelectriceld.However,s<strong>in</strong>cetheMPPisabit-serialSIMD(s<strong>in</strong>gle [LTK88]describesagridlessmodelwhere<strong>particle</strong>saremappedtoprocessors<strong>and</strong> thegridcomputationsareavoidedbyus<strong>in</strong>gthe<strong>in</strong>verseFFTtocomputethe<strong>particle</strong> Processor(MPP) ParticleSimulationUs<strong>in</strong>gtheMassivelyParallel<br />
reductionsumsneededforthistechniquewhencomput<strong>in</strong>gthechargedensitywas memory,theyfoundthattheoverhead<strong>in</strong>communicationwhencomput<strong>in</strong>gthe <strong>in</strong>struction,multipledata)architecturewithagridtopology<strong>and</strong>notmuchlocal sohighthat60%<strong>of</strong>theCPUtimewasused<strong>in</strong>thiseort. A.3.6A.Mank<strong>of</strong>skyetal.:Doma<strong>in</strong>Decomposition<strong>and</strong> [M+88]describes<strong>parallelization</strong>eortontwoproduction<strong>codes</strong>,ARGUS<strong>and</strong>CAN- (seealsodescription<strong>of</strong>otherarticlesbyL<strong>in</strong><strong>in</strong>SectionA.xx)<br />
isa3-Dsystem<strong>of</strong>simulation<strong>codes</strong>,thepaperpay<strong>in</strong>gparticularlyattentiontothose modulesrelatedtoPIC<strong>codes</strong>.Thesemodules<strong>in</strong>cludeseveraleldsolvers(SOR, DORformultiprocess<strong>in</strong>gonsystemssuchastheCrayX-MP<strong>and</strong>Cray2.ARGUS ParticlePush<strong>in</strong>gforMultiprocess<strong>in</strong>gComputers<br />
Chebyshev<strong>and</strong>FFT)<strong>and</strong>electromagneticsolvers(leapfrog,generalizedimplicit <strong>and</strong>frequencydoma<strong>in</strong>),wheretheyclaimtheir3DFFTsolvertobeexceptionally fastȮne<strong>of</strong>themost<strong>in</strong>terest<strong>in</strong>gideas<strong>of</strong>thepaper,however,ishowtheyusethe
cacheasstoragefor<strong>particle</strong>sthathavelefttheirlocaldoma<strong>in</strong>whereasthelocal<strong>particle</strong>sgetwrittentodisk.Thecached<strong>particle</strong>sthengettaggedontothe<br />
local<strong>particle</strong>s<strong>in</strong>itsnew<strong>cell</strong>whentheygetswapped<strong>in</strong>.Theirexperiencewith 129<br />
CANDOR,a2.5DelectromagneticPICcode,showedthatitprovedecientto multi-taskover<strong>particle</strong>s(oragroup<strong>of</strong><strong>particle</strong>s)with<strong>in</strong>aeldblock.Theynote <strong>particle</strong>spergroup.Thespeed-upfortheirimplementationontheCrayX-MP/48 <strong>particle</strong>groups,whereasthevectorizationeciency<strong>in</strong>creas<strong>in</strong>gwiththenumber<strong>of</strong> thetrade-oduetothe<strong>parallelization</strong>eciency<strong>in</strong>creas<strong>in</strong>gwiththenumber<strong>of</strong><br />
A.4ParallelPIC{SurveyArticles A.4.1JohnM.Dawson showedtoreachclosetothetheoreticalmaximum<strong>of</strong>4.<br />
<strong>of</strong>plasmas"[Daw83],isalengthyreview<strong>of</strong>theeldmodel<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong> electric<strong>and</strong>magneticelds.Thearticlecoversseveralphysicalmodel<strong>in</strong>gtechniques JohnDawson's1983Rev.<strong>of</strong>ModernPhysicspaperentitled\Particlesimulation 100relatedpapers. A.4.2DavidWalker'ssurveyPaper typicallyassociatedwith<strong>particle</strong><strong>codes</strong>.Itisaserialreferencethatsitesmorethan<br />
<strong>in</strong>-<strong>cell</strong>plasmasimulationcode"fromG.Fox'sjournalConcurrency:Practice<strong>and</strong> Experience,Dec.'90issue[Wal90]givesanicesurvey<strong>of</strong>vector<strong>and</strong>parallelPIC methods.Itcoversthemostbasicparalleltechniquesexplored,withanemphasis ontheimportance<strong>of</strong>loadbalanc<strong>in</strong>g.Thepaperstressesthatthereisastrongneed DavidWalker's\Characteriz<strong>in</strong>gtheparallelperformance<strong>of</strong>alarge-scale<strong>particle</strong>-<br />
forfuturework<strong>in</strong>theMultipleInstructionMultipleDataMIMD,i.e.distributed memorycomput<strong>in</strong>garena: positionschemesneedtobe<strong>in</strong>vestigated,particularlywith<strong>in</strong>homoge- \...ForMIMDdistributedmemorymultiprocessorsalternativedecom-
Thepapercites54references. neous<strong>particle</strong>distributions." 130<br />
Max[Max91]givesanicegeneraldescription<strong>of</strong>plasma<strong>codes</strong>,<strong>and</strong>how<strong>in</strong>the portantrole<strong>in</strong>theeld<strong>of</strong>plasmaastrophysics.Shepo<strong>in</strong>tstocurrenteorton A.4.3ClaireMax:\ComputerSimulation<strong>of</strong> com<strong>in</strong>gdecade,sophisticatednumericalmodels<strong>and</strong>simulationswillplayanim-<br />
theCRAYmach<strong>in</strong>es<strong>and</strong>howplasmaastrophysicshasagenu<strong>in</strong>eneedforthegovernmentalsupercomputerresourcespotentiallyprovidedbytheNSF,DOE,<strong>and</strong><br />
NASASupercomput<strong>in</strong>gCenters. A.5OtherParallelPICReferences A.5.1Vector<strong>codes</strong>{Horowitzetal withtim<strong>in</strong>ganalysesdoneonasimilar3DcodeforaCray.UnlikeNishiguchi etal.whoemploysaxedgrid,Horowitz'sapproachrequiressort<strong>in</strong>g<strong>in</strong>orderto Horowitz(LLNL,laterUniv.<strong>of</strong>Maryl<strong>and</strong>)etal.[Hor87]describesa2Dalgorithm <strong>particle</strong>s<strong>in</strong>their1-DcodetoutilizethevectorprocessoronaVP-100computer. Nishiguchi(OsakaUniv.)etal.'s3-pagenote[NOY85]describeshowtheybunch<br />
AstrophysicalPlasmas"<br />
modeledascollisionless<strong>particle</strong>swhereastheelectronsaretreatedasan<strong>in</strong>ertialessuid.Amultigridalgorithmisusedfortheeldsolve,whereastheleap-frog<br />
methodisusedtopushthe<strong>particle</strong>s.Horowitzmulti-tasksover3<strong>of</strong>the4Cray2 Horowitzetal.later[HSA89]describea3DhybridPICcodewhichisused tomodeltiltmodes<strong>in</strong>eld-reversedcongurations(FRCs).Here,theionsare lessstorage. vectorization.Thisschemeisabitslowerforverylargesystems,butrequiresmuch<br />
The<strong>in</strong>terpolation<strong>of</strong>theeldstothe<strong>particle</strong>swasfoundcomputationally<strong>in</strong>tensive processors<strong>in</strong>themultigridphasebycomput<strong>in</strong>gonedimensiontoeachprocessor.
<strong>and</strong>hencemulti-taskedachiev<strong>in</strong>ganaverageoverlap<strong>of</strong>about3duetotherelationshipbetweentasklength<strong>and</strong>thetime-sliceprovidedforeachtaskbytheCray.<br />
(FortheCraytheyused,thetime-slicedependedonthesize<strong>of</strong>thecode<strong>and</strong>the 131<br />
A.5.2C.S.L<strong>in</strong>(SouthwestResearchInstitute,TX)etal 2(many,butsimplercalculations).Theseresultsarehenceclearlydependedon theschedul<strong>in</strong>galgorithms<strong>of</strong>theCrayoperat<strong>in</strong>gsystem. priorityatwhichitran.)The<strong>particle</strong>pushphasesimilarlygotanoverlap<strong>of</strong>about<br />
<strong>in</strong>-<strong>cell</strong>CodeontheMPP(HCCA4)[L<strong>in</strong>89b],usesa<strong>particle</strong>sort<strong>in</strong>gschemebased onrotations(scatter<strong>particle</strong>sthatareclusteredthroughrotations).Thesame C.S.L<strong>in</strong>'spaperSimulations<strong>of</strong>BeamPlasmaInstabilitiesUs<strong>in</strong>gParallelParticle-<br />
approachisdescribed<strong>in</strong>theauthor'ssimilar,butlongerpaper,Particle-<strong>in</strong>-<strong>cell</strong> Simulations<strong>of</strong>WaveParticle<strong>in</strong>teractionsUs<strong>in</strong>gtheMassivelyParallelProcessor<br />
us<strong>in</strong>ganFFTsolver. locatedatGoddard.Theauthorsimulatesupto524,000<strong>particle</strong>sonthismach<strong>in</strong>e [L<strong>in</strong>89a]. (MassivelyParallelProcessor),a128-by-128toroidalgrid<strong>of</strong>bit-serialprocessors Twoprevioussort<strong>in</strong>gstudiesarementioned<strong>in</strong>thispaper:onewhere<strong>particle</strong>s Thepapersdescribea1-DelectrostaticPICcodeimplementedontheMPP<br />
array<strong>and</strong>sortedthe<strong>particle</strong>saccord<strong>in</strong>gtotheir<strong>cell</strong>everytimestep.Thiswas thatusedmorecomputationswasconsidered. aresortedateachtime-step(lots<strong>of</strong>overhead),anotherwherea\gridless"FFT<br />
wouldnotrema<strong>in</strong>load-balancedovertimes<strong>in</strong>cetheuctuations<strong>in</strong>electricalforces thearrayprocessors<strong>and</strong>thestag<strong>in</strong>gmemory.Theyalsopo<strong>in</strong>toutthatthescheme foundtobehighly<strong>in</strong>ecientontheMPPduetotheexcessiveI/Orequiredbetween Intherststudy,theymappedthesimulationdoma<strong>in</strong>directlytotheprocessor<br />
wouldcausethe<strong>particle</strong>stodistributenon-uniformlyovertheprocessors. Intheotherstudy,theydevelopedwhattheycallagridlessmodelwherethey
(seeSection2.3.4)However,itwasshowntobe7timesasslowasasimilarPIC map<strong>particle</strong>stor<strong>and</strong>omprocessors.Thisapproachclaimstoavoidchargecollectionbycomput<strong>in</strong>gtheelectricforcesdirectlyus<strong>in</strong>gtheDiscreteFourierTransform<br />
132<br />
serialprocessors),theimplementationuses64planestostore524,000<strong>particle</strong>s. Thesort<strong>in</strong>gapproach codetheCRAY,<strong>and</strong>isconsequentlydismissed.<br />
Theapproachllsonlyhalfthe<strong>particle</strong>planes(processorgrid)with<strong>particle</strong>sto thisbit-serialSIMDmach<strong>in</strong>e.Thespareroomwasused<strong>in</strong>the\shu<strong>in</strong>g"process makethesort<strong>in</strong>gsimplerbybe<strong>in</strong>gabletoshue("rotate")thedataeasily<strong>in</strong> Designedforthebit-serialMPP,(consist<strong>in</strong>g<strong>of</strong>the64-by-64toroidalgrid<strong>of</strong>bit-<br />
useful. <strong>and</strong>adierent<strong>in</strong>terconnectionnetwork,othertechniqueswillprobablyprovemore clearlytiedtotheMPParchitecture.Fornodeswithmorecomputationalpower wherecongested<strong>cell</strong>shadpart<strong>of</strong>theircontentsrotatedtotheirnorthernneighbor, <strong>and</strong>thenwest,ifnecessarydur<strong>in</strong>gthe\sort<strong>in</strong>g"process.Theimplementationis<br />
Indexsearch,onebyElliasonfromUmea<strong>in</strong>GeophysicalResearchLetters,<strong>and</strong>a paperbyhim<strong>and</strong>hisco-authors<strong>in</strong>JGR-SpaceSciences.None<strong>of</strong>theseseemto discuss<strong>particle</strong>sort<strong>in</strong>g. OnlytwootherrecentpapersbyC.S.L<strong>in</strong>werefoundus<strong>in</strong>gaScienceCitation<br />
A.5.3DavidWalker(ORNL)<br />
alsoWalker'sgeneralreference<strong>in</strong>SectionA.xx. accumulator",somek<strong>in</strong>d<strong>of</strong>gather-scattersorterproposedbyG.Fox.et.al.See doesnothaveafullimplementation<strong>of</strong>thecode.Heusesthe\quasi-staticcrystal In\TheImplementation<strong>of</strong>aThree-DimensionalPICCodeonaHypercubeConcurrentProcessor"[Wal89]Walkerdescribesa3-DPICcodefortheNCUBE,but
studyus<strong>in</strong>gthe<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>application"[LF88]isalsohighlyrelevanttoour This1988journalpaperentitled\Model<strong>in</strong>gtheperformance<strong>of</strong>hypercubes:Acase A.5.4Lubeck<strong>and</strong>Faber(LANL) 133<br />
the<strong>particle</strong>pushphase. work.Thepapercoversa2-DelectrostaticcodebenchmarkedontheInteliPSC<br />
<strong>in</strong>formation(observ<strong>in</strong>gstrictlocality).Theauthorsrejectedthisapproachbased usedfortheeldcalculations,whereastheyconsidered3dierentapproacherfor hypercube.AmultigridalgorithmbasedonFredrickson<strong>and</strong>McBryaneortsare<br />
onthattheir<strong>particle</strong>stendtocongregate<strong>in</strong>10%<strong>of</strong>the<strong>cell</strong>s,hencecaus<strong>in</strong>gserious load-imbalance. Therstapproachwastoassign<strong>in</strong>g<strong>particle</strong>stowhateverprocessorhasthe<strong>cell</strong><br />
spatialregion.Theauthorsarguethattheperformance<strong>of</strong>thisalternative(move byallow<strong>in</strong>g<strong>particle</strong>stobeassignedtoprocessorsnotnecessarilyconta<strong>in</strong><strong>in</strong>gtheir astrongfunction<strong>of</strong>the<strong>particle</strong><strong>in</strong>putdistribution. eitherthegrid<strong>and</strong>/or<strong>particle</strong>stoachieveamoreload-lancedsolution)wouldbe Thesecondalternativetheyconsideredwastorelaxthelocalityconstra<strong>in</strong>t<br />
iPSC). step.Thisachievesaperfectloadbalance(forhomogeneoussystemssuchasthe theprocessorssothatanequalnumber<strong>of</strong><strong>particle</strong>scanbeprocessedateachtime-<br />
Tous,thisdoesseemstorequirealot<strong>of</strong>extraover-head<strong>in</strong>communicat<strong>in</strong>gthe Thealternativetheydecidedtoimplementreplicatedthespatialgridamong<br />
hypercube\anorder<strong>of</strong>magnitudegreater"comparedwithasharedmemory entiregridateachprocessor. wholegridtoeachprocessorateachtime-step,nottomentionhav<strong>in</strong>gtostorethe<br />
implementation. authorscommentthattheyfoundthepartition<strong>in</strong>g<strong>of</strong>theirPICalgorithmforthe Thispaperdoesdescribeaniceperformancemodelfortheirapproach.The
<strong>in</strong>gforPIC<strong>codes</strong>,[ALO89,AL91,AL90,AL92,Aza92].Theirunderly<strong>in</strong>gmotivation A.5.5Azari<strong>and</strong>Lee'sWork Azari<strong>and</strong>S.Y.Leehavepublishedseveralpapersontheirworkonhybridpartition-<br />
134<br />
<strong>and</strong>partition<strong>in</strong>gthegrid<strong>in</strong>toequal-sizedsub-grids,oneperavailableprocessor ParticlePartitions. istoparallelizeParticle-In-Cell(PIC)<strong>codes</strong>thoughahybridscheme<strong>of</strong>Grid<strong>and</strong><br />
undesirableloadbalanc<strong>in</strong>gproblem(dynamicloadbalanc<strong>in</strong>g). element(PE).Theneedtosort<strong>particle</strong>sfromtimetotimeisreferredtoasan Partition<strong>in</strong>ggridspace<strong>in</strong>volvesdistribut<strong>in</strong>g<strong>particle</strong>sevenlyamongprocessor<br />
areevenlydistributedamongprocessorelements(PEs)nomatterwheretheyare entiresimulation.TheentiregridisassumedtohavetobestoredoneachPE<strong>in</strong> locatedonthegrid.EachPEkeepstrack<strong>of</strong>thesame<strong>particle</strong>throughoutthe ordertokeepthecommunicationoverheadlow.Thestoragerequirementsforthis A<strong>particle</strong>partition<strong>in</strong>gimplies,accord<strong>in</strong>gtotheirpapers,thatallthe<strong>particle</strong>s<br />
schemeislarger,<strong>and</strong>aglobalsum<strong>of</strong>thelocalgridentriesisneededaftereach<br />
distributionwhichwouldleadtoahigheciency. <strong>and</strong>bypartition<strong>in</strong>gthe<strong>particle</strong>sonemayattempttoobta<strong>in</strong>awell-balancedload iteration. gumentthatbypartition<strong>in</strong>gthespaceonecansavememoryspaceoneachPE, Theirhybridpartition<strong>in</strong>gschemecanbeoutl<strong>in</strong>edasfollows: Theirhybridpartition<strong>in</strong>gapproachcomb<strong>in</strong>esthesetwoschemeswiththear-<br />
1.thegridispartitioned<strong>in</strong>toequalsubgrids,. 2.agroup<strong>of</strong>PEsareassignedtoeachblock. 3.eachgrid-blockisstored<strong>in</strong>thelocalmemory<strong>of</strong>each<strong>of</strong>thePEsassignedto 4.the<strong>particle</strong>s<strong>in</strong>eachblockare<strong>in</strong>itiallypartitionedevenlyamongPEs<strong>in</strong>that thatblock block.
A.5.6PauletteLiewer(JPL)et.al. percube<strong>and</strong>BBNimplementations<strong>and</strong>correspond<strong>in</strong>gperformanceevaluations. Thepapersthengoontodescrib<strong>in</strong>gspecicimplementations<strong>in</strong>clud<strong>in</strong>gahy-<br />
135<br />
equal<strong>in</strong>numbertothenumber<strong>of</strong>processorsavailablesuchthat<strong>in</strong>itiallyeach sub-doma<strong>in</strong>hasanequalnumber<strong>of</strong>processors.Theirtest-bedwastheMarkIII staticcodenamed1-DUCLA,decompos<strong>in</strong>gthephysicaldoma<strong>in</strong><strong>in</strong>tosub-doma<strong>in</strong>s Liewerhasalsoco-authoredseveralpapersonthistopic.Her1988paperwith<br />
32-nodehypercube. Decyk,Dawson(UCLA)<strong>and</strong>G.Fox(Syracuse)[LDDF88],describesa1-Delectro-<br />
phase.(Theyneedtopartitionthedoma<strong>in</strong>accord<strong>in</strong>gtotheGrayCodenumber<strong>in</strong>g However,theauthorspo<strong>in</strong>tsouthowtheyneedtouseadierentpartition<strong>in</strong>gfor push<strong>in</strong>gphase,theydividetheirgridup<strong>in</strong>to(N?p)equal-sizedsub-doma<strong>in</strong>s. theFFTsolver<strong>in</strong>ordertotakeadvantage<strong>of</strong>thehypercubeconnectionforthis Thecodeusesthe1-DconcurrentFFTdescribed<strong>in</strong>Foxet.al.Forthe<strong>particle</strong><br />
PIC)implementedontheMarkIIIfb(64-processors)thatbeatstheCRAYXMP. thecaseifanitedierenceeldsolutionisused<strong>in</strong>place<strong>of</strong>theFFT. twiceateachtimestep.Intheconclusions,theypo<strong>in</strong>toutthatthismaynotbe <strong>of</strong>theprocessors.)Thecodehencepassesthegridarrayamongtheprocessors<br />
In1990Liewerco-authoredapaper[FLD90]describ<strong>in</strong>ga2-Delectrostaticcode thatwasperiodic<strong>in</strong>onedimension,<strong>and</strong>withtheoption<strong>of</strong>boundedorperiodic<strong>in</strong> theotherdimension.Thecodeused21-DFFTs<strong>in</strong>thesolver.Lieweretal.has In[LZDD89]theydescribeasimilarcodenamedGCPIC(GeneralConcurrent<br />
gridsarereplicatedoneachprocessor[DDSL93,LLFD93].Lieweretal.alsohave recentlydevelopeda3DcodeontheDeltaTouchstone(512nodegrid)wherethe apaperonloadbalanc<strong>in</strong>gdescribedlater<strong>in</strong>thisappendix.
A.5.7Sturtevant<strong>and</strong>Maccabee(Univ.<strong>of</strong>NewMexico) Thispaper[SM90]describesaplasmacodeimplementedontheBBNTC2000 whoseperformancewasdisappo<strong>in</strong>t<strong>in</strong>g.Theyusedashared-memoryPICalgorithm 136<br />
thatdidnotmapwelltothearchitecture<strong>of</strong>theBBN<strong>and</strong>hencegothitbythe highcosts<strong>of</strong>copy<strong>in</strong>gverylargeblocks<strong>of</strong>read-onlydata. A.5.8PeterMacNeice(Hughes/Goddard)<br />
eachtime-step.Anite-dierenceschemeisusedfortheeldsolve,whereasthe hasagridvector.TheyuseaEuleri<strong>and</strong>ecomposition,<strong>and</strong>henceneedtosortafter Thispaper[Mac93]describesa3DelectromagneticPICcodere-writtenforaMas- Parwitha128-by-128grid.ThecodeisbasedonOscarBuneman'sTRISTAN <strong>particle</strong>pushphaseisaccomplishedviatheleap-frogmethod.S<strong>in</strong>cetheyassume code.Theystorethethethirddimension<strong>in</strong>virtualmemorysothateachprocessor systemswithrelativelymild<strong>in</strong>homogeneities,noloadbalanc<strong>in</strong>gconsiderations weretaken.Thefacttheyonlysimulate400,000<strong>particle</strong>s<strong>in</strong>a105-by-44-by-55 ch<strong>in</strong>e. system,i.e.onlyaboutone<strong>particle</strong>per<strong>cell</strong><strong>and</strong>under-utiliz<strong>in</strong>gthe128-by-128 (64Kb/processor).TheMasParisaS<strong>in</strong>gleInstructionMultipleData(SIMD)ma-<br />
processorgrid,weassumewasduetothememorylimitations<strong>of</strong>theMasParused
Calculat<strong>in</strong>g<strong>and</strong>Verify<strong>in</strong>gthe AppendixB<br />
PlasmaFrequency B.1Introduction Inordertoverifywhatourcodeactuallydoes,an<strong>in</strong>-depthanalysis<strong>of</strong>onegeneral timestepcanpredictthegeneralbehavior<strong>of</strong>thecode.Ifthecodeproducesplasma oscillations,itisexpectedthatthechargedensityatagivengrid-po<strong>in</strong>tbehaves proportionallytoacos<strong>in</strong>ewave.Thatis,<br />
tion,butratheraTaylorseriesapproximation<strong>of</strong>them.Look<strong>in</strong>gattheratio<strong>of</strong>the Onecannotexpectthecodetoproducetheexactresultfromtheaboveequa-<br />
n+1 ij=constcos(!p(t+t)+) nij=constcos(!p(t)+) (B.1)<br />
aboveequation,onecanhenceexpectsometh<strong>in</strong>glike: n+1 nij=1?t2 (B.2)<br />
2?t[tan(!pt)+]<br />
137
Look<strong>in</strong>gatthepotentialateachgridpo<strong>in</strong>t,weassumeitis<strong>of</strong>theform: B.2Thepotentialequation 138<br />
wherex<strong>and</strong>yarethelimits<strong>of</strong>thegrid. S<strong>in</strong>cex=(i-1)hx,y=(j-1)hy,theaboveequationcanbere-written: i;j=Re(~eikxxeikyy) i;j=Re[~ei(kx(i?1)hx+ky(j?1)hy] (B.3)<br />
Thisfunctionscanbethought<strong>of</strong>asa2-Dplaneextend<strong>in</strong>g<strong>in</strong>towaves<strong>of</strong>\mounta<strong>in</strong>s"<strong>and</strong>\valleys"<strong>in</strong>thethirddimension.<br />
(B.4) B.2.1TheFFTsolver Dierentiat<strong>in</strong>gtwicedirectlywouldgivethefollow<strong>in</strong>gexpressionfor: @x2+@2 @2<br />
whichaga<strong>in</strong>givesus: i2k2x~eikxx+ikyy+i2k2y~eikxx+ikyy=?i;j @y2!~(eikxxeikyy)=?i;j ?(k2x+k2y)=?i;j<br />
0 (B.5) (B.6)<br />
Noticethatthisis<strong>in</strong>deedthequantitythatwescaledwith<strong>in</strong>theFFTsolver. =i;j 01 k2x+k2y: (B.7)<br />
B.2.2TheSORsolver<br />
thefollow<strong>in</strong>gexpressionfortheLaplacianr2: Writ<strong>in</strong>gEquation3.46fori;j+1<strong>in</strong>terms<strong>of</strong>i;jyields: Proceed<strong>in</strong>g<strong>in</strong>thesamefashionforetheotherneighbor<strong>in</strong>ggridpo<strong>in</strong>ts,oneobta<strong>in</strong>s i;j+1=i;jeikyhy
i;j[e?ikyhy+eikyhy?4+e?ikxhx+eikxhx]=h2=?i;j 139<br />
get:i;j Us<strong>in</strong>gtheexponentialform<strong>of</strong>theexpressionforcos<strong>in</strong>ecos(x)=12(eix+e?ix),we h2(2cos(kyhy)?4+2cos(kxhx))=?i;j 0 0 (B.9) (B.8)<br />
weget:?4h2 Rewrit<strong>in</strong>gtheaboveequationtottheexpression1?cos(x)=2s<strong>in</strong>2(x),simplify<strong>in</strong>g<br />
whichifwesolveforgivesus: us<strong>in</strong>gtheTaylorseriesapproximations<strong>in</strong>2(x)'x22,<strong>and</strong>consider<strong>in</strong>gh=hx=hy,<br />
'i;j 01 (k2x+k2y)h2<br />
4!i;j=?i;j 0 (B.10)<br />
expressionaswedidfortheFFTsolver.Thisisnotsurpris<strong>in</strong>ggiventhatournite Noticehowthegridspac<strong>in</strong>gquantitieshx<strong>and</strong>hycancel<strong>and</strong>weobta<strong>in</strong>thesame dierenceapproximation<strong>in</strong>deedissupposedtoapproximatethedierential. (B.11)<br />
B.2.3Theeldequations Asimilarresultcanthenbeobta<strong>in</strong>edfortheeldEus<strong>in</strong>gtheaboveresult: Exi;j=?(i;j+1?i;j?1) =?i;j =?i;j hxis<strong>in</strong>(kxhx) eikxhx?e?ikxhx 2hx 2hx ! (B.13) (B.12)<br />
Ey,weget: Aga<strong>in</strong>,us<strong>in</strong>gaTaylorseriesapproximation<strong>and</strong>follow<strong>in</strong>gthesameprocedurefor Exi;j'?ikxi;j (B.15) (B.14)<br />
Eyi;j'?ikyi;j (B.16)
Look<strong>in</strong>gattheeldateach<strong>particle</strong>Epart: Epartx=~Ex[eikxx+ikyy] 140<br />
reallymatter;<strong>in</strong>thiscase,itisnotimportantwheretheE-eldisbe<strong>in</strong>gevaluated. Wehenceapproximatetheeldsatthe<strong>particle</strong>spositionforx<strong>and</strong>ytobe: Whetherthex<strong>and</strong>yareherethecoord<strong>in</strong>ates<strong>of</strong>thegridorthe<strong>particle</strong>,doesnot (B.17)<br />
B.3Velocity<strong>and</strong><strong>particle</strong>positions Epartx'Ex Eparty'Ey (B.19) (B.18)<br />
Recallthatweusedthefollow<strong>in</strong>gequationsforupdat<strong>in</strong>gthevelocity<strong>and</strong><strong>particle</strong> positions:<br />
HereE=~Eeikxx0+ikyy0: v=v+qEmt x=x+vt (B.21) (B.20)<br />
time-stepsothatthevelocity<strong>and</strong><strong>particle</strong>positionswouldbe\leap<strong>in</strong>g"overeach other. Intheequationforthe<strong>particle</strong>update,xonthelefth<strong>and</strong>isx0+(new-time-step), whereasxontherighth<strong>and</strong>sideisx0+(old-time-step).Whenus<strong>in</strong>gtheLeap-<br />
B.4Chargedensity Frogmethod,t<strong>in</strong>thevelocityequation(EquationB.20)wassettot 2fortherst<br />
per<strong>cell</strong>.Now,look<strong>in</strong>gatagridpo<strong>in</strong>t<strong>and</strong>its4adjacent<strong>cell</strong>s,each<strong>of</strong>the<strong>of</strong> Assum<strong>in</strong>gthe<strong>cell</strong>-sizeisshrunkdowntothepo<strong>in</strong>twherethereisonlyone<strong>particle</strong>
consideredwhenupdat<strong>in</strong>gthechargedensity. the<strong>particle</strong>s<strong>in</strong>the4adjacent<strong>cell</strong>swillhaveadierent(x0;y0).Thesemustbe Look<strong>in</strong>gatthedistancesthe<strong>particle</strong>s<strong>in</strong>thesurround<strong>in</strong>g<strong>cell</strong>sarefromthegrid 141<br />
po<strong>in</strong>t,wecancalculatethechargedensityateachgridpo<strong>in</strong>t(Figure3.5).Seealso Figure3.3.j=(<strong>in</strong>t)((x0+hx)/hx); (i+1,j-1) 6?<br />
i=(<strong>in</strong>t)((y0+hy)/hy);<br />
a?? -<br />
x6? ? b(i,j) (i+1,j+1)<br />
(i-1,j-1)x:<strong>particle</strong>positions<br />
hy a(hy-b) hxx-(i-1,j+1)<br />
(hx-a)(hy-b) (hx-a)b<br />
Figure3.5:Contributions<strong>of</strong><strong>particle</strong>s<strong>in</strong>adjacent<strong>cell</strong>swith regardtochargedensityatgridpo<strong>in</strong>t(i,j). ab<br />
usedforcalculat<strong>in</strong>gchargedensity(seeSection3.3): Here,a=x0-((j-1)*hx);b=y0-((i-1)*hy),soplugg<strong>in</strong>gthese<strong>in</strong>theequationwe<br />
Theunperturbed<strong>particle</strong>positionsarehenceasshown<strong>in</strong>Figure3.6. i;j= hxhyNp oNg
(i+1,j-1) 142<br />
hy 6? -<br />
(xo,yo+hy)<br />
ax/<br />
6? b(i,j) (i+1,j+1)<br />
(i-1,j-1)x:<strong>particle</strong>positions hxx-(i-1,j+1)<br />
(xo+hx,yo+hy) (xo+hx,yo)<br />
Figure3.6:Unperturbedpositions<strong>of</strong><strong>particle</strong>s<strong>in</strong>adjacent<strong>cell</strong>s withrespecttogridpo<strong>in</strong>t(i,j). (xo,yo)<br />
Here,aperturbations<strong>of</strong>each<strong>particle</strong>canbeviewedas: Look<strong>in</strong>gatwhatx<strong>and</strong>yare<strong>in</strong>terms<strong>of</strong>,say,(x0;y0),wehave: a=ao+xb=bo+y x=~xeikxx0+ikyy0y=~yeikxx0+ikyy0 (B.23)<br />
wellasperturbations.Plugg<strong>in</strong>gtheseback<strong>in</strong>toourpreviousequation: Notethateach<strong>of</strong>thefour<strong>particle</strong>s<strong>in</strong>theadjacent<strong>cell</strong>shavedierentlocationsas i;j=[(hx?a0?x(x0+hx;y0+hy))(hy?b0?y(x0+hx;y0+hy)) (B.24)<br />
<strong>and</strong>exp<strong>and</strong><strong>in</strong>gthe-terms: +(hx?a0?x((x0+hx;y0))(b0+y(x0+hx;y0)) +(a0+x(x0;y0+hy))(hy?b0?y(x0;y0+hy)) +(a0+x(x0;y0))(b0+y(x0;y0))] hxhyNp! oNg (B.25)
i;j=[((hx?a0) ?~xeikx(x0+hx)+iky(y0+hy))((hy?b0)?~yeikx(x0+hx)+iky(y0+hy)) 143<br />
+((hx?a0)?~xeikx(x0+hx)+ikyy0)(b0+~yeikx(x0+hx)+ikyy0) +(a0+~xeikxx0+iky(y0+hy))((hy?b0)?~yeikxx0+iky(y0+hy)) (B.26)<br />
<strong>and</strong>hencenegligible<strong>in</strong>thiscontext: Multiply<strong>in</strong>goutthequantitiesneglect<strong>in</strong>gthexy-termss<strong>in</strong>cetheyareO(2), +(a0+~xeikxx0+ikyy0)(b0+~yeikxx0+ikyy0)] hxhyNp! 0Ng (B.27)<br />
i;j=[(hx?a0)(hy?b0)+(hx?a0)b0+a0(hy?b0)+a0b0] ?~xeikx(x0+hx)+iky(y0+hy)(hy?b0)?~yeikx(x0+hx)+iky(y0+hy)(hx?a0)<br />
Therstterm<strong>in</strong>sidethe\["\]"sismerelyhxhy.Look<strong>in</strong>gatthe~xterms,weget: ?~xeikx(x0+hx)+ikyy0(b0)+~yeikx(x0+hx)+ikyy0(hx?a0)<br />
xterms=(hy?b0)[?1+e?ikxhx]~xeikx(x0+hx)+iky(y0+hy) +~xeikxx0+iky(y0+hy)(hy?b0)?~yeikxx0+iky(y0+hy)(a0) +~xeikxx0+ikyy0(b0)+~yeikxx0+ikyy0(a0)] hxhyNp! 0Ng (B.28)<br />
Wenowusethefollow<strong>in</strong>gapproximationfor[?1+eix]: +b0[?1+e?ikxhx]~xeikx(x0+hx)+ikyy0 (B.30) (B.31) (B.29)<br />
Hencewecansimplifyourdxtermsequationto: xterms=(?ikxhx)hyx(x0+hx;y0+h0)+b0terms [?1+e?ix]'?1+(1?ix+x2 2+)'?ix (B.32) (B.33)
Look<strong>in</strong>gfurtherattheb0-terms: b0terms=?b0(?ikxhx)(x0+hx;y0+h0) +b0(?ikxhx)(x0+hx;y0) 144<br />
=((?ikxhx)b0[?1eikyhy](x0+hx;y0+hy) =bo(?ikxhx)(ikyhy)x(x0+hx;y0+h0) (B.34)<br />
=b0kxkyhxhyx(x0+hx;y0+h0) (B.35)<br />
S<strong>in</strong>ceweassumethatxissmallsothatb0
Epartx'Ex Eparty'Ey 145<br />
v=v+qEmt x=x+vt (B.45) (B.44) (B.43)<br />
helpfultoknowthefollow<strong>in</strong>gterm: whathappensbetweentwotime-stepsn<strong>and</strong>n+1.Inotherwords,itwouldbe Inordertoseewhatk<strong>in</strong>d<strong>of</strong>oscillationthecodewillproduce,onecanlookat i;j'0 Ng Np!(?ikxx?ikyy) (B.46)<br />
per<strong>cell</strong>(Ng=Np=1). Look<strong>in</strong>gatthen-thiteration,EquationB.46givesus: Forthepurposes<strong>of</strong>verify<strong>in</strong>gplasmaoscillations,wewillassumeone<strong>particle</strong> (n+1) i;j=term(n) (i;j)<br />
Toget(n+1) EquationB.41: (n) i;j'0(?ikxx?kyy) i;j,wepluggedEquationB.46<strong>in</strong>toEquationsB.41-B.46start<strong>in</strong>gwith (B.47)<br />
=i;j 01 k2x+k2y (B.48)
Exi;j'?ikxi;j =?ikx24old i;j 146<br />
similarly,<br />
0(k2x+k2y)!old<br />
k2x+k2y!35 1<br />
Eyi;j'?ikyi;j =? 0(k2x+k2y)!old iky i;j i;j (B.49)<br />
Westillmakethesameassumptionsabouttheeldateach<strong>particle</strong>: Epartx(i;j)'Ex(i;j) Eparty(i;j)'Ey(i;j) (B.51) (B.50)<br />
<strong>of</strong>whicharethenew<strong>and</strong>whicharetheoldvalues: Thecalculation,however,getalittlemoretrickywhenconsider<strong>in</strong>gtheequations forupdat<strong>in</strong>g<strong>particle</strong>velocitiesv<strong>and</strong>positionsx.Onewillherehavetokeeptrack<br />
Similarly,x=x+tv=) v0+vnew=v0+vold+tqmE v=v+qEmt=)<br />
x0+xnew=x0+xold+tvnew (B.52)<br />
Wehencehave: xnew=xold+tvnew vnew=vold+tqmE (B.54) (B.55) (B.53)
actuallycan<strong>cell</strong>edbe<strong>in</strong>gonbothsides<strong>of</strong>theaboveequations).Look<strong>in</strong>gatthe x-direction(lett<strong>in</strong>gx,xx)<strong>and</strong>us<strong>in</strong>gEquationB.42: Weassumethatforthersttime-stepthatvold=vold 147 x=vold<br />
new x=old x+ttqmEx(i;j) y=0.(They<br />
Similarly:<br />
x?t2q m0ikx<br />
(B.56)<br />
Go<strong>in</strong>gbacktochargedensity(EquationB.46)look<strong>in</strong>gatthex-direction: new y=old y?t2q m0iky k2x+k2yold i;j (B.57)<br />
new x(i;j)=(?ikxnew =24?ikx0@old x)0 x?t2q mold 0ikx i;jk2x+k2y1A350<br />
(B.58)<br />
Wehencehave new x(i;j)=h?ikxold ="?ikxold x?Kxold x?"t2q i;ji0 m0 k2x+k2y!#old<br />
i;j#0 (B.59)<br />
Do<strong>in</strong>gthesamefornew where y(i;j)<strong>and</strong>comb<strong>in</strong>gthetwoequations,weget: Kx=t2qk2x m0(k2x+k2y) (B.60)<br />
Look<strong>in</strong>gattheKx;Kytermstogether,theysimplifyasfollows: Kx+Ky=t2qk2x<br />
i;j=h(?kxold =[1?(Kx+Ky)o]old m0(k2x+k2y)+t2qk2y x?ikyold y)?(Kx+Ky)old i;j:<br />
i;ji0 (B.62) (B.61)
=(k2x+k2y)t2q =t2q m0(k2x+k2y) m0: 148<br />
Wehencehavethefollow<strong>in</strong>gexpressionforanupdatedchargedensity: new i;j= 1?t2q m00!old i;j (B.63)<br />
thetime-stepisactuallyt Wehenceactuallyhave: Not<strong>in</strong>gthatforthersttime-step(wherewemadethevold=0assumption), 2(velocitylagsthepositionupdateby1/2tome-step). (B.64)<br />
<strong>of</strong>cos(!0t): Assum<strong>in</strong>gourresultistherstfewentries<strong>of</strong>theTaylorseriesapproximation Theplasmafrequencyisdenedas!p,qq0 new i;j=0@1?t2 m001Aold 2q i;j<br />
1?!20m0:<br />
(B.65)<br />
thisexperimentshould<strong>in</strong>deedoscillateattheplasmafrequency!p,i.e.!0=!p. wecanassumeourfrequency!0=qq0 m0,<strong>and</strong>wehaveshownthatourcodefor 2
Bibliography [ABL88]N.G.Azari,A.W.Bojanczyk,<strong>and</strong>S.-Y.Lee.Synchronous<strong>and</strong>AsynchronousAlgorithmsforMatrixTranspositiononaMesh-Connected<br />
[AGR88]J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>.TheFastMultipole [AFKW90]IAngus,G.Fox,J.Kim,<strong>and</strong>D.Walker.Solv<strong>in</strong>gProblemsonConcurrentProcessorsVolumeII:S<strong>of</strong>twareforConcurrentProcessors.<br />
ArrayProcessor.InSPIEConf.Proc.,volume975,pages277{288, Prentice-Hall,1990. MethodforGridlessParticleSimulation.ComputerPhysicsCommunication,48:117{125,January1988.<br />
August1988.<br />
[AL90]N.G.Azari<strong>and</strong>S.-Y.Lee.Paralleliz<strong>in</strong>gParticle-<strong>in</strong>-CellSimulationon [AL91]N.G.Azari<strong>and</strong>S.-Y.Lee.HybridPartition<strong>in</strong>gforParticle-<strong>in</strong>-CellSimulationonSharedMemorySystems.InProc.<strong>of</strong>InternationalConf.<br />
Multiprocessors.InProc.<strong>of</strong>InternationalConf.onParallelProcess<strong>in</strong>g,pages352{353.ThePennsylvaniaStateUniversityPress,August<br />
1990.<br />
[ALO89]N.G.Azari,S.-Y.Lee,<strong>and</strong>N.F.Otani.ParallelGather-ScatterAlgo-<br />
[AL92]N.G.Azari<strong>and</strong>S.-Y.Lee.HybridTaskDecompositionforParticle-<strong>in</strong>- Conf.onParallelProcess<strong>in</strong>g,page9999,August1992. CellMethodonMessagePass<strong>in</strong>gSystems.InProc.<strong>of</strong>International onDistributedComputerSystems,pages526{533,May1991.<br />
[AT92]M.P.Allen<strong>and</strong>D.J.Tildesley.ComputerSimulation<strong>of</strong>Liquids.OxfordUniversityPress,1992.149<br />
GoldenGateEnterprises,March1989. percubes,ConcurrentComputers<strong>and</strong>Applications,pages1241{1245. rithmsforParticle-<strong>in</strong>-CellSimulation.InProc.<strong>of</strong>ForthConf.onHy-
[Aza92]N.G.Azari.ANewApproachtoTaskDecompositionforParallel [BCLL92]A.Bhatt,M.Chen,C.-Y.L<strong>in</strong>,<strong>and</strong>P.Liu.AbstractionsforParallel Particle-<strong>in</strong>-CellSimulation.Ph.D.dissertation,School<strong>of</strong>Electrical Eng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY,August1992. 150<br />
[BH19]J.Barnes<strong>and</strong>P.Hut.AhierarchicalO(NlogN)force-calculation [B<strong>in</strong>92]K.B<strong>in</strong>der,editor.Topics<strong>in</strong>AppliedPhysics{Volume71:TheMonte N-bodySimulations.InProc.ScalableHighPerformanceComput<strong>in</strong>g Conference,pages38{45.IEEEComputerSocietyPress,April1992. algorithm.Nature,324(4):446{449,19.<br />
[Bri87]W.L.Briggs.AMultigridTutorial.SIAM,1987. [BL91]C.K.Birdsall<strong>and</strong>A.B.Langdon.PlasmaPhysicsViaComputerSimulations.AdamHilger,Philadelphia,1991.<br />
CarloMethod<strong>in</strong>CondensedMatterPhysics.Spr<strong>in</strong>ger-Verlag,1992. [BT88]S.H.Brecht<strong>and</strong>V.A.Thomas.MultidimensionalSimulationsUs<strong>in</strong>g [BW91]R.L.Bowers<strong>and</strong>J.R.Wilson.NumericalModel<strong>in</strong>g<strong>in</strong>AppliedPhysics [Daw83]J.M.Dawson.Particlesimulation<strong>of</strong>plasmas.Rev.ModernPhysics, 143,January1988. HybridParticleCodes.ComputerPhysicsCommunication,48:135{<br />
[DDSL93]J.M.Dawson,V.K.Decyk,R.Sydora,<strong>and</strong>P.C.Liewer.High- <strong>and</strong>Astrophysics.Jones<strong>and</strong>Bartlett,1991.<br />
[EUR89]A.C.Elster,M.U.Uyar,<strong>and</strong>A.P.Reeves.Fault-TolerantMatrixOperationsonHypercubeMultiprocessors.InF.Ris<strong>and</strong>P.M.Kogge,<br />
editors,Proc.<strong>of</strong>the1989InternationalConferenceonParallelPro-<br />
55(2):403{447,April1983. PerformanceComput<strong>in</strong>g<strong>and</strong>PlasmaPhysics.PhysicsToday,46:64{ 70,March1993.<br />
[FJL+88]G.Fox,M.Johnson,G.Lyzenga,S.Otto,J.Salmon,<strong>and</strong>D.Walker. [FLD90]R.D.Ferraro,P.C.Liewer,<strong>and</strong>V.K.Decyk.A2DElectrostaticPIC gust1989.Vol.III. cess<strong>in</strong>g,pages169{176.ThePennsylvaniaStateUniversityPress,Au-<br />
pages440{444.IEEEComputerSocietyPress,April1990. editors,Proc.<strong>of</strong>theFifthDistributedMemoryComput<strong>in</strong>gConference, Solv<strong>in</strong>gProblemsonConcurrentProcessorsVolumeI:GeneralTechniques<strong>and</strong>RegularProblems.Prentice-Hall,1988.<br />
CodefortheMARKIIIHypercube.InD.W.Walker<strong>and</strong>Q.F.Stout,
[Har64]F.H.Harlow.TheParticle-<strong>in</strong>-CellComput<strong>in</strong>gMethodforFluidDynamics.InB.Alder,S.Fernbach,<strong>and</strong>A.Rotenberg,editors,Methods<br />
<strong>in</strong>ComputationalPhysics,volumeVol.3,pages319{343.Academic 151 Press,1964. [Har88]F.H.Harlow.PIC<strong>and</strong>ItsProgeny.ComputerPhysicsCommunication,48:1{10,January1988.<br />
[HE89]R.W.Hockney<strong>and</strong>J.W.Eastwood.ComputerSimulationUs<strong>in</strong>gParticles.AdamHilger,NewYork,1989.<br />
Ltd.,Bristol,1981. [HJ81]R.W.Hockney<strong>and</strong>C.R.Jesshope.ParallelComputers.AdamHilger [Hor87]E.J.Horowitz.Vectoriz<strong>in</strong>gtheInterpolationRout<strong>in</strong>es<strong>of</strong>Particle-<strong>in</strong>- [HSA89]E.J.Horowitz,D.E.Schumaker,<strong>and</strong>D.V.Anderson.QN3D:AThree- [HL88]D.W.Hewett<strong>and</strong>A.B.Langdon.RecentProgresswithAvanti:A2.5D<br />
DimensionalQuasi-neutralHybridParticle-<strong>in</strong>-CellCodewithApplicationstotheTiltModeInstability<strong>in</strong>FieldReversedCongurations.<br />
CellCodes.Journal<strong>of</strong>ComputationalPhysics,68:56{65,1987. 48:127{133,January1988. EMDirectImplicitPICCode.ComputerPhysicsCommunication,<br />
[JH87]S.L.Johnsson<strong>and</strong>C.T.Ho.AlgorithmsforMultiply<strong>in</strong>gMatrices<strong>of</strong> Journal<strong>of</strong>ComputationalPhysics,84:279{310,1989. ArbitraryShapesUs<strong>in</strong>gSharedMemoryPrimitivesonBooleanCubes.<br />
[LDDF88]P.C.Liewer,V.K.Decyk,J.M.Dawson,<strong>and</strong>G.C.Fox.AUniversal [LD89]P.C.Liewer<strong>and</strong>V.K.Decyk.AgeneralConcurrentAlgorithmfor TechnicalReportYALEU/DCS/TR-569,<strong>Department</strong><strong>of</strong>Computer Physics,85:302{322,1989. ConcurrentAlgorithmforPlasmaParticle-<strong>in</strong>-CellCodes.InG.Fox, PlasmaParticle-<strong>in</strong>-CellSimulationCodes.Journal<strong>of</strong>Computational Science,YaleUniversity,October1987.<br />
[L<strong>in</strong>89a]C.S.L<strong>in</strong>.Particle-<strong>in</strong>-CellSimulations<strong>of</strong>WaveParticleInteractions [LF88]O.M.Lubeck<strong>and</strong>V.Faber.Model<strong>in</strong>gtheperformance<strong>of</strong>hypercubes: 9:37{52,1988. Acasestudyus<strong>in</strong>gthe<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>application.ParallelComput<strong>in</strong>g, editor,TheThirdConferenceonHypercubeConcurrentComputers<strong>and</strong><br />
Us<strong>in</strong>gtheMassivelyParallelProcessor.InProc.Supercomput<strong>in</strong>g'89, Applications,pages1101{1107.ACM,January1988.<br />
pages287{294.ACMPress,November1989.
[L<strong>in</strong>89b]C.S.L<strong>in</strong>.Simulations<strong>of</strong>BeamPlasmaInstabilitiesUs<strong>in</strong>gaParallel pages1247{1254.GoldenGateEnterprises,March1989. Particle-<strong>in</strong>-<strong>cell</strong>CodeontheMassivelyParallelProcessor.InProc.<strong>of</strong> ForthConf.onHypoercubes,ConcurrentComputers<strong>and</strong>Applications, 152<br />
[LLFD93]P.Lyster,P.C.Liewer,R.Ferraro,<strong>and</strong>V.K.Decyk.Implementa-<br />
[LLDD90]P.C.Liewer,E.W.Leaver,V.K.Decyk,<strong>and</strong>J.M.Dawson.Dynamic IIIHypercube.InD.W.Walker<strong>and</strong>Q.F.Stout,editors,Proc.<strong>of</strong> theFifthDistributedMemoryComput<strong>in</strong>gConference,pages939{942. IEEEComputerSocietyPress,April1990. LoadBalanc<strong>in</strong>g<strong>in</strong>aConcurrentPlasmaPICCodeontheJPL/Mark<br />
[Loa92]C.F.VanLoan,editor.ComputationalFrameworksfortheFast nputers.prepr<strong>in</strong>t,1993. tion<strong>of</strong>theThree-DimensionalParticle-<strong>in</strong>-CellSchemeonDistributed- MemoryMultiple-InstructionMultiple-DataMassivelyParallelCom-<br />
[LZDD89]P.C.Liewer,B.A.Zimmerman,V.K.Decyk,<strong>and</strong>J.M.Dawson.Ap-<br />
[LTK88]C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga.GridlessParticleSimulationus<strong>in</strong>g 48:149{154,January1988. plication<strong>of</strong>HypercubeComputerstoPlasmaParticle-<strong>in</strong>-CellSimula-<br />
tionCodes.InProc.Supercomput<strong>in</strong>g'89,pages284{286.ACMPress, theMassivelyParallelProcessor.ComputerPhysicsCommunication, FourierTransform.SIAM,1992.<br />
[Mac93]P.MacNeice.AmElectromagneticPICCodeontheMasPar.InProc. [M+88]A.Mank<strong>of</strong>skyetal.Doma<strong>in</strong>Decomposition<strong>and</strong>ParticlePush<strong>in</strong>g forMultiprocess<strong>in</strong>gComputers.ComputerPhysicsCommunication, 48:155{165,January1988. November1989.<br />
[Max91]C.E.Max.ComputerSimulation<strong>of</strong>AstrophysicalPlasmas.Computers [McC89]S.F.McCormick.MultilevelAdaptiveMethodsforPartialDierential 6thSIAMConferenceonParallelProcess<strong>in</strong>gforScienticComput<strong>in</strong>g, <strong>in</strong>Physics,5(2):152{162,1991. pages129{132,Norfolk,VA,March1993.<br />
[NOY85]A.Nishiguchi,S.Orii,<strong>and</strong>T.Yabe.VectorCalculation<strong>of</strong>Particle [Ota] Equations.SIAM,1989. Code.Journal<strong>of</strong>ComputationalPhysics,61:519{522,1985. N.F.Otani.Personalcommunications.
[Sha] [Ram] [SHG92]J.P.S<strong>in</strong>gh,J.L.Hennessy,<strong>and</strong>A.Gupta.Implications<strong>of</strong>Hierarchical J.G.Shaw.Personalcommunications. P.S.Ramesh.Personalcommunications. N-bodyMethodsforMultiprocessorArchitectures.TechnicalReport 153<br />
[SHT+92]J.P.S<strong>in</strong>gh,C.Holt,T.Totsuka,A.Gupta,<strong>and</strong>J.L.Hennessy.Load manuscript,ComputerSystemsLaboratory,StanfordUniversity,1992. Balanc<strong>in</strong>g<strong>and</strong>DataLocality<strong>in</strong>HierarchicalN-bodyMethods.TechnicalReportmanuscript,ComputerSystemsLaboratory,StanfordUni-<br />
[SM90]J.E.Sturtevant<strong>and</strong>A.B.Maccabee.Implement<strong>in</strong>gParticle-In-Cell [Sil91] MarianSilberste<strong>in</strong>.ComputerSimulation<strong>of</strong>K<strong>in</strong>eticAlfvenwaves<strong>and</strong><br />
Conference,pages433{439.IEEEComputerSocietyPress,April1990. PlasmaSimulationCodeontheBBNTC2000.InD.W.Walker<strong>and</strong> Q.F.Stout,editors,Proc.<strong>of</strong>theFifthDistributedMemoryComput<strong>in</strong>g August1991. tion,School<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY, DoubleLayersAlongAuroralMagneticFieldL<strong>in</strong>es.Mastersdisserta-<br />
[SO93]M.Silberste<strong>in</strong><strong>and</strong>N.F.Otani.ComputerSimulation<strong>of</strong>AlfvenWaves [SRS93]A.Sameh,J.Riganati,<strong>and</strong>D.Sarno.ComputationalScience& <strong>and</strong>DoubleLayersAlongAuroralMagneticFieldL<strong>in</strong>es.Journal<strong>of</strong> GeophysicalResearch,Inpreparation,1993.<br />
[UR85a]M.U.Uyar<strong>and</strong>A.P.Reeves.FaultRecongurationfortheNearNeighborProblemInaDistributedMIMDEnvironment.InProceed<strong>in</strong>gs<strong>of</strong><br />
the5thInternationalConferenceonDistributedComputerSystems, toFusion<strong>and</strong>Astrophysics.Addison-Wesley,1989. [Taj89]ToskikaTajima.ComputationalPlasmaPhysics:WithApplications Eng<strong>in</strong>eer<strong>in</strong>g.Computer,26(10):8{12,October1993.<br />
[UR85b]M.U.Uyar<strong>and</strong>A.P.Reeves.FaultReconguration<strong>in</strong>aDistributed [UR88]M.U.Uyar<strong>and</strong>A.P.Reeves.DynamicFaultReconguration<strong>in</strong>a MIMDEnvironmentwithaMultistageNetwork.InProceed<strong>in</strong>gs<strong>of</strong>the pages372{379,Denver,CO,May1985.<br />
Mesh-ConnectedMIMDEnvironment.IEEETrans.onComputers, 1985InternationalConferenceonParallelProcess<strong>in</strong>g,pages798{805,<br />
37:1191{1205,October1988.
[Uya86]M.U.Uyar.DynamicFaultReconguration<strong>in</strong>MultiprocessorSystems. [Wal89]D.W.Walker.TheImplementation<strong>of</strong>aThree-DimentionalPICCode Ph.D.dissertation,School<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY,June1986.<br />
154<br />
[Wal90]D.W.Walker.Characteriz<strong>in</strong>gtheparallelperformance<strong>of</strong>alarge-scale, GoldenGateEnterprises,March1989. poercubes,ConcurrentComputers<strong>and</strong>Applications,pages1255{1261. <strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>plasmasimulationcode.Concurrency:Practice<strong>and</strong> onaHypercubeConcurrentProcessor.InProc.<strong>of</strong>ForthConf.onHy-<br />
[ZJ89] [You89]D.M.Young.AHistoricalOverview<strong>of</strong>IterativeMethods.Computer F.Zhao<strong>and</strong>S.L.Johnsson.TheParallelMultipoleMethodonthe ConnectionMach<strong>in</strong>e.TechnicalReportSeriesCS89-6,Th<strong>in</strong>k<strong>in</strong>gMach<strong>in</strong>eCorporation,October1989.<br />
Experience,2(4):257{288,December1990. PhysicsCommunications,53:1{17,1989.