21.01.2014 Views

parallelization issues and particle-in-cell codes - Department of ...

parallelization issues and particle-in-cell codes - Department of ...

parallelization issues and particle-in-cell codes - Department of ...

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

PARALLELIZATIONISSUES PARTICLE-IN-CELLCODES AND<br />

<strong>in</strong>PartialFulllment<strong>of</strong>theRequirementsfortheDegree<strong>of</strong> PresentedtotheFaculty<strong>of</strong>theGraduateSchool <strong>of</strong>CornellUniversity ADissertation<br />

Doctor<strong>of</strong>Philosophy<br />

AnneCathr<strong>in</strong>eElster August1994 by


cAnneCathr<strong>in</strong>eElster1994 ALLRIGHTSRESERVED


PARALLELIZATIONISSUES PARTICLE-IN-CELLCODES AnneCathr<strong>in</strong>eElster,Ph.D. AND<br />

\Everyth<strong>in</strong>gshouldbemadeassimpleaspossible,butnotsimpler." CornellUniversity1994<br />

<strong>of</strong><strong>in</strong>dividualmodulessuchasmatrixsolvers<strong>and</strong>factorizers.However,manyapplications<strong>in</strong>volveseveral<strong>in</strong>teract<strong>in</strong>gmodules.Ouranalyses<strong>of</strong>a<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><br />

Theeld<strong>of</strong>parallelscienticcomput<strong>in</strong>ghasconcentratedon<strong>parallelization</strong> {AlbertE<strong>in</strong>ste<strong>in</strong>.<br />

<strong>in</strong>gdependenciesaectdatapartition<strong>in</strong>g<strong>and</strong>leadtonew<strong>parallelization</strong>strategies concern<strong>in</strong>gprocessor,memory<strong>and</strong>cacheutilization.Ourtest-bed,aKSR1,is codemodel<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong>anelectriceld,showthattheseaccompanyever,most<strong>of</strong>thenewmethodspresentedholdgenerallyforhierarchical<strong>and</strong>/or<br />

distributedmemorysystems. adistributedmemorymach<strong>in</strong>ewithagloballysharedaddress<strong>in</strong>gspace.How-<br />

performanceanalyseswithaccompany<strong>in</strong>gKSRbenchmarks,havebeen<strong>in</strong>cludedfor boththisscheme<strong>and</strong>forthetraditionalreplicatedgridsapproach. arraystokeepthe<strong>particle</strong>locationsautomaticallypartiallysorted.Complexity<strong>and</strong> We<strong>in</strong>troduceanovelapproachthatusesdualpo<strong>in</strong>tersonthelocal<strong>particle</strong>


ourresultsdemonstrateitfailstoscaleproperlyforproblemswithlargegrids(say, storage<strong>and</strong>computationtimeassociatedwithadd<strong>in</strong>gthegridcopies,becomes greaterthan128-by-128)runn<strong>in</strong>gonasfewas15KSRnodes,s<strong>in</strong>cetheextra Thelatterapproachma<strong>in</strong>ta<strong>in</strong>sload-balancewithrespectto<strong>particle</strong>s.However,<br />

signicant.<br />

<strong>particle</strong>distributions.Ourdualpo<strong>in</strong>terapproachmayfacilitatethisthroughdynamicallypartitionedgrids.<br />

parallelsystems.Itmay,however,requireloadbalanc<strong>in</strong>gschemesfornon-uniform replicatethewholegrid.Consequently,itscaleswellforlargeproblemsonhighly Ourgridpartition<strong>in</strong>gscheme,althoughhardertoimplement,doesnotneedto<br />

producesa25%sav<strong>in</strong>gs<strong>in</strong>cache-hitsfora4-by-4cache. po<strong>in</strong>tswith<strong>in</strong>thesamecache-l<strong>in</strong>ebyreorder<strong>in</strong>gthegrid<strong>in</strong>dex<strong>in</strong>g.Thisalignment Aconsideration<strong>of</strong>the<strong>in</strong>putdata'seectonthesimulationmayleadt<strong>of</strong>urtherimprovements.Forexample,<strong>in</strong>thecase<strong>of</strong>mean<strong>particle</strong>drift,itis<strong>of</strong>ten<br />

advantageoustopartitionthegridprimarilyalongthedirection<strong>of</strong>thedrift.<br />

Wealso<strong>in</strong>troducehierarchicaldatastructuresthatstoreneighbor<strong>in</strong>ggrid-<br />

<strong>codes</strong>isalsogiven. <strong>in</strong>stabilities.Anoverview<strong>of</strong>themostcentralreferencesrelatedtoparallel<strong>particle</strong> whichleadtopredictablephenomena<strong>in</strong>clud<strong>in</strong>gplasmaoscillations<strong>and</strong>two-stream The<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><strong>codes</strong>forthisstudyweretestedus<strong>in</strong>gphysicalparameters,


BiographicalSketch<br />

Norway,toSynnve<strong>and</strong>NilsLoeElsteronOctober2,1962. AnneCathr<strong>in</strong>eElsterwasbornjustsouth<strong>of</strong>theArcticCircle,<strong>in</strong>MoiRana, \Cognitoergosum."(Ith<strong>in</strong>k,thereforeIam.)<br />

Herelementary<strong>and</strong>secondaryeducationswhereobta<strong>in</strong>edatMissoradoSchool {Rene'Decartes,DiscourseonMethod,1637.<br />

(1968-70),Monrovia,Liberia;BrevikBarneskole<strong>and</strong>AsenUngdommskole,Brevik, Norway(1970-78),followedbyPorsgrunnvideregaendeskole,Norway,wereshe completedherExamenArtium<strong>in</strong>1981withascienceconcentration<strong>in</strong>chemistry<br />

enrolledthefollow<strong>in</strong>gyearattheUniversity<strong>of</strong>MassachusettsatAmherstwhere <strong>and</strong>physics.AwardedascholarshipthroughtheNorway-AmericaAssociation<br />

shereceivedherB.Sc.degree<strong>in</strong>ComputerSystemsEng<strong>in</strong>eer<strong>in</strong>gcumlaude<strong>in</strong>May major<strong>in</strong>g<strong>in</strong>pre-bus<strong>in</strong>ess,butslant<strong>in</strong>gherprogramtowardscomputerscience.She bytheUniversity<strong>of</strong>Oregon,Eugene,shespentherrstyear<strong>of</strong>collegeocially<br />

1985.Annejo<strong>in</strong>edtheSchool<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>gatCornellUniversity<strong>in</strong> SchlumbergerWellServicesattheirAust<strong>in</strong>SystemsCenterfollow<strong>in</strong>gherPh.D. AnthonyP.Reeveschair<strong>in</strong>gherCommittee.Shehasacceptedapositionwith September1985fromwhichreceivedanMSdegree<strong>in</strong>August1988withPr<strong>of</strong>essor<br />

iii


KareAndreasBjrnerud Tothememory<strong>of</strong><br />

iv


Acknowledgements<br />

FirstIwouldliketothankmyadvisorPr<strong>of</strong>.NielsF.Otani<strong>and</strong>myunocial co-advisorsDr.JohnG.Shaw<strong>and</strong>Dr.PalghatS.Ramesh<strong>of</strong>theXeroxDesign aloud."{Author'sre-write<strong>of</strong>quotebyRalphWaldoEmerson,Friends. \FriendsarepeoplewithwhomImaybes<strong>in</strong>cere.Beforethem,Imayth<strong>in</strong>k<br />

youforbeliev<strong>in</strong>g<strong>in</strong>me<strong>and</strong>lett<strong>in</strong>gmegetaglimpse<strong>of</strong>theworld<strong>of</strong>computational physics! <strong>and</strong>advicewereessentialforthedevelopment<strong>and</strong>completion<strong>of</strong>thiswork.Thank ResearchInstitute(DRI)formak<strong>in</strong>gitallpossible.Theirencouragement,support<br />

gali<strong>of</strong>ComputerScience<strong>and</strong>hisNUMAgroupfornumeroushelpfuldiscussions, <strong>and</strong>SpecialCommitteememberPr<strong>of</strong>.Soo-YoungLeeforhisusefulsuggestions. GratitudeisalsoextendedtomySpecialCommitteememberPr<strong>of</strong>.KeshavP<strong>in</strong>partment<strong>and</strong>theircomputerstas,forprovid<strong>in</strong>g<strong>in</strong>valuablehelp<strong>and</strong>computer<br />

resources,<strong>and</strong>Dr.GregoryW.Zack<strong>of</strong>theXeroxDesignResearchInstitutefor hissupport. IalsowishtothanktheCornellTheoryCenter,theComputerScienceDe-<br />

northespiritneededtogetthroughthisdegreeprogram.Most<strong>of</strong>all,Iwould theirencouragement<strong>and</strong>moralsupport,Iwouldneitherhavehadthecondence Specialthanksareextendedtoallmygoodfriendsthroughtheyears.Without v


(Tusentakk,MammaogPappa!)<strong>and</strong>mysibl<strong>in</strong>gs,JohanFredrik<strong>and</strong>ToneBente. liketoacknowledgemyfamily<strong>in</strong>clud<strong>in</strong>gmyparents,Synnve<strong>and</strong>NilsL.Elster<br />

grant,theMathematicalScienceInstitute,<strong>and</strong>IBMthroughtheComputerScience<strong>Department</strong>,forthesupportthroughGraduateResearchAssistantships;the<br />

F<strong>in</strong>ally,Ialsowishtoexpressmyhonor<strong>and</strong>gratitudetomysponsors<strong>in</strong>clud<strong>in</strong>g: XeroxCorporation<strong>and</strong>theNationalScienceFoundationthroughmyadvisor'sPYI<br />

enticResearch,Mr.<strong>and</strong>Mrs.Rob<strong>in</strong>sonthroughtheThankstoSc<strong>and</strong><strong>in</strong>aviaInc., tricalEng<strong>in</strong>eer<strong>in</strong>g<strong>and</strong>theComputerScience<strong>Department</strong>forthesupportthrough Teach<strong>in</strong>gAssistantships;<strong>and</strong>theRoyalNorwegianCouncilforIndustrial<strong>and</strong>Sci-<br />

NorwegianGovernmentforprovid<strong>in</strong>gscholarships<strong>and</strong>loans;theSchool<strong>of</strong>Elec-<br />

aswellasCornellUniversity,forthegenerousfellowships.<br />

ProjectsAgency,theNationalInstitutes<strong>of</strong>Health,IBMCorporation<strong>and</strong>other tion,<strong>and</strong>NewYorkState.Additionalfund<strong>in</strong>gcomesfromtheAdvancedResearch TheoryCenter,whichreceivesmajorfund<strong>in</strong>gfromtheNationalScienceFounda-<br />

Thisresearchwasalsoconducted<strong>in</strong>partus<strong>in</strong>gtheresources<strong>of</strong>theCornell<br />

members<strong>of</strong>thecenter'sCorporateResearchInstitute. whoprobablyunknow<strong>in</strong>glysupportedmyeortsthroughtheaboveorganizations <strong>and</strong><strong>in</strong>stitutions.Maytheir<strong>in</strong>vestmentspayosomeday! Igratefullyacknowledgeallthetaxpayers<strong>in</strong>Norway<strong>and</strong>theUnitedStates<br />

knownanerscholar. closefriend<strong>and</strong>\brother"whosounexpectedlypassedawayonlyaweekafter defend<strong>in</strong>ghisPh.D.<strong>in</strong>FrenchLiteratureatOxford<strong>in</strong>December1992.Ihavenot Thisdissertationisdedicatedtothememory<strong>of</strong>KareAndrewBjrnerud,my<br />

vi


1Introduction Table<strong>of</strong>Contents 1.1Motivation<strong>and</strong>Goals::::::::::::::::::::::::::1 1.2ParticleSimulationModels:::::::::::::::::::::::4 1.3NumericalTechniques:::::::::::::::::::::::::5 1.4Contributions::::::::::::::::::::::::::::::6 1.5Appendices:::::::::::::::::::::::::::::::7 1.1.1Term<strong>in</strong>ology:::::::::::::::::::::::::::2 1<br />

2PreviousWorkonParticleCodes 2.3OtherParallelPICReferences:::::::::::::::::::::11 2.2ParallelParticle-<strong>in</strong>-Cell<strong>codes</strong>:::::::::::::::::::::9 2.1TheOrig<strong>in</strong>s<strong>of</strong>ParticleCodes:::::::::::::::::::::8 2.3.2Hypercubeapproaches:::::::::::::::::::::12 2.3.1Fight<strong>in</strong>gdatalocalityonthebit-serialMPP:::::::::11 2.2.1Vector<strong>and</strong>low-ordermultitask<strong>in</strong>g<strong>codes</strong>:::::::::::10 8<br />

3ThePhysics<strong>of</strong>ParticleSimulationCodes 2.4LoadBalanc<strong>in</strong>g:::::::::::::::::::::::::::::17 2.3.5Other<strong>particle</strong>methods:::::::::::::::::::::16 2.3.3AMasParapproach:::::::::::::::::::::::15 2.3.4ABBNattempt:::::::::::::::::::::::::16<br />

3.2TheDiscreteModelEquations:::::::::::::::::::::26 3.1ParticleModel<strong>in</strong>g::::::::::::::::::::::::::::25 2.4.1Dynamictaskschedul<strong>in</strong>g::::::::::::::::::::18<br />

3.3Solv<strong>in</strong>gforTheField::::::::::::::::::::::::::31 3.2.1Poisson'sequation:::::::::::::::::::::::27 3.2.2Thecharge,q::::::::::::::::::::::::::28 3.2.3ThePlasmaFrequency,!p:::::::::::::::::::30<br />

3.3.1Poisson'sEquation:::::::::::::::::::::::32 3.3.2FFTsolvers:::::::::::::::::::::::::::33 3.3.3F<strong>in</strong>ite-dierencesolvers:::::::::::::::::::::36 3.3.4Neumannboundaries::::::::::::::::::::::37 vii


3.4Mesh{ParticleInteractions:::::::::::::::::::::::39 3.5Mov<strong>in</strong>gthe<strong>particle</strong>s::::::::::::::::::::::::::42 3.6Test<strong>in</strong>gtheCode{ParameterRequirements:::::::::::::46 3.6.1!p<strong>and</strong>thetimestep::::::::::::::::::::::46 3.4.1Apply<strong>in</strong>gtheeldtoeach<strong>particle</strong>:::::::::::::::39<br />

3.6.2Two-streamInstabilityTest::::::::::::::::::48 3.5.1TheMobilityModel::::::::::::::::::::::42 3.5.2TheAccelerationModel::::::::::::::::::::43 3.4.2Recomput<strong>in</strong>geldsdueto<strong>particle</strong>s::::::::::::::40<br />

4Parallelization<strong>and</strong>HierarchicalMemoryIssues 3.7ResearchApplication{DoubleLayers::::::::::::::::50 4.3TheSimulationGrid::::::::::::::::::::::::::55 4.2DistributedMemoryversusSharedMemory:::::::::::::54 4.1Introduction:::::::::::::::::::::::::::::::53 4.3.1ReplicatedGrids::::::::::::::::::::::::55 4.3.2DistributedGrids::::::::::::::::::::::::56 4.3.3Block-column/Block-rowPartition<strong>in</strong>g:::::::::::::56<br />

4.4ParticlePartition<strong>in</strong>g::::::::::::::::::::::::::57 4.3.4GridsBasedonR<strong>and</strong>omParticleDistributions::::::::56 4.4.2PartialSort<strong>in</strong>g:::::::::::::::::::::::::58 4.4.1FixedProcessorPartition<strong>in</strong>g::::::::::::::::::57<br />

4.6Particlesort<strong>in</strong>g<strong>and</strong><strong>in</strong>homogeneousproblems::::::::::::63 4.5LoadBalanc<strong>in</strong>g:::::::::::::::::::::::::::::60 4.6.1DynamicPartition<strong>in</strong>gs:::::::::::::::::::::64 4.5.3Load<strong>and</strong>distance:::::::::::::::::::::::62 4.5.2Loadbalanc<strong>in</strong>gus<strong>in</strong>gthe<strong>particle</strong>densityfunction::::::61 4.5.1TheUDDapproach:::::::::::::::::::::::60 4.4.3DoublePo<strong>in</strong>terScheme:::::::::::::::::::::59<br />

4.7TheFieldSolver::::::::::::::::::::::::::::66 4.6.2Communicationpatterns::::::::::::::::::::64 4.6.3N-body/MultipoleIdeas::::::::::::::::::::65 4.7.1Processorutilization::::::::::::::::::::::67<br />

5AlgorithmicComplexity<strong>and</strong>PerformanceAnalyses 4.8InputEfects:ElectromagneticConsiderations::::::::::::71 4.9HierarchicalMemoryDataStructures:CellCach<strong>in</strong>g:::::::::72 4.7.3FFTSolvers:::::::::::::::::::::::::::67 4.7.2Non-uniformgrid<strong>issues</strong>::::::::::::::::::::67<br />

5.2Model::::::::::::::::::::::::::::::::::75 5.1Introduction:::::::::::::::::::::::::::::::74 4.7.4Multigrid::::::::::::::::::::::::::::71<br />

5.2.1KSRSpecics::::::::::::::::::::::::::76 viii


5.3SerialPICPerformance:::::::::::::::::::::::::80 5.2.2Modelparameters::::::::::::::::::::::::78 5.2.3ResultSummary::::::::::::::::::::::::80<br />

5.4ParallelPIC{FixedParticlePartition<strong>in</strong>g,Replicatedgrids:::::85 5.3.4FFT-solver:::::::::::::::::::::::::::83 5.3.5Field-GridCalculation:::::::::::::::::::::84 5.3.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity(Scatter)::::::::::::::::::::::::::83<br />

5.3.2ParticleUpdates{Positions::::::::::::::::::82 5.3.1ParticleUpdates{Velocities::::::::::::::::::82<br />

5.4.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity87 5.4.2ParticleUpdates{Positions::::::::::::::::::86 5.4.1ParticleUpdates{Velocities::::::::::::::::::85 5.5ParallelPIC{PartitionedChargeGridUs<strong>in</strong>gTemporaryBorders 5.5.3Calculat<strong>in</strong>gtheParticles'ContributiontotheChargeDensity96 5.5.2ParticleUpdates{Positions::::::::::::::::::93 5.5.1ParticleUpdates{Velocities::::::::::::::::::92 5.4.5ParallelField-GridCalculation::::::::::::::::91 <strong>and</strong>PartiallySortedLocalParticleArrays::::::::::::::92 5.4.4DistributedmemoryFFTsolver::::::::::::::::88<br />

6ImplementationontheKSR1 5.6HierarchicalDatastructures:Cell-cach<strong>in</strong>g::::::::::::::98 5.5.5ParallelField-GridCalculation::::::::::::::::98 5.5.4FFTsolver:::::::::::::::::::::::::::98<br />

6.3ParallelSupportontheKSR::::::::::::::::::::::105 6.1Architectureoverview::::::::::::::::::::::::::103 6.2Someprelim<strong>in</strong>arytim<strong>in</strong>gresults::::::::::::::::::::103 6.4KSRPICCode:::::::::::::::::::::::::::::106 6.3.1CversusFortran::::::::::::::::::::::::106 6.4.1Port<strong>in</strong>gtheserial<strong>particle</strong>code::::::::::::::::106 102<br />

6.10ParticleScal<strong>in</strong>g:::::::::::::::::::::::::::::117 6.5Cod<strong>in</strong>gtheParallelizations:::::::::::::::::::::::109<br />

6.11Test<strong>in</strong>g::::::::::::::::::::::::::::::::::117 6.9GridScal<strong>in</strong>g:::::::::::::::::::::::::::::::117 6.8FFTSolver:::::::::::::::::::::::::::::::116 6.7DistributedGrid::::::::::::::::::::::::::::113 6.6Replicat<strong>in</strong>gGrids::::::::::::::::::::::::::::111 6.5.1ParallelizationUs<strong>in</strong>gSPC:::::::::::::::::::109<br />

7Conclusions<strong>and</strong>Futurework 7.1FutureWork:::::::::::::::::::::::::::::::123 ix 121


AAnnotatedBibliography A.2ReferenceBooks:::::::::::::::::::::::::::::125 A.1Introduction:::::::::::::::::::::::::::::::124 A.2.1Hockney<strong>and</strong>Eastwood:\ComputerSimulationsUs<strong>in</strong>gParticles"::::::::::::::::::::::::::::::12ulation":::::::::::::::::::::::::::::125<br />

124 A.2.2Birdsall<strong>and</strong>Langdon:\PlasmaPhysicsViaComputerSim-<br />

A.3GeneralParticleMethods:::::::::::::::::::::::126 A.3.1F.H.Harlow:"PIC<strong>and</strong>itsProgeny":::::::::::::126 A.3.2J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>:TheFast A.2.3Others::::::::::::::::::::::::::::::125 A.3.3D.W.Hewett<strong>and</strong>A.B.Langdon:RecentProgresswithAvanti: A.2.4Foxet.al:\Solv<strong>in</strong>gProblemsonConcurrentProcessors"::126<br />

A.3.4S.H.Brecht<strong>and</strong>V.A.Thomas:MultidimensionalSimulationsUs<strong>in</strong>gHybridParticleCodes:::::::::::::::127<br />

MultipoleMethodforGridlessParticleSimulation::::::127 A2.5DEMDirectImplicitPICCode:::::::::::::127<br />

A.4ParallelPIC{SurveyArticles:::::::::::::::::::::129 A.4.1JohnM.Dawson::::::::::::::::::::::::129 A.3.6A.Mank<strong>of</strong>skyetal.:Doma<strong>in</strong>Decomposition<strong>and</strong>Particle A.3.5C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga:GridlessParticleSimulationUs<strong>in</strong>gtheMassivelyParallelProcessor(MPP)::::128<br />

A.4.3ClaireMax:\ComputerSimulation<strong>of</strong>AstrophysicalPlasmas"130 A.4.2DavidWalker'ssurveyPaper::::::::::::::::::129 Push<strong>in</strong>gforMultiprocess<strong>in</strong>gComputers::::::::::::128 A.5OtherParallelPICReferences:::::::::::::::::::::130<br />

A.5.7Sturtevant<strong>and</strong>Maccabee(Univ.<strong>of</strong>NewMexico)::::::136 A.5.6PauletteLiewer(JPL)et.al.:::::::::::::::::135 A.5.4Lubeck<strong>and</strong>Faber(LANL):::::::::::::::::::133 A.5.5Azari<strong>and</strong>Lee'sWork::::::::::::::::::::::134 A.5.3DavidWalker(ORNL):::::::::::::::::::::132 A.5.2C.S.L<strong>in</strong>(SouthwestResearchInstitute,TX)etal::::::131 A.5.1Vector<strong>codes</strong>{Horowitzetal:::::::::::::::::130<br />

BCalculat<strong>in</strong>g<strong>and</strong>Verify<strong>in</strong>gthePlasmaFrequency B.2Thepotentialequation:::::::::::::::::::::::::138 B.1Introduction:::::::::::::::::::::::::::::::137 A.5.8PeterMacNeice(Hughes/Goddard)::::::::::::::136<br />

B.3Velocity<strong>and</strong><strong>particle</strong>positions:::::::::::::::::::::140 B.2.2TheSORsolver:::::::::::::::::::::::::138 B.2.1TheFFTsolver:::::::::::::::::::::::::138 B.2.3Theeldequations:::::::::::::::::::::::139 137<br />

x


Bibliography B.4Chargedensity:::::::::::::::::::::::::::::140 B.5Plugg<strong>in</strong>gtheequations<strong>in</strong>toeachother::::::::::::::::144 149<br />

xi


List<strong>of</strong>Tables 2.2Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1988-89::::21 2.1Overview<strong>of</strong>ParallelPICReferences{2.5DHybrid:::::::::20<br />

4.1Boundary/Arearatiosfor2Dpartition<strong>in</strong>gswithunitarea.:::::57 4.2Surface/VolumeRatiosfor3Dpartition<strong>in</strong>gswithunitvolume.:::57 3.1Plasmaoscillations<strong>and</strong>time-step:::::::::::::::::::49 2.5Overview<strong>of</strong>ParallelPICReferences{Others::::::::::::24 2.4Overview<strong>of</strong>ParallelPICReferences{Walker::::::::::::23 2.3Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1990-93::::22<br />

6.4DistributedGrid{InputEects:::::::::::::::::::116 5.1PerformanceComplexity{PICAlgorithms:::::::::::::81 6.3SerialPerformance<strong>of</strong>ParticleCodeSubrout<strong>in</strong>es::::::::::108 6.2SerialPerformanceontheKSR1:::::::::::::::::::107 6.1SSCALSerialTim<strong>in</strong>gResults:::::::::::::::::::::104<br />

xii


List<strong>of</strong>Figures 3.13-Dview<strong>of</strong>a2-D<strong>particle</strong>simulation.Chargesarethought<strong>of</strong>as 3.3Calculation<strong>of</strong>eldatlocation<strong>of</strong><strong>particle</strong>us<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation.41 3.4TheLeapfrogMethod.::::::::::::::::::::::::45 3.2Calculation<strong>of</strong>nodeentry<strong>of</strong>lowercorner<strong>of</strong>current<strong>cell</strong>::::::40 3.5Two-stream<strong>in</strong>stabilitytest.a)Initialconditions,b)wavesare \rods".:::::::::::::::::::::::::::::::::29<br />

4.3X-proleforParticleDistributionShown<strong>in</strong>Figure4.2:::::::61 4.5BasicTopologies:(a)2DMesh(grid),(b)R<strong>in</strong>g.::::::::::69 4.4NewGridDistributionDuetoX-prole<strong>in</strong>Figure4.3.:::::::62 4.1Gridpo<strong>in</strong>tdistribution(rows)oneachprocessor.::::::::::56 4.2Inhomogeneous<strong>particle</strong>distribution:::::::::::::::::61 form<strong>in</strong>g,c)characterstictwo-streameye.:::::::::::::::51<br />

5.1Transpose<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g.:::::::::89 4.6Rowstorageversus<strong>cell</strong>cach<strong>in</strong>gstorage:::::::::::::::73<br />

5.5Cache-hitsfora4x4<strong>cell</strong>-cachedsubgrid:::::::::::::::100 5.3Movement<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheir<strong>cell</strong>s,block-vectorsett<strong>in</strong>g.:::95 5.2GridPartition<strong>in</strong>g;a)block-vector,b)subgrid::::::::::::94<br />

6.1KSRSPCcallsparalleliz<strong>in</strong>g<strong>particle</strong>updaterout<strong>in</strong>es::::::::110 5.4ParallelChargeAccumulation:Particles<strong>in</strong><strong>cell</strong>`A'sharegridpo<strong>in</strong>ts<br />

6.2ParallelScalability<strong>of</strong>ReplicatedGridApproach::::::::::112 po<strong>in</strong>ts,`a'<strong>and</strong>`b',aresharedwith<strong>particle</strong>supdatedbyanother process<strong>in</strong>gnode.::::::::::::::::::::::::::::97 withthe<strong>particle</strong>s<strong>in</strong>the8neighbor<strong>in</strong>g<strong>cell</strong>s.Two<strong>of</strong>thesegrid<br />

6.5Grid-sizeScal<strong>in</strong>g.Replicatedgrids;4nodes;262,144Particles,100 6.3Scatter:DistributedGridversusReplicatedGrids::::::::::114 6.4DistributedGridBenchmarks:::::::::::::::::::::115 6.6ParticleScal<strong>in</strong>g{ReplicatedGrids::::::::::::::::::119 Time-steps:::::::::::::::::::::::::::::::118 xiii


Chapter1 Introduction \Onceexperienced,theexpansion<strong>of</strong>personal<strong>in</strong>tellectualpowermadeavailablebythecomputerisnoteasilygivenup."{SheilaEvansWidnall,Chair<br />

<strong>of</strong>theFacultycommitteeonundergraduateadmissionsatMIT,Science,<br />

<strong>in</strong>gplasmaphysics,xerography,astrophysics,<strong>and</strong>semiconductordevicephysics. 1.1Motivation<strong>and</strong>Goals Particlesimulationsarefundamental<strong>in</strong>manyareas<strong>of</strong>appliedresearch,<strong>in</strong>clud-<br />

August1983.<br />

S<strong>of</strong>ar,thesesimulationshave,duetotheirhighdem<strong>and</strong>forcomputerresources typicallyus<strong>in</strong>guptoanorder<strong>of</strong>1million<strong>particle</strong>s. (especiallymemory<strong>and</strong>CPUpower),beenlimitedto<strong>in</strong>vestigat<strong>in</strong>glocaleects{ magneticelds.Thenumericaltechniquesusedusually<strong>in</strong>volveassign<strong>in</strong>gcharges tosimulated<strong>particle</strong>s,solv<strong>in</strong>gtheassociatedeldequationswithrespecttosimulatedmeshpo<strong>in</strong>ts,apply<strong>in</strong>gtheeldsolutiontothegrid,<strong>and</strong>solv<strong>in</strong>gtherelated<br />

equations<strong>of</strong>motionforthe<strong>particle</strong>s.Codesbasedonthesenumericaltechniques Thesesimulations<strong>of</strong>ten<strong>in</strong>volvetrack<strong>in</strong>g<strong>of</strong>charged<strong>particle</strong>s<strong>in</strong>electric<strong>and</strong><br />

arefrequentlyreferredtoParticle-<strong>in</strong>-Cell(PIC)<strong>codes</strong>. HighlyparallelcomputerssuchastheKendallSquareResearch(KSR)ma- 1


ch<strong>in</strong>e,arebecom<strong>in</strong>gamore<strong>and</strong>more<strong>in</strong>tegralpart<strong>of</strong>scienticcomputation.By develop<strong>in</strong>gnovelalgorithmictechniquestotakeadvantage<strong>of</strong>thesemodernparallelmach<strong>in</strong>es<strong>and</strong>employ<strong>in</strong>gstate-<strong>of</strong>-the-art<strong>particle</strong>simulationmethods,weare<br />

2<br />

target<strong>in</strong>g2-Dsimulationsus<strong>in</strong>gatleast10-100millionsimulation<strong>particle</strong>s.This cont<strong>in</strong>u<strong>in</strong>g<strong>in</strong>vestigation<strong>of</strong>severalproblems<strong>in</strong>magnetospheric<strong>and</strong>ionospheric possible. willfacilitatethestudy<strong>of</strong><strong>in</strong>terest<strong>in</strong>gglobalphysicalphenomenapreviouslynot physics.Inparticular,weexpectthesemethodstobeappliedtoanexist<strong>in</strong>gsimulationwhichmodelstheenergization<strong>and</strong>precipitation<strong>of</strong>theelectronsresponsible<br />

fortheAuroraBorealis.Pr<strong>of</strong>essorOtani<strong>and</strong>hisstudentsarecurrentlyexam<strong>in</strong><strong>in</strong>g These<strong>particle</strong>simulation<strong>parallelization</strong>methodswillthenbeused<strong>in</strong>our<br />

portant<strong>in</strong>produc<strong>in</strong>gthese\auroral"electrons,becausethel<strong>in</strong>earmodestructure <strong>of</strong>thesewaves<strong>in</strong>cludesacomponent<strong>of</strong>theelectriceldwhichisorientedalong magneticplasmawaves)have<strong>in</strong>accelerat<strong>in</strong>gtheseelectronsalongeldl<strong>in</strong>esat altitudes<strong>of</strong>1to2Earthradii.K<strong>in</strong>eticAlfvenwaveshavebeenproposedtobeim-<br />

withsmallercomputerstherolethatk<strong>in</strong>eticAlfvenwaves(lowfrequencyelectro-<br />

Earth'sionosphere.[Sil91][SO93]. thetechniquesdeveloped<strong>in</strong>thisthesis. theEarth'smagneticeld,idealforaccelerat<strong>in</strong>gelectronsdownwardtowardsthe<br />

1.1.1Term<strong>in</strong>ology Theterm<strong>in</strong>ologyused<strong>in</strong>thisthesisisbasedonthetermscommonlyassociated Other<strong>codes</strong>thattargetparallelcomputersshouldalsobenetfrommany<strong>of</strong><br />

withcomputerarchitecture<strong>and</strong>parallelprocess<strong>in</strong>gaswellasthoseadoptedby KSR.Thefollow<strong>in</strong>g\dictionary"isforthebenet<strong>of</strong>thoseunfamiliarwiththis term<strong>in</strong>ology. Cache:Fastlocalmemory. Registers:Reallyfastlocalmemory;onlyafewwords<strong>of</strong>storage.


Subcache:KSR-ismforfastlocalmemory,i.e.whatgenerallywouldbe referredtoasaprocessor'scacheorlocalcache.OntheKSR1the0.5Mb subcacheissplitup<strong>in</strong>toa256Kbdatacache<strong>and</strong>a256Kb<strong>in</strong>structioncache. 3<br />

LocalMemory:KSRcallsthesum<strong>of</strong>theirlocalmemoryAllcache<strong>and</strong>they maybeusedforemphasiswhenreferr<strong>in</strong>gtotheKSR. Thisthesiswillusethetermcacheforfastlocalmemory.Thetermsubcache<br />

eachprocess<strong>in</strong>gelementasthemoregenerallyusedterm,localmemory.On notcommonpractice.Thisthesiswillrefertothedistributedmemoryon hencesometimesrefertolocalmemoryas\localcache".Notethatthisis theKSR,thelocalmemoryconsists<strong>of</strong>128sets<strong>of</strong>16-wayassociativememory<br />

Subpage(alsocalledcachel<strong>in</strong>e):M<strong>in</strong>imumdata-packagecopied<strong>in</strong>tolocal Page:Cont<strong>in</strong>uousmemoryblock<strong>in</strong>localmemory(16KbonKSR)thatmay beswappedouttoexternalmemory(e.g.disk). eachwithapagesize<strong>of</strong>16Kbgiv<strong>in</strong>gatotallocalmemory<strong>of</strong>32Mb.<br />

Thrash<strong>in</strong>g:Swapp<strong>in</strong>g<strong>of</strong>data<strong>in</strong><strong>and</strong>out<strong>of</strong>cacheorlocalmemorydue subcache.OntheKSR1eachpageisdivided<strong>in</strong>to128subpages(cachel<strong>in</strong>es) <strong>of</strong>128bytes(16words).<br />

Process:Runn<strong>in</strong>gprogramorprogramsegment,typicallywithalot<strong>of</strong>OS OSkernel:Operat<strong>in</strong>gSystem(OS)kernel;set<strong>of</strong>OSprograms<strong>and</strong>rout<strong>in</strong>es. acrosspageorsubpageboundaries. tomemoryreferencesrequest<strong>in</strong>gdatafromotherprocessorsordataspread<br />

Threads:Light-weightprocesses,i.e.specialprocesseswithlittleOSkernelover-head;OntheKSRonetypicallyspawnsaparallelprogram<strong>in</strong>toP<br />

kernelsupport,i.e.processestypicallytakeawhiletosetup(create)<strong>and</strong> release<strong>and</strong>arethensometimesreferredtoasheavyweightprocesses(see threads).


Machthreads:Low-level<strong>in</strong>terfacetotheOSkernel;basedontheMach threadrunn<strong>in</strong>gperprocessor. threads,whereP=no.<strong>of</strong>availableprocessorssothattherewillbeone 4<br />

Pthreads(POSIXthreads):Higher-level<strong>in</strong>terfacetoMachthreadsbased <strong>of</strong>KSR'sparallelismsupportisbuiltontop<strong>of</strong>Pthreads.TheKSR1Pthreads adheretoIEEEPOSIXP1003.4ast<strong>and</strong>ard. ontheIEEEPOSIX(PortableOperat<strong>in</strong>gSystemInterface)st<strong>and</strong>ard.Most OS(asopposedtoUNIX).<br />

Note:thetermsthread<strong>and</strong>processormaybeused<strong>in</strong>terchangeablywhentalk<strong>in</strong>g<br />

Particlesimulationmodelsare<strong>of</strong>tendividedup<strong>in</strong>t<strong>of</strong>ourcategories[HE89]: 1.2ParticleSimulationModels aboutaparallelprogramthatrunsonethreadperprocessor. Forfurtherdetailsonourtest-bed,theKSR1,seeChapter6.<br />

1.Correlatedsystemswhich<strong>in</strong>cludeN-bodyproblems<strong>and</strong>relatedmodels concern<strong>in</strong>gcovalentliquids(e.g.moleculardynamics),ionicliquids,stellar<br />

3.Collisionalsystems<strong>in</strong>clud<strong>in</strong>gsubmicronsemiconductordevicesus<strong>in</strong>gthe 2.Collisionlesssystems<strong>in</strong>clud<strong>in</strong>gcollisionlessplasma<strong>and</strong>galaxieswithspiralstructures,<br />

clusters,galaxyclusters,etc.,<br />

4.Collision-dom<strong>in</strong>atedsystems<strong>in</strong>clud<strong>in</strong>gsemiconductordevicesimulations us<strong>in</strong>gthediusionmodel<strong>and</strong><strong>in</strong>viscid,<strong>in</strong>compressibleuidmodelsus<strong>in</strong>g microscopicMonte-Carlomodel,<strong>and</strong> vortexcalculations.


ions,stars,orgalaxies).Collision-dom<strong>in</strong>atedsystems,ontheotherh<strong>and</strong>,takethe betweeneach<strong>particle</strong>simulated<strong>and</strong>physical<strong>particle</strong>smodeled(atoms,molecules Inthecorrelatedsystemsmodelsthereisaone-to-onemapp<strong>in</strong>g(correlation) 5<br />

otherextreme<strong>and</strong>useamathematicaldescriptionthattreatsthevortexelements asacont<strong>in</strong>uous,<strong>in</strong>compressible<strong>and</strong><strong>in</strong>visciduid. tems,whereeachsimulations<strong>particle</strong>,<strong>of</strong>tenreferredtoasa\super<strong>particle</strong>",may representmillions<strong>of</strong>physicalelectronsorions<strong>in</strong>acollisionlessplasma.Thenumericaltechniquesusedusually<strong>in</strong>volveassign<strong>in</strong>gchargestosimulated<strong>particle</strong>s,<br />

solv<strong>in</strong>gtheassociatedeldequationswithrespecttosimulatedmeshpo<strong>in</strong>ts,apply<strong>in</strong>gtheeldsolutiontothegrid,<strong>and</strong>solv<strong>in</strong>gtherelatedequations<strong>of</strong>motionfor<br />

the<strong>particle</strong>s.Codesbasedonthesenumericaltechniquesarefrequentlyreferred <strong>in</strong>furtherdetail<strong>in</strong>Chapter3.Thischapteralso<strong>in</strong>cludesdiscussions<strong>of</strong>methods used<strong>in</strong>verify<strong>in</strong>gtheparameterizations<strong>and</strong>thephysicsbeh<strong>in</strong>dthecode.Tests us<strong>in</strong>grealphysicalparametersdemonstrat<strong>in</strong>gpredictablephenomena<strong>in</strong>clud<strong>in</strong>g Chapter2discussesthema<strong>in</strong>referencescover<strong>in</strong>gcurrentparallel<strong>particle</strong><strong>codes</strong><br />

Ourapplicationfalls<strong>in</strong>betweenunderthesecondcategory,collisionlesssys-<br />

toParticle-<strong>in</strong>-Cell(PIC)<strong>codes</strong>. <strong>and</strong>relatedtopics.Thephysicsbeh<strong>in</strong>dthecollisionlessplasmamodelisdescribed<br />

Ourimplementationsuse2-DFFTs(assum<strong>in</strong>gperiodicboundaries)astheeld 1.3NumericalTechniques solvers.Othertechniques,<strong>in</strong>clud<strong>in</strong>gnitedierence(e.g.SOR)<strong>and</strong>multigrid plasmaoscillation<strong>and</strong>two-stream<strong>in</strong>stabilitiesarediscussed.<br />

methodsmayalsobeused.Theeldisappliedtoeachnode/gridpo<strong>in</strong>tus<strong>in</strong>ga 1Dnitedierenceequationforeachdimension,<strong>and</strong>thentheeldcalculatedat each<strong>particle</strong>locationus<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation.Aleapfrog<strong>particle</strong>pusheris usedtoadvancethe<strong>particle</strong>positions<strong>and</strong>velocities.Thesenumericalalgorithms areexpla<strong>in</strong>ed<strong>in</strong>furtherdetail<strong>in</strong>Chapter3.


work;<strong>in</strong>stead,wewillconcentrateonthegeneralapproaches<strong>of</strong>howtoparallelize themasthey<strong>in</strong>teractwithothersections<strong>of</strong>acode. However,ourchoice<strong>of</strong>particularnumericalmethodsisnotthefocus<strong>of</strong>this 6<br />

parallelmethodswith<strong>in</strong>numericalalgorithmswithaneyetowardsma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g theirapplicabilitytomoresophisticatednumericalmethodsthatmaybedeveloped <strong>in</strong>thefuture. Theprimarygoal<strong>of</strong>thisworkishenceto<strong>in</strong>vestigatehowbestto<strong>in</strong>corporate<br />

1.4Contributions<br />

highlightboththeparallelperformance<strong>of</strong>the<strong>in</strong>dividualblocks<strong>of</strong>code<strong>and</strong>also Thisdissertationprovidesan<strong>in</strong>-depthstudy<strong>of</strong>howtoparallelizeafairlycomplex<br />

codemay<strong>in</strong>uenceotherparts<strong>of</strong>thecode.Weshowhowthese<strong>in</strong>teractionsaect giveguidel<strong>in</strong>esastohowthechoices<strong>of</strong>parallelmethodsused<strong>in</strong>eachblock<strong>of</strong> analyses<strong>of</strong>theparallelalgorithmsdevelopedhere<strong>in</strong>,are<strong>in</strong>cluded.Theseanalyses codewithseveral<strong>in</strong>teract<strong>in</strong>gmodules.Algorithmiccomplexity<strong>and</strong>performance<br />

memory<strong>and</strong>cacheutilization. datapartition<strong>in</strong>g<strong>and</strong>leadtonew<strong>parallelization</strong>strategiesconcern<strong>in</strong>gprocessor, memoryaddress<strong>in</strong>gspacesuchasthetheKSRsupercomputer.However,most<strong>of</strong> cludesadiscussion<strong>of</strong>distributedversussharedmemoryenvironments. systemswitheitherahierarchicalordistributedmemorysystem.Chapter4<strong>in</strong>-<br />

thenewmethodsdeveloped<strong>in</strong>thisdissertationgenerallyholdforhigh-performance Ourframeworkisaphysicallydistributedmemorysystemwithagloballyshared<br />

replicated<strong>and</strong>partitionedgrids,aswellasanovelgridpartition<strong>in</strong>gapproachthat leadstoanecientimplementationfacilitat<strong>in</strong>gdynamicallypartitionedgrids.This novelapproachtakesadvantage<strong>of</strong>theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>uses adualpo<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraysthatkeepsthe<strong>particle</strong>locations Thischapteralsodescribesatraditionalparallel<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>methodsus<strong>in</strong>g<br />

automaticallypartiallysorted(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Load-


alanc<strong>in</strong>gtechniquesassociatedwiththisdynamicschemearealsodiscussed. thecachesize<strong>and</strong>theproblem'smemoryaccessstructure,<strong>and</strong>showthatfurther Chapter4also<strong>in</strong>troduceshierarchicaldatastructuresthataretailoredforboth 7<br />

computeronwhichourtest-<strong>codes</strong>wasimplemented<strong>in</strong>Cus<strong>in</strong>gPthreadsonthe arecovered<strong>in</strong>Chapter5.Chapter6describesourtest-bed,theKSR1super-<br />

improvementscanbemadebyconsider<strong>in</strong>gthe<strong>in</strong>putdata'seectonthesimulation.<br />

experimentalresults. KSR1.Optimizationswereguidedbythemethodsdictatedbyouranalytical<strong>and</strong> Complexity<strong>and</strong>performanceanalyses<strong>of</strong>themethodsdiscussed<strong>in</strong>Chapter4<br />

1.5Appendices Anannotatedbibliography<strong>of</strong>thereferencesdescribed<strong>in</strong>Chapter2isprovided <strong>in</strong>AppendixA.AppendixBshowshowtoverifythatthenumericalmethods usedproducetheexpectedplasmaoscillationbycalculat<strong>in</strong>g<strong>and</strong>analyz<strong>in</strong>gwhat happensdur<strong>in</strong>gonegeneraltime-step.


Chapter2 PreviousWorkonParticleCodes <strong>in</strong>evitable."{HerbertG.Wells[1866-1946],M<strong>in</strong>d<strong>and</strong>theEnd<strong>of</strong>ItsTether \Welive<strong>in</strong>referencetopastexperience<strong>and</strong>nott<strong>of</strong>utureevents,however<br />

Theorig<strong>in</strong>s<strong>of</strong>PIC<strong>codes</strong>datebackto1955whenHarlow<strong>and</strong>hiscolleaguesat 2.1TheOrig<strong>in</strong>s<strong>of</strong>ParticleCodes TheAnalects. \Studythepastifyoudiv<strong>in</strong>ethefuture."{Confucius[551-479B.C.],<br />

toHockney<strong>and</strong>Eastwood[HE89],thisworklaidthefoundationforMorse<strong>and</strong> m<strong>in</strong>d(their1-Dcodedidnotcompetewithsimilar<strong>codes</strong><strong>of</strong>thatera.)Accord<strong>in</strong>g LosAlamos[Har88,Har64]developedthem<strong>and</strong>relatedmethodsforuiddynamics calculations.Theirorig<strong>in</strong>alworkwasa1-Dcodethathad2ormoredimesions<strong>in</strong> (Cloud-<strong>in</strong>-Cell)schemesforplasmas<strong>in</strong>1969. Nielson's,<strong>and</strong>Birdsall'sBerkeleygroup<strong>in</strong>troduction<strong>of</strong>higher-order<strong>in</strong>terpolation<br />

lengthyreview<strong>of</strong>theeld<strong>of</strong>model<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong>electric<strong>and</strong>magnetic man'sgroupatStanford<strong>and</strong>Birdsall<strong>and</strong>BridgesatBerkeley<strong>in</strong>thelate'50s,but theirworkdidnotuseameshfortheeldcalculations.Dawson[Daw83]givesa Therst<strong>particle</strong>models<strong>of</strong>electrostaticplasmasare,however,duetoBunne-<br />

8


elds<strong>and</strong>coversseveralphysicalmodel<strong>in</strong>gtechniquestypicallyassociatedwith <strong>particle</strong><strong>codes</strong>.Severalgoodreferencebookson<strong>particle</strong>simulationshavealsobeen published<strong>in</strong>thelastfewyears[HE89][BL91][Taj89][BW91].Foramoredetailed 9<br />

pleaseseetheannotatedbibliography<strong>in</strong>AppendixA. 2.2ParallelParticle-<strong>in</strong>-Cell<strong>codes</strong> desciption<strong>of</strong>thesebooks<strong>and</strong>some<strong>of</strong>themajorpapersreferenced<strong>in</strong>thischapter, Duetotheirdem<strong>and</strong>forcomputationalpower,<strong>particle</strong><strong>codes</strong>areconsideredgood c<strong>and</strong>idatesforparallelhighperformancecomputersystems[FJL+88][AFKW90] [Max91][Wal90].However,because<strong>of</strong>theirseem<strong>in</strong>glynon-parallelstructure,especially<strong>in</strong>the<strong>in</strong>homogeneouscases,muchworkstillrema<strong>in</strong>s{toquoteWalker<br />

[Wal90]:<br />

<strong>in</strong>Tables2.1-2.5attheend<strong>of</strong>thischapter.Thecentralideasprovidedbythese Anoverview<strong>of</strong>thema<strong>in</strong>referencesperta<strong>in</strong><strong>in</strong>gtoparallelPIC<strong>codes</strong>isprovided positionschemesneedtobe<strong>in</strong>vestigated,particularlywith<strong>in</strong>homoge-<br />

neous<strong>particle</strong>distributions." \...ForMIMDdistributedmemorymultiprocessorsalternativedecom-<br />

thisisfrequentlycalledaLagrangi<strong>and</strong>ecomposition.TheparallelPIC<strong>codes</strong> us<strong>in</strong>gapureLagrangi<strong>and</strong>ecompositionwillusuallyreplicatethegridoneach referencesarediscussed<strong>in</strong>subsequentsections. processor.InanEuleri<strong>and</strong>ecomposition,the<strong>particle</strong>sareassignedtothe processorswithitslocalgrid<strong>in</strong>formation.Asthe<strong>particle</strong>smove,theymaymigrate Notethatiftheassignments<strong>of</strong><strong>particle</strong>s(orgrids)toprocessorsrema<strong>in</strong>xed,<br />

toanotherprocessorwheretheirnewlocalgrid<strong>in</strong>formationisthenstored.An loadbalanceas<strong>particle</strong>smove<strong>and</strong>\bunch"togetheroncerta<strong>in</strong>processorsover AdaptiveEulerianapproachwillre-partitionthegrid<strong>in</strong>ordertoachieveabetter time.


2.2.1Vector<strong>and</strong>low-ordermultitask<strong>in</strong>g<strong>codes</strong> (OsakaUniv.)etal.havea3-pagenote[NOY85]thatdescribeshowtheybunch Therst<strong>parallelization</strong>s<strong>of</strong><strong>particle</strong><strong>codes</strong>weredoneonvectormach<strong>in</strong>es.Nishiguchi 10<br />

withtim<strong>in</strong>ganalysesdoneonasimilar3DcodeforaCray.UnlikeNishiguchiet Horowitz(LLNL,laterUniv.<strong>of</strong>Maryl<strong>and</strong>)etal.[Hor87]describesa2Dalgorithm <strong>particle</strong>s<strong>in</strong>their1-DcodetoutilizethevectorprocessoronaVP-100computer.<br />

muchlessstorage. al.whoemploysaxedgrid,Horowitz'sapproachrequiressort<strong>in</strong>g<strong>in</strong>ordertodo thevectorization.Thisschemeisabitslowerforverylargesystems,butrequires scribedbyMank<strong>of</strong>skyetal.[M+88].They<strong>in</strong>cludelow-ordermultiprocess<strong>in</strong>gon systemssuchastheCrayX-MP<strong>and</strong>Cray2.ARGUSisa3-Dsystem<strong>of</strong>simulation<strong>codes</strong><strong>in</strong>clud<strong>in</strong>gmodulesforPIC<strong>codes</strong>.Thesemodules<strong>in</strong>cludeseveraleld<br />

solvers(SOR,Chebyshev<strong>and</strong>FFT)<strong>and</strong>electromagneticsolvers(leapfrog,generalizedimplicit<strong>and</strong>frequencydoma<strong>in</strong>),wheretheyclaimtheir3DFFTsolverto<br />

whereasthelocal<strong>particle</strong>sgotwrittentodisk.Acached<strong>particle</strong>swasthentagged beexceptionallyfast.One<strong>of</strong>themost<strong>in</strong>terest<strong>in</strong>gideas<strong>of</strong>thepaper,however,is howtheyusedthecacheasstoragefor<strong>particle</strong>sthathavelefttheirlocaldoma<strong>in</strong> tomulti-taskover<strong>particle</strong>s(oragroup<strong>of</strong><strong>particle</strong>s)with<strong>in</strong>aeldblock.They ontoalocal<strong>particle</strong><strong>in</strong>itsnew<strong>cell</strong>whenitgotswapped<strong>in</strong>.Theirexperience withCANDOR,a2.5DelectromagneticPICcode,showedthatitprovedecient notethetrade-oduetothefactthat<strong>parallelization</strong>eciency<strong>in</strong>creaseswiththe number<strong>of</strong><strong>particle</strong>spergroup.Thespeed-upfortheirimplementationontheCray number<strong>of</strong><strong>particle</strong>groups,whereasthevectorizationeciency<strong>in</strong>creaseswiththe X-MP/48showedtoreachclosetothetheoreticalmaximum<strong>of</strong>4. modeledascollisionless<strong>particle</strong>swhereastheelectronsaretreatedasan<strong>in</strong>ertia- tomodeltiltmodes<strong>in</strong>eld-reversedcongurations(FRCs).Here,theionsare<br />

Parallelizationeortsontwoproduction<strong>codes</strong>,ARGUS<strong>and</strong>CANDOR,isde-<br />

Horowitzetal.[HSA89]laterdescribea3DhybridPICcodewhichisused


processors<strong>in</strong>themultigridphasebycomput<strong>in</strong>gonedimensiononeachprocessor. lessuid.Amultigridalgorithmisusedfortheeldsolve,whereastheleapfrog methodisusedtopushthe<strong>particle</strong>s.Horowitzmulti-tasksover3<strong>of</strong>the4Cray2 11<br />

<strong>and</strong>hencemulti-taskedachiev<strong>in</strong>ganaverageoverlap<strong>of</strong>about3duetotherelationshipbetweentasklength<strong>and</strong>thetime-sliceprovidedforeachtaskbytheCray.<br />

The<strong>in</strong>terpolation<strong>of</strong>theeldstothe<strong>particle</strong>swasfoundcomputationally<strong>in</strong>tensive<br />

2(many,butsimplercalculations).Theseresultsarehenceclearlydependenton theschedul<strong>in</strong>galgorithms<strong>of</strong>theCrayoperat<strong>in</strong>gsystem. priorityatwhichitran.)The<strong>particle</strong>pushphasesimilarlygotanoverlap<strong>of</strong>about (FortheCraytheyused,thetime-slicedependedonthesize<strong>of</strong>thecode<strong>and</strong>the<br />

2.3OtherParallelPICReferences Withthe<strong>in</strong>troduction<strong>of</strong>distributedmemorysystems,severalnew<strong>issues</strong>arearis<strong>in</strong>g <strong>in</strong>the<strong>parallelization</strong><strong>of</strong><strong>particle</strong><strong>codes</strong>.Datalocalityisstillthecentralissue,but ratherthantry<strong>in</strong>gtollupaset<strong>of</strong>vectorregisters,onenowtriestom<strong>in</strong>imize communicationcosts(globaldata<strong>in</strong>teraction).Ineithercase,whereavailable,one wouldliketolluplocalcachel<strong>in</strong>es.<br />

128-by-128toroidalgrid<strong>of</strong>bit-serialprocessorslocatedatGoddard. electrostaticPICcodeimplementedontheMPP(MassivelyParallelProcessor),a C.S.L<strong>in</strong>etal.[LTK88,L<strong>in</strong>89b,L<strong>in</strong>89a]describeseveralimplementations<strong>of</strong>a1-D 2.3.1Fight<strong>in</strong>gdatalocalityonthebit-serialMPP<br />

processors<strong>and</strong>thegridcomputationsareavoidedbyus<strong>in</strong>gthe<strong>in</strong>verseFFTto ogy<strong>and</strong>notmuchlocalmemory,theyfoundthattheoverhead<strong>in</strong>communication bit-serialSIMD(s<strong>in</strong>gle<strong>in</strong>struction,multipledata)architecturewithagridtopol-<br />

computethe<strong>particle</strong>positions<strong>and</strong>theelectriceld.However,s<strong>in</strong>cetheMPPisa Theyrstdescribeagridlessmodel[LTK88]where<strong>particle</strong>saremappedto<br />

whencomput<strong>in</strong>gthereductionsumsneededforthistechniquewhencomput<strong>in</strong>g


array<strong>and</strong>sortedthe<strong>particle</strong>saccord<strong>in</strong>gtotheir<strong>cell</strong>everytimestep.Thiswas thechargedensitywassohighthat60%<strong>of</strong>theCPUtimewasused<strong>in</strong>thiseort. Inanearlierstudy,theymappedthesimulationdoma<strong>in</strong>directlytotheprocessor 12<br />

wouldnotrema<strong>in</strong>load-balancedovertimes<strong>in</strong>cetheuctuations<strong>in</strong>electricalforces thearrayprocessors<strong>and</strong>thestag<strong>in</strong>gmemory.Theyalsopo<strong>in</strong>toutthatthescheme foundtobehighly<strong>in</strong>ecientontheMPPduetotheexcessiveI/Orequiredbetween wouldcausethe<strong>particle</strong>stodistributenon-uniformlyovertheprocessors.<br />

llsonlyhalfthe<strong>particle</strong>planes(processorgrid)with<strong>particle</strong>stomakethesort<strong>in</strong>g solver.Theimplementationuses64planestostorethe<strong>particle</strong>s.Thisapproach (scatter<strong>particle</strong>sthatareclusteredthroughrotations). Theauthorsimulatesupto524,000<strong>particle</strong>sonthismach<strong>in</strong>eus<strong>in</strong>ganFFT L<strong>in</strong>later[L<strong>in</strong>89b,L<strong>in</strong>89a],usesa<strong>particle</strong>sort<strong>in</strong>gschemebasedonrotations<br />

simplerbybe<strong>in</strong>gabletoshue(\rotate")thedataeasilyonthisbit-serialSIMD mach<strong>in</strong>e.Thespareroomwasused<strong>in</strong>the\shu<strong>in</strong>g"process.Herecongested<strong>cell</strong>s westernneighbor,ifnecessary,dur<strong>in</strong>gthe\sort<strong>in</strong>g"process. hadpart<strong>of</strong>theircontentsrotatedtotheirnorthernneighbor,<strong>and</strong>thentotheir<br />

2.3.2Hypercubeapproaches morecomputationalpower<strong>and</strong>adierent<strong>in</strong>terconnectionnetwork,othertechniqueswillprobablyprovemoreuseful.<br />

ThisimplementationisclearlytiedtotheMPParchitecture.Fornodeswith<br />

NCUBE,haveoat<strong>in</strong>gpo<strong>in</strong>tprocessors(someevenwithvectorboards)<strong>and</strong>alot UnliketheMPP,hypercubessuchastheInteliPSCs,theJPLMarkII,<strong>and</strong>the connectivity(thenumber<strong>of</strong><strong>in</strong>terconnectionsbetweennodesgrowslogarthmically morelocalmemory.Theiradvantageis<strong>in</strong>theirrelativelyhighdegree<strong>of</strong><strong>in</strong>ter-node <strong>in</strong>thenumber<strong>of</strong>nodes)thatprovidesperfectnear-neighborconnectionsforFFT systemsarestillfairlyhomogeneous(withtheexception<strong>of</strong>someI/Oprocessors). algorithmsaswellasO(logN)datagathers<strong>and</strong>scatters.Thenodes<strong>of</strong>these


ontheInteliPSChypercube.AmultigridalgorithmbasedonFredrickson<strong>and</strong> McBryaneortsareusedfortheeldcalculations,whereastheyconsidered3 Lubeck<strong>and</strong>Faber(LANL)[LF88]covera2-Delectrostaticcodebenchmarked 13<br />

basedonthefactthattheir<strong>particle</strong>stendedtocongregate<strong>in</strong>10%<strong>of</strong>the<strong>cell</strong>s, <strong>cell</strong><strong>in</strong>formation(observ<strong>in</strong>gstrictlocality).Theauthorsrejectedthisapproach dierentapproachesforthe<strong>particle</strong>pushphase.<br />

torelaxthelocalityconstra<strong>in</strong>tbyallow<strong>in</strong>g<strong>particle</strong>stobeassignedtoprocessorsnot hencecaus<strong>in</strong>gseriousload-imbalance.Thesecondalternativetheyconsideredwas Theirrstapproachwastoassign<strong>in</strong>g<strong>particle</strong>stowhicheverprocessorhasthe<br />

lancedsolution)wouldbeastrongfunction<strong>of</strong>the<strong>particle</strong><strong>in</strong>putdistribution. <strong>of</strong>thisalternative(moveeitherthegrid<strong>and</strong>/or<strong>particle</strong>stoachieveamoreload-<br />

necessarilyconta<strong>in</strong><strong>in</strong>gtheirspatialregion.Theauthorsarguethattheperformance<br />

astheiPSC).Tous,however,thisseemstorequirealot<strong>of</strong>extraover-head<strong>in</strong> theprocessorssothatanequalnumber<strong>of</strong><strong>particle</strong>scouldbeprocessedateach time-step.Thisachievesaperfectloadbalance(forhomogeneoussystemssuch communicat<strong>in</strong>gthewholegridtoeachprocessorateachtime-step,nottomention Thealternativetheydecidedtoimplementreplicatedthespatialgridamong<br />

hav<strong>in</strong>gtostoretheentiregridateachprocessor.Theydo,however,describea greater"comparedwithasharedmemoryimplementation. thepartition<strong>in</strong>g<strong>of</strong>theirPICalgorithmforthehypercube\anorder<strong>of</strong>magnitude niceperformancemodelfortheirapproach.Theauthorscommentthattheyfound<br />

Aza92].Theirunderly<strong>in</strong>gmotivationistoparallelizeParticle-In-Cell(PIC)<strong>codes</strong> throughahybridscheme<strong>of</strong>Grid<strong>and</strong>ParticlePartitions. onhybridpartition<strong>in</strong>gforPIC<strong>codes</strong>onhypercubes[ALO89,AL91,AL90,AL92, Partition<strong>in</strong>ggridspace<strong>in</strong>volvesdistribut<strong>in</strong>g<strong>particle</strong>sevenlyamongprocessor Azari<strong>and</strong>Lee(Cornell)havepublishedseveralpapersrelatedtoAzari'swork<br />

<strong>and</strong>partition<strong>in</strong>gthegrid<strong>in</strong>toequal-sizedsub-grids,oneperavailableprocessor element(PE).Theneedtosort<strong>particle</strong>sfromtimetotimeisreferredtoasan


undesirableloadbalanc<strong>in</strong>gproblem(dynamicloadbalanc<strong>in</strong>g). areevenlydistributedamongprocessorelements(PEs)nomatterwheretheyare A<strong>particle</strong>partition<strong>in</strong>gimplies,accord<strong>in</strong>gtotheirpapers,thatallthe<strong>particle</strong>s 14<br />

entiresimulation.TheentiregridisassumedtohavetobestoredoneachPE<strong>in</strong> locatedonthegrid.EachPEkeepstrack<strong>of</strong>thesame<strong>particle</strong>throughoutthe ordertokeepthecommunicationoverheadlow.Thestoragerequirementsforthis schemearelarger,<strong>and</strong>aglobalsum<strong>of</strong>thelocalgridentriesisneededaftereach iteration. mentthatbypartition<strong>in</strong>gthespaceonecansavememoryspaceoneachPE,<strong>and</strong> bypartition<strong>in</strong>gthe<strong>particle</strong>sonemayattempttoobta<strong>in</strong>awell-balancedloaddistributionwhichwouldleadtoahigheciency.Theirhybridpartition<strong>in</strong>gscheme<br />

canbeoutl<strong>in</strong>edasfollows: Theirhybridpartition<strong>in</strong>gapproachcomb<strong>in</strong>esthesetwoschemeswiththeargu-<br />

1.thegridispartitioned<strong>in</strong>toequalsubgrids; 2.agroup<strong>of</strong>PEsareassignedtoeachblock; 3.eachgrid-blockisstored<strong>in</strong>thelocalmemory<strong>of</strong>each<strong>of</strong>thePEsassignedto 4.the<strong>particle</strong>s<strong>in</strong>eachblockare<strong>in</strong>itiallypartitionedevenlyamongPEs<strong>in</strong>that thatblock;<br />

accumulator",somek<strong>in</strong>d<strong>of</strong>gather-scattersorterproposedbyG.Fox.etal.(See alsoWalker'sgeneralreference<strong>in</strong>AppendixA.) nothaveafullimplementation<strong>of</strong>thecode.Heusesthe\quasi-staticcrystal Walker(ORNL)[Wal89]describesa3-DPICcodefortheNCUBE,butdoes block.<br />

tations.Her1988paperwithDecyk,Dawson(UCLA)<strong>and</strong>G.Fox(Syracuse) [LDDF88],describesa1-Delectrostaticcodenamed1-DUCLA,decompos<strong>in</strong>gthe physicaldoma<strong>in</strong><strong>in</strong>tosub-doma<strong>in</strong>sequal<strong>in</strong>numbertothenumber<strong>of</strong>processors Liewer(JPL)hasalsoco-authoredseveralpapersonhypercubeimplemen-


availablesuchthat<strong>in</strong>itiallyeachsub-doma<strong>in</strong>hasanequalnumber<strong>of</strong>processors. Theirtest-bedwastheMarkIII32-nodehypercube. Thecodeusesthe1-DconcurrentFFTdescribed<strong>in</strong>Foxetal.Forthe<strong>particle</strong> 15<br />

<strong>of</strong>theprocessors.)Thecodehencepassesthegridarrayamongtheprocessors phase.(Theyneedtopartitionthedoma<strong>in</strong>accord<strong>in</strong>gtotheGrayCodenumber<strong>in</strong>g theFFTsolver<strong>in</strong>ordertotakeadvantage<strong>of</strong>thehypercubeconnectionforthis However,theauthorspo<strong>in</strong>touthowtheyneedtouseadierentpartition<strong>in</strong>gfor push<strong>in</strong>gphase,theydividetheirgridup<strong>in</strong>to(N?p)equal-sizedsub-doma<strong>in</strong>s.<br />

processors)thatbeatstheCrayX-MP.Lieweralsoco-authoredapaper[FLD90] namedGCPIC(GeneralConcurrentPIC)implementedontheMarkIIIfb(64- thecaseifanitedierenceeldsolutionisused<strong>in</strong>place<strong>of</strong>theFFT. twiceateachtimestep.Intheconclusions,theypo<strong>in</strong>toutthatthismaynotbe<br />

theoption<strong>of</strong>boundedorperiodic<strong>in</strong>theotherdimension.Thecodeused21- describ<strong>in</strong>ga2-Delectrostaticcodethatwasperiodic<strong>in</strong>onedimension,<strong>and</strong>with Inthepaperpublishedayearlater[LZDD89]theydescribeasimilarcode<br />

later<strong>in</strong>thisthesis. DFFTs<strong>in</strong>thesolver.Lieweretal.hasrecentlydevelopeda3Dcodeonthe<br />

2.3.3AMasParapproach DeltaTouchstone(512nodegrid)wherethegridsarereplicatedoneachprocessor [DDSL93,LLFD93].Lieweretal.alsohaveapaperonloadbalanc<strong>in</strong>gdescribed MacNeice'spaper[Mac93]describesa3DelectromagneticPICcodere-written foraMasParwitha128-by-128grid.ThecodeisbasedonOscarBuneman's TRISTANcode.Theystorethethethirddimension<strong>in</strong>virtualmemorysothat needtosortaftereachtime-step.Anite-dierenceschemeisusedfortheeld eachprocessorhasagridvector.TheyuseaEuleri<strong>and</strong>ecomposition,<strong>and</strong>hence S<strong>in</strong>cetheyassumesystemswithrelativelymild<strong>in</strong>homogeneities,noloadbalanc<strong>in</strong>g solve,whereasthe<strong>particle</strong>pushphaseisaccomplishedviatheleapfrogmethod.


considerationsweretaken.Thefacttheyonlysimulate400,000<strong>particle</strong>s<strong>in</strong>a105- the128-by-128processorgrid,weassumewasduetothememorylimitations<strong>of</strong>the by-44-by-55system,i.e.onlyaboutone<strong>particle</strong>per<strong>cell</strong><strong>and</strong>henceunder-utiliz<strong>in</strong>g 16<br />

<strong>in</strong>dividual(local)data. 2.3.4ABBNattempt (SIMD)mach<strong>in</strong>e,i.e.theprocessorssharethe<strong>in</strong>structionstream,butoperateon MasParused(64Kb/processor).TheMasParisaS<strong>in</strong>gleInstructionMultipleData<br />

ourtest-bed,theKSR1,isalsoashared-addressspacesystemwithdistributed bythehighcosts<strong>of</strong>copy<strong>in</strong>gverylargeblocks<strong>of</strong>read-onlydata.LiketheBBN, TC2000whoseperformancewasdisappo<strong>in</strong>t<strong>in</strong>g.Theyusedashared-memoryPIC Sturtevant<strong>and</strong>Maccabee[SM90]describeaplasmacodeimplementedontheBBN<br />

overcamesome<strong>of</strong>theobstaclesthatfaceSturtevant<strong>and</strong>Maccabee. memory.Byfocus<strong>in</strong>gmoreondatalocality<strong>issues</strong>,wewilllatershowhowwe algorithmthatdidnotmapwelltothearchitecture<strong>of</strong>theBBN<strong>and</strong>hencegothit<br />

(alternat<strong>in</strong>gdirectionimplicit)asa2-Ddirecteldsolver.Thepaperdoespo<strong>in</strong>t outthatonlym<strong>in</strong>imalconsiderationwasgiventoalgorithmsthatmaybeusedto 2.3.5Other<strong>particle</strong>methods <strong>and</strong>somerelativisticextensions.ThecodeusesaniterativesolutionbasedonADI D.W.Hewett<strong>and</strong>A.B.Langdon[HL88]describethedirectimplicitPICmethod<br />

(severalD).Bytreat<strong>in</strong>gtheelectronsasamasslessuid<strong>and</strong>theionsas<strong>particle</strong>s, tages<strong>in</strong>us<strong>in</strong>gahybrid<strong>particle</strong>codetosimulateplasmasonverylargescalelengths somephysicsthatmagnetohydrodynamics(MHD)<strong>codes</strong>donotprovide(MHDassumeschargeneutrality,i.e.=0),canbe<strong>in</strong>cludedwithoutthecosts<strong>of</strong>afull<br />

S.H.Brecht<strong>and</strong>V.A.Thomas[BT88]describetheadvantages<strong>and</strong>disadvan-<br />

implementtherelativisticextensions.Someconceptsweretestedona1-Dcode.<br />

<strong>particle</strong>code.Theyavoidsolv<strong>in</strong>gthepotentialequationsbyassum<strong>in</strong>gthatthe


plasmaisquasi-neutral(neni),us<strong>in</strong>gtheDarw<strong>in</strong>approximationwherelight wavescanbeignored,<strong>and</strong>assum<strong>in</strong>gtheelectronmasstobezero.Theyhenceuse apredictor-correctormethodtosolvethesimpliedequations. 17<br />

modernhierarchicalsolvers,<strong>of</strong>whichthemostgeneraltechniqueisthefastmultipolemethod(FMM),toavoidsome<strong>of</strong>thelocalsmooth<strong>in</strong>g,boundaryproblems,<br />

J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>[AGR88]advocatetheuse<strong>of</strong> plasmas<strong>and</strong>beams,<strong>and</strong>plasmas<strong>in</strong>complicatedregions.Thepaperdescribesthe <strong>and</strong>alias<strong>in</strong>gproblemsassociatedwithPICmethodswhenusedtosimulatecold FMMmethodforgridless<strong>particle</strong>simulations<strong>and</strong>howitfareswithrespecttothe<br />

MPP,Lieweretal.[LLDD90]haveimplementedadynamicloadbalanc<strong>in</strong>gscheme 2.4LoadBalanc<strong>in</strong>g Asidefromthescroll<strong>in</strong>gmethodspreviouslymentionedthatL<strong>in</strong>developedforthe aforementionedproblemsassociatedwithPICmethods.<br />

ThecodeisbasedontheelectrostaticcodeGCPIC(see2.5.1).Loadbalanc<strong>in</strong>gwas achievedbytheirAdaptiveEulerianapproachthathaseachprocessorcalculate fora1DelectromagneticPICcodeontheMarkIIIHypercubeatCaltech/JPL.<br />

theirsub-gridboundaries<strong>and</strong>theircurrentnumber<strong>of</strong><strong>particle</strong>s.Theypo<strong>in</strong>tout thattheactualplasmadensityprolecouldbeuseddirectly(computedfortheeld solutionstage),butthatitwouldrequiremorecommunicationtomakethedensity anapproximation<strong>of</strong>theplasmadensityprole<strong>and</strong>us<strong>in</strong>gittocomputethegrid partition<strong>in</strong>g.Thiscalculationrequiresallprocessorstobroadcastthelocation<strong>of</strong><br />

muchlargeramount<strong>of</strong>communication<strong>and</strong>computationoverhead.Resultsfrom proleglobal.Othermethods,suchas<strong>particle</strong>sort<strong>in</strong>g,wereassumedtorequirea testcaseswith5120<strong>particle</strong>srunon8processorswereprovided.Intheloadbalanc<strong>in</strong>gcase,the<strong>particle</strong>distributionwasapproximatedevery5time-steps.


Hennessyetal.[SHG92,SHT+92]havealsodonesome<strong>in</strong>terest<strong>in</strong>gload-balanc<strong>in</strong>g 2.4.1Dynamictaskschedul<strong>in</strong>g studiesforhierarchicalN-bodymethodsthatareworth<strong>in</strong>vestigat<strong>in</strong>g. 18<br />

MultipoleMethod.Thepaperstressescash<strong>in</strong>g<strong>of</strong>communicateddata<strong>and</strong>claims centratesonanalyz<strong>in</strong>gtwoN-bodymethods:theBarnes-HutMethod<strong>and</strong>theFast thatformostrealisticscal<strong>in</strong>gcases,boththecommunicationtocomputationratio, aswellastheoptimalcachesize,growslowlyaslargerproblemsarerunonlarger Therstpaper,entitled\Implications<strong>of</strong>hierarchicalN-bodyMethods"con-<br />

N-bodymethods"focusesonhowtoachieveeective<strong>parallelization</strong>sthroughsimultaneouslyconsider<strong>in</strong>gloadbalanc<strong>in</strong>g<strong>and</strong>optimiz<strong>in</strong>gfordatalocalityforthreodsconsidered<strong>in</strong>therstpaper,theyalsoconsiderarecentmethodforradiosity<br />

Thesecondpaper,entitled\LoadBalanc<strong>in</strong>g<strong>and</strong>DataLocality<strong>in</strong>Hierarchical memoryimplementations. overheadssubstantially<strong>in</strong>creaseswhengo<strong>in</strong>gfromshared-memorytodistributed mach<strong>in</strong>es.Theyalsoshowthattheprogramm<strong>in</strong>gcomplexity<strong>and</strong>performance<br />

ma<strong>in</strong>hierarchicalN-bodymethods.InadditiontotheBarnes-Hut<strong>and</strong>FMMmeth-<br />

<strong>in</strong>dicator<strong>of</strong>theworkassociatedwithit<strong>in</strong>thenext.Unabletondaneective calculations<strong>in</strong>computergraphics. predictivemechanismthatcouldprovideloadbalanc<strong>in</strong>g,thebestapproachedthe authorsendedupwithwassometh<strong>in</strong>gtheycallcost-estimates+steal<strong>in</strong>g.Thisuses s<strong>in</strong>ce<strong>in</strong>thiscase,theworkassociatedwithapatch<strong>in</strong>oneiterationisnotagood Thelatterturnsouttorequireaverydierentapproachtoload-balanc<strong>in</strong>g<br />

prol<strong>in</strong>gorcost-estimatesto<strong>in</strong>itializethetaskqueuesateachprocessor,<strong>and</strong>then useson-the-ytasksteal<strong>in</strong>gtoprovideloadbalanc<strong>in</strong>g. atStanford.Ithas16processorsorganized<strong>in</strong>4clusterswhereeachclusterhas 4MIPSR3000processorsconnectedbyasharedbus.Theclustersareconnected together<strong>in</strong>ameshnetwork.Eachprocessorhastwolevel<strong>of</strong>cachesthatarekept ThetestbedforbothpapersistheexperimentalDASHmultiprocessorlocated


coherent<strong>in</strong>hardware,<strong>and</strong>shareanequalfraction<strong>of</strong>thephysicalmemory. dierentfromoursett<strong>in</strong>g,buttheiruse<strong>of</strong>dynamictaskschedul<strong>in</strong>gtoachieveload Boththeapplications(N-bodysimulations)<strong>and</strong>test-bed(DASH)arefairly 19<br />

<strong>in</strong>formationsuchas\load"<strong>and</strong>\distance"thatrelatestothisapproach. balanc<strong>in</strong>g,isworthnot<strong>in</strong>g.Chapter4discussesouridea<strong>of</strong>us<strong>in</strong>gprocessorsystem


Mank<strong>of</strong>sky2.5DhybridCrayX-MPmulti- Author(s) Table2.1:Overview<strong>of</strong>ParallelPICReferences{2.5DHybrid Type ArchitectureParallel methodsSolverpushersimulated FieldParticleMax.ptcls<br />

Horowiz2.5DhybridCray2 [M+88] etal. (CANDOR)X-MP/48task<strong>in</strong>g (ARGUS)(4proc.) 3D Cray2& multi-3D-FFTleapfrogleapfrog<br />

etal. (4proc.) multi-Multigridleap-frogupto106 &others<br />

(1989-92) [HSA89] Azari&2.5DhybridInteliPSC/2hybridpredictor-sortptcls16,384 Lee quasi-neutralhypercubepartitioncorrectoreacht Darw<strong>in</strong> (32proc)(subgrids 43x43x43<br />

[AL90,AL91]ignorelight, [ALO89] [Aza92] [AL92] (neni, me=0) BBN replicated)<br />

20


Author(s) Table2.2:Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1988-89 TypeArchitectureParallel methods Solver Field ParticleMax.ptcls<br />

Dawson&Fox,statichypercube Liewer, Decyk, electro-MarkIIIdecomp.r<strong>and</strong>.gen 1-D JPL Eulerianforvelocity<br />

FFTGaussian pushersimulated<br />

[LDDF88] Liewer, Decyk, [LD89] GCPIC electro-MarkIII(loadbal. statichypercubetried<strong>in</strong> 1-D JPL [LLDD90]) Eulerian FFT leav<strong>in</strong>g buer720,896ptcls ptcls8128gridpts<br />

21


Author(s)TypeArchitectureParallelFieldParticleMax.ptcls Table2.3:Overview<strong>of</strong>ParallelPICReferences{Lieweretal.1990-93<br />

Ferraro,2-D Decyk,electro-MarkIII JPL replicatetwo methodsSolverpushersimulated<br />

[FLD90] Liewer,3-D LiewerstatichypercubeoneachFFTs Deltaprocessor<br />

replicate?? grid 1-D<br />

[DDSL93] Krucken,electro-Touchstonegrid Ferraro,static Decyck 512-nodeprocessor (Intel) oneach ??(3000gridpts)<br />

1.47x108<br />

(prepr<strong>in</strong>t)<br />

22


Author(s) Type Table2.4:Overview<strong>of</strong>ParallelPICReferences{Walker ArchitectureParallel methodsSolver Field ParticleMax.ptcls<br />

[Wal89]implementedconsidered)<strong>of</strong>gridpts Walker3D,butnohypercubedynamic full-scalePIC(NCUBErout<strong>in</strong>gassumed FFT pushersimulated<br />

Walker estimates<br />

SOS 3-D hypercubeAdaptivecommercialSOScodeestimates simulated EulerSOScode only<br />

[Wal90] onCray only<br />

23


Author(s) L<strong>in</strong>etal. Type Table2.5:Overview<strong>of</strong>ParallelPICReferences{Others<br />

[LTK88]electrostatic(16,384 ArchitectureParallel proc.)<br />

reduction\gridless"<strong>in</strong>verse methods sums <strong>in</strong>v.FFTFFT Solverpushersimulated FieldParticleMax.ptcls<br />

Lubeck& [L<strong>in</strong>89b] [L<strong>in</strong>89a]electrostatic(bit-serial) Faberelectrostatic(64proc)gridoneach 1-D 2-D InteliPSCreplicatemultigridsortptclsGridpts: MPP sort<strong>in</strong>g1-DFFTloadbal.524,000 etc. viashifts<br />

MacNeice [Mac93]basedon [LF88] (1993)electromagn.128x128 3-D MasPar processor 3rdD<strong>in</strong>dierence store niteleapfrog200000 eacht


Chapter3<br />

SimulationCodes ThePhysics<strong>of</strong>Particle<br />

3.1ParticleModel<strong>in</strong>g splendour<strong>of</strong>theirown."{Bertr<strong>and</strong>Russell,WhatIBelieve,1925. theend,thefreshairbr<strong>in</strong>gsvigour,<strong>and</strong>allthegreatspaceshavea \Eveniftheopenw<strong>in</strong>dows<strong>of</strong>scienceatrstmakesusshiver...<strong>in</strong><br />

experienced<strong>in</strong>comput<strong>in</strong>g,butnotnecessarilywithphysics,wealsoreviewthebasic algorithmsassociatedwith<strong>particle</strong>simulations.Asabenettothosewhoare physicsbeh<strong>in</strong>dsome<strong>of</strong>theequationsmodeled,i.e.howonearrivesattheequations Thegoal<strong>of</strong>thischapteristoshowhowtoimplementsome<strong>of</strong>thetypicalnumerical<br />

use<strong>of</strong><strong>particle</strong><strong>codes</strong>. <strong>and</strong>whattheymeanwithrespecttothephysicalworldtheyaredescrib<strong>in</strong>g.Based howthisknowledgecanbeusedtoverify<strong>and</strong>testthe<strong>codes</strong>.F<strong>in</strong>ally,wedescribe examples<strong>of</strong><strong>in</strong>terest<strong>in</strong>gphysicalphenomenathatcanbe<strong>in</strong>vestigatedthroughthe onourunderst<strong>and</strong><strong>in</strong>g<strong>of</strong>thephysicalsystemwearetry<strong>in</strong>gtomodel,wethenshow<br />

manynumericalmethodsavailable.Severalothertechniques,somemorecomplex, Thenumericaltechniquesdescribed<strong>in</strong>thischapterrepresentonlyafew<strong>of</strong>the 25


moresuitableforourfuturegoal,i.e.<strong>parallelization</strong>thanothermorecomplex representativeavor<strong>of</strong>themostcommontechniques,<strong>and</strong>mayalsoprovetobe butpossiblymoreaccurate,exist.However,wefeelthetechniqueschosengivea 26<br />

algorithms. oursimulation.Althoughthephysicalworldisclearly3D<strong>and</strong>themethodsused extendtomultipledimensions,wechosetomodelonly2Ds<strong>in</strong>cethenumerical problems(whichcapturepossibleglobaleects){evenonpresentsupercomputers. computationsotherwisewouldtakeaverylongtimeforthelarger<strong>in</strong>terest<strong>in</strong>g Thediscussionsfollow<strong>in</strong>gconcentrateon2Dsimulations,themodelwechosefor<br />

solutions(us<strong>in</strong>gnergrids).Untilthewholeuniversecanbeaccuratelymodeled 3Dmodel,doanevenlarger-scale2D(or1D)simulation,orobta<strong>in</strong>moredetailed <strong>in</strong>ashortamount<strong>of</strong>time,physicalmodelswillalwaysbesubjecttotrade-os! computationalpowerbecomesavailable,onemighteitherdecidetoexp<strong>and</strong>toa Thisisthereasonmanycurrentserial<strong>particle</strong><strong>codes</strong>modelonly1D.Asmore<br />

3.2TheDiscreteModelEquations Thissectionlooksatwheretheequationsthatweremodeledstemfrom<strong>in</strong>the physics.Inoursimulationsus<strong>in</strong>gtheAccelerationmodel,weassumedthethe equationsbe<strong>in</strong>gmodeledwere:<br />

dv=dt=qE=m dx=dt=v; r2=?0; E=?r; (3.4) (3.3) (3.2) (3.1)<br />

Thefollow<strong>in</strong>gsubsectionswilldescribetheseequation<strong>in</strong>moredetail.


3.2.1Poisson'sequation Theeldequationr2=?=0stemsfromthedierentialform<strong>of</strong>Maxwell's equationswhichsayelectriceldswithspacechargesobey: 27<br />

freecharges.rDisalsocalledthedivergence<strong>of</strong>D. whereDistheelectric\displacement"<strong>and</strong>fisthesourcedensityordensity<strong>of</strong> Comb<strong>in</strong>edwiththeequationformatter(withnopolarizationsorboundcharges) rD=f; (3.5)<br />

<strong>and</strong>withE=grad,r,wehave: divr=?0or D=0E div0E= (3.7) (3.6)<br />

eratorsgrad<strong>and</strong>divalsoyieldstheabovePoisson'sequation: Fora2-Dscalareld(x;y),thesequentialapplication<strong>of</strong>thedierentialop-<br />

r2=?0: (3.8)<br />

orr2=?0: r2=@2 @x2+@2<br />

v=dx=dt;<strong>and</strong>dv=dt=qE=mfollowfromNewton'slaw,wheretheforceFisthe @y2 (3.10) (3.9)<br />

theymean<strong>in</strong>oursimulation. sum<strong>of</strong>theelectricforceqE. Letusnowtakeacloserlookatthechargeq<strong>and</strong>thechargedensity<strong>and</strong>what


Thechargedensity,,isusuallythought<strong>of</strong>asunits<strong>of</strong>charge=(length)3(chargeper 3.2.2Thecharge,q volume).InSI-units,thelengthismeasured<strong>in</strong>meters.Now,volume<strong>in</strong>dicatesa3- 28<br />

Dsystem,whereasoursimulationisa2-Done.To\convert"our2-Dsimulationto a3-Dview,onecanth<strong>in</strong>k<strong>of</strong>eachsimulation<strong>particle</strong>represent<strong>in</strong>garod-likeentity <strong>of</strong><strong>in</strong>nitelengththat<strong>in</strong>the3-Dplane<strong>in</strong>tersectsthex-yplaneperpendicularlyat<br />

rodwouldhencerepresentthechargeqLhz.Consequently,thetotalchargeQ<strong>in</strong> Consideravolumehzhxhy,wherehzistheheight<strong>in</strong>units<strong>of</strong>theunitlength, number<strong>of</strong>simulation<strong>particle</strong>s.Consider<strong>in</strong>gqLasthechargeperunitlength,each withhx<strong>and</strong>hybe<strong>in</strong>gthegridspac<strong>in</strong>g<strong>in</strong>x<strong>and</strong>y,respectively,<strong>and</strong>Npbe<strong>in</strong>gthe thesimulations<strong>particle</strong>'slocation(x0,y0)asshown<strong>in</strong>Figure3.1.<br />

thevolumeis: Asmentioned,thechargedensityisQ/volume: Q=volume=(NpqLhz)=(hzhxhy) Q=NpqLhz (3.12) (3.11)<br />

noticethathzalwayscancelsout<strong>and</strong>that=NpqL=(hxhy)hastherightunits: (no?dim)(charge=length) (length)2 =charge (length)3: (3.13)<br />

Inotherwords,<strong>in</strong>our2-Dsimulation,thechargedensity,,isdenedsothat: hxhyX gridpo<strong>in</strong>ts(i;j)i;j=qLNp; (3.14)<br />

<strong>particle</strong>s(Np),number<strong>of</strong>gridpo<strong>in</strong>ts(Ng),gridspac<strong>in</strong>gs(hx;hy),awellasthe Inoursimulations,wehavesettheelectroncharge(q),thenumber<strong>of</strong>simulation (3.15)


z 29<br />

6 qL=charge/unitlength<br />

> eee<br />

?<br />

y - unitlength<br />

??<br />

,,,,,,,,,,<br />

? 6<br />

-x,<br />

Figure3.1:3-Dview<strong>of</strong>a2-D<strong>particle</strong>simulation.Chargesarethought<strong>of</strong>as\rods". ?:simulation<strong>particle</strong>asviewedon2-Dplane


chargespersimulation<strong>particle</strong>)throughqL: meanchargedensity0,allas<strong>in</strong>putvariables.Notethatbyspecify<strong>in</strong>gtheabove parameters,0determ<strong>in</strong>esthesize<strong>of</strong>thesimulation<strong>particle</strong>s(number<strong>of</strong>electron 30<br />

simulation<strong>particle</strong>. importantthatthe<strong>in</strong>putparametersused<strong>in</strong>thetestareconsistentsothatthe wheren0isthe<strong>particle</strong>density,<strong>and</strong>nisthenumber<strong>of</strong>physical<strong>particle</strong>sper Intest<strong>in</strong>gthecodeforplasmaoscillation,aphysicalphenomenon,itistherefore 0=n0q=nNpqL=(hxhy)<br />

<strong>in</strong>realplasmas.Togetafeelforwheretheseoscillationstemfrom,onecantakea resultsmakesense(physically).<br />

lookatwhathappenstotwo<strong>in</strong>niteverticalplanes<strong>of</strong>charges<strong>of</strong>thesamepolarity 3.2.3ThePlasmaFrequency,!p whentheyareparalleltoeachother. Theplasmafrequencyisdenedas!p=qq0=m0:Theseoscillationsareobserved FromGausswehavethattheuxatone<strong>of</strong>theplanesis:<br />

with: theeldsbuttheverticalones(<strong>in</strong><strong>and</strong>out<strong>of</strong>theplane)cancel,sooneendsup Look<strong>in</strong>gata\pill-box"aroundonesmallareaontheplane,onenoticesthatall IEdA=2ZEdA=2EA<br />

IEdA=q0<br />

Know<strong>in</strong>gthat=nq=Area<strong>and</strong>q=RdA,<strong>and</strong>plugg<strong>in</strong>gthisbacktotheux equationonegets: giv<strong>in</strong>gE==20.S<strong>in</strong>cewehavetwoplanes,thetotaleld<strong>in</strong>oursystemis: EA=A E=0


frequency.FromNewtonwehave:F=ma=qE Nowwewanttoshowthatx00+!2x=0,where!woulddescribetheplasma 31<br />

Hence,onecanwrite: Look<strong>in</strong>gattheeldequationfromGauss<strong>in</strong>terms<strong>of</strong>x:<br />

Iftheplanesappearperiodically,onecouldthendeducethefollow<strong>in</strong>goscillatory E=0=nq x00=a=q 0volx=0x<br />

behavior<strong>of</strong>themovement<strong>of</strong>theplaneswithrespecttooneanother: !p=sq m0x<br />

laws,an<strong>in</strong>-depthanalysis<strong>of</strong>onegeneraltimestepcanbeusedtopredictthegeneral behavior<strong>of</strong>thecode.Theanalyticderivationsprovided<strong>in</strong>AppendixBprovethat Thisisreferredtoastheplasmafrequency. Inordertoverifywhatourcodeactuallywillbehaveaccord<strong>in</strong>gtothephysical m0<br />

3.3Solv<strong>in</strong>gforTheField Theelds<strong>in</strong>whichthe<strong>particle</strong>saresimulatedaretypicallydescribedbypartial dierentialequations(PDEs).Whichnumericalmethodtousetosolvetheseequa-<br />

ournumericalapproach<strong>in</strong>deedproducesthepredictedplasmafrequency.<br />

ablility,variation<strong>of</strong>coecients)<strong>and</strong>theboundaryconditions(periodic,mixedor tionsdependsontheproperties<strong>of</strong>theequations(l<strong>in</strong>earity,dimensionality,seper-<br />

boundariesrequiremethodsthatgothroughaset<strong>of</strong>iterationsimprov<strong>in</strong>gonan solutiondirectly(directmethods),whereasthemorecomplicatedequation<strong>and</strong> simple,isolated). Simple,eldequationscanbesolvedus<strong>in</strong>gmethodsthatcomputetheexact


<strong>in</strong>itialguess(iterativemethods).Forvery<strong>in</strong>homogeneous<strong>and</strong>non-l<strong>in</strong>earsystems, currentnumericaltechniquesmaynotbeabletondanysolutions. Foranoverview<strong>of</strong>some<strong>of</strong>theclassicalnumericalmethodsusedforsolv<strong>in</strong>geld 32<br />

Young[You89]givesaniceoverview<strong>of</strong>iterativemethods.Itshould,however benotedthatnumericalanalysisisstillisanevolv<strong>in</strong>gdiscipl<strong>in</strong>ewhosecurrent <strong>and</strong>futurecontributionsmay<strong>in</strong>uenceone'schoice<strong>of</strong>method.Recent<strong>in</strong>terest<strong>in</strong>gtechniques<strong>in</strong>cludemultigridmethods<strong>and</strong>relatedmultileveladaptivemethods<br />

[McC89].<br />

equations,Chapter6<strong>of</strong>Hockney<strong>and</strong>Eastwood'sbook[HE89]isrecommended.<br />

Poisson'sequation: 3.3.1Poisson'sEquation theGauss'lawforchargeconservationwhichcanbedescribedbythefollow<strong>in</strong>g Thebehavior<strong>of</strong>charged<strong>particle</strong>s<strong>in</strong>appliedelectrostaticeldsisgovernedby<br />

where;<strong>and</strong>0aretheelectrostaticpotential,spacechargedensity,<strong>and</strong>dielectric constant,respectively.TheLaplacian,r2,act<strong>in</strong>gupon(x;y),isforthe2-D caseus<strong>in</strong>gaCartesiancoord<strong>in</strong>atesystem<strong>in</strong>x<strong>and</strong>y: r2=?0 r2=@2<br />

(3.16)<br />

PDE:@2 Comb<strong>in</strong><strong>in</strong>gthisdenitionwithGauss'law,henceyieldsthefollow<strong>in</strong>gelliptic1 @x2+@2@y2 (3.17)<br />

wherea,b,c,d,e,f,<strong>and</strong>garegivenrealfunctionswhicharecont<strong>in</strong>uous<strong>in</strong>someregion<strong>in</strong>the (x,y)plane,satisfyb2


potential,,<strong>and</strong>subsequentlytheeld,E=?r(hencethenameeldsolver). whichisthesecondorderequationweneedtosolve<strong>in</strong>ordertodeterm<strong>in</strong>ethe ThePoisson'sequationisacommonPDEwhichwewillshowcanbesolved 33<br />

us<strong>in</strong>gFFT-basedmethodswhenperiodic<strong>and</strong>certa<strong>in</strong>othersimpleboundaryconditionscanbeassumed.Ourparallelimplementationcurrentlyusesthistype<strong>of</strong><br />

DirectsolversbasedontheFFT(FastFourierTransform)<strong>and</strong>cyclicreduction solver. 3.3.2FFTsolvers O(N2)ormoreoperations<strong>and</strong>theaboveschemesmodelellipticequations,these aPDE<strong>in</strong>O(NlogN)orlessoperations[HE89].S<strong>in</strong>cePDEsolverstypicallyare (CR)schemescomputetheexactsolutions<strong>of</strong>Ndierenceequationsrelat<strong>in</strong>gto schemesarecommonlyreferredtoas\rapidellipticsolvers"(RES).RESmayonly beusedforsomecommonspecialcases<strong>of</strong>thegeneralPDEsthatmayarisefrom theeldequation.However,whentheycanbeused,thearegenerallythemethods equationr2=f)<strong>in</strong>simpleregions(e.g.squaresorrectangles)withcerta<strong>in</strong> <strong>of</strong>choices<strong>in</strong>cetheyarebothfast<strong>and</strong>accurate. simpleboundaryconditions. TheFourierTransform FFTsolversrequirethatthePDEshaveconstantcoecients(e.g.thePoisson<br />

The1DFouriertransform<strong>of</strong>afunctionhisafunctionH<strong>of</strong>!: The<strong>in</strong>verseFouriertransformtakesHbacktotheorig<strong>in</strong>alfunctionh: H(!)=Z1<br />

h(x)=1 2Z1 x=?1h(x)e?i!xdx: !=?1H(!)ei!xd!: (3.20) (3.19)


Ifhisadescription<strong>of</strong>aphysicalprocessasafunction<strong>of</strong>time,then!ismeasured <strong>in</strong>cyclesperunittime(theunit<strong>of</strong>frequency).Hencehis<strong>of</strong>tenlabeledthetime doma<strong>in</strong>description<strong>of</strong>thephysicalprocess,whereasHrepresentstheprocess<strong>in</strong>the 34<br />

frequencydoma<strong>in</strong>.However,<strong>in</strong>our<strong>particle</strong>simulation,hisafunction<strong>of</strong>distance. Inthiscase,!is<strong>of</strong>tenreferredtoastheangularfrequency(radianspersecond), s<strong>in</strong>ceit<strong>in</strong>corporatesthe2factorassociatedwithf,theunit<strong>of</strong>frequency(i.e. !=2f).<br />

Thesecondderivativecansimilarlybeobta<strong>in</strong>edbyascal<strong>in</strong>g<strong>of</strong>Hby?!2: Noticethatdierentiation(or<strong>in</strong>tegration)<strong>of</strong>hleadstoamerescal<strong>in</strong>g<strong>of</strong>Hbyi!: Dierentiationviascal<strong>in</strong>g<strong>in</strong>F dx=1 dx2=1 d2h dh2Z1<br />

!=?1i!H(!)ei!xd!: !=?1(i!)dH dxd! (3.22) (3.21)<br />

Hence,iftheFouriertransform<strong>and</strong>its<strong>in</strong>versecanbecomputedquickly,socan<br />

2Z1 !=?1(i!)[(i!)H(!)]ei!xd! !=?1(?!2)H(!)ei!xd!: (3.23)<br />

Discretiz<strong>in</strong>gforcomputers thederivatives<strong>of</strong>afunctionbyus<strong>in</strong>gtheFourier(frequency)doma<strong>in</strong>. (3.24)<br />

InordertobeabletosolvethePDEonacomputer,wemustdiscretizeoursystem (orone<strong>of</strong>itsperiods<strong>in</strong>theperiodiccase)<strong>in</strong>toaniteset<strong>of</strong>gridpo<strong>in</strong>ts.Inorder tosatisfythediscreteFouriertransform(DFT),thesegridpo<strong>in</strong>tsmustbeequally (theh(x)'s)withaspac<strong>in</strong>gd<strong>in</strong>toNcomplexnumbers(theH(!)'s): spaceddimension-wise(wemayapplya1DDFTsequentiallyforeachdimension toobta<strong>in</strong>amulti-dimensionaltransform).A1DDFTmapsNcomplexnumbers H(h(x))=H(!)N?1 Xn=0fnei!n=N (3.25)


erepresentedbyaFouriers<strong>in</strong>etransform(nocos<strong>in</strong>eterms),i.e.xedvalues boundaryconditions.Hence,acceptableboundaries<strong>in</strong>cludethosethathavecan NoticethattheFourierterms(harmonics)mustalsobeabletosatisfythe 35<br />

thefullDFT. thatareperiodic(systemsthatrepeat/\wrap"aroundthemselves)<strong>and</strong>henceuse (Dirichlet);thoseus<strong>in</strong>gaFouriercos<strong>in</strong>etransform,i.e.slopes(Neumann);orthose<br />

widthlimitedtowith<strong>in</strong>thespac<strong>in</strong>gs(i.eallfrequencies<strong>of</strong>gmustsatisfytheb<strong>and</strong>s y,respectively,thentheSampl<strong>in</strong>gTheoremsaysthatifafunctiongisb<strong>and</strong>-<br />

?hx


1)ComputeFFT((x;y)): ~(kx;ky)=1 LxLyN?1 Xx=0M?1 Xy=0(x;y)e2ixkx=Lxe2iyky=Ly 36<br />

comments)wherekx=2=Lx<strong>and</strong>ky=2=Ly,<strong>and</strong>scal<strong>in</strong>gby1=0. 2)Getby<strong>in</strong>tegrat<strong>in</strong>g(kx;ky)twicebydivid<strong>in</strong>gbyk2=k2x+k2y(seeprevious 3)Inordertogetbacktothegrid,(x;y)takethecorrespond<strong>in</strong>g<strong>in</strong>verse (3.27)<br />

FourierTransform:<br />

3.3.3F<strong>in</strong>ite-dierencesolvers 0(x;y)=1 JLJ?1 kx=0L?1 Xky=0(kx;ky)e?2ixkx=Lxe?2iyky=Ly: X<br />

Relaxation)methodwhichweused<strong>in</strong>somesimpletest-casesunderNeumann(re- Wenow<strong>in</strong>cludethedescription<strong>of</strong>a5-po<strong>in</strong>tnite-dierenceSOR(SuccessiveOver (3.28)<br />

ective)<strong>and</strong>Dirichlet(xed)boundaryconditions. low<strong>in</strong>gisobta<strong>in</strong>edforeachcoord<strong>in</strong>ate: Apply<strong>in</strong>ga5-po<strong>in</strong>tcentralnitedierenceoperatorfortheLaplacian,thefol-<br />

thepotentialatagivennode(i,j),yieldsthefollow<strong>in</strong>g2-Dnite-dierenceformula wherehisthegridspac<strong>in</strong>g<strong>in</strong>x<strong>and</strong>y(hereassumedequal). Comb<strong>in</strong><strong>in</strong>gGauss'law<strong>and</strong>theaboveapproximation,<strong>and</strong>thesolv<strong>in</strong>gfori;j, r2(i;j+1+i?1;j?4i;j+i+1;j+i;j?1)=h2; (3.29)<br />

(Gauss-Seidel): i;j=i;j 0h2+i;j+1+i?1;j+i+1;j+i;j?1=4; (3.30)


fortest<strong>in</strong>gsomesimplecases. Thisisthenitedierenceapproximationweused<strong>in</strong>ourorig<strong>in</strong>alPoissonsolver Tospeeduptheconvergence<strong>of</strong>ourniteapproximationtechniques,weused 37<br />

i;j'swerehenceupdated<strong>and</strong>\accelerated"asfollows: aSuccessiveOverRelaxation(SOR)schemetoupdatethe<strong>in</strong>teriorpo<strong>in</strong>ts.The tmp=(i;j+1+i?1;j+i+1;j+i;j?1)=4; i;j=i;j+!(tmp?i;j) (3.31)<br />

codewasrunforseveraldierent!'s,<strong>and</strong>wedeterm<strong>in</strong>edthatforthegridspac<strong>in</strong>gsweused,itseemedtooptimize(i.e.needfeweriterationbeforeconvergence)<br />

between1.7<strong>and</strong>1.85.Asmentioned,convergencewashereassumedtobeanac-<br />

thatEquation3.31isthesameasEquation3.30.Inourtestimplementation,our where!istheaccelerationfactorassumedtobeoptimalbetween1<strong>and</strong>2.Notice (3.32)<br />

SunSparcstations)calculatedbyupdat<strong>in</strong>gtheexitconditionasfollows: ceptablethresholdforround-o(weused10?14formost<strong>of</strong>ourtestrunsonour<br />

theoptimalaccelerationfactor!. methodsexistsuchasmultigridmethods<strong>and</strong>techniquesus<strong>in</strong>glargertemplate Thereisanextensivenumericalliteraturediscuss<strong>in</strong>ghowtopredict<strong>and</strong>calculate TheSORisafairlycommontechnique.Othernewer<strong>and</strong>possiblymoreaccurate exit=max(tmp?(i;j);exit): (3.33)<br />

(morepo<strong>in</strong>ts).S<strong>in</strong>ceweassumedtheeld<strong>in</strong>ournalcodetobeperiodic,we optedfortheFFT-solvers<strong>in</strong>ceitisamuchmoreaccurate(givesusthedirect<br />

toappear\reected"acrosstheboundary.Inotherwords, solution)<strong>and</strong>als<strong>of</strong>airlyquickmethodthatparallelizeswell. 3.3.4Neumannboundaries NaturalNeumannboundariesareboundarieswherethe<strong>in</strong>teriorpo<strong>in</strong>tsareassumed


forpo<strong>in</strong>tsalongtheNeumannborder.Forour2-Dcase,thenitedierenceupdate d38<br />

forborderpo<strong>in</strong>tsonasquaregridhencebecome: i;j?1?i;j+1 2hx =db;x(Leftorrightborders) dx=0<br />

wheredb;x<strong>and</strong>db;yformtheboundaryconditions(derivative<strong>of</strong>thepotential) forthatboundary.Inthecase<strong>of</strong>impermeableNeumannborders,db=0. i?1;j?i+1;j 2hy =db;y(Toporbottomborders) (3.35) (3.34)<br />

thenplugged<strong>in</strong>tothe2-Dcentralnitedierenceformula,asshownbelow(here: h=hx=hy).Tosimplifynotation,weusedthefollow<strong>in</strong>g"template"forournite dierenceapproximations:k Solv<strong>in</strong>gtheaboveequationforthepo<strong>in</strong>tbeyondtheborder,theresultwas<br />

become: Theequationsforapproximat<strong>in</strong>gthepotentialsattheNeumannbordershence l mo n m=(i,j) k=(i+1,j) o=(i-1,j) l=(i,j-1)<br />

n=(i,j+1)<br />

m=m 0hxhy+2hxdb;x+k+2n+o=4:0[Leftborder](3.36)<br />

SORtoconverge,aswellasfortest<strong>in</strong>gpurposes,some<strong>of</strong>theboundarypo<strong>in</strong>tsmay speciedasDirichlet(xedpotential,i.e.electrodes).Inourtestimplementation Similarequationswerederivedforthetop<strong>and</strong>bottomborders. SystemswithperiodicorNeumannboundariesares<strong>in</strong>gular.Inorderforthe 0hxhy?2hxdb;x+k+2l+o=4:0[Rightborder](3.37)<br />

electrode,<strong>and</strong>theupperrightborderbe<strong>in</strong>ga1Velectrode.Weobservedhow weusedamaskto<strong>in</strong>dicateNeumannorDirichletedges. WetestedourSORcodeonagridwithpart<strong>of</strong>thebottomleftbe<strong>in</strong>ga0V


theeldscontourslookedbyus<strong>in</strong>gourplotfacility.Asexpected,thepotentials \smoothed"betweenthe0V<strong>and</strong>1Velectrode,i.e.thepotentialsvaluesateach gridpo<strong>in</strong>tchangedverylittlelocallyafterconvergence.Agroup<strong>of</strong><strong>particle</strong>s<strong>of</strong>the 39<br />

samecharge<strong>in</strong>itializedclosetoeachotherattheOVelectrodewould,asexpected, overtimemovetowardsthe1Velectrode<strong>and</strong>generallyspreadawayfromeach<br />

Thesolverprovideduswiththepotential(i;j)ateachgrid-po<strong>in</strong>t.TheeldEat 3.4Mesh{ParticleInteractions neartheNeumanleftorrightborder. other.Theywould,asexpected,alsocurveback<strong>in</strong>tothesystemwhencom<strong>in</strong>g<br />

eachcorrespond<strong>in</strong>ggrid-po<strong>in</strong>twasthencalculatedus<strong>in</strong>garstorderdierence<strong>in</strong> eachdirection.TheeldEwashencestored<strong>in</strong>twoarrays,Ex<strong>and</strong>Ey,forthex the<strong>in</strong>teriorpo<strong>in</strong>tswhenapply<strong>in</strong>gtheeldtoeachnodeonthegrid: <strong>and</strong>ydirections,respectively.Weusedthefollow<strong>in</strong>g1-Ddierenceequationsfor Exi;j=(i;j?1?i;j+1)=(2hx) Eyi;j=(i?1;j?i+1;j)=(2hy) (3.38)<br />

3.4.1Apply<strong>in</strong>gtheeldtoeach<strong>particle</strong> canbeviewedasthevectorresult<strong>in</strong>gfromcomb<strong>in</strong><strong>in</strong>gEx<strong>and</strong>Ey. wherehx<strong>and</strong>hyarethegridspac<strong>in</strong>gs<strong>in</strong>x<strong>and</strong>y,respectively.Theactualeld (3.39)<br />

Afunctionwasthenwrittentocalculatetheelectriceldatagiven<strong>particle</strong>'s location(x0,y0),giventheeldgridsEx<strong>and</strong>Ey,theirsize,<strong>and</strong>thegridspac<strong>in</strong>gs canbestbedescribedthroughFigures3.2<strong>and</strong>3.3. hx,hy.(Allparameterswerepassedthroughpo<strong>in</strong>ters).Thenite-elementscheme


(i+1,j) 40 (i+1,j+1)<br />

(i,j).........<br />

(x0,y0)<br />

(i,j+1) hy<br />

Theeld'scontributiontoeach<strong>particle</strong>washencecalculatedas: Figure3.2:Calculation<strong>of</strong>nodeentry<strong>of</strong>lowercorner<strong>of</strong>current<strong>cell</strong> j=(<strong>in</strong>t)((x0+hx)/hx);<br />

Epartx=(Ex(i;j)(hx?a)(hy?b) i=(<strong>in</strong>t)((y0+hy)/hy).<br />

+Ex(i+1;j)(hx?a)b +Ex(i;j+1)a(hy?b) +Ex(i+1;j+1)ab)=(hxhy); (3.43) (3.41) (3.42) (3.40)<br />

Eparty=(Ey(i;j)(hx?a)(hy?b) +Ey(i+1;j)(hx?a)b +Ey(i;j+1)a(hy?b) +Ey(i+1;j+1)ab)=(hxhy); (3.45) (3.47) (3.46) (3.44)<br />

3.4.2Recomput<strong>in</strong>geldsdueto<strong>particle</strong>s Totake<strong>in</strong>toaccountthechargeduetoall<strong>particle</strong>s,theeldsneededtobe recomputedaftereachtimestep.Particleshadtobesynchronized<strong>in</strong>timebefore


(i+1,j) 41<br />

(i,j)......ḣx a(x0,y0)<br />

(i+1,j+1)<br />

where b (i,j+1) hy<br />

Figure3.3:Calculation<strong>of</strong>eldatlocation<strong>of</strong><strong>particle</strong>us<strong>in</strong>gbi-l<strong>in</strong>ear<strong>in</strong>terpolation. b=y0-((i-1)*hy). a=x0-((j-1)*hx);<br />

densitywasupdatedaccord<strong>in</strong>gtothe<strong>particle</strong>s'location{i.e.the<strong>particle</strong>s<strong>in</strong>uence theeld.Theeldsateachnodewereupdatedas<strong>in</strong>theprevioussection. theeldupdate.Asbefore,weassumedan<strong>in</strong>itialchargedensitypernode.This<br />

k=p=(hx+hy): <strong>and</strong>basdened<strong>in</strong>Figure3.3.Thecorrespond<strong>in</strong>gequationsfortheupdateswere Thechargedensitygrid,washenceupdatedaftereachtime-stepus<strong>in</strong>ga<br />

(i+1;j+1)+=abk: (i+1;j)+=(hx?a)bk; (i;j+1)+=a(hy?b)k; (i;j)+=(hx?a)(hy?b)k; (3.51) (3.48) (3.49) (3.50)


totrack<strong>particle</strong>s<strong>in</strong>anelectriceld.Particlecoord<strong>in</strong>ateswereread<strong>in</strong>fromale Us<strong>in</strong>gthePoissonsolverdeveloped<strong>in</strong>theprevioussection,wethendevelopedcode 3.5Mov<strong>in</strong>gthe<strong>particle</strong>s 42<br />

usedtotrack<strong>in</strong>dependent<strong>particle</strong>s<strong>in</strong>theelduntiltheyencounteredaboundary. I.e.theeldswerecomputedatthenodesus<strong>in</strong>gF<strong>in</strong>iteDierences(FD)<strong>and</strong> <strong>in</strong>thelastsection,exceptweallowedforanon-zerosourceterm(r2=S,S6=0). <strong>and</strong>trackeduntiltheyhitaboundary.The<strong>in</strong>itialconditionsweresetasdescribed<br />

<strong>in</strong>terpolatedwithF<strong>in</strong>iteElements(FE). Ahybridnite-element/nitedierencemethod<strong>and</strong>Euler<strong>in</strong>tegrationwere<br />

weremovedaccord<strong>in</strong>gtot,avariablethatgotadjustedsothatthe<strong>particle</strong>s accord<strong>in</strong>gtothe<strong>particle</strong>s'location{i.e.the<strong>particle</strong>s<strong>in</strong>uencedtheeld.The eldsateachnodewereupdatedataxedtime-<strong>in</strong>tervalt,whereasthe<strong>particle</strong>s wouldmovenomorethan1/4<strong>of</strong>a<strong>cell</strong>sideperttime<strong>in</strong>crement.The<strong>particle</strong>s' Theprogramassumedan<strong>in</strong>itialchargedensitypernodewhichgotupdated<br />

trajectorieswererecordedonoutputles. calculations\leap"overthepositions,<strong>and</strong>viceversa. leapfrogmethodwasconsideredfortheaccelerationmodel,wherethevelocities 3.5.1TheMobilityModel Fortrack<strong>in</strong>gthe<strong>particle</strong>s,themobilitymodelupdatesthelocationsdirectly.A<br />

IntheMobilitymodel,thenewlocation(x1,y1)<strong>of</strong>a<strong>particle</strong>atlocation(x0,y0)is calculatedgiventheeld(Epart-x,Epart-y)atthat<strong>particle</strong>,mobility(),timestep(dt),<strong>and</strong>grid(look<strong>in</strong>gat<strong>particle</strong>velocity=dx/dt).Tomakesurereasonable<br />

stepsweretaken,theimplementationshouldcheckwhetherthe<strong>particle</strong>hasmoved border,<strong>and</strong>aagbagwasset. morethan1/4<strong>of</strong>the<strong>particle</strong>'s<strong>cell</strong>'sside(hx,hypassed<strong>in</strong>)with<strong>in</strong>atime-step. Ifso,itmayreducethetime-step,recomputethelocation,<strong>and</strong>returnthenew time-step.Inourtestimplementation,the<strong>particle</strong>swerestoppedwhentheyhita


tionupdates<strong>in</strong>themobilitymodel: Giventhepreviousrout<strong>in</strong>es,thefollow<strong>in</strong>gsimpleequationsdescribedtheloca-<br />

43<br />

bility<strong>of</strong>themedium,,isthesame<strong>in</strong>bothx<strong>and</strong>y.Mosteldsconsideredare<strong>in</strong> Theaboveequationsassumethemediumtobeisotropic,thatisthatthemo-<br />

x1=x0+(Epartxt); y1=y0+(Epartyt): (3.53) (3.52)<br />

Bythelaws<strong>of</strong>physics,<strong>particle</strong>swilltendtobeattractedtoDirichletboundaries suchmedia.<br />

Thelatterhappensbecauseasthe<strong>particle</strong>sclose<strong>in</strong>onaNeumannborder,they (electrodes)withoppositehighcharges,butberepelledfromNeumannboundaries. seetheir\imagecharge"reectedacrosstheborder.S<strong>in</strong>ceequallysignedcharges Test<strong>in</strong>gtheMobilityModel<br />

(y=0)equallyspacedbetweenx=0.4<strong>and</strong>0.6(gridrang<strong>in</strong>gfromx=0to1).We veriedthatthe<strong>particle</strong>sdidnotspreadoutunlesstheeldswererecomputed. repeleachother,the<strong>particle</strong>steersaway<strong>and</strong>doesnotcrossaNeumannborder.<br />

As<strong>in</strong>glesimulation<strong>particle</strong>startednearthebottomplate(0V),wouldalwaysgo electrodeplates,respectively.N<strong>in</strong>e<strong>particle</strong>swerethenstartedatthebottomplate Thecodewastestedby<strong>in</strong>itializ<strong>in</strong>gthebottom<strong>and</strong>topborderas0V<strong>and</strong>1V<br />

IntheAccelerationModelformov<strong>in</strong>gthe<strong>particle</strong>s,theforceonthe<strong>particle</strong>is straightuptothetopplate(1V),s<strong>in</strong>cenoother<strong>particle</strong>swouldbepresentto <strong>in</strong>uenceitseld. 3.5.2TheAccelerationModel proportionaltotheeldstrengthratherthanthevelocity: F=qE (3.54)


whereEistheeld(Ex,Ey)atthat<strong>particle</strong>.Thisisthemodelwewillbeus<strong>in</strong>g<br />

<strong>in</strong>ourparallelizedplasmasimulation. 44<br />

elds<strong>and</strong>then<strong>particle</strong>s'contributiontothem.However,<strong>in</strong>stead<strong>of</strong>therout<strong>in</strong>efor Theleapfrogmethodwasusedforupdat<strong>in</strong>gthe<strong>particle</strong>s'locationaccord<strong>in</strong>gto<br />

comput<strong>in</strong>gthe<strong>particle</strong>'svelocity,theotherforupdat<strong>in</strong>gthenewlocation. mov<strong>in</strong>gthe<strong>particle</strong>us<strong>in</strong>gtheMobilityModel,wenowusedtworout<strong>in</strong>es,onefor thismodel.Adragtermproportionaltothevelocitywasalsoadded<strong>in</strong>. Weusedthesamefunctionsasdescribed<strong>in</strong>thelastsectionforcomput<strong>in</strong>gthe<br />

updatelagahalftime-stepbeh<strong>in</strong>dtheupdate<strong>of</strong>a<strong>particle</strong>'sposition(position with<strong>in</strong>itialspeed(vx0,vy0),giventheforceF,<strong>particle</strong>massm,time-step(t), <strong>and</strong>thegrid. Thefunctionscalculatethenewspeed(vx1,vy1)<strong>of</strong>a<strong>particle</strong>atlocation(x0,y0)<br />

(x,y)\leap<strong>in</strong>gover"thevelocity,thentheotherwayaround: Theleap-frogmethodmodelsa<strong>particle</strong>'smovementbyhav<strong>in</strong>gthevelocity<br />

Splitt<strong>in</strong>gthedirectionalvectors<strong>in</strong>tox<strong>and</strong>ytermsgive: vn+1=2=vn?1=2+(F((xn;yn))=m)t Fx=qEpartx (3.55)<br />

Add<strong>in</strong>gthedragterm,thisgaveusthefollow<strong>in</strong>g<strong>codes</strong>egment: Fy=qEparty vx1=vx0+((Fx/mass)*del_t)-(drag*vx0*del_t); (3.57) (3.56)<br />

Withtheequationsforupdat<strong>in</strong>gthelocations: xn+1=xn+vn+1=2t vy1=vy0+((Fy/mass)*del_t)-(drag*vy0*del_t); (3.58)


' 45<br />

vn?1=2 ''?$<br />

?$<br />

Figure3.4:TheLeapfrogMethod. xn vn+1=2 xn+1 vn+3=2-time<br />

orthe<strong>codes</strong>egment:x1=x0+(vx0*del_t);<br />

periodicsystemisassumed.Aagissetifthe<strong>particle</strong>movesmorethanasystem (1=2t),<strong>and</strong>then\leaps"overthelocationstep(Figure3.4). Noticethatatthersttime-step,thevelocityis\pulledback"halfatime-step Whenthe<strong>particle</strong>hitsaborder,the<strong>particle</strong>re-apearsontheothersideifa y1=y0+(vy0*del_t);<br />

lengthwith<strong>in</strong>atime-step.Inthiscasethetime-stepwillneedtobereduced<strong>in</strong><br />

Initialtest<strong>in</strong>gwasdoneasforthemobilitymodel.Aga<strong>in</strong>,westarted5<strong>particle</strong>sat <strong>in</strong>thischapter. Test<strong>in</strong>gtheAccelerationModel ordertohaveaviablesimulation.Parameterizationwillbediscussedfurtherlater<br />

theupperplate,exceptforcaseswherethe<strong>particle</strong>camefairlyclosetotheirleft orrightborders.Aspredicted,the<strong>particle</strong>shead<strong>in</strong>gfortheborderwerethenbe repelledbytheirimagecharges. thebottomplate<strong>and</strong>showedthattheyspreadoutnicelyastheymovedtowards<br />

simulatetheplasmaoscillationsdescribed<strong>in</strong>thenextsection. Thetrue\acid-test"forourcodewas,however,toseewhetheritcouldcorrectly


Theparametershx,hy,Lx,Ly,t,<strong>and</strong>Npneedtosatisfythefollow<strong>in</strong>gconstra<strong>in</strong>ts 3.6Test<strong>in</strong>gtheCode{Parameter Requirements 46<br />

iscollisionless(Hockney<strong>and</strong>Eastwood[1988]describestheseforthe1-Dcase): <strong>in</strong>orderfortheplasmawavestobeadequatelyrepresented<strong>and</strong>sothatthemodel 1.!pt2,where!pistheplasmaoscillationfrequency.Onecanusually 2.hx;hyD,i.e.thatthespac<strong>in</strong>gD,theDebyelengthdenedtobethe expect!ptbetween.1<strong>and</strong>.2togivetheoptimalspeedversusaccuracy.<br />

3.Lx;LyD. 4.NpDLx;Ly,i.e.number<strong>of</strong>simulation<strong>particle</strong>sperDebyelengthshould characteristicwavelength<strong>of</strong>electrostaticoscillation(D=vT=!p,wherevT<br />

belargecomparedtothesimulationarea.Thisgenerallyguaranteesthat isthethermalvelocity<strong>of</strong>theplasma.<br />

therearealargenumber<strong>of</strong>simulation<strong>particle</strong>s<strong>in</strong>therange<strong>of</strong>thevelocities<br />

3.6.1!p<strong>and</strong>thetimestep regard<strong>in</strong>gboththestability<strong>and</strong>thecorrectness<strong>of</strong>thecode. Byanalyz<strong>in</strong>gthecodecarefullyfromthispo<strong>in</strong>t,condencecanbeachieved nearthephasevelocity<strong>of</strong>unstablewaves.<br />

Toseewhetherthecodecouldproducethecorrectplasmafrerquency,wereformulatedthecodeused<strong>in</strong>theaccelerationmodel(Section3.5.2)tohaveperiodic<br />

boundaryconditionsonallboundaries. lengthnq<strong>in</strong>thedirectionperpendiculartothesimulationplane,<strong>and</strong>massper <strong>particle</strong>planeswilleachbesee<strong>in</strong>ganotherplane<strong>of</strong>chargesacrosstheboundaries unitlengthnm.Wearehencenowmodel<strong>in</strong>gan\<strong>in</strong>nitesystem"wherethetwo Thesystemisthenloadedwithtwob<strong>and</strong>s<strong>of</strong><strong>particle</strong>swithchargeperunit


asitisrepelledfromtheotherb<strong>and</strong><strong>in</strong>its<strong>cell</strong>.Thissystemhencecorrespondsto theoscillatorysystemdescribed<strong>in</strong>Section3.2.3. Assum<strong>in</strong>gthesystemsizeisLxbyLy,the<strong>particle</strong>b<strong>and</strong>sarethenplacedeither 47<br />

thesystemshouldoscillate. closetotheboundaryorclosetothecenter<strong>of</strong>thesystem,aligned<strong>in</strong>thexory direction.Aslongastheb<strong>and</strong>sarenotplacedatdistance<strong>of</strong>12Lfromeachother,<br />

accurateresultthantheSORmethod. solverdescribed<strong>in</strong>Section3.3.2forthePoisson'sequationtoobta<strong>in</strong>amore velocities.S<strong>in</strong>cewenowareassum<strong>in</strong>gperiodicconditions,wecouldusetheFFT-<br />

Ifthecodewascorrectlynormalized,the<strong>particle</strong>planesshouldoscillateback Theleap-frog<strong>particle</strong>-pusherwasusedtoadvancethe<strong>particle</strong>positions<strong>and</strong><br />

<strong>and</strong>forth<strong>in</strong>they-direction(orx-direction)throughthecenter<strong>of</strong>thesystemata frequency<strong>of</strong>!0=!p=2=1period. <strong>in</strong>deedapproximatedtheexpectedtheplasmafrequencyaslongas!pt2. Us<strong>in</strong>gthefollow<strong>in</strong>gknownphysicalparameters(fromPhysicsToday,Aug'93): Wetestedourcodeforseveraldierenttime-stepst<strong>and</strong>veriedthatourcode<br />

<strong>and</strong>the<strong>in</strong>putparameters: 0=8.854187817*10?12; q=-1.6021773*10?19;<br />

0=q*n0=-1.602*10?12(n0=107{typicalforsomeplasmas) drag=0.0;t=(vary<strong>in</strong>g{seebelow);tmax=0.0002 m=9.109389*10?31;<br />

wecancalculatetheexpectedplasmafrequency: !p=sq0 m0=vut (9:10938910?31)(8:85418781710?12)1:78105 (?1:602177310?19)2


Toavoidanygrid-renementproblems/<strong>in</strong>terpolationerrors,weputtheb<strong>and</strong>s 48<br />

atx=0.1875<strong>and</strong>at0.8125whichis<strong>in</strong>thecenter<strong>of</strong>theb<strong>and</strong>srespectivecolumn<strong>of</strong> <strong>cell</strong>sforan8x8system.Wewerehenceabletogetthefollow<strong>in</strong>gtestsdemonstrat<strong>in</strong>g the!ptrelationshipshown<strong>in</strong>Table3.1. above.Noticethatfort=1:010?5,!pt=2,<strong>and</strong>aspredictedbythetheory istheturn<strong>in</strong>gpo<strong>in</strong>tforstability. 3.6.2Two-streamInstabilityTest Theseresults,shown<strong>in</strong>Table3.1,agreewiththetheoreticalresultweobta<strong>in</strong>ed<br />

distributed<strong>particle</strong>s(<strong>in</strong>ourcase2D<strong>particle</strong>grids)areloadedwithopposite<strong>in</strong>itial driftvelocities.Detailedknowledge<strong>of</strong>thenon-l<strong>in</strong>earbehaviorassociatedwithsuch performedatwo-stream<strong>in</strong>stabilitytest[BL91].Inthiscase,twoset<strong>of</strong>uniformly simulationsweredeveloped<strong>in</strong>the'60s. T<strong>of</strong>uthertestwhetherour<strong>codes</strong>wereabletosimulatephysicalsystems,wealso<br />

growexponentially<strong>in</strong>time.Inorderforthistesttowork,caremustbetaken bunch<strong>in</strong>g<strong>of</strong><strong>particle</strong>s<strong>in</strong>theotherstream,<strong>and</strong>viceversa.Theperturbationshence densityperturbation(bunch<strong>in</strong>g)<strong>of</strong>onestreamisre<strong>in</strong>forcedbytheforcesdueto movethrougheachotheronewavelength<strong>in</strong>onecycle<strong>of</strong>theplasmafrequency,the Systemsthatsimulateoppos<strong>in</strong>gstreamsareunstables<strong>in</strong>cewhentwostreams<br />

separatelyus<strong>in</strong>gan<strong>in</strong>itialdriftvelocityvdrift=!pLxforhalfthe<strong>particle</strong>s<strong>and</strong> vdrift=?!pLxfortheotherhalf<strong>of</strong>the<strong>particle</strong>s. <strong>in</strong>choos<strong>in</strong>gthe<strong>in</strong>itialconditions.Inourcase,wechosetotesteachdimension<br />

wassetto1:510?7.The\eyes"appearedwith<strong>in</strong>10time-steps<strong>of</strong>thelargewave teristicnon-l<strong>in</strong>ear\eyes"associatedwithtwo-stream<strong>in</strong>stabilities.Ourtime-step appear<strong>in</strong>g.Noticethatthesearedistanceversusvelocityplotsshow<strong>in</strong>g1Deects. AscanbeseenfromFigures3.6a-c,ourcodewasabletocapturethecharac-


Table3.1:Plasmaoscillations<strong>and</strong>time-step 49<br />

t period Two-b<strong>and</strong>test<br />

5.0*10?5blowsup 10*10?5blowsup (seconds)(*10?5) {passesborders<strong>in</strong>1t! !p=2=period<br />

2.0*10?5blowsup (*105)<br />

0.52*10?53.12-3.64 1.5*10?5blowsup<br />

0.50*10?53.00-3.50 1.2*10?5 1.0*10?53.38(avg) 2.5 2.6,butblowsupafter1period<br />

3.25(avg) 1.86 1.93 2.0<br />

0.48*10?53.36 0.40*10?53.20-3.60 0.30*10?53.20-3.60<br />

1.87<br />

0.25*10?53.50 0.10*10?53.50 3.4(avg) 0.05*10?53.50 0.01*10?53.50 1.795 1.85


position.Weobta<strong>in</strong>edsimilarplotsforacorrespond<strong>in</strong>gtest<strong>of</strong>x<strong>and</strong>vx. Eachdot<strong>in</strong>Figure3.6aactuallyrepresentsallthe<strong>particle</strong>s<strong>in</strong>xmapp<strong>in</strong>gtothey 50<br />

Accesstoalargeparallelsystems,suchastheKSR-1,willallowustoremovethe 3.7ResearchApplication{DoubleLayers restrictionsimposedonusbythespeed<strong>of</strong>thesmallercomputerswecurrentlyuse.<br />

asatellitepass<strong>in</strong>gbelowtheplasma.Thedevelopment<strong>of</strong>the<strong>in</strong>stabilitydependson whenpresent,changestheelectronvelocitydistributionthatwouldbeobservedby plasmaphysics<strong>issues</strong>.First,an<strong>in</strong>stabilityoccurs<strong>in</strong>thepresentsimulationsdueto an<strong>in</strong>teractionbetweentheaccelerated<strong>and</strong>backgroundelectrons.The<strong>in</strong>stability, Ourresearchgrouphopes,thereby,tobeabletoclarifyanumber<strong>of</strong><strong>in</strong>terest<strong>in</strong>g<br />

raterepresentation<strong>of</strong>thiseectrequiresbothahigh-speedplatform<strong>and</strong>aparallel algorithmappropriatefortheproblem.Second,wehaveobservedthepresence <strong>of</strong>anomalousresistivity<strong>in</strong>regions<strong>of</strong>substantialAlfvenwave-generatedelectron <strong>and</strong>the<strong>in</strong>stability.Theperpendicular<strong>and</strong>parallelresolutionrequiredforaccu-<br />

theperpendicularstructure<strong>of</strong>thenonl<strong>in</strong>eardevelopment<strong>of</strong>boththeAlfvenwave<br />

iscurrentlyrestrictedbythesimulationsystemlength.Because<strong>of</strong>theperiodic noiseduetoverylargesuper<strong>particle</strong>s.Third,theevolution<strong>of</strong>anAlfvenwavepulse allowustoemployamuchlargernumber<strong>of</strong><strong>particle</strong>s,enabl<strong>in</strong>gustoreducethe drift.Wewouldliketop<strong>in</strong>po<strong>in</strong>tthecause<strong>of</strong>thisresistivity,butitsmechanismhas<br />

boundaryconditions,employed<strong>in</strong>ourcurrentsimulation,anAlfvenwavepacket provedelusiveduetothepresence<strong>of</strong>substantialnoise.Use<strong>of</strong>parallelismwould<br />

musteventuallytraversearegionpreviouslycrossed,encounter<strong>in</strong>gplasmaconditions<strong>of</strong>itsownwake.Aga<strong>in</strong>,thelongersystempossibleonparallelsystemssuch<br />

astheKSR-1,wouldalleviatethisproblem.Therearealsoothersimulation<strong>issues</strong> whichwouldbenetfromtheresources<strong>and</strong><strong>parallelization</strong>possibilitiespresented <strong>in</strong>thisthesis. Our<strong>in</strong>itialexperiments<strong>in</strong>dicatethattheKSR1isagoodmatchforourproblem.


51<br />

vdrift?vdriftvdrift?vdriftvdrift?vdrift xvxyvy<br />

c)characterstictwo-streameye. Figure3.5:Two-stream<strong>in</strong>stabilitytest.a)Initialconditions,b)wavesareform<strong>in</strong>g,


Wendthecomb<strong>in</strong>ation<strong>of</strong>therelativeease<strong>of</strong>implementationprovidedbyits <strong>and</strong>processorresourcesveryattractive. shared-memoryprogramm<strong>in</strong>genvironmentcomb<strong>in</strong>edwithitssignicantmemory 52<br />

2GBusedforOS,program,<strong>and</strong>datastorage).Each<strong>particle</strong>uses4doubleprecisonquantities(velocity<strong>and</strong>location<strong>in</strong>bothx<strong>and</strong>y)<strong>and</strong>henceoccupies32<br />

arraysneedtot<strong>in</strong>localmemory.GiventhecurrentKSR1'shardware,acon-<br />

Duetothecomputational<strong>and</strong>memoryrequirements<strong>of</strong>ourcode,allmajor servativeestimatewouldimplywearerestrictedto2GB<strong>of</strong>memory(theother<br />

orlarger.Thisshouldenableustostudyeectscurrentlynotseenus<strong>in</strong>gcurrent serial<strong>codes</strong>(us<strong>in</strong>g,forexample,256-by-32grids). oryrestrictions,wewouldthereforeliketomodelsystemsthatare4096-by-256 bytes.Particlecode<strong>in</strong>vestigations<strong>of</strong>auroralaccelerationtypicallyemploy10-100 <strong>particle</strong>spergriddepend<strong>in</strong>gontheeectbe<strong>in</strong>gstudied.Giventhecurrentmem-


Chapter4 Parallelization<strong>and</strong>Hierarchical MemoryIssues<br />

theparallelismfor<strong>in</strong>dividualmodules,suchassolvers,matrixtransposers,factorizers,<strong>particle</strong>pushers,etc.Our<strong>particle</strong>simulationcode,however,isfairlycomplex<br />

4.1Introduction Todate,theeld<strong>of</strong>scienticparallelcomputationhasconcentratedonoptimiz<strong>in</strong>g \Parallelismisenjoymentexponentiated."{authorca.1986.<br />

<strong>in</strong>teractionsaect<strong>parallelization</strong>. <strong>and</strong>consists<strong>of</strong>several<strong>in</strong>teract<strong>in</strong>gmodules.Akeypo<strong>in</strong>t<strong>in</strong>ourworkisthereforeto considerthe<strong>in</strong>teractionsbetweenthesesub-programblocks<strong>and</strong>analyzehowthe ignoredthisissue,or<strong>in</strong>thecase<strong>of</strong>Azarietal.[ALO89],usedalocalizeddirect solver.Thelatter,however,onlyworksforveryspecializedcases.Themoregeneral solver)mayimpacttheoverall<strong>particle</strong>partition<strong>in</strong>g.Previousworkhaseither problemsusuallyrequiretheuse<strong>of</strong>somesort<strong>of</strong>numericalPDEsolver. Inparticular,wewouldliketoseehowthesolverpartition<strong>in</strong>g(sayforanFFT<br />

anovelgridpartition<strong>in</strong>gapproachthatleadstoanecientimplementationfacili- Traditionalparallelmethodsus<strong>in</strong>greplicated<strong>and</strong>partitionedgrids,aswellas 53


tat<strong>in</strong>gdynamicallypartitionedgridsaredescribed.Thelatterisanovelapproach thattakesadvantage<strong>of</strong>theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>usesadual po<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraystokeepthe<strong>particle</strong>locationspartially 54<br />

grid.Inthecontext<strong>of</strong>gridupdates<strong>of</strong>thechargedensity,wewillrefertothis showanovelapproachus<strong>in</strong>ghierarchicaldatastructuresforstor<strong>in</strong>gthesimulation techniquesassociatedwiththisdynamicschemewillalsobediscussed. sortedautomatically(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Load-balanc<strong>in</strong>g<br />

techniqueas<strong>cell</strong>cach<strong>in</strong>g. F<strong>in</strong>ally,wewill<strong>in</strong>vestigatehowmemoryhierarchiesaect<strong>parallelization</strong><strong>and</strong><br />

4.2DistributedMemoryversusShared locality<strong>and</strong>theoverheadassociatedwithit.Thisproblemwithparalleloverhead Theprimaryproblemfac<strong>in</strong>gdistributedmemorysystemsarema<strong>in</strong>ta<strong>in</strong><strong>in</strong>gdata alsoextendstothesharedmemorysett<strong>in</strong>gwheredatalocalitywithrespectto<br />

havean<strong>in</strong>terest<strong>in</strong>gpropertywhichseemhighlyrelevant.Inordertoachievetheir peakspeed(17-19MFLOPs),thedatamustbe<strong>in</strong>whatSuncallstheSuperCache memorysystemwhereallmemoryistreatedasacache(orhierarchythere<strong>of</strong>). cache,isimportant.TheauthorproposesthatoneviewtheKSRasashared<br />

whichisabout0.5MBytesperprocessor(maybeto<strong>in</strong>crease<strong>in</strong>futureversions<strong>of</strong> thehardware).Thisimpliesthatifyouarego<strong>in</strong>gtopartitionaproblemacrossa Shaw[Sha]po<strong>in</strong>tsoutthathisexperiencewiththeSPARC-10sshowsthey<br />

group<strong>of</strong>SPARC-10s,youhavemanylevels<strong>of</strong>memoryaccesstoworryabout: 2.virtualmemoryaccessononemach<strong>in</strong>eonanetwork 1.mach<strong>in</strong>eaccessonanetwork 3.realmemoryaccessononemach<strong>in</strong>eonanetwork 4.SuperCacheaccessononeprocessorononemach<strong>in</strong>eonanetwork


Hence,thereisagreatdealtoworryabout<strong>in</strong>gett<strong>in</strong>gaproblemtowork\right" whenyouhaveanetwork<strong>of</strong>multiprocessorSPARC-10s. Anetwork<strong>of</strong>Sun10'swillconsequentlyraisealot<strong>of</strong><strong>issues</strong>similartothat 55<br />

addressthegeneralproblems<strong>and</strong>givesomeguidel<strong>in</strong>esonthe<strong>parallelization</strong>s<strong>of</strong> doubt,alot<strong>of</strong>ne-tun<strong>in</strong>gisnecessary.Itis,however,hopedthatourworkcan <strong>in</strong>structions).)Toachieveoptimumperformanceonanygivenparallelsystem,no alsohasa0.5MBlocalcacheoneachprocessor(0.25MBfordata,0.25MBfor <strong>of</strong>theKSR<strong>in</strong>thattheybothpossessseverallevels<strong>of</strong>cache/memory.(TheKSR<br />

4.3TheSimulationGrid gridquantities.Theeasiest<strong>and</strong>mostcommonparallelimplementations,havea Whenparalleliz<strong>in</strong>ga<strong>particle</strong>code,one<strong>of</strong>thebottlenecksishowtoupdatethe fairlycomplexphysics(<strong>and</strong>similar)<strong>codes</strong>.<br />

localcopy<strong>of</strong>thegridforeachthread(processor). quantitiesontheborders,<strong>and</strong>the<strong>particle</strong>srema<strong>in</strong>totallysorted;I.e.,all<strong>particle</strong>s with<strong>in</strong>asub-gridareh<strong>and</strong>ledbythesamelocalthread(process<strong>in</strong>gnode). 4.3.1ReplicatedGrids Intheidealcase,thegridisdistributed,theprocess<strong>in</strong>gnodesonlysharegrid<br />

Toensurethatallgridupdatesoccurwithoutanycontentionfromthreadstry<strong>in</strong>g<br />

thelocalgridsarethenaddedtogethereitherbyoneglobalmasterthreador<strong>in</strong> threads),one<strong>of</strong>themostcommontechniquesistoreplicatethegridforeachthread. Eachthreadthencalculatesthecontribution<strong>of</strong>itsown<strong>particle</strong>sbyadd<strong>in</strong>gthem up<strong>in</strong>alocalarrayequal<strong>in</strong>sizetothewholegrid.Whenallthethreadsaredone, toupdatethesamegridnode(apply<strong>in</strong>gcontributionsfrom<strong>particle</strong>sfromdierent<br />

globalmasterapproachis<strong>in</strong>factfasterforsmallergrids(say,32x32). showthat,duetotheoverhead<strong>in</strong>spawn<strong>in</strong>g<strong>and</strong>synchroniz<strong>in</strong>gmanythreads,a parallelbysomeorallthreads.ExperimentsperformedbytheauthorontheKSR1


4.3.2DistributedGrids Eventhoughthe<strong>particle</strong>pushphaseisparallelizedus<strong>in</strong>gareplicatedgrid,thegrid maystillbedistributed<strong>in</strong>aparallelizedeldsolver.ForanFFTsolverthistypically 56<br />

partition<strong>in</strong>gmaynotbethemostecient. 4.3.3Block-column/Block-rowPartition<strong>in</strong>g gridwith<strong>particle</strong>updates<strong>in</strong>m<strong>in</strong>d,weshallshowthatacolumnorrow-oriented <strong>in</strong>volvesblock-column<strong>and</strong>block-rowdistributions.However,whendistribut<strong>in</strong>gthe<br />

Assum<strong>in</strong>g,nevertheless,thatonetriestosticktothecolumn/rowdistribution,<strong>in</strong><br />

4.1. the<strong>particle</strong>updatephase,anodewouldstillneedtocopyarow(orcolumn)to are<strong>cell</strong>-based.Thisleaveseachprocessorwiththegridstructureshown<strong>in</strong>Figure <strong>of</strong>process<strong>in</strong>gnodesavailable)s<strong>in</strong>cethe<strong>particle</strong>-grid<strong>and</strong>grid-<strong>particle</strong>calculations itsneighbor(assum<strong>in</strong>gthenumber<strong>of</strong>grid-po<strong>in</strong>ts<strong>in</strong>eachdirection=thenumber<br />

o----o----o----o----o----o----o----o----o<br />

o=localgriddata;O=datacopiedfromneighbor |||||||||<br />

Figure4.1:Gridpo<strong>in</strong>tdistribution(rows)oneachprocessor. O----O----O----O----O----O----O----O----O<br />

4.3.4GridsBasedonR<strong>and</strong>omParticleDistributions toleavetheirlocalsubdoma<strong>in</strong>sless<strong>of</strong>ten,<strong>and</strong>lesscommunicationishenceneeded. r<strong>and</strong>omdistributions<strong>of</strong><strong>particle</strong>s<strong>in</strong>anelectrostaticeld.The<strong>particle</strong>sthentend Squaresubgridsgiveamuchbetterborder/<strong>in</strong>teriorratiothansk<strong>in</strong>nyrectanglesfor


isreasonable<strong>and</strong>itsimplementationsimpler. butconcludedthatthesquareis<strong>in</strong>deedagoodchoices<strong>in</strong>ceitsborder/arearatio We<strong>in</strong>vestigatedseveralotherregularpolygons(seefollow<strong>in</strong>gTables4.1<strong>and</strong>4.2), 57<br />

Table4.1:Boundary/Arearatiosfor2Dpartition<strong>in</strong>gswithunitarea.<br />

\UniformGrid"Hex: RegularHexagon: PolygonBoundary/AreaRatio Square: Circle: 3.94 3.54 3.72<br />

Table4.2:Surface/VolumeRatiosfor3Dpartition<strong>in</strong>gswithunitvolume. RegularTriangle: 4.00<br />

PolygonSurface/VolumeRatio 4.56<br />

Staggeredregularhexagonwithdepth=1side: Cyl<strong>in</strong>derw/optimumheight<strong>and</strong>radius: Sphere(optimum): Cube: 5.59 5.54 4.84<br />

4.4ParticlePartition<strong>in</strong>g 4.4.1FixedProcessorPartition<strong>in</strong>g 6.00<br />

ma<strong>in</strong>ta<strong>in</strong>salocalcopy<strong>of</strong>thegridthat,aftereachstepisaddedtotheother localgridstoyieldaglobalgridwhichdescribesthetotoalchargedistribution. Theeasiest<strong>and</strong>mostcommonschemefor<strong>particle</strong>partition<strong>in</strong>gistodistributethe <strong>particle</strong>sevenlyamongtheprocessors<strong>and</strong>leteachprocessorkeepontrack<strong>in</strong>gthe same<strong>particle</strong>sovertime.Thisschemeworksreasonablywellwheneachprocessor


Unfortunately,replicat<strong>in</strong>gthegridisnotdesirablewhenthegridislarge<strong>and</strong> severalprocessorsareused.Inadditiontotheobviousgridsummationcosts,it wouldalsoconsumesagreatdeal<strong>of</strong>valuablememory<strong>and</strong>thereforehampersour 58<br />

<strong>particle</strong>partition<strong>in</strong>gschemewouldalsonotfarewell<strong>in</strong>comb<strong>in</strong>ationwithagrid eortsto<strong>in</strong>vestigateglobalphysicaleects{one<strong>of</strong>theprimegoals<strong>of</strong>our<strong>particle</strong> simulation. partition<strong>in</strong>g. Azari<strong>and</strong>Lee[AL91,AL92]foradistributedmemoryhypercube(stillhasproblems Analternativewouldbetouseahybridpartition<strong>in</strong>gliketheonedescribedby S<strong>in</strong>cethe<strong>particle</strong>swillbecomedispersedalloverthegridovertime,axed<br />

chapter. periodically.Thelatterwouldalsorequireadynamicgridallocationifloadbalance istobema<strong>in</strong>ta<strong>in</strong>ed.Wewillgetbacktothesecomb<strong>in</strong>edschemeslater<strong>in</strong>this forvery<strong>in</strong>homogeneouscases),ortosortthe<strong>particle</strong>saccord<strong>in</strong>gtolocalgrids<br />

Onewaytoreducememoryconictswhenupdat<strong>in</strong>gthegrid<strong>in</strong>thecasewherethe gridisdistributedamongtheprocessors,istohavethe<strong>particle</strong>spartiallysorted. 4.4.2PartialSort<strong>in</strong>g Bypartialsort<strong>in</strong>gwemeanthatall<strong>particle</strong>quantities(locations<strong>and</strong>velocities) with<strong>in</strong>acerta<strong>in</strong>subgridarema<strong>in</strong>ta<strong>in</strong>edbytheprocessorh<strong>and</strong>l<strong>in</strong>gtherespective withmillions<strong>of</strong><strong>particle</strong>sfairly<strong>of</strong>ten. {here:gridlocations)arelimitedtothegridpo<strong>in</strong>tsontheborders<strong>of</strong>thesubgrids. Thismethodcanbequitecostlyifoneiitisnecessarytogloballysortanarray subgrid.Inthiscase,memoryconicts(waitsforexclusiveaccesstosharedvariable<br />

\Dirty"bits Analternativeapproachistoma<strong>in</strong>ta<strong>in</strong>local<strong>particle</strong>arraysthatgetpartiallysorted aftereachtime-step.(Thiswouldbetheequivalent<strong>of</strong>send<strong>in</strong>g<strong>and</strong>receiv<strong>in</strong>g<strong>particle</strong>sthatleavetheirsub-doma<strong>in</strong><strong>in</strong>thedistributedmemorysett<strong>in</strong>g.)


depend<strong>in</strong>gonwhetherornotthenewlocationiswith<strong>in</strong>thelocalsubgrid.Cray istoadda\dirty-bit"toeach<strong>particle</strong>location<strong>and</strong>thensetorclearthisbit Afairlycommontechniquefromtheshared-memoryvectorprocess<strong>in</strong>gworld 59<br />

programmersworry<strong>in</strong>gaboutus<strong>in</strong>gtoomuchextramemoryforthe\dirty"bits havebeenknowntousetheleastsignicantbit<strong>in</strong>thedouble-precisionoat<strong>in</strong>g po<strong>in</strong>tnumberdescrib<strong>in</strong>gthelocationsasa\dirty"bit! appropriate\dirty-bit"sett<strong>in</strong>g.Thiswould,however,requireafollow<strong>in</strong>gsearch throughthe\dirty-bits"<strong>of</strong>alllocal<strong>particle</strong>sbyallprocessors<strong>and</strong>acorrespond<strong>in</strong>g notbeexp<strong>and</strong>ed<strong>and</strong>wasted). ll-<strong>in</strong><strong>of</strong>\dirty"locationslocallyoneachprocessor,(assum<strong>in</strong>glocalmemoryshould Ifthelocationisonanewgrid,thelocationsmaystillbewrittenbackwiththe<br />

4.4.3DoublePo<strong>in</strong>terScheme<br />

getsupdated<strong>and</strong>theexit<strong>in</strong>g<strong>particle</strong>iswrittentoascratchmemory.Noticehow Amoreelegantwaythatachievespartialsort<strong>in</strong>g<strong>of</strong>local<strong>particle</strong>arraysautomaticallydur<strong>in</strong>g<strong>particle</strong>updatesistoma<strong>in</strong>ta<strong>in</strong>twopo<strong>in</strong>ters,aread<strong>and</strong>awrite<br />

thisautomatically\sorts"the<strong>particle</strong>sback<strong>in</strong>tothelocalarray. po<strong>in</strong>ter,tothelocal<strong>particle</strong>arrays.Ifthenew<strong>particle</strong>locationisstillwith<strong>in</strong>the localsubgrid,thenbothpo<strong>in</strong>tersget<strong>in</strong>cremented,otherwiseonlythereadpo<strong>in</strong>ter array<strong>and</strong>ll<strong>in</strong><strong>in</strong>com<strong>in</strong>g<strong>particle</strong>sbyupdat<strong>in</strong>gthewritepo<strong>in</strong>ter. Afterthethread(processor)isdone,itcouldthengothroughtheglobalscratch<br />

acrossscalarprocessors. eachvector-location.However,<strong>in</strong>ourworkweareconcentrat<strong>in</strong>gon<strong>parallelization</strong>s welltovectorizationasthe\dirty-bit"approach,unlessaset<strong>of</strong>po<strong>in</strong>tersisusedfor Itshouldbepo<strong>in</strong>tedoutthatthisdualpo<strong>in</strong>tertechniquesdoesnotlenditselfas<br />

Loadbalanc<strong>in</strong>g<strong>in</strong>formation Thisdualpo<strong>in</strong>terschemethewritepo<strong>in</strong>tersautomaticallytellsyouhowloadbalancedthecomputationisaftereachtime-step.Ifsomethread(processor)suddenly


getsverymany<strong>particle</strong>sorhardlyany,agscouldberaisedto<strong>in</strong>itiatearepartition<strong>in</strong>g<strong>of</strong>thegridforloadbalanc<strong>in</strong>gpurposes.Thiswouldalsobeuseful<strong>in</strong>the<br />

60 extremecasewheremost<strong>of</strong>the<strong>particle</strong>sendup<strong>in</strong>asmallnumber<strong>of</strong>subgrids<strong>and</strong> caus<strong>in</strong>gmemoryproblemsforthelocal<strong>particle</strong>array. 4.5LoadBalanc<strong>in</strong>g 4.5.1TheUDDapproach Load-balanc<strong>in</strong>gideasstemm<strong>in</strong>gfromtheauthor'sworkonfault-tolerantmatrix algorithms[EUR89]canalsobeappliedtoloadbalanc<strong>in</strong>g<strong>particle</strong><strong>codes</strong>.There,<br />

processortom<strong>in</strong>imizetheeectontherema<strong>in</strong><strong>in</strong>gpartialorthogonaltrees,low hadbeenespeciallytailoredforecientmulti-process<strong>in</strong>gonhypercubes.Thehypercubealgorithmswerebasedonan<strong>in</strong>terconnectionschemethat<strong>in</strong>volvedtwo<br />

communicationoverheadwasma<strong>in</strong>ta<strong>in</strong>ed. uniformdistribution<strong>of</strong>the<strong>particle</strong>sovercurrentlyavailableprocessors(assum<strong>in</strong>g theUDD(Uniform-Data-Distribution)approachformatrices[EUR89]{i.e.a ahomogeneoussystem).Elsteretal.analyzedseveralre-distributiontechniques,<strong>of</strong> whichtheRow/ColumnUDD(UniformDataDistribution)[Uya86,UR85a,UR85b, Theoptimumre-distribution<strong>of</strong><strong>particle</strong>sshouldbesimilartothatshownfor orthogonalsets<strong>of</strong>b<strong>in</strong>arytrees.Byfocus<strong>in</strong>gonredistribut<strong>in</strong>gtheloadforeach algorithmicfault-toleranttechniqueswere<strong>in</strong>troducedformatrixalgorithmsthat<br />

UR88]methodprovedtobethemost<strong>in</strong>terest<strong>in</strong>g.Firstacolumn-wiseUDDhas performedontherowwiththefaultyprocessor.This<strong>in</strong>volveddistribut<strong>in</strong>gthe<br />

faultyprocessor,while<strong>in</strong>creas<strong>in</strong>gtheheight<strong>of</strong>thesub-matricesontherema<strong>in</strong><strong>in</strong>g direction(height)<strong>of</strong>thesub-matricesontherema<strong>in</strong><strong>in</strong>gprocessors<strong>in</strong>therow<strong>of</strong>the datapo<strong>in</strong>ts<strong>of</strong>thefaultyprocessorequallyamongtheotherprocessorsresid<strong>in</strong>g<strong>in</strong> sothatonlynear-neighborcommunicationwasneeded.Then,byshr<strong>in</strong>k<strong>in</strong>gthey-<br />

processorscorrespond<strong>in</strong>gly(row-wiseUDD),loadbalanc<strong>in</strong>gwasachievedwithboth thesamerowasthehealthyonesbyrippl<strong>in</strong>gtheloadfromprocessortoprocessor


61ss sssss<br />

s s s<br />

Figure4.2:Inhomogeneous<strong>particle</strong>distribution<br />

thenear-neighborcommunicationpattern<strong>and</strong>theorthogonaltreepatternsupheld. Figure4.3:X-proleforParticleDistributionShown<strong>in</strong>Figure4.2 ?JAACCCC Inour<strong>particle</strong>sort<strong>in</strong>gsett<strong>in</strong>gsimilarideasmightproveuseful<strong>in</strong>re-distribut<strong>in</strong>g .J.<br />

thegridwhenopt<strong>in</strong>gforloadbalanc<strong>in</strong>g. 4.5.2Loadbalanc<strong>in</strong>gus<strong>in</strong>gthe<strong>particle</strong>densityfunction functions<strong>of</strong>thex<strong>and</strong>ydirections.Onewould<strong>in</strong>thiscaseperiodicallycalculate Onewaytodothepartition<strong>in</strong>gistoma<strong>in</strong>ta<strong>in</strong>awatchonthe<strong>particle</strong>density thedensity\prole"<strong>of</strong>thesystem.Forexampleifthegridhadthedistributionas shown<strong>in</strong>Figure4.2,itwouldgiveax-proleasshown<strong>in</strong>Figure4.3.Thex-prole couldthenbeusedtopartitionthegrid<strong>in</strong>thex-direction(seeFigure4.4).The samecouldthebedoneforthey-direction,giv<strong>in</strong>gunevenrectangularsub-grids. Thisisrem<strong>in</strong>iscent<strong>of</strong>adaptivemeshrenement,sotherearesurelyideastobe


62<br />

.<br />

usedfromthatarea.Theschemeisalsosimilartotheonerecentlyproposedby Lieweretal.[LLDD90]fora1Dcode.Theyuseanapproximatedensityfunction toavoidhav<strong>in</strong>gtobroadcastthe<strong>particle</strong>densityforallgridpo<strong>in</strong>ts. Figure4.4:NewGridDistributionDuetoX-prole<strong>in</strong>Figure4.3.<br />

presence<strong>of</strong>I/Oprocessorsonasubset<strong>of</strong>theprocessors,etc. <strong>in</strong>clud<strong>in</strong>gtheKSR,hasnodeswithunevenloadsduestoeithertime-shar<strong>in</strong>g,the 4.5.3Load<strong>and</strong>distance wouldonlyachieveloadbalanceforhomogeneoussystems.Manyparallelsystems, Techniquesbasedpurelyon<strong>particle</strong>load(asoutl<strong>in</strong>ed<strong>in</strong>theprevioussection)<br />

nodeswithrespecttocommunicationtime)<strong>in</strong>tothecode,onecouldmakethe loadbalance.By<strong>in</strong>corporat<strong>in</strong>gtheuse<strong>of</strong>tablesma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g\load"(howbusy implementationsmoregenerallysuitable,notonlyfortheKSR,butals<strong>of</strong>ora arethe<strong>in</strong>dividualprocess<strong>in</strong>gnodes)<strong>and</strong>\distance"(howfarawayaretheotehr Oneideaweareconsider<strong>in</strong>gistouserun-time<strong>in</strong>formationtoaidus<strong>in</strong>achiev<strong>in</strong>g<br />

distributedsystemenvironmentssuchasanetwork<strong>of</strong>Suns. <strong>and</strong>gridre-assignment<strong>in</strong>thiscontext,arewhat<strong>in</strong>putdistributions<strong>and</strong>run-time distributionscancommonlybeexpected. Thoughtstoconsiderwhensearch<strong>in</strong>gfortherightapproach<strong>particle</strong>sort<strong>in</strong>g


computations,thenitisreasonabletoassumeastaticallocation<strong>of</strong>sub-grids. However,ifthesystemhasonelimitedarea<strong>of</strong><strong>particle</strong>smigrat<strong>in</strong>garoundthe Ifthe<strong>in</strong>putdataisuniform,<strong>and</strong>onecanexpectittorema<strong>in</strong>sodur<strong>in</strong>gthe 63<br />

topology;which<strong>in</strong>dicateswhichprocessorsareneighbors,<strong>and</strong>hencedictates actuallysorts(re-arrangepartition<strong>in</strong>gs)willactuallydependon:1)thenetwork <strong>particle</strong>/gridpartition<strong>in</strong>gs,<strong>and</strong>whentosort(updatethepartition<strong>in</strong>g).Howone system,itisreasonabletomoreseriouslyconsideradynamicgridapproach.<br />

howtom<strong>in</strong>imizecommunication,<strong>and</strong>2)thememorymodel;whichaectshow Eitherway,\load"<strong>and</strong>\distance"<strong>in</strong>formationcouldthenbeusedtodeterm<strong>in</strong>e<br />

pass<strong>in</strong>gadncach<strong>in</strong>goccurs. operat<strong>in</strong>gsystem,butifthesetwopo<strong>in</strong>tsareignored,oneisstilllikelytoendup tialfordistributedworkstations,sotailor<strong>in</strong>gtheimplementationstom<strong>in</strong>imizenow withan<strong>in</strong>ecientimplementation.Communicationoverheadis,however,substan-<br />

becomescrucial. Granted,theKSRtriestohidethesetwodependencieswiththeaid<strong>of</strong>itsclever<br />

endup<strong>in</strong>oneprocessor,byassign<strong>in</strong>geachpart<strong>of</strong>thegridtoagroup<strong>of</strong>processors Azari<strong>and</strong>Lee[AL91]addressedtheproblemthatresultswhenseveral<strong>particle</strong>s 4.6Particlesort<strong>in</strong>g<strong>and</strong><strong>in</strong>homogeneous<br />

(hybridpartition<strong>in</strong>g).Thisworksreasonablyforfairlyhomogeneousproblems, problems<br />

wheremost<strong>of</strong>the\action"takesplace<strong>in</strong>asmallregion<strong>of</strong>thesystem(oneprocessor\group").Unfortunately,thereareseveralsuchcases<strong>in</strong>plasmaphysics,<br />

butwouldtakeaseriousperformancehitforstrongly<strong>in</strong>homogeneousproblems sodevelop<strong>in</strong>galgorithmstoh<strong>and</strong>lethese<strong>in</strong>homogeneouscasesisdenitelyworth <strong>in</strong>vestigat<strong>in</strong>g.


Thewayweseetheproblem<strong>of</strong><strong>in</strong>homogeneousproblemssolved,istousebotha 4.6.1DynamicPartition<strong>in</strong>gs dynamicgrid<strong>and</strong>adynamic<strong>particle</strong>partition<strong>in</strong>g(well,the<strong>particle</strong>partition<strong>in</strong>gis 64<br />

basicallyimplied),i.e.whatWalkerreferstoasanadaptiveeuleri<strong>and</strong>ecomposition. Thenumber<strong>of</strong>gridelementsperprocessorshouldherereecttheconcentration<strong>of</strong> <strong>particle</strong>s,i.e.ifaprocessordoescomputations<strong>in</strong>anareawithalot<strong>of</strong><strong>particle</strong>s,it wouldoperateonasmallergridregion,<strong>and</strong>viceversa.Gridquantitieswouldthen needtobedynamicallyredistributedatrun-timeas<strong>particle</strong>congregate<strong>in</strong>various areas<strong>of</strong>thegrid. gridasapossibleattemptatloadbalanc<strong>in</strong>g.Toquotehim: Onpage49<strong>of</strong>histhesis,Azari[Aza92]<strong>in</strong>deedmentionsre-partition<strong>in</strong>g<strong>of</strong>the havebeendesignedfornon-uniform<strong>particle</strong>distributiononthegrid. Onepossibleattemptforloadbalanc<strong>in</strong>gcouldbetore-partitionthegrid spaceus<strong>in</strong>gadierentmethodsuchasbi-partition<strong>in</strong>g.Thesemethods<br />

Theoverheadswillbefurtheranalyzed<strong>in</strong>thisthesis.S<strong>in</strong>ceweareus<strong>in</strong>ganFFT solver,thegridwillneedtobere-partitionedregularlyregardless<strong>of</strong>the<strong>particle</strong> s<strong>in</strong>cethenumber<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>eachsubgridwouldbedierent.Also, there-partition<strong>in</strong>gtaskitselfisanewoverhead. However,thegridpartition<strong>in</strong>gunbalancedthegrid-relatedcalculations<br />

pusher<strong>in</strong>ordertotakeadvantage<strong>of</strong>parallelism<strong>in</strong>thesolver. 4.6.2Communicationpatterns<br />

the<strong>particle</strong>stage,s<strong>in</strong>ceoneherewouldgenerallyprefersquaresubgridsoneach isthattheroworcolumndistributionusedbytheFFTisnotassuitablefor (forthe<strong>particle</strong>phase).Anotherargumentfordo<strong>in</strong>gare-distribution<strong>of</strong>thegrid shouldbesimilartothat<strong>of</strong>go<strong>in</strong>gfromarowdistributiontoadynamicsub-grid Thecommunicationcostforthetransposeassociatedwithadistributed2DFFT<br />

processor.


shouldmatchthispartition<strong>in</strong>g<strong>in</strong>ordertoavoidanextratranspose.Wewillget desirableals<strong>of</strong>ortheotherstages,thentheFFTsolver'sorder<strong>of</strong>the1D-FFTs If,however,forsomereasonamoreblock-columnorblockrowpartition<strong>in</strong>gisn 65<br />

trixtranspositionsformeshes<strong>and</strong>hypercubes,respectively. backtothisidea<strong>in</strong>thenextchapter. 4.6.3N-body/MultipoleIdeas Theauthorhasalsoconsideredsomeparallelmultipole/N-bodyideas[ZJ89,BCLL92]. Azari-Bojanczyk-Lee[ABL88]<strong>and</strong>Johnsson-Ho[JH87]have<strong>in</strong>vestigatedma-<br />

Hut<strong>and</strong>ORB(OrthogonalRecursiveBisection)treestosubdivide<strong>particle</strong>s.One Multipolemethodsuse<strong>in</strong>terest<strong>in</strong>gtree-structuredapproachessuchastheBarnes- ideawouldbetouseatree-structuresimilartothatdescribedbyBarnes<strong>and</strong>Hut [BH19]toorganizethe<strong>particle</strong>sforeach"sort"[BH19]. Barnes-Huttree TheBH(Barnes-Hut)treeorganizesthe<strong>particle</strong>sbymapp<strong>in</strong>gthemontoab<strong>in</strong>ary tree,quad-tree(max.4childrenpernode),oroct-tree(max.8childrenpernode) for1-D,2-Dor3-Dspaces,respectively.Consider<strong>in</strong>g<strong>particle</strong>sdistributedona partition<strong>in</strong>gthespace<strong>in</strong>to4equalboxes.Eachboxisthenpartitionedaga<strong>in</strong>until 2-Dplane(withmorethanone<strong>particle</strong>ispresent),itsquad-treeisgeneratedby onlyone<strong>particle</strong>rema<strong>in</strong>sperbox.Therootdepictsthetoplevelbox(thewhole space),each<strong>in</strong>ternalnoderepresent<strong>in</strong>ga<strong>cell</strong>,<strong>and</strong>theleaves<strong>particle</strong>s(oremptyif<br />

<strong>in</strong>terested<strong>in</strong>,Iwillnotgo<strong>in</strong>tothedetail<strong>of</strong>whatthecalculationsactuallyestimate the<strong>cell</strong>hasno<strong>particle</strong>).<br />

physically.) subtrees<strong>of</strong><strong>particle</strong>sforboxesconta<strong>in</strong><strong>in</strong>g<strong>particle</strong>ssucientlyfarawayfromthe present<strong>particle</strong>dur<strong>in</strong>gforcecalculations.(S<strong>in</strong>ceitisthedatastructureIam TheBHalgorithmtraversesthetreeforeach<strong>particle</strong>approximat<strong>in</strong>gwhole


Howdoesthisrelateto<strong>particle</strong>sort<strong>in</strong>gforPIC<strong>codes</strong>?S<strong>in</strong>ce<strong>particle</strong>s<strong>in</strong>variably w<strong>and</strong>ero<strong>in</strong>dierentdirections,<strong>particle</strong>sthatwerelocaltothegrid<strong>cell</strong>s(boundto Treestructures<strong>and</strong><strong>particle</strong>sort<strong>in</strong>g66<br />

mach<strong>in</strong>es.Oneapproachtoovercomethisistosortthe<strong>particle</strong>stotheprocessors conta<strong>in</strong><strong>in</strong>gtheircurrent(<strong>and</strong>possiblyneighbor<strong>in</strong>g)<strong>cell</strong>s. processornodes)atthethestart<strong>of</strong>thesimulations,afterawhilearenolongernear theirorig<strong>in</strong>s.Thiscausesalot<strong>of</strong>communicationtracfordistributedmemory<br />

anicelybalanceddistribution<strong>of</strong><strong>particle</strong>s,buts<strong>in</strong>cethetreestructureitselfdoes assignthenodesaccord<strong>in</strong>gtothesub-treedistributionthisgives.Thiswillyield notaccountfor<strong>cell</strong>-<strong>cell</strong><strong>in</strong>teraction(neighbor<strong>in</strong>g<strong>cell</strong>s),amorecleverapproachis requiredforPIC<strong>codes</strong>s<strong>in</strong>cetheyareheavily\neighbor-oriented". Therstideaistomapthenew<strong>particle</strong>locationstotheBHtree<strong>and</strong>then<br />

types<strong>of</strong><strong>in</strong>teractionsisfairlycomplex,butnotthat<strong>in</strong>terest<strong>in</strong>gforourpurposes. <strong>cell</strong><strong>in</strong>teractions.Howitgoesabouttheactualcomputations<strong>and</strong>thedierent <strong>in</strong>volves<strong>particle</strong>-<strong>particle</strong><strong>and</strong><strong>particle</strong>-<strong>cell</strong><strong>in</strong>teractions,theFMMalsoallows<strong>cell</strong>-<br />

thecomputationalspace<strong>in</strong>toatreestructure.UnliketheBHmethodwhichonly TheFastMultipoleMethod(FMM)usesasimilarrecursivedecomposition<strong>of</strong><br />

thatahybridBH-FMMapproachcouldprovethemostuseful.S<strong>in</strong>ceweonlycare aboutneighbor<strong>in</strong>g<strong>cell</strong>s,itshouldbepossibletosimplifythecalculationsfor<strong>cell</strong><strong>cell</strong><strong>in</strong>teractions.<br />

Inordertomakeuse<strong>of</strong>thestructuresthesemethodspresent,theauthorbelieves<br />

code,comparedtoothersolvers,suchasmultigrid. impactchoos<strong>in</strong>ganFFTsolverwouldhaveonthe<strong>particle</strong>-push<strong>in</strong>gstages<strong>of</strong>our Notic<strong>in</strong>gthatdierentimplementationsusedierentsolvers,itisusefultoseewhat 4.7TheFieldSolver<br />

wouldgenerallynotbeusedforvery<strong>in</strong>homogeneoussystems.Otani[Ota]has Itisalsoworthnot<strong>in</strong>gthatanFFTsolver<strong>and</strong>periodicboundaryconditions


ispredictedthatonewouldnothavelessthan,say,half<strong>of</strong>theprocessorsh<strong>and</strong>l<strong>in</strong>g po<strong>in</strong>tedout,however,thatitispossibleforverylargewavestoexist<strong>in</strong>anotherwise homogeneoussystemwhich,<strong>in</strong>turn,leadtosignicantbunch<strong>in</strong>g<strong>of</strong>the<strong>particle</strong>s.It 67<br />

However,us<strong>in</strong>gonly50%<strong>of</strong>theprocessorseectivelyonaparallelsystemis<strong>in</strong>deed most<strong>of</strong>the<strong>particle</strong>s. 4.7.1Processorutilization asignicantperformancedegradation.Itis\onlyhalf"<strong>of</strong>theparallel,butwith<br />

withrespecttoasiglenode<strong>of</strong>afactor<strong>of</strong>64! respecttotheserialspeed,whichistheimportantmeasure,itisverysignicant, 50%utilization/eciencygivesusa64-timesspeed-up(max.theoreticallimit) versusa128-timesspeedupforafullyusedsystem.I.e.thereisaloss<strong>of</strong>resources especiallyforhighlyparallelsystems.Forexampleona128-processorsystema<br />

balancethe<strong>particle</strong>pushstagemayleadtoproblemswiththeeldsolver. Ramesh[Ram]haspo<strong>in</strong>tedoutthatkeep<strong>in</strong>ganon-uniformgrid<strong>in</strong>ordertoload-<br />

4.7.2Non-uniformgrid<strong>issues</strong><br />

loadacrosstheprocessors,aperformancehit,alsopo<strong>in</strong>tedoutbyAzari<strong>and</strong>Lee, doesstillcome<strong>in</strong>toplay<strong>in</strong>thatwenowhaveasolverthatwillhaveanuneven sorwouldvaryaccord<strong>in</strong>gtonon-uniformdistribution.Ramesh'spo<strong>in</strong>t,however, system-widebasis.Thenumberifmanygridpo<strong>in</strong>tsthatgetstoredbyeachproces-<br />

Thegridcouldalternativelybepartitioneduniformly(equalgridspac<strong>in</strong>g)ona<br />

thatwillhavetobeconsidered.<br />

AparallelFFTisconsideredquitecommunication-<strong>in</strong>tensive,<strong>and</strong>because<strong>of</strong>the 4.7.3FFTSolvers leadtoanother\can<strong>of</strong>worms"withrespecttoaccuracy,etc. Alternatively,asolverfornon-uniformmeshescouldbeused.It,however,may<br />

communicationstructure,isisbestsuitedforhypercubeswhichhaveahighdegree


overheadisfunctionallythesameasthemoredirect2Dapproach,butbyperform<strong>in</strong>gallthecommunicationatonce(transpose),somestart-up-messageoverheadis<br />

<strong>of</strong>communicationl<strong>in</strong>ks.Do<strong>in</strong>ga2DFFTusuallyalsoimpliesdo<strong>in</strong>gatranspose <strong>of</strong>therows/columns.G.Foxet.al.[FJL+88]po<strong>in</strong>toutthatforthisapproachthe 68<br />

saved).<br />

whodidsomeimageprocess<strong>in</strong>gworkontheMPP(bit-serialarrayproc.). systems.Interest<strong>in</strong>glyenough,theonlyreferenceIseemtohaveencounteredfor thehypercube.Sarkaetal.have<strong>in</strong>vestigateparallelFFTsonsharedmemory parallelFFTsonarrayprocessors,isMariaGuiterrez's,astudent<strong>of</strong>myMSadvisor ClaireChu,astudent<strong>of</strong>VanLoan,wroteaPhDthesisonparallelFFTsfor<br />

<strong>and</strong>asectiononFFTsforsharedmemorysystems.Allthealgorithms<strong>in</strong>thebook arewritten<strong>in</strong>ablock-matrixlanguage.Asmentioned<strong>in</strong>Section2,G.Foxet.al. alsocoversparallelFFTs.Bothpo<strong>in</strong>touttheuse<strong>of</strong>atranspose. VanLoan'sbook[Loa92].It<strong>in</strong>cludesbothClareChu'sworkonthehypercube<br />

highlyconnectedcommunicationtopologiesassumeahypercube<strong>in</strong>terconnection network(ittssoperfectlyfortheFFT!).Inaddition,mostdistributedmemory withonlyonedatapo<strong>in</strong>tperprocessor. <strong>and</strong>sharedmemoryapproachesfound<strong>in</strong>numericaltextstendtolookatalgorithms Mostdistributedparallelalgorithms<strong>in</strong>volv<strong>in</strong>gtreestructures,FFTsorother<br />

po<strong>in</strong>tsperprocessor(assum<strong>in</strong>gtheneighbor<strong>in</strong>gcolumngetscopied)seemsvery processors,thiswouldonlyleaveusa4-by-4grid.Besides,hav<strong>in</strong>gonly4grid <strong>in</strong>ecientforthe<strong>particle</strong>phase{<strong>and</strong>certa<strong>in</strong>lydoesnotallowformuchdynamic gridallocation. Inourcase,thisisnotagoodmodel.For<strong>in</strong>stance,ifweweretouse16<br />

anngridmappedontopprocess<strong>in</strong>gelements(PEs)withn=O(p).Inthis case,ifn=p,wewillhavep1D-FFTstosolve<strong>in</strong>eachdirectiononpPEs.The questionis,<strong>of</strong>course,howtoredistributethegridelementswhengo<strong>in</strong>gfromone Consequently,itwouldbealotmorereasonabletoassumethatwehave,say


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o 69<br />

|||| | \___________________________/ (b) |<br />

o---o---o---o<br />

dimensiontotheother<strong>and</strong>whatcostis<strong>in</strong>volved. Figure4.5:BasicTopologies:(a)2DMesh(grid),(b)R<strong>in</strong>g.<br />

advantageoustohaveone(ormore)columns(orrows)perprocessor. <strong>of</strong>2(simplify<strong>in</strong>gtheFFTs).FortherstFFT,itisclearthatitwouldbemost Letusrstconsiderast<strong>and</strong>ardgrid<strong>and</strong>r<strong>in</strong>gtopologyshown<strong>in</strong>Figure4.5.<br />

ecutiontimetoperformanN-po<strong>in</strong>tFFT.Onewouldthenneedto\transpose" Assumethatboththeprocessorarray<strong>and</strong>theeldgriddimensionsareapower<br />

thegridentries<strong>in</strong>ordertoperformthe1D-FFTs<strong>in</strong>theotherdirection.Hopefully,suchpackagescouldbeobta<strong>in</strong>edfromthevendors<strong>in</strong>cethisisobviouslya<br />

time-consum<strong>in</strong>gtaskgivenallthecommunication<strong>in</strong>volved. Assum<strong>in</strong>gonlyonecolumnperprocessor,itwouldthentakeO(NlogN)ex-<br />

withI/Oattachmentsthatsignicantlydegradetheircomputationaluse,caus<strong>in</strong>g set.ThereasonthelatterislikelytohappenisthattheKSRhassomeprocessors underly<strong>in</strong>gtoken-basedcommunicationstructure.Thereisalsotheproblemwith contentionfromotherr<strong>in</strong>g<strong>cell</strong>sthatmaybeexcludedfromone'scurrentprocessor ItisnotobvioushowfastsuchanalgorithmwouldbeontheKSRgiventhe


a32-processorr<strong>in</strong>g{theidealforanFFT.Ifonly16processorsareused<strong>and</strong>one doesnotexplicitlyrequestexclusiveaccesstothewholer<strong>in</strong>g,otheruser'sprocesses anunbalancedsystem.Giventhis,itisunlikelythatonecouldgetthefulluse<strong>of</strong> 70<br />

r<strong>in</strong>gaect<strong>in</strong>gone'sperformance.FurtherdetailsontheKSR'sarchitectureisgiven runn<strong>in</strong>gontherema<strong>in</strong><strong>in</strong>g16nodescouldcauseextracommunicationtraconthe<br />

r<strong>in</strong>g(or64ormore).Noticeherethepotentialcontentionbetweenload-balanc<strong>in</strong>g <strong>in</strong>Chapter6.<br />

Match<strong>in</strong>gGridStructures<strong>and</strong>Alternat<strong>in</strong>gFFTs the1D-FFTs<strong>and</strong>thetranspose. reasonableapproachifoneconsidersrunn<strong>in</strong>gonthefull\unbalanced"32-processor The\pool-<strong>of</strong>-task"approachthatVanLoanmentions<strong>in</strong>hisbookmaybea<br />

(or32addresseswithastride<strong>of</strong>64pagesor64addresseswithastride<strong>of</strong>32pages). Asmentioned<strong>in</strong>the<strong>in</strong>troduction,thelocalmemoryontheKSR=(128sets)x<br />

Oneshouldhencetakecaretoensurethatstridesareanon-power-<strong>of</strong>-two<strong>of</strong>the processorrepeatedlyreferencesmorethan16addresseswithastride<strong>of</strong>128pages comes<strong>in</strong>powers-<strong>of</strong>-two.OntheKSRlocalmemorythrash<strong>in</strong>ghenceoccurswhena (16-wayassociativity)x(16Kb(pagesize))=32Mb.Infact,allphysicalmemory<br />

operateonarraysthat<strong>in</strong>deedarepowers-<strong>of</strong>-two.Itisforthisreason<strong>and</strong>thefact multiple<strong>of</strong>apagesize(16kB).<br />

<strong>of</strong>ten<strong>in</strong>volveanactualtranspose. themanufacturer(<strong>of</strong>tenh<strong>and</strong>-coded<strong>in</strong>assembler),that2DFFTimplementations thatoneusuallycanmakeuse<strong>of</strong>fastunit-stride1-DFFTrout<strong>in</strong>esprovidedby Noticehowthisconictswiththeimplementation<strong>of</strong>2DFFTswhichtendto<br />

betweeneachset<strong>of</strong>1-DFFTcalls<strong>in</strong>ordertobeabletousecontiguousvectorsfor the1-DFFTcomputations.Thisreorder<strong>in</strong>g<strong>of</strong>thegridcanbequitecostlyifthe keepthe<strong>particle</strong>s<strong>in</strong>block-vectorgridsdur<strong>in</strong>gthe<strong>particle</strong>push,bothre-order<strong>in</strong>g gridisne<strong>and</strong>ifthereareontheaverageafew<strong>particle</strong>sper<strong>cell</strong>,respectively.Ifwe Dur<strong>in</strong>gour2-DFFTeldsolver,thereisareorder<strong>in</strong>g(transposition)<strong>of</strong>thegrid


stepscanbesaved,<strong>and</strong>wehencegetthefollow<strong>in</strong>g: 1.Column-wise{FFT 71<br />

2.Transpose 3.Row-wiseFFT 4.Row-wise<strong>in</strong>verseFFT 5.Transpose<br />

square-shapedgridpartition<strong>in</strong>gforuniformproblems. 6.Column-wise{<strong>in</strong>verseFFT<br />

4.7.4Multigrid 7.ParticlePush<br />

Innot<strong>in</strong>ghowsimilarourdynamicgrid-partition<strong>in</strong>gapproachforthe<strong>particle</strong>grid/grid-<strong>particle</strong>stepsistothat<strong>of</strong>adaptivegridschemesonends<strong>in</strong>multigrid<br />

(MG)methods,onecouldask,woulditbereasonabletoconsiderus<strong>in</strong>ganactual parallelMGmethodfortheeldsolver?GoodMultigridreferences<strong>in</strong>clude[Bri87, developedforparallelsystems.Theexamples<strong>in</strong>thereferencesfocusonproblems withNeumann<strong>and</strong>Dirichletboundaryconditions. McC89].Theproblemseems,however,tobethatthesemethodsarestillnotfully<br />

Noticethatthisblock-columngridpartition<strong>in</strong>gconictswiththeoptimum<br />

tocause<strong>particle</strong>stomove<strong>in</strong>adirectionprimarilyalongtheeld.Inthiscase, Forelectromagnetic<strong>codes</strong>,themagneticeldl<strong>in</strong>eswillforcerta<strong>in</strong>criteriatend 4.8InputEfects:Electromagnetic Considerations


onewould<strong>in</strong>deedlikethesubpartitionstobealignedasatrectanglesalongthis direction. Electronstendtobetiedtotheeldl<strong>in</strong>es<strong>and</strong>movehugedistancesalongthe 72<br />

magneticeld,buthavedicultymov<strong>in</strong>gperpendiculartoit.Ionsbehavealotlike ionsspiral<strong>in</strong>garoundthemagneticeldB,<strong>and</strong>theionsbehaveasiftheywere<br />

theelectrons,buts<strong>in</strong>cetheyhavemoremass,theytendtomakelargerexcursions acrosstheelds.Ifthefrequencyregimeishigh,onewillonlyseeapart<strong>of</strong>the a<strong>particle</strong>cangoaroundamagneticeldl<strong>in</strong>epersecond),onemaywanttomodel isnopreferredgeneraldirections<strong>of</strong>the<strong>particle</strong>s,sosquarepartition<strong>in</strong>gswouldbe preferable. motionwheretheelectronsareunmagnitized.Fortheseunmagnetizedcases,there Ifthefrequencyregimeiswayabovethecyclotronfrequency(howmanytimes<br />

memoryhierarchies,especiallysystemswithcaches,weshallshowthattheseare Gridsaretypicallystoredeithercolumnorrow-wise.However,<strong>in</strong>asystemwith 4.9HierarchicalMemoryDataStructures:<br />

notnecessarilythebeststorageschemesforPIC<strong>codes</strong>. CellCach<strong>in</strong>g<br />

each<strong>particle</strong>willbeaccess<strong>in</strong>gthefourgridpo<strong>in</strong>ts<strong>of</strong>its<strong>cell</strong>(assum<strong>in</strong>gaquadrangulargrid).<br />

Whenthe<strong>particle</strong>'schargecontributionsarecollectedbackonthegridpo<strong>in</strong>ts,<br />

po<strong>in</strong>tsitiscontribut<strong>in</strong>gto. l<strong>in</strong>estendtobesmall{16wordsontheKSR1),<strong>and</strong>columnorrowstorage<strong>of</strong>the gridexits,each<strong>particle</strong>willneedatleasttwocachel<strong>in</strong>estoaccessthefourgrid S<strong>in</strong>ceasignicantoverheadispaidforeachcachehit,wehenceproposethat Ifthelocalgridsizeexceedsacachel<strong>in</strong>e(whichistypicallythecases<strong>in</strong>cecache<br />

one<strong>in</strong>steadstoresthegridaccord<strong>in</strong>gtoa<strong>cell</strong>cach<strong>in</strong>gscheme.Thismeansthat


<strong>in</strong>stead<strong>of</strong>stor<strong>in</strong>gthegridroworcolumn-wisedur<strong>in</strong>gthe<strong>particle</strong>phase,oneshould storethegridpo<strong>in</strong>ts<strong>in</strong>a1-Darrayaccord<strong>in</strong>gtolittlesubgridsthatmayt<strong>in</strong>to onecache-l<strong>in</strong>e.OntheKSR1wherethecachel<strong>in</strong>eis16words,thismeansthat 73<br />

m<strong>in</strong>imizethebordereects.However,eventhe8X2caseshowsa22%reduction<strong>in</strong> thegridisstoredasasequence<strong>of</strong>either8X2or4x4subgrids.Thelatterwould thenumber<strong>of</strong>cache-l<strong>in</strong>eaccesses.Furtheranalysis<strong>of</strong>thisschemewillbepresented<br />

4x4<strong>cell</strong>-cach<strong>in</strong>gstorage: <strong>in</strong>Chapter5. Row-storage<strong>of</strong>2-Dmxnarray:<br />

a00a03a10a13a04a07a14a17amn a00a01a0na10amn<br />

(cache-l<strong>in</strong>e)aligned.TheKSRautomaticallydoesthiswhenus<strong>in</strong>gmalloc<strong>in</strong>C. Inorderforthe<strong>cell</strong>-cach<strong>in</strong>gschemetobeeective,thegridneedstobesubpage Noticethatthis<strong>cell</strong>-cach<strong>in</strong>gtechniqueisreallyauni-processortechnique.Therefore,bothparallel<strong>and</strong>serial<strong>cell</strong><strong>codes</strong>foranysystemwithacachewouldbenet<br />

fromus<strong>in</strong>gthisalternativeblockstorage. s<strong>in</strong>ce<strong>cell</strong>-cach<strong>in</strong>gstoragewouldnotpreservethecolumn/rowstorage<strong>in</strong>theFFTs.<br />

Figure4.6:Rowstorageversus<strong>cell</strong>cach<strong>in</strong>gstorage<br />

Itwould,however,aectthealternat<strong>in</strong>gFFTapproachwewillbe<strong>in</strong>troduc<strong>in</strong>g


Chapter5<br />

PerformanceAnalyses AlgorithmicComplexity<strong>and</strong><br />

5.1Introduction ous,judgmentdicult."{Hippocrates[ca.460-357B.C.],Aphorisms. Proverbial:Arslonga,vitabrevis. \Lifeisshort,theArtlong,opportunityeet<strong>in</strong>g,experiencetreacher-<br />

Inordertogetabetterunderst<strong>and</strong><strong>in</strong>g<strong>of</strong>howvarious<strong>parallelization</strong>approaches proachesweth<strong>in</strong>karethemostreasonable.Theseanalysesconsiderbothcompu-<br />

tationalrequirements<strong>and</strong>memorytrac.Bycomb<strong>in</strong><strong>in</strong>gthecomplexityresults withthetim<strong>in</strong>gresultsfromserial<strong>codes</strong><strong>and</strong>knownparallelbenchmarks,onecan wouldaectperformance,thischapter<strong>in</strong>cludesacomplexityanalysis<strong>of</strong>theap-<br />

wastheKSR1(seeChapter6). estimatehoweectivethe<strong>parallelization</strong>swillbeonagivenparallelsystem,given <strong>and</strong>test<strong>in</strong>gthechosenparallelalgorithm(s)onachosenarchitecture.Ourtest-bed certa<strong>in</strong>chosenproblemparameterssuchasgridsize,<strong>and</strong>thenumber<strong>of</strong>simulation <strong>particle</strong>sused.Ane-tun<strong>in</strong>g<strong>of</strong>theseresultsispossibleafterfullyimplement<strong>in</strong>g<br />

74


Whenmov<strong>in</strong>gfromsequentialcomputersystemstoparallelsystemswithdistributedmemory,datalocalitybecomesamajorissueformostapplication.S<strong>in</strong>ce<br />

75 5.2Model<br />

systems. caremustbetakensothatthisprocessdoesnottakean<strong>in</strong>ord<strong>in</strong>ateamount<strong>of</strong> extrarun-time<strong>and</strong>memoryspace.Afterall,thepurpose<strong>of</strong>paralleliz<strong>in</strong>gthe<strong>codes</strong> issotheycantakeadvantage<strong>of</strong>thecomb<strong>in</strong>edspeed<strong>and</strong>memorysize<strong>of</strong>parallel <strong>in</strong>termediateresults<strong>and</strong>dataneedtobesharedamongtheprocess<strong>in</strong>gelements,<br />

Jessup[HJ81]: wheretcommisthecommunicationtimeforanN-vector.Hereisthestart-up time<strong>and</strong>aparameterdescrib<strong>in</strong>gtheb<strong>and</strong>widthonthesystem. Typically,communicationoverheadismodeledasfollows(seeHockney<strong>and</strong><br />

thetotaltimeagivenparallelapproachwouldtake(assum<strong>in</strong>gtheprocessdoesnot Ifonethenassumesthatthecomputationsdonotoverlapwithcommunication, tcomm=+N<br />

getswappedoutbytheoperat<strong>in</strong>gsystematrun-time)wouldthenbe: communicationoverheaddescribedabove. wheretcompisthetimethealgorithmspendsoncomputations<strong>and</strong>tcommisthe Modernparallelsystemswithdistributedmemory,start<strong>in</strong>gwiththeInteliPSC Ttotal=tcomp+tcomm;<br />

2,however,have<strong>in</strong>dependentI/Oprocessorsthatlettheusersoverlapthetime time.Howecienttheimplementationsarethenalsobecomesrelatedtohowwell spentsend<strong>in</strong>gdatabetweenprocessors,i.e.communication,withcomputation theprogramcanrequestdata<strong>in</strong>advancewhiledo<strong>in</strong>glargechunks<strong>of</strong>computations. ishencealsostronglyapplication-dependent. Howmuchtheapplicationprogrammercantakeadvantage<strong>of</strong>thisoverlapfeature


thememoryhierarchyareaccessedfortherequesteddata.Forsimplicitywewill consideramodelwithonlytwohierarchicalparameters,tlm,whichdenotestime Onhierarchicalmemorysystems,tcommwillbeafunction<strong>of</strong>whichlevels<strong>of</strong> 76<br />

alocalsubcache.Foravector<strong>of</strong>Nlocaldataelements,tlm(N)ishencenota Itshouldbenotedthattlmcoversbothitemswith<strong>in</strong>acache-l<strong>in</strong>e<strong>and</strong>itemswith<strong>in</strong> ory: associatedwithlocalmemoryaccesses<strong>and</strong>tgm,timeassociatedwithglobalmem-<br />

are1)with<strong>in</strong>acache-l<strong>in</strong>e,2)with<strong>in</strong>cache-l<strong>in</strong>es<strong>in</strong>cache<strong>and</strong>3)<strong>in</strong>localmemory. simpleconstant,butratherafunction<strong>of</strong>whetherthe<strong>in</strong>dividualdataelements tcomm=tlm(N)+tgm(M);<br />

thanaccesstosystemmemory,wewillassumethatthisissoundesirabledur<strong>in</strong>g Similarly,tgmisafunction<strong>of</strong>whetheravector<strong>of</strong>Mdataelementsaccessedareall<br />

computationthattheproblems<strong>in</strong>thiscasearetailoredtotwith<strong>in</strong>thesystem with<strong>in</strong>1)acommunicationpacket(equaltoacache-l<strong>in</strong>eontheKSR),2)<strong>in</strong>some<br />

memory. accesstoexternalmemorydevicestypicallyisseveralorders<strong>of</strong>magnitudeslower distantlocalmemory,or3)onsomeexternalstoragedevicesuchasdisk.S<strong>in</strong>ce<br />

<strong>of</strong>thesamecomplexity. 5.2.1KSRSpecics TheKSR1alsohasotherhardwarefeaturesthathampertheaccuracy<strong>of</strong>theabove Noticethatserialcomputersystemswithlocalcachesalso<strong>in</strong>volvetcomm=tlm<br />

distributedmemorymodel.Amongthesefeaturesisitshierarchicaltokenr<strong>in</strong>g network<strong>and</strong>thelocalsub-cachesassociatedwitheachprocessor. dedicatedto<strong>in</strong>struction,theotherhalftodata.Theperformance<strong>of</strong>thesystem theperformance<strong>of</strong>distributedmemorysystemswithvectorprocessors,willdepend willhencealsodependonhowecientlythelocalsub-cachegetsutilized.Similarly, EachKSRprocessor<strong>cell</strong><strong>in</strong>cludes<strong>of</strong>a.5MBytesub-cache,half<strong>of</strong>whichis


onhowecientlythevectorunitsateachprocess<strong>in</strong>gnodeareutilized. nell,wecurrentlyhave128nodeKSR1systemmadeup<strong>of</strong>432-noder<strong>in</strong>gsthat TheKSR1's<strong>in</strong>terconnectionnetworkconsists<strong>of</strong>ar<strong>in</strong>g<strong>of</strong>r<strong>in</strong>gsr<strong>in</strong>gs.AtCor-<br />

77<br />

atanygiventime;i.e.amaximum<strong>of</strong>14messagescanyamong32processors simultaneously.Thisisamoreseverelimitthanwhatisnormallyseenondistributedmach<strong>in</strong>eswithI/Ochips(e.g.st<strong>and</strong>ardr<strong>in</strong>gs,meshes<strong>and</strong>hypercubes).<br />

thatthecommunicationb<strong>and</strong>width<strong>of</strong>itsr<strong>in</strong>gissignicantlyhigherthanwhat Onenormallyexpectsam<strong>in</strong>imum<strong>of</strong>Nmessages(near-neighbor)tobeh<strong>and</strong>led simultaneousbyanN-processorr<strong>in</strong>g.TheKSRdoes,however,havetheadvantage eachmessage-headerhastomakeitbacktothesender,thereisvirtuallynodif-<br />

oneseesontypicalr<strong>in</strong>garchitectures.Also,s<strong>in</strong>cether<strong>in</strong>gisunidirectional<strong>and</strong> ference<strong>in</strong>communicationwiththenearestneighborversuscommunicationwitha bythefactthat3processors<strong>in</strong>eachKSRr<strong>in</strong>ghaveI/Odevicesattachedmak<strong>in</strong>g nodeconnectedattheoppositeend<strong>of</strong>ther<strong>in</strong>g.However,majorcommunication delaysareexperiencedwhenhav<strong>in</strong>gtoaccessdatalocatedonotherr<strong>in</strong>gs.<br />

thattheremaybenomorethanapproximately14messages(tokens)ononer<strong>in</strong>g hiddenduetothesharedmemoryprogramm<strong>in</strong>genvironment)isfurtherlimited<strong>in</strong> areconnectedtoatop-levelcommunicationr<strong>in</strong>g.The\messagepass<strong>in</strong>g"(actually<br />

variablethatletsusersavoidthesespecial<strong>cell</strong>swhenperform<strong>in</strong>gbenchmarks. thesenodesslowerthantheothers.Whenrunn<strong>in</strong>ganapplicationonall32r<strong>in</strong>g nodes,thismusthencebeconsidered.KSRfortunatelyprovidesanenvironment Thecalculations<strong>of</strong>theexpectedperformance<strong>of</strong>eachnodeisalsocomplicated<br />

feed-backfromourtestruns.Ourprototypeimplementationsbenchmarked<strong>in</strong> overlap.Renementswillthenbemadeaswebuildupthemodel<strong>and</strong><strong>in</strong>corporate simplestmodel,i.e,onewhichassumesuniformprocessors<strong>and</strong>nocommunication Chapter6didnottakeadvantage<strong>of</strong>pre-fetch<strong>in</strong>gfeatures. Whenanalyz<strong>in</strong>gvariousapproachestoourproblem,weshallstartwiththe


5.2.2Modelparameters Thefollow<strong>in</strong>gparameterswillbeused<strong>in</strong>ourmodel: \xx"belowarereplacedbyone<strong>of</strong>thefollow<strong>in</strong>g: 78<br />

{scatter:Calculation<strong>of</strong>each<strong>particle</strong>'scontributiontothechargedensity {part-push-x:Update<strong>of</strong><strong>particle</strong>positions. {part-push-v:Update<strong>of</strong><strong>particle</strong>velocities(<strong>in</strong>cludeschargegather).<br />

{t-solver:Fieldsolverus<strong>in</strong>g2FFT.Maybeimplementedus<strong>in</strong>g1D {transpose:Reorder<strong>in</strong>g<strong>of</strong>griddatapo<strong>in</strong>tsfrombe<strong>in</strong>gstoredamongthe basedonitsxedsimulationcharge<strong>and</strong>location. FFTsalongeach<strong>of</strong>thetwosimulationaxes.<br />

{border-sum:Addtemporaryborderelementstoglobalgrid. {grid-sum:Adduplocalgridcopies<strong>in</strong>toaglobalgridsum. vector/PE,thisisastraightforwardvectortranspose. PEsasblock-columnstoblock-rows,orvisaversa.Inthecase<strong>of</strong>1<br />

{part-sort:Procedureforrelocat<strong>in</strong>g<strong>particle</strong>sthathaveleftportion<strong>of</strong><br />

{nd-new-ptcls:Searchthroughout/scratcharraysfor<strong>in</strong>com<strong>in</strong>g<strong>particle</strong>s.<br />

{loc-ptcls:access<strong>in</strong>glocalarrays<strong>of</strong><strong>particle</strong>quantities. thegridscorrespond<strong>in</strong>gtotheircurrentnode.This<strong>in</strong>volvestransfer<strong>of</strong> v<strong>and</strong>xdataforthese<strong>particle</strong>stoPEsthathavetheirnewlocalgrid<br />

T:Totaltime{describestimeswhich<strong>in</strong>volvebothcomputationtime<strong>and</strong> communicationtime(whereneeded).Thecommunicationtimeisconsidered tobethetimespentexchang<strong>in</strong>gneededdataamongprocess<strong>in</strong>gunits. Txx:Totaltimeforthespecicalgorithm/method\xx".


tcomm?xx:Communicationtime<strong>of</strong>algorithm\xx". ,:Start-uptime<strong>and</strong>b<strong>and</strong>widthparameter,respectively,asdenedabove tcomp?xx:Computationtime<strong>of</strong>algorithm\xx". 79<br />

tgm(N):Globalmemorycommunicationtime<strong>in</strong>volved<strong>in</strong>access<strong>in</strong>gavector tlm(N):LocalMemorycommunicationtime<strong>in</strong>volved<strong>in</strong>access<strong>in</strong>gavector <strong>of</strong>lengthNstoredonalocalnodes(seeSection5.1) <strong>in</strong>Section5.1.<br />

P:Number<strong>of</strong>process<strong>in</strong>gnodesavailable. Px:Number<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>thex-directionwhenus<strong>in</strong>gasub-grid <strong>of</strong>lengthNstoredonaaparallelsystems(seeSection5.1).Includestlm(N) forthespeciedvector.<br />

Py:Number<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>they-directionwhenus<strong>in</strong>gasub-grid partition<strong>in</strong>g.<br />

Np:Number<strong>of</strong>super-<strong>particle</strong>s<strong>in</strong>simulation. N:Generic<strong>in</strong>tegerforalgorithmiccomplexity.E.g.O(N)impliesal<strong>in</strong>eartimealgorithm.<br />

partition<strong>in</strong>g(P=PxPy).<br />

Npmoved:Totalnumber<strong>of</strong><strong>particle</strong>sthatmovedwith<strong>in</strong>atime-step. Nx:Number<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>thex-direction(rectangulardoma<strong>in</strong>). Ny:Number<strong>of</strong>gridpo<strong>in</strong>ts<strong>in</strong>they-direction(rectangulardoma<strong>in</strong>). maxlocNp:Maximumnumber<strong>of</strong><strong>particle</strong>sonanygivenprocessor.<br />

ccalc:Calculationconstantcharacteriz<strong>in</strong>gthespeed<strong>of</strong>thelocalnode. Ng:Totalnumber<strong>of</strong>grid-po<strong>in</strong>ts<strong>in</strong>thex<strong>and</strong>y-directions.Ng=NxNy.


approachesanalyzed.Theseapproachesare1)serialPIConasystemwithalocal table5.1Showasummary<strong>of</strong>theperformancecomplexityguresforthethree 5.2.3ResultSummary 80<br />

cachememoryonecanhaveperprocessor,theserialcodenotonlysuersfrom cache,2)parallelalgorithmus<strong>in</strong>gaxed<strong>particle</strong>partition<strong>in</strong>g<strong>and</strong>areplicatedgrid withaparallelsum,<strong>and</strong>3)aparallelalgorithmus<strong>in</strong>gaxedgridpartition<strong>in</strong>gthat<br />

t<strong>in</strong>cache(tlmargumentsarelarge).Onecanalsoseethatthereplicatedgrid hav<strong>in</strong>gonlyoneCPU,butalsosuerswhenthedatasetsgetlarge<strong>and</strong>nolonger automaticallypartiallysortsthelocaldynamic<strong>particle</strong>arrays.<br />

approachwillsuerfromthetgrid?sumcomputation<strong>and</strong>communicationoverhead Fromthistableonecanseethatgiventhatoneislimited<strong>in</strong>howmuchfastlocal<br />

forlargegrids(Ng)onalargenumber<strong>of</strong>processors(P).ThefactmaxlocNpwill maxlocNpNp clearlyhamperthegridpartition<strong>in</strong>gapproachforload-imbalancedsystemswith<br />

<strong>and</strong>discussesfurther<strong>issues</strong>associatedwitheachapproach. tocompensateforthisimbalancebyre-partition<strong>in</strong>gthegrid(<strong>and</strong>re-sort<strong>in</strong>gthe <strong>particle</strong>s). Thefollow<strong>in</strong>gsectionsdescribehowwearrivedattheequations<strong>in</strong>Table5.1 Pforagoodportion<strong>of</strong>thetime-steps.Chapter4discussedways<br />

5.3SerialPICPerformance onenode,\communication"timewillherebeassumedlimitedtolocalmemory Follow<strong>in</strong>gisananalysis<strong>of</strong>theserialalgorithm.S<strong>in</strong>ceitisassumedtorunononly notbethecase.Thesharedmemoryprogramm<strong>in</strong>genvironmentwillthenstore accesses. datatodiskonsmallerserialsystems,exceptthattheKSRusesfastRAMona some<strong>of</strong>thedataonothernodes.Thiscanbeconsideredequivalenttoswapp<strong>in</strong>g high-speedr<strong>in</strong>g<strong>in</strong>stead<strong>of</strong>slowI/Odevices. Inlargesimulationsthatusemorethanthethe32Mblocalmemory,thiswill


Table5.1:PerformanceComplexity{PICAlgorithms 81<br />

GathereldO(Np) at<strong>particle</strong>s <strong>and</strong>updatetcomm= velocities Serial w/cache tcomp= ParticlePartition<strong>in</strong>gGridPartition<strong>in</strong>g Replicatedgridswithautomatic tcomp= O(Np withparallelsumpartial<strong>particle</strong>sort<br />

Update tcomm= P) tcomp=<br />

article O(tlm(Ng))+O(tgm(Ng))+O(tlm(Ng<br />

O(maxlocNp)<br />

positions O(Np) tcomm= tcomp= O(tlm(Np))O(tlm(Np tcomp= O(Np tcomm= P) P)) tcomp= O(maxlocNp) O(tlm(maxlocNp))+ tcomm= tcomm= O(tgm(Npmoved)) P))+<br />

Scatter <strong>particle</strong>charges togrid (charge densities) O(Np) tcomp= O(NglogP) tcomp= O(Np P)+ O(tgm(max(Nx tcomp= O(maxlocNp)+<br />

Needto tcomm= O(tlm(Np))O(tlm(Np tcomm= O(tgm(NglogP))O(tgm(maxNx P))+O(tlm(maxlocNp))+ tcomm=<br />

Px;Ny Py))<br />

re-arrange gridforFFT?otherwise Notifstart<strong>in</strong>NotifdoseriesNotifpartitioned samedimension;<strong>of</strong>grid-sumsongrid<strong>in</strong>onlyone O(tlm(Ng))O(tgm(Ng)) tcomp= sub-grids;otherwisedimension;otherwise tcomp= O(tgm(Ng)) tcomp= Py)<br />

2DFFT Solver Field-GridO(Ng) calculationstcomm= O(NglogNg)O(Ng O(tlm(Ng))O(tlm(tcomp))+O(tlm(tcomp))+ tcomp= tcomm= tcomp= O(Ng O(tlmNg) PlogNg) P) O(Ng O(tlmNg) tcomp= O(Ng tcomm= PlogNg)<br />

(F<strong>in</strong>ite Dierences)O(tlm(Ng))O(tlm(Ng tcomm= tcomm= P)) O(tlm(Ng tcomm= P) P))


The<strong>particle</strong>rout<strong>in</strong>eforpush<strong>in</strong>gthevelocitiesgatherstheeldsateach<strong>particle</strong> <strong>and</strong>thenupdatestheirvelocitiesus<strong>in</strong>gthiseld.Theeldgatherisabil<strong>in</strong>ear 5.3.1ParticleUpdates{Velocities 82<br />

additionscanbereducedby2byus<strong>in</strong>gtemporaryvariablesfortheweigh<strong>in</strong>gsums (hx?a)<strong>and</strong>(hy?b).Thevelocityupdatev=v+[qEpart(1=m)t?(dragvt) <strong>in</strong>terpolationthatconsists<strong>of</strong>7additions<strong>and</strong>8multiplications.Thenumber<strong>of</strong> cansimilarlybeoptimizedtouseonly2multiplications<strong>and</strong>2additionsforeach dimension.Wehencehave: Tserial?part?push?v=tcomp?gather+tcomp?part?push?v +tcomm?gather+tcomm?part?push?v =Npc1calc+Npc2calc (5.1) (5.2)<br />

wherec1calc<strong>and</strong>c1calcisthetimespentdo<strong>in</strong>gthemultiplicationss<strong>and</strong>additions +tlm(Ng)+tlm(Np) =O(Np)+O(tlm(Ng)+tlm(Np)) (5.5) (5.3)<br />

associatedwithgather<strong>and</strong>thevelocityupdates,respectively. (5.4)<br />

t<strong>in</strong>cache).Ifthegridisstoredeitherroworcolumn-wise,eachgatherwillalso implyareadfromtwodierentgridcache-l<strong>in</strong>eswhentherowsorcolumnsexceed tlm(Ng)wouldimplyalot<strong>of</strong>localcachehits(assum<strong>in</strong>gthegridistoolargeto thecache-l<strong>in</strong>esize,unlessareorder<strong>in</strong>gschemesuchas<strong>cell</strong>-cach<strong>in</strong>gisdone.Cellcach<strong>in</strong>gwas<strong>in</strong>troduced<strong>in</strong>Chapter4<strong>and</strong>itscomplexityisdiscussedlater<strong>in</strong>this<br />

Noticethatifthe<strong>particle</strong>sarenotsortedwithrespecttotheirgridlocation,<br />

The<strong>particle</strong>rout<strong>in</strong>eforpush<strong>in</strong>gthelocationsupdateseachlocationwiththefollow<strong>in</strong>g2Dmultiply<strong>and</strong>add:x=x+vt,aga<strong>in</strong>giv<strong>in</strong>g:<br />

Tserial?part?push?x=tcomp?part?push?x+tcomm?part?push?x<br />

chapter. 5.3.2ParticleUpdates{Positions (5.6)


=Npccalc+tlm(Np) 83<br />

5.3.3Calculat<strong>in</strong>gtheParticles'Contributiontothe whereccalcisthetimespentdo<strong>in</strong>gtheabovemultiplications<strong>and</strong>additions. =O(Np)+O(tlm(Np)) (5.7)<br />

ChargeDensity(Scatter) (5.8)<br />

Hereeach<strong>particle</strong>'schargegetsscatteredtothe4gridcorners.Thisoperation requires8additions<strong>and</strong>8multiplications,where,as<strong>in</strong>thegathercase,2additions canbesavedbyus<strong>in</strong>gtemporaryvariablesfortheweigh<strong>in</strong>gsums(hx?a)<strong>and</strong> (hy?b).Wehencehave:<br />

whereccalcisthetimespentdo<strong>in</strong>gtheabovemultiplications<strong>and</strong>additions. Tserial?charge?gather=tcomp?charge?gather+tcomm?charge?gather(5.9)<br />

Likethegathercase,each<strong>particle</strong>islikelytocausetwocache-hitsregard<strong>in</strong>g =O(Np)+O(tlm(Np)) =Npccalc+tlm(Np) (5.11) (5.10)<br />

gridpo<strong>in</strong>tsunlessareorder<strong>in</strong>gschemesuchas<strong>cell</strong>-cach<strong>in</strong>gisemployed. 5.3.4FFT-solver 1DFFTSareknowntotakeonlyO(NlogN)computationtime.Ourcurrentserial <strong>codes</strong>olvesfortheeldaccord<strong>in</strong>gtothefollow<strong>in</strong>gFFT-basedalgorithm: Step1:Allocatetemporaryarrayforstor<strong>in</strong>gcomplexcolumnbeforecall<strong>in</strong>g Step2:Copymatrix<strong>in</strong>tocomplexarray.O(Ng)(memoryaccess) Step3a:CallcomplexFFTfunctionrow-wise.O(NglogNx) Step3b:CallcomplexFFTfunctioncolumn-wise. FFTrout<strong>in</strong>eprovidedbyNumericalRecipes.<br />

(NowhaveFFT(chargedensity),i.e.(kx,ky))O(NglogNy)


Step4:Scale<strong>in</strong>Fouriermode(i.e.dividebyk2=kx2+ky2<strong>and</strong>scaleby Step5a:CallcomplexFFTfunctiondo<strong>in</strong>g<strong>in</strong>verseFFTcolumn-wise.O(NglogNy) 1/epstoget(kx,ky)).O(Ng)additions<strong>and</strong>multiplications. 84<br />

Step7:Obta<strong>in</strong>nalresultbyscal<strong>in</strong>g(multiply<strong>in</strong>g)theresultby1/(NXNy). Step6:Transferresultbacktorealarray.O(Ng)(memoryaccess) Step5b:Callcomplex<strong>in</strong>verseFFTfunctionrow-wise.O(NglogNx)<br />

2,4,6<strong>and</strong>7)areallconsideredtobelocal<strong>and</strong>l<strong>in</strong>ear<strong>in</strong>number<strong>of</strong>gridpo<strong>in</strong>ts,i.e. theyrequireO(Ng)operationswithno<strong>in</strong>ter-processordatatransferstak<strong>in</strong>gplace Thememorycopy<strong>in</strong>g(<strong>in</strong>clud<strong>in</strong>gtranspos<strong>in</strong>g)<strong>and</strong>scal<strong>in</strong>goperations(Steps1, O(Ng)multiplications<br />

However,s<strong>in</strong>cewestronglyrecommendthatvendor-optimizedFFTrout<strong>in</strong>esbe (hopefully). notionthata2-DFFTgenerallytakesO(N2logN)onaserialcomputer. used(<strong>and</strong>these<strong>of</strong>tenuseproprietaryalgorithms),wewillsticktothesimplied Thememorycopy<strong>in</strong>gcouldbeavoidedbyimplement<strong>in</strong>gamoretailoredFFT.<br />

hencehave: aswellas5a<strong>and</strong>5b.Suchtranposesgenerallygeneratealot<strong>of</strong>cache-hits.We Intheabovealgorithmthereisanimplicittransposebetweensteps3a<strong>and</strong>3b Tserial?fft?solver=tcomp?fft?solver+tcomm?fft?solver =O(NglogNg)+O(tlm(Ng)); (5.12)<br />

5.3.5Field-GridCalculation FFTwillalsohavesomecache-hitsassociatedwithit. whereO(tlm(Ng))signiesthecache-hitsassociatedwiththetranspose.The2D (5.13)<br />

1-Dnitedierenceequation<strong>of</strong>thepotentials(onthegrid)calculatedbytheeld Theeld-gridcalculationdeterm<strong>in</strong>estheelectriceld<strong>in</strong>eachdirectionbyus<strong>in</strong>ga


solver.This<strong>in</strong>volves2additions<strong>and</strong>twomultiplicationsforeachgridpo<strong>in</strong>tfor eachdirection,giv<strong>in</strong>gthefollow<strong>in</strong>gcomputationtime: Tserial?field?grid=tcomp?field?grid+tcomm?field?grid 85<br />

Noticethats<strong>in</strong>ceNpgenerallyisanorder<strong>of</strong>magnitudeorsolargerthanNg, =O(Ng)+O(tlm(Ng)) =Ngccalc+tlm(Ng) (5.16) (5.15) (5.14)<br />

wecanexpectthisrout<strong>in</strong>etobefairly<strong>in</strong>signicantwithrespecttotherest<strong>of</strong>the computationtime.Thisisveried<strong>in</strong>Chapter6. 5.4ParallelPIC{FixedParticlePartition<strong>in</strong>g,<br />

when<strong>particle</strong>s<strong>in</strong>neighbor<strong>in</strong>g<strong>cell</strong>strytocontributetothesamechargedensity replicatethegridarray(s).Thisisdonetoavoidthewriteconictsthatmayoccur Asmentioned<strong>in</strong>Chapter4,oneway<strong>of</strong>paralleliz<strong>in</strong>gaPIC<strong>particle</strong>codeisto Replicatedgrids<br />

gridpo<strong>in</strong>ts. densityarray.Theothergrids(thepotential<strong>and</strong>theeldgrids)getcopied<strong>in</strong>on read-access,<strong>and</strong>s<strong>in</strong>cetheyareonlywrittentoonceforeachtime-step,wedonot havetoworryaboutwrite-conictscaus<strong>in</strong>gerroneousresults. Inthesharedmemorysett<strong>in</strong>g,weonlyneedtophysicallyreplicatethecharge<br />

The<strong>particle</strong>pushrout<strong>in</strong>esareidealc<strong>and</strong>idatesfor<strong>parallelization</strong>s<strong>in</strong>cetheyperform<strong>in</strong>dependentcalculationsoneach<strong>particle</strong>.S<strong>in</strong>ceweassumethesamenodes<br />

alwaysprocessthesame<strong>particle</strong>s,<strong>parallelization</strong>isachievedbyparalleliz<strong>in</strong>gthe global<strong>particle</strong>loop.Wehencehave tcomp?replicated?grid?push?v=(Np =O(Np P)ccalc P): (5.18)<br />

5.4.1ParticleUpdates{Velocities<br />

(5.17)


However,s<strong>in</strong>cethe<strong>particle</strong>stendtobecomer<strong>and</strong>omlydistributed<strong>in</strong>space(if theywerenotalready),<strong>and</strong>s<strong>in</strong>ceaprocess<strong>in</strong>gelementisalwaysresponsiblefor Noticethatthereisno<strong>in</strong>ter-processcommunicationassociatedwithwrites. 86<br />

updat<strong>in</strong>gthesamegroup<strong>of</strong><strong>particle</strong>s,regardless<strong>of</strong>the<strong>particle</strong>gridlocation,the readsassociatedwithgather<strong>in</strong>gtheeldatthe<strong>particle</strong>smostlikelywillcause cache-hitsforlargegridss<strong>in</strong>ceitisunlikelythewholeeldgridwillt<strong>in</strong>the subcache. copied<strong>in</strong>tolocalmemoryas<strong>particle</strong>sarescatteredalloverthesystem.Thiscan beviewedasanotherlevel<strong>of</strong>cachehit: Regardless,thereplicatedgridmethodislikelytocausetheentiregridtoget<br />

Assum<strong>in</strong>gthegridwasdistributedbythesolver<strong>and</strong>eld-gridphases<strong>in</strong>toblockcolumnsorblock-rows,thismeansthatwhenus<strong>in</strong>glargesystemsspann<strong>in</strong>gseveral<br />

tcomm?replicated?grid?push?v=tgrid+tloc?ptcls =O(tgm(Ng))+O(tlm(Np P)): (5.20) (5.21) (5.19)<br />

wecantakeadvantage<strong>of</strong>pre-fetch<strong>in</strong>gifsuchoperationsareavailable,<strong>and</strong>ifthe tiontgm.However,s<strong>in</strong>cewecanassumetheentiregrideventuallywillbeneeded, gridwillt<strong>in</strong>localmemory. r<strong>in</strong>gs,thesecopieswilloccurasahierarchicalbroadcasts<strong>in</strong>corporated<strong>in</strong>thefunc-<br />

S<strong>in</strong>cethe<strong>particle</strong>positionupdatesarecompletely<strong>in</strong>dependentboth<strong>of</strong>oneother 5.4.2ParticleUpdates{Positions <strong>and</strong>thegrid,thisrout<strong>in</strong>ecanbe\trivially"parallelized,<strong>and</strong>near-perfectl<strong>in</strong>ear<br />

Thecache-hitsassociatedwithread<strong>in</strong>g<strong>and</strong>writ<strong>in</strong>geach<strong>particle</strong>position speed-upexpected: Treplicated?grid?push?loc=tcomp?push?x+tcomm?push?x =O(Np P)+O(tlm(Np P)): (5.23) (5.22)


5.4.3Calculat<strong>in</strong>gtheParticles'Contributiontothe arrayscanhencebesequentiallyaccessed. (O(tlm(Np P)))areherem<strong>in</strong>imals<strong>in</strong>cethe<strong>particle</strong>partition<strong>in</strong>gisxed.The<strong>particle</strong> 87<br />

byaccumulat<strong>in</strong>gthechargeseach<strong>particle</strong>representonthegridpartition<strong>in</strong>gthe Oncethe<strong>particle</strong>positionsaredeterm<strong>in</strong>ed,onecancalculatethechargedensity simulationspace.Hence,each<strong>particle</strong>'schargegetsscatteredtothe4gridcorners <strong>of</strong>thegrid<strong>cell</strong>conta<strong>in</strong><strong>in</strong>gthe<strong>particle</strong>.S<strong>in</strong>cetherearefour<strong>cell</strong>sassociatedwith ChargeDensity<br />

computed,theyallneedtobesummedup<strong>in</strong>toaglobalarrayforthesolver.This eachgridpo<strong>in</strong>t,write-conictsarelikelytooccurs<strong>in</strong>cethe<strong>particle</strong>partition<strong>in</strong>gis computationphase.However,aftertheselocalchargedensityarrayshavebeen xed.Thesewrite-conictsare,however,avoidedbyreplicat<strong>in</strong>gthegrids<strong>in</strong>the sumcouldbedoneeitherseriallyor<strong>in</strong>parallelus<strong>in</strong>gatreestructure. beneed<strong>in</strong>g,wecantakeadvantageforpre-fetch<strong>in</strong>gifsuchoperationsareavailable. occurashierarchicalgathers.Aga<strong>in</strong>,s<strong>in</strong>ceweknow<strong>in</strong>advancewhichdatawewill Ngelements.Thesetransfers,liketheoneshappen<strong>in</strong>g<strong>in</strong>thepush-vphasemay TheserialsumwouldrequireO(Ngp)additionsplusP?1block-transfers<strong>of</strong><br />

thisphenomenon. Notice,however,thatifp=128,logp=7,whichmeansO(Nglogp)mayapproach Np(assum<strong>in</strong>g10orso<strong>particle</strong>sper<strong>cell</strong>)!Ourbenchmarks<strong>in</strong>Chapter6illustrate Ifoneprocess<strong>in</strong>gnodeweretogatherthewholeresult,thiswouldcausenetwork Ifaglobalparalleltreesumwasused,O(Nglogp)additionswouldberequired.<br />

eachsummationtreewillherecorrespondtothenaldest<strong>in</strong>ation<strong>of</strong>thesubgrid. beavoidedifthesumswerearrangedtobeaccumulatedonsubgrids.Theroot<strong>of</strong> tracforthesolvers<strong>in</strong>cethesolverwouldbeus<strong>in</strong>gadistributedgrid.Thiscould Noticethatthissplitstillcausesthesamesumcomplexity: tcomp?parallel?sum=O(PNg=PlogP)=O(NglogP)


Wehencehave: Treplicated?grid?scatter==tcomp?scatter+tcomp?grid?sum +tcomm?scatter+tcomm?grid?sum 88 (5.24)<br />

5.4.4DistributedmemoryFFTsolver +O(tlmNp =ONp P+O(NglogP) P)+O(tgm(MglogP)) (5.26) (5.27) (5.25)<br />

S<strong>in</strong>cetheFFTsolveriscompletely<strong>in</strong>dependent<strong>of</strong>the<strong>particle</strong>s,its<strong>parallelization</strong> advantage<strong>of</strong>vendor-providedparallellibrarieswheneveravailableforthiscase. canalsobedoneseparately.Infact,itishighlyrecommendedthatonetakes<br />

Transpose Assum<strong>in</strong>gonevectorperprocessor,themostobviouswaytodoar<strong>in</strong>gtransposeis basically<strong>in</strong>volvestranspos<strong>in</strong>gadistributedgrid. Noticethatthecommunication<strong>in</strong>volved<strong>in</strong>go<strong>in</strong>gfromonedimensiontotheother<br />

toattherststep,sendoutthewholevectorexcepttheentrythatthenodeitself willbeus<strong>in</strong>g<strong>in</strong>therow(column)operation.IfafullI/Or<strong>in</strong>gwasavailable,this neighbor. shouldbenoconictss<strong>in</strong>cethenodeswillbesend<strong>in</strong>gthisdatatotheirnearest wouldimplytak<strong>in</strong>gcommunicationtime=+(N?1)onaN-by-Ngrid.There<br />

<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g. totalcommunicationtime.Figure5.1depictsthedatamovements<strong>in</strong>atranspose the<strong>in</strong>tendedrow(orcolumn)(seeexamplebelow).Theaveragemessagelength wouldbeN/2,butthestart-uptimewouldeachtimebe.ThisgivesN+(N2 Foreachstep,oneentrythengets\pickedo"untileverynodehasreceived<br />

I.e.N=P=4,giv<strong>in</strong>gusacommunicationtime<strong>of</strong>4+42(or4+8)i,e thetransposeisO(N2)!!ThisisworrisomewithrespecttotheO(NlogN)FFTs 2)


step1:a11 keepsend Node1 Node289<br />

a21 keepsend a22a12 keepsend Node3 a13 a23 keepsend Node4 a14<br />

step2:a11a24 a41 a31 a42 a32 a33a43 a24<br />

a14a34 a21a31 a22a41 a32a12 a33a42 a43a13 a44a23 a34<br />

step4:done step3:a11a23 a14 a13 a21a34 a24 a31a41 a33 a32 a42a12 a44 a43<br />

communicationtime. Over-all2D-FFTcomplexity computationaltime,especiallys<strong>in</strong>cecomputationaltimeisusuallyalotlessthan Figure5.1:Transpose<strong>of</strong>a4-by-4matrixona4-processorr<strong>in</strong>g.<br />

Totalcommunicationcostsfor\transpose"byfollow<strong>in</strong>gther<strong>in</strong>gapproachabove, weget: aboveequationsassumethatthereisexactlyoneN-vectorperprocessor.This givesatotalcommunicationtimefortheFFTsolver(P=Nx=Ny): tcommtranspose=N+N2 Thisisthecommunicationtimeitwouldtaketore-shueN-vectors.Notethe 2


Totaltimefora2DparallelFFTwithoneN-vectorperprocessorwillthenbe: tcomm?fft=p+N2g 90<br />

Tparallel?2D?fft=tcomp?2D?fft+tcomm?2D?fft 2<br />

assum<strong>in</strong>gthedata-blocksarebunchedtogethersothatthenumber<strong>of</strong>vectorswith<strong>in</strong> acomputationphasedoesnotaectthestart-uptime.Hereccalcisaconstant =2(N2logN)ccalc+Ng+N2 2 (5.29) (5.28)<br />

<strong>in</strong>dicat<strong>in</strong>gthespeed<strong>of</strong>theprocessor.Notethats<strong>in</strong>cethisanalysisdidnottake <strong>in</strong>toconsiderationthenumber<strong>of</strong>memoryaccessesmadeversusthenumber<strong>of</strong> oat<strong>in</strong>gpo<strong>in</strong>tcalculationsmade,thisparameterwouldhavetobesomestatistical<br />

1-DFFTs. average<strong>of</strong>thetwo<strong>in</strong>ordertobemean<strong>in</strong>gful.Localcach<strong>in</strong>gconsiderationscould<br />

Eachcommunicationphasethentransfers(\transposes")ontheaverageNpN2blocks ThismeansthateachprocessorhasacomputationphasethatperformsNp(NlogN) alsobegured<strong>in</strong>tothisparameter.<br />

<strong>of</strong>dataPtimes. AssumenowaNNgridonaPprocessorr<strong>in</strong>g,whereNP=k,kan<strong>in</strong>teger.<br />

Eachprocess<strong>in</strong>gunit<strong>in</strong>thecomputationphasedoes:<br />

Thisgivestotalcommunicationtime(multiply<strong>in</strong>gabovewithNp): computations. (N2)blocks<strong>of</strong>datadur<strong>in</strong>gthePphases. Eachcommunicationphasethentransfers(\transposes")ontheaverage(NP) tcomp?fft=NP(2NlogN)<br />

(Note:(NP)P=Ng))tcomm?fft=N+N2 2:


Thetotaltimeestimatefora2DFFTishence: T2D?fft=OPNg P(Ng 2)ccalc+O 91<br />

Whenseveralr<strong>in</strong>gsare<strong>in</strong>volved,anothermemoryhierarchygetsadded.Similarly, fortrulylargesystems,one1DFFTmaynott<strong>in</strong>acachel<strong>in</strong>e<strong>and</strong>additional =ONg(Ng PlogNg++Ng) Ng+N2g 2)!(5.30)<br />

performancehitswillhencebetaken. (5.31)<br />

Parallel2DFFTSolver Aparallel2DDFFTsolver<strong>in</strong>volvestwo<strong>of</strong>theaboveparallel2DFFTs(count<strong>in</strong>g boththeregularFFT<strong>and</strong>the<strong>in</strong>verseFFT)<strong>and</strong>ascal<strong>in</strong>g<strong>of</strong>1=(N2)foreachgrid quantity.Afactor<strong>of</strong>2doesnotshow<strong>in</strong>thecomplexityequations,<strong>in</strong>fact,eventhe scal<strong>in</strong>g<strong>of</strong>O(N2),canbeconsidered<strong>in</strong>cluded<strong>in</strong>withthe2DFFT(O(N2logN)). equationsforthe2DFFTsolver: Assum<strong>in</strong>gwehaveagrid<strong>of</strong>sizeNg=N2,wehencehavethefollow<strong>in</strong>gcomplexity Tparallel?fft?solver=tcomp?fft?solver+tcomm?fft?solver =O(NglogNg)+O(tlm(NglogNg)+O(tgm(Ng)); (5.32)<br />

5.4.5ParallelField-GridCalculation transpose.The2DFFTwillalsohavesomecache-hitsassociatedwithit. whereO(tgm(Ng))signiesthecache-hits<strong>and</strong>memorytracassociatedwiththe (5.33)<br />

Thenitedierenceequationsassociatedwiththeeld-gridcalculationsparallelize fairlystraightforwardly.S<strong>in</strong>cethe<strong>in</strong>putgridisthepotentialgridsuppliedbythe achieve: solver,itmakessensetopartitionthe<strong>in</strong>teriorloop<strong>in</strong>tocomparableblock-vectors. Thebordercasesmaybeparallelizedseparately.Oneshouldhencebeableto Tparallel?field?grid=tcomp?field?grid+tcomm?field?grid (5.34)


=Ng Pccalc+tlm(Ng 92<br />

Theeld-gridcalculationsmaycausesomereadcachehitsforthenextphaseif thepusherisnotgo<strong>in</strong>gtobeus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g. =O(Ng P)+O(tlm(Ng)) P) (5.35)<br />

5.5ParallelPIC{PartitionedChargeGrid(5.36)<br />

4whichreplicatesthegridborders<strong>and</strong>usesdualpo<strong>in</strong>tersonthelocal<strong>particle</strong> Inthissectionwewillanalyzethegridpartition<strong>in</strong>gapproachoutl<strong>in</strong>ed<strong>in</strong>Chapter SortedLocalParticleArrays Us<strong>in</strong>gTemporaryBorders<strong>and</strong>Partially<br />

maximumtimerequiredononenode,i.e.: Thetotaltime<strong>of</strong>theupdate<strong>of</strong>thelocal<strong>particle</strong>velocityarraysisequaltothe arrays<strong>in</strong>ordertoma<strong>in</strong>ta<strong>in</strong>anautomaticpartialsort<strong>in</strong>g<strong>of</strong>the<strong>particle</strong>s. 5.5.1ParticleUpdates{Velocities<br />

Add<strong>in</strong>g<strong>in</strong>thecommunicationtimeassociatedwiththegather<strong>and</strong>thevelocity updatesgivesusthefollow<strong>in</strong>gequation: tcomp?serial?part?push?v=maxlocalNpccalc =O(maxlocalNp): (5.38) (5.37)<br />

Tjj?push?v=tcomp?gather+tcomp?part?push?v +tcomm?gather+tcomm?part?push?v =maxlocNpc1calc+maxlocNpc2calc (5.39) (5.40)<br />

wherec1calc<strong>and</strong>c1calcisthetimespentdo<strong>in</strong>gthemultiplicationss<strong>and</strong>additions +tlm(Ng)+tlm(Np) =O(maxlocNp)+O(tlm(Ng)+tlm(maxlocNp))(5.43) (5.41)<br />

associatedwithgather<strong>and</strong>thevelocityupdates,respectively.Noticethatthis (5.42)


method.S<strong>in</strong>cethe<strong>particle</strong>sarealreadypartiallysorted,onlylocalgridpo<strong>in</strong>tsor thoserightontheirborders,contributetotheeldateach<strong>particle</strong>(tlmversustgm). versionsuersalotfewercache-hits<strong>in</strong>thegatherphasethanthereplicatedgrid 93<br />

basedonthischeck,eitherwritethemback<strong>in</strong>tothelocalarrayorouttothe Ifthe<strong>particle</strong>sarestored<strong>in</strong>localarraysus<strong>in</strong>garead<strong>and</strong>writepo<strong>in</strong>teronthe 5.5.2ParticleUpdates{Positions cancheckwhetherthe<strong>particle</strong>s'newlocationsarestillwith<strong>in</strong>thelocalgrid,<strong>and</strong> localpositionarray(s)<strong>and</strong>awritepo<strong>in</strong>terforanoutput/scratcharray,thenone globalscratcharray.Theseoperationsmaybecompletelylocalifeachnodehas itsownscratchareatowriteto{hencetheuse<strong>of</strong>anextrawritepo<strong>in</strong>terpernode fortheout/scratcharray.(Thenodesmayotherwisehavetosp<strong>in</strong>-lockonaglobal writepo<strong>in</strong>ter.)Thispart<strong>of</strong>the<strong>codes</strong>hencetakes:<br />

Itishighlyunlikelythatalargenumber<strong>of</strong><strong>particle</strong>swillleaveonanygiventimestep,s<strong>in</strong>cemost<strong>particle</strong>sshouldbemov<strong>in</strong>gwith<strong>in</strong>eachsubdoma<strong>in</strong>ifaccurate(<strong>and</strong><br />

ecient)resultsaretobeobta<strong>in</strong>ed.HowbadO(maxlocalNp)willgetdepends onhow<strong>in</strong>homegenoustheproblembe<strong>in</strong>gsimulatedgets<strong>and</strong>whetherdynamic grid-reorder<strong>in</strong>g(globalsortsneeded)isdonetocompensatetotheseimbalances. thecost<strong>in</strong>alsohav<strong>in</strong>gtoauto-sortthevelocitiessothattheycorrespondtothe <strong>particle</strong>locations.Thelatterwould<strong>in</strong>volveO(maxlocalNp)localread/writes. <strong>particle</strong>location<strong>in</strong>dexhastobetested. Theaboveguresdonot<strong>in</strong>cludethetest<strong>in</strong>g<strong>of</strong>thenew<strong>particle</strong>locationor<br />

Ifthe<strong>particle</strong>arraysareperfectlyload-balanced,O(maxlocalNp)=O(Np=P). tpush?local?ptcls=maxlocalNpccalc =O(maxlocalNp): (5.45) (5.44)<br />

Toaccountforpossible<strong>in</strong>com<strong>in</strong>g<strong>particle</strong>s,thenodesnowhavetocheckeach Noticethatifthegridisdividedupalongonlyonedimension,thenonlyone


other'sout/scratcharrays.Thiswill<strong>in</strong>volve: tf<strong>in</strong>d?new?<strong>particle</strong>s=tlm(maxlocalNp)[writes] 94<br />

theseupdateswilltakedependsonhowmany<strong>particle</strong>sleavetheirlocaldoma<strong>in</strong>. Noticethatthereads<strong>in</strong>volveread-transfersfromothernodes.Howmuchtime +tgm(Npmove)[reads] (5.47) (5.46)<br />

Ifthe<strong>particle</strong>sareuniformlydistributed,theoptimalgridpartition<strong>in</strong>gwouldbe ict<strong>in</strong>gwiththeonedesirablefortheFFTsolverwhichrequiresablock-vector squaresubgrids(seeChapter4,Section3.4).Noticethatthispartition<strong>in</strong>giscon- thegridpartition<strong>in</strong>g,thedistribution<strong>of</strong>the<strong>particle</strong>s<strong>and</strong>their<strong>in</strong>itialvelocities. Thenumber<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheirlocaldoma<strong>in</strong>eachtime-stepdependson<br />

partition<strong>in</strong>g(Figure5.2).<br />

Asmallsav<strong>in</strong>gswillbeachievedwhenus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g<strong>in</strong>the Figure5.2:GridPartition<strong>in</strong>g;a)block-vector,b)subgrid a)block-vector b)subgrid<br />

thismaybefarout-weighedbythepenaltyfor,<strong>in</strong>theblock-vectorcase,hav<strong>in</strong>g largerborderswithfewer<strong>in</strong>teriorpo<strong>in</strong>tsthanthesubgridpartition<strong>in</strong>g. testfor<strong>particle</strong>locationss<strong>in</strong>ceonlyonedimensionhastobechecked.However, Assumethattheaverageprobabilitythatthelocal<strong>particle</strong>sonaprocess<strong>in</strong>g


nodeleavetheir<strong>cell</strong>aftereachtime-stepisP(locNpleave<strong>cell</strong>).Assum<strong>in</strong>gone roworcolumn<strong>of</strong>grid-<strong>cell</strong>sperprocessor,thenabout50%<strong>of</strong>those<strong>particle</strong>sthat leavetheirlocal<strong>cell</strong>,willstillrema<strong>in</strong>onthelocalprocessor(ignor<strong>in</strong>gtheones 95<br />

leav<strong>in</strong>gdiagonallyothefourcorners).SeeFigure5.3.? -6?<br />

? 6 - 6-? 6-?66- 6- 6<br />

Figure5.3:Movement<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gtheir<strong>cell</strong>s,block-vectorsett<strong>in</strong>g. ? ?6? ? ? -<br />

processor. ThismeansthatP(locNpleave<strong>cell</strong>)=2<strong>particle</strong>smustbetransferredtoanother<br />

process<strong>in</strong>gnode. over,onlytheborder<strong>cell</strong>swillbecontribut<strong>in</strong>g<strong>particle</strong>stobemovedtoanother <strong>particle</strong>pushphase,thenassum<strong>in</strong>gthe<strong>particle</strong>swillnotmovemorethanone<strong>cell</strong> Theprobability<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gisafunction<strong>of</strong>the<strong>in</strong>itialvelocities<strong>and</strong> However,ifasub-matrixpartition<strong>in</strong>gwaschosenforthegridpo<strong>in</strong>ts<strong>in</strong>the<br />

partition<strong>in</strong>gcouldadapttothedrift.E.g.ifthe<strong>particle</strong>saregenerallydrift<strong>in</strong>g<strong>in</strong> itscurrentprocessor? how\bad"isthedistribution<strong>of</strong><strong>particle</strong>s,<strong>and</strong>howlikelyisthis\bunch"toleave positions<strong>of</strong>the<strong>particle</strong>s<strong>and</strong>thesize<strong>of</strong>each<strong>cell</strong>.Thequestionthenbecomes:<br />

thex-direction,thenitwouldbeveryadvantageoustopartitionthegridasblockrows<strong>and</strong>therebysignicantlyreduc<strong>in</strong>gtheneedfor<strong>particle</strong>sort<strong>in</strong>g(re-location<br />

<strong>of</strong><strong>particle</strong>s)aftereachtime-step.However,<strong>in</strong>plasmasimulationsthereisusually Noticealsothatifonecouldanticipatetheover-alldrift<strong>of</strong>the<strong>particle</strong>s,thegrid<br />

considerableup-downmotion,eventhoughthereisleft-rightdriftpresent. Thetotaltimetaken,asidefromlocalread/writes<strong>and</strong>dynamicload-balanc<strong>in</strong>g


steps,willbe: Tjj?push?x=tcomp?push?x+tcomm?loc?ptcls+tcomm?f<strong>in</strong>d?new?ptcls(5.48) 96<br />

=O(maxlocNp)+O(tlm(maxlocNp)+tgm(Npmoved)); =maxlocNpccalc+tlm(maxlocNp)+tgm(Npmoved) (5.49)<br />

theout/scratcharraysare<strong>in</strong>cludedastgm(Npmoved). arewith<strong>in</strong>thelocalgriddoma<strong>in</strong><strong>of</strong>theirprocess<strong>in</strong>gnode.Thesearchesthrough whereccalc<strong>in</strong>cludestheadditions<strong>and</strong>multiplicationsassociatedwiththe<strong>particle</strong> positionupdatesaswellasthetestsforwhetherthe<strong>particle</strong>s'newpositionsstill (5.50)<br />

5.5.3Calculat<strong>in</strong>gtheParticles'Contributiontothe<br />

mustbeexpected<strong>in</strong>overcom<strong>in</strong>gthepotentialwrite-conictsthatwillresultwhen sharethegridpo<strong>in</strong>tswith<strong>particle</strong>s<strong>in</strong>thenearestneighbor<strong>in</strong>g<strong>cell</strong>s,someoverhead its<strong>cell</strong>,wherethechargeisthenaddedtothechargedensity.S<strong>in</strong>cethe<strong>particle</strong>s Inthisrout<strong>in</strong>eeach<strong>particle</strong>'schargeisscatteredtothefourgridpo<strong>in</strong>tssurround<strong>in</strong>g ChargeDensity<br />

thegridisdistributed.(SeeFigure5.4) ever,thiswillcauseseveralbusy-waitsunlessthe<strong>particle</strong>sarecompletelysorted. Ast<strong>and</strong>ardapproachwouldbetouseagloballockontheborderelements,how-<br />

theextraborderarraywasthetopborder(simpliesimplementationforperiodic <strong>particle</strong>s'contributiontotheoverallchargedensity,werstcalculatethemasif borders.Noticethatonlyoneborderforeachdimensionneedstobereplicated s<strong>in</strong>cetheotherlocalbordermayaswellbethemastercopy.Whencalculat<strong>in</strong>gthe Anotherapproachwouldbetousetheideafromreplicatedgrids,i.e.replicated<br />

systems).Theseborderarraysthengetaddedtothemastercopy.S<strong>in</strong>ceonlyone canbedone<strong>in</strong>parallel.However,therewillbenetworktrac<strong>in</strong>volvedgett<strong>in</strong>gthe correspond<strong>in</strong>gmastercopyvaluestobeadded<strong>in</strong>. nodehasanextracopy<strong>of</strong>anygivenborder,therewillbenowriteconicts,sothese


97<br />

...<br />

146adA<br />

2 7db358<br />

...<br />

Nodel Nodek<br />

Figure5.4:ParallelChargeAccumulation:Particles<strong>in</strong><strong>cell</strong>`A'sharegridpo<strong>in</strong>ts withthe<strong>particle</strong>s<strong>in</strong>the8neighbor<strong>in</strong>g<strong>cell</strong>s.Two<strong>of</strong>thesegridpo<strong>in</strong>ts,`a'<strong>and</strong>`b', aresharedwith<strong>particle</strong>supdatedbyanotherprocess<strong>in</strong>gnode. Tparallel?scatter=tcomp?scatter+tcomp?border?sum +tcomm?scatter+tcomm?border?sum (5.51)<br />

Noticethatthesize<strong>of</strong>theborder(s)dependsonthepartition<strong>in</strong>g.Ifthedoma<strong>in</strong> +O(tlm(maxlocNp))+O(tgm(max =O(maxlocNp)+O(max Nx Px;Ny Py!) Nx Px;Ny Py!))(5.54) (5.53) (5.52)<br />

isonlypartitioned<strong>in</strong>ondirection(block-vector),say<strong>in</strong>y,thentheborderwillbe maxNx isPNx.Noticehoweventhisisconsiderablylessthanwhatthereplicatedgrid <strong>particle</strong>distributions,willbesquaresub-partition<strong>in</strong>g. uses(PNgwhenNyP). As<strong>in</strong>the<strong>particle</strong>pushphase,theoptimalcaseforsimulationswithuniform Px;Ny Py=Nx=1=Nx<strong>and</strong>thetotalextraspaceallocatedfortheseborders<br />

...<br />

...<br />

...<br />

...


Despitethefactthattheblock-vectorpartition<strong>in</strong>gisfarfromoptimalfortheother parts<strong>of</strong>thecode,itisprobably<strong>in</strong>advisabletotrytoanyotherpartition<strong>in</strong>gfor 5.5.4FFTsolver 98<br />

FFTs<strong>and</strong><strong>in</strong>verseFFTsareavailable<strong>and</strong>allonehastoworryaboutisaccess<strong>in</strong>g/arrang<strong>in</strong>gthedata<strong>in</strong>theappropriateorder.Ifonly1DFFTrout<strong>in</strong>esare<br />

Section5.4.Regardless,theoveralltimeforaparallelizedsolvershouldbethat<strong>of</strong> available,theseriallocalversionmaybeusedasabuild<strong>in</strong>gblockasdescribed<strong>in</strong> thesolverused<strong>in</strong>thereplicatedgridcase.<br />

theFFTphase.Mostvendorsprovideveryecienth<strong>and</strong>-codedFFTrout<strong>in</strong>es thatwouldprobablybealotfasterthanuser-codedversions.Ideallyparallel2D<br />

Thenitedierenceequationsassociatedwiththeeld-gridcalculationsparallelizesasforthereplicatedgridcase:<br />

5.5.5ParallelField-GridCalculation partition<strong>in</strong>g(orviceversa)willrequiretimetgm(Ng). Are-arrangement<strong>of</strong>thegridfromasubgridpartition<strong>in</strong>gtoablock-vector<br />

Asbefore,theeld-gridcalculationsmaycausesomereadcachehitsforthenext Tparallel?field?grid=tcomp?field?grid+tcomm?field?grid =O(Ng =Ng Pccalc+tlm(Ng P)+O(tlm(Ng)) P) (5.56) (5.57) (5.55)<br />

5.6HierarchicalDatastructures:Cell-cach<strong>in</strong>g phaseifthepusherisnotgo<strong>in</strong>gtobeus<strong>in</strong>gblock-vectorpartition<strong>in</strong>g. memoryhierarchies,especiallyasystemswithcaches,weshowed<strong>in</strong>Chapter4 Gridsaretypicallystoredeithercolumnorrow-wise.However,<strong>in</strong>asystemwith thatthesearenotnecessarilythebeststorageschemesforPIC<strong>codes</strong>.Instead,


l<strong>in</strong>esneededforeachscatter(calculation<strong>of</strong>a<strong>particle</strong>'scontributiontothecharge oneshouldtrytobuildadatastructurethatm<strong>in</strong>imizesthenumber<strong>of</strong>cache-<br />

density). 99<br />

gulargridstructure).Ifoneusesacolumnorrowstorage<strong>of</strong>thegrid,onecan henceexpecttheneedtoaccesstwocache-l<strong>in</strong>es<strong>of</strong>gridquantitiesforeach<strong>particle</strong>. each<strong>particle</strong>willbeaccess<strong>in</strong>gallfourgridpo<strong>in</strong>ts<strong>of</strong>its<strong>cell</strong>(assum<strong>in</strong>gaquadran-<br />

E.g.forarow-wisestoredgridtherewilltypicallybeonecache-l<strong>in</strong>econta<strong>in</strong><strong>in</strong>gthe Whenthe<strong>particle</strong>'schargecontributionsarecollectedbackonthegridpo<strong>in</strong>ts,<br />

toptwocorners<strong>and</strong>oneconta<strong>in</strong><strong>in</strong>gthebottomtwocorners. tasmany<strong>cell</strong>saspossiblewith<strong>in</strong>acachel<strong>in</strong>e,thenumber<strong>of</strong>cache-hitscan ratherthanrowsorcolumns(orblock-rows<strong>and</strong>block-columns).For3D<strong>codes</strong>, bereduced.Inthiscase,squaresubdoma<strong>in</strong>s<strong>of</strong>thegridaretted<strong>in</strong>tothecache sub-cubesshouldbeaccommodatedaswellaspossible.Inotherwords,whatever Ifone<strong>in</strong>steadusesthe<strong>cell</strong>-cach<strong>in</strong>gstrategy<strong>of</strong>Chapter4,whereonetriesto<br />

accesspatternthecodeuses,thecacheuseshouldtrytoreectthis. Thismeansthatacolumn/rowstorageapproachwoulduse2x16cache-hits <strong>of</strong>cache-hitsassociatedwitheach<strong>cell</strong><strong>in</strong>thiscase. eachcache-l<strong>in</strong>ecanaccommodatea4-by-4subgrid.Figure5.5showsthenumber whereas<strong>cell</strong>-cach<strong>in</strong>gwoulduse[(3x3)*1]+[(3+3)*2]+4=25cache-hits,an S<strong>in</strong>ceacache-l<strong>in</strong>eontheKSRis128bytes,i.e.1664-bitoat<strong>in</strong>gpo<strong>in</strong>tnumbers,<br />

<strong>cell</strong>is:Totalcachehitsper<strong>cell</strong>=(Cx?1)(Cy?1)1(<strong>in</strong>teriorpo<strong>in</strong>ts):(5.58) improvement<strong>of</strong>morethan25%! Ingeneral,ifthecachesizeisCxCy,thenthetotalnumber<strong>of</strong>cache-hitsper<br />

IfCx=Cy=C,theoptimalsituationfor<strong>cell</strong>-cach<strong>in</strong>g,theaboveequation +((Cx?1)+(Cy?1))2(borders)(5.59) +14(corner) (5.60)


100<br />

f f<br />

f<br />

f=gridpo<strong>in</strong>tsstored<strong>in</strong>cache-l<strong>in</strong>e 12<br />

1<br />

21 21 1f2<br />

Figure5.5:Cache-hitsfora4x4<strong>cell</strong>-cachedsubgrid 4<br />

becomes: however,requiremorearray<strong>in</strong>dex<strong>in</strong>g.ToaccessarrayelementA(i,j),weneed: Inordernottohavetochangealltheotheralgorithms,<strong>cell</strong>-cach<strong>in</strong>gdoes, Totalcachehitsper<strong>cell</strong>=C2+2C+1(<strong>cell</strong>?cach<strong>in</strong>g) A(i;j)=A[((i=C)(C2C2))+((j=C)C2) (5.61)<br />

Nx+j],most<strong>of</strong>theseoperationcanbeperformedbysimpleshiftswhenCisa AlthoughthisrequiresmanymoreoperationthanthetypicalA(i;j)=A[i +(jmodC)+(imodC)C] (5.63) (5.62)<br />

power<strong>of</strong>2.Comparedtograbb<strong>in</strong>ganothercache-l<strong>in</strong>e(whichcouldbeverycostlyif onehastogetitfromacrossthesystemonadierentr<strong>in</strong>g)thisisstillanegligible cost. grid. ratherblock-column-<strong>cell</strong>-cachestoavoidcommunicationcostswhenre-order<strong>in</strong>gthe levelwiththeFFTsolver.Theover-allgridcouldstillbestoredblock-column,or Itshouldbenotedthatthe<strong>cell</strong>-cashedstorageschemeconictsatthelocal


stage<strong>of</strong>thegridsummation<strong>and</strong>aspart<strong>of</strong>theeld-gridcalculationswiththehelp onatemporarygridarray. Inthereplicatedgridcase,there-order<strong>in</strong>gscouldbe<strong>in</strong>corporated<strong>in</strong>thenal 101<br />

benetserialaswellasparallelmach<strong>in</strong>eswithlocalcaches. Itshouldbeemphasizedthat<strong>cell</strong>-cach<strong>in</strong>gisalocalmemoryconstructthat


Chapter6 ImplementationontheKSR1<br />

Chapters4<strong>and</strong>5.Ourtest-bedistheKendallSquareResearchmach<strong>in</strong>eKSR1 Thischapterdescribesourimplementations<strong>of</strong>some<strong>of</strong>theideaspresented<strong>in</strong> havenocerta<strong>in</strong>tyuntilyoutry."{Sophocles[495-406B.C.],Trach<strong>in</strong>iae. \Onemustlearnbydo<strong>in</strong>gtheth<strong>in</strong>g;thoughyouth<strong>in</strong>kyouknowit,you<br />

currentlyavailableattheCornellTheoryCenter.<br />

onewouldhencehavetoconsideritadistributedmemorymach<strong>in</strong>ewithrespectto notpaycarefulattentiontomemorylocality<strong>and</strong>cach<strong>in</strong>g.Foroptimalperformance distributedamongitsprocessors,toaprogrammeritisaddressedlikeasharedmemorymach<strong>in</strong>e.However,seriousperformancehitscanbeexperiencedifonedoes<br />

AlthoughtheKSR1physicallyhasitschunks<strong>of</strong>32Mbma<strong>in</strong>realmemory<br />

works,delaysduetoswapp<strong>in</strong>g<strong>and</strong>datacopy<strong>in</strong>gcanbem<strong>in</strong>imized.Many<strong>of</strong>these datalocality.Byalsohav<strong>in</strong>gagoodunderst<strong>and</strong><strong>in</strong>gforhowthememoryhierarchy <strong>issues</strong>werediscussed<strong>in</strong>Chapter4<strong>and</strong>5. memoryoneachprocess<strong>in</strong>gnode.102<br />

Theconguration<strong>in</strong>nthisstudyhas128processornodeswith32Mb<strong>of</strong>real


to<strong>in</strong>struction,theotherhalftodata.Its64-bit20MIPSprocess<strong>in</strong>gunitsexecutes EachKSRprocessor<strong>cell</strong>consists<strong>of</strong>a0.5Mbsub-cache,half<strong>of</strong>whichisdedicated 6.1Architectureoverview 103<br />

two<strong>in</strong>structionspercyclegiv<strong>in</strong>g,accord<strong>in</strong>gtothemanufacturer,40peakMFLOPs Mb/second.The<strong>cell</strong>sexperiencethefollow<strong>in</strong>gmemorylatencies: per<strong>cell</strong>(28MFLOPFFT<strong>and</strong>32MFLOPmatrixmultiplication). saidtoperformat1Gb/second,whereasthest<strong>and</strong>ardi/ochannelarelistedas30 sub-cache(0.5Mb):2cyclesor.1microsecond; Communicationbetweentheprocess<strong>in</strong>g<strong>cell</strong><strong>and</strong>itslocalmemory(cache)is<br />

localcache(32Mb):20-24cyclesor1microsec;ond cacheonotherr<strong>in</strong>g(s)(33,792Mb):570cyclesor28.5microsecond; externaldisks(variable):400,000cycles!orapproximately20msec. othercacheonsamer<strong>in</strong>g(992Mb):130cyclesor6.5microsecond;<br />

beclearfromthisgurethatprogramsthanneedsomuchmemorythatfrequent variablewassetsothatthethreadsonlyranonnodeswithouti/odevices. whentim<strong>in</strong>gthesenodes.Forthebenchmarkspresented<strong>in</strong>thischapter,asystem Thelongdisklatencyisduealmostentirelytothedisk-accessspeed.Itshould Ourtest-bedcurrentlyhas10nodeswithi/odevices.Caremusthencebetaken<br />

diskI/Oisnecessarydur<strong>in</strong>gcomputation,willtakean<strong>in</strong>ord<strong>in</strong>ateamount<strong>of</strong>time. Infact,one<strong>of</strong>thestrengths<strong>of</strong>highlyparallelcomputersisnotonlytheirpotentially powerfulcollectiveprocess<strong>in</strong>gspeed,butthelargeamounts<strong>of</strong>realmemory<strong>and</strong><br />

implementation<strong>of</strong>thest<strong>and</strong>ardBLASrout<strong>in</strong>eSSCAL(vector-times-scalar)ona Togetabetterfeelfortheperformance<strong>of</strong>theKSR,werstranaserialC-<br />

cachethattheyprovide. 6.2Someprelim<strong>in</strong>arytim<strong>in</strong>gresults


happenedaseachdataelementwasusedonlyonce. series<strong>of</strong>localplatforms.Itshouldbenotedthats<strong>in</strong>ceweperformedthistestwith unitstride,thisletstheKSRtakeadvantage<strong>of</strong>cach<strong>in</strong>g.However,nocachere-use 104<br />

v-length(secs) 100,0000.2499 Arch.:Sun4cSparc1Sun4cSparc2Sun4m/670MPDECAlphaKSR-1 (ottar.cs) Table6.1:SSCALSerialTim<strong>in</strong>gResults<br />

200,0000.500 load:0.08-0.390.00-0.06 0.2666 (ash.tc) 0.1000 (secs) 1.00-1.300.0-0.12.25-5.0 (leo.cs) 0.1000 0.1166 (secs) (sitar.cs)(homer.tc)<br />

0.01670.0600 (secs) (secs)<br />

1,000,000 2,000,000 400,0000.500 0.98 0.217 0.42 0.43 0.200 0.38 0.40 0.050 0.07 0.08 0.120<br />

4,000,00012.87 2.43 2.45 4.85 5.07 1.06 1.08 2.12 2.15 4.25 1.02 1.97 4.07 1.03 2.10 0.20 0.21 0.40 0.22<br />

4.23 0.43 0.82 0.84 0.24 0.56 0.58 1.10 3.42 5.44<br />

theSunSPARC4m/670MPs(althoughnottheDECAlpha1)aslongasthe Ascanbeseenfromtheresults<strong>in</strong>Table6.1,theKSR'sscalarunitsout-perform 8,000,000 1.65 1.67 6.28<br />

giv<strong>in</strong>guswide-rang<strong>in</strong>gtim<strong>in</strong>gresultsforthesameproblem.Thiscouldbedue problemwassmallenough.Forlargerproblems,theloadseemedtoplayarole anothernode.Theseareclearly<strong>issues</strong>thatalsowillneedtobeconsideredfor parallelimplementations. porationwithnootheruserspresent.Alongwithlaterresults,thetim<strong>in</strong>gsobta<strong>in</strong>edseemto <strong>in</strong>dicatethatitsscalarspeedisabout3timesthat<strong>of</strong>aKSRnode<strong>and</strong>about4timesthat<strong>of</strong>a tocach<strong>in</strong>gaswellastheKSROS'sattempttoshipourprocessdynamicallyto<br />

Sun4m/670MP{veryimpressive<strong>in</strong>deed. 1TheDECAlpharesultswhereobta<strong>in</strong>edonaloanermach<strong>in</strong>efromDigitalEquipmentCor-


6.3ParallelSupportontheKSR TheKSRoersaparalleliz<strong>in</strong>gFortrancompilerbasedonPrest<strong>of</strong>orst<strong>and</strong>ard<strong>parallelization</strong>ssuchastil<strong>in</strong>g.Prestocomm<strong>and</strong>sthengetconvertedtoPthreadswhich<br />

105<br />

threadmanipulations.OnemayalsouseSPC(SimplePrestoC)whichprovides sixP1003.4thread)directlyleav<strong>in</strong>gittototheprogrammertodotheactually asimpliedmechanismwhenus<strong>in</strong>gfunctionsthatgetcalledbyateam<strong>of</strong>threads (generally,oneperprocess<strong>in</strong>gnode).TheSPCteamsprovidesanimplicitbarrier arethenaga<strong>in</strong>sitt<strong>in</strong>gontop<strong>of</strong>Machthread.TheCcompilerusesPthreads(Pos-<br />

rightbefore<strong>and</strong>aftereachparallelcall.<br />

tom<strong>in</strong>imizememorytransfers. systemtakescare<strong>of</strong>datareplication,distribution,<strong>and</strong>movement,thoughitis highlyusefultomakenote<strong>of</strong>howthisish<strong>and</strong>ledsothatprogramscanbeoptimized tohaveashareddataarea.Eachthreadalsohasaprivatedataarea.Thememory TheKSR'sprogrammer'smemorymodeldenesallthreads<strong>in</strong>as<strong>in</strong>gleprocess<br />

&execute"foragivenset<strong>of</strong>parallelthreads.Allsubsequentthreadsareequal hierarchically,<strong>and</strong>any<strong>of</strong>thesemayaga<strong>in</strong>callafork. barriers,mutexlocks,<strong>and</strong>conditionvariables.Onlyonethreadcancall\fork ThebarriersareimplementedthroughcallstoBarriercheck<strong>in</strong><strong>and</strong>Barrier TheKSRoersfourPthreadconstructsfor<strong>parallelization</strong>:s<strong>in</strong>glePthreads,<br />

lattermakesthemasterwaitforallslavestonish.TheexplicitKSRPthread barrierconstructhenceoperateasa\half"barrier.Inordertoobta<strong>in</strong>thefull checkout.Theformercausesallslavestowaitforthedesignatedmaster,the<br />

waitformaster).SPCcallsimplicitlyprovidefullbarriers,i.e.acheck-outfollowed synchronizationusuallyassociatedwiththetermbarrier,apthread_checkout byacheck-<strong>in</strong>. (masterwaitsforslaves)needstobeissuedrightbeforeapthread_check<strong>in</strong>(slaves


Fortranmayappeartobethemostnaturallanguageforimplement<strong>in</strong>gnumerical 6.3.1CversusFortran algorithms.AlthoughatpresentonlytheKSRFortrancompileroersautomatic 106<br />

exibility.C,withitspowerfulpo<strong>in</strong>terconstructsfordynamicmemoryallocation <strong>and</strong>strongl<strong>in</strong>ktoUNIX,isalsorapidlybecom<strong>in</strong>gmorepopular.S<strong>in</strong>cewewant <strong>parallelization</strong>,wehavechosentoimplementourcode<strong>in</strong>Cs<strong>in</strong>ceitallowsformore tousesome<strong>of</strong>theCpo<strong>in</strong>terfeatures<strong>in</strong>theimplementation,C<strong>in</strong>deedbecomes appear<strong>in</strong>futureFortranst<strong>and</strong>ards.Ifthathappens,Fortranmayaga<strong>in</strong>appear moreattractive.)C<strong>in</strong>terfaceswellwithbothAssemblyLanguage<strong>and</strong>Fortran, anaturalchoice.(Theauthorexpectsthedynamicmemoryallocationfeatureto <strong>and</strong>isalongwithFortran77<strong>and</strong>C++,theonlylanguagescurrentlyavailableon ourKSR1. 6.4KSRPICCode<br />

edit<strong>in</strong>g<strong>in</strong>clude-statementspo<strong>in</strong>t<strong>in</strong>gtolocallibraryles)wastochangethepr<strong>in</strong>tf theKSR1.Theonlycodechangerequiredtogetitrunn<strong>in</strong>gcorrectly(asidefrom wastoportourserial<strong>particle</strong>codedevelopedonthelocalSunworkstationto 6.4.1Port<strong>in</strong>gtheserial<strong>particle</strong>code<br />

statementsformatt<strong>in</strong>gargument\%lf"<strong>and</strong>\%le"topla<strong>in</strong>\%f"<strong>and</strong>\%e".This Therstth<strong>in</strong>gwedidafterexplor<strong>in</strong>gsomethePthreadfeaturesontestcases,<br />

theimplementorsnotion<strong>of</strong>doubleis. istheresult<strong>of</strong>mov<strong>in</strong>gfroma32-bitmach<strong>in</strong>etoa64-bitmach<strong>in</strong>e<strong>and</strong>hencewhat<br />

isaround12-13secondsforall<strong>of</strong>them.Thisshouldhence<strong>in</strong>dicatethe<strong>in</strong>creased for,benchmarkswereobta<strong>in</strong>edfortheserialversion<strong>of</strong>our<strong>particle</strong>codeonthe correspond<strong>in</strong>gtest-runsfor32x32gridpo<strong>in</strong>tswiththesamenumber<strong>of</strong><strong>particle</strong>s, KSR.SeeTable6.2.Noticethatthedierencebetweenthe16x16entries<strong>and</strong>the Inordertogetabetternotion<strong>of</strong>whatk<strong>in</strong>d<strong>of</strong>performancewecouldhope


solvertimeforgo<strong>in</strong>gfroma16x16gridtoa32x32grid.Noticehowthisgure provessignicantcomparedtotheoveralltimefortherunswithanaverage<strong>of</strong>1-2 <strong>particle</strong>sper<strong>cell</strong>,butbecomesmorenegligibleasthenumber<strong>of</strong><strong>particle</strong>sper<strong>cell</strong> 107<br />

<strong>in</strong>creases.<br />

Runno.GridsizeNo.<strong>of</strong><strong>particle</strong>stime(sec) Table6.2:SerialPerformanceontheKSR1 100time-steps,non-optimizedcode<br />

1 8x8Load=1.5-3.0<br />

2b 2a 2c 1024 2048 4096 64 15.68 27.86 51.81 1.88<br />

2d 3b 3a 3c 16x16 8192 1024 2048 4096 100.42 28.50 39.90<br />

Toga<strong>in</strong>further<strong>in</strong>sightonhowmucheachpart<strong>of</strong>thecodecontributestothe 3d 32x32 8192 113.30 63.76<br />

16,384<strong>particle</strong>sover100timestepsonbothaSunSparc1,SunSparc460/MP,<strong>and</strong> overalltime<strong>of</strong>thecomputations,benchmarkswereobta<strong>in</strong>forrunssimulat<strong>in</strong>g loopsbygloballyden<strong>in</strong>gtheappropriate<strong>in</strong>verses,<strong>and</strong>replac<strong>in</strong>gthedivisionswith multiplications.Asonecanseefromtheresults,thedefaultKSRdivision<strong>in</strong>C theKSR1(Table6.2).Intheoptimizedcase,weremoveddivisionsfrom<strong>particle</strong>


speed-ups,butnotquiteasgoodasouroptimizedcode(exceptforthesolver, isreallyslow.Bysett<strong>in</strong>gthe-qdivagwhencompil<strong>in</strong>g,weobta<strong>in</strong>edsignicant whichwedidnotoptimize). 108<br />

Table6.3:SerialPerformance<strong>of</strong>ParticleCodeSubrout<strong>in</strong>es 128x128=16K<strong>particle</strong>s,32-by-32grid,100time-steps Sun4c/60Sparc1,SunSparc670/MP,<strong>and</strong>theKSR1<br />

Initializations4.93 slowdivslowdivslowdivoptimizedoptimizedoptimized Sparc1670/MPKSR1Sparc1670/MPKSR1 (sec)(sec)(sec)(sec) 9.9 5.37 (sec) 5.92 (sec)<br />

FFTFieldsolve34.03 (Part-rho) Pull-back Scatter 85.45 2.71 35.175.75 1.5 2.32 0.95 31.4 0.52 9.84<br />

Update<strong>particle</strong>152.4 8.9 33.58 13.9 11.2<br />

(<strong>in</strong>cl.gather) velocities -108.20120.7049.4 32.8 8.8<br />

Update<strong>particle</strong>47.93 Fieldgrid locations 1.70 - 32.231.47 1.66 1.08 17.7 0.5 26.2<br />

Likethevectorcode,theseKSRguresalsocomparefavorabletothoseobta<strong>in</strong>ed Simulation314.82-188.3264.94113.379.77 (toE) 0.22<br />

onSunworkstations.TheSparc1took,forourapplication,about1.5timeslonger, ontheaverage,thantheKSRserialrunswithamoderateload.


therest<strong>of</strong>thecode.S<strong>in</strong>cethisrout<strong>in</strong>eisastrictlygrid-dependent(i.e.O(Ng)), Itisworthnot<strong>in</strong>ghowlittletimetheeld-gridcalculationstakescomparedwith Theseresultswereusedwhendevelop<strong>in</strong>gourstrategiesforChapter4<strong>and</strong>5. 109<br />

number<strong>of</strong>simulation<strong>particle</strong>s,Np,i.e.NgNp,<strong>and</strong>alltheotherrout<strong>in</strong>esare 6.5Cod<strong>in</strong>gtheParallelizations dependentonNp. thismatchesouranalysis<strong>in</strong>Chapter5s<strong>in</strong>cetherearealotfewergrid-po<strong>in</strong>tsthan<br />

Ourrsteortwastoparallelizeourcodebypartition<strong>in</strong>gthecodewithrespect to<strong>particle</strong>s.S<strong>in</strong>ceeach<strong>of</strong>the<strong>particle</strong>updatesare<strong>in</strong>dependent,thisisafairly \trivial"<strong>parallelization</strong>fortheserout<strong>in</strong>es.Forthechargecollectionphase,however,<br />

us<strong>in</strong>gSPC(SimplePrestoC)foreachavailableprocessor.Ourcodehenceuses Inordertoachievethe<strong>particle</strong><strong>parallelization</strong>,werstspawnedateam<strong>of</strong>Pthreads 6.5.1ParallelizationUs<strong>in</strong>gSPC thismeansblock<strong>in</strong>g<strong>and</strong>wait<strong>in</strong>gforgridelementsorreplicat<strong>in</strong>gthegrid.<br />

implicitbarrierconstructsasdepictedbelowthroughtheprcallcalls.Although itisratherwastefultospawnathreadforthous<strong>and</strong>s<strong>of</strong><strong>particle</strong>sononly64-256 Pthreadsarelight-weightconstructsproduc<strong>in</strong>gverylittleoverheadoncreation, processors.Wehencegroupthe<strong>particle</strong>sus<strong>in</strong>gthemodulus<strong>and</strong>ceil<strong>in</strong>gfunctions whencalculat<strong>in</strong>gthepo<strong>in</strong>terstothe<strong>particle</strong>numbereachthreadstartsat(similar tothepartition<strong>in</strong>g<strong>of</strong>ar<strong>and</strong>om-sizedmatrixacrossaset<strong>of</strong>processors).Thiscan eitherbedonewith<strong>in</strong>as<strong>in</strong>glesystemcallus<strong>in</strong>gglobalsorus<strong>in</strong>glocalsystemcalls <strong>in</strong>eachsubrout<strong>in</strong>e.Weoptedforthelatterforthesake<strong>of</strong>modularity.Figure 6.1showsthe<strong>codes</strong>egmentforourma<strong>in</strong><strong>particle</strong>loopafteradd<strong>in</strong>gSPCcallsto parallelized<strong>particle</strong>updaterout<strong>in</strong>es.


************************ma<strong>in</strong>simulationloop:***************/ t=0.0; while(t


6.6Replicat<strong>in</strong>gGrids111<br />

end.InasharedmemoryaddressedsystemliketheKSR,thiscanbefairlyeasily implementedbyadd<strong>in</strong>ganextradimensiontothegridarray. havethenodes(threads)updatelocalcopies<strong>and</strong>thenaddthemtogether<strong>in</strong>the (part-rho)rout<strong>in</strong>eistoreplicatethechargedensitygridforeachprocessors<strong>and</strong> Themostpopularwaytoh<strong>and</strong>letheparallelwritesassociatedwiththescatter<br />

accumulation.However,aspredicted,thechargeaccumulationrout<strong>in</strong>ereallytakes velocities<strong>and</strong>positions,respectively,whenus<strong>in</strong>gthesereplicatedgridforcharge thenonparallelizedsum,theoverhead<strong>in</strong>add<strong>in</strong>gthelocalgridcopiestogether amajorperformancehitasthenumber<strong>of</strong>process<strong>in</strong>gnodes<strong>in</strong>creases.Infact,for Figures6.2showshowgreatspeed-upswereachievedforthe<strong>particle</strong>pushersfor<br />

parallelsums). ThisbehaviorcanclearlybeenseenfromtheScattercurve<strong>in</strong>Figure6.2.(usesthe <strong>in</strong>thatitalsoactuallyslowsdownwhenmorethan16process<strong>in</strong>gnodesareused. downfor8ormoreprocessors!Atree-basedparallelizedsumisnotmuchbetter causessomuchoverheadthatthisrout<strong>in</strong>eexperiencesaworsethanl<strong>in</strong>earslow-<br />

<strong>of</strong>processorss<strong>in</strong>cetheseupdatesare<strong>in</strong>dependent<strong>and</strong>hence<strong>in</strong>volveno<strong>in</strong>ter-node memorytrac.Thevelocityupdatesneedtoread<strong>in</strong>gridquantities<strong>and</strong>s<strong>in</strong>ce theseaccessesarefairlyr<strong>and</strong>om,globalreadcopiesneedtobemadebythesystem underneath,<strong>and</strong>thisrout<strong>in</strong>ehencetakesaperformancehitoverthepositionupdate Asexpected,the<strong>particle</strong>positionupdatesscalefairlyl<strong>in</strong>early<strong>in</strong>thenumber<br />

<strong>of</strong>thegrid<strong>in</strong>thisphase. rout<strong>in</strong>e. at118nodes,ourmaximumtestcase,eachthreadtypicallyh<strong>and</strong>ledonlyonerow sion,italsodidnotbenetmuchfrom<strong>parallelization</strong>.Thisisnotsurpris<strong>in</strong>gs<strong>in</strong>ce Thescatter<strong>and</strong>gather/push-vrout<strong>in</strong>esalsoshow<strong>in</strong>terest<strong>in</strong>g\k<strong>in</strong>ks"<strong>in</strong>their S<strong>in</strong>cetheeld-gridrout<strong>in</strong>ewasalreadyfast<strong>and</strong>onlyparallelized<strong>in</strong>onedimen-<br />

curvearound32processors.ThiseectcanprobablybeattributedtotheKSR1's


112<br />

Figure6.2:ParallelScalability<strong>of</strong>ReplicatedGridApproach


32-noder<strong>in</strong>gsize.S<strong>in</strong>ceweelim<strong>in</strong>atedI/Onodes<strong>in</strong>ourruns,thismeantthat whenwerequested32nodes,thosenodeswillnolongerbeonthesamer<strong>in</strong>g,<strong>and</strong> thelongeraccesstimesfor<strong>in</strong>ter-r<strong>in</strong>gmemoryaccessstartsplay<strong>in</strong>garole. 113<br />

6.7DistributedGrid<br />

parableoptimization.Asexpected,thethedistributedgridcaseshowscont<strong>in</strong>ued obta<strong>in</strong>edfromthesetwoapproachesforrunswiththesameproblemsize<strong>of</strong>comheadassociatedwithreplicatedgrids.Figure6.3givesscatterrout<strong>in</strong>ebenchmarks<br />

One<strong>of</strong>thereasonsfordistribut<strong>in</strong>gthegridwastoalleviatedthescatterover-<br />

speed-upwith<strong>in</strong>creas<strong>in</strong>gnumber<strong>of</strong>processors,whereasthereplicatedgridapproachbehavesas<strong>in</strong>Figure6.2,byactuallyshow<strong>in</strong>gslow-downwhenus<strong>in</strong>gmore<br />

than16processors.<br />

results<strong>in</strong>Table6.4.Thistablecomparesasimulationswithrelativelylarge<strong>in</strong>itial <strong>of</strong>thepushrout<strong>in</strong>esareaectedbythe<strong>in</strong>itialconditionsasdemonstratedbythe Itishardertocomparethisdatatothereplicatedgridcase,s<strong>in</strong>cetheeciency <strong>in</strong>clud<strong>in</strong>g<strong>particle</strong>pushresultsaswell. Figure6.4showsthesamebenchmarksforthepartitionedgridscase,butnow<br />

<strong>in</strong>itialvelocitiesonlyone-hundredth<strong>of</strong>that<strong>and</strong>onewithnone.Thesebenchmarks time-step,thisrout<strong>in</strong>etakesalargeperformancehit(almostafactor<strong>of</strong>2<strong>in</strong>our arefairlycomparabletoeachotherexceptfortherout<strong>in</strong>eupdat<strong>in</strong>gtheposition. velocitiessuchastheplasma<strong>in</strong>stabilitytestcase(shown<strong>in</strong>column1),toonewith S<strong>in</strong>cetheglobalscratcharraysgrowwiththenumber<strong>of</strong><strong>particle</strong>sleav<strong>in</strong>gateach testcase). mostcolumnforcomparisons.Althoughthesebenchmarksaredoneononly4pro-<br />

cessor,theyalreadyshowthatthedistributedgridscatterrout<strong>in</strong>eisfasterthan thereplicatedgridscatterrout<strong>in</strong>e.However,theseresultsalsoshowtheoverhead associatedwithpartiallysort<strong>in</strong>g<strong>and</strong>ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>gdynamiclocal<strong>particle</strong>arrays. Benchmarksforthecorrespond<strong>in</strong>greplicatedgridcaseare<strong>in</strong>cluded<strong>in</strong>theright-


114<br />

Figure6.3:Scatter:DistributedGridversusReplicatedGrids


115<br />

Figure6.4:DistributedGridBenchmarks


(seeFigure6.4),thisapproachbecomesfavorableasthescatterrout<strong>in</strong>estartsto dom<strong>in</strong>atethereplicatedgridcase. However,s<strong>in</strong>cethedistributed<strong>particle</strong>updaterout<strong>in</strong>esalsoscalel<strong>in</strong>early<strong>in</strong>time 116<br />

Vdrift: 128128grid;32,768<strong>particle</strong>s;4processors. Table6.4:DistributedGrid{InputEects<br />

ScatterTime(secs)Time(secs)Time(secs)Time(secs)<br />

!p2Ly1/100(!p2Ly) 12.35 11.77 11.83 0.0 19.33 N.A.<br />

Gather+23.86 Push-v Push-x 55.14 22.53 29.35 22.43 29.13 19.20<br />

6.8FFTSolver 16.67<br />

vendorprovided2DFFTrout<strong>in</strong>ethatissupplied<strong>in</strong>Fortranthatwehopetobe abletoutilize<strong>in</strong>ourproductionruns.S<strong>in</strong>cetheserout<strong>in</strong>esaredevelopedbythe Wedidnotimplementanoptimized2DFFTsolverontheKSR.TheKSRhasa vendorstoshowotheirhardware,theseare<strong>of</strong>tenhighlyoptimizedrout<strong>in</strong>esthat useassemblylanguageprogramm<strong>in</strong>g,cach<strong>in</strong>g<strong>and</strong>everyotherfeaturethevendor welltheyarecoded. canth<strong>in</strong>k<strong>of</strong>tak<strong>in</strong>gadvantage<strong>of</strong>.Itisthereforehighlyunlikelythanuserswillbe abletoachievesimilarperformancewiththeirFortranorC<strong>codes</strong>,nomatterhow


FromTable6.3onewouldhardlyth<strong>in</strong>kitworthwhiletoparallelizetheField-Grid 6.9GridScal<strong>in</strong>g rout<strong>in</strong>es<strong>in</strong>ceittakessuchan<strong>in</strong>signicantamount<strong>of</strong>timecomparedwiththerest 117<br />

<strong>of</strong>theserialcode.However,onedoesactuallystartnotic<strong>in</strong>gitspresenceasthe gridsgetlarger<strong>and</strong>therest<strong>of</strong>thecodeparallelized.Figure6.4illustrateshowthe parallelizedcode,asexpected,scalesl<strong>in</strong>earlywiththegrid-sizeforxednumber <strong>of</strong>processors<strong>and</strong><strong>particle</strong>s. velocityrout<strong>in</strong>e<strong>and</strong>thescatterrout<strong>in</strong>eshowthegrideectsaspredicted. vary<strong>in</strong>ggrid-sizes<strong>in</strong>cethisrout<strong>in</strong>eis<strong>in</strong>dependent<strong>of</strong>thegrid,whereasthegather/push-<br />

6.10ParticleScal<strong>in</strong>g Asexpected,the<strong>particle</strong>pushrout<strong>in</strong>esforpositionsshowednoeect<strong>of</strong>the<br />

bution(eld-grid)showsnoeect<strong>of</strong>thisscal<strong>in</strong>gs<strong>in</strong>ceitonly<strong>in</strong>volvesgridquan-<br />

tities.The\k<strong>in</strong>k"<strong>in</strong>the<strong>particle</strong>updatecurvesexperienced<strong>in</strong>ourbenchmarks <strong>and</strong>thenumber<strong>of</strong><strong>particle</strong>svaried.(Seegure6.5) atat4096<strong>particle</strong>s/node(4<strong>particle</strong>spergrid)canbeattributedtolocalpag<strong>in</strong>g/cach<strong>in</strong>g.S<strong>in</strong>cethereplicatedgridapproachaccessesthesamelocal<strong>particle</strong><br />

Asexpected,calculat<strong>in</strong>gtheeldateachgrid-po<strong>in</strong>tbasedonthechargedistri-<br />

Wealso<strong>in</strong>cludedbenchmarkswherethenumber<strong>of</strong>grids<strong>and</strong>processorswerexed,<br />

cache-sizeisexceeded. 6.11Test<strong>in</strong>g Toassurethatthe<strong>codes</strong>werestillmodel<strong>in</strong>gtheexpectedequations,theoutput<strong>of</strong> quantities<strong>in</strong>sequentialmemorylocations,cache-hitswilloccurwhenthelocal<br />

playedanimportantrole<strong>in</strong>thiseortUnfortunately,wedidnothaveaccesstothe ter3forcheck<strong>in</strong>gforplasmaoscillations,symmetry<strong>and</strong>two-stream<strong>in</strong>stabilities eachnewparallelizedversionwascheckedforvalidity.Thetestsdescribed<strong>in</strong>Chap-


118<br />

Time-steps Figure6.5:Grid-sizeScal<strong>in</strong>g.Replicatedgrids;4nodes;262,144Particles,100


119<br />

Figure6.6:ParticleScal<strong>in</strong>g{ReplicatedGrids


HDFpackageusedtoproducethetwo-stream<strong>in</strong>stabilitygraphs<strong>in</strong>Chapter3.We did,however,compareoutputles<strong>of</strong>thechargedistribution<strong>and</strong><strong>particle</strong>position withourserialcode'scorrespond<strong>in</strong>goutput.Welimitedthenumber<strong>of</strong>iterations 120<br />

toabout10s<strong>in</strong>cethese<strong>codes</strong>producenon-l<strong>in</strong>earities<strong>and</strong>round-oeectsthatas timeprogresses.


Chapter7 Conclusions<strong>and</strong>Futurework<br />

Bybr<strong>in</strong>g<strong>in</strong>gtogetherknowledgefromtheelds<strong>of</strong>computerscience,physics,<strong>and</strong> \Iknowthatyoubelieveyouunderst<strong>and</strong>whatyouth<strong>in</strong>kIwrote,butIam<br />

appliedmathematics(numericalanalysis),thisdissertationpresentedsomeguidel<strong>in</strong>esonhowt<strong>of</strong>ullyleverageparallelsupercomputertechnology<strong>in</strong>thearea<strong>of</strong><br />

Restaurant,Rt89,Ithaca,NY;orig<strong>in</strong>albyRichardNixon. {author'sversion<strong>of</strong>say<strong>in</strong>grstseenonaplaqueatGlennwoodP<strong>in</strong>es notsureyourealizethatwhatyouread,isnotwhatImeant..."<br />

fairlycomplexapplicationwhichhasseveral<strong>in</strong>teract<strong>in</strong>gmodules. <strong>in</strong>gasummary<strong>of</strong>thema<strong>in</strong>referencesrelatedtothearea<strong>of</strong>parallelParticle-<strong>in</strong>-Cell <strong>codes</strong>.Ourcodemodeledcharged<strong>particle</strong>s<strong>in</strong>anelectriceld.Weanalyzedthis <strong>particle</strong>simulations. Thisworkhighlightedseveralrelevantbasicpr<strong>in</strong>ciplesfromtheseelds<strong>in</strong>clud-<br />

sharedmemoryaddress<strong>in</strong>gspacesuchastheKSRsupercomputer.However,most <strong>of</strong>thenewmethodsdeveloped<strong>in</strong>thisdissertationholdgenerallyforhigh-performance systemswitheitherhierarchicalordistributedmemory. Bystudy<strong>in</strong>gthe<strong>in</strong>teractionsbetweenourapplication'ssub-programblocks,we Ourframeworkwasaphysicallydistributedmemorysystemwithaglobally<br />

showedhowtheaccompany<strong>in</strong>gdependenciesaectdatapartition<strong>in</strong>g<strong>and</strong>leadto 121


cilitat<strong>in</strong>gdynamicallypartitionedgrids.Thisnovelapproachtakesadvantage<strong>of</strong> new<strong>parallelization</strong>strategiesconcern<strong>in</strong>gprocessor,memory<strong>and</strong>cacheutilization. We<strong>in</strong>troducedanovelapproachthatleadtoanecientimplementationfa-<br />

122<br />

theshared-memoryaddress<strong>in</strong>gsystem<strong>and</strong>usesadualpo<strong>in</strong>terschemeonthelocal<strong>particle</strong>arraysthatkeepsthe<strong>particle</strong>locationsautomaticallypartiallysorted<br />

(i.e.sortedtowith<strong>in</strong>thelocalgridpartition).Complexity<strong>and</strong>performanceanalyseswas<strong>in</strong>cludedforboththisapproach<strong>and</strong>for<strong>of</strong>traditionalreplicatedgrids<br />

cachesize<strong>and</strong>ourproblem'sstructure<strong>of</strong>memoryaccess.Byreorder<strong>in</strong>gthegrid Thissavedusus25%<strong>of</strong>thecache-hitsfora4-by-4cache. approach. <strong>in</strong>dex<strong>in</strong>g,wealignedthestorage<strong>of</strong>neighbor<strong>in</strong>ggrid-po<strong>in</strong>tswiththelocalcache. Wealso<strong>in</strong>troducedhierarchicaldatastructuresthatweretailoredforboththe<br />

gridapproachisappropriateforproblemsrunonalimitednumber<strong>of</strong>processors. data'seectonthesimulation.E.g.<strong>in</strong>thecase<strong>of</strong>mean<strong>particle</strong>drift,itmaybe advantageoustopartitionthegridprimarilyalongthedirection<strong>of</strong>thedrift. Ouranalyses<strong>and</strong>implementationbenchmarksdemonstratehowthereplicated Weshowedthatfurtherimprovementscanbemadebyconsider<strong>in</strong>gthe<strong>in</strong>put<br />

severalprocessors(<strong>in</strong>ourcase,greaterthan8withrespecttoprocess<strong>in</strong>gtime).In oryrequirements,forlargeproblems(say,gridslargerthan128-by-128)runn<strong>in</strong>gon replicatedgridfailstoscaleproperlybothwithrespecttocomputation<strong>and</strong>mem-<br />

S<strong>in</strong>cethe<strong>particle</strong>sarealwaysprocessedbythesamecomputationalnode,an<strong>in</strong>itiallyload-balancedsimulationwillrema<strong>in</strong>perfectlyloadbalanced.However,the<br />

thiscase,theextrastorage<strong>and</strong>computationtimeassociatedwithadd<strong>in</strong>gthelocal largeproblemsonhighlyparallelsystemswithseveraldozenormoreprocessors, s<strong>in</strong>ceitdoesnotneedtoreplicate<strong>and</strong>summarizethewholegrid. partition<strong>in</strong>g,ontheotherh<strong>and</strong>,ismorediculttoimplement,butscaleswellfor gridcopiestogether,wasshowntobesignicant.Ourdualpo<strong>in</strong>terschemeforgrid


<strong>in</strong>stabilities. whichleadtopredictablephenomena<strong>in</strong>clud<strong>in</strong>gplasmaoscillations<strong>and</strong>two-stream The<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong><strong>codes</strong>forthisstudyweretestedus<strong>in</strong>gphysicalparameters, 123<br />

7.1FutureWork \Ilikethedreams<strong>of</strong>thefuturebetterthanthehistory<strong>of</strong>thepast."<br />

thepast10yearscomput<strong>in</strong>ghasbecomeanessentialtoolformanyscientists<strong>and</strong> Wilson,isclearlyagrow<strong>in</strong>geldthatwillrema<strong>in</strong><strong>in</strong>focusforyearstocome.Dur<strong>in</strong>g Theeld<strong>of</strong>computationalscience,atermrstcastbyphysicsNobelLaureateKen {ThomasJeerson,lettertoJohnAdams,1816.<br />

eng<strong>in</strong>eers,<strong>and</strong>weexpectthistrendtocont<strong>in</strong>ueastechnologyprovidesuswith morecomputationalpowerthroughtheadvances<strong>in</strong>semiconductortechnologies specicallyaimed<strong>and</strong>ComputationalScience<strong>and</strong>Eng<strong>in</strong>eer<strong>in</strong>g[SRS93]. theimportance<strong>of</strong>severaluniversitieshaverecently<strong>in</strong>itiatedgraduateprograms aswellastheuse<strong>of</strong>parallel<strong>and</strong>distributedcomputersystems.Toillustrate methods,butitisclearthatthema<strong>in</strong>techniquesdevelopedtoparallelizecomplex computerprograms,suchas<strong>particle</strong>simulations,canbeappliedtomanyother scientic<strong>and</strong>eng<strong>in</strong>eer<strong>in</strong>g<strong>codes</strong>.Currentresearchareasthattakeadvantage<strong>of</strong> computationalscience<strong>and</strong>eng<strong>in</strong>eer<strong>in</strong>g<strong>in</strong>cludeastronomy,biology<strong>and</strong>virology, Thisdissertationfocusedonanalyz<strong>in</strong>gthe<strong>parallelization</strong><strong>of</strong>Particle-<strong>in</strong>-Cell<br />

chemistry,electromagnetics,uiddynamics,geometricmodel<strong>in</strong>g,materialscience, medic<strong>in</strong>e,numericalalgorithms,physics,problem-solv<strong>in</strong>genvironments,scientic visualization,signal<strong>and</strong>imageprocess<strong>in</strong>g,structurestechnology,symboliccomput<strong>in</strong>g,<strong>and</strong>weather<strong>and</strong>oceanmodel<strong>in</strong>g.Thefunhasmerelybegun!<br />

\Manwantstoknow,<strong>and</strong>whenheceasestodoso,heisnolongerman".{ Nansen[1861-1930],onthereasonforpolarexplorations.


AnnotatedBibliography AppendixA<br />

A.1Introduction Thissectiongiveadetaileddescription<strong>of</strong>some<strong>of</strong>thema<strong>in</strong>referencesrelatedto {Virg<strong>in</strong>iaWoolf,quoted<strong>in</strong>HaroldNicolson,Diaries. \Noth<strong>in</strong>ghasreallyhappeneduntilithasbeenrecorded."<br />

centraltoourwork.Anoverviewtable<strong>of</strong>thesereferenceswasprovided<strong>in</strong>Chapter paralell<strong>particle</strong><strong>codes</strong>.SectionsA.2<strong>and</strong>A.3coverthema<strong>in</strong>books<strong>and</strong>general articles,respectively,whereasSectionsA.4<strong>and</strong>A.5coverthePICreferencesmost<br />

CitationIndex.INSPECisrecentlibraryutilitythattheauthorfoundextremely <strong>of</strong>severalsciencediscipl<strong>in</strong>es,referecesare<strong>of</strong>tenhardtondus<strong>in</strong>gonlytheScience 2.S<strong>in</strong>cetheeld<strong>of</strong>computationalscienceisrelativelynew<strong>and</strong>onthe<strong>in</strong>tersection recentconferencepapers<strong>and</strong>abstractsaswell.)However,toreallybeontop<strong>of</strong> thisrapidlymov<strong>in</strong>geld,itisimportantt<strong>of</strong>ollowthema<strong>in</strong>conferences<strong>in</strong>high useful(itnotonlylistauthors<strong>and</strong>titlesfrompublishedjournals,but<strong>in</strong>cludes<br />

Mounta<strong>in</strong>ConferenceonIterativeMethods),etc.),keep<strong>in</strong>touchwiththema<strong>in</strong> performancecomput<strong>in</strong>g(e.g.Supercomput<strong>in</strong>g'9x,SHPCC'9x,SIAMconferences, TheColoradoConferenceonIterativeMethods(previouslycalledTheCopper 124


actors<strong>of</strong>theeld(havethemsendyoutechnicalreports),<strong>and</strong>lookoutfornew journalssuchasComputationalScience&Eng<strong>in</strong>eer<strong>in</strong>gdueoutbyIEEE<strong>in</strong>the spr<strong>in</strong>g<strong>of</strong>1994. 125<br />

A.2ReferenceBooks A.2.1Hockney<strong>and</strong>Eastwood:\ComputerSimulations Thisbook[HE89]isasolidreferencethatconcentrateson<strong>particle</strong>simulationswith applications<strong>in</strong>thearea<strong>of</strong>semiconductors.They<strong>in</strong>cludeahistoricaloverview<strong>of</strong> PIC<strong>codes</strong><strong>and</strong>giveafairlythoroughexample<strong>of</strong>1-Dplasmamodels.Theycovers severalnumericalmethodsforsolv<strong>in</strong>gtheeldequation(<strong>in</strong>clud<strong>in</strong>gtheSOR<strong>and</strong> Us<strong>in</strong>gParticles"<br />

FFTmethodsweused),<strong>and</strong>describeseveraltechniquesrelatedtosemiconductor devicemodel<strong>in</strong>g,astrophysics,<strong>and</strong>moleculardynamics.TheFFTisdescribed<strong>in</strong><br />

Thistext[BL91],concentratesondescrib<strong>in</strong>gthegeneraltechniquesused<strong>in</strong>plasma theappendix.<br />

physics<strong>codes</strong><strong>and</strong>hence<strong>in</strong>cludefairlydetaileddescriptions<strong>of</strong>2<strong>and</strong>3-Delectrostatic<strong>and</strong>electromagneticprograms.Theirbookalso<strong>in</strong>cludeadiskettewith<br />

A.2.2Birdsall<strong>and</strong>Langdon:\PlasmaPhysicsVia<br />

1-Delectrostaticcodecompletewithgraphicsforrunn<strong>in</strong>gunderMS-DOS<strong>and</strong> ComputerSimulation"<br />

X-w<strong>in</strong>dows(X11). A.2.3Others Tajima[Taj89]<strong>and</strong>Bower&Wilson[BW91]arealsogoodgenericreferencesfor plasmasimulation<strong>codes</strong>withslantstowardsastrophysics. ulardynamics<strong>in</strong>volv<strong>in</strong>gnumericaltechniquessuchastheMonteCarlomethod Forthose<strong>in</strong>terested<strong>in</strong>correlatedmethods{statisticalmechanics<strong>and</strong>molec-


(namedsoduetotheroler<strong>and</strong>omnumbersplay<strong>in</strong>thismethod){Allen&Tildesley[AT92]<strong>and</strong>B<strong>in</strong>der[B<strong>in</strong>92]areconsideredsome<strong>of</strong>thebestreferencesonthe<br />

subject.AllthoughthesemethodsarequitedierentfromthePICmethod,they 126<br />

mayprovide<strong>in</strong>terest<strong>in</strong>galgorithmicideasthatcouldbeusedwhenparalleliz<strong>in</strong>g PIC<strong>codes</strong>.<br />

used<strong>in</strong>paralleliz<strong>in</strong>gphysical<strong>codes</strong>.Thematerialcentersaroundtheauthors experienceontheCaltechhypercube,<strong>and</strong>ishencebiasedtowardshypercubeimplementations.VolumeI,subtitledGeneralTechniques<strong>and</strong>RegularProblems,<br />

Processors" This2-volumeset[FJL+88,AFKW90]coversthemostcommonparalleltechniques A.2.4Foxet.al:\Solv<strong>in</strong>gProblemsonConcurrent<br />

givesatheoreticaloverview<strong>of</strong>thealgorithmsused<strong>and</strong>theirunderly<strong>in</strong>gnumerical techniques,whereasVolumeII,S<strong>of</strong>twareforConcurrentProcessor,isaprimarilycompendium<strong>of</strong>Fortran<strong>and</strong>Cprogramsbasedonthealgorithmsdescribed<strong>in</strong><br />

VolumeI.<br />

workshopwaspublished<strong>in</strong>ComputerPhysicsCommunicationsthefollow<strong>in</strong>gyear Physics"washeldatLosAlamos<strong>in</strong>NewMexico.Theproceed<strong>in</strong>gsfromthis InApril1987aworkshopentitled\ParticleMethods<strong>in</strong>FluidDynamics<strong>and</strong>Plasma A.3GeneralParticleMethods<br />

<strong>and</strong>conta<strong>in</strong>sseveral<strong>in</strong>terest<strong>in</strong>gpapers<strong>in</strong>thearea: [Har88]isasurveyarticlecover<strong>in</strong>gtheorig<strong>in</strong><strong>of</strong>PIC<strong>and</strong>relatedtechniquesas A.3.1F.H.Harlow:"PIC<strong>and</strong>itsProgeny" methodsforexplor<strong>in</strong>gshock<strong>in</strong>teractionswithmaterial<strong>in</strong>terfaces<strong>in</strong>the'60s.Harlow'sPICworkforuiddynamicssimulations('64)wasthefoundationforMorse<br />

<strong>and</strong>Nielson's('69)workonhigher-order<strong>in</strong>terpolation(CIC)schemesforplasmas.


Accord<strong>in</strong>gtoHockney<strong>and</strong>Eastwood,they<strong>and</strong>Birdsall'sBerkeleygroup('69)were therstto<strong>in</strong>troduceCICschemes.) Thebulk<strong>of</strong>Harlow'spaperisalist<strong>of</strong>143references,mostlypapersonuid 127<br />

dynamicsbytheauthors<strong>and</strong>hiscolleaguesatLosAlamos(1955-87). A.3.2J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>: TheFastMultipoleMethodforGridlessParticle<br />

odswhenusedtosimulatecoldplasmas<strong>and</strong>beams,<strong>and</strong>plasmas<strong>in</strong>complicated [AGR88]advocatestheuse<strong>of</strong>modernhierarchicalsolvers,<strong>of</strong>whichthemostgeneraltechniqueisthefastmultipolemethod(FMM),toavoidsome<strong>of</strong>thelocal<br />

smooth<strong>in</strong>g,boundaryproblems,<strong>and</strong>alias<strong>in</strong>gproblemsassociatedwithPICmeth-<br />

Simulation<br />

A.3.3D.W.Hewett<strong>and</strong>A.B.Langdon:RecentProgress theaforementionedproblemsassociatedwithPICmethods. regions.ThepaperdescribestheFMMmethod<strong>and</strong>howitfareswithrespectto<br />

ThecodeusesaniterativesolutionbasedonADI(alternat<strong>in</strong>gdirectionimplicit)as a2-Ddirecteldsolver.Thepaperdoespo<strong>in</strong>toutthatonlym<strong>in</strong>imalconsideration [HL88]describesthedirectimplicitPICmethod<strong>and</strong>somerelativisticextensions. withAvanti:A2.5DEMDirectImplicitPIC<br />

wasgiventoalgorithmsthatmaybeusedtoimplementtherelativisticextensions. Code<br />

[BT88]describestheadvantages<strong>and</strong>disadvantages<strong>in</strong>us<strong>in</strong>gahybrid<strong>particle</strong>code A.3.4S.H.Brecht<strong>and</strong>V.A.Thomas:Multidimensional Someconceptsweretestedona1-Dcode.<br />

tosimulateplasmasonverylargescalelengths(severalD).Bytreat<strong>in</strong>gthe electronasamasslessuid<strong>and</strong>theionsas<strong>particle</strong>s,somephysicsthatmagneto- SimulationsUs<strong>in</strong>gHybridParticleCodes


hydrodynamics(MHD)<strong>codes</strong>donotprovide(MHDassumeschargeneutrality,i.e. thepotentialequationsbyassum<strong>in</strong>gthattheplasmaisquasi-neutral(neni), =0),canbe<strong>in</strong>cludedwithoutthecosts<strong>of</strong>afull<strong>particle</strong>code.Theyavoidsolv<strong>in</strong>g 128<br />

us<strong>in</strong>gtheDarw<strong>in</strong>approximationwherelightwavescanbeignored,<strong>and</strong>assum<strong>in</strong>g theelectronmasstobezero.Theyhenceuseapredictor-correctormethodtosolve thesimpliedequations. A.3.5C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga:Gridless<br />

positions<strong>and</strong>theelectriceld.However,s<strong>in</strong>cetheMPPisabit-serialSIMD(s<strong>in</strong>gle [LTK88]describesagridlessmodelwhere<strong>particle</strong>saremappedtoprocessors<strong>and</strong> thegridcomputationsareavoidedbyus<strong>in</strong>gthe<strong>in</strong>verseFFTtocomputethe<strong>particle</strong> Processor(MPP) ParticleSimulationUs<strong>in</strong>gtheMassivelyParallel<br />

reductionsumsneededforthistechniquewhencomput<strong>in</strong>gthechargedensitywas memory,theyfoundthattheoverhead<strong>in</strong>communicationwhencomput<strong>in</strong>gthe <strong>in</strong>struction,multipledata)architecturewithagridtopology<strong>and</strong>notmuchlocal sohighthat60%<strong>of</strong>theCPUtimewasused<strong>in</strong>thiseort. A.3.6A.Mank<strong>of</strong>skyetal.:Doma<strong>in</strong>Decomposition<strong>and</strong> [M+88]describes<strong>parallelization</strong>eortontwoproduction<strong>codes</strong>,ARGUS<strong>and</strong>CAN- (seealsodescription<strong>of</strong>otherarticlesbyL<strong>in</strong><strong>in</strong>SectionA.xx)<br />

isa3-Dsystem<strong>of</strong>simulation<strong>codes</strong>,thepaperpay<strong>in</strong>gparticularlyattentiontothose modulesrelatedtoPIC<strong>codes</strong>.Thesemodules<strong>in</strong>cludeseveraleldsolvers(SOR, DORformultiprocess<strong>in</strong>gonsystemssuchastheCrayX-MP<strong>and</strong>Cray2.ARGUS ParticlePush<strong>in</strong>gforMultiprocess<strong>in</strong>gComputers<br />

Chebyshev<strong>and</strong>FFT)<strong>and</strong>electromagneticsolvers(leapfrog,generalizedimplicit <strong>and</strong>frequencydoma<strong>in</strong>),wheretheyclaimtheir3DFFTsolvertobeexceptionally fastȮne<strong>of</strong>themost<strong>in</strong>terest<strong>in</strong>gideas<strong>of</strong>thepaper,however,ishowtheyusethe


cacheasstoragefor<strong>particle</strong>sthathavelefttheirlocaldoma<strong>in</strong>whereasthelocal<strong>particle</strong>sgetwrittentodisk.Thecached<strong>particle</strong>sthengettaggedontothe<br />

local<strong>particle</strong>s<strong>in</strong>itsnew<strong>cell</strong>whentheygetswapped<strong>in</strong>.Theirexperiencewith 129<br />

CANDOR,a2.5DelectromagneticPICcode,showedthatitprovedecientto multi-taskover<strong>particle</strong>s(oragroup<strong>of</strong><strong>particle</strong>s)with<strong>in</strong>aeldblock.Theynote <strong>particle</strong>spergroup.Thespeed-upfortheirimplementationontheCrayX-MP/48 <strong>particle</strong>groups,whereasthevectorizationeciency<strong>in</strong>creas<strong>in</strong>gwiththenumber<strong>of</strong> thetrade-oduetothe<strong>parallelization</strong>eciency<strong>in</strong>creas<strong>in</strong>gwiththenumber<strong>of</strong><br />

A.4ParallelPIC{SurveyArticles A.4.1JohnM.Dawson showedtoreachclosetothetheoreticalmaximum<strong>of</strong>4.<br />

<strong>of</strong>plasmas"[Daw83],isalengthyreview<strong>of</strong>theeldmodel<strong>in</strong>gcharged<strong>particle</strong>s<strong>in</strong> electric<strong>and</strong>magneticelds.Thearticlecoversseveralphysicalmodel<strong>in</strong>gtechniques JohnDawson's1983Rev.<strong>of</strong>ModernPhysicspaperentitled\Particlesimulation 100relatedpapers. A.4.2DavidWalker'ssurveyPaper typicallyassociatedwith<strong>particle</strong><strong>codes</strong>.Itisaserialreferencethatsitesmorethan<br />

<strong>in</strong>-<strong>cell</strong>plasmasimulationcode"fromG.Fox'sjournalConcurrency:Practice<strong>and</strong> Experience,Dec.'90issue[Wal90]givesanicesurvey<strong>of</strong>vector<strong>and</strong>parallelPIC methods.Itcoversthemostbasicparalleltechniquesexplored,withanemphasis ontheimportance<strong>of</strong>loadbalanc<strong>in</strong>g.Thepaperstressesthatthereisastrongneed DavidWalker's\Characteriz<strong>in</strong>gtheparallelperformance<strong>of</strong>alarge-scale<strong>particle</strong>-<br />

forfuturework<strong>in</strong>theMultipleInstructionMultipleDataMIMD,i.e.distributed memorycomput<strong>in</strong>garena: positionschemesneedtobe<strong>in</strong>vestigated,particularlywith<strong>in</strong>homoge- \...ForMIMDdistributedmemorymultiprocessorsalternativedecom-


Thepapercites54references. neous<strong>particle</strong>distributions." 130<br />

Max[Max91]givesanicegeneraldescription<strong>of</strong>plasma<strong>codes</strong>,<strong>and</strong>how<strong>in</strong>the portantrole<strong>in</strong>theeld<strong>of</strong>plasmaastrophysics.Shepo<strong>in</strong>tstocurrenteorton A.4.3ClaireMax:\ComputerSimulation<strong>of</strong> com<strong>in</strong>gdecade,sophisticatednumericalmodels<strong>and</strong>simulationswillplayanim-<br />

theCRAYmach<strong>in</strong>es<strong>and</strong>howplasmaastrophysicshasagenu<strong>in</strong>eneedforthegovernmentalsupercomputerresourcespotentiallyprovidedbytheNSF,DOE,<strong>and</strong><br />

NASASupercomput<strong>in</strong>gCenters. A.5OtherParallelPICReferences A.5.1Vector<strong>codes</strong>{Horowitzetal withtim<strong>in</strong>ganalysesdoneonasimilar3DcodeforaCray.UnlikeNishiguchi etal.whoemploysaxedgrid,Horowitz'sapproachrequiressort<strong>in</strong>g<strong>in</strong>orderto Horowitz(LLNL,laterUniv.<strong>of</strong>Maryl<strong>and</strong>)etal.[Hor87]describesa2Dalgorithm <strong>particle</strong>s<strong>in</strong>their1-DcodetoutilizethevectorprocessoronaVP-100computer. Nishiguchi(OsakaUniv.)etal.'s3-pagenote[NOY85]describeshowtheybunch<br />

AstrophysicalPlasmas"<br />

modeledascollisionless<strong>particle</strong>swhereastheelectronsaretreatedasan<strong>in</strong>ertialessuid.Amultigridalgorithmisusedfortheeldsolve,whereastheleap-frog<br />

methodisusedtopushthe<strong>particle</strong>s.Horowitzmulti-tasksover3<strong>of</strong>the4Cray2 Horowitzetal.later[HSA89]describea3DhybridPICcodewhichisused tomodeltiltmodes<strong>in</strong>eld-reversedcongurations(FRCs).Here,theionsare lessstorage. vectorization.Thisschemeisabitslowerforverylargesystems,butrequiresmuch<br />

The<strong>in</strong>terpolation<strong>of</strong>theeldstothe<strong>particle</strong>swasfoundcomputationally<strong>in</strong>tensive processors<strong>in</strong>themultigridphasebycomput<strong>in</strong>gonedimensiontoeachprocessor.


<strong>and</strong>hencemulti-taskedachiev<strong>in</strong>ganaverageoverlap<strong>of</strong>about3duetotherelationshipbetweentasklength<strong>and</strong>thetime-sliceprovidedforeachtaskbytheCray.<br />

(FortheCraytheyused,thetime-slicedependedonthesize<strong>of</strong>thecode<strong>and</strong>the 131<br />

A.5.2C.S.L<strong>in</strong>(SouthwestResearchInstitute,TX)etal 2(many,butsimplercalculations).Theseresultsarehenceclearlydependedon theschedul<strong>in</strong>galgorithms<strong>of</strong>theCrayoperat<strong>in</strong>gsystem. priorityatwhichitran.)The<strong>particle</strong>pushphasesimilarlygotanoverlap<strong>of</strong>about<br />

<strong>in</strong>-<strong>cell</strong>CodeontheMPP(HCCA4)[L<strong>in</strong>89b],usesa<strong>particle</strong>sort<strong>in</strong>gschemebased onrotations(scatter<strong>particle</strong>sthatareclusteredthroughrotations).Thesame C.S.L<strong>in</strong>'spaperSimulations<strong>of</strong>BeamPlasmaInstabilitiesUs<strong>in</strong>gParallelParticle-<br />

approachisdescribed<strong>in</strong>theauthor'ssimilar,butlongerpaper,Particle-<strong>in</strong>-<strong>cell</strong> Simulations<strong>of</strong>WaveParticle<strong>in</strong>teractionsUs<strong>in</strong>gtheMassivelyParallelProcessor<br />

us<strong>in</strong>ganFFTsolver. locatedatGoddard.Theauthorsimulatesupto524,000<strong>particle</strong>sonthismach<strong>in</strong>e [L<strong>in</strong>89a]. (MassivelyParallelProcessor),a128-by-128toroidalgrid<strong>of</strong>bit-serialprocessors Twoprevioussort<strong>in</strong>gstudiesarementioned<strong>in</strong>thispaper:onewhere<strong>particle</strong>s Thepapersdescribea1-DelectrostaticPICcodeimplementedontheMPP<br />

array<strong>and</strong>sortedthe<strong>particle</strong>saccord<strong>in</strong>gtotheir<strong>cell</strong>everytimestep.Thiswas thatusedmorecomputationswasconsidered. aresortedateachtime-step(lots<strong>of</strong>overhead),anotherwherea\gridless"FFT<br />

wouldnotrema<strong>in</strong>load-balancedovertimes<strong>in</strong>cetheuctuations<strong>in</strong>electricalforces thearrayprocessors<strong>and</strong>thestag<strong>in</strong>gmemory.Theyalsopo<strong>in</strong>toutthatthescheme foundtobehighly<strong>in</strong>ecientontheMPPduetotheexcessiveI/Orequiredbetween Intherststudy,theymappedthesimulationdoma<strong>in</strong>directlytotheprocessor<br />

wouldcausethe<strong>particle</strong>stodistributenon-uniformlyovertheprocessors. Intheotherstudy,theydevelopedwhattheycallagridlessmodelwherethey


(seeSection2.3.4)However,itwasshowntobe7timesasslowasasimilarPIC map<strong>particle</strong>stor<strong>and</strong>omprocessors.Thisapproachclaimstoavoidchargecollectionbycomput<strong>in</strong>gtheelectricforcesdirectlyus<strong>in</strong>gtheDiscreteFourierTransform<br />

132<br />

serialprocessors),theimplementationuses64planestostore524,000<strong>particle</strong>s. Thesort<strong>in</strong>gapproach codetheCRAY,<strong>and</strong>isconsequentlydismissed.<br />

Theapproachllsonlyhalfthe<strong>particle</strong>planes(processorgrid)with<strong>particle</strong>sto thisbit-serialSIMDmach<strong>in</strong>e.Thespareroomwasused<strong>in</strong>the\shu<strong>in</strong>g"process makethesort<strong>in</strong>gsimplerbybe<strong>in</strong>gabletoshue("rotate")thedataeasily<strong>in</strong> Designedforthebit-serialMPP,(consist<strong>in</strong>g<strong>of</strong>the64-by-64toroidalgrid<strong>of</strong>bit-<br />

useful. <strong>and</strong>adierent<strong>in</strong>terconnectionnetwork,othertechniqueswillprobablyprovemore clearlytiedtotheMPParchitecture.Fornodeswithmorecomputationalpower wherecongested<strong>cell</strong>shadpart<strong>of</strong>theircontentsrotatedtotheirnorthernneighbor, <strong>and</strong>thenwest,ifnecessarydur<strong>in</strong>gthe\sort<strong>in</strong>g"process.Theimplementationis<br />

Indexsearch,onebyElliasonfromUmea<strong>in</strong>GeophysicalResearchLetters,<strong>and</strong>a paperbyhim<strong>and</strong>hisco-authors<strong>in</strong>JGR-SpaceSciences.None<strong>of</strong>theseseemto discuss<strong>particle</strong>sort<strong>in</strong>g. OnlytwootherrecentpapersbyC.S.L<strong>in</strong>werefoundus<strong>in</strong>gaScienceCitation<br />

A.5.3DavidWalker(ORNL)<br />

alsoWalker'sgeneralreference<strong>in</strong>SectionA.xx. accumulator",somek<strong>in</strong>d<strong>of</strong>gather-scattersorterproposedbyG.Fox.et.al.See doesnothaveafullimplementation<strong>of</strong>thecode.Heusesthe\quasi-staticcrystal In\TheImplementation<strong>of</strong>aThree-DimensionalPICCodeonaHypercubeConcurrentProcessor"[Wal89]Walkerdescribesa3-DPICcodefortheNCUBE,but


studyus<strong>in</strong>gthe<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>application"[LF88]isalsohighlyrelevanttoour This1988journalpaperentitled\Model<strong>in</strong>gtheperformance<strong>of</strong>hypercubes:Acase A.5.4Lubeck<strong>and</strong>Faber(LANL) 133<br />

the<strong>particle</strong>pushphase. work.Thepapercoversa2-DelectrostaticcodebenchmarkedontheInteliPSC<br />

<strong>in</strong>formation(observ<strong>in</strong>gstrictlocality).Theauthorsrejectedthisapproachbased usedfortheeldcalculations,whereastheyconsidered3dierentapproacherfor hypercube.AmultigridalgorithmbasedonFredrickson<strong>and</strong>McBryaneortsare<br />

onthattheir<strong>particle</strong>stendtocongregate<strong>in</strong>10%<strong>of</strong>the<strong>cell</strong>s,hencecaus<strong>in</strong>gserious load-imbalance. Therstapproachwastoassign<strong>in</strong>g<strong>particle</strong>stowhateverprocessorhasthe<strong>cell</strong><br />

spatialregion.Theauthorsarguethattheperformance<strong>of</strong>thisalternative(move byallow<strong>in</strong>g<strong>particle</strong>stobeassignedtoprocessorsnotnecessarilyconta<strong>in</strong><strong>in</strong>gtheir astrongfunction<strong>of</strong>the<strong>particle</strong><strong>in</strong>putdistribution. eitherthegrid<strong>and</strong>/or<strong>particle</strong>stoachieveamoreload-lancedsolution)wouldbe Thesecondalternativetheyconsideredwastorelaxthelocalityconstra<strong>in</strong>t<br />

iPSC). step.Thisachievesaperfectloadbalance(forhomogeneoussystemssuchasthe theprocessorssothatanequalnumber<strong>of</strong><strong>particle</strong>scanbeprocessedateachtime-<br />

Tous,thisdoesseemstorequirealot<strong>of</strong>extraover-head<strong>in</strong>communicat<strong>in</strong>gthe Thealternativetheydecidedtoimplementreplicatedthespatialgridamong<br />

hypercube\anorder<strong>of</strong>magnitudegreater"comparedwithasharedmemory entiregridateachprocessor. wholegridtoeachprocessorateachtime-step,nottomentionhav<strong>in</strong>gtostorethe<br />

implementation. authorscommentthattheyfoundthepartition<strong>in</strong>g<strong>of</strong>theirPICalgorithmforthe Thispaperdoesdescribeaniceperformancemodelfortheirapproach.The


<strong>in</strong>gforPIC<strong>codes</strong>,[ALO89,AL91,AL90,AL92,Aza92].Theirunderly<strong>in</strong>gmotivation A.5.5Azari<strong>and</strong>Lee'sWork Azari<strong>and</strong>S.Y.Leehavepublishedseveralpapersontheirworkonhybridpartition-<br />

134<br />

<strong>and</strong>partition<strong>in</strong>gthegrid<strong>in</strong>toequal-sizedsub-grids,oneperavailableprocessor ParticlePartitions. istoparallelizeParticle-In-Cell(PIC)<strong>codes</strong>thoughahybridscheme<strong>of</strong>Grid<strong>and</strong><br />

undesirableloadbalanc<strong>in</strong>gproblem(dynamicloadbalanc<strong>in</strong>g). element(PE).Theneedtosort<strong>particle</strong>sfromtimetotimeisreferredtoasan Partition<strong>in</strong>ggridspace<strong>in</strong>volvesdistribut<strong>in</strong>g<strong>particle</strong>sevenlyamongprocessor<br />

areevenlydistributedamongprocessorelements(PEs)nomatterwheretheyare entiresimulation.TheentiregridisassumedtohavetobestoredoneachPE<strong>in</strong> locatedonthegrid.EachPEkeepstrack<strong>of</strong>thesame<strong>particle</strong>throughoutthe ordertokeepthecommunicationoverheadlow.Thestoragerequirementsforthis A<strong>particle</strong>partition<strong>in</strong>gimplies,accord<strong>in</strong>gtotheirpapers,thatallthe<strong>particle</strong>s<br />

schemeislarger,<strong>and</strong>aglobalsum<strong>of</strong>thelocalgridentriesisneededaftereach<br />

distributionwhichwouldleadtoahigheciency. <strong>and</strong>bypartition<strong>in</strong>gthe<strong>particle</strong>sonemayattempttoobta<strong>in</strong>awell-balancedload iteration. gumentthatbypartition<strong>in</strong>gthespaceonecansavememoryspaceoneachPE, Theirhybridpartition<strong>in</strong>gschemecanbeoutl<strong>in</strong>edasfollows: Theirhybridpartition<strong>in</strong>gapproachcomb<strong>in</strong>esthesetwoschemeswiththear-<br />

1.thegridispartitioned<strong>in</strong>toequalsubgrids,. 2.agroup<strong>of</strong>PEsareassignedtoeachblock. 3.eachgrid-blockisstored<strong>in</strong>thelocalmemory<strong>of</strong>each<strong>of</strong>thePEsassignedto 4.the<strong>particle</strong>s<strong>in</strong>eachblockare<strong>in</strong>itiallypartitionedevenlyamongPEs<strong>in</strong>that thatblock block.


A.5.6PauletteLiewer(JPL)et.al. percube<strong>and</strong>BBNimplementations<strong>and</strong>correspond<strong>in</strong>gperformanceevaluations. Thepapersthengoontodescrib<strong>in</strong>gspecicimplementations<strong>in</strong>clud<strong>in</strong>gahy-<br />

135<br />

equal<strong>in</strong>numbertothenumber<strong>of</strong>processorsavailablesuchthat<strong>in</strong>itiallyeach sub-doma<strong>in</strong>hasanequalnumber<strong>of</strong>processors.Theirtest-bedwastheMarkIII staticcodenamed1-DUCLA,decompos<strong>in</strong>gthephysicaldoma<strong>in</strong><strong>in</strong>tosub-doma<strong>in</strong>s Liewerhasalsoco-authoredseveralpapersonthistopic.Her1988paperwith<br />

32-nodehypercube. Decyk,Dawson(UCLA)<strong>and</strong>G.Fox(Syracuse)[LDDF88],describesa1-Delectro-<br />

phase.(Theyneedtopartitionthedoma<strong>in</strong>accord<strong>in</strong>gtotheGrayCodenumber<strong>in</strong>g However,theauthorspo<strong>in</strong>tsouthowtheyneedtouseadierentpartition<strong>in</strong>gfor push<strong>in</strong>gphase,theydividetheirgridup<strong>in</strong>to(N?p)equal-sizedsub-doma<strong>in</strong>s. theFFTsolver<strong>in</strong>ordertotakeadvantage<strong>of</strong>thehypercubeconnectionforthis Thecodeusesthe1-DconcurrentFFTdescribed<strong>in</strong>Foxet.al.Forthe<strong>particle</strong><br />

PIC)implementedontheMarkIIIfb(64-processors)thatbeatstheCRAYXMP. thecaseifanitedierenceeldsolutionisused<strong>in</strong>place<strong>of</strong>theFFT. twiceateachtimestep.Intheconclusions,theypo<strong>in</strong>toutthatthismaynotbe <strong>of</strong>theprocessors.)Thecodehencepassesthegridarrayamongtheprocessors<br />

In1990Liewerco-authoredapaper[FLD90]describ<strong>in</strong>ga2-Delectrostaticcode thatwasperiodic<strong>in</strong>onedimension,<strong>and</strong>withtheoption<strong>of</strong>boundedorperiodic<strong>in</strong> theotherdimension.Thecodeused21-DFFTs<strong>in</strong>thesolver.Lieweretal.has In[LZDD89]theydescribeasimilarcodenamedGCPIC(GeneralConcurrent<br />

gridsarereplicatedoneachprocessor[DDSL93,LLFD93].Lieweretal.alsohave recentlydevelopeda3DcodeontheDeltaTouchstone(512nodegrid)wherethe apaperonloadbalanc<strong>in</strong>gdescribedlater<strong>in</strong>thisappendix.


A.5.7Sturtevant<strong>and</strong>Maccabee(Univ.<strong>of</strong>NewMexico) Thispaper[SM90]describesaplasmacodeimplementedontheBBNTC2000 whoseperformancewasdisappo<strong>in</strong>t<strong>in</strong>g.Theyusedashared-memoryPICalgorithm 136<br />

thatdidnotmapwelltothearchitecture<strong>of</strong>theBBN<strong>and</strong>hencegothitbythe highcosts<strong>of</strong>copy<strong>in</strong>gverylargeblocks<strong>of</strong>read-onlydata. A.5.8PeterMacNeice(Hughes/Goddard)<br />

eachtime-step.Anite-dierenceschemeisusedfortheeldsolve,whereasthe hasagridvector.TheyuseaEuleri<strong>and</strong>ecomposition,<strong>and</strong>henceneedtosortafter Thispaper[Mac93]describesa3DelectromagneticPICcodere-writtenforaMas- Parwitha128-by-128grid.ThecodeisbasedonOscarBuneman'sTRISTAN <strong>particle</strong>pushphaseisaccomplishedviatheleap-frogmethod.S<strong>in</strong>cetheyassume code.Theystorethethethirddimension<strong>in</strong>virtualmemorysothateachprocessor systemswithrelativelymild<strong>in</strong>homogeneities,noloadbalanc<strong>in</strong>gconsiderations weretaken.Thefacttheyonlysimulate400,000<strong>particle</strong>s<strong>in</strong>a105-by-44-by-55 ch<strong>in</strong>e. system,i.e.onlyaboutone<strong>particle</strong>per<strong>cell</strong><strong>and</strong>under-utiliz<strong>in</strong>gthe128-by-128 (64Kb/processor).TheMasParisaS<strong>in</strong>gleInstructionMultipleData(SIMD)ma-<br />

processorgrid,weassumewasduetothememorylimitations<strong>of</strong>theMasParused


Calculat<strong>in</strong>g<strong>and</strong>Verify<strong>in</strong>gthe AppendixB<br />

PlasmaFrequency B.1Introduction Inordertoverifywhatourcodeactuallydoes,an<strong>in</strong>-depthanalysis<strong>of</strong>onegeneral timestepcanpredictthegeneralbehavior<strong>of</strong>thecode.Ifthecodeproducesplasma oscillations,itisexpectedthatthechargedensityatagivengrid-po<strong>in</strong>tbehaves proportionallytoacos<strong>in</strong>ewave.Thatis,<br />

tion,butratheraTaylorseriesapproximation<strong>of</strong>them.Look<strong>in</strong>gattheratio<strong>of</strong>the Onecannotexpectthecodetoproducetheexactresultfromtheaboveequa-<br />

n+1 ij=constcos(!p(t+t)+) nij=constcos(!p(t)+) (B.1)<br />

aboveequation,onecanhenceexpectsometh<strong>in</strong>glike: n+1 nij=1?t2 (B.2)<br />

2?t[tan(!pt)+]<br />

137


Look<strong>in</strong>gatthepotentialateachgridpo<strong>in</strong>t,weassumeitis<strong>of</strong>theform: B.2Thepotentialequation 138<br />

wherex<strong>and</strong>yarethelimits<strong>of</strong>thegrid. S<strong>in</strong>cex=(i-1)hx,y=(j-1)hy,theaboveequationcanbere-written: i;j=Re(~eikxxeikyy) i;j=Re[~ei(kx(i?1)hx+ky(j?1)hy] (B.3)<br />

Thisfunctionscanbethought<strong>of</strong>asa2-Dplaneextend<strong>in</strong>g<strong>in</strong>towaves<strong>of</strong>\mounta<strong>in</strong>s"<strong>and</strong>\valleys"<strong>in</strong>thethirddimension.<br />

(B.4) B.2.1TheFFTsolver Dierentiat<strong>in</strong>gtwicedirectlywouldgivethefollow<strong>in</strong>gexpressionfor: @x2+@2 @2<br />

whichaga<strong>in</strong>givesus: i2k2x~eikxx+ikyy+i2k2y~eikxx+ikyy=?i;j @y2!~(eikxxeikyy)=?i;j ?(k2x+k2y)=?i;j<br />

0 (B.5) (B.6)<br />

Noticethatthisis<strong>in</strong>deedthequantitythatwescaledwith<strong>in</strong>theFFTsolver. =i;j 01 k2x+k2y: (B.7)<br />

B.2.2TheSORsolver<br />

thefollow<strong>in</strong>gexpressionfortheLaplacianr2: Writ<strong>in</strong>gEquation3.46fori;j+1<strong>in</strong>terms<strong>of</strong>i;jyields: Proceed<strong>in</strong>g<strong>in</strong>thesamefashionforetheotherneighbor<strong>in</strong>ggridpo<strong>in</strong>ts,oneobta<strong>in</strong>s i;j+1=i;jeikyhy


i;j[e?ikyhy+eikyhy?4+e?ikxhx+eikxhx]=h2=?i;j 139<br />

get:i;j Us<strong>in</strong>gtheexponentialform<strong>of</strong>theexpressionforcos<strong>in</strong>ecos(x)=12(eix+e?ix),we h2(2cos(kyhy)?4+2cos(kxhx))=?i;j 0 0 (B.9) (B.8)<br />

weget:?4h2 Rewrit<strong>in</strong>gtheaboveequationtottheexpression1?cos(x)=2s<strong>in</strong>2(x),simplify<strong>in</strong>g<br />

whichifwesolveforgivesus: us<strong>in</strong>gtheTaylorseriesapproximations<strong>in</strong>2(x)'x22,<strong>and</strong>consider<strong>in</strong>gh=hx=hy,<br />

'i;j 01 (k2x+k2y)h2<br />

4!i;j=?i;j 0 (B.10)<br />

expressionaswedidfortheFFTsolver.Thisisnotsurpris<strong>in</strong>ggiventhatournite Noticehowthegridspac<strong>in</strong>gquantitieshx<strong>and</strong>hycancel<strong>and</strong>weobta<strong>in</strong>thesame dierenceapproximation<strong>in</strong>deedissupposedtoapproximatethedierential. (B.11)<br />

B.2.3Theeldequations Asimilarresultcanthenbeobta<strong>in</strong>edfortheeldEus<strong>in</strong>gtheaboveresult: Exi;j=?(i;j+1?i;j?1) =?i;j =?i;j hxis<strong>in</strong>(kxhx) eikxhx?e?ikxhx 2hx 2hx ! (B.13) (B.12)<br />

Ey,weget: Aga<strong>in</strong>,us<strong>in</strong>gaTaylorseriesapproximation<strong>and</strong>follow<strong>in</strong>gthesameprocedurefor Exi;j'?ikxi;j (B.15) (B.14)<br />

Eyi;j'?ikyi;j (B.16)


Look<strong>in</strong>gattheeldateach<strong>particle</strong>Epart: Epartx=~Ex[eikxx+ikyy] 140<br />

reallymatter;<strong>in</strong>thiscase,itisnotimportantwheretheE-eldisbe<strong>in</strong>gevaluated. Wehenceapproximatetheeldsatthe<strong>particle</strong>spositionforx<strong>and</strong>ytobe: Whetherthex<strong>and</strong>yareherethecoord<strong>in</strong>ates<strong>of</strong>thegridorthe<strong>particle</strong>,doesnot (B.17)<br />

B.3Velocity<strong>and</strong><strong>particle</strong>positions Epartx'Ex Eparty'Ey (B.19) (B.18)<br />

Recallthatweusedthefollow<strong>in</strong>gequationsforupdat<strong>in</strong>gthevelocity<strong>and</strong><strong>particle</strong> positions:<br />

HereE=~Eeikxx0+ikyy0: v=v+qEmt x=x+vt (B.21) (B.20)<br />

time-stepsothatthevelocity<strong>and</strong><strong>particle</strong>positionswouldbe\leap<strong>in</strong>g"overeach other. Intheequationforthe<strong>particle</strong>update,xonthelefth<strong>and</strong>isx0+(new-time-step), whereasxontherighth<strong>and</strong>sideisx0+(old-time-step).Whenus<strong>in</strong>gtheLeap-<br />

B.4Chargedensity Frogmethod,t<strong>in</strong>thevelocityequation(EquationB.20)wassettot 2fortherst<br />

per<strong>cell</strong>.Now,look<strong>in</strong>gatagridpo<strong>in</strong>t<strong>and</strong>its4adjacent<strong>cell</strong>s,each<strong>of</strong>the<strong>of</strong> Assum<strong>in</strong>gthe<strong>cell</strong>-sizeisshrunkdowntothepo<strong>in</strong>twherethereisonlyone<strong>particle</strong>


consideredwhenupdat<strong>in</strong>gthechargedensity. the<strong>particle</strong>s<strong>in</strong>the4adjacent<strong>cell</strong>swillhaveadierent(x0;y0).Thesemustbe Look<strong>in</strong>gatthedistancesthe<strong>particle</strong>s<strong>in</strong>thesurround<strong>in</strong>g<strong>cell</strong>sarefromthegrid 141<br />

po<strong>in</strong>t,wecancalculatethechargedensityateachgridpo<strong>in</strong>t(Figure3.5).Seealso Figure3.3.j=(<strong>in</strong>t)((x0+hx)/hx); (i+1,j-1) 6?<br />

i=(<strong>in</strong>t)((y0+hy)/hy);<br />

a?? -<br />

x6? ? b(i,j) (i+1,j+1)<br />

(i-1,j-1)x:<strong>particle</strong>positions<br />

hy a(hy-b) hxx-(i-1,j+1)<br />

(hx-a)(hy-b) (hx-a)b<br />

Figure3.5:Contributions<strong>of</strong><strong>particle</strong>s<strong>in</strong>adjacent<strong>cell</strong>swith regardtochargedensityatgridpo<strong>in</strong>t(i,j). ab<br />

usedforcalculat<strong>in</strong>gchargedensity(seeSection3.3): Here,a=x0-((j-1)*hx);b=y0-((i-1)*hy),soplugg<strong>in</strong>gthese<strong>in</strong>theequationwe<br />

Theunperturbed<strong>particle</strong>positionsarehenceasshown<strong>in</strong>Figure3.6. i;j= hxhyNp![(hx?a)(hy?b)+(hx?a)b+a(hy?b)+ab](B.22) oNg


(i+1,j-1) 142<br />

hy 6? -<br />

(xo,yo+hy)<br />

ax/<br />

6? b(i,j) (i+1,j+1)<br />

(i-1,j-1)x:<strong>particle</strong>positions hxx-(i-1,j+1)<br />

(xo+hx,yo+hy) (xo+hx,yo)<br />

Figure3.6:Unperturbedpositions<strong>of</strong><strong>particle</strong>s<strong>in</strong>adjacent<strong>cell</strong>s withrespecttogridpo<strong>in</strong>t(i,j). (xo,yo)<br />

Here,aperturbations<strong>of</strong>each<strong>particle</strong>canbeviewedas: Look<strong>in</strong>gatwhatx<strong>and</strong>yare<strong>in</strong>terms<strong>of</strong>,say,(x0;y0),wehave: a=ao+xb=bo+y x=~xeikxx0+ikyy0y=~yeikxx0+ikyy0 (B.23)<br />

wellasperturbations.Plugg<strong>in</strong>gtheseback<strong>in</strong>toourpreviousequation: Notethateach<strong>of</strong>thefour<strong>particle</strong>s<strong>in</strong>theadjacent<strong>cell</strong>shavedierentlocationsas i;j=[(hx?a0?x(x0+hx;y0+hy))(hy?b0?y(x0+hx;y0+hy)) (B.24)<br />

<strong>and</strong>exp<strong>and</strong><strong>in</strong>gthe-terms: +(hx?a0?x((x0+hx;y0))(b0+y(x0+hx;y0)) +(a0+x(x0;y0+hy))(hy?b0?y(x0;y0+hy)) +(a0+x(x0;y0))(b0+y(x0;y0))] hxhyNp! oNg (B.25)


i;j=[((hx?a0) ?~xeikx(x0+hx)+iky(y0+hy))((hy?b0)?~yeikx(x0+hx)+iky(y0+hy)) 143<br />

+((hx?a0)?~xeikx(x0+hx)+ikyy0)(b0+~yeikx(x0+hx)+ikyy0) +(a0+~xeikxx0+iky(y0+hy))((hy?b0)?~yeikxx0+iky(y0+hy)) (B.26)<br />

<strong>and</strong>hencenegligible<strong>in</strong>thiscontext: Multiply<strong>in</strong>goutthequantitiesneglect<strong>in</strong>gthexy-termss<strong>in</strong>cetheyareO(2), +(a0+~xeikxx0+ikyy0)(b0+~yeikxx0+ikyy0)] hxhyNp! 0Ng (B.27)<br />

i;j=[(hx?a0)(hy?b0)+(hx?a0)b0+a0(hy?b0)+a0b0] ?~xeikx(x0+hx)+iky(y0+hy)(hy?b0)?~yeikx(x0+hx)+iky(y0+hy)(hx?a0)<br />

Therstterm<strong>in</strong>sidethe\["\]"sismerelyhxhy.Look<strong>in</strong>gatthe~xterms,weget: ?~xeikx(x0+hx)+ikyy0(b0)+~yeikx(x0+hx)+ikyy0(hx?a0)<br />

xterms=(hy?b0)[?1+e?ikxhx]~xeikx(x0+hx)+iky(y0+hy) +~xeikxx0+iky(y0+hy)(hy?b0)?~yeikxx0+iky(y0+hy)(a0) +~xeikxx0+ikyy0(b0)+~yeikxx0+ikyy0(a0)] hxhyNp! 0Ng (B.28)<br />

Wenowusethefollow<strong>in</strong>gapproximationfor[?1+eix]: +b0[?1+e?ikxhx]~xeikx(x0+hx)+ikyy0 (B.30) (B.31) (B.29)<br />

Hencewecansimplifyourdxtermsequationto: xterms=(?ikxhx)hyx(x0+hx;y0+h0)+b0terms [?1+e?ix]'?1+(1?ix+x2 2+)'?ix (B.32) (B.33)


Look<strong>in</strong>gfurtherattheb0-terms: b0terms=?b0(?ikxhx)(x0+hx;y0+h0) +b0(?ikxhx)(x0+hx;y0) 144<br />

=((?ikxhx)b0[?1eikyhy](x0+hx;y0+hy) =bo(?ikxhx)(ikyhy)x(x0+hx;y0+h0) (B.34)<br />

=b0kxkyhxhyx(x0+hx;y0+h0) (B.35)<br />

S<strong>in</strong>ceweassumethatxissmallsothatb0


Epartx'Ex Eparty'Ey 145<br />

v=v+qEmt x=x+vt (B.45) (B.44) (B.43)<br />

helpfultoknowthefollow<strong>in</strong>gterm: whathappensbetweentwotime-stepsn<strong>and</strong>n+1.Inotherwords,itwouldbe Inordertoseewhatk<strong>in</strong>d<strong>of</strong>oscillationthecodewillproduce,onecanlookat i;j'0 Ng Np!(?ikxx?ikyy) (B.46)<br />

per<strong>cell</strong>(Ng=Np=1). Look<strong>in</strong>gatthen-thiteration,EquationB.46givesus: Forthepurposes<strong>of</strong>verify<strong>in</strong>gplasmaoscillations,wewillassumeone<strong>particle</strong> (n+1) i;j=term(n) (i;j)<br />

Toget(n+1) EquationB.41: (n) i;j'0(?ikxx?kyy) i;j,wepluggedEquationB.46<strong>in</strong>toEquationsB.41-B.46start<strong>in</strong>gwith (B.47)<br />

=i;j 01 k2x+k2y (B.48)


Exi;j'?ikxi;j =?ikx24old i;j 146<br />

similarly,<br />

0(k2x+k2y)!old<br />

k2x+k2y!35 1<br />

Eyi;j'?ikyi;j =? 0(k2x+k2y)!old iky i;j i;j (B.49)<br />

Westillmakethesameassumptionsabouttheeldateach<strong>particle</strong>: Epartx(i;j)'Ex(i;j) Eparty(i;j)'Ey(i;j) (B.51) (B.50)<br />

<strong>of</strong>whicharethenew<strong>and</strong>whicharetheoldvalues: Thecalculation,however,getalittlemoretrickywhenconsider<strong>in</strong>gtheequations forupdat<strong>in</strong>g<strong>particle</strong>velocitiesv<strong>and</strong>positionsx.Onewillherehavetokeeptrack<br />

Similarly,x=x+tv=) v0+vnew=v0+vold+tqmE v=v+qEmt=)<br />

x0+xnew=x0+xold+tvnew (B.52)<br />

Wehencehave: xnew=xold+tvnew vnew=vold+tqmE (B.54) (B.55) (B.53)


actuallycan<strong>cell</strong>edbe<strong>in</strong>gonbothsides<strong>of</strong>theaboveequations).Look<strong>in</strong>gatthe x-direction(lett<strong>in</strong>gx,xx)<strong>and</strong>us<strong>in</strong>gEquationB.42: Weassumethatforthersttime-stepthatvold=vold 147 x=vold<br />

new x=old x+ttqmEx(i;j) y=0.(They<br />

Similarly:<br />

x?t2q m0ikx<br />

(B.56)<br />

Go<strong>in</strong>gbacktochargedensity(EquationB.46)look<strong>in</strong>gatthex-direction: new y=old y?t2q m0iky k2x+k2yold i;j (B.57)<br />

new x(i;j)=(?ikxnew =24?ikx0@old x)0 x?t2q mold 0ikx i;jk2x+k2y1A350<br />

(B.58)<br />

Wehencehave new x(i;j)=h?ikxold ="?ikxold x?Kxold x?"t2q i;ji0 m0 k2x+k2y!#old<br />

i;j#0 (B.59)<br />

Do<strong>in</strong>gthesamefornew where y(i;j)<strong>and</strong>comb<strong>in</strong>gthetwoequations,weget: Kx=t2qk2x m0(k2x+k2y) (B.60)<br />

Look<strong>in</strong>gattheKx;Kytermstogether,theysimplifyasfollows: Kx+Ky=t2qk2x<br />

i;j=h(?kxold =[1?(Kx+Ky)o]old m0(k2x+k2y)+t2qk2y x?ikyold y)?(Kx+Ky)old i;j:<br />

i;ji0 (B.62) (B.61)


=(k2x+k2y)t2q =t2q m0(k2x+k2y) m0: 148<br />

Wehencehavethefollow<strong>in</strong>gexpressionforanupdatedchargedensity: new i;j= 1?t2q m00!old i;j (B.63)<br />

thetime-stepisactuallyt Wehenceactuallyhave: Not<strong>in</strong>gthatforthersttime-step(wherewemadethevold=0assumption), 2(velocitylagsthepositionupdateby1/2tome-step). (B.64)<br />

<strong>of</strong>cos(!0t): Assum<strong>in</strong>gourresultistherstfewentries<strong>of</strong>theTaylorseriesapproximation Theplasmafrequencyisdenedas!p,qq0 new i;j=0@1?t2 m001Aold 2q i;j<br />

1?!20m0:<br />

(B.65)<br />

thisexperimentshould<strong>in</strong>deedoscillateattheplasmafrequency!p,i.e.!0=!p. wecanassumeourfrequency!0=qq0 m0,<strong>and</strong>wehaveshownthatourcodefor 2


Bibliography [ABL88]N.G.Azari,A.W.Bojanczyk,<strong>and</strong>S.-Y.Lee.Synchronous<strong>and</strong>AsynchronousAlgorithmsforMatrixTranspositiononaMesh-Connected<br />

[AGR88]J.Ambrosiano,L.Greengard,<strong>and</strong>V.Rokhl<strong>in</strong>.TheFastMultipole [AFKW90]IAngus,G.Fox,J.Kim,<strong>and</strong>D.Walker.Solv<strong>in</strong>gProblemsonConcurrentProcessorsVolumeII:S<strong>of</strong>twareforConcurrentProcessors.<br />

ArrayProcessor.InSPIEConf.Proc.,volume975,pages277{288, Prentice-Hall,1990. MethodforGridlessParticleSimulation.ComputerPhysicsCommunication,48:117{125,January1988.<br />

August1988.<br />

[AL90]N.G.Azari<strong>and</strong>S.-Y.Lee.Paralleliz<strong>in</strong>gParticle-<strong>in</strong>-CellSimulationon [AL91]N.G.Azari<strong>and</strong>S.-Y.Lee.HybridPartition<strong>in</strong>gforParticle-<strong>in</strong>-CellSimulationonSharedMemorySystems.InProc.<strong>of</strong>InternationalConf.<br />

Multiprocessors.InProc.<strong>of</strong>InternationalConf.onParallelProcess<strong>in</strong>g,pages352{353.ThePennsylvaniaStateUniversityPress,August<br />

1990.<br />

[ALO89]N.G.Azari,S.-Y.Lee,<strong>and</strong>N.F.Otani.ParallelGather-ScatterAlgo-<br />

[AL92]N.G.Azari<strong>and</strong>S.-Y.Lee.HybridTaskDecompositionforParticle-<strong>in</strong>- Conf.onParallelProcess<strong>in</strong>g,page9999,August1992. CellMethodonMessagePass<strong>in</strong>gSystems.InProc.<strong>of</strong>International onDistributedComputerSystems,pages526{533,May1991.<br />

[AT92]M.P.Allen<strong>and</strong>D.J.Tildesley.ComputerSimulation<strong>of</strong>Liquids.OxfordUniversityPress,1992.149<br />

GoldenGateEnterprises,March1989. percubes,ConcurrentComputers<strong>and</strong>Applications,pages1241{1245. rithmsforParticle-<strong>in</strong>-CellSimulation.InProc.<strong>of</strong>ForthConf.onHy-


[Aza92]N.G.Azari.ANewApproachtoTaskDecompositionforParallel [BCLL92]A.Bhatt,M.Chen,C.-Y.L<strong>in</strong>,<strong>and</strong>P.Liu.AbstractionsforParallel Particle-<strong>in</strong>-CellSimulation.Ph.D.dissertation,School<strong>of</strong>Electrical Eng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY,August1992. 150<br />

[BH19]J.Barnes<strong>and</strong>P.Hut.AhierarchicalO(NlogN)force-calculation [B<strong>in</strong>92]K.B<strong>in</strong>der,editor.Topics<strong>in</strong>AppliedPhysics{Volume71:TheMonte N-bodySimulations.InProc.ScalableHighPerformanceComput<strong>in</strong>g Conference,pages38{45.IEEEComputerSocietyPress,April1992. algorithm.Nature,324(4):446{449,19.<br />

[Bri87]W.L.Briggs.AMultigridTutorial.SIAM,1987. [BL91]C.K.Birdsall<strong>and</strong>A.B.Langdon.PlasmaPhysicsViaComputerSimulations.AdamHilger,Philadelphia,1991.<br />

CarloMethod<strong>in</strong>CondensedMatterPhysics.Spr<strong>in</strong>ger-Verlag,1992. [BT88]S.H.Brecht<strong>and</strong>V.A.Thomas.MultidimensionalSimulationsUs<strong>in</strong>g [BW91]R.L.Bowers<strong>and</strong>J.R.Wilson.NumericalModel<strong>in</strong>g<strong>in</strong>AppliedPhysics [Daw83]J.M.Dawson.Particlesimulation<strong>of</strong>plasmas.Rev.ModernPhysics, 143,January1988. HybridParticleCodes.ComputerPhysicsCommunication,48:135{<br />

[DDSL93]J.M.Dawson,V.K.Decyk,R.Sydora,<strong>and</strong>P.C.Liewer.High- <strong>and</strong>Astrophysics.Jones<strong>and</strong>Bartlett,1991.<br />

[EUR89]A.C.Elster,M.U.Uyar,<strong>and</strong>A.P.Reeves.Fault-TolerantMatrixOperationsonHypercubeMultiprocessors.InF.Ris<strong>and</strong>P.M.Kogge,<br />

editors,Proc.<strong>of</strong>the1989InternationalConferenceonParallelPro-<br />

55(2):403{447,April1983. PerformanceComput<strong>in</strong>g<strong>and</strong>PlasmaPhysics.PhysicsToday,46:64{ 70,March1993.<br />

[FJL+88]G.Fox,M.Johnson,G.Lyzenga,S.Otto,J.Salmon,<strong>and</strong>D.Walker. [FLD90]R.D.Ferraro,P.C.Liewer,<strong>and</strong>V.K.Decyk.A2DElectrostaticPIC gust1989.Vol.III. cess<strong>in</strong>g,pages169{176.ThePennsylvaniaStateUniversityPress,Au-<br />

pages440{444.IEEEComputerSocietyPress,April1990. editors,Proc.<strong>of</strong>theFifthDistributedMemoryComput<strong>in</strong>gConference, Solv<strong>in</strong>gProblemsonConcurrentProcessorsVolumeI:GeneralTechniques<strong>and</strong>RegularProblems.Prentice-Hall,1988.<br />

CodefortheMARKIIIHypercube.InD.W.Walker<strong>and</strong>Q.F.Stout,


[Har64]F.H.Harlow.TheParticle-<strong>in</strong>-CellComput<strong>in</strong>gMethodforFluidDynamics.InB.Alder,S.Fernbach,<strong>and</strong>A.Rotenberg,editors,Methods<br />

<strong>in</strong>ComputationalPhysics,volumeVol.3,pages319{343.Academic 151 Press,1964. [Har88]F.H.Harlow.PIC<strong>and</strong>ItsProgeny.ComputerPhysicsCommunication,48:1{10,January1988.<br />

[HE89]R.W.Hockney<strong>and</strong>J.W.Eastwood.ComputerSimulationUs<strong>in</strong>gParticles.AdamHilger,NewYork,1989.<br />

Ltd.,Bristol,1981. [HJ81]R.W.Hockney<strong>and</strong>C.R.Jesshope.ParallelComputers.AdamHilger [Hor87]E.J.Horowitz.Vectoriz<strong>in</strong>gtheInterpolationRout<strong>in</strong>es<strong>of</strong>Particle-<strong>in</strong>- [HSA89]E.J.Horowitz,D.E.Schumaker,<strong>and</strong>D.V.Anderson.QN3D:AThree- [HL88]D.W.Hewett<strong>and</strong>A.B.Langdon.RecentProgresswithAvanti:A2.5D<br />

DimensionalQuasi-neutralHybridParticle-<strong>in</strong>-CellCodewithApplicationstotheTiltModeInstability<strong>in</strong>FieldReversedCongurations.<br />

CellCodes.Journal<strong>of</strong>ComputationalPhysics,68:56{65,1987. 48:127{133,January1988. EMDirectImplicitPICCode.ComputerPhysicsCommunication,<br />

[JH87]S.L.Johnsson<strong>and</strong>C.T.Ho.AlgorithmsforMultiply<strong>in</strong>gMatrices<strong>of</strong> Journal<strong>of</strong>ComputationalPhysics,84:279{310,1989. ArbitraryShapesUs<strong>in</strong>gSharedMemoryPrimitivesonBooleanCubes.<br />

[LDDF88]P.C.Liewer,V.K.Decyk,J.M.Dawson,<strong>and</strong>G.C.Fox.AUniversal [LD89]P.C.Liewer<strong>and</strong>V.K.Decyk.AgeneralConcurrentAlgorithmfor TechnicalReportYALEU/DCS/TR-569,<strong>Department</strong><strong>of</strong>Computer Physics,85:302{322,1989. ConcurrentAlgorithmforPlasmaParticle-<strong>in</strong>-CellCodes.InG.Fox, PlasmaParticle-<strong>in</strong>-CellSimulationCodes.Journal<strong>of</strong>Computational Science,YaleUniversity,October1987.<br />

[L<strong>in</strong>89a]C.S.L<strong>in</strong>.Particle-<strong>in</strong>-CellSimulations<strong>of</strong>WaveParticleInteractions [LF88]O.M.Lubeck<strong>and</strong>V.Faber.Model<strong>in</strong>gtheperformance<strong>of</strong>hypercubes: 9:37{52,1988. Acasestudyus<strong>in</strong>gthe<strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>application.ParallelComput<strong>in</strong>g, editor,TheThirdConferenceonHypercubeConcurrentComputers<strong>and</strong><br />

Us<strong>in</strong>gtheMassivelyParallelProcessor.InProc.Supercomput<strong>in</strong>g'89, Applications,pages1101{1107.ACM,January1988.<br />

pages287{294.ACMPress,November1989.


[L<strong>in</strong>89b]C.S.L<strong>in</strong>.Simulations<strong>of</strong>BeamPlasmaInstabilitiesUs<strong>in</strong>gaParallel pages1247{1254.GoldenGateEnterprises,March1989. Particle-<strong>in</strong>-<strong>cell</strong>CodeontheMassivelyParallelProcessor.InProc.<strong>of</strong> ForthConf.onHypoercubes,ConcurrentComputers<strong>and</strong>Applications, 152<br />

[LLFD93]P.Lyster,P.C.Liewer,R.Ferraro,<strong>and</strong>V.K.Decyk.Implementa-<br />

[LLDD90]P.C.Liewer,E.W.Leaver,V.K.Decyk,<strong>and</strong>J.M.Dawson.Dynamic IIIHypercube.InD.W.Walker<strong>and</strong>Q.F.Stout,editors,Proc.<strong>of</strong> theFifthDistributedMemoryComput<strong>in</strong>gConference,pages939{942. IEEEComputerSocietyPress,April1990. LoadBalanc<strong>in</strong>g<strong>in</strong>aConcurrentPlasmaPICCodeontheJPL/Mark<br />

[Loa92]C.F.VanLoan,editor.ComputationalFrameworksfortheFast nputers.prepr<strong>in</strong>t,1993. tion<strong>of</strong>theThree-DimensionalParticle-<strong>in</strong>-CellSchemeonDistributed- MemoryMultiple-InstructionMultiple-DataMassivelyParallelCom-<br />

[LZDD89]P.C.Liewer,B.A.Zimmerman,V.K.Decyk,<strong>and</strong>J.M.Dawson.Ap-<br />

[LTK88]C.S.L<strong>in</strong>,A.L.Thr<strong>in</strong>g,<strong>and</strong>J.Koga.GridlessParticleSimulationus<strong>in</strong>g 48:149{154,January1988. plication<strong>of</strong>HypercubeComputerstoPlasmaParticle-<strong>in</strong>-CellSimula-<br />

tionCodes.InProc.Supercomput<strong>in</strong>g'89,pages284{286.ACMPress, theMassivelyParallelProcessor.ComputerPhysicsCommunication, FourierTransform.SIAM,1992.<br />

[Mac93]P.MacNeice.AmElectromagneticPICCodeontheMasPar.InProc. [M+88]A.Mank<strong>of</strong>skyetal.Doma<strong>in</strong>Decomposition<strong>and</strong>ParticlePush<strong>in</strong>g forMultiprocess<strong>in</strong>gComputers.ComputerPhysicsCommunication, 48:155{165,January1988. November1989.<br />

[Max91]C.E.Max.ComputerSimulation<strong>of</strong>AstrophysicalPlasmas.Computers [McC89]S.F.McCormick.MultilevelAdaptiveMethodsforPartialDierential 6thSIAMConferenceonParallelProcess<strong>in</strong>gforScienticComput<strong>in</strong>g, <strong>in</strong>Physics,5(2):152{162,1991. pages129{132,Norfolk,VA,March1993.<br />

[NOY85]A.Nishiguchi,S.Orii,<strong>and</strong>T.Yabe.VectorCalculation<strong>of</strong>Particle [Ota] Equations.SIAM,1989. Code.Journal<strong>of</strong>ComputationalPhysics,61:519{522,1985. N.F.Otani.Personalcommunications.


[Sha] [Ram] [SHG92]J.P.S<strong>in</strong>gh,J.L.Hennessy,<strong>and</strong>A.Gupta.Implications<strong>of</strong>Hierarchical J.G.Shaw.Personalcommunications. P.S.Ramesh.Personalcommunications. N-bodyMethodsforMultiprocessorArchitectures.TechnicalReport 153<br />

[SHT+92]J.P.S<strong>in</strong>gh,C.Holt,T.Totsuka,A.Gupta,<strong>and</strong>J.L.Hennessy.Load manuscript,ComputerSystemsLaboratory,StanfordUniversity,1992. Balanc<strong>in</strong>g<strong>and</strong>DataLocality<strong>in</strong>HierarchicalN-bodyMethods.TechnicalReportmanuscript,ComputerSystemsLaboratory,StanfordUni-<br />

[SM90]J.E.Sturtevant<strong>and</strong>A.B.Maccabee.Implement<strong>in</strong>gParticle-In-Cell [Sil91] MarianSilberste<strong>in</strong>.ComputerSimulation<strong>of</strong>K<strong>in</strong>eticAlfvenwaves<strong>and</strong><br />

Conference,pages433{439.IEEEComputerSocietyPress,April1990. PlasmaSimulationCodeontheBBNTC2000.InD.W.Walker<strong>and</strong> Q.F.Stout,editors,Proc.<strong>of</strong>theFifthDistributedMemoryComput<strong>in</strong>g August1991. tion,School<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY, DoubleLayersAlongAuroralMagneticFieldL<strong>in</strong>es.Mastersdisserta-<br />

[SO93]M.Silberste<strong>in</strong><strong>and</strong>N.F.Otani.ComputerSimulation<strong>of</strong>AlfvenWaves [SRS93]A.Sameh,J.Riganati,<strong>and</strong>D.Sarno.ComputationalScience& <strong>and</strong>DoubleLayersAlongAuroralMagneticFieldL<strong>in</strong>es.Journal<strong>of</strong> GeophysicalResearch,Inpreparation,1993.<br />

[UR85a]M.U.Uyar<strong>and</strong>A.P.Reeves.FaultRecongurationfortheNearNeighborProblemInaDistributedMIMDEnvironment.InProceed<strong>in</strong>gs<strong>of</strong><br />

the5thInternationalConferenceonDistributedComputerSystems, toFusion<strong>and</strong>Astrophysics.Addison-Wesley,1989. [Taj89]ToskikaTajima.ComputationalPlasmaPhysics:WithApplications Eng<strong>in</strong>eer<strong>in</strong>g.Computer,26(10):8{12,October1993.<br />

[UR85b]M.U.Uyar<strong>and</strong>A.P.Reeves.FaultReconguration<strong>in</strong>aDistributed [UR88]M.U.Uyar<strong>and</strong>A.P.Reeves.DynamicFaultReconguration<strong>in</strong>a MIMDEnvironmentwithaMultistageNetwork.InProceed<strong>in</strong>gs<strong>of</strong>the pages372{379,Denver,CO,May1985.<br />

Mesh-ConnectedMIMDEnvironment.IEEETrans.onComputers, 1985InternationalConferenceonParallelProcess<strong>in</strong>g,pages798{805,<br />

37:1191{1205,October1988.


[Uya86]M.U.Uyar.DynamicFaultReconguration<strong>in</strong>MultiprocessorSystems. [Wal89]D.W.Walker.TheImplementation<strong>of</strong>aThree-DimentionalPICCode Ph.D.dissertation,School<strong>of</strong>ElectricalEng<strong>in</strong>eer<strong>in</strong>g,CornellUniversity,Ithaca,NY,June1986.<br />

154<br />

[Wal90]D.W.Walker.Characteriz<strong>in</strong>gtheparallelperformance<strong>of</strong>alarge-scale, GoldenGateEnterprises,March1989. poercubes,ConcurrentComputers<strong>and</strong>Applications,pages1255{1261. <strong>particle</strong>-<strong>in</strong>-<strong>cell</strong>plasmasimulationcode.Concurrency:Practice<strong>and</strong> onaHypercubeConcurrentProcessor.InProc.<strong>of</strong>ForthConf.onHy-<br />

[ZJ89] [You89]D.M.Young.AHistoricalOverview<strong>of</strong>IterativeMethods.Computer F.Zhao<strong>and</strong>S.L.Johnsson.TheParallelMultipoleMethodonthe ConnectionMach<strong>in</strong>e.TechnicalReportSeriesCS89-6,Th<strong>in</strong>k<strong>in</strong>gMach<strong>in</strong>eCorporation,October1989.<br />

Experience,2(4):257{288,December1990. PhysicsCommunications,53:1{17,1989.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!