12.07.2015 Views

One-sided vs. Two-sided Communication Paradigms on Relaxed ...

One-sided vs. Two-sided Communication Paradigms on Relaxed ...

One-sided vs. Two-sided Communication Paradigms on Relaxed ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

F U T U R E T E C H N O L O G I E S G R O U P<str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> <str<strong>on</strong>g>vs</str<strong>on</strong>g>. <str<strong>on</strong>g>Two</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g><str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Paradigms</str<strong>on</strong>g> <strong>on</strong><strong>Relaxed</strong> Ordering Interc<strong>on</strong>nectsKhaled IbrahimPaul Hargrove, Costin Iancu, Kathy YelickLAWRENCE BERKELEY NATIONAL LABORATORY


OverviewF U T U R E T E C H N O L O G I E S G R O U P Applicati<strong>on</strong> performance of <strong>on</strong>e-<str<strong>on</strong>g>sided</str<strong>on</strong>g> <str<strong>on</strong>g>vs</str<strong>on</strong>g>. two-<str<strong>on</strong>g>sided</str<strong>on</strong>g> <strong>on</strong> Cray XE06(Gemini Interc<strong>on</strong>nect). Brief Overview of Cray software stack and hardware <strong>on</strong> Hopper. Support of Relaxati<strong>on</strong> Performance comparis<strong>on</strong> of communicati<strong>on</strong> primitives using strictand relaxed ordering. Single-<str<strong>on</strong>g>sided</str<strong>on</strong>g> <str<strong>on</strong>g>vs</str<strong>on</strong>g>. two-<str<strong>on</strong>g>sided</str<strong>on</strong>g> communicati<strong>on</strong> paradigms interacti<strong>on</strong>with relaxed ordering.LAWRENCE BERKELEY NATIONAL LABORATORY


Motivati<strong>on</strong>F U T U R E T E C H N O L O G I E S G R O U PPercentage UPC over MPI speedup250%200%40%30%20%10%0%-10%64 procs256 procsep ft is lu mg sp bt Harm<strong>on</strong>ic meanAlso, why Gasnet is not matching vendor performance?Why single-<str<strong>on</strong>g>sided</str<strong>on</strong>g> performs better than two-<str<strong>on</strong>g>sided</str<strong>on</strong>g> MPI?LAWRENCE BERKELEY NATIONAL LABORATORY


Cray Software StackF U T U R E T E C H N O L O G I E S G R O U P uGNI Multiplecommunicati<strong>on</strong> protocols RDMA (Block TransferEngine (BTE)).• Optimized for largemessages FMA• Optimized for smallmessages DMAPP communicati<strong>on</strong>protocol High-level protocol forsingle-<str<strong>on</strong>g>sided</str<strong>on</strong>g>communicati<strong>on</strong> support collectivesIOCTLGNIKernel LevelGNI (kGNI)GNI CoreFortran, CMPICH2DIRECTSHMEMHardware (Gemini/Arie)UPCDMAPPGeneric Hardware Abstracti<strong>on</strong> Layer (GHAL)libpgasDIRECTLAWRENCE BERKELEY NATIONAL LABORATORY


Hopper HardwareF U T U R E T E C H N O L O G I E S G R O U PHopper Node(Cray XE 6)2 twelve-core AMD 'MagnyCours'2.1-GHz processors per node.Four DDR3 1333-MHz memorychannels per twelve-core'MagnyCours' processorHopper NodeCore 0DDR3 Core 1Core 2Core 3DDR3 Core 4L3Cache6MBHT3L3Cache6MBCore 0Core 1Core 2Core 3Core 4DDR3DDR3201.6 Gflops/nodeL1: 64 KB, L2 caches::512KBCore 5Core 0HT3HT3HT3Core 5Core 06-MB L3 cache shared between 6cores.DDR3Core 1Core 2HT3Core 1Core 2DDR3DDR3Core 3Core 4L3Cache6MBL3Cache6MBCore 3Core 4DDR3Core 5Core 5HT3HT3HT3HT3NIC 0NetLinkBlockNIC 13D torus48 port YARC Router...LAWRENCE BERKELEY NATIONAL LABORATORY


Relaxati<strong>on</strong> of Memory TransfersF U T U R E T E C H N O L O G I E S G R O U P Hypertransport transacti<strong>on</strong>s terms Posted (writes without ack) N<strong>on</strong>-Posted (reads or writes with Ack) Relaxati<strong>on</strong> <strong>on</strong> Gemini Strict (no reodering) Default (n<strong>on</strong>-posted get pass posted writes) <strong>Relaxed</strong> (all n<strong>on</strong>-posted pass posted writes) Relaxati<strong>on</strong> affects both in-node memory transacti<strong>on</strong>s and remotememory transacti<strong>on</strong>s Interc<strong>on</strong>nect Relaxati<strong>on</strong>: How to? Dynamic Routing Multiple virtual channels Why? Performance: easier way to improve throughput (BW), than to improvelatency (wire delay) Resilience and fault tolerance.LAWRENCE BERKELEY NATIONAL LABORATORY


MicrobenchmarkF U T U R E T E C H N O L O G I E S G R O U PNode 0&1UPCT 1UPCT 2...UPCT mNode 2&3UPCT 1+mUPCT 2+m...UPCT 2m Objective: DialectsCompare the performance of lowlevelAPIs <str<strong>on</strong>g>vs</str<strong>on</strong>g>. high level runtimes.uGNI (FMA & BTE)DMAPPBerkeley UPCCray UPCOSU MBW (MPI) Ranks: m Window: w number of outstandingn<strong>on</strong>-blocking operati<strong>on</strong>s. Testbed: defaultbidirecti<strong>on</strong>al48 communicati<strong>on</strong> pairsMultiple outstanding messages(Window size)• Default 1LAWRENCE BERKELEY NATIONAL LABORATORY


DMAPP <str<strong>on</strong>g>vs</str<strong>on</strong>g>. uGNI (FMA and BTE) - strictF U T U R E T E C H N O L O G I E S G R O U PDMAPP strict FMA strict BTE strict1600010000Bandwidth (MB/s)10001008 16 32 64 128 256 512 1K 2K 4K 8K 16KMsg Size2K4K8K16K32K64K128K256K512K1M2M4M1200080004000Bandwidth (MB/s)LAWRENCE BERKELEY NATIONAL LABORATORY


DMAPP <str<strong>on</strong>g>vs</str<strong>on</strong>g>. uGNI (FMA and BTE)- <strong>Relaxed</strong>F U T U R E T E C H N O L O G I E S G R O U PDMAPP relaxed FMA relaxed BTE relaxed1600010000Bandwidth (MB/s)10001008 16 32 64 128 256 512 1K 2K 4K 8K 16KMsg Size2K4K8K16K32K64K128K256K512KLAWRENCE BERKELEY NATIONAL LABORATORY1M2M4MDMAPP can hide complexity of uGNI (FMA and BTE)1200080004000Bandwidth (MB/s)


Strict Ordering and C<strong>on</strong>currencyF U T U R E T E C H N O L O G I E S G R O U PC<strong>on</strong>currency (proc/node): 1 2 4 8 16 241000012000Bandwidth (MB/s)100010010FMA transport8 16 32 64 128 256 512 1K 2K 4K 8K 16KMsg SizeBTE transport2K4K8K16K32K64K128K256K512K1M2MLAWRENCE BERKELEY NATIONAL LABORATORY4M80004000Bandwidth (MB/s)


<strong>Relaxed</strong> Ordering and C<strong>on</strong>currencyF U T U R E T E C H N O L O G I E S G R O U PC<strong>on</strong>currency (proc/node): 1 2 4 8 16 2410000Bandwidth (MB/s)10001001200080004000Bandwidth (MB/s)10FMA transport8 16 32 64 128 256 512 1K 2K 4K 8K 16K2K4K8KMsg SizeBTE transport16K32K64K128K256K512K1M2M4MLAWRENCE BERKELEY NATIONAL LABORATORY


Low-Level API SummaryF U T U R E T E C H N O L O G I E S G R O U P DMAPP hides switchingcomplexity between FMA and BTE It also provides collectives <strong>Relaxed</strong> ordering improvesperformance (expected). Strict ordering hurts performanceMOSTLY under high c<strong>on</strong>currency. Registrati<strong>on</strong> is an expensivememory operati<strong>on</strong>s that better beremoved from critical path ofexecuti<strong>on</strong>. Performance Difference between<str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> and <str<strong>on</strong>g>Two</str<strong>on</strong>g> <str<strong>on</strong>g>sided</str<strong>on</strong>g> is NOTdue to different Performance oflow-level APIs.IOCTLGNIKernel LevelGNI (kGNI)GNI CoreFortran, CMPICH2DIRECTSHMEMHardware (Gemini/Arie)UPCDMAPPGeneric Hardware Abstracti<strong>on</strong> Layer (GHAL)libpgasDIRECTLAWRENCE BERKELEY NATIONAL LABORATORY


<str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Paradigms</str<strong>on</strong>g>F U T U R E T E C H N O L O G I E S G R O U P<str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> UPC<str<strong>on</strong>g>Two</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> MPIP0P1P0P1LocalpromotedLocalLocal sharedLocal sharedpromotedput/get operati<strong>on</strong>ssend/recv operati<strong>on</strong>sLAWRENCE BERKELEY NATIONAL LABORATORY


<str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> Paradigm TerritoryF U T U R E T E C H N O L O G I E S G R O U PTraditi<strong>on</strong>al Shared Memory - CoherentCaches <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> load/store Default to relaxed ordering• Strict for synchr<strong>on</strong>izati<strong>on</strong> Message Passing - nocoherence Send/recv matching Strict matchingLAWRENCE BERKELEY NATIONAL LABORATORY


<str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> Paradigm TerritoryF U T U R E T E C H N O L O G I E S G R O U PTraditi<strong>on</strong>al Shared Memory - CoherentCaches <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> load/store Default to relaxed ordering• Strict for synchr<strong>on</strong>izati<strong>on</strong>PGAS Shared Memory - CoherentCaches <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> load/store Default to relaxed ordering• Strict for synchr<strong>on</strong>izati<strong>on</strong> Message Passing - nocoherence between nodes Load/store matching Strict matching Partiti<strong>on</strong> Global Address space <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> put/get strict ordering if blocking• <strong>Relaxed</strong> if n<strong>on</strong>-blocking.LAWRENCE BERKELEY NATIONAL LABORATORY


<str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> Paradigm TerritoryF U T U R E T E C H N O L O G I E S G R O U PTraditi<strong>on</strong>al Shared Memory - CoherentCaches <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> load/store Default to relaxed ordering• Strict for synchr<strong>on</strong>izati<strong>on</strong>MPI+MPI Shared Memory - CoherentCaches Send/recv matching Shared memory bypass Message Passing - nocoherence between nodes Send/recv matching Strict matching Message passing Send/recv Matching Matching strict, progress is notLAWRENCE BERKELEY NATIONAL LABORATORY


<strong>Relaxed</strong> Ordering and<str<strong>on</strong>g>Communicati<strong>on</strong></str<strong>on</strong>g> ParadigmF U T U R E T E C H N O L O G I E S G R O U P Requirement to service multiple outstanding requests Minimal remote-end involvement Unambiguous destinati<strong>on</strong> HPC runtime expectati<strong>on</strong> from Device Driver APIs point-to-point ordering• Node-to-node ordering• Rank-to-rank ordering (multi-core makes it different from node-to-node) Or, relaxed ordering, with multiple completi<strong>on</strong> semantic (local andglobal)LAWRENCE BERKELEY NATIONAL LABORATORY


Unambiguous Destinati<strong>on</strong>(Registrati<strong>on</strong>)F U T U R E T E C H N O L O G I E S G R O U PRegistrati<strong>on</strong> (Resolving remote ambiguity):Prevents memory swappingAllows offloading communicati<strong>on</strong> to NICAllows out-of-order progress of many transfers1000Expensive (hopefully not frequent)Limited and shared resourcesLocalP0LocalP1P0P1u sec10010a) Registrati<strong>on</strong>100010010sharedsharedpromotedpromoted12K4K8K16K32K64K128K256K512KMemory segment size1M2M4M8M1put/get UPCsend/recv MPILAWRENCE BERKELEY NATIONAL LABORATORY


UPC Shared Memory Ordering Challenge-ExampleF U T U R E T E C H N O L O G I E S G R O U P Cray definiti<strong>on</strong> of “Global Completi<strong>on</strong>” GASnet Heisenbug!P0:01: Get (m(A 0,0 ) m(A 1,0 ))02: m(A 0,0 )m(A 0,0 )+103: Put(m(A 0,0 )m(A 1,0 ))04: m(A 0,1 )P 0HT3P 1P1:11: Wait until m(A 0,1 )12: Get (m(A 0,2 ) m(A 1,0 ))13: Print m(A 0,2 )A 0,0A 0,1A 0,2...GeminiNICGlobally address MemoryMemory Address: A node,offsetAll initialized to zeros3D torus010312GeminiNICHT3A 1,0A 1,1A 1,2...P 2P 3LAWRENCE BERKELEY NATIONAL LABORATORY


MPI OrderingF U T U R E T E C H N O L O G I E S G R O U P N<strong>on</strong>-overtake rule (determinism) If a sender sends two messages (Message 1 and Message 2) in successi<strong>on</strong>to the same destinati<strong>on</strong>, and both match the same receive, the receiveoperati<strong>on</strong> will receive Message 1 before Message 2. If a receiver posts two receives (Receive 1 and Receive 2), in successi<strong>on</strong>,and both are looking for the same message, Receive 1 will receive themessage before Receive 2.Send: S order source rankdest rankS 2 0 S 1 0 S 0 0P 0P 1HT3GeminiNICReceive: R order source rankGeminiNIC3D torusHT3R 0 ANY S 1 2 R 2 ANYGemini P 2HT3NICP 3R 0 ANY R 1 ANYP 4P 6LAWRENCE BERKELEY NATIONAL LABORATORY


Cray MPI Method to ExploitGemini <strong>Relaxed</strong> OrderingF U T U R E T E C H N O L O G I E S G R O U PTypical eager <str<strong>on</strong>g>vs</str<strong>on</strong>g>. rendezvousCray’s<str<strong>on</strong>g>Two</str<strong>on</strong>g> eager modes<str<strong>on</strong>g>Two</str<strong>on</strong>g> rendezvous modeSwitching is c<strong>on</strong>trolled based <strong>on</strong> Message sizeRegistrati<strong>on</strong> cache is used(may not be able to tell which memory will be used for communicati<strong>on</strong>)E0 E1 R0 R10 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1MB 4MBMPICH_GNI_MAX_VSHORT_MSG_SIZEMPICH_GNI_MAX_EAGER_MSG_SIZEMPICH_GNI_NDREG_MAXSIZELAWRENCE BERKELEY NATIONAL LABORATORY


MPI Eager 0F U T U R E T E C H N O L O G I E S G R O U PSender(PE0)1. GNI SMSG send(header+data)SMSGMailboxesPEnReceiver(PE1)PE4PE02Completi<strong>on</strong>Queue3.copyMax size changewith the number of ranks.LAWRENCE BERKELEY NATIONAL LABORATORY


MPI Eager 1F U T U R E T E C H N O L O G I E S G R O U PApplicati<strong>on</strong> spaceSender(PE0)1. copy2. GNI SMSG send(header)Pre-allocated(registered)Runtime buffersSMSGMailboxesPE4PE04. RDMA FMAGET5. GNI SMSG (Recv D<strong>on</strong>e)3PEnCQReceiver(PE1)6. copyLAWRENCE BERKELEY NATIONAL LABORATORY


MPI Rendezvous 0F U T U R E T E C H N O L O G I E S G R O U PSender(PE0)2. GNI SMSG send(header)5. RDMA BTEGETSMSGMailboxesPE4PE03PEnCQReceiver(PE1)4. Registerreceive data(or lookup incache)1. Registersend data(or lookup incache)6. GNI SMSG (Recv D<strong>on</strong>e)LAWRENCE BERKELEY NATIONAL LABORATORY


MPI Rendezvous 1F U T U R E T E C H N O L O G I E S G R O U PSender(PE0)1. GNI SMSG send(header)4. GNI SMSG send(CTS)SMSGMailboxesPE4PE0PEnCQReceiver(PE1)3. Registerreceive data(in chunks)25. pipelined GNI RDMA BTE PUTs4. Registersend data(in chunks)6. GNI SMSG Send (d<strong>on</strong>e)LAWRENCE BERKELEY NATIONAL LABORATORY


<str<strong>on</strong>g>Two</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> Summary of ChallengesF U T U R E T E C H N O L O G I E S G R O U P Remote involvement Rank-to-rank strict ordering can cause• Node-to-node strict ordering by hardware• How to handle message cancellati<strong>on</strong>?! Message size ambiguity• int MPI_Irecv(buf, count,datatype, source, tag, …)– count reflects buffer size NOT msg size.– Receiver cannot tell what protocol will be used before matching• Probing obstruct relaxed ordering progress. Remote readiness for transfer All memory space default to local Registrati<strong>on</strong> is <strong>on</strong>-demand• Variability of performance based <strong>on</strong> hit/miss in registrati<strong>on</strong> cache.LAWRENCE BERKELEY NATIONAL LABORATORY


1648Bwindow size1648BF U T U R E T E C H N O L O G I E S G R O U P25.0%11 2 4 8 16 248KB16412.5%25.0%1 111 1 2 2 4 4 8 816 162424 1 2 48KB 16 Experiment: Vary number of messages per rank.81.3%43.8%4 Vary the53.1% 71.9% number of ranks 4per node.434.4%93.2%81.3%43.8%453.1% 71.9%4LAWRENCE BERKELEY NATIONAL 79.5% 34.4%LABORATORYwindow sizeMPI default25.0%MPI <str<strong>on</strong>g>vs</str<strong>on</strong>g>. UPC 8B Transacti<strong>on</strong>s1648B MPI buffers UPC shared (no BTE) dest8B 8Bwindow size16168KB8KB16window size12.5%25.0%25.0%50.0%87.5%100%87.5%75.0%62.5%50.0% 4111 2 C<strong>on</strong>currency4 8 16 24 1 2 4 8 16 248KBMPI default164168KBMPI buffers (no BTE)37.5%25.0%12.5%0.00%100%90.6%81.3%71.9%62.5%453.1%43.8%34.4%25.0%16UPC n<strong>on</strong>-8B100%87.5%75.0%62.5%50.0%37.5%25.0%12.5%0.00%16 8KB100% 86.4%90.6%81.3%79.5%71.9%62.5%53.1%65.9%43.8%25.0%


window sizewindow sizewindow size87.5%62.5%25.0%62.5%25.0%425.0%25.0%44450.0% 4 50.0%12.5%12.5%37.5%50.0%37.5%25.0%25.0%MPI <str<strong>on</strong>g>vs</str<strong>on</strong>g>. UPC 8KB Transacti<strong>on</strong>s25.0%12.5%12.5%0.00%0.00%F U T U R E T E C H N O L O G I E S G R O U P111111 1 2 2 4 4 8 8 16 16 24 1241 2 2 4 4 8 8 16 1616 24 2424 1 2window siz8KB8KB16 16window size481.3% 81.3%53.1% 53.1% 71.9% 71.9%81.3% 81.3% 4 43.8% 443.8%34.4% 34.4%93.2%43.8% 43.8%1 11 11 1 2 2 4 4 8 8 16 16 24 1241 2 2 4 4 8 8 16 161624 24124 1 216window size44256 256 KB KB164MPI Defaultwindow sizeApplicati<strong>on</strong> developer perspective: 16 161616window size4window size8KB168KB16479.5%256 256 KBKBWhat level of c<strong>on</strong>currency should I use?What is the implicati<strong>on</strong> for intra-node communicati<strong>on</strong>s?LAWRENCE BERKELEY NATIONAL LABORATORY62.5% 62.5%UPC shared dest71.9% 71.9%96.8%8KB16100% 100%90.6% 90.6%86.4%81.3% 81.3%71.9% 71.9%62.5% 62.5%4 79.5%53.1% 53.1%43.8% 43.8%34.4% 34.4%25.0% 25.0% 65.9%16256 KB100% 100%90.6% 90.6%81.3% 81.3%71.9% 71.9%62.5%462.5%53.1% 53.1%43.8% 43.8%34.4% 34.4%


window sizwindowwindow sizewindow size71.9%425.0%450.0%481.3% 4 43.8% 412.5%71.9%62.5% 496.8%37.5%53.1%25.0%43.8%62.5%12.5%MPI <str<strong>on</strong>g>vs</str<strong>on</strong>g>. UPC 2MB Transacti<strong>on</strong>s34.4%0.00%25.0%11F U T U R E T E C H1N 1O L O G I E S G R O U P11 1224488161624 124 1 22 444 888 161616 242424 1 21642 8KB16 MBMPI Default81.3%453.1% 71.9%71.9%window siz16window size4216MB28KBMB16UPC shared dest2 MB100% 100%90.6% 90.6%81.3% 81.3%43.8%71.9% 71.9%4 34.4% 34.4% 53.1%96.8%71.9% 62.5% 4 62.5%53.1% 53.1%96.8%43.8% 43.8%34.4% 34.4%25.0% 25.0%11112 4 8 16 24 1 112 4 8 8 16 16 241 2 4 8 16 24 1 2 4 8 16 24 241 2C<strong>on</strong>currencyC<strong>on</strong>currencyC<strong>on</strong>currency256 KB256 KB161616100%90.6%81.3%window size481.3% 4 43.8%71.9%LAWRENCE BERKELEY NATIONAL LABORATORY62.5%71.9%62.5%53.1%43.8%34.4%


<str<strong>on</strong>g>Two</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> MPI <str<strong>on</strong>g>vs</str<strong>on</strong>g>. <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> UPCF U T U R E T E C H N O L O G I E S G R O U PUPC shared UPC n<strong>on</strong>-shared MPI100% window size = 1Difference is NOT due:UPC <str<strong>on</strong>g>vs</str<strong>on</strong>g>. MPIRuntime optimizati<strong>on</strong>Registrati<strong>on</strong> overheador, CopyingDifference is due to:language semanticand ability to exploit relaxedordering.Percentage of peak bandwidth90%80%70%60%50%40%30%20%10%0%1 proc4 procs24 procs8 B1 proc4 procs24 procs2MBLAWRENCE BERKELEY NATIONAL LABORATORY


C<strong>on</strong>clusi<strong>on</strong>F U T U R E T E C H N O L O G I E S G R O U P <str<strong>on</strong>g>One</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> communicati<strong>on</strong> exploits relaxed ordering with ease Communicate-able memory is easier to identify• annotated shared - can be registered upfr<strong>on</strong>t. <strong>Relaxed</strong> <str<strong>on</strong>g>vs</str<strong>on</strong>g>. strict ordering is explicitly specified• by programmer (or programming language). (Default to relaxed) Large percentage of the peak performance for different communicati<strong>on</strong>patterns <str<strong>on</strong>g>Two</str<strong>on</strong>g>-<str<strong>on</strong>g>sided</str<strong>on</strong>g> faces the following challenges Strict matching between send and receives (large startup overhead) Locality noti<strong>on</strong> (everything default to local)• Expensive to prepare buffer for communicati<strong>on</strong> (registrati<strong>on</strong>). Receiver ambiguity about transacti<strong>on</strong>s (probing or over allocati<strong>on</strong>) Performance may need high node c<strong>on</strong>currency (problematic to in-nodecommunicati<strong>on</strong>).LAWRENCE BERKELEY NATIONAL LABORATORY

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!