Large scale applications scalability - HPC Advisory Council
Large scale applications scalability - HPC Advisory Council
Large scale applications scalability - HPC Advisory Council
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>HPC</strong> Applications Scalability<br />
Gilad@hpcadvisorycouncil.com
Applications Best Practices<br />
• LAMMPS - <strong>Large</strong>-<strong>scale</strong> Atomic/Molecular Massively Parallel Simulator<br />
• D. E. Shaw Research Desmond<br />
• NWChem<br />
• MPQC - Massively Parallel Quantum Chemistry Program<br />
• ANSYS FLUENT and CFX<br />
• CD-adapco STAR-CCM+<br />
• CD-adapco STAR-CD<br />
• MM5 - The Fifth-Generation Meso<strong>scale</strong> Model<br />
• NAMD<br />
• LSTC LS-DYNA<br />
• Schlumberger ECLIPSE<br />
• CPMD - Car-Parrinello Molecular Dynamics<br />
• WRF - Weather Research and Forecast Model<br />
• Dacapo - total energy program based on density functional theory<br />
• Lattice QCD<br />
• Open Foam<br />
• GAMESS<br />
• HOMME<br />
• OpenAtom<br />
• PopPerf<br />
2
Applications Best Practices<br />
• Performance evaluation<br />
• Profiling – network, I/O<br />
• Recipes (installation, debug, command lines etc)<br />
• MPI libraries<br />
• Power management<br />
• Optimizations at small <strong>scale</strong><br />
• Optimizations at <strong>scale</strong><br />
3
<strong>HPC</strong> Applications<br />
Small Scale
ECLIPSE Performance Results - Interconnect<br />
• InfiniBand enables highest <strong>scalability</strong><br />
– Performance accelerates with cluster size<br />
• Performance over GigE and 10GigE is not scaling<br />
– Slowdown occurs beyond 8 nodes<br />
Elapsed Time (Seconds)<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
Schlumberger ECLIPSE<br />
(FOURMILL)<br />
4 8 12 16 20 22 24<br />
Number of Nodes<br />
GigE 10GigE InfiniBand<br />
Lower is better Single job per cluster size<br />
5
MM5 Performance Results - Interconnect<br />
• Input Data: T3A<br />
– Resolution 36KM, grid size 112x136, 33 vertical levels<br />
– 81 second time-step, 3 hour forecast<br />
• InfiniBand DDR delivers higher performance in any cluster size<br />
– Up to 46% versus GigE, 30% versus 10GigE<br />
Higher is better<br />
Mflop/s<br />
60000<br />
50000<br />
40000<br />
30000<br />
20000<br />
10000<br />
0<br />
MM5 Benchmark Results - T3A<br />
4 6 8 10 12 14 16 18 20 22<br />
Number of Nodes<br />
GigE 10GigE InfiniBand-DDR<br />
Platform MPI<br />
6
MPQC Performance Results - Interconnect<br />
• Input Dataset<br />
– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set<br />
• InfiniBand enables higher <strong>scalability</strong><br />
– Performance accelerates with cluster size<br />
– Outperforms GigE by up to 124% and 10GigE by up to 38%<br />
Lower is better<br />
Elapsed Time(Seconds)<br />
16000<br />
12000<br />
8000<br />
4000<br />
0<br />
MPQC Benchmark Result<br />
(aug-cc-pVTZ)<br />
8 16 24<br />
Number of Nodes<br />
GigE 10GigE InfiniBand DDR<br />
7
NAMD Performance Results – Interconnect<br />
• ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water<br />
– Models a bloodstream lipoprotein particle<br />
– One of the most used data sets for benchmarking NAMD<br />
• InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size<br />
– InfiniBand provides higher performance up to 79% vs GigE and 49% vs 10GigE<br />
ns/Day<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
NAMD<br />
(ApoA1)<br />
4 8 12 16 20 24<br />
Number of Nodes<br />
Higher is better GigE 10GigE InfiniBand<br />
Open MPI<br />
8
CPMD Performance - MPI<br />
• Platform MPI shows better <strong>scalability</strong> over Open MPI<br />
Elapsed time (Seconds)<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Lower is better<br />
CPMD<br />
(Wat32 inp-1)<br />
4 8 16<br />
Number of Nodes<br />
Open MPI Platform MPI<br />
Elapsed time (Seconds)<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
CPMD<br />
(Wat32 inp-2)<br />
4 8 16<br />
Number of Nodes<br />
Open MPI Platform MPI<br />
These results are<br />
based on InfiniBand<br />
9
MM5 Performance - MPI<br />
• Platform MPI demonstrates higher performance versus MVAPICH<br />
– Up to 25% higher performance<br />
– Platform MPI advantage increases with increased cluster size<br />
Higher is better<br />
Mflop/s<br />
60000<br />
50000<br />
40000<br />
30000<br />
20000<br />
10000<br />
0<br />
MM5 Benchmark Results - T3A<br />
4 6 8 10 12 14 16 18 20 22<br />
Number of Nodes<br />
MVAPICH Platform MPI<br />
Single job on each node<br />
10
NAMD Performance - MPI<br />
• Platform MPI and Open MPI provides same level of performance<br />
– Platform MPI has better performance for cluster size lower than 20 nodes<br />
– Open MPI becomes better with 24 nodes<br />
Higher is better<br />
• Higher configurations than 24 nodes were not tested<br />
ns/Day<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
NAMD<br />
(ApoA1)<br />
4 8 12 16 20 24<br />
Number of Nodes<br />
Open MPI Platform MPI<br />
These results are<br />
based on InfiniBand<br />
11
STAR-CD Performance - MPI<br />
• Test case<br />
– Single job over the entire systems<br />
– Input Dataset (A-Class)<br />
• HP-MPI has slightly better performance with CPU affinity enabled<br />
Total Elapsed Time (s)<br />
Lower is better<br />
10000<br />
9000<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
STAR-CD Benchmark Results<br />
(A-Class)<br />
4 8 16 20 24<br />
Number of Nodes<br />
Platform MPI HP-MPI<br />
12
CPMD – MPI Profiling<br />
• MPI_AlltoAll is the key collective function in CPMD<br />
– Number of AlltoAll messages increases dramatically with cluster size<br />
Number of Messages (Millions)<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
MPI_Allgather<br />
MPI_Allreduce<br />
MPI Function Distribution<br />
(C120 inp-1)<br />
MPI_Alltoall<br />
MPI_Bcast<br />
MPI_Isend<br />
4 Nodes 8 Nodes 16 Nodes<br />
MPI_Recv<br />
MPI_Send<br />
13
ECLIPSE - MPI Profiling<br />
Percentage of Messages<br />
60%<br />
50%<br />
40%<br />
30%<br />
Eclipse MPI Profiliing<br />
4 8 12 16 20 24<br />
Number of Nodes<br />
MPI_Isend < 128 Bytes MPI_Isend < 256 KB<br />
MPI_Recv < 128 Bytes MPI_Recv < 256 KB<br />
• Majority of MPI messages are large size<br />
• Demonstrating the need for highest throughput<br />
14
Abaqus/Standard - MPI Profiling<br />
• MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest<br />
communication overhead<br />
MPI Runtime (s)<br />
1000000<br />
100000<br />
10000<br />
1000<br />
100<br />
10<br />
1<br />
MPI_Allreduce<br />
MPI_Alltoall<br />
MPI_Barrier<br />
Abaqus/Standard MPI Profiliing<br />
(S2A)<br />
MPI_Bcast<br />
MPI_Irecv<br />
MPI_Isend<br />
MPI Function<br />
MPI_Recv<br />
MPI_Reduce<br />
MPI_Send<br />
MPI_Ssend<br />
16 Cores 32 Cores 64 Cores<br />
MPI_Test<br />
MPI_Waitall<br />
15
OpenFOAM MPI Profiling – MPI Functions<br />
• Mostly used MPI functions<br />
– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv<br />
– Number of MPI functions increases with cluster size<br />
16
ECLIPSE - Interconnect Usage<br />
Data Transferred (MB/s)<br />
Data Transferred (MB/s)<br />
• Total server throughput increases rapidly with cluster size<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
Data Sent<br />
4 Nodes<br />
0<br />
1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031<br />
Timing (s)<br />
Data Sent<br />
16 Nodes<br />
0<br />
1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847<br />
Timing (s)<br />
Data Transferred (MB/s)<br />
Data Transferred (MB/s)<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
This data is per node based<br />
Data Sent<br />
Data Sent<br />
0<br />
1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793<br />
Timing (s)<br />
8 Nodes<br />
0<br />
1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191<br />
Timing (s)<br />
24 Nodes<br />
17
NWChem MPI Profiling – Message Transferred<br />
• Most data related MPI messages are within 8KB-256KB in size<br />
– Number of messages increases with cluster size<br />
• Shows the need for highest throughput to ensure highest system utilization<br />
Number of Messages (K)<br />
60000<br />
50000<br />
40000<br />
30000<br />
20000<br />
10000<br />
0<br />
NWChem Benchmark Profiliing<br />
(Siosi7)<br />
STAR-CD MPI Profiling – Data Transferred<br />
• Most point-to-point MPI messages are within 1KB to 8KB in size<br />
• Number of messages increases with cluster size<br />
Number of Messages<br />
(Millions)<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
[0-128B><br />
[128B-1K><br />
STAR-CD MPI Profiling<br />
(MPI_Sendrecv)<br />
[1K-8K><br />
Message Size<br />
[8K-256K><br />
[256K-1M><br />
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes<br />
19
FLUENT 12 MPI Profiling – Message Transferred<br />
• Most data related MPI messages are within 256B-1KB in size<br />
• Typical MPI synchronization messages are lower than 64B in size<br />
• Number of messages increases with cluster size<br />
Number of Messages<br />
(Thousands)<br />
28000<br />
24000<br />
20000<br />
16000<br />
12000<br />
8000<br />
4000<br />
0<br />
[0..64B]<br />
[64..256B]<br />
FLUENT 12 Benchmark Profiling Result<br />
(Truck_poly_14M)<br />
[256B..1KB]<br />
[1..4KB]<br />
[4..16KB]<br />
[16..64KB]<br />
[64..256KB]<br />
Message Size<br />
[256KB..1M]<br />
[1..4M]<br />
4 Nodes 8 Nodes 16 Nodes<br />
[4M..2GB]<br />
20
ECLIPSE Performance - Productivity<br />
• InfiniBand increases productivity by allowing multiple jobs to run simultaneously<br />
– Providing required productivity for reservoir simulations<br />
• Three cases are presented<br />
– Single job over the entire systems<br />
– Four jobs, each on two cores per CPU per server<br />
– Eight jobs, each on one CPU core per server<br />
• Eight jobs per node increases productivity by up to 142%<br />
Number of Jobs<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
Schlumberger ECLIPSE<br />
(FOURMILL)<br />
8 12 16 20 22 24<br />
Number of Nodes<br />
Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node<br />
InfiniBand<br />
21
MPQC Performance - Productivity<br />
• InfiniBand increases productivity by allowing multiple jobs to run simultaneously<br />
– Providing required productivity for MPQC computation<br />
• Two cases are presented<br />
– Single job over the entire systems<br />
– Two jobs, each on four cores per server<br />
• Two jobs per node increases productivity by up to 35%<br />
Higher is better<br />
Jobs/Day<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
MPQC Benchmark Result<br />
(aug-cc-pVDZ)<br />
8 12 16 20 24<br />
Number of Nodes<br />
1 Job 2 Jobs<br />
35%<br />
InfiniBand<br />
22
LS-DYNA Performance - Productivity<br />
Jobs per Day<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
4 (32 Cores)<br />
LS-DYNA - 3 Vehicle Collision<br />
8 (64 Cores)<br />
12 (96 Cores)<br />
16 (128 Cores)<br />
Number of Nodes<br />
20 (160 Cores)<br />
24 (192 Cores)<br />
1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs<br />
Higher is better<br />
Jobs per Day<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
4 (32 Cores)<br />
LS-DYNA - Neon Refined Revised<br />
8 (64 Cores)<br />
12 (96 Cores)<br />
16 (128 Cores)<br />
Number of Nodes<br />
20 (160 Cores)<br />
24 (192 Cores)<br />
1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs<br />
23
FLUENT Performance - Productivity<br />
• Test cases<br />
– Single job over the entire systems<br />
– 2 jobs, each runs on four cores per server<br />
• Running multiple jobs simultaneously improves FLUENT productivity<br />
– Up to 90% more jobs per day for Eddy_417K<br />
– Up to 30% more jobs per day for Aircraft_2M<br />
– Up to 3% more jobs per day for Truck_14M<br />
• As bigger the # of elements, higher node count is required for increased productivity<br />
– The CPU is the bottleneck for larger number of servers<br />
Higher is better<br />
Productivity Gain<br />
100%<br />
80%<br />
60%<br />
40%<br />
20%<br />
0%<br />
FLUENT 12.0 Productivity Result<br />
(2 jobs in parallel vs 1 job)<br />
4 8 12 16 20 24<br />
Number of Nodes<br />
Eddy_417K Aircraft_2M Truck_14M<br />
InfiniBand DDR<br />
24
MM5 Performance with Power Management<br />
• Test Scenario<br />
– 24 servers, 4 processes per node, 2 processes per CPU (socket)<br />
• Similar performance with power management enabled or disabled<br />
– Only 1.4% performance degradation<br />
Higher is better<br />
Gflop/s<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
MM5 Benchmark Results - T3A<br />
Power Management Enabled Power Management Disabled<br />
InfiniBand DDR<br />
25
MM5 Benchmark – Power Cost Savings<br />
• Power management saves 673$/year for the 24-node cluster<br />
• As cluster size increases, bigger saving are expected<br />
$/year<br />
24 Node Cluster<br />
11000<br />
10600<br />
10200<br />
9800<br />
9400<br />
9000<br />
MM5 Benchmark - T3A<br />
Power Cost Comparison<br />
7%<br />
Power Management Enabled Power Management Disabled<br />
$/year = Total power consumption/year (KWh) * $0.20<br />
For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf<br />
InfiniBand DDR<br />
26
STAR-CD Benchmark – Power Consumption<br />
• Power management reduces 2% of total system power consumption<br />
Lower is better<br />
Power Consumption (Watt)<br />
7500<br />
7000<br />
6500<br />
6000<br />
5500<br />
5000<br />
STAR-CD Benchmark Results<br />
(AClass)<br />
Power Management<br />
Enabled<br />
Power Management<br />
Disabled<br />
InfiniBand DDR<br />
27
STAR-CD Performance with Power Management<br />
• Test Scenario<br />
– 24 servers, 4-Cores/Proc<br />
• Nearly identical performance with power management enabled or<br />
disabled<br />
Lower is better<br />
Total Elapsed Time (s)<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
STAR-CD Benchmark Results<br />
(AClass)<br />
Power Management<br />
Enabled<br />
Power Management<br />
Disabled<br />
InfiniBand DDR<br />
28
MQPC Performance with CPU Frequency Scaling<br />
• Enabling CPU Frequency Scaling<br />
– Userspace – reducing CPU frequency to 1.9GHz<br />
– Performance – setting for maximum performance (CPU frequency of 2.6GHz)<br />
– OnDemand – Maximum performance per application activity<br />
• Userspace increases job run time since CPU frequency is reduced<br />
• Performance and OnDemand enable similar performance<br />
– Due to high resource demands from the application<br />
24 Node Cluster<br />
Elapsed Time (s)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
MPQC Performance<br />
ug-cc-pVDZ ug-cc-pVTZ<br />
Benchmark Dataset<br />
Userspace (1.9GHz) Performance OnDemand<br />
InfiniBand DDR<br />
29
NWChem Benchmark Results<br />
• Input Dataset - Siosi7<br />
• Open MPI and HP-MPI provides similar performance and <strong>scalability</strong><br />
– Intel MKL library enables better performance versus GCC<br />
Higher is better<br />
Performance (Jobs/Day)<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
NWChem Benchmark Result<br />
(Siosi7)<br />
8 12 16<br />
HP-MPI + GCC<br />
Number of<br />
HP-MPI<br />
Nodes<br />
+ Intel MKL<br />
Open MPI + GCC Open MPI + Intel MKL<br />
InfiniBand QDR<br />
30
NWChem Benchmark Results<br />
• Input Dataset - H2O7<br />
• AMD ACML provides higher performance and <strong>scalability</strong> versus<br />
the default BLAS library<br />
Wall time (Seconds)<br />
Lower is better<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
NWChem Benchmark Result<br />
(H2O7 MP2)<br />
4 8 16<br />
Number of Nodes<br />
24<br />
MVAPICH HP-MPI Open MPI<br />
MVAPICH + ACML HP-MPI + ACML OpenMPI + ACML<br />
InfiniBand DDR<br />
31
Tuning MPI For Performance Improvements<br />
• MPI Performance tuning<br />
– Enabling CPU affinity<br />
• mca mpi_paffinity_alone 1<br />
– Increasing eager limit over infiniBand to 32K<br />
• mca btl_openib_eager_limit 32767<br />
• Performance increase of up to 10%<br />
Higher is better<br />
Days/ns<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
NAMD<br />
(ApoA1)<br />
4 8 12 16 20 24<br />
Number of Nodes<br />
Open MPI - Default Open MPI - Tuned<br />
32
<strong>HPC</strong> Applications<br />
<strong>Large</strong> Scale
Acknowledgments<br />
• The following research was performed under the <strong>HPC</strong><br />
<strong>Advisory</strong> <strong>Council</strong> activities<br />
– Participating members: AMD, Dell, Jülich, Mellanox NCAR,<br />
ParTec, Sun<br />
– Compute resource: <strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong> Cluster Center,<br />
Jülich Supercomputing Centre<br />
• For more info please refer to<br />
– www.hpcadvisorycouncil.com<br />
34
HOMME<br />
• High-Order Methods Modeling Environment (HOMME)<br />
– Framework for creating a high-performance scalable global atmospheric model<br />
– Configurable for shallow water or the dry/moist primitive equations<br />
– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate<br />
System Model (CCSM)<br />
– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP<br />
– Developed by the Scientific Computing Section at the National Center for Atmospheric Research (NCAR)<br />
35
POPperf<br />
• POP (Parallel Ocean Program) is an ocean circulation model<br />
– Simulations of the global ocean<br />
– Ocean-ice coupled simulations<br />
– Developed at Los Alamos National Lab<br />
• POPperf is a modified version of POP 2.0 (Parallel Ocean Program)<br />
• POPperf improves POP <strong>scalability</strong> on large processor counts<br />
– Re-writing of the conjugate gradient solver to use a 1D data structure<br />
– The addition of a space-filling curve partitioning technique<br />
– Low memory binary parallel I/O functionality<br />
• Developed by NCAR and freely available to the community<br />
36
Objectives and Environments<br />
• The presented research was done to provide best practices<br />
– HOMME and POPperf <strong>scalability</strong> and optimizations<br />
– Understanding communication patterns<br />
• <strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong> system<br />
– Dell PowerEdge SC 1435 24-node cluster<br />
– Quad-Core AMD Opteron 2382 (“Shanghai”) CPUs<br />
– Mellanox® InfiniBand ConnectX HCAs and switches<br />
• Jülich Supercomputing Centre - JuRoPA<br />
– Sun Blade servers with Intel Nehalem processors<br />
– Mellanox 40Gb/s InfiniBand HCAs and switches<br />
– ParTec ParaStation Cluster Operation Software and MPI<br />
37
ParTec MPI Parameters<br />
• PSP_ONDEMAND<br />
– Each MPI connection needs a certain amount of memory by default (0.5MB)<br />
– PSP_ONDEMAND disabled<br />
• MPI connections establish when application starts<br />
• Reduce overhead to start connection dynamically<br />
– PSP_ONDEMAND enabled<br />
• MPI connections establish per need<br />
• Reduce unnecessary message checking from large number of connections<br />
• Reduce memory footprint<br />
• PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE<br />
– Define default send/receive queue size (Default is16)<br />
– Changing them can reduce memory footprint<br />
38
Lessons Learn for <strong>Large</strong> Scale<br />
• Each application needs customized MPI tuning<br />
– Dynamic MPI connection establishment should be enabled if<br />
• Simultaneous connections setup is not a must<br />
• Some MPI function needs to check incoming messages from any source<br />
– Dynamic MPI connection establishment should be disabled if<br />
• MPI_Alltoall is used in the application<br />
• Number of calls to check MPI_ANY_SROUCE is small<br />
• Enough memory in the system<br />
• At different <strong>scale</strong>, different parameters may be used<br />
– PSP_ONDEMAND shouldn’t be enabled at very small <strong>scale</strong> cluster<br />
• Different MPI libraries has different characteristics<br />
– Parameters change for one MPI may not fit to other MPIs<br />
39
HOMME Performance at Scale<br />
40
HOMME Performance at Higher Scale…<br />
41
HOMME Scalability<br />
42
POPperf Performance at Scale<br />
43
POPperf Scalability<br />
44
Summary<br />
• HOMME and POPperf performance<br />
depends on low latency network<br />
• InfiniBand enables both HOMME and<br />
POPperf to <strong>scale</strong><br />
– Tested over 13824 cores<br />
– Same <strong>scalability</strong> is expected at even higher<br />
core count<br />
• Optimized MPI settings can dramatically<br />
improve application performance<br />
– Understanding application communication<br />
pattern<br />
– MPI parameter tuning<br />
45
Acknowledgements<br />
• Thanks to all the participated <strong>HPC</strong> <strong>Advisory</strong> members<br />
– Ben Mayer and John Dennis from NCAR who provided the code and<br />
benchmark instructions<br />
– Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the<br />
JuRoPA system available for benchmarking<br />
– Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec<br />
who helped review the report<br />
46
47<br />
Thank You<br />
<strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong><br />
www.hpcadvisorycouncil.com<br />
info@hpcadvisorycouncil.com<br />
47