24.01.2013 Views

Large scale applications scalability - HPC Advisory Council

Large scale applications scalability - HPC Advisory Council

Large scale applications scalability - HPC Advisory Council

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>HPC</strong> Applications Scalability<br />

Gilad@hpcadvisorycouncil.com


Applications Best Practices<br />

• LAMMPS - <strong>Large</strong>-<strong>scale</strong> Atomic/Molecular Massively Parallel Simulator<br />

• D. E. Shaw Research Desmond<br />

• NWChem<br />

• MPQC - Massively Parallel Quantum Chemistry Program<br />

• ANSYS FLUENT and CFX<br />

• CD-adapco STAR-CCM+<br />

• CD-adapco STAR-CD<br />

• MM5 - The Fifth-Generation Meso<strong>scale</strong> Model<br />

• NAMD<br />

• LSTC LS-DYNA<br />

• Schlumberger ECLIPSE<br />

• CPMD - Car-Parrinello Molecular Dynamics<br />

• WRF - Weather Research and Forecast Model<br />

• Dacapo - total energy program based on density functional theory<br />

• Lattice QCD<br />

• Open Foam<br />

• GAMESS<br />

• HOMME<br />

• OpenAtom<br />

• PopPerf<br />

2


Applications Best Practices<br />

• Performance evaluation<br />

• Profiling – network, I/O<br />

• Recipes (installation, debug, command lines etc)<br />

• MPI libraries<br />

• Power management<br />

• Optimizations at small <strong>scale</strong><br />

• Optimizations at <strong>scale</strong><br />

3


<strong>HPC</strong> Applications<br />

Small Scale


ECLIPSE Performance Results - Interconnect<br />

• InfiniBand enables highest <strong>scalability</strong><br />

– Performance accelerates with cluster size<br />

• Performance over GigE and 10GigE is not scaling<br />

– Slowdown occurs beyond 8 nodes<br />

Elapsed Time (Seconds)<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

Schlumberger ECLIPSE<br />

(FOURMILL)<br />

4 8 12 16 20 22 24<br />

Number of Nodes<br />

GigE 10GigE InfiniBand<br />

Lower is better Single job per cluster size<br />

5


MM5 Performance Results - Interconnect<br />

• Input Data: T3A<br />

– Resolution 36KM, grid size 112x136, 33 vertical levels<br />

– 81 second time-step, 3 hour forecast<br />

• InfiniBand DDR delivers higher performance in any cluster size<br />

– Up to 46% versus GigE, 30% versus 10GigE<br />

Higher is better<br />

Mflop/s<br />

60000<br />

50000<br />

40000<br />

30000<br />

20000<br />

10000<br />

0<br />

MM5 Benchmark Results - T3A<br />

4 6 8 10 12 14 16 18 20 22<br />

Number of Nodes<br />

GigE 10GigE InfiniBand-DDR<br />

Platform MPI<br />

6


MPQC Performance Results - Interconnect<br />

• Input Dataset<br />

– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set<br />

• InfiniBand enables higher <strong>scalability</strong><br />

– Performance accelerates with cluster size<br />

– Outperforms GigE by up to 124% and 10GigE by up to 38%<br />

Lower is better<br />

Elapsed Time(Seconds)<br />

16000<br />

12000<br />

8000<br />

4000<br />

0<br />

MPQC Benchmark Result<br />

(aug-cc-pVTZ)<br />

8 16 24<br />

Number of Nodes<br />

GigE 10GigE InfiniBand DDR<br />

7


NAMD Performance Results – Interconnect<br />

• ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water<br />

– Models a bloodstream lipoprotein particle<br />

– One of the most used data sets for benchmarking NAMD<br />

• InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size<br />

– InfiniBand provides higher performance up to 79% vs GigE and 49% vs 10GigE<br />

ns/Day<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

NAMD<br />

(ApoA1)<br />

4 8 12 16 20 24<br />

Number of Nodes<br />

Higher is better GigE 10GigE InfiniBand<br />

Open MPI<br />

8


CPMD Performance - MPI<br />

• Platform MPI shows better <strong>scalability</strong> over Open MPI<br />

Elapsed time (Seconds)<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Lower is better<br />

CPMD<br />

(Wat32 inp-1)<br />

4 8 16<br />

Number of Nodes<br />

Open MPI Platform MPI<br />

Elapsed time (Seconds)<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

CPMD<br />

(Wat32 inp-2)<br />

4 8 16<br />

Number of Nodes<br />

Open MPI Platform MPI<br />

These results are<br />

based on InfiniBand<br />

9


MM5 Performance - MPI<br />

• Platform MPI demonstrates higher performance versus MVAPICH<br />

– Up to 25% higher performance<br />

– Platform MPI advantage increases with increased cluster size<br />

Higher is better<br />

Mflop/s<br />

60000<br />

50000<br />

40000<br />

30000<br />

20000<br />

10000<br />

0<br />

MM5 Benchmark Results - T3A<br />

4 6 8 10 12 14 16 18 20 22<br />

Number of Nodes<br />

MVAPICH Platform MPI<br />

Single job on each node<br />

10


NAMD Performance - MPI<br />

• Platform MPI and Open MPI provides same level of performance<br />

– Platform MPI has better performance for cluster size lower than 20 nodes<br />

– Open MPI becomes better with 24 nodes<br />

Higher is better<br />

• Higher configurations than 24 nodes were not tested<br />

ns/Day<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

NAMD<br />

(ApoA1)<br />

4 8 12 16 20 24<br />

Number of Nodes<br />

Open MPI Platform MPI<br />

These results are<br />

based on InfiniBand<br />

11


STAR-CD Performance - MPI<br />

• Test case<br />

– Single job over the entire systems<br />

– Input Dataset (A-Class)<br />

• HP-MPI has slightly better performance with CPU affinity enabled<br />

Total Elapsed Time (s)<br />

Lower is better<br />

10000<br />

9000<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

STAR-CD Benchmark Results<br />

(A-Class)<br />

4 8 16 20 24<br />

Number of Nodes<br />

Platform MPI HP-MPI<br />

12


CPMD – MPI Profiling<br />

• MPI_AlltoAll is the key collective function in CPMD<br />

– Number of AlltoAll messages increases dramatically with cluster size<br />

Number of Messages (Millions)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

MPI_Allgather<br />

MPI_Allreduce<br />

MPI Function Distribution<br />

(C120 inp-1)<br />

MPI_Alltoall<br />

MPI_Bcast<br />

MPI_Isend<br />

4 Nodes 8 Nodes 16 Nodes<br />

MPI_Recv<br />

MPI_Send<br />

13


ECLIPSE - MPI Profiling<br />

Percentage of Messages<br />

60%<br />

50%<br />

40%<br />

30%<br />

Eclipse MPI Profiliing<br />

4 8 12 16 20 24<br />

Number of Nodes<br />

MPI_Isend < 128 Bytes MPI_Isend < 256 KB<br />

MPI_Recv < 128 Bytes MPI_Recv < 256 KB<br />

• Majority of MPI messages are large size<br />

• Demonstrating the need for highest throughput<br />

14


Abaqus/Standard - MPI Profiling<br />

• MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest<br />

communication overhead<br />

MPI Runtime (s)<br />

1000000<br />

100000<br />

10000<br />

1000<br />

100<br />

10<br />

1<br />

MPI_Allreduce<br />

MPI_Alltoall<br />

MPI_Barrier<br />

Abaqus/Standard MPI Profiliing<br />

(S2A)<br />

MPI_Bcast<br />

MPI_Irecv<br />

MPI_Isend<br />

MPI Function<br />

MPI_Recv<br />

MPI_Reduce<br />

MPI_Send<br />

MPI_Ssend<br />

16 Cores 32 Cores 64 Cores<br />

MPI_Test<br />

MPI_Waitall<br />

15


OpenFOAM MPI Profiling – MPI Functions<br />

• Mostly used MPI functions<br />

– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv<br />

– Number of MPI functions increases with cluster size<br />

16


ECLIPSE - Interconnect Usage<br />

Data Transferred (MB/s)<br />

Data Transferred (MB/s)<br />

• Total server throughput increases rapidly with cluster size<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

Data Sent<br />

4 Nodes<br />

0<br />

1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031<br />

Timing (s)<br />

Data Sent<br />

16 Nodes<br />

0<br />

1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847<br />

Timing (s)<br />

Data Transferred (MB/s)<br />

Data Transferred (MB/s)<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

This data is per node based<br />

Data Sent<br />

Data Sent<br />

0<br />

1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793<br />

Timing (s)<br />

8 Nodes<br />

0<br />

1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191<br />

Timing (s)<br />

24 Nodes<br />

17


NWChem MPI Profiling – Message Transferred<br />

• Most data related MPI messages are within 8KB-256KB in size<br />

– Number of messages increases with cluster size<br />

• Shows the need for highest throughput to ensure highest system utilization<br />

Number of Messages (K)<br />

60000<br />

50000<br />

40000<br />

30000<br />

20000<br />

10000<br />

0<br />

NWChem Benchmark Profiliing<br />

(Siosi7)<br />


STAR-CD MPI Profiling – Data Transferred<br />

• Most point-to-point MPI messages are within 1KB to 8KB in size<br />

• Number of messages increases with cluster size<br />

Number of Messages<br />

(Millions)<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

[0-128B><br />

[128B-1K><br />

STAR-CD MPI Profiling<br />

(MPI_Sendrecv)<br />

[1K-8K><br />

Message Size<br />

[8K-256K><br />

[256K-1M><br />

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes<br />

19


FLUENT 12 MPI Profiling – Message Transferred<br />

• Most data related MPI messages are within 256B-1KB in size<br />

• Typical MPI synchronization messages are lower than 64B in size<br />

• Number of messages increases with cluster size<br />

Number of Messages<br />

(Thousands)<br />

28000<br />

24000<br />

20000<br />

16000<br />

12000<br />

8000<br />

4000<br />

0<br />

[0..64B]<br />

[64..256B]<br />

FLUENT 12 Benchmark Profiling Result<br />

(Truck_poly_14M)<br />

[256B..1KB]<br />

[1..4KB]<br />

[4..16KB]<br />

[16..64KB]<br />

[64..256KB]<br />

Message Size<br />

[256KB..1M]<br />

[1..4M]<br />

4 Nodes 8 Nodes 16 Nodes<br />

[4M..2GB]<br />

20


ECLIPSE Performance - Productivity<br />

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously<br />

– Providing required productivity for reservoir simulations<br />

• Three cases are presented<br />

– Single job over the entire systems<br />

– Four jobs, each on two cores per CPU per server<br />

– Eight jobs, each on one CPU core per server<br />

• Eight jobs per node increases productivity by up to 142%<br />

Number of Jobs<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

Schlumberger ECLIPSE<br />

(FOURMILL)<br />

8 12 16 20 22 24<br />

Number of Nodes<br />

Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node<br />

InfiniBand<br />

21


MPQC Performance - Productivity<br />

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously<br />

– Providing required productivity for MPQC computation<br />

• Two cases are presented<br />

– Single job over the entire systems<br />

– Two jobs, each on four cores per server<br />

• Two jobs per node increases productivity by up to 35%<br />

Higher is better<br />

Jobs/Day<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

MPQC Benchmark Result<br />

(aug-cc-pVDZ)<br />

8 12 16 20 24<br />

Number of Nodes<br />

1 Job 2 Jobs<br />

35%<br />

InfiniBand<br />

22


LS-DYNA Performance - Productivity<br />

Jobs per Day<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

4 (32 Cores)<br />

LS-DYNA - 3 Vehicle Collision<br />

8 (64 Cores)<br />

12 (96 Cores)<br />

16 (128 Cores)<br />

Number of Nodes<br />

20 (160 Cores)<br />

24 (192 Cores)<br />

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs<br />

Higher is better<br />

Jobs per Day<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

4 (32 Cores)<br />

LS-DYNA - Neon Refined Revised<br />

8 (64 Cores)<br />

12 (96 Cores)<br />

16 (128 Cores)<br />

Number of Nodes<br />

20 (160 Cores)<br />

24 (192 Cores)<br />

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs<br />

23


FLUENT Performance - Productivity<br />

• Test cases<br />

– Single job over the entire systems<br />

– 2 jobs, each runs on four cores per server<br />

• Running multiple jobs simultaneously improves FLUENT productivity<br />

– Up to 90% more jobs per day for Eddy_417K<br />

– Up to 30% more jobs per day for Aircraft_2M<br />

– Up to 3% more jobs per day for Truck_14M<br />

• As bigger the # of elements, higher node count is required for increased productivity<br />

– The CPU is the bottleneck for larger number of servers<br />

Higher is better<br />

Productivity Gain<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

FLUENT 12.0 Productivity Result<br />

(2 jobs in parallel vs 1 job)<br />

4 8 12 16 20 24<br />

Number of Nodes<br />

Eddy_417K Aircraft_2M Truck_14M<br />

InfiniBand DDR<br />

24


MM5 Performance with Power Management<br />

• Test Scenario<br />

– 24 servers, 4 processes per node, 2 processes per CPU (socket)<br />

• Similar performance with power management enabled or disabled<br />

– Only 1.4% performance degradation<br />

Higher is better<br />

Gflop/s<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

MM5 Benchmark Results - T3A<br />

Power Management Enabled Power Management Disabled<br />

InfiniBand DDR<br />

25


MM5 Benchmark – Power Cost Savings<br />

• Power management saves 673$/year for the 24-node cluster<br />

• As cluster size increases, bigger saving are expected<br />

$/year<br />

24 Node Cluster<br />

11000<br />

10600<br />

10200<br />

9800<br />

9400<br />

9000<br />

MM5 Benchmark - T3A<br />

Power Cost Comparison<br />

7%<br />

Power Management Enabled Power Management Disabled<br />

$/year = Total power consumption/year (KWh) * $0.20<br />

For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf<br />

InfiniBand DDR<br />

26


STAR-CD Benchmark – Power Consumption<br />

• Power management reduces 2% of total system power consumption<br />

Lower is better<br />

Power Consumption (Watt)<br />

7500<br />

7000<br />

6500<br />

6000<br />

5500<br />

5000<br />

STAR-CD Benchmark Results<br />

(AClass)<br />

Power Management<br />

Enabled<br />

Power Management<br />

Disabled<br />

InfiniBand DDR<br />

27


STAR-CD Performance with Power Management<br />

• Test Scenario<br />

– 24 servers, 4-Cores/Proc<br />

• Nearly identical performance with power management enabled or<br />

disabled<br />

Lower is better<br />

Total Elapsed Time (s)<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

STAR-CD Benchmark Results<br />

(AClass)<br />

Power Management<br />

Enabled<br />

Power Management<br />

Disabled<br />

InfiniBand DDR<br />

28


MQPC Performance with CPU Frequency Scaling<br />

• Enabling CPU Frequency Scaling<br />

– Userspace – reducing CPU frequency to 1.9GHz<br />

– Performance – setting for maximum performance (CPU frequency of 2.6GHz)<br />

– OnDemand – Maximum performance per application activity<br />

• Userspace increases job run time since CPU frequency is reduced<br />

• Performance and OnDemand enable similar performance<br />

– Due to high resource demands from the application<br />

24 Node Cluster<br />

Elapsed Time (s)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

MPQC Performance<br />

ug-cc-pVDZ ug-cc-pVTZ<br />

Benchmark Dataset<br />

Userspace (1.9GHz) Performance OnDemand<br />

InfiniBand DDR<br />

29


NWChem Benchmark Results<br />

• Input Dataset - Siosi7<br />

• Open MPI and HP-MPI provides similar performance and <strong>scalability</strong><br />

– Intel MKL library enables better performance versus GCC<br />

Higher is better<br />

Performance (Jobs/Day)<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

NWChem Benchmark Result<br />

(Siosi7)<br />

8 12 16<br />

HP-MPI + GCC<br />

Number of<br />

HP-MPI<br />

Nodes<br />

+ Intel MKL<br />

Open MPI + GCC Open MPI + Intel MKL<br />

InfiniBand QDR<br />

30


NWChem Benchmark Results<br />

• Input Dataset - H2O7<br />

• AMD ACML provides higher performance and <strong>scalability</strong> versus<br />

the default BLAS library<br />

Wall time (Seconds)<br />

Lower is better<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

NWChem Benchmark Result<br />

(H2O7 MP2)<br />

4 8 16<br />

Number of Nodes<br />

24<br />

MVAPICH HP-MPI Open MPI<br />

MVAPICH + ACML HP-MPI + ACML OpenMPI + ACML<br />

InfiniBand DDR<br />

31


Tuning MPI For Performance Improvements<br />

• MPI Performance tuning<br />

– Enabling CPU affinity<br />

• mca mpi_paffinity_alone 1<br />

– Increasing eager limit over infiniBand to 32K<br />

• mca btl_openib_eager_limit 32767<br />

• Performance increase of up to 10%<br />

Higher is better<br />

Days/ns<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

NAMD<br />

(ApoA1)<br />

4 8 12 16 20 24<br />

Number of Nodes<br />

Open MPI - Default Open MPI - Tuned<br />

32


<strong>HPC</strong> Applications<br />

<strong>Large</strong> Scale


Acknowledgments<br />

• The following research was performed under the <strong>HPC</strong><br />

<strong>Advisory</strong> <strong>Council</strong> activities<br />

– Participating members: AMD, Dell, Jülich, Mellanox NCAR,<br />

ParTec, Sun<br />

– Compute resource: <strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong> Cluster Center,<br />

Jülich Supercomputing Centre<br />

• For more info please refer to<br />

– www.hpcadvisorycouncil.com<br />

34


HOMME<br />

• High-Order Methods Modeling Environment (HOMME)<br />

– Framework for creating a high-performance scalable global atmospheric model<br />

– Configurable for shallow water or the dry/moist primitive equations<br />

– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate<br />

System Model (CCSM)<br />

– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP<br />

– Developed by the Scientific Computing Section at the National Center for Atmospheric Research (NCAR)<br />

35


POPperf<br />

• POP (Parallel Ocean Program) is an ocean circulation model<br />

– Simulations of the global ocean<br />

– Ocean-ice coupled simulations<br />

– Developed at Los Alamos National Lab<br />

• POPperf is a modified version of POP 2.0 (Parallel Ocean Program)<br />

• POPperf improves POP <strong>scalability</strong> on large processor counts<br />

– Re-writing of the conjugate gradient solver to use a 1D data structure<br />

– The addition of a space-filling curve partitioning technique<br />

– Low memory binary parallel I/O functionality<br />

• Developed by NCAR and freely available to the community<br />

36


Objectives and Environments<br />

• The presented research was done to provide best practices<br />

– HOMME and POPperf <strong>scalability</strong> and optimizations<br />

– Understanding communication patterns<br />

• <strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong> system<br />

– Dell PowerEdge SC 1435 24-node cluster<br />

– Quad-Core AMD Opteron 2382 (“Shanghai”) CPUs<br />

– Mellanox® InfiniBand ConnectX HCAs and switches<br />

• Jülich Supercomputing Centre - JuRoPA<br />

– Sun Blade servers with Intel Nehalem processors<br />

– Mellanox 40Gb/s InfiniBand HCAs and switches<br />

– ParTec ParaStation Cluster Operation Software and MPI<br />

37


ParTec MPI Parameters<br />

• PSP_ONDEMAND<br />

– Each MPI connection needs a certain amount of memory by default (0.5MB)<br />

– PSP_ONDEMAND disabled<br />

• MPI connections establish when application starts<br />

• Reduce overhead to start connection dynamically<br />

– PSP_ONDEMAND enabled<br />

• MPI connections establish per need<br />

• Reduce unnecessary message checking from large number of connections<br />

• Reduce memory footprint<br />

• PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE<br />

– Define default send/receive queue size (Default is16)<br />

– Changing them can reduce memory footprint<br />

38


Lessons Learn for <strong>Large</strong> Scale<br />

• Each application needs customized MPI tuning<br />

– Dynamic MPI connection establishment should be enabled if<br />

• Simultaneous connections setup is not a must<br />

• Some MPI function needs to check incoming messages from any source<br />

– Dynamic MPI connection establishment should be disabled if<br />

• MPI_Alltoall is used in the application<br />

• Number of calls to check MPI_ANY_SROUCE is small<br />

• Enough memory in the system<br />

• At different <strong>scale</strong>, different parameters may be used<br />

– PSP_ONDEMAND shouldn’t be enabled at very small <strong>scale</strong> cluster<br />

• Different MPI libraries has different characteristics<br />

– Parameters change for one MPI may not fit to other MPIs<br />

39


HOMME Performance at Scale<br />

40


HOMME Performance at Higher Scale…<br />

41


HOMME Scalability<br />

42


POPperf Performance at Scale<br />

43


POPperf Scalability<br />

44


Summary<br />

• HOMME and POPperf performance<br />

depends on low latency network<br />

• InfiniBand enables both HOMME and<br />

POPperf to <strong>scale</strong><br />

– Tested over 13824 cores<br />

– Same <strong>scalability</strong> is expected at even higher<br />

core count<br />

• Optimized MPI settings can dramatically<br />

improve application performance<br />

– Understanding application communication<br />

pattern<br />

– MPI parameter tuning<br />

45


Acknowledgements<br />

• Thanks to all the participated <strong>HPC</strong> <strong>Advisory</strong> members<br />

– Ben Mayer and John Dennis from NCAR who provided the code and<br />

benchmark instructions<br />

– Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the<br />

JuRoPA system available for benchmarking<br />

– Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec<br />

who helped review the report<br />

46


47<br />

Thank You<br />

<strong>HPC</strong> <strong>Advisory</strong> <strong>Council</strong><br />

www.hpcadvisorycouncil.com<br />

info@hpcadvisorycouncil.com<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!