Large scale applications scalability - HPC Advisory Council

HPC Applications Scalability 

Gilad@hpcadvisorycouncil.com

Applications Best Practices 

• LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator 

• D. E. Shaw Research Desmond 

• NWChem 

• MPQC - Massively Parallel Quantum Chemistry Program 

• ANSYS FLUENT and CFX 

• CD-adapco STAR-CCM+ 

• CD-adapco STAR-CD 

• MM5 - The Fifth-Generation Mesoscale Model 

• NAMD 

• LSTC LS-DYNA 

• Schlumberger ECLIPSE 

• CPMD - Car-Parrinello Molecular Dynamics 

• WRF - Weather Research and Forecast Model 

• Dacapo - total energy program based on density functional theory 

• Lattice QCD 

• Open Foam 

• GAMESS 

• HOMME 

• OpenAtom 

• PopPerf 

2

Applications Best Practices 

• Performance evaluation 

• Profiling – network, I/O 

• Recipes (installation, debug, command lines etc) 

• MPI libraries 

• Power management 

• Optimizations at small scale 

• Optimizations at scale 

3

HPC Applications 

Small Scale

ECLIPSE Performance Results - Interconnect 

• InfiniBand enables highest scalability 

– Performance accelerates with cluster size 

• Performance over GigE and 10GigE is not scaling 

– Slowdown occurs beyond 8 nodes 

Elapsed Time (Seconds) 

6000 

5000 

4000 

3000 

2000 

1000 

0 

Schlumberger ECLIPSE 

(FOURMILL) 

4 8 12 16 20 22 24 

Number of Nodes 

GigE 10GigE InfiniBand 

Lower is better Single job per cluster size 

5

MM5 Performance Results - Interconnect 

• Input Data: T3A 

– Resolution 36KM, grid size 112x136, 33 vertical levels 

– 81 second time-step, 3 hour forecast 

• InfiniBand DDR delivers higher performance in any cluster size 

– Up to 46% versus GigE, 30% versus 10GigE 

Higher is better 

Mflop/s 

60000 

50000 

40000 

30000 

20000 

10000 

0 

MM5 Benchmark Results - T3A 

4 6 8 10 12 14 16 18 20 22 


GigE 10GigE InfiniBand-DDR 

Platform MPI 

6

MPQC Performance Results - Interconnect 

• Input Dataset 

– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set 

• InfiniBand enables higher scalability 

– Performance accelerates with cluster size 

– Outperforms GigE by up to 124% and 10GigE by up to 38% 

Lower is better 

Elapsed Time(Seconds) 

16000 

12000 

8000 

4000 

0 

MPQC Benchmark Result 

(aug-cc-pVTZ) 

8 16 24 


GigE 10GigE InfiniBand DDR 

7

NAMD Performance Results – Interconnect 

• ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water 

– Models a bloodstream lipoprotein particle 

– One of the most used data sets for benchmarking NAMD 

• InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size 

– InfiniBand provides higher performance up to 79% vs GigE and 49% vs 10GigE 

ns/Day 

7 

6 

5 

4 

3 

2 

1 

0 

NAMD 

(ApoA1) 

4 8 12 16 20 24 


Higher is better GigE 10GigE InfiniBand 

Open MPI 

8

CPMD Performance - MPI 

• Platform MPI shows better scalability over Open MPI 

Elapsed time (Seconds) 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 


CPMD 

(Wat32 inp-1) 

4 8 16 


Open MPI Platform MPI 

Elapsed time (Seconds) 

180 

160 

140 

120 

100 

80 

60 

40 

20 

0 

CPMD 

(Wat32 inp-2) 

4 8 16 



These results are 

based on InfiniBand 

9

MM5 Performance - MPI 

• Platform MPI demonstrates higher performance versus MVAPICH 

– Up to 25% higher performance 

– Platform MPI advantage increases with increased cluster size 


Mflop/s 

60000 

50000 

40000 

30000 

20000 

10000 

0 


4 6 8 10 12 14 16 18 20 22 


MVAPICH Platform MPI 

Single job on each node 

10

NAMD Performance - MPI 

• Platform MPI and Open MPI provides same level of performance 

– Platform MPI has better performance for cluster size lower than 20 nodes 

– Open MPI becomes better with 24 nodes 


• Higher configurations than 24 nodes were not tested 

ns/Day 

7 

6 

5 

4 

3 

2 

1 

0 

NAMD 

(ApoA1) 

4 8 12 16 20 24 



These results are 

based on InfiniBand 

11

STAR-CD Performance - MPI 

• Test case 

– Single job over the entire systems 

– Input Dataset (A-Class) 

• HP-MPI has slightly better performance with CPU affinity enabled 

Total Elapsed Time (s) 


10000 

9000 

8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

STAR-CD Benchmark Results 

(A-Class) 

4 8 16 20 24 


Platform MPI HP-MPI 

12

CPMD – MPI Profiling 

• MPI_AlltoAll is the key collective function in CPMD 

– Number of AlltoAll messages increases dramatically with cluster size 

Number of Messages (Millions) 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

MPI_Allgather 

MPI_Allreduce 

MPI Function Distribution 

(C120 inp-1) 

MPI_Alltoall 

MPI_Bcast 

MPI_Isend 

4 Nodes 8 Nodes 16 Nodes 

MPI_Recv 

MPI_Send 

13

ECLIPSE - MPI Profiling 

Percentage of Messages 

60% 

50% 

40% 

30% 

Eclipse MPI Profiliing 

4 8 12 16 20 24 


MPI_Isend < 128 Bytes MPI_Isend < 256 KB 

MPI_Recv < 128 Bytes MPI_Recv < 256 KB 

• Majority of MPI messages are large size 

• Demonstrating the need for highest throughput 

14

Abaqus/Standard - MPI Profiling 

• MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest 

communication overhead 

MPI Runtime (s) 

1000000 

100000 

10000 

1000 

100 

10 

1 

MPI_Allreduce 

MPI_Alltoall 

MPI_Barrier 

Abaqus/Standard MPI Profiliing 

(S2A) 

MPI_Bcast 

MPI_Irecv 

MPI_Isend 

MPI Function 

MPI_Recv 

MPI_Reduce 

MPI_Send 

MPI_Ssend 

16 Cores 32 Cores 64 Cores 

MPI_Test 

MPI_Waitall 

15

OpenFOAM MPI Profiling – MPI Functions 

• Mostly used MPI functions 

– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv 

– Number of MPI functions increases with cluster size 

16

ECLIPSE - Interconnect Usage 

Data Transferred (MB/s) 


• Total server throughput increases rapidly with cluster size 

1400 

1200 

1000 

800 

600 

400 

200 

1400 

1200 

1000 

800 

600 

400 

200 

Data Sent 

4 Nodes 

0 

1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031 

Timing (s) 

Data Sent 

16 Nodes 

0 

1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847 

Timing (s) 



1400 

1200 

1000 

800 

600 

400 

200 

1400 

1200 

1000 

800 

600 

400 

200 

This data is per node based 

Data Sent 

Data Sent 

0 

1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 

Timing (s) 

8 Nodes 

0 

1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 

Timing (s) 

24 Nodes 

17

NWChem MPI Profiling – Message Transferred 

• Most data related MPI messages are within 8KB-256KB in size 

– Number of messages increases with cluster size 

• Shows the need for highest throughput to ensure highest system utilization 

Number of Messages (K) 

60000 

50000 

40000 

30000 

20000 

10000 

0 

NWChem Benchmark Profiliing 

(Siosi7) 

STAR-CD MPI Profiling – Data Transferred 

• Most point-to-point MPI messages are within 1KB to 8KB in size 

• Number of messages increases with cluster size 

Number of Messages 

(Millions) 

7 

6 

5 

4 

3 

2 

1 

0 

[0-128B> 

[128B-1K> 

STAR-CD MPI Profiling 

(MPI_Sendrecv) 

[1K-8K> 

Message Size 

[8K-256K> 

[256K-1M> 

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes 

19

FLUENT 12 MPI Profiling – Message Transferred 

• Most data related MPI messages are within 256B-1KB in size 

• Typical MPI synchronization messages are lower than 64B in size 

• Number of messages increases with cluster size 

Number of Messages 

(Thousands) 

28000 

24000 

20000 

16000 

12000 

8000 

4000 

0 

[0..64B] 

[64..256B] 

FLUENT 12 Benchmark Profiling Result 

(Truck_poly_14M) 

[256B..1KB] 

[1..4KB] 

[4..16KB] 

[16..64KB] 

[64..256KB] 

Message Size 

[256KB..1M] 

[1..4M] 

4 Nodes 8 Nodes 16 Nodes 

[4M..2GB] 

20

ECLIPSE Performance - Productivity 

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously 

– Providing required productivity for reservoir simulations 

• Three cases are presented 


– Four jobs, each on two cores per CPU per server 

– Eight jobs, each on one CPU core per server 

• Eight jobs per node increases productivity by up to 142% 

Number of Jobs 

300 

250 

200 

150 

100 

50 

0 

Schlumberger ECLIPSE 

(FOURMILL) 

8 12 16 20 22 24 


Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node 

InfiniBand 

21

MPQC Performance - Productivity 

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously 

– Providing required productivity for MPQC computation 

• Two cases are presented 


– Two jobs, each on four cores per server 

• Two jobs per node increases productivity by up to 35% 


Jobs/Day 

300 

250 

200 

150 

100 

50 

0 

MPQC Benchmark Result 

(aug-cc-pVDZ) 

8 12 16 20 24 


1 Job 2 Jobs 

35% 

InfiniBand 

22

LS-DYNA Performance - Productivity 

Jobs per Day 

120 

100 

80 

60 

40 

20 

0 

4 (32 Cores) 

LS-DYNA - 3 Vehicle Collision 

8 (64 Cores) 

12 (96 Cores) 

16 (128 Cores) 


20 (160 Cores) 

24 (192 Cores) 

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs 


Jobs per Day 

1400 

1200 

1000 

800 

600 

400 

200 

0 

4 (32 Cores) 

LS-DYNA - Neon Refined Revised 

8 (64 Cores) 

12 (96 Cores) 

16 (128 Cores) 


20 (160 Cores) 

24 (192 Cores) 

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs 

23

FLUENT Performance - Productivity 

• Test cases 


– 2 jobs, each runs on four cores per server 

• Running multiple jobs simultaneously improves FLUENT productivity 

– Up to 90% more jobs per day for Eddy_417K 

– Up to 30% more jobs per day for Aircraft_2M 

– Up to 3% more jobs per day for Truck_14M 

• As bigger the # of elements, higher node count is required for increased productivity 

– The CPU is the bottleneck for larger number of servers 


Productivity Gain 

100% 

80% 

60% 

40% 

20% 

0% 

FLUENT 12.0 Productivity Result 

(2 jobs in parallel vs 1 job) 

4 8 12 16 20 24 


Eddy_417K Aircraft_2M Truck_14M 

InfiniBand DDR 

24

MM5 Performance with Power Management 

• Test Scenario 

– 24 servers, 4 processes per node, 2 processes per CPU (socket) 

• Similar performance with power management enabled or disabled 

– Only 1.4% performance degradation 


Gflop/s 

40 

35 

30 

25 

20 

15 

10 


Power Management Enabled Power Management Disabled 


25

MM5 Benchmark – Power Cost Savings 

• Power management saves 673$/year for the 24-node cluster 

• As cluster size increases, bigger saving are expected 

$/year 

24 Node Cluster 

11000 

10600 

10200 

9800 

9400 

9000 

MM5 Benchmark - T3A 

Power Cost Comparison 

7% 

Power Management Enabled Power Management Disabled 

$/year = Total power consumption/year (KWh) * $0.20 

For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf 


26

STAR-CD Benchmark – Power Consumption 

• Power management reduces 2% of total system power consumption 


Power Consumption (Watt) 

7500 

7000 

6500 

6000 

5500 

5000 


(AClass) 

Power Management 

Enabled 


Disabled 


27

STAR-CD Performance with Power Management 

• Test Scenario 

– 24 servers, 4-Cores/Proc 

• Nearly identical performance with power management enabled or 

disabled 


Total Elapsed Time (s) 

1800 

1600 

1400 

1200 

1000 


(AClass) 


Enabled 


Disabled 


28

MQPC Performance with CPU Frequency Scaling 

• Enabling CPU Frequency Scaling 

– Userspace – reducing CPU frequency to 1.9GHz 

– Performance – setting for maximum performance (CPU frequency of 2.6GHz) 

– OnDemand – Maximum performance per application activity 

• Userspace increases job run time since CPU frequency is reduced 

• Performance and OnDemand enable similar performance 

– Due to high resource demands from the application 

24 Node Cluster 

Elapsed Time (s) 

3000 

2500 

2000 

1500 

1000 

500 

0 

MPQC Performance 

ug-cc-pVDZ ug-cc-pVTZ 

Benchmark Dataset 

Userspace (1.9GHz) Performance OnDemand 


29

NWChem Benchmark Results 

• Input Dataset - Siosi7 

• Open MPI and HP-MPI provides similar performance and scalability 

– Intel MKL library enables better performance versus GCC 


Performance (Jobs/Day) 

300 

250 

200 

150 

100 

50 

0 

NWChem Benchmark Result 

(Siosi7) 

8 12 16 

HP-MPI + GCC 

Number of 

HP-MPI 

Nodes 

+ Intel MKL 

Open MPI + GCC Open MPI + Intel MKL 

InfiniBand QDR 

30

NWChem Benchmark Results 

• Input Dataset - H2O7 

• AMD ACML provides higher performance and scalability versus 

the default BLAS library 

Wall time (Seconds) 


300 

250 

200 

150 

100 

50 

0 

NWChem Benchmark Result 

(H2O7 MP2) 

4 8 16 


24 

MVAPICH HP-MPI Open MPI 

MVAPICH + ACML HP-MPI + ACML OpenMPI + ACML 


31

Tuning MPI For Performance Improvements 

• MPI Performance tuning 

– Enabling CPU affinity 

• mca mpi_paffinity_alone 1 

– Increasing eager limit over infiniBand to 32K 

• mca btl_openib_eager_limit 32767 

• Performance increase of up to 10% 


Days/ns 

7 

6 

5 

4 

3 

2 

1 

0 

NAMD 

(ApoA1) 

4 8 12 16 20 24 


Open MPI - Default Open MPI - Tuned 

32

HPC Applications 

Large Scale

Acknowledgments 

• The following research was performed under the HPC 

Advisory Council activities 

– Participating members: AMD, Dell, Jülich, Mellanox NCAR, 

ParTec, Sun 

– Compute resource: HPC Advisory Council Cluster Center, 

Jülich Supercomputing Centre 

• For more info please refer to 

– www.hpcadvisorycouncil.com 

34

HOMME 

• High-Order Methods Modeling Environment (HOMME) 

– Framework for creating a high-performance scalable global atmospheric model 

– Configurable for shallow water or the dry/moist primitive equations 

– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate 

System Model (CCSM) 

– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP 

– Developed by the Scientific Computing Section at the National Center for Atmospheric Research (NCAR) 

35

POPperf 

• POP (Parallel Ocean Program) is an ocean circulation model 

– Simulations of the global ocean 

– Ocean-ice coupled simulations 

– Developed at Los Alamos National Lab 

• POPperf is a modified version of POP 2.0 (Parallel Ocean Program) 

• POPperf improves POP scalability on large processor counts 

– Re-writing of the conjugate gradient solver to use a 1D data structure 

– The addition of a space-filling curve partitioning technique 

– Low memory binary parallel I/O functionality 

• Developed by NCAR and freely available to the community 

36

Objectives and Environments 

• The presented research was done to provide best practices 

– HOMME and POPperf scalability and optimizations 

– Understanding communication patterns 

• HPC Advisory Council system 

– Dell PowerEdge SC 1435 24-node cluster 

– Quad-Core AMD Opteron 2382 (“Shanghai”) CPUs 

– Mellanox® InfiniBand ConnectX HCAs and switches 

• Jülich Supercomputing Centre - JuRoPA 

– Sun Blade servers with Intel Nehalem processors 

– Mellanox 40Gb/s InfiniBand HCAs and switches 

– ParTec ParaStation Cluster Operation Software and MPI 

37

ParTec MPI Parameters 

• PSP_ONDEMAND 

– Each MPI connection needs a certain amount of memory by default (0.5MB) 

– PSP_ONDEMAND disabled 

• MPI connections establish when application starts 

• Reduce overhead to start connection dynamically 

– PSP_ONDEMAND enabled 

• MPI connections establish per need 

• Reduce unnecessary message checking from large number of connections 

• Reduce memory footprint 

• PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE 

– Define default send/receive queue size (Default is16) 

– Changing them can reduce memory footprint 

38

Lessons Learn for Large Scale 

• Each application needs customized MPI tuning 

– Dynamic MPI connection establishment should be enabled if 

• Simultaneous connections setup is not a must 

• Some MPI function needs to check incoming messages from any source 

– Dynamic MPI connection establishment should be disabled if 

• MPI_Alltoall is used in the application 

• Number of calls to check MPI_ANY_SROUCE is small 

• Enough memory in the system 

• At different scale, different parameters may be used 

– PSP_ONDEMAND shouldn’t be enabled at very small scale cluster 

• Different MPI libraries has different characteristics 

– Parameters change for one MPI may not fit to other MPIs 

39

HOMME Performance at Scale 

40

HOMME Performance at Higher Scale… 

41

HOMME Scalability 

42

POPperf Performance at Scale 

43

POPperf Scalability 

44

Summary 

• HOMME and POPperf performance 

depends on low latency network 

• InfiniBand enables both HOMME and 

POPperf to scale 

– Tested over 13824 cores 

– Same scalability is expected at even higher 

core count 

• Optimized MPI settings can dramatically 

improve application performance 

– Understanding application communication 

pattern 

– MPI parameter tuning 

45

Acknowledgements 

• Thanks to all the participated HPC Advisory members 

– Ben Mayer and John Dennis from NCAR who provided the code and 

benchmark instructions 

– Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the 

JuRoPA system available for benchmarking 

– Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec 

who helped review the report 

46

47 

Thank You 

HPC Advisory Council 

www.hpcadvisorycouncil.com 

info@hpcadvisorycouncil.com 

47

Large scale applications scalability - HPC Advisory Council

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?