Dell Power Solutions

More documents

Recommendations

Info

HIGH-PERFORMANCE COMPUTINGApplicationCompilerMiddlewareOperating systemProtocolInterconnectPlatformConfiguring the test environmentThe test environment was based on a cluster comprising 32 DellPowerEdge 3250 servers interconnected with nonblocking GigabitEthernet and Myricom Myrinet. The Gigabit Ethernet interconnectcomprised one Myricom M3-E128 switch enclosure populated with16 Myrinet M3-SW16-8E switch line cards, each having 8 GigabitEthernet ports (at 1 Gbps link speed) for 128 ports total. The Myrinetinterconnect comprised one Myricom M3-E128 switch enclosurepopulated with 16 Myrinet M3-SW16-8F switch line cards, eachhaving 8 fiber ports (at 2 Gbps link speed) for 128 ports total.Each PowerEdge 3250 was configured with two Intel ®Itanium®2processors running at 1.3 GHz with 3 MB of level 3 (L3) cache and4 GB of double data rate (DDR) RAM operating on a 400 MHz frontsidebus. Each PowerEdge 3250 had an Intel ®E8870 chip set ScalableNode Controller (SNC), which accommodates up to eight registeredDDR dual in-line memory modules (DIMMs). The operating system wasRed Hat ®Enterprise Linux®AS 2.1 with kernel version 2.4.18-e.37smp.Intel ®Fortran and C++ Compilers 7.1 (Build 20030909) were usedto link STAR-CD V3.150A.012 with MPICH 1.2.4 and MPICH-GM1.2.5..10. Figure 1 shows the architectural stack of the computationaltest environment.Measuring application efficiencyThe efficiency in parallel applications is usually measured by speedup.The speedup of an application on a cluster with N processors is definedas t 1 / t N, where t 1 is the execution time of the application running onone processor and t N is the execution time of the application runningacross N processors. In theory, the maximum speedup of a parallelsimulation with N processors isSTAR-CD V3.150A.012Intel Fortran and C++ Compilers 7.1MPICH 1.2.4 and MPICH-GM 1.2.5..10Red Hat Enterprise Linux AS 2.1, kernel 2.4.18-e.37smpTCP/IP, GM-2Gigabit Ethernet, MyrinetFigure 1. Architectural stack of the computational test environmentDell PowerEdge 3250 servers in a 32-node clusterN—that is, the program runsN timesfaster than it would on a single processor. However, in reality, as thenumber of processors increases, application speedup is usually lessthan N. The disparity between theoretical and actual performance canbe attributed to factors that include increasing parallel job initiation,interprocessor communication, file I/O, and network contention.Running test casesCD adapco Group provides a suite of six test cases selected to demonstratethe versatility and robustness of STAR-CD in CFD solutionsand the relative performance of the application on different hardwareplatforms and different processor configurations. 2 CD adapco Groupselected these test cases to be industry representative, categorizingthem as small, medium, and large. Figure 2 lists benchmark namesand brief descriptions of the data sets used in this study.In this study, the small (Engine block) and the large (A-class)test cases were used to perform the analysis and draw the conclusionsregarding the application performance and sensitivity todifferent hardware configurations. Each test case was performedon 1, 2, 4, 8, 16, and 32 processors to help assess the scalabilityof the application in different hardware configurations, such asusing two processors versus one processor on each node, usingthe local file system versus NFS, and using Gigabit Ethernet versusMyrinet interconnects. All the benchmark runs were conductedusing the double-precision version of the code and the conjugategradient solver.When running the application over the Myrinet interconnectusing MPICH-GM 1.2.5..10, testers had to make sure that theRAMFILES and TURBO environment variables were properly passedto each node because the mpirun shell script does not handle thistask the same way that MPICH 1.2.4 does. Initially, benchmarkruns using Gigabit Ethernet were much faster than the runs usingMyrinet. By setting the RAMFILES and TURBO environmentvariables, Dell engineers enabled STAR-CD to use a solver codeoptimized for Itanium 2 architecture and RAMFILES—resulting inimproved application performance with the Myrinet interconnect.Comparing the performance of the local file system versus NFSSome applications perform several file I/O operations during execution,while other applications do most of the I/O at the end ofthe simulation and still other applications do very little I/O at all.Direct (local) or remote (NFS) access to the file system can affectthe application’s performance because file I/O is usually slowerthan other operations such as computation and communication. Toobserve the performance impact of NFS compared to the local filesystem, Dell engineers ran the Engine block and A-class test cases inClass Benchmark Cells Mesh DescriptionSmall Engine block 156,739 Hexahedral Engine cooling in automobileengine blockLarge A-class 5,914,426 Hybrid Turbulent flow aroundA-class carFigure 2. STAR-CD benchmarks used in the Dell test2For more information about STAR-CD benchmarks and data sets, visit www.cd-adapco.com/support/bench/315/index.htm.124POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTING7070Performance improvement (percent)6050403020100A-classEngine blockPerformance degradation (percent)6050403020100A-classEngine block-102 4 8 16 32-102 4 8 16 32Number of processesNumber of processesFigure 3. Performance improvement of the local file system compared to NFSFigure 4. Performance degradation of two processors per node compared to one processor per nodethe same cluster test environment with a single processor per node,first using NFS and later using the local file system for the STAR-CDinput and output files.Figure 3 shows the performance gain from using the local filesystem over NFS for different numbers of processes running ona single-processor node. Clearly, the file system’s effect on theapplication’s performance for both test cases was apparent at 16processes (or processors), and a greater than 50 percent performanceimprovement was achieved using the local file system at32 processes (or processors). In other words, using NFS instead ofthe local file system can drastically decrease the overall applicationperformance at large processor counts. This behavior should betaken into account when designing an HPC cluster for a specificapplication.Comparing single-processor-per-node versus dual-processorper-nodeperformanceA parallel application such as STAR-CD running on two processors(one from each of two nodes) usually delivers better performancethan a similar parallel application running on two processors thatreside on the same node. The better performance can be attributedprimarily to the fact that the application does not need to sharememory and I/O devices when it is running on multiple nodes.However, a symmetric multiprocessing (SMP) system in which theprocessors reside on the same node is often a less-expensive solutionbecause of the many resources that can be shared between two ormore processors. Besides sharing internal resources, SMP systemsrequire fewer ports on the network switch because one networkinterface card (NIC) can be used for each SMP system. The needfor fewer ports also helps to decrease the overall cost of an SMPbasedcluster.The A-class and Engine block test cases were executed in thesame hardware and software environment. First, only one processorper node was used to perform the simulations. Then, both processorsfrom each node were used to perform the same simulations.Results of this test are shown in Figure 4. The x-axis is the numberof processes (equivalent to the number of processors) used, and they-axis shows the percent degradation in overall parallel applicationperformance for both test cases when using two processors pernode, with respect to using one processor per node. Gigabit Ethernetwas used as the cluster interconnect, and the local file system wasused to perform file I/O.Using up to 16 processors, the A-class test case exhibited aperformance drop of approximately 17 percent. When 32 processorswere used, the drop increased to over 60 percent because interprocessorcommunication took up a larger percentage of total timethan in the 16-processor case. With two communicating processesusing a shared resource, namely the network interface, there is anincreased potential for contention. Network interfaces with lowerlatency and higher bandwidth can mitigate some of this contention.Demonstrating this mitigation is the comparison of Myrinet andGigabit Ethernet performance in the two-processor-per-node casediscussed in the section, “Comparing the performance of GigabitEthernet versus Myrinet interconnects.”On the other hand, the Engine block test case showed roughly10 percent degradation when using two processors (one SMP node);the degradation diminished as more processors—up to 16 processors(eight SMP nodes)—were used. Because the same workloaddecomposed into smaller sub-workloads when more processors wereused, each sub-workload started to benefit more from the cache. Inaddition, the memory contention problem inherent to SMP systemsbecame less significant when each processor worked on a smallerdata set. These factors contributed to less performance degradation(from a roughly 10 percent degradation to a 5 percent performanceimprovement) as indicated in Figure 4. However, when 32 processorswere used, more than 20 percent performance degradation wasobserved because of the negative affect of high-latency interconnects(as discussed in the “Comparing the performance of Gigabit Ethernetversus Myrinet interconnects” section) and increased interprocessorcommunication.www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2005. Copyright © 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 125
Page 1 and 2:
DELL POWER SOLUTIONS • FEBRUARY 2
Page 3 and 4:
POWERSOLUTIONSTHE MAGAZINE FOR DIRE
Page 5 and 6:
© 2005 Quantum Corporation. All ri
Page 7 and 8:
Dave is on vacation.He’s been not
Page 9 and 10:
UTILITY=AVAILABILITY.From SAP to BE
Page 11 and 12:
EXECUTIVE INSIGHTSleading third-par
Page 13 and 14:
NEW-GENERATION SERVER TECHNOLOGYis
Page 15 and 16:
NEW-GENERATION SERVER TECHNOLOGYAno
Page 17 and 18:
NEW-GENERATION SERVER TECHNOLOGYThe
Page 19 and 20:
The industry’s preeminent source
Page 21 and 22:
NEW-GENERATION SERVER TECHNOLOGYI/O
Page 23 and 24:
Will yours be there when you need i
Page 25 and 26:
NEW-GENERATION SERVER TECHNOLOGYas
Page 27 and 28:
NEW-GENERATION SERVER TECHNOLOGYThe
Page 29 and 30:
NEW-GENERATION SERVER TECHNOLOGYpor
Page 31 and 32:
More data? Less time?No problem.Del
Page 33 and 34:
NEW-GENERATION SERVER TECHNOLOGYser
Page 35:
NEW-GENERATION SERVER TECHNOLOGYDTK
Page 38 and 39:
NEW-GENERATION SERVER TECHNOLOGYpac
Page 40 and 41:
NEW-GENERATION SERVER TECHNOLOGYman
Page 42 and 43:
NEW-GENERATION SERVER TECHNOLOGYFig
Page 44 and 45:
NEW-GENERATION SERVER TECHNOLOGYSun
Page 46 and 47:
NEW-GENERATION SERVER TECHNOLOGYTab
Page 48 and 49:
NEW-GENERATION SERVER TECHNOLOGYMan
Page 50 and 51:
Page 52 and 53:
Page 54 and 55:
SYSTEMS MANAGEMENTsuch as:Dell Upda
Page 56 and 57:
SYSTEMS MANAGEMENTThis hardware-cen
Page 58 and 59:
SYSTEMS MANAGEMENTCLI taskFigure 4.
Page 60 and 61:
SYSTEMS MANAGEMENTManaging Dell Cli
Page 62 and 63:
SYSTEMS MANAGEMENTFigure 2. OMCA De
Page 64 and 65:
SYSTEMS MANAGEMENTAgentless Monitor
Page 66 and 67:
SYSTEMS MANAGEMENTWeb serverGlobal
Page 68 and 69:
SYSTEMS MANAGEMENTorganizations can
Page 70 and 71:
STORAGE• Multi-staged disk backup
Page 72 and 73:
STORAGEExec Advanced Disk-Based Bac
Page 74 and 75:
STORAGEPrimary disk (RAID)Figure 1.
Page 76 and 77: STORAGESTORAGEFREESubscriptionReque
Page 78 and 79: STORAGEcentralized backup can offer
Page 80 and 81: STORAGEoccurs. Replication can be s
Page 82 and 83: STORAGEBackup concepts, while firml
Page 84 and 85: SCALABLE ENTERPRISEFile Systemsfor
Page 86 and 87: SCALABLE ENTERPRISEServer 1 Server
Page 88 and 89: SCALABLE ENTERPRISEThe Promise ofUn
Page 90 and 91: SCALABLE ENTERPRISELANExternalcommu
Page 92 and 93: SCALABLE ENTERPRISErequired for dif
Page 94 and 95: SCALABLE ENTERPRISEFigure 1 shows v
Page 96 and 97: SCALABLE ENTERPRISEDeploying and Ma
Page 98 and 99: SCALABLE ENTERPRISEapplication serv
Page 100 and 101: SCALABLE ENTERPRISEExploitingAutoma
Page 102 and 103: SCALABLE ENTERPRISEFigure 2. Perfor
Page 104 and 105: SCALABLE ENTERPRISEMigrating Oracle
Page 106 and 107: SCALABLE ENTERPRISEexport and impor
Page 108 and 109: SCALABLE ENTERPRISE8> 'm:\expdata\o
Page 110 and 111: SCALABLE ENTERPRISEClientsPublic LA
Page 112 and 113: SCALABLE ENTERPRISEnodes, the clust
Page 114 and 115: HIGH-PERFORMANCE COMPUTING(Red Hat
Page 116 and 117: HIGH-PERFORMANCE COMPUTINGFor monit
Page 118 and 119: HIGH-PERFORMANCE COMPUTINGin a clus
Page 120 and 121: HIGH-PERFORMANCE COMPUTINGPerforman
Page 122 and 123: HIGH-PERFORMANCE COMPUTINGPowerEdge
Page 124 and 125: HIGH-PERFORMANCE COMPUTING2.50Power
Page 129 and 130: HIGH-PERFORMANCE COMPUTINGthe incre
Page 131 and 132: HIGH-PERFORMANCE COMPUTINGCompute n
Page 133 and 134: HIGH-PERFORMANCE COMPUTINGIBRIX Fus
Page 135 and 136: HIGH-PERFORMANCE COMPUTINGPlanning
Page 137 and 138: HIGH-PERFORMANCE COMPUTINGFeatureCo
Page 139 and 140: HIGH-PERFORMANCE COMPUTINGUnderstan
Page 141 and 142: HIGH-PERFORMANCE COMPUTINGprovides
Page 143 and 144: Oracle DatabaseWorld’s #1 Databas
show all

Dell Power Solutions

Create successful ePaper yourself

Delete template?

Save as template?