HIGH-PERFORMANCE COMPUTING2.50<strong>Power</strong>Edge 3250 (Intel Itanium processor at 1.5 GHz)<strong>Power</strong>Edge 1850 (Intel Xeon processor at 3.6 GHz)<strong>Power</strong>Edge 1850 (Intel Xeon processor at 3.2 GHz)<strong>Power</strong>Edge 1750 (Intel Xeon processor at 3.2 GHz)44 percent speedup can be attributed to the faster memory andfrequency scaling.Cache effectiveness is important for BLAST. Larger cache sizeon the Itanium 2 processor, along with the ability to execute a higher2.002.312.372.232.362.222.29number of integer instructions per cycle, helped achieve speedupsof 130 percent on the larger query size.Relative performance1.501.001.441.3511.431.3711.521.4111.521.4611.591.4711.441.291Ramesh Radhakrishnan, Ph.D., is a systems engineer in the Scalable Systems Group at<strong>Dell</strong>. His interests are performance analysis and characterization of enterprise-level benchmarks.Ramesh has a Ph.D. in Computer Engineering from The University of Texas at Austin.0.500.0094,000(1 thread)94,000(2 threads)206,000(1 thread)206,000(2 threads)Query size/number of threads510,000(1 thread)Figure 3. Performance speedup for the <strong>Power</strong>Edge 3250 and <strong>Power</strong>Edge 1850 servers over the<strong>Power</strong>Edge 1750 server510,000(2 threads)Rizwan Ali is a systems engineer working in the Scalable Systems Group at <strong>Dell</strong>. His currentresearch interests are performance benchmarking and high-speed interconnects. Rizwan hasa B.S. in Electrical Engineering from the University of Minnesota.Garima Kochhar is a systems engineer in the Scalable Systems Group at <strong>Dell</strong>. She hasa B.S. in Computer Science and Physics from the Birla Institute of Technology and Science(BITS) in Pilani, India. She has an M.S. from The Ohio State University, where she workedin the area of job scheduling.Three query sizes—94,000 words; 206,000 words; and 510,000words—were chosen to represent small, medium, and large datasets, respectively. The database against which these queries werematched remained constant. The medium and large query sizeswere 2.2 and 5.4 times larger than the small query size.Figure 3 illustrates the relative performance of the <strong>Power</strong>Edge3250 and <strong>Power</strong>Edge 1850 servers compared to the <strong>Power</strong>Edge 1750system using dual 130 nm Intel Xeon processors at 3.2 GHz.The <strong>Power</strong>Edge 3250 server with the Itanium 2 processor hadthe best performance for all query sizes and exhibited speedupranging from 122 percent to 137 percent over the <strong>Power</strong>Edge 1750server. BLAST scaled more efficiently on the <strong>Power</strong>Edge 3250 serverwith the Itanium 2 processor. Therefore, in Figure 3 the speedupfor Itanium 2 is slightly higher for dual-threaded runs. Figure 3also shows that the frequency increase from 3.2 GHz to 3.6 GHz(13 percent frequency scaling) for the 90 nm Intel Xeon processorresulted only in a 6 to 11 percent improvement for small andmedium query sizes. For the larger query size, improvement fromfrequency scaling was slightly higher at 12 to 15 percent. Therefore,BLAST scaled well with increasing processor frequency on the IntelXeon processors, especially on larger query sizes.The impact of memory and processors on BLAST performanceMemory performance is important for BLAST. On the <strong>Power</strong>Edge1750 and <strong>Power</strong>Edge 1850 servers, the difference in performanceoccurred mainly because of the improved DDR2 technology and thefaster FSB in the <strong>Power</strong>Edge 1850 server. Performance was better onthe <strong>Power</strong>Edge 1850 server by 29 percent for the 3.2 GHz processorand 44 percent for the 3.6 GHz processor. The 29 percent speedupwas primarily caused by the faster memory subsystem, and theKalyana Chadalavada is a senior engineer with the <strong>Dell</strong> Enterprise <strong>Solutions</strong> EngineeringGroup at the Bangalore Development Center. Kalyana has a B.S. in Computer Science andEngineering from Nagarjuna University in India. His current interests include performancecharacterizations on HPC clusters and processor architectures.Ramesh Rajagopalan is a lead software engineer in the Database and Applications Engineeringteam of the <strong>Dell</strong> Product Group. His current areas of focus include Oracle Real ApplicationClusters and performance analysis of <strong>Dell</strong> clusters. Ramesh has a Bachelor of Engineering inComputer Science from the Indian Institute of Science, Bangalore.FOR MORE INFORMATIONTop 500 Supercomputer Sites:www.top500.orgBLAST benchmark:www.ncbi.nih.gov/BLASTSTREAM benchmark:www.streambench.orgEunice, Jonathan. “Itanium 2 Performance: Wow!” Illuminata Research Note.August 27, 2002.Hsieh, Jenwei, Tau Leng, Victor Mashayekhi, and Reza Rooholamini. “Impactof Level 2 Cache and Memory Subsystem on the Scalability of Clusters ofSmall-Scale SMP Servers.” IEEE International Conference on ClusterComputing. Chemnitz, Germany. November 2000.Intel Corporation. “DDR2 Advantages for Dual Processor Servers.” August 2004.www.memforum.org/memorybasics/ddr2/DDR2_Whitepaper.pdf.Koufaty, David and Deborah T. Marr. “Hyper-Threading Technology in theNetBurst Microarchitecture.” IEEE Micro. March/April 2003.Sharangpani, Harash and Ken Arora. “Itanium Processor Microarchitecture.”IEEE Micro. September/October 2000.122POWER SOLUTIONS Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTINGGetting the Best Performance from an HPC Cluster:A STAR-CD Case StudyHigh-performance computing (HPC) clusters represent a new era in supercomputing.Because HPC clusters usually comprise standards-based, commodity components, theydiffer primarily in node properties, interconnect configuration, file system type, andclustering middleware. This article explains how a multipurpose and multidisciplinecomputational fluid dynamics application can be used to understand how the performanceof an HPC application may be affected by system components such asinterconnects (bandwidth and latency), file I/O, and symmetric multiprocessing.BY BARIS GULER; JENWEI HSIEH, PH.D.; RAJIV KAPOOR; LANCE SHULER; AND JOHN BENNINGHOFFHigh-performance computing (HPC) clusters candeliver supercomputing-class performance using offthe-shelf,industry-standard components. As a result, HPCclusters differ primarily in their node properties, interconnectconfiguration, file system type, and clustering middleware.Application performance is highly sensitive tomany characteristics of an HPC cluster, such as cache size,processor speed, memory latency and bandwidth, interconnectlatency and bandwidth, file I/O, and so forth. 1For example, administrators can configure a cluster witha local disk, Network File System (NFS), or a parallel filesystem and achieve very different performance and costvalues for different applications.Determining HPC cluster application performancethrough <strong>Dell</strong> and Intel collaborationTo determine the effect of various HPC cluster propertieson application performance, from January toMay 2004 <strong>Dell</strong> engineers tested the CD adapco GroupSTAR-CD, a popular industrial computational fluiddynamics (CFD) application, on an HPC cluster comprising<strong>Dell</strong> <strong>Power</strong>Edge 3250 servers. The STAR-CDsolver uses state-of-the-art numerical methodologiesto achieve a high level of accuracy for complexunstructured meshes, in both steady and transientsimulations.<strong>Dell</strong> engineers cross-checked their results fromtwo STAR-CD workloads with tests by the Intel HPCfor Independent Software Vendor (HPC ISV) EnablingTeam to verify performance and adjust configurationsfor optimal performance and scaling of STAR-CD. Thiscross-checking ranged from a comparison of singlenodeperformance to scaling across a large cluster. Differencesin benchmark results between <strong>Dell</strong> and Inteltests exposed performance issues, such as those arisingfrom using a local file system instead of NFS.1For more information about the effects of processor speed and cache size on application performance, see “Understanding the Behavior of Computational Fluid Dynamics Applicationson <strong>Dell</strong> <strong>Power</strong>Edge 3250 Servers” by Baris Guler and Rizwan Ali in <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, October 2004.www.dell.com/powersolutions Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. POWER SOLUTIONS 123