HIGH-PERFORMANCE COMPUTING<strong>Power</strong>Edge 1750 (IA-32) <strong>Power</strong>Edge 1850 (IA-32) <strong>Power</strong>Edge 3250 (IA-64)CPU Dual Intel Xeon processors Dual Intel Xeon processors at Dual Intel Itanium 2 processorsat 3.2 GHz (130 nm) 3.2 GHz and at 3.6 GHz (90 nm) at 1.5 GHzFSB 64 bits wide; 533 MHz 64 bits wide; 800 MHz 128 bits wide; 400 MHzCache size L2: 512 KB L2: 1 MB L2: 256 KBL3: 1 MB L3: 6 MBMemory Four 1 GB DDR at 266 MHz Two 2 GB DDR2 at 400 MHz Four 1GB DDR at 266 MHz(operating at 200 MHz)Memory 4.8 GB/sec 6.4 GB/sec 6.4 GB/secbandwidthOperating 32-bit Red Hat ® Enterprise 32-bit Red Hat Enterprise Red Hat Enterprise Linux 3system Linux ® 3, Update 2 Linux 3, Update 2 for IA-64Figure 1. Servers tested in BLAST performance studytechnologies like double data rate 2 (DDR2) memory and PeripheralComponent Interconnect (PCI) Express are available in <strong>Dell</strong> serverssupporting the 90 nm Intel Xeon processors.A continuation of the study reported in the October 2004 issueof <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, 2 the study discussed in this article comparedthe performance characteristics of <strong>Dell</strong> <strong>Power</strong>Edge server modelsequipped with 32-bit Intel Xeon processors, the new 90 nm IntelXeon processors, and 64-bit Itanium 2 processors by using a scientificapplication—basic local alignment search tool (BLAST)—thatis widely used in the field of bioinformatics. The aim of the studywas to understand the impact of the different features and technologiesthat are available in Intel processors as well as the impact ofmemory technology on the performance of BLAST. To achieve thisend, the <strong>Dell</strong> engineers used the STREAM benchmark to measurethe memory system performance of the <strong>Dell</strong> servers. Furthermore,the test team studied the effect of the processor clock frequency onthe performance of BLAST.Figure 1 lists the servers tested in the <strong>Dell</strong> study and the processortechnology used by each server. The <strong>Dell</strong> model names are used toavoid ambiguity between the current-generation 130 nm Intel Xeonand the more recent 90 nm Intel Xeon processor. Empirical studieshave shown that small-scale symmetric multiprocessing (SMP) systemsmake excellent platforms for building HPC clusters. Thus, allthe servers used in this study were two-processor systems.Comparison of Intel processor–based systemsThe processor and memory subsystems used in the test servers hadthe biggest impact on the performance of BLAST. The following sectionsdiscuss the architecture of these two subsystems.Overview of processor architecturesThe Intel NetBurst microarchitecture is the core of Intel’s 130 nmtechnology, on which the 32-bit Intel Xeon processor—used in the<strong>Dell</strong> <strong>Power</strong>Edge 1750 server—is based. The IntelNetBurst architecture uses a 20-stage pipelinethat allows higher core frequencies than possiblein previous-generation processors. The 130 nmIntel Xeon processors were introduced at speeds of1.8 GHz and are currently available at speeds of upto 3.2 GHz. The FSB scales from 400 MHz in theinitial 180 nm Intel Xeon processors to 533 MHzon the 3.2 GHz 130 nm processors.The Intel Xeon processor is a superscalar processorthat combines out-of-order speculative executionwith register renaming and branch predictionto enhance performance. The processor uses anExecution Trace Cache that stores pre-decodedmicro-operations. Streaming SIMD Extensions 2 (SSE2) instructionsare used to speed up media types of workload.The <strong>Dell</strong> <strong>Power</strong>Edge 1850 server is the follow-on to the <strong>Power</strong>Edge1750 server, and uses the more recent Intel Xeon processor based on90 nm technology. The 90 nm Intel Xeon processor is an extensionof the 130 nm Intel Xeon processor. However, some architecturaldifferences between these two Intel Xeon processors can have animpact on application performance. The 90 nm Intel Xeon processoris being introduced at a frequency of 2.8 GHz, coupled with a faster800 MHz FSB. It uses a longer 31-stage processor pipeline that willfacilitate higher frequencies in future versions.The <strong>Dell</strong> <strong>Power</strong>Edge 3250 system is based on the Itaniumprocessor family, which uses a 64-bit architecture and implementsthe EPIC architecture. Instructions in groups, or bundles, are issuedin parallel, depending on the available resources. The Itanium 2processors differ from the Intel Xeon processors in the fact that theyuse software—a compiler—to exploit parallelism, asopposed to complex hardwareto detect and exploitinstruction parallelism. Softwarecompilers provide theinformation needed to efficientlyexecute instructionsin parallel.Itanium 2 processors areComputing clusters built fromindustry-standard componentssuch as Intel processors arebecoming the fastest-growingchoice for HPC systems.available in frequencies rangingfrom 1.0 GHz to 1.9 GHz and use varying sizes of level 3 (L3)caches ranging from 1.5 MB to 9 MB. A 128-bit 400 MHz FSB isused to connect the processors. The Itanium 2 processor has a largenumber of registers compared to the Intel Xeon processors: 12864-bit general-purpose registers (GPRs), which are used by thecompiler to keep the six Integer Functional Units busy.2See “Performance Characterization of BLAST on Intel EM64T Architecture” by Garima Kochhar, Kalyana Chadalavada, Amina Saify, and Rizwan Ali in <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>,October 2004.120POWER SOLUTIONS Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTINGDifferences in memory subsystemsThe <strong>Dell</strong> <strong>Power</strong>Edge 1750 and <strong>Dell</strong> <strong>Power</strong>Edge 3250 servers use DDRat 266 MHz (PC2100) memory. The <strong>Power</strong>Edge 3250 server, however,operates at the speed of 200 MHz. The <strong>Dell</strong> <strong>Power</strong>Edge 1850 serveruses the new DDR2 memoryrunning at 400 MHz (PC3200),which has a theoretical bandwidthof 3.2 GB/sec. DDR2architecture is also based onthe industry-standard dynamicRAM (DRAM) technology. TheDDR2 standard contains severalmajor internal changesthat allow improvements inareas such as reliability andpower consumption. One of the most important DDR2 featuresis the ability to prefetch 4 bits of memory at a time compared to2 bits in DDR.DDR2 transfer speed starts where the current DDR technologyends at 400 MHz. In the future, DDR2 is expected to support 533and 667 mega-transfers/sec (MT/sec) to enable memory bandwidthsof 4.3 GB/sec and 5.3 GB/sec. Currently, only DDR2 at 400 MHz isavailable in <strong>Dell</strong> <strong>Power</strong>Edge servers, which is the memory technologyused in the <strong>Power</strong>Edge 1850 system.Components of the test environmentBLAST scaled well withincreasing processorfrequency on the IntelXeon processors, especiallyon larger query sizes.The goal of the <strong>Dell</strong> study, which was conducted in July 2004, wasto evaluate the impact of processor and memory architecture on theperformance of BLAST. Three <strong>Dell</strong> <strong>Power</strong>Edge servers configuredsimilarly in terms of software and compilers were used. The maindifference between the servers was in the processor architecture and,to some extent, in the memory technology. The use of three serversallowed the test team to compare three processor architectures, includingthe impact of processor and FSB (system bus) speeds and theinfluence of memory technology on BLAST performance.The <strong>Power</strong>Edge 1750 and <strong>Power</strong>Edge 1850 servers, which useIntel Xeon processors, had Intel Hyper-Threading Technology turnedoff. The 90 nm Intel Xeon processors used in the <strong>Power</strong>Edge 1850support 64-bit extensions (EM64T). The 64-bit mode of the EM64TcapableIntel Xeon processor was not used because that was notthe focus of this study.Application background and characteristicsThe BLAST family of sequence database-search algorithms serves asthe foundation for much biological research. The BLAST algorithmssearch for similarities between a short query sequence and a large,infrequently changing database of DNA or amino acid sequences.Version 8.0 Intel compilers for 32-bit applications were usedto compile BLAST on the <strong>Power</strong>Edge 1750 and <strong>Power</strong>Edge 1850servers. For the Itanium-based <strong>Power</strong>Edge 3250 server, version 8.0Intel compilers for 64-bit applications were used.BLAST was executed on each of the four configurations—the<strong>Power</strong>Edge 1850 was configured with two different processor frequencies,one at 3.2 GHz and one at 3.6 GHz. The test used adatabase of about 2 million sequences, with about 10 billion totalletters. For this study, BLAST was executed against single queriesof three lengths: 94,000 words; 206,000 words; and 510,000 words.Runs were conducted using both single and dual threads.Performance evaluation and analysisBefore testing BLAST performance, the test team used the STREAMbenchmark to measure memory bandwidth. STREAM measures realworldbandwidth sustainable from ordinary user programs as opposedto the theoretical peak bandwidth provided by vendors. By runningfour simple kernels, the benchmark measures traffic all the wayfrom registers to main memory (and vice versa). Because the arraysare much too large to fit in caches, the benchmark measures a mixtureof both read and write traffic. STREAM measures programmerperceivedbandwidth—that is, sustained bandwidth rather than rawor peak bandwidth.Figure 2 shows the measured memory bandwidth using theSTREAM benchmark. The <strong>Power</strong>Edge 3250 server showed significantimprovements over the <strong>Power</strong>Edge 1750 server because of its widersystem bus (128 bits). Similarly, the <strong>Power</strong>Edge 1850 server showedimprovements over the <strong>Power</strong>Edge 3250 thanks to its faster memoryclock speed (400 MHz) as well as a faster FSB (800 MHz).Next, the performance of BLAST was evaluated using differentquery sizes and running single and dual threads. The importanceof processor frequency, architecture, and memory subsystem designcan be determined from the results obtained on the four testedsystem configurations.Throughput (MB/sec)40003500300025002000150010005000<strong>Power</strong>Edge 3250 (Intel Itanium processor at 1.5 GHz)<strong>Power</strong>Edge 1850 (Intel Xeon processor at 3.2 GHz)<strong>Power</strong>Edge 1750 (Intel Xeon processor at 3.2 GHz)366632823675 3737339131552427 2431Copy Scale Add TriadFigure 2. Sustainable memory bandwidth measured using STREAM benchmark21943524 3646 2162www.dell.com/powersolutions Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. POWER SOLUTIONS 121