POWER SOLUTIONS

More documents

Recommendations

Info

SYSTEM ARCHITECTUREImproving Fault ToleranceUsing Memory Redundancy and Hot-Plug Actionsin Dell PowerEdge ServersFeatures that enable redundancy across physical memory can enhance server reliabilityand help keep critical business applications available 24/7—particularly whencombined with hot-plug capabilities designed into the Dell PowerEdge 6850 server.Allowing administrators to replace failing memory devices and add incrementalmemory upgrades while a server is running can help reduce system downtime andenhance scalability considerably. This article discusses memory redundancy and hotplugfeatures of PowerEdge 6850 servers.BY CHANDRA S. MUGUNDA, VIJAY NIJHAWAN, DENNIS STULTZ, SAURABH GUPTA, AND HARISH JAYAKUMARRelated Categories:Dell PowerEdge serversMemoryMultiprocessor (MP)System architectureVisit www.dell.com/powersolutionsfor the complete category index.RAID technology helps provide redundancy, faulttolerance, and high availability in enterprise diskdrive subsystems. Different RAID levels, such as RAID-0,RAID-1, RAID-5, and RAID-10, can be configured toenable secondary storage benefits that suit specific longtermorganizational requirements. Currently, the low costof physical memory allows servers to support 32 GB to64 GB of server RAM cost-effectively in dual in-linememory modules (DIMMs).Unfortunately, the requirement for large memorycapacities can increase the chance of memory errors simplybecause physical memory devices have the potential forfailure. Thus, the more memory that resides in a system,the greater the potential for memory failure in that system.Safeguarding against potential memory failures andhelping to ensure uninterrupted application availability,fault tolerance, redundancy, and hot-plug capabilities canbe crucial for IT environments.Hard and soft memory errorsMemory is an electronic storage device, and electronicstorage devices have the potential to incorrectly returninformation—that is, data read from memory can bedifferent from what was originally written to memory.Dynamic RAM (DRAM) stores 1s and 0s as charges onsmall capacitors residing on the DIMM, which must becontinually refreshed to help ensure that the data is notlost. But even a small electrical disturbance near thememory cell can alter the charge in a capacitor, thuscausing a memory error.Typically, two types of error occur in a memorysystem. The first is called a repeatable, or hard, error.112DELL POWER SOLUTIONS Reprinted from Dell Power Solutions, August 2005. Copyright © 2005 Dell Inc. All rights reserved. August 2005
SYSTEM ARCHITECTUREHard errors consistently return incorrect results and usually indicatethat a piece of hardware is broken. For example, a bit maybe stuck such that it always returns a 0 regardless of whether a 1or a 0 is written to it. Hard errors are relatively easy to diagnoseand fix because they are consistent and repeatable.The second kind of error is called a transient, or soft, error.Soft errors occur when a bit reads back the wrong value once, butsubsequently functions correctly. Soft errors are more common thanhard errors—and are also more difficult to diagnose. They are notcaused by circuit problems and, once corrected, do not reoccur.Memory error detection and correctionReliability in memory starts at the DIMM level. To help ensure areliable memory system in servers, it is essential to provide protectionfrom both hard and soft memory errors. Memory detectionor correction protocols such as parity checking or error-correctingcode (ECC) are designed to provide true protection from hard andsoft memory errors.Parity checking is one of the oldest and most basic forms ofmemory checking. It is a simple method of detecting single-biterrors in a memory system. Along with the eight bits of datastored in memory, parity checking uses one additional bit todetermine the parity of the byte—odd or even. However, paritychecking detects only odd-numbered single-bit errors, and doesnot enable administrators to locate and correct these errors.When a parity error is detected, the parity circuit generates whatis called a nonmaskable interrupt (NMI), which is generally usedto instruct the processor to halt immediately. The processor ishalted to help ensure that the incorrect memory does not corruptother data on the system.ECC, on the other hand, not only detects both single-bit andmultibit errors, but it also corrects single-bit errors. Each timedata is stored in memory, ECC memory uses an algorithm to adda block of bits known as check bits. When this data is retrieved,the sum of the check bits (the checksum) is recomputed. TheCPU CPU2 1CPU3CPU4Memorybridge1Memorybridge2Memorybridge3Memorybridge4NorthbridgeFigure 1. Dell PowerEdge 6850 planar block diagramDIMMs 1–16Riser 1 Riser 2 Riser 3 Riser 4checksum of the written data is then compared with the checksumof the read data to determine whether any of the data bits arecorrupted. If the checksums are identical, this indicates to the ECCmemory that there is no error.If they are different, the datacontains one or more errors.The ECC memory logic thenisolates the errors and reportsthem to the system. For a correctablesingle-bit error, theECC memory logic correctsthe error and outputs thecorrected data without haltingthe system. For multibiterrors—that is, errors involvingtwo or more bits—ECCmemory is capable of detecting but not correcting the errors. Whenmultibit errors occur, ECC handles them by generating an NMI thatinstructs the system to halt.As memory capacity increases, the number of soft errors willrise. Typically, a percentage of soft errors are multibit errors, whichECC cannot correct. As a result, administrators should expect thepotential for failure in ECC systems to increase as memory isincreased.Memory redundancy options in PowerEdge 6850 serversTo help provide server fault tolerance, maximize memory capacity,and enhance reliability, Dell server hardware is designed withmemory redundancy options that can help improve the performanceand uptime of servers in memory error situations. TheDell PowerEdge 6850 memory subsystem resides on up to fourmemory riser boards, or cards,supporting a system maximum of64 GB when 4 GB DIMMs are installed. Each riser has four doubledata rate 2 (DDR2) slots arranged logically as two banks and onememory bridge (see Figure 1).The PowerEdge 6850 offers three memory redundancy options,which are set in the BIOS: spare-bank memory, memory mirroring,and memory RAID. Organizations select these options whenordering the PowerEdge 6850. They may also opt not to use thesethree options. To disable these options, organizations should selectthe Redundancy Disabled option. Figure 2 shows BIOS options formemory redundancy modes.Spare-bank memoryThe PowerEdge 6850 offersthree memory redundancyoptions, which are set in theBIOS: spare-bank memory,memory mirroring, andmemory RAID.Within a riser board, memory can be set as a spare bank. Once sparingis enabled, when the error rate—that is, the rate of correctableerrors—of a failing DIMM reaches a preset threshold (set by theadministrator in the BIOS), the DIMM’s contents are copied tothe spare bank. When the copy is in progress, all reads from thewww.dell.com/powersolutions Reprinted from Dell Power Solutions, August 2005. Copyright © 2005 Dell Inc. All rights reserved. DELL POWER SOLUTIONS 113
Page 2 and 3:
Gigabit EthernetMulti-Gigabit Ether
Page 4 and 5:
TABLE OF CONTENTSPOWERSOLUTIONSTHE
Page 6 and 7:
TABLE OF CONTENTSDATABASE TECHNOLOG
Page 8 and 9:
EDITOR’S COMMENTSRelational Bliss
Page 10 and 11:
SYSTEMS MANAGEMENTTake Control with
Page 12 and 13:
SYSTEMS MANAGEMENTInstrumented plat
Page 14 and 15:
SYSTEMS MANAGEMENTinterface (CLI) a
Page 16 and 17:
SYSTEMS MANAGEMENTIntroducing Dell
Page 18 and 19:
SYSTEMS MANAGEMENTA query definitio
Page 20 and 21:
SYSTEMS MANAGEMENTremote agents. Fo
Page 22 and 23:
SYSTEMS MANAGEMENTError code Error
Page 24 and 25:
SYSTEMS MANAGEMENTFigure 3. Wbemtes
Page 26 and 27:
SYSTEMS MANAGEMENTSoftware Change M
Page 28 and 29:
SYSTEMS MANAGEMENTDell UpdatePackag
Page 30 and 31:
SYSTEMS MANAGEMENTFigure 5. Summary
Page 32 and 33:
SYSTEMS MANAGEMENTIT Assistant UI I
Page 34 and 35:
SYSTEMS MANAGEMENTFigure 3. Configu
Page 36 and 37:
SYSTEMS MANAGEMENTConsole, Altiris
Page 38 and 39:
SYSTEMS MANAGEMENTDeploying Server
Page 40 and 41:
SYSTEMS MANAGEMENTAdvancing the Upd
Page 42 and 43:
SYSTEMS MANAGEMENTallows administra
Page 44 and 45:
SYSTEMS MANAGEMENTTime-Savings Vali
Page 46 and 47:
SYSTEMS MANAGEMENTBENEFITS OF ALTIR
Page 48 and 49:
SYSTEMS MANAGEMENTLANDesk Server Ma
Page 50 and 51:
SYSTEMS MANAGEMENT• SEL event cre
Page 52 and 53:
SYSTEMS MANAGEMENTAdministrators ca
Page 54 and 55:
SYSTEMS MANAGEMENTUsing the DRAC 4
Page 56 and 57:
SYSTEMS MANAGEMENTVirtual floppy me
Page 58 and 59:
SYSTEMS MANAGEMENTUpdating theDell
Page 60 and 61:
SYSTEMS MANAGEMENTthe overall Power
Page 62 and 63:
SYSTEMS MANAGEMENTFor the firmware
Page 64 and 65: DATA CENTER TECHNOLOGYGuidelines fo
Page 66 and 67: DATA CENTER TECHNOLOGYusing the tem
Page 68 and 69: DATA CENTER TECHNOLOGYInformation L
Page 70 and 71: DATA CENTER TECHNOLOGYdata center:
Page 72 and 73: DATA CENTER TECHNOLOGYresources and
Page 74 and 75: BUSINESS CONTINUITYe-mail systems u
Page 76 and 77: BUSINESS CONTINUITYGalaxy and Backu
Page 78 and 79: STORAGE TECHNOLOGYWhen migrating fr
Page 80 and 81: STORAGE TECHNOLOGYwhich can lead to
Page 82 and 83: STORAGE TECHNOLOGYChannel ports; al
Page 84 and 85: STORAGE TECHNOLOGYMaximum zone sets
Page 86 and 87: STORAGE TECHNOLOGYbenefits includin
Page 88 and 89: STORAGE TECHNOLOGYmemory manager wa
Page 90 and 91: STORAGE TECHNOLOGYLink-Level Error
Page 92 and 93: STORAGE TECHNOLOGYThe extended link
Page 94 and 95: STORAGE TECHNOLOGYMigrating from De
Page 96 and 97: STORAGE TECHNOLOGYFigure 3. Managin
Page 98 and 99: DATABASE TECHNOLOGYFast, Flexible C
Page 100 and 101: DATABASE TECHNOLOGYScaling Out SQL
Page 102 and 103: DATABASE TECHNOLOGYMSN Communicatio
Page 104 and 105: DATABASE TECHNOLOGYThe MSN CSP appl
Page 106 and 107: DATABASE TECHNOLOGY1,20070Transacti
Page 108 and 109: DATABASE TECHNOLOGYAdvantages of OC
Page 110 and 111: DATABASE TECHNOLOGYASM instance ins
Page 112 and 113: VIRTUALIZATION TECHNOLOGYDomain 0 V
Page 116 and 117: SYSTEM ARCHITECTUREBIOS Hot Hotopti
Page 118 and 119: SYSTEM ARCHITECTUREinitialized. The
Page 120: If you’re paying unreasonable lic
show all

POWER SOLUTIONS

Create successful ePaper yourself

Delete template?

Save as template?