10.07.2015 Views

POWER SOLUTIONS

POWER SOLUTIONS

POWER SOLUTIONS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

SYSTEM ARCHITECTUREHard errors consistently return incorrect results and usually indicatethat a piece of hardware is broken. For example, a bit maybe stuck such that it always returns a 0 regardless of whether a 1or a 0 is written to it. Hard errors are relatively easy to diagnoseand fix because they are consistent and repeatable.The second kind of error is called a transient, or soft, error.Soft errors occur when a bit reads back the wrong value once, butsubsequently functions correctly. Soft errors are more common thanhard errors—and are also more difficult to diagnose. They are notcaused by circuit problems and, once corrected, do not reoccur.Memory error detection and correctionReliability in memory starts at the DIMM level. To help ensure areliable memory system in servers, it is essential to provide protectionfrom both hard and soft memory errors. Memory detectionor correction protocols such as parity checking or error-correctingcode (ECC) are designed to provide true protection from hard andsoft memory errors.Parity checking is one of the oldest and most basic forms ofmemory checking. It is a simple method of detecting single-biterrors in a memory system. Along with the eight bits of datastored in memory, parity checking uses one additional bit todetermine the parity of the byte—odd or even. However, paritychecking detects only odd-numbered single-bit errors, and doesnot enable administrators to locate and correct these errors.When a parity error is detected, the parity circuit generates whatis called a nonmaskable interrupt (NMI), which is generally usedto instruct the processor to halt immediately. The processor ishalted to help ensure that the incorrect memory does not corruptother data on the system.ECC, on the other hand, not only detects both single-bit andmultibit errors, but it also corrects single-bit errors. Each timedata is stored in memory, ECC memory uses an algorithm to adda block of bits known as check bits. When this data is retrieved,the sum of the check bits (the checksum) is recomputed. TheCPU CPU2 1CPU3CPU4Memorybridge1Memorybridge2Memorybridge3Memorybridge4NorthbridgeFigure 1. Dell PowerEdge 6850 planar block diagramDIMMs 1–16Riser 1 Riser 2 Riser 3 Riser 4checksum of the written data is then compared with the checksumof the read data to determine whether any of the data bits arecorrupted. If the checksums are identical, this indicates to the ECCmemory that there is no error.If they are different, the datacontains one or more errors.The ECC memory logic thenisolates the errors and reportsthem to the system. For a correctablesingle-bit error, theECC memory logic correctsthe error and outputs thecorrected data without haltingthe system. For multibiterrors—that is, errors involvingtwo or more bits—ECCmemory is capable of detecting but not correcting the errors. Whenmultibit errors occur, ECC handles them by generating an NMI thatinstructs the system to halt.As memory capacity increases, the number of soft errors willrise. Typically, a percentage of soft errors are multibit errors, whichECC cannot correct. As a result, administrators should expect thepotential for failure in ECC systems to increase as memory isincreased.Memory redundancy options in PowerEdge 6850 serversTo help provide server fault tolerance, maximize memory capacity,and enhance reliability, Dell server hardware is designed withmemory redundancy options that can help improve the performanceand uptime of servers in memory error situations. TheDell PowerEdge 6850 memory subsystem resides on up to fourmemory riser boards, or cards,supporting a system maximum of64 GB when 4 GB DIMMs are installed. Each riser has four doubledata rate 2 (DDR2) slots arranged logically as two banks and onememory bridge (see Figure 1).The PowerEdge 6850 offers three memory redundancy options,which are set in the BIOS: spare-bank memory, memory mirroring,and memory RAID. Organizations select these options whenordering the PowerEdge 6850. They may also opt not to use thesethree options. To disable these options, organizations should selectthe Redundancy Disabled option. Figure 2 shows BIOS options formemory redundancy modes.Spare-bank memoryThe PowerEdge 6850 offersthree memory redundancyoptions, which are set in theBIOS: spare-bank memory,memory mirroring, andmemory RAID.Within a riser board, memory can be set as a spare bank. Once sparingis enabled, when the error rate—that is, the rate of correctableerrors—of a failing DIMM reaches a preset threshold (set by theadministrator in the BIOS), the DIMM’s contents are copied tothe spare bank. When the copy is in progress, all reads from thewww.dell.com/powersolutions Reprinted from Dell Power Solutions, August 2005. Copyright © 2005 Dell Inc. All rights reserved. DELL <strong>POWER</strong> <strong>SOLUTIONS</strong> 113

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!