11.01.2013 Views

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

mirrored volume group on the client. This achieves a potentially greater level of reliability by<br />

providing redundancy. Client volume group mirroring is also required when a VIOS logical<br />

volume is used as a virtual SCSI device on the Client. In this case, the virtual SCSI devices<br />

are associated with different SCSI disks, each controlled by one of the two VIOS.<br />

As an example of network high availability, Shared Ethernet Adapter (SEA) failover offers<br />

Ethernet redundancy to the client at the virtual level. The client gets one standard virtual<br />

Ethernet adapter hosted by two VIO servers. The two Virtual I/O servers use a control<br />

channel to determine which of them is supplying the Ethernet service to the client. Through<br />

this active monitoring between the two VIOS, failure of either will result in the remaining VIOS<br />

taking control of the Ethernet service for the client. The client has no special protocol or<br />

software configured, and uses the virtual Ethernet adapter as though it was hosted by only<br />

one VIOS.<br />

2.1.12 Special Uncorrectable Error handling<br />

Although a rare occurrence, an uncorrectable data error can occur in memory or a cache,<br />

despite all precautions built into the server. In older generations of servers (prior to <strong>IBM</strong><br />

POWER4 processor-based offerings), this type of error would eventually result in a system<br />

crash. The <strong>IBM</strong> System p and System i offerings extend the POWER4 technology design and<br />

include techniques for handling these types of errors.<br />

On these servers, when an uncorrectable error is identified at one of the many checkers<br />

strategically deployed throughout the system's central electronic complex, the detecting<br />

hardware modifies the ECC word associated with the data, creating a special ECC code. This<br />

code indicates that an uncorrectable error has been identified at the data source and that the<br />

data in the “standard” ECC word is no longer valid. The check hardware also signals the<br />

Service Processor and identifies the source of the error. The Service Processor then takes<br />

appropriate action to handle the error. This technique is called Special Uncorrectable Error<br />

(SUE) handling.<br />

Simply detecting an error does not automatically cause termination of a system or partition. In<br />

many cases, an uncorrectable error will cause generation of a synchronous machine check<br />

interrupt. The machine check interrupt occurs when a processor tries to load the bad data.<br />

The firmware provides a pointer to the instruction that referred to the corrupt data, the system<br />

continues to operate normally, and the hardware observes the use of the data.<br />

The system is designed to mitigate the problem using a number of approaches. For example,<br />

if the data is never actually used but is simply overwritten, then the error condition can safely<br />

be voided and the system will continue to operate normally.<br />

For <strong>AIX</strong> V5.2 or greater, if the data is actually referenced for use by a process, then the<br />

operating system is informed of the error. The operating system will terminate only the<br />

specific user process associated with the corrupt data.<br />

New with <strong>AIX</strong> V6.1<br />

The POWER6 processor adds the ability to report the faulting memory address on a SUE<br />

Machine Check. This hardware characteristic, combined with the <strong>AIX</strong> V6.1 recovery<br />

framework, expands the cases in which <strong>AIX</strong> will recover from an SUE to include some<br />

instances when the error occurs in kernel mode.<br />

Specifically, if an SUE occurs inside one of the copyin() and copyout() family of kernel<br />

services, these functions will return an error code and allow the system to continue operating<br />

(in contrast, on a POWER4 or POWER5 system, <strong>AIX</strong> would crash). The new SUE feature<br />

integrates the kernel mode handling of SUEs with the FRR recovery framework.<br />

18 <strong>IBM</strong> <strong>AIX</strong> <strong>Continuous</strong> <strong>Availability</strong> <strong>Features</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!