11.01.2013 Views

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.2.1 Error checking<br />

Run-Time Error Checking<br />

The Run-Time Error Checking (RTEC) facility provides service personnel with a method to<br />

manipulate debug capabilities that are already built into product binaries. RTEC provides<br />

powerful first failure data capture and second failure data capture error detection features.<br />

The basic RTEC framework is introduced in <strong>AIX</strong> V5.3 TL3, and has now been extended with<br />

additional features. RTEC features include the Consistency Checker and Xmalloc Debug<br />

features. <strong>Features</strong> are generally tunable with the errctrl command.<br />

Some features also have attributes or commands specific to a given subsystem, such as the<br />

sodebug command associated with new socket debugging capabilities. The enhanced socket<br />

debugging facilities are described in the <strong>AIX</strong> publications, which can be found online at the<br />

following site:<br />

http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp<br />

Kernel stack overflow detection<br />

Beginning with the <strong>AIX</strong> V5.3 TL5 package, the kernel provides enhanced logic to detect stack<br />

overflows. All running <strong>AIX</strong> code maintains an area of memory called a stack, which is used to<br />

store data necessary for the execution of the code. As the code runs, this stack grows and<br />

shrinks. It is possible for a stack to grow beyond its maximum size and overwrite other data.<br />

These problems can be difficult to service. <strong>AIX</strong> V5.3 TL5 introduces an asynchronous<br />

run-time error checking capability to examine if certain kernel stacks have overflowed. The<br />

default action upon overflow detection is to log an entry in the <strong>AIX</strong> error log. The stack<br />

overflow run-time error checking feature is controlled by the ml.stack_overflow component.<br />

<strong>AIX</strong> V6.1 improves kernel stack overflow detection so that some stacks are guarded with a<br />

synchronous overflow detection capability. Additionally, when the recovery framework is<br />

enabled, some kernel stack overflows that previously were fatal are now fully recoverable.<br />

Kernel no-execute protection<br />

Also introduced in the <strong>AIX</strong> V5.3 TL5 package, no-execute protection is set for various kernel<br />

data areas that should never be treated as executable code. This exploits the page-level<br />

execution enable/disable hardware feature. The benefit is immediate detection if erroneous<br />

device driver or kernel code inadvertently make a stray branch onto one of these pages.<br />

Previously the behavior would likely lead to a crash, but was undefined.<br />

This enhancement improves kernel reliability and serviceability by catching attempts to<br />

execute invalid addresses immediately, before they have a chance to cause further damage<br />

or create a difficult-to-debug secondary failure. This feature is largely transparent to the user,<br />

because most of the data areas being protected should clearly be non-executable.<br />

2.2.2 Extended Error Handling<br />

In 2001, <strong>IBM</strong> introduced a methodology that uses a combination of system firmware and<br />

Extended Error Handling (EEH) device drivers that allow recovery from intermittent PCI bus<br />

errors. This approach works by recovering and resetting the adapter, thereby initiating system<br />

recovery for a permanent PCI bus error. Rather than failing immediately, the faulty device is<br />

“frozen” and restarted, preventing a machine check. POWER6 technology extends this<br />

capability to PCIe bus errors.<br />

20 <strong>IBM</strong> <strong>AIX</strong> <strong>Continuous</strong> <strong>Availability</strong> <strong>Features</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!