IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
2.2.1 Error checking<br />
Run-Time Error Checking<br />
The Run-Time Error Checking (RTEC) facility provides service personnel with a method to<br />
manipulate debug capabilities that are already built into product binaries. RTEC provides<br />
powerful first failure data capture and second failure data capture error detection features.<br />
The basic RTEC framework is introduced in <strong>AIX</strong> V5.3 TL3, and has now been extended with<br />
additional features. RTEC features include the Consistency Checker and Xmalloc Debug<br />
features. <strong>Features</strong> are generally tunable with the errctrl command.<br />
Some features also have attributes or commands specific to a given subsystem, such as the<br />
sodebug command associated with new socket debugging capabilities. The enhanced socket<br />
debugging facilities are described in the <strong>AIX</strong> publications, which can be found online at the<br />
following site:<br />
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp<br />
Kernel stack overflow detection<br />
Beginning with the <strong>AIX</strong> V5.3 TL5 package, the kernel provides enhanced logic to detect stack<br />
overflows. All running <strong>AIX</strong> code maintains an area of memory called a stack, which is used to<br />
store data necessary for the execution of the code. As the code runs, this stack grows and<br />
shrinks. It is possible for a stack to grow beyond its maximum size and overwrite other data.<br />
These problems can be difficult to service. <strong>AIX</strong> V5.3 TL5 introduces an asynchronous<br />
run-time error checking capability to examine if certain kernel stacks have overflowed. The<br />
default action upon overflow detection is to log an entry in the <strong>AIX</strong> error log. The stack<br />
overflow run-time error checking feature is controlled by the ml.stack_overflow component.<br />
<strong>AIX</strong> V6.1 improves kernel stack overflow detection so that some stacks are guarded with a<br />
synchronous overflow detection capability. Additionally, when the recovery framework is<br />
enabled, some kernel stack overflows that previously were fatal are now fully recoverable.<br />
Kernel no-execute protection<br />
Also introduced in the <strong>AIX</strong> V5.3 TL5 package, no-execute protection is set for various kernel<br />
data areas that should never be treated as executable code. This exploits the page-level<br />
execution enable/disable hardware feature. The benefit is immediate detection if erroneous<br />
device driver or kernel code inadvertently make a stray branch onto one of these pages.<br />
Previously the behavior would likely lead to a crash, but was undefined.<br />
This enhancement improves kernel reliability and serviceability by catching attempts to<br />
execute invalid addresses immediately, before they have a chance to cause further damage<br />
or create a difficult-to-debug secondary failure. This feature is largely transparent to the user,<br />
because most of the data areas being protected should clearly be non-executable.<br />
2.2.2 Extended Error Handling<br />
In 2001, <strong>IBM</strong> introduced a methodology that uses a combination of system firmware and<br />
Extended Error Handling (EEH) device drivers that allow recovery from intermittent PCI bus<br />
errors. This approach works by recovering and resetting the adapter, thereby initiating system<br />
recovery for a permanent PCI bus error. Rather than failing immediately, the faulty device is<br />
“frozen” and restarted, preventing a machine check. POWER6 technology extends this<br />
capability to PCIe bus errors.<br />
20 <strong>IBM</strong> <strong>AIX</strong> <strong>Continuous</strong> <strong>Availability</strong> <strong>Features</strong>