IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
eases the burden placed on system administrators and IT support staff, but also enables<br />
rapid and precise collection of problem data.<br />
On the software side, serviceability is the ability to diagnose and correct or recover from an<br />
error when it occurs. The most significant serviceability capabilities and enablers in <strong>AIX</strong> are<br />
referred to as the software service aids. The primary software service aids are error logging,<br />
system dump, and tracing.<br />
With the advent of next generation UNIX servers from <strong>IBM</strong>, many hardware reliability-,<br />
availability-, and serviceability-related issues such as memory error detection, LPARs,<br />
hardware sensors and so on have been implemented. These features are supported with the<br />
relevant software in <strong>AIX</strong>. These abilities continue to establish <strong>AIX</strong> as the best UNIX operating<br />
system.<br />
1.7 First Failure Data Capture<br />
<strong>IBM</strong> has implemented a server design that builds in thousands of hardware error checker<br />
stations that capture and help to identify error conditions within the server. The <strong>IBM</strong> System p<br />
p5-595 server, for example, includes almost 80,000 checkers to help capture and identify<br />
error conditions. These are stored in over 29,000 Fault Isolation Register bits. Each of these<br />
checkers is viewed as a “diagnostic probe” into the server, and, when coupled with extensive<br />
diagnostic firmware routines, allows quick and accurate assessment of hardware error<br />
conditions at run-time.<br />
Integrated hardware error detection and fault isolation is a key component of the System p<br />
and System i platform design strategy. It is for this reason that in 1997, <strong>IBM</strong> introduced First<br />
Failure Data Capture (FFDC) for <strong>IBM</strong> POWER servers. FFDC plays a critical role in<br />
delivering servers that can self-diagnose and self-heal. The system effectively traps hardware<br />
errors at system run time.<br />
FFDC is a technique which ensures that when a fault is detected in a system (through error<br />
checkers or other types of detection methods), the root cause of the fault will be captured<br />
without the need to recreate the problem or run any sort of extended tracing or diagnostics<br />
program. For the vast majority of faults, an effective FFDC design means that the root cause<br />
can also be detected automatically without servicer intervention. The pertinent error data<br />
related to the fault is captured and saved for further analysis.<br />
In hardware, FFDC data is collected in fault isolation registers based on the first event that<br />
had occurred. FFDC check stations are carefully positioned within the server logic and data<br />
paths to ensure that potential errors can be quickly identified and accurately tracked to an<br />
individual Field Replaceable Unit (FRU).<br />
This proactive diagnostic strategy is a significant improvement over less accurate “reboot and<br />
diagnose” service approaches. Using projections based on <strong>IBM</strong> internal tracking information,<br />
it is possible to predict that high impact outages would occur two to three times more<br />
frequently without an FFDC capability.<br />
In fact, without some type of pervasive method for problem diagnosis, even simple problems<br />
that occur intermittently can cause serious and prolonged outages. By using this proactive<br />
diagnostic approach, <strong>IBM</strong> no longer has to rely on an intermittent “reboot and retry” error<br />
detection strategy, but instead knows with some certainty which part is having problems.<br />
This architecture is also the basis for <strong>IBM</strong> predictive failure analysis, because the Service<br />
Processor can now count and log intermittent component errors and can deallocate or take<br />
other corrective actions when an error threshold is reached.<br />
Chapter 1. Introduction 7