11.01.2013 Views

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

eases the burden placed on system administrators and IT support staff, but also enables<br />

rapid and precise collection of problem data.<br />

On the software side, serviceability is the ability to diagnose and correct or recover from an<br />

error when it occurs. The most significant serviceability capabilities and enablers in <strong>AIX</strong> are<br />

referred to as the software service aids. The primary software service aids are error logging,<br />

system dump, and tracing.<br />

With the advent of next generation UNIX servers from <strong>IBM</strong>, many hardware reliability-,<br />

availability-, and serviceability-related issues such as memory error detection, LPARs,<br />

hardware sensors and so on have been implemented. These features are supported with the<br />

relevant software in <strong>AIX</strong>. These abilities continue to establish <strong>AIX</strong> as the best UNIX operating<br />

system.<br />

1.7 First Failure Data Capture<br />

<strong>IBM</strong> has implemented a server design that builds in thousands of hardware error checker<br />

stations that capture and help to identify error conditions within the server. The <strong>IBM</strong> System p<br />

p5-595 server, for example, includes almost 80,000 checkers to help capture and identify<br />

error conditions. These are stored in over 29,000 Fault Isolation Register bits. Each of these<br />

checkers is viewed as a “diagnostic probe” into the server, and, when coupled with extensive<br />

diagnostic firmware routines, allows quick and accurate assessment of hardware error<br />

conditions at run-time.<br />

Integrated hardware error detection and fault isolation is a key component of the System p<br />

and System i platform design strategy. It is for this reason that in 1997, <strong>IBM</strong> introduced First<br />

Failure Data Capture (FFDC) for <strong>IBM</strong> POWER servers. FFDC plays a critical role in<br />

delivering servers that can self-diagnose and self-heal. The system effectively traps hardware<br />

errors at system run time.<br />

FFDC is a technique which ensures that when a fault is detected in a system (through error<br />

checkers or other types of detection methods), the root cause of the fault will be captured<br />

without the need to recreate the problem or run any sort of extended tracing or diagnostics<br />

program. For the vast majority of faults, an effective FFDC design means that the root cause<br />

can also be detected automatically without servicer intervention. The pertinent error data<br />

related to the fault is captured and saved for further analysis.<br />

In hardware, FFDC data is collected in fault isolation registers based on the first event that<br />

had occurred. FFDC check stations are carefully positioned within the server logic and data<br />

paths to ensure that potential errors can be quickly identified and accurately tracked to an<br />

individual Field Replaceable Unit (FRU).<br />

This proactive diagnostic strategy is a significant improvement over less accurate “reboot and<br />

diagnose” service approaches. Using projections based on <strong>IBM</strong> internal tracking information,<br />

it is possible to predict that high impact outages would occur two to three times more<br />

frequently without an FFDC capability.<br />

In fact, without some type of pervasive method for problem diagnosis, even simple problems<br />

that occur intermittently can cause serious and prolonged outages. By using this proactive<br />

diagnostic approach, <strong>IBM</strong> no longer has to rely on an intermittent “reboot and retry” error<br />

detection strategy, but instead knows with some certainty which part is having problems.<br />

This architecture is also the basis for <strong>IBM</strong> predictive failure analysis, because the Service<br />

Processor can now count and log intermittent component errors and can deallocate or take<br />

other corrective actions when an error threshold is reached.<br />

Chapter 1. Introduction 7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!