11.01.2013 Views

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

IBM AIX Continuous Availability Features - IBM Redbooks

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The new processor autonomously replaces the defective one. Dynamic CPU Guard and<br />

Dynamic CPU Sparing work together to protect a customer's investments through their<br />

self-diagnosing and self-healing software. More information is available at the following site:<br />

http://www.research.ibm.com/journal/sj/421/jann.html<br />

2.1.4 Predictive CPU deallocation and dynamic processor deallocation<br />

On these systems, <strong>AIX</strong> implements continuous hardware surveillance and regularly polls the<br />

firmware for hardware errors. When the number of processor errors hits a threshold and the<br />

firmware recognizes that there is a distinct probability that this system component will fail,<br />

then the firmware returns an error report. In all cases, the error is logged in the system error<br />

log. In addition, on multiprocessor systems, depending on the type of failure, <strong>AIX</strong> attempts to<br />

stop using the untrustworthy processor and deallocate it. More information is available at the<br />

following site:<br />

http://www-05.ibm.com/cz/power6/files/zpravy/WhitePaper_POWER6_availabilty.PDF<br />

2.1.5 Processor recovery and alternate processor<br />

Although this is related mainly to System p hardware, <strong>AIX</strong> still plays a role in this process and<br />

coding in <strong>AIX</strong> allows the alternate processor recovery feature to deallocate and deconfigure a<br />

failing processor by moving the instruction stream over to and restarting it on a spare<br />

processor. These operations can be accomplished by the POWER Hypervisor and<br />

POWER6 hardware without application interruption, thus allowing processing to continue<br />

unimpeded. More information is available at the following site:<br />

http://www-05.ibm.com/cz/power6/files/zpravy/WhitePaper_POWER6_availabilty.PDF<br />

2.1.6 Excessive interrupt disablement detection<br />

<strong>AIX</strong> V5.3 ML3 has introduced a new feature which can detect a period of excessive interrupt<br />

disablement on a CPU, and create an error log record to report it. This allows you to know if<br />

privileged code running on a system is unduly (and silently) impacting performance. It also<br />

helps to identify and improve such offending code paths before the problems manifest in<br />

ways that have proven very difficult to diagnose in the past.<br />

This feature employs a kernel profiling approach to detect disabled code that runs for too<br />

long. The basic idea is to take advantage of the regularly scheduled clock “ticks” that<br />

generally occur every 10 milliseconds, using them to approximately measure continuously<br />

disabled stretches of CPU time individually on each logical processor in the configuration.<br />

This approach will alert you to partially disabled code sequences by logging one or more hits<br />

within the offending code. It will alert you to fully disabled code sequences by logging the<br />

i_enable that terminates them.<br />

You can turn excessive interrupt disablement off and on, respectively, by changing the<br />

proc.disa RAS component:<br />

errctrl -c proc.disa errcheckoff<br />

errctrl -c proc.disa errcheckon<br />

Note that the preceding commands only affect the current boot. In <strong>AIX</strong> 6.1, the -P flag is<br />

introduced so that the setting can be changed persistently across reboots, for example:<br />

errctrl -c proc.disa -P errcheckoff<br />

Chapter 2. <strong>AIX</strong> continuous availability features 13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!