IBM AIX Continuous Availability Features - IBM Redbooks

More documents

Recommendations

Info

1.6.2 Availability 1.6.3 Serviceability For all operating system or application errors, recovery must be attempted. When an error occurs, it is not valid to simply give up and terminate processing. Instead, the operating system or application must at least try to keep the component affected by the error up and running. If that is not possible, the operating system or application should make every effort to capture the error data and automate system restart as quickly as possible. The amount of effort put into the recovery should, of course, be proportional to the impact of a failure and the reasonableness of “trying again”. If actual recovery is not feasible, then the impact of the error should be reduced to the minimum appropriate level. Today, many customers require that recovery processing be subject to a time limit and have concluded that rapid termination with quick restart or takeover by another application or system is preferable to delayed success. However, takeover strategies rely on redundancy that becomes more and more expensive as systems get larger, and in most cases the main reason for quick termination is to begin a lengthy takeover process as soon as possible. Thus, the focus is now shifting back towards core reliability, and that means quality and recovery features. Today’s systems have hot plug capabilities for many subcomponents, from processors to input/output cards to memory. Also, clustering techniques, reconfigurable input/output data paths, mirrored disks, and hot swappable hardware should help to achieve a significant level of system availability. From a software perspective, availability is the capability of a program to perform its function whenever it is needed. Availability is a basic customer requirement. Customers require a stable degree of certainty, and also require that schedules and user needs are met. Availability gauges the percentage of time a system or program can be used by the customer for productive use. Availability is determined by the number of interruptions and the duration of the interruptions, and depends on characteristics and capabilities which include: ► The ability to change program or operating system parameters without rebuilding the kernel and restarting the system ► The ability to configure new devices without restarting the system ► The ability to install new software or update existing software without restarting the system ► The ability to monitor system resources and programs and cleanup or recover resources when failures occur ► The ability to maintain data integrity in spite of errors The AIX operating system includes many availability characteristics and capabilities from which your overall environment will benefit. Focus on serviceability is shifting from providing customer support remotely through conventional methods, such as phone and e-mail, to automated system problem reporting and correction, without user (or system administrator) intervention. Hot swapping capabilities of some hardware components enhances the serviceability aspect. A service processor with advanced diagnostic and administrative tools further enhances the system serviceability. A System p server's service processor can call home in the service report, providing detailed information for IBM service to act upon. This automation not only 6 IBM AIX Continuous Availability Features
eases the burden placed on system administrators and IT support staff, but also enables rapid and precise collection of problem data. On the software side, serviceability is the ability to diagnose and correct or recover from an error when it occurs. The most significant serviceability capabilities and enablers in AIX are referred to as the software service aids. The primary software service aids are error logging, system dump, and tracing. With the advent of next generation UNIX servers from IBM, many hardware reliability-, availability-, and serviceability-related issues such as memory error detection, LPARs, hardware sensors and so on have been implemented. These features are supported with the relevant software in AIX. These abilities continue to establish AIX as the best UNIX operating system. 1.7 First Failure Data Capture IBM has implemented a server design that builds in thousands of hardware error checker stations that capture and help to identify error conditions within the server. The IBM System p p5-595 server, for example, includes almost 80,000 checkers to help capture and identify error conditions. These are stored in over 29,000 Fault Isolation Register bits. Each of these checkers is viewed as a “diagnostic probe” into the server, and, when coupled with extensive diagnostic firmware routines, allows quick and accurate assessment of hardware error conditions at run-time. Integrated hardware error detection and fault isolation is a key component of the System p and System i platform design strategy. It is for this reason that in 1997, IBM introduced First Failure Data Capture (FFDC) for IBM POWER servers. FFDC plays a critical role in delivering servers that can self-diagnose and self-heal. The system effectively traps hardware errors at system run time. FFDC is a technique which ensures that when a fault is detected in a system (through error checkers or other types of detection methods), the root cause of the fault will be captured without the need to recreate the problem or run any sort of extended tracing or diagnostics program. For the vast majority of faults, an effective FFDC design means that the root cause can also be detected automatically without servicer intervention. The pertinent error data related to the fault is captured and saved for further analysis. In hardware, FFDC data is collected in fault isolation registers based on the first event that had occurred. FFDC check stations are carefully positioned within the server logic and data paths to ensure that potential errors can be quickly identified and accurately tracked to an individual Field Replaceable Unit (FRU). This proactive diagnostic strategy is a significant improvement over less accurate “reboot and diagnose” service approaches. Using projections based on IBM internal tracking information, it is possible to predict that high impact outages would occur two to three times more frequently without an FFDC capability. In fact, without some type of pervasive method for problem diagnosis, even simple problems that occur intermittently can cause serious and prolonged outages. By using this proactive diagnostic approach, IBM no longer has to rely on an intermittent “reboot and retry” error detection strategy, but instead knows with some certainty which part is having problems. This architecture is also the basis for IBM predictive failure analysis, because the Service Processor can now count and log intermittent component errors and can deallocate or take other corrective actions when an error threshold is reached. Chapter 1. Introduction 7
Page 1: Front cover IBM AIX Continuous Avai
Page 4 and 5: Note: Before using this information
Page 6 and 7: 2.3.7 Minidump . . . . . . . . . .
Page 8 and 9: vi IBM AIX Continuous Availability
Page 10 and 11: Trademarks The following terms are
Page 12 and 13: Engineering degree in Computer Scie
Page 14 and 15: Find out more about the residency p
Page 16 and 17: 1.1 Overview In May 2007, IBM intro
Page 18 and 19: 1.5 Continuous operations Continuou
Page 22 and 23: IBM has tried to enhance FFDC featu
Page 24 and 25: 10 IBM AIX Continuous Availability
Page 26 and 27: 2.1 System availability AIX is buil
Page 28 and 29: On AIX 5.3, the only way to persist
Page 30 and 31: When used on a standalone server (s
Page 32 and 33: mirrored volume group on the client
Page 34 and 35: 2.2.1 Error checking Run-Time Error
Page 36 and 37: The PowerPC hardware gives software
Page 38 and 39: System dumps can also be user-initi
Page 40 and 41: Live dumps are small dumps that do
Page 42 and 43: Trace is a valuable tool for observ
Page 44 and 45: Component Trace uses mechanisms sim
Page 46 and 47: 2.3.14 syslog To display the boot l
Page 48 and 49: Logging can be added to many of the
Page 50 and 51: compression: on path specification:
Page 52 and 53: 2.4.4 EtherChannel network traffic
Page 54 and 55: 2.4.6 2-Port Adapter-based Ethernet
Page 56 and 57: operations) without affecting produ
Page 58 and 59: Quorum A quorum is a vote of the nu
Page 60 and 61: 2.5.8 Journaled File System-related
Page 62 and 63: Fast I/O failure is controlled by a
Page 64 and 65: display its output in a format suit
Page 66 and 67: For more detailed information about
Page 68 and 69: 54 IBM AIX Continuous Availability
Page 70 and 71:
3.1 AIX Reliability, Availability,
Page 72 and 73:
This means that traditional AIX sys
Page 74 and 75:
used by IBM service personnel, so t
Page 76 and 77:
also be used to obtain a snapshot o
Page 78 and 79:
3.5 Live dump and component dump Li
Page 80 and 81:
lfs filesystem ._0 | NO | ON/3 | OF
Page 82 and 83:
.vmker | YES | ON/3 | OFF/3 .vmpool
Page 84 and 85:
The output of Start a Live Dump is
Page 86 and 87:
Table 3-3 Concurrent update termino
Page 88 and 89:
We tested the installation of an ef
Page 90 and 91:
Setting efix state to: STABLE +----
Page 92 and 93:
mode, storage keys are called user
Page 94 and 95:
Example 3-19 shows the SMIT menu fo
Page 96 and 97:
might share a given hardware key. M
Page 98 and 99:
KKEY_UPUBLIC, KKEY_FILE_DATA KKEY_U
Page 100 and 101:
Note: Kernel keys and keysets, whic
Page 102 and 103:
extension key-aware is determining
Page 104 and 105:
added a protection gate for the int
Page 106 and 107:
} printf("hardware keyset after KER
Page 108 and 109:
} param=atoi(*(argv+1)); printf("Te
Page 110 and 111:
Load the kernel extension so that a
Page 112 and 113:
3 : 000000000000000A r4 : F10006C00
Page 114 and 115:
3.7.7 User mode protection keys Use
Page 116 and 117:
► The APIs provided by the system
Page 118 and 119:
if (!strcmp(*(argv+1),"read")){ if
Page 120 and 121:
strlen() at 0xd010da00 _doprnt(??,
Page 122 and 123:
int pthread_attr_setukeyset_np (pth
Page 124 and 125:
Machine State Save Area iar : 00000
Page 126 and 127:
As a consequence, determining the r
Page 128 and 129:
A Vue script or Vue program is a pr
Page 130 and 131:
► An optional predicate The predi
Page 132 and 133:
3. Interval probe manager The inter
Page 134 and 135:
{ int count; count = count +1; prin
Page 136 and 137:
► Errno names ► Signal names
Page 138 and 139:
* will result in error out of read(
Page 140 and 141:
This places a large part of the bur
Page 142 and 143:
Note: The frequency of checks made
Page 144 and 145:
Non-fragments 00000066 Promotions 0
Page 146 and 147:
immediately after the returned stor
Page 148 and 149:
This option should be used with car
Page 150 and 151:
3.9.6 KDB commands for XMDBG ► xm
Page 152 and 153:
138 IBM AIX Continuous Availability
Page 154 and 155:
Name AIX version Hardware required
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Online resources These Web sites ar
Page 162 and 163:
sysdumpdev 27 tprof 28 trace 28 trc
Page 164:
subsequent reboot 71, 133 SUE 18 Sy
show all

IBM AIX Continuous Availability Features - IBM Redbooks

Create successful ePaper yourself

Delete template?

Save as template?