IBM AIX Continuous Availability Features - IBM Redbooks

More documents

Recommendations

Info

mirrored volume group on the client. This achieves a potentially greater level of reliability by providing redundancy. Client volume group mirroring is also required when a VIOS logical volume is used as a virtual SCSI device on the Client. In this case, the virtual SCSI devices are associated with different SCSI disks, each controlled by one of the two VIOS. As an example of network high availability, Shared Ethernet Adapter (SEA) failover offers Ethernet redundancy to the client at the virtual level. The client gets one standard virtual Ethernet adapter hosted by two VIO servers. The two Virtual I/O servers use a control channel to determine which of them is supplying the Ethernet service to the client. Through this active monitoring between the two VIOS, failure of either will result in the remaining VIOS taking control of the Ethernet service for the client. The client has no special protocol or software configured, and uses the virtual Ethernet adapter as though it was hosted by only one VIOS. 2.1.12 Special Uncorrectable Error handling Although a rare occurrence, an uncorrectable data error can occur in memory or a cache, despite all precautions built into the server. In older generations of servers (prior to IBM POWER4 processor-based offerings), this type of error would eventually result in a system crash. The IBM System p and System i offerings extend the POWER4 technology design and include techniques for handling these types of errors. On these servers, when an uncorrectable error is identified at one of the many checkers strategically deployed throughout the system's central electronic complex, the detecting hardware modifies the ECC word associated with the data, creating a special ECC code. This code indicates that an uncorrectable error has been identified at the data source and that the data in the “standard” ECC word is no longer valid. The check hardware also signals the Service Processor and identifies the source of the error. The Service Processor then takes appropriate action to handle the error. This technique is called Special Uncorrectable Error (SUE) handling. Simply detecting an error does not automatically cause termination of a system or partition. In many cases, an uncorrectable error will cause generation of a synchronous machine check interrupt. The machine check interrupt occurs when a processor tries to load the bad data. The firmware provides a pointer to the instruction that referred to the corrupt data, the system continues to operate normally, and the hardware observes the use of the data. The system is designed to mitigate the problem using a number of approaches. For example, if the data is never actually used but is simply overwritten, then the error condition can safely be voided and the system will continue to operate normally. For AIX V5.2 or greater, if the data is actually referenced for use by a process, then the operating system is informed of the error. The operating system will terminate only the specific user process associated with the corrupt data. New with AIX V6.1 The POWER6 processor adds the ability to report the faulting memory address on a SUE Machine Check. This hardware characteristic, combined with the AIX V6.1 recovery framework, expands the cases in which AIX will recover from an SUE to include some instances when the error occurs in kernel mode. Specifically, if an SUE occurs inside one of the copyin() and copyout() family of kernel services, these functions will return an error code and allow the system to continue operating (in contrast, on a POWER4 or POWER5 system, AIX would crash). The new SUE feature integrates the kernel mode handling of SUEs with the FRR recovery framework. 18 IBM AIX Continuous Availability Features
Note: The default kernel recovery framework setting is disabled. This means an affirmative action must be taken via SMIT or the raso command to enable recovery. When recovery is not enabled, the behavior will be the same as on AIX 5.3. 2.1.13 Automated system hang recovery Automatic system hang recovery with error detection and fix capabilities are key features of the automated system management of AIX which can detect the condition that high priority processes are monopolizing system resources and prohibiting normal execution. AIX offers system administrators a variety of customizable solutions to remedy the system hang condition. 2.1.14 Recovery framework Beginning with AIX V6.1, the kernel can recover from errors in selected routines, thus avoiding an unplanned system outage. The kernel recovery framework improves system availability. The framework allows continued system operation after some unexpected kernel errors. Kernel recovery Kernel recovery in AIX V6.1 is disabled by default. This is because the set of errors that can be recovered is limited in AIX V6.1, and kernel recovery, when enabled, requires an extra 4 K page of memory per thread. To enable, disable, or show kernel recovery state, use the SMIT path Problem Determination → Kernel Recovery, or use the smitty krecovery command. You can show the current and next boot states, and also enable or disable the kernel recovery framework at the next boot. In order for the change to become fully active, you must run the /usr/sbin/bosboot command after changing the kernel recovery state, and then reboot the operating system. During a kernel recovery action, the system might pause for a short time, generally less than two seconds. The following actions occur immediately after a kernel recovery action: 1. The system console displays the message saying that a kernel error recovery action has occurred. 2. AIX adds an entry into the error log. 3. AIX may generate a live dump. 4. You can send the error log data and live dump data to IBM for service (similar to sending data from a full system termination). Note: Some functions might be lost after a kernel recovery, but the operating system remains in a stable state. If necessary, shut down and restart your system to restore the lost functions. 2.2 System reliability Over the years the AIX operating system has included many reliability features inspired by IBM technology, and it now includes even more ground breaking technologies that add to AIX reliability. Some of these include kernel support for POWER6 storage keys, Concurrent AIX Update, dynamic tracing and enhanced software first failure data capture, just to mention a few new features. Chapter 2. AIX continuous availability features 19
Page 1: Front cover IBM AIX Continuous Avai
Page 4 and 5: Note: Before using this information
Page 6 and 7: 2.3.7 Minidump . . . . . . . . . .
Page 8 and 9: vi IBM AIX Continuous Availability
Page 10 and 11: Trademarks The following terms are
Page 12 and 13: Engineering degree in Computer Scie
Page 14 and 15: Find out more about the residency p
Page 16 and 17: 1.1 Overview In May 2007, IBM intro
Page 18 and 19: 1.5 Continuous operations Continuou
Page 20 and 21: 1.6.2 Availability 1.6.3 Serviceabi
Page 22 and 23: IBM has tried to enhance FFDC featu
Page 24 and 25: 10 IBM AIX Continuous Availability
Page 26 and 27: 2.1 System availability AIX is buil
Page 28 and 29: On AIX 5.3, the only way to persist
Page 30 and 31: When used on a standalone server (s
Page 34 and 35: 2.2.1 Error checking Run-Time Error
Page 36 and 37: The PowerPC hardware gives software
Page 38 and 39: System dumps can also be user-initi
Page 40 and 41: Live dumps are small dumps that do
Page 42 and 43: Trace is a valuable tool for observ
Page 44 and 45: Component Trace uses mechanisms sim
Page 46 and 47: 2.3.14 syslog To display the boot l
Page 48 and 49: Logging can be added to many of the
Page 50 and 51: compression: on path specification:
Page 52 and 53: 2.4.4 EtherChannel network traffic
Page 54 and 55: 2.4.6 2-Port Adapter-based Ethernet
Page 56 and 57: operations) without affecting produ
Page 58 and 59: Quorum A quorum is a vote of the nu
Page 60 and 61: 2.5.8 Journaled File System-related
Page 62 and 63: Fast I/O failure is controlled by a
Page 64 and 65: display its output in a format suit
Page 66 and 67: For more detailed information about
Page 68 and 69: 54 IBM AIX Continuous Availability
Page 70 and 71: 3.1 AIX Reliability, Availability,
Page 72 and 73: This means that traditional AIX sys
Page 74 and 75: used by IBM service personnel, so t
Page 76 and 77: also be used to obtain a snapshot o
Page 78 and 79: 3.5 Live dump and component dump Li
Page 80 and 81: lfs filesystem ._0 | NO | ON/3 | OF
Page 82 and 83:
.vmker | YES | ON/3 | OFF/3 .vmpool
Page 84 and 85:
The output of Start a Live Dump is
Page 86 and 87:
Table 3-3 Concurrent update termino
Page 88 and 89:
We tested the installation of an ef
Page 90 and 91:
Setting efix state to: STABLE +----
Page 92 and 93:
mode, storage keys are called user
Page 94 and 95:
Example 3-19 shows the SMIT menu fo
Page 96 and 97:
might share a given hardware key. M
Page 98 and 99:
KKEY_UPUBLIC, KKEY_FILE_DATA KKEY_U
Page 100 and 101:
Note: Kernel keys and keysets, whic
Page 102 and 103:
extension key-aware is determining
Page 104 and 105:
added a protection gate for the int
Page 106 and 107:
} printf("hardware keyset after KER
Page 108 and 109:
} param=atoi(*(argv+1)); printf("Te
Page 110 and 111:
Load the kernel extension so that a
Page 112 and 113:
3 : 000000000000000A r4 : F10006C00
Page 114 and 115:
3.7.7 User mode protection keys Use
Page 116 and 117:
► The APIs provided by the system
Page 118 and 119:
if (!strcmp(*(argv+1),"read")){ if
Page 120 and 121:
strlen() at 0xd010da00 _doprnt(??,
Page 122 and 123:
int pthread_attr_setukeyset_np (pth
Page 124 and 125:
Machine State Save Area iar : 00000
Page 126 and 127:
As a consequence, determining the r
Page 128 and 129:
A Vue script or Vue program is a pr
Page 130 and 131:
► An optional predicate The predi
Page 132 and 133:
3. Interval probe manager The inter
Page 134 and 135:
{ int count; count = count +1; prin
Page 136 and 137:
► Errno names ► Signal names
Page 138 and 139:
* will result in error out of read(
Page 140 and 141:
This places a large part of the bur
Page 142 and 143:
Note: The frequency of checks made
Page 144 and 145:
Non-fragments 00000066 Promotions 0
Page 146 and 147:
immediately after the returned stor
Page 148 and 149:
This option should be used with car
Page 150 and 151:
3.9.6 KDB commands for XMDBG ► xm
Page 152 and 153:
138 IBM AIX Continuous Availability
Page 154 and 155:
Name AIX version Hardware required
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Online resources These Web sites ar
Page 162 and 163:
sysdumpdev 27 tprof 28 trace 28 trc
Page 164:
subsequent reboot 71, 133 SUE 18 Sy
show all

IBM AIX Continuous Availability Features - IBM Redbooks

Create successful ePaper yourself

Delete template?

Save as template?