IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
IBM AIX Continuous Availability Features - IBM Redbooks
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.2.3 Paging space verification<br />
2.2.4 Storage keys<br />
Finding the root cause of system crashes, hangs or other symptoms when that root cause is<br />
data corruption can be difficult, because the symptoms can appear far downstream from<br />
where the corruption was observed. The page space verification design is intended to<br />
improve First Failure Data Capture (FFDC) of problems caused by paging space data<br />
corruption by checking that the data read in from paging space matches the data that was<br />
written out.<br />
When a page is paged out, a checksum will be computed on the data in the page and saved<br />
in a pinned array associated with the paging device. If and when it is paged back in, a new<br />
checksum will be computed on the data that is read in from paging space and compared to<br />
the value in the array. If the values do not match, the kernel will log an error and halt (if the<br />
error occurred in system memory), or send an exception to the application (if it occurred in<br />
user memory).<br />
Paging space verification can be enabled or disabled, on a per-paging space basis, by using<br />
the mkps and chps commands. The details of these commands can be found in their<br />
corresponding <strong>AIX</strong> man pages.<br />
Most application programmers have experienced the inadvertent memory overlay problem<br />
where a piece of code accidentally wrote to a memory location that is not part of the<br />
component’s memory domain. The new hardware feature, called storage protection keys, and<br />
referred to as storage keys in this paper, assists application programmers in locating these<br />
inadvertent memory overlays.<br />
Memory overlays and addressing errors are among the most difficult problems to diagnose<br />
and service. The problem is compounded by growing software size and increased<br />
complexity. Under <strong>AIX</strong>, a large global address space is shared among a variety of software<br />
components. This creates a serviceability issue for both applications and the <strong>AIX</strong> kernel.<br />
The <strong>AIX</strong> 64-bit kernel makes extensive use of a large flat address space by design. This is<br />
important in order to avoid costly MMU operations on POWER processors. Although this<br />
design does produce a significant performance advantage, it also adds reliability, availability<br />
and serviceability (RAS) costs. Large 64-bit applications, such as DB2®, use a global<br />
address space for similar reasons and also face issues with memory overlays.<br />
Storage keys were introduced in PowerPC® architecture to provide memory isolation, while<br />
still permitting software to maintain a flat address space. The concept was adopted from the<br />
System z and <strong>IBM</strong> 390 systems. Storage keys allow an address space to be assigned<br />
context-specific protection. Access to the memory regions can be limited to prevent, and<br />
catch, illegal storage references.<br />
A new CPU facility, Authority Mask Register (AMR), has been added to define the key set that<br />
the CPU has access to. The AMR is implemented as bit pairs vector indexed by key number,<br />
with distinct bits to control read and write access for each key. The key protection is in<br />
addition to the existing page protection bits. For any load or store process, the CPU retrieves<br />
the memory key assigned to the targeted page during the translation process. The key<br />
number is used to select the bit pair in the AMR that defines if an access is permitted.<br />
A data storage interrupt occurs when this check fails. The AMR is a per-context register that<br />
can be updated efficiently. The TLB/ERAT contains storage key values for each virtual page.<br />
This allows AMR updates to be efficient, since they do not require TLB/ERAT invalidation.<br />
Chapter 2. <strong>AIX</strong> continuous availability features 21