A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper
A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper
A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>White</strong> <strong>Paper</strong><br />
A <strong>Tour</strong> <strong>beyond</strong> <strong>BIOS</strong> <strong>Implementing</strong><br />
the ACPI Platform Error Interface<br />
<strong>with</strong> the Unified Extensible<br />
Firmware Interface<br />
Palsamy Sakthikumar<br />
Intel Corporation<br />
Vincent J. Zimmer<br />
Intel Corporation<br />
January 2013<br />
i
Executive Summary<br />
This paper presents a universal model to design the industry standard WHEA/<strong>APEI</strong><br />
infrastructure in x86 platforms using <strong>UEFI</strong> firmware.<br />
ii
Table of Contents<br />
Overview ...................................................................................................................................... 4<br />
Introduction to <strong>APEI</strong> ............................................................................................................. 4<br />
Goal and Motivation ................................................................................................................. 5<br />
Platform Error Infrastructure ............................................................................................... 6<br />
Error Classifications ............................................................................................................. 7<br />
ACPI Platform Error Interface .............................................................................................. 9<br />
Hardware Error Source Table (HEST) .............................................................................. 9<br />
ERST (Error Record Serialization table (ERST) .............................................................. 9<br />
Error Injection Table (EINJ) ................................................................................................. 9<br />
Boot Error Record Table (BERT) ..................................................................................... 10<br />
Generic Error Status block ................................................................................................ 10<br />
<strong>APEI</strong> Error handling models .............................................................................................. 10<br />
Platform Error Signaling ..................................................................................................... 11<br />
The <strong>APEI</strong> Error source signaling ...................................................................................... 11<br />
<strong>UEFI</strong> Overview ......................................................................................................................... 12<br />
<strong>UEFI</strong> Overview .................................................................................................................... 12<br />
ACPI Table Installation ...................................................................................................... 13<br />
<strong>UEFI</strong> PI DXE SMM ............................................................................................................. 14<br />
<strong>APEI</strong> Design in <strong>UEFI</strong> PI-based host firmware .............................................................. 17<br />
<strong>APEI</strong> Infrastructure in <strong>UEFI</strong> ............................................................................................... 17<br />
<strong>UEFI</strong> Modular Approach to <strong>APEI</strong> ...................................................................................... 18<br />
<strong>APEI</strong> Architecture Modules ............................................................................................... 20<br />
Processor/Chipset Modules .............................................................................................. 20<br />
Platform Specific Modules ................................................................................................. 21<br />
Conclusion ................................................................................................................................. 23<br />
Glossary ...................................................................................................................................... 24<br />
References ................................................................................................................................. 25<br />
iii
Overview<br />
The ACPI Platform Error Interface (<strong>APEI</strong>)/WHEA (Windows Hardware Error Architecture) is<br />
standard defined by ACPI specification for PC and server platforms to report and handle<br />
platform errors in graceful way. It’s important transfer hardware platform error knowledge to<br />
operating systems and management stacks in a meaningful way, in order to perform necessary<br />
corrective actions to recover the system or prevent system from sudden failure.<br />
Introduction to <strong>APEI</strong><br />
As every platform design and implementation are different, OS needs universal way for error<br />
reporting and providing error log information. The <strong>APEI</strong> (ACPI Platform Error Interface)<br />
formally known as WHEA (Windows Hardware Error Architecture) is an ACPI standard. Using<br />
<strong>APEI</strong> and <strong>UEFI</strong> standards, we can easily implement the <strong>APEI</strong> in <strong>UEFI</strong> firmware in a modular<br />
fashion, so that most of the <strong>UEFI</strong> modules are common to all platforms. This commonality<br />
makes it easy to port these capabilities to every platform, including mobile, client and servers.<br />
Summary<br />
This section has provided an overview of error reporting and <strong>APEI</strong>.<br />
4
Goal and Motivation<br />
The ACPI Platform Error Interface (<strong>APEI</strong>) is an industry standard interface defined by the<br />
Advanced Configuration and Power Interface (ACPI) specification. The platform runtime<br />
interface to the platform firmware is largely based upon ACPI, but the Unified Extensible<br />
Firmware Interface (<strong>UEFI</strong>) is a complementary technology that entails a major platform standard<br />
in the industry for platform hardware initialization and OS enabling software. Marrying these to<br />
produce a universal model is key motivation and goal of this paper. Though platforms are<br />
different across different stock keeping units (SKUs) and original equipment manufacturers<br />
(OEMs), most of the <strong>APEI</strong> implementation can be implemented in an interoperable fashion by<br />
applying the modularity offered in <strong>UEFI</strong>.<br />
Error reporting and recovery is a key requirement for the enterprise. In order to hit 5-9’s of<br />
availability, a machine must have maximum uptime. Error reporting forms an important<br />
element of platform Reliability, Availability and Serviceability (RAS) which provides this uptime.<br />
Summary<br />
This section has provided a rationale for leveraging <strong>UEFI</strong> technology to deploy <strong>APEI</strong>.<br />
5
Platform Error Infrastructure<br />
All platforms have different designs and different components that may fail during the life of the<br />
system, such as DRAM failures [GOOGLE]. The error detection and handling is a key function<br />
to determine the failing components and taking corrective action in a timely fashion in order to<br />
prevent any data corruption and also to prolong the life of the system. The error handling is a<br />
cooperative activity between the platform hardware (HW), host firmware (<strong>UEFI</strong> or PC/AT<br />
<strong>BIOS</strong>), and the operating system (OS). The platform errors can be generally classified to<br />
Memory subsystem, IO subsystem, Processor, chipset and platform hardware. The role of the<br />
host firmware and low level out-of-band, management firmware includes configuring the<br />
platform hardware, chipset and processor to detect and report all errors at boot time. In addition<br />
to the pre-OS configuration, when the errors are singled at runtime, these firmware agents read<br />
error registers in order to analyze and report results to the OS promptly in a manner which is<br />
understandable to the OS software. At this point, the OS can initiate appropriate corrective action<br />
and/or inform the administrator to take the actions required. Figure 1 shows the typical platform<br />
error infrastructure and how the errors get propagated from hardware to the OS <strong>with</strong> the help of<br />
the host firmware.<br />
The errors from all the subsystems get signaled to host firmware or the OS directly. When<br />
signaled to the host firmware, the host firmware will read the hardware registers, analyze the<br />
component that generated the error and assess the severity of the error. Host firmware will create<br />
detailed error log information for the OS and notify the OS of the error’s occurrence. Host<br />
firmware may additionally generate a platform log and communicate this log to the management<br />
controllers out of band firmware for system management purposes. Such controllers include but<br />
are not limited to baseboard management controllers (BMC). When the error notification is<br />
conveyed to the OS either directly or from host firmware, the OS will inspect the hardware<br />
registers or host firmware created error log to further analyze the error and initiate corrective<br />
actions. The OS may also inform the administrator to take any corrective action as well.<br />
6
Figure 1 Platform Error Infrastructure<br />
The platform hardware can also provide error injection functionality. For high availability<br />
systems such as enterprise servers, the OS needs a software method or API to inject hardware<br />
errors into platform for purposes of validating the functionality of the error reporting and logging<br />
mechanisms.<br />
Error Classifications<br />
Generally all platform errors can be classified to three categories.<br />
7
• Corrected errors (CE): These are hardware corrected errors or recovered errors, such as<br />
single bit memory errors or memory failover.<br />
• Uncorrected errors (UE): These are errors hardware could not correct or recover. There<br />
is a possibility for software or the OS to recover from this error.<br />
• Fatal errors: These are uncorrectable errors from which neither software or hardware<br />
could recover. Continuing to run in the face of these errors could make system<br />
unreliable.<br />
Some uncorrected errors and all fatal error conditions require immediate containment to prevent<br />
any data corruption. This may require a system shut down or reset to recover or replacing the<br />
failing component, also referred as FRU (Field replacement unit). As such, it is necessary for the<br />
error reporting mechanisms to isolate the error to respective FRU units.<br />
Summary<br />
This section introduced the overall concept of error initiation and reporting.<br />
8
ACPI Platform Error Interface<br />
As every platform design and implementation are different, OS needs universal way for error<br />
reporting and providing error log information. The <strong>APEI</strong> (ACPI Platform Error Interface)<br />
formally known as WHEA (Windows Hardware Error Architecture) is an ACPI standard<br />
defining error reporting interfaces to OS. Using <strong>APEI</strong> and <strong>UEFI</strong> standards, we can easily<br />
implement the <strong>APEI</strong> in <strong>UEFI</strong> host firmware modularly, so that most of the <strong>UEFI</strong> modules are<br />
common to all platforms and making it easy portable for every platform including mobile, client<br />
and servers. The <strong>APEI</strong> defines four ACPI tables for declaring platform error infrastructure and<br />
Error record container for communicating error information to OS.<br />
Hardware Error Source Table (HEST)<br />
The HEST table enables host firmware to declare all errors that platform component can generate<br />
and error signaling for those. The host firmware shall create Error source entries in HEST for<br />
each component (such as, processor, PCIe device, PCIe bridge, etc) and each type of error <strong>with</strong><br />
corresponding error notification mechanism (singling) to OS. These error entries include x86<br />
architectural errors, industry standard errors and generic hardware error source for platform<br />
errors. The x86 architectural errors, MCE and CMC, and standard errors PCIe AER, MSI and<br />
PCI INTx can be handled by OS natively. The generic hardware error source can be used for all<br />
firmware 1 st errors and platform errors (such as memory, board logic) that do not have OS native<br />
signaling, so they have to use platform signaling SCI or NMI.<br />
ERST (Error Record Serialization table (ERST)<br />
The ERST table provides a generic interface for the OS to store and retrieve error records in the<br />
platform persistent storage, such as NVRAM (Non-volatile RAM). The error records stored<br />
through this interface shall persist across system resets till OS clears it. The OS will use this<br />
interface store error information during critical error handling for later extensive error analysis.<br />
Host firmware shall provide serialization instruction using ACPI specification defined actions to<br />
facilitate read, write and clear error records.<br />
Error Injection Table (EINJ)<br />
One of the important functions required in implementing the error is the ability to inject error<br />
conditions by the OS to evaluate the correct functionality of the entire error handling in the<br />
platform hardware, host firmware and the OS. The EINJ table interface facilitates error injection<br />
from the OS. The host firmware shall provide at minimum one error injection method per error<br />
type supported in the platform. The host firmware will provide the generic error serialization<br />
instructions to trigger the error in the hardware.<br />
9
Boot Error Record Table (BERT)<br />
The BERT table provides the interface to report errors that occurred system boot time. In<br />
addition BERT also can be used to report fatal errors that resulted in surprise system reset or<br />
shutdown before OS had the chance to see the error. The host firmware shall build this table <strong>with</strong><br />
error record entries for each error that occurred.<br />
Generic Error Status block<br />
The Error status block is the standard error data container for communicating detailed error<br />
information log to the OS for generic error sources listed in HEST. The generic error source may<br />
include firmware 1 st errors and non-standard platform errors. The error status block can log<br />
multiple errors until the OS reads them and handles these errors appropriately. The standard x86<br />
MCE and CMC error log information are presented in Machine check bank registers in the<br />
processor. In the same fashion, the PCIe AER standard error information is presented in<br />
PCIe/PCI device’s or bridge’s configuration register space. Given these mechanisms, the OS can<br />
directly read and analyze the data. The error record structure and information are specified in<br />
<strong>UEFI</strong> 2.3.1 [<strong>UEFI</strong>] specification.<br />
<strong>APEI</strong> Error handling models<br />
<strong>APEI</strong> offers two error models, Firmware first model and OS Native model.<br />
Firmware 1 st is used when the host firmware needs to initially examine the error and attempt<br />
recovery or corrective action in an OS transparent way. This model is also used when certain<br />
OEMs want more control over error handling before the OS takes control, such as for purposes<br />
of executing some management functions. In the Firmware 1 st model, all errors are initially<br />
signaled to the host firmware via SMI or other General Purpose Input (GPI) events. Then host<br />
firmware analyzes and decides what to do, and at the end of the flow creates a detailed <strong>APEI</strong><br />
error log <strong>with</strong> FRU information to OS. Finally, the host firmware will then signal the OS about<br />
the existence of the error via SCI, NMI, or other interrupts.<br />
The OS native model, on the other hand, provides handling of the error directly by the OS or OS<br />
level software by directly accessing the hardware registers and analyzing the error. This requires<br />
standard architecture in the hardware for providing error information in the hardware and<br />
signaling, for e.g. industry standard PCIe AER and x86 MCA architectures. This model takes the<br />
burden off of host firmware.<br />
Platforms can use combined model also, where some errors are handled firmware 1 st , some<br />
natively and some both (a.k.a. parallel model). Many of the servers employ this combined model<br />
for better handling and managing the server better via remotely.<br />
Note: For the Firmware 1 st model, the host firmware has to program the processor and chipset to<br />
trigger SMI upon hardware error pins or upon processor/chipset error messages. After host<br />
firmware handling the errors 1 st and building error records for OS, it will generate SCI or NMI to<br />
OS.<br />
10
Platform Error Signaling<br />
Whenever an error detected in the platform hardware, the hardware error will be signaled to host<br />
firmware or OS or both. In x86 architecture, the error signals to host firmware are hardware error<br />
pins, hardware error registers or error messages. These error signals will be typically routed to<br />
trigger System management interrupt (SMI) so that host firmware can handle the errors in a<br />
secluded environment in System management mode (SMM). The error signaling to the OS are<br />
interrupts. There interrupts are x86 architecture interrupts, such as MCE (Machine check<br />
architecture error), Corrected Machine Check (CMC) interrupts or standard IO interrupts such as,<br />
PCE AER (Advanced error reporting) MSI (Message signaling interrupt), PCI interrupt (INTx),<br />
or platform NMI (non-maskable interrupt) and ACPI SCI (System control interrupt). Upon these<br />
interrupts being activated the OS will handle errors in a processor driver, ACPI driver or in an<br />
<strong>APEI</strong> driver.<br />
The <strong>APEI</strong> Error source signaling<br />
The following table shows how the various subsystem errors are signaled to <strong>APEI</strong> domain to<br />
host firmware and OS in different error reporting model.<br />
Firmware 1 st Model Native reporting model<br />
Errors Signal to host firmware Signal to OS Signal to OS<br />
Processor CE SMI SCI CMC<br />
Processor UE & Fatal SMI NMI MCE<br />
Memory CE SMI SCI CMC<br />
Memory UE & Fatal SMI NMI MCE<br />
IO CE SMI SCI PCIe AER - MSI/INTx<br />
IO UE/Fatal SMI NMI PCIe AER - MSI/INTx<br />
Chipset CE SMI SCI CMC or None<br />
Chipset UE/Fatal SMI NMI MCE or None<br />
Hardware CE SMI SCI None<br />
Hardware UE/Fatal SMI NMI None<br />
Summary<br />
This section introduced <strong>APEI</strong> and its constituent elements.<br />
11
<strong>UEFI</strong> Overview<br />
<strong>UEFI</strong> Overview<br />
The implementation of the <strong>APEI</strong> elements described herein is based upon host firmware based<br />
upon the Unified Extensible Firmware Interface (<strong>UEFI</strong>) Platform Initialization (PI) specification<br />
sets. The <strong>UEFI</strong> specifications are purposely silent on construction intent and policy. Instead,<br />
the <strong>UEFI</strong> specification is a pure interface specification that admits to conformance testing of the<br />
API’s.<br />
Figure 2 <strong>UEFI</strong> PI Boot Flow<br />
12
Figure 3 <strong>UEFI</strong> PI Software Layering Diagram<br />
As you can see from the boot flow in Figure 2 above, the SEC, PEI, and DXE all run prior to<br />
having the <strong>UEFI</strong> services available. The <strong>UEFI</strong> specification describes a set of interfaces to the<br />
platform, and the <strong>UEFI</strong> PI DXE phase acts as the <strong>UEFI</strong> core. In fact, the DXE core is the<br />
preferred embodiment of the <strong>UEFI</strong> interfaces. The SEC, PEI, and DXE components are<br />
provided by the platform manufacturer. These PI elements are also referred to as the ‘Green H’,<br />
as shown in Figure 3 above.<br />
ACPI Table Installation<br />
One set of DXE drivers have knowledge of the <strong>APEI</strong> infrastructure and the platform topology.<br />
These drivers populate the respective ACPI tables mentioned earlier. A pointer to the ACPI<br />
tables can be found via a reference in the <strong>UEFI</strong> System Table’s set of system configuration<br />
13
tables. ACPI tables have a GUID definition that can be found in the <strong>UEFI</strong> specification [<strong>UEFI</strong>].<br />
See Figure 4 below<br />
<strong>UEFI</strong> System<br />
Table<br />
<strong>UEFI</strong> PI DXE SMM<br />
DXE Driver<br />
Dispatcher<br />
f<br />
Figure 4 <strong>UEFI</strong> PI flow and ACPI tables<br />
<strong>UEFI</strong> & DXE<br />
Executables<br />
Device, Bus,<br />
or Service<br />
Driver<br />
EFI Boot Services Table<br />
EFI Runtime Services Table<br />
System Configuration Table<br />
- ACPI Tables<br />
Of the DXE components, there is a class of DXE Drivers called DXE SMM. The DXE SMM<br />
drivers support the OS runtime interactions <strong>with</strong> the platform. This can include SMI invocations<br />
from ASL or via pin activations from the CPU and chipset.<br />
14
The <strong>UEFI</strong> PI boot flow augmented <strong>with</strong> the PI SMM driver loading is shown in Figure 5 below.<br />
Pre<br />
Verifier<br />
rify<br />
e<br />
v<br />
Security<br />
(SEC)<br />
PEI<br />
Core<br />
Chipset<br />
Init<br />
Board<br />
Init<br />
Pre EFI<br />
Initialization<br />
(PEI)<br />
<strong>UEFI</strong> PI SMM<br />
Core Services<br />
PI DXE<br />
SMM<br />
Constructor<br />
DXE Driver<br />
Dispatcher<br />
Intrinsic<br />
Services<br />
security<br />
Driver<br />
Execution<br />
Environment<br />
(DXE)<br />
Boot Dev<br />
Select<br />
(BDS)<br />
SMM Handler<br />
Transient<br />
System Load<br />
(TSL)<br />
Run Time<br />
(RT)<br />
After<br />
Life<br />
(AL)<br />
Power on [ . . Platform initialization . . ] [ . . . . OS boot . . . . ] Shutdown<br />
Figure 5 <strong>UEFI</strong> PI SMM Driver load<br />
There can be a plurality of DXE SMM drivers, including but not limited to the <strong>APEI</strong> support<br />
modules. The relationship of a DXE SMM driver to the PI DXE SMM core is shown in Figure 6<br />
15
elow.<br />
SMM Entry<br />
(CPU)<br />
Manage(NULL)<br />
SMM Exit<br />
(CPU)<br />
Root<br />
SMI<br />
Handler<br />
(Driver) (Driver)<br />
SMI<br />
Handler<br />
SMI SMI<br />
Handler<br />
Manage<br />
(GUID1)<br />
SMI<br />
Event<br />
Sources<br />
Figure 6 SMM driver topology<br />
Manage<br />
(GUID2)<br />
Manage(GUID3)<br />
<strong>APEI</strong><br />
Child SMI<br />
Handler Driver<br />
SMI Handler<br />
(Driver)<br />
SMI Handler<br />
SMI Handler<br />
SMI Handler<br />
SMI<br />
Event<br />
Sources<br />
The PI SMM DXE drivers are PE/COFF executables like other DXE and <strong>UEFI</strong> drivers. As such,<br />
the functionality for different capabilities can be delivered as separate <strong>UEFI</strong> PI packagingspecification<br />
based source modules, or more likely going forward, as separate binaries. These<br />
source packages or binaries can be factored such that they are unique to pure software elements<br />
like generic <strong>APEI</strong> or error logging processing, versus other drivers which are specific to a given<br />
chipset family error reporting hardware.<br />
Summary<br />
This section has provided an overview of <strong>UEFI</strong>, PI, and the mechanisms by which ACPI tables<br />
and PI SMM drivers are deployed in the platform.<br />
16
<strong>APEI</strong> Design in <strong>UEFI</strong> PI-based host<br />
firmware<br />
<strong>Implementing</strong> a modular <strong>APEI</strong> error infrastructure in a system involves support in platform<br />
hardware, <strong>UEFI</strong> host firmware and the OS. Each entity has multiple components working<br />
cooperatively as part of supporting <strong>APEI</strong>.<br />
<strong>APEI</strong> Infrastructure in <strong>UEFI</strong><br />
Now that the generic overview of <strong>UEFI</strong> and the PI infrastructure has been provided, Figure 7<br />
below shows all components and their interactions to implement the <strong>APEI</strong> error interfaces. This<br />
modular infrastructure facilitates graceful and collective way of handling errors and recovering<br />
from errors. In addition provides portable design to any platform independent of individual<br />
design harnessing <strong>UEFI</strong> and <strong>UEFI</strong> PI host firmware standards. This section will focus on the<br />
<strong>UEFI</strong> PI host firmware components and how they can be designed in a common and platform<br />
independent model.<br />
17
Figure 7 <strong>APEI</strong> Infrastructures in a <strong>UEFI</strong> host firmware-based System<br />
<strong>UEFI</strong> Modular Approach to <strong>APEI</strong><br />
This section describes about different <strong>UEFI</strong> modules and design that segregates platform<br />
agnostic modules from platform design dependant modules. By following this <strong>UEFI</strong><br />
implementation design, only the platform dependent modules will have to be ported from<br />
18
platform to platform and retain most of the <strong>APEI</strong> implementation same for all implementations.<br />
Figure 8 below shows the different modules that <strong>APEI</strong> implementation can follow using <strong>UEFI</strong> PI<br />
standards. The following section will describe each module in details.<br />
ACPI<br />
Driver<br />
Module<br />
<strong>UEFI</strong><br />
variable<br />
services<br />
Architecture<br />
Modules<br />
<strong>APEI</strong> ACPI Table<br />
& Data definitions<br />
<strong>APEI</strong> Support<br />
Module<br />
Error Injection<br />
Module<br />
Boot Error<br />
Module<br />
Error Persistence<br />
Module<br />
Error Source<br />
Definition<br />
Board HW Error<br />
Module<br />
Platform Agnostic<br />
Platform Dependant<br />
Platform Specific Modules<br />
Error Injection<br />
Definition<br />
SMM<br />
Error<br />
Handler<br />
<strong>APEI</strong> ASL<br />
definitions<br />
Processor/Chipset<br />
Modules<br />
<strong>APEI</strong> Error<br />
Signaling<br />
Processor Error<br />
Module<br />
Chipset Error<br />
Module<br />
Memory Error<br />
Module<br />
IO/PCIe Error<br />
Module<br />
19
Figure 8 <strong>APEI</strong> Modules in <strong>UEFI</strong> host firmware<br />
The <strong>APEI</strong> module in <strong>UEFI</strong> host firmware can classified to three types of modules as shown in<br />
figure 8. The Architecture modules are part of <strong>UEFI</strong> PI infrastructure and common in all<br />
implementations. The Processor/Chipset modules also can be part of <strong>UEFI</strong> PI infrastructure and<br />
common to one processor architecture, for e.g. Intel x86 architecture. This may need little<br />
porting for various processor/chipset generations. The Platform specific modules are platform<br />
design dependant and modified for each platform. This modular design reduces the modification<br />
greatly for platform to platform.<br />
<strong>APEI</strong> Architecture Modules<br />
The AEPI architecture modules will use exiting <strong>UEFI</strong> infrastructure such as ACPI driver and<br />
variable services to support <strong>APEI</strong> standard interface and communication to the OS.<br />
<strong>APEI</strong> Tables and Data definition<br />
This module contains the new ACPI table definitions for <strong>APEI</strong> and data structure for Error<br />
record structures, Error injection and Error persistence interfaces.<br />
<strong>APEI</strong> Support Module<br />
<strong>APEI</strong> support module is the collector all AEPI support in the platform and also provider of<br />
creating <strong>APEI</strong> table, data structure and interfaces for other modules. In addition this module also<br />
will create necessary runtime interfaces for OS <strong>APEI</strong> driver. This module will also enquire<br />
platform modules and create all Error sources that are supported in HEST table. The <strong>APEI</strong><br />
support module is also responsible to create and publish EINJ, ERST and BERT tables on behalf<br />
of other modules by using <strong>UEFI</strong> ACPI standard protocols. This module also will provide runtime<br />
and boot interfaces to create Error records for Os in Error status block container.<br />
Error Injection Module<br />
The Error injection module will use <strong>APEI</strong> support module interfaces and install the support error<br />
injection entries and error injection support calls using the platform specific module interfaces.<br />
Boot Error Module<br />
The Boot error module will use processor/chipset modules to scan for boot time errors and last<br />
boot errors that were not reported to OS. Once it determines and collects the error, it will use<br />
<strong>APEI</strong> module interface to create error record BERT for each errors.<br />
Error Persistence Module<br />
The Error persistence module provide runtime interface to OS for storing/retrieving error<br />
records. This module can use <strong>UEFI</strong> variable services for persistent storage for portability. In<br />
addition this module will also provide proper information to <strong>APEI</strong> module to create ERST table.<br />
Processor/Chipset Modules<br />
Processor chipset modules will provide the supporting infrastructure to read hardware registers<br />
for determining error and collecting error information from various components in the system<br />
and also build detailed error information log for the OS using <strong>APEI</strong> Support module interfaces.<br />
20
The top level error handling is done in runtime by SMM error handler in x86 architecture<br />
platforms. This main error handler will use the other subsystem error handlers to detect and<br />
process respective errors in addition to providing the error log record. The error record shall be<br />
created for all CE, UE and Fatal errors.<br />
Processor Error module<br />
The Processor error module is responsible to detect and analyze processor data errors and<br />
internal errors and perform any recover action that can be done. This module also will create<br />
<strong>UEFI</strong> specified processor error record log and pass it to main error handler.<br />
Chipset Error Module<br />
The Chipset error module is responsible to detect and analyze errors in chipset devices such<br />
legacy bridged and integrated devices and if possible perform any recovery actions. This module<br />
also will create <strong>UEFI</strong> specified processor/chipset error record log and pass it to main error<br />
handler.<br />
Memory Error Module<br />
The memory error module is responsible to detect and analyze memory subsystem errors<br />
including data errors, memory device errors, memory channel errors and redundancy failures.<br />
This module will also initiate any recovery action, such as sparing and mirror failover. This<br />
module also will create <strong>UEFI</strong> specified processor error record log for all memory errors and pass<br />
it to main error handler.<br />
IO/PCIe Error Module<br />
The IO/PCIe error module is responsible to detect and process errors in the IO sub system<br />
including data errors, device errors and PCIe link errors. It will determine and perform if any<br />
recovery action possible such as reset PCIe device. This module also will create <strong>UEFI</strong> specified<br />
processor error record log and pass it to main error handler.<br />
Platform Specific Modules<br />
The platform specific modules are the modules that will require modification from platform to<br />
platform. These define platform specific policies and support component errors and error<br />
injection capabilities and hardware error signaling to OS.<br />
Error Source definition<br />
The module will define the support error sources in the platform. These include both Os native<br />
errors and firmware 1 st errors. The non-standard <strong>APEI</strong> errors can be always reported as generic<br />
hardware errors. In addition this module will list out error notification methods for each error.<br />
The <strong>APEI</strong> support module will use this platform specific information to create necessary <strong>APEI</strong><br />
tables and interfaces to OS.<br />
Error Injection definition<br />
The Error injection definition will list out the types of error injection support in the platform and<br />
platform specific error injection mechanism. In general implementation most of the error<br />
21
injection mechanisms can be moved to Processor/Chipset modules, so that they can be retained<br />
part of common modules.<br />
<strong>APEI</strong> Error signaling<br />
This module will implement the platform hardware specific interfaces to trigger AEPI hardware<br />
error notification, for e.g. SCI, NMI, to OS.<br />
Board Hardware Error module<br />
This module is for any additional platform board specific error reporting and handling. The<br />
errors handled by this module can be reported as generic hardware error sources and logged<br />
same way.<br />
<strong>APEI</strong> ASL definitions<br />
<strong>APEI</strong> ASL module will be platform specific module as it will contain the ASL code and device<br />
definition for support <strong>APEI</strong>. Depending on the platform implementation and policy overrides this<br />
module has to be ported from platform to platform.<br />
Summary<br />
This section provided an overview of the various <strong>APEI</strong> <strong>UEFI</strong> PI modules.<br />
22
Conclusion<br />
Error reporting on modern platforms entails several elements whose activities must be<br />
coordinated. These include platform design, host firmware construction, and error-reporting<br />
aware facilities in the OS and OS software. To deliver this class of functionality in the platform<br />
and host firmware, the interface definitions of ACPI and <strong>UEFI</strong> are leveraged so that an OS can<br />
be designed against these standards and not a specific vendor’s implementation. Underneath the<br />
abstractions afforded by ACPI and <strong>UEFI</strong>, though, the modularity of the <strong>UEFI</strong> PI standards and<br />
rich open source implementations like the <strong>UEFI</strong> Development Kit [EDK2] can be used to<br />
support a rich set of vendor platforms and achieve high code re-use. This re-use helps <strong>with</strong> timeto-market<br />
by not having to re-engineer all of the firmware elements and also accrue the<br />
validation efforts on stable binaries of drivers that do not change across SKU’s and generations<br />
of platforms.<br />
23
Glossary<br />
ACPI – Advanced Configuration and Power Interface. Static tables and ACPI Machine<br />
Language (AML) interpreted byte code. Preferred OS runtime interface to the platform.<br />
<strong>APEI</strong> – ACPI Platform Error Interface. Industry standard platform error interface to OS.<br />
AML – ACPI Machine Language<br />
<strong>APEI</strong> – ACPI Platform Error Interface.<br />
ASL – ACPI Source Language.<br />
<strong>BIOS</strong> – Basic Input Output System. Firmware that executes on the host CPUs. Can be PC/AT<br />
<strong>BIOS</strong> or <strong>UEFI</strong> PI-based.<br />
BMC – Baseboard Management Controller.<br />
CMC – Corrected Machine Check.<br />
MCA – Machine Check Architecture. Error signaling mechanism for x86 CPU’s.<br />
MSI – Message Signaled Interrupt.<br />
PI – Platform Initialization. Volume 1-5 of the <strong>UEFI</strong> PI specifications. Volume 3 includes an<br />
execution mode for SMM.<br />
SMM – System Management Mode. x86 CPU operational mode that is isolated from and<br />
transparent to the operating system runtime<br />
<strong>UEFI</strong> – Unified Extensible Firmware Interface. Firmware interface between the platform and<br />
the operating system. Predominate interfaces are in the boot services (BS) or pre-OS. Few<br />
runtime (RT) services.<br />
WHEA – Windows Hardware Error Architecture. Various error management technologies in<br />
Windows.<br />
24
References<br />
[ACPI] ACPI Specification Revision 5.0 www.acpi.info<br />
[EDK2] <strong>UEFI</strong> Developer Kit www.tianocore.org<br />
[FRAMEWORK] Intel Framework Specifications www.intel.com/technology/framework<br />
[GOOGLE] Bianca Schroeder and Eduardo Pinheiro and Wolf-Dietrich Weber<br />
“DRAM Errors in the Wild: A Large-Scale Field Study,” Sigmetrics 2009<br />
http://research.google.com/pubs/pub35162.html<br />
[<strong>UEFI</strong>] Unified Extensible Firmware Interface (<strong>UEFI</strong>) Specification, Version 2.3.1c<br />
www.uefi.org<br />
[<strong>UEFI</strong> Book] Zimmer,, et al, “Beyond <strong>BIOS</strong>: Developing <strong>with</strong> the Unified Extensible Firmware<br />
Interface,” 2 nd edition, Intel Press, January 2011<br />
[<strong>UEFI</strong> Overview] Zimmer, Rothman, Hale, “<strong>UEFI</strong>: From Reset Vector to Operating System,”<br />
Chapter 3 of Hardware-Dependent Software, Springer, February 2009<br />
[<strong>UEFI</strong> PI Specification] <strong>UEFI</strong> Platform Initialization (PI) Specifications, volumes 1-5, Version<br />
1.2 www.uefi.org<br />
[WHEA] Windows Hardware Error Architecture.<br />
http://msdn.microsoft.com/en-us/library/windows/hardware/gg463286.aspx<br />
25
Authors<br />
Palsamy Sakthikumar (psakthik@gmail.com) is a platform architect <strong>with</strong><br />
the Data Center Group at Intel Corporation.<br />
Vincent J. Zimmer (vincent.zimmer@intel.com) is a Principal Engineer <strong>with</strong><br />
the Software and Services Group at Intel Corporation.<br />
26
This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO<br />
WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY,<br />
NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE<br />
ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including<br />
liability for infringement of any proprietary rights, relating to use of information in this specification.<br />
No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted<br />
herein.<br />
Intel, the Intel logo, Intel. leap ahead. and Intel. Leap ahead. logo, and other Intel product name are<br />
trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and<br />
other countries.<br />
*Other names and brands may be claimed as the property of others.<br />
Copyright 2013 by Intel Corporation. All rights reserved<br />
27