27.03.2013 Views

A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper

A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper

A_Tour_beyond_BIOS_Implementing_APEI_with_UEFI_White_Paper

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>White</strong> <strong>Paper</strong><br />

A <strong>Tour</strong> <strong>beyond</strong> <strong>BIOS</strong> <strong>Implementing</strong><br />

the ACPI Platform Error Interface<br />

<strong>with</strong> the Unified Extensible<br />

Firmware Interface<br />

Palsamy Sakthikumar<br />

Intel Corporation<br />

Vincent J. Zimmer<br />

Intel Corporation<br />

January 2013<br />

i


Executive Summary<br />

This paper presents a universal model to design the industry standard WHEA/<strong>APEI</strong><br />

infrastructure in x86 platforms using <strong>UEFI</strong> firmware.<br />

ii


Table of Contents<br />

Overview ...................................................................................................................................... 4<br />

Introduction to <strong>APEI</strong> ............................................................................................................. 4<br />

Goal and Motivation ................................................................................................................. 5<br />

Platform Error Infrastructure ............................................................................................... 6<br />

Error Classifications ............................................................................................................. 7<br />

ACPI Platform Error Interface .............................................................................................. 9<br />

Hardware Error Source Table (HEST) .............................................................................. 9<br />

ERST (Error Record Serialization table (ERST) .............................................................. 9<br />

Error Injection Table (EINJ) ................................................................................................. 9<br />

Boot Error Record Table (BERT) ..................................................................................... 10<br />

Generic Error Status block ................................................................................................ 10<br />

<strong>APEI</strong> Error handling models .............................................................................................. 10<br />

Platform Error Signaling ..................................................................................................... 11<br />

The <strong>APEI</strong> Error source signaling ...................................................................................... 11<br />

<strong>UEFI</strong> Overview ......................................................................................................................... 12<br />

<strong>UEFI</strong> Overview .................................................................................................................... 12<br />

ACPI Table Installation ...................................................................................................... 13<br />

<strong>UEFI</strong> PI DXE SMM ............................................................................................................. 14<br />

<strong>APEI</strong> Design in <strong>UEFI</strong> PI-based host firmware .............................................................. 17<br />

<strong>APEI</strong> Infrastructure in <strong>UEFI</strong> ............................................................................................... 17<br />

<strong>UEFI</strong> Modular Approach to <strong>APEI</strong> ...................................................................................... 18<br />

<strong>APEI</strong> Architecture Modules ............................................................................................... 20<br />

Processor/Chipset Modules .............................................................................................. 20<br />

Platform Specific Modules ................................................................................................. 21<br />

Conclusion ................................................................................................................................. 23<br />

Glossary ...................................................................................................................................... 24<br />

References ................................................................................................................................. 25<br />

iii


Overview<br />

The ACPI Platform Error Interface (<strong>APEI</strong>)/WHEA (Windows Hardware Error Architecture) is<br />

standard defined by ACPI specification for PC and server platforms to report and handle<br />

platform errors in graceful way. It’s important transfer hardware platform error knowledge to<br />

operating systems and management stacks in a meaningful way, in order to perform necessary<br />

corrective actions to recover the system or prevent system from sudden failure.<br />

Introduction to <strong>APEI</strong><br />

As every platform design and implementation are different, OS needs universal way for error<br />

reporting and providing error log information. The <strong>APEI</strong> (ACPI Platform Error Interface)<br />

formally known as WHEA (Windows Hardware Error Architecture) is an ACPI standard. Using<br />

<strong>APEI</strong> and <strong>UEFI</strong> standards, we can easily implement the <strong>APEI</strong> in <strong>UEFI</strong> firmware in a modular<br />

fashion, so that most of the <strong>UEFI</strong> modules are common to all platforms. This commonality<br />

makes it easy to port these capabilities to every platform, including mobile, client and servers.<br />

Summary<br />

This section has provided an overview of error reporting and <strong>APEI</strong>.<br />

4


Goal and Motivation<br />

The ACPI Platform Error Interface (<strong>APEI</strong>) is an industry standard interface defined by the<br />

Advanced Configuration and Power Interface (ACPI) specification. The platform runtime<br />

interface to the platform firmware is largely based upon ACPI, but the Unified Extensible<br />

Firmware Interface (<strong>UEFI</strong>) is a complementary technology that entails a major platform standard<br />

in the industry for platform hardware initialization and OS enabling software. Marrying these to<br />

produce a universal model is key motivation and goal of this paper. Though platforms are<br />

different across different stock keeping units (SKUs) and original equipment manufacturers<br />

(OEMs), most of the <strong>APEI</strong> implementation can be implemented in an interoperable fashion by<br />

applying the modularity offered in <strong>UEFI</strong>.<br />

Error reporting and recovery is a key requirement for the enterprise. In order to hit 5-9’s of<br />

availability, a machine must have maximum uptime. Error reporting forms an important<br />

element of platform Reliability, Availability and Serviceability (RAS) which provides this uptime.<br />

Summary<br />

This section has provided a rationale for leveraging <strong>UEFI</strong> technology to deploy <strong>APEI</strong>.<br />

5


Platform Error Infrastructure<br />

All platforms have different designs and different components that may fail during the life of the<br />

system, such as DRAM failures [GOOGLE]. The error detection and handling is a key function<br />

to determine the failing components and taking corrective action in a timely fashion in order to<br />

prevent any data corruption and also to prolong the life of the system. The error handling is a<br />

cooperative activity between the platform hardware (HW), host firmware (<strong>UEFI</strong> or PC/AT<br />

<strong>BIOS</strong>), and the operating system (OS). The platform errors can be generally classified to<br />

Memory subsystem, IO subsystem, Processor, chipset and platform hardware. The role of the<br />

host firmware and low level out-of-band, management firmware includes configuring the<br />

platform hardware, chipset and processor to detect and report all errors at boot time. In addition<br />

to the pre-OS configuration, when the errors are singled at runtime, these firmware agents read<br />

error registers in order to analyze and report results to the OS promptly in a manner which is<br />

understandable to the OS software. At this point, the OS can initiate appropriate corrective action<br />

and/or inform the administrator to take the actions required. Figure 1 shows the typical platform<br />

error infrastructure and how the errors get propagated from hardware to the OS <strong>with</strong> the help of<br />

the host firmware.<br />

The errors from all the subsystems get signaled to host firmware or the OS directly. When<br />

signaled to the host firmware, the host firmware will read the hardware registers, analyze the<br />

component that generated the error and assess the severity of the error. Host firmware will create<br />

detailed error log information for the OS and notify the OS of the error’s occurrence. Host<br />

firmware may additionally generate a platform log and communicate this log to the management<br />

controllers out of band firmware for system management purposes. Such controllers include but<br />

are not limited to baseboard management controllers (BMC). When the error notification is<br />

conveyed to the OS either directly or from host firmware, the OS will inspect the hardware<br />

registers or host firmware created error log to further analyze the error and initiate corrective<br />

actions. The OS may also inform the administrator to take any corrective action as well.<br />

6


Figure 1 Platform Error Infrastructure<br />

The platform hardware can also provide error injection functionality. For high availability<br />

systems such as enterprise servers, the OS needs a software method or API to inject hardware<br />

errors into platform for purposes of validating the functionality of the error reporting and logging<br />

mechanisms.<br />

Error Classifications<br />

Generally all platform errors can be classified to three categories.<br />

7


• Corrected errors (CE): These are hardware corrected errors or recovered errors, such as<br />

single bit memory errors or memory failover.<br />

• Uncorrected errors (UE): These are errors hardware could not correct or recover. There<br />

is a possibility for software or the OS to recover from this error.<br />

• Fatal errors: These are uncorrectable errors from which neither software or hardware<br />

could recover. Continuing to run in the face of these errors could make system<br />

unreliable.<br />

Some uncorrected errors and all fatal error conditions require immediate containment to prevent<br />

any data corruption. This may require a system shut down or reset to recover or replacing the<br />

failing component, also referred as FRU (Field replacement unit). As such, it is necessary for the<br />

error reporting mechanisms to isolate the error to respective FRU units.<br />

Summary<br />

This section introduced the overall concept of error initiation and reporting.<br />

8


ACPI Platform Error Interface<br />

As every platform design and implementation are different, OS needs universal way for error<br />

reporting and providing error log information. The <strong>APEI</strong> (ACPI Platform Error Interface)<br />

formally known as WHEA (Windows Hardware Error Architecture) is an ACPI standard<br />

defining error reporting interfaces to OS. Using <strong>APEI</strong> and <strong>UEFI</strong> standards, we can easily<br />

implement the <strong>APEI</strong> in <strong>UEFI</strong> host firmware modularly, so that most of the <strong>UEFI</strong> modules are<br />

common to all platforms and making it easy portable for every platform including mobile, client<br />

and servers. The <strong>APEI</strong> defines four ACPI tables for declaring platform error infrastructure and<br />

Error record container for communicating error information to OS.<br />

Hardware Error Source Table (HEST)<br />

The HEST table enables host firmware to declare all errors that platform component can generate<br />

and error signaling for those. The host firmware shall create Error source entries in HEST for<br />

each component (such as, processor, PCIe device, PCIe bridge, etc) and each type of error <strong>with</strong><br />

corresponding error notification mechanism (singling) to OS. These error entries include x86<br />

architectural errors, industry standard errors and generic hardware error source for platform<br />

errors. The x86 architectural errors, MCE and CMC, and standard errors PCIe AER, MSI and<br />

PCI INTx can be handled by OS natively. The generic hardware error source can be used for all<br />

firmware 1 st errors and platform errors (such as memory, board logic) that do not have OS native<br />

signaling, so they have to use platform signaling SCI or NMI.<br />

ERST (Error Record Serialization table (ERST)<br />

The ERST table provides a generic interface for the OS to store and retrieve error records in the<br />

platform persistent storage, such as NVRAM (Non-volatile RAM). The error records stored<br />

through this interface shall persist across system resets till OS clears it. The OS will use this<br />

interface store error information during critical error handling for later extensive error analysis.<br />

Host firmware shall provide serialization instruction using ACPI specification defined actions to<br />

facilitate read, write and clear error records.<br />

Error Injection Table (EINJ)<br />

One of the important functions required in implementing the error is the ability to inject error<br />

conditions by the OS to evaluate the correct functionality of the entire error handling in the<br />

platform hardware, host firmware and the OS. The EINJ table interface facilitates error injection<br />

from the OS. The host firmware shall provide at minimum one error injection method per error<br />

type supported in the platform. The host firmware will provide the generic error serialization<br />

instructions to trigger the error in the hardware.<br />

9


Boot Error Record Table (BERT)<br />

The BERT table provides the interface to report errors that occurred system boot time. In<br />

addition BERT also can be used to report fatal errors that resulted in surprise system reset or<br />

shutdown before OS had the chance to see the error. The host firmware shall build this table <strong>with</strong><br />

error record entries for each error that occurred.<br />

Generic Error Status block<br />

The Error status block is the standard error data container for communicating detailed error<br />

information log to the OS for generic error sources listed in HEST. The generic error source may<br />

include firmware 1 st errors and non-standard platform errors. The error status block can log<br />

multiple errors until the OS reads them and handles these errors appropriately. The standard x86<br />

MCE and CMC error log information are presented in Machine check bank registers in the<br />

processor. In the same fashion, the PCIe AER standard error information is presented in<br />

PCIe/PCI device’s or bridge’s configuration register space. Given these mechanisms, the OS can<br />

directly read and analyze the data. The error record structure and information are specified in<br />

<strong>UEFI</strong> 2.3.1 [<strong>UEFI</strong>] specification.<br />

<strong>APEI</strong> Error handling models<br />

<strong>APEI</strong> offers two error models, Firmware first model and OS Native model.<br />

Firmware 1 st is used when the host firmware needs to initially examine the error and attempt<br />

recovery or corrective action in an OS transparent way. This model is also used when certain<br />

OEMs want more control over error handling before the OS takes control, such as for purposes<br />

of executing some management functions. In the Firmware 1 st model, all errors are initially<br />

signaled to the host firmware via SMI or other General Purpose Input (GPI) events. Then host<br />

firmware analyzes and decides what to do, and at the end of the flow creates a detailed <strong>APEI</strong><br />

error log <strong>with</strong> FRU information to OS. Finally, the host firmware will then signal the OS about<br />

the existence of the error via SCI, NMI, or other interrupts.<br />

The OS native model, on the other hand, provides handling of the error directly by the OS or OS<br />

level software by directly accessing the hardware registers and analyzing the error. This requires<br />

standard architecture in the hardware for providing error information in the hardware and<br />

signaling, for e.g. industry standard PCIe AER and x86 MCA architectures. This model takes the<br />

burden off of host firmware.<br />

Platforms can use combined model also, where some errors are handled firmware 1 st , some<br />

natively and some both (a.k.a. parallel model). Many of the servers employ this combined model<br />

for better handling and managing the server better via remotely.<br />

Note: For the Firmware 1 st model, the host firmware has to program the processor and chipset to<br />

trigger SMI upon hardware error pins or upon processor/chipset error messages. After host<br />

firmware handling the errors 1 st and building error records for OS, it will generate SCI or NMI to<br />

OS.<br />

10


Platform Error Signaling<br />

Whenever an error detected in the platform hardware, the hardware error will be signaled to host<br />

firmware or OS or both. In x86 architecture, the error signals to host firmware are hardware error<br />

pins, hardware error registers or error messages. These error signals will be typically routed to<br />

trigger System management interrupt (SMI) so that host firmware can handle the errors in a<br />

secluded environment in System management mode (SMM). The error signaling to the OS are<br />

interrupts. There interrupts are x86 architecture interrupts, such as MCE (Machine check<br />

architecture error), Corrected Machine Check (CMC) interrupts or standard IO interrupts such as,<br />

PCE AER (Advanced error reporting) MSI (Message signaling interrupt), PCI interrupt (INTx),<br />

or platform NMI (non-maskable interrupt) and ACPI SCI (System control interrupt). Upon these<br />

interrupts being activated the OS will handle errors in a processor driver, ACPI driver or in an<br />

<strong>APEI</strong> driver.<br />

The <strong>APEI</strong> Error source signaling<br />

The following table shows how the various subsystem errors are signaled to <strong>APEI</strong> domain to<br />

host firmware and OS in different error reporting model.<br />

Firmware 1 st Model Native reporting model<br />

Errors Signal to host firmware Signal to OS Signal to OS<br />

Processor CE SMI SCI CMC<br />

Processor UE & Fatal SMI NMI MCE<br />

Memory CE SMI SCI CMC<br />

Memory UE & Fatal SMI NMI MCE<br />

IO CE SMI SCI PCIe AER - MSI/INTx<br />

IO UE/Fatal SMI NMI PCIe AER - MSI/INTx<br />

Chipset CE SMI SCI CMC or None<br />

Chipset UE/Fatal SMI NMI MCE or None<br />

Hardware CE SMI SCI None<br />

Hardware UE/Fatal SMI NMI None<br />

Summary<br />

This section introduced <strong>APEI</strong> and its constituent elements.<br />

11


<strong>UEFI</strong> Overview<br />

<strong>UEFI</strong> Overview<br />

The implementation of the <strong>APEI</strong> elements described herein is based upon host firmware based<br />

upon the Unified Extensible Firmware Interface (<strong>UEFI</strong>) Platform Initialization (PI) specification<br />

sets. The <strong>UEFI</strong> specifications are purposely silent on construction intent and policy. Instead,<br />

the <strong>UEFI</strong> specification is a pure interface specification that admits to conformance testing of the<br />

API’s.<br />

Figure 2 <strong>UEFI</strong> PI Boot Flow<br />

12


Figure 3 <strong>UEFI</strong> PI Software Layering Diagram<br />

As you can see from the boot flow in Figure 2 above, the SEC, PEI, and DXE all run prior to<br />

having the <strong>UEFI</strong> services available. The <strong>UEFI</strong> specification describes a set of interfaces to the<br />

platform, and the <strong>UEFI</strong> PI DXE phase acts as the <strong>UEFI</strong> core. In fact, the DXE core is the<br />

preferred embodiment of the <strong>UEFI</strong> interfaces. The SEC, PEI, and DXE components are<br />

provided by the platform manufacturer. These PI elements are also referred to as the ‘Green H’,<br />

as shown in Figure 3 above.<br />

ACPI Table Installation<br />

One set of DXE drivers have knowledge of the <strong>APEI</strong> infrastructure and the platform topology.<br />

These drivers populate the respective ACPI tables mentioned earlier. A pointer to the ACPI<br />

tables can be found via a reference in the <strong>UEFI</strong> System Table’s set of system configuration<br />

13


tables. ACPI tables have a GUID definition that can be found in the <strong>UEFI</strong> specification [<strong>UEFI</strong>].<br />

See Figure 4 below<br />

<strong>UEFI</strong> System<br />

Table<br />

<strong>UEFI</strong> PI DXE SMM<br />

DXE Driver<br />

Dispatcher<br />

f<br />

Figure 4 <strong>UEFI</strong> PI flow and ACPI tables<br />

<strong>UEFI</strong> & DXE<br />

Executables<br />

Device, Bus,<br />

or Service<br />

Driver<br />

EFI Boot Services Table<br />

EFI Runtime Services Table<br />

System Configuration Table<br />

- ACPI Tables<br />

Of the DXE components, there is a class of DXE Drivers called DXE SMM. The DXE SMM<br />

drivers support the OS runtime interactions <strong>with</strong> the platform. This can include SMI invocations<br />

from ASL or via pin activations from the CPU and chipset.<br />

14


The <strong>UEFI</strong> PI boot flow augmented <strong>with</strong> the PI SMM driver loading is shown in Figure 5 below.<br />

Pre<br />

Verifier<br />

rify<br />

e<br />

v<br />

Security<br />

(SEC)<br />

PEI<br />

Core<br />

Chipset<br />

Init<br />

Board<br />

Init<br />

Pre EFI<br />

Initialization<br />

(PEI)<br />

<strong>UEFI</strong> PI SMM<br />

Core Services<br />

PI DXE<br />

SMM<br />

Constructor<br />

DXE Driver<br />

Dispatcher<br />

Intrinsic<br />

Services<br />

security<br />

Driver<br />

Execution<br />

Environment<br />

(DXE)<br />

Boot Dev<br />

Select<br />

(BDS)<br />

SMM Handler<br />

Transient<br />

System Load<br />

(TSL)<br />

Run Time<br />

(RT)<br />

After<br />

Life<br />

(AL)<br />

Power on [ . . Platform initialization . . ] [ . . . . OS boot . . . . ] Shutdown<br />

Figure 5 <strong>UEFI</strong> PI SMM Driver load<br />

There can be a plurality of DXE SMM drivers, including but not limited to the <strong>APEI</strong> support<br />

modules. The relationship of a DXE SMM driver to the PI DXE SMM core is shown in Figure 6<br />

15


elow.<br />

SMM Entry<br />

(CPU)<br />

Manage(NULL)<br />

SMM Exit<br />

(CPU)<br />

Root<br />

SMI<br />

Handler<br />

(Driver) (Driver)<br />

SMI<br />

Handler<br />

SMI SMI<br />

Handler<br />

Manage<br />

(GUID1)<br />

SMI<br />

Event<br />

Sources<br />

Figure 6 SMM driver topology<br />

Manage<br />

(GUID2)<br />

Manage(GUID3)<br />

<strong>APEI</strong><br />

Child SMI<br />

Handler Driver<br />

SMI Handler<br />

(Driver)<br />

SMI Handler<br />

SMI Handler<br />

SMI Handler<br />

SMI<br />

Event<br />

Sources<br />

The PI SMM DXE drivers are PE/COFF executables like other DXE and <strong>UEFI</strong> drivers. As such,<br />

the functionality for different capabilities can be delivered as separate <strong>UEFI</strong> PI packagingspecification<br />

based source modules, or more likely going forward, as separate binaries. These<br />

source packages or binaries can be factored such that they are unique to pure software elements<br />

like generic <strong>APEI</strong> or error logging processing, versus other drivers which are specific to a given<br />

chipset family error reporting hardware.<br />

Summary<br />

This section has provided an overview of <strong>UEFI</strong>, PI, and the mechanisms by which ACPI tables<br />

and PI SMM drivers are deployed in the platform.<br />

16


<strong>APEI</strong> Design in <strong>UEFI</strong> PI-based host<br />

firmware<br />

<strong>Implementing</strong> a modular <strong>APEI</strong> error infrastructure in a system involves support in platform<br />

hardware, <strong>UEFI</strong> host firmware and the OS. Each entity has multiple components working<br />

cooperatively as part of supporting <strong>APEI</strong>.<br />

<strong>APEI</strong> Infrastructure in <strong>UEFI</strong><br />

Now that the generic overview of <strong>UEFI</strong> and the PI infrastructure has been provided, Figure 7<br />

below shows all components and their interactions to implement the <strong>APEI</strong> error interfaces. This<br />

modular infrastructure facilitates graceful and collective way of handling errors and recovering<br />

from errors. In addition provides portable design to any platform independent of individual<br />

design harnessing <strong>UEFI</strong> and <strong>UEFI</strong> PI host firmware standards. This section will focus on the<br />

<strong>UEFI</strong> PI host firmware components and how they can be designed in a common and platform<br />

independent model.<br />

17


Figure 7 <strong>APEI</strong> Infrastructures in a <strong>UEFI</strong> host firmware-based System<br />

<strong>UEFI</strong> Modular Approach to <strong>APEI</strong><br />

This section describes about different <strong>UEFI</strong> modules and design that segregates platform<br />

agnostic modules from platform design dependant modules. By following this <strong>UEFI</strong><br />

implementation design, only the platform dependent modules will have to be ported from<br />

18


platform to platform and retain most of the <strong>APEI</strong> implementation same for all implementations.<br />

Figure 8 below shows the different modules that <strong>APEI</strong> implementation can follow using <strong>UEFI</strong> PI<br />

standards. The following section will describe each module in details.<br />

ACPI<br />

Driver<br />

Module<br />

<strong>UEFI</strong><br />

variable<br />

services<br />

Architecture<br />

Modules<br />

<strong>APEI</strong> ACPI Table<br />

& Data definitions<br />

<strong>APEI</strong> Support<br />

Module<br />

Error Injection<br />

Module<br />

Boot Error<br />

Module<br />

Error Persistence<br />

Module<br />

Error Source<br />

Definition<br />

Board HW Error<br />

Module<br />

Platform Agnostic<br />

Platform Dependant<br />

Platform Specific Modules<br />

Error Injection<br />

Definition<br />

SMM<br />

Error<br />

Handler<br />

<strong>APEI</strong> ASL<br />

definitions<br />

Processor/Chipset<br />

Modules<br />

<strong>APEI</strong> Error<br />

Signaling<br />

Processor Error<br />

Module<br />

Chipset Error<br />

Module<br />

Memory Error<br />

Module<br />

IO/PCIe Error<br />

Module<br />

19


Figure 8 <strong>APEI</strong> Modules in <strong>UEFI</strong> host firmware<br />

The <strong>APEI</strong> module in <strong>UEFI</strong> host firmware can classified to three types of modules as shown in<br />

figure 8. The Architecture modules are part of <strong>UEFI</strong> PI infrastructure and common in all<br />

implementations. The Processor/Chipset modules also can be part of <strong>UEFI</strong> PI infrastructure and<br />

common to one processor architecture, for e.g. Intel x86 architecture. This may need little<br />

porting for various processor/chipset generations. The Platform specific modules are platform<br />

design dependant and modified for each platform. This modular design reduces the modification<br />

greatly for platform to platform.<br />

<strong>APEI</strong> Architecture Modules<br />

The AEPI architecture modules will use exiting <strong>UEFI</strong> infrastructure such as ACPI driver and<br />

variable services to support <strong>APEI</strong> standard interface and communication to the OS.<br />

<strong>APEI</strong> Tables and Data definition<br />

This module contains the new ACPI table definitions for <strong>APEI</strong> and data structure for Error<br />

record structures, Error injection and Error persistence interfaces.<br />

<strong>APEI</strong> Support Module<br />

<strong>APEI</strong> support module is the collector all AEPI support in the platform and also provider of<br />

creating <strong>APEI</strong> table, data structure and interfaces for other modules. In addition this module also<br />

will create necessary runtime interfaces for OS <strong>APEI</strong> driver. This module will also enquire<br />

platform modules and create all Error sources that are supported in HEST table. The <strong>APEI</strong><br />

support module is also responsible to create and publish EINJ, ERST and BERT tables on behalf<br />

of other modules by using <strong>UEFI</strong> ACPI standard protocols. This module also will provide runtime<br />

and boot interfaces to create Error records for Os in Error status block container.<br />

Error Injection Module<br />

The Error injection module will use <strong>APEI</strong> support module interfaces and install the support error<br />

injection entries and error injection support calls using the platform specific module interfaces.<br />

Boot Error Module<br />

The Boot error module will use processor/chipset modules to scan for boot time errors and last<br />

boot errors that were not reported to OS. Once it determines and collects the error, it will use<br />

<strong>APEI</strong> module interface to create error record BERT for each errors.<br />

Error Persistence Module<br />

The Error persistence module provide runtime interface to OS for storing/retrieving error<br />

records. This module can use <strong>UEFI</strong> variable services for persistent storage for portability. In<br />

addition this module will also provide proper information to <strong>APEI</strong> module to create ERST table.<br />

Processor/Chipset Modules<br />

Processor chipset modules will provide the supporting infrastructure to read hardware registers<br />

for determining error and collecting error information from various components in the system<br />

and also build detailed error information log for the OS using <strong>APEI</strong> Support module interfaces.<br />

20


The top level error handling is done in runtime by SMM error handler in x86 architecture<br />

platforms. This main error handler will use the other subsystem error handlers to detect and<br />

process respective errors in addition to providing the error log record. The error record shall be<br />

created for all CE, UE and Fatal errors.<br />

Processor Error module<br />

The Processor error module is responsible to detect and analyze processor data errors and<br />

internal errors and perform any recover action that can be done. This module also will create<br />

<strong>UEFI</strong> specified processor error record log and pass it to main error handler.<br />

Chipset Error Module<br />

The Chipset error module is responsible to detect and analyze errors in chipset devices such<br />

legacy bridged and integrated devices and if possible perform any recovery actions. This module<br />

also will create <strong>UEFI</strong> specified processor/chipset error record log and pass it to main error<br />

handler.<br />

Memory Error Module<br />

The memory error module is responsible to detect and analyze memory subsystem errors<br />

including data errors, memory device errors, memory channel errors and redundancy failures.<br />

This module will also initiate any recovery action, such as sparing and mirror failover. This<br />

module also will create <strong>UEFI</strong> specified processor error record log for all memory errors and pass<br />

it to main error handler.<br />

IO/PCIe Error Module<br />

The IO/PCIe error module is responsible to detect and process errors in the IO sub system<br />

including data errors, device errors and PCIe link errors. It will determine and perform if any<br />

recovery action possible such as reset PCIe device. This module also will create <strong>UEFI</strong> specified<br />

processor error record log and pass it to main error handler.<br />

Platform Specific Modules<br />

The platform specific modules are the modules that will require modification from platform to<br />

platform. These define platform specific policies and support component errors and error<br />

injection capabilities and hardware error signaling to OS.<br />

Error Source definition<br />

The module will define the support error sources in the platform. These include both Os native<br />

errors and firmware 1 st errors. The non-standard <strong>APEI</strong> errors can be always reported as generic<br />

hardware errors. In addition this module will list out error notification methods for each error.<br />

The <strong>APEI</strong> support module will use this platform specific information to create necessary <strong>APEI</strong><br />

tables and interfaces to OS.<br />

Error Injection definition<br />

The Error injection definition will list out the types of error injection support in the platform and<br />

platform specific error injection mechanism. In general implementation most of the error<br />

21


injection mechanisms can be moved to Processor/Chipset modules, so that they can be retained<br />

part of common modules.<br />

<strong>APEI</strong> Error signaling<br />

This module will implement the platform hardware specific interfaces to trigger AEPI hardware<br />

error notification, for e.g. SCI, NMI, to OS.<br />

Board Hardware Error module<br />

This module is for any additional platform board specific error reporting and handling. The<br />

errors handled by this module can be reported as generic hardware error sources and logged<br />

same way.<br />

<strong>APEI</strong> ASL definitions<br />

<strong>APEI</strong> ASL module will be platform specific module as it will contain the ASL code and device<br />

definition for support <strong>APEI</strong>. Depending on the platform implementation and policy overrides this<br />

module has to be ported from platform to platform.<br />

Summary<br />

This section provided an overview of the various <strong>APEI</strong> <strong>UEFI</strong> PI modules.<br />

22


Conclusion<br />

Error reporting on modern platforms entails several elements whose activities must be<br />

coordinated. These include platform design, host firmware construction, and error-reporting<br />

aware facilities in the OS and OS software. To deliver this class of functionality in the platform<br />

and host firmware, the interface definitions of ACPI and <strong>UEFI</strong> are leveraged so that an OS can<br />

be designed against these standards and not a specific vendor’s implementation. Underneath the<br />

abstractions afforded by ACPI and <strong>UEFI</strong>, though, the modularity of the <strong>UEFI</strong> PI standards and<br />

rich open source implementations like the <strong>UEFI</strong> Development Kit [EDK2] can be used to<br />

support a rich set of vendor platforms and achieve high code re-use. This re-use helps <strong>with</strong> timeto-market<br />

by not having to re-engineer all of the firmware elements and also accrue the<br />

validation efforts on stable binaries of drivers that do not change across SKU’s and generations<br />

of platforms.<br />

23


Glossary<br />

ACPI – Advanced Configuration and Power Interface. Static tables and ACPI Machine<br />

Language (AML) interpreted byte code. Preferred OS runtime interface to the platform.<br />

<strong>APEI</strong> – ACPI Platform Error Interface. Industry standard platform error interface to OS.<br />

AML – ACPI Machine Language<br />

<strong>APEI</strong> – ACPI Platform Error Interface.<br />

ASL – ACPI Source Language.<br />

<strong>BIOS</strong> – Basic Input Output System. Firmware that executes on the host CPUs. Can be PC/AT<br />

<strong>BIOS</strong> or <strong>UEFI</strong> PI-based.<br />

BMC – Baseboard Management Controller.<br />

CMC – Corrected Machine Check.<br />

MCA – Machine Check Architecture. Error signaling mechanism for x86 CPU’s.<br />

MSI – Message Signaled Interrupt.<br />

PI – Platform Initialization. Volume 1-5 of the <strong>UEFI</strong> PI specifications. Volume 3 includes an<br />

execution mode for SMM.<br />

SMM – System Management Mode. x86 CPU operational mode that is isolated from and<br />

transparent to the operating system runtime<br />

<strong>UEFI</strong> – Unified Extensible Firmware Interface. Firmware interface between the platform and<br />

the operating system. Predominate interfaces are in the boot services (BS) or pre-OS. Few<br />

runtime (RT) services.<br />

WHEA – Windows Hardware Error Architecture. Various error management technologies in<br />

Windows.<br />

24


References<br />

[ACPI] ACPI Specification Revision 5.0 www.acpi.info<br />

[EDK2] <strong>UEFI</strong> Developer Kit www.tianocore.org<br />

[FRAMEWORK] Intel Framework Specifications www.intel.com/technology/framework<br />

[GOOGLE] Bianca Schroeder and Eduardo Pinheiro and Wolf-Dietrich Weber<br />

“DRAM Errors in the Wild: A Large-Scale Field Study,” Sigmetrics 2009<br />

http://research.google.com/pubs/pub35162.html<br />

[<strong>UEFI</strong>] Unified Extensible Firmware Interface (<strong>UEFI</strong>) Specification, Version 2.3.1c<br />

www.uefi.org<br />

[<strong>UEFI</strong> Book] Zimmer,, et al, “Beyond <strong>BIOS</strong>: Developing <strong>with</strong> the Unified Extensible Firmware<br />

Interface,” 2 nd edition, Intel Press, January 2011<br />

[<strong>UEFI</strong> Overview] Zimmer, Rothman, Hale, “<strong>UEFI</strong>: From Reset Vector to Operating System,”<br />

Chapter 3 of Hardware-Dependent Software, Springer, February 2009<br />

[<strong>UEFI</strong> PI Specification] <strong>UEFI</strong> Platform Initialization (PI) Specifications, volumes 1-5, Version<br />

1.2 www.uefi.org<br />

[WHEA] Windows Hardware Error Architecture.<br />

http://msdn.microsoft.com/en-us/library/windows/hardware/gg463286.aspx<br />

25


Authors<br />

Palsamy Sakthikumar (psakthik@gmail.com) is a platform architect <strong>with</strong><br />

the Data Center Group at Intel Corporation.<br />

Vincent J. Zimmer (vincent.zimmer@intel.com) is a Principal Engineer <strong>with</strong><br />

the Software and Services Group at Intel Corporation.<br />

26


This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO<br />

WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY,<br />

NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE<br />

ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including<br />

liability for infringement of any proprietary rights, relating to use of information in this specification.<br />

No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted<br />

herein.<br />

Intel, the Intel logo, Intel. leap ahead. and Intel. Leap ahead. logo, and other Intel product name are<br />

trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and<br />

other countries.<br />

*Other names and brands may be claimed as the property of others.<br />

Copyright 2013 by Intel Corporation. All rights reserved<br />

27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!