Toward Systematic Design of Fault- Tolerant Systems

More documents

Recommendations

Info

.A systematicallydesignedinfrastructureis autonomous:Itdoes notdepend onother partsof the systemfor support.returning the system to a previous, error-freestate—or forward—constructing a valid, errorfreenew state from existing (usually redundant)information. A recovery sequence includes faultdiagnosis and removal, error elimination, staterestoration, and recovery validation. When diagnosisidentifies a permanently faulty subsystem,fault removal is performed by either substitutinga good spare subsystem or reconfiguring the systemto function without the faulty subsystem.Error elimination and state restoration completethe recovery. Independent validation of successfulrecovery is desirable for every subsystem. Twospecial cases of recovery are error correction,which allows a subsystem (for example, memory)to continue with a permanent fault, andmasking redundancy (for example, triple modularredundancy with voting), which masks afault’s presence without further recovery action.Systemwide integration. The desired result of systempartitioning and subsystem design is an integrated setof local, intermediate, and global fault tolerance functionsthat serve as a protective infrastructure to ensurethe timely and correct delivery of system services.A systematically designed infrastructure is autonomous;that is, it does not depend on other partsof the system (operating system, applications, personnel,and so on) for support. It is also distributedand fault tolerant itself. It has dedicated communicationlinks and can also use the main links. And it ismanaged by a highly protected “hard core” subsystem(or a hierarchy of such subsystems) that executesglobal decisions to assure system recovery. Theseproperties are analogous to those of the humanimmune system.This phase has two major goals: to verify the infrastructure’scompleteness and consistency and to evaluateits ability to handle two or more nearlyconcurrent fault manifestations. To accomplish thesegoals, we need in-depth analysis and experimentalfault injection, using the fault and error scenariosdeveloped during specification.EvaluationEvaluation of fault tolerance is continuous duringsystem partitioning, subsystem design, and systemintegration. At each step evaluation is an importantdesign tool that facilitates the choices between faulttolerance techniques and assesses the likelihood ofmeeting the dependability goals. Successful completionof design requires a convincing verification of thedesign’s completeness and its potential to meet thedependability goals. Verification consists of two distinctevaluations: first qualitative, then quantitative.Qualitative evaluation generates deterministic predictions.It must be satisfied prior to quantitative evaluation,which generates probabilistic predictions.Otherwise, the evaluation may generate unreasonablyoptimistic predictions of availability and reliabilitybecause unjustified simplifications will go unnoticed.Qualitative evaluation: Deterministic goals. The outcomeof qualitative evaluation is a yes/no conclusionwith respect to four goals:• Fault tolerance completeness and consistency.This evaluation is part of systemwide integration.Here we use checklists of questions derived fromthe design paradigm, and experimental faultinjection using worst-case scenarios.• Absolute tolerance. This evaluates whether the systemcan survive one (or more than one) fault froma specified set and then execute a safe shutdown,usually stated as fail operational/fail safe. A detaileddesign analysis is needed to prove this property.• Absence of design faults. Here formal and heuristicmethods such as proof of correctness, testing,and experimentation are applicable.• System security goals. To evaluate deterministicsecurity requirements, we use the same tools usedto evaluate the absence of design faults.Quantitative evaluation: Probabilistic goals. Thisevaluation requires three steps:• Describe the design using a system evaluationmodel that is characterized by sets of physical,structural, repair, fault tolerance, and performanceparameters for every subsystem.• Obtain coverage and execution-time parametersfor all local, intermediate, and global detectionand recovery functions. Fault injection experimentsare essential for this task.• Use the model to predict system reliability, availability,maintainability, and safety. The existenceof multiple service modes necessitates the predictionof mean time between mode reductionsand duration of mode reduction in suitable measures(mean, 99th percentile, and so on) insteadof a single availability prediction.ModificationAn existing system is modified for repair—theremoval of newly discovered faults—or for augmentationof functionality, performance, and/or fault tolerance.In both cases subsystems are modified or newsubsystems are added. When this happens, it is essentialto modify the specification first and then reimplementthe subsystems with a complete reexaminationand reevaluation of detection and recovery functions.54 Computer
.Failure to return to the specification may cause gapsin fault tolerance protection.NASA experienced such a gap on April 10, 1981. Atimely synchronization check was omitted after theaddition of an alternate reentry program. As a result,the first flight of the US space shuttle program wasaborted 19 minutes before launch.OFF-THE-SHELF APPROACHThe bottom-up approach of the design paradigmresults in fault-tolerant systems that are composed ofan integrated set of fault-tolerant subsystems.However, development time and cost constraints oftenlead developers to use off-the-shelf subsystems—including microprocessors, operating systems, andapplications—as building blocks in the design of systemsthat are expected to be highly dependable. OTSitems usually have few fault tolerance functions—sometimes none at all.Pentium Pro limitationsTo illustrate the nature of fault tolerance functionsin, for example, OTS microprocessors, let’s considerthe Intel Pentium Pro. 8 Compared with Sun UltraSparcII, MIPS 10000, HP PA-8000, DEC Alpha 21164, andIBM/Apple/Motorola PowerPC 620 microprocessors,the Pentium Pro appears to have the most completeset of fault tolerance functions among contemporarymicroprocessors.An ancestor of the Pentium Pro, Intel’s 486, providedparity checking for data bytes. The Pentiumadded address parity and introduced parity checks forcache, translation lookaside buffer, and microcodestorage arrays; it also introduced a Machine CheckException with address and type registers. In addition,the Pentium reintroduced the master/checkerduplexing (functional redundancy checking) optionthat Intel pioneered in the 432 processor chips.The Pentium Pro integrates five Pentium componentsinto a single component. It retains all Pentiumfault tolerance techniques, replacing data-byte paritywith eight ECC (error-correcting code) bits forSEC/DED (single-error correction/double-error detection)operations. It uses two parity bits and providesa retry for the address bus, and it includes parity bitsfor two groups of control signals. The Machine CheckException is generalized into a Machine CheckArchitecture with three global control registers andfive banks of four error-reporting registers each. 8We can see that as chip complexity increases, morefault tolerance functions are added; however, majorOTS drawbacks remain in the Pentium Pro:• Protection by parity and ECC is limited to storagearrays and communication links, which areeasy to check. The more complex data- andinstruction-processing logic remains unchecked.• The extensive system developer’s documentation 8(more than 1,400 pages) commingles error handlingwith all other information. There is no comprehensivetop-down view of the fault tolerancetechniques and their interrelationships. Managementof most error conditions is relegated to a“central agent,” which remains unexplained.Developers run a significant risk of overlookingor misinterpreting the details of error handling.• Use of the Machine Check Architecture isoptional and must be enabled by software. Thisleaves open the possibility of its accidental ormalicious disabling during operation. TheMachine Check Exception Handler softwaremerely logs machine status and error informationand then shuts down the system, since thereare no on-chip recovery procedures to invoke.• The master/checker duplexing is a throwawaysolution at twice the cost of one microprocessor.Error detection is delayed until the error reachesthe component’s output; by then, shutdown,BIST, and restart is the only recovery option left,even if the cause was only a soft error that a localretry could eliminate.Retrofit solutionsSystems built from OTS subsystems are very difficultto retrofit for fault tolerance. The absence of OTShardware support for fault tolerance means that theonly solution is to build a software monitor subsystem(such as the Pentium Pro’s Machine Check ExceptionHandler) that resides and executes on the OTS hardwareelements. A software monitor tries to check allsubsystems for indications of failure and recordsabnormal symptoms. When necessary, it initiates shutdowns,BIST, and restarts. This approach has twoweaknesses: The monitor software itself is unprotectedbecause it resides and executes on an OTSprocessor, and it limits recovery handling to on/off.A costly but effective method for building high-confidencesystems with OTS subsystems is to employmultiple-channel computation with diverse hardwareand software in each channel. 7 Variations of thisdesign diversity approach have been used successfullyin safety-critical systems, such as flight control andrail transportation, that use well-defined cyclic controlalgorithms.However, cost and application complexity precludethis solution in most distributed, heterogeneous systemswith OTS components. A potential retrofit solutionis to implement a small, highly fault-toleranthardware subsystem that monitors the system’s operation,ensures data integrity, and manages recoverySystemsbuilt fromOTSsubsystemsare verydifficultto retrofitfor faulttolerance.April 1997 55
Page 1 and 2: .Theme FeatureToward SystematicDesi
Page 3: .the availability of remote support
Page 7 and 8: .• what will motivate the major c

Toward Systematic Design of Fault- Tolerant Systems

Create successful ePaper yourself

Delete template?

Save as template?