Toward Systematic Design of Fault- Tolerant Systems

More documents

Recommendations

Info

.The possiblecauses offaultsincludenaturalphenomenaof internalor externalorigin, andaccidentalor intentionalhumanactions.C. Gilley, Francis P. Mathur, David A. Rennels, JohnA. Rohr, and David K. Rubin) culminated in a laboratorymodel of the JPL-STAR computer. 6 US PatentNo. 3,517,671 granted to me in 1970 and assignedto NASA, validated the concept’s originality. A flightmodel of the JPL-STAR was designed for a 10- to 15-year space mission, 6 but construction stopped whenNASA discontinued the Grand Tour mission.Nevertheless, the project did afford the study of allaccessible engineering solutions and theory concerningreliability. From a confusing variety of theoriesand techniques 2-5 emerged the unifying concept offault tolerance. 1 And in 1971 JPL and the IEEEComputer Society cosponsored the first InternationalSymposium on Fault-Tolerant Computing.During the succeeding 25 years the set of faults thatfault-tolerant systems had to tolerate grew extensively.The original concept dealt with transient and permanentlogic faults of physical origin. The increasing complexityof software and VLSI chip logic emphasized theimpossibility of removing all design faults prior to operation.Thus, faults due to human design errors wereadded to the demands of fault tolerance. Experiencealso led to the addition of interaction faults—thosefaults inadvertently introduced by humans during computeroperation or maintenance. Finally, consequencesof malicious actions intended to alter or interrupt servicewere recognized as intentional design faults. Thisconcept established a common ground for the unifiedtreatment of security and fault tolerance concerns insystem design. Assuring full compatibility and integrationof security and fault tolerance techniques is amajor challenge for contemporary designers.The possible causes of faults, then, include naturalphenomena of internal or external origin and accidentalor intentional human actions. In either case,faults will cause errors. If error detection and recoverydo not take place in a timely manner, a failure willoccur that will be manifested by the denial or an undesirablechange of service.There are two ways to build a fault-tolerant system.The bottom-up approach entails designing aninfrastructure of autonomously fault-tolerant subsystems(microprocessors, memories, sensors, displays,and so on) and integrating this infrastructurewith global fault tolerance functions such as reconfigurationand externally supported recovery. The topdownapproach allows a system to be built usingexisting (off-the-shelf) subsystems that may have littleor no fault tolerance at all. A global monitoringfunction is then implemented to provide fault tolerance.Top-down design is the prevailing practice.An examination of both approaches clearly makesa case for the long-range merits of the bottom-up technique.Moreover, similarities between fault toleranceand the human immune system suggest an analogythat offers developers and users of high-confidencesystems a greater understanding and acceptance ofbottom-up fault tolerance.DESIGN PARADIGM FORFAULT-TOLERANT SYSTEMSBuilding the JPL-STAR computer involved muchimprovisation and experimentation with design alternatives.It became apparent that the lessons learnedcould serve as the foundation for a more orderlyapproach that would guide designers in treating faulttolerance as a systematic issue.My 1967 paper 1 was followed by publications overthe next 25 years that formulated an evolving view ofhow to systematically introduce fault tolerance inhardware, software, communication, and manmachine-interfacesubsystems and how to integratethese elements during design. 2 The appearance of new,successful fault-tolerant systems offered additionaldesign insights and more operational experience.Here I summarize the most mature version of theguidelines for bottom-up fault tolerance. An abstractionof observed design processes in which steps oftenoverlap, it is offered as a way to minimize the probabilityof oversights, mistakes, and inconsistencies thatmay occur during the implementation of fault tolerance.The first three steps—specification, implementation,and evaluation—deal with the building of anew system. Implementation and evaluation are concurrent.Step four—modification—addresses therepair or augmentation of an existing design.SpecificationSystem requirements that describe the services to bedelivered in terms of functionality and performanceare the starting point in specifying fault tolerance.Mission phases. First we identify different phases,characterized by different environments and operationconditions during the system’s lifetime or mission.Dependability conditions. Then we identify fourdependability conditions for each phase:• Fault classes. I have already summarized thepotential fault classes to be tolerated. Environmentalconditions and operating interfaces willaffect the expected fault classes. Each missionphase may encounter internal faults that arisewithin the system and external faults that introduceerrors into the system via the interfaces: I/Olinks, external fault tolerance support, andhuman-machine interaction.• External support. Here we determine the availabilityof repair by external agents and the formof this support (on-demand or periodic) as well as52 Computer
.the availability of remote support for fault tolerancefunctions (for example, external diagnosisand software reloading).• Service modes. Here we determine the acceptabilityof different modes of service (full, reduced,degraded, emergency, safe shutdown, and so on)for each phase and establish each mode’s requiredservice level.• Dependability goals. Here we determine eachphase’s requirements for system availability, reliability,maintainability, and safety, as well aspotentially varying criticality of service deliveryand system security goals during different phases.Evaluation methods. Now we choose how we aregoing to evaluate the likelihood that the design willmeet the dependability goals. We must also agree uponand specify independent means of validating the evaluationand define fault- and error-occurrence scenariosto enable evaluation. For example, how will thesystem respond to two or more independent, nearcoincidentfaults with overlapping effects? How willit respond to latent errors?Resource allocation. Finally, since the resources formaking a system dependable are limited, we mustdecide before implementation begins how much tospend on fault tolerance versus fault avoidance duringimplementation.ImplementationFault tolerance is implemented through thesequence of system partitioning, subsystem design,and systemwide integration.System partitioning. System partitioning is the definitionof system structure, expressed by the interconnectionof its building blocks and communicationlinks. Functionality, performance, and fault tolerancerequirements, available technologies, and past experienceall affect the choice of the hardware, software,communication, and interface (I/O) subsystems thatconstitute the system being designed. For fault tolerance,partitioning defines potentially replaceable unitsand their error containment boundaries, which areintended to ensure failure independence and theabsence of erroneous outputs for individual subsystems.Partitioning involves three major decisions:• Fault tolerance hierarchy. Here we define a systemwidestrategy for coordinated error detectionand recovery. We may define detection and recoveryfunctions for an individual subsystem (local),to be shared by a set of identical subsystems orextended over a group of diverse subsystems(intermediate), or to be systemwide (global). Ahierarchy of three or all four approaches may beused in larger systems. At this time we define specializedsupport subsystems for fault tolerance(for example, service processors, software monitors,and dedicated communication links) andadd them to the system structure.• Redundancy methods. We can incorporateredundant subsystems into a design using multiple-channelcomputation (duplex, triplex, and soon), spare subsystems, or degradation by exclusionof failed subsystems. Or we can combinethese approaches (for example, hybrid redundancy).Time redundancy (repetition of computations)is another option.• Design diversity. Now we must decide if designdiversity 7 (diverse hardware and software in eachchannel) is to be implemented. Design diversityis an effective way to ensure safety in criticalapplications and to protect the most vital subsystemsof complex systems and networksagainst design faults.Once these decisions have been made, every subsystemis characterized by its own error-containmentrequirements and its participation in set, group,and/or global detection and recovery functions isdetermined. Then we can apportion the system-levelgoals for availability, reliability, maintainability, safety,and security among the sets of subsystems that constitutethe system.Subsystem design. In a subsystem, fault tolerance isimplemented in two parts: local error detection andrecovery and support for higher level (intermediate,global) error detection and recovery. The adequacy ofthe chosen local detection and recovery methods isassessed by evaluations that occur during subsystemdesign and often lead to modification of initial choices.Both parts of the fault tolerance implementation entailthe following steps:• Error detection. Errors are undesired statescaused by active faults and may lead to subsystemfailure. Error detection is either concurrent withdelivery of service or preemptive, in which caseservice delivery is suspended. Error-detectingcodes are an example of the former method; BIST(built-in self-test) exemplifies the latter. Errordetection methods must be able to detect errorsproduced by faults in the fault tolerance mechanismsthemselves. Furthermore, they must be ableto check spare subsystems for dormant faults anddetect latent errors in stored information that isnot subject to other forms of error detection.• Recovery. An error signal invokes a recoveryprocedure, which attempts to restore the systemto a valid state. Recovery is either backward—How will thesystemrespond totwo or moreindependent,nearcoincidentfaults withoverlappingeffects?April 1997 53
Page 1: .Theme FeatureToward SystematicDesi
Page 5 and 6: .Failure to return to the specifica
Page 7 and 8: .• what will motivate the major c

Toward Systematic Design of Fault- Tolerant Systems

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?