13.07.2015 Views

Toward Systematic Design of Fault- Tolerant Systems

Toward Systematic Design of Fault- Tolerant Systems

Toward Systematic Design of Fault- Tolerant Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

.A systematicallydesignedinfrastructureis autonomous:Itdoes notdepend onother parts<strong>of</strong> the systemfor support.returning the system to a previous, error-freestate—or forward—constructing a valid, errorfreenew state from existing (usually redundant)information. A recovery sequence includes faultdiagnosis and removal, error elimination, staterestoration, and recovery validation. When diagnosisidentifies a permanently faulty subsystem,fault removal is performed by either substitutinga good spare subsystem or reconfiguring the systemto function without the faulty subsystem.Error elimination and state restoration completethe recovery. Independent validation <strong>of</strong> successfulrecovery is desirable for every subsystem. Twospecial cases <strong>of</strong> recovery are error correction,which allows a subsystem (for example, memory)to continue with a permanent fault, andmasking redundancy (for example, triple modularredundancy with voting), which masks afault’s presence without further recovery action.Systemwide integration. The desired result <strong>of</strong> systempartitioning and subsystem design is an integrated set<strong>of</strong> local, intermediate, and global fault tolerance functionsthat serve as a protective infrastructure to ensurethe timely and correct delivery <strong>of</strong> system services.A systematically designed infrastructure is autonomous;that is, it does not depend on other parts<strong>of</strong> the system (operating system, applications, personnel,and so on) for support. It is also distributedand fault tolerant itself. It has dedicated communicationlinks and can also use the main links. And it ismanaged by a highly protected “hard core” subsystem(or a hierarchy <strong>of</strong> such subsystems) that executesglobal decisions to assure system recovery. Theseproperties are analogous to those <strong>of</strong> the humanimmune system.This phase has two major goals: to verify the infrastructure’scompleteness and consistency and to evaluateits ability to handle two or more nearlyconcurrent fault manifestations. To accomplish thesegoals, we need in-depth analysis and experimentalfault injection, using the fault and error scenariosdeveloped during specification.EvaluationEvaluation <strong>of</strong> fault tolerance is continuous duringsystem partitioning, subsystem design, and systemintegration. At each step evaluation is an importantdesign tool that facilitates the choices between faulttolerance techniques and assesses the likelihood <strong>of</strong>meeting the dependability goals. Successful completion<strong>of</strong> design requires a convincing verification <strong>of</strong> thedesign’s completeness and its potential to meet thedependability goals. Verification consists <strong>of</strong> two distinctevaluations: first qualitative, then quantitative.Qualitative evaluation generates deterministic predictions.It must be satisfied prior to quantitative evaluation,which generates probabilistic predictions.Otherwise, the evaluation may generate unreasonablyoptimistic predictions <strong>of</strong> availability and reliabilitybecause unjustified simplifications will go unnoticed.Qualitative evaluation: Deterministic goals. The outcome<strong>of</strong> qualitative evaluation is a yes/no conclusionwith respect to four goals:• <strong>Fault</strong> tolerance completeness and consistency.This evaluation is part <strong>of</strong> systemwide integration.Here we use checklists <strong>of</strong> questions derived fromthe design paradigm, and experimental faultinjection using worst-case scenarios.• Absolute tolerance. This evaluates whether the systemcan survive one (or more than one) fault froma specified set and then execute a safe shutdown,usually stated as fail operational/fail safe. A detaileddesign analysis is needed to prove this property.• Absence <strong>of</strong> design faults. Here formal and heuristicmethods such as pro<strong>of</strong> <strong>of</strong> correctness, testing,and experimentation are applicable.• System security goals. To evaluate deterministicsecurity requirements, we use the same tools usedto evaluate the absence <strong>of</strong> design faults.Quantitative evaluation: Probabilistic goals. Thisevaluation requires three steps:• Describe the design using a system evaluationmodel that is characterized by sets <strong>of</strong> physical,structural, repair, fault tolerance, and performanceparameters for every subsystem.• Obtain coverage and execution-time parametersfor all local, intermediate, and global detectionand recovery functions. <strong>Fault</strong> injection experimentsare essential for this task.• Use the model to predict system reliability, availability,maintainability, and safety. The existence<strong>of</strong> multiple service modes necessitates the prediction<strong>of</strong> mean time between mode reductionsand duration <strong>of</strong> mode reduction in suitable measures(mean, 99th percentile, and so on) instead<strong>of</strong> a single availability prediction.ModificationAn existing system is modified for repair—theremoval <strong>of</strong> newly discovered faults—or for augmentation<strong>of</strong> functionality, performance, and/or fault tolerance.In both cases subsystems are modified or newsubsystems are added. When this happens, it is essentialto modify the specification first and then reimplementthe subsystems with a complete reexaminationand reevaluation <strong>of</strong> detection and recovery functions.54 Computer

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!