11.07.2015 Views

Encyclopedia of Computer Science and Technology

Encyclopedia of Computer Science and Technology

Encyclopedia of Computer Science and Technology

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Ffault toleranceFault tolerance is a design concept that recognizes that allcomputer-based systems will fail eventually. The question iswhether a system as a whole can be designed to “fail gracefully.”This means that even if one or more componentsfail, the system will continue to operate according to itsdesign specifications, even if its speed or throughput mustdecrease.Methods <strong>and</strong> ImplementationsThere are a number <strong>of</strong> ways to make a system more faulttolerant. Individual components such as hard drives can becomposed <strong>of</strong> multiple units so that the remaining units cantake over if one fails (see also raid). If each key componenthas at least one backup, then there should be time to replacethe primary before the backup also fails.Another way to achieve fault tolerance is to provide multiplepaths to successful completion <strong>of</strong> the task. In fact, thisis how packet-switched networks like the Internet work(see tcp/ip). If one communications link is down or toocongested, packets are given an alternative routing.Fault diagnosis s<strong>of</strong>tware can also play an important roleboth in determining how to respond to a problem (beyondany automatic response) <strong>and</strong> for providing data that will beuseful later to system administrators or technicians. Somefault diagnosis systems can use elaborate rules (see expertsystems) to pinpoint the cause <strong>of</strong> a fault <strong>and</strong> recommend asolution.The amount <strong>of</strong> fault tolerance to be provided for a systemdepends on a number <strong>of</strong> factors:• How important is it that the system not fail?• How critical is a given component to the operation <strong>of</strong>the system?• How likely is it that a given component will fail?(Mean time between failures, or MBTF)• How expensive is it to make the component or systemfault tolerant?A related concept is fail-safe. While fault toleranceemphasizes continued operation despite one or more failures,fail-safe emphasizes the ability to shut down safelyin case <strong>of</strong> an unrecoverable failure. With computer-basedsystems, fail-safe design can use redundant systems (asin avionics) to perform calculations, with a failing system“outvoted” if necessary by the good ones. In most casesthere should also be a provision to alert the pilot or operatorin time to take over operations from the automaticsystem.Another common example <strong>of</strong> fail-safe is modern operatingsystems that create a “journal” <strong>of</strong> pending operations t<strong>of</strong>iles that can be used to restore the integrity <strong>of</strong> the systemafter a power failure or other abrupt shutdown (see filesystem.)Further ReadingIsermann, Rolf. Fault-Diagnosis Systems: An Introduction from FaultDetection to Fault Tolerance. New York: Springer, 2006.Koren, Israel, <strong>and</strong> C. Mani Krishna. Fault-Tolerant Systems. SanFrancisco: Morgan Kauffman, 2007.National Institute <strong>of</strong> St<strong>and</strong>ards <strong>and</strong> <strong>Technology</strong>. “A ConceptualFramework for System Fault Tolerance.” Available online. URL:189

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!