4 years ago

Achieving fault-tolerant software with rejuvenation and ... - IEEE Xplore

Achieving fault-tolerant software with rejuvenation and ... - IEEE Xplore

Software Reconfiguration

Software Reconfiguration Products ■ ■ multilayer reconfiguration, in which recovery from a failure at one layer might take place at higher layers either independently or in coordination with each layer having different characteristics; and priority reconfiguration, which might involve re-optimizing an entire network to reconnect disrupted high-priority circuits over currently established lowerpriority circuits. Each of these reconfiguration techniques requires the provisioning of spare resources for redundancy that can be used when a failure occurs. The redundant resources can be dedicated, to guarantee reconfiguration, or shared, in which case recovery might not be possible (spare resources might not be available at the time of a fault). On the other hand, sharing redundant resources is more efficient in environments of low fault probability or when reconfiguration need not be guaranteed. Reconfiguration can be provided at different layers and implemented with different algorithms at each layer. 13 In fact, if all restoration mechanisms are similar at each layer, there is increased whole system vulnerability. 2 For example, if all layers used preplanned mechanisms, then each layer— and the system as a whole—will not be able to handle unexpected fault events. If all layers used real-time search algorithms, then system behavior would be hard to predict. Instead, it is better to use complementary reconfiguration algorithms at different layers and draw on the benefits of each. For example, a preplanned algorithm at a lower layer (for speed), followed by a real-time search mechanism at a higher layer, can handle unexpected faults that lower layers have been unable to handle. In general, reconfiguration of successfully executing software for recovery from a failure in another part of a system should only be performed if it can be accomplished transparently such that it is imperceptible to users. However, there are cases when highpriority software fails and requires resources for recovery. In this scenario, lowerpriority software should be delayed or terminated and its resources reassigned to aid this recovery. 9 Intentional system degradation to maintain essential processing is the most extreme type of reconfiguration. Given the maturation of software reconfiguration techniques, products have begun to appear, particularly in the PC operating system market. These products do not protect from hardware failures, such as CPU malfunction or disk failure, but they can be useful tools against buggy software and human error. In general, these products track software changes (system, application, data file, and registry setting), use a hard disk to make redundant copies, and let the user restore (reconfigure) a system to a previous “snapshot.” Note that there are trade-offs for providing this reconfiguration capability versus system performance and hard disk space requirements. For more information, here is a representative sampling of current products: ■ ■ ■ ■ ConfigSafe v 4.0, by imagine LAN (, GoBack v 2.21, by Roxio (, Rewind, by Power On Software (, and System Restore Utility, included in Microsoft Windows ME (www. Operational state Rejuvenation Aging Failureprobable state Reconfiguration Bug manifestation Failed state Fault-tolerant software requires a whole system approach. 2,9 We have attempted to outline the use of contrasting proactive and reactive approaches to achieve fault-tolerant software, but it’s not easy to say which approach is better. Both approaches are nonexclusive and complementary, such that they work well together in an integrated system (see Figure 1). The high cost of redundancy required for reactive reconfiguration suggests it is better suited for software in which a rejuvenation schedule appears unrealistic due to imminent faults or where an outage’s effect could be catastrophic. Proactive rejuvenation is the preferred solution when faults can be efficiently avoided using a realistic rejuvenation schedule or where the risk an outage presents is low. Because both approaches are relatively new and under study, we direct readers to our references for more details. Figure 1. A model depicting the complementary nature of rejuvenation and reconfiguration. 7 July/August 2001 IEEE SOFTWARE 51

An analogy can be made between these fault-tolerant software approaches and CPU communications in a computer system. Reactive reconfiguration is equivalent to eventdriven interrupts, and proactive rejuvenation is equivalent to polling resources. Preventing failures before they occur might be the best approach when finding all software bugs is possible, just as polling is preferable when CPU communication can be anticipated. However, when finding all bugs is improbable (or maybe testing is not even attempted), then having the flexibility to react to multipriority interrupts with robust service-handling routines might be the critical last line of defense against software faults. “Statistical Non-Parametric Algorithms to Estimate the Optimal Rejuvenation Schedule,” Pacific Rim Int’l Symp. Dependable Computing (PRDC), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 77–84. 8. W. Yurcik and D. Tipper, “Survivable ATM Group Communications: Issues and Techniques,” Eight Intl. Conf. Telecomm. Systems, Vanderbuilt Univ./Owen Graduate School of Management, Nashville, Tenn., 2000, pp. 518–537. 9. D. Wells et al., “Software Survivability,” Proc. DARPA Information Survivability Conf. and Exposition (DIS- CEX), IEEE Computer Soc. Press, Los Alamitos, Calif., vol. 2, Jan. 2000, pp. 241–255. 10. D. Medhi, “Network Reliability and Fault Tolerance,” Wiley Encyclopedia of Electrical and Electronics Engineering, John Wiley, New York, 1999. 11. D. Medhi and D. Tipper, “Multi-Layered Network Survivability—Models, Analysis, Architecture, Framework and Implementation: An Overview,” Proc. DARPA Information Survivability Conference and Exposition (DISCEX), IEEE Computer Soc. Press, Los Alamitos, Calif., vol. 1, Jan. 2000, pp. 173–186. 12. D. Medhi and D. Tipper, “Towards Fault Recovery and Management in Communications Networks,” J. Network and Systems Management, vol. 5, no. 2, Jun. 1997, pp. 101–104. 13. D. Johnson, “Survivability Strategies for Broadband Networks,” IEEE Globecom, 1996, pp. 452–456. Acknowledgments The authors thank Katerina Goseva-Popstojanova, Duke University, who provided an outstanding introduction to the concept of rejuvenation; David Tipper, University of Pittsburgh, and Deep Medhi, University of Missouri–Kansas City, for making significant contributions to the field of fault-tolerant networking using reconfiguration; and Kishor S. Trivedi, Duke University, for making seminal contributions in the development of software rejuvenation. Lastly, we thank past reviewers for their specific feedback that has significantly improved this article. References 1. D. Milojicic, “Fred B. Schneider on Distributed Computing,” IEEE Distributed Systems Online, vol. 1, no. 1, 2000, ds1intprint.htm (current 11 June 2001). 2. W. Yurcik, D. Doss, and H. Kruse, “Survivability- Over-Security: Providing Whole System Assurance,” IEEE/SEI/CERT Information Survivability Workshop (ISW), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 201–204. 3. Y. Huang et al., “Software Rejuvenation: Analysis, Module and Applications,” 25th IEEE Intl. Symp. on Fault Tolerant Computing, 1995, pp. 381–390. 4. K.S. Trivedi, K. Vaidyanathan, and K. Goseva-Popstojanova, “Modeling and Analysis of Software Aging and Rejuvenation,” 33rd Ann. Simulation Symp., Soc. Computer Simulation Int’l Press, 2000, pp. 270–279. 5. K. Vaidyanathan et al., “Analysis of Software Rejuvenation in Cluster Systems,” Fast Abstracts, Pacific Rim Int’l Symp. Dependable Computing (PRDC), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 3–4. 6. S. Garg et al., “A Methodology for Detection and Estimation of Software Aging,” Ninth Int’l Symp. Software Reliability Eng., 1998, pp. 283–292. 7. T. Dohi, K. Goseva-Popstojanova, and K.S. Trivedi, For further information on this or any other computing topic, please visit our Digital Library at About the Authors David Doss is an associate professor and the graduate program coordinator in the Department of Applied Computer Science at Illinois State University. He is also a retired Lt. Commander US Navy (SSN). Contact him at William Yurcik is an assistant professor in the Department of Applied Computer Science at Illinois State University. Prior to his academic career, he worked for organizations such as the Naval Research Laboratory, MITRE, and NASA. Contact him at 52 IEEE SOFTWARE July/August 2001

Fault-Tolerant Nanoscale Processors on ... - IEEE Xplore
Techniques of Software Fault Tolerance - IJCSET
Toward hardware-redundant, fault-tolerant logic for ... - IEEE Xplore
New Breed of Network Fault-Tolerant Voltage-Source ... - IEEE Xplore
Fault-Tolerant Computing Systems - IEEE Xplore
Fault-tolerant and Energy-efficient Permutation ... - IEEE Xplore
Minimum energy fault tolerant sensor networks ... - IEEE Xplore
Fault tolerance on star graphs - Parallel Algorithms ... - IEEE Xplore
Power consumption of fault tolerant codes: the active ... - IEEE Xplore
EVENODD: an efficient scheme for tolerating double ... - IEEE Xplore
Tolerant VLSI Processor Arrays - IEEE Xplore
SWIFT: Software Implemented Fault Tolerance - Liberty Research ...
Fault-tolerant mobile agent execution - Computers ... - IEEE Xplore
Fault-Tolerant Control for SSSC Using Neural ... - IEEE Xplore
Software Fault Tolerance: A Tutorial
Fault-tolerant target detection in sensor networks ... - IEEE Xplore
Fault-Tolerant Robust Supervisor for Discrete Event ... - IEEE Xplore
SWIFT: Software Implemented Fault Tolerance
Error Tolerant Multimedia Stream Processing: There's ... - IEEE Xplore
Tolerances and Uncertainties Versus Magnetic ... - IEEE Xplore
Tolerances and Uncertainties Versus Magnetic ... - IEEE Xplore
Fault Tolerance and Recovery - MESL
software modulated fault tolerance - Liberty Research Group ...
Achievable Rate Analysis in Network-Coded ... - IEEE Xplore
Software-Implemented Hardware Fault Tolerance Experiments ...
Achieving Long-Term Surveillance in VigilNet - IEEE Xplore
A Software Defined Testbed for Reconfigurable ... - IEEE Xplore
The Fault Avoidance and The Fault Tolerance Approaches for ... - Inpe