12.07.2015 Views

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

228 T. D. CHANDRAAND S. TOUEGTo do so, we introduce the concept of “reducibility” among failure detectors.In<strong>for</strong>mally, a failure detector ‘3’ is reducible to ~ailure detector QI if there is adistributed algorithm that can trans<strong>for</strong>m S3into 9’. We also say that Q‘ is weakerthan $3: Given this reduction algorithm, anything that can be done using failuredetector $3’, can be done using 9 instead. Two failure detectors are equivalent ifthey are reducible to each other. Using the concept of reducibility (extended toclasses of failure detectors), we show how to reduce our eight classes of failuredetectors to four, and consider how to solve Consensus <strong>for</strong> each class.We show that certain failure detectors can be used to solve Consensus insystems with any number of process failures, while others require a majority ofcorrect processes. In order to better understand where the majority requirementbecomes necessary, we study an infinite hierarchy of failure detector classes anddetermine exactly where in this hierarchy the majority requirement becomesnecessary.Of special interest is OW, the weakest class of failure detectors considered inthis paper. In<strong>for</strong>mally, a failure detector is in OW if it satisfies the following twoproperties:Completeness. There is a time after which every process that crashes ispermanently suspected by some correct process.Accuracy. There is a time after which some correct process is never suspectedby any correct process.Such a failure detector can make an infinite number of mistakes: Each localfailure detector module can repeatedly add and then remove correct processesfrom its list of suspects (this reflects the inherent difficulty of determiningwhether a process is just slow or whether it has crashed). Moreover, some correctprocesses may be erroneously suspected to have crashed by all the otherprocesses throughout the entire execution.The two properties of OW state that eventually some conditions must hold<strong>for</strong>ever; of course this cannot be achieved in a real system. However, in practice,it is not really required that these conditions hold <strong>for</strong>ever. When solving aproblem that “terminates”, such as Consensus, it is enough that they hold <strong>for</strong> a“sufficiently long” period of time: This period should be long enough <strong>for</strong> thealgorithm to achieve its goal (e.g., <strong>for</strong> correct processes to decide). When solvinga problem that does not terminate, such as Atomic Broadcast, it is enough thatthese properties hold <strong>for</strong> “sufficiently long” periods of time: Each period shouldbe long enough <strong>for</strong> some progress to occur (e.g., <strong>for</strong> correct processes to deliversome messages). However, in an asynchronous system it is not possible toquantify “sufficiently long”, since even a single process step is allowed to take anarbitrarily long amount of time, Thus, it is convenient to state the properties ofOW in the stronger <strong>for</strong>m given above.sAnother desirable feature of OW is the following. If an application assumes afailure detector with the properties of OW, but the failure detector that it actuallyuses “malfunctions” and continuously fails to meet these properties-<strong>for</strong> example,we can focus on solving Consensus since all our results will automatically apply to Atomic Broadcastas well.5 Solving a problem with the assumption that certain properties hold <strong>for</strong> sufficiently long has beendone previously, see Dwork et al. [1988].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!