12.07.2015 Views

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

UnreIiabie <strong>Failure</strong> <strong>Detectors</strong> <strong>for</strong> <strong>Reliable</strong> Distn”buted <strong>Systems</strong> 231used to bridge the gap between known impossibility results and the need <strong>for</strong>practical solutions <strong>for</strong> fault-tolerant asynchronous systems.The remainder of this paper is organised as follows: In Section 2, we describeour model and introduce eight classes of failure detectors defined in terms ofproperties. In Section 3, we use the concept of reduction to show that we canfocus on four classes of failure detectors rather than eight. In Section 4, wepresent <strong>Reliable</strong> Broadcast, a communication primitive <strong>for</strong> asynchronous systemsused by several of our algorithms. In Section 5, we define the Consensusproblem. In Section 6, we show how to solve Consensus <strong>for</strong> each one of the fourequivalence classes of failure detectors. In Section 7, we show that Consensusand Atomic Broadcast are equivalent to each other in asynchronous systems. InSection 8, we complete our comparison of the failure detector classes defined inthis paper. In Section 9, we discuss related work, and in particular, we describean implementation of a failure detector with the properties of OW in severalmodels of partial synchrony. Finally, in the Appendix we define an infinitehierarchy of failure detector classes, and determine exactly where in thishierarchy a majority of correct processes is required to solve Consensus.2. The ModelWe consider asynchronous distributed systems in which there is no bound onmessage delay, clock drift, or the time necessary to execute a step. Our model ofasynchronous computation with failure detection is patterned after the onein Fischer et al. [1985]. The system consists of a set of n processes, II ={Pi, P27 . . . . p.}. Every pair of processes is connected by a reliable communicationchannel.To simplify the presentation of our model, we assume the existence of adiscrete global clock. This is merely a fictional device: the processes do not haveaccess to it. We take the range 3 of the clock’s ticks to be the set of naturalnumbers.2.1. FAILURESAND FAILURE PATTERNS. Processes can fail by crashing, that is,by prematurely halting. A failure pattern F is a function from 9 to 2“, where F(t)denotes the set of processes that have crashed through time t. Oncea process crashes, it does not “recover”, that is, Vt: F(t)C F(t + 1). We definecrashed(F) = U,= T F(r) and correct, = II – crashed(F). If p E crashed(F), wesay p crashes in F and if p E correct(F), we say p is correct in F. We consider onlyfailure patterns F such that at least one process is correct, that is, correct(F) # 0.2.2. FAILURE DETECTORS. Each failure detector module outputs the set ofprocesses that it currently suspects to have crashed.g A failure detector history Hisa function from H X 7 to 2n. H(p, t) is the value of the failure detector moduleof process p at time f. If q G Ff(p, t), we say that p suspects q at time t in H. Weomit references to H when it is obvious from the context. Note that thefailure detector modules of two different processes need not agree onthe list of processes that are suspected to have crashed, that is, if p # q, thenH(p, t) # H(g, t) ispossible.9 In Chandra et al. [1992], failure detectors can output values from an arbirrary range

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!