28.12.2013 Views

Slides - Åbo Akademi

Slides - Åbo Akademi

Slides - Åbo Akademi

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Software Safety<br />

Lecture 8: System Reliability<br />

Elena Troubitsyna<br />

Anton Tarasyuk<br />

<strong>Åbo</strong> <strong>Akademi</strong> University


System Dependability<br />

A. Avizienis, J.-C. Laprie and B. Randell:<br />

Dependability and its Threats (2004):<br />

Dependability is the ability of a system to deliver a service that<br />

can be justifiably trusted or, alternatively, as the ability of a system<br />

to avoid service failures that are more frequent or more severe than<br />

is acceptable<br />

Dependability attributes: reliability, availability, maintainability,<br />

safety, confidentiality, integrity


Reliability Definition<br />

Reliability: the ability of a system to deliver correct service under<br />

given conditions for a specified period of time<br />

Reliability is generally measured by the probability that a system<br />

(or system component) S can perform a required function under<br />

given conditions for the time interval [0, t]:<br />

R(t) = P {S not failed over time [0, t]}


More Definitions<br />

Failure rate (or failure intensity): the number of failures per time<br />

unit (hour, second, ...) or per natural unit (transaction, run, ...)<br />

• can be considered as an alternative way of expressing reliability<br />

Availability: the ability of a system to be in a state to deliver<br />

correct service under given conditions at a specified instant of time<br />

• usually measured by the average (over time) probability that a<br />

system is operational in a specified environment<br />

Maintainability: the ability of a system to be restored to a state in<br />

which it can deliver correct service<br />

• usually measured by the probability that maintenance of the<br />

system will restore it within a given time period


Safety-critical Systems<br />

• Pervasive use of computer-based systems in many critical<br />

infrastructures<br />

• flight/traffic control, driverless trains/cars operation, nuclear<br />

plant monitoring, robotic surgery, military applications, etc.<br />

• Extremely high reliability requirements for safety-critical<br />

systems<br />

• avionics domain example: failure rate ≤ 10 −9 failures per hour,<br />

i.e., more than a hundred years of operation without<br />

encountering a failure


Hardware vs. Software<br />

• Hardware for safety-critical systems is very reliable and its<br />

reliability is being improved<br />

• Software is not as reliable as hardware, however, its role in<br />

safety-critical systems increases<br />

• The division between hardware and software reliability is<br />

somewhat artificial [J. D. Musa, 2004]<br />

• Many concepts of software reliability engineering are adapted<br />

from the mature and successful techniques of hardware<br />

reliability


Hardware Failures<br />

The system is said to have a failure when its actual behaviour<br />

deviates from the intended one specified in design documents<br />

Underlying faults: the largest part of hardware failures caused by<br />

physical wear-out or physical defect of a component<br />

• transient faults: the faults that may occur and then disappear after<br />

some period of time<br />

• permanent faults: the faults that remain in the system until they<br />

are repaired<br />

• intermittent faults: the reoccurring transient faults


Failure Rate<br />

The failure rate of a system (or system component) is the mean<br />

number of failures within a given period of time<br />

The failure rate is a random variable (gives us the probability that<br />

a system, which has been operating over time t, fails over the next<br />

time unit)<br />

The failure rate of a component normally varies with time: λ(t)


Failure Rate (ctd.)<br />

Classification of failures (time-dependent):<br />

• Early failure period<br />

• Constant failure rate period<br />

• Wear-out failure period


Failure Rate (ctd.)<br />

Let T be the random variable measuring the uptime of some<br />

system (or system component) S:<br />

where<br />

R(t) = P {T > t} = 1 − F (t) =<br />

∫ ∞<br />

t<br />

f (t)dt,<br />

– F (t) is the cumulative distribution function of T<br />

– f (t) is the (failure) probability density function of T<br />

The failure rate of the system can be defined as:<br />

λ(t) = f (t)<br />

R(t)


Most Important Distributions<br />

Discrete distributions:<br />

• binomial distribution<br />

• Poisson distribution<br />

Continuous distributions:<br />

• exponential distribution<br />

• normal distribution<br />

• lognormal distribution<br />

• Weibull distribution<br />

• gamma distribution<br />

• Pareto distribution


Repair Rate<br />

The repair rate expresses the probability that a system, failed for a<br />

time t, recovers its ability to perform its function in the next time<br />

unit<br />

The repair rate of the system can be defined as:<br />

where<br />

µ(t) =<br />

g(t)<br />

1 − M(t)<br />

– g(t) is the (repair) probability density function and<br />

– M(t) is the system maintainability


Reliability Parameters<br />

MTTF: Mean Time to Failure, i.e., the expected time that a system will<br />

operate before the first failure occurs (often term MTBF – Mean Time<br />

Between Failures – is used for repairable systems)<br />

MTTR: Mean Time to Repair, i.e., the average time taken to repair a<br />

system that has failed<br />

MTTR includes the time taken to detect the failure, locate the fault,<br />

repair and reconfigure the system.<br />

MTTF =<br />

∫ ∞<br />

0<br />

R(t)dt MTTR =<br />

∫ ∞<br />

0<br />

(1 − M(t))dt<br />

Availability of a repairable systems is defined as<br />

MTTF<br />

MTTF + MTTR


Most Important Distributions<br />

Discrete distributions:<br />

• binomial distribution<br />

• Poisson distribution<br />

Continuous distributions:<br />

• exponential distribution<br />

• normal distribution<br />

• lognormal distribution<br />

• Weibull distribution<br />

• gamma distribution<br />

• Pareto distribution


Exponential Distribution<br />

The distribution gives:<br />

• failure density: f (t) = λe −λt<br />

• reliability function: R(t) = e −λt<br />

• constant failure and repair rates: λ(t) = λ and µ(t) = µ<br />

• MTTF = 1 λ<br />

• MTTR = 1 µ<br />

The probability of a system working correctly throughout a given<br />

period of time decreases exponentially with the length of this time<br />

period.


Exponential Distribution (cont.)<br />

The exponential distribution is often used in calculations as they<br />

are thus made much simpler<br />

During the useful-life stage, the failure rate is related to the<br />

reliability of the component or system by exponential failure law<br />

Describes neither the case of early failures (decreasing failure rate)<br />

nor the case of worn-out components (increasing failure rate)


MTTF Example<br />

A system with a constant failure rate of 0.001 failures per hour has<br />

a MTTF of 1000 hours.<br />

This does not mean that the system will operate correctly<br />

for 1000 hours!<br />

The reliability of such system at a time t is: R(t) = e −λt<br />

Assume that t = MTTF = 1 λ<br />

R(t) = e −1 ≈ 0.37<br />

Any given system has only a 37% chance of functioning correctly<br />

for an amount of time equal to the MTTF (i.e., a 63% chance of<br />

failing in this period).


Failure Rate Estimation<br />

The failure rate is often assumed to be constant (this assumption<br />

simplifies calculation)<br />

Often no other assumption can be made because of the small<br />

number of available event data<br />

In this case, an estimator of the failure rate is given by:<br />

where<br />

λ = N f<br />

T f<br />

,<br />

– N f – number of failures observed during operation<br />

– T f – cumulative operating time


Example: Failure Rate Calculation<br />

Failure rate calculation example<br />

Ten identical components are each tested until they either fail or reach<br />

1000 hours, at which time the test is terminated for that component.<br />

• Ten Theidentical results components are: are each tested until they either fail or reach 1000<br />

hours, at which time the test is terminated for that component. The results<br />

are:<br />

Component Hours Failure<br />

Component 1 1000 No failure<br />

Component 2 1000 No failure<br />

Component 3 467 Failed<br />

Component 4 1000 No failure<br />

Component 5 630 Failed<br />

Component 6 590 Failed<br />

Component 7 1000 No failure<br />

Component 8 285 Failed<br />

Component 9 648 Failed<br />

Component 10 882 Failed<br />

Totals 7502 6<br />

The estimated failure rate is:<br />

Estimated failure rate is:<br />

6 failures<br />

7502 hrs = 799.8 · 10−6 failures/hr<br />

or<br />

or =799.8 failures for every million<br />

hours λ of = operation. 799.8 failures for every million hours of<br />

operation.<br />

12 April 2007 8


Combinational Models<br />

Allow overall reliability of the system to be calculated from the<br />

reliability of its components<br />

Physical separation and fault isolation<br />

• highly reduce the complexity of the reliability models<br />

• redundancy to achieve the fault tolerance<br />

Distinguishes between two different situations:<br />

• Failure of any components causes system failure – series<br />

model<br />

• Several components must fail simultaneously to cause a<br />

malfunctioning – parallel model


Series Systems<br />

Such a configuration that failure of any system component causes<br />

failure of the entire system<br />

For a series system that consists N (independent) components the<br />

overall system failure rate is<br />

λ =<br />

N∑<br />

i=1<br />

λ i<br />

and, consequently,<br />

N∏<br />

R(t) = R i (t)<br />

i=1


Parallel Systems<br />

Redundant system – failure of one component<br />

does not lead to failure of the entire system<br />

The system will remain operational if at least<br />

one of the parallel elements is functioning<br />

correctly<br />

Reliability of the parallel system is determined by considering first<br />

probability of failure (unreliability) of an individual component and then<br />

the overall system:<br />

N∏<br />

R(t) = 1 − (1 − R i (t))<br />

i=1


Series-Parallel Combinations<br />

The most common in practice.<br />

Any systems may be simplified, i.e.,<br />

reduced to a single component<br />

A series-parallel arrangement:<br />

(a) The original arrangement<br />

(b) The result of combining parallel<br />

modules<br />

(c) The result of combining the series<br />

modules<br />

The overall reliability of the system is<br />

represented by that of module 13


Triple Modular Redundancy<br />

Reminder: TMR system consists of three<br />

parallel modules<br />

Reliability of a single module is R m (t)<br />

Reliability of a TMR system that consists of three identical modules, is<br />

R TMR (t) = R 3 M(t) + 3R 2 M(t)(1 − R M (t)) = 3R 2 M(t) − 2R 3 M(t)<br />

Reliability of the TMR arrangement may be worse than the reliability of<br />

an individual module!


M-of-N Arrangement<br />

A system consists of N identical modules<br />

At least M modules must function correctly in order to prevent a<br />

system failure<br />

R M−of −N (t) =<br />

N−M<br />

∑<br />

i=0<br />

[( i<br />

N)<br />

R N−i<br />

m (t)(1 − R m (t)) i ]


Software Reliability<br />

• Software, in general, logically more complex<br />

• Software failures are design failures<br />

• caused by design faults (have human nature)<br />

• design = all software development steps (from requirements to<br />

deployment and maintenance)<br />

• harder to identify, measure and detect<br />

• Software does not wear out


Failure rate<br />

The term failure intensity is often used instead of failure rate (to<br />

avoid confusions)<br />

Typically, a software is under permanent development and<br />

maintenance, which leads to jumps in the overall failure rate<br />

• May increase after major program/environmental changes<br />

• May decrease due to the improvements and bug-fixes


Software Reliability (ctd.)<br />

• Software, unlike hardware, can be fault-free (theoretically :))<br />

• some formal methods can guarantee the correctness of<br />

software (proof-based verification, model checking, etc.)<br />

• Correctness of software does not ensure its reliability!<br />

• software can satisfy the specification document, yet the<br />

specification document itself might already be faulty<br />

• No independence assumption, i.e., copies of software will fail<br />

together<br />

• most hardware fault tolerance mechanisms ineffective for<br />

software<br />

• design diversity instead of component redundancy<br />

(e.g., N-version programming )


Design diversity<br />

Each variant of software is generated by a separate (independent)<br />

team of developers<br />

• higher probability to generate a correct variant<br />

• independent design faults in different variants<br />

Costly, yet leads to an effective reliability improvement<br />

Not as efficient as N-modular redundancy in hardware reliability<br />

engineering [J. C. Knight and N. G. Leveson, 1986]


Appendix: Failure Rate Proof<br />

Proof:<br />

1<br />

λ(t) = lim<br />

∆t→0 ∆t · P {A[t,t+∆t] S<br />

| ¬A [0,t]<br />

S<br />

} =<br />

lim<br />

∆t→0<br />

lim<br />

∆t→0<br />

1<br />

∆t · P {A[t,t+∆t] S<br />

P {¬A [0,t]<br />

∩ ¬A [0,t]<br />

S<br />

}<br />

S<br />

}<br />

1<br />

∆t · P {A[0,t+∆t] S<br />

} − P {A [0,t]<br />

R(t)<br />

lim<br />

∆t→0<br />

S<br />

}<br />

=<br />

=<br />

R(t) − R(t + ∆t)<br />

∆tR(t)<br />

= f (t)<br />

R(t) ,<br />

where<br />

– A [x,y]<br />

S<br />

̂= S failed over time [x, y]<br />

– ¬A [x,y]<br />

S<br />

̂= S did not fail over time [x, y]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!