Slides - Åbo Akademi
Slides - Åbo Akademi
Slides - Åbo Akademi
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Software Safety<br />
Lecture 8: System Reliability<br />
Elena Troubitsyna<br />
Anton Tarasyuk<br />
<strong>Åbo</strong> <strong>Akademi</strong> University
System Dependability<br />
A. Avizienis, J.-C. Laprie and B. Randell:<br />
Dependability and its Threats (2004):<br />
Dependability is the ability of a system to deliver a service that<br />
can be justifiably trusted or, alternatively, as the ability of a system<br />
to avoid service failures that are more frequent or more severe than<br />
is acceptable<br />
Dependability attributes: reliability, availability, maintainability,<br />
safety, confidentiality, integrity
Reliability Definition<br />
Reliability: the ability of a system to deliver correct service under<br />
given conditions for a specified period of time<br />
Reliability is generally measured by the probability that a system<br />
(or system component) S can perform a required function under<br />
given conditions for the time interval [0, t]:<br />
R(t) = P {S not failed over time [0, t]}
More Definitions<br />
Failure rate (or failure intensity): the number of failures per time<br />
unit (hour, second, ...) or per natural unit (transaction, run, ...)<br />
• can be considered as an alternative way of expressing reliability<br />
Availability: the ability of a system to be in a state to deliver<br />
correct service under given conditions at a specified instant of time<br />
• usually measured by the average (over time) probability that a<br />
system is operational in a specified environment<br />
Maintainability: the ability of a system to be restored to a state in<br />
which it can deliver correct service<br />
• usually measured by the probability that maintenance of the<br />
system will restore it within a given time period
Safety-critical Systems<br />
• Pervasive use of computer-based systems in many critical<br />
infrastructures<br />
• flight/traffic control, driverless trains/cars operation, nuclear<br />
plant monitoring, robotic surgery, military applications, etc.<br />
• Extremely high reliability requirements for safety-critical<br />
systems<br />
• avionics domain example: failure rate ≤ 10 −9 failures per hour,<br />
i.e., more than a hundred years of operation without<br />
encountering a failure
Hardware vs. Software<br />
• Hardware for safety-critical systems is very reliable and its<br />
reliability is being improved<br />
• Software is not as reliable as hardware, however, its role in<br />
safety-critical systems increases<br />
• The division between hardware and software reliability is<br />
somewhat artificial [J. D. Musa, 2004]<br />
• Many concepts of software reliability engineering are adapted<br />
from the mature and successful techniques of hardware<br />
reliability
Hardware Failures<br />
The system is said to have a failure when its actual behaviour<br />
deviates from the intended one specified in design documents<br />
Underlying faults: the largest part of hardware failures caused by<br />
physical wear-out or physical defect of a component<br />
• transient faults: the faults that may occur and then disappear after<br />
some period of time<br />
• permanent faults: the faults that remain in the system until they<br />
are repaired<br />
• intermittent faults: the reoccurring transient faults
Failure Rate<br />
The failure rate of a system (or system component) is the mean<br />
number of failures within a given period of time<br />
The failure rate is a random variable (gives us the probability that<br />
a system, which has been operating over time t, fails over the next<br />
time unit)<br />
The failure rate of a component normally varies with time: λ(t)
Failure Rate (ctd.)<br />
Classification of failures (time-dependent):<br />
• Early failure period<br />
• Constant failure rate period<br />
• Wear-out failure period
Failure Rate (ctd.)<br />
Let T be the random variable measuring the uptime of some<br />
system (or system component) S:<br />
where<br />
R(t) = P {T > t} = 1 − F (t) =<br />
∫ ∞<br />
t<br />
f (t)dt,<br />
– F (t) is the cumulative distribution function of T<br />
– f (t) is the (failure) probability density function of T<br />
The failure rate of the system can be defined as:<br />
λ(t) = f (t)<br />
R(t)
Most Important Distributions<br />
Discrete distributions:<br />
• binomial distribution<br />
• Poisson distribution<br />
Continuous distributions:<br />
• exponential distribution<br />
• normal distribution<br />
• lognormal distribution<br />
• Weibull distribution<br />
• gamma distribution<br />
• Pareto distribution
Repair Rate<br />
The repair rate expresses the probability that a system, failed for a<br />
time t, recovers its ability to perform its function in the next time<br />
unit<br />
The repair rate of the system can be defined as:<br />
where<br />
µ(t) =<br />
g(t)<br />
1 − M(t)<br />
– g(t) is the (repair) probability density function and<br />
– M(t) is the system maintainability
Reliability Parameters<br />
MTTF: Mean Time to Failure, i.e., the expected time that a system will<br />
operate before the first failure occurs (often term MTBF – Mean Time<br />
Between Failures – is used for repairable systems)<br />
MTTR: Mean Time to Repair, i.e., the average time taken to repair a<br />
system that has failed<br />
MTTR includes the time taken to detect the failure, locate the fault,<br />
repair and reconfigure the system.<br />
MTTF =<br />
∫ ∞<br />
0<br />
R(t)dt MTTR =<br />
∫ ∞<br />
0<br />
(1 − M(t))dt<br />
Availability of a repairable systems is defined as<br />
MTTF<br />
MTTF + MTTR
Most Important Distributions<br />
Discrete distributions:<br />
• binomial distribution<br />
• Poisson distribution<br />
Continuous distributions:<br />
• exponential distribution<br />
• normal distribution<br />
• lognormal distribution<br />
• Weibull distribution<br />
• gamma distribution<br />
• Pareto distribution
Exponential Distribution<br />
The distribution gives:<br />
• failure density: f (t) = λe −λt<br />
• reliability function: R(t) = e −λt<br />
• constant failure and repair rates: λ(t) = λ and µ(t) = µ<br />
• MTTF = 1 λ<br />
• MTTR = 1 µ<br />
The probability of a system working correctly throughout a given<br />
period of time decreases exponentially with the length of this time<br />
period.
Exponential Distribution (cont.)<br />
The exponential distribution is often used in calculations as they<br />
are thus made much simpler<br />
During the useful-life stage, the failure rate is related to the<br />
reliability of the component or system by exponential failure law<br />
Describes neither the case of early failures (decreasing failure rate)<br />
nor the case of worn-out components (increasing failure rate)
MTTF Example<br />
A system with a constant failure rate of 0.001 failures per hour has<br />
a MTTF of 1000 hours.<br />
This does not mean that the system will operate correctly<br />
for 1000 hours!<br />
The reliability of such system at a time t is: R(t) = e −λt<br />
Assume that t = MTTF = 1 λ<br />
R(t) = e −1 ≈ 0.37<br />
Any given system has only a 37% chance of functioning correctly<br />
for an amount of time equal to the MTTF (i.e., a 63% chance of<br />
failing in this period).
Failure Rate Estimation<br />
The failure rate is often assumed to be constant (this assumption<br />
simplifies calculation)<br />
Often no other assumption can be made because of the small<br />
number of available event data<br />
In this case, an estimator of the failure rate is given by:<br />
where<br />
λ = N f<br />
T f<br />
,<br />
– N f – number of failures observed during operation<br />
– T f – cumulative operating time
Example: Failure Rate Calculation<br />
Failure rate calculation example<br />
Ten identical components are each tested until they either fail or reach<br />
1000 hours, at which time the test is terminated for that component.<br />
• Ten Theidentical results components are: are each tested until they either fail or reach 1000<br />
hours, at which time the test is terminated for that component. The results<br />
are:<br />
Component Hours Failure<br />
Component 1 1000 No failure<br />
Component 2 1000 No failure<br />
Component 3 467 Failed<br />
Component 4 1000 No failure<br />
Component 5 630 Failed<br />
Component 6 590 Failed<br />
Component 7 1000 No failure<br />
Component 8 285 Failed<br />
Component 9 648 Failed<br />
Component 10 882 Failed<br />
Totals 7502 6<br />
The estimated failure rate is:<br />
Estimated failure rate is:<br />
6 failures<br />
7502 hrs = 799.8 · 10−6 failures/hr<br />
or<br />
or =799.8 failures for every million<br />
hours λ of = operation. 799.8 failures for every million hours of<br />
operation.<br />
12 April 2007 8
Combinational Models<br />
Allow overall reliability of the system to be calculated from the<br />
reliability of its components<br />
Physical separation and fault isolation<br />
• highly reduce the complexity of the reliability models<br />
• redundancy to achieve the fault tolerance<br />
Distinguishes between two different situations:<br />
• Failure of any components causes system failure – series<br />
model<br />
• Several components must fail simultaneously to cause a<br />
malfunctioning – parallel model
Series Systems<br />
Such a configuration that failure of any system component causes<br />
failure of the entire system<br />
For a series system that consists N (independent) components the<br />
overall system failure rate is<br />
λ =<br />
N∑<br />
i=1<br />
λ i<br />
and, consequently,<br />
N∏<br />
R(t) = R i (t)<br />
i=1
Parallel Systems<br />
Redundant system – failure of one component<br />
does not lead to failure of the entire system<br />
The system will remain operational if at least<br />
one of the parallel elements is functioning<br />
correctly<br />
Reliability of the parallel system is determined by considering first<br />
probability of failure (unreliability) of an individual component and then<br />
the overall system:<br />
N∏<br />
R(t) = 1 − (1 − R i (t))<br />
i=1
Series-Parallel Combinations<br />
The most common in practice.<br />
Any systems may be simplified, i.e.,<br />
reduced to a single component<br />
A series-parallel arrangement:<br />
(a) The original arrangement<br />
(b) The result of combining parallel<br />
modules<br />
(c) The result of combining the series<br />
modules<br />
The overall reliability of the system is<br />
represented by that of module 13
Triple Modular Redundancy<br />
Reminder: TMR system consists of three<br />
parallel modules<br />
Reliability of a single module is R m (t)<br />
Reliability of a TMR system that consists of three identical modules, is<br />
R TMR (t) = R 3 M(t) + 3R 2 M(t)(1 − R M (t)) = 3R 2 M(t) − 2R 3 M(t)<br />
Reliability of the TMR arrangement may be worse than the reliability of<br />
an individual module!
M-of-N Arrangement<br />
A system consists of N identical modules<br />
At least M modules must function correctly in order to prevent a<br />
system failure<br />
R M−of −N (t) =<br />
N−M<br />
∑<br />
i=0<br />
[( i<br />
N)<br />
R N−i<br />
m (t)(1 − R m (t)) i ]
Software Reliability<br />
• Software, in general, logically more complex<br />
• Software failures are design failures<br />
• caused by design faults (have human nature)<br />
• design = all software development steps (from requirements to<br />
deployment and maintenance)<br />
• harder to identify, measure and detect<br />
• Software does not wear out
Failure rate<br />
The term failure intensity is often used instead of failure rate (to<br />
avoid confusions)<br />
Typically, a software is under permanent development and<br />
maintenance, which leads to jumps in the overall failure rate<br />
• May increase after major program/environmental changes<br />
• May decrease due to the improvements and bug-fixes
Software Reliability (ctd.)<br />
• Software, unlike hardware, can be fault-free (theoretically :))<br />
• some formal methods can guarantee the correctness of<br />
software (proof-based verification, model checking, etc.)<br />
• Correctness of software does not ensure its reliability!<br />
• software can satisfy the specification document, yet the<br />
specification document itself might already be faulty<br />
• No independence assumption, i.e., copies of software will fail<br />
together<br />
• most hardware fault tolerance mechanisms ineffective for<br />
software<br />
• design diversity instead of component redundancy<br />
(e.g., N-version programming )
Design diversity<br />
Each variant of software is generated by a separate (independent)<br />
team of developers<br />
• higher probability to generate a correct variant<br />
• independent design faults in different variants<br />
Costly, yet leads to an effective reliability improvement<br />
Not as efficient as N-modular redundancy in hardware reliability<br />
engineering [J. C. Knight and N. G. Leveson, 1986]
Appendix: Failure Rate Proof<br />
Proof:<br />
1<br />
λ(t) = lim<br />
∆t→0 ∆t · P {A[t,t+∆t] S<br />
| ¬A [0,t]<br />
S<br />
} =<br />
lim<br />
∆t→0<br />
lim<br />
∆t→0<br />
1<br />
∆t · P {A[t,t+∆t] S<br />
P {¬A [0,t]<br />
∩ ¬A [0,t]<br />
S<br />
}<br />
S<br />
}<br />
1<br />
∆t · P {A[0,t+∆t] S<br />
} − P {A [0,t]<br />
R(t)<br />
lim<br />
∆t→0<br />
S<br />
}<br />
=<br />
=<br />
R(t) − R(t + ∆t)<br />
∆tR(t)<br />
= f (t)<br />
R(t) ,<br />
where<br />
– A [x,y]<br />
S<br />
̂= S failed over time [x, y]<br />
– ¬A [x,y]<br />
S<br />
̂= S did not fail over time [x, y]