Slides - Åbo Akademi

Software Safety 

Lecture 8: System Reliability 

Elena Troubitsyna 

Anton Tarasyuk 

Åbo Akademi University

System Dependability 

A. Avizienis, J.-C. Laprie and B. Randell: 

Dependability and its Threats (2004): 

Dependability is the ability of a system to deliver a service that 

can be justifiably trusted or, alternatively, as the ability of a system 

to avoid service failures that are more frequent or more severe than 

is acceptable 

Dependability attributes: reliability, availability, maintainability, 

safety, confidentiality, integrity

Reliability Definition 

Reliability: the ability of a system to deliver correct service under 

given conditions for a specified period of time 

Reliability is generally measured by the probability that a system 

(or system component) S can perform a required function under 

given conditions for the time interval [0, t]: 

R(t) = P {S not failed over time [0, t]}

More Definitions 

Failure rate (or failure intensity): the number of failures per time 

unit (hour, second, ...) or per natural unit (transaction, run, ...) 

• can be considered as an alternative way of expressing reliability 

Availability: the ability of a system to be in a state to deliver 

correct service under given conditions at a specified instant of time 

• usually measured by the average (over time) probability that a 

system is operational in a specified environment 

Maintainability: the ability of a system to be restored to a state in 

which it can deliver correct service 

• usually measured by the probability that maintenance of the 

system will restore it within a given time period

Safety-critical Systems 

• Pervasive use of computer-based systems in many critical 

infrastructures 

• flight/traffic control, driverless trains/cars operation, nuclear 

plant monitoring, robotic surgery, military applications, etc. 

• Extremely high reliability requirements for safety-critical 

systems 

• avionics domain example: failure rate ≤ 10 −9 failures per hour, 

i.e., more than a hundred years of operation without 

encountering a failure

Hardware vs. Software 

• Hardware for safety-critical systems is very reliable and its 

reliability is being improved 

• Software is not as reliable as hardware, however, its role in 

safety-critical systems increases 

• The division between hardware and software reliability is 

somewhat artificial [J. D. Musa, 2004] 

• Many concepts of software reliability engineering are adapted 

from the mature and successful techniques of hardware 

reliability

Hardware Failures 

The system is said to have a failure when its actual behaviour 

deviates from the intended one specified in design documents 

Underlying faults: the largest part of hardware failures caused by 

physical wear-out or physical defect of a component 

• transient faults: the faults that may occur and then disappear after 

some period of time 

• permanent faults: the faults that remain in the system until they 

are repaired 

• intermittent faults: the reoccurring transient faults

Failure Rate 

The failure rate of a system (or system component) is the mean 

number of failures within a given period of time 

The failure rate is a random variable (gives us the probability that 

a system, which has been operating over time t, fails over the next 

time unit) 

The failure rate of a component normally varies with time: λ(t)

Failure Rate (ctd.) 

Classification of failures (time-dependent): 

• Early failure period 

• Constant failure rate period 

• Wear-out failure period

Failure Rate (ctd.) 

Let T be the random variable measuring the uptime of some 

system (or system component) S: 

where 

R(t) = P {T > t} = 1 − F (t) = 

∫ ∞ 

t 

f (t)dt, 

– F (t) is the cumulative distribution function of T 

– f (t) is the (failure) probability density function of T 

The failure rate of the system can be defined as: 

λ(t) = f (t) 

R(t)

Most Important Distributions 

Discrete distributions: 

• binomial distribution 

• Poisson distribution 

Continuous distributions: 

• exponential distribution 

• normal distribution 

• lognormal distribution 

• Weibull distribution 

• gamma distribution 

• Pareto distribution

Repair Rate 

The repair rate expresses the probability that a system, failed for a 

time t, recovers its ability to perform its function in the next time 

unit 

The repair rate of the system can be defined as: 

where 

µ(t) = 

g(t) 

1 − M(t) 

– g(t) is the (repair) probability density function and 

– M(t) is the system maintainability

Reliability Parameters 

MTTF: Mean Time to Failure, i.e., the expected time that a system will 

operate before the first failure occurs (often term MTBF – Mean Time 

Between Failures – is used for repairable systems) 

MTTR: Mean Time to Repair, i.e., the average time taken to repair a 

system that has failed 

MTTR includes the time taken to detect the failure, locate the fault, 

repair and reconfigure the system. 

MTTF = 

∫ ∞ 

0 

R(t)dt MTTR = 

∫ ∞ 

0 

(1 − M(t))dt 

Availability of a repairable systems is defined as 

MTTF 

MTTF + MTTR

Most Important Distributions 

Discrete distributions: 

• binomial distribution 

• Poisson distribution 

Continuous distributions: 

• exponential distribution 

• normal distribution 

• lognormal distribution 

• Weibull distribution 

• gamma distribution 

• Pareto distribution

Exponential Distribution 

The distribution gives: 

• failure density: f (t) = λe −λt 

• reliability function: R(t) = e −λt 

• constant failure and repair rates: λ(t) = λ and µ(t) = µ 

• MTTF = 1 λ 

• MTTR = 1 µ 

The probability of a system working correctly throughout a given 

period of time decreases exponentially with the length of this time 

period.

Exponential Distribution (cont.) 

The exponential distribution is often used in calculations as they 

are thus made much simpler 

During the useful-life stage, the failure rate is related to the 

reliability of the component or system by exponential failure law 

Describes neither the case of early failures (decreasing failure rate) 

nor the case of worn-out components (increasing failure rate)

MTTF Example 

A system with a constant failure rate of 0.001 failures per hour has 

a MTTF of 1000 hours. 

This does not mean that the system will operate correctly 

for 1000 hours! 

The reliability of such system at a time t is: R(t) = e −λt 

Assume that t = MTTF = 1 λ 

R(t) = e −1 ≈ 0.37 

Any given system has only a 37% chance of functioning correctly 

for an amount of time equal to the MTTF (i.e., a 63% chance of 

failing in this period).

Failure Rate Estimation 

The failure rate is often assumed to be constant (this assumption 

simplifies calculation) 

Often no other assumption can be made because of the small 

number of available event data 

In this case, an estimator of the failure rate is given by: 

where 

λ = N f 

T f 

, 

– N f – number of failures observed during operation 

– T f – cumulative operating time

Example: Failure Rate Calculation 

Failure rate calculation example 

Ten identical components are each tested until they either fail or reach 

1000 hours, at which time the test is terminated for that component. 

• Ten Theidentical results components are: are each tested until they either fail or reach 1000 

hours, at which time the test is terminated for that component. The results 

are: 

Component Hours Failure 

Component 1 1000 No failure 


Component 3 467 Failed 








Totals 7502 6 

The estimated failure rate is: 

Estimated failure rate is: 

6 failures 

7502 hrs = 799.8 · 10−6 failures/hr 

or 

or =799.8 failures for every million 

hours λ of = operation. 799.8 failures for every million hours of 

operation. 

12 April 2007 8

Combinational Models 

Allow overall reliability of the system to be calculated from the 

reliability of its components 

Physical separation and fault isolation 

• highly reduce the complexity of the reliability models 

• redundancy to achieve the fault tolerance 

Distinguishes between two different situations: 

• Failure of any components causes system failure – series 

model 

• Several components must fail simultaneously to cause a 

malfunctioning – parallel model

Series Systems 

Such a configuration that failure of any system component causes 

failure of the entire system 

For a series system that consists N (independent) components the 

overall system failure rate is 

λ = 

N∑ 

i=1 

λ i 

and, consequently, 

N∏ 

R(t) = R i (t) 

i=1

Parallel Systems 

Redundant system – failure of one component 

does not lead to failure of the entire system 

The system will remain operational if at least 

one of the parallel elements is functioning 

correctly 

Reliability of the parallel system is determined by considering first 

probability of failure (unreliability) of an individual component and then 

the overall system: 

N∏ 

R(t) = 1 − (1 − R i (t)) 

i=1

Series-Parallel Combinations 

The most common in practice. 

Any systems may be simplified, i.e., 

reduced to a single component 

A series-parallel arrangement: 

(a) The original arrangement 

(b) The result of combining parallel 

modules 

(c) The result of combining the series 

modules 

The overall reliability of the system is 

represented by that of module 13

Triple Modular Redundancy 

Reminder: TMR system consists of three 

parallel modules 

Reliability of a single module is R m (t) 

Reliability of a TMR system that consists of three identical modules, is 

R TMR (t) = R 3 M(t) + 3R 2 M(t)(1 − R M (t)) = 3R 2 M(t) − 2R 3 M(t) 

Reliability of the TMR arrangement may be worse than the reliability of 

an individual module!

M-of-N Arrangement 

A system consists of N identical modules 

At least M modules must function correctly in order to prevent a 

system failure 

R M−of −N (t) = 

N−M 

∑ 

i=0 

[( i 

N) 

R N−i 

m (t)(1 − R m (t)) i ]

Software Reliability 

• Software, in general, logically more complex 

• Software failures are design failures 

• caused by design faults (have human nature) 

• design = all software development steps (from requirements to 

deployment and maintenance) 

• harder to identify, measure and detect 

• Software does not wear out

Failure rate 

The term failure intensity is often used instead of failure rate (to 

avoid confusions) 

Typically, a software is under permanent development and 

maintenance, which leads to jumps in the overall failure rate 

• May increase after major program/environmental changes 

• May decrease due to the improvements and bug-fixes

Software Reliability (ctd.) 

• Software, unlike hardware, can be fault-free (theoretically :)) 

• some formal methods can guarantee the correctness of 

software (proof-based verification, model checking, etc.) 

• Correctness of software does not ensure its reliability! 

• software can satisfy the specification document, yet the 

specification document itself might already be faulty 

• No independence assumption, i.e., copies of software will fail 

together 

• most hardware fault tolerance mechanisms ineffective for 

software 

• design diversity instead of component redundancy 

(e.g., N-version programming )

Design diversity 

Each variant of software is generated by a separate (independent) 

team of developers 

• higher probability to generate a correct variant 

• independent design faults in different variants 

Costly, yet leads to an effective reliability improvement 

Not as efficient as N-modular redundancy in hardware reliability 

engineering [J. C. Knight and N. G. Leveson, 1986]

Appendix: Failure Rate Proof 

Proof: 

1 

λ(t) = lim 

∆t→0 ∆t · P {A[t,t+∆t] S 

| ¬A [0,t] 

S 

} = 

lim 

∆t→0 

lim 

∆t→0 

1 

∆t · P {A[t,t+∆t] S 

P {¬A [0,t] 

∩ ¬A [0,t] 

S 

} 

S 

} 

1 

∆t · P {A[0,t+∆t] S 

} − P {A [0,t] 

R(t) 

lim 

∆t→0 

S 

} 

= 

= 

R(t) − R(t + ∆t) 

∆tR(t) 

= f (t) 

R(t) , 

where 

– A [x,y] 

S 

̂= S failed over time [x, y] 

– ¬A [x,y] 

S 

̂= S did not fail over time [x, y]

Slides - Åbo Akademi

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?