High Availability Theoretical Basics - Schneider Electric
High Availability Theoretical Basics - Schneider Electric
High Availability Theoretical Basics - Schneider Electric
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
System Technical Note<br />
How can I...<br />
90<br />
Increase the <strong>Availability</strong> of a<br />
Collaborative Control System<br />
1
Table of Contents<br />
Introduction to <strong>High</strong> <strong>Availability</strong> .............................................5<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong>.......................................9<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System..........31<br />
Conclusion..............................................................................73<br />
Appendix .................................................................................76<br />
Reliability & <strong>Availability</strong> Calculation Examples...................83<br />
3
Introduction to <strong>High</strong> <strong>Availability</strong><br />
4
Introduction to <strong>High</strong> <strong>Availability</strong><br />
Purpose<br />
Introduction<br />
Introduction to <strong>High</strong> <strong>Availability</strong><br />
The intent of this System Technical Note (STN) is to describe the capabilities of the<br />
different <strong>Schneider</strong> <strong>Electric</strong> solutions that answer most critical applications<br />
requirements, and consequently increase the availability of a Collaborative Control<br />
System. It provides a description of a common, readily understandable reference<br />
point for end users, system integrators, OEMs, sales people, business support and<br />
other parties.<br />
Increasingly, process applications require a high availability automation system.<br />
Before deciding to install a high availability automation system in your installation, you<br />
need to consider the following key questions:<br />
• What is the security level needed? This regards security concerning of both<br />
persons and hardware. For instance, the complete start/stop sequences that<br />
manage the kiln in a cement plant includes a key condition: the most powerful<br />
equipment must be the last to start and stop.<br />
• What is the criticality of the process? This point concerns all the processes that<br />
involve a reaction (mechanical, chemical, etc.). Consider the example of the kiln<br />
again. To avoid its destruction, the complete stop sequence needs a slow cooling<br />
of the constituent material. Another typical example is the biological treatment in<br />
a wastewater plant, which cannot be stopped every day.<br />
• What is the environmental criticality?. Consider the example of an automation<br />
system in a tunnel. If a fire starts on a side, thereof the PLCs (the default and the<br />
redundant one) will be installed on each side of the tunnel. More, does the<br />
system will face to harsh environment; in terms of vibration, temperature, shock…?<br />
• Which other particular circumstances does the system have to address? This last<br />
topic includes additional questions, for example: does the system really need<br />
redundancy if the electrical network becomes inoperative in the concerned layer<br />
of the installation, what is the criticality of the data in the event of a loss of<br />
communication, etc.<br />
5
Introduction to <strong>High</strong> <strong>Availability</strong><br />
<strong>Availability</strong> is a term that increasingly used to qualify a process asset or system. In<br />
addition, Reliability and Maintainability are terms now often used for analyses<br />
considered of major usefulness in design improvement, and in production diagnostic<br />
issues. Accordingly, the design of automation system architectures must consider<br />
these types of questions.<br />
Document Overview<br />
The Collaborative Control System provided by <strong>Schneider</strong> <strong>Electric</strong> offers different<br />
levels of redundancy, allowing you to design an effective high availability system.<br />
This document contains the following chapters:<br />
• The <strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong> chapter describes the fundamentals in<br />
<strong>High</strong> <strong>Availability</strong>. In one hand it presents theory and basics, in another hand<br />
explains a method to conceptualize series/parallel architectures. Calculation<br />
examples illustrate this approach.<br />
• The Collaborative Control System <strong>Availability</strong> chapter provides different solutions<br />
in <strong>High</strong> <strong>Availability</strong>, especially for Collaborative Control System; from the<br />
information management to the I/O modules level.<br />
• The final Conclusion chapter summarizes customer benefits provided by<br />
<strong>Schneider</strong> <strong>Electric</strong> <strong>High</strong> <strong>Availability</strong> solutions, as well as additional information<br />
and references.<br />
6
Introduction to <strong>High</strong> <strong>Availability</strong><br />
The following drawing represents various levels of an automation system architecture:<br />
As shown in the following chapters, redundancy is a convenient means of elevating<br />
global system reliability, and so far its availability. The Collaborative Control System<br />
Hi-<strong>Availability</strong>, that is redundancy, can be addressed at the different levels of the<br />
architecture:<br />
� Single or redundant I/O Modules (depending on sensors/actuators redundancy).<br />
� Depending on the I/O handling philosophy (for example conventional Remote I/O<br />
stations, or I/Os Islands distributed on Ethernet) different scenarios can be<br />
applied: dual communication medium I/O Bus or Self healing Ring, single or dual.<br />
� Programmable Controller CPU redundancy (HotStandby PAC Station).<br />
� Communication network and ports redundancy.<br />
� SCADA System dedicated approaches with multiple operator station location<br />
scenarios and resource redundancy capabilities.<br />
7
Guide Scope<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The realization of an automation project includes five main phases: Selection, Design,<br />
Configuration, Implementation and Operation. To help you develop a whole project<br />
based on these previous phases, <strong>Schneider</strong> <strong>Electric</strong> created the System Technical<br />
book concept: Guide (STG) and Notes (STN).<br />
A System Technical Guide provides you technical guidelines and recommendations<br />
to implement technologies regarding to your needs and requirements. This guide<br />
covers the entire project life cycle scope, from the selection to the operation phase<br />
providing design methodology and even source code examples of all components<br />
part of a sub-system.<br />
A System Technical Notes gives a more theoretical approach by focusing specially on<br />
a system technology. This book describes what is our complete solution offer<br />
regarding a system and therefore supports you in the selection phase of a project.<br />
STG and STN are obviously linked and complementary. To sum up, you will figure out<br />
the technologies fundamentals in a STN, and their corresponding tested and<br />
validated applications in one or several STGs.<br />
STG Scope<br />
Automation Project<br />
Life Cycle<br />
STN Scope<br />
8
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Fault Tolerant System<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
This section describes basic high availability terms, concepts and formulas, and<br />
includes examples for typical applications.<br />
A Fault Tolerant System usually refers to a system that can operate even though a<br />
hardware component becomes inoperative. The Redundancy principle is often used<br />
to implement a Fault Tolerant Systems, because an alternate component takes over<br />
the task transparently.<br />
Lifetime and Failure Rate<br />
Considering a given system or device, its Lifetime corresponds to the number of<br />
hours it can function under normal operating conditions. This number is the result of<br />
the individual life expectancy of the components used in its assembly.<br />
Lifetime is generally seen as a sequence of three subsequent phases: the “early life”<br />
(or “infant mortality”), the "useful life,” and the "wear out period.”<br />
Failure Rate (λ) is defined as the average (mean) rate probability at which a system<br />
will become inoperative.<br />
When considering events occurring on a series of systems, for example a group of<br />
light bulbs, the units to be used should normally be “failures-per-unit-of- system-time.”<br />
Examples include failures per machine-hour or failures per system-year. Because the<br />
scope of this document is limited to single repairable entities, we will usually discuss<br />
failures-per-unit-of-time.<br />
Failure Rate Example: For a human being, Failure Rate (λ) measures the<br />
probability of death occurring in the next hour. Stating λ (20 years) = 10 -6 per<br />
hour would mean that the probability for someone age 20 to die in the next<br />
hour is 10 -6 .<br />
9
Bathtub Curve<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The following figure shows the Bathtub Curve, which represents the Failure Rate (λ)<br />
according to the Lifetime (t):<br />
Consider the<br />
relation between<br />
Failure Rate and<br />
Lifetime for a<br />
device consisting of<br />
assembled<br />
electronic parts.<br />
This relationship is<br />
represented by the<br />
"bathtub curve". As<br />
shown in the previous diagram. In "early life," this system, exhibits a high Failure<br />
Rate, which gradually reduces until it approaches a constant value, maintained during<br />
its "useful life.” The system finally enters the “wear-out” stage of its life, where Failure<br />
Rate increases exponentially.<br />
Note: Useful Life normally starts at the beginning of system use and ends at the<br />
beginning of its wear-out phase. Assuming that "early life" corresponds to the ”burn-<br />
in” period indicated by the manufacturer, we generally consider that Useful Life starts<br />
with the beginning of system use by the end user.<br />
RAMS (Reliability <strong>Availability</strong> Maintainability Safety)<br />
The following text, from the MIL-HDBK-338B standard, defines the RAM criteria and<br />
their probability aspect:<br />
"For the engineering specialties of reliability, availability and maintainability (RAM),<br />
the theories are stated in the mathematics of probability and statistics. The underlying<br />
reason for the use of these concepts is the inherent uncertainty in predicting a failure.<br />
Even given a failure model based on physical or chemical reactions, the results will<br />
not be the time a part will fail, but rather the time a given percentage of the parts will<br />
fail or the probability that a given part will fail in a specified time."<br />
Along with Reliability, <strong>Availability</strong> and Maintainability, Safety is the fourth metric of a<br />
meta-domain that specialists have named RAMS (also sometimes referred to as<br />
dependability).<br />
The Bathtub Curve<br />
10
Metrics<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
RAMS metrics relate to time allocation and depend on the operational state of a given<br />
system.<br />
The following curve defines the state linked to each term:<br />
• MUT: Mean Up Time<br />
MUT qualifies the average duration of the system being in operational state.<br />
• MDT: Mean Down Time<br />
MDT qualifies the average duration of the system not being in operational state. It<br />
comprises the different portions of time required to subsequently detect the error, fix it,<br />
and restore the system to its operational state.<br />
• MTBF: Mean Time between Failure<br />
MTBF is defined by the MIL-HDBK-338 standard as follows: "A basic measure of<br />
reliability for repairable items. The mean number of life units during which all parts of<br />
the item perform within their specified limits, during a particular measurement interval<br />
under stated conditions."<br />
Thus for repairable systems, MTBF is a metric commonly used to appraise Reliability,<br />
and corresponds to the average time interval (normally specified in hours) between<br />
two consecutive occurrences of inoperative states.<br />
Put simply: MTBF = MUT + MDT<br />
MTBF can be calculated (provisional reliability) based on data books such as UTE<br />
C80-810 (RDF2000), MIL HDBK-217F, FIDES, RDF 93, and BELLCORE. Other<br />
inputs include field feedback, laboratory testing, or demonstrated MTBF (Operational<br />
Reliability), or a combination of these. Remember that MTBF only applies for<br />
repairable systems<br />
11
• MTTF (or MTTFF): Mean Time to First Failure<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
MTTF is the mean time before the occurrence of the first failure.<br />
MTTF (and MTBF by extension) is often confused with Useful Life, even though<br />
these two concepts are not related in any way. For example, a battery may have<br />
a Useful Life of 4 hours and have a MTTF of 100,000 hours. These figures<br />
indicate that for a population of 100,000 batteries, there will be approximately<br />
one battery failure every hour (defective batteries being replaced).<br />
Considering a repairable system with an exponential distribution Reliability and a<br />
constant Failure Rate (λ), MTTF = 1 / λ.<br />
Mean Down Time is usually very low compared to Mean Up Time. This equivalence<br />
is extended to MTBF, and assimilated to MTTF, resulting in the following relationship:<br />
MTBF = 1 / λ.<br />
This relationship is widely used in additional calculations.<br />
Example:<br />
Given the MTBF of a communication adapter, 618,191 hours, what is the<br />
probability for that module to operate without failure for 5 years?<br />
Calculate the module Reliability over a 5-year time period:<br />
R(t) = e -λt = e<br />
• FIT: Failures in Time<br />
-t / MTBF<br />
a) Divide by 5 years, that is 8,760 * 5 = 43,800 hours, by the given MTBF:<br />
43,800 / 618,191 = 0.07085<br />
b) Then raise e to the power of the negative value of that number:<br />
e -.07085 = .9316<br />
Thus there is a 93.16% probability that the communication module will not fail<br />
on a 5-year period.<br />
Typically used as the Failure Rate measurement for non- repairable electronic<br />
components, FIT is the number of failures in one billion hours.<br />
FIT= 10 9 / MTBF or MTBF= 10 9 / FIT<br />
12
Safety<br />
Definition<br />
Maintainability<br />
Definition<br />
Mathematics <strong>Basics</strong><br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Safety refers to the protection of people, assets and the environment. For example, if<br />
an installation has a tank with an internal pressure that exceeds a given threshold,<br />
there is a high chance of explosion, and eventual destruction of the installation (with<br />
injury or death of people and damage to the environment). In this example, the Safety<br />
System put in place is to open a valve to the atmosphere, to prevent the pressure<br />
threshold from being crossed.<br />
Maintainability refers to the ability of a system to be maintained in an operational<br />
state. This once again relates to probability. Maintainability corresponds to the<br />
probability for an inoperative system to be repaired in a given time interval.<br />
If design considerations may at a certain level impact the maintainability of a system,<br />
then the maintenance organization will also have a major impact on its maintainability.<br />
Having the right number of people trained to observe and react with the proper<br />
methods, tools, and spare parts are considerations that usually depend more on the<br />
customer organization than on the automation system architecture.<br />
Equipment shall be maintainable on-site by trained personnel according to the<br />
maintenance strategy. A common metric named Maintainability, M(t), gives the<br />
probability that a required given active maintenance operation can be accomplished<br />
in a given time interval.<br />
The relationship between Maintainability and repair is similar to the relationship<br />
between reliability and failure, with the repair rate µ(t) defined in a way analogous to<br />
the Failure Rate. When this repair rate is considered as a constant, it implies an<br />
exponential distribution for Maintainability: M(t) = e -µt .<br />
The maintainability of equipment is reflected in MTTR, which is usually considered as<br />
the sum of the individual times required for administrative, transport, and repair times.<br />
13
<strong>Availability</strong><br />
Definition<br />
Mathematics <strong>Basics</strong><br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The term <strong>High</strong> <strong>Availability</strong> is often used when discussing Fault Tolerant Systems. For<br />
example, your telephone line is supposed to offer you a high level of availability: the<br />
service you are paying for has to be effectively accessible and dependable. Your line<br />
availability is compared related to the continuity of the service which you are<br />
provided. As an example, assume you are living in a remote area with occasional<br />
violent storms. Because of your loacation and the damage these storms can cause,<br />
long delays are required to fix your line once it is out of order. In these conditions, if<br />
on average your line appears to be usable only 50% of the time, you have poor<br />
availability. By contrast, if on average each of your attempts is 100% satisfied, then<br />
your line has high availability.<br />
This example demonstrates that <strong>Availability</strong> is the key metric to measuring a system’s<br />
tolerance level, that it is typically expressed in percent (for example 99.999%), and<br />
that it belongs to the domain of probability.<br />
The Instantaneous <strong>Availability</strong> of a device is the probability that this device will be in<br />
the functional state for which it was designed, under given conditions and at a given<br />
time (t), with the assumption that the required external conditions are met.<br />
Besides Instantaneous <strong>Availability</strong>, different variants have been specified, each<br />
corresponding to a specific definition, including Asymptotic <strong>Availability</strong>, Intrinsic<br />
<strong>Availability</strong> and Operational <strong>Availability</strong>.<br />
14
Asymptotic (or Steady State) <strong>Availability</strong>: A ∞<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Asymptotic <strong>Availability</strong> is the limit of the instantaneous availability function as time<br />
approaches infinity,<br />
MDT MUT MUT<br />
A = 1−<br />
=<br />
=<br />
∞ MUT + MDT MUT+<br />
MDT MTBF<br />
where Downtime includes all repair time (corrective and preventive maintenance<br />
time), administrative time and logistic time.<br />
The following curve shows an example of asymptotic behavior:<br />
Intrinsic (or Inherent) <strong>Availability</strong>: Ai<br />
Intrinsic <strong>Availability</strong> does not include administrative time and logistic time, and usually<br />
does not include preventive maintenance time. This is primarily a function of the basic<br />
equipment/system design.<br />
A<br />
i<br />
MTBF<br />
=<br />
MTBF + MTTR<br />
We will consider Intrinsic <strong>Availability</strong> in our <strong>Availability</strong> calculations.<br />
Operational <strong>Availability</strong><br />
Operational <strong>Availability</strong> corresponds to the probability that an item will operate<br />
satisfactorily at a given point in time when used in an actual or realistic operating and<br />
support environment.<br />
A<br />
0<br />
Uptime<br />
=<br />
Operating⋅<br />
Cycle<br />
Operational <strong>Availability</strong> includes logistics time, ready time, and waiting or<br />
administrative downtime, and both preventive and corrective maintenance downtime.<br />
15
Classification<br />
Reliability<br />
Definition<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
This is the availability that the customer actually experiences. It is essentially the a<br />
posteriori availability based on actual events that happened to the system.<br />
A quite common way used to classify a system in terms of <strong>Availability</strong> consists of<br />
listing the number of 9s of its availability figure.<br />
The following table defines types of availability:<br />
Class Type of<br />
<strong>Availability</strong><br />
<strong>Availability</strong><br />
(%)<br />
Downtime<br />
per Year<br />
Number of<br />
Nines<br />
1 Unmanaged 90 36.5 days 1-nine<br />
2 Managed 99 3.65 days 2-nines<br />
3 Well Managed 99.9 8.76 hours 3-nines<br />
4 Tolerant 99.99 52.6 minutes 4-nines<br />
5 <strong>High</strong> <strong>Availability</strong> 99.999 5.26 minutes 4-nines<br />
6 Very <strong>High</strong> 99.9999 30.00 seconds 6-nines<br />
7 Ultra <strong>High</strong> 99.99999 3 seconds 7-nines<br />
For example, a system that has a five-nine availability rating means that the system is<br />
99.999 % available; with a system downtime of approximately 5.26 minutes per year.<br />
A fundamental associated metric is Reliability. Return to the example of your<br />
telephone line. If the wired network is a very old one, having suffered from many<br />
years of lack of maintenance, it may frequently be out of order. Even if the current<br />
maintenance personnel are doing their best to repair it in a minimum time, it can be<br />
said to have poor reliability if, for example, it has experienced ten losses of<br />
communication during the last year. Notice that Reliability necessarily refers to a<br />
given time interval, typically one year. Therefore, Reliability will account for the<br />
absence of shutdown of a considered system in operation, on a given time interval.<br />
As with <strong>Availability</strong>, we consider Reliability in terms of perspective (a prediction), and<br />
within the domain of probability.<br />
16
Mathematics <strong>Basics</strong><br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
In many situations a detected disruption fortunately does not mean the end of a<br />
device’s life. This is usually the case for the automation and control systems being<br />
discussed, which are repairable entities. As a result, the ability to predict the number<br />
of shutdowns, due to a detected disruption over a specified period of time, is useful to<br />
estimate the budget required for the replacement of inoperative parts.<br />
In addition, knowing this figure can help you maintain an adequate inventory of spare<br />
parts. Put simply, the question "Will a device work for a particular period" can only be<br />
answered as a probability; hence the concept of Reliability.<br />
According to the MIL-STD-721C standard, the definition of reliability R(t) of a given<br />
system is the probability of that system to perform its intended function under stated<br />
conditions for a stated period of time. As an example, a system featured by 0.9999<br />
reliability over a year has a 99.99% probability of functioning properly throughout an<br />
entire year.<br />
Note: The reliability is systematically indicated for a given period of time, for example<br />
one year.<br />
Referring to the system model we considered with the "bathtub curve,” one<br />
characteristic is its constant Failure Rate, during the useful life. In that portion of its<br />
lifetime, the Reliability of the considered system will follow an exponential law, given<br />
by the following formula: R(t) = e -λt , where λ stands for the Failure Rate.<br />
The following figure illustrates the exponential law, R(t) = e -λt , where λ stands for the<br />
Failure Rate:<br />
As shown in the diagram,<br />
Reliability starts with a value of<br />
1 at time zero, which<br />
represents the moment the<br />
system is put into operation.<br />
Reliability then falls graduallly<br />
down to zero, following the<br />
exponential law. Therefore,<br />
Reliability is about 37% at<br />
t=1/λ. As an example, assume the given system experiences an average number of<br />
0.5 inoperative states per a 1-year time unit. The exponential law indicates that such<br />
a system would have about a 37% chance of remaining in operation, reaching 1 / 0.5<br />
= 2 years of service.<br />
17
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Note: Considering the flat portion of the bathtub curve model, where the Failure Rate<br />
is constant over time and remains the same for a unit regardless of this unit’s age, the<br />
system is said to be "memoryless.”<br />
Reliability versus <strong>Availability</strong><br />
Reliability one of the factors influencing <strong>Availability</strong>, but must not be confused with<br />
Availabilty: 99.99% Reliability does not mean 99.99% <strong>Availability</strong>. Reliability<br />
measures the ability of a system to function without interruptions, while <strong>Availability</strong><br />
measures the ability of this system to provide a specified application service level.<br />
<strong>High</strong>er reliability reduces the frequency of inoperative states, while increasing overall<br />
<strong>Availability</strong>.<br />
There is a difference between Hardware MTBF and System MTBF. The mean time<br />
between hardware component failures occurring on an I/O Module, for example, is<br />
referred to as the Hardware MTBF. Mean time between failures occurring on a<br />
system considered as a whole, a PLC configuration for example, is referred to as the<br />
System MTBF.<br />
As will be demonstrated, hardware component redundancy will provide an increase in<br />
the overall system MTBF, even though the individual component’s MTBF remains the<br />
same.<br />
Although <strong>Availability</strong> is a function of Reliability, it is possible for a system with<br />
poor Reliability to achieve high <strong>Availability</strong>. For example, consider that a system<br />
averages 4 failures a year, and for each failure, this system can be restored with<br />
an average outage time of 1 minute. Over the specified period of time, MTBF is<br />
131,400 minutes (4 minutes of downtime out of 525,600 minutes per year).<br />
In that one year period:<br />
� Reliability R(t) = e -λt = e - 4 = 1.83%, very poor Reliability<br />
MTBF 131⋅<br />
400<br />
� (Inherent) <strong>Availability</strong> A<br />
i<br />
= =<br />
= 99,99924% ,<br />
MTBF + MTTR 131400<br />
+ 1<br />
very good <strong>Availability</strong><br />
18
Reliability Block Diagrams (RBD)<br />
Series-Parallel Systems<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Based on basic probability computation rules, RBD are simple, convenient tools to<br />
represent a system and its components, to determine the Reliability of the system.<br />
The target system, for example a PLC rack, must first be interpreted in terms of series<br />
and parallel arrangements of elementary parts.<br />
The following figure shows a representation of a serial architecture:<br />
are considered as a participant member of a 5-part series.<br />
As an example, assume that one of the 5<br />
modules (1 power supply and 4 other<br />
modules) that populate the PLC rack<br />
becomes inoperative. As a consequence,<br />
the entire rack is affected, as it is no longer<br />
100% capable of performing its assigned<br />
mission, regardless of which module is<br />
inoperative. Thus each of the 5 modules<br />
Note: When considering Reliability, two components are described as in series if both<br />
are necessary to perform a given function.<br />
The following figure shows a representation of a parallel architecture:<br />
sequence of the 4 other modules.<br />
As an example, assume that the PLC rack<br />
now contains redundant Power Supply<br />
modules, in addition to the 4 other<br />
modules. If one Power Supply becomes<br />
inoperative, then the other supplies power<br />
for the entire rack. These 2 power supplies<br />
would be considered as a parallel sub-<br />
system, in turn coupled in series with the<br />
Note: Two components are in parallel, from a reliability standpoint, when the system<br />
works if at least one of the two components works. In this example, Power Supply 1<br />
and 2 are said to be in active redundancy. The redundancy would be described as<br />
passive if one of the parallel components is turned on only if the other is inoperative<br />
only, for example in the case of auxiliary power generators.<br />
19
Serial RBD<br />
Reliability<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Serial system Reliability is equal to the product of the individual elements’ reliabilities.<br />
n<br />
R<br />
S<br />
(t) = R<br />
1<br />
(t) ∗ R<br />
2<br />
(t) ∗R<br />
3<br />
(t) ∗ ... ∗ Rn(t)<br />
= ∏ R (t)<br />
i=<br />
1<br />
i<br />
where: Rs(t) = System Reliability<br />
Ri(t) = Individual Element Reliability<br />
n = Total number of elements on the serial system<br />
Assuming constant individual Failure Rate:<br />
− λ t − λ t − λ t − λ t − ( λ + λ + λ + ... + λ ) t<br />
R (t) = R (t) ∗ R (t) ∗ R (t) ∗ ... ∗ R (t) = e<br />
1<br />
∗ e<br />
2<br />
∗ e<br />
31<br />
∗ ... ∗ e<br />
n<br />
= e<br />
1 2 3 n<br />
S 1 2 3 n<br />
n<br />
That is λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal to the sum<br />
S i<br />
i = 1<br />
of the individual Failure Rate of these elements, with<br />
Example 1:<br />
R<br />
S<br />
(t)<br />
−λ<br />
S<br />
t<br />
= e<br />
Consider a system with 10 elements, each of them required for the proper operation<br />
of the system, for example a 10-module rack. To determine RS(t), the Reliability of<br />
that system over a given time interval t, if each of the considered elements shows an<br />
individual Reliability Ri(t) of 0.99:<br />
10<br />
R<br />
S<br />
(t) = ∏ R (t)<br />
i=<br />
1<br />
i<br />
= (0.99) x (0.99) x (0.99) . . . (0.99) = (0.99) 10 = 0.9044<br />
Thus the targeted system Reliability is 90.44%<br />
Example 2:<br />
Consider two elements with the following failure rates:<br />
λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1<br />
Element 1<br />
λ1 = 120 x 10 -6 h -1<br />
Element 2<br />
λ2 = 180 x 10 -6 h -1<br />
20
The System Reliability, over a 1,000- hour mission, is:<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
n<br />
λ<br />
S<br />
= ∑ λ<br />
i 1<br />
i<br />
= λ<br />
1<br />
+ λ<br />
=<br />
2<br />
= 120 x 10 -6 + 180 x 10 -6 = 300 x 10 -6 = 0.3 x 10 -3 h -1<br />
−λ<br />
3 x 103<br />
S<br />
t − 0.3 x 10−<br />
− 0.3<br />
R S(1000<br />
h) = e = e<br />
= e = 0.7408<br />
Thus the targeted system Reliability is 74.08%<br />
<strong>Availability</strong><br />
Serial system <strong>Availability</strong> is equal to the product of the individual elements’<br />
Availabilities.<br />
A S<br />
=<br />
A1<br />
. A2<br />
. A3<br />
.... .An<br />
where: As= system (asymptotic) availability<br />
n<br />
= ∏ Ai<br />
i=<br />
1<br />
Ai = Individual element (asymptotic) availability<br />
n = Total number of elements on the serial system<br />
21
Calculation Example<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
In this example, we calculate the availability of a PAC Station using shared distributed<br />
I/O Islands. The following illustration shows the final configuration:<br />
This calculation applies the equations given by basic probability analysis. To do this<br />
calculation, a spreadsheet was developed. These are the figures applied in the<br />
spreadsheet:<br />
Failure rate: λ = 1/MTBF<br />
Reliability: R(t) = e -λt<br />
n<br />
Total serial Systems Failure Rate λ = ∑ λ<br />
S i<br />
i = 1<br />
Total MTBF serial Systems = 1 / λs<br />
<strong>Availability</strong> = MTBF/ MTBF+ MTTR<br />
Unavailability = 1 – <strong>Availability</strong><br />
Unavailability over years: Unavailability × hours (one year = 8460 hours)<br />
The following table shows the method to perform the calculation:<br />
Step Action<br />
1 Perform the calculation of the Standalone CPU.<br />
2 Perform the calculation of a distributed island.<br />
3 Based on the serial structure, add up the results from Steps 1 and 2.<br />
Note: A common variant of in-rack I/O Module stations are I/O Islands, distributed on<br />
an Ethernet communication network. <strong>Schneider</strong> <strong>Electric</strong> offers a versatile family<br />
named Advantys STB, which can be used to define such architectures.<br />
Step 1: Calculation linked to the Standalone CPU Rack.<br />
The following figures represent the Standalone CPU Rack:<br />
22
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The following screenshot is the spreadsheet corresponding to this analysis.<br />
Step 2: Calculation linked to the STB Island.<br />
The following figures represent the Distributed I/O on STB Island:<br />
The following screenshot is the spreadsheet corresponding to this analysis:<br />
23
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Step 3: Calculation of the entire installation. Assume that the communication network<br />
used to link I/O Islands to CPU has no examples of Reliability metrics (these are<br />
explored further in a subsequent chapter).<br />
The following figures represent the final Distributed architecture:<br />
The following screenshot is the spreadsheet corresponding to the entire analysis:<br />
Note: The highlighted values were calculated in the two previous steps<br />
Considering the results of this Serial System (Rack # 1+ Islands # 1 ... #4), Reliability over<br />
one year is approximately 82 % (the probability that this system will encounter one failure<br />
during one year is approximately 18%).<br />
System MTBF itself is approximately 44,000 hours (about 5 years)<br />
Considering the <strong>Availability</strong>, with a 2-hour Mean Time To Repair (typical of a very good<br />
logistics and maintenance organization), the system would achieve a 4-nines <strong>Availability</strong>,<br />
an average probability of approximately 24 minutes downtime per year;<br />
24
Parallel RBD<br />
Reliability<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Theory of Probability provides expression of the Reliability of a Parallel System (for<br />
example, a Redundant System), with Qi(t) (unreliability) being the reverse of Ri(t).<br />
R Red<br />
( t)<br />
= 1−<br />
[ Q1(<br />
t)<br />
x Q2<br />
( t)<br />
x Q3<br />
( t)<br />
x<br />
Qi (t) = Ri (t) = 1 - Ri (t) :<br />
....<br />
n<br />
x Qn<br />
( t)<br />
] = 1−<br />
∏ Qi<br />
( t)<br />
i=<br />
1<br />
with: R Red = Reliability of the Simple Redundancy System<br />
Example:<br />
n<br />
∏ Q i = Probability of Failure of the System<br />
i=<br />
1<br />
n<br />
= 1−<br />
∏ 1−<br />
Ri<br />
( t)<br />
i=<br />
1<br />
Rj = Probability of Non-Failure of an Individual Parallelized Element<br />
Qj = Probability of Failure of an Individual Parallelized Element<br />
n = Total number of Parallelized Elements<br />
Considering two elements with the following failure rates:<br />
λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1<br />
The System Reliability, over a 1,000-hour mission, is:<br />
Reliability of elements 1 and 2 over the 1,000-hour period:<br />
R<br />
R<br />
1<br />
2<br />
( 1,<br />
000<br />
( 1,<br />
000<br />
h)<br />
h)<br />
= e<br />
= e<br />
−λ1t<br />
−λ2t<br />
= e<br />
= e<br />
−6<br />
3<br />
−120<br />
x 10 x 10<br />
−6<br />
3<br />
−180<br />
x 10 x 10<br />
=<br />
=<br />
0,<br />
8869<br />
0.<br />
8353<br />
Unreliability of elements 1 and 2 over the 1,000 hour-period:<br />
Q1 ( 1,<br />
000 h)<br />
= 1−<br />
R1<br />
( 1,<br />
000 h)<br />
= 1−<br />
R<br />
Q 2<br />
2<br />
Element 1<br />
λ1 = 120 x 10 -6 h -1<br />
Element 2<br />
λ2 = 180 x 10 -6 h -1<br />
( 1,<br />
000<br />
( 1,<br />
000<br />
h)<br />
h)<br />
= 1−<br />
0.<br />
8869 =<br />
= 1−<br />
0.<br />
8353 =<br />
0.<br />
1130<br />
0.<br />
1647<br />
25
Redundant System Reliability, over the 1,000-hour period:<br />
R12 ( t = 1000 h)<br />
= 1−<br />
[ Q1<br />
( t = 1000 h)<br />
. Q 2<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
( 1,<br />
000<br />
h)<br />
]<br />
= 1−<br />
( 0.<br />
1130<br />
x 0.<br />
1647)<br />
=<br />
Thus with Individual Elements’ Reliability of 88.69% and 83.53% respectively, the<br />
targeted Redundant System Reliability is 98.14%<br />
<strong>Availability</strong><br />
Parallel system Unavailability is equal to the product of the individual elements’<br />
0.<br />
9814<br />
Unavailabilities. Thus the Parallel system <strong>Availability</strong>, given the individual parallelized<br />
elements’ Availabilities, is:<br />
A<br />
= 1−<br />
[ ( 1−<br />
A ) . ( 1−<br />
A ) . ... . ( 1−<br />
A ) ] = 1−<br />
( 1−<br />
A )<br />
S 1<br />
2<br />
n ∏<br />
i=<br />
1<br />
where: As= system (asymptotic) availability<br />
Ai = Individual element (asymptotic) availability<br />
n = Total number of elements on the serial system<br />
n<br />
i<br />
26
Calculation Example<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
To illustrate a Parallel system, we perform the calculation of the <strong>Availability</strong> of a<br />
redundant PLC System using shared distributed I/O Islands.<br />
The following illustration shows the final configuration:<br />
The formulas are the same as used in the previous calculation example, except for<br />
the calculation of the reliability for a parallel system, which is as follows:<br />
R Red<br />
( t)<br />
= 1−<br />
[ Q1(<br />
t)<br />
x Q2<br />
( t)<br />
x Q3<br />
( t)<br />
x<br />
....<br />
n<br />
x Qn<br />
( t)<br />
] = 1−<br />
∏ Qi<br />
( t)<br />
i=<br />
1<br />
The following table shows the method to perform the calculation:<br />
Step Action<br />
1 Perform the calculation of a standalone CPU<br />
n<br />
= 1−<br />
∏ 1−<br />
Ri<br />
( t)<br />
i=<br />
1<br />
2 Perform the calculation for the redundant structure, here the two CPUs<br />
3 Perform the calculation of a distributed island<br />
4 Concatenate the results from Steps 1,2 and 3<br />
Note: the previous results from the serial analysis, regarding the calculation linked to<br />
the standalone elements, are reused.<br />
27
Step 1: Calculation linked to the Standalone CPU Rack.<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
Because the analysis is identical to that for the serial case, the following screenshot<br />
shows the spreadsheet corresponding only to the final results:<br />
Step 2: Calculation of the redundant CPU group.<br />
The following figures show represent the Redundant CPUs:<br />
The following screenshot is the spreadsheet corresponding to this analysis:<br />
Step 3: Calculation linked to the STB Island.<br />
Because the analysis is identical to that for the serial case, the following screenshot<br />
shows the spreadsheet corresponding to the final results only:<br />
28
Step 4: Calculation of the entire installation.<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The following screenshot is the spreadsheet corresponding to the entire analysis:<br />
Note: The only difference between this architecture and the previous one relates to<br />
the CPU Rack: this one is a redundant one (Premium Hot Standby), while the former<br />
one was a standalone one.<br />
Looking at the results of this Parallel System (Premium CPU Rack Redundancy), Reliability<br />
over one year would be approximately 99.9%, compared to 97.4% available with a<br />
Standalone Premium CPU Rack (i.e. the probability for a Premium Rack System to<br />
encounter one failure during one year would have been reduced from 2.6% to 0.1%).<br />
System MTBF itself would increas from 335,000 hours (approximately 38 years) to 503,000<br />
hours (approximately 57 years).<br />
For System <strong>Availability</strong>, a 2-hour Mean Time To Repair we provide approximately a 9-nines<br />
resulting <strong>Availability</strong> (almost 100%).<br />
Note: Other calculations examples are available in the Calculation Examples chapter.<br />
Note: The previous examples cover the PAC station only. To extend the calculation<br />
to a whole system, the MTBF of the network components and the SCADA systems<br />
(PC, servers) must be taken in account.<br />
29
Conclusion<br />
Serial System<br />
<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />
The above computations demonstrate that the combined availability of two<br />
components in series is always lower than the availability of its individual components.<br />
The following table gives an example of combined availability in serial system:<br />
Component <strong>Availability</strong> Downtime<br />
X 99% (2-nines) 3.65 days/year<br />
Y 99.99% (4-nines) 52 minutes/year<br />
X and Y Combined 98.99% 3.69 days/year<br />
This table indicates that even though a very high availability Part Y was used, the<br />
overall availability of the system was reduced by the low availability of Part X. A<br />
common saying indicates that "a chain is as strong as the weakest link", however, in<br />
this instance a chain is actually “weaker than the weakest link.”<br />
Parallel System<br />
The above computations indicate that the combined availability of two components in<br />
parallel is always higher than the availability of its individual components.<br />
The following table gives an example of combined availability in a parallel system:<br />
Component <strong>Availability</strong> Downtime<br />
X 99% (2-nines) 3.65 days/year<br />
Two X components<br />
operating in parallel<br />
Three X components<br />
operating in parallel<br />
99.99% (4-nines)<br />
52 minutes/year<br />
99.9999% (6-nines) 31 seconds /year !<br />
This indicates that even though a very low availability Part X was used, the overall<br />
availability of the system is much higher. Thus redundancy provides a very powerful<br />
mechanism for making a highly reliable system from low reliability components.<br />
30
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
In an automation system, how can you reach the level of availability required to keep<br />
the plant in operation? By what means should you enforce the system architecture,<br />
providing and maintaining access to the information required to monitor and control<br />
the process?<br />
This chapter provides answers to these questions, and reviews the system<br />
architecture from top to bottom, that is, from operator stations and data servers<br />
(Information Management) to Controllers and Devices (Control System Level), via<br />
communication networks (Communication Infrastructure Level).<br />
The following figure illustrates the Collaborative Control System:<br />
Ethernet<br />
This previous system architecture drawing above shows various redundancy<br />
capabilities that can be proposed:<br />
• Dual SCADA Vijeo Citect server<br />
Ethernet<br />
Profibus<br />
• Dual Ethernet control network managed by ConneXium switches<br />
• Premium Hot Standby with Ethernet device bus<br />
• Quantum Hot Standby with Remote I/O, Profibus and Ethernet.<br />
• Quantum safety controller HotStandby with remote I/O<br />
Ethernet<br />
31
Information Management Level<br />
Redundancy Level<br />
Key Features<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
This section explains how to implement various technological means to address<br />
architecture challenges, with examples using a current client/server paradigm. The<br />
section describes:<br />
• A method to define a redundancy level<br />
• The most appropriate system architecture for the defined level<br />
The key features a Vijeo Citect SCADA software has to handle relate to:<br />
• Data acquisition<br />
• Events and Alarms (including time stamping)<br />
• Measurements and trends<br />
• Recipes<br />
• Report<br />
• Graphics ( displays and user interfaces)<br />
In addition, any current SCADA package provides access to a treatment/calculation<br />
module (openness module), allowing users to edit program code (IML/CML, Cicode,<br />
Script VB, etc.).<br />
Note: This model is applicable for a single station (PC), including for small<br />
applications. The synthesis between the stakes and the key features will help to<br />
determine the most appropriate redundant solution.<br />
32
Stakes<br />
Risk analysis<br />
Level Definition<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Considering the previously defined key features, stakes when designing a SCADA<br />
system include:<br />
• Is a changeover possible?<br />
• Does a time limit exist?<br />
• What are the critical data?<br />
• Has cost reduction been considered?<br />
Linked to the previous stakes, the risk analysis is essential to defining the redundancy<br />
level. Consider the events the SCADA system will face, i.e. the risk, in terms of the<br />
following:<br />
• Inoperative hardware<br />
• Inoperative power supply<br />
• Environmental events (natural disaster, fire, etc.)<br />
That can imply loss of data, operator screen, connection with devices and so on.<br />
Finally, the redundancy level is defined as the compilation of the key features, the<br />
stakes, and the risk analysis with the customer expectations related to the data<br />
criticality level. The following table illustrates the flow from the process analysis to the<br />
redundancy level:<br />
33
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following table explains the redundancy levels:<br />
Redundancy Level State of the standby system Switchover performance<br />
No redundancy No standby system Not applicable<br />
Cold Standby The Standby system is only powered on<br />
if the default system becomes<br />
inoperative.<br />
Warm Standby The Standby system switches from<br />
normal to backup mode.<br />
Hot Standby The Standby system runs together with<br />
the default system one.<br />
Architecture Redundancy Solutions Overview<br />
Several minutes<br />
Large amount of lost data<br />
Several seconds<br />
Small amount of lost data<br />
Transparent<br />
No lost data<br />
This section examines various redundancy solutions. A Vijeo Citect SCADA system<br />
can be redundant at the following levels:<br />
• Clients, i.e. operator stations<br />
• Data servers<br />
• Control and information network<br />
• Targets, i.e. PAC station, controller, devices<br />
34
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The Vijeo Citect functional organization corresponds directly<br />
to a Client/Server philosophy. An example of a Client/Server<br />
topology is shown in the diagram to the left: a single Display<br />
Client Operator Station with a single I/O Server, in charge of<br />
device data (PLC) communication.<br />
Vijeo Citect architecture is a combination of several operational entities that handles<br />
A<br />
l<br />
a<br />
r<br />
m<br />
s<br />
,<br />
T<br />
r<br />
e<br />
n<br />
d<br />
s<br />
a<br />
Alarms, Trends and Reports, respectively. In addition, this functional architecture<br />
includes at least one I/O Server. The I/O Server acts as a Client to the peripheral<br />
devices (PAC) and as a Server to Alarms, Trends and Reports (ATR) entities.<br />
As shown in the figure above, ATR and I/O Server(s) act either as a Client or as a<br />
Server, depending on the designated relationship. The default mechanism linking<br />
these Clients and Servers is based on a Publisher / Subscriber relation.<br />
As shown in the following screenshot, because of its client server model, Vijeo Citect<br />
can create a dedicated server, depending on the application requirements: for<br />
example for ATR, or for I/O server:<br />
35
.<br />
Clients: Operator Workstations<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Vijeo Citect is able to manage the redundancy at the operator station level with<br />
several client workstations. These stations can be located in the control room or<br />
distributed in the plant close to the critical part of the process. A web client interface<br />
can also be used to monitor and control the plant using a standard web browser.<br />
If an operator station becomes inoperative, the plant can still be monitored and<br />
controlled using additional operator screen.<br />
Servers: Resource Duplication Solution<br />
This example assumes a Field Devices Communication Server, providing Services to<br />
several types of Clients, such as Alarm Handling, Graphic Display, etc.<br />
A first example of Redundancy is a complete duplication of the first Server. Basically,<br />
if a system becomes inoperative, for example Server A, Server B would retain the job<br />
and respond to the service requests presented by the Clients.<br />
36
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Vijeo Citect can define Primary and Standby servers within a project, with each<br />
element of a pair being held by different hardware.<br />
The first level of redundancy duplicates the I/O<br />
server or the ATR server, as shown in the<br />
illustration. In this case, a Standby server is<br />
maintained in parallel to the Primary server. In the<br />
event of a detected interruption on the hardware,<br />
the Standby server will assume control of the<br />
communication with the devices.<br />
Based on data criticality, the second level<br />
duplicates all the servers, ATR and I/O. Identical data is maintained on both servers.<br />
Network: Data Path Redundancy<br />
In the diagram to the left,<br />
Primary and Standby I/O<br />
servers are deployed<br />
independently, while Alarms,<br />
Trends and Reports servers<br />
are run as separate<br />
processes on common<br />
Primary and Standby<br />
computers.<br />
Data path redundancy involves not alternative<br />
device(s), but alternative data paths between<br />
the I/O Server and connected I/O Devices.<br />
Thus if one data path becomes inoperative,<br />
the other is used.<br />
Note: Vijeo Citect reconnects through the primary data path when it is returned into<br />
service.<br />
On a larger Vijeo Citect system, you can<br />
also use data path redundancy to<br />
maintain device communications through<br />
multiple I/O Server redundancy, as shown<br />
in the following diagram.<br />
37
Target: I/O Device<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Redundant LAN<br />
As previously indicated, redundancy of Alarms, Reports, Trends, and I/O Servers is<br />
achieved by adding standby servers. Vijeo Citect can also use the dual end point (or<br />
multiple network interfaces) potentially available on each server, enabling the<br />
specification of a complete and unique network connection between a Client and a<br />
Server.<br />
A given I/O Server is able to<br />
handle designated pairs of<br />
Devices, Primary and Standby.<br />
This Device Redundancy does<br />
not rely on a PLC Hot Standby<br />
mechanism: Primary and<br />
Standby devices are assumed<br />
to be concurrently acting on the<br />
same process, but no<br />
assumption is made<br />
concerning the relationship<br />
between the two devices. Seen from I/O Server, this redundancy offers access only to<br />
an alternate device, in case the first device becomes inoperative.<br />
The I/O device is an extension of I/O Device Redundancy, providing for more than<br />
one Standby I/O Device. Depending on the user configuration, a given order of<br />
priority applies when an I/O Server (potentially a redundant one) needs to switch to a<br />
Standby I/O Device. For example, in the figure above I/O Device 3 would be allotted<br />
the highest priority, then I/O Device 2, then finally I/O Device 4.<br />
38
Clustering<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
In those conditions, in case of a detected interruption occurring on Primary I/O Device<br />
1, a switchover would take place, with I/O Server 2 handling communications, and<br />
with Standby I/O Device 3. If an interruption is now detected on I/O Device 3, a new<br />
switchover would take place, with I/O Server 1 handling communications, with<br />
Standby I/O Device 2. Finally, if there is an interruption on I/O Device 2, another<br />
switchover would take place, with I/O Server 2 handling communications, with<br />
Standby I/O Device 4<br />
A cluster may<br />
contain several<br />
possibly<br />
redundant I/O<br />
Servers<br />
(maximum of<br />
one per<br />
machine), and<br />
standalone or<br />
redundant<br />
Refer again to the<br />
diagram on the left,<br />
explaining<br />
Redundancy basics,<br />
and examine the<br />
functional<br />
representation as a<br />
cluster.<br />
ATR servers; these latter servers being implemented either on a common or on<br />
separate machines.<br />
39
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The cluster concept<br />
offers a response to a<br />
typical scenario of a<br />
system separated into<br />
several sites, with<br />
each of these sites<br />
being controlled by<br />
local operators, and<br />
supported by local<br />
redundant servers.<br />
The clustering model<br />
can concurrently address an additional level of management that requires all sites<br />
across the system to be monitored simultaneously from a central control room.<br />
With this scenario, each site is represented with a separate cluster, grouping its<br />
primary and standby servers. Clients on each site are interested only in the local<br />
cluster, whereas clients at the central control room are able to view all clusters.<br />
B<br />
a<br />
s<br />
e<br />
d<br />
o<br />
n<br />
c<br />
l<br />
u<br />
s<br />
t<br />
e<br />
Based on cluster design, each site can then be addressed independently within its<br />
own cluster. As a result, deployment of a control room scenario is fairly<br />
straightforward, with the control room itself only needing display clients.<br />
The cluster concept does not actually provide an additional level of redundancy.<br />
Regarding data criticality, clustering organizes servers, and consequently provides<br />
additional flexibility.<br />
40
Conclusion<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Each cluster contains only one pair each of ATR servers. Those pairs of servers,<br />
redundant to each other, must be on different machines.<br />
Each cluster can contain an unlimited number of I/O servers; those servers must also<br />
be on different machines that increase the level of system availability.<br />
The following illustration shows a complete installation. Redundant solutions<br />
previously discussed can be identified:<br />
Scada Clients<br />
Data Servers<br />
Control Network<br />
Targeted Devices<br />
Communication Infrastructure Level<br />
The previous section reviewed various aspects of enhanced availability at the<br />
Information Management level, focusing on SCADA architecture, represented by<br />
Vijeo Citect. This section covers <strong>High</strong> <strong>Availability</strong> concerns between the Information<br />
level and the Control Level.<br />
A proper design at the communication infrastructure level must include:<br />
• Analysis of the plant topology<br />
• Localization of the critical process steps<br />
• The definition of network topologies<br />
• The appropriate use of communication protocols<br />
41
Plant Topology<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The first step of the communication infrastructure level definition is the plant topology<br />
analysis. From this survey, the goal is to gather information to develop a networking<br />
system diagram, prior to defining the network topologies.<br />
This plant topology analysis must be done as a top-down process:<br />
• Break-down of the plant in selected areas<br />
• Localization of the areas to be connected<br />
• Localization of the hazardous areas<br />
• Localization of the station and the nodes included in these areas to be connected<br />
• Localization of the existing networks & cabling paths, in the event of expansion or<br />
redesign<br />
Before defining the network topologies, the following project requirements must be<br />
considered:<br />
• <strong>Availability</strong> expectation, according to the criticality of the process or data.<br />
• Cost constraints<br />
• Operator skill level<br />
From the project and the plant analyses, identify the most critical areas:<br />
Project<br />
Network Topology<br />
Topologies<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Following the criticality analysis, the networking diagram can be defined by selecting<br />
the relevant Network topology.<br />
The following table describes the four main topologies from which to choose:<br />
Architecture Limitations Advantages Disadvantages<br />
Bus<br />
Star<br />
Ring<br />
The traffic must flow serially,<br />
therefore the bandwidth is not<br />
used efficiently<br />
Cable ways and distances Efficient use of the<br />
Cost-effective solution If a switch becomes<br />
bandwidth, as the<br />
traffic is spread across<br />
the star<br />
Preferred topology<br />
when there is no need<br />
of redundancy<br />
Behavior quite similar to Bus Auto-configuration if<br />
used with self-healing<br />
protocol<br />
Possible to couple<br />
others rings for<br />
increasing redundancy<br />
inoperative, the<br />
communication is lost.<br />
If the main switch<br />
becomes inoperative,<br />
the communication is<br />
lost<br />
The Auto-configuration<br />
depends on the<br />
protocol used<br />
Note: These different topologies can be mixed to define the plant network diagram.<br />
43
Ring Topology<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following diagram shows the level of availability based on topology:<br />
In automation architecture, Ring (and Dual ring) topologies are the most commonly<br />
used to increase the availability of a system.<br />
Mesh architecture is not used in process applications; therefore we will not discuss it<br />
in detail. All these topologies are allowed using <strong>Schneider</strong> <strong>Electric</strong> ConneXium<br />
switches.<br />
In a ring topology four events can occur, leading to a loss of communication:<br />
1)Broken ring line<br />
2)Inoperative ring switch<br />
3)Inoperative ended-line devices<br />
4)Inoperative ended-line devices switch<br />
44
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following diagram illustrates these four occurrences:<br />
1<br />
2<br />
To protect the network architecture from these events, several communication<br />
protocols are proposed and described in the following section.<br />
A solution able to enhance networking availability while preserving budget<br />
considerations consists of reducing both limitations of an Ethernet network:<br />
distributed as a bus, but built as a ring. At least one specific active component is<br />
necessary: this is a network switch usually named Redundancy Manager (RM).<br />
Consider an Ethernet loop designed with such a RM switch. In normal conditions, this<br />
RM switch will open the loop: which prevents Ethernet frames from circulating<br />
endlessly in a loop.<br />
3<br />
4<br />
If a break occurs, the<br />
Redundancy Manager<br />
switch reacts immediately,<br />
and closes the Ethernet<br />
loop, bringing the network<br />
back to full operating<br />
condition.<br />
Note that the term Self<br />
Healing Ring concerns the<br />
ring management only;<br />
once cut, the cable is not<br />
able to repair itself.<br />
45
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
A mix of Dual Networking and Network Redundancy is possible. Note that in such a<br />
design, a SCADA I/O Server has to be equipped with two communication boards, and<br />
reciprocally, each device (PLC) has to be allotted two Ethernet ports.<br />
Redundant Coupling of a Ring Network or a Network Segment<br />
Topological considerations may lead to consideration of a network layout aggregating<br />
satellite rings or segments around a backbone network (itself designed as a ring or as<br />
a segment).<br />
This may be an effective design, considering the junction between trunk and satellites,<br />
especially if backbone and satellite networks have been designed as ring networks to<br />
provide for <strong>High</strong> <strong>Availability</strong>.<br />
With the Connexium product line, <strong>Schneider</strong> <strong>Electric</strong> offers switches that may afford a<br />
redundant coupling. Several variations allow connection to the network. Each of these<br />
variations is featured by two “departure” switches on the backbone network. Each<br />
departure switch is crossing through a separate link to access the satellite network.<br />
These variations include:<br />
1. A single pivot ”arrival' switch on the connected network<br />
2. Two different arrival switches on the connected network; this links<br />
synchronization, making good use of backbone and satellite networks<br />
3. Two different switches to access the connected network; this links<br />
synchronization being carried via an additional specific link established<br />
between the two arrival switches on the connected network.<br />
46
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following drawing illustrates this architecture:<br />
ETY ETY SYNC SYNC SYNC SYNC LINK<br />
LINK<br />
Redundant line<br />
RM RM<br />
RM RM<br />
RM<br />
Ring coupling capabilities increase the level of networking availability by allowing<br />
different paths to access to targeted devices.<br />
New generation of <strong>Schneider</strong> ConneXium switches authorize many architectures<br />
based on dual ring. A unique switch is able now to couple two Ethernet rings<br />
extending the capabilities of Ethernet architecture.<br />
Dual Ring in One Switch<br />
ETY ETY<br />
ETY<br />
ETY<br />
The following illustration shows the architecture, which allows the combination of two<br />
rings managed by a unique switch.<br />
Premium<br />
Premium<br />
CPU CPU SYNC SYNC LINK<br />
LINK<br />
MRP / RSTP<br />
ETY ETY<br />
ETY<br />
ETY<br />
ETY ETY SYNC SYNC SYNC SYNC LINK<br />
LINK<br />
Ring Coupling<br />
47
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Dual Ring in Two Switches<br />
The following architecture bypasses the single detected interruption.<br />
Dual Ring Extension in Two Switches<br />
The following architecture allows extension of the main network to other segments :<br />
This concept of sub-ring enables to couple segments to existing redundancy rings.<br />
The devices in the main ring (1) are seen as Sub-Ring Managers (SRM) for the new<br />
connected sub-ring (2).<br />
Daisy Chain Loop Topology<br />
Ethernet "daisy chain" refers to the<br />
integration of a switch function inside a<br />
communicating device; as a result, this<br />
“daisy-chainable” device offers two<br />
Ethernet ports, for example one "in" port<br />
48
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
and one "out" port. The advantage of such a daisy-chainable device is that its<br />
installation inside an Ethernet network requires only two cables.<br />
<strong>Schneider</strong> <strong>Electric</strong> plans to offer are:<br />
In addition, a daisy-chain layout can<br />
correspond to either a network segment<br />
or a network loop; a managed Connexium<br />
switch4, featuring an RM capability, is<br />
able to handle such a daisy-chained loop.<br />
The First daisy-chainable devices<br />
� Advantys STB dual port Ethernet communication adapter (STB NIP 2311)<br />
� Advantys ETB IP67 dual port Ethernet<br />
� Motor controller TeSys T<br />
� Variable speed drive ATV 61/71 (VW3A3310D)<br />
� PROFIBUS DP V1 Remote Master<br />
� ETG 30xx Factorycast gateway<br />
Note: Assuming no specific redundancy protocol is selected to handle the daisy<br />
chain loop, expected loop reconfiguration time on failover is approximately one<br />
second.<br />
SCADA<br />
SCADA<br />
RM<br />
RM<br />
Primary Primary<br />
Standby<br />
ETY ETY ETY ETY ETY ETY ETY SYNC SYNC LINK<br />
LINK<br />
Redundant line<br />
RM RM<br />
RM<br />
ETY<br />
ETY<br />
ETY<br />
ETY<br />
MRP / RSTP<br />
Premium<br />
Premium<br />
CPU CPU SYNC SYNC LINK<br />
LINK<br />
MRP / RSTP<br />
ETY<br />
ETY<br />
ETY<br />
ETY<br />
ETY ETY SYNC SYNC SYNC SYNC LINK<br />
LINK<br />
Ring Coupling<br />
SCADA<br />
SCADA<br />
RM<br />
49
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Daisy chaining topologies can be coupled to dual Ethernet rings using TCSESM<br />
ConneXium switches.<br />
Redundancy Communication Protocols<br />
Mix Switches In<br />
Network<br />
(Cisco/3Com/Hirs<br />
chmann)<br />
The management of Ethernet ring requires dedicated communication protocols as<br />
described in the following table:<br />
Each protocol is characterized by different performance criteria in terms of fault<br />
detection and global system recovery time.<br />
Rapid Spanning Tree Protocol (RSTP)<br />
RSTP stands for Rapid Spanning Tree Protocol (IEEE 802.1w standard) 1 . Based on<br />
STP, RSTP has introduced some additional parameters that must be entered during<br />
the switch configuration. These parameters are used by the RSTP protocol during the<br />
path selection process; because of these, the reconfiguration time is much faster than<br />
with STP (typically less than one second).<br />
1 The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE 802.1t-2001 and IEEE<br />
802.1w standards<br />
MRP<br />
Part of IEC 62439<br />
FDIS<br />
Recovery Time Less than 200 ms<br />
50 switches max<br />
Rapid Spanning<br />
Tree (RSTP)<br />
HIPER-Ring Fast HIPER-Ring<br />
Yes No No<br />
( New features<br />
available only on<br />
Extended<br />
Connexium<br />
switches)<br />
depending on<br />
number<br />
of switches<br />
up to 1 sec<br />
300 or 500ms<br />
50 switches<br />
max<br />
10 ms for 5<br />
switches<br />
160<br />
microseconds for<br />
each additional<br />
switch<br />
50
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The new release of TCSESM ConneXium switches allows better performance of<br />
RSTP management with a detection time of 15 ms and a propagation time of 15 ms<br />
for each switch. Considering a 6 switches configuration, the recovery time is about<br />
105 ms.<br />
HIPER-Ring (Version 1)<br />
Version 1 of the HIPER-Ring networking strategy has been available for<br />
approximately 10 years. It applies to a Self Healing Ring networking layout.<br />
Such a ring structure may include up to 50 switches. It typically features a<br />
reconfiguration delay of 150 milliseconds 2 and a maximum time of 500 ms. As a result,<br />
in case of an issue occurring on a link cable or on one of the switches populating the<br />
ring, the network will take about 150 ms to detect this event, and cause the<br />
Redundancy Manager switch to close the loop.<br />
Note: The Redundancy Manager switch is said to be active when it opens the<br />
network.<br />
HIPER-Ring Version 2 (MRP)<br />
Note: If recovery time of 500 ms is acceptable, then no switch redundancy<br />
configuration is needed only dip switches have to be set up.<br />
2 When configuring a Connexium TCS ESM switch for HIPER-Ring V1, the user is asked to choose<br />
between a maximum Standard Recovery Time, which is 500 ms, and a maximum Accelerated Recovery<br />
Time, which is 300 ms.<br />
51
Fast HIPER-Ring<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
MRP is an IEC 62439 industry standard protocol based on HIPER-ring. Therefore all<br />
switch manufactures can implement MRP if they choose too. This allows a mix of<br />
different manufactures in an MRP configuration. <strong>Schneider</strong>’s’ switches support a<br />
selectable maximum recovery time of 200ms or 500ms and 50 switch maximum ring<br />
configuration.<br />
TCESM switches also support redundant coupling of MRP rings. MRP rings could<br />
easily be used instead of HIPER-ring. MRP would require that all switches be<br />
configured via Web pages and allow for recovery time of 200ms or 500ms.<br />
Additionally the I/O network could be a MRP redundant network and the control<br />
network HIPER-ring or vice versa.<br />
A new family of Connexium switches is coming named TCS ESM Extended. This will<br />
offer a third version of HIPER-Ring strategy, named Fast HIPER-RING.<br />
Featuring a guaranteed recovery time of less than 10 milliseconds, the fast HIPER<br />
ring structure allows both a cost optimized implementation of a redundant network as<br />
well as maintenance and network extension during operation. This makes fast HIPER<br />
ring especially suitable for complex applications such as a combined transmission of<br />
video, audio and data information<br />
52
Selection<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
To end the communication level section, the following table presents all the<br />
communication protocols, and thus helps you selecting the most appropriate<br />
installation for your high availability solution:<br />
Selection Criteria Solution Comments<br />
Ease of configuration or<br />
installed base<br />
HIPER-Ring If recovery time of 500 ms is acceptable, then no switch<br />
redundancy configuration is needed. Dip switches have<br />
to be set up only.<br />
New installation MRP All switches are configured via web pages<br />
Open architecture with<br />
multiple vendors switch<br />
Complex Architecture MRP, RSTP or<br />
Installation with one MRM (Media Ring Manager) and<br />
X MRCs (Media Ring Client)<br />
RSTP Reconfiguration time: 15 ms (detected fault) + 15 ms<br />
FAST HIPER-<br />
Ring<br />
per switch.<br />
We recommend MRP or RSTP for <strong>High</strong> <strong>Availability</strong> with<br />
dual ring, and FAST HIPER-Ring for high performance.<br />
53
Control System Level<br />
Redundancy Principles<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Having detailed <strong>High</strong> <strong>Availability</strong> aspects at the Information Management level and at<br />
the Communication Infrastructure level, we will now concentrate on <strong>High</strong> <strong>Availability</strong><br />
concerns at the Control Level. Specific discussion will focus on PAC redundancy.<br />
Modicon Quantum and Premium PAC provide Hot Standby capabilities, and have<br />
several shared principles:<br />
1. The type of architecture is<br />
shared. A Primary unit executes<br />
the program, with a Standby unit<br />
ready, but not executing the<br />
program (apart from the first<br />
section of it). By default, these two units contain an identical application program.<br />
2. The units are synchronized. The standby unit is aligned with the Primary unit.<br />
Also, on each scan, the Primary unit transfers to the Standby unit its "database,”<br />
that is, the application variables (located or not located) and internal data. The<br />
entire database is transferred, except the "Non-Transfer Table", which is a<br />
sequence of Memory Words (%MW). The benefit of this transfer is that, in case of<br />
a switchover, the new Primary unit will continue to handle the process, starting<br />
with updated variables and data values. This is referred to as a "bumpless"<br />
switchover.<br />
3. The Hot Standby redundancy mechanism is controlled via the "Command<br />
Register" (accessed thru %SW60 system word); reciprocally, this Hot Standby<br />
redundancy mechanism is monitored via the "Status Register" (accessed thru<br />
%SW61 system word). As a result, as long as the application creates links<br />
between these system words and located memory words, any HMI can receive<br />
feedback regarding Hot Standby system operating conditions, and, if necessary,<br />
address these operating conditions.<br />
4. For any Ethernet port acting as a server (Modbus/TCP or HTTP protocol) on the<br />
Primary unit, its IP address is implicitly incremented by one on the Standby unit.<br />
54
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
In case a switchover occurs, homothetic addresses will automatically be<br />
exchanged. The benefit of this feature is that seen from a SCADA/HMI, the<br />
"active" unit is still accessed at the same IP address. No specific adaptation is<br />
required at the development stage of the SCADA / HMI application.<br />
5. The common programming environment that is used with both solutions is Unity<br />
Pro. No particular restrictions apply when using the standardized (IEC 1131-3)<br />
instruction set. In addition, the portion of code specific to the Hot Standby system<br />
is optional, and is used primarily for monitoring purpose. This means that with any<br />
given application, the difference between its implementation on standalone<br />
architecture and its implementation on a Hot Standby architecture is largely<br />
cosmetic.<br />
Consequently, a user familiar with one type of Hot Standby system does not have to<br />
start from scratch if he has to use a second type; initial investment is preserved and<br />
re-usable, and only a few differences must be learned to differentiate the two<br />
technologies.<br />
Premium and Quantum Hot Standby Architectures<br />
Depending on projects constraints or customers requirements (performance, installed<br />
base or project specifications), a specific Hot Standby PAC station topology can be<br />
selected:<br />
- Hot Standby PAC with in-rack I/O or remote I/O<br />
- Hot Standby PAC with distributed I/O, connected on Standard Ethernet or<br />
connected to another device bus, such as Profibus DP.<br />
- Hot Standby PAC mixing different topologies<br />
The following table presents the available configurations with either a Quantum or<br />
Premium PLC:<br />
PLC Layer<br />
I/O Bus Ethernet Profibus<br />
Quantum Configuration 1 Configuration 3 Configuration 5<br />
Premium Configuration 2 Configuration 4 Not applicable<br />
Note: A sixth configuration may be considered, which combines all other<br />
configurations listed above<br />
55
In-Rack and Remote I/O Architectures<br />
Quantum Hot Standby: Configuration 1<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
the racks is greater than the lack of<br />
communication gap that takes place at the<br />
switchover). The module population of a<br />
Quantum CPU rack in a Hot Standby<br />
configuration is very similar to that of a<br />
standalone PLC 3 . The only specific requirement<br />
is that the CPU module must be a 140 CPU 671 60.<br />
With Quantum Hot Standby, in-rack I/O<br />
modules are located in the remote I/O<br />
racks. They are "shared" by both Primary<br />
and Standby CPUs, but only the Primary<br />
unit actually handles the I/O<br />
communications at any given time. In<br />
case of a switchover, the control<br />
takeover executed by the new Primary<br />
unit occurs in a bumpless way (provided<br />
that the holdup time parameter allotted to<br />
For redundant PAC architecture, both units require two interlinks to execute different<br />
types of diagnostic - to orient the election of the Primary unit, and to achieve<br />
synchronization between both machines. The first of these "Sync Links,” the CPU<br />
Sync Link, is a dedicated optic fiber link anchored on the Ethernet port local to the<br />
CPU module. This port is dedicated exclusively for this use on Quantum Hot Standby<br />
architecture. The second of these Sync Links, Remote I/O Sync Link, is not an<br />
additional one: the Hot Standby system uses the existing Remote I/O medium,<br />
hosting both machines, thus providing them with an opportunity to communicate.<br />
One benefit of the CPU optic fiber port is its inherent capability to have the two units<br />
installed up to 2 km apart, using 62.5/125 multimode optic fiber. The Remote I/O Sync<br />
Link can also run through optic fiber, provided that Remote I/O Communication<br />
Processor modules are coupled on the optic fiber.<br />
3 All I/O modules are accepted on remote I/O racks, except 140 HLI 340 00 (Interrupt module).<br />
Looking at Ethernet adapters currently available, 140 NWM 100 00 communication module is not<br />
compatible with a Hot Standby system. Also EtherNet/IP adapter (140 NOC 771 00) is not compatible with<br />
Quantum Hot Standby in Step 1.<br />
56
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Up to 6 communication modules 4 , such as NOE Ethernet TCP/IP adapters, can be<br />
handled by a Quantum unit; whether it is a standalone unit or part of a Hot Standby<br />
architecture.<br />
Up to 31 Remote I/O stations can be handled from a Quantum CPU rack, whether<br />
standalone or Hot Standby. Note that the Remote I/O service payload on scan time is<br />
approximately 3 to 4 ms per station.<br />
Redundant Device Implementation<br />
Using redundant in-rack I/O modules on<br />
Quantum Hot Standby, and having to<br />
interface redundant sensors/actuators, will<br />
require redundant input / output channels.<br />
Preferably, homothetic channels should be<br />
installed on different modules and different<br />
I/O Stations. For even a simple transfer of<br />
information to both sides of the outputs, the<br />
application must define and implement rules<br />
for selecting and treating the proper input<br />
signals. In addition to information transfer,<br />
the application will have to address<br />
diagnostic requirements.<br />
4 Acceptable communication modules are Modbus Plus adapters, Ethernet TCP/IP adapters, Ethernet/IP<br />
adapters and PROFIBUS DP V1 adapters.<br />
57
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Single Device Implementation<br />
Assuming a Quantum Hot Standby<br />
(i.e. the one taken on Primary output) to the target actuator.<br />
application is required to handle redundant<br />
in-rack I/O channels, but without redundant<br />
sensors and actuators, special devices are<br />
used to handle the associated wiring<br />
requirements. Any given sensor signal,<br />
either digital or analog, passes through<br />
such a dedicated device, which replicates it<br />
and passes it on to homothetic input<br />
channels. Reciprocally, any given pair of<br />
homothetic output signals, either digital or<br />
analog, are provided to a dedicated device<br />
that selects and transfers the proper signal<br />
Depending on the selected I/O Bus technology, a specific layout may result in<br />
enhanced availability.<br />
Dual Coaxial Cable<br />
Coaxial cable can be installed with<br />
either as a single or redundant<br />
design. With a redundant design,<br />
communications are duplicated on<br />
both channels, providing a "massive<br />
communication redundancy.” Either<br />
the Remote I/O Processor or Remote<br />
I/O adapters are equipped with a pair<br />
of connectors, with each connector<br />
attached to a separate coaxial<br />
distribution.<br />
58
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Self Healing Optical Fiber Ring<br />
ring diagnostics are required, or single mode fiber is required.<br />
Configuration Change on the Fly<br />
Remote I/O Stations can be installed<br />
as terminal nodes of a fiber optic<br />
segment or (self-healing) ring. The<br />
<strong>Schneider</strong> catalog offers one model<br />
of transceiver (490 NRP 954)<br />
applicable for 62.5/125 multimode<br />
optic fiber: 5 transceivers maximum,<br />
10 km maximum ring circumference.<br />
When the number of transceivers is<br />
greater than 5, precise optic fiber<br />
<strong>High</strong>-level level feature is currently being added to Quantum Hot Standby applications<br />
design. This is CCTF, or Configuration Change on the Fly. This new feature will allow<br />
you to modify the configuration of the existing and running PLC application program,<br />
without having to stop the PLC. As an example, consider the addition of a new<br />
discrete or analog module on a remote Quantum I/O Station. For the CPU Firmware<br />
version upgrade, executed on a Quantum Hot Standby architecture, this CCTF will be<br />
sequentially executed, one unit at a time.This is an obvious for applications that<br />
cannot afford any stop, which will now become available for architecture modification<br />
or extensions.<br />
Premium Hot Standby: Configuration 2<br />
Premium Hot Standby is able to<br />
handle in-rack I/O modules, installed<br />
on Bus-X racks 5 . The Primary unit<br />
acquires its inputs, executes the logic,<br />
and updates its outputs. As a result of<br />
cyclical Primary to Standby data<br />
transfer, the Standby unit provides<br />
local outputs that are the image of the outputs decided on the Primary unit. In case of<br />
a switchover, the control takeover executed by the new Primary unit occurs in a<br />
bumpless fashion.<br />
5 Initial version of Premium Hot Standby only authorizes a single Bus-X rack on both units.<br />
59
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The module population of a Premium<br />
CPU rack, in a Hot Standby<br />
configuration, is very similar to that of<br />
a standalone PLC (some restrictions<br />
apply on compatible modules 6 ). Two<br />
types of CPU modules are available:<br />
TSX H57 24M and TSX H57 44M,<br />
which differ mainly in regard to memory and communication resources.<br />
The first of the two Sync Links, the<br />
CPU Sync Link is a dedicated copper<br />
link anchored on the Ethernet port<br />
local to the CPU module. With<br />
Premium Hot Standby architecture, the<br />
second Sync Link, Ethernet Sync Link is established using a standard Ethernet<br />
TCP/IP module 7 . It corresponds to the communication adapter elected as the<br />
"monitored" adapter<br />
6 Counting, motion, weighing and safety modules are not accepted. On the communication side, apart from<br />
Modbus modules TSX SCY 11 601/21 601, only currently available Ethernet TCP/IP modules are accepted.<br />
Also EtherNet/IP adapter (TSX ETC 100) is not compatible with Premium Hot Standby in Step 1.<br />
7 TSX ETY 4103 or TSX ETY 5103 communication module<br />
60
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following picture illustrates the Premium Ethernet configuration, with the CPU<br />
and the ETY Sync links:<br />
This Ethernet configuration is detailed in the following section.<br />
Redundant Device Implementation<br />
In-rack I/O module implementation on Premium<br />
Hot Standby corresponds by default to a<br />
massive redundancy layout: each input and<br />
each output has a physical connection on both<br />
units. Redundant sensors and actuators do not<br />
require additional hardware. For even a simple<br />
transfer of information to both sides of the outputs, the application must define and<br />
implement rules for selecting and treating the proper input signals. In addition to<br />
information transfer, the application will have to address diagnostic requirements.<br />
Single Device Implementation<br />
Assuming a Premium Hot Standby application<br />
is required to handle redundant in-rack I/O<br />
channels, but without redundant sensors and<br />
actuators, special devices are used to handle<br />
the associated wiring requirements. Any given<br />
sensor signal, either digital or analog, passes<br />
through such a dedicated device, which<br />
replicates it and passes it on to homothetic input channels. Reciprocally, any given<br />
pair of homothetic output signals, either digital or analog, are provided to a dedicated<br />
device that selects and transfers the proper signal (i.e. the one taken on Primary<br />
output) to the target actuator.<br />
61
Distributed I/O Architectures<br />
Ethernet TCP/IP: Configurations 3&4<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
<strong>Schneider</strong> <strong>Electric</strong> has supported Transparent Ready strategy for several years. In<br />
addition to SCADA and HMI, Variables Speed Drives, Power Meters, and a wide<br />
ranges of gateways, Distributed I/Os with Ethernet connectivity, such as Advantys<br />
STB, are also being proposed,. In addition many manufacturers are offering devices<br />
capable of communicating on Ethernet using Modbus TCP 8 protocol. These different<br />
contributions using the Modbus protocol design legacy have helped make Ethernet a<br />
general purpose preferred communication support for automation architectures.<br />
In addition to Ethernet messaging services solicited through application program<br />
function blocks, a communication service is available on <strong>Schneider</strong> <strong>Electric</strong> PLCs: the<br />
I/O Scanner. The I/O Scanner makes a PLC Ethernet adapter/Copro act as a<br />
Modbus/TCP client, periodically launching a sequence of requests on the network.<br />
These requests correspond to standard Modbus function codes, asking for Registers<br />
(Data Words), Read, Write or Read/Write operations. This sequence is determined by<br />
a list of individual contracts specified in a table defined during the PLC configuration.<br />
The typical target of such a communication contract is an I/O block, hence the name<br />
"I/O Scanner". Also, the I/O Scanner service may be used to implement data<br />
exchanges with any type of equipment, including another PLC, provided that<br />
equipment can behave as a Modbus/TCP server, and respond to multiple words<br />
access requests.<br />
Ethernet I/O scanner service is<br />
compatible with a Hot Standby<br />
implementation, whether Premium<br />
or Quantum. The I/O scanner is<br />
active only on the Primary unit. In<br />
case of a controlled switchover,<br />
Ethernet TCP/IP connections<br />
handled by the former Primary unit<br />
are properly closed, and new ones<br />
are reopen once the new Primary<br />
gains control. In case of a sudden<br />
switchover, resulting, for example,<br />
8 In step 1, support of Ethernet/IP Quantum/Premium adapters is not available with Hot Stand-By.<br />
62
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
from a power cut-off, the former Primary may not able to close the connections it had<br />
opened. These connections will be closed after expiration of a Keep Alive timeout.<br />
In case of a switchover, proper communications will typically recover after one initial<br />
cycle of I/O scanning. However, the worst case gap for address swap, with I/O<br />
scanner, is 500 ms, plus one initial cycle of I/O scanning. As a result, this mode of<br />
communication, and hence architectures with Distributed I/Os on Ethernet, is not<br />
preferred with a control system that regards time criticality as an essential criteria.<br />
Note: The automatic IP address swap capability is a property inherited by every<br />
Ethernet TCP/IP adapter installed in the CPU rack.<br />
Self Healing Ring<br />
As demonstrated in the previous chapter, Ethernet TCP/IP used with products like<br />
Connexium offers real opportunities to design enhanced availability architectures,<br />
handling communication between the Information Management Level and the Control<br />
Level. Such architectures, based on a Self Healing Ring topology, are also applicable<br />
when using Ethernet TCP/IP as a fieldbus.<br />
Note that Connexium accepts Copper or Optic Fiber rings. In addition, dual<br />
networking is also applicable at the fieldbus level.<br />
63
Profibus Architecture<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
I/O Devices Distributed on PROFIBUS DP/PA: Configuration 5<br />
A PROFIBUS DP V1 Master Class 1<br />
Quantum form communication module is<br />
available. It handles cyclic and acyclic data<br />
exchanges, and accepts FDT/DTM Asset<br />
Management System data flow, through its<br />
local Ethernet port.<br />
The PROFIBUS network is configured with<br />
a Configuration Builder software, which<br />
supplies the Unity Pro application program<br />
with data structures corresponding to cyclic<br />
data exchanges and to diagnostic<br />
information.<br />
The Configuration Builder can also be configured to pass Unity Pro a set of DFBs,<br />
allowing easy implementation of Acyclic operations.<br />
Each Quantum PLC can accept up to 6 of these DP Master modules (each of them<br />
handling its own PROFIBUS network). Also, the PTQ PDPM V1 Master Module is<br />
compatible with a Quantum Hot Standby implementation. Only the Master Module in<br />
the Primary unit is active on the PROFIBUS network; the Master Module on the<br />
Standby unit stays in a dormant state unless awakened by a switchover.<br />
64
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
PROFIBUS DP V1 Remote Master and Hot Standby PLC<br />
With a smart device such as<br />
PROFIBUS Remote Master 9 ,<br />
an I/O Scanner stream is<br />
handled by the PLC<br />
application 10 and forwarded<br />
to the Remote Master via<br />
Ethernet TCP/IP. In turn,<br />
Remote Master handles the<br />
corresponding cyclic<br />
exchanges with the devices populating the PROFIBUS network. Remote Master can<br />
also handle acyclic data exchanges.<br />
The PROFIBUS network is configured with Unity Pro 11 , which also acts as an FDT<br />
container, able to host manufacturer device DTMs. In addition, Remote Master offers<br />
a comDTM to work with third party FDT/DTM Asset Management Systems.<br />
Automatic symbol generation provides Unity Pro with data structures corresponding<br />
to data exchanges and diagnostic information. A set of DFBs is delivered that allows<br />
an easy implementation of acyclic operations.<br />
Remote Master is compatible with a Quantum, Premium or Modicon M340 Hot<br />
Standby implementation.<br />
9 Planned First Customer Shipment: Q4 2009<br />
10 M340, Premium or Quantum<br />
11 version 5.0<br />
65
Redundancy and Safety<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Requirements for <strong>Availability</strong> and Safety are often considered<br />
to be working against each other. Safety can be the<br />
response to the maxim "stop if any potential danger arises,’<br />
whereas <strong>Availability</strong> can follow the slogan "Produce in spite of<br />
everything.”<br />
Two models of CPU are available to design a Quantum<br />
Safety configuration: the first model (140 CPU 651 60S) is<br />
dedicated to standalone architectures, whereas the second<br />
model ( 140 CPU 671 60S) is dedicated to redundant<br />
architectures.<br />
The Quantum Safety PLC has some exclusive features: a<br />
specific hardware design for safety modules (CPU and I/O<br />
modules) and a dedicated instruction set.<br />
Otherwise, a Safety Quantum Hot Standby configuration has<br />
much in common with a regular Quantum Hot Standby configuration. The<br />
configuration windows, for example, are almost the same, and the Ethernet<br />
communication adapters inherit the IP Address automatic swap capability. Thus the<br />
Safety Quantum Hot Standby helps to reconcile and integrate the concepts of Safety<br />
and <strong>Availability</strong>.<br />
66
Mixed Configuration: Configuration 6<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
Premium / Quantum Hot Standby Switchover Conditions<br />
System Health Diagnostics<br />
Whether Premium or Quantum,<br />
application requirements such as<br />
topology, environment, periphery,<br />
time criticality, etc. may influence<br />
the final architecture design to<br />
adopt both types of design<br />
strategies concurrently, i.e. in-<br />
rack and distributed I/Os,<br />
depending on individual<br />
subsystems constraints.<br />
Systematic checks are executed cyclically by any running CPU, in order to detect a<br />
potential hardware corruption, such as a change affecting the integrity of the Copro,<br />
the sub-part of the CPU module that hosts the integrated Ethernet port. Another<br />
example of a systematic check is the continuous check of the voltage levels provided<br />
by the power supply module(s). In case of a negative result during these hardware<br />
health diagnostics, the tested CPU will usually switch to a Stop State.<br />
When the unit in question is part of a Hot Standby System, in addition to these<br />
standard hardware tests separately executed on both machines, more specific tests<br />
are conducted between the units. These additional tests involve both Sync Links. The<br />
basic objective is to confirm that the Primary unit is effectively operational, executing<br />
the application program, and controlling the I/O exchanges. In addition, the system<br />
must verify that the current Standby unit is able to assume control after a switchover.<br />
67
Controlled Switchover<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
If an abnormal situation occurs on the current Primary unit, it gives up control and<br />
switches either to Off-Line state (the CPU is not a part of the Hot Standby system<br />
coupling) or to Stop State, depending on the event. The former Standby unit takes<br />
control as the new Primary unit.<br />
As previously indicated, the Hot Standby system is controlled through the %SW61<br />
system register. Each unit owns an individual bit on the system Command Register<br />
that decides whether or not that particular unit has to make its possible to "hook" to<br />
the other unit. An operational hooked redundant Hot Standby system requires both<br />
units to indicate this intent. Consequently, executing a switchover controlled by the<br />
application on a hooked system is straightforward; it requires briefly toggling the<br />
decision bit that controls the current Primary unit’s "hooking" intent. The first toggle<br />
transition switches the current Primary unit to Off-Line Sate, and makes the former<br />
Standby unit take control. The next toggle transition makes the former Primary unit<br />
return and hook as the new Standby Unit.<br />
An example of this possibility is a controlled switchover resulting from diagnostics<br />
conducted at the application level. A Quantum Hot Standby, Premium Hot Standby or<br />
Monitored Ethernet Adapter system does not handle a Modbus Plus or Ethernet<br />
communication adapter malfunction as a condition implicitly forcing a switchover. As a<br />
result, these communication modules must be cyclically tested by the application,<br />
both on Standby and on Primary. Diagnostic results elaborated on Standby are<br />
usually transferred to the Primary unit because of the Reverse Transfer Registers.<br />
Finally, the application being reported a non fugitive inconsistency affecting Primary<br />
unit, and at the same time a fully operational Standby unit, forces a switchover based<br />
on the playing on Control Register.<br />
Hence, the application program can decide on a Hot Standby switchover, having<br />
registered a steady state negative diagnostic on the Ethernet adapter linking the<br />
Primary unit to the "Process Network", and being at the same time informed that the<br />
Standby unit is fully operational.<br />
Note: There are two ways to implement a Controlled Switchover: automatically<br />
through configuration of a default DFB, or customized with the creation of a DFB with<br />
its own switchover conditions.<br />
68
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following illustration represents an example of a DFB handling a Hot Standby<br />
Controlled Switchover:<br />
This in turn makes use of an embedded DFB: HSBY_WR:<br />
Fragment of HSBY_Switch_Decision DFB<br />
Fragment of HSBY_WR DFB<br />
Note: HSBY_WR DFB executes a write access on HSBY Control Register (%SW60).<br />
69
Switchover Latencies<br />
<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />
The following table details the typical and maximum swap time delay encountered<br />
when reestablishing Ethernet services during a Switchover event. (Premium and<br />
Quantum configurations)<br />
Service Typical Swap Time Maximum Swap Time<br />
Swap IP Address 6 ms 500 ms<br />
I/O Scanning 1 initial cycle of I/O scanning 500 ms + 1 initial cycle of I/O scanning<br />
Client Messaging 1 MAST task cycle 500 ms + 1 MAST task cycle<br />
Server<br />
Messaging<br />
1 MAST task cycle + the time required<br />
by the client to reestablish its<br />
connection with the server (1)<br />
FTP/TFTP Server the time required by the client to<br />
reestablish its connection with the<br />
server (1)<br />
500 ms + the time required by the<br />
client to reestablish its connection with<br />
the server (1)<br />
500 ms + the time required by the<br />
client to reestablish its connection with<br />
the server (1)<br />
SNMP 1 MAST task cycle 500 ms + 1 MAST task cycle<br />
HTTP Server the time required by the client to<br />
reestablish its connection with the<br />
server (1)<br />
500 ms + the time required by the<br />
client to reestablish its connection with<br />
the server (1)<br />
(1) The time the client requires to reconnect with the server depends on the client communication loss<br />
timeout settings.<br />
Selection<br />
To end the control level section, the following table presents the four main criteria that<br />
help you selecting the most appropriate configuration for your high availability solution:<br />
Criteria Cost <strong>High</strong> Criticality<br />
Switchover Performance Premium<br />
Openness Premium<br />
In Rack Architecture<br />
Distributed Architecture<br />
Quantum<br />
In Rack Architecture<br />
Quantum<br />
Distributed Architecture<br />
70
Premium/ Quantum Hot Standby Solution Reminder<br />
Conclusion<br />
The following tables provide a brief reminder of essential characteristics for Premium<br />
and Quantum Hot Standby solutions, respectively:<br />
Up to 128 per network / ETY I/O Scanner handles up to 64 transactions<br />
Premium Hot Standby Essential Characteristics<br />
71
Conclusion<br />
Quantum Hot Standby Essential Characteristics<br />
Conclusion<br />
This chapter has covered functional and architectural redundancy aspects, from the<br />
Information Management level to the Control level, and up through the<br />
Communication Infrastructure level.<br />
72
Conclusion<br />
Conclusion<br />
This section summarizes the main characteristics and properties of <strong>Availability</strong> for<br />
Collaborative Control automation architectures.<br />
Chapter 1 demonstrated that <strong>Availability</strong> is dependent not only on Reliability, but also<br />
on Maintenance as it is provided to a given system. The first level of contribution,<br />
Reliability, is primarily a function of the system design and components. Component<br />
and device manufacturers thus have a direct but not exclusive influence on system<br />
<strong>Availability</strong>. The second level of contribution, Maintenance and Logistics, is totally<br />
dependent on end customer behavior.<br />
Chapter 2 presented some simple Reliability and <strong>Availability</strong> calculation examples,<br />
and demonstrated that beyond basic use cases, dedicated skills and tools are<br />
required to extract figures from real cases.<br />
Chapter 3 explored a central focus of this document, Redundancy, and its application<br />
at the Information Management Level, Communication Infrastructure Level and<br />
Control System Level.<br />
This final chapter summarizes customer benefits provided by <strong>Schneider</strong> <strong>Electric</strong> <strong>High</strong><br />
<strong>Availability</strong> solutions, as well as additional information and references.<br />
73
Benefits<br />
Standard Offer<br />
Conclusion<br />
<strong>Schneider</strong>-<strong>Electric</strong> currently offers a wide range of solutions, providing the best<br />
design to respond to specific customer needs for Redundancy and <strong>Availability</strong> in<br />
Automation and Control Systems.<br />
One key concept of <strong>High</strong> <strong>Availability</strong> is that redundancy is not a default design<br />
characteristic at any system level. Also, Redundancy can be added locally, using in<br />
most cases standard components.<br />
Simplicity of Implementation<br />
System Transparency<br />
Information Management Level<br />
At any level, intrusion of Redundancy into system design and implementation is<br />
minimum, compared to a non-redundant system. For SCADA implementation,<br />
network active components selection, or PLC programming, most of the software<br />
contributions to redundancy are dependent on selections executed during the<br />
configuration phase. Also, Redundancy can be applied selectively.<br />
The transparency of a redundant system, compared to a standalone one, is a<br />
customer requirement. With the <strong>Schneider</strong> <strong>Electric</strong> automation offer, this transparency<br />
is present at each level of the system.<br />
For Client Display Stations, Dual Path Supervisory Networks, Redundant I/O servers,<br />
or Dual Access to Process Networks, each redundant contribution is handled<br />
separately by the system. For example, concurrent Display Clients communication<br />
flow will be transparently re-routed to the I/O sever by the Supervisory Network in<br />
case of a cable disruption. This flow will also be transparently routed to the alternative<br />
I/O Server, in case of a sudden malfunction of the first server. Finally, the I/O Server<br />
may transparently leave the communication channel it is using per default, if that<br />
channel ceases to operate properly, or if the target PLC will not respond through this<br />
channel.<br />
74
Communication Infrastructure Level<br />
Control System Level<br />
Ease of Use<br />
Conclusion<br />
Whether utilized as a Process Network or as a Fieldbus Network, currently available<br />
active network components can easily participate in an automatically reconfigured<br />
network. With continuous enhancements, HIPER-Ring strategy not only offers<br />
simplicity, but also a level of performance compatible with a high-reactivity demand.<br />
The “IP Address automatic switch" for a SCADA application communicating through<br />
Ethernet is an important feature of <strong>Schneider</strong> <strong>Electric</strong> PLCs. Apart from simplifying<br />
the design of the SCADA application implementation, what may cause delays and<br />
increased cost, this feature also contributes to reducing the payload of a<br />
communication context exchange on a PLC switchover.<br />
As previously stated, increased effort has been made to make the implementation of<br />
a redundant feature simple and straightforward.<br />
The Vijeo Citect, ConneXium Web Pages and Unity Pro software environments offer<br />
clear and accessible configuration windows, along with a dedicated selective help, in<br />
order to execute the required parameterization.<br />
More Detailed RAMS Investigation<br />
In case of a specific need for detailed dependability (RAMS) studies, for any type of<br />
architecture, contact the <strong>Schneider</strong> <strong>Electric</strong> Safety Competency Center. This center<br />
has skilled and experienced individuals ready to help you with all your needs.<br />
75
Appendix<br />
Glossary<br />
Appendix<br />
Note: the references in bracket refer to standard, which are specified at the end of<br />
this glossary.<br />
1) Active Redundancy<br />
Redundancy where the different means required to accomplish a given function are<br />
present simultaneously [5]<br />
2) <strong>Availability</strong><br />
Ability of an item to be in a state to perform a required function under given conditions,<br />
at a given instant of time or over a given time interval, assuming that the required<br />
external resources are provided [IEV 191-02-05] (performance) [2]<br />
3) Common Mode Failure<br />
Failure that affects all redundant elements for a given function at the same time [2]<br />
4) Complete failure<br />
Failure which results in the complete inability of an item to perform all required<br />
functions [IEV 191-04-20] [2]<br />
5) Dependability<br />
Collective term used to describe availability performance and its influencing factors:<br />
reliability performance, maintainability performance and maintenance support<br />
performance [IEV 191-02-03] [2]<br />
Note: Dependability is used only for general descriptions in non-quantitative terms.<br />
6) Dormant<br />
A state in which an item is able to function but is not required functioning. Not to be<br />
confused with downtime [4]<br />
7) Downtime<br />
Time during which an item is in an operational inventory but is not in condition to<br />
perform its required function [4]<br />
8) Failure<br />
Termination of the ability of an item to perform a required function [IEV 191-04-01] [2]<br />
Note 1: After failure the item detects a fault.<br />
Note 2: "Failure" is an event, as distinguished from "fault", which is a state.<br />
76
9) Failure Analysis<br />
Appendix<br />
The act of determining the physical failure mechanism resulting in the functional<br />
failure of a component or piece of equipment [1]<br />
10) Failure Mode and Effects Analysis (FMEA)<br />
Procedure for analyzing each potential failure mode in a product, to determine the<br />
results or effects on the product. When the analysis is extended to classify each<br />
potential failure mode according to its severity and probability of occurrence, it is<br />
called a Failure Mode, Effects, and Criticality Analysis (FMECA).[6]<br />
11) Failure Rate<br />
Total number of failures within an item population, divided by the total number of life<br />
units expended by that population, during a particular measurement period under<br />
stated conditions [4]<br />
12) Fault<br />
State of an item characterized by its inability to perform a required function, excluding<br />
this inability during preventive maintenance or other planned actions, or due to lack of<br />
external resources [IEV 191-05-01] [2]<br />
Note: a fault is often the result of a failure of the item itself, but may exist without prior<br />
failure.<br />
13) Fault- tolerance<br />
Ability to tolerate and accommodate for a fault with or without performance<br />
degradation<br />
14) Fault Tree Analysis (FTA)<br />
Method used to evaluate reliability of engineering systems. FTA is concerned with<br />
fault events. A fault tree may be described as a logical representation of the<br />
relationship of primary or basic fault events that lead to the occurrence of a specified<br />
undesirable fault event known as the “top event.” A fault tree is depicted using a tree<br />
structure with logic gates such as AND and OR [7]<br />
77
15) Hidden Failure<br />
FTA illustration [7]<br />
Failure occurring that is not detectable by or evident to the operating crew [1]<br />
16) Inherent <strong>Availability</strong> (Intrinsic <strong>Availability</strong>) : Ai<br />
Appendix<br />
A measure of <strong>Availability</strong> that includes only the effects of an item design and its<br />
application, and does not account for effects of the operational and support<br />
environment. Sometimes referred to as "intrinsic" availability [4]<br />
17) Integrity<br />
Reliability of data which is being processed or stored.<br />
18) Maintainability<br />
Probability that an item can be retained in, or restored to, a specified condition when<br />
maintenance is performed by personnel having specified skill levels, using prescribed<br />
procedures and resources, at each prescribed level of maintenance and repair. [4]<br />
19) Markov Method<br />
A Markov process is a mathematical model that is useful in the study of the<br />
availability of complex systems. The basic concepts of the Markov process are those<br />
of the “state” of the system (for example operating, non-operating) and state<br />
“transition” (from operating to non-operating due to failure, or from non-operating to<br />
operating due to repair). [4]<br />
78
20) MDT: Mean Downtime<br />
Markov Graph illustration [2]<br />
Average time a system is unavailable for use due to a failure.<br />
Time includes the actual repair time plus all delay time associated with a repair<br />
person arriving with the appropriate replacement parts [4]<br />
21) MOBF: Mean Operating Time Between Failures<br />
Expectation of the operating time between failures [IEV 191-12-09] [2]<br />
22) MTBF<br />
Appendix<br />
A basic measure of reliability for repairable items. The mean number of life units<br />
during which all parts of the item perform within their specified limits, during a<br />
particular measurement interval under stated conditions. [4]<br />
23) MTTF : Mean Time To Failure<br />
A basic measure of reliability for non-repairable items. The total number of life units of<br />
an item population divided by the number of failures within that population, during a<br />
particular measurement interval under stated conditions. [4]<br />
Note: Used with repairable items, MTTF stands for Mean Time To First Failure<br />
24) MTTR : Mean Time To Repair<br />
A basic measure of maintainability. The sum of corrective maintenance times at any<br />
specific level of repair, divided by the total number of failures within an item repaired<br />
at that level, during a particular interval under stated conditions. [4]<br />
25) MTTR : Mean Time To Recovery<br />
Expectation of the time to recovery [IEV 191-13-08] [2]<br />
79
26) Non-Detectable Failure<br />
Appendix<br />
Failure at the component, equipment, subsystem, or system (product) level that is<br />
identifiable by analysis but cannot be identified through periodic testing or revealed by<br />
an alarm or an indication of an anomaly. [4]<br />
27) Redundancy<br />
Existence in an item of two or more means of performing a required function [IEV<br />
191-15-01] [2]<br />
Note: in this standard, the existence of more than one path (consisting of links and<br />
switches) between end nodes.<br />
Existence of more than one means for accomplishing a given function. Each means<br />
of accomplishing the function need not necessarily be identical. The two basic types<br />
of redundancy are active and standby. [4]<br />
28) Reliability<br />
Ability of an item to perform a required function under given conditions for a given<br />
time interval [IEV 191-02-06] [2]<br />
Note 1: It is generally assumed that an item is in a state to perform this required<br />
function at the beginning of the time interval<br />
Note 2: the term “reliability” is also used as a measure of reliability performance (see<br />
IEV 191-12-01)<br />
29) Repairability<br />
Probability that a failed item will be restored to operable condition within a specified<br />
time of active repair [4]<br />
30) Serviceability<br />
Relative ease with which an item can be serviced (i.e. kept in operating condition). [4]<br />
31) Standby Redundancy<br />
Redundancy wherein a part of the means for performing a required function is<br />
intended to operate, while the remaining part(s) of the means are inoperative until<br />
needed [IEV 19-5-03] [2]<br />
Note: this is also known as dynamic redundancy.<br />
Redundancy in which some or all of the redundant items are not operating<br />
continuously but are activated only upon failure of the primary item performing the<br />
function(s). [4]<br />
80
32) System Downtime<br />
Appendix<br />
Time interval between the commencement of work on a system (product) malfunction<br />
and the time when the system has been repaired and/or checked by the maintenance<br />
person, and no further maintenance activity is executed. [4]<br />
33) Total System Downtime<br />
Time interval between the reporting of a system (product) malfunction and the time<br />
when the system has been repaired and/or checked by the maintenance person, and<br />
no further maintenance activity is executed. [4]<br />
34) Unavailability<br />
State of an item of being unable to perform its required function [IEV 603-05-05] [2]<br />
Note: Unavailability is expressed as the fraction of expected operating life that an<br />
item is not available, for example given in minutes per year<br />
Ratio: downtime/(uptime + downtime) [3]<br />
Often expressed as a maximum period of time during which the variable is<br />
unavailable, for example 4 hours per month<br />
35) Uptime<br />
That element of Active Time during which an item is in condition to perform its<br />
required functions. (Increases availability and dependability). [4]<br />
[1] Maintenance & reliability terms - Life Cycle Engineering<br />
[2] IEC 62439: <strong>High</strong> <strong>Availability</strong> automation networks<br />
[3] IEEE Std C37.1-2007: Standard for SCADA and Automation System<br />
[4] MIL-HDBK-338B - Military Handbook - Electronic Reliability Design Handbook<br />
[5] IEC-271-194<br />
[6] The Certified Quality Engineer Handbook - Connie M. Borror, Editor<br />
[7] Reliability, Quality, and Safety for Engineers - B.S. Dhillon - CRC Press<br />
81
Standards<br />
General purpose<br />
Appendix<br />
This section contains a selected, non-exhaustive list of reference documents and<br />
standards related to Reliability and <strong>Availability</strong>:<br />
� IEC 60050 (191):1990 - International Electrotechnical Vocabulary (IEV)<br />
FMEA/FMECA<br />
� IEC 60812 (1985) - Analysis techniques for system reliability - Procedures for failure mode and<br />
effect analysis (FMEA)<br />
� MIL-STD 1629A (1980) Procedures for performing a failure mode, effects and criticality analysis<br />
Reliability Block Diagrams<br />
� IEC 61078 (1991) Analysis techniques for dependability - Reliability block diagram method<br />
Fault Tree Analysis<br />
� NUREG-0492 - Fault Tree Handbook - US Nuclear Regulatory Commission<br />
Markov Analysis<br />
� IEC 61165 (2006) Application of Markov techniques<br />
RAMS<br />
� IEC 60300-1 (2003) - Dependability management - Part 1: Dependability management systems<br />
� IEC 62278 (2002) - Railway applications - Specification and demonstration of reliability, availability,<br />
maintainability and safety (RAMS)<br />
Functional Safety<br />
� IEC 61508 - Functional safety of electrical/electronic/programmable electronic safety related<br />
systems (7 parts)<br />
� IEC 61511 (2003) Functional safety - Safety instrumented systems for the process industry sector.<br />
82
Calculation Examples<br />
Reliability & <strong>Availability</strong> Calculation Examples<br />
The previous chapter provided necessary knowledge and theory basics to understand<br />
simple examples of Reliability and <strong>Availability</strong> calculations. This section examines<br />
concrete, simple but realistic examples, related to current, plausible solutions.<br />
The root piece of information required for all of these calculations, for all major<br />
components we will implement in our architectures, is the MTBF figure. MTBFs are<br />
normally provided by the manufacturer, either on request and/or on dedicated<br />
documents. For <strong>Schneider</strong> <strong>Electric</strong> PLCs, MTBF information can be found on the<br />
intranet site Pl@net, under Product offer>Quality Info.<br />
Note: The MTBF information is usually not accessible via the <strong>Schneider</strong> <strong>Electric</strong><br />
public website.<br />
Redundant PLC System, Using Massive I/O Modules Redundancy<br />
Standalone Architecture<br />
Consider a simple, single-rack Premium PLC configuration.<br />
First, calculate the individual modules’ MTBFs, using a spreadsheet that will do the<br />
calculations (examples include Excel and Open Office).<br />
83
Calculation Examples<br />
The calculation guidelines are derived from the main conclusion given by Serial RBD<br />
n<br />
analysis, that is:. λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal<br />
S i<br />
i = 1<br />
to the sum of the individual Failure Rate of these elements, with<br />
R<br />
S<br />
(t)<br />
−λ<br />
S<br />
t<br />
= e .<br />
The first operation will identify individual MTBF figures of the part references<br />
populating the target system. Using these figures, a second sheet will then<br />
subsequently consider the item, group and system levels.<br />
� For each of the identified items part references:<br />
- Individual item Failure Rate λ, calculated just by inverting individual item<br />
MTBF : λ = 1 / MTBF<br />
- individual item Reliability, over 1 year, applying R(t) = e -λt , where t = 8760<br />
that is the number of hours in one year (365 * 24)<br />
84
� For each of the identified items references groups:<br />
Calculation Examples<br />
- Group Failure Rate: Individual item Failure Rate times the number of items in<br />
the considered group<br />
- Group Reliability: Individual item Reliability powered by the number of items<br />
� For the considered System:<br />
- System Failure Rate: sum of Groups Failure Rates<br />
- System MTBF in hours: reverse of System Failure Rate<br />
- System Reliability over one year: exp(- System Failure Rate * 8760)]<br />
(where 8760 = 365 *24 : number of hours in one year)<br />
- System <strong>Availability</strong> (with MTTR = 2 h): System MTBF / (System MTBF + 2)]<br />
Looking at the example results, Reliability over one year is approximately 83%<br />
(This means that the probability for this system to encounter one failure during<br />
one year is approximately 17%).<br />
System MTBF itself is approximately 45,500 hours (approximately 5 years)<br />
Regarding <strong>Availability</strong>, with a 2 hour Mean Time To Repair (which corresponds to<br />
a very good logistic and maintenance organization), we would have a 4-nines<br />
resulting <strong>Availability</strong>, an average probability of approximately 23 minutes<br />
downtime per year<br />
Note:<br />
� As previously explained, System figures produced by our basic calculations only<br />
apply the equations given by basic probability analysis. In addition to these<br />
calculations, a commercial software tool has been used, permitting us to confirm<br />
our figures.<br />
85
Redundant Architecture<br />
Calculation Examples<br />
Assuming we would like to increase this system’s<br />
<strong>Availability</strong> by implementing a redundant architecture,<br />
we need to calculate the potential gain for System<br />
Reliability and <strong>Availability</strong>.<br />
We will assume that we have no additional hardware,<br />
apart from the two redundant racks. The required calculation guidelines are derived<br />
from the main conclusion given by Parallel RBD analysis, that is:<br />
R Red<br />
( t)<br />
n<br />
= 1−<br />
∏ 1−<br />
Ri<br />
( t)<br />
i=<br />
1<br />
This means if we consider RS as the Standalone System Reliability and RRed as the<br />
Redundant System Reliability: R ( t)<br />
1 ( 1 R ( t))<br />
2<br />
Red<br />
= − − S<br />
As a result of the redundant architecture, System Reliability over one year is<br />
now approximately 97% (i.e. the probability for the system to encounter one<br />
failure during one year has been reduced to approximately 3%).<br />
System MTBF has increased from 45,500 hours (approximately 5 years) to<br />
68,000 hours (approximately 7.8 years).<br />
Regarding System <strong>Availability</strong>, with a 2 hour Mean Time To Repair we would<br />
receive an 8-nines resulting <strong>Availability</strong>, which is close to 100%.<br />
Note: Formal calculations should also take into account undetected errors on<br />
redundant architecture, what would provide somewhat less optimistic figures.<br />
A complete analysis should also take in account the additional wiring devices typically<br />
used on a massive I/O redundancy strategy, feeding homothetic input points with the<br />
same input signal, and bringing homothetic output points onto the same output signal.<br />
Also, with this software, a parallel structure has been retained, with the Failure Rate<br />
the same on the Standby rack as on the Primary rack.<br />
86
Reminder of Standby Definitions:<br />
Calculation Examples<br />
Cold Standby: The Standby unit starts only<br />
if the default unit becomes inoperative.<br />
Warm Standby: The Failure Rate of the<br />
Standby unit during the inactive mode is<br />
less the Failure Rate when it becomes<br />
active.<br />
Hot Standby: The Failure Rate of the<br />
Standby unit during the inactive mode is the same as the Failure Rate when it<br />
becomes active. This is a simple parallel configuration.<br />
87
Redundant PLC System, Using Shared Remote I/O Racks<br />
Simple Architecture with a Standalone CPU Rack<br />
Standalone CPU Rack<br />
Calculation Examples<br />
Consider a standalone CPU rack equipped<br />
with power supply, CPU, remote I/O<br />
processor and Ethernet communication<br />
modules.<br />
As in the previous section, calculate the<br />
potential gain a redundant architecture would<br />
allow, in terms of Reliability and <strong>Availability</strong>.<br />
First calculate the individual modules MTBFs,<br />
then establish, for each subsequent rack, a<br />
worksheet that provides the Reliability and<br />
<strong>Availability</strong> metrics figures.<br />
88
Remote I/O Rack<br />
Simple Architecture : Standalone CPU Rack + Remote I/O Rack<br />
Calculation Examples<br />
89
Calculation Examples<br />
For this Serial System (Rack # 1+ Rack # 2), Reliability over one year is approximately 82.8% (the<br />
probability for this system to encounter one failure during one year is approximately 17%).<br />
System MTBF is approximately 46,260 hours (approximately 5.3 years)<br />
Regarding <strong>Availability</strong>, with a 2 hour Mean Time To Repair (which corresponds to a very good logistic<br />
and maintenance organization), we receive a 4-nines resulting <strong>Availability</strong>, which is an average<br />
probability of approximately 23 minutes downtime per year<br />
Note: As expected, Reliability and <strong>Availability</strong> figures resulting from this Serial<br />
implementation are determined by the weakest link of the chain.<br />
90
Simple Architecture with a Redundant CPU Rack<br />
Calculation Examples<br />
One solution to enforce Rack #1 <strong>Availability</strong> (i.e.the<br />
rack hosting the CPU, and at present the process<br />
control core) is to implement a Hot Standby<br />
configuration.<br />
As a result of Rack #1 Redundancy, Reliability over one year would be approximately<br />
99.7%, compared to 94.6% availability with a Standalone Rack #1 (i.e. the probability for<br />
Redundant Rack #1 System to encounter one failure during one year would have been<br />
reduced from 5.4% to 0.3%).<br />
System MTBF would increase from approximately 157,000 hours (approximately 18 years)<br />
to 235,000 hours (approximately 27 years).<br />
Regarding System <strong>Availability</strong>, with a 2 hour Mean Time To Repair we would receive a 9-<br />
nines resulting <strong>Availability</strong>, close to 100%<br />
Note: As expected, Reliability and <strong>Availability</strong> figures resulting from this Parallel<br />
implementation are better than the best of the individual figures for different links of<br />
the chain.<br />
91
Consider a common misuse of reliability figures:<br />
Calculation Examples<br />
� Last architecture examination (Distributed Architecture with a Redundant CPU<br />
Rack) has proven a real benefit in implementing a CPU Rack redundancy, with<br />
this architecture a sub-sytem would be elevated to a 99.9 % Reliability, and<br />
almost a 100% <strong>Availability</strong>.<br />
� The common misuse of this Reliability figure would be to make arguments for the<br />
potential benefit on Reliability and <strong>Availability</strong> for the whole resulting system.<br />
Note: This example has considered the Sub-System Failure Rate provided by the<br />
reliability Software.<br />
Regarding the Serial System built by the Redundant CPU Racks and the STB Islands,<br />
the worksheet above shows a resulting Reliability (over one year) of 84,06%, the<br />
Standalone System Reliability during the same period of time being 81.95%.<br />
In addition, the worksheet shows a resulting <strong>Availability</strong> (for a 2 hour MTTR) of<br />
99.96%, the Standalone System <strong>Availability</strong> (for a 2 hour MTTR) being 99.9957%.<br />
As a result, this data suggests that implementing a CPU Rack Redundancy would<br />
have almost no benefit.<br />
Of course, this is an incorrect conclusion, and the example should suggest a simple<br />
rule: always compare comparable items. If we implement a redundancy on the CPU<br />
Control Rack in order to increase process control core Reliability, and to an extent<br />
<strong>Availability</strong>, we need to then examine and compare the figures only at this level, as<br />
the entire system has not received an increase in redundancy.<br />
92
Recommendations for Utilizing Figures<br />
Calculation Examples<br />
Previous case studies provide several important items concerning Reliability metrics<br />
evaluation:<br />
• Provided elementary MTBF figures are available, a "serial" system is quite easy<br />
to evaluate, thanks to the "magic bullet" formula: λ = 1 / MTBF<br />
• If you consider a "parallel sub-system", System Reliability can be easily derived<br />
from the Sub-System Reliability. Thus System <strong>Availability</strong> can be almost as easily<br />
calculated, starting from the Sub-System Reliability. But the "magic bullet"<br />
formula no longer applies, as in those conditions, exponential law does not<br />
directly link Reliability to Failure Rate. However, System MTBF can be<br />
approximated from the Sub-System MTBF :<br />
"System MTBF (in hours) # 1.5 Sub-System MTBF"<br />
But no direct relationship will provide the System Failure Rate.<br />
Because these calculations must be precisely executed, the acquisition of<br />
commercial reliability software could be one solution. Another consideration<br />
would be the solicitation of an external service.<br />
93
<strong>Schneider</strong> <strong>Electric</strong> Industries SAS<br />
Head Office<br />
89, bd Franklin Roosvelt<br />
92506 Rueil-Malmaison Cedex<br />
FRANCE<br />
www.schneider-electric.com<br />
Due to evolution of standards and equipment, characteristics indicated in texts and images<br />
in this document are binding only after confirmation by our departments.<br />
Print:<br />
Version 1 - 06 2009<br />
94