31.12.2012 Views

High Availability Theoretical Basics - Schneider Electric

High Availability Theoretical Basics - Schneider Electric

High Availability Theoretical Basics - Schneider Electric

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

System Technical Note<br />

How can I...<br />

90<br />

Increase the <strong>Availability</strong> of a<br />

Collaborative Control System<br />

1


Table of Contents<br />

Introduction to <strong>High</strong> <strong>Availability</strong> .............................................5<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong>.......................................9<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System..........31<br />

Conclusion..............................................................................73<br />

Appendix .................................................................................76<br />

Reliability & <strong>Availability</strong> Calculation Examples...................83<br />

3


Introduction to <strong>High</strong> <strong>Availability</strong><br />

4


Introduction to <strong>High</strong> <strong>Availability</strong><br />

Purpose<br />

Introduction<br />

Introduction to <strong>High</strong> <strong>Availability</strong><br />

The intent of this System Technical Note (STN) is to describe the capabilities of the<br />

different <strong>Schneider</strong> <strong>Electric</strong> solutions that answer most critical applications<br />

requirements, and consequently increase the availability of a Collaborative Control<br />

System. It provides a description of a common, readily understandable reference<br />

point for end users, system integrators, OEMs, sales people, business support and<br />

other parties.<br />

Increasingly, process applications require a high availability automation system.<br />

Before deciding to install a high availability automation system in your installation, you<br />

need to consider the following key questions:<br />

• What is the security level needed? This regards security concerning of both<br />

persons and hardware. For instance, the complete start/stop sequences that<br />

manage the kiln in a cement plant includes a key condition: the most powerful<br />

equipment must be the last to start and stop.<br />

• What is the criticality of the process? This point concerns all the processes that<br />

involve a reaction (mechanical, chemical, etc.). Consider the example of the kiln<br />

again. To avoid its destruction, the complete stop sequence needs a slow cooling<br />

of the constituent material. Another typical example is the biological treatment in<br />

a wastewater plant, which cannot be stopped every day.<br />

• What is the environmental criticality?. Consider the example of an automation<br />

system in a tunnel. If a fire starts on a side, thereof the PLCs (the default and the<br />

redundant one) will be installed on each side of the tunnel. More, does the<br />

system will face to harsh environment; in terms of vibration, temperature, shock…?<br />

• Which other particular circumstances does the system have to address? This last<br />

topic includes additional questions, for example: does the system really need<br />

redundancy if the electrical network becomes inoperative in the concerned layer<br />

of the installation, what is the criticality of the data in the event of a loss of<br />

communication, etc.<br />

5


Introduction to <strong>High</strong> <strong>Availability</strong><br />

<strong>Availability</strong> is a term that increasingly used to qualify a process asset or system. In<br />

addition, Reliability and Maintainability are terms now often used for analyses<br />

considered of major usefulness in design improvement, and in production diagnostic<br />

issues. Accordingly, the design of automation system architectures must consider<br />

these types of questions.<br />

Document Overview<br />

The Collaborative Control System provided by <strong>Schneider</strong> <strong>Electric</strong> offers different<br />

levels of redundancy, allowing you to design an effective high availability system.<br />

This document contains the following chapters:<br />

• The <strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong> chapter describes the fundamentals in<br />

<strong>High</strong> <strong>Availability</strong>. In one hand it presents theory and basics, in another hand<br />

explains a method to conceptualize series/parallel architectures. Calculation<br />

examples illustrate this approach.<br />

• The Collaborative Control System <strong>Availability</strong> chapter provides different solutions<br />

in <strong>High</strong> <strong>Availability</strong>, especially for Collaborative Control System; from the<br />

information management to the I/O modules level.<br />

• The final Conclusion chapter summarizes customer benefits provided by<br />

<strong>Schneider</strong> <strong>Electric</strong> <strong>High</strong> <strong>Availability</strong> solutions, as well as additional information<br />

and references.<br />

6


Introduction to <strong>High</strong> <strong>Availability</strong><br />

The following drawing represents various levels of an automation system architecture:<br />

As shown in the following chapters, redundancy is a convenient means of elevating<br />

global system reliability, and so far its availability. The Collaborative Control System<br />

Hi-<strong>Availability</strong>, that is redundancy, can be addressed at the different levels of the<br />

architecture:<br />

� Single or redundant I/O Modules (depending on sensors/actuators redundancy).<br />

� Depending on the I/O handling philosophy (for example conventional Remote I/O<br />

stations, or I/Os Islands distributed on Ethernet) different scenarios can be<br />

applied: dual communication medium I/O Bus or Self healing Ring, single or dual.<br />

� Programmable Controller CPU redundancy (HotStandby PAC Station).<br />

� Communication network and ports redundancy.<br />

� SCADA System dedicated approaches with multiple operator station location<br />

scenarios and resource redundancy capabilities.<br />

7


Guide Scope<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The realization of an automation project includes five main phases: Selection, Design,<br />

Configuration, Implementation and Operation. To help you develop a whole project<br />

based on these previous phases, <strong>Schneider</strong> <strong>Electric</strong> created the System Technical<br />

book concept: Guide (STG) and Notes (STN).<br />

A System Technical Guide provides you technical guidelines and recommendations<br />

to implement technologies regarding to your needs and requirements. This guide<br />

covers the entire project life cycle scope, from the selection to the operation phase<br />

providing design methodology and even source code examples of all components<br />

part of a sub-system.<br />

A System Technical Notes gives a more theoretical approach by focusing specially on<br />

a system technology. This book describes what is our complete solution offer<br />

regarding a system and therefore supports you in the selection phase of a project.<br />

STG and STN are obviously linked and complementary. To sum up, you will figure out<br />

the technologies fundamentals in a STN, and their corresponding tested and<br />

validated applications in one or several STGs.<br />

STG Scope<br />

Automation Project<br />

Life Cycle<br />

STN Scope<br />

8


<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Fault Tolerant System<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

This section describes basic high availability terms, concepts and formulas, and<br />

includes examples for typical applications.<br />

A Fault Tolerant System usually refers to a system that can operate even though a<br />

hardware component becomes inoperative. The Redundancy principle is often used<br />

to implement a Fault Tolerant Systems, because an alternate component takes over<br />

the task transparently.<br />

Lifetime and Failure Rate<br />

Considering a given system or device, its Lifetime corresponds to the number of<br />

hours it can function under normal operating conditions. This number is the result of<br />

the individual life expectancy of the components used in its assembly.<br />

Lifetime is generally seen as a sequence of three subsequent phases: the “early life”<br />

(or “infant mortality”), the "useful life,” and the "wear out period.”<br />

Failure Rate (λ) is defined as the average (mean) rate probability at which a system<br />

will become inoperative.<br />

When considering events occurring on a series of systems, for example a group of<br />

light bulbs, the units to be used should normally be “failures-per-unit-of- system-time.”<br />

Examples include failures per machine-hour or failures per system-year. Because the<br />

scope of this document is limited to single repairable entities, we will usually discuss<br />

failures-per-unit-of-time.<br />

Failure Rate Example: For a human being, Failure Rate (λ) measures the<br />

probability of death occurring in the next hour. Stating λ (20 years) = 10 -6 per<br />

hour would mean that the probability for someone age 20 to die in the next<br />

hour is 10 -6 .<br />

9


Bathtub Curve<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The following figure shows the Bathtub Curve, which represents the Failure Rate (λ)<br />

according to the Lifetime (t):<br />

Consider the<br />

relation between<br />

Failure Rate and<br />

Lifetime for a<br />

device consisting of<br />

assembled<br />

electronic parts.<br />

This relationship is<br />

represented by the<br />

"bathtub curve". As<br />

shown in the previous diagram. In "early life," this system, exhibits a high Failure<br />

Rate, which gradually reduces until it approaches a constant value, maintained during<br />

its "useful life.” The system finally enters the “wear-out” stage of its life, where Failure<br />

Rate increases exponentially.<br />

Note: Useful Life normally starts at the beginning of system use and ends at the<br />

beginning of its wear-out phase. Assuming that "early life" corresponds to the ”burn-<br />

in” period indicated by the manufacturer, we generally consider that Useful Life starts<br />

with the beginning of system use by the end user.<br />

RAMS (Reliability <strong>Availability</strong> Maintainability Safety)<br />

The following text, from the MIL-HDBK-338B standard, defines the RAM criteria and<br />

their probability aspect:<br />

"For the engineering specialties of reliability, availability and maintainability (RAM),<br />

the theories are stated in the mathematics of probability and statistics. The underlying<br />

reason for the use of these concepts is the inherent uncertainty in predicting a failure.<br />

Even given a failure model based on physical or chemical reactions, the results will<br />

not be the time a part will fail, but rather the time a given percentage of the parts will<br />

fail or the probability that a given part will fail in a specified time."<br />

Along with Reliability, <strong>Availability</strong> and Maintainability, Safety is the fourth metric of a<br />

meta-domain that specialists have named RAMS (also sometimes referred to as<br />

dependability).<br />

The Bathtub Curve<br />

10


Metrics<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

RAMS metrics relate to time allocation and depend on the operational state of a given<br />

system.<br />

The following curve defines the state linked to each term:<br />

• MUT: Mean Up Time<br />

MUT qualifies the average duration of the system being in operational state.<br />

• MDT: Mean Down Time<br />

MDT qualifies the average duration of the system not being in operational state. It<br />

comprises the different portions of time required to subsequently detect the error, fix it,<br />

and restore the system to its operational state.<br />

• MTBF: Mean Time between Failure<br />

MTBF is defined by the MIL-HDBK-338 standard as follows: "A basic measure of<br />

reliability for repairable items. The mean number of life units during which all parts of<br />

the item perform within their specified limits, during a particular measurement interval<br />

under stated conditions."<br />

Thus for repairable systems, MTBF is a metric commonly used to appraise Reliability,<br />

and corresponds to the average time interval (normally specified in hours) between<br />

two consecutive occurrences of inoperative states.<br />

Put simply: MTBF = MUT + MDT<br />

MTBF can be calculated (provisional reliability) based on data books such as UTE<br />

C80-810 (RDF2000), MIL HDBK-217F, FIDES, RDF 93, and BELLCORE. Other<br />

inputs include field feedback, laboratory testing, or demonstrated MTBF (Operational<br />

Reliability), or a combination of these. Remember that MTBF only applies for<br />

repairable systems<br />

11


• MTTF (or MTTFF): Mean Time to First Failure<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

MTTF is the mean time before the occurrence of the first failure.<br />

MTTF (and MTBF by extension) is often confused with Useful Life, even though<br />

these two concepts are not related in any way. For example, a battery may have<br />

a Useful Life of 4 hours and have a MTTF of 100,000 hours. These figures<br />

indicate that for a population of 100,000 batteries, there will be approximately<br />

one battery failure every hour (defective batteries being replaced).<br />

Considering a repairable system with an exponential distribution Reliability and a<br />

constant Failure Rate (λ), MTTF = 1 / λ.<br />

Mean Down Time is usually very low compared to Mean Up Time. This equivalence<br />

is extended to MTBF, and assimilated to MTTF, resulting in the following relationship:<br />

MTBF = 1 / λ.<br />

This relationship is widely used in additional calculations.<br />

Example:<br />

Given the MTBF of a communication adapter, 618,191 hours, what is the<br />

probability for that module to operate without failure for 5 years?<br />

Calculate the module Reliability over a 5-year time period:<br />

R(t) = e -λt = e<br />

• FIT: Failures in Time<br />

-t / MTBF<br />

a) Divide by 5 years, that is 8,760 * 5 = 43,800 hours, by the given MTBF:<br />

43,800 / 618,191 = 0.07085<br />

b) Then raise e to the power of the negative value of that number:<br />

e -.07085 = .9316<br />

Thus there is a 93.16% probability that the communication module will not fail<br />

on a 5-year period.<br />

Typically used as the Failure Rate measurement for non- repairable electronic<br />

components, FIT is the number of failures in one billion hours.<br />

FIT= 10 9 / MTBF or MTBF= 10 9 / FIT<br />

12


Safety<br />

Definition<br />

Maintainability<br />

Definition<br />

Mathematics <strong>Basics</strong><br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Safety refers to the protection of people, assets and the environment. For example, if<br />

an installation has a tank with an internal pressure that exceeds a given threshold,<br />

there is a high chance of explosion, and eventual destruction of the installation (with<br />

injury or death of people and damage to the environment). In this example, the Safety<br />

System put in place is to open a valve to the atmosphere, to prevent the pressure<br />

threshold from being crossed.<br />

Maintainability refers to the ability of a system to be maintained in an operational<br />

state. This once again relates to probability. Maintainability corresponds to the<br />

probability for an inoperative system to be repaired in a given time interval.<br />

If design considerations may at a certain level impact the maintainability of a system,<br />

then the maintenance organization will also have a major impact on its maintainability.<br />

Having the right number of people trained to observe and react with the proper<br />

methods, tools, and spare parts are considerations that usually depend more on the<br />

customer organization than on the automation system architecture.<br />

Equipment shall be maintainable on-site by trained personnel according to the<br />

maintenance strategy. A common metric named Maintainability, M(t), gives the<br />

probability that a required given active maintenance operation can be accomplished<br />

in a given time interval.<br />

The relationship between Maintainability and repair is similar to the relationship<br />

between reliability and failure, with the repair rate µ(t) defined in a way analogous to<br />

the Failure Rate. When this repair rate is considered as a constant, it implies an<br />

exponential distribution for Maintainability: M(t) = e -µt .<br />

The maintainability of equipment is reflected in MTTR, which is usually considered as<br />

the sum of the individual times required for administrative, transport, and repair times.<br />

13


<strong>Availability</strong><br />

Definition<br />

Mathematics <strong>Basics</strong><br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The term <strong>High</strong> <strong>Availability</strong> is often used when discussing Fault Tolerant Systems. For<br />

example, your telephone line is supposed to offer you a high level of availability: the<br />

service you are paying for has to be effectively accessible and dependable. Your line<br />

availability is compared related to the continuity of the service which you are<br />

provided. As an example, assume you are living in a remote area with occasional<br />

violent storms. Because of your loacation and the damage these storms can cause,<br />

long delays are required to fix your line once it is out of order. In these conditions, if<br />

on average your line appears to be usable only 50% of the time, you have poor<br />

availability. By contrast, if on average each of your attempts is 100% satisfied, then<br />

your line has high availability.<br />

This example demonstrates that <strong>Availability</strong> is the key metric to measuring a system’s<br />

tolerance level, that it is typically expressed in percent (for example 99.999%), and<br />

that it belongs to the domain of probability.<br />

The Instantaneous <strong>Availability</strong> of a device is the probability that this device will be in<br />

the functional state for which it was designed, under given conditions and at a given<br />

time (t), with the assumption that the required external conditions are met.<br />

Besides Instantaneous <strong>Availability</strong>, different variants have been specified, each<br />

corresponding to a specific definition, including Asymptotic <strong>Availability</strong>, Intrinsic<br />

<strong>Availability</strong> and Operational <strong>Availability</strong>.<br />

14


Asymptotic (or Steady State) <strong>Availability</strong>: A ∞<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Asymptotic <strong>Availability</strong> is the limit of the instantaneous availability function as time<br />

approaches infinity,<br />

MDT MUT MUT<br />

A = 1−<br />

=<br />

=<br />

∞ MUT + MDT MUT+<br />

MDT MTBF<br />

where Downtime includes all repair time (corrective and preventive maintenance<br />

time), administrative time and logistic time.<br />

The following curve shows an example of asymptotic behavior:<br />

Intrinsic (or Inherent) <strong>Availability</strong>: Ai<br />

Intrinsic <strong>Availability</strong> does not include administrative time and logistic time, and usually<br />

does not include preventive maintenance time. This is primarily a function of the basic<br />

equipment/system design.<br />

A<br />

i<br />

MTBF<br />

=<br />

MTBF + MTTR<br />

We will consider Intrinsic <strong>Availability</strong> in our <strong>Availability</strong> calculations.<br />

Operational <strong>Availability</strong><br />

Operational <strong>Availability</strong> corresponds to the probability that an item will operate<br />

satisfactorily at a given point in time when used in an actual or realistic operating and<br />

support environment.<br />

A<br />

0<br />

Uptime<br />

=<br />

Operating⋅<br />

Cycle<br />

Operational <strong>Availability</strong> includes logistics time, ready time, and waiting or<br />

administrative downtime, and both preventive and corrective maintenance downtime.<br />

15


Classification<br />

Reliability<br />

Definition<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

This is the availability that the customer actually experiences. It is essentially the a<br />

posteriori availability based on actual events that happened to the system.<br />

A quite common way used to classify a system in terms of <strong>Availability</strong> consists of<br />

listing the number of 9s of its availability figure.<br />

The following table defines types of availability:<br />

Class Type of<br />

<strong>Availability</strong><br />

<strong>Availability</strong><br />

(%)<br />

Downtime<br />

per Year<br />

Number of<br />

Nines<br />

1 Unmanaged 90 36.5 days 1-nine<br />

2 Managed 99 3.65 days 2-nines<br />

3 Well Managed 99.9 8.76 hours 3-nines<br />

4 Tolerant 99.99 52.6 minutes 4-nines<br />

5 <strong>High</strong> <strong>Availability</strong> 99.999 5.26 minutes 4-nines<br />

6 Very <strong>High</strong> 99.9999 30.00 seconds 6-nines<br />

7 Ultra <strong>High</strong> 99.99999 3 seconds 7-nines<br />

For example, a system that has a five-nine availability rating means that the system is<br />

99.999 % available; with a system downtime of approximately 5.26 minutes per year.<br />

A fundamental associated metric is Reliability. Return to the example of your<br />

telephone line. If the wired network is a very old one, having suffered from many<br />

years of lack of maintenance, it may frequently be out of order. Even if the current<br />

maintenance personnel are doing their best to repair it in a minimum time, it can be<br />

said to have poor reliability if, for example, it has experienced ten losses of<br />

communication during the last year. Notice that Reliability necessarily refers to a<br />

given time interval, typically one year. Therefore, Reliability will account for the<br />

absence of shutdown of a considered system in operation, on a given time interval.<br />

As with <strong>Availability</strong>, we consider Reliability in terms of perspective (a prediction), and<br />

within the domain of probability.<br />

16


Mathematics <strong>Basics</strong><br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

In many situations a detected disruption fortunately does not mean the end of a<br />

device’s life. This is usually the case for the automation and control systems being<br />

discussed, which are repairable entities. As a result, the ability to predict the number<br />

of shutdowns, due to a detected disruption over a specified period of time, is useful to<br />

estimate the budget required for the replacement of inoperative parts.<br />

In addition, knowing this figure can help you maintain an adequate inventory of spare<br />

parts. Put simply, the question "Will a device work for a particular period" can only be<br />

answered as a probability; hence the concept of Reliability.<br />

According to the MIL-STD-721C standard, the definition of reliability R(t) of a given<br />

system is the probability of that system to perform its intended function under stated<br />

conditions for a stated period of time. As an example, a system featured by 0.9999<br />

reliability over a year has a 99.99% probability of functioning properly throughout an<br />

entire year.<br />

Note: The reliability is systematically indicated for a given period of time, for example<br />

one year.<br />

Referring to the system model we considered with the "bathtub curve,” one<br />

characteristic is its constant Failure Rate, during the useful life. In that portion of its<br />

lifetime, the Reliability of the considered system will follow an exponential law, given<br />

by the following formula: R(t) = e -λt , where λ stands for the Failure Rate.<br />

The following figure illustrates the exponential law, R(t) = e -λt , where λ stands for the<br />

Failure Rate:<br />

As shown in the diagram,<br />

Reliability starts with a value of<br />

1 at time zero, which<br />

represents the moment the<br />

system is put into operation.<br />

Reliability then falls graduallly<br />

down to zero, following the<br />

exponential law. Therefore,<br />

Reliability is about 37% at<br />

t=1/λ. As an example, assume the given system experiences an average number of<br />

0.5 inoperative states per a 1-year time unit. The exponential law indicates that such<br />

a system would have about a 37% chance of remaining in operation, reaching 1 / 0.5<br />

= 2 years of service.<br />

17


<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Note: Considering the flat portion of the bathtub curve model, where the Failure Rate<br />

is constant over time and remains the same for a unit regardless of this unit’s age, the<br />

system is said to be "memoryless.”<br />

Reliability versus <strong>Availability</strong><br />

Reliability one of the factors influencing <strong>Availability</strong>, but must not be confused with<br />

Availabilty: 99.99% Reliability does not mean 99.99% <strong>Availability</strong>. Reliability<br />

measures the ability of a system to function without interruptions, while <strong>Availability</strong><br />

measures the ability of this system to provide a specified application service level.<br />

<strong>High</strong>er reliability reduces the frequency of inoperative states, while increasing overall<br />

<strong>Availability</strong>.<br />

There is a difference between Hardware MTBF and System MTBF. The mean time<br />

between hardware component failures occurring on an I/O Module, for example, is<br />

referred to as the Hardware MTBF. Mean time between failures occurring on a<br />

system considered as a whole, a PLC configuration for example, is referred to as the<br />

System MTBF.<br />

As will be demonstrated, hardware component redundancy will provide an increase in<br />

the overall system MTBF, even though the individual component’s MTBF remains the<br />

same.<br />

Although <strong>Availability</strong> is a function of Reliability, it is possible for a system with<br />

poor Reliability to achieve high <strong>Availability</strong>. For example, consider that a system<br />

averages 4 failures a year, and for each failure, this system can be restored with<br />

an average outage time of 1 minute. Over the specified period of time, MTBF is<br />

131,400 minutes (4 minutes of downtime out of 525,600 minutes per year).<br />

In that one year period:<br />

� Reliability R(t) = e -λt = e - 4 = 1.83%, very poor Reliability<br />

MTBF 131⋅<br />

400<br />

� (Inherent) <strong>Availability</strong> A<br />

i<br />

= =<br />

= 99,99924% ,<br />

MTBF + MTTR 131400<br />

+ 1<br />

very good <strong>Availability</strong><br />

18


Reliability Block Diagrams (RBD)<br />

Series-Parallel Systems<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Based on basic probability computation rules, RBD are simple, convenient tools to<br />

represent a system and its components, to determine the Reliability of the system.<br />

The target system, for example a PLC rack, must first be interpreted in terms of series<br />

and parallel arrangements of elementary parts.<br />

The following figure shows a representation of a serial architecture:<br />

are considered as a participant member of a 5-part series.<br />

As an example, assume that one of the 5<br />

modules (1 power supply and 4 other<br />

modules) that populate the PLC rack<br />

becomes inoperative. As a consequence,<br />

the entire rack is affected, as it is no longer<br />

100% capable of performing its assigned<br />

mission, regardless of which module is<br />

inoperative. Thus each of the 5 modules<br />

Note: When considering Reliability, two components are described as in series if both<br />

are necessary to perform a given function.<br />

The following figure shows a representation of a parallel architecture:<br />

sequence of the 4 other modules.<br />

As an example, assume that the PLC rack<br />

now contains redundant Power Supply<br />

modules, in addition to the 4 other<br />

modules. If one Power Supply becomes<br />

inoperative, then the other supplies power<br />

for the entire rack. These 2 power supplies<br />

would be considered as a parallel sub-<br />

system, in turn coupled in series with the<br />

Note: Two components are in parallel, from a reliability standpoint, when the system<br />

works if at least one of the two components works. In this example, Power Supply 1<br />

and 2 are said to be in active redundancy. The redundancy would be described as<br />

passive if one of the parallel components is turned on only if the other is inoperative<br />

only, for example in the case of auxiliary power generators.<br />

19


Serial RBD<br />

Reliability<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Serial system Reliability is equal to the product of the individual elements’ reliabilities.<br />

n<br />

R<br />

S<br />

(t) = R<br />

1<br />

(t) ∗ R<br />

2<br />

(t) ∗R<br />

3<br />

(t) ∗ ... ∗ Rn(t)<br />

= ∏ R (t)<br />

i=<br />

1<br />

i<br />

where: Rs(t) = System Reliability<br />

Ri(t) = Individual Element Reliability<br />

n = Total number of elements on the serial system<br />

Assuming constant individual Failure Rate:<br />

− λ t − λ t − λ t − λ t − ( λ + λ + λ + ... + λ ) t<br />

R (t) = R (t) ∗ R (t) ∗ R (t) ∗ ... ∗ R (t) = e<br />

1<br />

∗ e<br />

2<br />

∗ e<br />

31<br />

∗ ... ∗ e<br />

n<br />

= e<br />

1 2 3 n<br />

S 1 2 3 n<br />

n<br />

That is λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal to the sum<br />

S i<br />

i = 1<br />

of the individual Failure Rate of these elements, with<br />

Example 1:<br />

R<br />

S<br />

(t)<br />

−λ<br />

S<br />

t<br />

= e<br />

Consider a system with 10 elements, each of them required for the proper operation<br />

of the system, for example a 10-module rack. To determine RS(t), the Reliability of<br />

that system over a given time interval t, if each of the considered elements shows an<br />

individual Reliability Ri(t) of 0.99:<br />

10<br />

R<br />

S<br />

(t) = ∏ R (t)<br />

i=<br />

1<br />

i<br />

= (0.99) x (0.99) x (0.99) . . . (0.99) = (0.99) 10 = 0.9044<br />

Thus the targeted system Reliability is 90.44%<br />

Example 2:<br />

Consider two elements with the following failure rates:<br />

λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1<br />

Element 1<br />

λ1 = 120 x 10 -6 h -1<br />

Element 2<br />

λ2 = 180 x 10 -6 h -1<br />

20


The System Reliability, over a 1,000- hour mission, is:<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

n<br />

λ<br />

S<br />

= ∑ λ<br />

i 1<br />

i<br />

= λ<br />

1<br />

+ λ<br />

=<br />

2<br />

= 120 x 10 -6 + 180 x 10 -6 = 300 x 10 -6 = 0.3 x 10 -3 h -1<br />

−λ<br />

3 x 103<br />

S<br />

t − 0.3 x 10−<br />

− 0.3<br />

R S(1000<br />

h) = e = e<br />

= e = 0.7408<br />

Thus the targeted system Reliability is 74.08%<br />

<strong>Availability</strong><br />

Serial system <strong>Availability</strong> is equal to the product of the individual elements’<br />

Availabilities.<br />

A S<br />

=<br />

A1<br />

. A2<br />

. A3<br />

.... .An<br />

where: As= system (asymptotic) availability<br />

n<br />

= ∏ Ai<br />

i=<br />

1<br />

Ai = Individual element (asymptotic) availability<br />

n = Total number of elements on the serial system<br />

21


Calculation Example<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

In this example, we calculate the availability of a PAC Station using shared distributed<br />

I/O Islands. The following illustration shows the final configuration:<br />

This calculation applies the equations given by basic probability analysis. To do this<br />

calculation, a spreadsheet was developed. These are the figures applied in the<br />

spreadsheet:<br />

Failure rate: λ = 1/MTBF<br />

Reliability: R(t) = e -λt<br />

n<br />

Total serial Systems Failure Rate λ = ∑ λ<br />

S i<br />

i = 1<br />

Total MTBF serial Systems = 1 / λs<br />

<strong>Availability</strong> = MTBF/ MTBF+ MTTR<br />

Unavailability = 1 – <strong>Availability</strong><br />

Unavailability over years: Unavailability × hours (one year = 8460 hours)<br />

The following table shows the method to perform the calculation:<br />

Step Action<br />

1 Perform the calculation of the Standalone CPU.<br />

2 Perform the calculation of a distributed island.<br />

3 Based on the serial structure, add up the results from Steps 1 and 2.<br />

Note: A common variant of in-rack I/O Module stations are I/O Islands, distributed on<br />

an Ethernet communication network. <strong>Schneider</strong> <strong>Electric</strong> offers a versatile family<br />

named Advantys STB, which can be used to define such architectures.<br />

Step 1: Calculation linked to the Standalone CPU Rack.<br />

The following figures represent the Standalone CPU Rack:<br />

22


<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The following screenshot is the spreadsheet corresponding to this analysis.<br />

Step 2: Calculation linked to the STB Island.<br />

The following figures represent the Distributed I/O on STB Island:<br />

The following screenshot is the spreadsheet corresponding to this analysis:<br />

23


<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Step 3: Calculation of the entire installation. Assume that the communication network<br />

used to link I/O Islands to CPU has no examples of Reliability metrics (these are<br />

explored further in a subsequent chapter).<br />

The following figures represent the final Distributed architecture:<br />

The following screenshot is the spreadsheet corresponding to the entire analysis:<br />

Note: The highlighted values were calculated in the two previous steps<br />

Considering the results of this Serial System (Rack # 1+ Islands # 1 ... #4), Reliability over<br />

one year is approximately 82 % (the probability that this system will encounter one failure<br />

during one year is approximately 18%).<br />

System MTBF itself is approximately 44,000 hours (about 5 years)<br />

Considering the <strong>Availability</strong>, with a 2-hour Mean Time To Repair (typical of a very good<br />

logistics and maintenance organization), the system would achieve a 4-nines <strong>Availability</strong>,<br />

an average probability of approximately 24 minutes downtime per year;<br />

24


Parallel RBD<br />

Reliability<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Theory of Probability provides expression of the Reliability of a Parallel System (for<br />

example, a Redundant System), with Qi(t) (unreliability) being the reverse of Ri(t).<br />

R Red<br />

( t)<br />

= 1−<br />

[ Q1(<br />

t)<br />

x Q2<br />

( t)<br />

x Q3<br />

( t)<br />

x<br />

Qi (t) = Ri (t) = 1 - Ri (t) :<br />

....<br />

n<br />

x Qn<br />

( t)<br />

] = 1−<br />

∏ Qi<br />

( t)<br />

i=<br />

1<br />

with: R Red = Reliability of the Simple Redundancy System<br />

Example:<br />

n<br />

∏ Q i = Probability of Failure of the System<br />

i=<br />

1<br />

n<br />

= 1−<br />

∏ 1−<br />

Ri<br />

( t)<br />

i=<br />

1<br />

Rj = Probability of Non-Failure of an Individual Parallelized Element<br />

Qj = Probability of Failure of an Individual Parallelized Element<br />

n = Total number of Parallelized Elements<br />

Considering two elements with the following failure rates:<br />

λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1<br />

The System Reliability, over a 1,000-hour mission, is:<br />

Reliability of elements 1 and 2 over the 1,000-hour period:<br />

R<br />

R<br />

1<br />

2<br />

( 1,<br />

000<br />

( 1,<br />

000<br />

h)<br />

h)<br />

= e<br />

= e<br />

−λ1t<br />

−λ2t<br />

= e<br />

= e<br />

−6<br />

3<br />

−120<br />

x 10 x 10<br />

−6<br />

3<br />

−180<br />

x 10 x 10<br />

=<br />

=<br />

0,<br />

8869<br />

0.<br />

8353<br />

Unreliability of elements 1 and 2 over the 1,000 hour-period:<br />

Q1 ( 1,<br />

000 h)<br />

= 1−<br />

R1<br />

( 1,<br />

000 h)<br />

= 1−<br />

R<br />

Q 2<br />

2<br />

Element 1<br />

λ1 = 120 x 10 -6 h -1<br />

Element 2<br />

λ2 = 180 x 10 -6 h -1<br />

( 1,<br />

000<br />

( 1,<br />

000<br />

h)<br />

h)<br />

= 1−<br />

0.<br />

8869 =<br />

= 1−<br />

0.<br />

8353 =<br />

0.<br />

1130<br />

0.<br />

1647<br />

25


Redundant System Reliability, over the 1,000-hour period:<br />

R12 ( t = 1000 h)<br />

= 1−<br />

[ Q1<br />

( t = 1000 h)<br />

. Q 2<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

( 1,<br />

000<br />

h)<br />

]<br />

= 1−<br />

( 0.<br />

1130<br />

x 0.<br />

1647)<br />

=<br />

Thus with Individual Elements’ Reliability of 88.69% and 83.53% respectively, the<br />

targeted Redundant System Reliability is 98.14%<br />

<strong>Availability</strong><br />

Parallel system Unavailability is equal to the product of the individual elements’<br />

0.<br />

9814<br />

Unavailabilities. Thus the Parallel system <strong>Availability</strong>, given the individual parallelized<br />

elements’ Availabilities, is:<br />

A<br />

= 1−<br />

[ ( 1−<br />

A ) . ( 1−<br />

A ) . ... . ( 1−<br />

A ) ] = 1−<br />

( 1−<br />

A )<br />

S 1<br />

2<br />

n ∏<br />

i=<br />

1<br />

where: As= system (asymptotic) availability<br />

Ai = Individual element (asymptotic) availability<br />

n = Total number of elements on the serial system<br />

n<br />

i<br />

26


Calculation Example<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

To illustrate a Parallel system, we perform the calculation of the <strong>Availability</strong> of a<br />

redundant PLC System using shared distributed I/O Islands.<br />

The following illustration shows the final configuration:<br />

The formulas are the same as used in the previous calculation example, except for<br />

the calculation of the reliability for a parallel system, which is as follows:<br />

R Red<br />

( t)<br />

= 1−<br />

[ Q1(<br />

t)<br />

x Q2<br />

( t)<br />

x Q3<br />

( t)<br />

x<br />

....<br />

n<br />

x Qn<br />

( t)<br />

] = 1−<br />

∏ Qi<br />

( t)<br />

i=<br />

1<br />

The following table shows the method to perform the calculation:<br />

Step Action<br />

1 Perform the calculation of a standalone CPU<br />

n<br />

= 1−<br />

∏ 1−<br />

Ri<br />

( t)<br />

i=<br />

1<br />

2 Perform the calculation for the redundant structure, here the two CPUs<br />

3 Perform the calculation of a distributed island<br />

4 Concatenate the results from Steps 1,2 and 3<br />

Note: the previous results from the serial analysis, regarding the calculation linked to<br />

the standalone elements, are reused.<br />

27


Step 1: Calculation linked to the Standalone CPU Rack.<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

Because the analysis is identical to that for the serial case, the following screenshot<br />

shows the spreadsheet corresponding only to the final results:<br />

Step 2: Calculation of the redundant CPU group.<br />

The following figures show represent the Redundant CPUs:<br />

The following screenshot is the spreadsheet corresponding to this analysis:<br />

Step 3: Calculation linked to the STB Island.<br />

Because the analysis is identical to that for the serial case, the following screenshot<br />

shows the spreadsheet corresponding to the final results only:<br />

28


Step 4: Calculation of the entire installation.<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The following screenshot is the spreadsheet corresponding to the entire analysis:<br />

Note: The only difference between this architecture and the previous one relates to<br />

the CPU Rack: this one is a redundant one (Premium Hot Standby), while the former<br />

one was a standalone one.<br />

Looking at the results of this Parallel System (Premium CPU Rack Redundancy), Reliability<br />

over one year would be approximately 99.9%, compared to 97.4% available with a<br />

Standalone Premium CPU Rack (i.e. the probability for a Premium Rack System to<br />

encounter one failure during one year would have been reduced from 2.6% to 0.1%).<br />

System MTBF itself would increas from 335,000 hours (approximately 38 years) to 503,000<br />

hours (approximately 57 years).<br />

For System <strong>Availability</strong>, a 2-hour Mean Time To Repair we provide approximately a 9-nines<br />

resulting <strong>Availability</strong> (almost 100%).<br />

Note: Other calculations examples are available in the Calculation Examples chapter.<br />

Note: The previous examples cover the PAC station only. To extend the calculation<br />

to a whole system, the MTBF of the network components and the SCADA systems<br />

(PC, servers) must be taken in account.<br />

29


Conclusion<br />

Serial System<br />

<strong>High</strong> <strong>Availability</strong> <strong>Theoretical</strong> <strong>Basics</strong><br />

The above computations demonstrate that the combined availability of two<br />

components in series is always lower than the availability of its individual components.<br />

The following table gives an example of combined availability in serial system:<br />

Component <strong>Availability</strong> Downtime<br />

X 99% (2-nines) 3.65 days/year<br />

Y 99.99% (4-nines) 52 minutes/year<br />

X and Y Combined 98.99% 3.69 days/year<br />

This table indicates that even though a very high availability Part Y was used, the<br />

overall availability of the system was reduced by the low availability of Part X. A<br />

common saying indicates that "a chain is as strong as the weakest link", however, in<br />

this instance a chain is actually “weaker than the weakest link.”<br />

Parallel System<br />

The above computations indicate that the combined availability of two components in<br />

parallel is always higher than the availability of its individual components.<br />

The following table gives an example of combined availability in a parallel system:<br />

Component <strong>Availability</strong> Downtime<br />

X 99% (2-nines) 3.65 days/year<br />

Two X components<br />

operating in parallel<br />

Three X components<br />

operating in parallel<br />

99.99% (4-nines)<br />

52 minutes/year<br />

99.9999% (6-nines) 31 seconds /year !<br />

This indicates that even though a very low availability Part X was used, the overall<br />

availability of the system is much higher. Thus redundancy provides a very powerful<br />

mechanism for making a highly reliable system from low reliability components.<br />

30


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

In an automation system, how can you reach the level of availability required to keep<br />

the plant in operation? By what means should you enforce the system architecture,<br />

providing and maintaining access to the information required to monitor and control<br />

the process?<br />

This chapter provides answers to these questions, and reviews the system<br />

architecture from top to bottom, that is, from operator stations and data servers<br />

(Information Management) to Controllers and Devices (Control System Level), via<br />

communication networks (Communication Infrastructure Level).<br />

The following figure illustrates the Collaborative Control System:<br />

Ethernet<br />

This previous system architecture drawing above shows various redundancy<br />

capabilities that can be proposed:<br />

• Dual SCADA Vijeo Citect server<br />

Ethernet<br />

Profibus<br />

• Dual Ethernet control network managed by ConneXium switches<br />

• Premium Hot Standby with Ethernet device bus<br />

• Quantum Hot Standby with Remote I/O, Profibus and Ethernet.<br />

• Quantum safety controller HotStandby with remote I/O<br />

Ethernet<br />

31


Information Management Level<br />

Redundancy Level<br />

Key Features<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

This section explains how to implement various technological means to address<br />

architecture challenges, with examples using a current client/server paradigm. The<br />

section describes:<br />

• A method to define a redundancy level<br />

• The most appropriate system architecture for the defined level<br />

The key features a Vijeo Citect SCADA software has to handle relate to:<br />

• Data acquisition<br />

• Events and Alarms (including time stamping)<br />

• Measurements and trends<br />

• Recipes<br />

• Report<br />

• Graphics ( displays and user interfaces)<br />

In addition, any current SCADA package provides access to a treatment/calculation<br />

module (openness module), allowing users to edit program code (IML/CML, Cicode,<br />

Script VB, etc.).<br />

Note: This model is applicable for a single station (PC), including for small<br />

applications. The synthesis between the stakes and the key features will help to<br />

determine the most appropriate redundant solution.<br />

32


Stakes<br />

Risk analysis<br />

Level Definition<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Considering the previously defined key features, stakes when designing a SCADA<br />

system include:<br />

• Is a changeover possible?<br />

• Does a time limit exist?<br />

• What are the critical data?<br />

• Has cost reduction been considered?<br />

Linked to the previous stakes, the risk analysis is essential to defining the redundancy<br />

level. Consider the events the SCADA system will face, i.e. the risk, in terms of the<br />

following:<br />

• Inoperative hardware<br />

• Inoperative power supply<br />

• Environmental events (natural disaster, fire, etc.)<br />

That can imply loss of data, operator screen, connection with devices and so on.<br />

Finally, the redundancy level is defined as the compilation of the key features, the<br />

stakes, and the risk analysis with the customer expectations related to the data<br />

criticality level. The following table illustrates the flow from the process analysis to the<br />

redundancy level:<br />

33


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following table explains the redundancy levels:<br />

Redundancy Level State of the standby system Switchover performance<br />

No redundancy No standby system Not applicable<br />

Cold Standby The Standby system is only powered on<br />

if the default system becomes<br />

inoperative.<br />

Warm Standby The Standby system switches from<br />

normal to backup mode.<br />

Hot Standby The Standby system runs together with<br />

the default system one.<br />

Architecture Redundancy Solutions Overview<br />

Several minutes<br />

Large amount of lost data<br />

Several seconds<br />

Small amount of lost data<br />

Transparent<br />

No lost data<br />

This section examines various redundancy solutions. A Vijeo Citect SCADA system<br />

can be redundant at the following levels:<br />

• Clients, i.e. operator stations<br />

• Data servers<br />

• Control and information network<br />

• Targets, i.e. PAC station, controller, devices<br />

34


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The Vijeo Citect functional organization corresponds directly<br />

to a Client/Server philosophy. An example of a Client/Server<br />

topology is shown in the diagram to the left: a single Display<br />

Client Operator Station with a single I/O Server, in charge of<br />

device data (PLC) communication.<br />

Vijeo Citect architecture is a combination of several operational entities that handles<br />

A<br />

l<br />

a<br />

r<br />

m<br />

s<br />

,<br />

T<br />

r<br />

e<br />

n<br />

d<br />

s<br />

a<br />

Alarms, Trends and Reports, respectively. In addition, this functional architecture<br />

includes at least one I/O Server. The I/O Server acts as a Client to the peripheral<br />

devices (PAC) and as a Server to Alarms, Trends and Reports (ATR) entities.<br />

As shown in the figure above, ATR and I/O Server(s) act either as a Client or as a<br />

Server, depending on the designated relationship. The default mechanism linking<br />

these Clients and Servers is based on a Publisher / Subscriber relation.<br />

As shown in the following screenshot, because of its client server model, Vijeo Citect<br />

can create a dedicated server, depending on the application requirements: for<br />

example for ATR, or for I/O server:<br />

35


.<br />

Clients: Operator Workstations<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Vijeo Citect is able to manage the redundancy at the operator station level with<br />

several client workstations. These stations can be located in the control room or<br />

distributed in the plant close to the critical part of the process. A web client interface<br />

can also be used to monitor and control the plant using a standard web browser.<br />

If an operator station becomes inoperative, the plant can still be monitored and<br />

controlled using additional operator screen.<br />

Servers: Resource Duplication Solution<br />

This example assumes a Field Devices Communication Server, providing Services to<br />

several types of Clients, such as Alarm Handling, Graphic Display, etc.<br />

A first example of Redundancy is a complete duplication of the first Server. Basically,<br />

if a system becomes inoperative, for example Server A, Server B would retain the job<br />

and respond to the service requests presented by the Clients.<br />

36


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Vijeo Citect can define Primary and Standby servers within a project, with each<br />

element of a pair being held by different hardware.<br />

The first level of redundancy duplicates the I/O<br />

server or the ATR server, as shown in the<br />

illustration. In this case, a Standby server is<br />

maintained in parallel to the Primary server. In the<br />

event of a detected interruption on the hardware,<br />

the Standby server will assume control of the<br />

communication with the devices.<br />

Based on data criticality, the second level<br />

duplicates all the servers, ATR and I/O. Identical data is maintained on both servers.<br />

Network: Data Path Redundancy<br />

In the diagram to the left,<br />

Primary and Standby I/O<br />

servers are deployed<br />

independently, while Alarms,<br />

Trends and Reports servers<br />

are run as separate<br />

processes on common<br />

Primary and Standby<br />

computers.<br />

Data path redundancy involves not alternative<br />

device(s), but alternative data paths between<br />

the I/O Server and connected I/O Devices.<br />

Thus if one data path becomes inoperative,<br />

the other is used.<br />

Note: Vijeo Citect reconnects through the primary data path when it is returned into<br />

service.<br />

On a larger Vijeo Citect system, you can<br />

also use data path redundancy to<br />

maintain device communications through<br />

multiple I/O Server redundancy, as shown<br />

in the following diagram.<br />

37


Target: I/O Device<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Redundant LAN<br />

As previously indicated, redundancy of Alarms, Reports, Trends, and I/O Servers is<br />

achieved by adding standby servers. Vijeo Citect can also use the dual end point (or<br />

multiple network interfaces) potentially available on each server, enabling the<br />

specification of a complete and unique network connection between a Client and a<br />

Server.<br />

A given I/O Server is able to<br />

handle designated pairs of<br />

Devices, Primary and Standby.<br />

This Device Redundancy does<br />

not rely on a PLC Hot Standby<br />

mechanism: Primary and<br />

Standby devices are assumed<br />

to be concurrently acting on the<br />

same process, but no<br />

assumption is made<br />

concerning the relationship<br />

between the two devices. Seen from I/O Server, this redundancy offers access only to<br />

an alternate device, in case the first device becomes inoperative.<br />

The I/O device is an extension of I/O Device Redundancy, providing for more than<br />

one Standby I/O Device. Depending on the user configuration, a given order of<br />

priority applies when an I/O Server (potentially a redundant one) needs to switch to a<br />

Standby I/O Device. For example, in the figure above I/O Device 3 would be allotted<br />

the highest priority, then I/O Device 2, then finally I/O Device 4.<br />

38


Clustering<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

In those conditions, in case of a detected interruption occurring on Primary I/O Device<br />

1, a switchover would take place, with I/O Server 2 handling communications, and<br />

with Standby I/O Device 3. If an interruption is now detected on I/O Device 3, a new<br />

switchover would take place, with I/O Server 1 handling communications, with<br />

Standby I/O Device 2. Finally, if there is an interruption on I/O Device 2, another<br />

switchover would take place, with I/O Server 2 handling communications, with<br />

Standby I/O Device 4<br />

A cluster may<br />

contain several<br />

possibly<br />

redundant I/O<br />

Servers<br />

(maximum of<br />

one per<br />

machine), and<br />

standalone or<br />

redundant<br />

Refer again to the<br />

diagram on the left,<br />

explaining<br />

Redundancy basics,<br />

and examine the<br />

functional<br />

representation as a<br />

cluster.<br />

ATR servers; these latter servers being implemented either on a common or on<br />

separate machines.<br />

39


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The cluster concept<br />

offers a response to a<br />

typical scenario of a<br />

system separated into<br />

several sites, with<br />

each of these sites<br />

being controlled by<br />

local operators, and<br />

supported by local<br />

redundant servers.<br />

The clustering model<br />

can concurrently address an additional level of management that requires all sites<br />

across the system to be monitored simultaneously from a central control room.<br />

With this scenario, each site is represented with a separate cluster, grouping its<br />

primary and standby servers. Clients on each site are interested only in the local<br />

cluster, whereas clients at the central control room are able to view all clusters.<br />

B<br />

a<br />

s<br />

e<br />

d<br />

o<br />

n<br />

c<br />

l<br />

u<br />

s<br />

t<br />

e<br />

Based on cluster design, each site can then be addressed independently within its<br />

own cluster. As a result, deployment of a control room scenario is fairly<br />

straightforward, with the control room itself only needing display clients.<br />

The cluster concept does not actually provide an additional level of redundancy.<br />

Regarding data criticality, clustering organizes servers, and consequently provides<br />

additional flexibility.<br />

40


Conclusion<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Each cluster contains only one pair each of ATR servers. Those pairs of servers,<br />

redundant to each other, must be on different machines.<br />

Each cluster can contain an unlimited number of I/O servers; those servers must also<br />

be on different machines that increase the level of system availability.<br />

The following illustration shows a complete installation. Redundant solutions<br />

previously discussed can be identified:<br />

Scada Clients<br />

Data Servers<br />

Control Network<br />

Targeted Devices<br />

Communication Infrastructure Level<br />

The previous section reviewed various aspects of enhanced availability at the<br />

Information Management level, focusing on SCADA architecture, represented by<br />

Vijeo Citect. This section covers <strong>High</strong> <strong>Availability</strong> concerns between the Information<br />

level and the Control Level.<br />

A proper design at the communication infrastructure level must include:<br />

• Analysis of the plant topology<br />

• Localization of the critical process steps<br />

• The definition of network topologies<br />

• The appropriate use of communication protocols<br />

41


Plant Topology<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The first step of the communication infrastructure level definition is the plant topology<br />

analysis. From this survey, the goal is to gather information to develop a networking<br />

system diagram, prior to defining the network topologies.<br />

This plant topology analysis must be done as a top-down process:<br />

• Break-down of the plant in selected areas<br />

• Localization of the areas to be connected<br />

• Localization of the hazardous areas<br />

• Localization of the station and the nodes included in these areas to be connected<br />

• Localization of the existing networks & cabling paths, in the event of expansion or<br />

redesign<br />

Before defining the network topologies, the following project requirements must be<br />

considered:<br />

• <strong>Availability</strong> expectation, according to the criticality of the process or data.<br />

• Cost constraints<br />

• Operator skill level<br />

From the project and the plant analyses, identify the most critical areas:<br />

Project<br />


Network Topology<br />

Topologies<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Following the criticality analysis, the networking diagram can be defined by selecting<br />

the relevant Network topology.<br />

The following table describes the four main topologies from which to choose:<br />

Architecture Limitations Advantages Disadvantages<br />

Bus<br />

Star<br />

Ring<br />

The traffic must flow serially,<br />

therefore the bandwidth is not<br />

used efficiently<br />

Cable ways and distances Efficient use of the<br />

Cost-effective solution If a switch becomes<br />

bandwidth, as the<br />

traffic is spread across<br />

the star<br />

Preferred topology<br />

when there is no need<br />

of redundancy<br />

Behavior quite similar to Bus Auto-configuration if<br />

used with self-healing<br />

protocol<br />

Possible to couple<br />

others rings for<br />

increasing redundancy<br />

inoperative, the<br />

communication is lost.<br />

If the main switch<br />

becomes inoperative,<br />

the communication is<br />

lost<br />

The Auto-configuration<br />

depends on the<br />

protocol used<br />

Note: These different topologies can be mixed to define the plant network diagram.<br />

43


Ring Topology<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following diagram shows the level of availability based on topology:<br />

In automation architecture, Ring (and Dual ring) topologies are the most commonly<br />

used to increase the availability of a system.<br />

Mesh architecture is not used in process applications; therefore we will not discuss it<br />

in detail. All these topologies are allowed using <strong>Schneider</strong> <strong>Electric</strong> ConneXium<br />

switches.<br />

In a ring topology four events can occur, leading to a loss of communication:<br />

1)Broken ring line<br />

2)Inoperative ring switch<br />

3)Inoperative ended-line devices<br />

4)Inoperative ended-line devices switch<br />

44


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following diagram illustrates these four occurrences:<br />

1<br />

2<br />

To protect the network architecture from these events, several communication<br />

protocols are proposed and described in the following section.<br />

A solution able to enhance networking availability while preserving budget<br />

considerations consists of reducing both limitations of an Ethernet network:<br />

distributed as a bus, but built as a ring. At least one specific active component is<br />

necessary: this is a network switch usually named Redundancy Manager (RM).<br />

Consider an Ethernet loop designed with such a RM switch. In normal conditions, this<br />

RM switch will open the loop: which prevents Ethernet frames from circulating<br />

endlessly in a loop.<br />

3<br />

4<br />

If a break occurs, the<br />

Redundancy Manager<br />

switch reacts immediately,<br />

and closes the Ethernet<br />

loop, bringing the network<br />

back to full operating<br />

condition.<br />

Note that the term Self<br />

Healing Ring concerns the<br />

ring management only;<br />

once cut, the cable is not<br />

able to repair itself.<br />

45


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

A mix of Dual Networking and Network Redundancy is possible. Note that in such a<br />

design, a SCADA I/O Server has to be equipped with two communication boards, and<br />

reciprocally, each device (PLC) has to be allotted two Ethernet ports.<br />

Redundant Coupling of a Ring Network or a Network Segment<br />

Topological considerations may lead to consideration of a network layout aggregating<br />

satellite rings or segments around a backbone network (itself designed as a ring or as<br />

a segment).<br />

This may be an effective design, considering the junction between trunk and satellites,<br />

especially if backbone and satellite networks have been designed as ring networks to<br />

provide for <strong>High</strong> <strong>Availability</strong>.<br />

With the Connexium product line, <strong>Schneider</strong> <strong>Electric</strong> offers switches that may afford a<br />

redundant coupling. Several variations allow connection to the network. Each of these<br />

variations is featured by two “departure” switches on the backbone network. Each<br />

departure switch is crossing through a separate link to access the satellite network.<br />

These variations include:<br />

1. A single pivot ”arrival' switch on the connected network<br />

2. Two different arrival switches on the connected network; this links<br />

synchronization, making good use of backbone and satellite networks<br />

3. Two different switches to access the connected network; this links<br />

synchronization being carried via an additional specific link established<br />

between the two arrival switches on the connected network.<br />

46


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following drawing illustrates this architecture:<br />

ETY ETY SYNC SYNC SYNC SYNC LINK<br />

LINK<br />

Redundant line<br />

RM RM<br />

RM RM<br />

RM<br />

Ring coupling capabilities increase the level of networking availability by allowing<br />

different paths to access to targeted devices.<br />

New generation of <strong>Schneider</strong> ConneXium switches authorize many architectures<br />

based on dual ring. A unique switch is able now to couple two Ethernet rings<br />

extending the capabilities of Ethernet architecture.<br />

Dual Ring in One Switch<br />

ETY ETY<br />

ETY<br />

ETY<br />

The following illustration shows the architecture, which allows the combination of two<br />

rings managed by a unique switch.<br />

Premium<br />

Premium<br />

CPU CPU SYNC SYNC LINK<br />

LINK<br />

MRP / RSTP<br />

ETY ETY<br />

ETY<br />

ETY<br />

ETY ETY SYNC SYNC SYNC SYNC LINK<br />

LINK<br />

Ring Coupling<br />

47


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Dual Ring in Two Switches<br />

The following architecture bypasses the single detected interruption.<br />

Dual Ring Extension in Two Switches<br />

The following architecture allows extension of the main network to other segments :<br />

This concept of sub-ring enables to couple segments to existing redundancy rings.<br />

The devices in the main ring (1) are seen as Sub-Ring Managers (SRM) for the new<br />

connected sub-ring (2).<br />

Daisy Chain Loop Topology<br />

Ethernet "daisy chain" refers to the<br />

integration of a switch function inside a<br />

communicating device; as a result, this<br />

“daisy-chainable” device offers two<br />

Ethernet ports, for example one "in" port<br />

48


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

and one "out" port. The advantage of such a daisy-chainable device is that its<br />

installation inside an Ethernet network requires only two cables.<br />

<strong>Schneider</strong> <strong>Electric</strong> plans to offer are:<br />

In addition, a daisy-chain layout can<br />

correspond to either a network segment<br />

or a network loop; a managed Connexium<br />

switch4, featuring an RM capability, is<br />

able to handle such a daisy-chained loop.<br />

The First daisy-chainable devices<br />

� Advantys STB dual port Ethernet communication adapter (STB NIP 2311)<br />

� Advantys ETB IP67 dual port Ethernet<br />

� Motor controller TeSys T<br />

� Variable speed drive ATV 61/71 (VW3A3310D)<br />

� PROFIBUS DP V1 Remote Master<br />

� ETG 30xx Factorycast gateway<br />

Note: Assuming no specific redundancy protocol is selected to handle the daisy<br />

chain loop, expected loop reconfiguration time on failover is approximately one<br />

second.<br />

SCADA<br />

SCADA<br />

RM<br />

RM<br />

Primary Primary<br />

Standby<br />

ETY ETY ETY ETY ETY ETY ETY SYNC SYNC LINK<br />

LINK<br />

Redundant line<br />

RM RM<br />

RM<br />

ETY<br />

ETY<br />

ETY<br />

ETY<br />

MRP / RSTP<br />

Premium<br />

Premium<br />

CPU CPU SYNC SYNC LINK<br />

LINK<br />

MRP / RSTP<br />

ETY<br />

ETY<br />

ETY<br />

ETY<br />

ETY ETY SYNC SYNC SYNC SYNC LINK<br />

LINK<br />

Ring Coupling<br />

SCADA<br />

SCADA<br />

RM<br />

49


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Daisy chaining topologies can be coupled to dual Ethernet rings using TCSESM<br />

ConneXium switches.<br />

Redundancy Communication Protocols<br />

Mix Switches In<br />

Network<br />

(Cisco/3Com/Hirs<br />

chmann)<br />

The management of Ethernet ring requires dedicated communication protocols as<br />

described in the following table:<br />

Each protocol is characterized by different performance criteria in terms of fault<br />

detection and global system recovery time.<br />

Rapid Spanning Tree Protocol (RSTP)<br />

RSTP stands for Rapid Spanning Tree Protocol (IEEE 802.1w standard) 1 . Based on<br />

STP, RSTP has introduced some additional parameters that must be entered during<br />

the switch configuration. These parameters are used by the RSTP protocol during the<br />

path selection process; because of these, the reconfiguration time is much faster than<br />

with STP (typically less than one second).<br />

1 The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE 802.1t-2001 and IEEE<br />

802.1w standards<br />

MRP<br />

Part of IEC 62439<br />

FDIS<br />

Recovery Time Less than 200 ms<br />

50 switches max<br />

Rapid Spanning<br />

Tree (RSTP)<br />

HIPER-Ring Fast HIPER-Ring<br />

Yes No No<br />

( New features<br />

available only on<br />

Extended<br />

Connexium<br />

switches)<br />

depending on<br />

number<br />

of switches<br />

up to 1 sec<br />

300 or 500ms<br />

50 switches<br />

max<br />

10 ms for 5<br />

switches<br />

160<br />

microseconds for<br />

each additional<br />

switch<br />

50


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The new release of TCSESM ConneXium switches allows better performance of<br />

RSTP management with a detection time of 15 ms and a propagation time of 15 ms<br />

for each switch. Considering a 6 switches configuration, the recovery time is about<br />

105 ms.<br />

HIPER-Ring (Version 1)<br />

Version 1 of the HIPER-Ring networking strategy has been available for<br />

approximately 10 years. It applies to a Self Healing Ring networking layout.<br />

Such a ring structure may include up to 50 switches. It typically features a<br />

reconfiguration delay of 150 milliseconds 2 and a maximum time of 500 ms. As a result,<br />

in case of an issue occurring on a link cable or on one of the switches populating the<br />

ring, the network will take about 150 ms to detect this event, and cause the<br />

Redundancy Manager switch to close the loop.<br />

Note: The Redundancy Manager switch is said to be active when it opens the<br />

network.<br />

HIPER-Ring Version 2 (MRP)<br />

Note: If recovery time of 500 ms is acceptable, then no switch redundancy<br />

configuration is needed only dip switches have to be set up.<br />

2 When configuring a Connexium TCS ESM switch for HIPER-Ring V1, the user is asked to choose<br />

between a maximum Standard Recovery Time, which is 500 ms, and a maximum Accelerated Recovery<br />

Time, which is 300 ms.<br />

51


Fast HIPER-Ring<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

MRP is an IEC 62439 industry standard protocol based on HIPER-ring. Therefore all<br />

switch manufactures can implement MRP if they choose too. This allows a mix of<br />

different manufactures in an MRP configuration. <strong>Schneider</strong>’s’ switches support a<br />

selectable maximum recovery time of 200ms or 500ms and 50 switch maximum ring<br />

configuration.<br />

TCESM switches also support redundant coupling of MRP rings. MRP rings could<br />

easily be used instead of HIPER-ring. MRP would require that all switches be<br />

configured via Web pages and allow for recovery time of 200ms or 500ms.<br />

Additionally the I/O network could be a MRP redundant network and the control<br />

network HIPER-ring or vice versa.<br />

A new family of Connexium switches is coming named TCS ESM Extended. This will<br />

offer a third version of HIPER-Ring strategy, named Fast HIPER-RING.<br />

Featuring a guaranteed recovery time of less than 10 milliseconds, the fast HIPER<br />

ring structure allows both a cost optimized implementation of a redundant network as<br />

well as maintenance and network extension during operation. This makes fast HIPER<br />

ring especially suitable for complex applications such as a combined transmission of<br />

video, audio and data information<br />

52


Selection<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

To end the communication level section, the following table presents all the<br />

communication protocols, and thus helps you selecting the most appropriate<br />

installation for your high availability solution:<br />

Selection Criteria Solution Comments<br />

Ease of configuration or<br />

installed base<br />

HIPER-Ring If recovery time of 500 ms is acceptable, then no switch<br />

redundancy configuration is needed. Dip switches have<br />

to be set up only.<br />

New installation MRP All switches are configured via web pages<br />

Open architecture with<br />

multiple vendors switch<br />

Complex Architecture MRP, RSTP or<br />

Installation with one MRM (Media Ring Manager) and<br />

X MRCs (Media Ring Client)<br />

RSTP Reconfiguration time: 15 ms (detected fault) + 15 ms<br />

FAST HIPER-<br />

Ring<br />

per switch.<br />

We recommend MRP or RSTP for <strong>High</strong> <strong>Availability</strong> with<br />

dual ring, and FAST HIPER-Ring for high performance.<br />

53


Control System Level<br />

Redundancy Principles<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Having detailed <strong>High</strong> <strong>Availability</strong> aspects at the Information Management level and at<br />

the Communication Infrastructure level, we will now concentrate on <strong>High</strong> <strong>Availability</strong><br />

concerns at the Control Level. Specific discussion will focus on PAC redundancy.<br />

Modicon Quantum and Premium PAC provide Hot Standby capabilities, and have<br />

several shared principles:<br />

1. The type of architecture is<br />

shared. A Primary unit executes<br />

the program, with a Standby unit<br />

ready, but not executing the<br />

program (apart from the first<br />

section of it). By default, these two units contain an identical application program.<br />

2. The units are synchronized. The standby unit is aligned with the Primary unit.<br />

Also, on each scan, the Primary unit transfers to the Standby unit its "database,”<br />

that is, the application variables (located or not located) and internal data. The<br />

entire database is transferred, except the "Non-Transfer Table", which is a<br />

sequence of Memory Words (%MW). The benefit of this transfer is that, in case of<br />

a switchover, the new Primary unit will continue to handle the process, starting<br />

with updated variables and data values. This is referred to as a "bumpless"<br />

switchover.<br />

3. The Hot Standby redundancy mechanism is controlled via the "Command<br />

Register" (accessed thru %SW60 system word); reciprocally, this Hot Standby<br />

redundancy mechanism is monitored via the "Status Register" (accessed thru<br />

%SW61 system word). As a result, as long as the application creates links<br />

between these system words and located memory words, any HMI can receive<br />

feedback regarding Hot Standby system operating conditions, and, if necessary,<br />

address these operating conditions.<br />

4. For any Ethernet port acting as a server (Modbus/TCP or HTTP protocol) on the<br />

Primary unit, its IP address is implicitly incremented by one on the Standby unit.<br />

54


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

In case a switchover occurs, homothetic addresses will automatically be<br />

exchanged. The benefit of this feature is that seen from a SCADA/HMI, the<br />

"active" unit is still accessed at the same IP address. No specific adaptation is<br />

required at the development stage of the SCADA / HMI application.<br />

5. The common programming environment that is used with both solutions is Unity<br />

Pro. No particular restrictions apply when using the standardized (IEC 1131-3)<br />

instruction set. In addition, the portion of code specific to the Hot Standby system<br />

is optional, and is used primarily for monitoring purpose. This means that with any<br />

given application, the difference between its implementation on standalone<br />

architecture and its implementation on a Hot Standby architecture is largely<br />

cosmetic.<br />

Consequently, a user familiar with one type of Hot Standby system does not have to<br />

start from scratch if he has to use a second type; initial investment is preserved and<br />

re-usable, and only a few differences must be learned to differentiate the two<br />

technologies.<br />

Premium and Quantum Hot Standby Architectures<br />

Depending on projects constraints or customers requirements (performance, installed<br />

base or project specifications), a specific Hot Standby PAC station topology can be<br />

selected:<br />

- Hot Standby PAC with in-rack I/O or remote I/O<br />

- Hot Standby PAC with distributed I/O, connected on Standard Ethernet or<br />

connected to another device bus, such as Profibus DP.<br />

- Hot Standby PAC mixing different topologies<br />

The following table presents the available configurations with either a Quantum or<br />

Premium PLC:<br />

PLC Layer<br />

I/O Bus Ethernet Profibus<br />

Quantum Configuration 1 Configuration 3 Configuration 5<br />

Premium Configuration 2 Configuration 4 Not applicable<br />

Note: A sixth configuration may be considered, which combines all other<br />

configurations listed above<br />

55


In-Rack and Remote I/O Architectures<br />

Quantum Hot Standby: Configuration 1<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

the racks is greater than the lack of<br />

communication gap that takes place at the<br />

switchover). The module population of a<br />

Quantum CPU rack in a Hot Standby<br />

configuration is very similar to that of a<br />

standalone PLC 3 . The only specific requirement<br />

is that the CPU module must be a 140 CPU 671 60.<br />

With Quantum Hot Standby, in-rack I/O<br />

modules are located in the remote I/O<br />

racks. They are "shared" by both Primary<br />

and Standby CPUs, but only the Primary<br />

unit actually handles the I/O<br />

communications at any given time. In<br />

case of a switchover, the control<br />

takeover executed by the new Primary<br />

unit occurs in a bumpless way (provided<br />

that the holdup time parameter allotted to<br />

For redundant PAC architecture, both units require two interlinks to execute different<br />

types of diagnostic - to orient the election of the Primary unit, and to achieve<br />

synchronization between both machines. The first of these "Sync Links,” the CPU<br />

Sync Link, is a dedicated optic fiber link anchored on the Ethernet port local to the<br />

CPU module. This port is dedicated exclusively for this use on Quantum Hot Standby<br />

architecture. The second of these Sync Links, Remote I/O Sync Link, is not an<br />

additional one: the Hot Standby system uses the existing Remote I/O medium,<br />

hosting both machines, thus providing them with an opportunity to communicate.<br />

One benefit of the CPU optic fiber port is its inherent capability to have the two units<br />

installed up to 2 km apart, using 62.5/125 multimode optic fiber. The Remote I/O Sync<br />

Link can also run through optic fiber, provided that Remote I/O Communication<br />

Processor modules are coupled on the optic fiber.<br />

3 All I/O modules are accepted on remote I/O racks, except 140 HLI 340 00 (Interrupt module).<br />

Looking at Ethernet adapters currently available, 140 NWM 100 00 communication module is not<br />

compatible with a Hot Standby system. Also EtherNet/IP adapter (140 NOC 771 00) is not compatible with<br />

Quantum Hot Standby in Step 1.<br />

56


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Up to 6 communication modules 4 , such as NOE Ethernet TCP/IP adapters, can be<br />

handled by a Quantum unit; whether it is a standalone unit or part of a Hot Standby<br />

architecture.<br />

Up to 31 Remote I/O stations can be handled from a Quantum CPU rack, whether<br />

standalone or Hot Standby. Note that the Remote I/O service payload on scan time is<br />

approximately 3 to 4 ms per station.<br />

Redundant Device Implementation<br />

Using redundant in-rack I/O modules on<br />

Quantum Hot Standby, and having to<br />

interface redundant sensors/actuators, will<br />

require redundant input / output channels.<br />

Preferably, homothetic channels should be<br />

installed on different modules and different<br />

I/O Stations. For even a simple transfer of<br />

information to both sides of the outputs, the<br />

application must define and implement rules<br />

for selecting and treating the proper input<br />

signals. In addition to information transfer,<br />

the application will have to address<br />

diagnostic requirements.<br />

4 Acceptable communication modules are Modbus Plus adapters, Ethernet TCP/IP adapters, Ethernet/IP<br />

adapters and PROFIBUS DP V1 adapters.<br />

57


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Single Device Implementation<br />

Assuming a Quantum Hot Standby<br />

(i.e. the one taken on Primary output) to the target actuator.<br />

application is required to handle redundant<br />

in-rack I/O channels, but without redundant<br />

sensors and actuators, special devices are<br />

used to handle the associated wiring<br />

requirements. Any given sensor signal,<br />

either digital or analog, passes through<br />

such a dedicated device, which replicates it<br />

and passes it on to homothetic input<br />

channels. Reciprocally, any given pair of<br />

homothetic output signals, either digital or<br />

analog, are provided to a dedicated device<br />

that selects and transfers the proper signal<br />

Depending on the selected I/O Bus technology, a specific layout may result in<br />

enhanced availability.<br />

Dual Coaxial Cable<br />

Coaxial cable can be installed with<br />

either as a single or redundant<br />

design. With a redundant design,<br />

communications are duplicated on<br />

both channels, providing a "massive<br />

communication redundancy.” Either<br />

the Remote I/O Processor or Remote<br />

I/O adapters are equipped with a pair<br />

of connectors, with each connector<br />

attached to a separate coaxial<br />

distribution.<br />

58


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Self Healing Optical Fiber Ring<br />

ring diagnostics are required, or single mode fiber is required.<br />

Configuration Change on the Fly<br />

Remote I/O Stations can be installed<br />

as terminal nodes of a fiber optic<br />

segment or (self-healing) ring. The<br />

<strong>Schneider</strong> catalog offers one model<br />

of transceiver (490 NRP 954)<br />

applicable for 62.5/125 multimode<br />

optic fiber: 5 transceivers maximum,<br />

10 km maximum ring circumference.<br />

When the number of transceivers is<br />

greater than 5, precise optic fiber<br />

<strong>High</strong>-level level feature is currently being added to Quantum Hot Standby applications<br />

design. This is CCTF, or Configuration Change on the Fly. This new feature will allow<br />

you to modify the configuration of the existing and running PLC application program,<br />

without having to stop the PLC. As an example, consider the addition of a new<br />

discrete or analog module on a remote Quantum I/O Station. For the CPU Firmware<br />

version upgrade, executed on a Quantum Hot Standby architecture, this CCTF will be<br />

sequentially executed, one unit at a time.This is an obvious for applications that<br />

cannot afford any stop, which will now become available for architecture modification<br />

or extensions.<br />

Premium Hot Standby: Configuration 2<br />

Premium Hot Standby is able to<br />

handle in-rack I/O modules, installed<br />

on Bus-X racks 5 . The Primary unit<br />

acquires its inputs, executes the logic,<br />

and updates its outputs. As a result of<br />

cyclical Primary to Standby data<br />

transfer, the Standby unit provides<br />

local outputs that are the image of the outputs decided on the Primary unit. In case of<br />

a switchover, the control takeover executed by the new Primary unit occurs in a<br />

bumpless fashion.<br />

5 Initial version of Premium Hot Standby only authorizes a single Bus-X rack on both units.<br />

59


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The module population of a Premium<br />

CPU rack, in a Hot Standby<br />

configuration, is very similar to that of<br />

a standalone PLC (some restrictions<br />

apply on compatible modules 6 ). Two<br />

types of CPU modules are available:<br />

TSX H57 24M and TSX H57 44M,<br />

which differ mainly in regard to memory and communication resources.<br />

The first of the two Sync Links, the<br />

CPU Sync Link is a dedicated copper<br />

link anchored on the Ethernet port<br />

local to the CPU module. With<br />

Premium Hot Standby architecture, the<br />

second Sync Link, Ethernet Sync Link is established using a standard Ethernet<br />

TCP/IP module 7 . It corresponds to the communication adapter elected as the<br />

"monitored" adapter<br />

6 Counting, motion, weighing and safety modules are not accepted. On the communication side, apart from<br />

Modbus modules TSX SCY 11 601/21 601, only currently available Ethernet TCP/IP modules are accepted.<br />

Also EtherNet/IP adapter (TSX ETC 100) is not compatible with Premium Hot Standby in Step 1.<br />

7 TSX ETY 4103 or TSX ETY 5103 communication module<br />

60


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following picture illustrates the Premium Ethernet configuration, with the CPU<br />

and the ETY Sync links:<br />

This Ethernet configuration is detailed in the following section.<br />

Redundant Device Implementation<br />

In-rack I/O module implementation on Premium<br />

Hot Standby corresponds by default to a<br />

massive redundancy layout: each input and<br />

each output has a physical connection on both<br />

units. Redundant sensors and actuators do not<br />

require additional hardware. For even a simple<br />

transfer of information to both sides of the outputs, the application must define and<br />

implement rules for selecting and treating the proper input signals. In addition to<br />

information transfer, the application will have to address diagnostic requirements.<br />

Single Device Implementation<br />

Assuming a Premium Hot Standby application<br />

is required to handle redundant in-rack I/O<br />

channels, but without redundant sensors and<br />

actuators, special devices are used to handle<br />

the associated wiring requirements. Any given<br />

sensor signal, either digital or analog, passes<br />

through such a dedicated device, which<br />

replicates it and passes it on to homothetic input channels. Reciprocally, any given<br />

pair of homothetic output signals, either digital or analog, are provided to a dedicated<br />

device that selects and transfers the proper signal (i.e. the one taken on Primary<br />

output) to the target actuator.<br />

61


Distributed I/O Architectures<br />

Ethernet TCP/IP: Configurations 3&4<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

<strong>Schneider</strong> <strong>Electric</strong> has supported Transparent Ready strategy for several years. In<br />

addition to SCADA and HMI, Variables Speed Drives, Power Meters, and a wide<br />

ranges of gateways, Distributed I/Os with Ethernet connectivity, such as Advantys<br />

STB, are also being proposed,. In addition many manufacturers are offering devices<br />

capable of communicating on Ethernet using Modbus TCP 8 protocol. These different<br />

contributions using the Modbus protocol design legacy have helped make Ethernet a<br />

general purpose preferred communication support for automation architectures.<br />

In addition to Ethernet messaging services solicited through application program<br />

function blocks, a communication service is available on <strong>Schneider</strong> <strong>Electric</strong> PLCs: the<br />

I/O Scanner. The I/O Scanner makes a PLC Ethernet adapter/Copro act as a<br />

Modbus/TCP client, periodically launching a sequence of requests on the network.<br />

These requests correspond to standard Modbus function codes, asking for Registers<br />

(Data Words), Read, Write or Read/Write operations. This sequence is determined by<br />

a list of individual contracts specified in a table defined during the PLC configuration.<br />

The typical target of such a communication contract is an I/O block, hence the name<br />

"I/O Scanner". Also, the I/O Scanner service may be used to implement data<br />

exchanges with any type of equipment, including another PLC, provided that<br />

equipment can behave as a Modbus/TCP server, and respond to multiple words<br />

access requests.<br />

Ethernet I/O scanner service is<br />

compatible with a Hot Standby<br />

implementation, whether Premium<br />

or Quantum. The I/O scanner is<br />

active only on the Primary unit. In<br />

case of a controlled switchover,<br />

Ethernet TCP/IP connections<br />

handled by the former Primary unit<br />

are properly closed, and new ones<br />

are reopen once the new Primary<br />

gains control. In case of a sudden<br />

switchover, resulting, for example,<br />

8 In step 1, support of Ethernet/IP Quantum/Premium adapters is not available with Hot Stand-By.<br />

62


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

from a power cut-off, the former Primary may not able to close the connections it had<br />

opened. These connections will be closed after expiration of a Keep Alive timeout.<br />

In case of a switchover, proper communications will typically recover after one initial<br />

cycle of I/O scanning. However, the worst case gap for address swap, with I/O<br />

scanner, is 500 ms, plus one initial cycle of I/O scanning. As a result, this mode of<br />

communication, and hence architectures with Distributed I/Os on Ethernet, is not<br />

preferred with a control system that regards time criticality as an essential criteria.<br />

Note: The automatic IP address swap capability is a property inherited by every<br />

Ethernet TCP/IP adapter installed in the CPU rack.<br />

Self Healing Ring<br />

As demonstrated in the previous chapter, Ethernet TCP/IP used with products like<br />

Connexium offers real opportunities to design enhanced availability architectures,<br />

handling communication between the Information Management Level and the Control<br />

Level. Such architectures, based on a Self Healing Ring topology, are also applicable<br />

when using Ethernet TCP/IP as a fieldbus.<br />

Note that Connexium accepts Copper or Optic Fiber rings. In addition, dual<br />

networking is also applicable at the fieldbus level.<br />

63


Profibus Architecture<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

I/O Devices Distributed on PROFIBUS DP/PA: Configuration 5<br />

A PROFIBUS DP V1 Master Class 1<br />

Quantum form communication module is<br />

available. It handles cyclic and acyclic data<br />

exchanges, and accepts FDT/DTM Asset<br />

Management System data flow, through its<br />

local Ethernet port.<br />

The PROFIBUS network is configured with<br />

a Configuration Builder software, which<br />

supplies the Unity Pro application program<br />

with data structures corresponding to cyclic<br />

data exchanges and to diagnostic<br />

information.<br />

The Configuration Builder can also be configured to pass Unity Pro a set of DFBs,<br />

allowing easy implementation of Acyclic operations.<br />

Each Quantum PLC can accept up to 6 of these DP Master modules (each of them<br />

handling its own PROFIBUS network). Also, the PTQ PDPM V1 Master Module is<br />

compatible with a Quantum Hot Standby implementation. Only the Master Module in<br />

the Primary unit is active on the PROFIBUS network; the Master Module on the<br />

Standby unit stays in a dormant state unless awakened by a switchover.<br />

64


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

PROFIBUS DP V1 Remote Master and Hot Standby PLC<br />

With a smart device such as<br />

PROFIBUS Remote Master 9 ,<br />

an I/O Scanner stream is<br />

handled by the PLC<br />

application 10 and forwarded<br />

to the Remote Master via<br />

Ethernet TCP/IP. In turn,<br />

Remote Master handles the<br />

corresponding cyclic<br />

exchanges with the devices populating the PROFIBUS network. Remote Master can<br />

also handle acyclic data exchanges.<br />

The PROFIBUS network is configured with Unity Pro 11 , which also acts as an FDT<br />

container, able to host manufacturer device DTMs. In addition, Remote Master offers<br />

a comDTM to work with third party FDT/DTM Asset Management Systems.<br />

Automatic symbol generation provides Unity Pro with data structures corresponding<br />

to data exchanges and diagnostic information. A set of DFBs is delivered that allows<br />

an easy implementation of acyclic operations.<br />

Remote Master is compatible with a Quantum, Premium or Modicon M340 Hot<br />

Standby implementation.<br />

9 Planned First Customer Shipment: Q4 2009<br />

10 M340, Premium or Quantum<br />

11 version 5.0<br />

65


Redundancy and Safety<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Requirements for <strong>Availability</strong> and Safety are often considered<br />

to be working against each other. Safety can be the<br />

response to the maxim "stop if any potential danger arises,’<br />

whereas <strong>Availability</strong> can follow the slogan "Produce in spite of<br />

everything.”<br />

Two models of CPU are available to design a Quantum<br />

Safety configuration: the first model (140 CPU 651 60S) is<br />

dedicated to standalone architectures, whereas the second<br />

model ( 140 CPU 671 60S) is dedicated to redundant<br />

architectures.<br />

The Quantum Safety PLC has some exclusive features: a<br />

specific hardware design for safety modules (CPU and I/O<br />

modules) and a dedicated instruction set.<br />

Otherwise, a Safety Quantum Hot Standby configuration has<br />

much in common with a regular Quantum Hot Standby configuration. The<br />

configuration windows, for example, are almost the same, and the Ethernet<br />

communication adapters inherit the IP Address automatic swap capability. Thus the<br />

Safety Quantum Hot Standby helps to reconcile and integrate the concepts of Safety<br />

and <strong>Availability</strong>.<br />

66


Mixed Configuration: Configuration 6<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

Premium / Quantum Hot Standby Switchover Conditions<br />

System Health Diagnostics<br />

Whether Premium or Quantum,<br />

application requirements such as<br />

topology, environment, periphery,<br />

time criticality, etc. may influence<br />

the final architecture design to<br />

adopt both types of design<br />

strategies concurrently, i.e. in-<br />

rack and distributed I/Os,<br />

depending on individual<br />

subsystems constraints.<br />

Systematic checks are executed cyclically by any running CPU, in order to detect a<br />

potential hardware corruption, such as a change affecting the integrity of the Copro,<br />

the sub-part of the CPU module that hosts the integrated Ethernet port. Another<br />

example of a systematic check is the continuous check of the voltage levels provided<br />

by the power supply module(s). In case of a negative result during these hardware<br />

health diagnostics, the tested CPU will usually switch to a Stop State.<br />

When the unit in question is part of a Hot Standby System, in addition to these<br />

standard hardware tests separately executed on both machines, more specific tests<br />

are conducted between the units. These additional tests involve both Sync Links. The<br />

basic objective is to confirm that the Primary unit is effectively operational, executing<br />

the application program, and controlling the I/O exchanges. In addition, the system<br />

must verify that the current Standby unit is able to assume control after a switchover.<br />

67


Controlled Switchover<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

If an abnormal situation occurs on the current Primary unit, it gives up control and<br />

switches either to Off-Line state (the CPU is not a part of the Hot Standby system<br />

coupling) or to Stop State, depending on the event. The former Standby unit takes<br />

control as the new Primary unit.<br />

As previously indicated, the Hot Standby system is controlled through the %SW61<br />

system register. Each unit owns an individual bit on the system Command Register<br />

that decides whether or not that particular unit has to make its possible to "hook" to<br />

the other unit. An operational hooked redundant Hot Standby system requires both<br />

units to indicate this intent. Consequently, executing a switchover controlled by the<br />

application on a hooked system is straightforward; it requires briefly toggling the<br />

decision bit that controls the current Primary unit’s "hooking" intent. The first toggle<br />

transition switches the current Primary unit to Off-Line Sate, and makes the former<br />

Standby unit take control. The next toggle transition makes the former Primary unit<br />

return and hook as the new Standby Unit.<br />

An example of this possibility is a controlled switchover resulting from diagnostics<br />

conducted at the application level. A Quantum Hot Standby, Premium Hot Standby or<br />

Monitored Ethernet Adapter system does not handle a Modbus Plus or Ethernet<br />

communication adapter malfunction as a condition implicitly forcing a switchover. As a<br />

result, these communication modules must be cyclically tested by the application,<br />

both on Standby and on Primary. Diagnostic results elaborated on Standby are<br />

usually transferred to the Primary unit because of the Reverse Transfer Registers.<br />

Finally, the application being reported a non fugitive inconsistency affecting Primary<br />

unit, and at the same time a fully operational Standby unit, forces a switchover based<br />

on the playing on Control Register.<br />

Hence, the application program can decide on a Hot Standby switchover, having<br />

registered a steady state negative diagnostic on the Ethernet adapter linking the<br />

Primary unit to the "Process Network", and being at the same time informed that the<br />

Standby unit is fully operational.<br />

Note: There are two ways to implement a Controlled Switchover: automatically<br />

through configuration of a default DFB, or customized with the creation of a DFB with<br />

its own switchover conditions.<br />

68


<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following illustration represents an example of a DFB handling a Hot Standby<br />

Controlled Switchover:<br />

This in turn makes use of an embedded DFB: HSBY_WR:<br />

Fragment of HSBY_Switch_Decision DFB<br />

Fragment of HSBY_WR DFB<br />

Note: HSBY_WR DFB executes a write access on HSBY Control Register (%SW60).<br />

69


Switchover Latencies<br />

<strong>High</strong> <strong>Availability</strong> with Collaborative Control System<br />

The following table details the typical and maximum swap time delay encountered<br />

when reestablishing Ethernet services during a Switchover event. (Premium and<br />

Quantum configurations)<br />

Service Typical Swap Time Maximum Swap Time<br />

Swap IP Address 6 ms 500 ms<br />

I/O Scanning 1 initial cycle of I/O scanning 500 ms + 1 initial cycle of I/O scanning<br />

Client Messaging 1 MAST task cycle 500 ms + 1 MAST task cycle<br />

Server<br />

Messaging<br />

1 MAST task cycle + the time required<br />

by the client to reestablish its<br />

connection with the server (1)<br />

FTP/TFTP Server the time required by the client to<br />

reestablish its connection with the<br />

server (1)<br />

500 ms + the time required by the<br />

client to reestablish its connection with<br />

the server (1)<br />

500 ms + the time required by the<br />

client to reestablish its connection with<br />

the server (1)<br />

SNMP 1 MAST task cycle 500 ms + 1 MAST task cycle<br />

HTTP Server the time required by the client to<br />

reestablish its connection with the<br />

server (1)<br />

500 ms + the time required by the<br />

client to reestablish its connection with<br />

the server (1)<br />

(1) The time the client requires to reconnect with the server depends on the client communication loss<br />

timeout settings.<br />

Selection<br />

To end the control level section, the following table presents the four main criteria that<br />

help you selecting the most appropriate configuration for your high availability solution:<br />

Criteria Cost <strong>High</strong> Criticality<br />

Switchover Performance Premium<br />

Openness Premium<br />

In Rack Architecture<br />

Distributed Architecture<br />

Quantum<br />

In Rack Architecture<br />

Quantum<br />

Distributed Architecture<br />

70


Premium/ Quantum Hot Standby Solution Reminder<br />

Conclusion<br />

The following tables provide a brief reminder of essential characteristics for Premium<br />

and Quantum Hot Standby solutions, respectively:<br />

Up to 128 per network / ETY I/O Scanner handles up to 64 transactions<br />

Premium Hot Standby Essential Characteristics<br />

71


Conclusion<br />

Quantum Hot Standby Essential Characteristics<br />

Conclusion<br />

This chapter has covered functional and architectural redundancy aspects, from the<br />

Information Management level to the Control level, and up through the<br />

Communication Infrastructure level.<br />

72


Conclusion<br />

Conclusion<br />

This section summarizes the main characteristics and properties of <strong>Availability</strong> for<br />

Collaborative Control automation architectures.<br />

Chapter 1 demonstrated that <strong>Availability</strong> is dependent not only on Reliability, but also<br />

on Maintenance as it is provided to a given system. The first level of contribution,<br />

Reliability, is primarily a function of the system design and components. Component<br />

and device manufacturers thus have a direct but not exclusive influence on system<br />

<strong>Availability</strong>. The second level of contribution, Maintenance and Logistics, is totally<br />

dependent on end customer behavior.<br />

Chapter 2 presented some simple Reliability and <strong>Availability</strong> calculation examples,<br />

and demonstrated that beyond basic use cases, dedicated skills and tools are<br />

required to extract figures from real cases.<br />

Chapter 3 explored a central focus of this document, Redundancy, and its application<br />

at the Information Management Level, Communication Infrastructure Level and<br />

Control System Level.<br />

This final chapter summarizes customer benefits provided by <strong>Schneider</strong> <strong>Electric</strong> <strong>High</strong><br />

<strong>Availability</strong> solutions, as well as additional information and references.<br />

73


Benefits<br />

Standard Offer<br />

Conclusion<br />

<strong>Schneider</strong>-<strong>Electric</strong> currently offers a wide range of solutions, providing the best<br />

design to respond to specific customer needs for Redundancy and <strong>Availability</strong> in<br />

Automation and Control Systems.<br />

One key concept of <strong>High</strong> <strong>Availability</strong> is that redundancy is not a default design<br />

characteristic at any system level. Also, Redundancy can be added locally, using in<br />

most cases standard components.<br />

Simplicity of Implementation<br />

System Transparency<br />

Information Management Level<br />

At any level, intrusion of Redundancy into system design and implementation is<br />

minimum, compared to a non-redundant system. For SCADA implementation,<br />

network active components selection, or PLC programming, most of the software<br />

contributions to redundancy are dependent on selections executed during the<br />

configuration phase. Also, Redundancy can be applied selectively.<br />

The transparency of a redundant system, compared to a standalone one, is a<br />

customer requirement. With the <strong>Schneider</strong> <strong>Electric</strong> automation offer, this transparency<br />

is present at each level of the system.<br />

For Client Display Stations, Dual Path Supervisory Networks, Redundant I/O servers,<br />

or Dual Access to Process Networks, each redundant contribution is handled<br />

separately by the system. For example, concurrent Display Clients communication<br />

flow will be transparently re-routed to the I/O sever by the Supervisory Network in<br />

case of a cable disruption. This flow will also be transparently routed to the alternative<br />

I/O Server, in case of a sudden malfunction of the first server. Finally, the I/O Server<br />

may transparently leave the communication channel it is using per default, if that<br />

channel ceases to operate properly, or if the target PLC will not respond through this<br />

channel.<br />

74


Communication Infrastructure Level<br />

Control System Level<br />

Ease of Use<br />

Conclusion<br />

Whether utilized as a Process Network or as a Fieldbus Network, currently available<br />

active network components can easily participate in an automatically reconfigured<br />

network. With continuous enhancements, HIPER-Ring strategy not only offers<br />

simplicity, but also a level of performance compatible with a high-reactivity demand.<br />

The “IP Address automatic switch" for a SCADA application communicating through<br />

Ethernet is an important feature of <strong>Schneider</strong> <strong>Electric</strong> PLCs. Apart from simplifying<br />

the design of the SCADA application implementation, what may cause delays and<br />

increased cost, this feature also contributes to reducing the payload of a<br />

communication context exchange on a PLC switchover.<br />

As previously stated, increased effort has been made to make the implementation of<br />

a redundant feature simple and straightforward.<br />

The Vijeo Citect, ConneXium Web Pages and Unity Pro software environments offer<br />

clear and accessible configuration windows, along with a dedicated selective help, in<br />

order to execute the required parameterization.<br />

More Detailed RAMS Investigation<br />

In case of a specific need for detailed dependability (RAMS) studies, for any type of<br />

architecture, contact the <strong>Schneider</strong> <strong>Electric</strong> Safety Competency Center. This center<br />

has skilled and experienced individuals ready to help you with all your needs.<br />

75


Appendix<br />

Glossary<br />

Appendix<br />

Note: the references in bracket refer to standard, which are specified at the end of<br />

this glossary.<br />

1) Active Redundancy<br />

Redundancy where the different means required to accomplish a given function are<br />

present simultaneously [5]<br />

2) <strong>Availability</strong><br />

Ability of an item to be in a state to perform a required function under given conditions,<br />

at a given instant of time or over a given time interval, assuming that the required<br />

external resources are provided [IEV 191-02-05] (performance) [2]<br />

3) Common Mode Failure<br />

Failure that affects all redundant elements for a given function at the same time [2]<br />

4) Complete failure<br />

Failure which results in the complete inability of an item to perform all required<br />

functions [IEV 191-04-20] [2]<br />

5) Dependability<br />

Collective term used to describe availability performance and its influencing factors:<br />

reliability performance, maintainability performance and maintenance support<br />

performance [IEV 191-02-03] [2]<br />

Note: Dependability is used only for general descriptions in non-quantitative terms.<br />

6) Dormant<br />

A state in which an item is able to function but is not required functioning. Not to be<br />

confused with downtime [4]<br />

7) Downtime<br />

Time during which an item is in an operational inventory but is not in condition to<br />

perform its required function [4]<br />

8) Failure<br />

Termination of the ability of an item to perform a required function [IEV 191-04-01] [2]<br />

Note 1: After failure the item detects a fault.<br />

Note 2: "Failure" is an event, as distinguished from "fault", which is a state.<br />

76


9) Failure Analysis<br />

Appendix<br />

The act of determining the physical failure mechanism resulting in the functional<br />

failure of a component or piece of equipment [1]<br />

10) Failure Mode and Effects Analysis (FMEA)<br />

Procedure for analyzing each potential failure mode in a product, to determine the<br />

results or effects on the product. When the analysis is extended to classify each<br />

potential failure mode according to its severity and probability of occurrence, it is<br />

called a Failure Mode, Effects, and Criticality Analysis (FMECA).[6]<br />

11) Failure Rate<br />

Total number of failures within an item population, divided by the total number of life<br />

units expended by that population, during a particular measurement period under<br />

stated conditions [4]<br />

12) Fault<br />

State of an item characterized by its inability to perform a required function, excluding<br />

this inability during preventive maintenance or other planned actions, or due to lack of<br />

external resources [IEV 191-05-01] [2]<br />

Note: a fault is often the result of a failure of the item itself, but may exist without prior<br />

failure.<br />

13) Fault- tolerance<br />

Ability to tolerate and accommodate for a fault with or without performance<br />

degradation<br />

14) Fault Tree Analysis (FTA)<br />

Method used to evaluate reliability of engineering systems. FTA is concerned with<br />

fault events. A fault tree may be described as a logical representation of the<br />

relationship of primary or basic fault events that lead to the occurrence of a specified<br />

undesirable fault event known as the “top event.” A fault tree is depicted using a tree<br />

structure with logic gates such as AND and OR [7]<br />

77


15) Hidden Failure<br />

FTA illustration [7]<br />

Failure occurring that is not detectable by or evident to the operating crew [1]<br />

16) Inherent <strong>Availability</strong> (Intrinsic <strong>Availability</strong>) : Ai<br />

Appendix<br />

A measure of <strong>Availability</strong> that includes only the effects of an item design and its<br />

application, and does not account for effects of the operational and support<br />

environment. Sometimes referred to as "intrinsic" availability [4]<br />

17) Integrity<br />

Reliability of data which is being processed or stored.<br />

18) Maintainability<br />

Probability that an item can be retained in, or restored to, a specified condition when<br />

maintenance is performed by personnel having specified skill levels, using prescribed<br />

procedures and resources, at each prescribed level of maintenance and repair. [4]<br />

19) Markov Method<br />

A Markov process is a mathematical model that is useful in the study of the<br />

availability of complex systems. The basic concepts of the Markov process are those<br />

of the “state” of the system (for example operating, non-operating) and state<br />

“transition” (from operating to non-operating due to failure, or from non-operating to<br />

operating due to repair). [4]<br />

78


20) MDT: Mean Downtime<br />

Markov Graph illustration [2]<br />

Average time a system is unavailable for use due to a failure.<br />

Time includes the actual repair time plus all delay time associated with a repair<br />

person arriving with the appropriate replacement parts [4]<br />

21) MOBF: Mean Operating Time Between Failures<br />

Expectation of the operating time between failures [IEV 191-12-09] [2]<br />

22) MTBF<br />

Appendix<br />

A basic measure of reliability for repairable items. The mean number of life units<br />

during which all parts of the item perform within their specified limits, during a<br />

particular measurement interval under stated conditions. [4]<br />

23) MTTF : Mean Time To Failure<br />

A basic measure of reliability for non-repairable items. The total number of life units of<br />

an item population divided by the number of failures within that population, during a<br />

particular measurement interval under stated conditions. [4]<br />

Note: Used with repairable items, MTTF stands for Mean Time To First Failure<br />

24) MTTR : Mean Time To Repair<br />

A basic measure of maintainability. The sum of corrective maintenance times at any<br />

specific level of repair, divided by the total number of failures within an item repaired<br />

at that level, during a particular interval under stated conditions. [4]<br />

25) MTTR : Mean Time To Recovery<br />

Expectation of the time to recovery [IEV 191-13-08] [2]<br />

79


26) Non-Detectable Failure<br />

Appendix<br />

Failure at the component, equipment, subsystem, or system (product) level that is<br />

identifiable by analysis but cannot be identified through periodic testing or revealed by<br />

an alarm or an indication of an anomaly. [4]<br />

27) Redundancy<br />

Existence in an item of two or more means of performing a required function [IEV<br />

191-15-01] [2]<br />

Note: in this standard, the existence of more than one path (consisting of links and<br />

switches) between end nodes.<br />

Existence of more than one means for accomplishing a given function. Each means<br />

of accomplishing the function need not necessarily be identical. The two basic types<br />

of redundancy are active and standby. [4]<br />

28) Reliability<br />

Ability of an item to perform a required function under given conditions for a given<br />

time interval [IEV 191-02-06] [2]<br />

Note 1: It is generally assumed that an item is in a state to perform this required<br />

function at the beginning of the time interval<br />

Note 2: the term “reliability” is also used as a measure of reliability performance (see<br />

IEV 191-12-01)<br />

29) Repairability<br />

Probability that a failed item will be restored to operable condition within a specified<br />

time of active repair [4]<br />

30) Serviceability<br />

Relative ease with which an item can be serviced (i.e. kept in operating condition). [4]<br />

31) Standby Redundancy<br />

Redundancy wherein a part of the means for performing a required function is<br />

intended to operate, while the remaining part(s) of the means are inoperative until<br />

needed [IEV 19-5-03] [2]<br />

Note: this is also known as dynamic redundancy.<br />

Redundancy in which some or all of the redundant items are not operating<br />

continuously but are activated only upon failure of the primary item performing the<br />

function(s). [4]<br />

80


32) System Downtime<br />

Appendix<br />

Time interval between the commencement of work on a system (product) malfunction<br />

and the time when the system has been repaired and/or checked by the maintenance<br />

person, and no further maintenance activity is executed. [4]<br />

33) Total System Downtime<br />

Time interval between the reporting of a system (product) malfunction and the time<br />

when the system has been repaired and/or checked by the maintenance person, and<br />

no further maintenance activity is executed. [4]<br />

34) Unavailability<br />

State of an item of being unable to perform its required function [IEV 603-05-05] [2]<br />

Note: Unavailability is expressed as the fraction of expected operating life that an<br />

item is not available, for example given in minutes per year<br />

Ratio: downtime/(uptime + downtime) [3]<br />

Often expressed as a maximum period of time during which the variable is<br />

unavailable, for example 4 hours per month<br />

35) Uptime<br />

That element of Active Time during which an item is in condition to perform its<br />

required functions. (Increases availability and dependability). [4]<br />

[1] Maintenance & reliability terms - Life Cycle Engineering<br />

[2] IEC 62439: <strong>High</strong> <strong>Availability</strong> automation networks<br />

[3] IEEE Std C37.1-2007: Standard for SCADA and Automation System<br />

[4] MIL-HDBK-338B - Military Handbook - Electronic Reliability Design Handbook<br />

[5] IEC-271-194<br />

[6] The Certified Quality Engineer Handbook - Connie M. Borror, Editor<br />

[7] Reliability, Quality, and Safety for Engineers - B.S. Dhillon - CRC Press<br />

81


Standards<br />

General purpose<br />

Appendix<br />

This section contains a selected, non-exhaustive list of reference documents and<br />

standards related to Reliability and <strong>Availability</strong>:<br />

� IEC 60050 (191):1990 - International Electrotechnical Vocabulary (IEV)<br />

FMEA/FMECA<br />

� IEC 60812 (1985) - Analysis techniques for system reliability - Procedures for failure mode and<br />

effect analysis (FMEA)<br />

� MIL-STD 1629A (1980) Procedures for performing a failure mode, effects and criticality analysis<br />

Reliability Block Diagrams<br />

� IEC 61078 (1991) Analysis techniques for dependability - Reliability block diagram method<br />

Fault Tree Analysis<br />

� NUREG-0492 - Fault Tree Handbook - US Nuclear Regulatory Commission<br />

Markov Analysis<br />

� IEC 61165 (2006) Application of Markov techniques<br />

RAMS<br />

� IEC 60300-1 (2003) - Dependability management - Part 1: Dependability management systems<br />

� IEC 62278 (2002) - Railway applications - Specification and demonstration of reliability, availability,<br />

maintainability and safety (RAMS)<br />

Functional Safety<br />

� IEC 61508 - Functional safety of electrical/electronic/programmable electronic safety related<br />

systems (7 parts)<br />

� IEC 61511 (2003) Functional safety - Safety instrumented systems for the process industry sector.<br />

82


Calculation Examples<br />

Reliability & <strong>Availability</strong> Calculation Examples<br />

The previous chapter provided necessary knowledge and theory basics to understand<br />

simple examples of Reliability and <strong>Availability</strong> calculations. This section examines<br />

concrete, simple but realistic examples, related to current, plausible solutions.<br />

The root piece of information required for all of these calculations, for all major<br />

components we will implement in our architectures, is the MTBF figure. MTBFs are<br />

normally provided by the manufacturer, either on request and/or on dedicated<br />

documents. For <strong>Schneider</strong> <strong>Electric</strong> PLCs, MTBF information can be found on the<br />

intranet site Pl@net, under Product offer>Quality Info.<br />

Note: The MTBF information is usually not accessible via the <strong>Schneider</strong> <strong>Electric</strong><br />

public website.<br />

Redundant PLC System, Using Massive I/O Modules Redundancy<br />

Standalone Architecture<br />

Consider a simple, single-rack Premium PLC configuration.<br />

First, calculate the individual modules’ MTBFs, using a spreadsheet that will do the<br />

calculations (examples include Excel and Open Office).<br />

83


Calculation Examples<br />

The calculation guidelines are derived from the main conclusion given by Serial RBD<br />

n<br />

analysis, that is:. λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal<br />

S i<br />

i = 1<br />

to the sum of the individual Failure Rate of these elements, with<br />

R<br />

S<br />

(t)<br />

−λ<br />

S<br />

t<br />

= e .<br />

The first operation will identify individual MTBF figures of the part references<br />

populating the target system. Using these figures, a second sheet will then<br />

subsequently consider the item, group and system levels.<br />

� For each of the identified items part references:<br />

- Individual item Failure Rate λ, calculated just by inverting individual item<br />

MTBF : λ = 1 / MTBF<br />

- individual item Reliability, over 1 year, applying R(t) = e -λt , where t = 8760<br />

that is the number of hours in one year (365 * 24)<br />

84


� For each of the identified items references groups:<br />

Calculation Examples<br />

- Group Failure Rate: Individual item Failure Rate times the number of items in<br />

the considered group<br />

- Group Reliability: Individual item Reliability powered by the number of items<br />

� For the considered System:<br />

- System Failure Rate: sum of Groups Failure Rates<br />

- System MTBF in hours: reverse of System Failure Rate<br />

- System Reliability over one year: exp(- System Failure Rate * 8760)]<br />

(where 8760 = 365 *24 : number of hours in one year)<br />

- System <strong>Availability</strong> (with MTTR = 2 h): System MTBF / (System MTBF + 2)]<br />

Looking at the example results, Reliability over one year is approximately 83%<br />

(This means that the probability for this system to encounter one failure during<br />

one year is approximately 17%).<br />

System MTBF itself is approximately 45,500 hours (approximately 5 years)<br />

Regarding <strong>Availability</strong>, with a 2 hour Mean Time To Repair (which corresponds to<br />

a very good logistic and maintenance organization), we would have a 4-nines<br />

resulting <strong>Availability</strong>, an average probability of approximately 23 minutes<br />

downtime per year<br />

Note:<br />

� As previously explained, System figures produced by our basic calculations only<br />

apply the equations given by basic probability analysis. In addition to these<br />

calculations, a commercial software tool has been used, permitting us to confirm<br />

our figures.<br />

85


Redundant Architecture<br />

Calculation Examples<br />

Assuming we would like to increase this system’s<br />

<strong>Availability</strong> by implementing a redundant architecture,<br />

we need to calculate the potential gain for System<br />

Reliability and <strong>Availability</strong>.<br />

We will assume that we have no additional hardware,<br />

apart from the two redundant racks. The required calculation guidelines are derived<br />

from the main conclusion given by Parallel RBD analysis, that is:<br />

R Red<br />

( t)<br />

n<br />

= 1−<br />

∏ 1−<br />

Ri<br />

( t)<br />

i=<br />

1<br />

This means if we consider RS as the Standalone System Reliability and RRed as the<br />

Redundant System Reliability: R ( t)<br />

1 ( 1 R ( t))<br />

2<br />

Red<br />

= − − S<br />

As a result of the redundant architecture, System Reliability over one year is<br />

now approximately 97% (i.e. the probability for the system to encounter one<br />

failure during one year has been reduced to approximately 3%).<br />

System MTBF has increased from 45,500 hours (approximately 5 years) to<br />

68,000 hours (approximately 7.8 years).<br />

Regarding System <strong>Availability</strong>, with a 2 hour Mean Time To Repair we would<br />

receive an 8-nines resulting <strong>Availability</strong>, which is close to 100%.<br />

Note: Formal calculations should also take into account undetected errors on<br />

redundant architecture, what would provide somewhat less optimistic figures.<br />

A complete analysis should also take in account the additional wiring devices typically<br />

used on a massive I/O redundancy strategy, feeding homothetic input points with the<br />

same input signal, and bringing homothetic output points onto the same output signal.<br />

Also, with this software, a parallel structure has been retained, with the Failure Rate<br />

the same on the Standby rack as on the Primary rack.<br />

86


Reminder of Standby Definitions:<br />

Calculation Examples<br />

Cold Standby: The Standby unit starts only<br />

if the default unit becomes inoperative.<br />

Warm Standby: The Failure Rate of the<br />

Standby unit during the inactive mode is<br />

less the Failure Rate when it becomes<br />

active.<br />

Hot Standby: The Failure Rate of the<br />

Standby unit during the inactive mode is the same as the Failure Rate when it<br />

becomes active. This is a simple parallel configuration.<br />

87


Redundant PLC System, Using Shared Remote I/O Racks<br />

Simple Architecture with a Standalone CPU Rack<br />

Standalone CPU Rack<br />

Calculation Examples<br />

Consider a standalone CPU rack equipped<br />

with power supply, CPU, remote I/O<br />

processor and Ethernet communication<br />

modules.<br />

As in the previous section, calculate the<br />

potential gain a redundant architecture would<br />

allow, in terms of Reliability and <strong>Availability</strong>.<br />

First calculate the individual modules MTBFs,<br />

then establish, for each subsequent rack, a<br />

worksheet that provides the Reliability and<br />

<strong>Availability</strong> metrics figures.<br />

88


Remote I/O Rack<br />

Simple Architecture : Standalone CPU Rack + Remote I/O Rack<br />

Calculation Examples<br />

89


Calculation Examples<br />

For this Serial System (Rack # 1+ Rack # 2), Reliability over one year is approximately 82.8% (the<br />

probability for this system to encounter one failure during one year is approximately 17%).<br />

System MTBF is approximately 46,260 hours (approximately 5.3 years)<br />

Regarding <strong>Availability</strong>, with a 2 hour Mean Time To Repair (which corresponds to a very good logistic<br />

and maintenance organization), we receive a 4-nines resulting <strong>Availability</strong>, which is an average<br />

probability of approximately 23 minutes downtime per year<br />

Note: As expected, Reliability and <strong>Availability</strong> figures resulting from this Serial<br />

implementation are determined by the weakest link of the chain.<br />

90


Simple Architecture with a Redundant CPU Rack<br />

Calculation Examples<br />

One solution to enforce Rack #1 <strong>Availability</strong> (i.e.the<br />

rack hosting the CPU, and at present the process<br />

control core) is to implement a Hot Standby<br />

configuration.<br />

As a result of Rack #1 Redundancy, Reliability over one year would be approximately<br />

99.7%, compared to 94.6% availability with a Standalone Rack #1 (i.e. the probability for<br />

Redundant Rack #1 System to encounter one failure during one year would have been<br />

reduced from 5.4% to 0.3%).<br />

System MTBF would increase from approximately 157,000 hours (approximately 18 years)<br />

to 235,000 hours (approximately 27 years).<br />

Regarding System <strong>Availability</strong>, with a 2 hour Mean Time To Repair we would receive a 9-<br />

nines resulting <strong>Availability</strong>, close to 100%<br />

Note: As expected, Reliability and <strong>Availability</strong> figures resulting from this Parallel<br />

implementation are better than the best of the individual figures for different links of<br />

the chain.<br />

91


Consider a common misuse of reliability figures:<br />

Calculation Examples<br />

� Last architecture examination (Distributed Architecture with a Redundant CPU<br />

Rack) has proven a real benefit in implementing a CPU Rack redundancy, with<br />

this architecture a sub-sytem would be elevated to a 99.9 % Reliability, and<br />

almost a 100% <strong>Availability</strong>.<br />

� The common misuse of this Reliability figure would be to make arguments for the<br />

potential benefit on Reliability and <strong>Availability</strong> for the whole resulting system.<br />

Note: This example has considered the Sub-System Failure Rate provided by the<br />

reliability Software.<br />

Regarding the Serial System built by the Redundant CPU Racks and the STB Islands,<br />

the worksheet above shows a resulting Reliability (over one year) of 84,06%, the<br />

Standalone System Reliability during the same period of time being 81.95%.<br />

In addition, the worksheet shows a resulting <strong>Availability</strong> (for a 2 hour MTTR) of<br />

99.96%, the Standalone System <strong>Availability</strong> (for a 2 hour MTTR) being 99.9957%.<br />

As a result, this data suggests that implementing a CPU Rack Redundancy would<br />

have almost no benefit.<br />

Of course, this is an incorrect conclusion, and the example should suggest a simple<br />

rule: always compare comparable items. If we implement a redundancy on the CPU<br />

Control Rack in order to increase process control core Reliability, and to an extent<br />

<strong>Availability</strong>, we need to then examine and compare the figures only at this level, as<br />

the entire system has not received an increase in redundancy.<br />

92


Recommendations for Utilizing Figures<br />

Calculation Examples<br />

Previous case studies provide several important items concerning Reliability metrics<br />

evaluation:<br />

• Provided elementary MTBF figures are available, a "serial" system is quite easy<br />

to evaluate, thanks to the "magic bullet" formula: λ = 1 / MTBF<br />

• If you consider a "parallel sub-system", System Reliability can be easily derived<br />

from the Sub-System Reliability. Thus System <strong>Availability</strong> can be almost as easily<br />

calculated, starting from the Sub-System Reliability. But the "magic bullet"<br />

formula no longer applies, as in those conditions, exponential law does not<br />

directly link Reliability to Failure Rate. However, System MTBF can be<br />

approximated from the Sub-System MTBF :<br />

"System MTBF (in hours) # 1.5 Sub-System MTBF"<br />

But no direct relationship will provide the System Failure Rate.<br />

Because these calculations must be precisely executed, the acquisition of<br />

commercial reliability software could be one solution. Another consideration<br />

would be the solicitation of an external service.<br />

93


<strong>Schneider</strong> <strong>Electric</strong> Industries SAS<br />

Head Office<br />

89, bd Franklin Roosvelt<br />

92506 Rueil-Malmaison Cedex<br />

FRANCE<br />

www.schneider-electric.com<br />

Due to evolution of standards and equipment, characteristics indicated in texts and images<br />

in this document are binding only after confirmation by our departments.<br />

Print:<br />

Version 1 - 06 2009<br />

94

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!