High Availability Theoretical Basics - Schneider Electric

System Technical Note 

How can I... 

90 

Increase the Availability of a 

Collaborative Control System 

1

Table of Contents 

Introduction to High Availability .............................................5 

High Availability Theoretical Basics.......................................9 

High Availability with Collaborative Control System..........31 

Conclusion..............................................................................73 

Appendix .................................................................................76 

Reliability & Availability Calculation Examples...................83 

3

Introduction to High Availability 

4


Purpose 

Introduction 


The intent of this System Technical Note (STN) is to describe the capabilities of the 

different Schneider Electric solutions that answer most critical applications 

requirements, and consequently increase the availability of a Collaborative Control 

System. It provides a description of a common, readily understandable reference 

point for end users, system integrators, OEMs, sales people, business support and 

other parties. 

Increasingly, process applications require a high availability automation system. 

Before deciding to install a high availability automation system in your installation, you 

need to consider the following key questions: 

• What is the security level needed? This regards security concerning of both 

persons and hardware. For instance, the complete start/stop sequences that 

manage the kiln in a cement plant includes a key condition: the most powerful 

equipment must be the last to start and stop. 

• What is the criticality of the process? This point concerns all the processes that 

involve a reaction (mechanical, chemical, etc.). Consider the example of the kiln 

again. To avoid its destruction, the complete stop sequence needs a slow cooling 

of the constituent material. Another typical example is the biological treatment in 

a wastewater plant, which cannot be stopped every day. 

• What is the environmental criticality?. Consider the example of an automation 

system in a tunnel. If a fire starts on a side, thereof the PLCs (the default and the 

redundant one) will be installed on each side of the tunnel. More, does the 

system will face to harsh environment; in terms of vibration, temperature, shock…? 

• Which other particular circumstances does the system have to address? This last 

topic includes additional questions, for example: does the system really need 

redundancy if the electrical network becomes inoperative in the concerned layer 

of the installation, what is the criticality of the data in the event of a loss of 

communication, etc. 

5


Availability is a term that increasingly used to qualify a process asset or system. In 

addition, Reliability and Maintainability are terms now often used for analyses 

considered of major usefulness in design improvement, and in production diagnostic 

issues. Accordingly, the design of automation system architectures must consider 

these types of questions. 

Document Overview 

The Collaborative Control System provided by Schneider Electric offers different 

levels of redundancy, allowing you to design an effective high availability system. 

This document contains the following chapters: 

• The High Availability Theoretical Basics chapter describes the fundamentals in 

High Availability. In one hand it presents theory and basics, in another hand 

explains a method to conceptualize series/parallel architectures. Calculation 

examples illustrate this approach. 

• The Collaborative Control System Availability chapter provides different solutions 

in High Availability, especially for Collaborative Control System; from the 

information management to the I/O modules level. 

• The final Conclusion chapter summarizes customer benefits provided by 

Schneider Electric High Availability solutions, as well as additional information 

and references. 

6


The following drawing represents various levels of an automation system architecture: 

As shown in the following chapters, redundancy is a convenient means of elevating 

global system reliability, and so far its availability. The Collaborative Control System 

Hi-Availability, that is redundancy, can be addressed at the different levels of the 

architecture: 

� Single or redundant I/O Modules (depending on sensors/actuators redundancy). 

� Depending on the I/O handling philosophy (for example conventional Remote I/O 

stations, or I/Os Islands distributed on Ethernet) different scenarios can be 

applied: dual communication medium I/O Bus or Self healing Ring, single or dual. 

� Programmable Controller CPU redundancy (HotStandby PAC Station). 

� Communication network and ports redundancy. 

� SCADA System dedicated approaches with multiple operator station location 

scenarios and resource redundancy capabilities. 

7

Guide Scope 

High Availability Theoretical Basics 

The realization of an automation project includes five main phases: Selection, Design, 

Configuration, Implementation and Operation. To help you develop a whole project 

based on these previous phases, Schneider Electric created the System Technical 

book concept: Guide (STG) and Notes (STN). 

A System Technical Guide provides you technical guidelines and recommendations 

to implement technologies regarding to your needs and requirements. This guide 

covers the entire project life cycle scope, from the selection to the operation phase 

providing design methodology and even source code examples of all components 

part of a sub-system. 

A System Technical Notes gives a more theoretical approach by focusing specially on 

a system technology. This book describes what is our complete solution offer 

regarding a system and therefore supports you in the selection phase of a project. 

STG and STN are obviously linked and complementary. To sum up, you will figure out 

the technologies fundamentals in a STN, and their corresponding tested and 

validated applications in one or several STGs. 

STG Scope 

Automation Project 

Life Cycle 

STN Scope 

8


Fault Tolerant System 


This section describes basic high availability terms, concepts and formulas, and 

includes examples for typical applications. 

A Fault Tolerant System usually refers to a system that can operate even though a 

hardware component becomes inoperative. The Redundancy principle is often used 

to implement a Fault Tolerant Systems, because an alternate component takes over 

the task transparently. 

Lifetime and Failure Rate 

Considering a given system or device, its Lifetime corresponds to the number of 

hours it can function under normal operating conditions. This number is the result of 

the individual life expectancy of the components used in its assembly. 

Lifetime is generally seen as a sequence of three subsequent phases: the “early life” 

(or “infant mortality”), the "useful life,” and the "wear out period.” 

Failure Rate (λ) is defined as the average (mean) rate probability at which a system 

will become inoperative. 

When considering events occurring on a series of systems, for example a group of 

light bulbs, the units to be used should normally be “failures-per-unit-of- system-time.” 

Examples include failures per machine-hour or failures per system-year. Because the 

scope of this document is limited to single repairable entities, we will usually discuss 

failures-per-unit-of-time. 

Failure Rate Example: For a human being, Failure Rate (λ) measures the 

probability of death occurring in the next hour. Stating λ (20 years) = 10 -6 per 

hour would mean that the probability for someone age 20 to die in the next 

hour is 10 -6 . 

9

Bathtub Curve 


The following figure shows the Bathtub Curve, which represents the Failure Rate (λ) 

according to the Lifetime (t): 

Consider the 

relation between 

Failure Rate and 

Lifetime for a 

device consisting of 

assembled 

electronic parts. 

This relationship is 

represented by the 

"bathtub curve". As 

shown in the previous diagram. In "early life," this system, exhibits a high Failure 

Rate, which gradually reduces until it approaches a constant value, maintained during 

its "useful life.” The system finally enters the “wear-out” stage of its life, where Failure 

Rate increases exponentially. 

Note: Useful Life normally starts at the beginning of system use and ends at the 

beginning of its wear-out phase. Assuming that "early life" corresponds to the ”burn- 

in” period indicated by the manufacturer, we generally consider that Useful Life starts 

with the beginning of system use by the end user. 

RAMS (Reliability Availability Maintainability Safety) 

The following text, from the MIL-HDBK-338B standard, defines the RAM criteria and 

their probability aspect: 

"For the engineering specialties of reliability, availability and maintainability (RAM), 

the theories are stated in the mathematics of probability and statistics. The underlying 

reason for the use of these concepts is the inherent uncertainty in predicting a failure. 

Even given a failure model based on physical or chemical reactions, the results will 

not be the time a part will fail, but rather the time a given percentage of the parts will 

fail or the probability that a given part will fail in a specified time." 

Along with Reliability, Availability and Maintainability, Safety is the fourth metric of a 

meta-domain that specialists have named RAMS (also sometimes referred to as 

dependability). 

The Bathtub Curve 

10

Metrics 


RAMS metrics relate to time allocation and depend on the operational state of a given 

system. 

The following curve defines the state linked to each term: 

• MUT: Mean Up Time 

MUT qualifies the average duration of the system being in operational state. 

• MDT: Mean Down Time 

MDT qualifies the average duration of the system not being in operational state. It 

comprises the different portions of time required to subsequently detect the error, fix it, 

and restore the system to its operational state. 

• MTBF: Mean Time between Failure 

MTBF is defined by the MIL-HDBK-338 standard as follows: "A basic measure of 

reliability for repairable items. The mean number of life units during which all parts of 

the item perform within their specified limits, during a particular measurement interval 

under stated conditions." 

Thus for repairable systems, MTBF is a metric commonly used to appraise Reliability, 

and corresponds to the average time interval (normally specified in hours) between 

two consecutive occurrences of inoperative states. 

Put simply: MTBF = MUT + MDT 

MTBF can be calculated (provisional reliability) based on data books such as UTE 

C80-810 (RDF2000), MIL HDBK-217F, FIDES, RDF 93, and BELLCORE. Other 

inputs include field feedback, laboratory testing, or demonstrated MTBF (Operational 

Reliability), or a combination of these. Remember that MTBF only applies for 

repairable systems 

11

• MTTF (or MTTFF): Mean Time to First Failure 


MTTF is the mean time before the occurrence of the first failure. 

MTTF (and MTBF by extension) is often confused with Useful Life, even though 

these two concepts are not related in any way. For example, a battery may have 

a Useful Life of 4 hours and have a MTTF of 100,000 hours. These figures 

indicate that for a population of 100,000 batteries, there will be approximately 

one battery failure every hour (defective batteries being replaced). 

Considering a repairable system with an exponential distribution Reliability and a 

constant Failure Rate (λ), MTTF = 1 / λ. 

Mean Down Time is usually very low compared to Mean Up Time. This equivalence 

is extended to MTBF, and assimilated to MTTF, resulting in the following relationship: 

MTBF = 1 / λ. 

This relationship is widely used in additional calculations. 

Example: 

Given the MTBF of a communication adapter, 618,191 hours, what is the 

probability for that module to operate without failure for 5 years? 

Calculate the module Reliability over a 5-year time period: 

R(t) = e -λt = e 

• FIT: Failures in Time 

-t / MTBF 

a) Divide by 5 years, that is 8,760 * 5 = 43,800 hours, by the given MTBF: 

43,800 / 618,191 = 0.07085 

b) Then raise e to the power of the negative value of that number: 

e -.07085 = .9316 

Thus there is a 93.16% probability that the communication module will not fail 

on a 5-year period. 

Typically used as the Failure Rate measurement for non- repairable electronic 

components, FIT is the number of failures in one billion hours. 

FIT= 10 9 / MTBF or MTBF= 10 9 / FIT 

12

Safety 

Definition 

Maintainability 

Definition 

Mathematics Basics 


Safety refers to the protection of people, assets and the environment. For example, if 

an installation has a tank with an internal pressure that exceeds a given threshold, 

there is a high chance of explosion, and eventual destruction of the installation (with 

injury or death of people and damage to the environment). In this example, the Safety 

System put in place is to open a valve to the atmosphere, to prevent the pressure 

threshold from being crossed. 

Maintainability refers to the ability of a system to be maintained in an operational 

state. This once again relates to probability. Maintainability corresponds to the 

probability for an inoperative system to be repaired in a given time interval. 

If design considerations may at a certain level impact the maintainability of a system, 

then the maintenance organization will also have a major impact on its maintainability. 

Having the right number of people trained to observe and react with the proper 

methods, tools, and spare parts are considerations that usually depend more on the 

customer organization than on the automation system architecture. 

Equipment shall be maintainable on-site by trained personnel according to the 

maintenance strategy. A common metric named Maintainability, M(t), gives the 

probability that a required given active maintenance operation can be accomplished 

in a given time interval. 

The relationship between Maintainability and repair is similar to the relationship 

between reliability and failure, with the repair rate µ(t) defined in a way analogous to 

the Failure Rate. When this repair rate is considered as a constant, it implies an 

exponential distribution for Maintainability: M(t) = e -µt . 

The maintainability of equipment is reflected in MTTR, which is usually considered as 

the sum of the individual times required for administrative, transport, and repair times. 

13

Availability 

Definition 



The term High Availability is often used when discussing Fault Tolerant Systems. For 

example, your telephone line is supposed to offer you a high level of availability: the 

service you are paying for has to be effectively accessible and dependable. Your line 

availability is compared related to the continuity of the service which you are 

provided. As an example, assume you are living in a remote area with occasional 

violent storms. Because of your loacation and the damage these storms can cause, 

long delays are required to fix your line once it is out of order. In these conditions, if 

on average your line appears to be usable only 50% of the time, you have poor 

availability. By contrast, if on average each of your attempts is 100% satisfied, then 

your line has high availability. 

This example demonstrates that Availability is the key metric to measuring a system’s 

tolerance level, that it is typically expressed in percent (for example 99.999%), and 

that it belongs to the domain of probability. 

The Instantaneous Availability of a device is the probability that this device will be in 

the functional state for which it was designed, under given conditions and at a given 

time (t), with the assumption that the required external conditions are met. 

Besides Instantaneous Availability, different variants have been specified, each 

corresponding to a specific definition, including Asymptotic Availability, Intrinsic 

Availability and Operational Availability. 

14

Asymptotic (or Steady State) Availability: A ∞ 


Asymptotic Availability is the limit of the instantaneous availability function as time 

approaches infinity, 

MDT MUT MUT 

A = 1− 

= 

= 

∞ MUT + MDT MUT+ 

MDT MTBF 

where Downtime includes all repair time (corrective and preventive maintenance 

time), administrative time and logistic time. 

The following curve shows an example of asymptotic behavior: 

Intrinsic (or Inherent) Availability: Ai 

Intrinsic Availability does not include administrative time and logistic time, and usually 

does not include preventive maintenance time. This is primarily a function of the basic 

equipment/system design. 

A 

i 

MTBF 

= 

MTBF + MTTR 

We will consider Intrinsic Availability in our Availability calculations. 

Operational Availability 

Operational Availability corresponds to the probability that an item will operate 

satisfactorily at a given point in time when used in an actual or realistic operating and 

support environment. 

A 

0 

Uptime 

= 

Operating⋅ 

Cycle 

Operational Availability includes logistics time, ready time, and waiting or 

administrative downtime, and both preventive and corrective maintenance downtime. 

15

Classification 

Reliability 

Definition 


This is the availability that the customer actually experiences. It is essentially the a 

posteriori availability based on actual events that happened to the system. 

A quite common way used to classify a system in terms of Availability consists of 

listing the number of 9s of its availability figure. 

The following table defines types of availability: 

Class Type of 



(%) 

Downtime 

per Year 

Number of 

Nines 

1 Unmanaged 90 36.5 days 1-nine 

2 Managed 99 3.65 days 2-nines 

3 Well Managed 99.9 8.76 hours 3-nines 

4 Tolerant 99.99 52.6 minutes 4-nines 

5 High Availability 99.999 5.26 minutes 4-nines 

6 Very High 99.9999 30.00 seconds 6-nines 

7 Ultra High 99.99999 3 seconds 7-nines 

For example, a system that has a five-nine availability rating means that the system is 

99.999 % available; with a system downtime of approximately 5.26 minutes per year. 

A fundamental associated metric is Reliability. Return to the example of your 

telephone line. If the wired network is a very old one, having suffered from many 

years of lack of maintenance, it may frequently be out of order. Even if the current 

maintenance personnel are doing their best to repair it in a minimum time, it can be 

said to have poor reliability if, for example, it has experienced ten losses of 

communication during the last year. Notice that Reliability necessarily refers to a 

given time interval, typically one year. Therefore, Reliability will account for the 

absence of shutdown of a considered system in operation, on a given time interval. 

As with Availability, we consider Reliability in terms of perspective (a prediction), and 

within the domain of probability. 

16



In many situations a detected disruption fortunately does not mean the end of a 

device’s life. This is usually the case for the automation and control systems being 

discussed, which are repairable entities. As a result, the ability to predict the number 

of shutdowns, due to a detected disruption over a specified period of time, is useful to 

estimate the budget required for the replacement of inoperative parts. 

In addition, knowing this figure can help you maintain an adequate inventory of spare 

parts. Put simply, the question "Will a device work for a particular period" can only be 

answered as a probability; hence the concept of Reliability. 

According to the MIL-STD-721C standard, the definition of reliability R(t) of a given 

system is the probability of that system to perform its intended function under stated 

conditions for a stated period of time. As an example, a system featured by 0.9999 

reliability over a year has a 99.99% probability of functioning properly throughout an 

entire year. 

Note: The reliability is systematically indicated for a given period of time, for example 

one year. 

Referring to the system model we considered with the "bathtub curve,” one 

characteristic is its constant Failure Rate, during the useful life. In that portion of its 

lifetime, the Reliability of the considered system will follow an exponential law, given 

by the following formula: R(t) = e -λt , where λ stands for the Failure Rate. 

The following figure illustrates the exponential law, R(t) = e -λt , where λ stands for the 

Failure Rate: 

As shown in the diagram, 

Reliability starts with a value of 

1 at time zero, which 

represents the moment the 

system is put into operation. 

Reliability then falls graduallly 

down to zero, following the 

exponential law. Therefore, 

Reliability is about 37% at 

t=1/λ. As an example, assume the given system experiences an average number of 

0.5 inoperative states per a 1-year time unit. The exponential law indicates that such 

a system would have about a 37% chance of remaining in operation, reaching 1 / 0.5 

= 2 years of service. 

17


Note: Considering the flat portion of the bathtub curve model, where the Failure Rate 

is constant over time and remains the same for a unit regardless of this unit’s age, the 

system is said to be "memoryless.” 

Reliability versus Availability 

Reliability one of the factors influencing Availability, but must not be confused with 

Availabilty: 99.99% Reliability does not mean 99.99% Availability. Reliability 

measures the ability of a system to function without interruptions, while Availability 

measures the ability of this system to provide a specified application service level. 

Higher reliability reduces the frequency of inoperative states, while increasing overall 

Availability. 

There is a difference between Hardware MTBF and System MTBF. The mean time 

between hardware component failures occurring on an I/O Module, for example, is 

referred to as the Hardware MTBF. Mean time between failures occurring on a 

system considered as a whole, a PLC configuration for example, is referred to as the 

System MTBF. 

As will be demonstrated, hardware component redundancy will provide an increase in 

the overall system MTBF, even though the individual component’s MTBF remains the 

same. 

Although Availability is a function of Reliability, it is possible for a system with 

poor Reliability to achieve high Availability. For example, consider that a system 

averages 4 failures a year, and for each failure, this system can be restored with 

an average outage time of 1 minute. Over the specified period of time, MTBF is 

131,400 minutes (4 minutes of downtime out of 525,600 minutes per year). 

In that one year period: 

� Reliability R(t) = e -λt = e - 4 = 1.83%, very poor Reliability 

MTBF 131⋅ 

400 

� (Inherent) Availability A 

i 

= = 

= 99,99924% , 

MTBF + MTTR 131400 

+ 1 

very good Availability 

18

Reliability Block Diagrams (RBD) 

Series-Parallel Systems 


Based on basic probability computation rules, RBD are simple, convenient tools to 

represent a system and its components, to determine the Reliability of the system. 

The target system, for example a PLC rack, must first be interpreted in terms of series 

and parallel arrangements of elementary parts. 

The following figure shows a representation of a serial architecture: 

are considered as a participant member of a 5-part series. 

As an example, assume that one of the 5 

modules (1 power supply and 4 other 

modules) that populate the PLC rack 

becomes inoperative. As a consequence, 

the entire rack is affected, as it is no longer 

100% capable of performing its assigned 

mission, regardless of which module is 

inoperative. Thus each of the 5 modules 

Note: When considering Reliability, two components are described as in series if both 

are necessary to perform a given function. 

The following figure shows a representation of a parallel architecture: 

sequence of the 4 other modules. 

As an example, assume that the PLC rack 

now contains redundant Power Supply 

modules, in addition to the 4 other 

modules. If one Power Supply becomes 

inoperative, then the other supplies power 

for the entire rack. These 2 power supplies 

would be considered as a parallel sub- 

system, in turn coupled in series with the 

Note: Two components are in parallel, from a reliability standpoint, when the system 

works if at least one of the two components works. In this example, Power Supply 1 

and 2 are said to be in active redundancy. The redundancy would be described as 

passive if one of the parallel components is turned on only if the other is inoperative 

only, for example in the case of auxiliary power generators. 

19

Serial RBD 

Reliability 


Serial system Reliability is equal to the product of the individual elements’ reliabilities. 

n 

R 

S 

(t) = R 

1 

(t) ∗ R 

2 

(t) ∗R 

3 

(t) ∗ ... ∗ Rn(t) 

= ∏ R (t) 

i= 

1 

i 

where: Rs(t) = System Reliability 

Ri(t) = Individual Element Reliability 

n = Total number of elements on the serial system 

Assuming constant individual Failure Rate: 

− λ t − λ t − λ t − λ t − ( λ + λ + λ + ... + λ ) t 

R (t) = R (t) ∗ R (t) ∗ R (t) ∗ ... ∗ R (t) = e 

1 

∗ e 

2 

∗ e 

31 

∗ ... ∗ e 

n 

= e 

1 2 3 n 

S 1 2 3 n 

n 

That is λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal to the sum 

S i 

i = 1 

of the individual Failure Rate of these elements, with 

Example 1: 

R 

S 

(t) 

−λ 

S 

t 

= e 

Consider a system with 10 elements, each of them required for the proper operation 

of the system, for example a 10-module rack. To determine RS(t), the Reliability of 

that system over a given time interval t, if each of the considered elements shows an 

individual Reliability Ri(t) of 0.99: 

10 

R 

S 

(t) = ∏ R (t) 

i= 

1 

i 

= (0.99) x (0.99) x (0.99) . . . (0.99) = (0.99) 10 = 0.9044 

Thus the targeted system Reliability is 90.44% 

Example 2: 

Consider two elements with the following failure rates: 

λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1 

Element 1 

λ1 = 120 x 10 -6 h -1 

Element 2 

λ2 = 180 x 10 -6 h -1 

20

The System Reliability, over a 1,000- hour mission, is: 


n 

λ 

S 

= ∑ λ 

i 1 

i 

= λ 

1 

+ λ 

= 

2 

= 120 x 10 -6 + 180 x 10 -6 = 300 x 10 -6 = 0.3 x 10 -3 h -1 

−λ 

3 x 103 

S 

t − 0.3 x 10− 

− 0.3 

R S(1000 

h) = e = e 

= e = 0.7408 

Thus the targeted system Reliability is 74.08% 


Serial system Availability is equal to the product of the individual elements’ 

Availabilities. 

A S 

= 

A1 

. A2 

. A3 

.... .An 

where: As= system (asymptotic) availability 

n 

= ∏ Ai 

i= 

1 

Ai = Individual element (asymptotic) availability 


21

Calculation Example 


In this example, we calculate the availability of a PAC Station using shared distributed 

I/O Islands. The following illustration shows the final configuration: 

This calculation applies the equations given by basic probability analysis. To do this 

calculation, a spreadsheet was developed. These are the figures applied in the 

spreadsheet: 

Failure rate: λ = 1/MTBF 

Reliability: R(t) = e -λt 

n 

Total serial Systems Failure Rate λ = ∑ λ 

S i 

i = 1 

Total MTBF serial Systems = 1 / λs 

Availability = MTBF/ MTBF+ MTTR 

Unavailability = 1 – Availability 

Unavailability over years: Unavailability × hours (one year = 8460 hours) 

The following table shows the method to perform the calculation: 

Step Action 

1 Perform the calculation of the Standalone CPU. 

2 Perform the calculation of a distributed island. 

3 Based on the serial structure, add up the results from Steps 1 and 2. 

Note: A common variant of in-rack I/O Module stations are I/O Islands, distributed on 

an Ethernet communication network. Schneider Electric offers a versatile family 

named Advantys STB, which can be used to define such architectures. 

Step 1: Calculation linked to the Standalone CPU Rack. 

The following figures represent the Standalone CPU Rack: 

22


The following screenshot is the spreadsheet corresponding to this analysis. 

Step 2: Calculation linked to the STB Island. 

The following figures represent the Distributed I/O on STB Island: 

The following screenshot is the spreadsheet corresponding to this analysis: 

23


Step 3: Calculation of the entire installation. Assume that the communication network 

used to link I/O Islands to CPU has no examples of Reliability metrics (these are 

explored further in a subsequent chapter). 

The following figures represent the final Distributed architecture: 

The following screenshot is the spreadsheet corresponding to the entire analysis: 

Note: The highlighted values were calculated in the two previous steps 

Considering the results of this Serial System (Rack # 1+ Islands # 1 ... #4), Reliability over 

one year is approximately 82 % (the probability that this system will encounter one failure 

during one year is approximately 18%). 

System MTBF itself is approximately 44,000 hours (about 5 years) 

Considering the Availability, with a 2-hour Mean Time To Repair (typical of a very good 

logistics and maintenance organization), the system would achieve a 4-nines Availability, 

an average probability of approximately 24 minutes downtime per year; 

24

Parallel RBD 

Reliability 


Theory of Probability provides expression of the Reliability of a Parallel System (for 

example, a Redundant System), with Qi(t) (unreliability) being the reverse of Ri(t). 

R Red 

( t) 

= 1− 

[ Q1( 

t) 

x Q2 

( t) 

x Q3 

( t) 

x 

Qi (t) = Ri (t) = 1 - Ri (t) : 

.... 

n 

x Qn 

( t) 

] = 1− 

∏ Qi 

( t) 

i= 

1 

with: R Red = Reliability of the Simple Redundancy System 

Example: 

n 

∏ Q i = Probability of Failure of the System 

i= 

1 

n 

= 1− 

∏ 1− 

Ri 

( t) 

i= 

1 

Rj = Probability of Non-Failure of an Individual Parallelized Element 

Qj = Probability of Failure of an Individual Parallelized Element 

n = Total number of Parallelized Elements 

Considering two elements with the following failure rates: 

λ1 = 120 x 10-6 h -1 and λ2 = 180 x 10-6 h -1 

The System Reliability, over a 1,000-hour mission, is: 

Reliability of elements 1 and 2 over the 1,000-hour period: 

R 

R 

1 

2 

( 1, 

000 

( 1, 

000 

h) 

h) 

= e 

= e 

−λ1t 

−λ2t 

= e 

= e 

−6 

3 

−120 

x 10 x 10 

−6 

3 

−180 

x 10 x 10 

= 

= 

0, 

8869 

0. 

8353 

Unreliability of elements 1 and 2 over the 1,000 hour-period: 

Q1 ( 1, 

000 h) 

= 1− 

R1 

( 1, 

000 h) 

= 1− 

R 

Q 2 

2 

Element 1 

λ1 = 120 x 10 -6 h -1 

Element 2 

λ2 = 180 x 10 -6 h -1 

( 1, 

000 

( 1, 

000 

h) 

h) 

= 1− 

0. 

8869 = 

= 1− 

0. 

8353 = 

0. 

1130 

0. 

1647 

25

Redundant System Reliability, over the 1,000-hour period: 

R12 ( t = 1000 h) 

= 1− 

[ Q1 

( t = 1000 h) 

. Q 2 


( 1, 

000 

h) 

] 

= 1− 

( 0. 

1130 

x 0. 

1647) 

= 

Thus with Individual Elements’ Reliability of 88.69% and 83.53% respectively, the 

targeted Redundant System Reliability is 98.14% 


Parallel system Unavailability is equal to the product of the individual elements’ 

0. 

9814 

Unavailabilities. Thus the Parallel system Availability, given the individual parallelized 

elements’ Availabilities, is: 

A 

= 1− 

[ ( 1− 

A ) . ( 1− 

A ) . ... . ( 1− 

A ) ] = 1− 

( 1− 

A ) 

S 1 

2 

n ∏ 

i= 

1 

where: As= system (asymptotic) availability 

Ai = Individual element (asymptotic) availability 


n 

i 

26

Calculation Example 


To illustrate a Parallel system, we perform the calculation of the Availability of a 

redundant PLC System using shared distributed I/O Islands. 

The following illustration shows the final configuration: 

The formulas are the same as used in the previous calculation example, except for 

the calculation of the reliability for a parallel system, which is as follows: 

R Red 

( t) 

= 1− 

[ Q1( 

t) 

x Q2 

( t) 

x Q3 

( t) 

x 

.... 

n 

x Qn 

( t) 

] = 1− 

∏ Qi 

( t) 

i= 

1 

The following table shows the method to perform the calculation: 

Step Action 

1 Perform the calculation of a standalone CPU 

n 

= 1− 

∏ 1− 

Ri 

( t) 

i= 

1 

2 Perform the calculation for the redundant structure, here the two CPUs 

3 Perform the calculation of a distributed island 

4 Concatenate the results from Steps 1,2 and 3 

Note: the previous results from the serial analysis, regarding the calculation linked to 

the standalone elements, are reused. 

27

Step 1: Calculation linked to the Standalone CPU Rack. 


Because the analysis is identical to that for the serial case, the following screenshot 

shows the spreadsheet corresponding only to the final results: 

Step 2: Calculation of the redundant CPU group. 

The following figures show represent the Redundant CPUs: 

The following screenshot is the spreadsheet corresponding to this analysis: 

Step 3: Calculation linked to the STB Island. 

Because the analysis is identical to that for the serial case, the following screenshot 

shows the spreadsheet corresponding to the final results only: 

28

Step 4: Calculation of the entire installation. 


The following screenshot is the spreadsheet corresponding to the entire analysis: 

Note: The only difference between this architecture and the previous one relates to 

the CPU Rack: this one is a redundant one (Premium Hot Standby), while the former 

one was a standalone one. 

Looking at the results of this Parallel System (Premium CPU Rack Redundancy), Reliability 

over one year would be approximately 99.9%, compared to 97.4% available with a 

Standalone Premium CPU Rack (i.e. the probability for a Premium Rack System to 

encounter one failure during one year would have been reduced from 2.6% to 0.1%). 

System MTBF itself would increas from 335,000 hours (approximately 38 years) to 503,000 

hours (approximately 57 years). 

For System Availability, a 2-hour Mean Time To Repair we provide approximately a 9-nines 

resulting Availability (almost 100%). 

Note: Other calculations examples are available in the Calculation Examples chapter. 

Note: The previous examples cover the PAC station only. To extend the calculation 

to a whole system, the MTBF of the network components and the SCADA systems 

(PC, servers) must be taken in account. 

29

Conclusion 

Serial System 


The above computations demonstrate that the combined availability of two 

components in series is always lower than the availability of its individual components. 

The following table gives an example of combined availability in serial system: 

Component Availability Downtime 

X 99% (2-nines) 3.65 days/year 

Y 99.99% (4-nines) 52 minutes/year 

X and Y Combined 98.99% 3.69 days/year 

This table indicates that even though a very high availability Part Y was used, the 

overall availability of the system was reduced by the low availability of Part X. A 

common saying indicates that "a chain is as strong as the weakest link", however, in 

this instance a chain is actually “weaker than the weakest link.” 

Parallel System 

The above computations indicate that the combined availability of two components in 

parallel is always higher than the availability of its individual components. 

The following table gives an example of combined availability in a parallel system: 

Component Availability Downtime 

X 99% (2-nines) 3.65 days/year 

Two X components 

operating in parallel 

Three X components 

operating in parallel 

99.99% (4-nines) 

52 minutes/year 

99.9999% (6-nines) 31 seconds /year ! 

This indicates that even though a very low availability Part X was used, the overall 

availability of the system is much higher. Thus redundancy provides a very powerful 

mechanism for making a highly reliable system from low reliability components. 

30

High Availability with Collaborative Control System 


In an automation system, how can you reach the level of availability required to keep 

the plant in operation? By what means should you enforce the system architecture, 

providing and maintaining access to the information required to monitor and control 

the process? 

This chapter provides answers to these questions, and reviews the system 

architecture from top to bottom, that is, from operator stations and data servers 

(Information Management) to Controllers and Devices (Control System Level), via 

communication networks (Communication Infrastructure Level). 

The following figure illustrates the Collaborative Control System: 

Ethernet 

This previous system architecture drawing above shows various redundancy 

capabilities that can be proposed: 

• Dual SCADA Vijeo Citect server 

Ethernet 

Profibus 

• Dual Ethernet control network managed by ConneXium switches 

• Premium Hot Standby with Ethernet device bus 

• Quantum Hot Standby with Remote I/O, Profibus and Ethernet. 

• Quantum safety controller HotStandby with remote I/O 

Ethernet 

31

Information Management Level 

Redundancy Level 

Key Features 


This section explains how to implement various technological means to address 

architecture challenges, with examples using a current client/server paradigm. The 

section describes: 

• A method to define a redundancy level 

• The most appropriate system architecture for the defined level 

The key features a Vijeo Citect SCADA software has to handle relate to: 

• Data acquisition 

• Events and Alarms (including time stamping) 

• Measurements and trends 

• Recipes 

• Report 

• Graphics ( displays and user interfaces) 

In addition, any current SCADA package provides access to a treatment/calculation 

module (openness module), allowing users to edit program code (IML/CML, Cicode, 

Script VB, etc.). 

Note: This model is applicable for a single station (PC), including for small 

applications. The synthesis between the stakes and the key features will help to 

determine the most appropriate redundant solution. 

32

Stakes 

Risk analysis 

Level Definition 


Considering the previously defined key features, stakes when designing a SCADA 

system include: 

• Is a changeover possible? 

• Does a time limit exist? 

• What are the critical data? 

• Has cost reduction been considered? 

Linked to the previous stakes, the risk analysis is essential to defining the redundancy 

level. Consider the events the SCADA system will face, i.e. the risk, in terms of the 

following: 

• Inoperative hardware 

• Inoperative power supply 

• Environmental events (natural disaster, fire, etc.) 

That can imply loss of data, operator screen, connection with devices and so on. 

Finally, the redundancy level is defined as the compilation of the key features, the 

stakes, and the risk analysis with the customer expectations related to the data 

criticality level. The following table illustrates the flow from the process analysis to the 

redundancy level: 

33


The following table explains the redundancy levels: 

Redundancy Level State of the standby system Switchover performance 

No redundancy No standby system Not applicable 

Cold Standby The Standby system is only powered on 

if the default system becomes 

inoperative. 

Warm Standby The Standby system switches from 

normal to backup mode. 

Hot Standby The Standby system runs together with 

the default system one. 

Architecture Redundancy Solutions Overview 

Several minutes 

Large amount of lost data 

Several seconds 

Small amount of lost data 

Transparent 

No lost data 

This section examines various redundancy solutions. A Vijeo Citect SCADA system 

can be redundant at the following levels: 

• Clients, i.e. operator stations 

• Data servers 

• Control and information network 

• Targets, i.e. PAC station, controller, devices 

34


The Vijeo Citect functional organization corresponds directly 

to a Client/Server philosophy. An example of a Client/Server 

topology is shown in the diagram to the left: a single Display 

Client Operator Station with a single I/O Server, in charge of 

device data (PLC) communication. 

Vijeo Citect architecture is a combination of several operational entities that handles 

A 

l 

a 

r 

m 

s 

, 

T 

r 

e 

n 

d 

s 

a 

Alarms, Trends and Reports, respectively. In addition, this functional architecture 

includes at least one I/O Server. The I/O Server acts as a Client to the peripheral 

devices (PAC) and as a Server to Alarms, Trends and Reports (ATR) entities. 

As shown in the figure above, ATR and I/O Server(s) act either as a Client or as a 

Server, depending on the designated relationship. The default mechanism linking 

these Clients and Servers is based on a Publisher / Subscriber relation. 

As shown in the following screenshot, because of its client server model, Vijeo Citect 

can create a dedicated server, depending on the application requirements: for 

example for ATR, or for I/O server: 

35

. 

Clients: Operator Workstations 


Vijeo Citect is able to manage the redundancy at the operator station level with 

several client workstations. These stations can be located in the control room or 

distributed in the plant close to the critical part of the process. A web client interface 

can also be used to monitor and control the plant using a standard web browser. 

If an operator station becomes inoperative, the plant can still be monitored and 

controlled using additional operator screen. 

Servers: Resource Duplication Solution 

This example assumes a Field Devices Communication Server, providing Services to 

several types of Clients, such as Alarm Handling, Graphic Display, etc. 

A first example of Redundancy is a complete duplication of the first Server. Basically, 

if a system becomes inoperative, for example Server A, Server B would retain the job 

and respond to the service requests presented by the Clients. 

36


Vijeo Citect can define Primary and Standby servers within a project, with each 

element of a pair being held by different hardware. 

The first level of redundancy duplicates the I/O 

server or the ATR server, as shown in the 

illustration. In this case, a Standby server is 

maintained in parallel to the Primary server. In the 

event of a detected interruption on the hardware, 

the Standby server will assume control of the 

communication with the devices. 

Based on data criticality, the second level 

duplicates all the servers, ATR and I/O. Identical data is maintained on both servers. 

Network: Data Path Redundancy 

In the diagram to the left, 

Primary and Standby I/O 

servers are deployed 

independently, while Alarms, 

Trends and Reports servers 

are run as separate 

processes on common 

Primary and Standby 

computers. 

Data path redundancy involves not alternative 

device(s), but alternative data paths between 

the I/O Server and connected I/O Devices. 

Thus if one data path becomes inoperative, 

the other is used. 

Note: Vijeo Citect reconnects through the primary data path when it is returned into 

service. 

On a larger Vijeo Citect system, you can 

also use data path redundancy to 

maintain device communications through 

multiple I/O Server redundancy, as shown 

in the following diagram. 

37

Target: I/O Device 


Redundant LAN 

As previously indicated, redundancy of Alarms, Reports, Trends, and I/O Servers is 

achieved by adding standby servers. Vijeo Citect can also use the dual end point (or 

multiple network interfaces) potentially available on each server, enabling the 

specification of a complete and unique network connection between a Client and a 

Server. 

A given I/O Server is able to 

handle designated pairs of 

Devices, Primary and Standby. 

This Device Redundancy does 

not rely on a PLC Hot Standby 

mechanism: Primary and 

Standby devices are assumed 

to be concurrently acting on the 

same process, but no 

assumption is made 

concerning the relationship 

between the two devices. Seen from I/O Server, this redundancy offers access only to 

an alternate device, in case the first device becomes inoperative. 

The I/O device is an extension of I/O Device Redundancy, providing for more than 

one Standby I/O Device. Depending on the user configuration, a given order of 

priority applies when an I/O Server (potentially a redundant one) needs to switch to a 

Standby I/O Device. For example, in the figure above I/O Device 3 would be allotted 

the highest priority, then I/O Device 2, then finally I/O Device 4. 

38

Clustering 


In those conditions, in case of a detected interruption occurring on Primary I/O Device 

1, a switchover would take place, with I/O Server 2 handling communications, and 

with Standby I/O Device 3. If an interruption is now detected on I/O Device 3, a new 

switchover would take place, with I/O Server 1 handling communications, with 

Standby I/O Device 2. Finally, if there is an interruption on I/O Device 2, another 

switchover would take place, with I/O Server 2 handling communications, with 

Standby I/O Device 4 

A cluster may 

contain several 

possibly 

redundant I/O 

Servers 

(maximum of 

one per 

machine), and 

standalone or 

redundant 

Refer again to the 

diagram on the left, 

explaining 

Redundancy basics, 

and examine the 

functional 

representation as a 

cluster. 

ATR servers; these latter servers being implemented either on a common or on 

separate machines. 

39


The cluster concept 

offers a response to a 

typical scenario of a 

system separated into 

several sites, with 

each of these sites 

being controlled by 

local operators, and 

supported by local 

redundant servers. 

The clustering model 

can concurrently address an additional level of management that requires all sites 

across the system to be monitored simultaneously from a central control room. 

With this scenario, each site is represented with a separate cluster, grouping its 

primary and standby servers. Clients on each site are interested only in the local 

cluster, whereas clients at the central control room are able to view all clusters. 

B 

a 

s 

e 

d 

o 

n 

c 

l 

u 

s 

t 

e 

Based on cluster design, each site can then be addressed independently within its 

own cluster. As a result, deployment of a control room scenario is fairly 

straightforward, with the control room itself only needing display clients. 

The cluster concept does not actually provide an additional level of redundancy. 

Regarding data criticality, clustering organizes servers, and consequently provides 

additional flexibility. 

40

Conclusion 


Each cluster contains only one pair each of ATR servers. Those pairs of servers, 

redundant to each other, must be on different machines. 

Each cluster can contain an unlimited number of I/O servers; those servers must also 

be on different machines that increase the level of system availability. 

The following illustration shows a complete installation. Redundant solutions 

previously discussed can be identified: 

Scada Clients 

Data Servers 

Control Network 

Targeted Devices 

Communication Infrastructure Level 

The previous section reviewed various aspects of enhanced availability at the 

Information Management level, focusing on SCADA architecture, represented by 

Vijeo Citect. This section covers High Availability concerns between the Information 

level and the Control Level. 

A proper design at the communication infrastructure level must include: 

• Analysis of the plant topology 

• Localization of the critical process steps 

• The definition of network topologies 

• The appropriate use of communication protocols 

41

Plant Topology 


The first step of the communication infrastructure level definition is the plant topology 

analysis. From this survey, the goal is to gather information to develop a networking 

system diagram, prior to defining the network topologies. 

This plant topology analysis must be done as a top-down process: 

• Break-down of the plant in selected areas 

• Localization of the areas to be connected 

• Localization of the hazardous areas 

• Localization of the station and the nodes included in these areas to be connected 

• Localization of the existing networks & cabling paths, in the event of expansion or 

redesign 

Before defining the network topologies, the following project requirements must be 

considered: 

• Availability expectation, according to the criticality of the process or data. 

• Cost constraints 

• Operator skill level 

From the project and the plant analyses, identify the most critical areas: 

Project 

Network Topology 

Topologies 


Following the criticality analysis, the networking diagram can be defined by selecting 

the relevant Network topology. 

The following table describes the four main topologies from which to choose: 

Architecture Limitations Advantages Disadvantages 

Bus 

Star 

Ring 

The traffic must flow serially, 

therefore the bandwidth is not 

used efficiently 

Cable ways and distances Efficient use of the 

Cost-effective solution If a switch becomes 

bandwidth, as the 

traffic is spread across 

the star 

Preferred topology 

when there is no need 

of redundancy 

Behavior quite similar to Bus Auto-configuration if 

used with self-healing 

protocol 

Possible to couple 

others rings for 

increasing redundancy 

inoperative, the 

communication is lost. 

If the main switch 

becomes inoperative, 

the communication is 

lost 

The Auto-configuration 

depends on the 

protocol used 

Note: These different topologies can be mixed to define the plant network diagram. 

43

Ring Topology 


The following diagram shows the level of availability based on topology: 

In automation architecture, Ring (and Dual ring) topologies are the most commonly 

used to increase the availability of a system. 

Mesh architecture is not used in process applications; therefore we will not discuss it 

in detail. All these topologies are allowed using Schneider Electric ConneXium 

switches. 

In a ring topology four events can occur, leading to a loss of communication: 

1)Broken ring line 

2)Inoperative ring switch 

3)Inoperative ended-line devices 

4)Inoperative ended-line devices switch 

44


The following diagram illustrates these four occurrences: 

1 

2 

To protect the network architecture from these events, several communication 

protocols are proposed and described in the following section. 

A solution able to enhance networking availability while preserving budget 

considerations consists of reducing both limitations of an Ethernet network: 

distributed as a bus, but built as a ring. At least one specific active component is 

necessary: this is a network switch usually named Redundancy Manager (RM). 

Consider an Ethernet loop designed with such a RM switch. In normal conditions, this 

RM switch will open the loop: which prevents Ethernet frames from circulating 

endlessly in a loop. 

3 

4 

If a break occurs, the 

Redundancy Manager 

switch reacts immediately, 

and closes the Ethernet 

loop, bringing the network 

back to full operating 

condition. 

Note that the term Self 

Healing Ring concerns the 

ring management only; 

once cut, the cable is not 

able to repair itself. 

45


A mix of Dual Networking and Network Redundancy is possible. Note that in such a 

design, a SCADA I/O Server has to be equipped with two communication boards, and 

reciprocally, each device (PLC) has to be allotted two Ethernet ports. 

Redundant Coupling of a Ring Network or a Network Segment 

Topological considerations may lead to consideration of a network layout aggregating 

satellite rings or segments around a backbone network (itself designed as a ring or as 

a segment). 

This may be an effective design, considering the junction between trunk and satellites, 

especially if backbone and satellite networks have been designed as ring networks to 

provide for High Availability. 

With the Connexium product line, Schneider Electric offers switches that may afford a 

redundant coupling. Several variations allow connection to the network. Each of these 

variations is featured by two “departure” switches on the backbone network. Each 

departure switch is crossing through a separate link to access the satellite network. 

These variations include: 

1. A single pivot ”arrival' switch on the connected network 

2. Two different arrival switches on the connected network; this links 

synchronization, making good use of backbone and satellite networks 

3. Two different switches to access the connected network; this links 

synchronization being carried via an additional specific link established 

between the two arrival switches on the connected network. 

46


The following drawing illustrates this architecture: 

ETY ETY SYNC SYNC SYNC SYNC LINK 

LINK 

Redundant line 

RM RM 

RM RM 

RM 

Ring coupling capabilities increase the level of networking availability by allowing 

different paths to access to targeted devices. 

New generation of Schneider ConneXium switches authorize many architectures 

based on dual ring. A unique switch is able now to couple two Ethernet rings 

extending the capabilities of Ethernet architecture. 

Dual Ring in One Switch 

ETY ETY 

ETY 

ETY 

The following illustration shows the architecture, which allows the combination of two 

rings managed by a unique switch. 

Premium 

Premium 

CPU CPU SYNC SYNC LINK 

LINK 

MRP / RSTP 

ETY ETY 

ETY 

ETY 


LINK 

Ring Coupling 

47


Dual Ring in Two Switches 

The following architecture bypasses the single detected interruption. 

Dual Ring Extension in Two Switches 

The following architecture allows extension of the main network to other segments : 

This concept of sub-ring enables to couple segments to existing redundancy rings. 

The devices in the main ring (1) are seen as Sub-Ring Managers (SRM) for the new 

connected sub-ring (2). 

Daisy Chain Loop Topology 

Ethernet "daisy chain" refers to the 

integration of a switch function inside a 

communicating device; as a result, this 

“daisy-chainable” device offers two 

Ethernet ports, for example one "in" port 

48


and one "out" port. The advantage of such a daisy-chainable device is that its 

installation inside an Ethernet network requires only two cables. 

Schneider Electric plans to offer are: 

In addition, a daisy-chain layout can 

correspond to either a network segment 

or a network loop; a managed Connexium 

switch4, featuring an RM capability, is 

able to handle such a daisy-chained loop. 

The First daisy-chainable devices 

� Advantys STB dual port Ethernet communication adapter (STB NIP 2311) 

� Advantys ETB IP67 dual port Ethernet 

� Motor controller TeSys T 

� Variable speed drive ATV 61/71 (VW3A3310D) 

� PROFIBUS DP V1 Remote Master 

� ETG 30xx Factorycast gateway 

Note: Assuming no specific redundancy protocol is selected to handle the daisy 

chain loop, expected loop reconfiguration time on failover is approximately one 

second. 

SCADA 

SCADA 

RM 

RM 

Primary Primary 

Standby 

ETY ETY ETY ETY ETY ETY ETY SYNC SYNC LINK 

LINK 

Redundant line 

RM RM 

RM 

ETY 

ETY 

ETY 

ETY 

MRP / RSTP 

Premium 

Premium 

CPU CPU SYNC SYNC LINK 

LINK 

MRP / RSTP 

ETY 

ETY 

ETY 

ETY 


LINK 

Ring Coupling 

SCADA 

SCADA 

RM 

49


Daisy chaining topologies can be coupled to dual Ethernet rings using TCSESM 

ConneXium switches. 

Redundancy Communication Protocols 

Mix Switches In 

Network 

(Cisco/3Com/Hirs 

chmann) 

The management of Ethernet ring requires dedicated communication protocols as 

described in the following table: 

Each protocol is characterized by different performance criteria in terms of fault 

detection and global system recovery time. 

Rapid Spanning Tree Protocol (RSTP) 

RSTP stands for Rapid Spanning Tree Protocol (IEEE 802.1w standard) 1 . Based on 

STP, RSTP has introduced some additional parameters that must be entered during 

the switch configuration. These parameters are used by the RSTP protocol during the 

path selection process; because of these, the reconfiguration time is much faster than 

with STP (typically less than one second). 

1 The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE 802.1t-2001 and IEEE 

802.1w standards 

MRP 

Part of IEC 62439 

FDIS 

Recovery Time Less than 200 ms 

50 switches max 

Rapid Spanning 

Tree (RSTP) 

HIPER-Ring Fast HIPER-Ring 

Yes No No 

( New features 

available only on 

Extended 

Connexium 

switches) 

depending on 

number 

of switches 

up to 1 sec 

300 or 500ms 

50 switches 

max 

10 ms for 5 

switches 

160 

microseconds for 

each additional 

switch 

50


The new release of TCSESM ConneXium switches allows better performance of 

RSTP management with a detection time of 15 ms and a propagation time of 15 ms 

for each switch. Considering a 6 switches configuration, the recovery time is about 

105 ms. 

HIPER-Ring (Version 1) 

Version 1 of the HIPER-Ring networking strategy has been available for 

approximately 10 years. It applies to a Self Healing Ring networking layout. 

Such a ring structure may include up to 50 switches. It typically features a 

reconfiguration delay of 150 milliseconds 2 and a maximum time of 500 ms. As a result, 

in case of an issue occurring on a link cable or on one of the switches populating the 

ring, the network will take about 150 ms to detect this event, and cause the 

Redundancy Manager switch to close the loop. 

Note: The Redundancy Manager switch is said to be active when it opens the 

network. 

HIPER-Ring Version 2 (MRP) 

Note: If recovery time of 500 ms is acceptable, then no switch redundancy 

configuration is needed only dip switches have to be set up. 

2 When configuring a Connexium TCS ESM switch for HIPER-Ring V1, the user is asked to choose 

between a maximum Standard Recovery Time, which is 500 ms, and a maximum Accelerated Recovery 

Time, which is 300 ms. 

51

Fast HIPER-Ring 


MRP is an IEC 62439 industry standard protocol based on HIPER-ring. Therefore all 

switch manufactures can implement MRP if they choose too. This allows a mix of 

different manufactures in an MRP configuration. Schneider’s’ switches support a 

selectable maximum recovery time of 200ms or 500ms and 50 switch maximum ring 

configuration. 

TCESM switches also support redundant coupling of MRP rings. MRP rings could 

easily be used instead of HIPER-ring. MRP would require that all switches be 

configured via Web pages and allow for recovery time of 200ms or 500ms. 

Additionally the I/O network could be a MRP redundant network and the control 

network HIPER-ring or vice versa. 

A new family of Connexium switches is coming named TCS ESM Extended. This will 

offer a third version of HIPER-Ring strategy, named Fast HIPER-RING. 

Featuring a guaranteed recovery time of less than 10 milliseconds, the fast HIPER 

ring structure allows both a cost optimized implementation of a redundant network as 

well as maintenance and network extension during operation. This makes fast HIPER 

ring especially suitable for complex applications such as a combined transmission of 

video, audio and data information 

52

Selection 


To end the communication level section, the following table presents all the 

communication protocols, and thus helps you selecting the most appropriate 

installation for your high availability solution: 

Selection Criteria Solution Comments 

Ease of configuration or 

installed base 

HIPER-Ring If recovery time of 500 ms is acceptable, then no switch 

redundancy configuration is needed. Dip switches have 

to be set up only. 

New installation MRP All switches are configured via web pages 

Open architecture with 

multiple vendors switch 

Complex Architecture MRP, RSTP or 

Installation with one MRM (Media Ring Manager) and 

X MRCs (Media Ring Client) 

RSTP Reconfiguration time: 15 ms (detected fault) + 15 ms 

FAST HIPER- 

Ring 

per switch. 

We recommend MRP or RSTP for High Availability with 

dual ring, and FAST HIPER-Ring for high performance. 

53

Control System Level 

Redundancy Principles 


Having detailed High Availability aspects at the Information Management level and at 

the Communication Infrastructure level, we will now concentrate on High Availability 

concerns at the Control Level. Specific discussion will focus on PAC redundancy. 

Modicon Quantum and Premium PAC provide Hot Standby capabilities, and have 

several shared principles: 

1. The type of architecture is 

shared. A Primary unit executes 

the program, with a Standby unit 

ready, but not executing the 

program (apart from the first 

section of it). By default, these two units contain an identical application program. 

2. The units are synchronized. The standby unit is aligned with the Primary unit. 

Also, on each scan, the Primary unit transfers to the Standby unit its "database,” 

that is, the application variables (located or not located) and internal data. The 

entire database is transferred, except the "Non-Transfer Table", which is a 

sequence of Memory Words (%MW). The benefit of this transfer is that, in case of 

a switchover, the new Primary unit will continue to handle the process, starting 

with updated variables and data values. This is referred to as a "bumpless" 

switchover. 

3. The Hot Standby redundancy mechanism is controlled via the "Command 

Register" (accessed thru %SW60 system word); reciprocally, this Hot Standby 

redundancy mechanism is monitored via the "Status Register" (accessed thru 

%SW61 system word). As a result, as long as the application creates links 

between these system words and located memory words, any HMI can receive 

feedback regarding Hot Standby system operating conditions, and, if necessary, 

address these operating conditions. 

4. For any Ethernet port acting as a server (Modbus/TCP or HTTP protocol) on the 

Primary unit, its IP address is implicitly incremented by one on the Standby unit. 

54


In case a switchover occurs, homothetic addresses will automatically be 

exchanged. The benefit of this feature is that seen from a SCADA/HMI, the 

"active" unit is still accessed at the same IP address. No specific adaptation is 

required at the development stage of the SCADA / HMI application. 

5. The common programming environment that is used with both solutions is Unity 

Pro. No particular restrictions apply when using the standardized (IEC 1131-3) 

instruction set. In addition, the portion of code specific to the Hot Standby system 

is optional, and is used primarily for monitoring purpose. This means that with any 

given application, the difference between its implementation on standalone 

architecture and its implementation on a Hot Standby architecture is largely 

cosmetic. 

Consequently, a user familiar with one type of Hot Standby system does not have to 

start from scratch if he has to use a second type; initial investment is preserved and 

re-usable, and only a few differences must be learned to differentiate the two 

technologies. 

Premium and Quantum Hot Standby Architectures 

Depending on projects constraints or customers requirements (performance, installed 

base or project specifications), a specific Hot Standby PAC station topology can be 

selected: 

- Hot Standby PAC with in-rack I/O or remote I/O 

- Hot Standby PAC with distributed I/O, connected on Standard Ethernet or 

connected to another device bus, such as Profibus DP. 

- Hot Standby PAC mixing different topologies 

The following table presents the available configurations with either a Quantum or 

Premium PLC: 

PLC Layer 

I/O Bus Ethernet Profibus 

Quantum Configuration 1 Configuration 3 Configuration 5 

Premium Configuration 2 Configuration 4 Not applicable 

Note: A sixth configuration may be considered, which combines all other 

configurations listed above 

55

In-Rack and Remote I/O Architectures 

Quantum Hot Standby: Configuration 1 


the racks is greater than the lack of 

communication gap that takes place at the 

switchover). The module population of a 

Quantum CPU rack in a Hot Standby 

configuration is very similar to that of a 

standalone PLC 3 . The only specific requirement 

is that the CPU module must be a 140 CPU 671 60. 

With Quantum Hot Standby, in-rack I/O 

modules are located in the remote I/O 

racks. They are "shared" by both Primary 

and Standby CPUs, but only the Primary 

unit actually handles the I/O 

communications at any given time. In 

case of a switchover, the control 

takeover executed by the new Primary 

unit occurs in a bumpless way (provided 

that the holdup time parameter allotted to 

For redundant PAC architecture, both units require two interlinks to execute different 

types of diagnostic - to orient the election of the Primary unit, and to achieve 

synchronization between both machines. The first of these "Sync Links,” the CPU 

Sync Link, is a dedicated optic fiber link anchored on the Ethernet port local to the 

CPU module. This port is dedicated exclusively for this use on Quantum Hot Standby 

architecture. The second of these Sync Links, Remote I/O Sync Link, is not an 

additional one: the Hot Standby system uses the existing Remote I/O medium, 

hosting both machines, thus providing them with an opportunity to communicate. 

One benefit of the CPU optic fiber port is its inherent capability to have the two units 

installed up to 2 km apart, using 62.5/125 multimode optic fiber. The Remote I/O Sync 

Link can also run through optic fiber, provided that Remote I/O Communication 

Processor modules are coupled on the optic fiber. 

3 All I/O modules are accepted on remote I/O racks, except 140 HLI 340 00 (Interrupt module). 

Looking at Ethernet adapters currently available, 140 NWM 100 00 communication module is not 

compatible with a Hot Standby system. Also EtherNet/IP adapter (140 NOC 771 00) is not compatible with 

Quantum Hot Standby in Step 1. 

56


Up to 6 communication modules 4 , such as NOE Ethernet TCP/IP adapters, can be 

handled by a Quantum unit; whether it is a standalone unit or part of a Hot Standby 

architecture. 

Up to 31 Remote I/O stations can be handled from a Quantum CPU rack, whether 

standalone or Hot Standby. Note that the Remote I/O service payload on scan time is 

approximately 3 to 4 ms per station. 

Redundant Device Implementation 

Using redundant in-rack I/O modules on 

Quantum Hot Standby, and having to 

interface redundant sensors/actuators, will 

require redundant input / output channels. 

Preferably, homothetic channels should be 

installed on different modules and different 

I/O Stations. For even a simple transfer of 

information to both sides of the outputs, the 

application must define and implement rules 

for selecting and treating the proper input 

signals. In addition to information transfer, 

the application will have to address 

diagnostic requirements. 

4 Acceptable communication modules are Modbus Plus adapters, Ethernet TCP/IP adapters, Ethernet/IP 

adapters and PROFIBUS DP V1 adapters. 

57


Single Device Implementation 

Assuming a Quantum Hot Standby 

(i.e. the one taken on Primary output) to the target actuator. 

application is required to handle redundant 

in-rack I/O channels, but without redundant 

sensors and actuators, special devices are 

used to handle the associated wiring 

requirements. Any given sensor signal, 

either digital or analog, passes through 

such a dedicated device, which replicates it 

and passes it on to homothetic input 

channels. Reciprocally, any given pair of 

homothetic output signals, either digital or 

analog, are provided to a dedicated device 

that selects and transfers the proper signal 

Depending on the selected I/O Bus technology, a specific layout may result in 

enhanced availability. 

Dual Coaxial Cable 

Coaxial cable can be installed with 

either as a single or redundant 

design. With a redundant design, 

communications are duplicated on 

both channels, providing a "massive 

communication redundancy.” Either 

the Remote I/O Processor or Remote 

I/O adapters are equipped with a pair 

of connectors, with each connector 

attached to a separate coaxial 

distribution. 

58


Self Healing Optical Fiber Ring 

ring diagnostics are required, or single mode fiber is required. 

Configuration Change on the Fly 

Remote I/O Stations can be installed 

as terminal nodes of a fiber optic 

segment or (self-healing) ring. The 

Schneider catalog offers one model 

of transceiver (490 NRP 954) 

applicable for 62.5/125 multimode 

optic fiber: 5 transceivers maximum, 

10 km maximum ring circumference. 

When the number of transceivers is 

greater than 5, precise optic fiber 

High-level level feature is currently being added to Quantum Hot Standby applications 

design. This is CCTF, or Configuration Change on the Fly. This new feature will allow 

you to modify the configuration of the existing and running PLC application program, 

without having to stop the PLC. As an example, consider the addition of a new 

discrete or analog module on a remote Quantum I/O Station. For the CPU Firmware 

version upgrade, executed on a Quantum Hot Standby architecture, this CCTF will be 

sequentially executed, one unit at a time.This is an obvious for applications that 

cannot afford any stop, which will now become available for architecture modification 

or extensions. 

Premium Hot Standby: Configuration 2 

Premium Hot Standby is able to 

handle in-rack I/O modules, installed 

on Bus-X racks 5 . The Primary unit 

acquires its inputs, executes the logic, 

and updates its outputs. As a result of 

cyclical Primary to Standby data 

transfer, the Standby unit provides 

local outputs that are the image of the outputs decided on the Primary unit. In case of 

a switchover, the control takeover executed by the new Primary unit occurs in a 

bumpless fashion. 

5 Initial version of Premium Hot Standby only authorizes a single Bus-X rack on both units. 

59


The module population of a Premium 

CPU rack, in a Hot Standby 

configuration, is very similar to that of 

a standalone PLC (some restrictions 

apply on compatible modules 6 ). Two 

types of CPU modules are available: 

TSX H57 24M and TSX H57 44M, 

which differ mainly in regard to memory and communication resources. 

The first of the two Sync Links, the 

CPU Sync Link is a dedicated copper 

link anchored on the Ethernet port 

local to the CPU module. With 

Premium Hot Standby architecture, the 

second Sync Link, Ethernet Sync Link is established using a standard Ethernet 

TCP/IP module 7 . It corresponds to the communication adapter elected as the 

"monitored" adapter 

6 Counting, motion, weighing and safety modules are not accepted. On the communication side, apart from 

Modbus modules TSX SCY 11 601/21 601, only currently available Ethernet TCP/IP modules are accepted. 

Also EtherNet/IP adapter (TSX ETC 100) is not compatible with Premium Hot Standby in Step 1. 

7 TSX ETY 4103 or TSX ETY 5103 communication module 

60


The following picture illustrates the Premium Ethernet configuration, with the CPU 

and the ETY Sync links: 

This Ethernet configuration is detailed in the following section. 

Redundant Device Implementation 

In-rack I/O module implementation on Premium 

Hot Standby corresponds by default to a 

massive redundancy layout: each input and 

each output has a physical connection on both 

units. Redundant sensors and actuators do not 

require additional hardware. For even a simple 

transfer of information to both sides of the outputs, the application must define and 

implement rules for selecting and treating the proper input signals. In addition to 

information transfer, the application will have to address diagnostic requirements. 

Single Device Implementation 

Assuming a Premium Hot Standby application 

is required to handle redundant in-rack I/O 

channels, but without redundant sensors and 

actuators, special devices are used to handle 

the associated wiring requirements. Any given 

sensor signal, either digital or analog, passes 

through such a dedicated device, which 

replicates it and passes it on to homothetic input channels. Reciprocally, any given 

pair of homothetic output signals, either digital or analog, are provided to a dedicated 

device that selects and transfers the proper signal (i.e. the one taken on Primary 

output) to the target actuator. 

61

Distributed I/O Architectures 

Ethernet TCP/IP: Configurations 3&4 


Schneider Electric has supported Transparent Ready strategy for several years. In 

addition to SCADA and HMI, Variables Speed Drives, Power Meters, and a wide 

ranges of gateways, Distributed I/Os with Ethernet connectivity, such as Advantys 

STB, are also being proposed,. In addition many manufacturers are offering devices 

capable of communicating on Ethernet using Modbus TCP 8 protocol. These different 

contributions using the Modbus protocol design legacy have helped make Ethernet a 

general purpose preferred communication support for automation architectures. 

In addition to Ethernet messaging services solicited through application program 

function blocks, a communication service is available on Schneider Electric PLCs: the 

I/O Scanner. The I/O Scanner makes a PLC Ethernet adapter/Copro act as a 

Modbus/TCP client, periodically launching a sequence of requests on the network. 

These requests correspond to standard Modbus function codes, asking for Registers 

(Data Words), Read, Write or Read/Write operations. This sequence is determined by 

a list of individual contracts specified in a table defined during the PLC configuration. 

The typical target of such a communication contract is an I/O block, hence the name 

"I/O Scanner". Also, the I/O Scanner service may be used to implement data 

exchanges with any type of equipment, including another PLC, provided that 

equipment can behave as a Modbus/TCP server, and respond to multiple words 

access requests. 

Ethernet I/O scanner service is 

compatible with a Hot Standby 

implementation, whether Premium 

or Quantum. The I/O scanner is 

active only on the Primary unit. In 

case of a controlled switchover, 

Ethernet TCP/IP connections 

handled by the former Primary unit 

are properly closed, and new ones 

are reopen once the new Primary 

gains control. In case of a sudden 

switchover, resulting, for example, 

8 In step 1, support of Ethernet/IP Quantum/Premium adapters is not available with Hot Stand-By. 

62


from a power cut-off, the former Primary may not able to close the connections it had 

opened. These connections will be closed after expiration of a Keep Alive timeout. 

In case of a switchover, proper communications will typically recover after one initial 

cycle of I/O scanning. However, the worst case gap for address swap, with I/O 

scanner, is 500 ms, plus one initial cycle of I/O scanning. As a result, this mode of 

communication, and hence architectures with Distributed I/Os on Ethernet, is not 

preferred with a control system that regards time criticality as an essential criteria. 

Note: The automatic IP address swap capability is a property inherited by every 

Ethernet TCP/IP adapter installed in the CPU rack. 

Self Healing Ring 

As demonstrated in the previous chapter, Ethernet TCP/IP used with products like 

Connexium offers real opportunities to design enhanced availability architectures, 

handling communication between the Information Management Level and the Control 

Level. Such architectures, based on a Self Healing Ring topology, are also applicable 

when using Ethernet TCP/IP as a fieldbus. 

Note that Connexium accepts Copper or Optic Fiber rings. In addition, dual 

networking is also applicable at the fieldbus level. 

63

Profibus Architecture 


I/O Devices Distributed on PROFIBUS DP/PA: Configuration 5 

A PROFIBUS DP V1 Master Class 1 

Quantum form communication module is 

available. It handles cyclic and acyclic data 

exchanges, and accepts FDT/DTM Asset 

Management System data flow, through its 

local Ethernet port. 

The PROFIBUS network is configured with 

a Configuration Builder software, which 

supplies the Unity Pro application program 

with data structures corresponding to cyclic 

data exchanges and to diagnostic 

information. 

The Configuration Builder can also be configured to pass Unity Pro a set of DFBs, 

allowing easy implementation of Acyclic operations. 

Each Quantum PLC can accept up to 6 of these DP Master modules (each of them 

handling its own PROFIBUS network). Also, the PTQ PDPM V1 Master Module is 

compatible with a Quantum Hot Standby implementation. Only the Master Module in 

the Primary unit is active on the PROFIBUS network; the Master Module on the 

Standby unit stays in a dormant state unless awakened by a switchover. 

64


PROFIBUS DP V1 Remote Master and Hot Standby PLC 

With a smart device such as 

PROFIBUS Remote Master 9 , 

an I/O Scanner stream is 

handled by the PLC 

application 10 and forwarded 

to the Remote Master via 

Ethernet TCP/IP. In turn, 

Remote Master handles the 

corresponding cyclic 

exchanges with the devices populating the PROFIBUS network. Remote Master can 

also handle acyclic data exchanges. 

The PROFIBUS network is configured with Unity Pro 11 , which also acts as an FDT 

container, able to host manufacturer device DTMs. In addition, Remote Master offers 

a comDTM to work with third party FDT/DTM Asset Management Systems. 

Automatic symbol generation provides Unity Pro with data structures corresponding 

to data exchanges and diagnostic information. A set of DFBs is delivered that allows 

an easy implementation of acyclic operations. 

Remote Master is compatible with a Quantum, Premium or Modicon M340 Hot 

Standby implementation. 

9 Planned First Customer Shipment: Q4 2009 

10 M340, Premium or Quantum 

11 version 5.0 

65

Redundancy and Safety 


Requirements for Availability and Safety are often considered 

to be working against each other. Safety can be the 

response to the maxim "stop if any potential danger arises,’ 

whereas Availability can follow the slogan "Produce in spite of 

everything.” 

Two models of CPU are available to design a Quantum 

Safety configuration: the first model (140 CPU 651 60S) is 

dedicated to standalone architectures, whereas the second 

model ( 140 CPU 671 60S) is dedicated to redundant 

architectures. 

The Quantum Safety PLC has some exclusive features: a 

specific hardware design for safety modules (CPU and I/O 

modules) and a dedicated instruction set. 

Otherwise, a Safety Quantum Hot Standby configuration has 

much in common with a regular Quantum Hot Standby configuration. The 

configuration windows, for example, are almost the same, and the Ethernet 

communication adapters inherit the IP Address automatic swap capability. Thus the 

Safety Quantum Hot Standby helps to reconcile and integrate the concepts of Safety 

and Availability. 

66

Mixed Configuration: Configuration 6 


Premium / Quantum Hot Standby Switchover Conditions 

System Health Diagnostics 

Whether Premium or Quantum, 

application requirements such as 

topology, environment, periphery, 

time criticality, etc. may influence 

the final architecture design to 

adopt both types of design 

strategies concurrently, i.e. in- 

rack and distributed I/Os, 

depending on individual 

subsystems constraints. 

Systematic checks are executed cyclically by any running CPU, in order to detect a 

potential hardware corruption, such as a change affecting the integrity of the Copro, 

the sub-part of the CPU module that hosts the integrated Ethernet port. Another 

example of a systematic check is the continuous check of the voltage levels provided 

by the power supply module(s). In case of a negative result during these hardware 

health diagnostics, the tested CPU will usually switch to a Stop State. 

When the unit in question is part of a Hot Standby System, in addition to these 

standard hardware tests separately executed on both machines, more specific tests 

are conducted between the units. These additional tests involve both Sync Links. The 

basic objective is to confirm that the Primary unit is effectively operational, executing 

the application program, and controlling the I/O exchanges. In addition, the system 

must verify that the current Standby unit is able to assume control after a switchover. 

67

Controlled Switchover 


If an abnormal situation occurs on the current Primary unit, it gives up control and 

switches either to Off-Line state (the CPU is not a part of the Hot Standby system 

coupling) or to Stop State, depending on the event. The former Standby unit takes 

control as the new Primary unit. 

As previously indicated, the Hot Standby system is controlled through the %SW61 

system register. Each unit owns an individual bit on the system Command Register 

that decides whether or not that particular unit has to make its possible to "hook" to 

the other unit. An operational hooked redundant Hot Standby system requires both 

units to indicate this intent. Consequently, executing a switchover controlled by the 

application on a hooked system is straightforward; it requires briefly toggling the 

decision bit that controls the current Primary unit’s "hooking" intent. The first toggle 

transition switches the current Primary unit to Off-Line Sate, and makes the former 

Standby unit take control. The next toggle transition makes the former Primary unit 

return and hook as the new Standby Unit. 

An example of this possibility is a controlled switchover resulting from diagnostics 

conducted at the application level. A Quantum Hot Standby, Premium Hot Standby or 

Monitored Ethernet Adapter system does not handle a Modbus Plus or Ethernet 

communication adapter malfunction as a condition implicitly forcing a switchover. As a 

result, these communication modules must be cyclically tested by the application, 

both on Standby and on Primary. Diagnostic results elaborated on Standby are 

usually transferred to the Primary unit because of the Reverse Transfer Registers. 

Finally, the application being reported a non fugitive inconsistency affecting Primary 

unit, and at the same time a fully operational Standby unit, forces a switchover based 

on the playing on Control Register. 

Hence, the application program can decide on a Hot Standby switchover, having 

registered a steady state negative diagnostic on the Ethernet adapter linking the 

Primary unit to the "Process Network", and being at the same time informed that the 

Standby unit is fully operational. 

Note: There are two ways to implement a Controlled Switchover: automatically 

through configuration of a default DFB, or customized with the creation of a DFB with 

its own switchover conditions. 

68


The following illustration represents an example of a DFB handling a Hot Standby 

Controlled Switchover: 

This in turn makes use of an embedded DFB: HSBY_WR: 

Fragment of HSBY_Switch_Decision DFB 

Fragment of HSBY_WR DFB 

Note: HSBY_WR DFB executes a write access on HSBY Control Register (%SW60). 

69

Switchover Latencies 


The following table details the typical and maximum swap time delay encountered 

when reestablishing Ethernet services during a Switchover event. (Premium and 

Quantum configurations) 

Service Typical Swap Time Maximum Swap Time 

Swap IP Address 6 ms 500 ms 

I/O Scanning 1 initial cycle of I/O scanning 500 ms + 1 initial cycle of I/O scanning 

Client Messaging 1 MAST task cycle 500 ms + 1 MAST task cycle 

Server 

Messaging 

1 MAST task cycle + the time required 

by the client to reestablish its 

connection with the server (1) 

FTP/TFTP Server the time required by the client to 

reestablish its connection with the 

server (1) 

500 ms + the time required by the 

client to reestablish its connection with 

the server (1) 




SNMP 1 MAST task cycle 500 ms + 1 MAST task cycle 

HTTP Server the time required by the client to 

reestablish its connection with the 

server (1) 




(1) The time the client requires to reconnect with the server depends on the client communication loss 

timeout settings. 

Selection 

To end the control level section, the following table presents the four main criteria that 

help you selecting the most appropriate configuration for your high availability solution: 

Criteria Cost High Criticality 

Switchover Performance Premium 

Openness Premium 

In Rack Architecture 

Distributed Architecture 

Quantum 

In Rack Architecture 

Quantum 

Distributed Architecture 

70

Premium/ Quantum Hot Standby Solution Reminder 

Conclusion 

The following tables provide a brief reminder of essential characteristics for Premium 

and Quantum Hot Standby solutions, respectively: 

Up to 128 per network / ETY I/O Scanner handles up to 64 transactions 

Premium Hot Standby Essential Characteristics 

71

Conclusion 

Quantum Hot Standby Essential Characteristics 

Conclusion 

This chapter has covered functional and architectural redundancy aspects, from the 

Information Management level to the Control level, and up through the 

Communication Infrastructure level. 

72

Conclusion 

Conclusion 

This section summarizes the main characteristics and properties of Availability for 

Collaborative Control automation architectures. 

Chapter 1 demonstrated that Availability is dependent not only on Reliability, but also 

on Maintenance as it is provided to a given system. The first level of contribution, 

Reliability, is primarily a function of the system design and components. Component 

and device manufacturers thus have a direct but not exclusive influence on system 

Availability. The second level of contribution, Maintenance and Logistics, is totally 

dependent on end customer behavior. 

Chapter 2 presented some simple Reliability and Availability calculation examples, 

and demonstrated that beyond basic use cases, dedicated skills and tools are 

required to extract figures from real cases. 

Chapter 3 explored a central focus of this document, Redundancy, and its application 

at the Information Management Level, Communication Infrastructure Level and 

Control System Level. 

This final chapter summarizes customer benefits provided by Schneider Electric High 

Availability solutions, as well as additional information and references. 

73

Benefits 

Standard Offer 

Conclusion 

Schneider-Electric currently offers a wide range of solutions, providing the best 

design to respond to specific customer needs for Redundancy and Availability in 

Automation and Control Systems. 

One key concept of High Availability is that redundancy is not a default design 

characteristic at any system level. Also, Redundancy can be added locally, using in 

most cases standard components. 

Simplicity of Implementation 

System Transparency 

Information Management Level 

At any level, intrusion of Redundancy into system design and implementation is 

minimum, compared to a non-redundant system. For SCADA implementation, 

network active components selection, or PLC programming, most of the software 

contributions to redundancy are dependent on selections executed during the 

configuration phase. Also, Redundancy can be applied selectively. 

The transparency of a redundant system, compared to a standalone one, is a 

customer requirement. With the Schneider Electric automation offer, this transparency 

is present at each level of the system. 

For Client Display Stations, Dual Path Supervisory Networks, Redundant I/O servers, 

or Dual Access to Process Networks, each redundant contribution is handled 

separately by the system. For example, concurrent Display Clients communication 

flow will be transparently re-routed to the I/O sever by the Supervisory Network in 

case of a cable disruption. This flow will also be transparently routed to the alternative 

I/O Server, in case of a sudden malfunction of the first server. Finally, the I/O Server 

may transparently leave the communication channel it is using per default, if that 

channel ceases to operate properly, or if the target PLC will not respond through this 

channel. 

74

Communication Infrastructure Level 

Control System Level 

Ease of Use 

Conclusion 

Whether utilized as a Process Network or as a Fieldbus Network, currently available 

active network components can easily participate in an automatically reconfigured 

network. With continuous enhancements, HIPER-Ring strategy not only offers 

simplicity, but also a level of performance compatible with a high-reactivity demand. 

The “IP Address automatic switch" for a SCADA application communicating through 

Ethernet is an important feature of Schneider Electric PLCs. Apart from simplifying 

the design of the SCADA application implementation, what may cause delays and 

increased cost, this feature also contributes to reducing the payload of a 

communication context exchange on a PLC switchover. 

As previously stated, increased effort has been made to make the implementation of 

a redundant feature simple and straightforward. 

The Vijeo Citect, ConneXium Web Pages and Unity Pro software environments offer 

clear and accessible configuration windows, along with a dedicated selective help, in 

order to execute the required parameterization. 

More Detailed RAMS Investigation 

In case of a specific need for detailed dependability (RAMS) studies, for any type of 

architecture, contact the Schneider Electric Safety Competency Center. This center 

has skilled and experienced individuals ready to help you with all your needs. 

75

Appendix 

Glossary 

Appendix 

Note: the references in bracket refer to standard, which are specified at the end of 

this glossary. 

1) Active Redundancy 

Redundancy where the different means required to accomplish a given function are 

present simultaneously [5] 

2) Availability 

Ability of an item to be in a state to perform a required function under given conditions, 

at a given instant of time or over a given time interval, assuming that the required 

external resources are provided [IEV 191-02-05] (performance) [2] 

3) Common Mode Failure 

Failure that affects all redundant elements for a given function at the same time [2] 

4) Complete failure 

Failure which results in the complete inability of an item to perform all required 

functions [IEV 191-04-20] [2] 

5) Dependability 

Collective term used to describe availability performance and its influencing factors: 

reliability performance, maintainability performance and maintenance support 

performance [IEV 191-02-03] [2] 

Note: Dependability is used only for general descriptions in non-quantitative terms. 

6) Dormant 

A state in which an item is able to function but is not required functioning. Not to be 

confused with downtime [4] 

7) Downtime 

Time during which an item is in an operational inventory but is not in condition to 

perform its required function [4] 

8) Failure 

Termination of the ability of an item to perform a required function [IEV 191-04-01] [2] 

Note 1: After failure the item detects a fault. 

Note 2: "Failure" is an event, as distinguished from "fault", which is a state. 

76

9) Failure Analysis 

Appendix 

The act of determining the physical failure mechanism resulting in the functional 

failure of a component or piece of equipment [1] 

10) Failure Mode and Effects Analysis (FMEA) 

Procedure for analyzing each potential failure mode in a product, to determine the 

results or effects on the product. When the analysis is extended to classify each 

potential failure mode according to its severity and probability of occurrence, it is 

called a Failure Mode, Effects, and Criticality Analysis (FMECA).[6] 

11) Failure Rate 

Total number of failures within an item population, divided by the total number of life 

units expended by that population, during a particular measurement period under 

stated conditions [4] 

12) Fault 

State of an item characterized by its inability to perform a required function, excluding 

this inability during preventive maintenance or other planned actions, or due to lack of 

external resources [IEV 191-05-01] [2] 

Note: a fault is often the result of a failure of the item itself, but may exist without prior 

failure. 

13) Fault- tolerance 

Ability to tolerate and accommodate for a fault with or without performance 

degradation 

14) Fault Tree Analysis (FTA) 

Method used to evaluate reliability of engineering systems. FTA is concerned with 

fault events. A fault tree may be described as a logical representation of the 

relationship of primary or basic fault events that lead to the occurrence of a specified 

undesirable fault event known as the “top event.” A fault tree is depicted using a tree 

structure with logic gates such as AND and OR [7] 

77

15) Hidden Failure 

FTA illustration [7] 

Failure occurring that is not detectable by or evident to the operating crew [1] 

16) Inherent Availability (Intrinsic Availability) : Ai 

Appendix 

A measure of Availability that includes only the effects of an item design and its 

application, and does not account for effects of the operational and support 

environment. Sometimes referred to as "intrinsic" availability [4] 

17) Integrity 

Reliability of data which is being processed or stored. 

18) Maintainability 

Probability that an item can be retained in, or restored to, a specified condition when 

maintenance is performed by personnel having specified skill levels, using prescribed 

procedures and resources, at each prescribed level of maintenance and repair. [4] 

19) Markov Method 

A Markov process is a mathematical model that is useful in the study of the 

availability of complex systems. The basic concepts of the Markov process are those 

of the “state” of the system (for example operating, non-operating) and state 

“transition” (from operating to non-operating due to failure, or from non-operating to 

operating due to repair). [4] 

78

20) MDT: Mean Downtime 

Markov Graph illustration [2] 

Average time a system is unavailable for use due to a failure. 

Time includes the actual repair time plus all delay time associated with a repair 

person arriving with the appropriate replacement parts [4] 

21) MOBF: Mean Operating Time Between Failures 

Expectation of the operating time between failures [IEV 191-12-09] [2] 

22) MTBF 

Appendix 

A basic measure of reliability for repairable items. The mean number of life units 

during which all parts of the item perform within their specified limits, during a 

particular measurement interval under stated conditions. [4] 

23) MTTF : Mean Time To Failure 

A basic measure of reliability for non-repairable items. The total number of life units of 

an item population divided by the number of failures within that population, during a 

particular measurement interval under stated conditions. [4] 

Note: Used with repairable items, MTTF stands for Mean Time To First Failure 

24) MTTR : Mean Time To Repair 

A basic measure of maintainability. The sum of corrective maintenance times at any 

specific level of repair, divided by the total number of failures within an item repaired 

at that level, during a particular interval under stated conditions. [4] 

25) MTTR : Mean Time To Recovery 

Expectation of the time to recovery [IEV 191-13-08] [2] 

79

26) Non-Detectable Failure 

Appendix 

Failure at the component, equipment, subsystem, or system (product) level that is 

identifiable by analysis but cannot be identified through periodic testing or revealed by 

an alarm or an indication of an anomaly. [4] 

27) Redundancy 

Existence in an item of two or more means of performing a required function [IEV 

191-15-01] [2] 

Note: in this standard, the existence of more than one path (consisting of links and 

switches) between end nodes. 

Existence of more than one means for accomplishing a given function. Each means 

of accomplishing the function need not necessarily be identical. The two basic types 

of redundancy are active and standby. [4] 

28) Reliability 

Ability of an item to perform a required function under given conditions for a given 

time interval [IEV 191-02-06] [2] 

Note 1: It is generally assumed that an item is in a state to perform this required 

function at the beginning of the time interval 

Note 2: the term “reliability” is also used as a measure of reliability performance (see 

IEV 191-12-01) 

29) Repairability 

Probability that a failed item will be restored to operable condition within a specified 

time of active repair [4] 

30) Serviceability 

Relative ease with which an item can be serviced (i.e. kept in operating condition). [4] 

31) Standby Redundancy 

Redundancy wherein a part of the means for performing a required function is 

intended to operate, while the remaining part(s) of the means are inoperative until 

needed [IEV 19-5-03] [2] 

Note: this is also known as dynamic redundancy. 

Redundancy in which some or all of the redundant items are not operating 

continuously but are activated only upon failure of the primary item performing the 

function(s). [4] 

80

32) System Downtime 

Appendix 

Time interval between the commencement of work on a system (product) malfunction 

and the time when the system has been repaired and/or checked by the maintenance 

person, and no further maintenance activity is executed. [4] 

33) Total System Downtime 

Time interval between the reporting of a system (product) malfunction and the time 

when the system has been repaired and/or checked by the maintenance person, and 

no further maintenance activity is executed. [4] 

34) Unavailability 

State of an item of being unable to perform its required function [IEV 603-05-05] [2] 

Note: Unavailability is expressed as the fraction of expected operating life that an 

item is not available, for example given in minutes per year 

Ratio: downtime/(uptime + downtime) [3] 

Often expressed as a maximum period of time during which the variable is 

unavailable, for example 4 hours per month 

35) Uptime 

That element of Active Time during which an item is in condition to perform its 

required functions. (Increases availability and dependability). [4] 

[1] Maintenance & reliability terms - Life Cycle Engineering 

[2] IEC 62439: High Availability automation networks 

[3] IEEE Std C37.1-2007: Standard for SCADA and Automation System 

[4] MIL-HDBK-338B - Military Handbook - Electronic Reliability Design Handbook 

[5] IEC-271-194 

[6] The Certified Quality Engineer Handbook - Connie M. Borror, Editor 

[7] Reliability, Quality, and Safety for Engineers - B.S. Dhillon - CRC Press 

81

Standards 

General purpose 

Appendix 

This section contains a selected, non-exhaustive list of reference documents and 

standards related to Reliability and Availability: 

� IEC 60050 (191):1990 - International Electrotechnical Vocabulary (IEV) 

FMEA/FMECA 

� IEC 60812 (1985) - Analysis techniques for system reliability - Procedures for failure mode and 

effect analysis (FMEA) 

� MIL-STD 1629A (1980) Procedures for performing a failure mode, effects and criticality analysis 

Reliability Block Diagrams 

� IEC 61078 (1991) Analysis techniques for dependability - Reliability block diagram method 

Fault Tree Analysis 

� NUREG-0492 - Fault Tree Handbook - US Nuclear Regulatory Commission 

Markov Analysis 

� IEC 61165 (2006) Application of Markov techniques 

RAMS 

� IEC 60300-1 (2003) - Dependability management - Part 1: Dependability management systems 

� IEC 62278 (2002) - Railway applications - Specification and demonstration of reliability, availability, 

maintainability and safety (RAMS) 

Functional Safety 

� IEC 61508 - Functional safety of electrical/electronic/programmable electronic safety related 

systems (7 parts) 

� IEC 61511 (2003) Functional safety - Safety instrumented systems for the process industry sector. 

82

Calculation Examples 

Reliability & Availability Calculation Examples 

The previous chapter provided necessary knowledge and theory basics to understand 

simple examples of Reliability and Availability calculations. This section examines 

concrete, simple but realistic examples, related to current, plausible solutions. 

The root piece of information required for all of these calculations, for all major 

components we will implement in our architectures, is the MTBF figure. MTBFs are 

normally provided by the manufacturer, either on request and/or on dedicated 

documents. For Schneider Electric PLCs, MTBF information can be found on the 

intranet site Pl@net, under Product offer>Quality Info. 

Note: The MTBF information is usually not accessible via the Schneider Electric 

public website. 

Redundant PLC System, Using Massive I/O Modules Redundancy 

Standalone Architecture 

Consider a simple, single-rack Premium PLC configuration. 

First, calculate the individual modules’ MTBFs, using a spreadsheet that will do the 

calculations (examples include Excel and Open Office). 

83


The calculation guidelines are derived from the main conclusion given by Serial RBD 

n 

analysis, that is:. λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal 

S i 

i = 1 

to the sum of the individual Failure Rate of these elements, with 

R 

S 

(t) 

−λ 

S 

t 

= e . 

The first operation will identify individual MTBF figures of the part references 

populating the target system. Using these figures, a second sheet will then 

subsequently consider the item, group and system levels. 

� For each of the identified items part references: 

- Individual item Failure Rate λ, calculated just by inverting individual item 

MTBF : λ = 1 / MTBF 

- individual item Reliability, over 1 year, applying R(t) = e -λt , where t = 8760 

that is the number of hours in one year (365 * 24) 

84

� For each of the identified items references groups: 


- Group Failure Rate: Individual item Failure Rate times the number of items in 

the considered group 

- Group Reliability: Individual item Reliability powered by the number of items 

� For the considered System: 

- System Failure Rate: sum of Groups Failure Rates 

- System MTBF in hours: reverse of System Failure Rate 

- System Reliability over one year: exp(- System Failure Rate * 8760)] 

(where 8760 = 365 *24 : number of hours in one year) 

- System Availability (with MTTR = 2 h): System MTBF / (System MTBF + 2)] 

Looking at the example results, Reliability over one year is approximately 83% 

(This means that the probability for this system to encounter one failure during 

one year is approximately 17%). 

System MTBF itself is approximately 45,500 hours (approximately 5 years) 

Regarding Availability, with a 2 hour Mean Time To Repair (which corresponds to 

a very good logistic and maintenance organization), we would have a 4-nines 

resulting Availability, an average probability of approximately 23 minutes 

downtime per year 

Note: 

� As previously explained, System figures produced by our basic calculations only 

apply the equations given by basic probability analysis. In addition to these 

calculations, a commercial software tool has been used, permitting us to confirm 

our figures. 

85

Redundant Architecture 


Assuming we would like to increase this system’s 

Availability by implementing a redundant architecture, 

we need to calculate the potential gain for System 

Reliability and Availability. 

We will assume that we have no additional hardware, 

apart from the two redundant racks. The required calculation guidelines are derived 

from the main conclusion given by Parallel RBD analysis, that is: 

R Red 

( t) 

n 

= 1− 

∏ 1− 

Ri 

( t) 

i= 

1 

This means if we consider RS as the Standalone System Reliability and RRed as the 

Redundant System Reliability: R ( t) 

1 ( 1 R ( t)) 

2 

Red 

= − − S 

As a result of the redundant architecture, System Reliability over one year is 

now approximately 97% (i.e. the probability for the system to encounter one 

failure during one year has been reduced to approximately 3%). 

System MTBF has increased from 45,500 hours (approximately 5 years) to 

68,000 hours (approximately 7.8 years). 

Regarding System Availability, with a 2 hour Mean Time To Repair we would 

receive an 8-nines resulting Availability, which is close to 100%. 

Note: Formal calculations should also take into account undetected errors on 

redundant architecture, what would provide somewhat less optimistic figures. 

A complete analysis should also take in account the additional wiring devices typically 

used on a massive I/O redundancy strategy, feeding homothetic input points with the 

same input signal, and bringing homothetic output points onto the same output signal. 

Also, with this software, a parallel structure has been retained, with the Failure Rate 

the same on the Standby rack as on the Primary rack. 

86

Reminder of Standby Definitions: 


Cold Standby: The Standby unit starts only 

if the default unit becomes inoperative. 

Warm Standby: The Failure Rate of the 

Standby unit during the inactive mode is 

less the Failure Rate when it becomes 

active. 

Hot Standby: The Failure Rate of the 

Standby unit during the inactive mode is the same as the Failure Rate when it 

becomes active. This is a simple parallel configuration. 

87

Redundant PLC System, Using Shared Remote I/O Racks 

Simple Architecture with a Standalone CPU Rack 

Standalone CPU Rack 


Consider a standalone CPU rack equipped 

with power supply, CPU, remote I/O 

processor and Ethernet communication 

modules. 

As in the previous section, calculate the 

potential gain a redundant architecture would 

allow, in terms of Reliability and Availability. 

First calculate the individual modules MTBFs, 

then establish, for each subsequent rack, a 

worksheet that provides the Reliability and 

Availability metrics figures. 

88

Remote I/O Rack 

Simple Architecture : Standalone CPU Rack + Remote I/O Rack 


89


For this Serial System (Rack # 1+ Rack # 2), Reliability over one year is approximately 82.8% (the 

probability for this system to encounter one failure during one year is approximately 17%). 

System MTBF is approximately 46,260 hours (approximately 5.3 years) 

Regarding Availability, with a 2 hour Mean Time To Repair (which corresponds to a very good logistic 

and maintenance organization), we receive a 4-nines resulting Availability, which is an average 

probability of approximately 23 minutes downtime per year 

Note: As expected, Reliability and Availability figures resulting from this Serial 

implementation are determined by the weakest link of the chain. 

90

Simple Architecture with a Redundant CPU Rack 


One solution to enforce Rack #1 Availability (i.e.the 

rack hosting the CPU, and at present the process 

control core) is to implement a Hot Standby 

configuration. 

As a result of Rack #1 Redundancy, Reliability over one year would be approximately 

99.7%, compared to 94.6% availability with a Standalone Rack #1 (i.e. the probability for 

Redundant Rack #1 System to encounter one failure during one year would have been 

reduced from 5.4% to 0.3%). 

System MTBF would increase from approximately 157,000 hours (approximately 18 years) 

to 235,000 hours (approximately 27 years). 

Regarding System Availability, with a 2 hour Mean Time To Repair we would receive a 9- 

nines resulting Availability, close to 100% 

Note: As expected, Reliability and Availability figures resulting from this Parallel 

implementation are better than the best of the individual figures for different links of 

the chain. 

91

Consider a common misuse of reliability figures: 


� Last architecture examination (Distributed Architecture with a Redundant CPU 

Rack) has proven a real benefit in implementing a CPU Rack redundancy, with 

this architecture a sub-sytem would be elevated to a 99.9 % Reliability, and 

almost a 100% Availability. 

� The common misuse of this Reliability figure would be to make arguments for the 

potential benefit on Reliability and Availability for the whole resulting system. 

Note: This example has considered the Sub-System Failure Rate provided by the 

reliability Software. 

Regarding the Serial System built by the Redundant CPU Racks and the STB Islands, 

the worksheet above shows a resulting Reliability (over one year) of 84,06%, the 

Standalone System Reliability during the same period of time being 81.95%. 

In addition, the worksheet shows a resulting Availability (for a 2 hour MTTR) of 

99.96%, the Standalone System Availability (for a 2 hour MTTR) being 99.9957%. 

As a result, this data suggests that implementing a CPU Rack Redundancy would 

have almost no benefit. 

Of course, this is an incorrect conclusion, and the example should suggest a simple 

rule: always compare comparable items. If we implement a redundancy on the CPU 

Control Rack in order to increase process control core Reliability, and to an extent 

Availability, we need to then examine and compare the figures only at this level, as 

the entire system has not received an increase in redundancy. 

92

Recommendations for Utilizing Figures 


Previous case studies provide several important items concerning Reliability metrics 

evaluation: 

• Provided elementary MTBF figures are available, a "serial" system is quite easy 

to evaluate, thanks to the "magic bullet" formula: λ = 1 / MTBF 

• If you consider a "parallel sub-system", System Reliability can be easily derived 

from the Sub-System Reliability. Thus System Availability can be almost as easily 

calculated, starting from the Sub-System Reliability. But the "magic bullet" 

formula no longer applies, as in those conditions, exponential law does not 

directly link Reliability to Failure Rate. However, System MTBF can be 

approximated from the Sub-System MTBF : 

"System MTBF (in hours) # 1.5 Sub-System MTBF" 

But no direct relationship will provide the System Failure Rate. 

Because these calculations must be precisely executed, the acquisition of 

commercial reliability software could be one solution. Another consideration 

would be the solicitation of an external service. 

93

Schneider Electric Industries SAS 

Head Office 

89, bd Franklin Roosvelt 

92506 Rueil-Malmaison Cedex 

FRANCE 

www.schneider-electric.com 

Due to evolution of standards and equipment, characteristics indicated in texts and images 

in this document are binding only after confirmation by our departments. 

Print: 

Version 1 - 06 2009 

94

High Availability Theoretical Basics - Schneider Electric

Create successful ePaper yourself

Delete template?

Save as template?