SCADA and Alarm Management - DCE FEL ČVUT v Praze

moodle.dce.fel.cvut.cz

SCADA and Alarm Management - DCE FEL ČVUT v Praze

SCADA and Alarm Management

Pavel Burget

pavel.burget@fel.cvut.cz

CTU Prague

Faculty of electrical engineering

Department of control engineering

http://dce.fel.cvut.cz

Lecture contents

• What is alarm management

• Use cases

Objectives

Sources:

D. Rothenberg: Alarm Management for Process Control

EEMUA: Alarm Systems. A Guide to Design, Management and Procurement

The High Performance HMI Handbook

ASM Consortium: Abnormal Situation Management

P. Andow: Best Practices in Process Plant Alarm Management

Process Vue: Alarm Management through Visualization

G. Abreu: Alarm Analysis and Management

13.12.2012

1


Requirements for safe operation

• Hazards must be recognized and understood

• Equipment must be adequate

• Maintain plant integrity

– Systems must be available and procedures defined

• Competent staff

• Be prepared for emergency

• Monitor actual performance

In the area of alarm management most companies fail to meet these

basic requirements for safe operation

Future upgrades (e.g.,

Advanced Control)

Lost opportunity

(Cost of comfort)

Lost Profit

Lost Revenue

Additional

unplanned costs

Efficiency

Inciden

t

Theoretically possible; currently unsustainable

Shut down

Plant Performance

Various cost elements

Comfort Margin

Accident

Equipment

damage, etc.

Theoretical Limit

Current Limit

Operating Target

Fixed Costs

(Idle Plant)

Profit

Break-even

Loss

13.12.2012

2


A typical Production Profile for an

Asset Intensive Facility for a

calendar year.

Days per Year

Days per Year

< 60%

< 60%

5 days

30 days

16 days

8 days

47 days

Daily Production

79 days

Production Target set by Enterprise

A look at plant operations

95 days

95% 100%

Plant Operating Target

62 days

23 days

Factors affecting plant operations

Daily Production

Plant Availability

Planning Constraints

Operational Constraints

Plant Incidents Production

Effectiveness

Asset Utilization

Agility/Flexibility

95% 100%

Plant Capacity Limit

13.12.2012

3


24.2M

5.8%

Real‐life Examples

This plant had

$24.2M in lost

capacity due to

asset availability &

incidents!

This plant had

5.8% in lost

capacity!

This plant lost

$38.5M!

And this plant

lost $33.5M!

Site Studies have identified Plant Lost Opportunity

Between 3-15% in Lost Capacity is

attributed to asset in-availability and

incidents

Days per Year

NEW EMPHASIS!!

Asset Management

Reliability & CMMS

< 60%

Daily Production

Plant Availability

Plant Incidents

Manufacturing

Execution

Scheduling & ERP

Plant Operating Target

Planning Constraints

Operational Constraints

95% 100%

Production

Management

DCS/APC/

Optimization efforts

Plant Capacity Limit

13.12.2012

4


Days per Year

Emphasis on plant & equipment

reliability improvements and reduced

incidents can result in a recovery of 3‐

15% of lost capacity!

< 60% Daily Production

• Three Mile Island – 1979

Major Profit Potential

Higher Plant Operating Target

95% 100%

Fewer Planning Constraints

Fewer Operational Constraints

Plant Capacity Limit

Incident Use Cases

– A series of of failures and operational missteps occurred that

resulted in the release of a measurable amount of radio

active material into the air.

Alarm management issues

Alarms are not applied properly

– The use of alarms is not fully understood

Alarm systems can avoid these incidents

13.12.2012

5


• Piper Alpha Oil Rig – 1988

Incident Use Cases

– An accumulation of errors and questionable decisions caused

a catastrophic fire on the offshore platform causing 167

deaths and billions of dollars worth damage.

Alarm management issues

– Inadequate shift handovers

– Issues with false alarms and HMI design

• Texaco Milford Haven Refinery ‐ 1994

Incident Use Cases

– A severe electrical storm caused plant disturbances. An

explosion that occurred five hours later was a combination of

failures in management, equipment and control systems

during the plant upset. Twenty‐six people were injured and

damage of £48 million was caused.

Alarm management issues

– Too many alarms that were poorly prioritised

– Control room displays did not help the operators to

understand what was happening

– During the last 11 minutes before the explosion, the two

operators had to recognise, acknowledge and act on 275

alarms

13.12.2012

6


• Bunsfield Oil Storage ‐ 2005

Incident Use Cases

– In December 2005, a number of explosions occurred. A large

fire engulfed a high proportion of the site. Over 40 people

were injured; fortunately there were no fatalities.

Alarm management issues

Alarms are not tested regularly

Source

Failure

Types

Stylistic or Cultural Indicators

Commitment

Competence

Cognizance

data collected & analyzed

Diagnostic and

remedial measures

Management

Functional

Failure

Types

General Failure Types

Accidents

Incidents

Near-Misses

1-10 hit list

Proactive Design

SI Projects

Best Practices

Areas of incidents prevention

Interface

between the

organization

& the individual

Safety Information System

Condition

Tokens

Precursors

Poor workplace design

High workload

Unsociable hours

Inadequate

training

Poor perception

of hazards

Alarms

Human Factors

Control room

design

Workplace

Organization Individual

Unsafe Acts

Errors &

Violations

Near miss Auditing

Training

Workspace

Motivation

Attitude

Group Factors

Working Practice

13.12.2012

7


Operational

Modes:

Emergency

Abnormal

Normal

Plant States:

Disaster

Accident

Out of

Control

Abnormal

Normal

Critical

Systems:

Physical and Mechanical

Containment System

Safety Shutdown,

Protective Systems,

Hardwired Emergency Alarms

DCS Alarm System

Managing Abnormal Situations

Area Emergency Response

System

Site Emergency Response

System

Decision Support System

Process Equipment,

DCS, Automatic Controls

Plant Management Systems

Operational

Goals

Minimize

Impact

Bring to

Safe State

Return to

Normal

Keep Normal

Plant

Activities

Firefighting

First Aid

Rescue

Evacuation

Manual Control &

Troubleshooting

Preventative

Monitoring &

Testing

Control by alarms

13.12.2012

8


Time for operator

to respond to alarm

and correct fault

Potential

Impact

of

Initiating

Event

The Setting of a high pre‐trip alarm

B

A

Limit of largest normal

operational fluctuation

Maximum rate of change

of alarmed variable during fault

Abnormal Operating Region

With typical alarm systems,

Incident

orienting begins after an event

creates an abnormal plant state.

The extent of the problem can

impact operator’s ability to be fully

aware of the locations of process

disturbances.

As disturbances propagate the

number of conditions to be aware of

increases as well as the response

requirements and the likelihood of

missing important information. Failure is

Detected

Failure Occurrence in the

Process or in the Safeguarding System

Point of operator awareness

Correct intervention causes return to normal

Time

Limit at which

protection operates

Alarm Setting

EMMUA Alarm Systems Guide page 17

Alarm System Awareness of Disturbances

Safe status of the

Process assured

13.12.2012

9


Potential

Impact

of

Initiating

Event

Point of operator awareness

Alarm System Management of Problems

Alarm Floods delay Evaluation

Standing Alarms

interfere with

Orientation

Correct intervention causes return to normal

Time

Incident

Inadequate filtering interferes with Action

Good Alarm Management in Situation Awareness

Potential

Impact

of

Initiating

Event

Time

Average shift in awareness with decision support

• Increases likelihood of

awareness of disturbances

• Reduces time to awareness

• Hence, reduces the average

impact of initiating events

13.12.2012

10


Impact

of

Initiating

Event

Trip from SIS

Potential

Impact

of

Initiating

Event

Time

High Alarm

Emergency Alarm

Process Safety Time

Emergency

Impact of Protection System

Operator

diagnostic time

Time

High

Trip

Incorrect

Incident

FTT

UNSAFE

SAFE

Loss

Quality

Profit

FTT= Fault Tolerance Time

Suboptimal

Best

No response

13.12.2012

11


Impact of Decision Support System for Optimal

Response

Potential

Impact

of

Initiating

Event

Failure Occurrence in the

Process or in the

Safeguarding System

• Reduces errors

• Decreases time to

implement response

• Manages side effects

• Increases awareness

Time

System internal

diagnostic time

Failure is

Detected

Time for Time for reaction of the Process

corrective action on the corrective action

Fault Tolerance Time

Fault tolerance time of the process or Process Safety Time (PST)

Timing Diagram

Safe status of the

Process assured

t

13.12.2012

12


What is EEMUA?

• Stands for: The Engineering Equipment and Materials Users' Association

• European based non‐profit industry Association

• Aims to improve the safety, environmental and operating performance of industrial

facilities in the most cost‐effective way.

• Published; 191 Alarm Systems ‐ A Guide to Design, Management and Procurement

• Updated in 2007; (2 nd Edition) with field proven information.

• Leverage information from the ASM Consortium, Industry Studies, Incident Analysis.

• Website Resources – References: http://www.eemua.co.uk

• Great Benchmarking info like activity rates, risk management (prioritization), Total

alarm installed.

• Operators questionnaires to access usefulness of the alarm system and guide

improvements.

• Resource to rationalization and analysis techniques.

• Overall a great guide that has been deemed the defacto standard for alarm

management

What is ISA 18.02?

Management of Alarm Systems for the Process Industries.

• ISA effort to develop a standard to provide a blueprint for developing an

effective alarm management strategy.

• It won’t tell automation suppliers how to design their alarm system but it

provides guidance to help them put together solutions to allow end users to

design their own alarm management strategy.

• Outlines best practices for both existing and new facilities.

• It is expected that the next generation of Alarm solutions will provide more

metrics and improved identification of floods, chatters to provide users

quicker access to the date they need.

• Prescribes a life Cycle based approach to managing alarms

13.12.2012

13


Philosophy

Identification

Rationalization

Design

Implementation

Operation

Maintenance

Change

Management

Monitoring /

Assessment

Source: ISA 18.02 draft

Alarm Management Life Cycle

Audit

Looking at the ISA 18.02 life cycle approach, Alarm Analysis

plays an important role in monitoring, assessing and auditing

the Alarm management system.

What can we do with Alarm Analysis?

Find alarms that do not function correctly

Performance metrics

Alarm Activity / Rates

Frequent Alarm (bad Actors)

Isolate and Analyze individual plant areas

Analyze Plant upset or Alarm Flood

Priority Distribution of Alarms

Stale / Long Lasting Alarms

Chattering Alarm

General statistics on Alarms

Performance Metrics with Industry

guidelines (EEMUA)

What is the Alarm Analysis Purpose?

13.12.2012

14


What Data is essential for Alarm Analysis

• Date and Time of Events

• Tag Name

• Tag Type

• Descriptions

• Event Priority

• Event Type (high Alarm, trip, failed)

• Operator Action (Acknowledged, Return)

• Value exceeded

• Plant Area Code (optional)

• Reporting Node (optional)

• This data is obtained by different methods like OPC Alarm and Event

Servers, manufacturer specific publishing server (via an API), etc.

Outcome of Alarm Analysis. Alarm Activity Analysis

– Average rate good indicator of the system health.

– Unusual high activity levels are plotted in trend charts

– Excessive events can be the result from abnormal conditions

– 150 alarms per day are considered acceptable per the

EEMUA guidelines

– Over 300 are considered manageable

– Anything outside of these metrics exposes the kind of load

on the operator, which compromises safety and efficiency.

– Measuring on an hourly basis can be quite effective in

monitoring the alarm system performance .

– Measuring alarms in a 10 min period can effectively identify

the beginning of floods. Eg. 10 alarm in 10 minutes.

– Additional activity parameters that are useful.

• Days with max and min activity

• Top Activity days

• Percent of totals and accumulated percents

Alarms per hour, per shift, per day and per 10 minutes period with average max and

min

• Average alarm per hour for different priorities

• Averages per hour with minimum and maximum

13.12.2012

15


• Determine periods of activity where the rates

of alarm are higher than the operator can

handle.

• During these periods alarms become a

nuisance and a distraction.

• Important alarms can be easily missed.

• Calculations involves measuring how long the

system produces alarms above a maximum

rate. Eg. 10 alarms in 10 minutes. Also a

minimum rate is defined to end the flood

period. Eg. 5 alarms in 10 minutes.

• Additional data from flood events can include:

– Number of floods per day / week

– Total Duration of all floods

– Start / Stop Time of each flood

– Flood Severity. ‐ how many alarms during flood (Top

Ten flood events during the period)

– Percentage of time the alarm system is in a flood

condition.

• Few signals often product a large percentage of alarms.

• Frequent alarm analysis helps expose those signals

• Substantial performance improvement can be gained by

addressing these signals first.

• These signals include nuisance alarms which are suspect and

can’t be relied upon.

• Potential hazardous situations can result when operators do

not trust the validity of alarms.

• Often 10 to 20 of the most frequent alarms produce between

20 to 80 percent of all the alarms.

• A common practice is to suppress nuisance alarms, but this

approach without proper rationalization can result a high

volume of non annunciated alarms which can have hazardous

consequences with financial losses, safety and environmental

impact.

• Additional useful information from this analysis is:

– Number of occurrences

– Tag Names and Description

– Priority of the Alarms

– Percent of total and Accumulated Percentage

Alarm Floods Analysis

Frequent Alarms Analysis

13.12.2012

16


• These are alarms that transition in and out in short

amount of time

• The time and rate must reflect what is considered too

fast for an operator to take action. Eg. 3 occurrences in

1 minute

• It is common for a chattering alarm to produce

hundreds of alarms in a few hours.

• Very big distraction for operators and can cause periods

of flood.

• The may reflect;

– Instrument problems

– Poor control

– Bad dead bands

– Improper delay settings

• Chattering alarms are easily fixed and should not be

ignored.

• Information of the chatter analysis that is useful to

correct the problem:

– Top Ten Chattering signals

– Total alarms due to chatter

– Percentage of total alarms due to chatter.

• These are alarms that the operator

does not pay attention to.

• They provide little value to the

operator

• More than 24 Hours is a good Time

windows to measure standing alarms

• The Analysis provides a list of alarms

that should be candidates for

rationalization and possible

suppression under certain conditions.

• Other Information that is useful from

this analysis:

– Top Ten standing Alarms signals

– Time Standing

– Number of Active Alarms per Hour

– Open Alarms in period with Min and Max

Chattering Alarm Analysis

Long Standing Alarm Analysis

13.12.2012

17


Alarm distribution Analysis (priority and others)

• Distribution analysis by priority provides

indication that the alarm system may not be

rationalized correctly.

• EEMUA has recommendations on priority

distribution.

Alarm priority represents risk management into

the system and can compromise safety and

efficiency.

• Too many high and medium priority alarms can

overwhelm operators during upsets making it

hard to distinguish the importance of the alarm.

• The priority distribution can vary significantly

between configuration and occurrence.

• Other Distributions analysis that can be useful:

Alarms by Regions

Alarms by Signal Type

Alarm by Event Types (High, Low etc)

Alarm by control node

Correlated Alarm Analysis

• This is also called parent/child analysis and cause/consequence analysis

• During period (1 day, week, etc.) a time window (eg. 60 secs) when a parent

alarm occurs is specified for the child to occur, if this happens for a number of

times they are probabilistically considered correlated.

• Identify alarms that are closely linked and may convey the same information.

Alarms with the same root cause

• High alarms and high‐high alarms can be correlated if the settings are too close

together.

• Referencing of the parent/child patterns to a specific plant event, such as

equipment trip, toxic release, etc.

• Statistical correlation –suggestion of patterns that help expose plan abnormal

operations, but might look as random

13.12.2012

18


• Provide information about how the alarm

system is configured.

– Number of alarms installed

– How many alarms per priority, region, type, node

– Total alarms from instruments

Alarms from the control system (generated)

• These are generally good statistics to compare

with the Analysis resulting from the actual

events which are supposed to yield

understandable results.

• Eg. If the system produces large amount of

high priority alarms, the configuration analysis

can reveal the percentage of high priority

alarms installed. This in turn can uncover if the

system is dangerously operated or simply

wrongly configured.

Configuration Analysis

Approach to use Analysis results with life cycle

If the Alarms system is new it is recommended to follow

the life cycle approach proposed by the ISA 18.02

standard

On existing systems two different approaches may be

used to properly utilize Alarm Analysis Results

a) Short term (Sanity Actions)

a) Most Frequent Alarms root cause Analysis

b) Chattering Alarms Elimination

c) Standing points review and redesign. (possible

suppression).

d) Review alarm correlations and implement

possible corrections to abnormal issues.

b) Long Term

a) Update or Create an Alarm philosophy

b) Reevaluate and Rationalize all the alarms for

effectiveness and risks (priorities)

c) Design and implement changes identified gaps

identified in the philosophy and rationalization

steps.

d) Fully establish the Monitoring, Change

Management and Auditing Procedures to ensure

proper management

13.12.2012

19


Additional tips for Improving an Alarm System

– Re‐evaluate priorities from time to time.

– Reduce standing alarms

– Identify and resolve implementation issues

– Create an alarm management team to review alarm management strategies

– Continuous training on new/changed alarms is the key to obtain excellent

operational results.

– Uncover new areas of improvement to achieve higher levels of performance.

– Do not assume that all trip points and limits are correct, review from time to time

to make sure they have not changed

– Keep an eye on any new projects (links integration, system modifications etc.) for

excessive alarming.

– When an alarm is removed, use another method for situation awareness, e.g..

Graphics.

Distributed gathering/setting of alarms

Asset Management Engineering Tool HMI

Diagnosis

Calibration

Maintenance

Field Device

Configuration

Parameterization

Commissioning

Process variables

Clock synchronization

Status

Operating

Alarm reporting

Monitoring

0

7

Automation

13.12.2012

20


PB - transmitter

PB - actuator

Distributed gathering/setting of alarms

TB

absolute press.

TB

electric valve

PB (Physical Block)

FB (Function Block)

TB (Transducer Block)

PA profile –device architecture

FB

analog input

FB

analog output

Profibus PA

13.12.2012

21


• Location of individual blocks

header

References to the list of pointers

Pointers to individual blocks

PB TB1 TBn FB1 FBn

Status

Measured value

Measuring range

Filter time

Alarm/warning limits

Alarm summary

TAG

Manufacturer-specific

parameters

Object directory

DP services

cyclic

DP services

acyclic

DP services

acyclic

PA profile

PA-Profile

(e.g. for

pressure

transmitters)

13.12.2012

22


12 bar

8 bar

Measuring range

(bar)

0 bar

-12 bar

Physical limit of the measuring sensor

Measuring range limit

HI-HI-LIM (Upper alarm limit)

HI-LIM (Upper warning limit)

LO-LIM (Lower warning limit)

LO-LO-LIM (Lower alarm limit)

Measuring range limit

Physical limit of the measuring sensor

AI function block

Limit values

PV_SCALE

(Scaling of the

measuring range)

OUT

(Measured value)

13.12.2012

23


The measured value itself (4 bytes - float) and status byte

LSB

MSB

GR QS QS QU

Communication

DTM

Gateway DTM

BTM

Device

DTM

BTM

Measured value

Quality

(good, bad, uncertain, ...)

Auxiliary status

(other details of the quality

e.g. configuration error, sensor

error, service state)

Limits

(within limits, under limit, above

limit)

FDT/DTM –more details

Ethernet

13.12.2012

24


Process display overview –level 1

Process display detail –level 2

13.12.2012

25


Process display detail –level 3

13.12.2012

26

More magazines by this user
Similar magazines