Draft Human Error and Safety Risk Analysis (HESRA) Methodology ...

DraftHuman Error and Safety Risk Analysis (HESRA) Methodology for FederalAviation Administration Air Traffic Control Maintenance and Operations(Revision 7)January 2009Submitted to:Dino PiccioneFAA800 Independence AvenueATO-P R&D (Room 907)Washington, DC 20591Submitted by:Michael MaddoxCorinna ProctorHumanCentricResearch, LLC.111 James Jackson Ave.Suite 221Cary, NC 27513-3164(919) 481-0565(919) 481-0310 Faxwww.humancentricresearch.com

FAA Human Error Risk Analysis (HESRA) MethodTABLE OF CONTENTS1.0 REVISION HISTORY ............................................................................................................. 32.0 INTRODUCTION.................................................................................................................... 32.1 Background ...................................................................................................................... 52.2 HESRA Overview.............................................................................................................. 62.2.1 Applicability to Life Cycle Stage....................................................................62.2.2 Proactive Assessment of Human Error Risk/Potential ..................................62.2.3 Formalizes Risk Assessment Process..........................................................62.2.4 Based on Engineering Risk and Reliability Assessment Techniques ............72.2.5 Alternative to Quantitative Risk Assessment Techniques .............................72.2.6 Use of Data from Incident/Error History ........................................................72.2.7 Use of Data from Formal or Informal Usability Tests.....................................72.2.8 Establishes Working Group ..........................................................................82.3 Objectives ......................................................................................................................... 83.0 SCOPE AND LIMITATIONS.................................................................................................. 83.1 ATC Facility Maintenance Procedures and Systems ................................................... 83.2 Extensibility to Other Domains....................................................................................... 83.3 Applicability to Developmental and Existing ATC Facilities and Systems................ 93.4 Handles Discrete Errors .................................................................................................. 93.5 Handling Multiple Discrete Errors .................................................................................. 94.0 ANALYSIS PROCESS ........................................................................................................ 104.1 Step 1 - Establish Analysis Team................................................................................. 104.1.1 Composition ...............................................................................................104.1.2 Roles and Responsibilities..........................................................................114.1.3 Team Tasks................................................................................................144.2 Step 2: Familiarize Team with System to be Analyzed.............................................. 154.3 Step 3: Prioritize Procedures to Analyze.................................................................... 164.3.1 Select Procedure Subset............................................................................174.3.2 Walk Through Selected Procedure(s).........................................................184.3.3 Pre-Analysis Risk Reduction ......................................................................184.4 Step 4 - Set Analysis Perspective................................................................................. 194.4.1 User Population..........................................................................................194.4.2 Usage Environment....................................................................................194.4.3 Performance Shaping Factors ....................................................................204.4.4 Overall Complexity of System, User Interface and/or Procedure ................204.5 Step 5 – Define Tasks .................................................................................................... 204.5.1 Evaluate Level of Task Detail .....................................................................214.5.2 The General Task.......................................................................................214.5.3 Define/identify More Detailed Tasks Where Required.................................224.6 Step 6 – Define Steps..................................................................................................... 224.6.1 Identify Steps to Complete Tasks ...............................................................224.6.2 Enter Steps into Analysis Tool/Document...................................................234.7 Step 7 – Define Errors, Causes, and Effects ............................................................... 234.7.1 Pre-Fill Errors and Causes .........................................................................244.7.2 Review Each Step ......................................................................................254.7.3 Develop Exhaustive Error List ....................................................................264.7.4 Relate Errors to other Steps, Components, Etc., if Appropriate ..................264.8 Step 8 – Assign Rating for Error Likelihood ............................................................... 264.8.1 Review Each Error......................................................................................274.8.2 Use Existing Error, Usability Test, and Other Data as Appropriate .............27HESRA: v7 Page i of 46

FAA Human Error Risk Analysis (HESRA) Method4.8.3 Look Especially for Tasks That Require Skills or Capabilities That HumanUsers Are Unlikely to Possess ...............................................................................284.8.4 Team Discussion and Consensus ..............................................................284.8.5 Same Error for Different Task/Step.............................................................284.8.6 Internal Consistency...................................................................................284.8.7 Influenced by Elements in 3.4.....................................................................294.9 Step 9 – Assign Severity Rating ................................................................................... 294.9.1 Worst-Case Scenario .................................................................................304.9.2 Team Discussion and Consensus ..............................................................314.9.3 Account for Conditional Errors, Sequence of Errors, Etc.............................314.9.4 Not Greatly Influenced by Elements in 3.4..................................................314.9.5 Internal Consistency...................................................................................314.10 Step 10 – Assign Rating for Detection and Recovery................................................ 314.10.1 Automatic Recovery .................................................................................324.10.2 Composite of Detection and Recovery......................................................334.10.3 Influenced by Severity ..............................................................................334.10.4 Influenced by Elements in 3.4...................................................................334.11 Step 11 – Calculate Hazard Index and RPN ................................................................ 344.12 Step 12 – Analyze Criticality......................................................................................... 344.12.1 HI Criticality ..............................................................................................344.12.2 RPN Criticality ..........................................................................................364.12.3 Sort by Severity, HI, RPN .........................................................................374.12.4 Compare Levels of HI and RPN to Criticality Breakpoints.........................384.12.5 Determine Action Requirements for Each Error ........................................384.13 Step 13 – Reduce Risk .................................................................................................. 384.13.1 Develop Initial Risk Reduction Suggestions..............................................394.13.2 Re-convene HESRA Team.......................................................................394.13.3 Assign Ratings Assuming Remediation ....................................................404.13.4 Assess Impact on HI and RPN .................................................................414.13.5 Iterate Remediation if Risk Is Not Sufficiently Reduced ............................414.14 Step 14 – Produce Risk Analysis Report .................................................................... 414.14.1 Overview of System and Procedures Analyzed ........................................424.14.2 General Statement of Findings .................................................................424.14.3 Overall Recommendations Related to System and Procedures................424.14.4 Explicit Listing of “High Priority” Errors......................................................424.14.5 Explicit Listing and Description of Proposed Remedy ...............................424.14.6 Link or Provide Access to HESRA Analysis Spreadsheet (or WhateverSoftware Tool is Used to Support the Analysis)......................................................424.14.7 Statement of Concurrence of Analysis Team............................................434.14.8 Statement(s) of Exceptions from Analysis Team Members.......................434.15 Step 15 – Assign Remediation Actions ....................................................................... 434.16 Step 16 - Monitor Remediation to Ensure Actions Are Completed .......................... 43HESRA: v7 Page ii of 46

1.0 REVISION HISTORYVersion Release ate Notes1.0 August 2005Initial draft. Applicable for existing for maintenance/operationsprocedures.2.0 December 27, 2006 Updated with revised prioritization scheme.3.0 April 20074.0 June 20075.0 July 20076.0 August 20077.0 January 2009Updated in be in accord with the scalar directions and nomenclatureconventions reflected in FAA’s existing SMS documentationFix section references, flip listing order in rating scale tables, reviewdefinitions of scale anchorsReplaced maintenance with reference to maintenance oroperational procedures. Added executive summary.Added schematic of HESRA matrix. Removed spreadsheetillustration figures.Changed criticality breakpoints and definition of breakpoints forHazard Index (Table 9) and Risk Priority Number (Tables 11 and12) to accurately reflect reversed rating scales.2.0 EXECUTIVE SUMMARYHuman error is the predominant component of serious incidents in the aviation(and every other) domain. Estimates of human error as an initiating orcontributing factor in serious incidents are typically in the 70-90% range. This willcome as no surprise to those who work in the aviation profession. Just as inother segments of the profession, the operations and maintenance componentsof the U.S. air traffic control (ATC) system also exhibit high levels of human errorinvolvement in incidents and accidents. Until recently, however, the FAA has nothad at its disposal an objective and straightforward human error risk estimationmethod.This is not to imply that there are not many viable human error risk analysismethods. There are quite a few. However, because of the variability in humanbehavior and the unique organizational requirements of different topical domains,a method that works well in one domain, might be a very difficult fit for anotherdomain.Most existing proactive human error analysis tools, regardless of their specificapplication domain, are based in one way or another on the engineeringcomponent failure risk estimation technique known as Failure Modes and EffectAnalysis, or FMEA. Even with this common ancestry, however, different forms ofhuman error risk analysis have been developed for, example, the criticalhealthcare and space exploration fields.Some of these differences are purely the result of different domain terminology orspecific error outcomes that must be reflected in the risk analysis method. ForHESRA: v5 Page 3 of 46

FAA Human Error Risk Analysis (HESRA) Methodexample, the result of an error in an intensive care unit might be the death of apatient, whereas, the most serious outcome of an error in computer networkallocation, might be a server outage. Also, the granularity of various techniquesmight make certain techniques more appropriate than others for a particularapplication.This document provides the theoretical basis and procedural information requiredfor a human error risk analysis method named HESRA (Human Error and SafetyRisk Analysis). HESRA is specifically tailored to the FAA ATO environment. Itcan be applied to either mature systems or to systems under development.The FAA sponsored this work to provide system engineers, ATC, Tech Ops, andother groups within the FAA with a tool that can aid in identifying andunderstanding human error risks. More importantly, HESRA also helps identifyways in which the likelihood of human errors can be reduced and the likelihood ofearly recovery from errors can be increased.This document has been consistently updated during its development. HESRA’snomenclature and scaling methods are currently in compliance with FAA’s SMS2.0 requirements. In fact, HESRA also considers the ability of humans torecover from errors before the errors cause bad consequences. The FAA SMSframework does not consider recovery.2.1 CaveatsHESRA provides a team of system experts with a way to communicate abouthuman error and risk. This discussion is best led by a practitioner trained inhuman factors and the elements of human behavior that can lead to humanfailures. It is unlikely that HESRA will be effective without the active participationof a human factors practitioner.HESRA has not yet been validated in the sense that it has been applied tomultiple systems in many settings. As of the date of this report, HESRA has beenapplied twice to a maintenance procedure required with the VSCS. HESRA iscurrently being applied to a system being developed to provide wake turbulencespacing information to air traffic controllers.HESRA is part of the ATO Human Factors toolbox. It should be considered asone of many diagnostic tools that can help identify human error risks in ATCsystems.HESRA: v7 Page 4 of 46

FAA Human Error Risk Analysis (HESRA) Method3.0 INTRODUCTIONIn the domain of human performance, error is a constant, diffuse presence. Humanfactors practitioners have been fascinated by human error for many decades. Errors areone of the most well studied aspects of human behavior noted for both their ubiquity andpersistence. There is, in fact, no such thing as error-free human performance, at leastover any meaningful time period. Even when error-producing conditions are identified, itis very difficult (but certainly not impossible) to significantly reduce errors.The Federal Aviation Administration (FAA) has recognized this but is committed to findingways to identify and reduce the potential for human error. Given the potentialcatastrophic results of an error in the National Air Space (NAS), FAA wanted toimplement a proactive method for analyzing the risk of human error and safetyassociated with systems and procedures used in air traffic control (ATC), specifically ATCfacility maintenance/operations.This document describes the methodology and procedures that were developed forapplying human error risk analysis to ATC facility maintenance/operations. The name ofthe method is Human Error and Safety Risk Analysis (HESRA) and it introducesproactive human error analysis into the FAA ATC facilities maintenance/operationsenvironment.HESRA is neither the only nor most comprehensive technique that has been developedand applied in other domains. However, it is a method that has been shown to beworkable, applicable, and effective in identifying and mitigating the conditions that arelikely to increase human errors.3.1 BackgroundThere are very few “ground truths” in the realm of human behavior. Over many decadesof applied research, it is clear that human performance in any particular endeavor variesover a very wide range. Of course, the reason for this variation is as simple as it isperplexing. That is, a very broad range of variables affects human performance. Theindividual effects of each variable and their interaction are simply too complex to allow usto reasonably predict the outcome.Which leads to one of the few ground truths of the human factors domain, i.e., humanscommit errors. In fact, humans commit errors with such frequency that variousresearchers have developed classification schemes in an attempt to help us name theerrors more consistently. Thus, there are “skill-based, rule-based, and knowledge-based”errors; there are have “slips, mistakes, and violations” and there are errors of commissionand errors of omission.A corollary to the ground truth that humans commit errors is that it is virtually impossibleto completely eliminate human errors. The best that can be done is to recognize theconditions that prompt human errors and try to arrange them to minimize their errorcausing effects.Not all is bad news, however. While it is true that human errors are pervasive events, it isalso true that most errors have little or no consequence. In fact, the really bad eventsthat occur as the result of human errors are quite rare. That rarity is the product of whatis typically known as the “chain of causation”. A single, isolated human error is unlikely tohave severe effects. When combined with other errors, certain process states,environmental conditions, etc., however, human errors can form a link in a causativechain that can have dramatic and very bad consequences.HESRA: v7 Page 5 of 46

FAA Human Error Risk Analysis (HESRA) MethodBecause of the “chain of causation” characteristics of major accidents, we can deal withaccident prevention in two ways. First, we can identify and eliminate conditions thatelevate the risk of errors. Second, we can provide “cutouts” that short circuit the chain ofcausation so isolated errors are not allowed to propagate to an ultimate (bad) event.There are essentially two methods for preventing errors. The first, and most common, isto wait until something bad happens and then go back and figure out why it happened. ARoot Cause Analysis is an example of this type of method. In theory, processes can thenput in place to prevent the same bad thing from happening again. The second method isto examine the process or system and try to figure out which elements place users at themost risk for committing errors. Once the high-risk elements are identified, they can bechanged to present much lower risk. In effect, an error is being preventing before itoccurs.This method concentrates on the second, proactive approach to reducing human errors.The focus of this methodology is on ATC facilities maintenance/operations operations.However, HESRA can be extended to the ATC operations environment.3.2 HESRA OverviewHESRA is one of a number of human error risk analysis processes developed for specificdomains. While its basis is generic, HESRA has been tailored specifically to beapplicable in the FAA ATC maintenance/operations environment.3.2.1 Applicability to Life Cycle StageSince it is an a priori method, HESRA can be applied at virtually any stage of the systemdesign, procurement, and implementation cycle. The only absolute pre-requisite forconducting a HESRA analysis is that the interaction process among human users and thesystem must be defined in enough detail to permit its decomposition into tasks and steps.This is not to say that the output of HESRA will be equally valuable at all stages ofsystem design and implementation. Since the likelihood of human errors is highlydependent on the complexity of procedures and user interfaces, HESRA is more likely toproduce detailed and valid output when the user interface(s) and procedures exist in atleast prototype form.However, there can still be great utility in conducting a preliminary risk analysis at veryearly stages of the system definition process. For example, if the design team isconsidering particular ways for users to interact with the proposed system, it is likely thatHESRA can identify modes of interaction that are more or less likely to produce errorsthan other modes.3.2.2 Proactive Assessment of Human Error Risk/PotentialHESRA is a proactive risk analysis method. That is, its goal is to identify elements ofprocess and system design that are most likely to produce human errors before thoseerrors are actually committed, or at least before they result in significantly badconsequences. This a priori risk identification aspect of HESRA is its most attractivefeature. One need not wait for bad things to happen in order to identify and fix thecauses.3.2.3 Formalizes Risk Assessment ProcessHESRA introduces a formal, objective structure to assessing the risk of human errors inmaintenance/operations procedures. It is the explicit goal of the FAA Air TrafficHESRA: v7 Page 6 of 46

FAA Human Error Risk Analysis (HESRA) MethodOperations (ATO) to move toward a safety culture in both operations andmaintenance/operations. One aspect of an integrated, diffuse safety culture is anemphasis on identifying and correcting high-risk conditions before they result in harm topeople or equipment. However, an effective effort in this regard requires moving beyondoffering individual opinions regarding risk and putting in place a consistent, objective, andpractical method of assessing risks.3.2.4 Based on Engineering Risk and Reliability Assessment TechniquesHESRA is based on a well-developed and widely practiced engineering risk assessmenttechnique known as Failure Modes and Effects Analysis, or FMEA. Because of itsengineering heritage, HESRA can draw on a large pool of experienced risk analysispractitioners who can adapt their skills to consider task-related errors instead ofcomponent failures. Also, there are a number of existing commercial softwareapplications that support FMEA activities and data. Several of these tools can beadapted to support HESRA.3.2.5 Alternative to Quantitative Risk Assessment TechniquesThe natural tendency in an engineering organization is to frame all risk analysis in termsof precise quantitative estimates. This is a reasonable perspective given the ampleavailability of well-documented failure data related to mechanical and electroniccomponents. In fact, much of the early work on human error analysis attempted to takethis same route. However, the lack of quantitative human error data for many (actually,most) practical tasks typically dooms the purely quantitative approach to human erroranalysis.However, the lack of reasonable quantitative methods does not imply that there are nouseful alternatives. In fact, human error analysis techniques that apply ordinal scaleratings have been and are being used in a number of sophisticated, complex domains.Examples include NASA and the healthcare and medical products fields.3.2.6 Use of Data from Incident/Error HistoryOne of the very nice features of HESRA is that it does not depend on the availability ofhistorical error data. That is, it is perfectly acceptable to conduct a human error riskanalysis without referring to any particular past incidents or errors. However, justbecause it is possible to do so does not mean that such information cannot be used if it isavailable.If previous incident investigations have been done for a particular ATC system or facility,then the results of those investigations can be directly applied in a HESRA analysis. Themost likely effect of having access to error data is to inform the analysis team’sconsensus of likelihood and detection//mitigation. However, such error data can informany or all of the three rating scales (likelihood, severity, and detection/mitigation) used inHESRA.3.2.7 Use of Data from Formal or Informal Usability TestsJust as HESRA can use existing incident and error data, it can also use data fromusability tests. Often, usability tests are designed to elicit more errors than one wouldnormally see during the actual service life of a system. Therefore, the analysis team canuse these test data to inform their ratings for likelihood and detection/mitigation.HESRA: v7 Page 7 of 46

FAA Human Error Risk Analysis (HESRA) Method3.2.8 Establishes Working GroupIntroducing HESRA into the ATC maintenance/operations domain requires establishing atleast one team of people to perform the analysis. The next section of this documentdescribes the composition of the HESRA analysis team. While this document describesa single team, these same requirements can be applied to more than one HESRA team.The primary direct effect of establishing a HESRA team is to enable the ATO to identifyand reduce risks associated with human errors. This is, after all, the reason forintroducing HESRA. However, the indirect effects of establishing HESRA teams can bealmost as beneficial as those from the specific analyses. Slowly, a pool of people will beformed with the perspective and breadth of view established by their participation inHESRA activities.Common effects of participating on a risk analysis team include an appreciation of theperspectives of other people on the team, the adoption of an evaluative view of systemsand processes, and the realization that identifying and reducing human error risks is notas complex as one might believe prior to working on an actual analysis effort. These areall positive effects for the organization.3.3 ObjectivesThe ultimate goal of adapting HESRA to the FAA ATC facilities maintenance/operationsdomain is to allow FAA maintainers to provide the highest level of facility safety with thelowest risk of compromising safety through human error. There are a number of enablingobjectives that support this goal. These include at least the following:• Provide a proactive method with which high-risk elements of maintenance/operationsprocedures can be identified before they lead to errors with possible safety consequences.• Provide an objective method of proactively assessing the risk of various design andoperational features of ATC facilities.• Introduce the formalism of a priori risk analysis into the FAA ATC facilitiesmaintenance/operations domain. This formalism is quite different than the typical post hocaccident investigation methods that are currently in place.• Provide the perspective, methods, and tools to help the FAA ATO move toward a morediffuse safety culture.•4.0 SCOPE AND LIMITATIONSWhile proactive human error risk analysis techniques, such as HESRA, have beenapplied in a broad range of domains, this application of the HESRA method has verylimited goals and scope.4.1 ATC Facility Maintenance Procedures and SystemsThe current methodology is focused on analyzing human errors associated with thedesign of procedures and systems included in the ATC facility maintenance/operationsdomain. Therefore, the rating scales included in this procedure may or may not beapplicable to a broader set of domains.4.2 Extensibility to Other DomainsSince HESRA is based on analyzing potential task-related errors, it should be extensibleto any domain or system in which procedures, either formal or informal, can be suitablydecomposed into individual tasks and steps. For existing systems in the FAAHESRA: v7 Page 8 of 46

FAA Human Error Risk Analysis (HESRA) Methodmaintenance/operations domain, there are usually existing, detailed procedures that lendthemselves to task decomposition. For systems under development, it should certainly befeasible, using task analysis, to develop prototype maintenance/operations procedures ofsufficient detail to satisfy HESRA requirements.Whether this is also the case in other FAA domains, such as ATC, is not clear. However,even for amorphous or poorly documented procedures or systems, task analysis is likelyto yield information that can be subjected to HESRA analysis. The issue, of course, ishow difficult it might be to perform such task analyses.4.3 Applicability to Developmental and Existing ATC Facilities and SystemsOne of the very desirable features of HESRA is that it can be applied to existing facilities(systems) and to those in various planning, development, or procurement stages. Sinceit is a proactive method, HESRA does not depend on the existence of prior operatingexperience. Rather, the only requirement is a list of tasks or steps required to performspecific operations. Such a list can be developed using detailed specifications,simulations, developmental models, or actual equipment and software. As such, HESRAcan be used during most phases of the FAA development lifecycle including missionanalysis, safety analysis, investment analysis, solution implementation, and in-servicemanagement.4.4 Handles Discrete ErrorsHESRA is based on a widely used engineering risk analysis method known as FailureModes and Effects Analysis (FMEA). The nature of all risk analysis methods based onFMEA is that they are well adapted to identify and assess discrete, i.e., individual errors.The reason for this ability is quite simple. Errors and causes are identified for each task(or step) in a procedure and then each error/cause combination is evaluatedindependently. Thus, individual errors are likely to be identified and assigned risk ratings.4.5 Handling Multiple Discrete ErrorsThe feature of FMEA-based risk analysis techniques that makes them very good atidentifying individual errors also makes them less than adept at identifying multiple,dependent or conditional errors. Since errors are considered only in the context ofindividual tasks or steps, the likelihood of identifying a meaningful complex combinationof errors is dependent on the imagination and experience of the risk analysis team.The way in which conditional and/or multiple errors are considered using HESRA is in theevaluation of the severity of particular errors. At this point in the analysis process, theanalysis team is essentially answering the question “What is the worst that can happen ifthis error occurs?” It is quite a common occurrence for the answer to be “Well, itdepends. If this error occurs in combination with this other error or equipment failure,then the severity would be quite high. If it occurs in the absence of that other error orfailure, then it wouldn’t be so bad.”In this way, conditional errors can be noted during the analysis and listed as one of thefactors that explain a particular severity rating. However, and this should be clearlynoted, there is not an explicit activity in HESRA (or, to our knowledge, in any otherFMEA-based risk analysis technique) in which the analysis team is required to considermultiple or conditional errors.HESRA: v7 Page 9 of 46

FAA Human Error Risk Analysis (HESRA) MethodIt might seem obvious, but it is an absolute and non-negotiable requirement that all teammembers must be neutral in terms of the outcome of the analysis. That is, teammembers cannot have an axe to grind – other than the desire to increase the safety ofthe system being analyzed. They can have no material interest (either monetary orpolitical) in the outcome of the analysis. An extreme example in this regard is someonewho might lose (or THINKS they might lose) his or her job if the analysis shows a systemto be extremely risky (or not).Having an opinion regarding the risks associated with a particular procedure or system isnot the same as having a stake in the outcome of the analysis. Everyone on the team islikely to have opinions prior to the analysis. The key to selecting team members is thatevery member of the team should be willing to change their opinions if evidence and logicso dictate.5.1.2 Roles and ResponsibilitiesThe roles and responsibilities of team members are described in Table 1, below. Thereare three important points to note in these roles and responsibilities. First, the HumanFactors (HF) Specialist is the leader of the analysis team. This is appropriate for anumber of reasons, but primarily the Human Factors Specialist has the education,background, experience, and perspective to guide the analysis team in their deliberationsregarding human errors.The second feature to note in Table 1 is that the role of ATO Manager is optional. AnATO Manager can bring significant benefits to the analysis team, but it is also perfectlyacceptable to perform the risk analysis without an ATO Manager.HESRA: v7 Page 11 of 46

FAA Human Error Risk Analysis (HESRA) MethodTable 1. HESRA team roles and responsibilitiesRole orTeamMemberHumanFactorsSpecialist/Team LeaderMaintenanceSubjectMatter Expert(SME)Background• Has an understanding of humanperception, performance, andcognition• Knows what sorts of tasks arecompatible with common humancapabilities and which are not• Knows how to interpret data fromresearch and usability tests• Has very strong group facilitationskills• Understands the proceduresassociated with maintaining thesystem to be analyzed• Ideally will have actualexperience performing the tasksto be evaluatedResponsibilitiesThe HF specialist is the leader of the HESRAteam. The purpose of conducting the riskanalysis is to identify system and proceduralelements, or combinations of elements, thatpose high risks for human errors. Since the HFspecialist has in-depth knowledge of humanperformance, cognition, perception, behavior,and errors, it is logical and appropriate that thisindividual take the lead in assessing risks.The HF specialist must become familiar withthe system and procedure(s) to be analyzed.The HF specialist also performs the initialtask breakdown and error mode definition.The maintenance/operations SME is the users’representative on the analysis team. The focusof this HESRA procedure is ATC facilitymaintenance/operations procedures.Maintainers are the people who conduct thoseprocedures in the field.The maintenance/operations SME must helpthe analysis team understand likely fieldbehavior, actual procedural steps, acceptedpractices, tools, interactions among work teams,and the influence of environmental, social, andpolitical factors. Also, themaintenance/operations SME is likely to be ableto share knowledge of past critical incidents anderrors related to a given procedure – whetherthose incidents were reported or not.HESRA: v7 Page 12 of 46

FAA Human Error Risk Analysis (HESRA) MethodRole orTeamMemberBackgroundResponsibilitiesTrainer • Trainers should have a greatdeal of experience conductingthe types of tasks that will beevaluated during the riskanalysis• Have taught those tasks to manymaintainers over a period of time• For new systems, shouldgenerally have a great deal ofdetailed knowledge regardingthe type of system beingevaluatedTrainers teach declarative and proceduralknowledge about the system to be analyzed.The trainer should inform the discussions of theanalysis team by indicating which proceduraltasks are difficult for trainees to master.The trainer can also help team deliberations bydescribing common mistakes made bymaintainers during training, feedback receivedfrom the field, and personal experience overmany systems and locationsSystemTechnicalSpecialist• Has in-depth knowledge ofsystem functions, operations,and interactions• Can fill the roles of bothmaintainer or trainer and systemtechnical specialist.• Can be representative of systemdeveloperFor any reasonably complex system, it isessential to have a technical specialist on theanalysis team since the range of potentialeffects of errors may not always be obvious tomaintainers. The system technical specialistcan best aid team deliberations by providing indepthknowledge of system functions, theeffects of errors, and the interactions of errorsand components.The input of the system technical specialist ismost helpful when debating the potentialseverity of specific errors and the elements of asystem that might detect and mitigate theeffects of an errorATO Scientist • Has a long association withparticular operational systems,e.g., understands the history of asystem or class of systems• Can fill this role and that of theHF SpecialistThe ATO scientist brings a broad technical viewto the analysis team. The most useful input tothe analysis team from the ATO scientist is todescribe the error history of the procedure orfacility under review and the implications oferrors across systems.ATO Manager(Optional)• Has broad experience withvarious workers on the systembeing analyzed, as well asknowledge of the severity oferror effects on the ATC system• Has the ability to explain theeffects of errors on managementfunctionsThe ATO scientist can also bring knowledge ofpast and ongoing research applicable to thisspecific analysisAlthough not absolutely necessary, including anATO manager on the risk analysis team can bevaluable in a number of ways. Most notably,participating on the risk analysis team providesthe manager with detailed familiarity with therisk analysis process and the deliberations ofthe team.In addition to enriching the discussionsassociated with the risk analysis, the manager’sparticipation will lead to fewer back-end issuesand faster buy-in of the analysis results. TheATO manager can help coordinate access tofacilities, equipment, and personnel, if suchaccess is necessary.HESRA: v7 Page 13 of 46

FAA Human Error Risk Analysis (HESRA) MethodFinally, it should be noted that Table 1 defines roles and responsibilities, not specificindividual identities. It is feasible for a single person to fulfill more than one role on theanalysis team. For example, the person acting in the ATO Scientist role might also beable to act as the Human Factors Specialist. Likewise, the System Technical Specialistmight also be able to act as the Trainer. It is also possible for a single role to be filled bymore than one individual. For example, for an application of HESRA to the VoiceSwitching Communications System (VSCS), there were three individuals who filled theTrainer role and one individual who filled the Maintainer SME and System TechnicalSpecialist role.5.1.3 Team TasksWhen the team is established, it is necessary to also establish an operating frameworkwith all team members. The actual analysis tasks will be described in subsequentsections of this document. However, certain management and housekeeping tasks needto be completed before the detailed analysis begins.5.1.3.1 Hold Initial MeetingThe team, once its membership is defined, should arrange to have an initialmeeting to introduce team members to one another and determine how they willoperate for the duration of the analysis effort. It is possible for this initial meeting tobe held in a virtual environment, e.g., video or audio conference. During thismeeting, the nominal team leader, which is always the HF Specialist, will beidentified.In their invitation to join the analysis team, each member will be informed of thesystem to be analyzed and the overall timeframe for the analysis. It is likely thatnot all prospective team members will have undergone HESRA training prior totheir participation on the team. The initial meeting will provide an opportunity tointroduce the HESRA method and to discuss, in general terms, the system andprocedure to be analyzed. In addition, locations for future meetings can bediscussed as well as logistical needs such as LCD projectors and other facilities.5.1.3.2 Establish Ground RulesThe team should establish simple ground rules for their interaction. Some of thisinteraction is predicated on the flow of activities in the HESRA method. However,the analysis team has great flexibility regarding how they accomplish each of thetasks in the analysis process.For example, the team might decide to discuss each error mode in a serial fashion,assigning all three ratings before moving on to the next error mode. However, ateam might decide to assign all likelihood ratings before going back and assigningseverity and detection/mitigation ratings.5.1.3.3 Agree on Level Of EffortIt is critically important that analysis team members be able to devote enough timeto the analysis process to be effective. It is not acceptable to simply have amajority of the analysis team present for their deliberations. Team members mustdiscuss the required level of effort for the analysis process and then agree, as ateam, to provide that level of effort.HESRA: v7 Page 14 of 46

FAA Human Error Risk Analysis (HESRA) Method5.1.3.4 Record KeepingWhile it is important to keep records of the analysis process and outcome, it is alsonecessary to keep them in such a way that they do not interfere with the primarywork of the analysis team. The primary elements and requirements of recordkeeping are described in this section.5.1.3.4.1 Select Tool to Support/Document AnalysisCurrently the primary documentation and reporting tool is MicrosoftExcel. However, there are several off-the-shelf tools that can becustomized to serve as a support tool for HESRA that are beingexamined as alternative to Excel.5.1.3.4.2 Configure Tool for Current AnalysisConfiguring an Excel spreadsheet for a particular HESRA analysisconsists of opening a new file using the HESRA template and thenentering the name of the system or procedure to be analyzed.Appendix B provides an example analysis form. Using the HESRAtemplate is very straightforward. Clicking on the New Row button willcreate a new row just below the cursor location. The new row willreplicate the Task and Step.5.1.3.4.3 Supplement with Meeting NotesRegardless of the tool being used to support HESRA, it should besupplemented with copious meeting notes. Inevitably, there will bediscussions among the analysis team related to particular tasks,errors, or the reasoning behind assigning ratings. Any wordprocessing application can be used to collect meeting notes, but thenotes should definitely be kept in electronic form to allow easyediting and distribution.5.2 Step 2: Familiarize Team with Systemto AnalyzeIt is imperative that all team members have some level of understanding regarding thesystem for which procedures are being analyzed. It is somewhat of a paradox that theHuman Factors Specialist, who is the nominal head of the analysis team, is likely tohave the least familiarity with the ATC system and maintenance/operationsprocedure(s) to be analyzed. In the normal course of events, a typical human factorsanalysis will require the Human Factors Specialist to become very familiar with thesystem, product, or procedure being analyzed. HESRA is really no different in thisrespect, except this familiarization must occur prior to the analysis effort.There are a number of ways to facilitate such familiarization. These include, in noparticular order, visits to ATC sites at which the system is located, walking throughrepresentative procedures, reading operation and maintenance/operations manuals,reading vendor information related to the system, interviewing maintainers andoperators of the system, and spending time with training professionals who teachHESRA: v7 Page 15 of 46

FAA Human Error Risk Analysis (HESRA) Methodtechnicians how to maintain the system. We have found that all of these approachesare likely to be used when trying to become familiar with a system.Until the Human Factors Specialist is familiar with the system to be analyzed, it isunlikely that this individual will be able to reasonably understand themaintenance/operations procedure(s) to be analyzed. This lack of understanding willhamper the HF Specialist’s ability to perform the initial task and step identificationrequired to “pre-fill” the analysis spreadsheet. Even if it is possible to fill out thespreadsheet, the lack of a basic understanding of system layout and functions willseriously inhibit the ability to determine the effects of identified errors.5.3 Step 3: Prioritize Procedures to AnalyzeThe second step in performing a HESRA is to choose of all candidate procedures thosethat take priority and should be analyzed. This step is important and we recommendthat the approach described here should be used.The analysis team should review each candidate procedure. The review shouldconcentrate on the worst-case scenario that might pertain if serious human errors arecommitted during the conduct of the procedure. The intent here is not to concoct somewildly unlikely series of events that might lead to the loss of separation or some othervery bad event. Rather, it is the analysis team’s job to think in practical terms about theconsequences of improperly performing each procedure.The outcome of this process is that a ranking is assigned to each procedure related tothe problems it could cause to the local ATC system. The following scale can be usedto help the team prioritize certain procedures over others:1. Immediately brings down the facility or subsystem and adversely affects otherfacilities or subsystems.2. Immediately brings down the facility or subsystem, but does not affect otherfacilities or subsystems – or – leaves the facility or subsystem in a nonfunctionalmode that is not obvious to observers.3. Immediate reduction in the function of the facility or subsystem, but partialfunctionality is retained. No latent effects.4. Possible delayed minor functional effects on facility or subsystem. Noimmediate effects.5. No serious immediate or latent functional effects on the facility. Effects resultin inconvenience and can be easily addressed.In addition to the above taxonomy, there are other viable methods for generating apool of candidate procedures that should be submitted for HESRA analysis. The tablebelow presents additional criteria the team can use to generate candidate procedures.To help organize the procedures that could be studied, the team should organize theprocedures by rank-order such that procedures that are known to have severeconsequences or known to be difficult during training are the ones that are considered(prior to considering others that are less severe and likely might not require HESRA).HESRA: v7 Page 16 of 46

FAA Human Error Risk Analysis (HESRA) MethodWe recommend that eliminating procedures prior to analysis be done as a two-stepprocess. First, the analysis team should discuss the procedure and agree that it can beeliminated. Second, other individuals should be consulted who can determine thateliminating the procedure will not compromise technical, regulatory, or managementrequirements. Eliminating a routine maintenance/operations procedure requires alsoremoving it from the Maintenance Management System (MMS).5.4 Step 4 - Set Analysis PerspectiveRisk analysis is not performed in a vacuum, nor is it typically done on a hypotheticalsystem. Usually, but not always, HESRA will be directed at a realmaintenance/operations procedure that relates to an actual system operating in the ATCenvironment. Another alternative is to embed human error risk analysis into thedevelopment process for new or replacement systems. That is, themaintenance/operations procedures to be analyzed are for a facility or system that hasnot yet been deployed.The elements of the analysis perspective described in this document aremaintenance/operations-oriented. However, to be clear, these exact sameconsiderations will apply to any FAA domain, including ATC operations.In performing a risk analysis (HESRA) for any system or procedure, we have to makecertain assumptions concerning the type(s) of users, the usage environments, the taskenvironment, the overall complexity of the UI, etc. Brief descriptions of theseconsiderations, as applied to the particular product or system and the specific analysis,are provided below.5.4.1 User PopulationThe user population for the particular maintenance/operations procedure should be listedin as much detail as necessary to allow the analysts to determine the users’ perspectivewhile conducting the procedure.The level of training and experience that users are expected to have with the particularprocedure should be described. For example, will users have undergone training specificto the procedure? Will their training bring every user up to a minimum level ofcompetence? Even if users have been trained to use the procedure, will they have anyactual job experience with it? That is, for purposes of the analysis, users can be trained,but still be considered new users.5.4.2 Usage EnvironmentAlthough there is a wide range of potential usage environments, we are assuming forpurposes of the HESRA that the product or system will be used in the ATC domain.However, even within the ATC domain, there are a number of different types of facilities,e.g., ARTCC, SSC, TRACON, etc. The description of the usage environment shouldinclude both its physical aspects, e.g., indoors vs. outdoors, hot vs. cold, etc., as well asthe operational environment. For example, will users be under time stress, will they bemaking life-and-death decisions, are users subject to punitive actions by management,etc? Stressful physical and operational environments elevate the likelihood of humanerrors.HESRA: v7 Page 19 of 46

FAA Human Error Risk Analysis (HESRA) Method5.4.3 Performance Shaping FactorsHuman performance, especially the likelihood of committing errors, is strongly influencedby a number of factors. For purposes of the HESRA, we should explicitly list thosefactors we consider to positively and negatively affect users’ performance. For example,some common negative factors are the following:• Time pressure• Fatigue• Multi-tasking• Noise• Physical exertion• Poor communication• Confusing terminologyPositive factors include, but are certainly not limited to, the following:Well-designed user interfaceGood communication linksGood trainingWell-written proceduresLack of time pressureQuiet workspace5.4.4 Overall Complexity of System, User Interface and/or ProcedureWhile feature-rich UI’s are often viewed as a good thing, complex user interfacesincrease the likelihood of user confusion and errors. In general, the more choices a userhas regarding which actions to take, the more likely it is that they will make an incorrectchoice. In this regard, the complexity of the user interface for the procedure or systembeing evaluated should be generally rated as low, moderate, or high. An explanation ofthis rating should also be included in the documentation for the analysis.5.5 Step 5 – Define TasksThe most fundamental activity in the HESRA process is defining or identifying the tasksthat will be analyzed for risk. Fortunately, most ATC facility maintenance/operations isvery much procedure-oriented, so for any activity, a detailed, task- or step-orientedprocedure is likely to exist. The major exception to this statement is for systems stillunder development. Detailed maintenance/operations procedures for suchdevelopmental systems might not exist when the HESRA process is conducted. If aprocedure already exists, then defining tasks requires much less effort than in situationswhere detailed procedures do not exist. If a procedure does not yet exist, then someform of task analysis is warranted. (see 3.4.3, below)The activities undertaken by technicians to complete a maintenance/operationsprocedure (or any procedure) can be divided into “tasks” and “steps”. This is a somewhatarbitrary distinction, but a helpful one when it comes to organizingmaintenance/operations actions for analysis. An easy way to distinguish between tasksand steps is that tasks define what has to be done, but not necessarily how to do it.Steps, which are usually embedded within tasks, tell how to do what the task requires.HESRA: v7 Page 20 of 46

FAA Human Error Risk Analysis (HESRA) MethodThe sequence of activities in the HESRA process is to first define tasks and give each aone-word name that can be used to define all the steps and errors associated with it.The actual name given to the task is not particularly important, but should reflect thenature of the task. For example, if I identify a series of steps that allow me to log onto acomputer, then a reasonable name might be “Login” or “Logon”.5.5.1 Evaluate Level of Task DetailTasks and steps listed in various FAA maintenance/operations procedures contain a widerange of detail regarding the specific actions associated with them. In most cases, veryspecific procedures exist for each periodic maintenance/operations task. However, evenfor these periodic procedures, the level of detail can vary dramatically. In the VSCSprocedure “Clean and check WS printers”, for example, the second numbered step is“Remove paper from the printer.” Contrast this with the second numbered step in the“Reboot VCSU Servers” procedure, i.e., “At the VCSU Ops Console, select the key twice.”There is some leeway in determining the appropriate level of task detail needed for riskanalysis. The key to knowing whether there is enough detail in the existing procedure isthat it should describe what must be done., but not (necessarily) how to do it. In manycases, ATC maintenance/operations procedures mix both tasks and task steps.Sometimes, the steps are explicitly associated with a task. These instances are typicallylisted in this syntax: “Do “X” by completing the following steps – “. Then, the stepsrequired to complete the higher-level task are listed.More often, the HF Specialist must determine the higher-level task associated with aseries of detailed steps. For example, in the “Reboot VCSU servers” procedure, the firsttask is to determine which of the two servers is currently “active” and which is the“standby” server. There are a number of steps associated with this aggregate task, but itis up to the analyst to lump these steps together and give the task group a meaningfuldefinition and label. In this example, we could define this group of steps as “Identifyactive server” and it’s task label as “Identify.”The Maintenance Management System (MMS) is a good source of non-specificprocedures. The MMS order to “Perform VCSU server modeover” is a good example ofdescribing what must be done without providing any detail about how that certification isaccomplished. This level of detail is not sufficient to perform a HESRA analysis.However, the actual maintenance/operations procedure for this general task, which isreferenced in the MMS, contains more detailed task descriptions.5.5.2 The General TaskIn defining tasks, there is one task category that is not necessarily associated with anyparticular set of steps in the procedure or system function being analyzed. Rather, it isrelated to the procedure as a whole. This task category is defined as “general” andrelates to the common errors of not beginning a procedure or not completing it once it isstarted. It can also address not performing the procedure in the proper order, if theprocedure is typically done in a particular order with other procedures. The “General” taskcategory should always be the first one listed in the HESRA spreadsheet.HESRA: v7 Page 21 of 46

FAA Human Error Risk Analysis (HESRA) Method5.5.3 Define/identify More Detailed Tasks Where RequiredThe lack of task details will probably not be an issue for most existing ATC facilitymaintenance/operations procedures. These procedures have been developed andrefined over a period of time, as have the training materials for them. It is likely thathighly detailed task descriptions exist for these procedures. However, for new systems orthose under development, such detailed procedures might not yet exist.The best course of action for analyzing procedures that do not yet contain detailed taskdescriptions is to walk through the procedures on representative systems or equipment.Some procedures contain statements related to overall functional activities. For example,“Swap active disk drives”. This is a very general statement of what must be done, but itdoes not include either lower level tasks or the detailed steps associated with them.It is possible to fill in the lower level “what to do” and “how to” information, if there isenough expertise and experience on the analysis team. However, if little or no task detailexists and the requisite technical expertise and operational experience is not present onthe analysis team, it might be a better use of the analysis team’s time to postpone the riskanalysis until a separate task analysis effort is completed.If there are only a few gaps in the level of task detail for a procedure, then the analysisteam can probably provide this detail as part of the risk analysis process. The HFSpecialist should make the decision regarding whether task analysis is appropriate aspart of the risk the analysis process.5.6 Step 6 – Define StepsThe next activity in the HESRA process is to define the steps required to complete thetasks identified as described above. This activity would be considerably simplified if wecould assume that the steps listed in the selected maintenance/operations procedurecould be entered into the HESRA spreadsheet. However, this is typically not a practicalapproach for at least two reasons.First, as we’ve noted previously, the individual, numbered steps in existing proceduresspan a wide range of detail. If we proceed by entering those procedural steps with thehighest level of detail, then subsequent analytical activities become tremendouslydetailed and complex. More to the point, addressing technician actions at a very detailedlevel is unlikely to yield error risk estimates of much practical value.Second, some procedures are written at a very high level of abstraction and contain fewdetailed, “how to” steps. A good example of this is the VSCS “Power Fail Recovery”procedure, which contains almost no detail for the individual task categories. Thissituation is likely to be more prevalent for systems under development. However, as theVSCS example demonstrates, it can also occur for very mature systems and facilities.5.6.1 Identify Steps to Complete TasksAn assumption regarding task analysis in the HESRA context is that the “what to do”tasks have already been defined. That is, we assume the analysis can start with a list oftasks that define what must be done, but not necessarily how to do it. If this level of taskdefinition does not exist, then a task analysis effort apart from the risk analysis iswarranted.HESRA: v7 Page 22 of 46

FAA Human Error Risk Analysis (HESRA) MethodFor each “what to do” task, the analysis team, or a subset of the team consisting of theHF Specialist, Maintenance SME, and Trainer, should identify the steps necessary toactually complete the task. The real trick here is to define individual steps at theappropriate level of detail. There is no commonly accepted definition of the term “step”.However, in the context of human error risk analysis, a step can be defined as a specifichuman action, or series of actions, that can be completed either correctly or incorrectly.There can be many steps required to complete any given task. For example, in theexample used in 3.4.1, “Perform VCSU server modeover”, the (existing) detailedprocedure lists 15 steps. Some of these steps really encompass very small increments ofactivity, e.g., “…select the key twice.” Such highly detailed steps will usually notadd much to the error analysis. In fact, they might make it less likely that important errorswill be identified because they simply add noise to the analysis.One helpful way to define steps is to think of how you might describe the steps tosomeone who isn’t familiar with the system or procedure you are analyzing. First, youwould try to list the big, incremental tasks that have to be completed. For example, youmight say “First, you have to make sure you know which server to shut down, then youhave to shut it down, verify it is shut down, and then re-start it. Once it’s restarted, youhave to make sure it’s running OK, and then make it the active server. Then you repeatthe process with the other server.”Then, for each high-level task, you would describe the steps required to accomplish it.For the first task above, you might say “First, you make sure you’re logged onto an OpsControl Console, then you identify the standby server, then you make sure all theresources are assigned to the active server, then you change the mode of the standbyserver to ‘offline maintenance/operations’.”This level of step detail often will aggregate a number of steps in the actual procedure,which is a good thing from an analysis perspective – so long as the aggregation doesn’thide a potential error that should be evaluated.5.6.2 Enter Steps into Analysis Tool/DocumentAs steps are identified, they should be entered into the HESRA spreadsheet. Ifthe step description can be condensed, it is acceptable to do so, as long as thereis a complete step description kept in supplemental documentation. In addition,each Step should be identified with a number. This number serves only toidentify the step/error/cause combination within the task grouping. For example,I might want to refer to Step 2 in the General task grouping. There is no otherimplication for step number.5.7 Step 7 – Define Errors, Causes, and EffectsThe real power of HESRA lies in properly defining the errors that can occur at each stepin a maintenance/operations procedure. After all, errors are what we are trying toprevent. To the extent the risk analysis team is thorough and conscientious in definingerrors, and then the remainder of the HESRA process allows them to do a reasonable jobof ranking those errors. However, neither HESRA nor any other human error riskanalysis method is likely to prevent an error that has not been defined.HESRA: v7 Page 23 of 46

FAA Human Error Risk Analysis (HESRA) MethodThis part of the HESRA process tends to become tedious and repetitive. It is likely thatthe same errors will be defined for many procedural steps. Also, the sheer number oferrors that can be associated with a specific procedure step is often surprising.5.7.1 Pre-Fill Errors and CausesThis is a good point in the process for the HF Specialist to pre-fill the HESRAspreadsheet with potential errors and causes for each step. There are at least two goodreasons for the HF Specialist to complete this activity without necessarily soliciting inputfrom the rest of the analysis team. First, the errors and causes in which we areinterested are the result of human actions, so the HF Specialist is likely to have the mostexpertise of any team member. Second, this is a very tedious process and it is best not totake up the time of the entire team.It is also appropriate during this activity for the HF Specialist to note the effects of specificerrors and to identify mitigating factors. However, these are not a primary purpose of thepre-fill activity. In fact, the value of HESRA is enhanced by team discussions of effectsand mitigating factors, so those can certainly be left for the entire team to identify.During this process, each step will likely spawn a number of entries in the HESRAspreadsheet. Each row within a task category represents one particular type of errorrelated to a single step. Each type of error, in turn, can be duplicated to show how aparticular error can be caused by different factors.Certain types of human errors tend to be committed over and over – and for the samereasons –regardless of the domain in which the errors occur. To make sure that thesecommon errors are considered in the HESRA process, we have adapted a list of theseerrors and causes from other sources. This type of list is often called a taxonomy, butthat term is not important for the work of a HESRA team. We will simply call the listspresented below a framework for identifying human errors.As an example, let’s consider the step “Identify the active server.” A proper outcome forthis step would be “The active server is properly identified.” A fundamental error for thisstep might be to simply skip it. There can be multiple possible causes for such an error,such as an external distraction, lack of knowledge regarding how to identify the server,time pressure, etc.. Each of these causes would merit a separate row in the HESRAspreadsheet.Table 3. Framework of human errors and causes.Common ErrorsExternal Causes of ErrorsSkip a stepPerform a step out of orderFail to start a stepFail to complete a stepUse the wrong equipmentProblems with the procedureEquipment designEnvironment (noise, heat, vibration, etc.)Time pressureOrganizational factorsImproper/poor schedulingDistractionsHESRA: v7 Page 24 of 46

FAA Human Error Risk Analysis (HESRA) MethodInternal Causes of ErrorsImproper trainingFatigueStressExcessive memory loadLack of familiarity (with procedure or system)Cognitive capability exceededPhysical capability exceededAttention not maintainedRemember, this is simply a pre-fill of the HESRA spreadsheet by the HF Specialist.There is no guarantee that this pass will be either complete or accurate. The purpose ofdoing the pre-fill is to make the analysis process more efficient. When the entire teamevaluates the steps, errors, and causes, then some might be added, some deleted, andothers modified.5.7.2 Review Each StepOnce the HESRA spreadsheet has been pre-filled, the analysis team should re-conveneto perform the detailed analysis. The team should discuss each procedure step in serialfashion. Finish identifying the potential errors associated with one step beforeproceeding to the next. Each member of the analysis team should ensure that theyunderstand the human and machine actions described in the task step, the conditionsunder which the step is performed, and the relationship of the step to previous andsubsequent steps.Causal factors for each error should be reviewed by the team and modified as necessary.Some of the most interesting and illuminating discussions will occur when the teamidentifies the effects of each error on human and system performance. In previousanalyses, we have found that some of the errors have never been considered by anyoneon the analysis team. Sometimes, the effects of these errors are so subtle or complexthat outside experts must be consulted to determine the most likely effects.For a good example of the type of “effects” discussion that can occur during a HESRAanalysis, consider the actual case of a procedural requirement to verify that all processesand resources are assigned to the active server (and not the standby server) beforeproceeding. What happens if the technician proceeds without verifying resourceassignments and they turn out to be incorrectly assigned? The effects of this error took aHESRA analysis team a while to determine and resulted in the team calling in otherexperts.When identifying the effects of specific errors, remember that an error can have bothsystem and human effects. System effects usually relate to the temporary or permanentloss or degradation of one, or more, system function. Human effects can includeincreased workload, confusion, injury, and, in rare cases, death.Finally, any mitigating factors that would impact the error should be described. A commonmitigating factor is that it might be immediately obvious that a step in a procedure hasbeen omitted. In the procedural step discussed above, for example, the following stepcannot be completed if the key is not pressed twice because the dialog windowrequired in Step 3 will not be displayed. This is a very strong mitigating factor.A much less salient mitigating factor is the disk slot numbering scheme on which theVSCS maintainer depends to select the proper disk to remove in the “Perform update ofserver gold mirrored drives” procedure. If an incorrect disk is removed, there is little toHESRA: v7 Page 25 of 46

FAA Human Error Risk Analysis (HESRA) Methodnotify the maintainer that this error has occurred. In addition, the slot-numbering schemeis inconsistent with the disk numbers, i.e., slot “0” contains disk “1”. This designdiscrepancy might actually induce the maintainer to remove the wrong disk.5.7.3 Develop Exhaustive Error ListThe goal of the analysis team is to develop an exhaustive list of errors for each task step.The team should not concern itself with the likelihood, severity, or mitigation that mightpertain for each error – at least, not at this point in the analysis.This perspective tends to be problematic for some analysis team members. Thetendency is to do a quick mental evaluation of potential errors and then discount (and notmention) those that are thought to have a low probability of occurrence or trivialconsequences. Do not do this! It will defeat the purpose of the risk analysis. We arevery much interested in low probability events. There are many opportunities later in theHESRA process to assign low ratings for probability and severity. This is not the place todo it.The appropriate perspective for this part of the analysis is to become skeptical andevaluative. Do not accept the premise that any step is so easy that there cannot beerrors. For example, one error that should always be listed for a procedure step is simplyfailing to do that step. There can be a number of reasons for skipping a step in aprocedure, but it is one of the most common human errors.5.7.4 Relate Errors to other Steps, Components, Etc., if AppropriateOne of the weaknesses of FMEA-based risk analysis methods is their inherent lack of amechanism to easily tie various errors together. We might find during the analysis thatcertain errors might cause (or prevent) other errors, or at least make other errors morelikely.For example, suppose it is apparent to the team that a certain error is much more likely tooccur if an error has already occurred in a previous step. This should definitely be notedin the HESRA documentation.In HESRA, we have provided a way to at least note these associations for later use. Inthe “comments” section of the analysis spreadsheet, the analysis team should explicitlynote any such connections, even if they are only theoretical or deemed to be unlikely toactually occur.5.8 Step 8 – Assign Rating for Error LikelihoodLikelihood refers to the overall probability, in nominal terms, of a particular error occurringdue to a specific cause. Each row in the HESRA spreadsheet applies to a single errorwith a particular cause. The exact same error with a different cause might have a verydifferent likelihood of occurring.Remember that the training and experience of users, as well as the operationalenvironment and performance shaping factors, will influence which likelihood rating isassigned to a particular error. The team should rate each error using the 5-point scaleshown in Table 4, below.HESRA: v7 Page 26 of 46

FAA Human Error Risk Analysis (HESRA) MethodFor reference, the following table also shows the 5-point Likelihood scale from FAA’sSafety Management System (SMS). SMS and HESRA differ in likelihood definitions duemainly to the difference between human error rates and component failure rates.Table 4. HESRA Error Likelihood Ratings (FAA SMS Ratings/Terminology)Error Likelihood Rating Category Error Likelihood Rating Definition1(A)2(B)3(C)Extremely Likely(Frequent)Likely(Probable)Occasional(Remote)Likely to occur on the order of once every 3-4times the task is performed.Likely to occur on a regular basis, on theorder of once every 10 times the task isperformed.Likely to occur sporadically over the life of thesystem, on the order of once every 25 timesthe task is performed.4(D)5(E)Unlikely(Extremely Remote)Extremely Unlikely(Extremely Improbable)Not likely to occur more than 5-10 times overthe life of the system.Not likely to occur more than once or twiceduring the operational life of the system.Typically, it takes a risk analysis team some period of time before all members arecomfortable with the meaning of specific ratings. For some team members, this will bethe first time they have participated in a risk analysis of any kind, much less one in whichrisks of human error is considered. The HF Specialist should take the lead in the earlydiscussions of assigning likelihood ratings.5.8.1 Review Each ErrorWhen assigning likelihood ratings, the analysis team should consider each error in serialfashion. However, it is certainly acceptable during group discussions to compare andcontrast one error with another. Often, this helps the team put particular errors inperspective. Also, depending on the specific equipment and task conditions, the exactsame error might be rated differently for different maintenance/operations task steps.5.8.2 Use Existing Error, Usability Test, and Other Data as AppropriateThere is often a lot of discussion among the team regarding the basis upon which errorlikelihood should be assigned. The HF Specialist and the ATO Scientist can provideinformation related to the specific error in terms of typical human capabilities and systemhistory, respectively. Also, the Maintenance SME and Trainer might have good insightsinto the typical field experience with a particular error.Beyond these sources of information, it is both possible and advisable to use any datathat might pertain. For example, maybe a number of known, similar errors have beencommitted and reported. In some cases, a formal usability test might have beenconducted on the system or procedure being analyzed. This is not likely for oldersystems, but usability test results might exist for newer systems.These data usually cannot be applied directly to an error mode under analysis. It is rare,but not unheard of, to find data for the exact error mode being evaluated. More often,existing error and usability data are best applied by informing the discussion among teamHESRA: v7 Page 27 of 46

FAA Human Error Risk Analysis (HESRA) Methodmembers. Also, it is very useful for team members to know that a particular error hasoccurred and that it caused consequences of specific severity.5.8.3 Look Especially for Tasks That Require Skills or Capabilities ThatHuman Users Are Unlikely to PossessRequiring maintainers to complete a task in a way that challenges basic humancapabilities often precipitates human errors. For example, suppose a procedural steprequires the maintainer to remember readings from a measurement taken in a precedingstep. Such a task requires the use of short-term memory, which is notoriously prone toerror. Another example is a step that requires numbers or letters to be written down andused in a subsequent step. Transcription is also a task that is subject to fairly high errorrates.When the analysis team discusses potential errors that are due to such fundamentalviolations, the HF Specialist should make certain the team understands that the likelihoodof these types of errors should be elevated.5.8.4 Team Discussion and ConsensusIt is critically important that analysis team members come to a common understanding ofwhat the likelihood ratings mean and how to assign ratings to errors. As noted above, itwill usually take a new team some time, typically on the order of a couple of hours, tobecome comfortable assigning likelihood ratings. Things progress much more smoothlyafter that point.Team members who are not HF professionals often have a very difficult timeunderstanding that assigning relatively high likelihood ratings does not reflect badly onthe people who will be performing the work being analyzed. In this case, of course, thosepeople are FAA ATC maintainers, so there will be some sensitivity in this regard. The HFSpecialist and the ATO Scientist can greatly facilitate this process by using examples andshowing how system characteristics are usually the cause of high likelihood ratings – notlack of skill or motivation on the part of maintainers.5.8.5 Same Error for Different Task/StepAs was discussed above, the same error applied to a different maintenance/operationstask could have drastically different likelihood ratings. This point is an important point forthe analysis team to understand, because the tendency will be to simply assign the samelikelihood rating to identical errors.The use of procedure walkthroughs can illustrate this point nicely. Using the example ofmeasuring voltage at two test points, one of the errors will undoubtedly be placing thetest probes on the wrong test points. It will be easy to see that the likelihood of this errormight be quite high for test points buried in the innards of a circuit card cage and quitelow for test points brought out to the front panel of the card cage.5.8.6 Internal ConsistencyThe point was just that the same errors could have vastly different likelihoods ofoccurrence if they are subject to vastly different equipment configurations, taskenvironments, etc. However, the converse is also true. The same error should haveroughly the same likelihood ratings if the task circumstances are equivalent.Using the same example as in 4.7.5, suppose the analysis team is assessing the step ofmeasuring voltage at a pair of front panel test points. If there is another task thatHESRA: v7 Page 28 of 46

FAA Human Error Risk Analysis (HESRA) Methodduplicates this step at an adjacent panel, then both the types of errors and the likelihoodof occurrence of those errors should be roughly the same.One must be very careful, however, in determining where such internal consistency iswarranted and where it is not. Even a small change in the task environment can have asignificant effect on likelihood ratings. Suppose, for example, the second instance of thisstep occurs at a height that is 4 inches above the floor, whereas the first instance occursat chest level. Or, suppose the second instance of the task has to be done underextreme time pressure, whereas the first instance can be done in a deliberate, unhurriedmanner.5.8.7 Influenced by Elements in 4.4In section 4.4, we described how to set the perspective for a risk analysis. In terms ofassigning likelihood ratings, this perspective is very important. All of the elementsdescribed in 4.4 can and should influence the analysis team’s discussion and ultimatedecision regarding the likelihood rating for each error mode.5.9 Step 9 – Assign Severity RatingThe severity scale is presented in Table 6. It reflects what could happen if the particularerror under consideration is actually committed. A Severity rating should be assigned ifone or more of the definition statements apply.Of the three ratings that are assigned during a human error risk analysis, the severityrating is the least subject to the wide variances in human behavior. It is also the leastamenable to remediation. In other words, the severity of an error outcome is notparticularly dependent on anything other than the design of the system.This is not to say that the severity cannot be reduced, but, for a given system design,there is not likely to be too much argument about its potential outcome. There are anumber of considerations related to the severity rating. These are described below.HESRA: v7 Page 29 of 46

FAA Human Error Risk Analysis (HESRA) MethodTable 6. Severity Rating Scales (FAA SMS Category Names)Severity Rating Category Severity Definition1 Catastrophic(Catastrophic)2 Critical(Hazardous)• Serious injury, death, permanent loss ofone or more equipment functions• Extended loss of function/service• Major increase in maintainer or ATCworkload• Increased safety risk for FAA personnel• Loss of positive A/T control• Extended reduction of safety margin• Serious injury or moderate temporary lossof equipment function• Moderate increase in maintainer or ATCworkload• No safety margin for FAA personnel• Potential loss of A/C separation• Brief reduction in local safety margin3 Significant(Major)• Moderate injury or moderate equipmentdamage• Loss of redundancy for a critical component• Slight increase in maintainer or ATCworkload• Decreased safety margin for FAA personnel• Increased risk should additional errors orequipment failures occur• Potential increased stress on remainingfunctional equipment4 Marginal(Minor)5 Negligible(No Safety Effect)• Minor injury or slight equipment damage• Work around• Loss of redundancy for a non-criticalcomponent• Increased risk of more serious effects• Minimal decrease of safety margin• No injury or equipment damage• No significant effect ono safetyo function/serviceo schedule5.9.1 Worst-Case ScenarioErrors can have different effects in different circumstances. Our perspective duringHESRA, however, should be to identify the worst-case scenario when assigning severityratings. This is not to say that the analysis team has to imagine the most bizarrecombination of circumstances imaginable to arrive at the severity rating. However, whenconsidering a number of possible outcomes of an error, the team should choose theoutcome with the most severe consequences.HESRA: v7 Page 30 of 46

FAA Human Error Risk Analysis (HESRA) Method5.9.2 Team Discussion and ConsensusThere is not likely to be a wide range of opinions on the analysis team regarding theseverity of outcomes for a particular error. Remember, the assumption for this part of theanalysis is that the error will occur. There is no need to be concerned about thelikelihood that it will occur or how its effects might be mitigated. For purposes of severityratings, assume the error will occur and that it will not be detected or mitigated.5.9.3 Account for Conditional Errors, Sequence of Errors, Etc.This is the place in HESRA where the analysis team can take into account errorsequences and conditional consequences, i.e., errors with consequences that can varygreatly based on other errors. For example, consider an error like performing aprocedural step out of sequence. The consequences of that type of error might varygreatly depending on where, in the overall procedural sequence, the step is actuallyperformed.Also, the severity of an error might change drastically if another error has beencommitted earlier in the procedure. These types of combinations of errors and sequencedependencies should be noted on the HESRA spreadsheet. Even if the team decides toassign a severity rating that is not related to a particular combination or sequence, it is agood idea to document the fact that the team considered them and recognizes theirmagnification effect on the severity of the error outcome.5.9.4 Not Greatly Influenced by Elements in 4.4Since the primary assumption in this part of the analysis is that the error under discussionwill occur, the performance shaping factors listed in 4.4 don’t play a role in the severityrating. We’re assuming that the error will occur and will not be mitigated. Performanceshaping factors don’t matter for severity ratings.5.9.5 Internal ConsistencyOnce the team has assigned a severity rating for a particular effect, then that ratingshould also be applied to any other identical effects listed in the HESRA spreadsheet.For example, if an error will cause the loss of a particular system component, then losingthat component will typically have the exact same severity rating regardless of theparticular error that causes its loss.5.10 Step 10 – Assign Rating for Detection and RecoveryThe final rating to be assigned by the analysis team is related to the likelihood andtimeliness of detecting and recovering from an error. Detection means that someone orsomething realizes that an error has been committed. It is important to understand thatan error can be detected by an automated piece of equipment or by a person. Each errorshould be rated for detection and recovery, using the rating scales in Table 7.The FAA SMS framework has no equivalent for the HESRA recovery scale and recoveryis not considered in the SMS determination of risk.HESRA: v7 Page 31 of 46

FAA Human Error Risk Analysis (HESRA) MethodTable 7. Detection/Recovery Rating ScalesRecovery Rating Category Recovery Scale Definition1 Very Low Detection and/or recovery are notlikely to occur until the errorpropagates through the operationalsystem(s)2 Low Detection and/or recovery aredelayed until the error causes atleast some serious effects on theoperational system(s)3 Moderate Detection and/or recovery occurafter a moderate delay, but in timeto prevent all but minor effects onthe operational system(s)4 High Immediate or very quick detection.Recovery requires manualintervention, but is likely to be donebefore the error causes anyoperational effects.5 Very High Immediate, automatic detectionand/or recoveryFor maintenance/operations procedures, the person who typically detects an error is themaintainer who committed the error. It can also be a person in the vicinity of themaintainer who commits the error or someone monitoring the system on which themaintainer is working.Recovery is the process of finding and “fixing” the error so it does not cause a harmfuleffect. An underlying assumption in all human error risk analysis processes is that oncedetected, the person who detects an error will recover from it if it is possible to do so.The catch here is that it might not be possible for people to recover from an error in atimely manner.5.10.1 Automatic RecoverySometimes, error recovery requires no human intervention, such as automaticallyswitching over to a redundant backup system when an error causes the primary systemto fail. It is entirely possible for automatic recovery to take place without a human beingnotified that an error has caused a system failure. This can leave the system in avulnerable state because subsequent errors or failures cannot be automaticallyrecovered. Fortunately, these instances tend to be rare.The analysis team should consider this type of recovery when performing the analysis.The tendency is to assume that it will occur with no thought about what happens if it fails.The mechanism for automatic recovery should be identified in the spreadsheet templatecomment field and discussions held with others to understand the implications of a failureof the automatic recovery.HESRA: v7 Page 32 of 46

FAA Human Error Risk Analysis (HESRA) MethodIn all instances in which automatic recovery is assumed to the overall risk of a particularerror-cause pair, an additional HESRA spreadsheet entry should be included in which theautomatic recovery is assumed to be inoperative. These entries should be visually codedso it is obvious to those examining the spreadsheet or subsequent report(s).5.10.2 Composite of Detection and RecoveryThe detection and recovery scale, which is shown in Table 4, is a composite scale. Thatis, the anchor points on the scale are defined in terms of both detection and recovery.This is intentional, but it might cause some confusion among analysis team members.The basic idea is this: We want to assign this rating based on the timeliness andlikelihood that the effects of the error are blocked from propagating through the system.That is, we don’t want the effects of the error to spread.When one considers the risk of human errors, it is really the consequences of thoseerrors that we most want to avoid. The old expression “no harm, no foul” sums up thedetection and recovery processes in the overall risk analysis domain. Blocking harmfuleffects requires both detection and recovery, which is the reason these actions arecombined in a scale5.10.3 Influenced by SeverityBoth detection and recovery can be influenced by the severity of the effects of the error.For example, suppose a particular error causes a software application to crash in anobvious way. This is likely to be pretty easy to detect. Also, while a lot depends on theexact configuration of the computer, re-initiating a single application isn’t likely to beterribly difficult or time-consuming.However, suppose instead that an error crashes a server. That event might or might notbe so obvious, since some equipment can operate without talking to the server for someperiod of time. Also, bringing a server back up from a crash can be difficult and can takequite a bit of time.5.10.4 Influenced by Elements in 3.4Detection and recovery are very much dependent on human capabilities and limitations.As such, they can be heavily influenced by the performance shaping factors described in3.4. Consider these two fictional situations. In the first, the maintainer is operating in aquiet environment and is performing a familiar procedure. The system on which theprocedure is being performed is critical to ATC operations, but a hot backup system isoperating perfectly.In the second scenario, the maintainer is working on exactly the same system, but is in arather noisy and crowded location. The hot backup system is down for a period of timetherefore the system on which the maintainer is working is absolutely essential to ATCoperations.The maintainer commits an error that brings down the system on which he or she isworking. The factors that influence the maintainer’s ability to detect and recover from thiserror will be quite different for these two scenarios. In particular, environmental factors(noise, cramped workspace), time pressure (in the second scenario, it is critical to get thesystem back up as soon as possible), and external demands for action will likely causethe maintainer in the second scenario to require longer to mitigate the error.HESRA: v7 Page 33 of 46

FAA Human Error Risk Analysis (HESRA) Method5.11 Step 11 – Calculate Hazard Index and RPNOnce the appropriate ratings are entered into the HESRA spreadsheet, theHazard Index (HI) and Risk Priority Number (RPN) should be calculated. HI iscalculated by multiplying Likelihood and Severity. RPN is calculated bymultiplying the HI by the Detection/Recovery rating. When using the HESRAspreadsheet, these values are automatically calculated.5.12 Step 12 – Analyze CriticalityCriticality analysis is a general term used in the risk analysis domain. It sounds complex,but is actually very simple. The idea is that each of the errors identified and rated by theanalysis team must be categorized according to its overall potential to cause bad thingsto happen. That potential is, by definition, related to the Hazard Index and Risk PriorityNumber associated with each error.The categorization is driven by comparing the value of the HI and RPN for a specificerror-cause combination to pre-defined “breakpoints”, where a breakpoint is defined asbeing equal to or greater than some value of the HI or RPN. While the HESRAspreadsheet calculates these breakpoints and assigns a criticality value as illustrated inFigure 10, the process for arriving at those values is discussed below.5.12.1 HI CriticalityThere are a number of ways of considering the various ratings assigned to specificerrors. The Risk Priority Number (RPN) is a metric that takes into account all three errorratings, i.e., likelihood of occurrence, potential severity of effects, and recovery. However,it is also a good practice to consider each failure mode without regard for the likelihood ofdetection and recovery. Again, this is the Hazard Index, which can be found bymultiplying Likelihood of occurrence and potential Severity.Hazard Index is equivalent, at least in conceptual terms, to the “risk” assigned in the FAASMS framework. Those categories are used to help the analysis team identify errors thatshould be dealt with immediately, subject to further analysis, or other actions.It is quite possible for an error to have a very high hazard index, but a relatively low RPN.If this is the case, we might want to examine the recovery opportunities very carefully,since failing to recover from an error will have very serious consequences.The potential range of the hazard index for any particular error is illustrated in Table 9.HESRA: v7 Page 34 of 46

FAA Human Error Risk Analysis (HESRA) MethodTable 9. Criteria for Rank Ordering Hazard Index RatingHazardIndexValueCategory Definition/Action Notes20 - 25 ExtremelyLow• No system or safetyimplications, even withhigh recovery rating• One rating must be “5”.• The other rating can be “5” or “4”.12 - 16 Low • Unlikely to have systemor safety implications,even with high recoveryrating8-10 Moderate • Potentially significantsystem or safetyimplications• Moderate recoveryrating is required3 - 6 High • Significant system orsafety implications ifnot recovered• Outcome is highlydependent on recovery• Allows one rating of “3”.• Does not allow any rating to beeither “2” or “1”.• Allows one rating of “2”.• Does not allow any rating to be“1”.• Allows one rating of “1”.• Max rating is either both ratings of“2” or one• of “1” and the other “3” dependenton recovery1 - 2 ExtremelyHigh• Critical system orsafety implications ifnot recovered• Outcome is highlydependent on recovery• Each rating either “1” or “2”In the FAA SMS framework, Table 9 is supplanted by a “risk acceptability matrix”, whichis reproduced below as Table 10. In this matrix, various combinations of likelihood andseverity are assigned a color code denoting one of three levels of acceptability denotedas “high”, “medium” and “low.” The full definitions for these risk acceptability levels arecontained in the FAA SMS Manual, however, they are briefly described as follows:• High – Unacceptable risk that must be reduced to “medium” or “low” before thechange being contemplated is implemented.• Medium – The minimum acceptable level of risk associated with a change. Thechange can be implemented, but most be tracked.• Low – An acceptable level of risk that allows contemplated changes to be madewithout further monitoring.HESRA: v7 Page 35 of 46

FAA Human Error Risk Analysis (HESRA) MethodTable 10. SMS Risk Acceptability MatrixSeverityLikelihoodNo SafetyEffect5Minor4Major3Hazardous2Catastrophic1FrequentAProbableBRemoteCExtremelyRemoteDExtremelyImprobableEHigh RiskMedium RiskLow Risk5.12.2 RPN CriticalityRisk priority, which is closely associated with the concept of “criticality”, is a somewhatarbitrary construct, in that artificial dividing lines have been established among thedifferent risk indices to form RPN categories, as illustrated in Table 11. Defining thesebreakpoints is not a science. However, such categorization is a useful exercise in that itallows one to prioritize the resources so they are directed at the most “serious” errors.The category breakpoints have been established on a purely arithmetic basis, as shownin Table 12.The FAA SMS framework does not recognize the construct of RPN and assigns no riskcriticality based on its value for any potential human error.Table 11. RPN Risk CategoriesRPN Category Definition/ActionN/A Single Failure Condition • Criticality will become “High Risk” if a componentfails or a software error occurs.90 - 125 Extremely Low Risk • No system or safety implications.• No further design or evaluation efforts required.60 - 89 Low Risk • No significant system or safety implications.• Unlikely that significant design, training, orprocedural changes will be required.HESRA: v7 Page 36 of 46

FAA Human Error Risk Analysis (HESRA) MethodRPN Category Definition/Action28-59 Moderate Risk • Potentially significant system or safety implications• Possible that significant design, training, orprocedural changes will be required.• If system is not yet deployed, error mode shouldbe further evaluated and then monitored duringusability testing.9 - 27 High Risk • Significant system or safety implications.• Likely that significant design, training, orprocedural element will be required.• Error mode should be further evaluated andspecifically addressed with usability testing.1 - 8 Extremely High Risk • Critical system or safety implications.• If an existing system, then immediate remediationshould take place.• If system is not yet deployed, significant design,training, or procedural changes are requiredbefore the system is deploy• Errors should be specifically addressed withusability testing (after a “fix” is made).Table 12. RPN Risk Category BreakpointsRPN Category DefinitionN/A Single Failure Condition Automatic detection andmitigation. No human interventionrequired.90 - 125 Extremely Low Risk All ratings are “5” or “4”.60 - 89 Low Risk One rating can be a “3”. Otherratings can be “5” or “4”.28-59 Moderate Risk No rating of “1” is allowed. Onerating of “2” is allowed.9 - 27 High Risk One rating of “1” allowed. Otherratings can be valid combinationsof “2”-“5”.1 – 8 Extremely High Risk All three ratings can be”2”, “1”, orany combination of “2” and “1”. Itis mathematically feasible for onerating to be a “3”, in which casethe other two ratings are one’s, orone “1” and one “2”Risk Priority Number (RPN) criticality breakpoints are based on the maximum values ofratings that go to make up the RPN5.12.3 Sort by Severity, HI, RPNOnce all the errors for all task steps have been rated, sort the errors in two ways. First,sort by Hazard Index (HI), which is the product of the likelihood and severity rankings.Second, sort by Risk Priority Number (RPN), which is the product of all three rankings.HESRA: v7 Page 37 of 46

FAA Human Error Risk Analysis (HESRA) MethodAfter each sort, the analysis team should discuss those errors that rise to the highestcriticality levels. These are candidates for immediate action to reduce their risk.Why do we evaluate errors according HI and RPN? By examining the highest criticalityerrors from each sort, instances in which detection and recovery play major roles inreducing overall risk should be fairly obvious. Team members should be inherentlyskeptical of the RPN for those errors that have the lowest hazard indexes.For errors that have high HI’s and low RPN’s, we are essentially saying “This error islikely to occur and its consequences are severe, but we’ll detect it and make sure thoseconsequences never occur.” This might be an accurate reflection of reality. However, itcan also be an instance of explaining away high-risk errors that are politically sensitive,don’t conform to current management or agency directives, or won’t be acknowledged forother reasons. Be careful about unsupported detection and recovery claims. In addition,even when there are low Detection and Recovery ratings, an error with a low HI shouldbe considered for remediation due to its potential consequences.5.12.4 Compare Levels of HI and RPN to Criticality BreakpointsAs noted in 3.12.1, the criticality categories are essentially defined by numericalbreakpoints. The Hazard Index can take on values from 1 to 25 (both the likelihood andseverity scales are from 1 to 5). The Risk Priority Number can take on values from 1 to125. Therefore, each of the criticality categories must be defined in terms of the lowestand highest values of HI and RPN that would place the error risk in that category.In reality, it is easiest to program these categories into the HESRA spreadsheet and thenautomatically assign each error to the appropriate HI and RPN criticality category. If thisis done, then the sorting described above can be done using these criticality categories.5.12.5 Determine Action Requirements for Each ErrorDetermining action requirements for each error is really a pre-determined step. Note inthe introduction to 3.12, that each criticality category has, as part of its definition, ageneral action assignment. Keep in mind we are not determining exactly what has to bedone. That is the next step in the process (3.13). At this stage, we are simply decidingwhether anything has to be done and, if so, how quickly must it be completed.For example, suppose the risk for a particular error falls into the “Extremely High Risk”criticality category. Based on the definition for that category, we know that somethinghas to be done and that it has to be done quickly. On the other hand, if the risk falls intothe “Extremely Low Risk” category, we know that no action may be necessary.5.13 Step 13 – Reduce RiskThe output of the HESRA analysis is a list of procedural steps and human errors orderedaccording to their risk - higher risks earlier in the list. We developed the followingstrategy for developing risk reduction ideas: The HESRA team must develop at least oneremediation strategy for each high-risk error. In effect, the analysis process up to thispoint has identified elements of high-risk in the procedures that have been analyzed. Thenext logical question is “What do we do about it?” These deliberations should beHESRA: v7 Page 38 of 46

FAA Human Error Risk Analysis (HESRA) Method• Do not attempt to reduce risk by doing things automatically. In the analyticalphase, we considered all automatic mitigations. In this phase, let’s fix things theold-fashioned way – by actually fixing them.• The risk priority number has got to go UP as a result of the risk reduction ideas.It is not sufficient to increase likelihood and decrease the recovery score, thushaving no net effect on the risk priority number. For each idea we come up with,we have to go back and rate the likelihood and mitigation. If we can’t elevate theproduct of these numbers, then the idea isn’t really reducing the risk.• When attempting to increase likelihood or recovery, be sure to address the lowerrating first. For example, if the likelihood rating is a “4” and the recovery rating isa “1”, look at ways of increasing the recovery score before addressing thelikelihood score.• When developing risk reduction ideas, use the principle of “as high as isreasonably achievable”, or ALARA. Don’t presuppose that decreasing therecovery score from a “1” to a “2” will be good enough. Try to increase the scoreto a “5” and see what happens.• Every idea has to be reasonable from a technological and policy perspective. Itdoes us no good to come with ideas that everyone acknowledges will simply notget done within the existing technology and policy framework of the FAA. We’renot so much worried about money resources, since the team probably shouldn’tbe worrying about paying for the risk reduction.• Even if most (or all) of the identified error modes have low risk ratings, the teamshould attempt to determine whether the overall effect of lots of little problemsreduces the safety of the system.The HESRA team members should bring to the risk reduction discussion their individualperspectives and expertise, along with what they know is feasible and possible. It is atthis point in the process that technical, management, and policy “fixes” can be reasonablyconsidered. Fixes that are not possible, for whatever reason, should be removed fromfurther consideration. It does no good to propose a risk reduction strategy that standsalmost no chance of being implemented because of cost, policy, technical, or otherembedded issues.The analysis team does not have to completely develop the remediation strategies, onlysuggest alternatives that will work to reduce the scale ratings that caused the low HI orRPN. The details of each remediation strategy should be worked out apart from the riskanalysis.5.13.3 Assign Ratings Assuming RemediationThe purpose of suggesting remediation strategies is to reduce the risk indexes thatcaused the error to move to a high criticality category. A reasonable question, then, is“How much will this solution reduce the risk?” In order to assess the magnitude of therisk reduction, the analysis team should assume that the suggested remediation isdeveloped and done correctly. They should then assign provisional ratings on each ofthe three risk scales for that remediation using the spreadsheet.HESRA: v7 Page 40 of 46

FAA Human Error Risk Analysis (HESRA) Method5.13.4 Assess Impact on HI and RPNOnce the provisional risk scale ratings are assigned for a suggested remediation, it willthen be feasible to calculate new HI and RPN values. To state the obvious, the HI andRPN should be higher for the remediation than for the original error. It is up to theanalysis team to determine whether the magnitude of risk reduction is sufficient orwhether another or more remediation is necessary.At this point, assigning remediation risk ratings is a conceptual exercise. That is, theanalysis team is assigning risk ratings as if a remediation is done, as if it is donecorrectly, etc. However conceptual the exercise might be, it forces the analysis team toconsider and discuss remediation options with an eye towards actually reducing risk.5.13.5 Iterate Remediation if Risk Is Not Sufficiently ReducedRemediation typically has the goal, perhaps unstated, of moving the overall risk of anerror down to the lowest two criticality categories, i.e., “Low” or “Extremely Low”. At thevery least, remediation should move the error down one level of criticality. For example,if the original risk analysis places the error mode into an “Extremely High Risk” riskcategory, the remediation will aim to at least move it out of that category. When riskscale values are assigned to the remediation and risk indexes are re-calculated, this goalwill have been met, or not.The overall risk can be reduced, but the error might still fall into the same criticalitycategory as before the remediation. Since criticality is assessed only categorically, sucha reduction is not likely to be acceptable. Therefore, another form of remediation shouldbe identified and the process repeated.It is entirely possible that the risk analysis team will be unable to identify a remediationthat will sufficiently lower the criticality of an error. How could this happen? Well, itmight be that the severity of the effects of an error are so pronounced that slightreductions in the likelihood of occurrence or mitigation don’t lower the overall criticalityenough. Severity of effects is not typically amenable to easy remediation.The effect of losing approach radar, for example, is whatever that effect might be. Theanalysis team isn’t going to be able to change that effect without, perhaps, adding aredundant radar system, which is not something they’re likely to be able to do.If sufficient remediation cannot be identified, then the error should be flagged for furtheranalysis. The fact that the analysis team, with all their expertise, cannot think of effectiveremediation is an indication that the procedure and/or the system itself might need someserious re-design.5.14 Step 14 – Produce Risk Analysis ReportThe product of the HESRA analysis is a report that more or less encapsulates the riskanalysis process – as applied to this particular procedure or system. The components ofthe report are described below.HESRA: v7 Page 41 of 46

FAA Human Error Risk Analysis (HESRA) Method5.14.1 Overview of System and Procedures AnalyzedThe report should briefly describe the target of the analysis effort. This might be aparticular facility, a system that is being developed, a series of maintenance/operations oroperational procedures, etc. It should also indicate the reason(s) for conducting theanalysis. For example, was the analysis prompted by a particular event, by arequirement in the development or procurement process, by a pending change to asystem or facility, etc.5.14.2 General Statement of FindingsIn general, what did the risk analysis reveal? This part of the report should not providethe details of the analysis. Rather, a prose statement describing in general terms theoutcome of the analysis. For example, the team might report that they found theprocedure to be generally free of high-risk elements, with the exception of a few tasks forwhich certain high-probability errors could have severe consequences. If these high-riskelements can be easily “fixed” (or not), then the report should so state.5.14.3 Overall Recommendations Related to System and ProceduresIt is a fact of life that very few people are going to read a HESRA report from cover tocover. Therefore, it is important for the analysis team to provide a concise description oftheir recommendations regarding the procedure(s), facility, or system it has analyzed.This can be done with a bullet-point list of recommendations. The idea is to convey tothe reader the steps the team feels need to be taken to reduce whatever risks they foundto be too high.5.14.4 Explicit Listing of “High Priority” ErrorsInevitably, some errors will float to the top of the criticality hierarchy. These are the highriskelements that HESRA is designed to find and, hopefully, eliminate. The HESRAreport should explicitly list and describe the errors that are considered high priority by theanalysis team. If there are characteristics of these errors that are counterintuitive, thenthe report should explain why and how the ratings were assigned. The readers of thereport will not be privy to the deliberations of the analysis team, so it is perfectlyreasonable to spend some ink explaining how risk associated with these errors came tobe rated so critically.5.14.5 Explicit Listing and Description of Proposed RemedyThis section of the report should be interwoven with the information contained in theprevious section. That is, for each high-priority error mode, the proposed remedy shouldbe listed and described. As noted in the body of this document, the analysis team will notnecessarily know the details of each proposed fix. For example, a perfectly validrecommendation is to add a coding dimension for a procedural step that uses only colorcoding.The level of detail for this information should be sufficient for ATO management to assignresponsibility for developing and implementing the remedy.5.14.6 Link or Provide Access to HESRA Analysis Spreadsheet (or WhateverSoftware Tool is Used to Support the Analysis)The risk analysis is embodied in the HESRA spreadsheet. All the work of the riskanalysis team, including the task breakdown, errors, ratings, criticality calculations, andnotes are contained in the spreadsheet. HESRA spreadsheets tend to be fairly large.HESRA: v7 Page 42 of 46

FAA Human Error Risk Analysis (HESRA) MethodTherefore, they are not amenable to simply being printed out and included in a paperreport.The best way to convey the HESRA spreadsheet information is to save a read-onlyversion of the file in an accessible location. In the HESRA report, include a link to thespreadsheet. Since the report will likely be distributed in soft form (as well as in hardcopy), the link will allow any interested person to see the actual, detailed information itcontains.5.14.7 Statement of Concurrence of Analysis TeamThe HESRA report should contain a section for team members to state their reasons forconcurrence with the general report findings. This might seem a bit odd, as in “What partof yes don’t you understand?” However, various team members might want to clarify whythey agree with the report findings. Their reasons might not be obvious or intuitive.5.14.8 Statement(s) of Exceptions from Analysis Team MembersThe findings of the HESRA team are adopted by consensus, not unanimity. Therefore,some team members might want to document their views and objections to certainfindings. It is valuable to document these exceptions, since dissenting team memberscan raise quite valid issues that will shed light on the overall report findings. This is theappropriate place for team members to state their concerns and objections to individualrisk ratings, to the selection or elimination from consideration of particular risk reductionstrategies, or to any other aspect of the HESRA process.5.15 Step 15 – Assign Remediation ActionsThis is not within the purview of the risk analysis team, but should be done by ATOmanagement with the advice of the appropriate risk analysis team members. It isincluded in this method description for the sake of completeness. Remediation should beassigned to specific people or organizations. Without such assignment, it is unlikely thatthe remediation will be undertaken and completed in a timely manner.5.16 Step 16 - Monitor Remediation to EnsureActions Are CompletedAs with the previous step, monitoring remediation is not within the scope of the analysisteam’s work. However, it is critically important that the work of implementing the team’srecommendations be monitored until it is complete. This step is the responsibility ofappropriate ATO management.HESRA: v7 Page 43 of 46

FAA Human Error Risk Analysis (HESRA) MethodAppendix A – Definition of TermsBrief – A period of time on the order of minutes, up to one hour.Delays – Any incremental time added to scheduled departure or arrival times due to degradedATC facilities operation.Extended – A period of time on the order of hours, or longer.Safety Margin – The buffer between minimal local ATC safety, i.e., ability to maintain positiveA/C control, and the current level of safety.Maintainer Safety – The ability of System Specialists to perform their job tasks without asignificant risk of injury.Maintainer Workload – The current requirement for physical, perceptual, and cognitive capacityto perform job tasks related to maintaining and/or restoring ATC functions and services.Function – The ability of hardware, software, communication channels, etc., to support ATCtasks.Service – The ability of the ATC facilities to provide a specific functional capability to A/C andATC. Examples include radar, ILS, A/G comm., etc.HESRA: v7 Page 44 of 46

FAA Human Error Risk Analysis (HESRA) MethodAppendix B – Schematic of the HESRA MatrixHESRA: v7 Page 45 of 46

Draft Human Error and Safety Risk Analysis (HESRA) Methodology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?