Hazard log Information System, subsystem, unit Description Cause s Possible effects, effect on system Category(hazard level --probability and severity) Design constraints Corrective or preventative measures, possible safeguards, recommended action Operational phase when hazardous Responsible group or person for ensuring safeguards provided Tests(verification to be undertaken to demonstrate safety Other proposed and necessary actions Status of hazard resolution process Risk and hazard level measurement Risk= f(likelihood, severity) Impossible to measure risk accurately Instead, use risk assessment Accuracy of such assessments is controversial o avoid paralysis resulting from waiting for definitive data, we assume we have greater knowledge than scientists actually possess and make decisions based on those assumptions William Ruckleshaus Cannot evaluate probability of very rare events directly So use models of the interaction of events that can lead to an accident
c ������������������ ������������������������ Hazard Log Information System, subsystem, unit Description Cause(s) Possible effects, effect on system Category (hazard level −− probability and severity) Design constraints Corrective or preventative measures, possible safeguards, recommended action Operational phase when hazardous Responsible group or person for ensuring safeguards provided. Tests (verification) to be undertaken to demonstrate safety. Other proposed and necessary actions Status of hazard resolution process. ����������������� ����������������������������������� Risk and Hazard Level Measurement Risk = f (likelihood, severity) Impossible to measure risk accurately. Instead, use risk assessment: Accuracy of such assessments is controversial. ‘‘To avoid paralysis resulting from waiting for definitive data, we assume we have greater knowledge than scientists actually possess and make decisions based on those assumptions.’’ William Ruckleshaus Cannot evaluate probability of very rare events directly. c � So use models of the interaction of events that can lead to an accident
eyeson-63 Risk Modeling In practice, models only include events that can be measured Most causal factors involved in major accidents are unmeasurable Unmeasurable factors tend to be ignored or forgotten Can we measure software?(what does it mean to measure design?) Human error? Risk assessment data can be like the captured spy, if you torture it long enough, it will tell you anything you want to know William ruckelshaus Risk in a Free Society Misinterpreting Risk Risk assessments can easily be misinterpreted 10 Extended system boundary 10 System Boundary 10·10
c ������������������ Risk Modeling In practice, models only include events that can be measured. Most causal factors involved in major accidents are unmeasurable. Unmeasurable factors tend to be ignored or forgotten. Can we measure software? (what does it mean to measure design?) Human error? Risk assessment data can be like the captured spy; if you torture it long enough, it will tell you anything you want to know. William Ruckelshaus Risk in a Free Society c ������������������ Misinterpreting Risk Risk assessments can easily be misinterpreted: 10 −4 −3 System Boundary Extended system boundary 10 10 −3 −3 −6 . 10 = 10
EXample of unrealistic risk assessment contributing to an accident Design: System design included a relief valve opened by an operator to protect against overpressurization. A secondary valve was installed as backup in case the primary valve failed. The operator must know if the first valve did not open so the second valve could be activated Events The operator commanded the relief valve to open. The open position indicator light and open indicator light both illuminated. The operator, thinking the primary relief valve had opened, did not activate the secondary relief valve. However, the primary valve was NoT open and the system exploded Causal Factors: Post-accident examination discovered the indicator light circuit wa wired to indicate presence of power at the valve, but it did not indicate valve position. Thus, the indicator showed only that the activation button had been pushed not that the valve had opened An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but ignored the possibility of design error in the electrical wiring; the probability of design error was not quantifiable. No safety evaluation of the electrical wiring was made; instead confidence was established on the basis of the low probability of coincident failure of the two relief valves The Therac-25 is another example where unrealistic risk assessment contributed to the losses
c ������������������������ Example of unrealistic risk assessment contributing to an accident Design: System design included a relief valve opened by an operator to protect against overpressurization. A secondary valve was installed as backup in case the primary valve failed. The operator must know if the first valve did not open so the second valve could be activated. Events: The operator commanded the relief valve to open. The open position indicator light and open indicator light both illuminated. The operator, thinking the primary relief valve had opened, did not activate the secondary relief valve. However, the primary valve was NOT open and the system. exploded. Causal Factors: Post−accident examination discovered the indicator light circuit was wired to indicate presence of power at the valve, but it did not indicate valve position. Thus, the indicator showed only that the activation button had been pushed, not that the valve had opened. An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but ignored the possibility of design error in the electrical wiring; the probability of design error was not quantifiable. No safety evaluation of the electrical wiring was made; instead confidence was established on the basis of the low probability of coincident failure of the two relief valves. The Therac−25 is another example where unrealistic risk assessment contributed to the losses
Classic Hazard Level Matrix SEVERITY ll Catastrophic Critical Marginal Negligible A Frequent 1-A 1-A III-A Ⅳ∨-A B Moderate 1-B II-B ∨- C Occasional l-C lI-C I-t LIKELIHOOD D Remote 1-D lI-D l-D ∨-D E Unlikely I-E E III-E E F Impossible I-F l-FⅢ-F Another Example Hazard Level Matrix A B D E F Frequent Probable Occasional Remote Improbable Impossible Design action Design action Design action Hazard must Catastrophic required to required to required to or hazard I control hazard control hazard control hazard probability 3 reduced 12 Design action Design action/Hazard must Hazard contr Critical eliminate or eliminate or lazaro"desirable quired to required to be control control hazard control hazard probability cost effective Assume willImpossible 4 reduced 6 Marg niy/ design action Hazard must Hazard control Normally not uired to be controlled desirable if eliminate or or hazard cost effective control hazard probability 5 reduced 6 Negligible Negligible hazard -H 12
c ������������������ Classic Hazard Level Matrix ������������������������ SEVERITY A B C LIKELIHOOD D E F I II III IV Catastrophic Critical Marginal Negligible Frequent Moderate Occasional Remote Unlikely Impossible c � I−A II−A III−A IV−A I−B II−B III−B IV−B I−C II−C III−C IV−C I−D II−D III−D IV−D I−E II−E III−E IV−E I−F II−F III−F IV−F ����������������� ������������������������ Another Example Hazard Level Matrix A B C D E F Frequent Probable Occasional Remote Improbable Impossible 10 11 12 12 12 12 ������������������������� ������� ���������� � ������������������������ ������������������������ ������� ���������� � ������������� ������������������������� ������������ � � ��� ��������������� ����������������� ���� ������������������� ������������������������� ��� ��������������� ��������� ��������� ��������������������� ������������� ������������ � � ��� 12 II Marginal III Negligible IV 1 2 3 4 9 12 3 4 6 7 12 12 5 6 8 10 12 ����������������� ���� ��������������� ������������ � � ��� ������������� ������� �������������������� ������������������� ��������������� � � �������������� ��������������� ����������� �������� ��� ����������������� ���� ��������������� ������������������� ��������������������� ��������� ��������� ��� ��������������� ������������������������� ��������������������� ��������� ��������� ��� ��������������� ������������������������� ��������������������� ��������� ��������� ��� ��������������� ������������������������� ��������������������� ��������� ��������� ��� ��������������� ������������������������� ��������������������� ��������� ��������� ��� ��������������� ������������������������� ������������������� Critical Catastrophic I
Hazard level Assessment Not feasible for complex human/computer-controlled systems No way to determine likelihood Almost always involves new designs and new technology Severity is often adequate(and can be determined) to plan effort to spend on eliminating or mitigating hazard May be possible to establish qualitative criteria to evaluate potential hazard level to make deployment or technology decisions, but will depend on system Example of qualitative criteria AATT Safety Criterion The introduction of aatt tools will not degrade safety from the current level Hazard level assessment based on Severity of worst possible loss associated with tool Likelihood that introduction of tool will reduce current safety level of ATC system
c ������������������ ������������������������ Hazard Level Assessment Not feasible for complex human/computer−controlled systems No way to determine likelihood Almost always involves new designs and new technology Severity is often adequate (and can be determined) to plan effort to spend on eliminating or mitigating hazard. May be possible to establish qualitative criteria to evaluate potential hazard level to make deployment or technology decisions, but will depend on system. c ������������������ ������������������������ Example of Qualitative Criteria AATT Safety Criterion: The introduction of AATT tools will not degrade safety from the current level. Hazard level assessment based on: Severity of worst possible loss associated with tool Likelihood that introduction of tool will reduce current safety level of ATC system
Example severity Level (from a proposed JAA standard Class I Catastrophic Unsurvivable accident with hull loss Class I Critical Survivable accident with less than full hull loss fatalities possible Class Ill: Marginal Equipment loss with possible injuries and no fatalities · ClassⅣV: Negligible Some loss of efficiency Procedures able to compensate, but controller workload likely to be high until overall system demand reduced Reportable incident events such as operational errors, pilot deviations, surface vehicle deviation Example likelihood Level User tasks and responsibilities LOW Insignificant or no change Medium: Minor change High: Significant change Potential for inappropriate human decision making LoW: Insignificant or no change Medium: Minor change Significant change Potential for user distraction or disengagement from prImal iry task LOW Insignificant or no change Medium: Minor change Significant change
c � Example Severity Level ������������������������ ����������������� (from a proposed JAA standard) Class I: Catastrophic Unsurvivable accident with hull loss. Class II: Critical Survivable accident with less than full hull loss; fatalities possible Class III: Marginal Equipment loss with possible injuries and no fatalities Class IV: Negligible Some loss of efficiency Procedures able to compensate, but controller workload likely to be high until overall system demand reduced. Reportable incident events such as operational errors, pilot deviations, surface vehicle deviation. c ������������������ ������������������������ Example Likelihood Level User tasks and responsibilities Low: Insignificant or no change Medium: Minor change High: Significant change Potential for inappropriate human decision making Low: Insignificant or no change Medium: Minor change High: Significant change Potential for user distraction or disengagement from primary task Low: Insignificant or no change Medium: Minor change High: Significant change
Example Likelihood Level(2) Safety margins LoW: Insignificant or no change Medium: Minor change High Significant change Potential for reducing situation awareness LoW- Insignificant or no change Medium: Minor change High Significant change Skills currently used and those necessary to backup and monitor new decision support tools LoW- Insignificant or no change Medium: Minor change High Significant change Introduction of new failure modes and hazard causes LoW New tools have same function and failure modes Is system components they are replacing Medium: Introduced but well understood and effective mitigation measures can be designed Introduced and cannot be classified under medium Effect of software on current system hazard mitigation measures LoW: Cannot render ineffective High: Can render ineffective Need for new system hazard mitigation measures LoW: Potential software errors will not require High: Potential software errors could require
c ������������������������ ������������������������ Example Likelihood Level (2) Safety margins Low: Insignificant or no change Medium: Minor change High: Significant change Potential for reducing situation awareness Low: Insignificant or no change Medium: Minor change High: Significant change Skills currently used and those necessary to backup and monitor new decision support tools Low: Insignificant or no change Medium: Minor change High: Significant change Introduction of new failure modes and hazard causes Low: New tools have same function and failure modes as system components they are replacing Medium: Introduced but well understood and effective mitigation measures can be designed High: Introduced and cannot be classified under medium Effect of software on current system hazard mitigation measures Low: Cannot render ineffective High: Can render ineffective Need for new system hazard mitigation measures Low: Potential software errors will not require High: Potential software errors could require
Causality Accident causes are often oversimplified The vessel Baltic star, registered in panama ran aground at full speed on the shore of an island in the stockholm waters on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in english to the sailor who was tending the rudd The latter was hard of hearing and understood only Greek LeMonde Larger organizational and economic factors? ccident causes ssues in Causality Filtering and subjectivity in accident reports Root cause seduction Idea of a singular cause is satisfying to our desire for certainty and control Leads to fixing symptoms The"fixing orientation Well understood causes given more attention Component failure Operator error Tend to look for linear cause-effect relationships Makes it easier to select corrective actions(a" fix")
c ������������������ ������������� �������������������������� Causality Accident causes are often oversimplified: The vessel Baltic Star, registered in Panama, ran aground at full speed on the shore of an island in the Stockholm waters on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in English to the sailor who was tending the rudder. The latter was hard of hearing and understood only Greek. LeMonde Larger organizational and economic factors? c �������������� �������������������������� Issues in Causality Filtering and subjectivity in accident reports Root cause seduction Idea of a singular cause is satisfying to our desire for certainty and control. Leads to fixing symptoms The "fixing" orientation Well understood causes given more attention Component failure Operator error Tend to look for linear cause−effect relationships Makes it easier to select corrective actions (a "fix") 76
NASA Procedures and Guidelines: NPG 8621 Draft 1 Root cause: Along a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policylpractice/procedure Contributing Cause A factor, event, or circumstance that led directly or indirectly to the dominant root cause or which contributed to the severity of the mishap Hierarchical Models LEVEL 3 SYSTEMIC FACTORS LEVEL 2 CONDITIONS EVENTS OR ACCIDENT MECHANISM
c ������������������ �������������������������� NASA Procedures and Guidelines: NPG 8621 Draft 1 Root Cause: "Along a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policy/practice/procedure." Contributing Cause: "A factor, event, or circumstance that led directly or indirectly to the dominant root cause, or which contributed to the severity of the mishap." c ������������������ �������������������������� Hierarchical Models EVENTS OR ACCIDENT MECHANISM LEVEL 2 CONDITIONS LEVEL 3 SYSTEMIC FACTORS
Accident causes Hierarchical Analysis Example Diffused responsibility Org and Inadequate and authority communication revlew problems process Everyone assumes QA did not someone else tested understand using load tape process Shw load tape contains incorrect filter constant Centaur MS sends Centaur Low accel separates zero roll rate leads to wrong from Titan iv to FC software unstable sloshing t time for engine shutdown Systemic Factors in(Software-Related) Accidents 1. Flaws in the Safety Culture Safety Culture: The general attitude and approach to safety reflected by those who participate in an industry or organization, including management, workers, and government regulators Underestimating or not understanding software risks Overconfidence and complacency Assuming risk decreases over time Ignoring warning signs Inadequate emphasis on risk management Incorrect prioritization of changes to automation Slow understanding of problems in human-automation mismatch Overrelying on redundancy and protection systems Unrealistic risk assessment
����������������� ����������������� ������������� �������������������������� c � Hierarchical Analysis Example Diffused responsibility Org. and communication and authority problems Inadequate review process Everyone assumes someone else tested using load tape QA did not understand process from Titan IV unstable becomes Centaur to FC software IMS sends separates zero roll rate S/w load tape contains incorrect filter constant sloshing Low accel leads to wrong time for engine shutdown Centaur Fuel ������������� �������������������������� Systemic Factors in (Software−Related) Accidents 1. Flaws in the Safety Culture Safety Culture: The general attitude and approach to safety reflected by those who participate in an industry or organization, including management, workers, and government regulators Underestimating or not understanding software risks Overconfidence and complacency Assuming risk decreases over time Ignoring warning signs Inadequate emphasis on risk management Incorrect prioritization of changes to automation c � Slow understanding of problems in human−automation mismatch Overrelying on redundancy and protection systems Unrealistic risk assessment