Software System Safety Copyright Nancy G Leveson, July 2002
. Software System Safety Copyright Nancy G. Leveson, July 2002. c
Accident with No Component Failures VENT LAH GEARBOX LC CATALYST VAPOR COOLING WATER REFLUX REACTOR COMPUTER ypes of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure System Accidents Arise in interactions among components No components may have"failed Caused by interactive complexity and tight coupling Exacerbated by the introduction of computers
Accident with No Component Failures c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓ ✂✁☎✄✝✆✟✞✠☎✡☎☛✄☎☞ LC COMPUTER WATER COOLING CONDENSER VENT REFLUX REACTOR VAPOR LA CATALYST GEARBOX Types of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure System Accidents Arise in interactions among components No components may have "failed" c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖✕ ✂✁☎✄✝✆✟✞✠☎✡☎☛✄☎☞ Caused by interactive complexity and tight coupling Exacerbated by the introduction of computers. .
Interactive Complexity Complexity is a moving target The underlying factor is intellectual manageability 1. A simple"system has a small number of unknowns in its interactions within the system and with its environment 2. A system is intellectually unmanageable when the level of interactions reaches the point where they cannot be thoroughly planned understood anticipated guarded agains 3. Introducing new technology introduces unknowns and even "unk-unks Computers and Risk Computers and risk We seem not to trust one another as much as would be desirable. In lieu of trusting each other, are we putting too much trust in our technology?.. Perhaps we are not educating our children sufficiently well to understand he reasonable uses and limits of technology. Thomas B. sheridan
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖✧ Interactive Complexity ✂✁☎✄✝✆✟✞✠☎✡☎☛✄☎☞ Complexity is a moving target The underlying factor is intellectual manageability 1. A "simple" system has a small number of unknowns in its interactions within the system and with its environment. 2. A system is intellectually unmanageable when the level of interactions reaches the point where they cannot be thoroughly planned understood anticipated guarded against 3. Introducing new technology introduces unknowns and even "unk−unks." c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖★ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ Computers and Risk We seem not to trust one another as much as would be desirable. In lieu of trusting each other, are we putting too much trust in our technology? . . . Perhaps we are not educating our children sufficiently well to understand the reasonable uses and limits of technology. Thomas B. Sheridan
A Possible solution Com Enforce discipline and control complexity Limits have changed from structural integrity and physical constraints of materials to intellectual limits Improve communication among engineers Build safety in by enforcing constraints on behavior Example(batch reactor) System safety constraint: Water must be flowing into reflux condenser whenever catalyst is added to reactor Software safety constraint: Software must always open water valve before catalyst valve Computers and Risk Stages in Process Control System Evolution 1. Mechanical systems Direct sensory perception of process Displays are directly connected to process and thus are physical extensions of it Design decisions highly constrained by Available space Physics of underlying process Limited possibility of action at a distance
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖✪ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ A Possible Solution Enforce discipline and control complexity Limits have changed from structural integrity and physical constraints of materials to intellectual limits Improve communication among engineers Build safety in by enforcing constraints on behavior Example (batch reactor) System safety constraint: Water must be flowing into reflux condenser whenever catalyst is added to reactor. Software safety constraint: Software must always open water valve before catalyst valve ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖✩ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ Stages in Process Control System Evolution 1. Mechanical systems Direct sensory perception of process Displays are directly connected to process and thus are physical extensions of it. Design decisions highly constrained by: Available space c Physics of underlying process Limited possibility of action at a distance
Stages in Process Control System Evolution(2) 2. Electromechanical systems Capability for action at a distance Need to provide an image of process to operators Need to provide feedback on actions taken Relaxed constraints on designers but created new possibilities for designer and operator error Computers and Risk Stages in Process Control System Evolution 3) 3. Computer-based systems Allow multiplexing of controls and displays Relaxes even more constraints and introduces more possibility for error But constraints shaped environment in ways that efficiently transmitted valuable process information and supported cognitive processes of operators Finding it hard to capture and present these qualities in new systems
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✖✫ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ Stages in Process Control System Evolution (2) 2. Electromechanical systems Capability for action at a distance Need to provide an image of process to operators Need to provide feedback on actions taken. Relaxed constraints on designers but created new possibilities for designer and operator error. c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✬ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ Stages in Process Control System Evolution (3) 3. Computer−based systems Allow multiplexing of controls and displays. Relaxes even more constraints and introduces more possibility for error. But constraints shaped environment in ways that efficiently transmitted valuable process information and supported cognitive processes of operators. Finding it hard to capture and present these qualities in new systems
The Problem to be solved The primary safety problem in computer-based systems is the lack of appropriate constraints on design The job of the system safety engineer is to identify the design constraints necessary to maintain safety and to ensure the system and software design enforces ther
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓☎✓ ✗✠☎☞✝✘☎✙☎✚✄☎✞✏✜✛☎✑☎✢✝✣✥✤✏✎✦ The Problem to be Solved The primary safety problem in computer−based systems is the lack of appropriate constraints on design. The job of the system safety engineer is to identify the design constraints necessary to maintain safety and to ensure the system and software design enforces them.
Safety≠ Reliability Accidents in high-tech systems are changing their nature, and we must change our approaches to safety accordingly Confusing Safety and Reliability From an FAA report on AtC software architectures The faas en route automation meets the criteria for consideration as a safety-critical system. Therefore en route automation systems must posses ultra-high eliability From a blue ribbon panel report on the v-22 Osprey problems Safety [software] Recommendation: Improve reliability, then verify by extensive test/fix/test in challenging environments
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✕ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ . . Safety Reliability Accidents in high−tech systems are changing their nature, and we must change our approaches to safety accordingly. . . c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✧ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ Confusing Safety and Reliability From an FAA report on ATC software architectures: "The FAA’s en route automation meets the criteria for consideration as a safety−critical system. Therefore, en route automation systems must posses ultra−high reliability." From a blue ribbon panel report on the V−22 Osprey problems: "Safety [software]: ... Recommendation: Improve reliability, then verify by extensive test/fix/test in challenging environments
OLe Safety vs. Reliab ility Does Software Fail? Failure: Nonperformance or inability of system or component to perform its intended function for a specified time under specified environmental conditions A basic abnormal occurrence, e.g burned out bearing in a pump relay not closing properly when voltage applied Fault: Higher-order events, e.g relay closes at wrong time due to improper functioning of an upstream component All failures are faults but not all faults are failures Reliability Engineering Approach to Safety Reliability: The probability an item will perform its required function in the specified manner over a given time eriod and under specified or assumed conditions (Note: Most software-related accidents result from errors in specified requirements or function and deviations from assumed conditions. Concerned primarily with failures and failure rate reduction Parallel redundancy Standby sparing Safety factors and margins Derating Screening Timed replacements
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭★ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ Does Software Fail? Failure: Nonperformance or inability of system or component to perform its intended function for a specified time under specified environmental conditions. A basic abnormal occurrence, e.g., burned out bearing in a pump relay not closing properly when voltage applied Fault: Higher−order events, e.g., relay closes at wrong time due to improper functioning of an upstream component. All failures are faults but not all faults are failures. c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✳ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ Reliability Engineering Approach to Safety Reliability: The probability an item will perform its required function in the specified manner over a given time period and under specified or assumed conditions. (Note: Most software−related accidents result from errors in specified requirements or function and deviations from assumed conditions.) Concerned primarily with failures and failure rate reduction Parallel redundancy Standby sparing Safety factors and margins Derating Screening Timed replacements
OLeveson-16 Reliability Engineering Approach to Safety(2) Assumes accidents are the result of component failure t Techniques exist to increase component reliability Failure rates in hardware are quantifiable Omits important factors in accidents May even decrease safety Many accidents occur without any component" failure e.g. Accidents may be caused by equipment operation outside parameters and time limits upon which reliability analyses are based Or may be caused by interactions of components all operating according to specification Highly reliable components are not necessarily safe
c ✌✵✄☎✍✲✄☎✏✎✠☎✑✝✒✶✓✭✷ ✮✛☎✯✄☎✚✰✜✍✎✏✎✱☎✣✴✄☎☛ ✤✛☎✡☎✤ ☛✤✚✰ Reliability Engineering Approach to Safety (2) Assumes accidents are the result of component failure. Techniques exist to increase component reliability Failure rates in hardware are quantifiable. Omits important factors in accidents. May even decrease safety. Many accidents occur without any component ‘‘failure’’ e.g. Accidents may be caused by equipment operation outside parameters and time limits upon which reliability analyses are based. Or may be caused by interactions of components all operating according to specification Highly reliable components are not necessarily safe
Software Component reuse One of most common factors in software -related accidents Software contains assumptions about its environment Accidents occur when these assumptions are incorrect Therac-25 Ariane 5 U.K. ATC software Most likely to change the features embedded in or controlled by the software CoTS makes safety analysis more difficult Safety and reliability are different qualities! Software-Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer Unhandled controlled-system states and environmental condition Merely trying to get the software"correct"or to make it reliable will not make it safer under these conditions
c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✪ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ Software Component Reuse One of most common factors in software−related accidents Software contains assumptions about its environment. Accidents occur when these assumptions are incorrect. Therac−25 Ariane 5 U.K. ATC software Most likely to change the features embedded in or controlled by the software. COTS makes safety analysis more difficult. Safety and reliability are different qualities! c ✌☎✄☎✍✎✄☎✏✎✠☎✑✝✒✔✓✭✩ ✮✛☎✯✄☎✚✰✜✍✎✏✲✱☎✣✥✄☎☛ ✤✛☎✡☎✤ ☛ ✤✚✰ Software−Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer. Unhandled controlled−system states and environmental conditions. Merely trying to get the software ‘‘correct’’ or to make it reliable will not make it safer under these conditions