Design for Safety Unfortunately, everyone had forgotten why the branch came off the top of the main and nobody realized that this was important. Trevor Kletz What Went Wrong? Before a wise man ventures into a pit, he lowers a ladder so he can climb out Rabbi Samuel Ha-Levi Ben Joseph ibm Nagrela
c ��������������������� ���������� Design for Safety Unfortunately, everyone had forgotten why the branch came off the top of the main and nobody realized that this was important. Trevor Kletz What Went Wrong? Before a wise man ventures into a pit, he lowers a ladder so he can climb out. Rabbi Samuel Ha−Levi Ben Joseph Ibm Nagrela
Design for Safety Software design must enforce safety constraints Should be able to trace from requirements to code(vice versa) Design should incorporate basic safety design principles Safe Design Precedence HAZARD ELIMINATION Substitution Simplification Decoupling Elimination of human errors Reduction of hazardous materials or conditions Decreasing cost HAZARD REDUCTION Increasing effectiveness Design for controllability Barriers Lockins. Lockouts, Interlocks Failure Minimization Safety Factors and Margins unfancy HAZARD CONTROL Reducing exposure Isolation and containment Protection systems and fail-safe design DAMAGE REDUCTION
c ��������������������� ���������� Design for Safety Software design must enforce safety constraints Should be able to trace from requirements to code (vice versa) Design should incorporate basic safety design principles c ��������������������� ���������� Safe Design Precedence HAZARD ELIMINATION Reduction of hazardous materials or conditions Elimination of human errors Substitution Simplification Decoupling HAZARD REDUCTION Design for controllability Barriers Lockins, Lockouts, Interlocks Failure Minimization Safety Factors and Margins Redundancy HAZARD CONTROL Reducing exposure Isolation and containment Protection systems and fail−safe design DAMAGE REDUCTION Decreasing cost Increasing effectiveness
Hazard elimination a SUBSTITUTION e Use safe or safer materials Simple hardware devices may be safer than using a computer. No technological imperative that says we MUST use computers to control dangerous devices Introducing new technology introduces unknowns and even unk-unks Design a SIMPLIFICATION Criteria for a simple software design 1. Testable: Number of states limited determinism vs nondeterminism single tasking vs. multitasking polling over interrupts 2. Easily understood and readable 3. Interactions between components are limited and straightforward 4. Code includes only minimum features and capability required by system Should not contain unnecessary or undocumented features or unused executable code 5. Worst case timing is determinable by looking at code
c ��������������������� ���������� Hazard Elimination SUBSTITUTION Use safe or safer materials. Simple hardware devices may be safer than using a computer. No technological imperative that says we MUST use computers to control dangerous devices. Introducing new technology introduces unknowns and even unk−unks. c ��������������������� ���������� SIMPLIFICATION Criteria for a simple software design: 1. Testable: Number of states limited determinism vs. nondeterminism single tasking vs. multitasking polling over interrupts 2. Easily understood and readable 3. Interactions between components are limited and straightforward. 4. Code includes only minimum features and capability required by system. Should not contain unnecessary or undocumented features or unused executable code. 5. Worst case timing is determinable by looking at code
SIMPLIFICATION (cont) Reducing and simplifying interfaces will eliminate errors and make designs more testable Easy to add functions to software, hard to practice restraint Constructing a simple design requires discipline, creativity, restraint and time Design so that structural decomposition matches functional decomposition a DECOUPLING Tightly coupled system is one that is highly interdependent Each part linked to many other parts Failure or unplanned behavior in one can rapidly affect status of others Processes are time-dependent and cannot wait Little slack in system Sequences are invariant Only one way to reach a goal System accidents caused by unplanned interactions Coupling creates increased number of interfaces and potential interactions
c ��������������������� ���������� SIMPLIFICATION (con’t) Reducing and simplifying interfaces will eliminate errors and make designs more testable. Easy to add functions to software, hard to practice restraint. Constructing a simple design requires discipline, creativity, restraint, and time. Design so that structural decomposition matches functional decomposition. . c ��������������������� ���������� DECOUPLING Tightly coupled system is one that is highly interdependent: Each part linked to many other parts. Failure or unplanned behavior in one can rapidly affect status of others. Processes are time−dependent and cannot wait. Little slack in system Sequences are invariant. Only one way to reach a goal. System accidents caused by unplanned interactions. Coupling creates increased number of interfaces and potential interactions
DECOUPLING (con't) Computers tend to increase system coupling unless very careful Applying principles of decoupling to software design Modularization How split up is crucial to determining effects Firewalls Read-only or restricted write memories Eliminate hazardous effects of common hardware failures Design ELIMINATION OF HUMAN ERRORS Design so few opportunities for errors Make impossible or possible to detect immediately Lots of ways to increase safety of human-machine interaction Making status of component clear. Designing software to be error tolerant etc.(will cover separately) Programming language design Not only simple itself (masterable), but should encourage the production of simple and understandable programs Some language features have been found to be particularly error prone
c ��������������������� ���������� DECOUPLING (con’t) Computers tend to increase system coupling unless very careful. Applying principles of decoupling to software design: Modularization: How split up is crucial to determining effects. Firewalls Read−only or restricted write memories Eliminate hazardous effects of common hardware failures c ��������������������� ���������� ELIMINATION OF HUMAN ERRORS Design so few opportunities for errors. Make impossible or possible to detect immediately. Lots of ways to increase safety of human−machine interaction. Making status of component clear. Designing software to be error tolerant etc. (will cover separately) Programming language design: Not only simple itself (masterable), but should encourage the production of simple and understandable programs. Some language features have been found to be particularly error prone
REDUCTION OF HAZARDOUS MATERIALS OR CONDITIONS Software should contain only code that is absolutely necessary to achieve required functionality Implications for COTS Extra code may lead to hazards and may make software analysis more difficult Memory not used should be initialized to a pattern that will revert to a safe state Design Turbine-Generator Example Safety requirements 1. Must always be able to close steam valves within a few hundred milliseconds 2. Under no circumstances can steam valves open spuriously Whatever the nature of internal or external fault Divided into two parts(decoupled)on separate processors 1. Non-critical functions: loss cannot endanger turbine nor cause it to shutdown less important governing functions supervisory, coordination, and management functions 2. Small number of critical functions
c ��������������������� ���������� REDUCTION OF HAZARDOUS MATERIALS OR CONDITIONS Software should contain only code that is absolutely necessary to achieve required functionality. Implications for COTS Extra code may lead to hazards and may make software analysis more difficult. Memory not used should be initialized to a pattern that will revert to a safe state. c ��������������������� ���������� Turbine−Generator Example Safety requirements: 1. Must always be able to close steam valves within a few hundred milliseconds. 2. Under no circumstances can steam valves open spuriously, whatever the nature of internal or external fault. Divided into two parts (decoupled) on separate processors: 1. Non−critical functions: loss cannot endanger turbine nor cause it to shutdown. less important governing functions supervisory, coordination, and management functions 2. Small number of critical functions
Turbine-Generator Example(2) Uses polling: No interrupts except for fatal store fault(nonmaskable) Timing and sequencing thus defined More rigorous and exhaustive testing possible All messages unidirectional No recovery or contention protocols required Higher level of predictability Self-checks of Sensibility of incoming signals Whether processor functioning correctly Failure of self-check leads to reversion to safe state through fail-safe hardware State table defines Scheduling of tasks Self-check criteria appropriate under particular conditions Hazard reduction · Passive safeguards Maintain safety by their presence Fail into safe states Active safeguards Require hazard or condition to be detected and corrected Tradeoffs Passive rely on physical principles Active depend on less reliable detection and recovery mechanisms BUT Passive tend to be more restrictive in terms of design freedom and not always feasible to implement
c ��������������������� ���������� Turbine−Generator Example (2) Uses polling : No interrupts except for fatal store fault (nonmaskable) Timing and sequencing thus defined More rigorous and exhaustive testing possible. All messages unidirectional No recovery or contention protocols required Higher level of predictability Self−checks of Sensibility of incoming signals Whether processor functioning correctly Failure of self−check leads to reversion to safe state through fail−safe hardware. State table defines: Scheduling of tasks Self−check criteria appropriate under particular conditions ��������������������� ���������� Hazard Reduction Passive safeguards: Maintain safety by their presence Fail into safe states Active safeguards: Require hazard or condition to be detected and corrected Tradeoffs: Passive rely on physical principles Active depend on less reliable detection and recovery mechanisms. c BUT Passive tend to be more restrictive in terms of design freedom and not always feasible to implement
eyeson Design for Controllability Make system easier to control, both for humans and computers Use incremental control Perform critical steps incrementally rather than in one step Provide feedback To test validity of assumptions and models upon which decisions made To allow taking corrective action before significant damage done Provide various types of fallback or intermediate states · Lower time pressures ● Provide decision aids Use monitoring Monitoring Difficult to make monitors independent Checks require access to information being monitored but usually involves possibility of corrupting that information Depends on assumptions about structure of system and about errors that may or may not occur May be incorrect under certain conditions Common incorrect assumptions may be reflected both in design of monitor and devices being monitored
c ��������������������� ���������� Design for Controllability Make system easier to control, both for humans and computers. Use incremental control: Perform critical steps incrementally rather than in one step. Provide feedback To test validity of assumptions and models upon which decisions made To allow taking corrective action before significant damage done. Provide various types of fallback or intermediate states Lower time pressures Provide decision aids Use monitoring c ��������������������� ���������� Monitoring Difficult to make monitors independent: Checks require access to information being monitored but usually involves possibility of corrupting that information. Depends on assumptions about structure of system and about errors that may or may not occur May be incorrect under certain conditions Common incorrect assumptions may be reflected both in design of monitor and devices being monitored
A Hierarchy of Software Checking not detected Observe system externally to provide independent view Use additional hardware or completely separate hardware Often observe both controlled system and controller not detected Independent monitoring by process separate from that being che data being passed between modules consistency of global data structures not detected expected timing of modules or processes Can detect coding errors and implementation errors Use assertions: statements(boolean expressions on system state about expected state of module at different points in execution or about expected value of parameters passed to module not detected e.g. range checks, state checks, reasonableness checks e Used to detect hardware failures and individual instruction errors e.g,memory protection violation, divide by zero Checksums Often built into hardware or checks included in operating system Software Monitoring(Checking) In general, farther down the hierarchy check can be made, the better Detect the error closer to the time it occurred and before erroneous data used Easier to isolate and diagnose the problem More likely to be able to fix erroneous state rather than recover to safe state Writing effective self-checks very hard and number usually limited by time and memory Limit to safety-critical states Use hazard analysis to determine check contents and location Added monitoring and checks can cause failures themselves
A Hierarchy of Software Checking c ��������������������� ���������� not detected not detected not detected Used to detect hardware failures and individual instruction errors. Observe system externally to provide independent view not detected Fail Checksums e.g., memory protection violation, divide by zero e.g. range checks, state checks, reasonableness checks about expected value of parameters passed to module. Use assertions: statements (boolean expressions on system state) about expected state of module at different points in execution or Can detect coding errors and implementation errors. expected timing of modules or processes consistency of global data structures data being passed between modules May check: Independent monitoring by process separate from that being checked. Often observe both controlled system and controller. Use additional hardware or completely separate hardware. Often built into hardware or checks included in operating system. c ��������������������� ���������� Software Monitoring (Checking) In general, farther down the hierarchy check can be made, the better: Detect the error closer to the time it occurred and before erroneous data used. Easier to isolate and diagnose the problem More likely to be able to fix erroneous state rather than recover to safe state. Writing effective self−checks very hard and number usually limited by time and memory. Limit to safety−critical states Use hazard analysis to determine check contents and location Added monitoring and checks can cause failures themselves
OLeveson-215 Barriers LOCKOUTS Make access to dangerous state difficult or impossible Implications for software Avoiding EMI Authority limiting Controlling access to and modification of critical variables Can adapt some security techniques a LOCKIN Make it difficult or impossible to leave a safe state Need to protect software against environmental conditions e.g., operator errors data arriving in wrong order or at unexpected speed Completeness criteria ensure specified behavior robust against mistaken environmental conditions
c ��������������������� ���������� Barriers LOCKOUTS Make access to dangerous state difficult or impossible. Implications for software: Avoiding EMI Authority limiting Controlling access to and modification of critical variables Can adapt some security techniques c ��������������������� ���������� LOCKIN Make it difficult or impossible to leave a safe state. Need to protect software against environmental conditions. e.g., operator errors data arriving in wrong order or at unexpected speed Completeness criteria ensure specified behavior robust against mistaken environmental conditions