Guy, C.G. Computer Reliability The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton CRC Press llc. 2000
Guy, C.G. “Computer Reliability” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
98 Computer reliability 98.1 Introduction 98.2 Definitions of Failure Fault, and error 98.3 Failure Rate and Reliability 98.4 Relationship Between Reliability and Failure Rate 98.5 Mean Time to failure 98.6 Mean Time to Repa 98.7 Mean time between failures 98.9 Calculation of Computer System Reliability 98.10 Markov Modeling Chris Guy 98.11 Software Reliability University of reading 98.12 Reliability Calculations for Real System 98.1 Introduction This chapter outlines the knowledge needed to estimate the reliability of any electronic system or subsystem within a computer. The word estimate was used in the first sentence to emphasize that the following calculations, even if carried out perfectly correctly, can provide no guarantee that a particular example of a piece of electronic equipment will work for any length of time. However, they can provide a reasonable guide to the probability that something will function as expected over a given time period. The first step in estimating the reliability f a computer system is to determine the likelihood of failure of each of the individual components, such as resistors, capacitors, integrated circuits, and connectors, that make up the system. This information can the be used in a full system analysis 98.2 Definitions of failure. Fault and error A failure occurs when a system or component does not perform as expected. Examples of failures at the component level could be a base-emitter short in a transistor somewhere within a large integrated circuit or a solder joint going open circuit because of vibrations. If a component experiences a failure, it may cause a fault, ding to an error, which may lead to a system failure. A fault may be either the outward manifestation of a component failure or a design fault Component failure may be caused by internal physical phenomena or by external environmental effects such as electromagnetic fields or power supply variations. Design faults may be divided into two classes. The first class of design fault is caused by using components outside their rated specification. It should be possible to eliminate this class of faults by careful design checking. The second class, which is characteristic of large digital circuits such as those found in computer systems, is caused by the designer not taking into account every logical condition that could ccur during system operation. All computer systems have a software component as an integral part of their operation, and software is especially prone to this kind of design fault. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 98 Computer Reliability 98.1 Introduction 98.2 Definitions of Failure, Fault, and Error 98.3 Failure Rate and Reliability 98.4 Relationship Between Reliability and Failure Rate 98.5 Mean Time to Failure 98.6 Mean Time to Repair 98.7 Mean Time Between Failures 98.8 Availability 98.9 Calculation of Computer System Reliability 98.10 Markov Modeling 98.11 Software Reliability 98.12 Reliability Calculations for Real Systems 98.1 Introduction This chapter outlines the knowledge needed to estimate the reliability of any electronic system or subsystem within a computer. The word estimate was used in the first sentence to emphasize that the following calculations, even if carried out perfectly correctly, can provide no guarantee that a particular example of a piece of electronic equipment will work for any length of time. However, they can provide a reasonable guide to the probability that something will function as expected over a given time period. The first step in estimating the reliability of a computer system is to determine the likelihood of failure of each of the individual components, such as resistors, capacitors, integrated circuits, and connectors, that make up the system. This information can then be used in a full system analysis. 98.2 Definitions of Failure, Fault, and Error A failure occurs when a system or component does not perform as expected. Examples of failures at the component level could be a base-emitter short in a transistor somewhere within a large integrated circuit or a solder joint going open circuit because of vibrations. If a component experiences a failure, it may cause a fault, leading to an error, which may lead to a system failure. A fault may be either the outward manifestation of a component failure or a design fault. Component failure may be caused by internal physical phenomena or by external environmental effects such as electromagnetic fields or power supply variations. Design faults may be divided into two classes. The first class of design fault is caused by using components outside their rated specification. It should be possible to eliminate this class of faults by careful design checking. The second class, which is characteristic of large digital circuits such as those found in computer systems, is caused by the designer not taking into account every logical condition that could occur during system operation. All computer systems have a software component as an integral part of their operation, and software is especially prone to this kind of design fault. Chris G. Guy University of Reading
A fault may be permanent or transitory. Examples of permanent faults are short or open circuits within a component caused by physical failures. Transitory faults can be subdivided further into two classes. The first usually called transient faults, are caused by such things as alpha-particle radiation or power supply variations. Large random access memory circuits are particularly prone to this kind of fault. By definition, a transient fault is not caused by physical damage to the hardware. The second class is usually called intermittent faults. These faults are temporary but reoccur in an unpredictable manner. They are caused by loose physical connections between components or by components used at the limits of their specification. Intermittent faults often become permanent faults after a period of time. A fault may be active or inactive. For example, if a fault causes the output of a digital component to be stuck at logic 1, and the desired output is logic 1, then this would be classed inactive fault. Once the desired output becomes logic 0, then the fault becomes active The consequence for the system operation of a fault is an error. As the error may be caused by a p or by a transitory fault, it may be classed as a hard error or a soft error. An error in an individual may be due to a fault in that subsystem or to the propagation of an error from another part of the overa m The terms fault and error are sometimes interchanged. The term failure is often used to mean anything covered by these definitions. The definitions given here are those in most common usage. Physical faults within a component can be characterized by their external electrical effects. These effects are ommonly classified into fault models. The intention of any fault model is to take into account every possible failure mechanism, so that the effects on the system can be worked out. The manifestation of faults in a system can be classified according to the likely effects, producing an error model. The purpose of error models is to ry to establish what kinds of corrective action need be taken in order to effect repair 98.3 Failure Rate and reliability An individual component may fail after a random time, so it is impossible to predict any pattern of failure from one example. It is possible, however, to estimate the rate at which members of a group of identical components will fail. This rate can be determined by experimental means using accelerated life tests. In a normal operating environment, the time for a statistically significant number of failures to have occurred in a group of modern digital components could be tens or even hundreds of years. Consequently, the manufacturers must make the environment for the tests extremely unfavorable in order to produce failures in a few hours or days and then extrapolate back to produce the likely number of failures in a normal environment. The failure rate is then defined as the number of failures per unit time, in a given environment, compared with the number of surviving components. It is usually expressed as a number of failures per million hours If f(n)is the number of components that have failed up to time t, and s(t) is the number of o ived, then z(o), the failure rate or hazard rate, is defined as Most electronic components will exhibit a variation of failure rate with time. Many studies have shown that this variation can often be approximated to the pattern shown in Fig. 98. 1. For obvious reasons this is known as a bathtub curve. The first phase, where the failure rate starts high but is decreasing with time, is where the components are suffering infant mortality; in other words, those that had manufacturing defects are failing This is often called the burn-in phase. The second part, where the failure rate is roughly constant, is the useful life period of operation for the component. The final part, where the failure rate is increasing with time, is where the components are starting to wear out. ng the same nomenclature s(t)+ f(t) e 2000 by CRC Press LLC
© 2000 by CRC Press LLC A fault may be permanent or transitory. Examples of permanent faults are short or open circuits within a component caused by physical failures. Transitory faults can be subdivided further into two classes. The first, usually called transient faults, are caused by such things as alpha-particle radiation or power supply variations. Large random access memory circuits are particularly prone to this kind of fault. By definition, a transient fault is not caused by physical damage to the hardware. The second class is usually called intermittent faults. These faults are temporary but reoccur in an unpredictable manner. They are caused by loose physical connections between components or by components used at the limits of their specification. Intermittent faults often become permanent faults after a period of time. A fault may be active or inactive. For example, if a fault causes the output of a digital component to be stuck at logic 1, and the desired output is logic 1, then this would be classed as an inactive fault. Once the desired output becomes logic 0, then the fault becomes active. The consequence for the system operation of a fault is an error. As the error may be caused by a permanent or by a transitory fault, it may be classed as a hard error or a soft error. An error in an individual subsystem may be due to a fault in that subsystem or to the propagation of an error from another part of the overall system. The terms fault and error are sometimes interchanged. The term failure is often used to mean anything covered by these definitions. The definitions given here are those in most common usage. Physical faults within a component can be characterized by their external electrical effects. These effects are commonly classified into fault models. The intention of any fault model is to take into account every possible failure mechanism, so that the effects on the system can be worked out. The manifestation of faults in a system can be classified according to the likely effects, producing an error model. The purpose of error models is to try to establish what kinds of corrective action need be taken in order to effect repairs. 98.3 Failure Rate and Reliability An individual component may fail after a random time, so it is impossible to predict any pattern of failure from one example. It is possible, however, to estimate the rate at which members of a group of identical components will fail. This rate can be determined by experimental means using accelerated life tests. In a normal operating environment, the time for a statistically significant number of failures to have occurred in a group of modern digital components could be tens or even hundreds of years. Consequently, the manufacturers must make the environment for the tests extremely unfavorable in order to produce failures in a few hours or days and then extrapolate back to produce the likely number of failures in a normal environment. The failure rate is then defined as the number of failures per unit time, in a given environment, compared with the number of surviving components. It is usually expressed as a number of failures per million hours. If f(t) is the number of components that have failed up to time t, and s(t) is the number of components that have survived, then z(t), the failure rate or hazard rate, is defined as (98.1) Most electronic components will exhibit a variation of failure rate with time. Many studies have shown that this variation can often be approximated to the pattern shown in Fig. 98.1. For obvious reasons this is known as a bathtub curve. The first phase, where the failure rate starts high but is decreasing with time, is where the components are suffering infant mortality; in other words, those that had manufacturing defects are failing. This is often called the burn-in phase. The second part, where the failure rate is roughly constant, is the useful life period of operation for the component. The final part, where the failure rate is increasing with time, is where the components are starting to wear out. Using the same nomenclature as before, if: (98.2) z t s t df t d t ( ) ( ) ( ) ( ) = × 1 s t( ) + = f (t) N
Time FIGURE 98.1 Variation of failure rate with time i.e., N is the total number of components in the test, then the reliability r(t)is defined as r(t)=s(t) (98.3) or in words, and using the definition from the IEEE Standard Dictionary of Electrical and Electronic Terms, reliability is the probability that a device will function without failure over a specified time or amount of usage, under stated conditions. 98.4 Relationship Between Reliability and Failure Rate Using Eqs.(98.1),(98.2), and (98.3)then z(t)= n dr(t) (984) s(t) d(t) A is commonly used as the symbol for the failure rate z (t)in the period where it is a constant, i.e., the useful life of the component. Consequently, we may write Eq.(98.4)as dr(t) r(t d(t) Rewriting, integrating, and using the limits of integration as r(r)=l at t =0 and r(t)=0 at t=oo gives the result r(t)=e-Ar (986) This result is true only for the period of operation where the failure rate is a constant. For most common components, real failure rates can be obtained from such handbooks as the American military MIL-HDBK 217E, as explained in Section 98.12 It must also be borne in mind that the calculated reliability is a probability function based on lifetime tests. There can be no guarantee that any batch of components will exhibit the same failure rate and hence reliability as those predicted because of variations in manufacturing conditions. Even if the components were made at e 2000 by CRC Press LLC
© 2000 by CRC Press LLC i.e., N is the total number of components in the test, then the reliability r(t) is defined as (98.3) or in words, and using the definition from the IEEE Standard Dictionary of Electrical and Electronic Terms, reliability is the probability that a device will function without failure over a specified time period or amount of usage, under stated conditions. 98.4 Relationship Between Reliability and Failure Rate Using Eqs. (98.1), (98.2), and (98.3) then (98.4) l is commonly used as the symbol for the failure rate z(t) in the period where it is a constant, i.e., the useful life of the component. Consequently, we may write Eq. (98.4) as (98.5) Rewriting, integrating, and using the limits of integration as r(t) = 1 at t=0 and r(t) = 0 at t = • gives the result: (98.6) This result is true only for the period of operation where the failure rate is a constant. For most common components, real failure rates can be obtained from such handbooks as the American military MIL-HDBK- 217E, as explained in Section 98.12. It must also be borne in mind that the calculated reliability is a probability function based on lifetime tests. There can be no guarantee that any batch of components will exhibit the same failure rate and hence reliability as those predicted because of variations in manufacturing conditions. Even if the components were made at FIGURE 98.1 Variation of failure rate with time. r t s t N ( ) ( ) = z t N s t dr t d t ( ) ( ) ( ) ( ) =- × l=- × 1 r t dr t ( ) d t ( ) ( ) rt e t ( ) = - l
he same factory as those tested, the process used might have been slightly different and the equipment will be older. Quality assurance standards are imposed on companies to try to guarantee that they meet minimum manufacturing standards, but some cases in the United States have shown that even the largest plants can fall short of these standards 98.5 Mean Time to Failure A figure that is commonly quoted because it gives a readier feel for the system performance is the mean time to failure or mttf This is defined as MTTF=r(t)dt (98.7) Hence, for the period where the failure rate is constant: MTTE (988) 98.6 Mean Time to Repair For many computer systems it is possible to define a mean time to repair(MTTR). This will be a function of a number of things, including the time taken to detect the failure, the time taken to isolate and replace the faulty component, and the time taken to verify that the system is operating correctly again. while the MTTF is a function of the system design and the operating environment, the MTTR is often a function of unpredictable human factors and, hence, is difficult to quantify Figures used for MTTR for a given system in a fixed situation could be predictions based on the experience of the reliability engineers or could be simply the maximum response time given in the maintenance contract for a computer. In either case, MTTR Predictions may be subject to some fluctuations. To take an extreme example, if the service engineer has a flat tire while on the way to effect the repair, then the repair time may be many times the predicted MTTR. For some systems no MTTR can be predicted, as they are in situations that make repair impossible or uneconomic. Computers in satellites are a good example. In these cases and all others where no errors in the output can be allowed, fault tolerant approaches must be used in order to extend the mTTF beyond the desired system operational lifetime. 98.7 Mean Time between failures For systems where repair is possible, a figure for the expected time between failures can be defined as MTBF= MTTF MTTR (989) The definitions given for MTTF and MTBF are the most commonly accepted ones. In some texts, MTBF is wrongly used as mean time before failure, confusing it with MTTE. In many real systems, MTTF is very much greater than MTTR, so the values of MTTF and MTBF will be almost identical, in any case 98.8 Availability Availability is defined as the probability that the system will be functioning at a given time during its normal working p e 2000 by CRC Press LLC
© 2000 by CRC Press LLC the same factory as those tested, the process used might have been slightly different and the equipment will be older. Quality assurance standards are imposed on companies to try to guarantee that they meet minimum manufacturing standards, but some cases in the United States have shown that even the largest plants can fall short of these standards. 98.5 Mean Time to Failure A figure that is commonly quoted because it gives a readier feel for the system performance is the mean time to failure or MTTF. This is defined as (98.7) Hence, for the period where the failure rate is constant: (98.8) 98.6 Mean Time to Repair For many computer systems it is possible to define a mean time to repair (MTTR). This will be a function of a number of things, including the time taken to detect the failure, the time taken to isolate and replace the faulty component, and the time taken to verify that the system is operating correctly again. While the MTTF is a function of the system design and the operating environment, the MTTR is often a function of unpredictable human factors and, hence, is difficult to quantify. Figures used for MTTR for a given system in a fixed situation could be predictions based on the experience of the reliability engineers or could be simply the maximum response time given in the maintenance contract for a computer. In either case, MTTR predictions may be subject to some fluctuations. To take an extreme example, if the service engineer has a flat tire while on the way to effect the repair, then the repair time may be many times the predicted MTTR. For some systems no MTTR can be predicted, as they are in situations that make repair impossible or uneconomic. Computers in satellites are a good example. In these cases and all others where no errors in the output can be allowed, fault tolerant approaches must be used in order to extend the MTTF beyond the desired system operational lifetime. 98.7 Mean Time Between Failures For systems where repair is possible, a figure for the expected time between failures can be defined as MTBF = MTTF + MTTR (98.9) The definitions given for MTTF and MTBF are the most commonly accepted ones. In some texts, MTBF is wrongly used as mean time before failure, confusing it with MTTF. In many real systems, MTTF is very much greater than MTTR, so the values of MTTF and MTBF will be almost identical, in any case. 98.8 Availability Availability is defined as the probability that the system will be functioning at a given time during its normal working period. MTTF = r( )t dt 0 • Ú MTTF = 1 l
FIGURE 98.2 Series model total working time total time This can n also be written as MTTF (98.11) MTTF + MTTR ome systems are designed for extremely high availability. For example, the computers used by aT&T to control its telephone exchanges are designed for an availability of 0.9999999, which corresponds to an unplanned downtime of 2 min in 40 years. In order to achieve this level of availability, fault tolerant techniques have to be used from the design stage, accompanied by a high level of monitoring and maintenance. 98.9 Calculation of Computer System Reliability For systems that have not been designed to be fault tolerant it is common to assume that the failure of any component implies the failure of the system. Thus, the system failure rate can be determined by the so-called parts count method. If the system contains m types of component, each with a failure rate Am, then the system failure rate 2 can be defined as ∑N where N is the number of each type of component. he ability will be rs (98.13) If the system design is such that the failure of an individual component does not necessarily cause system failure, then the calculations of MTTF and r,(t) become more complicated. Consider two situations where a computer system is made up of several subsystems. These may be individual components or groups of components, e. g, circuit boards. The first is where failure of an individual subsystem implies system failure. This is known as the series model and is shown in Fig. 98. 2. This is the same case as considered previously, and the parts count method, Eqs. (98.12)and (98.13), can be used. The second case where failure of an individual subsystem does not imply system failure. This is shown in Fig. 98.3. Only the failure of every subsystem means that the system has failed, and the system reliability can be evaluated by the following method. If r(n) is the reliability (or probability of not failing) of each subsystem, then g(o) is the probability of an individual subsystem failing. Hence, the probability of them all failing is q(t)=[1-r(t)]n (98.14) or n subsystems e 2000 by CRC Press LLC
© 2000 by CRC Press LLC (98.10) This can also be written as (98.11) Some systems are designed for extremely high availability. For example, the computers used by AT&T to control its telephone exchanges are designed for an availability of 0.9999999, which corresponds to an unplanned downtime of 2 min in 40 years. In order to achieve this level of availability, fault tolerant techniques have to be used from the design stage, accompanied by a high level of monitoring and maintenance. 98.9 Calculation of Computer System Reliability For systems that have not been designed to be fault tolerant it is common to assume that the failure of any component implies the failure of the system. Thus, the system failure rate can be determined by the so-called parts count method. If the system contains m types of component, each with a failure rate lm, then the system failure rate ls can be defined as (98.12) where Nm is the number of each type of component. The system reliability will be (98.13) If the system design is such that the failure of an individual component does not necessarily cause system failure, then the calculations of MTTF and rs (t) become more complicated. Consider two situations where a computer system is made up of several subsystems. These may be individual components or groups of components, e.g., circuit boards. The first is where failure of an individual subsystem implies system failure. This is known as the series model and is shown in Fig. 98.2. This is the same case as considered previously, and the parts count method, Eqs. (98.12) and (98.13), can be used. The second case is where failure of an individual subsystem does not imply system failure. This is shown in Fig. 98.3. Only the failure of every subsystem means that the system has failed, and the system reliability can be evaluated by the following method. If r(t) is the reliability (or probability of not failing) of each subsystem, then q(t) = 1 – r(t) is the probability of an individual subsystem failing. Hence, the probability of them all failing is qs (t) = [1 – r(t)]n (98.14) for n subsystems. FIGURE 98.2 Series model. Av = total working time total time Av = MTTF MTTF + MTTR l l s = × Â Nm m m 1 r t N r m m s m ( ) = × ’1
2 FIGURE 98.3 Parallel model FIGURE 98.4 Parallel series model. Output FIGURE 98.5 Series-parallel mode Hence the system reliability will be r(t)=1-[1-r(t)]n (98.15) In practice, systems will be made up of differing combinations of parallel and series networks; the simplest examples are shown in Figs. 98.4 and 98.5 e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Hence the system reliability will be: rs (t) = 1 – [1 – r(t)]n (98.15) In practice, systems will be made up of differing combinations of parallel and series networks; the simplest examples are shown in Figs. 98.4 and 98.5. FIGURE 98.3 Parallel model. FIGURE 98.4 Parallel series model. FIGURE 98.5 Series-parallel model
In voter FIGURE 98.6 Triple- modular-redundant system. Parallel-Series System Assuming that the reliability of each subsystem is identical, then the overall reliability can be calculated thus The reliability of one unit is r; hence the reliability of the series path is r". The probability of failure of each path is then q=1-rm. Hence, the probability of failure of all m paths is(1-rm)m, and the reliability of the complete system is (1-r)m (98.16) Series-Parallel System Making similar assumptions, and using a similar method, the reliability can be written as rp=1-(1-n)m (98.17) It is straightforward to extend these results to systems with subsystems having different reliabilities and in different combinations. It can be seen that these simple models could be used as the basis for a fault tolerant system, ie one that is able to carry on performing its designated function even while some of its parts have failed. Practical Systems Using Parallel Sub-Systems A computer system that uses parallel sub-systems to improve reliability must incorporate some kind of arbitrator to determine which output to use at any given time. A common method of arbitration involves adding a voter to a system with N parallel modules, where N is an odd number. For example, if N=3, a single incorrect output can be masked by the two correct outputs outvoting it. Hence, the system output will be correct, even though an error has occurred in one of the sub-systems. This system would be known as Triple Modular-Redundant(TMR)(Fig. 98.6) The reliability of a TMr system is the probability that any two out of the three units will be working. Th can be expressed as 码5+(1-5)+(1-2)+(1-) where rn(n=1, 2, 3)is the reliability of each subsystem. If n=r2=T3=r this reduces to e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Parallel-Series System Assuming that the reliability of each subsystem is identical, then the overall reliability can be calculated thus. The reliability of one unit is r; hence the reliability of the series path is rn. The probability of failure of each path is then q = 1 – rn. Hence, the probability of failure of all m paths is (1 - rn)m, and the reliability of the complete system is rps = 1 – (1 – rn)m (98.16) Series-Parallel System Making similar assumptions, and using a similar method, the reliability can be written as rsp = [1 – (1 – r)n]m (98.17) It is straightforward to extend these results to systems with subsystems having different reliabilities and in different combinations. It can be seen that these simple models could be used as the basis for a fault tolerant system, i.e., one that is able to carry on performing its designated function even while some of its parts have failed. Practical Systems Using Parallel Sub-Systems A computer system that uses parallel sub-systems to improve reliability must incorporate some kind of arbitrator to determine which output to use at any given time. A common method of arbitration involves adding a voter to a system with N parallel modules, where N is an odd number. For example, if N = 3, a single incorrect output can be masked by the two correct outputs outvoting it. Hence, the system output will be correct, even though an error has occurred in one of the sub-systems. This system would be known as TripleModular-Redundant (TMR) (Fig. 98.6). The reliability of a TMR system is the probability that any two out of the three units will be working. This can be expressed as where rn (n = 1, 2, 3) is the reliability of each subsystem. If r1 = r2 = r3 = r this reduces to FIGURE 98.6 Triple-modular-redundant system. r rr r rr r r r r r r r tmr = 123 + 1 2 ( ) - 3 + - 1( )2 3 + - ( )1 2 3 1 1 1 r r r tmr = 3 2 - 2 3
FIGURE 98.7 State diagram for two-unit parallel system The reliability of the voter must be included when calculating the overall reliability of such a system. As the voter appears in every path from input to output, it can be included as a series element in a series-parallel model. This leads to (98.18) where r, is the reliability of the voter. More information on methods of using redundancy to improve system reliability can be found in Chapter 93 98.10 Markov Modeling Another approach to determining the probability of system failure is to use a Markov model of the system, rather than the combinatorial methods outlined previously. Markov models involve the defining of system states and state transitions. The mathematics of Markov modeling are well beyond the scope of this brief introduction but most engineering mathematics textbooks will cover the technique. To model the reliability of any system it is necessary to define the various fault-free and faulty states that ould exist. For example, a system consisting of two identical units(A and B), either of which has to work for he system to work, would have four possible states. They would be(1)A and B working;(2)A working, B failed; (3)B working, A failed; and (4)A and B failed. The system designer must assign to each state a series of probabilities that determine whether it will remain in the same state or change to another after a given time period. This is usually shown in a state diagram, as in Fig. 98.7. This model does not allow for the possibility of repair, but this could easily be added. 98.11 Software Reliability One of the major components in any computer system is its software. Although software is unlikely to wear out in a physical sense, it is still impossible to prove that anything other than the simplest of programs is totally free from bugs. Hence, any piece of software will follow the first and second parts of the normal bathtub curve ( Fig. 98.1). The burn-in phase for hardware corresponds to the early release of a complex program, where bugs are commonly found and have to be fixed. The useful life phase for hardware corresponds to the time when the software can be described as stable, even though bugs may still be found. In this phase, where the failure rate can be characterized as constant(even if it is very low), the hardware performance criteria, such as MTTF TR can be estimated. They must be included in any estimation of the overall availability for the computer as a whole. Just as with hardware, techniques using redundancy can be used to improve the availability erance. e 2000 by CRC Press LLC
© 2000 by CRC Press LLC The reliability of the voter must be included when calculating the overall reliability of such a system. As the voter appears in every path from input to output, it can be included as a series element in a series-parallel model. This leads to (98.18) where rv is the reliability of the voter. More information on methods of using redundancy to improve system reliability can be found in Chapter 93. 98.10 Markov Modeling Another approach to determining the probability of system failure is to use a Markov model of the system, rather than the combinatorial methods outlined previously. Markov models involve the defining of system states and state transitions. The mathematics of Markov modeling are well beyond the scope of this brief introduction, but most engineering mathematics textbooks will cover the technique. To model the reliability of any system it is necessary to define the various fault-free and faulty states that could exist. For example, a system consisting of two identical units (A and B), either of which has to work for the system to work, would have four possible states. They would be (1) A and B working; (2) A working, B failed; (3) B working, A failed; and (4) A and B failed. The system designer must assign to each state a series of probabilities that determine whether it will remain in the same state or change to another after a given time period. This is usually shown in a state diagram, as in Fig. 98.7. This model does not allow for the possibility of repair, but this could easily be added. 98.11 Software Reliability One of the major components in any computer system is its software. Although software is unlikely to wear out in a physical sense, it is still impossible to prove that anything other than the simplest of programs is totally free from bugs. Hence, any piece of software will follow the first and second parts of the normal bathtub curve (Fig. 98.1). The burn-in phase for hardware corresponds to the early release of a complex program, where bugs are commonly found and have to be fixed. The useful life phase for hardware corresponds to the time when the software can be described as stable, even though bugs may still be found. In this phase, where the failure rate can be characterized as constant (even if it is very low), the hardware performance criteria, such as MTTF and MTTR can be estimated. They must be included in any estimation of the overall availability for the computer system as a whole. Just as with hardware, techniques using redundancy can be used to improve the availability through fault tolerance. FIGURE 98.7 State diagram for two-unit parallel system. r r r r tmr v = - [ ] 3 2 2 3
98.12 Reliability Calculations for Real Systems The most common source of basic reliability data for electronic components and circuits is the military handbook Reliability Prediction of Electronic Equipment, published by the U.S. Department of Defense. It has the designation MIL-HDBK-217E in its most recent version. This handbook provides both the basic reliability data and the formulae to modify those data for the application of interest. For example, the formula for predicting the failure rate, Ap of a bipolar or MOS microprocessor is given as 入n=πo(G兀y+C2r) L failures per 10° hours where to is the part quality factor, with several categories, ranging from a full mil-spec part to a commercial part; T is the temperature acceleration factor, related to both the technology in use and the actual operating temperature; T, is the voltage stress derating factor, which is higher for devices operating at higher voltages; TE is the application environment factor (the handbook gives figures for many categories of environment, ranging from laboratory conditions up to the conditions found in the nose cone of a missile in flight); T, is the device learning factor, related to how mature the technology is and how long the production of the part has been going on; C is the circuit complexity factor, dependent on the number of transistors on the chip; and C is the package complexity, related to the number of pins and the type of package The following figures are given for a 16-bit microprocessor, operating on the ground in a laboratory environment, with a junction temperature of 51C. The device is assumed to be packaged in a plastic, 64-pin dual in-line package and to have been manufactured using the same technology for several year 兀Q=20xr=0.89v=1 π=1C1=0.06C2=0.033 Hence, the failure rate A, for this device, operating in the specified environment, is estimated to be 1.32 failures per 10 hours. To calculate the predicted failure rate for a system based around this microprocessor would involve similar calculations for all the parts, including the passive components, the PCB, and connectors, and multiplying all the resultant failure rates together. The resulting figure could then be inverted to give a predicted MTTF. This kind of calculation is repetitive, tedious, and therefore prone to errors, so many companies low provide software to perform the calculations. The official Department of Defense program for automating the calculation of reliability figures is called ORACLE. It is regularly updated to include all the changes since MIL-HDBK-217E was released. Versions for VAX/VMS and the IBM PC are available from the rome air defense Center, RBET, Griffiss Air Force Base, NY 13441-5700 Other software to perform the same function is advertised in the publications listed under Further Information. Defining Terms Availability: This figure gives a prediction for the proportion of time that a given part or system will be in full working order. It can be calculated from MTTF MTTF MTTR Failure rate: The failure rate, ,, is the(predicted or measured)number of failures per unit time for a specified part or system operating in a given environment. It is usually assumed to be constant during the working life of a component or syster Mean time to failure: This figure is used to give an expected working lifetime for a given part, in a given environment. It is defined by the equation MTTF=r(t)dt e 2000 by CRC Press LLC
© 2000 by CRC Press LLC 98.12 Reliability Calculations for Real Systems The most common source of basic reliability data for electronic components and circuits is the military handbook Reliability Prediction of Electronic Equipment, published by the U.S. Department of Defense. It has the designation MIL-HDBK-217E in its most recent version. This handbook provides both the basic reliability data and the formulae to modify those data for the application of interest. For example, the formula for predicting the failure rate, lp, of a bipolar or MOS microprocessor is given as lp = pQ(C1pTpV + C2pE)pL failures per 106 hours where pQ is the part quality factor, with several categories, ranging from a full mil-spec part to a commercial part; pT is the temperature acceleration factor, related to both the technology in use and the actual operating temperature; pV is the voltage stress derating factor, which is higher for devices operating at higher voltages; pE is the application environment factor (the handbook gives figures for many categories of environment, ranging from laboratory conditions up to the conditions found in the nose cone of a missile in flight); pL is the device learning factor, related to how mature the technology is and how long the production of the part has been going on; C1 is the circuit complexity factor, dependent on the number of transistors on the chip; and C2 is the package complexity, related to the number of pins and the type of package. The following figures are given for a 16-bit microprocessor, operating on the ground in a laboratory environment, with a junction temperature of 51°C. The device is assumed to be packaged in a plastic, 64-pin dual in-line package and to have been manufactured using the same technology for several years: pQ = 20 pT = 0.89 pV = 1 pE = 0.38 pL = 1 C1 = 0.06 C2 = 0.033 Hence, the failure rate lp for this device, operating in the specified environment, is estimated to be 1.32 failures per 106 hours. To calculate the predicted failure rate for a system based around this microprocessor would involve similar calculations for all the parts, including the passive components, the PCB, and connectors, and multiplying all the resultant failure rates together. The resulting figure could then be inverted to give a predicted MTTF. This kind of calculation is repetitive, tedious, and therefore prone to errors, so many companies now provide software to perform the calculations. The official Department of Defense program for automating the calculation of reliability figures is called ORACLE. It is regularly updated to include all the changes since MIL-HDBK-217E was released. Versions for VAX/VMS and the IBM PC are available from the Rome Air Defense Center, RBET, Griffiss Air Force Base, NY 13441-5700. Other software to perform the same function is advertised in the publications listed under Further Information. Defining Terms Availability: This figure gives a prediction for the proportion of time that a given part or system will be in full working order. It can be calculated from Failure rate: The failure rate, l, is the (predicted or measured) number of failures per unit time for a specified part or system operating in a given environment. It is usually assumed to be constant during the working life of a component or system. Mean time to failure: This figure is used to give an expected working lifetime for a given part, in a given environment. It is defined by the equation Av = MTTF MTTF + MTTR MTTF = r t dt ( ) 0 • Ú