© 2000 by CRC Press LLC The conc_中国高校课件下载中心

点击下载：《电子工程师手册》学习资料（英文版）Chapter 93 Fault Tolerance

正在加载图片...

The concept of N-version programming was developed to allow certain design flaws in software modules to be tolerated [Chen and Avizienis, 1978. The basic concept of N-version programming is to design and code the software module N times and to vote on the N results produced by these modules. Each of the N modules is designed and coded by a separate group of programmers. Each group designs the software from the same set of specifications such that each of the N modules performs the same function. However, it is hoped that by performing the n designs independently, the same mistakes will not be made by the different groups Therefore, when a fault occurs, the fault will either not occur in all modules or it will occur differently in eacl module, so that the results generated by the modules will differ. Assuming that the faults are independent the approach can tolerate(N-1)/2 faults where N is odd The recovery block approach to software fault tolerance is analogous to the active approaches to hardware fault tolerance, specifically the cold standby sparing approach. N versions of a program are provided, and a ingle set of acceptance tests is used. One version of the program is designated as the primary version, and the remaining N-1 versions are designated as spares, or secondary versions. The primary version of the software is always used unless it fails to pass the acceptance tests. If the acceptance tests are failed by the primary version, then the first secondary version is tried. This process continues until one version passes the acceptance tests or he system fails because none of the versions can pass the tests. 93.6 Dependability evaluation Dependability is defined as the quality of service provided by a system [Laprie, 1985]. Perhaps the most important measures of dependability are reliability and availability. Fundamental to reliability calculations is the concept of failure rate. Intuitively, the failure rate is the expected number of failures of a type of device or ystem per a given time period Shooman, 1968. The failure rate is typically denoted as A when it is assumed to have a constant value. To more clearly understand the mathematical basis for the concept of a failure rate, first consider the definition of the reliability function. The reliability R(n) of a component, or a system, is the conditional probability that the component operates correctly throughout the interval [to, t] given that it was operating correctly at the time to There are a number of different ways in which the failure rate function can be expressed. For example, the failure rate function A t) can be written strictly in terms of the reliability function R(t) as z(t) (t)/dt R(t) Similarly, a r) can be written in terms of the unreliability Qr)as z(t) dr(t)/dt dQ(t)/dt R(t)1-Q(t) where Q()=l-R(t). The derivative of the unreliability, dQt)/dt, is called the failure density function. The failure rate function is clearly dependent upon time; however, experience has shown that the failure rate function for electronic components does have a period where the value of z( t) is approximately constant. The ommonly accepted relationship between the failure rate function and time for electronic components is called the bathtub curve and is illustrated in Fig. 93.6. The bathtub curve assumes that during the early life of systems, failures occur frequently due to substandard or weak components. The decreasing part of the bathtub curve called the early-life or infant mortality region. At the opposite end of the curve is the wear-out region where systems have been functional for a long period of time and are beginning to experience failures due to the physical wearing of electronic or mechanical components. During the intermediate region, the failure rate function is assumed to be a constant The constant portion of the bathtub curve is called the useful-life phase e 2000 by CRC Press LLC© 2000 by CRC Press LLC The concept of N-version programming was developed to allow certain design flaws in software modules to be tolerated [Chen and Avizienis, 1978]. The basic concept of N-version programming is to design and code the software module N times and to vote on the N results produced by these modules. Each of the N modules is designed and coded by a separate group of programmers. Each group designs the software from the same set of specifications such that each of the N modules performs the same function. However, it is hoped that by performing the N designs independently, the same mistakes will not be made by the different groups. Therefore, when a fault occurs, the fault will either not occur in all modules or it will occur differently in each module, so that the results generated by the modules will differ. Assuming that the faults are independent the approach can tolerate (N – 1)/2 faults where N is odd. The recovery block approach to software fault tolerance is analogous to the active approaches to hardware fault tolerance, specifically the cold standby sparing approach. N versions of a program are provided, and a single set of acceptance tests is used. One version of the program is designated as the primary version, and the remaining N – 1 versions are designated as spares, or secondary versions. The primary version of the software is always used unless it fails to pass the acceptance tests. If the acceptance tests are failed by the primary version, then the first secondary version is tried. This process continues until one version passes the acceptance tests or the system fails because none of the versions can pass the tests. 93.6 Dependability Evaluation Dependability is defined as the quality of service provided by a system [Laprie, 1985]. Perhaps the most important measures of dependability are reliability and availability. Fundamental to reliability calculations is the concept of failure rate. Intuitively, the failure rate is the expected number of failures of a type of device or system per a given time period [Shooman, 1968]. The failure rate is typically denoted as l when it is assumed to have a constant value. To more clearly understand the mathematical basis for the concept of a failure rate, first consider the definition of the reliability function. The reliability R(t) of a component, or a system, is the conditional probability that the component operates correctly throughout the interval [t0, t] given that it was operating correctly at the time t0. There are a number of different ways in which the failure rate function can be expressed. For example, the failure rate function z(t) can be written strictly in terms of the reliability function R(t) as Similarly, z(t) can be written in terms of the unreliability Q(t) as where Q(t) = 1 – R(t). The derivative of the unreliability, dQ(t)/dt, is called the failure density function. The failure rate function is clearly dependent upon time; however, experience has shown that the failure rate function for electronic components does have a period where the value of z(t) is approximately constant. The commonly accepted relationship between the failure rate function and time for electronic components is called the bathtub curve and is illustrated in Fig. 93.6. The bathtub curve assumes that during the early life of systems, failures occur frequently due to substandard or weak components. The decreasing part of the bathtub curve is called the early-life or infant mortality region. At the opposite end of the curve is the wear-out region where systems have been functional for a long period of time and are beginning to experience failures due to the physical wearing of electronic or mechanical components. During the intermediate region, the failure rate function is assumed to be a constant. The constant portion of the bathtub curve is called the useful-life phase z t dR t dt R t ( ) – ( ) ( ) = Ê Ë Á ˆ ¯ ˜ / z t dR t dt R t dQ t dt Q t ( ) – ( ) ( ) ( ) – ( ) = = / / 1

<<向上翻页向下翻页>>

点击下载：《电子工程师手册》学习资料（英文版）Chapter 93 Fault Tolerance