CMSC 5719 MSc Seminar Fault-Tolerant Computing X,Qiang(Johnny)徐強 [Partly adapted from Koren Krishna,and B.Parhami Slides] Part.1.1 Qiang Xu CUHK,Fall 2012
Part.1 .1 Qiang Xu CUHK, Fall 2012 CMSC 5719 MSc Seminar Fault-Tolerant Computing XU, Qiang (Johnny) 徐強 [Partly adapted from Koren & Krishna, and B. Parhami Slides]
Why Learn This Stuff? 空出= 滨治 Part.1.2 Qiang Xu CUHK,Fall 2012
Part.1 .2 Qiang Xu CUHK, Fall 2012 Why Learn This Stuff?
Outline ◆Motivation Fault classification ◆Redundancy Metrics for Reliability ◆Case studies Part.1.3 Qiang Xu CUHK,Fall 2012
Part.1 .3 Qiang Xu CUHK, Fall 2012 Outline Motivation Fault classification Redundancy Metrics for Reliability Case studies
Fault-Tolerance Basic definition Fault-tolerant systems ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors +In practice we can never guarantee the flawless execution of tasks under any circumstances Limit ourselves to types of failures and errors which are more likely to occur Part.1.4 Qiang Xu CUHK,Fall 2012
Part.1 .4 Qiang Xu CUHK, Fall 2012 Fault-Tolerance - Basic definition Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors In practice - we can never guarantee the flawless execution of tasks under any circumstances Limit ourselves to types of failures and errors which are more likely to occur
Need For Fault-Tolerance ◆ Critical applications require extreme fault tolerance (e.g.,aircrafts,nuclear reactors, medical equipment,and financial applications) A malfunction of a computer in such applications can lead to catastrophe Their probability of failure must be extremely low, possibly one in a billion per hour of operation System operating in a harsh environment with high failure possibilities electromagnetic disturbances particle hits and alike Complex systems consisting of millions of devices Part.1.5 Qiang Xu CUHK,Fall 2012
Part.1 .5 Qiang Xu CUHK, Fall 2012 Need For Fault-Tolerance Critical applications require extreme fault tolerance (e.g., aircrafts, nuclear reactors, medical equipment, and financial applications) A malfunction of a computer in such applications can lead to catastrophe Their probability of failure must be extremely low, possibly one in a billion per hour of operation System operating in a harsh environment with high failure possibilities electromagnetic disturbances particle hits and alike Complex systems consisting of millions of devices
Get to Know the Enemy:What cause Faults? Toll of the presidency-Photo 6 of 6 HEco地0aBO Presidert-elect a photo ustration how he might age over four years Pp地com-PHoto by Kovn D动 Aging Manufacturing Defects (a.k.a.,Circuit Wearout) Part.1.6 Qiang Xu CUHK,Fall 2012
Part.1 .6 Qiang Xu CUHK, Fall 2012 Get to Know the Enemy: What cause Faults? Manufacturing Defects Aging (a.k.a., Circuit Wearout)
Get to Know the Enemy:What cause Faults? WEATHERPIX STOCK IMAGES Power Supply Noise AV=IR+L Al/At Internal Electronic Noise Electromagnetic Interference Part.1.7 Qiang Xu CUHK,Fall 2012
Part.1 .7 Qiang Xu CUHK, Fall 2012 Get to Know the Enemy: What cause Faults? Internal Electronic Noise Electromagnetic Interference
Get to Know the Enemy:What cause Faults? ZDNet UK Home News Blogs Reviews Videos Jobs Resources Community Hardware Software Communications I Intemet Security IT Management Emerging Tech Lead lnte°Parallel Studio ZDNet UK Create parallel applications for the desktop Click here to find out mo and compete in a multiccre industry You are here:ZDNet.co.uk>News Software LOGIN ENTERPRISE APPLICATIONS TOOLKIT Email address US software 'blew up Russian gas pipeline' Pentium FDIV Error Password Matt Lon◆Y ZDNet.co.uk Published:.01 Mar 2004 15:10 GMT □Remember mo 色Em家Tre4e:品c饰Lht目Print Post a comme Subreit Help JOIN Faulty US software was to blame for one of the biggest non- ZDNET.CO.UK nuclear explosions the world has ever seen,which took place in a Become part of the Siberian natural gas pipeline,according to a new book published ZDNet community. on Monday. Bugs… Malicious attack (beyond the scope) Part.1.8 Qiang Xu CUHK,Fall 2012
Part.1 .8 Qiang Xu CUHK, Fall 2012 Get to Know the Enemy: What cause Faults? Bugs … Malicious attack (beyond the scope)
Fault Classification according to Duration Permanent Faults never go away,component has to be repaired or replaced Transient Faults disappear after a relatively short time Example a memory cell whose contents are changed due to some electromagnetic interference Overwriting the memory cell with the right content will make the fault go away Intermittent Faults cycle between active and benign states Example a loose connection An increasing threat largely due to temeprature and voltage fluctuations Part.1.9 Qiang Xu CUHK,Fall 2012
Part.1 .9 Qiang Xu CUHK, Fall 2012 Fault Classification according to Duration Permanent Faults - never go away, component has to be repaired or replaced Transient Faults - disappear after a relatively short time Example - a memory cell whose contents are changed due to some electromagnetic interference Overwriting the memory cell with the right content will make the fault go away Intermittent Faults - cycle between active and benign states Example - a loose connection An increasing threat largely due to temeprature and voltage fluctuations
Failures during Lifetime Decreasing Constant Increasing Failure Failure Failure Rate Rate Rate Observed Failure Rate Mortality" Failure Wear Out Fallures Constant(Random) Failures Time Three phases of system lifetime Infant mortality (imperfect test,weak components) Normal lifetime (transient/intermittent faults) Wear-out period (circuit aging) Part.1.10 Qiang Xu CUHK,Fall 2012
Part.1 .10 Qiang Xu CUHK, Fall 2012 Failures during Lifetime Three phases of system lifetime Infant mortality (imperfect test, weak components) Normal lifetime (transient/intermittent faults) Wear-out period (circuit aging)