What is System Hang and How to Handle it Yian Zhu,Yue Li2,Jingling Xue2,Tian Tan3,Jialong Shil,Yang Shen3,Chunyan Ma3 School of Computer Science,Northwestern Polytechnical University.Xi'an.P.R.China 2School of Computer Science and Engineering.University of New South Wales,Sydney.Australia 3School of Software and Microelectronics.Northwestern Polytechnical University.Xi'an,PR.China {zhuya,machunyan}@nwpu.edu.cn [yueli,jingling@cse.unsw.edu.au {silverbullettt,jialong.tea,yangfields@gmail.com Abstract by operating system (OS)(except for some severe cases detected only partially by watchdog mechanisms Almost every computer user has encountered an un- provided by some modern OSes).This leaves the user responsive system failure or system hang,which leaves no choice but to power the system off.As a result, the user no choice but to power off the computer:In the OS fails to provide continuous services,causing this paper.the causes of such failures are analyzed in the user to lose some valuable data.Worse still,if the detail and one empirical hypothesis for detecting sys- computer system is deployed in some mission-critical tem hang is proposed.This hypothesis exploits a small applications,e.g.,nuclear reactors,system hang may set of system performance metrics provided by the OS lead to devastating consequences. itself.thereby avoiding modifying the OS kernel and By observing existing studies dealing with system introducing additional cost (e.g.,hardware modules). hang,we draw two conclusions.First,most studies, Under this hypothesis,we propose SHFH.a self although being effective in certain cases,could only healing framework to handle system hang,which can address certain system hang scenarios [1]-[5].One be deployed on OS dynamically.One unique feature main explanation to this is that it is difficult to analyze of SHFH is that its "light-heavy"detection strategy the causes of system hang,and accordingly,each study is designed to make intelligent tradeoffs between the focuses on its own assumptions about the causes of performance overhead and the false positive rate system hang.As a result,it is necessary to study the induced by system hang detection.Another feature causes of system hang more comprehensively. is that its diagnosis-based recovery strategy offers Second,most methodologies for detecting system a better granularity to recover from system hang. hang need additional assistance,provided by either Our experimental results show that SHFH can cover new hardware modules [7],modified OS kernels [1], 95.34%of system hang scenarios,with a false positive [5],or monitor breakpoints inserted dynamically for rate of 0.58%and 0.6%performance overhead,val- interested code regions [4].Can we rely on the exist- idating the effectiveness of our empirical hypothesis ing services provided by the OS to detect system hang effectively?An attempt made in[2]does this by just Keywords-System Hang.Operating System,Self- monitoring I/O throughput,but it fails if a hang occurs Healing Framework,Fault Detection and Recovery within some OS code not related to I/O.The work of [8]is developed on the assumption that statistical 1.Introduction models of processes,for such metrics as CPU and memory utilization,may reveal the slowness of the Almost every computer user has encountered such a system (similar to system hang).However,since the scenario in which all windows displayed on a com- causal relationship between the statistical models for puter monitor become static and the whole computer processes and the slowness for the system has not system ceases to respond to user input.Sometimes been validated,the effectiveness of this assumption even the mouse cursor does not move either."Unre- remains unclear.As a result,whether or not existing sponsiveness'”,“freeze''and"hang”have been used OS services can be utilized to detect system hang to describe such a phenomenon,with "hang"being becomes an attractive argument,since an affirmative the most popular [1]-[4],[6],[7],[9],[12].Note answer implies that no additional cost will be incurred that a single program unresponsive failure (i.e.,one The main contributions of this paper are as follows. application failing to respond to user input)is regarded We give a new characterization of system hang based as application hang,which is not the focus in this on the two popular views about it (as described in paper.Unlike the other failures (e.g.,invalid opcode Section 2.1).Besides,the causes of system hang and general protection fault)whose causes can be de- are analyzed in detail from two aspects:indefinite tected directly by hardware [13],system hang cannot wait for system resources (resources not released or usually be detected by hardware or even perceived released slowly)and infinite loop under interrupt andWhat is System Hang and How to Handle it Yian Zhu1 , Yue Li2 , Jingling Xue2 , Tian Tan3 , Jialong Shi1 , Yang Shen3 , Chunyan Ma3 1 School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R.China 2 School of Computer Science and Engineering, University of New South Wales, Sydney, Australia 3 School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, P.R.China {zhuya,machunyan}@nwpu.edu.cn {yueli,jingling}@cse.unsw.edu.au {silverbullettt,jialong.tea,yangfields}@gmail.com Abstract Almost every computer user has encountered an unresponsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a selfhealing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its “light-heavy” detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis. Keywords-System Hang, Operating System, SelfHealing Framework, Fault Detection and Recovery 1. Introduction Almost every computer user has encountered such a scenario in which all windows displayed on a computer monitor become static and the whole computer system ceases to respond to user input. Sometimes even the mouse cursor does not move either. “Unresponsiveness”, “freeze” and “hang” have been used to describe such a phenomenon, with “hang” being the most popular [1]–[4], [6], [7], [9], [12]. Note that a single program unresponsive failure (i.e., one application failing to respond to user input) is regarded as application hang, which is not the focus in this paper. Unlike the other failures (e.g., invalid opcode and general protection fault) whose causes can be detected directly by hardware [13], system hang cannot usually be detected by hardware or even perceived by operating system (OS) (except for some severe cases detected only partially by watchdog mechanisms provided by some modern OSes). This leaves the user no choice but to power the system off. As a result, the OS fails to provide continuous services, causing the user to lose some valuable data. Worse still, if the computer system is deployed in some mission-critical applications, e.g., nuclear reactors, system hang may lead to devastating consequences. By observing existing studies dealing with system hang, we draw two conclusions. First, most studies, although being effective in certain cases, could only address certain system hang scenarios [1]–[5]. One main explanation to this is that it is difficult to analyze the causes of system hang, and accordingly, each study focuses on its own assumptions about the causes of system hang. As a result, it is necessary to study the causes of system hang more comprehensively. Second, most methodologies for detecting system hang need additional assistance, provided by either new hardware modules [7], modified OS kernels [1], [5], or monitor breakpoints inserted dynamically for interested code regions [4]. Can we rely on the existing services provided by the OS to detect system hang effectively? An attempt made in [2] does this by just monitoring I/O throughput, but it fails if a hang occurs within some OS code not related to I/O. The work of [8] is developed on the assumption that statistical models of processes, for such metrics as CPU and memory utilization, may reveal the slowness of the system (similar to system hang). However, since the causal relationship between the statistical models for processes and the slowness for the system has not been validated, the effectiveness of this assumption remains unclear. As a result, whether or not existing OS services can be utilized to detect system hang becomes an attractive argument, since an affirmative answer implies that no additional cost will be incurred. The main contributions of this paper are as follows. We give a new characterization of system hang based on the two popular views about it (as described in Section 2.1). Besides, the causes of system hang are analyzed in detail from two aspects: indefinite wait for system resources (resources not released or released slowly) and infinite loop under interrupt and