preemption constraints.Accordingly,we present six a new characterization of system hang is given below. types of faults responsible for system hang. System hang is a fuzzy concept which depends on We propose a self-healing framework to handle the criteria of the observer-the system gets partially system hang automatically and refer to it as SHFH. or completely stalled,and most services become un- which can be deployed on OS(currently implemented responsive,or respond to user inputs with an obvious on Linux)dynamically.One unique feature is that a latency (an unacceptable length of time according to "light-heavy"detection strategy is adopted to make in- the observer). telligent tradeoffs between the performance overhead and the false positive rate induced by system hang 2.2.Causes of System Hang detection.Another feature lies in its diagnosis-based Tasks need to run effectively to provide services. recovery strategy,which is designed to provide a better In other words,if tasks cannot run,or run without granularity for system hang recovery. doing useful work,users would be aware of the un- We have selected UnixBench [22]as our benchmark available services (unresponsive).Accordingly,what suite,and injected six types of faults into UnixBench causes tasks to be unavailable to run (i.e.,tasks to to cause system hang among 9 bench workloads wait for resources that will never be released)or to representing at least 95%of kernel usage [26].By do useless work (i.e.,tasks to fall into an infinite loop) analyzing a total of 68 performance metrics (e.g., contributes to system hang.It should be noticed that context switches per second and number of runnable although a task falls into an infinite loop,it can be tasks)which are provided by the OS itself from 1080 interrupted or preempted by other tasks.Besides,some experiments under normal and anomalous workloads. system hangs can be automatically recovered after a and after further experimental validation by using both period of time since the resources which are held by UnixBench and LTP(Linux Test Project)[21],we find other tasks are released slowly.In this situation.if that 9 common performance metrics are sufficient as users have no patience to wait for a long time (until the basis to detect most system hang problems without resources are released),system hang is considered requiring any additional assistance(e.g.,new hardware happening. modules or kernel modification). Consequently,we analyze the causes of system The rest of this paper is organized as follows. hang from two aspects:infinite loop under interrupt Section 2 describes what system hang is and what and preemption constraints and indefinite wait for causes it.Section 3 discusses whether empirical sys- system resources (resources not released or released tem performance metrics can be utilized to detect slowly).Accordingly,six types of faults are distin- system hang.According to the hypothesis presented in guished as shown in Figure 1. Section 3,SHFH is proposed and described in detail in Section 4.Section 5 evaluates our SHFH and validates 2.2.1.Infinite Loop accordingly the effectiveness of the hypothesis made When interrupts are disabled (F1),even a clock inter- in Section 3.Section 6 discusses the related work and rupt cannot be responded.As a result,if the running Section 7 concludes the paper. task does not relinquish the CPU on its own,i.e.,falls into an infinite loop,other tasks would have no chance 2.System Hang and Causes to be executed.In the case with interrupts enabled but preemption disabled(F3),CPU can respond to inter- There is no standard definition of system hang.In rupts;however,even tasks with higher priority cannot Section 2.1,we give a new characterization of system be executed,thus making some services provided by hang as our analysis foundation according to the two the ready tasks unavailable.Although both interrupts existing views about it.The causes of system hang are and preemption are enabled,when a task falls into an analyzed in detail in Section 2.2. infinite loop in kernel(F2)(certain OSes,e.g.,Linux 2.1.What is System Hang after 2.6 version,support kernel preemption mecha- nism),it still cannot be preempted unless all the locks There are two popular views.Studies [1],[3],[5],[7] held by the task are released or the task is blocked describe system hang as that OS does not relinquish or explicitly calls schedule function;however,falling the processor,and does not schedule any process, into an infinite loop in kernel offers little chances i.e.,the system is in a totally hang state which does to satisfy the above conditions,thus providing OS not allow other tasks to execute and respond to any little opportunities to schedule other tasks.Generally. user input.On the other side,studies [2],[4],[8], infinite loops can be explained in two scenarios:(1) [9],[11]consider that when OS gets partially or an interrupt(preemption)enabled operation cannot be completely stalled,and does not respond to user-space executed due to an infinite loop formed earlier and(2) applications,the system enters a state of hang. an interrupt (preemption)disabled/enabled pair falls We prefer the second view about system hang inside an infinite loop.Faults related to spinlocks,e.g.. because it includes a broader scope of hang scenar- double spinlocks,are also categorized into F1(the first ios which is in accordance with our daily human- scenario)due to its mechanism of busy waiting for computer interaction experience,and based on which. locks after interrupts are disabled.Even in a multi-preemption constraints. Accordingly, we present six types of faults responsible for system hang. We propose a self-healing framework to handle system hang automatically and refer to it as SHFH, which can be deployed on OS (currently implemented on Linux) dynamically. One unique feature is that a “light-heavy” detection strategy is adopted to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature lies in its diagnosis-based recovery strategy, which is designed to provide a better granularity for system hang recovery. We have selected UnixBench [22] as our benchmark suite, and injected six types of faults into UnixBench to cause system hang among 9 bench workloads representing at least 95% of kernel usage [26]. By analyzing a total of 68 performance metrics (e.g., context switches per second and number of runnable tasks) which are provided by the OS itself from 1080 experiments under normal and anomalous workloads, and after further experimental validation by using both UnixBench and LTP (Linux Test Project) [21], we find that 9 common performance metrics are sufficient as the basis to detect most system hang problems without requiring any additional assistance (e.g., new hardware modules or kernel modification). The rest of this paper is organized as follows. Section 2 describes what system hang is and what causes it. Section 3 discusses whether empirical system performance metrics can be utilized to detect system hang. According to the hypothesis presented in Section 3, SHFH is proposed and described in detail in Section 4. Section 5 evaluates our SHFH and validates accordingly the effectiveness of the hypothesis made in Section 3. Section 6 discusses the related work and Section 7 concludes the paper. 2. System Hang and Causes There is no standard definition of system hang. In Section 2.1, we give a new characterization of system hang as our analysis foundation according to the two existing views about it. The causes of system hang are analyzed in detail in Section 2.2. 2.1. What is System Hang There are two popular views. Studies [1], [3], [5], [7] describe system hang as that OS does not relinquish the processor, and does not schedule any process, i.e., the system is in a totally hang state which does not allow other tasks to execute and respond to any user input. On the other side, studies [2], [4], [8], [9], [11] consider that when OS gets partially or completely stalled, and does not respond to user-space applications, the system enters a state of hang. We prefer the second view about system hang because it includes a broader scope of hang scenarios which is in accordance with our daily humancomputer interaction experience, and based on which, a new characterization of system hang is given below. System hang is a fuzzy concept which depends on the criteria of the observer - the system gets partially or completely stalled, and most services become unresponsive, or respond to user inputs with an obvious latency (an unacceptable length of time according to the observer). 2.2. Causes of System Hang Tasks need to run effectively to provide services. In other words, if tasks cannot run, or run without doing useful work, users would be aware of the unavailable services (unresponsive). Accordingly, what causes tasks to be unavailable to run (i.e., tasks to wait for resources that will never be released) or to do useless work (i.e., tasks to fall into an infinite loop) contributes to system hang. It should be noticed that although a task falls into an infinite loop, it can be interrupted or preempted by other tasks. Besides, some system hangs can be automatically recovered after a period of time since the resources which are held by other tasks are released slowly. In this situation, if users have no patience to wait for a long time (until resources are released), system hang is considered happening. Consequently, we analyze the causes of system hang from two aspects: infinite loop under interrupt and preemption constraints and indefinite wait for system resources (resources not released or released slowly). Accordingly, six types of faults are distinguished as shown in Figure 1. 2.2.1. Infinite Loop When interrupts are disabled (F1), even a clock interrupt cannot be responded. As a result, if the running task does not relinquish the CPU on its own, i.e., falls into an infinite loop, other tasks would have no chance to be executed. In the case with interrupts enabled but preemption disabled (F3), CPU can respond to interrupts; however, even tasks with higher priority cannot be executed, thus making some services provided by the ready tasks unavailable. Although both interrupts and preemption are enabled, when a task falls into an infinite loop in kernel (F2) (certain OSes, e.g., Linux after 2.6 version, support kernel preemption mechanism), it still cannot be preempted unless all the locks held by the task are released or the task is blocked or explicitly calls schedule function; however, falling into an infinite loop in kernel offers little chances to satisfy the above conditions, thus providing OS little opportunities to schedule other tasks. Generally, infinite loops can be explained in two scenarios: (1) an interrupt (preemption) enabled operation cannot be executed due to an infinite loop formed earlier and (2) an interrupt (preemption) disabled/enabled pair falls inside an infinite loop. Faults related to spinlocks, e.g., double spinlocks, are also categorized into F1 (the first scenario) due to its mechanism of busy waiting for locks after interrupts are disabled. Even in a multi-