system,we do not regard it as a successful recovery an infinite loop with both interrupt and preemption action.We consider a recovery as successful if it disabled.Monitoring I/O throughput [2]is an effective can preserve the operating system with the ability to way to detect some system hang problems,however continue running and providing services after system it fails if a hang occurs within OS code not related hang is detected.However,in the worst case,SHFH to 1/O.The work of [4]monitors signals and wait- cannot work due to the serious system hang scenarios ing/holding time for critical sections,task scheduling (e.g.,all CPUs stalled simultaneously).As a result, timeouts,and so on.A total of eight variables for there are still 4.66%system hang scenarios on average a single process and monitors need to be deployed that cannot be recovered and even restarted by SHFH through dynamic probing with the help of KProbes According to the coverage,false positive,recovery and to place breakpoints into the kernel.If applying this restart ratio provided by SHFH(shown in Table 6),the strategy into monitoring every process,it may get a effectiveness of the hypothesis proposed in Section 3.1 sound proof of system hang with low false positives; is empirically validated. however,the performance overhead is not optimistic. Table 6.Coverage,false positive,recovery and restart ratio Generally,when system hang is detected,restarting provided by SHFH the system is regarded as the default recovery action. Study [5]keeps the OS running through killing the Work Detection False Recovery Panic and load coverage positive ratio current running process.However,when the suspi- restart LTP 94.45% 1.16% 78.67% 15.78% cious process is not the current one,e.g.,a process Unixbench 96.22% 0% 82.44% 13.78% which is sleeping with a spinlock or a large block average 95.34% 0.58% 80.56% 14.78% of memory,the other processes needing the spinlock The performance overhead is evaluated by the index or the memory space consume the CPU and memory of system performance which can be captured from resources and eventually cause system hang.In this Unixbench.By comparing the index result of running case,killing the current process cannot handle system a standard benchmark with and without SHFH,we hang.Our recovery strategy varies with the diagnosis find that SHFH suffers from a performance over- results of detection,e.g.,killing the sleeping processes head of about 0.6%.Recall that our experiments are (located by the light-heavy detection of SHFH)that conducted on a multi-core computer.When SHFH hold a large piece of memory which wait for a signal is applied on a single-core computer,the detection that would never happen,rather than just killing the coverage and recovery ratio may decrease because the current process or restarting the system. recovery operations can not be taken when some types of faults like Fl occur. 7.Conclusion 6.Related Work In this paper,we give a new characterization of system hang according to the two existing views about it, We discuss the related work about the causes of and and analyze the causes of system hang in detail detection and recovery methods for system hang from two aspects:indefinite wait for system resources The OS kernel falling into an infinite loop is seen (resources not released or released slowly)and infi- as the reason for system hang [1],[3],[5];however, nite loop under interrupt and preemption constraints. that reason may not be appropriate when considering Accordingly,six types of faults that may cause system the preemption mechanisms used.Incorrect usage of hang are described.To avoid additional cost incurred synchronization primitives (in particular those related by extra assistance(e.g.,new hardware modules,ker- to spinlocks in Linux)is regarded as the main causes nel modification or breakpoint insertions),we present of system hang [2.In addition,studies reported in a hypothesis which only uses a small subset of the set [4],[9]also take into account indefinite wait(for an of system performance metrics to detect system hang. event that will never occur).However,its effectiveness Based on this hypothesis,we propose a self-healing depends on the way it is waiting for an event (e.g, framework named SHFH,which can be deployed sleeping or busy waiting). dynamically,to handle system hang.SHFH can auto- Several methods have been proposed to detect sys- matically detect system hang and help system recover tem hang.The improved watchdog timer [5]needs from it.Evaluation results show that SHFH introduces to be periodically reset under the normal situation; 0.6%performance overhead and can detect system otherwise the timer would expire and an NMI will hang with a false positive rate of 0.58%and a coverage be triggered.However,this method cannot detect rate of 95.34%,indicating the effectiveness of the an infinite loop when the process (responsible for "light-heavy"detection strategy adopted in SHFH resetting the timer)does not get stuck.SHD(System Given a recovery rate of 80.56%(making the OS con- Hang Detector)[1]counts the number of instruction tinue running and providing services),its diagnosis- executed between two consecutive context switches based recovery strategy provides a better recovery When OS does not schedule processes,the counter granularity than the naive approach that resorts to value will increase and exceed the theoretical max- restarting the system.Finally,our experimental results imum value.This approach is only effective against also validate the effectiveness of our hypothesis thatsystem, we do not regard it as a successful recovery action. We consider a recovery as successful if it can preserve the operating system with the ability to continue running and providing services after system hang is detected. However, in the worst case, SHFH cannot work due to the serious system hang scenarios (e.g., all CPUs stalled simultaneously). As a result, there are still 4.66% system hang scenarios on average that cannot be recovered and even restarted by SHFH. According to the coverage, false positive, recovery and restart ratio provided by SHFH (shown in Table 6), the effectiveness of the hypothesis proposed in Section 3.1 is empirically validated. Table 6. Coverage, false positive, recovery and restart ratio provided by SHFH Work Detection False Recovery Panic and load coverage positive ratio restart LTP 94.45% 1.16% 78.67% 15.78% Unixbench 96.22% 0% 82.44% 13.78% average 95.34% 0.58% 80.56% 14.78% The performance overhead is evaluated by the index of system performance which can be captured from Unixbench. By comparing the index result of running a standard benchmark with and without SHFH, we find that SHFH suffers from a performance overhead of about 0.6%. Recall that our experiments are conducted on a multi-core computer. When SHFH is applied on a single-core computer, the detection coverage and recovery ratio may decrease because the recovery operations can not be taken when some types of faults like F1 occur. 6. Related Work We discuss the related work about the causes of and detection and recovery methods for system hang. The OS kernel falling into an infinite loop is seen as the reason for system hang [1], [3], [5]; however, that reason may not be appropriate when considering the preemption mechanisms used. Incorrect usage of synchronization primitives (in particular those related to spinlocks in Linux) is regarded as the main causes of system hang [2]. In addition, studies reported in [4], [9] also take into account indefinite wait (for an event that will never occur). However, its effectiveness depends on the way it is waiting for an event (e.g., sleeping or busy waiting). Several methods have been proposed to detect system hang. The improved watchdog timer [5] needs to be periodically reset under the normal situation; otherwise the timer would expire and an NMI will be triggered. However, this method cannot detect an infinite loop when the process (responsible for resetting the timer) does not get stuck. SHD (System Hang Detector) [1] counts the number of instruction executed between two consecutive context switches. When OS does not schedule processes, the counter value will increase and exceed the theoretical maximum value. This approach is only effective against an infinite loop with both interrupt and preemption disabled. Monitoring I/O throughput [2] is an effective way to detect some system hang problems, however it fails if a hang occurs within OS code not related to I/O. The work of [4] monitors signals and waiting/holding time for critical sections, task scheduling timeouts, and so on. A total of eight variables for a single process and monitors need to be deployed through dynamic probing with the help of KProbes to place breakpoints into the kernel. If applying this strategy into monitoring every process, it may get a sound proof of system hang with low false positives; however, the performance overhead is not optimistic. Generally, when system hang is detected, restarting the system is regarded as the default recovery action. Study [5] keeps the OS running through killing the current running process. However, when the suspicious process is not the current one, e.g., a process which is sleeping with a spinlock or a large block of memory, the other processes needing the spinlock or the memory space consume the CPU and memory resources and eventually cause system hang. In this case, killing the current process cannot handle system hang. Our recovery strategy varies with the diagnosis results of detection, e.g., killing the sleeping processes (located by the light-heavy detection of SHFH) that hold a large piece of memory which wait for a signal that would never happen, rather than just killing the current process or restarting the system. 7. Conclusion In this paper, we give a new characterization of system hang according to the two existing views about it, and analyze the causes of system hang in detail from two aspects: indefinite wait for system resources (resources not released or released slowly) and infi- nite loop under interrupt and preemption constraints. Accordingly, six types of faults that may cause system hang are described. To avoid additional cost incurred by extra assistance (e.g., new hardware modules, kernel modification or breakpoint insertions), we present a hypothesis which only uses a small subset of the set of system performance metrics to detect system hang. Based on this hypothesis, we propose a self-healing framework named SHFH, which can be deployed dynamically, to handle system hang. SHFH can automatically detect system hang and help system recover from it. Evaluation results show that SHFH introduces 0.6% performance overhead and can detect system hang with a false positive rate of 0.58% and a coverage rate of 95.34%, indicating the effectiveness of the “light-heavy” detection strategy adopted in SHFH. Given a recovery rate of 80.56% (making the OS continue running and providing services), its diagnosisbased recovery strategy provides a better recovery granularity than the naive approach that resorts to restarting the system. Finally, our experimental results also validate the effectiveness of our hypothesis that