questions respectively time spent by application)is still zero after injecting the respective kernel module.Finally,CPUO cannot 3.3.Which Performance Metrics to Select execute the user program any more when the system In this section.we investigate experimentally which enters a hang state (see Figure 2-(d)).In addition,after metrics to select to detect system hang by observing the 59th second,the number of context switches per if a metric changes abnormally under hang scenario. second (cs)(as shown in Figure 2-(e))is small since First,we describe our experimental setup.Then,we the other three CPUs are occupied by the injected use an example to show how these experiments work. kernel codes.Although some metrics vary obviously Finally,the system performance metrics which have after the injection of the faults,e.g.,the number of potential to detect system hang are selected according runnable tasks under the pipe workload(Figure 2-(f)). to our experimental results. they may not be selected as detection metrics,since the value of influenced metrics may be normal in other 3.3.1.Experiment Setup workloads (e.g.,the number of runnable tasks for the The six types of faults (see Section 2)that cause sys- shell8 workload as shown in Figure 2-(f)). tem hang are considered as the injected faults,which After injecting F5 into the pipe workload 10 times are implemented as errant kernel modules and loaded and finishing the experiments of F5 in other 8 work- dynamically under different workloads.Accordingly, loads of UnixBench,the general detection metrics the activation rate of injected faults to cause system selected for F5 are usr,sys per CPU,and cs hang is 100%.We select 68 system performance met- 3.3.3.Experimental Conclusion rics(e.g.,number of tasks currently blocked and per- Similar to the methodology adopted by the above centage of time spent by soft interrupt requests)as the observation targets.To observe the general variations example,other experiments are implemented,and the of performance metrics under sufficient workloads,9 experimental results are given in Table 1.Metric programs (context1,dhry,fstime,hanoi,shell8,pipe, iowait represents the percentage of time spent by I/O wait.rin means the number of tasks in the running spawn,syscall,and execl)in the benchmark suite state and blk records the number of tasks currently (UnixBench 5.1.2)are selected,which could represent at least 95%of kernel usage [26].Experiments are blocked.Metric pswpout means the number of pages performed on two computers.One with Intel Core swapped out per second and memfree records the i5 650.3.20GHz CPU(seen as 4 CPUs by OS)and unused space of memory.util means the percentage 4GB RAM,and the other one with Intel Pentium 4. of CPU time during which I/O requests were issued 3.20GHz CPU (seen as 2 CPUs by OS)and 512MB to the device.The 9 system performance metrics in Table 1 are considered as the metrics to detect system RAM.We consider a Linux kernel (version 2.6.32) as our experimental operating system.To guarantee hang.F1,F2 and F3 have the same detection metrics the generality of the experimental results,each type since they all consume CPU inappropriately.F4 makes of injected faults is loaded and executed under each the tasks sleep to wait for the services provided by selected UnixBench workload 10 times in each com- the tasks which are trapped in deadlock,thus it has no influence on the CPU metrics.Because F5 makes puter.Consequently,the total number of experiments the tasks run on CPUs in a way of busy waiting,its conducted is6×9×10×2=1080. metrics are similar to the ones related to infinite loops 3.3.2.An Example As for F6,since it has relevance to consumption of We choose F5 and inject it in the pipe workload of large resources,its detection metrics should be related UnixBench running on the computer with Intel Core to memory and I/O. i5 650.3.20GHz CPU and 4GB RAM. Table 1.Performance metrics used to detect system hang Although experienced programmers avoid using semaphores after a spinlock to make an unlock oper- Metrics CPU Process Memory disk 1/O ation executed quickly,they may ignore whether the Fault FI called functions after a spinlock have operations on F2 semaphores or sleep.As a result,tasks which wait F3 for the spinlock to be released(the task holding the F5 spinlock falls asleep due to the downo operation on F6 semaphore or explicitly sleep operation,F5)have to run on CPU in a busy waiting way,leaving no chance 3.4.How to Determine System Hang for other tasks to run.We inject the sleeping kernel module with a spinlock A at the 23rd second,and The values of several monitored metrics of system inject the kernel modules which acquire A at the 39th, under the normal execution are quite different from 51st and 59th seconds consecutively.As shown in those of a hang system.During normal execution, Figure 2-(a),2-(b)and 2-(c),metric sys(percentage of each value of a monitored metric has an acceptable time spent by system call and exception)reaches and range.The system is considered healthy when each holds 100%,and the value of metric usr(percentage of monitored metric is among its acceptable range.Byquestions respectively. 3.3. Which Performance Metrics to Select In this section, we investigate experimentally which metrics to select to detect system hang by observing if a metric changes abnormally under hang scenario. First, we describe our experimental setup. Then, we use an example to show how these experiments work. Finally, the system performance metrics which have potential to detect system hang are selected according to our experimental results. 3.3.1. Experiment Setup The six types of faults (see Section 2) that cause system hang are considered as the injected faults, which are implemented as errant kernel modules and loaded dynamically under different workloads. Accordingly, the activation rate of injected faults to cause system hang is 100%. We select 68 system performance metrics (e.g., number of tasks currently blocked and percentage of time spent by soft interrupt requests) as the observation targets. To observe the general variations of performance metrics under sufficient workloads, 9 programs (context1, dhry, fstime, hanoi, shell8, pipe, spawn, syscall, and execl) in the benchmark suite (UnixBench 5.1.2) are selected, which could represent at least 95% of kernel usage [26]. Experiments are performed on two computers. One with Intel Core i5 650, 3.20GHz CPU (seen as 4 CPUs by OS) and 4GB RAM, and the other one with Intel Pentium 4, 3.20GHz CPU (seen as 2 CPUs by OS) and 512MB RAM. We consider a Linux kernel (version 2.6.32) as our experimental operating system. To guarantee the generality of the experimental results, each type of injected faults is loaded and executed under each selected UnixBench workload 10 times in each computer. Consequently, the total number of experiments conducted is 6 × 9 × 10 × 2 = 1080. 3.3.2. An Example We choose F5 and inject it in the pipe workload of UnixBench running on the computer with Intel Core i5 650, 3.20GHz CPU and 4GB RAM. Although experienced programmers avoid using semaphores after a spinlock to make an unlock operation executed quickly, they may ignore whether the called functions after a spinlock have operations on semaphores or sleep. As a result, tasks which wait for the spinlock to be released (the task holding the spinlock falls asleep due to the down() operation on semaphore or explicitly sleep operation, F5) have to run on CPU in a busy waiting way, leaving no chance for other tasks to run. We inject the sleeping kernel module with a spinlock A at the 23rd second, and inject the kernel modules which acquire A at the 39th, 51st and 59th seconds consecutively. As shown in Figure 2-(a), 2-(b) and 2-(c), metric sys (percentage of time spent by system call and exception) reaches and holds 100%, and the value of metric usr (percentage of time spent by application) is still zero after injecting the respective kernel module. Finally, CPU0 cannot execute the user program any more when the system enters a hang state (see Figure 2-(d)). In addition, after the 59th second, the number of context switches per second (cs) (as shown in Figure 2-(e)) is small since the other three CPUs are occupied by the injected kernel codes. Although some metrics vary obviously after the injection of the faults, e.g., the number of runnable tasks under the pipe workload (Figure 2-(f)), they may not be selected as detection metrics, since the value of influenced metrics may be normal in other workloads (e.g., the number of runnable tasks for the shell8 workload as shown in Figure 2-(f)). After injecting F5 into the pipe workload 10 times and finishing the experiments of F5 in other 8 workloads of UnixBench, the general detection metrics selected for F5 are usr, sys per CPU, and cs. 3.3.3. Experimental Conclusion Similar to the methodology adopted by the above example, other experiments are implemented, and the experimental results are given in Table 1. Metric iowait represents the percentage of time spent by I/O wait. run means the number of tasks in the running state and blk records the number of tasks currently blocked. Metric pswpout means the number of pages swapped out per second and memfree records the unused space of memory. util means the percentage of CPU time during which I/O requests were issued to the device. The 9 system performance metrics in Table 1 are considered as the metrics to detect system hang. F1, F2 and F3 have the same detection metrics since they all consume CPU inappropriately. F4 makes the tasks sleep to wait for the services provided by the tasks which are trapped in deadlock, thus it has no influence on the CPU metrics. Because F5 makes the tasks run on CPUs in a way of busy waiting, its metrics are similar to the ones related to infinite loops. As for F6, since it has relevance to consumption of large resources, its detection metrics should be related to memory and I/O. Table 1. Performance metrics used to detect system hang P Fault PPPPP Metrics CPU Process Memory disk I/O sys usr iowait run blk cs pswpout memfree util F1 √ √ √ F2 √ √ F3 √ √ √ F4 √ √ F5 √ √ √ F6 √ √ √ √ √ √ 3.4. How to Determine System Hang The values of several monitored metrics of system under the normal execution are quite different from those of a hang system. During normal execution, each value of a monitored metric has an acceptable range. The system is considered healthy when each monitored metric is among its acceptable range. By