正在加载图片...
the implementations of the light detector,the heavy from the light detector or a timeout signal from detector and the recovery component are described the timer that is periodically updated by the light 4.2.1.Light Detector detector.Once triggered,the heavy detector first takes a diagnosis action to check the error code and some The light detector can be considered as the eyes of the extra performance metrics sent by the light detector to SHFH,and it is used to collect six system performance confirm whether the system is in a hang state or not metrics,sys,iowait and usr on each CPU,run,cs and This is necessary because the monitored metrics may pswpout(as described in Section 3.3).According to also seem anomalous to the light detector under some our formal study (see Section 3.4),system hang can be normal conditions(e.g.,the system has a heavy load). revealed by the performance metrics.We define some Although some performance metrics used to verify conditions under which an alert should be triggered system hang are the same as those the light detector, and at the same time an error code is generated some other metrics are added or the bounds of the according to different anomalous metrics.The error metrics are set differently when confirming system codes can help the heavy detector perform a further hang.The mapping from an error code to metrics for check.The mapping between trigger conditions and verifying system hang is given in Table 3.Because an error code is given in Table 2 a recovery strategy is chosen based on the type of Table 2.Mapping model of light detector faults,the mapping rules from an error code (with extra performance metrics which are sent by the light Trigger condition Error code detector)to the possible faults and recovery actions sys exceeds its upper bound for are also given in Table 3. consecutive monitor interval and the CPU_ERROR usr does not reach its lower bound Let us consider an example to see how the mapping iowait higher than its upper bound CPU ERROR rules work.As shown in Table 3,when the error run surpasses its upper bound PROC ERROR code from the light detector is MEM_ERROR,the cs is lower than its lower bound PROC_ERROR heavy detector first checks the values of blk and util. pswpour exceeds its upper bound MEM_ERROR for consecutive monitor interval and when both exceed their upper bounds,it polls all tasks to find the task consuming memory abnor- The light detector consists of two core functions: mally.If this further diagnosis condition is satisfied To obtain the performance metrics of the system, F6 is considered as the cause of system hang,and we use the sar command to collect data period- then according to the mapping rules in Table 3,the ically from the /proc file system which is pro- operation which kills the task that consumes memory vided by Linux.By establishing a pipe between abnormally is selected.Otherwise,the heavy detector the light detector and sar,system performance checks the next mapping rule of MEM_ERROR.If metrics are obtained dynamically. no rules of MEM ERROR match,the heavy detector Once the light detector finds that some metrics will ignore the alert from the light detector. indicate an anomalous condition.it will trigger 4.2.3.Recovery an alert by sending a message which includes the error code and metrics that are necessary for The recovery component of SHFH tries to help OS the heavy detector to perform a further check. recover from a hang state and provide continuous Otherwise,it sends an empty message periodi- services,or restart in some severe cases.Based on cally to update the timer of the heavy detector different diagnosis results generated by the heavy to indicate that the light detector is still working detector according to the mapping rules,different Sockets are used as the communication medium recovery operations are taken(shown in Table 3).The between the light and heavy detectors. recovery component offers three types of recovery actions:kill or stop the suspicious process/thread;send The light detector acts as a filter of most metrics measured in normal states,which can guarantee that an NMI(Non-Maskable Interrupt)to a particular CPU the heavy-cost operations are only executed when the to wake up the stalled CPU;panic the system and then system is in an abnormal state. restart.The recovery component may have to restart the OS when the hang scenario is caused by some pro- 4.2.2.Heavy Detector cesses which are in the UNINTERRUPTIBLE state. Unlike the light detector which generates an alert when the system is possibly in a hang state to increase 5.Evaluation the coverage of hang detection,the heavy detector, which acts as the brain of SHFH,should be able In order to evaluate SHFH and the effectiveness of to confirm whether the system is in a hang state the hypothesis described in Section 3.1,we have to decrease the false positive rate,and then choose conducted our fault injection experiments. a proper recovery action according to different fault 5.1.Experiment Setup causes which can be achieved by a diagnose progress. The heavy detector can be triggered under one of The experiments are performed on a computer with the two conditions:by receiving an alert message Intel Core i5 650.3.20GHz CPU (seen as 4 CPUsthe implementations of the light detector, the heavy detector and the recovery component are described. 4.2.1. Light Detector The light detector can be considered as the eyes of the SHFH, and it is used to collect six system performance metrics, sys, iowait and usr on each CPU, run, cs and pswpout (as described in Section 3.3). According to our formal study (see Section 3.4), system hang can be revealed by the performance metrics. We define some conditions under which an alert should be triggered and at the same time an error code is generated according to different anomalous metrics. The error codes can help the heavy detector perform a further check. The mapping between trigger conditions and an error code is given in Table 2. Table 2. Mapping model of light detector Trigger condition Error code sys exceeds its upper bound for consecutive monitor interval and the CPU ERROR usr does not reach its lower bound iowait higher than its upper bound CPU ERROR run surpasses its upper bound PROC ERROR cs is lower than its lower bound PROC ERROR pswpout exceeds its upper bound MEM ERROR for consecutive monitor interval The light detector consists of two core functions: • To obtain the performance metrics of the system, we use the sar command to collect data period￾ically from the /proc file system which is pro￾vided by Linux. By establishing a pipe between the light detector and sar, system performance metrics are obtained dynamically. • Once the light detector finds that some metrics indicate an anomalous condition, it will trigger an alert by sending a message which includes the error code and metrics that are necessary for the heavy detector to perform a further check. Otherwise, it sends an empty message periodi￾cally to update the timer of the heavy detector to indicate that the light detector is still working. Sockets are used as the communication medium between the light and heavy detectors. The light detector acts as a filter of most metrics measured in normal states, which can guarantee that the heavy-cost operations are only executed when the system is in an abnormal state. 4.2.2. Heavy Detector Unlike the light detector which generates an alert when the system is possibly in a hang state to increase the coverage of hang detection, the heavy detector, which acts as the brain of SHFH, should be able to confirm whether the system is in a hang state to decrease the false positive rate, and then choose a proper recovery action according to different fault causes which can be achieved by a diagnose progress. The heavy detector can be triggered under one of the two conditions: by receiving an alert message from the light detector or a timeout signal from the timer that is periodically updated by the light detector. Once triggered, the heavy detector first takes a diagnosis action to check the error code and some extra performance metrics sent by the light detector to confirm whether the system is in a hang state or not. This is necessary because the monitored metrics may also seem anomalous to the light detector under some normal conditions (e.g., the system has a heavy load). Although some performance metrics used to verify system hang are the same as those the light detector, some other metrics are added or the bounds of the metrics are set differently when confirming system hang. The mapping from an error code to metrics for verifying system hang is given in Table 3. Because a recovery strategy is chosen based on the type of faults, the mapping rules from an error code (with extra performance metrics which are sent by the light detector) to the possible faults and recovery actions are also given in Table 3. Let us consider an example to see how the mapping rules work. As shown in Table 3, when the error code from the light detector is MEM ERROR, the heavy detector first checks the values of blk and util, and when both exceed their upper bounds, it polls all tasks to find the task consuming memory abnor￾mally. If this further diagnosis condition is satisfied, F6 is considered as the cause of system hang, and then according to the mapping rules in Table 3, the operation which kills the task that consumes memory abnormally is selected. Otherwise, the heavy detector checks the next mapping rule of MEM ERROR. If no rules of MEM ERROR match, the heavy detector will ignore the alert from the light detector. 4.2.3. Recovery The recovery component of SHFH tries to help OS recover from a hang state and provide continuous services, or restart in some severe cases. Based on different diagnosis results generated by the heavy detector according to the mapping rules, different recovery operations are taken (shown in Table 3). The recovery component offers three types of recovery actions: kill or stop the suspicious process/thread; send an NMI (Non-Maskable Interrupt) to a particular CPU to wake up the stalled CPU; panic the system and then restart. The recovery component may have to restart the OS when the hang scenario is caused by some pro￾cesses which are in the UNINTERRUPTIBLE state. 5. Evaluation In order to evaluate SHFH and the effectiveness of the hypothesis described in Section 3.1, we have conducted our fault injection experiments. 5.1. Experiment Setup The experiments are performed on a computer with Intel Core i5 650, 3.20GHz CPU (seen as 4 CPUs
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有