正在加载图片...
comparing the values of metrics collected during the Light detector normal and hang states,subject to a further theoretical monitor analysis,we can empirically set an acceptable range for each monitored metric.When the values of one or Performance Metrics alert update more metrics are out of their acceptable ranges,we can consider that the system enters a hang state timeout Get info Timer For example,after observing the statistics of exper- imental results,we find that in some hang scenarios Target when alert Heavy detector sys is more than 95%for a long time (e.g.,exceeds System one second)and usr is lower than 4%,which may Diagnosis be caused by a task executing an infinite loop in the kernel mode.However in the normal state,sys Hang detected can hardly reach 90%and last for more than one Recovery second,because the time of system calls and exception handling spending on CPU is usually very short.In Figure 3.An overview of SHFH addition,since the system calls are invoked by user applications,usr should not be lower than a certain percentage. abnormal(perhaps caused by system hang),it triggers Moreover.the influence of a specific hardware mod- an alert to wake up the heavy detector.The heavy ule or operating system should also be considered to detector gets further information by some expensive improve the portability of the detection strategy of sys- operations (e.g.,poll processes).Then the gathered tem hang.Thus,an appropriate platform-independent information is analyzed by the diagnosis part of the range for each monitored metric is preferred,e.g, heavy detector.If system hang is asserted to occur,the some metrics can be evaluated in the form of per- related recovery operations (depending on different centage.In this case,the ranges of some metrics diagnosis results),e.g.,suspending the current task on can be initialized according to different hardware a particular CPU or restarting the system,would be and operating system configurations which can be executed;otherwise,the alert triggered by the light captured when the system starts to run on a specific detector will be ignored by the heavy detector.One platform. unique feature of SHFH is that its"light-heavy"detec- tion strategy is designed to make intelligent tradeoffs 4.SHFH:Self-Healing Framework for between the performance overhead and the false pos- System Hang itive rate induced by system hang detection.Because the light detector is lightweight (a user application), In this section,we introduce SHFH (a self-healing expensive operations for collecting further data to framework to handle system hang),which adopts detect a hang(to decrease false positives)are incurred the methodology of utilizing the 9 empirical system (by the heavy detector)only when the light detector performance metrics (described in Section 3)to de- triggers an alert. tect system hang.To automate the whole process The light detector is a real time user process,and of handling system hang,we introduce the idea of in some scenarios,it may have no opportunity to be self-healing for designing SHFH.A traditional self- executed due to certain faults (e.g.,F1,F2,F3 and F5 healing architecture includes detection,diagnosis and described in Section 2).To overcome this problem recovery components [17].Its diagnosis part is usu- a watchdog timer mechanism is introduced in SHFH. ally implemented into multiple diagnosis engines to The light detector periodically updates the value of the capture different failures,which is independent of the timer in the heavy detector.If the timer is not updated detection part.SHFH makes system hang as a failure for consecutive periods of time,the services provided target and only monitors the performance metrics that by the light detector are regarded as unavailable.Then may implicate system hang.Its diagnosis mechanism the recovery operation is called since even the real is integrated into the detection component for helping time application cannot run (there must be something diagnosis-based recovery.This revision remarkably wrong with the system). decreases the performance overhead induced by the self-healing framework and simplifies its structure. 4.2.Implementation of SHFH 4.1.Overview of SHFH We have implemented SHFH in the Linux operating system (kernel 2.6.32).The light detector of SHFH As shown in Figure 3,SHFH contains three core parts: is implemented as a real time process,and both light detector,heavy detector and recovery component the heavy detector and recovery component are im- plemented as loadable kernel modules.The whole In SHFH,the light detector only monitors six SHFH can be dynamically loaded and removed by system performance metrics (see Section 4.2.1)pe- simple shell command.In this section,the detailed riodically.When it finds that the values of metrics are detection and recovery strategies for system hang,andcomparing the values of metrics collected during the normal and hang states, subject to a further theoretical analysis, we can empirically set an acceptable range for each monitored metric. When the values of one or more metrics are out of their acceptable ranges, we can consider that the system enters a hang state. For example, after observing the statistics of exper￾imental results, we find that in some hang scenarios sys is more than 95% for a long time (e.g., exceeds one second) and usr is lower than 4%, which may be caused by a task executing an infinite loop in the kernel mode. However in the normal state, sys can hardly reach 90% and last for more than one second, because the time of system calls and exception handling spending on CPU is usually very short. In addition, since the system calls are invoked by user applications, usr should not be lower than a certain percentage. Moreover, the influence of a specific hardware mod￾ule or operating system should also be considered to improve the portability of the detection strategy of sys￾tem hang. Thus, an appropriate platform-independent range for each monitored metric is preferred, e.g., some metrics can be evaluated in the form of per￾centage. In this case, the ranges of some metrics can be initialized according to different hardware and operating system configurations which can be captured when the system starts to run on a specific platform. 4. SHFH: Self-Healing Framework for System Hang In this section, we introduce SHFH (a self-healing framework to handle system hang), which adopts the methodology of utilizing the 9 empirical system performance metrics (described in Section 3) to de￾tect system hang. To automate the whole process of handling system hang, we introduce the idea of self-healing for designing SHFH. A traditional self￾healing architecture includes detection, diagnosis and recovery components [17]. Its diagnosis part is usu￾ally implemented into multiple diagnosis engines to capture different failures, which is independent of the detection part. SHFH makes system hang as a failure target and only monitors the performance metrics that may implicate system hang. Its diagnosis mechanism is integrated into the detection component for helping diagnosis-based recovery. This revision remarkably decreases the performance overhead induced by the self-healing framework and simplifies its structure. 4.1. Overview of SHFH As shown in Figure 3, SHFH contains three core parts: light detector, heavy detector and recovery component. In SHFH, the light detector only monitors six system performance metrics (see Section 4.2.1) pe￾riodically. When it finds that the values of metrics are Figure 3. An overview of SHFH abnormal (perhaps caused by system hang), it triggers an alert to wake up the heavy detector. The heavy detector gets further information by some expensive operations (e.g., poll processes). Then the gathered information is analyzed by the diagnosis part of the heavy detector. If system hang is asserted to occur, the related recovery operations (depending on different diagnosis results), e.g., suspending the current task on a particular CPU or restarting the system, would be executed; otherwise, the alert triggered by the light detector will be ignored by the heavy detector. One unique feature of SHFH is that its “light-heavy” detec￾tion strategy is designed to make intelligent tradeoffs between the performance overhead and the false pos￾itive rate induced by system hang detection. Because the light detector is lightweight (a user application), expensive operations for collecting further data to detect a hang (to decrease false positives) are incurred (by the heavy detector) only when the light detector triggers an alert. The light detector is a real time user process, and in some scenarios, it may have no opportunity to be executed due to certain faults (e.g., F1, F2, F3 and F5 described in Section 2). To overcome this problem, a watchdog timer mechanism is introduced in SHFH. The light detector periodically updates the value of the timer in the heavy detector. If the timer is not updated for consecutive periods of time, the services provided by the light detector are regarded as unavailable. Then the recovery operation is called since even the real time application cannot run (there must be something wrong with the system). 4.2. Implementation of SHFH We have implemented SHFH in the Linux operating system (kernel 2.6.32). The light detector of SHFH is implemented as a real time process, and both the heavy detector and recovery component are im￾plemented as loadable kernel modules. The whole SHFH can be dynamically loaded and removed by simple shell command. In this section, the detailed detection and recovery strategies for system hang, and
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有