a small number of performance metrics(9 in SHFH) [15]W.Mauerer.Professional Limx Kernel Architecture .Wiley seem to be sufficient for system hang detection. Publishing Inc.,2008. [16]F.David and R.Campbell."Building a Self-Healing Oper- Acknowledgement ating System,"In DASC,Columbia,USA,2007,pp.3-10 We thank Roberto Natella and Antonio Bovenzi from [17]H.Psaier and S.Dustdar."A survey on self-healing systems Universita degli Studi di Napoli Federico II,Haoxiang approaches and systems,"Cloud Computing,2010,vol.91. Lin from Microsoft Research Asia and Zhongkui Sun pp.43.-73. from Northwestern Polytechnical University for the [18]A.Avizienis,J.Laprie,B.Randell,C.Landwehr."Basic con- discussions about causes of system hang with us.This cepts and taxonomy of dependable and secure computing,"In IEEE Transactions on Dependable and Secure Computing, work is supported by Aeronautical Science Founda- Los Alamitos,USA,2004,pp.11-33. tion of China 20100753022,National Natural Science [19]I.Lee and R.Iyer."Faults,Symptoms,and Software Fault Foundation of China 61103003 and an Australian Tolerance in Tandem GUARDIAN90 Operating System,"In Research Council Grant DP0987236. FTCS,Toulouse,France,1993,pp.20-29. [20]W.Gu,Z.Kalbarczyk,and R.K.Iyer."Error Sensitivity of References the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors,"In DSN,Washington,D.C.,USA,2004,pp. [1]L.Wang,Z.Kalbarczyk W.Gu and R.Iyer."Reliability 887-896. MicroKernel:Providing Application-Aware Reliability in the [21]Sourceforge.net.Linux Test Project (LTP).URL: OS,"IEEE Transactions on Reliabiliry,2007,vol.56,pp.597- http://Itp.sourceforge.net/ 614. [22]Google Project Hosting.unixbench-5.1.2.tar.gz.URL: [2]D.Cotroneo,R.Natella,S.Russo."Assessment and improve- http://code.google.com/p/byte-unixbench/ ment of hang detection in the Linux operating system,"In SRDS,New York,USA,2009,pp.288-294. (23]D.Bovet and M.Cesati.Understanding the Limo Kernel, [3]L.Wang.Z.Kalbarczyk and R.Iyer."Formalizing System 3rd.O'Reilly&Associates,Inc.,2005,pp.228-252. Behavior for Evaluating a System Hang Detector,"In IEEE [24]E.Ciliendo,T.Kunimasa,B.Braswell,Linor Performance Symp.on Reliable Distributed Systems,Naples,ITA,2008, and Tuning Guidelines,IBM Redpaper,July 2007. Pp.269-278. (25]M.Sullivan and R.Chillarege."Software Defects and Their [4]A.Bovenzi,M.Cinque,D.Cotroneo,R.Natella and G. Impact on System Availability-A Study of Field Failures in Carrozza."OS-Level Hang Detection in Complex Software Operating Systems,"In International Symposium on Fault- Systems,"Int.J.Critical Computer-Based Systems,2011, Tolerant Computing,Nuremberg,Germany,1991,pp.2-9. vol.2,Pp.352-377. [26]W.Gu,Z.Kalbarczyk,R.K.Iyer,and Z.Yang."Character [5]F.M.David,J.C.Carlyle and R.H.Campbell."Exploring ization of linux kernel behavior under errors,"In DSN,San Recovery from Operating System Lockups,"In LSENLY Francisco,CA,USA,2003,pp.459-468. Annual Technical Conference,Santa Clara,CA,2007,pp.1-6 [27]A.Chou,J.Yang,B.Chelf,S.Hallem,and D.Engler."An empirical study of operating system errors.In 4CM Symp. [6]X.Song.H.Chen and B.Zang."Why software hangs and Operating Sys.Principles,New York,NY,USA,2001,pp. what can be done with It,"In International Conference on 73-88. Dependable Systems and Networks,Chicago,USA,2010,pp 311-316. [28]N.Palix,G.Thomas,S.Saha,C.Calvs,J.Lawall,and G. Muller."Faults in Linux:Ten years later,"In International [7]N.Nakka,G.P.Saggese,Z.Kalbarczyk and R.K. Conference on Architectural Support for Programming Lan- lyer."An Architectural Framework for Detecting Process guages and Operating Systems,Newport Beach,CA.2011, Hangs/Crashes,"In EDCC,Budapest,HUN,2005,pp.103- Pp.305-318. 121 [8]S.Basu,J.Dunagan and G.Smith."Why did my PC suddenly slow down?"In Workshop on Tackling Computer Systems Problems with Machine Learning Techniques,Cambridge, USA,2007. [9]G.Carrozza,M.Cinque,D.Cotroneo and R.Natella."Op- erating System Support to Detect Application Hangs,"In VECoS,Leeds,UK,2008. [10]T.Jarboui,J.Arlat,Y.Crouzet,K.Kanoun and T.Marteau. "Analysis of the Effects of Real and Injected Software Faults: Linux as a Case Study,"In PRDC,Tsukuba,Japan,2002,pp. 51.58. [11]D.Chen,G.Jacques-Silva,Z.Kalbarczyk,R.K.Iyer and B. Mealey."Error Behavior Comparison of Multiple Computing Systems:A Case Study Using Linux on Pentium,Solaris on SPARC,and AlX on POWER,"In PRDC,Taipei,TW,2008, pp.339-346. [12]X.Wang et al.,"Hang analysis:fighting responsiveness bugs," In EuroSys,Glasgow,UK,2008 [13]Intel 64 and IA-32 Architectures Software Developer's Manual,Volume 3. [14]R.Love.Limx Kernel Development.3rd.Addison-Wesley Professional,2010.a small number of performance metrics (9 in SHFH) seem to be sufficient for system hang detection. Acknowledgement We thank Roberto Natella and Antonio Bovenzi from Universita degli Studi di Napoli Federico II, Haoxiang ` Lin from Microsoft Research Asia and Zhongkui Sun from Northwestern Polytechnical University for the discussions about causes of system hang with us. This work is supported by Aeronautical Science Foundation of China 20100753022, National Natural Science Foundation of China 61103003 and an Australian Research Council Grant DP0987236. References [1] L. Wang, Z. Kalbarczyk W. Gu and R. Iyer. “Reliability MicroKernel: Providing Application-Aware Reliability in the OS,” IEEE Transactions on Reliability, 2007, vol.56, pp. 597- 614. [2] D. Cotroneo, R. Natella, S. Russo. “Assessment and improvement of hang detection in the Linux operating system,” In SRDS, New York, USA, 2009, pp. 288-294. [3] L. Wang, Z. Kalbarczyk and R. Iyer. “Formalizing System Behavior for Evaluating a System Hang Detector,” In IEEE Symp. on Reliable Distributed Systems, Naples, ITA, 2008, pp. 269-278. [4] A. Bovenzi, M. Cinque, D. Cotroneo, R. Natella and G. Carrozza. “OS-Level Hang Detection in Complex Software Systems,” Int. J. Critical Computer-Based Systems, 2011, vol.2, pp. 352-377. [5] F. M. David, J. C. Carlyle and R. H. Campbell. “Exploring Recovery from Operating System Lockups,” In USENIX Annual Technical Conference, Santa Clara, CA, 2007, pp. 1-6. [6] X. Song, H. Chen and B. Zang. “Why software hangs and what can be done with It,” In International Conference on Dependable Systems and Networks, Chicago, USA, 2010, pp. 311-316. [7] N. Nakka, G. P. Saggese, Z. Kalbarczyk and R. K. Iyer. “An Architectural Framework for Detecting Process Hangs/Crashes,” In EDCC, Budapest, HUN, 2005, pp. 103- 121. [8] S. Basu, J. Dunagan and G. Smith. “Why did my PC suddenly slow down?” In Workshop on Tackling Computer Systems Problems with Machine Learning Techniques, Cambridge, USA, 2007. [9] G. Carrozza, M. Cinque, D. Cotroneo and R. Natella. “Operating System Support to Detect Application Hangs,” In VECoS, Leeds, UK, 2008. [10] T. Jarboui, J. Arlat, Y. Crouzet, K. Kanoun and T. Marteau. “Analysis of the Effects of Real and Injected Software Faults: Linux as a Case Study,” In PRDC, Tsukuba, Japan, 2002, pp. 51-58. [11] D. Chen, G. Jacques-Silva, Z. Kalbarczyk, R. K. Iyer and B. Mealey. “Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER,” In PRDC, Taipei, TW, 2008, pp. 339-346. [12] X. Wang et al., “Hang analysis: fighting responsiveness bugs,” In EuroSys, Glasgow, UK, 2008. [13] Intelr 64 and IA-32 Architectures Software Developer’s Manual, Volume 3. [14] R. Love. Linux Kernel Development, 3rd. Addison-Wesley Professional, 2010. [15] W. Mauerer. Professional Linux Kernel Architecture .Wiley Publishing Inc., 2008. [16] F. David and R. Campbell. “Building a Self-Healing Operating System,” In DASC, Columbia, USA, 2007, pp. 3-10. [17] H. Psaier and S. Dustdar. “A survey on self-healing systems : approaches and systems,” Cloud Computing, 2010, vol.91, pp. 43-73. [18] A. Avizienis, J. Laprie, B. Randell, C. Landwehr. “Basic concepts and taxonomy of dependable and secure computing,” In IEEE Transactions on Dependable and Secure Computing, Los Alamitos, USA, 2004, pp. 11-33. [19] I. Lee and R. Iyer. “Faults, Symptoms, and Software Fault Tolerance in Tandem GUARDIAN90 Operating System,” In FTCS, Toulouse, France, 1993, pp. 20-29. [20] W. Gu, Z. Kalbarczyk, and R. K. Iyer. “Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors,” In DSN, Washington, D.C., USA, 2004, pp. 887-896. [21] Sourceforge.net. Linux Test Project (LTP). URL: http://ltp.sourceforge.net/ [22] Google Project Hosting. unixbench-5.1.2.tar.gz. URL: http://code.google.com/p/byte-unixbench/ [23] D. Bovet and M. Cesati. Understanding the Linux Kernel, 3rd. O’Reilly & Associates, Inc., 2005, pp. 228-252. [24] E. Ciliendo, T. Kunimasa, B. Braswell, Linux Performance and Tuning Guidelines, IBM Redpaper, July 2007. [25] M. Sullivan and R. Chillarege. “Software Defects and Their Impact on System Availability-A Study of Field Failures in Operating Systems,” In International Symposium on FaultTolerant Computing, Nuremberg, Germany, 1991, pp. 2-9. [26] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. “Characterization of linux kernel behavior under errors,” In DSN, San Francisco, CA, USA, 2003, pp. 459-468. [27] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. “An empirical study of operating system errors,” In ACM Symp. Operating Sys. Principles, New York, NY, USA, 2001, pp. 73 -88. [28] N. Palix, G. Thomas, S. Saha, C. Calvs, J. Lawall, and G. Muller. “Faults in Linux: Ten years later,” In International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, 2011, pp. 305-318