Find an HA enhancement recommendation The availability calculation for IT resources,HA solutions. (C,n),(C2,n2〉,,(Cm,nm)such that the and workflows are respectively defined as follows. overall enhancement cost defined as Usually,an IT resource is a hosting stack composed of CostHAEnhance ∑1h(C,n) different layers of resource components such as middleware, ∑1h(C,n) an operating system,and a physical server,with the "top"of is minimized and the following conditions the stack being the IT resource that we are really interested in. are satisfied: In a hosting stack,the failure of any component will generally I)i∈1,n,P(WA)=Π1P(C)R>B result in the failure of everything "above"it in the stack. (2)i∈[1,m,Lower Bound(C)≤n)≤ which will further lead to unavailable services hosted by this Upper Bound(Ci) IT resource.Thus,the availability of an IT resource is given Each enhancement parameter is defined as Xi=n/n;to by: capture the enhancement degree for resource C;an enhance- ment cost function is defined as fi(ni,X:)=h(Ci,n)- P(C)= Π(P(RC,) (2) h(Ci,ni)to calculate the cost of HA enhancement for resource j=1 Ci.To find an optimal enhancement solution,we need to where P(Ci)is the availability capability for the IT resource iteratively explore different HA solutions for each IT resource C;which includes m resource components in its hosting and different redundancy degrees for each HA solution. stack,and P(RCj)is the availability capability of resource Returning to the example depicted in Fig.7,finding an component RCj. optimized HA enhancement recommendation boils down to An HA solution is usually composed of several basic IT solving the following constrained optimization problem: resources.When one basic IT resource in an HA solution Find an HA enhancement vector becomes unavailable,the other resources can take over its (X1,X2,X3,X4)such that the HA enhancement workload to guarantee service continuity.For example,if cost an application server cluster is composed of m application Cost HAEnhance F(X1,X2,X3,X4) servers,all providing the same capabilities and hosting the f1(1,X1)+f2(1,X2)+f3(1,X3)+f4(1,X4) same applications,the overall service becomes unavailable is minimized and the following conditions are only if all application servers fail.In this paper,we view an HA satisfied: solution as a resource group;its availability can be calculated (1)P(W1)=P(C)P(C2)P(C3)P(C4)>0.999 as follows: (2)P(W2)=P(C)P(C2)P(C4)>0.999 (3)1≤X1≤32 P(RG) P(C) (3) (4)1≤X2≤16 (5)1≤X3≤16 (6)1≤X4≤2 where P(RG)is the availability capability for resource group RG,and P(Ci)is the availability capability of IT resource In this problem,the bound inequality constraints are set C. according to different resource types:32 as the upper bound An IT resource workflow links resources that support for the HA cluster of HTTP servers (resource C1),16 for the loosely coupled services.In such a workflow.the failure of WAS application server(resources C2 and C3),2 for the DB2 any resource results in the failure of the whole workflow.For database server (resource C4). example,a typical IT resource workflow is a three-tier Web hosting architecture including a Web server,an application B.Calculations server,and a database server.Thus the availability of a Currently,the most frequently used [7]8]definition of avail- workflow is given by: ability is the uptime ratio,which is a close approximation of the steady state availability value and represents the percentage P(W)=Π(P(C) (4) of time a computer system is available throughout its useful j=1 life time.This uptime ratio can be defined as follows: where P(W)is the availability capability for workflow W that MTTR links m IT resources,and P(C;)is the availability capability uptime,atio =1- (1) MTBF of IT resource Ci. Fig.8 depicts an example with three IT resources,one where MTTR is the expected time to recover from a failure, resource group,and an IT resource workflow derived from the and MTBF is the expected time interval from one failure of a business workflow.The availability capability of this business system to the next.With this definition,the uptime ratio lies in workflow is: the range from 0 to 1 (in practice,one hopes the lower bound is limited to 0.9 at worse).We assume that MTBF,MTTR and the uptime ratio can be measured empirically or directly P(W)=P(RG1)P(C3)= obtained from product documentation (1-(1-P(C)(1-P(C2))P(C3) (5)Find an HA enhancement recommendation (hC1, n0 1 i,hC2, n0 2 i, ...,hCm, n0 mi) such that the overall enhancement cost defined as CostHAEnhance = Pm i=1 h(Ci , n0 i P ) − m i=1 h(Ci , ni) is minimized and the following conditions are satisfied: (1) ∀i ∈ [1, n], P(Wi) = Qm j=1 P(Cj ) Ri,j > Pi (2) ∀i ∈ [1, m], LowerBound(Ci) ≤ n 0 ( i) ≤ U pperBound(Ci) Each enhancement parameter is defined as Xi = n 0 i /ni to capture the enhancement degree for resource Ci ; an enhancement cost function is defined as fi(ni , Xi) = h(Ci , n0 i ) − h(Ci , ni) to calculate the cost of HA enhancement for resource Ci . To find an optimal enhancement solution, we need to iteratively explore different HA solutions for each IT resource and different redundancy degrees for each HA solution. Returning to the example depicted in Fig. 7, finding an optimized HA enhancement recommendation boils down to solving the following constrained optimization problem: Find an HA enhancement vector (X1, X2, X3, X4) such that the HA enhancement cost CostHAEnhance = F(X1, X2, X3, X4) = f1(1, X1) + f2(1, X2) + f3(1, X3) + f4(1, X4) is minimized and the following conditions are satisfied: (1) P(W1) = P(C1)P(C2)P(C3)P(C4) > 0.999 (2) P(W2) = P(C1)P(C2)P(C4) > 0.999 (3) 1 ≤ X1 ≤ 32 (4) 1 ≤ X2 ≤ 16 (5) 1 ≤ X3 ≤ 16 (6) 1 ≤ X4 ≤ 2 In this problem, the bound inequality constraints are set according to different resource types: 32 as the upper bound for the HA cluster of HTTP servers (resource C1), 16 for the WAS application server (resources C2 and C3), 2 for the DB2 database server (resource C4). B. Calculations Currently, the most frequently used [7][8] definition of availability is the uptime ratio, which is a close approximation of the steady state availability value and represents the percentage of time a computer system is available throughout its useful life time. This uptime ratio can be defined as follows: uptimeratio = 1 − MT T R MT BF (1) where MTTR is the expected time to recover from a failure, and MTBF is the expected time interval from one failure of a system to the next. With this definition, the uptime ratio lies in the range from 0 to 1 (in practice, one hopes the lower bound is limited to 0.9 at worse). We assume that MTBF, MTTR and the uptime ratio can be measured empirically or directly obtained from product documentation. The availability calculation for IT resources, HA solutions, and workflows are respectively defined as follows. Usually, an IT resource is a hosting stack composed of different layers of resource components such as middleware, an operating system, and a physical server, with the “top” of the stack being the IT resource that we are really interested in. In a hosting stack, the failure of any component will generally result in the failure of everything “above” it in the stack, which will further lead to unavailable services hosted by this IT resource. Thus, the availability of an IT resource is given by: P(Ci) = Ym j=1 (P(RCj )) (2) where P(Ci) is the availability capability for the IT resource Ci which includes m resource components in its hosting stack, and P(RCj ) is the availability capability of resource component RCj . An HA solution is usually composed of several basic IT resources. When one basic IT resource in an HA solution becomes unavailable, the other resources can take over its workload to guarantee service continuity. For example, if an application server cluster is composed of m application servers, all providing the same capabilities and hosting the same applications, the overall service becomes unavailable only if all application servers fail. In this paper, we view an HA solution as a resource group; its availability can be calculated as follows: P(RG) = 1 − Ym j=1 (1 − P(Cj )) (3) where P(RG) is the availability capability for resource group RG, and P(Cj ) is the availability capability of IT resource Cj . An IT resource workflow links resources that support loosely coupled services. In such a workflow, the failure of any resource results in the failure of the whole workflow. For example, a typical IT resource workflow is a three-tier Web hosting architecture including a Web server, an application server, and a database server. Thus the availability of a workflow is given by: P(W) = Ym j=1 (P(Cj )) (4) where P(W) is the availability capability for workflow W that links m IT resources, and P(Cj ) is the availability capability of IT resource Cj . Fig. 8 depicts an example with three IT resources, one resource group, and an IT resource workflow derived from the business workflow. The availability capability of this business workflow is: P(W) = P(RG1)P(C3) = (1 − (1 − P(C1))(1 − P(C2)))P(C3) (5)