正在加载图片...
corresponds to the set of temporary states S in our Lemma 1.For any state s S,the optimal utility function problem.We may also partition the state space S into satisfies E disjoint subsets based on the number of edges having u(s)=maxaeA.(r(s,a)+ss P(s'Is,e)u(s)) been tested in the states as S=So US1U...USEl.In Particularly,if s is a non-terminating state,then decision epoch i,the system can only be in a state in Si. u(s)=maxeEA.{-c(e)+p(e)u(s.e)+(1-p(e))u(s\e)} Action Sets:For each state s E S.there is a set of actions and for any terminating state,its utility is 0. that can be performed under it.We define the associated action set As of state s as the set of edges that have Based on the Lemma 1,we design an algorithm that not been tested in s.Additionally,for terminating states, computes the optimal testing strategy a following the standard their action set also contains the terminating action L.As dynamic programming paradigm,as shown in Algorithm 1. a result,the whole action set A=Uses As =EU[L). Algorithm 1 The MDP-based Exact Algorithm Transition Probabilities and Rewards:The transition probability function and reward function characterize the Input:Uncertain graph G(V,E,p,c),source s,destination t Output:The optimal testing strategy m result of choosing some action at some state.Generally speaking,at each state,choosing an action will gain 上:Initialize:un(s)=O,for all s∈SEl 2:fori=E]to 0 do some reward and the system will evolve into other states 3: for All s in Si do probabilistically at the next decision epoch.Projecting 4 if s is a terminating state then into our problem,the transition probability of action e 5: ux(s):=0,T(s):=⊥. (testing edge e)is given by the existence probability 6 else of edge e.Denote by s.e the temporary state evolved 7: from s by setting se as 1 and by se the temporary state e*:arg maxeeA {-c(e)+p(e)ur(s.e) evolved from s by setting se as 0.Formally,the transition +(1-p(e)un(se)}, 8: u.(s)=-c(e*)+p(e*)u.(s·e*) probability function is given by: p(e) ifs=s·e, +(1-p(e*)un(s\e*), 9: π(s):=e* P(s's,e)= 1-p(e)if s'=s\e, lo:return元 0 otherwise, and 1 if s'=s, We prove the correctness of the dynamic programming P(sls,⊥)= algorithm in the following theorem. 0 otherwise. Then it follows that the reward function is r(s,e)= Theorem 3.For an uncertain graph g,Algorithm I yields -c(e)and r(s,L)=0.Note that the reward function is an optimal adaptive testing strategy and has a complexity of negative,corresponds to the cost and the transition prob- O((V+E)3E).where V]denotes the number of nodes ability and reward function are independent with regard and E denotes the number of edges in g. to decision epochs or previous state,which demonstrates Proof:Denote an optimal testing strategy as n,the the Markov property of our problem. strategy given by Algorithm 1 as m.By backward induction, Decision Policy:A decision policy is a mapping from we prove that the utility function u of m is no less than the state space to action set.Therefore,in our problem,it is optimal utility function u=u on every state,which implies equivalent to an adaptive testing strategy. that is an optimal strategy. Optimality Criterion:Obviously,in our case,the op- First,for all s E SEl.obviously ur(s)=ur(s)=0. timality criterion is the expected total reward criterion, Suppose for all states s∈Si,i≥k,r(s)≥ur-(s),then we i.e.,the decision policy with the maximum expected prove that for all states sE Sk1,u(s)>u(s).Indeed,by total reward of the constructed MDP corresponds to the the selection criterion of the algorithm,for a state s e S optimal adaptive testing strategy. that is non-terminating,we have (s)-c(e)+p(e)u (s.)+(1-p(e))u(se)) B.Exact Dynamic Programming Algorithm ≥-c(π*(s)+p(π*(s)ux(s·π*(s) From the equivalence between our problem and an MDP, it follows that our problem also satisfies the "Principle of +(1-p(π*(s))ux(s\π*(s)】 Optimality"in MDP [34],i.e.,starting at any states,the ≥-c(π*(s)+p(π*(s)um-(s·π*(s) optimal adaptive testing strategy incurs the minimum expected +(1-p(π*(s))u+(s\π*(s) (1) cost among all strategies.This enables us to define the optimal =u(s), utility function u of states assigning each temporary state the expected reward (negative cost)of the optimal strategy where Inequality (1)follows from the induction hypothesis. starting at that state.Similarly,we define a utility function u And if s is a terminating state,then also u(s)=u(s)=0. associated with strategy m as the reward gained by starting Hence,we prove that under every state s,following is from each state.By the Bellman equation [34],we have the optimal,and particularly from the initial all-*state,returns following lemma. the maximum expected reward,or equivalently,incurs the minimum expected cost.6 corresponds to the set of temporary states S in our problem. We may also partition the state space S into |E| disjoint subsets based on the number of edges having been tested in the states as S = S0 ∪ S1 ∪ . . . ∪ S|E| . In decision epoch i, the system can only be in a state in Si . • Action Sets: For each state s ∈ S, there is a set of actions that can be performed under it. We define the associated action set As of state s as the set of edges that have not been tested in s. Additionally, for terminating states, their action set also contains the terminating action ⊥. As a result, the whole action set A = S s∈S As = E ∪ {⊥}. • Transition Probabilities and Rewards: The transition probability function and reward function characterize the result of choosing some action at some state. Generally speaking, at each state, choosing an action will gain some reward and the system will evolve into other states probabilistically at the next decision epoch. Projecting into our problem, the transition probability of action e (testing edge e) is given by the existence probability of edge e. Denote by s · e the temporary state evolved from s by setting se as 1 and by s\e the temporary state evolved from s by setting se as 0. Formally, the transition probability function is given by: P(s 0 |s, e) =    p(e) if s 0 = s · e, 1 − p(e) if s 0 = s\e, 0 otherwise, and P(s 0 |s, ⊥) = ( 1 if s 0 = s, 0 otherwise. Then it follows that the reward function is r(s, e) = −c(e) and r(s, ⊥) = 0. Note that the reward function is negative, corresponds to the cost and the transition prob￾ability and reward function are independent with regard to decision epochs or previous state, which demonstrates the Markov property of our problem. • Decision Policy: A decision policy is a mapping from state space to action set. Therefore, in our problem, it is equivalent to an adaptive testing strategy. • Optimality Criterion: Obviously, in our case, the op￾timality criterion is the expected total reward criterion, i.e., the decision policy with the maximum expected total reward of the constructed MDP corresponds to the optimal adaptive testing strategy. B. Exact Dynamic Programming Algorithm From the equivalence between our problem and an MDP, it follows that our problem also satisfies the “Principle of Optimality” in MDP [34], i.e., starting at any states, the optimal adaptive testing strategy incurs the minimum expected cost among all strategies. This enables us to define the optimal utility function u of states assigning each temporary state the expected reward (negative cost) of the optimal strategy starting at that state. Similarly, we define a utility function uπ associated with strategy π as the reward gained by π starting from each state. By the Bellman equation [34], we have the following lemma. Lemma 1. For any state s ∈ S, the optimal utility function satisfies u(s) = maxa∈As {r(s, a) + P s 0∈S P(s 0 | s, e)u(s 0 )}. Particularly, if s is a non-terminating state, then u(s) = maxe∈As {−c(e) + p(e)u(s · e) + (1 − p(e))u(s\e)} and for any terminating state, its utility is 0. Based on the Lemma 1, we design an algorithm that computes the optimal testing strategy π following the standard dynamic programming paradigm, as shown in Algorithm 1. Algorithm 1 The MDP-based Exact Algorithm Input: Uncertain graph G(V, E, p, c), source s, destination t Output: The optimal testing strategy π 1: Initialize: uπ(s) = 0, for all s ∈ S|E| 2: for i = |E| to 0 do 3: for All s in Si do 4: if s is a terminating state then 5: uπ(s) := 0, π(s) := ⊥. 6: else 7: e ∗ := arg maxe∈As {−c(e) + p(e)uπ(s · e) +(1 − p(e))uπ(s\e)}, 8: uπ(s) := −c(e ∗ ) + p(e ∗ )uπ(s · e ∗ ) +(1 − p(e ∗ ))uπ(s\e ∗ ), 9: π(s) := e ∗ . 10: return π We prove the correctness of the dynamic programming algorithm in the following theorem. Theorem 3. For an uncertain graph G, Algorithm 1 yields an optimal adaptive testing strategy and has a complexity of O((|V | + |E|)3|E| ), where |V | denotes the number of nodes and |E| denotes the number of edges in G. Proof: Denote an optimal testing strategy as π ∗ , the strategy given by Algorithm 1 as π. By backward induction, we prove that the utility function uπ of π is no less than the optimal utility function uπ∗ = u on every state, which implies that π is an optimal strategy. First, for all s ∈ S|E| , obviously uπ(s) = uπ∗ (s) = 0. Suppose for all states s ∈ Si , i ≥ k, uπ(s) ≥ uπ∗ (s), then we prove that for all states s ∈ Sk−1, uπ(s) ≥ uπ∗ (s). Indeed, by the selection criterion of the algorithm, for a state s ∈ Sk−1 that is non-terminating, we have uπ(s) = max e∈As {−c(e) + p(e)uπ(s · e) + (1 − p(e))uπ(s\e)} ≥ − c(π ∗ (s)) + p(π ∗ (s))uπ(s · π ∗ (s)) + (1 − p(π ∗ (s)))uπ(s\π ∗ (s)) ≥ − c(π ∗ (s)) + p(π ∗ (s))uπ∗ (s · π ∗ (s)) + (1 − p(π ∗ (s)))uπ∗ (s\π ∗ (s)) (1) =uπ∗ (s), where Inequality (1) follows from the induction hypothesis. And if s is a terminating state, then also uπ(s) = uπ∗ (s) = 0. Hence, we prove that under every state s, following π is optimal, and particularly from the initial all-∗ state, π returns the maximum expected reward, or equivalently, incurs the minimum expected cost
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有