1 Determining Source-Destination Connectivity in Uncertain Networks:Modeling and Solutions Luoyi Fu,Xinzhe Fu,Zhiying Xu2,Qianyang Pengl,Xinbing Wang2,and Songwu Lu3 Dept.of Computer Science,Shanghai Jiao Tong University,China. Email:{yiluofu,fxz0114,az950512,xwang8 @sjtu.edu.cn. 2Dept.of Electrical Engineering,Shanghai Jiao Tong University,China.Email:{xuzhiying}@sjtu.edu.cn 3Dept.of Computer Science,University of California,Los Angeles,USA.Email:{slu}@cs.ucla.edu Abstract-Determination of source-destination connectivity in the networks investigated are modeled as deterministic graphs networks has long been a fundamental problem,where most 4],[5]with the source-destination connectivity problems existing works are based on deterministic graphs that overlook transformed to the corresponding graph reachability problems. the inherent uncertainty in network links.To overcome such limitation,this paper models the network as an uncertain graph However,as indeterminacy plagues in our life,deterministic where each edge e exists independently with some probability graph often fails to serve as a suitable model for networks p(e).The problem examined is that of determining whether nowadays.Usually,we do not have certain knowledge of a given pair of nodes,a source s and a destination t,are existence of network links.For instance,in social networks. connected by a path or separated by a cut.Assuming that during due to the variability of social ties [6],the relations between each determining process we are associated with an underlying graph,the existence of each edge can be unraveled through edge network nodes may not be known in advance:in communi- testing at a cost of c(e).Our goal is to find an optimal strategy cation systems,established connections between nodes may incurring the minimum expected testing cost with the expectation frequently fail because of the unreliability of data links [7]. taken over all possible underlying graphs that form a product [8].It has also been pointed out that more than 90%of distribution. network links are observed to be unreliable [10].Consequent- Formulating it into a combinatorial optimization problem, we first characterize the computational complexity of optimally ly,we may not obtain deterministic network configuration determining source-destination connectivity in uncertain graphs. from the predesigned topology;sometimes we even have to Specifically,through proving the NP-hardness of two closely intentionally blur the links for privacy reasons [11].All those related problems,we show that,contrary to its counterpart in factors impose a need to incorporate uncertainty into the deterministic graphs,this problem cannot be solved in polynomial network,which can essentially be modeled as an uncertain time unless P=NP.Driven by the necessity of designing an exact algorithm,we then apply the Markov Decision Process graph [11],where,instead of appearing deterministically,each framework to give a dynamic programming algorithm that edge is associated with some prior existence probability.The derives the optimal strategies.As the exact algorithm may have existence probabilities not only are symbols of uncertainty,but prohibitive time complexity in practical situations,we further also bear important attributes of network links.Take social propose two more efficient approximation schemes compromising the optimality.The first one is a simple greedy approach with network again for example.These probabilities may represent linear approximation ratio.Interestingly,we show that naive as the confidence of link prediction [12],or the strength of the it is,it enjoys significantly better performance guarantee than influence that a node has on the other [2].In communication some other seemingly more sophisticated algorithms.Second,by networks such as data center networks,these probabilities harnessing the submodularity of the problem,we further design a reflect the failure frequency of communication links [7]. more elaborate algorithm with better approximation ratio.The When the graph is uncertain,traditional methods such as effectiveness of the proposed algorithms are justified through extensive simulations on three real network datasets,from which depth-first-traversal,breadth-first-traversal and graph labeling we demonstrate that the proposed algorithms yield strategies with are no longer suitable for determining the source-destination smaller expected cost than conventional heuristics. connectivity of networks due to the lack of deterministic information on edges'existence.To hedge the uncertainty,we I.INTRODUCTION need to test the edges to determine whether they truly exist or not.However,such edge testing involves far more complicated Source and destination connectivity of networks has sig- procedures than simply identifying uncertain links and thus nificant applications in real life.It concerns crucial issues may turn out to be more costly.For example,in citation such as reliability,routing,information diffusion [1],[2],etc. networks,we can establish probabilistic relationships between Hence,in the past few decades,a lot of research has been papers just by reference information.In contrast,to unravel the dedicated to this problem [3],4,5]and there have been genuine relation between papers,we have to apply advanced many efficient algorithms proposed under various types of data mining approaches which involves considerably more networks.A common feature shared by all those works is that intensive computation.Consequently,it is extremely desirable The early version of this paper is to appear in the Proceedings of IEEE to test the most cost-effective edges,i.e.,to design a testing INFOCOM 2017 [41]. strategy that determines the source-destination connectivity of
1 Determining Source-Destination Connectivity in Uncertain Networks: Modeling and Solutions Luoyi Fu1 , Xinzhe Fu1 , Zhiying Xu2 , Qianyang Peng1 , Xinbing Wang1,2, and Songwu Lu3 1Dept. of Computer Science, Shanghai Jiao Tong University, China. Email:{yiluofu,fxz0114,az950512,xwang8}@sjtu.edu.cn. 2Dept. of Electrical Engineering, Shanghai Jiao Tong University, China. Email:{xuzhiying}@sjtu.edu.cn 3Dept. of Computer Science, University of California, Los Angeles, USA. Email:{slu}@cs.ucla.edu Abstract—Determination of source-destination connectivity in networks has long been a fundamental problem, where most existing works are based on deterministic graphs that overlook the inherent uncertainty in network links. To overcome such limitation, this paper models the network as an uncertain graph where each edge e exists independently with some probability p(e). The problem examined is that of determining whether a given pair of nodes, a source s and a destination t, are connected by a path or separated by a cut. Assuming that during each determining process we are associated with an underlying graph, the existence of each edge can be unraveled through edge testing at a cost of c(e). Our goal is to find an optimal strategy incurring the minimum expected testing cost with the expectation taken over all possible underlying graphs that form a product distribution. Formulating it into a combinatorial optimization problem, we first characterize the computational complexity of optimally determining source-destination connectivity in uncertain graphs. Specifically, through proving the NP-hardness of two closely related problems, we show that, contrary to its counterpart in deterministic graphs, this problem cannot be solved in polynomial time unless P=NP. Driven by the necessity of designing an exact algorithm, we then apply the Markov Decision Process framework to give a dynamic programming algorithm that derives the optimal strategies. As the exact algorithm may have prohibitive time complexity in practical situations, we further propose two more efficient approximation schemes compromising the optimality. The first one is a simple greedy approach with linear approximation ratio. Interestingly, we show that naive as it is, it enjoys significantly better performance guarantee than some other seemingly more sophisticated algorithms. Second, by harnessing the submodularity of the problem, we further design a more elaborate algorithm with better approximation ratio. The effectiveness of the proposed algorithms are justified through extensive simulations on three real network datasets, from which we demonstrate that the proposed algorithms yield strategies with smaller expected cost than conventional heuristics. I. INTRODUCTION Source and destination connectivity of networks has significant applications in real life. It concerns crucial issues such as reliability, routing, information diffusion [1], [2], etc. Hence, in the past few decades, a lot of research has been dedicated to this problem [3], [4], [5] and there have been many efficient algorithms proposed under various types of networks. A common feature shared by all those works is that The early version of this paper is to appear in the Proceedings of IEEE INFOCOM 2017 [41]. the networks investigated are modeled as deterministic graphs [4], [5] with the source-destination connectivity problems transformed to the corresponding graph reachability problems. However, as indeterminacy plagues in our life, deterministic graph often fails to serve as a suitable model for networks nowadays. Usually, we do not have certain knowledge of existence of network links. For instance, in social networks, due to the variability of social ties [6], the relations between network nodes may not be known in advance; in communication systems, established connections between nodes may frequently fail because of the unreliability of data links [7], [8]. It has also been pointed out that more than 90% of network links are observed to be unreliable [10]. Consequently, we may not obtain deterministic network configuration from the predesigned topology; sometimes we even have to intentionally blur the links for privacy reasons [11]. All those factors impose a need to incorporate uncertainty into the network, which can essentially be modeled as an uncertain graph [11], where, instead of appearing deterministically, each edge is associated with some prior existence probability. The existence probabilities not only are symbols of uncertainty, but also bear important attributes of network links. Take social network again for example. These probabilities may represent the confidence of link prediction [12], or the strength of the influence that a node has on the other [2]. In communication networks such as data center networks, these probabilities reflect the failure frequency of communication links [7]. When the graph is uncertain, traditional methods such as depth-first-traversal, breadth-first-traversal and graph labeling are no longer suitable for determining the source-destination connectivity of networks due to the lack of deterministic information on edges’ existence. To hedge the uncertainty, we need to test the edges to determine whether they truly exist or not. However, such edge testing involves far more complicated procedures than simply identifying uncertain links and thus may turn out to be more costly. For example, in citation networks, we can establish probabilistic relationships between papers just by reference information. In contrast, to unravel the genuine relation between papers, we have to apply advanced data mining approaches which involves considerably more intensive computation. Consequently, it is extremely desirable to test the most cost-effective edges, i.e., to design a testing strategy that determines the source-destination connectivity of
3 uncertain networks incurring minimum cost.Furthermore,to heuristics as they achieve better tradeoff between the fully utilize the results of previous tests,the strategy should be complexity of the algorithm and the optimality of the adaptive,which means that we may determine the next edge to solutions. test based on the edge existence information we have already The rest of the paper is organized as follows.we review acquired through previous tests.And we defer a more detailed related studies in SectionⅡ.In SectionⅡ,.we formally explanation of how the problem of interest can be applied to introduce the definitions and notations related to our problem. other realistic scenarios to the end of Section III-B. In Section IV,we investigate the computational complexity In this paper,we are thus motivated to present a first of the problem.We present our exact dynamic programming look into the problem of determining source-destination con- algorithm based on Markov Decision Process framework in nectivity in uncertain networks.Given a network modeled Section V.In Section VI,we present the two approximation as an uncertain graph with each edge associated with an algorithms and we evaluate our algorithms on real life data existence probability and a testing cost,together with two in Section VII.We give concluding remarks as well as future network nodes s,t designated as source and destination,we directions in Section VIII. aim to derive efficient strategy specifying which edges to test so that we can verify whether s and t are connected II.RELATED WORK by a path or separated by a cut with the minimum cost incurred.Note that the source and destination connectivity 1)Uncertain Networks:Uncertain network has been under is also referred to as s-t connectivity.Comparing with s-t intensive study for long.However,instead of verifying the connectivity in deterministic graphs that can be easily solved existence of some structures in uncertain networks,most by graph traversal methods in polynomial time,by proving the efforts have been devoted to calculating the existence prob- NP-hardness of the problem,we find that the s-t connectivity ability of those structures.One of the fundamental problems in uncertain graphs turns out to be far more complicated in that regard is the network reliability problem,which asks and highly non-trivial.Driven by the necessity of pursuing the probability that uncertain networks are connected [1]. exact algorithms that can capture the features of the optimal Following this.Jin et al.consider the distance-constrained solutions,we proceed by converting our problem into an reachability,i.e..the probability that two nodes are connected equivalent Markov Decision Process (MDP)to give a dynamic by a path shorter than a predefined threshold in an uncertain programming algorithm that yields optimal strategies but has network [13].The work in [14]focuses on discovering sub- exponential running time.Considering that the prohibitive graphs with high reliability measure.In recent years,other time complexity of such exact algorithm renders it unsuitable types of study on uncertain networks(graphs)include reliable for practical applications,we therefore design approximation topology design [15],extracting representative subgraphs for schemes to compromise the optimality of computed strategy the acceleration of various querying processes [16],perfor- for the efficiency of the algorithms.In doing so,we first put mance analysis of unreliable wireless networks [8]as well as forward a simple greedy approach that computes near optimal connectivity and resilience of secure sensor networks [9]. solutions with linear approximation guarantee,which can be The modeling of uncertain networks in the present work is further improved by a second algorithm we propose through similar to random graph which was first introduced by Erdos the exploration of submodularity in our problem. and Renyi in [17].Despite of this similarity,the problems Our key contributions are summarized as follows: investigated are quite different.Specifically,previous works Theory:We formally define the problem of determining on random graphs [17][18]is dedicated to the analysis of s-t connectivity in uncertain networks.We prove compu- model property in an asymptotic sense where the number tational complexity-theoretic results of the problem show-of nodes goes to infinity.In contrast,our focus here lies in ing that it cannot be solved in polynomial time unless the combinatorial optimization problem formulated from the P=NP.The results provide useful insights to the inherent determination of source-destination connectivity in uncertain hardness and combinatoric nature of our problem. networks,with the model serving as a mean to characterize Algorithm:We derive an exact dynamic programming the uncertainty in networks and a parsimonious media for algorithm by converting our problem into an equivalent fi- extracting the essence of the problem. nite horizon Markov Decision Process.To further counter 2)Sequential Testing:The nature of our problem is analo- the problem,we design two approximation schemes.The gous to a class of sequential testing problems which involves first one is a simple greedy approach and we show that diagnosing a system by determining the states of its compo- naive as it is,it can provide non-trivial performance nents through a series of tests.The dependency of the whole guarantee.More surprisingly,its performance is far better system on its components'states can be given by explicit than some other more complicated algorithms.Then,we function or via an oracle.Existing results include optimal further improve the approximation ratio of the greedy diagnosing strategies on series and parallel systems,double algorithm by utilizing the submodularity of the problem regular systems,etc.See [19]for a comprehensive review.A in the second algorithm. special class of sequential testing problems called Stochastic Application:We demonstrate the effectiveness of our Boolean Function Evaluation (SBFE)have close connection algorithms on practical applications through extensive to our problem.In SBFE,each component has two states simulations with real network datasets.It is shown that and thus can be considered as a Boolean variable and the our proposed algorithms are superior to the conventional system is given by a Boolean function.The works in [20]and
2 uncertain networks incurring minimum cost. Furthermore, to fully utilize the results of previous tests, the strategy should be adaptive, which means that we may determine the next edge to test based on the edge existence information we have already acquired through previous tests. And we defer a more detailed explanation of how the problem of interest can be applied to other realistic scenarios to the end of Section III-B. In this paper, we are thus motivated to present a first look into the problem of determining source-destination connectivity in uncertain networks. Given a network modeled as an uncertain graph with each edge associated with an existence probability and a testing cost, together with two network nodes s, t designated as source and destination, we aim to derive efficient strategy specifying which edges to test so that we can verify whether s and t are connected by a path or separated by a cut with the minimum cost incurred. Note that the source and destination connectivity is also referred to as s-t connectivity. Comparing with s-t connectivity in deterministic graphs that can be easily solved by graph traversal methods in polynomial time, by proving the NP-hardness of the problem, we find that the s-t connectivity in uncertain graphs turns out to be far more complicated and highly non-trivial. Driven by the necessity of pursuing exact algorithms that can capture the features of the optimal solutions, we proceed by converting our problem into an equivalent Markov Decision Process (MDP) to give a dynamic programming algorithm that yields optimal strategies but has exponential running time. Considering that the prohibitive time complexity of such exact algorithm renders it unsuitable for practical applications, we therefore design approximation schemes to compromise the optimality of computed strategy for the efficiency of the algorithms. In doing so, we first put forward a simple greedy approach that computes near optimal solutions with linear approximation guarantee, which can be further improved by a second algorithm we propose through the exploration of submodularity in our problem. Our key contributions are summarized as follows: • Theory: We formally define the problem of determining s-t connectivity in uncertain networks. We prove computational complexity-theoretic results of the problem showing that it cannot be solved in polynomial time unless P=NP. The results provide useful insights to the inherent hardness and combinatoric nature of our problem. • Algorithm: We derive an exact dynamic programming algorithm by converting our problem into an equivalent fi- nite horizon Markov Decision Process. To further counter the problem, we design two approximation schemes. The first one is a simple greedy approach and we show that naive as it is, it can provide non-trivial performance guarantee. More surprisingly, its performance is far better than some other more complicated algorithms. Then, we further improve the approximation ratio of the greedy algorithm by utilizing the submodularity of the problem in the second algorithm. • Application: We demonstrate the effectiveness of our algorithms on practical applications through extensive simulations with real network datasets. It is shown that our proposed algorithms are superior to the conventional heuristics as they achieve better tradeoff between the complexity of the algorithm and the optimality of the solutions. The rest of the paper is organized as follows. we review related studies in Section II. In Section III, we formally introduce the definitions and notations related to our problem. In Section IV, we investigate the computational complexity of the problem. We present our exact dynamic programming algorithm based on Markov Decision Process framework in Section V. In Section VI, we present the two approximation algorithms and we evaluate our algorithms on real life data in Section VII. We give concluding remarks as well as future directions in Section VIII. II. RELATED WORK 1) Uncertain Networks: Uncertain network has been under intensive study for long. However, instead of verifying the existence of some structures in uncertain networks, most efforts have been devoted to calculating the existence probability of those structures. One of the fundamental problems in that regard is the network reliability problem, which asks the probability that uncertain networks are connected [1]. Following this, Jin et al. consider the distance-constrained reachability, i.e., the probability that two nodes are connected by a path shorter than a predefined threshold in an uncertain network [13]. The work in [14] focuses on discovering subgraphs with high reliability measure. In recent years, other types of study on uncertain networks (graphs) include reliable topology design [15], extracting representative subgraphs for the acceleration of various querying processes [16], performance analysis of unreliable wireless networks [8] as well as connectivity and resilience of secure sensor networks [9]. The modeling of uncertain networks in the present work is similar to random graph which was first introduced by Erdos¨ and Renyi in [17]. Despite of this similarity, the problems investigated are quite different. Specifically, previous works on random graphs [17] [18] is dedicated to the analysis of model property in an asymptotic sense where the number of nodes goes to infinity. In contrast, our focus here lies in the combinatorial optimization problem formulated from the determination of source-destination connectivity in uncertain networks, with the model serving as a mean to characterize the uncertainty in networks and a parsimonious media for extracting the essence of the problem. 2) Sequential Testing: The nature of our problem is analogous to a class of sequential testing problems which involves diagnosing a system by determining the states of its components through a series of tests. The dependency of the whole system on its components’ states can be given by explicit function or via an oracle. Existing results include optimal diagnosing strategies on series and parallel systems, double regular systems, etc. See [19] for a comprehensive review. A special class of sequential testing problems called Stochastic Boolean Function Evaluation (SBFE) have close connection to our problem. In SBFE, each component has two states and thus can be considered as a Boolean variable and the system is given by a Boolean function. The works in [20] and
0 ⊙ © 02 s 0,', 06 PrG=0.16 PrG)=0.16 Pr(G=0.04 0,l,}9 0,,} )2 K 个 ,, ⊙ ",, plea)=0.5 clez)=5 tcerlain graph a possible underlying graph ,*,0乌 PrG)=0.24 PHG)=0.04 ,0,}% 它 0 f,101乌 ,0,19 a known edge .---a potential edge {,0,049 a known non-edge PrG)=0.06 PHG)=0.24 PrG)=0.06 Fig.2.The table in the left demonstrates an adaptive testing strategy with the action of terminating states omitted.The left entries denote the temporary Fig.1.An uncertain graph with three edges and its eight possible underlying states and the right entries represent the corresponding testing edges.The right graphs.The existence probability of each edge is labeled beside it.For part illustrates the transition of temporary states when the strategy is executed clearance.we do not show the direction of each edge. on the underlying graph in the figure.Following the strategy on the left,when the underlying graph is as shown on the right,the evolution of temporary state [21]propose approximate algorithms for evaluating DNF,CNF is{◆,*,*},{0,*,*},{0,l,*},{0,l,l}.Note that some non-terminating states are and CDNF formulas.Deshpande et al.[22]propose a general not reachable during the testing process starting from the initial state,but we still show them in the figure in accordance to the definition of adaptive testing method called the Q-value approach to approximately solve strategy,which is a mapping defined on the set of all temporary states.For SBFE problems based on the adaptive submodular framework clearance,we do not show the direction of each edge. proposed in [23]. We note that,there are no previous works that study the element“O”,“I”amd“*”.And we define S={0,l,*}Eto same problem as ours except the two from Kowshik [24]be the set of temporary states associated with g. and Fu et al.[25][26],respectively,but in more restrictive settings.Particularly,Kowshik derive the optimal solution for Each temporary state s E S represents a set of outcomes s-t connectivity problem in parallel-series and series-parallel during the testing process,where"0"means the corresponding uncertain graphs [24].Fu et al.[25][26]propose an efficient edge has been tested and found not existing,"1"means algorithm and prove its optimality in an ER graph,i.e,a the corresponding edge has been tested and found existing complete graph where each edge has the same probability of and "*"means the corresponding edge has not been tested. existence and the cost of testing each edge is uniform.Our Additionally,we denote the condition of edge e in state s as work is the first attempt to consider whether optimality exists se.As our goal is to determine the s-t connectivity of the in determination of s-t connectivity in a general uncertain underlying graph for g,for a temporary state s,we define it graph to be a terminating state if either the edge set fe se =1 forms a superset of an s-t path in g or edge set fe se=0 III.MODELS AND PROBLEM FORMULATION forms a superset of an s-t cut'in g.We successfully determine A.Uncertain Graph Model the s-t connectivity by reaching a terminating state We denote an uncertain directed graph by g=(V,E,p,c),Definition 2.(Adaptive Testing Strategy)An adaptive testing where V is the set of vertices,E is the set of edges,p:E strategy is a deterministic mappingπ:S→EU{⊥}.Initially (0.1 is a function that assigns each edge e its corresponding starting from the all-*state,an adaptive testing strategy existence probability,and c:ER+represents the testing specifies which edge to test (or terminate as denoted by L) cost of each edge based on the previous testing outcomes. Following the state of art [13].we assume the existence probability of each edge to be independent.And we interpret In the present work,we restrict our consideration to reason- G as a distribution on the set {G=(V,EG),EC CE}of El able strategies where all the terminating states are mapped toL possible underlying deterministic graphs,where denotes the and no state is mapped to any edge that has already been tested cardinality of a set.The probability of a deterministic graph in that state.Also note that some states may never be reached G(V,EG)being the underlying graph is: but we still include them in the strategy for consistency. During each determining process,we are associated with Pr(G=Πp(e)Π(1-p(e). an underlying graph.The outcome of tests are dictated by the eEE\EG underlying graph and after each test the current temporary state We also use G EG to represent that G is a possible underlying will evolve into a new state.Therefore,an adaptive testing graph for g.We define g to be s-t connected if there exists an strategy may test different sets of edges before termination s-t path in the underlying graph of g.Figure 1 demonstrates when executed on different underlying graphs of C.For a an example of a three-edge uncertain graph with its possible specific underlying graph G.we denote E(G)as the set of underlying graphs. edges strategy m tests on it.Note that as G is deterministic, E(G)is also deterministic.It follows that the expected cost B.Problem Formulation Definition 1.(Temporary State)A temporary state s of an All the cuts in this paper are graph s-t cut,ie.,the minimal cut sets that uncertain graph g(V,E,p,c)is an E-dimension vector with partition s and t into different subsets
3 1e 2 e 3e 1 ( ) 0.2 p e = 2 p(e ) = 0.5 3 ( ) 0.6 p e = Pr(G) = 0.16 Pr(G) = 0.16 Pr(G) = 0.04 Pr(G) = 0.04 Pr(G) = 0.06 Pr(G) = 0.24 Pr(G) = 0.06 Pr(G) = 0.24 s s s s s s s s s t t t t t t t t t 1 2 3 4 5 6 7 8 Fig. 1. An uncertain graph with three edges and its eight possible underlying graphs. The existence probability of each edge is labeled beside it. For clearance, we do not show the direction of each edge. [21] propose approximate algorithms for evaluating DNF, CNF and CDNF formulas. Deshpande et al. [22] propose a general method called the Q-value approach to approximately solve SBFE problems based on the adaptive submodular framework proposed in [23]. We note that, there are no previous works that study the same problem as ours except the two from Kowshik [24] and Fu et al. [25] [26], respectively, but in more restrictive settings. Particularly, Kowshik derive the optimal solution for s-t connectivity problem in parallel-series and series-parallel uncertain graphs [24]. Fu et al. [25] [26] propose an efficient algorithm and prove its optimality in an ER graph, i.e, a complete graph where each edge has the same probability of existence and the cost of testing each edge is uniform. Our work is the first attempt to consider whether optimality exists in determination of s-t connectivity in a general uncertain graph. III. MODELS AND PROBLEM FORMULATION A. Uncertain Graph Model We denote an uncertain directed graph by G = (V, E, p, c), where V is the set of vertices, E is the set of edges, p : E 7→ (0, 1] is a function that assigns each edge e its corresponding existence probability, and c : E 7→ R + represents the testing cost of each edge. Following the state of art [13], we assume the existence probability of each edge to be independent. And we interpret G as a distribution on the set {G = (V, EG), EG ⊆ E} of 2 |E| possible underlying deterministic graphs, where |·| denotes the cardinality of a set. The probability of a deterministic graph G(V, EG) being the underlying graph is: P r(G) = Y e∈EG p(e) Y e∈E\EG (1 − p(e)). We also use G ∈ G to represent that G is a possible underlying graph for G. We define G to be s-t connected if there exists an s-t path in the underlying graph of G. Figure 1 demonstrates an example of a three-edge uncertain graph with its possible underlying graphs. B. Problem Formulation Definition 1. (Temporary State) A temporary state s of an uncertain graph G(V, E, p, c) is an |E|-dimension vector with 1e 2 e 3e 1 ( ) 0.2 p e = 2 p(e ) = 0.5 3 ( ) 0.6 p e = s t s t 1 ( ) 3 c e = 2 c(e ) = 5 3 ( ) 2 c e = 1 2 3 2 3 1 1 1 1 1 1 {*,*,*} {0,*,*} {0,1,*} {0,*,1} {*,1,*} {*,*,1} {*,*,0} {*,0,*} {*,1,0} {*,0,1} {*,0,0} e e e e e e e e e e e 1e 2e 3e uncertain graph a possible underlying graph a known edge a known non - edge a potential edge Fig. 2. The table in the left demonstrates an adaptive testing strategy with the action of terminating states omitted. The left entries denote the temporary states and the right entries represent the corresponding testing edges. The right part illustrates the transition of temporary states when the strategy is executed on the underlying graph in the figure. Following the strategy on the left, when the underlying graph is as shown on the right, the evolution of temporary state is {*,*,*},{0,*,*},{0,1,*},{0,1,1}. Note that some non-terminating states are not reachable during the testing process starting from the initial state, but we still show them in the figure in accordance to the definition of adaptive testing strategy, which is a mapping defined on the set of all temporary states. For clearance, we do not show the direction of each edge. element “0”, “1” and “*”. And we define S = {0, 1, ∗}|E| to be the set of temporary states associated with G. Each temporary state s ∈ S represents a set of outcomes during the testing process, where “0” means the corresponding edge has been tested and found not existing, “1” means the corresponding edge has been tested and found existing and “*” means the corresponding edge has not been tested. Additionally, we denote the condition of edge e in state s as se. As our goal is to determine the s-t connectivity of the underlying graph for G, for a temporary state s, we define it to be a terminating state if either the edge set {e | se = 1} forms a superset of an s-t path in G or edge set {e | se = 0} forms a superset of an s-t cut1 in G. We successfully determine the s-t connectivity by reaching a terminating state. Definition 2. (Adaptive Testing Strategy) An adaptive testing strategy is a deterministic mapping π : S 7→ E∪{⊥}. Initially starting from the all-∗ state, an adaptive testing strategy specifies which edge to test (or terminate as denoted by ⊥) based on the previous testing outcomes. In the present work, we restrict our consideration to reasonable strategies where all the terminating states are mapped to ⊥ and no state is mapped to any edge that has already been tested in that state. Also note that some states may never be reached but we still include them in the strategy for consistency. During each determining process, we are associated with an underlying graph. The outcome of tests are dictated by the underlying graph and after each test the current temporary state will evolve into a new state. Therefore, an adaptive testing strategy may test different sets of edges before termination when executed on different underlying graphs of G. For a specific underlying graph G, we denote Eπ(G) as the set of edges strategy π tests on it. Note that as G is deterministic, Eπ(G) is also deterministic. It follows that the expected cost 1All the cuts in this paper are graph s-t cut, i.e., the minimal cut sets that partition s and t into different subsets
of a is given by: whole network structure,it is reasonable to render the link Cost)=∑[Pr(G)∑c(el existence between nodes as a preassigned probability.And the algorithm for the problem thus helps us more efficiently GEg e∈Er(G) discover the relationship between nodes,which manifests its where (c)c(e)equals to the cost incurred by when significant use in link prediction [311.Moreover,the structure the underlying graph is G,and the expected cost is the weight- of the social networks can also be revealed [32],[33]by ed sum of the costs incurred on all the possible underlying applying the algorithm to different node pairs.Last but not graphs.An example of an adaptive testing strategy on an least,more potential applications of the illustrated scenario uncertain graph is illustrated in Figure 2. also include sensor networks [28],P2P networks [29]and Data Based on all the conditions above,now we give a formal center networks [7]. definition of our problem stated as follows. IV.COMPUTATIONAL COMPLEXITY Definition 3.(The Connectivity Determination Problem)Giv- en an uncertain directed graph2g(V,E,p,c)with two nodes In this section,we investigate the computational complexity s,t EV designated as source and destination,respectively, of the Connectivity Determination problem.By demonstrating the goal is to find an adaptive testing strategy n that incurs the hardness of two closely related problems,we show both the minimum expected cost. computing the testing strategy with the minimum expected cost holistically and sequentially are NP-hard.More specifically, Remark:Apart from deriving the strategy's action in all we first convert our problem into its corresponding decision temporary states at once,an algorithm for the Connectivity version that asks for the existence of an adaptive testing Determination problem can instead compute the strategy se- strategy with expected cost less than some value l for a given quentially,only deciding the next edge to test based on the uncertain graph.Then.we consider the problem of deciding current state.In algorithmic point of view,we consider the which edge to test first in the optimal strategy.The inherent time complexity of an algorithm for Connectivity Determina-tension of the Connectivity Determination problem is therefore tion problem as the maximum time it takes to compute all the disclosed through demonstrating the NP-hardness of these two relevant actions of a determining process.Therefore,finding problems,as stated in Theorems 1 and 2,respectively. the optimal strategy in a sequential fashion,on the surface, may appear to simplify the problem compared to computing it Theorem 1.The decision version of Connectivity Determina- holistically.However,we show in next section that the problem tion Problem is NP-hard. is NP-hard regardless of in which way we compute the optimal Proof:Inspired by [27],we prove the theorem by re- strategy.Table I summarizes the notations that will be used duction from the s-t reliability problem [1]:Given a directed throughout the paper. graph G and two nodes s and t.The s-t reliability is to Applicability of the model:We illustrate how to project our compute the probability of s being connected to t assuming model into real situations using several examples.Let us firstly the edges in G exist independently with probability As s-t take communication networks for example,where the edge reliability problem is #P-hard [1]3,its decision version that existence probability corresponds to the reliability of network quests whether the probability of s being connected to t is links.The testing cost represents the probing costs of the links. larger than some predefined value ro is NP-hard. The connectivity determination problem asks for a minimum The reduction works as follows.For a graph G(V,E),we cost probing strategy determining the source-destination con- transform it to an uncertain graph G(V,E',p,c)by adding an nectivity of the networks,allowing for the prevalent existence edge M between s.t and set the rest of g is just the same of edge uncertainty.Other than communication networks,such as G.Define n as the number of edges in G.We set the cost uncertain graph can also be projected into the structure of a of M as c(M)=n2+1 and the cost of testing other edges priced information,with each edge representing a piece of as 1.Then we assign the existence probability of all edges in information (data)and its testing cost corresponds to the price g as Finally,we designate s,t in g as the source and the of the data.Under such circumstance,the optimal strategy al- destination in the constructed instance. lows us to successfully query the information paying minimum Let r be the s-t reliability in G and I be the expected cost price [30].Another application is a social network graph where incurred by the optimal testing strategy on g.We define a an edge can be treated as a first cousin relationship,and the generic G'as a subgraph resulted from an underlying graph object of interest is to determine whether two individuals from of G with edge M removed.We will show that if we know l, a large population are distant cousins.Obviously,determining then we can efficiently compute r. whether an edge exists between two individuals is usually First,from the definitions,we have r=for some very expensive,involving costly genetic testing that outweighs integer k,and I must obey the following two constraints: any computational cost.An edge could of course represent I>(1-r)c(M)and I<rn +(1-r)c(M).Here,the a variety of other relationships that are expensive to check,first inequality follows from the fact that we have to test M for instance due to confidentiality or physical restrictions.In whenever we find out that s and t is not connected in G those scenarios where users have no prior knowledge of the The second inequality holds since the expected cost of the 2Without loss of generality.we assume the graph with vertex set V and 3#P is a complexity class for counting problems.#P-hard is at least as hard edge set E is s-t connected,i.e.,g is s-t connected if all its edges exist. as NP-hard [1]
4 of π is given by: Cost(π) = X G∈G [P r(G) X e∈Eπ(G) c(e)], where P e∈Eπ(G) c(e) equals to the cost incurred by π when the underlying graph is G, and the expected cost is the weighted sum of the costs incurred on all the possible underlying graphs. An example of an adaptive testing strategy on an uncertain graph is illustrated in Figure 2. Based on all the conditions above, now we give a formal definition of our problem stated as follows. Definition 3. (The Connectivity Determination Problem) Given an uncertain directed graph2 G(V, E, p, c) with two nodes s, t ∈ V designated as source and destination, respectively, the goal is to find an adaptive testing strategy π that incurs the minimum expected cost. Remark: Apart from deriving the strategy’s action in all temporary states at once, an algorithm for the Connectivity Determination problem can instead compute the strategy sequentially, only deciding the next edge to test based on the current state. In algorithmic point of view, we consider the time complexity of an algorithm for Connectivity Determination problem as the maximum time it takes to compute all the relevant actions of a determining process. Therefore, finding the optimal strategy in a sequential fashion, on the surface, may appear to simplify the problem compared to computing it holistically. However, we show in next section that the problem is NP-hard regardless of in which way we compute the optimal strategy. Table I summarizes the notations that will be used throughout the paper. Applicability of the model: We illustrate how to project our model into real situations using several examples. Let us firstly take communication networks for example, where the edge existence probability corresponds to the reliability of network links. The testing cost represents the probing costs of the links. The connectivity determination problem asks for a minimum cost probing strategy determining the source-destination connectivity of the networks, allowing for the prevalent existence of edge uncertainty. Other than communication networks, such uncertain graph can also be projected into the structure of a priced information, with each edge representing a piece of information (data) and its testing cost corresponds to the price of the data. Under such circumstance, the optimal strategy allows us to successfully query the information paying minimum price [30]. Another application is a social network graph where an edge can be treated as a first cousin relationship, and the object of interest is to determine whether two individuals from a large population are distant cousins. Obviously, determining whether an edge exists between two individuals is usually very expensive, involving costly genetic testing that outweighs any computational cost. An edge could of course represent a variety of other relationships that are expensive to check, for instance due to confidentiality or physical restrictions. In those scenarios where users have no prior knowledge of the 2Without loss of generality, we assume the graph with vertex set V and edge set E is s-t connected, i.e., G is s-t connected if all its edges exist. whole network structure, it is reasonable to render the link existence between nodes as a preassigned probability. And the algorithm for the problem thus helps us more efficiently discover the relationship between nodes, which manifests its significant use in link prediction [31]. Moreover, the structure of the social networks can also be revealed [32], [33] by applying the algorithm to different node pairs. Last but not least, more potential applications of the illustrated scenario also include sensor networks [28], P2P networks [29] and Data center networks [7]. IV. COMPUTATIONAL COMPLEXITY In this section, we investigate the computational complexity of the Connectivity Determination problem. By demonstrating the hardness of two closely related problems, we show both computing the testing strategy with the minimum expected cost holistically and sequentially are NP-hard. More specifically, we first convert our problem into its corresponding decision version that asks for the existence of an adaptive testing strategy with expected cost less than some value l for a given uncertain graph. Then, we consider the problem of deciding which edge to test first in the optimal strategy. The inherent tension of the Connectivity Determination problem is therefore disclosed through demonstrating the NP-hardness of these two problems, as stated in Theorems 1 and 2, respectively. Theorem 1. The decision version of Connectivity Determination Problem is NP-hard. Proof: Inspired by [27], we prove the theorem by reduction from the s-t reliability problem [1]: Given a directed graph G and two nodes s and t. The s-t reliability is to compute the probability of s being connected to t assuming the edges in G exist independently with probability 1 2 . As s-t reliability problem is #P-hard [1]3 , its decision version that quests whether the probability of s being connected to t is larger than some predefined value r0 is NP-hard. The reduction works as follows. For a graph G(V, E), we transform it to an uncertain graph G(V, E0 , p, c) by adding an edge M between s, t and set the rest of G is just the same as G. Define n as the number of edges in G. We set the cost of M as c(M) = n2 n+1 and the cost of testing other edges as 1. Then we assign the existence probability of all edges in G as 1 2 . Finally, we designate s, t in G as the source and the destination in the constructed instance. Let r be the s-t reliability in G and l be the expected cost incurred by the optimal testing strategy on G. We define a generic G0 as a subgraph resulted from an underlying graph of G with edge M removed. We will show that if we know l, then we can efficiently compute r. First, from the definitions, we have r = k 2n for some integer k, and l must obey the following two constraints: l ≥ (1 − r)c(M) and l ≤ rn + (1 − r)c(M) . Here, the first inequality follows from the fact that we have to test M whenever we find out that s and t is not connected in G0 . The second inequality holds since the expected cost of the 3#P is a complexity class for counting problems. #P-hard is at least as hard as NP-hard [1]
TABLE I NOTATIONS AND DEFINITIONS Theorem 1 establishes the NP-hardness of the decision version of our problem,which implies the NP-hardness of computing Notation Definition the optimal strategy in a holistic fashion.Theorem 2 shows uncertain directed graph vertex set of the uncertain graph that even computing the optimal testing strategy sequentially edge set of the uncertain graph cannot be completed in polynomial time unless P=NP. e probability function indicating the existence probability of edges in the uncertain graph cost function indicating the testing cost of edges in the uncertain graph G underlying deterministic graph S={8,e3} s,t source and destination set of temporary states 32={8,6} s,s,a,b temporary states S={62,6,6} Se the element corresponding to edge e in state s M 开 adaptive testing strategy ExG)】 set of edges tests before termination Fig.3.The uncertain graph (right)constructed for the set cover instance when the underlying graph is G (left).We omit the probability and cost of edges in the uncertain graph and Cost() the expected cost of strategy m refer them to the Appendix A. utility function in the Markov Decision Process utility function in the Q-value approach the collection of s-t paths in g V.MDP-BASED EXACT ALGORITHM C the collection of s-t cuts in g The NP-hardness analysis in the previous section implies Pe the subfamily of s-t paths in g that edge e lies on Ce the subfamily of s-t cuts in g that edge e lies on that solving the problem exactly may lead to a prohibitively the goal value in the Q-value approach large cost.However,it is still essential to design an exact algorithm to capture the features of the optimal solutions and gain insights of our Connectivity Determination problem. optimal strategy is certainly no greater than that of a simple The main idea of seeking for an exact algorithm is through strategy that first test all the edges in E and test M if no converting our problem into an equivalent Markov Decision s-t path is found.Combining the two inequalities,we have Process (MDP).Adopting the notations in [34],in the sequel, 2mc≤k≤2"c0.Consequently,k=L2e40g」 c(M) c(M)-n c(M)-n- we will first show how the elements in our problem can be Therefore,if we have a polynomial time algorithm that solves naturally mapped to the components in a finite horizon MDP. the decision version of Connectivity Determination problem, then we can efficiently solve the decision version of s-t reli- A.Mapping the Problem into MDP ability problem.Since the latter is NP-hard,we conclude that As a mathematical framework for planning or navigating the decision version of Connectivity Determination problem is ◆ uncertain systems,MDP models the way of a decision maker's also NP-hard. choosing actions so that the system can perform optimally Theorem 2.Deciding the optimal first edge to test (the edge with regard to some predefined criterion.The key components tested by the optimal strategy in the intial state)is NP-hard. of an MDP include decision epochs,state space,action sets, transition probabilities,rewards,decision policy and optimality Proof:We only present a proof sketch here and refer the criterion.Regarding this,now we show the mapping between details to Appendix A.The proof is done by reduction from these components and the elements in our problem one by set cover problem.Given a universe of elements,a family of one.The correspondence is also summarized in Table II. subsets of the universe and a predefined integer k,a cover is a subfamily of sets whose union equals to the universe. TABLE II The set cover problem asks whether there exists a cover of THE CORRELATION BETWEEN CONNECTIVITY DETERMINATION PROBLEM AND MARKOV DECISION PROCESS cardinality less than k.For a set cover instance,we construct a corresponding uncertain graph as follows.We first create Markov Decision Process Connectivity Determination Problem a set vertex for each subset in the family and an element state space set of temporary states S state vertex for each element in the universe.Next,we add three a temporary state s action set testing or termination,EU special vertices:source s,destination t and a special set vertex transition probability function P probability function p of edges sM.Then,we add edges from s to each set vertex,from reward function r cost function c of edges each element vertex to t and from each set vertex to the decision policy adaptive testing strategy element vertices it contains in the original instance.Specially. Decision Epochs:In an MDP,decisions are made at we add edges from sM to all the element vertices.By carefully points of time called decision epochs.In our problem the assigning the cost and probability of each edge,we prove that decision epochs are the times we need to decide which the optimal first edge to test is the edge M from s to sM if and edge to test next or terminate.Since we at most need only if there does not exist a cover of size smaller than k in to test E edges whereE is the number of possible the original set cover instance.Figure 3 presents the uncertain edges in the uncertain graph,our corresponding MDP is graph constructed for a set cover instance. ◆ of finite horizon. Remark:The two theorems characterize the complexity State Space:The state space of an MDP represents of the Connectivity Determination problem from two aspects. the possible states that a system can be in.It naturally
5 TABLE I NOTATIONS AND DEFINITIONS Notation Definition G uncertain directed graph V vertex set of the uncertain graph E edge set of the uncertain graph p probability function indicating the existence probability of edges in the uncertain graph c cost function indicating the testing cost of edges in the uncertain graph G underlying deterministic graph s, t source and destination S set of temporary states s, s 0 , a, b temporary states se the element corresponding to edge e in state s π adaptive testing strategy Eπ(G) set of edges π tests before termination when the underlying graph is G Cost(π) the expected cost of strategy π u utility function in the Markov Decision Process g utility function in the Q-value approach P the collection of s-t paths in G C the collection of s-t cuts in G Pe the subfamily of s-t paths in G that edge e lies on Ce the subfamily of s-t cuts in G that edge e lies on Q the goal value in the Q-value approach optimal strategy is certainly no greater than that of a simple strategy that first test all the edges in E and test M if no s-t path is found. Combining the two inequalities, we have 2 n c(M)−l c(M) ≤ k ≤ 2 n c(M)−l c(M)−n . Consequently, k = b2 n c(M)−l c(M)−n c. Therefore, if we have a polynomial time algorithm that solves the decision version of Connectivity Determination problem, then we can efficiently solve the decision version of s-t reliability problem. Since the latter is NP-hard, we conclude that the decision version of Connectivity Determination problem is also NP-hard. Theorem 2. Deciding the optimal first edge to test (the edge tested by the optimal strategy in the intial state) is NP-hard. Proof: We only present a proof sketch here and refer the details to Appendix A. The proof is done by reduction from set cover problem. Given a universe of elements, a family of subsets of the universe and a predefined integer k, a cover is a subfamily of sets whose union equals to the universe. The set cover problem asks whether there exists a cover of cardinality less than k. For a set cover instance, we construct a corresponding uncertain graph as follows. We first create a set vertex for each subset in the family and an element vertex for each element in the universe. Next, we add three special vertices: source s, destination t and a special set vertex sM. Then, we add edges from s to each set vertex, from each element vertex to t and from each set vertex to the element vertices it contains in the original instance. Specially, we add edges from sM to all the element vertices. By carefully assigning the cost and probability of each edge, we prove that the optimal first edge to test is the edge M from s to sM if and only if there does not exist a cover of size smaller than k in the original set cover instance. Figure 3 presents the uncertain graph constructed for a set cover instance. Remark: The two theorems characterize the complexity of the Connectivity Determination problem from two aspects. Theorem 1 establishes the NP-hardness of the decision version of our problem, which implies the NP-hardness of computing the optimal strategy in a holistic fashion. Theorem 2 shows that even computing the optimal testing strategy sequentially cannot be completed in polynomial time unless P=NP. M s t 1s 2 s 3 s SM ε 1 2 ε 3 ε 4 ε 1 1 3 2 1 4 3 2 3 4 { , } { , } { , , } s s s ε ε ε ε ε ε ε = = = Fig. 3. The uncertain graph (right) constructed for the set cover instance (left). We omit the probability and cost of edges in the uncertain graph and refer them to the Appendix A. V. MDP-BASED EXACT ALGORITHM The NP-hardness analysis in the previous section implies that solving the problem exactly may lead to a prohibitively large cost. However, it is still essential to design an exact algorithm to capture the features of the optimal solutions and gain insights of our Connectivity Determination problem. The main idea of seeking for an exact algorithm is through converting our problem into an equivalent Markov Decision Process (MDP). Adopting the notations in [34], in the sequel, we will first show how the elements in our problem can be naturally mapped to the components in a finite horizon MDP. A. Mapping the Problem into MDP As a mathematical framework for planning or navigating uncertain systems, MDP models the way of a decision maker’s choosing actions so that the system can perform optimally with regard to some predefined criterion. The key components of an MDP include decision epochs, state space, action sets, transition probabilities, rewards, decision policy and optimality criterion. Regarding this, now we show the mapping between these components and the elements in our problem one by one. The correspondence is also summarized in Table II. TABLE II THE CORRELATION BETWEEN CONNECTIVITY DETERMINATION PROBLEM AND MARKOV DECISION PROCESS Markov Decision Process Connectivity Determination Problem state space set of temporary states S state a temporary state s action set testing or termination, E ∪ {⊥} transition probability function P probability function p of edges reward function r cost function c of edges decision policy adaptive testing strategy π • Decision Epochs: In an MDP, decisions are made at points of time called decision epochs. In our problem the decision epochs are the times we need to decide which edge to test next or terminate. Since we at most need to test |E| edges where |E| is the number of possible edges in the uncertain graph, our corresponding MDP is of finite horizon. • State Space: The state space of an MDP represents the possible states that a system can be in. It naturally
corresponds to the set of temporary states S in our Lemma 1.For any state s S,the optimal utility function problem.We may also partition the state space S into satisfies E disjoint subsets based on the number of edges having u(s)=maxaeA.(r(s,a)+ss P(s'Is,e)u(s)) been tested in the states as S=So US1U...USEl.In Particularly,if s is a non-terminating state,then decision epoch i,the system can only be in a state in Si. u(s)=maxeEA.{-c(e)+p(e)u(s.e)+(1-p(e))u(s\e)} Action Sets:For each state s E S.there is a set of actions and for any terminating state,its utility is 0. that can be performed under it.We define the associated action set As of state s as the set of edges that have Based on the Lemma 1,we design an algorithm that not been tested in s.Additionally,for terminating states, computes the optimal testing strategy a following the standard their action set also contains the terminating action L.As dynamic programming paradigm,as shown in Algorithm 1. a result,the whole action set A=Uses As =EU[L). Algorithm 1 The MDP-based Exact Algorithm Transition Probabilities and Rewards:The transition probability function and reward function characterize the Input:Uncertain graph G(V,E,p,c),source s,destination t Output:The optimal testing strategy m result of choosing some action at some state.Generally speaking,at each state,choosing an action will gain 上:Initialize:un(s)=O,for all s∈SEl 2:fori=E]to 0 do some reward and the system will evolve into other states 3: for All s in Si do probabilistically at the next decision epoch.Projecting 4 if s is a terminating state then into our problem,the transition probability of action e 5: ux(s):=0,T(s):=⊥. (testing edge e)is given by the existence probability 6 else of edge e.Denote by s.e the temporary state evolved 7: from s by setting se as 1 and by se the temporary state e*:arg maxeeA {-c(e)+p(e)ur(s.e) evolved from s by setting se as 0.Formally,the transition +(1-p(e)un(se)}, 8: u.(s)=-c(e*)+p(e*)u.(s·e*) probability function is given by: p(e) ifs=s·e, +(1-p(e*)un(s\e*), 9: π(s):=e* P(s's,e)= 1-p(e)if s'=s\e, lo:return元 0 otherwise, and 1 if s'=s, We prove the correctness of the dynamic programming P(sls,⊥)= algorithm in the following theorem. 0 otherwise. Then it follows that the reward function is r(s,e)= Theorem 3.For an uncertain graph g,Algorithm I yields -c(e)and r(s,L)=0.Note that the reward function is an optimal adaptive testing strategy and has a complexity of negative,corresponds to the cost and the transition prob- O((V+E)3E).where V]denotes the number of nodes ability and reward function are independent with regard and E denotes the number of edges in g. to decision epochs or previous state,which demonstrates Proof:Denote an optimal testing strategy as n,the the Markov property of our problem. strategy given by Algorithm 1 as m.By backward induction, Decision Policy:A decision policy is a mapping from we prove that the utility function u of m is no less than the state space to action set.Therefore,in our problem,it is optimal utility function u=u on every state,which implies equivalent to an adaptive testing strategy. that is an optimal strategy. Optimality Criterion:Obviously,in our case,the op- First,for all s E SEl.obviously ur(s)=ur(s)=0. timality criterion is the expected total reward criterion, Suppose for all states s∈Si,i≥k,r(s)≥ur-(s),then we i.e.,the decision policy with the maximum expected prove that for all states sE Sk1,u(s)>u(s).Indeed,by total reward of the constructed MDP corresponds to the the selection criterion of the algorithm,for a state s e S optimal adaptive testing strategy. that is non-terminating,we have (s)-c(e)+p(e)u (s.)+(1-p(e))u(se)) B.Exact Dynamic Programming Algorithm ≥-c(π*(s)+p(π*(s)ux(s·π*(s) From the equivalence between our problem and an MDP, it follows that our problem also satisfies the "Principle of +(1-p(π*(s))ux(s\π*(s)】 Optimality"in MDP [34],i.e.,starting at any states,the ≥-c(π*(s)+p(π*(s)um-(s·π*(s) optimal adaptive testing strategy incurs the minimum expected +(1-p(π*(s))u+(s\π*(s) (1) cost among all strategies.This enables us to define the optimal =u(s), utility function u of states assigning each temporary state the expected reward (negative cost)of the optimal strategy where Inequality (1)follows from the induction hypothesis. starting at that state.Similarly,we define a utility function u And if s is a terminating state,then also u(s)=u(s)=0. associated with strategy m as the reward gained by starting Hence,we prove that under every state s,following is from each state.By the Bellman equation [34],we have the optimal,and particularly from the initial all-*state,returns following lemma. the maximum expected reward,or equivalently,incurs the minimum expected cost
6 corresponds to the set of temporary states S in our problem. We may also partition the state space S into |E| disjoint subsets based on the number of edges having been tested in the states as S = S0 ∪ S1 ∪ . . . ∪ S|E| . In decision epoch i, the system can only be in a state in Si . • Action Sets: For each state s ∈ S, there is a set of actions that can be performed under it. We define the associated action set As of state s as the set of edges that have not been tested in s. Additionally, for terminating states, their action set also contains the terminating action ⊥. As a result, the whole action set A = S s∈S As = E ∪ {⊥}. • Transition Probabilities and Rewards: The transition probability function and reward function characterize the result of choosing some action at some state. Generally speaking, at each state, choosing an action will gain some reward and the system will evolve into other states probabilistically at the next decision epoch. Projecting into our problem, the transition probability of action e (testing edge e) is given by the existence probability of edge e. Denote by s · e the temporary state evolved from s by setting se as 1 and by s\e the temporary state evolved from s by setting se as 0. Formally, the transition probability function is given by: P(s 0 |s, e) = p(e) if s 0 = s · e, 1 − p(e) if s 0 = s\e, 0 otherwise, and P(s 0 |s, ⊥) = ( 1 if s 0 = s, 0 otherwise. Then it follows that the reward function is r(s, e) = −c(e) and r(s, ⊥) = 0. Note that the reward function is negative, corresponds to the cost and the transition probability and reward function are independent with regard to decision epochs or previous state, which demonstrates the Markov property of our problem. • Decision Policy: A decision policy is a mapping from state space to action set. Therefore, in our problem, it is equivalent to an adaptive testing strategy. • Optimality Criterion: Obviously, in our case, the optimality criterion is the expected total reward criterion, i.e., the decision policy with the maximum expected total reward of the constructed MDP corresponds to the optimal adaptive testing strategy. B. Exact Dynamic Programming Algorithm From the equivalence between our problem and an MDP, it follows that our problem also satisfies the “Principle of Optimality” in MDP [34], i.e., starting at any states, the optimal adaptive testing strategy incurs the minimum expected cost among all strategies. This enables us to define the optimal utility function u of states assigning each temporary state the expected reward (negative cost) of the optimal strategy starting at that state. Similarly, we define a utility function uπ associated with strategy π as the reward gained by π starting from each state. By the Bellman equation [34], we have the following lemma. Lemma 1. For any state s ∈ S, the optimal utility function satisfies u(s) = maxa∈As {r(s, a) + P s 0∈S P(s 0 | s, e)u(s 0 )}. Particularly, if s is a non-terminating state, then u(s) = maxe∈As {−c(e) + p(e)u(s · e) + (1 − p(e))u(s\e)} and for any terminating state, its utility is 0. Based on the Lemma 1, we design an algorithm that computes the optimal testing strategy π following the standard dynamic programming paradigm, as shown in Algorithm 1. Algorithm 1 The MDP-based Exact Algorithm Input: Uncertain graph G(V, E, p, c), source s, destination t Output: The optimal testing strategy π 1: Initialize: uπ(s) = 0, for all s ∈ S|E| 2: for i = |E| to 0 do 3: for All s in Si do 4: if s is a terminating state then 5: uπ(s) := 0, π(s) := ⊥. 6: else 7: e ∗ := arg maxe∈As {−c(e) + p(e)uπ(s · e) +(1 − p(e))uπ(s\e)}, 8: uπ(s) := −c(e ∗ ) + p(e ∗ )uπ(s · e ∗ ) +(1 − p(e ∗ ))uπ(s\e ∗ ), 9: π(s) := e ∗ . 10: return π We prove the correctness of the dynamic programming algorithm in the following theorem. Theorem 3. For an uncertain graph G, Algorithm 1 yields an optimal adaptive testing strategy and has a complexity of O((|V | + |E|)3|E| ), where |V | denotes the number of nodes and |E| denotes the number of edges in G. Proof: Denote an optimal testing strategy as π ∗ , the strategy given by Algorithm 1 as π. By backward induction, we prove that the utility function uπ of π is no less than the optimal utility function uπ∗ = u on every state, which implies that π is an optimal strategy. First, for all s ∈ S|E| , obviously uπ(s) = uπ∗ (s) = 0. Suppose for all states s ∈ Si , i ≥ k, uπ(s) ≥ uπ∗ (s), then we prove that for all states s ∈ Sk−1, uπ(s) ≥ uπ∗ (s). Indeed, by the selection criterion of the algorithm, for a state s ∈ Sk−1 that is non-terminating, we have uπ(s) = max e∈As {−c(e) + p(e)uπ(s · e) + (1 − p(e))uπ(s\e)} ≥ − c(π ∗ (s)) + p(π ∗ (s))uπ(s · π ∗ (s)) + (1 − p(π ∗ (s)))uπ(s\π ∗ (s)) ≥ − c(π ∗ (s)) + p(π ∗ (s))uπ∗ (s · π ∗ (s)) + (1 − p(π ∗ (s)))uπ∗ (s\π ∗ (s)) (1) =uπ∗ (s), where Inequality (1) follows from the induction hypothesis. And if s is a terminating state, then also uπ(s) = uπ∗ (s) = 0. Hence, we prove that under every state s, following π is optimal, and particularly from the initial all-∗ state, π returns the maximum expected reward, or equivalently, incurs the minimum expected cost
> The time complexity of the algorithm can be justified as Therefore,.Cost()=∑Geo[Pr(G)∑eeE.(ec(el≤lE follows.There are in total 3El temporary states associated ∑G∈c[Pr(G)Cert(G】≤E·Cost(π*) ■ with the uncertain graph.Qualifying whether a state s is a ter- Strictly speaking,the greedy algorithm yields non-adaptive minating state can be realized by querying the s-t connectivity strategies,i.e.,the strategies it derives do not make decisions on two deterministic graphs G(V,E1)and G2(V,E2),where based on previous results but tests the edges following a E1=fe l se 1}and E2=fe l se 1 or se =*)This is predefined order.We can modify it into an adaptive version, implementable in O(V+E)time by depth-first traversal, which omits the edges that do not effect the s-t connectivity, and selecting the optimal action for each state requires O(E) e.g.the edges that do not lie on any of the s-t paths in the time.Hence,the algorithm terminates and finds the optimal current state.Although the adaptive version of the greedy solution in ((V+E)3El)time. ■ algorithm has better performance,Theorem 4 holds for both Remark:Note that the exact algorithm computes the whole the adaptive and non-adaptive version. testing strategy in one run,meaning that we can obtain the There are three further notes regarding the properties of corresponding action for all temporary states by invoking the the greedy algorithm:(i)Although the proof of Theorem 4 algorithm once.Therefore,on average,the exact algorithm is completed by comparing the performance of the greedy takes O(E)time to determine the optimal action for each algorithm to the minimum certifier cost,the resulting bound temporary state.However,as the time complexity we consider is actually tight,(ii)The approximation ratio of an alternative here is the total time for an algorithm to determine all the greedy algorithm based on the existence probability of edges actions of a testing process and we have to execute the whole is far worse than O(E)and (iii)The greedy algorithm only algorithm at the beginning of each testing process,the exact considers the testing cost of edges.However,it has significant- algorithm still has exponential time complexity.On the other ly better performance guarantee than some seemingly more hand,the approximation algorithms presented in the following sophisticated algorithms that take into account the existence section exploits the sequential way of computing the testing probabilities of edges.The soundness of the above three strategy.Based on each temporary state,they use polynomial arguments is supported in our Appendices C. time to determine the corresponding action,thus can be categorized as polynomial time approximation algorithms. B.Adaptive Submodular Algorithm VI.APPROXIMATION ALGORITHMS To further improve the approximation ratio,we adopt the Q-value approach in Stochastic Boolean Function Evaluation As stated previously,it is unrealistic to pursue efficient exact Problem (SBFE)proposed by [22].Based on that,we utilize algorithm on general uncertain graphs due to the inherent the adaptive submodularity [23]in our problem and propose tension of the Connectivity Determination problem.Therefore. the Adaptive Submodular Algorithm,of which the approxi- in this section,we propose approximation schemes that have mation ratio is logarithmic to the number of edges for most both polynomial time complexity and good approximation uncertain graphs guarantee.Specifically,we focus on analyzing the approxima- tion ratios of the two proposed algorithms.The readers may 1)Preliminaries:The Q-value approach is proposed for refer to Appendix IV for more details of their time complexity. the SBFE problem,which has close connection with our problem.The SBFE problem is that given a Boolean function f {0,1)"10,1}on an unknown input x.Each bit i A.A Simple Greedy Approach of x can only be determined by paying a cost ci.The prior An intuitive greedy algorithm is to test the edge with probability of the value of each bit being 1 is specified by a the minimum cost (breaking ties arbitrarily).Surprisingly,we probability vector p=(p1,p2,...,pn)and x is drawn from show that this greedy algorithm has a non-trivial approxima- the corresponding product distribution.The goal is to derive an tion ratio of (E). evaluation strategy minimizing the expected cost.The relation Theorem 4.Given an instance of our problem with uncer- between our problem and SBFE can be established by consid- tain graph g(V,E,p,c)and two nodes s,t as source and ering each edge as a Boolean variable and see the function as implicitly given by the s-t connectivity of the uncertain graph. destination,let n be a strategy that tests the edges in g according to their costs sorted in an increasing order.Then, However,as constructing such a function from a graph may Cost(π)≤lE·Cost(π*),whereπ*is the optimal strategy. have exponential time complexity,we cannot directly use the algorithms proposed for SBFE problems. Proof:Suppose that we know the underlying graph G We then introduce some useful definitions for the Q-value of G in advance.Denote Cert(G)as the certifier of G's s-t approach adapted for our problem.For two temporary states connectivity with the minimum cost.If s and t are connected a.b E S.a is an extension of b,written as a~b,if a;=b; in G,then a certifier consists of an s-t path in g whose edges for all bi*.A function g:SN is said to be monotone exist in G.If s and t are disconnected in G.then a certifier if for sES and all s'~s,g(s')-g(s)>0.g is submodular consists of an s-t cut in g whose edges do not exist in G.if g(s.e)-g(s)>g(s'.e)-g(s')and g(s\e)-g(s)> Since tests the edges from cheap to expensive,we must have g(s'e)-g(s')whenever s'~s and se =se=*.For a state eE(G)c(e)<ECert(G)for any G.And due to the s E S and edge e with se =*the expected marginal gain fact that even the optimal strategy has no prior knowledge of of the edge to the current state with respect to g is given by the underlying graphG,clearly∑eeE,ec(e)≥Cert(G).p(e)g(s·e)+(1-ple)g(s\e)-gs)
7 The time complexity of the algorithm can be justified as follows. There are in total 3 |E| temporary states associated with the uncertain graph. Qualifying whether a state s is a terminating state can be realized by querying the s-t connectivity on two deterministic graphs G1 s (V, E1) and G2 s (V, E2), where E1 = {e | se = 1} and E2 = {e | se = 1 or se = ∗}. This is implementable in O(|V | + |E|) time by depth-first traversal, and selecting the optimal action for each state requires O(|E|) time. Hence, the algorithm terminates and finds the optimal solution in O((|V | + |E|)3|E| ) time. Remark: Note that the exact algorithm computes the whole testing strategy in one run, meaning that we can obtain the corresponding action for all temporary states by invoking the algorithm once. Therefore, on average, the exact algorithm takes O(|E|) time to determine the optimal action for each temporary state. However, as the time complexity we consider here is the total time for an algorithm to determine all the actions of a testing process and we have to execute the whole algorithm at the beginning of each testing process, the exact algorithm still has exponential time complexity. On the other hand, the approximation algorithms presented in the following section exploits the sequential way of computing the testing strategy. Based on each temporary state, they use polynomial time to determine the corresponding action, thus can be categorized as polynomial time approximation algorithms. VI. APPROXIMATION ALGORITHMS As stated previously, it is unrealistic to pursue efficient exact algorithm on general uncertain graphs due to the inherent tension of the Connectivity Determination problem. Therefore, in this section, we propose approximation schemes that have both polynomial time complexity and good approximation guarantee. Specifically, we focus on analyzing the approximation ratios of the two proposed algorithms. The readers may refer to Appendix IV for more details of their time complexity. A. A Simple Greedy Approach An intuitive greedy algorithm is to test the edge with the minimum cost (breaking ties arbitrarily). Surprisingly, we show that this greedy algorithm has a non-trivial approximation ratio of O(|E|). Theorem 4. Given an instance of our problem with uncertain graph G(V, E, p, c) and two nodes s, t as source and destination, let π be a strategy that tests the edges in G according to their costs sorted in an increasing order. Then, Cost(π) ≤ |E| · Cost(π ∗ ), where π ∗ is the optimal strategy. Proof: Suppose that we know the underlying graph G of G in advance. Denote Cert(G) as the certifier of G’s s-t connectivity with the minimum cost. If s and t are connected in G, then a certifier consists of an s-t path in G whose edges exist in G. If s and t are disconnected in G, then a certifier consists of an s-t cut in G whose edges do not exist in G. Since P π tests the edges from cheap to expensive, we must have e∈Eπ(G) c(e) ≤ |E| · Cert(G) for any G. And due to the fact that even the optimal strategy has no prior knowledge of the underlying graph G, clearly P e∈Eπ∗ (G) c(e) ≥ Cert(G). Therefore, Cost(π) = P G∈G[P r(G) P e∈Eπ(G) P c(e)] ≤ |E| · G∈G[P r(G)Cert(G)] ≤ |E| · Cost(π ∗ ). Strictly speaking, the greedy algorithm yields non-adaptive strategies, i.e., the strategies it derives do not make decisions based on previous results but tests the edges following a predefined order. We can modify it into an adaptive version, which omits the edges that do not effect the s-t connectivity, e.g. the edges that do not lie on any of the s-t paths in the current state. Although the adaptive version of the greedy algorithm has better performance, Theorem 4 holds for both the adaptive and non-adaptive version. There are three further notes regarding the properties of the greedy algorithm: (i) Although the proof of Theorem 4 is completed by comparing the performance of the greedy algorithm to the minimum certifier cost, the resulting bound is actually tight, (ii) The approximation ratio of an alternative greedy algorithm based on the existence probability of edges is far worse than O(|E|) and (iii) The greedy algorithm only considers the testing cost of edges. However, it has significantly better performance guarantee than some seemingly more sophisticated algorithms that take into account the existence probabilities of edges. The soundness of the above three arguments is supported in our Appendices C. B. Adaptive Submodular Algorithm To further improve the approximation ratio, we adopt the Q-value approach in Stochastic Boolean Function Evaluation Problem (SBFE) proposed by [22]. Based on that, we utilize the adaptive submodularity [23] in our problem and propose the Adaptive Submodular Algorithm, of which the approximation ratio is logarithmic to the number of edges for most uncertain graphs. 1) Preliminaries: The Q-value approach is proposed for the SBFE problem, which has close connection with our problem. The SBFE problem is that given a Boolean function f : {0, 1} n 7→ {0, 1} on an unknown input x. Each bit xi of x can only be determined by paying a cost ci . The prior probability of the value of each bit being 1 is specified by a probability vector p = (p1, p2, . . . , pn) and x is drawn from the corresponding product distribution. The goal is to derive an evaluation strategy minimizing the expected cost. The relation between our problem and SBFE can be established by considering each edge as a Boolean variable and see the function as implicitly given by the s-t connectivity of the uncertain graph. However, as constructing such a function from a graph may have exponential time complexity, we cannot directly use the algorithms proposed for SBFE problems. We then introduce some useful definitions for the Q-value approach adapted for our problem. For two temporary states a, b ∈ S, a is an extension of b, written as a ∼ b, if ai = bi for all bi 6= ∗. A function g : S 7→ N is said to be monotone if for s ∈ S and all s 0 ∼ s, g(s 0 ) − g(s) ≥ 0. g is submodular if g(s · e) − g(s) ≥ g(s 0 · e) − g(s 0 ) and g(s\e) − g(s) ≥ g(s 0\e) − g(s 0 ) whenever s 0 ∼ s and se = s 0 e = ∗. For a state s ∈ S and edge e with se = ∗, the expected marginal gain of the edge to the current state with respect to g is given by p(e)g(s · e) + (1 − p(e))g(s\e) − g(s)
The Q-value approach:The Q-value approach [22]states gp(s·e)-gn(s)≥9p(s'·e)-gp(s) that if we have a utility function g:SN that satisfies: gp(s\e)-gp(s)>gp(s'e)-gp(s), (1)g is monotone and submodular,(2)g(*,*,...,*)=0, and (3)for any temporary state sS,g(s)=Q iff s is gc(s·e)-gc(s)≥gc(s'·e)-ge(s) a terminating state,then g is assignment feasible with goal gc(s\e)-gc(s)≥gc(se)-gc(s), value O for our problem.And by using the adaptive greedy algorithm proposed in the Adaptive Submodular framework whenever s's and se=se =*Note that actually, in [23]that suggests testing the edge with the maximum ratio gp(s.e)-gp(s)=gc(s\e)-gc(s)=0 for all s and e such between its expected marginal gain and cost each time,we that se =*Then,combining the fact that g(s.e)-g(s)= yield a solution that is within a factor of(In+1)of the (IPl-gp(s.e))(gc(s.e)-gc(s))and g(s\e)-g(s)=(IC]- optimum.For completeness,we state the relevant result in the ge(s\e))(gp(s\e)-gp(s)).we have g is submodular.Second, since gp(*,*,,*)=gc(*,*,,*)=0,g(*,*,,*)is following lemma. also zero.The third condition is explained above.Hence,g is Lemma 2.(23]Given a utility feasible function g,set of an assignment feasible utility function. ■ states S and goal value Q,we are to select a certain action 3)The Adaptive Submodular Algorithm:By applying the on current states,and the state transition is governed by Q-value approach [22]with our utility function,we have our the action.Our goal is to make the state evolve to some Adaptive Submodular algorithm shown in Algorithm 2.Note terminating state s such that g(s)=Q.Then,the strategy that the algorithm computes the testing strategy sequentially, that maximize the ratio between the gain with respect to g i.e.,in one iteration,it only determines the next edge to test and the cost of the action is a(InQ+1)-approximation of the based on the current temporary state. minimum cost strategy. 4)Performance Guarantee:By Lemma 2,the Adaptive 2)Applying the Q-value Approach:To harness the Q-value Submodular algorithm yields an approximation of O(In Q)= approach,we need to choose appropriate utility function g. O(In(PC)).Since for most graphs,the number of paths And we present in the following the utility function that we and the number of cuts are polynomial to the number of design for our algorithm. edges,Algorithm 2 has logarithmic approximation ratio in Given an uncertain graph g(V,E,p,c),we denote by P most cases.However,in some extreme cases,the number of the collection of s-t paths in g,and by C the collection of paths or cuts can be exponentially large,making the worst s-t cuts in 9.For an edge e,we define Pe as the set of s-t case approximation ratio still turn out to be O(E).Also note paths it lies on in g and Ce as the set of minimal s-t cuts it that in the algorithm we do not specify how to implement the lies on in g.Note that the above definitions interpret g as a selection rule in line 3 of Algorithm 2,hence the algorithm deterministic graph with vertex set V and edge set E.Then, can be viewed as a framework that can embody any valid for each temporary state s,we define two auxiliary functions greedy selection algorithm.Standard implementation of the gp and ge as: selection rule involves counting the number of paths and cuts in graph C,which is #P-hard [1]in general.So,we may gp(s)=1 U Pel,ge(s)=1U Cel. use polynomial time approximate counting schemes [35].[36] e:5e=0 e:se=1 that can preserve an approximation ratio of O(In(PC)), Our utility function g:SN is given by: if the scheme can guarantee that the expected marginal gain g(s)=IPlIC]-(IPl-gp(s))(Icl-gc(s)). of the selected edge is within a factor o of the optimal gain Furthermore,when the number of paths and the number of The intuitive explanation for the functions gp,g and g is that cuts are polynomial to the number of edges,we can also use if we view the determining process as a covering process,then efficient enumerating schemes [37]to select the next edge the non-existence of an edge can be regarded as covering the to test following the criterion of our Adaptive Submodular paths it lies on and the existence of an edge is equivalent algorithm to covering the cuts it lies on.If all the paths in g have been covered in some state s,then we have gp(s)=P and Algorithm 2 The Adaptive Submodular Algorithm conclude that s and t in the underlying graph of g must be Input:Uncertain graph G(V,E,p,c),source and destination disconnected and the converse is also true.The dual case holds nodes s.t. similarly.Therefore,when s is a terminating state,we must Output:An approximate adaptive testing strategy have g(s)=Q.Theorem 5 demonstrates the validity of the 1:Initialize:Current state s :=(*,*,...,*)The set of utility function that we construct. tested edges E as an empty set. Theorem 5.The utility function g is assignment feasible. 2:Repeat until s becomes a terminating state. Proof:We prove the theorem by showing that g satisfies 3:e:=arg maxE((s() c(e) 4:E :=E Ufe*},test e*and observe the outcome. the three conditions mentioned above.First,obviously both gp 5:if edge e*exists then and ge are monotone,it follows that g is also monotone.And 6:se*=1 since gp and ge are easily verified to be submodular,we have 7:else 8: se*=0 4An s-t cut is minimal if and only if no proper subset of it is an s-t cut
8 The Q-value approach: The Q-value approach [22] states that if we have a utility function g : S 7→ N that satisfies: (1) g is monotone and submodular, (2) g(∗, ∗, . . . , ∗) = 0, and (3) for any temporary state s ∈ S, g(s) = Q iff s is a terminating state, then g is assignment feasible with goal value Q for our problem. And by using the adaptive greedy algorithm proposed in the Adaptive Submodular framework in [23] that suggests testing the edge with the maximum ratio between its expected marginal gain and cost each time, we yield a solution that is within a factor of (ln Q + 1) of the optimum. For completeness, we state the relevant result in the following lemma. Lemma 2. [23] Given a utility feasible function g, set of states S and goal value Q, we are to select a certain action on current states, and the state transition is governed by the action. Our goal is to make the state evolve to some terminating state s such that g(s) = Q. Then, the strategy that maximize the ratio between the gain with respect to g and the cost of the action is a (ln Q+ 1)-approximation of the minimum cost strategy. 2) Applying the Q-value Approach: To harness the Q-value approach, we need to choose appropriate utility function g. And we present in the following the utility function that we design for our algorithm. Given an uncertain graph G(V, E, p, c), we denote by P the collection of s-t paths in G, and by C the collection of s-t cuts in G. For an edge e, we define Pe as the set of s-t paths it lies on in G and Ce as the set of minimal s-t cuts4 it lies on in G. Note that the above definitions interpret G as a deterministic graph with vertex set V and edge set E. Then, for each temporary state s, we define two auxiliary functions gp and gc as: gp(s) = | [ e:se=0 Pe|, gc(s) = | [ e:se=1 Ce|. Our utility function g : S 7→ N is given by: g(s) = |P||C| − (|P| − gp(s))(|C| − gc(s)). The intuitive explanation for the functions gp, gc and g is that if we view the determining process as a covering process, then the non-existence of an edge can be regarded as covering the paths it lies on and the existence of an edge is equivalent to covering the cuts it lies on. If all the paths in G have been covered in some state s, then we have gp(s) = |P| and conclude that s and t in the underlying graph of G must be disconnected and the converse is also true. The dual case holds similarly. Therefore, when s is a terminating state, we must have g(s) = Q. Theorem 5 demonstrates the validity of the utility function that we construct. Theorem 5. The utility function g is assignment feasible. Proof: We prove the theorem by showing that g satisfies the three conditions mentioned above. First, obviously both gp and gc are monotone, it follows that g is also monotone. And since gp and gc are easily verified to be submodular, we have 4An s-t cut is minimal if and only if no proper subset of it is an s-t cut. gp(s · e) − gp(s) ≥ gp(s 0 · e) − gp(s 0 ), gp(s\e) − gp(s) ≥ gp(s 0 \e) − gp(s 0 ), gc(s · e) − gc(s) ≥ gc(s 0 · e) − gc(s 0 ), gc(s\e) − gc(s) ≥ gc(s 0 \e) − gc(s 0 ), whenever s 0 ∼ s and s 0 e = se = ∗. Note that actually, gp(s · e) − gp(s) = gc(s\e) − gc(s) = 0 for all s and e such that se = ∗. Then, combining the fact that g(s · e) − g(s) = (|P| − gp(s · e))(gc(s · e) − gc(s)) and g(s\e) − g(s) = (|C| − gc(s\e))(gp(s\e) − gp(s)), we have g is submodular. Second, since gp(∗, ∗, . . . , ∗) = gc(∗, ∗, . . . , ∗) = 0, g(∗, ∗, . . . , ∗) is also zero. The third condition is explained above. Hence, g is an assignment feasible utility function. 3) The Adaptive Submodular Algorithm: By applying the Q-value approach [22] with our utility function, we have our Adaptive Submodular algorithm shown in Algorithm 2. Note that the algorithm computes the testing strategy sequentially, i.e., in one iteration, it only determines the next edge to test based on the current temporary state. 4) Performance Guarantee: By Lemma 2, the Adaptive Submodular algorithm yields an approximation of O(ln Q) = O(ln(|P||C|)). Since for most graphs, the number of paths and the number of cuts are polynomial to the number of edges, Algorithm 2 has logarithmic approximation ratio in most cases. However, in some extreme cases, the number of paths or cuts can be exponentially large, making the worst case approximation ratio still turn out to be O(|E|). Also note that in the algorithm we do not specify how to implement the selection rule in line 3 of Algorithm 2, hence the algorithm can be viewed as a framework that can embody any valid greedy selection algorithm. Standard implementation of the selection rule involves counting the number of paths and cuts in graph G, which is #P-hard [1] in general. So, we may use polynomial time approximate counting schemes [35], [36] that can preserve an approximation ratio of O(ln(α|P||C|)), if the scheme can guarantee that the expected marginal gain of the selected edge is within a factor α of the optimal gain. Furthermore, when the number of paths and the number of cuts are polynomial to the number of edges, we can also use efficient enumerating schemes [37] to select the next edge to test following the criterion of our Adaptive Submodular algorithm. Algorithm 2 The Adaptive Submodular Algorithm Input: Uncertain graph G(V, E, p, c), source and destination nodes s, t. Output: An approximate adaptive testing strategy 1: Initialize: Current state s := (∗, ∗, . . . , ∗), The set of tested edges Eπ as an empty set. 2: Repeat until s becomes a terminating state. 3: e ∗ := arg maxe∈E\Eπ { p(e)g(s·e)+(1−p(e))g(s\e)−g(s) c(e) }. 4: Eπ := Eπ ∪ {e ∗}, test e ∗ and observe the outcome. 5: if edge e ∗ exists then 6: se ∗ := 1 7: else 8: se ∗ := 0
9 25.000 t-Geed山y 25.000 200.00 ★0 pSort 20.000 -PeSort Greedy -Greedy IntSort 20.000 ★-Sort --AdaSub y-PeSort 2150 00 15,000 B-IntSort 1s.000 --AdaSub 100.00 10.00 5.000 5,000 3 5678 (a) (b) (c) Fig.4.The expected cost of the adaptive testing strategies yielded by different algorithms.The r-coordinates of the figure follow the increasing order of the size of the uncertain graphs. VII.SIMULATIONS to approximate the expected cost of the strategy.To further In this section,we present our simulations on the perfor- eliminate the random noise in data,we designate 10 pairs of mance of the proposed algorithms on various datasets.We source and destination in each uncertain graph,and the final first introduce our simulation environment in the following results shown in the figures are the average costs incurred by and show the detailed results in subsequent sections. strategies among all pairs of source and destination. 3)Algorithms Involved in Performance Comparisons:To A.Simulation Settings evaluate the performance of our proposed algorithms,we 1)Simulation Datasets:We adopt three real life datasets in include three additional heuristics adapted from [19].We our simulations.The basic descriptions and statistics are listed briefly introduce the algorithms as follows: as follows: Greedy Algorithm(Greedy):The greedy algorithm that Citation Networks (from Microsoft Academic Graph tests the edges from low cost to high cost proposed in [40]):We extract six citation networks of different sub- Section VI. fields in Microsoft Academic Graph ranging from 749 Adaptive Submodular Algorithm (AdaSub):The nodes (1429 edges)to 273751 nodes (993025 edges)to Adaptive Submodular algorithm based on the Q-value generate six uncertain graphs. approach proposed in Section VI. Internet Peer to Peer Networks [38]:This dataset con- MDP-based Algorithm (MDP):The exact dynamic tains a snapshot of the Gnutella peer-to-peer file sharing programming algorithm applying the Markov Decision network in August 2002.We extract nine subnetworks Process framework proposed in Section 1. ranging from 1000 nodes (1700 edges)to 5000 nodes Optimistic Sort Algorithm [19](OpSort):The algo- (16469 edges)to form nine uncertain graphs. rithm yields a strategy that tests the edges following Twitter Ego Networks [39]:This dataset consists of ego the increasing order of c/p.OpSort is optimal when the networks in Twitter.We select 10 ego networks ranging uncertain graph is a parallel graph [19]. from 95 nodes (1376 edges)to 213 nodes (17930 edges) Pessimistic Sort Algorithm [19](PeSort):The algo- to create 10 uncertain graphs.Note that the ego networks rithm yields a strategy that tests the edges following the we use have high density. increasing order of c/(1-p).PeSort is optimal when the For each uncertain graph generated above,we use Jac- uncertain graph is a serial graph [19]. card's coefficient [12],which is an established metric for Intersection Sort Algorithm [19](IntSort):The algo- link prediction in social networks,to assign the existence rithm that tests the edge with the minimum cost that lies probabilities of the edges.Specifically,for an edge e=(,y) on the intersection of a shortest s-t path and a minimum in uncertain graph g,the existence probability of e is given as s-t cut in the uncertain graph under the current state. p(e)=r()nr()/r()Ur(y),where r()denotes the Due to the prohibitive time complexity of the MDP-based set of neighbors of a node in the graph.We construct the cost algorithm,we only apply it to a sequence of subnetworks with function of the uncertain graphs by assigning the cost of each 20 edges that are extracted from the citation networks. edge from a Gaussian distribution with mean 50 and standard deviation 10.The negative part of the distribution is truncated. 2)Calculation of the Performance Metric:The perfor- B.Evaluation of Proposed Algorithms mance metric in the simulations is the expected cost of the We plot the expected costs of the strategies derived by strategies derived by the algorithms.However,to calculate different algorithms on various uncertain graphs in Figures the exact expected cost of a strategy requires testing it on 4 and 5. all the possible underlying graphs of an uncertain graph,of From Figure 5,we can see that the Adaptive Submodular al- which the number is extremely large.Therefore,instead,we gorithm indeed yields near optimal testing strategies,of which first generate 1000 underlying graphs by sampling from the the expected cost is at most 8%higher than the minimum distribution given by the uncertain graph and then use the expected cost achieved by the MDP-based algorithm.On the average cost of a strategy incurs on the 1000 underlying graphs other hand,although the Greedy algorithm is relatively simple
9 1 2 3 4 5 6 Expected Cost 0 5,000 10,000 15,000 20,000 25,000 Dataset: Citation Networks Greedy OpSort PeSort IntSort AdaSub (a) 1 2 3 4 5 6 7 8 9 Expected Cost 0 5,000 10,000 15,000 20,000 25,000 Dataset: Internet P2P Networks Greedy OpSort PeSort IntSort AdaSub (b) 1 2 3 4 5 6 7 8 9 10 Expected Cost 0 50,000 100,000 150,000 200,000 Dataset: Twitter Ego Networks Greedy OpSort PeSort IntSort AdaSub (c) Fig. 4. The expected cost of the adaptive testing strategies yielded by different algorithms. The x-coordinates of the figure follow the increasing order of the size of the uncertain graphs. VII. SIMULATIONS In this section, we present our simulations on the performance of the proposed algorithms on various datasets. We first introduce our simulation environment in the following and show the detailed results in subsequent sections. A. Simulation Settings 1) Simulation Datasets: We adopt three real life datasets in our simulations. The basic descriptions and statistics are listed as follows: • Citation Networks (from Microsoft Academic Graph [40]): We extract six citation networks of different sub- fields in Microsoft Academic Graph ranging from 749 nodes (1429 edges) to 273751 nodes (993025 edges) to generate six uncertain graphs. • Internet Peer to Peer Networks [38]: This dataset contains a snapshot of the Gnutella peer-to-peer file sharing network in August 2002. We extract nine subnetworks ranging from 1000 nodes (1700 edges) to 5000 nodes (16469 edges) to form nine uncertain graphs. • Twitter Ego Networks [39]: This dataset consists of ego networks in Twitter. We select 10 ego networks ranging from 95 nodes (1376 edges) to 213 nodes (17930 edges) to create 10 uncertain graphs. Note that the ego networks we use have high density. For each uncertain graph generated above, we use Jaccard’s coefficient [12], which is an established metric for link prediction in social networks, to assign the existence probabilities of the edges. Specifically, for an edge e = (x, y) in uncertain graph G, the existence probability of e is given as p(e) = |Γ(x) ∩ Γ(y)|/|Γ(x) ∪ Γ(y)|, where Γ(·) denotes the set of neighbors of a node in the graph. We construct the cost function of the uncertain graphs by assigning the cost of each edge from a Gaussian distribution with mean 50 and standard deviation 10. The negative part of the distribution is truncated. 2) Calculation of the Performance Metric: The performance metric in the simulations is the expected cost of the strategies derived by the algorithms. However, to calculate the exact expected cost of a strategy requires testing it on all the possible underlying graphs of an uncertain graph, of which the number is extremely large. Therefore, instead, we first generate 1000 underlying graphs by sampling from the distribution given by the uncertain graph and then use the average cost of a strategy incurs on the 1000 underlying graphs to approximate the expected cost of the strategy. To further eliminate the random noise in data, we designate 10 pairs of source and destination in each uncertain graph, and the final results shown in the figures are the average costs incurred by strategies among all pairs of source and destination. 3) Algorithms Involved in Performance Comparisons: To evaluate the performance of our proposed algorithms, we include three additional heuristics adapted from [19]. We briefly introduce the algorithms as follows: • Greedy Algorithm (Greedy): The greedy algorithm that tests the edges from low cost to high cost proposed in Section VI. • Adaptive Submodular Algorithm (AdaSub): The Adaptive Submodular algorithm based on the Q-value approach proposed in Section VI. • MDP-based Algorithm (MDP): The exact dynamic programming algorithm applying the Markov Decision Process framework proposed in Section 1. • Optimistic Sort Algorithm [19] (OpSort): The algorithm yields a strategy that tests the edges following the increasing order of c/p. OpSort is optimal when the uncertain graph is a parallel graph [19]. • Pessimistic Sort Algorithm [19] (PeSort): The algorithm yields a strategy that tests the edges following the increasing order of c/(1−p). PeSort is optimal when the uncertain graph is a serial graph [19]. • Intersection Sort Algorithm [19] (IntSort): The algorithm that tests the edge with the minimum cost that lies on the intersection of a shortest s-t path and a minimum s-t cut in the uncertain graph under the current state. Due to the prohibitive time complexity of the MDP-based algorithm, we only apply it to a sequence of subnetworks with 20 edges that are extracted from the citation networks. B. Evaluation of Proposed Algorithms We plot the expected costs of the strategies derived by different algorithms on various uncertain graphs in Figures 4 and 5. From Figure 5, we can see that the Adaptive Submodular algorithm indeed yields near optimal testing strategies, of which the expected cost is at most 8% higher than the minimum expected cost achieved by the MDP-based algorithm. On the other hand, although the Greedy algorithm is relatively simple
10 350 300 Sort 日Intsor 250 MDP 200 150 100 3 Citation Internet P2P Twitter Ego Datasets Fig.5.Comparisons with the exact MDP-based algorithm on six small Fig.6.The adaptivity gap of the Greedy algorithm on different datasets. subgraphs of the citation networks. the expected costs of the transmission schemes it derives are version.We calculate the adaptivity gap for each uncertain within a factor of 2.7 times the optimal ones. graph in the three datasets and show the results in Figure 6. As demonstrated in Figure 4.the Adaptive Submodular It turns out that the adaptivity gap in Citation and Internet algorithm derives the strategies with the minimum expected Peer to Peer networks can be as large as two.For the dense cost among the five compared algorithms in all three data sets. Twitter Ego network,the gap may even reach up to six.Hence, However,the gap between it and the Intersection Sort heuristic harnessing the results of previous tests and making the strategy is small.This can be attributed to the fact that in many cases adaptive bring significant gain in the testing costs. the intersection of a shortest s-t path and a minimum s-t cut is identical to the edge selected by the rule in the Adaptive Submodular algorithm.However,as shown in Appendix C-D, VIII.CONCLUSION AND FUTURE WORK the Intersection Sort heuristic does not possess such theoretical guarantee as the Adaptive Submodular algorithm. In this paper,we modeled the network as an uncertain graph For the other three compared algorithms,although it seems where each edge e exists independently with some probability that the Optimistic Sort and Pessimistic Sort utilize more in- p(e)and examined the problem of determining whether a formation than the Greedy algorithm,as they take into account given pair of source node and destination node are connected the existence probabilities of edges.There exists no significant by a path or separated by a cut.Assuming that during each de- gap between the Greedy algorithm and the other two.An termining process we are associated with an underlying graph, explanation to this is that the real life networks are mixed the existence of each edge can be unraveled through edge with parallel and serial structures.So often the sole selection testing at a cost of c(e).We aimed to find an optimal strategy criterion in OpSort or PeSort does not match the optimum. incurring the minimum expected cost with the expectation Also,as demonstrated in Appendix C-C,surprisingly,the taken over all possible underlying graphs.We have formulated worst case approximation ratios of both OpSort and PeSort are it into a combinatorial optimization problem and first in- far worse than the Greedy algorithm.Therefore,the Greedy vestigated its computational complexity.Specifically,through algorithm indeed achieves a desirable compromise between proving the NP-hardness of two closely related problems,we OpSort and PeSort. have shown that this problem cannot be solved in polynomial Finally,an important observation from our simulation re- time unless P=NP.Then,we have applied the Markov Decision sults is that:larger size is not equivalent to higher cost. Process framework to give an exact dynamic programming Despite that the largest uncertain graph generated by the algorithm with exponential time complexity.Moreover,we citation networks has about one million edges,the cost of have proposed two efficient approximation schemes:a simple determining s-t connectivity in it is still considerably lower greedy approach with linear approximation ratio and a second than in Twitter networks.This phenomenon results from the Adaptive Submodular algorithm with logarithmic approxima- fact that the Twitter networks are far denser than the other tion ratio for most uncertain graphs.Finally,we have justified two datasets,which means that there are significantly larger the effectiveness and superiority of our proposed algorithms number of edges that lie on the paths from source and through theoretical analysis and extensive simulations on real destination,influencing the s-t connectivity.Therefore,instead network datasets. of the total number of edges,it is the number of relevant edges There remains a lot of future directions that can be explored. that determines the scale of the expected testing cost. For example,it is desirable to design an algorithm with better approximation ratio and scalability,so that we can solve the C.Characterization of the Adaptivity Gap connectivity determination problem more efficiently and more We investigate the significance of the adaptivity of testing aptly apply it to large scale networks.Another interesting strategies by capturing the gap between the adaptive and work is to derive the theoretical bound of the adaptivity gap non-adaptive versions of the Greedy algorithm on different of the Greedy algorithm and the approximation ratio of the datasets.Specifically,we define the adaptivity gap of the Intersection Sort Algorithm.Finally,it is also worthwhile to Greedy algorithm as the ratio between the expected cost of the investigate the approximation hardness of the Connectivity strategies derived by the non-adaptive version and the adaptive Determination Problem
10 1 2 3 4 5 6 Expected Cost 50 100 150 200 250 300 350 Greedy OpSort PeSort IntSort AdaSub MDP Fig. 5. Comparisons with the exact MDP-based algorithm on six small subgraphs of the citation networks. the expected costs of the transmission schemes it derives are within a factor of 2.7 times the optimal ones. As demonstrated in Figure 4, the Adaptive Submodular algorithm derives the strategies with the minimum expected cost among the five compared algorithms in all three data sets. However, the gap between it and the Intersection Sort heuristic is small. This can be attributed to the fact that in many cases, the intersection of a shortest s-t path and a minimum s-t cut is identical to the edge selected by the rule in the Adaptive Submodular algorithm. However, as shown in Appendix C-D, the Intersection Sort heuristic does not possess such theoretical guarantee as the Adaptive Submodular algorithm. For the other three compared algorithms, although it seems that the Optimistic Sort and Pessimistic Sort utilize more information than the Greedy algorithm, as they take into account the existence probabilities of edges. There exists no significant gap between the Greedy algorithm and the other two. An explanation to this is that the real life networks are mixed with parallel and serial structures. So often the sole selection criterion in OpSort or PeSort does not match the optimum. Also, as demonstrated in Appendix C-C, surprisingly, the worst case approximation ratios of both OpSort and PeSort are far worse than the Greedy algorithm. Therefore, the Greedy algorithm indeed achieves a desirable compromise between OpSort and PeSort. Finally, an important observation from our simulation results is that: larger size is not equivalent to higher cost. Despite that the largest uncertain graph generated by the citation networks has about one million edges, the cost of determining s-t connectivity in it is still considerably lower than in Twitter networks. This phenomenon results from the fact that the Twitter networks are far denser than the other two datasets, which means that there are significantly larger number of edges that lie on the paths from source and destination, influencing the s-t connectivity. Therefore, instead of the total number of edges, it is the number of relevant edges that determines the scale of the expected testing cost. C. Characterization of the Adaptivity Gap We investigate the significance of the adaptivity of testing strategies by capturing the gap between the adaptive and non-adaptive versions of the Greedy algorithm on different datasets. Specifically, we define the adaptivity gap of the Greedy algorithm as the ratio between the expected cost of the strategies derived by the non-adaptive version and the adaptive Datasets Citation Internet P2P Twitter Ego Adaptivity Gap 1 2 3 4 5 6 Fig. 6. The adaptivity gap of the Greedy algorithm on different datasets. version. We calculate the adaptivity gap for each uncertain graph in the three datasets and show the results in Figure 6. It turns out that the adaptivity gap in Citation and Internet Peer to Peer networks can be as large as two. For the dense Twitter Ego network, the gap may even reach up to six. Hence, harnessing the results of previous tests and making the strategy adaptive bring significant gain in the testing costs. VIII. CONCLUSION AND FUTURE WORK In this paper, we modeled the network as an uncertain graph where each edge e exists independently with some probability p(e) and examined the problem of determining whether a given pair of source node and destination node are connected by a path or separated by a cut. Assuming that during each determining process we are associated with an underlying graph, the existence of each edge can be unraveled through edge testing at a cost of c(e). We aimed to find an optimal strategy incurring the minimum expected cost with the expectation taken over all possible underlying graphs. We have formulated it into a combinatorial optimization problem and first investigated its computational complexity. Specifically, through proving the NP-hardness of two closely related problems, we have shown that this problem cannot be solved in polynomial time unless P=NP. Then, we have applied the Markov Decision Process framework to give an exact dynamic programming algorithm with exponential time complexity. Moreover, we have proposed two efficient approximation schemes: a simple greedy approach with linear approximation ratio and a second Adaptive Submodular algorithm with logarithmic approximation ratio for most uncertain graphs. Finally, we have justified the effectiveness and superiority of our proposed algorithms through theoretical analysis and extensive simulations on real network datasets. There remains a lot of future directions that can be explored. For example, it is desirable to design an algorithm with better approximation ratio and scalability, so that we can solve the connectivity determination problem more efficiently and more aptly apply it to large scale networks. Another interesting work is to derive the theoretical bound of the adaptivity gap of the Greedy algorithm and the approximation ratio of the Intersection Sort Algorithm. Finally, it is also worthwhile to investigate the approximation hardness of the Connectivity Determination Problem