In this paper, we propose a novel ver_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Power-law Graph Computing - Theoretical and Empirical Analysis

正在加载图片...

In this paper,we propose a novel vertex-cut GP method,called degree-based hashing (DBH),for distributed power-law graph computing.The main contributions of DBH are briefly outlined as follows: DBH can effectively exploit the power-law degree distributions in natural graphs for vertex- cut GP. Theoretical bounds on the communication cost and workload balance for DBH can be de- rived,which show that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. DBH can be implemented as an execution engine for PowerGraph [7],and hence all PowerGraph applications can be seamlessly supported by DBH. Empirical results on several large real graphs and synthetic graphs show that DBH can outperform the state-of-the-art methods. 2 Problem Formulation Let G=(V,E)denote a graph,where V={v1,v2,...,Un}is the set of vertices and E CV x V is the set of edges in G.Let VI denote the cardinality of the set V,and hence VI=n.vi and vj are called neighbors if(vi,vj)EE.The degree of vi is denoted as di,which measures the number of neighbors of vi.Please note that we only need to consider the GP task for undirected graphs because the workload mainly depends on the number of edges no matter directed or undirected graphs the computation is based on.Even if the computation is based on directed graphs,we can also use the undirected counterparts of the directed graphs to get the partitioning results. Assume we have a cluster of p machines.Vertex-cut GP is to assign each edge with the two corre- sponding vertices to one of the p machines in the cluster.The assignment of an edge is unique,while vertices may have replicas across different machines.For DGC frameworks based on vertex-cut GP, the workload(amount of computation)of a machine is roughly linear in the number of edges located in that machine,and the replicas of the vertices incur communication for synchronization.So the goal of vertex-cut GP is to minimize the number of replicas and simultaneously balance the number of edges on each machine Let M(e)E{1,...,p}be the machine edge eE is assigned to,and A(v){1,...,p}be the span of vertex v over different machines.Hence,A(v)is the number of replicas of v among different machines.Similar to PowerGraph [7],one of the replicas of a vertex is chosen as the master and the others are treated as the mirrors of the master.We let Master(v)denote the machine in which the master of v is located.Hence,the goal of vertex-cut GP can be formulated as follows: min∑A(训 47 m日=ml<λg.and m∈v1 Nasder(e)=ml<P分 where m∈{l,.,p}denotes a machine,入≥1 and p≥1 are imbalance factors.Wede fineA(vi)as replication factor,maxHeE M(e)=m)as edge-imbalance,and maxV Master()-mas vertex-imbalance.To get a good balance of workload. and p should be as small as possible. The degrees of natural graphs usually follow skewed power-law distributions [3,1]: Pr(d)d-a, where Pr(d)is the probability that a vertex has degree d and the power parameter a is a positive constant.The lower the a is,the more skewed a graph will be.This power-law degree distribu- tion makes GP challenging [7].Although vertex-cut methods can achieve better performance than edge-cut methods for power-law graphs [7],existing vertex-cut methods,such as random method in PowerGraph and grid-based method in GraphBuilder [9],cannot make effective use of the power- law distribution to achieve satisfactory performance.In this paper, we propose a novel vertex-cut GP method, called degree-based hashing (DBH), for distributed power-law graph computing. The main contributions of DBH are briefly outlined as follows: • DBH can effectively exploit the power-law degree distributions in natural graphs for vertexcut GP. • Theoretical bounds on the communication cost and workload balance for DBH can be derived, which show that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. • DBH can be implemented as an execution engine for PowerGraph [7], and hence all PowerGraph applications can be seamlessly supported by DBH. • Empirical results on several large real graphs and synthetic graphs show that DBH can outperform the state-of-the-art methods. 2 Problem Formulation Let G = (V, E) denote a graph, where V = {v1, v2, . . . , vn} is the set of vertices and E ⊆ V × V is the set of edges in G. Let |V | denote the cardinality of the set V , and hence |V | = n. vi and vj are called neighbors if (vi , vj ) ∈ E. The degree of vi is denoted as di , which measures the number of neighbors of vi . Please note that we only need to consider the GP task for undirected graphs because the workload mainly depends on the number of edges no matter directed or undirected graphs the computation is based on. Even if the computation is based on directed graphs, we can also use the undirected counterparts of the directed graphs to get the partitioning results. Assume we have a cluster of p machines. Vertex-cut GP is to assign each edge with the two corresponding vertices to one of the p machines in the cluster. The assignment of an edge is unique, while vertices may have replicas across different machines. For DGC frameworks based on vertex-cut GP, the workload (amount of computation) of a machine is roughly linear in the number of edges located in that machine, and the replicas of the vertices incur communication for synchronization. So the goal of vertex-cut GP is to minimize the number of replicas and simultaneously balance the number of edges on each machine. Let M(e) ∈ {1, . . . , p} be the machine edge e ∈ E is assigned to, and A(v) ⊆ {1, . . . , p} be the span of vertex v over different machines. Hence, |A(v)| is the number of replicas of v among different machines. Similar to PowerGraph [7], one of the replicas of a vertex is chosen as the master and the others are treated as the mirrors of the master. We let M aster(v) denote the machine in which the master of v is located. Hence, the goal of vertex-cut GP can be formulated as follows: min A 1 n Xn i=1 |A(vi)| s.t. max m |{e ∈ E | M(e) = m}| < λ|E| p , and max m |{v ∈ V | M aster(v) = m}| < ρ n p , where m ∈ {1, . . . , p} denotes a machine, λ ≥ 1 and ρ ≥ 1 are imbalance factors. We de- fine 1 n Pn i=1 |A(vi)| as replication factor, p |E| max m |{e ∈ E | M(e) = m}| as edge-imbalance, and p n max m |{v ∈ V | M aster(v) = m}| as vertex-imbalance. To get a good balance of workload, λ and ρ should be as small as possible. The degrees of natural graphs usually follow skewed power-law distributions [3, 1]: Pr(d) ∝ d −α , where Pr(d) is the probability that a vertex has degree d and the power parameter α is a positive constant. The lower the α is, the more skewed a graph will be. This power-law degree distribution makes GP challenging [7]. Although vertex-cut methods can achieve better performance than edge-cut methods for power-law graphs [7], existing vertex-cut methods, such as random method in PowerGraph and grid-based method in GraphBuilder [9], cannot make effective use of the powerlaw distribution to achieve satisfactory performance. 3

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Power-law Graph Computing - Theoretical and Empirical Analysis