Distributed Power-law Graph Computing: Theoretical and Empirical Analysis Cong Xie Ling Yan Dept.of Comp.Sci.and Eng. Dept.of Comp.Sci.and Eng. Shanghai Jiao Tong University Shanghai Jiao Tong University 800 Dongchuan Road 800 Dongchuan Road Shanghai 200240,China Shanghai 200240,China xcgoner1108@qmail.com yling0718@sjtu.edu.cn Wu-Jun Li Zhihua Zhang National Key Lab.for Novel Software Tech. Dept.of Comp.Sci.and Eng. Dept.of Comp.Sci.and Tech. Shanghai Jiao Tong University Nanjing University 800 Dongchuan Road Nanjing 210023,China Shanghai 200240,China liwujun@nju.edu.cn zhang-zhecs.situ.edu.cn Abstract With the emergence of big graphs in a variety of real applications like social networks,machine learning based on distributed graph-computing(DGC)frame works has attracted much attention from big data machine learning community. In DGC frameworks,the graph partitioning(GP)strategy plays a key role to af- fect the performance,including the workload balance and communication cost. Typically,the degree distributions of natural graphs from real applications follow skewed power laws,which makes GP a challenging task.Recently,many methods have been proposed to solve the GP problem.However,the existing GP methods cannot achieve satisfactory performance for applications with power-law graphs. In this paper,we propose a novel vertex-cut method,called degree-based hash- ing(DBH),for GP.DBH makes effective use of the skewed degree distributions for GP.We theoretically prove that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. Furthermore,empirical results on several large power-law graphs also show that DBH can outperform the state of the art. 1 Introduction Recent years have witnessed the emergence of big graphs in a large variety of real applications, such as the web and social network services.Furthermore,many machine learning and data mining algorithms can also be modeled with graphs [14].Hence,machine learning based on distributed graph-computing(DGC)frameworks has attracted much attention from big data machine learning community [14,16,15,7,12,8].To perform distributed(parallel)graph-computing on clusters with several machines(servers),one has to partition the whole graph across the machines in a cluster. Graph partitioning(GP)can dramatically affect the performance of DGC frameworks in terms of workload balance and communication cost.Hence,the GP strategy typically plays a key role in DGC frameworks.The ideal GP method should minimize the cross-machine communication cost, and simultaneously keep the workload in every machine approximately balanced
Distributed Power-law Graph Computing: Theoretical and Empirical Analysis Cong Xie Dept. of Comp. Sci. and Eng. Shanghai Jiao Tong University 800 Dongchuan Road Shanghai 200240, China xcgoner1108@gmail.com Ling Yan Dept. of Comp. Sci. and Eng. Shanghai Jiao Tong University 800 Dongchuan Road Shanghai 200240, China yling0718@sjtu.edu.cn Wu-Jun Li National Key Lab. for Novel Software Tech. Dept. of Comp. Sci. and Tech. Nanjing University Nanjing 210023, China liwujun@nju.edu.cn Zhihua Zhang Dept. of Comp. Sci. and Eng. Shanghai Jiao Tong University 800 Dongchuan Road Shanghai 200240, China zhang-zh@cs.sjtu.edu.cn Abstract With the emergence of big graphs in a variety of real applications like social networks, machine learning based on distributed graph-computing (DGC) frameworks has attracted much attention from big data machine learning community. In DGC frameworks, the graph partitioning (GP) strategy plays a key role to affect the performance, including the workload balance and communication cost. Typically, the degree distributions of natural graphs from real applications follow skewed power laws, which makes GP a challenging task. Recently, many methods have been proposed to solve the GP problem. However, the existing GP methods cannot achieve satisfactory performance for applications with power-law graphs. In this paper, we propose a novel vertex-cut method, called degree-based hashing (DBH), for GP. DBH makes effective use of the skewed degree distributions for GP. We theoretically prove that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. Furthermore, empirical results on several large power-law graphs also show that DBH can outperform the state of the art. 1 Introduction Recent years have witnessed the emergence of big graphs in a large variety of real applications, such as the web and social network services. Furthermore, many machine learning and data mining algorithms can also be modeled with graphs [14]. Hence, machine learning based on distributed graph-computing (DGC) frameworks has attracted much attention from big data machine learning community [14, 16, 15, 7, 12, 8]. To perform distributed (parallel) graph-computing on clusters with several machines (servers), one has to partition the whole graph across the machines in a cluster. Graph partitioning (GP) can dramatically affect the performance of DGC frameworks in terms of workload balance and communication cost. Hence, the GP strategy typically plays a key role in DGC frameworks. The ideal GP method should minimize the cross-machine communication cost, and simultaneously keep the workload in every machine approximately balanced. 1
Existing GP methods can be divided into two main categories:edge-cut and vertex-cut methods. Edge-cut tries to evenly assign the vertices to machines by cutting the edges.In contrast,vertex-cut tries to evenly assign the edges to machines by cutting the vertices.Figure 1 illustrates the edge- cut and vertex-cut partitioning results of an example graph.In Figure 1 (a),the edges(A,C)and (A,E)are cut,and the two machines store the vertex sets {A,B,D}and [C,E,respectively.In Figure 1(b),the vertex A is cut,and the two machines store the edge sets {(A,B),(A,D),(B,D)} and {(A,C),(A,E),(C,E),respectively.In edge-cut,both machines of a cut edge should maintain a ghost (local replica)of the vertex and the edge data.In vertex-cut,all the machines associated with a cut vertex should maintain a mirror (local replica)of the vertex.The ghosts and mirrors are shown in shaded vertices in Figure 1.In edge-cut,the workload of a machine is determined by the number of vertices located in that machine,and the communication cost of the whole graph is determined by the number of edges spanning different machines.In vertex-cut,the workload of a machine is determined by the number of edges located in that machine,and the communication cost of the whole graph is determined by the number of mirrors of the vertices. (D (a)Edge-Cut (b)Vertex-Cut Figure 1:Two strategies for graph partitioning.Shaded vertices are ghosts and mirrors,respectively Most traditional DGC frameworks,such as GraphLab [14]and Pregel [16],use edge-cut method- s [10,19,20,21]for GP.Very recently,the authors of PowerGraph [7]find that the vertex-cut methods can achieve better performance than edge-cut methods,especially for power-law graph- s.Hence,vertex-cut has attracted more and more attention from DGC research community.For example,PowerGraph [7]adopts a random vertex-cut method and two greedy variants for GP. GraphBuilder [9]provides some heuristics,such as the grid-based constrained solution,to improve the random vertex-cut method. Large natural graphs usually follow skewed degree distributions like power-law distributions,which makes GP challenging.Different vertex-cut methods can result in different performance for power- law graphs.For example,Figure 2(a)shows a toy power-law graph with only one vertex having much higher degree than the others.Figure 2(b)shows a partitioning strategy by cutting the vertices {E,F,A,C,D),and Figure 2(c)shows a partitioning strategy by cutting the vertices [A,E).We can find that the partitioning strategy in Figure 2(c)is better than that in Figure 2(b)because the number of mirrors in Figure 2(c)is smaller which means less communication cost.The intuition underlying this example is that cutting higher-degree vertices can result in fewer mirror vertices. Hence,the power-law degree distribution can be used to facilitate GP.Unfortunately,existing vertex- cut methods,including those in PowerGraph and GraphBuilder,make rarely use of the power-law degree distribution for GP.Hence,they cannot achieve satisfactory performance in natural power- law graphs.PowerLyra [4]tries to combine both edge-cut and vertex-cut together by using the power-law degree distribution.However,it is lack of theoretical guarantee. (b)Bad partitioning D (a)Sample (c)Good partitioning Figure 2:Partition a sample graph with vertex-cut
Existing GP methods can be divided into two main categories: edge-cut and vertex-cut methods. Edge-cut tries to evenly assign the vertices to machines by cutting the edges. In contrast, vertex-cut tries to evenly assign the edges to machines by cutting the vertices. Figure 1 illustrates the edgecut and vertex-cut partitioning results of an example graph. In Figure 1 (a), the edges (A,C) and (A,E) are cut, and the two machines store the vertex sets {A,B,D} and {C,E}, respectively. In Figure 1 (b), the vertex A is cut, and the two machines store the edge sets {(A,B),(A,D),(B,D)} and {(A,C),(A,E),(C,E)}, respectively. In edge-cut, both machines of a cut edge should maintain a ghost (local replica) of the vertex and the edge data. In vertex-cut, all the machines associated with a cut vertex should maintain a mirror (local replica) of the vertex. The ghosts and mirrors are shown in shaded vertices in Figure 1. In edge-cut, the workload of a machine is determined by the number of vertices located in that machine, and the communication cost of the whole graph is determined by the number of edges spanning different machines. In vertex-cut, the workload of a machine is determined by the number of edges located in that machine, and the communication cost of the whole graph is determined by the number of mirrors of the vertices. (a) Edge-Cut (b) Vertex-Cut Figure 1: Two strategies for graph partitioning. Shaded vertices are ghosts and mirrors, respectively. Most traditional DGC frameworks, such as GraphLab [14] and Pregel [16], use edge-cut methods [10, 19, 20, 21] for GP. Very recently, the authors of PowerGraph [7] find that the vertex-cut methods can achieve better performance than edge-cut methods, especially for power-law graphs. Hence, vertex-cut has attracted more and more attention from DGC research community. For example, PowerGraph [7] adopts a random vertex-cut method and two greedy variants for GP. GraphBuilder [9] provides some heuristics, such as the grid-based constrained solution, to improve the random vertex-cut method. Large natural graphs usually follow skewed degree distributions like power-law distributions, which makes GP challenging. Different vertex-cut methods can result in different performance for powerlaw graphs. For example, Figure 2 (a) shows a toy power-law graph with only one vertex having much higher degree than the others. Figure 2 (b) shows a partitioning strategy by cutting the vertices {E, F, A, C, D}, and Figure 2 (c) shows a partitioning strategy by cutting the vertices {A, E}. We can find that the partitioning strategy in Figure 2 (c) is better than that in Figure 2 (b) because the number of mirrors in Figure 2 (c) is smaller which means less communication cost. The intuition underlying this example is that cutting higher-degree vertices can result in fewer mirror vertices. Hence, the power-law degree distribution can be used to facilitate GP. Unfortunately, existing vertexcut methods, including those in PowerGraph and GraphBuilder, make rarely use of the power-law degree distribution for GP. Hence, they cannot achieve satisfactory performance in natural powerlaw graphs. PowerLyra [4] tries to combine both edge-cut and vertex-cut together by using the power-law degree distribution. However, it is lack of theoretical guarantee. (a) Sample (b) Bad partitioning (c) Good partitioning Figure 2: Partition a sample graph with vertex-cut. 2
In this paper,we propose a novel vertex-cut GP method,called degree-based hashing (DBH),for distributed power-law graph computing.The main contributions of DBH are briefly outlined as follows: DBH can effectively exploit the power-law degree distributions in natural graphs for vertex- cut GP. Theoretical bounds on the communication cost and workload balance for DBH can be de- rived,which show that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. DBH can be implemented as an execution engine for PowerGraph [7],and hence all PowerGraph applications can be seamlessly supported by DBH. Empirical results on several large real graphs and synthetic graphs show that DBH can outperform the state-of-the-art methods. 2 Problem Formulation Let G=(V,E)denote a graph,where V={v1,v2,...,Un}is the set of vertices and E CV x V is the set of edges in G.Let VI denote the cardinality of the set V,and hence VI=n.vi and vj are called neighbors if(vi,vj)EE.The degree of vi is denoted as di,which measures the number of neighbors of vi.Please note that we only need to consider the GP task for undirected graphs because the workload mainly depends on the number of edges no matter directed or undirected graphs the computation is based on.Even if the computation is based on directed graphs,we can also use the undirected counterparts of the directed graphs to get the partitioning results. Assume we have a cluster of p machines.Vertex-cut GP is to assign each edge with the two corre- sponding vertices to one of the p machines in the cluster.The assignment of an edge is unique,while vertices may have replicas across different machines.For DGC frameworks based on vertex-cut GP, the workload(amount of computation)of a machine is roughly linear in the number of edges located in that machine,and the replicas of the vertices incur communication for synchronization.So the goal of vertex-cut GP is to minimize the number of replicas and simultaneously balance the number of edges on each machine Let M(e)E{1,...,p}be the machine edge eE is assigned to,and A(v){1,...,p}be the span of vertex v over different machines.Hence,A(v)is the number of replicas of v among different machines.Similar to PowerGraph [7],one of the replicas of a vertex is chosen as the master and the others are treated as the mirrors of the master.We let Master(v)denote the machine in which the master of v is located.Hence,the goal of vertex-cut GP can be formulated as follows: min∑A(训 47 m日=ml<λg.and m∈v1 Nasder(e)=ml<P分 where m∈{l,.,p}denotes a machine,入≥1 and p≥1 are imbalance factors.Wede fineA(vi)as replication factor,maxHeE M(e)=m)as edge-imbalance,and maxV Master()-mas vertex-imbalance.To get a good balance of workload. and p should be as small as possible. The degrees of natural graphs usually follow skewed power-law distributions [3,1]: Pr(d)d-a, where Pr(d)is the probability that a vertex has degree d and the power parameter a is a positive constant.The lower the a is,the more skewed a graph will be.This power-law degree distribu- tion makes GP challenging [7].Although vertex-cut methods can achieve better performance than edge-cut methods for power-law graphs [7],existing vertex-cut methods,such as random method in PowerGraph and grid-based method in GraphBuilder [9],cannot make effective use of the power- law distribution to achieve satisfactory performance
In this paper, we propose a novel vertex-cut GP method, called degree-based hashing (DBH), for distributed power-law graph computing. The main contributions of DBH are briefly outlined as follows: • DBH can effectively exploit the power-law degree distributions in natural graphs for vertexcut GP. • Theoretical bounds on the communication cost and workload balance for DBH can be derived, which show that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. • DBH can be implemented as an execution engine for PowerGraph [7], and hence all PowerGraph applications can be seamlessly supported by DBH. • Empirical results on several large real graphs and synthetic graphs show that DBH can outperform the state-of-the-art methods. 2 Problem Formulation Let G = (V, E) denote a graph, where V = {v1, v2, . . . , vn} is the set of vertices and E ⊆ V × V is the set of edges in G. Let |V | denote the cardinality of the set V , and hence |V | = n. vi and vj are called neighbors if (vi , vj ) ∈ E. The degree of vi is denoted as di , which measures the number of neighbors of vi . Please note that we only need to consider the GP task for undirected graphs because the workload mainly depends on the number of edges no matter directed or undirected graphs the computation is based on. Even if the computation is based on directed graphs, we can also use the undirected counterparts of the directed graphs to get the partitioning results. Assume we have a cluster of p machines. Vertex-cut GP is to assign each edge with the two corresponding vertices to one of the p machines in the cluster. The assignment of an edge is unique, while vertices may have replicas across different machines. For DGC frameworks based on vertex-cut GP, the workload (amount of computation) of a machine is roughly linear in the number of edges located in that machine, and the replicas of the vertices incur communication for synchronization. So the goal of vertex-cut GP is to minimize the number of replicas and simultaneously balance the number of edges on each machine. Let M(e) ∈ {1, . . . , p} be the machine edge e ∈ E is assigned to, and A(v) ⊆ {1, . . . , p} be the span of vertex v over different machines. Hence, |A(v)| is the number of replicas of v among different machines. Similar to PowerGraph [7], one of the replicas of a vertex is chosen as the master and the others are treated as the mirrors of the master. We let M aster(v) denote the machine in which the master of v is located. Hence, the goal of vertex-cut GP can be formulated as follows: min A 1 n Xn i=1 |A(vi)| s.t. max m |{e ∈ E | M(e) = m}| < λ|E| p , and max m |{v ∈ V | M aster(v) = m}| < ρ n p , where m ∈ {1, . . . , p} denotes a machine, λ ≥ 1 and ρ ≥ 1 are imbalance factors. We de- fine 1 n Pn i=1 |A(vi)| as replication factor, p |E| max m |{e ∈ E | M(e) = m}| as edge-imbalance, and p n max m |{v ∈ V | M aster(v) = m}| as vertex-imbalance. To get a good balance of workload, λ and ρ should be as small as possible. The degrees of natural graphs usually follow skewed power-law distributions [3, 1]: Pr(d) ∝ d −α , where Pr(d) is the probability that a vertex has degree d and the power parameter α is a positive constant. The lower the α is, the more skewed a graph will be. This power-law degree distribution makes GP challenging [7]. Although vertex-cut methods can achieve better performance than edge-cut methods for power-law graphs [7], existing vertex-cut methods, such as random method in PowerGraph and grid-based method in GraphBuilder [9], cannot make effective use of the powerlaw distribution to achieve satisfactory performance. 3
3 Degree-Based Hashing for GP In this section,we propose a novel vertex-cut method,called degree-based hashing (DBH),to ef- fectively exploit the power-law distribution for GP. 3.1 Hashing Model We refer to a certain machine by its index idz,and the idrth machine is denoted as Pid.We first de- fine two kinds of hash functions:vertex-hash function id verter_hash(v)which hashes vertex v to the machine Pidz,and edge-hash function idr edgehash(e)or idx edge_hash(vi,vj) which hashes edge e=(vi,vi)to the machine Pidz. Our hashing model includes two main components: Master-vertex assignment:The master replica of vi is uniquely assigned to one of the p machines with equal probability for each machine by some randomized hash function verter hash(vi). Edge assignment:Each edge e =(vi,vj)is assigned to one of the p machines by some hash function edge_hash(vi,vj). It is easy to find that the above hashing model is a vertex-cut GP method.The master-vertex as- signment can be easily implemented,which can also be expected to achieve a low vertex-imbalance score.On the contrary,the edge assignment is much more complicated.Different edge-hash func- tions can achieve different replication factors and different edge-imbalance scores.Please note that replication factor reflects communication cost,and edge-imbalance reflects workload-imbalance. Hence,the key of our hashing model lies in the edge-hash function edge_hash(vi,vj). 3.2 Degree-Based Hashing From the example in Figure 2,we observe that in power-law graphs the replication factor,which is defined as the total number of replicas divided by the total number of vertices,will be smaller if we cut vertices with relatively higher degrees.Based on this intuition,we define the edge_hash(vi,vj) as follows: edgehash(vi,vj)= fvertex-hash(vi)if di<dj, (1) verterhash(v;)otherwise. It means that we use the vertex-hash function to define the edge-hash function.Furthermore,the edge-hash function value of an edge is determined by the degrees of the two associated vertices. More specifically,the edge-hash function value of an edge is defined by the vertex-hash function value of the associated vertex with a smaller degree.Hence,our method is called degree-based hashing(DBH).DBH can effectively capture the intuition that cutting vertices with higher degrees will get better performance. Our DBH method for vertex-cut GP is briefly summarized in Algorithm 1,where [n)={1,...,n}. Algorithm 1 Degree-based hashing(DBH)for vertex-cut GP Input:The set of edges E;the set of vertices V;the number of machines p. Output:The assignment M(e)E[p for each edge e. 1:Initialization:count the degree di for each i [n]in parallel 2:for all e =(vi,vj)EE do 3: Hash each edge in parallel: 4: if di<di then 5: M(e)←-verter_hash(i) 6: else M(e)verterhash(vj) 8: end if 9:end for
3 Degree-Based Hashing for GP In this section, we propose a novel vertex-cut method, called degree-based hashing (DBH), to effectively exploit the power-law distribution for GP. 3.1 Hashing Model We refer to a certain machine by its index idx, and the idxth machine is denoted as Pidx. We first de- fine two kinds of hash functions: vertex-hash function idx = vertex hash(v) which hashes vertex v to the machine Pidx, and edge-hash function idx = edge hash(e) or idx = edge hash(vi , vj ) which hashes edge e = (vi , vj ) to the machine Pidx. Our hashing model includes two main components: • Master-vertex assignment: The master replica of vi is uniquely assigned to one of the p machines with equal probability for each machine by some randomized hash function vertex hash(vi). • Edge assignment: Each edge e = (vi , vj ) is assigned to one of the p machines by some hash function edge hash(vi , vj ). It is easy to find that the above hashing model is a vertex-cut GP method. The master-vertex assignment can be easily implemented, which can also be expected to achieve a low vertex-imbalance score. On the contrary, the edge assignment is much more complicated. Different edge-hash functions can achieve different replication factors and different edge-imbalance scores. Please note that replication factor reflects communication cost, and edge-imbalance reflects workload-imbalance. Hence, the key of our hashing model lies in the edge-hash function edge hash(vi , vj ). 3.2 Degree-Based Hashing From the example in Figure 2, we observe that in power-law graphs the replication factor, which is defined as the total number of replicas divided by the total number of vertices, will be smaller if we cut vertices with relatively higher degrees. Based on this intuition, we define the edge hash(vi , vj ) as follows: edge hash(vi , vj ) = vertex hash(vi) if di < dj , vertex hash(vj ) otherwise. (1) It means that we use the vertex-hash function to define the edge-hash function. Furthermore, the edge-hash function value of an edge is determined by the degrees of the two associated vertices. More specifically, the edge-hash function value of an edge is defined by the vertex-hash function value of the associated vertex with a smaller degree. Hence, our method is called degree-based hashing (DBH). DBH can effectively capture the intuition that cutting vertices with higher degrees will get better performance. Our DBH method for vertex-cut GP is briefly summarized in Algorithm 1, where [n] = {1, . . . , n}. Algorithm 1 Degree-based hashing (DBH) for vertex-cut GP Input: The set of edges E; the set of vertices V ; the number of machines p. Output: The assignment M(e) ∈ [p] for each edge e. 1: Initialization: count the degree di for each i ∈ [n] in parallel 2: for all e = (vi , vj ) ∈ E do 3: Hash each edge in parallel: 4: if di < dj then 5: M(e) ← vertex hash(vi) 6: else 7: M(e) ← vertex hash(vj ) 8: end if 9: end for 4
4 Theoretical Analysis In this section,we present theoretical analysis for our DBH method.For comparison,the ran- dom vertex-cut method (called Random)of PowerGraph [7]and the grid-based constrained solu- tion(called Grid)of GraphBuilder [9]are adopted as baselines.Our analysis is based on random- ization.Moreover,we assume that the graph is undirected and there are no duplicated edges in the graph.We mainly study the performance in terms of replication factor,edge-imbalance and vertex- imbalance defined in Section 2.Due to space limitation,we put the proofs of all theoretical results into the supplementary material. 4.1 Partitioning Degree-fixed Graphs Firstly,we assume that the degree sequence [di is fixed.Then we can get the following expected replication factor produced by different methods. Random assigns each edge evenly to the p machines via a randomized hash function.The result can be directly got from PowerGraph [7]. Lemma 1.Assume that we have a sequence of n vertices [vi and the corresponding degree sequence D={diA simple randomized vertex-cut on p machines has the expected replication factor: 空ap---)] By using the Grid hash function,each vertex has p rather than p candidate machines compared to Random.Thus we simply replace p with p to get the following corollary. Corollary 1.By using Grid for hashing,the expected replication factor on p machines is: 24l0-三--》] Using DBH method in Section 3.2,we obtain the following result by fixing the sequence [h, where h;is defined as the number of v:'s adjacent edges which are hashed by the neighbors of v according to the edge-hash function defined in(1). Theorem 1.Assume that we have a sequence ofn vertices and the corresponding degree sequence D =(di.For each vi.di-hi adjacent edges of it are hashed by vi itself.Define H=hi.Our DBH method on p machines has the expected replication factor: 2---門s--] where hi≤d-1for any. This theorem says that our DBH method has smaller expected replication factor than Random of PowerGraph [7]. Next we turn to the analysis of the balance constraints.We still fix the degree sequence and have the following result for our DBH method. Theorem 2.Our DBH method on p machines with the sequences {vi)1,{di and (hi defined in Theorem I has the edge-imbalance: maxl{e∈E|M(e)=mH +ma ∑(d-h) 1=1 P jElp]vEP; EV/p 2EV/p Although the master vertices are evenly assigned to each machine,we want to show how the ran- domized assignment is close to the perfect balance.This problem is well studied in the model of uniformly throwing n balls into p bins when n>p(Inp)3 [18]
4 Theoretical Analysis In this section, we present theoretical analysis for our DBH method. For comparison, the random vertex-cut method (called Random) of PowerGraph [7] and the grid-based constrained solution (called Grid) of GraphBuilder [9] are adopted as baselines. Our analysis is based on randomization. Moreover, we assume that the graph is undirected and there are no duplicated edges in the graph. We mainly study the performance in terms of replication factor, edge-imbalance and verteximbalance defined in Section 2. Due to space limitation, we put the proofs of all theoretical results into the supplementary material. 4.1 Partitioning Degree-fixed Graphs Firstly, we assume that the degree sequence {di} n i=1 is fixed. Then we can get the following expected replication factor produced by different methods. Random assigns each edge evenly to the p machines via a randomized hash function. The result can be directly got from PowerGraph [7]. Lemma 1. Assume that we have a sequence of n vertices {vi} n i=1 and the corresponding degree sequence D = {di} n i=1. A simple randomized vertex-cut on p machines has the expected replication factor: E " 1 n Xn i=1 |A(vi)| D # = p n Xn i=1 1 − 1 − 1 p di . By using the Grid hash function, each vertex has √p rather than p candidate machines compared to Random. Thus we simply replace p with √p to get the following corollary. Corollary 1. By using Grid for hashing, the expected replication factor on p machines is: E " 1 n Xn i=1 |A(vi)| D # = √p n Xn i=1 1 − 1 − 1 √p di . Using DBH method in Section 3.2, we obtain the following result by fixing the sequence {hi} n i=1, where hi is defined as the number of vi’s adjacent edges which are hashed by the neighbors of vi according to the edge-hash function defined in (1). Theorem 1. Assume that we have a sequence of n vertices {vi} n i=1 and the corresponding degree sequence D = {di} n i=1. For each vi , di − hi adjacent edges of it are hashed by vi itself. Define H = {hi} n i=1. Our DBH method on p machines has the expected replication factor: E " 1 n Xn i=1 |A(vi)| H, D# = p n Xn i=1 1 − 1 − 1 p hi+1 ≤ p n Xn i=1 1 − 1 − 1 p di , where hi ≤ di − 1 for any vi . This theorem says that our DBH method has smaller expected replication factor than Random of PowerGraph [7]. Next we turn to the analysis of the balance constraints. We still fix the degree sequence and have the following result for our DBH method. Theorem 2. Our DBH method on p machines with the sequences {vi} n i=1, {di} n i=1 and {hi} n i=1 defined in Theorem 1 has the edge-imbalance: max m |{e ∈ E | M(e) = m}| |E|/p = Pn i=1 hi p + max j∈[p] P vi∈Pj (di − hi) 2|E|/p . Although the master vertices are evenly assigned to each machine, we want to show how the randomized assignment is close to the perfect balance. This problem is well studied in the model of uniformly throwing n balls into p bins when n p(ln p) 3 [18]. 5
Lemma 2.The maximum number of master vertices for each machine is bounded as follows: (Pr[MazLoad ka]o(1) fa>1, Pr[MarLoad>ka]=1-o(1)if0<a<1 Here MarLoad=max HvEV Master(v)=m)l,and ka=+ 2n In p 2alnp 4.2 Partitioning Power-law Graphs Now we change the sequence of fixed degrees into a sequence of random samples generated from the power-law distribution.As a result,upper-bounds can be provided for the above three methods, which are Random.Grid and DBH. Theorem 3.Let the minimal degree be dmin and each de {di be sampled from a power-law degree distribution with parameter o E(2,3).The expected replication factor of Random on p machines can be approximately bounded by: E0--)】s--] where=dmin×&号 This theorem says that when the degree sequence is under power-law distribution,the upper bound of the expected replication factor increases as a decreases.This implies that Random yields a worse partitioning when the power-law graph is more skewed. Like Corollary 1,we replace p with p to get the similar result for Grid. Corollary 2.By using Grid method,the expected replication factor on p machines can be approxi- mately bounded by: 9三--)】s--捫 whee2=dmin×&二2 Note that万l-(-)叫sp--分)l-SoCoray'2 ehGrd can rduce the replication factor but it is not motivated by the skewness of the degree distribution. Theorem 4.Assume each edge is hashed by our DBH method and hi di-1 for any vi.The expected replication factor of DBH on p machines can be approximately bounded by: EH.D --)】s--] where-dmin×&二-dminx+ Note that --)]<--)] Therefore.our DBH method can expectedly reduce the replication factor.The teincreases as o decreases,which means our DBH reduces more replication factor when the power-law graph is more skewed.Note that Grid and our DBH method actually use two different ways to reduce the replication factor.Grid reduces more replication factor when p grows.These two approaches can be combined to obtain further improvement,which is not the focus of this paper. Finally,we show that our DBH methd also guarantees good edge-balance (workload balance)under power-law distributions. 6
Lemma 2. The maximum number of master vertices for each machine is bounded as follows: Pr[M axLoad > ka] = o(1) if a > 1, Pr[M axLoad > ka] = 1 − o(1) if 0 < a < 1. Here M axLoad = max m |{v ∈ V | M aster(v) = m}|, and ka = n p + r 2n ln p p 1 − ln ln p 2a ln p . 4.2 Partitioning Power-law Graphs Now we change the sequence of fixed degrees into a sequence of random samples generated from the power-law distribution. As a result, upper-bounds can be provided for the above three methods, which are Random, Grid and DBH. Theorem 3. Let the minimal degree be dmin and each d ∈ {di} n i=1 be sampled from a power-law degree distribution with parameter α ∈ (2, 3). The expected replication factor of Random on p machines can be approximately bounded by: ED " p n Xn i=1 1 − 1 − 1 p di # ≤ p 1 − 1 − 1 p Ωˆ , where Ω =ˆ dmin × α−1 α−2 . This theorem says that when the degree sequence is under power-law distribution, the upper bound of the expected replication factor increases as α decreases. This implies that Random yields a worse partitioning when the power-law graph is more skewed. Like Corollary 1, we replace p with √p to get the similar result for Grid. Corollary 2. By using Grid method, the expected replication factor on p machines can be approximately bounded by: ED "√p n Xn i=1 1 − 1 − 1 √p di # ≤ √ p 1 − 1 − 1 √p Ωˆ , where Ω =ˆ dmin × α−1 α−2 . Note that √p 1 − 1 − √ 1 p Ωˆ ≤ p 1 − 1 − 1 p Ωˆ . So Corollary 2 tells us that Grid can reduce the replication factor but it is not motivated by the skewness of the degree distribution. Theorem 4. Assume each edge is hashed by our DBH method and hi ≤ di − 1 for any vi . The expected replication factor of DBH on p machines can be approximately bounded by: EH,D " p n Xn i=1 1 − 1 − 1 p hi+1# ≤ p 1 − 1 − 1 p Ωˆ0 , where Ωˆ0 = dmin × α−1 α−2 − dmin × α−1 2α−3 + 1 2 . Note that p 1 − 1 − 1 p Ωˆ0 < p 1 − 1 − 1 p Ωˆ . Therefore, our DBH method can expectedly reduce the replication factor. The term α−1 2α−3 increases as α decreases, which means our DBH reduces more replication factor when the power-law graph is more skewed. Note that Grid and our DBH method actually use two different ways to reduce the replication factor. Grid reduces more replication factor when p grows. These two approaches can be combined to obtain further improvement, which is not the focus of this paper. Finally, we show that our DBH methd also guarantees good edge-balance (workload balance) under power-law distributions. 6
Theorem 5.Assume each edge is hashed by the DBH method with dmin.vi.di and hin defined above.The vertices are evenly assigned.By taking the constant 2E/p nEp d/p,there exists e (0,1)such that the expected edge-imbalance of DBH 1i=1 on p machines can be bounded w.h.p(with high probability).That is, +max (d:-h) ≤1+e)2 va∈P p Note that any e that satisfies 1/en/p could work for this theorem,which results in a tighter bound for large n.Therefore,together with Theorem 4,this theorem shows that our DBH method can reduce the replication factor and simultaneously guarantee good workload balance. 5 Empirical Evaluation In this section,empirical evaluation on real and synthetic graphs is used to verify the effectiveness of our DBH method.The cluster for experiment contains 64 machines connected via 1 GB Ethernet. Each machine has 24 Intel Xeon cores and 96GB of RAM. 5.1 Datasets The graph datasets used in our experiments include both synthetic and real-world power-law graphs. Each synthetic power-law graph is generated by a combination of two synthetic directed graphs.The in-degree and out-degree of the two directed graphs are sampled from the power-law degree distribu- tions with different power parameters a and B,respectively.Such a collection of synthetic graphs is separated into two subsets:one subset with parameter a>B which is shown in Table 1(a),and the other subset with parameter aB (b)Synthetic graphs:o
Theorem 5. Assume each edge is hashed by the DBH method with dmin, {vi} n i=1, {di} n i=1 and {hi} n i=1 defined above. The vertices are evenly assigned. By taking the constant 2|E|/p = ED Pn i=1 di = nED [d] /p, there exists ∈ (0, 1) such that the expected edge-imbalance of DBH on p machines can be bounded w.h.p (with high probability). That is, EH,D Xn i=1 hi p + max j∈[p] X vi∈Pj (di − hi) ≤ (1 + ) 2|E| p . Note that any that satisfies 1/ n/p could work for this theorem, which results in a tighter bound for large n. Therefore, together with Theorem 4, this theorem shows that our DBH method can reduce the replication factor and simultaneously guarantee good workload balance. 5 Empirical Evaluation In this section, empirical evaluation on real and synthetic graphs is used to verify the effectiveness of our DBH method. The cluster for experiment contains 64 machines connected via 1 GB Ethernet. Each machine has 24 Intel Xeon cores and 96GB of RAM. 5.1 Datasets The graph datasets used in our experiments include both synthetic and real-world power-law graphs. Each synthetic power-law graph is generated by a combination of two synthetic directed graphs. The in-degree and out-degree of the two directed graphs are sampled from the power-law degree distributions with different power parameters α and β, respectively. Such a collection of synthetic graphs is separated into two subsets: one subset with parameter α ≥ β which is shown in Table 1(a), and the other subset with parameter α < β which is shown in Table 1(b). The real-world graphs are shown in Table 1(c). Some of the real-world graphs are the same as those in the experiment of PowerGraph. And some additional real-world graphs are from the UF Sparse Matrices Collection [6]. Table 1: Datasets (a) Synthetic graphs: α ≥ β Alias α β |E| S1 2.2 2.2 71,334,974 S2 2.2 2.1 88,305,754 S3 2.2 2.0 134,881,233 S4 2.2 1.9 273,569,812 S5 2.1 2.1 103,838,645 S6 2.1 2.0 164,602,848 S7 2.1 1.9 280,516,909 S8 2.0 2.0 208,555,632 S9 2.0 1.9 310,763,862 (b) Synthetic graphs: α < β Alias α β |E| S10 2.1 2.2 88,617,300 S11 2.0 2.2 135,998,503 S12 2.0 2.1 145,307,486 S13 1.9 2.2 280,090,594 S14 1.9 2.1 289,002,621 S15 1.9 2.0 327,718,498 (c) Real-world graphs Alias Graph |V | |E| Tw Twitter [11] 42M 1.47B Arab Arabic-2005 [6] 22M 0.6B Wiki Wiki [2] 5.7M 130M LJ LiveJournal [17] 5.4M 79M WG WebGoogle [13] 0.9M 5.1M 5.2 Baselines and Evaluation Metric In our experiment, we adopt the Random of PowerGraph [7] and the Grid of GraphBuilder [9]1 as baselines for empirical comparison. The method Hybrid of PowerLyra [4] is not adopted for comparison because it combines both edge-cut and vertex-cut which is not a pure vertex-cut method. One important metric is the replication factor, which reflects the communication cost. To test the speedup for real applications, we use the total execution time for PageRank which is forced to take 100 iterations. The speedup is defined as: speedup = 100% × (γAlg − γDBH)/γAlg, where γAlg is the execution time of PageRank with the method Alg. Here, Alg can be Random or Grid. Because all the methods can achieve good workload balance in our experiments, we do not report it here. 1GraphLab 2.2 released in July 2013 has used PowerGraph as its engine, and the Grid GP method has been adopted by GraphLab 2.2 to replace the original Random GP method. Detailed information can be found at: http://graphlab.org/projects/index.html 7
5.3 Results Figure 3 shows the replication factor on two subsets of synthetic graphs.We can find that our DBH method achieves much lower replication factor than Random and Grid.The replication factor of DBH is reduced by up to 80%compared to Random and 60%compared to Grid. (a)Replication Factor (b)Replication Factor Figure 3:Experiments on two subsets of synthetic graphs.The X-axis denotes different datasets in Table 1(a) and Table 1(b).The number of machines is 48. Figure 4(a)shows the replication factor on the real-world graphs.We can also find that DBH achieves the best performance.Figure 4(b)shows that the relative speedup of DBH is up to 60% over Random and 25%over Grid on the PageRank computation. (a)Replication Factor (b)Execution Speedup Figure 4:Experiments on real-world graphs.The number of machines is 48. Figure 5 shows the replication factor and execution time for PageRank on Twitter graph when the number of machines ranges from 8 to 64.We can find our DBH achieves the best performance for all cases. 益 (a)Replication Factor (b)Execution Time Figure 5:Experiments on Twitter graph.The number of machines ranges from 8 to 64. 6 Conclusion In this paper,we have proposed a new vertex-cut graph partitioning method called degree-based hashing (DBH)for distributed graph-computing frameworks.DBH can effectively exploit the power-law degree distributions in natural graphs to achieve promising performance.Both theo- retical and empirical results show that DBH can outperform the state-of-the-art methods.In our future work,we will apply DBH to more big data machine learning tasks. 7 Acknowledgements This work is supported by the NSFC (No.61100125,No.61472182),the 863 Program of China (No.2012AA011003),and the Fundamental Research Funds for the Central Universities
5.3 Results Figure 3 shows the replication factor on two subsets of synthetic graphs. We can find that our DBH method achieves much lower replication factor than Random and Grid. The replication factor of DBH is reduced by up to 80% compared to Random and 60% compared to Grid. S1 S2 S3 S4 S5 S6 S7 S8 S9 0 5 10 15 20 25 30 Replication Factor 1+10−12 Random Grid DBH (a) Replication Factor S10 S11 S12 S13 S14 S15 0 5 10 15 20 25 30 Replication Factor 1+10−12 Random Grid DBH (b) Replication Factor Figure 3: Experiments on two subsets of synthetic graphs. The X-axis denotes different datasets in Table 1(a) and Table 1(b). The number of machines is 48. Figure 4 (a) shows the replication factor on the real-world graphs. We can also find that DBH achieves the best performance. Figure 4 (b) shows that the relative speedup of DBH is up to 60% over Random and 25% over Grid on the PageRank computation. WG LJ Wiki Arab Tw 0 2 4 6 8 10 12 14 16 18 Replication Factor 1+10−12 Random Grid DBH (a) Replication Factor WG LJ Wiki Arab Tw 0 10 20 30 40 50 60 70 Speedup(%) 26.5% 8.42% 21.2% 4.28% 23.6% 6.06% 31.5% 13.3% 60.6% 25% 1+10−12 Random Grid (b) Execution Speedup Figure 4: Experiments on real-world graphs. The number of machines is 48. Figure 5 shows the replication factor and execution time for PageRank on Twitter graph when the number of machines ranges from 8 to 64. We can find our DBH achieves the best performance for all cases. 8 16 24 48 64 0 2 4 6 8 10 12 14 16 18 20 1+10−12 Replication Factor Number of Machines Random Grid DBH (a) Replication Factor 8 16 24 48 64 200 400 600 800 1000 1200 1400 1600 1800 2000 1+10−12 Execution Time (Sec) Number of Machines Random Grid DBH (b) Execution Time Figure 5: Experiments on Twitter graph. The number of machines ranges from 8 to 64. 6 Conclusion In this paper, we have proposed a new vertex-cut graph partitioning method called degree-based hashing (DBH) for distributed graph-computing frameworks. DBH can effectively exploit the power-law degree distributions in natural graphs to achieve promising performance. Both theoretical and empirical results show that DBH can outperform the state-of-the-art methods. In our future work, we will apply DBH to more big data machine learning tasks. 7 Acknowledgements This work is supported by the NSFC (No. 61100125, No. 61472182), the 863 Program of China (No. 2012AA011003), and the Fundamental Research Funds for the Central Universities. 8
References [1]Lada A Adamic and Bernardo A Huberman.Zipf's law and the internet.Glortometrics,3(1):143-150, 2002. [2]Paolo Boldi and Sebastiano Vigna.The webgraph framework I:compression techniques.In Proceedings of the 13th international conference on World Wide Web(WWW),2004. [3]Andrei Broder,Ravi Kumar,Farzin Maghoul,Prabhakar Raghavan,Sridhar Rajagopalan,Raymie Stata, Andrew Tomkins,and Janet Wiener.Graph structure in the web.Computer networks,33(1):309-320. 2000. [4]Rong Chen,Jiaxin Shi.Yanzhe Chen.Haibing Guan,and Haibo Chen.Powerlyra:Differentiated graph computation and partitioning on skewed graphs.Technical Report IPADSTR-2013-001,Institute of Par- allel and Distributed Systems.Shanghai Jiao Tong University.2013. [5]Aaron Clauset,Cosma Rohilla Shalizi,and Mark EJ Newman.Power-law distributions in empirical data SIAM review,51(4):661-703.2009. [6]Timothy A Davis and Yifan Hu.The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software,38(1):1,2011. [7]Joseph E Gonzalez,Yucheng Low,Haijie Gu,Danny Bickson,and Carlos Guestrin.Powergraph:Dis- tributed graph-parallel computation on natural graphs.In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI),2012. [8]Joseph E.Gonzalez,Reynold S.Xin,Ankur Dave,Daniel Crankshaw,Michael J.Franklin,and Ion Stoica GraphX:Graph processing in a distributed dataflow framework.In Proceedings of the IIth USENIX Symposium on Operating Systems Design and Implementation (OSDI),2014. [9]Nilesh Jain,Guangdeng Liao,and Theodore L Willke.Graphbuilder:scalable graph etl framework.In Proceedings of the First International Workshop on Graph Data Management Experiences and Systems, 2013. [10]George Karypis and Vipin Kumar.Multilevel graph partitioning schemes.In Proceedings of the Interna- tional Conference on Parallel Processing (ICPP),1995. [11]Haewoon Kwak,Changhyun Lee,Hosung Park,and Sue Moon.What is twitter,a social network or a news media.In Proceedings of the 19th international conference on World Wide Web(WWW),2010. [12]Aapo Kyrola,Guy E.Blelloch,and Carlos Guestrin.Graphchi:Large-scale graph computation on just a PC.In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSD).2012. [13]Jure Leskovec.Stanford large network dataset collection.URL http://snap.stanford.edu/data/index. hmml,2011. [14]Yucheng Low,Joseph Gonzalez,Aapo Kyrola,Danny Bickson,Carlos Guestrin,and Joseph M.Heller- stein.GraphLab:A new framework for parallel machine learning.In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAl),2010. [15]Yucheng Low,Joseph Gonzalez,Aapo Kyrola,Danny Bickson,Carlos Guestrin,and Joseph M.Heller- stein.Distributed graphlab:A framework for machine learning in the cloud.In Proceedings of the International Conference on Very Large Data Bases(VLDB),2012. [16]Grzegorz Malewicz,Matthew H Austern,Aart JC Bik,James C Dehnert,Ilan Horn,Naty Leiser,and Grzegorz Czajkowski.Pregel:a system for large-scale graph processing.In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD),2010. [17]Alan Mislove,Massimiliano Marcon,Krishna P Gummadi,Peter Druschel,and Bobby Bhattacharjee. Measurement and analysis of online social networks.In Proceedings of the 7th ACM SIGCOMM confer- ence on Internet Measurement,2007. [18]Martin Raab and Angelika Steger.balls into binsa simple and tight analysis.In Randomization and Approximation Techniques in Computer Science,pages 159-170.Springer,1998. [19]Isabelle Stanton and Gabriel Kliot.Streaming graph partitioning for large distributed graphs.In Pro- ceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD).2012. [20]Charalampos Tsourakakis,Christos Gkantsidis,Bozidar Radunovic,and Milan Voinovic.Fennel:Stream- ing graph partitioning for massive scale graphs.In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM),2014. [21]Lu Wang,Yanghua Xiao,Bin Shao,and Haixun Wang.How to partition a billion-node graph.In Pro- ceedings of the International Conference on Data Engineering (ICDE),2014. 9
References [1] Lada A Adamic and Bernardo A Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002. [2] Paolo Boldi and Sebastiano Vigna. The webgraph framework I: compression techniques. In Proceedings of the 13th international conference on World Wide Web (WWW), 2004. [3] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer networks, 33(1):309–320, 2000. [4] Rong Chen, Jiaxin Shi, Yanzhe Chen, Haibing Guan, and Haibo Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report IPADSTR-2013-001, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University, 2013. [5] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review, 51(4):661–703, 2009. [6] Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1):1, 2011. [7] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012. [8] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014. [9] Nilesh Jain, Guangdeng Liao, and Theodore L Willke. Graphbuilder: scalable graph etl framework. In Proceedings of the First International Workshop on Graph Data Management Experiences and Systems, 2013. [10] George Karypis and Vipin Kumar. Multilevel graph partitioning schemes. In Proceedings of the International Conference on Parallel Processing (ICPP), 1995. [11] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media. In Proceedings of the 19th international conference on World Wide Web (WWW), 2010. [12] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012. [13] Jure Leskovec. Stanford large network dataset collection. URL http://snap. stanford. edu/data/index. html, 2011. [14] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. GraphLab: A new framework for parallel machine learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010. [15] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2012. [16] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010. [17] Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet Measurement, 2007. [18] Martin Raab and Angelika Steger. balls into binsa simple and tight analysis. In Randomization and Approximation Techniques in Computer Science, pages 159–170. Springer, 1998. [19] Isabelle Stanton and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2012. [20] Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. Fennel: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM), 2014. [21] Lu Wang, Yanghua Xiao, Bin Shao, and Haixun Wang. How to partition a billion-node graph. In Proceedings of the International Conference on Data Engineering (ICDE), 2014. 9
A Proofs A.1 The Proof of Theorem 1 Proof.Let the indicator H;denote the event that vertex v;has at least one of hi edges in the jth machine. Then the expectation E[H;]is EH;]=1-Pr(none of the h edges is on the machine j) =1-- For some vertex vi.hi adjacent edges are hashed by the neighbours of vi and (di-hi)adjacent edges are hashed by vi itself to the same machine.So for the(di-hi)edges,the number of replications of vi is simply 1 due to the assumption hi<di-1. As for the reuh:ofheadjacds,we have[H吲l=1-(L-).Here历,involves theother p-1 machines except for the one that already has a replication. Putting the two parts together,we have 4加=1+=1+-)-(-》] =p--》+门 Thus,the expected replication factor is: 2叭=(--】 =-(-)鬥] 0 Note that we assume h<di-1.which means the hash function hashes at least one of the adjacent edges ofv by vi itself.Such assumption is to guarantee that there will be no"single"master vertex as which no adjacent edges are in the same partition. By Lemma 1,we obtain --)门s--] which implies that the hash-based vertex-cut via degree-approach is at least as good as the randomized vertex- cut in the replication factor. A.2 The Proof of Theorem 2 Proof.Since we assume that the vertices are evenly hashed to all the machines,the hi adjacent edges of vi are also evenly assigned to all the machines.Subsequently,each machine hasedges.Thus,we sum up all the vertices,obtaining For the rest d;-hi adjacent edges of vi,they are assigned to the same machine.So this part of edges incurs imbalance. In the above procedure each edge is actually assigned twice.Thus,the final result is max{e∈E|M(e)=mH 含+微品低- EV/p 2E/p ◇ 10
A Proofs A.1 The Proof of Theorem 1 Proof. Let the indicator Hj denote the event that vertex vi has at least one of hi edges in the jth machine. Then the expectation E[Hj ] is E[Hj ] = 1 − Pr(none of the hi edges is on the machine j) = 1 − 1 − 1 p hi . For some vertex vi, hi adjacent edges are hashed by the neighbours of vi and (di − hi) adjacent edges are hashed by vi itself to the same machine. So for the (di − hi) edges, the number of replications of vi is simply 1 due to the assumption hi ≤ di − 1. As for the residual hi of the adjacent edges, we have E[Hj ] = 1 − 1 − 1 p hi . Here Hj involves the other p − 1 machines except for the one that already has a replication. Putting the two parts together, we have E [|A(vi)|] = 1 +Xp−1 j=1 E[Hj ] = 1 + (p − 1) 1 − 1 − 1 p hi = p 1 − 1 − 1 p hi+1 . Thus, the expected replication factor is: E " 1 n Xn i=1 |A(v)| # = 1 n Xn i=1 p 1 − 1 − 1 p hi+1 = p n Xn i=1 1 − 1 − 1 p hi+1 . Note that we assume hi ≤ di −1, which means the hash function hashes at least one of the adjacent edges of vi by vi itself. Such assumption is to guarantee that there will be no “single” master vertex as which no adjacent edges are in the same partition. By Lemma 1, we obtain p n Xn i=1 1 − 1 − 1 p hi+1 ≤ p n Xn i=1 1 − 1 − 1 p di , which implies that the hash-based vertex-cut via degree-approach is at least as good as the randomized vertexcut in the replication factor. A.2 The Proof of Theorem 2 Proof. Since we assume that the vertices are evenly hashed to all the machines, the hi adjacent edges of vi are also evenly assigned to all the machines. Subsequently, each machine has hi p edges. Thus, we sum up all the vertices, obtaining Pn i=1 hi p . For the rest di − hi adjacent edges of vi, they are assigned to the same machine. So this part of edges incurs imbalance. In the above procedure each edge is actually assigned twice. Thus, the final result is max m |{e ∈ E | M(e) = m}| |E|/p = Pn i=1 hi p + max j∈[p] P vi∈Pj (di − hi) 2|E|/p . 10