Blocking-based Neighbor Sampling for Large-scale Graph Neural Networks Kai-Lang Yao and Wu-Jun Li* National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China yaokl@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract Among various algorithms of representation learning for graphs,graph neural networks (GNNs)[Gori et al.,2005; The exponential increase in computation and mem- Bruna et al..2014]have recently become the most successful ory complexity with the depth of network has be- and popular ones,due to their powerful ability in modeling come the main impediment to the successful appli- complex relationships between samples.Although many ad- cation of graph neural networks(GNNs)on large- vanced GNN models [Kipf and Welling,2017:Hamilton et scale graphs like graphs with hundreds of million- al.,2017;Velickovic et al.,2018]have been proposed,most s of nodes.In this paper,we propose a novel of them are limited to the successful application on small- neighbor sampling strategy,dubbed blocking-based scale graphs (e.g.,graphs with hundreds of thousands of n- neighbor sampling (BNS),for efficient training of odes).There are significant challenges in applying existing GNNs on large-scale graphs.Specifically,BNS GNN methods to applications with large-scale graphs (e.g., adopts a policy to stochastically block the ongo- graphs with hundreds of millions of nodes)because of the ing expansion of neighboring nodes,which can re- expensive computation and memory cost during the train- duce the rate of the exponential increase in com- ing process.Due to the iteratively dependent nature of n- putation and memory complexity of GNNs.Fur- odes in GNNs,the number of nodes supporting the compu- thermore,a reweighted policy is applied to graph tation of output layer exponentially increases with the depth convolution,to adjust the contribution of blocked of network.Hence,the computation and memory complexity and non-blocked neighbors to central nodes.We grow exponentially.Moreover,recent works [Li et al.,2019; theoretically prove that BNS provides an unbiased Verma and Zhang,2020:Chen et al.,2020c]show the poten- estimation for the original graph convolution oper- tial to improve the performance of GNN models as the net- ation.Extensive experiments on three benchmark work becomes deeper,which will undoubtedly exacerbate the datasets show that,on large-scale graphs,BNS is problem of expensive cost on large-scale graphs.Nowadays, 2x~5x faster than state-of-the-art methods when in order to speed up the training process,it is a dominan- achieving the same accuracy.Moreover,even on t trend to perform training on GPUs.However,many GPUs the small-scale graphs,BNS also demonstrates the have limited graphics memory,which hinders GNN models advantage of low time cost. from training with large batch size and as a result leads to a sharp increase in time cost for training. 1 Introduction Solutions for the above problem mainly include model sim- plification methods and sampling-based methods.For model Graph has been widely used for describing unstructured data simplification methods [Wu et al.,2019;Klicpera et al.,2019; in real applications such as social networks,brain networks, Chen et al.,2020b],the main idea is to remove the non- molecular graphs,and knowledge graphs.Edges in graph- linear transformation between graph convolution layers such s depict the complex relationships between samples,and rich that the graph convolution on node features can be prepro- relational information between samples is contained in graph- cessed before training.Although model simplification meth- s.Making good use of the rich relational information be- ods are efficient in training,as stated in [Chen et al,2020al, tween samples in graphs has great potential in boosting the it is still an open question whether simplified GNNs'ex- performance of traditional machine learning methods which pressive power can match that of the original GNNs.For are mainly designed for modeling independent and identical- sampling-based methods,existing works can be broadly cat- ly distributed (i.i.d.)data.In addition,graph data is now egorized into node-wise sampling [Hamilton et al.,2017; widely available in many applications.Therefore,developing Chen et al.,2018a;Cong et al.,2020],layer-wise sampling advanced graph learning algorithms is a topic of great inter- [Chen et al.,2018b;Huang et al.,2018;Zou et al.,2019],and est. subgraph sampling [Chiang et al.,2019;Zeng et al.,2020]. For node-wise sampling,the main idea is to sample a num- *Corresponding author ber of neighbors for each node of each layer in a top-down
Blocking-based Neighbor Sampling for Large-scale Graph Neural Networks Kai-Lang Yao and Wu-Jun Li∗ National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China yaokl@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract The exponential increase in computation and memory complexity with the depth of network has become the main impediment to the successful application of graph neural networks (GNNs) on largescale graphs like graphs with hundreds of millions of nodes. In this paper, we propose a novel neighbor sampling strategy, dubbed blocking-based neighbor sampling (BNS), for efficient training of GNNs on large-scale graphs. Specifically, BNS adopts a policy to stochastically block the ongoing expansion of neighboring nodes, which can reduce the rate of the exponential increase in computation and memory complexity of GNNs. Furthermore, a reweighted policy is applied to graph convolution, to adjust the contribution of blocked and non-blocked neighbors to central nodes. We theoretically prove that BNS provides an unbiased estimation for the original graph convolution operation. Extensive experiments on three benchmark datasets show that, on large-scale graphs, BNS is 2× ∼ 5× faster than state-of-the-art methods when achieving the same accuracy. Moreover, even on the small-scale graphs, BNS also demonstrates the advantage of low time cost. 1 Introduction Graph has been widely used for describing unstructured data in real applications such as social networks, brain networks, molecular graphs, and knowledge graphs. Edges in graphs depict the complex relationships between samples, and rich relational information between samples is contained in graphs. Making good use of the rich relational information between samples in graphs has great potential in boosting the performance of traditional machine learning methods which are mainly designed for modeling independent and identically distributed (i.i.d.) data. In addition, graph data is now widely available in many applications. Therefore, developing advanced graph learning algorithms is a topic of great interest. ∗Corresponding author Among various algorithms of representation learning for graphs, graph neural networks (GNNs) [Gori et al., 2005; Bruna et al., 2014] have recently become the most successful and popular ones, due to their powerful ability in modeling complex relationships between samples. Although many advanced GNN models [Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018] have been proposed, most of them are limited to the successful application on smallscale graphs (e.g., graphs with hundreds of thousands of nodes). There are significant challenges in applying existing GNN methods to applications with large-scale graphs (e.g., graphs with hundreds of millions of nodes) because of the expensive computation and memory cost during the training process. Due to the iteratively dependent nature of nodes in GNNs, the number of nodes supporting the computation of output layer exponentially increases with the depth of network. Hence, the computation and memory complexity grow exponentially. Moreover, recent works [Li et al., 2019; Verma and Zhang, 2020; Chen et al., 2020c] show the potential to improve the performance of GNN models as the network becomes deeper, which will undoubtedly exacerbate the problem of expensive cost on large-scale graphs. Nowadays, in order to speed up the training process, it is a dominant trend to perform training on GPUs. However, many GPUs have limited graphics memory, which hinders GNN models from training with large batch size and as a result leads to a sharp increase in time cost for training. Solutions for the above problem mainly include model simplification methods and sampling-based methods. For model simplification methods [Wu et al., 2019; Klicpera et al., 2019; Chen et al., 2020b], the main idea is to remove the nonlinear transformation between graph convolution layers such that the graph convolution on node features can be preprocessed before training. Although model simplification methods are efficient in training, as stated in [Chen et al., 2020a], it is still an open question whether simplified GNNs’ expressive power can match that of the original GNNs. For sampling-based methods, existing works can be broadly categorized into node-wise sampling [Hamilton et al., 2017; Chen et al., 2018a; Cong et al., 2020], layer-wise sampling [Chen et al., 2018b; Huang et al., 2018; Zou et al., 2019], and subgraph sampling [Chiang et al., 2019; Zeng et al., 2020]. For node-wise sampling, the main idea is to sample a number of neighbors for each node of each layer in a top-down
manner.For layer-wise sampling,the main idea is to inde- We take GCN [Kipf and Welling,2017]as an example pendently sample a number of nodes from a candidate set for to describe the problem of the exponential increase in com- each layer based on the importance probabilities of nodes. putation and memory complexity.Let A'=A+I and All connections between the nodes of two adjacent layers are A =D-A'D-,where D denotes the diagonal degree used to perform approximate graph convolution.I For sub- matrix of A'and D Then GCN can be for- graph sampling,the main idea is to sample a subgraph and mulated as follows: feed it to GNN models before each round of mini-batch train- ing.Although the above sampling strategies are applicable to Z9=∑eaA,Hg-,H9=fz9wo),④ large-scale GNNs,they have some deficiencies or limitation- s in terms of accuracy,total time cost,or memory cost.For where H(o)=X,f()is the activation function,N(i)de- example,existing node-wise sampling strategies need to sam- notes the set of neighbors of node i.W()E Rrxr is a learn- ple a large number of neighbors for high accuracy,which will able parameter. lead to a sharp increase in time cost.Layer-wise sampling From (1),we can see that the output of a node at the Lth strategies have a high time cost of preparing data(including layer iteratively depends on the information of its 1....L- sampling)and may suffer from sparse connection between t- hop neighbors.Such an iteratively dependent nature of n- wo adjacent layers.Subgraph sampling strategies may also suffer from sparse connection in subgraphs. odes leads to the exponential increase in computation and memory complexity with the depth of network.Let r denote In this paper,we propose a novel node-wise sampling s- the feature dimension of hidden layer.Then,the computa- trategy,called blocking-based neighbor sampling (BNS),for tion and memory complexity during a mini-batch training are large-scale training of GNNs.The contributions of this paper O(sL-1.(sBr Br2))and O(Lr2+sL.Br),respectively. are listed as follows: We propose a novel blocking mechanism in BNS to s- 3 Blocking-based Neighbor Sampling tochastically block the ongoing expansion of neighbor- ing nodes,dramatically reducing the computation and In this section,we present the details of BNS.Firstly,we sam- memory complexity. ple a fixed number of neighbors for each node at the current layer e.Secondly,we adopt a policy to stochastically block We further propose a reweighted policy to adjust the the ongoing expansion of neighboring nodes at the preceding contribution of blocked and non-blocked neighboring n- layers 1.....-1.Note that once a node is blocked,all its odes to central nodes. ways out to all other nodes are blocked,and it is trapped at its We theoretically prove that BNS provides an unbiased current position.Thirdly,after sampling finishes,reweight- estimation for the original graph convolution operation. ed graph convolution is performed to obtain the outputs,in which a reweighted policy is adopted to adjust the contribu- Extensive experiments on large-scale graphs show that tion of blocked and non-blocked neighbors to central nodes. BNS is 2x5x faster than existing state-of-the-art A visual illustration of BNS is presented in Figure 1. methods when achieving the same accuracy.Even on the small-scale graph,BNS also demonstrates the advantage of low time cost. 2 Notations and Problem Definition 2.1 Notations We use boldface uppercase letters,such as B,to denote matri- ces.The ith row and the jth column of a matrix B are denoted as Bi and B.i,respectively.Bij denotes the element at the (a)non-sampling (b)BNS ith row and jth column in B.B]lo denotes the number of non-zero entries in B.BF denotes the Frobenius norm of Figure 1:A visual illustration of BNS.Solid circles refer to nodes. B The node within the inner dashed circle refers to the node of output layer.(a)We assume each node has 5 neighbors.(b)Black solid circles refer to blocked neighbors.The thickness of solid lines that 2.2 Problem Definition connect two nodes indicates the magnitude of weights of nodes in Suppose we have a graph with N nodes.Let A 10,1)NxN reweighted graph convolution. denote the adjacency matrix of the graph.Ai;=1 denotes there exists an edge between node i and node j,and Ai=0 denotes there is no edge between them.Let X RNxu de- 3.1 Sampling Algorithm note the node feature matrix,where u denotes the dimension The entire sampling process is performed in a top-down man- of node feature.Suppose the average number of neighbors ner,and it is summarized in Algorithm 1.Suppose Vin de- per node in the graph is s.Suppose the mini-batch size of note a mini-batch of nodes in the output layer.Firstly,we nodes at output layer is B.We use L to denote the layer num- sample sn neighbors for each node i at layer e in line 5. ber of GNNs. Sample(N(i),sn)in line 5 is an operation that uniformly
manner. For layer-wise sampling, the main idea is to independently sample a number of nodes from a candidate set for each layer based on the importance probabilities of nodes. All connections between the nodes of two adjacent layers are used to perform approximate graph convolution. For subgraph sampling, the main idea is to sample a subgraph and feed it to GNN models before each round of mini-batch training. Although the above sampling strategies are applicable to large-scale GNNs, they have some deficiencies or limitations in terms of accuracy, total time cost, or memory cost. For example, existing node-wise sampling strategies need to sample a large number of neighbors for high accuracy, which will lead to a sharp increase in time cost. Layer-wise sampling strategies have a high time cost of preparing data (including sampling) and may suffer from sparse connection between two adjacent layers. Subgraph sampling strategies may also suffer from sparse connection in subgraphs. In this paper, we propose a novel node-wise sampling strategy, called blocking-based neighbor sampling (BNS), for large-scale training of GNNs. The contributions of this paper are listed as follows: • We propose a novel blocking mechanism in BNS to stochastically block the ongoing expansion of neighboring nodes, dramatically reducing the computation and memory complexity. • We further propose a reweighted policy to adjust the contribution of blocked and non-blocked neighboring nodes to central nodes. • We theoretically prove that BNS provides an unbiased estimation for the original graph convolution operation. • Extensive experiments on large-scale graphs show that BNS is 2× ∼ 5× faster than existing state-of-the-art methods when achieving the same accuracy. Even on the small-scale graph, BNS also demonstrates the advantage of low time cost. 2 Notations and Problem Definition 2.1 Notations We use boldface uppercase letters, such as B, to denote matrices. The ith row and the jth column of a matrix B are denoted as Bi∗ and B∗j , respectively. Bij denotes the element at the ith row and jth column in B. kBk0 denotes the number of non-zero entries in B. kBkF denotes the Frobenius norm of B. 2.2 Problem Definition Suppose we have a graph with N nodes. Let A ∈ {0, 1} N×N denote the adjacency matrix of the graph. Aij = 1 denotes there exists an edge between node i and node j, and Aij = 0 denotes there is no edge between them. Let X ∈ R N×u denote the node feature matrix, where u denotes the dimension of node feature. Suppose the average number of neighbors per node in the graph is s. Suppose the mini-batch size of nodes at output layer is B. We use L to denote the layer number of GNNs. We take GCN [Kipf and Welling, 2017] as an example to describe the problem of the exponential increase in computation and memory complexity. Let A0 = A + I and Aˆ = D− 1 2 A0D− 1 2 , where D denotes the diagonal degree matrix of A0 and Dii = Pn j=1 A0 ij . Then GCN can be formulated as follows: Z (`) i∗ = X j∈N(i) Aˆ ijH (`−1) j∗ , H (`) i∗ = f(Z (`) i∗ W(`) ), (1) where H(0) = X, f(·) is the activation function, N (i) denotes the set of neighbors of node i. W(`) ∈ R r×r is a learnable parameter. From (1), we can see that the output of a node at the Lth layer iteratively depends on the information of its 1, · · · , Lhop neighbors. Such an iteratively dependent nature of nodes leads to the exponential increase in computation and memory complexity with the depth of network. Let r denote the feature dimension of hidden layer. Then, the computation and memory complexity during a mini-batch training are O(s L−1 ·(sBr + Br2 )) and O(Lr2 + s L · Br), respectively. 3 Blocking-based Neighbor Sampling In this section, we present the details of BNS. Firstly, we sample a fixed number of neighbors for each node at the current layer `. Secondly, we adopt a policy to stochastically block the ongoing expansion of neighboring nodes at the preceding layers {1, · · · , `−1}. Note that once a node is blocked, all its ways out to all other nodes are blocked, and it is trapped at its current position. Thirdly, after sampling finishes, reweighted graph convolution is performed to obtain the outputs, in which a reweighted policy is adopted to adjust the contribution of blocked and non-blocked neighbors to central nodes. A visual illustration of BNS is presented in Figure 1. (a) non-sampling (b) BNS Figure 1: A visual illustration of BNS. Solid circles refer to nodes. The node within the inner dashed circle refers to the node of output layer. (a) We assume each node has 5 neighbors. (b) Black solid circles refer to blocked neighbors. The thickness of solid lines that connect two nodes indicates the magnitude of weights of nodes in reweighted graph convolution. 3.1 Sampling Algorithm The entire sampling process is performed in a top-down manner, and it is summarized in Algorithm 1. Suppose Vin denote a mini-batch of nodes in the output layer. Firstly, we sample sn neighbors for each node i at layer ` in line 5. Sample(N (i), sn) in line 5 is an operation that uniformly
samples sn elements from N(i).Then,we randomly se- Algorithm 1 Sampling Algorithm lect sn×6(0≤6≤l)nodes from N(i)in line6, and stop sampling neighbors for them at the preceding lay- Require:Mini-batch of nodes Vin,the number of neighbors ers {1,...,e-1}via the operations in line 7 and line 13 sampled for each node sn,ratio of blocked neighbors per Block(e(i),6)in line 6 is an operation that uniformly sam- node o. ples (i)x 6 elements from Ne(i)as blocked neighbors. Ensure:{(b,%,{Wb(a),N6(a)}8)}=1 records the blocked nodes at the th layer,which are used 1:h=m,=0 in subsequent processing steps.The operations in line 10 and 2:Sample in a top-down manner: line 11 ensure that the blocked nodes are mapped to the same 3:for =L:1 do feature space as non-blocked nodes. 4: fori∈y%bdo 5: 3.2 Reweighted Graph Convolution Ne(i)=Sample(N(i),sn) 6: 6(a)=Bloc(W(),6) We first reformulate Z in Equation(1)to an expectation N6(同=N(Ng问 form: 8: end for Z9=W(·Ej~pUEN(I)A.jH-” (2) 9: fori∈zdo 10: where p(jN(i)i)is a uniform distribution over the neigh- N()=W()U{ 11: N(i)=N6()U{闭 bors of node i.(i)denotes the number of elements in 12: end for W(i). 13: For blocked nodes,since their neighbor expansions are Vh =Uievs,Nib(i) blocked,their estimation for Equation(2)is less precise(hav- 14: %-1=UeN6( ing large variance)than non-blocked nodes.We can see 15: %=%U%- that representations of blocked nodes carry little informa- 16:end for tion about the input graph.Therefore,it is reasonable to in- crease the contribution of non-blocked nodes to the central nodes.We perform reweighted graph convolution to achieve 3.3 Objective Function this goal. Let w ={W(1),...,W(L)}denote the learnable parame- After Algorithm 1 is performed,reweighted graph convo- ters defined in Equation (5).Y=H(L)denotes the output lution is formulated as follows.For readability,we denote of GNN models.For multi-class classification,f()in the n吃1=W%(ln.2=Whb()训andn=W(l last layer denotes the softmax function,while it denotes the =p. ,1+n,2 p2=1-p).n+ sigmoid function for multi-label classification.The objective n.1 (3) function for BNS is formulated as follows: n吃,2 Ai =Pia'nia+ni -·A,j∈Nh(i)andi∈hb min∑∑-Yie log Yie+W2.∑IwoI3,(⑥ A Pia'na+n ni ·A,Vi∈N6(i)andi∈%b where A is a hyper-parameter for the regularization term of parameters W,denotes the set of nodes in training set. A=n·Aa,i∈M%b 3.4 Complexity Analysis z9≈∑A,”=z9,ie6U,④ In this subsection,we compare the computation and mem- jEN(i) ory complexity of different methods with those of BNS in a mini-batch training step,which is summarized in Table 1.For H=f(29wo), (5) existing node-wise sampling methods NS [Hamilton et al., where p∈[0,l.Compared with(W(i)l/W()l)·A,in 2017],VRGCN [Chen et al.,2018a]and MVS-GNN [Con- Equation (2),Aj adopts a different weights,p and p2.to g et al.,20201,they reduce the growth rate from s to sn. adjust the contribution of non-blocked and blocked nodes to where s is much smaller than s.In particular,VRGCN and MVS-GNN show that they can achieve comparable accura- node i.In the following proposition,we prove that Z()is cy to NS with smaller sn.For layer-wise sampling method an unbiased estimation of which makes our proposed LADIES [Zou et al.,2019]and subgraph sampling method reweighted graph convolution theoretically sound.In experi- GraphSAINT [Zeng et al.,2020].they reduce the computa- ments,p is set to 0.5 for convenience. tion and memory complexity to the level that is linear with Proposition 1.Suppose H(-1)is given.If Ne(i)is uni- the depth of network. formly sampled from N(i).N(i)is uniformly sampled from Although the above methods can achieve good perfor- N(i)and p,1 then defined in Equation (4)is an mance in terms of accuracy,time cost,and memory cost on unbiased estimation of small-scale graphs (e.g.,graphs with hundreds of thousand- s of nodes),they are not efficient or even not applicable for Proof.The proof can be found in the Appendix. large-scale graphs (e.g.,graphs with millions of nodes and hundreds of millions of nodes).Some problems and draw- 'The Appendix can be found in https://cs.nju.edu.cn/lwi/. backs existing in these methods are overlooked due to the
samples sn elements from N (i). Then, we randomly select sn × δ (0 ≤ δ ≤ 1) nodes from N ` (i) in line 6, and stop sampling neighbors for them at the preceding layers {1, · · · , ` − 1} via the operations in line 7 and line 13. Block(N ` (i), δ) in line 6 is an operation that uniformly samples |N ` (i)| × δ elements from N ` (i) as blocked neighbors. V ` b records the blocked nodes at the `th layer, which are used in subsequent processing steps. The operations in line 10 and line 11 ensure that the blocked nodes are mapped to the same feature space as non-blocked nodes. 3.2 Reweighted Graph Convolution We first reformulate Z (`) i∗ in Equation (1) to an expectation form: Z (`) i∗ = |N (i)| · Ej∼p(j∈N(i)|i)Aˆ ijH (`−1) j∗ , (2) where p(j ∈ N (i)|i) is a uniform distribution over the neighbors of node i. |N (i)| denotes the number of elements in N (i). For blocked nodes, since their neighbor expansions are blocked, their estimation for Equation (2) is less precise (having large variance) than non-blocked nodes. We can see that representations of blocked nodes carry little information about the input graph. Therefore, it is reasonable to increase the contribution of non-blocked nodes to the central nodes. We perform reweighted graph convolution to achieve this goal. After Algorithm 1 is performed, reweighted graph convolution is formulated as follows. For readability, we denote n ` i,1 = |N ` b (i)|, n ` i,2 = |N ` nb(i)| and ni = |N (i)|. ρ ` i,1 = ρ · n ` i,1 + n ` i,2 n ` i,1 , ρ` i,2 = (1 − ρ) · n ` i + ˜n ` i n ` i,2 , (3) A˜` ij = ρ ` i,1 · ni n ` i,1 + n ` i,2 · Aˆ ij , ∀j ∈ N ` nb(i) and i ∈ V` nb, A˜` ij = ρ ` i,2 · ni n ` i,1 + n ` i,2 · Aˆ ij , ∀j ∈ N ` b (i) and i ∈ V` nb, A˜` ii = ni · Aˆ ii, i ∈ V` b \V` nb, Z (`) i∗ ≈ X j∈N(`)(i) A˜ ijH (`−1) j∗ := Z˜ (`) i∗ , ∀i ∈ V` nb ∪ V` b , (4) H (`) i∗ = f(Z˜ (`) i∗ W(`) ), (5) where ρ ∈ [0, 1]. Compared with (|N (i)|/|N ` (i)|) · Aˆ ij in Equation (2), A˜ ij adopts a different weights, ρ ` i,1 and ρ ` i,2 , to adjust the contribution of non-blocked and blocked nodes to node i. In the following proposition, we prove that Z˜(`) is an unbiased estimation of Z (`) i∗ , which makes our proposed reweighted graph convolution theoretically sound. In experiments, ρ is set to 0.5 for convenience. Proposition 1. Suppose H(`−1) is given. If N ` (i) is uniformly sampled from N (i), N ` b (i) is uniformly sampled from N ` (i) and ρ ∈ [0, 1], then Z˜ (`) i∗ defined in Equation (4) is an unbiased estimation of Z (`) i∗ . Proof. The proof can be found in the Appendix 1 . 1The Appendix can be found in https://cs.nju.edu.cn/lwj/. Algorithm 1 Sampling Algorithm Require: Mini-batch of nodes Vin, the number of neighbors sampled for each node sn, ratio of blocked neighbors per node δ. Ensure: {(V ` nb, V ` b , {(N ` nb(i), N ` b (i))} N i=1)} L `=1 1: V L nb = Vin, Vb = ∅ 2: Sample in a top-down manner: 3: for ` = L : 1 do 4: for i ∈ V` nb do 5: N ` (i) = Sample(N (i), sn) 6: N ` b (i) = Block(N ` (i), δ) 7: N ` nb(i) = N ` (i)\N ` b (i) 8: end for 9: for i ∈ Vb do 10: N ` (i) = N ` (i) ∪ {i} 11: N ` b (i) = N ` b (i) ∪ {i} 12: end for 13: V `−1 nb = S i∈V` nb N ` nb(i) 14: V `−1 b = S i∈V` nb N ` b (i) 15: Vb = Vb ∪ V`−1 b 16: end for 3.3 Objective Function Let W = {W(1) , · · · ,W(L)} denote the learnable parameters defined in Equation (5). Yˆ = H(L) denotes the output of GNN models. For multi-class classification, f(·) in the last layer denotes the softmax function, while it denotes the sigmoid function for multi-label classification. The objective function for BNS is formulated as follows: min W X i∈V0 X c −Yic log Yˆ ic + λ/2 · X ` kW(`) k 2 F , (6) where λ is a hyper-parameter for the regularization term of parameters W, V 0 denotes the set of nodes in training set. 3.4 Complexity Analysis In this subsection, we compare the computation and memory complexity of different methods with those of BNS in a mini-batch training step, which is summarized in Table 1. For existing node-wise sampling methods NS [Hamilton et al., 2017], VRGCN [Chen et al., 2018a] and MVS-GNN [Cong et al., 2020], they reduce the growth rate from s to sn, where sn is much smaller than s. In particular, VRGCN and MVS-GNN show that they can achieve comparable accuracy to NS with smaller sn. For layer-wise sampling method LADIES [Zou et al., 2019] and subgraph sampling method GraphSAINT [Zeng et al., 2020], they reduce the computation and memory complexity to the level that is linear with the depth of network. Although the above methods can achieve good performance in terms of accuracy, time cost, and memory cost on small-scale graphs (e.g., graphs with hundreds of thousands of nodes), they are not efficient or even not applicable for large-scale graphs (e.g., graphs with millions of nodes and hundreds of millions of nodes). Some problems and drawbacks existing in these methods are overlooked due to the
Method Computation complexity Memory complexity Non-sampling [Kipf and Welling,2017] O(s-(sBr+Br2)) O(Lr2+s.Br NS [Hamilton et al..2017] 0(s -1.(snBr+Bm2)】 O(Lr2+sh·Br) VRGCN IChen et al..2018al (sh-1.((sn+s).Br+Br) O(Lr2s1(s+sn)Br) MVS-GNN [Cong et al.,2020] L-1.((sn+s).Br+Br2)) O(Lr2+s-1(s+sn)Br) LADIES [Zou et al..2019] O(L·(s/W)2.Alo+Ls·T O(Lr2+Ls·r) GraphSAINT [Zeng et al.,2020] O(L·(sg/W)2.Alo+Lsg·r O(Lr2+Lsg·r) BNS (ours) Os-·(smBr+(6/1-6)+1)·Br2)】 OLr2+s站-Sn Br】 Table 1:Computation and memory complexity.s denotes the average number of neighbors per node in A.sn denotes the average number of neighbors sampled for each node.sn=snx(1-6),where o denotes the ratio of blocked nodes in BNS.st denotes the average number of nodes per layer in layer-wise sampling.s denotes the average number of nodes per subgraph in subgraph sampling.B=Vin denotes the mini-batch size of output layer.L is the number of layers in GNN models.r is the hidden dimension of networks. lack of systematically experimental analysis on large-scale 4.2 Baselines and Settings graphs.For example,even with low computation complex- We compare BNS with VRGCN [Chen et al.,2018a], ity,VRGCN,MVS-GNN and LADIES have a high time cost LADIES [Zou et al.,2019]and GraphSAINT [Zeng et al., of preparing data(including sampling)before each round of 2020].which are the state-of-the-art methods with node-wise mini-batch training.In addition,VRGCN brings a huge bur- sampling,layer-wise sampling and subgraph sampling,re- den to the memory for storing all nodes'historical represen- spectively.Additionally,we compare BNS with the classi- tations at each layer.MVS-GNN has the same complexi- cal node-wise sampling method NS [Hamilton et al.,2017] ty as the non-sampling method at the outer iteration,which We do not compare BNS with MVS-GNN since MVS-GNN might make the training infeasible on large-scale graphs be- adopts the non-sampling strategy for training at the outer it- cause of running out of graphics memory.GraphSAINT faces eration,which leads to the problem of running out of graph- the problem of sparse connection in subgraphs.Moreover, ics memory.Besides,comparisons with model simplifica- GraphSAINT adopts the non-sampling strategy at the evalua- tion methods are moved to the Appendix due to space limi- tion and testing stage,which is also inefficient on large-scale tation.Since the original implementations of the above base- graphs. lines cannot directly scale to the benchmark datasets in this Similar to existing node-wise sampling methods,BNS re- paper,we re-implement them according to the corresponding duces the growth rate from s to a small sn,where sn denotes authors'codes.For a fair comparison,implementations of all the number of non-blocked neighbors per node.We will show methods,including BNS,only differ in the sampling process. that with a small sn,BNS can achieve comparable accuracy For all methods,GNN model is instantiated with GraphSAGE to NS with a large sn,while BNS has lower computation and Hamilton et al.,2017],since it can achieve good perfor- memory complexity.Moreover,BNS has a low time cost of mance on the benchmark datasets.Note that sampling strate- preparing data before each round of mini-batch training. gies and settings during inference are the same as those in the training stage for all methods except for GraphSAINT. 4 Experiments The hyper-parameters r,L,T(maximum epoch),A and p(probability of dropout)are independent of sampling strate- In this section,we compare BNS with other baselines on five gies,and hence they are set to be the same for different sam- node-classification datasets.BNS is implemented on the Py- pling strategies on one specific dataset.Empirically,r is set torch platform [Paszke et al..2019]with Pytorch-Geometric to 128 on all datasets,L is set to 5 on both ogbn-proteins and Library Fey and Lenssen,2019.All experiments are run on ogbn-products,and L is set to 3 on ogbn-papers100M.For T. a NVIDIA TitanXP GPU server with 12 GB graphics memo- it is set to 100 on both ogbn-products and ogbn-papers100M, y and set to 1,000 on ogbn-proteins.For A and p,the values of them are obtained by tuning with NS on the benchmark 4.1 Datasets datasets.On ogbn-product,A =5 x 10-6 and p =0.1.On ogbn-papers100M,入=5×10-7andp=0.1.On ogbn- Ogbn-products,ogbn-papers100M and ogbn-proteins2 are proteins,A =0 and p =0.In BNS,we set p to 0.5 for con- publicly available [Hu et al.,2020].Ogbn-products is a large- venience and do not tune it.Adam [Kingma and Ba,2015] scale dataset with millions of nodes.Ogbn-papers100M is a is used to optimize the model and the learning rate n is set to large-scale dataset with hundreds of millions of nodes.Ogbn- 0.01.For all settings,experiments are run for 10 times with proteins is a small-scale dataset with hundreds of thousands different initialization each time,and the mean results of 10 of nodes.Amazon and Yelp in GraphSAINT,are also used runs are reported. for evaluation.Due to space limitation,the information and results on Amazon and Yelp are moved to the Appendix.The 4.3 Evaluation Criteria statistics of datasets can be found in the Appendix. The ultimate goal of sampling strategies for GNNs is to ob- tain high accuracy with a low time cost,not just to reduce https://ogb.stanford.edu/docs/nodeprop/ time and memory cost to extreme cases at the expense of sac-
Method Computation complexity Memory complexity Non-sampling [Kipf and Welling, 2017] O s L−1 (sBr + Br2 ) O Lr2 + s L · Br NS [Hamilton et al., 2017] O s L−1 n · (snBr + Br2 ) O Lr2 + s L n · Br VRGCN [Chen et al., 2018a] O s L−1 n · ((sn + s) · Br + Br2 ) O Lr2 + s L−1 n · (s + sn)Br MVS-GNN [Cong et al., 2020] O s L−1 n · ((sn + s) · Br + Br2 ) O Lr2 + s L−1 n · (s + sn)Br LADIES [Zou et al., 2019] O L · (sl/N) 2 · kAk0 + Lsl · r 2 ) O Lr2 + Lsl · r GraphSAINT [Zeng et al., 2020] O L · (sg/N) 2 · kAk0 + Lsg · r 2 ) O Lr2 + Lsg · r BNS (ours) O s˜ L−1 n · (snBr + (δ/(1 − δ) + 1) · Br2 ) O Lr2 + ˜s L−1 n · snBr Table 1: Computation and memory complexity. s denotes the average number of neighbors per node in A. sn denotes the average number of neighbors sampled for each node. s˜n = sn × (1 − δ), where δ denotes the ratio of blocked nodes in BNS. sl denotes the average number of nodes per layer in layer-wise sampling. sg denotes the average number of nodes per subgraph in subgraph sampling. B = |Vin| denotes the mini-batch size of output layer. L is the number of layers in GNN models. r is the hidden dimension of networks. lack of systematically experimental analysis on large-scale graphs. For example, even with low computation complexity, VRGCN, MVS-GNN and LADIES have a high time cost of preparing data (including sampling) before each round of mini-batch training. In addition, VRGCN brings a huge burden to the memory for storing all nodes’ historical representations at each layer. MVS-GNN has the same complexity as the non-sampling method at the outer iteration, which might make the training infeasible on large-scale graphs because of running out of graphics memory. GraphSAINT faces the problem of sparse connection in subgraphs. Moreover, GraphSAINT adopts the non-sampling strategy at the evaluation and testing stage, which is also inefficient on large-scale graphs. Similar to existing node-wise sampling methods, BNS reduces the growth rate from s to a small s˜n, where s˜n denotes the number of non-blocked neighbors per node. We will show that with a small s˜n, BNS can achieve comparable accuracy to NS with a large sn, while BNS has lower computation and memory complexity. Moreover, BNS has a low time cost of preparing data before each round of mini-batch training. 4 Experiments In this section, we compare BNS with other baselines on five node-classification datasets. BNS is implemented on the Pytorch platform [Paszke et al., 2019] with Pytorch-Geometric Library [Fey and Lenssen, 2019]. All experiments are run on a NVIDIA TitanXP GPU server with 12 GB graphics memory. 4.1 Datasets Ogbn-products, ogbn-papers100M and ogbn-proteins2 are publicly available [Hu et al., 2020]. Ogbn-products is a largescale dataset with millions of nodes. Ogbn-papers100M is a large-scale dataset with hundreds of millions of nodes. Ogbnproteins is a small-scale dataset with hundreds of thousands of nodes. Amazon and Yelp in GraphSAINT, are also used for evaluation. Due to space limitation, the information and results on Amazon and Yelp are moved to the Appendix. The statistics of datasets can be found in the Appendix. 2 https://ogb.stanford.edu/docs/nodeprop/ 4.2 Baselines and Settings We compare BNS with VRGCN [Chen et al., 2018a], LADIES [Zou et al., 2019] and GraphSAINT [Zeng et al., 2020], which are the state-of-the-art methods with node-wise sampling, layer-wise sampling and subgraph sampling, respectively. Additionally, we compare BNS with the classical node-wise sampling method NS [Hamilton et al., 2017]. We do not compare BNS with MVS-GNN since MVS-GNN adopts the non-sampling strategy for training at the outer iteration, which leads to the problem of running out of graphics memory. Besides, comparisons with model simplification methods are moved to the Appendix due to space limitation. Since the original implementations of the above baselines cannot directly scale to the benchmark datasets in this paper, we re-implement them according to the corresponding authors’ codes. For a fair comparison, implementations of all methods, including BNS, only differ in the sampling process. For all methods, GNN model is instantiated with GraphSAGE [Hamilton et al., 2017], since it can achieve good performance on the benchmark datasets. Note that sampling strategies and settings during inference are the same as those in the training stage for all methods except for GraphSAINT. The hyper-parameters r, L, T (maximum epoch), λ and p (probability of dropout) are independent of sampling strategies, and hence they are set to be the same for different sampling strategies on one specific dataset. Empirically, r is set to 128 on all datasets, L is set to 5 on both ogbn-proteins and ogbn-products, and L is set to 3 on ogbn-papers100M. For T, it is set to 100 on both ogbn-products and ogbn-papers100M, and set to 1,000 on ogbn-proteins. For λ and p, the values of them are obtained by tuning with NS on the benchmark datasets. On ogbn-product, λ = 5 × 10−6 and p = 0.1. On ogbn-papers100M, λ = 5 × 10−7 and p = 0.1. On ogbnproteins, λ = 0 and p = 0. In BNS, we set ρ to 0.5 for convenience and do not tune it. Adam [Kingma and Ba, 2015] is used to optimize the model and the learning rate η is set to 0.01. For all settings, experiments are run for 10 times with different initialization each time, and the mean results of 10 runs are reported. 4.3 Evaluation Criteria The ultimate goal of sampling strategies for GNNs is to obtain high accuracy with a low time cost, not just to reduce time and memory cost to extreme cases at the expense of sac-
products.b'=25 (a)ogbn-products. -3 (b)ogbn-papers100M. Figure 2:Test accuracy curves on ogbn-products and ogbn-papers100M.Methods that need more than one day to obtain the curves are omitted in the figures.b'=/B.where denotes the number of nodes in training data and B is batch size.At each row,the first three figures present the results of the first experimental setting in Section 4.3.The last figure presents the results of the second experimental setting. Methods ogbn-prodūcts ogbn-papers100M Accuracy (% Time(s)↓ TI+T2() Accuracy(o)↑ Time(s)↓ T1+T2(s) NS 78.64±0.17 5.5×103 4.5×10°+9.6×102 63.61±0.13 2.5×10 8.0×103+1.7×104 VRGCN 77.07±0.49 1.2×104 1.1×104+1.5×103 63.34±0.12 2.2×104 7.0×103+1.5×10 LADIES 78.96±0.50 4.7×103 4.5×103+2.5×10 63.25±0.21 2.5×104 1.2×104+1.3×109 GraphSAINT 78.95±0.41 7.1×103 4.5×103+2.6×103 61.60±0.12 2.1×104 8.0×103+1.3×104 BNS (ours) 80.14±0.27 9.1×102 7.3×102+1.8×10 63.88±0.12 1.2×10 4.3×10°+7.7×10 Table 2:Results on ogbn-products and ogbn-papers100M.Boldface letters denote the best results.Time presented in tables denotes the total training time of one run."TI"refers to the time cost of preparing data."T2"refers to the time cost of performing forward and backward propagation.The results in tables are obtained under the second experimental setting in the Section 4.3. rificing accuracy.In most cases,reducing memory can also 4.4 Results reduce the time cost since GNN model can perform training Results on ogbn-products and ogbn-papers100M are summa- with a larger batch size when the graphics memory cost is rized in Figure 2 and Table 2,from which we can draw the lower.Hence,we omit the comparison of memory cost in following conclusions.Firstly,when achieving the same ac- experiments.In a nutshell,the accuracy of GNN model and curacy under different settings,BNS is faster than all other time cost during training are presented to evaluate the perfor- methods.For example,from Figure 2(a),we can see that BNS mance of different methods. is approximately 4x ~5x faster than GraphSAINT(second- One reasonable way to evaluate the performance of differ- best)when achieving the accuracy of 80%on ogbn-products. ent methods is to compare time cost when achieving the same From Figure 2(b),we can see that BNS is approximately 2x accuracy.Since batch size has an important impact on time faster than NS(second-best)when achieving the accuracy of cost and accuracy,we design two kinds of experiments for 63.5%on ogbn-papers100M.Secondly,compared with other fair comparison: methods.BNS can achieve the best performance in accura- The first experimental setting:On each dataset,for d- cy with the minimum time cost.This point can be drawn ifferent methods.we train GNN model with the same from Table 2.which is consistent with the results in Fig- batch size.All methods are run with the best setting that ure 2.Thirdly,VRGCN and LADIES have a high time cost can achieve the best accuracy in this case. in preparing data,which is even higher than the time cost in The second experimental setting:On each dataset,for performing forward and backward propagation.Finally,from different methods,we train GNN model with the maxi- Table 2,we observe an interesting phenomenon,i.e.,the ac- curacy of BNS does not decrease compared to that of NS,and mum batch size that can achieve the best accuracy.All is even higher than that of NS.This phenomenon can be ex- methods are run with the best setting that can achieve the plained by the observations in JK-Net [Xu et al.,2018].i.e..it best accuracy. is important to enhance the influence of local neighborhoods Detailed settings of each method can be found in the Ap- on the central nodes:otherwise the local information of the pendix. central nodes in the input graphs will be washed out in a few
0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - minimum b BNS(ours) NS VRGCN LADIES GraphSAINT (a) ogbn-products. 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - minimum b BNS(ours) NS VRGCN LADIES (b) ogbn-papers100M. Figure 2: Test accuracy curves on ogbn-products and ogbn-papers100M. Methods that need more than one day to obtain the curves are omitted in the figures. b 0 = |V0 |/B, where |V0 | denotes the number of nodes in training data and B is batch size. At each row, the first three figures present the results of the first experimental setting in Section 4.3. The last figure presents the results of the second experimental setting. Methods ogbn-products ogbn-papers100M Accuracy (%) ↑ Time (s) ↓ T1+T2 (s) Accuracy (%) ↑ Time (s) ↓ T1+T2 (s) NS 78.64 ± 0.17 5.5 × 103 4.5 × 103 + 9.6 × 102 63.61 ± 0.13 2.5 × 104 8.0 × 103 + 1.7 × 104 VRGCN 77.07 ± 0.49 1.2 × 104 1.1 × 104 + 1.5 × 103 63.34 ± 0.12 2.2 × 104 7.0 × 103 + 1.5 × 104 LADIES 78.96 ± 0.50 4.7 × 103 4.5 × 103 + 2.5 × 102 63.25 ± 0.21 2.5 × 104 1.2 × 104 + 1.3 × 104 GraphSAINT 78.95 ± 0.41 7.1 × 103 4.5 × 103 + 2.6 × 103 61.60 ± 0.12 2.1 × 104 8.0 × 103 + 1.3 × 104 BNS (ours) 80.14 ± 0.27 9.1 × 102 7.3 × 102 + 1.8 × 102 63.88 ± 0.12 1.2 × 104 4.3 × 103 + 7.7 × 103 Table 2: Results on ogbn-products and ogbn-papers100M. Boldface letters denote the best results. Time presented in tables denotes the total training time of one run. “T1” refers to the time cost of preparing data. “T2” refers to the time cost of performing forward and backward propagation. The results in tables are obtained under the second experimental setting in the Section 4.3. rificing accuracy. In most cases, reducing memory can also reduce the time cost since GNN model can perform training with a larger batch size when the graphics memory cost is lower. Hence, we omit the comparison of memory cost in experiments. In a nutshell, the accuracy of GNN model and time cost during training are presented to evaluate the performance of different methods. One reasonable way to evaluate the performance of different methods is to compare time cost when achieving the same accuracy. Since batch size has an important impact on time cost and accuracy, we design two kinds of experiments for fair comparison: • The first experimental setting: On each dataset, for different methods, we train GNN model with the same batch size. All methods are run with the best setting that can achieve the best accuracy in this case. • The second experimental setting: On each dataset, for different methods, we train GNN model with the maximum batch size that can achieve the best accuracy. All methods are run with the best setting that can achieve the best accuracy. Detailed settings of each method can be found in the Appendix. 4.4 Results Results on ogbn-products and ogbn-papers100M are summarized in Figure 2 and Table 2, from which we can draw the following conclusions. Firstly, when achieving the same accuracy under different settings, BNS is faster than all other methods. For example, from Figure 2(a), we can see that BNS is approximately 4× ∼ 5× faster than GraphSAINT (secondbest) when achieving the accuracy of 80% on ogbn-products. From Figure 2(b), we can see that BNS is approximately 2× faster than NS (second-best) when achieving the accuracy of 63.5% on ogbn-papers100M. Secondly, compared with other methods, BNS can achieve the best performance in accuracy with the minimum time cost. This point can be drawn from Table 2, which is consistent with the results in Figure 2. Thirdly, VRGCN and LADIES have a high time cost in preparing data, which is even higher than the time cost in performing forward and backward propagation. Finally, from Table 2, we observe an interesting phenomenon, i.e., the accuracy of BNS does not decrease compared to that of NS, and is even higher than that of NS. This phenomenon can be explained by the observations in JK-Net [Xu et al., 2018], i.e., it is important to enhance the influence of local neighborhoods on the central nodes; otherwise the local information of the central nodes in the input graphs will be washed out in a few
-b-35 15 Figure 3:Test ROC-AUC curves on ogbn-proteins Method Accuracy(%)↑Time(s)↓ T1+T2(S) NS 78.84±0.32 1.1×10 3.6×10+7.4×10 VRGCN 78.26±0.64 1.9×104 8.0×103+1.1×10 LADIES 79.49±0.37 1.9×104 1.9×101+3.0×102 GraphSAINT 78.73±0.45 3.3×104 1.1×10+2.2×104 BNS (ours) 79.60±0.29 9.3×108.6×10+7.4×102 Table 3:Results on ogbn-proteins. steps.We can see that the stochastically blocking policy is P2=1 in Equation (3),Equation(5)is a plain Monte-Carlo helpful for BNS to preserve local information around central approximation of Equation (2).The results are presented in nodes Table 4.From Table 4.we can conclude that reweighted poli- Results on ogbn-proteins,a relative small graph,are sum- cy enhances the ability of BNS in utilizing the information of marized in Figure 3 and Table 3,from which we can draw blocked neighbors. the following conclusions.Firstly,BNS is faster than al- I other methods when achieving the same accuracy under Accuracy (%or ROC-AUC (% different settings.However,the gap of the time cost for Methods ogbn- ogbn- ogbn- achieving the same accuracy between BNS and other meth- products papers100M proteins ods is small.The main reason is that neighboring expan- BNS w/o rew 79.12±0.1962.54±0.1479.40±0.21 sions can easily cover the entire graph within a few layer- BNS 80.14±0.27 63.88±0.12 79.60±0.29 s or steps on small-scale graphs.Therefore,these methods have the same order of computation and memory complexity Table 4:Ablation study on reweighted policy.'w/o rew'means BNS O(Nsr Nr2).Secondly,BNS can achieve the best perfor- runs without reweighted policy. mance in terms of accuracy with the fastest speed.This point can be drawn from Table 3,which is consistent with results in Figure 3.Thirdly,once again,we observe that LADIES 5 Conclusions has a high time cost in preparing data.Finally,we observe On large-scale graphs(e.g.,graphs with hundreds of million- that GraphSAINT(non-sampling strategy)achieves lower ac- s of nodes),existing sampling strategies have deficiencies or curacy than NS,LADIES and BNS.This may be caused limitations in accuracy,time cost,or memory cost.Hence, by the over-smoothing problem of GNNs [Li et al.,2018; designing an effective sampling strategy for efficient training Xu et al.,2018;Oono and Suzuki,2020].This observation, of GNNs on large-scale graphs is still challenging.In this pa- in turn,shows that the stochasticity introduced by sampling per,we propose a novel neighbor sampling strategy,dubbed can alleviate the over-smoothing problem of GNNs. blocking-based neighbor sampling(BNS),for training GNNs Summary.First,on large-scale graphs,BNS is 2x ~5x on large-scale graphs.The main idea is to adopt a policy to faster than existing state-of-the-art methods when achieving stochastically block the ongoing expansion of neighbors,by the same accuracy.Compared with BNS,other methods have which computation and memory complexity can be signifi- some deficiencies or limitations.For example,NS needs a cantly reduced.Furthermore,reweighted graph convolution large number of sn to achieve high accuracy.VRGCN and is proposed to adjust the contribution of blocked and non- LADIES have a high time cost of preparing data,which are blocked neighbors to central nodes.Extensive experiments more expensive than performing forward and backward prop- on large-scale graphs show that,when achieving the same ac- agation.Second,even on the small-scale graph,BNS demon- curacy,BNS is 2x~5x faster than state-of-the-art method- strates the advantage of low time cost.Third,compared with s.Experiments on the small-scale graph also demonstrate the other methods.BNS can achieve the best performance in ac- advantage of BNS in terms of time cost. curacy with the minimum time cost. Acknowledgments 4.5 Ablation Study This work is supported by the NSFC-NRF Joint Re- We study the effectiveness of reweighted policy by setting search Project (No.61861146001)and NSFC Project (No. p=1 and p2 =1 in Equation (3).With p=1 and 61921006)
0 0.5 1 1.5 2 time (s) 104 74 76 78 80 Test ROC-AUC (%) ogbn-proteins - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 74 76 78 80 Test ROC-AUC (%) ogbn-proteins - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 74 76 78 80 Test ROC-AUC (%) ogbn-proteins - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 74 76 78 80 Test ROC-AUC (%) ogbn-proteins - minimum b BNS(ours) NS VRGCN LADIES GraphSAINT Figure 3: Test ROC-AUC curves on ogbn-proteins. Method Accuracy (%) ↑ Time (s) ↓ T1+T2 (s) NS 78.84 ± 0.32 1.1 × 104 3.6 × 103 + 7.4 × 103 VRGCN 78.26 ± 0.64 1.9 × 104 8.0 × 103 + 1.1 × 104 LADIES 79.49 ± 0.37 1.9 × 104 1.9 × 104 + 3.0 × 102 GraphSAINT 78.73 ± 0.45 3.3 × 104 1.1 × 104 + 2.2 × 104 BNS (ours) 79.60± 0.29 9.3 × 103 8.6 × 103 + 7.4 × 102 Table 3: Results on ogbn-proteins. steps. We can see that the stochastically blocking policy is helpful for BNS to preserve local information around central nodes. Results on ogbn-proteins, a relative small graph, are summarized in Figure 3 and Table 3, from which we can draw the following conclusions. Firstly, BNS is faster than all other methods when achieving the same accuracy under different settings. However, the gap of the time cost for achieving the same accuracy between BNS and other methods is small. The main reason is that neighboring expansions can easily cover the entire graph within a few layers or steps on small-scale graphs. Therefore, these methods have the same order of computation and memory complexity O(Nsr + Nr2 ). Secondly, BNS can achieve the best performance in terms of accuracy with the fastest speed. This point can be drawn from Table 3, which is consistent with results in Figure 3. Thirdly, once again, we observe that LADIES has a high time cost in preparing data. Finally, we observe that GraphSAINT (non-sampling strategy) achieves lower accuracy than NS, LADIES and BNS. This may be caused by the over-smoothing problem of GNNs [Li et al., 2018; Xu et al., 2018; Oono and Suzuki, 2020]. This observation, in turn, shows that the stochasticity introduced by sampling can alleviate the over-smoothing problem of GNNs. Summary. First, on large-scale graphs, BNS is 2× ∼ 5× faster than existing state-of-the-art methods when achieving the same accuracy. Compared with BNS, other methods have some deficiencies or limitations. For example, NS needs a large number of sn to achieve high accuracy. VRGCN and LADIES have a high time cost of preparing data, which are more expensive than performing forward and backward propagation. Second, even on the small-scale graph, BNS demonstrates the advantage of low time cost. Third, compared with other methods, BNS can achieve the best performance in accuracy with the minimum time cost. 4.5 Ablation Study We study the effectiveness of reweighted policy by setting ρ ` i,1 = 1 and ρ ` i,2 = 1 in Equation (3). With ρ ` i,1 = 1 and ρ ` i,2 = 1 in Equation (3), Equation (5) is a plain Monte-Carlo approximation of Equation (2). The results are presented in Table 4. From Table 4, we can conclude that reweighted policy enhances the ability of BNS in utilizing the information of blocked neighbors. Methods Accuracy (%) or ROC-AUC (%) ogbn- ogbn- ogbnproducts papers100M proteins BNS w/o rew 79.12 ± 0.19 62.54 ± 0.14 79.40 ± 0.21 BNS 80.14 ± 0.27 63.88 ± 0.12 79.60 ± 0.29 Table 4: Ablation study on reweighted policy. ‘w/o rew’ means BNS runs without reweighted policy. 5 Conclusions On large-scale graphs (e.g., graphs with hundreds of millions of nodes), existing sampling strategies have deficiencies or limitations in accuracy, time cost, or memory cost. Hence, designing an effective sampling strategy for efficient training of GNNs on large-scale graphs is still challenging. In this paper, we propose a novel neighbor sampling strategy, dubbed blocking-based neighbor sampling (BNS), for training GNNs on large-scale graphs. The main idea is to adopt a policy to stochastically block the ongoing expansion of neighbors, by which computation and memory complexity can be signifi- cantly reduced. Furthermore, reweighted graph convolution is proposed to adjust the contribution of blocked and nonblocked neighbors to central nodes. Extensive experiments on large-scale graphs show that, when achieving the same accuracy, BNS is 2× ∼ 5× faster than state-of-the-art methods. Experiments on the small-scale graph also demonstrate the advantage of BNS in terms of time cost. Acknowledgments This work is supported by the NSFC-NRF Joint Research Project (No. 61861146001) and NSFC Project (No. 61921006)
References [Klicperaet al.,2019]Johannes Klicpera,Stefan WeiBenberger, [Bruna et al.,2014]Joan Bruna.Woiciech Zaremba.Arthur Szlam. and Stephan Gunnemann.Diffusion improves graph learning. and Yann LeCun.Spectral networks and locally connected net- In Adavances in Neural Information Processing Systems,2019. works on graphs.In International Conference on Learning Rep- [Li et al.,2018]Qimai Li,Zhichao Han,and Xiao-Ming Wu.Deep- resentations.2014. er insights into graph convolutional networks for semi-supervised [Chen et al.,2018a]Jianfei Chen,Jun Zhu,and Le Song.Stochas learning.In AAAI Conference on Artificial Intelligence,2018. tic training of graph convolutional networks with variance reduc- [Li et al.,2019]Guohao Li,Matthias Muller,Ali K.Thabet,and tion.In International Conference on Machine Learning,2018. Bernard Ghanem.DeepGCNs:can GCNs go as deep as C- NNs?In IEEE/CVF International Conference on Computer Vi- Chen et al..2018bl Jie Chen.Tengfei Ma.and Cao Xiao.FastGC- sion,2019. N:fast learning with graph convolutional networks via impor- tance sampling.In International Conference on Learning Repre- [Oono and Suzuki,2020]Kenta Oono and Taiji Suzuki.Graph neu- sentations.2018. ral networks exponentially lose expressive power for node clas- sification.In International Conference on Learning Representa- [Chen et al.,2020a]Lei Chen,Zhengdao Chen,and Joan Bruna. tio1s,2020. On graph neural networks versus graph-augmented MLPs.CoR- R,abs/2010.15116.2020. [Paszke et al.,2019]Adam Paszke,Sam Gross,Francisco Massa. Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen, [Chen et al..2020b]Ming Chen.Zhewei Wei,Bolin Ding.Yaliang Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmai- Li.Ye Yuan,Xiaoyong Du,and Ji-Rong Wen.Scalable graph son,Andreas Kopf,Edward Yang,Zachary DeVito,Martin neural networks via bidirectional propagation.In Advances in Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner, Neural Information Processing Systems,2020. Lu Fang,Junjie Bai,and Soumith Chintala.Pytorch:an impera- [Chen et al.,2020c]Ming Chen,Zhewei Wei,Zengfeng Huang. tive style,high-performance deep learning library.In Adavances Bolin Ding.and Yaliang Li.Simple and deep graph convolution- in Neural Information Processing Systems,2019. al networks.In International Conference on Machine Learning, [Velickovic et al..2018]Petar Velickovic.Guillem Cucurull.Aran- 2020. txa Casanova.Adriana Romero,Pietro Lio,and Yoshua Bengio. IChiang et al.,2019]Wei-Lin Chiang,Xuanging Liu,Si Si,Yang Graph attention networks.In International Conference on Learn- Li,Samy Bengio,and Cho-Jui Hsieh.Cluster-GCN:an efficient ing Representations,2018. algorithm for training deep and large graph convolutional net- [Verma and Zhang,2020]Saurabh Verma and Zhi-Li Zhang.To- works.In ACM SIGKDD International Conference on Knowl- wards deeper graph neural networks.In ACM SIGKDD Inter- edge Discovery Data Mining.2019. national Conference on Knowledge Discovery Data Mining. [Cong et al.,2020]Weilin Cong.Rana Forsati,Mahmut T.Kan- 2020. demir,and Mehrdad Mahdavi.Minimal variance sampling with [Wu et al.,2019]Felix Wu,Amauri H.Souza Jr.,Tianyi Zhang, provable guarantees for fast training of graph neural networks.In Christopher Fifty,Tao Yu,and Kilian Q.Weinberger.Simpli- ACM SIGKDD Interational Conference on Knowledge Discov- fying graph convolutional networks.In International Conference ery Data Mining,2020. on Machine Learning,2019. [Fey and Lenssen,2019]Matthias Fey and Jan E.Lenssen.Fast [Xu et al.,2018]Keyulu Xu,Chengtao Li,Yonglong Tian,Tomohi- graph representation learning with PyTorch Geometric.In In- ro Sonobe,Ken-ichi Kawarabayashi,and Stefanie Jegelka.Rep- ternational Conference on Learning Representations Workshop resentation learning on graphs with jumping knowledge network- on Representation Learning on Graphs and Manifolds,2019. s.In International Conference on Machine Learning,2018. [Gori et al.,2005]M.Gori,G.Monfardini,and F.Scarselli.A new [Zeng et al.,2020]Hanging Zeng,Hongkuan Zhou,Ajitesh Srivas- model for learning in graph domains.In IEEE International Joint tava,Rajgopal Kannan,and Viktor K.Prasanna.GraphSAINT: Conference on Neural Networks,2005. graph sampling based inductive learning method.In Internation- [Hamilton et al..2017]William L.Hamilton,Zhitao Ying,and Jure al Conference on Learning Representations,2020. Leskovec.Inductive representation learning on large graphs.In [Zou et al.,2019]Difan Zou,Ziniu Hu,Yewen Wang,Song Jiang, Adavances in Neural Information Processing Systems,2017 Yizhou Sun,and Quanquan Gu.Layer-dependent importance [Hu et al.,2020]Weihua Hu,Matthias Fey,Marinka Zitnik,Yux- sampling for training deep and large graph convolutional net- works.In Adavances in Neural Information Processing Systems. iao Dong,Hongyu Ren,Bowen Liu,Michele Catasta,and Jure 2019. Leskovec.Open graph benchmark:datasets for machine learn- ing on graphs.In Advances in Neural Information Processing Systems,2020 [Huang et al.,2018]Wen-bing Huang.Tong Zhang,Yu Rong,and Junzhou Huang.Adaptive sampling towards fast graph represen- tation learning.In Adavances in Neural Information Processing Systems.2018. [Kingma and Ba,2015]Diederik P.Kingma and Jimmy Ba.Adam: a method for stochastic optimization.In International Confer- ence on Learning Representations,2015 [Kipf and Welling,2017]Thomas N.Kipf and Max Welling.Semi- supervised classification with graph convolutional networks.In International Conference on Learning Representations,2017
References [Bruna et al., 2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations, 2014. [Chen et al., 2018a] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. In International Conference on Machine Learning, 2018. [Chen et al., 2018b] Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations, 2018. [Chen et al., 2020a] Lei Chen, Zhengdao Chen, and Joan Bruna. On graph neural networks versus graph-augmented MLPs. CoRR, abs/2010.15116, 2020. [Chen et al., 2020b] Ming Chen, Zhewei Wei, Bolin Ding, Yaliang Li, Ye Yuan, Xiaoyong Du, and Ji-Rong Wen. Scalable graph neural networks via bidirectional propagation. In Advances in Neural Information Processing Systems, 2020. [Chen et al., 2020c] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, 2020. [Chiang et al., 2019] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: an efficient algorithm for training deep and large graph convolutional networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. [Cong et al., 2020] Weilin Cong, Rana Forsati, Mahmut T. Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. [Fey and Lenssen, 2019] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In International Conference on Learning Representations Workshop on Representation Learning on Graphs and Manifolds, 2019. [Gori et al., 2005] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks, 2005. [Hamilton et al., 2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Adavances in Neural Information Processing Systems, 2017. [Hu et al., 2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems, 2020. [Huang et al., 2018] Wen-bing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fast graph representation learning. In Adavances in Neural Information Processing Systems, 2018. [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015. [Kipf and Welling, 2017] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. [Klicpera et al., 2019] Johannes Klicpera, Stefan Weißenberger, and Stephan Gunnemann. Diffusion improves graph learning. ¨ In Adavances in Neural Information Processing Systems, 2019. [Li et al., 2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI Conference on Artificial Intelligence, 2018. [Li et al., 2019] Guohao Li, Matthias Muller, Ali K. Thabet, and ¨ Bernard Ghanem. DeepGCNs: can GCNs go as deep as CNNs? In IEEE/CVF International Conference on Computer Vision, 2019. [Oono and Suzuki, 2020] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020. [Paszke et al., 2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin ¨ Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style, high-performance deep learning library. In Adavances in Neural Information Processing Systems, 2019. [Velickovic et al., 2018] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. ` Graph attention networks. In International Conference on Learning Representations, 2018. [Verma and Zhang, 2020] Saurabh Verma and Zhi-Li Zhang. Towards deeper graph neural networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. [Wu et al., 2019] Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019. [Xu et al., 2018] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, 2018. [Zeng et al., 2020] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor K. Prasanna. GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, 2020. [Zou et al., 2019] Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Layer-dependent importance sampling for training deep and large graph convolutional networks. In Adavances in Neural Information Processing Systems, 2019
A Proofs Proposition 1.Suppose H(-)is given.If N(i)is uniformly sampled from N(i).N(i)is uniformly sampled from Nt(i) and p1,thendefined in Equation (4)is an unbiased estimator of Proof.Case 1.If ive vtb,then we have E291=EN(训·A,Hk-1=z9 Case 2.If i let (i)denote the set of non-blocked neighbors of node i at layer e.Then we have E吃1=E ∑iHg-)+∑A,收-] jEN(i) tEN(i) =[∑ni,+[] jEN(i) tEN(i) ForE∑,es0A,Hg-,we have 旷-∑绸-可 je5() jeb(国 =E A,耳g- ∑pN1 -jEN(i) =pZ9. FarE[∑,ev,rg-,we have ∑A=[ jEN(i) 3a0g jEN(i) =E ∑1-小gA,Hg- jEN(i) W6( =(1-p)z9 Summing up the above two parts,we haveE].Combining case 1 and case 2 ends the proof. ▣ B Experimental Settings B.1 Statistics of Datasets The statistics of datasets are presented in Table B.1. Datasets ogbn-products ogbn-papers100M ogbn-proteins #Nodes 2.449.029 111.059.956 132.534 #Edges 61.859.140 1.615.685.872 39.561.252 Features/Node 100 128 8 #Classes 47 172 112 #Training Nodes 196.615 1,207,179 86.619 #Validation Nodes 39.323 125.265 21,236 #Test Nodes 2.213.091 214,338 24.679 Task Type Multi-class Multi-class Multi-label Metric Accuracy Accuracy ROC-AUC Table B.1:Statistics of benchmark datasets
A Proofs Proposition 1. Suppose H(`−1) is given. If N ` (i) is uniformly sampled from N (i), N ` b (i) is uniformly sampled from N ` (i) and ρ ∈ [0, 1], then Z˜ (`) i∗ defined in Equation (4) is an unbiased estimator of Z (`) i∗ . Proof. Case 1. If i ∈ V` b \V` nb, then we have E[Z˜ (`) i∗ ] = E[|N (i)| · Aˆ iiH (`−1) i∗ ] = Z (`) i∗ . Case 2. If i ∈ V` nb, let N ` nb(i) denote the set of non-blocked neighbors of node i at layer `. Then we have E[Z˜ (`) i∗ ] = E X j∈N` nb(i) A˜ ijH (`−1) j∗ + X t∈N` b (i) A˜ ijH (`−1) t∗ = E X j∈N` nb(i) A˜ ijH (`−1) j∗ + E X t∈N` b (i) A˜ ijH (`−1) t∗ . For E P j∈N` nb(i) A˜ ijH (`−1) j∗ , we have E X j∈N` nb(i) A˜ ijH (`−1) j∗ = E X j∈N` nb(i) ρ ` i,1 · |N (i)| |N `(i)| · Aˆ ijH (`−1) j∗ = E X j∈N` nb(i) ρ · |N (i)| |N ` nb(i)| Aˆ ijH (`−1) j∗ = ρZ (`) i∗ . For E P j∈N` b (i) A˜ ijH (`−1) j∗ , we have E X j∈N` b (i) A˜ ijH (`−1) j∗ = E X j∈N` b (i) ρ ` i,2 · |N (i)| |N `(i)| · Aˆ ijH (`−1) j∗ = E X j∈N` b (i) (1 − ρ) · |N (i)| |N ` b (i)| Aˆ ijH (`−1) j∗ = (1 − ρ)Z (`) i∗ . Summing up the above two parts, we have E[Z˜ (`) i∗ ] = Z (`) i∗ . Combining case 1 and case 2 ends the proof. B Experimental Settings B.1 Statistics of Datasets The statistics of datasets are presented in Table B.1. Datasets ogbn-products ogbn-papers100M ogbn-proteins #Nodes 2,449,029 111,059,956 132,534 #Edges 61,859,140 1,615,685,872 39,561,252 Features/Node 100 128 8 #Classes 47 172 112 #Training Nodes 196,615 1,207,179 86,619 #Validation Nodes 39,323 125,265 21,236 #Test Nodes 2,213,091 214,338 24,679 Task Type Multi-class Multi-class Multi-label Metric Accuracy Accuracy ROC-AUC Table B.1: Statistics of benchmark datasets
B.2 Detailed Experimental Settings Detailed experimental settings,mainly including hyper-parameters for reproducing results,are presented in Table B.2,Table B.3 and Table B.4. Dataset T 入 ogbn-products 5128 100 5×10-60.10.01 ogbn-papers100M 3128 100 5×10-7 0.10.01 ogbn-proteins 51281000 0 00.01 Table B.2:Hyper-parameters that are independent of different methods.L is the layer number.r is the feature dimension of hidden layer.T is the maximum epoch.A is the hyper-parameter of the regularization term of parameters.p is the probability of dropout.n is the learning rate of Adam optimizer. Dataset NS VRGCN LADIES GraphSAINT BNS ogbn-products Sn=5 k=7 w=50 sn=3,6=1/2 ogbn-papers100M 8n=10 Sn=6 k=15 w=20 8n=6,6=2/3 ogbn-proteins Sn =9 Sn=3 k=0.5 sm=4,0=2/3 Table B.3:Method-dependent hyper-parameters for generating test accuracy curves of b'=25.35,45 in Figure 2 and Figure 3.For each method and each dataset,the settings of different b'are the same. Dataset NS VRGCN LADIES GraphSAINT BNS ogbn-products 8n=5,6=158n=2,6=15k=7,6=150=50,6=155n=3,0=1/2,6=5 ogbn-papers100M sn=15.B'=50 sn=6,b'=30 k=15,b=15w=50,b=70sn=6,6=2/3.b=25 ogbn-proteins sn=9,b=8sn=3.b=8k=0.5,b'=8 full graph,b'=1sn=4,6=2/3.b'=8 Table B.4:Method-dependent hyper-parameters for generating test accuracy curves of maximum b'in Figure 2 and Figure 3.Hyper- parameters in this table are the same as those for generating results in Table 3 and Table 4. C Experiments on Amazon and Yelp To have a fair comparison with GraphSAINT [Zeng et al.,2020].we conduct experiments on Amazon and Yelp.However,our experimental results show that differences in accuracy and time cost between different methods are small.For example,NS [Hamilton et al.,2017]only needs to sample two neighbors for each node to achieve the best performance in accuracy and time cost.Hence,Amazon and Yelp are not suitable for analyzing the differences between different sampling strategies. C.1 Datasets and Settings Datasets.Amazon has millions of nodes.Yelp has hundreds of thousands of nodes.Details of Amazon and Yelp can refer to [Zeng et al.,2020]. Settings All hyper-parameters independent of sampling strategies are set the same as in [Zeng et al.,2020].For all sampling strategies,they are evaluated with the same GNN model as GraphSAINT does in [Zeng et al.,2020].Other detailed settings are presented in Table C.1. Dataset NS VRGCN LADIES GraphSAINT BNS Yelp 8m=2 2 k=1 0=2 sn=1,6=1/2 Amazon sn=2 k=1 w=2 sm=1,6=1/2 Table C.1:Method-dependent hyper-parameters for generating test accuracy curves of b'=25,35,45 in Figure C.1.For each method and each dataset,settings of different b'are the same. Dataset NS VRGCN LADIES GraphSAINT BNS Yelp sn=2,b=25$n=2.b=25 k=1,b=250=2,b=253n=1,6=1/2,b=25 Amazon sn =2,b'=25 sn=2.b=25 k=1.b=25w=2,b=25 sm=1,6=1/2,b=25 Table C.2:Method-dependent hyper-parameters for generating test accuracy curves of maximum b'in Figure C.1.Hyper-parameters in this table are the same as those for generating results in Table C.3. C.2 Results on Amazon and Yelp Results on Yelp and Amazon are summarized in Figure C.1 and Table C.3.On Yelp,we can see that the gap between the time cost of different methods is small when achieving the same accuracy.On Amazon,we can see that NS and BNS are approximately 2x faster than GraphSAINT and LADIES when achieving the same accuracy.But the gap between NS and BNS is small
B.2 Detailed Experimental Settings Detailed experimental settings, mainly including hyper-parameters for reproducing results, are presented in Table B.2, Table B.3 and Table B.4. Dataset L r T λ p η ogbn-products 5 128 100 5 × 10−6 0.1 0.01 ogbn-papers100M 3 128 100 5 × 10−7 0.1 0.01 ogbn-proteins 5 128 1000 0 0 0.01 Table B.2: Hyper-parameters that are independent of different methods. L is the layer number. r is the feature dimension of hidden layer. T is the maximum epoch. λ is the hyper-parameter of the regularization term of parameters. p is the probability of dropout. η is the learning rate of Adam optimizer. Dataset NS VRGCN LADIES GraphSAINT BNS ogbn-products sn = 5 - k = 7 w = 50 ˜sn = 3, δ = 1/2 ogbn-papers100M sn = 10 sn = 6 k = 15 w = 20 ˜sn = 6, δ = 2/3 ogbn-proteins sn = 9 sn = 3 k = 0.5 - s˜n = 4, δ = 2/3 Table B.3: Method-dependent hyper-parameters for generating test accuracy curves of b 0 = 25, 35, 45 in Figure 2 and Figure 3. For each method and each dataset, the settings of different b 0 are the same. Dataset NS VRGCN LADIES GraphSAINT BNS ogbn-products sn = 5, b 0 = 15 sn = 2, b 0 = 15 k = 7, b 0 = 15 w = 50, b 0 = 15 ˜sn = 3, δ = 1/2, b 0 = 5 ogbn-papers100M sn = 15, b 0 = 50 sn = 6, b 0 = 30 k = 15, b 0 = 15 w = 50, b 0 = 70 ˜sn = 6, δ = 2/3, b 0 = 25 ogbn-proteins sn = 9, b 0 = 8 sn = 3, b 0 = 8 k = 0.5, b 0 = 8 full graph, b 0 = 1 ˜sn = 4, δ = 2/3, b 0 = 8 Table B.4: Method-dependent hyper-parameters for generating test accuracy curves of maximum b 0 in Figure 2 and Figure 3. Hyperparameters in this table are the same as those for generating results in Table 3 and Table 4. C Experiments on Amazon and Yelp To have a fair comparison with GraphSAINT [Zeng et al., 2020], we conduct experiments on Amazon and Yelp. However, our experimental results show that differences in accuracy and time cost between different methods are small. For example, NS [Hamilton et al., 2017] only needs to sample two neighbors for each node to achieve the best performance in accuracy and time cost. Hence, Amazon and Yelp are not suitable for analyzing the differences between different sampling strategies. C.1 Datasets and Settings Datasets. Amazon has millions of nodes. Yelp has hundreds of thousands of nodes. Details of Amazon and Yelp can refer to [Zeng et al., 2020]. Settings All hyper-parameters independent of sampling strategies are set the same as in [Zeng et al., 2020]. For all sampling strategies, they are evaluated with the same GNN model as GraphSAINT does in [Zeng et al., 2020]. Other detailed settings are presented in Table C.1. Dataset NS VRGCN LADIES GraphSAINT BNS Yelp sn = 2 2 k = 1 w = 2 ˜sn = 1, δ = 1/2 Amazon sn = 2 - k = 1 w = 2 ˜sn = 1, δ = 1/2 Table C.1: Method-dependent hyper-parameters for generating test accuracy curves of b 0 = 25, 35, 45 in Figure C.1. For each method and each dataset, settings of different b 0 are the same. Dataset NS VRGCN LADIES GraphSAINT BNS Yelp sn = 2, b 0 = 25 sn = 2, b 0 = 25 k = 1, b 0 = 25 w = 2, b 0 = 25 ˜sn = 1, δ = 1/2, b 0 = 25 Amazon sn = 2, b 0 = 25 sn = 2, b 0 = 25 k = 1, b 0 = 25 w = 2, b 0 = 25 ˜sn = 1, δ = 1/2, b 0 = 25 Table C.2: Method-dependent hyper-parameters for generating test accuracy curves of maximum b 0 in Figure C.1. Hyper-parameters in this table are the same as those for generating results in Table C.3. C.2 Results on Amazon and Yelp Results on Yelp and Amazon are summarized in Figure C.1 and Table C.3. On Yelp, we can see that the gap between the time cost of different methods is small when achieving the same accuracy. On Amazon, we can see that NS and BNS are approximately 2× faster than GraphSAINT and LADIES when achieving the same accuracy. But the gap between NS and BNS is small
Yelp-b'=25 Yelp-b'=35 Yelp-b'=45 Yelp-minimum b' --GraphSAINT 1000 2000 3000 4000 10002000 3000 4000 1000 2000 3000 4000 000 2000 3000 4000 ime (s) time (s) ime (a)Yelp Amazon-b=25 Amazon -b'=35 Amazon-b=45 Amazon minimum b' X 78 BNS(ours -BNS(ours) +BNS(ours】 NS BNS(ours) NS -NS ADIE 2000 4000 6000 800010000 2000 4000 800010000 2000 4000 800010000 000 40006000800010000 time (s) time (s) time(s) time(s) (b)Amazon. Figure C.1:Test accuracy curves on Yelp and Amazon.Methods that need more than one day to obtain the curves are omitted in the figures. b/b,where denotes the number of nodes in training data and b is batch size.At each row,the first three figures present the results of the first experimental setting in subsection Evaluation Criteria.The last figure presents the results of the second experimental setting. Methods Yelp Amazon F1-micro(%)↑Time(s) T1+T2(S) F1-nicro(%)↑Time(s) T1+T2(S) NS 65.39±0.05 5.1×103 4.0×103+1.1×103 80.89±0.08 1.1×109.0×103+2.1×103 VRGCN 62.58±0.42 7.9×103 5.1×103+2.8×103 LADIES 65.36±0.03 5.3×103 4.6×103+7.2×102 81.05±0.08 2.7×101 2.5×101+2.4×103 GraphSAINT 65.46±0.05 4.9×103 3.8×103+1.1×103 80.94±0.07 2.3×10 1.7×104+5.9×103 BNS (ours) 65.34±0.04 4.5×1033.5×103+9.6×10280.91±0.089.0×1037.1×103+1.9×103 Table C.3:Results on Yelp and Amazon.Time presented in tables indicates the total training time of one run."TI"refers to the time cost of preparing data."T2"refers to the time cost of performing forward and backward propagation.The results in tables are obtained under the second experimental setting in the subsection Evaluation Criteria. D Comparisons with Model Simplification Methods Model simplification methods solve the exponential increase problem of GNNs by simplifying the GNN model,which is different from sampling-based methods.Although model simplification methods are efficient in training,as stated in [Chen et al.,20201,it is still an open question whether simplified GNNs'expressive power can match that of deep GNNs.Therefore,we only conduct comparisons with SGC [Wu et al.,2019],and we leave the extensive evaluations on model simplification methods in future works. The results are summarized in Table D.1.From Table D.1,we can see that SGC performs worse on ogbn-products,ogbn- proteins,Yelp and Amazon. Methods ogbn-products ogbn-papers100M ogbn-proteins Yelp Amazon Accuracy(%)↑ Accuracy(%)↑ ROC-AUC(%↑FI-nicro(%↑ F1-nicro(%)↑ SGC 78.18±0.31 63.29±0.19 72.45±0.26 40.53±0.04 38.80±0.05 BNS (ours) 80.14±0.27 63.88±0.12 79.60±0.29 65.34±0.04 80.91±0.08 Table D.1:Comparisons with SGC References [Chen et al.,2020]Lei Chen.Zhengdao Chen,and Joan Bruna. On graph neural networks versus graph-augmented MLPs.CoRR,ab- s/2010.15116.2020. [Hamilton et al.,2017]William L.Hamilton,Zhitao Ying,and Jure Leskovec.Inductive representation learning on large graphs.In Ada- vances in Neural Information Processing Systems,2017. [Wu et al.,2019]Felix Wu,Amauri H.Souza Jr.,Tianyi Zhang,Christopher Fifty,Tao Yu,and Kilian Q.Weinberger.Simplifying graph convolutional networks.In International Conference on Machine Learning,2019. [Zeng et al,2020]Hanqing Zeng.Hongkuan Zhou,Ajitesh Srivastava,Rajgopal Kannan,and Viktor K.Prasanna.GraphSAINT:graph sampling based inductive learning method.In International Conference on Learning Representations,2020
0 1000 2000 3000 4000 time (s) 62 63 64 65 F1-micro (%) Yelp - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 1000 2000 3000 4000 time (s) 62 63 64 65 F1-micro (%) Yelp - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 1000 2000 3000 4000 time (s) 62 63 64 65 F1-micro (%) Yelp - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 1000 2000 3000 4000 time (s) 62 63 64 65 F1-micro (%) Yelp - minimum b BNS(ours) NS VRGCN LADIES GraphSAINT (a) Yelp. 0 2000 4000 6000 8000 10000 time (s) 74 76 78 80 F1-micro (%) Amazon - b =25 BNS(ours) NS LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 74 76 78 80 F1-micro (%) Amazon - b =35 BNS(ours) NS LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 74 76 78 80 F1-micro (%) Amazon - b =45 BNS(ours) NS LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 74 76 78 80 F1-micro (%) Amazon - minimum b BNS(ours) NS LADIES GraphSAINT (b) Amazon. Figure C.1: Test accuracy curves on Yelp and Amazon. Methods that need more than one day to obtain the curves are omitted in the figures. b 0 = |V0 |/b, where |V0 | denotes the number of nodes in training data and b is batch size. At each row, the first three figures present the results of the first experimental setting in subsection Evaluation Criteria. The last figure presents the results of the second experimental setting. Methods Yelp Amazon F1-micro (%) ↑ Time (s) ↓ T1+T2 (s) F1-micro (%) ↑ Time (s) ↓ T1+T2 (s) NS 65.39 ± 0.05 5.1 × 103 4.0 × 103 + 1.1 × 103 80.89 ± 0.08 1.1 × 104 9.0 × 103 + 2.1 × 103 VRGCN 62.58 ± 0.42 7.9 × 103 5.1 × 103 + 2.8 × 103 - - - LADIES 65.36 ± 0.03 5.3 × 103 4.6 × 103 + 7.2 × 102 81.05 ± 0.08 2.7 × 104 2.5 × 104 + 2.4 × 103 GraphSAINT 65.46 ± 0.05 4.9 × 103 3.8 × 103 + 1.1 × 103 80.94 ± 0.07 2.3 × 104 1.7 × 104 + 5.9 × 103 BNS (ours) 65.34 ± 0.04 4.5 × 103 3.5 × 103 + 9.6 × 102 80.91 ± 0.08 9.0 × 103 7.1 × 103 + 1.9 × 103 Table C.3: Results on Yelp and Amazon. Time presented in tables indicates the total training time of one run. “T1” refers to the time cost of preparing data. “T2” refers to the time cost of performing forward and backward propagation. The results in tables are obtained under the second experimental setting in the subsection Evaluation Criteria. D Comparisons with Model Simplification Methods Model simplification methods solve the exponential increase problem of GNNs by simplifying the GNN model, which is different from sampling-based methods. Although model simplification methods are efficient in training, as stated in [Chen et al., 2020], it is still an open question whether simplified GNNs’ expressive power can match that of deep GNNs. Therefore, we only conduct comparisons with SGC [Wu et al., 2019], and we leave the extensive evaluations on model simplification methods in future works. The results are summarized in Table D.1. From Table D.1, we can see that SGC performs worse on ogbn-products, ogbnproteins, Yelp and Amazon. Methods ogbn-products ogbn-papers100M ogbn-proteins Yelp Amazon Accuracy (%) ↑ Accuracy (%) ↑ ROC-AUC (%) ↑ F1-micro (%) ↑ F1-micro (%) ↑ SGC 78.18 ± 0.31 63.29 ± 0.19 72.45 ± 0.26 40.53 ± 0.04 38.80 ± 0.05 BNS (ours) 80.14 ± 0.27 63.88 ± 0.12 79.60 ± 0.29 65.34 ± 0.04 80.91 ± 0.08 Table D.1: Comparisons with SGC. References [Chen et al., 2020] Lei Chen, Zhengdao Chen, and Joan Bruna. On graph neural networks versus graph-augmented MLPs. CoRR, abs/2010.15116, 2020. [Hamilton et al., 2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Adavances in Neural Information Processing Systems, 2017. [Wu et al., 2019] Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019. [Zeng et al., 2020] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor K. Prasanna. GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, 2020