oBscence lopsclence. Iop. org Home Search Collections Journals About Contact us My lOPscience Fast unfolding of communities in large networks This content has been downloaded from lOPscience Please scroll down to see the full text J Stat. Mech.(2008 )P10008 (http://iopscience.ioporg/1742-5468/2008/10/p10008) View the table of contents for this issue, or go to the journal homepage for more Download details P Address:202.120.22496 This content was downloaded on 31/05/2017 at 09: 13 Please note that terms and conditions apply You may also be interested in Coarse-grained diffusion distance for community structure detection in complexnetworks Jian Liu and Tingzhan Liu Analysis of the structure of complex networks at different resolution levels A Arenas, A Fernandez and s gomez Comparing community structure identification eon Danon, Albert Diaz-Guilera, Jordi duch et al Extending a configuration model to find communities in complex networks Di Jin, Dongxiao He, Qinghua Hu et al. The effect of size heterogeneity on community identification in complex networks Leon Danon, Albert Diaz-Guilera and Alex Arenas Finding network communities using modularity density Federico Botta and Charo i del genio Fuzzy overlapping communities in networks Steve Gregory A Markov rando under constraint for discovering overlapping communities incomplex networks Di Jin, Bo Yang, Baquero et al Revealing network communities through modularity maximization by a contraction-dilation method Juan Mei, Sheng He, Guiyang Shi et al
This content has been downloaded from IOPscience. Please scroll down to see the full text. Download details: IP Address: 202.120.224.96 This content was downloaded on 31/05/2017 at 09:13 Please note that terms and conditions apply. Fast unfolding of communities in large networks View the table of contents for this issue, or go to the journal homepage for more J. Stat. Mech. (2008) P10008 (http://iopscience.iop.org/1742-5468/2008/10/P10008) Home Search Collections Journals About Contact us My IOPscience You may also be interested in: Coarse-grained diffusion distance for community structure detection in complexnetworks Jian Liu and Tingzhan Liu Analysis of the structure of complex networks at different resolution levels A Arenas, A Fernández and S Gómez Comparing community structure identification Leon Danon, Albert Díaz-Guilera, Jordi Duch et al. Extending a configuration model to find communities in complex networks Di Jin, Dongxiao He, Qinghua Hu et al. The effect of size heterogeneity on community identification in complex networks Leon Danon, Albert Díaz-Guilera and Alex Arenas Finding network communities using modularity density Federico Botta and Charo I del Genio Fuzzy overlapping communities in networks Steve Gregory A Markov random walk under constraint for discovering overlapping communities incomplex networks Di Jin, Bo Yang, Carlos Baquero et al. Revealing network communities through modularity maximization by a contraction–dilation method Juan Mei, Sheng He, Guiyang Shi et al
ournal of Statistical Mechanics: Theory and Experiment An loP and sISSA joumal Fast unfolding of communities in large networks Vincent D Blondel, Jean-Loup Guillaume,2 Renaud Lambotte 3 and Etienne lefebvre I Department of Mathematical Engineering, Universite Catholique de Louvain 4 avenue Georges Lemaitre. B-1348 Louvain-la-Neuve. belgium 2 LIP6, Universite Pierre et Marie Curie, 4 place Jussieu, F-75005 Paris, France 3 Institute for Mathematical Sciences, Imperial College London 53 Prince's Gate, South Kensington Campus, London SW7 2PG E-mail: vincent. blondel@uclouvain be, jean-loup. guillaume@lip6. fr 三 r.lambiotte@imperial.ac.ukandpixetus@hotmail.com Received 18 April 2008 Accepted 3 September 2008 Published 9 October 2008 Online at stacks. iop. org/JSTAT/2008/P10008 doi:10.1088/1742-5468/2008/10/P10008 Abstract. We propose a simple method to extract the community structure of oo large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection methods in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2 million customers and by analysing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad hoc modular networks Keywords: random graphs, networks, new applications of statistical mechanics ArXiv e Print: 0803.0476 C2008 IOP Publishing Ltd and SISSA 1742-546 P10008+12530.00
J. Stat. Mech. (2008) P10008 ournal of Statistical Mechanics: JAn IOP and SISSA journal Theory and Experiment Fast unfolding of communities in large networks Vincent D Blondel1, Jean-Loup Guillaume1,2, Renaud Lambiotte1,3 and Etienne Lefebvre1 1 Department of Mathematical Engineering, Universit´e Catholique de Louvain, 4 avenue Georges Lemaitre, B-1348 Louvain-la-Neuve, Belgium 2 LIP6, Universit´e Pierre et Marie Curie, 4 place Jussieu, F-75005 Paris, France 3 Institute for Mathematical Sciences, Imperial College London, 53 Prince’s Gate, South Kensington Campus, London SW7 2PG, UK E-mail: vincent.blondel@uclouvain.be, jean-loup.guillaume@lip6.fr, r.lambiotte@imperial.ac.uk and pixetus@hotmail.com Received 18 April 2008 Accepted 3 September 2008 Published 9 October 2008 Online at stacks.iop.org/JSTAT/2008/P10008 doi:10.1088/1742-5468/2008/10/P10008 Abstract. We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection methods in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2 million customers and by analysing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad hoc modular networks. Keywords: random graphs, networks, new applications of statistical mechanics ArXiv ePrint: 0803.0476 c 2008 IOP Publishing Ltd and SISSA 1742-5468/08/P10008+12$30.00
Fast unfolding of communities in large networks Contents 1. Introduction 2. Method 236 3. Application to large networks 4. Conclusion and discussion Acknowledgments References 1. Introduction Social, technological and information systems can often be described in terms of complex networks that have a topology of interconnected nodes combining organization and 三 and these scales demand new methods to retrieve comprehensive information from their structure. a promising approach consists in decomposing the networks into sub-units or communities, which are sets of highly interconnected nodes [ 3. The identification of these communities is of crucial importance as they may help to uncover a priori unknown functional modules such as topics in information networks or cyber-communities in social networks. Moreover, the resulting meta-network, whose nodes are the communities, may then be used to visualize the original network structure The problem of community detection requires the partition of a network into communities of densely connected nodes, with the nodes belonging to different been proposed to find reasonably good partitions in a reasonably fast way. This search for fast algorithms has attracted much interest in recent years due to the increasing availability of large network datasets and the impact of networks on everyday life. One can distinguish several types of community detection algorithms: divisive algorithms O detect inter-community links and remove them from the network [4]-[61, agglomerative oo algorithms merge similar nodes/communities recursively 7 and optimization methods are based on the maximization of an objective function [8 -[10. The quality of the partitions resulting from these methods is often measured by the so-called modularity of the partition. The modularity of a partition is a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities 5, 11. In the case of weighted networks(weighted networks are networks that have weights on their links. such as the number of communications between two mobile phone users), it is defined as [12 n∑4-w0]6(a.9) doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Contents 1. Introduction 2 2. Method 3 3. Application to large networks 6 4. Conclusion and discussion 10 Acknowledgments 11 References 11 1. Introduction Social, technological and information systems can often be described in terms of complex networks that have a topology of interconnected nodes combining organization and randomness [1, 2]. The typical size of large networks such as social network services, mobile phone networks or the web is now counted in millions, if not billions, of nodes and these scales demand new methods to retrieve comprehensive information from their structure. A promising approach consists in decomposing the networks into sub-units or communities, which are sets of highly interconnected nodes [3]. The identification of these communities is of crucial importance as they may help to uncover a priori unknown functional modules such as topics in information networks or cyber-communities in social networks. Moreover, the resulting meta-network, whose nodes are the communities, may then be used to visualize the original network structure. The problem of community detection requires the partition of a network into communities of densely connected nodes, with the nodes belonging to different communities being only sparsely connected. Precise formulations of this optimization problem are known to be computationally intractable. Several algorithms have therefore been proposed to find reasonably good partitions in a reasonably fast way. This search for fast algorithms has attracted much interest in recent years due to the increasing availability of large network datasets and the impact of networks on everyday life. One can distinguish several types of community detection algorithms: divisive algorithms detect inter-community links and remove them from the network [4]–[6], agglomerative algorithms merge similar nodes/communities recursively [7] and optimization methods are based on the maximization of an objective function [8]–[10]. The quality of the partitions resulting from these methods is often measured by the so-called modularity of the partition. The modularity of a partition is a scalar value between −1 and 1 that measures the density of links inside communities as compared to links between communities [5, 11]. In the case of weighted networks (weighted networks are networks that have weights on their links, such as the number of communications between two mobile phone users), it is defined as [12] Q = 1 2m i,j Aij − kikj 2m δ(ci, cj ), (1) doi:10.1088/1742-5468/2008/10/P10008 2
Fast unfolding of communities in large networks where Aii represents the weight of the edge between i and j, ki A is the the weights of the edges attached to vertex i, ci is the community to which vertex i is assigned, the 8 function d(a, u)is 1 if u= v and 0 otherwise and m=52iiAi Modularity has been used to compare the quality of the partitions obtained by different methods, but also as an objective function to optimize [13. Unfortunately, exact modularity optimization is a problem that is computationally hard [14 and approximation algorithms are necessary when dealing with large networks. The fastest approximation algorithm for optimizing modularity on large networks was proposed by Clauset et al [8. That method consists in recurrently merging communities that optimize the production of modularity. Unfortunately, this greedy algorithm may produce values of modularity that are significantly lower than what can be found by using, for instance, simulated annealing [15]. Moreover, the method proposed in [8 has a tendency to produce super-communities that contain a large fraction of the nodes, even on synthetic networks D that have no significant community structure. This artefact also has the disadvantage of slowing down the algorithm considerably and makes it inapplicable to networks of more than a million nodes. This undesired effect has been circumvented by introducing tricks in order to balance the size of the communities being merged, thereby speeding up the running time and making it possible to deal with networks that have a few million 三 nodes 16 The largest networks that have been dealt with so far in the literature are a protein- protein interaction network of 30 739 nodes [17], a network of about 400000 items on sale on the website of a large on-line retailer 8 and a Japanese social networking system of about 5.5 million users [16. These sizes still leave considerable room for improvement [18] considering that, as of today, the social networking service Facebook has about 64 million active users, the mobile network operator Vodafone has about 200 million customers and Google indexes several billion web-pages. Let us also notice that in most large networks such as those listed above there are several natural organization levels-communities divide themselves into sub-communities-and it is thus desirable to obtain community detection methods that reveal this hierarchical structure [19 2. Method We now introduce our algorithm that finds high modularity partitions of large networks in a short time and that unfolds a complete hierarchical community structure for the O network, thereby giving access to different resolutions of community detection. Contrary o to all the other community detection algorithms, the network size limits that we are facing with our algorithm are due to limited storage capacity rather than limited computation time: identifying communities in a 118 million nodes network took only 152 mint Our algorithm is divided into two phases that are repeated iteratively. Assume that we start with a weighted network of N nodes. First, we assign a different community to ach node of the network. So, in this initial partition there are as many communities as there are nodes. Then, for each node i we consider the neighbours j of i and we evaluate the gain of modularity that would take place by removing i from its community and by All methods described here have been compiled and tested on the same machine: a bi-opteron 22k with 24 GB ofmemoryThecodeisfreelyavailablefordownloadontheweb-pagehttp://findcommunities.googlepages doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks where Aij represents the weight of the edge between i and j, ki = j Aij is the sum of the weights of the edges attached to vertex i, ci is the community to which vertex i is assigned, the δ function δ(u, v) is 1 if u = v and 0 otherwise and m = 1 2 ij Aij . Modularity has been used to compare the quality of the partitions obtained by different methods, but also as an objective function to optimize [13]. Unfortunately, exact modularity optimization is a problem that is computationally hard [14] and so approximation algorithms are necessary when dealing with large networks. The fastest approximation algorithm for optimizing modularity on large networks was proposed by Clauset et al [8]. That method consists in recurrently merging communities that optimize the production of modularity. Unfortunately, this greedy algorithm may produce values of modularity that are significantly lower than what can be found by using, for instance, simulated annealing [15]. Moreover, the method proposed in [8] has a tendency to produce super-communities that contain a large fraction of the nodes, even on synthetic networks that have no significant community structure. This artefact also has the disadvantage of slowing down the algorithm considerably and makes it inapplicable to networks of more than a million nodes. This undesired effect has been circumvented by introducing tricks in order to balance the size of the communities being merged, thereby speeding up the running time and making it possible to deal with networks that have a few million nodes [16]. The largest networks that have been dealt with so far in the literature are a protein– protein interaction network of 30 739 nodes [17], a network of about 400 000 items on sale on the website of a large on-line retailer [8] and a Japanese social networking system of about 5.5 million users [16]. These sizes still leave considerable room for improvement [18] considering that, as of today, the social networking service Facebook has about 64 million active users, the mobile network operator Vodafone has about 200 million customers and Google indexes several billion web-pages. Let us also notice that in most large networks such as those listed above there are several natural organization levels—communities divide themselves into sub-communities—and it is thus desirable to obtain community detection methods that reveal this hierarchical structure [19]. 2. Method We now introduce our algorithm that finds high modularity partitions of large networks in a short time and that unfolds a complete hierarchical community structure for the network, thereby giving access to different resolutions of community detection. Contrary to all the other community detection algorithms, the network size limits that we are facing with our algorithm are due to limited storage capacity rather than limited computation time: identifying communities in a 118 million nodes network took only 152 min4. Our algorithm is divided into two phases that are repeated iteratively. Assume that we start with a weighted network of N nodes. First, we assign a different community to each node of the network. So, in this initial partition there are as many communities as there are nodes. Then, for each node i we consider the neighbours j of i and we evaluate the gain of modularity that would take place by removing i from its community and by 4 All methods described here have been compiled and tested on the same machine: a bi-opteron 2.2k with 24 GB of memory. The code is freely available for download on the web-page http://findcommunities.googlepages.com. doi:10.1088/1742-5468/2008/10/P10008 3
Fast unfolding of communities in large networks placing it in the community of 3. The node i is then placed in the community for which this gain is maximum(in the case of a tie we use a breaking rule), but only if this gain is positive. If no positive gain is possible, i stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first phase is then complete. Let us insist on the fact that a node may be and often is, considered several times. This first phase stops when a local maxima of the modularity is attained, i. e. when no individual move can improve the modularity. One should also note that the output of the algorithm depends on the order in which the nodes are considered. Preliminary results on several test cases seem to indicate that the orderin of the nodes does not have a significant influence on the modularity that is obtained, while it may affect the computation time, but the reasons for this dependence are not clear. In particular, taking the nodes in a natural order related to the community structure itself (e.g. the order given by a previous community computation, or the postcode)does not D give clear improvement( see section 3). The problem of choosing an order is thus worth studying since it could give good heuristics to enhance the computation time Part of the algorithm's efficiency results from the fact that the gain in modularity AQ obtained by moving an isolated node i into a community C can easily be computed 三 Q 2m )-[需-()-(篇 where > in is the sum of the weights of the links inside C, tot is the sum of the weights node i, kin is the sum of the weights of the links from i to nodes in C and m is the sum N of the weights of all the links in the network. A similar expression is used in order to evaluate the change of modularity when i is removed from its community. In practice, one therefore evaluates the change of modularity by removing i from its community andO then by moving it into a neighbouring community. The second phase of the algorithm consists in building a new network whose nodes are now the communities found during the first phase. To do so, the weights of the links U between the new nodes are given by the sum of the weight of the links between nodes in the corresponding two communities 20. Links between nodes of the same community lead to self-loops for this community in the new network. Once this second phase is completed, it is then possible to reapply the first phase of the algorithm to the resulting weighted network and to iterate. Let us denote by pass'a combination of these two phases. By construction. the number of meta-communities decreases at each pass, and o as a consequence most of the computing time is used in the first pass. The passes are iterated(see figure 1) until there are no more changes and a maximum of modularity is attained. The algorithm is reminiscent of the self-similar nature of complex networks 21 and naturally incorporates a notion of hierarchy, as communities of communities are built during the process. The height of the hierarchy that is constructed is determined by the number of passes and is generally a small number, as will be shown in some examples below In order to decrease the overall running time of the method it is possible to introduce a threshold and then stop the first phase as soon as the relative gain in modularity does not exceed this threshold. The numerical results reported here have been obtained with this minor modification doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks placing it in the community of j. The node i is then placed in the community for which this gain is maximum (in the case of a tie we use a breaking rule), but only if this gain is positive. If no positive gain is possible, i stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first phase is then complete. Let us insist on the fact that a node may be, and often is, considered several times. This first phase stops when a local maxima of the modularity is attained, i.e. when no individual move can improve the modularity5. One should also note that the output of the algorithm depends on the order in which the nodes are considered. Preliminary results on several test cases seem to indicate that the ordering of the nodes does not have a significant influence on the modularity that is obtained, while it may affect the computation time, but the reasons for this dependence are not clear. In particular, taking the nodes in a natural order related to the community structure itself (e.g. the order given by a previous community computation, or the postcode) does not give clear improvement (see section 3). The problem of choosing an order is thus worth studying since it could give good heuristics to enhance the computation time. Part of the algorithm’s efficiency results from the fact that the gain in modularity ΔQ obtained by moving an isolated node i into a community C can easily be computed by ΔQ = in +2ki,in 2m − tot +ki 2m 2 − in 2m − tot 2m 2 − ki 2m 2 , (2) where in is the sum of the weights of the links inside C, tot is the sum of the weights of the links incident to nodes in C, ki is the sum of the weights of the links incident to node i, ki,in is the sum of the weights of the links from i to nodes in C and m is the sum of the weights of all the links in the network. A similar expression is used in order to evaluate the change of modularity when i is removed from its community. In practice, one therefore evaluates the change of modularity by removing i from its community and then by moving it into a neighbouring community. The second phase of the algorithm consists in building a new network whose nodes are now the communities found during the first phase. To do so, the weights of the links between the new nodes are given by the sum of the weight of the links between nodes in the corresponding two communities [20]. Links between nodes of the same community lead to self-loops for this community in the new network. Once this second phase is completed, it is then possible to reapply the first phase of the algorithm to the resulting weighted network and to iterate. Let us denote by ‘pass’ a combination of these two phases. By construction, the number of meta-communities decreases at each pass, and as a consequence most of the computing time is used in the first pass. The passes are iterated (see figure 1) until there are no more changes and a maximum of modularity is attained. The algorithm is reminiscent of the self-similar nature of complex networks [21] and naturally incorporates a notion of hierarchy, as communities of communities are built during the process. The height of the hierarchy that is constructed is determined by the number of passes and is generally a small number, as will be shown in some examples below. 5 In order to decrease the overall running time of the method it is possible to introduce a threshold and then stop the first phase as soon as the relative gain in modularity does not exceed this threshold. The numerical results reported here have been obtained with this minor modification. doi:10.1088/1742-5468/2008/10/P10008 4
Fast unfolding of communities in large networks 11 1st pass 2nd pass 26 6 3 Figure 1. Visualization of the steps of our algorithm. Each pass is made of 三 two phases: one where modularity is optimized by allowing only local changes of communities: one where the communities found are aggregated in order to O build a new network of communities. The passes are repeated iteratively until lo increase of modularity is possible This simple algorithm has several advantages. First, its steps are intuitive and easy te mplement, and the outcome is unsupervised. Moreover, the algorithm is extremely fast i.e. computer simulations on large ad hoc modular networks suggest that its complexity oo is linear on typical and sparse data. This is due to the fact that the possible gains communities decreases drastically after just a few passes so that most of the running time is concentrated on the first iterations. The so-called resolution limit problem of modularity also seems to be circumvented thanks to the intrinsic multi-level nature of our algorithm. Indeed, it is well known 22 that modularity optimization fails to identify communities smaller than a certain scale, thereby inducing a resolution limit on the O community detected by a pure modularity optimization approach. This observation only partially relevant in our case because the first phase of our algorithm involves the displacement of single nodes from one community to another. Consequently, the probability that two distinct communities can be merged by moving nodes one by one is very low. These communities may possibly be merged in the later passes, after blocks of nodes have been aggregated. However, our algorithm provides a decomposition of the network into communities for different levels of organization. In order to illustrate this feature, let us focus on the ring of 30 cliques discussed in 22, where the cliques are composed of 5 nodes and are interconnected through single links. The first pass of the algorithm finds the natural partition of the network, where each community corresponds to one clique. The second pass finds the global maximum of modularity where cliques are combined into groups of 2. Consequently, if the cliques are indeed merged in the final doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Figure 1. Visualization of the steps of our algorithm. Each pass is made of two phases: one where modularity is optimized by allowing only local changes of communities; one where the communities found are aggregated in order to build a new network of communities. The passes are repeated iteratively until no increase of modularity is possible. This simple algorithm has several advantages. First, its steps are intuitive and easy to implement, and the outcome is unsupervised. Moreover, the algorithm is extremely fast, i.e. computer simulations on large ad hoc modular networks suggest that its complexity is linear on typical and sparse data. This is due to the fact that the possible gains in modularity are easy to compute with the above formula and that the number of communities decreases drastically after just a few passes so that most of the running time is concentrated on the first iterations. The so-called resolution limit problem of modularity also seems to be circumvented thanks to the intrinsic multi-level nature of our algorithm. Indeed, it is well known [22] that modularity optimization fails to identify communities smaller than a certain scale, thereby inducing a resolution limit on the community detected by a pure modularity optimization approach. This observation is only partially relevant in our case because the first phase of our algorithm involves the displacement of single nodes from one community to another. Consequently, the probability that two distinct communities can be merged by moving nodes one by one is very low. These communities may possibly be merged in the later passes, after blocks of nodes have been aggregated. However, our algorithm provides a decomposition of the network into communities for different levels of organization. In order to illustrate this feature, let us focus on the ring of 30 cliques discussed in [22], where the cliques are composed of 5 nodes and are interconnected through single links. The first pass of the algorithm finds the natural partition of the network, where each community corresponds to one clique. The second pass finds the global maximum of modularity where cliques are combined into groups of 2. Consequently, if the cliques are indeed merged in the final doi:10.1088/1742-5468/2008/10/P10008 5
Fast unfolding of communities in large networks Table 1. Summary of numerical results. This table gives the performances of the algorithm of Clauset et al [8, of Pons and Latapy 7, of Wakita and Tsurumi [16] and of our algorithm for community detection in networks of various sizes. For each method /network, the table displays the modularity that is achieved and the computation time. Empty cells correspond to a computation time over 24 h. Our method clearly performs better in terms of computer time and modularity. It is also interesting to note the small value of Q found by WT for the mobile phone network. This bad modularity result may originate from their heuristic which creates balanced communities, while our approach gives unbalanced communities in this specific network. Karate Arxiv Internet Web nd. edu Phone Web uk-20052001 Nodes/34/779k/24k70k/351k325k/1M2.04M/5.4M39M/783M118M/1B links CNM0.38/0s0.772/3.6s0.692/799s0.927/5034s 0.42/0s0.757/3.3s0.729/57580.895/6666 WT 0.42/0s0.761/0.7s0.667/62s0.898/248s0.553/367s-/ 0.42/0s0.813/0s0.781/1s0.935/3s0.76/44s0.979/738s0.984/152mn 三 gorithm partition due to the resolution limit, they are distinct after the first pass. This resul suggests that the intermediate solutions found by our algorithm may also be meaningful and that the uncovered hierarchical structure may allow the end-user to zoom into the network and to observe its structure with the desired resolution 3. Application to large networks In order to verify the validity of our algorithm, we have applied it on a number of test- case networks that are commonly used for efficiency comparison and we have compared it with three other community detection algorithms(see table 1). The networks that we consider include a small social network [23, a network of 9000 scientific papers and their citations 24], a sub-network of the internet [25 and a web-page network of a few outperforms all the other methods to which it is compared. We also have applied or method on two web networks of unprecedented sizes: a sub-network of the.uk domain of 39 million nodes and 783 million links 27 and a network of 118 million nodes and 1 billion links obtained by the Stanford Web Base crawler 27, 28. Even for these very large networks, the computation time is small(12 min and 152 min, respectively) and makes networks of still larger size, perhaps a billion nodes, accessible to computational analysis It is also interesting to note that the number of passes is usually very small. In the case of the Karate Club [23, for instance, there are only 3 passes: during the first one, the nodes of the network are partitioned into 6 communities; after the second one, only four communities remain; during the third one, nothing happens and the algorithm therefore stops. In the above examples, the number of passes is always smaller than 5 doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Table 1. Summary of numerical results. This table gives the performances of the algorithm of Clauset et al [8], of Pons and Latapy [7], of Wakita and Tsurumi [16] and of our algorithm for community detection in networks of various sizes. For each method/network, the table displays the modularity that is achieved and the computation time. Empty cells correspond to a computation time over 24 h. Our method clearly performs better in terms of computer time and modularity. It is also interesting to note the small value of Q found by WT for the mobile phone network. This bad modularity result may originate from their heuristic which creates balanced communities, while our approach gives unbalanced communities in this specific network. Karate Arxiv Internet Web nd.edu Phone Web uk-2005 Web WebBase 2001 Nodes/ 34/77 9k/24k 70k/351k 325k/1M 2.04M/5.4M 39M/783M 118M/1B links CNM 0.38/0 s 0.772/3.6 s 0.692/799 s 0.927/5034 s —/— —/— —/— PL 0.42/0 s 0.757/3.3 s 0.729/575 s 0.895/6666 s —/— —/— —/— WT 0.42/0 s 0.761/0.7 s 0.667/62 s 0.898/248 s 0.553/367 s —/— —/— Our 0.42/0 s 0.813/0 s 0.781/1 s 0.935/3 s 0.76/44 s 0.979/738 s 0.984/152 mn algorithm partition due to the resolution limit, they are distinct after the first pass. This result suggests that the intermediate solutions found by our algorithm may also be meaningful and that the uncovered hierarchical structure may allow the end-user to zoom into the network and to observe its structure with the desired resolution. 3. Application to large networks In order to verify the validity of our algorithm, we have applied it on a number of testcase networks that are commonly used for efficiency comparison and we have compared it with three other community detection algorithms (see table 1). The networks that we consider include a small social network [23], a network of 9000 scientific papers and their citations [24], a sub-network of the internet [25] and a web-page network of a few hundred thousand web-pages (the nd.edu domain, see [26]). In all cases, one can observe both the rapidity and the large values of the modularity that are obtained. Our method outperforms all the other methods to which it is compared. We also have applied our method on two web networks of unprecedented sizes: a sub-network of the.uk domain of 39 million nodes and 783 million links [27] and a network of 118 million nodes and 1 billion links obtained by the Stanford WebBase crawler [27, 28]. Even for these very large networks, the computation time is small (12 min and 152 min, respectively) and makes networks of still larger size, perhaps a billion nodes, accessible to computational analysis. It is also interesting to note that the number of passes is usually very small. In the case of the Karate Club [23], for instance, there are only 3 passes: during the first one, the 34 nodes of the network are partitioned into 6 communities; after the second one, only four communities remain; during the third one, nothing happens and the algorithm therefore stops. In the above examples, the number of passes is always smaller than 5. doi:10.1088/1742-5468/2008/10/P10008 6
Fast unfolding of communities in large networks We have also tested the sensitivity of our algorithm by applying it on ad hoc networks that have a known community structure. To do so, we have used networks composed of 128 nodes which are split into 4 communities of 32 nodes each 29. Pairs of nodes belonging to the same community are linked with probability pin while pairs belongin to different communities are linked with probability pout. The accuracy of the method is evaluated by measuring the fraction of correctly identified nodes and the normalized mutual information. In the benchmark proposed in 29, the fraction of correctly identified nodes is 0.67 for zout=8, 0.92 for zout=7 and 0.98 for zout =6, i.e. an accuracy similar to that of the algorithm of Pons and Latapy 7 and of the algorithm of Reichardt and Bornholdt 30. To our knowledge, only two algorithms have a better accuracy than ours the algorithm of Duch and Arenas 31] and the simulated annealing method first proposed in [15, but their computational cost limits their applicability to much smaller networks than the ones considered here. Our algorithm has also been successfully tested on other CD information is nearly 1 for the macro-communities with a mixing parameter ks up to 35 It reaches 0.5 when the mixing parameter is around 55 To validate the communities obtained we have also applied our algorithm to a large network constructed from the records of a belgian mobile phone company. This network is 三 described in detail in [33] where it is shown to exhibits typical features of social networks, ( D composed of 2.6 million customers, between whom weighted links are drawn that account such as a high clustering coefficient and a fat-tailed degree distribution. The network is for their total number of phone calls during a 6 month period. In this paper, we have focused on a subset of 2.04 million customers for whom several entries are associated such as their age, sex, language and the postcode of the place where they live. This large social network is exceptional due to the particular situation of Belgium where two main linguistic communities(French and Dutch) coexist and which provides an excellent way to test the validity of our community detection method by looking at the linguistic homogeneity of communities[34]. From a more sociological point of view, the possibility to co highlight the linguistic, religious or ethnic homogeneity of communities opens perspectives for describing the social cohesion and the potential fragility of a country 35 On this particular network, our community detection algorithm has identified a hierarchy of 6 levels. At the bottom level every customer is a community of its own and at the top level there are 261 communities that have more than 100 customers. These communities account for about 75% of all customers. We have performed a language O analysis of these 261 communities(see figure 2). The homogeneity of a community is characterized by the percentage of those speaking the dominant language in that oo community; this quantity goes to l when the community tends to be monolingual. Our analysis reveals that the network is strongly segregated, with most communities almost monolingual. There are 36 communities with more than 10 000 customers and, except for one community at the interface between the two language clusters, all these communitie have more than 85% of their members speaking the same language(see figure 3 for a complete distribution). It is interesting to analyse more closely the only community that has a more equilibrate distribution of languages. Our hierarchy-revealing algorithm allows us to do this by considering the sub-communities provided by the algorithm at the lower level. As shown in figure 3, these sub-communities are closely connected to each other and are themselves composed of heterogeneous groups of people. TH doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks We have also tested the sensitivity of our algorithm by applying it on ad hoc networks that have a known community structure. To do so, we have used networks composed of 128 nodes which are split into 4 communities of 32 nodes each [29]. Pairs of nodes belonging to the same community are linked with probability pin while pairs belonging to different communities are linked with probability pout. The accuracy of the method is evaluated by measuring the fraction of correctly identified nodes and the normalized mutual information. In the benchmark proposed in [29], the fraction of correctly identified nodes is 0.67 for zout = 8, 0.92 for zout = 7 and 0.98 for zout = 6, i.e. an accuracy similar to that of the algorithm of Pons and Latapy [7] and of the algorithm of Reichardt and Bornholdt [30]. To our knowledge, only two algorithms have a better accuracy than ours, the algorithm of Duch and Arenas [31] and the simulated annealing method first proposed in [15], but their computational cost limits their applicability to much smaller networks than the ones considered here. Our algorithm has also been successfully tested on other benchmarks, such as the ones proposed in [19, 32]. In the case [32], the normalized mutual information is nearly 1 for the macro-communities with a mixing parameter k3 up to 35. It reaches 0.5 when the mixing parameter is around 55. To validate the communities obtained we have also applied our algorithm to a large network constructed from the records of a Belgian mobile phone company. This network is described in detail in [33] where it is shown to exhibits typical features of social networks, such as a high clustering coefficient and a fat-tailed degree distribution. The network is composed of 2.6 million customers, between whom weighted links are drawn that account for their total number of phone calls during a 6 month period. In this paper, we have focused on a subset of 2.04 million customers for whom several entries are associated, such as their age, sex, language and the postcode of the place where they live. This large social network is exceptional due to the particular situation of Belgium where two main linguistic communities (French and Dutch) coexist and which provides an excellent way to test the validity of our community detection method by looking at the linguistic homogeneity of communities [34]. From a more sociological point of view, the possibility to highlight the linguistic, religious or ethnic homogeneity of communities opens perspectives for describing the social cohesion and the potential fragility of a country [35]. On this particular network, our community detection algorithm has identified a hierarchy of 6 levels. At the bottom level every customer is a community of its own and at the top level there are 261 communities that have more than 100 customers. These communities account for about 75% of all customers. We have performed a language analysis of these 261 communities (see figure 2). The homogeneity of a community is characterized by the percentage of those speaking the dominant language in that community; this quantity goes to 1 when the community tends to be monolingual. Our analysis reveals that the network is strongly segregated, with most communities almost monolingual. There are 36 communities with more than 10 000 customers and, except for one community at the interface between the two language clusters, all these communities have more than 85% of their members speaking the same language (see figure 3 for a complete distribution). It is interesting to analyse more closely the only community that has a more equilibrate distribution of languages. Our hierarchy-revealing algorithm allows us to do this by considering the sub-communities provided by the algorithm at the lower level. As shown in figure 3, these sub-communities are closely connected to each other and are themselves composed of heterogeneous groups of people. These doi:10.1088/1742-5468/2008/10/P10008 7
Fast unfolding of communities in large networks 三 Figure 2. Graphical representation of the network of communities extracted from a Belgian mobile phone network. About 2 million customers are represented on this network. The size of a node is proportional to the number of individuals in the language spoken in the community(red for French and green for Dutch). Only the o communities composed of more than 100 customers have been plotted. Notice the intermediate community of mixed colours between the two main language clusters. A zoom at higher resolution reveals that it is made of several sub- communities with less apparent language separation groups of people, where language ceases to be a discriminating factor, might possibly play a crucial role for the integration of the country and for the emergence of consensus between the communities 36. One may indeed wonder what would happen if the community at the interface between the two language clusters in figure 2 was to be removed doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Figure 2. Graphical representation of the network of communities extracted from a Belgian mobile phone network. About 2 million customers are represented on this network. The size of a node is proportional to the number of individuals in the corresponding community and its colour on a red–green scale represents the main language spoken in the community (red for French and green for Dutch). Only the communities composed of more than 100 customers have been plotted. Notice the intermediate community of mixed colours between the two main language clusters. A zoom at higher resolution reveals that it is made of several subcommunities with less apparent language separation. groups of people, where language ceases to be a discriminating factor, might possibly play a crucial role for the integration of the country and for the emergence of consensus between the communities [36]. One may indeed wonder what would happen if the community at the interface between the two language clusters in figure 2 was to be removed. doi:10.1088/1742-5468/2008/10/P10008 8
Fast unfolding of communities in large networks "2“ ocotoo° m Majority French ▲ Majority Dutch 1000 10000 100000 Community size 三 Figure 3. For the largest communities in the Belgian mobile phone network D we represent the size of the community and the proportion of customers in the community that speak the dominant language of the community. For all but one community of more than 10000 members the dominant language is spoken by more than 85% of the community members Another interesting observation is related to the presence of other languages. There are actually four possible language declarations for the customers of this particular mobile oo phone operator: French, Dutch, English or German. It is interesting to note that, whereas English-speaking customers disperse themselves quite evenly in all communities, more than 60% of the German speaking customers are concentrated in just one community This is probably due to the fact that german-speaking people are mainly concentrated in a small region close to Germany, while English-speaking people are spread over the whole country. Let us finally observe that, as can be visually noticed in figure 2, French-speaking ommunities are much more densely connected than their Dutch-speaking counterparts: O on average, the strength of the links between French-speaking communities is 54% stronger than those between Dutch-speaking communities. This difference of structure o between the two sub-networks seems to indicate that the two linguistic communities are characterized by different social behaviours and therefore suggests to search other topological characteristics for the communities We have also focused on this mobile phone network in order to elucidate the role played by the ordering of the nodes on the output of the method. To do so, we have first performed 100 analyses where the ordering of the nodes is chosen randomly. From the modularity Q i and computation time Ti evaluated at each run i, we have computed the average and variance of these variables. In the case of modularity, the value is almost constant over all the runs with(Q)=0.76 and a deviation of oQ=V(Q2)-(Q)2=10 The fluctuations of the computation time are more important, as(1 44.2 s and doi:10.1088/1742-5468/2008/10/P10008
J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Figure 3. For the largest communities in the Belgian mobile phone network we represent the size of the community and the proportion of customers in the community that speak the dominant language of the community. For all but one community of more than 10 000 members the dominant language is spoken by more than 85% of the community members. Another interesting observation is related to the presence of other languages. There are actually four possible language declarations for the customers of this particular mobile phone operator: French, Dutch, English or German. It is interesting to note that, whereas English-speaking customers disperse themselves quite evenly in all communities, more than 60% of the German speaking customers are concentrated in just one community. This is probably due to the fact that German-speaking people are mainly concentrated in a small region close to Germany, while English-speaking people are spread over the whole country. Let us finally observe that, as can be visually noticed in figure 2, French-speaking communities are much more densely connected than their Dutch-speaking counterparts: on average, the strength of the links between French-speaking communities is 54% stronger than those between Dutch-speaking communities. This difference of structure between the two sub-networks seems to indicate that the two linguistic communities are characterized by different social behaviours and therefore suggests to search other topological characteristics for the communities. We have also focused on this mobile phone network in order to elucidate the role played by the ordering of the nodes on the output of the method. To do so, we have first performed 100 analyses where the ordering of the nodes is chosen randomly. From the modularity Qi and computation time Ti evaluated at each run i, we have computed the average and variance of these variables. In the case of modularity, the value is almost constant over all the runs with Q = 0.76 and a deviation of σQ ≡ Q2−Q2 = 10−2. The fluctuations of the computation time are more important, as T = 44.2 s and doi:10.1088/1742-5468/2008/10/P10008 9