Finding and evaluating community structure in networks M. E.J. Newmanl, 2 and M. Girvan2,3 Department of Physics and Center for the Study of Compler Systems University of Michigan, Ann Arbor, M 48109-1120 sAnta Fe Institute, 1399 Hyde Park Road, Santa Fe, NM87501 Department of Physics, Cornell University, Ithaca, NY 14853-2501 We propose and study a set of algorithms for discovering community str natural divisions of network nodes into densely connected subgroups. Our algorithms all share two definitive features: first, they involve iterative removal of edges from the network to split it into communities, the edges removed being identified using one of a number of possible"betweenness measures, and second, these measures are, crucially, recalculated after each removal. We also propose a measure for the strength of the community structure found by our algorithms, which gives us objective metric for choosing the number of communities into which a network should be divided We demonstrate that our algorithms are highly effective at discovering community structure in both computer-generated and real-world network data, and show how they can be used to shed light on the sometimes dauntingly complex structure of networked systems. 日苏 I. INTRODUCTION hierarchical clustering in sociology [18, 19. Before pre- senting our own findings, it is worth reviewing some of Empirical studies and theoretical modeling of networks this preceding work, to understand its achievements and have been the subject of a large body of recent research in where it falls short c statistical physics and applied mathematics [1, 2, 3, 4 Graph partitioning is a problem that arises in, for ex Network ideas have been applied with great success to ample, parallel computing. Suppose we have a num- opics as diverse as the Internet and the world wide ber n of intercommunicating computer processes, which web [5, 6, 7, epidemiology [8, 9, 10, 11], scientific ci- we wish to distribute over a number g of computer proces- tation and collaboration [12, 13), metabolism [14, 15], sors. Processes do not necessarily need to communicate and ecosystems [16, 17 to name but a few. a property with all others, and the pattern of required communica L that seems to be common to many networks is commu- tions can be represented by a graph or network in which nity structure, the division of network nodes into groups the vertices represent processes and edges join process within which the network connections are dense. but be- pairs that need to communicate. The problem is to allo- tween which they are sparser--see Fig. 1. The ability to cate the processes to processors in such a way as roughly C find and analyze such groups can provide invaluable help to balance the load on each processor, while at the same Q in understanding and visualizing the structure of net- time minimizing the number of edges that run between oo works. In this paper we show how this can be achieved. processors, so that the amount of interprocessor commu- The study of community structure in networks has a nication(which is normally slow) is minimized. In gen- long history. It is closely related to the ideas of graph eral, finding an exact solution to a partitioning task of artitioning in graph theory and computer science, and this kind is believed to be an NP- ng it prohibitively difficult to solve for large graphs, but a wide variety of heuristic algorithms have been devel- oped that give acceptably good solutions in many cases the best known being perhaps the Kernighan-Lin alg rithm [201, which runs in time O(n )on sparse graphs A solution to the graph partitioning problem is how- ever not particularly helpful for analyzing and under standing networks in general. If we merely want to find if and how a given network breaks down into commu nities, we probably dont know how many such com munities there are going to be, and there is no reason why they should be roughly the same size. Furthermore, he number of inter-community edges needn't be strictly minimized either, since more such edges are admissible between large communities than between small ones As far as our goals in this paper are concerned, a mor type considered in this paper. In this case there are three useful approach is that taken by social network analysis immunities, denoted by the dashed circles, which have dense with the set of techniques known as hierarchical cluster internal links but between which there are only a lower density ing. These techniques are aimed at discovering natural of external links divisions of (social)networks into groups, based on va
arXiv:cond-mat/0308217v1 [cond-mat.stat-mech] 11 Aug 2003 Finding and evaluating community structure in networks M. E. J. Newman1, 2 and M. Girvan2, 3 1Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109–1120 2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501 3Department of Physics, Cornell University, Ithaca, NY 14853–2501 We propose and study a set of algorithms for discovering community structure in networks— natural divisions of network nodes into densely connected subgroups. Our algorithms all share two definitive features: first, they involve iterative removal of edges from the network to split it into communities, the edges removed being identified using one of a number of possible “betweenness” measures, and second, these measures are, crucially, recalculated after each removal. We also propose a measure for the strength of the community structure found by our algorithms, which gives us an objective metric for choosing the number of communities into which a network should be divided. We demonstrate that our algorithms are highly effective at discovering community structure in both computer-generated and real-world network data, and show how they can be used to shed light on the sometimes dauntingly complex structure of networked systems. I. INTRODUCTION Empirical studies and theoretical modeling of networks have been the subject of a large body of recent research in statistical physics and applied mathematics [1, 2, 3, 4]. Network ideas have been applied with great success to topics as diverse as the Internet and the world wide web [5, 6, 7], epidemiology [8, 9, 10, 11], scientific citation and collaboration [12, 13], metabolism [14, 15], and ecosystems [16, 17], to name but a few. A property that seems to be common to many networks is community structure, the division of network nodes into groups within which the network connections are dense, but between which they are sparser—see Fig. 1. The ability to find and analyze such groups can provide invaluable help in understanding and visualizing the structure of networks. In this paper we show how this can be achieved. The study of community structure in networks has a long history. It is closely related to the ideas of graph partitioning in graph theory and computer science, and FIG. 1: A small network with community structure of the type considered in this paper. In this case there are three communities, denoted by the dashed circles, which have dense internal links but between which there are only a lower density of external links. hierarchical clustering in sociology [18, 19]. Before presenting our own findings, it is worth reviewing some of this preceding work, to understand its achievements and where it falls short. Graph partitioning is a problem that arises in, for example, parallel computing. Suppose we have a number n of intercommunicating computer processes, which we wish to distribute over a number g of computer processors. Processes do not necessarily need to communicate with all others, and the pattern of required communications can be represented by a graph or network in which the vertices represent processes and edges join process pairs that need to communicate. The problem is to allocate the processes to processors in such a way as roughly to balance the load on each processor, while at the same time minimizing the number of edges that run between processors, so that the amount of interprocessor communication (which is normally slow) is minimized. In general, finding an exact solution to a partitioning task of this kind is believed to be an NP-complete problem, making it prohibitively difficult to solve for large graphs, but a wide variety of heuristic algorithms have been developed that give acceptably good solutions in many cases, the best known being perhaps the Kernighan–Lin algorithm [20], which runs in time O(n 3 ) on sparse graphs. A solution to the graph partitioning problem is however not particularly helpful for analyzing and understanding networks in general. If we merely want to find if and how a given network breaks down into communities, we probably don’t know how many such communities there are going to be, and there is no reason why they should be roughly the same size. Furthermore, the number of inter-community edges needn’t be strictly minimized either, since more such edges are admissible between large communities than between small ones. As far as our goals in this paper are concerned, a more useful approach is that taken by social network analysis with the set of techniques known as hierarchical clustering. These techniques are aimed at discovering natural divisions of (social) networks into groups, based on var-
FIG. 2:a hierarchical tree or dendrogram illustrating the ype of output generated by the algorithms described here The circles at the bottom of the figure represent the ind FIG. 3: Agglomerative clustering methods are typically good vidual vertices of the network. As we move up the tree the at discovering the strongly linked cores of communities(bold vertices join together to form larger and larger communities, vertices and edges)but tend to leave out peripheral vertices, as indicated by the lines, until we reach the top where all are even when, as here, most of them clearly belong to one com- joined together in a single community. Alternatively, we the munity or another dendrogram depicts an initially connected network splitting into smaller and smaller communities as we go from top to bottom.A cross-section of the tree at any level, as indicated dency to find only the cores of communities and leave the dotted line, will give the communities at that level The vertical height of the split-points in the tree are indica- out the periphery. The core nodes in a community of- tive only of the order in which the splits (or joins)took place, ten have strong similarity, and hence are connected early although it is possible to construct more elaborate dendro- in the agglomerative proe but peripheral nodes that grams in which these heights contain other informatio have no strong similarity to others tend to get neglected leading to structures like that shown in Fig 3. In this figure, there are a number of peripheral nodes whose com munity membership is obvious to the eyein most cases ous metrics of similarity or strength of connection be- they have only a single link to a specific community- tween vertices. They fall into two broad classes, agglom- but agglomerative methods often fail to place such nodes erative and divisive [19, depending on whether they correctl cus on the addition or removal of edges to or from the net In this paper, therefore, we focus on divisive meth- work. In an agglomerative method, similarities are cal- ods. These methods have been relatively little studied culated by one method or another between vertex pairs, in the previous literature, either in social network the- and edges are then added to an initially empty network ory or elsewhere but, as we will see, seem to offer a (n vertices with no edges) starting with the vertex pairs lot of promise. In a divisive method, we start with the with highest similarity. The procedure can be halted at network of interest and attempt to find the least similar any point, and the resulting components in the network connected pairs of vertices and then remove the edges are taken to be the communities. Alternatively, the en- between them. By doing this repeatedly, we divide the tire progression of the algorithm from empty graph to network into smaller and smaller components, and again complete graph can be represented in the form of a tree we can stop the process at any stage and take the com- or dendrogram such as that shown in Fig. 2. Horizontal ponents at that stage to be the network communities cuts through the tree represent the communities appro- Again, the process can be represented as a dendrogram te to different halting points depicting the successive splits of the network into smaller agglomerative methods based on a wide variety of sim- and smaller groups ilarity measures have been applied to different networks. The approach we take follows roughly these lines Some networks have natural similarity metrics built in. but adopts a somewhat different philosophical viewpoint For example, in the widely studied network of collabo- Rather than looking for the most weakly connected ver rations between film actors [ 21, 22, in which two actors tex pairs, our approach will be to look for the edges in the e connected if they have appeared in the same film, one network that are most "between"other vertices, meaning could quantify similarity by how many films actors have that the edge is, in some sense, responsible for connect- appeared in together 23. Other networks have no natu- ing many pairs of others. Such edges need not be weak al metric, but suitable ones can be devised using correla- at all in the similarity sense. How this idea works out tion coefficients, path lengths, or matrix methods. a well practice will become clear in the course of the presenta known example of an agglomerative clustering method is tion. the Concor algorithm of Breiger et al. 24 Briefly then, the outline of this paper is as follow Agglomerative methods have their problems however. In Sec. II we describe the crucial concepts behind our One concern is that they fail with some frequency to find methods for finding community structure in networks and the correct communities in networks were the commu- show how these concepts can be turned into a concrete nity structure is known, which makes it difficult to place prescription for performing calculations. In Sec. Ill we much trust in them in other cases. Another is their ten- describe in detail the implementation of our methods. In
2 FIG. 2: A hierarchical tree or dendrogram illustrating the type of output generated by the algorithms described here. The circles at the bottom of the figure represent the individual vertices of the network. As we move up the tree the vertices join together to form larger and larger communities, as indicated by the lines, until we reach the top, where all are joined together in a single community. Alternatively, we the dendrogram depicts an initially connected network splitting into smaller and smaller communities as we go from top to bottom. A cross-section of the tree at any level, as indicated by the dotted line, will give the communities at that level. The vertical height of the split-points in the tree are indicative only of the order in which the splits (or joins) took place, although it is possible to construct more elaborate dendrograms in which these heights contain other information. ious metrics of similarity or strength of connection between vertices. They fall into two broad classes, agglomerative and divisive [19], depending on whether they focus on the addition or removal of edges to or from the network. In an agglomerative method, similarities are calculated by one method or another between vertex pairs, and edges are then added to an initially empty network (n vertices with no edges) starting with the vertex pairs with highest similarity. The procedure can be halted at any point, and the resulting components in the network are taken to be the communities. Alternatively, the entire progression of the algorithm from empty graph to complete graph can be represented in the form of a tree or dendrogram such as that shown in Fig. 2. Horizontal cuts through the tree represent the communities appropriate to different halting points. Agglomerative methods based on a wide variety of similarity measures have been applied to different networks. Some networks have natural similarity metrics built in. For example, in the widely studied network of collaborations between film actors [21, 22], in which two actors are connected if they have appeared in the same film, one could quantify similarity by how many films actors have appeared in together [23]. Other networks have no natural metric, but suitable ones can be devised using correlation coefficients, path lengths, or matrix methods. A well known example of an agglomerative clustering method is the Concor algorithm of Breiger et al. [24]. Agglomerative methods have their problems however. One concern is that they fail with some frequency to find the correct communities in networks were the community structure is known, which makes it difficult to place much trust in them in other cases. Another is their tenFIG. 3: Agglomerative clustering methods are typically good at discovering the strongly linked cores of communities (bold vertices and edges) but tend to leave out peripheral vertices, even when, as here, most of them clearly belong to one community or another. dency to find only the cores of communities and leave out the periphery. The core nodes in a community often have strong similarity, and hence are connected early in the agglomerative process, but peripheral nodes that have no strong similarity to others tend to get neglected, leading to structures like that shown in Fig. 3. In this figure, there are a number of peripheral nodes whose community membership is obvious to the eye—in most cases they have only a single link to a specific community— but agglomerative methods often fail to place such nodes correctly. In this paper, therefore, we focus on divisive methods. These methods have been relatively little studied in the previous literature, either in social network theory or elsewhere, but, as we will see, seem to offer a lot of promise. In a divisive method, we start with the network of interest and attempt to find the least similar connected pairs of vertices and then remove the edges between them. By doing this repeatedly, we divide the network into smaller and smaller components, and again we can stop the process at any stage and take the components at that stage to be the network communities. Again, the process can be represented as a dendrogram depicting the successive splits of the network into smaller and smaller groups. The approach we take follows roughly these lines, but adopts a somewhat different philosophical viewpoint. Rather than looking for the most weakly connected vertex pairs, our approach will be to look for the edges in the network that are most “between” other vertices, meaning that the edge is, in some sense, responsible for connecting many pairs of others. Such edges need not be weak at all in the similarity sense. How this idea works out in practice will become clear in the course of the presentation. Briefly then, the outline of this paper is as follows. In Sec. II we describe the crucial concepts behind our methods for finding community structure in networks and show how these concepts can be turned into a concrete prescription for performing calculations. In Sec. III we describe in detail the implementation of our methods. In
Sec IV we consider ways of determining when a particu- 2. The shortest-path betweenness can be thought of lar division of a network into communities is a good one in terms of signals traveling through a network. allowing us to quantify the success of our community If signals travel from source to destination finding algorithms. And in Sec. v we give a number geodesic network paths, and all vertices send sig- of applications of our algorithms to particular networks nals at the same constant rate to all others. then both real and artificial. In Sec. Vi we give our conclu- the betweenness is a measure of the rate at which sions. a brief report of some of the work contained in signals pass along each edge. Suppose however that this paper has appeared previously as Ref. 25 signals do not travel along geodesic paths, but in- stead just perform a random walk about the net- work until they reach their destination. This gives IL. FINDING COMMUNITIES IN A NETWORK us another measure on edges, the random-walk be- tweenness: we calculate the expected net number of times that a random walk between a particular In this paper we present a class of new algorithms for network clustering, i. e, the discovery of community pair of vertices will pass down a particular edge structure in networks. Our discussion focuses primarily and sum over all vertex pairs. The random-walk betweenness can be calculated using matrix meth- on networks with only a single type of vertex and a single ods. as described in Sec. III C type of undirected, unweighted edge, although general izations to more com licated network types are certainly 3. Another betweenness measure is motivated by ideas from elementary circuit theory. We consider the There are two central features that distinguish our al- circuit created by placing a unit resistance on each gorithms from those that have preceded them. First, our edge of the network and unit current source and algorithms are divisive rather than agglomerative. Di- sink at a particular pair of vertices. The resulting visive algorithms have occasionally been studied in the current How in the network will travel from source past, but, as discussed in the introduction, ours differ to sink along a multitude of paths, those with least in focusing not on removing the edges between vertex resistance carrying the greatest fraction of the cur pairs with lowest similarity, but on finding edges with rent. The current-flow betweenness for an edge the highest "betweenness, where betweenness is some we define to be the absolute value of the current measure that favors edges that lie between communities long the edge summed over all source/ sink pairs and disfavors those that lie inside communities It can be calculated using Kirchhoffs laws, as de- To make things more concrete, we give some examples scribed in Sec. III B. In fact, as we will of the types of betweenness measures we will be looking current-flow betweenness turns out to be exactly at, all of them are based on the same idea. If two com- the same as the random walk betweenness of the munities are joined by only a few inter-community edges, previous paragraph, but we nonetheless consider it then all paths through the network from vertices in one parately since it leads to a simpler derivation of community to vertices in the other must pass along one of those few edges. Given a suitable set of paths, one can count how many go along each edge in the graph, and These measures are only suggestions; many others are his number we then expect to be largest for the inter- possible and may well be appropriate for specific applica- ommunity edges, thus providing a method for identify- tions. Measures(1)and (2)are in some sense extremes in ing them. Our different measures correspond to various the spectrum of possibilities, one corresponding to signals mplementations of this idea. that know exactly where they are going, and the other to signals that have no idea where they are going. As 1. The simplest example of such a betweenness mea- we will see, however, these two measures actually give find the shortest paths between all pairs of vertices of betweenness measure may not, at least for the types nd count how many run along each edge. To the of applications considered here, be that important est of our know ledge this measure was first intro- The second way in which our methods differ from pre- duced by Anthonisse in a never-published technical vious ones is in the inclusion of a"recalculation step"in report in 1971 26. Anthonisse called it "rush, the algorithm. If we were to perform a standard divisive but we prefer the term edge betweenness, since the clustering based on edge betweenness we would calculate quantity is a natural generalization to edges of the the edge betweenness for all edges in the network and well-known(vertex) betweenness measure of Free- then remove edges in decreasing order of betweenness to 27, which was the inspiration for our ap- produce a dendrogram like that of Fig. 2, showing the proach. When we need to distinguish it from the order in which the network split up other betweenness measures considered in this pa- However, once the first edge in the network is removed per, we will refer to it as shortest-path betweenness. in such an algorithm, the betweenness values for the re- A fast algorithm for calculating the shortest-path maining edges will no longer reflect the network as it now betweenness is given in Sec. III A is. This can give rise to unwanted behaviors. For exam-
3 Sec. IV we consider ways of determining when a particular division of a network into communities is a good one, allowing us to quantify the success of our community- finding algorithms. And in Sec. V we give a number of applications of our algorithms to particular networks, both real and artificial. In Sec. VI we give our conclusions. A brief report of some of the work contained in this paper has appeared previously as Ref. 25. II. FINDING COMMUNITIES IN A NETWORK In this paper we present a class of new algorithms for network clustering, i.e., the discovery of community structure in networks. Our discussion focuses primarily on networks with only a single type of vertex and a single type of undirected, unweighted edge, although generalizations to more complicated network types are certainly possible. There are two central features that distinguish our algorithms from those that have preceded them. First, our algorithms are divisive rather than agglomerative. Divisive algorithms have occasionally been studied in the past, but, as discussed in the introduction, ours differ in focusing not on removing the edges between vertex pairs with lowest similarity, but on finding edges with the highest “betweenness,” where betweenness is some measure that favors edges that lie between communities and disfavors those that lie inside communities. To make things more concrete, we give some examples of the types of betweenness measures we will be looking at. All of them are based on the same idea. If two communities are joined by only a few inter-community edges, then all paths through the network from vertices in one community to vertices in the other must pass along one of those few edges. Given a suitable set of paths, one can count how many go along each edge in the graph, and this number we then expect to be largest for the intercommunity edges, thus providing a method for identifying them. Our different measures correspond to various implementations of this idea. 1. The simplest example of such a betweenness measure is that based on shortest (geodesic) paths: we find the shortest paths between all pairs of vertices and count how many run along each edge. To the best of our knowledge this measure was first introduced by Anthonisse in a never-published technical report in 1971 [26]. Anthonisse called it “rush,” but we prefer the term edge betweenness, since the quantity is a natural generalization to edges of the well-known (vertex) betweenness measure of Freeman [27], which was the inspiration for our approach. When we need to distinguish it from the other betweenness measures considered in this paper, we will refer to it as shortest-path betweenness. A fast algorithm for calculating the shortest-path betweenness is given in Sec. III A. 2. The shortest-path betweenness can be thought of in terms of signals traveling through a network. If signals travel from source to destination along geodesic network paths, and all vertices send signals at the same constant rate to all others, then the betweenness is a measure of the rate at which signals pass along each edge. Suppose however that signals do not travel along geodesic paths, but instead just perform a random walk about the network until they reach their destination. This gives us another measure on edges, the random-walk betweenness: we calculate the expected net number of times that a random walk between a particular pair of vertices will pass down a particular edge and sum over all vertex pairs. The random-walk betweenness can be calculated using matrix methods, as described in Sec. III C. 3. Another betweenness measure is motivated by ideas from elementary circuit theory. We consider the circuit created by placing a unit resistance on each edge of the network and unit current source and sink at a particular pair of vertices. The resulting current flow in the network will travel from source to sink along a multitude of paths, those with least resistance carrying the greatest fraction of the current. The current-flow betweenness for an edge we define to be the absolute value of the current along the edge summed over all source/sink pairs. It can be calculated using Kirchhoff’s laws, as described in Sec. III B. In fact, as we will show, the current-flow betweenness turns out to be exactly the same as the random walk betweenness of the previous paragraph, but we nonetheless consider it separately since it leads to a simpler derivation of the measure. These measures are only suggestions; many others are possible and may well be appropriate for specific applications. Measures (1) and (2) are in some sense extremes in the spectrum of possibilities, one corresponding to signals that know exactly where they are going, and the other to signals that have no idea where they are going. As we will see, however, these two measures actually give rather similar results, indicating that the precise choice of betweenness measure may not, at least for the types of applications considered here, be that important. The second way in which our methods differ from previous ones is in the inclusion of a “recalculation step” in the algorithm. If we were to perform a standard divisive clustering based on edge betweenness we would calculate the edge betweenness for all edges in the network and then remove edges in decreasing order of betweenness to produce a dendrogram like that of Fig. 2, showing the order in which the network split up. However, once the first edge in the network is removed in such an algorithm, the betweenness values for the remaining edges will no longer reflect the network as it now is. This can give rise to unwanted behaviors. For exam-
ple, if two communities are joined by two edges, but, for that we discussed in Ref. 25 47. The other versions we one reason or another, most paths between the two flow discuss, while being of some pedagogical interest, make along just one of those edges, then that edge will have a greater computational demands, and in practice seem high betweenness score and the other will not. An algo- give results no better than the shortest-path method rithm that calculated betweennesses only once and then removed edges in betweenness order would remove the first edge early in the course of its operation, but the IIL. IMPLEMENTATION second might not get removed until much later. Thus he obvious division of the network into two parts might In theory, the descriptions of the last section com- not be discovered by the algorithm. In the worst case the pletely define the methods we consider in this paper, but two parts themselves might be individually broken up be- in practice there are a number of tricks to their imple- fore the division between the two is made. In practice, mentation that are important for turning the description problems like this crop up in real networks with some into a workable computer algorithm regularity and render algorithms of this type ineffective Essentially all of the work in the algorithm is in the for the discovery of community structure The solution, luckily, is obvious. We simply recalc alculation of the betweenness scores for the edges; the job of finding and removing the highest-scoring edge is late our betweenness measure after the removal of each trivial and not computationally demanding. Let us tackle dge. This certainly adds to the computational effort of our three suggested betweenness measures in turn performing the calculation, but its effect on the results so desirable that we consider the price worth paying Thus the general form of our community structure find- A. Shortest-path betweenness ing algorithm is as follows 1. Calculate betweenness scores for all edges in the At first sight, it appears that calculating the edge be tweenness measure based on geodesic paths for all edges will take O(mn2)operations on a graph with m edges 2. Find the edge with the highest score and remove it and n vertices: calculating the shortest path between a from the network particular pair of vertices can be done using breadth-first search in time O(m)[28, 29, and there are O(n)ver- 3. Recalculate betweenness for all remaining edges. tex pairs. Recently however new algorithms have been 4. Repeat from step 2 proposed by Newman 30 and independently by Bran- des 31 that can perform the calculation faster than this fact, it appears that the recalculation step is the finding all betweennesses in O(mn)time. Both Newman most important feature of the algorithm, as far as getting and Brandes gave algorithms for the standard Freeman satisfactory results is concerned. As mentioned above vertex betweenness, but it is trivial to adapt their algo- our studies indicate that, once one hits on the idea of us- rithms for edge betweenness. We describe the resulting ing betweenness measures to weight edges, the exact mea- method here for the algorithm of Newman. sure one uses appears not to influence the results highly Breadth-first search can find shortest paths from a sin- The recalculation step, on the other hand, is absolutely gle vertex s to all others in time o(m). In the simplest crucial to the operation of our methods. This step was case, when there is only a single shortest path from the missing from previous attempts at solving the cluster- source vertex to any other(we will consider other cases oblem divisive algorithms, and yet without in a moment) the resulting set of paths forms a shortest- it the results are very poor indeed, failing to find known path tree--see Fig 4a. We can now use this tree to calcu community structure even in the simplest of cases. In late the contribution to betweenness for each edge from Sec. VB we give an example comparing the performance this set of paths as follows. We find first the leaves"of of the algorithm on a particular network with and with- the tree, i.e., those nodes such that no shortest paths to ut the recalculation step other nodes pass through them, and we assign a score of 1 In the following sections we discuss implementation to the single edge that connects each to the rest of the and give examples of our algorithms for finding commu- tree, as shown in the figure. Then, starting with those nity structure. For the reader who merely wants to know edges that are farthest from the source vertex on the tree, what algorithm they should use for their own problem, i.e., lowest in Fig 4a, we work upwards, assigning a score t us give an immediate answer: for most problems, we to each edge that is 1 plus the sum of the scores on the recommend the algorithm with betweenness scores cal- neighboring edges immediately below it. When we have culated using the shortest-path betweenness measure(1) gone though all edges in the tree, the resulting scores above. This to work well and is the are the betweenness counts for the paths from vertex s quickest to calculate--as described in Sec. III A, it can Repeating the process for all possible vertices s and sum- e calculated for all edges in time O(mn), where m is ming the scores, we arrive at the full betweenness scores the number of edges in the graph and n is the number for shortest paths between all pairs. The breadth-first of vertices. This is the only version of the algorithm search and the process of working up through the tree
4 ple, if two communities are joined by two edges, but, for one reason or another, most paths between the two flow along just one of those edges, then that edge will have a high betweenness score and the other will not. An algorithm that calculated betweennesses only once and then removed edges in betweenness order would remove the first edge early in the course of its operation, but the second might not get removed until much later. Thus the obvious division of the network into two parts might not be discovered by the algorithm. In the worst case the two parts themselves might be individually broken up before the division between the two is made. In practice, problems like this crop up in real networks with some regularity and render algorithms of this type ineffective for the discovery of community structure. The solution, luckily, is obvious. We simply recalculate our betweenness measure after the removal of each edge. This certainly adds to the computational effort of performing the calculation, but its effect on the results is so desirable that we consider the price worth paying. Thus the general form of our community structure finding algorithm is as follows: 1. Calculate betweenness scores for all edges in the network. 2. Find the edge with the highest score and remove it from the network. 3. Recalculate betweenness for all remaining edges. 4. Repeat from step 2. In fact, it appears that the recalculation step is the most important feature of the algorithm, as far as getting satisfactory results is concerned. As mentioned above, our studies indicate that, once one hits on the idea of using betweenness measures to weight edges, the exact measure one uses appears not to influence the results highly. The recalculation step, on the other hand, is absolutely crucial to the operation of our methods. This step was missing from previous attempts at solving the clustering problem using divisive algorithms, and yet without it the results are very poor indeed, failing to find known community structure even in the simplest of cases. In Sec. V B we give an example comparing the performance of the algorithm on a particular network with and without the recalculation step. In the following sections we discuss implementation and give examples of our algorithms for finding community structure. For the reader who merely wants to know what algorithm they should use for their own problem, let us give an immediate answer: for most problems, we recommend the algorithm with betweenness scores calculated using the shortest-path betweenness measure (1) above. This measure appears to work well and is the quickest to calculate—as described in Sec. III A, it can be calculated for all edges in time O(mn), where m is the number of edges in the graph and n is the number of vertices. This is the only version of the algorithm that we discussed in Ref. 25 [47]. The other versions we discuss, while being of some pedagogical interest, make greater computational demands, and in practice seem to give results no better than the shortest-path method. III. IMPLEMENTATION In theory, the descriptions of the last section completely define the methods we consider in this paper, but in practice there are a number of tricks to their implementation that are important for turning the description into a workable computer algorithm. Essentially all of the work in the algorithm is in the calculation of the betweenness scores for the edges; the job of finding and removing the highest-scoring edge is trivial and not computationally demanding. Let us tackle our three suggested betweenness measures in turn. A. Shortest-path betweenness At first sight, it appears that calculating the edge betweenness measure based on geodesic paths for all edges will take O(mn2 ) operations on a graph with m edges and n vertices: calculating the shortest path between a particular pair of vertices can be done using breadth-first search in time O(m) [28, 29], and there are O(n 2 ) vertex pairs. Recently however new algorithms have been proposed by Newman [30] and independently by Brandes [31] that can perform the calculation faster than this, finding all betweennesses in O(mn) time. Both Newman and Brandes gave algorithms for the standard Freeman vertex betweenness, but it is trivial to adapt their algorithms for edge betweenness. We describe the resulting method here for the algorithm of Newman. Breadth-first search can find shortest paths from a single vertex s to all others in time O(m). In the simplest case, when there is only a single shortest path from the source vertex to any other (we will consider other cases in a moment) the resulting set of paths forms a shortestpath tree—see Fig. 4a. We can now use this tree to calculate the contribution to betweenness for each edge from this set of paths as follows. We find first the “leaves” of the tree, i.e., those nodes such that no shortest paths to other nodes pass through them, and we assign a score of 1 to the single edge that connects each to the rest of the tree, as shown in the figure. Then, starting with those edges that are farthest from the source vertex on the tree, i.e., lowest in Fig. 4a, we work upwards, assigning a score to each edge that is 1 plus the sum of the scores on the neighboring edges immediately below it. When we have gone though all edges in the tree, the resulting scores are the betweenness counts for the paths from vertex s. Repeating the process for all possible vertices s and summing the scores, we arrive at the full betweenness scores for shortest paths between all pairs. The breadth-first search and the process of working up through the tree
2. Every vertex i adjacent to s is given distance di 3. For each vertex j adjacent to one of those vertices i of three (a)If j has not yet been assigned a distance, it is assigned distance d=di+ l and weight le (b)If j has already been assigned a distance and di di +1, then the vertexs weight FIG. 4: Calculation of shortest-path betweenness:(a) Wher (c) If j has already been assigned a distance and there is only a single shortest path from a source vertex s d< di+ l, we do nothing (top) to all other reachable vertices, those paths necessarily form a tree. which makes the calculation of the contribution 4. Repeat from step 3 until no vertices remain that o betweenness from this set of paths particularly simple, as have assigned distances but whose neighbors do not describe in the text. (b) For cases in which there is more than have assigned distances one shortest path to some vertices, the calculation is more complex. First we must calculate the number of paths from In practice, this algorithm can be implemented most ef- the source to each other vertex (numbers on vertices), and ficiently using a queue or first-in/first-out buffer to store then these are used to weight the path counts appropriately. the vertices that have been assigned a distance, just as In either case, we can check the results by confirming that the in the standard breadth-first search sum of the betweennesses of the edges connected to the source Physically, the weight on a vertex i represents the num- vertex is equal to the total number of reachable vertices-six ber of distinct paths from the source vertex to i.These in each of the cases illustrated here e,ghts are precisely what we need to calculate our edge tweennesses, because if two vertices i and j are con- both take worst-case time o(m) and there are n ver nected, with j farther tha fraction of a geodesic path from j through i to s is given tices total, so the entire calculation takes time o(mn)as by w/w;. Thus, to calculate the contribution to edge be- claimed behind the algorithm. In general, however. it is not the only carry out the following steps Earting at s, we need This simple case serves to illustrate the basic principle tweenness from all shortest paths case that there is only a single shortest path between any 1. Find every " leaf" vertex t, i.e., a vertex such that pair of vertices. Most networks have at least some vertex no paths from s to other vertices go though t. pairs between which there are several geodesic paths of equal length. Figure 4b shows a simple example of a 2. For each vertex i neighboring t assign a score to the shortest path tree"for a network with this property edge from t to i of wi /wt The resulting structure is in fact no longer a tree. and in such cases an extra step is required in the algorithm to 3. Now, starting with the edges that are farthest from calculate the betweenness correctly the source vertex s-lower down in a diagram such In the traditional definition of vertex betweenness 27 as Fig. 4b-work up towards s. To the edge from multiple shortest paths between a pair of vertices are vertex i to vertex j, with j being farther from s given equal weights summing to 1. For example, if there than i, assign a score that is 1 plus the sum of are three shortest paths, each will be given weight 3 he scores on the neighboring edges immediately We adopt the same definition for our edge betweenness below it(i.e, those with which it shares a common (as did Anthonisse in his original work [26, although vertex), all multiplied by wi/wj other definitions are possible Note that the paths 4. Repeat from step 3 until vertex s is reached. may run along the same edge or edges for some part of their length, resulting in edges with greater weight. To Now repeating this process for all n source vertices s and culate correctly what fraction of the paths flow along summing the resulting scores on the edges gives us the each edge in the network, we generalize the breadth-first total betweenness for all edges in time o(mn search part of the calculation, as follows We now have to repeat this calculation for each edge Consider Fig. 4b and suppose we are performing a removed from the network, of which there are m, and breadth-first search starting at vertex s. We carry out hence the complete community structure algorithm based the following steps: on shortest-path betweenness operates in worst-case time O(mn), or O(n )time on a sparse graph. In our experi 1. The initial vertex s is given distance d,=0 and a ence, this typically makes it tractable for networks of up
5 2 3 11 6 25 6 (a) 1 1 2 1 3 1 1 3 1 5 5 6 7 6 3 (b) 1 s s 1 4 2 1 leaves 1 2 FIG. 4: Calculation of shortest-path betweenness: (a) When there is only a single shortest path from a source vertex s (top) to all other reachable vertices, those paths necessarily form a tree, which makes the calculation of the contribution to betweenness from this set of paths particularly simple, as describe in the text. (b) For cases in which there is more than one shortest path to some vertices, the calculation is more complex. First we must calculate the number of paths from the source to each other vertex (numbers on vertices), and then these are used to weight the path counts appropriately. In either case, we can check the results by confirming that the sum of the betweennesses of the edges connected to the source vertex is equal to the total number of reachable vertices—six in each of the cases illustrated here. both take worst-case time O(m) and there are n vertices total, so the entire calculation takes time O(mn) as claimed. This simple case serves to illustrate the basic principle behind the algorithm. In general, however, it is not the case that there is only a single shortest path between any pair of vertices. Most networks have at least some vertex pairs between which there are several geodesic paths of equal length. Figure 4b shows a simple example of a shortest path “tree” for a network with this property. The resulting structure is in fact no longer a tree, and in such cases an extra step is required in the algorithm to calculate the betweenness correctly. In the traditional definition of vertex betweenness [27] multiple shortest paths between a pair of vertices are given equal weights summing to 1. For example, if there are three shortest paths, each will be given weight 1 3 . We adopt the same definition for our edge betweenness (as did Anthonisse in his original work [26], although other definitions are possible [32]). Note that the paths may run along the same edge or edges for some part of their length, resulting in edges with greater weight. To calculate correctly what fraction of the paths flow along each edge in the network, we generalize the breadth-first search part of the calculation, as follows. Consider Fig. 4b and suppose we are performing a breadth-first search starting at vertex s. We carry out the following steps: 1. The initial vertex s is given distance ds = 0 and a weight ws = 1. 2. Every vertex i adjacent to s is given distance di = ds + 1 = 1, and weight wi = ws = 1. 3. For each vertex j adjacent to one of those vertices i we do one of three things: (a) If j has not yet been assigned a distance, it is assigned distance dj = di + 1 and weight wj = wi . (b) If j has already been assigned a distance and dj = di + 1, then the vertex’s weight is increased by wi , that is wj ← wj + wi . (c) If j has already been assigned a distance and dj < di + 1, we do nothing. 4. Repeat from step 3 until no vertices remain that have assigned distances but whose neighbors do not have assigned distances. In practice, this algorithm can be implemented most ef- ficiently using a queue or first-in/first-out buffer to store the vertices that have been assigned a distance, just as in the standard breadth-first search. Physically, the weight on a vertex i represents the number of distinct paths from the source vertex to i. These weights are precisely what we need to calculate our edge betweennesses, because if two vertices i and j are connected, with j farther than i from the source s, then the fraction of a geodesic path from j through i to s is given by wi/wj . Thus, to calculate the contribution to edge betweenness from all shortest paths starting at s, we need only carry out the following steps: 1. Find every “leaf” vertex t, i.e., a vertex such that no paths from s to other vertices go though t. 2. For each vertex i neighboring t assign a score to the edge from t to i of wi/wt. 3. Now, starting with the edges that are farthest from the source vertex s—lower down in a diagram such as Fig. 4b—work up towards s. To the edge from vertex i to vertex j, with j being farther from s than i, assign a score that is 1 plus the sum of the scores on the neighboring edges immediately below it (i.e., those with which it shares a common vertex), all multiplied by wi/wj . 4. Repeat from step 3 until vertex s is reached. Now repeating this process for all n source vertices s and summing the resulting scores on the edges gives us the total betweenness for all edges in time O(mn). We now have to repeat this calculation for each edge removed from the network, of which there are m, and hence the complete community structure algorithm based on shortest-path betweenness operates in worst-case time O(m2n), or O(n 3 ) time on a sparse graph. In our experience, this typically makes it tractable for networks of up
to about n= 10000 vertices, with current(circa 2003) desktop computers. In some special cases one can do current in better. In particular, we note that the removal of an edge only affects the betweenness of other edges that fall the same component, and hence that we need only recalculate betweennesses in that component. Networks with strong community structure often break apart i parate components quite early in the progress of the al- gorithm, substantially reducing the amount of work that needs to be done on subsequent steps. Whether this re- sults in a change in the computational complexity of the current out algorithm for any commonly occurring classes of graphs is an open question, but it certainly gives a substantial FIG. 5: An example of the type of resistor network considered speed boost to many of the calculations described in this here, in which a unit resistance is placed on each edge and unit current flows into and out of the source and sink vertices Some networks are directed, i.e., their edges run in one direction only. The world wide web is an example: links in the web point in one direction only from one web use the absolute magnitude of the current flow as our page to another. One could imagine a generalization of betweenness score for each source/sink pair the shortest-path betweenness that allowed for directed The current flows in the network are governed by edges by counting only those paths that travel in the Kirchhoffs laws. To solve them we proceed as follows forward direction along edges. Such a calculation is a for each separate component of the graph. Let V be the trivial variation on the one described above. However voltage at vertex i, measured relative to any convenient we have found that in many cases it is better to ignore point. Then for all i we have the directed nature of a network in calculating commu nity structure. Often an edge acts simply as an indication Aii(Vi-Vi)=Sis-8it of a connection between two nodes . and its direction is unimportant. For example, in Ref. 25 we applied our where Aii is the ij element of the adjacency matrix of algorithm to a food web of predator-prey interactions the graph, i.e., Aij= l if i and j are connected by an etween marine species. Predator-prey interactions are edge and Ai;=0 otherwise. The left-hand side of Eq(1) clearly directed--one species may eat another, but it is represents the net current How out of vertex i along edges unlikely that the reverse is simultaneously true. However, of the network, and the right-hand side represents the as far as community structure is concerned, we want to source and sink. Defining ki= 2; Aij, which is the know only which species have interactions with which vertex degree, and creating a diagonal matrix D witI others. We find, therefore, that our algorithm applied these degrees on the diagonal Dii= ki, this equation can to the undirected version of the food web works well at be written in matrix form as(D-A).V=s, where the picking out the community structure, and no special al- source vector s has components Sec nple of our method applied to a directed graphi gorithm is needed for the directed case. We give anothe for i=s D fo 0 otherwise We cannot directly invert the matrix D-a to get the B. Resistor networks voltage vector V, because the matrix(which is just the graph Laplacian) is singular. This is equivalent to saying As examples of betweenness measures that take more that there is one undetermined degree of freedom corre- than just shortest paths into account, we proposed in sponding to the choice of reference potential for measur- Sec. II measures based on random walks and resistor net- ing the voltages. We can add any constant to a solution works. In fact, as we now show, when appropriately for the vertex voltages and get another solution--only fined these two measures are precisely the same. Here we the voltage differences matter. In choosing the reference derive the resistance measure first, since it turns out to potential, we fix this degree of freedom, leaving only n-1 be simpler; in the following section we derive the random more to be determined. In mathematical terms, once any walk measure and show that the two are equivalent n-1 of the equations in our matrix formulation are sat- Consider the network created by placing a unit resis- isfied, the remaining one is also automatically satisfied so ta ance on every edge of our network, a unit current source long as current is conserved in the network as a whole. at vertex s, and a unit current sink at vertext(see Fig. 5). i.e., so long as >i si=0, which is clearly true in this Clearly the current between s and t will flow primarily case along short paths, but some will flow along longer ones, Choosing any vertex v to be the reference point, there- roughly in inverse proportion to their length. We will fore, we remove the row and column corresponding
6 to about n = 10 000 vertices, with current (circa 2003) desktop computers. In some special cases one can do better. In particular, we note that the removal of an edge only affects the betweenness of other edges that fall in the same component, and hence that we need only recalculate betweennesses in that component. Networks with strong community structure often break apart into separate components quite early in the progress of the algorithm, substantially reducing the amount of work that needs to be done on subsequent steps. Whether this results in a change in the computational complexity of the algorithm for any commonly occurring classes of graphs is an open question, but it certainly gives a substantial speed boost to many of the calculations described in this paper. Some networks are directed, i.e., their edges run in one direction only. The world wide web is an example; links in the web point in one direction only from one web page to another. One could imagine a generalization of the shortest-path betweenness that allowed for directed edges by counting only those paths that travel in the forward direction along edges. Such a calculation is a trivial variation on the one described above. However, we have found that in many cases it is better to ignore the directed nature of a network in calculating community structure. Often an edge acts simply as an indication of a connection between two nodes, and its direction is unimportant. For example, in Ref. 25 we applied our algorithm to a food web of predator-prey interactions between marine species. Predator-prey interactions are clearly directed—one species may eat another, but it is unlikely that the reverse is simultaneously true. However, as far as community structure is concerned, we want to know only which species have interactions with which others. We find, therefore, that our algorithm applied to the undirected version of the food web works well at picking out the community structure, and no special algorithm is needed for the directed case. We give another example of our method applied to a directed graph in Sec. V D. B. Resistor networks As examples of betweenness measures that take more than just shortest paths into account, we proposed in Sec. II measures based on random walks and resistor networks. In fact, as we now show, when appropriately de- fined these two measures are precisely the same. Here we derive the resistance measure first, since it turns out to be simpler; in the following section we derive the random walk measure and show that the two are equivalent. Consider the network created by placing a unit resistance on every edge of our network, a unit current source at vertex s, and a unit current sink at vertex t (see Fig. 5). Clearly the current between s and t will flow primarily along short paths, but some will flow along longer ones, roughly in inverse proportion to their length. We will t s current in current out FIG. 5: An example of the type of resistor network considered here, in which a unit resistance is placed on each edge and unit current flows into and out of the source and sink vertices. use the absolute magnitude of the current flow as our betweenness score for each source/sink pair. The current flows in the network are governed by Kirchhoff’s laws. To solve them we proceed as follows for each separate component of the graph. Let Vi be the voltage at vertex i, measured relative to any convenient point. Then for all i we have X j Aij (Vi − Vj ) = δis − δit, (1) where Aij is the ij element of the adjacency matrix of the graph, i.e., Aij = 1 if i and j are connected by an edge and Aij = 0 otherwise. The left-hand side of Eq. (1) represents the net current flow out of vertex i along edges of the network, and the right-hand side represents the source and sink. Defining ki = P j Aij , which is the vertex degree, and creating a diagonal matrix D with these degrees on the diagonal Dii = ki , this equation can be written in matrix form as (D − A)· V = s, where the source vector s has components si = ( +1 for i = s −1 for i = t 0 otherwise. (2) We cannot directly invert the matrix D − A to get the voltage vector V, because the matrix (which is just the graph Laplacian) is singular. This is equivalent to saying that there is one undetermined degree of freedom corresponding to the choice of reference potential for measuring the voltages. We can add any constant to a solution for the vertex voltages and get another solution—only the voltage differences matter. In choosing the reference potential, we fix this degree of freedom, leaving only n−1 more to be determined. In mathematical terms, once any n − 1 of the equations in our matrix formulation are satisfied, the remaining one is also automatically satisfied so long as current is conserved in the network as a whole, i.e., so long as P i si = 0, which is clearly true in this case. Choosing any vertex v to be the reference point, therefore, we remove the row and column corresponding to
that vertex from D and a before inverting. Denoting by the is element of M, which we denote [Mlis. In the resulting (n-1)x(n-1)matrices D, and Au, we particular, walks end up at v and w with probabilities can then write Mles and Mws, and of those a fraction 1/ky and 1/kw respectively then pass along the edge(v, w)in one (3) direction or the other. Note that they may also have Calculation of the currents in the network thus involves fore reaching this point. )Summing over all n, the mean and taking the differences of pairs of columns to get tI number of times that a walk of any length traverses the voltage vector V for each possible source/sink pair.(The ge from v to w is k[(i -mi)vs, and similarly for voltage for the one missing vertex v is always zero, by alks that go from w to u. hypothesis. The absolute magnitudes of the differences To highlight the similarity with the current-How be- of voltages along each edge give us betweenness scores tweenness of Sec. III B, let us denote these two numbers for the given source and sink. Summing over all sources and Vw respectively. Then we can write and sinks, we then get our complete betweenness score The matrix inversion takes time o(n )in the worst (I-M)-1·s=(D2-A)-1·s,(4) case, while the subsequent calculation of betweennesses where the source vector s is the vector whose components takes time O(mn2), where as before m is the number of are all 0 except for a single 1 in the position corresponding edges and n the number of vertices in the graph. Thus, to the source vertex s the entire community structure algorithm, including the Now we define our random-walk betweenness for the recalculation step, will take o((n+m)mn2)time to com- edge(u, w)to be the absolute value of the difference of plete, or O(n")on a sparse graph. Although, as we will the two probabilities Vu and Vu, i. e, the net number of ee,the algorithm is good at finding community struc- times the walk passes along the edge in one direction ture, this poor performance makes it practical only for This seems a natural definition-it makes little sense to maller graphs; a few hundreds of vertices is the most accord an edge high betweenness simply because a walk hat we have been able to do. It is for this reason that went back and forth along it many times. It is the differ- we recommend using the shortest-path betweenness al- ence between the numbers of times the edge is traversed gorithm in most cases, which gives results about as good in either direction that matters [48] or better with considerably less effort But now we see that this method is very similar to the resistor network calculation of Sec. IIIB. In that calcu ation we also evaluated(Dt-At-I s for a suitable C. Random walks source vector and then took differences of the resulting numbers. The only difference is that in the current-flow quires us to calculate how often on average random walks Purely for the purposes of mathematical convenience, IF The random-walk betweenness described in Sec. Ii re- calculation we had a sink term in s as well as a source tarting at vertex s will pass down a particular edge from can add such a sink in the present case at the target ver vertex u to vertex w(or vice versa) before finding their tex t-this makes no difference to the solution for V since ay to a given target vertex t. To calculate this quantity the tth row has been removed from the equations anyway we proceed as follows for each separate component of the By doing this, however, we turn the equations into pre cisely the form of the current-Hlow calculation, and hence As before, let Aii be an element of the adjacency y ma- it becomes clear that the two measures are numerically trix such that Aij= l if vertices i and j are connected by identical, although their derivation is quite different. (It an edge and Aii=0 otherwise. Consider a random walk also immediately follows that we can remove any that on each step decides uniformly between the neigh- column and still get the same answer-it doesn bors of the current vertex j and takes a step to one of to be row and column t, although physically this hem. The number of neighbors is just the degree of the makes the most sense vertex k,=2iAij, and the probability for the transition from j to i is Aij/kj, which we can regard as an element of the matrix A.D. where d is the diagonal IV. QUANTIFYING THE STRENGTH OF matrix with d= k COMMUNITY STRUCTURE We are interested in walks that terminate when they reach the target t, so that t is an absorbing state. The As we show in Sec. V, our community structure algo- most convenient way to represent this is just to remove rithms do an excellent job of recovering known communi entirely the vertex t from the graph, so that no walk ties both in artificially generated random networks and reaches any other vertex from t. Thus let Mt= At D real-world examples. However, in practical situations the be the matrix M with the tth row and column removed algorithms will normally be used on networks for which (and similarly for At and Dt) he communities are not known ahead of time. This Now the probability that a walk starts at s, takes n raises a new problem: how do we know when the comm steps, and ends up at some other vertex(not t), is given nities found by the algorithm are good ones? Our algo-
7 that vertex from D and A before inverting. Denoting the resulting (n − 1) × (n − 1) matrices Dv and Av, we can then write V = (Dv − Av) −1 · s. (3) Calculation of the currents in the network thus involves inverting Dv − Av once for any convenient choice of v, and taking the differences of pairs of columns to get the voltage vector V for each possible source/sink pair. (The voltage for the one missing vertex v is always zero, by hypothesis.) The absolute magnitudes of the differences of voltages along each edge give us betweenness scores for the given source and sink. Summing over all sources and sinks, we then get our complete betweenness score. The matrix inversion takes time O(n 3 ) in the worst case, while the subsequent calculation of betweennesses takes time O(mn2 ), where as before m is the number of edges and n the number of vertices in the graph. Thus, the entire community structure algorithm, including the recalculation step, will take O (n+m)mn2 time to complete, or O(n 4 ) on a sparse graph. Although, as we will see, the algorithm is good at finding community structure, this poor performance makes it practical only for smaller graphs; a few hundreds of vertices is the most that we have been able to do. It is for this reason that we recommend using the shortest-path betweenness algorithm in most cases, which gives results about as good or better with considerably less effort. C. Random walks The random-walk betweenness described in Sec. II requires us to calculate how often on average random walks starting at vertex s will pass down a particular edge from vertex v to vertex w (or vice versa) before finding their way to a given target vertex t. To calculate this quantity we proceed as follows for each separate component of the graph. As before, let Aij be an element of the adjacency matrix such that Aij = 1 if vertices i and j are connected by an edge and Aij = 0 otherwise. Consider a random walk that on each step decides uniformly between the neighbors of the current vertex j and takes a step to one of them. The number of neighbors is just the degree of the vertex kj = P i Aij , and the probability for the transition from j to i is Aij/kj , which we can regard as an element of the matrix M = A · D−1 , where D is the diagonal matrix with Dii = ki . We are interested in walks that terminate when they reach the target t, so that t is an absorbing state. The most convenient way to represent this is just to remove entirely the vertex t from the graph, so that no walk ever reaches any other vertex from t. Thus let Mt = At ·D−1 t be the matrix M with the tth row and column removed (and similarly for At and Dt). Now the probability that a walk starts at s, takes n steps, and ends up at some other vertex (not t), is given by the is element of Mn t , which we denote [Mn t ]is. In particular, walks end up at v and w with probabilities [Mn t ]vs and [Mn t ]ws, and of those a fraction 1/kv and 1/kw respectively then pass along the edge (v, w) in one direction or the other. (Note that they may also have passed along this edge an arbitrary number of times before reaching this point.) Summing over all n, the mean number of times that a walk of any length traverses the edge from v to w is k −1 v [(I − Mt) −1 ]vs, and similarly for walks that go from w to v. To highlight the similarity with the current-flow betweenness of Sec. III B, let us denote these two numbers Vv and Vw respectively. Then we can write V = D−1 · (I − Mt) −1 · s = (Dt − At) −1 · s, (4) where the source vector s is the vector whose components are all 0 except for a single 1 in the position corresponding to the source vertex s. Now we define our random-walk betweenness for the edge (v, w) to be the absolute value of the difference of the two probabilities Vv and Vw, i.e., the net number of times the walk passes along the edge in one direction. This seems a natural definition—it makes little sense to accord an edge high betweenness simply because a walk went back and forth along it many times. It is the difference between the numbers of times the edge is traversed in either direction that matters [48]. But now we see that this method is very similar to the resistor network calculation of Sec. III B. In that calculation we also evaluated (Dt − At) −1 · s for a suitable source vector and then took differences of the resulting numbers. The only difference is that in the current-flow calculation we had a sink term in s as well as a source. Purely for the purposes of mathematical convenience, we can add such a sink in the present case at the target vertex t—this makes no difference to the solution for V since the tth row has been removed from the equations anyway. By doing this, however, we turn the equations into precisely the form of the current-flow calculation, and hence it becomes clear that the two measures are numerically identical, although their derivation is quite different. (It also immediately follows that we can remove any row or column and still get the same answer—it doesn’t have to be row and column t, although physically this choice makes the most sense.) IV. QUANTIFYING THE STRENGTH OF COMMUNITY STRUCTURE As we show in Sec. V, our community structure algorithms do an excellent job of recovering known communities both in artificially generated random networks and in real-world examples. However, in practical situations the algorithms will normally be used on networks for which the communities are not known ahead of time. This raises a new problem: how do we know when the communities found by the algorithm are good ones? Our algo-
rithms always produce some division of the network int work into communities as we move down the dendrogram communities, even in completely random networks that and look for local peaks in its value, which indicate pal have no meaningful community structure, so it would be ticularly satisfactory splits. Usually we find that there useful to have some way of saying how good the struc- are only one or two such peaks and, as we will show in ture found is. Furthermore, the algorithms'output is the next section, in cases where the community struc in the form of a dendrogram which represents an entire ture is known beforehand by some means we find that nested hierarchy of possible community divisions for the the positions of these peaks correspond closely to the ex- network. We would like to know which of these divisions pected divisions. The height of a peak is a measure of are the best ones for a given network--where we should the strength of the community division cut the dendrogram to get a sensible division of the net- work To answer these questions we now define a measure of APPLICATIONs he quality of a particular division of a network, which we call the modularity. This measure is based on a pre- In thi In this section we give a number of applications of ous measure of assortative mixing proposed by New- Igorithms to particular problems, illustrating their op- man [33]. Consider a particular division of a network eration, and their use in understanding the structure of into k communities. Let us define a k x k symmetric ma- mplex networks trix e whose element eij is the fraction of all edges in the network that link vertices in community i to vertices in community j [49 .(Here we consider all edges in the orig A. Tests on computer-generated networks inal network--even after edges have been removed by the community structure algorithm our modularity measure is calculated using the full network. First. as a controlled test of how well our algorithms perform, we have generated networks with known com- of edges in the network that connect vertices in the same munity structure, to see if the algorithms can recognize community, and clearly a good division into communities and extract this structure should have a high value of this trace. The trace on its We ha rge number of graphs with own, however, is not a good indicator of the quality of the 128 vertices. divided into four communities of 32 division since, for example, placing all vertices in a single vertices each. Edges were placed independently at ran- community would give the maximal value of Tre =1 dom between vertex pairs with probability Pin for an edge to fall between vertices in the same community and pout while giving no information about community structure to fall between vertices in different communities. The So we further define the row(or column) sums a values of Pin and pout were chosen to make the expecte 2; eij, which represent the fraction of edges that connect degree of each vertex equal to 16. In Fig. 6 we show to vertices in community i. In a network in which edge a typical dendrogram from the analysis of such a grap fall between vertices without regard for the communities using the shortest-path betweenness version of our algo- they belong to, we would have eij=aiaj. Thus we can rithm.(In fact, for the sake of clarity, the figure is for define a modularity measure by a 64-node version of the graph. ) Results for the random walk version are similar. At the top of the figure we also Q T e2‖ (5)show the modularity (5), for the same calculation, plotted as a function of position in the dendrogram. That is, the graph is aligned with the dendrogram so that one where x indicates the sum of the elements of the ma- can read off modularity values for different divisions of trix x. This quantity measures the fraction of the edges the network directly. As we can see the modularity has a in the network that connect vertices of the same type single clear peak at the point where the network breaks i. e, within-community edges)minus the expected value into four communities, as we would expect. The peak of the same quantity in a network with the same commu- value is around 0.5, which is typical nity divisions but random connections between the ver- In Fig. 7 we show the fraction of vertices in our tices. If the number of within-community edges is no bet- computer-generated network sample classified correctly er than random, we will get Q=0. Values approaching into the four communities by our algorithms, as a func- Q=l, which is the lm, indicate strong commu- tion of the mean number zout of edges from each vertex nity structure 50. In practice, values for such networks to vertices in other communities. As the figure shows typically fall in the range from about 0.3 to 0.7. Higher both the shortest-path and random-walk versions of the values are rare algorithm perform excellently, with more than 90% of all The expected error on Q can be calculated by treating vertices classified correctly from Zout =0 all the way to each edge in the network as an independent measurement around zout= 6. Only for zout 2 6 does the classifica of the contributions to the elements of the matrix e. a tion begin to deteriorate markedly. In other words, our simple jackknife procedure works well 33, 34 algorithm correctly identifies the community structure in Typically, we will calculate Q for each split of a net- the network almost all the way to the point zout =8 at
8 rithms always produce some division of the network into communities, even in completely random networks that have no meaningful community structure, so it would be useful to have some way of saying how good the structure found is. Furthermore, the algorithms’ output is in the form of a dendrogram which represents an entire nested hierarchy of possible community divisions for the network. We would like to know which of these divisions are the best ones for a given network—where we should cut the dendrogram to get a sensible division of the network. To answer these questions we now define a measure of the quality of a particular division of a network, which we call the modularity. This measure is based on a previous measure of assortative mixing proposed by Newman [33]. Consider a particular division of a network into k communities. Let us define a k ×k symmetric matrix e whose element eij is the fraction of all edges in the network that link vertices in community i to vertices in community j [49]. (Here we consider all edges in the original network—even after edges have been removed by the community structure algorithm our modularity measure is calculated using the full network.) The trace of this matrix Tr e = P i eii gives the fraction of edges in the network that connect vertices in the same community, and clearly a good division into communities should have a high value of this trace. The trace on its own, however, is not a good indicator of the quality of the division since, for example, placing all vertices in a single community would give the maximal value of Tr e = 1 while giving no information about community structure at all. P So we further define the row (or column) sums ai = j eij , which represent the fraction of edges that connect to vertices in community i. In a network in which edges fall between vertices without regard for the communities they belong to, we would have eij = aiaj. Thus we can define a modularity measure by Q = X i eii − a 2 i = Tr e − e 2 , (5) where k x k indicates the sum of the elements of the matrix x. This quantity measures the fraction of the edges in the network that connect vertices of the same type (i.e., within-community edges) minus the expected value of the same quantity in a network with the same community divisions but random connections between the vertices. If the number of within-community edges is no better than random, we will get Q = 0. Values approaching Q = 1, which is the maximum, indicate strong community structure [50]. In practice, values for such networks typically fall in the range from about 0.3 to 0.7. Higher values are rare. The expected error on Q can be calculated by treating each edge in the network as an independent measurement of the contributions to the elements of the matrix e. A simple jackknife procedure works well [33, 34]. Typically, we will calculate Q for each split of a network into communities as we move down the dendrogram, and look for local peaks in its value, which indicate particularly satisfactory splits. Usually we find that there are only one or two such peaks and, as we will show in the next section, in cases where the community structure is known beforehand by some means we find that the positions of these peaks correspond closely to the expected divisions. The height of a peak is a measure of the strength of the community division. V. APPLICATIONS In this section we give a number of applications of our algorithms to particular problems, illustrating their operation, and their use in understanding the structure of complex networks. A. Tests on computer-generated networks First, as a controlled test of how well our algorithms perform, we have generated networks with known community structure, to see if the algorithms can recognize and extract this structure. We have generated a large number of graphs with n = 128 vertices, divided into four communities of 32 vertices each. Edges were placed independently at random between vertex pairs with probability pin for an edge to fall between vertices in the same community and pout to fall between vertices in different communities. The values of pin and pout were chosen to make the expected degree of each vertex equal to 16. In Fig. 6 we show a typical dendrogram from the analysis of such a graph using the shortest-path betweenness version of our algorithm. (In fact, for the sake of clarity, the figure is for a 64-node version of the graph.) Results for the random walk version are similar. At the top of the figure we also show the modularity, Eq. (5), for the same calculation, plotted as a function of position in the dendrogram. That is, the graph is aligned with the dendrogram so that one can read off modularity values for different divisions of the network directly. As we can see, the modularity has a single clear peak at the point where the network breaks into four communities, as we would expect. The peak value is around 0.5, which is typical. In Fig. 7 we show the fraction of vertices in our computer-generated network sample classified correctly into the four communities by our algorithms, as a function of the mean number zout of edges from each vertex to vertices in other communities. As the figure shows, both the shortest-path and random-walk versions of the algorithm perform excellently, with more than 90% of all vertices classified correctly from zout = 0 all the way to around zout = 6. Only for zout >∼ 6 does the classification begin to deteriorate markedly. In other words, our algorithm correctly identifies the community structure in the network almost all the way to the point zout = 8 at
0. 0.6 o-o shortest path random walk 0.2 average number of inter-community edges per vertex FIG. 7: The fraction of vertices correctly identified by our algorithms in the computer-generated graphs described in the text. The two curves show results for the edge betweenness (circles)and random walk(squares)versions of the algorithm as a function of the number of edges vertices have to others outside their own community. The point zout =8 at the rightmost edge of the plot represents the point at which the graphs-in this example-have as many connections outside their own community as inside it. Each point is an average over 100 graphs which each vertex has on average the same number of connections to vertices outside its community as it does to those inside The shortest-path version of the algorithm does how- ever perform noticeably better than the random-walk version,especially for the more difficult cases where zout is large. Given that the random-walk algorithm is also more computationally demanding, there seems little rea son to use it rather than the shortest-path algorithm, and hence, as discussed previously, we recommend the latter for most applications. ( To be fair, the random-walk al gorithm does slightly out-perform the shortest-path algo- rithm in the example addressed in the following section, although, being only a single case, it is hard to know whether this is significant. B. Zachary's karate club network We now turn to applications of our methods to real- world network data. Our first such example is taken FIG 6: Plot of the modularity and dendrogram for a 64. from one of the classic studies in social network anal- vertex random community-structured graph generated as de ated as de. ysis. Over the course of two years in the early 1970s, scribed in the text with, in this case, Zin =6 and Zout =2. Wayne Zachary observed social interactions between the The shapes on the right denote the four communities in the members of a karate club at an American university [35 graph and as we can see, the peak in the modularity(dotted He constructed networks of ties between members of the line)corresponds to a perfect identification of the communi club based on their social interactions both within the ties club and away from it. By chance, a dispute
9 0 0.1 0.2 0.3 0.4 0.5 modularity FIG. 6: Plot of the modularity and dendrogram for a 64- vertex random community-structured graph generated as described in the text with, in this case, zin = 6 and zout = 2. The shapes on the right denote the four communities in the graph and as we can see, the peak in the modularity (dotted line) corresponds to a perfect identification of the communities. 0 2 4 6 8 average number of inter-community edges per vertex 0 0.2 0.4 0.6 0.8 1 fraction of vertices classified correctly shortest path random walk FIG. 7: The fraction of vertices correctly identified by our algorithms in the computer-generated graphs described in the text. The two curves show results for the edge betweenness (circles) and random walk (squares) versions of the algorithm as a function of the number of edges vertices have to others outside their own community. The point zout = 8 at the rightmost edge of the plot represents the point at which the graphs—in this example—have as many connections outside their own community as inside it. Each point is an average over 100 graphs. which each vertex has on average the same number of connections to vertices outside its community as it does to those inside. The shortest-path version of the algorithm does however perform noticeably better than the random-walk version, especially for the more difficult cases where zout is large. Given that the random-walk algorithm is also more computationally demanding, there seems little reason to use it rather than the shortest-path algorithm, and hence, as discussed previously, we recommend the latter for most applications. (To be fair, the random-walk algorithm does slightly out-perform the shortest-path algorithm in the example addressed in the following section, although, being only a single case, it is hard to know whether this is significant.) B. Zachary’s karate club network We now turn to applications of our methods to realworld network data. Our first such example is taken from one of the classic studies in social network analysis. Over the course of two years in the early 1970s, Wayne Zachary observed social interactions between the members of a karate club at an American university [35]. He constructed networks of ties between members of the club based on their social interactions both within the club and away from it. By chance, a dispute arose during the course of his study between the club’s adminis-
C. Collaboration network For our next example, we look at a collaboration net- ④ work of scientists. Figure 10a shows the largest com- ponent of a network of collaborations between physi- cists who conduct research on networks.(The authors of the present paper, for instance, are among the nodes in this network. This network(which appeared previ- ously in Ref. 36) was constructed by taking names of authors appearing in the lengthy bibliography of Ref. 4 and cross-referencing with the Physics E-print Archive at FIG.8: The network of friendships between individuals in arxiv. org, specifically the condensed matter section of he karate club study of Zachary 35. The administrator and he archive where, for historical reasons, most papers on networks have appeared. Authors appearing in both were the instructor are represented by nodes 1 and 33 respectively. added to the network as vertices, and edges between them ing with the club's administrator after the fission of the club, indicate coauthorship of one or more papers appearing open circles those who aligned with the instructor he figure are not limited to papers on topics concerning networks--we were interested primarily in whether peo- ple know one another, and collaboration on any topic is trator and its principal karate teacher over whether to a reasonable indicator of acquaintance. raise club fees, and as a result the club eventually split The network as presented in Fig. 10a is difficult to in- in two, forming two smaller clubs, centered around the terpret. Given the names of the scientists, a knowledge- administrator and the teacher able reader with too much time on their hands could. no In Fig. 8 we show a consensus network structure ex doubt, pick out known grou cted from Zachary's observations before the split. ular institutions, from the general confusion. But were Feeding this network into our algorithms we find the re- this a network about which we had no a priori knowledge, sults shown in Fig. 9. In the left-most two panels we we would be hard pressed to understand its underlying show the dendrograms generated by the shortest-path structure. nd random-walk versions of our algorithm, along with Applying the shortest-path version of our algorithm the modularity measures for the same. As we see, both to this network we find that the modularity, Eq. (5) algorithms give reasonably high values for the modularity has a strong peak at 13 communities with a value of when the network is split into two communities-around Q=0.72 +0.02. Extracting the communities from the 0. 4 in each case--indicating that there is a strong nat- corresponding dendrogram, we have indicated them with al division at this level. What's more, the divisions colors in Fig. 10b. The knowledgeable reader will again in question correspond almost perfectly to the actual di- be able to discern known groups of scientists in this ren visions in the club revealed by which group each club dering, and more easily now with the help of the colors ember joined after the club split up. (The shapes of Still, however, the structure of the network as a whole he vertices representing the two factions are the same as and the of the interactions between groups is quite un- those of Fig 8. Only one vertex, vertex 3, is misclassi-clear fied by the shortest-path version of the method, and none In Fig. 10c we have reduced the network to only the are misclassified by the random-walk version-the latter groups. In this panel, we have drawn each group as a gets a perfect score on this test.(On the other hand, the circle, with size varying roughly with the number of indi- two-community split fails to produce a local maximum in viduals in the group. The lines between groups indicate e modularity for the random-walk method, unlike the collaborations between group members, with the thick shortest-path method for which there is a local maximum ness of the lines varying in proportion to the number of precisely at this point. pairs of scientists who have collaborated. Now the over- In the last panel of Fig. 9 we show the dendrogra II structure of the network becomes easy to see. The d modularity for an algorithm based on shortest-path network is centered around the middle group shown in betweenness but without the crucial recalculation step cyan(which consists of researchers primarily in southern discussed in Sec. IL. As the figure shows, without this Europe), with a knot of inter-community collaborations step, the algorithm fails to find the division of the net- going on between the groups on the lower right of the work into the two known groups. Furthermore, the mod- picture(mostly Boston University physicists and their ularity doesnt reach nearly such high values the intellectual descendants). Other groups (including the first two panels, indicating that the divisions suggested authors'own) are arranged in various attitudes further
10 2 1 3 4 5 6 7 8 11 12 13 14 18 20 22 9 32 31 28 29 33 10 17 34 15 16 21 23 24 26 30 25 27 19 FIG. 8: The network of friendships between individuals in the karate club study of Zachary [35]. The administrator and the instructor are represented by nodes 1 and 33 respectively. Shaded squares represent individuals to who ended up aligning with the club’s administrator after the fission of the club, open circles those who aligned with the instructor. trator and its principal karate teacher over whether to raise club fees, and as a result the club eventually split in two, forming two smaller clubs, centered around the administrator and the teacher. In Fig. 8 we show a consensus network structure extracted from Zachary’s observations before the split. Feeding this network into our algorithms we find the results shown in Fig. 9. In the left-most two panels we show the dendrograms generated by the shortest-path and random-walk versions of our algorithm, along with the modularity measures for the same. As we see, both algorithms give reasonably high values for the modularity when the network is split into two communities—around 0.4 in each case—indicating that there is a strong natural division at this level. What’s more, the divisions in question correspond almost perfectly to the actual divisions in the club revealed by which group each club member joined after the club split up. (The shapes of the vertices representing the two factions are the same as those of Fig. 8.) Only one vertex, vertex 3, is misclassi- fied by the shortest-path version of the method, and none are misclassified by the random-walk version—the latter gets a perfect score on this test. (On the other hand, the two-community split fails to produce a local maximum in the modularity for the random-walk method, unlike the shortest-path method for which there is a local maximum precisely at this point.) In the last panel of Fig. 9 we show the dendrogram and modularity for an algorithm based on shortest-path betweenness but without the crucial recalculation step discussed in Sec. II. As the figure shows, without this step, the algorithm fails to find the division of the network into the two known groups. Furthermore, the modularity doesn’t reach nearly such high values as in the first two panels, indicating that the divisions suggested are much poorer than in the cases with the recalculation. C. Collaboration network For our next example, we look at a collaboration network of scientists. Figure 10a shows the largest component of a network of collaborations between physicists who conduct research on networks. (The authors of the present paper, for instance, are among the nodes in this network.) This network (which appeared previously in Ref. 36) was constructed by taking names of authors appearing in the lengthy bibliography of Ref. 4 and cross-referencing with the Physics E-print Archive at arxiv.org, specifically the condensed matter section of the archive where, for historical reasons, most papers on networks have appeared. Authors appearing in both were added to the network as vertices, and edges between them indicate coauthorship of one or more papers appearing in the archive. Thus the collaborative ties represented in the figure are not limited to papers on topics concerning networks—we were interested primarily in whether people know one another, and collaboration on any topic is a reasonable indicator of acquaintance. The network as presented in Fig. 10a is difficult to interpret. Given the names of the scientists, a knowledgeable reader with too much time on their hands could, no doubt, pick out known groupings, for instance at particular institutions, from the general confusion. But were this a network about which we had no a priori knowledge, we would be hard pressed to understand its underlying structure. Applying the shortest-path version of our algorithm to this network we find that the modularity, Eq. (5), has a strong peak at 13 communities with a value of Q = 0.72 ± 0.02. Extracting the communities from the corresponding dendrogram, we have indicated them with colors in Fig. 10b. The knowledgeable reader will again be able to discern known groups of scientists in this rendering, and more easily now with the help of the colors. Still, however, the structure of the network as a whole and the of the interactions between groups is quite unclear. In Fig. 10c we have reduced the network to only the groups. In this panel, we have drawn each group as a circle, with size varying roughly with the number of individuals in the group. The lines between groups indicate collaborations between group members, with the thickness of the lines varying in proportion to the number of pairs of scientists who have collaborated. Now the overall structure of the network becomes easy to see. The network is centered around the middle group shown in cyan (which consists of researchers primarily in southern Europe), with a knot of inter-community collaborations going on between the groups on the lower right of the picture (mostly Boston University physicists and their intellectual descendants). Other groups (including the authors’ own) are arranged in various attitudes further