Parallel modularity computation for directed weighted graphs with overlapping communities 1

. The paper presents new versions of modularity measure for directed weighted graphs with overlapping communities. We consider several approaches to computing modularity and try to extend them. Taking into account computational complexity, we suggest two parallelized extensions which are scalable to large graphs (more than 10 4 nodes).


Introduction
The motivation of our research into modularity computation was the need to quantitatively assess and compare the quality of various clustering algorithms applied to mobile call graphs. As soon as no such graphs with ground-truth 1 This research was collaborated with and supported by Russian Research Center, Huawei Technologies. community structure were found, we couldn't use the most popular quality metric based on Normalized Mutual Information (NMI). For evaluating quality of community detection methods on graphs with unknown reference communities, metrics based on probabilistic models are used. Such metrics include modularity, surprise, significance [19], ER-modularity [5]. Also, generative models from model-based community detection methods can be used to estimate likelihood of clustered graph [15,11]. Modularity value characterizes the strength of a particular clustering of a graph. It is high when clusters are dense and sparsely connected to each other, whereas its value is low when clusters are formed at random. Besides evaluation of community cover, modularity is also used as optimization function in some community detection algorithms [16,18]. In [12] modularity is also used for graph partitioning, but only for the case of two communities. Here we consider modularity metric, its existing extensions for directed and weighted graphs and for the case of overlapping communities. Then we describe our extensions of modularity for overlapping communities in directed weighted graphs.

Notation
In this paper we will use the following notation, most of which are common in graph theory.
graph with nodes and edges , nodes , edge ; adjacency matrix of graph ; an element of ; weight of edge ; degree of node ; set of communities on graph , particular community; set of communities node belongs to; average community size in graph , ; average square community size in graph , ; We will also use instead of to denote sizes of corresponding sets.

Existing versions of modularity
Modularity was defined by Newman and Girvan [3] to measure a quality of a partition of a graph into a set of clusters. It is the fraction of edges within the clusters minus the expected such fraction in a randomly connected graph with the same nodes and their degrees. Modularity was originally defined for undirected unweighted graphs and is given by: (1) where number of edges between nodes within community number of edges from the nodes in community to the nodes outside . Modularity can equivalently be expressed via adjacency matrix A ij and nodes degrees k i : (2) There are three main directions of extension of the original modularity definition: for directed graphs, for weighted graphs, and for the case of overlapping communities.

Modularity for directed and weighted graphs
Extension of modularity (2) to directed graphs is rather straightforward [7]: where is out-degree of node and is in-degree of node .
Modularity (2) is easily generalized to weighted graphs as well [2]: where weight of edge , is sum of all weights of edges of node , and is total weight of all edges. Moreover, modularity formula (2) for both weighted and directed graphs can be written as [6]: (5) Finally, modularity based on LinkRank, was suggested for weighted directed graphs [9]: (6) LinkRank is an analogy of PageRank [14] for links. PageRank is the probability of a particular page (node) being visited by a random surfer and can be defined as a stationary row vector of Google Matrix : . In case of directed graphs Google Matrix , where is damping parameter for PageRank (with probability random surfer jumps to a random node) and is indicator of dangling node: This formula originates from an alternative notion of community as a group of nodes where a random surfer spends more time in average. More technically, this definition of modularity is the deviation between the fraction of time a random walker spends within communities and the expected such time.

Overlapping modularity
In the case when a node can belong to several communities, the belonging coefficients are introduced [8] which indicate how much a node belongs to community . This coefficients are non-negative and sum to one: . This relates to another extension of community detection problem, called fuzzy community detection [13]. To generalize different approaches of using belonging coefficients, a belonging function can be defined [17] to characterize an extent to what an edge connects communities and respectively. According to this, several approaches for overlapping modularity from the literature can be generalized to the following two definitions [17]: (7) and (8) where belonging coefficient can be: (9) where is the number of maximal cliques containing edge , is the number of maximal cliques containing edge inside community . Belonging function can be:

Further extensions of modularity
Besides the node-based extensions, there was suggested edge-based extension [10] (for directed graphs): (11) Here edge belonging function can be any of (10), but the authors suggested this variant (together with empirically found expression for ): It is worth to notice that actually in the inner sum iterating of pairs of nodes are done over nodes only from community (not from the whole ), due to the form of functions.
Authors of [17] suggested density-based version of modularity (1) for overlapping directed graphs:

Drawbacks and limitations
The first obvious drawback is that there was not found any modularity formula, comprising all three needed properties: support of directed, weighted graphs with overlapping communities. The second limitation is computational complexity. Aforementioned formulas of overlapping modularity are not acceptable for large graphs (with more than nodes within community cover) due to their high computational complexity.
Denoting the average number of communities by , average community size by and number of nodes by , we have for (13) time complexity , and for (11) -. See subsection 4.1 for more details. It's also worth noting that LinkRank authors [9] provide some evidence that the modularity (5) can't distinguish the direction of links.

Our extensions of modularity
Since we focus on modularity for directed weighted graphs with overlapping communities, we actually have two possibilities of extension: make overlapping (directed) modularities support weights, or to extend directed weighted modularities to the overlapping case. The first approach suggests naive substitution of adjacency matrix of a graph to matrix of weights and number of edges to the sum of their weights. Doing so with density formula (13) leads to unnormalization: values of modularity start to exceed the available range . But we will still use it in experiments with unweighted graphs. On the other hand, edge-based formula seems to allow such generalization, becoming: (14) But this is still computationally expensive.
The second approach consists in introducing belonging coefficients (9) and belonging functions (10) to simple version (5): (15) and to LinkRank-based version of modularity (6): (16) Since PageRank (and hence LinkRank) has fast implementations ( [1,4]), these two formulas have much lower computational complexities. Also, we suggested to use in formulas a normalization coefficient instead of belonging function: (17) The intuition is the following. If both nodes belong to communities, the term will encounter times in modularity formula, once for each community, so we weigh it by the factor of . It's easy to see that otherwise modularity can become unlimited: suppose that each community is actually two equal different communities, then modularity value doubles.

Computational complexity
Here we calculate computational complexities of modularity extensions , , and . All complexities are present in table 1.
Firstly, denote by computational complexity of we consider it later. In the expression for (13), the term is computed in , so as ; in ; and in time. Counting that average square community size is not less than square of average size , each term of summation has complexity , giving overall complexity . In the expression for (14), the hardest term is and , which take steps, thus resulting in overall complexity. (15) and (16) have complexity , ignoring PageRank calculation time as insignificant. Understanding the big-O complexity of PageRank calculation requires analyzing the code of pagerank scipy method from NetworkX 2 . However, Aric Hagberg (NetworkX Lead Programmer) wrote that their implementation has "linear complexity in the number of edges". In practice, PageRank computation time is negligible. Now consider . Uniform belonging coefficient may be computed by one operation if communities for each node are explicitly known, e.g. each node has a set of labels. But usually community detection algorithms return list of communities represented by sets of nodes. This means we need operations to find all communities a given node belongs to. The same concerns fraction belonging coefficient, for which we have , supposing that average node membership is not very high, i.e.
. Therefore, intersection belonging function together with the others are .

Effects
In order to demonstrate adequacy of the estimate based on computed modularity with regard to intuitive community structure, we computed modularities of several alternative community covers of the example graph (see Fig. 1). We generated a large set of random community covers, and sort them according to the modularity value computed with formula (15). Fig. 1 demonstrates 3 covers with highest modularity and 3 covers with lowest modularity. We can see that the most intuitive cover corresponds to the highest modularity value. The same holds for formula (16).

Experiments
We implemented in Python four versions of modularity , , and together with 4 belonging functions (see (10) and (12) Also we conducted a set of experiments: on computation time, different belonging functions and belonging coefficients, and parallelizing.

Computation time
We compared modularity value and computation time of four appropriate formulas ( , , , ) on two graphs of different size. Since doesn't support weights and fraction belonging coefficient is undefined for directed graphs (due to possible zero in denominator), graphs were chosen undirected unweighted. Experiments with directed weighted graphs are to be conducted later. We took default belonging functions (suggested in original papers) and uniform belonging coefficient for simplicity. The small graph was generated by CDR-GEN generator 3 Table 2 shows that as size of graph and size and number of communities grow, and become too computationally expensive, so there are only two scalable candidates, and .

Belonging functions and belonging coefficients
Then we investigated influence of different belonging functions and belonging coefficients on values of and . We used the same Wu et al dataset clustered by Clique Percolation algorithm 7 with 13% of nodes involved in communities ( , , , ). Table 3 shows that the choice of belonging function or belonging coefficient doesn't make much difference to result modularity. Meanwhile, intersection belonging function takes the lowest time. Values of are in good consistency with those of , which is widely used in papers.
values tend to be less than and . values differ a lot, possibly due to dissimilar formula structure, but as far as we know this formula was not compared to other ones in literature. Table 4 extends the comparison of different belonging functions for and on a directed weighted graph with overlapping communities. Belonging coefficient is uniform. We see that the behavior is consistent with that of undirected unweighted case.

Parallel modularity
Computation process of and naturally allows parallelization. Since each community and each node pair contributes independently to the modularity value, iterating over node pairs may be distributed between processors. We implemented two parallel versions. The first one is rather straightforward. Iteration over communities is left sequential. Each time when community of size more than is encountered, parallel processes are initialized. The set of all nodes pairs within the community is split into equal chunks and are assigned to these processes (see algorithm 1).

Algorithm 1: Parallel modularity version 1.
The second parallel version is a little more complicated. The idea is to split the set of communities into subsets between processors. But in order to balance the load, these chunks should have approximately equal sum of squares of community size since community of size has ordered node pairs (counting self-loops). To achieve this we used a greedy algorithm, which iterates over communities in descending order and assigns each of them to a subset that has the smallest sum of size squares. The only problem here is that the biggest community may have size square much more than sum of size squares of the rest ones, i.e. the chunk which gets this community will be overloaded. To overcome this challenge we sort communities by their sizes in descending order and apply the first parallel approach to first (biggest) several communities, until we encounter community with small enough size to allow balancing of the rest ones or reach lower community size bound . The rest ones are split into subsets according to the mentioned greedy algorithm. To determine whether to start balancing we use a simple condition: square of size of current biggest community should be at most of total sum of squares of sizes of communities left at the moment. Formally, having sorted sizes of communities , the condition of stopping at community is . See algorithm 2.

Algorithm 2: Parallel modularity version 2.
We compared the speedup due to both versions of parallellization versus sequential computing of modularity for and . See table 5. When number of communities is small ( ) the first method is slightly faster due to its simplicity (results were averaged over 5 runs). In case of many communities the second version shows its benefit. We also investigated process scalability of both parallel implementations. The results are represented in Fig. 1.

Conclusion
We investigated existing approaches to computing modularity measure and developed and modularity extensions for large directed weighted graphs with overlapping communities. These extensions have low computational complexity which makes them applicable to graphs with more than 10 4 nodes and they also can be computed in parallel way. These two formulae are based on different notions of community: as group of nodes with more dense links (Q S ) or a group of nodes where a random surfer tends to spend more time (Q LR ). Since a surfer walks along link direction, the second formula is more sensible to direction of links in a graph. As a future direction may be considered a possibility to use new version of modularity for overlapping community detection in directed weighted graphs.  Аннотация. В статье представлены новые алгоритмы расчета модулярности для направленных взвешенных графов с пересекающимися сообществами. Рассматриваются несколько подходов для вычисления модулярности и их расширения. Учитывая вычислительную сложность известных подходов, предлагаются два параллельных расширения, масштабируемых на графы с более 10 4 вершин. Ключевые слова: модулярность; поиск сообществ; пэйдж-ранк; линк-ранк; функция принадлежности; коэффициент принадлежности. Список литературы [1]. Lawrence Page и др. "The PageRank citation ranking: bringing order to the web".