Network modularity optimization by a fusion
Transcription
Network modularity optimization by a fusion
J.-B. A NGELELLI et L. R EBOUL Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks Jean-Baptiste Angelelli1 and Laurence Reboul2 INSTITUT DE MATHEMATIQUES DE LUMINY, UMR6206 CNRS, Campus de Luminy, Case 907, 13288 Marseille Cedex 9, France [email protected] LABORATOIRE DE MATHEMATIQUES ET APPLICATIONS, UMR6086 CNRS, Université de Poitiers, 86000 Poitiers, France [email protected] 1 2 Community detection, also known as graph partitioning, is a useful tool for analysing and understanding data taking the form of non-oriented, simple networks. In particular, detecting communities in protein-protein interaction networks (PPI) is important in functional prediction. Modularity is a well known criterion of quality of a graph partition. In this paper, we propose a new partitioning method based on the optimization of the modularity over the set of partitions of a network using a fusion-fission process. A simulation study shows that our method is competitive. It is used to find communities of proteins in the Drosophila PPI network. Abstract: Community detection, protein-protein interaction networks, modularity, greedy optimization. Keywords: 1 Introduction Representing complex systems by networks is a popular and very useful way to study and understand their structure. The vertices correspond to the elements of the system and the edges are relations between them. Networks are encountered in very different domains dealing with complex systems. In Biology for instance, systems of proteins linked by chemical interactions can be modelized by protein-protein interaction (PPI) networks [1,2]. A well documented topic of interest in network analysis is community detection (see [3,4,5]). It aims at detecting highly connected subsets of vertices in the network. Such communities have always an interesting meaning. In PPI networks, they are subsets of proteins sharing the same function and therefore community detection is a powerful tool for functional prediction [1,2]. A good quality criterion for a network’s partition is modularity [4]. Community detection methods most often rely on the maximization of the modularity criterion so that developing optimization algorithms of modularity has become a key concern in this domain. The problem of finding the maximal modularity partition over the whole set of partitions of the graph is NP-hard. Therefore, in recent years, attention has been paid to develop some heuristics for finding high modularity partitions in polynomial time and space. Namely greedy algorithms [5] have been developed. In this paper, we propose a new greedy algorithm. It is based on fusion-fission of communities of an optimal cover from a modularity point of view (section 2). The comparison of our algorithm with that of Newman [5] on computer-generated networks shows competitive results (section 3). The algorithm is then used to find communities in the Drosophila PPI (section 4). JOBIM 2008 105 Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks 2 2.1 Modularity maximization and algorithm Algorithm Let G = (V, E) be a simple graph (i.e. there is 0 or 1 edge between each pair of vertices and a vertex cannot be connected to itself) with n = |V | and m = |E|. The main goal of this paper is to find a partitioning method of V . For that task, we build a greedy maximization algorithm of the modularity. The formula used for the modularity is that of Newman [6]. It is rewritten for practical purposes as n QG (P) = n di dj 1 !! 1 (Aij − )αij = K + Q! (P) 2m 2m 2m2 G (1) i=1 j=1 " " where K is a constant which depends only on the network, Q!G (P) = 2≤i≤n 1≤j≤i−1 Bij αij , Bij = 2mAij − di dj , αij equals to 1 if vertices i and j belong to the same community in P and 0 if not, (Aij ) is the adjacency matrix of G, di is the degree of vertex i. The problem is to find the optimal partition P, which is equivalent to find the matrix (αij ). There exists a very simple solution to this problem which does not require any maximization process. Indeed, Q!G (P) is maximal when αij equals to 1 if Bij ≥ 0, 0 if not. Unfortunately, this solution does not describe a partition in the general case. To describe a partition, we should have (αij = 1, αjk = 1) ⇒ (αik = 1). Instead, the matrix (αij ) describes a cover R of V . The idea used is as follows : the algorithm starts from this optimal cover. Then, it either fusion or fission pairs of overlapping communities until a partition is reached. At each step, the operation - fusion or fission - that provides the highest Q! is chosen. For that task, one needs to calculate the variation ∆Q! of modularity obtained by fusion or by fission of two given communities. Case of a fusion The variation ∆Q!f us induced by the fusion, or merging, of two subsets C1 , C2 ∈ V is computed as follows : only pairs of vertices whose both elements are in C1 ∪ C2 are involved in the process. For each pair of vertices i, j ∈ C1 ∪ C2 , we have αij = 1 after the fusion. Since we may or may not have had already αij = 1 before the fusion (consider the case where x, y ∈ C1 or x, y ∈ C2 ), we have ! ! ∆Q!f us (C1 , C2 ) = (1 − αij )Bij . (2) i∈C1 j∈C2 Case of a fission Given two subsets C1 , C2 ⊂ V with C1 ∩ C2 (= ∅, the fission of C1 ∪ C2 is not unique. Moreover, since it requires the enumeration of all subsets of C1 ∪ C2 , to find the fission C1! ∪ C2! maximizing the modularity is an NP-hard problem. Therefore, an approximation method is needed. We consider that only vertices in C1 ∩ C2 are to be moved in either C1! or C2! . Vertices in C1 \C2 (resp. C2 " \C1 ) are directly affected" to C1! (resp. C2! ). For each vertex i in C1 ∩ C2 , the quantities M1 (i) = j∈C1 Bij and M2 (i) = j∈C2 Bij are computed. Here, M1 (i) (resp. M2 (i)) can be interpreted as a measure of how close is vertex i to C1 (resp. C2 ) : if M1 (i) > M2 (i) then i is affected to C1! , else it is affected to C2! . Then, the approximate variation ∆Q!f is induced by the fission is ! ! ∆Q!f is (C1 , C2 ) = − αij Bij . (3) i∈C1! j∈C2! It is worth noticing that (3) is only an approximation. Firstly, the whole set of subdivisions of C1 ∪ C2 has not been checked. Secondly, two vertices separated by the fission may or may not be still joined in a third community. For these reasons, (3) is not a "guaranteed" optimal fission of C1 ∪ C2 but a 106 J.-B. A NGELELLI et L. R EBOUL "good enough" solution, fastly computable in polynomial time and space. The algorithm is as follows : 1. Start with R = {{x, y}|Bxy ≥ 0} 2. While R is not a partition : (a) For all {Ci , Cj } ⊂ R such that Ci ∩ Cj $= ∅, consider ∆Q!f us (Ci , Cj ), ∆Q!f is (Ci , Cj ) (b) Do the best operation - fusion or fission. 2.2 Analysis and complexity Since the algorithm stops when there is no pair of overlapping communities left, we must prove that this step always happens. Let Ar and Br be respectively the number of communities and the number of pairs of overlapping communities remaining at the rth step of the algorithm. Since a fusion (resp. fission) replaces two communities by a new one (resp. two new ones), Ar is a decreasing sequence. At the initialization step, each community is made of a pair {i, j} of vertices such that Bij ≥ 0. Considering that Bij = 2mAij −di dj , we can notice that Bij is greater or equal than 0 only if Aij = 1, which means when there is an edge between i and j. Therefore, Ar ≤ A0 ≤ m. Let us now consider B0 , the number of pairs of overlapping communities at the initialization step of the algorithm. One might think that B0 = O(A20 ). However, at step 0, communities exist only where edges exist, so that the number of pairs of overlapping communities is bounded by δ × m, where δ denotes the maximum vertex degree in the graph. This remark is crucial because in practical cases, we always have δ << m. In the case of a fusion between two communities, there is a least one pair of overlapping communities which disappears : the pair of the fusionning ones, so Br+1 < Br . In the case of a fission between two overlapping communities, there is still the same pair of overlapping communities which disappears. Furthermore, a third community covering only one of the two communities will only cover one of the two resulting communities because the fission takes place in the intersection of the two fissioning communities. Therefore, we still have Br+1 < Br in the case of a fission. Now, since Br is a strictly decreasing sequence of positive integers, it is necessarily finite and the algorithm always stops at some time. In order to study the time and space complexity of the algorithm, we can use the above remarks. We have seen that the size Br of the list satisfies Br = O(δm). At each step of the algorithm, we must go through the list of the pairs of overlapping communities and test the costs of both fusion and fission. It is easy to see that they are O(n2 ) time operations. The entire algorithm runs in O(δn2 m) time. Most of real-world networks of interest, including PPI networks, are sparse graphs, which means that m = O(n). In this case, the algorithm runs in O(δn3 ) time. We have to keep the list of pairs of overlapping communities. This list uses a O(δm) space. We also have to keep each community in memory. Since Ar ≤ m and that a community contains at most n elements, the algorithm uses O(nm) space to keep the communities in memory. Moreover, the computations of the cost of a fusion and a fission are O(n2 ) space operations but we only keep in memory the results, not the details of each computation, so that the entire algorithm uses a O(δm) + O(nm) + O(n2 ) space. Finally, we always have δ ≤ n ≤ m, therefore the space complexity of the algorithm is O(nm), which corresponds to O(n2 ) for sparse graphs. 3 3.1 Efficiency Performance on computer-generated networks To measure the efficiency of our algorithm, we use it to partition computer-generated random networks with a known in advance community structure. The method used is exposed by Newman [5]. 107 Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks Each network is made of 128 vertices divided in 4 communities of 32 elements. We note z the average degree of the vertices, zin the average intra-community degree, which is the average number of neighbours of a vertex belonging to the same community, zout the average extra-community degree, which is the average number of neighbours of a vertex belonging to other communities. Each network is generated with z = zin + zout = 16, so we use only zout as the variable in the testing process. When zout is low, the corresponding networks have many intra-community edges and few inter-community edges, so that the partitionning problem is easy and become increasingly difficult when zout gets higher. We make zout vary from 0 to 8 with 0.5 steps. For each value of zout , we generate 100 networks and partition them with our greedy algorithm as well as the one proposed by Newman [5]. We then measure the average fraction τe of correctly classified vertices and the average modularity Q of the resulting partitions. 0.95 " × 0.9 + ♦ × " + ♦ " × + ♦ " × + ♦ + ♦ + ♦ × " ♦ + × " 0 1 ♦ + ♦ 0.8 ♦ 2 0.7 ♦ 3 0.6 ♦ " × + × " × " τe (Newman algorithm) τe (this algorithm) Q (Newman algorithm) Q (this algorithm) 0.7 ♦ + + × " 0.8 0.65 ♦ + + 0.85 0.75 ♦ + 4 × " + ♦ " × 5 × " × " 6 0.5 + × " × " 7 × " ♦ 0.4 0.3 + × " 8 modularity of resulting partitions fraction of correctly classified vertices ♦ 1+ 0.2 average number of inter-community neighbors per vertex zout Figure 1. The fraction of correctly classified elements by our algorithm (♦) and Newman’s algorithm [5] (+) and the modularity of resulting partitions by our algorithm (×) and Newman’s algorithm [5] (!). 3.2 Analysis We can see on Fig. 1 that our algorithm (♦) provides better results than Newman’s one (+) on computer-generated networks. To explain this, we can point out that our algorithm finds a partition by a fusion-fision process while Newman’s one is an agglomerative process, so that it uses only fusions. Therefore, two vertices wrongly put together cannot be separated and errors build up. Using fission in the process helps to tackle this difficulty. We also see on Fig. 1 that the modularities of partitions provided by our algorithm (×) are almost equal to that provided by Newman’s algorithm (!). This may be somewhat surprising because of the difference between the two algorithms from the correctly classified elements criterion point of view. This result enhances the limitation of the modularity criterion : two partitions of different qualities may have virtually the same modularity. As an optimization process, Newman’s algorithm is almost as efficient as ours. Therefore, the differences in the qualities of the obtained partitions can be partially explained by the fact that Newman’s algorithm starts with the atomic partition, while our 108 J.-B. A NGELELLI et L. R EBOUL algorithm starts with a cover of the graph where strongly related elements (i.e. high Bij ) are already put together. 4 Application to a protein-protein interaction network In this section, we consider a Drosophila PPI network. The graph used here has 196 vertices and 310 edges and is a connected subgraph of the full Drosophila interactome. Vertices in this network represent proteins and edges represent interactions between proteins provided by [7,8,9,10]. Our partitioning algorithm is used to process this network and finds 12 groups of proteins shown in Table 1. size proteins 1 okr 1 Rbp1 14 bip2 CG2926 kn Rad51 p53 His4 pr-set7 CG4005 sd tou CG1244 CG9715 CG12340 stumps 12 Sep1 Sep2 pnut Uba2 sip2 snRNP70K SC35 sip1 tra SF2 tra2 U2af35 12 trx Pp1alpha-96A flw Pp1-87B Pp1-13C NiPp-1 I-2 CG5053 CG6416 CG12620 scrib CG16812 20 noc Su(var)3-9 dlg1 CG12470 Sh CG7357 l(3)j2D3 CG31029 CG14546 Su(var)205 Nap1 CkIIalpha HLHm7 l(1)sc ac emc E(spl) HLHm5 sc da 7 15 ftz Rpd3 Sin3A Smr usp EcR rig alien ftz-f1 Sin Sxl U2af38 snf Hrb87F U2af50 8 11 CG13030 Mi-2 Dref ttk Rlip AP-50 png CycB phyl sina lolal 9 29 Trl Chi Top2 Pc ph-d Tbp ph-p Scm barr Psc Taf4 E(z) Bin1 bcd esc corto His3 tor slbo Taf11 Taf1 Taf6 vap aur cnn ial Taf12 TfIIA-L Taf2 10 30 Hsf Su(fu) Mlf RpII215 ci CG32209 Kap-alpha3 CG4656 fu CkIalpha cos insc raps numb mira pros hh smo eve Gug pan arm sgg CG3402 Axn Hrb27C otu mus309 SH3PX1 Apc2 11 21 smt3 lwr gro CaMKII CtBP nej Mad Sir2 h CG4116 CG6459 Kr TfIIB CG8204 H brk pnr tin Med Smox run 12 30 dl pll Myd88 Tl tub BG4 Pli Traf1 msn dock cher Amph Hrs Ogt Stam Psn cact mbo Dif Rel Dip3 Dredd CG11486 scaf6 CG4349 inaD inaC trpl Cam ninaC 1 2 3 4 5 6 Table 1. Groups of proteins found by our algorithm in the Drosophila PPI network. First, we notice that groups 1 and 2 contain only a single element. This is not suprising since the elements okr and Rbp1 have a degree equal to 1 in our network. We investigate the biological meaning of the other groups using GOToolBox [11] and find that these communities correspond very well to known functional groups of proteins. Groups 3 and 7 describe proteins involved in regulation of cellular and metabolic processes, group 4 contains proteins involved in spliceosome assembly, RNA splicing and processing, proteins of group 5 are involved in protein amino-acid dephosphorylation, group 6 and 9 describe proteins involved in regulation of transcription, group 8 contains proteins involved in photoreceptor cell differentiation. We find that group 10 contains proteins involved in the hedgehog pathway (Su(fu), ci, fu, cos, hh, smo) and in the wingless pathway (CkIalpha, pan, arm, sgg, CG3402, Axn, Apc2). These two pathways are very near from a network point of view. That is why our algorithm provides a single community instead of two (they may be separated using the same algorithm on the subgraph induced by this group of vertices). Proteins in group 11 are transcription factors, group 12 describes proteins involved in the toll pathway. 109 Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks 5 Conclusions Optimizing modularity over the set of partitions of a network is a very effective way to detect interesting communities. However, using a fusion-fission process of overlapping communities provides even better results. This algorithm is not only efficient on computer-generated network but finds meaningful communities of proteins in the Drosophila PPI network. By the way, a limitation of modularity appears. Indeed, it is seen that partitions with very near modularities may show higher differences in overall quality. 6 Aknowledgements We thank Christine Brun and Anaïs Baudot for providing us the PPI network used in this article and for their help in the biological interpretation of the results. J-B.A. is supported by a fellowship from the French ’Ministère de l’Enseignement Supérieur et de la Recherche’. References [1] C. Brun, F. Chevenet, D. Martin, J. Wojcik and A. Guénoche, Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. , 5:R6, 2003. [2] C. Brun, C. Herrmann and A. Guénoche, Clustering proteins from interaction networks for the prediction of cellular functions, BMC Bioinformatics, 5:95, 2004. [3] A. Guénoche, Comparing recent methods in graph partitioning, Electronic Notes in Discrete Mathematics, 22:83-89, 2005. [4] M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E, 69(2):026113, 2004. [5] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E, 69(6):066133, 2004. [6] M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev. E, 74:036104, 2006. [7] L. Giot, J.S. Bader, C. Brouwer, A. Chauduri, B. Kuang, Y. Li, Y.L Hao, C.E. Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A. DaSilva, J. Zhong, C.A. Stanyon, R.L. Finley, K.P. White, M. Braverman, T. Jarvie, S. Gold, M. Leach, J. Knight, R.A. Shimkets, M.P. McKenna, J. Chant and J.M. Rothberg, A protein interaction map of Drosophila melanogaster, Science, 302:1727-36, 2003. [8] E. Formstecher, S. Arresta, V. Collura, A. Hamberger, A. Meil, A. Trehin, C. Reverdy, V. Betin, S. Maire, C. Brun, B. Jacq, M. Arpin, Y. Bellaiche, S. Bellusci, P. Benaroch, M. Bornens, R. Chanet, P. Chavrier, O. Delattre, V. Doye, R. Fehon, G. Faye, T. Galli, J.A. Girault, B. Goud, J. de Gunzburg, L. Johannes, M.P. Junier, V. Mirouse, A. Mukherjee, D. Papadopoulo, F. Perez, A. Plessis, M. Rosbach, C. Rossé, S. Saule, D. Stoppa-Lyonnet, A. Vincent, M. White, P. Legrain, J. Wojcik, J. Camonis, and L. Daviet, Protein interaction map of Drosophila melanogaster, Genome Res., 15:376-84, 2005. [9] C.A. Stanyon, G. Liu, B.A. Mangiola, N. Patel, L. Giot, B. Kuang, H. Zhang, J. Zhong, R.L. Finley, Jr, A Drosophila protein-interaction map centered on cell-cycle regulators, Genome Biol., 5:R96, 2004. [10] H. Hermjakob, L. Montecchi-Palazzi, C. Lewington, S. Mudali, S. Kerrien, S. Orchard, M. Vingron, B. Roechert, P. Roepstorff, A. Valencia, H. Margalit, J. Armstrong, A. Bairoch, G. Cesareni, D. Sherman and R. Apweiler, IntAct: an open source molecular interaction database, Nucleic Acids Res., 32:D452-5, 2004. [11] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry and B. Jacq, GOToolBox: functional analysis of gene datasets based on Gene Ontology, Genome Biology, 5(12):R101, 2004. 110