Network modularity optimization by a fusion

Transcription

Network modularity optimization by a fusion
J.-B. A NGELELLI et L. R EBOUL
Network modularity optimization by a fusion-fission process
and application to protein-protein interactions networks
Jean-Baptiste Angelelli1 and Laurence Reboul2
INSTITUT DE MATHEMATIQUES DE LUMINY, UMR6206 CNRS, Campus de Luminy, Case 907,
13288 Marseille Cedex 9, France
[email protected]
LABORATOIRE DE MATHEMATIQUES ET APPLICATIONS, UMR6086 CNRS, Université de Poitiers,
86000 Poitiers, France
[email protected]
1
2
Community detection, also known as graph partitioning, is a useful tool for
analysing and understanding data taking the form of non-oriented, simple networks. In
particular, detecting communities in protein-protein interaction networks (PPI) is important in functional prediction. Modularity is a well known criterion of quality of a graph
partition. In this paper, we propose a new partitioning method based on the optimization
of the modularity over the set of partitions of a network using a fusion-fission process. A
simulation study shows that our method is competitive. It is used to find communities of
proteins in the Drosophila PPI network.
Abstract:
Community detection, protein-protein interaction networks, modularity, greedy
optimization.
Keywords:
1
Introduction
Representing complex systems by networks is a popular and very useful way to study and understand
their structure. The vertices correspond to the elements of the system and the edges are relations
between them. Networks are encountered in very different domains dealing with complex systems.
In Biology for instance, systems of proteins linked by chemical interactions can be modelized by
protein-protein interaction (PPI) networks [1,2].
A well documented topic of interest in network analysis is community detection (see [3,4,5]). It aims
at detecting highly connected subsets of vertices in the network. Such communities have always
an interesting meaning. In PPI networks, they are subsets of proteins sharing the same function
and therefore community detection is a powerful tool for functional prediction [1,2]. A good quality
criterion for a network’s partition is modularity [4]. Community detection methods most often rely on
the maximization of the modularity criterion so that developing optimization algorithms of modularity
has become a key concern in this domain. The problem of finding the maximal modularity partition
over the whole set of partitions of the graph is NP-hard. Therefore, in recent years, attention has
been paid to develop some heuristics for finding high modularity partitions in polynomial time and
space. Namely greedy algorithms [5] have been developed. In this paper, we propose a new greedy
algorithm. It is based on fusion-fission of communities of an optimal cover from a modularity point of
view (section 2). The comparison of our algorithm with that of Newman [5] on computer-generated
networks shows competitive results (section 3). The algorithm is then used to find communities in the
Drosophila PPI (section 4).
JOBIM 2008
105
Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks
2
2.1
Modularity maximization and algorithm
Algorithm
Let G = (V, E) be a simple graph (i.e. there is 0 or 1 edge between each pair of vertices and a vertex
cannot be connected to itself) with n = |V | and m = |E|. The main goal of this paper is to find a
partitioning method of V . For that task, we build a greedy maximization algorithm of the modularity.
The formula used for the modularity is that of Newman [6]. It is rewritten for practical purposes as
n
QG (P) =
n
di dj
1 !!
1
(Aij −
)αij = K +
Q! (P)
2m
2m
2m2 G
(1)
i=1 j=1
"
"
where K is a constant which depends only on the network, Q!G (P) = 2≤i≤n 1≤j≤i−1 Bij αij ,
Bij = 2mAij − di dj , αij equals to 1 if vertices i and j belong to the same community in P and 0
if not, (Aij ) is the adjacency matrix of G, di is the degree of vertex i. The problem is to find the
optimal partition P, which is equivalent to find the matrix (αij ). There exists a very simple solution
to this problem which does not require any maximization process. Indeed, Q!G (P) is maximal when
αij equals to 1 if Bij ≥ 0, 0 if not. Unfortunately, this solution does not describe a partition in the
general case. To describe a partition, we should have (αij = 1, αjk = 1) ⇒ (αik = 1). Instead, the
matrix (αij ) describes a cover R of V . The idea used is as follows : the algorithm starts from this
optimal cover. Then, it either fusion or fission pairs of overlapping communities until a partition is
reached. At each step, the operation - fusion or fission - that provides the highest Q! is chosen. For
that task, one needs to calculate the variation ∆Q! of modularity obtained by fusion or by fission of
two given communities.
Case of a fusion The variation ∆Q!f us induced by the fusion, or merging, of two subsets C1 , C2 ∈
V is computed as follows : only pairs of vertices whose both elements are in C1 ∪ C2 are involved
in the process. For each pair of vertices i, j ∈ C1 ∪ C2 , we have αij = 1 after the fusion. Since we
may or may not have had already αij = 1 before the fusion (consider the case where x, y ∈ C1 or
x, y ∈ C2 ), we have
! !
∆Q!f us (C1 , C2 ) =
(1 − αij )Bij .
(2)
i∈C1 j∈C2
Case of a fission Given two subsets C1 , C2 ⊂ V with C1 ∩ C2 (= ∅, the fission of C1 ∪ C2 is
not unique. Moreover, since it requires the enumeration of all subsets of C1 ∪ C2 , to find the fission
C1! ∪ C2! maximizing the modularity is an NP-hard problem. Therefore, an approximation method
is needed. We consider that only vertices in C1 ∩ C2 are to be moved in either C1! or C2! . Vertices
in C1 \C2 (resp. C2 "
\C1 ) are directly affected"
to C1! (resp. C2! ). For each vertex i in C1 ∩ C2 , the
quantities M1 (i) = j∈C1 Bij and M2 (i) = j∈C2 Bij are computed. Here, M1 (i) (resp. M2 (i))
can be interpreted as a measure of how close is vertex i to C1 (resp. C2 ) : if M1 (i) > M2 (i) then i is
affected to C1! , else it is affected to C2! . Then, the approximate variation ∆Q!f is induced by the fission
is
! !
∆Q!f is (C1 , C2 ) = −
αij Bij .
(3)
i∈C1! j∈C2!
It is worth noticing that (3) is only an approximation. Firstly, the whole set of subdivisions of C1 ∪ C2
has not been checked. Secondly, two vertices separated by the fission may or may not be still joined
in a third community. For these reasons, (3) is not a "guaranteed" optimal fission of C1 ∪ C2 but a
106
J.-B. A NGELELLI et L. R EBOUL
"good enough" solution, fastly computable in polynomial time and space. The algorithm is as follows
:
1. Start with R = {{x, y}|Bxy ≥ 0}
2. While R is not a partition :
(a) For all {Ci , Cj } ⊂ R such that Ci ∩ Cj $= ∅, consider ∆Q!f us (Ci , Cj ), ∆Q!f is (Ci , Cj )
(b) Do the best operation - fusion or fission.
2.2
Analysis and complexity
Since the algorithm stops when there is no pair of overlapping communities left, we must prove that
this step always happens. Let Ar and Br be respectively the number of communities and the number
of pairs of overlapping communities remaining at the rth step of the algorithm. Since a fusion (resp.
fission) replaces two communities by a new one (resp. two new ones), Ar is a decreasing sequence.
At the initialization step, each community is made of a pair {i, j} of vertices such that Bij ≥ 0.
Considering that Bij = 2mAij −di dj , we can notice that Bij is greater or equal than 0 only if Aij = 1,
which means when there is an edge between i and j. Therefore, Ar ≤ A0 ≤ m. Let us now consider
B0 , the number of pairs of overlapping communities at the initialization step of the algorithm. One
might think that B0 = O(A20 ). However, at step 0, communities exist only where edges exist, so that
the number of pairs of overlapping communities is bounded by δ × m, where δ denotes the maximum
vertex degree in the graph. This remark is crucial because in practical cases, we always have δ << m.
In the case of a fusion between two communities, there is a least one pair of overlapping communities
which disappears : the pair of the fusionning ones, so Br+1 < Br . In the case of a fission between two
overlapping communities, there is still the same pair of overlapping communities which disappears.
Furthermore, a third community covering only one of the two communities will only cover one of
the two resulting communities because the fission takes place in the intersection of the two fissioning
communities. Therefore, we still have Br+1 < Br in the case of a fission. Now, since Br is a strictly
decreasing sequence of positive integers, it is necessarily finite and the algorithm always stops at
some time.
In order to study the time and space complexity of the algorithm, we can use the above remarks.
We have seen that the size Br of the list satisfies Br = O(δm). At each step of the algorithm, we
must go through the list of the pairs of overlapping communities and test the costs of both fusion
and fission. It is easy to see that they are O(n2 ) time operations. The entire algorithm runs in
O(δn2 m) time. Most of real-world networks of interest, including PPI networks, are sparse graphs,
which means that m = O(n). In this case, the algorithm runs in O(δn3 ) time. We have to keep
the list of pairs of overlapping communities. This list uses a O(δm) space. We also have to keep
each community in memory. Since Ar ≤ m and that a community contains at most n elements, the
algorithm uses O(nm) space to keep the communities in memory. Moreover, the computations of the
cost of a fusion and a fission are O(n2 ) space operations but we only keep in memory the results, not
the details of each computation, so that the entire algorithm uses a O(δm) + O(nm) + O(n2 ) space.
Finally, we always have δ ≤ n ≤ m, therefore the space complexity of the algorithm is O(nm),
which corresponds to O(n2 ) for sparse graphs.
3
3.1
Efficiency
Performance on computer-generated networks
To measure the efficiency of our algorithm, we use it to partition computer-generated random networks with a known in advance community structure. The method used is exposed by Newman [5].
107
Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks
Each network is made of 128 vertices divided in 4 communities of 32 elements. We note z the average degree of the vertices, zin the average intra-community degree, which is the average number
of neighbours of a vertex belonging to the same community, zout the average extra-community degree, which is the average number of neighbours of a vertex belonging to other communities. Each
network is generated with z = zin + zout = 16, so we use only zout as the variable in the testing
process. When zout is low, the corresponding networks have many intra-community edges and few
inter-community edges, so that the partitionning problem is easy and become increasingly difficult
when zout gets higher. We make zout vary from 0 to 8 with 0.5 steps. For each value of zout , we
generate 100 networks and partition them with our greedy algorithm as well as the one proposed by
Newman [5]. We then measure the average fraction τe of correctly classified vertices and the average
modularity Q of the resulting partitions.
0.95
"
×
0.9
+
♦
×
"
+
♦
"
×
+
♦
"
×
+
♦
+
♦
+
♦
×
"
♦
+
×
"
0
1
♦
+
♦
0.8
♦
2
0.7
♦
3
0.6
♦
"
×
+
×
"
×
"
τe (Newman algorithm)
τe (this algorithm)
Q (Newman algorithm)
Q (this algorithm)
0.7
♦
+
+
×
"
0.8
0.65
♦
+
+
0.85
0.75
♦
+
4
×
"
+
♦
"
×
5
×
"
×
"
6
0.5
+
×
"
×
"
7
×
"
♦ 0.4
0.3
+
×
"
8
modularity of resulting partitions
fraction of correctly classified vertices
♦
1+
0.2
average number of inter-community neighbors per vertex zout
Figure 1. The fraction of correctly classified elements by our algorithm (♦) and Newman’s algorithm [5] (+)
and the modularity of resulting partitions by our algorithm (×) and Newman’s algorithm [5] (!).
3.2
Analysis
We can see on Fig. 1 that our algorithm (♦) provides better results than Newman’s one (+) on
computer-generated networks. To explain this, we can point out that our algorithm finds a partition
by a fusion-fision process while Newman’s one is an agglomerative process, so that it uses only
fusions. Therefore, two vertices wrongly put together cannot be separated and errors build up. Using
fission in the process helps to tackle this difficulty.
We also see on Fig. 1 that the modularities of partitions provided by our algorithm (×) are almost
equal to that provided by Newman’s algorithm (!). This may be somewhat surprising because of
the difference between the two algorithms from the correctly classified elements criterion point of
view. This result enhances the limitation of the modularity criterion : two partitions of different
qualities may have virtually the same modularity. As an optimization process, Newman’s algorithm
is almost as efficient as ours. Therefore, the differences in the qualities of the obtained partitions can
be partially explained by the fact that Newman’s algorithm starts with the atomic partition, while our
108
J.-B. A NGELELLI et L. R EBOUL
algorithm starts with a cover of the graph where strongly related elements (i.e. high Bij ) are already
put together.
4
Application to a protein-protein interaction network
In this section, we consider a Drosophila PPI network. The graph used here has 196 vertices and
310 edges and is a connected subgraph of the full Drosophila interactome. Vertices in this network
represent proteins and edges represent interactions between proteins provided by [7,8,9,10]. Our
partitioning algorithm is used to process this network and finds 12 groups of proteins shown in Table
1.
size proteins
1 okr
1 Rbp1
14 bip2 CG2926 kn Rad51 p53 His4 pr-set7 CG4005 sd tou CG1244 CG9715 CG12340 stumps
12 Sep1 Sep2 pnut Uba2 sip2 snRNP70K SC35 sip1 tra SF2 tra2 U2af35
12 trx Pp1alpha-96A flw Pp1-87B Pp1-13C NiPp-1 I-2 CG5053 CG6416 CG12620 scrib CG16812
20 noc Su(var)3-9 dlg1 CG12470 Sh CG7357 l(3)j2D3 CG31029 CG14546 Su(var)205 Nap1
CkIIalpha HLHm7 l(1)sc ac emc E(spl) HLHm5 sc da
7 15 ftz Rpd3 Sin3A Smr usp EcR rig alien ftz-f1 Sin Sxl U2af38 snf Hrb87F U2af50
8 11 CG13030 Mi-2 Dref ttk Rlip AP-50 png CycB phyl sina lolal
9 29 Trl Chi Top2 Pc ph-d Tbp ph-p Scm barr Psc Taf4 E(z) Bin1 bcd esc corto His3 tor slbo
Taf11 Taf1 Taf6 vap aur cnn ial Taf12 TfIIA-L Taf2
10 30 Hsf Su(fu) Mlf RpII215 ci CG32209 Kap-alpha3 CG4656 fu CkIalpha cos insc raps numb
mira pros hh smo eve Gug pan arm sgg CG3402 Axn Hrb27C otu mus309 SH3PX1 Apc2
11 21 smt3 lwr gro CaMKII CtBP nej Mad Sir2 h CG4116 CG6459 Kr TfIIB CG8204 H brk
pnr tin Med Smox run
12 30 dl pll Myd88 Tl tub BG4 Pli Traf1 msn dock cher Amph Hrs Ogt Stam Psn cact mbo
Dif Rel Dip3 Dredd CG11486 scaf6 CG4349 inaD inaC trpl Cam ninaC
1
2
3
4
5
6
Table 1. Groups of proteins found by our algorithm in the Drosophila PPI network.
First, we notice that groups 1 and 2 contain only a single element. This is not suprising since the
elements okr and Rbp1 have a degree equal to 1 in our network. We investigate the biological meaning
of the other groups using GOToolBox [11] and find that these communities correspond very well to
known functional groups of proteins. Groups 3 and 7 describe proteins involved in regulation of
cellular and metabolic processes, group 4 contains proteins involved in spliceosome assembly, RNA
splicing and processing, proteins of group 5 are involved in protein amino-acid dephosphorylation,
group 6 and 9 describe proteins involved in regulation of transcription, group 8 contains proteins
involved in photoreceptor cell differentiation. We find that group 10 contains proteins involved in the
hedgehog pathway (Su(fu), ci, fu, cos, hh, smo) and in the wingless pathway (CkIalpha, pan, arm,
sgg, CG3402, Axn, Apc2). These two pathways are very near from a network point of view. That is
why our algorithm provides a single community instead of two (they may be separated using the same
algorithm on the subgraph induced by this group of vertices). Proteins in group 11 are transcription
factors, group 12 describes proteins involved in the toll pathway.
109
Network modularity optimization by a fusion-fission process and application to protein-protein interactions networks
5
Conclusions
Optimizing modularity over the set of partitions of a network is a very effective way to detect interesting communities. However, using a fusion-fission process of overlapping communities provides even
better results. This algorithm is not only efficient on computer-generated network but finds meaningful communities of proteins in the Drosophila PPI network. By the way, a limitation of modularity
appears. Indeed, it is seen that partitions with very near modularities may show higher differences in
overall quality.
6
Aknowledgements
We thank Christine Brun and Anaïs Baudot for providing us the PPI network used in this article and
for their help in the biological interpretation of the results. J-B.A. is supported by a fellowship from
the French ’Ministère de l’Enseignement Supérieur et de la Recherche’.
References
[1] C. Brun, F. Chevenet, D. Martin, J. Wojcik and A. Guénoche, Functional classification of proteins for the
prediction of cellular function from a protein-protein interaction network. Genome Biol. , 5:R6, 2003.
[2] C. Brun, C. Herrmann and A. Guénoche, Clustering proteins from interaction networks for the prediction
of cellular functions, BMC Bioinformatics, 5:95, 2004.
[3] A. Guénoche, Comparing recent methods in graph partitioning, Electronic Notes in Discrete Mathematics,
22:83-89, 2005.
[4] M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E,
69(2):026113, 2004.
[5] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E,
69(6):066133, 2004.
[6] M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev.
E, 74:036104, 2006.
[7] L. Giot, J.S. Bader, C. Brouwer, A. Chauduri, B. Kuang, Y. Li, Y.L Hao, C.E. Ooi, B. Godwin, E. Vitols,
G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone,
A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime,
M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A.
DaSilva, J. Zhong, C.A. Stanyon, R.L. Finley, K.P. White, M. Braverman, T. Jarvie, S. Gold, M. Leach,
J. Knight, R.A. Shimkets, M.P. McKenna, J. Chant and J.M. Rothberg, A protein interaction map of
Drosophila melanogaster, Science, 302:1727-36, 2003.
[8] E. Formstecher, S. Arresta, V. Collura, A. Hamberger, A. Meil, A. Trehin, C. Reverdy, V. Betin, S. Maire,
C. Brun, B. Jacq, M. Arpin, Y. Bellaiche, S. Bellusci, P. Benaroch, M. Bornens, R. Chanet, P. Chavrier,
O. Delattre, V. Doye, R. Fehon, G. Faye, T. Galli, J.A. Girault, B. Goud, J. de Gunzburg, L. Johannes,
M.P. Junier, V. Mirouse, A. Mukherjee, D. Papadopoulo, F. Perez, A. Plessis, M. Rosbach, C. Rossé,
S. Saule, D. Stoppa-Lyonnet, A. Vincent, M. White, P. Legrain, J. Wojcik, J. Camonis, and L. Daviet,
Protein interaction map of Drosophila melanogaster, Genome Res., 15:376-84, 2005.
[9] C.A. Stanyon, G. Liu, B.A. Mangiola, N. Patel, L. Giot, B. Kuang, H. Zhang, J. Zhong, R.L. Finley, Jr, A
Drosophila protein-interaction map centered on cell-cycle regulators, Genome Biol., 5:R96, 2004.
[10] H. Hermjakob, L. Montecchi-Palazzi, C. Lewington, S. Mudali, S. Kerrien, S. Orchard, M. Vingron, B.
Roechert, P. Roepstorff, A. Valencia, H. Margalit, J. Armstrong, A. Bairoch, G. Cesareni, D. Sherman
and R. Apweiler, IntAct: an open source molecular interaction database, Nucleic Acids Res., 32:D452-5,
2004.
[11] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry and B. Jacq, GOToolBox: functional analysis of
gene datasets based on Gene Ontology, Genome Biology, 5(12):R101, 2004.
110